[ https://issues.apache.org/jira/browse/LUCENE-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847766#action_12847766 ]
Toke Eskildsen commented on LUCENE-2335: ---------------------------------------- The sort-first-then-resolve-Strings is what I did in the proof of concept. The speed is that of TermsInfoReader, where it delivers a Term from a given position. If this is too slow for multiple segments, the segment-spanning ordered ordinals-approach could be tried. As for deprecating stored fields, then I guess there's the issue of spatial locality. Wouldn't moving the bytes into the inverted term index bloat it in a way that makes all searches slower? There's an issue of having multiple terms in the same field for a given document, which also ties into facets. It takes some more logic to handle this, but I think it can be done without excessive memory or processing load: Basically we make two passes, where the first pass determines the optimal packed structure and the second pass fills in the ordinals. This would give us a memory overhead of {code} #docs + #references_to_terms + #terms ints {code} for very fast facet structure building with support for collator sorted terms in the facet result. This is basically what we're already doing at Statsbiblioteket - the only real difference is whether the Strings are pulled from the Terms index or from an external structure. Saving RAM, this could be be done using PackedInts {code} #docs*log2(#references_to_terms) + #references_to_terms*log2(#terms) + #terms*log2(#terms) bits {code} but I am afraid that access time would suffer. A hybrid {code} #docs*32 + #references_to_terms*32 + #terms*log2(#terms) bits {code} would be just as fast for building as the non-packed version and a wee bit slower for the final fetching of the terms. Of course, just as with fillFields=true searches, the calculated Terms must be extracted at the end. For faceting, this can be quite a load. The facet-supporting structure is not as simple as the sorting-optimized one. I realize that supporting facets from the start might be quite a large jump. However, if API-breaks are requires, I guess it would be best to do it as few times as possible? > optimization: when sorting by field, if index has one segment and field > values are not needed, do not load String[] into field cache > ------------------------------------------------------------------------------------------------------------------------------------ > > Key: LUCENE-2335 > URL: https://issues.apache.org/jira/browse/LUCENE-2335 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Michael McCandless > Priority: Minor > Fix For: 3.1 > > > Spinoff from java-dev thread "Sorting with little memory: A suggestion", > started by Toke Eskildsen. > When sorting by SortField.STRING we currently ask FieldCache for a > StringIndex on that field. > This can consumes tons of RAM, when the values are mostly unique (eg a title > field), as it populates both int[] ords as well as String[] values. > But, if the index is only one segment, and the search sets fillFields=false, > we don't need the String[] values, just the int[] ords. If the app needs to > show the fields it can pull them (for the 1 page) from stored fields. > This can be a potent optimization -- alot of RAM saved -- for optimized > indexes. > When fixing this we must take care to share the int[] ords if some queries do > fillFields=true and some =false... ie, FieldCache will be called twice and it > should share the int[] ords across those invocations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org