[ https://issues.apache.org/jira/browse/LUCENE-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014723#comment-13014723 ]
Michael McCandless commented on LUCENE-3003: -------------------------------------------- bq. It is inefficient - but I never saw a way around it since the lists are all being built in parallel (due to the fact that we are uninverting). Lucene's indexer (TermsHashPerField) has precisely this same problem -- every unique term must point to two (well, one if omitTFAP) growable byte arrays. We use "slices" into a single big (paged) byte[], where first slice is tiny and can only hold like 5 bytes, but then points to the next slice which is a bit bigger, etc. We could look @ refactoring that for this use too... Though this is "just" the one-time startup cost. bq. Another small & easy optimization I hadn't gotten around to yet was to lower the indexIntervalBits and make it configurable. I did make it configurable to the Lucene class (you can pass it in to ctor), but for Solr I left it using every 128th term. {quote} Another small optimization would be to store an array of offsets to length-prefixed byte arrays, rather than a BytesRef[]. At least the values are already in packed byte arrays via PagedBytes. {quote} Both FieldCache and docvalues (branch) store an array-of-terms like this (the array of offsets is packed ints). We should also look at using an FST, which'd be the most compact but the ord -> term lookup cost goes up. Anyway I think we can pursue these cool ideas on new [future] issues... > Move UnInvertedField into Lucene core > ------------------------------------- > > Key: LUCENE-3003 > URL: https://issues.apache.org/jira/browse/LUCENE-3003 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3003.patch, LUCENE-3003.patch, > byte_size_32-bit-openjdk6.txt > > > Solr's UnInvertedField lets you quickly lookup all terms ords for a > given doc/field. > Like, FieldCache, it inverts the index to produce this, and creates a > RAM-resident data structure holding the bits; but, unlike FieldCache, > it can handle multiple values per doc, and, it does not hold the term > bytes in RAM. Rather, it holds only term ords, and then uses > TermsEnum to resolve ord -> term. > This is great eg for faceting, where you want to use int ords for all > of your counting, and then only at the end you need to resolve the > "top N" ords to their text. > I think this is a useful core functionality, and we should move most > of it into Lucene's core. It's a good complement to FieldCache. For > this first baby step, I just move it into core and refactor Solr's > usage of it. > After this, as separate issues, I think there are some things we could > explore/improve: > * The first-pass that allocates lots of tiny byte[] looks like it > could be inefficient. Maybe we could use the byte slices from the > indexer for this... > * We can improve the RAM efficiency of the TermIndex: if the codec > supports ords, and we are operating on one segment, we should just > use it. If not, we can use a more RAM-efficient data structure, > eg an FST mapping to the ord. > * We may be able to improve on the main byte[] representation by > using packed ints instead of delta-vInt? > * Eventually we should fold this ability into docvalues, ie we'd > write the byte[] image at indexing time, and then loading would be > fast, instead of uninverting -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org