[ https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558274#comment-13558274 ]
Shai Erera commented on LUCENE-4600: ------------------------------------ Hmm, it occurred to me that maybe your second comparison was between PostCollection and Counting? If so, then while it's indeed interesting, it's puzzling. PostCollection allocates FixedBitSet for every segment and in the end obtains a DISI from each FBS. As much as I know, DISIs over bitsets are not so cheap, especially when nextDoc() is called, because they need to find the next set bit ... if indeed it's faster, we must get to the bottom of it. It could mean other Collector could benefit from such post-collection technique ... While on that, is the best way to iterate on a bitset's set bits via DISI? I'm looking at OpenBitSetDISI.nextDoc() and it looks much more expensive than FixedBitSet.nextSetBit(). I modified PostCollection to do: {code} while (doc < length && (doc = bits.nextSetBit(doc)) != -1) { .. the previous code ++doc; } {code} And all tests pass with this change too. I wonder if that's faster than DISI. BTW, while making this change I noticed that I have a slight inefficiency in all 3 Collectors. If the document has not facets, I should have returned, but I forgot the return statement, e.g.: {code} if (buf.length == 0) { // this document has no facets return; // THAT LINE WAS MISSING! } {code} The code is still correct, just doing some redundant extra instructions. I'll upload an updated patch, with both changes shortly. > Explore facets aggregation during documents collection > ------------------------------------------------------ > > Key: LUCENE-4600 > URL: https://issues.apache.org/jira/browse/LUCENE-4600 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet > Reporter: Michael McCandless > Attachments: LUCENE-4600-cli.patch, LUCENE-4600.patch, > LUCENE-4600.patch, LUCENE-4600.patch, LUCENE-4600.patch, LUCENE-4600.patch > > > Today the facet module simply gathers all hits (as a bitset, optionally with > a float[] to hold scores as well, if you will aggregate them) during > collection, and then at the end when you call getFacetsResults(), it makes a > 2nd pass over all those hits doing the actual aggregation. > We should investigate just aggregating as we collect instead, so we don't > have to tie up transient RAM (fairly small for the bit set but possibly big > for the float[]). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org