[jira] [Commented] (LUCENE-4600) Explore facets aggregation during documents collection

Shai Erera (JIRA) Sun, 20 Jan 2013 07:20:15 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558274#comment-13558274
 ]


Shai Erera commented on LUCENE-4600:
------------------------------------

Hmm, it occurred to me that maybe your second comparison was between 
PostCollection and Counting? If so, then while it's indeed interesting, it's 
puzzling. PostCollection allocates FixedBitSet for every segment and in the end 
obtains a DISI from each FBS. As much as I know, DISIs over bitsets are not so 
cheap, especially when nextDoc() is called, because they need to find the next 
set bit ... if indeed it's faster, we must get to the bottom of it. It could 
mean other Collector could benefit from such post-collection technique ...

While on that, is the best way to iterate on a bitset's set bits via DISI? I'm 
looking at OpenBitSetDISI.nextDoc() and it looks much more expensive than 
FixedBitSet.nextSetBit(). I modified PostCollection to do:

{code}
while (doc < length && (doc = bits.nextSetBit(doc)) != -1) {
  .. the previous code
  ++doc;
}
{code}

And all tests pass with this change too. I wonder if that's faster than DISI.

BTW, while making this change I noticed that I have a slight inefficiency in 
all 3 Collectors. If the document has not facets, I should have returned, but I 
forgot the return statement, e.g.:

{code}
    if (buf.length == 0) {
      // this document has no facets
      return; // THAT LINE WAS MISSING!
    }
{code}

The code is still correct, just doing some redundant extra instructions. I'll 
upload an updated patch, with both changes shortly.
                
> Explore facets aggregation during documents collection
> ------------------------------------------------------
>
>                 Key: LUCENE-4600
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4600
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>         Attachments: LUCENE-4600-cli.patch, LUCENE-4600.patch, 
> LUCENE-4600.patch, LUCENE-4600.patch, LUCENE-4600.patch, LUCENE-4600.patch
>
>
> Today the facet module simply gathers all hits (as a bitset, optionally with 
> a float[] to hold scores as well, if you will aggregate them) during 
> collection, and then at the end when you call getFacetsResults(), it makes a 
> 2nd pass over all those hits doing the actual aggregation.
> We should investigate just aggregating as we collect instead, so we don't 
> have to tie up transient RAM (fairly small for the bit set but possibly big 
> for the float[]).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4600) Explore facets aggregation during documents collection

Reply via email to