[
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573633#comment-13573633
]
Shai Erera commented on LUCENE-4609:
------------------------------------
That will be interesting to test. In order to test it "fairly", we should
either test the decoder (that's what we usually test) through the abstracted
code (i.e. via CategoryListIterator), or Gilad, if you can, please copy
CountingFacetsCollector and inline the decoder code instead of the dgap+vint
code? That will be simpler to test, with the least noise.
bq. I'm not sure SimpleIntEncoder was ever used
Mike and I tested it ... at some point :). I don't remember where we posted the
results though, whether it was in email, GTalk or some issue. But I do remember
that the results were less good than DGapVInt. We were always surprised by how
fast DGapVInt is .. all the while we thought VInt is expensive, but it may not
be so expensive ... at least not on the Wikipedia collection.
bq. the decoding speed is significantly faster
That's good, but Mike and I have already concluded that EncodingSpeed just ..
lies :). It's a micro-benchmark, and while it showed significant improvements
after I moved the encoders to bulk-API, on the real-world scenario it performed
worse. I had to inline stuff and specialize it even more for it to beat the
previous way things worked.
I will be glad if SemiPacked is faster .. but judging from past experience, I
don't get my hopes too high :).
As for this encoding algorithm, it all depends on how many values actually fall
into the 256 range. That's another problem w/ EncodingSpeed -- it uses a
real-world scenario of a crazy application which encoded 2430 ordinals for a
single document! You can see that the values that are encoded are small, by
e.g. looking at the NOnes bits/int. I suspect that in real-life, there won't be
many values that fall into that range, at least after some documents have been
indexed, because when you have a single category per-dimension in a document,
then there are not too many chances that their values will be "close".
But .. we should let luceneutil be the judge of that :). So Gilad, can you make
a patch with a SemiPackedCountingCollector? And also modify the default that
FacetCollector.create returns, so that it's easy to compare base (CountinFC) to
comp (SemiPackedCFC). If you want to test the collector, then run
TestDemoFacets (as-is) and CountingFCTest (modify the collector though) to make
sure the Collector works.
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
> Key: LUCENE-4609
> URL: https://issues.apache.org/jira/browse/LUCENE-4609
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/facet
> Reporter: Shai Erera
> Priority: Minor
> Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch,
> LUCENE-4609.patch, LUCENE-4609.patch, SemiPackedEncoder.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the
> category ordinals. We have several such encoders, including VInt (default),
> and block encoders.
> It would be interesting to implement and benchmark a
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and
> the max value you can see and (2) one that decides for each doc on the
> optimal bitsPerValue, writes it as a header in the byte[] or something.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]