[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

Adrien Grand (JIRA) Thu, 20 Dec 2012 02:37:18 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536928#comment-13536928
 ]


Adrien Grand commented on LUCENE-4609:
--------------------------------------

bq. Attached a PackedEncoder, which is based on PackedInts.

Nice! You could probably improve memory efficiency and speed of the decoder by 
using a ReaderIterator instead of a Reader:
 * getReader: consumes the packed array stream and returns an in-memory packed 
array,
 * getDirectReader: does not consume the whole stream and return an impl that 
uses IndexInput.seek to look up values,
 * getReaderIterator: returns a sequential iterator which bulk-decodes values 
(the "mem" parameter allows you to control the speed/memory-efficiency 
trade-off), so it will be much faster than iterating over the values of 
getReader.

For improved speed, getReaderIterator has the {{next(int count)}} method which 
returns several values in a single call, this proved to be faster. Another 
option could be to directly use PackedInts.Encoder/Decoder similarly to 
Lucene41PostingsFormat (packed writers and reader iterators also use them under 
the hood).

bq. This is PForDelta compression (the outliers are encoded separately) I 
think? We can test it and see if it helps ... but we weren't so happy with it 
for encoding postings

If the packed stream is very large, another option is to split it into blocks 
that all have the same number of values (but different number of bits per 
value). This should prevent the whole stream from growing because of rare 
extreme values. This is what the stored fields index (with blocks of 1024 
values) and Lucene41PostingsFormat (with blocks of 128 values) do. Storing the 
min value at the beginning of the block and then only encoding deltas could 
help too.

bq. The header is very large ... really you should only need 1) bpv, and 2) 
bytes.length (which I think you already have, via both payloads and DocValues). 
If the PackedInts API isn't flexible enough for you to feed it bpv and 
bytes.length then let's fix that!

Most PackedInts method have a "*NoHeader" variant that does the exact same job 
whithout relying on a header at the beginning of the stream (LUCENE-4161), I 
think this is what you are looking for. We should probably make this header 
stuff opt-in rather than opt-out (by replacing getWriter/Reader/ReaderIterator 
with the NoHeader methods and adding a method dedicated to reading/writing a 
header).
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
> category ordinals. We have several such encoders, including VInt (default), 
> and block encoders.
> It would be interesting to implement and benchmark a 
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
> the max value you can see and (2) one that decides for each doc on the 
> optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4609) Write a PackedIntsEncoder/Decoder for facets

Reply via email to