[ 
https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilad Barkai updated LUCENE-4609:
---------------------------------

    Attachment: SemiPackedEncoder.patch

Finally figured out I was doing things completely wrong.. instead of having a 
super smart optimizing code for semi-packed encoder - there's now a strait 
forward semi-packed encoder:
Values smaller than 256 use only one byte (the value itself) and larger values 
are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per 
value (zero marker + 5 of VInt).

The idea, is to pay the penalty for variable length encoding for large values 
which should be less common in a sort-uniq-dgap scenario.

Wrote two versions w and w/o dgap specialization, though I'm not sure how 
useful is the non-specialized code.

I do not currently have the means to run the LuceneUtil (nor the wikipedia 
index with the categories) - but I ran the EncodingSpeed test - and was 
surprised.
While the encoding is on a little worse (or on par) with dgap-vint, the 
decoding speed is significantly faster. The new encode is the only (?!) encoder 
to beat {code}SimpleIntEncoder{code} (which writes plain 32 bits per value) in 
decoding time.

Those values being used in the EncodingSpeed are real scenario, but I'm not 
sure how much they represent a "common" case (e.g wikipedia). 

Mike - could you please try this encoder? I guess it only makes sense to run 
the specialized {code}DGapSemiPackedEncoder{code}.
Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or 
unique). It would be interesting to test it as well. We will pay in more I/O 
and much larger file size (~4 times larger..) but it doesn't mean it will be 
any slower.

Here are the results of the EncodingSpeed test:
{noformat}
Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 
2430) 41152 times.

Encoder                                                        Bits/Int         
 Encode Time                Encode Time          Decode Time                
Decode Time
                                                                              
[milliseconds]        [microsecond / int]       [milliseconds]        
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000         
         190                     1.9000                  165                    
 1.6500
VInt8                                                           18.4955         
         436                     4.3600                  359                    
 3.5900
Sorting(Unique(VInt8))                                          18.4955         
        3557                    35.5702                  314                    
 3.1400
Sorting(Unique(DGap(VInt8)))                                     8.5597         
        3485                    34.8502                  270                    
 2.7000
Sorting(Unique(DGapVInt8))                                       8.5597         
        3434                    34.3402                  192                    
 1.9200
Sorting(Unique(DGap(SemiPacked)))                                8.6453         
        3386                    33.8602                  156                    
 1.5600
Sorting(Unique(DGapSemiPacked))                                  8.6453         
        3397                    33.9702                   99                    
 0.9900
Sorting(Unique(DGap(EightFlags(VInt))))                          4.9679         
        4002                    40.0203                  381                    
 3.8100
Sorting(Unique(DGap(FourFlags(VInt))))                           4.8198         
        3972                    39.7203                  399                    
 3.9900
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                4.5794         
        4448                    44.4803                  645                    
 6.4500
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                4.5794         
        4461                    44.6103                  641                    
 6.4100


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 
1489) 67159 times.

Encoder                                                        Bits/Int         
 Encode Time                Encode Time          Decode Time                
Decode Time
                                                                              
[milliseconds]        [microsecond / int]       [milliseconds]        
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000         
         169                     1.6900                  163                    
 1.6300
VInt8                                                           18.2673         
         421                     4.2100                  374                    
 3.7400
Sorting(Unique(VInt8))                                          18.2673         
        2230                    22.3001                  337                    
 3.3700
Sorting(Unique(DGap(VInt8)))                                     8.9456         
        2257                    22.5701                  273                    
 2.7300
Sorting(Unique(DGapVInt8))                                       8.9456         
        2214                    22.1401                  192                    
 1.9200
Sorting(Unique(DGap(SemiPacked)))                                9.2787         
        2162                    21.6201                  180                    
 1.8000
Sorting(Unique(DGapSemiPacked))                                  9.2787         
        2148                    21.4801                  120                    
 1.2000
Sorting(Unique(DGap(EightFlags(VInt))))                          5.7542         
        2937                    29.3701                  395                    
 3.9500
Sorting(Unique(DGap(FourFlags(VInt))))                           5.5447         
        2768                    27.6801                  407                    
 4.0700
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                5.3566         
        3294                    32.9401                  651                    
 6.5100
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                5.3996         
        3318                    33.1801                  662                    
 6.6200


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 
5555555 times.

Encoder                                                        Bits/Int         
 Encode Time                Encode Time          Decode Time                
Decode Time
                                                                              
[milliseconds]        [microsecond / int]       [milliseconds]        
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000         
         316                     3.1600                  318                    
 3.1800
VInt8                                                           20.8889         
         569                     5.6900                  496                    
 4.9600
Sorting(Unique(VInt8))                                          20.8889         
        1196                    11.9600                  488                    
 4.8800
Sorting(Unique(DGap(VInt8)))                                    12.0000         
        1151                    11.5100                  481                    
 4.8100
Sorting(Unique(DGapVInt8))                                      12.0000         
        1100                    11.0000                  352                    
 3.5200
Sorting(Unique(DGap(SemiPacked)))                               14.6667         
        1107                    11.0700                  507                    
 5.0700
Sorting(Unique(DGapSemiPacked))                                 14.6667         
        1037                    10.3700                  439                    
 4.3900
Sorting(Unique(DGap(EightFlags(VInt))))                         10.2222         
        1315                    13.1500                  656                    
 6.5600
Sorting(Unique(DGap(FourFlags(VInt))))                          10.2222         
        1408                    14.0800                  675                    
 6.7500
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                9.7778         
        1617                    16.1700                  990                    
 9.9000
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))               10.2222         
        1708                    17.0800                  992                    
 9.9200


Estimating ~100000000 Integers compression time by
Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 
957) 104493 times.

Encoder                                                        Bits/Int         
 Encode Time                Encode Time          Decode Time                
Decode Time
                                                                              
[milliseconds]        [microsecond / int]       [milliseconds]        
[microsecond / int]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple                                                          32.0000         
         181                     1.8100                  168                    
 1.6800
VInt8                                                           16.5768         
         428                     4.2800                  351                    
 3.5100
Sorting(Unique(VInt8))                                          16.5768         
        1586                    15.8600                  330                    
 3.3000
Sorting(Unique(DGap(VInt8)))                                     8.4848         
        1477                    14.7700                  267                    
 2.6700
Sorting(Unique(DGapVInt8))                                       8.4848         
        1435                    14.3500                  193                    
 1.9300
Sorting(Unique(DGap(SemiPacked)))                                8.6771         
        1424                    14.2400                  159                    
 1.5900
Sorting(Unique(DGapSemiPacked))                                  8.6771         
        1402                    14.0200                  100                    
 1.0000
Sorting(Unique(DGap(EightFlags(VInt))))                          4.4138         
        1603                    16.0300                  376                    
 3.7600
Sorting(Unique(DGap(FourFlags(VInt))))                           4.1797         
        1700                    17.0000                  382                    
 3.8200
Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt)))))                3.8955         
        2011                    20.1100                  602                    
 6.0200
Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt)))))                3.8871         
        2024                    20.2400                  594                    
 5.9400


{noformat}


I have another version which uses two bytes before using the variable length 
(e.g values smaller than 0x10000 are encoded as is in two bytes, other values 
are encoded as two leading zero bytes and the VInt) - but I did not optimize it 
yet, nor I'm sure it's very useful. 
                
> Write a PackedIntsEncoder/Decoder for facets
> --------------------------------------------
>
>                 Key: LUCENE-4609
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4609
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/facet
>            Reporter: Shai Erera
>            Priority: Minor
>         Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, 
> LUCENE-4609.patch, LUCENE-4609.patch, SemiPackedEncoder.patch
>
>
> Today the facets API lets you write IntEncoder/Decoder to encode/decode the 
> category ordinals. We have several such encoders, including VInt (default), 
> and block encoders.
> It would be interesting to implement and benchmark a 
> PackedIntsEncoder/Decoder, with potentially two variants: (1) receives 
> bitsPerValue up front, when you e.g. know that you have a small taxonomy and 
> the max value you can see and (2) one that decides for each doc on the 
> optimal bitsPerValue, writes it as a header in the byte[] or something.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to