[ https://issues.apache.org/jira/browse/LUCENE-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573590#comment-13573590 ]
Gilad Barkai edited comment on LUCENE-4609 at 2/7/13 3:29 PM: -------------------------------------------------------------- Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder: Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt). The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario. Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code. I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the {EncodingSpeed} test - and was surprised. While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat {SimpleIntEncoder} (which writes plain 32 bits per value) in decoding time. Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia). Mike - could you please try this encoder? I guess it only makes sense to run the specialized {DGapSemiPackedEncoder}. Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower. Here are the results of the EncodingSpeed test: {noformat} Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 190 1.9000 165 1.6500 VInt8 18.4955 436 4.3600 359 3.5900 Sorting(Unique(VInt8)) 18.4955 3557 35.5702 314 3.1400 Sorting(Unique(DGap(VInt8))) 8.5597 3485 34.8502 270 2.7000 Sorting(Unique(DGapVInt8)) 8.5597 3434 34.3402 192 1.9200 Sorting(Unique(DGap(SemiPacked))) 8.6453 3386 33.8602 156 1.5600 Sorting(Unique(DGapSemiPacked)) 8.6453 3397 33.9702 99 0.9900 Sorting(Unique(DGap(EightFlags(VInt)))) 4.9679 4002 40.0203 381 3.8100 Sorting(Unique(DGap(FourFlags(VInt)))) 4.8198 3972 39.7203 399 3.9900 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 4.5794 4448 44.4803 645 6.4500 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 4.5794 4461 44.6103 641 6.4100 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 169 1.6900 163 1.6300 VInt8 18.2673 421 4.2100 374 3.7400 Sorting(Unique(VInt8)) 18.2673 2230 22.3001 337 3.3700 Sorting(Unique(DGap(VInt8))) 8.9456 2257 22.5701 273 2.7300 Sorting(Unique(DGapVInt8)) 8.9456 2214 22.1401 192 1.9200 Sorting(Unique(DGap(SemiPacked))) 9.2787 2162 21.6201 180 1.8000 Sorting(Unique(DGapSemiPacked)) 9.2787 2148 21.4801 120 1.2000 Sorting(Unique(DGap(EightFlags(VInt)))) 5.7542 2937 29.3701 395 3.9500 Sorting(Unique(DGap(FourFlags(VInt)))) 5.5447 2768 27.6801 407 4.0700 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 5.3566 3294 32.9401 651 6.5100 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 5.3996 3318 33.1801 662 6.6200 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 5555555 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 316 3.1600 318 3.1800 VInt8 20.8889 569 5.6900 496 4.9600 Sorting(Unique(VInt8)) 20.8889 1196 11.9600 488 4.8800 Sorting(Unique(DGap(VInt8))) 12.0000 1151 11.5100 481 4.8100 Sorting(Unique(DGapVInt8)) 12.0000 1100 11.0000 352 3.5200 Sorting(Unique(DGap(SemiPacked))) 14.6667 1107 11.0700 507 5.0700 Sorting(Unique(DGapSemiPacked)) 14.6667 1037 10.3700 439 4.3900 Sorting(Unique(DGap(EightFlags(VInt)))) 10.2222 1315 13.1500 656 6.5600 Sorting(Unique(DGap(FourFlags(VInt)))) 10.2222 1408 14.0800 675 6.7500 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 9.7778 1617 16.1700 990 9.9000 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 10.2222 1708 17.0800 992 9.9200 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 957) 104493 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 181 1.8100 168 1.6800 VInt8 16.5768 428 4.2800 351 3.5100 Sorting(Unique(VInt8)) 16.5768 1586 15.8600 330 3.3000 Sorting(Unique(DGap(VInt8))) 8.4848 1477 14.7700 267 2.6700 Sorting(Unique(DGapVInt8)) 8.4848 1435 14.3500 193 1.9300 Sorting(Unique(DGap(SemiPacked))) 8.6771 1424 14.2400 159 1.5900 Sorting(Unique(DGapSemiPacked)) 8.6771 1402 14.0200 100 1.0000 Sorting(Unique(DGap(EightFlags(VInt)))) 4.4138 1603 16.0300 376 3.7600 Sorting(Unique(DGap(FourFlags(VInt)))) 4.1797 1700 17.0000 382 3.8200 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 3.8955 2011 20.1100 602 6.0200 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 3.8871 2024 20.2400 594 5.9400 {noformat} I have another version which uses two bytes before using the variable length (e.g values smaller than 0x10000 are encoded as is in two bytes, other values are encoded as two leading zero bytes and the VInt) - but I did not optimize it yet, nor I'm sure it's very useful. was (Author: gilad): Finally figured out I was doing things completely wrong.. instead of having a super smart optimizing code for semi-packed encoder - there's now a strait forward semi-packed encoder: Values smaller than 256 use only one byte (the value itself) and larger values are encoded as VInt plus a leading zero byte. Worst case it can be 6 bytes per value (zero marker + 5 of VInt). The idea, is to pay the penalty for variable length encoding for large values which should be less common in a sort-uniq-dgap scenario. Wrote two versions w and w/o dgap specialization, though I'm not sure how useful is the non-specialized code. I do not currently have the means to run the LuceneUtil (nor the wikipedia index with the categories) - but I ran the EncodingSpeed test - and was surprised. While the encoding is on a little worse (or on par) with dgap-vint, the decoding speed is significantly faster. The new encode is the only (?!) encoder to beat {code}SimpleIntEncoder{code} (which writes plain 32 bits per value) in decoding time. Those values being used in the EncodingSpeed are real scenario, but I'm not sure how much they represent a "common" case (e.g wikipedia). Mike - could you please try this encoder? I guess it only makes sense to run the specialized {code}DGapSemiPackedEncoder{code}. Also, I'm not sure SimpleIntEncoder was ever used (without any sorting, or unique). It would be interesting to test it as well. We will pay in more I/O and much larger file size (~4 times larger..) but it doesn't mean it will be any slower. Here are the results of the EncodingSpeed test: {noformat} Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 3630 (unsorted, length of: 2430) 41152 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 190 1.9000 165 1.6500 VInt8 18.4955 436 4.3600 359 3.5900 Sorting(Unique(VInt8)) 18.4955 3557 35.5702 314 3.1400 Sorting(Unique(DGap(VInt8))) 8.5597 3485 34.8502 270 2.7000 Sorting(Unique(DGapVInt8)) 8.5597 3434 34.3402 192 1.9200 Sorting(Unique(DGap(SemiPacked))) 8.6453 3386 33.8602 156 1.5600 Sorting(Unique(DGapSemiPacked)) 8.6453 3397 33.9702 99 0.9900 Sorting(Unique(DGap(EightFlags(VInt)))) 4.9679 4002 40.0203 381 3.8100 Sorting(Unique(DGap(FourFlags(VInt)))) 4.8198 3972 39.7203 399 3.9900 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 4.5794 4448 44.4803 645 6.4500 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 4.5794 4461 44.6103 641 6.4100 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 9910 (unsorted, length of: 1489) 67159 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 169 1.6900 163 1.6300 VInt8 18.2673 421 4.2100 374 3.7400 Sorting(Unique(VInt8)) 18.2673 2230 22.3001 337 3.3700 Sorting(Unique(DGap(VInt8))) 8.9456 2257 22.5701 273 2.7300 Sorting(Unique(DGapVInt8)) 8.9456 2214 22.1401 192 1.9200 Sorting(Unique(DGap(SemiPacked))) 9.2787 2162 21.6201 180 1.8000 Sorting(Unique(DGapSemiPacked)) 9.2787 2148 21.4801 120 1.2000 Sorting(Unique(DGap(EightFlags(VInt)))) 5.7542 2937 29.3701 395 3.9500 Sorting(Unique(DGap(FourFlags(VInt)))) 5.5447 2768 27.6801 407 4.0700 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 5.3566 3294 32.9401 651 6.5100 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 5.3996 3318 33.1801 662 6.6200 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 10000 (unsorted, length of: 18) 5555555 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 316 3.1600 318 3.1800 VInt8 20.8889 569 5.6900 496 4.9600 Sorting(Unique(VInt8)) 20.8889 1196 11.9600 488 4.8800 Sorting(Unique(DGap(VInt8))) 12.0000 1151 11.5100 481 4.8100 Sorting(Unique(DGapVInt8)) 12.0000 1100 11.0000 352 3.5200 Sorting(Unique(DGap(SemiPacked))) 14.6667 1107 11.0700 507 5.0700 Sorting(Unique(DGapSemiPacked)) 14.6667 1037 10.3700 439 4.3900 Sorting(Unique(DGap(EightFlags(VInt)))) 10.2222 1315 13.1500 656 6.5600 Sorting(Unique(DGap(FourFlags(VInt)))) 10.2222 1408 14.0800 675 6.7500 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 9.7778 1617 16.1700 990 9.9000 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 10.2222 1708 17.0800 992 9.9200 Estimating ~100000000 Integers compression time by Encoding/decoding facets' ID payload of docID = 501871 (unsorted, length of: 957) 104493 times. Encoder Bits/Int Encode Time Encode Time Decode Time Decode Time [milliseconds] [microsecond / int] [milliseconds] [microsecond / int] ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- Simple 32.0000 181 1.8100 168 1.6800 VInt8 16.5768 428 4.2800 351 3.5100 Sorting(Unique(VInt8)) 16.5768 1586 15.8600 330 3.3000 Sorting(Unique(DGap(VInt8))) 8.4848 1477 14.7700 267 2.6700 Sorting(Unique(DGapVInt8)) 8.4848 1435 14.3500 193 1.9300 Sorting(Unique(DGap(SemiPacked))) 8.6771 1424 14.2400 159 1.5900 Sorting(Unique(DGapSemiPacked)) 8.6771 1402 14.0200 100 1.0000 Sorting(Unique(DGap(EightFlags(VInt)))) 4.4138 1603 16.0300 376 3.7600 Sorting(Unique(DGap(FourFlags(VInt)))) 4.1797 1700 17.0000 382 3.8200 Sorting(Unique(DGap(NOnes(3) (FourFlags(VInt))))) 3.8955 2011 20.1100 602 6.0200 Sorting(Unique(DGap(NOnes(4) (FourFlags(VInt))))) 3.8871 2024 20.2400 594 5.9400 {noformat} I have another version which uses two bytes before using the variable length (e.g values smaller than 0x10000 are encoded as is in two bytes, other values are encoded as two leading zero bytes and the VInt) - but I did not optimize it yet, nor I'm sure it's very useful. > Write a PackedIntsEncoder/Decoder for facets > -------------------------------------------- > > Key: LUCENE-4609 > URL: https://issues.apache.org/jira/browse/LUCENE-4609 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet > Reporter: Shai Erera > Priority: Minor > Attachments: LUCENE-4609.patch, LUCENE-4609.patch, LUCENE-4609.patch, > LUCENE-4609.patch, LUCENE-4609.patch, SemiPackedEncoder.patch > > > Today the facets API lets you write IntEncoder/Decoder to encode/decode the > category ordinals. We have several such encoders, including VInt (default), > and block encoders. > It would be interesting to implement and benchmark a > PackedIntsEncoder/Decoder, with potentially two variants: (1) receives > bitsPerValue up front, when you e.g. know that you have a small taxonomy and > the max value you can see and (2) one that decides for each doc on the > optimal bitsPerValue, writes it as a header in the byte[] or something. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org