[ 
https://issues.apache.org/jira/browse/LUCENE-5145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722388#comment-13722388
 ] 

Boaz Leskes edited comment on LUCENE-5145 at 7/29/13 12:23 PM:
---------------------------------------------------------------

While making the above changes I did some measurements which I feel is also 
useful to share.

PackedInts trade CPU for better CPU cache & memory usage. PackedInts gives you 
an acceptableOverheadRatio parameter to control the trade off but is not 
exposed in the AbstractAppendingLongBuffer family is based on those. This is 
especially important when you do no rely on the 
AbstractAppendingLongBuffer.iterator() to extract your data. Here is some 
experiments I run on my laptop, using BenchmarkAppendLongBufferRead which is 
included in the patch. The program allows you to play with different read 
strategies and data size and measure reading times.

This is the result of using AppendingDeltaPackedLongBuffer (previously called 
AppendingLongBuffer) to sequential read an array of 500000 elements, using it's 
get method. The data was uniformly distributed numbers between 0 & 7. The 
program measure 10,000 such read. The total time is the time it took to perform 
all of them. You also see in the output the number of bits used to store the 
elements and the storage class used.

------- Storage: DELTA_PACKED, Read: SEQUENTIAL, Read size: 1
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  22.18s avg:  
2.22ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 223.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:  19.14s avg:  
1.91ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 521.13kb)

As you can see, when retrieving elements one by one, the byte based 
implementation slightly faster. For comparison, the new 
AppendingPackedLongBuffer with the same setup:

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 1
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  16.69s avg:  
1.67ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:  13.47s avg:  
1.35ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)
Next to the fact that is faster, you see the same behavior. 

For random reads, the classes display similar behavior:

------- Storage: DELTA_PACKED, Read: RANDOM, Read size: 1
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  23.13s avg:  
2.31ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 223.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:  19.38s avg:  
1.94ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 521.13kb)

------- Storage: PACKED, Read: RANDOM, Read size: 1
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  19.23s avg:  
1.92ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:  15.95s avg:  
1.60ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)


Next I looked at the effect of exposing the bulk reads offered by the 
PackedInts structures in the AppendingLongBuffer family. Here is some results 
from the new packed implementation, this time reading 4 & 16 consecutive 
elements in a single read.

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 4
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  11.16s avg:  
1.12ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
BULK   GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  24.22s avg:  
2.42ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:   8.35s avg:  
0.84ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)
BULK   GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:   8.44s avg:  
0.84ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 16
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:   9.63s avg:  
0.96ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
BULK   GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  12.52s avg:  
1.25ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:   7.46s avg:  
0.75ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)
BULK   GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:   3.22s avg:  
0.32ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)

As you can see the bulk read api for the Direct8 class, starts to pay off when 
reading 4 elements or more. For the Packed64 it still slower when reading 16 
elements at a time. On my MacBook Air, it only started paying off at 1024 
elements. This maybe different on systems with more CPU caches.

Those tests (and many more I played with) are of course very synthetic. I also 
run the new classes under our implementation of faceted search. The results 
were similar.
                
      was (Author: bleskes):
    While making the above changes I did some measurements which I feel is also 
useful to share.

PackedInts trade CPU for better CPU cache & memory usage. PackedInts gives you 
an acceptableOverheadRatio parameter to control the trade off but is not 
exposed in the AbstractAppendingLongBuffer family is based on those. This is 
especially important when you do no rely on the 
AbstractAppendingLongBuffer.iterator() to extract your data. Here is some 
experiments I run on my laptop, using BenchmarkAppendLongBufferRead which is 
included in the patch. The program allows you to play with different read 
strategies and data size and measure reading times.

This is the result of using AppendingDeltaPackedLongBuffer (previously called 
AppendingLongBuffer) to sequential read an array of 500000 elements, using it's 
get method. The data was uniformly distributed numbers between 0 & 7. The 
program measure 10,000 such read. The total time is the time it took to perform 
all of them. You also see in the output the number of bits used to store the 
elements and the storage class used.

------- Storage: DELTA_PACKED, Read: SEQUENTIAL, Read size: 1
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  22.18s avg:  
2.22ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 223.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:  19.14s avg:  
1.91ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 521.13kb)

As you can see, when retrieving elements one by one, the byte based 
implementation slightly faster. For comparison, the new 
AppendingPackedLongBuffer with the same setup:

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 1
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  16.69s avg:  
1.67ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:  13.47s avg:  
1.35ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)
Next to the fact that is faster, you see the same behavior. 

For random reads, the classes display similar behavior:

------- Storage: DELTA_PACKED, Read: RANDOM, Read size: 1
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  23.13s avg:  
2.31ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 223.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:  19.38s avg:  
1.94ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 521.13kb)

------- Storage: PACKED, Read: RANDOM, Read size: 1
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  19.23s avg:  
1.92ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:  15.95s avg:  
1.60ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)


Next I looked at the effect of exposing the bulk reads offered by the 
PackedInts structures in the AppendingLongBuffer family. Here is some results 
from the new packed implementation, this time reading 4 & 16 consecutive 
elements in a single read.

------- Storage: PACKED, Read: SEQUENTIAL, Read size: 4
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  11.16s avg:  
1.12ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
BULK   GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  24.22s avg:  
2.42ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:   8.35s avg:  
0.84ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)
BULK   GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:   8.44s avg:  
0.84ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)

------- Storage: PACKED, Read: CONTINUOUS, Read size: 16
SINGLE GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:   9.63s avg:  
0.96ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
BULK   GET:    3 bits ratio 0.00 (i.e.,    3 bits) total time:  12.52s avg:  
1.25ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Packed64, 219.76kb)
SINGLE GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:   7.46s avg:  
0.75ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)
BULK   GET:    3 bits ratio 7.00 (i.e.,    8 bits) total time:   3.22s avg:  
0.32ms, total read: 2500000000 elm (class 
org.apache.lucene.util.packed.Direct8, 517.13kb)

As you can see the bulk read api for the Direct8 class, starts to pay off when 
reading 4 elements or more. For the Packed64 it still was much slower when 
reading 10 elements at a time. On my MacBook Air, it only started paying off at 
1024 elements. This maybe different on systems with more CPU caches.

Those tests (and many more I played with) are of course very synthetic. I also 
run the new classes under our implementation of faceted search. The results 
were similar.

                  
> Added AppendingPackedLongBuffer & extended AbstractAppendingLongBuffer family 
> (customizable compression ratio + bulk retrieval)
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5145
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5145
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Boaz Leskes
>         Attachments: LUCENE-5145.patch
>
>
> Made acceptableOverheadRatio configurable 
> Added bulk get to AbstractAppendingLongBuffer classes, for faster retrieval.
> Introduced a new variant, AppendingPackedLongBuffer which solely relies on 
> PackedInts as a back-end. This new class is useful where people have 
> non-negative numbers with a fairly uniform distribution over a fixed 
> (limited) range. Ex. facets ordinals.
> To distinguish it from AppendingPackedLongBuffer, delta based 
> AppendingLongBuffer was renamed to AppendingDeltaPackedLongBuffer
> Fixed an Issue with NullReader where it didn't respect it's valueCount in 
> bulk gets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to