Re: (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

Dawid Weiss Tue, 22 May 2012 09:29:26 -0700

I think you should do something with the read result in that microbenchmark
(a sum is good), otherwise there is a risk of optimizing away most of that
code.
On May 22, 2012 6:18 PM, "Adrien Grand (JIRA)" <j...@apache.org> wrote:


>
>     [
> https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Adrien Grand updated LUCENE-4062:
> ---------------------------------
>
>    Attachment: LUCENE-4062.patch
>
> Oh right, cases are swapped, I uploaded the wrong patch. Here is a correct
> version.
>
> I guess there are two options:
>  1. always writing the most compact form to disk and deserializing it in a
> fast {{PackedInts.Mutable}} implementation (depending on
> acceptableOverheadRatio),
>  2. writing a possibly aligned version on disk (depending on
> acceptableOverheadRatio) and then selecting the fastest PackedInts.Mutable
> implementation that exactly matches bitsPerValue (with no padding bits).
>
> A third option could be to write padding bits ({{Packed64SingleBlock}}
> subclasses may have such padding bits) as well, but I really dislike the
> fact that the on-disk format is implementation-dependent.
>
> Option 1 is likely to make deserialization slower since some decoding
> might occur (the copyData method) but on the other hand, option 2 would
> prevent us from using implementations that add padding bits (such as
> Packed64SingleBlock21 which has one padding bit every 3 21-bits integers or
> Packed64SingleBlock12 which has 4 padding bits every 5 12-bits integers but
> not Packed64SingleBlock{1,2,4} since 64%{1,2,4}=0).
>
> I initially chose option 1 because I think it is nice to have
> Packed64SingleBlock21 when bitsPerValue is close to 21, since it might be
> significantly faster than Packed64.
>
> In order to know how slower deserialization is (option 1), I ran a
> micro-benchmark which:
>  - loads a PackedInts.Reader from a ByteArrayDataInput,
>  - performs n random reads on the reader.
>
> Here is the code for the micro-benchmark in case you would like to run it.
> {code}
> int valueCount = 10000000;
> int bitsPerValue = 21;
> int[] offsets = new int[valueCount];
> Random random = new Random();
> for (int i = 0; i < valueCount; ++i) {
>  offsets[i] = random.nextInt(valueCount);
> }
> byte[] bytes = new byte[valueCount * 4];
> DataOutput out = new ByteArrayDataOutput(bytes);
> PackedInts.Writer writer = PackedInts.getWriter(out, valueCount,
> bitsPerValue);
> for (int i = 0; i < valueCount; ++i) {
>  writer.add(random.nextInt(1 << bitsPerValue));
> }
> writer.finish();
> for (int i = 0; i < 50; ++i) {
>  long start = System.nanoTime();
>  DataInput in = new ByteArrayDataInput(bytes);
>  PackedInts.Reader reader = PackedInts.getReader(in, 0f); // Packed64
>  // PackedInts.Reader reader = PackedInts.getReader(in, 0.1f); //
> Packed64SingleBlock
>  for (int j = 0, n = valueCount; j < n; ++j) {
>    reader.get(offsets[j]);
>  }
>  long end = System.nanoTime();
>  System.out.println(end - start);
> }
> {code}
>
> I ran this microbenchmark for bitsPerValue=21 and valueCount in (1 000
> 000, 10 000 000). The loading time (n = 0) is 2x to 3x slower with
> Packed64SingleBlock21. However, as soon as you perform valueCount/4 or more
> random reads (n >= valueCount/4), the total time is better for
> Packed64SingleBlock21. When n=valueCount, the total time is even 2x better
> for Packed64SingleBlock21.
>
> The loading overhead doesn't seem too bad.
>
> However I guess that some people might still be more interested in the
> loading time than in the query time (for write-intensive applications), but
> in this case they could still request a reader in its compact form (Packed
> 64 or Direct* when bitsPerValue is 8, 16, 32 or 64). If they want to be
> sure to have a fast reader too, they could also make sure that they used 8,
> 16, 32 or 64 as the bitsPerValue parameter of {{PackedInts.getWriter}}.
>
> Mike, what do you think?
>
> > More fine-grained control over the packed integer implementation that is
> chosen
> >
> -------------------------------------------------------------------------------
> >
> >                 Key: LUCENE-4062
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-4062
> >             Project: Lucene - Java
> >          Issue Type: Improvement
> >          Components: core/other
> >            Reporter: Adrien Grand
> >            Assignee: Michael McCandless
> >            Priority: Minor
> >              Labels: performance
> >             Fix For: 4.1
> >
> >         Attachments: LUCENE-4062.patch, LUCENE-4062.patch,
> LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch
> >
> >
> > In order to save space, Lucene has two main PackedInts.Mutable
> implentations, one that is very fast and is based on a
> byte/short/integer/long array (Direct*) and another one which packs bits in
> a memory-efficient manner (Packed*).
> > The packed implementation tends to be much slower than the direct one,
> which discourages some Lucene components to use it. On the other hand, if
> you store 21 bits integers in a Direct32, this is a space loss of
> (32-21)/32=35%.
> > If you accept to trade some space for speed, you could store 3 of these
> 21 bits integers in a long, resulting in an overhead of 1/3 bit per value.
> One advantage of this approach is that you never need to read more than one
> block to read or write a value, so this can be significantly faster than
> Packed32 and Packed64 which always need to read/write two blocks in order
> to avoid costly branches.
> > I ran some tests, and for 10000000 21 bits values, this implementation
> takes less than 2% more space and has 44% faster writes and 30% faster
> reads. The 12 bits version (5 values per block) has the same performance
> improvement and a 6% memory overhead compared to the packed implementation.
> > In order to select the best implementation for a given integer size, I
> wrote the {{PackedInts.getMutable(valueCount, bitsPerValue,
> acceptableOverheadPerValue)}} method. This method select the fastest
> implementation that has less than {{acceptableOverheadPerValue}} wasted
> bits per value. For example, if you accept an overhead of 20%
> ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty
> reasonable, here is what implementations would be selected:
> >  * 1: Packed64SingleBlock1
> >  * 2: Packed64SingleBlock2
> >  * 3: Packed64SingleBlock3
> >  * 4: Packed64SingleBlock4
> >  * 5: Packed64SingleBlock5
> >  * 6: Packed64SingleBlock6
> >  * 7: Direct8
> >  * 8: Direct8
> >  * 9: Packed64SingleBlock9
> >  * 10: Packed64SingleBlock10
> >  * 11: Packed64SingleBlock12
> >  * 12: Packed64SingleBlock12
> >  * 13: Packed64
> >  * 14: Direct16
> >  * 15: Direct16
> >  * 16: Direct16
> >  * 17: Packed64
> >  * 18: Packed64SingleBlock21
> >  * 19: Packed64SingleBlock21
> >  * 20: Packed64SingleBlock21
> >  * 21: Packed64SingleBlock21
> >  * 22: Packed64
> >  * 23: Packed64
> >  * 24: Packed64
> >  * 25: Packed64
> >  * 26: Packed64
> >  * 27: Direct32
> >  * 28: Direct32
> >  * 29: Direct32
> >  * 30: Direct32
> >  * 31: Direct32
> >  * 32: Direct32
> >  * 33: Packed64
> >  * 34: Packed64
> >  * 35: Packed64
> >  * 36: Packed64
> >  * 37: Packed64
> >  * 38: Packed64
> >  * 39: Packed64
> >  * 40: Packed64
> >  * 41: Packed64
> >  * 42: Packed64
> >  * 43: Packed64
> >  * 44: Packed64
> >  * 45: Packed64
> >  * 46: Packed64
> >  * 47: Packed64
> >  * 48: Packed64
> >  * 49: Packed64
> >  * 50: Packed64
> >  * 51: Packed64
> >  * 52: Packed64
> >  * 53: Packed64
> >  * 54: Direct64
> >  * 55: Direct64
> >  * 56: Direct64
> >  * 57: Direct64
> >  * 58: Direct64
> >  * 59: Direct64
> >  * 60: Direct64
> >  * 61: Direct64
> >  * 62: Direct64
> > Under 32 bits per value, only 13, 17 and 22-26 bits per value would
> still choose the slower Packed64 implementation. Allowing a 50% overhead
> would prevent the packed implementation to be selected for bits per value
> under 32. Allowing an overhead of 32 bits per value would make sure that a
> Direct* implementation is always selected.
> > Next steps would be to:
> >  * make lucene components use this {{getMutable}} method and let users
> decide what trade-off better suits them,
> >  * write a Packed32SingleBlock implementation if necessary (I didn't do
> it because I have no 32-bits computer to test the performance improvements).
> > I think this would allow more fine-grained control over the speed/space
> trade-off, what do you think?
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: (LUCENE-4062) More fine-grained control over the packed integer implementation that is chosen

Reply via email to