I think you should do something with the read result in that microbenchmark (a sum is good), otherwise there is a risk of optimizing away most of that code. On May 22, 2012 6:18 PM, "Adrien Grand (JIRA)" <[email protected]> wrote:
> > [ > https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel] > > Adrien Grand updated LUCENE-4062: > --------------------------------- > > Attachment: LUCENE-4062.patch > > Oh right, cases are swapped, I uploaded the wrong patch. Here is a correct > version. > > I guess there are two options: > 1. always writing the most compact form to disk and deserializing it in a > fast {{PackedInts.Mutable}} implementation (depending on > acceptableOverheadRatio), > 2. writing a possibly aligned version on disk (depending on > acceptableOverheadRatio) and then selecting the fastest PackedInts.Mutable > implementation that exactly matches bitsPerValue (with no padding bits). > > A third option could be to write padding bits ({{Packed64SingleBlock}} > subclasses may have such padding bits) as well, but I really dislike the > fact that the on-disk format is implementation-dependent. > > Option 1 is likely to make deserialization slower since some decoding > might occur (the copyData method) but on the other hand, option 2 would > prevent us from using implementations that add padding bits (such as > Packed64SingleBlock21 which has one padding bit every 3 21-bits integers or > Packed64SingleBlock12 which has 4 padding bits every 5 12-bits integers but > not Packed64SingleBlock{1,2,4} since 64%{1,2,4}=0). > > I initially chose option 1 because I think it is nice to have > Packed64SingleBlock21 when bitsPerValue is close to 21, since it might be > significantly faster than Packed64. > > In order to know how slower deserialization is (option 1), I ran a > micro-benchmark which: > - loads a PackedInts.Reader from a ByteArrayDataInput, > - performs n random reads on the reader. > > Here is the code for the micro-benchmark in case you would like to run it. > {code} > int valueCount = 10000000; > int bitsPerValue = 21; > int[] offsets = new int[valueCount]; > Random random = new Random(); > for (int i = 0; i < valueCount; ++i) { > offsets[i] = random.nextInt(valueCount); > } > byte[] bytes = new byte[valueCount * 4]; > DataOutput out = new ByteArrayDataOutput(bytes); > PackedInts.Writer writer = PackedInts.getWriter(out, valueCount, > bitsPerValue); > for (int i = 0; i < valueCount; ++i) { > writer.add(random.nextInt(1 << bitsPerValue)); > } > writer.finish(); > for (int i = 0; i < 50; ++i) { > long start = System.nanoTime(); > DataInput in = new ByteArrayDataInput(bytes); > PackedInts.Reader reader = PackedInts.getReader(in, 0f); // Packed64 > // PackedInts.Reader reader = PackedInts.getReader(in, 0.1f); // > Packed64SingleBlock > for (int j = 0, n = valueCount; j < n; ++j) { > reader.get(offsets[j]); > } > long end = System.nanoTime(); > System.out.println(end - start); > } > {code} > > I ran this microbenchmark for bitsPerValue=21 and valueCount in (1 000 > 000, 10 000 000). The loading time (n = 0) is 2x to 3x slower with > Packed64SingleBlock21. However, as soon as you perform valueCount/4 or more > random reads (n >= valueCount/4), the total time is better for > Packed64SingleBlock21. When n=valueCount, the total time is even 2x better > for Packed64SingleBlock21. > > The loading overhead doesn't seem too bad. > > However I guess that some people might still be more interested in the > loading time than in the query time (for write-intensive applications), but > in this case they could still request a reader in its compact form (Packed > 64 or Direct* when bitsPerValue is 8, 16, 32 or 64). If they want to be > sure to have a fast reader too, they could also make sure that they used 8, > 16, 32 or 64 as the bitsPerValue parameter of {{PackedInts.getWriter}}. > > Mike, what do you think? > > > More fine-grained control over the packed integer implementation that is > chosen > > > ------------------------------------------------------------------------------- > > > > Key: LUCENE-4062 > > URL: https://issues.apache.org/jira/browse/LUCENE-4062 > > Project: Lucene - Java > > Issue Type: Improvement > > Components: core/other > > Reporter: Adrien Grand > > Assignee: Michael McCandless > > Priority: Minor > > Labels: performance > > Fix For: 4.1 > > > > Attachments: LUCENE-4062.patch, LUCENE-4062.patch, > LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch > > > > > > In order to save space, Lucene has two main PackedInts.Mutable > implentations, one that is very fast and is based on a > byte/short/integer/long array (Direct*) and another one which packs bits in > a memory-efficient manner (Packed*). > > The packed implementation tends to be much slower than the direct one, > which discourages some Lucene components to use it. On the other hand, if > you store 21 bits integers in a Direct32, this is a space loss of > (32-21)/32=35%. > > If you accept to trade some space for speed, you could store 3 of these > 21 bits integers in a long, resulting in an overhead of 1/3 bit per value. > One advantage of this approach is that you never need to read more than one > block to read or write a value, so this can be significantly faster than > Packed32 and Packed64 which always need to read/write two blocks in order > to avoid costly branches. > > I ran some tests, and for 10000000 21 bits values, this implementation > takes less than 2% more space and has 44% faster writes and 30% faster > reads. The 12 bits version (5 values per block) has the same performance > improvement and a 6% memory overhead compared to the packed implementation. > > In order to select the best implementation for a given integer size, I > wrote the {{PackedInts.getMutable(valueCount, bitsPerValue, > acceptableOverheadPerValue)}} method. This method select the fastest > implementation that has less than {{acceptableOverheadPerValue}} wasted > bits per value. For example, if you accept an overhead of 20% > ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty > reasonable, here is what implementations would be selected: > > * 1: Packed64SingleBlock1 > > * 2: Packed64SingleBlock2 > > * 3: Packed64SingleBlock3 > > * 4: Packed64SingleBlock4 > > * 5: Packed64SingleBlock5 > > * 6: Packed64SingleBlock6 > > * 7: Direct8 > > * 8: Direct8 > > * 9: Packed64SingleBlock9 > > * 10: Packed64SingleBlock10 > > * 11: Packed64SingleBlock12 > > * 12: Packed64SingleBlock12 > > * 13: Packed64 > > * 14: Direct16 > > * 15: Direct16 > > * 16: Direct16 > > * 17: Packed64 > > * 18: Packed64SingleBlock21 > > * 19: Packed64SingleBlock21 > > * 20: Packed64SingleBlock21 > > * 21: Packed64SingleBlock21 > > * 22: Packed64 > > * 23: Packed64 > > * 24: Packed64 > > * 25: Packed64 > > * 26: Packed64 > > * 27: Direct32 > > * 28: Direct32 > > * 29: Direct32 > > * 30: Direct32 > > * 31: Direct32 > > * 32: Direct32 > > * 33: Packed64 > > * 34: Packed64 > > * 35: Packed64 > > * 36: Packed64 > > * 37: Packed64 > > * 38: Packed64 > > * 39: Packed64 > > * 40: Packed64 > > * 41: Packed64 > > * 42: Packed64 > > * 43: Packed64 > > * 44: Packed64 > > * 45: Packed64 > > * 46: Packed64 > > * 47: Packed64 > > * 48: Packed64 > > * 49: Packed64 > > * 50: Packed64 > > * 51: Packed64 > > * 52: Packed64 > > * 53: Packed64 > > * 54: Direct64 > > * 55: Direct64 > > * 56: Direct64 > > * 57: Direct64 > > * 58: Direct64 > > * 59: Direct64 > > * 60: Direct64 > > * 61: Direct64 > > * 62: Direct64 > > Under 32 bits per value, only 13, 17 and 22-26 bits per value would > still choose the slower Packed64 implementation. Allowing a 50% overhead > would prevent the packed implementation to be selected for bits per value > under 32. Allowing an overhead of 32 bits per value would make sure that a > Direct* implementation is always selected. > > Next steps would be to: > > * make lucene components use this {{getMutable}} method and let users > decide what trade-off better suits them, > > * write a Packed32SingleBlock implementation if necessary (I didn't do > it because I have no 32-bits computer to test the performance improvements). > > I think this would allow more fine-grained control over the speed/space > trade-off, what do you think? > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
