[ https://issues.apache.org/jira/browse/LUCENE-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409056#comment-13409056 ]
Toke Eskildsen commented on LUCENE-4062: ---------------------------------------- One solution later and the mystery is still there: Tricking the JVM to request the old memory value before setting the new one in Direct64 does increase its speed to the expected level on the i7 server mars: {code:java} public static long MASK = 0L; public void set(final int index, final long value) { values[index] = values[index] & MASK | value; } {code} I am guessing it somehow triggers either a pre-fetch of the relevant 4K page in main memory or marks the cache as hotter and thus delaying cache flush. Alas, I am not a hardware guy and this is a bit beyond me. Anyway, the solution is not usable as it slows execution on other machines. I can get access to another i7 server of the same generation, but it is identical to mars so I doubt the results would be different. Unless someone steps up with another current-generation i7 server to test on, I think we should file this under pixies. Also unresolved, but more relevant, is the question about auto-selection of Mutable implementation. Based on the data so far, the current auto-selector will result in sub-optimal selection on i7-machines. Without a reliable architecture detection, another way of viewing the issue is whether to optimize the selector towards older machines (padded) or newer (contiguous). I've looked around for a way to determine the CPU and cache size, but so far it seems that the only fairly reliable way is to determine the OS, then call some platform-specific native code and parse the output. Besides the ugliness of this solution, I guess it gets disqualified for the extra needed permissions from the security manager. > More fine-grained control over the packed integer implementation that is > chosen > ------------------------------------------------------------------------------- > > Key: LUCENE-4062 > URL: https://issues.apache.org/jira/browse/LUCENE-4062 > Project: Lucene - Java > Issue Type: Improvement > Components: core/other > Reporter: Adrien Grand > Assignee: Adrien Grand > Priority: Minor > Labels: performance > Fix For: 4.0, 5.0 > > Attachments: LUCENE-4062-2.patch, LUCENE-4062.patch, > LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, LUCENE-4062.patch, > LUCENE-4062.patch, LUCENE-4062.patch, Packed64SingleBlock.java, > Packed64Strategy.java, Packed64calc.java, PackedIntsBenchmark.java, > PackedIntsBenchmark.java, PackedIntsBenchmark.java, PackedIntsBenchmark.java, > PackedZero.java, measurements_te_graphs.pdf, measurements_te_i7.txt, > measurements_te_p4.txt, measurements_te_xeon.txt > > > In order to save space, Lucene has two main PackedInts.Mutable implentations, > one that is very fast and is based on a byte/short/integer/long array > (Direct*) and another one which packs bits in a memory-efficient manner > (Packed*). > The packed implementation tends to be much slower than the direct one, which > discourages some Lucene components to use it. On the other hand, if you store > 21 bits integers in a Direct32, this is a space loss of (32-21)/32=35%. > If you accept to trade some space for speed, you could store 3 of these 21 > bits integers in a long, resulting in an overhead of 1/3 bit per value. One > advantage of this approach is that you never need to read more than one block > to read or write a value, so this can be significantly faster than Packed32 > and Packed64 which always need to read/write two blocks in order to avoid > costly branches. > I ran some tests, and for 10000000 21 bits values, this implementation takes > less than 2% more space and has 44% faster writes and 30% faster reads. The > 12 bits version (5 values per block) has the same performance improvement and > a 6% memory overhead compared to the packed implementation. > In order to select the best implementation for a given integer size, I wrote > the {{PackedInts.getMutable(valueCount, bitsPerValue, > acceptableOverheadPerValue)}} method. This method select the fastest > implementation that has less than {{acceptableOverheadPerValue}} wasted bits > per value. For example, if you accept an overhead of 20% > ({{acceptableOverheadPerValue = 0.2f * bitsPerValue}}), which is pretty > reasonable, here is what implementations would be selected: > * 1: Packed64SingleBlock1 > * 2: Packed64SingleBlock2 > * 3: Packed64SingleBlock3 > * 4: Packed64SingleBlock4 > * 5: Packed64SingleBlock5 > * 6: Packed64SingleBlock6 > * 7: Direct8 > * 8: Direct8 > * 9: Packed64SingleBlock9 > * 10: Packed64SingleBlock10 > * 11: Packed64SingleBlock12 > * 12: Packed64SingleBlock12 > * 13: Packed64 > * 14: Direct16 > * 15: Direct16 > * 16: Direct16 > * 17: Packed64 > * 18: Packed64SingleBlock21 > * 19: Packed64SingleBlock21 > * 20: Packed64SingleBlock21 > * 21: Packed64SingleBlock21 > * 22: Packed64 > * 23: Packed64 > * 24: Packed64 > * 25: Packed64 > * 26: Packed64 > * 27: Direct32 > * 28: Direct32 > * 29: Direct32 > * 30: Direct32 > * 31: Direct32 > * 32: Direct32 > * 33: Packed64 > * 34: Packed64 > * 35: Packed64 > * 36: Packed64 > * 37: Packed64 > * 38: Packed64 > * 39: Packed64 > * 40: Packed64 > * 41: Packed64 > * 42: Packed64 > * 43: Packed64 > * 44: Packed64 > * 45: Packed64 > * 46: Packed64 > * 47: Packed64 > * 48: Packed64 > * 49: Packed64 > * 50: Packed64 > * 51: Packed64 > * 52: Packed64 > * 53: Packed64 > * 54: Direct64 > * 55: Direct64 > * 56: Direct64 > * 57: Direct64 > * 58: Direct64 > * 59: Direct64 > * 60: Direct64 > * 61: Direct64 > * 62: Direct64 > Under 32 bits per value, only 13, 17 and 22-26 bits per value would still > choose the slower Packed64 implementation. Allowing a 50% overhead would > prevent the packed implementation to be selected for bits per value under 32. > Allowing an overhead of 32 bits per value would make sure that a Direct* > implementation is always selected. > Next steps would be to: > * make lucene components use this {{getMutable}} method and let users decide > what trade-off better suits them, > * write a Packed32SingleBlock implementation if necessary (I didn't do it > because I have no 32-bits computer to test the performance improvements). > I think this would allow more fine-grained control over the speed/space > trade-off, what do you think? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org