[
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795228#action_12795228
]
Toke Eskildsen commented on LUCENE-1990:
----------------------------------------
I have very recently done some experiments with random-access packed positive
integer arrays (we could also call them unsigned). The basic approach is to put
the bits after each other in an int- or long-array, then extract the value by
shifting and masking. Unfortunately I have not been able to determine a single
winning strategy for space/speed trade-offs.
I've tried four simple implementations:
1. Pack the bits right after each other.
2. Pack the bits right after each other but allow only 1, 2, 4, 8, 16 or 32 as
value-bits.
3. Wrap an int[] or long[] and store the values directly (for comparison with 1
& 2).
4. Use int[] directly without any wrapping.
(Code can be found at
http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/src/dk/statsbiblioteket/summa/common/util/bits/
where #1 is BitsArrayPacked, #2 is BitsArrayAligned and #3 is BitsArrayInt)
The obvious benefit from #2 is that the bits for values are always contained in
a single block (int or long), which reduces the amount of operations for set
and get considerably. Unfortunately this means that as soon as a value greater
than 65535 needs to be stored, the internal representation will require 32
bits/value. This means no space-benefit compared to #3 while the speed penalty
remains. A hybrid approach might be considered where the implementation is
determined by the number of bits needed to store a value.
I've done some performance tests of the implementations on an aging 1.8GHz
single-core laptop (Dell 820). The code can be found at
http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/test/dk/statsbiblioteket/summa/common/util/bits/BitsArrayPerformanceTest.java?revision=2038&view=markup
In the tables below, the measurements for Packed, Aligned, Int, Constant and
int[] are the total time in milliseconds to perform actionCount actions (either
read or write). Random numbers (using the same seed) were used for index as
well as value and the overhead of constructing the random numbers were
subtracted from the measurements. "Constant" is a dummy where no values are
stored and the same value is always returned on a get. It is used to measure
method-call overhead.
For arrays of 10M values, 6-33 million values can be read or written per
second. If this is within acceptable limits, I'd be happy to try and make a
contribution to Lucene based on the code. However, I probably won't find the
time before February or March 2010.
{code}
actionCount arrayLength actionType valueMax Packed(#1) Aligned(#2)
Int(#3) Constant int[](#4)
10000000 1000000 write 1 583 416
984 254 635
10000000 1000000 write 15 594 499
1172 286 843
10000000 1000000 write 255 604 478
1057 213 656
10000000 1000000 write 256 734 765
1062 109 729
10000000 1000000 write 65535 1036 802
1124 417 734
10000000 1000000 write 65536 1015 1130
1072 172 781
10000000 1000000 write 2097151 1020 1187
1052 223 614
10000000 1000000 write 2147483646 1136 1073
839 73 719
10000000 1000000 read 1 286 203
786 104 0
10000000 1000000 read 15 291 182
859 78 0
10000000 1000000 read 255 672 494
989 67 0
10000000 1000000 read 256 568 729
1104 93 0
10000000 1000000 read 65535 833 755
1088 99 0
10000000 1000000 read 65536 947 974
1062 104 0
10000000 1000000 read 2097151 999 963
1062 88 0
10000000 1000000 read 2147483646 1349 869
1260 84 0
10000000 10000000 write 1 833 568
1458 239 1229
10000000 10000000 write 15 1427 1255
1432 276 1615
10000000 10000000 write 255 1599 1578
1448 250 1244
10000000 10000000 write 256 1656 1520
1317 109 1109
10000000 10000000 write 65535 1734 1630
1385 245 1307
10000000 10000000 write 65536 1718 1640
1395 182 1208
10000000 10000000 write 2097151 1807 1781
1447 250 1291
10000000 10000000 write 2147483646 1718 1599
1281 73 1099
10000000 10000000 read 1 562 296
1301 109 0
10000000 10000000 read 15 1187 1198
1322 83 0
10000000 10000000 read 255 1421 1432
1526 99 0
10000000 10000000 read 256 1521 1583
1588 104 0
10000000 10000000 read 65535 1578 1427
1374 31 0
10000000 10000000 read 65536 1634 1499
1489 113 0
10000000 10000000 read 2097151 1625 1374
1468 224 0
10000000 10000000 read 2147483646 1609 1349
1499 83 0
{code}
> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl. EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage. FieldCache.StringIndex could as well. And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs {
> long get(long index);
> void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting. If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
> PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]