[
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802888#action_12802888
]
Toke Eskildsen commented on LUCENE-1990:
----------------------------------------
{quote}
Toke:
Introducing yet another level of indirection and making a
byte/short/int/long-prvider detached from the implementation of the packed
values it tempting.
{quote}
{quote}
You mean the layer that stores the minValue, so that the full range is
supported? I actually think we should absorb that into packed ints,
so it's only one method call per lookup, and specialize the "positive
only" cases to avoid the extra add per lookup.
{quote}
No, I mean the backing array of ints or longs. For memory, the obvious choice
is int[] or long[], but designing for flexibility calls for an API, which
coincidentally is identical to Reader. So, a 4-bit Aligned could be backed by a
DirectInt (which will contain an int[]) or a Persistent (doing mmap or some
other persistent-oriented lookup).
...I should show this in code instead. I'll try and find the time.
{quote}
Yeah, we do eventually want CSF to be updateable, but I don't think we
need this for phase 1?
{quote}
Not in the specific scenario, no.
{quote}
Toke:
The whole 32bit vs. 64bit as backing array does present a bit of a problem with
persistence. We'll be in a situation where the index will be optimized for the
architecture used for building, not the one used for searching. Leaving the
option of a future mmap open means that it is not possible to do a conversion
when retrieving the bits, so I have no solution for this (other than doing
memory-only).
{quote}
{quote}
I'm confused - a future mmap impl shouldn't put pressure on the file
format used by packed ints today? Ie, a future mmap impl can use a
totally different format than the designed-to-be-slurped-into-RAM
format for packed ints, today?
Also, what do you mean by optimized for building not searching?
{quote}
When we're using Aligned, the choice of using int or long for the backing array
dictates how the persistent bitstream will be. If the index builder is a 64 bit
machine and 7 bits/value is used, the result will be longs, each containing 9
values.
When the structure is read into memory by the searcher, the backing array will
again be long. But if the searcher happens to be a 32 bit machine, the
performance will be less than if ints were used for the backing array.
One way to handle this is to do a conversion, when loading into memory. If the
searcher is 64 bit, it will always convert into longs. If the searcher is 32
bit, it will convert into int, if the bits/value is <= 32. The conversion is
cheap, so this is no problem in itself.
However, if we're planning for the future (or for flexibility, depending on
point of view), we would very much like the persistent format to be directly
usable, so that we don't need to load the whole structure into memory. This
rules out conversion and sets us back to step 1: The index will be optimized
for either 32bit or 64 bit searchers.
Oh well, we could always just ignore it and say that Aligned is 64 bit based.
As it is more memory-efficient than Aligned on 32 bit machines, maybe the
slightly smaller number of backing longs will compensate for the overhead of
retrieving longs on a 64 bit machine.
{quote}
Note that on 32 bit machines, if there is actually a gain, we can make
a backing store with ints yet still allow for storage of nbits>32? It
"just" means a value may be split across 2 or 3 values?
{quote}
My guess is that the number of values vs. the available level 2 cache will play
a big role here: For a relatively small number of values, the added logic will
be costly. For a larger number of values, the cache-misses will dwarf that.
> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-1990.patch,
> LUCENE-1990_PerformanceMeasurements20100104.zip
>
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl. EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage. FieldCache.StringIndex could as well. And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs {
> long get(long index);
> void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting. If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
> PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]