[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

Toke Eskildsen (JIRA) Wed, 20 Jan 2010 08:45:16 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802888#action_12802888
 ]


Toke Eskildsen commented on LUCENE-1990:
----------------------------------------

{quote}
Toke:
Introducing yet another level of indirection and making a 
byte/short/int/long-prvider detached from the implementation of the packed 
values it tempting.
{quote}

{quote}
You mean the layer that stores the minValue, so that the full range is
supported? I actually think we should absorb that into packed ints,
so it's only one method call per lookup, and specialize the "positive
only" cases to avoid the extra add per lookup.
{quote}

No, I mean the backing array of ints or longs. For memory, the obvious choice 
is int[] or long[], but designing for flexibility calls for an API, which 
coincidentally is identical to Reader. So, a 4-bit Aligned could be backed by a 
DirectInt (which will contain an int[]) or a Persistent (doing mmap or some 
other persistent-oriented lookup).

...I should show this in code instead. I'll try and find the time.

{quote}
Yeah, we do eventually want CSF to be updateable, but I don't think we
need this for phase 1?
{quote}

Not in the specific scenario, no.


{quote}
Toke:
The whole 32bit vs. 64bit as backing array does present a bit of a problem with 
persistence. We'll be in a situation where the index will be optimized for the 
architecture used for building, not the one used for searching. Leaving the 
option of a future mmap open means that it is not possible to do a conversion 
when retrieving the bits, so I have no solution for this (other than doing 
memory-only).
{quote}

{quote}
I'm confused - a future mmap impl shouldn't put pressure on the file
format used by packed ints today? Ie, a future mmap impl can use a
totally different format than the designed-to-be-slurped-into-RAM
format for packed ints, today?

Also, what do you mean by optimized for building not searching?
{quote}

When we're using Aligned, the choice of using int or long for the backing array 
dictates how the persistent bitstream will be. If the index builder is a 64 bit 
machine and 7 bits/value is used, the result will be longs, each containing 9 
values.

When the structure is read into memory by the searcher, the backing array will 
again be long. But if the searcher happens to be a 32 bit machine, the 
performance will be less than if ints were used for the backing array.

One way to handle this is to do a conversion, when loading into memory. If the 
searcher is 64 bit, it will always convert into longs. If the searcher is 32 
bit, it will convert into int, if the bits/value is <= 32. The conversion is 
cheap, so this is no problem in itself.

However, if we're planning for the future (or for flexibility, depending on 
point of view), we would very much like the persistent format to be directly 
usable, so that we don't need to load the whole structure into memory. This 
rules out conversion and sets us back to step 1: The index will be optimized 
for either 32bit or 64 bit searchers.

Oh well, we could always just ignore it and say that Aligned is 64 bit based. 
As it is more memory-efficient than Aligned on 32 bit machines, maybe the 
slightly smaller number of backing longs will compensate for the overhead of 
retrieving longs on a 64 bit machine.

{quote}
Note that on 32 bit machines, if there is actually a gain, we can make
a backing store with ints yet still allow for storage of nbits>32? It
"just" means a value may be split across 2 or 3 values?
{quote}

My guess is that the number of values vs. the available level 2 cache will play 
a big role here: For a relatively small number of values, the added logic will 
be costly. For a larger number of values, the cache-misses will dwarf that.

> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
>                 Key: LUCENE-1990
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1990
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1990.patch, 
> LUCENE-1990_PerformanceMeasurements20100104.zip
>
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl.  EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage.  FieldCache.StringIndex could as well.  And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs  {
>   long get(long index);
>   void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting.  If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
>   PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

Reply via email to