[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

Michael McCandless (JIRA) Thu, 04 Feb 2010 11:26:51 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829715#action_12829715
 ]


Michael McCandless commented on LUCENE-1990:
--------------------------------------------

{quote}
bq. Also, the maxValue at write time should not have to be known - eg the 
factory API should let me ask for a direct short writer without declaring the 
maxValue I will store.

Since Packed and Aligned needs maxValue (or bitsPerValue), this would require 
two distinct methods in the factory, each returning a subset of the possible 
implementations. I find that rather confusing.
{quote}

Maybe the caller just always uses the bitsRequired method to get the
required bit width per value?

Though, when we enable specializing storing of negative values as
well, that'll be a hassle...

OK let's leave it as you must pass the maxValue for now.

{quote}
Speaking of API additions, I find that

{code}
public int getBitsPerValue();
public int size();
public void set(long value);
public void clear();
{code}

are trivial to implement for the known implementations. They open up for things 
like auto-growing to fit higher values by using a delegating wrapper, re-using 
the structure for counting purposes and sorting in-place.
{quote}

I think the first 2 make sense, but I'd rather not pursue the 2nd two
at this time.  Ie, I think this API only needs write-once, and then
read-only.

If we open up random writing (set/clear), with auto-growing, etc.,
that does add complexities to the impl.  EG the backing store can no
longer be final, we'd have to do some locking (or mark the array
volatile) for thread safety, etc.

As far as I can tell... Lucene today doesn't yet need random write to
the packed ints.  The terms dict index and CSF are the two needs I
think we have now.  Someday (when CSF supports writing) we will... but
not yet today?

{quote}
bq. I don't think we need separate PRIORITY and BLOCK_PREFERENCE?  Can't we 
have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is 
really just packed with nbits rounded up to 8,16,32,64.

I agree that it does complicate matters somewhat to have them separated. When 
calling getReader the BLOCK_PREFERENCE should also be removed, as the block 
preference will always be the same as that architecture. Removing the "direct" 
option would require the caller to do some of the logic in some cases: If low 
processing requirements is a priority, direct is preferably and when the 
bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 
32) bitsPerValue = 64 and so on.
{quote}

(There's a bug in the patch in PackedInts.getReader, where it switches
the block size based on whether JRE is 64 bit: it's always choosing 64
bit now).

The "direct" option only applies during writing (ie, you round up to
the nearest native type bit width).  At read time it's just a packed
8/16/32/64.

Hmm... maybe we could just add an optional 2nd arg to bitsRequired, a
boolean eg "roundUpToNative" or something, which if true does that
rounding for you?  (And then go back to caller computes bit width and
passes it in?).


> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
>                 Key: LUCENE-1990
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1990
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1990-te20100122.patch, LUCENE-1990.patch, 
> LUCENE-1990_PerformanceMeasurements20100104.zip
>
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl.  EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage.  FieldCache.StringIndex could as well.  And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs  {
>   long get(long index);
>   void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting.  If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
>   PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1990) Add unsigned packed int impls in oal.util

Reply via email to