[
https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804723#action_12804723
]
Michael McCandless commented on LUCENE-1990:
--------------------------------------------
Good progress !
bq. I think Michaels generated code was meant as a temporary solution, until a
handcrafted version was available
Actually that was intended to be a fast impl... the switch should be
compiled to a direct lookup (maybe plus a conditional to catch the
"default" case even though it will never happen...ugh). But I like
your impl with no conditional at all. We should test both.
bq. As to whether to use int or long in the interface unsigned packed int, the
only numbers that will probably need to be long in the foreseeable future are
docids.
Also the file offsets into the terms dict, possibly the offsets in RAM
into the terms dict character data (UTF8 byte[]). Also, when we do
column stride fields, we allow storing values > int. I think we
should stick with {{long get(index)}} for now.
Other comments:
* Maybe we should move all of this under oal.util.packed?
(packedints? ints?)
* I think we should remove getMaxValue() from the Reader interface?
* Why create the IMPLEMENTATION enum? Why not simply return an
[anonymous] instance of Writer?
* Why not store bitsPerValue in the header instead of maxValue? EG
maybe my maxValue is 7000, but because I'm using directShort,
bitsPerValue is 16. Also, the maxValue at write time should not
have to be known -- eg the factory API should let me ask for a
direct short writer without declaring the maxValue I will store.
* I wonder if we should add an optional Object
getDirectBackingArray(). The packed/aligned impls would return
null, but the direct byte/short/int/long impls would return their
array. This would allow callers to specialize upstream impls to
do the direct array lookup without the cast-to-long (like how
FieldComparator now has impls for byte,short,int,long). I suspect
for column stride fields, when sorting by an integer field, on a
32bit arch, this would be a perf win. But: let's wait until we
have CSFs, and we can test whether there really is a gain here....
* I think we shouldn't put a getWriter on every Reader
impl... because it's a one to many mapping? Eg the format written
by PackedWriter can be read by direct byte/short/int/long,
Packed32/64.
* For starters I don't think we should make reader impls that can
read nbits > 31 bits with an int[] backing array. I think long[]
backing array is fine.
* I don't think we need separate PRIORITY and BLOCK_PREFERENCE?
Can't we have a single enum (STORAGE?) with: packed, aligned32,
aligned64? "Direct" is really just packed with nbits rounded up
to 8,16,32,64.
* Aligned32/64 is very wasteful for certain nbits... I like the idea
of "auto" to avoid risk that caller picks a bad combination.
* I think for starters we should not make any reader impls that do
remapping at load time.
> Add unsigned packed int impls in oal.util
> -----------------------------------------
>
> Key: LUCENE-1990
> URL: https://issues.apache.org/jira/browse/LUCENE-1990
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-1990-te20100122.patch, LUCENE-1990.patch,
> LUCENE-1990_PerformanceMeasurements20100104.zip
>
>
> There are various places in Lucene that could take advantage of an
> efficient packed unsigned int/long impl. EG the terms dict index in
> the standard codec in LUCENE-1458 could subsantially reduce it's RAM
> usage. FieldCache.StringIndex could as well. And I think "load into
> RAM" codecs like the one in TestExternalCodecs could use this too.
> I'm picturing something very basic like:
> {code}
> interface PackedUnsignedLongs {
> long get(long index);
> void set(long index, long value);
> }
> {code}
> Plus maybe an iterator for getting and maybe also for setting. If it
> helps, most of the usages of this inside Lucene will be "write once"
> so eg the set could make that an assumption/requirement.
> And a factory somewhere:
> {code}
> PackedUnsignedLongs create(int count, long maxValue);
> {code}
> I think we should simply autogen the code (we can start from the
> autogen code in LUCENE-1410), or, if there is an good existing impl
> that has a compatible license that'd be great.
> I don't have time near-term to do this... so if anyone has the itch,
> please jump!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]