[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

Michael Busch (JIRA) Sun, 03 Jan 2010 02:39:19 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795963#action_12795963
 ]


Michael Busch commented on LUCENE-2186:
---------------------------------------

Great to see progress here, Mike!

{quote}
String fields are stored as the UTF8 byte[]. This patch adds a
BytesRef, which does the same thing as flex's TermRef (we should merge
them).
{quote}

It looks like ByteRef is very similar to Payload? Could you use that instead 
and extend it with the new String constructor and compare methods? 

{quote}
It handles 3 types of values:
{quote}

So it looks like with your approach you want to support certain
"primitive" types out of the box, such as byte[], float, int, String?
If someone has custom data types, then they have, similar as with
payloads today, the byte[] indirection? 

The code I initially wrote for 1231 exposed IndexOutput, so that one
can call write*() directly, without having to convert to byte[]
first. I think we will also want to do that for 2125 (store attributes
in the index). So I'm wondering if this and 2125 should work
similarly? 
Thinking out loud: Could we have then attributes with
serialize/deserialize methods for primitive types, such as float?
Could we efficiently use such an approach all the way up to
FieldCache? It would be compelling if you could store an attribute as
CSF, or in the postinglist, retrieve it from the flex APIs, and also
from the FieldCache. All would be the same API and there would only be
one place that needs to "know" about the encoding (the attribute).

{quote}
Next step is to do basic integration with Lucene, and then compare
sort performance of this vs field cache.
{quote}

Yeah, that's where I got kind of stuck with 1231: We need to figure
out how the public API should look like, with which a user can add CSF
values to the index and retrieve them. The easiest and fastest way
would be to add a dedicated new API. The cleaner one would be to make the whole
Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt.

{quote}
There are abstract Writer/Reader classes. The current reader impls
are entirely RAM resident (like field cache), but the API is (I think)
agnostic, ie, one could make an MMAP impl instead.

I think this is the first baby step towards LUCENE-1231. Ie, it
cannot yet update values, and the reading API is fully random-access
by docID (like field cache), not like a posting list, though I
do think we should add an iterator() api (to return flex's DocsEnum)
{quote}

Hmm, so random-access would obviously be the preferred approach for SSDs, but
with conventional disks I think the performance would be poor? In 1231
I implemented the var-sized CSF with a skip list, similar to a posting
list. I think we should add that here too and we can still keep the
additional index that stores the pointers? We could have two readers:
one that allows random-access and loads the pointers into RAM (or uses
MMAP as you mentioned), and a second one that doesn't load anything
into RAM, uses the skip lists and only allows iterator-based access?

About updating CSF: I hope we can use parallel indexing for that. In
other words: It should be possible for users to use parallel indexes
to update certain fields, and Lucene should use the same approach
internally to store different "generations" of things like norms and CSFs.

> First cut at column-stride fields (index values storage)
> --------------------------------------------------------
>
>                 Key: LUCENE-2186
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2186
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2186.patch
>
>
> I created an initial basic impl for storing "index values" (ie
> column-stride value storage).  This is still a work in progress... but
> the approach looks compelling.  I'm posting my current status/patch
> here to get feedback/iterate, etc.
> The code is standalone now, and lives under new package
> oal.index.values (plus some util changes, refactorings) -- I have yet
> to integrate into Lucene so eg you can mark that a given Field's value
> should be stored into the index values, sorting will use these values
> instead of field cache, etc.
> It handles 3 types of values:
>   * Six variants of byte[] per doc, all combinations of fixed vs
>     variable length, and stored either "straight" (good for eg a
>     "title" field), "deref" (good when many docs share the same value,
>     but you won't do any sorting) or "sorted".
>   * Integers (variable bit precision used as necessary, ie this can
>     store byte/short/int/long, and all precisions in between)
>   * Floats (4 or 8 byte precision)
> String fields are stored as the UTF8 byte[].  This patch adds a
> BytesRef, which does the same thing as flex's TermRef (we should merge
> them).
> This patch also adds basic initial impl of PackedInts (LUCENE-1990);
> we can swap that out if/when we get a better impl.
> This storage is dense (like field cache), so it's appropriate when the
> field occurs in all/most docs.  It's just like field cache, except the
> reading API is a get() method invocation, per document.
> Next step is to do basic integration with Lucene, and then compare
> sort performance of this vs field cache.
> For the "sort by String value" case, I think RAM usage & GC load of
> this index values API should be much better than field caache, since
> it does not create object per document (instead shares big long[] and
> byte[] across all docs), and because the values are stored in RAM as
> their UTF8 bytes.
> There are abstract Writer/Reader classes.  The current reader impls
> are entirely RAM resident (like field cache), but the API is (I think)
> agnostic, ie, one could make an MMAP impl instead.
> I think this is the first baby step towards LUCENE-1231.  Ie, it
> cannot yet update values, and the reading API is fully random-access
> by docID (like field cache), not like a posting list, though I
> do think we should add an iterator() api (to return flex's DocsEnum)
> -- eg I think this would be a good way to track avg doc/field length
> for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

Reply via email to