[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

Michael McCandless (JIRA) Thu, 10 Jan 2013 08:24:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549759#comment-13549759
 ]


Michael McCandless commented on LUCENE-4620:
--------------------------------------------

Thanks Shai, that new patch worked!

This patch looks great!

It's a little disturbing that every doc must make a new
HashMap<String,BytesRef> at indexing time (seems like a lot of
overhead/objects when the common case just needs to return a single
BytesRef, which could be re-used).  Can we use
Collections.singletonMap when there are no partitions?

The decode API (more important than encode) looks like it reuses the
Bytes/IntsRef, so that's good.

Hmm why do we have VInt8.bytesNeeded?  Who uses that?  I think that's
a dangerous API to have .... it's better to simply encode and then see
how many bytes it took.

Hmm, it's a little abusive how VInt8.decode changes the offset of the
incoming BytesRef ... I guess this is why you want an upto :)

Net/net this is great progress over what we have today, so +1!

I ran a quick 10M English Wikipedia test w/ just term queries:
{noformat}
Task    QPS base      StdDev    QPS comp      StdDev                Pct diff
               HighTerm       12.79      (2.4%)       12.56      (1.2%)   -1.8% 
(  -5% -    1%)
                MedTerm       18.04      (1.8%)       17.77      (0.8%)   -1.5% 
(  -4% -    1%)
                LowTerm       47.69      (1.1%)       47.56      (1.0%)   -0.3% 
(  -2% -    1%)
{noformat}

The test only has 3 ords per doc so it's not "typical" ... looks like things 
got a bit slower (or possibly it's noise).
                
> Explore IntEncoder/Decoder bulk API
> -----------------------------------
>
>                 Key: LUCENE-4620
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4620
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Shai Erera
>         Attachments: LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

Reply via email to