[
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549759#comment-13549759
]
Michael McCandless commented on LUCENE-4620:
--------------------------------------------
Thanks Shai, that new patch worked!
This patch looks great!
It's a little disturbing that every doc must make a new
HashMap<String,BytesRef> at indexing time (seems like a lot of
overhead/objects when the common case just needs to return a single
BytesRef, which could be re-used). Can we use
Collections.singletonMap when there are no partitions?
The decode API (more important than encode) looks like it reuses the
Bytes/IntsRef, so that's good.
Hmm why do we have VInt8.bytesNeeded? Who uses that? I think that's
a dangerous API to have .... it's better to simply encode and then see
how many bytes it took.
Hmm, it's a little abusive how VInt8.decode changes the offset of the
incoming BytesRef ... I guess this is why you want an upto :)
Net/net this is great progress over what we have today, so +1!
I ran a quick 10M English Wikipedia test w/ just term queries:
{noformat}
Task QPS base StdDev QPS comp StdDev Pct diff
HighTerm 12.79 (2.4%) 12.56 (1.2%) -1.8%
( -5% - 1%)
MedTerm 18.04 (1.8%) 17.77 (0.8%) -1.5%
( -4% - 1%)
LowTerm 47.69 (1.1%) 47.56 (1.0%) -0.3%
( -2% - 1%)
{noformat}
The test only has 3 ords per doc so it's not "typical" ... looks like things
got a bit slower (or possibly it's noise).
> Explore IntEncoder/Decoder bulk API
> -----------------------------------
>
> Key: LUCENE-4620
> URL: https://issues.apache.org/jira/browse/LUCENE-4620
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Shai Erera
> Attachments: LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int)
> and decode(int). Originally, we believed that this layer can be useful for
> other scenarios, but in practice it's used only for writing/reading the
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder
> can still be streaming (as we don't know in advance how many ints will be
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet
> associations, which can write arbitrary byte[], and so may decoding to an
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts
> etc.) and later read, with as little overhead as possible.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]