[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

Michael McCandless (JIRA) Thu, 10 Jan 2013 11:28:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13549961#comment-13549961
 ]


Michael McCandless commented on LUCENE-4620:
--------------------------------------------

{quote}
bq. Can we use Collections.singletonMap when there are no partitions?

Done. Note though that BytesRef cannot be reused in the case of 
PerDimensionIndexingParams (i.e. multiple CLPs). This is not the common case, 
but it's not trivial to specialize it. Maybe as a second iteration. I did put a 
TODO in FacetFields to allow reuse.
{quote}

Well, we'd somehow need N BytesRefs to reuse (one per CLP) ... but I
don't think we should worry about that now.

It is unfortunate that the common case is often held back by the full
flexibility/generality of the facet module ... sometimes I think we
need a facet-light module.  But maybe if we can get the specialization
done we don't need facet-light ...

{quote}
bq. why do we have VInt8.bytesNeeded? Who uses that?

Currently no one uses it, but it was there and I thought that it's a convenient 
API to keep. Why encode and then see how many bytes were occupied?
Anyway, neither the encoders nor the decoders use it. I have no strong feelings 
for keeping/removing it, so if you feel like it should be removed, I can do it.
{quote}

I think we should remove it: it's a dangerous API because it can
encourage consumers to do things like call bytesNeeded first (to know
how much to grow their buffer, say) followed by encoding.  The slow
part of vInt encoding is all those ifs ...

{quote}
bq. Hmm, it's a little abusive how VInt8.decode changes the offset of the 
incoming BytesRef

It is, but that's the result of Java's lack of pass by reference. I.e., decode 
needs to return the caller two values: the decoded number and how many bytes 
were read.
Notice that in the previous byte[] variant, the method took a class Position, 
which is horrible. That's why I documented in decode() that it advances 
bytes.offset, so
the caller can restore it in the end. For instance, IntDecoder restores the 
offset to the original one in the end.

On LUCENE-4675 Robert gave me an idea to create a BytesRefIterator, and I 
started to play with it. I.e. it would wrap a BytesRef but add 'pos' and 'upto' 
indexes.
The user can modify 'pos' freely, withouth touching bytes.offset. That 
introduces an object allocation though, and since I'd want to reuse that object 
wherever
possible, I think I'll look at it after finishing this issue. It already 
contains too many changes.
{quote}

OK.

{quote}
bq. I guess this is why you want an upto

No, I wanted upto because iterating up to bytes.length is incorrect. You need 
to iterate up to offset+length. BytesRefIterator.pos and BytesRefIterator.upto 
solve these cases for me.
{quote}

OK.

{quote}
bq. looks like things got a bit slower (or possibly it's noise)

First, even if it's not noise, the slowdown IMO is worth the code 
simplification.
{quote}

+1

{quote}
But, I do believe that we'll see gains when there are more than 3 integers to 
encode/decode.
In fact, the facets test package has an EncodingSpeed class which measures the 
time it takes to encode/decode a large number of integers (a few thousands). 
When I compared the
result to 4x (i.e. without the patch), the decode time seemed to be ~x5 faster.
{quote}

Good!  Would be nice to have a real-world biggish-number-of-facets
benchmark ... I'll ponder how to do that w/ luceneutil.

bq. In this patch I added an Ant task "run-encoding-benchmark" which runs this 
class. Want to give it a try on your beast machine? For 4x, you can just copy 
the target to lucene/facet/build.xml, I believe it will work without issues.

OK I'll run it!

                
> Explore IntEncoder/Decoder bulk API
> -----------------------------------
>
>                 Key: LUCENE-4620
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4620
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Shai Erera
>         Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

Reply via email to