[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

Michael McCandless (JIRA) Mon, 14 Jan 2013 02:56:45 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552573#comment-13552573
 ]


Michael McCandless commented on LUCENE-4620:
--------------------------------------------

This change seemed to lose a bit of performance: look at 1/11/2013 on 
http://people.apache.org/~mikemccand/lucenebench/TermDateFacets.html

But, that tests just one dimension (Date), with only 3 ords per doc,
so I had assumed that this just wasn't enough ints being decoded to
see the gains from this bulk decoding.

So, I modified luceneutil to have more facets per doc (avg ~25 ords
per doc across 9 dimensions; 2.5M unique ords), and the results are
still slower:

{noformat}
      Task    QPS base      StdDev    QPS comp      StdDev                Pct 
diff
  HighTerm        3.62      (2.5%)        3.24      (1.0%)  -10.5% ( -13% -   
-7%)
   MedTerm        7.34      (1.7%)        6.78      (0.9%)   -7.6% ( -10% -   
-5%)
   LowTerm       14.92      (1.6%)       14.32      (1.2%)   -4.0% (  -6% -   
-1%)
  PKLookup      181.47      (4.7%)      183.04      (5.3%)    0.9% (  -8% -   
11%)
{noformat}

This is baffling ... not sure what's up.  I would expect some gains
given that the micro-benchmark showed sizable decode improvements.  It
must somehow be that decode cost is a minor part of facet counting?
(which is not a good sign!: it should be a big part of it...)

                
> Explore IntEncoder/Decoder bulk API
> -----------------------------------
>
>                 Key: LUCENE-4620
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4620
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Shai Erera
>            Assignee: Shai Erera
>             Fix For: 4.1, 5.0
>
>         Attachments: LUCENE-4620.patch, LUCENE-4620.patch, LUCENE-4620.patch
>
>
> Today, IntEncoder/Decoder offer a streaming API, where you can encode(int) 
> and decode(int). Originally, we believed that this layer can be useful for 
> other scenarios, but in practice it's used only for writing/reading the 
> category ordinals from payload/DV.
> Therefore, Mike and I would like to explore a bulk API, something like 
> encode(IntsRef, BytesRef) and decode(BytesRef, IntsRef). Perhaps the Encoder 
> can still be streaming (as we don't know in advance how many ints will be 
> written), dunno. Will figure this out as we go.
> One thing to check is whether the bulk API can work w/ e.g. facet 
> associations, which can write arbitrary byte[], and so may decoding to an 
> IntsRef won't make sense. This too we'll figure out as we go. I don't rule 
> out that associations will use a different bulk API.
> At the end of the day, the requirement is for someone to be able to configure 
> how ordinals are written (i.e. different encoding schemes: VInt, PackedInts 
> etc.) and later read, with as little overhead as possible.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4620) Explore IntEncoder/Decoder bulk API

Reply via email to