[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

Michael McCandless (JIRA) Fri, 21 Oct 2016 05:42:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594967#comment-15594967
 ]


Michael McCandless commented on LUCENE-7407:
--------------------------------------------

bq. At first blush, it doesn't look like the lucenebench tests cover sorting 
and faceting that well.

bq. For example, I tested function queries (ValueSource) and sorting by 
multiple docvalue fields. Are either of these things tested at all in 
https://home.apache.org/~mikemccand/lucenebench/ ?

bq. Running diverse fields in the same JVM run is esp important to prevent 
hotspot from over-optimizing for a single field cardinality (since different 
cardinalities have different docvalues encodings).

{quote}
How many different numeric fields are concurrently sorted on for 
https://home.apache.org/~mikemccand/lucenebench/ ?
The names suggest just one: " TermQuery (date/time sort)"
If that is actually the case, then you're in danger of hotspot 
over-specializing for that single field/cardinality.
{quote}

These are all good points, all things that I would like to improve
about Lucene's nightly benchmarks
(https://home.apache.org/~mikemccand/lucenebench/).  Patches welcome ;)

I'll try to add some low cardinality faceting/sorting coverage, maybe
using month name and day-of-the-year from the last modified date.

The nightly Wikipedia benchmark facets on Date field as a hierarchy
(year/month/day), and sorts on "last modified" (seconds resolution I
think) and title.

I've also long wanted to add highlighters...

bq. A quick test by hand is still more informative than having no information 
at all.

I disagree: it's reckless to run an overly synthetic benchmark and
then present the results as if they mean we should make poor API
tradeoffs.

bq. If one is measuring performance of a faceting change, then isolate it.

In the ideal world, yes, but this is notoriously problematic to do
with java: hotspot, GC, etc. will all behave very differently if you
are testing a very narrow part of the code.

{quote}
That's an unnecessary personal dig.
I've already put in a lot of effort into benchmarking this, only to have it 
dismissed with hand waves, for cases that may not even be covered (or may be 
under stated) by your own benchmarks.
I fully intend to dig into the solr side, but I was waiting until the API 
stabilizes (LUCENE-7462)
I pointed at specific examples that reside entirely in lucene code (the sorting 
examples)
{quote}

My point is that running synthetic benchmarks and mis-representing
them as "meaningful" is borderline reckless, and certainly nowhere
near as helpful as, say, improving our default codec, profiling and
removing slow spots, removing extra legacy wrappers, etc.  Those are
more positive ways to move our project forward.

Perhaps you feel you have put in a lot of effort here, but from where
I stand I see lots of complaining about how things got slower and little
effort to actually improve the sources.  This issue alone was a
tremendous amount of slogging for me, and I had to switch Solr over
without fully understanding its sources: you or other Solr experts
could have stepped in to help me then.

But why not do that now?  I.e. review my Solr changes or function
queries, etc.?  I could easily have done something silly: it was just
a "rote" cutover to the iterator API.

I think we could nicely optimize the browse only case, by just using
{{nextDoc}} to step through all doc values for a given field.  Does Solr
do that today?

Why not test the patch on LUCENE-7462 to see if that API change helps?

I am not disagreeing that DV access got slower: the Lucene nightly
benchmarks also show that.

Yet look at sort-by-title: at first it got slower, on initial cutover
to iterators, but then thanks to [~jpountz] (thank you Adrien!), it's
now faster than it was before:
https://home.apache.org/~mikemccand/lucenebench/TermTitleSort.html

With more iterations I expect we can do the same thing for the other
dense cases.  An iteration-only API means we can do all sorts of nice
compression improvements not possible with the random access API, we
don't need per-lookup bounds checks, etc.  We should adopt from the
many things we do to compress postings, which have been iterators only
forever.  And it means the sparse case, as a happy side effect,
get to improve too.

This could lead to a point in the future where the dense cases perform
better than they did with random access API, like sort-by-title does
already.  We've only just begun down this path, and in just a few
weeks [~jpountz] has already made big gains.


> Explore switching doc values to an iterator API
> -----------------------------------------------
>
>                 Key: LUCENE-7407
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7407
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>              Labels: docValues
>             Fix For: master (7.0)
>
>         Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
>     what you actually use", like postings, which is a compelling
>     reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
>     of doc values, even in the non-sparse case, since the read-time
>     API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
>     implicit in the iteration, and the awkward "return 0 if the
>     document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
>     {{CodecReader}}, and close the trappy "I accidentally shared a
>     single XXXDocValues instance across threads", since an iterator is
>     inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
>     postings over time, since the two problems ("iterate over doc ids
>     and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7407) Explore switching doc values to an iterator API

Reply via email to