[ 
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15589400#comment-15589400
 ] 

Yonik Seeley commented on LUCENE-7407:
--------------------------------------

bq. Well, your tests are also using synthetically generated data right? If you 
run performance tests with synthetic data, you draw synthetic conclusions.

Testing with real requests is what *users* should do with their specific 
requests.  We have no single set of such typical requests...  we have too many 
users with too many use cases.
Generalized synthetic conclusions can be superior to a single "real world" use 
case that fails to cover enough scenarios that real users will encounter.  At 
first blush, it doesn't look like the lucenebench tests cover sorting and 
faceting that well.

bq. And while I appreciate your efforts to isolate doc values performance alone 
("finding the root domain"), this is also a rather overly synthetic use case.  
Most queries involve non-trivial cost, and the overall impact to real world use 
cases is what matters here.

I disagree.  If one is measuring performance of a faceting change, then isolate 
it.  Then you can say "this change improved faceting performance on large 
cardinality fields by up to 50%".
If a request is doing other expensive stuff, of course the overall implact will 
be smaller.  One can make the impact arbitrarily small by adding other more 
expensive stuff to the request.

Also, *some* users out there will experience an impact of that magnitude.  We 
really don't have a single "typical" real-world use case... we have too many 
users trying to do too many crazy things.  Everything that might be considered 
a corner case is often represented by real users who depend on that 
performance.  For example, I've seen plenty of users who try to facet on dozens 
of fields *per request*.

bq. Instead of "quickly testing things by hand"

A quick test by hand is still more informative than having no information at 
all.  The accuracy may be lower, but when I see changes of the size I saw, I 
know that it needs further investigation!
For example, I tested function queries (ValueSource) and sorting by multiple 
docvalue fields.  Are either of these things tested at all in 
https://home.apache.org/~mikemccand/lucenebench/ ?  The test names suggest that 
they are not, but it's hard to tell.
And wrt to the by-hand sorting test, I did follow it up with a more thorough 
test:
https://issues.apache.org/jira/browse/SOLR-9599?focusedCommentId=15584223&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15584223

bq.  please do a more thorough test, discarding warmup iterations, running N 
JVMs (to account for hotstpot noise) with M iterations each (to account for 
other JVM noise) with diverse concurrent query types (to prevent hotspot from 
falsely over optimizing), etc.

Yes, I did all that.
Running diverse fields in the same JVM run is esp important to prevent hotspot 
from over-optimizing for a single field cardinality (since different 
cardinalities have different docvalues encodings).

How many different numeric fields are concurrently sorted on for 
https://home.apache.org/~mikemccand/lucenebench/ ?
The names suggest just one: " TermQuery (date/time sort)"
If that is actually the case, then you're in danger of hotspot 
over-specializing for that single field/cardinality.

bq.  try to be part of the solution ;)
That's an unnecessary personal dig.
- I've  already put in a lot of effort into benchmarking this, only to have it 
dismissed with hand waves, for cases that may not even be covered (or may be 
under stated) by your own benchmarks.
- I fully intend to dig into the solr side, but I was waiting until the API 
stabilizes (LUCENE-7462)
- I pointed at specific examples that reside entirely in lucene code (the 
sorting examples)




> Explore switching doc values to an iterator API
> -----------------------------------------------
>
>                 Key: LUCENE-7407
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7407
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>              Labels: docValues
>             Fix For: master (7.0)
>
>         Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
>   * It would make doc values disk usage more of a "you pay for what
>     what you actually use", like postings, which is a compelling
>     reduction for sparse usage.
>   * I think codecs could compress better and maybe speed up decoding
>     of doc values, even in the non-sparse case, since the read-time
>     API is more restrictive "forward only" instead of random access.
>   * We could remove {{getDocsWithField}} entirely, since that's
>     implicit in the iteration, and the awkward "return 0 if the
>     document didn't have this field" would go away.
>   * We can remove the annoying thread locals we must make today in
>     {{CodecReader}}, and close the trappy "I accidentally shared a
>     single XXXDocValues instance across threads", since an iterator is
>     inherently "use once".
>   * We could maybe leverage the numerous optimizations we've done for
>     postings over time, since the two problems ("iterate over doc ids
>     and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only.  Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to