[
https://issues.apache.org/jira/browse/LUCENE-7407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15589400#comment-15589400
]
Yonik Seeley commented on LUCENE-7407:
--------------------------------------
bq. Well, your tests are also using synthetically generated data right? If you
run performance tests with synthetic data, you draw synthetic conclusions.
Testing with real requests is what *users* should do with their specific
requests. We have no single set of such typical requests... we have too many
users with too many use cases.
Generalized synthetic conclusions can be superior to a single "real world" use
case that fails to cover enough scenarios that real users will encounter. At
first blush, it doesn't look like the lucenebench tests cover sorting and
faceting that well.
bq. And while I appreciate your efforts to isolate doc values performance alone
("finding the root domain"), this is also a rather overly synthetic use case.
Most queries involve non-trivial cost, and the overall impact to real world use
cases is what matters here.
I disagree. If one is measuring performance of a faceting change, then isolate
it. Then you can say "this change improved faceting performance on large
cardinality fields by up to 50%".
If a request is doing other expensive stuff, of course the overall implact will
be smaller. One can make the impact arbitrarily small by adding other more
expensive stuff to the request.
Also, *some* users out there will experience an impact of that magnitude. We
really don't have a single "typical" real-world use case... we have too many
users trying to do too many crazy things. Everything that might be considered
a corner case is often represented by real users who depend on that
performance. For example, I've seen plenty of users who try to facet on dozens
of fields *per request*.
bq. Instead of "quickly testing things by hand"
A quick test by hand is still more informative than having no information at
all. The accuracy may be lower, but when I see changes of the size I saw, I
know that it needs further investigation!
For example, I tested function queries (ValueSource) and sorting by multiple
docvalue fields. Are either of these things tested at all in
https://home.apache.org/~mikemccand/lucenebench/ ? The test names suggest that
they are not, but it's hard to tell.
And wrt to the by-hand sorting test, I did follow it up with a more thorough
test:
https://issues.apache.org/jira/browse/SOLR-9599?focusedCommentId=15584223&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15584223
bq. please do a more thorough test, discarding warmup iterations, running N
JVMs (to account for hotstpot noise) with M iterations each (to account for
other JVM noise) with diverse concurrent query types (to prevent hotspot from
falsely over optimizing), etc.
Yes, I did all that.
Running diverse fields in the same JVM run is esp important to prevent hotspot
from over-optimizing for a single field cardinality (since different
cardinalities have different docvalues encodings).
How many different numeric fields are concurrently sorted on for
https://home.apache.org/~mikemccand/lucenebench/ ?
The names suggest just one: " TermQuery (date/time sort)"
If that is actually the case, then you're in danger of hotspot
over-specializing for that single field/cardinality.
bq. try to be part of the solution ;)
That's an unnecessary personal dig.
- I've already put in a lot of effort into benchmarking this, only to have it
dismissed with hand waves, for cases that may not even be covered (or may be
under stated) by your own benchmarks.
- I fully intend to dig into the solr side, but I was waiting until the API
stabilizes (LUCENE-7462)
- I pointed at specific examples that reside entirely in lucene code (the
sorting examples)
> Explore switching doc values to an iterator API
> -----------------------------------------------
>
> Key: LUCENE-7407
> URL: https://issues.apache.org/jira/browse/LUCENE-7407
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Labels: docValues
> Fix For: master (7.0)
>
> Attachments: LUCENE-7407.patch
>
>
> I think it could be compelling if we restricted doc values to use an
> iterator API at read time, instead of the more general random access
> API we have today:
> * It would make doc values disk usage more of a "you pay for what
> what you actually use", like postings, which is a compelling
> reduction for sparse usage.
> * I think codecs could compress better and maybe speed up decoding
> of doc values, even in the non-sparse case, since the read-time
> API is more restrictive "forward only" instead of random access.
> * We could remove {{getDocsWithField}} entirely, since that's
> implicit in the iteration, and the awkward "return 0 if the
> document didn't have this field" would go away.
> * We can remove the annoying thread locals we must make today in
> {{CodecReader}}, and close the trappy "I accidentally shared a
> single XXXDocValues instance across threads", since an iterator is
> inherently "use once".
> * We could maybe leverage the numerous optimizations we've done for
> postings over time, since the two problems ("iterate over doc ids
> and store something interesting for each") are very similar.
> This idea has come up many in the past, e.g. LUCENE-7253 is a recent
> example, and very early iterations of doc values started with exactly
> this ;)
> However, it's a truly enormous change, likely 7.0 only. Or maybe we
> could have the new iterator APIs also ported to 6.x side by side with
> the deprecate existing random-access APIs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]