[ https://issues.apache.org/jira/browse/LUCENE-7618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man updated LUCENE-7618: ----------------------------- Attachment: LUCENE-7618.patch *TL;DR:* based on some half assed benchmarking using junit timings, the results did *NOT* show any improvements -- for some seeds (on my laptop) the new code was actaully slower sometimes, leading me to suspect I may have a bug somewhere in either the test or the code, but I haven't had time to dig into it. ---- *Background...* Just before I went on vacation, I had a conversation with someone who asked a question that can be summarized as: bq. If I already know the smallest possible value ({{minF}}) in the field {{foo}}, which is faster: {{foo:\[\* TO X\]}} or {{foo:\[minF TO X\]}} ? _(where X will be diff in every query)_ My answer, based on recollection of how simple term indexing range queries use to work, was that any difference should be negligable -- in the {{\*}} case lucene can skip the call to {{seekTo(minF)}} but otherwise the executation would be identical after that. For DocValues, I speculated that using {{\*}} should theoretically be faster, because as the code iterates ove the DocValues, it would only have to compare each doc's value to {{X}}, and not to {{minF}}. When I dug into the code, I found that wasn't the case: while query rewriting had a special case optimization for {{foo:\[\* TO \*\]}} there were no special case optimizations for a range query open on one end -- instead the TwoPhaseIterator created for SortedNumericDocValues assigns Long.MIN_VALUE/Long.MAX_VALUE as needed anytime min/max are null, and does 2 comparisons per doc value. Likewise the codepath for SortedSetDocValues assigns minOrd ({{0}}) and maxOrd ({{values.getValueCount() - 1}}) when the user specified min/max are null; and again did 2 comparisons per value. *Theory...* Adding special case handling for the open ended range queries, to return TwoPhaseIterators that only did one comparison in these special cases, seem(ed|s) like an easy win? In particular for the case of SortedSetDocValues: Even if the original query has non-null min & max values, after calling {{lookupTerm(...)}} on each we might find that one of them is completely out of range _for individual segments_ and optimize away one comparison on a per-segment level. (an existing optimization already handled the case where the min/max were both completely below/above the range of values in the segment) *Attempt...* Step #1 was writting a {{TestDocValuesRangeQueryOptimizations}} (see patch) against the existing code that built up an index with identical values in 3 diff fields (SortedNumeric, SortedSet, and Points) and then did a lot of randomized, open ended, range queries against that index, one field per test method. Once I had the basics of the test written, I then ran the test over and over with the same seed and noted down the junit test times for each. I then added what seemed like the most straight forward optimizations to {{DocValuesRangeQuery}}, and re-ran the test a few times with the same seed. *Results...* The results were not good. Looking back, didn't even bother to save the tmp file where I had been pasting the junit times -- that's how not good they were. Even accounting for potential "noise" in the results from other stuff running on my laptop, there was no inkling of any speedup in the new code (as compared to the test against the points field which didn't involve any modified code paths) and IIRC the results were occasionally slower. I set all this aside to come back to later to try and figure out if I'd made any obvious mistakes, but skimming the code today, and writting up my recollections, nothing jumps out at me. My best guess is that HotSpot is already optimizing away some of these comparisons, and that the new code just complicates things for HotSpot to the point that it's sometimes slower. I don't plan on taking this any further, but I wanted to open the issue to track the idea in case anyone else sees value in pursuing it futher. > Hypothetical perf improvements in DocValuesRangeQuery: reducing comparisons > for some queries/segments > ----------------------------------------------------------------------------------------------------- > > Key: LUCENE-7618 > URL: https://issues.apache.org/jira/browse/LUCENE-7618 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Hoss Man > Attachments: LUCENE-7618.patch > > > In reviewing the DocValuesRangeQuery code, it occured to me that there > _might_ be some potential performance optimizations possible in a few cases > relating queries that involve explicitly specified open ranges (ie: min or > max are null) or in the case of SortedSet: range queries that are > *effectively* open ended on particular segments, because the min/max are > below/above the minOrd/maxOrd for the segment. > Since these seemed like semi-common situations (open ended range queries are > fairly common in my experience, i'm not sure about the secondary SortedSet > "ord" case, but it seemd potentially promising particularly for fields like > incrementing ids, or timestamps, where values are added sequentially and > likeley to be clustered together) I did a bit of experimenting and wanted to > post my findings in jira -- patch & details to follow in comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org