[ 
https://issues.apache.org/jira/browse/SOLR-17182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-17182:
--------------------------------------
    Attachment: gen-random-csv.pl
                run-queries.sh
        Status: Open  (was: Open)

The original thinking in SOLR-16693 was basically:
 * Lucene made some changes:
 ** ExitableDirectoryReader now has overhead even if the QueryTimeout it uses 
is a No-Op
 ** IndexSearcher.searchWithTimeout feature that can use QueryTimeout
 * Therefore Solr should make the following changes:
 ** Use IndexSearcher.searchWithTimeout from SolrIndexSearcher anytime there is 
a QueryTimeout in the request
 ** Stop using ExitableDirectoryReader completely
 *** As a hedge against breaking backcompat: add a system property (that we 
don't plan to support for long) to trigger EDR use
 *** if this sysprop is enabled, IndexSearcher.searchWithTimeout is probably 
redundant so don't use it

My comment at the time, which was forked off into the creation of this jira 
(SOLR-16693), and the specific Implementation suggestion I made were:
 * Focused on concerns about the "non-search" code paths in Solr that use 
IndexReader (faceting, spellcheck, etc...) that would no longer respect 
timeAllowed (now more broadly "QueryLimits")
 * Assumed that if we added a way for these code-paths to get and EDR if and 
only if the request used QueryLimits, we could still eliminate the system 
property and it's conditional logic
 ** Which implicitly builds off the assumption that 
{{IndexSearcher.searchWithTimeout}} is adequate solution for enforcing 
{{QueryLimits}} in the "search" logic.

But as a result of some ad hoc testing I did after discovering SOLR-17831, I'm 
no longer convinced that {{IndexSearcher.searchWithTimeout}} is (always) an 
adequate solution for enforcing {{QueryLimits}} in the "search" logic.

With the fix for SOLR-17831 in place, using 
{{solr.useExitableDirectoryReader=true}} results in "slow" or "expensive" 
queries – w/o any complex features like faceting, spellcheck, etc... – being 
terminated *much* more quickly/aggressively then 
{{solr.useExitableDirectoryReader=false}} (and the implicit default 
IndexSearcher.searchWithTimeout / TimeLimitingBulkScorer behavior).

(I haven't exhaustively tested this, but IIUC: when & where – and how rarely -- 
{{TimeLimitingBulkScorer}} checks the {{QueryTimeout}} means that it takes 
longer and longer to notice the limit has been exceeded as the number of 
documents matched and/or the limit itself increase?)

 
----
 

I'm attaching some quick and dirty scripts I used to compare EDR to the default 
behavior using an index of 1 million sythetic documents and 99 requests of 
complex disjunctions of prefix queries (which was the simplest way i could 
think of to force a complicated "search" w/o using features that would give EDR 
and advantage – like function queries) using {{timeAllowed=50}}

My notes are below, but the bottom line is:
 * using EDR caused the mean QTime to be ~ "50ms" – regardless of the number of 
segments
 * the default behavior caused a mean QTime of ~ "85-150ms" depending on the 
number of segments

I think for people who *really* care about having {{QueryLimits}} enforced, we 
need a (long term supported) option for using {{ExitableDirectoryReader}} 
_everywhere_ in Solr – including {{{}SolrIndexSearcher{}}}.

 
{noformat}
# Use techproducts example
#
./solr/packaging/build/dev/bin/solr -e techproducts --no-prompt

# build up 1 million random docs
#
./gen-random-csv.pl 1000000 200000 100w200000 100w2000000 100w2000000 
100w2000000 100w2000000 > data.csv

# index them, using commitWithin (more aggressively then we probably need to) 
to force multiple segments
# (in my case i wound up with 48 segments)
#
curl 
'http://localhost:8983/solr/techproducts/update?commitWithin=1000&fieldnames=id,a_t,b_t,c_t,d_t,e_t,f_t'
 --data-binary @data.csv -H 'Content-type:application/csv'
{noformat}
{noformat}
# now run a bunch of synthetic (non-cached) queries with a large number of 
disjunctions
# to see how timeAllowed behaves and how long it takes to time out by default 
...
#
bash run-queries.sh > 48seg.timeallowed.default.txt
grep QTime 48seg.timeallowed.default.txt | perl -nle 'chomp; /.*?(\d+)/ or die; 
$sum+=$1; END { print $sum }'
15542
{noformat}
{noformat}
# Re-start solr with EDR enabled and run the same test...
#
./solr/packaging/build/dev/bin/solr -e techproducts --no-prompt --jvm-opts 
-Dsolr.useExitableDirectoryReader=true
bash run-queries.sh > 48seg.timeallowed.edr.txt
grep QTime 48seg.timeallowed.edr.txt | perl -nle 'chomp; /.*?(\d+)/ or die; 
$sum+=$1; END { print $sum }'
4961
{noformat}
{noformat}
# Restart solr with solr.autoCommit.maxTime==-1 and reindex everything to try 
and minimize segment counts
#
# In my case, I got it down to 18 segmentds
#
./solr/packaging/build/dev/bin/solr -e techproducts --no-prompt --jvm-opts 
-Dsolr.autoCommit.maxTime=-1
curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary 
'<delete><query>*:*</query></delete>'
curl 
'http://localhost:8983/solr/techproducts/update?commit=true&fieldnames=id,a_t,b_t,c_t,d_t,e_t,f_t'
 --data-binary @data.csv -H 'Content-type:application/csv'
curl 'http://localhost:8983/solr/techproducts/update?commit=true&optimize=true' 
{noformat}
{noformat}
# re-run same timeallowed queries (default behavior, 18 segments)
#
bash run-queries.sh > 18seg.timeallowed.default.txt
grep QTime 18seg.timeallowed.default.txt | perl -nle 'chomp; /.*?(\d+)/ or die; 
$sum+=$1; END { print $sum }'
8505
{noformat}
{noformat}
# Re-start solr with EDR enabled, and run the same test again on this 18 
segment index...
#
bash run-queries.sh > 18seg.timeallowed.edr.txt
grep QTime 18seg.timeallowed.edr.txt | perl -nle 'chomp; /.*?(\d+)/ or die; 
$sum+=$1; END { print $sum }'
4872
{noformat}

> Eliminate the need for 'solr.useExitableDirectoryReader' sysprop
> ----------------------------------------------------------------
>
>                 Key: SOLR-17182
>                 URL: https://issues.apache.org/jira/browse/SOLR-17182
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Query Limits
>            Reporter: Chris M. Hostetter
>            Assignee: Gus Heck
>            Priority: Major
>         Attachments: gen-random-csv.pl, run-queries.sh
>
>
> As the {{QueryLimit}} functionality in Solr gets beefed up, and supports 
> multiple types of limits, it would be nice if we could find a way to 
> eliminate the need for the {{solr.useExitableDirectoryReader}} sysprop, and 
> instead just have codepaths that use the underlying IndexReader  (like 
> faceting, spellcheck, etc...)  automatically get a reader that enforces the 
> limits if/when limits are in use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to