Isaac Hebsh created SOLR-5611:
---------------------------------
Summary: When documents are uniformly distributed over shards,
enable returning approximated results in distributed query
Key: SOLR-5611
URL: https://issues.apache.org/jira/browse/SOLR-5611
Project: Solr
Issue Type: Improvement
Components: SolrCloud
Reporter: Isaac Hebsh
Fix For: 4.7
Query with rows=1000, which sent to a collection of 100 shards (shard key
behaviour is default - based on hash of the unique key), will generate 100
requests of rows=1000, on each shard.
This results to total number of rows*numShards unique keys to be retrieved.
This behaviour is getting worst as numShards grows.
If the documents are uniformly distributed over the shards, the expected number
of document should be ~ rows/numShards. Obviously, there might be extreme
cases, when all of the top X documents are in a specific shard.
I suggest adding an optional parameter, say approxResults=true, which decides
whether we should limit the rows in the shard requests to rows/numShardsor not.
Moreover, we can add a numeric parameter which increases the limit, to be more
accurate.
For example, the query {{approxResults=true&approxResults.factor=1.5}} will
retrieve 1.5*rows/numShards from each shard. In the case of 100 shards and
rows=1000, each shard will return 15 documents.
Furthermore, this can reduce the problem of deep paging, because the same thing
can be applied there. when requested start=100000, Solr creating shard request
with start=0 and rows=START+ROWS. In the approximated approach, start parameter
(in the shard requests) can be set to 100000/numShards. The idea of the
approxResults.factor creates some difficulties here, though.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]