Isaac Hebsh created SOLR-5611:
---------------------------------

             Summary: When documents are uniformly distributed over shards, 
enable returning approximated results in distributed query
                 Key: SOLR-5611
                 URL: https://issues.apache.org/jira/browse/SOLR-5611
             Project: Solr
          Issue Type: Improvement
          Components: SolrCloud
            Reporter: Isaac Hebsh
             Fix For: 4.7


Query with rows=1000, which sent to a collection of 100 shards (shard key 
behaviour is default - based on hash of the unique key), will generate 100 
requests of rows=1000, on each shard.
This results to total number of rows*numShards unique keys to be retrieved. 
This behaviour is getting worst as numShards grows.

If the documents are uniformly distributed over the shards, the expected number 
of document should be ~ rows/numShards. Obviously, there might be extreme 
cases, when all of the top X documents are in a specific shard.

I suggest adding an optional parameter, say approxResults=true, which decides 
whether we should limit the rows in the shard requests to rows/numShardsor not. 
Moreover, we can add a numeric parameter which increases the limit, to be more 
accurate.
For example, the query {{approxResults=true&approxResults.factor=1.5}} will 
retrieve 1.5*rows/numShards from each shard. In the case of 100 shards and 
rows=1000, each shard will return 15 documents.

Furthermore, this can reduce the problem of deep paging, because the same thing 
can be applied there. when requested start=100000, Solr creating shard request 
with start=0 and rows=START+ROWS. In the approximated approach, start parameter 
(in the shard requests) can be set to 100000/numShards. The idea of the 
approxResults.factor creates some difficulties here, though.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to