Hi,

We are experimenting a parallel approach to issue a large OR-Boolean
query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
solr shards.

The way we are trying is to break the large query into smaller ones,
e.g.,  
the example above can be broken into 10 small queries: keywords:(1 OR 2
OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
etc

Now each shard will get 10 requests and the master will merge the
results coming back from each shard, similar to the regular distributed
search. 

There are a few issues with this approach, though:

Issue #1: sorting: 
ShardFieldSortedHitQueue.lessThan() checks if two documents are from the
same shard; if yes, it will just compare the orderInShard. However, in
our approach, two documents from the same shard could be results of two
different small queries, in that case, we can't just compare the
orderInShard. There is a simple solution to this issue: just change the
if statement to :
if (docA.shard == docB.shard && docA.sortFieldValues ==
docB.sortFieldValues)

Can someone make this change for Solr1.4 so that we don't have to
customize it? It does NOT impact the normal case.

Issue #2: number of matching documents found:
Since multiple responses from the same shard could contain duplicated
documents, the sum of number of documents found from each response will
be greater than the real number of documents found. 

Unless we ask each shard to return all the matching records for every
small query and then check duplicates, we can't get the accurate number
of documents found.

Issue #3: faceting counts
Due to the same limitation as in issue #2, the faceting count is always
greater than the correct one.

I don't have any idea how this issue can be resolved.

Reply via email to