We are searching strings, not numbers. The reason we are doing this kind
of query is that we have two big indexes, say, a collection of medicine
drugs and a collection of research papers. I first run a query against
the drugs index and get 102400 unique drug names back. Then I need to
find all the research papers where one or more of the 102400 drug names
are mentioned, hence the large OR query. This is a kind of JOIN query
between 2 indexes, which an article in the lucid web site comparing
databases and search engines briefly touched.

I was able to issue 100 parallel small queries against solr shards and
get the results back successfully (even sorted). My custom code is less
than 100 lines, mostly in my SearchHandler.handleRequestBody. But I have
problem summing up the correct facet counts because the faceting counts
from each shard are not disjunctive.

Based on what is suggested by two other responses to my question, I
think it is possible that the master can pass the original large query
to each shard, and each shard will split the large query into 100 lower
level disjunctive lucene queries, fire them against its Lucene index in
a parallel way and merge the results. Then each shard shall only return
1(instead of 100) result set to the master with disjunctive faceting
counts. It seems that the faceting problem can be solved in this way. I
would appreciate it if you could let me know if this approach is
feasible and correct; what solr plug-ins are needed(my guess is a custom
parser and query-component?)

Thanks,

Jeff   



-----Original Message-----
From: Grant Ingersoll [mailto:[email protected]] 
Sent: Thursday, September 24, 2009 10:01 AM
To: [email protected]
Subject: [PMX:FAKE_SENDER] Re: large OR-boolean query


On Sep 23, 2009, at 4:26 PM, Luo, Jeff wrote:

> Hi,
>
> We are experimenting a parallel approach to issue a large OR-Boolean
> query, e.g., keywords:(1 OR 2 OR 3 OR ... OR 102400), against several
> solr shards.
>
> The way we are trying is to break the large query into smaller ones,
> e.g.,
> the example above can be broken into 10 small queries: keywords:(1  
> OR 2
> OR 3 OR ... OR 1024), keywords:(1025 OR 1026 OR 1027 OR ... OR 2048),
> etc
>
> Now each shard will get 10 requests and the master will merge the
> results coming back from each shard, similar to the regular  
> distributed
> search.


Can you tell us a little bit more about the why/what of this?  Are you  
really searching numbers or are those just for example?  Do you care  
about the score or do you just need to know whether the result is  
there or not?


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to