Hi Jonathan,

The advantages of the obvious approach you outline are that it is simple, it 
fits in to the existing Solr model, it doesn't require any customization or 
modification to Solr/Lucene java code.  Unfortunately, it does not scale well.  
We originally tried just what you suggest for our implementation of Collection 
Builder.  For a user's personal collection we had a table that maps the 
collection id to the unique Solr ids.
Then when they wanted to search their collection, we just took their search and 
added a filter query with the fq=(id:1 OR id:2 OR....).   I seem to remember 
running in to a limit on the number of OR clauses allowed. Even if you can set 
that limit larger, there are a  number of efficiency issues.  

We ended up constructing a separate Solr index where we have a multi-valued 
collection number field. Unfortunately, until incremental field updating gets 
implemented, this means that every time someone adds a document to a 
collection, the entire document (including 700KB of OCR) needs to be re-indexed 
just to update the collection number field. This approach has allowed us to 
scale up to a total of something under 100,000 documents, but we don't think we 
can scale it much beyond that for various reasons.

I was actually thinking of some kind of custom Lucene/Solr component that would 
for example take a query parameter such as &lookitUp=123 and the component 
might do a JDBC query against a database or kv store and return results in some 
form that would be efficient for Solr/Lucene to process. (Of course this 
assumes that a JDBC query would be more efficient than just sending a long list 
of ids to Solr).  The other part of the equation is mapping the unique Solr ids 
to internal Lucene ids in order to implement a filter query.   I was wondering 
if something like the unique id to Lucene id mapper in zoie might be useful or 
if that is too specific to zoie. SoThis may be totally off-base, since I 
haven't looked at the zoie code at all yet.

In our particular use case, we might be able to build some kind of in-memory 
map after we optimize an index and before we mount it in production. In our 
workflow, we update the index and optimize it before we release it and once it 
is released to production there is no indexing/merging taking place on the 
production index (so the internal Lucene ids don't change.)  

Tom



-----Original Message-----
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Friday, October 15, 2010 1:07 PM
To: solr-user@lucene.apache.org
Subject: RE: filter query from external list of Solr unique IDs

Definitely interested in this. 

The naive obvious approach would be just putting all the ID's in the query. 
Like fq=(id:1 OR id:2 OR....).  Or making it another clause in the 'q'.  

Can you outline what's wrong with this approach, to make it more clear what's 
needed in a solution?
________________________________________

Reply via email to