RE: filter query from external list of Solr unique IDs

2013-06-16 Thread samabhiK
Does anything exists already in solr 4.3 to meet this usecase scenario?

View this message in context:
Sent from the Solr - User mailing list archive at

Re: filter query from external list of Solr unique IDs

2010-10-16 Thread Chris Hostetter

: Hoss  mentioned a couple of ideas:
: 1) sub-classing query parser
: 2) Having the app query a database and somehow passing something 
: to Solr or lucene for the filter query

The approach i was refering to is something one of my coworkers did a 
while back (if he's still lurking on the list, maybe he'll speak up)

He implemented a custom SqlFilterQuery class that was constructed from a 
JDBC URL and a SQL statement.  the SqlQuery class rewrote to itself (so it 
was a primitive query class) and returned a Scorer method that would:

1) execute the SQL query (which should return a sorted list of uniqueKey 
field values) and retrieve a JDBC iterator (cursor?) over the results.
2) fetch a TermEnum from Lucene for the uniqueKey field
3) use the JDBC Iterator to skip ahead on the TermEnum and for each 
uniqueKey to get the underlying lucene docid, and record it in a DocSet

As i recall, my coworker was using this in a custom RequestHandler, where 
he was then forcibly putting that DocSet in the filterCache so that it 
would be there on future requests, and it would be regenerated by 
autoWarming (the advantage of implementing this logic using the Query 
interface) but it could also be done with a custom cache if you don't want 
these to contend for space in the filterCache.

My point aout hte query parser was that instead of needing to use a custom 
RequestHandler (or even a custom SearchCOmponent) to generate this DocSet 
for filtering, you could probably do it using a QParserPlugin -- that way 
you could use a regaulr fq param to generate the filter.  You could even 
generalize the hell out of it so the SQL itself could be specified at 
request time...



RE: filter query from external list of Solr unique IDs

2010-10-16 Thread Jonathan Rochkind
 I was actually thinking of some kind of custom Lucene/Solr component that
 would for example take a query parameter such as lookitUp=123 and the
 component might do a JDBC query against a database or kv store and return
 results in some form that would be efficient for Solr/Lucene to process. 

If you do this, I'd definitely be interested in using it too. I have the same 
sort of use cases as you. 

It would be interesting to figure out if such a component could be used for 
join-like behavior too, as you included in your original use cases too. I'm 
not entirely sure what that would look like, but I have some problems which 
trying to solve with Solr often runs up against the lack of ability to do that 
(and there are lots of questions on the listserv asking how do I do a join in 
solr -- often the questioners can and should really solve their problem 
without needing to do this, but sometimes there really isn't a good solution 
without it), and when I try to spec out some workarounds in my head, it often 
comes up against the need to do what you're describing, efficiently do a query 
in solr limited by a known list of Solr ideas -- or in my case, sometimes a 
known (but lengthy, and ORd list of facet values for an fq limit -- really 
the same pattern there, it doesn't matter if the field is the id field or 
something else)

I'm not really sure what the API for a component like this meant to support 
'join' kind of behavior would look like, but it would be interesting to think 
about. Maybe it needs to be able to generate the values against an alternate 
solr core, with a specified query against that core, in addition to being able 
to generate with a specified query from a JDBC or kv-store lookup?  Or in some 
cases, even the same solr core -- do a query against the same solr core, take 
one stored field from the results, and use it to filter the result set of a 
subsequent query. 

RE: filter query from external list of Solr unique IDs

2010-10-16 Thread Jonathan Rochkind
 You could even
generalize the hell out of it so the SQL itself could be specified at
request time...


I think that's missing the need for an argument for what field in the solr 
index to require to be within the values generated by the SQL? Or maybe it's 
meant to assume the identifier field, but it would be an interesting 
generalization to allow any field. 

 q=solrfq={!sql field=id}SELECT ID FROM USER_MAP WHERE USER=1234 ORDER BY ID 

And then, thinking further, in addition to an external sql, how about 
generalize this to an alternate 'sub' query on the solr index itself?  

q=solrfq={!join_query field=id =id}foo:bar AND something_else

['subquery' is already taken as a defType, for purposes not entirely suitable 
here... I think? ]

Or even a different core! 

q=solrfq={!join_query core=different_one field=id on_stored_field=id}foo:bar 
AND something_else

If that could be done as efficiently as reasonable, it would actually solve a 
whole BUNCH of problem cases that come up now and then. (And yes, people would 
have to be cautioned not to immediately use this type of solution becuase they 
are used to thinking in terms of rdbms; but for some problems it really would 
allow things just not easily possible otherwise.)

RE: filter query from external list of Solr unique IDs

2010-10-16 Thread Jonathan Rochkind
Or an alternate simpler solution requiring the client to do more work, but 
getting around the inefficienty and limit on clauses of a whole lot of OR 
clauses: provide a qparser that accepts an in query. 

fq={!in field=id}100,150,201,304,etc.

Could, it seems from Hoss et al's suggestions, be processed much more 
efficiently than id:100 OR id:150 etc., and also without running up against the 
limit on number of clauses. Could still result in very large size of http 
request to solr though, which may or may not be a problem. 

(At first I was thinking about if it's a problem that ALL these solutions 
suggested will take up a spot in the filter cache since you're using fq-- but I 
think that's actually a benefit rather than a problem, in more cases than not 
it will actually be convenient for the result to be in the filter cache, it 
very well may end up being used multiple times). 

filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
At the Lucene Revolution conference I asked about efficiently building a filter 
query from an external list of Solr unique ids.

Some use cases I can think of are:
1)  personal sub-collections (in our case a user can create a small subset 
of our 6.5 million doc collection and then run filter queries against it)
2)  tagging documents
3)  access control lists
4)  anything that needs complex relational joins
5)  a sort of alternative to incremental field updating (i.e. update in an 
external database or kv store)
6)  Grant's clustering cluster points and similar apps.

Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't seem to be 
any work on it yet.

Hoss  mentioned a couple of ideas:
1) sub-classing query parser
2) Having the app query a database and somehow passing something to 
Solr or lucene for the filter query

Can Hoss or someone else point me to more detailed information on what might be 
involved in the two ideas listed above?

Is somehow keeping an up-to-date map of unique Solr ids to internal Lucene ids 
needed to implement this or is that a separate issue?

Tom Burton-West

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Jonathan Rochkind
Definitely interested in this. 

The naive obvious approach would be just putting all the ID's in the query. 
Like fq=(id:1 OR id:2 OR).  Or making it another clause in the 'q'.  

Can you outline what's wrong with this approach, to make it more clear what's 
needed in a solution?

From: Burton-West, Tom []
Sent: Friday, October 15, 2010 11:49 AM
Subject: filter query from external list of Solr unique IDs

At the Lucene Revolution conference I asked about efficiently building a filter 
query from an external list of Solr unique ids.

Some use cases I can think of are:
1)  personal sub-collections (in our case a user can create a small subset 
of our 6.5 million doc collection and then run filter queries against it)
2)  tagging documents
3)  access control lists
4)  anything that needs complex relational joins
5)  a sort of alternative to incremental field updating (i.e. update in an 
external database or kv store)
6)  Grant's clustering cluster points and similar apps.

Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't seem to be 
any work on it yet.

Hoss  mentioned a couple of ideas:
1) sub-classing query parser
2) Having the app query a database and somehow passing something to 
Solr or lucene for the filter query

Can Hoss or someone else point me to more detailed information on what might be 
involved in the two ideas listed above?

Is somehow keeping an up-to-date map of unique Solr ids to internal Lucene ids 
needed to implement this or is that a separate issue?

Tom Burton-West

Re: filter query from external list of Solr unique IDs

2010-10-15 Thread Yonik Seeley
On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom wrote:
 At the Lucene Revolution conference I asked about efficiently building a 
 filter query from an external list of Solr unique ids.

Yeah, I've thought about a special query parser and query to deal with
this (relatively) efficiently, both from a query perspective and a
memory perspective.

Should be pretty quick to throw together:
- comma separated list of terms (unique ids are a special case of this)
- in the query, store as a single byte array for efficiency
- sort the ids if they aren't already sorted
- do lookups with a term enumerator and skip weighting or anything
else like that
- configurable caching... may, or may not want to cache this big query

That's only part of the stuff you mention, but seems like it would be
useful to a number of people.


RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Demian Katz
The main problem I've encountered with the lots of OR clauses approach is 
that you eventually hit the limit on Boolean clauses and the whole query fails. 
 You can keep raising the limit through the Solr configuration, but there's 
still a ceiling eventually.

- Demian

 -Original Message-
 From: Jonathan Rochkind []
 Sent: Friday, October 15, 2010 1:07 PM
 Subject: RE: filter query from external list of Solr unique IDs
 Definitely interested in this.
 The naive obvious approach would be just putting all the ID's in the
 query. Like fq=(id:1 OR id:2 OR).  Or making it another clause in
 the 'q'.
 Can you outline what's wrong with this approach, to make it more clear
 what's needed in a solution?
 From: Burton-West, Tom []
 Sent: Friday, October 15, 2010 11:49 AM
 Subject: filter query from external list of Solr unique IDs
 At the Lucene Revolution conference I asked about efficiently building
 a filter query from an external list of Solr unique ids.
 Some use cases I can think of are:
 1)  personal sub-collections (in our case a user can create a small
 subset of our 6.5 million doc collection and then run filter queries
 against it)
 2)  tagging documents
 3)  access control lists
 4)  anything that needs complex relational joins
 5)  a sort of alternative to incremental field updating (i.e.
 update in an external database or kv store)
 6)  Grant's clustering cluster points and similar apps.
 Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't
 seem to be any work on it yet.
 Hoss  mentioned a couple of ideas:
 1) sub-classing query parser
 2) Having the app query a database and somehow passing
 something to Solr or lucene for the filter query
 Can Hoss or someone else point me to more detailed information on what
 might be involved in the two ideas listed above?
 Is somehow keeping an up-to-date map of unique Solr ids to internal
 Lucene ids needed to implement this or is that a separate issue?
 Tom Burton-West

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
Hi Jonathan,

The advantages of the obvious approach you outline are that it is simple, it 
fits in to the existing Solr model, it doesn't require any customization or 
modification to Solr/Lucene java code.  Unfortunately, it does not scale well.  
We originally tried just what you suggest for our implementation of Collection 
Builder.  For a user's personal collection we had a table that maps the 
collection id to the unique Solr ids.
Then when they wanted to search their collection, we just took their search and 
added a filter query with the fq=(id:1 OR id:2 OR).   I seem to remember 
running in to a limit on the number of OR clauses allowed. Even if you can set 
that limit larger, there are a  number of efficiency issues.  

We ended up constructing a separate Solr index where we have a multi-valued 
collection number field. Unfortunately, until incremental field updating gets 
implemented, this means that every time someone adds a document to a 
collection, the entire document (including 700KB of OCR) needs to be re-indexed 
just to update the collection number field. This approach has allowed us to 
scale up to a total of something under 100,000 documents, but we don't think we 
can scale it much beyond that for various reasons.

I was actually thinking of some kind of custom Lucene/Solr component that would 
for example take a query parameter such as lookitUp=123 and the component 
might do a JDBC query against a database or kv store and return results in some 
form that would be efficient for Solr/Lucene to process. (Of course this 
assumes that a JDBC query would be more efficient than just sending a long list 
of ids to Solr).  The other part of the equation is mapping the unique Solr ids 
to internal Lucene ids in order to implement a filter query.   I was wondering 
if something like the unique id to Lucene id mapper in zoie might be useful or 
if that is too specific to zoie. SoThis may be totally off-base, since I 
haven't looked at the zoie code at all yet.

In our particular use case, we might be able to build some kind of in-memory 
map after we optimize an index and before we mount it in production. In our 
workflow, we update the index and optimize it before we release it and once it 
is released to production there is no indexing/merging taking place on the 
production index (so the internal Lucene ids don't change.)  


-Original Message-
From: Jonathan Rochkind [] 
Sent: Friday, October 15, 2010 1:07 PM
Subject: RE: filter query from external list of Solr unique IDs

Definitely interested in this. 

The naive obvious approach would be just putting all the ID's in the query. 
Like fq=(id:1 OR id:2 OR).  Or making it another clause in the 'q'.  

Can you outline what's wrong with this approach, to make it more clear what's 
needed in a solution?

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
Thanks Yonik,

Is this something you might have time to throw together, or an outline of what 
needs to be thrown together?
Is this something that should be asked on the developer's list or discussed in 
SOLR 1715 or does it make the most sense to keep the discussion in this thread?


-Original Message-
From: [] On Behalf Of Yonik Seeley
Sent: Friday, October 15, 2010 1:19 PM
Subject: Re: filter query from external list of Solr unique IDs

On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom wrote:
 At the Lucene Revolution conference I asked about efficiently building a 
 filter query from an external list of Solr unique ids.
Yeah, I've thought about a special query parser and query to deal with
this (relatively) efficiently, both from a query perspective and a
memory perspective.

Should be pretty quick to throw together:
- comma separated list of terms (unique ids are a special case of this)
- in the query, store as a single byte array for efficiency
- sort the ids if they aren't already sorted
- do lookups with a term enumerator and skip weighting or anything
else like that
- configurable caching... may, or may not want to cache this big query

That's only part of the stuff you mention, but seems like it would be
useful to a number of people.
