Re: filter query from external list of Solr unique IDs

2010-10-16 Thread eks dev
if your index is read-only in production, can you add mapping
unique_id-Lucene docId in your kv store and and build filters externally?
That would make unique Key obsolete in your production index, as you would
work at lucene doc id level.

That way, you offline the problem to update/optimize phase. Ugly part is a
lot of updates on your kv-store...

I am not really familiar with solr, but working directly with lucene this is
doable, even having parallel index that has unique ID as a stored field, and
another index with indexed fields on update master, and than having only
this index with indexed fields in production.





On Fri, Oct 15, 2010 at 8:59 PM, Burton-West, Tom wrote:

> Hi Jonathan,
>
> The advantages of the obvious approach you outline are that it is simple,
> it fits in to the existing Solr model, it doesn't require any customization
> or modification to Solr/Lucene java code.  Unfortunately, it does not scale
> well.  We originally tried just what you suggest for our implementation of
> Collection Builder.  For a user's personal collection we had a table that
> maps the collection id to the unique Solr ids.
> Then when they wanted to search their collection, we just took their search
> and added a filter query with the fq=(id:1 OR id:2 OR).   I seem to
> remember running in to a limit on the number of OR clauses allowed. Even if
> you can set that limit larger, there are a  number of efficiency issues.
>
> We ended up constructing a separate Solr index where we have a multi-valued
> collection number field. Unfortunately, until incremental field updating
> gets implemented, this means that every time someone adds a document to a
> collection, the entire document (including 700KB of OCR) needs to be
> re-indexed just to update the collection number field. This approach has
> allowed us to scale up to a total of something under 100,000 documents, but
> we don't think we can scale it much beyond that for various reasons.
>
> I was actually thinking of some kind of custom Lucene/Solr component that
> would for example take a query parameter such as &lookitUp=123 and the
> component might do a JDBC query against a database or kv store and return
> results in some form that would be efficient for Solr/Lucene to process. (Of
> course this assumes that a JDBC query would be more efficient than just
> sending a long list of ids to Solr).  The other part of the equation is
> mapping the unique Solr ids to internal Lucene ids in order to implement a
> filter query.   I was wondering if something like the unique id to Lucene id
> mapper in zoie might be useful or if that is too specific to zoie. SoThis
> may be totally off-base, since I haven't looked at the zoie code at all yet.
>
> In our particular use case, we might be able to build some kind of
> in-memory map after we optimize an index and before we mount it in
> production. In our workflow, we update the index and optimize it before we
> release it and once it is released to production there is no
> indexing/merging taking place on the production index (so the internal
> Lucene ids don't change.)
>
> Tom
>
>
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Friday, October 15, 2010 1:07 PM
> To: solr-user@lucene.apache.org
> Subject: RE: filter query from external list of Solr unique IDs
>
> Definitely interested in this.
>
> The naive obvious approach would be just putting all the ID's in the query.
> Like fq=(id:1 OR id:2 OR).  Or making it another clause in the 'q'.
>
> Can you outline what's wrong with this approach, to make it more clear
> what's needed in a solution?
> 
>


RE: filter query from external list of Solr unique IDs

2010-10-16 Thread Jonathan Rochkind
> I was actually thinking of some kind of custom Lucene/Solr component that
> would for example take a query parameter such as &lookitUp=123 and the
> component might do a JDBC query against a database or kv store and return
> results in some form that would be efficient for Solr/Lucene to process. 

If you do this, I'd definitely be interested in using it too. I have the same 
sort of use cases as you. 

It would be interesting to figure out if such a component could be used for 
"join-like behavior" too, as you included in your original use cases too. I'm 
not entirely sure what that would look like, but I have some problems which 
trying to solve with Solr often runs up against the lack of ability to do that 
(and there are lots of questions on the listserv asking "how do I do a join in 
solr" -- often the questioners can and should really solve their problem 
without needing to do this, but sometimes there really isn't a good solution 
without it), and when I try to spec out some workarounds in my head, it often 
comes up against the need to do what you're describing, efficiently do a query 
in solr limited by a known list of Solr ideas -- or in my case, sometimes a 
known (but lengthy, and "OR"d list of facet values for an fq limit -- really 
the same pattern there, it doesn't matter if the field is the id field or 
something else)

I'm not really sure what the API for a component like this meant to support 
'join' kind of behavior would look like, but it would be interesting to think 
about. Maybe it needs to be able to generate the values against an alternate 
solr core, with a specified query against that core, in addition to being able 
to generate with a specified query from a JDBC or kv-store lookup?  Or in some 
cases, even the same solr core -- do a query against the same solr core, take 
one stored field from the results, and use it to filter the result set of a 
subsequent query. 


RE: filter query from external list of Solr unique IDs

2010-10-16 Thread Jonathan Rochkind
> You could even
>generalize the hell out of it so the SQL itself could be specified at
>request time...

>  q=solr&fq={!sql}SELECT ID FROM USER_MAP WHERE USER=1234 ORDER BY ID ASC

I think that's missing the need for an argument for what field in the solr 
index to require to be within the values generated by the SQL? Or maybe it's 
meant to assume the identifier field, but it would be an interesting 
generalization to allow any field. 

 q=solr&fq={!sql field=id}SELECT ID FROM USER_MAP WHERE USER=1234 ORDER BY ID 
ASC

And then, thinking further, in addition to an external sql, how about 
generalize this to an alternate 'sub' query on the solr index itself?  

q=solr&fq={!join_query field=id =id}foo:bar AND something_else

['subquery' is already taken as a defType, for purposes not entirely suitable 
here... I think? ]

Or even a different core! 

q=solr&fq={!join_query core=different_one field=id on_stored_field=id}foo:bar 
AND something_else

If that could be done as efficiently as reasonable, it would actually solve a 
whole BUNCH of problem cases that come up now and then. (And yes, people would 
have to be cautioned not to immediately use this type of solution becuase they 
are used to thinking in terms of rdbms; but for some problems it really would 
allow things just not easily possible otherwise.)


RE: filter query from external list of Solr unique IDs

2010-10-16 Thread Jonathan Rochkind
Or an alternate simpler solution requiring the client to do more work, but 
getting around the inefficienty and limit on clauses of a whole lot of OR 
clauses: provide a qparser that accepts an "in" query. 

&fq={!in field=id}100,150,201,304,etc.

Could, it seems from Hoss et al's suggestions, be processed much more 
efficiently than id:100 OR id:150 etc., and also without running up against the 
limit on number of clauses. Could still result in very large size of http 
request to solr though, which may or may not be a problem. 

(At first I was thinking about if it's a problem that ALL these solutions 
suggested will take up a spot in the filter cache since you're using fq-- but I 
think that's actually a benefit rather than a problem, in more cases than not 
it will actually be convenient for the result to be in the filter cache, it 
very well may end up being used multiple times). 

SolrJ new javabin format

2010-10-16 Thread Shawn Heisey
 The CHANGES.txt file in branch_3x says that the javabin format has 
changed in Solr 3.1, so you need to update SolrJ as well as Solr.  Is 
the SolrJ included in 3.1 compatible with both 3.1 and 1.4.1?  If not, 
that's going to make a graceful upgrade of my replicated distributed 
installation a little harder.


Thanks,
Shawn



Re: SolrJ new javabin format

2010-10-16 Thread Lance Norskog
Please add a JIRA requesting support for both formats from SolrJ.

On Sat, Oct 16, 2010 at 12:02 PM, Shawn Heisey  wrote:
>  The CHANGES.txt file in branch_3x says that the javabin format has changed
> in Solr 3.1, so you need to update SolrJ as well as Solr.  Is the SolrJ
> included in 3.1 compatible with both 3.1 and 1.4.1?  If not, that's going to
> make a graceful upgrade of my replicated distributed installation a little
> harder.
>
> Thanks,
> Shawn
>
>



-- 
Lance Norskog
goks...@gmail.com