Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-15 Thread Julián Arocena
Hi Andy,

may be you can look at Scotas products, www.scotas.com/products. They
combine the data synchronization in near real time between Oracle and Solr,
and also you can consumes data during SQLQ query time with new operators
and functions or direct to Solr.

Bye!

2013/3/12 Andy Lester a...@petdance.com


 On Mar 12, 2013, at 1:21 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:

  How are these sets of flrids created/defined?  (undertsanding the source
  of the filter information may help inspire alternative suggestsions, ie:
  XY Problem)


 It sounds like you're looking for patterns that could potentially
 providing groupings for these FLRIDs.  We've been down that road, too, but
 we don't see how there could be one.  The arbitrariness comes from the fact
 that the lists are maintained by users and can be changed at any time.

 Each book in the database has an FLRID.  Each user can create lists of
 books.  These lists can be modified at any time.

 That looks like this in Oracle:   USER   1-M   LIST   1-M   LISTDETAIL
  M - 1  TITLE

 The sizes we're talking about:  tens of thousands of users; hundreds of
 thousands of lists, with up to 100,000 items per list; tens of millions of
 listdetail.

 We have a feature that lets the user do a keyword search on books within
 his list.  We can't update the Solr record to keep track of which lists it
 appears on because there may be, say, 20 people every second updating the
 contents of their lists, and those 20 people expect that their next
 search-within-a-list will have those new results.

 Andy

 --
 Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance




Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-12 Thread Chris Hostetter

: q=title:dogs AND 
: (flrid:(123 125 139  34823) OR 
:  flrid:(34837 ... 59091) OR 
:  ... OR 
:  flrid:(101294813 ... 103049934))

: The problem with this approach (besides that it's clunky) is that it 
: seems to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back 
: in 50ms or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  
: With 100,000 FLRIDs, that jumps up to about 75000ms.  We want it be on 
: the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.

How are these sets of flrids created/defined?  (undertsanding the source 
of the filter information may help inspire alternative suggestsions, ie: 
XY Problem)

: * Have Solr do big ORs as a set operation not as (what we assume is) a 
: naive one-at-a-time matching.

It's not as naive as it may seem - scoring of disjunctions like this isn't 
a matter of asking each doc if it matches each query clause.  what happens 
is that for each segment of the index, each clause of a disjunction is 
asked to check for the first doc it matches in the segment -- which for 
TermQueries like this just means a quick lookup on the TermEnum, and the 
lowest (internal) doc num returned by any of the clauses represents the 
first match of that BooleanQuery.  All of the other clauses are asked 
for their first and then ultimately they are all asked to skip ahead to 
their next match, etc...

My point being: i don't think your speed observations are based on the 
number of documents, it's based on the number of query clauses -- which 
unfortunately happen to be the same in your situation.

: * An efficient way to pass a long set of IDs, or for Solr to be able to 
: pull them from the app's Oracle database.

This can definitely be done, there just isn't a general purpose turn key 
solution for it.  The appoach you'd need to take is to implement a 
PostFilter to implement your custom logic for deciding if a document 
should be in the result set or not, and then generate instances of your 
PostFilter implemantion in a QParserPlugin.

Here's a blog post with an example of doing this for an ACL type 
situation, where the parser input specifies a user and a CSV file is 
consulted to get the list of documents the user is allowed to see...

http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/

..you could follow a similar model where given some input, you generate a 
query to your oracle DB to return a SetString of IDs to consult in the 
collect method.


-Hoss


Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-12 Thread Andy Lester

On Mar 12, 2013, at 1:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 How are these sets of flrids created/defined?  (undertsanding the source 
 of the filter information may help inspire alternative suggestsions, ie: 
 XY Problem)


It sounds like you're looking for patterns that could potentially providing 
groupings for these FLRIDs.  We've been down that road, too, but we don't see 
how there could be one.  The arbitrariness comes from the fact that the lists 
are maintained by users and can be changed at any time.

Each book in the database has an FLRID.  Each user can create lists of books.  
These lists can be modified at any time.  

That looks like this in Oracle:   USER   1-M   LIST   1-M   LISTDETAIL  M - 
1  TITLE

The sizes we're talking about:  tens of thousands of users; hundreds of 
thousands of lists, with up to 100,000 items per list; tens of millions of 
listdetail.

We have a feature that lets the user do a keyword search on books within his 
list.  We can't update the Solr record to keep track of which lists it appears 
on because there may be, say, 20 people every second updating the contents of 
their lists, and those 20 people expect that their next search-within-a-list 
will have those new results.

Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance



How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Andy Lester
We've got an 11,000,000-document index.  Most documents have a unique ID called 
flrid, plus a different ID called solrid that is Solr's PK.  For some 
searches, we need to be able to limit the searches to a subset of documents 
defined by a list of FLRID values.  The list of FLRID values can change between 
every search and it will be rare enough to call it never that any two 
searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND 
(flrid:(123 125 139  34823) OR 
 flrid:(34837 ... 59091) OR 
 ... OR 
 flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We have 
to subgroup to get past Solr's limitations on the number of terms that can be 
ORed together.

The problem with this approach (besides that it's clunky) is that it seems to 
perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms or so.  
If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000 FLRIDs, 
that jumps up to about 75000ms.  We want it be on the order of 1000-2000ms at 
most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No 
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
* Considered: dumping all the FLRIDs for a given search into another core and 
doing a join between it and the main core, but if we do five or ten searches 
per second, it seems like Solr would die from all the commits.  The set of 
FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, 
so that Solr doesn't have to hit the documents in order to translate 
FLRID-SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to pull 
them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a naive 
one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because strings 
of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of 
situation a few times, but no answers that I see beyond what we're doing now.

* 
http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
* 
http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
* 
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
* 
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Thanks,
Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance



Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Roman Chyla
hi Andy,

It seems like a common type of operation and I would be also curious what
others think. My take on this is to create a compressed intbitset and send
it as a query filter, then have the handler decompress/deserialize it, and
use it as a filter query. We have already done experiments with intbitsets
and it is fast to send/receive

look at page 20
http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component

it is not on my immediate list of tasks, but if you want to help, it can be
done sooner

roman

On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:

 We've got an 11,000,000-document index.  Most documents have a unique ID
 called flrid, plus a different ID called solrid that is Solr's PK.  For
 some searches, we need to be able to limit the searches to a subset of
 documents defined by a list of FLRID values.  The list of FLRID values can
 change between every search and it will be rare enough to call it never
 that any two searches will have the same set of FLRIDs to limit on.

 What we're doing right now is, roughly:

 q=title:dogs AND
 (flrid:(123 125 139  34823) OR
  flrid:(34837 ... 59091) OR
  ... OR
  flrid:(101294813 ... 103049934))

 Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We
 have to subgroup to get past Solr's limitations on the number of terms that
 can be ORed together.

 The problem with this approach (besides that it's clunky) is that it seems
 to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms
 or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000
 FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
 1000-2000ms at most in all cases up to 100,000 FLRIDs.

 How can we do this better?

 Things we've tried or considered:

 * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No
 improvement.
 * Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
 * Considered: dumping all the FLRIDs for a given search into another core
 and doing a join between it and the main core, but if we do five or ten
 searches per second, it seems like Solr would die from all the commits.
  The set of FLRIDs is unique between searches so there is no reuse possible.
 * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
 instead, so that Solr doesn't have to hit the documents in order to
 translate FLRID-SolrID to do the matching.

 What we're hoping for:

 * An efficient way to pass a long set of IDs, or for Solr to be able to
 pull them from the app's Oracle database.
 * Have Solr do big ORs as a set operation not as (what we assume is) a
 naive one-at-a-time matching.
 * A way to create a match vector that gets passed to the query, because
 strings of fqs in the query seems to be a suboptimal way to do it.

 I've searched SO and the web and found people asking about this type of
 situation a few times, but no answers that I see beyond what we're doing
 now.

 *
 http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
 *
 http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
 *
 http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
 *
 http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

 Thanks,
 Andy

 --
 Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance




Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Walter Underwood
First, terms used to subset the index should be a filter query, not part of the 
main query. That may help, because the filter query terms are not used for 
relevance scoring.

Have you done any system profiling? Where is the bottleneck: CPU or disk? There 
is no point in optimising things before you know the bottleneck.

Also, your latency goals may be impossible. Assume roughly one disk access per 
term in the query. You are not going to be able to do 100,000 random access 
disk IOs in 2 seconds, let alone process the results.

wunder

On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:

 hi Andy,
 
 It seems like a common type of operation and I would be also curious what
 others think. My take on this is to create a compressed intbitset and send
 it as a query filter, then have the handler decompress/deserialize it, and
 use it as a filter query. We have already done experiments with intbitsets
 and it is fast to send/receive
 
 look at page 20
 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component
 
 it is not on my immediate list of tasks, but if you want to help, it can be
 done sooner
 
 roman
 
 On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:
 
 We've got an 11,000,000-document index.  Most documents have a unique ID
 called flrid, plus a different ID called solrid that is Solr's PK.  For
 some searches, we need to be able to limit the searches to a subset of
 documents defined by a list of FLRID values.  The list of FLRID values can
 change between every search and it will be rare enough to call it never
 that any two searches will have the same set of FLRIDs to limit on.
 
 What we're doing right now is, roughly:
 
q=title:dogs AND
(flrid:(123 125 139  34823) OR
 flrid:(34837 ... 59091) OR
 ... OR
 flrid:(101294813 ... 103049934))
 
 Each of those FQs parentheticals can be 1,000 FLRIDs strung together.  We
 have to subgroup to get past Solr's limitations on the number of terms that
 can be ORed together.
 
 The problem with this approach (besides that it's clunky) is that it seems
 to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in 50ms
 or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With 100,000
 FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
 1000-2000ms at most in all cases up to 100,000 FLRIDs.
 
 How can we do this better?
 
 Things we've tried or considered:
 
 * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.  No
 improvement.
 * Tried: Putting the FLRIDs into the fq instead of the q.  No improvement.
 * Considered: dumping all the FLRIDs for a given search into another core
 and doing a join between it and the main core, but if we do five or ten
 searches per second, it seems like Solr would die from all the commits.
 The set of FLRIDs is unique between searches so there is no reuse possible.
 * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
 instead, so that Solr doesn't have to hit the documents in order to
 translate FLRID-SolrID to do the matching.
 
 What we're hoping for:
 
 * An efficient way to pass a long set of IDs, or for Solr to be able to
 pull them from the app's Oracle database.
 * Have Solr do big ORs as a set operation not as (what we assume is) a
 naive one-at-a-time matching.
 * A way to create a match vector that gets passed to the query, because
 strings of fqs in the query seems to be a suboptimal way to do it.
 
 I've searched SO and the web and found people asking about this type of
 situation a few times, but no answers that I see beyond what we're doing
 now.
 
 *
 http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
 *
 http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
 *
 http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
 *
 http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html
 
 Thanks,
 Andy
 
 --
 Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
 
 






Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Roman Chyla
I think we speak of one use case where user wants to limit the search into
a collection of documents but there is no unifying (easy) way to select
those papers - besides a loong query: id:1 OR id:5 OR id:90...

And no, the latency of several hundred milliseconds is perfectly achievable
with several hundred thousands of ids, you should explore the link...

roman



On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood wun...@wunderwood.orgwrote:

 First, terms used to subset the index should be a filter query, not part
 of the main query. That may help, because the filter query terms are not
 used for relevance scoring.

 Have you done any system profiling? Where is the bottleneck: CPU or disk?
 There is no point in optimising things before you know the bottleneck.

 Also, your latency goals may be impossible. Assume roughly one disk access
 per term in the query. You are not going to be able to do 100,000 random
 access disk IOs in 2 seconds, let alone process the results.

 wunder

 On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:

  hi Andy,
 
  It seems like a common type of operation and I would be also curious what
  others think. My take on this is to create a compressed intbitset and
 send
  it as a query filter, then have the handler decompress/deserialize it,
 and
  use it as a filter query. We have already done experiments with
 intbitsets
  and it is fast to send/receive
 
  look at page 20
 
 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component
 
  it is not on my immediate list of tasks, but if you want to help, it can
 be
  done sooner
 
  roman
 
  On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:
 
  We've got an 11,000,000-document index.  Most documents have a unique ID
  called flrid, plus a different ID called solrid that is Solr's PK.
  For
  some searches, we need to be able to limit the searches to a subset of
  documents defined by a list of FLRID values.  The list of FLRID values
 can
  change between every search and it will be rare enough to call it
 never
  that any two searches will have the same set of FLRIDs to limit on.
 
  What we're doing right now is, roughly:
 
 q=title:dogs AND
 (flrid:(123 125 139  34823) OR
  flrid:(34837 ... 59091) OR
  ... OR
  flrid:(101294813 ... 103049934))
 
  Each of those FQs parentheticals can be 1,000 FLRIDs strung together.
  We
  have to subgroup to get past Solr's limitations on the number of terms
 that
  can be ORed together.
 
  The problem with this approach (besides that it's clunky) is that it
 seems
  to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back in
 50ms
  or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  With
 100,000
  FLRIDs, that jumps up to about 75000ms.  We want it be on the order of
  1000-2000ms at most in all cases up to 100,000 FLRIDs.
 
  How can we do this better?
 
  Things we've tried or considered:
 
  * Tried: Using dismax with minimum-match mm:0 to simulate an OR query.
  No
  improvement.
  * Tried: Putting the FLRIDs into the fq instead of the q.  No
 improvement.
  * Considered: dumping all the FLRIDs for a given search into another
 core
  and doing a join between it and the main core, but if we do five or ten
  searches per second, it seems like Solr would die from all the commits.
  The set of FLRIDs is unique between searches so there is no reuse
 possible.
  * Considered: Translating FLRIDs to SolrID and then limiting on SolrID
  instead, so that Solr doesn't have to hit the documents in order to
  translate FLRID-SolrID to do the matching.
 
  What we're hoping for:
 
  * An efficient way to pass a long set of IDs, or for Solr to be able to
  pull them from the app's Oracle database.
  * Have Solr do big ORs as a set operation not as (what we assume is) a
  naive one-at-a-time matching.
  * A way to create a match vector that gets passed to the query, because
  strings of fqs in the query seems to be a suboptimal way to do it.
 
  I've searched SO and the web and found people asking about this type of
  situation a few times, but no answers that I see beyond what we're doing
  now.
 
  *
 
 http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
  *
 
 http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
  *
 
 http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
  *
 
 http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html
 
  Thanks,
  Andy
 
  --
  Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance