subject:"How can I limit my Solr search to an arbitrary set of 100,000 documents\?"

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-15 Thread Julián Arocena

Hi Andy,

may be you can look at Scotas products, www.scotas.com/products. They
combine the data synchronization in near real time between Oracle and Solr,
and also you can consumes data during SQLQ query time with new operators
and functions or direct to Solr.

Bye!

2013/3/12 Andy Lester a...@petdance.com


 On Mar 12, 2013, at 1:21 PM, Chris Hostetter hossman_luc...@fucit.org
 wrote:

  How are these sets of flrids created/defined?  (undertsanding the source
  of the filter information may help inspire alternative suggestsions, ie:
  XY Problem)


 It sounds like you're looking for patterns that could potentially
 providing groupings for these FLRIDs.  We've been down that road, too, but
 we don't see how there could be one.  The arbitrariness comes from the fact
 that the lists are maintained by users and can be changed at any time.

 Each book in the database has an FLRID.  Each user can create lists of
 books.  These lists can be modified at any time.

 That looks like this in Oracle:   USER   1-M   LIST   1-M   LISTDETAIL
  M - 1  TITLE

 The sizes we're talking about:  tens of thousands of users; hundreds of
 thousands of lists, with up to 100,000 items per list; tens of millions of
 listdetail.

 We have a feature that lets the user do a keyword search on books within
 his list.  We can't update the Solr record to keep track of which lists it
 appears on because there may be, say, 20 people every second updating the
 contents of their lists, and those 20 people expect that their next
 search-within-a-list will have those new results.

 Andy

 --
 Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-12 Thread Chris Hostetter


: q=title:dogs AND 
: (flrid:(123 125 139  34823) OR 
:  flrid:(34837 ... 59091) OR 
:  ... OR 
:  flrid:(101294813 ... 103049934))

: The problem with this approach (besides that it's clunky) is that it 
: seems to perform O(N^2) or so.  With 1,000 FLRIDs, the search comes back 
: in 50ms or so.  If we have 10,000 FLRIDs, it comes back in 400-500ms.  
: With 100,000 FLRIDs, that jumps up to about 75000ms.  We want it be on 
: the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.

How are these sets of flrids created/defined?  (undertsanding the source 
of the filter information may help inspire alternative suggestsions, ie: 
XY Problem)

: * Have Solr do big ORs as a set operation not as (what we assume is) a 
: naive one-at-a-time matching.

It's not as naive as it may seem - scoring of disjunctions like this isn't 
a matter of asking each doc if it matches each query clause.  what happens 
is that for each segment of the index, each clause of a disjunction is 
asked to check for the first doc it matches in the segment -- which for 
TermQueries like this just means a quick lookup on the TermEnum, and the 
lowest (internal) doc num returned by any of the clauses represents the 
first match of that BooleanQuery.  All of the other clauses are asked 
for their first and then ultimately they are all asked to skip ahead to 
their next match, etc...

My point being: i don't think your speed observations are based on the 
number of documents, it's based on the number of query clauses -- which 
unfortunately happen to be the same in your situation.

: * An efficient way to pass a long set of IDs, or for Solr to be able to 
: pull them from the app's Oracle database.

This can definitely be done, there just isn't a general purpose turn key 
solution for it.  The appoach you'd need to take is to implement a 
PostFilter to implement your custom logic for deciding if a document 
should be in the result set or not, and then generate instances of your 
PostFilter implemantion in a QParserPlugin.

Here's a blog post with an example of doing this for an ACL type 
situation, where the parser input specifies a user and a CSV file is 
consulted to get the list of documents the user is allowed to see...

http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/

..you could follow a similar model where given some input, you generate a 
query to your oracle DB to return a SetString of IDs to consult in the 
collect method.


-Hoss

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-12 Thread Andy Lester


On Mar 12, 2013, at 1:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote:

 How are these sets of flrids created/defined?  (undertsanding the source 
 of the filter information may help inspire alternative suggestsions, ie: 
 XY Problem)


It sounds like you're looking for patterns that could potentially providing 
groupings for these FLRIDs.  We've been down that road, too, but we don't see 
how there could be one.  The arbitrariness comes from the fact that the lists 
are maintained by users and can be changed at any time.

Each book in the database has an FLRID.  Each user can create lists of books.  
These lists can be modified at any time.  

That looks like this in Oracle:   USER   1-M   LIST   1-M   LISTDETAIL  M - 
1  TITLE

The sizes we're talking about:  tens of thousands of users; hundreds of 
thousands of lists, with up to 100,000 items per list; tens of millions of 
listdetail.

We have a feature that lets the user do a keyword search on books within his 
list.  We can't update the Solr record to keep track of which lists it appears 
on because there may be, say, 20 people every second updating the contents of 
their lists, and those 20 people expect that their next search-within-a-list 
will have those new results.

Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Andy Lester

We've got an 11,000,000-document index. Most documents have a unique ID called
flrid, plus a different ID called solrid that is Solr's PK. For some
searches, we need to be able to limit the searches to a subset of documents
defined by a list of FLRID values. The list of FLRID values can change between
every search and it will be rare enough to call it never that any two
searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND
(flrid:(123 125 139 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have
to subgroup to get past Solr's limitations on the number of terms that can be
ORed together.

The problem with this approach (besides that it's clunky) is that it seems to
perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so.
If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs,
that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at
most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
* Considered: dumping all the FLRIDs for a given search into another core and
doing a join between it and the main core, but if we do five or ten searches
per second, it seems like Solr would die from all the commits. The set of
FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead,
so that Solr doesn't have to hit the documents in order to translate
FLRID-SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to pull
them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a naive
one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because strings
of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of
situation a few times, but no answers that I see beyond what we're doing now.

*
http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
*
http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
*
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
*
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Thanks,
Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Roman Chyla

hi Andy,

It seems like a common type of operation and I would be also curious what
others think. My take on this is to create a compressed intbitset and send
it as a query filter, then have the handler decompress/deserialize it, and
use it as a filter query. We have already done experiments with intbitsets
and it is fast to send/receive

look at page 20
http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component

it is not on my immediate list of tasks, but if you want to help, it can be
done sooner

roman

On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:

We've got an 11,000,000-document index. Most documents have a unique ID
called flrid, plus a different ID called solrid that is Solr's PK. For
some searches, we need to be able to limit the searches to a subset of
documents defined by a list of FLRID values. The list of FLRID values can
change between every search and it will be rare enough to call it never
that any two searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND
(flrid:(123 125 139 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We
have to subgroup to get past Solr's limitations on the number of terms that
can be ORed together.

The problem with this approach (besides that it's clunky) is that it seems
to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms
or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000
FLRIDs, that jumps up to about 75000ms. We want it be on the order of
1000-2000ms at most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
* Considered: dumping all the FLRIDs for a given search into another core
and doing a join between it and the main core, but if we do five or ten
searches per second, it seems like Solr would die from all the commits.
The set of FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID
instead, so that Solr doesn't have to hit the documents in order to
translate FLRID-SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to
pull them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a
naive one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because
strings of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of
situation a few times, but no answers that I see beyond what we're doing
now.

Thanks,
Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Walter Underwood

First, terms used to subset the index should be a filter query, not part of the
main query. That may help, because the filter query terms are not used for
relevance scoring.

Have you done any system profiling? Where is the bottleneck: CPU or disk? There
is no point in optimising things before you know the bottleneck.

Also, your latency goals may be impossible. Assume roughly one disk access per
term in the query. You are not going to be able to do 100,000 random access
disk IOs in 2 seconds, let alone process the results.

wunder

On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:

hi Andy,

look at page 20
http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component

it is not on my immediate list of tasks, but if you want to help, it can be
done sooner

roman

On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:

We've got an 11,000,000-document index. Most documents have a unique ID
called flrid, plus a different ID called solrid that is Solr's PK. For
some searches, we need to be able to limit the searches to a subset of
documents defined by a list of FLRID values. The list of FLRID values can
change between every search and it will be rare enough to call it never
that any two searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND
(flrid:(123 125 139 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We
have to subgroup to get past Solr's limitations on the number of terms that
can be ORed together.

The problem with this approach (besides that it's clunky) is that it seems
to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms
or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000
FLRIDs, that jumps up to about 75000ms. We want it be on the order of
1000-2000ms at most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
* Considered: dumping all the FLRIDs for a given search into another core
and doing a join between it and the main core, but if we do five or ten
searches per second, it seems like Solr would die from all the commits.
The set of FLRIDs is unique between searches so there is no reuse possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID
instead, so that Solr doesn't have to hit the documents in order to
translate FLRID-SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to
pull them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a
naive one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because
strings of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of
situation a few times, but no answers that I see beyond what we're doing
now.

Thanks,
Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

2013-03-08 Thread Roman Chyla

I think we speak of one use case where user wants to limit the search into
a collection of documents but there is no unifying (easy) way to select
those papers - besides a loong query: id:1 OR id:5 OR id:90...

And no, the latency of several hundred milliseconds is perfectly achievable
with several hundred thousands of ids, you should explore the link...

roman

On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood wun...@wunderwood.orgwrote:

First, terms used to subset the index should be a filter query, not part
of the main query. That may help, because the filter query terms are not
used for relevance scoring.

Have you done any system profiling? Where is the bottleneck: CPU or disk?
There is no point in optimising things before you know the bottleneck.

Also, your latency goals may be impossible. Assume roughly one disk access
per term in the query. You are not going to be able to do 100,000 random
access disk IOs in 2 seconds, let alone process the results.

wunder

On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote:

hi Andy,

It seems like a common type of operation and I would be also curious what
others think. My take on this is to create a compressed intbitset and
send
it as a query filter, then have the handler decompress/deserialize it,
and
use it as a filter query. We have already done experiments with
intbitsets
and it is fast to send/receive

look at page 20

http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component

it is not on my immediate list of tasks, but if you want to help, it can
be
done sooner

roman

On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote:

We've got an 11,000,000-document index. Most documents have a unique ID
called flrid, plus a different ID called solrid that is Solr's PK.
For
some searches, we need to be able to limit the searches to a subset of
documents defined by a list of FLRID values. The list of FLRID values
can
change between every search and it will be rare enough to call it
never
that any two searches will have the same set of FLRIDs to limit on.

What we're doing right now is, roughly:

q=title:dogs AND
(flrid:(123 125 139 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))

Each of those FQs parentheticals can be 1,000 FLRIDs strung together.
We
have to subgroup to get past Solr's limitations on the number of terms
that
can be ORed together.

The problem with this approach (besides that it's clunky) is that it
seems
to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in
50ms
or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With
100,000
FLRIDs, that jumps up to about 75000ms. We want it be on the order of
1000-2000ms at most in all cases up to 100,000 FLRIDs.

How can we do this better?

Things we've tried or considered:

* Tried: Using dismax with minimum-match mm:0 to simulate an OR query.
No
improvement.
* Tried: Putting the FLRIDs into the fq instead of the q. No
improvement.
* Considered: dumping all the FLRIDs for a given search into another
core
and doing a join between it and the main core, but if we do five or ten
searches per second, it seems like Solr would die from all the commits.
The set of FLRIDs is unique between searches so there is no reuse
possible.
* Considered: Translating FLRIDs to SolrID and then limiting on SolrID
instead, so that Solr doesn't have to hit the documents in order to
translate FLRID-SolrID to do the matching.

What we're hoping for:

* An efficient way to pass a long set of IDs, or for Solr to be able to
pull them from the app's Oracle database.
* Have Solr do big ORs as a set operation not as (what we assume is) a
naive one-at-a-time matching.
* A way to create a match vector that gets passed to the query, because
strings of fqs in the query seems to be a suboptimal way to do it.

I've searched SO and the web and found people asking about this type of
situation a few times, but no answers that I see beyond what we're doing
now.

http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys
*

http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr
*

http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
*

http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Thanks,
Andy

--
Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

How can I limit my Solr search to an arbitrary set of 100,000 documents?

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?

7 matches

Site Navigation

Mail list logo

Footer information