Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
Hi Andy, may be you can look at Scotas products, www.scotas.com/products. They combine the data synchronization in near real time between Oracle and Solr, and also you can consumes data during SQLQ query time with new operators and functions or direct to Solr. Bye! 2013/3/12 Andy Lester a...@petdance.com On Mar 12, 2013, at 1:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote: How are these sets of flrids created/defined? (undertsanding the source of the filter information may help inspire alternative suggestsions, ie: XY Problem) It sounds like you're looking for patterns that could potentially providing groupings for these FLRIDs. We've been down that road, too, but we don't see how there could be one. The arbitrariness comes from the fact that the lists are maintained by users and can be changed at any time. Each book in the database has an FLRID. Each user can create lists of books. These lists can be modified at any time. That looks like this in Oracle: USER 1-M LIST 1-M LISTDETAIL M - 1 TITLE The sizes we're talking about: tens of thousands of users; hundreds of thousands of lists, with up to 100,000 items per list; tens of millions of listdetail. We have a feature that lets the user do a keyword search on books within his list. We can't update the Solr record to keep track of which lists it appears on because there may be, say, 20 people every second updating the contents of their lists, and those 20 people expect that their next search-within-a-list will have those new results. Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
: q=title:dogs AND : (flrid:(123 125 139 34823) OR : flrid:(34837 ... 59091) OR : ... OR : flrid:(101294813 ... 103049934)) : The problem with this approach (besides that it's clunky) is that it : seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back : in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. : With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on : the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How are these sets of flrids created/defined? (undertsanding the source of the filter information may help inspire alternative suggestsions, ie: XY Problem) : * Have Solr do big ORs as a set operation not as (what we assume is) a : naive one-at-a-time matching. It's not as naive as it may seem - scoring of disjunctions like this isn't a matter of asking each doc if it matches each query clause. what happens is that for each segment of the index, each clause of a disjunction is asked to check for the first doc it matches in the segment -- which for TermQueries like this just means a quick lookup on the TermEnum, and the lowest (internal) doc num returned by any of the clauses represents the first match of that BooleanQuery. All of the other clauses are asked for their first and then ultimately they are all asked to skip ahead to their next match, etc... My point being: i don't think your speed observations are based on the number of documents, it's based on the number of query clauses -- which unfortunately happen to be the same in your situation. : * An efficient way to pass a long set of IDs, or for Solr to be able to : pull them from the app's Oracle database. This can definitely be done, there just isn't a general purpose turn key solution for it. The appoach you'd need to take is to implement a PostFilter to implement your custom logic for deciding if a document should be in the result set or not, and then generate instances of your PostFilter implemantion in a QParserPlugin. Here's a blog post with an example of doing this for an ACL type situation, where the parser input specifies a user and a CSV file is consulted to get the list of documents the user is allowed to see... http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/ ..you could follow a similar model where given some input, you generate a query to your oracle DB to return a SetString of IDs to consult in the collect method. -Hoss
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
On Mar 12, 2013, at 1:21 PM, Chris Hostetter hossman_luc...@fucit.org wrote: How are these sets of flrids created/defined? (undertsanding the source of the filter information may help inspire alternative suggestsions, ie: XY Problem) It sounds like you're looking for patterns that could potentially providing groupings for these FLRIDs. We've been down that road, too, but we don't see how there could be one. The arbitrariness comes from the fact that the lists are maintained by users and can be changed at any time. Each book in the database has an FLRID. Each user can create lists of books. These lists can be modified at any time. That looks like this in Oracle: USER 1-M LIST 1-M LISTDETAIL M - 1 TITLE The sizes we're talking about: tens of thousands of users; hundreds of thousands of lists, with up to 100,000 items per list; tens of millions of listdetail. We have a feature that lets the user do a keyword search on books within his list. We can't update the Solr record to keep track of which lists it appears on because there may be, say, 20 people every second updating the contents of their lists, and those 20 people expect that their next search-within-a-list will have those new results. Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
How can I limit my Solr search to an arbitrary set of 100,000 documents?
We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive look at page 20 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote: We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
First, terms used to subset the index should be a filter query, not part of the main query. That may help, because the filter query terms are not used for relevance scoring. Have you done any system profiling? Where is the bottleneck: CPU or disk? There is no point in optimising things before you know the bottleneck. Also, your latency goals may be impossible. Assume roughly one disk access per term in the query. You are not going to be able to do 100,000 random access disk IOs in 2 seconds, let alone process the results. wunder On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote: hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive look at page 20 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote: We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance
Re: How can I limit my Solr search to an arbitrary set of 100,000 documents?
I think we speak of one use case where user wants to limit the search into a collection of documents but there is no unifying (easy) way to select those papers - besides a loong query: id:1 OR id:5 OR id:90... And no, the latency of several hundred milliseconds is perfectly achievable with several hundred thousands of ids, you should explore the link... roman On Fri, Mar 8, 2013 at 12:56 PM, Walter Underwood wun...@wunderwood.orgwrote: First, terms used to subset the index should be a filter query, not part of the main query. That may help, because the filter query terms are not used for relevance scoring. Have you done any system profiling? Where is the bottleneck: CPU or disk? There is no point in optimising things before you know the bottleneck. Also, your latency goals may be impossible. Assume roughly one disk access per term in the query. You are not going to be able to do 100,000 random access disk IOs in 2 seconds, let alone process the results. wunder On Mar 8, 2013, at 9:32 AM, Roman Chyla wrote: hi Andy, It seems like a common type of operation and I would be also curious what others think. My take on this is to create a compressed intbitset and send it as a query filter, then have the handler decompress/deserialize it, and use it as a filter query. We have already done experiments with intbitsets and it is fast to send/receive look at page 20 http://www.slideshare.net/lbjay/letting-in-the-light-using-solr-as-an-external-search-component it is not on my immediate list of tasks, but if you want to help, it can be done sooner roman On Fri, Mar 8, 2013 at 12:10 PM, Andy Lester a...@petdance.com wrote: We've got an 11,000,000-document index. Most documents have a unique ID called flrid, plus a different ID called solrid that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it never that any two searches will have the same set of FLRIDs to limit on. What we're doing right now is, roughly: q=title:dogs AND (flrid:(123 125 139 34823) OR flrid:(34837 ... 59091) OR ... OR flrid:(101294813 ... 103049934)) Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together. The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs. How can we do this better? Things we've tried or considered: * Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement. * Tried: Putting the FLRIDs into the fq instead of the q. No improvement. * Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible. * Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID-SolrID to do the matching. What we're hoping for: * An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database. * Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching. * A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it. I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now. * http://stackoverflow.com/questions/11938342/solr-search-within-subset-defined-by-list-of-keys * http://stackoverflow.com/questions/9183898/searching-within-a-subset-of-data-solr * http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html * http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html Thanks, Andy -- Andy Lester = a...@petdance.com = www.petdance.com = AIM:petdance