Re: Handling intersection facets of many values
On Wed, 2014-11-19 at 23:53 +0100, Peter Sturge wrote: Yes, the 'lots-of-booleans' thing is a bit prohibitive as it won't realistically scale to large value sets. large is extremely relative in Solr Land, but I would be weary of going beyond 10K. 127.0.0.1:8983/solr/net/select?q=*:*fl=destfl=srcfacet=truefq={!join from=addr to=dest fromIndex=targets}*facet.field=srcfacet.field=destfacet.mincount=1facet.limit=-1facet.sort=countrows=0 Ah! fromIndex. I missed that. Thanks for following up with the full solution. - Toke Eskildsen, State and University Library, Denmark
Re: Handling intersection facets of many values
If you're willing to write some Java you can do something more efficient by intersecting two terms enumerations: this works with constant memory for any number of values in two fields, basically like intersecting any two sorted lists, you leap frog between them. I have an example if you're interested (I was finding compounds by indexing shingles and intersecting with regular word terms), but there isn't any support for using it in a query, or as part of Solr: it's just an offline kind of thing you can run against your index. -Mike On 11/19/2014 5:53 PM, Peter Sturge wrote: Hi Toke, Yes, the 'lots-of-booleans' thing is a bit prohibitive as it won't realistically scale to large value sets. I've been wrestling with joins this evening and have managed to get these working - and it works very nicely - and across cores (although not shards yet afaik)! For anyone looking to do this sort of facet intersecting, here's my query: 127.0.0.1:8983/solr/net/select?q=*:*fl=destfl=srcfacet=truefq={!join from=addr to=dest fromIndex=targets}*facet.field=srcfacet.field=destfacet.mincount=1facet.limit=-1facet.sort=countrows=0 Thanks, Peter On Wed, Nov 19, 2014 at 9:23 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Peter Sturge [peter.stu...@gmail.com] wrote: I guess you mean take the 1k or so values and build a boolean query from them? Not really. Let me try again: 1) Perform a facet call with facet.limit=-1 on dest to get the relevant dest values. The result will always be 1000 values or less. Take those values and construct a filter query a OR b OR c. 2) Perform a facet call on addr with the original query + the newly constructed filter query. The facet response should not contain the intersection. 1000 is a bit close to the default limit for boolean queries, so you might want to raise that. I'm also looking at creating a custom QueryParser that would build the relevant DocLists, then intersect them and return the values, [...] You are describing a Join in Solr and that would likely solve your problem, but it does not work across cores. Is it possible to have both the addr and dest data in the same core? - Toke Eskildsen
Handling intersection facets of many values
Hi Solr Group, Got an interesting use case (to me, at least), perhaps someone could give some insight on how best to achieve this? I've got a core that has about 7million entries, with a field call 'addr'. By definition, every entry has a unique 'addr' value, so there are 7million unique values for this field. I then have another core with ~20million entries. These have a field called 'dest', and there may be, say around 800-1000 unique values for 'dest', but there's always a value - the number of unique values varies. So..the problem is this: What is the best/only/most efficient way to consutruct a search where by I get back an (ideally faceted) list of values for 'dest' that occur in 'addr'? Can I do this with just faceting (e.g. facet query or similar)? Or do I need grouping? Note, I don't actually need the documents themselves, only the list of unique values that are the intersection of 'dest' and 'addr'. Can anyone help shed some light on how best to do this? Many thanks, Peter
RE: Handling intersection facets of many values
Peter Sturge [peter.stu...@gmail.com] wrote: [addr 7M unique, dest 1K unique] What is the best/only/most efficient way to consutruct a search where by I get back an (ideally faceted) list of values for 'dest' that occur in 'addr'? I assume the actual values are defined by a query? As the number of possible values in dest is not that large, extracting those first and then using them as a filter when searching for addr seems like a fairly efficient way of solving the problem. - Toke Eskildsen
Re: Handling intersection facets of many values
Hi Toke, Thanks for your input. I guess you mean take the 1k or so values and build a boolean query from them? If that's not what you mean, my apologies.. I'd thought of doing that - the trouble I had was the unique values could be 20k, or 15,167 or any arbirary and potentially high-ish number - it's not really known and can/will change over time. I believe a boolean query with more than 1024 ops can blow up the query, so scalability is a concern. The other issue is how this would yield the unique facet values - e.g. dest=8.8.8.8 (17) [i.e. 8.8.8.8 is in the 'addr' list and occurs 17 times in entries with a 'dest' field] - in fact, I need the uniques value(s) ('8.8.8.8') more than I need the count ('17') I could get the facet list of 'dest' values, then trawl through each one, but this will be a complicated and time-consuming client-side operation. I'm also looking at creating a custom QueryParser that would build the relevant DocLists, then intersect them and return the values, but I wouldn't want to reinvent the wheel if possible, given that facets already build unique term lists, seems so close - I guess it's like taking two facet lists (1 for addr, 1 for dest), intersecting them and returning the result: List 1: a b c d e f List 2: a a g z c c c e Resultant intersection: a (2) c (3) e (1) Thanks, Peter On Wed, Nov 19, 2014 at 7:16 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Peter Sturge [peter.stu...@gmail.com] wrote: [addr 7M unique, dest 1K unique] What is the best/only/most efficient way to consutruct a search where by I get back an (ideally faceted) list of values for 'dest' that occur in 'addr'? I assume the actual values are defined by a query? As the number of possible values in dest is not that large, extracting those first and then using them as a filter when searching for addr seems like a fairly efficient way of solving the problem. - Toke Eskildsen
RE: Handling intersection facets of many values
Peter Sturge [peter.stu...@gmail.com] wrote: I guess you mean take the 1k or so values and build a boolean query from them? Not really. Let me try again: 1) Perform a facet call with facet.limit=-1 on dest to get the relevant dest values. The result will always be 1000 values or less. Take those values and construct a filter query a OR b OR c. 2) Perform a facet call on addr with the original query + the newly constructed filter query. The facet response should not contain the intersection. 1000 is a bit close to the default limit for boolean queries, so you might want to raise that. I'm also looking at creating a custom QueryParser that would build the relevant DocLists, then intersect them and return the values, [...] You are describing a Join in Solr and that would likely solve your problem, but it does not work across cores. Is it possible to have both the addr and dest data in the same core? - Toke Eskildsen
Re: Handling intersection facets of many values
Hi Toke, Yes, the 'lots-of-booleans' thing is a bit prohibitive as it won't realistically scale to large value sets. I've been wrestling with joins this evening and have managed to get these working - and it works very nicely - and across cores (although not shards yet afaik)! For anyone looking to do this sort of facet intersecting, here's my query: 127.0.0.1:8983/solr/net/select?q=*:*fl=destfl=srcfacet=truefq={!join from=addr to=dest fromIndex=targets}*facet.field=srcfacet.field=destfacet.mincount=1facet.limit=-1facet.sort=countrows=0 Thanks, Peter On Wed, Nov 19, 2014 at 9:23 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Peter Sturge [peter.stu...@gmail.com] wrote: I guess you mean take the 1k or so values and build a boolean query from them? Not really. Let me try again: 1) Perform a facet call with facet.limit=-1 on dest to get the relevant dest values. The result will always be 1000 values or less. Take those values and construct a filter query a OR b OR c. 2) Perform a facet call on addr with the original query + the newly constructed filter query. The facet response should not contain the intersection. 1000 is a bit close to the default limit for boolean queries, so you might want to raise that. I'm also looking at creating a custom QueryParser that would build the relevant DocLists, then intersect them and return the values, [...] You are describing a Join in Solr and that would likely solve your problem, but it does not work across cores. Is it possible to have both the addr and dest data in the same core? - Toke Eskildsen