Re: Percolate feature?
On 01/10/2013 04:12, Otis Gospodnetic wrote: Just came across this ancient thread. Charlie, did this end up happening? I suspect Wolfgang may be interested, but that's just a wild guess. Hi Otis all, Yes we're actually planning to talk about it at Lucene Revolution in November and open source it around then - it's called 'Luwak' and we're working on a live customer implementation based on it currently. I was curious about your feeling that what you were open-sourcing might be a lot faster and more flexible than ES's percolator - can you share more about why do you have that feeling and whether you've confirmed this? Difficult to say at present - we've not done a direct comparative test yet and obviously we like our own implementation! It works very well for our clients' use case. Cheers Charlie Thanks, Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Aug 5, 2013 at 6:34 AM, Charlie Hull char...@flax.co.uk wrote: On 03/08/2013 00:50, Mark wrote: We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks Hi Mark, We've built something that implements this kind of reverse search for our clients in the media monitoring sector - we're working on releasing the core of this as open source very soon, hopefully in a month or two. It's based on Lucene. Just for reference it's able to apply tens of thousands of stored queries to a document per second (our clients often have very large and complex Boolean strings representing their clients' interests and may monitor hundreds of thousands of news stories every day). It also records the positions of every match. We suspect it's a lot faster and more flexible than Elasticsearch's Percolate feature. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Percolate feature?
Just came across this ancient thread. Charlie, did this end up happening? I suspect Wolfgang may be interested, but that's just a wild guess. I was curious about your feeling that what you were open-sourcing might be a lot faster and more flexible than ES's percolator - can you share more about why do you have that feeling and whether you've confirmed this? Thanks, Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Aug 5, 2013 at 6:34 AM, Charlie Hull char...@flax.co.uk wrote: On 03/08/2013 00:50, Mark wrote: We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks Hi Mark, We've built something that implements this kind of reverse search for our clients in the media monitoring sector - we're working on releasing the core of this as open source very soon, hopefully in a month or two. It's based on Lucene. Just for reference it's able to apply tens of thousands of stored queries to a document per second (our clients often have very large and complex Boolean strings representing their clients' interests and may monitor hundreds of thousands of news stories every day). It also records the positions of every match. We suspect it's a lot faster and more flexible than Elasticsearch's Percolate feature. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Percolate feature?
: Let's talk about the real use case. We are marketplace that sells : products that users have listed. For certain popular, high risk or : restricted keywords we charge the seller an extra fee/ban the listing. : We now have sellers purposely misspelling their listings to circumvent : this fee. They will start adding suffixes to their product listings such : as Sonies knowing that it gets indexed down to Sony and thus : matching a users query for Sony. Or they will munge together numbers and : products… 2013Sony. Same thing goes for adding crazy non-ascii : characters to the front of the keyword Î’Sony. This is obviously a : problem because we aren't charging for these keywords and more : importantly it makes our search results look like shit. : 1) Detect when a certain keyword is in a product title at listing time : so we may charge the seller. This was my idea of a reverse search : although sounds like I may have caused to much confusion with that term. Ok ... with the concrete specifics of your situation in mind, i can think of 2 completley differnet approaches -- depending on how precise you need to be about your definition of a match and how you want to deal with ongoing maintence as your system evolves... ## Approach #1 - NRT index searching w/custom plugin Even if you have 1000-5000 of these special queries you need to check, a custom comonent to execute those 1000-5000 queries should be very fast against a small index where most of the queries won't match anything -- especially if you write a custom component that pre-parses them into Query obejcts and hangs onto them in memory. (As a sample data point: With the 32 sample docs from Solr 4.x, I configured a request handler with 5000 unique facet.query defaults using hte {!field} qparser. most of these facet queries didn't match anything but a handfull of which matched on of the same documents. With completely cold caches, these 5000 facet queries had a QTime of 502ms on my laptop -- and that includes parsing all 5000 query strings) So imagine if you wrote a custom SearchComonent that could read your X special queries from some remote database on init (and re-load them on command) and parse them into Queries which it then holds on to in kind of datastructure that also tracked why you cared about them (ie: charge 10% more, banned, etc...). At query time, your custom component would filter the main result set of docs against these queries to look for matches that should be reported (along with the metdata about hte queries that match) and could also inspect the results of any query that matches, and generate highlighting each query+doc that matches. You would then register this custom search component in a special validation solr core that is otherwise confiure exactly the same as your regular production index. When a client says here's my Y products i want to add you would... 1) index those Y products into your validation solr core using softCommit=trueopenSearcher=true 2) execute a query using your special search component filtered to just the list of Y unique ids of hte products the client just gave you (that way you can handle concurrent requests from different clients w/o false positives) 3) use the results of that query to tell your client things like product #123 matches 'Sony' so we are charging you more; and product #456 matches 'Porn' so we are rejecting it 4) only when done, would you re-index those products into your real index. 5) help keep your validating index small by also doing a deleteById on all of that batch of Y docs when you are done validating. The upside of this approach is that it helps you ensure the validation logic you apply to products when you get them from clients *exactly* matches your real queries, even if your schema analysis evolve over time. the downside is it's a decent mount of custom plugin code you need to write upfront, and it will get slower if/when the number of special validation queries increases. ## Approach #1 - Approximate things with a reverse search Build a small index where each document contains the text of one of your special queries copied into multiple fields with a variety of analysis options configured (in particular: i suspect using shingles would be fruitful here). setup a query structure that uses functions to combine together the scores of many queries against each of those fields -- this might be simple addition, or you might want it to be considtional, ie: maybe you multiple the sum of the scores of some queries against simple fields with teh score of a query against a really simple field to eliminate false positives. Experiment a bit to see what kinds of inputs get you what kinds of scores, and maybe associate a threshold with each document which you index as a numeric field on those docs and then fold that threshold value into your calvulation using the {!frame} parser to make sure you only count matches
Re: Percolate feature?
Any ideas? On Aug 10, 2013, at 6:28 PM, Mark static.void@gmail.com wrote: Our schema is pretty basic.. nothing fancy going on here fieldType name=text class=solr.TextField omitNorms=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protected.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.KStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.KeywordMarkerFilterFactory protected=protected.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1/ filter class=solr.KStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType On Aug 10, 2013, at 3:40 PM, Jack Krupansky j...@basetechnology.com wrote: Now we're getting somewhere! To (over-simplify), you simply want to know if a given listing would match a high-value pattern, either in a clean manner (obvious keywords) or in an unclean manner (e.g., fuzzy keyword matching, stemming, n-grams.) To a large this also depends on how rich and powerful your end-user query support is. So, if the user searches for sony, samsung, or apple, will it match some oddball listing that fuzzily matches those terms. So... tell us, how rich your query interface is. I mean, do you support wildcard, fuzzy query, ngrams (e.g., can they type son or sam or app, or... will sony match sonblah-blah)? Reverse-search may in fact be what you need in this case since you literally do mean if I index this document, will it match any of these queries (but doesn't score a hit on your direct check for whether it is a clean keyword match.) In your previous examples you only gave clean product titles, not examples of circumventions of simple keyword matches. -- Jack Krupansky -Original Message- From: Mark Sent: Saturday, August 10, 2013 6:24 PM To: solr-user@lucene.apache.org Cc: Chris Hostetter Subject: Re: Percolate feature? So to reiteratve your examples from before, but change the labels a bit and add some more converse examples (and ignore the highlighting aspect for a moment... doc1 = Sony doc2 = Samsung Galaxy doc3 = Sony Playstation queryA = Sony Experia ... matches only doc1 queryB = Sony Playstation 3 ... matches doc3 and doc1 queryC = Samsung 52inch LC ... doesn't match anything queryD = Samsung Galaxy S4 ... matches doc2 queryE = Galaxy Samsung S4 ... matches doc2 ...do i still have that correct? Yes 2) if you *do* care about using non-trivial analysis, then you can't use the simple termfreq() function, which deals with raw terms -- in stead you have to use the query() function to ensure that the input is parsed appropriately -- but then you have to wrap that function in something that will normalize the scores - so in place of termfreq('words','Galaxy') you'd want something like... Yes we will be using non-trivial analysis. Now heres another twist… what if we don't care about scoring? Let's talk about the real use case. We are marketplace that sells products that users have listed. For certain popular, high risk or restricted keywords we charge the seller an extra fee/ban the listing. We now have sellers purposely misspelling their listings to circumvent this fee. They will start adding suffixes to their product listings such as Sonies knowing that it gets indexed down to Sony and thus matching a users query for Sony. Or they will munge together numbers and products… 2013Sony. Same thing goes for adding crazy non-ascii characters to the front of the keyword Î’Sony. This is obviously a problem because we aren't charging for these keywords and more importantly it makes our search results look like shit. We would like to: 1) Detect when a certain keyword is in a product title at listing time so we may charge the seller. This was my idea of a reverse search although sounds like I may have caused to much confusion with that term. 2) Attempt to autocorrect these titles hence the need for highlighting so we can try and replace the terms… this of course done outside of Solr via an external service. Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) this makes conventional approaches such as regex quite troublesome. Regex is also quite
Re: Percolate feature?
So to reiteratve your examples from before, but change the labels a bit and add some more converse examples (and ignore the highlighting aspect for a moment... doc1 = Sony doc2 = Samsung Galaxy doc3 = Sony Playstation queryA = Sony Experia ... matches only doc1 queryB = Sony Playstation 3 ... matches doc3 and doc1 queryC = Samsung 52inch LC ... doesn't match anything queryD = Samsung Galaxy S4 ... matches doc2 queryE = Galaxy Samsung S4 ... matches doc2 ...do i still have that correct? Yes 2) if you *do* care about using non-trivial analysis, then you can't use the simple termfreq() function, which deals with raw terms -- in stead you have to use the query() function to ensure that the input is parsed appropriately -- but then you have to wrap that function in something that will normalize the scores - so in place of termfreq('words','Galaxy') you'd want something like... Yes we will be using non-trivial analysis. Now heres another twist… what if we don't care about scoring? Let's talk about the real use case. We are marketplace that sells products that users have listed. For certain popular, high risk or restricted keywords we charge the seller an extra fee/ban the listing. We now have sellers purposely misspelling their listings to circumvent this fee. They will start adding suffixes to their product listings such as Sonies knowing that it gets indexed down to Sony and thus matching a users query for Sony. Or they will munge together numbers and products… 2013Sony. Same thing goes for adding crazy non-ascii characters to the front of the keyword Î’Sony. This is obviously a problem because we aren't charging for these keywords and more importantly it makes our search results look like shit. We would like to: 1) Detect when a certain keyword is in a product title at listing time so we may charge the seller. This was my idea of a reverse search although sounds like I may have caused to much confusion with that term. 2) Attempt to autocorrect these titles hence the need for highlighting so we can try and replace the terms… this of course done outside of Solr via an external service. Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) this makes conventional approaches such as regex quite troublesome. Regex is also quite slow and scales horribly and always needs to be in lockstep with schema changes. Now knowing this, is there a good way to approach this? Thanks On Aug 9, 2013, at 11:56 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I'll look into this. Thanks for the concrete example as I don't even : know which classes to start to look at to implement such a feature. Either roman isn't understanding what you are aksing for, or i'm not -- but i don't think what roman described will work for you... : so if your query contains no duplicates and all terms must match, you can : be sure that you are collecting docs only when the number of terms matches : number of clauses in the query several of the examples you gave did not match what Roman is describing, as i understand it. Most people on this thread seem to be getting confused by having their perceptions flipped about what your data known in advance is vs the data you get at request time. You described this... : Product keyword: Sony : Product keyword: Samsung Galaxy : : We would like to be able to detect given a product title whether or : not it : matches any known keywords. For a keyword to be matched all of it's : terms : must be present in the product title given. : : Product Title: Sony Experia : Matches and returns a highlight: emSony/em Experia ...suggesting that what you call product keywords are the data you know about in advance and product titles are the data you get at request time. So your example of the request time input (ie: query) Sony Experia matching data known in advance (ie: indexed document) Sony would not work with Roman's example. To rephrase (what i think i understand is) your goal... * you have many (10*3+) documents known in advance * any document D contain a set of words W(D) of varing sizes * any requests Q contains a set of words W(Q) of varing izes * you want a given request R to match a document D if and only if: - W(D) is a subset of W(Q) - ie: no iten exists in W(D) that does not exist in W(Q) - ie: any number of items may exist in W(Q) that are not in W(D) So to reiteratve your examples from before, but change the labels a bit and add some more converse examples (and ignore the highlighting aspect for a moment... doc1 = Sony doc2 = Samsung Galaxy doc3 = Sony Playstation queryA = Sony Experia ... matches only doc1 queryB = Sony Playstation 3 ... matches doc3 and doc1 queryC = Samsung 52inch LC ... doesn't match anything queryD = Samsung Galaxy S4 ... matches doc2 queryE = Galaxy Samsung S4 ... matches
Re: Percolate feature?
Now we're getting somewhere! To (over-simplify), you simply want to know if a given listing would match a high-value pattern, either in a clean manner (obvious keywords) or in an unclean manner (e.g., fuzzy keyword matching, stemming, n-grams.) To a large this also depends on how rich and powerful your end-user query support is. So, if the user searches for sony, samsung, or apple, will it match some oddball listing that fuzzily matches those terms. So... tell us, how rich your query interface is. I mean, do you support wildcard, fuzzy query, ngrams (e.g., can they type son or sam or app, or... will sony match sonblah-blah)? Reverse-search may in fact be what you need in this case since you literally do mean if I index this document, will it match any of these queries (but doesn't score a hit on your direct check for whether it is a clean keyword match.) In your previous examples you only gave clean product titles, not examples of circumventions of simple keyword matches. -- Jack Krupansky -Original Message- From: Mark Sent: Saturday, August 10, 2013 6:24 PM To: solr-user@lucene.apache.org Cc: Chris Hostetter Subject: Re: Percolate feature? So to reiteratve your examples from before, but change the labels a bit and add some more converse examples (and ignore the highlighting aspect for a moment... doc1 = Sony doc2 = Samsung Galaxy doc3 = Sony Playstation queryA = Sony Experia ... matches only doc1 queryB = Sony Playstation 3 ... matches doc3 and doc1 queryC = Samsung 52inch LC ... doesn't match anything queryD = Samsung Galaxy S4 ... matches doc2 queryE = Galaxy Samsung S4 ... matches doc2 ...do i still have that correct? Yes 2) if you *do* care about using non-trivial analysis, then you can't use the simple termfreq() function, which deals with raw terms -- in stead you have to use the query() function to ensure that the input is parsed appropriately -- but then you have to wrap that function in something that will normalize the scores - so in place of termfreq('words','Galaxy') you'd want something like... Yes we will be using non-trivial analysis. Now heres another twist… what if we don't care about scoring? Let's talk about the real use case. We are marketplace that sells products that users have listed. For certain popular, high risk or restricted keywords we charge the seller an extra fee/ban the listing. We now have sellers purposely misspelling their listings to circumvent this fee. They will start adding suffixes to their product listings such as Sonies knowing that it gets indexed down to Sony and thus matching a users query for Sony. Or they will munge together numbers and products… 2013Sony. Same thing goes for adding crazy non-ascii characters to the front of the keyword Î’Sony. This is obviously a problem because we aren't charging for these keywords and more importantly it makes our search results look like shit. We would like to: 1) Detect when a certain keyword is in a product title at listing time so we may charge the seller. This was my idea of a reverse search although sounds like I may have caused to much confusion with that term. 2) Attempt to autocorrect these titles hence the need for highlighting so we can try and replace the terms… this of course done outside of Solr via an external service. Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) this makes conventional approaches such as regex quite troublesome. Regex is also quite slow and scales horribly and always needs to be in lockstep with schema changes. Now knowing this, is there a good way to approach this? Thanks On Aug 9, 2013, at 11:56 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I'll look into this. Thanks for the concrete example as I don't even : know which classes to start to look at to implement such a feature. Either roman isn't understanding what you are aksing for, or i'm not -- but i don't think what roman described will work for you... : so if your query contains no duplicates and all terms must match, you can : be sure that you are collecting docs only when the number of terms matches : number of clauses in the query several of the examples you gave did not match what Roman is describing, as i understand it. Most people on this thread seem to be getting confused by having their perceptions flipped about what your data known in advance is vs the data you get at request time. You described this... : Product keyword: Sony : Product keyword: Samsung Galaxy : : We would like to be able to detect given a product title whether or : not it : matches any known keywords. For a keyword to be matched all of it's : terms : must be present in the product title given. : : Product Title: Sony Experia : Matches and returns a highlight: emSony/em Experia ...suggesting that what you call product keywords are the data you know about in advance and product titles
Re: Percolate feature?
Our schema is pretty basic.. nothing fancy going on here fieldType name=text class=solr.TextField omitNorms=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protected.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.KStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.KeywordMarkerFilterFactory protected=protected.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 preserveOriginal=1/ filter class=solr.KStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType On Aug 10, 2013, at 3:40 PM, Jack Krupansky j...@basetechnology.com wrote: Now we're getting somewhere! To (over-simplify), you simply want to know if a given listing would match a high-value pattern, either in a clean manner (obvious keywords) or in an unclean manner (e.g., fuzzy keyword matching, stemming, n-grams.) To a large this also depends on how rich and powerful your end-user query support is. So, if the user searches for sony, samsung, or apple, will it match some oddball listing that fuzzily matches those terms. So... tell us, how rich your query interface is. I mean, do you support wildcard, fuzzy query, ngrams (e.g., can they type son or sam or app, or... will sony match sonblah-blah)? Reverse-search may in fact be what you need in this case since you literally do mean if I index this document, will it match any of these queries (but doesn't score a hit on your direct check for whether it is a clean keyword match.) In your previous examples you only gave clean product titles, not examples of circumventions of simple keyword matches. -- Jack Krupansky -Original Message- From: Mark Sent: Saturday, August 10, 2013 6:24 PM To: solr-user@lucene.apache.org Cc: Chris Hostetter Subject: Re: Percolate feature? So to reiteratve your examples from before, but change the labels a bit and add some more converse examples (and ignore the highlighting aspect for a moment... doc1 = Sony doc2 = Samsung Galaxy doc3 = Sony Playstation queryA = Sony Experia ... matches only doc1 queryB = Sony Playstation 3 ... matches doc3 and doc1 queryC = Samsung 52inch LC ... doesn't match anything queryD = Samsung Galaxy S4 ... matches doc2 queryE = Galaxy Samsung S4 ... matches doc2 ...do i still have that correct? Yes 2) if you *do* care about using non-trivial analysis, then you can't use the simple termfreq() function, which deals with raw terms -- in stead you have to use the query() function to ensure that the input is parsed appropriately -- but then you have to wrap that function in something that will normalize the scores - so in place of termfreq('words','Galaxy') you'd want something like... Yes we will be using non-trivial analysis. Now heres another twist… what if we don't care about scoring? Let's talk about the real use case. We are marketplace that sells products that users have listed. For certain popular, high risk or restricted keywords we charge the seller an extra fee/ban the listing. We now have sellers purposely misspelling their listings to circumvent this fee. They will start adding suffixes to their product listings such as Sonies knowing that it gets indexed down to Sony and thus matching a users query for Sony. Or they will munge together numbers and products… 2013Sony. Same thing goes for adding crazy non-ascii characters to the front of the keyword Î’Sony. This is obviously a problem because we aren't charging for these keywords and more importantly it makes our search results look like shit. We would like to: 1) Detect when a certain keyword is in a product title at listing time so we may charge the seller. This was my idea of a reverse search although sounds like I may have caused to much confusion with that term. 2) Attempt to autocorrect these titles hence the need for highlighting so we can try and replace the terms… this of course done outside of Solr via an external service. Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) this makes conventional approaches such as regex quite troublesome. Regex is also quite slow and scales horribly and always needs to be in lockstep with schema changes. Now
Re: Percolate feature?
This _looks_ like simple phrase matching (no slop) and highlighting... But whenever I think the answer is really simple, it usually means that I'm missing something Best Erick On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote: Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: Sony Product keyword: Samsung Galaxy We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: Sony Experia Matches and returns a highlight: emSony/em Experia Product Title: Samsung 52inch LC Does not match Product Title: Samsung Galaxy S4 Matches a returns a highlight: emSamsung Galaxy/em Product Title: Galaxy Samsung S4 Matches a returns a highlight: em Galaxy Samsung/em What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w/o assuming solr/elasticsearch/anything specific about the implementation -- just describe your actual use case for us, with several real document/query examples. https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Percolate feature?
*All* of the terms in the field must be matched by the querynot vice-versa. And no, we don't have a query for that out of the box. To implement, it seems like it would require the total number of terms indexed for a field (for each document). I guess you could also index start and end tokens and then use query expansion to all possible combinations... messy though. -Yonik http://lucidworks.com On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com wrote: This _looks_ like simple phrase matching (no slop) and highlighting... But whenever I think the answer is really simple, it usually means that I'm missing something Best Erick On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote: Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: Sony Product keyword: Samsung Galaxy We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: Sony Experia Matches and returns a highlight: emSony/em Experia Product Title: Samsung 52inch LC Does not match Product Title: Samsung Galaxy S4 Matches a returns a highlight: emSamsung Galaxy/em Product Title: Galaxy Samsung S4 Matches a returns a highlight: em Galaxy Samsung/em What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w/o assuming solr/elasticsearch/anything specific about the implementation -- just describe your actual use case for us, with several real document/query examples. https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Percolate feature?
*All* of the terms in the field must be matched by the querynot vice-versa. Exactly. This is why I was trying to explain it as a reverse search. I just realized I describe it as a *large list of known keywords when really its small; no more than 1000. Forgetting about performance how hard do you think this would be to implement? How should I even start? Thanks for the input On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote: *All* of the terms in the field must be matched by the querynot vice-versa. And no, we don't have a query for that out of the box. To implement, it seems like it would require the total number of terms indexed for a field (for each document). I guess you could also index start and end tokens and then use query expansion to all possible combinations... messy though. -Yonik http://lucidworks.com On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com wrote: This _looks_ like simple phrase matching (no slop) and highlighting... But whenever I think the answer is really simple, it usually means that I'm missing something Best Erick On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote: Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: Sony Product keyword: Samsung Galaxy We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: Sony Experia Matches and returns a highlight: emSony/em Experia Product Title: Samsung 52inch LC Does not match Product Title: Samsung Galaxy S4 Matches a returns a highlight: emSamsung Galaxy/em Product Title: Galaxy Samsung S4 Matches a returns a highlight: em Galaxy Samsung/em What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w/o assuming solr/elasticsearch/anything specific about the implementation -- just describe your actual use case for us, with several real document/query examples. https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Percolate feature?
Starting with the presumption that Solr is a search engine for user queries, what exactly would a user query look like? Are you really requiring your users to enter long, carefully constructed, full length product titles?? What kind of application would force its users to do such a thing? Put another way, if the user has entered what they consider important terms in their query, why are you being so ready to ignore a lot of those terms? Or, is this simply a case where some old software had a feature that for reasons unknown behaved this way and you are merely trying to replicate that feature merely in the name of compatibility without thinking about whether the feature actually makes sense in a modern software environment? (Or, maybe your manager or marketing invented this feature and you're just trying to implement it as stated without trying to decide whether it makes sense?) The point is that you are making us try to guess what the actual use case is, rather than simply telling us what it is! Please clarify what your use case really is. If you would explain the use case (not some proposed solution), maybe we could offer suggestions for solutions. Put another way, what exactly do you perceive to be wrong with normal, traditional, simply query matching that causes you to go to such great lengths to avoid using normal, traditional, simple query matching? IOW, why are you trying to re-invent and re-imagine a wheel that doesn't appear to need to be re-invented or re-imagined? I'm sure you must have some reason for doing that, but why not disclose that reason so that we can utilize it in understanding what you are trying to do? -- Jack Krupansky -Original Message- From: Mark Sent: Friday, August 09, 2013 11:29 AM To: solr-user@lucene.apache.org Subject: Re: Percolate feature? *All* of the terms in the field must be matched by the querynot vice-versa. Exactly. This is why I was trying to explain it as a reverse search. I just realized I describe it as a *large list of known keywords when really its small; no more than 1000. Forgetting about performance how hard do you think this would be to implement? How should I even start? Thanks for the input On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote: *All* of the terms in the field must be matched by the querynot vice-versa. And no, we don't have a query for that out of the box. To implement, it seems like it would require the total number of terms indexed for a field (for each document). I guess you could also index start and end tokens and then use query expansion to all possible combinations... messy though. -Yonik http://lucidworks.com On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com wrote: This _looks_ like simple phrase matching (no slop) and highlighting... But whenever I think the answer is really simple, it usually means that I'm missing something Best Erick On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote: Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: Sony Product keyword: Samsung Galaxy We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: Sony Experia Matches and returns a highlight: emSony/em Experia Product Title: Samsung 52inch LC Does not match Product Title: Samsung Galaxy S4 Matches a returns a highlight: emSamsung Galaxy/em Product Title: Galaxy Samsung S4 Matches a returns a highlight: em Galaxy Samsung/em What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w
Re: Percolate feature?
On Fri, Aug 9, 2013 at 11:29 AM, Mark static.void@gmail.com wrote: *All* of the terms in the field must be matched by the querynot vice-versa. Exactly. This is why I was trying to explain it as a reverse search. I just realized I describe it as a *large list of known keywords when really its small; no more than 1000. Forgetting about performance how hard do you think this would be to implement? How should I even start? not hard, index all terms into a field - make sure there are no duplicates, as you want to count them - then I can imagine at least two options: save the number of terms as a payload together with the terms, or in second step (in a collector, for example), load the document and count them terms in the field - if they match the query size, you are done a trivial, naive implementation (as you say 'forget performance') could be: searcher.search(query, null, new Collector() { ... public void collect(int i) throws Exception { d = reader.document(i, fieldsToLoa); if (d.getValues(fieldToLoad).size() == query.size()) { PriorityQueue.add(new ScoreDoc(score, i + docBase)); } } } so if your query contains no duplicates and all terms must match, you can be sure that you are collecting docs only when the number of terms matches number of clauses in the query roman Thanks for the input On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote: *All* of the terms in the field must be matched by the querynot vice-versa. And no, we don't have a query for that out of the box. To implement, it seems like it would require the total number of terms indexed for a field (for each document). I guess you could also index start and end tokens and then use query expansion to all possible combinations... messy though. -Yonik http://lucidworks.com On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com wrote: This _looks_ like simple phrase matching (no slop) and highlighting... But whenever I think the answer is really simple, it usually means that I'm missing something Best Erick On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote: Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: Sony Product keyword: Samsung Galaxy We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: Sony Experia Matches and returns a highlight: emSony/em Experia Product Title: Samsung 52inch LC Does not match Product Title: Samsung Galaxy S4 Matches a returns a highlight: emSamsung Galaxy/em Product Title: Galaxy Samsung S4 Matches a returns a highlight: em Galaxy Samsung/em What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w/o assuming solr/elasticsearch/anything specific about the implementation -- just describe your actual use case for us, with several real document/query examples. https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Percolate feature?
I'll look into this. Thanks for the concrete example as I don't even know which classes to start to look at to implement such a feature. On Aug 9, 2013, at 9:49 AM, Roman Chyla roman.ch...@gmail.com wrote: On Fri, Aug 9, 2013 at 11:29 AM, Mark static.void@gmail.com wrote: *All* of the terms in the field must be matched by the querynot vice-versa. Exactly. This is why I was trying to explain it as a reverse search. I just realized I describe it as a *large list of known keywords when really its small; no more than 1000. Forgetting about performance how hard do you think this would be to implement? How should I even start? not hard, index all terms into a field - make sure there are no duplicates, as you want to count them - then I can imagine at least two options: save the number of terms as a payload together with the terms, or in second step (in a collector, for example), load the document and count them terms in the field - if they match the query size, you are done a trivial, naive implementation (as you say 'forget performance') could be: searcher.search(query, null, new Collector() { ... public void collect(int i) throws Exception { d = reader.document(i, fieldsToLoa); if (d.getValues(fieldToLoad).size() == query.size()) { PriorityQueue.add(new ScoreDoc(score, i + docBase)); } } } so if your query contains no duplicates and all terms must match, you can be sure that you are collecting docs only when the number of terms matches number of clauses in the query roman Thanks for the input On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote: *All* of the terms in the field must be matched by the querynot vice-versa. And no, we don't have a query for that out of the box. To implement, it seems like it would require the total number of terms indexed for a field (for each document). I guess you could also index start and end tokens and then use query expansion to all possible combinations... messy though. -Yonik http://lucidworks.com On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com wrote: This _looks_ like simple phrase matching (no slop) and highlighting... But whenever I think the answer is really simple, it usually means that I'm missing something Best Erick On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote: Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: Sony Product keyword: Samsung Galaxy We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: Sony Experia Matches and returns a highlight: emSony/em Experia Product Title: Samsung 52inch LC Does not match Product Title: Samsung Galaxy S4 Matches a returns a highlight: emSamsung Galaxy/em Product Title: Galaxy Samsung S4 Matches a returns a highlight: em Galaxy Samsung/em What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w/o assuming solr/elasticsearch/anything specific about the implementation -- just describe your actual use case for us, with several real document/query examples. https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also:
Re: Percolate feature?
All of the query words must match, right? So this is a phrase query in edismax with mm=100%. We have suggestions for exactly matching a whole field, but you need samsung galaxy to match the document samsung galaxy s4. That means you do not need an exact match on the field. If you do need that, I have a suggestion, but I don't want to confuse things further. wunder On Aug 9, 2013, at 10:01 AM, Mark wrote: I'll look into this. Thanks for the concrete example as I don't even know which classes to start to look at to implement such a feature. On Aug 9, 2013, at 9:49 AM, Roman Chyla roman.ch...@gmail.com wrote: On Fri, Aug 9, 2013 at 11:29 AM, Mark static.void@gmail.com wrote: *All* of the terms in the field must be matched by the querynot vice-versa. Exactly. This is why I was trying to explain it as a reverse search. I just realized I describe it as a *large list of known keywords when really its small; no more than 1000. Forgetting about performance how hard do you think this would be to implement? How should I even start? not hard, index all terms into a field - make sure there are no duplicates, as you want to count them - then I can imagine at least two options: save the number of terms as a payload together with the terms, or in second step (in a collector, for example), load the document and count them terms in the field - if they match the query size, you are done a trivial, naive implementation (as you say 'forget performance') could be: searcher.search(query, null, new Collector() { ... public void collect(int i) throws Exception { d = reader.document(i, fieldsToLoa); if (d.getValues(fieldToLoad).size() == query.size()) { PriorityQueue.add(new ScoreDoc(score, i + docBase)); } } } so if your query contains no duplicates and all terms must match, you can be sure that you are collecting docs only when the number of terms matches number of clauses in the query roman Thanks for the input On Aug 9, 2013, at 6:56 AM, Yonik Seeley yo...@lucidworks.com wrote: *All* of the terms in the field must be matched by the querynot vice-versa. And no, we don't have a query for that out of the box. To implement, it seems like it would require the total number of terms indexed for a field (for each document). I guess you could also index start and end tokens and then use query expansion to all possible combinations... messy though. -Yonik http://lucidworks.com On Fri, Aug 9, 2013 at 8:19 AM, Erick Erickson erickerick...@gmail.com wrote: This _looks_ like simple phrase matching (no slop) and highlighting... But whenever I think the answer is really simple, it usually means that I'm missing something Best Erick On Thu, Aug 8, 2013 at 11:18 PM, Mark static.void@gmail.com wrote: Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: Sony Product keyword: Samsung Galaxy We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: Sony Experia Matches and returns a highlight: emSony/em Experia Product Title: Samsung 52inch LC Does not match Product Title: Samsung Galaxy S4 Matches a returns a highlight: emSamsung Galaxy/em Product Title: Galaxy Samsung S4 Matches a returns a highlight: em Galaxy Samsung/em What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w/o assuming solr/elasticsearch/anything specific about the implementation -- just describe your actual use case for us, with several real document/query
Re: Percolate feature?
: I'll look into this. Thanks for the concrete example as I don't even : know which classes to start to look at to implement such a feature. Either roman isn't understanding what you are aksing for, or i'm not -- but i don't think what roman described will work for you... : so if your query contains no duplicates and all terms must match, you can : be sure that you are collecting docs only when the number of terms matches : number of clauses in the query several of the examples you gave did not match what Roman is describing, as i understand it. Most people on this thread seem to be getting confused by having their perceptions flipped about what your data known in advance is vs the data you get at request time. You described this... : Product keyword: Sony : Product keyword: Samsung Galaxy : : We would like to be able to detect given a product title whether or : not it : matches any known keywords. For a keyword to be matched all of it's : terms : must be present in the product title given. : : Product Title: Sony Experia : Matches and returns a highlight: emSony/em Experia ...suggesting that what you call product keywords are the data you know about in advance and product titles are the data you get at request time. So your example of the request time input (ie: query) Sony Experia matching data known in advance (ie: indexed document) Sony would not work with Roman's example. To rephrase (what i think i understand is) your goal... * you have many (10*3+) documents known in advance * any document D contain a set of words W(D) of varing sizes * any requests Q contains a set of words W(Q) of varing izes * you want a given request R to match a document D if and only if: - W(D) is a subset of W(Q) - ie: no iten exists in W(D) that does not exist in W(Q) - ie: any number of items may exist in W(Q) that are not in W(D) So to reiteratve your examples from before, but change the labels a bit and add some more converse examples (and ignore the highlighting aspect for a moment... doc1 = Sony doc2 = Samsung Galaxy doc3 = Sony Playstation queryA = Sony Experia ... matches only doc1 queryB = Sony Playstation 3 ... matches doc3 and doc1 queryC = Samsung 52inch LC ... doesn't match anything queryD = Samsung Galaxy S4 ... matches doc2 queryE = Galaxy Samsung S4 ... matches doc2 ...do i still have that correct? A similar question came up in the past, but i can't find my response now so i'll try to recreate it ... 1) if you don't care about using non-trivial analysis (ie: you don't need stemming, or synonyms, etc..), you can do this with some really simple function queries -- asusming you index a field containing hte number of words in each document, in addition to the words themselves. Assuming your words are in a field named words and the number of words is in a field named words_count a request for something like Galaxy Samsung S4 can be represented as... q={!frange l=0 u=0}sub(words_count, sum(termfreq('words','Galaxy'), termfreq('words','Samsung'), termfreq('words','S4')) ...ie: you want to compute the sub of the term frequencies for each of hte words requested, and then you want ot subtract that sum from the number of terms in the documengt -- and then you only want ot match documents where the result of that subtraction is 0. one complexity that comes up, is that you haven't specified: * can the list of words in your documents contain duplicates? * can the list of words in your query contain duplicates? * should a document with duplicatewords match only if the query also contains the same word duplicated? ...the answers to those questions make hte math more complicated (and are left as an excersize for the reader) 2) if you *do* care about using non-trivial analysis, then you can't use the simple termfreq() function, which deals with raw terms -- in stead you have to use the query() function to ensure that the input is parsed appropriately -- but then you have to wrap that function in something that will normalize the scores - so in place of termfreq('words','Galaxy') you'd want something like... if(query({!field f=words v='Galaxy'}),1,0) ...but again the math gets much harder if you make things more complex with duplicate words i nthe document or duplicate words in the query -- you'd probably have to use a custom similarity to get the scores returned by the query() function to be usable as is in the match equation (and drop the if() function) As for the highlighting part of hte problme -- that becomes much easier -- independent of the queries you use to *match* the documents, you can then specify a hl.q param to specify a much simpler query just containing the basic lst of words (as a simple boolean query, all clouses optional) and let it highlight them in your list of words. -Hoss
Re: Percolate feature?
I thought about that suggested doc/query model, but... Do you really want a query of Sony xbox or Sony ipad or even Sony Samsung to match document Sony? Seems quite odd. -- Jack Krupansky -Original Message- From: Chris Hostetter Sent: Friday, August 09, 2013 2:56 PM To: solr-user@lucene.apache.org Subject: Re: Percolate feature? : I'll look into this. Thanks for the concrete example as I don't even : know which classes to start to look at to implement such a feature. Either roman isn't understanding what you are aksing for, or i'm not -- but i don't think what roman described will work for you... : so if your query contains no duplicates and all terms must match, you can : be sure that you are collecting docs only when the number of terms matches : number of clauses in the query several of the examples you gave did not match what Roman is describing, as i understand it. Most people on this thread seem to be getting confused by having their perceptions flipped about what your data known in advance is vs the data you get at request time. You described this... : Product keyword: Sony : Product keyword: Samsung Galaxy : : We would like to be able to detect given a product title whether or : not it : matches any known keywords. For a keyword to be matched all of it's : terms : must be present in the product title given. : : Product Title: Sony Experia : Matches and returns a highlight: emSony/em Experia ...suggesting that what you call product keywords are the data you know about in advance and product titles are the data you get at request time. So your example of the request time input (ie: query) Sony Experia matching data known in advance (ie: indexed document) Sony would not work with Roman's example. To rephrase (what i think i understand is) your goal... * you have many (10*3+) documents known in advance * any document D contain a set of words W(D) of varing sizes * any requests Q contains a set of words W(Q) of varing izes * you want a given request R to match a document D if and only if: - W(D) is a subset of W(Q) - ie: no iten exists in W(D) that does not exist in W(Q) - ie: any number of items may exist in W(Q) that are not in W(D) So to reiteratve your examples from before, but change the labels a bit and add some more converse examples (and ignore the highlighting aspect for a moment... doc1 = Sony doc2 = Samsung Galaxy doc3 = Sony Playstation queryA = Sony Experia ... matches only doc1 queryB = Sony Playstation 3 ... matches doc3 and doc1 queryC = Samsung 52inch LC ... doesn't match anything queryD = Samsung Galaxy S4 ... matches doc2 queryE = Galaxy Samsung S4 ... matches doc2 ...do i still have that correct? A similar question came up in the past, but i can't find my response now so i'll try to recreate it ... 1) if you don't care about using non-trivial analysis (ie: you don't need stemming, or synonyms, etc..), you can do this with some really simple function queries -- asusming you index a field containing hte number of words in each document, in addition to the words themselves. Assuming your words are in a field named words and the number of words is in a field named words_count a request for something like Galaxy Samsung S4 can be represented as... q={!frange l=0 u=0}sub(words_count, sum(termfreq('words','Galaxy'), termfreq('words','Samsung'), termfreq('words','S4')) ...ie: you want to compute the sub of the term frequencies for each of hte words requested, and then you want ot subtract that sum from the number of terms in the documengt -- and then you only want ot match documents where the result of that subtraction is 0. one complexity that comes up, is that you haven't specified: * can the list of words in your documents contain duplicates? * can the list of words in your query contain duplicates? * should a document with duplicatewords match only if the query also contains the same word duplicated? ...the answers to those questions make hte math more complicated (and are left as an excersize for the reader) 2) if you *do* care about using non-trivial analysis, then you can't use the simple termfreq() function, which deals with raw terms -- in stead you have to use the query() function to ensure that the input is parsed appropriately -- but then you have to wrap that function in something that will normalize the scores - so in place of termfreq('words','Galaxy') you'd want something like... if(query({!field f=words v='Galaxy'}),1,0) ...but again the math gets much harder if you make things more complex with duplicate words i nthe document or duplicate words in the query -- you'd probably have to use a custom similarity to get the scores returned by the query() function to be usable as is in the match equation (and drop the if() function) As for the highlighting part of hte problme -- that becomes much
Re: Percolate feature?
On Fri, Aug 9, 2013 at 2:56 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I'll look into this. Thanks for the concrete example as I don't even : know which classes to start to look at to implement such a feature. Either roman isn't understanding what you are aksing for, or i'm not -- but i don't think what roman described will work for you... : so if your query contains no duplicates and all terms must match, you can : be sure that you are collecting docs only when the number of terms matches : number of clauses in the query several of the examples you gave did not match what Roman is describing, as i understand it. Most people on this thread seem to be getting confused by having their perceptions flipped about what your data known in advance is vs the data you get at request time. You described this... : Product keyword: Sony : Product keyword: Samsung Galaxy : : We would like to be able to detect given a product title whether or : not it : matches any known keywords. For a keyword to be matched all of it's : terms : must be present in the product title given. : : Product Title: Sony Experia : Matches and returns a highlight: emSony/em Experia ...suggesting that what you call product keywords are the data you know about in advance and product titles are the data you get at request time. So your example of the request time input (ie: query) Sony Experia matching data known in advance (ie: indexed document) Sony would not work with Roman's example. To rephrase (what i think i understand is) your goal... * you have many (10*3+) documents known in advance * any document D contain a set of words W(D) of varing sizes * any requests Q contains a set of words W(Q) of varing izes * you want a given request R to match a document D if and only if: - W(D) is a subset of W(Q) aha! this was not what i was understanding! i was assuming W(Q) is a subset of W(D) - or rather, W(Q) === W(D) so now i finally see the reasoning behind it and the use case, which is a VERY interesting one. roman - ie: no iten exists in W(D) that does not exist in W(Q) - ie: any number of items may exist in W(Q) that are not in W(D) So to reiteratve your examples from before, but change the labels a bit and add some more converse examples (and ignore the highlighting aspect for a moment... doc1 = Sony doc2 = Samsung Galaxy doc3 = Sony Playstation queryA = Sony Experia ... matches only doc1 queryB = Sony Playstation 3 ... matches doc3 and doc1 queryC = Samsung 52inch LC ... doesn't match anything queryD = Samsung Galaxy S4 ... matches doc2 queryE = Galaxy Samsung S4 ... matches doc2 ...do i still have that correct? A similar question came up in the past, but i can't find my response now so i'll try to recreate it ... 1) if you don't care about using non-trivial analysis (ie: you don't need stemming, or synonyms, etc..), you can do this with some really simple function queries -- asusming you index a field containing hte number of words in each document, in addition to the words themselves. Assuming your words are in a field named words and the number of words is in a field named words_count a request for something like Galaxy Samsung S4 can be represented as... q={!frange l=0 u=0}sub(words_count, sum(termfreq('words','Galaxy'), termfreq('words','Samsung'), termfreq('words','S4')) ...ie: you want to compute the sub of the term frequencies for each of hte words requested, and then you want ot subtract that sum from the number of terms in the documengt -- and then you only want ot match documents where the result of that subtraction is 0. one complexity that comes up, is that you haven't specified: * can the list of words in your documents contain duplicates? * can the list of words in your query contain duplicates? * should a document with duplicatewords match only if the query also contains the same word duplicated? ...the answers to those questions make hte math more complicated (and are left as an excersize for the reader) 2) if you *do* care about using non-trivial analysis, then you can't use the simple termfreq() function, which deals with raw terms -- in stead you have to use the query() function to ensure that the input is parsed appropriately -- but then you have to wrap that function in something that will normalize the scores - so in place of termfreq('words','Galaxy') you'd want something like... if(query({!field f=words v='Galaxy'}),1,0) ...but again the math gets much harder if you make things more complex with duplicate words i nthe document or duplicate words in the query -- you'd probably have to use a custom similarity to get the scores returned by the query() function to be usable as is in the match equation (and drop the if() function) As for the
Re: Percolate feature?
Ok forget the mention of percolate. We have a large list of known keywords we would like to match against. Product keyword: Sony Product keyword: Samsung Galaxy We would like to be able to detect given a product title whether or not it matches any known keywords. For a keyword to be matched all of it's terms must be present in the product title given. Product Title: Sony Experia Matches and returns a highlight: emSony/em Experia Product Title: Samsung 52inch LC Does not match Product Title: Samsung Galaxy S4 Matches a returns a highlight: emSamsung Galaxy/em Product Title: Galaxy Samsung S4 Matches a returns a highlight: em Galaxy Samsung/em What would be the best way to approach this? On Aug 5, 2013, at 7:02 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w/o assuming solr/elasticsearch/anything specific about the implementation -- just describe your actual use case for us, with several real document/query examples. https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Percolate feature?
On 03/08/2013 00:50, Mark wrote: We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks Hi Mark, We've built something that implements this kind of reverse search for our clients in the media monitoring sector - we're working on releasing the core of this as open source very soon, hopefully in a month or two. It's based on Lucene. Just for reference it's able to apply tens of thousands of stored queries to a document per second (our clients often have very large and complex Boolean strings representing their clients' interests and may monitor hundreds of thousands of news stories every day). It also records the positions of every match. We suspect it's a lot faster and more flexible than Elasticsearch's Percolate feature. Cheers Charlie -- Charlie Hull Flax - Open Source Enterprise Search tel/fax: +44 (0)8700 118334 mobile: +44 (0)7767 825828 web: www.flax.co.uk
Re: Percolate feature?
can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. I don't think you understand me. Say I only have one document indexed and it's contents are Foo Bar. I want this documented returned if and only if the query has the words Foo and Bar in it. If I use a mm of 100% for Foo Bar Bazz this document will not be returned because the full user query didn't match. I i use a 0% mm and search Foo Baz the documented will be returned even though it shouldn't. On Aug 2, 2013, at 5:09 PM, Jack Krupansky j...@basetechnology.com wrote: You seem to be mixing a couple of different concepts here. Prospective search or reverse search, (sometimes called alerts) is a logistics matter, but how to match terms is completely different. Solr does not have the exact percolate feature of ES, but your examples don't indicate a need for what percolate actually does. can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. See: http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29 Try to clarify your requirements... or maybe min-should-match was all you needed? -- Jack Krupansky -Original Message- From: Mark Sent: Friday, August 02, 2013 7:50 PM To: solr-user@lucene.apache.org Subject: Percolate feature? We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks
Re: Percolate feature?
Fine, then write the query that way: +foo +bar baz But it still doesn't sound as if any of this relates to prospective search/percolate. -- Jack Krupansky -Original Message- From: Mark Sent: Monday, August 05, 2013 2:11 PM To: solr-user@lucene.apache.org Subject: Re: Percolate feature? can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. I don't think you understand me. Say I only have one document indexed and it's contents are Foo Bar. I want this documented returned if and only if the query has the words Foo and Bar in it. If I use a mm of 100% for Foo Bar Bazz this document will not be returned because the full user query didn't match. I i use a 0% mm and search Foo Baz the documented will be returned even though it shouldn't. On Aug 2, 2013, at 5:09 PM, Jack Krupansky j...@basetechnology.com wrote: You seem to be mixing a couple of different concepts here. Prospective search or reverse search, (sometimes called alerts) is a logistics matter, but how to match terms is completely different. Solr does not have the exact percolate feature of ES, but your examples don't indicate a need for what percolate actually does. can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. See: http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29 Try to clarify your requirements... or maybe min-should-match was all you needed? -- Jack Krupansky -Original Message- From: Mark Sent: Friday, August 02, 2013 7:50 PM To: solr-user@lucene.apache.org Subject: Percolate feature? We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks
Re: Percolate feature?
Still not understanding. How do I know which words to require while searching? I want to search across all documents and return ones that have all of their terms matched. I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Also never said this was Percolate, just looked similar On Aug 5, 2013, at 11:43 AM, Jack Krupansky j...@basetechnology.com wrote: Fine, then write the query that way: +foo +bar baz But it still doesn't sound as if any of this relates to prospective search/percolate. -- Jack Krupansky -Original Message- From: Mark Sent: Monday, August 05, 2013 2:11 PM To: solr-user@lucene.apache.org Subject: Re: Percolate feature? can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. I don't think you understand me. Say I only have one document indexed and it's contents are Foo Bar. I want this documented returned if and only if the query has the words Foo and Bar in it. If I use a mm of 100% for Foo Bar Bazz this document will not be returned because the full user query didn't match. I i use a 0% mm and search Foo Baz the documented will be returned even though it shouldn't. On Aug 2, 2013, at 5:09 PM, Jack Krupansky j...@basetechnology.com wrote: You seem to be mixing a couple of different concepts here. Prospective search or reverse search, (sometimes called alerts) is a logistics matter, but how to match terms is completely different. Solr does not have the exact percolate feature of ES, but your examples don't indicate a need for what percolate actually does. can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. See: http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29 Try to clarify your requirements... or maybe min-should-match was all you needed? -- Jack Krupansky -Original Message- From: Mark Sent: Friday, August 02, 2013 7:50 PM To: solr-user@lucene.apache.org Subject: Percolate feature? We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks
Re: Percolate feature?
Percolate does not search across documents, it searches across registered queries for a single input document. As such, it still seems irrelevant to your desire to search across all documents. You still haven't explained how you can't do what you want using basic, plain Lucene search. Now, if all you really want is the ES percolate feature, as said, Solr doesn't have that - if you are sure that percolate really is what you need. But your use case still isn't clearly elaborated to the point where we can at least guess what you really need. For reference: http://www.elasticsearch.org/guide/reference/api/percolate/ The percolator allows to register queries against an index, and then send percolate requests which include a doc, and getting back the queries that match on that doc out of the set of registered queries. Think of it as the reverse operation of indexing and then searching. Instead of sending docs, indexing them, and then running queries. One sends queries, registers them, and then sends docs and finds out which queries match that doc. But that's rather different from what you asked, wanting to match queries against all terms in the index. -- Jack Krupansky -Original Message- From: Mark Sent: Monday, August 05, 2013 3:44 PM To: solr-user@lucene.apache.org Subject: Re: Percolate feature? Still not understanding. How do I know which words to require while searching? I want to search across all documents and return ones that have all of their terms matched. I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Also never said this was Percolate, just looked similar On Aug 5, 2013, at 11:43 AM, Jack Krupansky j...@basetechnology.com wrote: Fine, then write the query that way: +foo +bar baz But it still doesn't sound as if any of this relates to prospective search/percolate. -- Jack Krupansky -Original Message- From: Mark Sent: Monday, August 05, 2013 2:11 PM To: solr-user@lucene.apache.org Subject: Re: Percolate feature? can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. I don't think you understand me. Say I only have one document indexed and it's contents are Foo Bar. I want this documented returned if and only if the query has the words Foo and Bar in it. If I use a mm of 100% for Foo Bar Bazz this document will not be returned because the full user query didn't match. I i use a 0% mm and search Foo Baz the documented will be returned even though it shouldn't. On Aug 2, 2013, at 5:09 PM, Jack Krupansky j...@basetechnology.com wrote: You seem to be mixing a couple of different concepts here. Prospective search or reverse search, (sometimes called alerts) is a logistics matter, but how to match terms is completely different. Solr does not have the exact percolate feature of ES, but your examples don't indicate a need for what percolate actually does. can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. See: http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29 Try to clarify your requirements... or maybe min-should-match was all you needed? -- Jack Krupansky -Original Message- From: Mark Sent: Friday, August 02, 2013 7:50 PM To: solr-user@lucene.apache.org Subject: Percolate feature? We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something
Re: Percolate feature?
: Subject: Percolate feature? can you give a more concrete, realistic example of what you are trying to do? your synthetic hypothetical example is kind of hard to make sense of. your Subject line and comment that the percolate feature of elastic search sounds like what you want seems to have some lead people down a path of assuming you want to run these types of queries as documents are indexed -- but that isn't at all clear to me from the way you worded your question other then that. it's also not clear what aspect of the results you really care about -- are you only looking for the *number* of documents that match according to your concept of matching, or are you looking for a list of matches? what multiple documents have all of their terms in the query string -- how should they score relative to eachother? what if a document contains the same term multiple times, do you expect it to be a match of a query only if that term appears in the query multiple times as well? do you care about hte ordering of the terms in the query? the ordering of hte terms in the document? Ideally: describe for us what you wnat to do, w/o assuming solr/elasticsearch/anything specific about the implementation -- just describe your actual use case for us, with several real document/query examples. https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: Percolate feature?
Cool! On 08/05/2013 03:34 AM, Charlie Hull wrote: On 03/08/2013 00:50, Mark wrote: We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks Hi Mark, We've built something that implements this kind of reverse search for our clients in the media monitoring sector - we're working on releasing the core of this as open source very soon, hopefully in a month or two. It's based on Lucene. Just for reference it's able to apply tens of thousands of stored queries to a document per second (our clients often have very large and complex Boolean strings representing their clients' interests and may monitor hundreds of thousands of news stories every day). It also records the positions of every match. We suspect it's a lot faster and more flexible than Elasticsearch's Percolate feature. Cheers Charlie
Re: Percolate feature?
How difficult would it be to write percolate as an UpdateRequestProcessor? Is there a magic hook to parse and run query against single doc? Regards, Alex On 2 Aug 2013 20:10, Jack Krupansky j...@basetechnology.com wrote: You seem to be mixing a couple of different concepts here. Prospective search or reverse search, (sometimes called alerts) is a logistics matter, but how to match terms is completely different. Solr does not have the exact percolate feature of ES, but your examples don't indicate a need for what percolate actually does. can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. See: http://wiki.apache.org/solr/**ExtendedDisMax#mm_.28Minimum_.** 27Should.27_Match.29http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29 Try to clarify your requirements... or maybe min-should-match was all you needed? -- Jack Krupansky -Original Message- From: Mark Sent: Friday, August 02, 2013 7:50 PM To: solr-user@lucene.apache.org Subject: Percolate feature? We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch ( http://www.elasticsearch.org/**guide/reference/api/percolate/http://www.elasticsearch.org/guide/reference/api/percolate/ **) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks
Re: Percolate feature?
You seem to be mixing a couple of different concepts here. Prospective search or reverse search, (sometimes called alerts) is a logistics matter, but how to match terms is completely different. Solr does not have the exact percolate feature of ES, but your examples don't indicate a need for what percolate actually does. can match a user's query against all the terms in the index - that's exactly what Lucene and Solr have done since Day One, for all queries. Percolate actually does the opposite - matches an input document against a registered set of queries - and doesn't match against indexed documents. Solr does support Lucene's min should match feature so that you can specify, say, four query terms and return if at least two match. This is the mm parameter. See: http://wiki.apache.org/solr/ExtendedDisMax#mm_.28Minimum_.27Should.27_Match.29 Try to clarify your requirements... or maybe min-should-match was all you needed? -- Jack Krupansky -Original Message- From: Mark Sent: Friday, August 02, 2013 7:50 PM To: solr-user@lucene.apache.org Subject: Percolate feature? We have a set number of known terms we want to match against. In Index: term one term two term three I know how to match all terms of a user query against the index but we would like to know how/if we can match a user's query against all the terms in the index? Search Queries: my search term = 0 matches my term search one = 1 match (term one) some prefix term two = 1 match (term two) one two three = 0 matches I can only explain this is almost a reverse search??? I came across the following from ElasticSearch (http://www.elasticsearch.org/guide/reference/api/percolate/) and it sounds like this may accomplish the above but haven't tested. I was wondering if Solr had something similar or an alternative way of accomplishing this? Thanks