Our schema is pretty basic.. nothing fancy going on here <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protected.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" preserveOriginal="1"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protected.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType>
On Aug 10, 2013, at 3:40 PM, "Jack Krupansky" <j...@basetechnology.com> wrote: > Now we're getting somewhere! > > To (over-simplify), you simply want to know if a given "listing" would match > a high-value pattern, either in a "clean" manner (obvious keywords) or in an > "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.) > > To a large this also depends on how rich and powerful your end-user query > support is. So, if the user searches for "sony", "samsung", or "apple", will > it match some oddball listing that fuzzily matches those terms. > > So... tell us, how rich your query interface is. I mean, do you support > wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", > or... will "sony" match "sonblah-blah")? > > Reverse-search may in fact be what you need in this case since you literally > do mean "if I index this document, will it match any of these queries" (but > doesn't score a hit on your direct check for whether it is a clean keyword > match.) > > In your previous examples you only gave clean product titles, not examples of > circumventions of simple keyword matches. > > -- Jack Krupansky > > -----Original Message----- From: Mark > Sent: Saturday, August 10, 2013 6:24 PM > To: solr-user@lucene.apache.org > Cc: Chris Hostetter > Subject: Re: Percolate feature? > >> So to reiteratve your examples from before, but change the "labels" a >> bit and add some more converse examples (and ignore the "highlighting" >> aspect for a moment... >> >> doc1 = "Sony" >> doc2 = "Samsung Galaxy" >> doc3 = "Sony Playstation" >> >> queryA = "Sony Experia" ... matches only doc1 >> queryB = "Sony Playstation 3" ... matches doc3 and doc1 >> queryC = "Samsung 52inch LC" ... doesn't match anything >> queryD = "Samsung Galaxy S4" ... matches doc2 >> queryE = "Galaxy Samsung S4" ... matches doc2 >> >> >> ...do i still have that correct? > > Yes > >> 2) if you *do* care about using non-trivial analysis, then you can't use >> the simple "termfreq()" function, which deals with raw terms -- in stead >> you have to use the "query()" function to ensure that the input is parsed >> appropriately -- but then you have to wrap that function in something that >> will normalize the scores - so in place of termfreq('words','Galaxy') >> you'd want something like... > > > Yes we will be using non-trivial analysis. Now heres another twist… what if > we don't care about scoring? > > > Let's talk about the real use case. We are marketplace that sells products > that users have listed. For certain popular, high risk or restricted keywords > we charge the seller an extra fee/ban the listing. We now have sellers > purposely misspelling their listings to circumvent this fee. They will start > adding suffixes to their product listings such as "Sonies" knowing that it > gets indexed down to "Sony" and thus matching a users query for Sony. Or they > will munge together numbers and products… "2013Sony". Same thing goes for > adding crazy non-ascii characters to the front of the keyword "Î’Sony". This > is obviously a problem because we aren't charging for these keywords and more > importantly it makes our search results look like shit. > > We would like to: > > 1) Detect when a certain keyword is in a product title at listing time so we > may charge the seller. This was my idea of a "reverse search" although sounds > like I may have caused to much confusion with that term. > 2) Attempt to autocorrect these titles hence the need for highlighting so we > can try and replace the terms… this of course done outside of Solr via an > external service. > > Since we do some stemming (KStemmer) and filtering > (WordDelimiterFilterFactory) this makes conventional approaches such as regex > quite troublesome. Regex is also quite slow and scales horribly and always > needs to be in lockstep with schema changes. > > Now knowing this, is there a good way to approach this? > > Thanks > > > On Aug 9, 2013, at 11:56 AM, Chris Hostetter <hossman_luc...@fucit.org> wrote: > >> >> : I'll look into this. Thanks for the concrete example as I don't even >> : know which classes to start to look at to implement such a feature. >> >> Either roman isn't understanding what you are aksing for, or i'm not -- but >> i don't think what roman described will work for you... >> >> : > so if your query contains no duplicates and all terms must match, you can >> : > be sure that you are collecting docs only when the number of terms >> matches >> : > number of clauses in the query >> >> several of the examples you gave did not match what Roman is describing, >> as i understand it. Most people on this thread seem to be getting >> confused by having their perceptions "flipped" about what your "data known >> in advance is" vs the "data you get at request time". >> >> You described this... >> >> : >>>>> Product keyword: "Sony" >> : >>>>> Product keyword: "Samsung Galaxy" >> : >>>>> >> : >>>>> We would like to be able to detect given a product title whether or >> : >> not it >> : >>>>> matches any known keywords. For a keyword to be matched all of it's >> : >> terms >> : >>>>> must be present in the product title given. >> : >>>>> >> : >>>>> Product Title: "Sony Experia" >> : >>>>> Matches and returns a highlight: "<em>Sony</em> Experia" >> >> ...suggesting that what you call "product keywords" are the "data you know >> about in advance" and "product titles" are the data you get at request >> time. >> >> So your example of the "request time" input (ie: query) "Sony Experia" >> matching "data known in advance (ie: indexed document) "Sony" would not >> work with Roman's example. >> >> To rephrase (what i think i understand is) your goal... >> >> * you have many (10*3+) documents known in advance >> * any document D contain a set of words W(D) of varing sizes >> * any requests Q contains a set of words W(Q) of varing izes >> * you want a given request R to match a document D if and only if: >> - W(D) is a subset of W(Q) >> - ie: no iten exists in W(D) that does not exist in W(Q) >> - ie: any number of items may exist in W(Q) that are not in W(D) >> >> So to reiteratve your examples from before, but change the "labels" a >> bit and add some more converse examples (and ignore the "highlighting" >> aspect for a moment... >> >> doc1 = "Sony" >> doc2 = "Samsung Galaxy" >> doc3 = "Sony Playstation" >> >> queryA = "Sony Experia" ... matches only doc1 >> queryB = "Sony Playstation 3" ... matches doc3 and doc1 >> queryC = "Samsung 52inch LC" ... doesn't match anything >> queryD = "Samsung Galaxy S4" ... matches doc2 >> queryE = "Galaxy Samsung S4" ... matches doc2 >> >> >> ...do i still have that correct? >> >> >> A similar question came up in the past, but i can't find my response now >> so i'll try to recreate it ... >> >> >> 1) if you don't care about using non-trivial analysis (ie: you don't need >> stemming, or synonyms, etc..), you can do this with some >> really simple function queries -- asusming you index a field containing >> hte number of "words" in each document, in addition to the words >> themselves. Assuming your words are in a field named "words" and the >> number of words is in a field named "words_count" a request for something >> like "Galaxy Samsung S4" can be represented as... >> >> q={!frange l=0 u=0}sub(words_count, >> sum(termfreq('words','Galaxy'), >> termfreq('words','Samsung'), >> termfreq('words','S4')) >> >> ...ie: you want to compute the sub of the term frequencies for each of >> hte words requested, and then you want ot subtract that sum from the >> number of terms in the documengt -- and then you only want ot match >> documents where the result of that subtraction is 0. >> >> one complexity that comes up, is that you haven't specified: >> >> * can the list of words in your documents contain duplicates? >> * can the list of words in your query contain duplicates? >> * should a document with duplicatewords match only if the query also >> contains the same word duplicated? >> >> ...the answers to those questions make hte math more complicated (and are >> left as an excersize for the reader) >> >> >> 2) if you *do* care about using non-trivial analysis, then you can't use >> the simple "termfreq()" function, which deals with raw terms -- in stead >> you have to use the "query()" function to ensure that the input is parsed >> appropriately -- but then you have to wrap that function in something that >> will normalize the scores - so in place of termfreq('words','Galaxy') >> you'd want something like... >> >> if(query({!field f=words v='Galaxy'}),1,0) >> >> ...but again the math gets much harder if you make things more complex >> with duplicate words i nthe document or duplicate words in the query -- >> you'd >> probably have to use a custom similarity to get the scores returned by the >> query() function to be usable as is in the match equation (and drop the >> "if()" function) >> >> >> As for the highlighting part of hte problme -- that becomes much easier -- >> independent of the queries you use to *match* the documents, you can then >> specify a "hl.q" param to specify a much simpler query just containing the >> basic lst of words (as a simple boolean query, all clouses optional) and >> let it highlight them in your list of words. >> >> >> >> >> >> >> >> -Hoss >