Our schema is pretty basic.. nothing fancy going on here
    
<fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protected.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
preserveOriginal="1"/>
        <filter class="solr.KStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
       <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
protected="protected.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" 
preserveOriginal="1"/>
        <filter class="solr.KStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


On Aug 10, 2013, at 3:40 PM, "Jack Krupansky" <j...@basetechnology.com> wrote:

> Now we're getting somewhere!
> 
> To (over-simplify), you simply want to know if a given "listing" would match 
> a high-value pattern, either in a "clean" manner (obvious keywords) or in an 
> "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.)
> 
> To a large this also depends on how rich and powerful your end-user query 
> support is. So, if the user searches for "sony", "samsung", or "apple", will 
> it match some oddball listing that fuzzily matches those terms.
> 
> So... tell us, how rich your query interface is. I mean, do you support 
> wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", 
> or... will "sony" match "sonblah-blah")?
> 
> Reverse-search may in fact be what you need in this case since you literally 
> do mean "if I index this document, will it match any of these queries" (but 
> doesn't score a hit on your direct check for whether it is a clean keyword 
> match.)
> 
> In your previous examples you only gave clean product titles, not examples of 
> circumventions of simple keyword matches.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Mark
> Sent: Saturday, August 10, 2013 6:24 PM
> To: solr-user@lucene.apache.org
> Cc: Chris Hostetter
> Subject: Re: Percolate feature?
> 
>> So to reiteratve your examples from before, but change the "labels" a
>> bit and add some more converse examples (and ignore the "highlighting"
>> aspect for a moment...
>> 
>> doc1 = "Sony"
>> doc2 = "Samsung Galaxy"
>> doc3 = "Sony Playstation"
>> 
>> queryA = "Sony Experia"       ... matches only doc1
>> queryB = "Sony Playstation 3" ... matches doc3 and doc1
>> queryC = "Samsung 52inch LC"  ... doesn't match anything
>> queryD = "Samsung Galaxy S4"  ... matches doc2
>> queryE = "Galaxy Samsung S4"  ... matches doc2
>> 
>> 
>> ...do i still have that correct?
> 
> Yes
> 
>> 2) if you *do* care about using non-trivial analysis, then you can't use
>> the simple "termfreq()" function, which deals with raw terms -- in stead
>> you have to use the "query()" function to ensure that the input is parsed
>> appropriately -- but then you have to wrap that function in something that
>> will normalize the scores - so in place of termfreq('words','Galaxy')
>> you'd want something like...
> 
> 
> Yes we will be using non-trivial analysis. Now heres another twist… what if 
> we don't care about scoring?
> 
> 
> Let's talk about the real use case. We are marketplace that sells products 
> that users have listed. For certain popular, high risk or restricted keywords 
> we charge the seller an extra fee/ban the listing. We now have sellers 
> purposely misspelling their listings to circumvent this fee. They will start 
> adding suffixes to their product listings such as "Sonies" knowing that it 
> gets indexed down to "Sony" and thus matching a users query for Sony. Or they 
> will munge together numbers and products… "2013Sony". Same thing goes for 
> adding crazy non-ascii characters to the front of the keyword "Î’Sony". This 
> is obviously a problem because we aren't charging for these keywords and more 
> importantly it makes our search results look like shit.
> 
> We would like to:
> 
> 1) Detect when a certain keyword is in a product title at listing time so we 
> may charge the seller. This was my idea of a "reverse search" although sounds 
> like I may have caused to much confusion with that term.
> 2) Attempt to autocorrect these titles hence the need for highlighting so we 
> can try and replace the terms… this of course done outside of Solr via an 
> external service.
> 
> Since we do some stemming (KStemmer) and filtering 
> (WordDelimiterFilterFactory) this makes conventional approaches such as regex 
> quite troublesome. Regex is also quite slow and scales horribly and always 
> needs to be in lockstep with schema changes.
> 
> Now knowing this, is there a good way to approach this?
> 
> Thanks
> 
> 
> On Aug 9, 2013, at 11:56 AM, Chris Hostetter <hossman_luc...@fucit.org> wrote:
> 
>> 
>> : I'll look into this. Thanks for the concrete example as I don't even
>> : know which classes to start to look at to implement such a feature.
>> 
>> Either roman isn't understanding what you are aksing for, or i'm not -- but 
>> i don't think what roman described will work for you...
>> 
>> : > so if your query contains no duplicates and all terms must match, you can
>> : > be sure that you are collecting docs only when the number of terms 
>> matches
>> : > number of clauses in the query
>> 
>> several of the examples you gave did not match what Roman is describing,
>> as i understand it.  Most people on this thread seem to be getting
>> confused by having their perceptions "flipped" about what your "data known
>> in advance is" vs the "data you get at request time".
>> 
>> You described this...
>> 
>> : >>>>> Product keyword:  "Sony"
>> : >>>>> Product keyword:  "Samsung Galaxy"
>> : >>>>>
>> : >>>>> We would like to be able to detect given a product title whether or
>> : >> not it
>> : >>>>> matches any known keywords. For a keyword to be matched all of it's
>> : >> terms
>> : >>>>> must be present in the product title given.
>> : >>>>>
>> : >>>>> Product Title: "Sony Experia"
>> : >>>>> Matches and returns a highlight: "<em>Sony</em> Experia"
>> 
>> ...suggesting that what you call "product keywords" are the "data you know
>> about in advance" and "product titles" are the data you get at request
>> time.
>> 
>> So your example of the "request time" input (ie: query) "Sony Experia"
>> matching "data known in advance (ie: indexed document) "Sony" would not
>> work with Roman's example.
>> 
>> To rephrase (what i think i understand is) your goal...
>> 
>> * you have many (10*3+) documents known in advance
>> * any document D contain a set of words W(D) of varing sizes
>> * any requests Q contains a set of words W(Q) of varing izes
>> * you want a given request R to match a document D if and only if:
>>  - W(D) is a subset of W(Q)
>>  - ie: no iten exists in W(D) that does not exist in W(Q)
>>  - ie: any number of items may exist in W(Q) that are not in W(D)
>> 
>> So to reiteratve your examples from before, but change the "labels" a
>> bit and add some more converse examples (and ignore the "highlighting"
>> aspect for a moment...
>> 
>> doc1 = "Sony"
>> doc2 = "Samsung Galaxy"
>> doc3 = "Sony Playstation"
>> 
>> queryA = "Sony Experia"       ... matches only doc1
>> queryB = "Sony Playstation 3" ... matches doc3 and doc1
>> queryC = "Samsung 52inch LC"  ... doesn't match anything
>> queryD = "Samsung Galaxy S4"  ... matches doc2
>> queryE = "Galaxy Samsung S4"  ... matches doc2
>> 
>> 
>> ...do i still have that correct?
>> 
>> 
>> A similar question came up in the past, but i can't find my response now
>> so i'll try to recreate it ...
>> 
>> 
>> 1) if you don't care about using non-trivial analysis (ie: you don't need
>> stemming, or synonyms, etc..), you can do this with some
>> really simple function queries -- asusming you index a field containing
>> hte number of "words" in each document, in addition to the words
>> themselves.  Assuming your words are in a field named "words" and the
>> number of words is in a field named "words_count" a request for something
>> like "Galaxy Samsung S4" can be represented as...
>> 
>> q={!frange l=0 u=0}sub(words_count,
>>                        sum(termfreq('words','Galaxy'),
>>                            termfreq('words','Samsung'),
>>                            termfreq('words','S4'))
>> 
>> ...ie: you want to compute the sub of the term frequencies for each of
>> hte words requested, and then you want ot subtract that sum from the
>> number of terms in the documengt -- and then you only want ot match
>> documents where the result of that subtraction is 0.
>> 
>> one complexity that comes up, is that you haven't specified:
>> 
>> * can the list of words in your documents contain duplicates?
>> * can the list of words in your query contain duplicates?
>> * should a document with duplicatewords match only if the query also
>> contains the same word duplicated?
>> 
>> ...the answers to those questions make hte math more complicated (and are
>> left as an excersize for the reader)
>> 
>> 
>> 2) if you *do* care about using non-trivial analysis, then you can't use
>> the simple "termfreq()" function, which deals with raw terms -- in stead
>> you have to use the "query()" function to ensure that the input is parsed
>> appropriately -- but then you have to wrap that function in something that
>> will normalize the scores - so in place of termfreq('words','Galaxy')
>> you'd want something like...
>> 
>>           if(query({!field f=words v='Galaxy'}),1,0)
>> 
>> ...but again the math gets much harder if you make things more complex
>> with duplicate words i nthe document or duplicate words in the query --  
>> you'd
>> probably have to use a custom similarity to get the scores returned by the
>> query() function to be usable as is in the match equation (and drop the
>> "if()" function)
>> 
>> 
>> As for the highlighting part of hte problme -- that becomes much easier -- 
>> independent of the queries you use to *match* the documents, you can then
>> specify a "hl.q" param to specify a much simpler query just containing the
>> basic lst of words (as a simple boolean query, all clouses optional) and
>> let it highlight them in your list of words.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> -Hoss 
> 

Reply via email to