Now we're getting somewhere!

To (over-simplify), you simply want to know if a given "listing" would match a high-value pattern, either in a "clean" manner (obvious keywords) or in an "unclean" manner (e.g., fuzzy keyword matching, stemming, n-grams.)

To a large this also depends on how rich and powerful your end-user query support is. So, if the user searches for "sony", "samsung", or "apple", will it match some oddball listing that fuzzily matches those terms.

So... tell us, how rich your query interface is. I mean, do you support wildcard, fuzzy query, ngrams (e.g., can they type "son" or "sam" or "app", or... will "sony" match "sonblah-blah")?

Reverse-search may in fact be what you need in this case since you literally do mean "if I index this document, will it match any of these queries" (but doesn't score a hit on your direct check for whether it is a clean keyword match.)

In your previous examples you only gave clean product titles, not examples of circumventions of simple keyword matches.

-- Jack Krupansky

-----Original Message----- From: Mark
Sent: Saturday, August 10, 2013 6:24 PM
To: solr-user@lucene.apache.org
Cc: Chris Hostetter
Subject: Re: Percolate feature?

So to reiteratve your examples from before, but change the "labels" a
bit and add some more converse examples (and ignore the "highlighting"
aspect for a moment...

doc1 = "Sony"
doc2 = "Samsung Galaxy"
doc3 = "Sony Playstation"

queryA = "Sony Experia"       ... matches only doc1
queryB = "Sony Playstation 3" ... matches doc3 and doc1
queryC = "Samsung 52inch LC"  ... doesn't match anything
queryD = "Samsung Galaxy S4"  ... matches doc2
queryE = "Galaxy Samsung S4"  ... matches doc2


...do i still have that correct?

Yes

2) if you *do* care about using non-trivial analysis, then you can't use
the simple "termfreq()" function, which deals with raw terms -- in stead
you have to use the "query()" function to ensure that the input is parsed
appropriately -- but then you have to wrap that function in something that
will normalize the scores - so in place of termfreq('words','Galaxy')
you'd want something like...


Yes we will be using non-trivial analysis. Now heres another twist… what if we don't care about scoring?


Let's talk about the real use case. We are marketplace that sells products that users have listed. For certain popular, high risk or restricted keywords we charge the seller an extra fee/ban the listing. We now have sellers purposely misspelling their listings to circumvent this fee. They will start adding suffixes to their product listings such as "Sonies" knowing that it gets indexed down to "Sony" and thus matching a users query for Sony. Or they will munge together numbers and products… "2013Sony". Same thing goes for adding crazy non-ascii characters to the front of the keyword "Î’Sony". This is obviously a problem because we aren't charging for these keywords and more importantly it makes our search results look like shit.

We would like to:

1) Detect when a certain keyword is in a product title at listing time so we may charge the seller. This was my idea of a "reverse search" although sounds like I may have caused to much confusion with that term. 2) Attempt to autocorrect these titles hence the need for highlighting so we can try and replace the terms… this of course done outside of Solr via an external service.

Since we do some stemming (KStemmer) and filtering (WordDelimiterFilterFactory) this makes conventional approaches such as regex quite troublesome. Regex is also quite slow and scales horribly and always needs to be in lockstep with schema changes.

Now knowing this, is there a good way to approach this?

Thanks


On Aug 9, 2013, at 11:56 AM, Chris Hostetter <hossman_luc...@fucit.org> wrote:


: I'll look into this. Thanks for the concrete example as I don't even
: know which classes to start to look at to implement such a feature.

Either roman isn't understanding what you are aksing for, or i'm not -- but i don't think what roman described will work for you...

: > so if your query contains no duplicates and all terms must match, you can : > be sure that you are collecting docs only when the number of terms matches
: > number of clauses in the query

several of the examples you gave did not match what Roman is describing,
as i understand it.  Most people on this thread seem to be getting
confused by having their perceptions "flipped" about what your "data known
in advance is" vs the "data you get at request time".

You described this...

: >>>>> Product keyword:  "Sony"
: >>>>> Product keyword:  "Samsung Galaxy"
: >>>>>
: >>>>> We would like to be able to detect given a product title whether or
: >> not it
: >>>>> matches any known keywords. For a keyword to be matched all of it's
: >> terms
: >>>>> must be present in the product title given.
: >>>>>
: >>>>> Product Title: "Sony Experia"
: >>>>> Matches and returns a highlight: "<em>Sony</em> Experia"

...suggesting that what you call "product keywords" are the "data you know
about in advance" and "product titles" are the data you get at request
time.

So your example of the "request time" input (ie: query) "Sony Experia"
matching "data known in advance (ie: indexed document) "Sony" would not
work with Roman's example.

To rephrase (what i think i understand is) your goal...

* you have many (10*3+) documents known in advance
* any document D contain a set of words W(D) of varing sizes
* any requests Q contains a set of words W(Q) of varing izes
* you want a given request R to match a document D if and only if:
  - W(D) is a subset of W(Q)
  - ie: no iten exists in W(D) that does not exist in W(Q)
  - ie: any number of items may exist in W(Q) that are not in W(D)

So to reiteratve your examples from before, but change the "labels" a
bit and add some more converse examples (and ignore the "highlighting"
aspect for a moment...

doc1 = "Sony"
doc2 = "Samsung Galaxy"
doc3 = "Sony Playstation"

queryA = "Sony Experia"       ... matches only doc1
queryB = "Sony Playstation 3" ... matches doc3 and doc1
queryC = "Samsung 52inch LC"  ... doesn't match anything
queryD = "Samsung Galaxy S4"  ... matches doc2
queryE = "Galaxy Samsung S4"  ... matches doc2


...do i still have that correct?


A similar question came up in the past, but i can't find my response now
so i'll try to recreate it ...


1) if you don't care about using non-trivial analysis (ie: you don't need
stemming, or synonyms, etc..), you can do this with some
really simple function queries -- asusming you index a field containing
hte number of "words" in each document, in addition to the words
themselves.  Assuming your words are in a field named "words" and the
number of words is in a field named "words_count" a request for something
like "Galaxy Samsung S4" can be represented as...

 q={!frange l=0 u=0}sub(words_count,
                        sum(termfreq('words','Galaxy'),
                            termfreq('words','Samsung'),
                            termfreq('words','S4'))

...ie: you want to compute the sub of the term frequencies for each of
hte words requested, and then you want ot subtract that sum from the
number of terms in the documengt -- and then you only want ot match
documents where the result of that subtraction is 0.

one complexity that comes up, is that you haven't specified:

 * can the list of words in your documents contain duplicates?
 * can the list of words in your query contain duplicates?
 * should a document with duplicatewords match only if the query also
contains the same word duplicated?

...the answers to those questions make hte math more complicated (and are
left as an excersize for the reader)


2) if you *do* care about using non-trivial analysis, then you can't use
the simple "termfreq()" function, which deals with raw terms -- in stead
you have to use the "query()" function to ensure that the input is parsed
appropriately -- but then you have to wrap that function in something that
will normalize the scores - so in place of termfreq('words','Galaxy')
you'd want something like...

           if(query({!field f=words v='Galaxy'}),1,0)

...but again the math gets much harder if you make things more complex
with duplicate words i nthe document or duplicate words in the query -- you'd
probably have to use a custom similarity to get the scores returned by the
query() function to be usable as is in the match equation (and drop the
"if()" function)


As for the highlighting part of hte problme -- that becomes much easier -- independent of the queries you use to *match* the documents, you can then
specify a "hl.q" param to specify a much simpler query just containing the
basic lst of words (as a simple boolean query, all clouses optional) and
let it highlight them in your list of words.







-Hoss

Reply via email to