: Let's talk about the real use case. We are marketplace that sells 
: products that users have listed. For certain popular, high risk or 
: restricted keywords we charge the seller an extra fee/ban the listing. 
: We now have sellers purposely misspelling their listings to circumvent 
: this fee. They will start adding suffixes to their product listings such 
: as "Sonies" knowing that it gets indexed down to "Sony" and thus 
: matching a users query for Sony. Or they will munge together numbers and 
: products… "2013Sony". Same thing goes for adding crazy non-ascii 
: characters to the front of the keyword "Î’Sony". This is obviously a 
: problem because we aren't charging for these keywords and more 
: importantly it makes our search results look like shit.

: 1) Detect when a certain keyword is in a product title at listing time 
: so we may charge the seller. This was my idea of a "reverse search" 
: although sounds like I may have caused to much confusion with that term.

Ok ... with the concrete specifics of your situation in mind, i can think 
of 2 completley differnet approaches -- depending on how precise you need 
to be about your definition of a "match" and how you want to deal with 
ongoing maintence as your system evolves...

## Approach #1 - NRT index & searching w/custom plugin

Even if you have 1000-5000 of these special queries you need to check, 
a custom comonent to execute those 1000-5000 queries should be very fast 
against a small index where most of the queries won't match anything -- 
especially if you write a custom component that pre-parses them into Query 
obejcts and hangs onto them in memory.

(As a sample data point: With the 32 sample docs from Solr 4.x, I 
configured a request handler with 5000 unique facet.query defaults using 
hte {!field} qparser.  most of these facet queries didn't match anything 
but a handfull of which matched on of the same documents.  With completely 
cold caches, these 5000 facet queries had a QTime of 502ms on my laptop -- 
and that includes parsing all 5000 query strings)

So imagine if you wrote a custom SearchComonent that could read your X 
special queries from some remote database on init (and re-load them on 
command) and parse them into Queries which it then holds on to in kind of 
datastructure that also tracked why you cared about them (ie: charge 10% 
more, banned, etc...).  At query time, your custom component would filter 
the main result set of docs against these queries to look for matches that 
should be reported (along with the metdata about hte queries that match) 
and could also inspect the results of any query that matches, and generate 
highlighting each query+doc that matches.  You would then register this 
custom search component in a special "validation" solr core that is 
otherwise confiure exactly the same as your regular production index.  

When a client says "here's my Y products i want to add" you would...

 1) index those Y products into your validation solr core using 
softCommit=true&openSearcher=true
 2) execute a query using your special search component filtered to just 
the list of Y unique ids of hte products the client just gave you (that 
way you can handle concurrent requests from different clients w/o false 
positives)
 3) use the results of that query to tell your client things like "product 
#123 matches 'Sony' so we are charging you more; and product #456 matches 
'Porn' so we are rejecting it"
 4) only when done, would you re-index those products into your "real" 
index.
 5) help keep your "validating" index small by also doing a deleteById on 
all of that batch of Y docs when you are done validating.


The upside of this approach is that it helps you ensure the validation 
logic you apply to products when you get them from clients *exactly* 
matches your real queries, even if your schema & analysis evolve over 
time.  the downside is it's a decent mount of custom plugin code you need 
to write upfront, and it will get slower if/when the number of special 
validation queries increases.

## Approach #1 - Approximate things with a reverse search

Build a small index where each document contains the text of one 
of your special queries copied into multiple fields with a variety of 
analysis options configured (in particular: i suspect using shingles would 
be fruitful here).  setup a query structure that uses functions to combine 
together the scores of many queries against each of those fields -- this 
might be simple addition, or you might want it to be considtional, ie: 
maybe you multiple the sum of the scores of some queries against simple 
fields with teh score of a query against a really simple field to 
eliminate false positives.

Experiment a bit to see what kinds of inputs get you what kinds of scores, 
and maybe associate a "threshold" with each document which you index as a 
numeric field on those docs and then fold that threshold value into your 
calvulation using the {!frame} parser to make sure you only count matches 
that score above the "threshold" of the document they match against.

then when your client gives you Y documents, send them each one at a time 
against your little "reverse" index and see if they match anything, if so 
report back to the client what they matched


The upside of this approach is that it doesn't require implementing 
any custom java plugins, and scales better if you expect the number of 
special "queries" to scale w/o bound.  The downside is that it's really 
just giving you a hueristic to indicate that something *might* be a match, 
but you will have to constantly tune & adjust if/when the analysis or 
query structure of your production search system evolves.



-Hoss

Reply via email to