: Let's talk about the real use case. We are marketplace that sells : products that users have listed. For certain popular, high risk or : restricted keywords we charge the seller an extra fee/ban the listing. : We now have sellers purposely misspelling their listings to circumvent : this fee. They will start adding suffixes to their product listings such : as "Sonies" knowing that it gets indexed down to "Sony" and thus : matching a users query for Sony. Or they will munge together numbers and : products… "2013Sony". Same thing goes for adding crazy non-ascii : characters to the front of the keyword "Î’Sony". This is obviously a : problem because we aren't charging for these keywords and more : importantly it makes our search results look like shit.
: 1) Detect when a certain keyword is in a product title at listing time : so we may charge the seller. This was my idea of a "reverse search" : although sounds like I may have caused to much confusion with that term. Ok ... with the concrete specifics of your situation in mind, i can think of 2 completley differnet approaches -- depending on how precise you need to be about your definition of a "match" and how you want to deal with ongoing maintence as your system evolves... ## Approach #1 - NRT index & searching w/custom plugin Even if you have 1000-5000 of these special queries you need to check, a custom comonent to execute those 1000-5000 queries should be very fast against a small index where most of the queries won't match anything -- especially if you write a custom component that pre-parses them into Query obejcts and hangs onto them in memory. (As a sample data point: With the 32 sample docs from Solr 4.x, I configured a request handler with 5000 unique facet.query defaults using hte {!field} qparser. most of these facet queries didn't match anything but a handfull of which matched on of the same documents. With completely cold caches, these 5000 facet queries had a QTime of 502ms on my laptop -- and that includes parsing all 5000 query strings) So imagine if you wrote a custom SearchComonent that could read your X special queries from some remote database on init (and re-load them on command) and parse them into Queries which it then holds on to in kind of datastructure that also tracked why you cared about them (ie: charge 10% more, banned, etc...). At query time, your custom component would filter the main result set of docs against these queries to look for matches that should be reported (along with the metdata about hte queries that match) and could also inspect the results of any query that matches, and generate highlighting each query+doc that matches. You would then register this custom search component in a special "validation" solr core that is otherwise confiure exactly the same as your regular production index. When a client says "here's my Y products i want to add" you would... 1) index those Y products into your validation solr core using softCommit=true&openSearcher=true 2) execute a query using your special search component filtered to just the list of Y unique ids of hte products the client just gave you (that way you can handle concurrent requests from different clients w/o false positives) 3) use the results of that query to tell your client things like "product #123 matches 'Sony' so we are charging you more; and product #456 matches 'Porn' so we are rejecting it" 4) only when done, would you re-index those products into your "real" index. 5) help keep your "validating" index small by also doing a deleteById on all of that batch of Y docs when you are done validating. The upside of this approach is that it helps you ensure the validation logic you apply to products when you get them from clients *exactly* matches your real queries, even if your schema & analysis evolve over time. the downside is it's a decent mount of custom plugin code you need to write upfront, and it will get slower if/when the number of special validation queries increases. ## Approach #1 - Approximate things with a reverse search Build a small index where each document contains the text of one of your special queries copied into multiple fields with a variety of analysis options configured (in particular: i suspect using shingles would be fruitful here). setup a query structure that uses functions to combine together the scores of many queries against each of those fields -- this might be simple addition, or you might want it to be considtional, ie: maybe you multiple the sum of the scores of some queries against simple fields with teh score of a query against a really simple field to eliminate false positives. Experiment a bit to see what kinds of inputs get you what kinds of scores, and maybe associate a "threshold" with each document which you index as a numeric field on those docs and then fold that threshold value into your calvulation using the {!frame} parser to make sure you only count matches that score above the "threshold" of the document they match against. then when your client gives you Y documents, send them each one at a time against your little "reverse" index and see if they match anything, if so report back to the client what they matched The upside of this approach is that it doesn't require implementing any custom java plugins, and scales better if you expect the number of special "queries" to scale w/o bound. The downside is that it's really just giving you a hueristic to indicate that something *might* be a match, but you will have to constantly tune & adjust if/when the analysis or query structure of your production search system evolves. -Hoss