Haven't seen this discussed here.
See 7a at the link below:

http://www.asktog.com/columns/062top10ReasonsToNotShop.html

7a talks about searching on a camera site for the "Lowepro 100 AW".

He says this query works:        "Lowepro 100 AW"
and this query does not work: "Lowepro 100AW"

Cross checking with google indeed shows that the 1st form is much more popular, however the 2nd form is used, and if you're a commerce site or a site that wants to make it easier for users to find things you should help them out.

So the discussion question is what's the best way to handle this.

I guess the somewhat general form of this is that in a query, and term might be split into 2 terms that are individually indexed (so "100AW" is not indexed, but "100" and "AW" is).
In a way the flip side of this is that any 2 terms could be concatenated to form another term that was indexed (so in another universe it might be that passing "100 AW" is not as precise as passing "100AW" but how's the user to know).


In the context of Lucene ways to handle this seem to be:
- automagically run a fuzzy query (so if a query doesn't work, transform "Lowepro 100AW" to "Lowepro~ 100AW~"
- write a query parser that breaks apart unindexed tokens into ones that are indexed (so "100AW" becomes "100 AW")
- write a tokenizer that inserts dummy tokens for every pair of tokens, so the stream "Lowepro 100 AW" would also have "Lowepro100" and "100AW" inserted, presumably via magic w/ TokenStream.next()


Comments on best way to handle this?







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to