"Field weights"

Karl Wettin Fri, 14 Dec 2007 09:06:36 -0800

I have an index that contains three sorts of documents:


Car brand
Tire brand
Tire pressure

(Please bear with me, the real index has nothing to do with cars. Ijust try to explain the problem in an alternative domain to avoid NDAconflicts.)

There is a heirarchial composite relationship between these sort ofdocuments. A document describing "tire pressure" also contains "tirebrand" and "car brand". A document describing "tire brand" alsocontains information about "car brand". A document describing "carbrand" contains only that.

The requirement is that the consumer of the API should not have tospecify what fields they are searching in. There is no time (nortraining data) to implement a hidden markov model (HMM) tokenizer orsomething along that path in order to extract possible attributes fromthe query string. Instead the query string is tokenized once per fieldand assebled to one huge query. Normally this works fairly well.


Here are some example documents:

Volvo
Volvo, Michelin
Volvo, Nokian
Volvo, Nokian, 2.2 bars
Volvo, Firestone, 2.4 bars

Saab
Saab, Michelin
Saab, Nokian
Saab, Nokian, 2.1 bars
Saab, Firestone
Saab, Firestone, 2.4 bars
Saab, Firestone, 2.5 bars

If I search for Saab the top result will be the document representingthe car brand "Saab". The query would look like this: "car:saabtire:saab preasure:saab"


But lets say Saab starts manufacturing tires too:

Saab
Saab, Saab tires
Saab, Saab tires, 1.9 bars
Saab, Saab tires, 1.8 bars

If I search for "Saab" I still want the top result to be Saab the carbrand. But it not longer is, the match for "Saab, Saab tires" nowhave a greater score than "Saab", of course.

My idea is to work along the line of indexing "Saab" in the tire brandand tire pressure field too. Now searching for Saab will yeild aresult where the car brand "Saab" is the top result.

However, this will not work as I have different tokenizationstrategies for each field (stemming and what not). Tokenizing thequery string Saab for the field "tire brand" in Swedish might end upas "saa" and will thus not find the token Saab inserted for thedocument describing the car brand Saab.

I have a couple of experiments in my head I need to try out, startingwith tokezining query strings per field and using the tokens generatedfor the field car brand as query in the tire brand and tire pressuretoo. And vice versus.


Any brilliant ideas that might work? Hacky solutions are OK.


--
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

"Field weights"

Reply via email to