: My next target is searches on simple terms such as "doll" which, in google,
: would return documents about, well, "toy dolls", because that's the most
: common usage of the simple term "doll". But in my index it predominantly
: returns documents about CDs with the song "Doll Face", and "My baby doll" in
: them.

if you have good metdata about your documents, then you might get 
satisfing results using something like the edismax parser with appropriate 
weights on various fields -- you could for example say that "matching 
on the product_title field is important, but matching on a category_name 
is much more important" and thus use something like...

    q=doll&qf=product_title^5+category_name^50

..but that only helps you if you have category_name values that match the 
words people are searching for like "Doll"

This type of appoach doesn't help you in the case where you might have the 
inverse problem: document (category_name="doll", product_name="My baby") 
showing up first when a user searches for "my baby doll" but the user is 
really trying to find the document (category_name=cd, product_name="my 
baby doll")

it really all depends on your user base and the type of queries you 
expect.

An interesting solution to this problem that i've seen is to pre-process 
the query using a baysiean classifier to suggest which categories to boost 
on.

Here's a blog on this where the classifier was trained based on the 
keywords & categories of the documents...

http://engineering.wayfair.com/better-lucenesolr-searches-with-a-boost-from-an-external-naive-bayes-classifier/

...but you could also train the classifier using query logs and data about 
what documents users ultimately clicked on (to help you learn that for 
your userbase, people who search for "baby" are typically looking for CDs 
not dolls -- or vice versa)


: 
:  
: 
: I'm not directly asking how to solve this as much as I'm asking what
: direction I should be looking in to learn what I need to know to tackle the
: general issue myself.
: 
:  
: 
: Left on my own I would start looking at categorizing the CD's into a facet
: called "music", reasonably doable in my dataset. Then I need to reduce the
: boost-value of the entire facet/category of music unless certain pre-defined
: query terms exist, such as [music, cd, song, listen, dvd, <analyze actual
: user queries to come up with a more exhaustive list>, etc.]. 
: 
:  
: 
: I don't yet know how to do all of this, but after a couple more good books I
: should be "dangerous".
: 
:  
: 
: So the question to this list:
: 
:  
: 
: -          Am I on the right track here?  If not, can you point me in a
: direction to go?
: 
:  
: 
:  
: 
: 

-Hoss

Reply via email to