: My next target is searches on simple terms such as "doll" which, in google, : would return documents about, well, "toy dolls", because that's the most : common usage of the simple term "doll". But in my index it predominantly : returns documents about CDs with the song "Doll Face", and "My baby doll" in : them.
if you have good metdata about your documents, then you might get satisfing results using something like the edismax parser with appropriate weights on various fields -- you could for example say that "matching on the product_title field is important, but matching on a category_name is much more important" and thus use something like... q=doll&qf=product_title^5+category_name^50 ..but that only helps you if you have category_name values that match the words people are searching for like "Doll" This type of appoach doesn't help you in the case where you might have the inverse problem: document (category_name="doll", product_name="My baby") showing up first when a user searches for "my baby doll" but the user is really trying to find the document (category_name=cd, product_name="my baby doll") it really all depends on your user base and the type of queries you expect. An interesting solution to this problem that i've seen is to pre-process the query using a baysiean classifier to suggest which categories to boost on. Here's a blog on this where the classifier was trained based on the keywords & categories of the documents... http://engineering.wayfair.com/better-lucenesolr-searches-with-a-boost-from-an-external-naive-bayes-classifier/ ...but you could also train the classifier using query logs and data about what documents users ultimately clicked on (to help you learn that for your userbase, people who search for "baby" are typically looking for CDs not dolls -- or vice versa) : : : : I'm not directly asking how to solve this as much as I'm asking what : direction I should be looking in to learn what I need to know to tackle the : general issue myself. : : : : Left on my own I would start looking at categorizing the CD's into a facet : called "music", reasonably doable in my dataset. Then I need to reduce the : boost-value of the entire facet/category of music unless certain pre-defined : query terms exist, such as [music, cd, song, listen, dvd, <analyze actual : user queries to come up with a more exhaustive list>, etc.]. : : : : I don't yet know how to do all of this, but after a couple more good books I : should be "dangerous". : : : : So the question to this list: : : : : - Am I on the right track here? If not, can you point me in a : direction to go? : : : : : : -Hoss