Visvo.com originally was a categorized wide web search. While I don't
think our approach was the best way to proceed in hindsight, here is
what we did.
1) We had a mapreduce job that wasrun to place urls in a given category.
The actual function for determining a category is arbitrary. We
started with Bayesian methods based on noun phrases matched to hand
built categories, but it could be any function you want as long as it
maps url -> 1+ categories. Our function returned floats for categories,
highest matching category wins.
2) The job was such that if the function would pick the best category
out of a level, then rerun on its children. The function returned a
float value. If that value was higher than its parent it would continue
checking children at the next level and so on. The idea behind this was
to find the best category in a tree of categories.
3) If a url was in a category, it was considered to be in all of its
parent categories. So let's say we a url is in the following category:
/one/two/three/four
It is also considered to be in
/one/two/three
/one/two
/one
In the index we added a custom field called category and we would add
the category it was assigned to and all of its parent categories.
The UI would allow running keyword searches but also had a listing of
categories which were links. There was some special logic to try and
determine relevant starting point in the category tree from the query.
Not real successful so most started at the base of the category tree.
Clicking on a link would run a query like this:
keywords AND category=/one/two/three
Which should return you categorized results. As I said maybe not the
best approach but is an approach to having a categorized result. Hope
this helps.
Dennis
Kenan Azam wrote:
Hi,
I am using nutch 0.8.1 to do site wide searches. I want certain results to
be boosted more than others for which I have added custom index terms and
boosted them.
However, now I have the need to categorize results into category so that
interesting categories are not buried deep under.
Has someone tried to categorize search results. For example out of a 100
results, 20 appear in category1, 50 appear in category 2 and all others
appear in a third category?
Thanks, Kenan.