[Nutch-dev] Adding title and site to scoring

Andrzej Bialecki Mon, 17 Jan 2005 18:46:18 -0800

Hi,

After analyzing some of the search results from my ~10mln pages index, I noticed a few strange results. It seems to me that:

* the DefaultSimilarity seems to excessively favor small lengths of "content" (high tf) and anchor texts (too high boost value?).

* title is not indexed nor tokenized, but quite often contains query terms. Currently the title is treated as one of the anchors. IMHO the title is more important, and it should be made into a separate indexed and tokenized field and the default query translator (BasicQueryFilter) should take this into account.

* for the url field it's not the same whether the query terms occur in the domain name, or in the file path name in the url. The former is usually more important, because it's more likely to point to a referebce site, and IMHO should be boosted separately. The latter usually indicates a reference page. We could differentiate between the two by adding a "domain" field as unstored, tokenized and indexed field, and to modify the BasicQueryFilter accordingly to use this field in order to boost up reference sites.

Also, to offer more flexibility in searching I would propose to index the values of primaryType and secondaryType. This would enable searching for content of specific mime type. Currently these fields are only stored, but not indexed.

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Adding title and site to scoring

Reply via email to