Hi,
After analyzing some of the search results from my ~10mln pages index, I noticed a few strange results. It seems to me that:
* the DefaultSimilarity seems to excessively favor small lengths of "content" (high tf) and anchor texts (too high boost value?).
* title is not indexed nor tokenized, but quite often contains query terms. Currently the title is treated as one of the anchors. IMHO the title is more important, and it should be made into a separate indexed and tokenized field and the default query translator (BasicQueryFilter) should take this into account.
* for the url field it's not the same whether the query terms occur in the domain name, or in the file path name in the url. The former is usually more important, because it's more likely to point to a referebce site, and IMHO should be boosted separately. The latter usually indicates a reference page. We could differentiate between the two by adding a "domain" field as unstored, tokenized and indexed field, and to modify the BasicQueryFilter accordingly to use this field in order to boost up reference sites.
Also, to offer more flexibility in searching I would propose to index the values of primaryType and secondaryType. This would enable searching for content of specific mime type. Currently these fields are only stored, but not indexed.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
