Hi,

I'm afraid, I'll have to deal with the ranking the next days / weekend. So perhaps I can contribute some time and work for all of us.

Before taking the wrong way, some questions in advance:

- using luke to look at my indexes I see a field called <site>
- some more checking: there is a query-site-plugin.
-> so the "host" part mentioned by Doug below should be available right now.


To take up the note from Wolfgang (boosting short urls), I want to add another plugin calculating the url-length and storing it in an seperate field. Perhaps it makes sense to generate a third plugin storing only the "path" of the url so whe can use the site, the path and the total length for the ranking. The title might be a candidate for a fourth plugin.

My next step would be to extend the query-basic-plugin in two ways:

1.) read the weights out of the NutchConf
2.) read the used fields out of the NutchConf

In result it should be possible to customize the ranking by selecting the plugins and editing the config.

Is this way resonable or do I think too simple?

Michael



Doug Cutting wrote:
Andrzej Bialecki wrote:

Doug Cutting wrote:


NutchSimilarity.lengthNorm() penalize short content by considering all documents with less than 1000 content tokens to be normalized as though they have 1000 content tokens. Is this not sufficient?


Not in my experience. Please consider the following hits (attached in a file), ordered by score, which I've got from a 5mln pages index of mostly Swedish sites, for a query "apoteket" ("the pharmacy" in Swedish). There is clearly something very wrong with the second hit.


Yes. If that were a "title" match (which it really is), and titles were boosted less than anchors, then this would probably be third or lower.

I don't object to indexing titles in a separate field. They can be high quality, but they can also be spammed more easily than anchors. In any case, separately controlling their boost, length normalization, etc. is probably a good idea.


Ok, I'll prepare a patch for review.


Great! I'm glad more folks are looking at search result quality. This is very important, and not simple.

Example: all other things being equal (i.e. the content and anchors), which url seems to be more representative for the query "ikea":

http://www.ikea.se/something/else.html
http://www.something.se/else/ikea.html

IMHO the first url should be given a higher score. Currently they get the same score.


Agreed. This argues for "host" as a separate indexed field.




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to