To manipulate ranking, you can use boosting.
You can boost documents in a index filter extension you realize as a plugin.
The problem is that you can not change the boosting field that is stored in the index as well (unindexed).
So this may provide trouble until dedub and explaining of ranking..



Am 20.03.2005 um 20:38 schrieb Michael Nebel:

Hi Piotr,

as I wrote a month ago, I started working at the problem (was it really so long ago :-(. But then real life cought me and when I checked the nutch code again - many parts had changed. But the plugin I started/copied should still work.... perhaps I should give him a new try...

Addding the field to the index is quite trivial, when you just take a short look at Otis and Eriks book "lucene in action" (thanks!) or know about lucene. My next step was to extend the ranking, when I found the mistake in my strategie to add the url-length to the ranking: a string is no integer and the actual ranking works with strings. I think, my last post caused many smiles :-) For my part, I decided to read rest of the book before doing more coding.

Looking at my list, I think, I might return to this problem at the middle / end of the week. Perhaps, we could work together?

Regards

        Michael



Piotr Kosiorowski schrieb:

Hello,
I would like to have title and host as separate indexed fields in our installation. As this topic was already discussed on the list over a month ago I want to make sure that nothing was implemented till now before I start coding myself. I am working for Sabre Holdings and we are implementing Travel Search Engine based on nutch. The project is not at very advanced stage right now, but I am spending majority of my time working with nutch and I would like to say I am really impressed with it. As I was looking through our index I saw a lot of examples where adding host as a separate indexed field will help a lot with relevancy. The title is much more difficult to judge - as Doug wrote - they can be spammed quite easily but it would be nice to have a separate parameter to control title matching. I was also thinking about adding special handling for "host" fields as many companies are concatenating parts of their names in domain name (eg. http://www.hewlettpackard.com, http://www.arthurandersen.com/ and even http://www.sabreholdings.com/ :) ) but I will see how it works during implementation. What is an opinion of others about such feature? Does it make sense? I think majority of concatenations would simply find no matching tokens in host field so it should not affect search performance heavily.
My plan is to start working on it this week, I will submit a patch when I finish. So is anyone working on it right now or has something ready?
Or any special things I should consider?
Regards
Piotr Kosiorowski
Michael Nebel wrote:
Hi,

I'm afraid, I'll have to deal with the ranking the next days / weekend. So perhaps I can contribute some time and work for all of us.

Before taking the wrong way, some questions in advance:

- using luke to look at my indexes I see a field called <site>
- some more checking: there is a query-site-plugin.
-> so the "host" part mentioned by Doug below should be available right now.


To take up the note from Wolfgang (boosting short urls), I want to add another plugin calculating the url-length and storing it in an seperate field. Perhaps it makes sense to generate a third plugin storing only the "path" of the url so whe can use the site, the path and the total length for the ranking. The title might be a candidate for a fourth plugin.

My next step would be to extend the query-basic-plugin in two ways:

1.) read the weights out of the NutchConf
2.) read the used fields out of the NutchConf

In result it should be possible to customize the ranking by selecting the plugins and editing the config.

Is this way resonable or do I think too simple?

Michael



Doug Cutting wrote:

Andrzej Bialecki wrote:

Doug Cutting wrote:


NutchSimilarity.lengthNorm() penalize short content by considering all documents with less than 1000 content tokens to be normalized as though they have 1000 content tokens. Is this not sufficient?




Not in my experience. Please consider the following hits (attached in a file), ordered by score, which I've got from a 5mln pages index of mostly Swedish sites, for a query "apoteket" ("the pharmacy" in Swedish). There is clearly something very wrong with the second hit.




Yes. If that were a "title" match (which it really is), and titles were boosted less than anchors, then this would probably be third or lower.

I don't object to indexing titles in a separate field. They can be high quality, but they can also be spammed more easily than anchors. In any case, separately controlling their boost, length normalization, etc. is probably a good idea.




Ok, I'll prepare a patch for review.




Great! I'm glad more folks are looking at search result quality. This is very important, and not simple.

Example: all other things being equal (i.e. the content and anchors), which url seems to be more representative for the query "ikea":

http://www.ikea.se/something/else.html
http://www.something.se/else/ikea.html

IMHO the first url should be given a higher score. Currently they get the same score.




Agreed. This argues for "host" as a separate indexed field.





---------------------------------------------------------------
company:                http://www.media-style.com
forum:          http://www.text-mining.org
blog:                   http://www.find23.net



-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to