Re: [Nutch-dev] Adding title and site to scoring

Andrzej Bialecki Tue, 22 Mar 2005 16:03:34 -0800

Piotr Kosiorowski wrote:

Hello,
I was reading the code and implementing some features today and want to summarize it as I promised to Andrzej and Michael - my email is a bit long but I have promised some details.

Status of related features in current nutch codebase: - "site" field added by SiteIndexingFilter cannot be used for hostname storage as it is not tokenized and as I understand the purpose of this plugin (limiting answers to given site) it should not be tokenized. And we need to tokenize host. - there is a "title" field added by index-basic plugin but it is not indexed - it is stored only for display purposes.


Correct.

One comment though on the value of the "site", something that I intended to raise but always kept forgetting... Page.computeDomainID() returns a nice, unique long value, which (when encoded with Character.MAX_RADIX) is always shorter than the host name. This could be used instead of "site", to save some space in the index. The coresponding query plugin would compute the domain ID using the same formula.


There are two sets of changes required to add host and title fields to
the index and use them during search.

Indexing changes:

    -index-basic plugin:
    I assume index-basic functionality is to be changed to include
indexed,tokenized,unstored "host" and indexed,tokenized,stored "title"
fields and exclude title from "anchor" field.


Yes.

- NutchDocumentAnalyzer: - for "host" and "title" use the same analyzer as for "anchor" and "url".

Hmmm... It is not clear to me why in the current NutchDocumentAnalyzer the AnchorAnalyzer is used for "url". While for the "anchor" field it makes sense, because it sets a gap between the terms to prevent matches across consecutive anchors, in case of "url" we only ever have a single value being added. This results in just adding three empty positions at the start of the tokens, e.g. for the url "http://www.tjorn.se/kof/kultur/sagor/"; we get:

null_3, http|http-www|http-www-tjorn, www|www-tjorn, tjorn, se,
kof, kultur, sagor

So, I would argue that it doesn't make much sense, and we should fix it to use the ContentAnalyzer for "url".

The same would go for the new fields, "host" and "title", because there are ever only single values of these.

- NutchSimilarity: - length normalization should treat host as url and title as anchor for now.

Yes, probably correct... we need to see the results on some well-known cases.

Searching: - BasicQueryFilter - - add host and title fields handled exactly as all other fields. For start I will set TITLE_BOOST=1.5, and HOST_BOOST=2 (as host would be used in matching two times: in "host" and in "url" fields - it will influence the score very much).After implementation I will do some test to choose the values for boost that would look ok (at least for me).

Regarding the TITLE_BOOST - well, the boost for anchors is 2.0f, why should it be lower for this special anchor, which is the title? You make an assumption here that the author of the page is less to be trusted with the title than the others who link to his page...

Regarding the HOST_BOOST - if you consider the example I gave in the email that started this thread, the reason for treating the host part of urls separately was to increase the quality of results, by boosting up the scoring for sites that are more likely the "reference" sites for the query terms. So, given the query "ikea", and the following urls:

1. "http://www.ikea.se/some/other/name.html";
2. "http://www.some.se/some/other/ikea.html";
3. "http://ikea.some.se/some/other/name.html";

which of the above urls should score the highest?

With the current code, all three would get the same score. With your patch applied, only 1. and 3. would get the same score, the 2. would get a lower score. Now, the interesting question is this: is there any meaningful and generic way to introduce a difference in scoring between 1. and 3.?

I have already implemented all these changes (not a lot of work after
figuring what to change in fact) and I will do basic tests tommorow, and
after basic verification of implementation I will send the patch for
others interested to try - and comment on results.

Great. If the patch is not too large, just send it to the list, otherwise you can put it in Bugzilla.

Changes that are introduced by this patch would modify index structure
(addition of new field) and will change default query. I think it should
be possible to use new code with old index (it should behave as old code
as new fields in query would not be present in document), but mixing new
and old segments might be a problem. So I think this change requires
reindexing.

Correct - mixing indexes would be a big no-no. Using the new code with old indexes would lead to different absolute score values.


During implementation I have found two additional ideas:
1) Do not index url (keep it as stored only field) - add separate host
and path fields as indexed  (it will not index protocol, port and some
other parts of url but I am not sure if indexing them makes sense). It
will be easier to control effect of weights and length normalization if
host is not counted twice, but this would require reindexing as some old
fields would be used differently in query - so it will not work as
before with old index.

In some cases you want to select a subset of results by protocol (e.g. https, file, smb, etc). So, it seems to me that you need to keep the protocol around.

Also, I think that in some cases you want to run a phrase match across the whole url, so keeping an index of the whole url would be beneficial.

The fact that terms in "host" and "url" overlap can be adjusted by boosting and different normalization. Please also remember that terms, which are "qualified" with the field name (like in a query "anchor:test") would never match the content in other fields.


2)I do not have any evidence yet, but looking at the data I have a
feeling that "not host" part of an url is not as important as current
boost factor for it indicates. Probably it should be treated more like a
title (as it is settable by page owner and easy to spam). I will look at
paramters when I will have tested implementation so I can index the same
segments with different parameters and compare results.

IMHO it's difficult to say anything general about this. The "path" part of the url, in addition to all terms we can get from it, gives us an important information about the nesting level of the current page, and all in all it's somewhat more to be trusted than the page title. Some ranking methods give deeply nested pages a lower score than pages directly linked to the top of the site.

Do you think it makes sense to add such functionality? If so I can change these two additional things before posting a patch.

I think the changes related to the "host" field are better understood at this moment than these two. I think you should limit your patch just to the "host" functionality, and we should continue to discuss the other ideas.


--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

-------------------------------------------------------
This SF.net email is sponsored by: 2005 Windows Mobile Application Contest
Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones
for the chance to win $25,000 and application distribution. Enter today at
http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Adding title and site to scoring

Reply via email to