Hello,
I was reading the code and implementing some features today and want to
summarize it as I promised to Andrzej and Michael - my email is a bit long but I have promised some details.
Status of related features in current nutch codebase:
- "site" field added by SiteIndexingFilter cannot be used for hostname storage as it is not tokenized and as I understand the purpose of this plugin (limiting answers to given site) it should not be tokenized. And we need to tokenize host.
- there is a "title" field added by index-basic plugin but it is not indexed - it is stored only for display purposes.
Correct.
One comment though on the value of the "site", something that I intended to raise but always kept forgetting... Page.computeDomainID() returns a nice, unique long value, which (when encoded with Character.MAX_RADIX) is always shorter than the host name. This could be used instead of "site", to save some space in the index. The coresponding query plugin would compute the domain ID using the same formula.
There are two sets of changes required to add host and title fields to the index and use them during search.
Indexing changes:
-index-basic plugin: I assume index-basic functionality is to be changed to include indexed,tokenized,unstored "host" and indexed,tokenized,stored "title" fields and exclude title from "anchor" field.
Yes.
- NutchDocumentAnalyzer:
- for "host" and "title" use the same analyzer as for "anchor" and "url".
Hmmm... It is not clear to me why in the current NutchDocumentAnalyzer the AnchorAnalyzer is used for "url". While for the "anchor" field it makes sense, because it sets a gap between the terms to prevent matches across consecutive anchors, in case of "url" we only ever have a single value being added. This results in just adding three empty positions at the start of the tokens, e.g. for the url "http://www.tjorn.se/kof/kultur/sagor/" we get:
null_3, http|http-www|http-www-tjorn, www|www-tjorn, tjorn, se, kof, kultur, sagor
So, I would argue that it doesn't make much sense, and we should fix it to use the ContentAnalyzer for "url".
The same would go for the new fields, "host" and "title", because there are ever only single values of these.
- NutchSimilarity:
- length normalization should treat host as url and title as anchor for now.
Yes, probably correct... we need to see the results on some well-known cases.
Searching:
- BasicQueryFilter -
- add host and title fields handled exactly as all other fields. For start I will set TITLE_BOOST=1.5, and HOST_BOOST=2 (as host would be used in matching two times: in "host" and in "url" fields - it will influence the score very much).After implementation I will do some test to choose the values for boost that would look ok (at least for me).
Regarding the TITLE_BOOST - well, the boost for anchors is 2.0f, why should it be lower for this special anchor, which is the title? You make an assumption here that the author of the page is less to be trusted with the title than the others who link to his page...
Regarding the HOST_BOOST - if you consider the example I gave in the email that started this thread, the reason for treating the host part of urls separately was to increase the quality of results, by boosting up the scoring for sites that are more likely the "reference" sites for the query terms. So, given the query "ikea", and the following urls:
1. "http://www.ikea.se/some/other/name.html" 2. "http://www.some.se/some/other/ikea.html" 3. "http://ikea.some.se/some/other/name.html"
which of the above urls should score the highest?
With the current code, all three would get the same score. With your patch applied, only 1. and 3. would get the same score, the 2. would get a lower score. Now, the interesting question is this: is there any meaningful and generic way to introduce a difference in scoring between 1. and 3.?
I have already implemented all these changes (not a lot of work after figuring what to change in fact) and I will do basic tests tommorow, and after basic verification of implementation I will send the patch for others interested to try - and comment on results.
Great. If the patch is not too large, just send it to the list, otherwise you can put it in Bugzilla.
Changes that are introduced by this patch would modify index structure (addition of new field) and will change default query. I think it should be possible to use new code with old index (it should behave as old code as new fields in query would not be present in document), but mixing new and old segments might be a problem. So I think this change requires reindexing.
Correct - mixing indexes would be a big no-no. Using the new code with old indexes would lead to different absolute score values.
During implementation I have found two additional ideas: 1) Do not index url (keep it as stored only field) - add separate host and path fields as indexed (it will not index protocol, port and some other parts of url but I am not sure if indexing them makes sense). It will be easier to control effect of weights and length normalization if host is not counted twice, but this would require reindexing as some old fields would be used differently in query - so it will not work as before with old index.
In some cases you want to select a subset of results by protocol (e.g. https, file, smb, etc). So, it seems to me that you need to keep the protocol around.
Also, I think that in some cases you want to run a phrase match across the whole url, so keeping an index of the whole url would be beneficial.
The fact that terms in "host" and "url" overlap can be adjusted by boosting and different normalization. Please also remember that terms, which are "qualified" with the field name (like in a query "anchor:test") would never match the content in other fields.
2)I do not have any evidence yet, but looking at the data I have a feeling that "not host" part of an url is not as important as current boost factor for it indicates. Probably it should be treated more like a title (as it is settable by page owner and easy to spam). I will look at paramters when I will have tested implementation so I can index the same segments with different parameters and compare results.
IMHO it's difficult to say anything general about this. The "path" part of the url, in addition to all terms we can get from it, gives us an important information about the nesting level of the current page, and all in all it's somewhat more to be trusted than the page title. Some ranking methods give deeply nested pages a lower score than pages directly linked to the top of the site.
Do you think it makes sense to add such functionality? If so I can change these two additional things before posting a patch.
I think the changes related to the "host" field are better understood at this moment than these two. I think you should limit your patch just to the "host" functionality, and we should continue to discuss the other ideas.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
------------------------------------------------------- This SF.net email is sponsored by: 2005 Windows Mobile Application Contest Submit applications for Windows Mobile(tm)-based Pocket PCs or Smartphones for the chance to win $25,000 and application distribution. Enter today at http://ads.osdn.com/?ad_id=6882&alloc_id=15148&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
