Re: [Nutch-dev] Adding title and site to scoring

2005-03-24 Thread Piotr Kosiorowski
Thanks for detailed explanation. I think it is reasonable approach - I am planning to do relevancy tests next week so I will describe my findings. Regards, Piotr Doug Cutting wrote: Piotr Kosiorowski wrote: You have not commited NutchSimilarity class (at least I cannot find new version in SVN) so

Re: [Nutch-dev] Adding title and site to scoring

2005-03-24 Thread Doug Cutting
Piotr Kosiorowski wrote: You have not commited NutchSimilarity class (at least I cannot find new version in SVN) so for host and title default length normalization is used. Is it on purpose or by accident? It was on purpose, but with uncertainty. Sorry I forgot to mention it. In general, I think w

Re: [Nutch-dev] Adding title and site to scoring

2005-03-24 Thread Piotr Kosiorowski
Hello, Thanks for commiting it so quickly. I am happy with your changes except one which I do not understand. You have not commited NutchSimilarity class (at least I cannot find new version in SVN) so for host and title default length normalization is used. Is it on purpose or by accident? Regards

Re: [Nutch-dev] Adding title and site to scoring

2005-03-23 Thread Doug Cutting
I just applied this patch, somewhat modified. Thanks for providing it! Doug --- This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows Embedded(r) & Windows

[Nutch-dev] Re: Bug in existing version of NutchDocumentAnalyzer (Re: [Nutch-dev] Adding title and site to scoring)

2005-03-23 Thread Doug Cutting
Andrzej Bialecki wrote: Could somebody confirm/deny my analysis in the previous post, that the use of ANCHOR_ANALYZER for "url" is wrong, and the CONTENT_ANALYZER should be used instead? I don't think it's helpful, but nor do I think it's particularly harmful. I will remove it. One thing it wo

[Nutch-dev] Bug in existing version of NutchDocumentAnalyzer (Re: [Nutch-dev] Adding title and site to scoring)

2005-03-23 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: Hello, I am attaching the patch in "svn diff" format. I hope it is ok - I do [...] Index: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java === --- src/java/org/apache/nutch/analysis/NutchDocumentA

Re: [Nutch-dev] Adding title and site to scoring

2005-03-23 Thread Piotr Kosiorowski
Hello, I am attaching the patch in "svn diff" format. I hope it is ok - I do not have a lot of experience with SVN so correct me if I am wrong. The code looks surprisingly simple tome - maybe I have overlooked something. I was able to index two segments today one containing 120,000 of pages and

Re: [Nutch-dev] Adding title and site to scoring

2005-03-23 Thread Stefan Groschupf
To manipulate ranking, you can use boosting. You can boost documents in a index filter extension you realize as a plugin. The problem is that you can not change the boosting field that is stored in the index as well (unindexed). So this may provide trouble until dedub and explaining of ranking..

Re: [Nutch-dev] Adding title and site to scoring

2005-03-23 Thread Doug Cutting
Your changes make good sense. I look forward to seeing the patch. My preference would be to first apply the patch as proposed and then, subsequently, consider your final two points. Thanks! Doug Piotr Kosiorowski wrote: Hello, I was reading the code and implementing some features today and want

Re: [Nutch-dev] Adding title and site to scoring

2005-03-23 Thread Piotr Kosiorowski
Hello Stefan, I was reading your mail with interest as I have not thought about such way of approaching this problem. But after some thought I do not think it can be used to solve this particular problem. Maybe I am wrong but I think boosting in nutch is mainly used to increase or decrease score

Re: [Nutch-dev] Adding title and site to scoring

2005-03-23 Thread Piotr Kosiorowski
Hello, I was reading the code and implementing some features today and want to summarize it as I promised to Andrzej and Michael - my email is a bit long but I have promised some details. Status of related features in current nutch codebase: - "site" field added by SiteIndexingFilter cannot b

Re: [Nutch-dev] Adding title and site to scoring

2005-03-23 Thread Stefan Groschupf
Piotr, Lucene offers you possibility to boost paricular fields in a document but this functionality is not used in nutch (as far as I can tell). First this statement confuse me, but after digging in the sources I would say you are right. I missed that until writing my last mail and was thinking

Re: [Nutch-dev] Adding title and site to scoring

2005-03-23 Thread Michael Nebel
Hi Piotr, as I wrote a month ago, I started working at the problem (was it really so long ago :-(. But then real life cought me and when I checked the nutch code again - many parts had changed. But the plugin I started/copied should still work perhaps I should give him a new try... Addding

Re: [Nutch-dev] Adding title and site to scoring

2005-03-22 Thread Andrzej Bialecki
Piotr Kosiorowski wrote: Hello, I was reading the code and implementing some features today and want to summarize it as I promised to Andrzej and Michael - my email is a bit long but I have promised some details. Status of related features in current nutch codebase: - "site" field added by Si

Re: [Nutch-dev] Adding title and site to scoring

2005-03-20 Thread Piotr Kosiorowski
Hello, I would like to have title and host as separate indexed fields in our installation. As this topic was already discussed on the list over a month ago I want to make sure that nothing was implemented till now before I start coding myself. I am working for Sabre Holdings and we are implem

Re: [Nutch-dev] Adding title and site to scoring

2005-02-16 Thread Michael Nebel
Hi, I'm afraid, I'll have to deal with the ranking the next days / weekend. So perhaps I can contribute some time and work for all of us. Before taking the wrong way, some questions in advance: - using luke to look at my indexes I see a field called - some more checking: there is a query-site-pl

Re: [Nutch-dev] Adding title and site to scoring

2005-02-04 Thread Doug Cutting
Andrzej Bialecki wrote: Doug Cutting wrote: NutchSimilarity.lengthNorm() penalize short content by considering all documents with less than 1000 content tokens to be normalized as though they have 1000 content tokens. Is this not sufficient? Not in my experience. Please consider the following hi

Re: [Nutch-dev] Adding title and site to scoring

2005-01-25 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: * the DefaultSimilarity seems to excessively favor small lengths of "content" (high tf) and anchor texts (too high boost value?). NutchSimilarity.lengthNorm() penalize short content by considering all documents with less than 1000 content tokens to be

Re: [Nutch-dev] Adding title and site to scoring

2005-01-25 Thread Doug Cutting
Andrzej Bialecki wrote: * the DefaultSimilarity seems to excessively favor small lengths of "content" (high tf) and anchor texts (too high boost value?). NutchSimilarity.lengthNorm() penalize short content by considering all documents with less than 1000 content tokens to be normalized as though

[Nutch-dev] Adding title and site to scoring

2005-01-17 Thread Andrzej Bialecki
Hi, After analyzing some of the search results from my ~10mln pages index, I noticed a few strange results. It seems to me that: * the DefaultSimilarity seems to excessively favor small lengths of "content" (high tf) and anchor texts (too high boost value?). * title is not indexed nor tokenized