Thanks for detailed explanation. I think it is reasonable approach - I
am planning to do relevancy tests next week so I will describe my findings.
Regards,
Piotr
Doug Cutting wrote:
Piotr Kosiorowski wrote:
You have not commited NutchSimilarity class
(at least I cannot find new version in SVN) so
Piotr Kosiorowski wrote:
You have not commited NutchSimilarity class
(at least I cannot find new version in SVN) so for host and title
default length normalization is used. Is it on purpose or by accident?
It was on purpose, but with uncertainty. Sorry I forgot to mention it.
In general, I think w
Hello,
Thanks for commiting it so quickly. I am happy with your changes except one
which I do not understand. You have not commited NutchSimilarity class
(at least I cannot find new version in SVN) so for host and title
default length normalization is used. Is it on purpose or by accident?
Regards
I just applied this patch, somewhat modified.
Thanks for providing it!
Doug
---
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows
Andrzej Bialecki wrote:
Could somebody confirm/deny my analysis in the previous post, that the
use of ANCHOR_ANALYZER for "url" is wrong, and the CONTENT_ANALYZER
should be used instead?
I don't think it's helpful, but nor do I think it's particularly
harmful. I will remove it. One thing it wo
Piotr Kosiorowski wrote:
Hello,
I am attaching the patch in "svn diff" format. I hope it is ok - I do
[...]
Index: src/java/org/apache/nutch/analysis/NutchDocumentAnalyzer.java
===
--- src/java/org/apache/nutch/analysis/NutchDocumentA
Hello,
I am attaching the patch in "svn diff" format. I hope it is ok - I do
not have a lot of experience with SVN so correct me if I am wrong.
The code looks surprisingly simple tome - maybe I have overlooked
something. I was able to index two segments today one containing 120,000
of pages and
To manipulate ranking, you can use boosting.
You can boost documents in a index filter extension you realize as a
plugin.
The problem is that you can not change the boosting field that is
stored in the index as well (unindexed).
So this may provide trouble until dedub and explaining of ranking..
Your changes make good sense. I look forward to seeing the patch.
My preference would be to first apply the patch as proposed and then,
subsequently, consider your final two points.
Thanks!
Doug
Piotr Kosiorowski wrote:
Hello,
I was reading the code and implementing some features today and want
Hello Stefan,
I was reading your mail with interest as I have not thought about such
way of approaching this problem. But after some thought I do not think
it can be used to solve this particular problem.
Maybe I am wrong but I think boosting in nutch is mainly used to
increase or decrease score
Hello,
I was reading the code and implementing some features today and want to
summarize it as I promised to Andrzej and Michael - my email is a bit
long but I have promised some details.
Status of related features in current nutch codebase:
- "site" field added by SiteIndexingFilter cannot b
Piotr,
Lucene offers you possibility to boost paricular fields
in a document but this functionality is not used in nutch (as far as I
can tell).
First this statement confuse me, but after digging in the sources I
would say you are right.
I missed that until writing my last mail and was thinking
Hi Piotr,
as I wrote a month ago, I started working at the problem (was it really
so long ago :-(. But then real life cought me and when I checked the
nutch code again - many parts had changed. But the plugin I
started/copied should still work perhaps I should give him a new try...
Addding
Piotr Kosiorowski wrote:
Hello,
I was reading the code and implementing some features today and want to
summarize it as I promised to Andrzej and Michael - my email is a bit
long but I have promised some details.
Status of related features in current nutch codebase:
- "site" field added by Si
Hello,
I would like to have title and host as separate indexed fields in our
installation. As this topic was already discussed on the list over a
month ago I want to make sure that nothing was implemented till now
before I start coding myself. I am working for Sabre Holdings and we are
implem
Hi,
I'm afraid, I'll have to deal with the ranking the next days / weekend.
So perhaps I can contribute some time and work for all of us.
Before taking the wrong way, some questions in advance:
- using luke to look at my indexes I see a field called
- some more checking: there is a query-site-pl
Andrzej Bialecki wrote:
Doug Cutting wrote:
NutchSimilarity.lengthNorm() penalize short content by considering all
documents with less than 1000 content tokens to be normalized as
though they have 1000 content tokens. Is this not sufficient?
Not in my experience. Please consider the following hi
Doug Cutting wrote:
Andrzej Bialecki wrote:
* the DefaultSimilarity seems to excessively favor small lengths of
"content" (high tf) and anchor texts (too high boost value?).
NutchSimilarity.lengthNorm() penalize short content by considering all
documents with less than 1000 content tokens to be
Andrzej Bialecki wrote:
* the DefaultSimilarity seems to excessively favor small lengths of
"content" (high tf) and anchor texts (too high boost value?).
NutchSimilarity.lengthNorm() penalize short content by considering all
documents with less than 1000 content tokens to be normalized as though
Hi,
After analyzing some of the search results from my ~10mln pages index, I
noticed a few strange results. It seems to me that:
* the DefaultSimilarity seems to excessively favor small lengths of
"content" (high tf) and anchor texts (too high boost value?).
* title is not indexed nor tokenized
20 matches
Mail list logo