Hi Vangelis,
In Nutch 2.x we use partitioner for distrubiting urls. in reduce of
generatorjob we take only topN/recude count urls. We don't choose random by
default but we don't take with highest score.
Am i wrong Sebastian ?
Talat
22 May 2014 18:59 tarihinde Vangelis karv
for selection Domains (hosts) at the start of a
region (mapper input) have the highest chance to get selected.
I guess that the first line is wrong and should be updated.
Date: Thu, 22 May 2014 21:28:10 +0200
From: wastl.na...@googlemail.com
To: user@nutch.apache.org
Subject: Re: Importance
is wrong and should be updated.
Date: Thu, 22 May 2014 21:28:10 +0200
From: wastl.na...@googlemail.com
To: user@nutch.apache.org
Subject: Re: Importance of Score
Hi Vangelis,
Does it choose Urls with the highest score
Yes, it does. Have a look at generatorSortValue(...) in one
(Apache Nutch 2.2.1)
Hi again!
GeneratorJob marks the best topN sites for fetching. Does it choose Urls with
the highest score or random Urls? If it chooses randomly, then whats the point
of the score field??
Thank you!
Hi Vangelis,
Does it choose Urls with the highest score
Yes, it does. Have a look at generatorSortValue(...) in one the scoring filter
plugins.
In case of scoring-opic (activated per default), URLs/docs are simply ranked by
score
taken from CrawlDb. But other scoring filters may use different
5 matches
Mail list logo