Re: [Nutch-general] CrawlDbReader TopN

Andrzej Bialecki Wed, 25 Jul 2007 08:33:56 -0700

(Please don't cross-post to multiple lists)

Emmanuel wrote:
> I've been through the code of the CrawlDbReader class. I discovered the
> method "processTopNJob" which use the class CrawlDbTopNMapper and
> CrawlDbTopNReducer.
> I'm wondering why do we have this function. Is it an old implementation 
> that
> was used before the Generator to get the TopN links to Fetch or is it
> something else ?
> I would appreciate if you give me your thoughts.


It's not an old method, it's in use. See the synopsis in 
CrawlDbReader.main(). The purpose of this option is to dump the top 
scoring URLs, together with their scores. This is a useful functionality 
to monitor CrawlDb for potential scoring problems.

> 
> I found also some class which are not used, "CrawlDbDumpReducer" its 
> defined
> but its never used or instanciate.
> Don't you think we can remove it from the source code ?
> 

Yes, we can remove this class - it's equivalent to IdentityReducer, 
which is used implicitly by this job. This class is a leftover from the 
time, when it contained also some filtering code.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] CrawlDbReader TopN

Reply via email to