[Nutch-dev] Re: CrawlDb and inputDir's

Stefan Groschupf Tue, 31 Jan 2006 11:40:10 -0800

Thanks for the clarification, i missed all this cross links!
You definitely 'are in the know'. :-)
Stefan




Am 31.01.2006 um 20:31 schrieb Doug Cutting:

Stefan Groschupf wrote:
The call CrawlDb.createJob(...) creates the crawl db update job.In this method the main input folder is defined:
job.addInputDir(new File(crawlDb, CrawlDatum.DB_DIR_NAME));
However in the update method (line 48, 49) two more input dirsare added.This confuses me since theoretically I understand that the parseddata are need to add fresh urls into the crawldb, but I'msurprises that first of all both folders are added.
One is from the fetcher, the other from the parser.
The fetcher writes a CrawlDatum for each page fetched, withSTATUS_FETCH_*.
The parser writes a CrawlDatum for each link found, with aSTATUS_LINKED.
Secondly I can't find the code that writes crawldatum objectsinto this folders, instead I found that the fetchoutput formatwrites parseImpl and Content into these folders.
FetcherOutputFormat line 73, and ParseOutputFormat line 107.
I also find no code where these objects are converted or mergedtogether.
CrawlDbReducer.reduce().
Thirdly wouldn't be cleaner to move the adding of this foldersalso into the createJob method?
No, the createJob() method is also used by the Injector, wherethese directories are not appropriate.
Doug




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: CrawlDb and inputDir's

Reply via email to