[Nutch-dev] Interleaved (parallel) fetch cycles

Andrzej Bialecki Thu, 11 May 2006 05:49:59 -0700

Hi,

I'm planning to work on adding support in 0.8 for interleaved fetch cycles.

What this means is that (within some limits) you can generate multiplefetchlists, fetch them at different times, and then update the crawldbnot necessarily in the original sequence as they were generated. You canalso generate more fetchlists before any updatedb is run.

This functionality was supported in 0.7.x. When FetchListTool selected aPage for fetching, its next fetch time was pushed 1 week in the future.This was a simple and effective way to prevent the same Pages ending upon the next fetchlist, but at the same time to have their waiting "timeout" after 1 week, if e.g. fetching failed, segment was lost orwhatever. Please note that this method requires modification of WebDB.

If fetching was completed and an updatedb was run, the originalfetchTime/fetchInterval could be recovered from a copy of the Pageinside the FetcherOutput.

Now, in 0.8 we do it differently. We don't modify CrawlDB, so we have noway of recording which CrawlDatums end up on some fetchlist. This meansthat two "generate" operations run in sequence, without interveningupdatedb, will produce exactly the same fetchlists.

Generator would have to be modified to use the same trick as in 0.7.Unfortunately, this probably means that it will have to run a sort ofupdatedb, using its output fetchlist to mark entries in CrawlDB. Thisadds another map-reduce job to an already long-ish job (Generatoralready uses two map-reduce jobs). This also means that Generator willhave to put a lock on CrawlDB for the duration of this job, so that noother "generate" or "updatedb" can update it at the same time.

Then, when running an updatedb, the issue of scores and metadata comesinto question. We can imagine now that there were some other updatedb-srun in the meantime, not necessarily with earlier fetchlists - so thescore and metadata info could be actually newer in the latest CrawlDBthan what we have inside the current segment. In such case, we will getthe following in CrawlDbReducer:

* "old" value from CrawlDb (which could be actually newer!). Even ifit's old, its fetchTime could be in the future due to the trickdescribed above. We could also get null here, if we just discovered anew page.

* "original" value from CrawlDb, which was recorded in fetchlist. This,for once, has a true fetch time, and its metadata and score aresnapshots of that information at the time of "generate".

* "new" value from Fetcher, with new score / metadata information. Wewill also get "new" values from redirects, which might not match any ofthe above values (i.e. they could use unique urls).


* "linked" values from parsers, with score / metadata contributions.

Now, the question is how to update the score, metadata, fetchTime andfetchInterval information. We need a way to determine if the "new" valuewe have is in fact newer or older than the "old" value - I'm not surehow to do this, fetchTime and fetchInterval could have been modified sothey are not reliable... Perhaps we should add a "generation ID" toCrawlDatum? Anyway, assuming we have a way to know this:

* if "new" is newer than "old", then we take all metadata from "old",overwrite all info with the values from "new", and we keep "new".

* if "new" is older than "old", then we overwrite its metadata with allvalues from "old". We do the same with fetchTime and fetchInterval. Whatabout the score? I think that for new score calculations we should takethe latest available score info from the "old" value.

Updatedb would also have to lock CrawlDB so that no other updatedb orgenerate could run while we modify it.

That's probably all at the moment ... Any comments or suggestionsappreciated!


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Interleaved (parallel) fetch cycles

Reply via email to