Component fetching during parsing. (vertical crawling)

Ferdy Tue, 20 Jul 2010 05:30:48 -0700

Hello,

We are currently using a heavily modified version of nutch. The mainreason for this is the fact that we do not only fetch the urls that theQueueFeeder submits, but also additional resources from urls that areconstructed during parsing. So for example let's say the QueueFeedersubmits a html page to the fetcher, and after the fetch the page getsparsed. Nothing special so far. However the parser decides it also needssome images on the page. Perhaps these images link to other html pages,and we might want to fetch these too. All this is needed to parseinformation about this particular url we started with. These extra fetchurls we like to call Components, because they are additional resourcesrequired to do the parsing of our initial html page that was selectedfor fetching.

At first we tried to solve this "vertical crawling" problem by usingmultiple crawl cycles. Each crawl simply selects outlinks that areneeded for the parsing of the initial html page. A single inspection canpossibly overlap 2, 3 or 4 cycles (depending on the inspection's graphdepth). There are several problems with this approach, for one that thecrawldb is cluttered with all these component urls and secondly thatinspection completion times can be very long.

As an alternative we decided to let the parser fetch needed componentson-the-fly, so that additional urls are instantly added to the fetcherlists. Every fetched url can be either a non-component (the QueueFeederfed it; start parsing this resource) or as a component (the fetcherhands the resource over to the parser that requested it). In order tokeep parsers alive we always try to fetch components first, with respectto fetch politeness. A downside of this solution is that your fetch tasktotal running time will be more difficult to anticipate to. For example,if you inject and generate 100 urls and they will be fetched in a singletask, you might end up fetching a total of 1100 urls (in the assumptioneach inspection needs 10 components). We found this behaviour to beacceptable.

Because of our custom version of nutch we cannot upgrade easily to newerversions (we're still using modified fetcher classes from nutch 0.9).Often we end up fixing bugs that have already been fixed by thecommunity. Also, other users might benefit from our changes too.

Therefore we propose to redesign our vertical crawling system fromscratch for the newer nutch versions, should there be any interest fromthe community. Perhaps we are not the only one to implement such asystem with nutch. So, what are your thoughts about this?


Ferdy.

Component fetching during parsing. (vertical crawling)

Reply via email to