
We are currently using a heavily modified version of nutch. The main reason for this is the fact that we do not only fetch the urls that the QueueFeeder submits, but also additional resources from urls that are constructed during parsing. So for example let's say the QueueFeeder submits a html page to the fetcher, and after the fetch the page gets parsed. Nothing special so far. However the parser decides it also needs some images on the page. Perhaps these images link to other html pages, and we might want to fetch these too. All this is needed to parse information about this particular url we started with. These extra fetch urls we like to call Components, because they are additional resources required to do the parsing of our initial html page that was selected for fetching.

At first we tried to solve this "vertical crawling" problem by using multiple crawl cycles. Each crawl simply selects outlinks that are needed for the parsing of the initial html page. A single inspection can possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph depth). There are several problems with this approach, for one that the crawldb is cluttered with all these component urls and secondly that inspection completion times can be very long.

As an alternative we decided to let the parser fetch needed components on-the-fly, so that additional urls are instantly added to the fetcher lists. Every fetched url can be either a non-component (the QueueFeeder fed it; start parsing this resource) or as a component (the fetcher hands the resource over to the parser that requested it). In order to keep parsers alive we always try to fetch components first, with respect to fetch politeness. A downside of this solution is that your fetch task total running time will be more difficult to anticipate to. For example, if you inject and generate 100 urls and they will be fetched in a single task, you might end up fetching a total of 1100 urls (in the assumption each inspection needs 10 components). We found this behaviour to be acceptable.

Because of our custom version of nutch we cannot upgrade easily to newer versions (we're still using modified fetcher classes from nutch 0.9). Often we end up fixing bugs that have already been fixed by the community. Also, other users might benefit from our changes too.

Therefore we propose to redesign our vertical crawling system from scratch for the newer nutch versions, should there be any interest from the community. Perhaps we are not the only one to implement such a system with nutch. So, what are your thoughts about this?


