Hello,
We are currently using a heavily modified version of nutch. The main
reason for this is the fact that we do not only fetch the urls that the
QueueFeeder submits, but also additional resources from urls that are
constructed during parsing. So for example let's say the QueueFeeder
submits a html page to the fetcher, and after the fetch the page gets
parsed. Nothing special so far. However the parser decides it also needs
some images on the page. Perhaps these images link to other html pages,
and we might want to fetch these too. All this is needed to parse
information about this particular url we started with. These extra fetch
urls we like to call Components, because they are additional resources
required to do the parsing of our initial html page that was selected
for fetching.
At first we tried to solve this "vertical crawling" problem by using
multiple crawl cycles. Each crawl simply selects outlinks that are
needed for the parsing of the initial html page. A single inspection can
possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
depth). There are several problems with this approach, for one that the
crawldb is cluttered with all these component urls and secondly that
inspection completion times can be very long.
As an alternative we decided to let the parser fetch needed components
on-the-fly, so that additional urls are instantly added to the fetcher
lists. Every fetched url can be either a non-component (the QueueFeeder
fed it; start parsing this resource) or as a component (the fetcher
hands the resource over to the parser that requested it). In order to
keep parsers alive we always try to fetch components first, with respect
to fetch politeness. A downside of this solution is that your fetch task
total running time will be more difficult to anticipate to. For example,
if you inject and generate 100 urls and they will be fetched in a single
task, you might end up fetching a total of 1100 urls (in the assumption
each inspection needs 10 components). We found this behaviour to be
acceptable.
Because of our custom version of nutch we cannot upgrade easily to newer
versions (we're still using modified fetcher classes from nutch 0.9).
Often we end up fixing bugs that have already been fixed by the
community. Also, other users might benefit from our changes too.
Therefore we propose to redesign our vertical crawling system from
scratch for the newer nutch versions, should there be any interest from
the community. Perhaps we are not the only one to implement such a
system with nutch. So, what are your thoughts about this?
Ferdy.
- Component fetching during parsing. (vertical crawling) Ferdy
-