Re: Fetcher does no parsing by default in 1.3

Marek Bachmann Fri, 10 Jun 2011 04:33:29 -0700

I see the reason, you are right. That makes sense :)

I have used a script for fetching and updating already before. Now Ihave extended it to:


#!/bin/bash

./nutch generate crawl/crawldb crawl/segments topN 4000
newSeg=`ls -d crawl/segments/2* | tail -1`
echo $newSeg

./nutch fetch $newSeg
./nutch parse $newSeg
./nutch updatedb crawl/crawldb $newSeg

It is nearly the same as on http://wiki.apache.org/nutch/NutchTutorialbut extended with the parse command between fetch and update.

BTW: Its very nice that in v 1.3 you get informed about the map reduceprogress by default. Can you tell me where I can adjust this outputbehaviour, anyway? :)


On 10.06.2011 12:09, lewis john mcgibbney wrote:

Hi Marek,

One reason for this is that separating fetching and parsing stages means
that if there was to be an error during execution of a fetch (which also
undertook parsing) the error would be inherently harder to root out and
resolve. This could also mean that any crawl data collected during the fetch
process could be lost or damaged in this process.

On the other hand, if we undertake a parse of the fetched (fetching without
parsing) data after this stage has completed and we encounter an error, then
we can assume that the error is somewhere within the parsing stage and not
the fetching.

I am not sure if there is a way to change this back without hacking some of
your own code... maybe the best way is to use a reliable script

On Fri, Jun 10, 2011 at 11:01 AM, Marek Bachmann
<[email protected]>wrote:

... and I wonder if there is a way to change this behaviour back to let the
fetcher start the parsing.

The syntax help of the command hasn't been updated it seems:

  root@hrz-vm180:/home/nutchServer/nutch/runtime/local/bin# ./nutch fetch
Usage: Fetcher<segment>  [-threads n] [-noParsing]

Re: Fetcher does no parsing by default in 1.3

Reply via email to