I see the reason, you are right. That makes sense :)
I have used a script for fetching and updating already before. Now I
have extended it to:
#!/bin/bash
./nutch generate crawl/crawldb crawl/segments topN 4000
newSeg=`ls -d crawl/segments/2* | tail -1`
echo $newSeg
./nutch fetch $newSeg
./nutch parse $newSeg
./nutch updatedb crawl/crawldb $newSeg
It is nearly the same as on http://wiki.apache.org/nutch/NutchTutorial
but extended with the parse command between fetch and update.
BTW: Its very nice that in v 1.3 you get informed about the map reduce
progress by default. Can you tell me where I can adjust this output
behaviour, anyway? :)
On 10.06.2011 12:09, lewis john mcgibbney wrote:
Hi Marek,
One reason for this is that separating fetching and parsing stages means
that if there was to be an error during execution of a fetch (which also
undertook parsing) the error would be inherently harder to root out and
resolve. This could also mean that any crawl data collected during the fetch
process could be lost or damaged in this process.
On the other hand, if we undertake a parse of the fetched (fetching without
parsing) data after this stage has completed and we encounter an error, then
we can assume that the error is somewhere within the parsing stage and not
the fetching.
I am not sure if there is a way to change this back without hacking some of
your own code... maybe the best way is to use a reliable script
On Fri, Jun 10, 2011 at 11:01 AM, Marek Bachmann
<[email protected]>wrote:
... and I wonder if there is a way to change this behaviour back to let the
fetcher start the parsing.
The syntax help of the command hasn't been updated it seems:
root@hrz-vm180:/home/nutchServer/nutch/runtime/local/bin# ./nutch fetch
Usage: Fetcher<segment> [-threads n] [-noParsing]