Hi Manoharam,
You can use the parse command to parse a segment after it is fetched with
-noParsing option. The result will be equivalent to running fetch without
the noparsing option.
In your nutch installation directory, try the command bin/nutch. It will
give you the usage for the parse
Thanks.
I do my crawl using the Intranet Recrawl script available in the wiki.
I have put these statements in a loop iterating 10 times.
1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000
2. seg1=`ls -d crawl/segments/* | tail -1`
3. bin/nutch fetch $seg1 -threads 50
4. bin/nutch
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Thanks.
I do my crawl using the Intranet Recrawl script available in the wiki.
I have put these statements in a loop iterating 10 times.
1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000
2. seg1=`ls -d crawl/segments/* |
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.
Can someone please tell me what are the measures I can take to avoid
this error? And isn't it possible to make some code
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.
Can someone please tell me what are the measures I can take to avoid
If I run fetcher in non-parsing mode how can I later parse the pages
so that ultimately when a user searches in the Nutch search engine, he
can see the content of PDF files, etc as summary? Please help or point
me to proper articles or wiki where I can learn this.
On 5/30/07, Doğacan Güney [EMAIL