Re: OutOfMemoryError - Why should the while(1) loop stop?

Doğacan Güney Wed, 30 May 2007 23:13:48 -0700

On 5/31/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:

Thanks.


I do my crawl using the Intranet Recrawl script available in the wiki.
I have put these statements in a loop iterating 10 times.

 1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000
 2. seg1=`ls -d crawl/segments/* | tail -1`
 3. bin/nutch fetch $seg1 -threads 50


this will be bin/nutch fetch $seg1 -threads 50 -noParsing

 4. bin/nutch updatedb crawl/crawldb $seg1

So, to fetch without parsing I need to modify the statement 3 to:-

bin/nutch fetch $seg1 -threads 50 -noParsing.

Now where do I put this statement:-

bin/nutch parse $seg1

In between statement 3 and statement 4?


yes.


On 5/31/07, Vishal Shah <[EMAIL PROTECTED]> wrote:
> Hi Manoharam,
>
>   You can use the parse command to parse a segment after it is fetched with
> -noParsing option. The result will be equivalent to running fetch without
> the noparsing option.
>
>    In your nutch installation directory, try the command bin/nutch. It will
> give you the usage for the parse command.
>
> Regards,
>
> -vishal.
>
> -----Original Message-----
> From: Manoharam Reddy [mailto:[EMAIL PROTECTED]
> Sent: Thursday, May 31, 2007 11:24 AM
> To: [email protected]
> Subject: Re: OutOfMemoryError - Why should the while(1) loop stop?
>
> If I run fetcher in non-parsing mode how can I later parse the pages
> so that ultimately when a user searches in the Nutch search engine, he
> can see the content of PDF files, etc as summary? Please help or point
> me to proper articles or wiki where I can learn this.
>
> On 5/30/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> > On 5/30/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> > > Time and again I get this error and as a result the segment remains
> > > incomplete. This wastes one iteration of the for() loop in which I am
> > > doing generate, fetch and update.
> > >
> > > Can someone please tell me what are the measures I can take to avoid
> > > this error? And isn't it possible to make some code changes so that
> > > the whole fetch doesn't have to stop suddenly when this error occurs.
> > > Can't we do something in the code so that, the fetch still continues
> > > like in case of SocketException, in which case the fetch while(1) loop
> > > continues.
> > >
> > > If it is not possible, please tell me how can I prevent this error
> > > from happening?
> >
> > Are you also parsing during fetch? If you are, I would suggest running
> > Fetcher in non-parsing mode.
> >
> > >
> > > ----- ERROR -----
> > >
> > > fetch of http://telephony/register.asp failed with:
> > > java.lang.OutOfMemoryError: Java heap space
> > > java.lang.NullPointerException
> > > at
> org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:
> 87)
> > > at
> org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
> > > ......
> > > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
> > > fetcher caught:java.lang.NullPointerException
> > > java.lang.NullPointerException
> > > at
> org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:
> 87)
> > > at
> org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
> > > .......
> > > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
> > > fetcher caught:java.lang.NullPointerException
> > > Fetcher: java.io.IOException: Job failed!
> > >   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> > >   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
> > >   at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
> > >   at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
> > >   at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
> > >
> >
> >
> > --
> > Doğacan Güney
> >
>
>



--
Doğacan Güney

Re: OutOfMemoryError - Why should the while(1) loop stop?

Reply via email to