On 5/31/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote: > Thanks. > > I do my crawl using the Intranet Recrawl script available in the wiki. > I have put these statements in a loop iterating 10 times. > > 1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000 > 2. seg1=`ls -d crawl/segments/* | tail -1` > 3. bin/nutch fetch $seg1 -threads 50
this will be bin/nutch fetch $seg1 -threads 50 -noParsing > 4. bin/nutch updatedb crawl/crawldb $seg1 > > So, to fetch without parsing I need to modify the statement 3 to:- > > bin/nutch fetch $seg1 -threads 50 -noParsing. > > Now where do I put this statement:- > > bin/nutch parse $seg1 > > In between statement 3 and statement 4? yes. > > On 5/31/07, Vishal Shah <[EMAIL PROTECTED]> wrote: > > Hi Manoharam, > > > > You can use the parse command to parse a segment after it is fetched with > > -noParsing option. The result will be equivalent to running fetch without > > the noparsing option. > > > > In your nutch installation directory, try the command bin/nutch. It will > > give you the usage for the parse command. > > > > Regards, > > > > -vishal. > > > > -----Original Message----- > > From: Manoharam Reddy [mailto:[EMAIL PROTECTED] > > Sent: Thursday, May 31, 2007 11:24 AM > > To: [EMAIL PROTECTED] > > Subject: Re: OutOfMemoryError - Why should the while(1) loop stop? > > > > If I run fetcher in non-parsing mode how can I later parse the pages > > so that ultimately when a user searches in the Nutch search engine, he > > can see the content of PDF files, etc as summary? Please help or point > > me to proper articles or wiki where I can learn this. > > > > On 5/30/07, Doğacan Güney <[EMAIL PROTECTED]> wrote: > > > On 5/30/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote: > > > > Time and again I get this error and as a result the segment remains > > > > incomplete. This wastes one iteration of the for() loop in which I am > > > > doing generate, fetch and update. > > > > > > > > Can someone please tell me what are the measures I can take to avoid > > > > this error? And isn't it possible to make some code changes so that > > > > the whole fetch doesn't have to stop suddenly when this error occurs. > > > > Can't we do something in the code so that, the fetch still continues > > > > like in case of SocketException, in which case the fetch while(1) loop > > > > continues. > > > > > > > > If it is not possible, please tell me how can I prevent this error > > > > from happening? > > > > > > Are you also parsing during fetch? If you are, I would suggest running > > > Fetcher in non-parsing mode. > > > > > > > > > > > ----- ERROR ----- > > > > > > > > fetch of http://telephony/register.asp failed with: > > > > java.lang.OutOfMemoryError: Java heap space > > > > java.lang.NullPointerException > > > > at > > org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java: > > 87) > > > > at > > org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) > > > > ...... > > > > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) > > > > fetcher caught:java.lang.NullPointerException > > > > java.lang.NullPointerException > > > > at > > org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java: > > 87) > > > > at > > org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) > > > > ....... > > > > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) > > > > fetcher caught:java.lang.NullPointerException > > > > Fetcher: java.io.IOException: Job failed! > > > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > > > > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) > > > > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) > > > > at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) > > > > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) > > > > > > > > > > > > > -- > > > Doğacan Güney > > > > > > > > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
