Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?
Hi Manoharam, You can use the parse command to parse a segment after it is fetched with -noParsing option. The result will be equivalent to running fetch without the noparsing option. In your nutch installation directory, try the command bin/nutch. It will give you the usage for the parse command. Regards, -vishal. -Original Message- From: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:24 AM To: [EMAIL PROTECTED] Subject: Re: OutOfMemoryError - Why should the while(1) loop stop? If I run fetcher in non-parsing mode how can I later parse the pages so that ultimately when a user searches in the Nutch search engine, he can see the content of PDF files, etc as summary? Please help or point me to proper articles or wiki where I can learn this. On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? Are you also parsing during fetch? If you are, I would suggest running Fetcher in non-parsing mode. - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java: 87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java: 87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?
Thanks. I do my crawl using the Intranet Recrawl script available in the wiki. I have put these statements in a loop iterating 10 times. 1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000 2. seg1=`ls -d crawl/segments/* | tail -1` 3. bin/nutch fetch $seg1 -threads 50 4. bin/nutch updatedb crawl/crawldb $seg1 So, to fetch without parsing I need to modify the statement 3 to:- bin/nutch fetch $seg1 -threads 50 -noParsing. Now where do I put this statement:- bin/nutch parse $seg1 In between statement 3 and statement 4? On 5/31/07, Vishal Shah [EMAIL PROTECTED] wrote: Hi Manoharam, You can use the parse command to parse a segment after it is fetched with -noParsing option. The result will be equivalent to running fetch without the noparsing option. In your nutch installation directory, try the command bin/nutch. It will give you the usage for the parse command. Regards, -vishal. -Original Message- From: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:24 AM To: [EMAIL PROTECTED] Subject: Re: OutOfMemoryError - Why should the while(1) loop stop? If I run fetcher in non-parsing mode how can I later parse the pages so that ultimately when a user searches in the Nutch search engine, he can see the content of PDF files, etc as summary? Please help or point me to proper articles or wiki where I can learn this. On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? Are you also parsing during fetch? If you are, I would suggest running Fetcher in non-parsing mode. - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java: 87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java: 87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Thanks. I do my crawl using the Intranet Recrawl script available in the wiki. I have put these statements in a loop iterating 10 times. 1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000 2. seg1=`ls -d crawl/segments/* | tail -1` 3. bin/nutch fetch $seg1 -threads 50 this will be bin/nutch fetch $seg1 -threads 50 -noParsing 4. bin/nutch updatedb crawl/crawldb $seg1 So, to fetch without parsing I need to modify the statement 3 to:- bin/nutch fetch $seg1 -threads 50 -noParsing. Now where do I put this statement:- bin/nutch parse $seg1 In between statement 3 and statement 4? yes. On 5/31/07, Vishal Shah [EMAIL PROTECTED] wrote: Hi Manoharam, You can use the parse command to parse a segment after it is fetched with -noParsing option. The result will be equivalent to running fetch without the noparsing option. In your nutch installation directory, try the command bin/nutch. It will give you the usage for the parse command. Regards, -vishal. -Original Message- From: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sent: Thursday, May 31, 2007 11:24 AM To: [EMAIL PROTECTED] Subject: Re: OutOfMemoryError - Why should the while(1) loop stop? If I run fetcher in non-parsing mode how can I later parse the pages so that ultimately when a user searches in the Nutch search engine, he can see the content of PDF files, etc as summary? Please help or point me to proper articles or wiki where I can learn this. On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? Are you also parsing during fetch? If you are, I would suggest running Fetcher in non-parsing mode. - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java: 87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java: 87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) -- Doğacan Güney -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? Are you also parsing during fetch? If you are, I would suggest running Fetcher in non-parsing mode. - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general
Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?
If I run fetcher in non-parsing mode how can I later parse the pages so that ultimately when a user searches in the Nutch search engine, he can see the content of PDF files, etc as summary? Please help or point me to proper articles or wiki where I can learn this. On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote: On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update. Can someone please tell me what are the measures I can take to avoid this error? And isn't it possible to make some code changes so that the whole fetch doesn't have to stop suddenly when this error occurs. Can't we do something in the code so that, the fetch still continues like in case of SocketException, in which case the fetch while(1) loop continues. If it is not possible, please tell me how can I prevent this error from happening? Are you also parsing during fetch? If you are, I would suggest running Fetcher in non-parsing mode. - ERROR - fetch of http://telephony/register.asp failed with: java.lang.OutOfMemoryError: Java heap space java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) .. at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87) at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125) ... at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115) fetcher caught:java.lang.NullPointerException Fetcher: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477) -- Doğacan Güney - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general