Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Vishal Shah
Hi Manoharam,

  You can use the parse command to parse a segment after it is fetched with
-noParsing option. The result will be equivalent to running fetch without
the noparsing option.

   In your nutch installation directory, try the command bin/nutch. It will
give you the usage for the parse command.

Regards,

-vishal.

-Original Message-
From: Manoharam Reddy [mailto:[EMAIL PROTECTED] 
Sent: Thursday, May 31, 2007 11:24 AM
To: [EMAIL PROTECTED]
Subject: Re: OutOfMemoryError - Why should the while(1) loop stop?

If I run fetcher in non-parsing mode how can I later parse the pages
so that ultimately when a user searches in the Nutch search engine, he
can see the content of PDF files, etc as summary? Please help or point
me to proper articles or wiki where I can learn this.

On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
  Time and again I get this error and as a result the segment remains
  incomplete. This wastes one iteration of the for() loop in which I am
  doing generate, fetch and update.
 
  Can someone please tell me what are the measures I can take to avoid
  this error? And isn't it possible to make some code changes so that
  the whole fetch doesn't have to stop suddenly when this error occurs.
  Can't we do something in the code so that, the fetch still continues
  like in case of SocketException, in which case the fetch while(1) loop
  continues.
 
  If it is not possible, please tell me how can I prevent this error
  from happening?

 Are you also parsing during fetch? If you are, I would suggest running
 Fetcher in non-parsing mode.

 
  - ERROR -
 
  fetch of http://telephony/register.asp failed with:
  java.lang.OutOfMemoryError: Java heap space
  java.lang.NullPointerException
  at
org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:
87)
  at
org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
  ..
  at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
  fetcher caught:java.lang.NullPointerException
  java.lang.NullPointerException
  at
org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:
87)
  at
org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
  ...
  at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
  fetcher caught:java.lang.NullPointerException
  Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
 


 --
 Doğacan Güney



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Manoharam Reddy
Thanks.

I do my crawl using the Intranet Recrawl script available in the wiki.
I have put these statements in a loop iterating 10 times.

 1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000
 2. seg1=`ls -d crawl/segments/* | tail -1`
 3. bin/nutch fetch $seg1 -threads 50
 4. bin/nutch updatedb crawl/crawldb $seg1

So, to fetch without parsing I need to modify the statement 3 to:-

bin/nutch fetch $seg1 -threads 50 -noParsing.

Now where do I put this statement:-

bin/nutch parse $seg1

In between statement 3 and statement 4?

On 5/31/07, Vishal Shah [EMAIL PROTECTED] wrote:
 Hi Manoharam,

   You can use the parse command to parse a segment after it is fetched with
 -noParsing option. The result will be equivalent to running fetch without
 the noparsing option.

In your nutch installation directory, try the command bin/nutch. It will
 give you the usage for the parse command.

 Regards,

 -vishal.

 -Original Message-
 From: Manoharam Reddy [mailto:[EMAIL PROTECTED]
 Sent: Thursday, May 31, 2007 11:24 AM
 To: [EMAIL PROTECTED]
 Subject: Re: OutOfMemoryError - Why should the while(1) loop stop?

 If I run fetcher in non-parsing mode how can I later parse the pages
 so that ultimately when a user searches in the Nutch search engine, he
 can see the content of PDF files, etc as summary? Please help or point
 me to proper articles or wiki where I can learn this.

 On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:
  On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
   Time and again I get this error and as a result the segment remains
   incomplete. This wastes one iteration of the for() loop in which I am
   doing generate, fetch and update.
  
   Can someone please tell me what are the measures I can take to avoid
   this error? And isn't it possible to make some code changes so that
   the whole fetch doesn't have to stop suddenly when this error occurs.
   Can't we do something in the code so that, the fetch still continues
   like in case of SocketException, in which case the fetch while(1) loop
   continues.
  
   If it is not possible, please tell me how can I prevent this error
   from happening?
 
  Are you also parsing during fetch? If you are, I would suggest running
  Fetcher in non-parsing mode.
 
  
   - ERROR -
  
   fetch of http://telephony/register.asp failed with:
   java.lang.OutOfMemoryError: Java heap space
   java.lang.NullPointerException
   at
 org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:
 87)
   at
 org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
   ..
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
   fetcher caught:java.lang.NullPointerException
   java.lang.NullPointerException
   at
 org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:
 87)
   at
 org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
   ...
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
   fetcher caught:java.lang.NullPointerException
   Fetcher: java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
 at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
 at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
 at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
  
 
 
  --
  Doğacan Güney
 


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-31 Thread Doğacan Güney
On 5/31/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
 Thanks.

 I do my crawl using the Intranet Recrawl script available in the wiki.
 I have put these statements in a loop iterating 10 times.

  1. bin/nutch generate crawl/crawldb crawl/segments -topN 1000
  2. seg1=`ls -d crawl/segments/* | tail -1`
  3. bin/nutch fetch $seg1 -threads 50

this will be bin/nutch fetch $seg1 -threads 50 -noParsing

  4. bin/nutch updatedb crawl/crawldb $seg1

 So, to fetch without parsing I need to modify the statement 3 to:-

 bin/nutch fetch $seg1 -threads 50 -noParsing.

 Now where do I put this statement:-

 bin/nutch parse $seg1

 In between statement 3 and statement 4?

yes.


 On 5/31/07, Vishal Shah [EMAIL PROTECTED] wrote:
  Hi Manoharam,
 
You can use the parse command to parse a segment after it is fetched with
  -noParsing option. The result will be equivalent to running fetch without
  the noparsing option.
 
 In your nutch installation directory, try the command bin/nutch. It will
  give you the usage for the parse command.
 
  Regards,
 
  -vishal.
 
  -Original Message-
  From: Manoharam Reddy [mailto:[EMAIL PROTECTED]
  Sent: Thursday, May 31, 2007 11:24 AM
  To: [EMAIL PROTECTED]
  Subject: Re: OutOfMemoryError - Why should the while(1) loop stop?
 
  If I run fetcher in non-parsing mode how can I later parse the pages
  so that ultimately when a user searches in the Nutch search engine, he
  can see the content of PDF files, etc as summary? Please help or point
  me to proper articles or wiki where I can learn this.
 
  On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:
   On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
Time and again I get this error and as a result the segment remains
incomplete. This wastes one iteration of the for() loop in which I am
doing generate, fetch and update.
   
Can someone please tell me what are the measures I can take to avoid
this error? And isn't it possible to make some code changes so that
the whole fetch doesn't have to stop suddenly when this error occurs.
Can't we do something in the code so that, the fetch still continues
like in case of SocketException, in which case the fetch while(1) loop
continues.
   
If it is not possible, please tell me how can I prevent this error
from happening?
  
   Are you also parsing during fetch? If you are, I would suggest running
   Fetcher in non-parsing mode.
  
   
- ERROR -
   
fetch of http://telephony/register.asp failed with:
java.lang.OutOfMemoryError: Java heap space
java.lang.NullPointerException
at
  org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:
  87)
at
  org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
..
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
java.lang.NullPointerException
at
  org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:
  87)
at
  org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
...
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
fetcher caught:java.lang.NullPointerException
Fetcher: java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
  at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
  at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
  at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
  at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
   
  
  
   --
   Doğacan Güney
  
 
 



-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Doğacan Güney
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
 Time and again I get this error and as a result the segment remains
 incomplete. This wastes one iteration of the for() loop in which I am
 doing generate, fetch and update.

 Can someone please tell me what are the measures I can take to avoid
 this error? And isn't it possible to make some code changes so that
 the whole fetch doesn't have to stop suddenly when this error occurs.
 Can't we do something in the code so that, the fetch still continues
 like in case of SocketException, in which case the fetch while(1) loop
 continues.

 If it is not possible, please tell me how can I prevent this error
 from happening?

Are you also parsing during fetch? If you are, I would suggest running
Fetcher in non-parsing mode.


 - ERROR -

 fetch of http://telephony/register.asp failed with:
 java.lang.OutOfMemoryError: Java heap space
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
 at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
 ..
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
 fetcher caught:java.lang.NullPointerException
 java.lang.NullPointerException
 at 
 org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
 at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
 ...
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
 fetcher caught:java.lang.NullPointerException
 Fetcher: java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
   at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
   at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
   at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
   at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)



-- 
Doğacan Güney
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Manoharam Reddy
If I run fetcher in non-parsing mode how can I later parse the pages
so that ultimately when a user searches in the Nutch search engine, he
can see the content of PDF files, etc as summary? Please help or point
me to proper articles or wiki where I can learn this.

On 5/30/07, Doğacan Güney [EMAIL PROTECTED] wrote:
 On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
  Time and again I get this error and as a result the segment remains
  incomplete. This wastes one iteration of the for() loop in which I am
  doing generate, fetch and update.
 
  Can someone please tell me what are the measures I can take to avoid
  this error? And isn't it possible to make some code changes so that
  the whole fetch doesn't have to stop suddenly when this error occurs.
  Can't we do something in the code so that, the fetch still continues
  like in case of SocketException, in which case the fetch while(1) loop
  continues.
 
  If it is not possible, please tell me how can I prevent this error
  from happening?

 Are you also parsing during fetch? If you are, I would suggest running
 Fetcher in non-parsing mode.

 
  - ERROR -
 
  fetch of http://telephony/register.asp failed with:
  java.lang.OutOfMemoryError: Java heap space
  java.lang.NullPointerException
  at 
  org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
  at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
  ..
  at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
  fetcher caught:java.lang.NullPointerException
  java.lang.NullPointerException
  at 
  org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:87)
  at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:125)
  ...
  at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:115)
  fetcher caught:java.lang.NullPointerException
  Fetcher: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
 


 --
 Doğacan Güney

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general