help with nutch-site configuration

2013-03-03 Thread Amit Sela
My use case is crawling over ~12MM URLs with depth 1, and indexing them
with Solr.
I use nutch 1.6 and Solr 3.6.2.
I also use metatags plugin to fetch the URL's keywords and description.

However, I seem to have issues with fetching and indexing into Solr.
Running on a sample of ~120K URLs, results in fetching about half of them
and indexing ~20K...
After trying some configurations that did help but got me to the mentioned
numbers (it was lower before) I'm kinda lost in what's next.

If anyone works with this use case and can help I'd appreciate.

These are my current configurations:

namehttp.agent.name/name
valueMyNutchSpider/value
nameplugin.includes/name
valueprotocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)/value
namemetatags.names/name
valuekeywords;Keywords;description;Description/value
nameindex.parse.md/name
valuemetatag.keywords,metatag.Keywords,metatag.description,metatag.Description/value
namedb.update.additions.allowed/name
valuefalse/value
namegenerate.count.mode/name
valuedomain/value
namepartition.url.mode/name
valuebyDomain/value
namefetcher.queue.mode/name
valuebyDomain/value
namehttp.redirect.max/name
value30/value
namehttp.content.limit/name
value262144/value
namedb.injector.update/name
valuetrue/value
nameparse.filter.urls/name
valuetrue/value
nameparse.normalize.urls/name
valuetrue/value

Thanks!


Re: help with nutch-site configuration

2013-03-03 Thread kiran chitturi
Hi Amit,

I do not exactly understand your question. Do you want to know why half of
url's are not fetched ?

You need to take a look at (readdb -stats) to find out the statistics and
take a dump of the content, check the url's which are not fetched and see
what is the protocolStatus of those url's are. I previously noticed
inconsistency between fetchStatus and protocolStatus.

AFAIK, the successfully parsed pages are sent to Solr. If you want check
more, you can check the parse status in the dump and logs for any parse
errors.


HTH


On Sun, Mar 3, 2013 at 12:22 PM, Amit Sela am...@infolinks.com wrote:

 My use case is crawling over ~12MM URLs with depth 1, and indexing them
 with Solr.
 I use nutch 1.6 and Solr 3.6.2.
 I also use metatags plugin to fetch the URL's keywords and description.

 However, I seem to have issues with fetching and indexing into Solr.
 Running on a sample of ~120K URLs, results in fetching about half of them
 and indexing ~20K...
 After trying some configurations that did help but got me to the mentioned
 numbers (it was lower before) I'm kinda lost in what's next.

 If anyone works with this use case and can help I'd appreciate.

 These are my current configurations:

 namehttp.agent.name/name
 valueMyNutchSpider/value
 nameplugin.includes/name

 valueprotocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)/value
 namemetatags.names/name
 valuekeywords;Keywords;description;Description/value
 nameindex.parse.md/name

 valuemetatag.keywords,metatag.Keywords,metatag.description,metatag.Description/value
 namedb.update.additions.allowed/name
 valuefalse/value
 namegenerate.count.mode/name
 valuedomain/value
 namepartition.url.mode/name
 valuebyDomain/value
 namefetcher.queue.mode/name
 valuebyDomain/value
 namehttp.redirect.max/name
 value30/value
 namehttp.content.limit/name
 value262144/value
 namedb.injector.update/name
 valuetrue/value
 nameparse.filter.urls/name
 valuetrue/value
 nameparse.normalize.urls/name
 valuetrue/value

 Thanks!




-- 
Kiran Chitturi


Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread neeraj
Kiran,

  Were you able to resolve this issue?.. I am getting the same error when
fetching huge number of URL's

-Neeraj.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-tp4044231p4044398.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Sebastian Nagel
Hi Kiran,

there are many possible reasons for the problem. Beside the limits on the 
number of processes
the stack size in the Java VM and the system (see java -Xss and ulimit -s).

I think in local mode there should be only one mapper and consequently only
one thread spent for parsing. So the number of processes/threads is hardly the
problem suggested that you don't run any other number crunching tasks in 
parallel
on your desktop.

Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
 Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
 my last message.
 
 
 On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
 chitturikira...@gmail.comwrote:
 
 Hi!

 I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

 Last night i started a crawl on local mode for 5 seeds with the config
 given below. If the crawl goes well, it should fetch a total of 400
 documents. The crawling is done on a single host that we own.

 Config
 -

 fetcher.threads.per.queue - 2
 fetcher.server.delay - 1
 fetcher.throughput.threshold.pages - -1

 crawl script settings
 
 timeLimitFetch- 30
 numThreads - 5
 topN - 1
 mapred.child.java.opts=-Xmx1000m


 I have noticed today that the crawl has stopped due to an error and i have
 found the below error in logs.

 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
 http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:658)
 at
 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
 at
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
 at
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
 at
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
 at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
 at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
 at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 (END)



 Did anyone run in to the same issue ? I am not sure why the new native
 thread is not being created. The link here says [0] that it might due to
 the limitation of number of processes in my OS. Will increase them solve
 the issue ?


 [0] - http://ww2.cs.fsu.edu/~czhang/errors.html

 Thanks!

 --
 Kiran Chitturi

 
 
 



Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread kiran chitturi
Thanks Sebastian for the suggestions. I came over this by using low value
for topN(2000) than 1. I decided to use lower value for topN with more
rounds.


On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel
wastl.na...@googlemail.comwrote:

 Hi Kiran,

 there are many possible reasons for the problem. Beside the limits on the
 number of processes
 the stack size in the Java VM and the system (see java -Xss and ulimit -s).

 I think in local mode there should be only one mapper and consequently only
 one thread spent for parsing. So the number of processes/threads is hardly
 the
 problem suggested that you don't run any other number crunching tasks in
 parallel
 on your desktop.

 Luckily, you should be able to retry via bin/nutch parse ...
 Then trace the system and the Java process to catch the reason.

 Sebastian

 On 03/02/2013 08:13 PM, kiran chitturi wrote:
  Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
  my last message.
 
 
  On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
 chitturikira...@gmail.comwrote:
 
  Hi!
 
  I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
 
  Last night i started a crawl on local mode for 5 seeds with the config
  given below. If the crawl goes well, it should fetch a total of 400
  documents. The crawling is done on a single host that we own.
 
  Config
  -
 
  fetcher.threads.per.queue - 2
  fetcher.server.delay - 1
  fetcher.throughput.threshold.pages - -1
 
  crawl script settings
  
  timeLimitFetch- 30
  numThreads - 5
  topN - 1
  mapred.child.java.opts=-Xmx1000m
 
 
  I have noticed today that the crawl has stopped due to an error and i
 have
  found the below error in logs.
 
  2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
  http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
  2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
  java.lang.OutOfMemoryError: unable to create new native thread
  at java.lang.Thread.start0(Native Method)
  at java.lang.Thread.start(Thread.java:658)
  at
 
 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
  at
 
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
  at
 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
  at
 
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
  at
 org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
  at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
  at
 org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
  at
 org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
  at
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
  at
 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
  (END)
 
 
 
  Did anyone run in to the same issue ? I am not sure why the new native
  thread is not being created. The link here says [0] that it might due to
  the limitation of number of processes in my OS. Will increase them solve
  the issue ?
 
 
  [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
 
  Thanks!
 
  --
  Kiran Chitturi
 
 
 
 




-- 
Kiran Chitturi


Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Sebastian Nagel
 using low value for topN(2000) than 1
That would mean: you need 200 rounds and also 200 segments for 400k documents.
That's a work-around no solution!

If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.

Sebastian

On 03/03/2013 09:45 PM, kiran chitturi wrote:
 Thanks Sebastian for the suggestions. I came over this by using low value
 for topN(2000) than 1. I decided to use lower value for topN with more
 rounds.
 
 
 On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel
 wastl.na...@googlemail.comwrote:
 
 Hi Kiran,

 there are many possible reasons for the problem. Beside the limits on the
 number of processes
 the stack size in the Java VM and the system (see java -Xss and ulimit -s).

 I think in local mode there should be only one mapper and consequently only
 one thread spent for parsing. So the number of processes/threads is hardly
 the
 problem suggested that you don't run any other number crunching tasks in
 parallel
 on your desktop.

 Luckily, you should be able to retry via bin/nutch parse ...
 Then trace the system and the Java process to catch the reason.

 Sebastian

 On 03/02/2013 08:13 PM, kiran chitturi wrote:
 Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
 my last message.


 On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
 chitturikira...@gmail.comwrote:

 Hi!

 I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

 Last night i started a crawl on local mode for 5 seeds with the config
 given below. If the crawl goes well, it should fetch a total of 400
 documents. The crawling is done on a single host that we own.

 Config
 -

 fetcher.threads.per.queue - 2
 fetcher.server.delay - 1
 fetcher.throughput.threshold.pages - -1

 crawl script settings
 
 timeLimitFetch- 30
 numThreads - 5
 topN - 1
 mapred.child.java.opts=-Xmx1000m


 I have noticed today that the crawl has stopped due to an error and i
 have
 found the below error in logs.

 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
 http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:658)
 at

 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
 at

 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
 at

 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
 at

 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
 at
 org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
 at org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
 at
 org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
 at
 org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
 at

 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 (END)



 Did anyone run in to the same issue ? I am not sure why the new native
 thread is not being created. The link here says [0] that it might due to
 the limitation of number of processes in my OS. Will increase them solve
 the issue ?


 [0] - http://ww2.cs.fsu.edu/~czhang/errors.html

 Thanks!

 --
 Kiran Chitturi






 
 



Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread kiran chitturi
 If you find the time you should trace the process.
 Seems to be either a misconfiguration or even a bug.

 I will try to track this down soon with the previous configuration. Right
now, i am just trying to get data crawled by Monday.

Kiran.


  Luckily, you should be able to retry via bin/nutch parse ...
  Then trace the system and the Java process to catch the reason.
 
  Sebastian
 
  On 03/02/2013 08:13 PM, kiran chitturi wrote:
  Sorry, i am looking to crawl 400k documents with the crawl. I said 400
 in
  my last message.
 
 
  On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
  chitturikira...@gmail.comwrote:
 
  Hi!
 
  I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
 
  Last night i started a crawl on local mode for 5 seeds with the config
  given below. If the crawl goes well, it should fetch a total of 400
  documents. The crawling is done on a single host that we own.
 
  Config
  -
 
  fetcher.threads.per.queue - 2
  fetcher.server.delay - 1
  fetcher.throughput.threshold.pages - -1
 
  crawl script settings
  
  timeLimitFetch- 30
  numThreads - 5
  topN - 1
  mapred.child.java.opts=-Xmx1000m
 
 
  I have noticed today that the crawl has stopped due to an error and i
  have
  found the below error in logs.
 
  2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
  http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
  2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
  java.lang.OutOfMemoryError: unable to create new native thread
  at java.lang.Thread.start0(Native Method)
  at java.lang.Thread.start(Thread.java:658)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
  at
 
 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
  at
 
 
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
  at
  org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
  at
 org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
  at
  org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
  at
  org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
  at
  org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
  at
 
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
  (END)
 
 
 
  Did anyone run in to the same issue ? I am not sure why the new native
  thread is not being created. The link here says [0] that it might due
 to
  the limitation of number of processes in my OS. Will increase them
 solve
  the issue ?
 
 
  [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
 
  Thanks!
 
  --
  Kiran Chitturi
 
 
 
 
 
 
 
 




-- 
Kiran Chitturi


Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Tejas Patil
I agree with Sebastian. It was a crawl in local mode and not over a
cluster. The intended crawl volume is huge and if we dont override the
default heap size to some decent value, there is high possibility of facing
an OOM.


On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi chitturikira...@gmail.comwrote:

  If you find the time you should trace the process.
  Seems to be either a misconfiguration or even a bug.
 
  I will try to track this down soon with the previous configuration. Right
 now, i am just trying to get data crawled by Monday.

 Kiran.


   Luckily, you should be able to retry via bin/nutch parse ...
   Then trace the system and the Java process to catch the reason.
  
   Sebastian
  
   On 03/02/2013 08:13 PM, kiran chitturi wrote:
   Sorry, i am looking to crawl 400k documents with the crawl. I said
 400
  in
   my last message.
  
  
   On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
   chitturikira...@gmail.comwrote:
  
   Hi!
  
   I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
  
   Last night i started a crawl on local mode for 5 seeds with the
 config
   given below. If the crawl goes well, it should fetch a total of 400
   documents. The crawling is done on a single host that we own.
  
   Config
   -
  
   fetcher.threads.per.queue - 2
   fetcher.server.delay - 1
   fetcher.throughput.threshold.pages - -1
  
   crawl script settings
   
   timeLimitFetch- 30
   numThreads - 5
   topN - 1
   mapred.child.java.opts=-Xmx1000m
  
  
   I have noticed today that the crawl has stopped due to an error and
 i
   have
   found the below error in logs.
  
   2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
   http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
   2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
 job_local_0001
   java.lang.OutOfMemoryError: unable to create new native thread
   at java.lang.Thread.start0(Native Method)
   at java.lang.Thread.start(Thread.java:658)
   at
  
  
 
 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
   at
  
  
 
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
   at
  
  
 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
   at
  
  
 
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
   at
   org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
   at
  org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
   at
   org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
   at
   org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
   at
 org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at
   org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
   at
  
  
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
   (END)
  
  
  
   Did anyone run in to the same issue ? I am not sure why the new
 native
   thread is not being created. The link here says [0] that it might
 due
  to
   the limitation of number of processes in my OS. Will increase them
  solve
   the issue ?
  
  
   [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
  
   Thanks!
  
   --
   Kiran Chitturi
  
  
  
  
  
  
  
  
 
 


 --
 Kiran Chitturi



RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread Markus Jelsma
The default heap size of 1G is just enough for a parsing fetcher with 10 
threads. The only problem that may rise is too large and complicated PDF files 
or very large HTML files. If you generate fetch lists of a reasonable size 
there won't be a problem most of the time. And if you want to crawl a lot, then 
just generate more small segments.

If there is a bug it's most likely to be the parser eating memory and not 
releasing it. 
 
-Original message-
 From:Tejas Patil tejas.patil...@gmail.com
 Sent: Sun 03-Mar-2013 22:19
 To: user@nutch.apache.org
 Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new 
 native thread
 
 I agree with Sebastian. It was a crawl in local mode and not over a
 cluster. The intended crawl volume is huge and if we dont override the
 default heap size to some decent value, there is high possibility of facing
 an OOM.
 
 
 On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi 
 chitturikira...@gmail.comwrote:
 
   If you find the time you should trace the process.
   Seems to be either a misconfiguration or even a bug.
  
   I will try to track this down soon with the previous configuration. Right
  now, i am just trying to get data crawled by Monday.
 
  Kiran.
 
 
Luckily, you should be able to retry via bin/nutch parse ...
Then trace the system and the Java process to catch the reason.
   
Sebastian
   
On 03/02/2013 08:13 PM, kiran chitturi wrote:
Sorry, i am looking to crawl 400k documents with the crawl. I said
  400
   in
my last message.
   
   
On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
chitturikira...@gmail.comwrote:
   
Hi!
   
I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
   
Last night i started a crawl on local mode for 5 seeds with the
  config
given below. If the crawl goes well, it should fetch a total of 400
documents. The crawling is done on a single host that we own.
   
Config
-
   
fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1
   
crawl script settings

timeLimitFetch- 30
numThreads - 5
topN - 1
mapred.child.java.opts=-Xmx1000m
   
   
I have noticed today that the crawl has stopped due to an error and
  i
have
found the below error in logs.
   
2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
  job_local_0001
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:658)
at
   
   
  
  java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
at
   
   
  
  java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at
   
   
  
  java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
at
   
   
  
  java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at
org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at
   org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at
  org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at
   
   
  org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
(END)
   
   
   
Did anyone run in to the same issue ? I am not sure why the new
  native
thread is not being created. The link here says [0] that it might
  due
   to
the limitation of number of processes in my OS. Will increase them
   solve
the issue ?
   
   
[0] - http://ww2.cs.fsu.edu/~czhang/errors.html
   
Thanks!
   
--
Kiran Chitturi
   
   
   
   
   
   
   
   
  
  
 
 
  --
  Kiran Chitturi
 
 


Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

2013-03-03 Thread kiran chitturi
Thanks for your suggestion guys! The big crawl is fetching large amount of
big PDF files.

For something like below, the fetcher took a lot of time to finish up, even
though the files are fetched. It shows more than one hour of time.


 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
 spinWaiting=0, fetchQueues.totalSize=0
 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
 2013-03-01 20:57:55, elapsed: 01:34:09


Does fetching a lot of files causes this issue ? Should i stick to one
thread per local mode or use pseudo distributed mode to improve performance
?

What is an acceptable time fetcher should finish up after fetching the
files ? What exactly happens in this step ?

Thanks again!
Kiran.



On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma markus.jel...@openindex.iowrote:

 The default heap size of 1G is just enough for a parsing fetcher with 10
 threads. The only problem that may rise is too large and complicated PDF
 files or very large HTML files. If you generate fetch lists of a reasonable
 size there won't be a problem most of the time. And if you want to crawl a
 lot, then just generate more small segments.

 If there is a bug it's most likely to be the parser eating memory and not
 releasing it.

 -Original message-
  From:Tejas Patil tejas.patil...@gmail.com
  Sent: Sun 03-Mar-2013 22:19
  To: user@nutch.apache.org
  Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
 new native thread
 
  I agree with Sebastian. It was a crawl in local mode and not over a
  cluster. The intended crawl volume is huge and if we dont override the
  default heap size to some decent value, there is high possibility of
 facing
  an OOM.
 
 
  On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi 
 chitturikira...@gmail.comwrote:
 
If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.
   
I will try to track this down soon with the previous configuration.
 Right
   now, i am just trying to get data crawled by Monday.
  
   Kiran.
  
  
 Luckily, you should be able to retry via bin/nutch parse ...
 Then trace the system and the Java process to catch the reason.

 Sebastian

 On 03/02/2013 08:13 PM, kiran chitturi wrote:
 Sorry, i am looking to crawl 400k documents with the crawl. I
 said
   400
in
 my last message.


 On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi 
 chitturikira...@gmail.comwrote:

 Hi!

 I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
 2.8GHz.

 Last night i started a crawl on local mode for 5 seeds with the
   config
 given below. If the crawl goes well, it should fetch a total of
 400
 documents. The crawling is done on a single host that we own.

 Config
 -

 fetcher.threads.per.queue - 2
 fetcher.server.delay - 1
 fetcher.throughput.threshold.pages - -1

 crawl script settings
 
 timeLimitFetch- 30
 numThreads - 5
 topN - 1
 mapred.child.java.opts=-Xmx1000m


 I have noticed today that the crawl has stopped due to an error
 and
   i
 have
 found the below error in logs.

 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):

 http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
   job_local_0001
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:658)
 at


   
  
 java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
 at


   
  
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
 at


   
  
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
 at


   
  
 java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
 at
 org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
 at
org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
 at
 org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
 at
 org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
 at
   org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at
 org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
 at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
 at


  
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 (END)



 Did anyone run in to the same issue ? I am not sure why the new
   native
 

Re: nutch with cassandra internal network usage

2013-03-03 Thread Roland

Hi all,

I've read the sources ;)
(no, not really all, but enough, I hope)

So, major difference between generator  fetcher are the fields that 
it's loading from db.
As I had fetcher.store.content=true in the beginning, there was a lot 
data in the contents fields.
I run with fetcher.parse=true and that's why it loads all content during 
start-up of fetcherJob.


I did this in my local 2.1 sources:
Index: src/java/org/apache/nutch/fetcher/FetcherJob.java
===
--- src/java/org/apache/nutch/fetcher/FetcherJob.java   (revision 1448112)
+++ src/java/org/apache/nutch/fetcher/FetcherJob.java   (working copy)
@@ -140,6 +140,8 @@
 if (job.getConfiguration().getBoolean(PARSE_KEY, false)) {
   ParserJob parserJob = new ParserJob();
   fields.addAll(parserJob.getFields(job));
+  fields.remove(WebPage.Field.CONTENT); // FIXME
+  fields.remove(WebPage.Field.OUTLINKS); // FIXME
 }
 ProtocolFactory protocolFactory = new 
ProtocolFactory(job.getConfiguration());

 fields.addAll(protocolFactory.getFields());

and now start-up time of an fetcherJob is about 10 minutes :)

--Roland


Am 22.02.2013 10:28, schrieb Roland:

Hi Julien,

ok, so thanks for the clarification, I think I have to read the 
sources :)


--Roland

Am 22.02.2013 10:10, schrieb Julien Nioche:

Hi Roland

My previous email should have started with The point Alex is making 
is ...

and not just The point is 
I don't have an explanation as to why the generator is faster than the
fetching as I don't use 2.x at all but it would definitely be 
interesting
to find out. The behaviour of the fetcher is how I expect GORA to 
behave in

its current form i.e. pull everything - filter - process.

Julien


On 21 February 2013 16:58, Roland rol...@rvh-gmbh.de wrote:


Hi Julien,

the point I personally don't get, is: why is generating fast - fetching
not.
If it's possible to filter the generatorJob at the backend (what I 
think

it does), shouldn't it be possible to do the same for the fetcher?

--Roland

Am 21.02.2013 12:27, schrieb Julien Nioche:

  Lewis,
The point is whether the filtering is done on the backend side 
(e.g. using

queries, indices, etc...) then passed on to MapReduce via GORA or as I
assume by looking at the code filtered within the MapReduce which 
means

that all the entries are pulled from the backend anyway.
This makes quite a difference in terms of performance if you think e.g
about a large webtable which would have to be entirely passed to 
mapreduce

even if only a handful of entries are to be processed.

Makes sense?

Julien


On 21 February 2013 01:52, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com**wrote:

  Those filters are applied only to URLs which do not have a null

GENERATE_MARK
e.g.

  if (Mark.GENERATE_MARK.checkMark(**page) != null) {
if (GeneratorJob.LOG.**isDebugEnabled()) {
  GeneratorJob.LOG.debug(**Skipping  + url + ; already
generated);
}
return;

Therefore filters will be applied to all URLs which have a null
GENERATE_MARK value.

On Wed, Feb 20, 2013 at 2:45 PM, alx...@aim.com wrote:

  Hi,
Are those filters put on all data selected from hbase or sent to 
hbase

as
filters to select a subset of all hbase records?

Thanks.
Alex.







-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Wed, Feb 20, 2013 12:56 pm
Subject: Re: nutch with cassandra internal network usage


Hi Alex,

On Wed, Feb 20, 2013 at 11:54 AM, alx...@aim.com wrote:

  The generator also does not have filters. Its mapper goes over all

records as far as I know. If you use hadoop you can see how many


records
go


as input to mappers. Also see this

  I don't think this is true. The GeneratorMapper filters URLs 
before

selecting them for inclusion based on the following
- distance
- URLNormalizer(s)
- URLFilter(s)
in that order.
I am going to start a new thread on improvements to the GeneratorJob
regarding better configuration as it is a crucial stage in the crawl
process.

So the issue here, as you correctly explain, is with the Fetcher


obtaining

the URLs which have been marked with a desired batchId. This 
would be



done


via scanner in Gora.





--
*Lewis*