Re: nutch with cassandra internal network usage
Hi all, I've read the sources ;) (no, not really all, but enough, I hope) So, major difference between generator & fetcher are the fields that it's loading from db. As I had fetcher.store.content=true in the beginning, there was a lot data in the contents fields. I run with fetcher.parse=true and that's why it loads all content during start-up of fetcherJob. I did this in my local 2.1 sources: Index: src/java/org/apache/nutch/fetcher/FetcherJob.java === --- src/java/org/apache/nutch/fetcher/FetcherJob.java (revision 1448112) +++ src/java/org/apache/nutch/fetcher/FetcherJob.java (working copy) @@ -140,6 +140,8 @@ if (job.getConfiguration().getBoolean(PARSE_KEY, false)) { ParserJob parserJob = new ParserJob(); fields.addAll(parserJob.getFields(job)); + fields.remove(WebPage.Field.CONTENT); // FIXME + fields.remove(WebPage.Field.OUTLINKS); // FIXME } ProtocolFactory protocolFactory = new ProtocolFactory(job.getConfiguration()); fields.addAll(protocolFactory.getFields()); and now start-up time of an fetcherJob is about 10 minutes :) --Roland Am 22.02.2013 10:28, schrieb Roland: Hi Julien, ok, so thanks for the clarification, I think I have to read the sources :) --Roland Am 22.02.2013 10:10, schrieb Julien Nioche: Hi Roland My previous email should have started with "The point Alex is making is ..." and not just "The point is ...". I don't have an explanation as to why the generator is faster than the fetching as I don't use 2.x at all but it would definitely be interesting to find out. The behaviour of the fetcher is how I expect GORA to behave in its current form i.e. pull everything - filter - process. Julien On 21 February 2013 16:58, Roland wrote: Hi Julien, the point I personally don't get, is: why is generating fast - fetching not. If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? --Roland Am 21.02.2013 12:27, schrieb Julien Nioche: Lewis, The point is whether the filtering is done on the backend side (e.g. using queries, indices, etc...) then passed on to MapReduce via GORA or as I assume by looking at the code filtered within the MapReduce which means that all the entries are pulled from the backend anyway. This makes quite a difference in terms of performance if you think e.g about a large webtable which would have to be entirely passed to mapreduce even if only a handful of entries are to be processed. Makes sense? Julien On 21 February 2013 01:52, Lewis John Mcgibbney **wrote: Those filters are applied only to URLs which do not have a null GENERATE_MARK e.g. if (Mark.GENERATE_MARK.checkMark(**page) != null) { if (GeneratorJob.LOG.**isDebugEnabled()) { GeneratorJob.LOG.debug("**Skipping " + url + "; already generated"); } return; Therefore filters will be applied to all URLs which have a null GENERATE_MARK value. On Wed, Feb 20, 2013 at 2:45 PM, wrote: Hi, Are those filters put on all data selected from hbase or sent to hbase as filters to select a subset of all hbase records? Thanks. Alex. -Original Message- From: Lewis John Mcgibbney To: user Sent: Wed, Feb 20, 2013 12:56 pm Subject: Re: nutch with cassandra internal network usage Hi Alex, On Wed, Feb 20, 2013 at 11:54 AM, wrote: The generator also does not have filters. Its mapper goes over all records as far as I know. If you use hadoop you can see how many records go as input to mappers. Also see this I don't think this is true. The GeneratorMapper filters URLs before selecting them for inclusion based on the following - distance - URLNormalizer(s) - URLFilter(s) in that order. I am going to start a new thread on improvements to the GeneratorJob regarding better configuration as it is a crucial stage in the crawl process. So the issue here, as you correctly explain, is with the Fetcher obtaining the URLs which have been marked with a desired batchId. This would be done via scanner in Gora. -- *Lewis*
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Thanks for your suggestion guys! The big crawl is fetching large amount of big PDF files. For something like below, the fetcher took a lot of time to finish up, even though the files are fetched. It shows more than one hour of time. > > 2013-03-01 19:45:43,217 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2013-03-01* 19:45:43,217 *INFO fetcher.Fetcher - -activeThreads=0 > 2013-03-01* 20:57:55,288* INFO fetcher.Fetcher - Fetcher: finished at > 2013-03-01 20:57:55, elapsed: 01:34:09 Does fetching a lot of files causes this issue ? Should i stick to one thread per local mode or use pseudo distributed mode to improve performance ? What is an acceptable time fetcher should finish up after fetching the files ? What exactly happens in this step ? Thanks again! Kiran. On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma wrote: > The default heap size of 1G is just enough for a parsing fetcher with 10 > threads. The only problem that may rise is too large and complicated PDF > files or very large HTML files. If you generate fetch lists of a reasonable > size there won't be a problem most of the time. And if you want to crawl a > lot, then just generate more small segments. > > If there is a bug it's most likely to be the parser eating memory and not > releasing it. > > -Original message- > > From:Tejas Patil > > Sent: Sun 03-Mar-2013 22:19 > > To: user@nutch.apache.org > > Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create > new native thread > > > > I agree with Sebastian. It was a crawl in local mode and not over a > > cluster. The intended crawl volume is huge and if we dont override the > > default heap size to some decent value, there is high possibility of > facing > > an OOM. > > > > > > On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi < > chitturikira...@gmail.com>wrote: > > > > > > If you find the time you should trace the process. > > > > Seems to be either a misconfiguration or even a bug. > > > > > > > > I will try to track this down soon with the previous configuration. > Right > > > now, i am just trying to get data crawled by Monday. > > > > > > Kiran. > > > > > > > > > > >> Luckily, you should be able to retry via "bin/nutch parse ..." > > > > >> Then trace the system and the Java process to catch the reason. > > > > >> > > > > >> Sebastian > > > > >> > > > > >> On 03/02/2013 08:13 PM, kiran chitturi wrote: > > > > >>> Sorry, i am looking to crawl 400k documents with the crawl. I > said > > > 400 > > > > in > > > > >>> my last message. > > > > >>> > > > > >>> > > > > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > > > > >> chitturikira...@gmail.com>wrote: > > > > >>> > > > > Hi! > > > > > > > > I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 > 2.8GHz. > > > > > > > > Last night i started a crawl on local mode for 5 seeds with the > > > config > > > > given below. If the crawl goes well, it should fetch a total of > 400 > > > > documents. The crawling is done on a single host that we own. > > > > > > > > Config > > > > - > > > > > > > > fetcher.threads.per.queue - 2 > > > > fetcher.server.delay - 1 > > > > fetcher.throughput.threshold.pages - -1 > > > > > > > > crawl script settings > > > > > > > > timeLimitFetch- 30 > > > > numThreads - 5 > > > > topN - 1 > > > > mapred.child.java.opts=-Xmx1000m > > > > > > > > > > > > I have noticed today that the crawl has stopped due to an error > and > > > i > > > > >> have > > > > found the below error in logs. > > > > > > > > 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > > > > > > http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > > > > > 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - > > > job_local_0001 > > > > > java.lang.OutOfMemoryError: unable to create new native thread > > > > > at java.lang.Thread.start0(Native Method) > > > > > at java.lang.Thread.start(Thread.java:658) > > > > > at > > > > > > > > > >> > > > > > > > > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > > > > > at > > > > > > > > > >> > > > > > > > > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > > > > > at > > > > > > > > > >> > > > > > > > > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > > > > > at > > > > > > > > > >> > > > > > > > > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > > > > > at > > > > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > > > > > at > > > > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > > > > > at > > > > >> org.apache.nutch.
RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
The default heap size of 1G is just enough for a parsing fetcher with 10 threads. The only problem that may rise is too large and complicated PDF files or very large HTML files. If you generate fetch lists of a reasonable size there won't be a problem most of the time. And if you want to crawl a lot, then just generate more small segments. If there is a bug it's most likely to be the parser eating memory and not releasing it. -Original message- > From:Tejas Patil > Sent: Sun 03-Mar-2013 22:19 > To: user@nutch.apache.org > Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new > native thread > > I agree with Sebastian. It was a crawl in local mode and not over a > cluster. The intended crawl volume is huge and if we dont override the > default heap size to some decent value, there is high possibility of facing > an OOM. > > > On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi > wrote: > > > > If you find the time you should trace the process. > > > Seems to be either a misconfiguration or even a bug. > > > > > > I will try to track this down soon with the previous configuration. Right > > now, i am just trying to get data crawled by Monday. > > > > Kiran. > > > > > > > >> Luckily, you should be able to retry via "bin/nutch parse ..." > > > >> Then trace the system and the Java process to catch the reason. > > > >> > > > >> Sebastian > > > >> > > > >> On 03/02/2013 08:13 PM, kiran chitturi wrote: > > > >>> Sorry, i am looking to crawl 400k documents with the crawl. I said > > 400 > > > in > > > >>> my last message. > > > >>> > > > >>> > > > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > > > >> chitturikira...@gmail.com>wrote: > > > >>> > > > Hi! > > > > > > I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. > > > > > > Last night i started a crawl on local mode for 5 seeds with the > > config > > > given below. If the crawl goes well, it should fetch a total of 400 > > > documents. The crawling is done on a single host that we own. > > > > > > Config > > > - > > > > > > fetcher.threads.per.queue - 2 > > > fetcher.server.delay - 1 > > > fetcher.throughput.threshold.pages - -1 > > > > > > crawl script settings > > > > > > timeLimitFetch- 30 > > > numThreads - 5 > > > topN - 1 > > > mapred.child.java.opts=-Xmx1000m > > > > > > > > > I have noticed today that the crawl has stopped due to an error and > > i > > > >> have > > > found the below error in logs. > > > > > > 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > > > > http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > > > > 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - > > job_local_0001 > > > > java.lang.OutOfMemoryError: unable to create new native thread > > > > at java.lang.Thread.start0(Native Method) > > > > at java.lang.Thread.start(Thread.java:658) > > > > at > > > > > > > >> > > > > > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > > > > at > > > > > > > >> > > > > > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > > > > at > > > > > > > >> > > > > > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > > > > at > > > > > > > >> > > > > > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > > > > at > > > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > > > > at > > > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > > > > at > > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > > > > at > > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > > > > at > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > > at > > > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > > > at > > > > > > > >> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > > > (END) > > > > > > > > > > > > Did anyone run in to the same issue ? I am not sure why the new > > native > > > thread is not being created. The link here says [0] that it might > > due > > > to > > > the limitation of number of processes in my OS. Will increase them > > > solve > > > the issue ? > > > > > > > > > [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > > > > > > Thanks! > > > > > > -- > > > Kiran Chitturi > > > > > > >>> > > > >>> > > > >>> > > > >> > > > >> > > > > > > > > > > > > > > > > > > > > -- >
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
I agree with Sebastian. It was a crawl in local mode and not over a cluster. The intended crawl volume is huge and if we dont override the default heap size to some decent value, there is high possibility of facing an OOM. On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi wrote: > > If you find the time you should trace the process. > > Seems to be either a misconfiguration or even a bug. > > > > I will try to track this down soon with the previous configuration. Right > now, i am just trying to get data crawled by Monday. > > Kiran. > > > > >> Luckily, you should be able to retry via "bin/nutch parse ..." > > >> Then trace the system and the Java process to catch the reason. > > >> > > >> Sebastian > > >> > > >> On 03/02/2013 08:13 PM, kiran chitturi wrote: > > >>> Sorry, i am looking to crawl 400k documents with the crawl. I said > 400 > > in > > >>> my last message. > > >>> > > >>> > > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > > >> chitturikira...@gmail.com>wrote: > > >>> > > Hi! > > > > I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. > > > > Last night i started a crawl on local mode for 5 seeds with the > config > > given below. If the crawl goes well, it should fetch a total of 400 > > documents. The crawling is done on a single host that we own. > > > > Config > > - > > > > fetcher.threads.per.queue - 2 > > fetcher.server.delay - 1 > > fetcher.throughput.threshold.pages - -1 > > > > crawl script settings > > > > timeLimitFetch- 30 > > numThreads - 5 > > topN - 1 > > mapred.child.java.opts=-Xmx1000m > > > > > > I have noticed today that the crawl has stopped due to an error and > i > > >> have > > found the below error in logs. > > > > 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > > > http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > > > 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - > job_local_0001 > > > java.lang.OutOfMemoryError: unable to create new native thread > > > at java.lang.Thread.start0(Native Method) > > > at java.lang.Thread.start(Thread.java:658) > > > at > > > > > >> > > > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > > > at > > > > > >> > > > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > > > at > > > > > >> > > > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > > > at > > > > > >> > > > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > > > at > > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > > > at > > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > > > at > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > > > at > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > > > at > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > > at > > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > > at > > > > > >> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > > (END) > > > > > > > > Did anyone run in to the same issue ? I am not sure why the new > native > > thread is not being created. The link here says [0] that it might > due > > to > > the limitation of number of processes in my OS. Will increase them > > solve > > the issue ? > > > > > > [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > > > > Thanks! > > > > -- > > Kiran Chitturi > > > > >>> > > >>> > > >>> > > >> > > >> > > > > > > > > > > > > > -- > Kiran Chitturi >
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
> If you find the time you should trace the process. > Seems to be either a misconfiguration or even a bug. > > I will try to track this down soon with the previous configuration. Right now, i am just trying to get data crawled by Monday. Kiran. > >> Luckily, you should be able to retry via "bin/nutch parse ..." > >> Then trace the system and the Java process to catch the reason. > >> > >> Sebastian > >> > >> On 03/02/2013 08:13 PM, kiran chitturi wrote: > >>> Sorry, i am looking to crawl 400k documents with the crawl. I said 400 > in > >>> my last message. > >>> > >>> > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > >> chitturikira...@gmail.com>wrote: > >>> > Hi! > > I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. > > Last night i started a crawl on local mode for 5 seeds with the config > given below. If the crawl goes well, it should fetch a total of 400 > documents. The crawling is done on a single host that we own. > > Config > - > > fetcher.threads.per.queue - 2 > fetcher.server.delay - 1 > fetcher.throughput.threshold.pages - -1 > > crawl script settings > > timeLimitFetch- 30 > numThreads - 5 > topN - 1 > mapred.child.java.opts=-Xmx1000m > > > I have noticed today that the crawl has stopped due to an error and i > >> have > found the below error in logs. > > 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > > http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > > 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 > > java.lang.OutOfMemoryError: unable to create new native thread > > at java.lang.Thread.start0(Native Method) > > at java.lang.Thread.start(Thread.java:658) > > at > > > >> > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > > at > > > >> > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > > at > > > >> > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > > at > > > >> > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > > at > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > > at > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > > at > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > > at > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > at > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > at > > > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > (END) > > > > Did anyone run in to the same issue ? I am not sure why the new native > thread is not being created. The link here says [0] that it might due > to > the limitation of number of processes in my OS. Will increase them > solve > the issue ? > > > [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > > Thanks! > > -- > Kiran Chitturi > > >>> > >>> > >>> > >> > >> > > > > > > -- Kiran Chitturi
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
> using low value for topN(2000) than 1 That would mean: you need 200 rounds and also 200 segments for 400k documents. That's a work-around no solution! If you find the time you should trace the process. Seems to be either a misconfiguration or even a bug. Sebastian On 03/03/2013 09:45 PM, kiran chitturi wrote: > Thanks Sebastian for the suggestions. I came over this by using low value > for topN(2000) than 1. I decided to use lower value for topN with more > rounds. > > > On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel > wrote: > >> Hi Kiran, >> >> there are many possible reasons for the problem. Beside the limits on the >> number of processes >> the stack size in the Java VM and the system (see java -Xss and ulimit -s). >> >> I think in local mode there should be only one mapper and consequently only >> one thread spent for parsing. So the number of processes/threads is hardly >> the >> problem suggested that you don't run any other number crunching tasks in >> parallel >> on your desktop. >> >> Luckily, you should be able to retry via "bin/nutch parse ..." >> Then trace the system and the Java process to catch the reason. >> >> Sebastian >> >> On 03/02/2013 08:13 PM, kiran chitturi wrote: >>> Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in >>> my last message. >>> >>> >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < >> chitturikira...@gmail.com>wrote: >>> Hi! I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. Last night i started a crawl on local mode for 5 seeds with the config given below. If the crawl goes well, it should fetch a total of 400 documents. The crawling is done on a single host that we own. Config - fetcher.threads.per.queue - 2 fetcher.server.delay - 1 fetcher.throughput.threshold.pages - -1 crawl script settings timeLimitFetch- 30 numThreads - 5 topN - 1 mapred.child.java.opts=-Xmx1000m I have noticed today that the crawl has stopped due to an error and i >> have found the below error in logs. 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:658) > at > >> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > at > >> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > at > >> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > at > >> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > at >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > at org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > at >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > at >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > at >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > (END) Did anyone run in to the same issue ? I am not sure why the new native thread is not being created. The link here says [0] that it might due to the limitation of number of processes in my OS. Will increase them solve the issue ? [0] - http://ww2.cs.fsu.edu/~czhang/errors.html Thanks! -- Kiran Chitturi >>> >>> >>> >> >> > >
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Thanks Sebastian for the suggestions. I came over this by using low value for topN(2000) than 1. I decided to use lower value for topN with more rounds. On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel wrote: > Hi Kiran, > > there are many possible reasons for the problem. Beside the limits on the > number of processes > the stack size in the Java VM and the system (see java -Xss and ulimit -s). > > I think in local mode there should be only one mapper and consequently only > one thread spent for parsing. So the number of processes/threads is hardly > the > problem suggested that you don't run any other number crunching tasks in > parallel > on your desktop. > > Luckily, you should be able to retry via "bin/nutch parse ..." > Then trace the system and the Java process to catch the reason. > > Sebastian > > On 03/02/2013 08:13 PM, kiran chitturi wrote: > > Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in > > my last message. > > > > > > On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > chitturikira...@gmail.com>wrote: > > > >> Hi! > >> > >> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. > >> > >> Last night i started a crawl on local mode for 5 seeds with the config > >> given below. If the crawl goes well, it should fetch a total of 400 > >> documents. The crawling is done on a single host that we own. > >> > >> Config > >> - > >> > >> fetcher.threads.per.queue - 2 > >> fetcher.server.delay - 1 > >> fetcher.throughput.threshold.pages - -1 > >> > >> crawl script settings > >> > >> timeLimitFetch- 30 > >> numThreads - 5 > >> topN - 1 > >> mapred.child.java.opts=-Xmx1000m > >> > >> > >> I have noticed today that the crawl has stopped due to an error and i > have > >> found the below error in logs. > >> > >> 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): > >>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > >>> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 > >>> java.lang.OutOfMemoryError: unable to create new native thread > >>> at java.lang.Thread.start0(Native Method) > >>> at java.lang.Thread.start(Thread.java:658) > >>> at > >>> > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > >>> at > >>> > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > >>> at > >>> > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > >>> at > >>> > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > >>> at > org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > >>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) > >>> at > org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > >>> at > org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > >>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > >>> at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > >>> at > >>> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > >>> (END) > >> > >> > >> > >> Did anyone run in to the same issue ? I am not sure why the new native > >> thread is not being created. The link here says [0] that it might due to > >> the limitation of number of processes in my OS. Will increase them solve > >> the issue ? > >> > >> > >> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > >> > >> Thanks! > >> > >> -- > >> Kiran Chitturi > >> > > > > > > > > -- Kiran Chitturi
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Hi Kiran, there are many possible reasons for the problem. Beside the limits on the number of processes the stack size in the Java VM and the system (see java -Xss and ulimit -s). I think in local mode there should be only one mapper and consequently only one thread spent for parsing. So the number of processes/threads is hardly the problem suggested that you don't run any other number crunching tasks in parallel on your desktop. Luckily, you should be able to retry via "bin/nutch parse ..." Then trace the system and the Java process to catch the reason. Sebastian On 03/02/2013 08:13 PM, kiran chitturi wrote: > Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in > my last message. > > > On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi > wrote: > >> Hi! >> >> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz. >> >> Last night i started a crawl on local mode for 5 seeds with the config >> given below. If the crawl goes well, it should fetch a total of 400 >> documents. The crawling is done on a single host that we own. >> >> Config >> - >> >> fetcher.threads.per.queue - 2 >> fetcher.server.delay - 1 >> fetcher.throughput.threshold.pages - -1 >> >> crawl script settings >> >> timeLimitFetch- 30 >> numThreads - 5 >> topN - 1 >> mapred.child.java.opts=-Xmx1000m >> >> >> I have noticed today that the crawl has stopped due to an error and i have >> found the below error in logs. >> >> 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed (0ms): >>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm >>> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - job_local_0001 >>> java.lang.OutOfMemoryError: unable to create new native thread >>> at java.lang.Thread.start0(Native Method) >>> at java.lang.Thread.start(Thread.java:658) >>> at >>> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) >>> at >>> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) >>> at >>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) >>> at >>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) >>> at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) >>> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93) >>> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) >>> at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) >>> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) >>> at >>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) >>> (END) >> >> >> >> Did anyone run in to the same issue ? I am not sure why the new native >> thread is not being created. The link here says [0] that it might due to >> the limitation of number of processes in my OS. Will increase them solve >> the issue ? >> >> >> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html >> >> Thanks! >> >> -- >> Kiran Chitturi >> > > >
Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
Kiran, Were you able to resolve this issue?.. I am getting the same error when fetching huge number of URL's -Neeraj. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-tp4044231p4044398.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: help with nutch-site configuration
Hi Amit, I do not exactly understand your question. Do you want to know why half of url's are not fetched ? You need to take a look at (readdb -stats) to find out the statistics and take a dump of the content, check the url's which are not fetched and see what is the protocolStatus of those url's are. I previously noticed inconsistency between fetchStatus and protocolStatus. AFAIK, the successfully parsed pages are sent to Solr. If you want check more, you can check the parse status in the dump and logs for any parse errors. HTH On Sun, Mar 3, 2013 at 12:22 PM, Amit Sela wrote: > My use case is crawling over ~12MM URLs with depth 1, and indexing them > with Solr. > I use nutch 1.6 and Solr 3.6.2. > I also use metatags plugin to fetch the URL's keywords and description. > > However, I seem to have issues with fetching and indexing into Solr. > Running on a sample of ~120K URLs, results in fetching about half of them > and indexing ~20K... > After trying some configurations that did help but got me to the mentioned > numbers (it was lower before) I'm kinda lost in what's next. > > If anyone works with this use case and can help I'd appreciate. > > These are my current configurations: > > http.agent.name > MyNutchSpider > plugin.includes > > protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass) > metatags.names > keywords;Keywords;description;Description > index.parse.md > > metatag.keywords,metatag.Keywords,metatag.description,metatag.Description > db.update.additions.allowed > false > generate.count.mode > domain > partition.url.mode > byDomain > fetcher.queue.mode > byDomain > http.redirect.max > 30 > http.content.limit > 262144 > db.injector.update > true > parse.filter.urls > true > parse.normalize.urls > true > > Thanks! > -- Kiran Chitturi
help with nutch-site configuration
My use case is crawling over ~12MM URLs with depth 1, and indexing them with Solr. I use nutch 1.6 and Solr 3.6.2. I also use metatags plugin to fetch the URL's keywords and description. However, I seem to have issues with fetching and indexing into Solr. Running on a sample of ~120K URLs, results in fetching about half of them and indexing ~20K... After trying some configurations that did help but got me to the mentioned numbers (it was lower before) I'm kinda lost in what's next. If anyone works with this use case and can help I'd appreciate. These are my current configurations: http.agent.name MyNutchSpider plugin.includes protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass) metatags.names keywords;Keywords;description;Description index.parse.md metatag.keywords,metatag.Keywords,metatag.description,metatag.Description db.update.additions.allowed false generate.count.mode domain partition.url.mode byDomain fetcher.queue.mode byDomain http.redirect.max 30 http.content.limit 262144 db.injector.update true parse.filter.urls true parse.normalize.urls true Thanks!