Thanks for your suggestion guys! The big crawl is fetching large amount of
big PDF files.

For something like below, the fetcher took a lot of time to finish up, even
though the files are fetched. It shows more than one hour of time.

>
> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
> 2013-03-01 20:57:55, elapsed: 01:34:09


Does fetching a lot of files causes this issue ? Should i stick to one
thread per local mode or use pseudo distributed mode to improve performance
?

What is an acceptable time fetcher should finish up after fetching the
files ? What exactly happens in this step ?

Thanks again!
Kiran.



On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <markus.jel...@openindex.io>wrote:

> The default heap size of 1G is just enough for a parsing fetcher with 10
> threads. The only problem that may rise is too large and complicated PDF
> files or very large HTML files. If you generate fetch lists of a reasonable
> size there won't be a problem most of the time. And if you want to crawl a
> lot, then just generate more small segments.
>
> If there is a bug it's most likely to be the parser eating memory and not
> releasing it.
>
> -----Original message-----
> > From:Tejas Patil <tejas.patil...@gmail.com>
> > Sent: Sun 03-Mar-2013 22:19
> > To: user@nutch.apache.org
> > Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
> new native thread
> >
> > I agree with Sebastian. It was a crawl in local mode and not over a
> > cluster. The intended crawl volume is huge and if we dont override the
> > default heap size to some decent value, there is high possibility of
> facing
> > an OOM.
> >
> >
> > On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
> chitturikira...@gmail.com>wrote:
> >
> > > > If you find the time you should trace the process.
> > > > Seems to be either a misconfiguration or even a bug.
> > > >
> > > > I will try to track this down soon with the previous configuration.
> Right
> > > now, i am just trying to get data crawled by Monday.
> > >
> > > Kiran.
> > >
> > >
> > > > >> Luckily, you should be able to retry via "bin/nutch parse ..."
> > > > >> Then trace the system and the Java process to catch the reason.
> > > > >>
> > > > >> Sebastian
> > > > >>
> > > > >> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > > > >>> Sorry, i am looking to crawl 400k documents with the crawl. I
> said
> > > 400
> > > > in
> > > > >>> my last message.
> > > > >>>
> > > > >>>
> > > > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> > > > >> chitturikira...@gmail.com>wrote:
> > > > >>>
> > > > >>>> Hi!
> > > > >>>>
> > > > >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
> 2.8GHz.
> > > > >>>>
> > > > >>>> Last night i started a crawl on local mode for 5 seeds with the
> > > config
> > > > >>>> given below. If the crawl goes well, it should fetch a total of
> 400
> > > > >>>> documents. The crawling is done on a single host that we own.
> > > > >>>>
> > > > >>>> Config
> > > > >>>> ---------------------
> > > > >>>>
> > > > >>>> fetcher.threads.per.queue - 2
> > > > >>>> fetcher.server.delay - 1
> > > > >>>> fetcher.throughput.threshold.pages - -1
> > > > >>>>
> > > > >>>> crawl script settings
> > > > >>>> ----------------------------
> > > > >>>> timeLimitFetch- 30
> > > > >>>> numThreads - 5
> > > > >>>> topN - 10000
> > > > >>>> mapred.child.java.opts=-Xmx1000m
> > > > >>>>
> > > > >>>>
> > > > >>>> I have noticed today that the crawl has stopped due to an error
> and
> > > i
> > > > >> have
> > > > >>>> found the below error in logs.
> > > > >>>>
> > > > >>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> > > > >>>>>
> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> > > > >>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> > > job_local_0001
> > > > >>>>> java.lang.OutOfMemoryError: unable to create new native thread
> > > > >>>>>         at java.lang.Thread.start0(Native Method)
> > > > >>>>>         at java.lang.Thread.start(Thread.java:658)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > > >
> > >
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > > >
> > >
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > > >
> > >
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > > >
> > >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> > > > >>>>>         at
> > > > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> > > > >>>>>         at
> > > > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> > > > >>>>>         at
> > > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> > > > >>>>>         at
> > > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> > > > >>>>>         at
> > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > > >>>>>         at
> > > > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> > > > >>>>>         at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > > > >>>>> (END)
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Did anyone run in to the same issue ? I am not sure why the new
> > > native
> > > > >>>> thread is not being created. The link here says [0] that it
> might
> > > due
> > > > to
> > > > >>>> the limitation of number of processes in my OS. Will increase
> them
> > > > solve
> > > > >>>> the issue ?
> > > > >>>>
> > > > >>>>
> > > > >>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> > > > >>>>
> > > > >>>> Thanks!
> > > > >>>>
> > > > >>>> --
> > > > >>>> Kiran Chitturi
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> >
>



-- 
Kiran Chitturi

Reply via email to