Hi Kiran,

Is the 6 mins consistent across those 5 rounds ? With 10k files is takes
~60 minutes for writing segments.
With 2k file, it took 6 min gap. You will need 5 such small rounds to get
total 10k, so total gap time would be (5 * 6) = 30 mins. Thats half of the
time taken for the crawl with 10k !! So in a way, you saved 30 mins by
running small crawls. Something does seem right with the math here.

Thanks,
Tejas Patil

On Mon, Mar 4, 2013 at 12:45 PM, kiran chitturi
<chitturikira...@gmail.com>wrote:

> Thanks Sebastian for the details. This was the bottleneck i had when i am
> fetching 10k files. Now i switched to 2k and i have a 6 mins gap now.  It
> took me some time finding right configuration in the local node.
>
>
>
> On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
> <wastl.na...@googlemail.com>wrote:
>
> > After all documents are fetched (and ev. parsed) the segment has to be
> > written:
> > finish sorting the data and copy it from local temp dir (hadoop.tmp.dir)
> > to the
> > segment directory. If IO is a bottleneck this may take a while. Also
> looks
> > like
> > you have a lot of content!
> >
> > On 03/04/2013 06:03 AM, kiran chitturi wrote:
> > > Thanks for your suggestion guys! The big crawl is fetching large amount
> > of
> > > big PDF files.
> > >
> > > For something like below, the fetcher took a lot of time to finish up,
> > even
> > > though the files are fetched. It shows more than one hour of time.
> > >
> > >>
> > >> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
> > >> spinWaiting=0, fetchQueues.totalSize=0
> > >> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
> > >> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
> > >> 2013-03-01 20:57:55, elapsed: 01:34:09
> > >
> > >
> > > Does fetching a lot of files causes this issue ? Should i stick to one
> > > thread per local mode or use pseudo distributed mode to improve
> > performance
> > > ?
> > >
> > > What is an acceptable time fetcher should finish up after fetching the
> > > files ? What exactly happens in this step ?
> > >
> > > Thanks again!
> > > Kiran.
> > >
> > >
> > >
> > > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <
> > markus.jel...@openindex.io>wrote:
> > >
> > >> The default heap size of 1G is just enough for a parsing fetcher with
> 10
> > >> threads. The only problem that may rise is too large and complicated
> PDF
> > >> files or very large HTML files. If you generate fetch lists of a
> > reasonable
> > >> size there won't be a problem most of the time. And if you want to
> > crawl a
> > >> lot, then just generate more small segments.
> > >>
> > >> If there is a bug it's most likely to be the parser eating memory and
> > not
> > >> releasing it.
> > >>
> > >> -----Original message-----
> > >>> From:Tejas Patil <tejas.patil...@gmail.com>
> > >>> Sent: Sun 03-Mar-2013 22:19
> > >>> To: user@nutch.apache.org
> > >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
> > >> new native thread
> > >>>
> > >>> I agree with Sebastian. It was a crawl in local mode and not over a
> > >>> cluster. The intended crawl volume is huge and if we dont override
> the
> > >>> default heap size to some decent value, there is high possibility of
> > >> facing
> > >>> an OOM.
> > >>>
> > >>>
> > >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
> > >> chitturikira...@gmail.com>wrote:
> > >>>
> > >>>>> If you find the time you should trace the process.
> > >>>>> Seems to be either a misconfiguration or even a bug.
> > >>>>>
> > >>>>> I will try to track this down soon with the previous configuration.
> > >> Right
> > >>>> now, i am just trying to get data crawled by Monday.
> > >>>>
> > >>>> Kiran.
> > >>>>
> > >>>>
> > >>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..."
> > >>>>>>> Then trace the system and the Java process to catch the reason.
> > >>>>>>>
> > >>>>>>> Sebastian
> > >>>>>>>
> > >>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > >>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I
> > >> said
> > >>>> 400
> > >>>>> in
> > >>>>>>>> my last message.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> > >>>>>>> chitturikira...@gmail.com>wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi!
> > >>>>>>>>>
> > >>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
> > >> 2.8GHz.
> > >>>>>>>>>
> > >>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the
> > >>>> config
> > >>>>>>>>> given below. If the crawl goes well, it should fetch a total of
> > >> 400
> > >>>>>>>>> documents. The crawling is done on a single host that we own.
> > >>>>>>>>>
> > >>>>>>>>> Config
> > >>>>>>>>> ---------------------
> > >>>>>>>>>
> > >>>>>>>>> fetcher.threads.per.queue - 2
> > >>>>>>>>> fetcher.server.delay - 1
> > >>>>>>>>> fetcher.throughput.threshold.pages - -1
> > >>>>>>>>>
> > >>>>>>>>> crawl script settings
> > >>>>>>>>> ----------------------------
> > >>>>>>>>> timeLimitFetch- 30
> > >>>>>>>>> numThreads - 5
> > >>>>>>>>> topN - 10000
> > >>>>>>>>> mapred.child.java.opts=-Xmx1000m
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I have noticed today that the crawl has stopped due to an error
> > >> and
> > >>>> i
> > >>>>>>> have
> > >>>>>>>>> found the below error in logs.
> > >>>>>>>>>
> > >>>>>>>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed
> (0ms):
> > >>>>>>>>>>
> > >> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> > >>>>>>>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> > >>>> job_local_0001
> > >>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native thread
> > >>>>>>>>>>         at java.lang.Thread.start0(Native Method)
> > >>>>>>>>>>         at java.lang.Thread.start(Thread.java:658)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> > >>>>>>>>>>         at
> > >>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> > >>>>>>>>>>         at
> > >>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> > >>>>>>>>>>         at
> > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> > >>>>>>>>>>         at
> > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> > >>>>>>>>>>         at
> > >>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >>>>>>>>>>         at
> > >>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> > >>>>>>>>>>         at
> > >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>
> > >>
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > >>>>>>>>>> (END)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Did anyone run in to the same issue ? I am not sure why the new
> > >>>> native
> > >>>>>>>>> thread is not being created. The link here says [0] that it
> > >> might
> > >>>> due
> > >>>>> to
> > >>>>>>>>> the limitation of number of processes in my OS. Will increase
> > >> them
> > >>>>> solve
> > >>>>>>>>> the issue ?
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> > >>>>>>>>>
> > >>>>>>>>> Thanks!
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Kiran Chitturi
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Kiran Chitturi
> > >>>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
>
>
> --
> Kiran Chitturi
>

Reply via email to