Hi Kiran, Is the 6 mins consistent across those 5 rounds ? With 10k files is takes ~60 minutes for writing segments. With 2k file, it took 6 min gap. You will need 5 such small rounds to get total 10k, so total gap time would be (5 * 6) = 30 mins. Thats half of the time taken for the crawl with 10k !! So in a way, you saved 30 mins by running small crawls. Something does seem right with the math here.
Thanks, Tejas Patil On Mon, Mar 4, 2013 at 12:45 PM, kiran chitturi <chitturikira...@gmail.com>wrote: > Thanks Sebastian for the details. This was the bottleneck i had when i am > fetching 10k files. Now i switched to 2k and i have a 6 mins gap now. It > took me some time finding right configuration in the local node. > > > > On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel > <wastl.na...@googlemail.com>wrote: > > > After all documents are fetched (and ev. parsed) the segment has to be > > written: > > finish sorting the data and copy it from local temp dir (hadoop.tmp.dir) > > to the > > segment directory. If IO is a bottleneck this may take a while. Also > looks > > like > > you have a lot of content! > > > > On 03/04/2013 06:03 AM, kiran chitturi wrote: > > > Thanks for your suggestion guys! The big crawl is fetching large amount > > of > > > big PDF files. > > > > > > For something like below, the fetcher took a lot of time to finish up, > > even > > > though the files are fetched. It shows more than one hour of time. > > > > > >> > > >> 2013-03-01 19:45:43,217 INFO fetcher.Fetcher - -activeThreads=0, > > >> spinWaiting=0, fetchQueues.totalSize=0 > > >> 2013-03-01* 19:45:43,217 *INFO fetcher.Fetcher - -activeThreads=0 > > >> 2013-03-01* 20:57:55,288* INFO fetcher.Fetcher - Fetcher: finished at > > >> 2013-03-01 20:57:55, elapsed: 01:34:09 > > > > > > > > > Does fetching a lot of files causes this issue ? Should i stick to one > > > thread per local mode or use pseudo distributed mode to improve > > performance > > > ? > > > > > > What is an acceptable time fetcher should finish up after fetching the > > > files ? What exactly happens in this step ? > > > > > > Thanks again! > > > Kiran. > > > > > > > > > > > > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma < > > markus.jel...@openindex.io>wrote: > > > > > >> The default heap size of 1G is just enough for a parsing fetcher with > 10 > > >> threads. The only problem that may rise is too large and complicated > PDF > > >> files or very large HTML files. If you generate fetch lists of a > > reasonable > > >> size there won't be a problem most of the time. And if you want to > > crawl a > > >> lot, then just generate more small segments. > > >> > > >> If there is a bug it's most likely to be the parser eating memory and > > not > > >> releasing it. > > >> > > >> -----Original message----- > > >>> From:Tejas Patil <tejas.patil...@gmail.com> > > >>> Sent: Sun 03-Mar-2013 22:19 > > >>> To: user@nutch.apache.org > > >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create > > >> new native thread > > >>> > > >>> I agree with Sebastian. It was a crawl in local mode and not over a > > >>> cluster. The intended crawl volume is huge and if we dont override > the > > >>> default heap size to some decent value, there is high possibility of > > >> facing > > >>> an OOM. > > >>> > > >>> > > >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi < > > >> chitturikira...@gmail.com>wrote: > > >>> > > >>>>> If you find the time you should trace the process. > > >>>>> Seems to be either a misconfiguration or even a bug. > > >>>>> > > >>>>> I will try to track this down soon with the previous configuration. > > >> Right > > >>>> now, i am just trying to get data crawled by Monday. > > >>>> > > >>>> Kiran. > > >>>> > > >>>> > > >>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..." > > >>>>>>> Then trace the system and the Java process to catch the reason. > > >>>>>>> > > >>>>>>> Sebastian > > >>>>>>> > > >>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote: > > >>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I > > >> said > > >>>> 400 > > >>>>> in > > >>>>>>>> my last message. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi < > > >>>>>>> chitturikira...@gmail.com>wrote: > > >>>>>>>> > > >>>>>>>>> Hi! > > >>>>>>>>> > > >>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 > > >> 2.8GHz. > > >>>>>>>>> > > >>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the > > >>>> config > > >>>>>>>>> given below. If the crawl goes well, it should fetch a total of > > >> 400 > > >>>>>>>>> documents. The crawling is done on a single host that we own. > > >>>>>>>>> > > >>>>>>>>> Config > > >>>>>>>>> --------------------- > > >>>>>>>>> > > >>>>>>>>> fetcher.threads.per.queue - 2 > > >>>>>>>>> fetcher.server.delay - 1 > > >>>>>>>>> fetcher.throughput.threshold.pages - -1 > > >>>>>>>>> > > >>>>>>>>> crawl script settings > > >>>>>>>>> ---------------------------- > > >>>>>>>>> timeLimitFetch- 30 > > >>>>>>>>> numThreads - 5 > > >>>>>>>>> topN - 10000 > > >>>>>>>>> mapred.child.java.opts=-Xmx1000m > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> I have noticed today that the crawl has stopped due to an error > > >> and > > >>>> i > > >>>>>>> have > > >>>>>>>>> found the below error in logs. > > >>>>>>>>> > > >>>>>>>>> 2013-03-01 21:45:03,767 INFO parse.ParseSegment - Parsed > (0ms): > > >>>>>>>>>> > > >> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm > > >>>>>>>>>> 2013-03-01 21:45:03,790 WARN mapred.LocalJobRunner - > > >>>> job_local_0001 > > >>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native thread > > >>>>>>>>>> at java.lang.Thread.start0(Native Method) > > >>>>>>>>>> at java.lang.Thread.start(Thread.java:658) > > >>>>>>>>>> at > > >>>>>>>>>> > > >>>>>>> > > >>>>> > > >>>> > > >> > > > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681) > > >>>>>>>>>> at > > >>>>>>>>>> > > >>>>>>> > > >>>>> > > >>>> > > >> > > > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727) > > >>>>>>>>>> at > > >>>>>>>>>> > > >>>>>>> > > >>>>> > > >>>> > > >> > > > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655) > > >>>>>>>>>> at > > >>>>>>>>>> > > >>>>>>> > > >>>>> > > >>>> > > >> > > > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92) > > >>>>>>>>>> at > > >>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159) > > >>>>>>>>>> at > > >>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93) > > >>>>>>>>>> at > > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97) > > >>>>>>>>>> at > > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44) > > >>>>>>>>>> at > > >>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) > > >>>>>>>>>> at > > >>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > > >>>>>>>>>> at > > >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > > >>>>>>>>>> at > > >>>>>>>>>> > > >>>>>>> > > >>>> > > >> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > >>>>>>>>>> (END) > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Did anyone run in to the same issue ? I am not sure why the new > > >>>> native > > >>>>>>>>> thread is not being created. The link here says [0] that it > > >> might > > >>>> due > > >>>>> to > > >>>>>>>>> the limitation of number of processes in my OS. Will increase > > >> them > > >>>>> solve > > >>>>>>>>> the issue ? > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html > > >>>>>>>>> > > >>>>>>>>> Thanks! > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> Kiran Chitturi > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> > > >>>> -- > > >>>> Kiran Chitturi > > >>>> > > >>> > > >> > > > > > > > > > > > > > > > > -- > Kiran Chitturi >