My segments are roughly 1,3GB, 5GB, 4,5GB.
2009/6/15 Bartosz Gadzimski <[email protected]> > Hello, > > Can you look about size of merged segments? > > If I remember correctly when I had segment1 = 1GB and segment2= 1GB new > merged segment was like 5GB but I havn't got time to look into it. > > Thanks, > Bartosz > > czerwionka paul pisze: > > hi justin, >> >> i am running hadoop in distributed mode and having the same problem. >> >> merging segments just eats up much more temp space than the segments would >> have combined. >> >> paul. >> >> On 14.06.2009, at 18:17, MilleBii wrote: >> >> Same for merging 3 segments of 100k, 100K, 300k URLs resulted in >>> consumming >>> 200Gb and partition full after 18hours processing >>> >>> Something strange with this segment merge, >>> >>> Conf : PC Dual Core, Vista, Hadoop on single node. >>> >>> Can someone confirm if installing Hadoop in a distributed will fix it ? >>> Is >>> there a good config guide for the distributed mode. >>> >>> >>> 2009/6/12 Justin Yao <[email protected]> >>> >>> Hi John, >>>> I have no idea about that neither. >>>> Justin >>>> >>>> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak < >>>> [email protected]> wrote: >>>> >>>> Justin, >>>>> >>>>> Thanks for the response. >>>>> >>>>> I was having a similar issue, i was trying to merge the segments for >>>>> >>>> crawls >>>> >>>>> during the month of may probably around 13-15GB, so after everything >>>>> was >>>>> running it had used tmp space of around 900 GB doesn't seem very >>>>> >>>> efficient. >>>> >>>>> >>>>> I will try this out and see if it changes anything. >>>>> >>>>> Do you know if there is any risk in using the following: >>>>> <property> >>>>> <name>mapred.min.split.size</name> >>>>> <value>671088640</value> >>>>> </property> >>>>> >>>>> as suggested in the article? >>>>> >>>>> -John >>>>> >>>>> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote: >>>>> >>>>> Hi John, >>>>> >>>>>> >>>>>> I had the same issue before but never found a solution. >>>>>> Here is a workaround mentioned by someone in this mailing list, you >>>>>> may >>>>>> have >>>>>> a try: >>>>>> >>>>>> Seemingly abnormal temp space use by segment merger >>>>>> >>>>>> >>>>>> >>>> http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569> >>>> < >>>> http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569> >>>> >>>> >>>>> >>>>>> Regards, >>>>>> Justin >>>>>> >>>>>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak < >>>>>> [email protected] >>>>>> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>> Ok. >>>>>> >>>>>>> >>>>>>> So a update to this item. >>>>>>> >>>>>>> I did start running nutch with hadoop, I am trying a single node >>>>>>> config >>>>>>> just to test it out. >>>>>>> >>>>>>> It took forever to get all of the files in the DFS it was just over >>>>>>> >>>>>> 80GB >>>> >>>>> but it is in there. So I started the SegmentMerge job, and it is >>>>>>> >>>>>> working >>>> >>>>> flawlessly, still a little slow though. >>>>>>> >>>>>>> Also looking at the stats for the CPU they sometimes go over 20% by >>>>>>> not >>>>>>> by >>>>>>> much and not often, the Disk is very lightly taxed, peak was about 20 >>>>>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no issue >>>>>>> there. >>>>>>> >>>>>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but when I >>>>>>> restarted all it is still only using 2 and 1. Any ideas? I made >>>>>>> that >>>>>>> change in the hadoop-site.xml file BTW. >>>>>>> >>>>>>> -John >>>>>>> >>>>>>> >>>>>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote: >>>>>>> >>>>>>> John Martyniak wrote: >>>>>>> >>>>>>> >>>>>>>> Andrzej, >>>>>>>> >>>>>>>>> I am a little embarassed asking. But is there is a setup guide for >>>>>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as >>>>>>>>> setting >>>>>>>>> up for >>>>>>>>> Nutch 0.17 (Which I think is the existing guide out there). >>>>>>>>> >>>>>>>>> >>>>>>>>> Basically, yes - but this guide is primarily about the set up of >>>>>>>> >>>>>>> Hadoop >>>> >>>>> cluster using the Hadoop pieces distributed with Nutch. As such these >>>>>>>> instructions are already slightly outdated. So it's best simply to >>>>>>>> install a >>>>>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, >>>>>>>> and >>>>>>>> then >>>>>>>> build nutch*.job file separately. >>>>>>>> >>>>>>>> Also I have Hadoop already running for some other applications, not >>>>>>>> >>>>>>>> associated with Nutch, can I use the same install? I think that it >>>>>>>>> >>>>>>>> is >>>> >>>>> the >>>>>>>>> same version that Nutch 1.0 uses. Or is it just easier to set it >>>>>>>>> up >>>>>>>>> using >>>>>>>>> the nutch config. >>>>>>>>> >>>>>>>>> >>>>>>>>> Yes, it's perfectly ok to use Nutch with an existing Hadoop >>>>>>>> cluster of >>>>>>>> the >>>>>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would >>>>>>>> strongly >>>>>>>> recommend this way, instead of the usual "dirty" way of setting up >>>>>>>> >>>>>>> Nutch >>>> >>>>> by >>>>>>>> replicating the local build dir ;) >>>>>>>> >>>>>>>> Just specify the nutch*.job file like this: >>>>>>>> >>>>>>>> bin/hadoop jar nutch*.job <className> <args ..> >>>>>>>> >>>>>>>> where className and args is one of Nutch command-line tools. You can >>>>>>>> also >>>>>>>> modify slightly the bin/nutch script, so that you don't have to >>>>>>>> >>>>>>> specify >>>> >>>>> fully-qualified class names. >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> Andrzej Bialecki <>< >>>>>>>> ___. ___ ___ ___ _ _ __________________________________ >>>>>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>>>>>> ___|||__|| \| || | Embedded Unix, System Integration >>>>>>>> http://www.sigram.com Contact: info at sigram dot com >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> John Martyniak >>>>> President >>>>> Before Dawn Solutions, Inc. >>>>> 9457 S. University Blvd #266 >>>>> Highlands Ranch, CO 80126 >>>>> o: 877-499-1562 x707 >>>>> f: 877-499-1562 >>>>> c: 303-522-1756 >>>>> e: [email protected] >>>>> >>>>> >>>>> >>>> >> >> >
