Hi,
> Presumably in hadoop-site.xml as a property/value ? Indeed. J. > > On the other hand, I'm asking myself why merging segments... I don't fully > understand the benefits, if someone can shed some light. > > > 2009/6/15 Julien Nioche <[email protected]> > > > Hi, > > > > Have you tried setting *mapred.compress.map.output *to true? This should > > reduce the amount of temp space required. > > > > Julien > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > > 2009/6/15 czerwionka paul <[email protected]> > > > > > hi justin, > > > > > > i am running hadoop in distributed mode and having the same problem. > > > > > > merging segments just eats up much more temp space than the segments > > would > > > have combined. > > > > > > paul. > > > > > > > > > On 14.06.2009, at 18:17, MilleBii wrote: > > > > > > Same for merging 3 segments of 100k, 100K, 300k URLs resulted in > > >> consumming > > >> 200Gb and partition full after 18hours processing > > >> > > >> Something strange with this segment merge, > > >> > > >> Conf : PC Dual Core, Vista, Hadoop on single node. > > >> > > >> Can someone confirm if installing Hadoop in a distributed will fix it > ? > > Is > > >> there a good config guide for the distributed mode. > > >> > > >> > > >> 2009/6/12 Justin Yao <[email protected]> > > >> > > >> Hi John, > > >>> I have no idea about that neither. > > >>> Justin > > >>> > > >>> On Fri, Jun 12, 2009 at 8:05 AM, John Martyniak < > > >>> [email protected]> wrote: > > >>> > > >>> Justin, > > >>>> > > >>>> Thanks for the response. > > >>>> > > >>>> I was having a similar issue, i was trying to merge the segments for > > >>>> > > >>> crawls > > >>> > > >>>> during the month of may probably around 13-15GB, so after > everything > > >>>> was > > >>>> running it had used tmp space of around 900 GB doesn't seem very > > >>>> > > >>> efficient. > > >>> > > >>>> > > >>>> I will try this out and see if it changes anything. > > >>>> > > >>>> Do you know if there is any risk in using the following: > > >>>> <property> > > >>>> <name>mapred.min.split.size</name> > > >>>> <value>671088640</value> > > >>>> </property> > > >>>> > > >>>> as suggested in the article? > > >>>> > > >>>> -John > > >>>> > > >>>> On Jun 11, 2009, at 7:25 PM, Justin Yao wrote: > > >>>> > > >>>> Hi John, > > >>>> > > >>>>> > > >>>>> I had the same issue before but never found a solution. > > >>>>> Here is a workaround mentioned by someone in this mailing list, you > > may > > >>>>> have > > >>>>> a try: > > >>>>> > > >>>>> Seemingly abnormal temp space use by segment merger > > >>>>> > > >>>>> > > >>>>> > > >>> > > > http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569> > < > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 > > > > < > > > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 > > > > > >>> < > > >>> > > > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 > > >>> > > > >>> > > >>>> > > >>>>> Regards, > > >>>>> Justin > > >>>>> > > >>>>> On Sat, Jun 6, 2009 at 4:09 PM, John Martyniak < > > >>>>> [email protected] > > >>>>> > > >>>>> wrote: > > >>>>>> > > >>>>>> > > >>>>> Ok. > > >>>>> > > >>>>>> > > >>>>>> So a update to this item. > > >>>>>> > > >>>>>> I did start running nutch with hadoop, I am trying a single node > > >>>>>> config > > >>>>>> just to test it out. > > >>>>>> > > >>>>>> It took forever to get all of the files in the DFS it was just > over > > >>>>>> > > >>>>> 80GB > > >>> > > >>>> but it is in there. So I started the SegmentMerge job, and it is > > >>>>>> > > >>>>> working > > >>> > > >>>> flawlessly, still a little slow though. > > >>>>>> > > >>>>>> Also looking at the stats for the CPU they sometimes go over 20% > by > > >>>>>> not > > >>>>>> by > > >>>>>> much and not often, the Disk is very lightly taxed, peak was about > > 20 > > >>>>>> MB/sec, the drives and interface are rated at 3 GB/sec, so no > issue > > >>>>>> there. > > >>>>>> > > >>>>>> I tried to set the map jobs to 7 and the reduce jobs to 3, but > when > > I > > >>>>>> restarted all it is still only using 2 and 1. Any ideas? I made > > that > > >>>>>> change in the hadoop-site.xml file BTW. > > >>>>>> > > >>>>>> -John > > >>>>>> > > >>>>>> > > >>>>>> On Jun 4, 2009, at 10:00 AM, Andrzej Bialecki wrote: > > >>>>>> > > >>>>>> John Martyniak wrote: > > >>>>>> > > >>>>>> > > >>>>>>> Andrzej, > > >>>>>>> > > >>>>>>>> I am a little embarassed asking. But is there is a setup guide > > for > > >>>>>>>> setting up Hadoop for Nutch 1.0, or is it the same process as > > >>>>>>>> setting > > >>>>>>>> up for > > >>>>>>>> Nutch 0.17 (Which I think is the existing guide out there). > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Basically, yes - but this guide is primarily about the set up > of > > >>>>>>> > > >>>>>> Hadoop > > >>> > > >>>> cluster using the Hadoop pieces distributed with Nutch. As such > these > > >>>>>>> instructions are already slightly outdated. So it's best simply > to > > >>>>>>> install a > > >>>>>>> clean Hadoop 0.19.1 according to the instructions on Hadoop wiki, > > and > > >>>>>>> then > > >>>>>>> build nutch*.job file separately. > > >>>>>>> > > >>>>>>> Also I have Hadoop already running for some other applications, > not > > >>>>>>> > > >>>>>>> associated with Nutch, can I use the same install? I think that > > it > > >>>>>>>> > > >>>>>>> is > > >>> > > >>>> the > > >>>>>>>> same version that Nutch 1.0 uses. Or is it just easier to set > it > > up > > >>>>>>>> using > > >>>>>>>> the nutch config. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> Yes, it's perfectly ok to use Nutch with an existing Hadoop > > cluster > > >>>>>>> of > > >>>>>>> the > > >>>>>>> same vintage (which is 0.19.1 in Nutch 1.0). In fact, I would > > >>>>>>> strongly > > >>>>>>> recommend this way, instead of the usual "dirty" way of setting > up > > >>>>>>> > > >>>>>> Nutch > > >>> > > >>>> by > > >>>>>>> replicating the local build dir ;) > > >>>>>>> > > >>>>>>> Just specify the nutch*.job file like this: > > >>>>>>> > > >>>>>>> bin/hadoop jar nutch*.job <className> <args ..> > > >>>>>>> > > >>>>>>> where className and args is one of Nutch command-line tools. You > > can > > >>>>>>> also > > >>>>>>> modify slightly the bin/nutch script, so that you don't have to > > >>>>>>> > > >>>>>> specify > > >>> > > >>>> fully-qualified class names. > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Best regards, > > >>>>>>> Andrzej Bialecki <>< > > >>>>>>> ___. ___ ___ ___ _ _ __________________________________ > > >>>>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > >>>>>>> ___|||__|| \| || | Embedded Unix, System Integration > > >>>>>>> http://www.sigram.com Contact: info at sigram dot com > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> John Martyniak > > >>>> President > > >>>> Before Dawn Solutions, Inc. > > >>>> 9457 S. University Blvd #266 > > >>>> Highlands Ranch, CO 80126 > > >>>> o: 877-499-1562 x707 > > >>>> f: 877-499-1562 > > >>>> c: 303-522-1756 > > >>>> e: [email protected] > > >>>> > > >>>> > > >>>> > > >>> > > > > > > -- DigitalPebble Ltd http://www.digitalpebble.com
