Can someone point me to creating a hadoop dfs on a single machine? I'd like to test this out to see how much it speeds up merging. I'm using a ZFS filesystem and have no IO waits. It seems like when the index reaches the 4GB range, the merge time drastically goes up. My ZFS filesystem is on a SAN and is not the bottleneck.
Thanks in advance! --- On Mon, 6/15/09, Julien Nioche <[email protected]> wrote: > From: Julien Nioche <[email protected]> > Subject: Re: Merge taking forever > To: [email protected] > Date: Monday, June 15, 2009, 12:58 PM > Hi, > > > > Presumably in hadoop-site.xml as a property/value ? > > > Indeed. > > J. > > > > > > On the other hand, I'm asking myself why merging > segments... I don't fully > > understand the benefits, if someone can shed some > light. > > > > > > 2009/6/15 Julien Nioche <[email protected]> > > > > > Hi, > > > > > > Have you tried setting > *mapred.compress.map.output *to true? This should > > > reduce the amount of temp space required. > > > > > > Julien > > > -- > > > DigitalPebble Ltd > > > http://www.digitalpebble.com > > > > > > 2009/6/15 czerwionka paul <[email protected]> > > > > > > > hi justin, > > > > > > > > i am running hadoop in distributed mode and > having the same problem. > > > > > > > > merging segments just eats up much more temp > space than the segments > > > would > > > > have combined. > > > > > > > > paul. > > > > > > > > > > > > On 14.06.2009, at 18:17, MilleBii wrote: > > > > > > > > Same for merging 3 segments of 100k, > 100K, 300k URLs resulted in > > > >> consumming > > > >> 200Gb and partition full after 18hours > processing > > > >> > > > >> Something strange with this segment > merge, > > > >> > > > >> Conf : PC Dual Core, Vista, Hadoop on > single node. > > > >> > > > >> Can someone confirm if installing Hadoop > in a distributed will fix it > > ? > > > Is > > > >> there a good config guide for the > distributed mode. > > > >> > > > >> > > > >> 2009/6/12 Justin Yao <[email protected]> > > > >> > > > >> Hi John, > > > >>> I have no idea about that neither. > > > >>> Justin > > > >>> > > > >>> On Fri, Jun 12, 2009 at 8:05 AM, > John Martyniak < > > > >>> [email protected]> > wrote: > > > >>> > > > >>> Justin, > > > >>>> > > > >>>> Thanks for the response. > > > >>>> > > > >>>> I was having a similar issue, i > was trying to merge the segments for > > > >>>> > > > >>> crawls > > > >>> > > > >>>> during the month of may probably > around 13-15GB, so after > > everything > > > >>>> was > > > >>>> running it had used tmp space of > around 900 GB doesn't seem very > > > >>>> > > > >>> efficient. > > > >>> > > > >>>> > > > >>>> I will try this out and see if > it changes anything. > > > >>>> > > > >>>> Do you know if there is any risk > in using the following: > > > >>>> <property> > > > >>>> > <name>mapred.min.split.size</name> > > > >>>> > <value>671088640</value> > > > >>>> </property> > > > >>>> > > > >>>> as suggested in the article? > > > >>>> > > > >>>> -John > > > >>>> > > > >>>> On Jun 11, 2009, at 7:25 PM, > Justin Yao wrote: > > > >>>> > > > >>>> Hi John, > > > >>>> > > > >>>>> > > > >>>>> I had the same issue before > but never found a solution. > > > >>>>> Here is a workaround > mentioned by someone in this mailing list, you > > > may > > > >>>>> have > > > >>>>> a try: > > > >>>>> > > > >>>>> Seemingly abnormal temp > space use by segment merger > > > >>>>> > > > >>>>> > > > >>>>> > > > >>> > > > > > http://www.nabble.com/Content(source-code)-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569<http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569> > > < > > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 > > > > > > < > > > > > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 > > > > > > > >>> < > > > >>> > > > > > http://www.nabble.com/Content%28source-code%29-of-web-pages-crawled-by-nutch-tt23495506.html#a23522569 > > > >>> > > > > >>> > > > >>>> > > > >>>>> Regards, > > > >>>>> Justin > > > >>>>> > > > >>>>> On Sat, Jun 6, 2009 at 4:09 > PM, John Martyniak < > > > >>>>> [email protected] > > > >>>>> > > > >>>>> wrote: > > > >>>>>> > > > >>>>>> > > > >>>>> Ok. > > > >>>>> > > > >>>>>> > > > >>>>>> So a update to this > item. > > > >>>>>> > > > >>>>>> I did start running > nutch with hadoop, I am trying a single node > > > >>>>>> config > > > >>>>>> just to test it out. > > > >>>>>> > > > >>>>>> It took forever to get > all of the files in the DFS it was just > > over > > > >>>>>> > > > >>>>> 80GB > > > >>> > > > >>>> but it is in there. So I > started the SegmentMerge job, and it is > > > >>>>>> > > > >>>>> working > > > >>> > > > >>>> flawlessly, still a little slow > though. > > > >>>>>> > > > >>>>>> Also looking at the > stats for the CPU they sometimes go over 20% > > by > > > >>>>>> not > > > >>>>>> by > > > >>>>>> much and not often, the > Disk is very lightly taxed, peak was about > > > 20 > > > >>>>>> MB/sec, the drives and > interface are rated at 3 GB/sec, so no > > issue > > > >>>>>> there. > > > >>>>>> > > > >>>>>> I tried to set the map > jobs to 7 and the reduce jobs to 3, but > > when > > > I > > > >>>>>> restarted all it is > still only using 2 and 1. Any ideas? I made > > > that > > > >>>>>> change in the > hadoop-site.xml file BTW. > > > >>>>>> > > > >>>>>> -John > > > >>>>>> > > > >>>>>> > > > >>>>>> On Jun 4, 2009, at 10:00 > AM, Andrzej Bialecki wrote: > > > >>>>>> > > > >>>>>> John Martyniak wrote: > > > >>>>>> > > > >>>>>> > > > >>>>>>> Andrzej, > > > >>>>>>> > > > >>>>>>>> I am a little > embarassed asking. But is there is a setup guide > > > for > > > >>>>>>>> setting up > Hadoop for Nutch 1.0, or is it the same process as > > > >>>>>>>> setting > > > >>>>>>>> up for > > > >>>>>>>> Nutch 0.17 > (Which I think is the existing guide out there). > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Basically, > yes - but this guide is primarily about the set up > > of > > > >>>>>>> > > > >>>>>> Hadoop > > > >>> > > > >>>> cluster using the Hadoop pieces > distributed with Nutch. As such > > these > > > >>>>>>> instructions are > already slightly outdated. So it's best simply > > to > > > >>>>>>> install a > > > >>>>>>> clean Hadoop 0.19.1 > according to the instructions on Hadoop wiki, > > > and > > > >>>>>>> then > > > >>>>>>> build nutch*.job > file separately. > > > >>>>>>> > > > >>>>>>> Also I have Hadoop > already running for some other applications, > > not > > > >>>>>>> > > > >>>>>>> associated > with Nutch, can I use the same install? I think that > > > it > > > >>>>>>>> > > > >>>>>>> is > > > >>> > > > >>>> the > > > >>>>>>>> same version > that Nutch 1.0 uses. Or is it just easier to set > > it > > > up > > > >>>>>>>> using > > > >>>>>>>> the nutch > config. > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> Yes, it's > perfectly ok to use Nutch with an existing Hadoop > > > cluster > > > >>>>>>> of > > > >>>>>>> the > > > >>>>>>> same vintage (which > is 0.19.1 in Nutch 1.0). In fact, I would > > > >>>>>>> strongly > > > >>>>>>> recommend this way, > instead of the usual "dirty" way of setting > > up > > > >>>>>>> > > > >>>>>> Nutch > > > >>> > > > >>>> by > > > >>>>>>> replicating the > local build dir ;) > > > >>>>>>> > > > >>>>>>> Just specify the > nutch*.job file like this: > > > >>>>>>> > > > >>>>>>> > bin/hadoop jar nutch*.job <className> <args ..> > > > >>>>>>> > > > >>>>>>> where className and > args is one of Nutch command-line tools. You > > > can > > > >>>>>>> also > > > >>>>>>> modify slightly the > bin/nutch script, so that you don't have to > > > >>>>>>> > > > >>>>>> specify > > > >>> > > > >>>> fully-qualified class names. > > > >>>>>>> > > > >>>>>>> -- > > > >>>>>>> Best regards, > > > >>>>>>> Andrzej > Bialecki <>< > > > >>>>>>> ___. ___ ___ ___ _ > _ __________________________________ > > > >>>>>>> [__ || > __|__/|__||\/| Information Retrieval, Semantic Web > > > >>>>>>> ___|||__|| > \| || | Embedded Unix, System Integration > > > >>>>>>> http://www.sigram.com Contact: info at sigram dot > com > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > > >>>>>> John Martyniak > > > >>>> President > > > >>>> Before Dawn Solutions, Inc. > > > >>>> 9457 S. University Blvd #266 > > > >>>> Highlands Ranch, CO 80126 > > > >>>> o: 877-499-1562 x707 > > > >>>> f: 877-499-1562 > > > >>>> c: 303-522-1756 > > > >>>> e: [email protected] > > > >>>> > > > >>>> > > > >>>> > > > >>> > > > > > > > > > > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com >
