Re: Best Practices for Upgrading Hadoop Version?
Michael Noll has a good description of the upgrade process here: http://www.michael-noll.com/blog/2011/08/23/performing-an-hdfs-upgrade-of-an-hadoop-cluster/ If may not quite reflect the versions of Hadoop you plan to upgrade but it has some good pointers. Chris On 30 May 2012 09:12, ramon@accenture.com wrote: Hi, I did this upgrade on a similar cluster some weeks ago. I use the following method (all commands run with hadoop demons process owner): * Stop cluster. * Start only HDFS with : start-dfs.sh -upgrade * At this point the migration has started. * You can check the status with hadoop dfsadmin -upgradeProgress status * Now you can access files for reading. * If you find any issue can rollback migration with : start-dfs.sh -rollback * If everything seems ok you can mark the upgrade as finalized: hadoop dfsadmin -finalizeUpgrade -Original Message- From: Eli Finkelshteyn [mailto:iefin...@gmail.com] Sent: martes, 29 de mayo de 2012 20:29 To: common-user@hadoop.apache.org Subject: Best Practices for Upgrading Hadoop Version? Hi, I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 1.0.3. I'm running a pretty small cluster of just 4 nodes, and it's not really being used by too many people at the moment, so I'm OK if things get dirty or it goes offline for a bit. I was looking at the tutorial at wiki.apache.org http://wiki.apache.org/hadoop/Hadoop_Upgrade, but it seems either outdated, or missing information. Namely, from what I've noticed so far, it doesn't specify what user any of the commands should be run as. Since I'm sure this is something a lot of people have needed to do, Is there a better tutorial somewhere for updating Hadoop version in general? Eli Subject to local law, communications with Accenture and its affiliates including telephone calls and emails (including content), may be monitored by our systems for the purposes of security and the assessment of internal compliance with Accenture policy. __ www.accenture.com
Re: Moving blocks from a datanode
M, See http://wiki.apache.org/hadoop/FAQ - 3.6. I want to make a large cluster smaller by taking out a bunch of nodes simultaneously. How can this be done? This explains how to decomission nodes by moving the data off of the existing node. It's fairly easy and painless (just add the nodename to the slaves.exclude file and notify dfs) and once the data is off the node you could swap-out the disks and then re-introduce the node back into the cluster with larger drives (removing the nodename from slaves.exclude). Chris On 17 May 2012 02:55, Mayuran Yogarajah mayuran.yogara...@casalemedia.comwrote: Our cluster has several nodes which have smaller disks than other nodes and as a result fill up quicker. I am looking to move data off these nodes and onto the others. Here is what I am planning to do: 1) On the nodes with smaller disks, set dfs.datanode.du.reserved to a larger value 2) Restart data nodes 3) Run balancer Will this have the desired effect? If there is a better way to accomplish this please let me know. Thanks, M
Re: collecting CPU, mem, iops of hadoop jobs
Have a look at OpenTSDB (http://opentsdb.net/overview.html) as this does not have the same down sampling issue as Ganglia and stores the metrics in HBase making it easier to access and process the data. It's also pretty easy to add your own metrics. Another useful utility is 'collectl' (http://collectl.sourceforge.net/) which I tend to leave running in the background on each node collecting, storing and managing machine metrics locally - it's very lightweight. When I have an issue that requires a metric I forgot to capture with Ganglia I usually find it in the 'collectl' logs - as long as I get to the logs before they roll - usually a week. This also doesn't have the down sampling issue but it doesn't automatically agregate the data to a central database. Regards, Chris On 21 December 2011 01:20, Arun C Murthy a...@hortonworks.com wrote: Go ahead and open a MR jira (would appreciate a patch too! ;) ). thanks, Arun On Dec 20, 2011, at 2:55 PM, Patai Sangbutsarakum wrote: Thanks again Arun, you save me again.. :-) This is a great starting point. for CPU and possibly Mem. For the IOPS, just would like to ask if the tasknode/datanode collect the number or we should dig into OS level.. like /proc/PID_OF_tt/io ^hope this make sense -P On Tue, Dec 20, 2011 at 1:22 PM, Arun C Murthy a...@hortonworks.com wrote: Take a look at the JobHistory files produced for each job. With 0.20.205 you get CPU (slot millis). With 0.23 (alpha quality) you get CPU and JVM metrics (GC etc.). I believe you also get Memory, but not IOPS. Arun On Dec 20, 2011, at 1:11 PM, Patai Sangbutsarakum wrote: Thanks for reply, but I don't think metric exposed to Ganglia would be what i am really looking for.. what i am looking for is some kind of these (but not limit to) Job__ CPU time: 10204 sec. --aggregate from all tasknodes IOPS: 2344 -- aggregated from all datanode MEM: 30G -- aggregated etc, Job_aaa_bbb CPU time: IOPS: MEM: Sorry for ambiguous question. Thanks On Tue, Dec 20, 2011 at 12:47 PM, He Chen airb...@gmail.com wrote: You may need Ganglia. It is a cluster monitoring software. On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Hi Hadoopers, We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect CPU time, memory usage, IOPS of each hadoop Job. What would be the good starting point ? document ? api ? Thanks in advance -P
Re: Distributed sorting using Hadoop
Madhu, Try working your way through the MapReduce tutorial here: http://hadoop.apache.org/common/docs/r0.20.205.0/mapred_tutorial.html#Example%3A+WordCount+v1.0 that covers most of the concepts you require to do a distributed sort. Search for the worf, combiner, in the tutorial to understand about combining results using the Mapper - to reduce cross cluster traffic. Also work your way through several of the tutorials and videos on working with Hadoop - Google is your friend here. Another good source on the general algoritms is Jimmy Lin's book referenced on this page: http://www.umiacs.umd.edu/~jimmylin/book.html Regards, Chris On 26 November 2011 13:05, madhu_sushmi madhu_sus...@yahoo.com wrote: Hi, I need to implement distributed sorting using Hadoop. I am quite new to Hadoop and I am getting confused. If I want to implement Merge sort, what my Map and reduce should be doing. ? Should all the sorting happen at reduce side? Please help. This is an urgent requirement. Please guide me. -- View this message in context: http://old.nabble.com/Distributed-sorting-using-Hadoop-tp32876787p32876787.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Running more than one secondary namenode
Jorn, If you've configured the Name Node fsimage and edit log replication to both NFS and Secondary Name Node and regularly backup the fsimage and edit logs you would do better investing time in understanding exactly how the Name Node builds up it's internal database and how it applies it's edit logs; 'read the code, Luke'. Then, if you really want to be prepared, you can then produce some test scenarios by applying a corruption (that the Name Node can't handle automatically) to the fsimage or edit logs on a sacrificial system (VM?) and see if you can recover from this. That way, if you ever get hit with a Name Node corruption you'll be in a much better place to recovery most/all of your data. Even with the best setup it can happen if you hit a 'corner case' scenario. Chris On 12 October 2011 08:50, Jorn Argelo - Ephorus jorn.arg...@ephorus.com wrote: Hi all, I was wondering if there are any (technical) issues with running two secondary namenodes on two separate servers rather than running just one. Since basically everything falls or stands with a consistent snapshot of the namenode fsimage I was considering to run two secondary namenodes for additional resilience. Has this been done before or am I being too paranoid? Are there any caveats with doing this? Thanks, Jorn
Re: Block Size
On 29 September 2011 18:39, lessonz less...@q.com wrote: I'm new to Hadoop, and I'm trying to understand the implications of a 64M block size in the HDFS. Is there a good reference that enumerates the implications of this decision and its effects on files stored in the system as well as map-reduce jobs? Thanks. Good explanation of HDFS here: http://hadoop.apache.org/common/docs/current/hdfs_design.html In a nutshell MapReduce moves the computation to the node that hosts the data (block). As there is an overhead in startup/teardown of each task you want to make sure it has a reasonable amount of data to process, hence the default block size of 64MB. Quite a few users run at larger block sizes either as it's more efficient for their algorithmns or to reduce the overhead on the Name Node, more blocks = more meta-data to hold in the in-memory database. Hope that helps. Chris
Re: Why inter-rack communication in mapreduce slow?
Elton, Rapleaf's blog has an interesting posting on their experience that's worth a read: http://blog.rapleaf.com/dev/2010/08/26/analyzing-some-interesting-networks-for-mapreduce-clusters/ And if you want to get an idea of the interaction between CPU, Disk and Network there nothing like a picture, see Slide 9 in this deck, of a very simply Terasort Map/Reduce job. Obviously the real world is very different but the individual Map/Reduce jobs follow a similar pattern. Even doubling the node network performance in this simple example would not get you much performance improvement as the job is CPU bound for 50% of the time and only uses the network for roughly 10% of the remaining time. Chris On 6 June 2011 16:42, Michael Segel michael_se...@hotmail.com wrote: Well the problem is pretty basic. Take your typical 1 GBe switch with 42 ports. Each port is capable of doing 1 GBe in each direction across the switche's fabric. Depending on your hardware, that's a fabric of 40GB, shared. Depending on your hardware, you are usually using 1 or maybe 2 ports to 'trunk' to your network's back plane. (To keep this simple, lets just say that its a 1-2 GBe 'trunk' to your next rack. So you end up with 1GBe traffic from each node trying to communicate to another node on the next rack. So if that's 20 nodes per rack and they all want to communicate... you end up with 20 GBe (each direction) trying to fit through a 1 - 2 GBe pipe. Think of Rush hour in Chicago, or worse, rush hour in Atlanta where people don't know how to drive. :-P The quick fix... spend the 8-10K per switch to get a ToR that has 10+ GBe uplink capabilities. (usually 4 ports) Then you have at least 10 GBe per rack. JMHO -Mike To: common-user@hadoop.apache.org Subject: Re: Why inter-rack communication in mapreduce slow? Date: Mon, 6 Jun 2011 11:00:05 -0400 From: dar...@ontrenet.com IMO, that's right. Because map/reduce/hadoop was originally designed for that kind of text processing purpose. (i.e. few stages, low dependency, highly parallel). Its when one tries to solve general purpose algorithms of modest complexity that map/reduce gets into I/O churning problems. On Mon, 6 Jun 2011 23:58:53 +1000, elton sky eltonsky9...@gmail.com wrote: Hi John, Because for map task, job tracker tries to assign them to local data nodes, so there' not much n/w traffic. Then the only potential issue will be, as you said, reducers, which copies data from all maps. So in other words, if the application only creates small intermediate output, e.g. grep, wordcount, this jam between racks is not likely happen, is it? On Mon, Jun 6, 2011 at 11:40 PM, John Armstrong john.armstr...@ccri.comwrote: On Mon, 06 Jun 2011 09:34:56 -0400, dar...@ontrenet.com wrote: Yeah, that's a good point. I wonder though, what the load on the tracker nodes (port et. al) would be if a inter-rack fiber switch at 10's of GBS' is getting maxed. Seems to me that if there is that much traffic being mitigate across racks, that the tracker node (or whatever node it is) would overload first? It could happen, but I don't think it would always. For example, tracker is on rack A; sees that the best place to put reducer R is on rack B; sees reducer still needs a few hellabytes from mapper M on rack C; tells M to send data to R; switches on B and C get throttled, leaving A free to handle other things. In fact, it almost makes me wonder if an ideal setup is not only to have each of the main control daemons on their own nodes, but to put THOSE nodes on their own rack and keep all the data elsewhere.
Re: tips and tools to optimize cluster
Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose precision on the historical data. It also has some neat tracks around the collection and display of data. Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ ) which is a light weight Perl script that both captures and compresses the metrics, manages it's metrics data files and then filters and presents the metrics as requested. I find collectl lightweight and useful enough that I set it up to capture everything and then leave it running in the background on most systems I build because when you need the measurement data the event is usually in the past and difficult to reproduce. With collectl running I have a week to recognise the event and analyse/save the relevant data file(s); data file approx. 21MB/node/day gzipped. With a little bit of bash or awk or perl scripting you can convert the collectl output into a form easily loadable into Pig. Pig also has User Defined Functions (UDFs) that can import the Hadoop job history so with some Pig Latin you can marry your infrastructure metrics with your job metrics; a bit like the cluster eating it own dog food. BTW, watch out for a little gotcha with Ganglia. It doesn't seem to report the full jvm metrics via gmond although if you output the jvm metrics to file you get a record for each jvm on the node. I haven't looked into it in detail yet but it looks like Gangla only reports the last jvm record in each batch. Anyone else seen this? Chris On 24 May 2011 01:48, Tom Melendez t...@supertom.com wrote: Hi Folks, I'm looking for tips, tricks and tools to get at node utilization to optimize our cluster. I want answer questions like: - what nodes ran a particular job? - how long did it take for those nodes to run the tasks for that job? - how/why did Hadoop pick those nodes to begin with? More detailed questions like - how much memory did the task for the job use on that node? - average CPU load on that node during the task run And more aggregate questions like: - are some nodes favored more than others? - utilization averages (generally, how many cores on that node are in use, etc.) There are plenty more that I'm not asking, but you get the point? So, what are you guys using for this? I see some mentions of Ganglia, so I'll definitely look into that. Anything else? Anything you're using to monitor in real-time (like a 'top' across the nodes or something like that)? Any info or war-stories greatly appreciated. Thanks, Tom
Re: the question of hadoop
2010/9/6 褚 鵬兵 chu_pengb...@hotmail.com: hi ,my hadoop friends:i have the 3 questions about hadoop.there are 1 the speed between the datanodes. Tera data in one datanodes , the data transfers from one datanode to the another datanode. if the speed is bad, Hadoop will be slow, i think. i heard the gNet architecture in Greenplum , then hadoop ? SAS storage + G-Ethernet is best answer, isn't it? 2 the GUI tool there is a hive web tool in hadoop. but it is not enough to use it for our business work. it is too simple to use it. if hadoop+hive is designed into DWH. then how to use it for users. by CGI Tool(Command),? by New Developed webGUITOOL.? 3 5 computers Hadoop cluster and 1 computer SQLSERVER2000 5 computers Hadoop celeron 2.66G 1G memory Ethernet namenode + secondarynamenode + 3 datanode 1 computer SQLSERVER2000 celeron 2.66G 1G memory then i did select operation at the same data 100M . 5 computers Hadoop is 2mins 30secs 1 computer SQLSERVER2000 is 2mins 25secs the result is that 5 computers Hadoop is not good .why .can anyone give me some advises. thanks in adverse. Why use Hadoop in preference to a database? At the recent Hadoop User Group (UK) meeting, Andy Kemp from http://www.forward.co.uk/ presented their experience in moving from a MySQL database approach to Hadoop. From my notes of his talk their system manages 120 million keywords and is updated at a rate of 20GB/day. They originally used a sharded MySQL database but found it couldn't scale to handle the types of queries their users required, e.g. Can you cluster 17(?) million keyword phrases into thematic groups?. Their calculations indicated that the database approach would take more than a year to handle such a query. Moving to a cluster of 100 Hadoop nodes on Amazon EC2 reduced this time down to 7 hours. The issues then became one of the costs of storage and moving the data to and from the cluster. They then moved to a private VM system with about 30 VMs - I assume the processing took the same time as I didn't note this down. From there they then moved to dedicated hardware, 5 dedicated Hadoop nodes, and achieved better performance than the 30 VMs. Andy's talk, Hadoop in Context should available as a podcast here http://skillsmatter.com/podcast/cloud-grid/hadoop-in-context and would be well worth watching but when I lasted looked it hadn't been uploaded yet. At the same event, Ian Broadhead, from http://www.playfish.com/ gave a talk on managing the activity of over 1 million active Internet gamers producing over 50GB of data a day. Their original MySQL system took up to 50 times longer to process their data load than an EC2 cluster of Hadoop nodes. He talked about a typical workload being reduced from 2-3 days (using MySQL) down to 6 hours (using Hadoop). Unfortunately I don't think Ian's talk will appear as a podcast. However, most presentations during the evening made a point that Hadoop didn't completely replace their databases, just provided a convenient way to rapidly process large volumes of data, the output from Hadoop processing typically being stored in databases to satisfy general everyday business queries. I think the common theme here was that all of these users had large datasets of the order of 100's of GBs with multiple views of that data that handled in the order of 10's of millions of updates a day. I hope that helps. Chris
RE: Question about disk space allocation in hadoop
Some thoughts on how to restrict the temporary data, but I have only tried (a) in anger: a) Partition your disks into HDFS and intermediate temp partitions of the relevant size. This gives a fixed separation but is difficult/impossible to modify on a busy cluster especially as there may be no way of unloading/recovering the data stored in HDFS if you make a mistake resizing partitions; b) Implement disk quotas and set relevant hard and soft limits on the relevant root directories for intermediate space. This gives you the flexibility to change the limits when required but as the limits are per user/group some thought may be required as to which user/group the limits apply to. There may also be a performance impact? You could combine this with setting “dfs.datanode.du.reserved” value in $HADOOP_HOME/conf/hdfs-site.xml for limiting HDFS disk usage. c) Implement intermediate data space as a loopback file, see: http://wiki.cita.utoronto.ca/mediawiki/index.php/Fake_Fast_Local_Disk This example implements a temporary loopback filesystem on a iSCSI mounted Lustre filesystem but the principles are the same. There are some performance benchmarks linked to in section 3. The intermediate temp data space is limited by the size of the loopback file created. Chris -Original Message- From: Yu Li [mailto:car...@gmail.com] Sent: 30 June 2010 04:11 To: common-user@hadoop.apache.org Subject: Re: Question about disk space allocation in hadoop Hi all, Anybody has experience on this? Any Comments/Suggestions would be highly appreciated, Thanks. Best Regards, Carp 2010/6/29 Yu Li car...@gmail.com: Hi all, As we all know, machines in hadoop cluster may be both datanode and tasktracker, so one machine may store both MR job intermediate data and HDFS data. My question is: if we have more than one disk per node, say 4 disks, and would like both job intermediate data and HDFS data store into all disks to reduce IO times of each single disk, can we draw a line between space of local FS and HDFS? For example, restrict the intermediate temp data occupy no more than 25% space on each disk? Thanks in advance. Best Regards, Carp