MapReduce Jobs being 'stuck' for several hours and then completing
Hi all, We are using CDH3B4 on the Hadoop Cluster. We have hourly jobs kicking off every hour using the streaming API, each one of these jobs used to take 4/5 mins to complete but since 1pm yesterday all of a sudden started taking 3/4 hours. We looked at the data the jobs are working on and the data is exactly the same as it always has been. The cluster / config has not been touched since the upgrade to CDH3B4 which was one month ago. No errors are being reported in any of the logs, the jobs are just taking longer, much longer. One thing I have noticed in the logs, when the jobs just sit there in the middle of a job I do see one consistent entry in the slave log files: 2011-04-28 11:16:07,849 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s] 2011-04-28 11:16:07,849 INFO org.apache.hadoop.streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s] I see that entry in Map phases and Reduce phases, when the jobs just sit idle for many tens of mins not doing anything. This happens even if there is nothing else running on the cluster. If anyone can shed some light on this or give me a direction to look into further then it would be much appreciated. Thank you. Regards, Abhinay Mehta
Re: Configure Ganglia with Hadoop
Me and a colleague of mine (Ryan Greenhall) setup Ganglia on our hadoop cluster, he has written a summary of what we did to get it to work, you might find it useful: http://forwardtechnology.co.uk/blog/4cc841609f4e6a02114f Regards, Abhinay Mehta On 8 November 2010 15:31, Jonathan Creasy jon.cre...@announcemedia.comwrote: This is the correct configuration, and there should be nothing more needed. I don't think that these configuration changes will take affect on the fly so you would need to restart the datanode and namenode processes if I understand correctly. When you browse your you will see some more metrics: dfs.FSDirectory.files_deleted dfs.FSNamesystem.BlockCapacity dfs.FSNamesystem.BlocksTotal dfs.FSNamesystem.CapacityRemainingGB dfs.FSNamesystem.CapacityTotalGB dfs.FSNamesystem.CapacityUsedGB dfs.FSNamesystem.CorruptBlocks dfs.FSNamesystem.ExcessBlocks dfs.FSNamesystem.FilesTotal dfs.FSNamesystem.MissingBlocks dfs.FSNamesystem.PendingDeletionBlocks dfs.FSNamesystem.PendingReplicationBlocks dfs.FSNamesystem.ScheduledReplicationBlocks dfs.FSNamesystem.TotalLoad dfs.FSNamesystem.UnderReplicatedBlocks dfs.datanode.blockChecksumOp_avg_time dfs.datanode.blockChecksumOp_num_ops dfs.datanode.blockReports_avg_time dfs.datanode.blockReports_num_ops dfs.datanode.block_verification_failures dfs.datanode.blocks_read dfs.datanode.blocks_removed dfs.datanode.blocks_replicated dfs.datanode.blocks_verified dfs.datanode.blocks_written dfs.datanode.bytes_read dfs.datanode.bytes_written dfs.datanode.copyBlockOp_avg_time dfs.datanode.copyBlockOp_num_ops dfs.datanode.heartBeats_avg_time dfs.datanode.heartBeats_num_ops dfs.datanode.readBlockOp_avg_time dfs.datanode.readBlockOp_num_ops dfs.datanode.readMetadataOp_avg_time dfs.datanode.readMetadataOp_num_ops dfs.datanode.reads_from_local_client dfs.datanode.reads_from_remote_client dfs.datanode.replaceBlockOp_avg_time dfs.datanode.replaceBlockOp_num_ops dfs.datanode.writeBlockOp_avg_time dfs.datanode.writeBlockOp_num_ops dfs.datanode.writes_from_local_client dfs.datanode.writes_from_remote_client dfs.namenode.AddBlockOps dfs.namenode.CreateFileOps dfs.namenode.DeleteFileOps dfs.namenode.FileInfoOps dfs.namenode.FilesAppended dfs.namenode.FilesCreated dfs.namenode.FilesRenamed dfs.namenode.GetBlockLocations dfs.namenode.GetListingOps dfs.namenode.JournalTransactionsBatchedInSync dfs.namenode.SafemodeTime dfs.namenode.Syncs_avg_time dfs.namenode.Syncs_num_ops dfs.namenode.Transactions_avg_time dfs.namenode.Transactions_num_ops dfs.namenode.blockReport_avg_time dfs.namenode.blockReport_num_ops dfs.namenode.fsImageLoadTime jvm.metrics.gcCount jvm.metrics.gcTimeMillis jvm.metrics.logError jvm.metrics.logFatal jvm.metrics.logInfo jvm.metrics.logWarn jvm.metrics.maxMemoryM jvm.metrics.memHeapCommittedM jvm.metrics.memHeapUsedM jvm.metrics.memNonHeapCommittedM jvm.metrics.memNonHeapUsedM jvm.metrics.threadsBlocked jvm.metrics.threadsNew jvm.metrics.threadsRunnable jvm.metrics.threadsTerminated jvm.metrics.threadsTimedWaiting jvm.metrics.threadsWaiting rpc.metrics.NumOpenConnections rpc.metrics.RpcProcessingTime_avg_time rpc.metrics.RpcProcessingTime_num_ops rpc.metrics.RpcQueueTime_avg_time rpc.metrics.RpcQueueTime_num_ops rpc.metrics.abandonBlock_avg_time rpc.metrics.abandonBlock_num_ops rpc.metrics.addBlock_avg_time rpc.metrics.addBlock_num_ops rpc.metrics.blockReceived_avg_time rpc.metrics.blockReceived_num_ops rpc.metrics.blockReport_avg_time rpc.metrics.blockReport_num_ops rpc.metrics.callQueueLen rpc.metrics.complete_avg_time rpc.metrics.complete_num_ops rpc.metrics.create_avg_time rpc.metrics.create_num_ops rpc.metrics.getEditLogSize_avg_time rpc.metrics.getEditLogSize_num_ops rpc.metrics.getProtocolVersion_avg_time rpc.metrics.getProtocolVersion_num_ops rpc.metrics.register_avg_time rpc.metrics.register_num_ops rpc.metrics.rename_avg_time rpc.metrics.rename_num_ops rpc.metrics.renewLease_avg_time rpc.metrics.renewLease_num_ops rpc.metrics.rollEditLog_avg_time rpc.metrics.rollEditLog_num_ops rpc.metrics.rollFsImage_avg_time rpc.metrics.rollFsImage_num_ops rpc.metrics.sendHeartbeat_avg_time rpc.metrics.sendHeartbeat_num_ops rpc.metrics.versionRequest_avg_time rpc.metrics.versionRequest_num_ops -Jonathan On Nov 8, 2010, at 8:34 AM, Shuja Rehman wrote: Hi I have cluster of 4 machines and want to configure ganglia for monitoring purpose. I have read the wiki and add the following lines to hadoop-metrics.properties on each machine. dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=10.10.10.2:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=10.10.10.2:8649 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext jvm.period=10 jvm.servers=10.10.10.2:8649 rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext rpc.period=10
What are xcievers?
Hi all, Our hadoop cluster went down today with the following error: ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(a.b.c.d:50010, storageID=DS-131834207-127.0.1.1-50010-1276853028220, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: xceiverCount 257 exceeds the limit of concurrent xcievers 256 I read on some forums that we should increase the default number of xcievers, which is OK, I will do that. But I couldn't find any documentation to what xcievers are and how they are used up? Any info or links to documentation would be much appreciated. Thank you. Regards, Abhinay Mehta
Re: Upgrading Hadoop from CDH3b3 to CDH3
I've got a couple of wines saved up, could be useful for this upgrade. Thanks for the heads-up. On 20 October 2010 17:23, Michael Segel michael_se...@hotmail.com wrote: CDH3B3 is the latest. My suggestion is that you get a large bottle of your favorite alcoholic beverage and enjoy it while you're doing this upgrade. In B3 you have user hadoop going to hdfs and the introduction of user mapred. So your hdfs stuff has to be owned by hdfs:hadoop and mapred stuff owned by mapred:hadoop. (That's group hadoop btw) Then you have to make sure that your unix directories have the correct file permissions because if they don't hadoop fails. Also since you're not using the start-all, start-hbase scripts you have to write your own script that uses the /etc/init.d/hadoop* scripts. Then there are zookeeper issues... But when all said and done... its more stable than hbase-0.20.3 (Cloudera's earlier release of Hbase) and 0.89 seems fairly stable in B3. (We're still testing it and trying to break it.) HTH -Mike Date: Wed, 20 Oct 2010 13:53:11 +0100 Subject: Re: Upgrading Hadoop from CDH3b3 to CDH3 From: abhinay.me...@gmail.com To: common-user@hadoop.apache.org Yes you are right, we obviously have beta2 installed on our cluster too. Thanks. On 20 October 2010 12:35, ed hadoopn...@gmail.com wrote: I don't think there is a stable CDH3 yet although we've been using CDH3B2 and it has been pretty stable for us. (at least I don't see it available on their website and they JUST announced CDH3B3 last week at HadoopWorld. ~Ed On Wed, Oct 20, 2010 at 5:57 AM, Abhinay Mehta abhinay.me...@gmail.com wrote: Hi all, We currently have Cloudera's Hadoop beta 3 installed on our cluster, we would like to upgrade to the latest stable release CDH3. Is there documentation or recommended steps on how to do this? We found some docs on how to upgrade from CDH2 and CDHb2 to CDHb3 here: https://docs.cloudera.com/display/DOC/Hadoop+Upgrade+from+CDH2+or+CDH3b2+to+CDH3b3 Are the same steps recommended to upgrade to CDH3? I'm hoping it's a lot easier to upgrade from beta3 to the latest stable version than that document states? Thank you. Abhinay Mehta
Upgrading Hadoop from CDH3b3 to CDH3
Hi all, We currently have Cloudera's Hadoop beta 3 installed on our cluster, we would like to upgrade to the latest stable release CDH3. Is there documentation or recommended steps on how to do this? We found some docs on how to upgrade from CDH2 and CDHb2 to CDHb3 here: https://docs.cloudera.com/display/DOC/Hadoop+Upgrade+from+CDH2+or+CDH3b2+to+CDH3b3 Are the same steps recommended to upgrade to CDH3? I'm hoping it's a lot easier to upgrade from beta3 to the latest stable version than that document states? Thank you. Abhinay Mehta
Re: what affects number of reducers launched by hadoop?
Which configuration key controls the number of maximum tasks per node ? On 28 July 2010 20:40, Joe Stein charmal...@allthingshadoop.com wrote: mapred.tasktracker.reduce.tasks.maximum is how many you want as a ceiling per node you need to configure *mapred.reduce.tasks* to be more than one as it is defaulted to 1 (which you are overriding in your code which is why it works there) This value should be somewhere between .95 and 1.75 times the number of maximum tasks per node times the number of data nodes. So if you have 3 data nodes and it is setup max tasks of 7 then configure this between 25 and 36 On Wed, Jul 28, 2010 at 3:24 PM, Vitaliy Semochkin vitaliy...@gmail.com wrote: Hi, in my cluster mapred.tasktracker.reduce.tasks.maximum = 4 however during monitoring the job in job tracker I see only 1 reducer working first it is reduce copy - can someone please explain what does this mean? after it is reduce reduce when I set the number of reduce tasks for a job programatically to 10 job.setNumReduceTasks(10); the number of reduce reduce reducers increases to 10 and the performance of application increases as well (the number of reducers never exceeds). Can someone explain such behavior? Thanks in Advance, Vitaliy S -- /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */