MapReduce Jobs being 'stuck' for several hours and then completing

2011-04-28 Thread Abhinay Mehta
Hi all,

We are using CDH3B4 on the Hadoop Cluster.

We have hourly jobs kicking off every hour using the streaming API,
each one of these jobs used to take 4/5 mins to complete but since 1pm
yesterday all of a sudden started taking 3/4 hours.

We looked at the data the jobs are working on and the data is exactly the
same as it always has been.
The cluster / config has not been touched since the upgrade to CDH3B4 which
was one month ago.

No errors are being reported in any of the logs, the jobs are just taking
longer, much longer.
One thing I have noticed in the logs, when the jobs just sit there in the
middle of a job I do see one consistent entry in the slave log files:

2011-04-28 11:16:07,849 INFO org.apache.hadoop.streaming.PipeMapRed:
R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
2011-04-28 11:16:07,849 INFO org.apache.hadoop.streaming.PipeMapRed:
R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]

I see that entry in Map phases and Reduce phases, when the jobs just sit
idle for many tens of mins not doing anything.
This happens even if there is nothing else running on the cluster.

If anyone can shed some light on this or give me a direction to look into
further then it would be much appreciated.

Thank you.

Regards,
Abhinay Mehta


Re: Configure Ganglia with Hadoop

2010-11-08 Thread Abhinay Mehta
Me and a colleague of mine (Ryan Greenhall) setup Ganglia on our hadoop
cluster, he has written a summary of what we did to get it to work, you
might find it useful:

http://forwardtechnology.co.uk/blog/4cc841609f4e6a02114f

Regards,
Abhinay Mehta


On 8 November 2010 15:31, Jonathan Creasy jon.cre...@announcemedia.comwrote:

 This is the correct configuration, and there should be nothing more needed.
 I don't think that these configuration changes will take affect on the fly
 so you would need to restart the datanode and namenode processes if I
 understand correctly.

 When you browse your you will see some more metrics:

 dfs.FSDirectory.files_deleted
 dfs.FSNamesystem.BlockCapacity
 dfs.FSNamesystem.BlocksTotal
 dfs.FSNamesystem.CapacityRemainingGB
 dfs.FSNamesystem.CapacityTotalGB
 dfs.FSNamesystem.CapacityUsedGB
 dfs.FSNamesystem.CorruptBlocks
 dfs.FSNamesystem.ExcessBlocks
 dfs.FSNamesystem.FilesTotal
 dfs.FSNamesystem.MissingBlocks
 dfs.FSNamesystem.PendingDeletionBlocks
 dfs.FSNamesystem.PendingReplicationBlocks
 dfs.FSNamesystem.ScheduledReplicationBlocks
 dfs.FSNamesystem.TotalLoad
 dfs.FSNamesystem.UnderReplicatedBlocks
 dfs.datanode.blockChecksumOp_avg_time
 dfs.datanode.blockChecksumOp_num_ops
 dfs.datanode.blockReports_avg_time
 dfs.datanode.blockReports_num_ops
 dfs.datanode.block_verification_failures
 dfs.datanode.blocks_read
 dfs.datanode.blocks_removed
 dfs.datanode.blocks_replicated
 dfs.datanode.blocks_verified
 dfs.datanode.blocks_written
 dfs.datanode.bytes_read
 dfs.datanode.bytes_written
 dfs.datanode.copyBlockOp_avg_time
 dfs.datanode.copyBlockOp_num_ops
 dfs.datanode.heartBeats_avg_time
 dfs.datanode.heartBeats_num_ops
 dfs.datanode.readBlockOp_avg_time
 dfs.datanode.readBlockOp_num_ops
 dfs.datanode.readMetadataOp_avg_time
 dfs.datanode.readMetadataOp_num_ops
 dfs.datanode.reads_from_local_client
 dfs.datanode.reads_from_remote_client
 dfs.datanode.replaceBlockOp_avg_time
 dfs.datanode.replaceBlockOp_num_ops
 dfs.datanode.writeBlockOp_avg_time
 dfs.datanode.writeBlockOp_num_ops
 dfs.datanode.writes_from_local_client
 dfs.datanode.writes_from_remote_client
 dfs.namenode.AddBlockOps
 dfs.namenode.CreateFileOps
 dfs.namenode.DeleteFileOps
 dfs.namenode.FileInfoOps
 dfs.namenode.FilesAppended
 dfs.namenode.FilesCreated
 dfs.namenode.FilesRenamed
 dfs.namenode.GetBlockLocations
 dfs.namenode.GetListingOps
 dfs.namenode.JournalTransactionsBatchedInSync
 dfs.namenode.SafemodeTime
 dfs.namenode.Syncs_avg_time
 dfs.namenode.Syncs_num_ops
 dfs.namenode.Transactions_avg_time
 dfs.namenode.Transactions_num_ops
 dfs.namenode.blockReport_avg_time
 dfs.namenode.blockReport_num_ops
 dfs.namenode.fsImageLoadTime
 jvm.metrics.gcCount
 jvm.metrics.gcTimeMillis
 jvm.metrics.logError
 jvm.metrics.logFatal
 jvm.metrics.logInfo
 jvm.metrics.logWarn
 jvm.metrics.maxMemoryM
 jvm.metrics.memHeapCommittedM
 jvm.metrics.memHeapUsedM
 jvm.metrics.memNonHeapCommittedM
 jvm.metrics.memNonHeapUsedM
 jvm.metrics.threadsBlocked
 jvm.metrics.threadsNew
 jvm.metrics.threadsRunnable
 jvm.metrics.threadsTerminated
 jvm.metrics.threadsTimedWaiting
 jvm.metrics.threadsWaiting
 rpc.metrics.NumOpenConnections
 rpc.metrics.RpcProcessingTime_avg_time
 rpc.metrics.RpcProcessingTime_num_ops
 rpc.metrics.RpcQueueTime_avg_time
 rpc.metrics.RpcQueueTime_num_ops
 rpc.metrics.abandonBlock_avg_time
 rpc.metrics.abandonBlock_num_ops
 rpc.metrics.addBlock_avg_time
 rpc.metrics.addBlock_num_ops
 rpc.metrics.blockReceived_avg_time
 rpc.metrics.blockReceived_num_ops
 rpc.metrics.blockReport_avg_time
 rpc.metrics.blockReport_num_ops
 rpc.metrics.callQueueLen
 rpc.metrics.complete_avg_time
 rpc.metrics.complete_num_ops
 rpc.metrics.create_avg_time
 rpc.metrics.create_num_ops
 rpc.metrics.getEditLogSize_avg_time
 rpc.metrics.getEditLogSize_num_ops
 rpc.metrics.getProtocolVersion_avg_time
 rpc.metrics.getProtocolVersion_num_ops
 rpc.metrics.register_avg_time
 rpc.metrics.register_num_ops
 rpc.metrics.rename_avg_time
 rpc.metrics.rename_num_ops
 rpc.metrics.renewLease_avg_time
 rpc.metrics.renewLease_num_ops
 rpc.metrics.rollEditLog_avg_time
 rpc.metrics.rollEditLog_num_ops
 rpc.metrics.rollFsImage_avg_time
 rpc.metrics.rollFsImage_num_ops
 rpc.metrics.sendHeartbeat_avg_time
 rpc.metrics.sendHeartbeat_num_ops
 rpc.metrics.versionRequest_avg_time
 rpc.metrics.versionRequest_num_ops

 -Jonathan

 On Nov 8, 2010, at 8:34 AM, Shuja Rehman wrote:

  Hi
  I have cluster of 4 machines and want to configure ganglia for monitoring
  purpose. I have read the wiki and add the following lines to
  hadoop-metrics.properties on each machine.
 
  dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
  dfs.period=10
  dfs.servers=10.10.10.2:8649
 
  mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
  mapred.period=10
  mapred.servers=10.10.10.2:8649
 
  jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
  jvm.period=10
  jvm.servers=10.10.10.2:8649
 
  rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
  rpc.period=10

What are xcievers?

2010-11-07 Thread Abhinay Mehta
Hi all,

Our hadoop cluster went down today with the following error:

ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(a.b.c.d:50010,
storageID=DS-131834207-127.0.1.1-50010-1276853028220, infoPort=50075,
ipcPort=50020):DataXceiver
java.io.IOException: xceiverCount 257 exceeds the limit of concurrent
xcievers 256

I read on some forums that we should increase the default number of
xcievers, which is OK, I will do that.
But I couldn't find any documentation to what xcievers are and how they are
used up?

Any info or links to documentation would be much appreciated.

Thank you.

Regards,
Abhinay Mehta


Re: Upgrading Hadoop from CDH3b3 to CDH3

2010-10-21 Thread Abhinay Mehta
I've got a couple of wines saved up, could be useful for this upgrade.
Thanks for the heads-up.

On 20 October 2010 17:23, Michael Segel michael_se...@hotmail.com wrote:


 CDH3B3 is the latest.

 My suggestion is that you get a large bottle of your favorite alcoholic
 beverage and enjoy it while you're doing this upgrade.

 In B3 you have user hadoop going to hdfs and the introduction of user
 mapred.
 So your hdfs stuff has to be owned by hdfs:hadoop and mapred stuff owned by
 mapred:hadoop.
 (That's group hadoop btw)

 Then you have to make sure that your unix directories have the correct file
 permissions because if they don't hadoop fails.

 Also since you're not using the start-all, start-hbase scripts you have to
 write your own script that uses the /etc/init.d/hadoop* scripts.

 Then there are zookeeper issues...

 But when all said and done... its more stable than hbase-0.20.3 (Cloudera's
 earlier release of Hbase) and 0.89 seems fairly stable in B3. (We're still
 testing it and trying to break it.)

 HTH

 -Mike


  Date: Wed, 20 Oct 2010 13:53:11 +0100
  Subject: Re: Upgrading Hadoop from CDH3b3 to CDH3
  From: abhinay.me...@gmail.com
  To: common-user@hadoop.apache.org
 
  Yes you are right, we obviously have beta2 installed on our cluster too.
  Thanks.
 
  On 20 October 2010 12:35, ed hadoopn...@gmail.com wrote:
 
   I don't think there is a stable CDH3 yet although we've been using
 CDH3B2
   and it has been pretty stable for us.  (at least I don't see it
 available
   on
   their website and they JUST announced CDH3B3 last week at HadoopWorld.
  
   ~Ed
  
  
   On Wed, Oct 20, 2010 at 5:57 AM, Abhinay Mehta 
 abhinay.me...@gmail.com
   wrote:
  
Hi all,
   
We currently have Cloudera's Hadoop beta 3 installed on our cluster,
 we
would like to upgrade to the latest stable release CDH3.
Is there documentation or recommended steps on how to do this?
   
We found some docs on how to upgrade from CDH2 and CDHb2 to CDHb3
 here:
   
   
  
 https://docs.cloudera.com/display/DOC/Hadoop+Upgrade+from+CDH2+or+CDH3b2+to+CDH3b3
Are the same steps recommended to upgrade to CDH3?
   
I'm hoping it's a lot easier to upgrade from beta3 to the latest
 stable
version than that document states?
   
Thank you.
Abhinay Mehta
   
  




Upgrading Hadoop from CDH3b3 to CDH3

2010-10-20 Thread Abhinay Mehta
Hi all,

We currently have Cloudera's Hadoop beta 3 installed on our cluster, we
would like to upgrade to the latest stable release CDH3.
Is there documentation or recommended steps on how to do this?

We found some docs on how to upgrade from CDH2 and CDHb2 to CDHb3 here:
https://docs.cloudera.com/display/DOC/Hadoop+Upgrade+from+CDH2+or+CDH3b2+to+CDH3b3
Are the same steps recommended to upgrade to CDH3?

I'm hoping it's a lot easier to upgrade from beta3 to the latest stable
version than that document states?

Thank you.
Abhinay Mehta


Re: what affects number of reducers launched by hadoop?

2010-07-29 Thread Abhinay Mehta
Which configuration key controls the number of maximum tasks per node ?


On 28 July 2010 20:40, Joe Stein charmal...@allthingshadoop.com wrote:

 mapred.tasktracker.reduce.tasks.maximum is how many you want as a ceiling
 per node

 you need to configure *mapred.reduce.tasks* to be more than one as it is
 defaulted to 1 (which you are overriding in your code which is why it works
 there)

 This value should be somewhere between .95 and 1.75 times the number of
 maximum tasks per node times the number of data nodes.

 So if you have 3 data nodes and it is setup max tasks of 7 then configure
 this between 25 and 36

 On Wed, Jul 28, 2010 at 3:24 PM, Vitaliy Semochkin vitaliy...@gmail.com
 wrote:

  Hi,
 
  in my cluster mapred.tasktracker.reduce.tasks.maximum = 4
  however during monitoring the job in job tracker I see only 1 reducer
  working
 
  first it is
  reduce  copy - can someone please explain what does this mean?
 
  after it is
  reduce  reduce
 
  when I set the number of reduce tasks for a job programatically to 10
  job.setNumReduceTasks(10);
  the number of reduce  reduce reducers increases to 10 and the
  performance of application increases as well (the number of reducers
  never exceeds).
 
  Can someone explain such behavior?
 
  Thanks in Advance,
  Vitaliy S
 



 --

 /*
 Joe Stein
 http://www.linkedin.com/in/charmalloc
 Twitter: @allthingshadoop
 */