Re: Monitoring Slow job.

2011-10-05 Thread patrick sang
Hi all,
Thanks for Vitthai your reply leads me to the solution.
Here is what i do, hopefully it is considered contribution back to
community.

1. ./hadoop job -list all |awk '{ if($2==1) print $1 }'
--- to get the list of running JobID

2. ./hadoop job -status JobID
-- file: hdfs://xxx.xx.xxx/zz/aa/job.xml

3. ./hadoop fs -cat hdfs://xxx.xx.xxx/zz/aa/job.xml
here we got mapred.output.dir

4. ./hadoop job -history mapred.output.dir
-- Launched At:

5. Get the time diff between launched at and now.

I would be much easier to just scrape the
http://jobtracker:50030/jobdetails.jsp?jobid=jobId
or there might be some more elegant way of getting this. It's just how i did
it this time.

Cheers,
P

On Mon, Oct 3, 2011 at 5:41 PM, Vitthal Suhas Gogate 
gog...@hortonworks.com wrote:

 I am not sure there is a easy way to get what you want on command line..
 one
 option is to use following command which would give you verbose job history
 where you can find submit, Launch  Finish time (including duration on
 FinishTime line).  I am using hadoop-0.20.205.0  branch. So check if you
 have some such option for the version of hadoop you are using...

 I am pasting sample output for my wordcount program,

 bin/hadoop job -history job_output_directory_on_hdfs

 ==

 horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output
 Warning: $HADOOP_HOME is deprecated.


 Hadoop job: 0001_1317688277686_vgogate
 =
 Job tracker host name: job
 job tracker start time: Sun May 16 08:53:51 PDT 1976
 User: vgogate
 JobName: word count
 JobConf:

 hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml
 Submitted At: 3-Oct-2011 17:31:17
 Launched At: 3-Oct-2011 17:31:17 (0sec)
 Finished At: 3-Oct-2011 17:31:50 (32sec)
 Status: SUCCESS
 Counters:

 |Group Name|Counter name  |Map Value
 |Reduce Value|Total Value|

 ---
 |Job Counters  |Launched reduce tasks |0
 |0 |1
 |Job Counters  |SLOTS_MILLIS_MAPS |0
 |0 |12,257
 |Job Counters  |Total time spent by all reduces waiting
 after reserving slots (ms)|0 |0 |0
 |Job Counters  |Total time spent by all maps waiting after
 reserving slots (ms)|0 |0 |0
 |Job Counters  |Launched map tasks|0
 |0 |1
 |Job Counters  |Data-local map tasks  |0
 |0 |1
 |Job Counters  |SLOTS_MILLIS_REDUCES  |0
 |0 |10,082
 |File Output Format Counters   |Bytes Written |0
 |61,192|61,192
 |FileSystemCounters|FILE_BYTES_READ   |0
 |70,766|70,766
 |FileSystemCounters|HDFS_BYTES_READ   |112,056
 |0 |112,056
 |FileSystemCounters|FILE_BYTES_WRITTEN|92,325
 |92,294|184,619
 |FileSystemCounters|HDFS_BYTES_WRITTEN|0
 |61,192|61,192
 |File Input Format Counters|Bytes Read|111,933
 |0 |111,933
 |Map-Reduce Framework  |Reduce input groups   |0
 |2,411 |2,411
 |Map-Reduce Framework  |Map output materialized bytes |70,766
 |0 |70,766
 |Map-Reduce Framework  |Combine output records|2,411
 |0 |2,411
 |Map-Reduce Framework  |Map input records |2,643
 |0 |2,643
 |Map-Reduce Framework  |Reduce shuffle bytes  |0
 |0 |0
 |Map-Reduce Framework  |Reduce output records |0
 |2,411 |2,411
 |Map-Reduce Framework  |Spilled Records   |2,411
 |2,411 |4,822
 |Map-Reduce Framework  |Map output bytes  |120,995
 |0 |120,995
 |Map-Reduce Framework  |Combine input records |5,849
 |0 |5,849
 |Map-Reduce Framework  |Map output records|5,849
 |0 |5,849
 |Map-Reduce Framework  |SPLIT_RAW_BYTES   |123
 |0 |123
 |Map-Reduce Framework  |Reduce input records  |0
 |2,411 |2,411
 =

 Task Summary
 
 KindTotalSuccessfulFailedKilledStartTimeFinishTime

 Setup11003-Oct-2011 17:31:203-Oct-2011 17:31:24
 (4sec)
 Map11003-Oct-2011 17:31:263-Oct-2011 17:31:30
 (4sec)
 Reduce11003-Oct-2011 17:31:323-Oct-2011
 17:31:42
 (10sec)
 Cleanup11003-Oct-2011 17:31:443-Oct-2011
 17:31:48 (4sec)
 


 Analysis
 =

 Time taken by best performing map task task_201110031726_0001_m_00:
 4sec
 Average time taken by map tasks: 4sec
 Worse performing 

Monitoring Slow job.

2011-10-03 Thread patrick sang
Hi Hadoopers,

I am writing script to detect if there is any running job has been running
longer than X hours.

So far,  I use
./hadoop job -jt jobtracker:port -list all |awk '{ if($2==1) print $1 }'
--- to get the list of running JobID

I am finding the way to get how long the job has been running from jobId.

In the web admin page (http://job:50030), we can easily see the duration of
each jobId pretty easily
from *Started at:, ** Running for:*. of each running job.

How do we get the such an information from command line?

hope this make sense.
Thanks you in advance,

-P


Re: Monitoring Slow job.

2011-10-03 Thread Vitthal Suhas Gogate
I am not sure there is a easy way to get what you want on command line.. one
option is to use following command which would give you verbose job history
where you can find submit, Launch  Finish time (including duration on
FinishTime line).  I am using hadoop-0.20.205.0  branch. So check if you
have some such option for the version of hadoop you are using...

I am pasting sample output for my wordcount program,

bin/hadoop job -history job_output_directory_on_hdfs

==

horton-mac:hadoop-0.20.205.0 vgogate$ bin/hadoop job -history output
Warning: $HADOOP_HOME is deprecated.


Hadoop job: 0001_1317688277686_vgogate
=
Job tracker host name: job
job tracker start time: Sun May 16 08:53:51 PDT 1976
User: vgogate
JobName: word count
JobConf:
hdfs://horton-mac.local:54310/tmp/mapred/staging/vgogate/.staging/job_201110031726_0001/job.xml
Submitted At: 3-Oct-2011 17:31:17
Launched At: 3-Oct-2011 17:31:17 (0sec)
Finished At: 3-Oct-2011 17:31:50 (32sec)
Status: SUCCESS
Counters:

|Group Name|Counter name  |Map Value
|Reduce Value|Total Value|
---
|Job Counters  |Launched reduce tasks |0
|0 |1
|Job Counters  |SLOTS_MILLIS_MAPS |0
|0 |12,257
|Job Counters  |Total time spent by all reduces waiting
after reserving slots (ms)|0 |0 |0
|Job Counters  |Total time spent by all maps waiting after
reserving slots (ms)|0 |0 |0
|Job Counters  |Launched map tasks|0
|0 |1
|Job Counters  |Data-local map tasks  |0
|0 |1
|Job Counters  |SLOTS_MILLIS_REDUCES  |0
|0 |10,082
|File Output Format Counters   |Bytes Written |0
|61,192|61,192
|FileSystemCounters|FILE_BYTES_READ   |0
|70,766|70,766
|FileSystemCounters|HDFS_BYTES_READ   |112,056
|0 |112,056
|FileSystemCounters|FILE_BYTES_WRITTEN|92,325
|92,294|184,619
|FileSystemCounters|HDFS_BYTES_WRITTEN|0
|61,192|61,192
|File Input Format Counters|Bytes Read|111,933
|0 |111,933
|Map-Reduce Framework  |Reduce input groups   |0
|2,411 |2,411
|Map-Reduce Framework  |Map output materialized bytes |70,766
|0 |70,766
|Map-Reduce Framework  |Combine output records|2,411
|0 |2,411
|Map-Reduce Framework  |Map input records |2,643
|0 |2,643
|Map-Reduce Framework  |Reduce shuffle bytes  |0
|0 |0
|Map-Reduce Framework  |Reduce output records |0
|2,411 |2,411
|Map-Reduce Framework  |Spilled Records   |2,411
|2,411 |4,822
|Map-Reduce Framework  |Map output bytes  |120,995
|0 |120,995
|Map-Reduce Framework  |Combine input records |5,849
|0 |5,849
|Map-Reduce Framework  |Map output records|5,849
|0 |5,849
|Map-Reduce Framework  |SPLIT_RAW_BYTES   |123
|0 |123
|Map-Reduce Framework  |Reduce input records  |0
|2,411 |2,411
=

Task Summary

KindTotalSuccessfulFailedKilledStartTimeFinishTime

Setup11003-Oct-2011 17:31:203-Oct-2011 17:31:24
(4sec)
Map11003-Oct-2011 17:31:263-Oct-2011 17:31:30
(4sec)
Reduce11003-Oct-2011 17:31:323-Oct-2011 17:31:42
(10sec)
Cleanup11003-Oct-2011 17:31:443-Oct-2011
17:31:48 (4sec)



Analysis
=

Time taken by best performing map task task_201110031726_0001_m_00: 4sec
Average time taken by map tasks: 4sec
Worse performing map tasks:
TaskIdTimetaken
task_201110031726_0001_m_00 4sec
The last map task task_201110031726_0001_m_00 finished at (relative to
the Job launch time): 3-Oct-2011 17:31:30 (12sec)

Time taken by best performing shuffle task task_201110031726_0001_r_00:
7sec
Average time taken by shuffle tasks: 7sec
Worse performing shuffle tasks:
TaskIdTimetaken
task_201110031726_0001_r_00 7sec
The last shuffle task task_201110031726_0001_r_00 finished at (relative
to the Job launch time): 3-Oct-2011 17:31:39 (21sec)

Time taken by best performing reduce task task_201110031726_0001_r_00:
2sec
Average time taken by reduce tasks: 2sec
Worse performing reduce tasks:
TaskIdTimetaken
task_201110031726_0001_r_00 2sec
The last reduce task task_201110031726_0001_r_00 finished at (relative
to the Job launch time): 3-Oct-2011 17:31:42