Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

2015-02-04 Thread Suma Shivaprasad
Thanks for your inputs. The cluster Metrics API is giving correct numbers
for the failed/killed apps and is matching with the RM audit logs and we
are planning to use that instead.

Suma

On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S 
rohithsharm...@huawei.com wrote:

 There are several ways to confirm from YARN that total number of
 Killed/Failed applications in cluster
 1. Get from RM web UI lists OR
 2. From admin try using this to get numbers of failed and killed
 applications: ./yarn application -list -appStates FAILED,KILLED
 3. Using client API's

 Since metrics values are displayed in ganglia is incorrect, I get doubt
 that
 1. does ganglia is pointing out to correct RM cluster? Or
 2. what is the method ganglia uses to retrieve QueueMetrics?
 3. Any client program calculates you have written retrieve apps and
 calculate it?


 Thanks  Regards
 Rohith Sharma K S

 -Original Message-
 From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com]
 Sent: 04 February 2015 11:03
 To: u...@hadoop.apache.org
 Cc: yarn-dev@hadoop.apache.org
 Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

 Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
 The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
 which is very high wrt to the apps running at any given time(40-60). The RM
 logs though show 0 failed apps in audit logs during that hour.
 The RM UI also doesnt show any apps in Applications-Failed tab . The logs
 are getting rolled over at a slower rate ..every 1-2 hours. Am searching
 for Application Finished - Failed to find the apps failed. Please let me
 know if I am missing something here.

 Thanks
 Suma




 On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S 
 rohithsharm...@huawei.com wrote:

   Hi
 
 
 
  Could you give more information, which version of hadoop are you using?
 
 
 
   QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
  However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
 
  May be I suspect that Logs might be rolled out. Does more applications
  are running?
 
 
 
  All the applications history will be displayed  on RM web UI (provided
  RM is not restarted or RM recovery enabled). May be you can check
  these applications lists.
 
 
 
  For finding reasons for application killed/failed, one way is you can
  check in NodeManager logs also. Here  you need to check using
  container_id for corresponding application.
 
 
 
  Thanks  Regards
 
  Rohith Sharma K S
 
 
 
  *From:* Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com]
  *Sent:* 03 February 2015 21:35
  *To:* u...@hadoop.apache.org; yarn-dev@hadoop.apache.org
  *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons
 
 
 
  Hello,
 
 
  Was trying to debug reasons for Killed/Failed apps and was checking
  for the applications that were killed/failed in RM logs - from
 RMAuditLogger.
 
   QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
  However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
  Is it possible that some logs are missed by AuditLogger or is it the
  other way round and metrics are being reported higher ?
 
  Thanks
 
  Suma
 



Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons

2015-02-03 Thread Suma Shivaprasad
Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60.
The metrics in Ganglia shows around around 10-30 apps killed every 5 mins
which is very high wrt to the apps running at any given time(40-60). The RM
logs though show 0 failed apps in audit logs during that hour.
The RM UI also doesnt show any apps in Applications-Failed tab . The logs
are getting rolled over at a slower rate ..every 1-2 hours. Am searching
for Application Finished - Failed to find the apps failed. Please let me
know if I am missing something here.

Thanks
Suma




On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S 
rohithsharm...@huawei.com wrote:

  Hi



 Could you give more information, which version of hadoop are you using?



  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
 However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.

 May be I suspect that Logs might be rolled out. Does more applications are
 running?



 All the applications history will be displayed  on RM web UI (provided RM
 is not restarted or RM recovery enabled). May be you can check these
 applications lists.



 For finding reasons for application killed/failed, one way is you can
 check in NodeManager logs also. Here  you need to check using container_id
 for corresponding application.



 Thanks  Regards

 Rohith Sharma K S



 *From:* Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com]
 *Sent:* 03 February 2015 21:35
 *To:* u...@hadoop.apache.org; yarn-dev@hadoop.apache.org
 *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons



 Hello,


 Was trying to debug reasons for Killed/Failed apps and was checking for
 the applications that were killed/failed in RM logs - from RMAuditLogger.

  QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
 However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it
 possible that some logs are missed by AuditLogger or is it the other way
 round and metrics are being reported higher ?

 Thanks

 Suma



RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons

2015-02-03 Thread Rohith Sharma K S
Hi

Could you give more information, which version of hadoop are you using?


 QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. 
 However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs.
May be I suspect that Logs might be rolled out. Does more applications are 
running?

All the applications history will be displayed  on RM web UI (provided RM is 
not restarted or RM recovery enabled). May be you can check these applications 
lists.

For finding reasons for application killed/failed, one way is you can check in 
NodeManager logs also. Here  you need to check using container_id for 
corresponding application.

Thanks  Regards
Rohith Sharma K S

From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com]
Sent: 03 February 2015 21:35
To: u...@hadoop.apache.org; yarn-dev@hadoop.apache.org
Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons

Hello,

Was trying to debug reasons for Killed/Failed apps and was checking for the 
applications that were killed/failed in RM logs - from RMAuditLogger.
QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However 
RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible 
that some logs are missed by AuditLogger or is it the other way round and 
metrics are being reported higher ?
Thanks
Suma


QueueMetrics.AppsKilled/Failed metrics and failure reasons

2015-02-03 Thread Suma Shivaprasad
Hello,

Was trying to debug reasons for Killed/Failed apps and was checking for the
applications that were killed/failed in RM logs - from RMAuditLogger.
QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100.
However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it
possible that some logs are missed by AuditLogger or is it the other way
round and metrics are being reported higher ?

Thanks
Suma