Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
Thanks for your inputs. The cluster Metrics API is giving correct numbers for the failed/killed apps and is matching with the RM audit logs and we are planning to use that instead. Suma On Wed, Feb 4, 2015 at 12:04 PM, Rohith Sharma K S rohithsharm...@huawei.com wrote: There are several ways to confirm from YARN that total number of Killed/Failed applications in cluster 1. Get from RM web UI lists OR 2. From admin try using this to get numbers of failed and killed applications: ./yarn application -list -appStates FAILED,KILLED 3. Using client API's Since metrics values are displayed in ganglia is incorrect, I get doubt that 1. does ganglia is pointing out to correct RM cluster? Or 2. what is the method ganglia uses to retrieve QueueMetrics? 3. Any client program calculates you have written retrieve apps and calculate it? Thanks Regards Rohith Sharma K S -Original Message- From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] Sent: 04 February 2015 11:03 To: u...@hadoop.apache.org Cc: yarn-dev@hadoop.apache.org Subject: Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60. The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour. The RM UI also doesnt show any apps in Applications-Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for Application Finished - Failed to find the apps failed. Please let me know if I am missing something here. Thanks Suma On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S rohithsharm...@huawei.com wrote: Hi Could you give more information, which version of hadoop are you using? QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. May be I suspect that Logs might be rolled out. Does more applications are running? All the applications history will be displayed on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists. For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here you need to check using container_id for corresponding application. Thanks Regards Rohith Sharma K S *From:* Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] *Sent:* 03 February 2015 21:35 *To:* u...@hadoop.apache.org; yarn-dev@hadoop.apache.org *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons Hello, Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger. QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ? Thanks Suma
Re: QueueMetrics.AppsKilled/Failed metrics and failure reasons
Using hadoop 2.4.0. #of Applications running on average is small ~ 40 -60. The metrics in Ganglia shows around around 10-30 apps killed every 5 mins which is very high wrt to the apps running at any given time(40-60). The RM logs though show 0 failed apps in audit logs during that hour. The RM UI also doesnt show any apps in Applications-Failed tab . The logs are getting rolled over at a slower rate ..every 1-2 hours. Am searching for Application Finished - Failed to find the apps failed. Please let me know if I am missing something here. Thanks Suma On Wed, Feb 4, 2015 at 10:03 AM, Rohith Sharma K S rohithsharm...@huawei.com wrote: Hi Could you give more information, which version of hadoop are you using? QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. May be I suspect that Logs might be rolled out. Does more applications are running? All the applications history will be displayed on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists. For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here you need to check using container_id for corresponding application. Thanks Regards Rohith Sharma K S *From:* Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] *Sent:* 03 February 2015 21:35 *To:* u...@hadoop.apache.org; yarn-dev@hadoop.apache.org *Subject:* QueueMetrics.AppsKilled/Failed metrics and failure reasons Hello, Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger. QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ? Thanks Suma
RE: QueueMetrics.AppsKilled/Failed metrics and failure reasons
Hi Could you give more information, which version of hadoop are you using? QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. May be I suspect that Logs might be rolled out. Does more applications are running? All the applications history will be displayed on RM web UI (provided RM is not restarted or RM recovery enabled). May be you can check these applications lists. For finding reasons for application killed/failed, one way is you can check in NodeManager logs also. Here you need to check using container_id for corresponding application. Thanks Regards Rohith Sharma K S From: Suma Shivaprasad [mailto:sumasai.shivapra...@gmail.com] Sent: 03 February 2015 21:35 To: u...@hadoop.apache.org; yarn-dev@hadoop.apache.org Subject: QueueMetrics.AppsKilled/Failed metrics and failure reasons Hello, Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger. QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ? Thanks Suma
QueueMetrics.AppsKilled/Failed metrics and failure reasons
Hello, Was trying to debug reasons for Killed/Failed apps and was checking for the applications that were killed/failed in RM logs - from RMAuditLogger. QueueMetrics.AppsKilled/Failed metrics shows much higher nos i.e ~100. However RMAuditLogger shows 1 or 2 Apps as Killed/Failed in the logs. Is it possible that some logs are missed by AuditLogger or is it the other way round and metrics are being reported higher ? Thanks Suma