[ https://issues.apache.org/jira/browse/YARN-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Teke updated YARN-10421: --------------------------------- Attachment: YARN-10421.003.patch > Create YarnDiagnosticsService to serve diagnostic queries > ---------------------------------------------------------- > > Key: YARN-10421 > URL: https://issues.apache.org/jira/browse/YARN-10421 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Benjamin Teke > Assignee: Benjamin Teke > Priority: Major > Attachments: YARN-10421.001.patch, YARN-10421.002.patch, > YARN-10421.003.patch > > > YarnDiagnosticsServlet should run inside ResourceManager Daemon. The servlet > forks a separate process, which executes a shell/Python/etc script. Based on > the use-cases listed below the script collects information, bundles it and > sends it to UI2. The diagnostic options are the following: > # Application hanging: > ** Application logs > ** Find the hanging container and get multiple Jstacks > ** ResourceManager logs during job lifecycle > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL > # Application failed: > ** Application logs > ** ResourceManager logs during job lifecycle. > ** NodeManager logs from NodeManager where the hanging containers of the > jobs ran > ** Job Configuration from MapReduce HistoryServer, Spark HistoryServer, Tez > History URL. > ** Job related metrics like container, attempts. > # Scheduler related issue: > ** ResourceManager Scheduler logs with DEBUG enabled for 2 minutes. > ** Multiple Jstacks of ResourceManager > ** YARN and Scheduler Configuration > ** Cluster Scheduler API _/ws/v1/cluster/scheduler_ and Cluster Nodes API > _/ws/v1/cluster/nodes response_ > ** Scheduler Activities _/ws/v1/cluster/scheduler/bulkactivities_ response > (YARN-10319) > # ResourceManager / NodeManager daemon fails to start: > ** ResourceManager and NodeManager out and log file > ** YARN and Scheduler Configuration > Two new endpoints should be added to the RM web service: one for listing the > available diagnostic options (_/common-issue/list_), and one for calling a > selected option with the user provided parameters (_/common-issue/collect_). > The service should be transparent to the script changes to help with the > (on-the-fly) extensibility of the diagnostic tool. To split the changes to > smaller chunks the implementation behind _collect_ endpoint is to be provided > in YARN-10433. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org