hthuynh2 opened a new pull request #25156: initial commit URL: https://github.com/apache/spark/pull/25156 ## What changes were proposed in this pull request? Description This PR added the framework for monitoring the schedulers and displaying performance metrics to the UI. More specifically, for each event type that is handled by the DAGScheduler, CoarsedGrainedSchedulerBackend, and TaskResultGetter, it keeps track of the following information: number of handled events of this type, number of pending events of this type, average time to handle an event of this type, total time spending on handling all events of this type. ----------- Design and implementation: General Idea: For each event type, every time we handle an event of this type, we collect the information (e.g. the time taken to handle this event, number of pending events,...), and having a background thread that periodically collects the information and display to the UI. UI Explanation <img width="1422" alt="Screen Shot 2019-06-18 at 11 08 39 PM" src="https://user-images.githubusercontent.com/15680678/61195997-50592900-a691-11e9-8e8b-5a4faa5ecc05.png"> Explanation of columns: - Event Type: The type of event that the information in this row below to - Number of Handled Events: The number of events of this type that is handled - Number of Pending Events: The number of events of this type that is pending in the queue - Processing Speed: The average time that is used to handle an event of this type - Total time: The total time that is used to handled all events of this type - Timestamp: The time at the end of the interval that we collect the information displayed in this row. Driver Schedulers Summary Metrics table - This table contains information accumulatively collected since the application starts (e.g. the “Number of Handled Events” is the total number of events that are handled since the application starts) Driver Schedulers Busy Intervals Metrics table - This table contains the information for only some busy intervals. The purpose of this table is to pinpoint the time that the scheduler is busy. An interval is considered busy if the total time spending on handling all events of a type during this interval is greater than 70% of the interval length (This 70% is configurable). And the information displayed is the information collected during that interval only (e.g. the “Number of Handled Events” is the number of events that are handled during that interval). Note that we only keep track of top 5 busiest intervals sorted by the “total time” (This number 5 is configurable). An example of how to read the table and how it can be useful for monitoring and debugging the application: For an example shown in the figure below, let’s look at the first row of the “Driver Schedulers Busy Intervals Metrics” table. This row contains the information about the ReviveOffers event that is handled by the CoarseGrainedSchedulerBackend. There are 12 ReviveOffers events that is handled by the CoarseGrainedSchedulerBackend during the interval the ends at "2019/06/19 04:03:13". The number of ReviveOffers events that are pending is N/A (This is because getting the number of pending events that are waiting to be handled by the CoarseGrainedSchedulerBackend is difficult, so I leave it as N/A for now). The time for the CoarseGrainedSchedulerBackend to handle an ReviveOffers is 763ms. The total time spending on handling all 12 ReviveOffers events during this interval is 9s (For this example, the interval length is 10s). Finally, this interval ends at 2019/06/19 04:03:13. From the information above, we can easily see that it takes too long for the CoarseGrainedSchedulerBackend to handle the event ReviveOffers (for some interval, it takes up to 700ms to handle just one ReviveOffers event). This is actually an issue that is mentioned in [SPARK-26755](https://issues.apache.org/jira/browse/SPARK-26755). And this example is the application mentioned in [PR #23677](https://github.com/apache/spark/pull/23677). We can see that having these metrics can help to identify the bottleneck for the application. ----------- Implementation: Introduce 2 new classes: - SchedulerEventHandlingMetricTracker: This is used to keep track of the information for an event type - SchedulerMetricsManager: This is used to manage all the SchedulerEventHandlingMetricTracker for all event types --------------------- Configurations: spark.scheduler.metric.compute.enabled: Indicate if this feature is enable or not (default value is false) spark.scheduler.metric.compute.interval: This is the interval length (default value is 10s) spark.scheduler.metric.compute.numTopBusiestInterval: This is the number of busiest intervals that we will keep (default value is 5) spark.scheduler.metric.compute.busyIntervalThreshold: This is the threshold (in percent) for an interval to be considered busy (default value is 70%). ## How was this patch tested? This patch was manually tested with some applications. Please let me know if there's any specific tests that are needed to be done. Thank you!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org