hthuynh2 opened a new pull request #25156: initial commit
URL: https://github.com/apache/spark/pull/25156
 
 
   ## What changes were proposed in this pull request?
   Description
   This PR added the framework for monitoring the schedulers and displaying 
performance metrics to the UI. More specifically, for each event type that is 
handled by the DAGScheduler, CoarsedGrainedSchedulerBackend, and 
TaskResultGetter, it keeps track of the following information: number of 
handled events of this type, number of pending events of this type, average 
time to handle an event of this type, total time spending on handling all 
events of this type.
   
   -----------
   Design and implementation:
   General Idea: 
   For each event type, every time we handle an event of this type, we collect 
the information (e.g. the time taken to handle this event, number of pending 
events,...), and having a background thread that periodically collects the 
information and display to the UI.
   
   UI Explanation
   <img width="1422" alt="Screen Shot 2019-06-18 at 11 08 39 PM" 
src="https://user-images.githubusercontent.com/15680678/61195997-50592900-a691-11e9-8e8b-5a4faa5ecc05.png";>
   
   Explanation of columns:
   - Event Type: The type of event that the information in this row below to
   - Number of Handled Events: The number of events of this type that is handled
   - Number of Pending Events: The number of events of this type that is 
pending in the queue
   - Processing Speed: The average time that is used to handle an event of this 
type
   - Total time: The total time that is used to handled all events of this type
   - Timestamp: The time at the end of the interval that we collect the 
information displayed in this row.
   
   Driver Schedulers Summary Metrics table
   - This table contains information accumulatively collected since the 
application starts (e.g. the “Number of Handled Events” is the total number of 
events that are handled since the application starts)
   
   Driver Schedulers Busy Intervals Metrics table
   - This table contains the information for only some busy intervals. The 
purpose of this table is to pinpoint the time that the scheduler is busy. An 
interval is considered busy if the total time spending on handling all events 
of a type during this interval is greater than 70% of the interval length (This 
70% is configurable). And the information displayed is the information 
collected during that interval only (e.g. the “Number of Handled Events” is the 
number of events that are handled during that interval). Note that we only keep 
track of top 5 busiest intervals sorted by the “total time” (This number 5 is 
configurable). 
   
   An example of how to read the table and how it can be useful for monitoring 
and debugging the application:
   For an example shown in the figure below, let’s look at the first row of the 
“Driver Schedulers Busy Intervals Metrics” table. This row contains the 
information about the ReviveOffers event that is handled by the 
CoarseGrainedSchedulerBackend. There are 12 ReviveOffers events that is handled 
by the CoarseGrainedSchedulerBackend during the interval the ends at 
"2019/06/19 04:03:13". The number of ReviveOffers events that are pending is 
N/A (This is because getting the number of pending events that are waiting to 
be handled by the CoarseGrainedSchedulerBackend is difficult, so I leave it as 
N/A for now). The time for the CoarseGrainedSchedulerBackend to handle an 
ReviveOffers is 763ms. The total time spending on handling all 12 ReviveOffers 
events during this interval is 9s (For this example, the interval length is 
10s). Finally, this interval ends at 2019/06/19 04:03:13.
   
   From the information above, we can easily see that it takes too long for the 
CoarseGrainedSchedulerBackend to handle the event ReviveOffers (for some 
interval, it takes up to 700ms to handle just one ReviveOffers event).  This is 
actually an issue that is mentioned in 
[SPARK-26755](https://issues.apache.org/jira/browse/SPARK-26755). And this 
example is the application mentioned in [PR 
#23677](https://github.com/apache/spark/pull/23677). We can see that having 
these metrics can help to identify the bottleneck for the application. 
   
   -----------
   Implementation:
   Introduce 2 new classes:
   - SchedulerEventHandlingMetricTracker: This is used to keep track of the 
information for an event type
   - SchedulerMetricsManager: This is used to manage all the 
SchedulerEventHandlingMetricTracker for all event types
   ---------------------
   Configurations:
   spark.scheduler.metric.compute.enabled: Indicate if this feature is enable 
or not (default value is false)
   spark.scheduler.metric.compute.interval: This is the interval length 
(default value is 10s)
   spark.scheduler.metric.compute.numTopBusiestInterval: This is the number of 
busiest intervals that we will keep (default value is 5)
   spark.scheduler.metric.compute.busyIntervalThreshold: This is the threshold 
(in percent) for an interval to be considered busy (default value is 70%). 
   
   ## How was this patch tested?
   This patch was manually tested with some applications. Please let me know if 
there's any specific tests that are needed to be done.
   
   Thank you!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to