Luca Canali created SPARK-28091:
-----------------------------------

             Summary: Extend Spark metrics system with executor plugin metrics
                 Key: SPARK-28091
                 URL: https://issues.apache.org/jira/browse/SPARK-28091
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: Luca Canali


This proposes to improve Spark instrumentation by adding a hook for Spark 
executor plugin metrics to the Spark metrics systems implemented with the 
Dropwizard/Codahale library.

Context: The Spark metrics system provides a large variety of metrics, see also 
SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A typical 
workflow is to sink the metrics to a storage system and build dashboards on top 
of that.

Improvement: The original goal of this work was to add instrumentation for S3 
filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
instruments HDFS and local filesystem metrics. Rather than extending the code 
there, we proposes to add a metrics plugin system which is of more flexible and 
general use.

Advantages:
 * The metric plugin system makes it easy to implement instrumentation for S3 
access by Spark jobs.
 * The metrics plugin system allows for easy extensions of how Spark collects 
HDFS-related workload metrics. This is currently done using the Hadoop 
Filesystem GetAllStatistics method, which is deprecated in recent versions of 
Hadoop. Recent versions of Hadoop Filesystem recommend using method 
GetGlobalStorageStatistics, which also provides several additional metrics. 
GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been introduced 
in Hadoop 2.8). Using a metric plugin for Spark would allow an easy way to “opt 
in” using such new API calls for those deploying suitable Hadoop versions.
 * We also have the use case of adding Hadoop filesystem monitoring for a 
custom Hadoop compliant filesystem in use in our organization (EOS using the 
XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
Others may have similar use cases.
 * More generally, this method makes it straightforward to plug in Filesystem 
and other metrics to the Spark monitoring system. Future work on plugin 
implementation can address extending monitoring to measure usage of external 
resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
not normally be considered general enough for inclusion in Apache Spark code, 
but that can be nevertheless useful for specialized use cases, tests or 
troubleshooting.

Implementation:

The proposed implementation is currently a WIP open for comments and 
improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
builds on recent work on extending Spark executor metrics, such as SPARK-25228

Tests and examples:

This has been so far manually tested running Spark on YARN and K8S clusters, in 
particular for monitoring S3 and for extending HDFS instrumentation with the 
Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric plugin 
example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to