Hi all, I have a PR https://github.com/apache/spark/pull/22381 that exposes application status metrics (related jira: SPARK-25394).
So far metrics tooling needs to scrape the metrics rest api to get metrics like job delay, stages failed, stages completed etc. >From devops perspective it is good to standardize on a unified way of gathering metrics. The need came up on the K8s side where jmx prometheus exporter is commonly used to scrape metrics for several components such as kafka, cassandra, but the need is not limited there. Check comment here <https://github.com/apache/spark/pull/22381#issuecomment-420029771>: "The rest api is great for UI and consolidated analytics, but monitoring through it is not as straightforward as when the data emits directly from the source like this. There is all kinds of nice context that we get when the data from this spark node is collected directly from the node itself, and not proxied through another collector / reporter. It is easier to build a monitoring data model across the cluster when node, jmx, pod, resource manifests, and spark data all align by virtue of coming from the same collector. Building a similar view of the cluster just from the rest api, as a comparison, is simply harder and quite challenging to do in general purpose terms." The PR is ok to be merged but the major concern here is the mirroring of the metrics. I think that mirroring is ok since people may dont want to check the ui and they just want to integrate with jmx only (my use case) and gather metrics in grafana (common case out there). Does any of the committers or the community have an opinion on this? Is there an agreement about moving on with this? Note that the addition does not change much and can always be refactored if we come up with a new plan for the metrics story in the future. Thanks, Stavros