[ https://issues.apache.org/jira/browse/SPARK-35502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mati updated SPARK-35502: ------------------------- Description: Recently we have enabled prometheusServlet configuration in order to have spark master, worker, driver and executor metrics. We can see and using spark master, worker and driver executors but can't see spark executor metrics. We are running spark streaming standalone cluster in version 3.0.1 over physical servers. We have taken one of our jobs and added the following parameters to the job configuration, but couldn't see executer metrics by curling both driver and executor workers of this job: These are the parameters: --conf spark.ui.prometheus.enabled=true \ --conf spark.executor.processTreeMetrics.enabled=true Curl commands: [00764f](root@sparktest-40005-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [00764f](root@sparktest-40005-prod-chidc2:~)# Driver of this job - sparktest-40004: [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 Our UI port is on 4050 I understood that the executor Prometheus endpoint is still experimental which may explain the inconsistent behaviour we see but is there a plan to fix it? Are there any known issues regarding this? was: Recently we have enabled prometheusServlet configuration in order to have spark master, worker, driver and executor metrics. We can see and using spark master, worker and driver executors but can't see spark executor metrics. We are running spark streaming standalone cluster in version 3.0.1 over physical servers. We have taken one of our jobs and added the following parameters to the job configuration, but couldn't see executer metrics by curling both driver and executor workers of this job: These are the parameters: --conf spark.ui.prometheus.enabled=true \ --conf spark.executor.processTreeMetrics.enabled=true Curl commands: [00764f](root@sparktest-40005-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [00764f](root@sparktest-40005-prod-chidc2:~)# Driver of this job - sparktest-40004: [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] -n5 Out UI port is on 4050 I understood that the executor Prometheus endpoint is still experimental which may explain the inconsistent behaviour we see but is there a plan to fix it? Are there any known issues regarding this? Environment: metrics.properties {code:java} *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet *.sink.prometheusServlet.path=/metrics/prometheus master.sink.prometheusServlet.path=/metrics/master/prometheus applications.sink.prometheusServlet.path=/metrics/applications/prometheus ## Below may be removed after finalizing the native Prometheus implementation # # Enable Prometheus for driver # driver.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink driver.sink.prometheus_chidc2.report-instance-id=false # Prometheus pushgateway address driver.sink.prometheus_chidc2.pushgateway-address-protocol=http driver.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091 driver.sink.prometheus_chidc2.period=60 driver.sink.prometheus_chidc2.pushgateway-enable-timestamp=false driver.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2 driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001-prod-chidc2.chidc2.outbrain.com driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops driver.sink.prometheus_nydc1.class=com.banzaicloud.spark.metrics.sink.PrometheusSink driver.sink.prometheus_nydc1.report-instance-id=false # Prometheus pushgateway address driver.sink.prometheus_nydc1.pushgateway-address-protocol=http driver.sink.prometheus_nydc1.pushgateway-address=pushgateway ... driver.sink.prometheus_nydc1.period=60 driver.sink.prometheus_nydc1.pushgateway-enable-timestamp=false driver.sink.prometheus_nydc1.labels=cluster_name=apache-test-v3,datacenter=chidc2 driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001 driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops # # Enable Prometheus for executor # executor.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink executor.sink.prometheus_chidc2.report-instance-id=false # Prometheus pushgateway address executor.sink.prometheus_chidc2.pushgateway-address-protocol=http executor.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091 executor.sink.prometheus_chidc2.period=60 executor.sink.prometheus_chidc2.pushgateway-enable-timestamp=false executor.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2 executor.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001 executor.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) executor.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 executor.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops executor.sink.prometheus_nydc1.class=com.banzaicloud.spark.metrics.sink.PrometheusSink executor.sink.prometheus_nydc1.report-instance-id=false # Prometheus pushgateway address executor.sink.prometheus_nydc1.pushgateway-address-protocol=http executor.sink.prometheus_nydc1.pushgateway-address=pushgateway-spark-master executor.sink.prometheus_nydc1.period=60 executor.sink.prometheus_nydc1.pushgateway-enable-timestamp=false executor.sink.prometheus_nydc1.labels=cluster_name=apache-test-v3,datacenter=chidc2 executor.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001 executor.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) executor.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 executor.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops {code} spark-default.conf {code:java} # Configured by Chef via recipe: ob-spark-hadoop::install_v3# Configured by Chef via recipe: ob-spark-hadoop::install_v3## Default system properties included when running spark-submit.# This is useful for setting default environmental settings. # Log effective Spark configuration at startup on INFO levelspark.logConf true spark.ui.port 4050 # spark-extras#spark.driver.extraClassPath /opt/spark-extras#spark.executor.extraClassPath /opt/spark-extrasspark.metrics.namespace ${spark.app.name} # Enable event logs for HistoryServerspark.eventLog.enabled truespark.eventLog.dir hdfs:///apps/sparkspark.eventLog.compress truespark.history.fs.logDirectory hdfs:///apps/spark spark.history.fs.cleaner.enabled truespark.history.fs.cleaner.interval 1dspark.history.fs.cleaner.maxAge 7d spark.master.rest.enabled truespark.master spark://spark-apache-master-v3-test.service.consul:7077spark.serializer org.apache.spark.serializer.KryoSerializerspark.shuffle.compress truespark.shuffle.spill.compress truespark.shuffle.service.enabled truespark.executor.memory 1g # Spark streaming tuningsspark.streaming.blockInterval 200msspark.streaming.kafka.maxRetries 2 # Cleanupsspark.worker.cleanup.enabled truespark.worker.cleanup.interval 3600 # Spark executorsspark.executor.logs.rolling.enableCompression truespark.executor.logs.rolling.maxRetainedFiles 5spark.executor.logs.rolling.strategy sizespark.executor.logs.rolling.maxSize 100000 # Spark HA configurationsspark.deploy.recoveryMode=ZOOKEEPER testspark.deploy.zookeeper.dir=/spark # Spark Prometheus settings for executors spark.ui.prometheus.enabled truespark.executor.processTreeMetrics.enabled true spark.local.dir=/outbrain/Prod/spark/c,/outbrain/Prod/spark/d {code} > Spark Executor metrics are not produced/showed > ---------------------------------------------- > > Key: SPARK-35502 > URL: https://issues.apache.org/jira/browse/SPARK-35502 > Project: Spark > Issue Type: Bug > Components: Spark Submit > Affects Versions: 3.0.1 > Environment: metrics.properties > {code:java} > > *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet > *.sink.prometheusServlet.path=/metrics/prometheus > master.sink.prometheusServlet.path=/metrics/master/prometheus > applications.sink.prometheusServlet.path=/metrics/applications/prometheus > ## Below may be removed after finalizing the native Prometheus implementation > # > # Enable Prometheus for driver > # > driver.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink > driver.sink.prometheus_chidc2.report-instance-id=false > # Prometheus pushgateway address > driver.sink.prometheus_chidc2.pushgateway-address-protocol=http > > driver.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091 > driver.sink.prometheus_chidc2.period=60 > driver.sink.prometheus_chidc2.pushgateway-enable-timestamp=false > > driver.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2 > > driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001-prod-chidc2.chidc2.outbrain.com > > driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) > > driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 > > driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops > > driver.sink.prometheus_nydc1.class=com.banzaicloud.spark.metrics.sink.PrometheusSink > driver.sink.prometheus_nydc1.report-instance-id=false > # Prometheus pushgateway address > driver.sink.prometheus_nydc1.pushgateway-address-protocol=http > driver.sink.prometheus_nydc1.pushgateway-address=pushgateway ... > driver.sink.prometheus_nydc1.period=60 > driver.sink.prometheus_nydc1.pushgateway-enable-timestamp=false > > driver.sink.prometheus_nydc1.labels=cluster_name=apache-test-v3,datacenter=chidc2 > driver.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001 > > driver.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) > > driver.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 > > driver.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops > # > # Enable Prometheus for executor > # > executor.sink.prometheus_chidc2.class=com.banzaicloud.spark.metrics.sink.PrometheusSink > executor.sink.prometheus_chidc2.report-instance-id=false > # Prometheus pushgateway address > executor.sink.prometheus_chidc2.pushgateway-address-protocol=http > > executor.sink.prometheus_chidc2.pushgateway-address=pushgateway-spark-master-staging-pod.service.chidc2.consul:9091 > executor.sink.prometheus_chidc2.period=60 > executor.sink.prometheus_chidc2.pushgateway-enable-timestamp=false > > executor.sink.prometheus_chidc2.labels=cluster_name=apache-test-v3,datacenter=chidc2 > > executor.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001 > > executor.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) > > executor.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 > > executor.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops > > executor.sink.prometheus_nydc1.class=com.banzaicloud.spark.metrics.sink.PrometheusSink > executor.sink.prometheus_nydc1.report-instance-id=false > # Prometheus pushgateway address > executor.sink.prometheus_nydc1.pushgateway-address-protocol=http > executor.sink.prometheus_nydc1.pushgateway-address=pushgateway-spark-master > executor.sink.prometheus_nydc1.period=60 > executor.sink.prometheus_nydc1.pushgateway-enable-timestamp=false > > executor.sink.prometheus_nydc1.labels=cluster_name=apache-test-v3,datacenter=chidc2 > > executor.sink.prometheus_chidc2.master-worker-labels=instance=sparktest-40001 > > executor.sink.prometheus_chidc2.metrics-name-capture-regex=application_(\\S+)_([0-9]+)_cores;application_(\\S+)_([0-9]+)_runtime_ms;(.+)_[0-9]+_executor_(.+) > > executor.sink.prometheus_chidc2.metrics-name-replacement=__name__=application_$1_cores,start_time=$2;__name__=application_$1_runtime_ms,start_time=$2;__name__=$1_executor_$2 > > executor.sink.prometheus_chidc2.metrics-exclude-regex=.*CodeGenerator_.+;.*HiveExternalCatalog_.+;.+executor_filesystem_file_largeRead_ops;.+executor_filesystem_file_read_bytes;.+executor_filesystem_file_read_ops;.+executor_filesystem_file_write_bytes;.+executor_filesystem_file_write_ops > {code} > spark-default.conf > {code:java} > # Configured by Chef via recipe: ob-spark-hadoop::install_v3# Configured by > Chef via recipe: ob-spark-hadoop::install_v3## Default system properties > included when running spark-submit.# This is useful for setting default > environmental settings. > # Log effective Spark configuration at startup on INFO levelspark.logConf > true > spark.ui.port 4050 > # spark-extras#spark.driver.extraClassPath > /opt/spark-extras#spark.executor.extraClassPath > /opt/spark-extrasspark.metrics.namespace ${spark.app.name} > # Enable event logs for HistoryServerspark.eventLog.enabled > truespark.eventLog.dir > hdfs:///apps/sparkspark.eventLog.compress > truespark.history.fs.logDirectory hdfs:///apps/spark > spark.history.fs.cleaner.enabled truespark.history.fs.cleaner.interval > 1dspark.history.fs.cleaner.maxAge 7d > spark.master.rest.enabled truespark.master > spark://spark-apache-master-v3-test.service.consul:7077spark.serializer > org.apache.spark.serializer.KryoSerializerspark.shuffle.compress > truespark.shuffle.spill.compress truespark.shuffle.service.enabled > truespark.executor.memory 1g > # Spark streaming tuningsspark.streaming.blockInterval > 200msspark.streaming.kafka.maxRetries 2 > # Cleanupsspark.worker.cleanup.enabled truespark.worker.cleanup.interval > 3600 > # Spark executorsspark.executor.logs.rolling.enableCompression > truespark.executor.logs.rolling.maxRetainedFiles > 5spark.executor.logs.rolling.strategy sizespark.executor.logs.rolling.maxSize > 100000 > # Spark HA configurationsspark.deploy.recoveryMode=ZOOKEEPER > testspark.deploy.zookeeper.dir=/spark > # Spark Prometheus settings for executors > spark.ui.prometheus.enabled truespark.executor.processTreeMetrics.enabled true > spark.local.dir=/outbrain/Prod/spark/c,/outbrain/Prod/spark/d > {code} > > > Reporter: Mati > Priority: Major > > Recently we have enabled prometheusServlet configuration in order to have > spark master, worker, driver and executor metrics. > We can see and using spark master, worker and driver executors but can't see > spark executor metrics. > We are running spark streaming standalone cluster in version 3.0.1 over > physical servers. > > We have taken one of our jobs and added the following parameters to the job > configuration, but couldn't see executer metrics by curling both driver and > executor workers of this job: > > These are the parameters: > --conf spark.ui.prometheus.enabled=true \ > --conf spark.executor.processTreeMetrics.enabled=true > > Curl commands: > [00764f](root@sparktest-40005-prod-chidc2:~)# curl -s > [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] > -n5 > [00764f](root@sparktest-40005-prod-chidc2:~)# > Driver of this job - sparktest-40004: > [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s > [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] > -n5 > [e35005](root@sparktest-40004-prod-chidc2:~)# curl -s > [http://localhost:4050/metrics/executors/prometheus|head|http://localhost:4050/metrics/executors/prometheus%7Chead] > -n5 > > Our UI port is on 4050 > > I understood that the executor Prometheus endpoint is still experimental > which may explain the inconsistent behaviour we see but is there a plan to > fix it? > > Are there any known issues regarding this? > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org