[jira] [Commented] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics
[ https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758100#comment-16758100 ] Oscar Westra van Holthe - Kind commented on FLINK-11457: Seems sensible. Pity there's no sufficient coordination for initiialization (yet?) despite the active jobmanager coordinating the job execution by the taskmanagers. Edit: adjusted the subject & description to account for the discussion in the [PR|https://github.com/apache/flink/pull/5857] for FLINK-9187. What's left is the part where the current implementation goes wrong. > PrometheusPushGatewayReporter does not cleanup its metrics > -- > > Key: FLINK-11457 > URL: https://issues.apache.org/jira/browse/FLINK-11457 > Project: Flink > Issue Type: Bug >Reporter: Oscar Westra van Holthe - Kind >Priority: Major > > When cancelling a job running on a yarn based cluster and then shutting down > the cluster, metrics on the push gateway are not deleted. > My yarn-conf.yaml settings: > {code:yaml} > metrics.reporters: promgateway > metrics.reporter.promgateway.class: > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter > metrics.reporter.promgateway.host: pushgateway.gcpstg.bolcom.net > metrics.reporter.promgateway.port: 9091 > metrics.reporter.promgateway.jobName: PSMF > metrics.reporter.promgateway.randomJobNameSuffix: true > metrics.reporter.promgateway.deleteOnShutdown: true > metrics.reporter.promgateway.interval: 30 SECONDS > {code} > What I expect to happen: > * when running, the metrics are pushed to the push gateway to a separate > label per node (jobmanager/taskmanager) > * when shutting down, the metrics are deleted from the push gateway > This last bit does not happen. > How the job is run: > {code}flink run -m yarn-cluster -yn 5 -ys 2 -yst > "$INSTALL_DIRECTORY/app/psmf.jar"{code} > How the job is stopped: > {code} > YARN_APP_ID=$(yarn application -list | grep "PSMF" | awk '{print $1}') > FLINK_JOB_ID=$(flink list -r -yid ${YARN_APP_ID} | grep "PSMF" | awk '{print > $4}') > flink cancel -s "${SAVEPOINT_DIR%/}/" -yid "${YARN_APP_ID}" "${FLINK_JOB_ID}" > echo "stop" | yarn-session.sh -id ${YARN_APP_ID} > {code} > Is there anything I'm sdoing wrong? Anything I can help to fix? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics
[ https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar Westra van Holthe - Kind updated FLINK-11457: --- Description: When cancelling a job running on a yarn based cluster and then shutting down the cluster, metrics on the push gateway are not deleted. My yarn-conf.yaml settings: {code:yaml} metrics.reporters: promgateway metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter metrics.reporter.promgateway.host: pushgateway.gcpstg.bolcom.net metrics.reporter.promgateway.port: 9091 metrics.reporter.promgateway.jobName: PSMF metrics.reporter.promgateway.randomJobNameSuffix: true metrics.reporter.promgateway.deleteOnShutdown: true metrics.reporter.promgateway.interval: 30 SECONDS {code} What I expect to happen: * when running, the metrics are pushed to the push gateway to a separate label per node (jobmanager/taskmanager) * when shutting down, the metrics are deleted from the push gateway This last bit does not happen. How the job is run: {code}flink run -m yarn-cluster -yn 5 -ys 2 -yst "$INSTALL_DIRECTORY/app/psmf.jar"{code} How the job is stopped: {code} YARN_APP_ID=$(yarn application -list | grep "PSMF" | awk '{print $1}') FLINK_JOB_ID=$(flink list -r -yid ${YARN_APP_ID} | grep "PSMF" | awk '{print $4}') flink cancel -s "${SAVEPOINT_DIR%/}/" -yid "${YARN_APP_ID}" "${FLINK_JOB_ID}" echo "stop" | yarn-session.sh -id ${YARN_APP_ID} {code} Is there anything I'm sdoing wrong? Anything I can help to fix? was: When cancelling a job running on a yarn based cluster and then shutting down the cluster, metrics on the push gateway are not deleted. Any thoughts on a solution? I'm happy to implement it, but Im not sure what the best solution would be. > PrometheusPushGatewayReporter does not cleanup its metrics > -- > > Key: FLINK-11457 > URL: https://issues.apache.org/jira/browse/FLINK-11457 > Project: Flink > Issue Type: Bug >Reporter: Oscar Westra van Holthe - Kind >Priority: Major > > When cancelling a job running on a yarn based cluster and then shutting down > the cluster, metrics on the push gateway are not deleted. > My yarn-conf.yaml settings: > {code:yaml} > metrics.reporters: promgateway > metrics.reporter.promgateway.class: > org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter > metrics.reporter.promgateway.host: pushgateway.gcpstg.bolcom.net > metrics.reporter.promgateway.port: 9091 > metrics.reporter.promgateway.jobName: PSMF > metrics.reporter.promgateway.randomJobNameSuffix: true > metrics.reporter.promgateway.deleteOnShutdown: true > metrics.reporter.promgateway.interval: 30 SECONDS > {code} > What I expect to happen: > * when running, the metrics are pushed to the push gateway to a separate > label per node (jobmanager/taskmanager) > * when shutting down, the metrics are deleted from the push gateway > This last bit does not happen. > How the job is run: > {code}flink run -m yarn-cluster -yn 5 -ys 2 -yst > "$INSTALL_DIRECTORY/app/psmf.jar"{code} > How the job is stopped: > {code} > YARN_APP_ID=$(yarn application -list | grep "PSMF" | awk '{print $1}') > FLINK_JOB_ID=$(flink list -r -yid ${YARN_APP_ID} | grep "PSMF" | awk '{print > $4}') > flink cancel -s "${SAVEPOINT_DIR%/}/" -yid "${YARN_APP_ID}" "${FLINK_JOB_ID}" > echo "stop" | yarn-session.sh -id ${YARN_APP_ID} > {code} > Is there anything I'm sdoing wrong? Anything I can help to fix? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics
[ https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar Westra van Holthe - Kind updated FLINK-11457: --- Description: When cancelling a job running on a yarn based cluster and then shutting down the cluster, metrics on the push gateway are not deleted. Any thoughts on a solution? I'm happy to implement it, but Im not sure what the best solution would be. was: When using the PrometheusPushGatewayReporter, one has two options: * Use a fixed job name, which causes the jobmanager and taskmanager to overwrite each others metrics (i.e. last write wins, and you lose a lot of metrics) * Use a random suffix for the job name, which creates a lot of labels that have to be cleaned up manually The manual cleanup should not be necessary, but happens nonetheless when using a yarn cluster. A fix could be to add a suffix the job name, naming the nodes in a non-random manner like: {{myjob_jm0}}, {{my_job_tm1}}, {{my_job_tm1}}, {{my_job_tm2}}, {{my_job_tm3}}, {{my_job_tm4}}, ..., using a counter (not sure if such is available), or some other stable (!) suffix. Related discussion: FLINK-9187 Any thoughts on a solution? I'm happy to implement it, but Im not sure what the best solution would be. > PrometheusPushGatewayReporter does not cleanup its metrics > -- > > Key: FLINK-11457 > URL: https://issues.apache.org/jira/browse/FLINK-11457 > Project: Flink > Issue Type: Bug >Reporter: Oscar Westra van Holthe - Kind >Priority: Major > > When cancelling a job running on a yarn based cluster and then shutting down > the cluster, metrics on the push gateway are not deleted. > > > > Any thoughts on a solution? I'm happy to implement it, but Im not sure what > the best solution would be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics
[ https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar Westra van Holthe - Kind updated FLINK-11457: --- Summary: PrometheusPushGatewayReporter does not cleanup its metrics (was: PrometheusPushGatewayReporter either overwrites its own metrics or creates too may labels) > PrometheusPushGatewayReporter does not cleanup its metrics > -- > > Key: FLINK-11457 > URL: https://issues.apache.org/jira/browse/FLINK-11457 > Project: Flink > Issue Type: Bug >Reporter: Oscar Westra van Holthe - Kind >Priority: Major > > When using the PrometheusPushGatewayReporter, one has two options: > * Use a fixed job name, which causes the jobmanager and taskmanager to > overwrite each others metrics (i.e. last write wins, and you lose a lot of > metrics) > * Use a random suffix for the job name, which creates a lot of labels that > have to be cleaned up manually > The manual cleanup should not be necessary, but happens nonetheless when > using a yarn cluster. > A fix could be to add a suffix the job name, naming the nodes in a non-random > manner like: {{myjob_jm0}}, {{my_job_tm1}}, {{my_job_tm1}}, {{my_job_tm2}}, > {{my_job_tm3}}, {{my_job_tm4}}, ..., using a counter (not sure if such is > available), or some other stable (!) suffix. > Related discussion: FLINK-9187 > > Any thoughts on a solution? I'm happy to implement it, but Im not sure what > the best solution would be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11457) PrometheusPushGatewayReporter either overwrites its own metrics or creates too may labels
Oscar Westra van Holthe - Kind created FLINK-11457: -- Summary: PrometheusPushGatewayReporter either overwrites its own metrics or creates too may labels Key: FLINK-11457 URL: https://issues.apache.org/jira/browse/FLINK-11457 Project: Flink Issue Type: Bug Reporter: Oscar Westra van Holthe - Kind When using the PrometheusPushGatewayReporter, one has two options: * Use a fixed job name, which causes the jobmanager and taskmanager to overwrite each others metrics (i.e. last write wins, and you lose a lot of metrics) * Use a random suffix for the job name, which creates a lot of labels that have to be cleaned up manually The manual cleanup should not be necessary, but happens nonetheless when using a yarn cluster. A fix could be to add a suffix the job name, naming the nodes in a non-random manner like: {{myjob_jm0}}, {{my_job_tm1}}, {{my_job_tm1}}, {{my_job_tm2}}, {{my_job_tm3}}, {{my_job_tm4}}, ..., using a counter (not sure if such is available), or some other stable (!) suffix. Related discussion: FLINK-9187 Any thoughts on a solution? I'm happy to implement it, but Im not sure what the best solution would be. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-6873) Limit the number of open writers in file system connector
[ https://issues.apache.org/jira/browse/FLINK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630562#comment-16630562 ] Oscar Westra van Holthe - Kind commented on FLINK-6873: --- What may also be an issue is that the BucketingSink and the newer StreamingFileSink seem to ignore event time.. Thus, if your output stream uses a buffer of some sorts and your job catches up quickly (processing multiple days worth of events in a few hours), the sink may end up having too many open files. > Limit the number of open writers in file system connector > - > > Key: FLINK-6873 > URL: https://issues.apache.org/jira/browse/FLINK-6873 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector, Local Runtime, Streaming Connectors >Reporter: Mu Kong >Priority: Major > > Mail list discuss: > https://mail.google.com/mail/u/1/#label/MailList%2Fflink-dev/15c869b2a5b20d43 > Following exception will occur when Flink is writing to too many files: > {code} > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:714) > at org.apache.hadoop.hdfs.DFSOutputStream.start(DFSOutputStream.java:2170) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1685) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448) > at > org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787) > at > org.apache.flink.streaming.connectors.fs.StreamWriterBase.open(StreamWriterBase.java:120) > at > org.apache.flink.streaming.connectors.fs.StringWriter.open(StringWriter.java:62) > at > org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.openNewPartFile(BucketingSink.java:545) > at > org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.invoke(BucketingSink.java:440) > at > org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:41) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869) > at > org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:103) > at > org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:230) > at > org.apache.flink.streaming.connectors.kafka.internals.SimpleConsumerThread.run(SimpleConsumerThread.java:379) > {code} > Letting developers decide the max open connections to the open files would be > great. -- This message was sent by Atlassian JIRA (v7.6.3#76005)