[jira] [Commented] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics

2019-02-01 Thread Oscar Westra van Holthe - Kind (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758100#comment-16758100
 ] 

Oscar Westra van Holthe - Kind commented on FLINK-11457:


Seems sensible. Pity there's no sufficient coordination for initiialization 
(yet?) despite the active jobmanager coordinating the job execution by the 
taskmanagers.

Edit: adjusted the subject & description to account for the discussion in the 
[PR|https://github.com/apache/flink/pull/5857] for FLINK-9187. What's left is 
the part where the current implementation goes wrong.

> PrometheusPushGatewayReporter does not cleanup its metrics
> --
>
> Key: FLINK-11457
> URL: https://issues.apache.org/jira/browse/FLINK-11457
> Project: Flink
>  Issue Type: Bug
>Reporter: Oscar Westra van Holthe - Kind
>Priority: Major
>
> When cancelling a job running on a yarn based cluster and then shutting down 
> the cluster, metrics on the push gateway are not deleted.
> My yarn-conf.yaml settings:
> {code:yaml}
> metrics.reporters: promgateway
> metrics.reporter.promgateway.class: 
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.host: pushgateway.gcpstg.bolcom.net
> metrics.reporter.promgateway.port: 9091
> metrics.reporter.promgateway.jobName: PSMF
> metrics.reporter.promgateway.randomJobNameSuffix: true
> metrics.reporter.promgateway.deleteOnShutdown: true
> metrics.reporter.promgateway.interval: 30 SECONDS
> {code}
> What I expect to happen:
> * when running, the metrics are pushed to the push gateway to a separate 
> label per node (jobmanager/taskmanager)
> * when shutting down, the metrics are deleted from the push gateway
> This last bit does not happen.
> How the job is run:
> {code}flink run -m yarn-cluster -yn 5 -ys 2 -yst 
> "$INSTALL_DIRECTORY/app/psmf.jar"{code} 
> How the job is stopped:
> {code}
> YARN_APP_ID=$(yarn application -list | grep "PSMF" | awk '{print $1}')
> FLINK_JOB_ID=$(flink list -r -yid ${YARN_APP_ID} | grep "PSMF" | awk '{print 
> $4}')
> flink cancel -s "${SAVEPOINT_DIR%/}/" -yid "${YARN_APP_ID}" "${FLINK_JOB_ID}"
> echo "stop" | yarn-session.sh -id ${YARN_APP_ID}
> {code} 
> Is there anything I'm sdoing wrong? Anything I can help to fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics

2019-02-01 Thread Oscar Westra van Holthe - Kind (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar Westra van Holthe - Kind updated FLINK-11457:
---
Description: 
When cancelling a job running on a yarn based cluster and then shutting down 
the cluster, metrics on the push gateway are not deleted.

My yarn-conf.yaml settings:
{code:yaml}
metrics.reporters: promgateway
metrics.reporter.promgateway.class: 
org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: pushgateway.gcpstg.bolcom.net
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: PSMF
metrics.reporter.promgateway.randomJobNameSuffix: true
metrics.reporter.promgateway.deleteOnShutdown: true
metrics.reporter.promgateway.interval: 30 SECONDS
{code}

What I expect to happen:
* when running, the metrics are pushed to the push gateway to a separate label 
per node (jobmanager/taskmanager)
* when shutting down, the metrics are deleted from the push gateway

This last bit does not happen.

How the job is run:
{code}flink run -m yarn-cluster -yn 5 -ys 2 -yst 
"$INSTALL_DIRECTORY/app/psmf.jar"{code} 

How the job is stopped:
{code}
YARN_APP_ID=$(yarn application -list | grep "PSMF" | awk '{print $1}')
FLINK_JOB_ID=$(flink list -r -yid ${YARN_APP_ID} | grep "PSMF" | awk '{print 
$4}')
flink cancel -s "${SAVEPOINT_DIR%/}/" -yid "${YARN_APP_ID}" "${FLINK_JOB_ID}"
echo "stop" | yarn-session.sh -id ${YARN_APP_ID}
{code} 

Is there anything I'm sdoing wrong? Anything I can help to fix?

  was:
When cancelling a job running on a yarn based cluster and then shutting down 
the cluster, metrics on the push gateway are not deleted.

 

 


 

Any thoughts on a solution? I'm happy to implement it, but Im not sure what the 
best solution would be.


> PrometheusPushGatewayReporter does not cleanup its metrics
> --
>
> Key: FLINK-11457
> URL: https://issues.apache.org/jira/browse/FLINK-11457
> Project: Flink
>  Issue Type: Bug
>Reporter: Oscar Westra van Holthe - Kind
>Priority: Major
>
> When cancelling a job running on a yarn based cluster and then shutting down 
> the cluster, metrics on the push gateway are not deleted.
> My yarn-conf.yaml settings:
> {code:yaml}
> metrics.reporters: promgateway
> metrics.reporter.promgateway.class: 
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.host: pushgateway.gcpstg.bolcom.net
> metrics.reporter.promgateway.port: 9091
> metrics.reporter.promgateway.jobName: PSMF
> metrics.reporter.promgateway.randomJobNameSuffix: true
> metrics.reporter.promgateway.deleteOnShutdown: true
> metrics.reporter.promgateway.interval: 30 SECONDS
> {code}
> What I expect to happen:
> * when running, the metrics are pushed to the push gateway to a separate 
> label per node (jobmanager/taskmanager)
> * when shutting down, the metrics are deleted from the push gateway
> This last bit does not happen.
> How the job is run:
> {code}flink run -m yarn-cluster -yn 5 -ys 2 -yst 
> "$INSTALL_DIRECTORY/app/psmf.jar"{code} 
> How the job is stopped:
> {code}
> YARN_APP_ID=$(yarn application -list | grep "PSMF" | awk '{print $1}')
> FLINK_JOB_ID=$(flink list -r -yid ${YARN_APP_ID} | grep "PSMF" | awk '{print 
> $4}')
> flink cancel -s "${SAVEPOINT_DIR%/}/" -yid "${YARN_APP_ID}" "${FLINK_JOB_ID}"
> echo "stop" | yarn-session.sh -id ${YARN_APP_ID}
> {code} 
> Is there anything I'm sdoing wrong? Anything I can help to fix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics

2019-02-01 Thread Oscar Westra van Holthe - Kind (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar Westra van Holthe - Kind updated FLINK-11457:
---
Description: 
When cancelling a job running on a yarn based cluster and then shutting down 
the cluster, metrics on the push gateway are not deleted.

 

 


 

Any thoughts on a solution? I'm happy to implement it, but Im not sure what the 
best solution would be.

  was:
When using the PrometheusPushGatewayReporter, one has two options:
 * Use a fixed job name, which causes the jobmanager and taskmanager to 
overwrite each others metrics (i.e. last write wins, and you lose a lot of 
metrics)
 * Use a random suffix for the job name, which creates a lot of labels that 
have to be cleaned up manually

The manual cleanup should not be necessary, but happens nonetheless when using 
a yarn cluster.

A fix could be to add a suffix the job name, naming the nodes in a non-random 
manner like: {{myjob_jm0}}, {{my_job_tm1}}, {{my_job_tm1}}, {{my_job_tm2}}, 
{{my_job_tm3}}, {{my_job_tm4}}, ..., using a counter (not sure if such is 
available), or some other stable (!) suffix.

Related discussion: FLINK-9187

 

Any thoughts on a solution? I'm happy to implement it, but Im not sure what the 
best solution would be.


> PrometheusPushGatewayReporter does not cleanup its metrics
> --
>
> Key: FLINK-11457
> URL: https://issues.apache.org/jira/browse/FLINK-11457
> Project: Flink
>  Issue Type: Bug
>Reporter: Oscar Westra van Holthe - Kind
>Priority: Major
>
> When cancelling a job running on a yarn based cluster and then shutting down 
> the cluster, metrics on the push gateway are not deleted.
>  
>  
>  
> Any thoughts on a solution? I'm happy to implement it, but Im not sure what 
> the best solution would be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics

2019-02-01 Thread Oscar Westra van Holthe - Kind (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oscar Westra van Holthe - Kind updated FLINK-11457:
---
Summary: PrometheusPushGatewayReporter does not cleanup its metrics  (was: 
PrometheusPushGatewayReporter either overwrites its own metrics or creates too 
may labels)

> PrometheusPushGatewayReporter does not cleanup its metrics
> --
>
> Key: FLINK-11457
> URL: https://issues.apache.org/jira/browse/FLINK-11457
> Project: Flink
>  Issue Type: Bug
>Reporter: Oscar Westra van Holthe - Kind
>Priority: Major
>
> When using the PrometheusPushGatewayReporter, one has two options:
>  * Use a fixed job name, which causes the jobmanager and taskmanager to 
> overwrite each others metrics (i.e. last write wins, and you lose a lot of 
> metrics)
>  * Use a random suffix for the job name, which creates a lot of labels that 
> have to be cleaned up manually
> The manual cleanup should not be necessary, but happens nonetheless when 
> using a yarn cluster.
> A fix could be to add a suffix the job name, naming the nodes in a non-random 
> manner like: {{myjob_jm0}}, {{my_job_tm1}}, {{my_job_tm1}}, {{my_job_tm2}}, 
> {{my_job_tm3}}, {{my_job_tm4}}, ..., using a counter (not sure if such is 
> available), or some other stable (!) suffix.
> Related discussion: FLINK-9187
>  
> Any thoughts on a solution? I'm happy to implement it, but Im not sure what 
> the best solution would be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11457) PrometheusPushGatewayReporter either overwrites its own metrics or creates too may labels

2019-01-29 Thread Oscar Westra van Holthe - Kind (JIRA)
Oscar Westra van Holthe - Kind created FLINK-11457:
--

 Summary: PrometheusPushGatewayReporter either overwrites its own 
metrics or creates too may labels
 Key: FLINK-11457
 URL: https://issues.apache.org/jira/browse/FLINK-11457
 Project: Flink
  Issue Type: Bug
Reporter: Oscar Westra van Holthe - Kind


When using the PrometheusPushGatewayReporter, one has two options:
 * Use a fixed job name, which causes the jobmanager and taskmanager to 
overwrite each others metrics (i.e. last write wins, and you lose a lot of 
metrics)
 * Use a random suffix for the job name, which creates a lot of labels that 
have to be cleaned up manually

The manual cleanup should not be necessary, but happens nonetheless when using 
a yarn cluster.

A fix could be to add a suffix the job name, naming the nodes in a non-random 
manner like: {{myjob_jm0}}, {{my_job_tm1}}, {{my_job_tm1}}, {{my_job_tm2}}, 
{{my_job_tm3}}, {{my_job_tm4}}, ..., using a counter (not sure if such is 
available), or some other stable (!) suffix.

Related discussion: FLINK-9187

 

Any thoughts on a solution? I'm happy to implement it, but Im not sure what the 
best solution would be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6873) Limit the number of open writers in file system connector

2018-09-27 Thread Oscar Westra van Holthe - Kind (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-6873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16630562#comment-16630562
 ] 

Oscar Westra van Holthe - Kind commented on FLINK-6873:
---

What may also be an issue is that the BucketingSink and the newer 
StreamingFileSink seem to ignore event time..

Thus, if your output stream uses a buffer of some sorts and your job catches up 
quickly (processing multiple days worth of events in a few hours), the sink may 
end up having too many open files.

> Limit the number of open writers in file system connector
> -
>
> Key: FLINK-6873
> URL: https://issues.apache.org/jira/browse/FLINK-6873
> Project: Flink
>  Issue Type: Improvement
>  Components: filesystem-connector, Local Runtime, Streaming Connectors
>Reporter: Mu Kong
>Priority: Major
>
> Mail list discuss:
> https://mail.google.com/mail/u/1/#label/MailList%2Fflink-dev/15c869b2a5b20d43
> Following exception will occur when Flink is writing to too many files:
> {code}
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:714)
> at org.apache.hadoop.hdfs.DFSOutputStream.start(DFSOutputStream.java:2170)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1685)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1689)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1624)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:448)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$7.doCall(DistributedFileSystem.java:444)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:459)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:387)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:909)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:890)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:787)
> at 
> org.apache.flink.streaming.connectors.fs.StreamWriterBase.open(StreamWriterBase.java:120)
> at 
> org.apache.flink.streaming.connectors.fs.StringWriter.open(StringWriter.java:62)
> at 
> org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.openNewPartFile(BucketingSink.java:545)
> at 
> org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.invoke(BucketingSink.java:440)
> at 
> org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:41)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
> at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
> at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
> at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
> at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:103)
> at 
> org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:230)
> at 
> org.apache.flink.streaming.connectors.kafka.internals.SimpleConsumerThread.run(SimpleConsumerThread.java:379)
> {code}
> Letting developers decide the max open connections to the open files would be 
> great.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)