Custom GenericRecord Serializer with Tuple?

2022-09-25 Thread Hailu, Andreas
Hello!

I have a custom Avro GenericRecord serializer that supports reading and writing 
records without having to pass along the schema with every record by using a 
centralized registry. I've registered it with the execution environment as:

environment.addDefaultKryoSerializer(GenericRecord.class, 
CustomGenericRecordSerializer.class);

We read from a source that provides a Tuple2, and then map 
to GenericRecord to process and then sink. If I disable Kryo generic types 
given I don't want GenericRecord to fall back to the plain Kryo serializer, I 
encounter the following exception:

Caused by: java.lang.UnsupportedOperationException: Generic types have been 
disabled in the ExecutionConfig and type org.apache.avro.generic.GenericRecord 
is treated as a generic type.
at 
org.apache.flink.api.java.typeutils.GenericTypeInfo.createSerializer(GenericTypeInfo.java:87)
at 
org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:104)
at 
org.apache.flink.api.java.typeutils.TupleTypeInfo.createSerializer(TupleTypeInfo.java:49)
at 
org.apache.flink.optimizer.postpass.JavaApiPostPass.createSerializer(JavaApiPostPass.java:310)
at 
org.apache.flink.optimizer.postpass.JavaApiPostPass.traverseChannel(JavaApiPostPass.java:270)
at 
org.apache.flink.optimizer.postpass.JavaApiPostPass.traverse(JavaApiPostPass.java:96)
at 
org.apache.flink.optimizer.postpass.JavaApiPostPass.postPass(JavaApiPostPass.java:81)
at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:543)
at org.apache.flink.optimizer.Optimizer.compile(Optimizer.java:404)

It seems the GenericRecord in the Tuple2 gets interpreted as a GenericTypeInfo. 
I expected it should use the CustomGenericRecordSerializer that is registered. 
Have I misunderstood how the serializer registration works? Is there an extra 
step when using these types as part of a Tuple?

best,
ah




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: ExecutionMode in ExecutionConfig

2022-09-14 Thread Hailu, Andreas
I can give this a try. Do you know which Flink version does this feature become 
available in?

ah

From: zhanghao.c...@outlook.com 
Sent: Wednesday, September 14, 2022 11:10 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: ExecutionMode in ExecutionConfig

Could you try setting "execution.batch-shuffle-mode'='ALL_EXCHANGES_PIPELINED'? 
Looks like the ExecutionMode in ExecutionConfig does not work for DataStream 
APIs.

The default shuffling behavior for a DataStream API in batch mode is 
'ALL_EXCHANGES_BLOCKING' where upstream and downstream tasks run subsequently. 
On the other hand, the pipelined mode will have upstream and downstream tasks 
run simultaneously.



Best,
Zhanghao Chen

From: Hailu, Andreas mailto:andreas.ha...@gs.com>>
Sent: Wednesday, September 14, 2022 21:37
To: zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com> 
mailto:zhanghao.c...@outlook.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
Subject: RE: ExecutionMode in ExecutionConfig


Hi Zhanghao,



That seems different than what I'm referencing and one of my points of 
confusion - the documents refer to ExecutionMode as BATCH and STREAMING which 
is different than what the code refers to it as Runtime Mode e.g. 
env.setRuntimeMode(RuntimeExecutionMode.BATCH);



I'm referring to the ExecutionMode in the ExecutionConfig e.g. 
env.getConfig().setExecutionMode(ExecutionMode.BATCH)/ 
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED). I'm not able to find 
documentation on this anywhere.







ah



From: zhanghao.c...@outlook.com<mailto:zhanghao.c...@outlook.com> 
mailto:zhanghao.c...@outlook.com>>
Sent: Wednesday, September 14, 2022 1:10 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: ExecutionMode in ExecutionConfig



https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/datastream/execution_mode/<https://urldefense.proofpoint.com/v2/url?u=https-3A__nightlies.apache.org_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_datastream_execution-5Fmode_=DwMF-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=q-f1lFtNrjN2BnGqtchdhZkFNvCDUE8ZuhD4M0wJsdHcpLqEqTybqUaMAlo6lz91=bM_ucnQfxGo5Ky9Fq6S1yXbTqz476hGaKtkZINW4kGU=>
 gives a comprehensive description on it

Execution Mode (Batch/Streaming) | Apache 
Flink<https://urldefense.proofpoint.com/v2/url?u=https-3A__nightlies.apache.org_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_datastream_execution-5Fmode_=DwMF-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=q-f1lFtNrjN2BnGqtchdhZkFNvCDUE8ZuhD4M0wJsdHcpLqEqTybqUaMAlo6lz91=bM_ucnQfxGo5Ky9Fq6S1yXbTqz476hGaKtkZINW4kGU=>

Execution Mode (Batch/Streaming) # The DataStream API supports different 
runtime execution modes from which you can choose depending on the requirements 
of your use case and the characteristics of your job. There is the "classic" 
execution behavior of the DataStream API, which we call STREAMING execution 
mode. This should be used for unbounded jobs that require continuous 
incremental ...

nightlies.apache.org






Best,

Zhanghao Chen



From: Hailu, Andreas mailto:andreas.ha...@gs.com>>
Sent: Wednesday, September 14, 2022 7:13
To: user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
Subject: ExecutionMode in ExecutionConfig



Hello,



Is there somewhere I can learn more about the details of the effect of 
ExecutionMode in ExecutionConfig on a job? I am trying sort out some of the 
details as it seems to work differently between the DataStream API and 
deprecated DataSet API.



I've attached a picture of this job graph - I'm reading from a total of 3 data 
sources - the results of 2 are sent to CoGroup (orange rectangle), and the 
other has its records forwarded to a sink after some basic filter + map 
operations (red rectangle).



The DataSet API's job graph has all of the operators RUNNING immediately as we 
desire. However, the DataStream API's job graph only has the DataSource 
operators that are feeding into the CoGroup online, and the remaining operators 
wake up only when the 2 sources have completed. This winds up introducing a lot 
of latency in processing the batch.



Both of these are running in the same environment on the same data with 
identical ExecutionMode configs, just different APIs. I'm attempting to have 
the same behavior between them. I ask about ExecutionMode as I am able to 
replicate this behavior in DataSet by setting the ExecutionMode from the 
default of PIPELINED to BATCH.



Thanks!



best,

ah







Your Personal Data: We may collect and process information about you that may 
be subject to d

RE: ExecutionMode in ExecutionConfig

2022-09-14 Thread Hailu, Andreas
Hi Zhanghao,

That seems different than what I'm referencing and one of my points of 
confusion - the documents refer to ExecutionMode as BATCH and STREAMING which 
is different than what the code refers to it as Runtime Mode e.g. 
env.setRuntimeMode(RuntimeExecutionMode.BATCH);

I'm referring to the ExecutionMode in the ExecutionConfig e.g. 
env.getConfig().setExecutionMode(ExecutionMode.BATCH)/ 
env.getConfig().setExecutionMode(ExecutionMode.PIPELINED). I'm not able to find 
documentation on this anywhere.



ah

From: zhanghao.c...@outlook.com 
Sent: Wednesday, September 14, 2022 1:10 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: ExecutionMode in ExecutionConfig

https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/datastream/execution_mode/<https://urldefense.proofpoint.com/v2/url?u=https-3A__nightlies.apache.org_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_datastream_execution-5Fmode_=DwMF-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=q-f1lFtNrjN2BnGqtchdhZkFNvCDUE8ZuhD4M0wJsdHcpLqEqTybqUaMAlo6lz91=bM_ucnQfxGo5Ky9Fq6S1yXbTqz476hGaKtkZINW4kGU=>
 gives a comprehensive description on it
Execution Mode (Batch/Streaming) | Apache 
Flink<https://urldefense.proofpoint.com/v2/url?u=https-3A__nightlies.apache.org_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_datastream_execution-5Fmode_=DwMF-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=q-f1lFtNrjN2BnGqtchdhZkFNvCDUE8ZuhD4M0wJsdHcpLqEqTybqUaMAlo6lz91=bM_ucnQfxGo5Ky9Fq6S1yXbTqz476hGaKtkZINW4kGU=>
Execution Mode (Batch/Streaming) # The DataStream API supports different 
runtime execution modes from which you can choose depending on the requirements 
of your use case and the characteristics of your job. There is the "classic" 
execution behavior of the DataStream API, which we call STREAMING execution 
mode. This should be used for unbounded jobs that require continuous 
incremental ...
nightlies.apache.org



Best,
Zhanghao Chen
________
From: Hailu, Andreas mailto:andreas.ha...@gs.com>>
Sent: Wednesday, September 14, 2022 7:13
To: user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
Subject: ExecutionMode in ExecutionConfig


Hello,



Is there somewhere I can learn more about the details of the effect of 
ExecutionMode in ExecutionConfig on a job? I am trying sort out some of the 
details as it seems to work differently between the DataStream API and 
deprecated DataSet API.



I've attached a picture of this job graph - I'm reading from a total of 3 data 
sources - the results of 2 are sent to CoGroup (orange rectangle), and the 
other has its records forwarded to a sink after some basic filter + map 
operations (red rectangle).



The DataSet API's job graph has all of the operators RUNNING immediately as we 
desire. However, the DataStream API's job graph only has the DataSource 
operators that are feeding into the CoGroup online, and the remaining operators 
wake up only when the 2 sources have completed. This winds up introducing a lot 
of latency in processing the batch.



Both of these are running in the same environment on the same data with 
identical ExecutionMode configs, just different APIs. I'm attempting to have 
the same behavior between them. I ask about ExecutionMode as I am able to 
replicate this behavior in DataSet by setting the ExecutionMode from the 
default of PIPELINED to BATCH.



Thanks!



best,

ah





Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


RE: Apache Flink - Rest API for num of records in/out

2022-06-07 Thread Hailu, Andreas
Hi M,

We had a similar requirement – we were able to solve for this by:

1.   Supply the operators we’re interested in acquiring metrics for through 
the various name() methods

2.   Use the jobid API [1] and find the operator we’ve named in the 
“vertices” array

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/#jobs-jobid

ah

From: M Singh 
Sent: Tuesday, June 7, 2022 4:51 PM
To: User-Flink 
Subject: Apache Flink - Rest API for num of records in/out

Hi Folks:

I am trying to find if I can get the number of records for an operator using 
flinks REST API.  I've checked the docs at 
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/.

I did see some apis that use vertexid, but could not find how to that info 
without having vertex ids.

I am using flink 1.14.4.

Can you please let me know how to get that ?

Thanks



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: FlinkJobNotFoundException

2021-09-30 Thread Hailu, Andreas
Hi Matthias, the log file is quite large (21MB) so mailing it over in its 
entirety may have been a challenge. The file is available here [1], and we’re 
of course happy to share any relevant parts of it with the mailing list.

I think since we’ve shared logs with you before in the past, you weren’t sent 
over an additional welcome email ☺


[1] https://lockbox.gs.com/lockbox/folders/dc2ccacc-f2d2-4d66-a098-461b43e8b65f/

// ah

From: Matthias Pohl 
Sent: Thursday, September 30, 2021 2:57 AM
To: Gusick, Doug S [Engineering] 
Cc: user@flink.apache.org; Erai, Rahul [Engineering] 

Subject: Re: FlinkJobNotFoundException

I didn't receive any email. But we rather not do individual support. Please 
share the logs on the mailing list. This way, anyone is able to participate in 
the discussion.

Best,
Matthias

On Wed, Sep 29, 2021 at 8:12 PM Gusick, Doug S 
mailto:doug.gus...@gs.com>> wrote:
Hi Matthias,

Thank you for getting back. We have been looking into upgrading to a newer 
version, but have not completed full testing just yet.

I was unable to find a previous error in the JM logs. You should have received 
an email with details to a “lockbox”. I have uploaded the job manager logs 
there. Please let me know if you need any more information.

Thank you,
Doug

From: Matthias Pohl mailto:matth...@ververica.com>>
Sent: Wednesday, September 29, 2021 12:00 PM
To: Gusick, Doug S [Engineering] 
mailto:doug.gus...@ny.email.gs.com>>
Cc: user@flink.apache.org; Erai, Rahul 
[Engineering] mailto:rahul.e...@ny.email.gs.com>>
Subject: Re: FlinkJobNotFoundException

Hi Doug,
thanks for reaching out to the community. First of all, 1.9.2 is quite an old 
Flink version. You might want to consider upgrading to a newer version. The 
community only offers support for the two most-recent Flink versions. Newer 
version might include fixes for your issue.

But back to your actual problem: The logs you're providing only show that some 
job switched into FINISHED state. Is there some error showing up earlier in the 
logs which you might have missed? It would be helpful if you could share the 
complete JobManager logs to get a better understanding of what's going on.

Best,
Matthias

On Wed, Sep 29, 2021 at 3:47 PM Gusick, Doug S 
mailto:doug.gus...@gs.com>> wrote:
Hello,

We are facing an issue with some of our applications that are submitting a high 
volume of jobs to Flink (we are using v1.9.2). We are observing that numerous 
jobs (in this case 44 out of 350+) fail with the same FlinkJobNotFoundException 
within a 45 second timeframe.

From our client logs, this is the exception we can see:


Calc Engine: Caused by: org.apache.flink.runtime.rest.util.RestClientException: 
[org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job (d0991f0ae712a9df710aa03311a32c8c)]

Calc Engine:   at 
org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

Calc Engine:   at 
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

Calc Engine:   at 
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

Calc Engine:   at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

Calc Engine:   at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

Calc Engine:   ... 3 more


This is the first job to fail with the above exception. From the JobManager 
logs, we can see that the job goes to FINISHED State, and then we see the 
following exception:

2021-09-28 04:54:16,936 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink Java 
Job at Tue Sep 28 04:48:21 EDT 2021 (d0991f0ae712a9df710aa03311a32c8c) switched 
from state RUNNING to FINISHED.
2021-09-28 04:54:16,937 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 
d0991f0ae712a9df710aa03311a32c8c reached globally terminal state FINISHED.
2021-09-28 04:54:16,939 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the 
JobMaster for job Flink Java Job at Tue Sep 28 04:48:21 EDT 
2021(d0991f0ae712a9df710aa03311a32c8c).
2021-09-28 04:54:16,940 INFO  [flink-akka.actor.default-dispatcher-39] 
org.apache.flink.yarn.YarnResourceManager - Disconnect job 
manager 
0...@akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392
 for job d0991f0ae712a9df710aa03311a32c8c from the resource manager.
2021-09-28 04:54:18,256 ERROR [flink-akka.actor.default-dispatcher-91] 
org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler  - 
Exception occurred in REST handler: 
org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job 

RE: Many S3V4AuthErrorRetryStrategy warn logs while reading/writing from S3

2021-09-24 Thread Hailu, Andreas
Thanks, Robert.

// ah

From: Robert Metzger 
Sent: Wednesday, September 22, 2021 1:49 PM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: Many S3V4AuthErrorRetryStrategy warn logs while reading/writing 
from S3

Hey Andreas,

This could be related too 
https://github.com/apache/hadoop/pull/110/files#diff-0a2e55a2f79ea4079eb7b77b0dc3ee562b383076fa0ac168894d50c80a95131dR950<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_hadoop_pull_110_files-23diff-2D0a2e55a2f79ea4079eb7b77b0dc3ee562b383076fa0ac168894d50c80a95131dR950=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=XrWyobBln-QQ652Sml6hW8XEUlQrYjx_rDoyvyu456U=EKNCzrkaFqM8LySKDKyE0xKQqHI_pE00Fxg2feI15Gg=>

I guess in Flink this would be

s3.endpoint: your-endpoint-hostname
Where your-endpoint-hostname is a region-specific endpoint, which you can 
probably look up from the S3 docs.


On Wed, Sep 22, 2021 at 7:07 PM Hailu, Andreas 
mailto:andreas.ha...@gs.com>> wrote:
Hi,

When reading/writing to and from S3 using the flink-fs-s3-hadoop plugin on 
1.11.2, we observe a lot of these WARN log statements in the logs:

WARN  S3V4AuthErrorRetryStrategy - Attempting to re-send the request to 
s3.amazonaws.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__s3.amazonaws.com=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=XrWyobBln-QQ652Sml6hW8XEUlQrYjx_rDoyvyu456U=nIZDYIPynFqOi400vqQM3FygFfZ-mwWYLqpD5v4w7K0=>
 with AWS V4 authentication. To avoid this warning in the future, please use 
region-specific endpoint to access buckets located in regions that require V4 
signing.

The applications complete successfully which is great, but I’m not sure what 
the root of the error is and I’m hesitant to silence it through our logging 
configurations. I saw something that looks similar here[1]. Is there a way for 
us to similarly have Flink’s AWS S3 client to use V4 strategy to begin with?

[1] 
https://stackoverflow.com/questions/39513518/aws-emr-writing-to-kms-encrypted-s3-parquet-files<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_39513518_aws-2Demr-2Dwriting-2Dto-2Dkms-2Dencrypted-2Ds3-2Dparquet-2Dfiles=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=XrWyobBln-QQ652Sml6hW8XEUlQrYjx_rDoyvyu456U=wxsQRAaNJ8CqtAAJ6a4Klr26_e486CWtF8GWqvnQb4k=>



Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


Many S3V4AuthErrorRetryStrategy warn logs while reading/writing from S3

2021-09-22 Thread Hailu, Andreas
Hi,

When reading/writing to and from S3 using the flink-fs-s3-hadoop plugin on 
1.11.2, we observe a lot of these WARN log statements in the logs:

WARN  S3V4AuthErrorRetryStrategy - Attempting to re-send the request to 
s3.amazonaws.com with AWS V4 authentication. To avoid this warning in the 
future, please use region-specific endpoint to access buckets located in 
regions that require V4 signing.

The applications complete successfully which is great, but I'm not sure what 
the root of the error is and I'm hesitant to silence it through our logging 
configurations. I saw something that looks similar here[1]. Is there a way for 
us to similarly have Flink's AWS S3 client to use V4 strategy to begin with?

[1] 
https://stackoverflow.com/questions/39513518/aws-emr-writing-to-kms-encrypted-s3-parquet-files



Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: 1.9 to 1.11 Managed Memory Migration Questions

2021-08-27 Thread Hailu, Andreas [Engineering]
Thanks Caizhi, this was very helpful.

// ah

From: Caizhi Weng 
Sent: Thursday, August 26, 2021 10:41 PM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: 1.9 to 1.11 Managed Memory Migration Questions

Hi!

I've read the first mail again and discover that the direct memory OOM occurs 
when the job is writing to the sink, not when the data is transferring between 
tasks through the network.

I'm not familiar with HDFS, but I guess writing to HDFS will require some 
direct memory. Maybe a detailed stack trace will prove me right.

 How does this relate with the overall TaskManager process memory

See [1] for the detailed memory module. In short, task.off-heap.size is a part 
of the overall task manager process memory, which is dedicated for user's code 
(by user's code I mean UDF, sources, sinks, etc.). Network off-heap memory is 
only used for shuffling data between tasks and will not help for writing to 
HDFS, also managed memory is only used for operators such as joins and 
aggregations.

is there a way to make this scale along with it for jobs that process larger 
batches of data?

There is no way to do so currently, but I don't see why this is needed because 
for larger batches of data (I suppose you will increase the parallelism for 
larger data) more task managers will be allocated by Yarn (if you're not 
increasing the number of slots per task manager provides). All these memory 
settings are for per task managers, which means as the number of task managers 
scales, the total size of off-heap memory naturally scales.

What does having a default value of 0 bytes mean?

Most sources and sinks are not using direct memory, so a default 0 value is 
reasonable. Only when the user needs it (for example in this case you're using 
an HDFS sink) shall he allocate this part of memory.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup_tm.html#detailed-memory-model<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.11_ops_memory_mem-5Fsetup-5Ftm.html-23detailed-2Dmemory-2Dmodel=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=sZ2fkdpbwJ-blzd0Ch6lj9ZeGNissEN4890aD86V0ig=OAIyebbjZVyznz3av3_PaWZXaEF1PeitPvQL9PE0gdk=>

Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> 于2021年8月27日周五 上午5:16写道:
Hi Caizhi, thanks for responding.

The networking keys you suggested didn’t help, but I found that adding the 
‘taskmanager.memory.task.off-heap.size’ with a value of ‘1g’ lead to a 
successful job. I can see on this property’s documentation [1] that the default 
value is 0 bytes.

Task Off-Heap Memory size for TaskExecutors. This is the size of off heap 
memory (JVM direct memory and native memory) reserved for tasks. The configured 
value will be fully counted when Flink calculates the JVM max direct memory 
size parameter.
0 bytes (default)

From the mem setup page: [2]

The off-heap memory which is allocated by user code should be accounted for in 
task off-heap memory.

A few points of clarity if you would:

1. How does this relate with the overall TaskManager process memory, and is 
there a way to make this scale along with it for jobs that process larger 
batches of data?

2. What does having a default value of 0 bytes mean?

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html#taskmanager-memory-task-off-heap-size<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.10_ops_config.html-23taskmanager-2Dmemory-2Dtask-2Doff-2Dheap-2Dsize=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=sZ2fkdpbwJ-blzd0Ch6lj9ZeGNissEN4890aD86V0ig=gnGvxXf_ZRe9UbYqX3g2hzhbeSdAx-_NopPZXNj2i2o=>
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_setup.html#configure-off-heap-memory-direct-or-native<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.10_ops_memory_mem-5Fsetup.html-23configure-2Doff-2Dheap-2Dmemory-2Ddirect-2Dor-2Dnative=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=sZ2fkdpbwJ-blzd0Ch6lj9ZeGNissEN4890aD86V0ig=XFgNHAAS-ex6fvNyJThfwQEnALzw0jkY-LMV5KB39yU=>

// ah

From: Caizhi Weng mailto:tsreape...@gmail.com>>
Sent: Wednesday, August 25, 2021 10:47 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: 1.9 to 1.11 Managed Memory Migration Questions

Hi!

Why does this ~30% memory reduction happen?

I don't know how memory is calculated in Flink 1.9 but this 1.11 memory 
allocation result is reasonable. This is because managed memory, network memory 
and JVM overhead memory in 1.11 all has their default sizes or fractions 
(managed memory 40%, network memo

RE: 1.9 to 1.11 Managed Memory Migration Questions

2021-08-26 Thread Hailu, Andreas [Engineering]
Hi Caizhi, thanks for responding.

The networking keys you suggested didn’t help, but I found that adding the 
‘taskmanager.memory.task.off-heap.size’ with a value of ‘1g’ lead to a 
successful job. I can see on this property’s documentation [1] that the default 
value is 0 bytes.

Task Off-Heap Memory size for TaskExecutors. This is the size of off heap 
memory (JVM direct memory and native memory) reserved for tasks. The configured 
value will be fully counted when Flink calculates the JVM max direct memory 
size parameter.
0 bytes (default)

From the mem setup page: [2]

The off-heap memory which is allocated by user code should be accounted for in 
task off-heap memory.

A few points of clarity if you would:

1. How does this relate with the overall TaskManager process memory, and is 
there a way to make this scale along with it for jobs that process larger 
batches of data?

2. What does having a default value of 0 bytes mean?

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html#taskmanager-memory-task-off-heap-size
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/memory/mem_setup.html#configure-off-heap-memory-direct-or-native

// ah

From: Caizhi Weng 
Sent: Wednesday, August 25, 2021 10:47 PM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: 1.9 to 1.11 Managed Memory Migration Questions

Hi!

Why does this ~30% memory reduction happen?

I don't know how memory is calculated in Flink 1.9 but this 1.11 memory 
allocation result is reasonable. This is because managed memory, network memory 
and JVM overhead memory in 1.11 all has their default sizes or fractions 
(managed memory 40%, network memory 10% with 1g max, JVM overhead memory 10% 
with 1g max. See [1]), while heap memory doesn't. So a 5.8GB heap (about 12G - 
2G - 40%) and 4.3G managed memory (about 40%) is explainable.

How would you suggest discerning what properties we should have a look at?

Network shuffling memory now has its own configuration key, which is 
taskmanager.memory.network.fraction (and ...network.min and 
...network.max).Also see [1] and [2] for more keys related to task manager's 
memory.

[1] 
https://github.com/apache/flink/blob/release-1.11/flink-core/src/main/java/org/apache/flink/configuration/TaskManagerOptions.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_release-2D1.11_flink-2Dcore_src_main_java_org_apache_flink_configuration_TaskManagerOptions.java=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=8cDa79cdtGYTK87Mmwovq0nm7DRSnPuGXlEGR_UjZZw=2QfgZUSjhslppaIXyyJQSP_SgSfUtBPQJzlFaocTh4Y=>
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_setup_tm.html#detailed-memory-model<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.11_ops_memory_mem-5Fsetup-5Ftm.html-23detailed-2Dmemory-2Dmodel=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=8cDa79cdtGYTK87Mmwovq0nm7DRSnPuGXlEGR_UjZZw=He5punXHksNs-7hcAETmZ7oJV8pxAqv19UKDYIOQduk=>

Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> 于2021年8月26日周四 上午9:07写道:
Hi folks,

We’re about half way complete in migrating our YARN batch processing 
applications from Flink 1.9 to 1.11, and are currently tackling the memory 
configuration migrations.

Our test application’s sink failed with the following exception while writing 
to HDFS:

Caused by: java.lang.OutOfMemoryError: Direct buffer memory. The direct 
out-of-memory error has occurred. This can mean two things: either job(s) 
require(s) a larger size of JVM direct memory or there is a direct memory leak. 
The direct memory can be allocated by user code or some of its dependencies. In 
this case 'taskmanager.memory.task.off-heap.size' configuration option should 
be increased. Flink framework and its dependencies also consume the direct 
memory, mostly for network communication. The most of network memory is managed 
by Flink and should not result in out-of-memory error. In certain special 
cases, in particular for jobs with high parallelism, the framework may require 
more direct memory which is not managed by Flink. In this case 
'taskmanager.memory.framework.off-heap.size' configuration option should be 
increased. If the error persists then there is probably a direct memory leak in 
user code or some of its dependencies which has to be investigated and fixed. 
The task executor has to be shutdown...

We submit our applications through a Flink YARN session with –ytm, -yjm etc. We 
don’t have any memory configurations options set aside from 
‘taskmanager.network.bounded-blocking-subpartition-type: file’ which I see is 
now deprecated and replaced with a new option defaulted to ‘file’ (which works 
for us!) SO nearly everything else is as default.

We haven’t made any configuration changes yet thus far as w

1.9 to 1.11 Managed Memory Migration Questions

2021-08-25 Thread Hailu, Andreas [Engineering]
Hi folks,

We're about half way complete in migrating our YARN batch processing 
applications from Flink 1.9 to 1.11, and are currently tackling the memory 
configuration migrations.

Our test application's sink failed with the following exception while writing 
to HDFS:

Caused by: java.lang.OutOfMemoryError: Direct buffer memory. The direct 
out-of-memory error has occurred. This can mean two things: either job(s) 
require(s) a larger size of JVM direct memory or there is a direct memory leak. 
The direct memory can be allocated by user code or some of its dependencies. In 
this case 'taskmanager.memory.task.off-heap.size' configuration option should 
be increased. Flink framework and its dependencies also consume the direct 
memory, mostly for network communication. The most of network memory is managed 
by Flink and should not result in out-of-memory error. In certain special 
cases, in particular for jobs with high parallelism, the framework may require 
more direct memory which is not managed by Flink. In this case 
'taskmanager.memory.framework.off-heap.size' configuration option should be 
increased. If the error persists then there is probably a direct memory leak in 
user code or some of its dependencies which has to be investigated and fixed. 
The task executor has to be shutdown...

We submit our applications through a Flink YARN session with -ytm, -yjm etc. We 
don't have any memory configurations options set aside from 
'taskmanager.network.bounded-blocking-subpartition-type: file' which I see is 
now deprecated and replaced with a new option defaulted to 'file' (which works 
for us!) SO nearly everything else is as default.

We haven't made any configuration changes yet thus far as we're still combing 
through the migration instructions, but I did have some questions around what I 
observed.

1. I observed that an application ran with "-ytm 12288" on 1.9 receives 
8.47GB JVM Heap space and 5.95 Flink Managed Memory space  (as reported by the 
ApplicationMaster), where on 1.11 it receives 5.79 JVM Heap space and 4.30 
Flink Managed Memory space.  Why does this ~30% memory reduction happen?

2. Piggybacking off point 1, on 1..9 we were not explicitly setting 
off-heap memory parameters. How would you suggest discerning what properties we 
should have a look at?

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: Upgrading from Flink on YARN 1.9 to 1.11

2021-08-20 Thread Hailu, Andreas [Engineering]
Hi David, I was able to get this working using your suggestion:


1)Deploy a Flink YARN Session Cluster, noting the host + port of the 
session’s Job Manager.

2)Submit a Flink job using the session’s details, i.e submitting Flink job 
with ‘-m host:port’ option.

Thanks for clearing things up.

// ah

From: David Morávek 
Sent: Tuesday, August 17, 2021 4:37 AM
To: Hailu, Andreas [Engineering] 
Cc: Ravichandran, Soorya Prasanna [Engineering] 
; user@flink.apache.org
Subject: Re: Upgrading from Flink on YARN 1.9 to 1.11

Hi Andreas,

the problem here is that the command you're using is starting a per-job cluster 
(which is obvious from the used deployment method 
"YarnClusterDescriptor.deployJobCluster"). Apparently the `-m yarn-cluster` 
flag is deprecated and no longer supported, I think this is something we should 
completely remove in the near future. Also this was always supposed to start 
your job in per-job mode, but unfortunately in older versions this was kind of 
simulated using session cluster, so I'd say it has just worked by an accident 
(a.k.a "undocumented bug / feature").

What you really want to do is to start a session cluster upfront and than use a 
`yarn-session` deployment target (where you need to provide yarn application id 
so Flink can search for the active JobManager). This is well documented in the 
yarn section of the 
docs<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dmaster_docs_deployment_resource-2Dproviders_yarn_-23session-2Dmode=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=iu5vv8EZhy9VwahC4h6axF6B3ID6YDDFOzJcKLO8-Tw=QDBi2Ei2xYUfeKmx2aBFVcrAAOvtM3_iMT6GKr0aG80=>
 [1].

Can you please try this approach a let me know if that helped?

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/yarn/#session-mode<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dmaster_docs_deployment_resource-2Dproviders_yarn_-23session-2Dmode=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=iu5vv8EZhy9VwahC4h6axF6B3ID6YDDFOzJcKLO8-Tw=QDBi2Ei2xYUfeKmx2aBFVcrAAOvtM3_iMT6GKr0aG80=>

Best,
D.

On Mon, Aug 16, 2021 at 8:52 PM Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> wrote:
Hi David,

You’re correct about classpathing problems – thanks for your help in spotting 
them. I was able to get past that exception by removing some conflicting 
packages in my shaded JAR, but I’m seeing something else that’s interesting. 
With the 2 threads trying to submit jobs, one of the threads is able submit and 
process data successfully, while the other thread fails.

Log snippet:
2021-08-16 13:43:12,893 [thread-1] INFO  YarnClusterDescriptor - Cluster 
specification: ClusterSpecification{masterMemoryMB=4096, 
taskManagerMemoryMB=18432, slotsPerTaskManager=2}
2021-08-16 13:43:12,893 [thread-2] INFO  YarnClusterDescriptor - Cluster 
specification: ClusterSpecification{masterMemoryMB=4096, 
taskManagerMemoryMB=18432, slotsPerTaskManager=2}
2021-08-16 13:43:12,897 [thread-2] WARN  PluginConfig - The plugins directory 
[plugins] does not exist.
2021-08-16 13:43:12,897 [thread-1] WARN  PluginConfig - The plugins directory 
[plugins] does not exist.
2021-08-16 13:43:13,104 [thread-2] WARN  PluginConfig - The plugins directory 
[plugins] does not exist.
2021-08-16 13:43:13,104 [thread-1] WARN  PluginConfig - The plugins directory 
[plugins] does not exist.
2021-08-16 13:43:20,475 [thread-1] INFO  YarnClusterDescriptor - Adding 
delegation token to the AM container.
2021-08-16 13:43:20,488 [thread-1] INFO  DFSClient - Created 
HDFS_DELEGATION_TOKEN token 56247060 for delp on ha-hdfs:d279536
2021-08-16 13:43:20,512 [thread-1] INFO  TokenCache - Got dt for 
hdfs://d279536; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:d279536, Ident: 
(HDFS_DELEGATION_TOKEN token 56247060 for delp)
2021-08-16 13:43:20,513 [thread-1] INFO  Utils - Attempting to obtain Kerberos 
security token for HBase
2021-08-16 13:43:20,513 [thread-1] INFO  Utils - HBase is not available (not 
packaged with this application): ClassNotFoundException : 
"org.apache.hadoop.hbase.HBaseConfiguration".
2021-08-16 13:43:20,564 [thread-2] WARN  YarnClusterDescriptor - Add job graph 
to local resource fail.
2021-08-16 13:43:20,570 [thread-1] INFO  YarnClusterDescriptor - Submitting 
application master application_1628992879699_11275
2021-08-16 13:43:20,570 [thread-2] ERROR FlowDataBase - Exception running data 
flow for thread-2
org.apache.flink.client.deployment.ClusterDeploymentException: Could not deploy 
Yarn job cluster.
at 
org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:431)
at 
org.apache.flink.client.deployment.executors.AbstractJobClusterExe

RE: Upgrading from Flink on YARN 1.9 to 1.11

2021-08-16 Thread Hailu, Andreas [Engineering]
f it means 
anything, our client doesn’t require asynchronous job submission.

// ah

From: David Morávek 
Sent: Monday, August 16, 2021 6:28 AM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: Upgrading from Flink on YARN 1.9 to 1.11

Hi Andreas,

Per-job and session deployment modes should not be affected by this FLIP. 
Application mode is just a new deployment mode (where job driver runs embedded 
within JM), that co-exists with these two.

From information you've provided, I'd say your actual problem is this exception:

```
Caused by: java.lang.ExceptionInInitializerError
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:182)
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:175)
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.init(MessageBodyFactory.java:162)
at com.sun.jersey.api.client.Client.init(Client.java:342)
at com.sun.jersey.api.client.Client.access$000(Client.java:118)
at com.sun.jersey.api.client.Client$1.f(Client.java:191)
at com.sun.jersey.api.client.Client$1.f(Client.java:187)
at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193)
at com.sun.jersey.api.client.Client.(Client.java:187)
   at com.sun.jersey.api.client.Client.(Client.java:170)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceInit(TimelineClientImpl.java:285)
```

I've seen this exception a few times with Hadoop already and it's usually a 
dependency / class-path problem. If you google for this you'll find many 
references.

Best,
D.


On Fri, Aug 13, 2021 at 9:40 PM Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> wrote:
Hello folks!

We’re looking to upgrade from 1.9 to 1.11. Our Flink applications run on YARN 
and each have their own clusters, with each application having multiple jobs 
submitted.

Our current submission command looks like this:
$ run -m yarn-cluster --class com.class.name.Here -p 2 -yqu queue-name -ynm 
app-name -yn 1 -ys 2 -yjm 4096 -ytm 12288 /path/to/artifact.jar 
-application-args-go-here

The behavior observed in versions <= 1.9 is the following:

1. A Flink cluster gets deployed to YARN

2. Our application code is run, building graphs and submitting jobs

When we rebuilt and submit using 1.11.2, we now observe the following:

1. Our application code is run, building graph and submitting jobs

2. A Flink cluster gets deployed to YARN once execute() is invoked

I presume that this is a result of FLIP-85 [1] ?

This change in behavior proves to be a problem for us as our application is 
multi-threaded, and each thread submits its own job to the Flink cluster. What 
we see is the first thread to peexecute() submits a job to YARN, and others 
fail with a ClusterDeploymentException.

2021-08-13 14:47:42,299 [flink-thread-#1] INFO  YarnClusterDescriptor - Cluster 
specification: ClusterSpecification{masterMemoryMB=4096, 
taskManagerMemoryMB=18432, slotsPerTaskManager=2}
2021-08-13 14:47:42,299 [flink-thread-#2] INFO  YarnClusterDescriptor - Cluster 
specification: ClusterSpecification{masterMemoryMB=4096, 
taskManagerMemoryMB=18432, slotsPerTaskManager=2}
2021-08-13 14:47:42,304 [flink-thread-#1] WARN  PluginConfig - The plugins 
directory [plugins] does not exist.
2021-08-13 14:47:42,304 [flink-thread-#2] WARN  PluginConfig - The plugins 
directory [plugins] does not exist.
Listening for transport dt_socket at address: 5005
2021-08-13 14:47:46,716 [flink-thread-#2] WARN  PluginConfig - The plugins 
directory [plugins] does not exist.
2021-08-13 14:47:46,716 [flink-thread-#1] WARN  PluginConfig - The plugins 
directory [plugins] does not exist.
2021-08-13 14:47:54,820 [flink-thread-#1] INFO  YarnClusterDescriptor - Adding 
delegation token to the AM container.
2021-08-13 14:47:54,837 [flink-thread-#1] INFO  DFSClient - Created 
HDFS_DELEGATION_TOKEN token 56208379 for delp on ha-hdfs:d279536
2021-08-13 14:47:54,860 [flink-thread-#1] INFO  TokenCache - Got dt for 
hdfs://d279536; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:d279536, Ident: 
(HDFS_DELEGATION_TOKEN token 56208379 for user)
2021-08-13 14:47:54,860 [flink-thread-#1] INFO  Utils - Attempting to obtain 
Kerberos security token for HBase
2021-08-13 14:47:54,861 [flink-thread-#1] INFO  Utils - HBase is not available 
(not packaged with this application): ClassNotFoundException : 
"org.apache.hadoop.hbase.HBaseConfiguration".
2021-08-13 14:47:54,901 [flink-thread-#1] INFO  YarnClusterDescriptor - 
Submitting application master application_1628393898291_71530
2021-08-13 14:47:54,904 [flink-thread-#2] ERROR FlowDataBase - Exception 
running data flow for flink-thread-#2
org.apache.flink.client.deployment.ClusterDeploymentException: Could not deploy 
Yarn job cluster.
at 
org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(Y

Upgrading from Flink on YARN 1.9 to 1.11

2021-08-13 Thread Hailu, Andreas [Engineering]
Hello folks!

We're looking to upgrade from 1.9 to 1.11. Our Flink applications run on YARN 
and each have their own clusters, with each application having multiple jobs 
submitted.

Our current submission command looks like this:
$ run -m yarn-cluster --class com.class.name.Here -p 2 -yqu queue-name -ynm 
app-name -yn 1 -ys 2 -yjm 4096 -ytm 12288 /path/to/artifact.jar 
-application-args-go-here

The behavior observed in versions <= 1.9 is the following:

1. A Flink cluster gets deployed to YARN

2. Our application code is run, building graphs and submitting jobs

When we rebuilt and submit using 1.11.2, we now observe the following:

1. Our application code is run, building graph and submitting jobs

2. A Flink cluster gets deployed to YARN once execute() is invoked

I presume that this is a result of FLIP-85 [1] ?

This change in behavior proves to be a problem for us as our application is 
multi-threaded, and each thread submits its own job to the Flink cluster. What 
we see is the first thread to peexecute() submits a job to YARN, and others 
fail with a ClusterDeploymentException.

2021-08-13 14:47:42,299 [flink-thread-#1] INFO  YarnClusterDescriptor - Cluster 
specification: ClusterSpecification{masterMemoryMB=4096, 
taskManagerMemoryMB=18432, slotsPerTaskManager=2}
2021-08-13 14:47:42,299 [flink-thread-#2] INFO  YarnClusterDescriptor - Cluster 
specification: ClusterSpecification{masterMemoryMB=4096, 
taskManagerMemoryMB=18432, slotsPerTaskManager=2}
2021-08-13 14:47:42,304 [flink-thread-#1] WARN  PluginConfig - The plugins 
directory [plugins] does not exist.
2021-08-13 14:47:42,304 [flink-thread-#2] WARN  PluginConfig - The plugins 
directory [plugins] does not exist.
Listening for transport dt_socket at address: 5005
2021-08-13 14:47:46,716 [flink-thread-#2] WARN  PluginConfig - The plugins 
directory [plugins] does not exist.
2021-08-13 14:47:46,716 [flink-thread-#1] WARN  PluginConfig - The plugins 
directory [plugins] does not exist.
2021-08-13 14:47:54,820 [flink-thread-#1] INFO  YarnClusterDescriptor - Adding 
delegation token to the AM container.
2021-08-13 14:47:54,837 [flink-thread-#1] INFO  DFSClient - Created 
HDFS_DELEGATION_TOKEN token 56208379 for delp on ha-hdfs:d279536
2021-08-13 14:47:54,860 [flink-thread-#1] INFO  TokenCache - Got dt for 
hdfs://d279536; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:d279536, Ident: 
(HDFS_DELEGATION_TOKEN token 56208379 for user)
2021-08-13 14:47:54,860 [flink-thread-#1] INFO  Utils - Attempting to obtain 
Kerberos security token for HBase
2021-08-13 14:47:54,861 [flink-thread-#1] INFO  Utils - HBase is not available 
(not packaged with this application): ClassNotFoundException : 
"org.apache.hadoop.hbase.HBaseConfiguration".
2021-08-13 14:47:54,901 [flink-thread-#1] INFO  YarnClusterDescriptor - 
Submitting application master application_1628393898291_71530
2021-08-13 14:47:54,904 [flink-thread-#2] ERROR FlowDataBase - Exception 
running data flow for flink-thread-#2
org.apache.flink.client.deployment.ClusterDeploymentException: Could not deploy 
Yarn job cluster.
at 
org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:431)
at 
org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:70)
at 
org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:973)
at 
org.apache.flink.client.program.ContextEnvironment.executeAsync(ContextEnvironment.java:124)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:72)
...
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:826)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2152)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2138)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:919)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:114)
...
Caused by: java.lang.ExceptionInInitializerError
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:182)
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:175)
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.init(MessageBodyFactory.java:162)
at com.sun.jersey.api.client.Client.init(Client.java:342)
at com.sun.jersey.api.client.Client.access$000(Client.java:118)
at com.sun.jersey.api.client.Client$1.f(Client.java:191)
at com.sun.jersey.api.client.Client$1.f(Client.java:187)
at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193)
at com.sun.jersey.api.client.Client.(Client.java:187)
   at com.sun.jersey.api.client.Client.(Client.java:170)
at 

RE: Unable to use custom AWS credentials provider - 1.9.2

2021-08-09 Thread Hailu, Andreas [Engineering]
Hi Arvid, no. We are leveraging it as part of our application code, but not 
Kinesis – after finding and excluding duplicates of this package in our 
classpath, we are able to submit a job. Thanks.

// ah

From: Arvid Heise 
Sent: Friday, July 30, 2021 1:34 PM
To: Hailu, Andreas [Engineering] 
Cc: Ingo Bürk ; user 
Subject: Re: Unable to use custom AWS credentials provider - 1.9.2

Well usually the plugins should be properly isolated but Flink 1.9 is quite old 
so there is a chance the plugin classloader was not fully isolated.
But I also have a hard time concluding anything with the small stacktrace.

Do you need aws-java-sdk-core because of Kinesis?

On Fri, Jul 30, 2021 at 6:46 PM Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> wrote:
Hi Arvid,

Yes, we do have AWSCredentialsProvider in our user JAR. It’s coming from 
aws-java-sdk-core. Must we exclude that, then?

// ah

From: Arvid Heise mailto:ar...@apache.org>>
Sent: Friday, July 30, 2021 11:26 AM
To: Ingo Bürk mailto:i...@ververica.com>>
Cc: user mailto:user@flink.apache.org>>
Subject: Re: Unable to use custom AWS credentials provider - 1.9.2

Can you double-check if you have a AWSCredentialsProvider in your user jar or 
in your flink/lib/ ? Same for S3AUtils?

On Fri, Jul 30, 2021 at 9:50 AM Ingo Bürk 
mailto:i...@ververica.com>> wrote:
Hi Andreas,

Such an exception can occur if the class in question (your provider) and
the one being checked (AWSCredentialsProvider) were loaded from
different class loaders.

Any chance you can try once with 1.10+ to see if it would work? It does
look like a Flink issue to me, but I'm not sure this can be worked
around in 1.9.

[Initially sent to Andreas directly by accident]


Best
Ingo

On 29.07.21 17:37, Hailu, Andreas [Engineering] wrote:
> Hi team, I’m trying to read and write from and to S3 using a custom AWS
> Credential Provider using Flink v1.9.2 on YARN.
>
>
>
> I followed the instructions to create a plugins directory in our Flink
> distribution location and copy the FS implementation (I’m using
> s3-fs-hadoop) package into it. I have also placed the package that
> contains our custom CredentialsProvider implementation in that same
> directory as well.
>
>
>
> $ ls /flink-1.9.2/plugins/s3-fs-hadoop/
>
> total 20664
>
> 14469 Jun 17 10:57 aws-hadoop-utils-0.0.9.jar ßcontains our custom
> CredentialsProvider class
>
> 21141329 Jul 28 15:43 flink-s3-fs-hadoop-1.9.2.jar
>
>
>
> I’ve placed this directory in the java classpath when running the Flink
> application. I have added the ‘fs.s3a.assumed.role.credentials.provider’
> and ‘fs.s3a.assumed.role.arn’ to our flink-conf.yaml as well. When
> trying to run a basic app that reads a file, I get the following exception:
>
>
>
> Caused by: java.io.IOException: Class class
> com.gs.ep.da.lake.aws.CustomAwsCredentialProvider does not implement
> AWSCredentialsProvider
>
> at
> org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProvider(S3AUtils.java:400)
>
> at
> org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:367)
>
> at
> org.apache.hadoop.fs.s3a.S3ClientFactory$DefaultS3ClientFactory.createS3Client(S3ClientFactory.java:73)
>
>
>
> Have I missed a step here? Do I need to make the packages also available
> in my YARN classpath as well? I saw some discussion that suggest that
> there were some related problems around this that were resolved in v1.10
> [1][2][3].
>
>
>
> [1] 
> https://issues.apache.org/jira/browse/FLINK-14574<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14574=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=llyqmixMoNm0FF71rXQP1wWDK0_QIx-wv3SwxSJ4A6g=>
> <https://issues.apache.org/jira/browse/FLINK-14574<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14574=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=llyqmixMoNm0FF71rXQP1wWDK0_QIx-wv3SwxSJ4A6g=>>
>
> [2] 
> https://issues.apache.org/jira/browse/FLINK-13044<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13044=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=_J5mLFPB16qgGzCe0tgUbXb05GxmOyYym1deVQKY1bI=>
> <https://issues.apache.org/jira/browse/FLINK-13044<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13044=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WM

RE: Obtain JobManager Web Interface URL

2021-08-02 Thread Hailu, Andreas [Engineering]
Hi Yangze, sure!

After a submitted Flink app is complete, our client app polls the RESTful 
interface to pull job metrics -- operator start/end times, duration, records + 
bytes read/written etc... All of these metrics are all published to a database 
for analytical purposes, again both programmatic and ad-hoc.

There was no clear exposure of ClusterClient, so we had originally worked 
around this by extending the CliFrontend class with a bit of a façade class 
that grabbed the ClusterClient from the executeProgram() method:

@Override
protected void executeProgram(PackagedProgram program, ClusterClient client, 
int parallelism) throws ProgramMissingJobException, ProgramInvocationException {
logAndSysout("Starting execution of program");
System.setProperty(JOB_MANAGER_WEB_INTERFACE_PROPERTY, 
client.getWebInterfaceURL()); // <- Used elsewhere in application
...
}

These metrics prove immensely valuable as they help us optimize performance, 
diagnose issues, as well as predict resource requirements for applications.

// ah

-Original Message-
From: Yangze Guo 
Sent: Sunday, August 1, 2021 10:38 PM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: Obtain JobManager Web Interface URL

AFAIK, the ClusterClient should not be exposed through the public API.
Would you like to explain your use case and why you need to get the web UI 
programmatically?

Best,
Yangze Guo

On Fri, Jul 30, 2021 at 9:54 PM Hailu, Andreas [Engineering] 
 wrote:
>
> Hello Yangze, thanks for responding.
>
> I'm attempting to perform this programmatically on YARN, so looking at a log 
> just won't do :) What's the appropriate way to get an instance of a 
> ClusterClient? Do you know of any examples I can look at?
>
> // ah
>
> -Original Message-
> From: Yangze Guo 
> Sent: Thursday, July 29, 2021 11:17 PM
> To: Hailu, Andreas [Engineering] 
> Cc: user@flink.apache.org
> Subject: Re: Obtain JobManager Web Interface URL
>
> Hi, Hailu
>
> AFAIK, the ClusterClient#getWebInterfaceURL has been available since 1.10.
>
> Regarding the JobManager web interface, it will be print in the logs when 
> staring a native Kubernetes or Yarn cluster. In standalone mode, it is 
> configured by yourself[1].
>
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_pro
> jects_flink_flink-2Ddocs-2Dmaster_docs_deployment_resource-2Dproviders
> _standalone_overview_-23starting-2Dand-2Dstopping-2Da-2Dcluster=DwIF
> aQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VD
> hfisy2OJ1ZAzai-pcCC6TFXM=FVv2XIIuWzaAGdj6tz9whXTJ5GQ_xgAqIgesdgtEjG4
> =Cu-w4-hIu8MGtvnq2Ob8StpWCZhbFmwN4knnt35NqOM=
>
> Best,
> Yangze Guo
>
> On Fri, Jul 30, 2021 at 1:41 AM Hailu, Andreas [Engineering] 
>  wrote:
> >
> > Hi team,
> >
> >
> >
> > Is there a method available to obtain the JobManager’s REST url? We 
> > originally overloaded CliFrontend#executeProgram and nabbed it from the 
> > ClusterClient#getWebInterfaceUrl method, but it seems this method’s 
> > signature has been changed and no longer available as of 1.10.0.
> >
> >
> >
> > Best,
> >
> > Andreas
> >
> >
> >
> >
> > 
> >
> > Your Personal Data: We may collect and process information about you
> > that may be subject to data protection laws. For more information
> > about how we use and disclose your personal data, how we protect
> > your information, our legal basis to use your information, your
> > rights and who you can contact, please refer to:
> > http://www.gs.com/privacy-notices
>
> 
>
> Your Personal Data: We may collect and process information about you
> that may be subject to data protection laws. For more information
> about how we use and disclose your personal data, how we protect your
> information, our legal basis to use your information, your rights and
> who you can contact, please refer to:
> www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


RE: Unable to use custom AWS credentials provider - 1.9.2

2021-07-30 Thread Hailu, Andreas [Engineering]
Hi Arvid,

Yes, we do have AWSCredentialsProvider in our user JAR. It’s coming from 
aws-java-sdk-core. Must we exclude that, then?

// ah

From: Arvid Heise 
Sent: Friday, July 30, 2021 11:26 AM
To: Ingo Bürk 
Cc: user 
Subject: Re: Unable to use custom AWS credentials provider - 1.9.2

Can you double-check if you have a AWSCredentialsProvider in your user jar or 
in your flink/lib/ ? Same for S3AUtils?

On Fri, Jul 30, 2021 at 9:50 AM Ingo Bürk 
mailto:i...@ververica.com>> wrote:
Hi Andreas,

Such an exception can occur if the class in question (your provider) and
the one being checked (AWSCredentialsProvider) were loaded from
different class loaders.

Any chance you can try once with 1.10+ to see if it would work? It does
look like a Flink issue to me, but I'm not sure this can be worked
around in 1.9.

[Initially sent to Andreas directly by accident]


Best
Ingo

On 29.07.21 17:37, Hailu, Andreas [Engineering] wrote:
> Hi team, I’m trying to read and write from and to S3 using a custom AWS
> Credential Provider using Flink v1.9.2 on YARN.
>
>
>
> I followed the instructions to create a plugins directory in our Flink
> distribution location and copy the FS implementation (I’m using
> s3-fs-hadoop) package into it. I have also placed the package that
> contains our custom CredentialsProvider implementation in that same
> directory as well.
>
>
>
> $ ls /flink-1.9.2/plugins/s3-fs-hadoop/
>
> total 20664
>
> 14469 Jun 17 10:57 aws-hadoop-utils-0.0.9.jar ßcontains our custom
> CredentialsProvider class
>
> 21141329 Jul 28 15:43 flink-s3-fs-hadoop-1.9.2.jar
>
>
>
> I’ve placed this directory in the java classpath when running the Flink
> application. I have added the ‘fs.s3a.assumed.role.credentials.provider’
> and ‘fs.s3a.assumed.role.arn’ to our flink-conf.yaml as well. When
> trying to run a basic app that reads a file, I get the following exception:
>
>
>
> Caused by: java.io.IOException: Class class
> com.gs.ep.da.lake.aws.CustomAwsCredentialProvider does not implement
> AWSCredentialsProvider
>
> at
> org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProvider(S3AUtils.java:400)
>
> at
> org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:367)
>
> at
> org.apache.hadoop.fs.s3a.S3ClientFactory$DefaultS3ClientFactory.createS3Client(S3ClientFactory.java:73)
>
>
>
> Have I missed a step here? Do I need to make the packages also available
> in my YARN classpath as well? I saw some discussion that suggest that
> there were some related problems around this that were resolved in v1.10
> [1][2][3].
>
>
>
> [1] 
> https://issues.apache.org/jira/browse/FLINK-14574<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14574=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=llyqmixMoNm0FF71rXQP1wWDK0_QIx-wv3SwxSJ4A6g=>
> <https://issues.apache.org/jira/browse/FLINK-14574<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14574=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=llyqmixMoNm0FF71rXQP1wWDK0_QIx-wv3SwxSJ4A6g=>>
>
> [2] 
> https://issues.apache.org/jira/browse/FLINK-13044<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13044=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=_J5mLFPB16qgGzCe0tgUbXb05GxmOyYym1deVQKY1bI=>
> <https://issues.apache.org/jira/browse/FLINK-13044<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13044=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=_J5mLFPB16qgGzCe0tgUbXb05GxmOyYym1deVQKY1bI=>>
>
> [3] 
> https://issues.apache.org/jira/browse/FLINK-11956<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D11956=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=z31ZYJqnf00rLBW3HSiVrkQ8R6FYdw5F9QKqmWJ2CH4=>
> <https://issues.apache.org/jira/browse/FLINK-11956<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D11956=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=gpQJJW86RnCu0WMZ7mBrjMeu3Ou6dfc0-wwZHD1yL98=z31ZYJqnf00rLBW3HSiVrkQ8R6FYdw5F9QKqmWJ2CH4=>>
>
>
>
> Best,
>
> Andreas
>
>
>
>
> ---

RE: Obtain JobManager Web Interface URL

2021-07-30 Thread Hailu, Andreas [Engineering]
Hello Yangze, thanks for responding.

I'm attempting to perform this programmatically on YARN, so looking at a log 
just won't do :) What's the appropriate way to get an instance of a 
ClusterClient? Do you know of any examples I can look at?

// ah

-Original Message-
From: Yangze Guo 
Sent: Thursday, July 29, 2021 11:17 PM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: Obtain JobManager Web Interface URL

Hi, Hailu

AFAIK, the ClusterClient#getWebInterfaceURL has been available since 1.10.

Regarding the JobManager web interface, it will be print in the logs when 
staring a native Kubernetes or Yarn cluster. In standalone mode, it is 
configured by yourself[1].

[1] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dmaster_docs_deployment_resource-2Dproviders_standalone_overview_-23starting-2Dand-2Dstopping-2Da-2Dcluster=DwIFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=FVv2XIIuWzaAGdj6tz9whXTJ5GQ_xgAqIgesdgtEjG4=Cu-w4-hIu8MGtvnq2Ob8StpWCZhbFmwN4knnt35NqOM=

Best,
Yangze Guo

On Fri, Jul 30, 2021 at 1:41 AM Hailu, Andreas [Engineering] 
 wrote:
>
> Hi team,
>
>
>
> Is there a method available to obtain the JobManager’s REST url? We 
> originally overloaded CliFrontend#executeProgram and nabbed it from the 
> ClusterClient#getWebInterfaceUrl method, but it seems this method’s signature 
> has been changed and no longer available as of 1.10.0.
>
>
>
> Best,
>
> Andreas
>
>
>
>
> 
>
> Your Personal Data: We may collect and process information about you
> that may be subject to data protection laws. For more information
> about how we use and disclose your personal data, how we protect your
> information, our legal basis to use your information, your rights and
> who you can contact, please refer to:
> http://www.gs.com/privacy-notices



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


Obtain JobManager Web Interface URL

2021-07-29 Thread Hailu, Andreas [Engineering]
Hi team,

Is there a method available to obtain the JobManager's REST url? We originally 
overloaded CliFrontend#executeProgram and nabbed it from the 
ClusterClient#getWebInterfaceUrl method, but it seems this method's signature 
has been changed and no longer available as of 1.10.0.

Best,
Andreas




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


Unable to use custom AWS credentials provider - 1.9.2

2021-07-29 Thread Hailu, Andreas [Engineering]
Hi team, I'm trying to read and write from and to S3 using a custom AWS 
Credential Provider using Flink v1.9.2 on YARN.

I followed the instructions to create a plugins directory in our Flink 
distribution location and copy the FS implementation (I'm using s3-fs-hadoop) 
package into it. I have also placed the package that contains our custom 
CredentialsProvider implementation in that same directory as well.

$ ls /flink-1.9.2/plugins/s3-fs-hadoop/
total 20664
14469 Jun 17 10:57 aws-hadoop-utils-0.0.9.jar <-- contains our custom 
CredentialsProvider class
21141329 Jul 28 15:43 flink-s3-fs-hadoop-1.9.2.jar

I've placed this directory in the java classpath when running the Flink 
application. I have added the 'fs.s3a.assumed.role.credentials.provider' and 
'fs.s3a.assumed.role.arn' to our flink-conf.yaml as well. When trying to run a 
basic app that reads a file, I get the following exception:

Caused by: java.io.IOException: Class class 
com.gs.ep.da.lake.aws.CustomAwsCredentialProvider does not implement 
AWSCredentialsProvider
at 
org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProvider(S3AUtils.java:400)
at 
org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:367)
at 
org.apache.hadoop.fs.s3a.S3ClientFactory$DefaultS3ClientFactory.createS3Client(S3ClientFactory.java:73)

Have I missed a step here? Do I need to make the packages also available in my 
YARN classpath as well? I saw some discussion that suggest that there were some 
related problems around this that were resolved in v1.10 [1][2][3].

[1] https://issues.apache.org/jira/browse/FLINK-14574
[2] https://issues.apache.org/jira/browse/FLINK-13044
[3] https://issues.apache.org/jira/browse/FLINK-11956

Best,
Andreas




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: [1.9.2] Flink SSL on YARN - NoSuchFileException

2021-04-26 Thread Hailu, Andreas [Engineering]
Hey Nico, thanks for your reply. I gave this a try and unfortunately had no 
luck.

// ah

-Original Message-
From: Nico Kruber 
Sent: Wednesday, April 21, 2021 1:01 PM
To: user@flink.apache.org
Subject: Re: [1.9.2] Flink SSL on YARN - NoSuchFileException

Hi Andreas,
judging from [1], it should work if you refer to it via

security.ssl.rest.keystore: ./deploy-keys/rest.keystore
security.ssl.rest.truststore: ./deploy-keys/rest.truststore


Nico

[1] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-KAFKA-KEYTAB-Kafkaconsumer-error-Kerberos-td37277.html

On Monday, 19 April 2021 16:45:25 CEST Hailu, Andreas [Engineering] wrote:
> Hi Flink team,
>
> I'm trying to configure a Flink on YARN with SSL enabled. I've
> followed the documentation's instruction  [1] to generate a Keystore
> and Truststore locally, and added a the properties to my flink-conf.yaml.
> security.ssl.rest.keystore: /home/user/ssl/deploy-keys/rest.keystore
> security.ssl.rest.truststore:
> /home/user/ssl/deploy-keys/rest.truststore
>
> I've also added the yarnship option so that the keystore and
> truststore are deployed as suggested in [1].
>
> -m yarn-cluster --class  [...] -yt /home/user/ssl/deploy-keys/
>
> However, starting the Flink cluster results in a NoSuchFileException,
> Caused by: java.nio.file.NoSuchFileException:
> /home/user/ssl/deploy-keys/rest.keystore at
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvide
> r.jav
> a:214) at java.nio.file.Files.newByteChannel(Files.java:361)
> at java.nio.file.Files.newByteChannel(Files.java:407)
> at
> java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider
> .java
> :384) at java.nio.file.Files.newInputStream(Files.java:152)
> at
> org.apache.flink.runtime.net.SSLUtils.getKeyManagerFactory(SSLUtils.ja
> va:26
> 6) at
> org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUti
> ls.ja
> va:392) at
> org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUti
> ls.ja
> va:365) at
> org.apache.flink.runtime.net.SSLUtils.createRestServerSSLEngineFactory
> (SSLU
> tils.java:163) at
> org.apache.flink.runtime.rest.RestServerEndpointConfiguration.fromConf
> igura
> tion(RestServerEndpointConfiguration.java:160)
>
> I'm able to see in launch_container.sh that the shipped directory was
> able to be created successfully:
>
> mkdir -p deploy-keys
> ln -sf
> "/fs/htmp/yarn/local/usercache/delp/appcache/application_1618711298408
> _2664 /filecache/16/rest.truststore" "deploy-keys/rest.truststore"
> mkdir -p deploy-keys ln -sf
> "/fs/htmp/yarn/local/usercache/delp/appcache/application_1618711298408
> _2664 /filecache/13/rest.keystore" "deploy-keys/rest.keystore"
>
> So given the above logs, I tried editing flink-conf.yaml to reflect
> what I
> saw: security.ssl.rest.keystore: deploy-keys/rest.keystore
> security.ssl.rest.truststore: deploy-keys/rest.truststore
>
> But that didn't seem to work, either:
> Caused by: java.nio.file.NoSuchFileException: deploy-keys/rest.truststore
> at
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvide
> r.jav
> a:214) at java.nio.file.Files.newByteChannel(Files.java:361)
> at java.nio.file.Files.newByteChannel(Files.java:407)
> at
> java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider
> .java
> :384) at java.nio.file.Files.newInputStream(Files.java:152)
> at
> org.apache.flink.runtime.net.SSLUtils.getTrustManagerFactory(SSLUtils.java:
> 233) at
> org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUti
> ls.ja
> va:397) at
> org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUti
> ls.ja
> va:365) at
> org.apache.flink.runtime.net.SSLUtils.createRestClientSSLEngineFactory
> (SSLU
> tils.java:181) at
> org.apache.flink.runtime.rest.RestClientConfiguration.fromConfiguratio
> n(Res
> tClientConfiguration.java:106)
>
> What needs to be done to get the YARN application to point to the
> right keystore and truststore?
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/securi
> ty-ss l.html#tips-for-yarn--mesos-deployment
>
>

RE: [1.9.2] Flink SSL on YARN - NoSuchFileException

2021-04-26 Thread Hailu, Andreas [Engineering]
Hi Arvid, thanks for the reply.

Our stores are world-readable, so I don’t think that it’s an access issue. All 
of our clients have the stores present through a shared mount as well. I’m able 
to see the shipped stores in the directory.info output when pulling the YARN 
logs, and can confirm the account submitting the application has correct 
privileges.

The exception I shared occurs during the cluster deployment phase. Here’s the 
full stacktrace:

2021-04-26 13:37:17,468 [main] ERROR ClusterEntrypoint - Could not start 
cluster entrypoint YarnSessionClusterEntrypoint.
org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint YarnSessionClusterEntrypoint.
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501)
at 
org.apache.flink.yarn.entrypoint.YarnSessionClusterEntrypoint.main(YarnSessionClusterEntrypoint.java:93)
Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentF
actory.java:257)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:210)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:163)
... 2 more
Caused by: org.apache.flink.util.ConfigurationException: Failed to initialize 
SSLEngineFactory for REST server endpoint.
at 
org.apache.flink.runtime.rest.RestServerEndpointConfiguration.fromConfiguration(RestServerEndpointConfiguration.java:162)
at 
org.apache.flink.runtime.rest.SessionRestEndpointFactory.createRestEndpoint(SessionRestEndpointFactory.java:54)
at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentF
actory.java:150)
... 9 more
Caused by: java.nio.file.NoSuchFileException: 
/home/user/ssl/deploy-keys/rest.keystore
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at 
java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at 
org.apache.flink.runtime.net.SSLUtils.getKeyManagerFactory(SSLUtils.java:266)
at 
org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUtils.java:392)
at 
org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUtils.java:365)
   at 
org.apache.flink.runtime.net.SSLUtils.createRestServerSSLEngineFactory(SSLUtils.java:163)
at 
org.apache.flink.runtime.rest.RestServerEndpointConfiguration.fromConfiguration(RestServerEndpointConfiguration.java:160)
... 11 more

Given the number of machines in our YARN compute cluster, we’d really like to 
avoid having to have to copy the stores to each machine as that would add 
another step in configuration each time a machine is replaced, added, etc. The 
YARN shipping feature is really what we need.

The documentation [1] says that we should be able to ship the stores directly 
from my our client:

flink run -m yarn-cluster -yt deploy-keys/ flinkapp.jar

But it doesn’t provide an example of the requisite change made in the 
flink-conf.yaml that supports shipped stores.

If we consider that we have the stores available in a local directory called 
/home/user/ssl/deploy-keys/, and we’re shipping the directory through the –yt 
option, what do the values of:

1. security.ssl.rest.keystore

2. security.ssl.rest.truststore
Need to be in order for this to work? Happy to share our failed application’s 
YARN logs with you If you require them.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/security-ssl.html#tips-for-yarn--mesos-deployment

// ah

From: Arvid Heise 
Sent: Wednesday, April 21, 2021 1:05 PM
To: Hailu, Andreas [Engineering

[1.9.2] Flink SSL on YARN - NoSuchFileException

2021-04-19 Thread Hailu, Andreas [Engineering]
Hi Flink team,

I'm trying to configure a Flink on YARN with SSL enabled. I've followed the 
documentation's instruction  [1] to generate a Keystore and Truststore locally, 
and added a the properties to my flink-conf.yaml.
security.ssl.rest.keystore: /home/user/ssl/deploy-keys/rest.keystore
security.ssl.rest.truststore: /home/user/ssl/deploy-keys/rest.truststore

I've also added the yarnship option so that the keystore and truststore are 
deployed as suggested in [1].

-m yarn-cluster --class  [...] -yt /home/user/ssl/deploy-keys/

However, starting the Flink cluster results in a NoSuchFileException,
Caused by: java.nio.file.NoSuchFileException: 
/home/user/ssl/deploy-keys/rest.keystore
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at 
java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at 
org.apache.flink.runtime.net.SSLUtils.getKeyManagerFactory(SSLUtils.java:266)
at 
org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUtils.java:392)
at 
org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUtils.java:365)
at 
org.apache.flink.runtime.net.SSLUtils.createRestServerSSLEngineFactory(SSLUtils.java:163)
at 
org.apache.flink.runtime.rest.RestServerEndpointConfiguration.fromConfiguration(RestServerEndpointConfiguration.java:160)

I'm able to see in launch_container.sh that the shipped directory was able to 
be created successfully:

mkdir -p deploy-keys
ln -sf 
"/fs/htmp/yarn/local/usercache/delp/appcache/application_1618711298408_2664/filecache/16/rest.truststore"
 "deploy-keys/rest.truststore"
mkdir -p deploy-keys
ln -sf 
"/fs/htmp/yarn/local/usercache/delp/appcache/application_1618711298408_2664/filecache/13/rest.keystore"
 "deploy-keys/rest.keystore"

So given the above logs, I tried editing flink-conf.yaml to reflect what I saw:
security.ssl.rest.keystore: deploy-keys/rest.keystore
security.ssl.rest.truststore: deploy-keys/rest.truststore

But that didn't seem to work, either:
Caused by: java.nio.file.NoSuchFileException: deploy-keys/rest.truststore
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at 
sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
at java.nio.file.Files.newByteChannel(Files.java:361)
at java.nio.file.Files.newByteChannel(Files.java:407)
at 
java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
at java.nio.file.Files.newInputStream(Files.java:152)
at 
org.apache.flink.runtime.net.SSLUtils.getTrustManagerFactory(SSLUtils.java:233)
at 
org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUtils.java:397)
at 
org.apache.flink.runtime.net.SSLUtils.createRestNettySSLContext(SSLUtils.java:365)
at 
org.apache.flink.runtime.net.SSLUtils.createRestClientSSLEngineFactory(SSLUtils.java:181)
at 
org.apache.flink.runtime.rest.RestClientConfiguration.fromConfiguration(RestClientConfiguration.java:106)

What needs to be done to get the YARN application to point to the right 
keystore and truststore?

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/security-ssl.html#tips-for-yarn--mesos-deployment



Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


Understanding blocking behavior

2021-02-16 Thread Hailu, Andreas [Engineering]
Hi folks, I'm trying to get a better understanding of what operations result in 
blocked partitions. I've got a batch-processing job that reads from 2 sources, 
and then performs a series of Maps/Filters/CoGroups all with the same 
parallelism to create a final DataSet to be written to two different Sinks.

The kind of Sink a record in the DataSet is written to is dependent on the 
record's properties, so we use a Map + Filter operation to just pull the 
desired records for the Sink. The latter portion of the graph looks like this:

DataSet -> Map + FilterA (with parallelism P) -> SinkA (with parallelism X)
DataSet -> Map + FilterB (with parallelism P) -> SinkB (with parallelism P-X)

Parallelisms for the output into SinkA and SinkB are different than the 
parallelism used in the Map + Filter operation in order to control the 
resulting total number of output files. What I observe is that all of the 
records must first be sent to the Map + Filter operators, and only once after 
all records are received, the Sink begins to output records. This shows in the 
Flink Dashboard as the Sinks remaining in 'CREATED' states while the Map + 
Filter operators are 'RUNNING'. At scale, where the DataSet may contain 
billions of records, this ends up taking hours. Ideally, the records are 
streamed through to the Sink as they go through the Map + Filter.

Is this blocking behavior due to the fact that the Map + Filter operators must 
re-distribute the records as they're moving to an operator that has a lesser 
parallelism?



Andreas Hailu
Data Lake Engineering | Goldman Sachs




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: org.apache.flink.runtime.client.JobSubmissionException: Job has already been submitted

2021-01-22 Thread Hailu, Andreas [Engineering]
Hi Robert, I appreciate you having a look. I’ll have a closer look and see what 
I can find. Thanks!


// ah

From: Robert Metzger 
Sent: Friday, January 22, 2021 2:41 AM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: org.apache.flink.runtime.client.JobSubmissionException: Job has 
already been submitted

Hey Andreas,
thanks a lot for providing me with the full logs.

The JobManager actually received 2 job submissions.
There are 2 relevant log messages.
1. "Received JobGraph submission xxx"
2. "Submitting job"
1 is logged right after the dispatcher receives the JobGraph, before the 
duplicate submission check is done. 2 is logged once we know that there is no 
duplicate job.

We have the following log messages (which you actually posted in here on the 
list already)

TYPE 1: 2021-01-20 14:06:58,225 INFO  [flink-akka.actor.default-dispatcher-91] 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Received 
JobGraph submission 8e1c2fdd68feee100d8fee003efef3e2 (Flink Java Job at Wed Jan 
20 14:01:42 EST 2021).
TYPE 2: 2021-01-20 14:06:58,225 INFO  [flink-akka.actor.default-dispatcher-91] 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Submitting job 
8e1c2fdd68feee100d8fee003efef3e2 (Flink Java Job at Wed Jan 20 14:01:42 EST 
2021).
TYPE 1: 2021-01-20 14:08:45,199 INFO  [flink-akka.actor.default-dispatcher-30] 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Received 
JobGraph submission 8e1c2fdd68feee100d8fee003efef3e2 (Flink Java Job at Wed Jan 
20 14:01:42 EST 2021).

2021-01-20 14:09:19,981 INFO  [flink-akka.actor.default-dispatcher-90] 
org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the 
JobMaster for job Flink Java Job at Wed Jan 20 14:01:42 EST 
2021(8e1c2fdd68feee100d8fee003efef3e2).


So at 14:06, you are submitting the job, at 14:09 it fails.

In between (14:08) you are trying to submit the job again, which gets 
(rightfully) rejected. It seems that the second submission didn't get logged 
properly in your client.
I don't think there's a bug on Flink's side of things.


On Thu, Jan 21, 2021 at 7:17 PM Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> wrote:
Hi Robert,

I sent you an email with instructions to create an account to view the logs 
through our secure repository. I’ve included the JobManager and client 
application logs there.

We have a thread pool that we use to submit multiple jobs in parallel, but in 
there there’s no retry logic – if any one job fails, it’s an overall failure 
for the entire application. In regards to the timespan between when the job was 
logged to have been initially submitted from the client app logs and when the 
JobManager logs it as being received – we’re submitting a large number of jobs 
as a part of this application. Is it possible that it’s busy processing other 
jobs?

// ah

From: Robert Metzger mailto:rmetz...@apache.org>>
Sent: Thursday, January 21, 2021 10:00 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: org.apache.flink.runtime.client.JobSubmissionException: Job has 
already been submitted

Thanks a lot for your message.

Why is there a difference of 5 minutes between the timestamp of the job 
submission from the client to the timestamp on the JobManager where the 
submission is received?
Is there any service / custom logic involved in the job submission? (e.g. a 
proxy in between, that has some retry mechanism, or some custom code that does 
retries?)

Could you provide the full JobManager logs of that timeframe, not just those 
messages filtered for 8e1c2fdd68feee100d8fee003efef3e2?

On Wed, Jan 20, 2021 at 10:20 PM Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> wrote:
Hello,

We’re running 1.9.2 on YARN, and are seeing some interesting behavior when 
submitting jobs in a multi-threaded fashion to an application’s Flink cluster. 
The error we see reported in the client application logs is the following:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve 
the execution result. (JobID: 8e1c2fdd68feee100d8fee003efef3e2)
   at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
   at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
   at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
   at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
...
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 

RE: org.apache.flink.runtime.client.JobSubmissionException: Job has already been submitted

2021-01-21 Thread Hailu, Andreas [Engineering]
Hi Robert,

I sent you an email with instructions to create an account to view the logs 
through our secure repository. I’ve included the JobManager and client 
application logs there.

We have a thread pool that we use to submit multiple jobs in parallel, but in 
there there’s no retry logic – if any one job fails, it’s an overall failure 
for the entire application. In regards to the timespan between when the job was 
logged to have been initially submitted from the client app logs and when the 
JobManager logs it as being received – we’re submitting a large number of jobs 
as a part of this application. Is it possible that it’s busy processing other 
jobs?

// ah

From: Robert Metzger 
Sent: Thursday, January 21, 2021 10:00 AM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: org.apache.flink.runtime.client.JobSubmissionException: Job has 
already been submitted

Thanks a lot for your message.

Why is there a difference of 5 minutes between the timestamp of the job 
submission from the client to the timestamp on the JobManager where the 
submission is received?
Is there any service / custom logic involved in the job submission? (e.g. a 
proxy in between, that has some retry mechanism, or some custom code that does 
retries?)

Could you provide the full JobManager logs of that timeframe, not just those 
messages filtered for 8e1c2fdd68feee100d8fee003efef3e2?

On Wed, Jan 20, 2021 at 10:20 PM Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> wrote:
Hello,

We’re running 1.9.2 on YARN, and are seeing some interesting behavior when 
submitting jobs in a multi-threaded fashion to an application’s Flink cluster. 
The error we see reported in the client application logs is the following:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve 
the execution result. (JobID: 8e1c2fdd68feee100d8fee003efef3e2)
   at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
   at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
   at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
   at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
...
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
submit JobGraph.
   at 
org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:391)
   at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
   at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
   at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
   at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
   at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)
   at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
   at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
   at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
   at 
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
   at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
   at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
   ... 3 more
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Internal 
server error., http://fl...@d43723-191.dc.gs.com:37966/user/jobmanager_124>.
2021-01-20 14:07:29,821 INFO  [flink-akka.actor.default-dispatcher-64] 
org.apache.flink.runtime.jobmaster.JobMaster  - Starting 
execution of job Flink Java Job at Wed Jan 20 14:01:42 EST 2021 
(8e1c2fdd68feee100d8fee003efef3e2) under job master id 
.
2021-01-20 14:07:29,821 INFO  [flink-akka.actor.default-dispatcher-64] 
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink Java 
Job at Wed Jan 20 14:01:42 EST 2021 (8e1c2fdd68feee100d8fee003efef3e2) switched 
from state CREATED to RUNNING.
2021-01-20 14:07:29,822 INFO  [flink-akka.actor.default-dispatcher-2] 
org.apache.flink.yarn.YarnResourceManager - Registering job

org.apache.flink.runtime.client.JobSubmissionException: Job has already been submitted

2021-01-20 Thread Hailu, Andreas [Engineering]
Hello,

We're running 1.9.2 on YARN, and are seeing some interesting behavior when 
submitting jobs in a multi-threaded fashion to an application's Flink cluster. 
The error we see reported in the client application logs is the following:

org.apache.flink.client.program.ProgramInvocationException: Could not retrieve 
the execution result. (JobID: 8e1c2fdd68feee100d8fee003efef3e2)
   at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
   at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
   at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
   at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
...
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
submit JobGraph.
   at 
org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:391)
   at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
   at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
   at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
   at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
   at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:263)
   at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
   at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
   at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
   at 
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
   at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
   at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
   ... 3 more
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Internal 
server error., mailto:0...@akka.tcp://fl...@d43723-191.dc.gs.com:37966/user/jobmanager_124%20for%20job%208e1c2fdd68feee100d8fee003efef3e2>.
2021-01-20 14:07:29,822 INFO  [flink-akka.actor.default-dispatcher-64] 
org.apache.flink.yarn.YarnResourceManager - Request slot 
with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, 
directMemoryInMB=-1, nativeMemoryInMB=-1, networkMemoryInMB=-1, 
managedMemoryInMB=-1} for job 8e1c2fdd68feee100d8fee003efef3e2 with allocation 
id 5bca3bde577f93169928e04749b45343.
2021-01-20 14:08:45,199 INFO  [flink-akka.actor.default-dispatcher-30] 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Received 
JobGraph submission 8e1c2fdd68feee100d8fee003efef3e2 (Flink Java Job at Wed Jan 
20 14:01:42 EST 2021).
2021-01-20 14:09:19,981 INFO  [flink-akka.actor.default-dispatcher-90] 
org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the 
JobMaster for job Flink Java Job at Wed Jan 20 14:01:42 EST 
2021(8e1c2fdd68feee100d8fee003efef3e2).
2021-01-20 14:09:19,982 INFO  [flink-akka.actor.default-dispatcher-90] 
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink Java 
Job at Wed Jan 20 14:01:42 EST 2021 (8e1c2fdd68feee100d8fee003efef3e2) switched 
from state RUNNING to FAILING.

It would appear that the job for ID 8e1c2fdd68feee100d8fee003efef3e2, the 
cluster somehow received the submission request twice? The client log only show 
a single submission for this job:

2021-01-20 14:01:55,775 [ProductHistory-18359] INFO  RestClusterClient - 
Submitting job 8e1c2fdd68feee100d8fee003efef3e2 (detached: false).

So while the job is submitted a single time, the dispatcher somehow tries to 
perform two submissions resulting in a failure. How does this happen?



Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: Distribute Parallelism/Tasks within RichOutputFormat?

2020-12-23 Thread Hailu, Andreas [Engineering]
Thanks Chesnay, Flavio – I believe Flavio’s first recommendation will work well 
enough. I agree that the second approach may be a bit finicky to use long-term.

Cheers.

// ah

From: Chesnay Schepler 
Sent: Wednesday, December 23, 2020 4:07 AM
To: Flavio Pompermaier ; Hailu, Andreas [Engineering] 

Cc: user@flink.apache.org
Subject: Re: Distribute Parallelism/Tasks within RichOutputFormat?

Essentially I see 2 options here:
a) split your output format such that each format is it's own sink, and then 
follow Flavio's suggestion to filter the stream and apply each sink to one of 
the streams, with the respective parallelism. This would be the recommended 
approach.
b) modify your (custom?) output format to only create one of the Hadoop output 
formats within open() based on the subtask index, and apply a custom 
partitioner onto the input datastream that routes the elements based on the 
conditions to the respective subtasks. I would not recommend this though, 
because it could be quite a headache maintenance-wise.

On 12/23/2020 9:53 AM, Flavio Pompermaier wrote:
I'm not an expert of the streaming APIs but you could try to do something like 
this:

DataStream ds = null;
DataStream ds1 = ds.filter(...).setParallelism(3);
DataStream ds2 = ds.filter(...).setParallelism(7);

Could it fit your needs?

Best,
Flavio

On Wed, Dec 23, 2020 at 3:54 AM Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> wrote:
Hi folks,

I’ve got a single RichOutputFormat which is comprised of two 
HadoopOutputFormats, let’s call them A and B, each writing to different HDFS 
directories. If a Record matches a certain condition it’s written using A, 
otherwise it’s written with B. Currently, the parallelism that is set at the 
RichOutputFormat seems to propagates to both A & B – meaning if the parallelism 
set on the RichOutputFormat is 10, output A and B create 10 files even if A 
receives all the records and B receives none.

My app has knowledge about the ratio of records it expects will be sent to 
output A vs output B, and I would ideally like that pass that down through the 
RichOutputFormat. Meaning that if we have a parallelism of 10, and know that 
70% of the Records being sent go to A, I would like to supply the A with 7 
parallelism and B with 3.

I’m curious because the current approach can lead to lots of redundant empty 
files, and I’d like to minimize that if possible. Is something like this 
supported?



Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>





Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


Distribute Parallelism/Tasks within RichOutputFormat?

2020-12-22 Thread Hailu, Andreas [Engineering]
Hi folks,

I've got a single RichOutputFormat which is comprised of two 
HadoopOutputFormats, let's call them A and B, each writing to different HDFS 
directories. If a Record matches a certain condition it's written using A, 
otherwise it's written with B. Currently, the parallelism that is set at the 
RichOutputFormat seems to propagates to both A & B - meaning if the parallelism 
set on the RichOutputFormat is 10, output A and B create 10 files even if A 
receives all the records and B receives none.

My app has knowledge about the ratio of records it expects will be sent to 
output A vs output B, and I would ideally like that pass that down through the 
RichOutputFormat. Meaning that if we have a parallelism of 10, and know that 
70% of the Records being sent go to A, I would like to supply the A with 7 
parallelism and B with 3.

I'm curious because the current approach can lead to lots of redundant empty 
files, and I'd like to minimize that if possible. Is something like this 
supported?



Andreas Hailu
Data Lake Engineering | Goldman Sachs & Co.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: Runtime Dependency Issues Upgrading to Flink 1.11.2 from 1.9.2

2020-10-26 Thread Hailu, Andreas
Hi Leonard, Chesnay, thanks for having a look. I was able to sort this out -it 
was because of the change in default Class Loading policy becoming child-first 
introduced in 1.10 through  https://issues.apache.org/jira/browse/FLINK-13749 . 
Once I changed it back to parent-first, I was able to submit jobs.

Hopefully any other devs who have similar issues will find this thread useful :)

// ah

From: Leonard Xu 
Sent: Friday, October 16, 2020 1:10 AM
To: Chesnay Schepler 
Cc: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: Runtime Dependency Issues Upgrading to Flink 1.11.2 from 1.9.2

Hi, Chesnay

@Leonared I noticed you handled a similar case on the Chinese ML in 
July<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dflink.147419.n8.nabble.com_flink1-2D11-2Dtd5154.html=DwMFoQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=x3wVpzZtiuEaKQGifHliLwtb0J_8oiqzJoNrUxV19sE=ymvt_cFkIWmVUru2K5T3DxHuwJ3CSHc_3s9hdk3W4T4=>,
 do you have any insights?

The case in Chinese ML is the user added 
jakarta.ws.rs<https://urldefense.proofpoint.com/v2/url?u=http-3A__jakarta.ws.rs=DwMFoQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=x3wVpzZtiuEaKQGifHliLwtb0J_8oiqzJoNrUxV19sE=eXv0iTHKBxgFr2xou-ojEM1QUgpwW_rE1fB_9gmStXg=>-api-3.0.0-M1.jar
 to Flink/lib which lead the dependency conflicts, Hailu's case looks 
differently.

Hi @Hailu,
The Hadoop dependency jersey-core-1.9.jar contains class 
javax.ws.rs.RuntimeDelegate,
the dependency 
javax.ws.rs<https://urldefense.proofpoint.com/v2/url?u=http-3A__javax.ws.rs=DwMFoQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=x3wVpzZtiuEaKQGifHliLwtb0J_8oiqzJoNrUxV19sE=Pkw8aKBvVqJ9VfcNwDyLYcpm0qMJaKAFWUiQ_MA5U9g=>:javax.ws.rs<https://urldefense.proofpoint.com/v2/url?u=http-3A__javax.ws.rs=DwMFoQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=x3wVpzZtiuEaKQGifHliLwtb0J_8oiqzJoNrUxV19sE=Pkw8aKBvVqJ9VfcNwDyLYcpm0qMJaKAFWUiQ_MA5U9g=>-api
 in your shaed jar also contains class javax.ws.rs.RuntimeDelegate,
I doubt the CastException come from here.

Best
Leonard





On 10/15/2020 7:51 PM, Hailu, Andreas wrote:
Hi Chesnay, no, we haven't changed our Hadoop version. The only changes were 
the update the 1.11.2 runtime dependencies listed earlier, as well as compiling 
with the flink-clients in some of our modules since we were relying  on the 
transitive dependency. Our 1.9.2 jobs are still able to run just fine, which is 
interesting.

// ah

From: Chesnay Schepler <mailto:ches...@apache.org>
Sent: Thursday, October 15, 2020 7:34 AM
To: Hailu, Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Runtime Dependency Issues Upgrading to Flink 1.11.2 from 1.9.2

I'm not aware of any Flink module bundling this class. Note that this class is 
also bundled in jersey-core (which is also on your classpath), so it appears 
that there is a conflict between this jar and your shaded one.
Have you changed the Hadoop version you are using or how you provide them to 
Flink?

On 10/14/2020 6:56 PM, Hailu, Andreas wrote:
Hi team! We're trying to upgrade our applications from 1.9.2 to 1.11.2. After 
re-compiling and updating our runtime dependencies to use 1.11.2, we see this 
LinkageError:

Caused by: java.lang.LinkageError: ClassCastException: attempting to 
castjar:file:/local/data/scratch/hailua_p2epdlsuat/flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar!/javax/ws/rs/ext/RuntimeDelegate.class
 to 
jar:file:/local/data/scratch/hailua_p2epdlsuat/flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar!/javax/ws/rs/ext/RuntimeDelegate.class
at 
javax.ws.rs.ext.RuntimeDelegate.findDelegate(RuntimeDelegate.java:146) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at 
javax.ws.rs.ext.RuntimeDelegate.getInstance(RuntimeDelegate.java:120) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at javax.ws.rs.core.MediaType.valueOf(MediaType.java:179) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.core.header.MediaTypes.(MediaTypes.java:64) 
~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:182)
 ~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:175)
 ~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.init(MessageBodyFactory.java:162)
 ~[jersey-core-1.9.jar:1.9]
at com.sun.jersey.api.client.Client.init(Client.java:342) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client.access$000(Client.java:118) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.C

RE: Runtime Dependency Issues Upgrading to Flink 1.11.2 from 1.9.2

2020-10-15 Thread Hailu, Andreas
Hi Chesnay, no, we haven't changed our Hadoop version. The only changes were 
the update the 1.11.2 runtime dependencies listed earlier, as well as compiling 
with the flink-clients in some of our modules since we were relying  on the 
transitive dependency. Our 1.9.2 jobs are still able to run just fine, which is 
interesting.

// ah

From: Chesnay Schepler 
Sent: Thursday, October 15, 2020 7:34 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: Runtime Dependency Issues Upgrading to Flink 1.11.2 from 1.9.2

I'm not aware of any Flink module bundling this class. Note that this class is 
also bundled in jersey-core (which is also on your classpath), so it appears 
that there is a conflict between this jar and your shaded one.
Have you changed the Hadoop version you are using or how you provide them to 
Flink?

On 10/14/2020 6:56 PM, Hailu, Andreas wrote:
Hi team! We're trying to upgrade our applications from 1.9.2 to 1.11.2. After 
re-compiling and updating our runtime dependencies to use 1.11.2, we see this 
LinkageError:

Caused by: java.lang.LinkageError: ClassCastException: attempting to 
castjar:file:/local/data/scratch/hailua_p2epdlsuat/flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar!/javax/ws/rs/ext/RuntimeDelegate.class
 to 
jar:file:/local/data/scratch/hailua_p2epdlsuat/flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar!/javax/ws/rs/ext/RuntimeDelegate.class
at 
javax.ws.rs.ext.RuntimeDelegate.findDelegate(RuntimeDelegate.java:146) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at 
javax.ws.rs.ext.RuntimeDelegate.getInstance(RuntimeDelegate.java:120) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at javax.ws.rs.core.MediaType.valueOf(MediaType.java:179) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.core.header.MediaTypes.(MediaTypes.java:64) 
~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:182)
 ~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:175)
 ~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.init(MessageBodyFactory.java:162)
 ~[jersey-core-1.9.jar:1.9]
at com.sun.jersey.api.client.Client.init(Client.java:342) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client.access$000(Client.java:118) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client$1.f(Client.java:191) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client$1.f(Client.java:187) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193) 
~[jersey-core-1.9.jar:1.9]
at com.sun.jersey.api.client.Client.(Client.java:187) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client.(Client.java:170) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceInit(TimelineClientImpl.java:285)
 ~[hadoop-yarn-common-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
~[hadoop-common-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:355)
 ~[hadoop-yarn-client-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:331)
 ~[hadoop-yarn-client-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
 ~[hadoop-yarn-client-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1002)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:524)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:424)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:70)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:973)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.client.program.ContextEnvironment.executeAsync(ContextEnvironment.java:124)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:72)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]

I'll note that the flink-ingest-refiner jar

Runtime Dependency Issues Upgrading to Flink 1.11.2 from 1.9.2

2020-10-14 Thread Hailu, Andreas
Hi team! We're trying to upgrade our applications from 1.9.2 to 1.11.2. After 
re-compiling and updating our runtime dependencies to use 1.11.2, we see this 
LinkageError:

Caused by: java.lang.LinkageError: ClassCastException: attempting to 
castjar:file:/local/data/scratch/hailua_p2epdlsuat/flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar!/javax/ws/rs/ext/RuntimeDelegate.class
 to 
jar:file:/local/data/scratch/hailua_p2epdlsuat/flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar!/javax/ws/rs/ext/RuntimeDelegate.class
at 
javax.ws.rs.ext.RuntimeDelegate.findDelegate(RuntimeDelegate.java:146) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at 
javax.ws.rs.ext.RuntimeDelegate.getInstance(RuntimeDelegate.java:120) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at javax.ws.rs.core.MediaType.valueOf(MediaType.java:179) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.core.header.MediaTypes.(MediaTypes.java:64) 
~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:182)
 ~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.initReaders(MessageBodyFactory.java:175)
 ~[jersey-core-1.9.jar:1.9]
at 
com.sun.jersey.core.spi.factory.MessageBodyFactory.init(MessageBodyFactory.java:162)
 ~[jersey-core-1.9.jar:1.9]
at com.sun.jersey.api.client.Client.init(Client.java:342) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client.access$000(Client.java:118) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client$1.f(Client.java:191) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client$1.f(Client.java:187) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193) 
~[jersey-core-1.9.jar:1.9]
at com.sun.jersey.api.client.Client.(Client.java:187) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at com.sun.jersey.api.client.Client.(Client.java:170) 
~[flink-ingest-refiner-sandbox-SNAPSHOT-fat-shaded.jar:?]
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceInit(TimelineClientImpl.java:285)
 ~[hadoop-yarn-common-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
~[hadoop-common-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:355)
 ~[hadoop-yarn-client-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:331)
 ~[hadoop-yarn-client-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
 ~[hadoop-yarn-client-2.7.3.2.6.3.0-235.jar:?]
at 
org.apache.flink.yarn.YarnClusterDescriptor.startAppMaster(YarnClusterDescriptor.java:1002)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.yarn.YarnClusterDescriptor.deployInternal(YarnClusterDescriptor.java:524)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:424)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:70)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:973)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.client.program.ContextEnvironment.executeAsync(ContextEnvironment.java:124)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:72)
 ~[flink-dist_2.11-1.11.2.jar:1.11.2]

I'll note that the flink-ingest-refiner jar is our shaded JAR application that 
we use to submit jobs.
Looking into what dependencies have changed, on 1.9.2 our runtime dependencies 
from the available artifacts (sourced from one of the many mirrors)  are:

1.   flink-dist_2.11-1.9.2.jar

2.   flink-table-blink_2.11-1.9.2.jar

3.   flink-table_2.11-1.9.2.jar

4.   log4j-1.2.17.jar

5.   slf4j-log4j12-1.7.15.jar

Whereas 1.11.2's dependencies are:

1.   flink-dist_2.11-1.11.2.jar

2.   flink-table-blink_2.11-1.11.2.jar

3.   flink-table_2.11-1.11.2.jar

4.   log4j-1.2-api-2.12.1.jar

5.   log4j-api-2.12.1.jar

6.   log4j-core-2.12.1.jar

7.   log4j-slf4j-impl-2.12.1.jar

RuntimeDelegate comes from the javax.ws.rs:javax.ws.rs-api module, which we use 
internally for some of our REST implementations. I've been trying a few things 
here to no avail such as declaring our 

RE: Blobserver dying mid-application

2020-10-01 Thread Hailu, Andreas
@Chesnay:
I see. I actually had a separate thread with Robert Metzger ago regarding 
connection-related issues we’re plagued with at higher parallelisms, and his 
guidance lead us to look into our somaxconn config. We increased it from 128 to 
1024 in early September. We use the same generic JAR for all of our apps, so I 
don’t think JAR size is the cause. Just so I’m clear: when you say Flink 
session cluster – if we have 2 independent Flink applications  A & B with 
JobManagers that just happen to be running on the same DataNode, they don’t 
share Blobservers, right?

In regard to historical behavior, no, I haven’t seen these Blobserver 
connection problems until after the somaxconn config change. From an app 
perspective, the only way these ones are different is that they’re wide rather 
than deep i.e. large # of jobs to submit instead of a small handful of jobs 
with large amounts of data to process. If we have many jobs to submit, as soon 
as one is complete, we’re trying to submit the next.

I saw an example today of an application using 10 TaskManagers w/ 2 slots with 
a total 194 jobs to submit with at most 20 running in parallel fail with the 
same error. I’m happy to try increasing both the concurrent connections and 
backlog to 128 and 2048 respectively, but I still can’t make sense of how a 
backlog of 1,000 connections is being met by 10 Task Managers/20 connections at 
worst.

$ sysctl -a | grep net.core.somaxconn
net.core.somaxconn = 1024

// ah

From: Chesnay Schepler 
Sent: Thursday, October 1, 2020 1:41 PM
To: Hailu, Andreas [Engineering] ; Till Rohrmann 

Cc: user@flink.apache.org; Nico Kruber 
Subject: Re: Blobserver dying mid-application

All jobs running in a Flink session cluster talk to the same blob server.

The time when tasks are submitted depends on the job; for streaming jobs all 
tasks are deployed when the job starts running; in case of batch jobs the 
submission can be staggered.

I'm only aware of 2 cases where we transfer data via the blob server;
a) retrieval of jars required for the user code to run  (this is what you see 
in the stack trace)
b) retrieval of TaskInformation, which _should_ only happen if your job is 
quite large, but let's assume it does.

For a) there should be at most numberOfSlots * numberOfTaskExecutors concurrent 
connections, in the worst case of each slot working on a different job, as each 
would download the jars for their respective job. If multiple slots are used 
for the same job at the same time, then the job jar is only retrieved once.

For b) the limit should also be numberOfSlots * numberOfTaskExecutors; it is 
done once per task, and there are only so many tasks that can run at the same 
time.

Thus from what I can tell there should be at most 104 (26 task executors * 2 
slots * 2) concurrent attempts, of which only 54 should land in the backlog.

Did you run into this issue before?
If not, is this application different than your existing applications? Is the 
jar particularly big, jobs particularly short running or more complex than 
others.

One thing to note is that the backlog relies entirely on OS functionality, 
which can be subject to an upper limit enforced by the OS.
The configured backlog size is just a hint to the OS, but it may ignore it; it 
appears that 128 is not an uncommon upper limit, but maybe there are lower 
settings out there.
You can check this limit via sysctl -a | grep net.core.somaxconn
Maybe this value is set to 0, effectively disabling the backlog?

It may also be worthwhile to monitor the number of such connections. (netstat 
-ant | grep -c SYN_REC)

@Nico Do you have any ideas?

On 10/1/2020 6:26 PM, Hailu, Andreas wrote:
Hi Chesnay, Till, thanks for responding.

@Chesnay:
Apologies, I said cores when I meant slots ☺ So a total of 26 Task managers 
with 2 slots each for a grand total of 52 parallelism.

@Till:
For this application, we have a grand total of 78 jobs, with some of them 
demanding more parallelism than others. Each job has multiple operators – 
depending on the size of the data we’re operating on, we could submit 1 whopper 
with 52 parallelism, or multiple smaller jobs submitted in parallel with a sum 
of 52 parallelism. When does a task submission to a `TaskExecutor` take place? 
Is that on job submission or something else? I’m just curious as a parallelism 
of 52 seems on the lower side to breach 1K connections in the queue, unless 
interactions with the Blobserver are much more frequent than I think. Is it 
possible that separate Flink jobs share the same Blobserver? Because we have 
thousands of Flink applications running concurrently in our YARN cluster.

// ah

From: Chesnay Schepler <mailto:ches...@apache.org>
Sent: Thursday, October 1, 2020 5:42 AM
To: Till Rohrmann <mailto:trohrm...@apache.org>; Hailu, 
Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Blobserver dying mi

Blobserver dying mid-application

2020-09-30 Thread Hailu, Andreas
Hello folks, I'm seeing application failures where our Blobserver is refusing 
connections mid application:

2020-09-30 13:56:06,227 INFO  [flink-akka.actor.default-dispatcher-18] 
org.apache.flink.runtime.taskexecutor.TaskExecutor- Un-registering 
task and sending final execution state FINISHED to JobManager for task DataSink 
(TextOutputFormat 
(hdfs:/user/p2epda/lake/delp_prod/PROD/APPROVED/staging/datastore/MandateTradingLine/tmp_pipeline_collapse)
 - UTF-8) 3d1890b47f4398d805cf0c1b54286f71.
2020-09-30 13:56:06,423 INFO  [flink-akka.actor.default-dispatcher-18] 
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable  - Free slot 
TaskSlot(index:0, state:ACTIVE, resource profile: 
ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, 
directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, 
networkMemoryInMB=2147483647, managedMemoryInMB=3046}, allocationId: 
e8be16ec74f16c795d95b89cd08f5c37, jobId: e808de0373bd515224434b7ec1efe249).
2020-09-30 13:56:06,424 INFO  [flink-akka.actor.default-dispatcher-18] 
org.apache.flink.runtime.taskexecutor.JobLeaderService- Remove job 
e808de0373bd515224434b7ec1efe249 from job leader monitoring.
2020-09-30 13:56:06,424 INFO  [flink-akka.actor.default-dispatcher-18] 
org.apache.flink.runtime.taskexecutor.TaskExecutor- Close 
JobManager connection for job e808de0373bd515224434b7ec1efe249.
2020-09-30 13:56:06,426 INFO  [flink-akka.actor.default-dispatcher-18] 
org.apache.flink.runtime.taskexecutor.TaskExecutor- Close 
JobManager connection for job e808de0373bd515224434b7ec1efe249.
2020-09-30 13:56:06,426 INFO  [flink-akka.actor.default-dispatcher-18] 
org.apache.flink.runtime.taskexecutor.JobLeaderService- Cannot 
reconnect to job e808de0373bd515224434b7ec1efe249 because it is not registered.
2020-09-30 13:56:09,918 INFO  [CHAIN DataSource (dataset | Read Staging From 
File System | AVRO) -> Map (Map at 
readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at 
validateData(DAXTask.java:97)) -> FlatMap (FlatMap at 
handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at 
collapsePipelineIfRequired(Task.java:160)) (1/1)] 
org.apache.flink.runtime.blob.BlobClient  - Downloading 
48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39
 from d43723-430.dc.gs.com/10.48.128.14:46473 (retry 3)
2020-09-30 13:56:09,920 ERROR [CHAIN DataSource (dataset | Read Staging From 
File System | AVRO) -> Map (Map at 
readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at 
validateData(DAXTask.java:97)) -> FlatMap (FlatMap at 
handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at 
collapsePipelineIfRequired(Task.java:160)) (1/1)] 
org.apache.flink.runtime.blob.BlobClient  - Failed to fetch 
BLOB 
48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39
 from d43723-430.dc.gs.com/10.48.128.14:46473 and store it under 
/fs/htmp/yarn/local/usercache/delp_prod/appcache/application_1599723434208_15328880/blobStore-e2888df1-c7be-4ce6-b6b6-58e7c24a140a/incoming/temp-0004
 Retrying...
2020-09-30 13:56:09,920 INFO  [CHAIN DataSource (dataset | Read Staging From 
File System | AVRO) -> Map (Map at 
readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at 
validateData(DAXTask.java:97)) -> FlatMap (FlatMap at 
handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at 
collapsePipelineIfRequired(Task.java:160)) (1/1)] 
org.apache.flink.runtime.blob.BlobClient  - Downloading 
48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39
 from d43723-430.dc.gs.com/10.48.128.14:46473 (retry 4)
2020-09-30 13:56:09,922 ERROR [CHAIN DataSource (dataset | Read Staging From 
File System | AVRO) -> Map (Map at 
readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at 
validateData(DAXTask.java:97)) -> FlatMap (FlatMap at 
handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at 
collapsePipelineIfRequired(Task.java:160)) (1/1)] 
org.apache.flink.runtime.blob.BlobClient  - Failed to fetch 
BLOB 
48b8ba9f3de039f74483085f90e5ad64/p-0b27cb203799adb76d2a434ed3d64052d832cff3-6915d0cd0fef97f728cd890986b2bf39
 from d43723-430.dc.gs.com/10.48.128.14:46473 and store it under 
/fs/htmp/yarn/local/usercache/delp_prod/appcache/application_1599723434208_15328880/blobStore-e2888df1-c7be-4ce6-b6b6-58e7c24a140a/incoming/temp-0004
 Retrying...
2020-09-30 13:56:09,922 INFO  [CHAIN DataSource (dataset | Read Staging From 
File System | AVRO) -> Map (Map at 
readAvroFileWithFilter(FlinkReadUtils.java:82)) -> Filter (Filter at 
validateData(DAXTask.java:97)) -> FlatMap (FlatMap at 
handleBloomFilter(PreMergeTask.java:187)) -> FlatMap (FlatMap at 
collapsePipelineIfRequired(Task.java:160)) (1/1)] 

RE: JobManager refusing connections when running many jobs in parallel?

2020-08-19 Thread Hailu, Andreas
Hi Robert, following up - I suppose the questions distills into how would 
tuning  a timeout resolve connection refusals? I would think timeout-related 
failures may go down if the network is hammered. Connection Refused sounds like 
we’re out of threads or sockets somewhere, no? We’re testing out an increase in 
our sockets’ max connections, but I would like to know your thoughts.

// ah

From: Hailu, Andreas [Engineering]
Sent: Monday, August 17, 2020 9:51 AM
To: 'Robert Metzger' 
Cc: user@flink.apache.org; Shah, Siddharth [Engineering] 

Subject: RE: JobManager refusing connections when running many jobs in parallel?

Interesting – what is the JobManager submission bounded by? Does it only allow 
a certain number of submissions per second, or is there a number of threads it 
accepts?

// ah

From: Robert Metzger mailto:rmetz...@apache.org>>
Sent: Tuesday, August 11, 2020 4:46 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>; Shah, Siddharth 
[Engineering] 
mailto:siddharth.x.s...@ny.email.gs.com>>
Subject: Re: JobManager refusing connections when running many jobs in parallel?

Thanks for checking.

Your analysis sounds correct. The JM is busy processing job submissions, 
resulting in other submissions not being accepted.

Increasing rest.connection-timeout should resolve your problem.


On Fri, Aug 7, 2020 at 1:59 AM Hailu, Andreas 
mailto:andreas.ha...@gs.com>> wrote:
Thanks for pointing this out. We had a look - the nodes in our cluster have a 
cap of 65K open files and we aren’t breaching 50% per metrics, so I don’t 
believe this is the problem.

The connection refused error makes us think it’s some process using a thread 
pool for the JobManager hitting capacity on a port somewhere. This sound 
correct? Is there a config for us to increase the pool size?

From: Robert Metzger mailto:rmetz...@apache.org>>
Sent: Wednesday, July 29, 2020 1:52:53 AM
To: Hailu, Andreas [Engineering]
Cc: user@flink.apache.org<mailto:user@flink.apache.org>; Shah, Siddharth 
[Engineering]
Subject: Re: JobManager refusing connections when running many jobs in parallel?

Hi Andreas,

Thanks for reaching out .. this should not happen ...
Maybe your operating system has configured low limits for the number of 
concurrent connections / sockets. Maybe this thread is helpful: 
https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_923990_why-2Ddo-2Di-2Dget-2Dconnection-2Drefused-2Dafter-2D1024-2Dconnections=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=BIYf5yz3admb-1ZnilWzxDnW3JB8d8VkHSSBMPTHaQI=bwNr69eekHCEpop2wur_LkAkIxqza-OjwNmG7cv8atc=>
 (there might better SO threads, I didn't put much effort into searching :) )

On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas 
mailto:andreas.ha...@gs.com>> wrote:
Hi team,

We’ve observed that when we submit a decent number of jobs in parallel from a 
single Job Master, we encounter job failures due with Connection Refused 
exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s 
seemingly transient, however, as upon several retries the job succeeds. The 
surface level error varies, but digging deeper in stack traces it looks to stem 
from the Job Manager no longer accepting connections.

I’ve included a couple of examples below from failed jobs’ driver logs, with 
different errors stemming from a connection refused error:

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task 
Manager memory - 30 jobs submitted in parallel, each with parallelism of 1
Job Manager is running @ d43723-563.dc.gs.com<http://d43723-563.dc.gs.com>: 
Using job manager web tracking url http://d43723-563.dc.gs.com:41268;> 
Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) 
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve 
the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
...
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Number of retries has been exhausted.
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at 
java.util.concu

RE: JobManager refusing connections when running many jobs in parallel?

2020-08-17 Thread Hailu, Andreas
Interesting – what is the JobManager submission bounded by? Does it only allow 
a certain number of submissions per second, or is there a number of threads it 
accepts?

// ah

From: Robert Metzger 
Sent: Tuesday, August 11, 2020 4:46 AM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org; Shah, Siddharth [Engineering] 

Subject: Re: JobManager refusing connections when running many jobs in parallel?

Thanks for checking.

Your analysis sounds correct. The JM is busy processing job submissions, 
resulting in other submissions not being accepted.

Increasing rest.connection-timeout should resolve your problem.


On Fri, Aug 7, 2020 at 1:59 AM Hailu, Andreas 
mailto:andreas.ha...@gs.com>> wrote:
Thanks for pointing this out. We had a look - the nodes in our cluster have a 
cap of 65K open files and we aren’t breaching 50% per metrics, so I don’t 
believe this is the problem.

The connection refused error makes us think it’s some process using a thread 
pool for the JobManager hitting capacity on a port somewhere. This sound 
correct? Is there a config for us to increase the pool size?

From: Robert Metzger mailto:rmetz...@apache.org>>
Sent: Wednesday, July 29, 2020 1:52:53 AM
To: Hailu, Andreas [Engineering]
Cc: user@flink.apache.org<mailto:user@flink.apache.org>; Shah, Siddharth 
[Engineering]
Subject: Re: JobManager refusing connections when running many jobs in parallel?

Hi Andreas,

Thanks for reaching out .. this should not happen ...
Maybe your operating system has configured low limits for the number of 
concurrent connections / sockets. Maybe this thread is helpful: 
https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_923990_why-2Ddo-2Di-2Dget-2Dconnection-2Drefused-2Dafter-2D1024-2Dconnections=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=BIYf5yz3admb-1ZnilWzxDnW3JB8d8VkHSSBMPTHaQI=bwNr69eekHCEpop2wur_LkAkIxqza-OjwNmG7cv8atc=>
 (there might better SO threads, I didn't put much effort into searching :) )

On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas 
mailto:andreas.ha...@gs.com>> wrote:
Hi team,

We’ve observed that when we submit a decent number of jobs in parallel from a 
single Job Master, we encounter job failures due with Connection Refused 
exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s 
seemingly transient, however, as upon several retries the job succeeds. The 
surface level error varies, but digging deeper in stack traces it looks to stem 
from the Job Manager no longer accepting connections.

I’ve included a couple of examples below from failed jobs’ driver logs, with 
different errors stemming from a connection refused error:

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task 
Manager memory - 30 jobs submitted in parallel, each with parallelism of 1
Job Manager is running @ d43723-563.dc.gs.com<http://d43723-563.dc.gs.com>: 
Using job manager web tracking url http://d43723-563.dc.gs.com:41268;> 
Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) 
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve 
the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
...
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Number of retries has been exhausted.
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPro

Re: JobManager refusing connections when running many jobs in parallel?

2020-08-06 Thread Hailu, Andreas
Thanks for pointing this out. We had a look - the nodes in our cluster have a 
cap of 65K open files and we aren’t breaching 50% per metrics, so I don’t 
believe this is the problem.

The connection refused error makes us think it’s some process using a thread 
pool for the JobManager hitting capacity on a port somewhere. This sound 
correct? Is there a config for us to increase the pool size?

From: Robert Metzger 
Sent: Wednesday, July 29, 2020 1:52:53 AM
To: Hailu, Andreas [Engineering]
Cc: user@flink.apache.org; Shah, Siddharth [Engineering]
Subject: Re: JobManager refusing connections when running many jobs in parallel?

Hi Andreas,

Thanks for reaching out .. this should not happen ...
Maybe your operating system has configured low limits for the number of 
concurrent connections / sockets. Maybe this thread is helpful: 
https://stackoverflow.com/questions/923990/why-do-i-get-connection-refused-after-1024-connections<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_923990_why-2Ddo-2Di-2Dget-2Dconnection-2Drefused-2Dafter-2D1024-2Dconnections=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=BIYf5yz3admb-1ZnilWzxDnW3JB8d8VkHSSBMPTHaQI=bwNr69eekHCEpop2wur_LkAkIxqza-OjwNmG7cv8atc=>
 (there might better SO threads, I didn't put much effort into searching :) )

On Mon, Jul 27, 2020 at 6:31 PM Hailu, Andreas 
mailto:andreas.ha...@gs.com>> wrote:
Hi team,

We’ve observed that when we submit a decent number of jobs in parallel from a 
single Job Master, we encounter job failures due with Connection Refused 
exceptions. We’ve seen this behavior start at 30 jobs running in parallel. It’s 
seemingly transient, however, as upon several retries the job succeeds. The 
surface level error varies, but digging deeper in stack traces it looks to stem 
from the Job Manager no longer accepting connections.

I’ve included a couple of examples below from failed jobs’ driver logs, with 
different errors stemming from a connection refused error:

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task 
Manager memory - 30 jobs submitted in parallel, each with parallelism of 1
Job Manager is running @ d43723-563.dc.gs.com<http://d43723-563.dc.gs.com>: 
Using job manager web tracking url http://d43723-563.dc.gs.com:41268;> 
Job Manager Web Interface  (http://d43723-563.dc.gs.com:41268) 
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve 
the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
...
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Number of retries has been exhausted.
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
at 
org.apach

JobManager refusing connections when running many jobs in parallel?

2020-07-27 Thread Hailu, Andreas
Hi team,

We've observed that when we submit a decent number of jobs in parallel from a 
single Job Master, we encounter job failures due with Connection Refused 
exceptions. We've seen this behavior start at 30 jobs running in parallel. It's 
seemingly transient, however, as upon several retries the job succeeds. The 
surface level error varies, but digging deeper in stack traces it looks to stem 
from the Job Manager no longer accepting connections.

I've included a couple of examples below from failed jobs' driver logs, with 
different errors stemming from a connection refused error:

First example: 15 Task Managers/2 cores/4096 Job Manager memory/12288 Task 
Manager memory - 30 jobs submitted in parallel, each with parallelism of 1
Job Manager is running @ d43723-563.dc.gs.com: Using job manager web tracking 
url http://d43723-563.dc.gs.com:41268;> Job Manager Web Interface  
(http://d43723-563.dc.gs.com:41268) 
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve 
the execution result. (JobID: 1dfef6303cf0e888231d4c57b4b4e0e6)
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:255)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
...
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Number of retries has been exhausted.
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:273)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:341)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
... 1 more
Caused by: java.util.concurrent.CompletionException: 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 16 more
Caused by: 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: d43723-563.dc.gs.com/10.47.126.221:41268
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
... 6 more
Caused by: java.net.ConnectException: Connection refused

Second example: 30 Task Managers/2 cores/4096 Job Manager memory/12288 Task 
Manager memory - 60 jobs submitted in parallel, each with parallelism of 1
Job Manager is running @ d43723-484.dc.gs.com: Using job manager web tracking 
url 

RE: History Server Not Showing Any Jobs - File Not Found?

2020-07-11 Thread Hailu, Andreas
Thanks for the clarity. To this point you made:
(Note that by configuring "historyserver.web.tmpdir" to some permanent 
directory subsequent (re)starts of the HistorySserver can re-use this 
directory; so you only have to download things once)

The HistoryServer process in fact deletes this local cache during its shutdown 
hook. Is there a setting we can use so that it doesn't do this?
2020-07-11 11:43:29,527 [HistoryServer shutdown hook] INFO  HistoryServer - 
Removing web dashboard root cache directory 
/local/scratch/flink_historyserver_tmpdir
2020-07-11 11:43:29,536 [HistoryServer shutdown hook] INFO  HistoryServer - 
Stopped history server.

We're attempting to work around the UI becoming un-responsive/crashing the 
browser at a large number archives (in my testing, that's around 20,000 
archives with Chrome)  by persisting the job IDs of our submitted apps and then 
navigating to the job overview page directly, e.g. 
http://(host):(port)/#/job/(jobId)/overview. It would have been really great if 
the server stored archives by the application ID rather than the job ID - 
particularly for apps that potentially submit hundreds of jobs. Tracking one 
application ID (ala Spark) would ease the burden on the dev + ops side. Perhaps 
a feature for the future :)

// ah

From: Chesnay Schepler 
Sent: Tuesday, June 2, 2020 3:55 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

1) It downloads all archives and stores them on disk; the only thing stored in 
memory is the job ID or the archive. There is no hard upper limit; it is mostly 
constrained by disk space / memory. I say mostly, because I'm not sure how well 
the WebUI handles 100k jobs being loaded into the overview.

2) No, there is no retention policy. It is currently expected that an external 
process cleans up archives. If an archive was deleted (from the archive 
directory) the HistoryServer does notice that and also delete the local copy.

On 01/06/2020 23:05, Hailu, Andreas wrote:
So I created a new HDFS directory with just 1 archive and pointed the server to 
monitor that directory, et voila - I'm able to see the applications in the UI. 
So it must have been really churning trying to fetch all of those initial 
archives :)

I have a couple of follow up questions if you please:

1.   What is the upper limit of the number of archives the history server 
can support? Does it attempt to download every archive and load them all into 
memory?

2.   Retention: we have on the order of 100K applications per day in our 
production environment. Is there any native retention of policy? E.g. only keep 
the latest X archives in the dir - or is this something we need to manage 
ourselves?

Thanks.

// ah

From: Hailu, Andreas [Engineering]
Sent: Friday, May 29, 2020 8:46 AM
To: 'Chesnay Schepler' <mailto:ches...@apache.org>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Yes, these are all in the same directory, and we're at 67G right now. I'll try 
with incrementally smaller directories and let you know what I find.

// ah

From: Chesnay Schepler mailto:ches...@apache.org>>
Sent: Friday, May 29, 2020 3:11 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

oh I'm not using the HistoryServer; I just wrote it ;)
Are these archives all in the same location? So we're roughly looking at 5 GB 
of archives then?

That could indeed "just" be a resource problem. The HistoryServer eagerly 
downloads all archives, and not on-demand.
The next step would be to move some of the archives into a separate HDFS 
directory and try again.

(Note that by configuring "historyserver.web.tmpdir" to some permanent 
directory subsequent (re)starts of the HistorySserver can re-use this 
directory; so you only have to download things once)

On 29/05/2020 00:43, Hailu, Andreas wrote:
May I also ask what version of flink-hadoop you're using and the number of jobs 
you're storing the history for? As of writing we have roughly 101,000 
application history files. I'm curious to know if we're encountering some kind 
of resource problem.

// ah

From: Hailu, Andreas [Engineering]
Sent: Thursday, May 28, 2020 12:18 PM
To: 'Chesnay Schepler' <mailto:ches...@apache.org>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Okay, I will look further to see if we're mistakenly using a version that's 
pre-2.6.0. However, I don't see flink-shaded-hadoop in my /lib directory for 
flink-1.9.1.

flink-dist_2.11-1.9.1.jar
flink-table-blink_2.11-1.9.1.jar
flink-table_2.11-1.9.1.jar
log4j-1.2.17.jar
slf4j-log4j12-1.7.15.jar

Are the files within /lib.

// ah

From: C

RE: History Server Not Showing Any Jobs - File Not Found?

2020-06-01 Thread Hailu, Andreas
So I created a new HDFS directory with just 1 archive and pointed the server to 
monitor that directory, et voila - I'm able to see the applications in the UI. 
So it must have been really churning trying to fetch all of those initial 
archives :)

I have a couple of follow up questions if you please:

1.  What is the upper limit of the number of archives the history server 
can support? Does it attempt to download every archive and load them all into 
memory?

2.  Retention: we have on the order of 100K applications per day in our 
production environment. Is there any native retention of policy? E.g. only keep 
the latest X archives in the dir - or is this something we need to manage 
ourselves?

Thanks.

// ah

From: Hailu, Andreas [Engineering]
Sent: Friday, May 29, 2020 8:46 AM
To: 'Chesnay Schepler' ; user@flink.apache.org
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Yes, these are all in the same directory, and we're at 67G right now. I'll try 
with incrementally smaller directories and let you know what I find.

// ah

From: Chesnay Schepler mailto:ches...@apache.org>>
Sent: Friday, May 29, 2020 3:11 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

oh I'm not using the HistoryServer; I just wrote it ;)
Are these archives all in the same location? So we're roughly looking at 5 GB 
of archives then?

That could indeed "just" be a resource problem. The HistoryServer eagerly 
downloads all archives, and not on-demand.
The next step would be to move some of the archives into a separate HDFS 
directory and try again.

(Note that by configuring "historyserver.web.tmpdir" to some permanent 
directory subsequent (re)starts of the HistorySserver can re-use this 
directory; so you only have to download things once)

On 29/05/2020 00:43, Hailu, Andreas wrote:
May I also ask what version of flink-hadoop you're using and the number of jobs 
you're storing the history for? As of writing we have roughly 101,000 
application history files. I'm curious to know if we're encountering some kind 
of resource problem.

// ah

From: Hailu, Andreas [Engineering]
Sent: Thursday, May 28, 2020 12:18 PM
To: 'Chesnay Schepler' <mailto:ches...@apache.org>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Okay, I will look further to see if we're mistakenly using a version that's 
pre-2.6.0. However, I don't see flink-shaded-hadoop in my /lib directory for 
flink-1.9.1.

flink-dist_2.11-1.9.1.jar
flink-table-blink_2.11-1.9.1.jar
flink-table_2.11-1.9.1.jar
log4j-1.2.17.jar
slf4j-log4j12-1.7.15.jar

Are the files within /lib.

// ah

From: Chesnay Schepler mailto:ches...@apache.org>>
Sent: Thursday, May 28, 2020 11:00 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

Looks like it is indeed stuck on downloading the archive.

I searched a bit in the Hadoop JIRA and found several similar instances:
https://issues.apache.org/jira/browse/HDFS-6999<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D6999=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=wtWbBz9FrMlr29HibXGZvdcsFC1wqyVPulrYiTewpoQ=>
https://issues.apache.org/jira/browse/HDFS-7005<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7005=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=0KgRQHmW0Xj6NToNVzoi9iAGh1SIbfe8cnCqj1TXuW8=>
https://issues.apache.org/jira/browse/HDFS-7145<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7145=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=oy8z5gRd6dNDURDDH20f2yiplIuJ9qnYZeVpTIrHMwc=>

It is supposed to be fixed in 2.6.0 though :/

If hadoop is available from the HADOOP_CLASSPATH and flink-shaded-hadoop in 
/lib then you basically don't know what Hadoop version is actually being used,
which could lead to incompatibilities and dependency clashes.
If flink-shaded-hadoop 2.4/2.5 is on the classpath, maybe that is being used 
and runs into HDFS-7005.

On 28/05/2020 16:27, Hailu, Andreas wrote:
Just created a dump, here's what I see:

"Flink-HistoryServer-ArchiveFetcher-thread-1" #19 daemon prio=5 os_prio=0 
tid=0x7f93a5a2c000 nid=0x5692 runnable [0x7f934a0d3000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.

RE: History Server Not Showing Any Jobs - File Not Found?

2020-05-29 Thread Hailu, Andreas
Yes, these are all in the same directory, and we're at 67G right now. I'll try 
with incrementally smaller directories and let you know what I find.

// ah

From: Chesnay Schepler 
Sent: Friday, May 29, 2020 3:11 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

oh I'm not using the HistoryServer; I just wrote it ;)
Are these archives all in the same location? So we're roughly looking at 5 GB 
of archives then?

That could indeed "just" be a resource problem. The HistoryServer eagerly 
downloads all archives, and not on-demand.
The next step would be to move some of the archives into a separate HDFS 
directory and try again.

(Note that by configuring "historyserver.web.tmpdir" to some permanent 
directory subsequent (re)starts of the HistorySserver can re-use this 
directory; so you only have to download things once)

On 29/05/2020 00:43, Hailu, Andreas wrote:
May I also ask what version of flink-hadoop you're using and the number of jobs 
you're storing the history for? As of writing we have roughly 101,000 
application history files. I'm curious to know if we're encountering some kind 
of resource problem.

// ah

From: Hailu, Andreas [Engineering]
Sent: Thursday, May 28, 2020 12:18 PM
To: 'Chesnay Schepler' <mailto:ches...@apache.org>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Okay, I will look further to see if we're mistakenly using a version that's 
pre-2.6.0. However, I don't see flink-shaded-hadoop in my /lib directory for 
flink-1.9.1.

flink-dist_2.11-1.9.1.jar
flink-table-blink_2.11-1.9.1.jar
flink-table_2.11-1.9.1.jar
log4j-1.2.17.jar
slf4j-log4j12-1.7.15.jar

Are the files within /lib.

// ah

From: Chesnay Schepler mailto:ches...@apache.org>>
Sent: Thursday, May 28, 2020 11:00 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

Looks like it is indeed stuck on downloading the archive.

I searched a bit in the Hadoop JIRA and found several similar instances:
https://issues.apache.org/jira/browse/HDFS-6999<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D6999=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=wtWbBz9FrMlr29HibXGZvdcsFC1wqyVPulrYiTewpoQ=>
https://issues.apache.org/jira/browse/HDFS-7005<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7005=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=0KgRQHmW0Xj6NToNVzoi9iAGh1SIbfe8cnCqj1TXuW8=>
https://issues.apache.org/jira/browse/HDFS-7145<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7145=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=oy8z5gRd6dNDURDDH20f2yiplIuJ9qnYZeVpTIrHMwc=>

It is supposed to be fixed in 2.6.0 though :/

If hadoop is available from the HADOOP_CLASSPATH and flink-shaded-hadoop in 
/lib then you basically don't know what Hadoop version is actually being used,
which could lead to incompatibilities and dependency clashes.
If flink-shaded-hadoop 2.4/2.5 is on the classpath, maybe that is being used 
and runs into HDFS-7005.

On 28/05/2020 16:27, Hailu, Andreas wrote:
Just created a dump, here's what I see:

"Flink-HistoryServer-ArchiveFetcher-thread-1" #19 daemon prio=5 os_prio=0 
tid=0x7f93a5a2c000 nid=0x5692 runnable [0x7f934a0d3000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x0005df986960> (a sun.nio.ch.Util$2)
- locked <0x0005df986948> (a java.util.Collections$UnmodifiableSet)
- locked <0x0005df928390> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
  

RE: History Server Not Showing Any Jobs - File Not Found?

2020-05-28 Thread Hailu, Andreas
May I also ask what version of flink-hadoop you're using and the number of jobs 
you're storing the history for? As of writing we have roughly 101,000 
application history files. I'm curious to know if we're encountering some kind 
of resource problem.

// ah

From: Hailu, Andreas [Engineering]
Sent: Thursday, May 28, 2020 12:18 PM
To: 'Chesnay Schepler' ; user@flink.apache.org
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Okay, I will look further to see if we're mistakenly using a version that's 
pre-2.6.0. However, I don't see flink-shaded-hadoop in my /lib directory for 
flink-1.9.1.

flink-dist_2.11-1.9.1.jar
flink-table-blink_2.11-1.9.1.jar
flink-table_2.11-1.9.1.jar
log4j-1.2.17.jar
slf4j-log4j12-1.7.15.jar

Are the files within /lib.

// ah

From: Chesnay Schepler mailto:ches...@apache.org>>
Sent: Thursday, May 28, 2020 11:00 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

Looks like it is indeed stuck on downloading the archive.

I searched a bit in the Hadoop JIRA and found several similar instances:
https://issues.apache.org/jira/browse/HDFS-6999<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D6999=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=wtWbBz9FrMlr29HibXGZvdcsFC1wqyVPulrYiTewpoQ=>
https://issues.apache.org/jira/browse/HDFS-7005<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7005=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=0KgRQHmW0Xj6NToNVzoi9iAGh1SIbfe8cnCqj1TXuW8=>
https://issues.apache.org/jira/browse/HDFS-7145<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7145=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=oy8z5gRd6dNDURDDH20f2yiplIuJ9qnYZeVpTIrHMwc=>

It is supposed to be fixed in 2.6.0 though :/

If hadoop is available from the HADOOP_CLASSPATH and flink-shaded-hadoop in 
/lib then you basically don't know what Hadoop version is actually being used,
which could lead to incompatibilities and dependency clashes.
If flink-shaded-hadoop 2.4/2.5 is on the classpath, maybe that is being used 
and runs into HDFS-7005.

On 28/05/2020 16:27, Hailu, Andreas wrote:
Just created a dump, here's what I see:

"Flink-HistoryServer-ArchiveFetcher-thread-1" #19 daemon prio=5 os_prio=0 
tid=0x7f93a5a2c000 nid=0x5692 runnable [0x7f934a0d3000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x0005df986960> (a sun.nio.ch.Util$2)
- locked <0x0005df986948> (a java.util.Collections$UnmodifiableSet)
- locked <0x0005df928390> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:201)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
- locked <0x0005ceade5e0> (a 
org.apache.hadoop.hdfs.RemoteBlockReader2)
at 
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:781)
at 
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:837)
- eliminated <0x0005cead3688> (a 
org.apache.hadoop.hdfs.DFSInputStream)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:897)
- locked <0x0005cead3688> (a org.apache.hadoop.hdfs.DFSInputStream)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:9

RE: History Server Not Showing Any Jobs - File Not Found?

2020-05-28 Thread Hailu, Andreas
Okay, I will look further to see if we're mistakenly using a version that's 
pre-2.6.0. However, I don't see flink-shaded-hadoop in my /lib directory for 
flink-1.9.1.

flink-dist_2.11-1.9.1.jar
flink-table-blink_2.11-1.9.1.jar
flink-table_2.11-1.9.1.jar
log4j-1.2.17.jar
slf4j-log4j12-1.7.15.jar

Are the files within /lib.

// ah

From: Chesnay Schepler 
Sent: Thursday, May 28, 2020 11:00 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

Looks like it is indeed stuck on downloading the archive.

I searched a bit in the Hadoop JIRA and found several similar instances:
https://issues.apache.org/jira/browse/HDFS-6999<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D6999=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=wtWbBz9FrMlr29HibXGZvdcsFC1wqyVPulrYiTewpoQ=>
https://issues.apache.org/jira/browse/HDFS-7005<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7005=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=0KgRQHmW0Xj6NToNVzoi9iAGh1SIbfe8cnCqj1TXuW8=>
https://issues.apache.org/jira/browse/HDFS-7145<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D7145=DwMD-g=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=b1rFpuaq4HMshPx-d-0ZmaazccTuKjDKzJjF0WZSIso=oy8z5gRd6dNDURDDH20f2yiplIuJ9qnYZeVpTIrHMwc=>

It is supposed to be fixed in 2.6.0 though :/

If hadoop is available from the HADOOP_CLASSPATH and flink-shaded-hadoop in 
/lib then you basically don't know what Hadoop version is actually being used,
which could lead to incompatibilities and dependency clashes.
If flink-shaded-hadoop 2.4/2.5 is on the classpath, maybe that is being used 
and runs into HDFS-7005.

On 28/05/2020 16:27, Hailu, Andreas wrote:
Just created a dump, here's what I see:

"Flink-HistoryServer-ArchiveFetcher-thread-1" #19 daemon prio=5 os_prio=0 
tid=0x7f93a5a2c000 nid=0x5692 runnable [0x7f934a0d3000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x0005df986960> (a sun.nio.ch.Util$2)
- locked <0x0005df986948> (a java.util.Collections$UnmodifiableSet)
- locked <0x0005df928390> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:201)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
- locked <0x0005ceade5e0> (a 
org.apache.hadoop.hdfs.RemoteBlockReader2)
at 
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:781)
at 
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:837)
- eliminated <0x0005cead3688> (a 
org.apache.hadoop.hdfs.DFSInputStream)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:897)
- locked <0x0005cead3688> (a org.apache.hadoop.hdfs.DFSInputStream)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:945)
- locked <0x0005cead3688> (a org.apache.hadoop.hdfs.DFSInputStream)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.flink.util.IOUtils.copyBytes(IOUtils.java:69)
at org.apache.flink.util.IOUtils.copyBytes(IOUtils.java:91)
at 
org.apache.flink.runtime.history.FsJobArchivist.getArchivedJsons(FsJobArchivist.java:110)
at 
org.apache.flink.runtime.we

RE: History Server Not Showing Any Jobs - File Not Found?

2020-05-28 Thread Hailu, Andreas
Just created a dump, here's what I see:

"Flink-HistoryServer-ArchiveFetcher-thread-1" #19 daemon prio=5 os_prio=0 
tid=0x7f93a5a2c000 nid=0x5692 runnable [0x7f934a0d3000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x0005df986960> (a sun.nio.ch.Util$2)
- locked <0x0005df986948> (a java.util.Collections$UnmodifiableSet)
- locked <0x0005df928390> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:201)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:152)
- locked <0x0005ceade5e0> (a 
org.apache.hadoop.hdfs.RemoteBlockReader2)
at 
org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:781)
at 
org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:837)
- eliminated <0x0005cead3688> (a 
org.apache.hadoop.hdfs.DFSInputStream)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:897)
- locked <0x0005cead3688> (a org.apache.hadoop.hdfs.DFSInputStream)
   at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:945)
- locked <0x0005cead3688> (a org.apache.hadoop.hdfs.DFSInputStream)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.flink.util.IOUtils.copyBytes(IOUtils.java:69)
at org.apache.flink.util.IOUtils.copyBytes(IOUtils.java:91)
at 
org.apache.flink.runtime.history.FsJobArchivist.getArchivedJsons(FsJobArchivist.java:110)
at 
org.apache.flink.runtime.webmonitor.history.HistoryServerArchiveFetcher$JobArchiveFetcherTask.run(HistoryServerArchiveFetcher.java:169)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

What problems could the flink-shaded-hadoop jar being included introduce?

// ah

From: Chesnay Schepler 
Sent: Thursday, May 28, 2020 9:26 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

If it were a class-loading issue I would think that we'd see an exception of 
some kind. Maybe double-check that flink-shaded-hadoop is not in the lib 
directory. (usually I would ask for the full classpath that the HS is started 
with, but as it turns out this isn't getting logged :( (FLINK-18008))

The fact that overview.json and jobs/overview.json are missing indicates that 
something goes wrong directly on startup. What is supposed to happens is that 
the HS starts, fetches all currently available archives and then creates these 
files.
So it seems like the download gets stuck for some reason.

Can you use jstack to create a thread dump, and see what the 
Flink-HistoryServer-ArchiveFetcher is doing?

I will also file a JIRA for adding more logging statements, like when fetching 
starts/stops.

On 27/05/2020 20:57, Hailu, Andreas wrote:
Hi Chesney, apologies for not getting back to you sooner here. So I did what 
you suggested - I downloaded a few files from my jobmanager.archive.fs.dir HDFS 
directory to a locally availab

RE: History Server Not Showing Any Jobs - File Not Found?

2020-05-27 Thread Hailu, Andreas
Hi Chesney, apologies for not getting back to you sooner here. So I did what 
you suggested - I downloaded a few files from my jobmanager.archive.fs.dir HDFS 
directory to a locally available directory named 
/local/scratch/hailua_p2epdlsuat/historyserver/archived/. I then changed my 
historyserver.archive.fs.dir to 
file:///local/scratch/hailua_p2epdlsuat/historyserver/archived/ and that seemed 
to work. I'm able to see the history of the applications I downloaded. So this 
points to a problem with sourcing the history from HDFS.

Do you think this could be classpath related? This is what we use for our 
HADOOP_CLASSPATH var:
/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop/lib/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-hdfs/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-hdfs/lib/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-mapreduce/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-mapreduce/lib/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-yarn/*:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop-yarn/lib/*:/gns/software/ep/da/dataproc/dataproc-prod/lakeRmProxy.jar:/gns/software/infra/big-data/hadoop/hdp-2.6.5.0/hadoop/bin::/gns/mw/dbclient/postgres/jdbc/pg-jdbc-9.3.v01/postgresql-9.3-1100-jdbc4.jar

You can see we have references to Hadoop mapred/yarn/hdfs libs in there.

// ah

From: Chesnay Schepler 
Sent: Sunday, May 3, 2020 6:00 PM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

yes, exactly; I want to rule out that (somehow) HDFS is the problem.

I couldn't reproduce the issue locally myself so far.

On 01/05/2020 22:31, Hailu, Andreas wrote:
Hi Chesnay, yes - they were created using Flink 1.9.1 as we've only just 
started to archive them in the past couple weeks. Could you clarify on how you 
want to try local filesystem archives? As in changing jobmanager.archive.fs.dir 
and historyserver.web.tmpdir to the same local directory?

// ah

From: Chesnay Schepler <mailto:ches...@apache.org>
Sent: Wednesday, April 29, 2020 8:26 AM
To: Hailu, Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

hmm...let's see if I can reproduce the issue locally.

Are the archives from the same version the history server runs on? (Which I 
supposed would be 1.9.1?)

Just for the sake of narrowing things down, it would also be interesting to 
check if it works with the archives residing in the local filesystem.

On 27/04/2020 18:35, Hailu, Andreas wrote:
bash-4.1$ ls -l /local/scratch/flink_historyserver_tmpdir/
total 8
drwxrwxr-x 3 p2epdlsuat p2epdlsuat 4096 Apr 21 10:43 
flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9
drwxrwxr-x 3 p2epdlsuat p2epdlsuat 4096 Apr 21 10:22 
flink-web-history-95b3f928-c60f-4351-9926-766c6ad3ee76

There are just two directories in here. I don't see cache directories from my 
attempts today, which is interesting. Looking a little deeper into them:

bash-4.1$ ls -lr 
/local/scratch/flink_historyserver_tmpdir/flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9
total 1756
drwxrwxr-x 2 p2epdlsuat p2epdlsuat 1789952 Apr 21 10:44 jobs
bash-4.1$ ls -lr 
/local/scratch/flink_historyserver_tmpdir/flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9/jobs
total 0
-rw-rw-r-- 1 p2epdlsuat p2epdlsuat 0 Apr 21 10:43 overview.json

There are indeed archives already in HDFS - I've included some in my initial 
mail, but here they are again just for reference:
-bash-4.1$ hdfs dfs -ls /user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r-   3 delp datalake_admin_dev  50569 2020-03-21 23:17 
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r-   3 delp datalake_admin_dev  49578 2020-03-03 08:45 
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r-   3 delp datalake_admin_dev  50842 2020-03-24 15:19 
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...


// ah

From: Chesnay Schepler <mailto:ches...@apache.org>
Sent: Monday, April 27, 2020 10:28 AM
To: Hailu, Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

If historyserver.web.tmpdir is not set then java.io.tmpdir is used, so that 
should be fine.

What are the contents of /local/scratch/flink_historyserver_tmpdir?
I assume there are already archives in HDFS?

On 27/04/2020 16:02, Hailu, Andreas wrote:
My machine's /tmp directory is not large enough to support the archived files, 
so I changed my java.io.tmpdir to be in some other location which is 
significantly larger. I hadn't set anything for historyserver.web.tmpdir, so I 
suspect it was still po

RE: History Server Not Showing Any Jobs - File Not Found?

2020-05-01 Thread Hailu, Andreas
Hi Chesnay, yes - they were created using Flink 1.9.1 as we've only just 
started to archive them in the past couple weeks. Could you clarify on how you 
want to try local filesystem archives? As in changing jobmanager.archive.fs.dir 
and historyserver.web.tmpdir to the same local directory?

// ah

From: Chesnay Schepler 
Sent: Wednesday, April 29, 2020 8:26 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

hmm...let's see if I can reproduce the issue locally.

Are the archives from the same version the history server runs on? (Which I 
supposed would be 1.9.1?)

Just for the sake of narrowing things down, it would also be interesting to 
check if it works with the archives residing in the local filesystem.

On 27/04/2020 18:35, Hailu, Andreas wrote:
bash-4.1$ ls -l /local/scratch/flink_historyserver_tmpdir/
total 8
drwxrwxr-x 3 p2epdlsuat p2epdlsuat 4096 Apr 21 10:43 
flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9
drwxrwxr-x 3 p2epdlsuat p2epdlsuat 4096 Apr 21 10:22 
flink-web-history-95b3f928-c60f-4351-9926-766c6ad3ee76

There are just two directories in here. I don't see cache directories from my 
attempts today, which is interesting. Looking a little deeper into them:

bash-4.1$ ls -lr 
/local/scratch/flink_historyserver_tmpdir/flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9
total 1756
drwxrwxr-x 2 p2epdlsuat p2epdlsuat 1789952 Apr 21 10:44 jobs
bash-4.1$ ls -lr 
/local/scratch/flink_historyserver_tmpdir/flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9/jobs
total 0
-rw-rw-r-- 1 p2epdlsuat p2epdlsuat 0 Apr 21 10:43 overview.json

There are indeed archives already in HDFS - I've included some in my initial 
mail, but here they are again just for reference:
-bash-4.1$ hdfs dfs -ls /user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r-   3 delp datalake_admin_dev  50569 2020-03-21 23:17 
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r-   3 delp datalake_admin_dev  49578 2020-03-03 08:45 
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r-   3 delp datalake_admin_dev  50842 2020-03-24 15:19 
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...


// ah

From: Chesnay Schepler <mailto:ches...@apache.org>
Sent: Monday, April 27, 2020 10:28 AM
To: Hailu, Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

If historyserver.web.tmpdir is not set then java.io.tmpdir is used, so that 
should be fine.

What are the contents of /local/scratch/flink_historyserver_tmpdir?
I assume there are already archives in HDFS?

On 27/04/2020 16:02, Hailu, Andreas wrote:
My machine's /tmp directory is not large enough to support the archived files, 
so I changed my java.io.tmpdir to be in some other location which is 
significantly larger. I hadn't set anything for historyserver.web.tmpdir, so I 
suspect it was still pointing at /tmp. I just tried setting 
historyserver.web.tmpdir to the same location as my java.io.tmpdir location, 
but I'm afraid I'm still seeing the following issue:

2020-04-27 09:37:42,904 [nioEventLoopGroup-3-4] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/overview.json from classloader
2020-04-27 09:37:42,906 [nioEventLoopGroup-3-6] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/jobs/overview.json from classloader

flink-conf.yaml for reference:
jobmanager.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
historyserver.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
historyserver.web.tmpdir: /local/scratch/flink_historyserver_tmpdir/

Did you have anything else in mind when you said pointing somewhere funny?

// ah

From: Chesnay Schepler <mailto:ches...@apache.org>
Sent: Monday, April 27, 2020 5:56 AM
To: Hailu, Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?


overview.json is a generated file that is placed in the local directory 
controlled by historyserver.web.tmpdir.

Have you configured this option to point to some non-local filesystem? (Or if 
not, is the java.io.tmpdir property pointing somewhere funny?)
On 24/04/2020 18:24, Hailu, Andreas wrote:
I'm having a further look at the code in HistoryServerStaticFileServerHandler - 
is there an assumption about where overview.json is supposed to be located?

// ah

From: Hailu, Andreas [Engineering]
Sent: Wednesday, April 22, 2020 1:32 PM
To: 'Chesnay Schepler' <mailto:ches...@apache.org>; Hailu, 
Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: History Server Not Showing Any Jobs - Fil

RE: History Server Not Showing Any Jobs - File Not Found?

2020-04-27 Thread Hailu, Andreas
bash-4.1$ ls -l /local/scratch/flink_historyserver_tmpdir/
total 8
drwxrwxr-x 3 p2epdlsuat p2epdlsuat 4096 Apr 21 10:43 
flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9
drwxrwxr-x 3 p2epdlsuat p2epdlsuat 4096 Apr 21 10:22 
flink-web-history-95b3f928-c60f-4351-9926-766c6ad3ee76

There are just two directories in here. I don't see cache directories from my 
attempts today, which is interesting. Looking a little deeper into them:

bash-4.1$ ls -lr 
/local/scratch/flink_historyserver_tmpdir/flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9
total 1756
drwxrwxr-x 2 p2epdlsuat p2epdlsuat 1789952 Apr 21 10:44 jobs
bash-4.1$ ls -lr 
/local/scratch/flink_historyserver_tmpdir/flink-web-history-7fbb97cc-9f38-4844-9bcf-6272fe6828e9/jobs
total 0
-rw-rw-r-- 1 p2epdlsuat p2epdlsuat 0 Apr 21 10:43 overview.json

There are indeed archives already in HDFS - I've included some in my initial 
mail, but here they are again just for reference:
-bash-4.1$ hdfs dfs -ls /user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r-   3 delp datalake_admin_dev  50569 2020-03-21 23:17 
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r-   3 delp datalake_admin_dev  49578 2020-03-03 08:45 
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r-   3 delp datalake_admin_dev  50842 2020-03-24 15:19 
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...


// ah

From: Chesnay Schepler 
Sent: Monday, April 27, 2020 10:28 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

If historyserver.web.tmpdir is not set then java.io.tmpdir is used, so that 
should be fine.

What are the contents of /local/scratch/flink_historyserver_tmpdir?
I assume there are already archives in HDFS?

On 27/04/2020 16:02, Hailu, Andreas wrote:
My machine's /tmp directory is not large enough to support the archived files, 
so I changed my java.io.tmpdir to be in some other location which is 
significantly larger. I hadn't set anything for historyserver.web.tmpdir, so I 
suspect it was still pointing at /tmp. I just tried setting 
historyserver.web.tmpdir to the same location as my java.io.tmpdir location, 
but I'm afraid I'm still seeing the following issue:

2020-04-27 09:37:42,904 [nioEventLoopGroup-3-4] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/overview.json from classloader
2020-04-27 09:37:42,906 [nioEventLoopGroup-3-6] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/jobs/overview.json from classloader

flink-conf.yaml for reference:
jobmanager.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
historyserver.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
historyserver.web.tmpdir: /local/scratch/flink_historyserver_tmpdir/

Did you have anything else in mind when you said pointing somewhere funny?

// ah

From: Chesnay Schepler <mailto:ches...@apache.org>
Sent: Monday, April 27, 2020 5:56 AM
To: Hailu, Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?


overview.json is a generated file that is placed in the local directory 
controlled by historyserver.web.tmpdir.

Have you configured this option to point to some non-local filesystem? (Or if 
not, is the java.io.tmpdir property pointing somewhere funny?)
On 24/04/2020 18:24, Hailu, Andreas wrote:
I'm having a further look at the code in HistoryServerStaticFileServerHandler - 
is there an assumption about where overview.json is supposed to be located?

// ah

From: Hailu, Andreas [Engineering]
Sent: Wednesday, April 22, 2020 1:32 PM
To: 'Chesnay Schepler' <mailto:ches...@apache.org>; Hailu, 
Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Hi Chesnay, thanks for responding. We're using Flink 1.9.1. I enabled DEBUG 
level logging and this is something relevant I see:

2020-04-22 13:25:52,566 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
DFSInputStream - Connecting to datanode 10.79.252.101:1019
2020-04-22 13:25:52,567 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, 
remoteHostTrusted = false
2020-04-22 13:25:52,567 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
SaslDataTransferClient - SASL client skipping handshake in secured 
configuration with privileged port for addr = /10.79.252.101, datanodeId = 
DatanodeI
nfoWithStorage[10.79.252.101:1019,DS-7f4ec55d-7c5f-4a0e-b817-d9e635480b21,DISK]
2020-04-22 13:25:52,571 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
DFSInputStream - DFSInputStream has been closed already
2020-04-22 13:25:

RE: History Server Not Showing Any Jobs - File Not Found?

2020-04-27 Thread Hailu, Andreas
My machine's /tmp directory is not large enough to support the archived files, 
so I changed my java.io.tmpdir to be in some other location which is 
significantly larger. I hadn't set anything for historyserver.web.tmpdir, so I 
suspect it was still pointing at /tmp. I just tried setting 
historyserver.web.tmpdir to the same location as my java.io.tmpdir location, 
but I'm afraid I'm still seeing the following issue:

2020-04-27 09:37:42,904 [nioEventLoopGroup-3-4] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/overview.json from classloader
2020-04-27 09:37:42,906 [nioEventLoopGroup-3-6] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/jobs/overview.json from classloader

flink-conf.yaml for reference:
jobmanager.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
historyserver.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
historyserver.web.tmpdir: /local/scratch/flink_historyserver_tmpdir/

Did you have anything else in mind when you said pointing somewhere funny?

// ah

From: Chesnay Schepler 
Sent: Monday, April 27, 2020 5:56 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?


overview.json is a generated file that is placed in the local directory 
controlled by historyserver.web.tmpdir.

Have you configured this option to point to some non-local filesystem? (Or if 
not, is the java.io.tmpdir property pointing somewhere funny?)
On 24/04/2020 18:24, Hailu, Andreas wrote:
I'm having a further look at the code in HistoryServerStaticFileServerHandler - 
is there an assumption about where overview.json is supposed to be located?

// ah

From: Hailu, Andreas [Engineering]
Sent: Wednesday, April 22, 2020 1:32 PM
To: 'Chesnay Schepler' <mailto:ches...@apache.org>; Hailu, 
Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Hi Chesnay, thanks for responding. We're using Flink 1.9.1. I enabled DEBUG 
level logging and this is something relevant I see:

2020-04-22 13:25:52,566 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
DFSInputStream - Connecting to datanode 10.79.252.101:1019
2020-04-22 13:25:52,567 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, 
remoteHostTrusted = false
2020-04-22 13:25:52,567 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
SaslDataTransferClient - SASL client skipping handshake in secured 
configuration with privileged port for addr = /10.79.252.101, datanodeId = 
DatanodeI
nfoWithStorage[10.79.252.101:1019,DS-7f4ec55d-7c5f-4a0e-b817-d9e635480b21,DISK]
2020-04-22 13:25:52,571 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
DFSInputStream - DFSInputStream has been closed already
2020-04-22 13:25:52,573 [nioEventLoopGroup-3-6] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/jobs/overview.json from classloader
2020-04-22 13:25:52,576 [IPC Parameter Sending Thread #0] DEBUG 
Client$Connection$3 - IPC Client (1578587450) connection to 
d279536-002.dc.gs.com/10.59.61.87:8020 from d...@gs.com<mailto:d...@gs.com> 
sending #1391

Aside from that, it looks like a lot of logging around datanodes and block 
location metadata. Did I miss something in my classpath, perhaps? If so, do you 
have a suggestion on what I could try?

// ah

From: Chesnay Schepler mailto:ches...@apache.org>>
Sent: Wednesday, April 22, 2020 2:16 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

Which Flink version are you using?
Have you checked the history server logs after enabling debug logging?

On 21/04/2020 17:16, Hailu, Andreas [Engineering] wrote:
Hi,

I'm trying to set up the History Server, but none of my applications are 
showing up in the Web UI. Looking at the console, I see that all of the calls 
to /overview return the following 404 response: {"errors":["File not found."]}.

I've set up my configuration as follows:

JobManager Archive directory:
jobmanager.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
-bash-4.1$ hdfs dfs -ls /user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r-   3 delp datalake_admin_dev  50569 2020-03-21 23:17 
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r-   3 delp datalake_admin_dev  49578 2020-03-03 08:45 
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r-   3 delp datalake_admin_dev  50842 2020-03-24 15:19 
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...
...

History Server will fetch the archived jobs from the same location:
historyserver.archive.f

RE: History Server Not Showing Any Jobs - File Not Found?

2020-04-24 Thread Hailu, Andreas
I'm having a further look at the code in HistoryServerStaticFileServerHandler - 
is there an assumption about where overview.json is supposed to be located?

// ah

From: Hailu, Andreas [Engineering]
Sent: Wednesday, April 22, 2020 1:32 PM
To: 'Chesnay Schepler' ; Hailu, Andreas [Engineering] 
; user@flink.apache.org
Subject: RE: History Server Not Showing Any Jobs - File Not Found?

Hi Chesnay, thanks for responding. We're using Flink 1.9.1. I enabled DEBUG 
level logging and this is something relevant I see:

2020-04-22 13:25:52,566 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
DFSInputStream - Connecting to datanode 10.79.252.101:1019
2020-04-22 13:25:52,567 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, 
remoteHostTrusted = false
2020-04-22 13:25:52,567 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
SaslDataTransferClient - SASL client skipping handshake in secured 
configuration with privileged port for addr = /10.79.252.101, datanodeId = 
DatanodeI
nfoWithStorage[10.79.252.101:1019,DS-7f4ec55d-7c5f-4a0e-b817-d9e635480b21,DISK]
2020-04-22 13:25:52,571 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
DFSInputStream - DFSInputStream has been closed already
2020-04-22 13:25:52,573 [nioEventLoopGroup-3-6] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/jobs/overview.json from classloader
2020-04-22 13:25:52,576 [IPC Parameter Sending Thread #0] DEBUG 
Client$Connection$3 - IPC Client (1578587450) connection to 
d279536-002.dc.gs.com/10.59.61.87:8020 from d...@gs.com<mailto:d...@gs.com> 
sending #1391

Aside from that, it looks like a lot of logging around datanodes and block 
location metadata. Did I miss something in my classpath, perhaps? If so, do you 
have a suggestion on what I could try?

// ah

From: Chesnay Schepler mailto:ches...@apache.org>>
Sent: Wednesday, April 22, 2020 2:16 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

Which Flink version are you using?
Have you checked the history server logs after enabling debug logging?

On 21/04/2020 17:16, Hailu, Andreas [Engineering] wrote:
Hi,

I'm trying to set up the History Server, but none of my applications are 
showing up in the Web UI. Looking at the console, I see that all of the calls 
to /overview return the following 404 response: {"errors":["File not found."]}.

I've set up my configuration as follows:

JobManager Archive directory:
jobmanager.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
-bash-4.1$ hdfs dfs -ls /user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r-   3 delp datalake_admin_dev  50569 2020-03-21 23:17 
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r-   3 delp datalake_admin_dev  49578 2020-03-03 08:45 
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r-   3 delp datalake_admin_dev  50842 2020-03-24 15:19 
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...
...

History Server will fetch the archived jobs from the same location:
historyserver.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/

So I'm able to confirm that there are indeed archived applications that I 
should be able to view in the histserver. I'm not able to find out what file 
the overview service is looking for from the repository - any suggestions as to 
what I could look into next?

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>





Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


RE: History Server Not Showing Any Jobs - File Not Found?

2020-04-22 Thread Hailu, Andreas
Hi Chesnay, thanks for responding. We're using Flink 1.9.1. I enabled DEBUG 
level logging and this is something relevant I see:

2020-04-22 13:25:52,566 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
DFSInputStream - Connecting to datanode 10.79.252.101:1019
2020-04-22 13:25:52,567 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, 
remoteHostTrusted = false
2020-04-22 13:25:52,567 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
SaslDataTransferClient - SASL client skipping handshake in secured 
configuration with privileged port for addr = /10.79.252.101, datanodeId = 
DatanodeI
nfoWithStorage[10.79.252.101:1019,DS-7f4ec55d-7c5f-4a0e-b817-d9e635480b21,DISK]
2020-04-22 13:25:52,571 [Flink-HistoryServer-ArchiveFetcher-thread-1] DEBUG 
DFSInputStream - DFSInputStream has been closed already
2020-04-22 13:25:52,573 [nioEventLoopGroup-3-6] DEBUG 
HistoryServerStaticFileServerHandler - Unable to load requested file 
/jobs/overview.json from classloader
2020-04-22 13:25:52,576 [IPC Parameter Sending Thread #0] DEBUG 
Client$Connection$3 - IPC Client (1578587450) connection to 
d279536-002.dc.gs.com/10.59.61.87:8020 from d...@gs.com sending #1391

Aside from that, it looks like a lot of logging around datanodes and block 
location metadata. Did I miss something in my classpath, perhaps? If so, do you 
have a suggestion on what I could try?

// ah

From: Chesnay Schepler 
Sent: Wednesday, April 22, 2020 2:16 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: History Server Not Showing Any Jobs - File Not Found?

Which Flink version are you using?
Have you checked the history server logs after enabling debug logging?

On 21/04/2020 17:16, Hailu, Andreas [Engineering] wrote:
Hi,

I'm trying to set up the History Server, but none of my applications are 
showing up in the Web UI. Looking at the console, I see that all of the calls 
to /overview return the following 404 response: {"errors":["File not found."]}.

I've set up my configuration as follows:

JobManager Archive directory:
jobmanager.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
-bash-4.1$ hdfs dfs -ls /user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r-   3 delp datalake_admin_dev  50569 2020-03-21 23:17 
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r-   3 delp datalake_admin_dev  49578 2020-03-03 08:45 
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r-   3 delp datalake_admin_dev  50842 2020-03-24 15:19 
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...
...

History Server will fetch the archived jobs from the same location:
historyserver.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/

So I'm able to confirm that there are indeed archived applications that I 
should be able to view in the histserver. I'm not able to find out what file 
the overview service is looking for from the repository - any suggestions as to 
what I could look into next?

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>





Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


History Server Not Showing Any Jobs - File Not Found?

2020-04-21 Thread Hailu, Andreas [Engineering]
Hi,

I'm trying to set up the History Server, but none of my applications are 
showing up in the Web UI. Looking at the console, I see that all of the calls 
to /overview return the following 404 response: {"errors":["File not found."]}.

I've set up my configuration as follows:

JobManager Archive directory:
jobmanager.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/
-bash-4.1$ hdfs dfs -ls /user/p2epda/lake/delp_qa/flink_hs
Found 44282 items
-rw-r-   3 delp datalake_admin_dev  50569 2020-03-21 23:17 
/user/p2epda/lake/delp_qa/flink_hs/000144dba9dc0f235768a46b2f26e936
-rw-r-   3 delp datalake_admin_dev  49578 2020-03-03 08:45 
/user/p2epda/lake/delp_qa/flink_hs/000347625d8128ee3fd0b672018e38a5
-rw-r-   3 delp datalake_admin_dev  50842 2020-03-24 15:19 
/user/p2epda/lake/delp_qa/flink_hs/0004be6ce01ba9677d1eb619ad0fa757
...
...

History Server will fetch the archived jobs from the same location:
historyserver.archive.fs.dir: hdfs:///user/p2epda/lake/delp_qa/flink_hs/

So I'm able to confirm that there are indeed archived applications that I 
should be able to view in the histserver. I'm not able to find out what file 
the overview service is looking for from the repository - any suggestions as to 
what I could look into next?

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: Flink Conf "yarn.flink-dist-jar" Question

2020-04-15 Thread Hailu, Andreas [Engineering]
Okay, I’ll continue to watch the JIRAs. Thanks for the update, Till.

// ah

From: Till Rohrmann 
Sent: Wednesday, April 15, 2020 10:51 AM
To: Hailu, Andreas [Engineering] 
Cc: Yang Wang ; tison ; 
user@flink.apache.org
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Andreas,

it looks as if FLINK-13938 and FLINK-14964 won't make it into the 1.10.1 
release because the community is about to start the release process. Since 
FLINK-13938 is a new feature it will be shipped with a major release. There is 
still a bit of time until the 1.11 feature freeze and if Yang Wang has time to 
finish this PR, then we could ship it.

Cheers,
Till

On Wed, Apr 15, 2020 at 3:23 PM Hailu, Andreas [Engineering] 
mailto:andreas.ha...@gs.com>> wrote:
Yang, Tison,

Do we know when some solution for 13938 and 14964 will arrive? Do you think it 
will be in a 1.10.x version?

// ah

From: Hailu, Andreas [Engineering]
Sent: Friday, March 20, 2020 9:19 AM
To: 'Yang Wang' mailto:danrtsey...@gmail.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: Flink Conf "yarn.flink-dist-jar" Question

Hi Yang,

This is good to know. As a stopgap measure until a solution between 13938 and 
14964 arrives, we can automate the application staging directory cleanup from 
our client should the process fail. It’s not ideal, but will at least begin to 
manage our users’ quota. I’ll continue to watch the two tickets. Thank you.

// ah

From: Yang Wang mailto:danrtsey...@gmail.com>>
Sent: Monday, March 16, 2020 9:37 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Hailu,

Sorry for the late response. If the Flink cluster(e.g. Yarn application) is 
stopped directly
by `yarn application -kill`, then the staging directory will be left behind. 
Since the jobmanager
do not have any change to clean up the staging directly. Also it may happen 
when the
jobmanager crashed and reached the attempts limit of Yarn.

For FLINK-13938, yes, it is trying to use the Yarn public cache to accelerate 
the container
launch.


Best,
Yang

Hailu, Andreas mailto:andreas.ha...@gs.com>> 
于2020年3月10日周二 上午4:38写道:
Also may I ask what causes these application ID directories to be left behind? 
Is it a job failure, or can they persist even if the application succeeds? I’d 
like to know so that I can implement my own cleanup in the interim to prevent 
exceeding user disk space quotas.

// ah

From: Hailu, Andreas [Engineering]
Sent: Monday, March 9, 2020 1:20 PM
To: 'Yang Wang' mailto:danrtsey...@gmail.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: Flink Conf "yarn.flink-dist-jar" Question

Hi Yang,

Yes, a combination of these two would be very helpful for us. We have a single 
shaded binary which we use to run all of the jobs on our YARN cluster. If we 
could designate a single location in HDFS for that as well, we could also 
greatly benefit from FLINK-13938.

It sounds like a general public cache solution is what’s being called for?

// ah

From: Yang Wang mailto:danrtsey...@gmail.com>>
Sent: Sunday, March 8, 2020 10:52 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Hailu, tison,

I created a very similar ticket before to accelerate Flink submission on 
Yarn[1]. However,
we do not get a consensus in the PR. Maybe it's time to revive the discussion 
and try
to find a common solution for both the two tickets[1][2].


[1]. 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=njA3vGYTf0g7Zsog8AiwS4bbXxblOxepBEWUV9W3E0s=>
[2]. 
https://issues.apache.org/jira/browse/FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=9kT1RZkGwWh3MAbc_ZUrsEsmRRfw6VK4rlNIeNxs6GU=>


Best,
Yang

Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午11:21写道:
Hi Tison, thanks for the reply. I’ve replied to the ticket. I’ll be watching it 
as well.

// ah

From: tison mailto:wander4...@gmail.com>>
Sent: Friday, March 6, 2020 1:40 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: user

RE: Flink Conf "yarn.flink-dist-jar" Question

2020-04-15 Thread Hailu, Andreas [Engineering]
Yang, Tison,

Do we know when some solution for 13938 and 14964 will arrive? Do you think it 
will be in a 1.10.x version?

// ah

From: Hailu, Andreas [Engineering]
Sent: Friday, March 20, 2020 9:19 AM
To: 'Yang Wang' 
Cc: tison ; user@flink.apache.org
Subject: RE: Flink Conf "yarn.flink-dist-jar" Question

Hi Yang,

This is good to know. As a stopgap measure until a solution between 13938 and 
14964 arrives, we can automate the application staging directory cleanup from 
our client should the process fail. It’s not ideal, but will at least begin to 
manage our users’ quota. I’ll continue to watch the two tickets. Thank you.

// ah

From: Yang Wang mailto:danrtsey...@gmail.com>>
Sent: Monday, March 16, 2020 9:37 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Hailu,

Sorry for the late response. If the Flink cluster(e.g. Yarn application) is 
stopped directly
by `yarn application -kill`, then the staging directory will be left behind. 
Since the jobmanager
do not have any change to clean up the staging directly. Also it may happen 
when the
jobmanager crashed and reached the attempts limit of Yarn.

For FLINK-13938, yes, it is trying to use the Yarn public cache to accelerate 
the container
launch.


Best,
Yang

Hailu, Andreas mailto:andreas.ha...@gs.com>> 
于2020年3月10日周二 上午4:38写道:
Also may I ask what causes these application ID directories to be left behind? 
Is it a job failure, or can they persist even if the application succeeds? I’d 
like to know so that I can implement my own cleanup in the interim to prevent 
exceeding user disk space quotas.

// ah

From: Hailu, Andreas [Engineering]
Sent: Monday, March 9, 2020 1:20 PM
To: 'Yang Wang' mailto:danrtsey...@gmail.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: Flink Conf "yarn.flink-dist-jar" Question

Hi Yang,

Yes, a combination of these two would be very helpful for us. We have a single 
shaded binary which we use to run all of the jobs on our YARN cluster. If we 
could designate a single location in HDFS for that as well, we could also 
greatly benefit from FLINK-13938.

It sounds like a general public cache solution is what’s being called for?

// ah

From: Yang Wang mailto:danrtsey...@gmail.com>>
Sent: Sunday, March 8, 2020 10:52 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Hailu, tison,

I created a very similar ticket before to accelerate Flink submission on 
Yarn[1]. However,
we do not get a consensus in the PR. Maybe it's time to revive the discussion 
and try
to find a common solution for both the two tickets[1][2].


[1]. 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=njA3vGYTf0g7Zsog8AiwS4bbXxblOxepBEWUV9W3E0s=>
[2]. 
https://issues.apache.org/jira/browse/FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=9kT1RZkGwWh3MAbc_ZUrsEsmRRfw6VK4rlNIeNxs6GU=>


Best,
Yang

Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午11:21写道:
Hi Tison, thanks for the reply. I’ve replied to the ticket. I’ll be watching it 
as well.

// ah

From: tison mailto:wander4...@gmail.com>>
Sent: Friday, March 6, 2020 1:40 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

FLINK-13938 seems a bit different than your requirement. The one totally 
matches is 
FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=X1ZoN456fuc5mNxO6fBzDboEhrI0EHL873LzOd6tnN8=>.
 I'll appreciate it if you can share you opinion on the JIRA ticket.

Best,
tison.


tison mailto:wander4...@gmail.com>> 于2020年3月7日周六 上午2:35写道:
Yes your requirement is exactly taken into consideration by the community. We 
currently have an open JIRA ticket for the specific feature[1] and works for 
loosing the constraint of flink

RE: Flink Conf "yarn.flink-dist-jar" Question

2020-03-20 Thread Hailu, Andreas
Hi Yang,

This is good to know. As a stopgap measure until a solution between 13938 and 
14964 arrives, we can automate the application staging directory cleanup from 
our client should the process fail. It’s not ideal, but will at least begin to 
manage our users’ quota. I’ll continue to watch the two tickets. Thank you.

// ah

From: Yang Wang 
Sent: Monday, March 16, 2020 9:37 PM
To: Hailu, Andreas [Engineering] 
Cc: tison ; user@flink.apache.org
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Hailu,

Sorry for the late response. If the Flink cluster(e.g. Yarn application) is 
stopped directly
by `yarn application -kill`, then the staging directory will be left behind. 
Since the jobmanager
do not have any change to clean up the staging directly. Also it may happen 
when the
jobmanager crashed and reached the attempts limit of Yarn.

For FLINK-13938, yes, it is trying to use the Yarn public cache to accelerate 
the container
launch.


Best,
Yang

Hailu, Andreas mailto:andreas.ha...@gs.com>> 
于2020年3月10日周二 上午4:38写道:
Also may I ask what causes these application ID directories to be left behind? 
Is it a job failure, or can they persist even if the application succeeds? I’d 
like to know so that I can implement my own cleanup in the interim to prevent 
exceeding user disk space quotas.

// ah

From: Hailu, Andreas [Engineering]
Sent: Monday, March 9, 2020 1:20 PM
To: 'Yang Wang' mailto:danrtsey...@gmail.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: RE: Flink Conf "yarn.flink-dist-jar" Question

Hi Yang,

Yes, a combination of these two would be very helpful for us. We have a single 
shaded binary which we use to run all of the jobs on our YARN cluster. If we 
could designate a single location in HDFS for that as well, we could also 
greatly benefit from FLINK-13938.

It sounds like a general public cache solution is what’s being called for?

// ah

From: Yang Wang mailto:danrtsey...@gmail.com>>
Sent: Sunday, March 8, 2020 10:52 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Hailu, tison,

I created a very similar ticket before to accelerate Flink submission on 
Yarn[1]. However,
we do not get a consensus in the PR. Maybe it's time to revive the discussion 
and try
to find a common solution for both the two tickets[1][2].


[1]. 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=njA3vGYTf0g7Zsog8AiwS4bbXxblOxepBEWUV9W3E0s=>
[2]. 
https://issues.apache.org/jira/browse/FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=9kT1RZkGwWh3MAbc_ZUrsEsmRRfw6VK4rlNIeNxs6GU=>


Best,
Yang

Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午11:21写道:
Hi Tison, thanks for the reply. I’ve replied to the ticket. I’ll be watching it 
as well.

// ah

From: tison mailto:wander4...@gmail.com>>
Sent: Friday, March 6, 2020 1:40 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

FLINK-13938 seems a bit different than your requirement. The one totally 
matches is 
FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=X1ZoN456fuc5mNxO6fBzDboEhrI0EHL873LzOd6tnN8=>.
 I'll appreciate it if you can share you opinion on the JIRA ticket.

Best,
tison.


tison mailto:wander4...@gmail.com>> 于2020年3月7日周六 上午2:35写道:
Yes your requirement is exactly taken into consideration by the community. We 
currently have an open JIRA ticket for the specific feature[1] and works for 
loosing the constraint of flink-jar schema to support DFS location should 
happen.

Best,
tison.

[1] 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=ediMPoQtcPX7K-5fjXJxE2cPp5OySkzwXYfYj8mDWO0=>


Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午2:03写道:
Hi,

We noticed that

RE: Flink Conf "yarn.flink-dist-jar" Question

2020-03-09 Thread Hailu, Andreas
Also may I ask what causes these application ID directories to be left behind? 
Is it a job failure, or can they persist even if the application succeeds? I’d 
like to know so that I can implement my own cleanup in the interim to prevent 
exceeding user disk space quotas.

// ah

From: Hailu, Andreas [Engineering]
Sent: Monday, March 9, 2020 1:20 PM
To: 'Yang Wang' 
Cc: tison ; user@flink.apache.org
Subject: RE: Flink Conf "yarn.flink-dist-jar" Question

Hi Yang,

Yes, a combination of these two would be very helpful for us. We have a single 
shaded binary which we use to run all of the jobs on our YARN cluster. If we 
could designate a single location in HDFS for that as well, we could also 
greatly benefit from FLINK-13938.

It sounds like a general public cache solution is what’s being called for?

// ah

From: Yang Wang mailto:danrtsey...@gmail.com>>
Sent: Sunday, March 8, 2020 10:52 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: tison mailto:wander4...@gmail.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Hailu, tison,

I created a very similar ticket before to accelerate Flink submission on 
Yarn[1]. However,
we do not get a consensus in the PR. Maybe it's time to revive the discussion 
and try
to find a common solution for both the two tickets[1][2].


[1]. 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=njA3vGYTf0g7Zsog8AiwS4bbXxblOxepBEWUV9W3E0s=>
[2]. 
https://issues.apache.org/jira/browse/FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=9kT1RZkGwWh3MAbc_ZUrsEsmRRfw6VK4rlNIeNxs6GU=>


Best,
Yang

Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午11:21写道:
Hi Tison, thanks for the reply. I’ve replied to the ticket. I’ll be watching it 
as well.

// ah

From: tison mailto:wander4...@gmail.com>>
Sent: Friday, March 6, 2020 1:40 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

FLINK-13938 seems a bit different than your requirement. The one totally 
matches is 
FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=X1ZoN456fuc5mNxO6fBzDboEhrI0EHL873LzOd6tnN8=>.
 I'll appreciate it if you can share you opinion on the JIRA ticket.

Best,
tison.


tison mailto:wander4...@gmail.com>> 于2020年3月7日周六 上午2:35写道:
Yes your requirement is exactly taken into consideration by the community. We 
currently have an open JIRA ticket for the specific feature[1] and works for 
loosing the constraint of flink-jar schema to support DFS location should 
happen.

Best,
tison.

[1] 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=ediMPoQtcPX7K-5fjXJxE2cPp5OySkzwXYfYj8mDWO0=>


Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午2:03写道:
Hi,

We noticed that every time an application runs, it uploads the flink-dist 
artifact to the /user//.flink HDFS directory. This causes a user disk 
space quota issue as we submit thousands of apps to our cluster an hour. We had 
a similar problem with our Spark applications where it uploaded the Spark 
Assembly package for every app. Spark provides an argument to use a location in 
HDFS its for applications to leverage so they don’t need to upload them for 
every run, and that was our solution (see “spark.yarn.jar” configuration if 
interested.)

Looking at the Resource Orchestration Frameworks 
page<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23yarn-2Dflink-2Ddist-2Djar=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=3SPuvZu9nPph-qnE3TtbTngG-k3XDBLQGyk9I_tjNtI=>,
 I see there’s might be a similar concept through a “yarn.flink-dist-jar” 
configuration option. I wanted to place the flink-dist package we’re using in a 
location in HDFS and configure out jobs to point to it, e.g.

yarn.flink-dist-jar: h

RE: Flink Conf "yarn.flink-dist-jar" Question

2020-03-09 Thread Hailu, Andreas
Hi Yang,

Yes, a combination of these two would be very helpful for us. We have a single 
shaded binary which we use to run all of the jobs on our YARN cluster. If we 
could designate a single location in HDFS for that as well, we could also 
greatly benefit from FLINK-13938.

It sounds like a general public cache solution is what’s being called for?

// ah

From: Yang Wang 
Sent: Sunday, March 8, 2020 10:52 PM
To: Hailu, Andreas [Engineering] 
Cc: tison ; user@flink.apache.org
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

Hi Hailu, tison,

I created a very similar ticket before to accelerate Flink submission on 
Yarn[1]. However,
we do not get a consensus in the PR. Maybe it's time to revive the discussion 
and try
to find a common solution for both the two tickets[1][2].


[1]. 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=njA3vGYTf0g7Zsog8AiwS4bbXxblOxepBEWUV9W3E0s=>
[2]. 
https://issues.apache.org/jira/browse/FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=rlD0F8Cr4H0aPlN6O2_K13Q76RFOERSWuJANh4q6X_8=9kT1RZkGwWh3MAbc_ZUrsEsmRRfw6VK4rlNIeNxs6GU=>


Best,
Yang

Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午11:21写道:
Hi Tison, thanks for the reply. I’ve replied to the ticket. I’ll be watching it 
as well.

// ah

From: tison mailto:wander4...@gmail.com>>
Sent: Friday, March 6, 2020 1:40 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

FLINK-13938 seems a bit different than your requirement. The one totally 
matches is 
FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=X1ZoN456fuc5mNxO6fBzDboEhrI0EHL873LzOd6tnN8=>.
 I'll appreciate it if you can share you opinion on the JIRA ticket.

Best,
tison.


tison mailto:wander4...@gmail.com>> 于2020年3月7日周六 上午2:35写道:
Yes your requirement is exactly taken into consideration by the community. We 
currently have an open JIRA ticket for the specific feature[1] and works for 
loosing the constraint of flink-jar schema to support DFS location should 
happen.

Best,
tison.

[1] 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=ediMPoQtcPX7K-5fjXJxE2cPp5OySkzwXYfYj8mDWO0=>


Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午2:03写道:
Hi,

We noticed that every time an application runs, it uploads the flink-dist 
artifact to the /user//.flink HDFS directory. This causes a user disk 
space quota issue as we submit thousands of apps to our cluster an hour. We had 
a similar problem with our Spark applications where it uploaded the Spark 
Assembly package for every app. Spark provides an argument to use a location in 
HDFS its for applications to leverage so they don’t need to upload them for 
every run, and that was our solution (see “spark.yarn.jar” configuration if 
interested.)

Looking at the Resource Orchestration Frameworks 
page<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23yarn-2Dflink-2Ddist-2Djar=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=3SPuvZu9nPph-qnE3TtbTngG-k3XDBLQGyk9I_tjNtI=>,
 I see there’s might be a similar concept through a “yarn.flink-dist-jar” 
configuration option. I wanted to place the flink-dist package we’re using in a 
location in HDFS and configure out jobs to point to it, e.g.

yarn.flink-dist-jar: hdfs:user/delp/.flink/flink-dist_2.11-1.9.1.jar

Am I correct in that this is what I’m looking for? I gave this a try with some 
jobs today, and based on what I’m seeing in the launch_container.sh in our YARN 
application, it still looks like it’s being uploaded:

export 
_FLINK_JAR_PATH="hdfs://d279536/user/delp/.flink/application_1583031705852_117863/flink-dist_2.11-1.9.1.jar"

How can I confirm? Or is this perhaps not config I’m looking for?

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection l

RE: Flink Conf "yarn.flink-dist-jar" Question

2020-03-06 Thread Hailu, Andreas
Hi Tison, thanks for the reply. I’ve replied to the ticket. I’ll be watching it 
as well.

// ah

From: tison 
Sent: Friday, March 6, 2020 1:40 PM
To: Hailu, Andreas [Engineering] 
Cc: user@flink.apache.org
Subject: Re: Flink Conf "yarn.flink-dist-jar" Question

FLINK-13938 seems a bit different than your requirement. The one totally 
matches is 
FLINK-14964<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D14964=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=X1ZoN456fuc5mNxO6fBzDboEhrI0EHL873LzOd6tnN8=>.
 I'll appreciate it if you can share you opinion on the JIRA ticket.

Best,
tison.


tison mailto:wander4...@gmail.com>> 于2020年3月7日周六 上午2:35写道:
Yes your requirement is exactly taken into consideration by the community. We 
currently have an open JIRA ticket for the specific feature[1] and works for 
loosing the constraint of flink-jar schema to support DFS location should 
happen.

Best,
tison.

[1] 
https://issues.apache.org/jira/browse/FLINK-13938<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D13938=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=ediMPoQtcPX7K-5fjXJxE2cPp5OySkzwXYfYj8mDWO0=>


Hailu, Andreas mailto:andreas.ha...@gs.com>> 于2020年3月7日周六 
上午2:03写道:
Hi,

We noticed that every time an application runs, it uploads the flink-dist 
artifact to the /user//.flink HDFS directory. This causes a user disk 
space quota issue as we submit thousands of apps to our cluster an hour. We had 
a similar problem with our Spark applications where it uploaded the Spark 
Assembly package for every app. Spark provides an argument to use a location in 
HDFS its for applications to leverage so they don’t need to upload them for 
every run, and that was our solution (see “spark.yarn.jar” configuration if 
interested.)

Looking at the Resource Orchestration Frameworks 
page<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Dstable_ops_config.html-23yarn-2Dflink-2Ddist-2Djar=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=9sMjDI0I_9Yni5ZWqV8GScK_KBTaA65yK9kBG-LE5_4=3SPuvZu9nPph-qnE3TtbTngG-k3XDBLQGyk9I_tjNtI=>,
 I see there’s might be a similar concept through a “yarn.flink-dist-jar” 
configuration option. I wanted to place the flink-dist package we’re using in a 
location in HDFS and configure out jobs to point to it, e.g.

yarn.flink-dist-jar: hdfs:user/delp/.flink/flink-dist_2.11-1.9.1.jar

Am I correct in that this is what I’m looking for? I gave this a try with some 
jobs today, and based on what I’m seeing in the launch_container.sh in our YARN 
application, it still looks like it’s being uploaded:

export 
_FLINK_JAR_PATH="hdfs://d279536/user/delp/.flink/application_1583031705852_117863/flink-dist_2.11-1.9.1.jar"

How can I confirm? Or is this perhaps not config I’m looking for?

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


Flink Conf "yarn.flink-dist-jar" Question

2020-03-06 Thread Hailu, Andreas
Hi,

We noticed that every time an application runs, it uploads the flink-dist 
artifact to the /user//.flink HDFS directory. This causes a user disk 
space quota issue as we submit thousands of apps to our cluster an hour. We had 
a similar problem with our Spark applications where it uploaded the Spark 
Assembly package for every app. Spark provides an argument to use a location in 
HDFS its for applications to leverage so they don't need to upload them for 
every run, and that was our solution (see "spark.yarn.jar" configuration if 
interested.)

Looking at the Resource Orchestration Frameworks 
page,
 I see there's might be a similar concept through a "yarn.flink-dist-jar" 
configuration option. I wanted to place the flink-dist package we're using in a 
location in HDFS and configure out jobs to point to it, e.g.

yarn.flink-dist-jar: hdfs:user/delp/.flink/flink-dist_2.11-1.9.1.jar

Am I correct in that this is what I'm looking for? I gave this a try with some 
jobs today, and based on what I'm seeing in the launch_container.sh in our YARN 
application, it still looks like it's being uploaded:

export 
_FLINK_JAR_PATH="hdfs://d279536/user/delp/.flink/application_1583031705852_117863/flink-dist_2.11-1.9.1.jar"

How can I confirm? Or is this perhaps not config I'm looking for?

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: Table API: Joining on Tables of Complex Types

2020-02-14 Thread Hailu, Andreas
Hi Timo, Dawid,

This was very helpful - thanks! The Row type seems to only support getting 
fields by their index. Is there a way to get a field by its name like the Row 
class in Spark? Link: 
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html#getAs(java.lang.String)

Our use case is that we're developing a data-processing library for developers 
leveraging our system to refine existing datasets and produce new ones. The 
flow is as follows:

Our library reads Avro/Parquet GenericRecord data files from a source and turns 
it into a Table --> users write a series of operations on this Table to create 
a new resulting Table--> resulting Table is then transformed persisted back to 
the file system as Avro GenericRecords in Avro/Parquet file.

We can map the Row field names to their corresponding indexes by patching the 
AvroRowDeserializationSchema class, but it's the step where we handle expose 
the Table to our users and then try and persist which will end up in this 
metadata loss. We know what fields the Table must be composed of, but we just 
won't know which index they live in so Row#getField() isn't what quite what we 
need.

// ah

-Original Message-
From: Timo Walther 
Sent: Friday, January 17, 2020 11:29 AM
To: user@flink.apache.org
Subject: Re: Table API: Joining on Tables of Complex Types

Hi Andreas,

if dataset.getType() returns a RowTypeInfo you can ignore this log message. The 
type extractor runs before the ".returns()" but with this method you override 
the old type.

Regards,
Timo


On 15.01.20 15:27, Hailu, Andreas wrote:
> Dawid, this approach looks promising. I'm able to flatten out my Avro
> records into Rows and run simple queries atop of them. I've got a
> question - when I register my Rows as a table, I see the following log
> providing a warning:
>
> /2020-01-14 17:16:43,083 [main] INFO  TypeExtractor - class
> org.apache.flink.types.Row does not contain a getter for field fields/
>
> /2020-01-14 17:16:43,083 [main] INFO  TypeExtractor - class
> org.apache.flink.types.Row does not contain a setter for field fields/
>
> /2020-01-14 17:16:43,084 [main] INFO  TypeExtractor - Class class
> org.apache.flink.types.Row cannot be used as a POJO type because not
> all fields are valid POJO fields, and must be processed as GenericType.
> Please read the Flink documentation on "Data Types & Serialization"
> for details of the effect on performance./
>
> Will this be problematic even now that we've provided TypeInfos for
> the Rows? Performance is something that I'm concerned about as I've
> already introduced a new operation to transform our records to Rows.
>
> *// *ah**
>
> *From:* Hailu, Andreas [Engineering]
> *Sent:* Wednesday, January 8, 2020 12:08 PM
> *To:* 'Dawid Wysakowicz' <mailto:dwysakow...@apache.org>;
> mailto:user@flink.apache.org
> *Cc:* Richards, Adam S [Engineering] <mailto:adam.richa...@ny.email.gs.com>
> *Subject:* RE: Table API: Joining on Tables of Complex Types
>
> Very well - I'll give this a try. Thanks, Dawid.
>
> *// *ah**
>
> *From:* Dawid Wysakowicz  <mailto:dwysakow...@apache.org>>
> *Sent:* Wednesday, January 8, 2020 7:21 AM
> *To:* Hailu, Andreas [Engineering]  <mailto:andreas.ha...@ny.email.gs.com>>; mailto:user@flink.apache.org
> <mailto:user@flink.apache.org>
> *Cc:* Richards, Adam S [Engineering]  <mailto:adam.richa...@ny.email.gs.com>>
> *Subject:* Re: Table API: Joining on Tables of Complex Types
>
> Hi Andreas,
>
> Converting your GenericRecords to Rows would definitely be the safest
> option. You can check how its done in the
> org.apache.flink.formats.avro.AvroRowDeserializationSchema. You can
> reuse the logic from there to write something like:
>
>  DataSet dataset = ...
>
>  dataset.map( /* convert GenericRecord to Row
> */).returns(AvroSchemaConverter.convertToTypeInfo(avroSchemaString));
>
> Another thing you could try is to make sure that GenericRecord is seen
> as an avro type by fink (flink should understand that avro type is a
> complex type):
>
>  dataset.returns(new GenericRecordAvroTypeInfo(/*schema string*/)
>
> than the TableEnvironment should pick it up as a structured type and
> flatten it automatically when registering the Table. Bear in mind the
> returns method is part of SingleInputUdfOperator so you can apply it
> right after some transformation e.g. map/flatMap etc.
>
> Best,
>
> Dawid
>
> On 06/01/2020 18:03, Hailu, Andreas wrote:
>
> Hi David, thanks for getting back.
>
>  From what you've said, I think we'll need to convert our
> GenericRecord into structured types - do you have any references or
> examples I can have a look at? If not, perhaps you could just show
>   

RE: [ANNOUNCE] Apache Flink 1.10.0 released

2020-02-12 Thread Hailu, Andreas
Congrats all!

P.S. I noticed in the release notes that the bullet:

[FLINK-14516] The 
non-credit-based network flow control code was removed, along with the 
configuration option taskmanager.network.credit.model. Moving forward, Flink 
will always use credit-based flow control.

Mistakenly links to 
FLINK-13884 ☺


From: Yu Li 
Sent: Wednesday, February 12, 2020 8:31 AM
To: dev ; user ; 
annou...@apache.org
Subject: [ANNOUNCE] Apache Flink 1.10.0 released

The Apache Flink community is very happy to announce the release of Apache 
Flink 1.10.0, which is the latest major release.

Apache Flink® is an open-source stream processing framework for distributed, 
high-performing, always-available, and accurate data streaming applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the improvements for 
this new major release:
https://flink.apache.org/news/2020/02/11/release-1.10.0.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12345845

We would like to thank all contributors of the Apache Flink community who made 
this release possible!

Cheers,
Gary & Yu



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


1.9.2 Release Date?

2020-01-24 Thread Hailu, Andreas
Hi,

Do we have any thoughts on a release date for 1.9.2? I've been eyeing 
FLINK-13184 particularly to 
help alleviate stress on our RM + Name Node and reduce noise/delays due to 
sporadic Task Manager timeouts. We submit thousands of jobs per hour, so this 
looks like it could be a big help.

Best,
Andreas Hailu

The Goldman Sachs Group, Inc. All rights reserved.
See http://www.gs.com/disclaimer/global_email for important risk disclosures, 
conflicts of interest and other terms and conditions relating to this e-mail 
and your reliance on information contained in it.  This message may contain 
confidential or privileged information.  If you are not the intended recipient, 
please advise us immediately and delete this message.  See 
http://www.gs.com/disclaimer/email for further information on confidentiality 
and the risks of non-secure electronic communication.  If you cannot access 
these links, please notify us by reply message and we will send the contents to 
you.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: Table API: Joining on Tables of Complex Types

2020-01-15 Thread Hailu, Andreas
Dawid, this approach looks promising. I'm able to flatten out my Avro records 
into Rows and run simple queries atop of them. I've got a question - when I 
register my Rows as a table, I see the following log providing a warning:

2020-01-14 17:16:43,083 [main] INFO  TypeExtractor - class 
org.apache.flink.types.Row does not contain a getter for field fields
2020-01-14 17:16:43,083 [main] INFO  TypeExtractor - class 
org.apache.flink.types.Row does not contain a setter for field fields
2020-01-14 17:16:43,084 [main] INFO  TypeExtractor - Class class 
org.apache.flink.types.Row cannot be used as a POJO type because not all fields 
are valid POJO fields, and must be processed as GenericType. Please read the 
Flink documentation on "Data Types & Serialization" for details of the effect 
on performance.

Will this be problematic even now that we've provided TypeInfos for the Rows? 
Performance is something that I'm concerned about as I've already introduced a 
new operation to transform our records to Rows.

// ah

From: Hailu, Andreas [Engineering]
Sent: Wednesday, January 8, 2020 12:08 PM
To: 'Dawid Wysakowicz' ; user@flink.apache.org
Cc: Richards, Adam S [Engineering] 
Subject: RE: Table API: Joining on Tables of Complex Types

Very well - I'll give this a try. Thanks, Dawid.

// ah

From: Dawid Wysakowicz mailto:dwysakow...@apache.org>>
Sent: Wednesday, January 8, 2020 7:21 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Cc: Richards, Adam S [Engineering] 
mailto:adam.richa...@ny.email.gs.com>>
Subject: Re: Table API: Joining on Tables of Complex Types


Hi Andreas,

Converting your GenericRecords to Rows would definitely be the safest option. 
You can check how its done in the 
org.apache.flink.formats.avro.AvroRowDeserializationSchema. You can reuse the 
logic from there to write something like:

DataSet dataset = ...

dataset.map( /* convert GenericRecord to Row 
*/).returns(AvroSchemaConverter.convertToTypeInfo(avroSchemaString));

Another thing you could try is to make sure that GenericRecord is seen as an 
avro type by fink (flink should understand that avro type is a complex type):

dataset.returns(new GenericRecordAvroTypeInfo(/*schema string*/)

than the TableEnvironment should pick it up as a structured type and flatten it 
automatically when registering the Table. Bear in mind the returns method is 
part of SingleInputUdfOperator so you can apply it right after some 
transformation e.g. map/flatMap etc.

Best,

Dawid


On 06/01/2020 18:03, Hailu, Andreas wrote:
Hi David, thanks for getting back.

>From what you've said, I think we'll need to convert our GenericRecord into 
>structured types - do you have any references or examples I can have a look 
>at? If not, perhaps you could just show me a basic example of flattening a 
>complex object with accessors into a Table of structured types. Or by 
>structured types, did you mean Row?

// ah

From: Dawid Wysakowicz <mailto:dwysakow...@apache.org>
Sent: Monday, January 6, 2020 9:32 AM
To: Hailu, Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Cc: Richards, Adam S [Engineering] 
<mailto:adam.richa...@ny.email.gs.com>
Subject: Re: Table API: Joining on Tables of Complex Types


Hi Andreas,

First of all I would highly recommend converting a non-structured types to 
structured types as soon as possible as it opens more possibilities to optimize 
the plan.

Have you tried:

Table users = 
batchTableEnvironment.fromDataSet(usersDataset).select("getField(f0, userName) 
as userName", "f0")
Table other = 
batchTableEnvironment.fromDataSet(otherDataset).select("getField(f0, userName) 
as user", "f1")

Table result = other.join(users, "user = userName")

You could also check how the 
org.apache.flink.formats.avro.AvroRowDeserializationSchema class is implemented 
which internally converts an avro record to a structured Row.

Hope this helps.

Best,

Dawid
On 03/01/2020 23:16, Hailu, Andreas wrote:
Hi folks,

I'm trying to join two Tables which are composed of complex types, Avro's 
GenericRecord to be exact. I have to use a custom UDF to extract fields out of 
the record and I'm having some trouble on how to do joins on them as I need to 
call this UDF to read what I need. Example below:

batchTableEnvironment.registerFunction("getField", new GRFieldExtractor()); // 
GenericRecord field extractor
Table users = batchTableEnvironment.fromDataSet(usersDataset); // Converting 
from some pre-existing DataSet
Table otherDataset = batchTableEnvironment.fromDataSet(someOtherDataset);
Table userNames = t.select("getField(f0, userName)"); // This is how the UDF is 
used, as GenericRecord is a complex type requiring you to invoke a get() method 
on the f

RE: Table API: Joining on Tables of Complex Types

2020-01-08 Thread Hailu, Andreas
Very well - I'll give this a try. Thanks, Dawid.

// ah

From: Dawid Wysakowicz 
Sent: Wednesday, January 8, 2020 7:21 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Cc: Richards, Adam S [Engineering] 
Subject: Re: Table API: Joining on Tables of Complex Types


Hi Andreas,

Converting your GenericRecords to Rows would definitely be the safest option. 
You can check how its done in the 
org.apache.flink.formats.avro.AvroRowDeserializationSchema. You can reuse the 
logic from there to write something like:

DataSet dataset = ...

dataset.map( /* convert GenericRecord to Row 
*/).returns(AvroSchemaConverter.convertToTypeInfo(avroSchemaString));

Another thing you could try is to make sure that GenericRecord is seen as an 
avro type by fink (flink should understand that avro type is a complex type):

dataset.returns(new GenericRecordAvroTypeInfo(/*schema string*/)

than the TableEnvironment should pick it up as a structured type and flatten it 
automatically when registering the Table. Bear in mind the returns method is 
part of SingleInputUdfOperator so you can apply it right after some 
transformation e.g. map/flatMap etc.

Best,

Dawid


On 06/01/2020 18:03, Hailu, Andreas wrote:
Hi David, thanks for getting back.

>From what you've said, I think we'll need to convert our GenericRecord into 
>structured types - do you have any references or examples I can have a look 
>at? If not, perhaps you could just show me a basic example of flattening a 
>complex object with accessors into a Table of structured types. Or by 
>structured types, did you mean Row?

// ah

From: Dawid Wysakowicz <mailto:dwysakow...@apache.org>
Sent: Monday, January 6, 2020 9:32 AM
To: Hailu, Andreas [Engineering] 
<mailto:andreas.ha...@ny.email.gs.com>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Cc: Richards, Adam S [Engineering] 
<mailto:adam.richa...@ny.email.gs.com>
Subject: Re: Table API: Joining on Tables of Complex Types


Hi Andreas,

First of all I would highly recommend converting a non-structured types to 
structured types as soon as possible as it opens more possibilities to optimize 
the plan.

Have you tried:

Table users = 
batchTableEnvironment.fromDataSet(usersDataset).select("getField(f0, userName) 
as userName", "f0")
Table other = 
batchTableEnvironment.fromDataSet(otherDataset).select("getField(f0, userName) 
as user", "f1")

Table result = other.join(users, "user = userName")

You could also check how the 
org.apache.flink.formats.avro.AvroRowDeserializationSchema class is implemented 
which internally converts an avro record to a structured Row.

Hope this helps.

Best,

Dawid
On 03/01/2020 23:16, Hailu, Andreas wrote:
Hi folks,

I'm trying to join two Tables which are composed of complex types, Avro's 
GenericRecord to be exact. I have to use a custom UDF to extract fields out of 
the record and I'm having some trouble on how to do joins on them as I need to 
call this UDF to read what I need. Example below:

batchTableEnvironment.registerFunction("getField", new GRFieldExtractor()); // 
GenericRecord field extractor
Table users = batchTableEnvironment.fromDataSet(usersDataset); // Converting 
from some pre-existing DataSet
Table otherDataset = batchTableEnvironment.fromDataSet(someOtherDataset);
Table userNames = t.select("getField(f0, userName)"); // This is how the UDF is 
used, as GenericRecord is a complex type requiring you to invoke a get() method 
on the field you're interested in. Here we get a get on field 'userName'

I'd like to do something using the Table API similar to the query "SELECT * 
from otherDataset WHERE otherDataset.userName = users.userName". How is this 
done?

Best,
Andreas

The Goldman Sachs Group, Inc. All rights reserved.
See http://www.gs.com/disclaimer/global_email for important risk disclosures, 
conflicts of interest and other terms and conditions relating to this e-mail 
and your reliance on information contained in it.  This message may contain 
confidential or privileged information.  If you are not the intended recipient, 
please advise us immediately and delete this message.  See 
http://www.gs.com/disclaimer/email for further information on confidentiality 
and the risks of non-secure electronic communication.  If you cannot access 
these links, please notify us by reply message and we will send the contents to 
you.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>



Your Personal Data: We may collect and process informat

RE: Table API: Joining on Tables of Complex Types

2020-01-06 Thread Hailu, Andreas
Hi David, thanks for getting back.

>From what you've said, I think we'll need to convert our GenericRecord into 
>structured types - do you have any references or examples I can have a look 
>at? If not, perhaps you could just show me a basic example of flattening a 
>complex object with accessors into a Table of structured types. Or by 
>structured types, did you mean Row?

// ah

From: Dawid Wysakowicz 
Sent: Monday, January 6, 2020 9:32 AM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Cc: Richards, Adam S [Engineering] 
Subject: Re: Table API: Joining on Tables of Complex Types


Hi Andreas,

First of all I would highly recommend converting a non-structured types to 
structured types as soon as possible as it opens more possibilities to optimize 
the plan.

Have you tried:

Table users = 
batchTableEnvironment.fromDataSet(usersDataset).select("getField(f0, userName) 
as userName", "f0")
Table other = 
batchTableEnvironment.fromDataSet(otherDataset).select("getField(f0, userName) 
as user", "f1")

Table result = other.join(users, "user = userName")

You could also check how the 
org.apache.flink.formats.avro.AvroRowDeserializationSchema class is implemented 
which internally converts an avro record to a structured Row.

Hope this helps.

Best,

Dawid
On 03/01/2020 23:16, Hailu, Andreas wrote:
Hi folks,

I'm trying to join two Tables which are composed of complex types, Avro's 
GenericRecord to be exact. I have to use a custom UDF to extract fields out of 
the record and I'm having some trouble on how to do joins on them as I need to 
call this UDF to read what I need. Example below:

batchTableEnvironment.registerFunction("getField", new GRFieldExtractor()); // 
GenericRecord field extractor
Table users = batchTableEnvironment.fromDataSet(usersDataset); // Converting 
from some pre-existing DataSet
Table otherDataset = batchTableEnvironment.fromDataSet(someOtherDataset);
Table userNames = t.select("getField(f0, userName)"); // This is how the UDF is 
used, as GenericRecord is a complex type requiring you to invoke a get() method 
on the field you're interested in. Here we get a get on field 'userName'

I'd like to do something using the Table API similar to the query "SELECT * 
from otherDataset WHERE otherDataset.userName = users.userName". How is this 
done?

Best,
Andreas

The Goldman Sachs Group, Inc. All rights reserved.
See http://www.gs.com/disclaimer/global_email for important risk disclosures, 
conflicts of interest and other terms and conditions relating to this e-mail 
and your reliance on information contained in it.  This message may contain 
confidential or privileged information.  If you are not the intended recipient, 
please advise us immediately and delete this message.  See 
http://www.gs.com/disclaimer/email for further information on confidentiality 
and the risks of non-secure electronic communication.  If you cannot access 
these links, please notify us by reply message and we will send the contents to 
you.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


Table API: Joining on Tables of Complex Types

2020-01-03 Thread Hailu, Andreas
Hi folks,

I'm trying to join two Tables which are composed of complex types, Avro's 
GenericRecord to be exact. I have to use a custom UDF to extract fields out of 
the record and I'm having some trouble on how to do joins on them as I need to 
call this UDF to read what I need. Example below:

batchTableEnvironment.registerFunction("getField", new GRFieldExtractor()); // 
GenericRecord field extractor
Table users = batchTableEnvironment.fromDataSet(usersDataset); // Converting 
from some pre-existing DataSet
Table otherDataset = batchTableEnvironment.fromDataSet(someOtherDataset);
Table userNames = t.select("getField(f0, userName)"); // This is how the UDF is 
used, as GenericRecord is a complex type requiring you to invoke a get() method 
on the field you're interested in. Here we get a get on field 'userName'

I'd like to do something using the Table API similar to the query "SELECT * 
from otherDataset WHERE otherDataset.userName = users.userName". How is this 
done?

Best,
Andreas

The Goldman Sachs Group, Inc. All rights reserved.
See http://www.gs.com/disclaimer/global_email for important risk disclosures, 
conflicts of interest and other terms and conditions relating to this e-mail 
and your reliance on information contained in it.  This message may contain 
confidential or privileged information.  If you are not the intended recipient, 
please advise us immediately and delete this message.  See 
http://www.gs.com/disclaimer/email for further information on confidentiality 
and the risks of non-secure electronic communication.  If you cannot access 
these links, please notify us by reply message and we will send the contents to 
you.




Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

2019-11-22 Thread Hailu, Andreas
Zhijiang, Piotr, we made this change and it solved our mmap usage problem, so 
we can move forward in our testing. Thanks.

I’m curious – if I’m understanding this change in 1.9 correctly, blocking 
result partitions were being written to mmap which in turn resulted in 
exhausting container memory? This is why we were seeing failures in our 
pipelines which had operators which fed into a CoGroup?

// ah

From: Zhijiang 
Sent: Thursday, November 21, 2019 9:48 PM
To: Hailu, Andreas [Engineering] ; Piotr 
Nowojski 
Cc: user@flink.apache.org
Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

The hint of mmap usage below is really helpful to locate this problem. I forgot 
this biggest change for batch job in release-1.9.
The blocking type option can be set to `file` as Piotr suggested to behave 
similar as before. I think it can solve your problem.

--
From:Hailu, Andreas mailto:andreas.ha...@gs.com>>
Send Time:2019 Nov. 21 (Thu.) 23:37
To:Piotr Nowojski mailto:pi...@ververica.com>>
Cc:Zhijiang mailto:wangzhijiang...@aliyun.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
Subject:RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Thanks, Piotr. We’ll rerun our apps today with this and get back to you.

// ah

From: Piotr Nowojski mailto:pi...@data-artisans.com>> 
On Behalf Of Piotr Nowojski
Sent: Thursday, November 21, 2019 10:14 AM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>
Cc: Zhijiang mailto:wangzhijiang...@aliyun.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi,

I would suspect this:
https://issues.apache.org/jira/browse/FLINK-12070<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12070=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=pNSp_BPfgPubdHs-ZksQEfyjw6CMnhZ9_Jyb9_iD0VE=KWvzvfmJkcvcfiK-HlfNJeslBFOgnJtoHsZfMNtLoSo=>
To be the source of the problems.

There seems to be a hidden configuration option that avoids using memory mapped 
files:

taskmanager.network.bounded-blocking-subpartition-type: file

Could you test if helps?

Piotrek

On 21 Nov 2019, at 15:22, Hailu, Andreas 
mailto:andreas.ha...@gs.com>> wrote:

Hi Zhijiang,

I looked into the container logs for the failure, and didn’t see any specific 
OutOfMemory errors before it was killed. I ran the application using the same 
config this morning on 1.6.4, and it went through successfully. I took a 
snapshot of the memory usage from the dashboard and can send it to you if you 
like for reference.

What stands out to me as suspicious is that on 1.9.1, the application is using 
nearly 6GB of Mapped memory before it dies, while 1.6.4 uses 0 throughout its 
runtime and succeeds. The JVM heap memory itself never exceeds its capacity, 
peaking at 6.65GB, so it sounds like the problem lies somewhere in the changes 
around mapped memory.

// ah

From: Zhijiang mailto:wangzhijiang...@aliyun.com>>
Sent: Wednesday, November 20, 2019 11:32 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi Andreas,

You are running a batch job, so there should be no native memory used by rocked 
state backend. Then I guess it is either heap memory or direct memory over 
used. The heap managed memory is mainly used by batch operators and direct 
memory is used by network shuffle. Can you further check whether there are any 
logs to indicate HeapOutOfMemory or DirectOutOfMemory before killed? If the 
used memory exceeds the JVM configuration, it should throw that error. Then we 
can further narrow down the scope. I can not remember the changes of memory 
issues for managed memory or network stack, especially it really spans several 
releases.

Best,
Zhijiang

--
From:Hailu, Andreas mailto:andreas.ha...@gs.com>>
Send Time:2019 Nov. 21 (Thu.) 01:03
To:user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
Subject:RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Going through the release notes today - we tried fiddling with the 
taskmanager.memory.fraction option, going as low as 0.1 with unfortunately no 
success. It still leads to the container running beyond physical memory limits.

// ah

From: Hailu, Andreas [Engineering]
Sent: Tuesday, November 19, 2019 6:01 PM
To: 'user@flink.apache.org<mailto:user@flink.apache.org>' 
mailto:user@flink.apache.org>>
Subject: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi,

We’re in the mi

RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

2019-11-21 Thread Hailu, Andreas
Thanks, Piotr. We’ll rerun our apps today with this and get back to you.

// ah

From: Piotr Nowojski  On Behalf Of Piotr Nowojski
Sent: Thursday, November 21, 2019 10:14 AM
To: Hailu, Andreas [Engineering] 
Cc: Zhijiang ; user@flink.apache.org
Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi,

I would suspect this:
https://issues.apache.org/jira/browse/FLINK-12070<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12070=DwMFaQ=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=pNSp_BPfgPubdHs-ZksQEfyjw6CMnhZ9_Jyb9_iD0VE=KWvzvfmJkcvcfiK-HlfNJeslBFOgnJtoHsZfMNtLoSo=>
To be the source of the problems.

There seems to be a hidden configuration option that avoids using memory mapped 
files:

taskmanager.network.bounded-blocking-subpartition-type: file

Could you test if helps?

Piotrek


On 21 Nov 2019, at 15:22, Hailu, Andreas 
mailto:andreas.ha...@gs.com>> wrote:

Hi Zhijiang,

I looked into the container logs for the failure, and didn’t see any specific 
OutOfMemory errors before it was killed. I ran the application using the same 
config this morning on 1.6.4, and it went through successfully. I took a 
snapshot of the memory usage from the dashboard and can send it to you if you 
like for reference.

What stands out to me as suspicious is that on 1.9.1, the application is using 
nearly 6GB of Mapped memory before it dies, while 1.6.4 uses 0 throughout its 
runtime and succeeds. The JVM heap memory itself never exceeds its capacity, 
peaking at 6.65GB, so it sounds like the problem lies somewhere in the changes 
around mapped memory.

// ah

From: Zhijiang mailto:wangzhijiang...@aliyun.com>>
Sent: Wednesday, November 20, 2019 11:32 PM
To: Hailu, Andreas [Engineering] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi Andreas,

You are running a batch job, so there should be no native memory used by rocked 
state backend. Then I guess it is either heap memory or direct memory over 
used. The heap managed memory is mainly used by batch operators and direct 
memory is used by network shuffle. Can you further check whether there are any 
logs to indicate HeapOutOfMemory or DirectOutOfMemory before killed? If the 
used memory exceeds the JVM configuration, it should throw that error. Then we 
can further narrow down the scope. I can not remember the changes of memory 
issues for managed memory or network stack, especially it really spans several 
releases.

Best,
Zhijiang

--
From:Hailu, Andreas mailto:andreas.ha...@gs.com>>
Send Time:2019 Nov. 21 (Thu.) 01:03
To:user@flink.apache.org<mailto:user@flink.apache.org> 
mailto:user@flink.apache.org>>
Subject:RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Going through the release notes today - we tried fiddling with the 
taskmanager.memory.fraction option, going as low as 0.1 with unfortunately no 
success. It still leads to the container running beyond physical memory limits.

// ah

From: Hailu, Andreas [Engineering]
Sent: Tuesday, November 19, 2019 6:01 PM
To: 'user@flink.apache.org<mailto:user@flink.apache.org>' 
mailto:user@flink.apache.org>>
Subject: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi,

We’re in the middle of testing the upgrade of our data processing flows from 
Flink 1.6.4 to 1.9.1. We’re seeing that flows which were running just fine on 
1.6.4 now fail on 1.9.1 with the same application resources and input data 
size. It seems that there have been some changes around how the data is sorted 
prior to being fed to the CoGroup operator - this is the error that we 
encounter:

Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
... 15 more
Caused by: java.lang.Exception: The data preparation for task 'CoGroup (Dataset 
| Merge | NONE)' , caused an error: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: Lost connection to 
task manager 
'd73996-213.dc.gs.com/10.47.226.218:46003<http://d73996-213.dc.gs.com/10.47.226.218:46003>'.
 This indicates that the remote task manager was lost.
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
... 1 more
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated 

RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

2019-11-21 Thread Hailu, Andreas
Hi Zhijiang,

I looked into the container logs for the failure, and didn’t see any specific 
OutOfMemory errors before it was killed. I ran the application using the same 
config this morning on 1.6.4, and it went through successfully. I took a 
snapshot of the memory usage from the dashboard and can send it to you if you 
like for reference.

What stands out to me as suspicious is that on 1.9.1, the application is using 
nearly 6GB of Mapped memory before it dies, while 1.6.4 uses 0 throughout its 
runtime and succeeds. The JVM heap memory itself never exceeds its capacity, 
peaking at 6.65GB, so it sounds like the problem lies somewhere in the changes 
around mapped memory.

// ah

From: Zhijiang 
Sent: Wednesday, November 20, 2019 11:32 PM
To: Hailu, Andreas [Engineering] ; 
user@flink.apache.org
Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi Andreas,

You are running a batch job, so there should be no native memory used by rocked 
state backend. Then I guess it is either heap memory or direct memory over 
used. The heap managed memory is mainly used by batch operators and direct 
memory is used by network shuffle. Can you further check whether there are any 
logs to indicate HeapOutOfMemory or DirectOutOfMemory before killed? If the 
used memory exceeds the JVM configuration, it should throw that error. Then we 
can further narrow down the scope. I can not remember the changes of memory 
issues for managed memory or network stack, especially it really spans several 
releases.

Best,
Zhijiang

--
From:Hailu, Andreas mailto:andreas.ha...@gs.com>>
Send Time:2019 Nov. 21 (Thu.) 01:03
To:user@flink.apache.org mailto:user@flink.apache.org>>
Subject:RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Going through the release notes today - we tried fiddling with the 
taskmanager.memory.fraction option, going as low as 0.1 with unfortunately no 
success. It still leads to the container running beyond physical memory limits.

// ah

From: Hailu, Andreas [Engineering]
Sent: Tuesday, November 19, 2019 6:01 PM
To: 'user@flink.apache.org' 
mailto:user@flink.apache.org>>
Subject: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi,

We’re in the middle of testing the upgrade of our data processing flows from 
Flink 1.6.4 to 1.9.1. We’re seeing that flows which were running just fine on 
1.6.4 now fail on 1.9.1 with the same application resources and input data 
size. It seems that there have been some changes around how the data is sorted 
prior to being fed to the CoGroup operator - this is the error that we 
encounter:

Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
... 15 more
Caused by: java.lang.Exception: The data preparation for task 'CoGroup (Dataset 
| Merge | NONE)' , caused an error: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: Lost connection to 
task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This indicates that 
the remote task manager was lost.
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
... 1 more
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: Lost connection to 
task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This indicates that 
the remote task manager was lost.
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
at 
org.apache.flink.runtime.operators.CoGroupDriver.prepare(CoGroupDriver.java:102)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:474)

I drilled further down into the YARN app logs, and I found that the container 
was running out of physical memory:

2019-11-19 12:49:23,068 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e42_1574076744505_9444_01_04 because: Container 
[pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical memory 
used; 13.9 GB of 25.2 GB virtual memory used. Killing container.

This is what leads my suspicions as this resourcing configuration worked just 
fine on 1.6.4

I’m working on getting heap dumps of these applications to try and get a better 
understanding of what’s causing the blowup 

RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

2019-11-20 Thread Hailu, Andreas
Going through the release notes today - we tried fiddling with the 
taskmanager.memory.fraction option, going as low as 0.1 with unfortunately no 
success. It still leads to the container running beyond physical memory limits.

// ah

From: Hailu, Andreas [Engineering]
Sent: Tuesday, November 19, 2019 6:01 PM
To: 'user@flink.apache.org' 
Subject: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Hi,

We're in the middle of testing the upgrade of our data processing flows from 
Flink 1.6.4 to 1.9.1. We're seeing that flows which were running just fine on 
1.6.4 now fail on 1.9.1 with the same application resources and input data 
size. It seems that there have been some changes around how the data is sorted 
prior to being fed to the CoGroup operator - this is the error that we 
encounter:

Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
... 15 more
Caused by: java.lang.Exception: The data preparation for task 'CoGroup (Dataset 
| Merge | NONE)' , caused an error: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: Lost connection to 
task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This indicates that 
the remote task manager was lost.
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
... 1 more
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: Lost connection to 
task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This indicates that 
the remote task manager was lost.
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
at 
org.apache.flink.runtime.operators.CoGroupDriver.prepare(CoGroupDriver.java:102)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:474)

I drilled further down into the YARN app logs, and I found that the container 
was running out of physical memory:

2019-11-19 12:49:23,068 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e42_1574076744505_9444_01_04 because: Container 
[pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical memory 
used; 13.9 GB of 25.2 GB virtual memory used. Killing container.

This is what leads my suspicions as this resourcing configuration worked just 
fine on 1.6.4

I'm working on getting heap dumps of these applications to try and get a better 
understanding of what's causing the blowup in physical memory required myself, 
but it would be helpful if anyone knew what relevant changes have been made 
between these versions or where else I could look? There are some features in 
1.9 that we'd like to use in our flows so getting this sorted out, no pun 
intended, is inhibiting us from doing so.

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>


CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

2019-11-19 Thread Hailu, Andreas
Hi,

We're in the middle of testing the upgrade of our data processing flows from 
Flink 1.6.4 to 1.9.1. We're seeing that flows which were running just fine on 
1.6.4 now fail on 1.9.1 with the same application resources and input data 
size. It seems that there have been some changes around how the data is sorted 
prior to being fed to the CoGroup operator - this is the error that we 
encounter:

Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
... 15 more
Caused by: java.lang.Exception: The data preparation for task 'CoGroup (Dataset 
| Merge | NONE)' , caused an error: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: Lost connection to 
task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This indicates that 
the remote task manager was lost.
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
... 1 more
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 
'SortMerger Reading Thread' terminated due to an exception: Lost connection to 
task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This indicates that 
the remote task manager was lost.
at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
at 
org.apache.flink.runtime.operators.CoGroupDriver.prepare(CoGroupDriver.java:102)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:474)

I drilled further down into the YARN app logs, and I found that the container 
was running out of physical memory:

2019-11-19 12:49:23,068 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e42_1574076744505_9444_01_04 because: Container 
[pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical memory 
used; 13.9 GB of 25.2 GB virtual memory used. Killing container.

This is what leads my suspicions as this resourcing configuration worked just 
fine on 1.6.4

I'm working on getting heap dumps of these applications to try and get a better 
understanding of what's causing the blowup in physical memory required myself, 
but it would be helpful if anyone knew what relevant changes have been made 
between these versions or where else I could look? There are some features in 
1.9 that we'd like to use in our flows so getting this sorted out, no pun 
intended, is inhibiting us from doing so.

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices


RE: Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

2019-07-04 Thread Hailu, Andreas
Very well - thank you both.

// ah

From: Haibo Sun 
Sent: Wednesday, July 3, 2019 9:37 PM
To: Hailu, Andreas [Tech] 
Cc: Yitzchak Lieberman ; user@flink.apache.org
Subject: Re:RE: Re:Re: File Naming Pattern from HadoopOutputFormat

Hi, Andreas

I'm glad you have had a solution. If you're interested in option 2 I talked 
about, you can follow up on the progress of the issue 
(https://issues.apache.org/jira/browse/FLINK-12573<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12573=DwQGbg=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=PjxKTfWKbjbcxTz3PnGvrR-7MEYlxraUaI1tTTQFqEw=pYKp2r7d5fGK3otU5dfUAmTaZf2cjeuVUMCmjupz8Ik=>)
 that Yitzchak said by watching it.

Best,
Haibo

At 2019-07-03 21:11:44, "Hailu, Andreas" 
mailto:andreas.ha...@gs.com>> wrote:

Hi Haibo, Yitzchak, thanks for getting back to me.

The pattern I chose to use which worked was to extend the HadoopOutputFormat 
class, override the open() method, and modify the "mapreduce.output.basename" 
configuration property to match my desired file naming structure.

// ah

From: Haibo Sun mailto:sunhaib...@163.com>>
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman 
mailto:yitzch...@sentinelone.com>>
Cc: Hailu, Andreas [Tech] 
mailto:andreas.ha...@ny.email.gs.com>>; 
user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat


Hi, Andreas

You are right. To meet this requirement, Flink should need to expose a 
interface to allow customizing the filename.

Best,
Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" 
mailto:yitzch...@sentinelone.com>> wrote:
regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId() defined 
the directory for the files in case of partitioning the data, for example:
/day=20190101/part-1-1
there is an open issue for that: 
https://issues.apache.org/jira/browse/FLINK-12573<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12573=DwMGbg=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8=zGL975-dtGwduiOtS--MtzcwNbM6ti3ziA85_ki-Ql8=>

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun 
mailto:sunhaib...@163.com>> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the  
getDirectoryFileName() method to customize a file name, as shown below.
The javadoc of AvroOutputFormat: 
https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.8_api_java_org_apache_flink_formats_avro_AvroOutputFormat.html=DwMGbg=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8=tgLhwjHX0wHWgUMSSmNEOUmpatTF5N7JYfUEtYzQyf4=>


  public static class CustomAvroOutputFormat extends AvroOutputFormat {

  public CustomAvroOutputFormat(Path filePath, 
Class type) {

   super(filePath, type);

  }



  public CustomAvroOutputFormat(Class type) {

   super(type);

  }



  @Override

  public void open(int taskNumber, int numTasks) 
throws IOException {

   
this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);

   super.open(taskNumber, 
numTasks);

  }



  @Override

  protected String getDirectoryFileName(int 
taskNumber) {

   // returns a custom filename

   return null;

  }

  }

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, 
StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a 
class that implements the BucketAssigner interface and return a custom file 
name in the getBucketId() method (the value returned by getBucketId() will be 
treated as the file name).

ParquetStreamingFileSinkITCase:  
https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dformats_flink-2Dparquet_s

RE: Re:Re: File Naming Pattern from HadoopOutputFormat

2019-07-03 Thread Hailu, Andreas
Hi Haibo, Yitzchak, thanks for getting back to me.

The pattern I chose to use which worked was to extend the HadoopOutputFormat 
class, override the open() method, and modify the "mapreduce.output.basename" 
configuration property to match my desired file naming structure.

// ah

From: Haibo Sun 
Sent: Tuesday, July 2, 2019 5:57 AM
To: Yitzchak Lieberman 
Cc: Hailu, Andreas [Tech] ; user@flink.apache.org
Subject: Re:Re: File Naming Pattern from HadoopOutputFormat


Hi, Andreas

You are right. To meet this requirement, Flink should need to expose a 
interface to allow customizing the filename.

Best,
Haibo

At 2019-07-02 16:33:44, "Yitzchak Lieberman" 
mailto:yitzch...@sentinelone.com>> wrote:

regarding option 2 for parquet:
implementing bucket assigner won't set the file name as getBucketId() defined 
the directory for the files in case of partitioning the data, for example:
/day=20190101/part-1-1
there is an open issue for that: 
https://issues.apache.org/jira/browse/FLINK-12573<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D12573=DwMGbg=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8=zGL975-dtGwduiOtS--MtzcwNbM6ti3ziA85_ki-Ql8=>

On Tue, Jul 2, 2019 at 6:18 AM Haibo Sun 
mailto:sunhaib...@163.com>> wrote:
Hi, Andreas

I think the following things may be what you want.

1. For writing Avro, I think you can extend AvroOutputFormat and override the  
getDirectoryFileName() method to customize a file name, as shown below.
The javadoc of AvroOutputFormat: 
https://ci.apache.org/projects/flink/flink-docs-release-1.8/api/java/org/apache/flink/formats/avro/AvroOutputFormat.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.8_api_java_org_apache_flink_formats_avro_AvroOutputFormat.html=DwMGbg=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8=tgLhwjHX0wHWgUMSSmNEOUmpatTF5N7JYfUEtYzQyf4=>


  public static class CustomAvroOutputFormat extends AvroOutputFormat {

  public CustomAvroOutputFormat(Path filePath, 
Class type) {

   super(filePath, type);

  }



  public CustomAvroOutputFormat(Class type) {

   super(type);

  }



  @Override

  public void open(int taskNumber, int numTasks) 
throws IOException {

   
this.setOutputDirectoryMode(OutputDirectoryMode.ALWAYS);

   super.open(taskNumber, 
numTasks);

  }



  @Override

  protected String getDirectoryFileName(int 
taskNumber) {

   // returns a custom filename

   return null;

  }

  }

2. For writing Parquet, you can refer to ParquetStreamingFileSinkITCase, 
StreamingFileSink#forBulkFormat and DateTimeBucketAssigner. You can create a 
class that implements the BucketAssigner interface and return a custom file 
name in the getBucketId() method (the value returned by getBucketId() will be 
treated as the file name).

ParquetStreamingFileSinkITCase:  
https://github.com/apache/flink/blob/master/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dformats_flink-2Dparquet_src_test_java_org_apache_flink_formats_parquet_avro_ParquetStreamingFileSinkITCase.java=DwMGbg=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8=8d-ErhDksiRSGYq3JQhnEERmXTnKMk99N7KfWs08Hho=>

StreamingFileSink#forBulkFormat: 
https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/StreamingFileSink.java<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_flink_blob_master_flink-2Dstreaming-2Djava_src_main_java_org_apache_flink_streaming_api_functions_sink_filesystem_StreamingFileSink.java=DwMGbg=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4=hRr4SA7BtUvKoMBP6VDhfisy2OJ1ZAzai-pcCC6TFXM=KTuJ_S9VUApSpfhp2WtgGeinhaMG-qZuTb59kFHm3Z8=-d96QSEVUNN702t_ejhD2TmXhc4fi8534hoQDbUrzAs=>

DateTimeBucketAssigner: 
https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/bucketassigners/DateTimeBuck

File Naming Pattern from HadoopOutputFormat

2019-07-01 Thread Hailu, Andreas
Hello Flink team,

I'm writing Avro and Parquet files to HDFS, and I've would like to include a 
UUID as a part of the file name.

Our files in HDFS currently follow this pattern:

tmp-r-1.snappy.parquet
tmp-r-2.snappy.parquet
...

I'm using a custom output format which extends a RichOutputFormat - is this 
something which is natively supported? If so, could you please recommend how 
this could be done, or share the relevant document?

Best,
Andreas



Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices