Re: FileSource may cause akka.pattern.AskTimeoutException, and akka.ask.timeout not workable

2021-06-10 Thread Roman Khachatryan
Hi,

I think you need to increase client.timeout [1].
Regarding the FileSource, it's difficult to say whether it is the
reason. The logs you provided are from the client, JobManager logs
would be helpful.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#client-timeout

Regards,
Roman

On Thu, Jun 10, 2021 at 6:34 AM 陳樺威  wrote:
>
> Hello all,
>
> Our team encounter akka.pattern.AskTimeoutException when start jobmanager. 
> Base on the error message, we try to setup akka.ask.timeout and web.timeout 
> to 360s, but both of them doesn't work.
>
> We guess the issue may cause by FileSource.forRecordFileFormat. The 
> application will load files in batch mode to rebuild our historical data. The 
> job can run normally in small batch. But it will be broken when run over lots 
> of files. (around 3 files distributed in 1500 folders)
>
> The flink application is on kubernetes in application mode and files stores 
> in Google Cloud Storage.
>
> Our questions are,
> 1. How to enlarge akka.ask.timeout correctly in our case?
> 2. Is it cause by FileSource? If yes, could you provide some suggestions to 
> prevent it?
>
>
> Following is our settings.
> ```
> 2021-06-10 03:44:14,317 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: kubernetes.container.image, */:**.*.**
> 2021-06-10 03:44:14,317 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: fs.hdfs.hadoopconfig, /opt/flink/conf
> 2021-06-10 03:44:14,317 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: taskmanager.numberOfTaskSlots, 4
> 2021-06-10 03:44:14,317 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: kubernetes.rest-service.exposed.type, ClusterIP
> 2021-06-10 03:44:14,317 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: high-availability.jobmanager.port, 6123
> 2021-06-10 03:44:14,318 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: akka.ask.timeout, 360s
> 2021-06-10 03:44:14,318 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: state.backend.rocksdb.memory.write-buffer-ratio, 0.7
> 2021-06-10 03:44:14,318 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: metrics.reporter.prom.class, 
> org.apache.flink.metrics.prometheus.PrometheusReporter
> 2021-06-10 03:44:14,318 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: state.storage.fs.memory-threshold, 1048576
> 2021-06-10 03:44:14,318 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: execution.checkpointing.unaligned, true
> 2021-06-10 03:44:14,318 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: web.timeout, 100
> 2021-06-10 03:44:14,318 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: execution.target, kubernetes-application
> 2021-06-10 03:44:14,318 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: restart-strategy.fixed-delay.attempts, 2147483647
> 2021-06-10 03:44:14,319 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: jobmanager.memory.process.size, 8g
> 2021-06-10 03:44:14,319 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: taskmanager.rpc.port, 6122
> 2021-06-10 03:44:14,319 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: akka.framesize, 104857600b
> 2021-06-10 03:44:14,319 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: containerized.master.env.HADOOP_CLASSPATH, 
> /opt/flink/conf:/opt/hadoop-3.1.1/share/hadoop/common/lib/*:/opt/hadoop-3.1.1/share/hadoop/common/*:/opt/hadoop-3.1.1/share/hadoop/hdfs:/opt/hadoop-3.1.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.1.1/share/hadoop/hdfs/*:/opt/hadoop-3.1.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.1.1/share/hadoop/mapreduce/*:/opt/hadoop-3.1.1/share/hadoop/yarn:/opt/hadoop-3.1.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.1.1/share/hadoop/yarn/*:/contrib/capacity-scheduler/*.jar
> 2021-06-10 03:44:14,319 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: execution.attached, true
> 2021-06-10 03:44:14,319 INFO  
> org.apache.flink.configuration.GlobalConfiguration   [] - Loading 
> configuration property: internal.cluster.execution-mode, NO

FileSource may cause akka.pattern.AskTimeoutException, and akka.ask.timeout not workable

2021-06-09 Thread 陳樺威
Hello all,

Our team encounter *akka.pattern.AskTimeoutException *when start
jobmanager. Base on the error message, we try to setup *akka.ask.timeout *
and* web.timeout *to 360s, but both of them doesn't work.

We guess the issue may cause by *FileSource.forRecordFileFormat.* The
application will load files in batch mode to rebuild our historical data.
The job can run normally in small batch. But it will be broken when run
over lots of files. (around 3 files distributed in 1500 folders)

The flink application is on kubernetes in application mode and files stores
in Google Cloud Storage.

Our questions are,
1. How to enlarge akka.ask.timeout correctly in our case?
2. Is it cause by FileSource? If yes, could you provide some suggestions to
prevent it?


Following is our settings.
```
2021-06-10 03:44:14,317 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: kubernetes.container.image, */:**.*.**
2021-06-10 03:44:14,317 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: fs.hdfs.hadoopconfig, /opt/flink/conf
2021-06-10 03:44:14,317 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: taskmanager.numberOfTaskSlots, 4
2021-06-10 03:44:14,317 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: kubernetes.rest-service.exposed.type, ClusterIP
2021-06-10 03:44:14,317 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: high-availability.jobmanager.port, 6123
2021-06-10 03:44:14,318 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: akka.ask.timeout, 360s
2021-06-10 03:44:14,318 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: state.backend.rocksdb.memory.write-buffer-ratio, 0.7
2021-06-10 03:44:14,318 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: metrics.reporter.prom.class,
org.apache.flink.metrics.prometheus.PrometheusReporter
2021-06-10 03:44:14,318 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: state.storage.fs.memory-threshold, 1048576
2021-06-10 03:44:14,318 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: execution.checkpointing.unaligned, true
2021-06-10 03:44:14,318 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: web.timeout, 100
2021-06-10 03:44:14,318 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: execution.target, kubernetes-application
2021-06-10 03:44:14,318 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: restart-strategy.fixed-delay.attempts, 2147483647
2021-06-10 03:44:14,319 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: jobmanager.memory.process.size, 8g
2021-06-10 03:44:14,319 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: taskmanager.rpc.port, 6122
2021-06-10 03:44:14,319 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: akka.framesize, 104857600b
2021-06-10 03:44:14,319 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: containerized.master.env.HADOOP_CLASSPATH,
/opt/flink/conf:/opt/hadoop-3.1.1/share/hadoop/common/lib/*:/opt/hadoop-3.1.1/share/hadoop/common/*:/opt/hadoop-3.1.1/share/hadoop/hdfs:/opt/hadoop-3.1.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.1.1/share/hadoop/hdfs/*:/opt/hadoop-3.1.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.1.1/share/hadoop/mapreduce/*:/opt/hadoop-3.1.1/share/hadoop/yarn:/opt/hadoop-3.1.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.1.1/share/hadoop/yarn/*:/contrib/capacity-scheduler/*.jar
2021-06-10 03:44:14,319 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: execution.attached, true
2021-06-10 03:44:14,319 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: internal.cluster.execution-mode, NORMAL
2021-06-10 03:44:14,319 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property: high-availability,
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
2021-06-10 03:44:14,320 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property:
execution.checkpointing.externalized-checkpoint-retention,
DELETE_ON_CANCELLATION
2021-06-10 03:44:14,320 INFO
 org.apache.flink.configuration.GlobalConfiguration   [] - Loading
configuration property