Flink job repeated restart failure

2021-03-24 Thread VINAYA KUMAR BENDI
Dear all,

One of the Flink jobs gave below exception and failed. Several attempts to 
restart the job resulted in the same exception and the job failed each time. 
The job started successfully only after changing the file name.

Flink Version: 1.11.2

Exception
2021-03-24 20:13:09,288 INFO  org.apache.kafka.clients.producer.KafkaProducer   
   [] - [Producer clientId=producer-2] Closing the Kafka producer with 
timeoutMillis = 0 ms.
2021-03-24 20:13:09,288 INFO  org.apache.kafka.clients.producer.KafkaProducer   
   [] - [Producer clientId=producer-2] Proceeding to force close the 
producer since pending requests could not be completed within timeout 0 ms.
2021-03-24 20:13:09,304 WARN  org.apache.flink.runtime.taskmanager.Task 
   [] - Flat Map -> async wait operator -> Process -> Sink: Unnamed 
(1/1) (8905142514cb25adbd42980680562d31) switched from RUNNING to FAILED.
java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method) 
~[?:1.8.0_252]
at java.io.File.createNewFile(File.java:1012) ~[?:1.8.0_252]
at 
org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.createSpillingChannel(SpanningWrapper.java:291)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.updateLength(SpanningWrapper.java:178)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.transferFrom(SpanningWrapper.java:111)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.readNonSpanningRecord(SpillingAdaptiveSpanningRecordDeserializer.java:110)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:85)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:146)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:566)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536) 
~[flink-dist_2.12-1.11.2.jar:1.11.2]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
2021-03-24 20:13:09,305 INFO  org.apache.flink.runtime.taskmanager.Task 
   [] - Freeing task resources for Flat Map -> async wait operator -> 
Process -> Sink: Unnamed (1/1) (8905142514cb25adbd42980680562d31).
2021-03-24 20:13:09,311 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor   [] - 
Un-registering task and sending final execution state FAILED to JobManager for 
task Flat Map -> async wait operator -> Process -> Sink: Unnamed (1/1) 
8905142514cb25adbd42980680562d31.

File: 
https://github.com/apache/flink/blob/release-1.11.2/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/serialization/SpanningWrapper.java

Related Jira ID: https://issues.apache.org/jira/browse/FLINK-18811

Similar exception mentioned in FLINK-18811 which has a fix in 1.12.0. Though in 
our case, we didn't notice any disk failure. Is there any other reason(s) for 
the above mentioned IOException?

While we are planning to upgrade to the latest Flink version, are there any 
other workaround(s) instead of deploying the job again with a different file 
name?

Kind regards,
Vinaya


Apache Flink Deployment in Multiple AZ

2021-02-11 Thread VINAYA KUMAR BENDI
Hi,

Is Apache Flink (Job Managers, Task Managers) setup in multiple availability 
zones a recommended deployment model?

Kind regards,
Vinaya


Apache Flink Job Manager High CPU with Couchbase

2021-01-28 Thread VINAYA KUMAR BENDI
Hello,
We work in a multinational company that produces diesel engines and is working 
on an IoT platform to analyze engine performance based on sensor data. We are 
using Flink for deploying analytics stream processing jobs. We recently 
integrated these jobs with Couchbase (serving as a Cache) and are monitoring 
the performance of these jobs in our test environment.
Flink Cluster
Two Job Managers (2 cpu, 8 GB, Centos 7, 50 GB Disk, Apache Flink 1.9.0)
Six Task Managers (4 cpu, 16 GB, Centos 7, 50 GB Disk, Apache Flink 1.9.0)
Couchbase Cluster
Two nodes (4 cpu, 16 GB, Amazon Linux 2, 25 GB Disk, Community Edition 6.5.1 
build 6299)
Couchbase SDK
java-client (3.0.10)
We noticed that after re-deploying our analytics Flink jobs, high CPU usage 
alert is seen on Flink Job Manager.
  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
1831 centos20   0   11.3g   5.8g   6076 S 162.5 77.3  17181:30 java
It was observed that the Job Manager process (1831) comprised of 60.9% (873 out 
of 1433) threads related to couchbase timers of some sort (e.g. cb-timer-1-1, 
cb-events, cb-tracing-1, cb-orphan-1 etc.). This may be the reason for high CPU 
usage.
$ ps -eT | grep 1831 | grep -i cb | wc -l
873

$ ps -eT | grep 1831 | wc -l
1433

$ ps -eT | grep 1831 | grep -i cb
1831 11539 ?01:45:52 cb-timer-1-1
1831 11541 ?00:11:53 cb-events
1831 11542 ?00:10:13 cb-tracing-1
1831 11543 ?00:10:08 cb-orphan-1
1831 11545 ?00:04:47 cb-comp-1
1831 11546 ?00:06:41 cb-comp-2
1831 11547 ?00:34:57 cb-io-kv-5-1
1831 11549 ?00:17:48 cb-io-kv-5-2
1831 24911 ?00:43:07 cb-timer-1-1
1831 24912 ?00:05:34 cb-events
1831 24913 ?00:04:02 cb-tracing-1
1831 24914 ?00:04:02 cb-orphan-1
1831 24915 ?00:02:35 cb-comp-1
1831 24916 ?00:04:17 cb-comp-2
1831 24917 ?00:23:41 cb-io-kv-5-1
1831 24919 ?00:13:08 cb-io-kv-5-2
1831 24966 ?00:44:37 cb-timer-1-1
1831 24967 ?00:05:43 cb-events
...
If there is any issue in connecting to Couchbase from the Flink jobs running on 
Flink task managers then those may result in errors being logged in task 
manager or cause some other issue (e.g. CPU) in task managers. However, I am 
wondering why high CPU is seen on Flink Job Manager instead and not on Flink 
task managers. Please help in understanding possible reasons for creation of 
numerous cb-* threads on Flink Job Manager. If you have any other suggestions 
or commands to better troubleshoot this issue, those are welcome too.
Thank you.
Kind regards,
Vinaya