hi, all

集群信息:
flink版本是1.10,部署在kubernetes上,kubernetes版本为1.17.4,docker版本为19.03, cni使用的是weave。


现象:
作业运行的时候,偶发会出现operation卡住,下游收不到数据,水位线无法更新,反压上游,作业在一段时间会被kill掉的情况。


通过jstack出来的堆栈信息片段如下:


"Map (152/200)" #155 prio=5 os_prio=0 tid=0x00007f67a4076800 nid=0x31f waiting 
on condition [0x00007f66b04ed000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000608f3c600> (a 
java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
        at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:231)
        at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:209)
        at 
org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:189)
        at 
org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.requestNewBufferBuilder(ChannelSelectorRecordWriter.java:103)
        at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:145)
        at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:116)
        at 
org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:60)
        at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
        at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
        at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)




有怀疑过是虚拟化网络问题,增加了如下参数,不见效:
taskmanager.network.request-backoff.max: 300000
akka.ask.timeout: 120s
akka.watch.heartbeat.interval: 10s


尝试过调整buffer数量,不见效:
taskmanager.network.memory.floating-buffers-per-gate: 16
taskmanager.network.memory.buffers-per-channel: 6




目前我遭遇到的使用场景说明如上,希望得到一些回复和解答说明,非常感谢。

Looking forward to your reply and help.

Best



回复