[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305035#comment-17305035 ] Kyle Weaver commented on FLINK-10672: - I tried upgrading to Flink 1.12.1 and this problem no longer seems to happen with the default memory and pipelining options. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17158504#comment-17158504 ] Yun Gao commented on FLINK-10672: - Hi [~ibzib], very sorry to disturb, do you have any updates from you side for this issue ? Since it seems that Tfx / Beam / Flink all have evolved for some versions, does this issue still exists? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148271#comment-17148271 ] Peidian Li commented on FLINK-10672: I'm sorry for the misinformation. Although I found the same stack trace, mine was caused by another reason. I introduced incremental cleanup in my job, which resulted in slower job processing. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147486#comment-17147486 ] Yun Gao commented on FLINK-10672: - [~lipeidian], very thanks for the reporting, are you also using Beam on Flink ? could you also attach detailed program and exceptions ? Thanks a lot. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147323#comment-17147323 ] Peidian Li commented on FLINK-10672: I met the same problem with Flink 1.9. Is there any conclusion about this problem? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101381#comment-17101381 ] Yun Gao commented on FLINK-10672: - Hi [~ibzib], sorry that the environment has been freed by the administrator for met with some security problem, I will re-deploy the environment and will send you the access method this week. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17097696#comment-17097696 ] Kyle Weaver commented on FLINK-10672: - Hi Yun, sorry I haven't had much time to look at this lately. If you still have the test environment available, you can send info to ib...@apache.org. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070745#comment-17070745 ] Yun Gao commented on FLINK-10672: - Hi [~ibzib] I have deploy an environment with Beam 2.14 and tfx 0.14 on a public accessible machine. Could you provide me an email and I will send you the IP and account ? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065410#comment-17065410 ] Yun Gao commented on FLINK-10672: - Hi Kyle, very sorry that I still have some problem reproducing the issue. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065414#comment-17065414 ] Yun Gao commented on FLINK-10672: - Hi [~ibzib], very sorry that I still have some problems when reproducing the issue. I will try to find an public accessible machine and try to reproduce the issue on that machine so that we could all access the environment. I will concentrate on this issue and get the environment prepared in the next few days. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17065155#comment-17065155 ] Kyle Weaver commented on FLINK-10672: - Hi [~gaoyunhaii], any updates on this? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973391#comment-16973391 ] venkata subbarao chunduri commented on FLINK-10672: --- Thank you, Yun. Yes, the indefinite loop is not an issue. It is my application error, one of the subpartition is not read by partition client and that lead to exhaustion of the buffers assigned per partition on the server . > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970692#comment-16970692 ] Yun Gao commented on FLINK-10672: - Hi venkata, very thanks for the reporting! From my side I think by design local input channel should not be stuck since when new source start producing data again, the source thread (which holds the result partition that the local input channel reads data from) will [notify|https://github.com/apache/flink/blob/6322618bb0f1b7942d86cb1b2b7bc55290d9e330/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/LocalInputChannel.java#L202] the gate about the new data, and then it will unblock the consumer task thread from the wait() call. Have you also changed this part of codes when incorporating the RDMA functionality ? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970360#comment-16970360 ] venkata subbarao chunduri commented on FLINK-10672: --- Could it be due to local channel? It looks, so. I am also faced with the same issue. Seems, the local channel [can go to wait|https://github.com/apache/flink/blob/6322618bb0f1b7942d86cb1b2b7bc55290d9e330/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/SingleInputGate.java#L539] and may never wake up as data from the [local channel pulled in the same thread|https://github.com/apache/flink/blob/6322618bb0f1b7942d86cb1b2b7bc55290d9e330/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/LocalInputChannel.java#L104]. Could you confirm? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953107#comment-16953107 ] Kyle Weaver commented on FLINK-10672: - This bug was initially filed when stuckness became an issue with parallelism=1. That was fixed by setting BATCH_FORCED mode. Later, the problem happened again with BATCH_FORCED and parallelism=12. This happened for me on TFX 0.13 and 0.14, but I imagine it would happen for other TFX versions as well. Let me know if you are still unable to reproduce the issue. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953016#comment-16953016 ] Yun Gao commented on FLINK-10672: - Hi Kyle, very thanks for the input! I still have two confusions, the first is that the the attached jstack logs seems to be running with parallelism equals to 1, therefore is the stuck also exists for parallelism = 1? Besides, could you also provide me the version of TFX/Beam/Tensorflow used ? The job I ran seems to be different from the job corresponding to the attached jstack. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952809#comment-16952809 ] Kyle Weaver commented on FLINK-10672: - You will need to set the task parallelism to reproduce the stuck job. For me it happens when parallelism=12, which is also the number of cores on my machine (might or might not be coincidence). Anyway, you can experiment with different parallelism settings and see what happens. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, WithBroadcastJob.png, jmx_dump.json, jmx_dump_detailed.json, > jstack_129827.log, jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32)
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951507#comment-16951507 ] Yun Gao commented on FLINK-10672: - Hi [~angoenka], very sorry for the long delay because of meeting with some package conflict problems. I have run the chicago taxi test with TFX 0.14, Beam 2.14 and Tensorflow 1.14, however the job seems to be different from the attached one, as [^0.14_all_jobs.jpg] shows, there are 4 jobs in total and the largest one only has 145 tasks in total. Therefore could you also provide me the version or branch id of TFX/Beam/Tensorflow ? Besides, do I need to change some configuration to reproduce the stuck job (For example, do I need to change the default dataset) ? Very thanks for the explanation. :) > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 0.14_all_jobs.jpg, 1uruvakHxBu.png, 3aDKQ24WvKk.png, > Po89UGDn58V.png, jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, > jstack_163822.log, jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946410#comment-16946410 ] Yun Gao commented on FLINK-10672: - Hi [~angoenka] Very thanks for the input and very sorry for responding so late because of being in the holiday of National's Day in the past week, I will reproduce the problem as soon as possible. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942291#comment-16942291 ] Ankur Goenka commented on FLINK-10672: -- I don't have the original graph as the code has changed. But the problem still exists when running the following steps with parallelism 12. Setup flink using [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/setup_beam_on_flink.sh] Run the pipeline # [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/tfdv_analyze_and_validate_portable_beam.sh] # [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_portable_beam.sh] You can view the graph by just running it with parallelism 1. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941796#comment-16941796 ] Maximilian Michels commented on FLINK-10672: [~angoenka] Do you think you could provide that, or at least the steps to reproduce the behavior? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940226#comment-16940226 ] Yun Gao commented on FLINK-10672: - Hi [~mxm], very thanks for the explanation. Besides, is it possible to also provide the topology of the underlying Flink Job and the status of each job vertex when the job get stuck? I think with these information, we may be able to analyze if there are cyclic dependency among tasks. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939220#comment-16939220 ] Maximilian Michels commented on FLINK-10672: Yes, it is one possible explanation that the tasks get stuck reading elements from GRPC. The GRPC connections are used to transfer data and control execution between Flink and the SDK Harness (where language-specific UDF code runs). > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939094#comment-16939094 ] Yun Gao commented on FLINK-10672: - Hi [~ibzib], very sorry for not very familiar with Beam, but I see from the [stack log|[^jstack_66985.log]] that there is a GRPC server thread which is also writing data. So what is the relationship between the tasks and the GRPC ?Are tasks also communicate with GRPC and do GRPC might cause cycle between tasks ? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938443#comment-16938443 ] zhijiang commented on FLINK-10672: -- I think [~gaoyunhaii] would continue helping trace this issue . > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Assignee: Yun Gao >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938147#comment-16938147 ] Kyle Weaver commented on FLINK-10672: - [~zjwang] I have tested the pipeline on Flink 1.8.1 and Flink 1.9.0 and the issue is still present (even with BATCH_FORCED). Strangely, it didn't happen when I had parallelism set to 11, but did happen when parallelism was 12. However, it is solved if I increase the taskManager memory, as Ankur indicated earlier. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897491#comment-16897491 ] zhijiang commented on FLINK-10672: -- Could you also test the release-1.8.1 or current master branch whether it could happen or not. We ever fixed some deadlock issues for both pipelined and blocking partitions, which could also cause the similar problems as you mentioned. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897101#comment-16897101 ] Maximilian Michels commented on FLINK-10672: [~zjwang] From the testing that [~angoenka] did, it seemed to be more involved then just backpressure from an upstream task; although this is how it manifests. The attached stack traces do not provide enough evidence to track down the problem. As [~ibzib] reported, the {{BATCH_FORCED}} execution mode did not get rid of the problem. We want to make sure to identity the stuck task to make sure this is not a Beam related problem. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887642#comment-16887642 ] zhijiang commented on FLINK-10672: -- [~ibzib] I think it is the case of back pressure. The producer is blocked in the process of `requestMemorySegment` because its produced partition data could not be consumed by downstream tasks. In order to trace the root issue, the key point is to find the proper downstream task which triggers the back pressure. E.G. for the topology of A-->B–>C—>D, if we found vertex A is blocked by `requestMemorySegment` long time, we can trace the state of upstream vertex in topology. If vertex B is also blocked, we could continue tracing the upstream until we find the vertex which is not blocked any more, assuming vertex C in this case. Then we further check which specific parallelism task in vertex C causes the above serious block. Such task has a feature that its inqueue buffer size is very high and its out queue size is low even empty, and the relevant metrics could help. If such task is found, we could further check its stack to confirm where it is stuck. Maybe you could get the root cause then, or you can provide further findings for us. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887283#comment-16887283 ] Kyle Weaver commented on FLINK-10672: - [~zjwang] I was on Flink 1.5.6 before. I upgraded to 1.8.1 and it looks like this is still a problem. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125) > at
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887201#comment-16887201 ] zhijiang commented on FLINK-10672: -- [~ibzib] which fix do you mean? In addition which flink version is used to cause above issue? I have not looked through this issue yet, and I would further check the jstack to confirm whether it has been solved already. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886547#comment-16886547 ] Kyle Weaver commented on FLINK-10672: - [~zjwang] [~pnowojski] I saw you had made previous fixes to this code. Do you know why this might be happening? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882538#comment-16882538 ] Kyle Weaver commented on FLINK-10672: - This is still an issue with the TFX preprocess Beam pipeline when the parallelism and number of task slots are increased, even with the execution mode set to BATCH_FORCED. ||Key||Value|| |jobmanager.heap.mb|1024| |jobmanager.rpc.address|localhost| |jobmanager.rpc.port|6123| |parallelism.default|12| |rest.port|8081| |taskmanager.heap.mb|1024| |taskmanager.numberOfTaskSlots|12| |web.tmpdir|/tmp/flink-web-794ab85f-d04a-4b32-a535-eb00eeeadb98| h1. Overview ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap Size||Flink Managed Memory|| |41199|12|0|12|62.6 GB|922 MB|639 MB| h1. Memory h2. JVM (Heap/Non-Heap) ||Type||Committed||Used||Maximum|| |Heap|922 MB|497 MB|922 MB| |Non-Heap|70.5 MB|65.9 MB|-1 B| |Total|992 MB|563 MB|922 MB| h2. Outside JVM ||Type||Count||Used||Capacity|| |Direct|3,286|104 MB|104 MB| |Mapped|0|0 B|0 B| h1. Network h2. Memory Segments ||Type||Count|| |Available|3,203| |Total|3,278| h1. Garbage Collection ||Collector||Count||Time|| |G1_Young_Generation|27|871| |G1_Old_Generation|1|27| > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724901#comment-16724901 ] Maximilian Michels commented on FLINK-10672: No, [~thw] just set the Fix Version to empty because this is not fixed in 1.5.6. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724576#comment-16724576 ] Ankur Goenka commented on FLINK-10672: -- Is this fixed in 1.5.6? > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Labels: beam > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139) > at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671319#comment-16671319 ] Maximilian Michels commented on FLINK-10672: Thanks for investigating the matter [~angoenka]. This seems like an issue with the deadlock detection code. It's not a blocker for us anymore because we have the {{BATCH_FORCED}} workaround. We will continue to investigate the pipelined execution issue. The TFX pipeline makes heavy use of broadcast variables and there is a suspicion these might not be scheduled correctly under certain conditions. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230) > - locked <0xf6a60bd0> (a java.lang.Object) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81) > at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16669502#comment-16669502 ] Ankur Goenka commented on FLINK-10672: -- Investigation notes: * Pipeline seems to be waiting on writing data. I verified back pressure in an stuck task. The consuming task was waiting on data while producing task was waiting to memorySegment to write the data. * Many of the broadcasting tasks were in created state and were not scheduled. * Using only BATCH execution mode did not help. * Using BATCH_FORCED execution mode makes the pipeline work. This suggests that we might not be identifying a breakPipeline correctly. * Pipeline seems to also work if we increase the taskManager memory. This might be because with higher memory, pipelined buffers can push all the output to memory before starting the consuming task. > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {quote}"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at (C/C++) 0x7fef201c7dae (Unknown Source) > at (C/C++) 0x7fef1f2aea07 (Unknown Source) > at (C/C++) 0x7fef1f241cd3 (Unknown Source) > at java.lang.Object.wait(Native Method) > - waiting on <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247) > - locked <0xf6d56450> (a java.util.ArrayDeque) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107) > at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42) > at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26) > at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80) > at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35) > at >
[jira] [Commented] (FLINK-10672) Task stuck while writing output to flink
[ https://issues.apache.org/jira/browse/FLINK-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662937#comment-16662937 ] Ankur Goenka commented on FLINK-10672: -- cc: [~mxm] [~robertwb] > Task stuck while writing output to flink > > > Key: FLINK-10672 > URL: https://issues.apache.org/jira/browse/FLINK-10672 > Project: Flink > Issue Type: Bug >Affects Versions: 1.5.4 > Environment: OS: Debuan rodente 4.17 > Flink version: 1.5.4 > ||Key||Value|| > |jobmanager.heap.mb|1024| > |jobmanager.rpc.address|localhost| > |jobmanager.rpc.port|6123| > |metrics.reporter.jmx.class|org.apache.flink.metrics.jmx.JMXReporter| > |metrics.reporter.jmx.port|9250-9260| > |metrics.reporters|jmx| > |parallelism.default|1| > |rest.port|8081| > |taskmanager.heap.mb|1024| > |taskmanager.numberOfTaskSlots|1| > |web.tmpdir|/tmp/flink-web-bdb73d6c-5b9e-47b5-9ebf-eed0a7c82c26| > > h1. Overview > ||Data Port||All Slots||Free Slots||CPU Cores||Physical Memory||JVM Heap > Size||Flink Managed Memory|| > |43501|1|0|12|62.9 GB|922 MB|642 MB| > h1. Memory > h2. JVM (Heap/Non-Heap) > ||Type||Committed||Used||Maximum|| > |Heap|922 MB|575 MB|922 MB| > |Non-Heap|68.8 MB|64.3 MB|-1 B| > |Total|991 MB|639 MB|922 MB| > h2. Outside JVM > ||Type||Count||Used||Capacity|| > |Direct|3,292|105 MB|105 MB| > |Mapped|0|0 B|0 B| > h1. Network > h2. Memory Segments > ||Type||Count|| > |Available|3,194| > |Total|3,278| > h1. Garbage Collection > ||Collector||Count||Time|| > |G1_Young_Generation|13|336| > |G1_Old_Generation|1|21| >Reporter: Ankur Goenka >Priority: Major > Attachments: 1uruvakHxBu.png, 3aDKQ24WvKk.png, Po89UGDn58V.png, > jmx_dump.json, jmx_dump_detailed.json, jstack_129827.log, jstack_163822.log, > jstack_66985.log > > > I am running a fairly complex pipleline with 200+ task. > The pipeline works fine with small data (order of 10kb input) but gets stuck > with a slightly larger data (300kb input). > > The task gets stuck while writing the output toFlink, more specifically it > gets stuck while requesting memory segment in local buffer pool. The Task > manager UI shows that it has enough memory and memory segments to work with. > The relevant stack trace is > {{"grpc-default-executor-0" #138 daemon prio=5 os_prio=0 > tid=0x7fedb0163800 nid=0x30b7f in Object.wait() [0x7fedb4f9]}} > {{ java.lang.Thread.State: TIMED_WAITING (on object monitor)}} > {{ at (C/C++) 0x7fef201c7dae (Unknown Source)}} > {{ at (C/C++) 0x7fef1f2aea07 (Unknown Source)}} > {{ at (C/C++) 0x7fef1f241cd3 (Unknown Source)}} > {{ at java.lang.Object.wait(Native Method)}} > {{ - waiting on <0xf6d56450> (a java.util.ArrayDeque)}} > {{ at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:247)}} > {{ - locked <0xf6d56450> (a java.util.ArrayDeque)}} > {{ at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:204)}} > {{ at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:213)}} > {{ at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:144)}} > {{ at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)}} > {{ at > org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)}} > {{ at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)}} > {{ at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:42)}} > {{ at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStagePruningFunction.flatMap(FlinkExecutableStagePruningFunction.java:26)}} > {{ at > org.apache.flink.runtime.operators.chaining.ChainedFlatMapDriver.collect(ChainedFlatMapDriver.java:80)}} > {{ at > org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)}} > {{ at > org.apache.beam.runners.flink.translation.functions.FlinkExecutableStageFunction$MyDataReceiver.accept(FlinkExecutableStageFunction.java:230)}} > {{ - locked <0xf6a60bd0> (a java.lang.Object)}} > {{ at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:81)}} > {{ at > org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:32)}} > {{ at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:139)}} > {{ at > org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125)}} > {{ at >