[ 
https://issues.apache.org/jira/browse/FLINK-31036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689584#comment-17689584
 ] 

Matthias Pohl commented on FLINK-31036:
---------------------------------------

> FLINK-26803 added a lock at TM level during task is initializing, so we see 
> that many tasks are waiting for locks. I want to look at the logs of the 
> failed tests to analyze what those tasks are doing, and what's wrong with 
> this test?

I see. But I still miss to understand why the current stacktrace isn't enough 
to do so when we don't add more logs and are fine with what the current logs 
are revealing?
The reason I am asking is because we might want to come up with a strategy if 
the issue cannot be resolved before the rc creation is started (because it 
might not appear that frequent). Reverting FLINK-26803 by then sounds 
reasonable considering that it's "only" an improvement. WDYT?

> StateCheckpointedITCase timed out due to deadlock
> -------------------------------------------------
>
>                 Key: FLINK-31036
>                 URL: https://issues.apache.org/jira/browse/FLINK-31036
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.17.0
>            Reporter: Matthias Pohl
>            Assignee: Rui Fan
>            Priority: Blocker
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46023&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10608
> {code}
> "Legacy Source Thread - Source: Custom Source -> Filter (6/12)#69980" 
> #13718026 prio=5 os_prio=0 tid=0x00007f05f44f0800 nid=0x128157 waiting on 
> condition [0x00007f059feef000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000f0a974e8> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>       at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegmentBlocking(LocalBufferPool.java:384)
>       at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:356)
>       at 
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewBufferBuilderFromPool(BufferWritingResultPartition.java:414)
>       at 
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.requestNewUnicastBufferBuilder(BufferWritingResultPartition.java:390)
>       at 
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.appendUnicastDataForRecordContinuation(BufferWritingResultPartition.java:328)
>       at 
> org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.emitRecord(BufferWritingResultPartition.java:161)
>       at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:107)
>       at 
> org.apache.flink.runtime.io.network.api.writer.ChannelSelectorRecordWriter.emit(ChannelSelectorRecordWriter.java:55)
>       at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:105)
>       at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
>       at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
>       at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:59)
>       at 
> org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:31)
>       at 
> org.apache.flink.streaming.api.operators.StreamFilter.processElement(StreamFilter.java:39)
>       at 
> org.apache.flink.streaming.runtime.io.RecordProcessorUtils$$Lambda$1311/1256184070.accept(Unknown
>  Source)
>       at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
>       at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
>       at 
> org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
>       at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:418)
>       at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:513)
>       - locked <0x00000000d55035c0> (a java.lang.Object)
>       at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:103)
>       at 
> org.apache.flink.test.checkpointing.StateCheckpointedITCase$StringGeneratingSourceFunction.run(StateCheckpointedITCase.java:178)
>       - locked <0x00000000d55035c0> (a java.lang.Object)
>       at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
>       at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:67)
>       at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:333)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to