[ 
https://issues.apache.org/jira/browse/FLINK-17194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101125#comment-17101125
 ] 

Gary Yao commented on FLINK-17194:
----------------------------------

I am able to reliably reproduce this issue by setting {{akka.ask.timeout}} to 
{{5s}}, which should be still a generous timeout for a local Flink cluster.
The problem seems to be that sometimes releasing a partition is slow (due to 
file io), and this blocks the TaskExecutor's main thread. I have attached an 
example below. 


{noformat}
Run TPC-DS query 39b ...
{noformat}


{noformat}
2020-05-06 19:13:08,445 flink-akka.actor.default-dispatcher-35 DEBUG 
org.apache.flink.runtime.io.network.partition.ResultPartition [] - 
CsvTableSource(read fields: inv_date_sk, inv_item_sk, inv_warehouse_sk, 
inv_quantity_on_hand) -> 
SourceConversion(table=[default_catalog.default_database.inventory, source: 
[CsvTableSource(read fields: inv_date_sk, inv_item_sk
, inv_warehouse_sk, inv_quantity_on_hand)]], fields=[inv_date_sk, inv_item_sk, 
inv_warehouse_sk, inv_quantity_on_hand]) (3/4) 
(76d0879cdd3bdb851b44f8dbb5b30999): Releasing ResultPartition 
feb9262b7de50f164c061797ec01ba64#2@76d0879cdd3bdb851b44f8dbb5b30999 [BLOCKING, 
1 subpartitions].
2020-05-06 19:13:08,445 flink-akka.actor.default-dispatcher-35 DEBUG 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition [] - 
Close 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData@201865e0
2020-05-06 19:13:17,771 flink-akka.actor.default-dispatcher-35 DEBUG 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition [] - 
Closed 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData@201865e0
2020-05-06 19:13:17,771 flink-akka.actor.default-dispatcher-35 DEBUG 
org.apache.flink.runtime.io.network.partition.ResultPartition [] - 
CsvTableSource(read fields: inv_date_sk, inv_item_sk, inv_warehouse_sk, 
inv_quantity_on_hand) -> 
SourceConversion(table=[default_catalog.default_database.inventory, source: 
[CsvTableSource(read fields: inv_date_sk, inv_item_sk
, inv_warehouse_sk, inv_quantity_on_hand)]], fields=[inv_date_sk, inv_item_sk, 
inv_warehouse_sk, inv_quantity_on_hand]) (3/4) 
(76d0879cdd3bdb851b44f8dbb5b30999): Released ResultPartition 
feb9262b7de50f164c061797ec01ba64#2@76d0879cdd3bdb851b44f8dbb5b30999 [BLOCKING, 
1 subpartitions].
{noformat}

Note that it takes more than 9 seconds to release the partition. I have added 
additional debug prints.

I have also managed to invoke jstack at the right time on the TM process. The 
main thread is blocked on deleting {{FileChannelBoundedData#filePath}}.

{noformat}
2020-05-06T19:13:12.4383402Z "flink-akka.actor.default-dispatcher-35" #3555 
prio=5 os_prio=0 tid=0x00007f7fcc071000 nid=0x1f3f9 runnable 
[0x00007f7fd302c000]
2020-05-06T19:13:12.4383983Z    java.lang.Thread.State: RUNNABLE
2020-05-06T19:13:12.4384519Z    at 
sun.nio.fs.UnixNativeDispatcher.unlink0(Native Method)
2020-05-06T19:13:12.4384971Z    at 
sun.nio.fs.UnixNativeDispatcher.unlink(UnixNativeDispatcher.java:146)
2020-05-06T19:13:12.4385465Z    at 
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:231)
2020-05-06T19:13:12.4386000Z    at 
sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
2020-05-06T19:13:12.4386458Z    at java.nio.file.Files.delete(Files.java:1126)
2020-05-06T19:13:12.4386968Z    at 
org.apache.flink.runtime.io.network.partition.FileChannelBoundedData.close(FileChannelBoundedData.java:93)
2020-05-06T19:13:12.4388088Z    at 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.checkReaderReferencesAndDispose(BoundedBlockingSubpartition.java:247)
2020-05-06T19:13:12.4388765Z    at 
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.release(BoundedBlockingSubpartition.java:208)
2020-05-06T19:13:12.4389444Z    - locked <0x00000000ff836d78> (a 
java.lang.Object)
2020-05-06T19:13:12.4389905Z    at 
org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:290)
2020-05-06T19:13:12.4390481Z    at 
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartition(ResultPartitionManager.java:80)
2020-05-06T19:13:12.4391118Z    - locked <0x000000009d452b90> (a 
java.util.HashMap)
2020-05-06T19:13:12.4391597Z    at 
org.apache.flink.runtime.io.network.NettyShuffleEnvironment.releasePartitionsLocally(NettyShuffleEnvironment.java:153)
2020-05-06T19:13:12.4392267Z    at 
org.apache.flink.runtime.io.network.partition.TaskExecutorPartitionTrackerImpl.stopTrackingAndReleaseJobPartitions(TaskExecutorPartitionTrackerImpl.java:62)
2020-05-06T19:13:12.4392914Z    at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.releaseOrPromotePartitions(TaskExecutor.java:776)
2020-05-06T19:13:12.4393366Z    at 
sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
2020-05-06T19:13:12.4393813Z    at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-05-06T19:13:12.4394257Z    at 
java.lang.reflect.Method.invoke(Method.java:498)
2020-05-06T19:13:12.4394693Z    at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
2020-05-06T19:13:12.4395202Z    at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)
2020-05-06T19:13:12.4395686Z    at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
2020-05-06T19:13:12.4396165Z    at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$$Lambda$72/775020844.apply(Unknown
 Source)
2020-05-06T19:13:12.4396606Z    at 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
2020-05-06T19:13:12.4397015Z    at 
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
2020-05-06T19:13:12.4397447Z    at 
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
2020-05-06T19:13:12.4397874Z    at 
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
2020-05-06T19:13:12.4398414Z    at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
2020-05-06T19:13:12.4398879Z    at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2020-05-06T19:13:12.4399321Z    at 
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
2020-05-06T19:13:12.4399737Z    at 
akka.actor.Actor$class.aroundReceive(Actor.scala:517)
2020-05-06T19:13:12.4400138Z    at 
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
2020-05-06T19:13:12.4400552Z    at 
akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
2020-05-06T19:13:12.4400930Z    at 
akka.actor.ActorCell.invoke(ActorCell.scala:561)
2020-05-06T19:13:12.4401390Z    at 
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
2020-05-06T19:13:12.4401763Z    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
2020-05-06T19:13:12.4402135Z    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
2020-05-06T19:13:12.4402540Z    at 
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
2020-05-06T19:13:12.4402984Z    at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
2020-05-06T19:13:12.4403448Z    at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
2020-05-06T19:13:12.4404096Z    at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{noformat}

Any idea why this is happening now?

cc: [~zjwang] [~trohrmann]



> TPC-DS end-to-end test fails due to missing execution attempt
> -------------------------------------------------------------
>
>                 Key: FLINK-17194
>                 URL: https://issues.apache.org/jira/browse/FLINK-17194
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.11.0
>            Reporter: Chesnay Schepler
>            Assignee: Gary Yao
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.11.0
>
>
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7567&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> {code:java}
> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution 
> attempt d6bef26867c04f1c94903b06b60ec55f was not found.
>       at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:389)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to