[
https://issues.apache.org/jira/browse/FLINK-14429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16956159#comment-16956159
]
liupengcheng commented on FLINK-14429:
--------------------------------------
[~trohrmann] In the current impl, the client will polling for the execution
result, and if failed the error.message will also be passed to the client
through the result, then the client will know it's an failure and throw an
exception which contains the error message.
So from the exception we can know the status and the diagnosis when shutting
down the cluster. I don't agree that this change is a hack, I think it's a
natural fix for current implementation.
Also, if there are already an better design or refactoring, please let me the
detail. Thanks!
> Wrong app final status when running batch job on yarn with non-detached mode
> ----------------------------------------------------------------------------
>
> Key: FLINK-14429
> URL: https://issues.apache.org/jira/browse/FLINK-14429
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.9.0
> Reporter: liupengcheng
> Priority: Minor
> Labels: pull-request-available
> Attachments: image-2019-10-17-16-47-47-038.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Recently, we found that the app final status is not correct when an
> application failed when running batch job on yarn with non-detached mode, It
> reported SUCCEEDED but FAILED is what we expected.
> !image-2019-10-17-16-47-47-038.png!
>
> But the logs and client reported error and job failed(It's caused by OOM):
> {code:java}
> 2019-10-10 14:36:21,797 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job TeraSort
> (d82cbfaae905c695597083b1476e51b8) switched from state FAILING to FAILED.
> org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
> Connection for partition
> a254412fc7464cd4e0fe04ab9e3a6309@8d5afff58c86dd7f5bc78946f0101699 not
> reachable.
> at
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:168)
> at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:237)
> at
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.setup(SingleInputGate.java:215)
> at
> org.apache.flink.runtime.taskmanager.InputGateWithMetrics.setup(InputGateWithMetrics.java:65)
> at
> org.apache.flink.runtime.taskmanager.Task.setupPartitionsAndGates(Task.java:866)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:621)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Connecting the channel failed: Connecting to
> remote task manager + 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed.
> This might indicate that the remote task manager has been lost.
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:134)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:86)
> at
> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:68)
> at
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:165)
> ... 7 more
> Caused by:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connecting to remote task manager +
> 'zjy-hadoop-prc-st164.bj/10.152.47.8:45704' has failed. This might indicate
> that the remote task manager has been lost.
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
> at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:134)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:591)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:508)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
> at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
> ... 1 more
> Caused by:
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection refused: zjy-hadoop-prc-st164.bj/10.152.47.8:45704
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
> at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
> ... 6 more
> Caused by: java.net.ConnectException: Connection refused
> ... 10 more
> {code}
>
> I looked into the code, and find that it's because currently we didn't send
> status and diagnostics message to dispatcher when shutting down cluster. So I
> propose to add these informations to make the status correct.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)