Re: Flink YARN app terminated before the client receives the result

2020-03-30 Thread Aljoscha Krettek
I think we have to take a step back here. For per-job (YARN) mode, the general problem is that there are two systems that can do shutdown (and other things) and two clients. There is YARN and there is Flink, and Flink is YARN inside YARN, in a way. The solution, I think, is that cancellation fo

Re: Flink YARN app terminated before the client receives the result

2020-03-20 Thread Till Rohrmann
Yes you are right that `thenAcceptAsync` only breaks the control flow but it does not guarantee that the `RestServer` has actually sent the response to the client. Maybe we also need something similar to FLINK-10309 [1]. The problem I see with this approach is that it makes all RestHandlers statefu

Re: Flink YARN app terminated before the client receives the result

2020-03-20 Thread DONG, Weike
Hi Tison & Till, Changing *thenAccept *into *thenAcceptAsync *in the MiniDispatcher#cancelJob does not help to solve the problem in my environment. However, I have found that adding a* Thread.sleep(2000) *before the return of JobCancellationHandler#handleRequest solved the problem (at least the sy

Re: Flink YARN app terminated before the client receives the result

2020-03-17 Thread tison
JIRA created as https://jira.apache.org/jira/browse/FLINK-16637 Best, tison. Till Rohrmann 于2020年3月17日周二 下午5:57写道: > @Tison could you create an issue to track the problem. Please also link > the uploaded log file for further debugging. > > I think the reason why it worked in Flink 1.9 could h

Re: Flink YARN app terminated before the client receives the result

2020-03-17 Thread Till Rohrmann
@Tison could you create an issue to track the problem. Please also link the uploaded log file for further debugging. I think the reason why it worked in Flink 1.9 could have been that we had a async callback in the longer chain which broke the flow of execution and allowed to send the response. T

Re: Flink YARN app terminated before the client receives the result

2020-03-17 Thread DONG, Weike
Hi Tison & Till and all, I have uploaded the client, taskmanager and jobmanager log to Gist ( https://gist.github.com/kylemeow/500b6567368316ec6f5b8f99b469a49f), and I can reproduce this bug every time when trying to cancel Flink 1.10 jobs on YARN. Besides, in earlier Flink versions like 1.9, the

Re: Flink YARN app terminated before the client receives the result

2020-03-16 Thread tison
edit: previously after the cancellation we have a longer call chain to #jobReachedGloballyTerminalState which does the archive job & JM graceful showdown, which might take some time so that ... Best, tison. tison 于2020年3月17日周二 上午10:13写道: > Hi Weike & Till, > > I agree with Till and it is also

Re: Flink YARN app terminated before the client receives the result

2020-03-16 Thread tison
Hi Weike & Till, I agree with Till and it is also the analysis from my side. However, it seems even if we don't have FLINK-15116, it is still possible that we complete the cancel future but the cluster got shutdown before it properly delivered the response. There is one thing strange that this be

Re: Flink YARN app terminated before the client receives the result

2020-03-16 Thread Till Rohrmann
Hi Weike, could you share the complete logs with us? Attachments are being filtered out by the Apache mail server but it works if you upload the logs somewhere (e.g. https://gist.github.com/) and then share the link with us. Ideally you run the cluster with DEBUG log settings. I assume that you a

Re: Flink YARN app terminated before the client receives the result

2020-03-12 Thread DONG, Weike
Hi Yangze and all, I have tried numerous times, and this behavior persists. Below is the tail log of taskmanager.log: 2020-03-13 12:06:14.240 [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot TaskSlot(index:0, state:ACTIVE, re

Re: Flink YARN app terminated before the client receives the result

2020-03-12 Thread Yangze Guo
Would you mind to share more information about why the task executor is killed? If it is killed by Yarn, you might get such info in Yarn NM/RM logs. Best, Yangze Guo Best, Yangze Guo On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike wrote: > > Hi, > > Recently I have encountered a strange behavior

Flink YARN app terminated before the client receives the result

2020-03-12 Thread DONG, Weike
Hi, Recently I have encountered a strange behavior of Flink on YARN, which is that when I try to cancel a Flink job running in per-job mode on YARN using commands like "cancel -m yarn-cluster -yid application_1559388106022_9412 ed7e2e0ab0a7316c1b65df6047bc6aae" the client happily found and conne