I think we have to take a step back here. For per-job (YARN) mode, the
general problem is that there are two systems that can do shutdown (and
other things) and two clients. There is YARN and there is Flink, and
Flink is YARN inside YARN, in a way. The solution, I think, is that
cancellation fo
Yes you are right that `thenAcceptAsync` only breaks the control flow but
it does not guarantee that the `RestServer` has actually sent the response
to the client. Maybe we also need something similar to FLINK-10309 [1]. The
problem I see with this approach is that it makes all RestHandlers statefu
Hi Tison & Till,
Changing *thenAccept *into *thenAcceptAsync *in the
MiniDispatcher#cancelJob does not help to solve the problem in my
environment. However, I have found that adding a* Thread.sleep(2000) *before
the return of JobCancellationHandler#handleRequest solved the problem (at
least the sy
JIRA created as https://jira.apache.org/jira/browse/FLINK-16637
Best,
tison.
Till Rohrmann 于2020年3月17日周二 下午5:57写道:
> @Tison could you create an issue to track the problem. Please also link
> the uploaded log file for further debugging.
>
> I think the reason why it worked in Flink 1.9 could h
@Tison could you create an issue to track the problem. Please also link
the uploaded log file for further debugging.
I think the reason why it worked in Flink 1.9 could have been that we had a
async callback in the longer chain which broke the flow of execution and
allowed to send the response. T
Hi Tison & Till and all,
I have uploaded the client, taskmanager and jobmanager log to Gist (
https://gist.github.com/kylemeow/500b6567368316ec6f5b8f99b469a49f), and I
can reproduce this bug every time when trying to cancel Flink 1.10 jobs on
YARN.
Besides, in earlier Flink versions like 1.9, the
edit: previously after the cancellation we have a longer call chain to
#jobReachedGloballyTerminalState which does the archive job & JM graceful
showdown, which might take some time so that ...
Best,
tison.
tison 于2020年3月17日周二 上午10:13写道:
> Hi Weike & Till,
>
> I agree with Till and it is also
Hi Weike & Till,
I agree with Till and it is also the analysis from my side. However, it
seems even if we don't have FLINK-15116, it is still possible that we
complete the cancel future but the cluster got shutdown before it properly
delivered the response.
There is one thing strange that this be
Hi Weike,
could you share the complete logs with us? Attachments are being filtered
out by the Apache mail server but it works if you upload the logs somewhere
(e.g. https://gist.github.com/) and then share the link with us. Ideally
you run the cluster with DEBUG log settings.
I assume that you a
Hi Yangze and all,
I have tried numerous times, and this behavior persists.
Below is the tail log of taskmanager.log:
2020-03-13 12:06:14.240 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - Free slot
TaskSlot(index:0, state:ACTIVE, re
Would you mind to share more information about why the task executor
is killed? If it is killed by Yarn, you might get such info in Yarn
NM/RM logs.
Best,
Yangze Guo
Best,
Yangze Guo
On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike wrote:
>
> Hi,
>
> Recently I have encountered a strange behavior
Hi,
Recently I have encountered a strange behavior of Flink on YARN, which is
that when I try to cancel a Flink job running in per-job mode on YARN using
commands like
"cancel -m yarn-cluster
-yid application_1559388106022_9412 ed7e2e0ab0a7316c1b65df6047bc6aae"
the client happily found and conne
12 matches
Mail list logo