Hi Stephan, thank you for this clarification. I have a slightly related follow up question. I keep reading that, the preferred way to run Flink on Yarn is with "Flink-job-at-a-time-on-yarn". Can you explain this a little further? Of course, with separate YARN session the jobs are more decoupled, but on the other hand it seems contra-intuitive to start a new Flink Cluster for each job.
Best Regards, Konstantin On 12.07.2016 15:48, Stephan Ewen wrote: > I think there is a confusion between how Flink thinks about HA and job > life cycle, and how many users think about it. > > Flink thinks that a killing of the YARN session is a failure of the job. > So as soon as new Yarn resources become available, it tries to recover > the job. > Most users think that killing a Yarn session is equivalent to canceling > the job. > > I am unsure if we should start to interpret the killing of a Yarn > session as a cancellation. Do Yarn sessions never get killed > accidentally, or as the result of a Yarn-related failure? > > Using Flink-job-at-a-time-on-yarn, cancelling the Flink Job also shuts > down the Yarn session and hence shuts down everything properly. > > Hope that train of thought helps. > > > On Tue, Jul 12, 2016 at 3:15 PM, Ufuk Celebi <u...@apache.org > <mailto:u...@apache.org>> wrote: > > Are you running in HA mode? If yes, that's the expected behaviour at > the moment, because the ZooKeeper data is only cleaned up on a > terminal state (FINISHED, FAILED, CANCELLED). You have to specify > separate ZooKeeper root paths via "recovery.zookeeper.path.root". > There is an issue which should be fixed for 1.2 to make this > configurable in an easy way. > > On Tue, Jul 12, 2016 at 1:28 PM, Konstantin Gregor > <konstantin.gre...@tngtech.com > <mailto:konstantin.gre...@tngtech.com>> wrote: > > Hello everyone, > > > > I have a question concerning stopping Flink streaming processes > that run > > in a detached Yarn session. > > > > Here's what we do: We start a Yarn session via > > yarn-session.sh -n 8 -d -jm 4096 -tm 10000 -s 10 -qu flink_queue > > > > Then, we start our Flink streaming application via > > flink run -p 65 -c SomeClass some.jar > /dev/null 2>&1 & > > > > The problem occurs when we stop the application. > > If we stop the Flink application with > > flink cancel <JOB_ID> > > and then kill the yarn application with > > yarn application -kill <APPLICATION_ID> > > everything is fine. > > But what we expected was that when we only kill the yarn application > > without specifically canceling the Flink job before, the Flink job > will > > stay lingering on the machine and use resources until it is killed > > manually via its process id. > > > > One thing that we tried was to stop using ephemeral ports for the > > application-manager, namely we set yarn.application-master.port > > specifically to some port number, but the problem remains: Killing the > > yarn application does not kill the corresponding Flink job. > > > > Does anyone have an idea about this? Any help is greatly > appreciated :-) > > By the way, our application reads data from a Kafka queue and > writes it > > into HDFS, maybe this is also important to know. > > > > Thank you and best regards > > > > Konstantin > > -- > > Konstantin Gregor * konstantin.gre...@tngtech.com > <mailto:konstantin.gre...@tngtech.com> > > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > > Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke > > Sitz: Unterföhring * Amtsgericht München * HRB 135082 > > -- Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082
signature.asc
Description: OpenPGP digital signature