Re: Killing Yarn Session Leaves Lingering Flink Jobs

Konstantin Knauf Thu, 28 Jul 2016 03:28:50 -0700

Hi Stephan,

thank you for this clarification. I have a slightly related follow up
question. I keep reading that, the preferred way to run Flink on Yarn is
with "Flink-job-at-a-time-on-yarn". Can you explain this a little
further? Of course, with separate YARN session the jobs are more
decoupled, but on the other hand it seems contra-intuitive to start a
new Flink Cluster for each job.


Best Regards,

Konstantin

On 12.07.2016 15:48, Stephan Ewen wrote:
> I think there is a confusion between how Flink thinks about HA and job
> life cycle, and how many users think about it.
> 
> Flink thinks that a killing of the YARN session is a failure of the job.
> So as soon as new Yarn resources become available, it tries to recover
> the job.
> Most users think that killing a Yarn session is equivalent to canceling
> the job.
> 
> I am unsure if we should start to interpret the killing of a Yarn
> session as a cancellation. Do Yarn sessions never get killed
> accidentally, or as the result of a Yarn-related failure?
> 
> Using Flink-job-at-a-time-on-yarn, cancelling the Flink Job also shuts
> down the Yarn session and hence shuts down everything properly.
> 
> Hope that train of thought helps.
> 
> 
> On Tue, Jul 12, 2016 at 3:15 PM, Ufuk Celebi <u...@apache.org
> <mailto:u...@apache.org>> wrote:
> 
>     Are you running in HA mode? If yes, that's the expected behaviour at
>     the moment, because the ZooKeeper data is only cleaned up on a
>     terminal state (FINISHED, FAILED, CANCELLED). You have to specify
>     separate ZooKeeper root paths via "recovery.zookeeper.path.root".
>     There is an issue which should be fixed for 1.2 to make this
>     configurable in an easy way.
> 
>     On Tue, Jul 12, 2016 at 1:28 PM, Konstantin Gregor
>     <konstantin.gre...@tngtech.com
>     <mailto:konstantin.gre...@tngtech.com>> wrote:
>     > Hello everyone,
>     >
>     > I have a question concerning stopping Flink streaming processes
>     that run
>     > in a detached Yarn session.
>     >
>     > Here's what we do: We start a Yarn session via
>     > yarn-session.sh -n 8 -d -jm 4096 -tm 10000 -s 10 -qu flink_queue
>     >
>     > Then, we start our Flink streaming application via
>     > flink run -p 65 -c SomeClass some.jar > /dev/null 2>&1  &
>     >
>     > The problem occurs when we stop the application.
>     > If we stop the Flink application with
>     > flink cancel <JOB_ID>
>     > and then kill the yarn application with
>     > yarn application -kill <APPLICATION_ID>
>     > everything is fine.
>     > But what we expected was that when we only kill the yarn application
>     > without specifically canceling the Flink job before, the Flink job
>     will
>     > stay lingering on the machine and use resources until it is killed
>     > manually via its process id.
>     >
>     > One thing that we tried was to stop using ephemeral ports for the
>     > application-manager, namely we set yarn.application-master.port
>     > specifically to some port number, but the problem remains: Killing the
>     > yarn application does not kill the corresponding Flink job.
>     >
>     > Does anyone have an idea about this? Any help is greatly
>     appreciated :-)
>     > By the way, our application reads data from a Kafka queue and
>     writes it
>     > into HDFS, maybe this is also important to know.
>     >
>     > Thank you and best regards
>     >
>     > Konstantin
>     > --
>     > Konstantin Gregor * konstantin.gre...@tngtech.com
>     <mailto:konstantin.gre...@tngtech.com>
>     > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
>     > Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
>     > Sitz: Unterföhring * Amtsgericht München * HRB 135082
> 
> 

-- 
Konstantin Knauf * konstantin.kn...@tngtech.com * +49-174-3413182
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082

signature.asc
Description: OpenPGP digital signature

Re: Killing Yarn Session Leaves Lingering Flink Jobs

Reply via email to