Re: Flink Checkpoint on yarn

2016-03-20 Thread Ufuk Celebi
Can you please have a look into the JobManager log file and report which checkpoints are restored? You should see messages from ZooKeeperCompletedCheckpointStore like: - Found X checkpoints in ZooKeeper - Initialized with X. Removing all older checkpoints You can share the complete job manager

Re: Flink Checkpoint on yarn

2016-03-20 Thread Ufuk Celebi
Yes, the jobs have their own UUID. Although you expect there to be two independent clusters (which makes sense since you started via yarn-cluster), both clusters act as a single one because of the shared ZooKeeper root. What happens in your case is the following (this is also the reason why we

Re: Flink Checkpoint on yarn

2016-03-19 Thread Ufuk Celebi
OK, so you are submitting multiple jobs, but you submit them with -m yarn-cluster and therefore expect them to start separate YARN clusters. Makes sense and I would expect the same. I think that you can check in the client logs printed to stdout to which cluster the job is submitted. PS: The

Flink Checkpoint on yarn

2016-03-19 Thread Simone Robutti
Hello, I'm testing the checkpointing functionality with hdfs as a backend. For what I can see it uses different checkpointing files and resume the computation from different points and not from the latest available. This is to me an unexpected behaviour. I log every second, for every worker, a

Re: Flink Checkpoint on yarn

2016-03-19 Thread Simone Robutti
This is the log filtered to check messages from ZooKeeperCompletedCheckpointStore. https://gist.github.com/chobeat/0222b31b87df3fa46a23 It looks like it finds only a checkpoint but I'm not sure if the different hashes and IDs of the checkpoints are meaningful or not. 2016-03-16 15:33

Re: Flink Checkpoint on yarn

2016-03-19 Thread Ufuk Celebi
Hey Simone, from the logs it looks like multiple jobs have been submitted to the cluster, not just one. The different files correspond to different jobs recovering. The filtered logs show three jobs running/recovering (with IDs 10d8ccae6e87ac56bf763caf4bc4742f, 124f29322f9026ac1b35435d5de9f625,

Re: Flink Checkpoint on yarn

2016-03-19 Thread Stefano Baghino
Hi Ufuk, does the recovery.zookeeper.path.root property need to be set independently for each job that is run? Doesn't Flink take care of assigning some sort of identification to each job and storing their checkpoints independently? On Thu, Mar 17, 2016 at 11:43 AM, Ufuk Celebi

Re: Flink Checkpoint on yarn

2016-03-19 Thread Stefano Baghino
Hi Ufuk, I've read the documentation and it's exactly as you say, thanks for the clarification. Assuming one wants to run several jobs in parallel with different users on a secure cluster in HA

Re: Flink Checkpoint on yarn

2016-03-19 Thread Simone Robutti
Actually the test was intended for a single job. The fact that there are more jobs is unexpected and it will be the first thing to verify. Considering these problems we will go for deeper tests with multiple jobs. The logs are collected with "yarn logs" but log aggregation is not properly

Re: Flink Checkpoint on yarn

2016-03-19 Thread Stefano Baghino
Yes, but each job runs his own cluster, right? We have to run them on a secure cluster and on a per-user basis, thus we can't run a YARN session but have to run each job independently. On Thu, Mar 17, 2016 at 12:09 PM, Ufuk Celebi wrote: > On Thu, Mar 17, 2016 at 11:51 AM,

Re: Flink Checkpoint on yarn

2016-03-19 Thread Stefano Baghino
> Do you have time to repeat your experiment with different ZooKeeper root paths? We reached the same conclusion and we're running this test right now, thanks. On Thu, Mar 17, 2016 at 12:08 PM, Ufuk Celebi wrote: > Yes, the jobs have their own UUID. > > Although you expect

Re: Flink Checkpoint on yarn

2016-03-19 Thread Simone Robutti
I didn't resubmitted the job. Also the jobs are submitted one by one with -m yarn-master, not with a long running yarn session so I don't really know if they could mix up. I will repeat the test with a cleaned state because we saw that killing the job with yarn application -kill left the "flink

Re: Flink Checkpoint on yarn

2016-03-18 Thread Ufuk Celebi
On Thu, Mar 17, 2016 at 11:51 AM, Stefano Baghino wrote: > does the recovery.zookeeper.path.root property need to be set independently > for each job that is run? No, just per cluster.