[ https://issues.apache.org/jira/browse/FLINK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jared Stehler updated FLINK-6662: --------------------------------- Description: Running flink mesos on 1.3-release branch, I'm seeing the following error on appmaster startup: 2017-05-22 15:32:45.946 [flink-akka.actor.default-dispatcher-17] WARN o.a.flink.mesos.runtime.clusterframework.MesosJobManager - Failed to recover job 088027410f1a628e7dfc59dc23df3ded. java.lang.Exception: Failed to retrieve the submitted job graph from state handle. at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:186) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(JobManager.scala:536) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1$$anonfun$apply$mcV$sp$1.apply(JobManager.scala:533) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1$$anonfun$apply$mcV$sp$1.apply(JobManager.scala:533) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1.apply$mcV$sp(JobManager.scala:533) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1.apply(JobManager.scala:529) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1.apply(JobManager.scala:529) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassNotFoundException: org.apache.flink.runtime.jobgraph.tasks.JobSnapshottingSettings at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:64) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:305) at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58) at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:184) ... 15 common frames omitted was: Running flink mesos on 1.3-release branch, I'm seeing the following error on appmaster startup: sLast login: Sun May 21 19:03:05 on ttys005 sg 10.80.54.119% ~/dev/scratch/flink release-1.3 ● ssg 10.80.54.119 ✓ 11436 12:00:16 zsh: command not found: ssg ~/dev/scratch/flink release-1.3 ● ssh 10.80.54.119 127 ↵ 11437 12:00:16 Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-117-generic x86_64) * Documentation: https://help.ubuntu.com/ System information as of Mon May 22 15:00:39 UTC 2017 System load: 0.0 Processes: 159 Usage of /: 68.1% of 7.74GB Users logged in: 0 Memory usage: 20% IP address for eth0: 10.80.54.119 Swap usage: 0% IP address for docker0: 172.17.0.1 Graph this data and manage this system at: https://landscape.canonical.com/ Get cloud support with Ubuntu Advantage Cloud Guest: http://www.ubuntu.com/business/services/cloud 31 packages can be updated. 27 updates are security updates. New release '16.04.2 LTS' available. Run 'do-release-upgrade' to upgrade to it. Last login: Sun May 21 18:44:31 2017 from ip-10-80-48-143.us-west-2.compute.internal ubuntu@ip-10-80-54-119:~$ cd /mnt/mesos/ docker/ logs/ lost+found/ singularity-executor/ work/ ubuntu@ip-10-80-54-119:~$ cd /mnt/mesos/work/ meta/ provisioner/ slaves/ ubuntu@ip-10-80-54-119:~$ cd /mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/ docker/ frameworks/ ubuntu@ip-10-80-54-119:~$ cd /mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/ e87689b9-a83a-449b-9e6b-3e339ead141a-0004/ e87689b9-a83a-449b-9e6b-3e339ead141a-0006/ e87689b9-a83a-449b-9e6b-3e339ead141a-0008/ Singularity/ e87689b9-a83a-449b-9e6b-3e339ead141a-0005/ e87689b9-a83a-449b-9e6b-3e339ead141a-0007/ e87689b9-a83a-449b-9e6b-3e339ead141a-0009/ ubuntu@ip-10-80-54-119:~$ cd /mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/Singularity/executors/ 1vn1/ 3un1/ 4un1/ 6tn1/ 6un1/ asn1/ cvn1/ hun1/ iin1/ isn1/ jsn1/ jtn1/ ltn1/ ntn1/ osn1/ smn1/ tsn1/ won1/ ubuntu@ip-10-80-54-119:~$ cd /mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/Singularity/executors/ ubuntu@ip-10-80-54-119:/mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/Singularity/executors$ ls 1vn1/ runs ubuntu@ip-10-80-54-119:/mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/Singularity/executors$ ls 1vn1/runs/ e24faf7e-9553-4c07-8c6a-e85acdfe88af latest ubuntu@ip-10-80-54-119:/mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/Singularity/executors$ find . -name "*1495467133685*" find: `./asn1/runs/9b6627de-d545-4e34-87e6-f639d94afe47/pdf-service-1495313993-1495314044332-1-10.80.54.119-us_west_2c/logs': Permission denied find: `./asn1/runs/9b6627de-d545-4e34-87e6-f639d94afe47/pdf-service-1495313993-1495314044332-1-10.80.54.119-us_west_2c/tmp': Permission denied find: `./iin1/runs/b0bcea6d-4a88-4587-bf34-619aa627338d/deployinator-1494544026-1494956610579-1-10.80.54.119-us_west_2c/logs': Permission denied find: `./iin1/runs/b0bcea6d-4a88-4587-bf34-619aa627338d/deployinator-1494544026-1494956610579-1-10.80.54.119-us_west_2c/tmp': Permission denied find: `./ntn1/runs/6483530f-0e0a-4b27-ad89-dd0fede1e3c0/prometheus-1495389842-1495389843120-1-10.80.54.119-us_west_2c/storage': Permission denied ./cvn1/runs/11f8a647-1cad-4881-b30c-9f68a4aa1cc3/flink-mesos-1495467129-1495467133685-1-10.80.54.119-us_west_2c ubuntu@ip-10-80-54-119:/mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/Singularity/executors$ cd cvn1/runs/11f8a647-1cad-4881-b30c-9f68a4aa1cc3/flink-mesos-1495467129-1495467133685-1-10.80.54.119-us_west_2c ubuntu@ip-10-80-54-119:/mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/Singularity/executors/cvn1/runs/11f8a647-1cad-4881-b30c-9f68a4aa1cc3/flink-mesos-1495467129-1495467133685-1-10.80.54.119-us_west_2c$ ls -al total 140 drwxr-xr-x 5 root root 4096 May 22 15:33 . drwxr-xr-x 3 root root 4096 May 22 15:33 .. drwxr-xr-x 2 root root 4096 May 22 03:07 conf -rw-r--r-- 1 root root 3752 May 22 15:32 docker.env -rw-r--r-- 1 root root 948 May 22 15:33 executor.bash.log -rw-r--r-- 1 root root 12860 May 22 15:33 executor.java.log -rw-r--r-- 1 root root 752 May 22 15:33 logrotate.status drwxr-xr-x 2 root root 4096 May 22 15:33 logs -rw-r--r-- 1 root root 10979 May 22 15:32 runner.sh -rw-r--r-- 1 root root 79224 May 22 15:33 tail_of_finished_service.log drwxr-xr-x 2 root root 4096 May 22 15:32 tmp ubuntu@ip-10-80-54-119:/mnt/mesos/work/slaves/e87689b9-a83a-449b-9e6b-3e339ead141a-S13/frameworks/Singularity/executors/cvn1/runs/11f8a647-1cad-4881-b30c-9f68a4aa1cc3/flink-mesos-1495467129-1495467133685-1-10.80.54.119-us_west_2c$ less tail_of_finished_service.log at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.ClassNotFoundException: org.apache.flink.runtime.jobgraph.tasks.JobSnapshottingSettings at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:64) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:305) at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58) at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:184) ... 15 common frames omitted > ClassNotFoundException: o.a.f.r.j.t.JobSnapshottingSettings recovering job > -------------------------------------------------------------------------- > > Key: FLINK-6662 > URL: https://issues.apache.org/jira/browse/FLINK-6662 > Project: Flink > Issue Type: Bug > Components: JobManager, Mesos, State Backends, Checkpointing > Affects Versions: 1.3.0 > Reporter: Jared Stehler > > Running flink mesos on 1.3-release branch, I'm seeing the following error on > appmaster startup: > 2017-05-22 15:32:45.946 [flink-akka.actor.default-dispatcher-17] WARN > o.a.flink.mesos.runtime.clusterframework.MesosJobManager - Failed to recover > job 088027410f1a628e7dfc59dc23df3ded. > java.lang.Exception: Failed to retrieve the submitted job graph from state > handle. > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:186) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(JobManager.scala:536) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1$$anonfun$apply$mcV$sp$1.apply(JobManager.scala:533) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1$$anonfun$apply$mcV$sp$1.apply(JobManager.scala:533) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1.apply$mcV$sp(JobManager.scala:533) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1.apply(JobManager.scala:529) > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$1.apply(JobManager.scala:529) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.lang.ClassNotFoundException: > org.apache.flink.runtime.jobgraph.tasks.JobSnapshottingSettings > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:64) > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826) > at > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) > at > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) > at > org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:305) > at > org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58) > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:184) > ... 15 common frames omitted -- This message was sent by Atlassian JIRA (v6.3.15#6346)