Paul Brenner created ZEPPELIN-5323:
--------------------------------------
Summary: Interpreter Recovery Does Not Preserve Running Spark Jobs
Key: ZEPPELIN-5323
URL: https://issues.apache.org/jira/browse/ZEPPELIN-5323
Project: Zeppelin
Issue Type: Bug
Reporter: Paul Brenner
We are using zeppelin 0.10 built from from master on March 26th, looks like the
most recent commit was 85ed8e2e51e1ea10df38d4710216343efe218d60. We tried to
enable interpreter recovery by adding the following to zeppelin-site.xml:
<property>
<name>zeppelin.recovery.storage.class</name>
<value>org.apache.zeppelin.interpreter.recovery.FileSystemRecoveryStorage</value>
<description>ReoveryStorage implementation based on hadoop
FileSystem</description>
</property><property>
<name>zeppelin.recovery.dir</name>
<value>/user/zeppelin/recovery</value>
<description>Location where recovery metadata is stored</description>
</property>
when we start up zeppelin we get no errors, I can start a job running and I see
in {{/user/zeppelin/recovery/spark_paul.recovery}} that it lists
{{spark_paul-anonymous-2G3KV92PG 10.16.41.212:34374}} so that look
promising
when we stop zeppelin the interpreter process keeps running, but I see the
following happens to the spark job
21/04/08 13:42:09 INFO yarn.YarnAllocator: Canceling requests for 262 executor
container(s) to have a new desired total 0 executors.
21/04/08 13:42:09 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or
disconnected! Shutting down. zeppelin-212.sec.placeiq.net:36733
21/04/08 13:42:09 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or
disconnected! Shutting down. zeppelin-212.sec.placeiq.net:36733
21/04/08 13:42:09 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED,
exitCode: 0
21/04/08 13:42:09 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster
with SUCCEEDED
21/04/08 13:42:09 INFO impl.AMRMClientImpl: Waiting for application to be
successfully unregistered.
21/04/08 13:42:09 INFO yarn.ApplicationMaster: Deleting staging directory
hdfs://nameservice1/user/pbrenner/.sparkStaging/application_1617808481394_4478
21/04/08 13:42:09 INFO util.ShutdownHookManager: Shutdown hook called
then when we start zeppelin back up I see the following on the paragraph that
was running:
java.lang.RuntimeException: Interpreter instance
org.apache.zeppelin.spark.SparkInterpreter not created
at
org.apache.zeppelin.interpreter.remote.PooledRemoteClient.callRemoteFunction(PooledRemoteClient.java:114)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.callRemoteFunction(RemoteInterpreterProcess.java:99)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:281)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:442)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:71)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at
org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at
org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:182)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
It looks VERY close to working, but somehow spark jobs are still getting
shutdown when we shutdown zepplin. Any ideas?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)