[ 
https://issues.apache.org/jira/browse/LIVY-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated LIVY-541:
-----------------------------
    Description: 
It appears Livy doesn't differentiate sessions properly in Yarn causing errors 
when running multiple Livy servers behind a load balancer for HA and 
performance scaling on the same Hadoop cluster.

Each livy server uses monotonically incrementing session IDs with a random 
suffix but it appears that the random suffix isn't passed to Yarn which results 
in the following errors on the Livy server which is further behind in session 
numbers because it appears to find the session with the same number has already 
finished (submitted earlier by a different user on another Livy server as seen 
in Yarn RM UI):
{code:java}
org.apache.zeppelin.livy.LivyException: Session 197 is finished, appId: null, 
log: [    at 
org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2887), at 
org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2904),
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), at 
java.util.concurrent.FutureTask.run(FutureTask.java:266), at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142),
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617),
 at java.lang.Thread.run(Thread.java:748), 
YARN Diagnostics: , java.lang.Exception: No YARN application is found with tag 
livy-session-197-uveqmqyj in 300 seconds. Please check your cluster status, it 
is may be very busy., 
org.apache.livy.utils.SparkYarnApp.org$apache$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:182)
 
org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:239)
 
org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:236)
 scala.Option.getOrElse(Option.scala:120) 
org.apache.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:236)
 org.apache.livy.Utils$$anon$1.run(Utils.scala:94)]
at 
org.apache.zeppelin.livy.BaseLivyInterpreter.createSession(BaseLivyInterpreter.java:300)
at 
org.apache.zeppelin.livy.BaseLivyInterpreter.initLivySession(BaseLivyInterpreter.java:184)
at 
org.apache.zeppelin.livy.LivySharedInterpreter.open(LivySharedInterpreter.java:57)
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at 
org.apache.zeppelin.livy.BaseLivyInterpreter.getLivySharedInterpreter(BaseLivyInterpreter.java:165)
at 
org.apache.zeppelin.livy.BaseLivyInterpreter.open(BaseLivyInterpreter.java:139)
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:493)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}

  was:
It appears Livy doesn't differentiate sessions properly in Yarn causing errors 
when running multiple Livy servers behind a load balancer for HA and 
performance scaling on the same Hadoop cluster.

Each livy server uses monotonically incrementing session IDs with a hash suffix 
but it appears that the hash suffix isn't passed to Yarn which results in the 
following errors on the Livy server which is further behind in session numbers 
because it appears to find the session with the same number has already 
finished (submitted earlier by a different user on another Livy server as seen 
in Yarn RM UI):
{code:java}
org.apache.zeppelin.livy.LivyException: Session 197 is finished, appId: null, 
log: [    at 
org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2887), at 
org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2904),
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), at 
java.util.concurrent.FutureTask.run(FutureTask.java:266), at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142),
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617),
 at java.lang.Thread.run(Thread.java:748), 
YARN Diagnostics: , java.lang.Exception: No YARN application is found with tag 
livy-session-197-uveqmqyj in 300 seconds. Please check your cluster status, it 
is may be very busy., 
org.apache.livy.utils.SparkYarnApp.org$apache$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:182)
 
org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:239)
 
org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:236)
 scala.Option.getOrElse(Option.scala:120) 
org.apache.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:236)
 org.apache.livy.Utils$$anon$1.run(Utils.scala:94)]
at 
org.apache.zeppelin.livy.BaseLivyInterpreter.createSession(BaseLivyInterpreter.java:300)
at 
org.apache.zeppelin.livy.BaseLivyInterpreter.initLivySession(BaseLivyInterpreter.java:184)
at 
org.apache.zeppelin.livy.LivySharedInterpreter.open(LivySharedInterpreter.java:57)
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at 
org.apache.zeppelin.livy.BaseLivyInterpreter.getLivySharedInterpreter(BaseLivyInterpreter.java:165)
at 
org.apache.zeppelin.livy.BaseLivyInterpreter.open(BaseLivyInterpreter.java:139)
at 
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at 
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:493)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}


> Multiple Livy servers submitting to Yarn results in LivyException: Session is 
> finished ... No YARN application is found with tag livy-session-197-uveqmqyj 
> in 300 seconds. Please check your cluster status, it is may be very busy
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LIVY-541
>                 URL: https://issues.apache.org/jira/browse/LIVY-541
>             Project: Livy
>          Issue Type: Bug
>          Components: Server
>    Affects Versions: 0.5.0
>         Environment: Hortonworks HDP 2.6
>            Reporter: Hari Sekhon
>            Priority: Critical
>
> It appears Livy doesn't differentiate sessions properly in Yarn causing 
> errors when running multiple Livy servers behind a load balancer for HA and 
> performance scaling on the same Hadoop cluster.
> Each livy server uses monotonically incrementing session IDs with a random 
> suffix but it appears that the random suffix isn't passed to Yarn which 
> results in the following errors on the Livy server which is further behind in 
> session numbers because it appears to find the session with the same number 
> has already finished (submitted earlier by a different user on another Livy 
> server as seen in Yarn RM UI):
> {code:java}
> org.apache.zeppelin.livy.LivyException: Session 197 is finished, appId: null, 
> log: [  at 
> org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2887), at 
> org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2904),
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266), at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142),
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617),
>  at java.lang.Thread.run(Thread.java:748), 
> YARN Diagnostics: , java.lang.Exception: No YARN application is found with 
> tag livy-session-197-uveqmqyj in 300 seconds. Please check your cluster 
> status, it is may be very busy., 
> org.apache.livy.utils.SparkYarnApp.org$apache$livy$utils$SparkYarnApp$$getAppIdFromTag(SparkYarnApp.scala:182)
>  
> org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:239)
>  
> org.apache.livy.utils.SparkYarnApp$$anonfun$1$$anonfun$4.apply(SparkYarnApp.scala:236)
>  scala.Option.getOrElse(Option.scala:120) 
> org.apache.livy.utils.SparkYarnApp$$anonfun$1.apply$mcV$sp(SparkYarnApp.scala:236)
>  org.apache.livy.Utils$$anon$1.run(Utils.scala:94)]
> at 
> org.apache.zeppelin.livy.BaseLivyInterpreter.createSession(BaseLivyInterpreter.java:300)
> at 
> org.apache.zeppelin.livy.BaseLivyInterpreter.initLivySession(BaseLivyInterpreter.java:184)
> at 
> org.apache.zeppelin.livy.LivySharedInterpreter.open(LivySharedInterpreter.java:57)
> at 
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
> at 
> org.apache.zeppelin.livy.BaseLivyInterpreter.getLivySharedInterpreter(BaseLivyInterpreter.java:165)
> at 
> org.apache.zeppelin.livy.BaseLivyInterpreter.open(BaseLivyInterpreter.java:139)
> at 
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
> at 
> org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:493)
> at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
> at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to