[ https://issues.apache.org/jira/browse/SPARK-16752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marcelo Vanzin resolved SPARK-16752. ------------------------------------ Resolution: Invalid This is not the place to report bugs about the SJS, which is unrelated to the Spark project. > Spark Job Server not releasing jobs from "running list" even after yarn > completes the job > ----------------------------------------------------------------------------------------- > > Key: SPARK-16752 > URL: https://issues.apache.org/jira/browse/SPARK-16752 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 0.6.0, 1.5.0 > Environment: SJS version 0.6.1 and Spark 1.5.0 running on Yarn-client > mode > Reporter: Ash Pran > Labels: patch > Attachments: SJS_JOBS_RUNNING, SJS_JOB_COMP_YARN, > SJS_JOB_LOG_CONSOLE, SJS_Limited_Log.txt > > > We are having a strange issue with Spark Job Server (SJS) > We are using SJS 0.6.1 and Spark 1.5.0 with "yarn-client" mode. The details > of settings.sh for SJS is as below > ******************************************************************** > INSTALL_DIR=$(cd `dirname $0`; pwd -P) > LOG_DIR=$INSTALL_DIR/logs > PIDFILE=spark-jobserver.pid > JOBSERVER_MEMORY=16G > SPARK_VERSION=1.5.0 > SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark > SPARK_CONF_DIR=$SPARK_HOME/conf > SCALA_VERSION=2.10.4 > ******************************************************************** > We are using fair scheduling with 2 pools with 50 executors of 1 GB each. > We are also having max-jobs-per-context set to # of cores, which is 48. > What we are seeing is for the first 5 minutes or so, it is all good ...the > jobs get processed fine. > After 5 minutes or so, we see these 2 issues happening randomly. > 1) There are no jobs running in the cluster, completely available, but SJS > takes request, but does not submit it to the cluster for almost 3 - 4 minutes > and the job will be in "running job" list for that long. > 2) SJS takes request, submits it to cluster, job gets processed from cluster, > but even then, SJS does not move the job to completed list, it keeps it in > "running job" list for 3 - 4 minutes before moving it to completed job list > and during this time, our application keeps waiting for the response. > More issue details are documented in the external issue URL given below > Detailed steps outlined below > #1 The screenshot (SJS_JOBS_RUNNING) is of running job list. > Please look at the 1st row and of the last row, the time submitted for > the last job Id in the screenshot (4747ae86-7de3-4819-a29c-2b2c80c568a2) is > "16:49:00" > #2 If you look at 2nd screenshot (SJS_JOB_COMP_YARN) from Spark Yarn > cluster, the job was completed at "16:49:25" itself > #3 The 3rd screenshot (SJS_JOB_LOG_CONSOLE) is coming from the Spark Job > Server log, it says the same job completed at "17:13:55" > So, SJS was basically holding onto the job for more than 14 minutes and kept > it in the running job list although Yarn responded back in time. > Also, please take a look at the SJS log attached for the time period around > when this job was submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org