[ https://issues.apache.org/jira/browse/SPARK-16752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ash Pran updated SPARK-16752: ----------------------------- Attachment: SJS_JOBS_RUNNING SJS_JOB_LOG_CONSOLE SJS_JOB_COMP_YARN SJS_Limited_Log.txt Please see the attached files for further reference. > Spark Job Server not releasing jobs from "running list" even after yarn > completes the job > ----------------------------------------------------------------------------------------- > > Key: SPARK-16752 > URL: https://issues.apache.org/jira/browse/SPARK-16752 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 0.6.0, 1.5.0 > Environment: SJS version 0.6.1 and Spark 1.5.0 running on Yarn-client > mode > Reporter: Ash Pran > Labels: patch > Attachments: SJS_JOBS_RUNNING, SJS_JOB_COMP_YARN, > SJS_JOB_LOG_CONSOLE, SJS_Limited_Log.txt > > > We are having a strange issue with Spark Job Server (SJS) > We are using SJS 0.6.1 and Spark 1.5.0 with "yarn-client" mode. The details > of settings.sh for SJS is as below > ******************************************************************** > INSTALL_DIR=$(cd `dirname $0`; pwd -P) > LOG_DIR=$INSTALL_DIR/logs > PIDFILE=spark-jobserver.pid > JOBSERVER_MEMORY=16G > SPARK_VERSION=1.5.0 > SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark > SPARK_CONF_DIR=$SPARK_HOME/conf > SCALA_VERSION=2.10.4 > ******************************************************************** > We are using fair scheduling with 2 pools with 50 executors of 1 GB each. > We are also having max-jobs-per-context set to # of cores, which is 48. > What we are seeing is for the first 5 minutes or so, it is all good ...the > jobs get processed fine. > After 5 minutes or so, we see these 2 issues happening randomly. > 1) There are no jobs running in the cluster, completely available, but SJS > takes request, but does not submit it to the cluster for almost 3 - 4 minutes > and the job will be in "running job" list for that long. > 2) SJS takes request, submits it to cluster, job gets processed from cluster, > but even then, SJS does not move the job to completed list, it keeps it in > "running job" list for 3 - 4 minutes before moving it to completed job list > and during this time, our application keeps waiting for the response. > More issue details are documented in the external issue URL given below -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org