Shall I configure a remote interpreter to my notebook to run on the worker?


>> I have a standalone cluster with one master, one worker. I submit jobs
>> through zeppelin. master, worker, and zeppelin run in a separate container.
>> # spark home
>> export SPARK_HOME=/usr/local/spark
>> # set hadoop conf dir
>> export HADOOP_CONF_DIR=/opt/hadoop-2.7.3/etc/hadoop
>> # set options to pass spark-submit command
>> export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.11:1.5.0
>> --deploy-mode cluster"
>> # worker memory
>> export ZEPPELIN_JAVA_OPTS="-Dspark.driver.memory=7g
>> -Dspark.submit.deployMode=cluster"
>> # master
>> export MASTER="spark://<master>:7077"
>> My notebook code is very simple. It read csv and write it again in
>> directory /data previously created:
>> %spark.pyspark
>> def read_input(fin):
>>     '''
>>     Read input file from filesystem and return dataframe
>>     '''
>>     df =, format='com.databricks.spark.csv',
>> mode='PERMISSIVE', header='false', inferSchema='true')
>>     return df
>> def write_output(df, fout):
>>     '''
>>     Write dataframe to filesystem
>>     '''
>> df.write.mode('overwrite').format('com.databricks.spark.csv').options(delimiter=',',
>> header='true').save(fout)
>> data_in = '/data/01.csv'
>> data_out = '/data/02.csv'
>> df = read_input(data_in)
>> newdf = del_columns(df)
>> write_output(newdf, data_out)
>> I used --deploy-mode to *cluster* so that the driver is run in the
>> worker in order to read the CSV in the /data directory and not in zeppelin.
>> When running the notebook it complains that /opt/zeppelin-0.7.1/inter
>> preter/spark/zeppelin-spark_2.11-0.7.1.jar is missing:
>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache
>> set to: /root/.ivy2/cache The jars for the packages stored in:
>> /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/jars
>> /ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
>> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found
>> org.apache.commons#commons-csv;1.1 in central found
>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>> resolve 310ms :: artifacts dl 6ms :: modules in use:
>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>> org.apache.commons#commons-csv;1.1 from central in [default]
>> --------------------------------------------------------------------- |
>> | modules || artifacts | | conf | number| search|dwnlded|evicted||
>> number|dwnlded| ------------------------------
>> --------------------------------------- | default | 3 | 0 | 0 | 0 || 3 |
>> 0 | ---------------------------------------------------------------------
>> :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
>> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
>> REST application submission protocol. SLF4J: Class path contains multiple
>> SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/
>> interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/
>> lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/sh
>> are/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See for an
>> explanation. SLF4J: Actual binding is of type 
>> [org.slf4j.impl.Log4jLoggerFactory]
>> Warning: Master endpoint spark://spark-drone-master-sof
>> iane.autoetl.svc.cluster.local:7077 was not a REST server. Falling back
>> to legacy submission gateway instead. Ivy Default Cache set to:
>> /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars
>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
>> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found
>> org.apache.commons#commons-csv;1.1 in central found
>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>> resolve 69ms :: artifacts dl 5ms :: modules in use:
>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>> org.apache.commons#commons-csv;1.1 from central in [default]
>> --------------------------------------------------------------------- |
>> | modules || artifacts | | conf | number| search|dwnlded|evicted||
>> number|dwnlded| ------------------------------
>> --------------------------------------- | default | 3 | 0 | 0 | 0 || 3 |
>> 0 | ---------------------------------------------------------------------
>> :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
>> artifacts copied, 3 already retrieved (0kB/4ms)
>> java.nio.file.NoSuchFileException: /opt/zeppelin-0.7.1/interprete
>> r/spark/zeppelin-spark_2.11-0.7.1.jar at sun.nio.fs.UnixException.trans
>> lateToIOException( at
>> sun.nio.fs.UnixException.rethrowAsIOException(
>> So, what I did next is copy the /opt/zeppelin-0.7.1/interp
>> reter/spark/zeppelin-spark_2.11-0.7.1.jar in the worker container and
>> restarted the interpreter and run the notebook. It doesn't complain anymore
>> about the zeppelin-spark_2.11-0.7.1.jar, but I got another exception in
>> the notebook related to the RemoteInterpreterManagedProcess:
>> org.apache.zeppelin.interpreter.InterpreterException: Ivy Default Cache
>> set to: /root/.ivy2/cache The jars for the packages stored in:
>> /root/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-2.1.0/jars
>> /ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
>> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found
>> org.apache.commons#commons-csv;1.1 in central found
>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>> resolve 277ms :: artifacts dl 7ms :: modules in use:
>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>> org.apache.commons#commons-csv;1.1 from central in [default]
>> --------------------------------------------------------------------- |
>> | modules || artifacts | | conf | number| search|dwnlded|evicted||
>> number|dwnlded| ------------------------------
>> --------------------------------------- | default | 3 | 0 | 0 | 0 || 3 |
>> 0 | ---------------------------------------------------------------------
>> :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
>> artifacts copied, 3 already retrieved (0kB/8ms) Running Spark using the
>> REST application submission protocol. SLF4J: Class path contains multiple
>> SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/
>> interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/opt/zeppelin-0.7.1/
>> lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.3/sh
>> are/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See for an
>> explanation. SLF4J: Actual binding is of type 
>> [org.slf4j.impl.Log4jLoggerFactory]
>> Warning: Master endpoint spark://spark-drone-master-sof
>> iane.autoetl.svc.cluster.local:7077 was not a REST server. Falling back
>> to legacy submission gateway instead. Ivy Default Cache set to:
>> /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars
>> com.databricks#spark-csv_2.11 added as a dependency :: resolving
>> dependencies :: org.apache.spark#spark-submit-parent;1.0 confs:
>> [default] found com.databricks#spark-csv_2.11;1.5.0 in central found
>> org.apache.commons#commons-csv;1.1 in central found
>> com.univocity#univocity-parsers;1.5.1 in central :: resolution report ::
>> resolve 66ms :: artifacts dl 5ms :: modules in use:
>> com.databricks#spark-csv_2.11;1.5.0 from central in [default]
>> com.univocity#univocity-parsers;1.5.1 from central in [default]
>> org.apache.commons#commons-csv;1.1 from central in [default]
>> --------------------------------------------------------------------- |
>> | modules || artifacts | | conf | number| search|dwnlded|evicted||
>> number|dwnlded| ------------------------------
>> --------------------------------------- | default | 3 | 0 | 0 | 0 || 3 |
>> 0 | ---------------------------------------------------------------------
>> :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0
>> artifacts copied, 3 already retrieved (0kB/4ms) at
>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterMana
>> gedProcess.start( at
>> org.apache.zeppelin.interpreter.remote.RemoteInterpreterProc
>> ess.reference( at
>> org.apache.zeppelin.interpreter.remote.RemoteInterpreter.
>> open( at org.apache.zeppelin.interprete
>> r.remote.RemoteInterpreter.getFormType( at
>> org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormT
>> ype( at org.apache.zeppelin.notebook.P
>> aragraph.jobRun( at 
>> at 
>> org.apache.zeppelin.scheduler.RemoteScheduler$
>> at java.util.concurrent.Executors$
>> at at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>> tureTask.access$201( at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(
>> at 
>> java.util.concurrent.ThreadPoolExecutor$
>> at
>> In the Spark jobs I see a org.apache.zeppelin.interprete
>> r.remote.RemoteInterpreterServer running, and the stderr logs complains
>> about missing
>> Launch Command: "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" "-cp" 
>> "/opt/zeppelin-0.7.1/interpreter/spark/*:/opt/zeppelin-0.7.1/lib/interpreter/*:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar:/usr/local/spark/conf/:/usr/local/spark/jars/*:/opt/hadoop-2.7.3/etc/hadoop:/opt/hadoop-2.7.3/etc/hadoop/*:/opt/hadoop-2.7.3/share/hadoop/common/lib/*:/opt/hadoop-2.7.3/share/hadoop/common/*:/opt/hadoop-2.7.3/share/hadoop/hdfs/*:/opt/hadoop-2.7.3/share/hadoop/hdfs/lib/*:/opt/hadoop-2.7.3/share/hadoop/hdfs/*:/opt/hadoop-2.7.3/share/hadoop/yarn/lib/*:/opt/hadoop-2.7.3/share/hadoop/yarn/*:/opt/hadoop-2.7.3/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.7.3/share/hadoop/mapreduce/*:/opt/hadoop-2.7.3/share/hadoop/tools/lib/*"
>>  "-Xmx1024M" 
>> "-Dspark.jars=file:/root/.ivy2/jars/com.databricks_spark-csv_2.11-1.5.0.jar,file:/root/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/root/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar,file:/root/.ivy2/jars/com.databricks_spark-csv_2.11-1.5.0.jar,file:/root/.ivy2/jars/org.apache.commons_commons-csv-1.1.jar,file:/root/.ivy2/jars/com.univocity_univocity-parsers-1.5.1.jar,file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar"
>>  "-Dspark.driver.supervise=false" "-Dspark.driver.extraJavaOptions= 
>> -Dfile.encoding=UTF-8 
>> -Dlog4j.configuration=file:///opt/zeppelin-0.7.1/conf/ 
>> -Dzeppelin.log.file=/opt/zeppelin-0.7.1/logs/zeppelin-interpreter-spark--zeppelin-sofiane-1-zyfya.log"
>> ""
>>  "-Dspark.submit.deployMode=cluster" 
>> "-Dspark.master=spark://spark-drone-master-sofiane.autoetl.svc.cluster.local:7077"
>> "-Dspark.driver.extraClassPath=::/opt/zeppelin-0.7.1/interpreter/spark/*:/opt/zeppelin-0.7.1/lib/interpreter/*::/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar"
>>  "-Dspark.rpc.askTimeout=10s" "-Dfile.encoding=UTF-8" 
>> "-Dlog4j.configuration=file:///opt/zeppelin-0.7.1/conf/" 
>> "-Dzeppelin.log.file=/opt/zeppelin-0.7.1/logs/zeppelin-interpreter-spark--zeppelin-sofiane-1-zyfya.log"
>>  "org.apache.spark.deploy.worker.DriverWrapper" 
>> "spark://Worker@" 
>> "/usr/local/spark/work/driver-20170503115405-0036/zeppelin-spark_2.11-0.7.1.jar"
>>  "org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer" "46151"
>> ========================================
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in 
>> [jar:file:/opt/zeppelin-0.7.1/interpreter/spark/zeppelin-spark_2.11-0.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in 
>> [jar:file:/usr/local/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See for an 
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> log4j:ERROR Could not read configuration file from URL 
>> [file:/opt/zeppelin-0.7.1/conf/].
>> /opt/zeppelin-0.7.1/conf/ (No 
>> such file or directory)
>>      at Method)
>>      at
>>      at<init>(
>>      at<init>(
>>      at 
>>      at 
>>      at 
>> org.apache.log4j.PropertyConfigurator.doConfigure(
>>      at 
>> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(
>>      at org.apache.log4j.LogManager.<clinit>(
>>      at 
>> org.slf4j.impl.Log4jLoggerFactory.getLogger(
>>      at org.slf4j.LoggerFactory.getLogger(
>>      at 
>> org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(
>>      at 
>> org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(
>>      at org.apache.commons.logging.LogFactory.getLog(
>>      at 
>>      at 
>> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2373)
>>      at 
>> org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2373)
>>      at scala.Option.getOrElse(Option.scala:121)
>>      at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2373)
>>      at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:221)
>>      at 
>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:42)
>>      at 
>> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>> log4j:ERROR Ignoring configuration file 
>> [file:/opt/zeppelin-0.7.1/conf/].
>> log4j:WARN No appenders could be found for logger 
>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>> log4j:WARN Please initialize the log4j system properly.
>> log4j:WARN See for 
>> more info.
>> Using Spark's default log4j profile: 
>> org/apache/spark/
>> 17/05/03 11:54:06 INFO SecurityManager: Changing view acls to: root
>> 17/05/03 11:54:06 INFO SecurityManager: Changing modify acls to: root
>> 17/05/03 11:54:06 INFO SecurityManager: Changing view acls groups to:
>> 17/05/03 11:54:06 INFO SecurityManager: Changing modify acls groups to:
>> 17/05/03 11:54:06 INFO SecurityManager: SecurityManager: authentication 
>> disabled; ui acls disabled; users  with view permissions: Set(root); groups 
>> with view permissions: Set(); users  with modify permissions: Set(root); 
>> groups with modify permissions: Set()
>> 17/05/03 11:54:07 INFO Utils: Successfully started service 'Driver' on port 
>> 39770.
>> 17/05/03 11:54:07 INFO WorkerWatcher: Connecting to worker 
>> spark://Worker@
>> 17/05/03 11:54:07 INFO TransportClientFactory: Successfully created 
>> connection to / after 27 ms (0 ms spent in bootstraps)
>> 17/05/03 11:54:07 INFO WorkerWatcher: Successfully connected to 
>> spark://Worker@
>> 17/05/03 11:54:07 INFO RemoteInterpreterServer: Starting remote interpreter 
>> server on port 46151
>> The process never finishes, so I got to kill it...
>> What's going on? Anything wrong with my configuration?
>> Any help appreciated. I am struggling since a week.

