Lukas Heppe created ZEPPELIN-5666:
-------------------------------------
Summary: Spark Additional jars: Does spark.jars override
spark.jars.packages?
Key: ZEPPELIN-5666
URL: https://issues.apache.org/jira/browse/ZEPPELIN-5666
Project: Zeppelin
Issue Type: Bug
Affects Versions: 0.10.1
Reporter: Lukas Heppe
Hey,
I got the following setup:
Spark 3.1.2 Standalone Cluster (1 Master, 2 Worker)
Zeppelin 0.10.1
SparkInterpreterSetting:
{code:java}
SPARK_HOME /opt/spark (points to spark-3.1.2)
spark.master spark://my-spark-master-host:7077
spark.submit.deployMode client
spark.jars.packages
com.datastax.spark:spark-cassandra-connector_2.12:3.1.0,eu.europa.ec.joinup.sd-dss:dss-xades:5.9
spark.jars.repositories
https://ec.europa.eu/cefdigital/artifact/content/repositories/esignaturedss/{code}
If I prepare a cell like
{code:java}
%spark
sc.version {code}
I get the correct output (3.1.2). Also cells which compute PI from the spark
examples seem to work.
Also, if I use the spark cassandra connector, I get the first five rows like
expected.
{code:java}
%spark
import com.datastax.spark.connector._
val rdd = sc.cassandraTable("mykeyspace", "mytable")
println(rdd.take(5).toList) {code}
However, if I try to add a local jar via the spark.jars property as following
{code:java}
spark.jars file:///absolute/path/to/my/custom/jar
{code}
the jars provided via spark.jars.packages are not part of the SparkContext. The
jar is located at the worker and zeppelin at the same path. If I run
{code:java}
%spark
sc.listJars().foreach(println) {code}
without spark.jars set, I get a long list like expected (stuff from datastax +
eu repos). However, if I restart the interpreter and provide the spark.jars
option, the cell from above only posts my custom jar. The logs output the
following:
{code:java}
INFO [2022-03-04 15:51:17,742] ({FIFOScheduler-interpreter_1815846009-Worker-1}
SparkScala212Interpreter.scala[open]:68) - UserJars:
file:/opt/zeppelin/interpreter/spark/spark-interpreter-0.10.1.jar:file:/opt/path/to/my/jar,
LONG_LIST_OF_JARS_FROM_MAVEN.
...
Added JAR file:///path/to/my/custom/jar at
spark://x.x.x.:xxx/jars/my-custom.jar with timestamp xxx {code}
So it seems like the interpreter is aware of all of my jars, but only adds the
ones from the spark.jars property, whereas I would expect all of the jars to be
added. If I omit the spark.jars option, I get an entry ADDED JAR file:///...
for each jar of the spark.jars.packages entry.
In a previous Zeppelin version (0.8.1), I was able to configure all of this via
the SPARK_SUBMIT_OPTIONS environment variable like
{code:java}
SPARK_SUBMIT_OPTIONS=" ... --jars /abs/path/to/custom --packages
cassandraconn,etc.. --repositories additional-repo{code}
Is this a bug or am I converting these options in a wrong way?
Thank you!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)