Hi all,
This week, I tried upgrading to Spark 3.5.0, as it contained some fixes
for spark-protobuf that I need for my project. However, my code is no
longer running under Spark 3.5.0.
My build.sbt file is configured as follows:
val sparkV = "3.5.0"
val hadoopV = "3.3.6"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkV % "provided",
"org.apache.spark" %% "spark-sql" % sparkV % "provided",
"org.apache.hadoop" % "hadoop-client" % hadoopV % "provided",
"org.apache.spark" %% "spark-protobuf" % sparkV,
)
I am using sbt-assembly to build a fat JAR, but I exclude Spark and
Hadoop JARs to limit the assembled JAR size. Spark (and its
dependencies) are supplied in our environment by the jars/ directory
included in the the Spark distribution.
However, when running my application (which uses protobuf-java's
CodedOutputStream for writing delimited protobuf files) with Spark
3.5.0, I now get the following error:
...
Caused by: java.lang.ClassNotFoundException:
com.google.protobuf.CodedOutputStream
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 22 more
When inspecting the jars/ directory in the newest Spark release
(spark-3.5.0-bin-hadoop3), I noticed the protobuf-java JAR was no longer
included in the release, while it was present in Spark 3.4.1. My code
seems to compile because protobuf-java is still a dependency of
spark-core:3.5.0, but since the JAR is no longer included, the class
cannot be found at runtime.
Is this expected/intentional behaviour? I was able to resolve the issue
by manually adding protobuf-java as a dependency to my own project and
including it in the fat JAR, but it seems weird to me that it is no
longer shipped with Spark since the newest release. I also could not
find any mention of this change in the release notes or elsewhere, but
perhaps I missed something.
Thanks in advance for any help!
Cheers,
Gijs