pySpark and py4j: NoClassDefFoundError when upgrading a jar

Alessandro Liparoti Mon, 31 Dec 2018 02:54:04 -0800

We developed a Scala library to run on spark called FV. We also built
wrappers in python for its public API using py4j as in spark. For example,
the main object is instantiated like this


self._java_obj = self._new_java_obj("com.example.FV", self.uid)

and the methods on the object are called in this way

def add(self, r):
    self._java_obj.add(r)

We are experiencing an annoying issue when running pyspark with this
external library. We use to run the pyspark shell like this

pyspark --repositories <our-own-maven-release-repo> --packages
<com.example.FV:latest.release>

When we release a new version and we have some change in the Scala API,
things start to randomly break for some users. For example, in version 0.44
we had a class *DateUtils* (used by class *Utils*, which is used by class
*FV* in method *add*) that was dropped in version 0.45. When version 0.45
was released and a user called the method *add* in python API we got

java.lang.NoClassDefFoundError: Could not initialize class DateUtils

Basically, the python API is running the method *add* which contains a
reference to class *DateUtils*(v0.44) but when it is actually going to load
the needed class it doesn't find it, because the loaded jar is the v0.45
(as the ivy log shows when starting up the shell)

Do you have any idea of what the problem might be? Does maybe py4j cache
something so that when upgrading the classes we get this error?
*Alessandro Liparoti*

pySpark and py4j: NoClassDefFoundError when upgrading a jar

Reply via email to