We developed a Scala library to run on spark called FV. We also built
wrappers in python for its public API using py4j as in spark. For example,
the main object is instantiated like this
self._java_obj = self._new_java_obj("com.example.FV", self.uid)
and the methods on the object are called in this way
def add(self, r):
self._java_obj.add(r)
We are experiencing an annoying issue when running pyspark with this
external library. We use to run the pyspark shell like this
pyspark --repositories <our-own-maven-release-repo> --packages
<com.example.FV:latest.release>
When we release a new version and we have some change in the Scala API,
things start to randomly break for some users. For example, in version 0.44
we had a class *DateUtils* (used by class *Utils*, which is used by class
*FV* in method *add*) that was dropped in version 0.45. When version 0.45
was released and a user called the method *add* in python API we got
java.lang.NoClassDefFoundError: Could not initialize class DateUtils
Basically, the python API is running the method *add* which contains a
reference to class *DateUtils*(v0.44) but when it is actually going to load
the needed class it doesn't find it, because the loaded jar is the v0.45
(as the ivy log shows when starting up the shell)
Do you have any idea of what the problem might be? Does maybe py4j cache
something so that when upgrading the classes we get this error?
*Alessandro Liparoti*