Hi,
The code we're executing constructs pyspark.ml.Pipeline objects
concurrently in separate python threads.
We observe that the stages fed to the pipeline object get corrupted i.e
the stages supplied to a Pipeline object in one thread appear inside a
different Pipeline object constructed in a different thread with different
stages.
Things work fine if there is a single thread, so this looks like a thread
safety problem with the pyspark.ml.Pipeline class.
I give code snippet below - looking for confirmation that this looks like
a bug. If so, I will file one with complete py script to repro this issue.
class MLThread(threading.Thread):
def __init__(self,name,pipeline_stages):
super(MLThread,self).__init__()
self.name=name
self.pipeline_stages=pipeline_stages
def run(self):
self.pipeline=pyspark.ml.Pipeline(stages=self.pipeline_stages)
.. do some work on the pipeline
thread_1 = MLThread("t1", pipeline_stages_1)
thread_2 = MLThread("t2", pipeline_stages_2)
.
thread_n = MLThread("tn", pipeline_stages_n)
.. launch all threads
Once all threads are done, verify the self.pipeline.getStages() in each
object matches that supplied during pipeline object construction. Observe
that there are occasions when this does not hold true.
This occurs with both Spark 1.6 and 2.x
Regards,
Vinayak Joshi