pyspark.ml Pipeline stages are corrupted under multi-threaded access - is this a bug?

Vinayak Joshi5 Tue, 24 Jan 2017 01:42:08 -0800

Hi,

The code we're executing constructs pyspark.ml.Pipeline objects 
concurrently in separate python threads.


We observe that the stages fed to the pipeline object get corrupted i.e 
the stages supplied to a Pipeline object in one thread appear inside a 
different Pipeline object constructed in a different thread with different 
stages. 

Things work fine if there is a single thread, so this looks like a thread 
safety problem with the pyspark.ml.Pipeline class. 

I give code snippet below - looking for confirmation that this looks like 
a bug. If so, I will file one with complete py script to repro this issue.


class MLThread(threading.Thread):
    def __init__(self,name,pipeline_stages):
        super(MLThread,self).__init__()
        self.name=name
        self.pipeline_stages=pipeline_stages
 
    def run(self):
        self.pipeline=pyspark.ml.Pipeline(stages=self.pipeline_stages)
        .. do some work on the pipeline

thread_1 = MLThread("t1", pipeline_stages_1)
thread_2 = MLThread("t2", pipeline_stages_2)
.
thread_n = MLThread("tn", pipeline_stages_n)

.. launch all threads

Once all threads are done, verify the self.pipeline.getStages() in each 
object matches that supplied during pipeline object construction. Observe 
that there are occasions when this does not hold true.

This occurs with both Spark 1.6 and 2.x

Regards,
Vinayak Joshi

pyspark.ml Pipeline stages are corrupted under multi-threaded access - is this a bug?

Reply via email to