[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859737#comment-15859737 ]
Peter D Kirchner commented on SPARK-19348: ------------------------------------------ Per the above, perhaps a fix could address both threadsafety and the orphaned modifications to variables in the wrapped function bodies. For instance, in Pipeline.__init__() and setParams() the fix could remove references to _input_kwargs and instead invoke setParams(stages=stages) within __init__() and _set(stages=stages) within setParams(), respectively, Parallel changes would be needed in all wrapped functions but the resulting code would be functional, readable, and threadsafe. A fix for Pipeline alone is insufficient, because multiple pipelines could have stages consisting of instances of the same class, e.g. LogisticRegression. All the classes using @keyword_only need to be addressed by whatever fix is decided upon. > pyspark.ml.Pipeline gets corrupted under multi threaded use > ----------------------------------------------------------- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 1.6.0, 2.0.0, 2.1.0 > Reporter: Vinayak Joshi > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org