[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley resolved SPARK-19348. --------------------------------------- Resolution: Fixed Fix Version/s: 2.0.3 2.1.1 Issue resolved by pull request 17195 [https://github.com/apache/spark/pull/17195] > pyspark.ml.Pipeline gets corrupted under multi threaded use > ----------------------------------------------------------- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 > Reporter: Vinayak Joshi > Assignee: Bryan Cutler > Fix For: 2.1.1, 2.0.3, 2.2.0 > > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org