[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899979#comment-15899979 ] Apache Spark commented on SPARK-19348: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/17193 > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Vinayak Joshi >Assignee: Bryan Cutler > Fix For: 2.2.0 > > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859737#comment-15859737 ] Peter D Kirchner commented on SPARK-19348: -- Per the above, perhaps a fix could address both threadsafety and the orphaned modifications to variables in the wrapped function bodies. For instance, in Pipeline.__init__() and setParams() the fix could remove references to _input_kwargs and instead invoke setParams(stages=stages) within __init__() and _set(stages=stages) within setParams(), respectively, Parallel changes would be needed in all wrapped functions but the resulting code would be functional, readable, and threadsafe. A fix for Pipeline alone is insufficient, because multiple pipelines could have stages consisting of instances of the same class, e.g. LogisticRegression. All the classes using @keyword_only need to be addressed by whatever fix is decided upon. > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Vinayak Joshi > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856922#comment-15856922 ] Peter D Kirchner commented on SPARK-19348: -- Two things happen with this wrapper. First, it appears to me (after confirming with some simplified examples) that the modifications to incoming arguments made in the body of the wrapped function are lost (to take the pipeline.py example, stages=) when they are not updated in the dictionary that is passed in the calls made from inside the wrapped functions. The decorator explicitly states that it saves the 'actual input arguments' but does not clarify why. It seems as if the wrapped code should either update the dictionary or that lines of code that have no lasting effect should be deleted. Second, bryanc has pointed out that passing the kwargs to the wrapped function via a static class variable is thread-unsafe within each of the many ml classes that use the decorator. Passing the kwargs as an instance variable as bryanc has proposed seems satisfactory for the second solution, as would using a thread-local class variable. Both require changes to any files using the decorator. Locking in the decorator, if it could be implemented in spite of the nested calls of decorated functions, could confine the changes to the decorator definition. Passing the wrapper's kwargs dictionary as an additional entry in the kwargs dictionary passed to the wrapped function would be threadsafe but change the public API of many functions. It looks possible for a decorator to introspect the wrapped function's parameters in which case the wrapper could pass the dictionary on one of those keywords (the purpose being to leave the public API intact), then inside the wrapped function there would need to be code to detect the wrapper, retrieve the dictionary and restore the coopted variable. (As-is, the wrapped functions already have wrapper-specific code in order to function.) > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Vinayak Joshi > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856700#comment-15856700 ] Peter D Kirchner commented on SPARK-19348: -- To save folks some time, the keyword_only decorator came in to pyspark/ml/util.py with SPARK-4586 design: https://docs.google.com/document/d/1vL-4f5Xm-7t-kwVSaBylP_ZPrktPZjaOb2dWONtZU2s/edit commit: https://github.com/apache/spark/commit/cd4a15366244657c4b7936abe5054754534366f2#diff-dd5670d3fb55faba1859e9778e4026e5 > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Vinayak Joshi > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850726#comment-15850726 ] Apache Spark commented on SPARK-19348: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/16782 > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Vinayak Joshi > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850705#comment-15850705 ] Bryan Cutler commented on SPARK-19348: -- The problem here is with the @keyword_only decorator that is used in the Pipeline constructor (and every other ML class also). It relies on saving the params, stages in this case, to a static class variable. When multiple threads call the wrapped constructor, it becomes a race condition to read from that static variable before it is over-written by another thread. I can put up a potential fix, but it affects all of PySpark ML so it would need to be checked out carefully. As a workaround, you could just protect the construction of {{Pipeline}} with a shared lock. Other calls to {{fit}} etc, should be ok since they don't use that keyword_only decorator. > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Vinayak Joshi > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org