[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-03-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899979#comment-15899979
 ] 

Apache Spark commented on SPARK-19348:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/17193

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Vinayak Joshi
>Assignee: Bryan Cutler
> Fix For: 2.2.0
>
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-02-09 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859737#comment-15859737
 ] 

Peter D Kirchner commented on SPARK-19348:
--

Per the above, perhaps a fix could address both threadsafety and the orphaned 
modifications to variables in the wrapped function bodies.  For instance, in 
Pipeline.__init__() and setParams() the fix could remove references to 
_input_kwargs and instead invoke setParams(stages=stages) within __init__() and 
_set(stages=stages) within setParams(), respectively,  Parallel changes would 
be needed in all wrapped functions but the resulting code would be functional, 
readable, and threadsafe.

A fix for Pipeline alone is insufficient, because multiple pipelines could have 
stages consisting of instances of the same class, e.g. LogisticRegression.  All 
the classes using @keyword_only need to be addressed by whatever fix is decided 
upon.

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Vinayak Joshi
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-02-07 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856922#comment-15856922
 ] 

Peter D Kirchner commented on SPARK-19348:
--

Two things happen with this wrapper.

First, it appears to me (after confirming with some simplified examples) that 
the modifications to incoming arguments made in the body of the wrapped 
function are lost (to take the pipeline.py example, stages=) when they are not 
updated in the dictionary that is passed in the calls made from inside the 
wrapped functions.  The decorator explicitly states that it saves the 'actual 
input arguments' but does not clarify why.  It seems as if the wrapped code 
should either update the dictionary or that lines of code that have no lasting 
effect should be deleted.

Second, bryanc has pointed out that passing the kwargs to the wrapped function 
via a static class variable is thread-unsafe within each of the many ml classes 
that use the decorator.  Passing the kwargs as an instance variable as bryanc 
has proposed seems satisfactory for the second solution, as would using a 
thread-local class variable.  Both require changes to any files using the 
decorator. Locking in the decorator, if it could be implemented in spite of the 
nested calls of decorated functions, could confine the changes to the decorator 
definition.  Passing the wrapper's kwargs dictionary as an additional entry in 
the kwargs dictionary passed to the wrapped function would be threadsafe but 
change the public API of many functions.  It looks possible for a decorator to 
introspect the wrapped function's parameters in which case the wrapper could 
pass the dictionary on one of those keywords (the purpose being to leave the 
public API intact), then inside the wrapped function there would need to be 
code to detect the wrapper, retrieve the dictionary and restore the coopted 
variable.  (As-is, the wrapped functions already have wrapper-specific code in 
order to function.)

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Vinayak Joshi
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-02-07 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856700#comment-15856700
 ] 

Peter D Kirchner commented on SPARK-19348:
--

To save folks some time, the keyword_only decorator came in to 
pyspark/ml/util.py with SPARK-4586
design:
https://docs.google.com/document/d/1vL-4f5Xm-7t-kwVSaBylP_ZPrktPZjaOb2dWONtZU2s/edit
commit:
https://github.com/apache/spark/commit/cd4a15366244657c4b7936abe5054754534366f2#diff-dd5670d3fb55faba1859e9778e4026e5

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Vinayak Joshi
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850726#comment-15850726
 ] 

Apache Spark commented on SPARK-19348:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/16782

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Vinayak Joshi
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-02-02 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850705#comment-15850705
 ] 

Bryan Cutler commented on SPARK-19348:
--

The problem here is with the @keyword_only decorator that is used in the 
Pipeline constructor (and every other ML class also).  It relies on saving the 
params, stages in this case, to a static class variable.  When multiple threads 
call the wrapped constructor, it becomes a race condition to read from that 
static variable before it is over-written by another thread.  I can put up a 
potential fix, but it affects all of PySpark ML so it would need to be checked 
out carefully.

As a workaround, you could just protect the construction of {{Pipeline}} with a 
shared lock.  Other calls to {{fit}} etc, should be ok since they don't use 
that keyword_only decorator.

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Vinayak Joshi
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org