[jira] [Commented] (SPARK-32522) Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

Dongjoon Hyun (Jira) Thu, 06 Aug 2020 16:11:13 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172714#comment-17172714
 ]


Dongjoon Hyun commented on SPARK-32522:
---------------------------------------

Thank you for the explanation, [~Ben Smith].

> Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if 
> a large amount of data is fed into it and at least one of the model outputs 
> is fed to a Python UDF.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32522
>                 URL: https://issues.apache.org/jira/browse/SPARK-32522
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.3, 3.0.0
>         Environment: CentOS 7.6 with Python 3.6.3 and Spark 2.4.3
> or
> CentOS 7.6 with Python 3.6.3 and Spark built from master
>            Reporter: Ben Smith
>            Priority: Major
>              Labels: correctness
>         Attachments: model.zip, pyspark-script.py
>
>
> Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if 
> a large amount of data is fed into it and at least one of the model outputs 
> is fed to a Python UDF.
> This data correctness issue impacts both the Spark 2.4 releases and the 
> latest Master branch.
> I do not understand the root cause and cannot recreate 100% of the time. But 
> I have a simplified code sample (attached) that triggers the bug regularly. I 
> raised an inquiry on the mailing list as a Spark 2.4 issue but nobody had a 
> suggested root cause and I have since recreated the problem on master so I am 
> now raising a bug here.
> During debugging I have narrowed the problem down somewhat and some 
> observations I have made while doing this are:
>  * I can recreate the problem with a very simple MultilayerPerceptron with no 
> hidden layers (just 14 features and 2 outputs), I also see it with a more 
> complex MultilayerPerceptron model so I don't think the model details are 
> important.
>  * I cannot recreate the problem unless the model output is fed to a python 
> UDF, removing this leads to good outputs for the model and having it means 
> that model outputs are inconsistent (note that not just the Python UDF 
> outputs are inconsistent)
>  * I cannot recreate the problem on minuscule amounts of data or when my data 
> is partitioned heavily. 100,000 rows of input with 2 partitions sees the 
> issue happen most of the time.
>  * Some of the bad outputs I get could be explained if certain features were 
> zero when they came into the model (when they are not in my actual feature 
> data)
>  * I can recreate the problem on several different servers
> My environment is CentOS 7.6 with Python 3.6.3 and Spark 2.4.3, I can also 
> recreate the issue from the code on the Spark master branch but strangely I 
> cannot recreate the issue with Spark 2.4.3 and Python 2.7. I'm not sure why 
> the version of python would matter.
> The attached code sample triggers the problem for me the vast majority of the 
> time when pasted into a pyspark shell. This code generates a dataframe 
> containing 100,000 identical rows, transforms it with a MultiLayerPerceptron 
> model and feeds one of the model output columns to a simple Python UDF to 
> generate an additional column. The resulting dataframe has the distinct rows 
> selected and since all the inputs are identical I would expect to get 1 row 
> back, instead I get many unique rows with the number returned varying each 
> time I run the code. To run the code you will need the model files locally. I 
> have attached the model as a zip archive and unzipping this to /tmp should be 
> all you need to do to get the code to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32522) Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

Reply via email to