[ https://issues.apache.org/jira/browse/SPARK-32522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172714#comment-17172714 ]
Dongjoon Hyun commented on SPARK-32522: --------------------------------------- Thank you for the explanation, [~Ben Smith]. > Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if > a large amount of data is fed into it and at least one of the model outputs > is fed to a Python UDF. > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-32522 > URL: https://issues.apache.org/jira/browse/SPARK-32522 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.4.3, 3.0.0 > Environment: CentOS 7.6 with Python 3.6.3 and Spark 2.4.3 > or > CentOS 7.6 with Python 3.6.3 and Spark built from master > Reporter: Ben Smith > Priority: Major > Labels: correctness > Attachments: model.zip, pyspark-script.py > > > Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if > a large amount of data is fed into it and at least one of the model outputs > is fed to a Python UDF. > This data correctness issue impacts both the Spark 2.4 releases and the > latest Master branch. > I do not understand the root cause and cannot recreate 100% of the time. But > I have a simplified code sample (attached) that triggers the bug regularly. I > raised an inquiry on the mailing list as a Spark 2.4 issue but nobody had a > suggested root cause and I have since recreated the problem on master so I am > now raising a bug here. > During debugging I have narrowed the problem down somewhat and some > observations I have made while doing this are: > * I can recreate the problem with a very simple MultilayerPerceptron with no > hidden layers (just 14 features and 2 outputs), I also see it with a more > complex MultilayerPerceptron model so I don't think the model details are > important. > * I cannot recreate the problem unless the model output is fed to a python > UDF, removing this leads to good outputs for the model and having it means > that model outputs are inconsistent (note that not just the Python UDF > outputs are inconsistent) > * I cannot recreate the problem on minuscule amounts of data or when my data > is partitioned heavily. 100,000 rows of input with 2 partitions sees the > issue happen most of the time. > * Some of the bad outputs I get could be explained if certain features were > zero when they came into the model (when they are not in my actual feature > data) > * I can recreate the problem on several different servers > My environment is CentOS 7.6 with Python 3.6.3 and Spark 2.4.3, I can also > recreate the issue from the code on the Spark master branch but strangely I > cannot recreate the issue with Spark 2.4.3 and Python 2.7. I'm not sure why > the version of python would matter. > The attached code sample triggers the problem for me the vast majority of the > time when pasted into a pyspark shell. This code generates a dataframe > containing 100,000 identical rows, transforms it with a MultiLayerPerceptron > model and feeds one of the model output columns to a simple Python UDF to > generate an additional column. The resulting dataframe has the distinct rows > selected and since all the inputs are identical I would expect to get 1 row > back, instead I get many unique rows with the number returned varying each > time I run the code. To run the code you will need the model files locally. I > have attached the model as a zip archive and unzipping this to /tmp should be > all you need to do to get the code to run. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org