Jeff gold created SPARK-27519:
---------------------------------

             Summary: Pandas udf corrupting data
                 Key: SPARK-27519
                 URL: https://issues.apache.org/jira/browse/SPARK-27519
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.3.0
            Reporter: Jeff gold


While trying to use a pandas udf, i sent the udf 2 columns, a string and a list 
of a list of strings. The second argument structure for example: 
[['1'],['2'],['3']]

But when getting this same value in the udf, i receive something like this: 
[['1','2'],['3'],[]]

I checked and the same row in the table has the list with the correct 
structure, only in the udf did it change.

 

I don't know why this happens, but i do know it has something to do with the 
fact that that row was the 10,001th row and last row in it's partition. Pandas 
batch size is 10,000 so that row was sent as a second batch alone, and that's 
the only thing that seems to cause it, having 1 or 2 rows in a second batch of 
the partition. I was also able to get this with a second batch of 2 rows, the 
list wasn't changed except an empty list was added to the end. 

Hope you can help me understand what is going on, thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to