[jira] [Created] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

Robert Joseph Evans (Jira) Fri, 16 Apr 2021 14:38:04 -0700

Robert Joseph Evans created SPARK-35108:
-------------------------------------------


             Summary: Pickle produces incorrect key labels for 
GenericRowWithSchema (data corruption)
                 Key: SPARK-35108
                 URL: https://issues.apache.org/jira/browse/SPARK-35108
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.2, 3.0.1
            Reporter: Robert Joseph Evans
         Attachments: test.py, test.sh

I think this also shows up for all versions of Spark that pickle the data when 
doing a collect from python.

When you do a collect in python java will do a collect and convert the 
UnsafeRows into GenericRowWithSchema instances before it sends them to the 
Pickler. The Pickler, by default, will try to dedupe objects using hashCode and 
.equals for the object.  But .equals and .hashCode for GenericRowWithSchema 
only looks at the data, not the schema. But when we pickle the row the keys 
from the schema are written out.

This can result in data corruption, sort of, in a few cases where a row has the 
same number of elements as a struct within the row does, or a sub-struct within 
another struct. 

If the data happens to be the same, the keys for the resulting row or struct 
can be wrong.

My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

Reply via email to