[ https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Graves updated SPARK-35108: ---------------------------------- Priority: Blocker (was: Major) > Pickle produces incorrect key labels for GenericRowWithSchema (data > corruption) > ------------------------------------------------------------------------------- > > Key: SPARK-35108 > URL: https://issues.apache.org/jira/browse/SPARK-35108 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.1, 3.0.2 > Reporter: Robert Joseph Evans > Priority: Blocker > Labels: correctness > Attachments: test.py, test.sh > > > I think this also shows up for all versions of Spark that pickle the data > when doing a collect from python. > When you do a collect in python java will do a collect and convert the > UnsafeRows into GenericRowWithSchema instances before it sends them to the > Pickler. The Pickler, by default, will try to dedupe objects using hashCode > and .equals for the object. But .equals and .hashCode for > GenericRowWithSchema only looks at the data, not the schema. But when we > pickle the row the keys from the schema are written out. > This can result in data corruption, sort of, in a few cases where a row has > the same number of elements as a struct within the row does, or a sub-struct > within another struct. > If the data happens to be the same, the keys for the resulting row or struct > can be wrong. > My repro case is a bit convoluted, but it does happen. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org