[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-05-04 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338948#comment-17338948
 ] 

Robert Joseph Evans commented on SPARK-35108:
-

Looks good.  Thanks for the fix.

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-05-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338312#comment-17338312
 ] 

Hyukjin Kwon commented on SPARK-35108:
--

[~revans2] and [~tgraves] can you confirm this that fix addresses your issue?

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-05-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338311#comment-17338311
 ] 

Hyukjin Kwon commented on SPARK-35108:
--

I think this is a duplicate of SPARK-34545.

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-20 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325764#comment-17325764
 ] 

Hyukjin Kwon commented on SPARK-35108:
--

Thanks for cc'ing me [~tgraves]. I will take a look early next week if no one 
takes this one.

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-19 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325252#comment-17325252
 ] 

Thomas Graves commented on SPARK-35108:
---

seems like a correctness issue in some cases so marking it as such until to 
investigate.  [~hyukjin.kwon] [~cloud_fan]

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Blocker
>  Labels: correctness
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35108) Pickle produces incorrect key labels for GenericRowWithSchema (data corruption)

2021-04-16 Thread Robert Joseph Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324088#comment-17324088
 ] 

Robert Joseph Evans commented on SPARK-35108:
-

If you have SPARK_HOME set when you run test.sh on a system with 6 cores or 
more it should reproduce the issue.

I was able to mitigate the issue by adding .equals and .hashCode to 
GenericRowWithSchema so it took into account the schema. But we could also try 
to turn off the dedupe or value compare dedupe (Pickler has options to disable 
these things). I am not sure what the proper fix for this would be because the 
code for all of these is shared with other code paths.

> Pickle produces incorrect key labels for GenericRowWithSchema (data 
> corruption)
> ---
>
> Key: SPARK-35108
> URL: https://issues.apache.org/jira/browse/SPARK-35108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Robert Joseph Evans
>Priority: Major
> Attachments: test.py, test.sh
>
>
> I think this also shows up for all versions of Spark that pickle the data 
> when doing a collect from python.
> When you do a collect in python java will do a collect and convert the 
> UnsafeRows into GenericRowWithSchema instances before it sends them to the 
> Pickler. The Pickler, by default, will try to dedupe objects using hashCode 
> and .equals for the object.  But .equals and .hashCode for 
> GenericRowWithSchema only looks at the data, not the schema. But when we 
> pickle the row the keys from the schema are written out.
> This can result in data corruption, sort of, in a few cases where a row has 
> the same number of elements as a struct within the row does, or a sub-struct 
> within another struct. 
> If the data happens to be the same, the keys for the resulting row or struct 
> can be wrong.
> My repro case is a bit convoluted, but it does happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org