[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751757#comment-16751757 ] ANAND CHINNAKANNAN commented on SPARK-26677: [~hyukjin.kwon] - Do you know exactly the issue is from ParquetFileReader, The file reader was an issue with override the duplicate row keys. Let me know your thoughts. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879 ] ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 6:04 PM: - I have done the analysis for this bug. Below are the initial analysis scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3"); scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show; +-+ |value| +-+ | B| |null| +-+ When the issue happens only if the columns have duplicate row data. Will do further analysis was (Author: anandchinn): I have done the analysis for this bug. Below are the initial analysis scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3"); scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show; +-+ |value| +-+ | B| |null| +-+ When the issue happens only if the columns have duplicate row data. I will do the research to deep dive in the code > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Critical > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879 ] ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 6:03 PM: - I have done the analysis for this bug. Below are the initial analysis scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3"); scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show; +-+ |value| +-+ | B| |null| +-+ When the issue happens only if the columns have duplicate row data. I will do the research to deep dive in the code was (Author: anandchinn): I have done the analysis for this bug. Below are the initial analysis scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3"); scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show; +-+ |value| +-+ | B| |null| +-+ When the issue happens only if the columns have duplicate row data. I will do the research to deep dive in the code > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Critical > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879 ] ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 4:22 PM: - I have done the analysis for this bug. Below are the initial analysis scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3"); scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show; +-+ |value| +-+ | B| |null| +-+ When the issue happens only if the columns have duplicate row data. I will do the research to deep dive in the code was (Author: anandchinn): I have done the analysis for this bug. Below are the initial analysis scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3"); scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show; +-+ |value| +-+ | B| |null| +-+ When the issue happens only if the columns has Duplicate row data. We do the research to understand the code. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Critical > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879 ] ANAND CHINNAKANNAN commented on SPARK-26677: I have done the analysis for this bug. Below are the initial analysis scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3"); scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show; +-+ |value| +-+ | B| |null| +-+ When the issue happens only if the columns has Duplicate row data. We do the research to understand the code. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Critical > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org