[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756654#comment-16756654 ] Ryan Blue commented on SPARK-26677: --- Thanks, sorry about the mistake. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756609#comment-16756609 ] Dongjoon Hyun commented on SPARK-26677: --- Hi, [~rdblue]. I moved `2.4.1` from `Fixed Versions` field to `Target Versions` since it's not merged to `branch-2.4` yet. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752619#comment-16752619 ] Dongjoon Hyun commented on SPARK-26677: --- Yep. Correct. There were two mixed issues. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752606#comment-16752606 ] Ryan Blue commented on SPARK-26677: --- To clarify [~dongjoon]'s comment: All recent versions of Parquet are affected by this {{not(eqNullSafe(...)}} bug. Only Parquet 1.10.0 is affected by PARQUET-1309. This filter bug has been present since Parquet introduced dictionary filtering. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751890#comment-16751890 ] Dongjoon Hyun commented on SPARK-26677: --- Thank you, [~anandchinn] and [~hyukjin.kwon]. So, according to the PR and PARQUET-1309, only Parquet 1.10.0 (used in Spark 2.4.0) version has this issue. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751767#comment-16751767 ] Hyukjin Kwon commented on SPARK-26677: -- Yes, please read the linked PR above. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751757#comment-16751757 ] ANAND CHINNAKANNAN commented on SPARK-26677: [~hyukjin.kwon] - Do you know exactly the issue is from ParquetFileReader, The file reader was an issue with override the duplicate row keys. Let me know your thoughts. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749526#comment-16749526 ] Hyukjin Kwon commented on SPARK-26677: -- Im gonna open a PR soon. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Blocker > Labels: correctness > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file
[ https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879 ] ANAND CHINNAKANNAN commented on SPARK-26677: I have done the analysis for this bug. Below are the initial analysis scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3"); scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show; +-+ |value| +-+ | B| |null| +-+ When the issue happens only if the columns has Duplicate row data. We do the research to understand the code. > Incorrect results of not(eqNullSafe) when data read from Parquet file > -- > > Key: SPARK-26677 > URL: https://issues.apache.org/jira/browse/SPARK-26677 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu > 18.04). >Reporter: Michal Kapalka >Priority: Critical > > Example code (spark-shell from Spark 2.4.0): > {code:java} > scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > +-+ > {code} > Running the same with Spark 2.2.0 or 2.3.2 gives the correct result: > {code:java} > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} > Also, with a different input sequence and Spark 2.4.0 we get the correct > result: > {code:java} > scala> Seq("A", null).toDS.repartition(1).write.parquet("t") > scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show > +-+ > |value| > +-+ > | null| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org