[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-24 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751757#comment-16751757
 ] 

ANAND CHINNAKANNAN commented on SPARK-26677:


[~hyukjin.kwon] - Do you know exactly the issue is from ParquetFileReader, The 
file reader was an issue with override the duplicate row keys.

Let me know your thoughts. 

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Blocker
>  Labels: correctness
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879
 ] 

ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 6:04 PM:
-

I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. Will do 
further analysis 


was (Author: anandchinn):
I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. I will do 
the research to deep dive in the code

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Critical
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879
 ] 

ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 6:03 PM:
-

I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. I will do 
the research to deep dive in the code

 


was (Author: anandchinn):
I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. I will do 
the research to deep dive in the code

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Critical
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879
 ] 

ANAND CHINNAKANNAN edited comment on SPARK-26677 at 1/22/19 4:22 PM:
-

I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+
|value|

+-+
|    B|
|null|

+-+

When the issue happens only if the columns have duplicate row data. I will do 
the research to deep dive in the code

 


was (Author: anandchinn):
I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+

|value|

+-+

|    B|
|null|

+-+

When the issue happens only if the columns has Duplicate row data. We do the 
research to understand the code. 

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Critical
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26677) Incorrect results of not(eqNullSafe) when data read from Parquet file

2019-01-22 Thread ANAND CHINNAKANNAN (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748879#comment-16748879
 ] 

ANAND CHINNAKANNAN commented on SPARK-26677:


I have done the analysis for this bug. Below are the initial analysis 

scala> Seq("A", "B", null).toDS.repartition(1).write.parquet("t3");

scala> spark.read.parquet("t3").where(not(col("value").eqNullSafe("A"))).show;

+-+

|value|

+-+

|    B|
|null|

+-+

When the issue happens only if the columns has Duplicate row data. We do the 
research to understand the code. 

 

> Incorrect results of not(eqNullSafe) when data read from Parquet file 
> --
>
> Key: SPARK-26677
> URL: https://issues.apache.org/jira/browse/SPARK-26677
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Local installation of Spark on Linux (Java 1.8, Ubuntu 
> 18.04).
>Reporter: Michal Kapalka
>Priority: Critical
>
> Example code (spark-shell from Spark 2.4.0):
> {code:java}
> scala> Seq("A", "A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> +-+
> {code}
> Running the same with Spark 2.2.0 or 2.3.2 gives the correct result:
> {code:java}
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> Also, with a different input sequence and Spark 2.4.0 we get the correct 
> result:
> {code:java}
> scala> Seq("A", null).toDS.repartition(1).write.parquet("t")
> scala> spark.read.parquet("t").where(not(col("value").eqNullSafe("A"))).show
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org