[jira] [Commented] (SPARK-30530) CSV load followed by "is null" filter produces incorrect results
[ https://issues.apache.org/jira/browse/SPARK-30530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017806#comment-17017806 ] Maxim Gekk commented on SPARK-30530: [~jlowe] I prepared a fix for the issue. [~hyukjin.kwon] [~cloud_fan] Could you review it, please. > CSV load followed by "is null" filter produces incorrect results > > > Key: SPARK-30530 > URL: https://issues.apache.org/jira/browse/SPARK-30530 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Darrell Lowe >Priority: Major > > Trying to filter on is null from values loaded from a CSV file has regressed > recently and now produces incorrect results. > Given a CSV file with the contents: > {noformat:title=floats.csv} > 100.0,1.0, > 200.0,, > 300.0,3.0, > 1.0,4.0, > ,4.0, > 500.0,, > ,6.0, > -500.0,50.5 > {noformat} > Filtering this data for the first column being null should return exactly two > rows, but it is returning extraneous rows with nulls: > {noformat} > scala> val schema = StructType(Array(StructField("floats", FloatType, > true),StructField("more_floats", FloatType, true))) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(floats,FloatType,true), > StructField(more_floats,FloatType,true)) > scala> val df = spark.read.schema(schema).csv("floats.csv") > df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float] > scala> df.filter("floats is null").show > +--+---+ > |floats|more_floats| > +--+---+ > | null| null| > | null| null| > | null| null| > | null| null| > | null|4.0| > | null| null| > | null|6.0| > +--+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30530) CSV load followed by "is null" filter produces incorrect results
[ https://issues.apache.org/jira/browse/SPARK-30530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017290#comment-17017290 ] Maxim Gekk commented on SPARK-30530: [~jlowe] Thank you for the bug report. I will take a look at it. > CSV load followed by "is null" filter produces incorrect results > > > Key: SPARK-30530 > URL: https://issues.apache.org/jira/browse/SPARK-30530 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Darrell Lowe >Priority: Major > > Trying to filter on is null from values loaded from a CSV file has regressed > recently and now produces incorrect results. > Given a CSV file with the contents: > {noformat:title=floats.csv} > 100.0,1.0, > 200.0,, > 300.0,3.0, > 1.0,4.0, > ,4.0, > 500.0,, > ,6.0, > -500.0,50.5 > {noformat} > Filtering this data for the first column being null should return exactly two > rows, but it is returning extraneous rows with nulls: > {noformat} > scala> val schema = StructType(Array(StructField("floats", FloatType, > true),StructField("more_floats", FloatType, true))) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(floats,FloatType,true), > StructField(more_floats,FloatType,true)) > scala> val df = spark.read.schema(schema).csv("floats.csv") > df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float] > scala> df.filter("floats is null").show > +--+---+ > |floats|more_floats| > +--+---+ > | null| null| > | null| null| > | null| null| > | null| null| > | null|4.0| > | null| null| > | null|6.0| > +--+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30530) CSV load followed by "is null" filter produces incorrect results
[ https://issues.apache.org/jira/browse/SPARK-30530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017088#comment-17017088 ] Jason Darrell Lowe commented on SPARK-30530: The regressed behavior was introduced by this commit: {noformat} commit 4e50f0291f032b4a5c0b46ed01fdef14e4cbb050 Author: Maxim Gekk Date: Thu Jan 16 13:10:08 2020 +0900 [SPARK-30323][SQL] Support filters pushdown in CSV datasource {noformat} [~maxgekk] would you take a look? > CSV load followed by "is null" filter produces incorrect results > > > Key: SPARK-30530 > URL: https://issues.apache.org/jira/browse/SPARK-30530 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Darrell Lowe >Priority: Major > > Trying to filter on is null from values loaded from a CSV file has regressed > recently and now produces incorrect results. > Given a CSV file with the contents: > {noformat:title=floats.csv} > 100.0,1.0, > 200.0,, > 300.0,3.0, > 1.0,4.0, > ,4.0, > 500.0,, > ,6.0, > -500.0,50.5 > {noformat} > Filtering this data for the first column being null should return exactly two > rows, but it is returning extraneous rows with nulls: > {noformat} > scala> val schema = StructType(Array(StructField("floats", FloatType, > true),StructField("more_floats", FloatType, true))) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(floats,FloatType,true), > StructField(more_floats,FloatType,true)) > scala> val df = spark.read.schema(schema).csv("floats.csv") > df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float] > scala> df.filter("floats is null").show > +--+---+ > |floats|more_floats| > +--+---+ > | null| null| > | null| null| > | null| null| > | null| null| > | null|4.0| > | null| null| > | null|6.0| > +--+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org