Hi,

I have an application where I’m filtering data with SparkSQL with simple WHERE 
clauses. I also want the ability to show the unmatched rows for any filter, and 
so am wrapping the previous clause in `NOT()` to get the inverse. Example:

Filter:  username is not null
Inverse filter:  NOT(username is not null)

This worked fine in Spark 1.6. After upgrading to Spark 2.0.2, the inverse 
filter always returns zero results. It looks like this is a problem with how 
the filter is getting pushed down to Parquet. Specifically, the pushdown 
includes both the “is not null” filter, AND “not(is not null)”, which would 
obviously result in zero matches. An example below:

pyspark:
> x = spark.sql('select my_id from my_table where username is not null')
> y = spark.sql('select my_id from my_table where not(username is not null)')   
>                                                                               
>                                            
> x.explain()
== Physical Plan ==
*Project [my_id#6L]
+- *Filter isnotnull(username#91)
   +- *BatchedScan parquet default.my_table[my_id#6L,username#91]
       Format: ParquetFormat, InputPaths: s3://my-path/my.parquet,
       PartitionFilters: [], PushedFilters: [IsNotNull(username)],
       ReadSchema: struct<my_id:bigint,username:string>
[1159]> y.explain()
== Physical Plan ==
*Project [my_id#6L]
+- *Filter (isnotnull(username#91) && NOT isnotnull(username#91))username
   +- *BatchedScan parquet default.my_table[my_id#6L,username#91]
       Format: ParquetFormat, InputPaths: s3://my-path/my.parquet,
       PartitionFilters: [],
       PushedFilters: [IsNotNull(username), Not(IsNotNull(username))],username
       ReadSchema: struct<my_id:bigint,username:string>

Presently I’m working around this by using the new functionality of NOT EXISTS 
in Spark 2, but that seems like overkill.

Any help appreciated.

Alexi Kostibas
Engineering
Nuna
650 Townsend Street, Suite 425
San Francisco, CA 94103

Reply via email to