Hi, I have an application where I’m filtering data with SparkSQL with simple WHERE clauses. I also want the ability to show the unmatched rows for any filter, and so am wrapping the previous clause in `NOT()` to get the inverse. Example:
Filter: username is not null Inverse filter: NOT(username is not null) This worked fine in Spark 1.6. After upgrading to Spark 2.0.2, the inverse filter always returns zero results. It looks like this is a problem with how the filter is getting pushed down to Parquet. Specifically, the pushdown includes both the “is not null” filter, AND “not(is not null)”, which would obviously result in zero matches. An example below: pyspark: > x = spark.sql('select my_id from my_table where username is not null') > y = spark.sql('select my_id from my_table where not(username is not null)') > > > x.explain() == Physical Plan == *Project [my_id#6L] +- *Filter isnotnull(username#91) +- *BatchedScan parquet default.my_table[my_id#6L,username#91] Format: ParquetFormat, InputPaths: s3://my-path/my.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(username)], ReadSchema: struct<my_id:bigint,username:string> [1159]> y.explain() == Physical Plan == *Project [my_id#6L] +- *Filter (isnotnull(username#91) && NOT isnotnull(username#91))username +- *BatchedScan parquet default.my_table[my_id#6L,username#91] Format: ParquetFormat, InputPaths: s3://my-path/my.parquet, PartitionFilters: [], PushedFilters: [IsNotNull(username), Not(IsNotNull(username))],username ReadSchema: struct<my_id:bigint,username:string> Presently I’m working around this by using the new functionality of NOT EXISTS in Spark 2, but that seems like overkill. Any help appreciated. Alexi Kostibas Engineering Nuna 650 Townsend Street, Suite 425 San Francisco, CA 94103