Hi! First time poster, long time reader. I'm wondering if there is a way to let cataylst know that it doesn't need to repeat a filter on the spark side after a filter has been applied by the Source Implementing PrunedFilterScan.
This is for a usecase in which we except a filter on a non-existant column that serves as an entry point for our integration with a different system. While the source can correctly deal with this, the secondary filter done on the RDD itself wipes out the results because the column being filtered does not exist. In particular this is with our integration with Solr where we allow users to pass in a predicate based on "solr_query" ala ("where solr_query='*:*') there is no column "solr_query" so the rdd.filter( row.solr_query == "*:*') filters out all of the data since no row's will have that column. I'm thinking about a few solutions to this but they all seem a little hacky 1) Try to manually remove the filter step from the query plan after our source handles the filter 2) Populate the solr_query field being returned so they all automatically pass But I think the real solution is to add a way to create a PrunedFilterScan which does not reapply filters if the source doesn't want it to. IE Giving PrunedFilterScan the ability to trust the underlying source that the filter will be accurately applied. Maybe changing the api to PrunedFilterScan(requiredColumns: Array[String], filters: Array[Filter], reapply: Boolean = true) Where Catalyst can check the Reapply value and not add an RDD.filter if it is false. Thoughts? Thanks for your time, Russ