Github user viirya commented on the pull request: https://github.com/apache/spark/pull/8922#issuecomment-146215917 This implementation only considers the use case to evaluate a single attribute with an UDF and compare the result with a literal value. We only consider this because in the current implementation of `selectFilters` in `DataSourceStrategy`, only the predicates involving an attribute and a literal value (i.e., `col = 1`, `col2 > 2`, etc.) are selected to be the candidates for pushing down. Besides, the form like `udf(column) = 'ABCDE...'` is mostly common and widely used in our SQL queries involving UDFs in filtering condition. Your proposal looks good and very general. However, I am little worrying the performance regressions brought by creating a row for each input value and evaluating on the row. This patch helps us reduce the memory footprint required for loading lot of data from Parquet files. For performance, the improvement is not significant but competes with the case of not pushing down at least. I agree that this patch introduces additional complexity to the API. If you still think it is not worth, I will close this PR first. Thanks for reviewing and suggestion.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org