Xiao Li created SPARK-18766: ------------------------------- Summary: Push Down Filter Through BatchEvalPython Key: SPARK-18766 URL: https://issues.apache.org/jira/browse/SPARK-18766 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.0.2 Reporter: Xiao Li
Currently, when users use Python UDF in Filter, {{BatchEvalPython}} is always generated below {{FilterExec}}. However, not all the predicates need to be evaluated after Python UDF execution. Thus, we can push down the predicates through {{BatchEvalPython}} . {noformat} >>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], >>> ["key", "value"]) >>> from pyspark.sql.functions import udf, col >>> from pyspark.sql.types import BooleanType >>> my_filter = udf(lambda a: a < 2, BooleanType()) >>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) & >>> (df.value < "2")) >>> sel.explain(True) {noformat} {noformat} == Physical Plan == *Project [key#0L, value#1] +- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2)) +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9] +- Scan ExistingRDD[key#0L,value#1] {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org