Xiao Li created SPARK-18766:
-------------------------------

             Summary: Push Down Filter Through BatchEvalPython
                 Key: SPARK-18766
                 URL: https://issues.apache.org/jira/browse/SPARK-18766
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
    Affects Versions: 2.0.2
            Reporter: Xiao Li


Currently, when users use Python UDF in Filter, {{BatchEvalPython}} is always 
generated below {{FilterExec}}. However, not all the predicates need to be 
evaluated after Python UDF execution. Thus, we can push down the predicates 
through {{BatchEvalPython}} .

{noformat}
>>> df = spark.createDataFrame([(1, "1"), (2, "2"), (1, "2"), (1, "2")], 
>>> ["key", "value"])
>>> from pyspark.sql.functions import udf, col
>>> from pyspark.sql.types import BooleanType
>>> my_filter = udf(lambda a: a < 2, BooleanType())
>>> sel = df.select(col("key"), col("value")).filter((my_filter(col("key"))) & 
>>> (df.value < "2"))
>>> sel.explain(True)
{noformat}

{noformat}
== Physical Plan ==
*Project [key#0L, value#1]
+- *Filter ((isnotnull(value#1) && pythonUDF0#9) && (value#1 < 2))
   +- BatchEvalPython [<lambda>(key#0L)], [key#0L, value#1, pythonUDF0#9]
      +- Scan ExistingRDD[key#0L,value#1]
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to