GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/23248

    [SPARK-26293][SQL] Cast exception when having python udf in subquery

    ## What changes were proposed in this pull request?
    
    This is a regression introduced by 
https://github.com/apache/spark/pull/22104 at Spark 2.4.0.
    
    When we have Python UDF in subquery, we will hit an exception
    ```
    Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to 
org.apache.spark.sql.catalyst.expressions.PythonUDF
        at scala.collection.immutable.Stream.map(Stream.scala:414)
        at 
org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$2(EvalPythonExec.scala:98)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815)
    ...
    ```
    
    https://github.com/apache/spark/pull/22104 turned `ExtractPythonUDFs` from 
a physical rule to optimizer rule. However, there is a difference between a 
physical rule and optimizer rule. A physical rule always runs once, an 
optimizer rule may be applied twice on a query tree even the rule is located in 
a batch that only runs once.
    
    For a subquery, the `OptimizeSubqueries` rule will execute the entire 
optimizer on the query plan inside subquery. Later on subquery will be turned 
to joins, and the optimizer rules will be applied to it again.
    
    Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's 
applied twice on a query plan inside subquery, it will produce a malformed 
plan. It extracts Python UDF from Python exec plans.
    
    This PR proposes 2 changes to be double safe:
    1. `ExtractPythonUDFs` should skip python exec plans, to make the rule 
idempotent
    2. `ExtractPythonUDFs` should skip subquery
    
    ## How was this patch tested?
    
    a new test.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark python

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/23248.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23248
    
----
commit 9477fb09b850b981862cb72b0ebdebc5b404a082
Author: Wenchen Fan <wenchen@...>
Date:   2018-12-06T11:16:04Z

    python udf in subquery

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to