Tim Sell created SPARK-17100:
--------------------------------

             Summary: pyspark filter on a udf column after join gives 
java.lang.UnsupportedOperationException
                 Key: SPARK-17100
                 URL: https://issues.apache.org/jira/browse/SPARK-17100
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0
         Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
            Reporter: Tim Sell


In pyspark, when filtering on a udf derived column after some join types,
the optimized logical plan results is a java.lang.UnsupportedOperationException.

I could not replicate this in scala code from the shell, just python. It is a 
pyspark regression from spark 1.6.2.

This can be replicated with: bin/spark-submit bug.py

{code:python:title=bug.py}
import pyspark.sql.functions as F
from pyspark.sql import Row, SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.appName("test").getOrCreate()
    left = spark.createDataFrame([Row(a=1)])
    right = spark.createDataFrame([Row(a=1)])
    df = left.join(right, on='a', how='left_outer')
    df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
    df = df.filter('b = "x"')
    df.explain(extended=True)
{code}

The output is:
{code}
== Parsed Logical Plan ==
'Filter ('b = x)
+- Project [a#0L, <lambda>(a#0L) AS b#8]
   +- Project [a#0L]
      +- Join LeftOuter, (a#0L = a#3L)
         :- LogicalRDD [a#0L]
         +- LogicalRDD [a#3L]

== Analyzed Logical Plan ==
a: bigint, b: string
Filter (b#8 = x)
+- Project [a#0L, <lambda>(a#0L) AS b#8]
   +- Project [a#0L]
      +- Join LeftOuter, (a#0L = a#3L)
         :- LogicalRDD [a#0L]
         +- LogicalRDD [a#3L]

== Optimized Logical Plan ==
java.lang.UnsupportedOperationException: Cannot evaluate expression: 
<lambda>(input[0, bigint, true])
== Physical Plan ==
java.lang.UnsupportedOperationException: Cannot evaluate expression: 
<lambda>(input[0, bigint, true])
{code}


It fails when the join is:

* how='outer', on=column expression
* how='left_outer', on=string or column expression
* how='right_outer', on=string or column expression

It passes when the join is:

* how='inner', on=string or column expression
* how='outer', on=string

I made some tests to demonstrate each of these.

Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to