[ 
https://issues.apache.org/jira/browse/SPARK-48666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Zheng updated SPARK-48666:
------------------------------
    Description: 
We should avoid pushing down Unevaluable expression as it can cause unexpected 
failures. For example, the code snippet below (assuming there is a table 
{{_t_}} with a partition column {{{_}p{_})}}
{code:java}
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType

import pyspark.sql.functions as f

def getdata(p: str) -> str:
    return "data"

NEW_COLUMN = 'new_column'
P_COLUMN = 'p'

f_getdata = f.udf(getdata, StringType())
rows = spark.sql("select * from default.t")

table = rows.withColumn(NEW_COLUMN, f_getdata(f.col(P_COLUMN)))

df = table.alias('t1').join(table.alias('t2'), (f.col(f"t1.{NEW_COLUMN}") == 
f.col(f"t2.{NEW_COLUMN}")), how='inner')

df.show(){code}
will cause an error like:
{code:java}
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: 
getdata(input[0, string, true])#16
    at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
    at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
    at 
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:66)
    at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:391)
    at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:390)
    at 
org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:71)
    at 
org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:384)
    at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
    at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:166)
    at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:165)
 {code}
 
 

  was:
We should avoid pushing down Unevaluable expression as it can cause unexpected 
failures. For example, the code snippet below
{code:java}
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType

import pyspark.sql.functions as f

def getdata(p: str) -> str:
    return "data"

NEW_COLUMN = 'new_column'
P_COLUMN = 'p'

f_getdata = f.udf(getdata, StringType())
rows = spark.sql("select * from default.t")

table = rows.withColumn(NEW_COLUMN, f_getdata(f.col(P_COLUMN)))

df = table.alias('t1').join(table.alias('t2'), (f.col(f"t1.{NEW_COLUMN}") == 
f.col(f"t2.{NEW_COLUMN}")), how='inner')

df.show(){code}
will cause an error like:
{code:java}
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: 
getdata(input[0, string, true])#16
    at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
    at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
    at 
org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:66)
    at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:391)
    at 
org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:390)
    at 
org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:71)
    at 
org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:384)
    at 
org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
    at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:166)
    at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:165)
 {code}
 
 


> A filter should not be pushed down if it contains Unevaluable expression
> ------------------------------------------------------------------------
>
>                 Key: SPARK-48666
>                 URL: https://issues.apache.org/jira/browse/SPARK-48666
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Wei Zheng
>            Priority: Major
>
> We should avoid pushing down Unevaluable expression as it can cause 
> unexpected failures. For example, the code snippet below (assuming there is a 
> table {{_t_}} with a partition column {{{_}p{_})}}
> {code:java}
> from pyspark import SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StringType
> import pyspark.sql.functions as f
> def getdata(p: str) -> str:
>     return "data"
> NEW_COLUMN = 'new_column'
> P_COLUMN = 'p'
> f_getdata = f.udf(getdata, StringType())
> rows = spark.sql("select * from default.t")
> table = rows.withColumn(NEW_COLUMN, f_getdata(f.col(P_COLUMN)))
> df = table.alias('t1').join(table.alias('t2'), (f.col(f"t1.{NEW_COLUMN}") == 
> f.col(f"t2.{NEW_COLUMN}")), how='inner')
> df.show(){code}
> will cause an error like:
> {code:java}
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: 
> getdata(input[0, string, true])#16
>     at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>     at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
>     at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:66)
>     at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:391)
>     at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:390)
>     at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:71)
>     at 
> org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:384)
>     at 
> org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
>     at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:166)
>     at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:165)
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to