[ https://issues.apache.org/jira/browse/SPARK-48666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei Zheng updated SPARK-48666: ------------------------------ Description: We should avoid pushing down Unevaluable expression as it can cause unexpected failures. For example, the code snippet below (assuming there is a table {{_t_}} with a partition column {{{_}p{_})}} {code:java} from pyspark import SparkConf from pyspark.sql import SparkSession from pyspark.sql.types import StringType import pyspark.sql.functions as f def getdata(p: str) -> str: return "data" NEW_COLUMN = 'new_column' P_COLUMN = 'p' f_getdata = f.udf(getdata, StringType()) rows = spark.sql("select * from default.t") table = rows.withColumn(NEW_COLUMN, f_getdata(f.col(P_COLUMN))) df = table.alias('t1').join(table.alias('t2'), (f.col(f"t1.{NEW_COLUMN}") == f.col(f"t2.{NEW_COLUMN}")), how='inner') df.show(){code} will cause an error like: {code:java} org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: getdata(input[0, string, true])#16 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:66) at org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:391) at org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:390) at org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:71) at org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:384) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:166) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:165) {code} was: We should avoid pushing down Unevaluable expression as it can cause unexpected failures. For example, the code snippet below {code:java} from pyspark import SparkConf from pyspark.sql import SparkSession from pyspark.sql.types import StringType import pyspark.sql.functions as f def getdata(p: str) -> str: return "data" NEW_COLUMN = 'new_column' P_COLUMN = 'p' f_getdata = f.udf(getdata, StringType()) rows = spark.sql("select * from default.t") table = rows.withColumn(NEW_COLUMN, f_getdata(f.col(P_COLUMN))) df = table.alias('t1').join(table.alias('t2'), (f.col(f"t1.{NEW_COLUMN}") == f.col(f"t2.{NEW_COLUMN}")), how='inner') df.show(){code} will cause an error like: {code:java} org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: getdata(input[0, string, true])#16 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:66) at org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:391) at org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:390) at org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:71) at org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:384) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:166) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:165) {code} > A filter should not be pushed down if it contains Unevaluable expression > ------------------------------------------------------------------------ > > Key: SPARK-48666 > URL: https://issues.apache.org/jira/browse/SPARK-48666 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 4.0.0 > Reporter: Wei Zheng > Priority: Major > > We should avoid pushing down Unevaluable expression as it can cause > unexpected failures. For example, the code snippet below (assuming there is a > table {{_t_}} with a partition column {{{_}p{_})}} > {code:java} > from pyspark import SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.types import StringType > import pyspark.sql.functions as f > def getdata(p: str) -> str: > return "data" > NEW_COLUMN = 'new_column' > P_COLUMN = 'p' > f_getdata = f.udf(getdata, StringType()) > rows = spark.sql("select * from default.t") > table = rows.withColumn(NEW_COLUMN, f_getdata(f.col(P_COLUMN))) > df = table.alias('t1').join(table.alias('t2'), (f.col(f"t1.{NEW_COLUMN}") == > f.col(f"t2.{NEW_COLUMN}")), how='inner') > df.show(){code} > will cause an error like: > {code:java} > org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: > getdata(input[0, string, true])#16 > at org.apache.spark.SparkException$.internalError(SparkException.scala:92) > at org.apache.spark.SparkException$.internalError(SparkException.scala:96) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:66) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:391) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:390) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:71) > at > org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:384) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:166) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:165) > {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org