[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634122#comment-15634122 ]
Michael Armbrust commented on SPARK-18254: ------------------------------------------ Is this yet another bug caused by the generic operator push down? Can we turn that off? > UDFs don't see aliased column names > ----------------------------------- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 > Reporter: Nicholas Chammas > Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org