[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Chammas updated SPARK-18254: ------------------------------------- Description: Dunno if I'm misinterpreting something here, but this seems like a bug in how UDFs work, or in how they interface with the optimizer. Here's a basic reproduction. I'm using {{length_udf()}} just for illustration; it could be any UDF that accesses fields that have been aliased. {code} import pyspark from pyspark.sql import Row from pyspark.sql.functions import udf, col, struct def length(full_name): # The non-aliased names, FIRST and LAST, show up here, instead of # first_name and last_name. # print(full_name) return len(full_name.first_name) + len(full_name.last_name) if __name__ == '__main__': spark = ( pyspark.sql.SparkSession.builder .getOrCreate()) length_udf = udf(length) names = spark.createDataFrame([ Row(FIRST='Nick', LAST='Chammas'), Row(FIRST='Walter', LAST='Williams'), ]) names_cleaned = ( names .select( col('FIRST').alias('first_name'), col('LAST').alias('last_name'), ) .withColumn('full_name', struct('first_name', 'last_name')) .select('full_name')) # We see the schema we expect here. names_cleaned.printSchema() # However, here we get an AttributeError. length_udf() cannot # find first_name or last_name. (names_cleaned .withColumn('length', length_udf('full_name')) .show()) {code} When I run this I get a long stack trace, but the relevant portions seem to be: {code} File ".../udf-alias.py", line 10, in length return len(full_name.first_name) + len(full_name.last_name) File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", line 1502, in __getattr__ raise AttributeError(item) AttributeError: first_name {code} {code} Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", line 1497, in __getattr__ idx = self.__fields__.index(item) ValueError: 'first_name' is not in list {code} Here are the relevant query plans: {code} names_cleaned.explain() == Physical Plan == *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS full_name#10] +- Scan ExistingRDD[FIRST#0,LAST#1] {code} {code} (names_cleaned .withColumn('length', length_udf('full_name')) .explain()) == Physical Plan == *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS full_name#10, pythonUDF0#21 AS length#17] +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, pythonUDF0#21] +- Scan ExistingRDD[FIRST#0,LAST#1] {code} was: Dunno if I'm misinterpreting something here, but this seems like a bug in how UDFs work, or in how they interface with the optimizer. Here's a basic reproduction: {code} import pyspark from pyspark.sql import Row from pyspark.sql.functions import udf, col, struct def length(full_name): # The non-aliased names, FIRST and LAST, show up here, instead of # first_name and last_name. # print(full_name) return len(full_name.first_name) + len(full_name.last_name) if __name__ == '__main__': spark = ( pyspark.sql.SparkSession.builder .getOrCreate()) length_udf = udf(length) names = spark.createDataFrame([ Row(FIRST='Nick', LAST='Chammas'), Row(FIRST='Walter', LAST='Williams'), ]) names_cleaned = ( names .select( col('FIRST').alias('first_name'), col('LAST').alias('last_name'), ) .withColumn('full_name', struct('first_name', 'last_name')) .select('full_name')) # We see the schema we expect here. names_cleaned.printSchema() # However, here we get an AttributeError. length_udf() cannot # find first_name or last_name. (names_cleaned .withColumn('length', length_udf('full_name')) .show()) {code} When I run this I get a long stack trace, but the relevant portions seem to be: {code} File ".../udf-alias.py", line 10, in length return len(full_name.first_name) + len(full_name.last_name) File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", line 1502, in __getattr__ raise AttributeError(item) AttributeError: first_name {code} {code} Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", line 1497, in __getattr__ idx = self.__fields__.index(item) ValueError: 'first_name' is not in list {code} Here are the relevant query plans: {code} names_cleaned.explain() == Physical Plan == *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS full_name#10] +- Scan ExistingRDD[FIRST#0,LAST#1] {code} {code} (names_cleaned .withColumn('length', length_udf('full_name')) .explain()) == Physical Plan == *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS full_name#10, pythonUDF0#21 AS length#17] +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, pythonUDF0#21] +- Scan ExistingRDD[FIRST#0,LAST#1] {code} > UDFs don't see aliased column names > ----------------------------------- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 > Reporter: Nicholas Chammas > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant query plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org