Nicholas Chammas created SPARK-18254: ----------------------------------------
Summary: UDFs don't see aliased column names; somehow they get the original names Key: SPARK-18254 URL: https://issues.apache.org/jira/browse/SPARK-18254 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Environment: Python 3.5, Java 8 Reporter: Nicholas Chammas Dunno if I'm misinterpreting something here, but this seems like a bug in how UDFs work, or in how they interface with the optimizer. Here's a basic reproduction: {code} import pyspark from pyspark.sql import Row from pyspark.sql.functions import udf, col, struct def length(full_name): # The non-aliased names, FIRST and LAST, show up here, instead of # first_name and last_name. # print(full_name) return len(full_name.first_name) + len(full_name.last_name) if __name__ == '__main__': spark = ( pyspark.sql.SparkSession.builder .getOrCreate()) length_udf = udf(length) names = spark.createDataFrame([ Row(FIRST='Nick', LAST='Chammas'), Row(FIRST='Walter', LAST='Williams'), ]) names_cleaned = ( names .select( col('FIRST').alias('first_name'), col('LAST').alias('last_name'), ) .withColumn('full_name', struct('first_name', 'last_name')) .select('full_name')) # We see the schema we expect here. names_cleaned.printSchema() # However, here we get an AttributeError. length_udf() cannot # find first_name or last_name. (names_cleaned .withColumn('length', length_udf('full_name')) .show()) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org