Alberto Bonsanto created SPARK-17760: ----------------------------------------
Summary: DataFrame's pivot doesn't see column created in groupBy Key: SPARK-17760 URL: https://issues.apache.org/jira/browse/SPARK-17760 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Environment: Databrick's community version, spark 2.0.0, pyspark, python 2. Reporter: Alberto Bonsanto Priority: Minor Related to [https://stackoverflow.com/questions/39817993/pivoting-with-missing-values]. I'm not completely sure if this is a bug or expected behavior. When you `groypBy` by a column generated inside of it, the `pivot` method apparently doesn't find this column during the analysis. E.g. {code:none} df = (sc.parallelize([(1.0, "2016-03-30 01:00:00"), (30.2, "2015-01-02 03:00:02")]) .toDF(["amount", "Date"]) .withColumn("Date", col("Date").cast("timestamp"))) (df.withColumn("hour",hour("date")) .groupBy(dayofyear("date").alias("date")) .pivot("hour").sum("amount").show()){code} Shows the following exception. {quote} AnalysisException: u'resolved attribute(s) date#140688 missing from dayofyear(date)#140994,hour#140977,sum(`amount`)#140995 in operator !Aggregate \[dayofyear(cast(date#140688 as date))], [dayofyear(cast(date#140688 as date)) AS dayofyear(date)#140994, pivotfirst(hour#140977, sum(`amount`)#140995, 1, 3, 0, 0) AS __pivot_sum(`amount`) AS `sum(``amount``)`#141001\];' {quote} To solve it you have to add the column {{date}} before grouping and pivoting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org