Alberto Bonsanto created SPARK-17760:
----------------------------------------

             Summary: DataFrame's pivot doesn't see column created in groupBy
                 Key: SPARK-17760
                 URL: https://issues.apache.org/jira/browse/SPARK-17760
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0
         Environment: Databrick's community version, spark 2.0.0, pyspark, 
python 2.
            Reporter: Alberto Bonsanto
            Priority: Minor


Related to 
[https://stackoverflow.com/questions/39817993/pivoting-with-missing-values]. 
I'm not completely sure if this is a bug or expected behavior.

When you `groypBy` by a column generated inside of it, the `pivot` method 
apparently doesn't find this column during the analysis.

E.g.
{code:none}
df = (sc.parallelize([(1.0, "2016-03-30 01:00:00"), 
                      (30.2, "2015-01-02 03:00:02")])
        .toDF(["amount", "Date"])
        .withColumn("Date", col("Date").cast("timestamp")))

(df.withColumn("hour",hour("date"))
   .groupBy(dayofyear("date").alias("date"))
   .pivot("hour").sum("amount").show()){code}

Shows the following exception.

{quote}
AnalysisException: u'resolved attribute(s) date#140688 missing from 
dayofyear(date)#140994,hour#140977,sum(`amount`)#140995 in operator !Aggregate 
\[dayofyear(cast(date#140688 as date))], [dayofyear(cast(date#140688 as date)) 
AS dayofyear(date)#140994, pivotfirst(hour#140977, sum(`amount`)#140995, 1, 3, 
0, 0) AS __pivot_sum(`amount`) AS `sum(``amount``)`#141001\];'
{quote}

To solve it you have to add the column {{date}} before grouping and pivoting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to