Hi
There are 2 ways of doing it.
1. Using SQL - this method directly creates another dataframe object.
2. Using methods of the DF object, but in that case you have to provide the
schema through a row object. In this case you need to explicitly call
createDataFrame again which will infer the schema for you.
Here is python code
Method 1:
userStat = ssc.sql("select userId,sum(rating) total from ratings group
by userId")
print userStat.collect()[10]
userStat.printSchema()
Method 2:
userStatDF = userStat.groupBy("userId").sum().map(lambda t:
Row(userId=t[0],total=t[1]))
userStatDFSchema = ssc.createDataFrame(userStatDF)
print type(userStatDFSchema)
print userStatDFSchema.printSchema()
Output:
Row(userId=233, total=478)
root
|-- userId: long (nullable = true)
|-- total: long (nullable = true)
root
|-- total: long (nullable = true)
|-- userId: long (nullable = true)
As you can see, the downside of Method 2 is order of the fields are now
inferred (and most likely created in a dict under the hood) so ordered
alphabetically.
Hope this helps
Best
Ayan
On Tue, Apr 21, 2015 at 6:06 PM, Justin Yip wrote:
> Hello,
>
> I would like rename a column after aggregation. In the following code, the
> column name is "SUM(_1#179)", is there a way to rename it to a more
> friendly name?
>
> scala> val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10)))
> scala> d.groupBy("_1").sum().printSchema
> root
> |-- _1: integer (nullable = false)
> |-- SUM(_1#179): long (nullable = true)
> |-- SUM(_2#180): long (nullable = true)
>
> Thanks.
>
> Justin
>
> --
> View this message in context: Column renaming after DataFrame.groupBy
> <http://apache-spark-user-list.1001560.n3.nabble.com/Column-renaming-after-DataFrame-groupBy-tp22586.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>
--
Best Regards,
Ayan Guha