Yes. it results to a shuffle.
> On Dec 4, 2015, at 6:04 PM, Stephen Boesch wrote:
>
> @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle
> does it not?
>
> 2015-12-04 2:02 GMT-08:00 Fengdong Yu
Hi there,
I have my data stored in HDFS partitioned by month in Parquet format.
The directory looks like this:
-month=201411
-month=201412
-month=201501
-
I want to compute some aggregates for every timestamp.
How is it possible to achieve that by taking advantage of the existing
@Yu Fengdong: Your approach - specifically the groupBy results in a
shuffle does it not?
2015-12-04 2:02 GMT-08:00 Fengdong Yu :
> There are many ways, one simple is:
>
> such as: you want to know how many rows for each month:
>
>
>
There are many ways, one simple is:
such as: you want to know how many rows for each month:
sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count
the output looks like:
monthcount
201411100
201412200
hopes help.
> On Dec 4, 2015, at 5:53 PM, Yiannis