Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
Yes. it results to a shuffle. > On Dec 4, 2015, at 6:04 PM, Stephen Boesch wrote: > > @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle > does it not? > > 2015-12-04 2:02 GMT-08:00 Fengdong Yu

Avoid Shuffling on Partitioned Data

2015-12-04 Thread Yiannis Gkoufas
Hi there, I have my data stored in HDFS partitioned by month in Parquet format. The directory looks like this: -month=201411 -month=201412 -month=201501 - I want to compute some aggregates for every timestamp. How is it possible to achieve that by taking advantage of the existing

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Stephen Boesch
@Yu Fengdong: Your approach - specifically the groupBy results in a shuffle does it not? 2015-12-04 2:02 GMT-08:00 Fengdong Yu : > There are many ways, one simple is: > > such as: you want to know how many rows for each month: > > >

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
There are many ways, one simple is: such as: you want to know how many rows for each month: sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count the output looks like: monthcount 201411100 201412200 hopes help. > On Dec 4, 2015, at 5:53 PM, Yiannis