Yes. it results to a shuffle.
> On Dec 4, 2015, at 6:04 PM, Stephen Boesch <java...@gmail.com> wrote: > > @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle > does it not? > > 2015-12-04 2:02 GMT-08:00 Fengdong Yu <fengdo...@everstring.com > <mailto:fengdo...@everstring.com>>: > There are many ways, one simple is: > > such as: you want to know how many rows for each month: > > sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count > > > the output looks like: > > month count > 201411 100 > 201412 200 > > > hopes help. > > > > > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas <johngou...@gmail.com > > <mailto:johngou...@gmail.com>> wrote: > > > > Hi there, > > > > I have my data stored in HDFS partitioned by month in Parquet format. > > The directory looks like this: > > > > -month=201411 > > -month=201412 > > -month=201501 > > -.... > > > > I want to compute some aggregates for every timestamp. > > How is it possible to achieve that by taking advantage of the existing > > partitioning? > > One naive way I am thinking is issuing multiple sql queries: > > > > SELECT * FROM TABLE WHERE month=201411 > > SELECT * FROM TABLE WHERE month=201412 > > SELECT * FROM TABLE WHERE month=201501 > > ..... > > > > computing the aggregates on the results of each query and combining them in > > the end. > > > > I think there should be a better way right? > > > > Thanks > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > >