@Yu Fengdong: Your approach - specifically the groupBy results in a shuffle does it not?
2015-12-04 2:02 GMT-08:00 Fengdong Yu <fengdo...@everstring.com>: > There are many ways, one simple is: > > such as: you want to know how many rows for each month: > > > sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count > > > the output looks like: > > month count > 201411 100 > 201412 200 > > > hopes help. > > > > > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas <johngou...@gmail.com> > wrote: > > > > Hi there, > > > > I have my data stored in HDFS partitioned by month in Parquet format. > > The directory looks like this: > > > > -month=201411 > > -month=201412 > > -month=201501 > > -.... > > > > I want to compute some aggregates for every timestamp. > > How is it possible to achieve that by taking advantage of the existing > partitioning? > > One naive way I am thinking is issuing multiple sql queries: > > > > SELECT * FROM TABLE WHERE month=201411 > > SELECT * FROM TABLE WHERE month=201412 > > SELECT * FROM TABLE WHERE month=201501 > > ..... > > > > computing the aggregates on the results of each query and combining them > in the end. > > > > I think there should be a better way right? > > > > Thanks > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >