Re: Avoid Shuffling on Partitioned Data

Fengdong Yu Fri, 04 Dec 2015 02:08:04 -0800

Yes. it results to a shuffle.



> On Dec 4, 2015, at 6:04 PM, Stephen Boesch <java...@gmail.com> wrote:
> 
> @Yu Fengdong:  Your approach - specifically the groupBy results in a shuffle 
> does it not?
> 
> 2015-12-04 2:02 GMT-08:00 Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>>:
> There are many ways, one simple is:
> 
> such as: you want to know how many rows for each month:
> 
> sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count
> 
> 
> the output looks like:
> 
> month    count
> 201411    100
> 201412    200
> 
> 
> hopes help.
> 
> 
> 
> > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas <johngou...@gmail.com 
> > <mailto:johngou...@gmail.com>> wrote:
> >
> > Hi there,
> >
> > I have my data stored in HDFS partitioned by month in Parquet format.
> > The directory looks like this:
> >
> > -month=201411
> > -month=201412
> > -month=201501
> > -....
> >
> > I want to compute some aggregates for every timestamp.
> > How is it possible to achieve that by taking advantage of the existing 
> > partitioning?
> > One naive way I am thinking is issuing multiple sql queries:
> >
> > SELECT * FROM TABLE WHERE month=201411
> > SELECT * FROM TABLE WHERE month=201412
> > SELECT * FROM TABLE WHERE month=201501
> > .....
> >
> > computing the aggregates on the results of each query and combining them in 
> > the end.
> >
> > I think there should be a better way right?
> >
> > Thanks
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
>

Re: Avoid Shuffling on Partitioned Data

Reply via email to