Re: Avoid Shuffling on Partitioned Data

Stephen Boesch Fri, 04 Dec 2015 02:04:56 -0800

@Yu Fengdong:  Your approach - specifically the groupBy results in a
shuffle does it not?


2015-12-04 2:02 GMT-08:00 Fengdong Yu <fengdo...@everstring.com>:

> There are many ways, one simple is:
>
> such as: you want to know how many rows for each month:
>
>
> sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count
>
>
> the output looks like:
>
> month    count
> 201411    100
> 201412    200
>
>
> hopes help.
>
>
>
> > On Dec 4, 2015, at 5:53 PM, Yiannis Gkoufas <johngou...@gmail.com>
> wrote:
> >
> > Hi there,
> >
> > I have my data stored in HDFS partitioned by month in Parquet format.
> > The directory looks like this:
> >
> > -month=201411
> > -month=201412
> > -month=201501
> > -....
> >
> > I want to compute some aggregates for every timestamp.
> > How is it possible to achieve that by taking advantage of the existing
> partitioning?
> > One naive way I am thinking is issuing multiple sql queries:
> >
> > SELECT * FROM TABLE WHERE month=201411
> > SELECT * FROM TABLE WHERE month=201412
> > SELECT * FROM TABLE WHERE month=201501
> > .....
> >
> > computing the aggregates on the results of each query and combining them
> in the end.
> >
> > I think there should be a better way right?
> >
> > Thanks
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Avoid Shuffling on Partitioned Data

Reply via email to