Watch out with loading data from gzipped files. Spark cannot parallelize
the load of gzipped files, and if you do not explicitly repartition your
RDD created from such a file, everything you do on that RDD will run on a
single core.


On Wed, Apr 2, 2014 at 8:22 PM, K Koh <den...@gmail.com> wrote:

> Hi,
>
> I want to aggregate (time-stamped) event data at daily, weekly and monthly
> level stored in a directory in  data/yyyy/mm/dd/dat.gz format. For example:
>
>
> Each dat.gz file contains tuples in (datetime, id, value) format. I can
> perform aggregation as follows:
>
>
>
> but this code doesn't seem to be efficient because it doesn't exploit the
> data dependencies in reduce steps. For example, there is no dependency in
> between the data of 2010-01 and that of all other dates (2010-02, 2010-03,
> ...) so it would be ideal if we can load one month of data in each node
> once
> and perform all three (daily, weekly and monthly) aggregation.
>
> I think I could use mapPartitions and a big reducer that performs all three
> aggregations, but not sure it is a right way to go.
>
> Is there a more efficient way to perform these aggregations (by loading
> data
> once) yet keeping the code modular?
>
> Thanks,
> K
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-aggregate-event-data-at-daily-weekly-monthly-level-tp3673.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to