Hi Sean,

Thanks a lot for your answer. That explains it, as I was creating thousands
of RDDs, so I guess the communication overhead was the reason why the Spark
job was freezing. After changing the code to use RDDs of pairs and
aggregateByKey it works just fine, and quite fast.

Again, thanks a lot for your help.

Greetings,

Juan

2015-02-18 15:35 GMT+01:00 Sean Owen <so...@cloudera.com>:

> At some level, enough RDDs creates hundreds of thousands of tiny
> partitions of data each of which creates a task for each stage. The
> raw overhead of all the message passing can slow things down a lot. I
> would not design something to use an RDD per key. You would generally
> use key by some value you want to divide and filter on, and then use a
> *ByKey to do your work.
>
> You don't work with individual RDDs this way, but usually that's good
> news. You usually have a lot more flexibility operating just in pure
> Java / Scala to do whatever you need inside your function.
>
> On Wed, Feb 18, 2015 at 2:12 PM, Juan Rodríguez Hortalá
> <juan.rodriguez.hort...@gmail.com> wrote:
> > Hi Paweł,
> >
> > Thanks a lot for your answer. I finally got the program to work by using
> > aggregateByKey, but I was wondering why creating thousands of RDDs
> doesn't
> > work. I think that could be interesting for using methods that work on
> RDDs
> > like for example JavaDoubleRDD.stats() (
> >
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaDoubleRDD.html#stats%28%29
> ).
> > If the groups are small then I can chain groupBy(), collect(),
> parallelize()
> > and stats(), but that is quite inefficient because it implies moving
> data to
> > and from the driver, and also doesn't scale to big groups; on the other
> hand
> > if I use aggregateByKey or a similar function then I cannot use stats()
> so I
> > have to reimplement it. In general I was looking for a way to reuse other
> > functions that I have that work on RDDs, for using them on groups of
> data in
> > a RDD, because I don't see a how to directly apply them to each of the
> > groups in a pair RDD.
> >
> > Again, thanks a lot for your answer,
> >
> > Greetings,
> >
> > Juan Rodriguez
> >
> >
> >
> >
> > 2015-02-18 14:56 GMT+01:00 Paweł Szulc <paul.sz...@gmail.com>:
> >>
> >> Maybe you can omit using grouping all together with groupByKey? What is
> >> your next step after grouping elements by key? Are you trying to reduce
> >> values? If so then I would recommend using some reducing functions like
> for
> >> example reduceByKey or aggregateByKey. Those will first reduce value for
> >> each key locally on each node before doing actual IO over the network.
> There
> >> will also be no grouping phase so you will not run into memory issues.
> >>
> >> Please let me know if that helped
> >>
> >> Pawel Szulc
> >> @rabbitonweb
> >> http://www.rabbitonweb.com
> >>
> >>
> >> On Wed, Feb 18, 2015 at 12:06 PM, Juan Rodríguez Hortalá
> >> <juan.rodriguez.hort...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I'm writing a Spark program where I want to divide a RDD into different
> >>> groups, but the groups are too big to use groupByKey. To cope with
> that,
> >>> since I know in advance the list of keys for each group, I build a map
> from
> >>> the keys to the RDDs that result from filtering the input RDD to get
> the
> >>> records for the corresponding key. This works when I have a small
> number of
> >>> keys, but for big number of keys (tens of thousands) the execution gets
> >>> stuck, without issuing any new Spark stage. I suspect the reason is
> that the
> >>> Spark scheduler is not able to handle so many RDDs. Does it make
> sense? I'm
> >>> rewriting the program to use a single RDD of pairs, with cached
> partions,
> >>> but I wanted to be sure I understand the problem here.
> >>>
> >>> Thanks a lot in advance,
> >>>
> >>> Greetings,
> >>>
> >>> Juan Rodriguez
> >>
> >>
> >
>

Reply via email to