Hi, Thanks for your reply.
I basically want to check if my understanding what parallelize() on RDDs is correct. In my case, I create a vertex RDD and edge RDD and distribute them by calling parallelize(). Now does Spark perform any operation on these RDDs in parallel? For example, if I apply groupBy on the edge RDD (grouping by source vertex) and call a function F on the grouped RDD, will F be applied on each group in parallel and will Spark determine how to do this in parallel regardless of the number of groups? Thanks. On Tue, Feb 17, 2015 at 5:03 PM, Yifan LI <iamyifa...@gmail.com> wrote: > Hi Kannan, > > I am not sure I have understood what your question is exactly, but maybe > the reduceByKey or reduceByKeyLocally functionality is better to your need. > > > > Best, > Yifan LI > > > > > > On 17 Feb 2015, at 17:37, Vijayasarathy Kannan <kvi...@vt.edu> wrote: > > Hi, > > I am working on a Spark application that processes graphs and I am trying > to do the following. > > - group the vertices (key - vertex, value - set of its outgoing edges) > - distribute each key to separate processes and process them (like mapper) > - reduce the results back at the main process > > Does the "groupBy" functionality do the distribution by default? > Do we have to explicitly use RDDs to enable automatic distribution? > > It'd be great if you could help me understand these and how to go about > with the problem. > > Thanks. > > >