Hi,

Thanks for your reply.

I basically want to check if my understanding what parallelize() on RDDs is
correct. In my case, I create a vertex RDD and edge RDD and distribute them
by calling parallelize(). Now does Spark perform any operation on these
RDDs in parallel?

For example, if I apply groupBy on the edge RDD (grouping by source vertex)
and call a function F on the grouped RDD, will F be applied on each group
in parallel and will Spark determine how to do this in parallel regardless
of the number of groups?

Thanks.

On Tue, Feb 17, 2015 at 5:03 PM, Yifan LI <iamyifa...@gmail.com> wrote:

> Hi Kannan,
>
> I am not sure I have understood what your question is exactly, but maybe
> the reduceByKey or reduceByKeyLocally functionality is better to your need.
>
>
>
> Best,
> Yifan LI
>
>
>
>
>
> On 17 Feb 2015, at 17:37, Vijayasarathy Kannan <kvi...@vt.edu> wrote:
>
> Hi,
>
> I am working on a Spark application that processes graphs and I am trying
> to do the following.
>
> - group the vertices (key - vertex, value - set of its outgoing edges)
> - distribute each key to separate processes and process them (like mapper)
> - reduce the results back at the main process
>
> Does the "groupBy" functionality do the distribution by default?
> Do we have to explicitly use RDDs to enable automatic distribution?
>
> It'd be great if you could help me understand these and how to go about
> with the problem.
>
> Thanks.
>
>
>

Reply via email to