Hi Boromir,

Assuming the tree fits in memory, and what you want to do is parallelize
the computation, the 'obvious' way is the following:

* broadcast the tree T to each worker (ok since it fits in memory)
* construct an RDD for the deepest level - each element in the RDD is
(parent,data_at_node)
* aggregate this by key (=parent) -> RDD[parent,data]
* map each element (p, data) -> (parent(p), data) using T
* repeat until you have an RDD of size = 1 (assuming T is connected)

If T cannot fit in memory, or is very deep, then there are more exotic
techniques, but hopefully this suffices.

Andy


--
http://www.cs.ox.ac.uk/people/andy.twigg/

On 30 September 2014 14:12, Boromir Widas <[email protected]> wrote:

> Hello Folks,
>
> I have been trying to implement a tree reduction algorithm recently in
> spark but could not find suitable parallel operations. Assuming I have a
> general tree like the following -
>
>
>
> I have to do the following -
> 1) Do some computation at each leaf node to get an array of doubles.(This
> can be pre computed)
> 2) For each non leaf node, starting with the root node compute the sum of
> these arrays for all child nodes. So to get the array for node B, I need to
> get the array for E, which is the sum of G + H.
>
> ////////////////////// Start Snippet
> case class Node(name: String, children: Array[Node], values: Array[Double])
>
> // read in the tree here
>
> def getSumOfChildren(node: Node) : Array[Double] = {
>     if(node.isLeafNode) {
>       return node.values
>    }
>     foreach(child in node.children) {
>        // can use an accumulator here
>        node.values = (node.values, getSumOfChildren(child)).zipped.map(_+_)
>    }
>    node.values
> }
> ////////////////////////// End Snippet
>
> Any pointers to how this can be done in parallel to use all cores will be
> greatly appreciated.
>
> Thanks,
> Boromir.
>
>

Reply via email to