Hi Boromir, Assuming the tree fits in memory, and what you want to do is parallelize the computation, the 'obvious' way is the following:
* broadcast the tree T to each worker (ok since it fits in memory) * construct an RDD for the deepest level - each element in the RDD is (parent,data_at_node) * aggregate this by key (=parent) -> RDD[parent,data] * map each element (p, data) -> (parent(p), data) using T * repeat until you have an RDD of size = 1 (assuming T is connected) If T cannot fit in memory, or is very deep, then there are more exotic techniques, but hopefully this suffices. Andy -- http://www.cs.ox.ac.uk/people/andy.twigg/ On 30 September 2014 14:12, Boromir Widas <[email protected]> wrote: > Hello Folks, > > I have been trying to implement a tree reduction algorithm recently in > spark but could not find suitable parallel operations. Assuming I have a > general tree like the following - > > > > I have to do the following - > 1) Do some computation at each leaf node to get an array of doubles.(This > can be pre computed) > 2) For each non leaf node, starting with the root node compute the sum of > these arrays for all child nodes. So to get the array for node B, I need to > get the array for E, which is the sum of G + H. > > ////////////////////// Start Snippet > case class Node(name: String, children: Array[Node], values: Array[Double]) > > // read in the tree here > > def getSumOfChildren(node: Node) : Array[Double] = { > if(node.isLeafNode) { > return node.values > } > foreach(child in node.children) { > // can use an accumulator here > node.values = (node.values, getSumOfChildren(child)).zipped.map(_+_) > } > node.values > } > ////////////////////////// End Snippet > > Any pointers to how this can be done in parallel to use all cores will be > greatly appreciated. > > Thanks, > Boromir. > >
