Thanks a lot Andy and Debashish, your suggestions were of great help. On Tue, Sep 30, 2014 at 6:44 PM, Debasish Das <[email protected]> wrote:
> If the tree is too big build it on graphx....but it will need thorough > analysis so that the partitions are well balanced... > > On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg <[email protected]> wrote: > >> Hi Boromir, >> >> Assuming the tree fits in memory, and what you want to do is parallelize >> the computation, the 'obvious' way is the following: >> >> * broadcast the tree T to each worker (ok since it fits in memory) >> * construct an RDD for the deepest level - each element in the RDD is >> (parent,data_at_node) >> * aggregate this by key (=parent) -> RDD[parent,data] >> * map each element (p, data) -> (parent(p), data) using T >> * repeat until you have an RDD of size = 1 (assuming T is connected) >> >> If T cannot fit in memory, or is very deep, then there are more exotic >> techniques, but hopefully this suffices. >> >> Andy >> >> >> -- >> http://www.cs.ox.ac.uk/people/andy.twigg/ >> >> On 30 September 2014 14:12, Boromir Widas <[email protected]> wrote: >> >>> Hello Folks, >>> >>> I have been trying to implement a tree reduction algorithm recently in >>> spark but could not find suitable parallel operations. Assuming I have a >>> general tree like the following - >>> >>> >>> >>> I have to do the following - >>> 1) Do some computation at each leaf node to get an array of >>> doubles.(This can be pre computed) >>> 2) For each non leaf node, starting with the root node compute the sum >>> of these arrays for all child nodes. So to get the array for node B, I need >>> to get the array for E, which is the sum of G + H. >>> >>> ////////////////////// Start Snippet >>> case class Node(name: String, children: Array[Node], values: >>> Array[Double]) >>> >>> // read in the tree here >>> >>> def getSumOfChildren(node: Node) : Array[Double] = { >>> if(node.isLeafNode) { >>> return node.values >>> } >>> foreach(child in node.children) { >>> // can use an accumulator here >>> node.values = (node.values, >>> getSumOfChildren(child)).zipped.map(_+_) >>> } >>> node.values >>> } >>> ////////////////////////// End Snippet >>> >>> Any pointers to how this can be done in parallel to use all cores will >>> be greatly appreciated. >>> >>> Thanks, >>> Boromir. >>> >>> >> >
