Thanks a lot Andy and Debashish, your suggestions were of great help.

On Tue, Sep 30, 2014 at 6:44 PM, Debasish Das <[email protected]>
wrote:

> If the tree is too big build it on graphx....but it will need thorough
> analysis so that the partitions are well balanced...
>
> On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg <[email protected]> wrote:
>
>> Hi Boromir,
>>
>> Assuming the tree fits in memory, and what you want to do is parallelize
>> the computation, the 'obvious' way is the following:
>>
>> * broadcast the tree T to each worker (ok since it fits in memory)
>> * construct an RDD for the deepest level - each element in the RDD is
>> (parent,data_at_node)
>> * aggregate this by key (=parent) -> RDD[parent,data]
>> * map each element (p, data) -> (parent(p), data) using T
>> * repeat until you have an RDD of size = 1 (assuming T is connected)
>>
>> If T cannot fit in memory, or is very deep, then there are more exotic
>> techniques, but hopefully this suffices.
>>
>> Andy
>>
>>
>> --
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>
>> On 30 September 2014 14:12, Boromir Widas <[email protected]> wrote:
>>
>>> Hello Folks,
>>>
>>> I have been trying to implement a tree reduction algorithm recently in
>>> spark but could not find suitable parallel operations. Assuming I have a
>>> general tree like the following -
>>>
>>>
>>>
>>> I have to do the following -
>>> 1) Do some computation at each leaf node to get an array of
>>> doubles.(This can be pre computed)
>>> 2) For each non leaf node, starting with the root node compute the sum
>>> of these arrays for all child nodes. So to get the array for node B, I need
>>> to get the array for E, which is the sum of G + H.
>>>
>>> ////////////////////// Start Snippet
>>> case class Node(name: String, children: Array[Node], values:
>>> Array[Double])
>>>
>>> // read in the tree here
>>>
>>> def getSumOfChildren(node: Node) : Array[Double] = {
>>>     if(node.isLeafNode) {
>>>       return node.values
>>>    }
>>>     foreach(child in node.children) {
>>>        // can use an accumulator here
>>>        node.values = (node.values,
>>> getSumOfChildren(child)).zipped.map(_+_)
>>>    }
>>>    node.values
>>> }
>>> ////////////////////////// End Snippet
>>>
>>> Any pointers to how this can be done in parallel to use all cores will
>>> be greatly appreciated.
>>>
>>> Thanks,
>>> Boromir.
>>>
>>>
>>
>

Reply via email to