Implementation of Parallelized process in Standalone Spark Cluster using SystemML

Rajarshi Bhadra Wed, 26 Jul 2017 10:24:25 -0700

Hi,

I have been using SystemML for sometime and I am finding it extremely
useful for scaling up my algorithm using Spark. However there area few
aspects which I am fully not understanding and would like to have some
clarification


My System Configuration: 244gb RAM, 32 Cores.
My spark Configuration: 'spark.executor.cores', '4'
                                       'spark.driver.memory', '80g'
                                       'spark.executor.memory', '20g'
                                       'spark.memory.fraction', '0.75'
                                       'spark.worker.cleanup.enabled',
'true'
                                       'spark.default.parallelism','1'

I have a process in R which I am trying to implement. The process is
similar to randomForest involving growing trees. Now The way the process is
in R I parallelize it using the parLapply statement where n trees are grown
in n parallel processes. I have implemented the algorithm in an identical
way and tried running it using parfor loop. There are two main issues I am
facing

1. In R using ncore = 16 i get 30 trees in 10 mins but in spark via
systemml the process is taking 1 hour.
2. Also I have noticed that if one tree takes  2 mins to run 5 trees take
7-8 mins to run. It seems to me I am unable to parallelize the process by
trees in SystemML

It would be great if someone can help me out with this

Thank you
Rajarshi

Implementation of Parallelized process in Standalone Spark Cluster using SystemML

Reply via email to