Hi Sean, I'm trying to increase the cpu usage by running logistic regression in different datasets in parallel. They shouldn't depend on each other. I train several logistic regression models from different column combinations of a main dataset. I processed the combinations in a ParArray in an attempt to increase cpu usage but id did not help.
2015-02-20 8:17 GMT-02:00 Sean Owen <so...@cloudera.com>: > It sounds like your computation just isn't CPU bound, right? or maybe > that only some stages are. It's not clear what work you are doing > beyond the core LR. > > Stages don't wait on each other unless one depends on the other. You'd > have to clarify what you mean by running stages in parallel, like what > are the interdependencies. > > On Fri, Feb 20, 2015 at 10:01 AM, Dirceu Semighini Filho > <dirceu.semigh...@gmail.com> wrote: > > Hi all, > > I'm running Spark 1.2.0, in Stand alone mode, on different cluster and > > server sizes. All of my data is cached in memory. > > Basically I have a mass of data, about 8gb, with about 37k of columns, > and > > I'm running different configs of an BinaryLogisticRegressionBFGS. > > When I put spark to run on 9 servers (1 master and 8 slaves), with 32 > cores > > each. I noticed that the cpu usage was varying from 20% to 50% (counting > > the cpu usage of 9 servers in the cluster). > > First I tried to repartition the Rdds to the same number of total client > > cores (256), but that didn't help. After I've tried to change the > > property *spark.default.parallelism > > * to the same number (256) but that didn't helped to increase the cpu > usage. > > Looking at the spark monitoring tool, I saw that some stages took 52s to > > be completed. > > My last shot was trying to run some tasks in parallel, but when I start > > running tasks in parallel (4 tasks) the total cpu time spent to complete > > this has increased in about 10%, task parallelism didn't helped. > > Looking at the monitoring tool I've noticed that when running tasks in > > parallel, the stages complete together, if I have 4 stages running in > > parallel (A,B,C and D), if A, B and C finishes, they will wait for D to > > mark all this 4 stages as completed, is that right? > > Is there any way to improve the cpu usage when running on large servers? > > Spending more time when running tasks is an expected behaviour? > > > > Kind Regards, > > Dirceu >