Hi,

I am currently experimenting with linear regression (SGD) (Spark + MLlib, ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I do this (for now) by an exhaustive grid search of the step size and the number of iterations. Currently I am on a dual core that acts as a master (local mode for now but will be adding spark worker later). In order to maximize throughput I need to execute each execution of the linear regression algorithm in parallel.

According to the documentation it seems like parallel jobs may be scheduled if they are executed in separate threads [1]. So this brings me to my first question: does this mean I am CPU bound by the Spark master? In other words the maximum number of jobs = maximum number of threads of the OS?

I searched the mailing list but did not find anything regarding MLlib itself. I even peaked into the new MLlib API that uses pipelines and has support for parameter tuning. However, it looks like each job (instance of the learning algorithm) is executed in sequence. Can anyone confirm this? This brings me to my 2ndo question: is their any example that shows how one can execute MLlib algorithms as parallel jobs?

Finally, is their any general technique I can use to execute an algorithm in a distributed manner using Spark? More specifically I would like to have several MLlib algorithms run in parallel. Can anyone show me an example of sorts to do this?

TIA.
Hugo F.







[1] https://spark.apache.org/docs/1.2.0/job-scheduling.html




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to