Parallel parameter tuning: distributed execution of MLlib algorithms

Hugo Ferreira Fri, 22 May 2015 06:29:04 -0700

Hi,

I am currently experimenting with linear regression (SGD) (Spark +MLlib, ver. 1.2). At this point in time I need to fine-tune thehyper-parameters. I do this (for now) by an exhaustive grid search ofthe step size and the number of iterations. Currently I am on a dualcore that acts as a master (local mode for now but will be adding sparkworker later). In order to maximize throughput I need to execute eachexecution of the linear regression algorithm in parallel.

According to the documentation it seems like parallel jobs may bescheduled if they are executed in separate threads [1]. So this bringsme to my first question: does this mean I am CPU bound by the Sparkmaster? In other words the maximum number of jobs = maximum number ofthreads of the OS?

I searched the mailing list but did not find anything regarding MLlibitself. I even peaked into the new MLlib API that uses pipelines and hassupport for parameter tuning. However, it looks like each job (instanceof the learning algorithm) is executed in sequence. Can anyone confirmthis? This brings me to my 2ndo question: is their any example thatshows how one can execute MLlib algorithms as parallel jobs?

Finally, is their any general technique I can use to execute analgorithm in a distributed manner using Spark? More specifically I wouldlike to have several MLlib algorithms run in parallel. Can anyone showme an example of sorts to do this?


TIA.
Hugo F.







[1] https://spark.apache.org/docs/1.2.0/job-scheduling.html




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Parallel parameter tuning: distributed execution of MLlib algorithms

Reply via email to