Re: Parallel parameter tuning: distributed execution of MLlib algorithms

Peter Rudenko Wed, 17 Jun 2015 16:23:00 -0700

Hi, here's how to get Parrallel search pipleine:

package org.apache.spark.ml.pipeline


import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql._

class ParralelGridSearchPipelineextends Pipeline {

  override def fit(dataset: DataFrame, paramMaps: 
Array[ParamMap]):Seq[PipelineModel] = {
      paramMaps.par.map(fit(dataset, _)).toVector
  }

}

For this you need: 1) Make sure you have a lot of RAM, since for eachparameter you need to cache label points in LR (i've made a bitdifferent - first run sequentially for the first param - cache instancesand after run in parralell. You can check prototype here:https://issues.apache.org/jira/browse/SPARK-5844?focusedCommentId=14323253&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14323253)2) Set spark.scheduler.mode="FAIR" othervise tasks submitted within samecontext would execute in FIFO mode - so no parrallelizm 3) Also probablywould need to configure pool to use FAIR scheduler also(http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties)What i'm currently looking is to have fork/join pipeline. I have 2separate branches in my DAG pipeline - to process Numeric columns and toprocess categorical columns and after merge everything together withVectorAssembler. But i want this 2 branches to process in parallel. Alsolooking to define a bunch of different crossvalidators that uses othertechnics than grid search (random search crossvalidator, bayesianoptimization CV, etc.). Thanks, Peter Rudenko On 2015-06-18 01:58,Xiangrui Meng wrote:

On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira <h...@inesctec.pt> wrote:

Hi,

I am currently experimenting with linear regression (SGD) (Spark + MLlib,
ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I
do this (for now) by an exhaustive grid search of the step size and the
number of iterations. Currently I am on a dual core that acts as a master
(local mode for now but will be adding spark worker later). In order to
maximize throughput I need to execute each execution of the linear
regression algorithm in parallel.

How big is your dataset? If it is small or medium-sized, you might get better
performance by broadcasting the entire dataset and use a single machine solver
on each workers.

According to the documentation it seems like parallel jobs may be scheduled
if they are executed in separate threads [1]. So this brings me to my first
question: does this mean I am CPU bound by the Spark master? In other words
the maximum number of jobs = maximum number of threads of the OS?

We use the driver to collect model updates. Increasing the number of
parallel jobs
also increasing the driver load for both communication and computation. I don't
think you need to worry much about the max number of threads, which is usually
much larger than the number of parallel jobs we can actually run.

I searched the mailing list but did not find anything regarding MLlib
itself. I even peaked into the new MLlib API that uses pipelines and has
support for parameter tuning. However, it looks like each job (instance of
the learning algorithm) is executed in sequence. Can anyone confirm this?
This brings me to my 2ndo question: is their any example that shows how one
can execute MLlib algorithms as parallel jobs?

The new API is not optimized for performance yet. There is an example
here for k-means:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L393

Finally, is their any general technique I can use to execute an algorithm in
a distributed manner using Spark? More specifically I would like to have
several MLlib algorithms run in parallel. Can anyone show me an example of
sorts to do this?

TIA.
Hugo F.







[1] https://spark.apache.org/docs/1.2.0/job-scheduling.html




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Parallel parameter tuning: distributed execution of MLlib algorithms

Reply via email to