Hi, here's how to get Parrallel search pipleine:
package org.apache.spark.ml.pipeline
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql._
class ParralelGridSearchPipelineextends Pipeline {
override def fit(dataset: DataFrame, paramMaps:
Array[ParamMap]):Seq[PipelineModel] = {
paramMaps.par.map(fit(dataset, _)).toVector
}
}
For this you need: 1) Make sure you have a lot of RAM, since for each
parameter you need to cache label points in LR (i've made a bit
different - first run sequentially for the first param - cache instances
and after run in parralell. You can check prototype here:
https://issues.apache.org/jira/browse/SPARK-5844?focusedCommentId=14323253&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14323253)
2) Set spark.scheduler.mode="FAIR" othervise tasks submitted within same
context would execute in FIFO mode - so no parrallelizm 3) Also probably
would need to configure pool to use FAIR scheduler also
(http://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties)
What i'm currently looking is to have fork/join pipeline. I have 2
separate branches in my DAG pipeline - to process Numeric columns and to
process categorical columns and after merge everything together with
VectorAssembler. But i want this 2 branches to process in parallel. Also
looking to define a bunch of different crossvalidators that uses other
technics than grid search (random search crossvalidator, bayesian
optimization CV, etc.). Thanks, Peter Rudenko On 2015-06-18 01:58,
Xiangrui Meng wrote:
On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira <h...@inesctec.pt> wrote:
Hi,
I am currently experimenting with linear regression (SGD) (Spark + MLlib,
ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I
do this (for now) by an exhaustive grid search of the step size and the
number of iterations. Currently I am on a dual core that acts as a master
(local mode for now but will be adding spark worker later). In order to
maximize throughput I need to execute each execution of the linear
regression algorithm in parallel.
How big is your dataset? If it is small or medium-sized, you might get better
performance by broadcasting the entire dataset and use a single machine solver
on each workers.
According to the documentation it seems like parallel jobs may be scheduled
if they are executed in separate threads [1]. So this brings me to my first
question: does this mean I am CPU bound by the Spark master? In other words
the maximum number of jobs = maximum number of threads of the OS?
We use the driver to collect model updates. Increasing the number of
parallel jobs
also increasing the driver load for both communication and computation. I don't
think you need to worry much about the max number of threads, which is usually
much larger than the number of parallel jobs we can actually run.
I searched the mailing list but did not find anything regarding MLlib
itself. I even peaked into the new MLlib API that uses pipelines and has
support for parameter tuning. However, it looks like each job (instance of
the learning algorithm) is executed in sequence. Can anyone confirm this?
This brings me to my 2ndo question: is their any example that shows how one
can execute MLlib algorithms as parallel jobs?
The new API is not optimized for performance yet. There is an example
here for k-means:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L393
Finally, is their any general technique I can use to execute an algorithm in
a distributed manner using Spark? More specifically I would like to have
several MLlib algorithms run in parallel. Can anyone show me an example of
sorts to do this?
TIA.
Hugo F.
[1] https://spark.apache.org/docs/1.2.0/job-scheduling.html
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org