Re: How to train and predict in parallel via Spark MLlib?

Xiangrui Meng Thu, 18 Feb 2016 06:31:17 -0800

If you have a big cluster, you can trigger training jobs in different
threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui


On Thu, Feb 18, 2016, 4:28 AM Igor L. <tand...@gmail.com> wrote:

> Good day, Spark team!
> I have to solve regression problem for different restricitons. There is a
> bunch of criteria and rules for them, I have to build model and make
> predictions for each, combine all and save.
> So, now my solution looks like:
>
>     criteria2Rules: List[(String, Set[String])]
>     var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]())
>     criteria2Rules.foreach {
>       case (criterion, rules) =>
>         val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion,
> data)
>         val model: GradientBoostedTreesModel = buildModel(trainDataSet)
>         val predictionDataSet = preparePredictionDataSet(criterion, data)
>         val predictedScores = predictScores(predictionDataSet, model,
> criterion, rules)
>         result = result.union(predictedScores)
>     }
>
> It works almost nice, but too slow for the reason GradientBoostedTreesModel
> training not so fast, especially in case of big amount of features, samples
> and also quite big list of using criteria.
> I suppose it could work better, if Spark will train models and make
> predictions in parallel.
>
> I've tried to use a relational way of data operation:
>
>     val criteria2RulesRdd: RDD[(String, Set[String])]
>
>     val cartesianCriteriaRules2DataRdd =
> criteria2RulesRdd.cartesian(dataRdd)
>     cartesianCriteriaRules2DataRdd
>       .aggregateByKey(List[Data]())(
>         { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL
> ::: lstR}
>       )
>       .map {
>         case (criteria, rulesSet, scorePredictionDataList) =>
>           val trainSet = ???
>           val model = ???
>           val predictionSet = ???
>           val predictedScores = ???
>       }
>       ...
>
> but it inevitably brings to situation when one RDD is produced inside
> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) and
> as far as I know it's a bad scenario, e.g.
> toy example below doesn't work:
> scala> sc.parallelize(1 to 100).map(x => (x, sc.parallelize(Array(2)).map(_
> * 2).collect)).collect.
>
> Is there any way to use Spark MLlib in parallel way?
>
> Thank u for attention!
>
> --
> BR,
> Junior Scala/Python Developer
> Igor L.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to train and predict in parallel via Spark MLlib?

Reply via email to