Re: How to train and predict in parallel via Spark MLlib?

Xiangrui Meng Fri, 19 Feb 2016 12:20:12 -0800

I put a simple example here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/3877825096667927/588180/d9d264e39a.html


On Thu, Feb 18, 2016 at 6:47 AM Игорь Ляхов <tand...@gmail.com> wrote:

> Xiangrui, thnx for your answer!
> Could you clarify some details?
> What do you mean "I can trigger training jobs in different threads on the
> driver"? I have 4-machine cluster (It will grow in future), and I wish
> use them in parallel for training and predicting.
> Do you have any example? It will be great if you show me anyone.
>
> Thanks a lot for your participation!
> --Igor
>
> 2016-02-18 17:24 GMT+03:00 Xiangrui Meng <men...@gmail.com>:
>
>> If you have a big cluster, you can trigger training jobs in different
>> threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui
>>
>> On Thu, Feb 18, 2016, 4:28 AM Igor L. <tand...@gmail.com> wrote:
>>
>>> Good day, Spark team!
>>> I have to solve regression problem for different restricitons. There is a
>>> bunch of criteria and rules for them, I have to build model and make
>>> predictions for each, combine all and save.
>>> So, now my solution looks like:
>>>
>>>     criteria2Rules: List[(String, Set[String])]
>>>     var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]())
>>>     criteria2Rules.foreach {
>>>       case (criterion, rules) =>
>>>         val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion,
>>> data)
>>>         val model: GradientBoostedTreesModel = buildModel(trainDataSet)
>>>         val predictionDataSet = preparePredictionDataSet(criterion, data)
>>>         val predictedScores = predictScores(predictionDataSet, model,
>>> criterion, rules)
>>>         result = result.union(predictedScores)
>>>     }
>>>
>>> It works almost nice, but too slow for the reason
>>> GradientBoostedTreesModel
>>> training not so fast, especially in case of big amount of features,
>>> samples
>>> and also quite big list of using criteria.
>>> I suppose it could work better, if Spark will train models and make
>>> predictions in parallel.
>>>
>>> I've tried to use a relational way of data operation:
>>>
>>>     val criteria2RulesRdd: RDD[(String, Set[String])]
>>>
>>>     val cartesianCriteriaRules2DataRdd =
>>> criteria2RulesRdd.cartesian(dataRdd)
>>>     cartesianCriteriaRules2DataRdd
>>>       .aggregateByKey(List[Data]())(
>>>         { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) =>
>>> lstL
>>> ::: lstR}
>>>       )
>>>       .map {
>>>         case (criteria, rulesSet, scorePredictionDataList) =>
>>>           val trainSet = ???
>>>           val model = ???
>>>           val predictionSet = ???
>>>           val predictedScores = ???
>>>       }
>>>       ...
>>>
>>> but it inevitably brings to situation when one RDD is produced inside
>>> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint])
>>> and
>>> as far as I know it's a bad scenario, e.g.
>>> toy example below doesn't work:
>>> scala> sc.parallelize(1 to 100).map(x => (x,
>>> sc.parallelize(Array(2)).map(_
>>> * 2).collect)).collect.
>>>
>>> Is there any way to use Spark MLlib in parallel way?
>>>
>>> Thank u for attention!
>>>
>>> --
>>> BR,
>>> Junior Scala/Python Developer
>>> Igor L.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>
> --
> --
> С уважением,
> Игорь Ляхов
>

Re: How to train and predict in parallel via Spark MLlib?

Reply via email to