Xiangrui, thnx for your answer!
Could you clarify some details?
What do you mean "I can trigger training jobs in different threads on the
driver"? I have 4-machine cluster (It will grow in future), and I wish use
them in parallel for training and predicting.
Do you have any example? It will be great if you show me anyone.

Thanks a lot for your participation!
--Igor

2016-02-18 17:24 GMT+03:00 Xiangrui Meng <men...@gmail.com>:

> If you have a big cluster, you can trigger training jobs in different
> threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui
>
> On Thu, Feb 18, 2016, 4:28 AM Igor L. <tand...@gmail.com> wrote:
>
>> Good day, Spark team!
>> I have to solve regression problem for different restricitons. There is a
>> bunch of criteria and rules for them, I have to build model and make
>> predictions for each, combine all and save.
>> So, now my solution looks like:
>>
>>     criteria2Rules: List[(String, Set[String])]
>>     var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]())
>>     criteria2Rules.foreach {
>>       case (criterion, rules) =>
>>         val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion,
>> data)
>>         val model: GradientBoostedTreesModel = buildModel(trainDataSet)
>>         val predictionDataSet = preparePredictionDataSet(criterion, data)
>>         val predictedScores = predictScores(predictionDataSet, model,
>> criterion, rules)
>>         result = result.union(predictedScores)
>>     }
>>
>> It works almost nice, but too slow for the reason
>> GradientBoostedTreesModel
>> training not so fast, especially in case of big amount of features,
>> samples
>> and also quite big list of using criteria.
>> I suppose it could work better, if Spark will train models and make
>> predictions in parallel.
>>
>> I've tried to use a relational way of data operation:
>>
>>     val criteria2RulesRdd: RDD[(String, Set[String])]
>>
>>     val cartesianCriteriaRules2DataRdd =
>> criteria2RulesRdd.cartesian(dataRdd)
>>     cartesianCriteriaRules2DataRdd
>>       .aggregateByKey(List[Data]())(
>>         { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL
>> ::: lstR}
>>       )
>>       .map {
>>         case (criteria, rulesSet, scorePredictionDataList) =>
>>           val trainSet = ???
>>           val model = ???
>>           val predictionSet = ???
>>           val predictedScores = ???
>>       }
>>       ...
>>
>> but it inevitably brings to situation when one RDD is produced inside
>> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint])
>> and
>> as far as I know it's a bad scenario, e.g.
>> toy example below doesn't work:
>> scala> sc.parallelize(1 to 100).map(x => (x,
>> sc.parallelize(Array(2)).map(_
>> * 2).collect)).collect.
>>
>> Is there any way to use Spark MLlib in parallel way?
>>
>> Thank u for attention!
>>
>> --
>> BR,
>> Junior Scala/Python Developer
>> Igor L.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


-- 
--
С уважением,
Игорь Ляхов

Reply via email to