I put a simple example here: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/3877825096667927/588180/d9d264e39a.html
On Thu, Feb 18, 2016 at 6:47 AM Игорь Ляхов <tand...@gmail.com> wrote: > Xiangrui, thnx for your answer! > Could you clarify some details? > What do you mean "I can trigger training jobs in different threads on the > driver"? I have 4-machine cluster (It will grow in future), and I wish > use them in parallel for training and predicting. > Do you have any example? It will be great if you show me anyone. > > Thanks a lot for your participation! > --Igor > > 2016-02-18 17:24 GMT+03:00 Xiangrui Meng <men...@gmail.com>: > >> If you have a big cluster, you can trigger training jobs in different >> threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui >> >> On Thu, Feb 18, 2016, 4:28 AM Igor L. <tand...@gmail.com> wrote: >> >>> Good day, Spark team! >>> I have to solve regression problem for different restricitons. There is a >>> bunch of criteria and rules for them, I have to build model and make >>> predictions for each, combine all and save. >>> So, now my solution looks like: >>> >>> criteria2Rules: List[(String, Set[String])] >>> var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]()) >>> criteria2Rules.foreach { >>> case (criterion, rules) => >>> val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion, >>> data) >>> val model: GradientBoostedTreesModel = buildModel(trainDataSet) >>> val predictionDataSet = preparePredictionDataSet(criterion, data) >>> val predictedScores = predictScores(predictionDataSet, model, >>> criterion, rules) >>> result = result.union(predictedScores) >>> } >>> >>> It works almost nice, but too slow for the reason >>> GradientBoostedTreesModel >>> training not so fast, especially in case of big amount of features, >>> samples >>> and also quite big list of using criteria. >>> I suppose it could work better, if Spark will train models and make >>> predictions in parallel. >>> >>> I've tried to use a relational way of data operation: >>> >>> val criteria2RulesRdd: RDD[(String, Set[String])] >>> >>> val cartesianCriteriaRules2DataRdd = >>> criteria2RulesRdd.cartesian(dataRdd) >>> cartesianCriteriaRules2DataRdd >>> .aggregateByKey(List[Data]())( >>> { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => >>> lstL >>> ::: lstR} >>> ) >>> .map { >>> case (criteria, rulesSet, scorePredictionDataList) => >>> val trainSet = ??? >>> val model = ??? >>> val predictionSet = ??? >>> val predictedScores = ??? >>> } >>> ... >>> >>> but it inevitably brings to situation when one RDD is produced inside >>> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) >>> and >>> as far as I know it's a bad scenario, e.g. >>> toy example below doesn't work: >>> scala> sc.parallelize(1 to 100).map(x => (x, >>> sc.parallelize(Array(2)).map(_ >>> * 2).collect)).collect. >>> >>> Is there any way to use Spark MLlib in parallel way? >>> >>> Thank u for attention! >>> >>> -- >>> BR, >>> Junior Scala/Python Developer >>> Igor L. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> > > > -- > -- > С уважением, > Игорь Ляхов >