Xiangrui, thnx for your answer! Could you clarify some details? What do you mean "I can trigger training jobs in different threads on the driver"? I have 4-machine cluster (It will grow in future), and I wish use them in parallel for training and predicting. Do you have any example? It will be great if you show me anyone.
Thanks a lot for your participation! --Igor 2016-02-18 17:24 GMT+03:00 Xiangrui Meng <men...@gmail.com>: > If you have a big cluster, you can trigger training jobs in different > threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui > > On Thu, Feb 18, 2016, 4:28 AM Igor L. <tand...@gmail.com> wrote: > >> Good day, Spark team! >> I have to solve regression problem for different restricitons. There is a >> bunch of criteria and rules for them, I have to build model and make >> predictions for each, combine all and save. >> So, now my solution looks like: >> >> criteria2Rules: List[(String, Set[String])] >> var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]()) >> criteria2Rules.foreach { >> case (criterion, rules) => >> val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion, >> data) >> val model: GradientBoostedTreesModel = buildModel(trainDataSet) >> val predictionDataSet = preparePredictionDataSet(criterion, data) >> val predictedScores = predictScores(predictionDataSet, model, >> criterion, rules) >> result = result.union(predictedScores) >> } >> >> It works almost nice, but too slow for the reason >> GradientBoostedTreesModel >> training not so fast, especially in case of big amount of features, >> samples >> and also quite big list of using criteria. >> I suppose it could work better, if Spark will train models and make >> predictions in parallel. >> >> I've tried to use a relational way of data operation: >> >> val criteria2RulesRdd: RDD[(String, Set[String])] >> >> val cartesianCriteriaRules2DataRdd = >> criteria2RulesRdd.cartesian(dataRdd) >> cartesianCriteriaRules2DataRdd >> .aggregateByKey(List[Data]())( >> { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL >> ::: lstR} >> ) >> .map { >> case (criteria, rulesSet, scorePredictionDataList) => >> val trainSet = ??? >> val model = ??? >> val predictionSet = ??? >> val predictedScores = ??? >> } >> ... >> >> but it inevitably brings to situation when one RDD is produced inside >> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) >> and >> as far as I know it's a bad scenario, e.g. >> toy example below doesn't work: >> scala> sc.parallelize(1 to 100).map(x => (x, >> sc.parallelize(Array(2)).map(_ >> * 2).collect)).collect. >> >> Is there any way to use Spark MLlib in parallel way? >> >> Thank u for attention! >> >> -- >> BR, >> Junior Scala/Python Developer >> Igor L. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> -- -- С уважением, Игорь Ляхов