If you have a big cluster, you can trigger training jobs in different threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui
On Thu, Feb 18, 2016, 4:28 AM Igor L. <tand...@gmail.com> wrote: > Good day, Spark team! > I have to solve regression problem for different restricitons. There is a > bunch of criteria and rules for them, I have to build model and make > predictions for each, combine all and save. > So, now my solution looks like: > > criteria2Rules: List[(String, Set[String])] > var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]()) > criteria2Rules.foreach { > case (criterion, rules) => > val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion, > data) > val model: GradientBoostedTreesModel = buildModel(trainDataSet) > val predictionDataSet = preparePredictionDataSet(criterion, data) > val predictedScores = predictScores(predictionDataSet, model, > criterion, rules) > result = result.union(predictedScores) > } > > It works almost nice, but too slow for the reason GradientBoostedTreesModel > training not so fast, especially in case of big amount of features, samples > and also quite big list of using criteria. > I suppose it could work better, if Spark will train models and make > predictions in parallel. > > I've tried to use a relational way of data operation: > > val criteria2RulesRdd: RDD[(String, Set[String])] > > val cartesianCriteriaRules2DataRdd = > criteria2RulesRdd.cartesian(dataRdd) > cartesianCriteriaRules2DataRdd > .aggregateByKey(List[Data]())( > { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL > ::: lstR} > ) > .map { > case (criteria, rulesSet, scorePredictionDataList) => > val trainSet = ??? > val model = ??? > val predictionSet = ??? > val predictedScores = ??? > } > ... > > but it inevitably brings to situation when one RDD is produced inside > another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) and > as far as I know it's a bad scenario, e.g. > toy example below doesn't work: > scala> sc.parallelize(1 to 100).map(x => (x, sc.parallelize(Array(2)).map(_ > * 2).collect)).collect. > > Is there any way to use Spark MLlib in parallel way? > > Thank u for attention! > > -- > BR, > Junior Scala/Python Developer > Igor L. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >