Good day, Spark team! I have to solve regression problem for different restricitons. There is a bunch of criteria and rules for them, I have to build model and make predictions for each, combine all and save. So, now my solution looks like: criteria2Rules: List[(String, Set[String])] var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]()) criteria2Rules.foreach { case (criterion, rules) => val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion, data) val model: GradientBoostedTreesModel = buildModel(trainDataSet) val predictionDataSet = preparePredictionDataSet(criterion, data) val predictedScores = predictScores(predictionDataSet, model, criterion, rules) result = result.union(predictedScores) }
It works almost nice, but too slow for the reason GradientBoostedTreesModel training not so fast, especially in case of big amount of features, samples and also quite big list of using criteria. I suppose it could work better, if Spark will train models and make predictions in parallel. I've tried to use a relational way of data operation: val criteria2RulesRdd: RDD[(String, Set[String])] val cartesianCriteriaRules2DataRdd = criteria2RulesRdd.cartesian(dataRdd) cartesianCriteriaRules2DataRdd .aggregateByKey(List[Data]())( { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL ::: lstR} ) .map { case (criteria, rulesSet, scorePredictionDataList) => val trainSet = ??? val model = ??? val predictionSet = ??? val predictedScores = ??? } ... but it inevitably brings to situation when one RDD is produced inside another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) and as far as I know it's a bad scenario, e.g. toy example below doesn't work: scala> sc.parallelize(1 to 100).map(x => (x, sc.parallelize(Array(2)).map(_ * 2).collect)).collect. Is there any way to use Spark MLlib in parallel way? Thank u for attention! -- BR, Junior Scala/Python Developer Igor L. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org