How to train and predict in parallel via Spark MLlib?

Igor L. Thu, 18 Feb 2016 01:29:14 -0800

Good day, Spark team!
I have to solve regression problem for different restricitons. There is a
bunch of criteria and rules for them, I have to build model and make
predictions for each, combine all and save.
So, now my solution looks like:
    
    criteria2Rules: List[(String, Set[String])]
    var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]())
    criteria2Rules.foreach {
      case (criterion, rules) =>
        val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion,
data)
        val model: GradientBoostedTreesModel = buildModel(trainDataSet)
        val predictionDataSet = preparePredictionDataSet(criterion, data)
        val predictedScores = predictScores(predictionDataSet, model,
criterion, rules)
        result = result.union(predictedScores)
    }


It works almost nice, but too slow for the reason GradientBoostedTreesModel
training not so fast, especially in case of big amount of features, samples
and also quite big list of using criteria. 
I suppose it could work better, if Spark will train models and make
predictions in parallel.

I've tried to use a relational way of data operation:

    val criteria2RulesRdd: RDD[(String, Set[String])]
    
    val cartesianCriteriaRules2DataRdd =
criteria2RulesRdd.cartesian(dataRdd)
    cartesianCriteriaRules2DataRdd
      .aggregateByKey(List[Data]())(
        { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL
::: lstR}
      )
      .map {
        case (criteria, rulesSet, scorePredictionDataList) =>
          val trainSet = ???
          val model = ???
          val predictionSet = ???
          val predictedScores = ???
      }
      ...

but it inevitably brings to situation when one RDD is produced inside
another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) and
as far as I know it's a bad scenario, e.g.
toy example below doesn't work:
scala> sc.parallelize(1 to 100).map(x => (x, sc.parallelize(Array(2)).map(_
* 2).collect)).collect.

Is there any way to use Spark MLlib in parallel way?

Thank u for attention!

--
BR,
Junior Scala/Python Developer
Igor L.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to train and predict in parallel via Spark MLlib?

Reply via email to