I've tried to parallelize the separate regressions using allResponses.toParArray.map( x=> do logistic regression against labels in x) But I start to see messages like 14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task 363.0:4) 14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss was due to fetch failure from null and finally 14/06/20 10:10:26 ERROR scheduler.TaskSetManager: Task 363.0:4 failed 4 times; aborting job
Then 14/06/20 10:10:26 ERROR scheduler.DAGSchedulerActorSupervisor: eventProcesserActor failed due to the error null; shutting down SparkContext 14/06/20 10:10:26 ERROR actor.OneForOneStrategy: java.lang.UnsupportedOperationException at org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32) at org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185) This doesn't happen when I don't use toParArray. I read that spark was thread safe, but I seem to be running into problems. Am I doing something wrong? Kyle On Thu, Jun 19, 2014 at 11:21 AM, Kyle Ellrott <kellr...@soe.ucsc.edu> wrote: > > I'm working on a problem learning several different sets of responses > against the same set of training features. Right now I've written the > program to cycle through all of the different label sets, attached them to > the training data and run LogisticRegressionWithSGD on each of them. ie > > foreach curResponseSet in allResponses: > currentRDD : RDD[LabeledPoints] = curResponseSet joined with > trainingData > LogisticRegressionWithSGD.train(currentRDD) > > > Each of the different training runs are independent. It seems like I > should be parallelize them as well. > Is there a better way to do this? > > > Kyle >