I've tried to parallelize the separate regressions using
allResponses.toParArray.map( x=> do logistic regression against labels in x)
But I start to see messages like
14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task
363.0:4)
14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss was due to fetch
failure from null
and finally
14/06/20 10:10:26 ERROR scheduler.TaskSetManager: Task 363.0:4 failed 4
times; aborting job

Then
14/06/20 10:10:26 ERROR scheduler.DAGSchedulerActorSupervisor:
eventProcesserActor failed due to the error null; shutting down SparkContext
14/06/20 10:10:26 ERROR actor.OneForOneStrategy:
java.lang.UnsupportedOperationException
at
org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
at
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)


This doesn't happen when I don't use toParArray. I read that spark was
thread safe, but I seem to be running into problems. Am I doing something
wrong?

Kyle



On Thu, Jun 19, 2014 at 11:21 AM, Kyle Ellrott <kellr...@soe.ucsc.edu>
wrote:

>
> I'm working on a problem learning several different sets of responses
> against the same set of training features. Right now I've written the
> program to cycle through all of the different label sets, attached them to
> the training data and run LogisticRegressionWithSGD on each of them. ie
>
> foreach curResponseSet in allResponses:
>      currentRDD : RDD[LabeledPoints] = curResponseSet joined with
> trainingData
>      LogisticRegressionWithSGD.train(currentRDD)
>
>
> Each of the different training runs are independent. It seems like I
> should be parallelize them as well.
> Is there a better way to do this?
>
>
> Kyle
>

Reply via email to