from:"Kyle Ellrott"

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-08 Thread Kyle Ellrott

I changed the

curGraph = curGraph.outerJoinVertices(curMessages)(
  (vid, vertex, message) =
vertex.process(message.getOrElse(List[Message]()), ti)
).cache()

to

curGraph = curGraph.outerJoinVertices(curMessages)(
  (vid, vertex, message) = (vertex,
message.getOrElse(List[Message]()))
).mapVertices( (x,y) = y._1.process( y._2, ti ) ).cache()

So the call to the 'process' method was moved out of the outerJoinVertices
and into a separate mapVertices call, and the problem went away. Now,
'process' is only called once during the correct cycle.
So it would appear that outerJoinVertices caches the closure to be
recalculated if needed again while mapVertices actually caches the derived
values.

Is this a bug or a feature?

Kyle



On Sat, Feb 7, 2015 at 11:44 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote:

 I'm trying to setup a simple iterative message/update problem in GraphX
 (spark 1.2.0), but I'm running into issues with the caching and
 re-calculation of data. I'm trying to follow the example found in the
 Pregel implementation of materializing and cacheing messages and graphs and
 then unpersisting them after the next cycle has been done.
 It doesn't seem to be working, because every cycle gets progressively
 slower and it seems as if more and more of the values are being
 re-calculated despite my attempts to cache them.

 The code:
 ```
   var oldMessages : VertexRDD[List[Message]] = null
   var oldGraph : Graph[MyVertex, MyEdge ] = null
   curGraph = curGraph.mapVertices((x, y) = y.init())
   for (i - 0 to cycle_count) {
 val curMessages = curGraph.aggregateMessages[List[Message]](x = {
   //send messages
   .
 },
 (x, y) = {
//collect messages into lists
 val out = x ++ y
 out
   }
 ).cache()
 curMessages.count()
 val ti = i
 oldGraph = curGraph
 curGraph = curGraph.outerJoinVertices(curMessages)(
   (vid, vertex, message) =
 vertex.process(message.getOrElse(List[Message]()), ti)
 ).cache()
 curGraph.vertices.count()
 oldGraph.unpersistVertices(blocking = false)
 oldGraph.edges.unpersist(blocking = false)
 oldGraph = curGraph
 if (oldMessages != null ) {
   oldMessages.unpersist(blocking=false)
 }
 oldMessages = curMessages
   }
 ```

 The MyVertex.process method takes the list of incoming messages, averages
 them and returns a new MyVertex object. I've also set it up to append the
 cycle number (the second argument) into a log file named after the vertex.
 What ends up getting dumped into the log file for every vertex (in the
 exact same pattern) is
 ```
 Cycle: 0
 Cycle: 1
 Cycle: 0
 Cycle: 2
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 3
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 2
 Cycle: 4
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 2
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 0
 Cycle: 0
 Cycle: 1
 Cycle: 2
 Cycle: 3
 Cycle: 5
 ```

 Any ideas about what I might be doing wrong for the caching? And how I can
 avoid re-calculating so many of the values.


 Kyle

[GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-07 Thread Kyle Ellrott

I'm trying to setup a simple iterative message/update problem in GraphX
(spark 1.2.0), but I'm running into issues with the caching and
re-calculation of data. I'm trying to follow the example found in the
Pregel implementation of materializing and cacheing messages and graphs and
then unpersisting them after the next cycle has been done.
It doesn't seem to be working, because every cycle gets progressively
slower and it seems as if more and more of the values are being
re-calculated despite my attempts to cache them.

The code:
```
  var oldMessages : VertexRDD[List[Message]] = null
  var oldGraph : Graph[MyVertex, MyEdge ] = null
  curGraph = curGraph.mapVertices((x, y) = y.init())
  for (i - 0 to cycle_count) {
val curMessages = curGraph.aggregateMessages[List[Message]](x = {
  //send messages
  .
},
(x, y) = {
   //collect messages into lists
val out = x ++ y
out
  }
).cache()
curMessages.count()
val ti = i
oldGraph = curGraph
curGraph = curGraph.outerJoinVertices(curMessages)(
  (vid, vertex, message) =
vertex.process(message.getOrElse(List[Message]()), ti)
).cache()
curGraph.vertices.count()
oldGraph.unpersistVertices(blocking = false)
oldGraph.edges.unpersist(blocking = false)
oldGraph = curGraph
if (oldMessages != null ) {
  oldMessages.unpersist(blocking=false)
}
oldMessages = curMessages
  }
```

The MyVertex.process method takes the list of incoming messages, averages
them and returns a new MyVertex object. I've also set it up to append the
cycle number (the second argument) into a log file named after the vertex.
What ends up getting dumped into the log file for every vertex (in the
exact same pattern) is
```
Cycle: 0
Cycle: 1
Cycle: 0
Cycle: 2
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 3
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 2
Cycle: 4
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 2
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 0
Cycle: 0
Cycle: 1
Cycle: 2
Cycle: 3
Cycle: 5
```

Any ideas about what I might be doing wrong for the caching? And how I can
avoid re-calculating so many of the values.


Kyle

Re: dockerized spark executor on mesos?

2014-12-03 Thread Kyle Ellrott

I'd like to tag a question onto this; has anybody attempted to deploy spark
under Kubernetes
https://github.com/googlecloudplatform/kubernetes or Kubernetes mesos (
https://github.com/mesosphere/kubernetes-mesos ) .
On Wednesday, December 3, 2014, Matei Zaharia matei.zaha...@gmail.com
wrote:

 I'd suggest asking about this on the Mesos list (CCed). As far as I know,
 there was actually some ongoing work for this.

 Matei

  On Dec 3, 2014, at 9:46 AM, Dick Davies d...@hellooperator.net
 javascript:; wrote:
 
  Just wondered if anyone had managed to start spark
  jobs on mesos wrapped in a docker container?
 
  At present (i.e. very early testing) I'm able to submit executors
  to mesos via spark-submit easily enough, but they fall over
  as we don't have a JVM on our slaves out of the box.
 
  I can push one out via our CM system if push comes to shove,
  but it'd be nice to have that as part of the job (I'm thinking it might
  be a way to get some of the dependencies deployed too).
 
  bear in mind I'm a total clueless newbie at this so please be gentle
  if I'm doing this completely wrong.
 
  Thanks!
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
  For additional commands, e-mail: user-h...@spark.apache.org
 javascript:;
 


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:;
 For additional commands, e-mail: user-h...@spark.apache.org javascript:;

Re: Large Task Size?

2014-07-19 Thread Kyle Ellrott

I'm still having trouble with this one.
Watching it, I've noticed that the first time around, the task size is
large, but not terrible (199KB). It's on the second iteration of the
optimization that the task size goes crazy (120MB).

Does anybody have any ideas why this might be happening? Is there anyway
that I can view the data being encoded in the task description, so that I
might be able to get an idea why it is blowing up?

The line in question can be found at:
https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L157

From the logs:
14/07/19 17:29:35 WARN scheduler.TaskSetManager: Stage 858 contains a task
of very large size (199 KB). The maximum recommended task size is 100 KB.
14/07/19 17:30:22 WARN scheduler.TaskSetManager: Stage 857 contains a task
of very large size (185 KB). The maximum recommended task size is 100 KB.
14/07/19 17:30:26 WARN scheduler.TaskSetManager: Stage 1029 contains a task
of very large size (185 KB). The maximum recommended task size is 100 KB.
14/07/19 17:30:57 WARN scheduler.TaskSetManager: Stage 1202 contains a task
of very large size (123414 KB). The maximum recommended task size is 100 KB.

From the web server (connecting the stage number to the line number)
Stage Id Description
858 sample at GroupedGradientDescent.scala:157
857 collect at GroupedGradientDescent.scala:183
1029collect at GroupedGradientDescent.scala:194
1202sample at GroupedGradientDescent.scala:157

Kyle

On Tue, Jul 15, 2014 at 2:45 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote:

Yes, this is a proposed patch to MLLib so that you can use 1 RDD to train
multiple models at the same time. I am hoping that by multiplexing several
models in the same RDD will be more efficient then trying to get the Spark
scheduler to manage a few 100 tasks simultaneously.

I don't think I see stochasticLossHistory being included in the closure
(please correct me if I'm wrong). Its used once on line 183 to capture the
loss sums (a local operation on the results of a 'collect' call), and again
on line 198 to update weightSet, but that's after the loop completes, and
the memory blow definitely happens before then.

Kyle

On Tue, Jul 15, 2014 at 12:00 PM, Aaron Davidson ilike...@gmail.com
wrote:

Ah, I didn't realize this was non-MLLib code. Do you mean to be sending
stochasticLossHistory
in the closure as well?

On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:

It uses the standard SquaredL2Updater, and I also tried to broadcast it
as well.

The input is a RDD created by taking the union of several inputs, that
have all been run against MLUtils.kFold to produce even more RDDs. If I run
with 10 different inputs, each with 10 kFolds. I'm pretty certain that all
of the input RDDs have clean closures. But I'm curious, is there a high
overhead for running union? Could that create larger task sizes?

Kyle

On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson ilike...@gmail.com
wrote:

I also did a quick glance through the code and couldn't find anything
worrying that should be included in the task closures. The only possibly
unsanitary part is the Updater you pass in -- what is your Updater and is
it possible it's dragging in a significant amount of extra state?

On Sat, Jul 12, 2014 at 7:27 PM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:

I'm working of a patch to MLLib that allows for multiplexing several
different model optimization using the same RDD ( SPARK-2372:
https://issues.apache.org/jira/browse/SPARK-2372 )

In testing larger datasets, I've started to see some memory errors (
java.lang.OutOfMemoryError and exceeds max allowed: spark.akka.frameSize
errors ).
My main clue is that Spark will start logging warning on smaller
systems like:

14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a
task of very large size (10119 KB). The maximum recommended task size is
100 KB.

Looking up start '2862' in the case leads to a 'sample at
GroupedGradientDescent.scala:156' call. That code can be seen at

https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156

I've looked over the code, I'm broadcasting the larger variables, and
between the sampler and the combineByKey, I wouldn't think there much data
being moved over the network, much less a 10MB chunk.

Any ideas of what this might be a symptom of?

Kyle

Re: Large Task Size?

2014-07-15 Thread Kyle Ellrott

Kyle

On Tue, Jul 15, 2014 at 12:00 PM, Aaron Davidson ilike...@gmail.com wrote:

Ah, I didn't realize this was non-MLLib code. Do you mean to be sending
stochasticLossHistory
in the closure as well?

On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:

It uses the standard SquaredL2Updater, and I also tried to broadcast it
as well.

Kyle

On Sat, Jul 12, 2014 at 7:50 PM, Aaron Davidson ilike...@gmail.com
wrote:

On Sat, Jul 12, 2014 at 7:27 PM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:

I'm working of a patch to MLLib that allows for multiplexing several
different model optimization using the same RDD ( SPARK-2372:
https://issues.apache.org/jira/browse/SPARK-2372 )

14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a
task of very large size (10119 KB). The maximum recommended task size is
100 KB.

Looking up start '2862' in the case leads to a 'sample at
GroupedGradientDescent.scala:156' call. That code can be seen at

https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156

I've looked over the code, I'm broadcasting the larger variables, and
between the sampler and the combineByKey, I wouldn't think there much data
being moved over the network, much less a 10MB chunk.

Any ideas of what this might be a symptom of?

Kyle

Large Task Size?

2014-07-12 Thread Kyle Ellrott

I'm working of a patch to MLLib that allows for multiplexing several
different model optimization using the same RDD ( SPARK-2372:
https://issues.apache.org/jira/browse/SPARK-2372 )

In testing larger datasets, I've started to see some memory errors (
java.lang.OutOfMemoryError and exceeds max allowed: spark.akka.frameSize
errors ).
My main clue is that Spark will start logging warning on smaller systems
like:

14/07/12 19:14:46 WARN scheduler.TaskSetManager: Stage 2862 contains a task
of very large size (10119 KB). The maximum recommended task size is 100 KB.

Looking up start '2862' in the case leads to a 'sample at
GroupedGradientDescent.scala:156' call. That code can be seen at
https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala#L156

I've looked over the code, I'm broadcasting the larger variables, and
between the sampler and the combineByKey, I wouldn't think there much data
being moved over the network, much less a 10MB chunk.

Any ideas of what this might be a symptom of?

Kyle

Re: Improving Spark multithreaded performance?

2014-07-01 Thread Kyle Ellrott

This all seems pretty hackish and a lot of trouble to get around
limitations in mllib.
The big limitation is that right now, the optimization algorithms work on
one large dataset at a time. We need a second of set of methods to work on
a large number of medium sized datasets.
I've started to code a new set of optimization methods to add into mllib.
I've started with GroupedGradientDecent (
https://github.com/kellrott/spark/blob/mllib-grouped/mllib/src/main/scala/org/apache/spark/mllib/grouped/GroupedGradientDescent.scala
)

GroupedGradientDecent is based on GradientDecent, but instead, it takes
RDD[(Int, (Double, Vector))] as its data input rather then RDD[(Double,
Vector)]. The Int serves as key to mark which elements should be grouped
together. This lets you multiplex several dataset optimizations into the
same RDD.

I think I've gotten the GroupedGradientDecent to work correctly. I need to
go up the stack and start adding methods like SVMWithSGD.trainGroup.

Does anybody have any thoughts on this?

Kyle

On Fri, Jun 27, 2014 at 6:36 PM, Xiangrui Meng men...@gmail.com wrote:

The RDD is cached in only one or two workers. All other executors need
to fetch its content via network. Since the dataset is not huge, could
you try this?

val features: Array[Vector] = ...
val featuresBc = sc.broadcast(features)
// parallel loops
val labels: Array[Double] =
val rdd = sc.parallelize(0 until 1, 1).flatMap(i =
featuresBc.value.view.zip(labels))
val model = SVMWithSGD.train(rdd)
models(i) = model

Using BT broadcast factory would improve the performance of broadcasting.

Best,
Xiangrui

On Fri, Jun 27, 2014 at 3:06 PM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:
1) I'm using the static SVMWithSGD.train, with no options.
2) I have about 20,000 features (~5000 samples) that are being attached
and
trained against 14,000 different sets of labels (ie I'll be doing 14,000
different training runs against the same sets of features trying to
figure
out which labels can be learned), and I would also like to do cross fold
validation.

The driver doesn't seem to be using too much memory. I left it as -Xmx8g
and
it never complained.

Kyle

On Fri, Jun 27, 2014 at 1:18 PM, Xiangrui Meng men...@gmail.com wrote:

Hi Kyle,

A few questions:

1) Did you use `setIntercept(true)`?
2) How many features?

I'm a little worried about driver's load because the final aggregation
and weights update happen on the driver. Did you check driver's memory
usage as well?

Best,
Xiangrui

On Fri, Jun 27, 2014 at 8:10 AM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:
As far as I can tell there are is no data to broadcast (unless there
is
something internal to mllib that needs to be broadcast) I've coalesced
the
input RDDs to keep the number of partitions limited. When running,
I've
tried to get up to 500 concurrent stages, and I've coalesced the RDDs
down
to 2 partitions, so about 1000 tasks.
Despite having over 500 threads in the threadpool working on mllib
tasks,
the total CPU usage never really goes above 150%.
I've tried increasing 'spark.akka.threads' but that doesn't seem to do
anything.

My one thought would be that maybe because I'm using MLUtils.kFold to
generate the RDDs is that because I have so many tasks working off
RDDs
that
are permutations of original RDDs that maybe that is creating some
sort
of
dependency bottleneck.

Kyle

On Thu, Jun 26, 2014 at 6:35 PM, Aaron Davidson ilike...@gmail.com
wrote:

I don't have specific solutions for you, but the general things to
try
are:

- Decrease task size by broadcasting any non-trivial objects.
- Increase duration of tasks by making them less fine-grained.

How many tasks are you sending? I've seen in the past something like
25
seconds for ~10k total medium-sized tasks.

On Thu, Jun 26, 2014 at 12:06 PM, Kyle Ellrott
kellr...@soe.ucsc.edu
wrote:

I'm working to set up a calculation that involves calling mllib's
SVMWithSGD.train several thousand times on different permutations of
the
data. I'm trying to run the separate jobs using a threadpool to
dispatch the
different requests to a spark context connected a Mesos's cluster,
using
course scheduling, and a max of 2000 cores on Spark 1.0.
Total utilization of the system is terrible. Most of the 'aggregate
at
GradientDescent.scala:178' stages(where mllib spends most of its
time)
take
about 3 seconds, but have ~25 seconds of scheduler delay time.
What kind of things can I do to improve this?

Kyle

Re: Improving Spark multithreaded performance?

2014-06-27 Thread Kyle Ellrott

1) I'm using the static SVMWithSGD.train, with no options.
2) I have about 20,000 features (~5000 samples) that are being attached and
trained against 14,000 different sets of labels (ie I'll be doing 14,000
different training runs against the same sets of features trying to figure
out which labels can be learned), and I would also like to do cross fold
validation.

The driver doesn't seem to be using too much memory. I left it as -Xmx8g
and it never complained.

Kyle



On Fri, Jun 27, 2014 at 1:18 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Kyle,

 A few questions:

 1) Did you use `setIntercept(true)`?
 2) How many features?

 I'm a little worried about driver's load because the final aggregation
 and weights update happen on the driver. Did you check driver's memory
 usage as well?

 Best,
 Xiangrui

 On Fri, Jun 27, 2014 at 8:10 AM, Kyle Ellrott kellr...@soe.ucsc.edu
 wrote:
  As far as I can tell there are is no data to broadcast (unless there is
  something internal to mllib that needs to be broadcast) I've coalesced
 the
  input RDDs to keep the number of partitions limited. When running, I've
  tried to get up to 500 concurrent stages, and I've coalesced the RDDs
 down
  to 2 partitions, so about 1000 tasks.
  Despite having over 500 threads in the threadpool working on mllib tasks,
  the total CPU usage never really goes above 150%.
  I've tried increasing 'spark.akka.threads' but that doesn't seem to do
  anything.
 
  My one thought would be that maybe because I'm using MLUtils.kFold to
  generate the RDDs is that because I have so many tasks working off RDDs
 that
  are permutations of original RDDs that maybe that is creating some sort
 of
  dependency bottleneck.
 
  Kyle
 
 
  On Thu, Jun 26, 2014 at 6:35 PM, Aaron Davidson ilike...@gmail.com
 wrote:
 
  I don't have specific solutions for you, but the general things to try
  are:
 
  - Decrease task size by broadcasting any non-trivial objects.
  - Increase duration of tasks by making them less fine-grained.
 
  How many tasks are you sending? I've seen in the past something like 25
  seconds for ~10k total medium-sized tasks.
 
 
  On Thu, Jun 26, 2014 at 12:06 PM, Kyle Ellrott kellr...@soe.ucsc.edu
  wrote:
 
  I'm working to set up a calculation that involves calling mllib's
  SVMWithSGD.train several thousand times on different permutations of
 the
  data. I'm trying to run the separate jobs using a threadpool to
 dispatch the
  different requests to a spark context connected a Mesos's cluster,
 using
  course scheduling, and a max of 2000 cores on Spark 1.0.
  Total utilization of the system is terrible. Most of the 'aggregate at
  GradientDescent.scala:178' stages(where mllib spends most of its time)
 take
  about 3 seconds, but have ~25 seconds of scheduler delay time.
  What kind of things can I do to improve this?
 
  Kyle

Improving Spark multithreaded performance?

2014-06-26 Thread Kyle Ellrott

I'm working to set up a calculation that involves calling
mllib's SVMWithSGD.train several thousand times on different permutations
of the data. I'm trying to run the separate jobs using a threadpool to
dispatch the different requests to a spark context connected a Mesos's
cluster, using course scheduling, and a max of 2000 cores on Spark 1.0.
Total utilization of the system is terrible. Most of the 'aggregate at
GradientDescent.scala:178' stages(where mllib spends most of its time) take
about 3 seconds, but have ~25 seconds of scheduler delay time.
What kind of things can I do to improve this?

Kyle

Re: Parallel LogisticRegression?

2014-06-20 Thread Kyle Ellrott

I've tried to parallelize the separate regressions using
allResponses.toParArray.map( x= do logistic regression against labels in x)
But I start to see messages like
14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task
363.0:4)
14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss was due to fetch
failure from null
and finally
14/06/20 10:10:26 ERROR scheduler.TaskSetManager: Task 363.0:4 failed 4
times; aborting job

Then
14/06/20 10:10:26 ERROR scheduler.DAGSchedulerActorSupervisor:
eventProcesserActor failed due to the error null; shutting down SparkContext
14/06/20 10:10:26 ERROR actor.OneForOneStrategy:
java.lang.UnsupportedOperationException
at
org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
at
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)


This doesn't happen when I don't use toParArray. I read that spark was
thread safe, but I seem to be running into problems. Am I doing something
wrong?

Kyle



On Thu, Jun 19, 2014 at 11:21 AM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:


 I'm working on a problem learning several different sets of responses
 against the same set of training features. Right now I've written the
 program to cycle through all of the different label sets, attached them to
 the training data and run LogisticRegressionWithSGD on each of them. ie

 foreach curResponseSet in allResponses:
  currentRDD : RDD[LabeledPoints] = curResponseSet joined with
 trainingData
  LogisticRegressionWithSGD.train(currentRDD)


 Each of the different training runs are independent. It seems like I
 should be parallelize them as well.
 Is there a better way to do this?


 Kyle

Re: Parallel LogisticRegression?

2014-06-20 Thread Kyle Ellrott

I looks like I was running into
https://issues.apache.org/jira/browse/SPARK-2204
The issues went away when I changed to spark.mesos.coarse.

Kyle


On Fri, Jun 20, 2014 at 10:36 AM, Kyle Ellrott kellr...@soe.ucsc.edu
wrote:

 I've tried to parallelize the separate regressions using
 allResponses.toParArray.map( x= do logistic regression against labels in x)
 But I start to see messages like
 14/06/20 10:10:26 WARN scheduler.TaskSetManager: Lost TID 4193 (task
 363.0:4)
 14/06/20 10:10:27 WARN scheduler.TaskSetManager: Loss was due to fetch
 failure from null
 and finally
 14/06/20 10:10:26 ERROR scheduler.TaskSetManager: Task 363.0:4 failed 4
 times; aborting job

 Then
 14/06/20 10:10:26 ERROR scheduler.DAGSchedulerActorSupervisor:
 eventProcesserActor failed due to the error null; shutting down SparkContext
 14/06/20 10:10:26 ERROR actor.OneForOneStrategy:
 java.lang.UnsupportedOperationException
 at
 org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
 at
 org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
  at
 org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)


 This doesn't happen when I don't use toParArray. I read that spark was
 thread safe, but I seem to be running into problems. Am I doing something
 wrong?

 Kyle



 On Thu, Jun 19, 2014 at 11:21 AM, Kyle Ellrott kellr...@soe.ucsc.edu
 wrote:


 I'm working on a problem learning several different sets of responses
 against the same set of training features. Right now I've written the
 program to cycle through all of the different label sets, attached them to
 the training data and run LogisticRegressionWithSGD on each of them. ie

 foreach curResponseSet in allResponses:
  currentRDD : RDD[LabeledPoints] = curResponseSet joined with
 trainingData
  LogisticRegressionWithSGD.train(currentRDD)


 Each of the different training runs are independent. It seems like I
 should be parallelize them as well.
 Is there a better way to do this?


 Kyle

Parallel LogisticRegression?

2014-06-19 Thread Kyle Ellrott

I'm working on a problem learning several different sets of responses
against the same set of training features. Right now I've written the
program to cycle through all of the different label sets, attached them to
the training data and run LogisticRegressionWithSGD on each of them. ie

foreach curResponseSet in allResponses:
 currentRDD : RDD[LabeledPoints] = curResponseSet joined with
trainingData
 LogisticRegressionWithSGD.train(currentRDD)


Each of the different training runs are independent. It seems like I should
be parallelize them as well.
Is there a better way to do this?


Kyle

GraphX vertices and connected edges

2014-05-02 Thread Kyle Ellrott

What is the most efficient way to an RDD of GraphX vertices and their
connected edges? Initially I though I could use mapReduceTriplet, but I
realized that would neglect vertices that aren't connected to anything
Would I have to do a mapReduceTriplet and then do a join with all of the
vertices to pick up the ones lost because there aren't part of a triplet?
I'm hoping there is a simpler way I've neglected.

Kyle

Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

[GraphX] Excessive value recalculations during aggregateMessages cycles

Re: dockerized spark executor on mesos?

Re: Large Task Size?

Re: Large Task Size?

Large Task Size?

Re: Improving Spark multithreaded performance?

Re: Improving Spark multithreaded performance?

Improving Spark multithreaded performance?

Re: Parallel LogisticRegression?

Re: Parallel LogisticRegression?

Parallel LogisticRegression?

GraphX vertices and connected edges

13 matches

Site Navigation

Mail list logo

Footer information