Re: [GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-08 Thread Kyle Ellrott
that outerJoinVertices caches the closure to be recalculated if needed again while mapVertices actually caches the derived values. Is this a bug or a feature? Kyle On Sat, Feb 7, 2015 at 11:44 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I'm trying to setup a simple iterative message/update

[GraphX] Excessive value recalculations during aggregateMessages cycles

2015-02-07 Thread Kyle Ellrott
I'm trying to setup a simple iterative message/update problem in GraphX (spark 1.2.0), but I'm running into issues with the caching and re-calculation of data. I'm trying to follow the example found in the Pregel implementation of materializing and cacheing messages and graphs and then

Re: dockerized spark executor on mesos?

2014-12-03 Thread Kyle Ellrott
I'd like to tag a question onto this; has anybody attempted to deploy spark under Kubernetes https://github.com/googlecloudplatform/kubernetes or Kubernetes mesos ( https://github.com/mesosphere/kubernetes-mesos ) . On Wednesday, December 3, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: I'd

Re: Large Task Size?

2014-07-19 Thread Kyle Ellrott
sample at GroupedGradientDescent.scala:157 Kyle On Tue, Jul 15, 2014 at 2:45 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: Yes, this is a proposed patch to MLLib so that you can use 1 RDD to train multiple models at the same time. I am hoping that by multiplexing several models in the same

Re: Large Task Size?

2014-07-15 Thread Kyle Ellrott
definitely happens before then. Kyle On Tue, Jul 15, 2014 at 12:00 PM, Aaron Davidson ilike...@gmail.com wrote: Ah, I didn't realize this was non-MLLib code. Do you mean to be sending stochasticLossHistory in the closure as well? On Sun, Jul 13, 2014 at 1:05 AM, Kyle Ellrott kellr

Large Task Size?

2014-07-12 Thread Kyle Ellrott
I'm working of a patch to MLLib that allows for multiplexing several different model optimization using the same RDD ( SPARK-2372: https://issues.apache.org/jira/browse/SPARK-2372 ) In testing larger datasets, I've started to see some memory errors ( java.lang.OutOfMemoryError and exceeds max

Re: Improving Spark multithreaded performance?

2014-07-01 Thread Kyle Ellrott
= SVMWithSGD.train(rdd) models(i) = model Using BT broadcast factory would improve the performance of broadcasting. Best, Xiangrui On Fri, Jun 27, 2014 at 3:06 PM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: 1) I'm using the static SVMWithSGD.train, with no options. 2) I have about 20,000 features

Re: Improving Spark multithreaded performance?

2014-06-27 Thread Kyle Ellrott
`setIntercept(true)`? 2) How many features? I'm a little worried about driver's load because the final aggregation and weights update happen on the driver. Did you check driver's memory usage as well? Best, Xiangrui On Fri, Jun 27, 2014 at 8:10 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote

Improving Spark multithreaded performance?

2014-06-26 Thread Kyle Ellrott
I'm working to set up a calculation that involves calling mllib's SVMWithSGD.train several thousand times on different permutations of the data. I'm trying to run the separate jobs using a threadpool to dispatch the different requests to a spark context connected a Mesos's cluster, using course

Re: Parallel LogisticRegression?

2014-06-20 Thread Kyle Ellrott
, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I'm working on a problem learning several different sets of responses against the same set of training features. Right now I've written the program to cycle through all of the different label sets, attached them to the training data and run

Re: Parallel LogisticRegression?

2014-06-20 Thread Kyle Ellrott
I looks like I was running into https://issues.apache.org/jira/browse/SPARK-2204 The issues went away when I changed to spark.mesos.coarse. Kyle On Fri, Jun 20, 2014 at 10:36 AM, Kyle Ellrott kellr...@soe.ucsc.edu wrote: I've tried to parallelize the separate regressions using

Parallel LogisticRegression?

2014-06-19 Thread Kyle Ellrott
I'm working on a problem learning several different sets of responses against the same set of training features. Right now I've written the program to cycle through all of the different label sets, attached them to the training data and run LogisticRegressionWithSGD on each of them. ie foreach

GraphX vertices and connected edges

2014-05-02 Thread Kyle Ellrott
What is the most efficient way to an RDD of GraphX vertices and their connected edges? Initially I though I could use mapReduceTriplet, but I realized that would neglect vertices that aren't connected to anything Would I have to do a mapReduceTriplet and then do a join with all of the vertices to