Re: Improving Spark multithreaded performance?

2014-07-01 Thread Kyle Ellrott
This all seems pretty hackish and a lot of trouble to get around limitations in mllib. The big limitation is that right now, the optimization algorithms work on one large dataset at a time. We need a second of set of methods to work on a large number of medium sized datasets. I've started to code

Re: Improving Spark multithreaded performance?

2014-06-27 Thread anoldbrain
to look at. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Improving-Spark-multithreaded-performance-tp8359p8411.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Improving Spark multithreaded performance?

2014-06-27 Thread Xiangrui Meng
Hi Kyle, A few questions: 1) Did you use `setIntercept(true)`? 2) How many features? I'm a little worried about driver's load because the final aggregation and weights update happen on the driver. Did you check driver's memory usage as well? Best, Xiangrui On Fri, Jun 27, 2014 at 8:10 AM,

Re: Improving Spark multithreaded performance?

2014-06-27 Thread Kyle Ellrott
1) I'm using the static SVMWithSGD.train, with no options. 2) I have about 20,000 features (~5000 samples) that are being attached and trained against 14,000 different sets of labels (ie I'll be doing 14,000 different training runs against the same sets of features trying to figure out which

Improving Spark multithreaded performance?

2014-06-26 Thread Kyle Ellrott
I'm working to set up a calculation that involves calling mllib's SVMWithSGD.train several thousand times on different permutations of the data. I'm trying to run the separate jobs using a threadpool to dispatch the different requests to a spark context connected a Mesos's cluster, using course

Re: Improving Spark multithreaded performance?

2014-06-26 Thread Aaron Davidson
I don't have specific solutions for you, but the general things to try are: - Decrease task size by broadcasting any non-trivial objects. - Increase duration of tasks by making them less fine-grained. How many tasks are you sending? I've seen in the past something like 25 seconds for ~10k total