This all seems pretty hackish and a lot of trouble to get around
limitations in mllib.
The big limitation is that right now, the optimization algorithms work on
one large dataset at a time. We need a second of set of methods to work on
a large number of medium sized datasets.
I've started to code
to look at.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Improving-Spark-multithreaded-performance-tp8359p8411.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Kyle,
A few questions:
1) Did you use `setIntercept(true)`?
2) How many features?
I'm a little worried about driver's load because the final aggregation
and weights update happen on the driver. Did you check driver's memory
usage as well?
Best,
Xiangrui
On Fri, Jun 27, 2014 at 8:10 AM,
1) I'm using the static SVMWithSGD.train, with no options.
2) I have about 20,000 features (~5000 samples) that are being attached and
trained against 14,000 different sets of labels (ie I'll be doing 14,000
different training runs against the same sets of features trying to figure
out which
I'm working to set up a calculation that involves calling
mllib's SVMWithSGD.train several thousand times on different permutations
of the data. I'm trying to run the separate jobs using a threadpool to
dispatch the different requests to a spark context connected a Mesos's
cluster, using course
I don't have specific solutions for you, but the general things to try are:
- Decrease task size by broadcasting any non-trivial objects.
- Increase duration of tasks by making them less fine-grained.
How many tasks are you sending? I've seen in the past something like 25
seconds for ~10k total