Re: mllib performance on cluster

Evan R. Sparks Wed, 03 Sep 2014 11:27:44 -0700

I spoke with SK offline about this, it looks like the difference in timings
came from the fact that he was training 100 models for 100 iterations and
taking the total time (vs. my example which trains a single model for 100
iterations). I'm posting my response here, though, because I think it's
worth documenting:

Benchmarking on a dataset this small on this many cores is probably not
going to give you any meaningful information about how the algorithms scale
to "real" data problems.

In this case, you've thrown 200 cores at 5.6kb of data - 200
low-dimensional data points. The overheads of scheduling tasks, sending
them out to each worker, and network latencies between the nodes, which are
essentially fixed regardless of problem size are COMPLETELY dominating the
time spent computing - which in the first two cases is 9-10 flops per data
point and in the last case is a couple of array lookups and adds per data
point.

It would make a lot more sense to find or generate a dataset that's 10 or
100GB and see how performance scales there. You can do this with the code I
pasted earlier, just change the second, third, and fourth arguments to an
appropriate number of elements, dimensionality, and number of partitions
that matches the number of cores you have on your cluster.

In short, don't use a cluster unless you need one :).

Hope this helps!

On Tue, Sep 2, 2014 at 3:51 PM, SK <skrishna...@gmail.com> wrote:

> The dataset is quite small : 5.6 KB.  It has 200 rows and 3 features, and 1
> column of labels.  From this dataset, I split 80% for training set and 20%
> for test set. The features are integer counts and labels are binary (1/0).
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-performance-on-cluster-tp13290p13311.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: mllib performance on cluster

Reply via email to