I've tried some additional experiments with kmeans and I finally got it
worked as I expected. In fact, the number of partition is critical. I had a
data set of 240000x784 with 12 partitions. In this case the kmeans
algorithm took a very long time (about hours to converge). When I change
the partition into 32, the same kmeans ( runs = 10, k = 10, iterations =
300, init = kmeans|| ) converges in 4 min with 8 cores !!!!
As a comparison, the same problem solve with python scikit-learn takes 21
min on a single core.  So spark wins :)

As conclusion, setting the number of partition correctly is essential. Is
there a rule of thumb for that ?

On Mon, Dec 15, 2014 at 8:55 PM, Xiangrui Meng <men...@gmail.com> wrote:
>
> Please check the number of partitions after sc.textFile. Use
> sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui
>
> On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai <dbt...@dbtsai.com> wrote:
> > You just need to use the latest master code without any configuration
> > to get performance improvement from my PR.
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa <jaon...@gmail.com>
> wrote:
> >> After some investigation, I learned that I can't compare kmeans in mllib
> >> with another kmeans implementation directly. The kmeans|| initialization
> >> step takes more time than the algorithm implemented in julia for
> example.
> >> There is also the ability to run multiple runs of kmeans algorithm in
> mllib
> >> even by default the number of runs is 1.
> >>
> >> DB Tsai can you please tell me the configuration you took for the
> >> improvement you mention in your pull request. I'd like to run the same
> >> benchmark on mnist8m on my computer.
> >>
> >>
> >> Cheers;
> >>
> >>
> >>
> >> On Fri, Dec 5, 2014 at 10:34 PM, DB Tsai <dbt...@dbtsai.com> wrote:
> >>>
> >>> Also, are you using the latest master in this experiment? A PR merged
> >>> into the master couple days ago will spend up the k-means three times.
> >>> See
> >>>
> >>>
> >>>
> https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> -------------------------------------------------------
> >>> My Blog: https://www.dbtsai.com
> >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>
> >>>
> >>> On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa <jaon...@gmail.com>
> >>> wrote:
> >>> > The code is really simple :
> >>> >
> >>> > object TestKMeans {
> >>> >
> >>> >   def main(args: Array[String]) {
> >>> >
> >>> >     val conf = new SparkConf()
> >>> >       .setAppName("Test KMeans")
> >>> >       .setMaster("local[8]")
> >>> >       .set("spark.executor.memory", "8g")
> >>> >
> >>> >     val sc = new SparkContext(conf)
> >>> >
> >>> >     val numClusters = 500;
> >>> >     val numIterations = 2;
> >>> >
> >>> >
> >>> >     val data = sc.textFile("sample.csv").map(x =>
> >>> > Vectors.dense(x.split(',').map(_.toDouble)))
> >>> >     data.cache()
> >>> >
> >>> >
> >>> >     val clusters = KMeans.train(data, numClusters, numIterations)
> >>> >
> >>> >     println(clusters.clusterCenters.size)
> >>> >
> >>> >     val wssse = clusters.computeCost(data)
> >>> >     println(s"error : $wssse")
> >>> >
> >>> >   }
> >>> > }
> >>> >
> >>> >
> >>> > For the testing purpose, I was generating a sample random data with
> >>> > julia
> >>> > and store it in a csv file delimited by comma. The dimensions is
> 248000
> >>> > x
> >>> > 384.
> >>> >
> >>> > In the target application, I will have more than 248k data to
> cluster.
> >>> >
> >>> >
> >>> > On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com>
> >>> > wrote:
> >>> >>
> >>> >> Could you post you script to reproduce the results (also how to
> >>> >> generate the dataset)? That will help us to investigate it.
> >>> >>
> >>> >> On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <
> jaon...@gmail.com>
> >>> >> wrote:
> >>> >> > Hmm, here I use spark on local mode on my laptop with 8 cores. The
> >>> >> > data
> >>> >> > is
> >>> >> > on my local filesystem. Event thought, there an overhead due to
> the
> >>> >> > distributed computation, I found the difference between the
> runtime
> >>> >> > of
> >>> >> > the
> >>> >> > two implementations really, really huge. Is there a benchmark on
> how
> >>> >> > well
> >>> >> > the algorithm implemented in mllib performs ?
> >>> >> >
> >>> >> > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com>
> wrote:
> >>> >> >>
> >>> >> >> Spark has much more overhead, since it's set up to distribute the
> >>> >> >> computation. Julia isn't distributed, and so has no such
> overhead in
> >>> >> >> a
> >>> >> >> completely in-core implementation. You generally use Spark when
> you
> >>> >> >> have a problem large enough to warrant distributing, or, your
> data
> >>> >> >> already lives in a distributed store like HDFS.
> >>> >> >>
> >>> >> >> But it's also possible you're not configuring the implementations
> >>> >> >> the
> >>> >> >> same way, yes. There's not enough info here really to say.
> >>> >> >>
> >>> >> >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa
> >>> >> >> <jaon...@gmail.com>
> >>> >> >> wrote:
> >>> >> >> > Hi all,
> >>> >> >> >
> >>> >> >> > I'm trying to a run clustering with kmeans algorithm. The size
> of
> >>> >> >> > my
> >>> >> >> > data
> >>> >> >> > set is about 240k vectors of dimension 384.
> >>> >> >> >
> >>> >> >> > Solving the problem with the kmeans available in julia
> (kmean++)
> >>> >> >> >
> >>> >> >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html
> >>> >> >> >
> >>> >> >> > take about 8 minutes on a single core.
> >>> >> >> >
> >>> >> >> > Solving the same problem with spark kmean|| take more than 1.5
> >>> >> >> > hours
> >>> >> >> > with 8
> >>> >> >> > cores!!!!
> >>> >> >> >
> >>> >> >> > Either they don't implement the same algorithm either I don't
> >>> >> >> > understand
> >>> >> >> > how
> >>> >> >> > the kmeans in spark works. Is my data not big enough to take
> full
> >>> >> >> > advantage
> >>> >> >> > of spark ? At least, I expect to the same runtime.
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > Cheers,
> >>> >> >> >
> >>> >> >> >
> >>> >> >> > Jao
> >>> >> >
> >>> >> >
> >>> >
> >>> >
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Reply via email to