Re: Why KMeans with mllib is so slow ?

Xiangrui Meng Mon, 15 Dec 2014 11:58:40 -0800

Please check the number of partitions after sc.textFile. Use
sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui


On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai <dbt...@dbtsai.com> wrote:
> You just need to use the latest master code without any configuration
> to get performance improvement from my PR.
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Mon, Dec 8, 2014 at 7:53 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote:
>> After some investigation, I learned that I can't compare kmeans in mllib
>> with another kmeans implementation directly. The kmeans|| initialization
>> step takes more time than the algorithm implemented in julia for example.
>> There is also the ability to run multiple runs of kmeans algorithm in mllib
>> even by default the number of runs is 1.
>>
>> DB Tsai can you please tell me the configuration you took for the
>> improvement you mention in your pull request. I'd like to run the same
>> benchmark on mnist8m on my computer.
>>
>>
>> Cheers;
>>
>>
>>
>> On Fri, Dec 5, 2014 at 10:34 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>
>>> Also, are you using the latest master in this experiment? A PR merged
>>> into the master couple days ago will spend up the k-means three times.
>>> See
>>>
>>>
>>> https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa <jaon...@gmail.com>
>>> wrote:
>>> > The code is really simple :
>>> >
>>> > object TestKMeans {
>>> >
>>> >   def main(args: Array[String]) {
>>> >
>>> >     val conf = new SparkConf()
>>> >       .setAppName("Test KMeans")
>>> >       .setMaster("local[8]")
>>> >       .set("spark.executor.memory", "8g")
>>> >
>>> >     val sc = new SparkContext(conf)
>>> >
>>> >     val numClusters = 500;
>>> >     val numIterations = 2;
>>> >
>>> >
>>> >     val data = sc.textFile("sample.csv").map(x =>
>>> > Vectors.dense(x.split(',').map(_.toDouble)))
>>> >     data.cache()
>>> >
>>> >
>>> >     val clusters = KMeans.train(data, numClusters, numIterations)
>>> >
>>> >     println(clusters.clusterCenters.size)
>>> >
>>> >     val wssse = clusters.computeCost(data)
>>> >     println(s"error : $wssse")
>>> >
>>> >   }
>>> > }
>>> >
>>> >
>>> > For the testing purpose, I was generating a sample random data with
>>> > julia
>>> > and store it in a csv file delimited by comma. The dimensions is 248000
>>> > x
>>> > 384.
>>> >
>>> > In the target application, I will have more than 248k data to cluster.
>>> >
>>> >
>>> > On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com>
>>> > wrote:
>>> >>
>>> >> Could you post you script to reproduce the results (also how to
>>> >> generate the dataset)? That will help us to investigate it.
>>> >>
>>> >> On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <jaon...@gmail.com>
>>> >> wrote:
>>> >> > Hmm, here I use spark on local mode on my laptop with 8 cores. The
>>> >> > data
>>> >> > is
>>> >> > on my local filesystem. Event thought, there an overhead due to the
>>> >> > distributed computation, I found the difference between the runtime
>>> >> > of
>>> >> > the
>>> >> > two implementations really, really huge. Is there a benchmark on how
>>> >> > well
>>> >> > the algorithm implemented in mllib performs ?
>>> >> >
>>> >> > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>> >> >>
>>> >> >> Spark has much more overhead, since it's set up to distribute the
>>> >> >> computation. Julia isn't distributed, and so has no such overhead in
>>> >> >> a
>>> >> >> completely in-core implementation. You generally use Spark when you
>>> >> >> have a problem large enough to warrant distributing, or, your data
>>> >> >> already lives in a distributed store like HDFS.
>>> >> >>
>>> >> >> But it's also possible you're not configuring the implementations
>>> >> >> the
>>> >> >> same way, yes. There's not enough info here really to say.
>>> >> >>
>>> >> >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa
>>> >> >> <jaon...@gmail.com>
>>> >> >> wrote:
>>> >> >> > Hi all,
>>> >> >> >
>>> >> >> > I'm trying to a run clustering with kmeans algorithm. The size of
>>> >> >> > my
>>> >> >> > data
>>> >> >> > set is about 240k vectors of dimension 384.
>>> >> >> >
>>> >> >> > Solving the problem with the kmeans available in julia (kmean++)
>>> >> >> >
>>> >> >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html
>>> >> >> >
>>> >> >> > take about 8 minutes on a single core.
>>> >> >> >
>>> >> >> > Solving the same problem with spark kmean|| take more than 1.5
>>> >> >> > hours
>>> >> >> > with 8
>>> >> >> > cores!!!!
>>> >> >> >
>>> >> >> > Either they don't implement the same algorithm either I don't
>>> >> >> > understand
>>> >> >> > how
>>> >> >> > the kmeans in spark works. Is my data not big enough to take full
>>> >> >> > advantage
>>> >> >> > of spark ? At least, I expect to the same runtime.
>>> >> >> >
>>> >> >> >
>>> >> >> > Cheers,
>>> >> >> >
>>> >> >> >
>>> >> >> > Jao
>>> >> >
>>> >> >
>>> >
>>> >
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why KMeans with mllib is so slow ?

Reply via email to