set is about 270k vectors, each has
>>> about 350 dimensions. If I set k=500, the job takes about 3hrs on my
>>> cluster. The cluster has 7 executors, each has 8 cores...
>>>
>>> If I set k=5000 which is the required value for my task, the job goes on
>>> forever...
>> --------------
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
>>
> To start a new topic under
-
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
> To start a new topic under Apache Spark User List, email
> ml-node+s10015
,
David
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
cluster. The
cluster has 7 executors, each has 8 cores...
If I set k=5000 which is the required value for my task, the job goes on
forever...
Thanks,
David
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
Sent
is the required value for my task, the job goes on
forever...
Thanks,
David
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Why-KMeans-with-mllib-is-so-slow-tp20480p22273.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
Please check the number of partitions after sc.textFile. Use
sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui
On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai dbt...@dbtsai.com wrote:
You just need to use the latest master code without any configuration
to get performance improvement from
I've tried some additional experiments with kmeans and I finally got it
worked as I expected. In fact, the number of partition is critical. I had a
data set of 24x784 with 12 partitions. In this case the kmeans
algorithm took a very long time (about hours to converge). When I change
the
You just need to use the latest master code without any configuration
to get performance improvement from my PR.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Mon, Dec 8, 2014 at 7:53
Hi all,
I'm trying to a run clustering with kmeans algorithm. The size of my data
set is about 240k vectors of dimension 384.
Solving the problem with the kmeans available in julia (kmean++)
http://clusteringjl.readthedocs.org/en/latest/kmeans.html
take about 8 minutes on a single core.
Spark has much more overhead, since it's set up to distribute the
computation. Julia isn't distributed, and so has no such overhead in a
completely in-core implementation. You generally use Spark when you
have a problem large enough to warrant distributing, or, your data
already lives in a
Could you post you script to reproduce the results (also how to
generate the dataset)? That will help us to investigate it.
On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
Hmm, here I use spark on local mode on my laptop with 8 cores. The data is
on my local
The code is really simple :
*object TestKMeans {*
* def main(args: Array[String]) {*
*val conf = new SparkConf()*
* .setAppName(Test KMeans)*
* .setMaster(local[8])*
* .set(spark.executor.memory, 8g)*
*val sc = new SparkContext(conf)*
*val numClusters = 500;*
*
Also, are you using the latest master in this experiment? A PR merged
into the master couple days ago will spend up the k-means three times.
See
https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1
Sincerely,
DB Tsai
14 matches
Mail list logo