There is a simple example here: https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py . You can take advantage of sparsity by computing the distance via inner products: http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2 -Xiangrui
On Tue, Nov 25, 2014 at 2:39 AM, amin mohebbi <aminn_...@yahoo.com.invalid> wrote: > I have generated a sparse matrix by python, which has the size of > 4000*174000 (.pkl), the following is a small part of this matrix : > > (0, 45) 1 > (0, 413) 1 > (0, 445) 1 > (0, 107) 4 > (0, 80) 2 > (0, 352) 1 > (0, 157) 1 > (0, 191) 1 > (0, 315) 1 > (0, 395) 4 > (0, 282) 3 > (0, 184) 1 > (0, 403) 1 > (0, 169) 1 > (0, 267) 1 > (0, 148) 1 > (0, 449) 1 > (0, 241) 1 > (0, 303) 1 > (0, 364) 1 > (0, 257) 1 > (0, 372) 1 > (0, 73) 1 > (0, 64) 1 > (0, 427) 1 > : : > (2, 399) 1 > (2, 277) 1 > (2, 229) 1 > (2, 255) 1 > (2, 409) 1 > (2, 355) 1 > (2, 391) 1 > (2, 28) 1 > (2, 384) 1 > (2, 86) 1 > (2, 285) 2 > (2, 166) 1 > (2, 165) 1 > (2, 419) 1 > (2, 367) 2 > (2, 133) 1 > (2, 61) 1 > (2, 434) 1 > (2, 51) 1 > (2, 423) 1 > (2, 398) 1 > (2, 438) 1 > (2, 389) 1 > (2, 26) 1 > (2, 455) 1 > > I am new in Spark and would like to cluster this matrix by k-means > algorithm. Can anyone explain to me what kind of problems I might be faced. > Please note that I do not want to use Mllib and would like to write my own > k-means. > Best Regards > > ....................................................... > > Amin Mohebbi > > PhD candidate in Software Engineering > at university of Malaysia > > Tel : +60 18 2040 017 > > > > E-Mail : tp025...@ex.apiit.edu.my > > amin_...@me.com --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org