Maybe with "MEMORY_ONLY", spark has to recompute the RDD several times because they don't fit in memory. It makes things run slower.
As a general safe rule, use MEMORY_AND_DISK_SER Guillaume Pitel - Président d'eXenSa Prashant Sharma <scrapco...@gmail.com> a écrit : >I think Mahout uses FuzzyKmeans, which is different algorithm and it is not >iterative. > > >Prashant Sharma > > > >On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov <pahomov.e...@gmail.com> wrote: > >Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have >next results for k-means: >Number of iterations= 10, number of elements = 10000000, mahouttime= 602, >spark time = 138 >Number of iterations= 40, number of elements = 10000000, mahouttime= 1917, >spark time = 330 >Number of iterations= 70, number of elements = 10000000, mahouttime= 3203, >spark time = 388 >Number of iterations= 10, number of elements = 100000000, mahouttime= 1235, >spark time = 2226 >Number of iterations= 40, number of elements = 100000000, mahouttime= 2755, >spark time = 6388 >Number of iterations= 70, number of elements = 100000000, mahouttime= 4107, >spark time = 10967 >Number of iterations= 10, number of elements = 1000000000, mahouttime= 7070, >spark time = 25268 > >Time in seconds. It runs on Yarn cluster with about 40 machines. Elements for >clusterization are randomly created. When I changed persistence level from >Memory to Memory_and_disk, on big data spark started to work faster. > >What am I missing? > >See my benchmarking code in attachment. > > > >-- > >Sincerely yours >Egor Pakhomov >Scala Developer, Yandex > >