Maybe with "MEMORY_ONLY", spark has to recompute the RDD several times because 
they don't fit in memory. It makes things run slower.

As a general safe rule, use MEMORY_AND_DISK_SER



Guillaume Pitel - Président d'eXenSa 

Prashant Sharma <scrapco...@gmail.com> a écrit :

>I think Mahout uses FuzzyKmeans, which is different algorithm and it is not 
>iterative. 
>
>
>Prashant Sharma
>
>
>
>On Tue, Mar 25, 2014 at 6:50 PM, Egor Pahomov <pahomov.e...@gmail.com> wrote:
>
>Hi, I'm running benchmark, which compares Mahout and SparkML. For now I have 
>next results for k-means:
>Number of iterations= 10, number of elements = 10000000, mahouttime= 602, 
>spark time = 138
>Number of iterations= 40, number of elements = 10000000, mahouttime= 1917, 
>spark time = 330
>Number of iterations= 70, number of elements = 10000000, mahouttime= 3203, 
>spark time = 388
>Number of iterations= 10, number of elements = 100000000, mahouttime= 1235, 
>spark time = 2226
>Number of iterations= 40, number of elements = 100000000, mahouttime= 2755, 
>spark time = 6388
>Number of iterations= 70, number of elements = 100000000, mahouttime= 4107, 
>spark time = 10967
>Number of iterations= 10, number of elements = 1000000000, mahouttime= 7070, 
>spark time = 25268
>
>Time in seconds. It runs on Yarn cluster with about 40 machines. Elements for 
>clusterization are randomly created. When I changed persistence level from 
>Memory to Memory_and_disk, on big data spark started to work faster.
>
>What am I missing?
>
>See my benchmarking code in attachment.
>
>
>
>-- 
>
>Sincerely yours
>Egor Pakhomov
>Scala Developer, Yandex
>
>

Reply via email to