Hi, I'm benchmarking Spark(1.6) and MLLib TF-IDF (with hdfs) on a 20GB dataset, and not seeing much scale-up when I increase cores/executors/RAM according to Spark tuning documentation. I suspect I'm missing a trick in my configuration.
I'm running on shared memory (96 cores, 256GB RAM) and testing various combinations of: Number of executors (1,2,4,8) Number of cores per executor (1,2,4,8,12,24) Memory per executor (calculated as per cloudera recommendations) Of course in line with combined resource limits. Also setting the RDD partitioning number to 2,4,6,8 (I see best results at 4 partitions, about 5% better than worse case). Have also varied/switched the following settings: Using the Kyro Serialiser Setting driver memory Setting for compressed ops Dynamic scheduling trying different storage levels for persisting RDDs As we to up the cores in the best of these configurations we still see a running time of 19-20 minutes. Is there anything else I should be configuring to get better scale-up ? Are there any documented TF-IDF benchmark results that I can make comparisons with to validate (even if very approximate indirect comparisons?) Any advice would be much appreciated, Thanks Karen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLib-benchmarks-tp26878.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org