I just wanted to clarify - when I said you hit your "maximum level of parallelism", I meant that the default number of partitions might not be large enough to take advantage of more hardware, not that there was no way to increase your parallelism - the documentation I linked gives a few suggestions on how to increase the number of partitions.
-Will On Mon, Jun 15, 2015 at 5:00 PM, William Briggs <wrbri...@gmail.com> wrote: > There are a lot of variables to consider. I'm not an expert on Spark, and > my ML knowledge is rudimentary at best, but here are some questions whose > answers might help us to help you: > > - What type of Spark cluster are you running (e.g., Stand-alone, > Mesos, YARN)? > - What does the HTTP UI tell you in terms of number of stages / tasks, > number of exectors, and task execution time / memory used / amount of data > shuffled over the network? > > As I said, I'm not all that familiar with the ML side of Spark, but in > general, if I were adding more resources, and not seeing an improvement, > here are a few things I would consider: > > 1. Is your data set partitioned to allow the parallelism you are > seeking? Spark's parallelism comes from processing RDD partitions in > parallel, not processing individual RDD items in parallel; if you don't > have enough partitions to take advantage of the extra hardware, you will > see no benefit from adding capacity to your cluster. > 2. Do you have enough Spark executors to process your partitions in > parallel? This depends on your configuration and on your cluster type > (doubtful this is an issue here, since you are adding more executors and > seeing very little benefit). > 3. Are your partitions small enough (and/or your executor memory > configuration large enough) so that each partition fits into the memory of > an executor? If not, you will be constantly spilling to disk, which will > have a severe impact on performance. > 4. Are you shuffling over the network? If so, how frequently and how > much? Are you using efficient serialization (e.g., Kryo) and registering > your serialized classes in order to minimize shuffle overhead? > > There are plenty more variables, and some very good performance tuning > documentation <https://spark.apache.org/docs/latest/tuning.html> is > available. Without any more information to go on, my best guess would be > that you hit your maximum level of parallelism with the addition of the > second node (and even that was not fully utilized), and thus you see no > difference when adding a third node. > > Regards, > Will > > > On Mon, Jun 15, 2015 at 1:29 PM, Wang, Ningjun (LNG-NPV) < > ningjun.w...@lexisnexis.com> wrote: > >> I try to measure how spark standalone cluster performance scale out >> with multiple machines. I did a test of training the SVM model which is >> heavy in memory computation. I measure the run time for spark standalone >> cluster of 1 – 3 nodes, the result is following >> >> >> >> 1 node: 35 minutes >> >> 2 nodes: 30.1 minutes >> >> 3 nodes: 30.8 minutes >> >> >> >> So the speed does not seems to increase much with more machines. I know >> there are overhead for coordinating tasks among different machines. Seem to >> me the overhead is over 30% of the total run time. >> >> >> >> Is this typical? Does anybody see significant performance increase with >> more machines? Is there anything I can tune my spark cluster to make it >> scale out with more machines? >> >> >> >> Thanks >> >> Ningjun >> >> >> > >