I just wanted to clarify - when I said you hit your "maximum level of
parallelism", I meant that the default number of partitions might not be
large enough to take advantage of more hardware, not that there was no way
to increase your parallelism - the documentation I linked gives a few
suggestions on how to increase the number of partitions.

-Will

On Mon, Jun 15, 2015 at 5:00 PM, William Briggs <wrbri...@gmail.com> wrote:

> There are a lot of variables to consider. I'm not an expert on Spark, and
> my ML knowledge is rudimentary at best, but here are some questions whose
> answers might help us to help you:
>
>    - What type of Spark cluster are you running (e.g., Stand-alone,
>    Mesos, YARN)?
>    - What does the HTTP UI tell you in terms of number of stages / tasks,
>    number of exectors, and task execution time / memory used / amount of data
>    shuffled over the network?
>
> As I said, I'm not all that familiar with the ML side of Spark, but in
> general, if I were adding more resources, and not seeing an improvement,
> here are a few things I would consider:
>
>    1. Is your data set partitioned to allow the parallelism you are
>    seeking? Spark's parallelism comes from processing RDD partitions in
>    parallel, not processing individual RDD items in parallel; if you don't
>    have enough partitions to take advantage of the extra hardware, you will
>    see no benefit from adding capacity to your cluster.
>    2. Do you have enough Spark executors to process your partitions in
>    parallel? This depends on  your configuration and on your cluster type
>    (doubtful this is an issue here, since you are adding more executors and
>    seeing very little benefit).
>    3. Are your partitions small enough (and/or your executor memory
>    configuration large enough) so that each partition fits into the memory of
>    an executor? If not, you will be constantly spilling to disk, which will
>    have a severe impact on performance.
>    4. Are you shuffling over the network? If so, how frequently and how
>    much? Are you using efficient serialization (e.g., Kryo) and registering
>    your serialized classes in order to minimize shuffle overhead?
>
> There are plenty more variables, and some very good performance tuning
> documentation <https://spark.apache.org/docs/latest/tuning.html> is
> available. Without any more information to go on, my best guess would be
> that you hit your maximum level of parallelism with the addition of the
> second node (and even that was not fully utilized), and thus you see no
> difference when adding a third node.
>
> Regards,
> Will
>
>
> On Mon, Jun 15, 2015 at 1:29 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.w...@lexisnexis.com> wrote:
>
>>  I try to measure how spark standalone cluster performance scale out
>> with multiple machines. I did a test of training the SVM model which is
>> heavy in memory computation. I measure the run time for spark standalone
>> cluster of 1 – 3 nodes, the result is following
>>
>>
>>
>> 1 node: 35 minutes
>>
>> 2 nodes: 30.1 minutes
>>
>> 3 nodes: 30.8 minutes
>>
>>
>>
>> So the speed does not seems to increase much with more machines. I know
>> there are overhead for coordinating tasks among different machines. Seem to
>> me the overhead is over 30% of the total run time.
>>
>>
>>
>> Is this typical? Does anybody see significant performance increase with
>> more machines? Is there anything I can tune my spark cluster to make it
>> scale out with more machines?
>>
>>
>>
>> Thanks
>>
>> Ningjun
>>
>>
>>
>
>

Reply via email to