Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
I just wanted to clarify - when I said you hit your "maximum level of
parallelism", I meant that the default number of partitions might not be
large enough to take advantage of more hardware, not that there was no way
to increase your parallelism - the documentation I linked gives a few
suggestions on how to increase the number of partitions.

-Will

On Mon, Jun 15, 2015 at 5:00 PM, William Briggs  wrote:

> There are a lot of variables to consider. I'm not an expert on Spark, and
> my ML knowledge is rudimentary at best, but here are some questions whose
> answers might help us to help you:
>
>- What type of Spark cluster are you running (e.g., Stand-alone,
>Mesos, YARN)?
>- What does the HTTP UI tell you in terms of number of stages / tasks,
>number of exectors, and task execution time / memory used / amount of data
>shuffled over the network?
>
> As I said, I'm not all that familiar with the ML side of Spark, but in
> general, if I were adding more resources, and not seeing an improvement,
> here are a few things I would consider:
>
>1. Is your data set partitioned to allow the parallelism you are
>seeking? Spark's parallelism comes from processing RDD partitions in
>parallel, not processing individual RDD items in parallel; if you don't
>have enough partitions to take advantage of the extra hardware, you will
>see no benefit from adding capacity to your cluster.
>2. Do you have enough Spark executors to process your partitions in
>parallel? This depends on  your configuration and on your cluster type
>(doubtful this is an issue here, since you are adding more executors and
>seeing very little benefit).
>3. Are your partitions small enough (and/or your executor memory
>configuration large enough) so that each partition fits into the memory of
>an executor? If not, you will be constantly spilling to disk, which will
>have a severe impact on performance.
>4. Are you shuffling over the network? If so, how frequently and how
>much? Are you using efficient serialization (e.g., Kryo) and registering
>your serialized classes in order to minimize shuffle overhead?
>
> There are plenty more variables, and some very good performance tuning
> documentation  is
> available. Without any more information to go on, my best guess would be
> that you hit your maximum level of parallelism with the addition of the
> second node (and even that was not fully utilized), and thus you see no
> difference when adding a third node.
>
> Regards,
> Will
>
>
> On Mon, Jun 15, 2015 at 1:29 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.w...@lexisnexis.com> wrote:
>
>>  I try to measure how spark standalone cluster performance scale out
>> with multiple machines. I did a test of training the SVM model which is
>> heavy in memory computation. I measure the run time for spark standalone
>> cluster of 1 – 3 nodes, the result is following
>>
>>
>>
>> 1 node: 35 minutes
>>
>> 2 nodes: 30.1 minutes
>>
>> 3 nodes: 30.8 minutes
>>
>>
>>
>> So the speed does not seems to increase much with more machines. I know
>> there are overhead for coordinating tasks among different machines. Seem to
>> me the overhead is over 30% of the total run time.
>>
>>
>>
>> Is this typical? Does anybody see significant performance increase with
>> more machines? Is there anything I can tune my spark cluster to make it
>> scale out with more machines?
>>
>>
>>
>> Thanks
>>
>> Ningjun
>>
>>
>>
>
>


Re: Does spark performance really scale out with multiple machines?

2015-06-15 Thread William Briggs
There are a lot of variables to consider. I'm not an expert on Spark, and
my ML knowledge is rudimentary at best, but here are some questions whose
answers might help us to help you:

   - What type of Spark cluster are you running (e.g., Stand-alone, Mesos,
   YARN)?
   - What does the HTTP UI tell you in terms of number of stages / tasks,
   number of exectors, and task execution time / memory used / amount of data
   shuffled over the network?

As I said, I'm not all that familiar with the ML side of Spark, but in
general, if I were adding more resources, and not seeing an improvement,
here are a few things I would consider:

   1. Is your data set partitioned to allow the parallelism you are
   seeking? Spark's parallelism comes from processing RDD partitions in
   parallel, not processing individual RDD items in parallel; if you don't
   have enough partitions to take advantage of the extra hardware, you will
   see no benefit from adding capacity to your cluster.
   2. Do you have enough Spark executors to process your partitions in
   parallel? This depends on  your configuration and on your cluster type
   (doubtful this is an issue here, since you are adding more executors and
   seeing very little benefit).
   3. Are your partitions small enough (and/or your executor memory
   configuration large enough) so that each partition fits into the memory of
   an executor? If not, you will be constantly spilling to disk, which will
   have a severe impact on performance.
   4. Are you shuffling over the network? If so, how frequently and how
   much? Are you using efficient serialization (e.g., Kryo) and registering
   your serialized classes in order to minimize shuffle overhead?

There are plenty more variables, and some very good performance tuning
documentation  is
available. Without any more information to go on, my best guess would be
that you hit your maximum level of parallelism with the addition of the
second node (and even that was not fully utilized), and thus you see no
difference when adding a third node.

Regards,
Will


On Mon, Jun 15, 2015 at 1:29 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  I try to measure how spark standalone cluster performance scale out with
> multiple machines. I did a test of training the SVM model which is heavy in
> memory computation. I measure the run time for spark standalone cluster of
> 1 – 3 nodes, the result is following
>
>
>
> 1 node: 35 minutes
>
> 2 nodes: 30.1 minutes
>
> 3 nodes: 30.8 minutes
>
>
>
> So the speed does not seems to increase much with more machines. I know
> there are overhead for coordinating tasks among different machines. Seem to
> me the overhead is over 30% of the total run time.
>
>
>
> Is this typical? Does anybody see significant performance increase with
> more machines? Is there anything I can tune my spark cluster to make it
> scale out with more machines?
>
>
>
> Thanks
>
> Ningjun
>
>
>