Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread jalafate
Hi David,

It is a great point. It is actually one of the reasons that my program is
slow. I found that the major cause of my program running slow is the huge
garbage collection time. I created too many small objects in the map
procedure which triggers GC mechanism frequently. After I improved my
program by creating fewer objects, the performance is much better.

Here are two videos that may help other people who also struggling about
finding the bottleneck of your spark applications.

1. A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)
http://youtu.be/dmL0N3qfSc8

2. Spark Summit 2014 - Advanced Spark Training - Advanced Spark Internals
and Tuning
http://youtu.be/HG2Yd-3r4-M

I personally learned a lot from the points mentioned in the two videos
above.

In practice, I will monitor CPU user time, CPU idle time (if disk IO is the
bottleneck, CPU idle time should be significant), memory usage, network IO
and garbage collection time per task (can be found on the Spark web UI).
Ganglia will be helpful to monitor CPU, memory and network IO.

Best,
Julaiti



On Thu, Mar 5, 2015 at 1:39 AM, davidkl [via Apache Spark User List] <
ml-node+s1001560n21927...@n3.nabble.com> wrote:

> Hello Julaiti,
>
> Maybe I am just asking the obvious :-) but did you check disk IO?
> Depending on what you are doing that could be the bottleneck.
>
> In my case none of the HW resources was a bottleneck, but using some
> distributed features that were blocking execution (e.g. Hazelcast). Could
> that be your case as well?
>
> Regards
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Identify-the-performance-bottleneck-from-hardware-prospective-tp21684p21927.html
>  To unsubscribe from Identify the performance bottleneck from hardware
> prospective, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=21684&code=amFsYWZhdGVAZW5nLnVjc2QuZWR1fDIxNjg0fC05ODMxNTE2MTk=>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Identify-the-performance-bottleneck-from-hardware-prospective-tp21684p21937.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread Julaiti Alafate
Hi Mitch,

I think it is normal. The network utilization will be high when there is
some shuffling process happening. After that, the network utilization
should come down, while each slave nodes will do the computation on the
partitions assigned to them. At least it is my understanding.

Best,
Julaiti


On Tue, Mar 3, 2015 at 2:32 AM, Mitch Gusat  wrote:

> Hi Julaiti,
>
> Have you made progress in discovering the bottleneck below?
>
> While i suspect a configuration setting or program bug, i'm intrigued by 
> "network
> utilization is high for several seconds at the beginning, then drop close
> to 0"... Do you know more?
>
> thanks,
> Mitch Gusat (IBM research)
>
> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate 
> wrote:
>
> > Hi there,
> >
> > I am trying to scale up the data size that my application is handling.
> > This application is running on a cluster with 16 slave nodes. Each slave
> > node has 60GB memory. It is running in standalone mode. The data is coming
> > from HDFS that also in same local network.
> >
> > In order to have an understanding on how my program is running, I also had
> > a Ganglia installed on the cluster. From previous run, I know the stage
> > that taking longest time to run is counting word pairs (my RDD consists of
> > sentences from a corpus). My goal is to identify the bottleneck of my
> > application, then modify my program or hardware configurations according to
> > that.
> >
> > Unfortunately, I didn't find too much information on Spark monitoring and
> > optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for
> > application tuning from tasks perspective. Basically, his focus is on tasks
> > that oddly slower than the average. However, it didn't solve my problem
> > because there is no such tasks that run way slow than others in my case.
> >
> > So I tried to identify the bottleneck from hardware prospective. I want to
> > know what the limitation of the cluster is. I think if the executers are
> > running hard, either CPU, memory or network bandwidth (or maybe the
> > combinations) is hitting the roof. But Ganglia reports the CPU utilization
> > of cluster is no more than 50%, network utilization is high for several
> > seconds at the beginning, then drop close to 0. From Spark UI, I can see
> > the nodes with maximum memory usage is consuming around 6GB, while
> > "spark.executor.memory" is set to be 20GB.
> >
> > I am very confused that the program is not running fast enough, while
> > hardware resources are not in shortage. Could you please give me some hints
> > about what decides the performance of a Spark application from hardware
> > perspective?
> >
> > Thanks!
> >
> > Julaiti
>
>


Re: Identify the performance bottleneck from hardware prospective

2015-03-05 Thread davidkl
Hello Julaiti,

Maybe I am just asking the obvious :-) but did you check disk IO? Depending
on what you are doing that could be the bottleneck.

In my case none of the HW resources was a bottleneck, but using some
distributed features that were blocking execution (e.g. Hazelcast). Could
that be your case as well? 

Regards



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Identify-the-performance-bottleneck-from-hardware-prospective-tp21684p21927.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Akhil Das
It would be good if you can share the piece of code that you are using, so
people can suggest you how to optimize it further and stuffs like that.
Also, since you are having 20Gb of memory and ~30Gb of data, you can try
doing a rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
or .persist(StorageLevel.MEMORY_AND_DISK_2), ~12Gb of memory will be usable
by default out of 20Gb, you can increase it by setting
spark.storage.memoryFraction.

Thanks
Best Regards

On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate 
wrote:

> Thank you very much for your reply!
>
> My task is to count the number of word pairs in a document. If w1 and w2
> occur together in one sentence, the number of occurrence of word pair (w1,
> w2) adds 1. So the computational part of this algorithm is simply a
> two-level for-loop.
>
> Since the cluster is monitored by Ganglia, I can easily see that neither
> CPU or network IO is under pressure. The only parameter left is memory. In
> the "executor" tab of Spark Web UI, I can see a column named "memory used".
> It showed that only 6GB of 20GB memory is used. I understand this is
> measuring the size of RDD that persist in memory. So can I at least assume
> the data/object I used in my program is not exceeding memory limit?
>
> My confusion here is, why can't my program run faster while there is still
> efficient memory, CPU time and network bandwidth it can utilize?
>
> Best regards,
> Julaiti
>
>
> On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das 
> wrote:
>
>> What application are you running? Here's a few things:
>>
>> - You will hit bottleneck on CPU if you are doing some complex
>> computation (like parsing a json etc.)
>> - You will hit bottleneck on Memory if your data/objects used in the
>> program is large (like defining playing with HashMaps etc inside your map*
>> operations), Here you can set spark.executor.memory to a higher number and
>> also you can change the spark.storage.memoryFraction whose default value is
>> 0.6 of your executor memory.
>> - Network will be a bottleneck if data is not available locally on one of
>> the worker and hence it has to collect it from others, which is a lot of
>> Serialization and data transfer across your cluster.
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate 
>> wrote:
>>
>>> Hi there,
>>>
>>> I am trying to scale up the data size that my application is handling.
>>> This application is running on a cluster with 16 slave nodes. Each slave
>>> node has 60GB memory. It is running in standalone mode. The data is coming
>>> from HDFS that also in same local network.
>>>
>>> In order to have an understanding on how my program is running, I also
>>> had a Ganglia installed on the cluster. From previous run, I know the stage
>>> that taking longest time to run is counting word pairs (my RDD consists of
>>> sentences from a corpus). My goal is to identify the bottleneck of my
>>> application, then modify my program or hardware configurations according to
>>> that.
>>>
>>> Unfortunately, I didn't find too much information on Spark monitoring
>>> and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014
>>> for application tuning from tasks perspective. Basically, his focus is on
>>> tasks that oddly slower than the average. However, it didn't solve my
>>> problem because there is no such tasks that run way slow than others in my
>>> case.
>>>
>>> So I tried to identify the bottleneck from hardware prospective. I want
>>> to know what the limitation of the cluster is. I think if the executers are
>>> running hard, either CPU, memory or network bandwidth (or maybe the
>>> combinations) is hitting the roof. But Ganglia reports the CPU utilization
>>> of cluster is no more than 50%, network utilization is high for several
>>> seconds at the beginning, then drop close to 0. From Spark UI, I can see
>>> the nodes with maximum memory usage is consuming around 6GB, while
>>> "spark.executor.memory" is set to be 20GB.
>>>
>>> I am very confused that the program is not running fast enough, while
>>> hardware resources are not in shortage. Could you please give me some hints
>>> about what decides the performance of a Spark application from hardware
>>> perspective?
>>>
>>> Thanks!
>>>
>>> Julaiti
>>>
>>>
>>
>


Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Julaiti Alafate
The raw data is ~30 GB. It consists of 250 millions sentences. The total
length of the documents (i.e. the sum of the length of all sentences) is 11
billions. I also ran a simple algorithm to roughly count the maximum number
of word pairs by summing up d * (d - 1) over all sentences, where d is the
length of the sentence. It is about 63 billions.

Thanks,
Julaiti


On Tue, Feb 17, 2015 at 2:44 AM, Arush Kharbanda  wrote:

> Hi
>
> How big is your dataset?
>
> Thanks
> Arush
>
> On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate 
> wrote:
>
>> Thank you very much for your reply!
>>
>> My task is to count the number of word pairs in a document. If w1 and w2
>> occur together in one sentence, the number of occurrence of word pair (w1,
>> w2) adds 1. So the computational part of this algorithm is simply a
>> two-level for-loop.
>>
>> Since the cluster is monitored by Ganglia, I can easily see that neither
>> CPU or network IO is under pressure. The only parameter left is memory. In
>> the "executor" tab of Spark Web UI, I can see a column named "memory used".
>> It showed that only 6GB of 20GB memory is used. I understand this is
>> measuring the size of RDD that persist in memory. So can I at least assume
>> the data/object I used in my program is not exceeding memory limit?
>>
>> My confusion here is, why can't my program run faster while there is
>> still efficient memory, CPU time and network bandwidth it can utilize?
>>
>> Best regards,
>> Julaiti
>>
>>
>> On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das 
>> wrote:
>>
>>> What application are you running? Here's a few things:
>>>
>>> - You will hit bottleneck on CPU if you are doing some complex
>>> computation (like parsing a json etc.)
>>> - You will hit bottleneck on Memory if your data/objects used in the
>>> program is large (like defining playing with HashMaps etc inside your map*
>>> operations), Here you can set spark.executor.memory to a higher number and
>>> also you can change the spark.storage.memoryFraction whose default value is
>>> 0.6 of your executor memory.
>>> - Network will be a bottleneck if data is not available locally on one
>>> of the worker and hence it has to collect it from others, which is a lot of
>>> Serialization and data transfer across your cluster.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate >> > wrote:
>>>
 Hi there,

 I am trying to scale up the data size that my application is handling.
 This application is running on a cluster with 16 slave nodes. Each slave
 node has 60GB memory. It is running in standalone mode. The data is coming
 from HDFS that also in same local network.

 In order to have an understanding on how my program is running, I also
 had a Ganglia installed on the cluster. From previous run, I know the stage
 that taking longest time to run is counting word pairs (my RDD consists of
 sentences from a corpus). My goal is to identify the bottleneck of my
 application, then modify my program or hardware configurations according to
 that.

 Unfortunately, I didn't find too much information on Spark monitoring
 and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014
 for application tuning from tasks perspective. Basically, his focus is on
 tasks that oddly slower than the average. However, it didn't solve my
 problem because there is no such tasks that run way slow than others in my
 case.

 So I tried to identify the bottleneck from hardware prospective. I want
 to know what the limitation of the cluster is. I think if the executers are
 running hard, either CPU, memory or network bandwidth (or maybe the
 combinations) is hitting the roof. But Ganglia reports the CPU utilization
 of cluster is no more than 50%, network utilization is high for several
 seconds at the beginning, then drop close to 0. From Spark UI, I can see
 the nodes with maximum memory usage is consuming around 6GB, while
 "spark.executor.memory" is set to be 20GB.

 I am very confused that the program is not running fast enough, while
 hardware resources are not in shortage. Could you please give me some hints
 about what decides the performance of a Spark application from hardware
 perspective?

 Thanks!

 Julaiti


>>>
>>
>
>
> --
>
> [image: Sigmoid Analytics] 
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>


Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Arush Kharbanda
Hi

How big is your dataset?

Thanks
Arush

On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate 
wrote:

> Thank you very much for your reply!
>
> My task is to count the number of word pairs in a document. If w1 and w2
> occur together in one sentence, the number of occurrence of word pair (w1,
> w2) adds 1. So the computational part of this algorithm is simply a
> two-level for-loop.
>
> Since the cluster is monitored by Ganglia, I can easily see that neither
> CPU or network IO is under pressure. The only parameter left is memory. In
> the "executor" tab of Spark Web UI, I can see a column named "memory used".
> It showed that only 6GB of 20GB memory is used. I understand this is
> measuring the size of RDD that persist in memory. So can I at least assume
> the data/object I used in my program is not exceeding memory limit?
>
> My confusion here is, why can't my program run faster while there is still
> efficient memory, CPU time and network bandwidth it can utilize?
>
> Best regards,
> Julaiti
>
>
> On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das 
> wrote:
>
>> What application are you running? Here's a few things:
>>
>> - You will hit bottleneck on CPU if you are doing some complex
>> computation (like parsing a json etc.)
>> - You will hit bottleneck on Memory if your data/objects used in the
>> program is large (like defining playing with HashMaps etc inside your map*
>> operations), Here you can set spark.executor.memory to a higher number and
>> also you can change the spark.storage.memoryFraction whose default value is
>> 0.6 of your executor memory.
>> - Network will be a bottleneck if data is not available locally on one of
>> the worker and hence it has to collect it from others, which is a lot of
>> Serialization and data transfer across your cluster.
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate 
>> wrote:
>>
>>> Hi there,
>>>
>>> I am trying to scale up the data size that my application is handling.
>>> This application is running on a cluster with 16 slave nodes. Each slave
>>> node has 60GB memory. It is running in standalone mode. The data is coming
>>> from HDFS that also in same local network.
>>>
>>> In order to have an understanding on how my program is running, I also
>>> had a Ganglia installed on the cluster. From previous run, I know the stage
>>> that taking longest time to run is counting word pairs (my RDD consists of
>>> sentences from a corpus). My goal is to identify the bottleneck of my
>>> application, then modify my program or hardware configurations according to
>>> that.
>>>
>>> Unfortunately, I didn't find too much information on Spark monitoring
>>> and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014
>>> for application tuning from tasks perspective. Basically, his focus is on
>>> tasks that oddly slower than the average. However, it didn't solve my
>>> problem because there is no such tasks that run way slow than others in my
>>> case.
>>>
>>> So I tried to identify the bottleneck from hardware prospective. I want
>>> to know what the limitation of the cluster is. I think if the executers are
>>> running hard, either CPU, memory or network bandwidth (or maybe the
>>> combinations) is hitting the roof. But Ganglia reports the CPU utilization
>>> of cluster is no more than 50%, network utilization is high for several
>>> seconds at the beginning, then drop close to 0. From Spark UI, I can see
>>> the nodes with maximum memory usage is consuming around 6GB, while
>>> "spark.executor.memory" is set to be 20GB.
>>>
>>> I am very confused that the program is not running fast enough, while
>>> hardware resources are not in shortage. Could you please give me some hints
>>> about what decides the performance of a Spark application from hardware
>>> perspective?
>>>
>>> Thanks!
>>>
>>> Julaiti
>>>
>>>
>>
>


-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Julaiti Alafate
Thank you very much for your reply!

My task is to count the number of word pairs in a document. If w1 and w2
occur together in one sentence, the number of occurrence of word pair (w1,
w2) adds 1. So the computational part of this algorithm is simply a
two-level for-loop.

Since the cluster is monitored by Ganglia, I can easily see that neither
CPU or network IO is under pressure. The only parameter left is memory. In
the "executor" tab of Spark Web UI, I can see a column named "memory used".
It showed that only 6GB of 20GB memory is used. I understand this is
measuring the size of RDD that persist in memory. So can I at least assume
the data/object I used in my program is not exceeding memory limit?

My confusion here is, why can't my program run faster while there is still
efficient memory, CPU time and network bandwidth it can utilize?

Best regards,
Julaiti


On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das 
wrote:

> What application are you running? Here's a few things:
>
> - You will hit bottleneck on CPU if you are doing some complex computation
> (like parsing a json etc.)
> - You will hit bottleneck on Memory if your data/objects used in the
> program is large (like defining playing with HashMaps etc inside your map*
> operations), Here you can set spark.executor.memory to a higher number and
> also you can change the spark.storage.memoryFraction whose default value is
> 0.6 of your executor memory.
> - Network will be a bottleneck if data is not available locally on one of
> the worker and hence it has to collect it from others, which is a lot of
> Serialization and data transfer across your cluster.
>
> Thanks
> Best Regards
>
> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate 
> wrote:
>
>> Hi there,
>>
>> I am trying to scale up the data size that my application is handling.
>> This application is running on a cluster with 16 slave nodes. Each slave
>> node has 60GB memory. It is running in standalone mode. The data is coming
>> from HDFS that also in same local network.
>>
>> In order to have an understanding on how my program is running, I also
>> had a Ganglia installed on the cluster. From previous run, I know the stage
>> that taking longest time to run is counting word pairs (my RDD consists of
>> sentences from a corpus). My goal is to identify the bottleneck of my
>> application, then modify my program or hardware configurations according to
>> that.
>>
>> Unfortunately, I didn't find too much information on Spark monitoring and
>> optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for
>> application tuning from tasks perspective. Basically, his focus is on tasks
>> that oddly slower than the average. However, it didn't solve my problem
>> because there is no such tasks that run way slow than others in my case.
>>
>> So I tried to identify the bottleneck from hardware prospective. I want
>> to know what the limitation of the cluster is. I think if the executers are
>> running hard, either CPU, memory or network bandwidth (or maybe the
>> combinations) is hitting the roof. But Ganglia reports the CPU utilization
>> of cluster is no more than 50%, network utilization is high for several
>> seconds at the beginning, then drop close to 0. From Spark UI, I can see
>> the nodes with maximum memory usage is consuming around 6GB, while
>> "spark.executor.memory" is set to be 20GB.
>>
>> I am very confused that the program is not running fast enough, while
>> hardware resources are not in shortage. Could you please give me some hints
>> about what decides the performance of a Spark application from hardware
>> perspective?
>>
>> Thanks!
>>
>> Julaiti
>>
>>
>


Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Akhil Das
What application are you running? Here's a few things:

- You will hit bottleneck on CPU if you are doing some complex computation
(like parsing a json etc.)
- You will hit bottleneck on Memory if your data/objects used in the
program is large (like defining playing with HashMaps etc inside your map*
operations), Here you can set spark.executor.memory to a higher number and
also you can change the spark.storage.memoryFraction whose default value is
0.6 of your executor memory.
- Network will be a bottleneck if data is not available locally on one of
the worker and hence it has to collect it from others, which is a lot of
Serialization and data transfer across your cluster.

Thanks
Best Regards

On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate 
wrote:

> Hi there,
>
> I am trying to scale up the data size that my application is handling.
> This application is running on a cluster with 16 slave nodes. Each slave
> node has 60GB memory. It is running in standalone mode. The data is coming
> from HDFS that also in same local network.
>
> In order to have an understanding on how my program is running, I also had
> a Ganglia installed on the cluster. From previous run, I know the stage
> that taking longest time to run is counting word pairs (my RDD consists of
> sentences from a corpus). My goal is to identify the bottleneck of my
> application, then modify my program or hardware configurations according to
> that.
>
> Unfortunately, I didn't find too much information on Spark monitoring and
> optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for
> application tuning from tasks perspective. Basically, his focus is on tasks
> that oddly slower than the average. However, it didn't solve my problem
> because there is no such tasks that run way slow than others in my case.
>
> So I tried to identify the bottleneck from hardware prospective. I want to
> know what the limitation of the cluster is. I think if the executers are
> running hard, either CPU, memory or network bandwidth (or maybe the
> combinations) is hitting the roof. But Ganglia reports the CPU utilization
> of cluster is no more than 50%, network utilization is high for several
> seconds at the beginning, then drop close to 0. From Spark UI, I can see
> the nodes with maximum memory usage is consuming around 6GB, while
> "spark.executor.memory" is set to be 20GB.
>
> I am very confused that the program is not running fast enough, while
> hardware resources are not in shortage. Could you please give me some hints
> about what decides the performance of a Spark application from hardware
> perspective?
>
> Thanks!
>
> Julaiti
>
>


Identify the performance bottleneck from hardware prospective

2015-02-16 Thread jalafate
Hi there,

I am trying to scale up the data size that my application is handling. This
application is running on a cluster with 16 slave nodes. Each slave node has
60GB memory. It is running in standalone mode. The data is coming from HDFS
that also in same local network.

In order to have an understanding on how my program is running, I also had a
Ganglia installed on the cluster. From previous run, I know the stage that
taking longest time to run is counting word pairs (my RDD consists of
sentences from a corpus). My goal is to identify the bottleneck of my
application, then modify my program or hardware configurations according to
that.

Unfortunately, I didn't find too much information on Spark monitoring and
optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for
application tuning from tasks perspective. Basically, his focus is on tasks
that oddly slower than the average. However, it didn't solve my problem
because there is no such tasks that run way slow than others in my case.

So I tried to identify the bottleneck from hardware prospective. I want to
know what the limitation of the cluster is. I think if the executers are
running hard, either CPU, memory or network bandwidth (or maybe the
combinations) is hitting the roof. But Ganglia reports the CPU utilization
of cluster is no more than 50%, network utilization is high for several
seconds at the beginning, then drop close to 0. From Spark UI, I can see the
nodes with maximum memory usage is consuming around 6GB, while
"spark.executor.memory" is set to be 20GB. 

I am very confused that the program is not running fast enough, while
hardware resources are not in shortage. Could you please give me some hints
about what decides the performance of a Spark application from hardware
perspective?

Thanks!

Julaiti



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Identify-the-performance-bottleneck-from-hardware-prospective-tp21684.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Identify the performance bottleneck from hardware prospective

2015-02-16 Thread Julaiti Alafate
Hi there,

I am trying to scale up the data size that my application is handling. This
application is running on a cluster with 16 slave nodes. Each slave node
has 60GB memory. It is running in standalone mode. The data is coming from
HDFS that also in same local network.

In order to have an understanding on how my program is running, I also had
a Ganglia installed on the cluster. From previous run, I know the stage
that taking longest time to run is counting word pairs (my RDD consists of
sentences from a corpus). My goal is to identify the bottleneck of my
application, then modify my program or hardware configurations according to
that.

Unfortunately, I didn't find too much information on Spark monitoring and
optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for
application tuning from tasks perspective. Basically, his focus is on tasks
that oddly slower than the average. However, it didn't solve my problem
because there is no such tasks that run way slow than others in my case.

So I tried to identify the bottleneck from hardware prospective. I want to
know what the limitation of the cluster is. I think if the executers are
running hard, either CPU, memory or network bandwidth (or maybe the
combinations) is hitting the roof. But Ganglia reports the CPU utilization
of cluster is no more than 50%, network utilization is high for several
seconds at the beginning, then drop close to 0. From Spark UI, I can see
the nodes with maximum memory usage is consuming around 6GB, while
"spark.executor.memory" is set to be 20GB.

I am very confused that the program is not running fast enough, while
hardware resources are not in shortage. Could you please give me some hints
about what decides the performance of a Spark application from hardware
perspective?

Thanks!

Julaiti