Thank you very much for your reply!

My task is to count the number of word pairs in a document. If w1 and w2
occur together in one sentence, the number of occurrence of word pair (w1,
w2) adds 1. So the computational part of this algorithm is simply a
two-level for-loop.

Since the cluster is monitored by Ganglia, I can easily see that neither
CPU or network IO is under pressure. The only parameter left is memory. In
the "executor" tab of Spark Web UI, I can see a column named "memory used".
It showed that only 6GB of 20GB memory is used. I understand this is
measuring the size of RDD that persist in memory. So can I at least assume
the data/object I used in my program is not exceeding memory limit?

My confusion here is, why can't my program run faster while there is still
efficient memory, CPU time and network bandwidth it can utilize?

Best regards,
Julaiti


On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> What application are you running? Here's a few things:
>
> - You will hit bottleneck on CPU if you are doing some complex computation
> (like parsing a json etc.)
> - You will hit bottleneck on Memory if your data/objects used in the
> program is large (like defining playing with HashMaps etc inside your map*
> operations), Here you can set spark.executor.memory to a higher number and
> also you can change the spark.storage.memoryFraction whose default value is
> 0.6 of your executor memory.
> - Network will be a bottleneck if data is not available locally on one of
> the worker and hence it has to collect it from others, which is a lot of
> Serialization and data transfer across your cluster.
>
> Thanks
> Best Regards
>
> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate <jalaf...@eng.ucsd.edu>
> wrote:
>
>> Hi there,
>>
>> I am trying to scale up the data size that my application is handling.
>> This application is running on a cluster with 16 slave nodes. Each slave
>> node has 60GB memory. It is running in standalone mode. The data is coming
>> from HDFS that also in same local network.
>>
>> In order to have an understanding on how my program is running, I also
>> had a Ganglia installed on the cluster. From previous run, I know the stage
>> that taking longest time to run is counting word pairs (my RDD consists of
>> sentences from a corpus). My goal is to identify the bottleneck of my
>> application, then modify my program or hardware configurations according to
>> that.
>>
>> Unfortunately, I didn't find too much information on Spark monitoring and
>> optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for
>> application tuning from tasks perspective. Basically, his focus is on tasks
>> that oddly slower than the average. However, it didn't solve my problem
>> because there is no such tasks that run way slow than others in my case.
>>
>> So I tried to identify the bottleneck from hardware prospective. I want
>> to know what the limitation of the cluster is. I think if the executers are
>> running hard, either CPU, memory or network bandwidth (or maybe the
>> combinations) is hitting the roof. But Ganglia reports the CPU utilization
>> of cluster is no more than 50%, network utilization is high for several
>> seconds at the beginning, then drop close to 0. From Spark UI, I can see
>> the nodes with maximum memory usage is consuming around 6GB, while
>> "spark.executor.memory" is set to be 20GB.
>>
>> I am very confused that the program is not running fast enough, while
>> hardware resources are not in shortage. Could you please give me some hints
>> about what decides the performance of a Spark application from hardware
>> perspective?
>>
>> Thanks!
>>
>> Julaiti
>>
>>
>

Reply via email to