Hi

How big is your dataset?

Thanks
Arush

On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate <jalaf...@eng.ucsd.edu>
wrote:

> Thank you very much for your reply!
>
> My task is to count the number of word pairs in a document. If w1 and w2
> occur together in one sentence, the number of occurrence of word pair (w1,
> w2) adds 1. So the computational part of this algorithm is simply a
> two-level for-loop.
>
> Since the cluster is monitored by Ganglia, I can easily see that neither
> CPU or network IO is under pressure. The only parameter left is memory. In
> the "executor" tab of Spark Web UI, I can see a column named "memory used".
> It showed that only 6GB of 20GB memory is used. I understand this is
> measuring the size of RDD that persist in memory. So can I at least assume
> the data/object I used in my program is not exceeding memory limit?
>
> My confusion here is, why can't my program run faster while there is still
> efficient memory, CPU time and network bandwidth it can utilize?
>
> Best regards,
> Julaiti
>
>
> On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> What application are you running? Here's a few things:
>>
>> - You will hit bottleneck on CPU if you are doing some complex
>> computation (like parsing a json etc.)
>> - You will hit bottleneck on Memory if your data/objects used in the
>> program is large (like defining playing with HashMaps etc inside your map*
>> operations), Here you can set spark.executor.memory to a higher number and
>> also you can change the spark.storage.memoryFraction whose default value is
>> 0.6 of your executor memory.
>> - Network will be a bottleneck if data is not available locally on one of
>> the worker and hence it has to collect it from others, which is a lot of
>> Serialization and data transfer across your cluster.
>>
>> Thanks
>> Best Regards
>>
>> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate <jalaf...@eng.ucsd.edu>
>> wrote:
>>
>>> Hi there,
>>>
>>> I am trying to scale up the data size that my application is handling.
>>> This application is running on a cluster with 16 slave nodes. Each slave
>>> node has 60GB memory. It is running in standalone mode. The data is coming
>>> from HDFS that also in same local network.
>>>
>>> In order to have an understanding on how my program is running, I also
>>> had a Ganglia installed on the cluster. From previous run, I know the stage
>>> that taking longest time to run is counting word pairs (my RDD consists of
>>> sentences from a corpus). My goal is to identify the bottleneck of my
>>> application, then modify my program or hardware configurations according to
>>> that.
>>>
>>> Unfortunately, I didn't find too much information on Spark monitoring
>>> and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014
>>> for application tuning from tasks perspective. Basically, his focus is on
>>> tasks that oddly slower than the average. However, it didn't solve my
>>> problem because there is no such tasks that run way slow than others in my
>>> case.
>>>
>>> So I tried to identify the bottleneck from hardware prospective. I want
>>> to know what the limitation of the cluster is. I think if the executers are
>>> running hard, either CPU, memory or network bandwidth (or maybe the
>>> combinations) is hitting the roof. But Ganglia reports the CPU utilization
>>> of cluster is no more than 50%, network utilization is high for several
>>> seconds at the beginning, then drop close to 0. From Spark UI, I can see
>>> the nodes with maximum memory usage is consuming around 6GB, while
>>> "spark.executor.memory" is set to be 20GB.
>>>
>>> I am very confused that the program is not running fast enough, while
>>> hardware resources are not in shortage. Could you please give me some hints
>>> about what decides the performance of a Spark application from hardware
>>> perspective?
>>>
>>> Thanks!
>>>
>>> Julaiti
>>>
>>>
>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Reply via email to