Hi How big is your dataset?
Thanks Arush On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate <jalaf...@eng.ucsd.edu> wrote: > Thank you very much for your reply! > > My task is to count the number of word pairs in a document. If w1 and w2 > occur together in one sentence, the number of occurrence of word pair (w1, > w2) adds 1. So the computational part of this algorithm is simply a > two-level for-loop. > > Since the cluster is monitored by Ganglia, I can easily see that neither > CPU or network IO is under pressure. The only parameter left is memory. In > the "executor" tab of Spark Web UI, I can see a column named "memory used". > It showed that only 6GB of 20GB memory is used. I understand this is > measuring the size of RDD that persist in memory. So can I at least assume > the data/object I used in my program is not exceeding memory limit? > > My confusion here is, why can't my program run faster while there is still > efficient memory, CPU time and network bandwidth it can utilize? > > Best regards, > Julaiti > > > On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> What application are you running? Here's a few things: >> >> - You will hit bottleneck on CPU if you are doing some complex >> computation (like parsing a json etc.) >> - You will hit bottleneck on Memory if your data/objects used in the >> program is large (like defining playing with HashMaps etc inside your map* >> operations), Here you can set spark.executor.memory to a higher number and >> also you can change the spark.storage.memoryFraction whose default value is >> 0.6 of your executor memory. >> - Network will be a bottleneck if data is not available locally on one of >> the worker and hence it has to collect it from others, which is a lot of >> Serialization and data transfer across your cluster. >> >> Thanks >> Best Regards >> >> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate <jalaf...@eng.ucsd.edu> >> wrote: >> >>> Hi there, >>> >>> I am trying to scale up the data size that my application is handling. >>> This application is running on a cluster with 16 slave nodes. Each slave >>> node has 60GB memory. It is running in standalone mode. The data is coming >>> from HDFS that also in same local network. >>> >>> In order to have an understanding on how my program is running, I also >>> had a Ganglia installed on the cluster. From previous run, I know the stage >>> that taking longest time to run is counting word pairs (my RDD consists of >>> sentences from a corpus). My goal is to identify the bottleneck of my >>> application, then modify my program or hardware configurations according to >>> that. >>> >>> Unfortunately, I didn't find too much information on Spark monitoring >>> and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 >>> for application tuning from tasks perspective. Basically, his focus is on >>> tasks that oddly slower than the average. However, it didn't solve my >>> problem because there is no such tasks that run way slow than others in my >>> case. >>> >>> So I tried to identify the bottleneck from hardware prospective. I want >>> to know what the limitation of the cluster is. I think if the executers are >>> running hard, either CPU, memory or network bandwidth (or maybe the >>> combinations) is hitting the roof. But Ganglia reports the CPU utilization >>> of cluster is no more than 50%, network utilization is high for several >>> seconds at the beginning, then drop close to 0. From Spark UI, I can see >>> the nodes with maximum memory usage is consuming around 6GB, while >>> "spark.executor.memory" is set to be 20GB. >>> >>> I am very confused that the program is not running fast enough, while >>> hardware resources are not in shortage. Could you please give me some hints >>> about what decides the performance of a Spark application from hardware >>> perspective? >>> >>> Thanks! >>> >>> Julaiti >>> >>> >> > -- [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com