What application are you running? Here's a few things: - You will hit bottleneck on CPU if you are doing some complex computation (like parsing a json etc.) - You will hit bottleneck on Memory if your data/objects used in the program is large (like defining playing with HashMaps etc inside your map* operations), Here you can set spark.executor.memory to a higher number and also you can change the spark.storage.memoryFraction whose default value is 0.6 of your executor memory. - Network will be a bottleneck if data is not available locally on one of the worker and hence it has to collect it from others, which is a lot of Serialization and data transfer across your cluster.
Thanks Best Regards On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate <jalaf...@eng.ucsd.edu> wrote: > Hi there, > > I am trying to scale up the data size that my application is handling. > This application is running on a cluster with 16 slave nodes. Each slave > node has 60GB memory. It is running in standalone mode. The data is coming > from HDFS that also in same local network. > > In order to have an understanding on how my program is running, I also had > a Ganglia installed on the cluster. From previous run, I know the stage > that taking longest time to run is counting word pairs (my RDD consists of > sentences from a corpus). My goal is to identify the bottleneck of my > application, then modify my program or hardware configurations according to > that. > > Unfortunately, I didn't find too much information on Spark monitoring and > optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 for > application tuning from tasks perspective. Basically, his focus is on tasks > that oddly slower than the average. However, it didn't solve my problem > because there is no such tasks that run way slow than others in my case. > > So I tried to identify the bottleneck from hardware prospective. I want to > know what the limitation of the cluster is. I think if the executers are > running hard, either CPU, memory or network bandwidth (or maybe the > combinations) is hitting the roof. But Ganglia reports the CPU utilization > of cluster is no more than 50%, network utilization is high for several > seconds at the beginning, then drop close to 0. From Spark UI, I can see > the nodes with maximum memory usage is consuming around 6GB, while > "spark.executor.memory" is set to be 20GB. > > I am very confused that the program is not running fast enough, while > hardware resources are not in shortage. Could you please give me some hints > about what decides the performance of a Spark application from hardware > perspective? > > Thanks! > > Julaiti > >