The raw data is ~30 GB. It consists of 250 millions sentences. The total length of the documents (i.e. the sum of the length of all sentences) is 11 billions. I also ran a simple algorithm to roughly count the maximum number of word pairs by summing up d * (d - 1) over all sentences, where d is the length of the sentence. It is about 63 billions.
Thanks, Julaiti On Tue, Feb 17, 2015 at 2:44 AM, Arush Kharbanda <ar...@sigmoidanalytics.com > wrote: > Hi > > How big is your dataset? > > Thanks > Arush > > On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate <jalaf...@eng.ucsd.edu> > wrote: > >> Thank you very much for your reply! >> >> My task is to count the number of word pairs in a document. If w1 and w2 >> occur together in one sentence, the number of occurrence of word pair (w1, >> w2) adds 1. So the computational part of this algorithm is simply a >> two-level for-loop. >> >> Since the cluster is monitored by Ganglia, I can easily see that neither >> CPU or network IO is under pressure. The only parameter left is memory. In >> the "executor" tab of Spark Web UI, I can see a column named "memory used". >> It showed that only 6GB of 20GB memory is used. I understand this is >> measuring the size of RDD that persist in memory. So can I at least assume >> the data/object I used in my program is not exceeding memory limit? >> >> My confusion here is, why can't my program run faster while there is >> still efficient memory, CPU time and network bandwidth it can utilize? >> >> Best regards, >> Julaiti >> >> >> On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> What application are you running? Here's a few things: >>> >>> - You will hit bottleneck on CPU if you are doing some complex >>> computation (like parsing a json etc.) >>> - You will hit bottleneck on Memory if your data/objects used in the >>> program is large (like defining playing with HashMaps etc inside your map* >>> operations), Here you can set spark.executor.memory to a higher number and >>> also you can change the spark.storage.memoryFraction whose default value is >>> 0.6 of your executor memory. >>> - Network will be a bottleneck if data is not available locally on one >>> of the worker and hence it has to collect it from others, which is a lot of >>> Serialization and data transfer across your cluster. >>> >>> Thanks >>> Best Regards >>> >>> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate <jalaf...@eng.ucsd.edu >>> > wrote: >>> >>>> Hi there, >>>> >>>> I am trying to scale up the data size that my application is handling. >>>> This application is running on a cluster with 16 slave nodes. Each slave >>>> node has 60GB memory. It is running in standalone mode. The data is coming >>>> from HDFS that also in same local network. >>>> >>>> In order to have an understanding on how my program is running, I also >>>> had a Ganglia installed on the cluster. From previous run, I know the stage >>>> that taking longest time to run is counting word pairs (my RDD consists of >>>> sentences from a corpus). My goal is to identify the bottleneck of my >>>> application, then modify my program or hardware configurations according to >>>> that. >>>> >>>> Unfortunately, I didn't find too much information on Spark monitoring >>>> and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014 >>>> for application tuning from tasks perspective. Basically, his focus is on >>>> tasks that oddly slower than the average. However, it didn't solve my >>>> problem because there is no such tasks that run way slow than others in my >>>> case. >>>> >>>> So I tried to identify the bottleneck from hardware prospective. I want >>>> to know what the limitation of the cluster is. I think if the executers are >>>> running hard, either CPU, memory or network bandwidth (or maybe the >>>> combinations) is hitting the roof. But Ganglia reports the CPU utilization >>>> of cluster is no more than 50%, network utilization is high for several >>>> seconds at the beginning, then drop close to 0. From Spark UI, I can see >>>> the nodes with maximum memory usage is consuming around 6GB, while >>>> "spark.executor.memory" is set to be 20GB. >>>> >>>> I am very confused that the program is not running fast enough, while >>>> hardware resources are not in shortage. Could you please give me some hints >>>> about what decides the performance of a Spark application from hardware >>>> perspective? >>>> >>>> Thanks! >>>> >>>> Julaiti >>>> >>>> >>> >> > > > -- > > [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> > > *Arush Kharbanda* || Technical Teamlead > > ar...@sigmoidanalytics.com || www.sigmoidanalytics.com >