The raw data is ~30 GB. It consists of 250 millions sentences. The total
length of the documents (i.e. the sum of the length of all sentences) is 11
billions. I also ran a simple algorithm to roughly count the maximum number
of word pairs by summing up d * (d - 1) over all sentences, where d is the
length of the sentence. It is about 63 billions.

Thanks,
Julaiti


On Tue, Feb 17, 2015 at 2:44 AM, Arush Kharbanda <ar...@sigmoidanalytics.com
> wrote:

> Hi
>
> How big is your dataset?
>
> Thanks
> Arush
>
> On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate <jalaf...@eng.ucsd.edu>
> wrote:
>
>> Thank you very much for your reply!
>>
>> My task is to count the number of word pairs in a document. If w1 and w2
>> occur together in one sentence, the number of occurrence of word pair (w1,
>> w2) adds 1. So the computational part of this algorithm is simply a
>> two-level for-loop.
>>
>> Since the cluster is monitored by Ganglia, I can easily see that neither
>> CPU or network IO is under pressure. The only parameter left is memory. In
>> the "executor" tab of Spark Web UI, I can see a column named "memory used".
>> It showed that only 6GB of 20GB memory is used. I understand this is
>> measuring the size of RDD that persist in memory. So can I at least assume
>> the data/object I used in my program is not exceeding memory limit?
>>
>> My confusion here is, why can't my program run faster while there is
>> still efficient memory, CPU time and network bandwidth it can utilize?
>>
>> Best regards,
>> Julaiti
>>
>>
>> On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> What application are you running? Here's a few things:
>>>
>>> - You will hit bottleneck on CPU if you are doing some complex
>>> computation (like parsing a json etc.)
>>> - You will hit bottleneck on Memory if your data/objects used in the
>>> program is large (like defining playing with HashMaps etc inside your map*
>>> operations), Here you can set spark.executor.memory to a higher number and
>>> also you can change the spark.storage.memoryFraction whose default value is
>>> 0.6 of your executor memory.
>>> - Network will be a bottleneck if data is not available locally on one
>>> of the worker and hence it has to collect it from others, which is a lot of
>>> Serialization and data transfer across your cluster.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate <jalaf...@eng.ucsd.edu
>>> > wrote:
>>>
>>>> Hi there,
>>>>
>>>> I am trying to scale up the data size that my application is handling.
>>>> This application is running on a cluster with 16 slave nodes. Each slave
>>>> node has 60GB memory. It is running in standalone mode. The data is coming
>>>> from HDFS that also in same local network.
>>>>
>>>> In order to have an understanding on how my program is running, I also
>>>> had a Ganglia installed on the cluster. From previous run, I know the stage
>>>> that taking longest time to run is counting word pairs (my RDD consists of
>>>> sentences from a corpus). My goal is to identify the bottleneck of my
>>>> application, then modify my program or hardware configurations according to
>>>> that.
>>>>
>>>> Unfortunately, I didn't find too much information on Spark monitoring
>>>> and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014
>>>> for application tuning from tasks perspective. Basically, his focus is on
>>>> tasks that oddly slower than the average. However, it didn't solve my
>>>> problem because there is no such tasks that run way slow than others in my
>>>> case.
>>>>
>>>> So I tried to identify the bottleneck from hardware prospective. I want
>>>> to know what the limitation of the cluster is. I think if the executers are
>>>> running hard, either CPU, memory or network bandwidth (or maybe the
>>>> combinations) is hitting the roof. But Ganglia reports the CPU utilization
>>>> of cluster is no more than 50%, network utilization is high for several
>>>> seconds at the beginning, then drop close to 0. From Spark UI, I can see
>>>> the nodes with maximum memory usage is consuming around 6GB, while
>>>> "spark.executor.memory" is set to be 20GB.
>>>>
>>>> I am very confused that the program is not running fast enough, while
>>>> hardware resources are not in shortage. Could you please give me some hints
>>>> about what decides the performance of a Spark application from hardware
>>>> perspective?
>>>>
>>>> Thanks!
>>>>
>>>> Julaiti
>>>>
>>>>
>>>
>>
>
>
> --
>
> [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>
>
> *Arush Kharbanda* || Technical Teamlead
>
> ar...@sigmoidanalytics.com || www.sigmoidanalytics.com
>

Reply via email to