Hi Mitch,
I think it is normal. The network utilization will be high when there is
some shuffling process happening. After that, the network utilization
should come down, while each slave nodes will do the computation on the
partitions assigned to them. At least it is my understanding.
Best,
Hi David,
It is a great point. It is actually one of the reasons that my program is
slow. I found that the major cause of my program running slow is the huge
garbage collection time. I created too many small objects in the map
procedure which triggers GC mechanism frequently. After I improved my
Hello Julaiti,
Maybe I am just asking the obvious :-) but did you check disk IO? Depending
on what you are doing that could be the bottleneck.
In my case none of the HW resources was a bottleneck, but using some
distributed features that were blocking execution (e.g. Hazelcast). Could
that be
It would be good if you can share the piece of code that you are using, so
people can suggest you how to optimize it further and stuffs like that.
Also, since you are having 20Gb of memory and ~30Gb of data, you can try
doing a rdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
or
The raw data is ~30 GB. It consists of 250 millions sentences. The total
length of the documents (i.e. the sum of the length of all sentences) is 11
billions. I also ran a simple algorithm to roughly count the maximum number
of word pairs by summing up d * (d - 1) over all sentences, where d is
What application are you running? Here's a few things:
- You will hit bottleneck on CPU if you are doing some complex computation
(like parsing a json etc.)
- You will hit bottleneck on Memory if your data/objects used in the
program is large (like defining playing with HashMaps etc inside your
Thank you very much for your reply!
My task is to count the number of word pairs in a document. If w1 and w2
occur together in one sentence, the number of occurrence of word pair (w1,
w2) adds 1. So the computational part of this algorithm is simply a
two-level for-loop.
Since the cluster is
Hi
How big is your dataset?
Thanks
Arush
On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate jalaf...@eng.ucsd.edu
wrote:
Thank you very much for your reply!
My task is to count the number of word pairs in a document. If w1 and w2
occur together in one sentence, the number of occurrence of