Hi Jestin,
I need to expand on this in the Spark notes.
Spark can handle data locality itself but if the Spark nodes run on
separate nodes than HDFS' there's always the network between them that
makes the performance worse comparing to co-location of Spark and HDFS
nodes.
These are mere details
features than the Standalone mode. You can research more on google about it.
Yong
From: Jestin Ma <jestinwith.a...@gmail.com>
Sent: Tuesday, August 2, 2016 7:11 PM
To: Jacek Laskowski
Cc: Nikolay Zhebet; Andrew Ehrlich; user
Subject: Re: Tuning level of Paral
Hi Jestin,
Which of your actions is the bottleneck? Is it group by, count or the join?
Or all of them? It may help to tune the most time consuming ask first.
On Monday, August 1, 2016, Nikolay Zhebet wrote:
> Yes, Spark always trying to deliver snippet of code to the data
Yes, Spark always trying to deliver snippet of code to the data (not vice
versa). But you should realize, that if you try to run groupBY or Join on
the large dataset, then you always should migrate temporary localy grouped
data from one worker node to the another(It is shuffle operation as i
Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
conflicting sources on using YARN for Spark.
Reynold said that Spark workers automatically take care of data locality
here:
Hi.
Maybe you can help "data locality"..
If you use groupBY and joins, than most likely you will see alot of network
operations. This can be werry slow. You can try prepare, transform your
information in that way, what can minimize transporting temporary
information between worker-nodes.
Try
It seems that the number of tasks being this large do not matter. Each task
was set default by the HDFS as 128 MB (block size) which I've heard to be
ok. I've tried tuning the block (task) size to be larger and smaller to no
avail.
I tried coalescing to 50 but that introduced large data skew and
15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50)
placed right after loading the data. It will probably either run faster or
crash with out of memory errors.
> On Jul 29, 2016, at 9:02 AM, Jestin Ma wrote:
>
> I am processing ~2 TB of
I am processing ~2 TB of hdfs data using DataFrames. The size of a task is
equal to the block size specified by hdfs, which happens to be 128 MB,
leading to about 15000 tasks.
I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
I'm performing groupBy, count, and an outer-join with another