Hi Nikolay, I'm looking at data locality improvements for Spark, and I have conflicting sources on using YARN for Spark.
Reynold said that Spark workers automatically take care of data locality here: https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS However, I've read elsewhere ( https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/) that Spark on YARN increases data locality because YARN tries to place tasks next to HDFS blocks. Can anyone verify/support one side or the other? Thank you, Jestin On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet <phpap...@gmail.com> wrote: > Hi. > Maybe you can help "data locality".. > If you use groupBY and joins, than most likely you will see alot of > network operations. This can be werry slow. You can try prepare, transform > your information in that way, what can minimize transporting temporary > information between worker-nodes. > > Try google in this way "Data locality in Hadoop" > > > 2016-08-01 4:41 GMT+03:00 Jestin Ma <jestinwith.a...@gmail.com>: > >> It seems that the number of tasks being this large do not matter. Each >> task was set default by the HDFS as 128 MB (block size) which I've heard to >> be ok. I've tried tuning the block (task) size to be larger and smaller to >> no avail. >> >> I tried coalescing to 50 but that introduced large data skew and slowed >> down my job a lot. >> >> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <and...@aehrlich.com> >> wrote: >> >>> 15000 seems like a lot of tasks for that size. Test it out with a >>> .coalesce(50) placed right after loading the data. It will probably either >>> run faster or crash with out of memory errors. >>> >>> On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.a...@gmail.com> >>> wrote: >>> >>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task >>> is equal to the block size specified by hdfs, which happens to be 128 MB, >>> leading to about 15000 tasks. >>> >>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM. >>> I'm performing groupBy, count, and an outer-join with another DataFrame >>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving >>> to disk. >>> >>> Right now it takes about 55 minutes, and I've been trying to tune it. >>> >>> I read on the Spark Tuning guide that: >>> *In general, we recommend 2-3 tasks per CPU core in your cluster.* >>> >>> This means that I should have about 30-50 tasks instead of 15000, and >>> each task would be much bigger in size. Is my understanding correct, and is >>> this suggested? I've read from difference sources to decrease or increase >>> parallelism, or even keep it default. >>> >>> Thank you for your help, >>> Jestin >>> >>> >>> >> >