Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
conflicting sources on using YARN for Spark.

Reynold said that Spark workers automatically take care of data locality
here:
https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS

However, I've read elsewhere (
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
that Spark on YARN increases data locality because YARN tries to place
tasks next to HDFS blocks.

Can anyone verify/support one side or the other?

Thank you,
Jestin

On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet <phpap...@gmail.com> wrote:

> Hi.
> Maybe you can help "data locality"..
> If you use groupBY and joins, than most likely you will see alot of
> network operations. This can be werry slow. You can try prepare, transform
> your information in that way, what can minimize transporting temporary
> information between worker-nodes.
>
> Try google in this way "Data locality in Hadoop"
>
>
> 2016-08-01 4:41 GMT+03:00 Jestin Ma <jestinwith.a...@gmail.com>:
>
>> It seems that the number of tasks being this large do not matter. Each
>> task was set default by the HDFS as 128 MB (block size) which I've heard to
>> be ok. I've tried tuning the block (task) size to be larger and smaller to
>> no avail.
>>
>> I tried coalescing to 50 but that introduced large data skew and slowed
>> down my job a lot.
>>
>> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich <and...@aehrlich.com>
>> wrote:
>>
>>> 15000 seems like a lot of tasks for that size. Test it out with a
>>> .coalesce(50) placed right after loading the data. It will probably either
>>> run faster or crash with out of memory errors.
>>>
>>> On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.a...@gmail.com>
>>> wrote:
>>>
>>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>>> leading to about 15000 tasks.
>>>
>>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>>> I'm performing groupBy, count, and an outer-join with another DataFrame
>>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>>> to disk.
>>>
>>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>>
>>> I read on the Spark Tuning guide that:
>>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>>
>>> This means that I should have about 30-50 tasks instead of 15000, and
>>> each task would be much bigger in size. Is my understanding correct, and is
>>> this suggested? I've read from difference sources to decrease or increase
>>> parallelism, or even keep it default.
>>>
>>> Thank you for your help,
>>> Jestin
>>>
>>>
>>>
>>
>

Reply via email to