Re: Help with Shuffle Read performance

Sungwoo Park Thu, 29 Sep 2022 23:26:23 -0700

Hi Leszek,

For running YARN on Kubernetes and then running Spark on YARN, is there a
lot of overhead for maintaining YARN on Kubernetes? I thought people
usually want to move from YARN to Kubernetes because of the overhead of
maintaining Hadoop.


Thanks,

--- Sungwoo


On Fri, Sep 30, 2022 at 1:37 PM Leszek Reimus <leszek.rei...@gmail.com>
wrote:

> Hi Everyone,
>
> To add my 2 cents here:
>
> Advantage of containers, to me, is that it leaves the host system pristine
> and clean, allowing standardized devops deployment of hardware for any
> purpose. Way back before - when using bare metal / ansible, reusing hw
> always involved full reformat of base system. This alone is worth the ~1-2%
> performance tax cgroup containers have.
>
> Advantage of kubernetes is more on the deployment side of things. Unified
> deployment scripts that can be written by devs. Same deployment yaml (or
> helm chart) can be used on local Dev Env / QA / Integration Env and finally
> Prod (with some tweaks).
>
> Depending on the networking CNI, and storage backend - Kubernetes can have
> a very close to bare metal performance. In the end it is always a
> trade-off. You gain some, you pay with extra overhead.
>
> I'm running YARN on kubernetes and mostly run Spark on top of YARN (some
> legacy MapReduce jobs too though) . Finding it much more manageable to
> allocate larger memory/cpu chunks to yarn pods and then have run
> auto-scaler to scale out YARN if needed; than to manage individual
> memory/cpu requirements on Spark on Kubernetes deployment.
>
> As far as I tested, Spark on Kubernetes is immature when reliability is
> concerned (or maybe our homegrown k8s does not do fencing/STONITH well
> yet). When a node dies / goes down, I find executors not getting
> rescheduled to other nodes - the driver just gets stuck for the executors
> to come back. This does not happen on YARN / Standalone deployment (even
> when ran on same k8s cluster)
>
> Sincerely,
>
> Leszek Reimus
>
>
>
>
> On Thu, Sep 29, 2022 at 7:06 PM Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
>> Hi,
>>
>> dont containers finally run on systems, and the only advantage of
>> containers is that you can do better utilisation of system resources by
>> micro management of jobs running in it? Some say that containers have their
>> own binaries which isolates environment, but that is a lie, because in a
>> kubernetes environments that is running your SPARK jobs you will have the
>> same environment for all your kubes.
>>
>> And as you can see there are several other configurations, disk mounting,
>> security, etc issues to handle as an overhead as well.
>>
>> And the entire goal of all those added configurations is that someone in
>> your devops team feels using containers makes things more interesting
>> without any real added advantage to large volume jobs.
>>
>> But I may be wrong, and perhaps we need data, and not personal attacks
>> like the other person in the thread did.
>>
>> In case anyone does not know EMR does run on containers as well, and in
>> EMR running on EC2 nodes you can put all your binaries in containers and
>> use those for running your jobs.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Thu, Sep 29, 2022 at 7:46 PM Vladimir Prus <vladimir.p...@gmail.com>
>> wrote:
>>
>>> Igor,
>>>
>>> what exact instance types do you use? Unless you use local instance
>>> storage and have actually configured your Kubernetes and Spark to use
>>> instance storage, your 30x30 exchange can run into EBS IOPS limits. You can
>>> investigate that by going to an instance, then to volume, and see
>>> monitoring charts.
>>>
>>> Another thought is that you're essentially giving 4GB per core. That
>>> sounds pretty low, in my experience.
>>>
>>>
>>>
>>> On Thu, Sep 29, 2022 at 9:13 PM Igor Calabria <igor.calab...@gmail.com>
>>> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> I'm running spark 3.2 on kubernetes and have a job with a decently
>>>> sized shuffle of almost 4TB. The relevant cluster config is as follows:
>>>>
>>>> - 30 Executors. 16 physical cores, configured with 32 Cores for spark
>>>> - 128 GB RAM
>>>> -  shuffle.partitions is 18k which gives me tasks of around 150~180MB
>>>>
>>>> The job runs fine but I'm bothered by how underutilized the cluster
>>>> gets during the reduce phase. During the map(reading data from s3 and
>>>> writing the shuffle data) CPU usage, disk throughput and network usage is
>>>> as expected, but during the reduce phase it gets really low. It seems the
>>>> main bottleneck is reading shuffle data from other nodes, task statistics
>>>> reports values ranging from 25s to several minutes(the task sizes are
>>>> really close, they aren't skewed). I've tried increasing
>>>> "spark.reducer.maxSizeInFlight" and
>>>> "spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by
>>>> a little, but not enough to saturate the cluster resources.
>>>>
>>>> Did I miss some more tuning parameters that could help?
>>>> One obvious thing would be to vertically increase the machines and use
>>>> less nodes to minimize traffic, but 30 nodes doesn't seem like much even
>>>> considering 30x30 connections.
>>>>
>>>> Thanks in advance!
>>>>
>>>>
>>>
>>> --
>>> Vladimir Prus
>>> http://vladimirprus.com
>>>
>>
>
> --
> --------------
> "It is the common fate of the indolent to see their rights become a prey
> to the active. The condition upon which God hath given liberty to man is
> eternal vigilance; which condition if he break, servitude is at once the
> consequence of his crime and the punishment of his guilt." - John Philpot
> Curran: Speech upon the Right of Election, 1790.
>

Re: Help with Shuffle Read performance

Reply via email to