Hi Leszek, For running YARN on Kubernetes and then running Spark on YARN, is there a lot of overhead for maintaining YARN on Kubernetes? I thought people usually want to move from YARN to Kubernetes because of the overhead of maintaining Hadoop.
Thanks, --- Sungwoo On Fri, Sep 30, 2022 at 1:37 PM Leszek Reimus <leszek.rei...@gmail.com> wrote: > Hi Everyone, > > To add my 2 cents here: > > Advantage of containers, to me, is that it leaves the host system pristine > and clean, allowing standardized devops deployment of hardware for any > purpose. Way back before - when using bare metal / ansible, reusing hw > always involved full reformat of base system. This alone is worth the ~1-2% > performance tax cgroup containers have. > > Advantage of kubernetes is more on the deployment side of things. Unified > deployment scripts that can be written by devs. Same deployment yaml (or > helm chart) can be used on local Dev Env / QA / Integration Env and finally > Prod (with some tweaks). > > Depending on the networking CNI, and storage backend - Kubernetes can have > a very close to bare metal performance. In the end it is always a > trade-off. You gain some, you pay with extra overhead. > > I'm running YARN on kubernetes and mostly run Spark on top of YARN (some > legacy MapReduce jobs too though) . Finding it much more manageable to > allocate larger memory/cpu chunks to yarn pods and then have run > auto-scaler to scale out YARN if needed; than to manage individual > memory/cpu requirements on Spark on Kubernetes deployment. > > As far as I tested, Spark on Kubernetes is immature when reliability is > concerned (or maybe our homegrown k8s does not do fencing/STONITH well > yet). When a node dies / goes down, I find executors not getting > rescheduled to other nodes - the driver just gets stuck for the executors > to come back. This does not happen on YARN / Standalone deployment (even > when ran on same k8s cluster) > > Sincerely, > > Leszek Reimus > > > > > On Thu, Sep 29, 2022 at 7:06 PM Gourav Sengupta <gourav.sengu...@gmail.com> > wrote: > >> Hi, >> >> dont containers finally run on systems, and the only advantage of >> containers is that you can do better utilisation of system resources by >> micro management of jobs running in it? Some say that containers have their >> own binaries which isolates environment, but that is a lie, because in a >> kubernetes environments that is running your SPARK jobs you will have the >> same environment for all your kubes. >> >> And as you can see there are several other configurations, disk mounting, >> security, etc issues to handle as an overhead as well. >> >> And the entire goal of all those added configurations is that someone in >> your devops team feels using containers makes things more interesting >> without any real added advantage to large volume jobs. >> >> But I may be wrong, and perhaps we need data, and not personal attacks >> like the other person in the thread did. >> >> In case anyone does not know EMR does run on containers as well, and in >> EMR running on EC2 nodes you can put all your binaries in containers and >> use those for running your jobs. >> >> Regards, >> Gourav Sengupta >> >> On Thu, Sep 29, 2022 at 7:46 PM Vladimir Prus <vladimir.p...@gmail.com> >> wrote: >> >>> Igor, >>> >>> what exact instance types do you use? Unless you use local instance >>> storage and have actually configured your Kubernetes and Spark to use >>> instance storage, your 30x30 exchange can run into EBS IOPS limits. You can >>> investigate that by going to an instance, then to volume, and see >>> monitoring charts. >>> >>> Another thought is that you're essentially giving 4GB per core. That >>> sounds pretty low, in my experience. >>> >>> >>> >>> On Thu, Sep 29, 2022 at 9:13 PM Igor Calabria <igor.calab...@gmail.com> >>> wrote: >>> >>>> Hi Everyone, >>>> >>>> I'm running spark 3.2 on kubernetes and have a job with a decently >>>> sized shuffle of almost 4TB. The relevant cluster config is as follows: >>>> >>>> - 30 Executors. 16 physical cores, configured with 32 Cores for spark >>>> - 128 GB RAM >>>> - shuffle.partitions is 18k which gives me tasks of around 150~180MB >>>> >>>> The job runs fine but I'm bothered by how underutilized the cluster >>>> gets during the reduce phase. During the map(reading data from s3 and >>>> writing the shuffle data) CPU usage, disk throughput and network usage is >>>> as expected, but during the reduce phase it gets really low. It seems the >>>> main bottleneck is reading shuffle data from other nodes, task statistics >>>> reports values ranging from 25s to several minutes(the task sizes are >>>> really close, they aren't skewed). I've tried increasing >>>> "spark.reducer.maxSizeInFlight" and >>>> "spark.shuffle.io.numConnectionsPerPeer" and it did improve performance by >>>> a little, but not enough to saturate the cluster resources. >>>> >>>> Did I miss some more tuning parameters that could help? >>>> One obvious thing would be to vertically increase the machines and use >>>> less nodes to minimize traffic, but 30 nodes doesn't seem like much even >>>> considering 30x30 connections. >>>> >>>> Thanks in advance! >>>> >>>> >>> >>> -- >>> Vladimir Prus >>> http://vladimirprus.com >>> >> > > -- > -------------- > "It is the common fate of the indolent to see their rights become a prey > to the active. The condition upon which God hath given liberty to man is > eternal vigilance; which condition if he break, servitude is at once the > consequence of his crime and the punishment of his guilt." - John Philpot > Curran: Speech upon the Right of Election, 1790. >