Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-23 Thread Mich Talebzadeh
Thanks Julien for further info. I have been working a few day fee time on Pyspark on Kubernetes both on minikube and Google Cloud Platform (GCP) that provide Spark on Google Kubernetes Engine (GKE). Frankly my work on k8s has been a bit disappointing. In GCP the only available and supported

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-23 Thread Julien Laurenceau
Hi, Good question ! It is very dependent to your jobs and developer team. Things that mostly differ in my view is : 1/ data locality & fast-read If your data are stored in an HDFS cluster (not HCFS) and your Spark compute nodes are allowed to run on the Hadoop nodes, then definitely use Yarn to

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-06 Thread Mich Talebzadeh
I had a chance to look at this paper. I have reservations about this benchmark. They have used Google Dataproc which you can create a cluster of it with Hadoop and Spark (they used Spark 3) and decides on the number of worker nodes This is the layout of their set up Setup This benchmark

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Mich Talebzadeh
It is true that the original idea of Yarn on Hdfs came from data affinity. However, nowadays the separation of storage from the compute layer is very common. They do not allude to data affinity (say using Hadoop clusters). They refer to storage in Cloud and they refer to use of SSDs etc. I know

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Christian Pfarr
Does anyone know where the data for this benchmark was stored? Spark on YARN gets performance because of data locality via co-allocation of YARN Nodemanager and HDFS Datanode, not because of the job scheduler, right? Regards, z0ltrix \

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Mich Talebzadeh
Thanks Aditya for the link. I will have a look. Cheers view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Madaditya .Maddy
I came across an article that benchmarked spark on k8s vs yarn by Datamechanics. Link : https://www.datamechanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn -Regards Aditya On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh wrote: > Thanks Yuri. Those are

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Mich Talebzadeh
Thanks Yuri. Those are very valid points. Let me clarify my point. Let us assume that we will be using Yarn versus K8s doing the same job. Spark-submit will use Yarn at first instance and will then switch to using k8s for the same task. 1. Have there been such benchmarks? 2. When should I

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Not a big expert on Spark, but I’m not really understand how you are going to compare and what? Reading-writing to and from Hdfs? How does it related to yarn and k8s… these are recourse managers (YARN yet another resource manager) : what and how much to allocate and when… (cpu, ram). Local Disk

Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Mich Talebzadeh
I was curious to know if there are benchmarks around on comparison between Spark on Yarn compared to Kubernetes. This question arose because traditionally in Google Cloud we have been using Spark on Dataproc clusters. Dataproc provides Spark, Hadoop plus