Not a big expert on Spark, but I’m not really understand how you are going to compare and what? Reading-writing to and from Hdfs? How does it related to yarn and k8s… these are recourse managers (YARN yet another resource manager) : what and how much to allocate and when… (cpu, ram). Local Disk spilling? Depends on disk throughput… So what you are going to measure?
Best regards > On 5 Jul 2021, at 20:43, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > > > I was curious to know if there are benchmarks around on comparison between > Spark on Yarn compared to Kubernetes. > > This question arose because traditionally in Google Cloud we have been using > Spark on Dataproc clusters. Dataproc provides Spark, Hadoop plus others > (optional install) for data and analytic processing. It is PaaS > > Now they have GKE clusters as well and also introduced Apache Spark with > Cloud Dataproc on Kubernetes which allows one to submit Spark jobs to k8s > using Dataproc stub as a platform to submit the job as below from cloud > console or local > > gcloud dataproc jobs submit pyspark --cluster="dataproc-for-gke" > gs://bucket/testme.py --region="europe-west2" --py-files gs://bucket/DSBQ.zip > Job [e5fc19b62cf744f0b13f3e6d9cc66c19] submitted. > Waiting for job output... > > At the moment it is a struggle to see what merits using k8s instead of > dataproc bar notebooks etc. Actually there is not much literature around with > PySpark on k8s. > > For me Spark on bare metal is the preferred option as I cannot see how one > can pigeon hole Spark into a container and make it performant but I may be > totally wrong. > > Thanks > > view my Linkedin profile > > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. >