Does anyone know where the data for this benchmark was stored? Spark on YARN gets performance because of data locality via co-allocation of YARN Nodemanager and HDFS Datanode, not because of the job scheduler, right? Regards, z0ltrix \-------- Original-Nachricht -------- Am 5. Juli 2021, 21:27, Madaditya .Maddy schrieb: > > > > I came across an article that benchmarked spark on k8s vs yarn by > Datamechanics. > > > > > Link : > https://www.datamechanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn > > > > > \-Regards > > Aditya > > > > > On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh > <[mich.talebza...@gmail.com][mich.talebzadeh_gmail.com]> wrote: > > > > Thanks Yuri. Those are very valid points. > > > > > > > > > > Let me clarify my point. Let us assume that we will be using Yarn versus > > K8s doing the same job. Spark-submit will use Yarn at first instance and > > will then switch to using k8s for the same task. > > > > > > > > > > 1. Have there been such benchmarks? > > 2. When should I choose PaaS versus k8s for example for small to medium > > size jobs > > 3. I can see the flexibility of running Spark on Dataproc, then some may > > argue that k8s are the way forward > > 4. Bear in mind that I am only considering Spark. For example for Kafka > > and Zookeeper we opt for dockers as they do a single function. > > > > > > > > > > Cheers, > > > > > > > > > > Mich > > > > > > > > > > ![uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ][][view > > my Linkedin profile][] > > > > **Disclaimer:** Use it at your own risk.Any and all responsibility for any > > loss, damage or destruction of data or any other property which may arise > > from relying on this email's technical content is explicitly disclaimed. > > The author will in no case be liable for any monetary damages arising from > > such loss, damage or destruction. > > > > > > > > > > > > > > > > On Mon, 5 Jul 2021 at 19:06, "Yuri Oleynikov (יורי אולייניקוב)" > > <[yur...@gmail.com][yurkao_gmail.com]> wrote: > > > > > > > Not a big expert on Spark, but I’m not really understand how you are > > > going to compare and what? Reading-writing to and from Hdfs? How does it > > > related to yarn and k8s… these are recourse managers (YARN yet another > > > resource manager) : what and how much to allocate and when… (cpu, ram). > > > > > > Local Disk spilling? Depends on disk throughput… > > > > > > So what you are going to measure? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards > > > > > > > > > > > > > > > > On 5 Jul 2021, at 20:43, Mich Talebzadeh > > > > <[mich.talebza...@gmail.com][mich.talebzadeh_gmail.com]> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I was curious to know if there are benchmarks around on comparison > > > > between Spark on Yarn compared to Kubernetes. > > > > > > > > > > > > > > > > > > > > This question arose because traditionally in Google Cloud we have been > > > > using Spark on Dataproc clusters.[ Dataproc][Dataproc] provides Spark, > > > > Hadoop plus others (optional install) for data and analytic processing. > > > > It is PaaS > > > > > > > > > > > > > > > > > > > > Now they have GKE clusters as well and also introduced [Apache Spark > > > > with Cloud Dataproc on Kubernetes][] which allows one to submit Spark > > > > jobs to k8s using Dataproc stub as a platform to submit the job as > > > > below from cloud console or local > > > > > > > > > > > > > > > > > > > > gcloud dataproc jobs submit pyspark --cluster="dataproc-for-gke" > > > > gs://bucket/testme.py --region="europe-west2" --py-files > > > > gs://bucket/DSBQ.zip > > > > Job \[e5fc19b62cf744f0b13f3e6d9cc66c19\] submitted. > > > > Waiting for job output... > > > > > > > > > > > > > > > > > > > > > > > > At the moment it is a struggle to see what merits using k8s instead of > > > > dataproc bar notebooks etc. Actually there is not much literature > > > > around with PySpark on k8s. > > > > > > > > > > > > > > > > > > > > > > > > For me Spark on bare metal is the preferred option as I cannot see how > > > > one can pigeon hole Spark into a container and make it performant but I > > > > may be totally wrong. > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > ![uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ][][view > > > > my Linkedin profile][] > > > > > > > > **Disclaimer:** Use it at your own risk.Any and all responsibility for > > > > any loss, damage or destruction of data or any other property which may > > > > arise from relying on this email's technical content is explicitly > > > > disclaimed. The author will in no case be liable for any monetary > > > > damages arising from such loss, damage or destruction. [mich.talebzadeh_gmail.com]: mailto:mich.talebza...@gmail.com [uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]: https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ [view my Linkedin profile]: https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ [yurkao_gmail.com]: mailto:yur...@gmail.com [Dataproc]: https://cloud.google.com/dataproc [Apache Spark with Cloud Dataproc on Kubernetes]: https://cloud.google.com/blog/products/data-analytics/modernize-apache-spark-with-cloud-dataproc-on-kubernetes
publickey - EmailAddress(s=z0ltrix@pm.me) - 0xF0E154C5.asc
Description: application/pgp-keys
signature.asc
Description: OpenPGP digital signature