Does anyone know where the data for this benchmark was stored?







Spark on YARN gets performance because of data locality via co-allocation of 
YARN Nodemanager and HDFS Datanode, not because of the job scheduler, right?







Regards,




z0ltrix

















\-------- Original-Nachricht --------
Am 5. Juli 2021, 21:27, Madaditya .Maddy schrieb:

>
>
>
> I came across an article that benchmarked spark on k8s vs yarn by 
> Datamechanics.
>
>
>
>
> Link : 
> https://www.datamechanics.co/blog-post/apache-spark-performance-benchmarks-show-kubernetes-has-caught-up-with-yarn
>
>
>
>
> \-Regards
>
> Aditya
>
>
>
>
> On Mon, Jul 5, 2021, 23:49 Mich Talebzadeh 
> <[mich.talebza...@gmail.com][mich.talebzadeh_gmail.com]> wrote:
>
>
> > Thanks Yuri. Those are very valid points.
> >
> >
> >
> >
> > Let me clarify my point. Let us assume that we will be using Yarn versus 
> > K8s doing the same job. Spark-submit will use Yarn at first instance and 
> > will then switch to using k8s for the same task.
> >
> >
> >
> >
> > 1.  Have there been such benchmarks?
> > 2.  When should I choose PaaS versus k8s for example for small to medium 
> > size jobs
> > 3.  I can see the flexibility of running Spark on Dataproc, then some may 
> > argue that k8s are the way forward
> > 4.  Bear in mind that I am only considering Spark. For example for Kafka 
> > and Zookeeper we opt for dockers as they do a single function.
> >
> >
> >
> >
> > Cheers,
> >
> >
> >
> >
> > Mich
> >
> >
> >
> >
> > ![uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ][][view
> >  my Linkedin profile][]
> >
> > **Disclaimer:** Use it at your own risk.Any and all responsibility for any 
> > loss, damage or destruction of data or any other property which may arise 
> > from relying on this email's technical content is explicitly disclaimed. 
> > The author will in no case be liable for any monetary damages arising from 
> > such loss, damage or destruction.
> >
> >
> >
> >
> >
> >
> >
> > ‪On Mon, 5 Jul 2021 at 19:06, ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ 
> > <[yur...@gmail.com][yurkao_gmail.com]> wrote:‬
> >
> >
> > > Not a big expert on Spark, but I’m not really understand how you are 
> > > going to compare and what? Reading-writing to and from Hdfs? How does it 
> > > related to yarn and k8s… these are recourse managers (YARN yet another 
> > > resource manager) : what and how much to allocate and when… (cpu, ram).
> > >
> > > Local Disk spilling? Depends on disk throughput…
> > >
> > > So what you are going to measure?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Best regards
> > >
> > >
> > >
> > >
> > > > On 5 Jul 2021, at 20:43, Mich Talebzadeh 
> > > > <[mich.talebza...@gmail.com][mich.talebzadeh_gmail.com]> wrote:
> > > >
> > > >
> > >
> > > > 
> > > >
> > > >
> > > >
> > > >
> > > > I was curious to know if there are benchmarks around on comparison 
> > > > between Spark on Yarn compared to Kubernetes.
> > > >
> > > >
> > > >
> > > >
> > > > This question arose because traditionally in Google Cloud we have been 
> > > > using Spark on Dataproc clusters.[ Dataproc][Dataproc] provides Spark, 
> > > > Hadoop plus others (optional install) for data and analytic processing. 
> > > > It is PaaS
> > > >
> > > >
> > > >
> > > >
> > > > Now they have GKE clusters as well and also introduced [Apache Spark 
> > > > with Cloud Dataproc on Kubernetes][] which allows one to submit Spark 
> > > > jobs to k8s using Dataproc stub as a platform to submit the job as 
> > > > below from cloud console or local
> > > >
> > > >
> > > >
> > > >
> > > > gcloud dataproc jobs submit pyspark --cluster="dataproc-for-gke" 
> > > > gs://bucket/testme.py --region="europe-west2" --py-files 
> > > > gs://bucket/DSBQ.zip
> > > > Job \[e5fc19b62cf744f0b13f3e6d9cc66c19\] submitted.
> > > > Waiting for job output...
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > At the moment it is a struggle to see what merits using k8s instead of 
> > > > dataproc bar notebooks etc. Actually there is not much literature 
> > > > around with PySpark on k8s.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > For me Spark on bare metal is the preferred option as I cannot see how 
> > > > one can pigeon hole Spark into a container and make it performant but I 
> > > > may be totally wrong.
> > > >
> > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > >
> > > >
> > > > ![uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ][][view
> > > >  my Linkedin profile][]
> > > >
> > > > **Disclaimer:** Use it at your own risk.Any and all responsibility for 
> > > > any loss, damage or destruction of data or any other property which may 
> > > > arise from relying on this email's technical content is explicitly 
> > > > disclaimed. The author will in no case be liable for any monetary 
> > > > damages arising from such loss, damage or destruction.


[mich.talebzadeh_gmail.com]: mailto:mich.talebza...@gmail.com
[uc_export_download_id_1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ_revid_0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]:
 
https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ
[view my Linkedin profile]: 
https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/
[yurkao_gmail.com]: mailto:yur...@gmail.com
[Dataproc]: https://cloud.google.com/dataproc
[Apache Spark with Cloud Dataproc on Kubernetes]: 
https://cloud.google.com/blog/products/data-analytics/modernize-apache-spark-with-cloud-dataproc-on-kubernetes

Attachment: publickey - EmailAddress(s=z0ltrix@pm.me) - 0xF0E154C5.asc
Description: application/pgp-keys

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to