Regarding making spark writer fast part, If you are (or can be) on Databricks, check this out. It is just out of the oven at Databricks.
https://www.databricks.com/blog/announcing-general-availability-liquid-clustering?utm_source=bambu&utm_medium=social&utm_campaign=advocacy&blaid=6087618 ________________________________ From: Gera Shegalov <ger...@gmail.com> Sent: Wednesday, May 29, 2024 7:57:56 am To: Prem Sahoo <prem.re...@gmail.com> Cc: eab...@163.com <eab...@163.com>; Vibhor Gupta <vibhor.gu...@walmart.com>; user @spark <user@spark.apache.org> Subject: Re: Re: EXT: Dual Write to HDFS and MinIO in faster way I agree with the previous answers that (if requirements allow it) it is much easier to just orchestrate a copy either in the same app or sync externally. A long time ago and not for a Spark app we were solving a similar usecase via https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-hdfs/ViewFs.html#Multi-Filesystem_I.2F0_with_Nfly_Mount_Points . It may work with Spark because it is underneath the FileSystem API ... On Tue, May 21, 2024 at 10:03 PM Prem Sahoo <prem.re...@gmail.com<mailto:prem.re...@gmail.com>> wrote: I am looking for writer/comitter optimization which can make the spark write faster. On Tue, May 21, 2024 at 9:15 PM eab...@163.com<mailto:eab...@163.com> <eab...@163.com<mailto:eab...@163.com>> wrote: Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. ________________________________ eabour From: Prem Sahoo<mailto:prem.re...@gmail.com> Date: 2024-05-22 00:38 To: Vibhor Gupta<mailto:vibhor.gu...@walmart.com>; user<mailto:user@spark.apache.org> Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo <prem.re...@gmail.com<mailto:prem.re...@gmail.com>> wrote: Hello Vibhor, Thanks for the suggestion . I am looking for some other alternatives where I can use the same dataframe can be written to two destinations without re execution and cache or persist . Can some one help me in scenario 2 ? How to make spark write to MinIO faster ? Sent from my iPhone On May 21, 2024, at 1:18 AM, Vibhor Gupta <vibhor.gu...@walmart.com<mailto:vibhor.gu...@walmart.com>> wrote: Hi Prem, You can try to write to HDFS then read from HDFS and write to MinIO. This will prevent duplicate transformation. You can also try persisting the dataframe using the DISK_ONLY level. Regards, Vibhor From: Prem Sahoo <prem.re...@gmail.com<mailto:prem.re...@gmail.com>> Date: Tuesday, 21 May 2024 at 8:16 AM To: Spark dev list <d...@spark.apache.org<mailto:d...@spark.apache.org>> Subject: EXT: Dual Write to HDFS and MinIO in faster way EXTERNAL: Report suspicious emails to Email Abuse. Hello Team, I am planning to write to two datasource at the same time . Scenario:- Writing the same dataframe to HDFS and MinIO without re-executing the transformations and no cache(). Then how can we make it faster ? Read the parquet file and do a few transformations and write to HDFS and MinIO. here in both write spark needs execute the transformation again. Do we know how we can avoid re-execution of transformation without cache()/persist ? Scenario2 :- I am writing 3.2G data to HDFS and MinIO which takes ~6mins. Do we have any way to make writing this faster ? I don't want to do repartition and write as repartition will have overhead of shuffling . Please provide some inputs.