I am looking for writer/comitter optimization which can make the spark write faster.
On Tue, May 21, 2024 at 9:15 PM eab...@163.com <eab...@163.com> wrote: > Hi, > I think you should write to HDFS then copy file (parquet or orc) from > HDFS to MinIO. > > ------------------------------ > eabour > > > *From:* Prem Sahoo <prem.re...@gmail.com> > *Date:* 2024-05-22 00:38 > *To:* Vibhor Gupta <vibhor.gu...@walmart.com>; user > <user@spark.apache.org> > *Subject:* Re: EXT: Dual Write to HDFS and MinIO in faster way > > > On Tue, May 21, 2024 at 6:58 AM Prem Sahoo <prem.re...@gmail.com> wrote: > >> Hello Vibhor, >> Thanks for the suggestion . >> I am looking for some other alternatives where I can use the same >> dataframe can be written to two destinations without re execution and cache >> or persist . >> >> Can some one help me in scenario 2 ? >> How to make spark write to MinIO faster ? >> Sent from my iPhone >> >> On May 21, 2024, at 1:18 AM, Vibhor Gupta <vibhor.gu...@walmart.com> >> wrote: >> >> >> >> Hi Prem, >> >> >> >> You can try to write to HDFS then read from HDFS and write to MinIO. >> >> >> >> This will prevent duplicate transformation. >> >> >> >> You can also try persisting the dataframe using the DISK_ONLY level. >> >> >> >> Regards, >> >> Vibhor >> >> *From: *Prem Sahoo <prem.re...@gmail.com> >> *Date: *Tuesday, 21 May 2024 at 8:16 AM >> *To: *Spark dev list <d...@spark.apache.org> >> *Subject: *EXT: Dual Write to HDFS and MinIO in faster way >> >> *EXTERNAL: *Report suspicious emails to *Email Abuse.* >> >> Hello Team, >> >> I am planning to write to two datasource at the same time . >> >> >> >> Scenario:- >> >> >> >> Writing the same dataframe to HDFS and MinIO without re-executing the >> transformations and no cache(). Then how can we make it faster ? >> >> >> >> Read the parquet file and do a few transformations and write to HDFS and >> MinIO. >> >> >> >> here in both write spark needs execute the transformation again. Do we >> know how we can avoid re-execution of transformation without >> cache()/persist ? >> >> >> >> Scenario2 :- >> >> I am writing 3.2G data to HDFS and MinIO which takes ~6mins. >> >> Do we have any way to make writing this faster ? >> >> >> >> I don't want to do repartition and write as repartition will have >> overhead of shuffling . >> >> >> >> Please provide some inputs. >> >> >> >> >> >>