Re: Re: EXT: Dual Write to HDFS and MinIO in faster way
Regarding making spark writer fast part, If you are (or can be) on Databricks, check this out. It is just out of the oven at Databricks. https://www.databricks.com/blog/announcing-general-availability-liquid-clustering?utm_source=bambu&utm_medium=social&utm_campaign=advocacy&blaid=6087618 From: Gera Shegalov Sent: Wednesday, May 29, 2024 7:57:56 am To: Prem Sahoo Cc: eab...@163.com ; Vibhor Gupta ; user @spark Subject: Re: Re: EXT: Dual Write to HDFS and MinIO in faster way I agree with the previous answers that (if requirements allow it) it is much easier to just orchestrate a copy either in the same app or sync externally. A long time ago and not for a Spark app we were solving a similar usecase via https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-hdfs/ViewFs.html#Multi-Filesystem_I.2F0_with_Nfly_Mount_Points . It may work with Spark because it is underneath the FileSystem API ... On Tue, May 21, 2024 at 10:03 PM Prem Sahoo mailto:prem.re...@gmail.com>> wrote: I am looking for writer/comitter optimization which can make the spark write faster. On Tue, May 21, 2024 at 9:15 PM eab...@163.com<mailto:eab...@163.com> mailto:eab...@163.com>> wrote: Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. eabour From: Prem Sahoo<mailto:prem.re...@gmail.com> Date: 2024-05-22 00:38 To: Vibhor Gupta<mailto:vibhor.gu...@walmart.com>; user<mailto:user@spark.apache.org> Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo mailto:prem.re...@gmail.com>> wrote: Hello Vibhor, Thanks for the suggestion . I am looking for some other alternatives where I can use the same dataframe can be written to two destinations without re execution and cache or persist . Can some one help me in scenario 2 ? How to make spark write to MinIO faster ? Sent from my iPhone On May 21, 2024, at 1:18 AM, Vibhor Gupta mailto:vibhor.gu...@walmart.com>> wrote: Hi Prem, You can try to write to HDFS then read from HDFS and write to MinIO. This will prevent duplicate transformation. You can also try persisting the dataframe using the DISK_ONLY level. Regards, Vibhor From: Prem Sahoo mailto:prem.re...@gmail.com>> Date: Tuesday, 21 May 2024 at 8:16 AM To: Spark dev list mailto:d...@spark.apache.org>> Subject: EXT: Dual Write to HDFS and MinIO in faster way EXTERNAL: Report suspicious emails to Email Abuse. Hello Team, I am planning to write to two datasource at the same time . Scenario:- Writing the same dataframe to HDFS and MinIO without re-executing the transformations and no cache(). Then how can we make it faster ? Read the parquet file and do a few transformations and write to HDFS and MinIO. here in both write spark needs execute the transformation again. Do we know how we can avoid re-execution of transformation without cache()/persist ? Scenario2 :- I am writing 3.2G data to HDFS and MinIO which takes ~6mins. Do we have any way to make writing this faster ? I don't want to do repartition and write as repartition will have overhead of shuffling . Please provide some inputs.
Re: Re: EXT: Dual Write to HDFS and MinIO in faster way
I agree with the previous answers that (if requirements allow it) it is much easier to just orchestrate a copy either in the same app or sync externally. A long time ago and not for a Spark app we were solving a similar usecase via https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-hdfs/ViewFs.html#Multi-Filesystem_I.2F0_with_Nfly_Mount_Points . It may work with Spark because it is underneath the FileSystem API ... On Tue, May 21, 2024 at 10:03 PM Prem Sahoo wrote: > I am looking for writer/comitter optimization which can make the spark > write faster. > > On Tue, May 21, 2024 at 9:15 PM eab...@163.com wrote: > >> Hi, >> I think you should write to HDFS then copy file (parquet or orc) >> from HDFS to MinIO. >> >> -- >> eabour >> >> >> *From:* Prem Sahoo >> *Date:* 2024-05-22 00:38 >> *To:* Vibhor Gupta ; user >> >> *Subject:* Re: EXT: Dual Write to HDFS and MinIO in faster way >> >> >> On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: >> >>> Hello Vibhor, >>> Thanks for the suggestion . >>> I am looking for some other alternatives where I can use the same >>> dataframe can be written to two destinations without re execution and cache >>> or persist . >>> >>> Can some one help me in scenario 2 ? >>> How to make spark write to MinIO faster ? >>> Sent from my iPhone >>> >>> On May 21, 2024, at 1:18 AM, Vibhor Gupta >>> wrote: >>> >>> >>> >>> Hi Prem, >>> >>> >>> >>> You can try to write to HDFS then read from HDFS and write to MinIO. >>> >>> >>> >>> This will prevent duplicate transformation. >>> >>> >>> >>> You can also try persisting the dataframe using the DISK_ONLY level. >>> >>> >>> >>> Regards, >>> >>> Vibhor >>> >>> *From: *Prem Sahoo >>> *Date: *Tuesday, 21 May 2024 at 8:16 AM >>> *To: *Spark dev list >>> *Subject: *EXT: Dual Write to HDFS and MinIO in faster way >>> >>> *EXTERNAL: *Report suspicious emails to *Email Abuse.* >>> >>> Hello Team, >>> >>> I am planning to write to two datasource at the same time . >>> >>> >>> >>> Scenario:- >>> >>> >>> >>> Writing the same dataframe to HDFS and MinIO without re-executing the >>> transformations and no cache(). Then how can we make it faster ? >>> >>> >>> >>> Read the parquet file and do a few transformations and write to HDFS and >>> MinIO. >>> >>> >>> >>> here in both write spark needs execute the transformation again. Do we >>> know how we can avoid re-execution of transformation without >>> cache()/persist ? >>> >>> >>> >>> Scenario2 :- >>> >>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins. >>> >>> Do we have any way to make writing this faster ? >>> >>> >>> >>> I don't want to do repartition and write as repartition will have >>> overhead of shuffling . >>> >>> >>> >>> Please provide some inputs. >>> >>> >>> >>> >>> >>>
Re: Re: EXT: Dual Write to HDFS and MinIO in faster way
I am looking for writer/comitter optimization which can make the spark write faster. On Tue, May 21, 2024 at 9:15 PM eab...@163.com wrote: > Hi, > I think you should write to HDFS then copy file (parquet or orc) from > HDFS to MinIO. > > -- > eabour > > > *From:* Prem Sahoo > *Date:* 2024-05-22 00:38 > *To:* Vibhor Gupta ; user > > *Subject:* Re: EXT: Dual Write to HDFS and MinIO in faster way > > > On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: > >> Hello Vibhor, >> Thanks for the suggestion . >> I am looking for some other alternatives where I can use the same >> dataframe can be written to two destinations without re execution and cache >> or persist . >> >> Can some one help me in scenario 2 ? >> How to make spark write to MinIO faster ? >> Sent from my iPhone >> >> On May 21, 2024, at 1:18 AM, Vibhor Gupta >> wrote: >> >> >> >> Hi Prem, >> >> >> >> You can try to write to HDFS then read from HDFS and write to MinIO. >> >> >> >> This will prevent duplicate transformation. >> >> >> >> You can also try persisting the dataframe using the DISK_ONLY level. >> >> >> >> Regards, >> >> Vibhor >> >> *From: *Prem Sahoo >> *Date: *Tuesday, 21 May 2024 at 8:16 AM >> *To: *Spark dev list >> *Subject: *EXT: Dual Write to HDFS and MinIO in faster way >> >> *EXTERNAL: *Report suspicious emails to *Email Abuse.* >> >> Hello Team, >> >> I am planning to write to two datasource at the same time . >> >> >> >> Scenario:- >> >> >> >> Writing the same dataframe to HDFS and MinIO without re-executing the >> transformations and no cache(). Then how can we make it faster ? >> >> >> >> Read the parquet file and do a few transformations and write to HDFS and >> MinIO. >> >> >> >> here in both write spark needs execute the transformation again. Do we >> know how we can avoid re-execution of transformation without >> cache()/persist ? >> >> >> >> Scenario2 :- >> >> I am writing 3.2G data to HDFS and MinIO which takes ~6mins. >> >> Do we have any way to make writing this faster ? >> >> >> >> I don't want to do repartition and write as repartition will have >> overhead of shuffling . >> >> >> >> Please provide some inputs. >> >> >> >> >> >>
Re: Re: EXT: Dual Write to HDFS and MinIO in faster way
Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. eabour From: Prem Sahoo Date: 2024-05-22 00:38 To: Vibhor Gupta; user Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: Hello Vibhor, Thanks for the suggestion . I am looking for some other alternatives where I can use the same dataframe can be written to two destinations without re execution and cache or persist . Can some one help me in scenario 2 ? How to make spark write to MinIO faster ? Sent from my iPhone On May 21, 2024, at 1:18 AM, Vibhor Gupta wrote: Hi Prem, You can try to write to HDFS then read from HDFS and write to MinIO. This will prevent duplicate transformation. You can also try persisting the dataframe using the DISK_ONLY level. Regards, Vibhor From: Prem Sahoo Date: Tuesday, 21 May 2024 at 8:16 AM To: Spark dev list Subject: EXT: Dual Write to HDFS and MinIO in faster way EXTERNAL: Report suspicious emails to Email Abuse. Hello Team, I am planning to write to two datasource at the same time . Scenario:- Writing the same dataframe to HDFS and MinIO without re-executing the transformations and no cache(). Then how can we make it faster ? Read the parquet file and do a few transformations and write to HDFS and MinIO. here in both write spark needs execute the transformation again. Do we know how we can avoid re-execution of transformation without cache()/persist ? Scenario2 :- I am writing 3.2G data to HDFS and MinIO which takes ~6mins. Do we have any way to make writing this faster ? I don't want to do repartition and write as repartition will have overhead of shuffling . Please provide some inputs.
Re: EXT: Dual Write to HDFS and MinIO in faster way
On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: > Hello Vibhor, > Thanks for the suggestion . > I am looking for some other alternatives where I can use the same > dataframe can be written to two destinations without re execution and cache > or persist . > > Can some one help me in scenario 2 ? > How to make spark write to MinIO faster ? > Sent from my iPhone > > On May 21, 2024, at 1:18 AM, Vibhor Gupta > wrote: > > > > Hi Prem, > > > > You can try to write to HDFS then read from HDFS and write to MinIO. > > > > This will prevent duplicate transformation. > > > > You can also try persisting the dataframe using the DISK_ONLY level. > > > > Regards, > > Vibhor > > *From: *Prem Sahoo > *Date: *Tuesday, 21 May 2024 at 8:16 AM > *To: *Spark dev list > *Subject: *EXT: Dual Write to HDFS and MinIO in faster way > > *EXTERNAL: *Report suspicious emails to *Email Abuse.* > > Hello Team, > > I am planning to write to two datasource at the same time . > > > > Scenario:- > > > > Writing the same dataframe to HDFS and MinIO without re-executing the > transformations and no cache(). Then how can we make it faster ? > > > > Read the parquet file and do a few transformations and write to HDFS and > MinIO. > > > > here in both write spark needs execute the transformation again. Do we > know how we can avoid re-execution of transformation without > cache()/persist ? > > > > Scenario2 :- > > I am writing 3.2G data to HDFS and MinIO which takes ~6mins. > > Do we have any way to make writing this faster ? > > > > I don't want to do repartition and write as repartition will have overhead > of shuffling . > > > > Please provide some inputs. > > > > > >