Hello Vibhor,
Thanks for the suggestion .
I am looking for some other alternatives where I can use the same dataframe can 
be written to two destinations without re execution and cache or persist .

Can some one help me in scenario 2 ?
How to make spark write to MinIO faster ?
Sent from my iPhone

> On May 21, 2024, at 1:18 AM, Vibhor Gupta <vibhor.gu...@walmart.com> wrote:
> 
> 
> Hi Prem,
>  
> You can try to write to HDFS then read from HDFS and write to MinIO.
>  
> This will prevent duplicate transformation.
>  
> You can also try persisting the dataframe using the DISK_ONLY level.
>  
> Regards,
> Vibhor
> From: Prem Sahoo <prem.re...@gmail.com>
> Date: Tuesday, 21 May 2024 at 8:16 AM
> To: Spark dev list <dev@spark.apache.org>
> Subject: EXT: Dual Write to HDFS and MinIO in faster way
> 
> EXTERNAL: Report suspicious emails to Email Abuse.
> 
> Hello Team,
> I am planning to write to two datasource at the same time . 
>  
> Scenario:-
>  
> Writing the same dataframe to HDFS and MinIO without re-executing the 
> transformations and no cache(). Then how can we make it faster ?
>  
> Read the parquet file and do a few transformations and write to HDFS and 
> MinIO.
>  
> here in both write spark needs execute the transformation again. Do we know 
> how we can avoid re-execution of transformation  without cache()/persist ?
>  
> Scenario2 :-
> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
> Do we have any way to make writing this faster ?
>  
> I don't want to do repartition and write as repartition will have overhead of 
> shuffling .
>  
> Please provide some inputs. 
>  
>  

Reply via email to