Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-20 Thread Vibhor Gupta
Hi Prem,

You can try to write to HDFS then read from HDFS and write to MinIO.

This will prevent duplicate transformation.

You can also try persisting the dataframe using the DISK_ONLY level.

Regards,
Vibhor
From: Prem Sahoo 
Date: Tuesday, 21 May 2024 at 8:16 AM
To: Spark dev list 
Subject: EXT: Dual Write to HDFS and MinIO in faster way
EXTERNAL: Report suspicious emails to Email Abuse.
Hello Team,
I am planning to write to two datasource at the same time .

Scenario:-

Writing the same dataframe to HDFS and MinIO without re-executing the 
transformations and no cache(). Then how can we make it faster ?

Read the parquet file and do a few transformations and write to HDFS and MinIO.

here in both write spark needs execute the transformation again. Do we know how 
we can avoid re-execution of transformation  without cache()/persist ?

Scenario2 :-
I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
Do we have any way to make writing this faster ?

I don't want to do repartition and write as repartition will have overhead of 
shuffling .

Please provide some inputs.




Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
Hello Vibhor,
Thanks for the suggestion .
I am looking for some other alternatives where I can use the same dataframe can 
be written to two destinations without re execution and cache or persist .

Can some one help me in scenario 2 ?
How to make spark write to MinIO faster ?
Sent from my iPhone

> On May 21, 2024, at 1:18 AM, Vibhor Gupta  wrote:
> 
> 
> Hi Prem,
>  
> You can try to write to HDFS then read from HDFS and write to MinIO.
>  
> This will prevent duplicate transformation.
>  
> You can also try persisting the dataframe using the DISK_ONLY level.
>  
> Regards,
> Vibhor
> From: Prem Sahoo 
> Date: Tuesday, 21 May 2024 at 8:16 AM
> To: Spark dev list 
> Subject: EXT: Dual Write to HDFS and MinIO in faster way
> 
> EXTERNAL: Report suspicious emails to Email Abuse.
> 
> Hello Team,
> I am planning to write to two datasource at the same time . 
>  
> Scenario:-
>  
> Writing the same dataframe to HDFS and MinIO without re-executing the 
> transformations and no cache(). Then how can we make it faster ?
>  
> Read the parquet file and do a few transformations and write to HDFS and 
> MinIO.
>  
> here in both write spark needs execute the transformation again. Do we know 
> how we can avoid re-execution of transformation  without cache()/persist ?
>  
> Scenario2 :-
> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
> Do we have any way to make writing this faster ?
>  
> I don't want to do repartition and write as repartition will have overhead of 
> shuffling .
>  
> Please provide some inputs. 
>  
>  


Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Nicholas Chammas
[dev list to bcc]

This is a question for the user list <https://spark.apache.org/community.html> 
or for Stack Overflow 
<https://stackoverflow.com/questions/tagged/apache-spark>. The dev list is for 
discussions related to the development of Spark itself.

Nick


> On May 21, 2024, at 6:58 AM, Prem Sahoo  wrote:
> 
> Hello Vibhor,
> Thanks for the suggestion .
> I am looking for some other alternatives where I can use the same dataframe 
> can be written to two destinations without re execution and cache or persist .
> 
> Can some one help me in scenario 2 ?
> How to make spark write to MinIO faster ?
> Sent from my iPhone
> 
>> On May 21, 2024, at 1:18 AM, Vibhor Gupta  wrote:
>> 
>> 
>> Hi Prem,
>>  
>> You can try to write to HDFS then read from HDFS and write to MinIO.
>>  
>> This will prevent duplicate transformation.
>>  
>> You can also try persisting the dataframe using the DISK_ONLY level.
>>  
>> Regards,
>> Vibhor
>> From: Prem Sahoo 
>> Date: Tuesday, 21 May 2024 at 8:16 AM
>> To: Spark dev list 
>> Subject: EXT: Dual Write to HDFS and MinIO in faster way
>> 
>> EXTERNAL: Report suspicious emails to Email Abuse.
>> 
>> Hello Team,
>> I am planning to write to two datasource at the same time . 
>>  
>> Scenario:-
>>  
>> Writing the same dataframe to HDFS and MinIO without re-executing the 
>> transformations and no cache(). Then how can we make it faster ?
>>  
>> Read the parquet file and do a few transformations and write to HDFS and 
>> MinIO.
>>  
>> here in both write spark needs execute the transformation again. Do we know 
>> how we can avoid re-execution of transformation  without cache()/persist ?
>>  
>> Scenario2 :-
>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
>> Do we have any way to make writing this faster ?
>>  
>> I don't want to do repartition and write as repartition will have overhead 
>> of shuffling .
>>  
>> Please provide some inputs.