[dev list to bcc]

This is a question for the user list <https://spark.apache.org/community.html> 
or for Stack Overflow 
<https://stackoverflow.com/questions/tagged/apache-spark>. The dev list is for 
discussions related to the development of Spark itself.

Nick


> On May 21, 2024, at 6:58 AM, Prem Sahoo <prem.re...@gmail.com> wrote:
> 
> Hello Vibhor,
> Thanks for the suggestion .
> I am looking for some other alternatives where I can use the same dataframe 
> can be written to two destinations without re execution and cache or persist .
> 
> Can some one help me in scenario 2 ?
> How to make spark write to MinIO faster ?
> Sent from my iPhone
> 
>> On May 21, 2024, at 1:18 AM, Vibhor Gupta <vibhor.gu...@walmart.com> wrote:
>> 
>> 
>> Hi Prem,
>>  
>> You can try to write to HDFS then read from HDFS and write to MinIO.
>>  
>> This will prevent duplicate transformation.
>>  
>> You can also try persisting the dataframe using the DISK_ONLY level.
>>  
>> Regards,
>> Vibhor
>> From: Prem Sahoo <prem.re...@gmail.com>
>> Date: Tuesday, 21 May 2024 at 8:16 AM
>> To: Spark dev list <dev@spark.apache.org>
>> Subject: EXT: Dual Write to HDFS and MinIO in faster way
>> 
>> EXTERNAL: Report suspicious emails to Email Abuse.
>> 
>> Hello Team,
>> I am planning to write to two datasource at the same time . 
>>  
>> Scenario:-
>>  
>> Writing the same dataframe to HDFS and MinIO without re-executing the 
>> transformations and no cache(). Then how can we make it faster ?
>>  
>> Read the parquet file and do a few transformations and write to HDFS and 
>> MinIO.
>>  
>> here in both write spark needs execute the transformation again. Do we know 
>> how we can avoid re-execution of transformation  without cache()/persist ?
>>  
>> Scenario2 :-
>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
>> Do we have any way to make writing this faster ?
>>  
>> I don't want to do repartition and write as repartition will have overhead 
>> of shuffling .
>>  
>> Please provide some inputs. 

Reply via email to