Regarding making spark writer fast part, If you are (or can be) on Databricks, 
check this out. It is just out of the oven at Databricks.

https://www.databricks.com/blog/announcing-general-availability-liquid-clustering?utm_source=bambu&utm_medium=social&utm_campaign=advocacy&blaid=6087618



________________________________
From: Gera Shegalov <ger...@gmail.com>
Sent: Wednesday, May 29, 2024 7:57:56 am
To: Prem Sahoo <prem.re...@gmail.com>
Cc: eab...@163.com <eab...@163.com>; Vibhor Gupta <vibhor.gu...@walmart.com>; 
user @spark <user@spark.apache.org>
Subject: Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

I agree with the previous answers that (if requirements allow it) it is much 
easier to just orchestrate a copy either in the same app or sync externally.

A long time ago and not for a Spark app we were solving a similar usecase via 
https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-hdfs/ViewFs.html#Multi-Filesystem_I.2F0_with_Nfly_Mount_Points
 . It may work with Spark because it is underneath the FileSystem API ...



On Tue, May 21, 2024 at 10:03 PM Prem Sahoo 
<prem.re...@gmail.com<mailto:prem.re...@gmail.com>> wrote:
I am looking for writer/comitter optimization which can make the spark write 
faster.

On Tue, May 21, 2024 at 9:15 PM eab...@163.com<mailto:eab...@163.com> 
<eab...@163.com<mailto:eab...@163.com>> wrote:
Hi,
    I think you should write to HDFS then copy file (parquet or orc) from HDFS 
to MinIO.

________________________________
eabour

From: Prem Sahoo<mailto:prem.re...@gmail.com>
Date: 2024-05-22 00:38
To: Vibhor Gupta<mailto:vibhor.gu...@walmart.com>; 
user<mailto:user@spark.apache.org>
Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way


On Tue, May 21, 2024 at 6:58 AM Prem Sahoo 
<prem.re...@gmail.com<mailto:prem.re...@gmail.com>> wrote:
Hello Vibhor,
Thanks for the suggestion .
I am looking for some other alternatives where I can use the same dataframe can 
be written to two destinations without re execution and cache or persist .

Can some one help me in scenario 2 ?
How to make spark write to MinIO faster ?
Sent from my iPhone

On May 21, 2024, at 1:18 AM, Vibhor Gupta 
<vibhor.gu...@walmart.com<mailto:vibhor.gu...@walmart.com>> wrote:


Hi Prem,

You can try to write to HDFS then read from HDFS and write to MinIO.

This will prevent duplicate transformation.

You can also try persisting the dataframe using the DISK_ONLY level.

Regards,
Vibhor
From: Prem Sahoo <prem.re...@gmail.com<mailto:prem.re...@gmail.com>>
Date: Tuesday, 21 May 2024 at 8:16 AM
To: Spark dev list <d...@spark.apache.org<mailto:d...@spark.apache.org>>
Subject: EXT: Dual Write to HDFS and MinIO in faster way
EXTERNAL: Report suspicious emails to Email Abuse.
Hello Team,
I am planning to write to two datasource at the same time .

Scenario:-

Writing the same dataframe to HDFS and MinIO without re-executing the 
transformations and no cache(). Then how can we make it faster ?

Read the parquet file and do a few transformations and write to HDFS and MinIO.

here in both write spark needs execute the transformation again. Do we know how 
we can avoid re-execution of transformation  without cache()/persist ?

Scenario2 :-
I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
Do we have any way to make writing this faster ?

I don't want to do repartition and write as repartition will have overhead of 
shuffling .

Please provide some inputs.



Reply via email to