Hello Team,
I am planning to write to two datasource at the same time .

Scenario:-

Writing the same dataframe to HDFS and MinIO without re-executing the
transformations and no cache(). Then how can we make it faster ?

Read the parquet file and do a few transformations and write to HDFS and
MinIO.

here in both write spark needs execute the transformation again. Do we know
how we can avoid re-execution of transformation  without cache()/persist ?

Scenario2 :-
I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
Do we have any way to make writing this faster ?

I don't want to do repartition and write as repartition will have overhead
of shuffling .

Please provide some inputs.

Reply via email to