You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch.
The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com> wrote: > No, I am using Spark 2.4 to update the GCS partitions . I have a managed > Hive table on top of this. > [image: image.png] > When I do a dynamic partition update of Spark, it creates the new file in > a Staging area as shown here. > But the GCS blob renaming takes a lot of time. I have a partition based on > dates and I need to update around 3 years of data. It usually takes 3 hours > to finish the process. Anyway to speed up this? > With Best Regards, > > Dipayan Dev > > On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> So you are using GCP and your Hive is installed on Dataproc which happens >> to run your Spark as well. Is that correct? >> >> What version of Hive are you using? >> >> HTH >> >> >> Mich Talebzadeh, >> Solutions Architect/Engineering Lead >> Palantir Technologies Limited >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> Of late, I have encountered the issue where I have to overwrite a lot of >>> partitions of the Hive table through Spark. It looks like writing to >>> hive_staging_directory takes 25% of the total time, whereas 75% or more >>> time goes in moving the ORC files from staging directory to the final >>> partitioned directory structure. >>> >>> I got some reference where it's mentioned to use this config during the >>> Spark write. >>> *mapreduce.fileoutputcommitter.algorithm.version = 2* >>> >>> However, it's also mentioned it's not safe as partial job failure might >>> cause data loss. >>> >>> Is there any suggestion on the pros and cons of using this version? Or >>> any ongoing Spark feature development to address this issue? >>> >>> >>> >>> With Best Regards, >>> >>> Dipayan Dev >>> >>