Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is supported on GCS? IIRC it wasn't, but you could check with GCP support
On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <dev.dipaya...@gmail.com> wrote: > Thanks Jay, > > I will try that option. > > Any insight on the file committer algorithms? > > I tried v2 algorithm but its not enhancing the runtime. What’s the best > practice in Dataproc for dynamic updates in Spark. > > > On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com> wrote: > >> You can try increasing fs.gs.batch.threads and >> fs.gs.max.requests.per.batch. >> >> The definitions for these flags are available here - >> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md >> >> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com> >> wrote: >> >>> No, I am using Spark 2.4 to update the GCS partitions . I have a managed >>> Hive table on top of this. >>> [image: image.png] >>> When I do a dynamic partition update of Spark, it creates the new file >>> in a Staging area as shown here. >>> But the GCS blob renaming takes a lot of time. I have a partition based >>> on dates and I need to update around 3 years of data. It usually takes 3 >>> hours to finish the process. Anyway to speed up this? >>> With Best Regards, >>> >>> Dipayan Dev >>> >>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> So you are using GCP and your Hive is installed on Dataproc which >>>> happens to run your Spark as well. Is that correct? >>>> >>>> What version of Hive are you using? >>>> >>>> HTH >>>> >>>> >>>> Mich Talebzadeh, >>>> Solutions Architect/Engineering Lead >>>> Palantir Technologies Limited >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com> >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> Of late, I have encountered the issue where I have to overwrite a lot >>>>> of partitions of the Hive table through Spark. It looks like writing to >>>>> hive_staging_directory takes 25% of the total time, whereas 75% or more >>>>> time goes in moving the ORC files from staging directory to the final >>>>> partitioned directory structure. >>>>> >>>>> I got some reference where it's mentioned to use this config during >>>>> the Spark write. >>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2* >>>>> >>>>> However, it's also mentioned it's not safe as partial job failure >>>>> might cause data loss. >>>>> >>>>> Is there any suggestion on the pros and cons of using this version? Or >>>>> any ongoing Spark feature development to address this issue? >>>>> >>>>> >>>>> >>>>> With Best Regards, >>>>> >>>>> Dipayan Dev >>>>> >>>> -- > > > > With Best Regards, > > Dipayan Dev > Author of *Deep Learning with Hadoop > <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* > M.Tech (AI), IISc, Bangalore >