Re: Spark File Output Committer algorithm for GCS

Jay Mon, 17 Jul 2023 06:35:57 -0700

You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch.


The definitions for these flags are available here -
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md

On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com> wrote:

> No, I am using Spark 2.4 to update the GCS partitions . I have a managed
> Hive table on top of this.
> [image: image.png]
> When I do a dynamic partition update of Spark, it creates the new file in
> a Staging area as shown here.
> But the GCS blob renaming takes a lot of time. I have a partition based on
> dates and I need to update around 3 years of data. It usually takes 3 hours
> to finish the process. Anyway to speed up this?
> With Best Regards,
>
> Dipayan Dev
>
> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> So you are using GCP and your Hive is installed on Dataproc which happens
>> to run your Spark as well. Is that correct?
>>
>> What version of Hive are you using?
>>
>> HTH
>>
>>
>> Mich Talebzadeh,
>> Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Of late, I have encountered the issue where I have to overwrite a lot of
>>> partitions of the Hive table through Spark. It looks like writing to
>>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>>> time goes in moving the ORC files from staging directory to the final
>>> partitioned directory structure.
>>>
>>> I got some reference where it's mentioned to use this config during the
>>> Spark write.
>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>
>>> However, it's also mentioned it's not safe as partial job failure might
>>> cause data loss.
>>>
>>> Is there any suggestion on the pros and cons of using this version? Or
>>> any ongoing Spark feature development to address this issue?
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>>

Re: Spark File Output Committer algorithm for GCS

Reply via email to