Re: Spark File Output Committer algorithm for GCS

Yeachan Park Mon, 17 Jul 2023 07:26:10 -0700

Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
supported on GCS? IIRC it wasn't, but you could check with GCP support



On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <dev.dipaya...@gmail.com> wrote:

> Thanks Jay,
>
> I will try that option.
>
> Any insight on the file committer algorithms?
>
> I tried v2 algorithm but its not enhancing the runtime. What’s the best
> practice in Dataproc for dynamic updates in Spark.
>
>
> On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com> wrote:
>
>> You can try increasing fs.gs.batch.threads and
>> fs.gs.max.requests.per.batch.
>>
>> The definitions for these flags are available here -
>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>
>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com>
>> wrote:
>>
>>> No, I am using Spark 2.4 to update the GCS partitions . I have a managed
>>> Hive table on top of this.
>>> [image: image.png]
>>> When I do a dynamic partition update of Spark, it creates the new file
>>> in a Staging area as shown here.
>>> But the GCS blob renaming takes a lot of time. I have a partition based
>>> on dates and I need to update around 3 years of data. It usually takes 3
>>> hours to finish the process. Anyway to speed up this?
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>>
>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> So you are using GCP and your Hive is installed on Dataproc which
>>>> happens to run your Spark as well. Is that correct?
>>>>
>>>> What version of Hive are you using?
>>>>
>>>> HTH
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>> Solutions Architect/Engineering Lead
>>>> Palantir Technologies Limited
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Of late, I have encountered the issue where I have to overwrite a lot
>>>>> of partitions of the Hive table through Spark. It looks like writing to
>>>>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>>>>> time goes in moving the ORC files from staging directory to the final
>>>>> partitioned directory structure.
>>>>>
>>>>> I got some reference where it's mentioned to use this config during
>>>>> the Spark write.
>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>>
>>>>> However, it's also mentioned it's not safe as partial job failure
>>>>> might cause data loss.
>>>>>
>>>>> Is there any suggestion on the pros and cons of using this version? Or
>>>>> any ongoing Spark feature development to address this issue?
>>>>>
>>>>>
>>>>>
>>>>> With Best Regards,
>>>>>
>>>>> Dipayan Dev
>>>>>
>>>> --
>
>
>
> With Best Regards,
>
> Dipayan Dev
> Author of *Deep Learning with Hadoop
> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
> M.Tech (AI), IISc, Bangalore
>

Re: Spark File Output Committer algorithm for GCS

Reply via email to