Re: Spark File Output Committer algorithm for GCS

Mich Talebzadeh Mon, 17 Jul 2023 07:45:04 -0700

You said this Hive table was a managed table partitioned by date -->${TODAY}


How  do you define your Hive managed table?

HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 17 Jul 2023 at 15:29, Dipayan Dev <dev.dipaya...@gmail.com> wrote:

> It does support- It doesn’t error out for me atleast. But it took around 4
> hours to finish the job.
>
> Interestingly, it took only 10 minutes to write the output in the staging
> directory and rest of the time it took to rename the objects. Thats the
> concern.
>
> Looks like a known issue as spark behaves with GCS but not getting any
> workaround for this.
>
>
> On Mon, 17 Jul 2023 at 7:55 PM, Yeachan Park <yeachan...@gmail.com> wrote:
>
>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is
>> supported on GCS? IIRC it wasn't, but you could check with GCP support
>>
>>
>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev <dev.dipaya...@gmail.com>
>> wrote:
>>
>>> Thanks Jay,
>>>
>>> I will try that option.
>>>
>>> Any insight on the file committer algorithms?
>>>
>>> I tried v2 algorithm but its not enhancing the runtime. What’s the best
>>> practice in Dataproc for dynamic updates in Spark.
>>>
>>>
>>> On Mon, 17 Jul 2023 at 7:05 PM, Jay <jayadeep.jayara...@gmail.com>
>>> wrote:
>>>
>>>> You can try increasing fs.gs.batch.threads and
>>>> fs.gs.max.requests.per.batch.
>>>>
>>>> The definitions for these flags are available here -
>>>> https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
>>>>
>>>> On Mon, 17 Jul 2023 at 14:59, Dipayan Dev <dev.dipaya...@gmail.com>
>>>> wrote:
>>>>
>>>>> No, I am using Spark 2.4 to update the GCS partitions . I have a
>>>>> managed Hive table on top of this.
>>>>> [image: image.png]
>>>>> When I do a dynamic partition update of Spark, it creates the new file
>>>>> in a Staging area as shown here.
>>>>> But the GCS blob renaming takes a lot of time. I have a partition
>>>>> based on dates and I need to update around 3 years of data. It usually
>>>>> takes 3 hours to finish the process. Anyway to speed up this?
>>>>> With Best Regards,
>>>>>
>>>>> Dipayan Dev
>>>>>
>>>>> On Mon, Jul 17, 2023 at 1:53 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> So you are using GCP and your Hive is installed on Dataproc which
>>>>>> happens to run your Spark as well. Is that correct?
>>>>>>
>>>>>> What version of Hive are you using?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Solutions Architect/Engineering Lead
>>>>>> Palantir Technologies Limited
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> Of late, I have encountered the issue where I have to overwrite a
>>>>>>> lot of partitions of the Hive table through Spark. It looks like 
>>>>>>> writing to
>>>>>>> hive_staging_directory takes 25% of the total time, whereas 75% or more
>>>>>>> time goes in moving the ORC files from staging directory to the final
>>>>>>> partitioned directory structure.
>>>>>>>
>>>>>>> I got some reference where it's mentioned to use this config during
>>>>>>> the Spark write.
>>>>>>> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>>>>>>>
>>>>>>> However, it's also mentioned it's not safe as partial job failure
>>>>>>> might cause data loss.
>>>>>>>
>>>>>>> Is there any suggestion on the pros and cons of using this version?
>>>>>>> Or any ongoing Spark feature development to address this issue?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> With Best Regards,
>>>>>>>
>>>>>>> Dipayan Dev
>>>>>>>
>>>>>> --
>>>
>>>
>>>
>>> With Best Regards,
>>>
>>> Dipayan Dev
>>> Author of *Deep Learning with Hadoop
>>> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
>>> M.Tech (AI), IISc, Bangalore
>>>
>> --
>
>
>
> With Best Regards,
>
> Dipayan Dev
> Author of *Deep Learning with Hadoop
> <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
> M.Tech (AI), IISc, Bangalore
>

Re: Spark File Output Committer algorithm for GCS

Reply via email to