Re: Spark File Output Committer algorithm for GCS

Mich Talebzadeh Mon, 17 Jul 2023 01:23:58 -0700

So you are using GCP and your Hive is installed on Dataproc which happens
to run your Spark as well. Is that correct?


What version of Hive are you using?

HTH


Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 17 Jul 2023 at 09:16, Dipayan Dev <dev.dipaya...@gmail.com> wrote:

> Hi All,
>
> Of late, I have encountered the issue where I have to overwrite a lot of
> partitions of the Hive table through Spark. It looks like writing to
> hive_staging_directory takes 25% of the total time, whereas 75% or more
> time goes in moving the ORC files from staging directory to the final
> partitioned directory structure.
>
> I got some reference where it's mentioned to use this config during the
> Spark write.
> *mapreduce.fileoutputcommitter.algorithm.version = 2*
>
> However, it's also mentioned it's not safe as partial job failure might
> cause data loss.
>
> Is there any suggestion on the pros and cons of using this version? Or any
> ongoing Spark feature development to address this issue?
>
>
>
> With Best Regards,
>
> Dipayan Dev
>

Re: Spark File Output Committer algorithm for GCS

Reply via email to