date:20230718

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev

It does help performance but not significantly. I am just wondering, once Spark creates that staging directory along with the SUCCESS file, can we just do a gsutil rsync command and move these files to original directory? Anyone tried this approach or foresee any concern? On Mon, 17 Jul 2023 at

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Mich Talebzadeh

Spark has no role in creating that hive staging directory. That directory belongs to Hive and Spark simply does ETL there, loading to the Hive managed table in your case which ends up in saging directory I suggest that you review your design and use an external hive table with explicit location on

Re: Spark File Output Committer algorithm for GCS

2023-07-18 Thread Dipayan Dev

Hi Mich, Ok, my use-case is a bit different. I have a Hive table partitioned by dates and need to do dynamic partition updates(insert overwrite) daily for the last 30 days (partitions). The ETL inside the staging directories is completed in hardly 5minutes, but then renaming takes a lot of time as