Hi all, Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert new (time) partitions into existing Hive tables. But we see too many failures coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where the driver moves the successful files from staging to final directory. For some reason, the underlying FS implementation - GoogleCloudStorageImpl in this case - fails to move at least one file and the exception is propagated all the way through. We see many different failures - from hadoop.FileSystem.mkdirs, rename, etc., all coming from Hive.replaceFiles(). I guess FS failures are expected, but nowhere in org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions there is try/catch/retry mechanism. Is it expected that the FS implementation do it? Because the GCS connector does not seem to do that :) We ended up patching and rebuilding hive-exec jar as an immediate mitigation (try/catch/retry), while our platform teams are reaching out to GCP support.
I know this is more of a Hive issue rather than Spark, but still I wonder if anybody has encountered this issue, or similar? Thanks, Shay