Hi all,

Running on Dataproc 2.0/1.3/1.4, we use INSERT INTO OVERWRITE command to insert 
new (time) partitions into existing Hive tables. But we see too many failures 
coming from org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles. This is where 
the driver moves the successful files from staging to final directory.
For some reason, the underlying FS implementation - GoogleCloudStorageImpl in 
this case - fails to move at least one file and the exception is propagated all 
the way through. We see many different failures - from 
hadoop.FileSystem.mkdirs, rename, etc., all coming from Hive.replaceFiles().
I guess FS failures are expected, but nowhere in 
org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions there is 
try/catch/retry mechanism. Is it expected that the FS implementation do it?
Because the GCS connector does not seem to do that :)
We ended up patching and rebuilding hive-exec jar as an immediate mitigation 
(try/catch/retry), while our platform teams are reaching out to GCP support.

I know this is more of a Hive issue rather than Spark, but still I wonder if 
anybody has encountered this issue, or similar?

Thanks,
Shay





Reply via email to