[ https://issues.apache.org/jira/browse/HIVE-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15668735#comment-15668735 ]
Sahil Takiar commented on HIVE-15215: ------------------------------------- Here is the code that trigger the file by file delete (inside the {{Hive.java}} class): {code} replaceFiles(...) { ... FileSystem fs2 = oldPath.getFileSystem(conf); if (fs2.exists(oldPath)) { // Do not delete oldPath if: // - destf is subdir of oldPath //if ( !(fs2.equals(destf.getFileSystem(conf)) && FileUtils.isSubDir(oldPath, destf, fs2))) isOldPathUnderDestf = FileUtils.isSubDir(oldPath, destf, fs2); if (isOldPathUnderDestf) { // if oldPath is destf or its subdir, its should definitely be deleted, otherwise its // existing content might result in incorrect (extra) data. // But not sure why we changed not to delete the oldPath in HIVE-8750 if it is // not the destf or its subdir? oldPathDeleted = FileUtils.trashFilesUnderDir(fs2, oldPath, conf); } } ... } {code} > Files on S3 are deleted one by one in INSERT OVERWRITE queries > -------------------------------------------------------------- > > Key: HIVE-15215 > URL: https://issues.apache.org/jira/browse/HIVE-15215 > Project: Hive > Issue Type: Sub-task > Components: Hive > Reporter: Sahil Takiar > > When running {{INSERT OVERWRITE}} queries the files to overwrite are deleted > one by one. The reason is that, by default, hive.exec.stagingdir is inside > the target table directory. > Ideally Hive would just delete the entire table directory, but it can't do > that since the staging data is also inside the directory. Instead it deletes > each file one-by-one, which is very slow. > There are a few ways to fix this: > 1: Move the staging directory outside the table location. This can be done by > setting hive.exec.stagingdir to a different location when running on S3. It > would be nice if users didn't have to explicitly set this when running on S3 > and things just worked out-of-the-box. My understanding is that > hive.exec.stagingdir was only added to support HDFS encryption zones. Since > S3 doesn't have encryption zones, there should be no problem with using the > value of hive.exec.scratchdir to store all intermediate data instead. > 2: Multi-thread the delete operations > 3: See if the {{S3AFileSystem}} can expose some type of bulk delete op -- This message was sent by Atlassian JIRA (v6.3.4#6332)