[ https://issues.apache.org/jira/browse/HIVE-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sahil Takiar updated HIVE-15093: -------------------------------- Status: Patch Available (was: Open) > For S3-to-S3 renames, files should be moved individually rather than at a > directory level > ----------------------------------------------------------------------------------------- > > Key: HIVE-15093 > URL: https://issues.apache.org/jira/browse/HIVE-15093 > Project: Hive > Issue Type: Sub-task > Components: Hive > Affects Versions: 2.1.0 > Reporter: Sahil Takiar > Assignee: Sahil Takiar > Attachments: HIVE-15093.1.patch > > > Hive's MoveTask uses the Hive.moveFile method to move data within a > distributed filesystem as well as blobstore filesystems. > If the move is done within the same filesystem: > 1: If the source path is a subdirectory of the destination path, files will > be moved one by one using a threapool of workers > 2: If the source path is not a subdirectory of the destination path, a single > rename operation is used to move the entire directory > The second option may not work well on blobstores such as S3. Renames are not > metadata operations and require copying all the data. Client connectors to > blobstores may not efficiently rename directories. Worst case, the connector > will copy each file one by one, sequentially rather than using a threadpool > of workers to copy the data (e.g. HADOOP-13600). > Hive already has code to rename files using a threadpool of workers, but this > only occurs in case number 1. > This JIRA aims to modify the code so that case 1 is triggered when copying > within a blobstore. The focus is on copies within a blobstore because > needToCopy will return true if the src and target filesystems are different, > in which case a different code path is triggered. -- This message was sent by Atlassian JIRA (v6.3.4#6332)