[ https://issues.apache.org/jira/browse/HDFS-14978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei-Chiu Chuang reassigned HDFS-14978: -------------------------------------- Assignee: Aravindan Vijayan (was: Wei-Chiu Chuang) > In-place Erasure Coding Conversion > ---------------------------------- > > Key: HDFS-14978 > URL: https://issues.apache.org/jira/browse/HDFS-14978 > Project: Hadoop HDFS > Issue Type: New Feature > Components: erasure-coding > Affects Versions: 3.0.0 > Reporter: Wei-Chiu Chuang > Assignee: Aravindan Vijayan > Priority: Major > Attachments: In-place Erasure Coding Conversion.pdf > > > HDFS Erasure Coding is a new feature added in Apache Hadoop 3.0. It uses > encoding algorithms to reduce disk space usage while retaining redundancy > necessary for data recovery. It was a huge amount of work but it is just > getting adopted after almost 2 years. > One usability problem that’s blocking users from adopting HDFS Erasure Coding > is that existing replicated files have to be copied to an EC-enabled > directory explicitly. Renaming a file/directory to an EC-enabled directory > does not automatically convert the blocks. Therefore users typically perform > the following steps to erasure-code existing files: > {noformat} > Create $tmp directory, set EC policy at it > Distcp $src to $tmp > Delete $src (rm -rf $src) > mv $tmp $src > {noformat} > There are several reasons why this is not popular: > * Complex. The process involves several steps: distcp data to a temporary > destination; delete source file; move destination to the source path. > * Availability: there is a short period where nothing exists at the source > path, and jobs may fail unexpectedly. > * Overhead. During the copy phase, there is a point in time where all of > source and destination files exist at the same time, exhausting disk space. > * Not snapshot-friendly. If a snapshot is taken prior to performing the > conversion, the source (replicated) files will be preserved in the cluster > too. Therefore, the conversion actually increase storage space usage. > * Not management-friendly. This approach changes file inode number, > modification time and access time. Erasure coded files are supposed to store > cold data, but this conversion makes data “hot” again. > * Bulky. It’s either all or nothing. The directory may be partially erasure > coded, but this approach simply erasure code everything again. > To ease data management, we should offer a utility tool to convert replicated > files to erasure coded files in-place. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org