[ https://issues.apache.org/jira/browse/HDFS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250993#comment-15250993 ]
Zhe Zhang commented on HDFS-9806: --------------------------------- Thanks [~chris.douglas] for proposing the work and [~virajith] for the HDFS-9809 patch. I wonder if we can simplify the problem as "allowing some HDFS *files* to be provided external storage system", while still satisfying most requirements. In most production clusters, over 95% files are single-block. So having the staging logic on file (vs block) level should achieve most of the benefits of _using a small HDFS cluster to present a large amount of data_. I had some discussions with [~eddyxu] about the below ideas: Let's first assume that we already have (or going to have) an {{o.a.h.FileSystem}} connector the external storage system -- S3, ADL, Aliyun, GCS etc. I did a very small PoC patch based on {{ViewFS}}. The idea is simply to have a {{smallFS}} and a {{bigFS}}, and using the {{smallFS}} (which is always an HDFS) as a cache. Writes will always land on {{smallFS}} first -- hence supporting HDFS-level consistency such as hflush. Different write-back policies {{bigFS}} can be added, such as write-through, write-back, and 30-sec flushing like Linux. Read operations will try {{smallFS}} first and then {{bigFS}} on a miss (and also store a copy in {{smallFS}} on miss). If we want the strong guarantee of _using HDFS as the storage platform_, we can also stage the data into {{smallFS}} first and then serve to application, like Linux page cache. {code} public class CacheFileSystem extends FileSystem { private final ChRootedFileSystem smallFS; private final ChRootedFileSystem bigFS; ... @Override public FileStatus getFileStatus(Path f) throws IOException { try { return smallFS.getFileStatus(f); } catch (FileNotFoundException e) { return bigFS.getFileStatus(f); } } @Override public FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException { return smallFS.create(f, permission, overwrite, bufferSize, replication, blockSize, progress); } ... } {code} Of course if you want to present a {{DistributedFileSystem}} to applications, the logic will be more complex (e.g. more wrapping on the output from {{bigFS}}). But I think it's still simpler than breaking into NN and DN internals. This is basically how Alluxio / Tachyon handles the caching logic. Thoughts? > Allow HDFS block replicas to be provided by an external storage system > ---------------------------------------------------------------------- > > Key: HDFS-9806 > URL: https://issues.apache.org/jira/browse/HDFS-9806 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Chris Douglas > > In addition to heterogeneous media, many applications work with heterogeneous > storage systems. The guarantees and semantics provided by these systems are > often similar, but not identical to those of > [HDFS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html]. > Any client accessing multiple storage systems is responsible for reasoning > about each system independently, and must propagate/and renew credentials for > each store. > Remote stores could be mounted under HDFS. Block locations could be mapped to > immutable file regions, opaque IDs, or other tokens that represent a > consistent view of the data. While correctness for arbitrary operations > requires careful coordination between stores, in practice we can provide > workable semantics with weaker guarantees. -- This message was sent by Atlassian JIRA (v6.3.4#6332)