[ 
https://issues.apache.org/jira/browse/HDFS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250993#comment-15250993
 ] 

Zhe Zhang commented on HDFS-9806:
---------------------------------

Thanks [~chris.douglas] for proposing the work and [~virajith] for the 
HDFS-9809 patch.

I wonder if we can simplify the problem as "allowing some HDFS *files* to be 
provided external storage system", while still satisfying most requirements. In 
most production clusters, over 95% files are single-block. So having the 
staging logic on file (vs block) level should achieve most of the benefits of 
_using a small HDFS cluster to present a large amount of data_. I had some 
discussions with [~eddyxu] about the below ideas:

Let's first assume that we already have (or going to have) an 
{{o.a.h.FileSystem}} connector the external storage system -- S3, ADL, Aliyun, 
GCS etc. I did a very small PoC patch based on {{ViewFS}}. The idea is simply 
to have a {{smallFS}} and a {{bigFS}}, and using the {{smallFS}} (which is 
always an HDFS) as a cache. Writes will always land on {{smallFS}} first -- 
hence supporting HDFS-level consistency such as hflush. Different write-back 
policies {{bigFS}} can be added, such as write-through, write-back, and 30-sec 
flushing like Linux. Read operations will try {{smallFS}} first and then 
{{bigFS}} on a miss (and also store a copy in {{smallFS}} on miss). If we want 
the strong guarantee of _using HDFS as the storage platform_, we can also stage 
the data into {{smallFS}} first and then serve to application, like Linux page 
cache.
{code}
public class CacheFileSystem extends FileSystem {
        private final ChRootedFileSystem smallFS;
  private final ChRootedFileSystem bigFS;
  ...
  @Override
  public FileStatus getFileStatus(Path f) throws IOException {
    try {
      return smallFS.getFileStatus(f);
    } catch (FileNotFoundException e) {
      return bigFS.getFileStatus(f);
    }
  }

  @Override
  public FSDataOutputStream create(Path f,
      FsPermission permission,
      boolean overwrite,
      int bufferSize,
      short replication,
      long blockSize,
      Progressable progress) throws IOException {
    return smallFS.create(f, permission, overwrite,
        bufferSize, replication, blockSize, progress);
  }
  ...
}
{code}

Of course if you want to present a {{DistributedFileSystem}} to applications, 
the logic will be more complex (e.g. more wrapping on the output from 
{{bigFS}}). But I think it's still simpler than breaking into NN and DN 
internals. This is basically how Alluxio / Tachyon handles the caching logic.

Thoughts?

> Allow HDFS block replicas to be provided by an external storage system
> ----------------------------------------------------------------------
>
>                 Key: HDFS-9806
>                 URL: https://issues.apache.org/jira/browse/HDFS-9806
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Chris Douglas
>
> In addition to heterogeneous media, many applications work with heterogeneous 
> storage systems. The guarantees and semantics provided by these systems are 
> often similar, but not identical to those of 
> [HDFS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html].
>  Any client accessing multiple storage systems is responsible for reasoning 
> about each system independently, and must propagate/and renew credentials for 
> each store.
> Remote stores could be mounted under HDFS. Block locations could be mapped to 
> immutable file regions, opaque IDs, or other tokens that represent a 
> consistent view of the data. While correctness for arbitrary operations 
> requires careful coordination between stores, in practice we can provide 
> workable semantics with weaker guarantees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to