Hi Tianyi HE, Thanks for sharing this! This reminds me of the httpfs daemon. This daemon basically sits in front of an HDFS cluster and accepts requests, which it serves by forwarding them to the underlying HDFS instance. There is some documentation about it here: https://hadoop.apache.org/docs/stable/hadoop-hdfs-httpfs/index.html
Since httpfs uses an org.apache.hadoop.fs.FileSystem instance, it seems like you could plug in the apache.hadoop.fs.viewfs.ViewFileSystem class and be up and running with federation. I haven't tried this, but I would expect that it would work, unless there are bugs in ViewFS itself. The big advantage of httpfs is that it provides a webhdfs-style REST interface. As you said, this kind of interface makes it simple to use any language with REST bindings, without worrying about using a thick client. The big disadvantage of httpfs is that you must move both metadata and data operations through the httpfs daemon. This could become a performance bottleneck. It seems like you are concerned about this bottleneck. We also have webhdfs. Unlike httpfs, webhdfs doesn't require all the data to move through its daemon. With webhdfs, the client talks to DataNodes directly. I wonder if extending httpfs or webhdfs would be a better approach than starting from scratch. There is a maintenance burden for adding new services and daemons. This was our motivation for removing hftp, for example. It's certainly something to think about. best, Colin On Thu, Apr 28, 2016, at 17:55, 何天一 wrote: > Hey guys, > > My associates have investigated HDFS federation recently, which, turns > out > to be a quite good solution for improving scalability on > NameNode/DataNode > side. > > However, we encountered some problem on client-side. Since: > A) For historical reason, we use clients in multiple languages to access > HDFS, (i.e. python-snakebite, or perhaps libhdfs++). So we either > implement > multiple versions of ViewFS or we give up the consistency view (which can > be confusing to user). > B) We have hadoop client configuration deployed on client nodes, which we > do not have control over . Also, releasing new configuration could be a > real heavy operation because it needs to be pushed to several thousand of > nodes, as well as maintaining consistency (say a node is down throughout > the operation, then come back online. it could still possess a stale > version of configuration). > > So we intended to explore another solution to these problems, and came up > with a proxy model. > That is, build a RPC proxy in front of NameNodes. > All clients talk to proxy when they need to consult NameNode, then proxy > decide which NameNode should the request go to according to mount table. > This solved our problem. All clients are seamlessly upgraded with > federation support. > We open sourced the proxy recently: https://github.com/bytedance/nnproxy > (BTW, all kinds of feedbacks are welcomed) > > But there are still a few issues. For example, several modifications > needs > to be done inside hadoop ipc to support rpc forwarding. We released patch > according to which with nnproxy project ( > https://github.com/bytedance/nnproxy/tree/master/hadoop-patches). But it > could be better to have these merged to apache trunk. Does someone think > it's worth? > > > -- > Cheers, > Tianyi HE > (+86) 185 0042 4096 --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org