Re: Another thought on client-side support of HDFS federation

Colin McCabe Mon, 02 May 2016 10:32:48 -0700

Hi Tianyi HE,

Thanks for sharing this!  This reminds me of the httpfs daemon.  This
daemon basically sits in front of an HDFS cluster and accepts requests,
which it serves by forwarding them to the underlying HDFS instance. 
There is some documentation about it here:
https://hadoop.apache.org/docs/stable/hadoop-hdfs-httpfs/index.html

Since httpfs uses an org.apache.hadoop.fs.FileSystem instance, it seems
like you could plug in the apache.hadoop.fs.viewfs.ViewFileSystem class
and be up and running with federation.  I haven't tried this, but I
would expect that it would work, unless there are bugs in ViewFS itself.

The big advantage of httpfs is that it provides a webhdfs-style REST
interface.  As you said, this kind of interface makes it simple to use
any language with REST bindings, without worrying about using a thick
client.

The big disadvantage of httpfs is that you must move both metadata and
data operations through the httpfs daemon.  This could become a
performance bottleneck.  It seems like you are concerned about this
bottleneck.

We also have webhdfs.  Unlike httpfs, webhdfs doesn't require all the
data to move through its daemon.  With webhdfs, the client talks to
DataNodes directly.

I wonder if extending httpfs or webhdfs would be a better approach than
starting from scratch.  There is a maintenance burden for adding new
services and daemons.  This was our motivation for removing hftp, for
example.  It's certainly something to think about.

best,
Colin

On Thu, Apr 28, 2016, at 17:55, 何天一 wrote:
> Hey guys,
> 
> My associates have investigated HDFS federation recently, which, turns
> out
> to be a quite good solution for improving scalability on
> NameNode/DataNode
> side.
> 
> However, we encountered some problem on client-side. Since:
> A) For historical reason, we use clients in multiple languages to access
> HDFS, (i.e. python-snakebite, or perhaps libhdfs++). So we either
> implement
> multiple versions of ViewFS or we give up the consistency view (which can
> be confusing to user).
> B) We have hadoop client configuration deployed on client nodes, which we
> do not have control over . Also, releasing new configuration could be a
> real heavy operation because it needs to be pushed to several thousand of
> nodes, as well as maintaining consistency (say a node is down throughout
> the operation, then come back online. it could still possess a stale
> version of configuration).
> 
> So we intended to explore another solution to these problems, and came up
> with a proxy model.
> That is, build a RPC proxy in front of NameNodes.
> All clients talk to proxy when they need to consult NameNode, then proxy
> decide which NameNode should the request go to according to mount table.
> This solved our problem. All clients are seamlessly upgraded with
> federation support.
> We open sourced the proxy recently: https://github.com/bytedance/nnproxy
> (BTW, all kinds of feedbacks are welcomed)
> 
> But there are still a few issues. For example, several modifications
> needs
> to be done inside hadoop ipc to support rpc forwarding. We released patch
> according to which with nnproxy project (
> https://github.com/bytedance/nnproxy/tree/master/hadoop-patches). But it
> could be better to have these merged to apache trunk. Does someone think
> it's worth?
> 
> 
> -- 
> Cheers,
> Tianyi HE
> (+86) 185 0042 4096

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Another thought on client-side support of HDFS federation

Reply via email to