[DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Jim Clampffer Wed, 28 Feb 2018 09:55:38 -0800

Hi everyone,

I'd like to start a thread to discuss merging the HDFS-8707 aka libhdfs++
into trunk.  I sent originally sent a similar email out last October but it
sounds like it was buried by discussions about other feature merges that
were going on at the time.


libhdfs++ is an HDFS client written in C++ designed to be used in
applications that are written in non-JVM based languages.  In its current
state it supports kerberos authenticated reads from HDFS and has been used
in production clusters for over a year so it has had a significant amount
of burn-in time.  The HDFS-8707 branch has been around for about 2 years
now so I'd like to know people's thoughts on what it would take to merge
current branch and handling writes and encrypted reads in a new one.

Current notable features:
  -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a
drop-in replacement for clients that only need read support (until
libhdfs++ also supports writes).
  -An asynchronous C++ API with synchronous shims on top if the client
application wants to do blocking operations.  Internally a single thread
(optionally more) uses select/epoll by way of boost::asio to watch
thousands of sockets without the overhead of spawning threads to emulate
async operation.
  -Kerberos/SASL authentication support
  -HA namenode support
  -A set of utility programs that mirror the HDFS CLI utilities e.g.
"./hdfs dfs -chmod".  The major benefit of these is the tool startup time
is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a
lot less memory since it isn't dealing with the JVM.  This makes it
possible to do things like write a simple bash script that stats a file,
applies some rules to the result, and decides if it should move it in a way
that scales to thousands of files without being penalized with O(N) JVM
startups.
  -Cancelable reads.  This has proven to be very useful in multiuser
applications that (pre)fetch large blocks of data but need to remain
responsive for interactive users.  Rather than waiting for a large and/or
slow read to finish it will return immediately and the associated resources
(buffer, file descriptor) become available for the rest of the application
to use.

There are a couple known issues: the doc build isn't integrated with the
rest of hadoop and the public API headers aren't being exported when
building a distribution.  A short term solution for missing docs is to go
through the libhdfs(3) compatible API and use the libhdfs docs.  Other than
a few modifications to the pom files to integrate the build and the changes
are isolated to a new directory so the chance of causing any regressions in
the rest of the code is minimal.

Please share your thoughts, thanks!

[DISCUSS] Merging HDFS-8707 (C++ HDFS client) to trunk

Reply via email to