Hi everyone, I'd like to start a thread to discuss merging the HDFS-8707 aka libhdfs++ into trunk. I sent originally sent a similar email out last October but it sounds like it was buried by discussions about other feature merges that were going on at the time.
libhdfs++ is an HDFS client written in C++ designed to be used in applications that are written in non-JVM based languages. In its current state it supports kerberos authenticated reads from HDFS and has been used in production clusters for over a year so it has had a significant amount of burn-in time. The HDFS-8707 branch has been around for about 2 years now so I'd like to know people's thoughts on what it would take to merge current branch and handling writes and encrypted reads in a new one. Current notable features: -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a drop-in replacement for clients that only need read support (until libhdfs++ also supports writes). -An asynchronous C++ API with synchronous shims on top if the client application wants to do blocking operations. Internally a single thread (optionally more) uses select/epoll by way of boost::asio to watch thousands of sockets without the overhead of spawning threads to emulate async operation. -Kerberos/SASL authentication support -HA namenode support -A set of utility programs that mirror the HDFS CLI utilities e.g. "./hdfs dfs -chmod". The major benefit of these is the tool startup time is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a lot less memory since it isn't dealing with the JVM. This makes it possible to do things like write a simple bash script that stats a file, applies some rules to the result, and decides if it should move it in a way that scales to thousands of files without being penalized with O(N) JVM startups. -Cancelable reads. This has proven to be very useful in multiuser applications that (pre)fetch large blocks of data but need to remain responsive for interactive users. Rather than waiting for a large and/or slow read to finish it will return immediately and the associated resources (buffer, file descriptor) become available for the rest of the application to use. There are a couple known issues: the doc build isn't integrated with the rest of hadoop and the public API headers aren't being exported when building a distribution. A short term solution for missing docs is to go through the libhdfs(3) compatible API and use the libhdfs docs. Other than a few modifications to the pom files to integrate the build and the changes are isolated to a new directory so the chance of causing any regressions in the rest of the code is minimal. Please share your thoughts, thanks!