+1 Let's get this done. We've had many false starts on a native HDFS client. This is a good base to build on. -C
On Wed, Feb 28, 2018 at 9:55 AM, Jim Clampffer <james.clampf...@gmail.com> wrote: > Hi everyone, > > I'd like to start a thread to discuss merging the HDFS-8707 aka libhdfs++ > into trunk. I sent originally sent a similar email out last October but it > sounds like it was buried by discussions about other feature merges that > were going on at the time. > > libhdfs++ is an HDFS client written in C++ designed to be used in > applications that are written in non-JVM based languages. In its current > state it supports kerberos authenticated reads from HDFS and has been used > in production clusters for over a year so it has had a significant amount > of burn-in time. The HDFS-8707 branch has been around for about 2 years > now so I'd like to know people's thoughts on what it would take to merge > current branch and handling writes and encrypted reads in a new one. > > Current notable features: > -A libhdfs/libhdfs3 compatible C API that allows libhdfs++ to serve as a > drop-in replacement for clients that only need read support (until > libhdfs++ also supports writes). > -An asynchronous C++ API with synchronous shims on top if the client > application wants to do blocking operations. Internally a single thread > (optionally more) uses select/epoll by way of boost::asio to watch > thousands of sockets without the overhead of spawning threads to emulate > async operation. > -Kerberos/SASL authentication support > -HA namenode support > -A set of utility programs that mirror the HDFS CLI utilities e.g. > "./hdfs dfs -chmod". The major benefit of these is the tool startup time > is ~3 orders of magnitude faster (<1ms vs hundreds of ms) and occupies a > lot less memory since it isn't dealing with the JVM. This makes it > possible to do things like write a simple bash script that stats a file, > applies some rules to the result, and decides if it should move it in a way > that scales to thousands of files without being penalized with O(N) JVM > startups. > -Cancelable reads. This has proven to be very useful in multiuser > applications that (pre)fetch large blocks of data but need to remain > responsive for interactive users. Rather than waiting for a large and/or > slow read to finish it will return immediately and the associated resources > (buffer, file descriptor) become available for the rest of the application > to use. > > There are a couple known issues: the doc build isn't integrated with the > rest of hadoop and the public API headers aren't being exported when > building a distribution. A short term solution for missing docs is to go > through the libhdfs(3) compatible API and use the libhdfs docs. Other than > a few modifications to the pom files to integrate the build and the changes > are isolated to a new directory so the chance of causing any regressions in > the rest of the code is minimal. > > Please share your thoughts, thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org