Re: [DISCUSS] Move HDFS specific APIs to FileSystem abstration

Steve Loughran Fri, 17 Mar 2023 03:52:16 -0700

   1. I think a new interface would be good as FileContext could do the
   same thing
   2. using PathCapabilities probes should still be mandatory as for
   FileContext it would depend on the back end
   3. Whoever does this gets to specify what the API does and write the
   contract tests. Saying "just to do what HDFS does" isn't enough as it's not
   always clear the HDFS team no how much of that behaviour is intentional
   (rename, anyone?).

For any new API (a better rename, a better delete,...) I would normally
insist on making it cloud friendly, with an extensible builder API and an
emphasis on asynchronous IO. However this is existing code and does target
HDFS and Ozone -pulling the existing APIs up into a new interface seems the
right thing to do here.

 I have a WiP project to do a shim library to offer new FS APIs two older
Hadoop releases by way of reflection, so that we can get new APIs taken up
across projects where we cannot choreograph version updates across the
entire stack. (hello parquet, spark,...). My goal is to actually make this
a Hadoop managed project, with its own release schedule. You could add an
equivalent of the new interface in here, which would then use reflection
behind-the-scenes to invoke the underlying HDFS methods when the FS client
has them.

https://github.com/steveloughran/fs-api-shim

I've just added vector IO API there; the next step is to copy over a lot of
the contract tests from hadoop common and apply them through the shim -to
hadoop 3.2, 3.3.0-3.3.5. That testing against many backends is actually as
tricky as the reflection itself. However without this library it is going
to take a long long time for the open source applications to pick up the
higher performance/Cloud ready Apis. Yes, those of us who can build the
entire stack can do it, but that gradually adds more divergence from the
open source libraries, reduces the test coverage overall and only increases
maintenance costs over time.

steve

On Thu, 16 Mar 2023 at 20:56, Wei-Chiu Chuang <weic...@apache.org> wrote:

> Hi,
>
> Stephen and I are working on a project to make HBase to run on Ozone.
>
> HBase, born out of the Hadoop project, depends on a number of HDFS specific
> APIs, including recoverLease() and isInSafeMode(). The HBase community [1]
> strongly voiced that they don't want the project to have direct dependency
> on additional FS implementations due to dependency and vulnerability
> management concerns.
>
> To make this project successful, we're exploring options, to push up these
> APIs to the FileSystem abstraction. Eventually, it would make HBase FS
> implementation agnostic, and perhaps enable HBase to support other storage
> systems in the future.
>
> We'd use the PathCapabilities API to probe if the underlying FS
> implementation supports these APIs, and would then invoke the corresponding
> FileSystem APIs. This is straightforward but the FileSystem would become
> bloated.
>
> Another option is to create a "RecoverableFileSystem" interface, and have
> both DistributedFileSystem (HDFS) and RootedOzoneFileSystem (Ozone). This
> way the impact to the Hadoop project and the FileSystem abstraction is even
> smaller.
>
> Thoughts?
>
> [1] https://lists.apache.org/thread/tcrp8vxxs3z12y36mpzx35txhpp7tvxv
>

Re: [DISCUSS] Move HDFS specific APIs to FileSystem abstration

Reply via email to