Steve Loughran and I have been discussing on Stack Overflow <https://stackoverflow.com/q/73503205> a way forward for removing the Winutils requirement from the local `FileSystem` implementations.

Hadoop's FileSystem API has a lot of *nix assumptions which originally made it not possible to implement in pure Java for local file system access. The current implementation essentially creates shell processes that invoke *nix commands in order to e.g. access permissions. To get this working on Windows, Steve created Winutils <https://github.com/steveloughran/winutils>, a sort of Windows back-door subsystem of binary executables which must be installed separately (think a tiny .NET) and which Hadoop can invoke as a substitute for *nix calls. At the time it was no doubt a nifty quick workaround, but as a long-term solution it is horrible (for a long list of reasons which everyone here already knows so I won't go into them; see HADOOP-13223 <https://issues.apache.org/jira/browse/HADOOP-13223> and HADOOP-17839 <https://issues.apache.org/jira/browse/HADOOP-17839>.) There should be no need to install a separate set of executables maintained by a third party just to get Spark to write output to a local file on a Windows laptop, for example.

I have created the GlobalMentor Hadoop Bare Naked Local FileSystem <https://github.com/globalmentor/hadoop-bare-naked-local-fs>, an implementation of `FileSystem` for the local file system that extends `LocalFileSystem`/`RawLocalFileSystem` and "undoes" the Winutils code by accessing pure Java API calls instead. It is available on Maven, and using it with Spark is as simple as including it as a dependency at specifying the implementation in the configuration, e.g. programmatically:

```java
SparkSession spark = SparkSession.builder().appName("Foo Bar").master("local").getOrCreate(); spark.sparkContext().hadoopConfiguration().setClass("fs.file.impl", BareLocalFileSystem.class, FileSystem.class);
```

But Bare Naked Local File System is not the end of the story.

 * Bare Naked Local File System v0.1.0 doesn't (yet) support symlinks
   or the sticky bit.
 * But the bigger issue is how to excise Winutils completely in the
   existing Hadoop code. Winutils assumptions are hard-coded at a low
   level across various classes—even code that has nothing to do with
   the file system. The startup configuration for example calls
   `StringUtils.equalsIgnoreCase("true", valueString)` which loads the
   `StringUtils` class, which has a static reference to `Shell`, which
   has a static block that checks for `WINUTILS_EXE`.
 * For the most part there should no longer even be a need for anything
   but direct Java API access for the local file system. But muddling
   things further, the existing `RawLocalFileSystem` implementation has
   /four/ ways to access the local file system: Winutils, JNI calls,
   shell access, and a "new" approach using "stat". The "stat" approach
   has been switched off with a hard-coded `useDeprecatedFileStatus =
   true` because of HADOOP-9652
   <https://issues.apache.org/jira/browse/HADOOP-9652>.
 * Local file access is not contained within `RawLocalFileSystem` but
   is scattered across other classes; `FileUtil.readLink()` for example
   (which `RawLocalFileSystem` calls because of the deprecation issue
   above) uses the shell approach without any option to change it.
   (This implementation-specific decision should have been contained
   within the `FileSystem` implementation itself.)

In short, it's a mess that has accumulated over years and getting worse, charging high interest on what at first was a small, self-contained technical debt.

I would welcome the opportunity to clean up this mess. I'm probably as qualified as anyone to make the changes. This is one of my areas of expertise: I was designing a full abstract file system interface (with pure-Java from-scratch implementations for the local file system, Subversion, and WebDAV—even the WebDAV HTTP implementation was from scratch) around the time Apache Nutch was getting off the ground. Most recently I've worked on the Hadoop `FileSystem` API contracting for LinkedIn, discovering (what I consider to be) a huge bug in ViewFilesystem, HADOOP-18525 <https://issues.apache.org/jira/browse/HADOOP-18525>.

The cleanup should be done in several stages (e.g. consolidating WinUtils access; replacing code with pure Java API calls; undeprecating the new Stat code and relegating it to a different class, etc.). Unfortunately it's not financially feasible for me to sit here for several months and revamp the Hadoop `FileSystem` subsystem for fun (even though I wish I could). Perhaps there is job opening at a company related to Hadoop that would be interested in hiring me and devoting a certain percentage of my time to fixing local `FileSystem` access. If so, let me know where I should send my resume <https://www.garretwilson.com/about/resume>.

Otherwise let me know if any ideas for a way forward. If there proves to be interest in GlobalMentor Hadoop Bare Naked Local FileSystem <https://github.com/globalmentor/hadoop-bare-naked-local-fs> on GitHub I'll try to maintain and improve it, but really what needs to be revamped is the Hadoop codebase itself. I'll be happy when Hadoop is fixed so that both Steve's code and my code are no longer needed.

Garret

Reply via email to