Spark - Hadoop custom filesystem service loading

Jhon Anderson Cardenas Diaz Mon, 18 Mar 2019 13:24:20 -0700

Hi everyone,

On spark 2.2.0, if you wanted to create a custom file system
implementation, you just created an extension of
org.apache.hadoop.fs.FileSystem and put the canonical name of the custom
class on the file
src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem.


Once you imported that jar dependency on your spark submit application, the
custom schema was automatically loaded, and you could start to use it just
like ds.load("customfs://path").

But on spark 2.4.0 that does not seem to work the same. If you do exactly
the same you will get an error like "No FileSystem for customfs".

The only way I achieved this on 2.4.0, was specifying the spark property
spark.hadoop.fs.customfs.impl.

Do you guys consider this as a bug? or is it an intentional change that
should be documented on somewhere?

Btw, digging a little bit on this, it seems that the cause is that now the
FileSystem is initialized before the actual dependencies are downloaded
from Maven repo (see here
<https://github.com/apache/spark/blob/v2.4.0/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L66>).
And as that initialization loads the available filesystems at that point
and only once, the filesystems in the jars downloaded are not taken in
account.

Thanks.

Spark - Hadoop custom filesystem service loading

Reply via email to