I'm staring at https://issues.apache.org/jira/browse/HADOOP-17372 and a
stack trace which claims that a com.amazonaws class doesn't implement an
interface which it very much does

2020-11-10 05:27:33,517 [ScalaTest-main-running-S3DataFrameExampleSuite]
WARN  fs.FileSystem (FileSystem.java:createFileSystem(3466)) - Failed to
initialize fileystem s3a://stevel-ireland: java.io.IOException: Class class
com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not
implement AWSCredentialsProvider
- DataFrames *** FAILED ***
  org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:
java.io.IOException: Class class
com.amazonaws.auth.EnvironmentVariableCredentialsProvider does not
implement AWSCredentialsProvider;
  at
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)

This is happening because Hive wants to instantiate the FS for a filesystem
cluster (full stack in the JIRA for the curious)

FileSystem.get(startSs.sessionConf);


The cluster FS is set to be S3, the s3a code is building up its list of
credential providers via a configuration lookup

conf.getClasses("fs.s3a.aws.credentials.provider",
  "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
  org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
  com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
  org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider

followed by a validation that whatever was loaded can be passed into the
AWS SDK

if (!AWSCredentialsProvider.class.isAssignableFrom(credClass)) {
  throw new IOException("Class " + credClass + " " + NOT_AWS_PROVIDER);
}

What appears to be happening is that the loading of the AWS credential
provider is failing because that is loaded in a configuration based of the
HiveConf, which uses the context class loader which was used to create that
conf, so the AWS SDK class EnvironmentVariableCredentialsProvider is being
loaded in the isolated classloader. But S3AFilesystem, being
org.apache.hadoop code, is loading in the base classloader. As a result, it
doesn't consider the EnvironmentVariableCredentialsProvider to implement
the credential provider API.

What to do?

I could make this specific issue evaporate by just subclassing the aws SDK
credential providers somewhere in o.a.h.fs.s3a and putting them on the
default list, but that leaves the issue lurking for anyone else and for
some other configuration-driven extension points. Anyone who uses the
plugin options for the S3A and abfs connectors MUST use a class beginning
org.apache.hadoop or they won't be able to init hive.

Alternatively, I could ignore the context classloader and make the
Configuration.getClasses() method use whatever classloader loaded the
actual S3AFileSystem class. I worry that if I do that, something else is
going to go horriby wrong somewhere completely random in the future. Which
anything going near classloaders inevitably does, at some point.

Suggestions?

Reply via email to