Hi Nick,

You should look which spark version is "latest", understand which Hadoop
version was built "spark:latest" on top, and then check the compatibility
of Hadoop with the Azure libraries. In the past, I used the following
Dockerfile to experiment:

FROM gcr.io/spark-operator/spark:v3.0.0
USER root
ADD
https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/2.0.0/azure-storage-2.0.0.jar
/opt/spark/jars
ADD
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.7/hadoop-azure-2.7.7.jar
/opt/spark/jars
ADD
https://repo1.maven.org/maven2/com/azure/azure-storage-blob/12.8.0/azure-storage-blob-12.8.0.jar
/opt/spark/jars
ADD
https://repo1.maven.org/maven2/com/azure/azure-storage-common/12.8.0/azure-storage-common-12.8.0.jar
/opt/spark/jars


And the following properties:
spark.hadoop.fs.wasb.impl org.apache.hadoop.fs.azure.NativeAzureFileSystem
spark.hadoop.fs.AbstractFileSystem.wasb.impl org.apache.hadoop.fs.azure.Wasb


Good luck,

Pol Santamaria

On Fri, Apr 16, 2021 at 3:40 PM Nick Stenroos-Dam <n...@project.bi> wrote:

> Hello
>
>
>
> I am trying to load the Hadoop-Azure driver in Apache Spark, but so far I
> have failed.
>
> The plan is to include the required files in the docker image, as we plan
> on using a Client-mode SparkSession.
>
>
>
> My current Dockerfile looks like this:
> ------------------------------
>
> FROM spark:latest
>
>
>
> COPY *.jar $SPARK_HOME/jars
>
>
>
> ENV
> SPARK_EXTRA_CLASSPATH="$SPARK_HOME/jars/hadoop-azure-3.2.0.jar:$SPARK_HOME/jars/azure-keyvault-core-1.2.4.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/jetty-util-ajax-9.3.24.v20180605.jar:$SPARK_HOME/jars/wildfly-openssl-2.1.3.Final.jar"
>
> ENV HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
> ------------------------------
>
>
>
> In the directory I have the following dependencies:
>
> hadoop-azure-3.2.0.jar
>
> azure-storage-8.6.6.jar
>
> azure-keyvault-core-1.2.4.jar
>
> jetty-util-ajax-9.3.24.v20180605.jar
>
> wildfly-openssl-2.1.3.Final.jar
>
>
>
> (I have validated that these files are part of the image and located where
> I expect (/opt/spark/jars))
>
>
>
> When looking in the Spark UI – Environment, I can’t see that Hadoop-azure
> should be loaded.
>
> In addition when I try and read a file using the wasb:// schema, I get the
> following error:
>
> java.lang.classnotfoundexception: Class
> org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
>

Reply via email to