[
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272906#comment-16272906
]
Stavros Kontopoulos edited comment on SPARK-22657 at 11/30/17 5:02 PM:
-----------------------------------------------------------------------
The bug is not about s3n. Its about how dependencies are put on the class path
and what that means for hadoop libraries. Its a matter also how you deploy your
jar and what that jar includes. So when I craft my jar I should be careful to
leave out dependencies that may be available in my environment which might be
the case for some of those envs like yarn (EMR if recall correctly sets all
that stuff for you already).. but it might not be the case for others...
including standalone for example. At least people should be able to find info
about limitations. So another work around is to put the hadoop-aws thing on the
class path in the spark-env.sh file.
That way spark-submit will find it early enough and hadoop libraries will get
the proper fs implementations. I do insist though on the fact that hadoop
libraries impose restrctions that could be avoided with using another library
to download files. Also caching stuff in static fields wouldn't be my first
choice. That hadoop code in FileSystem.java could reload fs implementations if
you ask me, from the classloader if an fs impl is not found (but the classpath
has been updated) not just look at the static field there and fail...
was (Author: skonto):
The bug is not about s3n. Its about how dependencies are put on the class path
and what that means for hadoop libraries. Its a matter also how you deploy your
jar and what that jar includes. So when I craft my jar I should be careful to
leave out dependencies that may be available in my environment which might be
the case for some of those envs like yarn (EMR if recall correctly sets all
that stuff for you already).. but it might not be the case for others...
including standalone for example. At least people should be able to find info
about limitations. So another work around is to put the hadoop-aws thing on the
class path in the spark-env.sh file.
That way spark-submit will find it early enough and hadoop libraries will get
the proper fs implementations. I do insist though on the fact that hadoop
libraries impose restrctions that could be avoided with using another library
to download files. Also caching stuff in static fields wouldn't be my first
choice. That hadoop code could reload fs implementations if you ask me, from
the classloader if an fs impl is not found (but the classpath has been updated)
not just look at the static field there and fail...
> Hadoop fs implementation classes are not loaded if they are part of the app
> jar or other jar when --packages flag is used
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-22657
> URL: https://issues.apache.org/jira/browse/SPARK-22657
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.3.0
> Reporter: Stavros Kontopoulos
>
> To reproduce this issue run:
> ./bin/spark-submit --master mesos://leader.mesos:5050 \
> --packages com.github.scopt:scopt_2.11:3.5.0 \
> --conf spark.cores.max=8 \
> --conf
> spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
> \
> --conf spark.mesos.executor.docker.forcePullImage=true \
> --class S3Job
> http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
> \
> --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl
> s3n://arand-sandbox-mesosphere/linecount.out
> within a container created with
> mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image
> You will get: "Exception in thread "main" java.io.IOException: No FileSystem
> for scheme: s3n"
> This can be run reproduced with local[*] as well, no need to use mesos, this
> is not mesos bug.
> The specific spark job used above can be found here:
> https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
>
> Can be built with sbt assembly in that dir.
> Using this code :
> https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the
> beginning of the main method...
> you get the following output :
> https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4
> (Use
> http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
> to to get the modified job)
> The job works fine if --packages is not used.
> The commit that introduced this issue is (before that things work as
> expected):
> 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add
> glob support for resources adding to Spark [32m(5 months ago)
> [1;34m<jerryshao>[m Thu, 6 Jul 2017 15:32:49 +0800
> The exception comes from here:
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311
> https://github.com/apache/spark/pull/18235/files, check line 950, this is
> where a filesystem is first created.
> The Filesystem class is initialized there, before the main of the spark job
> is launched... the reason is --packages logic uses hadoop libraries to
> download files....
> Maven resolution happens before the app jar and the resolved jars are added
> to the classpath. So at that moment there is no s3n to add to the static map
> when the Filesystem static members are first initialized and also filled due
> to the first FileSystem instance created (SERVICE_FILE_SYSTEMS).
> Later in the spark job main where we try to access the s3n filesystem (create
> a second filesystem) we get the exception (at this point the app jar has the
> s3n implementation in it and its on the class path but that scheme is not
> loaded in the static map of the Filesystem class)...
> hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the
> problem is with the static map which is filled once and only once.
> That's why we see two prints of the map contents in the output(gist) above
> when --packages is used. The first print is before creating the s3n
> filesystem. We use reflection there to get the static map's entries. When
> --packages is not used that map is empty before creating the s3n filesystem
> since up to that point the Filesystem class is not yet loaded by the
> classloader.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]