[jira] [Commented] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

Steve Loughran (JIRA) Mon, 27 Nov 2017 03:40:45 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266674#comment-16266674
 ]


Steve Loughran commented on SPARK-22587:
----------------------------------------

Jerry had already pulled me in for this; it's one of those little "pits of 
semantics" you can get pulled into, "the incident pit" as they call it in SCUBA.

summary: You need a case insensitive check for schema and serInfo too, maybe 
even port. 
# Allow for null though. 
# an consider using {{FileSystem.makeQualified(path)}} as the safety check

What does Hadoop get up to?
* FileSystem.checkPath does a full check of (scheme, authority), with the auth 
of the canonicalized URI (including default ports) (so hdfs://namnode/ and 
hdfs://namenode:9820/ refer to the same FS. That code dates from 2008, so 
should be considered normative.
* S3AFilesSystem.checkPath only looks at hostnames, because it tries to strip 
out user:password from Paths for security reasons
* Wasb uses FileSystem.checkPath, but does some hacks to also handle an older 
scheme of "asv". I wouldn't worry about that little detail though
* {{AbstractFileSystem.checkPath}} (the FileContext implementation code) 
doesn't check auth, it looks at host and mentions the fact that on a file:// 
reference the host may be null. Raises {{InvalidPathException}} (subclass of 
{{IllegalArgumentException}} if its unhappy.

Overall then: check auth with an .equalsIgnoreCase(), allow for null. Worry 
about default ports if you really want to. 

Filed HADOOP-15070 to cover this whole area better in docs & tests, should make 
the FileContext/FileSystem checks consistent and raise the same 
InvalidPathException.

One thing to consider is adding to the FS APIs some predicate 
{{isFileSystemPath(Path p)}} to do the validation without the overhead of 
exception throwing, and implement it in one single (consistent) place. Wouldn't 
be there until Hadoop 3.1 though, so not of any immediate benefit.

thank you for bringing this undocumented, unspecified, untested and 
inconsistent logic to my attention :)


> Spark job fails if fs.defaultFS and application jar are different url
> ---------------------------------------------------------------------
>
>                 Key: SPARK-22587
>                 URL: https://issues.apache.org/jira/browse/SPARK-22587
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Submit
>    Affects Versions: 1.6.3
>            Reporter: Prabhu Joseph
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

Reply via email to