[ https://issues.apache.org/jira/browse/SPARK-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Or updated SPARK-1900: ----------------------------- Description: If I run the following on a YARN cluster {code} bin/spark-submit sheep.py --master yarn-client {code} it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file: {code} bin/spark-submit file:/path/to/sheep.py --master yarn-client {code} However, this also fails. This time it is because python does not understand URI schemes. This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. was: If I run the following on a YARN cluster {code} bin/spark-submit sheep.py --master yarn-client {code} it fails because of a mismatch in paths: `spark-submit` thinks that {code}sheep.py{code} resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file: {code} bin/spark-submit file:/path/to/sheep.py --master yarn-client {code} However, this also fails. This time it is because python does not understand URI schemes. This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. > Fix running PySpark files on YARN > ---------------------------------- > > Key: SPARK-1900 > URL: https://issues.apache.org/jira/browse/SPARK-1900 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.0.0 > Reporter: Andrew Or > Priority: Blocker > Fix For: 1.0.0 > > > If I run the following on a YARN cluster > {code} > bin/spark-submit sheep.py --master yarn-client > {code} > it fails because of a mismatch in paths: `spark-submit` thinks that > `sheep.py` resides on HDFS, and balks when it can't find the file there. A > natural workaround is to add the `file:` prefix to the file: > {code} > bin/spark-submit file:/path/to/sheep.py --master yarn-client > {code} > However, this also fails. This time it is because python does not understand > URI schemes. > This PR fixes this by automatically resolving all paths passed as command > line argument to `spark-submit` properly. This has the added benefit of > keeping file and jar paths consistent across different cluster modes. -- This message was sent by Atlassian JIRA (v6.2#6252)