[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21267 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r188144573 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,22 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) +if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: +self._python_includes.append(filename) +sys.path.insert(1, filepath) +except Exception: +from pyspark import util +warnings.warn( --- End diff -- Likewise, I checked the warning manually: ``` .../pyspark/context.py:229: RuntimeWarning: Failed to add file [/home/spark/tmp.py] speficied in 'spark.submit.pyFiles' to Python path: ... /usr/lib64/python27.zip /usr/lib64/python2.7 ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r188037259 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) +if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: +self._python_includes.append(filename) +sys.path.insert(1, filepath) +except Exception as e: +from pyspark import util +warnings.warn( +"Python file [%s] specified in 'spark.submit.pyFiles' failed " --- End diff -- Simplify this message? "Failed to add file [%s] speficied in 'spark.submit.pyFiles' to Python path:\n %s" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r187274825 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) --- End diff -- Yup, that's only missing on driver side in this mode specifically. Yarn doesn't add it since `spark.files` is not set if I understood correctly. They are specially handled in case of submit but shell case seems missing. I described a bit in the PR description too. > In case of Yarn client and cluster with submit, these are manually being handled. In particular #6360 added most of the logics. In this case, the Python path looks manually set via, for example, deploy.PythonRunner. We don't use spark.files here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r187274278 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) --- End diff -- Are 'spark.submit.pyFiles' files only missing on driver side? I mean, if they are not added by `SparkContext.addFile`, shouldn't they also be missing on executors? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r187264822 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) --- End diff -- I don't think so but that's already being done in other cluster / client modes. The copies are made via addFile in other modes but it's not being copied in this case specifically. I think we should better consistently copy. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r187259682 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) +if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: --- End diff -- Oh, I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r187259493 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) --- End diff -- For file types in `PACKAGE_EXTENSIONS`, do we need to copy? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r187216038 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) --- End diff -- that's the initial approach I tried. thing is, .py file in the configuration. it needs its parent directory (not .py file itself) and it would add other .py files too if there are in the directort. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r187133079 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) --- End diff -- Is this copy necessary? Couldn't you just add `path` to `sys.path` (instead of adding `filepath`) and that would solve the problem? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r186920316 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) +if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: +self._python_includes.append(filename) +sys.path.insert(1, filepath) +except Exception as e: +from pyspark import util +warnings.warn( --- End diff -- BTW, this should now be safer in any case since we now don't put non-existent files and print out warnings. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r186673789 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) +if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: --- End diff -- the root is added into the path above. .py file needs its parent directory .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r186670486 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) +if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: --- End diff -- Am I missing anything? Looks like `PACKAGE_EXTENSIONS = ('.zip', '.egg', '.jar')`. So `.py` seems not in that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21267#discussion_r186650331 --- Diff: python/pyspark/context.py --- @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, for path in self._conf.get("spark.submit.pyFiles", "").split(","): if path != "": (dirname, filename) = os.path.split(path) -if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: -self._python_includes.append(filename) -sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) +try: +filepath = os.path.join(SparkFiles.getRootDirectory(), filename) +if not os.path.exists(filepath): +# In case of YARN with shell mode, 'spark.submit.pyFiles' files are +# not added via SparkContext.addFile. Here we check if the file exists, +# try to copy and then add it to the path. See SPARK-21945. +shutil.copyfile(path, filepath) +if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: +self._python_includes.append(filename) +sys.path.insert(1, filepath) +except Exception as e: +from pyspark import util +warnings.warn( --- End diff -- Log was also tested manually: ``` .../python/pyspark/context.py:230: RuntimeWarning: Python file [/home/spark/tmp.py] specified in 'spark.submit.pyFiles' failed to be added in the Python path, excluding this in the Python path. : ... ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21267: [SPARK-21945][YARN][PYTHON] Make --py-files work ...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/21267 [SPARK-21945][YARN][PYTHON] Make --py-files work in PySpark shell in Yarn client mode ## What changes were proposed in this pull request? ### Problem When we run _PySpark shell with Yarn client mode_, specified `--py-files` are not recognised in _driver side_. Here are the steps I took to check: ```bash $ cat /home/spark/tmp.py def testtest(): return 1 ``` ```bash $ ./bin/pyspark --master yarn --deploy-mode client --py-files /home/spark/tmp.py ``` ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() # executor side [1] >>> test() # driver side Traceback (most recent call last): File "", line 1, in File "", line 2, in test ImportError: No module named tmp ``` ### How it happened? Unlike Yarn cluster and client mode with Spark submit, when Yarn client mode with PySpark shell specifically, 1. It first runs Python shell via: https://github.com/apache/spark/blob/3cb82047f2f51af553df09b9323796af507d36f8/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L158 as pointed out by @tgravescs in the JIRA. 2. this triggers shell.py and submit another application to launch a py4j gateway: https://github.com/apache/spark/blob/209b9361ac8a4410ff797cff1115e1888e2f7e66/python/pyspark/java_gateway.py#L45-L60 3. it runs a Py4J gateway: https://github.com/apache/spark/blob/3cb82047f2f51af553df09b9323796af507d36f8/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L425 4. it copies --py-files into local temp directory: https://github.com/apache/spark/blob/3cb82047f2f51af553df09b9323796af507d36f8/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L365-L376 and then these directories are set up to `spark.submit.pyFiles` 5. Py4J JVM is launched and then the Python paths are set via: https://github.com/apache/spark/blob/7013eea11cb32b1e0038dc751c485da5c94a484b/python/pyspark/context.py#L209-L216 However, these are not actually set because those files were copied into a tmp directory in 4. whereas this code path looks for `SparkFiles.getRootDirectory` where the files are stored only when `SparkContext.addFile()` is called. In other cluster mode, `spark.files` are set via: https://github.com/apache/spark/blob/3cb82047f2f51af553df09b9323796af507d36f8/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L554-L555 and those files are explicitly added via: https://github.com/apache/spark/blob/ecb8b383af1cf1b67f3111c148229e00c9c17c40/core/src/main/scala/org/apache/spark/SparkContext.scala#L395 So we are fine in other modes. In case of Yarn client and submit with _submit_, these are manually being handled. In particular https://github.com/apache/spark/pull/6360 added most of the logics. In this case, the Python path looks manually set via, for example, `deploy.PythonRunner`. We don't use `spark.files` here. ### How does the PR fix the problem? I tried to make an isolated approach as possible as I can: simply copy py file or zip files into `SparkFiles.getRootDirectory()` in driver side if not existing. Another possible way is to set `spark.files` but it does unnecessary stuff together and sounds a bit invasive. ### Before ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() [1] >>> test() Traceback (most recent call last): File "", line 1, in File "", line 2, in test ImportError: No module named tmp ``` ### After ```python >>> def test(): ... import tmp ... return tmp.testtest() ... >>> spark.range(1).rdd.map(lambda _: test()).collect() [1] >>> test() 1 ``` ## How was this patch tested? I manually tested in standalone and yarn cluster with PySpark shell. .zip and .py files were also tested with the similar steps above. It's difficult to add a test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-21945 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21267.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21267 commit 68be3baef22d8b7aa58a432cb5bd12437c07feb7 Author: hyukjinkwonDate: 2018-05-08T07:36:31Z Make --py-files work in PySpark shell in