Marcelo Vanzin created SPARK-17387:
--------------------------------------

             Summary: Creating SparkContext() from python without spark-submit 
ignores user conf
                 Key: SPARK-17387
                 URL: https://issues.apache.org/jira/browse/SPARK-17387
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.0.0
            Reporter: Marcelo Vanzin
            Priority: Minor


Consider the following scenario: user runs a python application not through 
spark-submit, but by adding the pyspark module and manually creating a Spark 
context. Kinda like this:

{noformat}
$ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyspark import SparkContext
>>> from pyspark import SparkConf
>>> conf = SparkConf().set("spark.driver.memory", "4g")
>>> sc = SparkContext(conf=conf)
{noformat}

If you look at the JVM launched by the pyspark code, it ignores the user's 
configuration:

{noformat}
$ ps ax | grep $(pgrep -f SparkSubmit)
12283 pts/2    Sl+    0:03 /apps/java7/bin/java -cp ... -Xmx1g 
-XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell
{noformat}

Note the "1g" of memory. If instead you use "pyspark", you get the correct "4g" 
in the JVM.

This also affects other configs; for example, you can't really add jars to the 
driver's classpath using "spark.jars".

You can work around this by setting the undocumented env variable Spark itself 
uses:

{noformat}
$ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf 
>>> spark.driver.memory=4g"
>>> from pyspark import SparkContext
>>> sc = SparkContext()
{noformat}

But it would be nicer if the configs were automatically propagated.

BTW the reason for this is that the {{launch_gateway}} function used to start 
the JVM does not take any parameters, and the only place where it reads 
arguments for Spark is that env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to