[ https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131945#comment-16131945 ]
Jakub Nowacki edited comment on SPARK-21752 at 8/18/17 9:11 AM: ---------------------------------------------------------------- [~skonto] What you are doing is in fact starting manually pyspark ({{shell.py}}) inside jupyter, which creates SparkSession, so what I written above doesn't have any effect as it is the same as running pyspark command. More Pythonic way of installing it is either adding modules to PYTHONPATH from the bundle {{python}} folder (e.g. http://sigdelta.com/blog/how-to-install-pyspark-locally/), which is very similar to what happens when you use {{pip}}/{{conda}} install. Also, I am referring to a plain python kernel in Jupyter (or any other python interpreter) started without executing {{shell.py}}. BTW you can create kernels in Jupyter e.g. https://gist.github.com/cogfor/903c911c9b1963dcd530bbc0b9d9f0ce, which will work as pyspark shell, similar to your setup While I understand that this is not a desired behavior to use {{master}} or {{spark.jars.packages}} in the config, I'd like to work out a preferred way of passing configuration options to SparkSession, especially for notebook users. Also, my experience is that many of the options other than {{master}} and {{spark.jars.packages}} work quite well with the SparkSession config, e.g. {{spark.executor.memory}} etc, which are sometimes need to be tuned to run some specific jobs; in a generic jobs I always rely on the defaults, which I often tune for a specific cluster. So my question is: in case we need to add some custom configuration to PySpark submission, should interactive Python users: # add *all* configurations to {{PYSPARK_SUBMIT_ARGS}} # some configuration like {{master}} or {{packages}} to to {{PYSPARK_SUBMIT_ARGS}} but others can be passed in the SparkSession config, maybe also saying which ones they are # we should fix something in SparkSession creation to make SparkSession config equally effective to {{PYSPARK_SUBMIT_ARGS}} Also, sometimes we know that e.g. job (not interactive, run by {{spark-submit}}) requires more executor memory or different number of partitions. Could we in this case use SparkSession config or each of these tuned parameters should be passed via {{spark-submit}} arguments? I'm happy to extend the documentation with such section for Python users as I don't think it's clear currently and would be very useful for python users. was (Author: jsnowacki): [~skonto] What you are doing is in fact starting manually pyspark ({{shell.py}}) inside jupyter, which creates SparkSession, so what I written above doesn't have any effect as it is the same as running pyspark command. More Pythonic way of installing it is either adding modules to PYTHONPATH from the bundle {{python}} folder (e.g. http://sigdelta.com/blog/how-to-install-pyspark-locally/), which is very similar to what happens when you use {{pip}}/{{conda}} install. Also, I am referring to a plain python kernel in Jupyter (or any other python interpreter) started without executing {{shell.py}}. BTW you can create kernels in Jupyter e.g. https://gist.github.com/cogfor/903c911c9b1963dcd530bbc0b9d9f0ce, which will work as pyspark shell, similar to your setup While I understand that not desired behavior to use {{master}} or {{spark.jars.packages}} in the config, I'd like to work out a preferred way of passing configuration options to SparkSession, especially for notebook users. Also, my experience is that many of the options other than {{master}} and {{spark.jars.packages}} work quite well with the SparkSession config, e.g. {{spark.executor.memory}} etc, which are sometimes need to be tuned to run some specific jobs; in a generic jobs I always rely on the defaults, which I often tune for a specific cluster. So my question is: in case we need to add some custom configuration to PySpark submission, should interactive Python users: # add *all* configurations to {{PYSPARK_SUBMIT_ARGS}} # some configuration like {{master}} or {{packages}} to to {{PYSPARK_SUBMIT_ARGS}} but others can be passed in the SparkSession config, maybe also saying which ones they are # we should fix something in SparkSession creation to make SparkSession config equally effective to {{PYSPARK_SUBMIT_ARGS}} Also, sometimes we know that e.g. job (not interactive, run by {{spark-submit}}) requires more executor memory or different number of partitions. Could we in this case use SparkSession config or each of these tuned parameters should be passed via {{spark-submit}} arguments? I'm happy to extend the documentation with such section for Python users as I don't think it's clear currently and would be very useful for python users. > Config spark.jars.packages is ignored in SparkSession config > ------------------------------------------------------------ > > Key: SPARK-21752 > URL: https://issues.apache.org/jira/browse/SPARK-21752 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Reporter: Jakub Nowacki > > If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder > as follows: > {code} > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\ > .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \ > .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \ > .getOrCreate() > {code} > the SparkSession gets created but there are no package download logs printed, > and if I use the loaded classes, Mongo connector in this case, but it's the > same for other packages, I get {{java.lang.ClassNotFoundException}} for the > missing classes. > If I use the config file {{conf/spark-defaults.comf}}, command line option > {{--packages}}, e.g.: > {code} > import os > os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages > org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell' > {code} > it works fine. Interestingly, using {{SparkConf}} object works fine as well, > e.g.: > {code} > conf = pyspark.SparkConf() > conf.set("spark.jars.packages", > "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0") > conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll") > conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll") > spark = pyspark.sql.SparkSession.builder\ > .appName('test-mongo')\ > .master('local[*]')\ > .config(conf=conf)\ > .getOrCreate() > {code} > The above is in Python but I've seen the behavior in other languages, though, > I didn't check R. > I also have seen it in older Spark versions. > It seems that this is the only config key that doesn't work for me via the > {{SparkSession}} builder config. > Note that this is related to creating new {{SparkSession}} as getting new > packages into existing {{SparkSession}} doesn't indeed make sense. Thus this > will only work with bare Python, Scala or Java, and not on {{pyspark}} or > {{spark-shell}} as they create the session automatically; it this case one > would need to use {{--packages}} option. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org