https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the pre-installed 2.2.1 path. I don't want to make any changes to worker node configuration, so any way to override the order? Jianshi On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <van...@cloudera.com> wrote: > Normally the version of Spark installed on the cluster does not > matter, since Spark is uploaded from your gateway machine to YARN by > default. > > You probably have some configuration (in spark-defaults.conf) that > tells YARN to use a cached copy. Get rid of that configuration, and > you can use whatever version you like. > On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > > > > Hi, > > > > I have a problem using multiple versions of Pyspark on YARN, the driver > and worker nodes are all preinstalled with Spark 2.2.1, for production > tasks. And I want to use 2.3.2 for my personal EDA. > > > > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however > on the worker node, the PYTHONPATH still uses the system SPARK_HOME. > > > > Anyone knows how to override the PYTHONPATH on worker nodes? > > > > Here's the error message, > >> > >> > >> Py4JJavaError: An error occurred while calling o75.collectToPython. > >> : org.apache.spark.SparkException: Job aborted due to stage failure: > Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in > stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): > org.apache.spark.SparkException: > >> Error from python worker: > >> Traceback (most recent call last): > >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in > _run_module_as_main > >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error) > >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in > _get_module_details > >> __import__(pkg_name) > >> File > "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line > 46, in <module> > >> File > "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", line > 29, in <module> > >> ModuleNotFoundError: No module named 'py4j' > >> PYTHONPATH was: > >> > /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar > > > > > > And here's how I started Pyspark session in Jupyter. > >> > >> > >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7 > >> %env PYSPARK_PYTHON=/usr/bin/python3 > >> import findspark > >> findspark.init() > >> import pyspark > >> sparkConf = pyspark.SparkConf() > >> sparkConf.setAll([ > >> ('spark.cores.max', '96') > >> ,('spark.driver.memory', '2g') > >> ,('spark.executor.cores', '4') > >> ,('spark.executor.instances', '2') > >> ,('spark.executor.memory', '4g') > >> ,('spark.network.timeout', '800') > >> ,('spark.scheduler.mode', 'FAIR') > >> ,('spark.shuffle.service.enabled', 'true') > >> ,('spark.dynamicAllocation.enabled', 'true') > >> ]) > >> py_files = > ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip'] > >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", > conf=sparkConf, pyFiles=py_files) > >> > > > > > > Thanks, > > -- > > Jianshi Huang > > > > > -- > Marcelo > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/