Hi,

I have a problem using multiple versions of Pyspark on YARN, the driver and
worker nodes are all preinstalled with Spark 2.2.1, for production tasks.
And I want to use 2.3.2 for my personal EDA.

I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on
the worker node, the PYTHONPATH still uses the system SPARK_HOME.

Anyone knows how to override the PYTHONPATH on worker nodes?

Here's the error message,

>
> Py4JJavaError: An error occurred while calling o75.collectToPython.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2):
> org.apache.spark.SparkException:
> Error from python worker:
> Traceback (most recent call last):
> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in
> _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in
> _get_module_details
> __import__(pkg_name)
> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py",
> line 46, in <module>
> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py",
> line 29, in <module>
> ModuleNotFoundError: No module named 'py4j'
> PYTHONPATH was:
>
> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar


And here's how I started Pyspark session in Jupyter.

>
> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
> %env PYSPARK_PYTHON=/usr/bin/python3
> import findspark
> findspark.init()
> import pyspark
> sparkConf = pyspark.SparkConf()
> sparkConf.setAll([
>     ('spark.cores.max', '96')
>     ,('spark.driver.memory', '2g')
>     ,('spark.executor.cores', '4')
>     ,('spark.executor.instances', '2')
>     ,('spark.executor.memory', '4g')
>     ,('spark.network.timeout', '800')
>     ,('spark.scheduler.mode', 'FAIR')
>     ,('spark.shuffle.service.enabled', 'true')
>     ,('spark.dynamicAllocation.enabled', 'true')
> ])
> py_files =
> ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client",
> conf=sparkConf, pyFiles=py_files)
>
>

Thanks,
-- 
Jianshi Huang

Reply via email to