Maybe this can help.

https://stackoverflow.com/questions/32959723/set-python-path-for-spark-worker



On 04/10/2018 12:19 μμ, Jianshi Huang wrote:
Hi,

I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA.

I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the PYTHONPATH still uses the system SPARK_HOME.

Anyone knows how to override the PYTHONPATH on worker nodes?

Here's the error message,


    Py4JJavaError: An error occurred while calling o75.collectToPython.
    : org.apache.spark.SparkException: Job aborted due to stage
    failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
    Lost task 0.3 in stage 0.0 (TID 3, emr-worker-8.cluster-68492,
    executor 2): org.apache.spark.SparkException:
    Error from python worker:
    Traceback (most recent call last):
    File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183,
    in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
    File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109,
    in _get_module_details
    __import__(pkg_name)
    File
    "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py",
    line 46, in <module>
    File
    "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py",
    line 29, in <module>
    ModuleNotFoundError: No module named 'py4j'
    PYTHONPATH was:
    
/usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar


And here's how I started Pyspark session in Jupyter.


    %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
    %env PYSPARK_PYTHON=/usr/bin/python3
    import findspark
    findspark.init()
    import pyspark
    sparkConf = pyspark.SparkConf()
    sparkConf.setAll([
    ('spark.cores.max', '96')
    ,('spark.driver.memory', '2g')
    ,('spark.executor.cores', '4')
    ,('spark.executor.instances', '2')
    ,('spark.executor.memory', '4g')
    ,('spark.network.timeout', '800')
    ,('spark.scheduler.mode', 'FAIR')
    ,('spark.shuffle.service.enabled', 'true')
    ,('spark.dynamicAllocation.enabled', 'true')
    ])
    py_files =
    ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
    sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client",
    conf=sparkConf, pyFiles=py_files)



Thanks,
--
Jianshi Huang


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol

Reply via email to