Sorry, I can't help you if that doesn't work. Your YARN RM really should not have SPARK_HOME set if you want to use more than one Spark version. On Thu, Oct 4, 2018 at 9:54 PM Jianshi Huang <jianshi.hu...@gmail.com> wrote: > > Hi Marcelo, > > I see what you mean. Tried it but still got same error message. > >> Error from python worker: >> Traceback (most recent call last): >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in >> _run_module_as_main >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error) >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in >> _get_module_details >> __import__(pkg_name) >> File >> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line >> 46, in <module> >> File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", >> line 29, in <module> >> ModuleNotFoundError: No module named 'py4j' >> PYTHONPATH was: >> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk3/yarn/usercache/jianshi.huang/filecache/134/__spark_libs__8468485589501316413.zip/spark-core_2.11-2.3.2.jar > > > On Fri, Oct 5, 2018 at 1:25 AM Marcelo Vanzin <van...@cloudera.com> wrote: >> >> Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get >> expanded by the shell). >> >> But it's really weird to be setting SPARK_HOME in the environment of >> your node managers. YARN shouldn't need to know about that. >> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <jianshi.hu...@gmail.com> >> wrote: >> > >> > https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31 >> > >> > The code shows Spark will try to find the path if SPARK_HOME is specified. >> > And on my worker node, SPARK_HOME is specified in .bashrc , for the >> > pre-installed 2.2.1 path. >> > >> > I don't want to make any changes to worker node configuration, so any way >> > to override the order? >> > >> > Jianshi >> > >> > On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <van...@cloudera.com> wrote: >> >> >> >> Normally the version of Spark installed on the cluster does not >> >> matter, since Spark is uploaded from your gateway machine to YARN by >> >> default. >> >> >> >> You probably have some configuration (in spark-defaults.conf) that >> >> tells YARN to use a cached copy. Get rid of that configuration, and >> >> you can use whatever version you like. >> >> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <jianshi.hu...@gmail.com> >> >> wrote: >> >> > >> >> > Hi, >> >> > >> >> > I have a problem using multiple versions of Pyspark on YARN, the driver >> >> > and worker nodes are all preinstalled with Spark 2.2.1, for production >> >> > tasks. And I want to use 2.3.2 for my personal EDA. >> >> > >> >> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), >> >> > however on the worker node, the PYTHONPATH still uses the system >> >> > SPARK_HOME. >> >> > >> >> > Anyone knows how to override the PYTHONPATH on worker nodes? >> >> > >> >> > Here's the error message, >> >> >> >> >> >> >> >> >> Py4JJavaError: An error occurred while calling o75.collectToPython. >> >> >> : org.apache.spark.SparkException: Job aborted due to stage failure: >> >> >> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 >> >> >> in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): >> >> >> org.apache.spark.SparkException: >> >> >> Error from python worker: >> >> >> Traceback (most recent call last): >> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in >> >> >> _run_module_as_main >> >> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error) >> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in >> >> >> _get_module_details >> >> >> __import__(pkg_name) >> >> >> File >> >> >> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", >> >> >> line 46, in <module> >> >> >> File >> >> >> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", >> >> >> line 29, in <module> >> >> >> ModuleNotFoundError: No module named 'py4j' >> >> >> PYTHONPATH was: >> >> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar >> >> > >> >> > >> >> > And here's how I started Pyspark session in Jupyter. >> >> >> >> >> >> >> >> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7 >> >> >> %env PYSPARK_PYTHON=/usr/bin/python3 >> >> >> import findspark >> >> >> findspark.init() >> >> >> import pyspark >> >> >> sparkConf = pyspark.SparkConf() >> >> >> sparkConf.setAll([ >> >> >> ('spark.cores.max', '96') >> >> >> ,('spark.driver.memory', '2g') >> >> >> ,('spark.executor.cores', '4') >> >> >> ,('spark.executor.instances', '2') >> >> >> ,('spark.executor.memory', '4g') >> >> >> ,('spark.network.timeout', '800') >> >> >> ,('spark.scheduler.mode', 'FAIR') >> >> >> ,('spark.shuffle.service.enabled', 'true') >> >> >> ,('spark.dynamicAllocation.enabled', 'true') >> >> >> ]) >> >> >> py_files = >> >> >> ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip'] >> >> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", >> >> >> conf=sparkConf, pyFiles=py_files) >> >> >> >> >> > >> >> > >> >> > Thanks, >> >> > -- >> >> > Jianshi Huang >> >> > >> >> >> >> >> >> -- >> >> Marcelo >> > >> > >> > >> > -- >> > Jianshi Huang >> > >> > LinkedIn: jianshi >> > Twitter: @jshuang >> > Github & Blog: http://huangjs.github.com/ >> >> >> >> -- >> Marcelo > > > > -- > Jianshi Huang >
-- Marcelo --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org