Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Marcelo Vanzin Fri, 05 Oct 2018 08:59:59 -0700

Sorry, I can't help you if that doesn't work. Your YARN RM really
should not have SPARK_HOME set if you want to use more than one Spark
version.
On Thu, Oct 4, 2018 at 9:54 PM Jianshi Huang <jianshi.hu...@gmail.com> wrote:
>
> Hi Marcelo,
>
> I see what you mean. Tried it but still got same error message.
>
>> Error from python worker:
>>   Traceback (most recent call last):
>>     File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in 
>> _run_module_as_main
>>       mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>>     File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in 
>> _get_module_details
>>       __import__(pkg_name)
>>     File 
>> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", line 
>> 46, in <module>
>>     File "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", 
>> line 29, in <module>
>>   ModuleNotFoundError: No module named 'py4j'
>> PYTHONPATH was:
>>   
>> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk3/yarn/usercache/jianshi.huang/filecache/134/__spark_libs__8468485589501316413.zip/spark-core_2.11-2.3.2.jar
>
>
> On Fri, Oct 5, 2018 at 1:25 AM Marcelo Vanzin <van...@cloudera.com> wrote:
>>
>> Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get
>> expanded by the shell).
>>
>> But it's really weird to be setting SPARK_HOME in the environment of
>> your node managers. YARN shouldn't need to know about that.
>> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang <jianshi.hu...@gmail.com> 
>> wrote:
>> >
>> > https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31
>> >
>> > The code shows Spark will try to find the path if SPARK_HOME is specified. 
>> > And on my worker node, SPARK_HOME is specified in .bashrc , for the 
>> > pre-installed 2.2.1 path.
>> >
>> > I don't want to make any changes to worker node configuration, so any way 
>> > to override the order?
>> >
>> > Jianshi
>> >
>> > On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin <van...@cloudera.com> wrote:
>> >>
>> >> Normally the version of Spark installed on the cluster does not
>> >> matter, since Spark is uploaded from your gateway machine to YARN by
>> >> default.
>> >>
>> >> You probably have some configuration (in spark-defaults.conf) that
>> >> tells YARN to use a cached copy. Get rid of that configuration, and
>> >> you can use whatever version you like.
>> >> On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang <jianshi.hu...@gmail.com> 
>> >> wrote:
>> >> >
>> >> > Hi,
>> >> >
>> >> > I have a problem using multiple versions of Pyspark on YARN, the driver 
>> >> > and worker nodes are all preinstalled with Spark 2.2.1, for production 
>> >> > tasks. And I want to use 2.3.2 for my personal EDA.
>> >> >
>> >> > I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), 
>> >> > however on the worker node, the PYTHONPATH still uses the system 
>> >> > SPARK_HOME.
>> >> >
>> >> > Anyone knows how to override the PYTHONPATH on worker nodes?
>> >> >
>> >> > Here's the error message,
>> >> >>
>> >> >>
>> >> >> Py4JJavaError: An error occurred while calling o75.collectToPython.
>> >> >> : org.apache.spark.SparkException: Job aborted due to stage failure: 
>> >> >> Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 
>> >> >> in stage 0.0 (TID 3, emr-worker-8.cluster-68492, executor 2): 
>> >> >> org.apache.spark.SparkException:
>> >> >> Error from python worker:
>> >> >> Traceback (most recent call last):
>> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in 
>> >> >> _run_module_as_main
>> >> >> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
>> >> >> File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 109, in 
>> >> >> _get_module_details
>> >> >> __import__(pkg_name)
>> >> >> File 
>> >> >> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/__init__.py", 
>> >> >> line 46, in <module>
>> >> >> File 
>> >> >> "/usr/lib/spark-current/python/lib/pyspark.zip/pyspark/context.py", 
>> >> >> line 29, in <module>
>> >> >> ModuleNotFoundError: No module named 'py4j'
>> >> >> PYTHONPATH was:
>> >> >> /usr/lib/spark-current/python/lib/pyspark.zip:/usr/lib/spark-current/python/lib/py4j-0.10.7-src.zip:/mnt/disk1/yarn/usercache/jianshi.huang/filecache/130/__spark_libs__5227988272944669714.zip/spark-core_2.11-2.3.2.jar
>> >> >
>> >> >
>> >> > And here's how I started Pyspark session in Jupyter.
>> >> >>
>> >> >>
>> >> >> %env SPARK_HOME=/opt/apps/ecm/service/spark/2.3.2-bin-hadoop2.7
>> >> >> %env PYSPARK_PYTHON=/usr/bin/python3
>> >> >> import findspark
>> >> >> findspark.init()
>> >> >> import pyspark
>> >> >> sparkConf = pyspark.SparkConf()
>> >> >> sparkConf.setAll([
>> >> >>     ('spark.cores.max', '96')
>> >> >>     ,('spark.driver.memory', '2g')
>> >> >>     ,('spark.executor.cores', '4')
>> >> >>     ,('spark.executor.instances', '2')
>> >> >>     ,('spark.executor.memory', '4g')
>> >> >>     ,('spark.network.timeout', '800')
>> >> >>     ,('spark.scheduler.mode', 'FAIR')
>> >> >>     ,('spark.shuffle.service.enabled', 'true')
>> >> >>     ,('spark.dynamicAllocation.enabled', 'true')
>> >> >> ])
>> >> >> py_files = 
>> >> >> ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip']
>> >> >> sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", 
>> >> >> conf=sparkConf, pyFiles=py_files)
>> >> >>
>> >> >
>> >> >
>> >> > Thanks,
>> >> > --
>> >> > Jianshi Huang
>> >> >
>> >>
>> >>
>> >> --
>> >> Marcelo
>> >
>> >
>> >
>> > --
>> > Jianshi Huang
>> >
>> > LinkedIn: jianshi
>> > Twitter: @jshuang
>> > Github & Blog: http://huangjs.github.com/
>>
>>
>>
>> --
>> Marcelo
>
>
>
> --
> Jianshi Huang
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

Reply via email to