Hey Mich,

Thanks for the detailed response. I get most of these options.

However, what we are trying to do is avoid having to upload the source
configs and pyspark.zip files to the cluster every time we execute the job
using spark-submit. Here is the code that does it:
https://github.com/apache/spark/blob/bacdb3b5fec9783f46042764eeee80eb2a0f5702/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L813

Wondering if there is a way to skip uploading the configs. Uploading the
pyspark.zip file can be skipped by setting
PYSPARK_ARCHIVES_PATH=local://....

On Mon, Dec 11, 2023 at 5:15 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Eugene,
>
> With regard to your points
>
> What are the PYTHONPATH and SPARK_HOME env variables in your script?
>
> OK let us look at a typical of my Spark project structure
>
> - project_root
>   |-- README.md
>   |-- __init__.py
>   |-- conf
>   |   |-- (configuration files for Spark)
>   |-- deployment
>   |   |-- deployment.yaml
>   |-- design
>   |   |-- (design-related files or documentation)
>   |-- othermisc
>   |   |-- (other miscellaneous files)
>   |-- sparkutils
>   |   |-- (utility modules or scripts specific to Spark)
>   |-- src
>       |-- (main source code for your Spark application)
>
> If you want Spark to recognize modules from the sparkutils directory or
> any other directories within your project, you can include those
> directories in the PYTHONPATH.
>
> For example, if you want to include the sparkutils directory:
>
> export PYTHONPATH=/path/to/project_root/sparkutils:$PYTHONPATH
> to recap, the ${PYTHONPATH} variable is primarily used to specify
> additional directories where Python should look for modules and packages.
> In the context of Spark, it is typically used to include directories
> containing custom Python code or modules that your Spark application
> depends on.
>
> With regard to
>
> The --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" configuration
> option in Spark is used when submitting a Spark application to run on YARN
>
>    -
>
>    --conf: This is used to specify Spark configuration properties when
>    submitting a Spark application.
>    -
>
>    spark.yarn.appMasterEnv.SPARK_HOME: This is a Spark configuration
>    property that defines the value of the SPARK_HOME environment variable
>    for the Spark application's Application Master (the process responsible for
>    managing the execution of tasks on a YARN cluster).
>    -
>
>    $SPARK_HOME: This holds the path to the Spark installation directory.
>
> This configuration is setting the SPARK_HOME environment variable for the
> Spark Application Master when the application is running on YARN. This is
> important because the Spark Application Master needs to know the location
> of the Spark installation directory (SPARK_HOME) to configure and manage
> the Spark application's execution on the YARN cluster. HTH
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 11 Dec 2023 at 01:43, Eugene Miretsky <eug...@badal.io> wrote:
>
>> Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't
>> understand a few things
>>
>> 1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty,
>> pyspark.zip is uploaded from the local SPARK_HOME. If it is set to
>> "local://" the upload is skipped. I would expect the latter to be the
>> default. What's the use case for uploading the local pyspark.zip every
>> time?
>> 2) It seems like the localConfigs are meant to be copied every time (code)
>> what's the use case for that? Why not just use the cluster config?
>>
>>
>>
>> On Sun, Dec 10, 2023 at 1:15 PM Eugene Miretsky <eug...@badal.io> wrote:
>>
>>> Thanks Mich,
>>>
>>> Tried this and still getting
>>> INF Client: "Uploading resource
>>> file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip ->
>>> hdfs:/<some path>". It is also doing it for (py4j.-0.10.9.7-src.zip and
>>> __spark_conf__<some_id>.zip). It is working now because I enabled direct
>>> access to HDFS to allow copying the files. But ideally I would like to not
>>> have to copy any files directly to HDFS.
>>>
>>> 1) We would expect pyspark as well as the relevant configs to already be
>>> available on the cluster - why are they being copied over? (we can always
>>> provide the extra libraries needed using py-files the way you did)
>>> 2) If we wanted users to be able to use custom pyspark, we would rather
>>> just copy the file HDFS/GCS in other ways, and let users reference it in
>>> their job
>>> 3) What are the PYTHONPATH and SPARK_HOME env variables in your script?
>>> Are they local paths, or paths on the spark cluster?
>>>
>>> On Fri, Nov 17, 2023 at 8:57 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> How are you submitting your spark job from your client?
>>>>
>>>> Your files can either be on HDFS or HCFS such as gs, s3 etc.
>>>>
>>>> With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I
>>>> assume you want your
>>>>
>>>>         spark-submit --verbose \
>>>>            --deploy-mode cluster \
>>>>            --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
>>>>            --conf "spark.yarn.appMasterEnv.PYTHONPATH=${PYTHONPATH}" \
>>>>            --conf "spark.executorEnv.PYTHONPATH=${PYTHONPATH}" \
>>>>            --py-files $CODE_DIRECTORY_CLOUD/dataproc_on_gke.zip \
>>>>            --conf "spark.driver.memory"=4G \
>>>>            --conf "spark.executor.memory"=4G \
>>>>            --conf "spark.num.executors"=4 \
>>>>            --conf "spark.executor.cores"=2 \
>>>>            $CODE_DIRECTORY_CLOUD/${APPLICATION}
>>>>
>>>> in my case I define $CODE_DIRECTORY_CLOUD as below on google
>>>> cloud storage
>>>>
>>>> CODE_DIRECTORY="/home/hduser/dba/bin/python/"
>>>> CODE_DIRECTORY_CLOUD="gs://,${PROJECT}-spark-on-k8s/codes"
>>>> cd $CODE_DIRECTORY
>>>> [ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip
>>>> echo `date` ", ===> creating source zip directory from  ${source_code}"
>>>> # zip needs to be done at root directory of code
>>>> zip -rq ${source_code}.zip ${source_code}
>>>> gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD
>>>> gsutil cp /home/hduser/dba/bin/python/${source_code}/src/${APPLICATION}
>>>> $CODE_DIRECTORY_CLOUD
>>>>
>>>> So in summary I create a zip  file of my project and copy it across to
>>>> the cloud storage and then put the application (py file) there as well and
>>>> use them in spark-submit
>>>>
>>>> I trust this answers your question.
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist, Solutions Architect & Engineer
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 15 Nov 2023 at 21:33, Eugene Miretsky <eug...@badal.io.invalid>
>>>> wrote:
>>>>
>>>>> Hey All,
>>>>>
>>>>> We are running Pyspark spark-submit from a client outside the cluster.
>>>>> The client has network connectivity only to the Yarn Master, not the HDFS
>>>>> Datanodes. How can we submit the jobs? The idea would be to preload all 
>>>>> the
>>>>> dependencies (job code, libraries, etc) to HDFS, and just submit the job
>>>>> from the client.
>>>>>
>>>>> We tried something like this
>>>>> 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit
>>>>> --master yarn --deploy-mode cluster --py-files hdfs://yarn-master-url
>>>>> hdfs://foo.py'
>>>>>
>>>>> The error we are getting is
>>>>> "
>>>>>
>>>>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout
>>>>> while waiting for channel to be ready for connect. ch :
>>>>> java.nio.channels.SocketChannel[connection-pending remote=/
>>>>> 10.117.110.19:9866]
>>>>>
>>>>> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
>>>>> /user/users/.sparkStaging/application_1698216436656_0104/
>>>>> *spark_conf.zip* could only be written to 0 of the 1 minReplication
>>>>> nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this
>>>>> operation.
>>>>> "
>>>>>
>>>>> A few question
>>>>> 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site
>>>>> conf files? Why would the client send them to the cluster? (the cluster
>>>>> already has all that info - this would make sense in client mode, but not
>>>>> cluster mode )
>>>>> 2) Is it possible to use spark-submit without HDFS access?
>>>>> 3) How would we fix this?
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>>
>>>>>
>>>
>>>

-- 

*Eugene Miretsky*
Managing Partner |  Badal.io | Book a meeting /w me!
<http://calendly.com/eugene-badal>
mobile:  416-568-9245
email:     eug...@badal.io <zb...@badal.io>

Reply via email to