Thanks Mich,

Tried this and still getting
INF Client: "Uploading resource
file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip ->
hdfs:/<some path>". It is also doing it for (py4j.-0.10.9.7-src.zip and
__spark_conf__<some_id>.zip). It is working now because I enabled direct
access to HDFS to allow copying the files. But ideally I would like to not
have to copy any files directly to HDFS.

1) We would expect pyspark as well as the relevant configs to already be
available on the cluster - why are they being copied over? (we can always
provide the extra libraries needed using py-files the way you did)
2) If we wanted users to be able to use custom pyspark, we would rather
just copy the file HDFS/GCS in other ways, and let users reference it in
their job
3) What are the PYTHONPATH and SPARK_HOME env variables in your script? Are
they local paths, or paths on the spark cluster?

On Fri, Nov 17, 2023 at 8:57 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> How are you submitting your spark job from your client?
>
> Your files can either be on HDFS or HCFS such as gs, s3 etc.
>
> With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I
> assume you want your
>
>         spark-submit --verbose \
>            --deploy-mode cluster \
>            --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
>            --conf "spark.yarn.appMasterEnv.PYTHONPATH=${PYTHONPATH}" \
>            --conf "spark.executorEnv.PYTHONPATH=${PYTHONPATH}" \
>            --py-files $CODE_DIRECTORY_CLOUD/dataproc_on_gke.zip \
>            --conf "spark.driver.memory"=4G \
>            --conf "spark.executor.memory"=4G \
>            --conf "spark.num.executors"=4 \
>            --conf "spark.executor.cores"=2 \
>            $CODE_DIRECTORY_CLOUD/${APPLICATION}
>
> in my case I define $CODE_DIRECTORY_CLOUD as below on google cloud storage
>
> CODE_DIRECTORY="/home/hduser/dba/bin/python/"
> CODE_DIRECTORY_CLOUD="gs://,${PROJECT}-spark-on-k8s/codes"
> cd $CODE_DIRECTORY
> [ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip
> echo `date` ", ===> creating source zip directory from  ${source_code}"
> # zip needs to be done at root directory of code
> zip -rq ${source_code}.zip ${source_code}
> gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD
> gsutil cp /home/hduser/dba/bin/python/${source_code}/src/${APPLICATION}
> $CODE_DIRECTORY_CLOUD
>
> So in summary I create a zip  file of my project and copy it across to the
> cloud storage and then put the application (py file) there as well and use
> them in spark-submit
>
> I trust this answers your question.
>
> HTH
>
>
>
> Mich Talebzadeh,
> Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Nov 2023 at 21:33, Eugene Miretsky <eug...@badal.io.invalid>
> wrote:
>
>> Hey All,
>>
>> We are running Pyspark spark-submit from a client outside the cluster.
>> The client has network connectivity only to the Yarn Master, not the HDFS
>> Datanodes. How can we submit the jobs? The idea would be to preload all the
>> dependencies (job code, libraries, etc) to HDFS, and just submit the job
>> from the client.
>>
>> We tried something like this
>> 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master
>> yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'
>>
>> The error we are getting is
>> "
>>
>> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while
>> waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending remote=/
>> 10.117.110.19:9866]
>>
>> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
>> /user/users/.sparkStaging/application_1698216436656_0104/*spark_conf.zip*
>> could only be written to 0 of the 1 minReplication nodes. There are 2
>> datanode(s) running and 2 node(s) are excluded in this operation.
>> "
>>
>> A few question
>> 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf
>> files? Why would the client send them to the cluster? (the cluster already
>> has all that info - this would make sense in client mode, but not cluster
>> mode )
>> 2) Is it possible to use spark-submit without HDFS access?
>> 3) How would we fix this?
>>
>> Cheers,
>> Eugene
>>
>> --
>>
>> *Eugene Miretsky*
>> Managing Partner |  Badal.io | Book a meeting /w me!
>> <http://calendly.com/eugene-badal>
>> mobile:  416-568-9245
>> email:     eug...@badal.io <zb...@badal.io>
>>
>

-- 

*Eugene Miretsky*
Managing Partner |  Badal.io | Book a meeting /w me!
<http://calendly.com/eugene-badal>
mobile:  416-568-9245
email:     eug...@badal.io <zb...@badal.io>

Reply via email to