Thanks Mich, Tried this and still getting INF Client: "Uploading resource file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip -> hdfs:/<some path>". It is also doing it for (py4j.-0.10.9.7-src.zip and __spark_conf__<some_id>.zip). It is working now because I enabled direct access to HDFS to allow copying the files. But ideally I would like to not have to copy any files directly to HDFS.
1) We would expect pyspark as well as the relevant configs to already be available on the cluster - why are they being copied over? (we can always provide the extra libraries needed using py-files the way you did) 2) If we wanted users to be able to use custom pyspark, we would rather just copy the file HDFS/GCS in other ways, and let users reference it in their job 3) What are the PYTHONPATH and SPARK_HOME env variables in your script? Are they local paths, or paths on the spark cluster? On Fri, Nov 17, 2023 at 8:57 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > How are you submitting your spark job from your client? > > Your files can either be on HDFS or HCFS such as gs, s3 etc. > > With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I > assume you want your > > spark-submit --verbose \ > --deploy-mode cluster \ > --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \ > --conf "spark.yarn.appMasterEnv.PYTHONPATH=${PYTHONPATH}" \ > --conf "spark.executorEnv.PYTHONPATH=${PYTHONPATH}" \ > --py-files $CODE_DIRECTORY_CLOUD/dataproc_on_gke.zip \ > --conf "spark.driver.memory"=4G \ > --conf "spark.executor.memory"=4G \ > --conf "spark.num.executors"=4 \ > --conf "spark.executor.cores"=2 \ > $CODE_DIRECTORY_CLOUD/${APPLICATION} > > in my case I define $CODE_DIRECTORY_CLOUD as below on google cloud storage > > CODE_DIRECTORY="/home/hduser/dba/bin/python/" > CODE_DIRECTORY_CLOUD="gs://,${PROJECT}-spark-on-k8s/codes" > cd $CODE_DIRECTORY > [ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip > echo `date` ", ===> creating source zip directory from ${source_code}" > # zip needs to be done at root directory of code > zip -rq ${source_code}.zip ${source_code} > gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD > gsutil cp /home/hduser/dba/bin/python/${source_code}/src/${APPLICATION} > $CODE_DIRECTORY_CLOUD > > So in summary I create a zip file of my project and copy it across to the > cloud storage and then put the application (py file) there as well and use > them in spark-submit > > I trust this answers your question. > > HTH > > > > Mich Talebzadeh, > Technologist, Solutions Architect & Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 15 Nov 2023 at 21:33, Eugene Miretsky <eug...@badal.io.invalid> > wrote: > >> Hey All, >> >> We are running Pyspark spark-submit from a client outside the cluster. >> The client has network connectivity only to the Yarn Master, not the HDFS >> Datanodes. How can we submit the jobs? The idea would be to preload all the >> dependencies (job code, libraries, etc) to HDFS, and just submit the job >> from the client. >> >> We tried something like this >> 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master >> yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py' >> >> The error we are getting is >> " >> >> org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while >> waiting for channel to be ready for connect. ch : >> java.nio.channels.SocketChannel[connection-pending remote=/ >> 10.117.110.19:9866] >> >> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File >> /user/users/.sparkStaging/application_1698216436656_0104/*spark_conf.zip* >> could only be written to 0 of the 1 minReplication nodes. There are 2 >> datanode(s) running and 2 node(s) are excluded in this operation. >> " >> >> A few question >> 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf >> files? Why would the client send them to the cluster? (the cluster already >> has all that info - this would make sense in client mode, but not cluster >> mode ) >> 2) Is it possible to use spark-submit without HDFS access? >> 3) How would we fix this? >> >> Cheers, >> Eugene >> >> -- >> >> *Eugene Miretsky* >> Managing Partner | Badal.io | Book a meeting /w me! >> <http://calendly.com/eugene-badal> >> mobile: 416-568-9245 >> email: eug...@badal.io <zb...@badal.io> >> > -- *Eugene Miretsky* Managing Partner | Badal.io | Book a meeting /w me! <http://calendly.com/eugene-badal> mobile: 416-568-9245 email: eug...@badal.io <zb...@badal.io>