I am not 100% sure but I do not think this works - the driver would need access to HDFS.
What you could try (have not tested it though in your scenario):
- host the zip file on a https server and use that url (I would recommend against it though for various reasons, such as reliability)


Am 15.11.2023 um 22:33 schrieb Eugene Miretsky <eug...@badal.io.invalid>:


Hey All, 

We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit the job from the client. 

We tried something like this
'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'

The error we are getting is 
"

org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.117.110.19:9866]

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/users/.sparkStaging/application_1698216436656_0104/spark_conf.zip could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation.


A few question 
1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf files? Why would the client send them to the cluster? (the cluster already has all that info - this would make sense in client mode, but not cluster mode )
2) Is it possible to use spark-submit without HDFS access? 
3) How would we fix this?  

Cheers,
Eugene

--

Eugene Miretsky
Managing Partner |  Badal.io | Book a meeting /w me! 
mobile 416-568-9245
email    eug...@badal.io

Reply via email to