Hi Eugene,
  As the logs indicate, when executing spark-submit, Spark will package and 
upload spark/conf to HDFS, along with uploading spark/jars. These files are 
uploaded to HDFS unless you specify uploading them to another OSS. To do so, 
you'll need to modify the configuration in hdfs-site.xml, for instance, 
fs.oss.impl, etc.



eabour
 
From: Eugene Miretsky
Date: 2023-11-16 09:58
To: eab...@163.com
CC: Eugene Miretsky; user @spark
Subject: Re: [EXTERNAL] Re: Spark-submit without access to HDFS
Hey! 

Thanks for the response. 

We are getting the error because there is no network connectivity to the data 
nodes - that's expected. 

What I am trying to find out is WHY we need access to the data nodes, and if 
there is a way to submit a job without it. 

Cheers,
Eugene

On Wed, Nov 15, 2023 at 7:32 PM eab...@163.com <eab...@163.com> wrote:
Hi Eugene,
    I think you should Check if the HDFS service is running properly.  From the 
logs, it appears that there are two datanodes in HDFS,  but none of them are 
healthy.  Please investigate the reasons why the datanodes are not functioning 
properly.  It seems that the issue might be due to insufficient disk space.



eabour
 
From: Eugene Miretsky
Date: 2023-11-16 05:31
To: user
Subject: Spark-submit without access to HDFS
Hey All, 

We are running Pyspark spark-submit from a client outside the cluster. The 
client has network connectivity only to the Yarn Master, not the HDFS 
Datanodes. How can we submit the jobs? The idea would be to preload all the 
dependencies (job code, libraries, etc) to HDFS, and just submit the job from 
the client. 

We tried something like this
'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master yarn 
--deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'

The error we are getting is 
"
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while 
waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=/10.117.110.19:9866]
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/user/users/.sparkStaging/application_1698216436656_0104/spark_conf.zip could 
only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) 
running and 2 node(s) are excluded in this operation.
" 

A few question 
1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf files? 
Why would the client send them to the cluster? (the cluster already has all 
that info - this would make sense in client mode, but not cluster mode )
2) Is it possible to use spark-submit without HDFS access? 
3) How would we fix this?  

Cheers,
Eugene

-- 

Eugene Miretsky
Managing Partner |  Badal.io | Book a meeting /w me! 
mobile:  416-568-9245
email:     eug...@badal.io


-- 

Eugene Miretsky
Managing Partner |  Badal.io | Book a meeting /w me! 
mobile:  416-568-9245
email:     eug...@badal.io

Reply via email to