Re: [EXTERNAL] Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Eugene Miretsky
Hey Mich, Thanks for the detailed response. I get most of these options. However, what we are trying to do is avoid having to upload the source configs and pyspark.zip files to the cluster every time we execute the job using spark-submit. Here is the code that does it: https://github.com/apache/s

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-11 Thread Mich Talebzadeh
Hi Eugene, With regard to your points What are the PYTHONPATH and SPARK_HOME env variables in your script? OK let us look at a typical of my Spark project structure - project_root |-- README.md |-- __init__.py |-- conf | |-- (configuration files for Spark) |-- deployment | |-- d

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't understand a few things 1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty, pyspark.zip is uploaded from the local SPARK_HOME. If it is set to "local://" the upload is skipped. I would expect the latter to be the default

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Thanks Mich, Tried this and still getting INF Client: "Uploading resource file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip -> hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and __spark_conf__.zip). It is working now because I enabled direct access to HDFS to allow copying t

Re: Spark-submit without access to HDFS

2023-11-17 Thread Mich Talebzadeh
Hi, How are you submitting your spark job from your client? Your files can either be on HDFS or HCFS such as gs, s3 etc. With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I assume you want your spark-submit --verbose \ --deploy-mode cluster \ --co

Re: Spark-submit without access to HDFS

2023-11-16 Thread Jörn Franke
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I would

Re: Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
hdfs-site.xml, for instance, fs.oss.impl, etc. eabour From: Eugene Miretsky Date: 2023-11-16 09:58 To: eab...@163.com CC: Eugene Miretsky; user @spark Subject: Re: [EXTERNAL] Re: Spark-submit without access to HDFS Hey! Thanks for the response. We are getting the error because there is no ne

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
functioning properly. > It seems that the issue might be due to insufficient disk space. > > -- > eabour > > > *From:* Eugene Miretsky > *Date:* 2023-11-16 05:31 > *To:* user > *Subject:* Spark-submit without access to HDFS > Hey All,

Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
to insufficient disk space. eabour From: Eugene Miretsky Date: 2023-11-16 05:31 To: user Subject: Spark-submit without access to HDFS Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS

Spark-submit without access to HDFS

2023-11-15 Thread Eugene Miretsky
Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit the