Hey Mich,
Thanks for the detailed response. I get most of these options.
However, what we are trying to do is avoid having to upload the source
configs and pyspark.zip files to the cluster every time we execute the job
using spark-submit. Here is the code that does it:
https://github.com/apache/s
Hi Eugene,
With regard to your points
What are the PYTHONPATH and SPARK_HOME env variables in your script?
OK let us look at a typical of my Spark project structure
- project_root
|-- README.md
|-- __init__.py
|-- conf
| |-- (configuration files for Spark)
|-- deployment
| |-- d
Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't
understand a few things
1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty, pyspark.zip
is uploaded from the local SPARK_HOME. If it is set to "local://" the
upload is skipped. I would expect the latter to be the default
Thanks Mich,
Tried this and still getting
INF Client: "Uploading resource
file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip ->
hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and
__spark_conf__.zip). It is working now because I enabled direct
access to HDFS to allow copying t
Hi,
How are you submitting your spark job from your client?
Your files can either be on HDFS or HCFS such as gs, s3 etc.
With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I
assume you want your
spark-submit --verbose \
--deploy-mode cluster \
--co
I am not 100% sure but I do not think this works - the driver would need access to HDFS.What you could try (have not tested it though in your scenario):- use SparkConnect: https://spark.apache.org/docs/latest/spark-connect-overview.html- host the zip file on a https server and use that url (I would
hdfs-site.xml, for instance,
fs.oss.impl, etc.
eabour
From: Eugene Miretsky
Date: 2023-11-16 09:58
To: eab...@163.com
CC: Eugene Miretsky; user @spark
Subject: Re: [EXTERNAL] Re: Spark-submit without access to HDFS
Hey!
Thanks for the response.
We are getting the error because there is no ne
functioning properly.
> It seems that the issue might be due to insufficient disk space.
>
> --
> eabour
>
>
> *From:* Eugene Miretsky
> *Date:* 2023-11-16 05:31
> *To:* user
> *Subject:* Spark-submit without access to HDFS
> Hey All,
to insufficient disk space.
eabour
From: Eugene Miretsky
Date: 2023-11-16 05:31
To: user
Subject: Spark-submit without access to HDFS
Hey All,
We are running Pyspark spark-submit from a client outside the cluster. The
client has network connectivity only to the Yarn Master, not the HDFS
Hey All,
We are running Pyspark spark-submit from a client outside the cluster. The
client has network connectivity only to the Yarn Master, not the HDFS
Datanodes. How can we submit the jobs? The idea would be to preload all the
dependencies (job code, libraries, etc) to HDFS, and just submit the
10 matches
Mail list logo