I use zip file personally and pass the application name (in your case
main.py) as the last input line like below

APPLICATION is your main.py. It does not need to be called main.py. It
could be anything like  testpython.py

CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes"   ## replace gs with s3
# zip needs to be done at root directory of code
zip -rq ${source_code}.zip ${source_code}
gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD  ## replace gsutil with
aws s3
gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD

your spark job

 spark-submit --verbose \
           --properties-file ${property_file} \
           --master k8s://https://$KUBERNETES_MASTER_IP:443 \
           --deploy-mode cluster \
           --name $APPNAME \
 *          --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \*
           --conf spark.kubernetes.namespace=$NAMESPACE \
           --conf spark.network.timeout=300 \
           --conf spark.kubernetes.allocation.batch.size=3 \
           --conf spark.kubernetes.allocation.batch.delay=1 \
           --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
           --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \
           --conf spark.kubernetes.driver.pod.name=$APPNAME \
           --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
           --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
           --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\
           --conf spark.dynamicAllocation.enabled=true \
           --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
           --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
           --conf spark.dynamicAllocation.executorIdleTimeout=30s \
           --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
           --conf spark.dynamicAllocation.minExecutors=0 \
           --conf spark.dynamicAllocation.maxExecutors=20 \
           --conf spark.driver.cores=3 \
           --conf spark.executor.cores=3 \
           --conf spark.driver.memory=1024m \
           --conf spark.executor.memory=1024m \
        *   $CODE_DIRECTORY_CLOUD/${APPLICATION}*

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 5 Mar 2024 at 16:15, Pedro, Chuck <cpe...@travelers.com.invalid>
wrote:

> Hi all,
>
>
>
> I am working in Databricks. When I submit a spark job with the –py-files
> argument, it seems the first two are read in but the third is ignored.
>
>
>
> "--py-files",
>
> "s3://some_path/appl_src.py",
>
> "s3://some_path/main.py",
>
> "s3://a_different_path/common.py",
>
>
>
> I can see the first two acknowledged in the Log4j but not the third.
>
>
>
> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
>
> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...
>
>
>
> As a result, the job fails because appl_src.py is importing from common.py
> but can’t find it.
>
>
>
> I posted to both Databricks community here
> <https://community.databricks.com/t5/data-engineering/spark-submit-not-reading-one-of-my-py-files-arguments/m-p/62361#M31953>
> and Stack Overflow here
> <https://stackoverflow.com/questions/78077822/databricks-spark-submit-getting-error-with-py-files>
> but did not get a response.
>
>
>
> I’m aware that we could use a .zip file, so I tried zipping the first two
> arguments but then got a totally different error:
>
>
>
> “Exception in thread "main" org.apache.spark.SparkException: Failed to get
> main class in JAR with error 'null'.  Please specify one with --class.”
>
>
>
> Basically I just want the application code in one s3 path and a “common”
> utilities package in another path. Thanks for your help.
>
>
>
>
>
>
>
> *Kind regards,*
>
> Chuck Pedro
>
>
>
>
> ------------------------------
> This message (including any attachments) may contain confidential,
> proprietary, privileged and/or private information. The information is
> intended to be for the use of the individual or entity designated above. If
> you are not the intended recipient of this message, please notify the
> sender immediately, and delete the message and any attachments. Any
> disclosure, reproduction, distribution or other use of this message or any
> attachments by an individual or entity other than the intended recipient is
> prohibited.
>
> TRVDiscDefault::1201
>

Reply via email to