Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
Sorry I forgot. This below is catered for yarn mode

if your application code primarily consists of Python files and does not
require a separate virtual environment with specific dependencies, you can
use the --py-files argument in spark-submit

spark-submit --verbose \
   --master yarn \
  --deploy-mode cluster \
  --name $APPNAME \
  --driver-memory 1g \  # Adjust memory as needed
  --executor-memory 1g \  # Adjust memory as needed
  --num-executors 2 \ # Adjust executors as needed
  -*-py-files ${build_directory}/source_code.zip \*
  $CODE_DIRECTORY_CLOUD/my_application_entry_point.py  # Path to your
main application script

For application code with a separate virtual environment)

If your application code has specific dependencies that you manage in a
separate virtual environment, you can leverage the --conf
spark.yarn.dist.archives argument.
spark-submit --verbose \
-master yarn \
-deploy-mode cluster \
--name $APPNAME \
 --driver-memory 1g \ # Adjust memory as needed
--executor-memory 1g \ # Adjust memory as needed
--num-executors 2 \ # Adjust executors as needed-
*-conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv \*
$CODE_DIRECTORY_CLOUD/my_application_entry_point.py # Path to your main
application script

Explanation:

   - --conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv:
This
   configures Spark to distribute your virtual environment archive (
   pyspark_venv.tar.gz) to the Yarn cluster nodes. The #pyspark_venv  part
   defines a symbolic link name within the container.
   - You do not need --py-fileshere because the virtual environment archive
   will contain all necessary dependencies.

Choosing the best approach:

The choice depends on your project setup:

   - No Separate Virtual Environment: Use  --py-files if your application
   code consists mainly of Python files and doesn't require a separate virtual
   environment.
   - Separate Virtual Environment: Use --conf spark.yarn.dist.archives if
   you manage dependencies in a separate virtual environment archive.

HTH
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 5 Mar 2024 at 17:28, Mich Talebzadeh 
wrote:

>
>
>
>  I use zip file personally and pass the application name (in your case
> main.py) as the last input line like below
>
> APPLICATION is your main.py. It does not need to be called main.py. It
> could be anything like  testpython.py
>
> CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes"   ## replace gs with s3
> # zip needs to be done at root directory of code
> zip -rq ${source_code}.zip ${source_code}
> gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD  ## replace gsutil with
> aws s3
> gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD
>
> your spark job
>
>  spark-submit --verbose \
>--properties-file ${property_file} \
>--master k8s://https://$KUBERNETES_MASTER_IP:443 \
>--deploy-mode cluster \
>--name $APPNAME \
>  *  --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \*
>--conf spark.kubernetes.namespace=$NAMESPACE \
>--conf spark.network.timeout=300 \
>--conf spark.kubernetes.allocation.batch.size=3 \
>--conf spark.kubernetes.allocation.batch.delay=1 \
>--conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
>--conf spark.kubernetes.executor.container.image=${IMAGEDRIVER}
> \
>--conf spark.kubernetes.driver.pod.name=$APPNAME \
>--conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
>--conf
> spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
>--conf
> spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
> \
>--conf spark.dynamicAllocation.enabled=true \
>--conf spark.dynamicAllocation.shuffleTracking.enabled=true \
>--conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
>--conf spark.dynamicAllocation.executorIdleTimeout=30s \
>--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
>--conf spark.dynamicAllocation.minExecutors=0 \
>--conf spark.dynamicAllocation.maxExecutors=20 \
>--conf spark.driver.cores=3 \
>--conf spark.executor.cores=3 \
>--conf spark.driver.memory=1024m \
>--conf spark.executor.memory=1024m \
> *   

Re: It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Mich Talebzadeh
 I use zip file personally and pass the application name (in your case
main.py) as the last input line like below

APPLICATION is your main.py. It does not need to be called main.py. It
could be anything like  testpython.py

CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes"   ## replace gs with s3
# zip needs to be done at root directory of code
zip -rq ${source_code}.zip ${source_code}
gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD  ## replace gsutil with
aws s3
gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD

your spark job

 spark-submit --verbose \
   --properties-file ${property_file} \
   --master k8s://https://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --name $APPNAME \
 *  --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \*
   --conf spark.kubernetes.namespace=$NAMESPACE \
   --conf spark.network.timeout=300 \
   --conf spark.kubernetes.allocation.batch.size=3 \
   --conf spark.kubernetes.allocation.batch.delay=1 \
   --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
   --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \
   --conf spark.kubernetes.driver.pod.name=$APPNAME \
   --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \
   --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
   --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.shuffleTracking.enabled=true \
   --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \
   --conf spark.dynamicAllocation.executorIdleTimeout=30s \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \
   --conf spark.dynamicAllocation.minExecutors=0 \
   --conf spark.dynamicAllocation.maxExecutors=20 \
   --conf spark.driver.cores=3 \
   --conf spark.executor.cores=3 \
   --conf spark.driver.memory=1024m \
   --conf spark.executor.memory=1024m \
*   $CODE_DIRECTORY_CLOUD/${APPLICATION}*

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 5 Mar 2024 at 16:15, Pedro, Chuck 
wrote:

> Hi all,
>
>
>
> I am working in Databricks. When I submit a spark job with the –py-files
> argument, it seems the first two are read in but the third is ignored.
>
>
>
> "--py-files",
>
> "s3://some_path/appl_src.py",
>
> "s3://some_path/main.py",
>
> "s3://a_different_path/common.py",
>
>
>
> I can see the first two acknowledged in the Log4j but not the third.
>
>
>
> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
>
> 24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...
>
>
>
> As a result, the job fails because appl_src.py is importing from common.py
> but can’t find it.
>
>
>
> I posted to both Databricks community here
> 
> and Stack Overflow here
> 
> but did not get a response.
>
>
>
> I’m aware that we could use a .zip file, so I tried zipping the first two
> arguments but then got a totally different error:
>
>
>
> “Exception in thread "main" org.apache.spark.SparkException: Failed to get
> main class in JAR with error 'null'.  Please specify one with --class.”
>
>
>
> Basically I just want the application code in one s3 path and a “common”
> utilities package in another path. Thanks for your help.
>
>
>
>
>
>
>
> *Kind regards,*
>
> Chuck Pedro
>
>
>
>
> --
> This message (including any attachments) may contain confidential,
> proprietary, privileged and/or private information. The information is
> intended to be for the use of the individual or entity designated above. If
> you are not the intended recipient of this message, please notify the
> sender immediately, and delete the message and any attachments. Any
> disclosure, reproduction, distribution or other use of this message or any
> attachments by an individual or entity other than the intended recipient is
> prohibited.
>
> TRVDiscDefault::1201
>


It seems --py-files only takes the first two arguments. Can someone please confirm?

2024-03-05 Thread Pedro, Chuck
Hi all,

I am working in Databricks. When I submit a spark job with the -py-files 
argument, it seems the first two are read in but the third is ignored.

"--py-files",
"s3://some_path/appl_src.py",
"s3://some_path/main.py",
"s3://a_different_path/common.py",

I can see the first two acknowledged in the Log4j but not the third.

24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...

As a result, the job fails because appl_src.py is importing from common.py but 
can't find it.

I posted to both Databricks community 
here
 and Stack Overflow 
here
 but did not get a response.

I'm aware that we could use a .zip file, so I tried zipping the first two 
arguments but then got a totally different error:

"Exception in thread "main" org.apache.spark.SparkException: Failed to get main 
class in JAR with error 'null'.  Please specify one with --class."

Basically I just want the application code in one s3 path and a "common" 
utilities package in another path. Thanks for your help.



Kind regards,
Chuck Pedro



This message (including any attachments) may contain confidential, proprietary, 
privileged and/or private information. The information is intended to be for 
the use of the individual or entity designated above. If you are not the 
intended recipient of this message, please notify the sender immediately, and 
delete the message and any attachments. Any disclosure, reproduction, 
distribution or other use of this message or any attachments by an individual 
or entity other than the intended recipient is prohibited.

TRVDiscDefault::1201