weixi62961 commented on issue #5380:
URL: https://github.com/apache/kyuubi/issues/5380#issuecomment-1780920306

   ### Summary
   Let‘s summarize the status of PySpark Batch Job. In my opinion, For PySpark 
jobs, it is recommended to submit jobs by post batches existing resources 
instead of uploading resources.
   
   ### Problem
   - up to Kyuubi 1.8.0,Rest API for creating batches in two ways:
        - 
https://kyuubi.readthedocs.io/en/master/client/rest/rest_api.html#post-batches
        - 
https://kyuubi.readthedocs.io/en/master/client/rest/rest_api.html#post-batches-with-uploading-resource
   - Currently, uploading-resource can only upload one resource file, which is 
not enough for normal PySpark jobs.
        - 
https://github.com/apache/kyuubi/blob/ceb84537835b61776629d3f1f23dade862059e1e/kyuubi-server/src/main/scala/org/apache/kyuubi/server/api/v1/BatchesResource.scala#L177
 
   - PySpark jobs are different from JarSpark jobs, which are usually an 
uber-jar/fat-jar with all dependencies. Normal PySpark jobs require not only 
resource file, but also dependency zip packages(--py-files).  spark-submt 
examples as follows:
   ```bash
   # JarSpark job
   $SPARK_HOME/bin/spark-submit \
   --master yarn \
   --deploy-mode cluster \
   fat-jar-with-dependencies.jar
   
   # PySpark job
   $SPARK_HOME/bin/spark-submit \
   --master yarn \
   --deploy-mode cluster \
   --py-files dependency.zip \
   main.py
   ```
   
   ### Workaround
   - For PySpark jobs, it is recommended to submit jobs by post batches 
existing resources instead of uploading resources.
   - First upload the resource file and dependency zip packages to a location 
that is accessible to all nodes, such as HDFS or NFS. Then submit the job 
through the post-batches API. Here is an example, using HDFS.
   - json parameters in pretty format
   ```json
   {
        "batchType": "PYSPARK",
        "resource": "hdfs:/tmp/upload/main.py",
        "conf": {
                "spark.master": "yarn",
                "spark.submit.deployMode": "cluster",
                "spark.submit.pyFiles": "hdfs:/tmp/upload/dependency.zip"
        }
   }
   ```
   - curl command ( it works!)
   ```bash
   curl -H "Content-Type: application/json" \
   -X POST \
   -d 
'{"batchType":"PYSPARK","resource":"hdfs:/tmp/upload/main.py","conf":{"spark.master":"yarn","spark.submit.deployMode":"cluster","spark.submit.pyFiles":"hdfs:/tmp/upload/dependency.zip"}}'
 \
   http://localhost:10099/api/v1/batches
   ```
   
   WDYT? @bowenliang123  cc @pan3793 
   
   For more PySpark job submit cases, please refer to my small project for  
[pyspark submit sample(https://github.com/weixi62961/pyspark_submit_sample)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to