weixi62961 commented on issue #5380:
URL: https://github.com/apache/kyuubi/issues/5380#issuecomment-1780920306
### Summary
Let‘s summarize the status of PySpark Batch Job. In my opinion, For PySpark
jobs, it is recommended to submit jobs by post batches existing resources
instead of uploading resources.
### Problem
- up to Kyuubi 1.8.0,Rest API for creating batches in two ways:
-
https://kyuubi.readthedocs.io/en/master/client/rest/rest_api.html#post-batches
-
https://kyuubi.readthedocs.io/en/master/client/rest/rest_api.html#post-batches-with-uploading-resource
- Currently, uploading-resource can only upload one resource file, which is
not enough for normal PySpark jobs.
-
https://github.com/apache/kyuubi/blob/ceb84537835b61776629d3f1f23dade862059e1e/kyuubi-server/src/main/scala/org/apache/kyuubi/server/api/v1/BatchesResource.scala#L177
- PySpark jobs are different from JarSpark jobs, which are usually an
uber-jar/fat-jar with all dependencies. Normal PySpark jobs require not only
resource file, but also dependency zip packages(--py-files). spark-submt
examples as follows:
```bash
# JarSpark job
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
fat-jar-with-dependencies.jar
# PySpark job
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files dependency.zip \
main.py
```
### Workaround
- For PySpark jobs, it is recommended to submit jobs by post batches
existing resources instead of uploading resources.
- First upload the resource file and dependency zip packages to a location
that is accessible to all nodes, such as HDFS or NFS. Then submit the job
through the post-batches API. Here is an example, using HDFS.
- json parameters in pretty format
```json
{
"batchType": "PYSPARK",
"resource": "hdfs:/tmp/upload/main.py",
"conf": {
"spark.master": "yarn",
"spark.submit.deployMode": "cluster",
"spark.submit.pyFiles": "hdfs:/tmp/upload/dependency.zip"
}
}
```
- curl command ( it works!)
```bash
curl -H "Content-Type: application/json" \
-X POST \
-d
'{"batchType":"PYSPARK","resource":"hdfs:/tmp/upload/main.py","conf":{"spark.master":"yarn","spark.submit.deployMode":"cluster","spark.submit.pyFiles":"hdfs:/tmp/upload/dependency.zip"}}'
\
http://localhost:10099/api/v1/batches
```
WDYT? @bowenliang123 cc @pan3793
For more PySpark job submit cases, please refer to my small project for
[pyspark submit sample(https://github.com/weixi62961/pyspark_submit_sample)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]