xy720 opened a new issue #4101:
URL: https://github.com/apache/incubator-doris/issues/4101
**Motivation**
Recently, we have introduced the Spark Load, which currently needs to upload
many jar packages to the Yarn cluster before load. These jar packages include
`$DORIS_HOME/lib/palo-fe.jar`(the Dpp runtime dependency) and all jars in the
`$SPARK_HOME/jars` folder(the Spark dependencies), which usually takes 2~3
minutes to upload.
Currently, these jars are uploaded to the temporary directories in HDFS. The
`palo-fe.jar` is uploaded to `{working_dir}/jobs/DB_ID/LABEL/JOB_ID/configs`.
Other jars are packaged as zip file and uploaded to
`{stage_dir}/APPLICATION_ID/__spark_lib__.zip`.
In most cases, the jar packages uploaded by two different load are
completely same, which means we don't have to upload these jar packages every
time. Secondly, the jar packages should be stored in one directory so that we
can manage them easily. Moreover, we can put all jars in a zip file in the
compile phase and upload it to a specified remote repository before load.
Therefore, as a proposal, I suggest to create a repository for all
dependencies of Spark Load in HDFS.
**The repository structure**
```
Repository/
|-lib_{version}.zip
| {All spark dependencies}
| |-roaringbitmap.jar
| |-activation-1.1.1.jar
| |-aircompressor-0.10.jar
| |-...
| {All dpp dependencies}
| |-spark-dpp.jar
|-lib_{version}.zip
|-lib_{version}.zip
|-...
```
The Repository/ directory is the parent dir of all zip files. When we submit
a spark load, fe will compare the version between remote zip file and local zip
file, and only upload when we can not find the right versionn.
Note that, the `spark-dpp.jar `is built by spark-dpp sub-modules. The
difference between `palo-fe.jar` and `spark-dpp.jar` is that `spark-dpp.jar`
contain other third-party libraries that `palo-fe.jar` depends on. You can see
the details about multi-modules of fe in this issue #4098 .
Meanwhile, we can set `AppResourceHdfsPath` argument of spark-submit to
lib.zip file. Spark will analyze it and find the entrance of MainClass.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]