JiaLiangC opened a new pull request, #1226:
URL: https://github.com/apache/bigtop/pull/1226
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/BIGTOP/How+to+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'BIGTOP-3638: Your PR title ...'.
-->
### Description of PR
The reason for the slow compilation: The Hadoop project has many modules,
and the inability to compile them in parallel results in a slow process. For
instance, the first compilation of Hadoop might take several hours, and even
with local Maven dependencies, a subsequent compilation can still take close to
40 minutes, which is very slow.
How to solve it: Use mvn dependency:tree and maven-to-plantuml to
investigate the dependency issues that prevent parallel compilation.
Investigate the dependencies between project modules.
Analyze the dependencies in multi-module Maven projects.
Download maven-to-plantuml:
wget
https://github.com/phxql/maven-to-plantuml/releases/download/v1.0/maven-to-plantuml-1.0.jar
Generate a dependency tree:
mvn dependency:tree > dep.txt
Generate a UML diagram from the dependency tree:
java -jar maven-to-plantuml.jar --input dep.txt --output dep.puml
For more information, visit: [maven-to-plantuml GitHub
repository](https://github.com/phxql/maven-to-plantuml/tree/master).
Hadoop Parallel Compilation Submission Logic
Reasons for Parallel Compilation Failure
In sequential compilation, as modules are compiled one by one in order,
there are no errors because the compilation follows the module sequence.
However, in parallel compilation, all modules are compiled simultaneously.
The compilation order during multi-module concurrent compilation depends on the
inter-module dependencies. If Module A depends on Module B, then Module B will
be compiled before Module A. This ensures that the compilation order follows
the dependencies between modules.
But when Hadoop compiles in parallel, for example, compiling
hadoop-yarn-project, the dependencies between modules are correct. The issue
arises during the dist package stage. dist packages all other compiled modules.
Behavior of hadoop-yarn-project in Serial Compilation:
In serial compilation, it compiles modules in the pom one by one in
sequence. After all modules are compiled, it compiles hadoop-yarn-project.
During the prepare-package stage, the maven-assembly-plugin plugin is executed
for packaging. All packages are repackaged according to the description in
hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml.
Behavior of hadoop-yarn-project in Parallel Compilation:
Parallel compilation compiles modules according to the dependency order
among them. If modules do not declare dependencies on each other through
dependency, they are compiled in parallel. According to the dependency
definition in the pom of hadoop-yarn-project, the dependencies are compiled
first, followed by hadoop-yarn-project, executing its maven-assembly-plugin.
However, the files needed for packaging in
hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml are not
all included in the dependency of hadoop-yarn-project. Therefore, when
compiling hadoop-yarn-project and executing maven-assembly-plugin, not all
required modules are built yet, leading to errors in parallel compilation.
Solution:
The solution is relatively straightforward: organize all modules from
hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml, and then
declare them as dependencies in the pom of hadoop-yarn-project.
### How was this patch tested?
manual test

### For code changes:
- [ ] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'BIGTOP-3638. Your PR title ...')?
- [ ] Make sure that newly added files do not have any licensing issues.
When in doubt refer to https://www.apache.org/licenses/
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]