This is an automated email from the ASF dual-hosted git repository.
felixybw pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-gluten.git
The following commit(s) were added to refs/heads/main by this push:
new f1d664bc39 [VL] Update document of build gluten in Docker (#8459)
f1d664bc39 is described below
commit f1d664bc397cc049933060487504bc1c6979a0ba
Author: BInwei Yang <[email protected]>
AuthorDate: Wed Jan 8 18:29:15 2025 -0800
[VL] Update document of build gluten in Docker (#8459)
Add the details to build Gluten in docker.
---
docs/developers/velox-backend-build-in-docker.md | 67 ++++++++++++++++++++----
docs/get-started/Velox.md | 18 ++++---
2 files changed, 67 insertions(+), 18 deletions(-)
diff --git a/docs/developers/velox-backend-build-in-docker.md
b/docs/developers/velox-backend-build-in-docker.md
index 4820c7cdc7..4d5a32767f 100755
--- a/docs/developers/velox-backend-build-in-docker.md
+++ b/docs/developers/velox-backend-build-in-docker.md
@@ -5,17 +5,64 @@ nav_order: 7
parent: Developer Overview
---
-Currently, Centos-7/8/9 and Ubuntu 20.04/22.04 are supported to build Gluten
Velox backend. Please refer to
-`.github/workflows/velox_weekly.yml` to install required tools before the
build.
+Currently, we have two way to build Gluten, static link or dynamic link.
-There are two docker images with almost all dependencies installed, respective
for static build and dynamic build.
-The according Dockerfiles are respectively `Dockerfile.centos7-static-build`
and `Dockerfile.centos8-dynamic-build`
-under `dev/docker/`.
+# Static link
+The static link approach builds all dependency libraries in vcpkg for both
Velox and Gluten. It then statically links these libraries into libvelox.so and
libgluten.so, enabling the build of Gluten on *any* Linux OS on x86 platforms
with 64G memory. However we only verified on Centos-7/8/9 and Ubuntu
20.04/22.04. Please submit an issue if it fails on your OS.
-```shell
-# For static build on centos-7.
-docker pull apache/gluten:vcpkg-centos-7
+Here is the dependency libraries required on target system, they are the
essential libraries pre-installed in every Linux OS.
+```
+linux-vdso.so.1
+librt.so.1
+libpthread.so.0
+libdl.so.2
+libm.so.6
+libc.so.6
+/lib64/ld-linux-x86-64.so.2
+```
+
+The 'dockerfile' to build Gluten jar:
+
+```
+FROM apache/gluten:vcpkg-centos-7
-# For dynamic build on centos-8.
-docker pull apache/gluten:centos-8 (dynamic build)
+# Build Gluten Jar
+RUN source /opt/rh/devtoolset-11/enable && \
+ git clone https://github.com/apache/incubator-gluten.git && \
+ cd incubator-gluten && \
+ ./dev/builddeps-veloxbe.sh --run_setup_script=OFF --enable_s3=ON
--enable_gcs=ON --enable_abfs=ON --enable_vcpkg=ON --build_arrow=OFF && \
+ mvn clean package -Pbackends-velox -Pceleborn -Piceberg -Pdelta
-Pspark-3.4 -DskipTests
+```
+`enable_vcpkg=ON` enables the static link. Vcpkg packages are already
pre-installed in the vcpkg-centos-7 image and can be reused automatically. The
image is maintained by Gluten community.
+
+The command builds Gluten jar in 'glutenimage':
+```
+docker build -t glutenimage -f dockerfile
+```
+The gluten jar can be copied from
glutenimage:/incubator-gluten/package/target/gluten-velox-bundle-*.jar
+
+# Dynamic link
+The dynamic link approach needs to install the dependencies libraries. It then
dynamically link the .so files into libvelox.so and libgluten.so. Currently,
Centos-7/8/9 and
+ Ubuntu 20.04/22.04 are supported to build Gluten Velox backend dynamically.
+
+The 'dockerfile' to build Gluten jar:
+
+```
+FROM apache/gluten:centos-8
+
+# Build Gluten Jar
+RUN source /opt/rh/devtoolset-11/enable && \
+ git clone https://github.com/apache/incubator-gluten.git && \
+ cd incubator-gluten && \
+ ./dev/builddeps-veloxbe.sh --run_setup_script=ON --enable_hdfs=ON
--enable_vcpkg=OFF --build_arrow=OFF && \
+ mvn clean package -Pbackends-velox -Pceleborn -Piceberg -Pdelta
-Pspark-3.4 -DskipTests && \
+ ./dev/build-thirdparty.sh
+```
+`enable_vcpkg=OFF` enables the dynamic link. Part of shared libraries are
pre-installed in the image. You need to specify `--run_setup_script=ON` to
install the rest of them. It then packages all dependency libraries into a jar
by `build-thirdparty.sh`.
+Please note the image is built based on centos-8. It has risk to build and
deploy the jar on other OSes.
+
+The command builds Gluten jar in 'glutenimage':
+```
+docker build -t glutenimage -f dockerfile
```
+The gluten jar can be copied from
glutenimage:/incubator-gluten/package/target/gluten-velox-bundle-*.jar and
glutenimage:/incubator-gluten/package/target/gluten-thirdparty-lib-*.jar
diff --git a/docs/get-started/Velox.md b/docs/get-started/Velox.md
index 48bca9a6d3..863e996796 100644
--- a/docs/get-started/Velox.md
+++ b/docs/get-started/Velox.md
@@ -16,8 +16,7 @@ parent: Getting-Started
# Prerequisite
-Currently, Gluten+Velox backend is only tested on
**Ubuntu20.04/Ubuntu22.04/Centos7/Centos8**.
-Other kinds of OS support are still in progress. The long term goal is to
support several common OS and conda env deployment.
+Currently, with static build Gluten+Velox backend supports all the Linux OSes,
but is only tested on **Ubuntu20.04/Ubuntu22.04/Centos7/Centos8**. With dynamic
build, Gluten+Velox backend support **Ubuntu20.04/Ubuntu22.04/Centos7/Centos8**
and their variants.
Currently, the officially supported Spark versions are 3.2.2, 3.3.1, 3.4.3 and
3.5.1.
@@ -103,20 +102,23 @@ mvn clean package -Pbackends-velox -Pceleborn -Puniffle
-Pspark-3.4 -DskipTests
mvn clean package -Pbackends-velox -Pceleborn -Puniffle -Pspark-3.5 -DskipTests
```
-Notes: Building Velox may fail caused by OOM. You can prevent this failure by
adjusting `NUM_THREADS` (e.g., `export NUM_THREADS=4`) before building
Gluten/Velox.
+Notes: Building Velox may fail caused by OOM. You can prevent this failure by
adjusting `NUM_THREADS` (e.g., `export NUM_THREADS=4`) before building
Gluten/Velox. The recommended minimal memory size is 64G.
After the above build process, the Jar file will be generated under
`package/target/`.
+Alternatively you may refer to [build in
docker](docs/developers/velox-backend-build-in-docker.md) to build the gluten
jar in docker.
+
## Dependency library deployment
With build option `enable_vcpkg=ON`, all dependency libraries will be
statically linked to `libvelox.so` and `libgluten.so` which are packed into the
gluten-jar.
In this way, only the gluten-jar is needed to add to
`spark.<driver|executor>.extraClassPath` and spark will deploy the jar to each
worker node. It's better to build
-the static version using a clean docker image without any extra libraries
installed. On host with some libraries like jemalloc installed, the script may
crash with
-odd message. You may need to uninstall those libraries to get a clean host. We
strongly recommend user to build Gluten in this way to avoid dependency lacking
issue.
+the static version using a clean docker image without any extra libraries
installed ( [build in docker](docs/developers/velox-backend-build-in-docker.md)
). On host with
+some libraries like jemalloc installed, the script may crash with odd message.
You may need to uninstall those libraries to get a clean host. We ** strongly
recommend ** user to build Gluten in this way to avoid dependency lacking issue.
-With build option `enable_vcpkg=OFF`, not all dependency libraries will be
statically linked. You need to separately execute `./dev/build-thirdparty.sh`
to pack required
-shared libraries into another jar named
`gluten-thirdparty-lib-$LINUX_OS-$VERSION-$ARCH.jar`. Then you need to add the
jar to Spark config `extraClassPath` and set
-`spark.gluten.loadLibFromJar=true`. Otherwise, you need to install required
shared libraries on each worker node. You may find the libraries list from the
third-party jar.
+With build option `enable_vcpkg=OFF`, not all dependency libraries will be
dynamically linked. After building, you need to separately execute
`./dev/build-thirdparty.sh` to
+pack required shared libraries into another jar named
`gluten-thirdparty-lib-$LINUX_OS-$VERSION-$ARCH.jar`. Then you need to add the
jar to Spark config `extraClassPath` and
+set `spark.gluten.loadLibFromJar=true`. Otherwise, you need to install
required shared libraries with ** exactly the same versions ** on each worker
node . You may find the
+libraries list from the third-party jar.
## HDFS support
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]