This is an automated email from the ASF dual-hosted git repository.
weichiu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/ozone-site.git
The following commit(s) were added to refs/heads/master by this push:
new c975a6719 HDDS-14303. Updating spark3 user guide (#358)
c975a6719 is described below
commit c975a6719072e801736ba93b71b18570c1b7d295
Author: Jason O'Sullivan <[email protected]>
AuthorDate: Fri Mar 6 18:37:34 2026 +0000
HDDS-14303. Updating spark3 user guide (#358)
---
docs/04-user-guide/02-integrations/06-spark.md | 195 +++++++++++++++++++-
.../04-user-guide/03-integrations/06-spark.md | 196 ++++++++++++++++++++-
2 files changed, 385 insertions(+), 6 deletions(-)
diff --git a/docs/04-user-guide/02-integrations/06-spark.md
b/docs/04-user-guide/02-integrations/06-spark.md
index 10f30e688..c1ff89d2f 100644
--- a/docs/04-user-guide/02-integrations/06-spark.md
+++ b/docs/04-user-guide/02-integrations/06-spark.md
@@ -1,8 +1,195 @@
---
-draft: true
+sidebar_label: Spark
---
-# Spark
+# Using Apache Spark with Ozone
-**TODO:** File a subtask under
[HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this
page or section.
-**TODO:** Uncomment link to this page in src/pages/index.js
+[Apache Spark](https://spark.apache.org/) is a widely used unified analytics
engine for large-scale data processing. Ozone can serve as a scalable storage
layer for Spark applications, allowing you to read and write data directly
from/to Ozone clusters using familiar Spark APIs.
+
+:::note
+This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and
Apache Ozone 2.2.0.
+:::
+
+## Overview
+
+Spark interacts with Ozone primarily through the OzoneFileSystem connector,
which allows access using the `ofs://` URI scheme.
+Spark can also access Ozone through the S3 Gateway using the `s3a://`
protocol, which is useful for porting existing cloud-native Spark applications
to Ozone without changing application code.
+
+The older `o3fs://` scheme is supported for legacy compatibility but is not
recommended for new deployments.
+
+Key benefits include:
+
+- Storing large datasets generated or consumed by Spark jobs directly in Ozone.
+- Leveraging Ozone's scalability and object storage features for Spark
workloads.
+- Using standard Spark DataFrame and `RDD` APIs to interact with Ozone data.
+
+## Prerequisites
+
+1. **Ozone Cluster:** A running Ozone cluster.
+2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3-*.jar` must be
available on the Spark driver and executor classpath.
+3. **Hadoop 3.4.x runtime (Ozone 2.1.0+):** Ozone 2.1.0 removed bundled copies
of several Hadoop classes (`LeaseRecoverable`, `SafeMode`, `SafeModeAction`)
and now requires them from the runtime classpath
([HDDS-13574](https://issues.apache.org/jira/browse/HDDS-13574)). Since Spark
3.5.x ships with Hadoop 3.3.4, you must add `hadoop-common-3.4.x.jar` to the
Spark classpath alongside the existing Hadoop JARs.
+4. **Configuration:** Spark needs access to Ozone configuration
(`core-site.xml` and potentially `ozone-site.xml`) to connect to the Ozone
cluster.
+
+## Configuration
+
+### 1. Core Site (`core-site.xml`)
+
+For `core-site.xml` configuration, refer to the [Ozone File System (ofs)
Configuration section](../client-interfaces/ofs#configuration).
+
+### 2. Spark Configuration (`spark-defaults.conf` or `--conf`)
+
+While Spark often picks up settings from `core-site.xml` on the classpath,
explicitly setting the implementation can sometimes be necessary:
+
+```properties
+spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem
+```
+
+### 3. Security (Kerberos)
+
+If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission
to obtain delegation tokens for Ozone.
+
+Configure the following property in `spark-defaults.conf` or via `--conf`,
specifying your Ozone filesystem URI:
+
+```properties
+# For YARN deployments in spark3+
+spark.kerberos.access.hadoopFileSystems=ofs://ozone1/
+```
+
+Replace `ozone1` with your OM Service ID. Ensure the user running the Spark
job has a valid Kerberos ticket (`kinit`).
+
+## Usage Examples
+
+You can read and write data using `ofs://` URIs like any other
Hadoop-compatible filesystem.
+
+**URI Format:** `ofs://<om-service-id>/<volume>/<bucket>/path/to/key`
+
+### Reading Data (Scala)
+
+```scala
+import org.apache.spark.sql.SparkSession
+
+val spark = SparkSession.builder.appName("Ozone Spark Read
Example").getOrCreate()
+
+// Read a CSV file from Ozone
+val df = spark.read.format("csv")
+ .option("header", "true")
+ .option("inferSchema", "true")
+ .load("ofs://ozone1/volume1/bucket1/input/data.csv")
+
+df.show()
+```
+
+### Writing Data (Scala)
+
+```scala
+import org.apache.spark.sql.SparkSession
+
+val spark = SparkSession.builder.appName("Ozone Spark Write
Example").getOrCreate()
+
+// Assume 'df' is a DataFrame you want to write
+val data = Seq(("Alice", 1), ("Bob", 2), ("Charlie", 3))
+val df = spark.createDataFrame(data).toDF("name", "id")
+
+// Write DataFrame to Ozone as Parquet files
+df.write.mode("overwrite")
+ .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet")
+```
+
+### Reading Data (Python)
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate()
+
+# Read a CSV file from Ozone
+df = spark.read.format("csv") \
+ .option("header", "true") \
+ .option("inferSchema", "true") \
+ .load("ofs://ozone1/volume1/bucket1/input/data.csv")
+
+df.show()
+```
+
+### Writing Data (Python)
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName("Ozone Spark Write Example").getOrCreate()
+
+# Assume 'df' is a DataFrame you want to write
+data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
+columns = ["name", "id"]
+df = spark.createDataFrame(data, columns)
+
+# Write DataFrame to Ozone as Parquet files
+df.write.mode("overwrite") \
+ .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet")
+```
+
+## Spark on Kubernetes
+
+The recommended approach for running Spark on Kubernetes with Ozone is to bake
the `ozone-filesystem-hadoop3-*.jar` JAR, the `hadoop-common-3.4.x.jar` JAR (if
using Ozone 2.1.0+), and core-site.xml directly into a custom Spark image.
+
+### Build a Custom Spark Image
+
+Place the Ozone client JAR and Hadoop compatibility JAR in /opt/spark/jars/,
which is on the default Spark classpath, and core-site.xml in /opt/spark/conf/:
+
+```dockerfile
+FROM apache/spark:3.5.8-scala2.12-java11-python3-ubuntu
+
+USER root
+
+ADD
https://repo1.maven.org/maven2/org/apache/ozone/ozone-filesystem-hadoop3/2.2.0/ozone-filesystem-hadoop3-2.2.0.jar
\
+ /opt/spark/jars/
+
+# Ozone 2.1.0+ requires Hadoop 3.4.x classes (HDDS-13574).
+# Add alongside (not replacing) Spark's bundled hadoop-common-3.3.4.jar.
+ADD
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.4.2/hadoop-common-3.4.2.jar
\
+ /opt/spark/jars/
+
+COPY core-site.xml /opt/spark/conf/core-site.xml
+COPY ozone_write.py /opt/spark/work-dir/ozone_write.py
+
+USER spark
+```
+
+Where core-site.xml contains at minimum:
+
+```xml
+<?xml version="1.0"?>
+<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+<configuration>
+ <property>
+ <name>fs.ofs.impl</name>
+ <value>org.apache.hadoop.fs.ozone.RootedOzoneFileSystem</value>
+ </property>
+ <property>
+ <name>ozone.om.address</name>
+ <value>om-host.example.com:9862</value>
+ </property>
+</configuration>
+```
+
+### Submit a Spark Job
+
+```bash
+./bin/spark-submit \
+ --master k8s://https://YOUR_KUBERNETES_API_SERVER:6443 \
+ --deploy-mode cluster \
+ --name spark-ozone-example \
+ --conf spark.executor.instances=2 \
+ --conf spark.kubernetes.container.image=YOUR_REPO/spark-ozone:latest \
+ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
+ --conf spark.kubernetes.namespace=YOUR_NAMESPACE \
+ local:///opt/spark/work-dir/ozone_write.py
+```
+
+Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with
your environment values.
+
+## Using the S3A Protocol
+
+Spark can also access Ozone through the S3 Gateway using the `s3a://`
protocol. This is useful for porting existing cloud-native Spark applications
to Ozone without changing application code.
+
+For configuration details, refer to the [S3A
documentation](../client-interfaces/s3a).
diff --git
a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md
b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md
index 5d0235c29..94e4459da 100644
--- a/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md
+++ b/versioned_docs/version-2.1.0/04-user-guide/03-integrations/06-spark.md
@@ -1,3 +1,195 @@
-# Spark
+---
+sidebar_label: Spark
+---
-**TODO:** File a subtask under
[HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this
page or section.
+# Using Apache Spark with Ozone
+
+[Apache Spark](https://spark.apache.org/) is a widely used unified analytics
engine for large-scale data processing. Ozone can serve as a scalable storage
layer for Spark applications, allowing you to read and write data directly
from/to Ozone clusters using familiar Spark APIs.
+
+:::note
+This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and
Apache Ozone 2.1.0.
+:::
+
+## Overview
+
+Spark interacts with Ozone primarily through the OzoneFileSystem connector,
which allows access using the `ofs://` URI scheme.
+Spark can also access Ozone through the S3 Gateway using the `s3a://`
protocol, which is useful for porting existing cloud-native Spark applications
to Ozone without changing application code.
+
+The older `o3fs://` scheme is supported for legacy compatibility but is not
recommended for new deployments.
+
+Key benefits include:
+
+- Storing large datasets generated or consumed by Spark jobs directly in Ozone.
+- Leveraging Ozone's scalability and object storage features for Spark
workloads.
+- Using standard Spark DataFrame and `RDD` APIs to interact with Ozone data.
+
+## Prerequisites
+
+1. **Ozone Cluster:** A running Ozone cluster.
+2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3-client-*.jar` must be
available on the Spark driver and executor classpath.
+3. **Hadoop 3.4.x runtime (Ozone 2.1.0+):** Ozone 2.1.0 removed bundled copies
of several Hadoop classes (`LeaseRecoverable`, `SafeMode`, `SafeModeAction`)
and now requires them from the runtime classpath
([HDDS-13574](https://issues.apache.org/jira/browse/HDDS-13574)). Since Spark
3.5.x ships with Hadoop 3.3.4, you must add `hadoop-common-3.4.x.jar` to the
Spark classpath alongside the existing Hadoop JARs.
+4. **Configuration:** Spark needs access to Ozone configuration
(`core-site.xml` and potentially `ozone-site.xml`) to connect to the Ozone
cluster.
+
+## Configuration
+
+### 1. Core Site (`core-site.xml`)
+
+For `core-site.xml` configuration, refer to the [Ozone File System (ofs)
Configuration section](../client-interfaces/ofs#configuration).
+
+### 2. Spark Configuration (`spark-defaults.conf` or `--conf`)
+
+While Spark often picks up settings from `core-site.xml` on the classpath,
explicitly setting the implementation can sometimes be necessary:
+
+```properties
+spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem
+```
+
+### 3. Security (Kerberos)
+
+If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission
to obtain delegation tokens for Ozone.
+
+Configure the following property in `spark-defaults.conf` or via `--conf`,
specifying your Ozone filesystem URI:
+
+```properties
+# For YARN deployments in spark3+
+spark.kerberos.access.hadoopFileSystems=ofs://ozone1/
+```
+
+Replace `ozone1` with your OM Service ID. Ensure the user running the Spark
job has a valid Kerberos ticket (`kinit`).
+
+## Usage Examples
+
+You can read and write data using `ofs://` URIs like any other
Hadoop-compatible filesystem.
+
+**URI Format:** `ofs://<om-service-id>/<volume>/<bucket>/path/to/key`
+
+### Reading Data (Scala)
+
+```scala
+import org.apache.spark.sql.SparkSession
+
+val spark = SparkSession.builder.appName("Ozone Spark Read
Example").getOrCreate()
+
+// Read a CSV file from Ozone
+val df = spark.read.format("csv")
+ .option("header", "true")
+ .option("inferSchema", "true")
+ .load("ofs://ozone1/volume1/bucket1/input/data.csv")
+
+df.show()
+```
+
+### Writing Data (Scala)
+
+```scala
+import org.apache.spark.sql.SparkSession
+
+val spark = SparkSession.builder.appName("Ozone Spark Write
Example").getOrCreate()
+
+// Assume 'df' is a DataFrame you want to write
+val data = Seq(("Alice", 1), ("Bob", 2), ("Charlie", 3))
+val df = spark.createDataFrame(data).toDF("name", "id")
+
+// Write DataFrame to Ozone as Parquet files
+df.write.mode("overwrite")
+ .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet")
+```
+
+### Reading Data (Python)
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate()
+
+# Read a CSV file from Ozone
+df = spark.read.format("csv") \
+ .option("header", "true") \
+ .option("inferSchema", "true") \
+ .load("ofs://ozone1/volume1/bucket1/input/data.csv")
+
+df.show()
+```
+
+### Writing Data (Python)
+
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.appName("Ozone Spark Write Example").getOrCreate()
+
+# Assume 'df' is a DataFrame you want to write
+data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
+columns = ["name", "id"]
+df = spark.createDataFrame(data, columns)
+
+# Write DataFrame to Ozone as Parquet files
+df.write.mode("overwrite") \
+ .parquet("ofs://ozone1/volume1/bucket1/output/users.parquet")
+```
+
+## Spark on Kubernetes
+
+The recommended approach for running Spark on Kubernetes with Ozone is to bake
the `ozone-filesystem-hadoop3-client-*.jar` JAR, the `hadoop-common-3.4.x.jar`
JAR (if using Ozone 2.1.0+), and core-site.xml directly into a custom Spark
image.
+
+### Build a Custom Spark Image
+
+Place the Ozone client JAR and Hadoop compatibility JAR in /opt/spark/jars/,
which is on the default Spark classpath, and core-site.xml in /opt/spark/conf/:
+
+```dockerfile
+FROM apache/spark:3.5.8-scala2.12-java11-python3-ubuntu
+
+USER root
+
+ADD
https://repo1.maven.org/maven2/org/apache/ozone/ozone-filesystem-hadoop3-client/2.1.0/ozone-filesystem-hadoop3-client-2.1.0.jar
\
+ /opt/spark/jars/
+
+# Ozone 2.1.0+ requires Hadoop 3.4.x classes (HDDS-13574).
+# Add alongside (not replacing) Spark's bundled hadoop-common-3.3.4.jar.
+ADD
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/3.4.2/hadoop-common-3.4.2.jar
\
+ /opt/spark/jars/
+
+COPY core-site.xml /opt/spark/conf/core-site.xml
+COPY ozone_write.py /opt/spark/work-dir/ozone_write.py
+
+USER spark
+```
+
+Where core-site.xml contains at minimum:
+
+```xml
+<?xml version="1.0"?>
+<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+<configuration>
+ <property>
+ <name>fs.ofs.impl</name>
+ <value>org.apache.hadoop.fs.ozone.RootedOzoneFileSystem</value>
+ </property>
+ <property>
+ <name>ozone.om.address</name>
+ <value>om-host.example.com:9862</value>
+ </property>
+</configuration>
+```
+
+### Submit a Spark Job
+
+```bash
+./bin/spark-submit \
+ --master k8s://https://YOUR_KUBERNETES_API_SERVER:6443 \
+ --deploy-mode cluster \
+ --name spark-ozone-example \
+ --conf spark.executor.instances=2 \
+ --conf spark.kubernetes.container.image=YOUR_REPO/spark-ozone:latest \
+ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
+ --conf spark.kubernetes.namespace=YOUR_NAMESPACE \
+ local:///opt/spark/work-dir/ozone_write.py
+```
+
+Replace `YOUR_KUBERNETES_API_SERVER`, `YOUR_REPO`, and `YOUR_NAMESPACE` with
your environment values.
+
+## Using the S3A Protocol
+
+Spark can also access Ozone through the S3 Gateway using the `s3a://`
protocol. This is useful for porting existing cloud-native Spark applications
to Ozone without changing application code.
+
+For configuration details, refer to the [S3A
documentation](../client-interfaces/s3a).
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]