jojochuang commented on code in PR #358: URL: https://github.com/apache/ozone-site/pull/358#discussion_r2891511545
########## docs/04-user-guide/02-integrations/06-spark.md: ########## @@ -1,8 +1,203 @@ --- -draft: true +sidebar_label: Spark --- -# Spark +# Using Apache Spark with Ozone -**TODO:** File a subtask under [HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this page or section. -**TODO:** Uncomment link to this page in src/pages/index.js +[Apache Spark](https://spark.apache.org/) is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. + +:::note +This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Apache Ozone 2.1.0. +::: + +## Overview + +Spark interacts with Ozone primarily through the OzoneFileSystem connector, which allows access using the `ofs://` URI scheme. +Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol, which is useful for porting existing cloud-native Spark applications to Ozone without changing application code. + +The older `o3fs://` scheme is supported for legacy compatibility but is not recommended for new deployments. + +Key benefits include: + +- Storing large datasets generated or consumed by Spark jobs directly in Ozone. +- Leveraging Ozone's scalability and object storage features for Spark workloads. +- Using standard Spark DataFrame and `RDD` APIs to interact with Ozone data. + +## Prerequisites + +1. **Ozone Cluster:** A running Ozone cluster. +2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3-client-*.jar` must be available on the Spark driver and executor classpath. +3. **Hadoop 3.4.x runtime (Ozone 2.1.0+):** Ozone 2.1.0 removed bundled copies of several Hadoop classes (`LeaseRecoverable`, `SafeMode`, `SafeModeAction`) and now requires them from the runtime classpath ([HDDS-13574](https://issues.apache.org/jira/browse/HDDS-13574)). Since Spark 3.5.x ships with Hadoop 3.3.4, you must add `hadoop-common-3.4.x.jar` to the Spark classpath alongside the existing Hadoop JARs. +4. **Configuration:** Spark needs access to Ozone configuration (`core-site.xml` and potentially `ozone-site.xml`) to connect to the Ozone cluster. + +## Configuration + +### 1. Core Site (`core-site.xml`) + +For `core-site.xml` configuration, refer to the [Ozone File System (ofs) Configuration section](../client-interfaces/ofs#configuration). + +### 2. Spark Configuration (`spark-defaults.conf` or `--conf`) + +While Spark often picks up settings from `core-site.xml` on the classpath, explicitly setting the implementation can sometimes be necessary: + +```properties +spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem +``` + +### 3. Security (Kerberos) + +If your Ozone and Spark clusters are Kerberos-enabled, Spark needs permission to obtain delegation tokens for Ozone. + +Configure the following property in `spark-defaults.conf` or via `--conf`, specifying your Ozone filesystem URI: + +```properties +# For YARN deployments in spark3+ +spark.kerberos.access.hadoopFileSystems=ofs://ozone1/ +``` + +Replace `ozone1` with your OM Service ID. Ensure the user running the Spark job has a valid Kerberos ticket (`kinit`). + +## Usage Examples + +You can read and write data using `ofs://` URIs like any other Hadoop-compatible filesystem. + +**URI Format:** `ofs://<om-service-id>/<volume>/<bucket>/path/to/key` + +### Reading Data (Scala) + +```scala +import org.apache.spark.sql.SparkSession + +val spark = SparkSession.builder.appName("Ozone Spark Read Example").getOrCreate() + +// Read a CSV file from Ozone +val df = spark.read.format("csv") + .option("header", "true") + .option("inferSchema", "true") + .load("ofs://ozone1/volume1/bucket1/input/data.csv") + +df.show() + +spark.stop() Review Comment: spark.stop() isn't needed after Ozone 2.0: https://issues.apache.org/jira/browse/HDDS-10564 It was an Ozone bug. ########## docs/04-user-guide/02-integrations/06-spark.md: ########## @@ -1,8 +1,203 @@ --- -draft: true +sidebar_label: Spark --- -# Spark +# Using Apache Spark with Ozone -**TODO:** File a subtask under [HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this page or section. -**TODO:** Uncomment link to this page in src/pages/index.js +[Apache Spark](https://spark.apache.org/) is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. + +:::note +This guide covers Apache Spark 3.x. Examples were tested with Spark 3.5.x and Apache Ozone 2.1.0. +::: + +## Overview + +Spark interacts with Ozone primarily through the OzoneFileSystem connector, which allows access using the `ofs://` URI scheme. +Spark can also access Ozone through the S3 Gateway using the `s3a://` protocol, which is useful for porting existing cloud-native Spark applications to Ozone without changing application code. + +The older `o3fs://` scheme is supported for legacy compatibility but is not recommended for new deployments. + +Key benefits include: + +- Storing large datasets generated or consumed by Spark jobs directly in Ozone. +- Leveraging Ozone's scalability and object storage features for Spark workloads. +- Using standard Spark DataFrame and `RDD` APIs to interact with Ozone data. + +## Prerequisites + +1. **Ozone Cluster:** A running Ozone cluster. +2. **Ozone Client JARs:** The `ozone-filesystem-hadoop3-client-*.jar` must be available on the Spark driver and executor classpath. Review Comment: We removed ozone-filesystem-hadoop3-client jar in Ozone 2.2 (to be released) and ozone-filesystem-hadoop3 will just work. Let's update all references of ozone-filesystem-hadoop3-client to ozone-filesystem-hadoop3 in this doc because it is for Ozone 2.2. For Ozone 2.1 page, we'll leave it as is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
