SaketaChalamchala commented on code in PR #243: URL: https://github.com/apache/ozone-site/pull/243#discussion_r2748629417
########## docs/04-user-guide/03-integrations/06-spark.md: ########## @@ -1,3 +1,166 @@ -# Spark +--- +sidebar_label: Spark +--- -**TODO:** File a subtask under [HDDS-9858](https://issues.apache.org/jira/browse/HDDS-9858) and complete this page or section. +# Using Apache Spark with Ozone + +Apache Spark is a widely used unified analytics engine for large-scale data processing. Ozone can serve as a scalable storage layer for Spark applications, allowing you to read and write data directly from/to Ozone clusters using familiar Spark APIs. + +## Overview + +Spark interacts with Ozone primarily through the OzoneFileSystem (ofs) connector, which allows access using the `ofs://` URI scheme. You can also use the older `o3fs://` scheme, though `ofs://` is generally recommended, especially in CDP environments. + +Key benefits include: + +- Storing large datasets generated or consumed by Spark jobs directly in Ozone. +- Leveraging Ozone's scalability and object storage features for Spark workloads. +- Using standard Spark DataFrame and RDD APIs to interact with Ozone data. + +## Prerequisites + +1. **Ozone Cluster:** A running Ozone cluster. +2. **Ozone Client JARs:** The `hadoop-ozone-filesystem-hadoop3.jar` must be available on the Spark driver and executor classpath. +3. **Configuration:** Spark needs access to Ozone configuration (`core-site.xml`and potentially`ozone-site.xml`) to connect to the Ozone cluster. + +## Configuration + +### 1. Core Site (`core-site.xml`) + +For `core-site.xml` configuration, refer to the [Ozone File System (ofs) Configuration section](../01-client-interfaces/02-ofs.md#configuration). + +### 2. Spark Configuration (`spark-defaults.conf` or `--conf`) + +While Spark often picks up settings from `core-site.xml` on the classpath, explicitly setting the implementation can sometimes be necessary: + +```properties +spark.hadoop.fs.ofs.impl=org.apache.hadoop.fs.ozone.RootedOzoneFileSystem +spark.hadoop.fs.o3fs.impl=org.apache.hadoop.fs.ozone.OzoneFileSystem Review Comment: Same as previous comment, we want to move away from o3fs protocol and encourage the usage of `ofs` protocol. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
