(polaris) branch main updated: Site: Add Polaris Spark client webpage under unreleased (#1503)

yufei Fri, 02 May 2025 17:44:28 -0700

This is an automated email from the ASF dual-hosted git repository.

yufei pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/polaris.git



The following commit(s) were added to refs/heads/main by this push:
     new 3b8aaffc3 Site: Add Polaris Spark client webpage under unreleased 
(#1503)
3b8aaffc3 is described below

commit 3b8aaffc3e19d71078d22a7ae23b33ec2795c617
Author: gh-yzou <[email protected]>
AuthorDate: Fri May 2 17:44:20 2025 -0700

    Site: Add Polaris Spark client webpage under unreleased (#1503)
---
 plugins/spark/README.md                            |  17 ++-
 .../in-dev/unreleased/polaris-spark-client.md      | 130 +++++++++++++++++++++
 2 files changed, 141 insertions(+), 6 deletions(-)

diff --git a/plugins/spark/README.md b/plugins/spark/README.md
index 0340ea9b7..66d4c2983 100644
--- a/plugins/spark/README.md
+++ b/plugins/spark/README.md
@@ -30,6 +30,12 @@ and depends on iceberg-spark-runtime 1.8.1.
 
 # Build Plugin Jar
 A task createPolarisSparkJar is added to build a jar for the Polaris Spark 
plugin, the jar is named as:
+`polaris-iceberg-<icebergVersion>-spark-runtime-<sparkVersion>_<scalaVersion>-<polarisVersion>.jar`.
 For example:
+`polaris-iceberg-1.8.1-spark-runtime-3.5_2.12-0.10.0-beta-incubating-SNAPSHOT.jar`.
+
+- `./gradlew :polaris-spark-3.5_2.12:createPolarisSparkJar` -- build jar for 
Spark 3.5 with Scala version 2.12.
+- `./gradlew :polaris-spark-3.5_2.13:createPolarisSparkJar` -- build jar for 
Spark 3.5 with Scala version 2.13.
+
 The result jar is located at plugins/spark/v3.5/build/<scala_version>/libs 
after the build.
 
 # Start Spark with Local Polaris Service using built Jar
@@ -51,13 +57,12 @@ bin/spark-shell \
 --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension
 \
 --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog 
\
 --conf spark.sql.catalog.<catalog-name>.warehouse=<catalog-name> \
---conf 
spark.sql.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation=true \
+--conf 
spark.sql.catalog.<catalog-name>.header.X-Iceberg-Access-Delegation=vended-credentials
 \
 --conf spark.sql.catalog.<catalog-name>=org.apache.polaris.spark.SparkCatalog \
 --conf spark.sql.catalog.<catalog-name>.uri=http://localhost:8181/api/catalog \
 --conf spark.sql.catalog.<catalog-name>.credential="root:secret" \
 --conf spark.sql.catalog.<catalog-name>.scope='PRINCIPAL_ROLE:ALL' \
 --conf spark.sql.catalog.<catalog-name>.token-refresh-enabled=true \
---conf spark.sql.catalog.<catalog-name>.type=rest \
 --conf spark.sql.sources.useV1SourceList=''
 ```
 
@@ -72,13 +77,12 @@ bin/spark-shell \
 --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension
 \
 --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog 
\
 --conf spark.sql.catalog.polaris.warehouse=<catalog-name> \
---conf spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=true \
+--conf 
spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation=vended-credentials 
\
 --conf spark.sql.catalog.polaris=org.apache.polaris.spark.SparkCatalog \
 --conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \
 --conf spark.sql.catalog.polaris.credential="root:secret" \
 --conf spark.sql.catalog.polaris.scope='PRINCIPAL_ROLE:ALL' \
 --conf spark.sql.catalog.polaris.token-refresh-enabled=true \
---conf spark.sql.catalog.polaris.type=rest \
 --conf spark.sql.sources.useV1SourceList=''
 ```
 
@@ -86,10 +90,11 @@ bin/spark-shell \
 The Polaris Spark client supports catalog management for both Iceberg and 
Delta tables, it routes all Iceberg table 
 requests to the Iceberg REST endpoints, and routes all Delta table requests to 
the Generic Table REST endpoints.
 
-Following describes the current limitations of the Polaris Spark client:
+The Spark Client requires at least delta 3.2.1 to work with Delta tables, 
which requires at least Apache Spark 3.5.3.
+Following describes the current functionality limitations of the Polaris Spark 
client:
 1) Create table as select (CTAS) is not supported for Delta tables. As a 
result, the `saveAsTable` method of `Dataframe`
    is also not supported, since it relies on the CTAS support.
 2) Create a Delta table without explicit location is not supported.
 3) Rename a Delta table is not supported.
 4) ALTER TABLE ... SET LOCATION/SET FILEFORMAT/ADD PARTITION is not supported 
for DELTA table.
-5) For other non-iceberg tables like csv, there is no specific guarantee 
provided today.
+5) For other non-Iceberg tables like csv, it is not supported today.
diff --git a/site/content/in-dev/unreleased/polaris-spark-client.md 
b/site/content/in-dev/unreleased/polaris-spark-client.md
new file mode 100644
index 000000000..46796cdc6
--- /dev/null
+++ b/site/content/in-dev/unreleased/polaris-spark-client.md
@@ -0,0 +1,130 @@
+---
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+Title: Polaris Spark Client
+type: docs
+weight: 650
+---
+
+Apache Polaris now provides Catalog support for Generic Tables (non-Iceberg 
tables), please check out 
+the [Catalog API Spec]({{% ref "polaris-catalog-service" %}}) for Generic 
Table API specs.
+
+Along with the Generic Table Catalog support, Polaris is also releasing a 
Spark client, which helps to 
+provide an end-to-end solution for Apache Spark to manage Delta tables using 
Polaris.
+
+Note the Polaris Spark client is able to handle both Iceberg and Delta tables, 
not just Delta.
+
+This page documents how to connect Spark with Polaris Service using the 
Polaris Spark client.
+
+## Quick Start with Local Polaris service
+If you want to quickly try out the functionality with a local Polaris service, 
simply check out the Polaris repo 
+and follow the instructions in the Spark plugin getting-started 
+[README](https://github.com/apache/polaris/blob/main/plugins/spark/v3.5/getting-started/README.md).
+
+Check out the Polaris repo:
+```shell
+cd ~
+git clone https://github.com/apache/polaris.git
+```
+
+## Start Spark against a deployed Polaris service
+Before starting, ensure that the deployed Polaris service supports Generic 
Tables, and that Spark 3.5(version 3.5.3 or later is installed).
+Spark 3.5.5 is recommended, and you can follow the instructions below to get a 
Spark 3.5.5 distribution.
+```shell
+cd ~
+wget 
https://archive.apache.org/dist/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz 
+mkdir spark-3.5
+tar xzvf spark-3.5.5-bin-hadoop3.tgz  -C spark-3.5 --strip-components=1
+cd spark-3.5
+```
+
+### Connecting with Spark using the Polaris Spark client
+The following CLI command can be used to start the Spark with connection to 
the deployed Polaris service using
+a released Polaris Spark client.
+
+```shell
+bin/spark-shell \
+--packages 
<polaris-spark-client-package>,org.apache.hadoop:hadoop-aws:3.4.0,io.delta:delta-spark_2.12:3.3.1
 \
+--conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension
 \
+--conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog 
\
+--conf spark.sql.catalog.<spark-catalog-name>.warehouse=<polaris-catalog-name> 
\
+--conf 
spark.sql.catalog.<spark-catalog-name>.header.X-Iceberg-Access-Delegation=vended-credentials
 \
+--conf 
spark.sql.catalog.<spark-catalog-name>=org.apache.polaris.spark.SparkCatalog \
+--conf spark.sql.catalog.<spark-catalog-name>.uri=<polaris-service-uri> \
+--conf 
spark.sql.catalog.<spark-catalog-name>.credential='<client-id>:<client-secret>' 
\
+--conf spark.sql.catalog.<spark-catalog-name>.scope='PRINCIPAL_ROLE:ALL' \
+--conf spark.sql.catalog.<spark-catalog-name>.token-refresh-enabled=true
+```
+Assume the released Polaris Spark client you want to use is 
`org.apache.polaris:polaris-iceberg-1.8.1-spark-runtime-3.5_2.12:1.0.0`,
+replace the `polaris-spark-client-package` field with the release.
+
+The `spark-catalog-name` is the catalog name you will use with Spark, and 
`polaris-catalog-name` is the catalog name used 
+by Polaris service, for simplicity, you can use the same name. 
+
+Replace the `polaris-service-uri` with the uri of the deployed Polaris 
service. For example, with a locally deployed
+Polaris service, the uri would be `http://localhost:8181/api/catalog`.
+
+For `client-id` and `client-secret` values, you can refer to [Using 
Polaris]({{% ref "getting-started/using-polaris" %}}) 
+for more details.
+
+You can also start the connection by programmatically initialize a 
SparkSession, following is an example with PySpark:
+```python
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder
+  .config("spark.jars.packages", 
"<polaris-spark-client-package>,org.apache.hadoop:hadoop-aws:3.3.4,io.delta:delta-spark_2.12:3.3.1")
+  .config("spark.sql.catalog.spark_catalog", 
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
+  .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension")
+  .config("spark.sql.catalog.<spark-catalog-name>", 
"org.apache.polaris.spark.SparkCatalog")  
+  .config("spark.sql.catalog.<spark-catalog-name>.uri", <polaris-service-uri>)
+  .config("spark.sql.catalog.<spark-catalog-name>.token-refresh-enabled", 
"true")
+  .config("spark.sql.catalog.<spark-catalog-name>.credential", 
"<client-id>:<client_secret>")
+  .config("spark.sql.catalog.<spark-catalog-name>.warehouse", 
<polaris_catalog_name>)
+  .config("spark.sql.catalog.polaris.scope", 'PRINCIPAL_ROLE:ALL')
+  .config("spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation", 
'vended-credentials')
+  .getOrCreate()
+```
+Similar as the CLI command, make sure the corresponding fields are replaced 
correctly.
+
+### Create tables with Spark
+After Spark is started, you can use it to create and access Iceberg and Delta 
tables, for example:
+```python
+spark.sql("USE polaris")
+spark.sql("CREATE NAMESPACE IF NOT EXISTS DELTA_NS")
+spark.sql("CREATE NAMESPACE IF NOT EXISTS DELTA_NS.PUBLIC")
+spark.sql("USE NAMESPACE DELTA_NS.PUBLIC")
+spark.sql("""CREATE TABLE IF NOT EXISTS PEOPLE (
+    id int, name string)
+USING delta LOCATION 'file:///tmp/var/delta_tables/people';
+""")
+```
+
+## Connecting with Spark using local Polaris Spark client jar
+If you would like to use a version of the Spark client that is currently not 
yet released, you can
+build a Spark client jar locally from source. Please check out the Polaris 
repo and refer to the Spark plugin
+[README](https://github.com/apache/polaris/blob/main/plugins/spark/README.md) 
for detailed instructions.
+
+## Limitations
+The Polaris Spark client has the following functionality limitations:
+1) Create table as select (CTAS) is not supported for Delta tables. As a 
result, the `saveAsTable` method of `Dataframe`
+   is also not supported, since it relies on the CTAS support.
+2) Create a Delta table without explicit location is not supported.
+3) Rename a Delta table is not supported.
+4) ALTER TABLE ... SET LOCATION/SET FILEFORMAT/ADD PARTITION is not supported 
for DELTA table.
+5) For other non-Iceberg tables like csv, it is not supported.

(polaris) branch main updated: Site: Add Polaris Spark client webpage under unreleased (#1503)

Reply via email to