snazy commented on code in PR #3022: URL: https://github.com/apache/polaris/pull/3022#discussion_r2517114660
########## getting-started/ceph/README.md: ########## @@ -0,0 +1,147 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Getting Started with Apache Polaris and Ceph + +## Overview + +This guide describes how to spin up a **single-node Ceph cluster** with **RADOS Gateway (RGW)** for S3-compatible storage and configure it for use by **Polaris**. + +This example cluster is configured for basic access key authentication only. +It does not include STS (Security Token Service) or temporary credentials. +All access to the Ceph RGW (RADOS Gateway) and Polaris integration uses static S3-style credentials (as configured via radosgw-admin user create). + +Spark is used as a query engine. This example assumes a local Spark installation. +See the [Spark Notebooks Example](../spark/README.md) for a more advanced Spark setup. + +## Starting the Example + +Before starting the Ceph + Polaris stack, you’ll need to configure environment variables that define network settings, credentials, and cluster IDs. + +The services are started **in sequence**: +1. Monitor + Manager +2. OSD +3. RGW +4. Polaris + +Note: this example pulls the `apache/polaris:latest` image, but assumes the image is `1.2.0-incubating` or later. + +### 1. Copy the example environment file +```shell +cp .env.example .env +``` + +### 2. Start monitor and manager +```shell +docker compose up -d mon1 mgr +``` + +### 3. Start OSD +```shell +docker compose up -d osd1 +``` + +### 4. Start RGW +```shell +docker compose up -d rgw1 +``` +#### Check status +```shell +docker exec --interactive --tty ceph-mon1-1 ceph -s +``` +You should see something like: +```yaml +cluster: + id: b2f59c4b-5f14-4f8c-a9b7-3b7998c76a0e + health: HEALTH_WARN + mon is allowing insecure global_id reclaim + 1 monitors have not enabled msgr2 + 6 pool(s) have no replicas configured + +services: + mon: 1 daemons, quorum mon1 (age 49m) + mgr: mgr(active, since 94m) + osd: 1 osds: 1 up (since 36m), 1 in (since 93m) + rgw: 1 daemon active (1 hosts, 1 zones) +``` + +### 5. Create bucket for Polaris storage +```shell +docker compose up -d setup_bucket +``` + +### 6. Run Polaris service +```shell +docker compose up -d polaris +``` + +### 7. Setup polaris catalog +```shell +docker compose up -d polaris-setup +``` + +## Connecting From Spark + +```shell +bin/spark-sql \ + --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.0,org.apache.iceberg:iceberg-aws-bundle:1.9.0 \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.polaris.type=rest \ + --conf spark.sql.catalog.polaris.io-impl="org.apache.iceberg.aws.s3.S3FileIO" \ + --conf spark.sql.catalog.polaris.uri=http://polaris:8181/api/catalog \ + --conf spark.sql.catalog.polaris.token-refresh-enabled=true \ + --conf spark.sql.catalog.polaris.warehouse=quickstart_catalog \ + --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \ + --conf spark.sql.catalog.polaris.credential=root:s3cr3t \ + --conf spark.sql.catalog.polaris.client.region=irrelevant \ + --conf spark.sql.catalog.polaris.s3.access-key-id=$RGW_ACCESS_KEY \ + --conf spark.sql.catalog.polaris.s3.secret-access-key=$RGW_SECRET_KEY Review Comment: The keys would be empty, because both variables aren't available in the shell. Better replace with the actual values for simplicity. ########## getting-started/ceph/README.md: ########## @@ -0,0 +1,147 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Getting Started with Apache Polaris and Ceph + +## Overview + +This guide describes how to spin up a **single-node Ceph cluster** with **RADOS Gateway (RGW)** for S3-compatible storage and configure it for use by **Polaris**. + +This example cluster is configured for basic access key authentication only. +It does not include STS (Security Token Service) or temporary credentials. +All access to the Ceph RGW (RADOS Gateway) and Polaris integration uses static S3-style credentials (as configured via radosgw-admin user create). + +Spark is used as a query engine. This example assumes a local Spark installation. +See the [Spark Notebooks Example](../spark/README.md) for a more advanced Spark setup. + +## Starting the Example + +Before starting the Ceph + Polaris stack, you’ll need to configure environment variables that define network settings, credentials, and cluster IDs. + +The services are started **in sequence**: +1. Monitor + Manager +2. OSD +3. RGW +4. Polaris + +Note: this example pulls the `apache/polaris:latest` image, but assumes the image is `1.2.0-incubating` or later. + +### 1. Copy the example environment file +```shell +cp .env.example .env +``` + +### 2. Start monitor and manager +```shell +docker compose up -d mon1 mgr +``` + +### 3. Start OSD +```shell +docker compose up -d osd1 +``` + +### 4. Start RGW +```shell +docker compose up -d rgw1 +``` +#### Check status +```shell +docker exec --interactive --tty ceph-mon1-1 ceph -s +``` +You should see something like: +```yaml +cluster: + id: b2f59c4b-5f14-4f8c-a9b7-3b7998c76a0e + health: HEALTH_WARN + mon is allowing insecure global_id reclaim + 1 monitors have not enabled msgr2 + 6 pool(s) have no replicas configured + +services: + mon: 1 daemons, quorum mon1 (age 49m) + mgr: mgr(active, since 94m) + osd: 1 osds: 1 up (since 36m), 1 in (since 93m) + rgw: 1 daemon active (1 hosts, 1 zones) +``` + +### 5. Create bucket for Polaris storage +```shell +docker compose up -d setup_bucket +``` + +### 6. Run Polaris service +```shell +docker compose up -d polaris +``` + +### 7. Setup polaris catalog +```shell +docker compose up -d polaris-setup +``` + +## Connecting From Spark + +```shell +bin/spark-sql \ + --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.0,org.apache.iceberg:iceberg-aws-bundle:1.9.0 \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.polaris.type=rest \ + --conf spark.sql.catalog.polaris.io-impl="org.apache.iceberg.aws.s3.S3FileIO" \ + --conf spark.sql.catalog.polaris.uri=http://polaris:8181/api/catalog \ Review Comment: The host cannot be resolved. Should be ```suggestion --conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \ ``` ########## getting-started/ceph/README.md: ########## @@ -0,0 +1,145 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Getting Started with Apache Polaris and Ceph + +## Overview + +This guide describes how to spin up a **single-node Ceph cluster** with **RADOS Gateway (RGW)** for S3-compatible storage and configure it for use by **Polaris**. + +This example cluster is configured for basic access key authentication only. +It does not include STS (Security Token Service) or temporary credentials. +All access to the Ceph RGW (RADOS Gateway) and Polaris integration uses static S3-style credentials (as configured via radosgw-admin user create). + +Spark is used as a query engine. This example assumes a local Spark installation. +See the [Spark Notebooks Example](../spark/README.md) for a more advanced Spark setup. + +## Starting the Example + +Before starting the Ceph + Polaris stack, you’ll need to configure environment variables that define network settings, credentials, and cluster IDs. + +Copy the example environment file: +```shell +mv getting-started/ceph/.env.example getting-started/ceph/.env +``` + +The services are started **in sequence**: +1. Monitor + Manager +2. OSD +3. RGW +4. Polaris + +Note: this example pulls the `apache/polaris:latest` image, but assumes the image is `1.2.0-incubating` or later. + + +### 1. Start monitor and manager +```shell +docker compose up -d mon1 mgr Review Comment: The Docker/Podman part LGTM now. ########## getting-started/ceph/README.md: ########## @@ -0,0 +1,152 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Getting Started with Apache Polaris and Ceph + +## Overview + +This guide describes how to spin up a **single-node Ceph cluster** with **RADOS Gateway (RGW)** for S3-compatible storage and configure it for use by **Polaris**. + +This example cluster is configured for basic access key authentication only. +It does not include STS (Security Token Service) or temporary credentials. +All access to the Ceph RGW (RADOS Gateway) and Polaris integration uses static S3-style credentials (as configured via radosgw-admin user create). + +Spark is used as a query engine. This example assumes a local Spark installation. +See the [Spark Notebooks Example](../spark/README.md) for a more advanced Spark setup. + +## Starting the Example + +Before starting the Ceph + Polaris stack, you’ll need to configure environment variables that define network settings, credentials, and cluster IDs. + +The services are started **in sequence**: +1. Monitor + Manager +2. OSD +3. RGW +4. Polaris + +Note: this example pulls the `apache/polaris:latest` image, but assumes the image is `1.2.0-incubating` or later. + +### 1. Copy the example environment file +```shell +cp .env.example .env +``` + +### 2. Prepare Network +```shell +# Optional: force runtime (docker or podman) +export RUNTIME=docker + +./getting-started/ceph/prepare-network.sh +``` + +### 3. Start monitor and manager +```shell +docker compose up -d mon1 mgr +``` + +### 4. Start OSD +```shell +docker compose up -d osd1 +``` + +### 5. Start RGW +```shell +docker compose up -d rgw1 +``` +#### Check status +```shell +docker exec --interactive --tty ceph-mon1-1 ceph -s +``` +You should see something like: +```yaml +cluster: + id: b2f59c4b-5f14-4f8c-a9b7-3b7998c76a0e + health: HEALTH_WARN + mon is allowing insecure global_id reclaim + 1 monitors have not enabled msgr2 + 6 pool(s) have no replicas configured + +services: + mon: 1 daemons, quorum mon1 (age 49m) + mgr: mgr(active, since 94m) + osd: 1 osds: 1 up (since 36m), 1 in (since 93m) + rgw: 1 daemon active (1 hosts, 1 zones) +``` + +### 6. Create bucket for Polaris storage +```shell +docker compose up -d setup_bucket +``` + +### 7. Run Polaris service +```shell +docker compose up -d polaris +``` + +### 8. Setup polaris catalog +```shell +docker compose up -d polaris-setup +``` + +## Connecting From Spark + +```shell +bin/spark-sql \ + --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.0,org.apache.iceberg:iceberg-aws-bundle:1.9.0 \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.polaris.type=rest \ + --conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \ + --conf spark.sql.catalog.polaris.token-refresh-enabled=false \ + --conf spark.sql.catalog.polaris.warehouse=quickstart_catalog \ + --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \ + --conf spark.sql.catalog.polaris.credential=root:s3cr3t \ + --conf spark.sql.catalog.polaris.client.region=irrelevant +``` + +Note: `s3cr3t` is defined as the password for the `root` user in the `docker-compose.yml` file. + +Note: The `client.region` configuration is required for the AWS S3 client to work, but it is not used in this example +since Ceph does not require a specific region. + +## Running Queries + +Run inside the Spark SQL shell: + +``` +spark-sql (default)> use polaris; +Time taken: 0.837 seconds + +spark-sql ()> create namespace ns; +Time taken: 0.374 seconds + +spark-sql ()> create table ns.t1 as select 'abc'; Review Comment: Still not working for me: ``` $ export RGW_ACCESS_KEY=POLARIS123ACCESS # Access key for Polaris S3 user $ export RGW_SECRET_KEY=POLARIS456SECRET # Secret key for Polaris S3 user $ spark-sql \ --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.0,org.apache.iceberg:iceberg-aws-bundle:1.9.0 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.polaris.type=rest \ --conf spark.sql.catalog.polaris.io-impl="org.apache.iceberg.aws.s3.S3FileIO" \ --conf spark.sql.catalog.polaris.uri=http://localhost:8181/api/catalog \ --conf spark.sql.catalog.polaris.token-refresh-enabled=true \ --conf spark.sql.catalog.polaris.warehouse=quickstart_catalog \ --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \ --conf spark.sql.catalog.polaris.credential=root:s3cr3t \ --conf spark.sql.catalog.polaris.client.region=irrelevant \ --conf spark.sql.catalog.polaris.s3.access-key-id=$RGW_ACCESS_KEY \ --conf spark.sql.catalog.polaris.s3.secret-access-key=$RGW_SECRET_KEY 25/11/12 07:50:56 WARN Utils: Your hostname, shark resolves to a loopback address: 127.0.1.1; using 192.168.x.x instead (on interface enp14s0) 25/11/12 07:50:56 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address :: loading settings :: url = jar:file:/home/snazy/.sdkman/candidates/spark/3.5.3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml Ivy Default Cache set to: /home/snazy/.ivy2/cache The jars for the packages stored in: /home/snazy/.ivy2/jars org.apache.iceberg#iceberg-spark-runtime-3.5_2.12 added as a dependency org.apache.iceberg#iceberg-aws-bundle added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-2fdcef36-748e-42b7-815e-6aac08972a3c;1.0 confs: [default] found org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.9.0 in central found org.apache.iceberg#iceberg-aws-bundle;1.9.0 in central :: resolution report :: resolve 56ms :: artifacts dl 1ms :: modules in use: org.apache.iceberg#iceberg-aws-bundle;1.9.0 from central in [default] org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.9.0 from central in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 2 | 0 | 0 | 0 || 2 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent-2fdcef36-748e-42b7-815e-6aac08972a3c confs: [default] 0 artifacts copied, 2 already retrieved (0kB/3ms) 25/11/12 07:50:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/11/12 07:50:57 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 25/11/12 07:50:57 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist 25/11/12 07:50:58 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0 25/11/12 07:50:58 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore [email protected] Spark Web UI available at http://x.x.x.x:4040 Spark master: local[*], Application Id: local-1762930257016 spark-sql (default)> use polaris; 25/11/12 07:51:01 WARN AuthManagers: Inferring rest.auth.type=oauth2 since property credential was provided. Please explicitly set rest.auth.type to avoid this warning. 25/11/12 07:51:01 WARN OAuth2Manager: Iceberg REST client is missing the OAuth2 server URI configuration and defaults to http://localhost:8181/api/catalog/v1/oauth/tokens. This automatic fallback will be removed in a future Iceberg release.It is recommended to configure the OAuth2 endpoint using the 'oauth2-server-uri' property to be prepared. This warning will disappear if the OAuth2 endpoint is explicitly configured. See https://github.com/apache/iceberg/issues/10537 25/11/12 07:51:01 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Time taken: 0.566 seconds spark-sql ()> create namespace ns; [SCHEMA_ALREADY_EXISTS] Cannot create schema `ns` because it already exists. Choose a different name, drop the existing schema, or add the IF NOT EXISTS clause to tolerate pre-existing schema. spark-sql ()> create table ns.t1 as select 'abc'; 25/11/12 07:51:06 ERROR SparkSQLDriver: Failed in [create table ns.t1 as select 'abc'] java.lang.IllegalArgumentException: Credential vending was requested for table ns.t1, but no credentials are available at org.apache.iceberg.rest.ErrorHandlers$DefaultErrorHandler.accept(ErrorHandlers.java:230) at org.apache.iceberg.rest.ErrorHandlers$TableErrorHandler.accept(ErrorHandlers.java:123) at org.apache.iceberg.rest.ErrorHandlers$TableErrorHandler.accept(ErrorHandlers.java:107) at org.apache.iceberg.rest.HTTPClient.throwFailure(HTTPClient.java:215) at org.apache.iceberg.rest.HTTPClient.execute(HTTPClient.java:299) at org.apache.iceberg.rest.BaseHTTPClient.post(BaseHTTPClient.java:88) at org.apache.iceberg.rest.RESTSessionCatalog$Builder.stageCreate(RESTSessionCatalog.java:921) at org.apache.iceberg.rest.RESTSessionCatalog$Builder.createTransaction(RESTSessionCatalog.java:799) at org.apache.iceberg.CachingCatalog$CachingTableBuilder.createTransaction(CachingCatalog.java:282) at org.apache.iceberg.spark.SparkCatalog.stageCreate(SparkCatalog.java:265) at org.apache.spark.sql.connector.catalog.StagingTableCatalog.stageCreate(StagingTableCatalog.java:94) at org.apache.spark.sql.execution.datasources.v2.AtomicCreateTableAsSelectExec.run(WriteToDataSourceV2Exec.scala:121) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:220) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:68) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:501) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:619) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:613) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:613) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:310) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Credential vending was requested for table ns.t1, but no credentials are available java.lang.IllegalArgumentException: Credential vending was requested for table ns.t1, but no credentials are available at org.apache.iceberg.rest.ErrorHandlers$DefaultErrorHandler.accept(ErrorHandlers.java:230) at org.apache.iceberg.rest.ErrorHandlers$TableErrorHandler.accept(ErrorHandlers.java:123) at org.apache.iceberg.rest.ErrorHandlers$TableErrorHandler.accept(ErrorHandlers.java:107) at org.apache.iceberg.rest.HTTPClient.throwFailure(HTTPClient.java:215) at org.apache.iceberg.rest.HTTPClient.execute(HTTPClient.java:299) at org.apache.iceberg.rest.BaseHTTPClient.post(BaseHTTPClient.java:88) at org.apache.iceberg.rest.RESTSessionCatalog$Builder.stageCreate(RESTSessionCatalog.java:921) at org.apache.iceberg.rest.RESTSessionCatalog$Builder.createTransaction(RESTSessionCatalog.java:799) at org.apache.iceberg.CachingCatalog$CachingTableBuilder.createTransaction(CachingCatalog.java:282) at org.apache.iceberg.spark.SparkCatalog.stageCreate(SparkCatalog.java:265) at org.apache.spark.sql.connector.catalog.StagingTableCatalog.stageCreate(StagingTableCatalog.java:94) at org.apache.spark.sql.execution.datasources.v2.AtomicCreateTableAsSelectExec.run(WriteToDataSourceV2Exec.scala:121) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:220) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651) at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:68) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:501) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:619) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:613) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:613) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:310) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` ########## getting-started/ceph/README.md: ########## @@ -0,0 +1,147 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Getting Started with Apache Polaris and Ceph + +## Overview + +This guide describes how to spin up a **single-node Ceph cluster** with **RADOS Gateway (RGW)** for S3-compatible storage and configure it for use by **Polaris**. + +This example cluster is configured for basic access key authentication only. +It does not include STS (Security Token Service) or temporary credentials. +All access to the Ceph RGW (RADOS Gateway) and Polaris integration uses static S3-style credentials (as configured via radosgw-admin user create). + +Spark is used as a query engine. This example assumes a local Spark installation. +See the [Spark Notebooks Example](../spark/README.md) for a more advanced Spark setup. + +## Starting the Example + +Before starting the Ceph + Polaris stack, you’ll need to configure environment variables that define network settings, credentials, and cluster IDs. + +The services are started **in sequence**: +1. Monitor + Manager +2. OSD +3. RGW +4. Polaris + +Note: this example pulls the `apache/polaris:latest` image, but assumes the image is `1.2.0-incubating` or later. + +### 1. Copy the example environment file +```shell +cp .env.example .env +``` + +### 2. Start monitor and manager +```shell +docker compose up -d mon1 mgr +``` + +### 3. Start OSD +```shell +docker compose up -d osd1 +``` + +### 4. Start RGW +```shell +docker compose up -d rgw1 +``` +#### Check status +```shell +docker exec --interactive --tty ceph-mon1-1 ceph -s +``` +You should see something like: +```yaml +cluster: + id: b2f59c4b-5f14-4f8c-a9b7-3b7998c76a0e + health: HEALTH_WARN + mon is allowing insecure global_id reclaim + 1 monitors have not enabled msgr2 + 6 pool(s) have no replicas configured + +services: + mon: 1 daemons, quorum mon1 (age 49m) + mgr: mgr(active, since 94m) + osd: 1 osds: 1 up (since 36m), 1 in (since 93m) + rgw: 1 daemon active (1 hosts, 1 zones) +``` + +### 5. Create bucket for Polaris storage +```shell +docker compose up -d setup_bucket +``` + +### 6. Run Polaris service +```shell +docker compose up -d polaris +``` + +### 7. Setup polaris catalog +```shell +docker compose up -d polaris-setup +``` + +## Connecting From Spark Review Comment: ```suggestion ## 8. Connecting From Spark ``` ########## getting-started/ceph/README.md: ########## @@ -0,0 +1,147 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Getting Started with Apache Polaris and Ceph + +## Overview + +This guide describes how to spin up a **single-node Ceph cluster** with **RADOS Gateway (RGW)** for S3-compatible storage and configure it for use by **Polaris**. + +This example cluster is configured for basic access key authentication only. +It does not include STS (Security Token Service) or temporary credentials. +All access to the Ceph RGW (RADOS Gateway) and Polaris integration uses static S3-style credentials (as configured via radosgw-admin user create). + +Spark is used as a query engine. This example assumes a local Spark installation. +See the [Spark Notebooks Example](../spark/README.md) for a more advanced Spark setup. + +## Starting the Example + +Before starting the Ceph + Polaris stack, you’ll need to configure environment variables that define network settings, credentials, and cluster IDs. + +The services are started **in sequence**: +1. Monitor + Manager +2. OSD +3. RGW +4. Polaris + +Note: this example pulls the `apache/polaris:latest` image, but assumes the image is `1.2.0-incubating` or later. + +### 1. Copy the example environment file +```shell +cp .env.example .env +``` + +### 2. Start monitor and manager +```shell +docker compose up -d mon1 mgr +``` + +### 3. Start OSD +```shell +docker compose up -d osd1 +``` + +### 4. Start RGW +```shell +docker compose up -d rgw1 +``` +#### Check status +```shell +docker exec --interactive --tty ceph-mon1-1 ceph -s +``` +You should see something like: +```yaml +cluster: + id: b2f59c4b-5f14-4f8c-a9b7-3b7998c76a0e + health: HEALTH_WARN + mon is allowing insecure global_id reclaim + 1 monitors have not enabled msgr2 + 6 pool(s) have no replicas configured + +services: + mon: 1 daemons, quorum mon1 (age 49m) + mgr: mgr(active, since 94m) + osd: 1 osds: 1 up (since 36m), 1 in (since 93m) + rgw: 1 daemon active (1 hosts, 1 zones) +``` + +### 5. Create bucket for Polaris storage +```shell +docker compose up -d setup_bucket +``` + +### 6. Run Polaris service +```shell +docker compose up -d polaris +``` + +### 7. Setup polaris catalog +```shell +docker compose up -d polaris-setup +``` + +## Connecting From Spark + +```shell +bin/spark-sql \ + --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.0,org.apache.iceberg:iceberg-aws-bundle:1.9.0 \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.polaris=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.polaris.type=rest \ + --conf spark.sql.catalog.polaris.io-impl="org.apache.iceberg.aws.s3.S3FileIO" \ + --conf spark.sql.catalog.polaris.uri=http://polaris:8181/api/catalog \ + --conf spark.sql.catalog.polaris.token-refresh-enabled=true \ + --conf spark.sql.catalog.polaris.warehouse=quickstart_catalog \ + --conf spark.sql.catalog.polaris.scope=PRINCIPAL_ROLE:ALL \ + --conf spark.sql.catalog.polaris.credential=root:s3cr3t \ + --conf spark.sql.catalog.polaris.client.region=irrelevant \ + --conf spark.sql.catalog.polaris.s3.access-key-id=$RGW_ACCESS_KEY \ + --conf spark.sql.catalog.polaris.s3.secret-access-key=$RGW_SECRET_KEY +``` + +Note: `s3cr3t` is defined as the password for the `root` user in the `docker-compose.yml` file. + +Note: The `client.region` configuration is required for the AWS S3 client to work, but it is not used in this example +since Ceph does not require a specific region. + +## Running Queries Review Comment: ```suggestion ## 9. Running Queries ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
