[iceberg] branch master updated: Dell: Add document. (#4993)

openinx Tue, 06 Sep 2022 23:03:40 -0700

This is an automated email from the ASF dual-hosted git repository.

openinx pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/master by this push:
     new 9a80238690 Dell: Add document. (#4993)
9a80238690 is described below

commit 9a8023869006b0047731140cd81cdf07570d5baa
Author: Xia <[email protected]>
AuthorDate: Wed Sep 7 14:03:28 2022 +0800

    Dell: Add document. (#4993)
---
 docs/dell.md | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 135 insertions(+)

diff --git a/docs/dell.md b/docs/dell.md
new file mode 100644
index 0000000000..173f2d18ef
--- /dev/null
+++ b/docs/dell.md
@@ -0,0 +1,135 @@
+---
+title: "Dell"
+url: dell
+menu:
+    main:
+        parent: Integrations
+        weight: 0
+---
+<!--
+ - Licensed to the Apache Software Foundation (ASF) under one or more
+ - contributor license agreements.  See the NOTICE file distributed with
+ - this work for additional information regarding copyright ownership.
+ - The ASF licenses this file to You under the Apache License, Version 2.0
+ - (the "License"); you may not use this file except in compliance with
+ - the License.  You may obtain a copy of the License at
+ -
+ -   http://www.apache.org/licenses/LICENSE-2.0
+ -
+ - Unless required by applicable law or agreed to in writing, software
+ - distributed under the License is distributed on an "AS IS" BASIS,
+ - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ - See the License for the specific language governing permissions and
+ - limitations under the License.
+ -->
+
+
+# Iceberg Dell Integration
+
+## Dell ECS Integration
+
+Iceberg can be used with Dell's Enterprise Object Storage (ECS) by using the 
ECS catalog since 0.15.0.
+
+See [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) for more 
information on Dell ECS.
+
+### Parameters
+
+When using Dell ECS with Iceberg, these configuration parameters are required:
+
+| Name                     | Description                       |
+| ------------------------ | --------------------------------- |
+| ecs.s3.endpoint          | ECS S3 service endpoint           |
+| ecs.s3.access-key-id     | ECS Username                      |
+| ecs.s3.secret-access-key | S3 Secret Key                     |
+| warehouse                | The location of data and metadata |
+
+The warehouse should use the following formats:
+
+| Example                    | Description                                     
                |
+| -------------------------- | 
--------------------------------------------------------------- |
+| ecs://bucket-a             | Use the whole bucket as the data                
                |
+| ecs://bucket-a/            | Use the whole bucket as the data. The last `/` 
is ignored.      |
+| ecs://bucket-a/namespace-a | Use a prefix to access the data only in this 
specific namespace |
+
+The Iceberg `runtime` jar supports different versions of Spark and Flink. You 
should pick the correct version.
+
+Even though the [Dell ECS 
client](https://github.com/EMCECS/ecs-object-client-java) jar is backward 
compatible, Dell EMC still recommends using the latest version of the client.
+
+### Spark
+
+To use the Dell ECS catalog with Spark 3.2.1, you should create a Spark 
session like:
+
+```bash
+ICEBERG_VERSION=0.15.0
+SPARK_VERSION=3.2_2.12
+ECS_CLIENT_VERSION=3.3.2
+
+DEPENDENCIES="org.apache.iceberg:iceberg-spark-runtime-${SPARK_VERSION}:${ICEBERG_VERSION},\
+org.apache.iceberg:iceberg-dell:${ICEBERG_VERSION},\
+com.emc.ecs:object-client-bundle:${ECS_CLIENT_VERSION}"
+
+spark-sql --packages ${DEPENDENCIES} \
+    --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
+    --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
+    --conf spark.sql.catalog.my_catalog.warehouse=ecs://bucket-a/namespace-a \
+    --conf 
spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.dell.ecs.EcsCatalog
 \
+    --conf spark.sql.catalog.my_catalog.ecs.s3.endpoint=http://10.x.x.x:9020 \
+    --conf 
spark.sql.catalog.my_catalog.ecs.s3.access-key-id=<Your-ecs-s3-access-key> \
+    --conf 
spark.sql.catalog.my_catalog.ecs.s3.secret-access-key=<Your-ecs-s3-secret-access-key>
+```
+
+Then, use `my_catalog` to access the data in ECS. You can use `SHOW NAMESPACES 
IN my_catalog` and `SHOW TABLES IN my_catalog` to fetch the namespaces and 
tables of the catalog.
+
+The related problems of catalog usage:
+
+1. The `SparkSession.catalog` won't access the 3rd-party catalog of Spark in 
both Python and Scala, so please use DDL SQL to list all tables and namespaces.
+
+
+### Flink
+
+Use the Dell ECS catalog with Flink, you first must create a Flink environment.
+
+```bash
+# HADOOP_HOME is your hadoop root directory after unpack the binary package.
+export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
+
+# download Iceberg dependency
+MAVEN_URL=https://repo1.maven.org/maven2
+ICEBERG_VERSION=0.15.0
+FLINK_VERSION=1.14
+wget 
${MAVEN_URL}/org/apache/iceberg/iceberg-flink-runtime-${FLINK_VERSION}/${ICEBERG_VERSION}/iceberg-flink-runtime-${FLINK_VERSION}-${ICEBERG_VERSION}.jar
+wget 
${MAVEN_URL}/org/apache/iceberg/iceberg-dell/${ICEBERG_VERSION}/iceberg-dell-${ICEBERG_VERSION}.jar
+
+# download ECS object client
+ECS_CLIENT_VERSION=3.3.2
+wget 
${MAVEN_URL}/com/emc/ecs/object-client-bundle/${ECS_CLIENT_VERSION}/object-client-bundle-${ECS_CLIENT_VERSION}.jar
+
+# open the SQL client.
+/path/to/bin/sql-client.sh embedded \
+    -j iceberg-flink-runtime-${FLINK_VERSION}-${ICEBERG_VERSION}.jar \
+    -j iceberg-dell-${ICEBERG_VERSION}.jar \
+    -j object-client-bundle-${ECS_CLIENT_VERSION}.jar \
+    shell
+```
+
+Then, use Flink SQL to create a catalog named `my_catalog`:
+
+```SQL
+CREATE CATALOG my_catalog WITH (
+    'type'='iceberg',
+    'warehouse' = 'ecs://bucket-a/namespace-a',
+    'catalog-impl'='org.apache.iceberg.dell.ecs.EcsCatalog',
+    'ecs.s3.endpoint' = 'http://10.x.x.x:9020',
+    'ecs.s3.access-key-id' = '<Your-ecs-s3-access-key>',
+    'ecs.s3.secret-access-key' = '<Your-ecs-s3-secret-access-key>')
+```
+
+Then, you can run `USE CATALOG my_catalog`, `SHOW DATABASES`, and `SHOW 
TABLES` to fetch the namespaces and tables of the catalog.
+
+### Limitations
+
+When you use the catalog with Dell ECS only, you should care about these 
limitations:
+
+1. `RENAME` statements are supported without other protections. When you try 
to rename a table, you need to guarantee all commits are finished in the 
original table.
+2. `RENAME` statements only rename the table without moving any data files. 
This can lead to a table's data being stored in a path outside of the 
configured warehouse path.
+3. The CAS operations used by table commits are based on the checksum of the 
object. There is a very small probability of a checksum conflict.

[iceberg] branch master updated: Dell: Add document. (#4993)

Reply via email to