openinx commented on a change in pull request #1464: URL: https://github.com/apache/iceberg/pull/1464#discussion_r494105478
########## File path: site/docs/flink.md ########## @@ -0,0 +1,296 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Flink + +Apache Iceberg support both [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API to write records into iceberg table. Currently, +we only integrate iceberg with apache flink 1.11.x . + +| Feature support | Flink 1.11.0 | Notes | +|------------------------------------------------------------------------|--------------------|--------------------------------------------------------| +| [SQL create catalog](#creating-catalogs-and-using-catalogs) | ✔️ | | +| [SQL create database](#create-database) | ✔️ | | +| [SQL create table](#create-table) | ✔️ | | +| [SQL alter table](#alter-table) | ✔️ | Only support altering table properties, Columns/PartitionKey changes are not supported now| +| [SQL drop_table](#drop-table) | ✔️ | | +| [SQL select](#querying-with-sql) | ️ | | +| [SQL insert into](#insert-into) | ✔️ ️ | Support both streaming and batch mode | +| [SQL insert overwrite](#insert-overwrite) | ✔️ ️ | | +| [DataStream read](#reading-with-datastream) | ✔️ ️ | | +| [DataStream append](#appending-data) | ✔️ ️ | | +| [DataStream overwrite](#overwrite-data) | ✔️ ️ | | +| [Metadata tables](#inspecting-tables) | ️ | | + +## Preparation + +To create iceberg table in flink, we recommend to use [Flink SQL Client](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html) because it's easier for users to understand the concepts. + +Step.1 Downloading the flink 1.11.x binary package from the apache flink [download page](https://flink.apache.org/downloads.html). We now use scala 2.12 to archive the apache iceberg-flink-runtime jar, so it's recommended to use flink 1.11 bundled with scala 2.12. + +```bash +wget https://downloads.apache.org/flink/flink-1.11.1/flink-1.11.1-bin-scala_2.12.tgz +tar xzvf flink-1.11.1-bin-scala_2.12.tgz +``` + +Step.2 Start a standalone flink cluster within hadoop environment. + +```bash +# HADOOP_HOME is your hadoop root directory after unpack the binary package. +export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` + +# Start the flink standalone cluster +./bin/start-cluster.sh +``` + +Step.3 Start the flink SQL client. + +We've created a separate `flink-runtime` module in iceberg project to generate a bundled jar, which could be loaded by flink SQL client directly. + +If we want to build the `flink-runtime` bundled jar manually, please just build the `iceberg` project and it will generate the jar under `<iceberg-root-dir>/flink-runtime/build/libs`. Of course, we could also download the `flink-runtime` jar from the [apache official repository](https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink/). Review comment: I expected that the iceberg-flink-runtime will be deployed here once we released the 0.10.0 , although we have not accomplished it. OK, the path should be changed from `iceberg-flink` to `iceberg-flink-runtime` . ########## File path: site/docs/flink.md ########## @@ -0,0 +1,296 @@ +<!-- + - Licensed to the Apache Software Foundation (ASF) under one or more + - contributor license agreements. See the NOTICE file distributed with + - this work for additional information regarding copyright ownership. + - The ASF licenses this file to You under the Apache License, Version 2.0 + - (the "License"); you may not use this file except in compliance with + - the License. You may obtain a copy of the License at + - + - http://www.apache.org/licenses/LICENSE-2.0 + - + - Unless required by applicable law or agreed to in writing, software + - distributed under the License is distributed on an "AS IS" BASIS, + - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + - See the License for the specific language governing permissions and + - limitations under the License. + --> + +# Flink + +Apache Iceberg support both [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API to write records into iceberg table. Currently, +we only integrate iceberg with apache flink 1.11.x . + +| Feature support | Flink 1.11.0 | Notes | +|------------------------------------------------------------------------|--------------------|--------------------------------------------------------| +| [SQL create catalog](#creating-catalogs-and-using-catalogs) | ✔️ | | +| [SQL create database](#create-database) | ✔️ | | +| [SQL create table](#create-table) | ✔️ | | +| [SQL alter table](#alter-table) | ✔️ | Only support altering table properties, Columns/PartitionKey changes are not supported now| +| [SQL drop_table](#drop-table) | ✔️ | | +| [SQL select](#querying-with-sql) | ️ | | +| [SQL insert into](#insert-into) | ✔️ ️ | Support both streaming and batch mode | +| [SQL insert overwrite](#insert-overwrite) | ✔️ ️ | | +| [DataStream read](#reading-with-datastream) | ✔️ ️ | | +| [DataStream append](#appending-data) | ✔️ ️ | | +| [DataStream overwrite](#overwrite-data) | ✔️ ️ | | +| [Metadata tables](#inspecting-tables) | ️ | | + +## Preparation + +To create iceberg table in flink, we recommend to use [Flink SQL Client](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html) because it's easier for users to understand the concepts. + +Step.1 Downloading the flink 1.11.x binary package from the apache flink [download page](https://flink.apache.org/downloads.html). We now use scala 2.12 to archive the apache iceberg-flink-runtime jar, so it's recommended to use flink 1.11 bundled with scala 2.12. + +```bash +wget https://downloads.apache.org/flink/flink-1.11.1/flink-1.11.1-bin-scala_2.12.tgz +tar xzvf flink-1.11.1-bin-scala_2.12.tgz +``` + +Step.2 Start a standalone flink cluster within hadoop environment. + +```bash +# HADOOP_HOME is your hadoop root directory after unpack the binary package. +export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` + +# Start the flink standalone cluster +./bin/start-cluster.sh +``` + +Step.3 Start the flink SQL client. + +We've created a separate `flink-runtime` module in iceberg project to generate a bundled jar, which could be loaded by flink SQL client directly. + +If we want to build the `flink-runtime` bundled jar manually, please just build the `iceberg` project and it will generate the jar under `<iceberg-root-dir>/flink-runtime/build/libs`. Of course, we could also download the `flink-runtime` jar from the [apache official repository](https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime/). + +```bash +# HADOOP_HOME is your hadoop root directory after unpack the binary package. +export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` + +./bin/sql-client.sh embedded -j <flink-runtime-directory>/iceberg-flink-runtime-xxx.jar shell +``` + +By default, iceberg has included hadoop jars for hadoop catalog. If we want to use hive catalog, we will need to load the hive jars when opening the flink sql client. Fortunately, apache flink has provided a [bundled hive jar](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/hive/#using-bundled-hive-jar) for sql client. So we could open the sql client +as the following: + +```bash +# HADOOP_HOME is your hadoop root directory after unpack the binary package. +export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` + +# wget the flink-sql-connector-hive-2.3.6_2.11-1.11.0.jar from the above bundled jar URL firstly. + +# open the SQL client. +./bin/sql-client.sh embedded \ + -j <flink-runtime-directory>/iceberg-flink-runtime-xxx.jar \ + -j <hive-bundlded-jar-directory>/flink-sql-connector-hive-2.3.6_2.11-1.11.0.jar \ + shell +``` + +## Creating catalogs and using catalogs. + +Flink 1.11 support to create catalogs by using flink sql. + +This creates an iceberg catalog named `hive_catalog` that loads tables from a hive metastore: + +```sql +CREATE CATALOG hive_catalog WITH ( + 'type'='iceberg', + 'catalog-type'='hive', + 'uri'='thrift://localhost:9083', + 'clients'='5', + 'property-version'='1' +); Review comment: It's necessary to fix this issue https://github.com/apache/iceberg/issues/1437 before we get this pr merged, because when i create the hive_catalog, it will try to create the `default` database but failed to : ``` Flink SQL> CREATE CATALOG hive_catalog WITH ( > 'type'='iceberg', > 'catalog-type'='hive', > 'uri'='thrift://localhost:9083', > 'clients'='5', > 'property-version'='1' > ); 2020-09-24 18:24:21,111 WARN org.apache.flink.table.client.cli.CliClient [] - Could not execute SQL statement. org.apache.flink.table.client.gateway.SqlExecutionException: Could not execute statement: CREATE CATALOG hive_catalog WITH ( 'type'='iceberg', 'catalog-type'='hive', 'uri'='thrift://localhost:9083', 'clients'='5', 'property-version'='1' ) at org.apache.flink.table.client.gateway.local.LocalExecutor.executeSql(LocalExecutor.java:362) ~[flink-sql-client_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.cli.CliClient.callDdl(CliClient.java:642) ~[flink-sql-client_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.cli.CliClient.callDdl(CliClient.java:637) ~[flink-sql-client_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.cli.CliClient.callCommand(CliClient.java:357) ~[flink-sql-client_2.12-1.11.1.jar:1.11.1] at java.util.Optional.ifPresent(Optional.java:159) [?:1.8.0_221] at org.apache.flink.table.client.cli.CliClient.open(CliClient.java:212) [flink-sql-client_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.SqlClient.openCli(SqlClient.java:142) [flink-sql-client_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.SqlClient.start(SqlClient.java:114) [flink-sql-client_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.SqlClient.main(SqlClient.java:201) [flink-sql-client_2.12-1.11.1.jar:1.11.1] Caused by: java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:159) ~[hadoop-common-2.9.2.jar:?] at org.apache.hadoop.fs.Path.<init>(Path.java:175) ~[hadoop-common-2.9.2.jar:?] at org.apache.hadoop.fs.Path.<init>(Path.java:110) ~[hadoop-common-2.9.2.jar:?] at org.apache.iceberg.hive.HiveCatalog.convertToDatabase(HiveCatalog.java:459) ~[?:?] at org.apache.iceberg.hive.HiveCatalog.lambda$createNamespace$5(HiveCatalog.java:214) ~[?:?] at org.apache.iceberg.hive.ClientPool.run(ClientPool.java:54) ~[?:?] at org.apache.iceberg.hive.HiveCatalog.createNamespace(HiveCatalog.java:213) ~[?:?] at org.apache.iceberg.flink.FlinkCatalog.createDatabase(FlinkCatalog.java:195) ~[?:?] at org.apache.iceberg.flink.FlinkCatalog.open(FlinkCatalog.java:116) ~[?:?] at org.apache.flink.table.catalog.CatalogManager.registerCatalog(CatalogManager.java:191) ~[flink-table-blink_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.api.internal.TableEnvironmentImpl.createCatalog(TableEnvironmentImpl.java:1086) ~[flink-table-blink_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeOperation(TableEnvironmentImpl.java:1019) ~[flink-table-blink_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:690) ~[flink-table-blink_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.gateway.local.LocalExecutor.lambda$executeSql$7(LocalExecutor.java:360) ~[flink-sql-client_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.gateway.local.ExecutionContext.wrapClassLoader(ExecutionContext.java:255) ~[flink-sql-client_2.12-1.11.1.jar:1.11.1] at org.apache.flink.table.client.gateway.local.LocalExecutor.executeSql(LocalExecutor.java:360) ~[flink-sql-client_2.12-1.11.1.jar:1.11.1] ... 8 more ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org