[GitHub] [iceberg] marton-bod commented on a change in pull request #2544: Doc: refactor Hive documentation with catalog loading examples

GitBox Fri, 30 Apr 2021 04:08:12 -0700


marton-bod commented on a change in pull request #2544:
URL: https://github.com/apache/iceberg/pull/2544#discussion_r623796870




##########
File path: site/docs/hive.md
##########
@@ -17,113 +17,286 @@
 
 # Hive
 
-## Hive read support
-Iceberg supports the reading of Iceberg tables from 
[Hive](https://hive.apache.org) by using a 
[StorageHandler](https://cwiki.apache.org/confluence/display/Hive/StorageHandlers).
 Please note that only Hive 2.x versions are currently supported.
+Iceberg supports reading and writing Iceberg tables through 
[Hive](https://hive.apache.org) by using a 
[StorageHandler](https://cwiki.apache.org/confluence/display/Hive/StorageHandlers).
+Here is the current compatibility matrix for Iceberg Hive support: 
 
-### Table creation
-This section explains the various steps needed in order to overlay a Hive 
table "on top of" an existing Iceberg table. Iceberg tables are created using 
either a 
[`Catalog`](./javadoc/master/index.html?org/apache/iceberg/catalog/Catalog.html)
 or an implementation of the 
[`Tables`](./javadoc/master/index.html?org/apache/iceberg/Tables.html) 
interface and Hive needs to be configured accordingly to read data from these 
different types of table.
+| Feature                  | Hive 2.x               | Hive 3.1.2             |
+| ------------------------ | ---------------------- | ---------------------- |
+| CREATE EXTERNAL TABLE    | ✔️                     | ✔️                     |
+| CREATE TABLE             | ✔️                     | ✔️                     |
+| DROP TABLE               | ✔️                     | ✔️                     |
+| SELECT                   | ✔️ (MapReduce and Tez) | ✔️ (MapReduce and Tez) |
+| INSERT INTO              | ✔️ (MapReduce only)️    | ✔️ (MapReduce only)    |
 
-#### Add the Iceberg Hive Runtime jar file to the Hive classpath
-Regardless of the table type, the `HiveIcebergStorageHandler` and supporting 
classes need to be made available on Hive's classpath. These are provided by 
the `iceberg-hive-runtime` jar file. For example, if using the Hive shell, this 
can be achieved by issuing a statement like so:
-```sql
-add jar /path/to/iceberg-hive-runtime.jar;
-```
-There are many others ways to achieve this including adding the jar file to 
Hive's auxiliary classpath (so it is available by default) - please refer to 
Hive's documentation for more information.
+## Enabling Iceberg support in Hive
 
-#### Using Hadoop Tables
-Iceberg tables created using `HadoopTables` are stored entirely in a directory 
in a filesystem like HDFS.
+### Loading runtime jar
 
-##### Create an Iceberg table
-The first step is to create an Iceberg table using the Spark/Java/Python API 
and `HadoopTables`. For the purposes of this documentation we will assume that 
the table is called `table_a` and that the table location is 
`hdfs://some_path/table_a`.
+To enable Iceberg support in Hive, the `HiveIcebergStorageHandler` and 
supporting classes need to be made available on Hive's classpath. 
+These are provided by the `iceberg-hive-runtime` jar file. 
+For example, if using the Hive shell, this can be achieved by issuing a 
statement like so:
 
-##### Create a Hive table
-Now overlay a Hive table on top of this Iceberg table by issuing Hive DDL like 
so:
-```sql
-CREATE EXTERNAL TABLE table_a 
-STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
-LOCATION 'hdfs://some_bucket/some_path/table_a';
 ```
-
-#### Query the Iceberg table via Hive
-You should now be able to issue Hive SQL `SELECT` queries using the above 
table and see the results returned from the underlying Iceberg table.
-```sql
-SELECT * from table_a;
+add jar /path/to/iceberg-hive-runtime.jar;
 ```
 
-#### Using Hive Catalog
-Iceberg tables created using `HiveCatalog` are automatically registered with 
Hive.
+There are many others ways to achieve this including adding the jar file to 
Hive's auxiliary classpath so it is available by default.
+Please refer to Hive's documentation for more information.
 
-##### Create an Iceberg table
-The first step is to create an Iceberg table using the Spark/Java/Python API 
and `HiveCatalog`. For the purposes of this documentation we will assume that 
the table is called `table_b` and that the table location is 
`s3://some_path/table_b`. In order for Iceberg to correctly set up the Hive 
table for querying some configuration values need to be set, the two options 
for this are described below - you can use either or the other depending on 
your use case.
+### Enabling support
+
+#### Hadoop configuration
+
+The value `iceberg.engine.hive.enabled` needs to be set to `true` in the 
Hadoop configuration in the environment.
+For example, it can be added to the Hive configuration file on the classpath 
of the application creating or modifying (altering, inserting etc.) the table 
by modifying the relevant `hive-site.xml`.
+You can also do it programmatically like so:
 
-##### Hive Configuration
-The value `iceberg.engine.hive.enabled` needs to be set to `true` and added to 
the Hive configuration file on the classpath of the application creating or 
modifying (altering, inserting etc.) the table. This can be done by modifying 
the relevant `hive-site.xml`. Alternatively this can be done programmatically 
like so:
 ```java
 Configuration hadoopConfiguration = spark.sparkContext().hadoopConfiguration();
-hadoopConfiguration.set(ConfigProperties.ENGINE_HIVE_ENABLED, "true"); 
//iceberg.engine.hive.enabled=true
+hadoopConfiguration.set(ConfigProperties.ENGINE_HIVE_ENABLED, "true"); // 
iceberg.engine.hive.enabled=true
 HiveCatalog catalog = new HiveCatalog(hadoopConfiguration);
 ...
 catalog.createTable(tableId, schema, spec);
 ```
 
-##### Table Property Configuration
-The property `engine.hive.enabled` needs to be set to `true` and added to the 
table properties when creating the Iceberg table. This can be done like so:
+#### Table property configuration
+
+Alternatively, the property `engine.hive.enabled` can be set to `true` and 
added to the table properties when creating the Iceberg table. 
+Here is an example of doing it programmatically:
+
 ```java
-    Map<String, String> tableProperties = new HashMap<String, String>();
-    tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED, "true"); 
//engine.hive.enabled=true
-    catalog.createTable(tableId, schema, spec, tableProperties);
+Catalog catalog = ...;
+Map<String, String> tableProperties = Maps.newHashMap();
+tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED, "true"); // 
engine.hive.enabled=true
+catalog.createTable(tableId, schema, spec, tableProperties);
 ```
 
-#### Query the Iceberg table via Hive
-In order to query a Hive table created by either of the HiveCatalog methods 
described above you need to first set a Hive configuration value like so:
-```sql
-SET iceberg.mr.catalog=hive;
+The table level configuration overwrites the global Hadoop configuration.
+
+## Iceberg and Hive catalog compatibility
+
+### Global Hive catalog
+
+From the Hive engine's perspective, there is only 1 global data catalog, which 
is the Hive metastore defined in the Hadoop configuration in the runtime 
environment.
+On contrast, Iceberg supports multiple different data catalog types such as 
Hive, Hadoop, AWS Glue, and also allow any custom catalog implementations.
+Users might want to read tables in anther catalog through the Hive engine, or 
perform cross-catalog operations like join.
+
+Iceberg handles this issue in the following way:
+
+1. All tables created by Iceberg's `HiveCatalog` with Hive engine feature 
enabled are automatically visible by the Hive engine.
+2. For Iceberg tables created in other catalogs, the catalog information is 
registered through Hadoop configuration.
+A Hive external table overlay needs to be created in the Hive metastore, 
+and the actual catalog name is recorded as a part of the overlay table 
properties. 
+See [CREATE EXTERNAL TABLE](#create-external-table) section for more details.
+
+### Custom Iceberg catalogs
+
+To globally register different catalogs, set the following Hadoop 
configurations:
+
+| Config Key                                    | Description                  
                          |
+| --------------------------------------------- | 
------------------------------------------------------ |
+| iceberg.catalog.<catalog_name\>.type          | type of catalog: 
`hive`,`hadoop` or `custom`             |
+| iceberg.catalog.<catalog_name\>.catalog-impl  | catalog implementation, must 
not be null if type is `custom` |
+| iceberg.catalog.<catalog_name\>.<key\>        | any config key and value 
pairs for the catalog         |
+
+Here are some examples using Hive CLI:
+
+Register a `HiveCatalog` called `another_hive`:
+
+```
+SET iceberg.catalog.another_hive.type=hive;
+SET iceberg.catalog.another_hive.uri=thrift://example.com:9083;
+SET iceberg.catalog.another_hive.clients=10;
+SET iceberg.catalog.another_hive.warehouse=hdfs://example.com:8020/warehouse;
+```
+
+Register a `HadoopCatalog` called `hadoop`:
+
+```
+SET iceberg.catalog.hadoop.type=hadoop;
+SET iceberg.catalog.hadoop.warehouse=hdfs://example.com:8020/warehouse;
+```
+
+Register an AWS `GlueCatalog` called `glue`:
+
+```
+SET iceberg.catalog.glue.type=custom;
+SET iceberg.catalog.glue.catalog-impl=org.apache.iceberg.aws.GlueCatalog;
+SET iceberg.catalog.glue.warehouse=s3://my-bucket/my/key/prefix;
+SET 
iceberg.catalog.glue.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager;
+SET iceberg.catalog.glue.lock.table=myGlueLockTable
 ```
-You should now be able to issue Hive SQL `SELECT` queries using the above 
table and see the results returned from the underlying Iceberg table.
+
+## DDL Commands
+
+### CREATE EXTERNAL TABLE
+
+The `CREATE EXTERNAL TABLE` command is used to overlay a Hive table "on top 
of" an existing Iceberg table. 
+Iceberg tables are created using either a 
[`Catalog`](./javadoc/master/index.html?org/apache/iceberg/catalog/Catalog.html),
+or an implementation of the 
[`Tables`](./javadoc/master/index.html?org/apache/iceberg/Tables.html) 
interface,
+and Hive needs to be configured accordingly to operate on these different 
types of table.
+
+#### Hive catalog tables
+
+As described before, tables created by the `HiveCatalog` with Hive engine 
feature enabled are directly visible by the Hive engine, so there is no need to 
create an overlay.
+
+#### Custom catalog tables
+
+For a table in a registered catalog, specify the catalog name in the statement 
using table property `iceberg.catalog`.
+For example, the SQL below creates an overlay for a table in a `hadoop` type 
catalog named `hadoop_cat`:
+
 ```sql
-SELECT * from table_b;
+SET iceberg.catalog.hadoop_cat.type=hadoop;
+SET iceberg.catalog.hadoop_cat.warehouse=hdfs://example.com:8020/hadoop_cat;
+
+CREATE EXTERNAL TABLE database_a.table_a
+STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
+TBLPROPERTIES ('iceberg.catalog'='hadoop_cat');
 ```
 
-#### Using Hadoop Catalog
-Iceberg tables created using `HadoopCatalog` are stored entirely in a 
directory in a filesystem like HDFS.
+When `iceberg.catalog` is missing from both table properties and the global 
Hadoop configuration, `HiveCatalog` will be used as default.
+
+#### Path-based Hadoop tables
 
-##### Create an Iceberg table
-The first step is to create an Iceberg table using the Spark/Java/Python API 
and `HadoopCatalog`. For the purposes of this documentation we will assume that 
the fully qualified table identifier is `database_a.table_c` and that the 
Hadoop Catalog warehouse location is 
`hdfs://some_bucket/path_to_hadoop_warehouse`. Iceberg will therefore create 
the table at the location 
`hdfs://some_bucket/path_to_hadoop_warehouse/database_a/table_c`.
+Iceberg tables created using `HadoopTables` are stored entirely in a directory 
in a filesystem like HDFS.
+These tables are considered to have no catalog. 
+To indicate that, set `iceberg.catalog` property to `location_based_table`. 
For example:
 
-##### Create a Hive table
-Now overlay a Hive table on top of this Iceberg table by issuing Hive DDL like 
so:
 ```sql
-CREATE EXTERNAL TABLE database_a.table_c 
+CREATE EXTERNAL TABLE table_a 
 STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' 
-LOCATION 'hdfs://some_bucket/path_to_hadoop_warehouse/database_a/table_c'
-TBLPROPERTIES (
-  'iceberg.mr.catalog'='hadoop', 
-  
'iceberg.mr.catalog.hadoop.warehouse.location'='hdfs://some_bucket/path_to_hadoop_warehouse')
-;
+LOCATION 'hdfs://some_bucket/some_path/table_a'
+TBLPROPERTIES ('iceberg.catalog'='location_based_table');
 ```
-Note that the Hive database and table name *must* match the values used in the 
Iceberg `TableIdentifier` when the table was created. 
 
-It is possible to omit either or both of the table properties but instead you 
will then need to set these when reading from the table. Generally it is 
recommended to set them at table creation time so you can query tables created 
by different catalogs. 
+### CREATE TABLE
+
+Hive also supports directly creating a new Iceberg table through `CREATE 
TABLE` statement. For example:
+
+```sql
+CREATE TABLE database_a.table_a (
+  id bigint, name string
+) PARTITIONED BY (
+  dept string
+) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
+```
+
+!!! Note
+    This creates an unpartitioned HMS table, while the underlying Iceberg 
table is partitioned.
+
+!!! Note
+    Due to the limitation of Hive `PARTITIONED BY` syntax, currently you can 
only partition by columns, 
+    which is translated to Iceberg identity partition transform.
+    You cannot partition by other Iceberg partition transforms such as 
`days(timestamp)`.
+
+The following Hive types have direct Iceberg types mapping:
+
+- boolean
+- float
+- double
+- integer
+- long
+- decimal
+- string
+- binary
+- date
+- timestamp (maps to Iceberg timestamp without timezone)
+- timestamplocaltz (Hive 3 only, maps to Iceberg timestamp with timezone)
+- struct
+- map
+- list
+
+The following Hive types are not supported by Iceberg:
+
+- interval_year_month
+- interval_day_time
+- union
+
+The following Hive types do not have direct Iceberg types mapping, but we can 
perform auto-conversion:
+
+| Hive type  | Iceberg type |
+| ---------- | ------------ |
+| byte       | integer      |
+| short      | integer      |
+| char       | string       |
+| varchar    | string       |
+
+You can enable this feature through Hadoop configuration (default not enabled):
+
+| Config key                               | Default                     | 
Description                                         |
+| -----------------------------------------| --------------------------- | 
--------------------------------------------------- |
+| iceberg.mr.schema.auto.conversion        | false                       | if 
CREATE TABLE should perform type auto-conversion |
+
+### DROP TABLE
+
+Tables can be dropped using the `DROP TABLE` command:
 
-#### Query the Iceberg table via Hive
-You should now be able to issue Hive SQL `SELECT` queries using the above 
table and see the results returned from the underlying Iceberg table.
 ```sql
-SELECT * from database_a.table_c;
+DROP TABLE [IF EXISTS] table_name [PURGE];
 ```
 
-### Features
+You can configure global purge behavior through Hadoop configuration:
+
+| Config key                  | Default                    | Description       
                                              |
+| ----------------------------| ---------------------------| 
--------------------------------------------------------------- |
+| external.table.purge        | true                       | if all data and 
metadata should be purged in a table by default |
+
+Each table's default purge behavior can be further configured through table 
properties:
 
-#### Predicate pushdown
-Pushdown of the Hive SQL `WHERE` clause has been implemented so that these 
filters are used at the Iceberg TableScan level as well as by the Parquet and 
ORC Readers.
+| Property                    | Default                    | Description       
                                                |
+| ----------------------------| ---------------------------| 
----------------------------------------------------------------- |
+| gc.enabled                  | true                       | if all data and 
metadata should be purged in the table by default |

Review comment:
       Yep, it's merged, so now they should change in tandem in these two 
scenarios:
   
   - Changing `gc.enabled` on the Iceberg table via `UpdateProperties` -> 
`external.table.purge` updated on HMS table accordingly
   - Setting `external.table.purge` as a table prop during Hive CREATE TABLE -> 
`gc.enabled` is pushed down accordingly to the Iceberg properties as well
   
   What's missing is the Hive `ALTER TABLE SET TBLPROPERTIES` scenario. In this 
case, changing `external.table.purge` via this command does not update 
`gc.enabled` on the Iceberg table. This is a limitation on Hive 3.1.2 because 
the `HiveMetaHook` doesn't have all the hooks for alter tables (which we have 
recently merged to Hive master).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] marton-bod commented on a change in pull request #2544: Doc: refactor Hive documentation with catalog loading examples

Reply via email to