This is an automated email from the ASF dual-hosted git repository.
fokko pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/main by this push:
new 51061511ed Docs: Update features for Hive 4.0 (#10162)
51061511ed is described below
commit 51061511edd953cc0456b7e2b8e8c7dff7567795
Author: Sourabh Badhya <[email protected]>
AuthorDate: Fri May 3 15:24:24 2024 +0530
Docs: Update features for Hive 4.0 (#10162)
Co-authored-by: Sourabh Badhya <[email protected]>
---
docs/docs/hive.md | 297 ++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 276 insertions(+), 21 deletions(-)
diff --git a/docs/docs/hive.md b/docs/docs/hive.md
index a8df154405..69c0f62795 100644
--- a/docs/docs/hive.md
+++ b/docs/docs/hive.md
@@ -24,6 +24,21 @@ Iceberg supports reading and writing Iceberg tables through
[Hive](https://hive.
a
[StorageHandler](https://cwiki.apache.org/confluence/display/Hive/StorageHandlers).
## Feature support
+The following features matrix illustrates the support for different features
across Hive releases for Iceberg tables -
+
+| Feature support | Hive 2 / 3
| Hive 4 |
+|-----------------------------------------------------------------|------------|--------|
+| [SQL create table](#create-table) | ✔️
| ✔️ |
+| [SQL create table as select (CTAS)](#create-table-as-select) | ✔️
| ✔️ |
+| [SQL create table like table (CTLT)](#create-table-like-table) | ✔️
| ✔️ |
+| [SQL drop table](#drop-table) | ✔️
| ✔️ |
+| [SQL insert into](#insert-into) | ✔️
| ✔️ |
+| [SQL insert overwrite](#insert-overwrite) | ✔️
| ✔️ |
+| [SQL delete from](#delete-from) |
| ✔️ |
+| [SQL update](#update) |
| ✔️ |
+| [SQL merge into](#merge-into) |
| ✔️ |
+| [Branches and tags](#branches-and-tags) |
| ✔️ |
+
Iceberg compatibility with Hive 2.x and Hive 3.1.2/3 supports the following
features:
* Creating a table
@@ -34,16 +49,7 @@ Iceberg compatibility with Hive 2.x and Hive 3.1.2/3
supports the following feat
!!! warning
DML operations work only with MapReduce execution engine.
-The HiveCatalog supports the following additional features with Hive version
4.0.0-alpha-2 and above:
-
-* Altering a table with expiring snapshots.
-* Create a table like an existing table (CTLT table)
-* Support adding parquet compression type via Table properties [Compression
types](https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration)
-* Altering a table metadata location
-* Supporting table rollback
-* Honors sort orders on existing tables when writing a table [Sort orders
specification](../../spec.md#sort-orders)
-
-With Hive version 4.0.0-alpha-1 and above, the Iceberg integration when using
HiveCatalog supports the following additional features:
+Hive supports the following additional features with Hive version 4.0.0 and
above:
* Creating an Iceberg identity-partitioned table
* Creating an Iceberg table with any partition spec, including the various
transforms supported by Iceberg
@@ -51,13 +57,31 @@ With Hive version 4.0.0-alpha-1 and above, the Iceberg
integration when using Hi
* Altering a table while keeping Iceberg and Hive schemas in sync
* Altering the partition schema (updating columns)
* Altering the partition schema by specifying partition transforms
-* Truncating a table
+* Truncating a table / partition, dropping a partition.
* Migrating tables in Avro, Parquet, or ORC (Non-ACID) format to Iceberg
-* Reading the schema of a table
-* Querying Iceberg metadata tables
-* Time travel applications
-* Inserting into a table (INSERT INTO)
-* Inserting data overwriting existing data (INSERT OVERWRITE)
+* Reading the schema of a table.
+* Querying Iceberg metadata tables.
+* Time travel applications.
+* Inserting into a table / partition (INSERT INTO).
+* Inserting data overwriting existing data (INSERT OVERWRITE) in a table /
partition.
+* Copy-on-write support for delete, update and merge queries, CRUD support for
Iceberg V1 tables.
+* Altering a table with expiring snapshots.
+* Create a table like an existing table (CTLT table)
+* Support adding parquet compression type via Table properties [Compression
types](https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration)
+* Altering a table metadata location.
+* Supporting table rollback.
+* Honors sort orders on existing tables when writing a table [Sort orders
specification](../../spec.md#sort-orders)
+* Creating, writing to and dropping an Iceberg branch / tag.
+* Allowing expire snapshots by Snapshot ID, by time range, by retention of
last N snapshots and using table properties.
+* Set current snapshot using snapshot ID for an Iceberg table.
+* Support for renaming an Iceberg table.
+* Altering a table to convert to an Iceberg table.
+* Fast forwarding, cherry-picking commit to an Iceberg branch.
+* Creating a branch from an Iceberg tag.
+* Set current snapshot using branch/tag for an Iceberg table.
+* Delete orphan files for an Iceberg table.
+* Allow full table compaction of Iceberg tables.
+* Support of showing partition information for Iceberg tables (SHOW
PARTITIONS).
!!! warning
DML operations work only with Tez execution engine.
@@ -67,6 +91,10 @@ With Hive version 4.0.0-alpha-1 and above, the Iceberg
integration when using Hi
Hive 4 comes with `hive-iceberg` that ships Iceberg, so no additional
downloads or jars are needed. For older versions of Hive a runtime jar has to
be added.
+### Hive 4.0.0
+
+Hive 4.0.0 comes with the Iceberg 1.4.3 included.
+
### Hive 4.0.0-beta-1
Hive 4.0.0-beta-1 comes with the Iceberg 1.3.0 included.
@@ -388,6 +416,11 @@ ALTER TABLE t SET TBLPROPERTIES('...'='...');
The Hive table schema is kept in sync with the Iceberg table. If an outside
source (Impala/Spark/Java API/etc)
changes the schema, the Hive table immediately reflects the changes. You alter
the table schema using Hive commands:
+* Rename a table
+```sql
+ALTER TABLE orders RENAME TO renamed_orders;
+```
+
* Add a column
```sql
ALTER TABLE orders ADD COLUMNS (nickname string);
@@ -431,12 +464,121 @@ ALTER TABLE t SET TBLPROPERTIES
('storage_handler'='org.apache.iceberg.mr.hive.H
During the migration the data files are not changed, only the appropriate
Iceberg metadata files are created.
After the migration, handle the table as a normal Iceberg table.
+#### Drop partitions
+You can drop partitions based on a single / multiple partition specification
using the following commands:
+```sql
+ALTER TABLE orders DROP PARTITION (buy_date == '2023-01-01', market_price >
1000), PARTITION (buy_date == '2024-01-01', market_price <= 2000);
+```
+The partition specification supports only identity-partition columns.
Transform columns in partition specification are not supported.
+
+#### Branches and tags
+
+`ALTER TABLE ... CREATE BRANCH`
+
+Branches can be created via the CREATE BRANCH statement with the following
options:
+
+* Create a branch using default properties.
+* Create a branch at a specific snapshot ID.
+* Create a branch using system time.
+* Create a branch with a specified number of snapshot retentions.
+* Create a branch using specific tag.
+
+```sql
+-- CREATE branch1 with default properties.
+ALTER TABLE test CREATE BRANCH branch1;
+
+-- CREATE branch1 at a specific snapshot ID.
+ALTER TABLE test CREATE BRANCH branch1 FOR SYSTEM_VERSION AS OF
3369973735913135680;
+
+-- CREATE branch1 using system time.
+ALTER TABLE test CREATE BRANCH branch1 FOR SYSTEM_TIME AS OF '2023-09-16
09:46:38.939 Etc/UTC';
+
+-- CREATE branch1 with a specified number of snapshot retentions.
+ALTER TABLE test CREATE BRANCH branch1 FOR SYSTEM_VERSION AS OF
3369973735913135680 WITH SNAPSHOT RETENTION 5 SNAPSHOTS;
+
+-- CREATE branch1 using a specific tag.
+ALTER TABLE test CREATE BRANCH branch1 FOR TAG AS OF tag1;
+```
+
+`ALTER TABLE ... CREATE TAG`
+
+Tags can be created via the CREATE TAG statement with the following options:
+
+* Create a tag using default properties.
+* Create a tag at a specific snapshot ID.
+* Create a tag using system time.
+
+```sql
+-- CREATE tag1 with default properties.
+ALTER TABLE test CREATE TAG tag1;
+
+-- CREATE tag1 at a specific snapshot ID.
+ALTER TABLE test CREATE TAG tag1 FOR SYSTEM_VERSION AS OF 3369973735913135680;
+
+-- CREATE tag1 using system time.
+ALTER TABLE test CREATE TAG tag1 FOR SYSTEM_TIME AS OF '2023-09-16
09:46:38.939 Etc/UTC';
+```
+
+`ALTER TABLE ... DROP BRANCH`
+
+Branches can be dropped via the DROP BRANCH statement with the following
options:
+
+* Do not fail if the branch does not exist with IF EXISTS
+
+```sql
+-- DROP branch1
+ALTER TABLE test DROP BRANCH branch1;
+
+-- DROP branch1 IF EXISTS
+ALTER TABLE test DROP BRANCH IF EXISTS branch1;
+```
+
+`ALTER TABLE ... DROP TAG`
+
+Tags can be dropped via the DROP TAG statement with the following options:
+
+* Do not fail if the tag does not exist with IF EXISTS
+
+```sql
+-- DROP tag1
+ALTER TABLE test DROP TAG tag1;
+
+-- DROP tag1 IF EXISTS
+ALTER TABLE test DROP TAG IF EXISTS tag1;
+```
+
+`ALTER TABLE ... EXECUTE FAST-FORWARD`
+
+An iceberg branch which is an ancestor of another branch can be fast-forwarded
to the state of the other branch.
+
+```sql
+-- This fast-forwards the branch1 to the state of main branch of the Iceberg
table.
+ALTER table test EXECUTE FAST-FORWARD 'branch1' 'main';
+
+-- This fast-forwards the branch1 to the state of branch2.
+ALTER table test EXECUTE FAST-FORWARD 'branch1' 'branch2';
+```
+
+#### `ALTER TABLE ... EXECUTE CHERRY-PICK`
+
+Cherry-pick of a snapshot requires the ID of the snapshot. Cherry-pick of
snapshots as of now is supported only on the main branch of an Iceberg table.
+
+```sql
+ ALTER table test EXECUTE CHERRY-PICK 8602659039622823857;
+```
+
### TRUNCATE TABLE
The following command truncates the Iceberg table:
```sql
TRUNCATE TABLE t;
```
-Using a partition specification is not allowed.
+
+#### TRUNCATE TABLE ... PARTITION
+The following command truncates the partition in an Iceberg table:
+```sql
+TRUNCATE TABLE orders PARTITION (customer_id = 1, first_name = 'John');
+```
+The partition specification supports only identity-partition columns.
Transform columns in partition specification are not supported.
### DROP TABLE
@@ -476,6 +618,12 @@ Here are the features highlights for Iceberg Hive read
support:
Some of the advanced / little used optimizations are not yet implemented for
Iceberg tables, so you should check your individual queries.
Also currently the statistics stored in the MetaStore are used for query
planning. This is something we are planning to improve in the future.
+Hive 4 supports select operations on branches which also work similar to the
table level select operations. However, the branch must be provided as follows
-
+```sql
+-- Branches should be specified as
<database_name>.<table_name>.branch_<branch_name>
+SELECT * FROM default.test.branch_branch1;
+```
+
### INSERT INTO
Hive supports the standard single-table INSERT INTO operation:
@@ -491,6 +639,15 @@ Multi-table insert is also supported, but it will not be
atomic. Commits occur o
Partial changes will be visible during the commit process and failures can
leave partial changes committed.
Changes within a single table will remain atomic.
+Insert-into operations on branches also work similar to the table level select
operations. However, the branch must be provided as follows -
+```sql
+-- Branches should be specified as
<database_name>.<table_name>.branch_<branch_name>
+INSERT INTO default.test.branch_branch1
+VALUES ('a', 1);
+INSERT INTO default.test.branch_branch1
+SELECT...;
+```
+
Here is an example of inserting into multiple tables at once in Hive SQL:
```sql
@@ -499,6 +656,18 @@ FROM customers
INSERT INTO target2 SELECT last_name, customer_id;
```
+#### INSERT INTO ... PARTITION
+
+Hive 4 supports partition-level INSERT INTO operation:
+
+```sql
+INSERT INTO table_a PARTITION (customer_id = 1, first_name = 'John')
+VALUES (1,2);
+INSERT INTO table_a PARTITION (customer_id = 1, first_name = 'John')
+SELECT...;
+```
+The partition specification supports only identity-partition columns.
Transform columns in partition specification are not supported.
+
### INSERT OVERWRITE
INSERT OVERWRITE can replace data in the table with the result of a query.
Overwrites are atomic operations for Iceberg tables.
For nonpartitioned tables the content of the table is always removed. For
partitioned tables the partitions
@@ -507,6 +676,69 @@ that have rows produced by the SELECT query will be
replaced.
INSERT OVERWRITE TABLE target SELECT * FROM source;
```
+#### INSERT OVERWRITE ... PARTITION
+
+Hive 4 supports partition-level INSERT OVERWRITE operation:
+
+```sql
+INSERT OVERWRITE TABLE target PARTITION (customer_id = 1, first_name = 'John')
SELECT * FROM source;
+```
+The partition specification supports only identity-partition columns.
Transform columns in partition specification are not supported.
+
+### DELETE FROM
+
+Hive 4 supports DELETE FROM queries to remove data from tables.
+
+Delete queries accept a filter to match rows to delete.
+
+```sql
+DELETE FROM target WHERE id > 1 AND id < 10;
+
+DELETE FROM target WHERE id IN (SELECT id FROM source);
+
+DELETE FROM target WHERE id IN (SELECT min(customer_id) FROM source);
+```
+If the delete filter matches entire partitions of the table, Iceberg will
perform a metadata-only delete. If the filter matches individual rows of a
table, then Iceberg will rewrite only the affected data files.
+
+### UPDATE
+
+Hive 4 supports UPDATE queries which accept a filter to match rows to update.
+
+```sql
+UPDATE target SET first_name = 'Raj' WHERE id > 1 AND id < 10;
+
+UPDATE target SET first_name = 'Raj' WHERE id IN (SELECT id FROM source);
+
+UPDATE target SET first_name = 'Raj' WHERE id IN (SELECT min(customer_id) FROM
source);
+```
+For more complex row-level updates based on incoming data, see the section on
MERGE INTO.
+
+### MERGE INTO
+
+Hive 4 added support for MERGE INTO queries that can express row-level updates.
+
+MERGE INTO updates a table, called the target table, using a set of updates
from another query, called the source. The update for a row in the target table
is found using the ON clause that is like a join condition.
+
+```sql
+MERGE INTO target AS t -- a target table
+USING source s -- the source updates
+ON t.id = s.id -- condition to find updates for target rows
+WHEN ... -- updates
+```
+
+Updates to rows in the target table are listed using WHEN MATCHED ... THEN
.... Multiple MATCHED clauses can be added with conditions that determine when
each match should be applied. The first matching expression is used.
+```sql
+WHEN MATCHED AND s.op = 'delete' THEN DELETE
+WHEN MATCHED AND t.count IS NULL AND s.op = 'increment' THEN UPDATE SET
t.count = 0
+WHEN MATCHED AND s.op = 'increment' THEN UPDATE SET t.count = t.count + 1
+```
+
+Source rows (updates) that do not match can be inserted:
+```sql
+WHEN NOT MATCHED THEN INSERT VALUES (s.a, s.b, s.c)
+```
+Only one record in the source data can update any given row of the target
table, or else an error will be thrown.
+
### QUERYING METADATA TABLES
Hive supports querying of the Iceberg Metadata tables. The tables could be
used as normal
Hive tables, so it is possible to use projections / joins / filters / etc.
@@ -514,11 +746,20 @@ To reference a metadata table the full name of the table
should be used, like:
<DB_NAME>.<TABLE_NAME>.<METADATA_TABLE_NAME>.
Currently the following metadata tables are available in Hive:
-* files
-* entries
+
+* all_data_files
+* all_delete_files
+* all_entries all_files
+* all_manifests
+* data_files
+* delete_files
+* entries
+* files
+* manifests
+* metadata_log_entries
+* partitions
+* refs
* snapshots
-* manifests
-* partitions
```sql
SELECT * FROM default.table_a.files;
@@ -594,3 +835,17 @@ Rollback to a specific snapshot ID
```sql
ALTER TABLE ice_t EXECUTE ROLLBACK(1111);
```
+
+### Compaction
+
+Hive 4 supports full table compaction of Iceberg tables using the following
commands:
+* Using the `ALTER TABLE ... COMPACT` syntax
+* Using the `OPTIMIZE TABLE ... REWRITE DATA` syntax
+```sql
+-- Using the ALTER TABLE ... COMPACT syntax
+ALTER TABLE t COMPACT 'major';
+
+-- Using the OPTIMIZE TABLE ... REWRITE DATA syntax
+OPTIMIZE TABLE t REWRITE DATA;
+```
+Both these syntax have the same effect of performing full table compaction on
an Iceberg table.