Hi,
My team has been using the custom catalog along with atomic metadata
updates but we never migrated existing iceberg tables onto it. We also
haven't turned on integration with the hive catalog, so I'm not sure how
easy it is to plug in there (I think there was some recent work on
that?). Dynamo provides a local mock which you could combine with s3mock
(check iceberg tests) to test it out:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html
Only weird things we've run into with dynamo is:
1. it seems like we get rate limited by dynamo pretty hard when first
writing to a new table until rate limits are adjusted (potentially by
aws dynamically dynamo's internal partitions?)
2. make sure to page scans if you have a lot of values when doing lists
(we haven't enabled catalog listing yet, but we've ran into this before)
We chose dynamo because we were using it for other usecases. I'm not
sure if it's the best aws provided option for atomic changes.
John
On 11/19/20 10:07 AM, Marko Babic wrote:
Hi everyone,
At my org we’ve spun up a few Iceberg tables on top of S3 without a
metastore (conscious of the consequences) and we’ve arrived at the
point that we need to support concurrent writes. :) I was hoping to
get some advice as to what the best way to integrate an existing
Iceberg table into a Hive Metastore or an alternative might be. We’re
still relatively early in our adoption of Iceberg and have no real
prior experience with Hive so I don’t know what I don’t know.
Some options we’re weighing:
- Existing tables aren’t so big that the moral equivalent of "CREATE
TABLE hive.db.table … AS SELECT * FROM hadoop.table" is out of the
question, but we’d prefer to not have to read + rewrite everything. We
also have stateful readers (tracking which snapshots they have
previously read) and preserving table history would make life easier.
- Doing something along the lines of the following and importing the
tables into Hive as external tables looks like it should work given my
understanding of how Iceberg is using HMS, but I don’t know if it’s
encouraged and I haven’t done diligence to understand potential
consequences:
```
hive> CREATE EXTERNAL TABLE `existing_table` (...)
LOCATION
's3://existing-table/'
-- serde, input/output formats omitted
TBLPROPERTIES (
-- Assuming latest metadata file for Hadoop table is
v99.metadata.json, rename it to 00099-uuid.metadata.json
-- so that BaseMetastoreTableOperations can correctly parse the
version number.
'metadata_location'='s3://existing-table/metadata/00099-uuid.metadata.json',
'table_type'='ICEBERG'
)
```
- Others seem to have had success implementing + maintaining a
custom catalog (https://iceberg.apache.org/custom-catalog/
<https://iceberg.apache.org/custom-catalog/>) backed by e.g. DynamoDB
for atomic metadata updates, which could appeal to us. Seems like
migration in this case consists of implementing the catalog and
plopping the latest metadata into the backing store. Are custom
catalogs more of an escape hatch when HMS can’t be used, or would that
maybe be a reasonable way forward if we find we don’t want to maintain
+ operate on top of HMS?
Apologies if this was discussed or documented somewhere else and I’ve
missed it.
Thanks!
Marko