Hi,

My team has been using the custom catalog along with atomic metadata updates but we never migrated existing iceberg tables onto it. We also haven't turned on integration with the hive catalog, so I'm not sure how

easy it is to plug in there (I think there was some recent work on that?). Dynamo provides a local mock which you could combine with s3mock (check iceberg tests) to test it out: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBLocal.html

Only weird things we've run into with dynamo is:
1. it seems like we get rate limited by dynamo pretty hard when first writing to a new table until rate limits are adjusted (potentially by aws dynamically dynamo's internal partitions?) 2. make sure to page scans if you have a lot of values when doing lists (we haven't enabled catalog listing yet, but we've ran into this before)

We chose dynamo because we were using it for other usecases. I'm not sure if it's the best aws provided option for atomic changes.

John

On 11/19/20 10:07 AM, Marko Babic wrote:
Hi everyone,

At my org we’ve spun up a few Iceberg tables on top of S3 without a metastore (conscious of the consequences) and we’ve arrived at the point that we need to support concurrent writes. :) I was hoping to get some advice as to what the best way to integrate an existing Iceberg table into a Hive Metastore or an alternative might be. We’re still relatively early in our adoption of Iceberg and have no real prior experience with Hive so I don’t know what I don’t know.

Some options we’re weighing:

  - Existing tables aren’t so big that the moral equivalent of "CREATE TABLE hive.db.table … AS SELECT * FROM hadoop.table" is out of the question, but we’d prefer to not have to read + rewrite everything. We also have stateful readers (tracking which snapshots they have previously read) and preserving table history would make life easier.

  - Doing something along the lines of the following and importing the tables into Hive as external tables looks like it should work given my understanding of how Iceberg is using HMS, but I don’t know if it’s encouraged and I haven’t done diligence to understand potential consequences:

```
hive> CREATE EXTERNAL TABLE `existing_table` (...)
LOCATION
  's3://existing-table/'
-- serde, input/output formats omitted
TBLPROPERTIES (
  -- Assuming latest metadata file for Hadoop table is v99.metadata.json, rename it to 00099-uuid.metadata.json   -- so that BaseMetastoreTableOperations can correctly parse the version number.
'metadata_location'='s3://existing-table/metadata/00099-uuid.metadata.json',
  'table_type'='ICEBERG'
)
```

  - Others seem to have had success implementing + maintaining a custom catalog (https://iceberg.apache.org/custom-catalog/ <https://iceberg.apache.org/custom-catalog/>) backed by e.g. DynamoDB for atomic metadata updates, which could appeal to us. Seems like migration in this case consists of implementing the catalog and plopping the latest metadata into the backing store. Are custom catalogs more of an escape hatch when HMS can’t be used, or would that maybe be a reasonable way forward if we find we don’t want to maintain + operate on top of HMS?

Apologies if this was discussed or documented somewhere else and I’ve missed it.

Thanks!

Marko

Reply via email to