AWS DynamoDB Catalog Support

2021-06-08 Thread Ye, Jack
Hi everyone, Because of the many asks we have received for supporting AWS DynamoDB Catalog implementation, I have put up PR https://github.com/apache/iceberg/pull/2688 showing the DynamoDbCatalog that we designed and implemented. I have heard a few internal implementations based on DynamoDB in

Re: FlinkSink UIDs problem

2021-06-08 Thread Steven Wu
Igor, I think your diagnosis is spot on. Regarding the workaround, I guess there are two ways - pipeline.auto-generate-uids=true, which is probably not what you are looking for - avoid FlinkSink builder and write your own glue code As for the fix, we can probably add a `uid` method to the FlinkS

Re: Consistency problems with Iceberg + EMRFS

2021-06-08 Thread Jack Ye
There are 2 potential root causes I see: 1. you might be using EMRFS with DynamoDB enabled to check consistency, that leads to the DynamoDB and S3 out of sync. The quick solution is to just delete the DynamoDB consistency table, and the next read/write will recreate and resync it. After all, EMRFS

Re: Consistency problems with Iceberg + EMRFS

2021-06-08 Thread Ryan Blue
Hi Scott, I'm not quite sure what's happening here, but I should at least note that we didn't intend for HDFS tables to be used with S3. HFDS tables use an atomic rename in the file system to ensure that only one committer "wins" to produce a given version of the table metadata. In S3, renames are

Consistency problems with Iceberg + EMRFS

2021-06-08 Thread Scott Kruger
We’re using the Iceberg API (0.11.1) over raw parquet data in S3/EMRFS, basically just using the table API to issues overwrites/appends. Everything works great for the most part, but we’ve recently started to have problems with the iceberg metadata directory going out of sync. See the following

Updating column type from timestamp to timestamptz

2021-06-08 Thread Huadong Liu
Hi, I have a hadoop table created from the Iceberg Java API that uses the timestamp type for a column. Spark cannot work with the table because of that. *java.lang.UnsupportedOperationException: Spark does not support timestamp without time zone fields* I tried *sql("ALTER TABLE table ALTER COLU

Re: AssertJ for Assertions

2021-06-08 Thread Ryan Blue
I mostly agree with Jack. I think that if I were starting a new project, I'd probably want to use the new assertions because they are readable, appear to be type-specific, and have some nice helpers for some types. But I don't see a lot of benefit to moving over to them right now and it would be a

FlinkSink UIDs problem

2021-06-08 Thread Igor Basov
Hello, I'm using Flink 1.11 with Iceberg 0.11. I use `pipeline.auto-generate-uids: false` in my Flink configuration to enforce assigning UIDs to operators, so that the job could be safely stopped and the state restored from the latest checkpoint. But when I use Iceberg FlinkSink I get error: Cause

Re: AssertJ for Assertions

2021-06-08 Thread Jack Ye
I would be on the more cautious side when introducing a new test utils library. Based on the PR, we are mostly changing things like Assert.assertEquals to another syntax, but that syntax is not complex in the first place. If we introduce AssertJ, there will be a mixture of 2 syntaxes, which is conf

AssertJ for Assertions

2021-06-08 Thread Eduard Tudenhoefner
Hi everyone, I was wondering what the appetite would be for introducing AssertJ to the project? I believe it's a really good testing library that makes writing assertions much more intuitive, as the assertions are written in kind-of a fluent way. The test code ends

Re: question about the iceberg manifest/manifest list/metadata api

2021-06-08 Thread Zoltán Borók-Nagy
Hey Yong, I've created a design doc about write support: https://docs.google.com/document/d/1_KL0YptDKwhiXvJyx4Vb-yZjggrPQAW2yjeGV4C0vMU/edit We don't have an upstream release of Impala that supports Iceberg, but you can checkout and build Impala master: https://cwiki.apache.org/confluence/displa