Re: [DISCUSS] Spark 3.1 support?

2023-04-26 Thread Walaa Eldin Moustafa
Yes, that sounds like a good compromise. Initially I was looking at deprecation guidelines in [1], but I see you are referring to [2]. [1] https://iceberg.apache.org/contribute/ [2] https://iceberg.apache.org/multi-engine-support/#current-engine-version-lifecycle-status On Wed, Apr 26, 2023 at 8:

Re: Sequence number for ContentFiles

2023-04-26 Thread Steven Wu
> Will a clock skew cause any issues w.r.t. relying on the snapshot commit time? I think we allow a mismatch up to a minute in TableMetadata. This is probably not a problem. typically the max allowed misalignment is much longer than 1 minute. > We also planned to expose file sequence number (diff

Re: [DISCUSS] Switch to JDK 11 for releases?

2023-04-26 Thread Jack Ye
Added issue https://github.com/apache/iceberg/issues/7440 -Jack On Wed, Apr 26, 2023 at 8:28 AM Anton Okolnychyi wrote: > Is anyone interested to give it a try? > > - Anton > > On Apr 25, 2023, at 4:13 PM, Ryan Blue wrote: > > Yeah, I do like the idea of using --release. We'll need to test it

Re: Seeking Input on Handling Ambiguity in Generating Changelogs

2023-04-26 Thread Jack Ye
To catch up from what we discussed in the community sync, I had the misunderstanding that this will throw an exception even when we are only requesting deletes and inserts. Looks like users who want to consume that kind of changes will not be affected, and what Anton proposes about using the WAP br

Re: SQL Syntax for Time Travel on a Branch?

2023-04-26 Thread Jack Ye
We probably want to document these two different behaviors, and what we think is the correct expected behavior on the website. The question about time travel in a branch comes quite often since the related feature is publicly released. If some users really want the ancestor-based behavior, it is t

Re: Sequence number for ContentFiles

2023-04-26 Thread Jack Ye
+1 for using file sequence number. This work has been discussed for a long time but never got picked up, would be great if someone can drive it to completion. -Jack On Wed, Apr 26, 2023 at 12:03 PM Anton Okolnychyi wrote: > Will a clock skew cause any issues w.r.t. relying on the snapshot commi

Re: Sequence number for ContentFiles

2023-04-26 Thread Anton Okolnychyi
Will a clock skew cause any issues w.r.t. relying on the snapshot commit time? I think we allow a mismatch up to a minute in TableMetadata. We also planned to expose file sequence number (different from data sequence number). I believe you could lookup snapshot using that info. https://iceberg.

Re: Sequence number for ContentFiles

2023-04-26 Thread Steven Wu
piggyback on this thread since we are discussing exposing more metadata in ContentFile or FileScanTask. Flink source watermark alignment can potentially leverage the snapshot timestamp (when data files are committed/appended to the table). Is it reasonable to expose some snapshot metadata in the F

Re: Improve Change Data Capture Use Case for Iceberg

2023-04-26 Thread Anton Okolnychyi
Thanks for starting a thread, Jack! I am yet to go through the proposal. I recently came across a similar idea in BigQuery, which relies on a staleness threshold: https://cloud.google.com/blog/products/data-analytics/bigquery-gains-change-data-capture-functionality/

Improve Change Data Capture Use Case for Iceberg

2023-04-26 Thread Jack Ye
Hi everyone, As we discussed in the community sync, it looks like we have some general interest in improving the CDC streaming process. Dan mentioned that Ryan has a proposal about an alternative CDC approach that has an accumulated changelog that is periodically synced to a target table. I have

Re: Sequence number for ContentFiles

2023-04-26 Thread Anton Okolnychyi
My initial thinking is that exposing sequence numbers on ContentFile is preferable (we would get it for free in scan tasks). That said, I’ll need to see how complicated the implementation would be. Exposing it on ContentScanTask is a viable alternative. However, we already have a precedent for a

Re: Sequence number for ContentFiles

2023-04-26 Thread Ryan Blue
Exposing sequence number makes sense for use cases like this. I also like the idea of exposing it through FileScanTask. That might be easier than trying to add it to ContentFile. Anton, what do you think about adding it to FileScanTask? On Wed, Apr 26, 2023 at 7:50 AM Anton Okolnychyi wrote: >

Re: SQL Syntax for Time Travel on a Branch?

2023-04-26 Thread Ryan Blue
> Just to make this explicit "history" here to the "snapshot-log" in the spec? Yes, that's correct. The snapshot log is the history of what snapshots was current. We don't keep history for other branches right now. I'm not sure that we would want to. > Given how time-travel is currently defined,

Re: Seeking Input on Handling Ambiguity in Generating Changelogs

2023-04-26 Thread Anton Okolnychyi
If I understand correctly, duplicate IDs are only a problem if we want to compute pre and post update images. If we are not trying to rebuild updates, it should not matter, which is good as most use cases can be solved with only DELETE and INSERT. My initial thought was to do our best and come

Re: Support create table like for Iceberg table?

2023-04-26 Thread John Zhuge
On Tue, Apr 25, 2023 at 4:07 PM Ryan Blue wrote: > Pucheng, what engine are you interested in? > > This works fine in Trino: CREATE TABLE table_copy (LIKE source_table > INCLUDING PROPERTIES) > > I don’t know if it works in Hive, and last time I checked it was not > implemented for DSv2 in Spark.

Re: SQL Syntax for Time Travel on a Branch?

2023-04-26 Thread Micah Kornfield
Thanks for the replies Ryan and Amogh, Time travel relies on history which captures all the changes on the main > table state. Just to make this explicit "history" here to the "snapshot-log" in the spec? We decided the first option is easier to understand and is what people > expect. That way if

Re: [DISCUSS] Switch to JDK 11 for releases?

2023-04-26 Thread Anton Okolnychyi
Is anyone interested to give it a try? - Anton > On Apr 25, 2023, at 4:13 PM, Ryan Blue wrote: > > Yeah, I do like the idea of using --release. We'll need to test it with those > platforms though. > > On Tue, Apr 25, 2023 at 12:04 PM Anton Okolnychyi > wrote: > It would be interesting to he

Re: Support create table like for Iceberg table?

2023-04-26 Thread Anton Okolnychyi
Pucheng, you mentioned you want to reuse existing data in the new table? Branching Iceberg table state can lead to unexpected situations as there will be multiple pointers in the catalog to the same state, which can eventually corrupt the table. Isn’t CREATE TABLE LIKE supposed to just reuse the

Re: [DISCUSS] Spark 3.1 support?

2023-04-26 Thread Anton Okolnychyi
Got it. Given that quite a bit of folks still use 3.1, I don’t think we would remove it unless the branch becomes inactive. Marking it as deprecated would allow us to indicate that it may not be as up-to-date and complete as other versions and some performance enhancements or even minor bug fixe

Re: Sequence number for ContentFiles

2023-04-26 Thread Anton Okolnychyi
It is actually my bad not following up on that after #5913 and #6002. I’ll take a look at #5760 referenced below by the end of this week. The plan was to expose sequence numbers on ContentFile. It is needed in a number of use cases. - Anton > On Apr 26, 2023, at 4:55 AM, Gabor Kaszab wrote:

Re: Support create table like for Iceberg table?

2023-04-26 Thread Zoltán Borók-Nagy
As a reference, Impala can also do Hive-style CREATE TABLE x LIKE y for Iceberg tables. You can see various examples at https://github.com/apache/impala/blob/master/testdata/workloads/functional-query/queries/QueryTest/iceberg-create-table-like-table.test - Zoltan On Wed, Apr 26, 2023 at 4:10 AM

Sequence number for ContentFiles

2023-04-26 Thread Gabor Kaszab
Hey Iceberg Community, I know there has been a discussion previously about exposing the sequence number on a ContentFile level, but if I'm not mistaken that conversation didn't end with a consensus. I found some relevant PRs that has been open for a while: https://github.com/apache/iceberg/pull/57