Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-28 Thread Jungtaek Lim
Hi Jingsong, Yeah auto compaction (and auto vacuum) should be the ideal state we probably want to reach. That requires "additional latency" on finishing a batch (assuming synchronous task on compaction), though it won't matter much with a proper interval, as end users would expect the latency on a

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-28 Thread Jingsong Li
Thanks Jungtaek for starting this discussion. What our team wants to do is data ingest into the Iceberg table with one minute frequency. This frequency can also lead to a large number of small files. Auto compaction(rewrites manifest and data files) in the streaming sink(writer) looks wonderful.

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-28 Thread Jungtaek Lim
> If your commit is taking 13 seconds to produce a retry, then it must be that you have a lot of metadata. That makes sense if you're producing a new snapshot every 10 seconds. It could be that at that rate, you have a large number of manifests and a large number of snapshots, causing a large metad

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-28 Thread Ryan Blue
Thanks for the PR! We will review it soon. The RewriteManifestAction is a parallel way to rewrite manfiests and then commit them using the RewriteManifests table operation. The idea here is that the metadata tree (manifest list and manifest files) functions as an index over data in the table. Some

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-28 Thread Ryan Blue
I verified that the rewrite action will correctly avoid filtering manifests that can't contain the deleted files, so retries should be okay. (The relevant part of the code is in ManifestFilterManager

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-28 Thread Jungtaek Lim
The case of keep failing on updating the result of rewriting data files isn't the matter of the number of retries, but the matter of latency on constructing new commit when conflicting happens. I haven't looked into details how the new commit is constructed with reflecting new metadata, but it too

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-27 Thread Jungtaek Lim
I'd love to contribute documentation about the actions - just need some time to understand the needs for some actions (like RewriteManifestAction). I just submitted a PR for structured streaming sink [1]. I mentioned expireSnapshot() there with linking javadoc page, but it'd be nice if there's als

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-27 Thread Ryan Blue
> seems to fail on high rate writing streaming query being run on the other side This kind of situation is where you'd want to tune the number of retries for a table. That's a likely source of the problem. We can also check to make sure we're being smart about conflict detection. A rewrite needs t

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-27 Thread Jungtaek Lim
Thanks for the quick response! And yes I also went through experimenting expireSnapshots() and it looked good. I can imagine some alternative conditions on expiring snapshots (like adjusting "granularity" between snapshots instead of removing all snapshots before the specific timestamp), but for n

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-27 Thread Ryan Blue
Hi Jungtaek, That setting controls whether Iceberg cleans up old copies of the table metadata file. The metadata file holds references to all of the table's snapshots (that have no expired) and is self-contained. No operations need to access previous metadata files. Those aren't typically that la

Effect of enabling 'write.metadata.delete-after-commit.enabled'

2020-07-27 Thread Jungtaek Lim
Hi devs, I'm experimenting with Apache Iceberg for Structured Streaming sink - plan to experiment with source as well, but I see PR still in review. It seems that "fast append" pretty much helps to retain reasonable latency for committing, though the metadata directory grows too fast. I found the