I'd love to contribute documentation about the actions - just need some time to understand the needs for some actions (like RewriteManifestAction).
I just submitted a PR for structured streaming sink [1]. I mentioned expireSnapshot() there with linking javadoc page, but it'd be nice if there's also a code example in the API page, as it'd not be bound for Spark's case (any fast changing cases including Flink streaming would need this as well). 1. https://github.com/apache/iceberg/pull/1261 On Tue, Jul 28, 2020 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote: > > seems to fail on high rate writing streaming query being run on the > other side > > This kind of situation is where you'd want to tune the number of retries > for a table. That's a likely source of the problem. We can also check to > make sure we're being smart about conflict detection. A rewrite needs to > scan any manifests that might have data files that conflict, which is why > retries can take a little while. The farther the rewrite is from active > data partitions, the better. And we can double-check to make sure we're > using the manifest file partition ranges to avoid doing unnecessary work. > > > from the end user's point of view, all actions are not documented > > Yes, we need to add documentation for the actions. If you're interested, > feel free to open PRs! The actions are fairly new, so we don't yet have > docs for them. > > Same with the streaming sink, we just need someone to write up docs and > contribute them. We don't use the streaming sink, so I've unfortunately > overlooked it. > > On Mon, Jul 27, 2020 at 3:25 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> > wrote: > >> Thanks for the quick response! >> >> And yes I also went through experimenting expireSnapshots() and it looked >> good. I can imagine some alternative conditions on expiring snapshots (like >> adjusting "granularity" between snapshots instead of removing all snapshots >> before the specific timestamp), but for now it's just an idea and I don't >> have support for real world needs. >> >> I also went through RewriteDataFilesAction and looked good as well. >> There's an existing Github issue to make the action be more intelligent >> which is valid and good to add. One thing I indicated is that it's a bit >> time-consuming task (expected for sure, not a problem) and seems to fail on >> high rate writing streaming query being run on the other side (this is a >> concern). I guess the action is only related to the old snapshots, hence no >> conflict is expected against fast append. It would be nice to know whether >> it's an expected behavior and it's recommended to stop all writes before >> running the action, or sounds like a bug. >> >> I haven't gone through RewriteManifestAction, though for now I am only >> curious about the needs. I'm eager to experiment with a streaming source >> which is in review - don't know about the details of Iceberg so >> not qualified to participate reviewing. I'd rather play with it when >> available and use it as a chance to learn about Iceberg itself. >> >> Btw, from the end user's point of view, all actions are not documented - >> even structured streaming sink is not documented and I had to go over the >> code. While I think it's obvious to document a streaming sink on Spark doc >> (wondering why it missed the documentation), would we want to document for >> actions as well? Looks like these actions are still evolving so wondering >> whether we are waiting for stabilizing, or just missed documentation. >> >> Thanks, >> Jungtaek Lim >> >> On Tue, Jul 28, 2020 at 2:45 AM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> Hi Jungtaek, >>> >>> That setting controls whether Iceberg cleans up old copies of the table >>> metadata file. The metadata file holds references to all of the table's >>> snapshots (that have no expired) and is self-contained. No operations need >>> to access previous metadata files. >>> >>> Those aren't typically that large, but could be when streaming data >>> because you create a lot of versions. For streaming, I'd recommend turning >>> it on and making sure you're running `expireSnapshots()` regularly to prune >>> old table versions -- although expiring snapshots will remove them from >>> table metadata and limit how far back you can time travel. >>> >>> On Mon, Jul 27, 2020 at 4:33 AM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> >>>> Hi devs, >>>> >>>> I'm experimenting with Apache Iceberg for Structured Streaming sink - >>>> plan to experiment with source as well, but I see PR still in review. >>>> >>>> It seems that "fast append" pretty much helps to retain reasonable >>>> latency for committing, though the metadata directory grows too fast. I >>>> found the option 'write.metadata.delete-after-commit.enabled' (false by >>>> default), and disabled it, and the overall size looks fine afterwards. >>>> >>>> That said, given the option is false by default, I'm wondering which >>>> would be impacted when turning off this option. My understanding is that it >>>> doesn't affect time-travel (as it refers to a snapshot), and restoring is >>>> also from snapshot, so not sure which point to consider when turning on the >>>> option. >>>> >>>> Thanks, >>>> Jungtaek Lim >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > Ryan Blue > Software Engineer > Netflix >