Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

Jungtaek Lim Mon, 27 Jul 2020 20:42:09 -0700

I'd love to contribute documentation about the actions - just need some
time to understand the needs for some actions (like RewriteManifestAction).


I just submitted a PR for structured streaming sink [1]. I mentioned
expireSnapshot() there with linking javadoc page, but it'd be nice if
there's also a code example in the API page, as it'd not be bound for
Spark's case (any fast changing cases including Flink streaming would
need this as well).

1. https://github.com/apache/iceberg/pull/1261

On Tue, Jul 28, 2020 at 10:39 AM Ryan Blue <rb...@netflix.com> wrote:

> > seems to fail on high rate writing streaming query being run on the
> other side
>
> This kind of situation is where you'd want to tune the number of retries
> for a table. That's a likely source of the problem. We can also check to
> make sure we're being smart about conflict detection. A rewrite needs to
> scan any manifests that might have data files that conflict, which is why
> retries can take a little while. The farther the rewrite is from active
> data partitions, the better. And we can double-check to make sure we're
> using the manifest file partition ranges to avoid doing unnecessary work.
>
> > from the end user's point of view, all actions are not documented
>
> Yes, we need to add documentation for the actions. If you're interested,
> feel free to open PRs! The actions are fairly new, so we don't yet have
> docs for them.
>
> Same with the streaming sink, we just need someone to write up docs and
> contribute them. We don't use the streaming sink, so I've unfortunately
> overlooked it.
>
> On Mon, Jul 27, 2020 at 3:25 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> Thanks for the quick response!
>>
>> And yes I also went through experimenting expireSnapshots() and it looked
>> good. I can imagine some alternative conditions on expiring snapshots (like
>> adjusting "granularity" between snapshots instead of removing all snapshots
>> before the specific timestamp), but for now it's just an idea and I don't
>> have support for real world needs.
>>
>> I also went through RewriteDataFilesAction and looked good as well.
>> There's an existing Github issue to make the action be more intelligent
>> which is valid and good to add. One thing I indicated is that it's a bit
>> time-consuming task (expected for sure, not a problem) and seems to fail on
>> high rate writing streaming query being run on the other side (this is a
>> concern). I guess the action is only related to the old snapshots, hence no
>> conflict is expected against fast append. It would be nice to know whether
>> it's an expected behavior and it's recommended to stop all writes before
>> running the action, or sounds like a bug.
>>
>> I haven't gone through RewriteManifestAction, though for now I am only
>> curious about the needs. I'm eager to experiment with a streaming source
>> which is in review - don't know about the details of Iceberg so
>> not qualified to participate reviewing. I'd rather play with it when
>> available and use it as a chance to learn about Iceberg itself.
>>
>> Btw, from the end user's point of view, all actions are not documented -
>> even structured streaming sink is not documented and I had to go over the
>> code. While I think it's obvious to document a streaming sink on Spark doc
>> (wondering why it missed the documentation), would we want to document for
>> actions as well? Looks like these actions are still evolving so wondering
>> whether we are waiting for stabilizing, or just missed documentation.
>>
>> Thanks,
>> Jungtaek Lim
>>
>> On Tue, Jul 28, 2020 at 2:45 AM Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> Hi Jungtaek,
>>>
>>> That setting controls whether Iceberg cleans up old copies of the table
>>> metadata file. The metadata file holds references to all of the table's
>>> snapshots (that have no expired) and is self-contained. No operations need
>>> to access previous metadata files.
>>>
>>> Those aren't typically that large, but could be when streaming data
>>> because you create a lot of versions. For streaming, I'd recommend turning
>>> it on and making sure you're running `expireSnapshots()` regularly to prune
>>> old table versions -- although expiring snapshots will remove them from
>>> table metadata and limit how far back you can time travel.
>>>
>>> On Mon, Jul 27, 2020 at 4:33 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> Hi devs,
>>>>
>>>> I'm experimenting with Apache Iceberg for Structured Streaming sink -
>>>> plan to experiment with source as well, but I see PR still in review.
>>>>
>>>> It seems that "fast append" pretty much helps to retain reasonable
>>>> latency for committing, though the metadata directory grows too fast. I
>>>> found the option 'write.metadata.delete-after-commit.enabled' (false by
>>>> default), and disabled it, and the overall size looks fine afterwards.
>>>>
>>>> That said, given the option is false by default, I'm wondering which
>>>> would be impacted when turning off this option. My understanding is that it
>>>> doesn't affect time-travel (as it refers to a snapshot), and restoring is
>>>> also from snapshot, so not sure which point to consider when turning on the
>>>> option.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Effect of enabling 'write.metadata.delete-after-commit.enabled'

Reply via email to