Hi, Luis,

1. The cleaning policy is configurable at both global and topic level. The
global one has the name log.cleanup.policy and the topic level has the name
cleanup.policy by just stripping the log prefix. We can probably do the
same for the new configs.

2. Since this KIP may require an admin to configure a larger dedup buffer
size, it would be useful to document this impact in the wiki and the
release notes.

3. Yes, it's unlikely for the last message to be removed in the current
implementation since we never clean the active segment. However, in theory,
this can happen. So it would be useful to guard this explicitly.

4. Just thought about another issue. We probably want to be a bit careful
with key deletion. Currently, one can delete a key by sending a message
with a delete tombstone (a null payload). To prevent a reader from missing
a deletion if it's removed too quickly, we depend on a configuration
log.cleaner.delete.retention.ms (defaults to 1 day). The delete tombstone
will only be physically removed from the log after that amount of time. The
expectation is that a reader should finish reading to the end of the log
after consuming a message within that configured time. With the new
strategy, we have similar, but slightly different problems. The first
problem is that the delete tombstone may be delivered earlier than an
outdated record in offset order to a consumer. In order for the consumer
not to take the outdated record, the consumer should cache the deletion
tombstone for some configured amount of time. We ca probably piggyback this
on log.cleaner.delete.retention.ms, but we need to document this. The
second problem is that once the delete tombstone is physically removed from
the log, how can we prevent outdated records to be added (otherwise, they
will never be garbage collected)? Not sure what's the best way to do this.
One possible way is to push this back to the application and require the
user not to publish outdated records after log.cleaner.delete.retention.ms.

Thanks,

Jun

On Wed, Jul 4, 2018 at 11:11 AM, Luís Cabral <luis_cab...@yahoo.com.invalid>
wrote:

> Hi Jun,
>
> -:  1. I guess both new configurations will be at the topic level?
>
> They will exist in the global configuration, at the very least.
> I would like to have them on the topic level as well, but there is an
> inconsistency between the cleanup/compaction properties that exist “only
> globally” vs “globally + per topic”.
> I haven’t gotten around to investigating why, and if that reason would
> then also impact the properties I’m suggesting. At first glance they seem
> to belong with the properties that are "only globally” configured, but
> Guozhang has written earlier with a suggestion of a compaction property
> that works for both (though I haven’t had the time to look into it yet,
> unfortunately).
>
> -:  2. Since the log cleaner now needs to keep both the offset and another
> long (say timestamp) in the de-dup map, it reduces the number of keys that
> we can keep in the map and thus may require more rounds of cleaning. This
> is probably not a big issue, but it will be useful to document this impact
> in the KIP.
>
> As a reader, I tend to prefer brief documentation on new features (they
> tend to be too many for me to find the willpower to read a 200-page essay
> about each one), so that influences me to avoid writing about every
> micro-impact that may exist, and simply leave it inferred (as you have just
> done).
> But I also don’t feel strongly enough about it to argue either way. So,
> after reading my argument, if you still insist, I’ll happily add this there.
>
> -: 3. With the new cleaning strategy, we want to be a bit careful with
> removing the last message in a partition (which is possible now). We need
> to preserve the offset of the last message so that we don't reuse the
> offset for a different message. One way to simply never remove the last
> message. Another way (suggested by Jason) is to create an empty message
> batch.
>
> That is a good point, but isn’t the last message always kept regardless?
> In all of my tests with this approach, never have I seen it being removed.
> This is not because I made it so while changing the code, it was simply
> like this before, which made me smile!
> Given these results, I just *assumed* (oops) that these scenarios
> represented the reality, so the compaction would only affected the “tail”,
> while the “head” remained untouched. Now that you say its possible that the
> last message actually gets overwritten somehow, I guess a new bullet point
> will have to be added to the KIP for this (after I’ve found the time to
> review the portion of the code that enacts this behaviour).
>
> Kind Regards,
> Luís Cabral
>
> From: Jun Rao
> Sent: 03 July 2018 23:58
> To: dev
> Subject: Re: [VOTE] KIP-280: Enhanced log compaction
>
> Hi, Luis,
>
> Thanks for the KIP. Overall, this seems a useful KIP. A few comments below.
>
> 1. I guess both new configurations will be at the topic level?
> 2. Since the log cleaner now needs to keep both the offset and another long
> (say timestamp) in the de-dup map, it reduces the number of keys that we
> can keep in the map and thus may require more rounds of cleaning. This is
> probably not a big issue, but it will be useful to document this impact in
> the KIP.
> 3. With the new cleaning strategy, we want to be a bit careful with
> removing the last message in a partition (which is possible now). We need
> to preserve the offset of the last message so that we don't reuse the
> offset for a different message. One way to simply never remove the last
> message. Another way (suggested by Jason) is to create an empty message
> batch.
>
> Jun
>
> On Sat, Jun 9, 2018 at 12:39 AM, Luís Cabral <luis_cab...@yahoo.com.invalid
> >
> wrote:
>
> > Hi all,
> >
> > Any takers on having a look at this KIP and voting on it?
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 280%3A+Enhanced+log+compaction
> >
> > Cheers,
> > Luis
> >
>
>

Reply via email to