Hello Jun,
Thanks for your feedbacks. I'd agree on #3 that it's worth adding a special
check to not delete the last message, since although unlikely, it is still
possible that a new active segment gets rolled out but contains no data
yet, and hence the actual last message in this case would be in a
"compact-able" segment.

For the second part of #4 you raised, maybe we could educate users to set "
message.timestamp.difference.max.ms" to be no larger than "
log.cleaner.delete.retention.ms" (its default value is Long.MAX_VALUE)? A
more aggressive approach would be changing the default value of the former
to be the value of the latter if:

1. cleanup.policy = compact OR compact,delete
2. log.cleaner.compaction.strategy != offset

Because in this case I think it makes sense to really allow users send any
data longer than "log.cleaner.delete.retention.ms", WDYT?


Guozhang


On Fri, Jul 6, 2018 at 11:51 AM, Jun Rao <j...@confluent.io> wrote:

> Hi, Luis,
>
> 1. The cleaning policy is configurable at both global and topic level. The
> global one has the name log.cleanup.policy and the topic level has the name
> cleanup.policy by just stripping the log prefix. We can probably do the
> same for the new configs.
>
> 2. Since this KIP may require an admin to configure a larger dedup buffer
> size, it would be useful to document this impact in the wiki and the
> release notes.
>
> 3. Yes, it's unlikely for the last message to be removed in the current
> implementation since we never clean the active segment. However, in theory,
> this can happen. So it would be useful to guard this explicitly.
>
> 4. Just thought about another issue. We probably want to be a bit careful
> with key deletion. Currently, one can delete a key by sending a message
> with a delete tombstone (a null payload). To prevent a reader from missing
> a deletion if it's removed too quickly, we depend on a configuration
> log.cleaner.delete.retention.ms (defaults to 1 day). The delete tombstone
> will only be physically removed from the log after that amount of time. The
> expectation is that a reader should finish reading to the end of the log
> after consuming a message within that configured time. With the new
> strategy, we have similar, but slightly different problems. The first
> problem is that the delete tombstone may be delivered earlier than an
> outdated record in offset order to a consumer. In order for the consumer
> not to take the outdated record, the consumer should cache the deletion
> tombstone for some configured amount of time. We ca probably piggyback this
> on log.cleaner.delete.retention.ms, but we need to document this. The
> second problem is that once the delete tombstone is physically removed from
> the log, how can we prevent outdated records to be added (otherwise, they
> will never be garbage collected)? Not sure what's the best way to do this.
> One possible way is to push this back to the application and require the
> user not to publish outdated records after log.cleaner.delete.retention.ms
> .
>
> Thanks,
>
> Jun
>
> On Wed, Jul 4, 2018 at 11:11 AM, Luís Cabral <luis_cab...@yahoo.com.invalid
> >
> wrote:
>
> > Hi Jun,
> >
> > -:  1. I guess both new configurations will be at the topic level?
> >
> > They will exist in the global configuration, at the very least.
> > I would like to have them on the topic level as well, but there is an
> > inconsistency between the cleanup/compaction properties that exist “only
> > globally” vs “globally + per topic”.
> > I haven’t gotten around to investigating why, and if that reason would
> > then also impact the properties I’m suggesting. At first glance they seem
> > to belong with the properties that are "only globally” configured, but
> > Guozhang has written earlier with a suggestion of a compaction property
> > that works for both (though I haven’t had the time to look into it yet,
> > unfortunately).
> >
> > -:  2. Since the log cleaner now needs to keep both the offset and
> another
> > long (say timestamp) in the de-dup map, it reduces the number of keys
> that
> > we can keep in the map and thus may require more rounds of cleaning. This
> > is probably not a big issue, but it will be useful to document this
> impact
> > in the KIP.
> >
> > As a reader, I tend to prefer brief documentation on new features (they
> > tend to be too many for me to find the willpower to read a 200-page essay
> > about each one), so that influences me to avoid writing about every
> > micro-impact that may exist, and simply leave it inferred (as you have
> just
> > done).
> > But I also don’t feel strongly enough about it to argue either way. So,
> > after reading my argument, if you still insist, I’ll happily add this
> there.
> >
> > -: 3. With the new cleaning strategy, we want to be a bit careful with
> > removing the last message in a partition (which is possible now). We need
> > to preserve the offset of the last message so that we don't reuse the
> > offset for a different message. One way to simply never remove the last
> > message. Another way (suggested by Jason) is to create an empty message
> > batch.
> >
> > That is a good point, but isn’t the last message always kept regardless?
> > In all of my tests with this approach, never have I seen it being
> removed.
> > This is not because I made it so while changing the code, it was simply
> > like this before, which made me smile!
> > Given these results, I just *assumed* (oops) that these scenarios
> > represented the reality, so the compaction would only affected the
> “tail”,
> > while the “head” remained untouched. Now that you say its possible that
> the
> > last message actually gets overwritten somehow, I guess a new bullet
> point
> > will have to be added to the KIP for this (after I’ve found the time to
> > review the portion of the code that enacts this behaviour).
> >
> > Kind Regards,
> > Luís Cabral
> >
> > From: Jun Rao
> > Sent: 03 July 2018 23:58
> > To: dev
> > Subject: Re: [VOTE] KIP-280: Enhanced log compaction
> >
> > Hi, Luis,
> >
> > Thanks for the KIP. Overall, this seems a useful KIP. A few comments
> below.
> >
> > 1. I guess both new configurations will be at the topic level?
> > 2. Since the log cleaner now needs to keep both the offset and another
> long
> > (say timestamp) in the de-dup map, it reduces the number of keys that we
> > can keep in the map and thus may require more rounds of cleaning. This is
> > probably not a big issue, but it will be useful to document this impact
> in
> > the KIP.
> > 3. With the new cleaning strategy, we want to be a bit careful with
> > removing the last message in a partition (which is possible now). We need
> > to preserve the offset of the last message so that we don't reuse the
> > offset for a different message. One way to simply never remove the last
> > message. Another way (suggested by Jason) is to create an empty message
> > batch.
> >
> > Jun
> >
> > On Sat, Jun 9, 2018 at 12:39 AM, Luís Cabral
> <luis_cab...@yahoo.com.invalid
> > >
> > wrote:
> >
> > > Hi all,
> > >
> > > Any takers on having a look at this KIP and voting on it?
> > >
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 280%3A+Enhanced+log+compaction
> > >
> > > Cheers,
> > > Luis
> > >
> >
> >
>



-- 
-- Guozhang

Reply via email to