Hi Luís,

Can you please clarify how the header value has to be encoded in case log
compaction strategy is 'header'. As I see current PR reads varLong in
CleanerCache.extractVersion and read String and uses toLong in
Cleaner.extractVersion while the KIP says no more than 'the header value
(which must be of type "long")'.

Otherwise +1 for the KIP

As for current implementation: it seems in Cleaner class header key
"version" is hardwired.

Andras



On Fri, Jul 6, 2018 at 10:36 PM Jun Rao <j...@confluent.io> wrote:

> Hi, Guozhang,
>
> For #4, what you suggested could make sense for timestamp based de-dup, but
> not sure how general it is since the KIP also supports de-dup based on
> header.
>
> Thanks,
>
> Jun
>
> On Fri, Jul 6, 2018 at 1:12 PM, Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Hello Jun,
> > Thanks for your feedbacks. I'd agree on #3 that it's worth adding a
> special
> > check to not delete the last message, since although unlikely, it is
> still
> > possible that a new active segment gets rolled out but contains no data
> > yet, and hence the actual last message in this case would be in a
> > "compact-able" segment.
> >
> > For the second part of #4 you raised, maybe we could educate users to
> set "
> > message.timestamp.difference.max.ms" to be no larger than "
> > log.cleaner.delete.retention.ms" (its default value is Long.MAX_VALUE)?
> A
> > more aggressive approach would be changing the default value of the
> former
> > to be the value of the latter if:
> >
> > 1. cleanup.policy = compact OR compact,delete
> > 2. log.cleaner.compaction.strategy != offset
> >
> > Because in this case I think it makes sense to really allow users send
> any
> > data longer than "log.cleaner.delete.retention.ms", WDYT?
> >
> >
> > Guozhang
> >
> >
> > On Fri, Jul 6, 2018 at 11:51 AM, Jun Rao <j...@confluent.io> wrote:
> >
> > > Hi, Luis,
> > >
> > > 1. The cleaning policy is configurable at both global and topic level.
> > The
> > > global one has the name log.cleanup.policy and the topic level has the
> > name
> > > cleanup.policy by just stripping the log prefix. We can probably do the
> > > same for the new configs.
> > >
> > > 2. Since this KIP may require an admin to configure a larger dedup
> buffer
> > > size, it would be useful to document this impact in the wiki and the
> > > release notes.
> > >
> > > 3. Yes, it's unlikely for the last message to be removed in the current
> > > implementation since we never clean the active segment. However, in
> > theory,
> > > this can happen. So it would be useful to guard this explicitly.
> > >
> > > 4. Just thought about another issue. We probably want to be a bit
> careful
> > > with key deletion. Currently, one can delete a key by sending a message
> > > with a delete tombstone (a null payload). To prevent a reader from
> > missing
> > > a deletion if it's removed too quickly, we depend on a configuration
> > > log.cleaner.delete.retention.ms (defaults to 1 day). The delete
> > tombstone
> > > will only be physically removed from the log after that amount of time.
> > The
> > > expectation is that a reader should finish reading to the end of the
> log
> > > after consuming a message within that configured time. With the new
> > > strategy, we have similar, but slightly different problems. The first
> > > problem is that the delete tombstone may be delivered earlier than an
> > > outdated record in offset order to a consumer. In order for the
> consumer
> > > not to take the outdated record, the consumer should cache the deletion
> > > tombstone for some configured amount of time. We ca probably piggyback
> > this
> > > on log.cleaner.delete.retention.ms, but we need to document this. The
> > > second problem is that once the delete tombstone is physically removed
> > from
> > > the log, how can we prevent outdated records to be added (otherwise,
> they
> > > will never be garbage collected)? Not sure what's the best way to do
> > this.
> > > One possible way is to push this back to the application and require
> the
> > > user not to publish outdated records after
> log.cleaner.delete.retention.
> > ms
> > > .
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Jul 4, 2018 at 11:11 AM, Luís Cabral
> > <luis_cab...@yahoo.com.invalid
> > > >
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > -:  1. I guess both new configurations will be at the topic level?
> > > >
> > > > They will exist in the global configuration, at the very least.
> > > > I would like to have them on the topic level as well, but there is an
> > > > inconsistency between the cleanup/compaction properties that exist
> > “only
> > > > globally” vs “globally + per topic”.
> > > > I haven’t gotten around to investigating why, and if that reason
> would
> > > > then also impact the properties I’m suggesting. At first glance they
> > seem
> > > > to belong with the properties that are "only globally” configured,
> but
> > > > Guozhang has written earlier with a suggestion of a compaction
> property
> > > > that works for both (though I haven’t had the time to look into it
> yet,
> > > > unfortunately).
> > > >
> > > > -:  2. Since the log cleaner now needs to keep both the offset and
> > > another
> > > > long (say timestamp) in the de-dup map, it reduces the number of keys
> > > that
> > > > we can keep in the map and thus may require more rounds of cleaning.
> > This
> > > > is probably not a big issue, but it will be useful to document this
> > > impact
> > > > in the KIP.
> > > >
> > > > As a reader, I tend to prefer brief documentation on new features
> (they
> > > > tend to be too many for me to find the willpower to read a 200-page
> > essay
> > > > about each one), so that influences me to avoid writing about every
> > > > micro-impact that may exist, and simply leave it inferred (as you
> have
> > > just
> > > > done).
> > > > But I also don’t feel strongly enough about it to argue either way.
> So,
> > > > after reading my argument, if you still insist, I’ll happily add this
> > > there.
> > > >
> > > > -: 3. With the new cleaning strategy, we want to be a bit careful
> with
> > > > removing the last message in a partition (which is possible now). We
> > need
> > > > to preserve the offset of the last message so that we don't reuse the
> > > > offset for a different message. One way to simply never remove the
> last
> > > > message. Another way (suggested by Jason) is to create an empty
> message
> > > > batch.
> > > >
> > > > That is a good point, but isn’t the last message always kept
> > regardless?
> > > > In all of my tests with this approach, never have I seen it being
> > > removed.
> > > > This is not because I made it so while changing the code, it was
> simply
> > > > like this before, which made me smile!
> > > > Given these results, I just *assumed* (oops) that these scenarios
> > > > represented the reality, so the compaction would only affected the
> > > “tail”,
> > > > while the “head” remained untouched. Now that you say its possible
> that
> > > the
> > > > last message actually gets overwritten somehow, I guess a new bullet
> > > point
> > > > will have to be added to the KIP for this (after I’ve found the time
> to
> > > > review the portion of the code that enacts this behaviour).
> > > >
> > > > Kind Regards,
> > > > Luís Cabral
> > > >
> > > > From: Jun Rao
> > > > Sent: 03 July 2018 23:58
> > > > To: dev
> > > > Subject: Re: [VOTE] KIP-280: Enhanced log compaction
> > > >
> > > > Hi, Luis,
> > > >
> > > > Thanks for the KIP. Overall, this seems a useful KIP. A few comments
> > > below.
> > > >
> > > > 1. I guess both new configurations will be at the topic level?
> > > > 2. Since the log cleaner now needs to keep both the offset and
> another
> > > long
> > > > (say timestamp) in the de-dup map, it reduces the number of keys that
> > we
> > > > can keep in the map and thus may require more rounds of cleaning.
> This
> > is
> > > > probably not a big issue, but it will be useful to document this
> impact
> > > in
> > > > the KIP.
> > > > 3. With the new cleaning strategy, we want to be a bit careful with
> > > > removing the last message in a partition (which is possible now). We
> > need
> > > > to preserve the offset of the last message so that we don't reuse the
> > > > offset for a different message. One way to simply never remove the
> last
> > > > message. Another way (suggested by Jason) is to create an empty
> message
> > > > batch.
> > > >
> > > > Jun
> > > >
> > > > On Sat, Jun 9, 2018 at 12:39 AM, Luís Cabral
> > > <luis_cab...@yahoo.com.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Any takers on having a look at this KIP and voting on it?
> > > > >
> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > 280%3A+Enhanced+log+compaction
> > > > >
> > > > > Cheers,
> > > > > Luis
> > > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>

Reply via email to