Hi Rajini,

thanks for pointing out, that looks like exactly what I had in mind. I wasn't able to google that.

Jan


On 08/06/2018 12:31 PM, Rajini Sivaram wrote:
Can you take a look at KIP-280:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-280%3A+Enhanced+log+compaction
?

On Mon, Aug 6, 2018 at 10:55 AM, Jan Lukavský <je...@seznam.cz> wrote:

Hi,

I have a question about log compaction. LogCleaner's JavaDoc states that:

{quote}

A message with key K and offset O is obsolete if there exists a message
with key K and offset O' such that O < O'.

{/quote}

That works fine if messages are arriving "in-order", i.e. with timestamp
assigned by log-append time (with some possible problems with clock
synchronization during leader rebalance), but if topic might contain
messages, that are late (because producer explicitly assignes timestamp to
each message), then compacting purely by offset might cause message with
older timestamp to be kept in the log in favor of newer message. Is this
intentional? Would it be possible to relax this so that the log compaction
would prefer message's timestamp instead of offset? What if the behavior of
the LogCleaner would be changed to something like this:

{quote}

A message with key K, timestamp T1 and offset O1 is obsolete if there
exists a message with key K, timestamp T2 and offset O2' such that T1 < T2
or T1 = T2 and O1 < O2'.

{/quote}

I'm aware that this would be much more complicated (because of the clock
synchronization problem that would have to be resolved), but this
definition seems to be more aligned with time characteristic of the data.
Should I try to create a KIP or this was already discussed and considered
unwanted (or even impossible) feature?

Thanks for any comments,

  Jan



Reply via email to