[jira] [Commented] (KAFKA-70) Introduce retention setting that depends on space

Jay Kreps (JIRA) Sun, 11 Sep 2011 10:02:32 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-70?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102303#comment-13102303
 ]


Jay Kreps commented on KAFKA-70:
--------------------------------

I agree the delete logic is correct. Log.scala keeps all segments in sorted 
order, and enforces the behavior that segments can only be deleted in reverse 
order (it uses takeWhile on the list) so this should guarantee that segments 
are deleted from oldest to newest. When we take the patch we should add a 
comments in our existing code to the effect that the list is guaranteed to be 
in sorted order, since that is kind of implicit now but it is important.

Prashanth, one corner case here is when the segment size < log.retention.size. 
One thing to note is that if the currently active segment is marked for 
deletion, even if that segment is not full, it will dutifully be deleted and a 
new, empty segment created. The cleanup based on modified time always 
guarantees that the last message happened AT LEAST log.retention.hours ago 
because it uses the modification time of the file. This means the client always 
has at least that much time to pick up their messages. This gets weird, though, 
when we delete the active segment based on size. It looks to me that if I have 
a segment size of 100MB but a log.retention.size of 50MB, then when my 
single-segment log gets to 50MB we will delete the active segment, effectively 
all messages in the log, even though likely it contains as-yet undelivered 
messages which may have arrived only ms ago. I think this will be confusing for 
people. Two ways to fix it: one is to require that log.retention.size > 
log.file.size, the other is to change the delete logic so that it deletes all 
but the most recent segment. Another way to say this is that since our 
granularity of cleanup is log.file.size we have to either overshoot or 
undershoot their given retention size by as much as log.file.size-1. It is 
probably safer to overshoot rather than undershoot what they give. Another 
alternative would be to phrase the retention configuration in terms of the 
number of files rather than the size in bytes which would be more exact, but I 
think the way you have done it is more intuitive.

Another feature request: would it be possible to add a per-topic config similar 
to topic.log.retention.hours (say topic.log.retention.size) to allow 
configuring the log size at the topic level? If not I think we can take a patch 
that fixes the above issue and just open a bug for that.

> Introduce retention setting that depends on space
> -------------------------------------------------
>
>                 Key: KAFKA-70
>                 URL: https://issues.apache.org/jira/browse/KAFKA-70
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.8
>              Labels: log, roll
>         Attachments: KAFKA-70.patch
>
>
> Currently there is this setting: 
> log.retention.hours 
> introduce: 
> log.retention.size 
> Semantic would be, either size is reached or time is reached, oldest messages 
> will be deleted. 
> This does not break back-ward compatibility and would make the system robust 
> under scenarios where message size is not deterministic over time.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-70) Introduce retention setting that depends on space

Reply via email to