Re: log.retention.size

Jun Rao Tue, 27 May 2014 07:35:49 -0700

For log.retention.bytes.per.topic and log.retention.hours.per.topic, the
current interpretation is that those are tight bounds. In other words, only
when those thresholds are violated, a segment is deleted. To further
satisfy log.retention.bytes.global, the per topic thresholds may no longer
be tight, i.e., we may need to delete a segment even when the per topic
threshold is not violated.


Thanks,

Jun


On Tue, May 27, 2014 at 12:22 AM, András Serény <sereny.and...@gravityrd.com
> wrote:

>
> No, I think more specific settings should get a chance first. I'm
> suggesting that provided that there is a segment rolled for a topic, *any
> *of log.retention.bytes.per.topic, log.retention.hours.per.topic, and a
> future log.retention.bytes.global violation would cause segments to be
> deleted.
>
> As far as I understand, the current logic says
>
> (1)
> for each topic, if there is a segment already rolled {
>     mark segments eligible for deletion due to
> log.retention.hours.for.this.topic
>     if log.retention.bytes.for.this.topic is still violated, mark
> segments eligible for deletion due to log.retention.bytes.for.this.topic
> }
>
> After this cleanup cycle, there could be another one,  taking into account
> the global threshold. For instance, something along the lines of
>
> (2)
> if after (1) log.retention.bytes.global is still violated, for each topic,
> if there is a segment already rolled {
>   calculate the required size for this topic (e.g. the proportional size,
> or simply (full size - threshold)/#topics ?)
>   mark segments exceeding the required size for deletion
> }
>
> Regards,
> András
>
>
>
> On 5/23/2014 4:46 PM, Jun Rao wrote:
>
>> Yes, that's possible. There is a default log.retention.bytes for every
>> topic. By introducing a global threshold, we may have to delete data from
>> logs whose size is smaller than log.retention.bytes. So, are you saying
>> that the global threshold has precedence?
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Fri, May 23, 2014 at 2:26 AM, András Serény
>> <sereny.and...@gravityrd.com>wrote:
>>
>>  Hi Kafka users,
>>>
>>> this feature would also be very useful for us. With lots of topics of
>>> different volume (and as they grow in number) it could become tedious to
>>> maintain topic level settings.
>>>
>>> As a start, I think uniform reduction is a good idea. Logs wouldn't be
>>> retained as long as you want, but that's already the case when a
>>> log.retention.bytes setting is specified. As for early rolling, I don't
>>> think it's necessary: currently, if there is no log segment eligible for
>>> deletion, log.retention.bytes and log.retention.hours settings won't kick
>>> in, so it's possible to exceed these limits, which is completely fine
>>> (please correct me if I'm mistaken here).
>>>
>>> All in all, introducing a global threshold doesn't seem to induce a
>>> considerable change in current retention logic.
>>>
>>> Regards,
>>> András
>>>
>>>
>>> On 5/8/2014 2:00 AM, vinh wrote:
>>>
>>>  Agreed…a global knob is a bit tricky for exactly the reason you've
>>>> identified.  Perhaps the problem could be simplified though by
>>>> considering
>>>> the context and purpose of Kafka.  I would use a persistent message
>>>> queue
>>>> because I want to guarantee that data/messages don't get lost.  But,
>>>> since
>>>> Kafka is not meant to be a long term storage solution (other products
>>>> can
>>>> be used for that), I would clarify that guarantee to apply only to the
>>>> most
>>>> recent messages up until a certain configured threshold (i.e. max 24
>>>> hrs,
>>>> max 500GB, etc).  Once those thresholds are reached, old messages are
>>>> deleted first.
>>>>
>>>> To ensure no message loss (up to a limit), I must ensure Kafka is highly
>>>> available.  There's a small a chance that the message deletion rate is
>>>> the
>>>> same rate that receive rate.  For example, when the incoming volume is
>>>> so
>>>> high that the size threshold is reached before the time threshold.
>>>>  But, I
>>>> may be ok with that because if Kafka goes down, it can cause upstream
>>>> applications to fail.  This can result in higher losses overall, and
>>>> particularly of the most *recent* messages.
>>>>
>>>> In other words, in a persistent but ephemeral message queue, I would
>>>> give
>>>> higher precedence to recent messages over older ones.  On the flip
>>>> side, by
>>>> allowing Kafka to go down when a disk is full, applications are forced
>>>> to
>>>> deal with the issue.  This adds complexity to apps, but perhaps it's
>>>> not a
>>>> bad thing.  After all, in scalability, all apps should be designed to
>>>> handle failure.
>>>>
>>>> Having said that, next is to decide which messages to delete first.  I
>>>> believe that's a separate issue and has its own complexities, too.
>>>>
>>>> The main idea though is that a global knob would provide flexibility,
>>>> even if not used.  From an operation perspective, if we can't ensure HA
>>>> for
>>>> all applications/components, it would be good if we can for at least
>>>> some
>>>> of the core ones, like Kafka.  This is much easier said that done
>>>> though.
>>>>
>>>>
>>>> On May 5, 2014, at 9:16 AM, Jun Rao <jun...@gmail.com> wrote:
>>>>
>>>>   Yes, your understanding is correct. A global knob that controls
>>>> aggregate
>>>>
>>>>> log size may make sense. What would be the expected behavior when that
>>>>> limit is reached? Would you reduce the retention uniformly across all
>>>>> topics? Then, it just means that some of the logs may not be retained
>>>>> as
>>>>> long as you want. Also, we need to think through what happens when
>>>>> every
>>>>> log has only 1 segment left and yet the total size still exceeds the
>>>>> limit.
>>>>> Do we roll log segments early?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jun
>>>>>
>>>>>
>>>>> On Sun, May 4, 2014 at 4:31 AM, vinh <v...@loggly.com> wrote:
>>>>>
>>>>>   Thanks Jun.  So if I understand this correctly, there really is no
>>>>>
>>>>>> master
>>>>>> property to control the total aggregate size of all Kafka data files
>>>>>> on
>>>>>> a
>>>>>> broker.
>>>>>>
>>>>>> log.retention.size and log.file.size are great for managing data at
>>>>>> the
>>>>>> application level.  In our case, application needs change frequently,
>>>>>> and
>>>>>> performance itself is an ever evolving feature.  This means various
>>>>>> configs
>>>>>> are constantly changing, like topics, # of partitions, etc.
>>>>>>
>>>>>> What rarely changes though is provisioned hardware resources.  So a
>>>>>> setting to control the total aggregate size of Kafka logs (or
>>>>>> persisted
>>>>>> data, for better clarity) would definitely simplify things at an
>>>>>> operational level, regardless what happens at the application level.
>>>>>>
>>>>>>
>>>>>> On May 2, 2014, at 7:49 AM, Jun Rao <jun...@gmail.com> wrote:
>>>>>>
>>>>>>   log.retention.size controls the total size in a log dir (per
>>>>>>
>>>>>>> partition). log.file.size
>>>>>>> controls the size of each log segment in the log dir.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jun
>>>>>>>
>>>>>>>
>>>>>>> On Thu, May 1, 2014 at 9:31 PM, vinh <v...@loggly.com> wrote:
>>>>>>>
>>>>>>>   In the 0.7 docs, the description for log.retention.size and
>>>>>>> log.file.size
>>>>>>> sound very much the same.  In particular, that they apply to a single
>>>>>>> log
>>>>>>> file (or log segment file).
>>>>>>>
>>>>>>>> http://kafka.apache.org/07/configuration.html
>>>>>>>>
>>>>>>>> I'm beginning to think there is no setting to control the max
>>>>>>>> aggregate
>>>>>>>> size of all logs.  If this is correct, what would be a good approach
>>>>>>>> to
>>>>>>>> enforce this requirement?  In my particular scenario, I have a lot
>>>>>>>> of
>>>>>>>>
>>>>>>>>  data
>>>>>>> being written to Kafka at a very high rate.  So a 1TB disk can easily
>>>>>>>
>>>>>>>> be
>>>>>>>> filled up in 24hrs or so.  One option is to add more Kafka brokers
>>>>>>>> to
>>>>>>>>
>>>>>>>>  add
>>>>>>> more disk space to the pool, but I'd like to avoid that and see if I
>>>>>>>
>>>>>>>> can
>>>>>>>> simply configure Kafka to not write more than 1TB aggregate.  Else,
>>>>>>>>
>>>>>>>>  Kafka
>>>>>>> will OOM and kill itself, and possibly the crash the node itself
>>>>>>>
>>>>>>>> because
>>>>>>>> the disk is full.
>>>>>>>>
>>>>>>>>
>>>>>>>> On May 1, 2014, at 9:21 PM, vinh <v...@loggly.com> wrote:
>>>>>>>>
>>>>>>>>   Using Kafka 0.7.2, I have the following in server.properties:
>>>>>>>>
>>>>>>>>> log.retention.hours=48
>>>>>>>>> log.retention.size=107374182400
>>>>>>>>> log.file.size=536870912
>>>>>>>>>
>>>>>>>>> My interpretation of this is:
>>>>>>>>> a) a single log segment file over 48hrs old will be deleted
>>>>>>>>> b) the total combined size of *all* logs is 100GB
>>>>>>>>> c) a single log segment file is limited to 500MB in size before a
>>>>>>>>> new
>>>>>>>>>
>>>>>>>>>  segment file is spawned spawning a new segment file
>>>>>>>>
>>>>>>>>  d) a "log file" can be composed of many "log segment files"
>>>>>>>>>
>>>>>>>>> But, even after setting the above, I find that the total combined
>>>>>>>>> size
>>>>>>>>>
>>>>>>>>>  of all Kafka logs on disk is 200GB right now.  Isn't
>>>>>>>> log.retention.size
>>>>>>>> supposed to limit it to 100GB?  Am I missing something?  The docs
>>>>>>>> are
>>>>>>>>
>>>>>>>>  not
>>>>>>> really clear, especially when it comes to distinguishing between a
>>>>>>> "log
>>>>>>>
>>>>>>>> file" and a "log segment file".
>>>>>>>>
>>>>>>>>  I have disk monitoring.  But like anything else in software, even
>>>>>>>>>
>>>>>>>>>  monitoring can fail.  Via configuration, I'd like to make sure
>>>>>>>> that
>>>>>>>>
>>>>>>>>  Kafka
>>>>>>> does not write more than the available disk space.  Or something like
>>>>>>>
>>>>>>>> log4j, where I can set a max number of log files and the max size
>>>>>>>> per
>>>>>>>>
>>>>>>>>  file,
>>>>>>> which essentially allows me to set a max aggregate size limit across
>>>>>>>
>>>>>>>> all
>>>>>>>> logs.
>>>>>>>>
>>>>>>>>  Thanks,
>>>>>>>>> -Vinh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>

Re: log.retention.size

Reply via email to