Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-20 Thread Gwen Shapira
Tom, Documentation improvements are always welcome. The docs are in /docs under the main repository, just sent a PR for trunk and we are good :) Segment sizes - I have some objections, but this can be discussed in its own thread. I feel like I did enough hijacking and Eric may get annoyed at some

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-20 Thread Tom Crayford
Hi, >From our perspective (running thousands of Kafka clusters), the main issues we see with compacted topics *aren't* disk space usage, or IO utilization of the log cleaner. Size matters a *lot* to the usability of consumers bootstrapping from the beginning - in fact we've been debating tuning o

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-19 Thread Jay Kreps
Hey Gwen, Yeah specifying in bytes versus the utilization percent would have been easier to implement. The argument against that is that basically users are super terrible at predicting and updating data sizes as stuff grows and you'd have to really set this then for each individual log perhaps? C

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-19 Thread Gwen Shapira
No, you are right that mapping dirty-bytes to dirty-map sizes is non-trivial. I think it would be good to discuss an alternative approach, but this is probably the wrong thread :) On Thu, May 19, 2016 at 4:36 AM, Ben Stopford wrote: > Hmm. Suffice to say, this isn’t an easy thing to tune, so I w

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-19 Thread Ben Stopford
Hmm. Suffice to say, this isn’t an easy thing to tune, so I would agree that a more holistic solution, which tuned itself to total disk availability, might be quite useful :) If we took the min.dirty.bytes route, and defaulted it to the segment size, that would work well for distributions where

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-18 Thread Gwen Shapira
Oops :) The docs are definitely not doing the feature any favors, but I didn't mean to imply the feature is thoughtless. Here's the thing I'm not getting: You are trading off disk space for IO efficiency. Thats reasonable. But why not allow users to specify space in bytes? Basically tell the Log

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-18 Thread Jay Kreps
So in summary we never considered this a mechanism to give the consumer time to consume prior to compaction, just a mechanism to control space wastage. It sort of accidentally gives you that but it's super hard to reason about it as an SLA since it is relative to the log size rather than absolute.

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-18 Thread Jay Kreps
The sad part is I actually did think pretty hard about how to configure that stuff so I guess *I* think the config makes sense! Clearly trying to prevent my being shot :-) I agree the name could be improved and the documentation is quite spartan--no guidance at all on how to set it or what it trad

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-18 Thread Gwen Shapira
Interesting! This needs to be double checked by someone with more experience, but reading the code, it looks like "log.cleaner.min.cleanable.ratio" controls *just* the second property, and I'm not even convinced about that. Few facts: 1. Each cleaner thread cleans one log at a time. It always go

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-18 Thread Ben Stopford
Generally, this seems like a sensible proposal to me. Regarding (1): time and message count seem sensible. I can’t think of a specific use case for bytes but it seems like there could be one. Regarding (2): The setting log.cleaner.min.cleanable.ratio currently seems to have two uses. It con

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-17 Thread Gwen Shapira
and Spark's implementation is another good reason to allow compaction lag. I'm convinced :) We need to decide: 1) Do we need just .ms config, or anything else? consumer lag is measured (and monitored) in messages, so if we need this feature to somehow work in tandem with consumer lag monito

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-17 Thread Eric Wasserman
James, Your pictures do an excellent job of illustrating my point. My mention of the additional "10's of minutes to hours" refers to how far after the original target checkpoint (T1 in your diagram) on may need to go to get to a checkpoint where all partitions of all topics are in the uncompac

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread James Cheng
> On May 16, 2016, at 9:21 PM, Eric Wasserman wrote: > > Gwen, > > For simplicity, the example I gave in the gist is for a single table with a > single partition. The salient point is that even for a single topic with one > partition there is no guarantee without the feature that one will be

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread James Cheng
We would find this KIP very useful. Our particular use case falls into the "application mistakes" portion of the KIP. We are storing source of truth data in log compacted topics, similar to the Confluent Schema Registry. One situation we had recently was a misbehaving application. It sent data

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread Gwen Shapira
I see what you mean, Eric. I was unclear on the specifics of your architecture. It sounds like you have a table somewhere that maps checkpoints to lists of . In that case it is indeed useful to know that if the checkpoint was written N ms ago, you will be able to find the exact offsets by looking

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread Eric Wasserman
Gwen, For simplicity, the example I gave in the gist is for a single table with a single partition. The salient point is that even for a single topic with one partition there is no guarantee without the feature that one will be able to restore some particular checkpoint as the offset indicated

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread Jay Kreps
Yeah I think I gave a scenario but that is not the same as a concrete use case. I think the question you have is how common is it that people care about this and what concrete things would you build where you had this requirement? I think that would be good to figure out. I think the issue with th

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread Gwen Shapira
I agree that log.cleaner.min.compaction.lag.ms gives slightly more flexibility for potentially-lagging consumers than tuning segment.roll.ms for the exact same scenario. If more people think that the use-case of "consumer which must see every single record, is running on a compacted topic, and is

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread Jay Kreps
I think it would be good to hammer out some of the practical use cases--I definitely share your disdain for adding more configs. Here is my sort of theoretical understanding of why you might want this. As you say a consumer bootstrapping itself in the compacted part of the log isn't actually trave

Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread Gwen Shapira
Hi Eric, Thank you for submitting this improvement suggestion. Do you mind clarifying the use-case for me? Looking at your gist: https://gist.github.com/ewasserman/f8c892c2e7a9cf26ee46 If my consumer started reading all the CDC topics from the very beginning in which they were created, without

[DISCUSS] KIP-58 - Make Log Compaction Point Configurable

2016-05-16 Thread Eric Wasserman
I would like to begin discussion on KIP-58 The KIP is here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-58+-+Make+Log+Compaction+Point+Configurable Jira: https://issues.apache.org/jira/browse/KAFKA-1981 Pull Request: https://github.com/apache/kafka/pull/1168 Thanks, Eric