Tom,
Documentation improvements are always welcome. The docs are in /docs under
the main repository, just sent a PR for trunk and we are good :)
Segment sizes - I have some objections, but this can be discussed in its
own thread. I feel like I did enough hijacking and Eric may get annoyed at
some
Hi,
>From our perspective (running thousands of Kafka clusters), the main issues
we see with compacted topics *aren't* disk space usage, or IO utilization
of the log cleaner.
Size matters a *lot* to the usability of consumers bootstrapping from the
beginning - in fact we've been debating tuning o
Hey Gwen,
Yeah specifying in bytes versus the utilization percent would have been
easier to implement. The argument against that is that basically users are
super terrible at predicting and updating data sizes as stuff grows and
you'd have to really set this then for each individual log perhaps?
C
No, you are right that mapping dirty-bytes to dirty-map sizes is
non-trivial. I think it would be good to discuss an alternative approach,
but this is probably the wrong thread :)
On Thu, May 19, 2016 at 4:36 AM, Ben Stopford wrote:
> Hmm. Suffice to say, this isn’t an easy thing to tune, so I w
Hmm. Suffice to say, this isn’t an easy thing to tune, so I would agree that a
more holistic solution, which tuned itself to total disk availability, might be
quite useful :)
If we took the min.dirty.bytes route, and defaulted it to the segment size,
that would work well for distributions where
Oops :)
The docs are definitely not doing the feature any favors, but I didn't mean
to imply the feature is thoughtless.
Here's the thing I'm not getting: You are trading off disk space for IO
efficiency. Thats reasonable. But why not allow users to specify space in
bytes?
Basically tell the Log
So in summary we never considered this a mechanism to give the consumer
time to consume prior to compaction, just a mechanism to control space
wastage. It sort of accidentally gives you that but it's super hard to
reason about it as an SLA since it is relative to the log size rather than
absolute.
The sad part is I actually did think pretty hard about how to configure
that stuff so I guess *I* think the config makes sense! Clearly trying to
prevent my being shot :-)
I agree the name could be improved and the documentation is quite
spartan--no guidance at all on how to set it or what it trad
Interesting!
This needs to be double checked by someone with more experience, but
reading the code, it looks like "log.cleaner.min.cleanable.ratio"
controls *just* the second property, and I'm not even convinced about
that.
Few facts:
1. Each cleaner thread cleans one log at a time. It always go
Generally, this seems like a sensible proposal to me.
Regarding (1): time and message count seem sensible. I can’t think of a
specific use case for bytes but it seems like there could be one.
Regarding (2):
The setting log.cleaner.min.cleanable.ratio currently seems to have two uses.
It con
and Spark's implementation is another good reason to allow compaction lag.
I'm convinced :)
We need to decide:
1) Do we need just .ms config, or anything else? consumer lag is
measured (and monitored) in messages, so if we need this feature to
somehow work in tandem with consumer lag monito
James,
Your pictures do an excellent job of illustrating my point.
My mention of the additional "10's of minutes to hours" refers to how far after
the original target checkpoint (T1 in your diagram) on may need to go to get to
a checkpoint where all partitions of all topics are in the uncompac
> On May 16, 2016, at 9:21 PM, Eric Wasserman wrote:
>
> Gwen,
>
> For simplicity, the example I gave in the gist is for a single table with a
> single partition. The salient point is that even for a single topic with one
> partition there is no guarantee without the feature that one will be
We would find this KIP very useful. Our particular use case falls into the
"application mistakes" portion of the KIP.
We are storing source of truth data in log compacted topics, similar to the
Confluent Schema Registry. One situation we had recently was a misbehaving
application. It sent data
I see what you mean, Eric.
I was unclear on the specifics of your architecture. It sounds like
you have a table somewhere that maps checkpoints to lists of
.
In that case it is indeed useful to know that if the checkpoint was
written N ms ago, you will be able to find the exact offsets by
looking
Gwen,
For simplicity, the example I gave in the gist is for a single table with a
single partition. The salient point is that even for a single topic with one
partition there is no guarantee without the feature that one will be able to
restore some particular checkpoint as the offset indicated
Yeah I think I gave a scenario but that is not the same as a concrete use
case. I think the question you have is how common is it that people care
about this and what concrete things would you build where you had this
requirement? I think that would be good to figure out.
I think the issue with th
I agree that log.cleaner.min.compaction.lag.ms gives slightly more
flexibility for potentially-lagging consumers than tuning
segment.roll.ms for the exact same scenario.
If more people think that the use-case of "consumer which must see
every single record, is running on a compacted topic, and is
I think it would be good to hammer out some of the practical use cases--I
definitely share your disdain for adding more configs. Here is my sort of
theoretical understanding of why you might want this.
As you say a consumer bootstrapping itself in the compacted part of the log
isn't actually trave
Hi Eric,
Thank you for submitting this improvement suggestion.
Do you mind clarifying the use-case for me?
Looking at your gist: https://gist.github.com/ewasserman/f8c892c2e7a9cf26ee46
If my consumer started reading all the CDC topics from the very
beginning in which they were created, without
I would like to begin discussion on KIP-58
The KIP is here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-58+-+Make+Log+Compaction+Point+Configurable
Jira: https://issues.apache.org/jira/browse/KAFKA-1981
Pull Request: https://github.com/apache/kafka/pull/1168
Thanks,
Eric
21 matches
Mail list logo