[
https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111237#comment-16111237
]
Sumant Tambe edited comment on KAFKA-5621 at 8/2/17 4:38 PM:
-
Wait, what? blocking indefinitely? where? No No.
On one hand, I agree with [~apurva]'s point about no exposing the batching
aspects in even more configs,
{{message.max.delivery.wait.ms}} could be confusing because it would be
interpreted as an end-to-end timeout. From batch-ready to broker receiving it.
Or worse, from calling send to broker receiving it. An abstract name like that
is open to interpretation. Therefore, I personally prefer {{batch.expiry.ms}}
as it does what it says. It expires batches (in the accumulator). IMO, batching
a well-known concept in big-data infrastructure: Kafka, Spark to name a few.
So lets assume we chose a name X for the accumulator timeout. Now the question
is what should be the defaults for it. We can either 1. support
backwards-compatible behavior by setting it X=request.timeout.ms or 2. Some
other agreed upon value (e.g., retries * request.timeout.ms) or 3. MAX_LONG in
the interest of letting Kafka do the work to the maximum possible extent.
I'm ok with 1 and 2. The concern I've about #3 is that "maximum extent" is not
forever. It's literally an extreme.
Following is an attempt to answer [~ijuma]'s question.
There are at-least three classes of application I consider when I think about
producer timeouts. 1. real-time apps (periodic healthcheck producer,
temperature sensor producer) 2. fail-fast (e.g., KMM) 3. Best Effort (I.e.,
everything else. Further sub-categorization possible here).
A real-time app has a soft upper bound on *both* message delivery and failure
of message delivery. In both cases, it wants to know. Such as app does *not*
close the producer on the first error because there's more data lined up right
behind. It's ok to lose a few samples of temperature. So it simply drops it and
moves on. May be when drop rate is like 70% it would close it. May use acks=0.
In this case, X=MAX_LONG is not suitable.
A fail-fast app does not have a upper bound on when the data gets to the
broker. It's ready to wait for a long time to give every message a chance. But,
if it's not making progress on a partition, it needs to know asap. That's it
has a bound on the failure notification. Such an app (IMO) would a. close the
producer if ordering is important (that's how we run KMM in Linkedin) b. queue
it at the back of the queue for a later attempt with obvious reordering. In
both cases, X=MAX_LONG is not suitable.
Now comes Best Effort, it has no bounds on success or failure notification.
Frankly, I don't know which app really fits this bill. I've heard about apps
configuring retries=MAX_LONG. I'll just say that I agree with [~becket_qin]'s
opinion on this that it's considered bad.
+1 for announcing Kip-91. I'll try to capture the discussion in this thread in
the KIP before announcing. Do you guys want me to do that? Or should I just
link to this thread?
was (Author: sutambe):
Wait, what? blocking indefinitely? where? No No.
On one hand, I agree with [~apurva]'s point about no exposing the batching
aspects in even more configs,
{{message.max.delivery.wait.ms}} could be confusing because it would be
interpreted as an end-to-end timeout. From batch-ready to broker receiving it.
Or worse, from calling send to broker receiving it. An abstract name like that
is open to interpretation. Therefore, I personally prefer {{batch.expiry.ms}}
as it does what it says. It expires batches (in the accumulator). IMO, batching
a well-known concept in big-data infrastructure: Kafka, Spark to name a few.
So lets assume we chose a name X for the accumulator timeout. Now the question
is what should be the defaults for it. We can either 1. support
backwards-compatible behavior by setting it X=request.timeout.ms or 2. Some
other agreed upon value (e.g., retries * request.timeout.ms) or 3. MAX_LONG in
the interest of letting Kafka do the work to the maximum possible extent.
I'm ok with 1 and 2. The concern I've about #3 is that "maximum extent" is not
forever. It's literally an extreme.
Following is an attempt to answer [~ijuma]'s question.
There are at-least three classes of application I consider when I think about
producer timeouts. 1. real-time apps (periodic healthcheck producer,
temperature sensor producer) 2. fail-fast (e.g., KMM) 3. Best Effort (I.e.,
everything else. Further sub-categorization possible here).
A real-time app has a soft upper bound on *both* message delivery and failure
of message delivery. In both cases, it wants to know. Such as app does *not*
close the producer on the first error because there's more data lined up right
behind. It's ok to lose a few samples of temperature. So it simply drops it and
moves on. May be when drop