[jira] [Comment Edited] (KAFKA-5621) The producer should retry expired batches when retries are enabled

Sumant Tambe (JIRA) Wed, 02 Aug 2017 09:39:23 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111237#comment-16111237
 ]


Sumant Tambe edited comment on KAFKA-5621 at 8/2/17 4:38 PM:
-------------------------------------------------------------

Wait, what? blocking indefinitely? where? No No.

On one hand, I agree with [~apurva]'s point about no exposing the batching 
aspects in even more configs,
 {{message.max.delivery.wait.ms}} could be confusing because it would be 
interpreted as an end-to-end timeout. From batch-ready to broker receiving it. 
Or worse, from calling send to broker receiving it. An abstract name like that 
is open to interpretation. Therefore, I personally prefer {{batch.expiry.ms}} 
as it does what it says. It expires batches (in the accumulator). IMO, batching 
a well-known concept in big-data infrastructure: Kafka, Spark to name a few. 

So lets assume we chose a name X for the accumulator timeout. Now the question 
is what should be the defaults for it. We can either 1. support 
backwards-compatible behavior by setting it X=request.timeout.ms or 2. Some 
other agreed upon value (e.g., retries * request.timeout.ms) or 3. MAX_LONG in 
the interest of letting Kafka do the work to the maximum possible extent.

I'm ok with 1 and 2. The concern I've about #3 is that "maximum extent" is not 
forever. It's literally an extreme. 

Following is an attempt to answer [~ijuma]'s question.

There are at-least three classes of application I consider when I think about 
producer timeouts. 1. real-time apps (periodic healthcheck producer, 
temperature sensor producer) 2. fail-fast (e.g., KMM) 3. Best Effort (I.e., 
everything else. Further sub-categorization possible here). 

A real-time app  has a soft upper bound on *both* message delivery and failure 
of message delivery. In both cases, it wants to know. Such as app does *not* 
close the producer on the first error because there's more data lined up right 
behind. It's ok to lose a few samples of temperature. So it simply drops it and 
moves on. May be when drop rate is like 70% it would close it. May use acks=0. 
In this case, X=MAX_LONG is not suitable.

A fail-fast app does not have a upper bound on when the data gets to the 
broker. It's ready to wait for a long time to give every message a chance. But, 
if it's not making progress on a partition, it needs to know asap. That's it 
has a bound on the failure notification. Such an app (IMO) would a. close the 
producer if ordering is important (that's how we run KMM in Linkedin) b. queue 
it at the back of the queue for a later attempt with obvious reordering. In 
both cases, X=MAX_LONG is not suitable.

Now comes Best Effort, it has no bounds on success or failure notification. 
Frankly, I don't know which app really fits this bill. I've heard about apps 
configuring retries=MAX_LONG. I'll just say that I agree with [~becket_qin]'s 
opinion on this that it's considered bad. 

+1 for announcing Kip-91. I'll try to capture the discussion in this thread in 
the KIP before announcing. Do you guys want me to do that? Or should I just 
link to this thread?


was (Author: sutambe):
Wait, what? blocking indefinitely? where? No No.

On one hand, I agree with [~apurva]'s point about no exposing the batching 
aspects in even more configs,
 {{message.max.delivery.wait.ms}} could be confusing because it would be 
interpreted as an end-to-end timeout. From batch-ready to broker receiving it. 
Or worse, from calling send to broker receiving it. An abstract name like that 
is open to interpretation. Therefore, I personally prefer {{batch.expiry.ms}} 
as it does what it says. It expires batches (in the accumulator). IMO, batching 
a well-known concept in big-data infrastructure: Kafka, Spark to name a few. 

So lets assume we chose a name X for the accumulator timeout. Now the question 
is what should be the defaults for it. We can either 1. support 
backwards-compatible behavior by setting it X=request.timeout.ms or 2. Some 
other agreed upon value (e.g., retries * request.timeout.ms) or 3. MAX_LONG in 
the interest of letting Kafka do the work to the maximum possible extent.

I'm ok with 1 and 2. The concern I've about #3 is that "maximum extent" is not 
forever. It's literally an extreme. 

Following is an attempt to answer [~ijuma]'s question.

There are at-least three classes of application I consider when I think about 
producer timeouts. 1. real-time apps (periodic healthcheck producer, 
temperature sensor producer) 2. fail-fast (e.g., KMM) 3. Best Effort (I.e., 
everything else. Further sub-categorization possible here). 

A real-time app  has a soft upper bound on *both* message delivery and failure 
of message delivery. In both cases, it wants to know. Such as app does *not* 
close the producer on the first error because there's more data lined up right 
behind. It's ok to lose a few samples of temperature. So it simply drops it and 
moves on. May be when drop rate is like 70% it would close it. May use acks=0. 
In this case, X=MAX_LONG is not suitable.

A fail-fast app does not have a upper bound on when the data gets to the 
broker. It's ready to wait for a long time to give every message a chance. But, 
if it's not making progress on a partition, it needs to know asap. That's it 
has a bound on the failure notification. Such an app (IMO) would a. close the 
producer if ordering is important (that's how we run KMM in Linkedin) b. queue 
it at the back of the queue for a later attempt with obvious reordering. In 
both cases, X=MAX_LONG is not suitable.

Now comes Best Effort, it has no bounds on success or failure notification. 
Frankly, I don't know which app really fits this bill. I've heard about apps 
configuring retries=MAX_LONG. I'll just say that I agree with [becket_qin]'s 
opinion on this that it's considered bad. 

+1 for announcing Kip-91. I'll try to capture the discussion in this thread in 
the KIP before announcing. Do you guys want me to do that? Or should I just 
link to this thread?

> The producer should retry expired batches when retries are enabled
> ------------------------------------------------------------------
>
>                 Key: KAFKA-5621
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5621
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Apurva Mehta
>            Assignee: Apurva Mehta
>             Fix For: 1.0.0
>
>
> Today, when a batch is expired in the accumulator, a {{TimeoutException}} is 
> raised to the user.
> It might be better the producer to retry the expired batch rather up to the 
> configured number of retries. This is more intuitive from the user's point of 
> view. 
> Further the proposed behavior makes it easier for applications like mirror 
> maker to provide ordering guarantees even when batches expire. Today, they 
> would resend the expired batch and it would get added to the back of the 
> queue, causing the output ordering to be different from the input ordering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (KAFKA-5621) The producer should retry expired batches when retries are enabled

Reply via email to