[jira] [Created] (KAFKA-6701) synchronize Log modification between delete cleanup and async delete
Sumant Tambe created KAFKA-6701: --- Summary: synchronize Log modification between delete cleanup and async delete Key: KAFKA-6701 URL: https://issues.apache.org/jira/browse/KAFKA-6701 Project: Kafka Issue Type: Bug Reporter: Sumant Tambe Assignee: Sumant Tambe Kafka broker crashes without any evident disk failures >From [~becket_qin]: This looks a bug in kafka when topic deletion and log >retention cleanup happen at the same time, the log retention cleanup may see >ClosedChannelException after the log has been renamed for async deletion. The root cause is that the topic deletion should have set the isClosed flag of the partition log to true and the retention should not bother to do the old log segments deletion when the log is closed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-5886) Introduce delivery.timeout.ms producer config (KIP-91)
[ https://issues.apache.org/jira/browse/KAFKA-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188707#comment-16188707 ] Sumant Tambe commented on KAFKA-5886: - The excellent KIP writeup is courtesy [~jkoshy] > Introduce delivery.timeout.ms producer config (KIP-91) > -- > > Key: KAFKA-5886 > URL: https://issues.apache.org/jira/browse/KAFKA-5886 > Project: Kafka > Issue Type: Improvement > Components: producer >Reporter: Sumant Tambe >Assignee: Sumant Tambe > Fix For: 1.0.0 > > > We propose adding a new timeout delivery.timeout.ms. The window of > enforcement includes batching in the accumulator, retries, and the inflight > segments of the batch. With this config, the user has a guaranteed upper > bound on when a record will either get sent, fail or expire from the point > when send returns. In other words we no longer overload request.timeout.ms to > act as a weak proxy for accumulator timeout and instead introduce an explicit > timeout that users can rely on without exposing any internals of the producer > such as the accumulator. > See > [KIP-91|https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer] > for more details. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5886) Introduce delivery.timeout.ms producer config (KIP-91)
[ https://issues.apache.org/jira/browse/KAFKA-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188300#comment-16188300 ] Sumant Tambe commented on KAFKA-5886: - [~guozhang] I'm resuming this today. I'll keep you posted. [~twbecker] With this KIP you can configure `delivery.timeout.ms` to a very large value and reduce `request.timeout.ms` to a much smaller value. As default `retries` are MAX_LONG, it should help your situation. What say? > Introduce delivery.timeout.ms producer config (KIP-91) > -- > > Key: KAFKA-5886 > URL: https://issues.apache.org/jira/browse/KAFKA-5886 > Project: Kafka > Issue Type: Improvement > Components: producer >Reporter: Sumant Tambe >Assignee: Sumant Tambe > Fix For: 1.0.0 > > > We propose adding a new timeout delivery.timeout.ms. The window of > enforcement includes batching in the accumulator, retries, and the inflight > segments of the batch. With this config, the user has a guaranteed upper > bound on when a record will either get sent, fail or expire from the point > when send returns. In other words we no longer overload request.timeout.ms to > act as a weak proxy for accumulator timeout and instead introduce an explicit > timeout that users can rely on without exposing any internals of the producer > such as the accumulator. > See > [KIP-91|https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer] > for more details. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5886) Introduce delivery.timeout.ms producer config (KIP-91)
[ https://issues.apache.org/jira/browse/KAFKA-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186378#comment-16186378 ] Sumant Tambe commented on KAFKA-5886: - I think It's unlikely that this would merge in 1.0.0. It's a complex patch. There are a few issues regarding tests that I've not had a chance to resolve. I plan to resume work next week. I'm familiar with the heavy producer load case. That's how we ran into the expiry issue. We've an internal patched version to work around that issue. > Introduce delivery.timeout.ms producer config (KIP-91) > -- > > Key: KAFKA-5886 > URL: https://issues.apache.org/jira/browse/KAFKA-5886 > Project: Kafka > Issue Type: Improvement > Components: producer >Reporter: Sumant Tambe >Assignee: Sumant Tambe > Fix For: 1.0.0 > > > We propose adding a new timeout delivery.timeout.ms. The window of > enforcement includes batching in the accumulator, retries, and the inflight > segments of the batch. With this config, the user has a guaranteed upper > bound on when a record will either get sent, fail or expire from the point > when send returns. In other words we no longer overload request.timeout.ms to > act as a weak proxy for accumulator timeout and instead introduce an explicit > timeout that users can rely on without exposing any internals of the producer > such as the accumulator. > See > [KIP-91|https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer] > for more details. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (KAFKA-5886) Implement KIP-91
Sumant Tambe created KAFKA-5886: --- Summary: Implement KIP-91 Key: KAFKA-5886 URL: https://issues.apache.org/jira/browse/KAFKA-5886 Project: Kafka Issue Type: Improvement Components: producer Reporter: Sumant Tambe Assignee: Sumant Tambe Fix For: 1.0.0 Implement [KIP-91|https://cwiki.apache.org/confluence/display/KAFKA/KIP-91+Provide+Intuitive+User+Timeouts+in+The+Producer] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5621) The producer should retry expired batches when retries are enabled
[ https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113590#comment-16113590 ] Sumant Tambe commented on KAFKA-5621: - Thread closed. Please share your thoughts on KIP-91 [DISCUSS] thread instead. > The producer should retry expired batches when retries are enabled > -- > > Key: KAFKA-5621 > URL: https://issues.apache.org/jira/browse/KAFKA-5621 > Project: Kafka > Issue Type: Bug >Reporter: Apurva Mehta >Assignee: Apurva Mehta > Fix For: 1.0.0 > > > Today, when a batch is expired in the accumulator, a {{TimeoutException}} is > raised to the user. > It might be better the producer to retry the expired batch rather up to the > configured number of retries. This is more intuitive from the user's point of > view. > Further the proposed behavior makes it easier for applications like mirror > maker to provide ordering guarantees even when batches expire. Today, they > would resend the expired batch and it would get added to the back of the > queue, causing the output ordering to be different from the input ordering. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (KAFKA-5621) The producer should retry expired batches when retries are enabled
[ https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111237#comment-16111237 ] Sumant Tambe edited comment on KAFKA-5621 at 8/2/17 4:38 PM: - Wait, what? blocking indefinitely? where? No No. On one hand, I agree with [~apurva]'s point about no exposing the batching aspects in even more configs, {{message.max.delivery.wait.ms}} could be confusing because it would be interpreted as an end-to-end timeout. From batch-ready to broker receiving it. Or worse, from calling send to broker receiving it. An abstract name like that is open to interpretation. Therefore, I personally prefer {{batch.expiry.ms}} as it does what it says. It expires batches (in the accumulator). IMO, batching a well-known concept in big-data infrastructure: Kafka, Spark to name a few. So lets assume we chose a name X for the accumulator timeout. Now the question is what should be the defaults for it. We can either 1. support backwards-compatible behavior by setting it X=request.timeout.ms or 2. Some other agreed upon value (e.g., retries * request.timeout.ms) or 3. MAX_LONG in the interest of letting Kafka do the work to the maximum possible extent. I'm ok with 1 and 2. The concern I've about #3 is that "maximum extent" is not forever. It's literally an extreme. Following is an attempt to answer [~ijuma]'s question. There are at-least three classes of application I consider when I think about producer timeouts. 1. real-time apps (periodic healthcheck producer, temperature sensor producer) 2. fail-fast (e.g., KMM) 3. Best Effort (I.e., everything else. Further sub-categorization possible here). A real-time app has a soft upper bound on *both* message delivery and failure of message delivery. In both cases, it wants to know. Such as app does *not* close the producer on the first error because there's more data lined up right behind. It's ok to lose a few samples of temperature. So it simply drops it and moves on. May be when drop rate is like 70% it would close it. May use acks=0. In this case, X=MAX_LONG is not suitable. A fail-fast app does not have a upper bound on when the data gets to the broker. It's ready to wait for a long time to give every message a chance. But, if it's not making progress on a partition, it needs to know asap. That's it has a bound on the failure notification. Such an app (IMO) would a. close the producer if ordering is important (that's how we run KMM in Linkedin) b. queue it at the back of the queue for a later attempt with obvious reordering. In both cases, X=MAX_LONG is not suitable. Now comes Best Effort, it has no bounds on success or failure notification. Frankly, I don't know which app really fits this bill. I've heard about apps configuring retries=MAX_LONG. I'll just say that I agree with [~becket_qin]'s opinion on this that it's considered bad. +1 for announcing Kip-91. I'll try to capture the discussion in this thread in the KIP before announcing. Do you guys want me to do that? Or should I just link to this thread? was (Author: sutambe): Wait, what? blocking indefinitely? where? No No. On one hand, I agree with [~apurva]'s point about no exposing the batching aspects in even more configs, {{message.max.delivery.wait.ms}} could be confusing because it would be interpreted as an end-to-end timeout. From batch-ready to broker receiving it. Or worse, from calling send to broker receiving it. An abstract name like that is open to interpretation. Therefore, I personally prefer {{batch.expiry.ms}} as it does what it says. It expires batches (in the accumulator). IMO, batching a well-known concept in big-data infrastructure: Kafka, Spark to name a few. So lets assume we chose a name X for the accumulator timeout. Now the question is what should be the defaults for it. We can either 1. support backwards-compatible behavior by setting it X=request.timeout.ms or 2. Some other agreed upon value (e.g., retries * request.timeout.ms) or 3. MAX_LONG in the interest of letting Kafka do the work to the maximum possible extent. I'm ok with 1 and 2. The concern I've about #3 is that "maximum extent" is not forever. It's literally an extreme. Following is an attempt to answer [~ijuma]'s question. There are at-least three classes of application I consider when I think about producer timeouts. 1. real-time apps (periodic healthcheck producer, temperature sensor producer) 2. fail-fast (e.g., KMM) 3. Best Effort (I.e., everything else. Further sub-categorization possible here). A real-time app has a soft upper bound on *both* message delivery and failure of message delivery. In both cases, it wants to know. Such as app does *not* close the producer on the first error because there's more data lined up right behind. It's ok to lose a few samples of temperature. So it simply drops it and moves on. May be when drop
[jira] [Commented] (KAFKA-5621) The producer should retry expired batches when retries are enabled
[ https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103460#comment-16103460 ] Sumant Tambe commented on KAFKA-5621: - {{queue.time.ms}} sound a lot like {{linger.ms}} to me. It may be confusing for those who are new to Kafka. I naturally biased towards {{batch.expiry.ms}} which proposed in KIP-91. For backwards compatibility, the default should be equal to {{request.timeout.ms}}, I think. > The producer should retry expired batches when retries are enabled > -- > > Key: KAFKA-5621 > URL: https://issues.apache.org/jira/browse/KAFKA-5621 > Project: Kafka > Issue Type: Bug >Reporter: Apurva Mehta > Fix For: 1.0.0 > > > Today, when a batch is expired in the accumulator, a {{TimeoutException}} is > raised to the user. > It might be better the producer to retry the expired batch rather up to the > configured number of retries. This is more intuitive from the user's point of > view. > Further the proposed behavior makes it easier for applications like mirror > maker to provide ordering guarantees even when batches expire. Today, they > would resend the expired batch and it would get added to the back of the > queue, causing the output ordering to be different from the input ordering. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (KAFKA-5621) The producer should retry expired batches when retries are enabled
[ https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100349#comment-16100349 ] Sumant Tambe commented on KAFKA-5621: - [~ijuma], [~apurva], It looks like this proposal will change the notional accumulator timeout from request.timeout.ms (r.t.m) to (r.t.m * retries). The bound on the send path of a record from the point when send returns is up to linger.ms + r.t.m * retries + retries * (r.t.m. + retry.backoff.ms). So time of the send failure notification could potentially double as two full retry cycles will be applied. I prefer an explicit name to the accumulator timeout (we proposed batch.expiry.ms) but I'm well aware that adding new configs makes things harder for the enduser. I prefer to avoid overloading configs to do multiple jobs. That's confusing too. Applying retries to accumulator is not intuitive to me. It seems to jack up accumulator time by a multiple. Having said that, if there is strong agreement over not adding a new config, I'm fine with the proposed trick here. > The producer should retry expired batches when retries are enabled > -- > > Key: KAFKA-5621 > URL: https://issues.apache.org/jira/browse/KAFKA-5621 > Project: Kafka > Issue Type: Bug >Reporter: Apurva Mehta > Fix For: 1.0.0 > > > Today, when a batch is expired in the accumulator, a {{TimeoutException}} is > raised to the user. > It might be better the producer to retry the expired batch rather up to the > configured number of retries. This is more intuitive from the user's point of > view. > Further the proposed behavior makes it easier for applications like mirror > maker to provide ordering guarantees even when batches expire. Today, they > would resend the expired batch and it would get added to the back of the > queue, causing the output ordering to be different from the input ordering. -- This message was sent by Atlassian JIRA (v6.4.14#64029)