[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118925#comment-14118925 ] Guozhang Wang commented on KAFKA-998: - This ticket can be closed as won't fix since we are moving to new producer now. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8.0, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779636#comment-13779636 ] Jun Rao commented on KAFKA-998: --- My feeling is that it may not be very easy to do a quick fix. Currently, the cause exceptions are eaten at several places just so that we can pass back unsuccessfully sent messages. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778974#comment-13778974 ] Jason Rosenberg commented on KAFKA-998: --- [~junrao] Would it be an easier short term fix, to at least include the root cause set on the FailedToSendMessageException. So, we could see the MessageSizeTooLargeException as the cause of the FTSME? Or is that not easy to do? > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778894#comment-13778894 ] Jason Rosenberg commented on KAFKA-998: --- Jun, Yeah, sorry, I forgot I had filed KAFKA-1025 to address my concerns about exposing recoverability. Thanks, Jason > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778883#comment-13778883 ] Jun Rao commented on KAFKA-998: --- Jason, This patch just won't retry sending the data when hitting a MessageTooLargeException. It doesn't really address you main concern, which is the caller doesn't know the real cause of the failure. Addressing this issue completely will need some more thoughts in the producer logic and the changes required may be non-trivial. So, I am not sure if we should do this in 0.8. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778822#comment-13778822 ] Jason Rosenberg commented on KAFKA-998: --- [~fancyrao] I'd love to see this in 0.8. For a message too large exception, which gets returned to the producer client currently as a FailedToSendMessageException, it's indistinguishable from any other kind of exception, for which sub-dividing the batch and retrying are not viable options. The workaround described is not workable, in practice since the FTSME does not include any root cause information (a simple causedBy() method might help in that regard)! > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765633#comment-13765633 ] Jun Rao commented on KAFKA-998: --- Thanks for the patch. Looks good overall. Just one comment: 10. ErrorMapping.fatalException(): Should we rename it to unrecoverableException? MessageSizeTooLarge doesn't seems like a fatal exception. I am not sure if it's worth patching this in 0.8. The workaround is to reduce the batch size, as well reducing retry times and retry intervals. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13761100#comment-13761100 ] Neha Narkhede commented on KAFKA-998: - >> Do people think this is small and important enough to apply to 0.8? +1. Guozhang, do you mind submitting a reviewboard ? > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759257#comment-13759257 ] Joel Koshy commented on KAFKA-998: -- Oh I thought this was for 0.8 - it does apply on trunk. Do people think this is small and important enough to apply to 0.8? Another comment after thinking about the patch: in dispatchSerializedData - would it be better to just drop data that have hit the message size limit? That way, there is no need to return the needRetry, so the dispatchSerializedData signature remains the same. The disadvantage is that we won't propagage a failedtosendmessage exception for such messages to the caller - for the producer in async mode that is probably fine (since right now the caller can't really do much with that exception) - in sync mode the caller could perhaps decide to send fewer messages at once. Even in that case we don't really say which topics/messages hit the message size limit so I think it is fine in that case as well. Furthermore, this would be covered by KAFKA-1026 to a large degree. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758659#comment-13758659 ] Guozhang Wang commented on KAFKA-998: - Thanks for the patch Joel. Do you mean rebase on 0.8 (it was originally on trunk)? > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758450#comment-13758450 ] Joel Koshy commented on KAFKA-998: -- Apologies for the late review. Couple of comments: * I think this could reset needRetry back to false if subsequent partitions in the iteration do need a retry: needRetry = needRetry && !fatalException(topicPartitionAndError._2). The logic is actually a bit confusing. Instead, it might be clearer to just do: failedTopicPartitions.exists() * Can you enhance the logging a bit to indicate that there were fatal sends that will not be retried? e.g., "Dropping messages to topic x due to message size limit.." or something like that. * Can you rebase? > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750847#comment-13750847 ] Jason Rosenberg commented on KAFKA-998: --- Ok, I filed KAFKA-1025 to track the issue for reasoning about whether a failed send should be recoverable. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750789#comment-13750789 ] Guozhang Wang commented on KAFKA-998: - Hello Jason, I think what you need is to return more information to the caller of Producer.send(), since currently it only returns a FailedToSendMessageException: "Failed to send messages after #, tries." For this case I think it is better for you to create a separate JIRA. As for dynamically adjust the batch size upon receiving MessageSizeTooLargeException, I will file a separate JIRA for this. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13750676#comment-13750676 ] Jason Rosenberg commented on KAFKA-998: --- Also, a producer can throw QueueFullException. From the client's point of view, it would make sense that this should also be a retryable situation (depending on load). Thus, QueueFullException might make sense to be a sub-class of FailedToSendException (and more likely a sub-class of RetriesExhaustedFailedToSendException (or whatever name that might better be renamed to). > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13749876#comment-13749876 ] Jason Rosenberg commented on KAFKA-998: --- The caller of Producer.send() should also have the ability to know whether a send failure is recoverable (that might succeed with more retries). It may be hard for the client developer to guess the right number of message.send.max.retries, otherwise (since a transient error, like a restarting broker, could take an unknown amount of time). If I want to implement guaranteed semantics, then the client needs to be able to have information on whether to continue retrying a message, or else give up. This could be done by having Producer.send() throw different exception types (e.g. different versions of FailedToSendMessageException), e.g. UnrecoverableFailedToSendException or RetriesExhaustedFailedToSendException (perhaps shorter names for these exceptions). These could both be sub-classes of FailedToSendException. Another approach might be to have the FailedToSendException return information, such as how many retries were attempted, whether or not the message might be recoverable with more retries, and it should wrap the root cause, so debugging is possible. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > Attachments: KAFKA-998.v1.patch > > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (KAFKA-998) Producer should not retry on non-recoverable error codes
[ https://issues.apache.org/jira/browse/KAFKA-998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729681#comment-13729681 ] Guozhang Wang commented on KAFKA-998: - Approach proposal: 1. Passing the errorCode from send to dispatchSerializedData to handle along with the outstandingProduceRequests. 2. When outstandingProduceRequests.size > 0, check the corresponding error code, if the error code indicates a non-avoidable error such as MessageSizeTooLarge, then break the while loop intermediately. By doing so it would also be easier in the future if we want to make a smarted producer client by dynamically shrinking the batch size. > Producer should not retry on non-recoverable error codes > > > Key: KAFKA-998 > URL: https://issues.apache.org/jira/browse/KAFKA-998 > Project: Kafka > Issue Type: Bug >Affects Versions: 0.8, 0.8.1 >Reporter: Joel Koshy >Assignee: Guozhang Wang > > Based on a discussion with Guozhang. The producer currently retries on all > error codes (including messagesizetoolarge which is pointless to retry on). > This can slow down the producer unnecessarily. > If at all we want to retry on that error code we would need to retry with a > smaller batch size, but that's a separate discussion. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira