Hi All, We are using Samza (0.10.0) in our system and recently ran into a problem where due to Kafka broker being unstable for few moments, our samza tasks while trying to write message to kafka got exceptions. After that moment, they went into a very long retry loop (Integer.MAX times).
The repeated warning lines we are getting in container logs are: *.* *.* *WARN [2016-05-23 06:41:36,645] [U:260,F:293,T:552,M:2,267] producer.internals.Sender:[Sender:completeBatch:257] - [kafka-producer-network-thread | samza_producer-job4-1-1463686278936-2] - Got error produce response with correlation id 5888322 on topic-partition Topic3-0, retrying (2144537752 attempts left). Error: CORRUPT_MESSAGE* *.* *.* We experimented with setting the kafka producer 'retries' configuration to a smaller number but it appears that samza does not permit overriding this parameter. On top of it there is some additional Samza level retry logic to re-send the message if kafka errored with a 'RetriableException' May I know what is the reason for disallowing this override? Additionally, what is the recommended way to handle such situations? I would have thought that a possible policy would be that if after K (configured by user) kafka retries, samza-kafka was still unable to send the message, it could have thrown an exception out to the user land and let the user determine what is to be done - in our case we would have chosen to kill the container and have yarn samza app master request for a new one from Yarn. There seem to be at-least a couple of bugs related to this already open 1. https://issues.apache.org/jira/browse/SAMZA-610 2. https://issues.apache.org/jira/browse/SAMZA-911 cheers, gaurav