[jira] [Created] (KAFKA-10228) producer: NETWORK_EXCEPTION is thrown instead of a request timeout

2020-07-02 Thread Christian Becker (Jira)
Christian Becker created KAFKA-10228:


 Summary: producer: NETWORK_EXCEPTION is thrown instead of a 
request timeout
 Key: KAFKA-10228
 URL: https://issues.apache.org/jira/browse/KAFKA-10228
 Project: Kafka
  Issue Type: Improvement
  Components: clients
Affects Versions: 2.3.1
Reporter: Christian Becker


We're currently seeing an issue with the java client (producer), when message 
producing runs into a timeout. Namely a NETWORK_EXCEPTION is thrown instead of 
a timeout exception.

*Situation and relevant code:*

Config
{code:java}
request.timeout.ms: 200
retries: 3
acks: all{code}
{code:java}
for (UnpublishedEvent event : unpublishedEvents) {
ListenableFuture> future;
future = kafkaTemplate.send(new ProducerRecord<>(event.getTopic(), 
event.getKafkaKey(), event.getPayload()));
futures.add(future.completable());
}

CompletableFuture.allOf(futures.stream().toArray(CompletableFuture[]::new)).join();{code}
We're using the KafkaTemplate from SpringBoot here, but it shouldn't matter, as 
it's merely a wrapper. There we put in batches of messages to be sent.

200ms later, we can see the following in the logs:
{code:java}
[Producer clientId=producer-1] Received invalid metadata error in produce 
request on partition events-6 due to 
org.apache.kafka.common.errors.NetworkException: The server disconnected before 
a response was received.. Going to request metadata update now
[Producer clientId=producer-1] Got error produce response with correlation id 
3094 on topic-partition events-6, retrying (2 attempts left). Error: 
NETWORK_EXCEPTION {code}
This was somewhat unexpected and sent us for a hunt across the infrastructure 
for possible connection issues, but we've found none.

Side note: In some cases the retries worked and the messages were successfully 
produced.

Only many hours of heavy debugging, we've noticed, that the error might be 
related to the low timeout setting. We've removed that setting now, as it was a 
remnant from the past and no longer valid for our use-case. However in order to 
avoid other people having that issue again and to simplify future debugging, 
some form of timeout exception should be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-8709) hard fail on "Unknown group metadata version"

2019-07-24 Thread Christian Becker (JIRA)
Christian Becker created KAFKA-8709:
---

 Summary: hard fail on "Unknown group metadata version"
 Key: KAFKA-8709
 URL: https://issues.apache.org/jira/browse/KAFKA-8709
 Project: Kafka
  Issue Type: Improvement
Reporter: Christian Becker


We attempted to do an update from 2.2 to 2.3 and then a rollback was done after 
{{inter.broker.protocol}} was changed. (We know this shouldn't be done, but it 
happened).

After downgrading to 2.2 again, some {{__consumer-offsets}} partitions fail to 
load with the message {{Unknown group metadata version 3}}. Subsequently the 
broker continues it's startup and the consumer groups won't exist. So the 
consumers are starting at their configured OLDEST or NEWEST position and start 
committing their offsets.

However on subsequent restarts of the brokers, the {{Unknown group metadata 
version}} exception remains and so the restarts are happening over and over 
again.

 

In order to prevent this, I'd suggest a updated flow when loading the offsets:
- the loading should continue reading the __consumer-offsets partition to see 
if a subsequent offset is commited that is readable
- if no "valid" offset could be found, throw the existing exception to let the 
operator know about the situation
- if a valid offset can be found, continue as normal

 

This would cause the following sequence of events:
1. corrupted offsets are written
2. broker restart
2a. broker loads offset partition
2b. {{KafkaException}} when loading the offset partition
2c. no "valid" offset is found after the "corrupt" record
2d. offsets reset
3. consumergroups are recreated and "valid" offsets are appended
4. broker restart
4a. broker loads offset partition
4b.  {{KafkaException}} when loading the offset partition
4c. "valid" offset is found after the "corrupted" ones
5. consumergroups still have their latest offset

It's a special case now, that this happened after some human error, but this 
also poses a danger for situations where the offsets might be corrupted for 
some unrelated reason. losing the offsets is a very serious situation and there 
should be safeguards against it, especially when there might be offsets 
recoverable. With this improvement, the offsets would be still lost once, but 
the broker is able to recover automatically over time and after compaction the 
corrupted records will be gone. (In our case this caused serious confusion as 
we've lost the offsets multiple times, as the error message {{Error loading 
offsets from}} implies, that the corrupted data is deleted and therefore the 
situation is recovered, whereas in reality this continues to be a issue until 
the corrupt data is gone once and for all which might take a long time.

In our case we seem to have evicted the broken records with temporarily setting 
the segment time to a very low value and deactivation of compaction
{code:java}
/opt/kafka/bin/kafka-topics.sh --alter --config segment.ms=90 --topic 
__consumer_offsets --zookeeper localhost:2181
/opt/kafka/bin/kafka-topics.sh --alter --config cleanup.policy=delete --topic 
__consumer_offsets --zookeeper localhost:2181
< wait for the cleaner to clean up >
/opt/kafka/bin/kafka-topics.sh --alter --config segment.ms=60480 --topic 
__consumer_offsets --zookeeper localhost:2181
/opt/kafka/bin/kafka-topics.sh --alter --config cleanup.policy=compact --topic 
__consumer_offsets --zookeeper localhost:2181{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)