Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

Alexey Romanenko Fri, 16 Sep 2022 09:17:46 -0700

Thanks Evan for getting back and great that it was resolved by configuration 
tweaks!


—
Alexey

> On 16 Sep 2022, at 16:46, Evan Galpin <egal...@apache.org> wrote:
> 
> Following up to close the loop.  I believe the Kafka errors I was seeing were 
> a red herring.  The actual root cause of the issues was worker nodes running 
> out of memory, and as a result kafka producers would have difficulty 
> competing for resources over GC thrashing.  Increasing the worker node size 
> to where there are no longer OOMKills has removed any kafka "issue".
> 
> Thanks all for your time and willingness to help.
> 
> Evan
> 
> On Tue, Sep 13, 2022 at 12:33 PM Evan Galpin <egal...@apache.org 
> <mailto:egal...@apache.org>> wrote:
> There's a few related log lines, but there isn't a full stacktrace as the 
> info originates from a logger statement[1] as opposed to thrown exception. 
> The related log lines are like so:
> 
> org.apache.kafka.clients.NetworkClient [Producer clientId=producer-109] 
> Disconnecting from node 10 due to socket connection setup timeout. The 
> timeout value is 11436 ms.    [2]
> 
> and
> 
> org.apache.kafka.clients.NetworkClient [Producer clientId=producer-109] Node 
> 10 disconnected.    [3]
> 
> [1] 
> https://github.com/apache/kafka/blob/f653cb7b5889fd619ab0e6a25216bd981a9d82bf/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L402
>  
> <https://github.com/apache/kafka/blob/f653cb7b5889fd619ab0e6a25216bd981a9d82bf/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L402>
>  
> [2] 
> https://github.com/apache/kafka/blob/1135f22eaf404fdf76489302648199578876c4ac/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L820
>  
> <https://github.com/apache/kafka/blob/1135f22eaf404fdf76489302648199578876c4ac/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L820>
> [3] 
> https://github.com/apache/kafka/blob/1135f22eaf404fdf76489302648199578876c4ac/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L937
>  
> <https://github.com/apache/kafka/blob/1135f22eaf404fdf76489302648199578876c4ac/clients/src/main/java/org/apache/kafka/clients/NetworkClient.java#L937>
> On Tue, Sep 13, 2022 at 12:17 PM Alexey Romanenko <aromanenko....@gmail.com 
> <mailto:aromanenko....@gmail.com>> wrote:
> Do you have by any chance the full stacktrace of this error?
> 
> —
> Alexey
> 
>> On 13 Sep 2022, at 18:05, Evan Galpin <egal...@apache.org 
>> <mailto:egal...@apache.org>> wrote:
>> 
>> Ya likewise, I'd expect this to be handled in the Kafka code without the 
>> need for special handling by Beam.  I'll reach out to Kafka mailing list as 
>> well and try to get a better understanding of the root issue.  Thanks for 
>> your time so far John! I'll ping this thread with any interesting findings 
>> or insights.
>> 
>> Thanks,
>> Evan
>> 
>> On Tue, Sep 13, 2022 at 11:38 AM John Casey via user <user@beam.apache.org 
>> <mailto:user@beam.apache.org>> wrote:
>> In principle yes, but I don't see any Beam level code to handle that. I'm a 
>> bit surprised it isn't handled in the Kafka producer layer itself.
>> 
>> On Tue, Sep 13, 2022 at 11:15 AM Evan Galpin <egal...@apache.org 
>> <mailto:egal...@apache.org>> wrote:
>> I'm not certain based on the logs where the disconnect is starting.  I have 
>> seen TimeoutExceptions like that mentioned in the SO issue you linked, so if 
>> we assume it's starting from the kafka cluster side, my concern is that the 
>> producers don't seem to be able to gracefully recover.  Given that 
>> restarting the pipeline (in this case, in Dataflow) makes the issue go away, 
>> I'm under the impression that producer clients in KafkaIO#write can get into 
>> a state that they're not able to recover from after experiencing a 
>> disconnect.  Is graceful recovery after cluster unavailability something 
>> that would be expected to be supported by KafkaIO today?
>> 
>> Thanks,
>> Evan
>> 
>> On Tue, Sep 13, 2022 at 11:07 AM John Casey via user <user@beam.apache.org 
>> <mailto:user@beam.apache.org>> wrote:
>> Googling that error message returned 
>> https://stackoverflow.com/questions/71077394/kafka-producer-resetting-the-last-seen-epoch-of-partition-resulting-in-timeout
>>  
>> <https://stackoverflow.com/questions/71077394/kafka-producer-resetting-the-last-seen-epoch-of-partition-resulting-in-timeout>
>> and 
>> https://github.com/a0x8o/kafka/blob/master/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L402
>>  
>> <https://github.com/a0x8o/kafka/blob/master/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L402>
>> 
>> Which suggests that there is some sort of disconnect happening between your 
>> pipeline and your kafka instance.
>> 
>> Do you see any logs when this disconnect starts, on the Beam or Kafka side 
>> of things?
>> 
>> On Tue, Sep 13, 2022 at 10:38 AM Evan Galpin <egal...@apache.org 
>> <mailto:egal...@apache.org>> wrote:
>> Thanks for the quick reply John!  I should also add that the root issue is 
>> not so much the logging, rather that these log messages seem to be 
>> correlated with periods where producers are not able to publish data to 
>> kafka.  The issue of not being able to publish data does not seem to resolve 
>> until restarting or updating the pipeline.
>> 
>> Here's my publisher config map:
>> 
>>                 .withProducerConfigUpdates(
>>                     Map.ofEntries(
>>                         Map.entry(
>>                             ProducerConfig.PARTITIONER_CLASS_CONFIG, 
>> DefaultPartitioner.class),
>>                         Map.entry(
>>                             ProducerConfig.COMPRESSION_TYPE_CONFIG, 
>> CompressionType.GZIP.name <http://compressiontype.gzip.name/>),
>>                         Map.entry(
>>                             CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, 
>> SecurityProtocol.SASL_SSL.name <http://securityprotocol.sasl_ssl.name/>),
>>                         Map.entry(
>>                             SaslConfigs.SASL_MECHANISM, 
>> PlainSaslServer.PLAIN_MECHANISM),
>>                         Map.entry(
>>                             SaslConfigs.SASL_JAAS_CONFIG, 
>> "org.apache.kafka.common.security.plain.PlainLoginModule required 
>> username=\"<api_key>\" password=\"<api_secret>\";")))
>> 
>> Thanks,
>> Evan
>> 
>> On Tue, Sep 13, 2022 at 10:30 AM John Casey <johnjca...@google.com 
>> <mailto:johnjca...@google.com>> wrote:
>> Hi Evan,
>> 
>> I haven't seen this before. Can you share your Kafka write configuration, 
>> and any other stack traces that could be relevant? 
>> 
>> John
>> 
>> On Tue, Sep 13, 2022 at 10:23 AM Evan Galpin <egal...@apache.org 
>> <mailto:egal...@apache.org>> wrote:
>> Hi all,
>> 
>> I've recently started using the KafkaIO connector as a sink, and am new to 
>> Kafka in general.  My kafka clusters are hosted by Confluent Cloud.  I'm 
>> using Beam SDK 2.41.0.  At least daily, the producers in my Beam pipeline 
>> are getting stuck in a loop frantically logging this message:
>> 
>> Node n disconnected.
>> 
>> Resetting the last seen epoch of partition <topic-partition> to x since the 
>> associated topicId changed from null to <topic_id>
>> 
>> Updating the running pipeline "resolves" the issue I believe as a result of 
>> recreating the Kafka producer clients, but it seems that as-is the KafkaIO 
>> producer clients are not resilient to node disconnects.  Might I be missing 
>> a configuration option, or are there any known issues like this?
>> 
>> Thanks,
>> Evan
>

Re: [troubleshooting] KafkaIO#write gets stuck "since the associated topicId changed from null to "

Reply via email to