[ 
https://issues.apache.org/jira/browse/KAFKA-6260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266883#comment-16266883
 ] 

Seweryn Habdank-Wojewodzki commented on KAFKA-6260:
---------------------------------------------------

[~ijuma] Include some reproducible example is pretty hard for me, as we have 
some servers and at two of them this happend (continuously). 
The other are working well, but the definitive difference is load. At those two 
we have pretty high load in comparison to others.
Also the problem is that our Kafka uses SSL and ACLs, which are changing timing 
and this part I cannot attach.
Kafka and Log4j settings I had already included.

Kafka code it self is pretty easy. It looks like (a bit compacted, as we have 
it spread in some methods):

{code}
StreamsBuilder kBuilder = new StreamsBuilder();
this.kafkaAndStreamsProperties = kafkaAndStreamsProperties;

this.kafkaAndStreamsProperties.put( 
StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass() );
this.kafkaAndStreamsProperties.put( 
StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass() );

StreamsConfig streamsConfig = new StreamsConfig( kafkaAndStreamsProperties );
KafkaInputConfiguration kic = new KafkaInputConfiguration();

final String inTopicName = kic.getInputTopicParameters().getTopicName();

KStream<String, String> stringInput = kBuilder.stream( inTopicName );

stringInput.foreach( ( k, v ) -> {
        try {
                MyRecordRto lrr = objectMapper.readValue( v, MyRecordRto.class 
);
                
                // it could take much time as sometimes we are wating here 
until the processors are finishing own work
                // at high load here we could wait up to minute
                myStreamProcessor.process( lrr );
                
        } catch ( IOException pe ) {
                LOG.warn( "Parsing JSON encountered error: {}", pe.getMessage() 
);
                LOG.trace( "Error: {}", pe );
        } catch ( Exception e ) {
                LOG.warn( "Processing message encountered error: {}", 
e.getMessage() );
                LOG.trace( "Error: {}", e );
        }
} );

KafkaStreams streams = new KafkaStreams( kBuilder.build(), streamsConfig );
streams.setUncaughtExceptionHandler( ( Thread t, Throwable e ) -> {
                        synchronized ( this ) {
                                LOG.fatal( "Caught unhandled exception: {}; {} 
in thread {}",
                                                e.getMessage(), 
e.getStackTrace(), t.getName() );
                                this.stop( 5L );
                                // seems to be asymmetric shutdown hook shall 
not contains System.exit()
                                // for v. 0.11.0.0 see official report 
KAFKA-4366
                                // but setUncaughtExceptionHandler without 
System.exit() hangs.
                                System.exit( 1 );
                        }
                }
);
streams.cleanUp();
streams.start();
{code}

I have new trace level logs. I will rework them and attach.


> AbstractCoordinator not clearly handles NULL Exception
> ------------------------------------------------------
>
>                 Key: KAFKA-6260
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6260
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 1.0.0
>         Environment: RedHat Linux
>            Reporter: Seweryn Habdank-Wojewodzki
>
> The error reporting is not clear. But it seems that Kafka Heartbeat shuts 
> down application due to NULL exception caused by "fake" disconnections.
> One more comment. We are processing messages in the stream, but sometimes we 
> have to block processing for minutes, as consumers are not handling too much 
> load. Is it possibble that when stream is waiting, then heartbeat is as well 
> blocked?
> Can you check that?
> {code}
> 2017-11-23 23:54:47 DEBUG AbstractCoordinator:177 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Received successful Heartbeat response
> 2017-11-23 23:54:50 DEBUG AbstractCoordinator:183 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Sending Heartbeat request to coordinator 
> cljp01.eb.lan.at:9093 (id: 2147483646 rack: null)
> 2017-11-23 23:54:50 TRACE NetworkClient:135 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Sending HEARTBEAT 
> {group_id=kafka-endpoint,generation_id=3834,member_id=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer-94f18be5-e49a-4817-9e5a-fe82a64e0b08}
>  with correlation id 24 to node 2147483646
> 2017-11-23 23:54:50 TRACE NetworkClient:135 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Completed receive from node 2147483646 for HEARTBEAT 
> with correlation id 24, received {throttle_time_ms=0,error_code=0}
> 2017-11-23 23:54:50 DEBUG AbstractCoordinator:177 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Received successful Heartbeat response
> 2017-11-23 23:54:52 DEBUG NetworkClient:183 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Disconnecting from node 1 due to request timeout.
> 2017-11-23 23:54:52 TRACE NetworkClient:135 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Cancelled request 
> {replica_id=-1,max_wait_time=6000,min_bytes=1,max_bytes=52428800,isolation_level=0,topics=[{topic=clj_internal_topic,partitions=[{partition=6,fetch_offset=211558395,log_start_offset=-1,max_bytes=1048576},{partition=8,fetch_offset=210178209,log_start_offset=-1,max_bytes=1048576},{partition=0,fetch_offset=209353523,log_start_offset=-1,max_bytes=1048576},{partition=2,fetch_offset=209291462,log_start_offset=-1,max_bytes=1048576},{partition=4,fetch_offset=210728595,log_start_offset=-1,max_bytes=1048576}]}]}
>  with correlation id 21 due to node 1 being disconnected
> 2017-11-23 23:54:52 DEBUG ConsumerNetworkClient:195 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Cancelled FETCH request RequestHeader(apiKey=FETCH, 
> apiVersion=6, 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  correlationId=21) with correlation id 21 due to node 1 being disconnected
> 2017-11-23 23:54:52 DEBUG Fetcher:195 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Fetch request 
> {clj_internal_topic-6=(offset=211558395, logStartOffset=-1, 
> maxBytes=1048576), clj_internal_topic-8=(offset=210178209, logStartOffset=-1, 
> maxBytes=1048576), clj_internal_topic-0=(offset=209353523, logStartOffset=-1, 
> maxBytes=1048576), clj_internal_topic-2=(offset=209291462, logStartOffset=-1, 
> maxBytes=1048576), clj_internal_topic-4=(offset=210728595, logStartOffset=-1, 
> maxBytes=1048576)} to cljp01.eb.lan.at:9093 (id: 1 rack: DC-1) failed 
> org.apache.kafka.common.errors.DisconnectException: null
> 2017-11-23 23:54:52 TRACE NetworkClient:123 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Found least loaded node cljp01.eb.lan.at:9093 (id: 1 
> rack: DC-1)
> 2017-11-23 23:54:52 DEBUG NetworkClient:183 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Initialize connection to node cljp01.eb.lan.at:9093 
> (id: 1 rack: DC-1) for sending metadata request
> 2017-11-23 23:54:52 DEBUG NetworkClient:183 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Initiating connection to node cljp01.eb.lan.at:9093 
> (id: 1 rack: DC-1)
> 2017-11-23 23:54:52 ERROR AbstractCoordinator:296 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Heartbeat thread for group kafka-endpoint failed due 
> to unexpected error
> java.lang.NullPointerException: null
>         at 
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:436) 
> ~[my-kafka-endpoint.jar:?]
>         at org.apache.kafka.common.network.Selector.poll(Selector.java:395) 
> ~[my-kafka-endpoint.jar:?]
>         at 
> org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:460) 
> ~[my-kafka-endpoint.jar:?]
>         at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:238)
>  ~[my-kafka-endpoint.jar:?]
>         at 
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:275)
>  ~[my-kafka-endpoint.jar:?]
>         at 
> org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:934)
>  [my-kafka-endpoint.jar:?]
> 2017-11-23 23:54:52 DEBUG AbstractCoordinator:177 - [Consumer 
> clientId=kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer,
>  groupId=kafka-endpoint] Heartbeat thread has closed
> 2017-11-23 23:55:16 INFO  KafkaElasticsearchEndpoint:61 - Got shutdown 
> requests ...
> 2017-11-23 23:55:16 INFO  StreamProcessor:106 - Streams closing ...
> 2017-11-23 23:55:16 INFO  StreamProcessor:111 - with 5 [s] timeout
> 2017-11-23 23:55:16 DEBUG KafkaStreams:183 - stream-client 
> [kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31]Stopping Streams client 
> with timeoutMillis = 5000 ms.
> 2017-11-23 23:55:16 INFO  KafkaStreams:346 - stream-client 
> [kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31]State transition from 
> RUNNING to PENDING_SHUTDOWN
> 2017-11-23 23:55:16 INFO  StreamThread:336 - stream-thread 
> [kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1] Informed 
> to shut down
> 2017-11-23 23:55:16 INFO  StreamThread:346 - stream-thread 
> [kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1] State 
> transition from RUNNING to PENDING_SHUTDOWN
> {code}
> Kafka settings:
> {code}
>         auto.commit.interval.ms = 5000
>         auto.offset.reset = earliest
>         bootstrap.servers = [xxx:9093, yyy:9093]
>         check.crcs = true
>         client.id = 
> kafka-endpoint-be51569b-8795-4709-8ec8-28c9cd099a31-StreamThread-1-consumer
>         connections.max.idle.ms = 540000
>         enable.auto.commit = false
>         exclude.internal.topics = true
>         fetch.max.bytes = 52428800
>         fetch.max.wait.ms = 6000
>         fetch.min.bytes = 1
>         group.id = kafka-endpoint
>         heartbeat.interval.ms = 3000
>         interceptor.classes = null
>         internal.leave.group.on.close = false
>         isolation.level = read_uncommitted
>         key.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
>         max.partition.fetch.bytes = 1048576
>         max.poll.interval.ms = 10000000
>         max.poll.records = 2000
>         metadata.max.age.ms = 300000
>         metric.reporters = []
>         metrics.num.samples = 2
>         metrics.recording.level = INFO
>         metrics.sample.window.ms = 30000
>         partition.assignment.strategy = 
> [org.apache.kafka.streams.processor.internals.StreamPartitionAssignor]
>         receive.buffer.bytes = 65536
>         reconnect.backoff.max.ms = 1000
>         reconnect.backoff.ms = 50
>         request.timeout.ms = 7000
>         retry.backoff.ms = 100
>         sasl.jaas.config = null
>         sasl.kerberos.kinit.cmd = /usr/bin/kinit
>         sasl.kerberos.min.time.before.relogin = 60000
>         sasl.kerberos.service.name = null
>         sasl.kerberos.ticket.renew.jitter = 0.05
>         sasl.kerberos.ticket.renew.window.factor = 0.8
>         sasl.mechanism = GSSAPI
>         security.protocol = SSL
>         send.buffer.bytes = 131072
>         session.timeout.ms = 6000
>         ssl.cipher.suites = null
>         ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
>         ssl.endpoint.identification.algorithm = null
>         ssl.key.password = [hidden]
>         ssl.keymanager.algorithm = SunX509
>         ssl.keystore.location = /data/my/etc/kafka/client-keystore
>         ssl.keystore.password = [hidden]
>         ssl.keystore.type = JKS
>         ssl.protocol = TLS
>         ssl.provider = null
>         ssl.secure.random.implementation = SHA1PRNG
>         ssl.trustmanager.algorithm = PKIX
>         ssl.truststore.location = /data/my/etc/kafka/truststore
>         ssl.truststore.password = [hidden]
>         ssl.truststore.type = JKS
>         value.deserializer = class 
> org.apache.kafka.common.serialization.ByteArrayDeserializer
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to