Re: Kafka Streams application Unable to Horizontally scale and the application on other instances refusing to start.

Ted Yu Sat, 16 Sep 2017 13:54:07 -0700

I searched for 'stream-thread failed to' in StreamThread100.log but didn't
find any occurrence:


StreamThread100.log:org.apache.kafka.streams.errors.StreamsException:
stream-thread failed to suspend stream tasks

Can you double check ?

On Fri, Sep 15, 2017 at 8:09 PM, dev loper <spark...@gmail.com> wrote:

> Hi All ,
>
> @Bill,
>
> I will reduce the MAX_POLL_RECORDS to 500/1000 and I will share the
> results shortly.
>
> @Ted,
>
> Yes I reduced MAX_POLL_RECORDS_CONFIG from 50000 to 5000 . It was not a
> typo . Do you think 50000 is way too high for an kafkaStreams Application
> ?  My Spark application which I was trying to replace with kafka streams
> was processing 250000 messages per 5 second batch, That was the reason I
> set 50000 records for MAX_POLL_RECORDS_CONFIG .
>
> I don't think I have hit  KAFKA-5397since I couldn't find any instance "
> org.apache.kafka.streams.errors.LockException"  in my logs. I could see
> below exception , but these exceptions  are triggered long after the
> application stopped consuming any messages .
>
> StreamThread100.log:org.apache.kafka.streams.errors.StreamsException:
> stream-thread failed to suspend stream tasks
> User provided listener 
> org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener
> for group myKafka-kafkareplica101Sept08 failed on partition revocation
>
> @Damian , I figured out the  pattern below, I don't know whether it
> helps.  The streamthread logs which I shared, Did it help?. I
>
>
> I couldn't figure of the reason why the consumers are getting closed while
> its getting allocated and suddenly stopped without any reason. If you look
> at the pattern
>
>  1) Consumer is getting Created with Config Values Supplied by the
> Application.
>  2) Adding Sensors
>  3) Fetching API Version and Initiating Connections to Kafka Brokers
> (NetworkClient.java)
>  4)  Sending metadata request and Recorded API Version From Broker
> (NetworkClient.java)
> 5)  Updated cluster metadata version (Some Incremental verison ) to Cluster
> 6)  Discovered coordinator (One of the kafka Brokers)
> 7) Initiating connection to coordinator  (One of the kafka Brokers)
> 8) fetching committed offsets for partitions (
> ConsumerCoordinator.java:826)
> 9)  Recorded API versions for node
> 10)  Resetting offset for partition( for all the partitions)
> 11) Handling ListOffsetResponse ( for all the partitions)
> 12) Removing Sensors ( I couldn't see any exceptions though)
> 13) consumer.KafkaConsumer (KafkaConsumer.java:1617) - The Kafka consumer
> has closed. (
> I couldn't see any exceptions though, I couldn't figure out the reason why 
> KafkaConsumer
> Closed , no reasons provided in the logs)
> 14) The kafka streams application goes back o step 1 and creates the
> consumer again.( the whole process is repeated but the application 90% of
> the time never recovers)
> 15) Since I have introduced the limit for MAX_POLL_RECORDS_CONFIG =60000
> (previouslyI nteger.MAXVALUE), I could See below exception after 60 seconds
> of consumer retry.
>  org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
> be completed since the group has already rebalancedAfter
>
> I am not sure what could be wrong on my side and how I can resolve this
> issue so that my application start processing messages consistently.
>
> Thanks
> Dev
>
> On Sat, Sep 16, 2017 at 2:02 AM, Bill Bejeck <b...@confluent.io> wrote:
>
>> Hi,
>>
>> Could you set MAX_POLL_RECORDS to something lower like 500 or 1000 and try
>> again?
>>
>> Thanks,
>> Bill
>>
>> On Fri, Sep 15, 2017 at 3:40 PM, dev loper <spark...@gmail.com> wrote:
>>
>> > Hi Damian,
>> >
>> > I have repeated my tests with slight configuration change. The current
>> > logs captured for "StreamThread"  keyword has more relevant logs when
>> > compared to logs which i shared previously. I started the application on
>> > instances 100,101 and 102 simultaneously with below configuration
>> >
>> > 1)  Reduced  MAX_POLL_RECORDS_CONFIG to 5000  (previously 50000)
>> >  2) Reduced MAX_POLL_RECORDS_CONFIG =60000 (Ipreviously nteger.MAXVALUE)
>> >
>> > When the application started all three instances started processing for
>> > first few minutes everything went well. After that I could see that
>> > "StreamThread100" error consumer was going for a toss and it started
>> > closing and creating the consumers for a while exactly with the pattern
>> of
>> > logs I mentioned in my previous email and after some time I could see
>> that
>> > " StreamThread100" stopped processing messages with below exception and
>> the
>> > other two continued processing messages without any issues.
>> >
>> > org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
>> be
>> > completed since the group has already rebalanced and assigned the
>> > partitions to another member. This means that the time between
>> subsequent
>> > calls to poll() was longer than the configured max.poll.interval.ms,
>> > which typically implies that the poll loop is spending too much time
>> > message processing. You can address this either by increasing the
>> session
>> > timeout or by reducing the maximum size of batches returned in poll()
>> with
>> > max.poll.records.
>> >
>> > I think since the consumers were starting and stopping there was no poll
>> > made form the system . Since I reduced Reduced MAX_POLL_RECORDS_CONFIG
>> > =60000  and the processors were getting closed and started which might
>> have
>> > resulted in the   "CommitFailedException due to non avialability of
>> > processing processors.
>> >
>> > After some time the issue got propagated to other servers, I have
>> attached
>> > the relevant logs with this mail Kindly go through this and let me know
>> how
>> > I can  solve this issue ?
>> >
>> >
>> >
>> >
>> > <https://mail.google.com/mail/?ui=2&ik=6aa5d30a60&view=att&t
>> h=15e8701650e040b7&attid=0.3&disp=safe&realattid=f_j7m9sld82&zw>
>> >
>> > On Fri, Sep 15, 2017 at 10:33 PM, dev loper <spark...@gmail.com> wrote:
>> >
>> >> Hi Ted,
>> >>
>> >> What should I be looking in broker logs ? I haven't looked at the
>> broker
>> >> side since my spark application processing from the same topic with a
>> >> different group id is able to process well.
>> >>
>> >> On Fri, Sep 15, 2017 at 3:30 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>> >>
>> >>> Is there some clue in broker logs ?
>> >>>
>> >>> Thanks
>> >>>
>> >>> On Thu, Sep 14, 2017 at 11:19 PM, dev loper <spark...@gmail.com>
>> wrote:
>> >>>
>> >>>> Dear Kafka Users,
>> >>>>
>> >>>> I am fairly new to Kafka Streams . I have deployed two instances of
>> >>>> Kafka 0.11 brokers on AWS M3.Xlarge insatnces. I have created a
>> topic with
>> >>>> 36 partitions .and speperate application writes to this topic and it
>> >>>> produces records at the rate of 10000 messages per second. I have
>> threes
>> >>>> instances of AWS  M4.xlarge instance  where my Kafka streams
>> application is
>> >>>> running which consumes these messages produced by the other
>> application.
>> >>>> The application  starts up fine working fine and its processing
>> messages on
>> >>>> the first instance,  but when I start the same application on other
>> >>>> instances it is not starting even though the process is alive it is
>> not
>> >>>> processing messages.Also I could see the other instances takes a
>> long time
>> >>>> to start .
>> >>>>
>> >>>> Apart from first instance,  other instances I could see the consumer
>> >>>> getting added and removed repeatedly and I couldn't see any message
>> >>>> processing at all . I have attached the detailed logs where this
>> behavior
>> >>>> is observed.
>> >>>>
>> >>>> Consumer is getting started with below log in these instances and
>> >>>> getting stopped with below log (* detailed logs attached *)
>> >>>>
>> >>>> INFO  | 21:59:30 | consumer.ConsumerConfig (AbstractConfig.java:223)
>> -
>> >>>> ConsumerConfig values:
>> >>>>     auto.commit.interval.ms = 5000
>> >>>>     auto.offset.reset = latest
>> >>>>     bootstrap.servers = [l-mykafkainstancekafka5101:9092,
>> >>>> l-mykafkainstancekafka5102:9092]
>> >>>>     check.crcs = true
>> >>>>     client.id =
>> >>>>     connections.max.idle.ms = 540000
>> >>>>     enable.auto.commit = false
>> >>>>     exclude.internal.topics = true
>> >>>>     fetch.max.bytes = 52428800
>> >>>>     fetch.max.wait.ms = 500
>> >>>>     fetch.min.bytes = 1
>> >>>>     group.id = myKafka-kafkareplica101Sept08
>> >>>>     heartbeat.interval.ms = 3000
>> >>>>     interceptor.classes = null
>> >>>>     internal.leave.group.on.close = true
>> >>>>     isolation.level = read_uncommitted
>> >>>>     key.deserializer = class mx.july.jmx.proximity.kafka.Ka
>> fkaKryoCodec
>> >>>>     max.partition.fetch.bytes = 1048576
>> >>>>     max.poll.interval.ms = 300000
>> >>>>     max.poll.records = 500
>> >>>>     metadata.max.age.ms = 300000
>> >>>>     metric.reporters = []
>> >>>>     metrics.num.samples = 2
>> >>>>     metrics.recording.level = INFO
>> >>>>     metrics.sample.window.ms = 30000
>> >>>>     partition.assignment.strategy = [class
>> >>>> org.apache.kafka.clients.consumer.RangeAssignor]
>> >>>>     receive.buffer.bytes = 65536
>> >>>>     reconnect.backoff.max.ms = 1000
>> >>>>     reconnect.backoff.ms = 50
>> >>>>     request.timeout.ms = 305000
>> >>>>     retry.backoff.ms = 100
>> >>>>     sasl.jaas.config = null
>> >>>>     sasl.kerberos.kinit.cmd = /usr/bin/kinit
>> >>>>     sasl.kerberos.min.time.before.relogin = 60000
>> >>>>     sasl.kerberos.service.name = null
>> >>>>     sasl.kerberos.ticket.renew.jitter = 0.05
>> >>>>     sasl.kerberos.ticket.renew.window.factor = 0.8
>> >>>>     sasl.mechanism = GSSAPI
>> >>>>     security.protocol = PLAINTEXT
>> >>>>     send.buffer.bytes = 131072
>> >>>>     session.timeout.ms = 10000
>> >>>>     ssl.cipher.suites = null
>> >>>>     ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
>> >>>>     ssl.endpoint.identification.algorithm = null
>> >>>>     ssl.key.password = null
>> >>>>     ssl.keymanager.algorithm = SunX509
>> >>>>     ssl.keystore.location = null
>> >>>>     ssl.keystore.password = null
>> >>>>     ssl.keystore.type = JKS
>> >>>>     ssl.protocol = TLS
>> >>>>     ssl.provider = null
>> >>>>     ssl.secure.random.implementation = null
>> >>>>     ssl.trustmanager.algorithm = PKIX
>> >>>>     ssl.truststore.location = null
>> >>>>     ssl.truststore.password = null
>> >>>>     ssl.truststore.type = JKS
>> >>>>     value.deserializer = class my.dev.MessageUpdateCodec
>> >>>>
>> >>>>
>> >>>> DEBUG | 21:59:30 | consumer.KafkaConsumer (KafkaConsumer.java:1617) -
>> >>>> The Kafka consumer has closed. and the whole process repeats.
>> >>>>
>> >>>>
>> >>>>
>> >>>> Below you can find my startup code for kafkastreams and the
>> parameters
>> >>>> which I have configured for starting the kafkastreams application .
>> >>>>
>> >>>>         private static Properties settings = new Properties();
>> >>>>         settings.put(StreamsConfig.APPLICATION_ID_CONFIG,
>> >>>> "mykafkastreamsapplication");
>> >>>>         settings.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"lates
>> t");
>> >>>>         settings.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG,"10
>> >>>> 000");
>> >>>>         settings.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG,"3000
>> 0");
>> >>>>         settings.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG,Inte
>> >>>> ger.MAX_VALUE);
>> >>>>         settings.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG,
>> "10000");
>> >>>>         settings.put(ConsumerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG,"
>> >>>> 60000");
>> >>>>
>> >>>>         KStreamBuilder builder = new KStreamBuilder();
>> >>>>         KafkaStreams streams = new KafkaStreams(builder, settings);
>> >>>>         builder.addSource(.....
>> >>>>          .addProcessor  .............
>> >>>>          .addProcessor  ........
>> >>>>
>> >>>>          .addStateStore(...............
>> ....).persistent().build(),"my
>> >>>> processor")
>> >>>>          .addSink ..............
>> >>>>          . addSink ..............
>> >>>>           streams.start();
>> >>>>
>> >>>> and I am using a Simple  processor to process my logic ..
>> >>>>
>> >>>> public class InfoProcessor extends AbstractProcessor<Key, Update> {
>> >>>> private static Logger logger = Logger.getLogger(InfoProcessor
>> .class);
>> >>>> private ProcessorContext context;
>> >>>> private KeyValueStore<Key, Info> infoStore;
>> >>>>
>> >>>> @Override
>> >>>> @SuppressWarnings("unchecked")
>> >>>> public void init(ProcessorContext context) {
>> >>>>     this.context = context;
>> >>>>     this.context.schedule(Constants.BATCH_DURATION_SECONDS * 1000);
>> >>>>     infoStore = (KeyValueStore<Key, Info>)
>> >>>> context.getStateStore("InfoStore");
>> >>>> }
>> >>>>
>> >>>> @Override
>> >>>> public void process(Key key, Update update) {
>> >>>>     try {
>> >>>>         if (key != null && update != null) {
>> >>>>             Info info = infoStore.get(key);
>> >>>>             // merge logic
>> >>>>             infoStore.put(key, info);
>> >>>>         }
>> >>>>
>> >>>>     } catch (Exception e) {
>> >>>>         logger.error(e.getMessage(), e);
>> >>>>     } finally {
>> >>>>     }
>> >>>>     context.commit();
>> >>>> }
>> >>>>
>> >>>> @Override
>> >>>> public void punctuate(long timestamp) {
>> >>>>     try {
>> >>>>         KeyValueIterator<Key, Info> iter = this.infoStore.all();
>> >>>>         while (iter.hasNext()) {
>> >>>>             // processing logic
>> >>>>
>> >>>>         }
>> >>>>         iter.close();
>> >>>>         context.commit();
>> >>>>     } catch (Exception e) {
>> >>>>         logger.error(e.getMessage(), e);
>> >>>>     }
>> >>>> }
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>
>

Re: Kafka Streams application Unable to Horizontally scale and the application on other instances refusing to start.

Reply via email to