That'd be great as I'm not familiar with the protocol there On Wed, May 3, 2017 at 8:41 AM Eno Thereska <eno.there...@gmail.com> wrote:
> Cool, thanks, shall we open a JIRA? > > Eno > > On 3 May 2017, at 16:16, João Peixoto <joao.harti...@gmail.com> wrote: > > > > Actually I need to apologize, I pasted the wrong issue, I meant to paste > > https://github.com/facebook/rocksdb/issues/261. > > > > RocksDB did not produce a crash report since it didn't actually crash. I > > performed thread dumps on stale and not-stale instances which revealed > the > > common behavior and I collect and plot several Kafka metrics, including > > "punctuate" durations, therefore I know it took a long time and > eventually > > finished. > > > > Joao > > > > On Wed, May 3, 2017 at 6:22 AM Eno Thereska <eno.there...@gmail.com> > wrote: > > > >> Hi there, > >> > >> Thanks for double checking. Does RocksDB actually crash or produce a > crash > >> dump? I’m curious how you know that the issue is > >> https://github.com/facebook/rocksdb/issues/1121 < > >> https://github.com/facebook/rocksdb/issues/1121>, so just double > checking > >> with you. > >> > >> If that’s indeed the case, do you mind opening a JIRA (a copy-paste of > the > >> below should suffice)? Alternatively let us know and we’ll open it. > Sounds > >> like we should handle this better. > >> > >> Thanks, > >> Eno > >> > >> > >>> On May 3, 2017, at 5:49 AM, João Peixoto <joao.harti...@gmail.com> > >> wrote: > >>> > >>> I believe I found the root cause of my problem. I seem to have hit this > >>> RocksDB bug https://github.com/facebook/rocksdb/issues/1121 > >>> > >>> On my stream configuration I have a custom transformer used for > >>> deduplicating records, highly inspired in the > >>> EventDeduplicationLambdaIntegrationTest > >>> < > >> > https://github.com/confluentinc/examples/blob/3.2.x/kafka-streams/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java#L161 > >>> > >>> but > >>> adjusted to my use case, special emphasis on the "punctuate" method. > >>> > >>> All the stale instances had the main stream thread "RUNNING" the > >>> "punctuate" method of this transformer, which in term was running > RocksDB > >>> "seekToFirst". > >>> > >>> Also during my debugging one such instance finished the "punctuate" > >> method, > >>> which took ~11h, exactly the time the instance was stuck for. > >>> Changing the backing state store from "persistent" to "inMemory" solved > >> my > >>> issue, at least after several days running, no stuck instances. > >>> > >>> This leads me to ask, shouldn't Kafka detect such a situation fairly > >>> quickly? Instead of just stopping polling? My guess is that the > heartbeat > >>> thread which now is separate continues working fine, since by > definition > >>> the stream runs a message through the whole pipeline this step probably > >>> just looked like it was VERY slow. Not sure what the best approach here > >>> would be. > >>> > >>> PS The linked code clearly states "This code is for demonstration > >> purposes > >>> and was not tested for production usage" so that's on me > >>> > >>> On Tue, May 2, 2017 at 11:20 AM Matthias J. Sax <matth...@confluent.io > > > >>> wrote: > >>> > >>>> Did you check the logs? Maybe you need to increase log level to DEBUG > to > >>>> get some more information. > >>>> > >>>> Did you double check committed offsets via > bin/kafka-consumer-groups.sh? > >>>> > >>>> -Matthias > >>>> > >>>> On 4/28/17 9:22 AM, João Peixoto wrote: > >>>>> My stream gets stale after a while and it simply does not receive any > >> new > >>>>> messages, aka does not poll. > >>>>> > >>>>> I'm using Kafka Streams 0.10.2.1 (same happens with 0.10.2.0) and the > >>>>> brokers are running 0.10.1.1. > >>>>> > >>>>> The stream state is RUNNING and there are no exceptions in the logs. > >>>>> > >>>>> Looking at the JMX metrics, the threads are there and running, just > not > >>>>> doing anything. > >>>>> The metric "consumer-coordinator-metrics > > heartbeat-response-time-max" > >>>>> (The max time taken to receive a response to a heartbeat request) > reads > >>>>> 43,361 seconds (almost 12 hours) which is consistent with the time of > >> the > >>>>> hang. Shouldn't this trigger a failure somehow? > >>>>> > >>>>> The stream configuration looks something like this: > >>>>> > >>>>> Properties props = new Properties(); > >>>>> props.put(StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG, > >>>>> CustomTimestampExtractor.class.getName()); > >>>>> props.put(StreamsConfig.APPLICATION_ID_CONFIG, streamName); > >>>>> props.put(StreamsConfig.CLIENT_ID_CONFIG, streamName); > >>>>> props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, > >>>>> myConfig.getBrokerList()); > >>>>> props.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, > >>>>> Serdes.String().getClass().getName()); > >>>>> props.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, > >>>>> Serdes.ByteArray().getClass().getName()); > >>>>> props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, > >>>>> myConfig.getCommitIntervalMs()); // 5000 > >>>>> props.put(StreamsConfig.METRICS_RECORDING_LEVEL_CONFIG, "DEBUG"); > >>>>> props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, > >>>>> myConfig.getStreamThreadsCount()); // 1 > >>>>> props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, > >>>>> myConfig.getMaxCacheBytes()); // 524_288_000L > >>>>> props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"); > >>>>> props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 50); > >>>>> > >>>>> The stream LEFT JOINs 2 topics, one of them being a KTable, and > outputs > >>>> to > >>>>> another topic. > >>>>> > >>>>> Thanks in advance for the help! > >>>>> > >>>> > >>>> > >> > >> > >