Thanks for confirming the fix. On Mon, Oct 22, 2018 at 7:48 AM Eduardo Soldera < [email protected]> wrote:
> Hi Raghu, Dataflow throws an exception if Kafka fails now and it recovers > after Kafka is available. > > Regards > > Em sex, 19 de out de 2018 às 14:01, Raghu Angadi <[email protected]> > escreveu: > >> On Fri, Oct 19, 2018 at 6:54 AM Eduardo Soldera < >> [email protected]> wrote: >> >>> Hi Raghu, just a quick update. We were waiting for Spotify's Scio to >>> update to Beam 2.7. We've just deployed the pipeling sucessfully. Just for >>> letting you know, I tried to use the workaround code snipped, but Dataflow >>> wouldn't recover after a Kafka unavailability. >>> >> >> Thanks for the update. The workaround helps only if KafkaClient itself >> can recover when try to read again. I guess some of those exceptions are >> are not recoverable. >> >> Please let us know how the actual fix works. >> >> Thanks. >> Raghu. >> >> >>> >>> Thanks for your help. >>> >>> Regards >>> >>> Em qua, 19 de set de 2018 às 15:37, Raghu Angadi <[email protected]> >>> escreveu: >>> >>>> >>>> >>>> On Wed, Sep 19, 2018 at 11:24 AM Juan Carlos Garcia < >>>> [email protected]> wrote: >>>> >>>>> Sorry I hit the send button to fast... The error occurs in the worker. >>>>> >>>> >>>> Np. Just one more comment on it: it is a very important >>>> design/correctness decision to for runner to decide how to handle >>>> persistent errors in a streaming pipeline. Dataflow keeps failing since >>>> there is no solution to restart a pipeline from scratch without losing >>>> exactly-once guarantees. It lets user decide if the pipeline needs to be >>>> 'upgraded'. >>>> >>>> Raghu. >>>> >>>>> >>>>> Juan Carlos Garcia <[email protected]> schrieb am Mi., 19. Sep. >>>>> 2018, 20:22: >>>>> >>>>>> Sorry for hijacking the thread, we are running Spark on top of Yarn, >>>>>> yarn retries multiple times until it reachs it max attempt and then gives >>>>>> up. >>>>>> >>>>>> Raghu Angadi <[email protected]> schrieb am Mi., 19. Sep. 2018, >>>>>> 18:58: >>>>>> >>>>>>> On Wed, Sep 19, 2018 at 7:26 AM Juan Carlos Garcia < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Don't know if its related, but we have seen our pipeline dying >>>>>>>> (using SparkRunner) when there is problem with Kafka (network >>>>>>>> interruptions), errors like: >>>>>>>> org.apache.kafka.common.errors.TimeoutException: Timeout expired while >>>>>>>> fetching topic metadata >>>>>>>> >>>>>>>> Maybe this will fix it as well, thanks Raghu for the hint about >>>>>>>> *withConsumerFactoryFn.* >>>>>>>> >>>>>>> >>>>>>> Wouldn't that be retried by the SparkRunner if it happens on the >>>>>>> worker? or does it happen while launching the pipeline on the client? >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Sep 19, 2018 at 3:29 PM Eduardo Soldera < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Raghu, thank you. >>>>>>>>> >>>>>>>>> I'm not sure though what to pass as an argument: >>>>>>>>> >>>>>>>>> KafkaIO.read[String,String]() >>>>>>>>> .withBootstrapServers(server) >>>>>>>>> .withTopic(topic) >>>>>>>>> .withKeyDeserializer(classOf[StringDeserializer]) >>>>>>>>> .withValueDeserializer(classOf[StringDeserializer]) >>>>>>>>> .withConsumerFactoryFn(new >>>>>>>>> KafkaExecutor.ConsumerFactoryFn(????????????????)) >>>>>>>>> .updateConsumerProperties(properties) >>>>>>>>> .commitOffsetsInFinalize() >>>>>>>>> .withoutMetadata() >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> >>>>>>>>> Em ter, 18 de set de 2018 às 21:15, Raghu Angadi < >>>>>>>>> [email protected]> escreveu: >>>>>>>>> >>>>>>>>>> Hi Eduardo, >>>>>>>>>> >>>>>>>>>> There another work around you can try without having to wait for >>>>>>>>>> 2.7.0 release: Use a wrapper to catch exception from >>>>>>>>>> KafkaConsumer#poll() >>>>>>>>>> and pass the wrapper to withConsumerFactoryFn() for KafkIO reader >>>>>>>>>> [1]. >>>>>>>>>> >>>>>>>>>> Using something like (such a wrapper is used in KafkasIO tests >>>>>>>>>> [2]): >>>>>>>>>> private static class ConsumerFactoryFn >>>>>>>>>> implements SerializableFunction<Map<String, >>>>>>>>>> Object>, Consumer<byte[], byte[]>> { >>>>>>>>>> @Override >>>>>>>>>> public Consumer<byte[], byte[]> apply(Map<String, Object> >>>>>>>>>> config) { >>>>>>>>>> return new KafkaConsumer(config) { >>>>>>>>>> @Override >>>>>>>>>> public ConsumerRecords<K, V> poll(long timeout) { >>>>>>>>>> // work around for BEAM-5375 >>>>>>>>>> while (true) { >>>>>>>>>> try { >>>>>>>>>> return super.poll(timeout); >>>>>>>>>> } catch (Exception e) { >>>>>>>>>> // LOG & sleep for sec >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> [1]: >>>>>>>>>> https://github.com/apache/beam/blob/release-2.4.0/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L417 >>>>>>>>>> [2]: >>>>>>>>>> https://github.com/apache/beam/blob/release-2.4.0/sdks/java/io/kafka/src/test/java/org/apache/beam/sdk/io/kafka/KafkaIOTest.java#L261 >>>>>>>>>> >>>>>>>>>> On Tue, Sep 18, 2018 at 5:49 AM Eduardo Soldera < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Raghu, we're not sure how long the network was down. >>>>>>>>>>> According to the logs no longer than one minute. A 30 second >>>>>>>>>>> shutdown would >>>>>>>>>>> work for the tests. >>>>>>>>>>> >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> Em sex, 14 de set de 2018 às 21:41, Raghu Angadi < >>>>>>>>>>> [email protected]> escreveu: >>>>>>>>>>> >>>>>>>>>>>> Thanks. I could repro myself as well. How long was the network >>>>>>>>>>>> down? >>>>>>>>>>>> >>>>>>>>>>>> Trying to get the fix into 2.7 RC2. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Sep 14, 2018 at 12:25 PM Eduardo Soldera < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Just to make myself clear, I'm not sure how to use the patch >>>>>>>>>>>>> but if you could send us some guidance would be great. >>>>>>>>>>>>> >>>>>>>>>>>>> Em sex, 14 de set de 2018 às 16:24, Eduardo Soldera < >>>>>>>>>>>>> [email protected]> escreveu: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Raghu, yes, it is feasible, would you do that for us? I'm >>>>>>>>>>>>>> not sure how we'd use the patch. We're using SBT and Spotify's >>>>>>>>>>>>>> Scio with >>>>>>>>>>>>>> Scala. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks >>>>>>>>>>>>>> >>>>>>>>>>>>>> Em sex, 14 de set de 2018 às 16:07, Raghu Angadi < >>>>>>>>>>>>>> [email protected]> escreveu: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is is feasible for you to verify the fix in your dev job? I >>>>>>>>>>>>>>> can make a patch against Beam 2.4 branch if you like. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Raghu. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Sep 14, 2018 at 11:14 AM Eduardo Soldera < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Raghu, thank you very much for the pull request. >>>>>>>>>>>>>>>> We'll wait for the 2.7 Beam release. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Em qui, 13 de set de 2018 às 18:19, Raghu Angadi < >>>>>>>>>>>>>>>> [email protected]> escreveu: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Fix: https://github.com/apache/beam/pull/6391 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 3:30 PM Raghu Angadi < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Filed BEAM-5375 >>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-5375>. I >>>>>>>>>>>>>>>>>> will fix it later this week. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 12:16 PM Raghu Angadi < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 12:11 PM Raghu Angadi < >>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for the job id, I looked at the worker logs >>>>>>>>>>>>>>>>>>>> (following usual support oncall access protocol that >>>>>>>>>>>>>>>>>>>> provides temporary >>>>>>>>>>>>>>>>>>>> access to things like logs in GCP): >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Root issue looks like consumerPollLoop() mentioned >>>>>>>>>>>>>>>>>>>> earlier needs to handle unchecked exception. In your case >>>>>>>>>>>>>>>>>>>> it is clear that >>>>>>>>>>>>>>>>>>>> poll thread exited with a runtime exception. The reader >>>>>>>>>>>>>>>>>>>> does not check for >>>>>>>>>>>>>>>>>>>> it and continues to wait for poll thread to enqueue >>>>>>>>>>>>>>>>>>>> messages. A fix should >>>>>>>>>>>>>>>>>>>> result in an IOException for read from the source. The >>>>>>>>>>>>>>>>>>>> runners will handle >>>>>>>>>>>>>>>>>>>> that appropriately after that. I will file a jira. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L678 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Ignore the link.. was pasted here by mistake. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> From the logs (with a comment below each one): >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - 2018-09-12 06:13:07.345 PDT Reader-0: reading >>>>>>>>>>>>>>>>>>>> from kafka_topic-0 starting at offset 2 >>>>>>>>>>>>>>>>>>>> - Implies the reader is initialized and poll >>>>>>>>>>>>>>>>>>>> thread is started. >>>>>>>>>>>>>>>>>>>> - 2018-09-12 06:13:07.780 PDT Reader-0: first >>>>>>>>>>>>>>>>>>>> record offset 2 >>>>>>>>>>>>>>>>>>>> - The reader actually got a message received by >>>>>>>>>>>>>>>>>>>> the poll thread from Kafka. >>>>>>>>>>>>>>>>>>>> - 2018-09-12 06:16:48.771 PDT Reader-0: exception >>>>>>>>>>>>>>>>>>>> while fetching latest offset for partition >>>>>>>>>>>>>>>>>>>> kafka_topic-0. will be retried. >>>>>>>>>>>>>>>>>>>> - This must have happened around the time when >>>>>>>>>>>>>>>>>>>> network was disrupted. This is from. Actual log is >>>>>>>>>>>>>>>>>>>> from another periodic >>>>>>>>>>>>>>>>>>>> task that fetches latest offsets for partitions. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The poll thread must have died around the time network >>>>>>>>>>>>>>>>>>>> was disrupted. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> The following log comes from kafka client itself and is >>>>>>>>>>>>>>>>>>>> printed every second when KafkaIO fetches latest offset. >>>>>>>>>>>>>>>>>>>> This log seems to >>>>>>>>>>>>>>>>>>>> be added in recent versions. It is probably an >>>>>>>>>>>>>>>>>>>> unintentional log. I don't >>>>>>>>>>>>>>>>>>>> think there is any better to fetch latest offsets than how >>>>>>>>>>>>>>>>>>>> KafkaIO does >>>>>>>>>>>>>>>>>>>> now. This is logged inside consumer.position() called at >>>>>>>>>>>>>>>>>>>> [1]. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - 2018-09-12 06:13:11.786 PDT [Consumer >>>>>>>>>>>>>>>>>>>> clientId=consumer-2, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> groupId=Reader-0_offset_consumer_1735388161_genericPipe] >>>>>>>>>>>>>>>>>>>> Resetting offset >>>>>>>>>>>>>>>>>>>> for partition com.arquivei.dataeng.andre-0 to offset 3. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1]: >>>>>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L678 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This 'Resetting offset' is harmless, but is quite >>>>>>>>>>>>>>>>>>> annoying to see in the worker logs. One way to avoid is to >>>>>>>>>>>>>>>>>>> set kafka >>>>>>>>>>>>>>>>>>> consumer's log level to WARNING. Ideally KafkaIO itself >>>>>>>>>>>>>>>>>>> should do something >>>>>>>>>>>>>>>>>>> to avoid it without user option. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 10:27 AM Eduardo Soldera < >>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Raghu! The job_id of our dev job is >>>>>>>>>>>>>>>>>>>>> 2018-09-12_06_11_48-5600553605191377866. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Em qua, 12 de set de 2018 às 14:18, Raghu Angadi < >>>>>>>>>>>>>>>>>>>>> [email protected]> escreveu: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks for debugging. >>>>>>>>>>>>>>>>>>>>>> Can you provide the job_id of your dev job? The >>>>>>>>>>>>>>>>>>>>>> stacktrace shows that there is no thread running >>>>>>>>>>>>>>>>>>>>>> 'consumerPollLoop()' which >>>>>>>>>>>>>>>>>>>>>> can explain stuck reader. You will likely find a logs at >>>>>>>>>>>>>>>>>>>>>> line 594 & 587 >>>>>>>>>>>>>>>>>>>>>> [1]. Dataflow caches its readers and DirectRunner may >>>>>>>>>>>>>>>>>>>>>> not. That can >>>>>>>>>>>>>>>>>>>>>> explain DirectRunner resume reads. The expectation in >>>>>>>>>>>>>>>>>>>>>> KafkaIO is that Kafka >>>>>>>>>>>>>>>>>>>>>> client library takes care of retrying in case of >>>>>>>>>>>>>>>>>>>>>> connection problems (as >>>>>>>>>>>>>>>>>>>>>> documented). It is possible that in some cases poll() >>>>>>>>>>>>>>>>>>>>>> throws and we need to >>>>>>>>>>>>>>>>>>>>>> restart the client in KafkaIO. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> [1]: >>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L594 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 9:59 AM Eduardo Soldera < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi Raghu, thanks for your help. >>>>>>>>>>>>>>>>>>>>>>> Just answering your previous question, the following >>>>>>>>>>>>>>>>>>>>>>> logs were the same as before the error, as if the >>>>>>>>>>>>>>>>>>>>>>> pipeline were still >>>>>>>>>>>>>>>>>>>>>>> getting the messages, for example: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> (...) >>>>>>>>>>>>>>>>>>>>>>> Resetting offset for partition >>>>>>>>>>>>>>>>>>>>>>> com.arquivei.dataeng.andre-0 to offset 10. >>>>>>>>>>>>>>>>>>>>>>> Resetting offset for partition >>>>>>>>>>>>>>>>>>>>>>> com.arquivei.dataeng.andre-0 to offset 15. >>>>>>>>>>>>>>>>>>>>>>> ERROR >>>>>>>>>>>>>>>>>>>>>>> Resetting offset for partition >>>>>>>>>>>>>>>>>>>>>>> com.arquivei.dataeng.andre-0 to offset 22. >>>>>>>>>>>>>>>>>>>>>>> Resetting offset for partition >>>>>>>>>>>>>>>>>>>>>>> com.arquivei.dataeng.andre-0 to offset 30. >>>>>>>>>>>>>>>>>>>>>>> (...) >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> But when checking the Kafka Consumer Group, the >>>>>>>>>>>>>>>>>>>>>>> current offset stays at 15, the commited offset from >>>>>>>>>>>>>>>>>>>>>>> the last processed >>>>>>>>>>>>>>>>>>>>>>> message, before the error. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> We'll file a bug, but we could now reproduce the >>>>>>>>>>>>>>>>>>>>>>> issue in a Dev scenario. >>>>>>>>>>>>>>>>>>>>>>> We started the same pipeline using the direct >>>>>>>>>>>>>>>>>>>>>>> runner, without Google Dataflow. We blocked the Kafka >>>>>>>>>>>>>>>>>>>>>>> Broker network and >>>>>>>>>>>>>>>>>>>>>>> the same error was thrown. Then we unblocked the >>>>>>>>>>>>>>>>>>>>>>> network and the pipeline >>>>>>>>>>>>>>>>>>>>>>> was able to successfully process the subsequent >>>>>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>>>>> When we started the same pipeline in the Dataflow >>>>>>>>>>>>>>>>>>>>>>> runner and did the same test, the same problem from our >>>>>>>>>>>>>>>>>>>>>>> production scenario >>>>>>>>>>>>>>>>>>>>>>> happened, Dataflow couldn't process the new messages. >>>>>>>>>>>>>>>>>>>>>>> Unfortunately, we've >>>>>>>>>>>>>>>>>>>>>>> stopped the dataflow job in production, but the >>>>>>>>>>>>>>>>>>>>>>> problematic dev job is >>>>>>>>>>>>>>>>>>>>>>> still running and the log file of the VM is attached. >>>>>>>>>>>>>>>>>>>>>>> Thank you very much. >>>>>>>>>>>>>>>>>>>>>>> Best regards >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Em ter, 11 de set de 2018 às 18:28, Raghu Angadi < >>>>>>>>>>>>>>>>>>>>>>> [email protected]> escreveu: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Specifically, I am interested if you have any >>>>>>>>>>>>>>>>>>>>>>>> thread running 'consumerPollLoop()' [1]. There should >>>>>>>>>>>>>>>>>>>>>>>> always be one (if a >>>>>>>>>>>>>>>>>>>>>>>> worker is assigned one of the partitions). It is >>>>>>>>>>>>>>>>>>>>>>>> possible that KafkaClient >>>>>>>>>>>>>>>>>>>>>>>> itself is hasn't recovered from the group coordinator >>>>>>>>>>>>>>>>>>>>>>>> error (though >>>>>>>>>>>>>>>>>>>>>>>> unlikely). >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaUnboundedReader.java#L570 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 12:31 PM Raghu Angadi < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi Eduardo, >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> In case of any error, the pipeline should keep on >>>>>>>>>>>>>>>>>>>>>>>>> trying to fetch. I don't know about this particular >>>>>>>>>>>>>>>>>>>>>>>>> error. Do you see any >>>>>>>>>>>>>>>>>>>>>>>>> others afterwards in the log? >>>>>>>>>>>>>>>>>>>>>>>>> Couple of things you could try if the logs are not >>>>>>>>>>>>>>>>>>>>>>>>> useful : >>>>>>>>>>>>>>>>>>>>>>>>> - login to one of the VMs and get stacktrace of >>>>>>>>>>>>>>>>>>>>>>>>> java worker (look for a container called >>>>>>>>>>>>>>>>>>>>>>>>> java-streaming) >>>>>>>>>>>>>>>>>>>>>>>>> - file a support bug or stackoverflow question >>>>>>>>>>>>>>>>>>>>>>>>> with jobid so that Dataflow oncall can take a look. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Raghu. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 12:10 PM Eduardo Soldera < >>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>> We have a Apache Beam pipeline running in Google >>>>>>>>>>>>>>>>>>>>>>>>>> Dataflow using KafkaIO. Suddenly the pipeline stop >>>>>>>>>>>>>>>>>>>>>>>>>> fetching Kafka messages >>>>>>>>>>>>>>>>>>>>>>>>>> at all, as our other workers from other pipelines >>>>>>>>>>>>>>>>>>>>>>>>>> continued to get Kafka >>>>>>>>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> At the moment it stopped we got these messages: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I [Consumer clientId=consumer-1, >>>>>>>>>>>>>>>>>>>>>>>>>> groupId=genericPipe] Error sending fetch request >>>>>>>>>>>>>>>>>>>>>>>>>> (sessionId=1396189203, epoch=2431598) to node 3: >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.kafka.common.errors.DisconnectException. >>>>>>>>>>>>>>>>>>>>>>>>>> I [Consumer clientId=consumer-1, >>>>>>>>>>>>>>>>>>>>>>>>>> groupId=genericPipe] Group coordinator >>>>>>>>>>>>>>>>>>>>>>>>>> 10.0.52.70:9093 (id: 2147483646 rack: null) is >>>>>>>>>>>>>>>>>>>>>>>>>> unavailable or invalid, will attempt rediscovery >>>>>>>>>>>>>>>>>>>>>>>>>> I [Consumer clientId=consumer-1, >>>>>>>>>>>>>>>>>>>>>>>>>> groupId=genericPipe] Discovered group coordinator >>>>>>>>>>>>>>>>>>>>>>>>>> 10.0.52.70:9093 (id: 2147483646 rack: null) >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> And then the pipeline stopped reading the >>>>>>>>>>>>>>>>>>>>>>>>>> messages. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> This is the KafkaIO setup we have: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> KafkaIO.read[String,String]() >>>>>>>>>>>>>>>>>>>>>>>>>> .withBootstrapServers(server) >>>>>>>>>>>>>>>>>>>>>>>>>> .withTopic(topic) >>>>>>>>>>>>>>>>>>>>>>>>>> .withKeyDeserializer(classOf[StringDeserializer]) >>>>>>>>>>>>>>>>>>>>>>>>>> .withValueDeserializer(classOf[StringDeserializer]) >>>>>>>>>>>>>>>>>>>>>>>>>> .updateConsumerProperties(properties) >>>>>>>>>>>>>>>>>>>>>>>>>> .commitOffsetsInFinalize() >>>>>>>>>>>>>>>>>>>>>>>>>> .withoutMetadata() >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Any help will be much appreciated. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>> Eduardo Soldera Garcia >>>>>>>>>>>>>>>>>>>>>>>>>> Data Engineer >>>>>>>>>>>>>>>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br >>>>>>>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>>>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas >>>>>>>>>>>>>>>>>>>>>>>>>> Fiscais] >>>>>>>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>>>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e >>>>>>>>>>>>>>>>>>>>>>>>>> mentoria no Vale do Silício] >>>>>>>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.facebook.com/arquivei> >>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei> >>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>> Eduardo Soldera Garcia >>>>>>>>>>>>>>>>>>>>>>> Data Engineer >>>>>>>>>>>>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br >>>>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas >>>>>>>>>>>>>>>>>>>>>>> Fiscais] >>>>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e >>>>>>>>>>>>>>>>>>>>>>> mentoria no Vale do Silício] >>>>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>>>>>>>>>>>>>>>>>>>>>> <https://www.facebook.com/arquivei> >>>>>>>>>>>>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei> >>>>>>>>>>>>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Eduardo Soldera Garcia >>>>>>>>>>>>>>>>>>>>> Data Engineer >>>>>>>>>>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br >>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas >>>>>>>>>>>>>>>>>>>>> Fiscais] >>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e >>>>>>>>>>>>>>>>>>>>> mentoria no Vale do Silício] >>>>>>>>>>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>>>>>>>>>>>>>>>>>>>> <https://www.facebook.com/arquivei> >>>>>>>>>>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei> >>>>>>>>>>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Eduardo Soldera Garcia >>>>>>>>>>>>>>>> Data Engineer >>>>>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br >>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais] >>>>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria >>>>>>>>>>>>>>>> no Vale do Silício] >>>>>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>>>>>>>>>>>>>>> <https://www.facebook.com/arquivei> >>>>>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei> >>>>>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Eduardo Soldera Garcia >>>>>>>>>>>>>> Data Engineer >>>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br >>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais] >>>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no >>>>>>>>>>>>>> Vale do Silício] >>>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>>>>>>>>>>>>> <https://www.facebook.com/arquivei> >>>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei> >>>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Eduardo Soldera Garcia >>>>>>>>>>>>> Data Engineer >>>>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br >>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais] >>>>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no >>>>>>>>>>>>> Vale do Silício] >>>>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>>>>>>>>>>>> <https://www.facebook.com/arquivei> >>>>>>>>>>>>> <https://www.linkedin.com/company/arquivei> >>>>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Eduardo Soldera Garcia >>>>>>>>>>> Data Engineer >>>>>>>>>>> (16) 3509-5555 | www.arquivei.com.br >>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais] >>>>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no >>>>>>>>>>> Vale do Silício] >>>>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>>>>>>>>>> <https://www.facebook.com/arquivei> >>>>>>>>>>> <https://www.linkedin.com/company/arquivei> >>>>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Eduardo Soldera Garcia >>>>>>>>> Data Engineer >>>>>>>>> (16) 3509-5555 | www.arquivei.com.br >>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>> [image: Arquivei.com.br – Inteligência em Notas Fiscais] >>>>>>>>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>>>>>>>> [image: Google seleciona Arquivei para imersão e mentoria no Vale >>>>>>>>> do Silício] >>>>>>>>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>>>>>>>> <https://www.facebook.com/arquivei> >>>>>>>>> <https://www.linkedin.com/company/arquivei> >>>>>>>>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> JC >>>>>>>> >>>>>>>> >>> >>> -- >>> Eduardo Soldera Garcia >>> Data Engineer >>> (16) 3509-5555 | www.arquivei.com.br >>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>> [image: Arquivei.com.br – Inteligência em Notas Fiscais] >>> <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> >>> [image: Google seleciona Arquivei para imersão e mentoria no Vale do >>> Silício] >>> <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> >>> <https://www.facebook.com/arquivei> >>> <https://www.linkedin.com/company/arquivei> >>> <https://www.youtube.com/watch?v=sSUUKxbXnxk> >>> >> > > -- > Eduardo Soldera Garcia > Data Engineer > (16) 3509-5555 | www.arquivei.com.br > <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> > [image: Arquivei.com.br – Inteligência em Notas Fiscais] > <https://arquivei.com.br/?utm_campaign=assinatura-email&utm_content=assinatura> > [image: Google seleciona Arquivei para imersão e mentoria no Vale do > Silício] > <https://arquivei.com.br/blog/google-seleciona-arquivei/?utm_campaign=assinatura-email-launchpad&utm_content=assinatura-launchpad> > <https://www.facebook.com/arquivei> > <https://www.linkedin.com/company/arquivei> > <https://www.youtube.com/watch?v=sSUUKxbXnxk> >
