Hi Bruno, sorry for getting back to you so late. I just tried to access your logs to investigate the problem but transfer.sh tells me that they are no longer there. Could you maybe re-upload them or directly send them to my mail address. Sorry for not taking faster a look at your problem and the inconveniences with the upload.
Cheers, Till On Thu, Mar 21, 2019 at 4:30 PM Bruno Aranda <bara...@apache.org> wrote: > Ok, here it goes: > > https://transfer.sh/12qMre/jobmanager-debug.log > > In an attempt to make it smaller, did remove the noisy "http wire" ones > and masked a couple of things. Not sure this covers everything you would > like to see. > > Thanks! > > Bruno > > On Thu, 21 Mar 2019 at 15:24, Till Rohrmann <trohrm...@apache.org> wrote: > >> Hi Bruno, >> >> could you upload the logs to https://transfer.sh/ or >> https://gist.github.com/ and then post a link. For further debugging >> this will be crucial. It would be really good if you could set the log >> level to DEBUG. >> >> Concerning the number of registered TMs, the new mode (not the legacy >> mode), no longer respects the `-n` setting when you start a yarn session. >> Instead it will dynamically start as many containers as you need to run the >> submitted jobs. That's why you don't see the spare TM and this is the >> expected behaviour. >> >> The community intends to add support for ranges of how many TMs must be >> active at any given time [1]. >> >> [1] https://issues.apache.org/jira/browse/FLINK-11078 >> >> Cheers, >> Till >> >> On Thu, Mar 21, 2019 at 1:50 PM Bruno Aranda <bara...@apache.org> wrote: >> >>> Hi Andrey, >>> >>> Thanks for your response. I was trying to get the logs somewhere but >>> they are biggish (~4Mb). Do you suggest somewhere I could put them? >>> >>> In any case, I can see exceptions like this: >>> >>> 2019/03/18 10:11:50,763 DEBUG >>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Releasing >>> slot [SlotRequestId{ab89ff271ebf317a13a9e773aca4e9fb}] because: null >>> 2019/03/18 10:11:50,807 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - Job >>> alert-event-beeTrap-notifier (2ff941926e6ad80ba441d9cfcd7d689d) switched >>> from state RUNNING to FAILING. >>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>> Could not allocate all requires slots within timeout of 300000 ms. Slots >>> required: 2, slots allocated: 0 >>> at >>> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:991) >>> at >>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) >>> ... >>> >>> It looks like a TM may crash, and then the JM. And then the JM is not >>> able to find slots for the tasks in a reasonable time frame? Weirdly, we >>> are running 13 TMs with 6 slots each (we used legacy mode in 1.6), and we >>> always try to keep an extra TM worth of free slots just in case. Looking at >>> the dashboard, I see 12 TMs, 2 free slots, but we tell Flink 13 are >>> available when we start the session in yarn. >>> >>> Any ideas? It is way less stable for us these days without having >>> changed settings much since we started using Flink around 1.2 some time >>> back. >>> >>> Thanks, >>> >>> Bruno >>> >>> >>> >>> On Tue, 19 Mar 2019 at 17:09, Andrey Zagrebin <and...@ververica.com> >>> wrote: >>> >>>> Hi Bruno, >>>> >>>> could you also share the job master logs? >>>> >>>> Thanks, >>>> Andrey >>>> >>>> On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda <bara...@apache.org> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> This is causing serious instability and data loss in our production >>>>> environment. Any help figuring out what's going on here would be really >>>>> appreciated. >>>>> >>>>> We recently updated our two EMR clusters from flink 1.6.1 to flink >>>>> 1.7.2 (running on AWS EMR). The road to the upgrade was fairly rocky, but >>>>> we felt like it was working sufficiently well in our pre-production >>>>> environments that we rolled it out to prod. >>>>> >>>>> However we're now seeing the jobmanager crash spontaneously several >>>>> times a day. There doesn't seem to be any pattern to when this happens - >>>>> it >>>>> doesn't coincide with an increase in the data flowing through the system, >>>>> nor is it at the same time of day. >>>>> >>>>> The big problem is that when it recovers, sometimes a lot of the jobs >>>>> fail to resume with the following exception: >>>>> >>>>> org.apache.flink.util.FlinkException: JobManager responsible for >>>>> 2401cd85e70698b25ae4fb2955f96fd0 lost the leadership. >>>>> at >>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185) >>>>> at >>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138) >>>>> at >>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) >>>>> //... >>>>> Caused by: java.util.concurrent.TimeoutException: The heartbeat of >>>>> JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out. >>>>> at >>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626) >>>>> ... 16 more >>>>> >>>>> Starting them manually afterwards doesn't resume from checkpoint, >>>>> which for most jobs means it starts from the end of the source kafka >>>>> topic. >>>>> This means whenever this surprise jobmanager restart happens, we have a >>>>> ticking clock during which we're losing data. >>>>> >>>>> We speculate that those jobs die first and while they wait to be >>>>> restarted (they have a 30 second delay strategy), the job manager restarts >>>>> and does not recover them? In any case, we have never seen so many job >>>>> failures and JM restarts with exactly the same EMR config. >>>>> >>>>> We've got some functionality we're building that uses the >>>>> StreamingFileSink over S3 bugfixes in 1.7.2, so rolling back isn't an >>>>> ideal >>>>> option. >>>>> >>>>> Looking through the mailing list, we found >>>>> https://issues.apache.org/jira/browse/FLINK-11843 - does it seem >>>>> possible this might be related? >>>>> >>>>> Best regards, >>>>> >>>>> Bruno >>>>> >>>>