We see the same running 1.4.2 on Yarn hosted on Aws EMR cluster. The only thing I can find in the logs from are SIGTERM with the code 15 or -100. Today our simple job reading from Kinesis and writing to Cassandra was killed. The other day in another job I identified a map state.remove command to cause a task manager lost without and exception I find it frustrating that it is so hard to find the root cause. If I look on historical metrics on cpu, heap and non heap I can’t see anything that should cause a problem. So any ideas about how to debug this kind of exception is much appreciated.
Med venlig hilsen / Best regards Lasse Nedergaard > Den 9. apr. 2018 kl. 21.48 skrev Chesnay Schepler <ches...@apache.org>: > > We will need more information to offer any solution. The exception simply > means that a TaskManager shut down, for which there are a myriad of possible > explanations. > > Please have a look at the TaskManager logs, they may contain a hint as to why > it shut down. > >> On 09.04.2018 16:01, Javier Lopez wrote: >> Hi, >> >> "are you moving the job jar to the ~/flink-1.4.2/lib path ? " -> Yes, to >> every node in the cluster. >> >>> On 9 April 2018 at 15:37, miki haiat <miko5...@gmail.com> wrote: >>> Javier >>> "adding the jar file to the /lib path of every task manager" >>> are you moving the job jar to the ~/flink-1.4.2/lib path ? >>> >>>> On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez <javier.lo...@zalando.de> >>>> wrote: >>>> Hi, >>>> >>>> We had the same metaspace problem, it was solved by adding the jar file to >>>> the /lib path of every task manager, as explained here >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading. >>>> As well we added these java options: "-XX:CompressedClassSpaceSize=100M >>>> -XX:MaxMetaspaceSize=300M -XX:MetaspaceSize=200M " >>>> >>>> From time to time we have the same problem with TaskManagers >>>> disconnecting, but the logs are not useful. We are using 1.3.2. >>>> >>>>> On 9 April 2018 at 10:41, Alexander Smirnov >>>>> <alexander.smirn...@gmail.com> wrote: >>>>> I've seen similar problem, but it was not a heap size, but Metaspace. >>>>> It was caused by a job restarting in a loop. Looks like for each restart, >>>>> Flink loads new instance of classes and very soon in runs out of >>>>> metaspace. >>>>> >>>>> I've created a JIRA issue for this problem, but got no response from the >>>>> development team on it: https://issues.apache.org/jira/browse/FLINK-9132 >>>>> >>>>> >>>>>> On Mon, Apr 9, 2018 at 11:36 AM 王凯 <wangka...@163.com> wrote: >>>>>> thanks a lot,i will try it >>>>>> >>>>>> 在 2018-04-09 00:06:02,"TechnoMage" <mla...@technomage.com> 写道: >>>>>> I have seen this when my task manager ran out of RAM. Increase the heap >>>>>> size. >>>>>> >>>>>> flink-conf.yaml: >>>>>> taskmanager.heap.mb >>>>>> jobmanager.heap.mb >>>>>> >>>>>> Michael >>>>>> >>>>>>> On Apr 8, 2018, at 2:36 AM, 王凯 <wangka...@163.com> wrote: >>>>>>> >>>>>>> <QQ图片20180408163927.png> >>>>>>> hi all, recently, i found a problem,it runs well when start. But after >>>>>>> long run,the exception display as above,how can resolve it? >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>> >> >