Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread 周思华
inated. 2018-04-07 21:59:21,064 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph- FetchUrlsFunction for sitemap -> ParseSiteMapFunction -> OutlinkToStateUrlFunction (1/1) (3e9374d1bf5fdb359e3a624a4d5d659b) switched from RUNNING to FAILED. java.lang.Exception: TaskManager was lost/killed: c51d38

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Ted Yu
/192.168.3.177:63780 >>> 2018-04-07 21:59:21,049 WARN akka.remote.transport.netty.NettyTransport >>>- Remote connection to [null] failed with >>> java.net.ConnectException: Connection refused: >>> kens-mbp.hsd1.ca.comcast.net/192.168.3

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Lasse Nedergaard
> Cc: user , Chesnay Schepler > Subject: Re: java.lang.Exception: TaskManager was lost/killed > > > This graph shows Non-Heap . If the same pattern exists it make sense that > it will try to allocate more memory and then exceed the limit. I can see > the trend for all other conta

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Ted Yu
Can you use third party site for the graph ? I cannot view it. Thanks Original message From: Lasse Nedergaard Date: 4/10/18 12:25 AM (GMT-08:00) To: Ken Krugler Cc: user , Chesnay Schepler Subject: Re: java.lang.Exception: TaskManager was lost/killed This graph shows

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-10 Thread Lasse Nedergaard
ion failed with [akka.tcp:// >> fl...@kens-mbp.hsd1.ca.comcast.net:63780]] Caused by: [Connection >> refused: kens-mbp.hsd1.ca.comcast.net/192.168.3.177:63780] >> 2018-04-07 21:59:21,056 WARN akka.remote.RemoteWatcher >> - Detected unreachable: [akka.tcp://

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Lasse Nedergaard
mcast.net:63780] > 2018-04-07 21:59:21,063 INFO org.apache.flink.runtime.jobmanager.JobManager >- Task manager akka.tcp://flink@kens-mbp. > hsd1.ca.comcast.net:63780/user/taskmanager terminated. > 2018-04-07 21:59:21,064 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph >

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Ken Krugler
G to FAILED. java.lang.Exception: TaskManager was lost/killed: c51d3879b6244828eb9fc78c943007ad @ kens-mbp.hsd1.ca.comcast.net (dataPort=63782) — Ken > On Apr 9, 2018, at 12:48 PM, Chesnay Schepler wrote: > > We will need more information to offer any solution. The exception simply > means that

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Hao Sun
Same story here, 1.3.2 on K8s. Very hard to find reasons on why a TM is killed. Not likely caused by memory leak. If there is a logger I have turn on please let me know. On Mon, Apr 9, 2018, 13:41 Lasse Nedergaard wrote: > We see the same running 1.4.2 on Yarn hosted on Aws EMR cluster. The only

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Lasse Nedergaard
We see the same running 1.4.2 on Yarn hosted on Aws EMR cluster. The only thing I can find in the logs from are SIGTERM with the code 15 or -100. Today our simple job reading from Kinesis and writing to Cassandra was killed. The other day in another job I identified a map state.remove command to

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Chesnay Schepler
We will need more information to offer any solution. The exception simply means that a TaskManager shut down, for which there are a myriad of possible explanations. Please have a look at the TaskManager logs, they may contain a hint as to why it shut down. On 09.04.2018 16:01, Javier Lopez w

Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Javier Lopez
Hi, "are you moving the job jar to the ~/flink-1.4.2/lib path ? " -> Yes, to every node in the cluster. On 9 April 2018 at 15:37, miki haiat wrote: > Javier > "adding the jar file to the /lib path of every task manager" > are you moving the job jar to the* ~/flink-1.4.2/lib path* ? > > On

Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread miki haiat
Javier "adding the jar file to the /lib path of every task manager" are you moving the job jar to the* ~/flink-1.4.2/lib path* ? On Mon, Apr 9, 2018 at 12:23 PM, Javier Lopez wrote: > Hi, > > We had the same metaspace problem, it was solved by adding the jar file to > the /lib path of every ta

Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Javier Lopez
Hi, We had the same metaspace problem, it was solved by adding the jar file to the /lib path of every task manager, as explained here https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/debugging_classloading.html#avoiding-dynamic-classloading. As well we added these java option

Re: Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Alexander Smirnov
I've seen similar problem, but it was not a heap size, but Metaspace. It was caused by a job restarting in a loop. Looks like for each restart, Flink loads new instance of classes and very soon in runs out of metaspace. I've created a JIRA issue for this problem, but got no response from the devel

Re:Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread 王凯
thanks a lot,i will try it 在 2018-04-09 00:06:02,"TechnoMage" 写道: I have seen this when my task manager ran out of RAM. Increase the heap size. flink-conf.yaml: taskmanager.heap.mb jobmanager.heap.mb Michael On Apr 8, 2018, at 2:36 AM, 王凯 wrote: hi all, recently, i found a problem,it

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-08 Thread TechnoMage
I have seen this when my task manager ran out of RAM. Increase the heap size. flink-conf.yaml: taskmanager.heap.mb jobmanager.heap.mb Michael > On Apr 8, 2018, at 2:36 AM, 王凯 wrote: > > > hi all, recently, i found a problem,it runs well when start. But after long > run,the exception displa

java.lang.Exception: TaskManager was lost/killed

2018-04-08 Thread 王凯
hi all, recently, i found a problem,it runs well when start. But after long run,the exception display as above,how can resolve it?