Hi,

that is very strange indeed. I had a look at the logs and there is no error or 
exception reported. I assume there is also no exception in your full logs? 
Which version of flink are you using and what operators were running in the 
task that stopped? If this happens again, would it be possible to take a thread 
dump from that JVM?

Best,
Stefan

> Am 26.09.2017 um 17:08 schrieb Tony Wei <tony19920...@gmail.com>:
> 
> Hi,
> 
> Something weird happened on my streaming job.
> 
> I found my streaming job seems to be blocked for a long time and I saw the 
> situation like the picture below. (chk #1245 and #1246 were all finishing 7/8 
> tasks then marked timeout by JM. Other checkpoints failed with the same state 
> like #1247 util I restarted TM.)
> 
> <snapshot.png>
> 
> I'm not sure what happened, but the consumer stopped fetching records, buffer 
> usage is 100% and the following task did not seem to fetch data anymore. Just 
> like the whole TM was stopped.
> 
> However, after I restarted TM and force the job restarting from the latest 
> completed checkpoint, everything worked again. And I don't know how to 
> reproduce it.
> 
> The attachment is my TM log. Because there are many user logs and sensitive 
> information, I only remain the log from `org.apache.flink...`.
> 
> My cluster setting is one JM and one TM with 4 available slots.
> 
> Streaming job uses all slots, checkpoint interval is 5 mins and max 
> concurrent number is 3.
> 
> Please let me know if it needs more information to find out what happened on 
> my streaming job. Thanks for your help.
> 
> Best Regards,
> Tony Wei
> <flink-root-taskmanager-0-partial.log>

Reply via email to