Hi, which Flink version are you using? This issue occurred quite freqently in the 1.2.0 RC0 and should be fixed in later RCs.
On Fri, Jan 27, 2017 at 4:13 PM, Malte Schwarzer <impres...@mieo.de> wrote: > Hi all, > > when running a Flink batch job, from time to time a TaskManager dies > randomly, which makes the full job failing. All other nodes then throw > the following exception: > > Error obtaining the sorted input: Thread 'SortMerger Reading Thread' > terminated due to an exception: Connection unexpectedly closed by remote > task manager 'dyingnode' ... > > However, there are no error messages in the log of 'dyingnode'. > > But in the PID thread dump of 'dyingnode' I found this: > > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGBUS (0x7) at pc=0x00003fff701afa4c, pid=1119228, > tid=0x00003ff38a3ff1b0 > # > # JRE version: OpenJDK Runtime Environment (8.0_101-b14) (build > 1.8.0_101-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.101-b14 mixed mode linux-ppc64 ) > # Problematic frame: > # J 433 C2 org.apache.flink.runtime.util.DataOutputSerializer.write(I)V > (40 bytes) @ 0x00003fff701afa4c [0x00003fff701afa00+0x4c] > # ... > > What can cause this? And is this Flink related? > > > Best regards, > Malte >