Hi Elias! Concerning the spilling of alignment data to disk:
- In 1.4.x , you can set an upper limit via " task.checkpoint.alignment.max-size ". See [1]. - In 1.5.x, the default is a back-pressure based alignment, which does not spill any more. Best, Stephan [1] https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#task-checkpoint-alignment-max-size On Wed, May 2, 2018 at 1:37 PM, Piotr Nowojski <pi...@data-artisans.com> wrote: > Hi, > > It might be some Kafka issue. > > From what you described your reasoning seems sound. For some reason TM3 > fails and is unable to restart and process any data, thus forcing spilling > on checkpoint barriers on TM1 and TM2. > > I don’t know the reason behind java.lang.NoClassDefFoundError: > org/apache/kafka/clients/NetworkClient$1 errors, but it doesn’t seem to > be important in this case. > > 1. What Kafka version are you using? Have you looked for any known Kafka > issues with those symptoms? > 2. Maybe the easiest thing will be to reformat/reinstall/recreate TM3 AWS > image? It might be some system issue. > > Piotrek > > On 28 Apr 2018, at 01:54, Elias Levy <fearsome.lucid...@gmail.com> wrote: > > We had a job on a Flink 1.4.2 cluster with three TMs experience an odd > failure the other day. It seems that it started as some sort of network > event. > > It began with the 3rd TM starting to warn every 30 seconds about socket > timeouts while sending metrics to DataDog. This latest for the whole > outage. > > Twelve minutes later, all TMs reported at nearly the same time that they > had marked the Kafka coordinator as deed ("Marking the coordinator XXX (id: > 2147482640 rack: null) dead for group ZZZ"). The job terminated and the > system attempted to recover it. Then things got into a weird state. > > The following related for six or seven times for a period of about 40 > minutes: > > 1. TM attempts to restart the job, but only the first and second TMs > show signs of doing so. > 2. The disk begins to fill up on TMs 1 and 2. > 3. TMs 1 & 2 both report java.lang.NoClassDefFoundError: > org/apache/kafka/clients/NetworkClient$1 errors. These were mentioned > on this list earlier this month. It is unclear if the are benign. > 4. The job dies when the disks finally fills up on 1 and 2. > > > Looking at the backtrace logged when the disk fills up, I gather that > Flink is buffering data coming from Kafka into one of my operators as a > result of a barrier. The job has a two input operator, with one input the > primary data, and a secondary input for control commands. It would appear > that for whatever reason the barrier for the control stream is not making > it to the operator, thus leading to the buffering and full disks. Maybe > Flink scheduled the operator source of the control stream on the 3rd TM > which seems like it was not scheduling tasks? > > Finally the JM records that it 13 late messages for already expired > checkpoints (could they be from the 3rd TM?), the job is restored one more > time and it works. All TMs report nearly at the same time that they can > now find the Kafka coordinator. > > > Seems like the 3rd TM has some connectivity issue, but then all TMs seems > to have a problem communicating with the Kafka coordinator at the same time > and recovered at the same time. The TMs are hosted in AWS across AZs, so > all of them having connectivity issues at the same time is suspect. The > Kafka node in question was up and other clients in our infrastructure seems > to be able to communicate with it without trouble. Also, the Flink job > itself seemed to be talking to the Kafka cluster while restarting as it was > spilling data to disk coming from Kafka. And the JM did not report any > reduction on available task slots, which would indicate connectivity issues > between the JM and the 3rd TM. Yet, the logs in the 3rd TM do not show any > record of trying to restore the job during the intermediate attempts. > > What do folks make of it? > > > And a question for Flink devs, is there some reason why Flink does not > stop spilling messages to disk when the disk is going to fill up? Seems > like there should be a configurable limit to how much data can be spilled > before back-pressure is applied to slow down or stop the source. > > >