Re: org.apache.zookeeper.ClientCnxn, Client session timed out
Ok, thanks for the clarification. On Thu, Dec 28, 2017 at 1:05 AM Ufuk Celebiwrote: > On Thu, Dec 28, 2017 at 12:11 AM, Hao Sun wrote: > > Thanks! Great to know I do not have to worry duplicates inside Flink. > > > > One more question, why this happens? Because TM and JM both check > leadership > > in different interval? > > Yes, it's not deterministic how this happens. There will also be cases > when the JM notices before the TM. > >
Re: org.apache.zookeeper.ClientCnxn, Client session timed out
On Thu, Dec 28, 2017 at 12:11 AM, Hao Sunwrote: > Thanks! Great to know I do not have to worry duplicates inside Flink. > > One more question, why this happens? Because TM and JM both check leadership > in different interval? Yes, it's not deterministic how this happens. There will also be cases when the JM notices before the TM.
Re: org.apache.zookeeper.ClientCnxn, Client session timed out
Thanks! Great to know I do not have to worry duplicates inside Flink. One more question, why this happens? Because TM and JM both check leadership in different interval? > The TM noticed the loss of leadership before the JM did. On Wed, Dec 27, 2017, 13:52 Ufuk Celebiwrote: > On Wed, Dec 27, 2017 at 4:41 PM, Hao Sun wrote: > >> Somehow TM detected JM leadership loss from ZK and self disconnected? >> And couple of seconds later, JM failed to connect to ZK? >> > > Yes, exactly as you describe. The TM noticed the loss of leadership before > the JM did. > > >> After all the cluster recovered nicely by its own, but I am wondering >> does this break the exactly-once semantics? If yes, what should I take care? >> > > Great :-) It does not break exactly-once guarantees *within* the Flink > pipeline as the state of the latest completed checkpoint will be restored > after recovery. This rewinds your job and might result in duplicate or > changed output if you don't use an exactly once or idempotent sink. > > – Ufuk > >
Re: org.apache.zookeeper.ClientCnxn, Client session timed out
On Wed, Dec 27, 2017 at 4:41 PM, Hao Sunwrote: > Somehow TM detected JM leadership loss from ZK and self disconnected? > And couple of seconds later, JM failed to connect to ZK? > Yes, exactly as you describe. The TM noticed the loss of leadership before the JM did. > After all the cluster recovered nicely by its own, but I am wondering does > this break the exactly-once semantics? If yes, what should I take care? > Great :-) It does not break exactly-once guarantees *within* the Flink pipeline as the state of the latest completed checkpoint will be restored after recovery. This rewinds your job and might result in duplicate or changed output if you don't use an exactly once or idempotent sink. – Ufuk