Thanks for the update! On 13.03.20 13:47, Rong Rong wrote:
1. I think we have finally pinpointed what the root cause to this issue is: When partitions are assigned manually (e.g. with assign() API instead subscribe() API) the client will not try to rediscover the coordinator if it dies [1]. This seems to no longer be an issue after Kafka 1.1.0 After cherry-picking the PR[2] back to Kafka 0.11.x branch and package it with our Flink application, we haven't seen this issue re-occurred so far.
So the solution to this thread is: we don't do anything because it is a Kafka bug that was fixed?
2. The GROUP_OFFSETS is in fact the default startup mode if Checkpoint is not enable - that's why I was a bit surprise that this problem was reported so many times. To follow up on the question "whether resuming from GROUP_OFFSETS are useful": there are definitely use cases where users don't want to use checkpointing (e.g. due to resource constraint, storage cost consideration, etc), but somehow still want to avoid a certain amount of data loss. Most of our analytics use cases falls into this category.
Yes, this is what I assumed. I was not suggesting to remove the feature. We also just leave it as is, right?
Best, Aljoscha