Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/22138 @koeninger I'm not sure I got your point correctly. This patch is based on some assumptions, so please correct me if I'm missing here. Assumptions follow: 1. There's actually no multiple consumers for a given key working at the same time. The cache key contains topic partition as well as group id. Even the query tries to do self-join so reading same topic in two different sources, I think group id should be different. 2. In normal case the offset will be continuous, and that's why cache should help. In retrying case this patch invalidates cache as same as current behavior, so it should start from scratch. (Btw, I'm curious what's more expensive between `leveraging pooled object but resetting kafka consumer` vs `invalidating pooled objects and start from scratch`. Latter feels more safer but if we just need extra seek instead of reconnecting to kafka, resetting could be improved and former will be cheaper. I feel it is out of scope of my PR though.) This patch keeps most of current behaviors, except two spots I guess. I already commented a spot why I change the behavior, and I'll comment another spot for the same.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org