[GitHub] [spark] HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking.
HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking. URL: https://github.com/apache/spark/pull/19096#issuecomment-523236896 > To say that we will never release a resource until a task says it's okay is inherently dangerous in a distributed system That's also the way how consumer instances are cached (that's why missing `close` brought file descriptor leak - #21997), though the logic is pretty much simpler since we don't allow co-use of instance (Kafka consumer is not thread-safe, whereas Kafka producer is thread-safe). Same behavior applies to Commons Pool as well, once you borrow the instance, pool never forces to destroy the instance until you return it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking.
HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking. URL: https://github.com/apache/spark/pull/19096#issuecomment-478187752 I had some kind of wondering that how much beneficial the complexity we are adding is, because the current code is a kind of no-brainer so it just has benefits with no specific concern, but now it concerns about concurrency - actually not feeling much from the code diff (maybe because they were addressed), but from review comments concerning scope of synchronization, race condition, deadlock, etc. While I mentioned general connection pool, I basically meant thread-isolated usage. I wouldn't concern producer takes same approach as consumer with current custom cache code. (Though I think getting rid of custom cache code would be ideal.) But you're right I agree someone can manage the complexity (given there're reviewers and at least one committers will review and merge) when this patch is merged, then my wondering is not a big deal. And while we don't know clearly about benefits on current approach against thread-isolated usage (like perf. benefit via co-use of producer), we also don't know clearly about opposite way. So let's leave it as a second-thought. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking.
HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking. URL: https://github.com/apache/spark/pull/19096#issuecomment-477846977 I'm really sorry to say on the PR which already goes through pretty long discussions and actually I asked to resurrect, so blame me please. First of all, I guess the patch is addressing the issue properly in current caching approach, and we should adopt this patch if we stick with current approach. (That's why I said the patch is still needed and asked to resurrect this.) Revisiting the current caching approach based on the fact, the caching approach now requires concerning about thread-safety, and to handle it we are also adding flags and counters, making cached producer instances being stateful (except the things Kafka producer itself may have), which would make us not easy to understand and maintain the code. If I remember it correctly, in SPARK-25151 (#22138 - the issue is still open and valid I think) someone (I guess Gabor) asked me "why don't you apply commons-pool to producer as well", I answered producer is thread-safe so can apply current cache approach whereas consumer is not thread-safe so it should have to be isolated (good to apply general connection pool implementation). Now I feel it's worth to consider applying it to producer as well (isolating per thread) which keeps producer to be stateless (same here, except the things Kafka producer itself may have) - it should require more connections to Kafka if there're concurrent usages, but once the number of necessary producers are pooled it would not hurt, which we expect for long running query like streaming one. Would like to hear everyone's voices. Thanks in advance! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org