[GitHub] [spark] HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking.

2019-08-20 Thread GitBox
HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka 
producer should not be closed if any task is using it - adds inuse tracking.
URL: https://github.com/apache/spark/pull/19096#issuecomment-523236896
 
 
   > To say that we will never release a resource until a task says it's okay 
is inherently dangerous in a distributed system
   
   That's also the way how consumer instances are cached (that's why missing 
`close` brought file descriptor leak - #21997), though the logic is pretty much 
simpler since we don't allow co-use of instance (Kafka consumer is not 
thread-safe, whereas Kafka producer is thread-safe). Same behavior applies to 
Commons Pool as well, once you borrow the instance, pool never forces to 
destroy the instance until you return it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking.

2019-03-29 Thread GitBox
HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka 
producer should not be closed if any task is using it - adds inuse tracking.
URL: https://github.com/apache/spark/pull/19096#issuecomment-478187752
 
 
   I had some kind of wondering that how much beneficial the complexity we are 
adding is, because the current code is a kind of no-brainer so it just has 
benefits with no specific concern, but now it concerns about concurrency - 
actually not feeling much from the code diff (maybe because they were 
addressed), but from review comments concerning scope of synchronization, race 
condition, deadlock, etc.
   
   While I mentioned general connection pool, I basically meant thread-isolated 
usage. I wouldn't concern producer takes same approach as consumer with current 
custom cache code. (Though I think getting rid of custom cache code would be 
ideal.)
   
   But you're right I agree someone can manage the complexity (given there're 
reviewers and at least one committers will review and merge) when this patch is 
merged, then my wondering is not a big deal. And while we don't know clearly 
about benefits on current approach against thread-isolated usage  (like perf. 
benefit via co-use of producer), we also don't know clearly about opposite way. 
So let's leave it as a second-thought.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking.

2019-03-28 Thread GitBox
HeartSaVioR edited a comment on issue #19096: [SPARK-21869][SS] A cached Kafka 
producer should not be closed if any task is using it - adds inuse tracking.
URL: https://github.com/apache/spark/pull/19096#issuecomment-477846977
 
 
   I'm really sorry to say on the PR which already goes through pretty long 
discussions and actually I asked to resurrect, so blame me please.
   
   First of all, I guess the patch is addressing the issue properly in current 
caching approach, and we should adopt this patch if we stick with current 
approach. (That's why I said the patch is still needed and asked to resurrect 
this.)
   
   Revisiting the current caching approach based on the fact, the caching 
approach now requires concerning about thread-safety, and to handle it we are 
also adding flags and counters, making cached producer instances being stateful 
(except the things Kafka producer itself may have), which would make us not 
easy to understand and maintain the code.
   
   If I remember it correctly, in SPARK-25151 (#22138 - the issue is still open 
and valid I think) someone (I guess Gabor) asked me "why don't you apply 
commons-pool to producer as well", I answered producer is thread-safe so can 
apply current cache approach whereas consumer is not thread-safe so it should 
have to be isolated (good to apply general connection pool implementation).
   
   Now I feel it's worth to consider applying it to producer as well (isolating 
per thread) which keeps producer to be stateless (same here, except the things 
Kafka producer itself may have) - it should require more connections to Kafka 
if there're concurrent usages, but once the number of necessary producers are 
pooled it would not hurt, which we expect for long running query like streaming 
one.
   
   Would like to hear everyone's voices. Thanks in advance!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org