> On July 21, 2015, 11:15 p.m., Gwen Shapira wrote:
> > core/src/main/scala/kafka/network/SocketServer.scala, line 400
> > <https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400>
> >
> >     So in case of unexpected exception, we log an error and keep running?
> >     
> >     Isn't it better to kill the processor, since we don't know what's the 
> > state of the system? If the acceptor keeps placing messages in the queue 
> > for a dead processor, isn't it a separate issue?
> 
> Jiangjie Qin wrote:
>     This part I'm not quite sure. I am not very experienced in the error 
> handling in such case, so please correct me if I missed something.
>     Here is what I thought.
>     
>     The way it currently works is that the acceptor will 
>     1. accept new connection request and create new socket channel
>     2. choose a processor and put the socket channel into the processor's new 
> connection queue
>     
>     The processor will just take the socket channels from the queue and 
> register it to the selector.
>     If the processor runs and get an uncaught exception, there are several 
> possibilities. 
>     Case 1: The exception was from one socket channel. 
>     Case 2: The exception was associated with a bad request. 
>     In case 1, ideally we should just disconnect that socket channel without 
> affecting other socket channels.
>     In case 2, I think we should log the error and skip the message - 
> assuming client will retry sending data if no response was received for a 
> given peoriod of time.
>     
>     I am not sure if letting processor exit is a good idea because this will 
> lead to the result of a badly behaving client screw the entire cluster - it 
> might screw processors one by one. Comparing with that, I kind of leaning 
> towards keeping the processor running and serving other normal TCP 
> connections if possible, but log the error so monitoring system can detect 
> and see if human intervention is needed.
>     
>     Also, I don't know what to do here to prevent the thread from exiting 
> without catching all the throwables.
>     According to this blog 
> http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/
>     I guess I can rethrow all the ControlThrowables, but intercept the rests?
> 
> Guozhang Wang wrote:
>     I would also prefer not to close the Processor thread upon exceptions, 
> mainly for avoid one bad client killing a shared Kafka cluster. In the past 
> we have seen such issues like DDoS MetadataRequest killing the cluster and 
> all other clients gets affected, etc, and the quota work is towards 
> preventing it. Since Processor threads are shared (8 by default on a broker), 
> it should not be closed by a single socket / bad client request.
> 
> Gwen Shapira wrote:
>     I like your thinking around cases #1 and #2. I think this should go as a 
> code comment somewhere, so when people improve / extend SocketServer they 
> will keep this logic in mind. Maybe even specify in specific catch clauses if 
> they are handling possible errors in request level or channel level.
>     
>     My concern is with possible case #3: Each processor has an 
> o.a.k.common.network.Selector. I'm concerned about the possibility of 
> something going wrong in the state of the selector, which will possibly be an 
> issue for all channels. For example failure to register could be an issue 
> with the channel.register call, but also perhaps an issue with keys.put (just 
> an example - I'm not sure something can actually break keys table). 
>     
>     I'd like to be able to identify cases where the Selector state may have 
> gone wrong and close the processor in that case. Does that make any sense? Or 
> am I being too paranoid?
> 
> Jiangjie Qin wrote:
>     Hi Gwen, I think what you said makes sense. Maybe I see this more from a 
> failure boundary point of view. 
>     
>     Actually we might need to do more if we let a processor exit. We need to 
> stop the acceptor from giving more channel to that acceptor, and potentially 
> stop the entire broker to prevent further issue. This might be a larger 
> failure scenario from what I can see.
>     
>     If we limit the impact to a processor, even if the processor has 
> defunctioned, letting it keep running would limit the impact within that 
> processor but not affect other processors. The connecting to this processor 
> will eventually disconnect and retry to connect to other processor. So the 
> broker won't stop serving when a processor stop working.
>     
>     I completely agree we would prefer to stop the service if something 
> really bad and unknown occured. But honestly I'm not sure how to exhaust all 
> the possible known throwables. In practise, I haven't seen any error of case 
> #3. So pragmatically I assume they are so rare and prefer to just catch all 
> the exceptions.
> 
> Gwen Shapira wrote:
>     As mentioned on the mailing list - I'm convinced that #3 is an edge case 
> that is easy for humans to recognize but isn't trivial to handle 
> programatically at this time.
>     
>     So lets go with your logic, just please put your explanation of the two 
> cases we are handling (request issues and channel issues) in a code comment, 
> and point out which catch clause covers each failure scenario.

Thanks a lot, Gwen. I just submitted the new patch.


- Jiangjie


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36664/#review92496
-----------------------------------------------------------


On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36664/
> -----------------------------------------------------------
> 
> (Updated July 23, 2015, 12:51 a.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2353
>     https://issues.apache.org/jira/browse/KAFKA-2353
> 
> 
> Repository: kafka
> 
> 
> Description
> -------
> 
> Addressed Gwen's comments
> 
> 
> Addressed Gwen's comments.
> 
> 
> Diffs
> -----
> 
>   core/src/main/scala/kafka/network/SocketServer.scala 
> 91319fa010b140cca632e5fa8050509bd2295fc9 
> 
> Diff: https://reviews.apache.org/r/36664/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Jiangjie Qin
> 
>

Reply via email to