> On July 21, 2015, 11:15 p.m., Gwen Shapira wrote: > > core/src/main/scala/kafka/network/SocketServer.scala, line 400 > > <https://reviews.apache.org/r/36664/diff/1/?file=1018238#file1018238line400> > > > > So in case of unexpected exception, we log an error and keep running? > > > > Isn't it better to kill the processor, since we don't know what's the > > state of the system? If the acceptor keeps placing messages in the queue > > for a dead processor, isn't it a separate issue? > > Jiangjie Qin wrote: > This part I'm not quite sure. I am not very experienced in the error > handling in such case, so please correct me if I missed something. > Here is what I thought. > > The way it currently works is that the acceptor will > 1. accept new connection request and create new socket channel > 2. choose a processor and put the socket channel into the processor's new > connection queue > > The processor will just take the socket channels from the queue and > register it to the selector. > If the processor runs and get an uncaught exception, there are several > possibilities. > Case 1: The exception was from one socket channel. > Case 2: The exception was associated with a bad request. > In case 1, ideally we should just disconnect that socket channel without > affecting other socket channels. > In case 2, I think we should log the error and skip the message - > assuming client will retry sending data if no response was received for a > given peoriod of time. > > I am not sure if letting processor exit is a good idea because this will > lead to the result of a badly behaving client screw the entire cluster - it > might screw processors one by one. Comparing with that, I kind of leaning > towards keeping the processor running and serving other normal TCP > connections if possible, but log the error so monitoring system can detect > and see if human intervention is needed. > > Also, I don't know what to do here to prevent the thread from exiting > without catching all the throwables. > According to this blog > http://www.tzavellas.com/techblog/2010/09/20/catching-throwable-in-scala/ > I guess I can rethrow all the ControlThrowables, but intercept the rests? > > Guozhang Wang wrote: > I would also prefer not to close the Processor thread upon exceptions, > mainly for avoid one bad client killing a shared Kafka cluster. In the past > we have seen such issues like DDoS MetadataRequest killing the cluster and > all other clients gets affected, etc, and the quota work is towards > preventing it. Since Processor threads are shared (8 by default on a broker), > it should not be closed by a single socket / bad client request. > > Gwen Shapira wrote: > I like your thinking around cases #1 and #2. I think this should go as a > code comment somewhere, so when people improve / extend SocketServer they > will keep this logic in mind. Maybe even specify in specific catch clauses if > they are handling possible errors in request level or channel level. > > My concern is with possible case #3: Each processor has an > o.a.k.common.network.Selector. I'm concerned about the possibility of > something going wrong in the state of the selector, which will possibly be an > issue for all channels. For example failure to register could be an issue > with the channel.register call, but also perhaps an issue with keys.put (just > an example - I'm not sure something can actually break keys table). > > I'd like to be able to identify cases where the Selector state may have > gone wrong and close the processor in that case. Does that make any sense? Or > am I being too paranoid? > > Jiangjie Qin wrote: > Hi Gwen, I think what you said makes sense. Maybe I see this more from a > failure boundary point of view. > > Actually we might need to do more if we let a processor exit. We need to > stop the acceptor from giving more channel to that acceptor, and potentially > stop the entire broker to prevent further issue. This might be a larger > failure scenario from what I can see. > > If we limit the impact to a processor, even if the processor has > defunctioned, letting it keep running would limit the impact within that > processor but not affect other processors. The connecting to this processor > will eventually disconnect and retry to connect to other processor. So the > broker won't stop serving when a processor stop working. > > I completely agree we would prefer to stop the service if something > really bad and unknown occured. But honestly I'm not sure how to exhaust all > the possible known throwables. In practise, I haven't seen any error of case > #3. So pragmatically I assume they are so rare and prefer to just catch all > the exceptions. > > Gwen Shapira wrote: > As mentioned on the mailing list - I'm convinced that #3 is an edge case > that is easy for humans to recognize but isn't trivial to handle > programatically at this time. > > So lets go with your logic, just please put your explanation of the two > cases we are handling (request issues and channel issues) in a code comment, > and point out which catch clause covers each failure scenario.
Thanks a lot, Gwen. I just submitted the new patch. - Jiangjie ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/36664/#review92496 ----------------------------------------------------------- On July 23, 2015, 12:51 a.m., Jiangjie Qin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/36664/ > ----------------------------------------------------------- > > (Updated July 23, 2015, 12:51 a.m.) > > > Review request for kafka. > > > Bugs: KAFKA-2353 > https://issues.apache.org/jira/browse/KAFKA-2353 > > > Repository: kafka > > > Description > ------- > > Addressed Gwen's comments > > > Addressed Gwen's comments. > > > Diffs > ----- > > core/src/main/scala/kafka/network/SocketServer.scala > 91319fa010b140cca632e5fa8050509bd2295fc9 > > Diff: https://reviews.apache.org/r/36664/diff/ > > > Testing > ------- > > > Thanks, > > Jiangjie Qin > >