Gian Merlino created SAMZA-607:
----------------------------------

             Summary: BrokerProxy gets stuck on down brokers
                 Key: SAMZA-607
                 URL: https://issues.apache.org/jira/browse/SAMZA-607
             Project: Samza
          Issue Type: Bug
    Affects Versions: 0.8.0
            Reporter: Gian Merlino


I took a broker offline for a few hours today and found that a Samza job was 
stuck trying to read from it while it was down, instead of switching to another 
broker in the ISR (this was a replicated topic with some partitions 
under-replicated, but all partitions available). During this time the 
BrokerProxy thread was in a retry loop logging a lot of ClosedChannelExceptions.

The broker had done a clean shutdown, but I think what happened is that the 
BrokerProxy just hadn't made any calls between when that broker stopped being 
leader for its partitions and when that broker went offline. So, it never got a 
NotLeaderForPartitionException and never abdicated.

Would it make sense for the BrokerProxy to abdicate all of its topic-partitions 
after getting too many network errors, and possibly shut itself down if it 
becomes empty? I think it'd be good to support brokers going offline 
temporarily or even permanently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to