Sorry, I meant creating a new producer, not consumer. Here's the code.
Producer - http://pastebin.com/Kqq1ymCX Consumer - http://pastebin.com/i2Z8PTYB Callback - http://pastebin.com/x253z7bG As you'll notice, I am creating a new producer for each message. So the bootstrap nodes should be refreshed. I have a single topic (receive.queue) replicated across 3 nodes. I add all 3 nodes to the bootstrap list. On bringing one of the nodes down, some messages start failing (metadata update timeout error). As I mentioned earlier, the problem goes away simply by setting the reconnect.backoff.ms property to 1000ms. On 7 May 2015 23:18, "Ewen Cheslack-Postava" <e...@confluent.io> wrote: > Rahul, the mailing list filters attachments, you'd have to post the code > somewhere else for people to be able to see it. > > But I don't think anyone suggested that creating a new consumer would fix > anything. Creating a new producer *and discarding the old one* basically > just makes it start from scratch using the bootstrap nodes, which is why > that would allow recovery from that condition. > > But that's just a workaround. The real issue is that the producer only > maintains metadata for the nodes that are replicas for the partitions of > the topics the producer sends data to. In some cases, this is a small set > of servers and can get the producer stuck if a node goes offline and it > doesn't have any other nodes that it can try to communicate with to get > updated metadata (since the topic partitions should have a new leader). > Falling back on the original bootstrap servers is one solution to this > problem. Another would be to maintain metadata for additional servers so > you always have extra "bootstrap" nodes in your current metadata set, even > if they aren't replicas for any of the topics you're working with. > > -Ewen > > > > On Thu, May 7, 2015 at 12:06 AM, Rahul Jain <rahul...@gmail.com> wrote: > > > Creating a new consumer instance *does not* solve this problem. > > > > Attaching the producer/consumer code that I used for testing. > > > > > > > > On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava <e...@confluent.io > > > > wrote: > > > >> I'm not sure about the old producer behavior in this same failure > >> scenario, > >> but creating a new producer instance would resolve the issue since it > >> would > >> start with the list of bootstrap nodes and, assuming at least one of > them > >> was up, it would be able to fetch up to date metadata. > >> > >> On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg <j...@squareup.com> > wrote: > >> > >> > Can you clarify, is this issue here specific to the "new" producer? > >> With > >> > the "old" producer, we routinely construct a new producer which makes > a > >> > fresh metadata request (via a VIP connected to all nodes in the > >> cluster). > >> > Would this approach work with the new producer? > >> > > >> > Jason > >> > > >> > > >> > On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <rahul...@gmail.com> > wrote: > >> > > >> > > Mayuresh, > >> > > I was testing this in a development environment and manually brought > >> > down a > >> > > node to simulate this. So the dead node never came back up. > >> > > > >> > > My colleague and I were able to consistently see this behaviour > >> several > >> > > times during the testing. > >> > > On 5 May 2015 20:32, "Mayuresh Gharat" <gharatmayures...@gmail.com> > >> > wrote: > >> > > > >> > > > I agree that to find the least Loaded node the producer should > fall > >> > back > >> > > to > >> > > > the bootstrap nodes if its not able to connect to any nodes in the > >> > > current > >> > > > metadata. That should resolve this. > >> > > > > >> > > > Rahul, I suppose the problem went off because the dead node in > your > >> > case > >> > > > might have came back up and allowed for a metadata update. Can you > >> > > confirm > >> > > > this? > >> > > > > >> > > > Thanks, > >> > > > > >> > > > Mayuresh > >> > > > > >> > > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <rahul...@gmail.com> > >> wrote: > >> > > > > >> > > > > We observed the exact same error. Not very clear about the root > >> cause > >> > > > > although it appears to be related to leastLoadedNode > >> implementation. > >> > > > > Interestingly, the problem went away by increasing the value of > >> > > > > reconnect.backoff.ms to 1000ms. > >> > > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" < > e...@confluent.io> > >> > > wrote: > >> > > > > > >> > > > > > Ok, all of that makes sense. The only way to possibly recover > >> from > >> > > that > >> > > > > > state is either for K2 to come back up allowing the metadata > >> > refresh > >> > > to > >> > > > > > eventually succeed or to eventually try some other node in the > >> > > cluster. > >> > > > > > Reusing the bootstrap nodes is one possibility. Another would > be > >> > for > >> > > > the > >> > > > > > client to get more metadata than is required for the topics it > >> > needs > >> > > in > >> > > > > > order to ensure it has more nodes to use as options when > looking > >> > for > >> > > a > >> > > > > node > >> > > > > > to fetch metadata from. I added your description to > KAFKA-1843, > >> > > > although > >> > > > > it > >> > > > > > might also make sense as a separate bug since fixing it could > be > >> > > > > considered > >> > > > > > incremental progress towards resolving 1843. > >> > > > > > > >> > > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy < > >> > > ku...@nmsworks.co.in > >> > > > > > >> > > > > > wrote: > >> > > > > > > >> > > > > > > Hi Ewen, > >> > > > > > > > >> > > > > > > Thanks for the response. I agree with you, In some case we > >> > should > >> > > > use > >> > > > > > > bootstrap servers. > >> > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > If you have logs at debug level, are you seeing this > >> message in > >> > > > > between > >> > > > > > > the > >> > > > > > > > connection attempts: > >> > > > > > > > > >> > > > > > > > Give up sending metadata request since no node is > available > >> > > > > > > > > >> > > > > > > > >> > > > > > > Yes, this log came for couple of times. > >> > > > > > > > >> > > > > > > > >> > > > > > > > > >> > > > > > > > Also, if you let it continue running, does it recover > after > >> the > >> > > > > > > > metadata.max.age.ms timeout? > >> > > > > > > > > >> > > > > > > > >> > > > > > > It does not reconnect. It is continuously trying to > connect > >> > with > >> > > > dead > >> > > > > > > node. > >> > > > > > > > >> > > > > > > > >> > > > > > > -Manikumar > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > -- > >> > > > > > Thanks, > >> > > > > > Ewen > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > -- > >> > > > -Regards, > >> > > > Mayuresh R. Gharat > >> > > > (862) 250-7125 > >> > > > > >> > > > >> > > >> > >> > >> > >> -- > >> Thanks, > >> Ewen > >> > > > > > > > -- > Thanks, > Ewen >