Sorry, I meant creating a new producer, not consumer.

Here's the code.

Producer - http://pastebin.com/Kqq1ymCX
Consumer - http://pastebin.com/i2Z8PTYB
Callback - http://pastebin.com/x253z7bG

As you'll notice, I am creating a new producer for each message. So the
bootstrap nodes should be refreshed.

I have a single topic (receive.queue) replicated across 3 nodes. I add all
3 nodes to the bootstrap list. On bringing one of the nodes down, some
messages start failing (metadata update timeout error).

As I mentioned earlier, the problem goes away simply by setting the
reconnect.backoff.ms property to 1000ms.





On 7 May 2015 23:18, "Ewen Cheslack-Postava" <e...@confluent.io> wrote:

> Rahul, the mailing list filters attachments, you'd have to post the code
> somewhere else for people to be able to see it.
>
> But I don't think anyone suggested that creating a new consumer would fix
> anything. Creating a new producer *and discarding the old one* basically
> just makes it start from scratch using the bootstrap nodes, which is why
> that would allow recovery from that condition.
>
> But that's just a workaround. The real issue is that the producer only
> maintains metadata for the nodes that are replicas for the partitions of
> the topics the producer sends data to. In some cases, this is a small set
> of servers and can get the producer stuck if a node goes offline and it
> doesn't have any other nodes that it can try to communicate with to get
> updated metadata (since the topic partitions should have a new leader).
> Falling back on the original bootstrap servers is one solution to this
> problem. Another would be to maintain metadata for additional servers so
> you always have extra "bootstrap" nodes in your current metadata set, even
> if they aren't replicas for any of the topics you're working with.
>
> -Ewen
>
>
>
> On Thu, May 7, 2015 at 12:06 AM, Rahul Jain <rahul...@gmail.com> wrote:
>
> > Creating a new consumer instance *does not* solve this problem.
> >
> > Attaching the producer/consumer code that I used for testing.
> >
> >
> >
> > On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava <e...@confluent.io
> >
> > wrote:
> >
> >> I'm not sure about the old producer behavior in this same failure
> >> scenario,
> >> but creating a new producer instance would resolve the issue since it
> >> would
> >> start with the list of bootstrap nodes and, assuming at least one of
> them
> >> was up, it would be able to fetch up to date metadata.
> >>
> >> On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg <j...@squareup.com>
> wrote:
> >>
> >> > Can you clarify, is this issue here specific to the "new" producer?
> >> With
> >> > the "old" producer, we routinely construct a new producer which makes
> a
> >> > fresh metadata request (via a VIP connected to all nodes in the
> >> cluster).
> >> > Would this approach work with the new producer?
> >> >
> >> > Jason
> >> >
> >> >
> >> > On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <rahul...@gmail.com>
> wrote:
> >> >
> >> > > Mayuresh,
> >> > > I was testing this in a development environment and manually brought
> >> > down a
> >> > > node to simulate this. So the dead node never came back up.
> >> > >
> >> > > My colleague and I were able to consistently see this behaviour
> >> several
> >> > > times during the testing.
> >> > > On 5 May 2015 20:32, "Mayuresh Gharat" <gharatmayures...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > I agree that to find the least Loaded node the producer should
> fall
> >> > back
> >> > > to
> >> > > > the bootstrap nodes if its not able to connect to any nodes in the
> >> > > current
> >> > > > metadata. That should resolve this.
> >> > > >
> >> > > > Rahul, I suppose the problem went off because the dead node in
> your
> >> > case
> >> > > > might have came back up and allowed for a metadata update. Can you
> >> > > confirm
> >> > > > this?
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Mayuresh
> >> > > >
> >> > > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <rahul...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > We observed the exact same error. Not very clear about the root
> >> cause
> >> > > > > although it appears to be related to leastLoadedNode
> >> implementation.
> >> > > > > Interestingly, the problem went away by increasing the value of
> >> > > > > reconnect.backoff.ms to 1000ms.
> >> > > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <
> e...@confluent.io>
> >> > > wrote:
> >> > > > >
> >> > > > > > Ok, all of that makes sense. The only way to possibly recover
> >> from
> >> > > that
> >> > > > > > state is either for K2 to come back up allowing the metadata
> >> > refresh
> >> > > to
> >> > > > > > eventually succeed or to eventually try some other node in the
> >> > > cluster.
> >> > > > > > Reusing the bootstrap nodes is one possibility. Another would
> be
> >> > for
> >> > > > the
> >> > > > > > client to get more metadata than is required for the topics it
> >> > needs
> >> > > in
> >> > > > > > order to ensure it has more nodes to use as options when
> looking
> >> > for
> >> > > a
> >> > > > > node
> >> > > > > > to fetch metadata from. I added your description to
> KAFKA-1843,
> >> > > > although
> >> > > > > it
> >> > > > > > might also make sense as a separate bug since fixing it could
> be
> >> > > > > considered
> >> > > > > > incremental progress towards resolving 1843.
> >> > > > > >
> >> > > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
> >> > > ku...@nmsworks.co.in
> >> > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hi Ewen,
> >> > > > > > >
> >> > > > > > >  Thanks for the response.  I agree with you, In some case we
> >> > should
> >> > > > use
> >> > > > > > > bootstrap servers.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > If you have logs at debug level, are you seeing this
> >> message in
> >> > > > > between
> >> > > > > > > the
> >> > > > > > > > connection attempts:
> >> > > > > > > >
> >> > > > > > > > Give up sending metadata request since no node is
> available
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >  Yes, this log came for couple of times.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > Also, if you let it continue running, does it recover
> after
> >> the
> >> > > > > > > > metadata.max.age.ms timeout?
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >  It does not reconnect.  It is continuously trying to
> connect
> >> > with
> >> > > > dead
> >> > > > > > > node.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > -Manikumar
> >> > > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Thanks,
> >> > > > > > Ewen
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > -Regards,
> >> > > > Mayuresh R. Gharat
> >> > > > (862) 250-7125
> >> > > >
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Thanks,
> >> Ewen
> >>
> >
> >
>
>
> --
> Thanks,
> Ewen
>

Reply via email to