Looking at our code more closely, it looks like there is a bug here.  If
the resolver returns an error for the addresses on the very first
resolution attempt, it looks like we will get into a state where nothing
will re-resolve.

It looks like this bug has been here for a long time, so I'm surprised no
one has run into it until now.  It definitely needs to be fixed, but it'll
take a bit of work to make all the pieces work together the right way.  Can
you please file an issue and tag me on it?

Thanks very much for reporting this!

On Mon, Aug 29, 2022 at 3:25 PM 'Chi Jameson' via grpc.io <
grpc-io@googlegroups.com> wrote:

> Hello!
>
> We've been able to locate where the client channel stops attempting to
> reconnect, but haven't found how/why the c-ares resolver successfully
> passes a 0 address list to the pick_first load balancer. What appears to be
> happening is it hits this 0 addresses check
> <https://github.com/grpc/grpc/blob/v1.36.4/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc#L192>
>  and
> causes a TRANSIENT_FAILURE, but then the client channel never responds
> beyond that. We've seen this same freeze happen in v1.46.4 at the same 
> subchannel
> list check
> <https://github.com/grpc/grpc/blob/v1.46.4/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc#L191>,
> but I don't necessarily know if it's possible to hit that if statement in a
> practice as the reproduction method I've used is to simply provide an empty
> address list to that method. Trying to actually get an empty c-ares to
> reproduce the behavior we're seeing has proven to be difficult as most of
> the time the resolver behaves as you mentioned.
>
> What normally listens for the UpdateState from the channel_control_helper?
> That might give us a good hint for why the client channel stops after that
> point.
>
> Thanks!
> Chi
> Cisco Meraki
> On Wednesday, August 17, 2022 at 11:35:47 AM UTC-6 Mark D. Roth wrote:
>
>> Can you try running with the following environment variables set, and
>> share the log?  That might help us figure out what's going on here.
>>
>> GRPC_VERBOSITY=DEBUG
>> GRPC_TRACE=client_channel_routing,pick_first,cares_resolver
>>
>> In general, the c-ares resolver should return an error when there's an
>> empty address list, so it should automatically retry the resolution
>> periodically until it succeeds.  The only exception I see in the code is if
>> there are balancer addresses successfully returned
>> <https://github.com/grpc/grpc/blob/9794038ae03842573517411df4ef6ac87a377be0/src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc#L325>,
>> but that shouldn't be the case if you're using pick_first.  Unless maybe
>> you're using a service config in DNS, but the service config lookup is
>> failing also?
>>
>> Anyway, getting some additional logs will probably help us understand
>> what's going wrong here.
>>
>> On Wed, Aug 10, 2022 at 6:41 AM 'Peter Hurley' via grpc.io <
>> grp...@googlegroups.com> wrote:
>>
>>> Thanks for the reply.
>>>
>>> > And would it be possible for you to upgrade your gRPC library and try
>>> to reproduce this?
>>> I didn't see any similar issue (marked fixed or not) in
>>> https://github.com/grpc/grpc/issues; we were hoping the community could
>>> confirm whether this has been observed and fixed already but went
>>> unreported in github.
>>>
>>> > v1.36.4 is over a year old, and a fair handful of bug fixes have gone
>>> in since then.
>>> We're using the still experimental TLSCredentials so every version bump
>>> is non-trivial, and we've already found fixed a number of core bugs
>>> ourselves, so it'll be a while before we're upgrading again in production.
>>>
>>> > Regarding that, are you able to reproduce the conditions in which the
>>> failure occurs, or are they maybe not fully understood? e.g., run a local
>>> DNS server for testing, and modify its records.
>>> Yeah, the exact conditions are not well understood, but almost certainly
>>> happening during a restart of the local caching dnsmasq server due to
>>> intermittent connection loss.
>>>
>>>
>>> On Fri, Aug 5, 2022 at 8:35 PM 'AJ Heller' via grpc.io <
>>> grp...@googlegroups.com> wrote:
>>>
>>>> That's mysterious, do you know what the state of the DNS records are
>>>> when this occurs? And would it be possible for you to upgrade your gRPC
>>>> library and try to reproduce this? v1.36.4 is over a year old, and a fair
>>>> handful of bug fixes have gone in since then.
>>>>
>>>> We've been unable to reproduce this failure in testing, and would
>>>>> appreciate any pointers:
>>>>>
>>>>
>>>> Regarding that, are you able to reproduce the conditions in which the
>>>> failure occurs, or are they maybe not fully understood? e.g., run a local
>>>> DNS server for testing, and modify its records.
>>>>
>>>>
>>>>>
>>>>>    - what is supposed to re-kick a new DNS resolve if the server list
>>>>>    is empty?
>>>>>    - where to check in the resolver code for an empty server list?
>>>>>    - or any other ideas for how to track down the problem
>>>>>
>>>>>
>>>>> We're using grpc v1.36.4 w/ libcares2 1.14
>>>>>
>>>>> Regards,
>>>>> Peter Hurley
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "grpc.io" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to grpc-io+u...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/grpc-io/306779dd-0a68-4b95-851e-0a5979a4e872n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/grpc-io/306779dd-0a68-4b95-851e-0a5979a4e872n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "grpc.io" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to grpc-io+u...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/grpc-io/CAKzaEUf00rkYWHD6aq1nks8WhVo59wrTcaspkMk2EHUDc1b0JQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/grpc-io/CAKzaEUf00rkYWHD6aq1nks8WhVo59wrTcaspkMk2EHUDc1b0JQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>> Mark D. Roth <ro...@google.com>
>> Software Engineer
>> Google, Inc.
>>
> --
> You received this message because you are subscribed to the Google Groups "
> grpc.io" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to grpc-io+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/grpc-io/5facbe71-8d1e-4b7d-8ea6-4030f0e2d6dan%40googlegroups.com
> <https://groups.google.com/d/msgid/grpc-io/5facbe71-8d1e-4b7d-8ea6-4030f0e2d6dan%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Mark D. Roth <r...@google.com>
Software Engineer
Google, Inc.

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/CAJgPXp6bStCE3Yg0grwDFk7kDZZq%3Do8Vya7bfp1X%2B7AO0UibZA%40mail.gmail.com.

Reply via email to