Hey Srini,

I've tested pretty aggressive KeepAlive config with the following 
parameters:

'grpc.http2.min_time_between_pings_ms': 1000,
'grpc.keepalive_time_ms': 1000,
'grpc.keepalive_permit_without_calls': 1

Is there anything I'm missing? Ideally I would like this solution to handle 
both explicit RST and also things like firewalls blackholing inactive 
connections (which we've seen happen in the past), so getting keepalive to 
detect a dead connection would be great.

Thanks,
Alysha

On Friday, August 17, 2018 at 8:17:43 PM UTC-4, Srini Polavarapu wrote:
>
> Hi Alysha,
>
> How did you confirm that client is going into backoff and it is indeed 
> receiving a RST when nginx goes away? Have you looked at the logs gRPC 
> generates when this happens? One possibility is that nginx doesn't send RST 
> and client doesn't know that the connection is broken until TCP timeout 
> occurs. Using keepalive will help in this case.
>
> You can try using wait_for_ready=false 
> <https://github.com/grpc/grpc/blob/5098508d2d41a116113f7e333c516cd9ef34a943/doc/wait-for-ready.md>
>  so 
> the call fails immediately and you can retry.
>
> A recent PR allows you to reset the backoff period. 
> https://github.com/grpc/grpc/pull/16225. It is experimental and doesn't 
> have python or ruby API so it can't be of immediate help.
>
> On Friday, August 17, 2018 at 12:58:12 PM UTC-7, alysha....@shopify.com 
> wrote:
>>
>> Hey Carl,
>>
>> This is with L7 nginx balancing, the reason we moved to nginx from L4 
>> balancers was so we could do per-call balancing (instead of per-connection 
>> with L7).
>>
>> >  In an ideal world, nginx would send a GOAWAY frame to both the client 
>> and the server, and allow all the RPCs to complete before tearing down the 
>> connection.
>>
>>  I agree a GOAWAY would be better but it seems like nginx doesn't do that 
>> (at least yet), they just RST the connection :(
>>
>> > The client knows how to reschedule and unstarted RPC onto a different 
>> connection, without returning an UNAVAILABLE.  
>>
>> Even when we were using L4 it seemed like a GOAWAY from the Go server 
>> would put the Core clients in a backoff state instead of retrying 
>> immediately. The only solution that worked was a round-robin over multiple 
>> connections and a slow-enough rolling restart so the connections could 
>> re-establish before the next one died.
>>
>> > When you say multiple connections to a single IP, does that mean 
>> multiple nginx instances listening on different ports?
>>
>> No, it's a pool of ~20 ingress nginx instances with an L4 load balancer, 
>> so traffic looks like client -> L4 LB -> nginx L7 -> backend GRPC pod. The 
>> problem is the L4 LB in front of nginx has a single public IP.
>>
>> > I'm most familiar with Java, which can actually do what you want.  The 
>> normal way is the create a custom NameResolver that returns multiple 
>> address for a single address, which a RoundRobin load balancer will use
>>
>> Yeah I considered writing something similar in Core but I was worried it 
>> wouldn't be adopted upstream because of the move to external LBs? It's very 
>> tough (impossible?) to add new resolvers to Ruby or Python without 
>> rebuilding the whole extension, and we're pretty worried about maintaining 
>> a fork of the C++ implementation. It's nice to hear the approach has some 
>> merits, I might experiment with it.
>>
>> Thanks,
>> Alysha
>>
>> On Friday, August 17, 2018 at 3:42:31 PM UTC-4, Carl Mastrangelo wrote:
>>>
>>> Hi Alysha,
>>>
>>> Do you you know if nginx is balancing at L4 or L7?    In an ideal world, 
>>> nginx would send a GOAWAY frame to both the client and the server, and 
>>> allow all the RPCs to complete before tearing down the connection.   The 
>>> client knows how to reschedule and unstarted RPC onto a different 
>>> connection, without returning an UNAVAILABLE.  
>>>
>>> When you say multiple connections to a single IP, does that mean 
>>> multiple nginx instances listening on different ports?    
>>>
>>> I'm most familiar with Java, which can actually do what you want.  The 
>>> normal way is the create a custom NameResolver that returns multiple 
>>> address for a single address, which a RoundRobin load balancer will use.  
>>> It sounds like you aren't using Java, but since the implementations are all 
>>> similar there may be a way to do so.  
>>>
>>> On Friday, August 17, 2018 at 8:46:49 AM UTC-7, alysha....@shopify.com 
>>> wrote:
>>>>
>>>> Hi grpc people!
>>>>
>>>> We have a setup where we're running a grpc service (written in Go) on 
>>>> GKE, and we're accepting traffic from outside the cluster through nginx 
>>>> ingresses. Our clients are all using Core GRPC libraries (mostly Ruby) to 
>>>> make calls to the nginx ingress, which load-balances per-call to our 
>>>> backend pods.
>>>>
>>>> The problem we have with this setup is that whenever the nginx 
>>>> ingresses reload they drop all client connections, which results in spikes 
>>>> of Unavailable errors from our grpc clients. There are many nginx 
>>>> ingresses 
>>>> but they all have a single IP, the incoming TCP connections are routed 
>>>> through a google cloud L4 load balancer. Whenever an nginx . client closes 
>>>> a TCP connection the GRPC subchannel treats the backend as unavailable, 
>>>> even though there are many more nginx pods that may be available 
>>>> immediately to serve traffic, and it goes into backoff logic. My 
>>>> understanding is that with multiple subchannels even if one nginx ingress 
>>>> is restarted the others can continue to serve requests and we shouldn't 
>>>> see 
>>>> Unavailable errors.
>>>>
>>>> My question is: what is the best way to make GRPC Core establish 
>>>> multiple connections to a single IP, so we can have long-lived connections 
>>>> to multiple nginx ingresses? 
>>>>
>>>> Possibilities we've considered:
>>>>
>>>> - DNS round-robin with multiple public IPs on a single A record - we've 
>>>> tested this and it works, but it requires us to manually administer the 
>>>> DNS 
>>>> records and run multiple L4 LBs
>>>>
>>>> - DNS SRV records - it seems like we could have multiple SRV records 
>>>> with the same hostname, but in my testing this requires us to add a 
>>>> look-aside load-balancer as well, and enable ares DNS which doesn't seem 
>>>> to 
>>>> be production-ready
>>>>
>>>> - Host a look-aside load-balancer - we could host our own LB service, 
>>>> but it's not clear to me how we would overcome this issue for the LB 
>>>> service? The LB would be behind the same nginx ingresses. I haven't found 
>>>> great documentation on how to set this up either.
>>>>
>>>> - Connection pooling in the client - wrapping the Ruby GRPC channels in 
>>>> a library that explicitly establishes multiple channels, each with one 
>>>> sub-channel. I've tried to write this but it's tricky to implement at a 
>>>> high level. I couldn't get it to perform as well during failures as the 
>>>> DNS 
>>>> round-robin approach.
>>>>
>>>> Are there options I missed? Is there any supported pattern for this? 
>>>> Has anyone deployed a similar architecture (many clients connecting 
>>>> through 
>>>> nginx on a single public IP)?
>>>>
>>>> Thanks,
>>>> Alysha
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To post to this group, send email to grpc-io@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/1a17d048-5b09-4672-bab8-ff9c2c4c1e5a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to