[grpc-io] Re: Establishing multiple grpc subchannels for a single resolved host

2018-08-20 Thread alysha.gardner via grpc.io
Hey Srini,

I've tested pretty aggressive KeepAlive config with the following 
parameters:

'grpc.http2.min_time_between_pings_ms': 1000,
'grpc.keepalive_time_ms': 1000,
'grpc.keepalive_permit_without_calls': 1

Is there anything I'm missing? Ideally I would like this solution to handle 
both explicit RST and also things like firewalls blackholing inactive 
connections (which we've seen happen in the past), so getting keepalive to 
detect a dead connection would be great.

Thanks,
Alysha

On Friday, August 17, 2018 at 8:17:43 PM UTC-4, Srini Polavarapu wrote:
>
> Hi Alysha,
>
> How did you confirm that client is going into backoff and it is indeed 
> receiving a RST when nginx goes away? Have you looked at the logs gRPC 
> generates when this happens? One possibility is that nginx doesn't send RST 
> and client doesn't know that the connection is broken until TCP timeout 
> occurs. Using keepalive will help in this case.
>
> You can try using wait_for_ready=false 
> 
>  so 
> the call fails immediately and you can retry.
>
> A recent PR allows you to reset the backoff period. 
> https://github.com/grpc/grpc/pull/16225. It is experimental and doesn't 
> have python or ruby API so it can't be of immediate help.
>
> On Friday, August 17, 2018 at 12:58:12 PM UTC-7, alysha@shopify.com 
> wrote:
>>
>> Hey Carl,
>>
>> This is with L7 nginx balancing, the reason we moved to nginx from L4 
>> balancers was so we could do per-call balancing (instead of per-connection 
>> with L7).
>>
>> >  In an ideal world, nginx would send a GOAWAY frame to both the client 
>> and the server, and allow all the RPCs to complete before tearing down the 
>> connection.
>>
>>  I agree a GOAWAY would be better but it seems like nginx doesn't do that 
>> (at least yet), they just RST the connection :(
>>
>> > The client knows how to reschedule and unstarted RPC onto a different 
>> connection, without returning an UNAVAILABLE.  
>>
>> Even when we were using L4 it seemed like a GOAWAY from the Go server 
>> would put the Core clients in a backoff state instead of retrying 
>> immediately. The only solution that worked was a round-robin over multiple 
>> connections and a slow-enough rolling restart so the connections could 
>> re-establish before the next one died.
>>
>> > When you say multiple connections to a single IP, does that mean 
>> multiple nginx instances listening on different ports?
>>
>> No, it's a pool of ~20 ingress nginx instances with an L4 load balancer, 
>> so traffic looks like client -> L4 LB -> nginx L7 -> backend GRPC pod. The 
>> problem is the L4 LB in front of nginx has a single public IP.
>>
>> > I'm most familiar with Java, which can actually do what you want.  The 
>> normal way is the create a custom NameResolver that returns multiple 
>> address for a single address, which a RoundRobin load balancer will use
>>
>> Yeah I considered writing something similar in Core but I was worried it 
>> wouldn't be adopted upstream because of the move to external LBs? It's very 
>> tough (impossible?) to add new resolvers to Ruby or Python without 
>> rebuilding the whole extension, and we're pretty worried about maintaining 
>> a fork of the C++ implementation. It's nice to hear the approach has some 
>> merits, I might experiment with it.
>>
>> Thanks,
>> Alysha
>>
>> On Friday, August 17, 2018 at 3:42:31 PM UTC-4, Carl Mastrangelo wrote:
>>>
>>> Hi Alysha,
>>>
>>> Do you you know if nginx is balancing at L4 or L7?In an ideal world, 
>>> nginx would send a GOAWAY frame to both the client and the server, and 
>>> allow all the RPCs to complete before tearing down the connection.   The 
>>> client knows how to reschedule and unstarted RPC onto a different 
>>> connection, without returning an UNAVAILABLE.  
>>>
>>> When you say multiple connections to a single IP, does that mean 
>>> multiple nginx instances listening on different ports?
>>>
>>> I'm most familiar with Java, which can actually do what you want.  The 
>>> normal way is the create a custom NameResolver that returns multiple 
>>> address for a single address, which a RoundRobin load balancer will use.  
>>> It sounds like you aren't using Java, but since the implementations are all 
>>> similar there may be a way to do so.  
>>>
>>> On Friday, August 17, 2018 at 8:46:49 AM UTC-7, alysha@shopify.com 
>>> wrote:

 Hi grpc people!

 We have a setup where we're running a grpc service (written in Go) on 
 GKE, and we're accepting traffic from outside the cluster through nginx 
 ingresses. Our clients are all using Core GRPC libraries (mostly Ruby) to 
 make calls to the nginx ingress, which load-balances per-call to our 
 backend pods.

 The problem we have with this setup is that whenever the nginx 
 ingresses reload they drop all client connections, which results in spikes 
 of Unavailable errors from

[grpc-io] Re: Establishing multiple grpc subchannels for a single resolved host

2018-08-17 Thread alysha.gardner via grpc.io
Hey Carl,

This is with L7 nginx balancing, the reason we moved to nginx from L4 
balancers was so we could do per-call balancing (instead of per-connection 
with L7).

>  In an ideal world, nginx would send a GOAWAY frame to both the client 
and the server, and allow all the RPCs to complete before tearing down the 
connection.

 I agree a GOAWAY would be better but it seems like nginx doesn't do that 
(at least yet), they just RST the connection :(

> The client knows how to reschedule and unstarted RPC onto a different 
connection, without returning an UNAVAILABLE.  

Even when we were using L4 it seemed like a GOAWAY from the Go server would 
put the Core clients in a backoff state instead of retrying immediately. 
The only solution that worked was a round-robin over multiple connections 
and a slow-enough rolling restart so the connections could re-establish 
before the next one died.

> When you say multiple connections to a single IP, does that mean multiple 
nginx instances listening on different ports?

No, it's a pool of ~20 ingress nginx instances with an L4 load balancer, so 
traffic looks like client -> L4 LB -> nginx L7 -> backend GRPC pod. The 
problem is the L4 LB in front of nginx has a single public IP.

> I'm most familiar with Java, which can actually do what you want.  The 
normal way is the create a custom NameResolver that returns multiple 
address for a single address, which a RoundRobin load balancer will use

Yeah I considered writing something similar in Core but I was worried it 
wouldn't be adopted upstream because of the move to external LBs? It's very 
tough (impossible?) to add new resolvers to Ruby or Python without 
rebuilding the whole extension, and we're pretty worried about maintaining 
a fork of the C++ implementation. It's nice to hear the approach has some 
merits, I might experiment with it.

Thanks,
Alysha

On Friday, August 17, 2018 at 3:42:31 PM UTC-4, Carl Mastrangelo wrote:
>
> Hi Alysha,
>
> Do you you know if nginx is balancing at L4 or L7?In an ideal world, 
> nginx would send a GOAWAY frame to both the client and the server, and 
> allow all the RPCs to complete before tearing down the connection.   The 
> client knows how to reschedule and unstarted RPC onto a different 
> connection, without returning an UNAVAILABLE.  
>
> When you say multiple connections to a single IP, does that mean multiple 
> nginx instances listening on different ports?
>
> I'm most familiar with Java, which can actually do what you want.  The 
> normal way is the create a custom NameResolver that returns multiple 
> address for a single address, which a RoundRobin load balancer will use.  
> It sounds like you aren't using Java, but since the implementations are all 
> similar there may be a way to do so.  
>
> On Friday, August 17, 2018 at 8:46:49 AM UTC-7, alysha@shopify.com 
> wrote:
>>
>> Hi grpc people!
>>
>> We have a setup where we're running a grpc service (written in Go) on 
>> GKE, and we're accepting traffic from outside the cluster through nginx 
>> ingresses. Our clients are all using Core GRPC libraries (mostly Ruby) to 
>> make calls to the nginx ingress, which load-balances per-call to our 
>> backend pods.
>>
>> The problem we have with this setup is that whenever the nginx ingresses 
>> reload they drop all client connections, which results in spikes of 
>> Unavailable errors from our grpc clients. There are many nginx ingresses 
>> but they all have a single IP, the incoming TCP connections are routed 
>> through a google cloud L4 load balancer. Whenever an nginx . client closes 
>> a TCP connection the GRPC subchannel treats the backend as unavailable, 
>> even though there are many more nginx pods that may be available 
>> immediately to serve traffic, and it goes into backoff logic. My 
>> understanding is that with multiple subchannels even if one nginx ingress 
>> is restarted the others can continue to serve requests and we shouldn't see 
>> Unavailable errors.
>>
>> My question is: what is the best way to make GRPC Core establish multiple 
>> connections to a single IP, so we can have long-lived connections to 
>> multiple nginx ingresses? 
>>
>> Possibilities we've considered:
>>
>> - DNS round-robin with multiple public IPs on a single A record - we've 
>> tested this and it works, but it requires us to manually administer the DNS 
>> records and run multiple L4 LBs
>>
>> - DNS SRV records - it seems like we could have multiple SRV records with 
>> the same hostname, but in my testing this requires us to add a look-aside 
>> load-balancer as well, and enable ares DNS which doesn't seem to be 
>> production-ready
>>
>> - Host a look-aside load-balancer - we could host our own LB service, but 
>> it's not clear to me how we would overcome this issue for the LB service? 
>> The LB would be behind the same nginx ingresses. I haven't found great 
>> documentation on how to set this up either.
>>
>> - Connection pooling in the client - wrapping t

[grpc-io] Establishing multiple grpc subchannels for a single resolved host

2018-08-17 Thread alysha.gardner via grpc.io
Hi grpc people!

We have a setup where we're running a grpc service (written in Go) on GKE, 
and we're accepting traffic from outside the cluster through nginx 
ingresses. Our clients are all using Core GRPC libraries (mostly Ruby) to 
make calls to the nginx ingress, which load-balances per-call to our 
backend pods.

The problem we have with this setup is that whenever the nginx ingresses 
reload they drop all client connections, which results in spikes of 
Unavailable errors from our grpc clients. There are many nginx ingresses 
but they all have a single IP, the incoming TCP connections are routed 
through a google cloud L4 load balancer. Whenever an nginx . client closes 
a TCP connection the GRPC subchannel treats the backend as unavailable, 
even though there are many more nginx pods that may be available 
immediately to serve traffic, and it goes into backoff logic. My 
understanding is that with multiple subchannels even if one nginx ingress 
is restarted the others can continue to serve requests and we shouldn't see 
Unavailable errors.

My question is: what is the best way to make GRPC Core establish multiple 
connections to a single IP, so we can have long-lived connections to 
multiple nginx ingresses? 

Possibilities we've considered:

- DNS round-robin with multiple public IPs on a single A record - we've 
tested this and it works, but it requires us to manually administer the DNS 
records and run multiple L4 LBs

- DNS SRV records - it seems like we could have multiple SRV records with 
the same hostname, but in my testing this requires us to add a look-aside 
load-balancer as well, and enable ares DNS which doesn't seem to be 
production-ready

- Host a look-aside load-balancer - we could host our own LB service, but 
it's not clear to me how we would overcome this issue for the LB service? 
The LB would be behind the same nginx ingresses. I haven't found great 
documentation on how to set this up either.

- Connection pooling in the client - wrapping the Ruby GRPC channels in a 
library that explicitly establishes multiple channels, each with one 
sub-channel. I've tried to write this but it's tricky to implement at a 
high level. I couldn't get it to perform as well during failures as the DNS 
round-robin approach.

Are there options I missed? Is there any supported pattern for this? Has 
anyone deployed a similar architecture (many clients connecting through 
nginx on a single public IP)?

Thanks,
Alysha

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to grpc-io+unsubscr...@googlegroups.com.
To post to this group, send email to grpc-io@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/0571d7d1-d91c-417a-b1ee-5c7f2296bc38%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.