mux-h2: Backend stream is not fully closed if frontend keeps stream open
Hello, I was previously investigating a strange gRPC streaming issue, that appears to be a fairly straight forward issue with how open streams get half-closed. In a nice situation, a frontend request comes in and has END_STREAM marked. The frontend becomes half-closed by the remote side. In turn, the backend gets half-closed by HAProxy on the local side. When the backend response has END_STREAM, it triggers the stream to nicely close, calling h2s_close. The response is then delivered to the requestor at the frontend. However, if the request on frontend does not have END_STREAM, then the backend also stays open. When the backend sends a response (H2 HEADERS frame) with END_STREAM, the H2 stream is updated to "half-closed (remote)" but it is never properly considered closed. The server thinks all has been processed, and can send a RST_STREAM afterwards, but the actual response is not delivered to the frontend side. Instead, HTTP 502 is sent to the original requestor, and the session disconnect state is reported as SH--. In my scenario, there is only a HEADERS frame, so `h2c_bck_handle_headers` can be modified as to `h2s_close` on `H2_SS_HREM` (starting out as `H2_SS_OPEN`) in addition to `H2_SS_HLOC`. This is a hacky solution however, and would not address a DATA frame having the same issue. Instead, the actual response should be properly processed when the backend remote closes its side (instead of waiting on frontend to close its side). I will try to loop back around to this issue, with a patch. But that will most likely take time from my side both due to limited personal bandwidth and unfamiliarity with the H2 processing. Anyone willing to provide a quicker patch is appreciated! --- Trace excerpt: [00|h2|2|mux_h2.c:3612] receiving H2 HEADERS frame : h2c=0x560dddb99c50(B,FRP) h2s=0x560dddb064a0(1,OPN) [00|h2|5|mux_h2.c:2876] h2c_bck_handle_headers(): entering : h2c=0x560dddb99c50(B,FRP) h2s=0x560dddb064a0(1,OPN) [00|h2|5|mux_h2.c:4898] h2c_dec_hdrs(): entering : h2c=0x560dddb99c50(B,FRP) [00|h2|1|mux_h2.c:5040] h2c_dec_hdrs(): h2c=0x560dddb99c50(B,FRP) dsi=1 rcvh :status: 200 [00|h2|1|mux_h2.c:5040] h2c_dec_hdrs(): h2c=0x560dddb99c50(B,FRP) dsi=1 rcvh content-type: application/grpc [00|h2|1|mux_h2.c:5040] h2c_dec_hdrs(): h2c=0x560dddb99c50(B,FRP) dsi=1 rcvh grpc-status: 12 [00|h2|1|mux_h2.c:5040] h2c_dec_hdrs(): h2c=0x560dddb99c50(B,FRP) dsi=1 rcvh grpc-message: unknown service grpc.reflection.v1.ServerReflection [00|h2|5|mux_h2.c:5146] h2c_dec_hdrs(): leaving : h2c=0x560dddb99c50(B,FRH) [00|h2|1|mux_h2.c:2951] rcvd H2 response : h2c=0x560dddb99c50(B,FRH) : [1] H2 RES: HTTP/2.0 200 [00|h2|5|mux_h2.c:2952] h2c_bck_handle_headers(): leaving : h2c=0x560dddb99c50(B,FRH) h2s=0x560dddb064a0(1,HCR) --- Best regards, Valters Jansons
Re: [PATCH] BUG/MINOR: promex: fix backend_agg_check_status
Le 12/09/2023 à 11:37, Cedric Paillet a écrit : When a server is in maintenance, the check.status is no longer updated. Therefore, we shouldn't consider check.status if the checks are not active. This check was properly implemented in the haproxy_server_check_status metric, but wasn't carried over to backend_agg_check_status, which introduced inconsistencies between the two metrics." --- addons/promex/service-prometheus.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index e02e8c0c7..6885d202e 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -863,8 +863,10 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) goto next_px; sv = px->srv; while (sv) { - srv_check_status = sv->check.status; - srv_check_count[srv_check_status] += 1; + if ((sv->check.state & (CHK_ST_ENABLED|CHK_ST_PAUSED)) == CHK_ST_ENABLED) { + srv_check_status = sv->check.status; + srv_check_count[srv_check_status] += 1; + } sv = sv->next; } for (; ctx->obj_state < HCHK_STATUS_SIZE; ctx->obj_state++) { Thanks, merged now ! -- Christopher Faulet
[PATCH] BUG/MINOR: promex: fix backend_agg_check_status
When a server is in maintenance, the check.status is no longer updated. Therefore, we shouldn't consider check.status if the checks are not active. This check was properly implemented in the haproxy_server_check_status metric, but wasn't carried over to backend_agg_check_status, which introduced inconsistencies between the two metrics." --- addons/promex/service-prometheus.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/addons/promex/service-prometheus.c b/addons/promex/service-prometheus.c index e02e8c0c7..6885d202e 100644 --- a/addons/promex/service-prometheus.c +++ b/addons/promex/service-prometheus.c @@ -863,8 +863,10 @@ static int promex_dump_back_metrics(struct appctx *appctx, struct htx *htx) goto next_px; sv = px->srv; while (sv) { - srv_check_status = sv->check.status; - srv_check_count[srv_check_status] += 1; + if ((sv->check.state & (CHK_ST_ENABLED|CHK_ST_PAUSED)) == CHK_ST_ENABLED) { + srv_check_status = sv->check.status; + srv_check_count[srv_check_status] += 1; + } sv = sv->next; } for (; ctx->obj_state < HCHK_STATUS_SIZE; ctx->obj_state++) { -- 2.25.1
RE: [PATCH] BUG/MINOR: promex: don't count servers in maintenance
Hi, Sorry, I now understand my problem better. When a server is in maintenance, the health checks are not performed and sv->check.state == CHK_ST_PAUSED. This is checked in the haproxy_server_check_status metric but not in backend_agg_check_status. I will check my new patch and push it afterwards. Thank you very much. -Message d'origine- De : Christopher Faulet Envoyé : lundi 11 septembre 2023 20:15 À : Cedric Paillet ; haproxy@formilux.org Objet : Re: [PATCH] BUG/MINOR: promex: don't count servers in maintenance Le 07/09/2023 à 16:50, Cedric Paillet a écrit : > >> And I guess we should also check the healthchecks are enabled for the >> server. It is not really an issue because call to get_check_status_result() >> will exclude neutral and unknown satuses. But there is no reason to count >> these servers. > > What we observed is that the health check of servers that have been put into > maintenance is no longer updated, and the status returned is the last known > one. I need to double-check, but I believe we even saw L7OK when putting a > "UP" server into maintenance. (What I'm sure of is that the majority of > servers in maintenance were in L7STS and not in UNK). > > If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to > display an incorrect backend_agg_server_status (and also > haproxy_server_check_status) for servers in maintenance for those who don't > set (no-maint=empty), and we probably then need another patch so that the > status of these servers is HCHK_STATUS_UNKNOWN?" > Health-checks for servers in maintenance are paused. So indeed, the last known status does not change anymore in this state. My purpose here was to also filter servers to only count those with health-checks enabled and running. When the server's metrics are dumped, the check status is already skipped for servers in maintenance. Thus it seems logical to not count them for the aggregated metric. In fact, this way, all servers in maintenance are skipped without checking the server's admin state. But it is probably cleaner to keep both checks for consistency. Except if I missed something. This part is not really clear for me anymore... -- Christopher Faulet