mux-h2: Backend stream is not fully closed if frontend keeps stream open

2023-09-12 Thread Valters Jansons
Hello,

I was previously investigating a strange gRPC streaming issue, that
appears to be a fairly straight forward issue with how open streams
get half-closed.

In a nice situation, a frontend request comes in and has END_STREAM
marked. The frontend becomes half-closed by the remote side. In turn,
the backend gets half-closed by HAProxy on the local side. When the
backend response has END_STREAM, it triggers the stream to nicely
close, calling h2s_close. The response is then delivered to the
requestor at the frontend.

However, if the request on frontend does not have END_STREAM, then the
backend also stays open. When the backend sends a response (H2 HEADERS
frame) with END_STREAM, the H2 stream is updated to "half-closed
(remote)" but it is never properly considered closed. The server
thinks all has been processed, and can send a RST_STREAM afterwards,
but the actual response is not delivered to the frontend side.
Instead, HTTP 502 is sent to the original requestor, and the session
disconnect state is reported as SH--.

In my scenario, there is only a HEADERS frame, so
`h2c_bck_handle_headers` can be modified as to `h2s_close` on
`H2_SS_HREM` (starting out as `H2_SS_OPEN`) in addition to
`H2_SS_HLOC`. This is a hacky solution however, and would not address
a DATA frame having the same issue. Instead, the actual response
should be properly processed when the backend remote closes its side
(instead of waiting on frontend to close its side).

I will try to loop back around to this issue, with a patch. But that
will most likely take time from my side both due to limited personal
bandwidth and unfamiliarity with the H2 processing. Anyone willing to
provide a quicker patch is appreciated!

---

Trace excerpt:

[00|h2|2|mux_h2.c:3612] receiving H2 HEADERS frame :
h2c=0x560dddb99c50(B,FRP) h2s=0x560dddb064a0(1,OPN)
[00|h2|5|mux_h2.c:2876] h2c_bck_handle_headers(): entering :
h2c=0x560dddb99c50(B,FRP) h2s=0x560dddb064a0(1,OPN)
[00|h2|5|mux_h2.c:4898] h2c_dec_hdrs(): entering : h2c=0x560dddb99c50(B,FRP)
[00|h2|1|mux_h2.c:5040] h2c_dec_hdrs(): h2c=0x560dddb99c50(B,FRP)
dsi=1 rcvh :status: 200
[00|h2|1|mux_h2.c:5040] h2c_dec_hdrs(): h2c=0x560dddb99c50(B,FRP)
dsi=1 rcvh content-type: application/grpc
[00|h2|1|mux_h2.c:5040] h2c_dec_hdrs(): h2c=0x560dddb99c50(B,FRP)
dsi=1 rcvh grpc-status: 12
[00|h2|1|mux_h2.c:5040] h2c_dec_hdrs(): h2c=0x560dddb99c50(B,FRP)
dsi=1 rcvh grpc-message: unknown service
grpc.reflection.v1.ServerReflection
[00|h2|5|mux_h2.c:5146] h2c_dec_hdrs(): leaving : h2c=0x560dddb99c50(B,FRH)
[00|h2|1|mux_h2.c:2951] rcvd H2 response  : h2c=0x560dddb99c50(B,FRH)
: [1] H2 RES: HTTP/2.0 200
[00|h2|5|mux_h2.c:2952] h2c_bck_handle_headers(): leaving :
h2c=0x560dddb99c50(B,FRH) h2s=0x560dddb064a0(1,HCR)
---

Best regards,
Valters Jansons



Re: [PATCH] BUG/MINOR: promex: fix backend_agg_check_status

2023-09-12 Thread Christopher Faulet

Le 12/09/2023 à 11:37, Cedric Paillet a écrit :

When a server is in maintenance, the check.status is no longer updated. Therefore, 
we shouldn't consider check.status if the checks are not active. This check was 
properly implemented in the haproxy_server_check_status metric, but wasn't carried 
over to backend_agg_check_status, which introduced inconsistencies between the two 
metrics."
---
  addons/promex/service-prometheus.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..6885d202e 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,8 +863,10 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
-   srv_check_status = 
sv->check.status;
-   
srv_check_count[srv_check_status] += 1;
+   if ((sv->check.state & 
(CHK_ST_ENABLED|CHK_ST_PAUSED)) == CHK_ST_ENABLED) {
+   srv_check_status = 
sv->check.status;
+   
srv_check_count[srv_check_status] += 1;
+   }
sv = sv->next;
}
for (; ctx->obj_state < HCHK_STATUS_SIZE; 
ctx->obj_state++) {


Thanks, merged now !

--
Christopher Faulet




[PATCH] BUG/MINOR: promex: fix backend_agg_check_status

2023-09-12 Thread Cedric Paillet
When a server is in maintenance, the check.status is no longer updated. 
Therefore, we shouldn't consider check.status if the checks are not active. 
This check was properly implemented in the haproxy_server_check_status metric, 
but wasn't carried over to backend_agg_check_status, which introduced 
inconsistencies between the two metrics."
---
 addons/promex/service-prometheus.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/addons/promex/service-prometheus.c 
b/addons/promex/service-prometheus.c
index e02e8c0c7..6885d202e 100644
--- a/addons/promex/service-prometheus.c
+++ b/addons/promex/service-prometheus.c
@@ -863,8 +863,10 @@ static int promex_dump_back_metrics(struct appctx *appctx, 
struct htx *htx)
goto next_px;
sv = px->srv;
while (sv) {
-   srv_check_status = 
sv->check.status;
-   
srv_check_count[srv_check_status] += 1;
+   if ((sv->check.state & 
(CHK_ST_ENABLED|CHK_ST_PAUSED)) == CHK_ST_ENABLED) {
+   srv_check_status = 
sv->check.status;
+   
srv_check_count[srv_check_status] += 1;
+   }
sv = sv->next;
}
for (; ctx->obj_state < 
HCHK_STATUS_SIZE; ctx->obj_state++) {
-- 
2.25.1



RE: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

2023-09-12 Thread Cedric Paillet
Hi,

Sorry, I now understand my problem better.

When a server is in maintenance, the health checks are not performed and 
sv->check.state == CHK_ST_PAUSED.
This is checked in the haproxy_server_check_status metric but not in 
backend_agg_check_status.
I will check my new patch and push it afterwards.

Thank you very much.


-Message d'origine-
De : Christopher Faulet  
Envoyé : lundi 11 septembre 2023 20:15
À : Cedric Paillet ; haproxy@formilux.org
Objet : Re: [PATCH] BUG/MINOR: promex: don't count servers in maintenance

Le 07/09/2023 à 16:50, Cedric Paillet a écrit :
> 
>> And I guess we should also check the healthchecks are enabled for the 
>> server. It is not really an issue because call to get_check_status_result() 
>> will exclude neutral and unknown satuses. But there is no reason to count 
>> these servers.
> 
> What we observed is that the health check of servers that have been put into 
> maintenance is no longer updated, and the status returned is the last known 
> one. I need to double-check, but I believe we even saw L7OK when putting a 
> "UP" server into maintenance. (What I'm sure of is that the majority of 
> servers in maintenance were in L7STS and not in UNK).
> 
> If we add the PROMEX_FL_NO_MAINT_SR (which makes sense), we will continue to 
> display an incorrect backend_agg_server_status (and also 
> haproxy_server_check_status) for servers in maintenance for those who don't 
> set (no-maint=empty), and we probably then need another patch so that the 
> status of these servers is HCHK_STATUS_UNKNOWN?"
> 

Health-checks for servers in maintenance are paused. So indeed, the last known 
status does not change anymore in this state. My purpose here was to also 
filter servers to only count those with health-checks enabled and running. When 
the server's metrics are dumped, the check status is already skipped for 
servers in maintenance. Thus it seems logical to not count them for the 
aggregated metric.

In fact, this way, all servers in maintenance are skipped without checking the 
server's admin state. But it is probably cleaner to keep both checks for 
consistency. Except if I missed something. This part is not really clear for me 
anymore...

--
Christopher Faulet