nic-6443 commented on code in PR #13487:
URL: https://github.com/apache/apisix/pull/13487#discussion_r3378681434


##########
apisix/plugins/prometheus/exporter.lua:
##########
@@ -366,16 +416,36 @@ function _M.http_log(conf, ctx)
                 gen_arr(route_id, service_id, consumer_name, balancer_ip,
                     vars.request_type, vars.request_llm_model, vars.llm_model,
                     unpack(extra_labels("llm_latency", ctx))))
+
+            -- Only streaming requests expose a real TTFT; for non-streaming 
the
+            -- var holds the total response time, which would pollute the TTFT
+            -- distribution, so record llm_ttft for ai_stream only.
+            if vars.request_type == "ai_stream" then
+                metrics.llm_ttft:observe(tonumber(llm_time_to_first_token),
+                    gen_arr(route_id, service_id, consumer_name, balancer_ip,
+                        vars.request_type, vars.request_llm_model, 
vars.llm_model,
+                        unpack(extra_labels("llm_ttft", ctx))))
+            end

Review Comment:
   > `apisix_llm_ttft` — LLM time to first token (milliseconds), observed for 
streaming (ai_stream) requests only. The existing apisix_llm_latency mixes 
streaming TTFT and non-streaming total latency in one series; this dedicated 
metric keeps the TTFT distribution semantically consistent so it can be used 
for streaming latency SLOs.
   
   Then we should also adjust the existing `apisix_llm_latency`. In stream 
request, set it to the time when the entire response is completed. Otherwise, 
`apisix_llm_ttft` and `apisix_llm_latency` will have overlapping functions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to