(apisix) branch master updated: docs: update `ai-proxy` doc and `ai-proxy-multi` plugin doc (#12094)

traky Fri, 28 Mar 2025 01:39:25 -0700

This is an automated email from the ASF dual-hosted git repository.

traky pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/apisix.git



The following commit(s) were added to refs/heads/master by this push:
     new 655ea62fe docs: update `ai-proxy` doc and `ai-proxy-multi` plugin doc 
(#12094)
655ea62fe is described below

commit 655ea62febacd61b3055d7edeeb8ad8bf61193c2
Author: Traky Deng <[email protected]>
AuthorDate: Fri Mar 28 16:39:14 2025 +0800

    docs: update `ai-proxy` doc and `ai-proxy-multi` plugin doc (#12094)
---
 docs/en/latest/plugins/ai-proxy-multi.md | 962 +++++++++++++++++++++++++++----
 docs/en/latest/plugins/ai-proxy.md       | 450 ++++++++++++---
 2 files changed, 1228 insertions(+), 184 deletions(-)

diff --git a/docs/en/latest/plugins/ai-proxy-multi.md 
b/docs/en/latest/plugins/ai-proxy-multi.md
index a23eccb55..6418599d9 100644
--- a/docs/en/latest/plugins/ai-proxy-multi.md
+++ b/docs/en/latest/plugins/ai-proxy-multi.md
@@ -5,7 +5,9 @@ keywords:
   - API Gateway
   - Plugin
   - ai-proxy-multi
-description: This document contains information about the Apache APISIX 
ai-proxy-multi Plugin.
+  - AI
+  - LLM
+description: The ai-proxy-multi Plugin extends the capabilities of ai-proxy 
with load balancing, retries, fallbacks, and health chekcs, simplying the 
integration with OpenAI, DeepSeek, and other OpenAI-compatible APIs.
 ---
 
 <!--
@@ -27,215 +29,977 @@ description: This document contains information about the 
Apache APISIX ai-proxy
 #
 -->
 
-## Description
+<head>
+  <link rel="canonical" href="https://docs.api7.ai/hub/ai-proxy-multi"; />
+</head>
 
-The `ai-proxy-multi` plugin simplifies access to LLM providers and models by 
defining a standard request format
-that allows key fields in plugin configuration to be embedded into the request.
+## Description
 
-This plugin adds additional features like `load balancing` and `retries` to 
the existing `ai-proxy` plugin.
+The `ai-proxy-multi` Plugin simplifies access to LLM and embedding models by 
transforming Plugin configurations into the designated request format for 
OpenAI, DeepSeek, and other OpenAI-compatible APIs. It extends the capabilities 
of [`ai-proxy-multi`](./ai-proxy.md) with load balancing, retries, fallbacks, 
and health checks.
 
-Proxying requests to OpenAI is supported now. Other LLM services will be 
supported soon.
+In addition, the Plugin also supports logging LLM request information in the 
access log, such as token usage, model, time to the first response, and more.
 
 ## Request Format
 
-### OpenAI
-
-- Chat API
-
 | Name               | Type   | Required | Description                         
                |
 | ------------------ | ------ | -------- | 
--------------------------------------------------- |
-| `messages`         | Array  | Yes      | An array of message objects         
                |
-| `messages.role`    | String | Yes      | Role of the message (`system`, 
`user`, `assistant`) |
-| `messages.content` | String | Yes      | Content of the message              
                |
-
-## Plugin Attributes
-
-| **Name**                     | **Required** | **Type** | **Description**     
                                                                                
          | **Default** |
-| ---------------------------- | ------------ | -------- | 
-------------------------------------------------------------------------------------------------------------
 | ----------- |
-| providers                    | Yes          | array    | List of AI 
providers, each following the provider schema.                                  
                   |             |
-| provider.name                | Yes          | string   | Name of the AI 
service provider. Allowed values: `openai`, `deepseek`.                         
               |             |
-| provider.model               | Yes          | string   | Name of the AI 
model to execute. Example: `gpt-4o`.                                            
               |             |
-| provider.priority            | No           | integer  | Priority of the 
provider for load balancing.                                                    
              | 0           |
-| provider.weight              | No           | integer  | Load balancing 
weight.                                                                         
               |             |
-| balancer.algorithm           | No           | string   | Load balancing 
algorithm. Allowed values: `chash`, `roundrobin`.                               
               | roundrobin  |
-| balancer.hash_on             | No           | string   | Defines what to 
hash on for consistent hashing (`vars`, `header`, `cookie`, `consumer`, 
`vars_combinations`). | vars        |
-| balancer.key                 | No           | string   | Key for consistent 
hashing in dynamic load balancing.                                              
           |             |
-| provider.auth                | Yes          | object   | Authentication 
details, including headers and query parameters.                                
               |             |
-| provider.auth.header         | No           | object   | Authentication 
details sent via headers. Header name must match `^[a-zA-Z0-9._-]+$`.           
               |             |
-| provider.auth.query          | No           | object   | Authentication 
details sent via query parameters. Keys must match `^[a-zA-Z0-9._-]+$`.         
               |             |
-| provider.override.endpoint   | No           | string   | Custom host 
override for the AI provider.                                                   
                  |             |
-| timeout                      | No           | integer  | Request timeout in 
milliseconds (1-60000).                                                         
           | 30000        |
-| keepalive                    | No           | boolean  | Enables keepalive 
connections.                                                                    
            | true        |
-| keepalive_timeout            | No           | integer  | Timeout for 
keepalive connections (minimum 1000ms).                                         
                  | 60000       |
-| keepalive_pool               | No           | integer  | Maximum keepalive 
connections.                                                                    
            | 30          |
-| ssl_verify                   | No           | boolean  | Enables SSL 
certificate verification.                                                       
                  | true        |
-
-## Example usage
-
-Create a route with the `ai-proxy-multi` plugin like so:
+| `messages`         | Array  | True      | An array of message objects.       
                 |
+| `messages.role`    | String | True      | Role of the message (`system`, 
`user`, `assistant`).|
+| `messages.content` | String | True      | Content of the message.            
                 |
+
+## Attributes
+
+| Name                               | Type            | Required | Default    
                       | Valid Values | Description |
+|------------------------------------|----------------|----------|-----------------------------------|--------------|-------------|
+| fallback_strategy                  | string         | False    | 
instance_health_and_rate_limiting | instance_health_and_rate_limiting | 
Fallback strategy. When set, the Plugin will check whether the specified 
instance’s token has been exhausted when a request is forwarded. If so, forward 
the request to the next instance regardless of the instance priority. When not 
set, the Plugin will not forward the request to low priority instances when 
token of the high priority instance is exhausted. |
+| balancer                           | object         | False    |             
                      |              | Load balancing configurations. |
+| balancer.algorithm                 | string         | False    | roundrobin  
                   | [roundrobin, chash] | Load balancing algorithm. When set 
to `roundrobin`, weighted round robin algorithm is used. When set to `chash`, 
consistent hashing algorithm is used. |
+| balancer.hash_on                   | string         | False    |             
                      | [vars, headers, cookie, consumer, vars_combinations] | 
Used when `type` is `chash`. Support hashing on [NGINX 
variables](https://nginx.org/en/docs/varindex.html), headers, cookie, consumer, 
or a combination of [NGINX variables](https://nginx.org/en/docs/varindex.html). 
|
+| balancer.key                       | string         | False    |             
                      |              | Used when `type` is `chash`. When 
`hash_on` is set to `header` or `cookie`, `key` is required. When `hash_on` is 
set to `consumer`, `key` is not required as the consumer name will be used as 
the key automatically. |
+| instances                          | array[object]  | True     |             
                      |              | LLM instance configurations. |
+| instances.name                     | string         | True     |             
                      |              | Name of the LLM service instance. |
+| instances.provider                 | string         | True     |             
                      | [openai, deepseek, openai-compatible] | LLM service 
provider. When set to `openai`, the Plugin will proxy the request to 
`api.openai.com`. When set to `deepseek`, the Plugin will proxy the request to 
`api.deepseek.com`. When set to `openai-compatible`, the Plugin will proxy the 
request to the custom endpoint configured in `override`. |
+| instances.priority                  | integer        | False    | 0          
                     |              | Priority of the LLM instance in load 
balancing. `priority` takes precedence over `weight`. |
+| instances.weight                    | string         | True     | 0          
                     | greater or equal to 0 | Weight of the LLM instance in 
load balancing. |
+| instances.auth                      | object         | True     |            
                       |              | Authentication configurations. |
+| instances.auth.header               | object         | False    |            
                       |              | Authentication headers. At least one of 
the `header` and `query` should be configured. |
+| instances.auth.query                | object         | False    |            
                       |              | Authentication query parameters. At 
least one of the `header` and `query` should be configured. |
+| instances.options                   | object         | False    |            
                       |              | Model configurations. In addition to 
`model`, you can configure additional parameters and they will be forwarded to 
the upstream LLM service in the request body. For instance, if you are working 
with OpenAI or DeepSeek, you can configure additional parameters such as 
`max_tokens`, `temperature`, `top_p`, and `stream`. See your LLM provider's API 
documentation for more av [...]
+| instances.options.model             | string         | False    |            
                       |              | Name of the LLM model, such as `gpt-4` 
or `gpt-3.5`. See your LLM provider's API documentation for more available 
models. |
+| logging                             | object         | False    |            
                       |              | Logging configurations. |
+| logging.summaries                   | boolean        | False    | false      
                     |              | If true, log request LLM model, duration, 
request, and response tokens. |
+| logging.payloads                    | boolean        | False    | false      
                     |              | If true, log request and response 
payload. |
+| logging.override                    | object         | False    |            
                       |              | Override setting. |
+| logging.override.endpoint           | string         | False    |            
                       |              | LLM provider endpoint to replace the 
default endpoint with. If not configured, the Plugin uses the default OpenAI 
endpoint `https://api.openai.com/v1/chat/completions`. |
+| checks                              | object         | False    |            
                       |              | Health check configurations. Note that 
at the moment, OpenAI and DeepSeek do not provide an official health check 
endpoint. Other LLM services that you can configure under `openai-compatible` 
provider may have available health check endpoints. |
+| checks.active                       | object         | True     |            
                       |              | Active health check configurations. |
+| checks.active.type                  | string         | False    | http       
                     | [http, https, tcp] | Type of health check connection. |
+| checks.active.timeout               | number         | False    | 1          
                     |              | Health check timeout in seconds. |
+| checks.active.concurrency           | integer        | False    | 10         
                     |              | Number of upstream nodes to be checked at 
the same time. |
+| checks.active.host                  | string         | False    |            
                       |              | HTTP host. |
+| checks.active.port                  | integer        | False    |            
                       | between 1 and 65535 inclusive | HTTP port. |
+| checks.active.http_path             | string         | False    | /          
                     |              | Path for HTTP probing requests. |
+| checks.active.https_verify_certificate | boolean   | False    | true         
                   |              | If true, verify the node's TLS certificate. 
|
+| timeout                             | integer        | False    | 30000      
                     | greater than or equal to 1 | Request timeout in 
milliseconds when requesting the LLM service. |
+| keepalive                           | boolean        | False    | true       
                     |              | If true, keep the connection alive when 
requesting the LLM service. |
+| keepalive_timeout                   | integer        | False    | 60000      
                     | greater than or equal to 1000 | Request timeout in 
milliseconds when requesting the LLM service. |
+| keepalive_pool                      | integer        | False    | 30         
                     |              | Keepalive pool size for when connecting 
with the LLM service. |
+| ssl_verify                          | boolean        | False    | true       
                     |              | If true, verify the LLM service's 
certificate. |
+
+## Examples
+
+The examples below demonstrate how you can configure `ai-proxy-multi` for 
different scenarios.
+
+:::note
+
+You can fetch the `admin_key` from `config.yaml` and save to an environment 
variable with the following command:
+
+```bash
+admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 
's/"//g')
+```
+
+:::
+
+### Load Balance between Instances
+
+The following example demonstrates how you can configure two models for load 
balancing, forwarding 80% of the traffic to one instance and 20% to the other.
+
+For demonstration and easier differentiation, you will be configuring one 
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints if applicable:
 
 ```shell
 curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
     "id": "ai-proxy-multi-route",
     "uri": "/anything",
     "methods": ["POST"],
     "plugins": {
       "ai-proxy-multi": {
-        "providers": [
+        "instances": [
           {
-            "name": "openai",
-            "model": "gpt-4",
-            "weight": 1,
-            "priority": 1,
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 8,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$OPENAI_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "gpt-4"
             }
           },
           {
-            "name": "deepseek",
-            "model": "deepseek-chat",
-            "weight": 1,
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 2,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "deepseek-chat"
             }
           }
         ]
       }
-    },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "httpbin.org": 1
-      }
     }
   }'
 ```
 
-In the above configuration, requests will be equally balanced among the 
`openai` and `deepseek` providers.
+Send 10 POST requests to the Route with a system prompt and a sample user 
question in the request body, to see the number of requests forwarded to OpenAI 
and DeepSeek:
 
-### Retry and fallback:
+```shell
+openai_count=0
+deepseek_count=0
+
+for i in {1..10}; do
+  model=$(curl -s "http://127.0.0.1:9080/anything"; -X POST \
+    -H "Content-Type: application/json" \
+    -d '{
+      "messages": [
+        { "role": "system", "content": "You are a mathematician" },
+        { "role": "user", "content": "What is 1+1?" }
+      ]
+    }' | jq -r '.model')
+
+  if [[ "$model" == *"gpt-4"* ]]; then
+    ((openai_count++))
+  elif [[ "$model" == "deepseek-chat" ]]; then
+    ((deepseek_count++))
+  fi
+done
+
+echo "OpenAI responses: $openai_count"
+echo "DeepSeek responses: $deepseek_count"
+```
+
+You should see a response similar to the following:
+
+```text
+OpenAI responses: 8
+DeepSeek responses: 2
+```
+
+### Configure Instance Priority and Rate Limiting
+
+The following example demonstrates how you can configure two models with 
different priorities and apply rate limiting on the instance with a higher 
priority. In the case where `fallback_strategy` is set to 
`instance_health_and_rate_limiting`, the Plugin should continue to forward 
requests to the low priority instance once the high priority instance's rate 
limiting quota is fully consumed.
 
-The `priority` attribute can be adjusted to implement the fallback and retry 
feature.
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints if applicable:
 
 ```shell
 curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
     "id": "ai-proxy-multi-route",
     "uri": "/anything",
     "methods": ["POST"],
     "plugins": {
       "ai-proxy-multi": {
-        "providers": [
+        "fallback_strategy: "instance_health_and_rate_limiting",
+        "instances": [
           {
-            "name": "openai",
-            "model": "gpt-4",
-            "weight": 1,
+            "name": "openai-instance",
+            "provider": "openai",
             "priority": 1,
+            "weight": 0,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$OPENAI_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "gpt-4"
             }
           },
           {
-            "name": "deepseek",
-            "model": "deepseek-chat",
-            "weight": 1,
+            "name": "deepseek-instance",
+            "provider": "deepseek",
             "priority": 0,
+            "weight": 0,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "deepseek-chat"
             }
           }
         ]
+      },
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "limit_strategy": "total_tokens"
       }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
     },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "httpbin.org": 1
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `10`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the route:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newton law" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Certainly! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics.\n\n---\n\n### **1. 
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will 
remain at rest, and an object in motion will continue moving at a constant 
velocity (in a straight lin [...]
+      },
+      ...
+    }
+  ],
+  ...
+}
+```
+
+### Load Balance and Rate Limit by Consumers
+
+The following example demonstrates how you can configure two models for load 
balancing and apply rate limiting by consumer.
+
+Create a Consumer `johndoe` and a rate limiting quota of 10 tokens in a 
60-second window on `openai-instance` instance:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "username": "johndoe",
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Configure `key-auth` credential for `johndoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials"; -X PUT 
\
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "cred-john-key-auth",
+    "plugins": {
+      "key-auth": {
+        "key": "john-key"
       }
     }
   }'
 ```
 
-In the above configuration `priority` for the deepseek provider is set to `0`. 
Which means if `openai` provider is unavailable then `ai-proxy-multi` plugin 
will retry sending request to `deepseek` in the second attempt.
+Create another Consumer `janedoe` and a rate limiting quota of 10 tokens in a 
60-second window on `deepseek-instance` instance:
 
-### Send request to an OpenAI compatible LLM
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "username": "johndoe",
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "deepseek-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
 
-Create a route with the `ai-proxy-multi` plugin with `provider.name` set to 
`openai-compatible` and the endpoint of the model set to 
`provider.override.endpoint` like so:
+Configure `key-auth` credential for `janedoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/janedoe/credentials"; -X PUT 
\
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "cred-jane-key-auth",
+    "plugins": {
+      "key-auth": {
+        "key": "jane-key"
+      }
+    }
+  }'
+```
+
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints if applicable:
 
 ```shell
 curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
     "id": "ai-proxy-multi-route",
     "uri": "/anything",
     "methods": ["POST"],
     "plugins": {
+      "key-auth": {},
       "ai-proxy-multi": {
-        "providers": [
+        "fallback_strategy: "instance_health_and_rate_limiting",
+        "instances": [
           {
-            "name": "openai-compatible",
-            "model": "qwen-plus",
-            "weight": 1,
-            "priority": 1,
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 0,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$OPENAI_API_KEY"'"
               }
             },
-            "override": {
-              "endpoint": 
"https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions";
+            "options": {
+              "model": "gpt-4"
             }
           },
           {
-            "name": "deepseek",
-            "model": "deepseek-chat",
-            "weight": 1,
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 0,
             "auth": {
               "header": {
                 "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
               }
             },
             "options": {
-                "max_tokens": 512,
-                "temperature": 1.0
+              "model": "deepseek-chat"
             }
           }
-        ],
-        "passthrough": false
+        ]
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route without any consumer key:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive an `HTTP/1.1 401 Unauthorized` response.
+
+Send a POST request to the Route with `johndoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H 'apikey: john-key' \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of the `openai` 
instance for `johndoe`, the next request within the 60-second window from 
`johndoe` is expected to be forwarded to the `deepseek` instance.
+
+Within the same 60-second window, send another POST request to the Route with 
`johndoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H 'apikey: john-key' \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws to me" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Certainly! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics.\n\n---\n\n### **1. 
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will 
remain at rest, and an object in motion will continue moving at a constant 
velocity (in a straight lin [...]
+      },
+      ...
+    }
+  ],
+  ...
+}
+```
+
+Send a POST request to the Route with `janedoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H 'apikey: jane-key' \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "The sum of 1 and 1 is 2. This is a basic arithmetic 
operation where you combine two units to get a total of two units."
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 14,
+    "completion_tokens": 31,
+    "total_tokens": 45,
+    "prompt_tokens_details": {
+      "cached_tokens": 0
+    },
+    "prompt_cache_hit_tokens": 0,
+    "prompt_cache_miss_tokens": 14
+  },
+  "system_fingerprint": "fp_3a5770e1b4_prod0225"
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of the `deepseek` 
instance for `janedoe`, the next request within the 60-second window from 
`janedoe` is expected to be forwarded to the `openai` instance.
+
+Within the same 60-second window, send another POST request to the Route with 
`janedoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H 'apikey: jane-key' \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws to me" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Sure, here are Newton's three laws of motion:\n\n1) 
Newton's First Law, also known as the Law of Inertia, states that an object at 
rest will stay at rest, and an object in motion will stay in motion, unless 
acted on by an external force. In simple words, this law suggests that an 
object will keep doing whatever it is doing until something causes it to do 
otherwise. \n\n2) Newton's Second Law states that the force acting on an object 
is equal to the mass of that object [...]
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  ...
+}
+```
+
+This shows `ai-proxy-multi` load balance the traffic with respect to the rate 
limiting rules in `ai-rate-limiting` by consumers.
+
+### Restrict Maximum Number of Completion Tokens
+
+The following example demonstrates how you can restrict the number of 
`completion_tokens` used when generating the chat completion.
+
+For demonstration and easier differentiation, you will be configuring one 
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-proxy-multi-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-proxy-multi": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "gpt-4",
+              "max_tokens": 50
+            }
+          },
+          {
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat",
+              "max_tokens": 100
+            }
+          }
+        ]
       }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons law" }
+    ]
+  }'
+```
+
+If the request is proxied to OpenAI, you should see a response similar to the 
following, where the content is truncated per 50 `max_tokens` threshold:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Newton's Laws of Motion are three physical laws that form 
the bedrock for classical mechanics. They describe the relationship between a 
body and the forces acting upon it, and the body's motion in response to those 
forces. \n\n1. Newton's First Law",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 20,
+    "completion_tokens": 50,
+    "total_tokens": 70,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
     },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "httpbin.org": 1
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+If the request is proxied to DeepSeek, you should see a response similar to 
the following, where the content is truncated per 100 `max_tokens` threshold:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Newton's Laws of Motion are three fundamental principles 
that form the foundation of classical mechanics. They describe the relationship 
between a body and the forces acting upon it, and the body's motion in response 
to those forces. Here's a brief explanation of each law:\n\n1. **Newton's First 
Law (Law of Inertia):**\n   - **Statement:** An object will remain at rest or 
in uniform motion in a straight line unless acted upon by an external force.\n  
 - **Explanation: [...]
+      },
+      "logprobs": null,
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 10,
+    "completion_tokens": 100,
+    "total_tokens": 110,
+    "prompt_tokens_details": {
+      "cached_tokens": 0
+    },
+    "prompt_cache_hit_tokens": 0,
+    "prompt_cache_miss_tokens": 10
+  },
+  "system_fingerprint": "fp_3a5770e1b4_prod0225"
+}
+```
+
+### Proxy to Embedding Models
+
+The following example demonstrates how you can configure the `ai-proxy-multi` 
Plugin to proxy requests and load balance between embedding models.
+
+Create a Route as such and update with your LLM providers, embedding models, 
API keys, and endpoints:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-proxy-multi-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-proxy-multi": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "text-embedding-3-small"
+            },
+            "override": {
+              "endpoint": "https://api.openai.com/v1/embeddings";
+            }
+          },
+          {
+            "name": "az-openai-instance",
+            "provider": "openai-compatible",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$AZ_OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "text-embedding-3-small"
+            },
+            "override": {
+              "endpoint": 
"https://ai-plugin-developer.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15";
+            }
+          }
+        ]
       }
     }
   }'
 ```
+
+Send a POST request to the Route with an input string:
+
+```shell
+curl "http://127.0.0.1:9080/embeddings"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "input": "hello world"
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  "object": "list",
+  "data": [
+    {
+      "object": "embedding",
+      "index": 0,
+      "embedding": [
+        -0.0067144386,
+        -0.039197803,
+        0.034177095,
+        0.028763203,
+        -0.024785956,
+        -0.04201061,
+        ...
+      ],
+    }
+  ],
+  "model": "text-embedding-3-small",
+  "usage": {
+    "prompt_tokens": 2,
+    "total_tokens": 2
+  }
+}
+```
+
+### Enable Active Health Checks
+
+The following example demonstrates how you can configure the `ai-proxy-multi` 
Plugin to proxy requests and load balance between models, and enable active 
health check to improve service availability. You can enable health check on 
one or multiple instances.
+
+Create a Route as such and update the LLM providers, embedding models, API 
keys, and health check related configurations:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-proxy-multi-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-proxy-multi": {
+        "instances": [
+          {
+            "name": "llm-instance-1",
+            "provider": "openai-compatible",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$YOUR_LLM_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "'"$YOUR_LLM_MODEL"'"
+            }
+          },
+          {
+            "name": "llm-instance-2",
+            "provider": "openai-compatible",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$YOUR_LLM_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "'"$YOUR_LLM_MODEL"'"
+            },
+            "checks": {
+              "active": {
+                "type": "https",
+                "host": "yourhost.com",
+                "http_path": "/your/probe/path",
+                "healthy": {
+                  "interval": 2,
+                  "successes": 1
+                },
+                "unhealthy": {
+                  "interval": 1,
+                  "http_failures": 3
+                }
+              }
+            }
+          }
+        ]
+      }
+    }
+  }'
+```
+
+For verification, the behaviours should be consistent with the verification in 
[active health checks](../tutorials/health-check.md).
+
+### Include LLM Information in Access Log
+
+The following example demonstrates how you can log LLM request related 
information in the gateway's access log to improve analytics and audit. The 
following variables are available:
+
+* `request_type`: Type of request, where the value could be 
`traditional_http`, `ai_chat`, or `ai_stream`.
+* `llm_time_to_first_token`: Duration from request sending to the first token 
received from the LLM service, in milliseconds.
+* `llm_model`: LLM model.
+* `llm_prompt_tokens`: Number of tokens in the prompt.
+* `llm_completion_tokens`: Number of chat completion tokens in the prompt.
+
+:::note
+
+The usage in this example will become available in APISIX 3.13.0.
+
+:::
+
+Update the access log format in your configuration file to include additional 
LLM related variables:
+
+```yaml title="conf/config.yaml"
+nginx_config:
+  http:
+    access_log_format: "$remote_addr - $remote_user [$time_local] $http_host 
\"$request_line\" $status $body_bytes_sent $request_time \"$http_referer\" 
\"$http_user_agent\" $upstream_addr $upstream_status $upstream_response_time 
\"$upstream_scheme://$upstream_host$upstream_uri\" \"$apisix_request_id\" 
\"$request_type\" \"$llm_time_to_first_token\" \"$llm_model\" 
\"$llm_prompt_tokens\" \"$llm_completion_tokens\""
+```
+
+Reload APISIX for configuration changes to take effect.
+
+Next, create a Route with the `ai-proxy-multi` Plugin and send a request. For 
instance, if the request is forwarded to OpenAI and you receive the following 
response:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null,
+        "annotations": []
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    ...
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+In the gateway's access log, you should see a log entry similar to the 
following:
+
+```text
+192.168.215.1 - - [21/Mar/2025:04:28:03 +0000] api.openai.com "POST /anything 
HTTP/1.1" 200 804 2.858 "-" "curl/8.6.0" - - - "http://api.openai.com"; 
"5c5e0b95f8d303cb81e4dc456a4b12d9" "ai_chat" "2858" "gpt-4" "23" "8"
+```
+
+The access log entry shows the request type is `ai_chat`, time to first token 
is `2858` milliseconds, LLM model is `gpt-4`, prompt token usage is `23`, and 
completion token usage is `8`.
diff --git a/docs/en/latest/plugins/ai-proxy.md 
b/docs/en/latest/plugins/ai-proxy.md
index fe6bb2f6b..239c6df5d 100644
--- a/docs/en/latest/plugins/ai-proxy.md
+++ b/docs/en/latest/plugins/ai-proxy.md
@@ -5,7 +5,9 @@ keywords:
   - API Gateway
   - Plugin
   - ai-proxy
-description: This document contains information about the Apache APISIX 
ai-proxy Plugin.
+  - AI
+  - LLM
+description: The ai-proxy Plugin simplifies access to LLM and embedding models 
providers by converting Plugin configurations into the required request format 
for OpenAI, DeepSeek, and other OpenAI-compatible APIs.
 ---
 
 <!--
@@ -27,147 +29,425 @@ description: This document contains information about the 
Apache APISIX ai-proxy
 #
 -->
 
+<head>
+  <link rel="canonical" href="https://docs.api7.ai/hub/ai-proxy"; />
+</head>
+
 ## Description
 
-The `ai-proxy` plugin simplifies access to LLM providers and models by 
defining a standard request format
-that allows key fields in plugin configuration to be embedded into the request.
+The `ai-proxy` Plugin simplifies access to LLM and embedding models by 
transforming Plugin configurations into the designated request format. It 
supports the integration with OpenAI, DeepSeek, and other OpenAI-compatible 
APIs.
 
-Proxying requests to OpenAI is supported now. Other LLM services will be 
supported soon.
+In addition, the Plugin also supports logging LLM request information in the 
access log, such as token usage, model, time to the first response, and more.
 
 ## Request Format
 
-### OpenAI
-
-- Chat API
-
 | Name               | Type   | Required | Description                         
                |
 | ------------------ | ------ | -------- | 
--------------------------------------------------- |
-| `messages`         | Array  | Yes      | An array of message objects         
                |
-| `messages.role`    | String | Yes      | Role of the message (`system`, 
`user`, `assistant`) |
-| `messages.content` | String | Yes      | Content of the message              
                |
-
-## Plugin Attributes
-
-| **Field**                 | **Required** | **Type** | **Description**        
                                                              |
-| ------------------------- | ------------ | -------- | 
------------------------------------------------------------------------------------
 |
-| auth                      | Yes          | Object   | Authentication 
configuration                                                         |
-| auth.header               | No           | Object   | Authentication 
headers. Key must match pattern `^[a-zA-Z0-9._-]+$`.                  |
-| auth.query                | No           | Object   | Authentication query 
parameters. Key must match pattern `^[a-zA-Z0-9._-]+$`.         |
-| model.provider            | Yes          | String   | Name of the AI service 
provider (`openai`).                                          |
-| model.name                | Yes          | String   | Model name to execute. 
                                                              |
-| model.options             | No           | Object   | Key/value settings for 
the model                                                     |
-| override.endpoint         | No           | String   | Override the endpoint 
of the AI provider                                             |
-| timeout                   | No           | Integer  | Timeout in 
milliseconds for requests to LLM. Range: 1 - 60000. Default: 30000         |
-| keepalive                 | No           | Boolean  | Enable keepalive for 
requests to LLM. Default: true                                  |
-| keepalive_timeout         | No           | Integer  | Keepalive timeout in 
milliseconds for requests to LLM. Minimum: 1000. Default: 60000 |
-| keepalive_pool            | No           | Integer  | Keepalive pool size 
for requests to LLM. Minimum: 1. Default: 30                     |
-| ssl_verify                | No           | Boolean  | SSL verification for 
requests to LLM. Default: true                                  |
-
-## Example usage
-
-Create a route with the `ai-proxy` plugin like so:
+| `messages`         | Array  | True      | An array of message objects.       
                 |
+| `messages.role`    | String | True      | Role of the message (`system`, 
`user`, `assistant`).|
+| `messages.content` | String | True      | Content of the message.            
                 |
+
+## Attributes
+
+| Name               | Type    | Required | Default | Valid values             
                 | Description |
+|--------------------|--------|----------|---------|------------------------------------------|-------------|
+| provider          | string  | True     |         | [openai, deepseek, 
openai-compatible] | LLM service provider. When set to `openai`, the Plugin 
will proxy the request to `https://api.openai.com/chat/completions`. When set 
to `deepseek`, the Plugin will proxy the request to 
`https://api.deepseek.com/chat/completions`. When set to `openai-compatible`, 
the Plugin will proxy the request to the custom endpoint configured in 
`override`. |
+| auth             | object  | True     |         |                            
              | Authentication configurations. |
+| auth.header      | object  | False    |         |                            
              | Authentication headers. At least one of `header` or `query` 
must be configured. |
+| auth.query       | object  | False    |         |                            
              | Authentication query parameters. At least one of `header` or 
`query` must be configured. |
+| options         | object  | False    |         |                             
             | Model configurations. In addition to `model`, you can configure 
additional parameters and they will be forwarded to the upstream LLM service in 
the request body. For instance, if you are working with OpenAI, you can 
configure additional parameters such as `temperature`, `top_p`, and `stream`. 
See your LLM provider's API documentation for more available options.  |
+| options.model   | string  | False    |         |                             
             | Name of the LLM model, such as `gpt-4` or `gpt-3.5`. Refer to 
the LLM provider's API documentation for available models. |
+| override        | object  | False    |         |                             
             | Override setting. |
+| override.endpoint | string | False    |         |                            
              | Custom LLM provider endpoint, required when `provider` is 
`openai-compatible`. |
+| logging        | object  | False    |         |                              
            | Logging configurations. |
+| logging.summaries | boolean | False | false |                                
          | If true, logs request LLM model, duration, request, and response 
tokens. |
+| logging.payloads  | boolean | False | false |                                
          | If true, logs request and response payload. |
+| timeout        | integer | False    | 30000    | ≥ 1                         
             | Request timeout in milliseconds when requesting the LLM service. 
|
+| keepalive      | boolean | False    | true   |                               
           | If true, keeps the connection alive when requesting the LLM 
service. |
+| keepalive_timeout | integer | False | 60000  | ≥ 1000                        
           | Keepalive timeout in milliseconds when connecting to the LLM 
service. |
+| keepalive_pool | integer | False    | 30       |                             
             | Keepalive pool size for the LLM service connection. |
+| ssl_verify     | boolean | False    | true   |                               
           | If true, verifies the LLM service's certificate. |
+
+## Examples
+
+The examples below demonstrate how you can configure `ai-proxy` for different 
scenarios.
+
+:::note
+
+You can fetch the `admin_key` from `config.yaml` and save to an environment 
variable with the following command:
+
+```bash
+admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 
's/"//g')
+```
+
+:::
+
+### Proxy to OpenAI
+
+The following example demonstrates how you can configure the API key, model, 
and other parameters in the `ai-proxy` Plugin and configure the Plugin on a 
Route to proxy user prompts to OpenAI.
+
+Obtain the OpenAI [API key](https://openai.com/blog/openai-api) and save it to 
an environment variable:
+
+```shell
+export OPENAI_API_KEY=<your-api-key>
+```
+
+Create a Route and configure the `ai-proxy` Plugin as such:
 
 ```shell
-curl "http://127.0.0.1:9180/apisix/admin/routes/1"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
+    "id": "ai-proxy-route",
     "uri": "/anything",
+    "methods": ["POST"],
     "plugins": {
       "ai-proxy": {
+        "provider": "openai",
         "auth": {
           "header": {
-            "Authorization": "Bearer <some-token>"
+            "Authorization": "Bearer '"$OPENAI_API_KEY"'"
           }
         },
-        "model": {
-          "provider": "openai",
-          "name": "gpt-4",
-          "options": {
-            "max_tokens": 512,
-            "temperature": 1.0
-          }
+        "options":{
+          "model": "gpt-4"
         }
       }
-    },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "somerandom.com:443": 1
-      },
-      "scheme": "https",
-      "pass_host": "node"
     }
   }'
 ```
 
-Upstream node can be any arbitrary value because it won't be contacted.
-
-Now send a request:
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
 
 ```shell
-curl http://127.0.0.1:9080/anything -i -XPOST  -H 'Content-Type: 
application/json' -d '{
-        "messages": [
-            { "role": "system", "content": "You are a mathematician" },
-            { "role": "user", "a": 1, "content": "What is 1+1?" }
-        ]
-    }'
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H "Host: api.openai.com" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
 ```
 
-You will receive a response like this:
+You should receive a response similar to the following:
 
 ```json
 {
+  ...,
+  "model": "gpt-4-0613",
   "choices": [
     {
-      "finish_reason": "stop",
       "index": 0,
       "message": {
-        "content": "The sum of \\(1 + 1\\) is \\(2\\).",
-        "role": "assistant"
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  ...
+}
+```
+
+### Proxy to DeepSeek
+
+The following example demonstrates how you can configure the `ai-proxy` Plugin 
to proxy requests to DeekSeek.
+
+Obtain the DeekSeek API key and save it to an environment variable:
+
+```shell
+export DEEPSEEK_API_KEY=<your-api-key>
+```
+
+Create a Route and configure the `ai-proxy` Plugin as such:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-proxy-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-proxy": {
+        "provider": "deepseek",
+        "auth": {
+          "header": {
+            "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+          }
+        },
+        "options": {
+          "model": "deepseek-chat"
+        }
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route with a sample question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are an AI assistant that helps people find 
information."
+      },
+      {
+        "role": "user",
+        "content": "Write me a 50-word introduction for Apache APISIX."
       }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Apache APISIX is a dynamic, real-time, high-performance 
API gateway and cloud-native platform. It provides rich traffic management 
features like load balancing, dynamic upstream, canary release, circuit 
breaking, authentication, observability, and more. Designed for microservices 
and serverless architectures, APISIX ensures scalability, security, and 
seamless integration with modern DevOps workflows."
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
     }
   ],
-  "created": 1723777034,
-  "id": "chatcmpl-9whRKFodKl5sGhOgHIjWltdeB8sr7",
-  "model": "gpt-4o-2024-05-13",
-  "object": "chat.completion",
-  "system_fingerprint": "fp_abc28019ad",
-  "usage": { "completion_tokens": 15, "prompt_tokens": 23, "total_tokens": 38 }
+  ...
 }
 ```
 
-### Send request to an OpenAI compatible LLM
+### Proxy to Azure OpenAI
+
+The following example demonstrates how you can configure the `ai-proxy` Plugin 
to proxy requests to other LLM services, such as Azure OpenAI.
+
+Obtain the Azure OpenAI API key and save it to an environment variable:
 
-Create a route with the `ai-proxy` plugin with `provider` set to 
`openai-compatible` and the endpoint of the model set to `override.endpoint` 
like so:
+```shell
+export AZ_OPENAI_API_KEY=<your-api-key>
+```
+
+Create a Route and configure the `ai-proxy` Plugin as such:
 
 ```shell
-curl "http://127.0.0.1:9180/apisix/admin/routes/1"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
+    "id": "ai-proxy-route",
     "uri": "/anything",
+    "methods": ["POST"],
     "plugins": {
       "ai-proxy": {
+        "provider": "openai-compatible",
         "auth": {
           "header": {
-            "Authorization": "Bearer <some-token>"
+            "api-key": "'"$AZ_OPENAI_API_KEY"'"
           }
         },
-        "model": {
-          "provider": "openai-compatible",
-          "name": "qwen-plus"
+        "options":{
+          "model": "gpt-4"
         },
         "override": {
-          "endpoint": 
"https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions";
+          "endpoint": 
"https://api7-auzre-openai.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-02-15-preview";
         }
       }
-    },
-    "upstream": {
-      "type": "roundrobin",
-      "nodes": {
-        "somerandom.com:443": 1
+    }
+  }'
+```
+
+Send a POST request to the Route with a sample question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are an AI assistant that helps people find 
information."
       },
-      "scheme": "https",
-      "pass_host": "node"
+      {
+        "role": "user",
+        "content": "Write me a 50-word introduction for Apache APISIX."
+      }
+    ],
+    "max_tokens": 800,
+    "temperature": 0.7,
+    "frequency_penalty": 0,
+    "presence_penalty": 0,
+    "top_p": 0.95,
+    "stop": null
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  "choices": [
+    {
+      ...,
+      "message": {
+        "content": "Apache APISIX is a modern, cloud-native API gateway built 
to handle high-performance and low-latency use cases. It offers a wide range of 
features, including load balancing, rate limiting, authentication, and dynamic 
routing, making it an ideal choice for microservices and cloud-native 
architectures.",
+        "role": "assistant"
+      }
+    }
+  ],
+  ...
+}
+```
+
+### Proxy to Embedding Models
+
+The following example demonstrates how you can configure the `ai-proxy` Plugin 
to proxy requests to embedding models. This example will use the OpenAI 
embedding model endpoint.
+
+Obtain the OpenAI [API key](https://openai.com/blog/openai-api) and save it to 
an environment variable:
+
+```shell
+export OPENAI_API_KEY=<your-api-key>
+```
+
+Create a Route and configure the `ai-proxy` Plugin as such:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-proxy-route",
+    "uri": "/embeddings",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-proxy": {
+        "provider": "openai",
+        "auth": {
+          "header": {
+            "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+          }
+        },
+        "options":{
+          "model": "text-embedding-3-small",
+          "encoding_format": "float"
+        },
+        "override": {
+          "endpoint": "https://api.openai.com/v1/embeddings";
+        }
+      }
     }
   }'
 ```
+
+Send a POST request to the Route with an input string:
+
+```shell
+curl "http://127.0.0.1:9080/embeddings"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "input": "hello world"
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  "object": "list",
+  "data": [
+    {
+      "object": "embedding",
+      "index": 0,
+      "embedding": [
+        -0.0067144386,
+        -0.039197803,
+        0.034177095,
+        0.028763203,
+        -0.024785956,
+        -0.04201061,
+        ...
+      ],
+    }
+  ],
+  "model": "text-embedding-3-small",
+  "usage": {
+    "prompt_tokens": 2,
+    "total_tokens": 2
+  }
+}
+```
+
+### Include LLM Information in Access Log
+
+The following example demonstrates how you can log LLM request related 
information in the gateway's access log to improve analytics and audit. The 
following variables are available:
+
+* `request_type`: Type of request, where the value could be 
`traditional_http`, `ai_chat`, or `ai_stream`.
+* `llm_time_to_first_token`: Duration from request sending to the first token 
received from the LLM service, in milliseconds.
+* `llm_model`: LLM model.
+* `llm_prompt_tokens`: Number of tokens in the prompt.
+* `llm_completion_tokens`: Number of chat completion tokens in the prompt.
+
+:::note
+
+The usage will become available in APISIX 3.13.0.
+
+:::
+
+Update the access log format in your configuration file to include additional 
LLM related variables:
+
+```yaml title="conf/config.yaml"
+nginx_config:
+  http:
+    access_log_format: "$remote_addr   $remote_user [$time_local] $http_host 
\"$request_line\" $status $body_bytes_sent $request_time \"$http_referer\" 
\"$http_user_agent\" $upstream_addr $upstream_status $upstream_response_time 
\"$upstream_scheme://$upstream_host$upstream_uri\" \"$apisix_request_id\" 
\"$request_type\" \"$llm_time_to_first_token\" \"$llm_model\" 
\"$llm_prompt_tokens\" \"$llm_completion_tokens\""
+```
+
+Reload APISIX for configuration changes to take effect.
+
+Now if you create a Route and send a request following the [Proxy to OpenAI 
example](#proxy-to-openai), you should receive a response similar to the 
following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null,
+        "annotations": []
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    ...
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+In the gateway's access log, you should see a log entry similar to the 
following:
+
+```text
+192.168.215.1   - [21/Mar/2025:04:28:03 +0000] api.openai.com "POST /anything 
HTTP/1.1" 200 804 2.858 "-" "curl/8.6.0"   -   "http://api.openai.com"; 
"5c5e0b95f8d303cb81e4dc456a4b12d9" "ai_chat" "2858" "gpt-4" "23" "8"
+```
+
+The access log entry shows the request type is `ai_chat`, time to first token 
is `2858` milliseconds, LLM model is `gpt-4`, prompt token usage is `23`, and 
completion token usage is `8`.

(apisix) branch master updated: docs: update `ai-proxy` doc and `ai-proxy-multi` plugin doc (#12094)

Reply via email to