This is an automated email from the ASF dual-hosted git repository.
traky pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/apisix.git
The following commit(s) were added to refs/heads/master by this push:
new 655ea62fe docs: update `ai-proxy` doc and `ai-proxy-multi` plugin doc
(#12094)
655ea62fe is described below
commit 655ea62febacd61b3055d7edeeb8ad8bf61193c2
Author: Traky Deng <[email protected]>
AuthorDate: Fri Mar 28 16:39:14 2025 +0800
docs: update `ai-proxy` doc and `ai-proxy-multi` plugin doc (#12094)
---
docs/en/latest/plugins/ai-proxy-multi.md | 962 +++++++++++++++++++++++++++----
docs/en/latest/plugins/ai-proxy.md | 450 ++++++++++++---
2 files changed, 1228 insertions(+), 184 deletions(-)
diff --git a/docs/en/latest/plugins/ai-proxy-multi.md
b/docs/en/latest/plugins/ai-proxy-multi.md
index a23eccb55..6418599d9 100644
--- a/docs/en/latest/plugins/ai-proxy-multi.md
+++ b/docs/en/latest/plugins/ai-proxy-multi.md
@@ -5,7 +5,9 @@ keywords:
- API Gateway
- Plugin
- ai-proxy-multi
-description: This document contains information about the Apache APISIX
ai-proxy-multi Plugin.
+ - AI
+ - LLM
+description: The ai-proxy-multi Plugin extends the capabilities of ai-proxy
with load balancing, retries, fallbacks, and health chekcs, simplying the
integration with OpenAI, DeepSeek, and other OpenAI-compatible APIs.
---
<!--
@@ -27,215 +29,977 @@ description: This document contains information about the
Apache APISIX ai-proxy
#
-->
-## Description
+<head>
+ <link rel="canonical" href="https://docs.api7.ai/hub/ai-proxy-multi" />
+</head>
-The `ai-proxy-multi` plugin simplifies access to LLM providers and models by
defining a standard request format
-that allows key fields in plugin configuration to be embedded into the request.
+## Description
-This plugin adds additional features like `load balancing` and `retries` to
the existing `ai-proxy` plugin.
+The `ai-proxy-multi` Plugin simplifies access to LLM and embedding models by
transforming Plugin configurations into the designated request format for
OpenAI, DeepSeek, and other OpenAI-compatible APIs. It extends the capabilities
of [`ai-proxy-multi`](./ai-proxy.md) with load balancing, retries, fallbacks,
and health checks.
-Proxying requests to OpenAI is supported now. Other LLM services will be
supported soon.
+In addition, the Plugin also supports logging LLM request information in the
access log, such as token usage, model, time to the first response, and more.
## Request Format
-### OpenAI
-
-- Chat API
-
| Name | Type | Required | Description
|
| ------------------ | ------ | -------- |
--------------------------------------------------- |
-| `messages` | Array | Yes | An array of message objects
|
-| `messages.role` | String | Yes | Role of the message (`system`,
`user`, `assistant`) |
-| `messages.content` | String | Yes | Content of the message
|
-
-## Plugin Attributes
-
-| **Name** | **Required** | **Type** | **Description**
| **Default** |
-| ---------------------------- | ------------ | -------- |
-------------------------------------------------------------------------------------------------------------
| ----------- |
-| providers | Yes | array | List of AI
providers, each following the provider schema.
| |
-| provider.name | Yes | string | Name of the AI
service provider. Allowed values: `openai`, `deepseek`.
| |
-| provider.model | Yes | string | Name of the AI
model to execute. Example: `gpt-4o`.
| |
-| provider.priority | No | integer | Priority of the
provider for load balancing.
| 0 |
-| provider.weight | No | integer | Load balancing
weight.
| |
-| balancer.algorithm | No | string | Load balancing
algorithm. Allowed values: `chash`, `roundrobin`.
| roundrobin |
-| balancer.hash_on | No | string | Defines what to
hash on for consistent hashing (`vars`, `header`, `cookie`, `consumer`,
`vars_combinations`). | vars |
-| balancer.key | No | string | Key for consistent
hashing in dynamic load balancing.
| |
-| provider.auth | Yes | object | Authentication
details, including headers and query parameters.
| |
-| provider.auth.header | No | object | Authentication
details sent via headers. Header name must match `^[a-zA-Z0-9._-]+$`.
| |
-| provider.auth.query | No | object | Authentication
details sent via query parameters. Keys must match `^[a-zA-Z0-9._-]+$`.
| |
-| provider.override.endpoint | No | string | Custom host
override for the AI provider.
| |
-| timeout | No | integer | Request timeout in
milliseconds (1-60000).
| 30000 |
-| keepalive | No | boolean | Enables keepalive
connections.
| true |
-| keepalive_timeout | No | integer | Timeout for
keepalive connections (minimum 1000ms).
| 60000 |
-| keepalive_pool | No | integer | Maximum keepalive
connections.
| 30 |
-| ssl_verify | No | boolean | Enables SSL
certificate verification.
| true |
-
-## Example usage
-
-Create a route with the `ai-proxy-multi` plugin like so:
+| `messages` | Array | True | An array of message objects.
|
+| `messages.role` | String | True | Role of the message (`system`,
`user`, `assistant`).|
+| `messages.content` | String | True | Content of the message.
|
+
+## Attributes
+
+| Name | Type | Required | Default
| Valid Values | Description |
+|------------------------------------|----------------|----------|-----------------------------------|--------------|-------------|
+| fallback_strategy | string | False |
instance_health_and_rate_limiting | instance_health_and_rate_limiting |
Fallback strategy. When set, the Plugin will check whether the specified
instance’s token has been exhausted when a request is forwarded. If so, forward
the request to the next instance regardless of the instance priority. When not
set, the Plugin will not forward the request to low priority instances when
token of the high priority instance is exhausted. |
+| balancer | object | False |
| | Load balancing configurations. |
+| balancer.algorithm | string | False | roundrobin
| [roundrobin, chash] | Load balancing algorithm. When set
to `roundrobin`, weighted round robin algorithm is used. When set to `chash`,
consistent hashing algorithm is used. |
+| balancer.hash_on | string | False |
| [vars, headers, cookie, consumer, vars_combinations] |
Used when `type` is `chash`. Support hashing on [NGINX
variables](https://nginx.org/en/docs/varindex.html), headers, cookie, consumer,
or a combination of [NGINX variables](https://nginx.org/en/docs/varindex.html).
|
+| balancer.key | string | False |
| | Used when `type` is `chash`. When
`hash_on` is set to `header` or `cookie`, `key` is required. When `hash_on` is
set to `consumer`, `key` is not required as the consumer name will be used as
the key automatically. |
+| instances | array[object] | True |
| | LLM instance configurations. |
+| instances.name | string | True |
| | Name of the LLM service instance. |
+| instances.provider | string | True |
| [openai, deepseek, openai-compatible] | LLM service
provider. When set to `openai`, the Plugin will proxy the request to
`api.openai.com`. When set to `deepseek`, the Plugin will proxy the request to
`api.deepseek.com`. When set to `openai-compatible`, the Plugin will proxy the
request to the custom endpoint configured in `override`. |
+| instances.priority | integer | False | 0
| | Priority of the LLM instance in load
balancing. `priority` takes precedence over `weight`. |
+| instances.weight | string | True | 0
| greater or equal to 0 | Weight of the LLM instance in
load balancing. |
+| instances.auth | object | True |
| | Authentication configurations. |
+| instances.auth.header | object | False |
| | Authentication headers. At least one of
the `header` and `query` should be configured. |
+| instances.auth.query | object | False |
| | Authentication query parameters. At
least one of the `header` and `query` should be configured. |
+| instances.options | object | False |
| | Model configurations. In addition to
`model`, you can configure additional parameters and they will be forwarded to
the upstream LLM service in the request body. For instance, if you are working
with OpenAI or DeepSeek, you can configure additional parameters such as
`max_tokens`, `temperature`, `top_p`, and `stream`. See your LLM provider's API
documentation for more av [...]
+| instances.options.model | string | False |
| | Name of the LLM model, such as `gpt-4`
or `gpt-3.5`. See your LLM provider's API documentation for more available
models. |
+| logging | object | False |
| | Logging configurations. |
+| logging.summaries | boolean | False | false
| | If true, log request LLM model, duration,
request, and response tokens. |
+| logging.payloads | boolean | False | false
| | If true, log request and response
payload. |
+| logging.override | object | False |
| | Override setting. |
+| logging.override.endpoint | string | False |
| | LLM provider endpoint to replace the
default endpoint with. If not configured, the Plugin uses the default OpenAI
endpoint `https://api.openai.com/v1/chat/completions`. |
+| checks | object | False |
| | Health check configurations. Note that
at the moment, OpenAI and DeepSeek do not provide an official health check
endpoint. Other LLM services that you can configure under `openai-compatible`
provider may have available health check endpoints. |
+| checks.active | object | True |
| | Active health check configurations. |
+| checks.active.type | string | False | http
| [http, https, tcp] | Type of health check connection. |
+| checks.active.timeout | number | False | 1
| | Health check timeout in seconds. |
+| checks.active.concurrency | integer | False | 10
| | Number of upstream nodes to be checked at
the same time. |
+| checks.active.host | string | False |
| | HTTP host. |
+| checks.active.port | integer | False |
| between 1 and 65535 inclusive | HTTP port. |
+| checks.active.http_path | string | False | /
| | Path for HTTP probing requests. |
+| checks.active.https_verify_certificate | boolean | False | true
| | If true, verify the node's TLS certificate.
|
+| timeout | integer | False | 30000
| greater than or equal to 1 | Request timeout in
milliseconds when requesting the LLM service. |
+| keepalive | boolean | False | true
| | If true, keep the connection alive when
requesting the LLM service. |
+| keepalive_timeout | integer | False | 60000
| greater than or equal to 1000 | Request timeout in
milliseconds when requesting the LLM service. |
+| keepalive_pool | integer | False | 30
| | Keepalive pool size for when connecting
with the LLM service. |
+| ssl_verify | boolean | False | true
| | If true, verify the LLM service's
certificate. |
+
+## Examples
+
+The examples below demonstrate how you can configure `ai-proxy-multi` for
different scenarios.
+
+:::note
+
+You can fetch the `admin_key` from `config.yaml` and save to an environment
variable with the following command:
+
+```bash
+admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed
's/"//g')
+```
+
+:::
+
+### Load Balance between Instances
+
+The following example demonstrates how you can configure two models for load
balancing, forwarding 80% of the traffic to one instance and 20% to the other.
+
+For demonstration and easier differentiation, you will be configuring one
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route as such and update with your LLM providers, models, API keys,
and endpoints if applicable:
```shell
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
- -H "X-API-KEY: ${ADMIN_API_KEY}" \
+ -H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
"ai-proxy-multi": {
- "providers": [
+ "instances": [
{
- "name": "openai",
- "model": "gpt-4",
- "weight": 1,
- "priority": 1,
+ "name": "openai-instance",
+ "provider": "openai",
+ "weight": 8,
"auth": {
"header": {
"Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
"options": {
- "max_tokens": 512,
- "temperature": 1.0
+ "model": "gpt-4"
}
},
{
- "name": "deepseek",
- "model": "deepseek-chat",
- "weight": 1,
+ "name": "deepseek-instance",
+ "provider": "deepseek",
+ "weight": 2,
"auth": {
"header": {
"Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
}
},
"options": {
- "max_tokens": 512,
- "temperature": 1.0
+ "model": "deepseek-chat"
}
}
]
}
- },
- "upstream": {
- "type": "roundrobin",
- "nodes": {
- "httpbin.org": 1
- }
}
}'
```
-In the above configuration, requests will be equally balanced among the
`openai` and `deepseek` providers.
+Send 10 POST requests to the Route with a system prompt and a sample user
question in the request body, to see the number of requests forwarded to OpenAI
and DeepSeek:
-### Retry and fallback:
+```shell
+openai_count=0
+deepseek_count=0
+
+for i in {1..10}; do
+ model=$(curl -s "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "What is 1+1?" }
+ ]
+ }' | jq -r '.model')
+
+ if [[ "$model" == *"gpt-4"* ]]; then
+ ((openai_count++))
+ elif [[ "$model" == "deepseek-chat" ]]; then
+ ((deepseek_count++))
+ fi
+done
+
+echo "OpenAI responses: $openai_count"
+echo "DeepSeek responses: $deepseek_count"
+```
+
+You should see a response similar to the following:
+
+```text
+OpenAI responses: 8
+DeepSeek responses: 2
+```
+
+### Configure Instance Priority and Rate Limiting
+
+The following example demonstrates how you can configure two models with
different priorities and apply rate limiting on the instance with a higher
priority. In the case where `fallback_strategy` is set to
`instance_health_and_rate_limiting`, the Plugin should continue to forward
requests to the low priority instance once the high priority instance's rate
limiting quota is fully consumed.
-The `priority` attribute can be adjusted to implement the fallback and retry
feature.
+Create a Route as such and update with your LLM providers, models, API keys,
and endpoints if applicable:
```shell
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
- -H "X-API-KEY: ${ADMIN_API_KEY}" \
+ -H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
"ai-proxy-multi": {
- "providers": [
+ "fallback_strategy: "instance_health_and_rate_limiting",
+ "instances": [
{
- "name": "openai",
- "model": "gpt-4",
- "weight": 1,
+ "name": "openai-instance",
+ "provider": "openai",
"priority": 1,
+ "weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
"options": {
- "max_tokens": 512,
- "temperature": 1.0
+ "model": "gpt-4"
}
},
{
- "name": "deepseek",
- "model": "deepseek-chat",
- "weight": 1,
+ "name": "deepseek-instance",
+ "provider": "deepseek",
"priority": 0,
+ "weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
}
},
"options": {
- "max_tokens": 512,
- "temperature": 1.0
+ "model": "deepseek-chat"
}
}
]
+ },
+ "ai-rate-limiting": {
+ "instances": [
+ {
+ "name": "openai-instance",
+ "limit": 10,
+ "time_window": 60
+ }
+ ],
+ "limit_strategy": "total_tokens"
}
+ }
+ }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "What is 1+1?" }
+ ]
+ }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+ ...,
+ "model": "gpt-4-0613",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "1+1 equals 2.",
+ "refusal": null
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 23,
+ "completion_tokens": 8,
+ "total_tokens": 31,
+ "prompt_tokens_details": {
+ "cached_tokens": 0,
+ "audio_tokens": 0
},
- "upstream": {
- "type": "roundrobin",
- "nodes": {
- "httpbin.org": 1
+ "completion_tokens_details": {
+ "reasoning_tokens": 0,
+ "audio_tokens": 0,
+ "accepted_prediction_tokens": 0,
+ "rejected_prediction_tokens": 0
+ }
+ },
+ "service_tier": "default",
+ "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `10`, the next
request within the 60-second window is expected to be forwarded to the other
instance.
+
+Within the same 60-second window, send another POST request to the route:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "Explain Newton law" }
+ ]
+ }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+ ...,
+ "model": "deepseek-chat",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "Certainly! Newton's laws of motion are three fundamental
principles that describe the relationship between the motion of an object and
the forces acting on it. They were formulated by Sir Isaac Newton in the late
17th century and are foundational to classical mechanics.\n\n---\n\n### **1.
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will
remain at rest, and an object in motion will continue moving at a constant
velocity (in a straight lin [...]
+ },
+ ...
+ }
+ ],
+ ...
+}
+```
+
+### Load Balance and Rate Limit by Consumers
+
+The following example demonstrates how you can configure two models for load
balancing and apply rate limiting by consumer.
+
+Create a Consumer `johndoe` and a rate limiting quota of 10 tokens in a
60-second window on `openai-instance` instance:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "username": "johndoe",
+ "plugins": {
+ "ai-rate-limiting": {
+ "instances": [
+ {
+ "name": "openai-instance",
+ "limit": 10,
+ "time_window": 60
+ }
+ ],
+ "rejected_code": 429,
+ "limit_strategy": "total_tokens"
+ }
+ }
+ }'
+```
+
+Configure `key-auth` credential for `johndoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials" -X PUT
\
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "id": "cred-john-key-auth",
+ "plugins": {
+ "key-auth": {
+ "key": "john-key"
}
}
}'
```
-In the above configuration `priority` for the deepseek provider is set to `0`.
Which means if `openai` provider is unavailable then `ai-proxy-multi` plugin
will retry sending request to `deepseek` in the second attempt.
+Create another Consumer `janedoe` and a rate limiting quota of 10 tokens in a
60-second window on `deepseek-instance` instance:
-### Send request to an OpenAI compatible LLM
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "username": "johndoe",
+ "plugins": {
+ "ai-rate-limiting": {
+ "instances": [
+ {
+ "name": "deepseek-instance",
+ "limit": 10,
+ "time_window": 60
+ }
+ ],
+ "rejected_code": 429,
+ "limit_strategy": "total_tokens"
+ }
+ }
+ }'
+```
-Create a route with the `ai-proxy-multi` plugin with `provider.name` set to
`openai-compatible` and the endpoint of the model set to
`provider.override.endpoint` like so:
+Configure `key-auth` credential for `janedoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/janedoe/credentials" -X PUT
\
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "id": "cred-jane-key-auth",
+ "plugins": {
+ "key-auth": {
+ "key": "jane-key"
+ }
+ }
+ }'
+```
+
+Create a Route as such and update with your LLM providers, models, API keys,
and endpoints if applicable:
```shell
curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
- -H "X-API-KEY: ${ADMIN_API_KEY}" \
+ -H "X-API-KEY: ${admin_key}" \
-d '{
"id": "ai-proxy-multi-route",
"uri": "/anything",
"methods": ["POST"],
"plugins": {
+ "key-auth": {},
"ai-proxy-multi": {
- "providers": [
+ "fallback_strategy: "instance_health_and_rate_limiting",
+ "instances": [
{
- "name": "openai-compatible",
- "model": "qwen-plus",
- "weight": 1,
- "priority": 1,
+ "name": "openai-instance",
+ "provider": "openai",
+ "weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
- "override": {
- "endpoint":
"https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions"
+ "options": {
+ "model": "gpt-4"
}
},
{
- "name": "deepseek",
- "model": "deepseek-chat",
- "weight": 1,
+ "name": "deepseek-instance",
+ "provider": "deepseek",
+ "weight": 0,
"auth": {
"header": {
"Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
}
},
"options": {
- "max_tokens": 512,
- "temperature": 1.0
+ "model": "deepseek-chat"
}
}
- ],
- "passthrough": false
+ ]
+ }
+ }
+ }'
+```
+
+Send a POST request to the Route without any consumer key:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "What is 1+1?" }
+ ]
+ }'
+```
+
+You should receive an `HTTP/1.1 401 Unauthorized` response.
+
+Send a POST request to the Route with `johndoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -H 'apikey: john-key' \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "What is 1+1?" }
+ ]
+ }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+ ...,
+ "model": "gpt-4-0613",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "1+1 equals 2.",
+ "refusal": null
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 23,
+ "completion_tokens": 8,
+ "total_tokens": 31,
+ "prompt_tokens_details": {
+ "cached_tokens": 0,
+ "audio_tokens": 0
+ },
+ "completion_tokens_details": {
+ "reasoning_tokens": 0,
+ "audio_tokens": 0,
+ "accepted_prediction_tokens": 0,
+ "rejected_prediction_tokens": 0
+ }
+ },
+ "service_tier": "default",
+ "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of the `openai`
instance for `johndoe`, the next request within the 60-second window from
`johndoe` is expected to be forwarded to the `deepseek` instance.
+
+Within the same 60-second window, send another POST request to the Route with
`johndoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -H 'apikey: john-key' \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "Explain Newtons laws to me" }
+ ]
+ }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+ ...,
+ "model": "deepseek-chat",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "Certainly! Newton's laws of motion are three fundamental
principles that describe the relationship between the motion of an object and
the forces acting on it. They were formulated by Sir Isaac Newton in the late
17th century and are foundational to classical mechanics.\n\n---\n\n### **1.
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will
remain at rest, and an object in motion will continue moving at a constant
velocity (in a straight lin [...]
+ },
+ ...
+ }
+ ],
+ ...
+}
+```
+
+Send a POST request to the Route with `janedoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -H 'apikey: jane-key' \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "What is 1+1?" }
+ ]
+ }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+ ...,
+ "model": "deepseek-chat",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "The sum of 1 and 1 is 2. This is a basic arithmetic
operation where you combine two units to get a total of two units."
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 14,
+ "completion_tokens": 31,
+ "total_tokens": 45,
+ "prompt_tokens_details": {
+ "cached_tokens": 0
+ },
+ "prompt_cache_hit_tokens": 0,
+ "prompt_cache_miss_tokens": 14
+ },
+ "system_fingerprint": "fp_3a5770e1b4_prod0225"
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of the `deepseek`
instance for `janedoe`, the next request within the 60-second window from
`janedoe` is expected to be forwarded to the `openai` instance.
+
+Within the same 60-second window, send another POST request to the Route with
`janedoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -H 'apikey: jane-key' \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "Explain Newtons laws to me" }
+ ]
+ }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+ ...,
+ "model": "gpt-4-0613",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "Sure, here are Newton's three laws of motion:\n\n1)
Newton's First Law, also known as the Law of Inertia, states that an object at
rest will stay at rest, and an object in motion will stay in motion, unless
acted on by an external force. In simple words, this law suggests that an
object will keep doing whatever it is doing until something causes it to do
otherwise. \n\n2) Newton's Second Law states that the force acting on an object
is equal to the mass of that object [...]
+ "refusal": null
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
+ }
+ ],
+ ...
+}
+```
+
+This shows `ai-proxy-multi` load balance the traffic with respect to the rate
limiting rules in `ai-rate-limiting` by consumers.
+
+### Restrict Maximum Number of Completion Tokens
+
+The following example demonstrates how you can restrict the number of
`completion_tokens` used when generating the chat completion.
+
+For demonstration and easier differentiation, you will be configuring one
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route as such and update with your LLM providers, models, API keys,
and endpoints if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "id": "ai-proxy-multi-route",
+ "uri": "/anything",
+ "methods": ["POST"],
+ "plugins": {
+ "ai-proxy-multi": {
+ "instances": [
+ {
+ "name": "openai-instance",
+ "provider": "openai",
+ "weight": 0,
+ "auth": {
+ "header": {
+ "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+ }
+ },
+ "options": {
+ "model": "gpt-4",
+ "max_tokens": 50
+ }
+ },
+ {
+ "name": "deepseek-instance",
+ "provider": "deepseek",
+ "weight": 0,
+ "auth": {
+ "header": {
+ "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+ }
+ },
+ "options": {
+ "model": "deepseek-chat",
+ "max_tokens": 100
+ }
+ }
+ ]
}
+ }
+ }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "Explain Newtons law" }
+ ]
+ }'
+```
+
+If the request is proxied to OpenAI, you should see a response similar to the
following, where the content is truncated per 50 `max_tokens` threshold:
+
+```json
+{
+ ...,
+ "model": "gpt-4-0613",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "Newton's Laws of Motion are three physical laws that form
the bedrock for classical mechanics. They describe the relationship between a
body and the forces acting upon it, and the body's motion in response to those
forces. \n\n1. Newton's First Law",
+ "refusal": null
+ },
+ "logprobs": null,
+ "finish_reason": "length"
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 20,
+ "completion_tokens": 50,
+ "total_tokens": 70,
+ "prompt_tokens_details": {
+ "cached_tokens": 0,
+ "audio_tokens": 0
},
- "upstream": {
- "type": "roundrobin",
- "nodes": {
- "httpbin.org": 1
+ "completion_tokens_details": {
+ "reasoning_tokens": 0,
+ "audio_tokens": 0,
+ "accepted_prediction_tokens": 0,
+ "rejected_prediction_tokens": 0
+ }
+ },
+ "service_tier": "default",
+ "system_fingerprint": null
+}
+```
+
+If the request is proxied to DeepSeek, you should see a response similar to
the following, where the content is truncated per 100 `max_tokens` threshold:
+
+```json
+{
+ ...,
+ "model": "deepseek-chat",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "Newton's Laws of Motion are three fundamental principles
that form the foundation of classical mechanics. They describe the relationship
between a body and the forces acting upon it, and the body's motion in response
to those forces. Here's a brief explanation of each law:\n\n1. **Newton's First
Law (Law of Inertia):**\n - **Statement:** An object will remain at rest or
in uniform motion in a straight line unless acted upon by an external force.\n
- **Explanation: [...]
+ },
+ "logprobs": null,
+ "finish_reason": "length"
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 10,
+ "completion_tokens": 100,
+ "total_tokens": 110,
+ "prompt_tokens_details": {
+ "cached_tokens": 0
+ },
+ "prompt_cache_hit_tokens": 0,
+ "prompt_cache_miss_tokens": 10
+ },
+ "system_fingerprint": "fp_3a5770e1b4_prod0225"
+}
+```
+
+### Proxy to Embedding Models
+
+The following example demonstrates how you can configure the `ai-proxy-multi`
Plugin to proxy requests and load balance between embedding models.
+
+Create a Route as such and update with your LLM providers, embedding models,
API keys, and endpoints:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "id": "ai-proxy-multi-route",
+ "uri": "/anything",
+ "methods": ["POST"],
+ "plugins": {
+ "ai-proxy-multi": {
+ "instances": [
+ {
+ "name": "openai-instance",
+ "provider": "openai",
+ "weight": 0,
+ "auth": {
+ "header": {
+ "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+ }
+ },
+ "options": {
+ "model": "text-embedding-3-small"
+ },
+ "override": {
+ "endpoint": "https://api.openai.com/v1/embeddings"
+ }
+ },
+ {
+ "name": "az-openai-instance",
+ "provider": "openai-compatible",
+ "weight": 0,
+ "auth": {
+ "header": {
+ "Authorization": "Bearer '"$AZ_OPENAI_API_KEY"'"
+ }
+ },
+ "options": {
+ "model": "text-embedding-3-small"
+ },
+ "override": {
+ "endpoint":
"https://ai-plugin-developer.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15"
+ }
+ }
+ ]
}
}
}'
```
+
+Send a POST request to the Route with an input string:
+
+```shell
+curl "http://127.0.0.1:9080/embeddings" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "input": "hello world"
+ }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+ "object": "list",
+ "data": [
+ {
+ "object": "embedding",
+ "index": 0,
+ "embedding": [
+ -0.0067144386,
+ -0.039197803,
+ 0.034177095,
+ 0.028763203,
+ -0.024785956,
+ -0.04201061,
+ ...
+ ],
+ }
+ ],
+ "model": "text-embedding-3-small",
+ "usage": {
+ "prompt_tokens": 2,
+ "total_tokens": 2
+ }
+}
+```
+
+### Enable Active Health Checks
+
+The following example demonstrates how you can configure the `ai-proxy-multi`
Plugin to proxy requests and load balance between models, and enable active
health check to improve service availability. You can enable health check on
one or multiple instances.
+
+Create a Route as such and update the LLM providers, embedding models, API
keys, and health check related configurations:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "id": "ai-proxy-multi-route",
+ "uri": "/anything",
+ "methods": ["POST"],
+ "plugins": {
+ "ai-proxy-multi": {
+ "instances": [
+ {
+ "name": "llm-instance-1",
+ "provider": "openai-compatible",
+ "weight": 0,
+ "auth": {
+ "header": {
+ "Authorization": "Bearer '"$YOUR_LLM_API_KEY"'"
+ }
+ },
+ "options": {
+ "model": "'"$YOUR_LLM_MODEL"'"
+ }
+ },
+ {
+ "name": "llm-instance-2",
+ "provider": "openai-compatible",
+ "weight": 0,
+ "auth": {
+ "header": {
+ "Authorization": "Bearer '"$YOUR_LLM_API_KEY"'"
+ }
+ },
+ "options": {
+ "model": "'"$YOUR_LLM_MODEL"'"
+ },
+ "checks": {
+ "active": {
+ "type": "https",
+ "host": "yourhost.com",
+ "http_path": "/your/probe/path",
+ "healthy": {
+ "interval": 2,
+ "successes": 1
+ },
+ "unhealthy": {
+ "interval": 1,
+ "http_failures": 3
+ }
+ }
+ }
+ }
+ ]
+ }
+ }
+ }'
+```
+
+For verification, the behaviours should be consistent with the verification in
[active health checks](../tutorials/health-check.md).
+
+### Include LLM Information in Access Log
+
+The following example demonstrates how you can log LLM request related
information in the gateway's access log to improve analytics and audit. The
following variables are available:
+
+* `request_type`: Type of request, where the value could be
`traditional_http`, `ai_chat`, or `ai_stream`.
+* `llm_time_to_first_token`: Duration from request sending to the first token
received from the LLM service, in milliseconds.
+* `llm_model`: LLM model.
+* `llm_prompt_tokens`: Number of tokens in the prompt.
+* `llm_completion_tokens`: Number of chat completion tokens in the prompt.
+
+:::note
+
+The usage in this example will become available in APISIX 3.13.0.
+
+:::
+
+Update the access log format in your configuration file to include additional
LLM related variables:
+
+```yaml title="conf/config.yaml"
+nginx_config:
+ http:
+ access_log_format: "$remote_addr - $remote_user [$time_local] $http_host
\"$request_line\" $status $body_bytes_sent $request_time \"$http_referer\"
\"$http_user_agent\" $upstream_addr $upstream_status $upstream_response_time
\"$upstream_scheme://$upstream_host$upstream_uri\" \"$apisix_request_id\"
\"$request_type\" \"$llm_time_to_first_token\" \"$llm_model\"
\"$llm_prompt_tokens\" \"$llm_completion_tokens\""
+```
+
+Reload APISIX for configuration changes to take effect.
+
+Next, create a Route with the `ai-proxy-multi` Plugin and send a request. For
instance, if the request is forwarded to OpenAI and you receive the following
response:
+
+```json
+{
+ ...,
+ "model": "gpt-4-0613",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "1+1 equals 2.",
+ "refusal": null,
+ "annotations": []
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 23,
+ "completion_tokens": 8,
+ "total_tokens": 31,
+ "prompt_tokens_details": {
+ "cached_tokens": 0,
+ "audio_tokens": 0
+ },
+ ...
+ },
+ "service_tier": "default",
+ "system_fingerprint": null
+}
+```
+
+In the gateway's access log, you should see a log entry similar to the
following:
+
+```text
+192.168.215.1 - - [21/Mar/2025:04:28:03 +0000] api.openai.com "POST /anything
HTTP/1.1" 200 804 2.858 "-" "curl/8.6.0" - - - "http://api.openai.com"
"5c5e0b95f8d303cb81e4dc456a4b12d9" "ai_chat" "2858" "gpt-4" "23" "8"
+```
+
+The access log entry shows the request type is `ai_chat`, time to first token
is `2858` milliseconds, LLM model is `gpt-4`, prompt token usage is `23`, and
completion token usage is `8`.
diff --git a/docs/en/latest/plugins/ai-proxy.md
b/docs/en/latest/plugins/ai-proxy.md
index fe6bb2f6b..239c6df5d 100644
--- a/docs/en/latest/plugins/ai-proxy.md
+++ b/docs/en/latest/plugins/ai-proxy.md
@@ -5,7 +5,9 @@ keywords:
- API Gateway
- Plugin
- ai-proxy
-description: This document contains information about the Apache APISIX
ai-proxy Plugin.
+ - AI
+ - LLM
+description: The ai-proxy Plugin simplifies access to LLM and embedding models
providers by converting Plugin configurations into the required request format
for OpenAI, DeepSeek, and other OpenAI-compatible APIs.
---
<!--
@@ -27,147 +29,425 @@ description: This document contains information about the
Apache APISIX ai-proxy
#
-->
+<head>
+ <link rel="canonical" href="https://docs.api7.ai/hub/ai-proxy" />
+</head>
+
## Description
-The `ai-proxy` plugin simplifies access to LLM providers and models by
defining a standard request format
-that allows key fields in plugin configuration to be embedded into the request.
+The `ai-proxy` Plugin simplifies access to LLM and embedding models by
transforming Plugin configurations into the designated request format. It
supports the integration with OpenAI, DeepSeek, and other OpenAI-compatible
APIs.
-Proxying requests to OpenAI is supported now. Other LLM services will be
supported soon.
+In addition, the Plugin also supports logging LLM request information in the
access log, such as token usage, model, time to the first response, and more.
## Request Format
-### OpenAI
-
-- Chat API
-
| Name | Type | Required | Description
|
| ------------------ | ------ | -------- |
--------------------------------------------------- |
-| `messages` | Array | Yes | An array of message objects
|
-| `messages.role` | String | Yes | Role of the message (`system`,
`user`, `assistant`) |
-| `messages.content` | String | Yes | Content of the message
|
-
-## Plugin Attributes
-
-| **Field** | **Required** | **Type** | **Description**
|
-| ------------------------- | ------------ | -------- |
------------------------------------------------------------------------------------
|
-| auth | Yes | Object | Authentication
configuration |
-| auth.header | No | Object | Authentication
headers. Key must match pattern `^[a-zA-Z0-9._-]+$`. |
-| auth.query | No | Object | Authentication query
parameters. Key must match pattern `^[a-zA-Z0-9._-]+$`. |
-| model.provider | Yes | String | Name of the AI service
provider (`openai`). |
-| model.name | Yes | String | Model name to execute.
|
-| model.options | No | Object | Key/value settings for
the model |
-| override.endpoint | No | String | Override the endpoint
of the AI provider |
-| timeout | No | Integer | Timeout in
milliseconds for requests to LLM. Range: 1 - 60000. Default: 30000 |
-| keepalive | No | Boolean | Enable keepalive for
requests to LLM. Default: true |
-| keepalive_timeout | No | Integer | Keepalive timeout in
milliseconds for requests to LLM. Minimum: 1000. Default: 60000 |
-| keepalive_pool | No | Integer | Keepalive pool size
for requests to LLM. Minimum: 1. Default: 30 |
-| ssl_verify | No | Boolean | SSL verification for
requests to LLM. Default: true |
-
-## Example usage
-
-Create a route with the `ai-proxy` plugin like so:
+| `messages` | Array | True | An array of message objects.
|
+| `messages.role` | String | True | Role of the message (`system`,
`user`, `assistant`).|
+| `messages.content` | String | True | Content of the message.
|
+
+## Attributes
+
+| Name | Type | Required | Default | Valid values
| Description |
+|--------------------|--------|----------|---------|------------------------------------------|-------------|
+| provider | string | True | | [openai, deepseek,
openai-compatible] | LLM service provider. When set to `openai`, the Plugin
will proxy the request to `https://api.openai.com/chat/completions`. When set
to `deepseek`, the Plugin will proxy the request to
`https://api.deepseek.com/chat/completions`. When set to `openai-compatible`,
the Plugin will proxy the request to the custom endpoint configured in
`override`. |
+| auth | object | True | |
| Authentication configurations. |
+| auth.header | object | False | |
| Authentication headers. At least one of `header` or `query`
must be configured. |
+| auth.query | object | False | |
| Authentication query parameters. At least one of `header` or
`query` must be configured. |
+| options | object | False | |
| Model configurations. In addition to `model`, you can configure
additional parameters and they will be forwarded to the upstream LLM service in
the request body. For instance, if you are working with OpenAI, you can
configure additional parameters such as `temperature`, `top_p`, and `stream`.
See your LLM provider's API documentation for more available options. |
+| options.model | string | False | |
| Name of the LLM model, such as `gpt-4` or `gpt-3.5`. Refer to
the LLM provider's API documentation for available models. |
+| override | object | False | |
| Override setting. |
+| override.endpoint | string | False | |
| Custom LLM provider endpoint, required when `provider` is
`openai-compatible`. |
+| logging | object | False | |
| Logging configurations. |
+| logging.summaries | boolean | False | false |
| If true, logs request LLM model, duration, request, and response
tokens. |
+| logging.payloads | boolean | False | false |
| If true, logs request and response payload. |
+| timeout | integer | False | 30000 | ≥ 1
| Request timeout in milliseconds when requesting the LLM service.
|
+| keepalive | boolean | False | true |
| If true, keeps the connection alive when requesting the LLM
service. |
+| keepalive_timeout | integer | False | 60000 | ≥ 1000
| Keepalive timeout in milliseconds when connecting to the LLM
service. |
+| keepalive_pool | integer | False | 30 |
| Keepalive pool size for the LLM service connection. |
+| ssl_verify | boolean | False | true |
| If true, verifies the LLM service's certificate. |
+
+## Examples
+
+The examples below demonstrate how you can configure `ai-proxy` for different
scenarios.
+
+:::note
+
+You can fetch the `admin_key` from `config.yaml` and save to an environment
variable with the following command:
+
+```bash
+admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed
's/"//g')
+```
+
+:::
+
+### Proxy to OpenAI
+
+The following example demonstrates how you can configure the API key, model,
and other parameters in the `ai-proxy` Plugin and configure the Plugin on a
Route to proxy user prompts to OpenAI.
+
+Obtain the OpenAI [API key](https://openai.com/blog/openai-api) and save it to
an environment variable:
+
+```shell
+export OPENAI_API_KEY=<your-api-key>
+```
+
+Create a Route and configure the `ai-proxy` Plugin as such:
```shell
-curl "http://127.0.0.1:9180/apisix/admin/routes/1" -X PUT \
- -H "X-API-KEY: ${ADMIN_API_KEY}" \
+curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
-d '{
+ "id": "ai-proxy-route",
"uri": "/anything",
+ "methods": ["POST"],
"plugins": {
"ai-proxy": {
+ "provider": "openai",
"auth": {
"header": {
- "Authorization": "Bearer <some-token>"
+ "Authorization": "Bearer '"$OPENAI_API_KEY"'"
}
},
- "model": {
- "provider": "openai",
- "name": "gpt-4",
- "options": {
- "max_tokens": 512,
- "temperature": 1.0
- }
+ "options":{
+ "model": "gpt-4"
}
}
- },
- "upstream": {
- "type": "roundrobin",
- "nodes": {
- "somerandom.com:443": 1
- },
- "scheme": "https",
- "pass_host": "node"
}
}'
```
-Upstream node can be any arbitrary value because it won't be contacted.
-
-Now send a request:
+Send a POST request to the Route with a system prompt and a sample user
question in the request body:
```shell
-curl http://127.0.0.1:9080/anything -i -XPOST -H 'Content-Type:
application/json' -d '{
- "messages": [
- { "role": "system", "content": "You are a mathematician" },
- { "role": "user", "a": 1, "content": "What is 1+1?" }
- ]
- }'
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -H "Host: api.openai.com" \
+ -d '{
+ "messages": [
+ { "role": "system", "content": "You are a mathematician" },
+ { "role": "user", "content": "What is 1+1?" }
+ ]
+ }'
```
-You will receive a response like this:
+You should receive a response similar to the following:
```json
{
+ ...,
+ "model": "gpt-4-0613",
"choices": [
{
- "finish_reason": "stop",
"index": 0,
"message": {
- "content": "The sum of \\(1 + 1\\) is \\(2\\).",
- "role": "assistant"
+ "role": "assistant",
+ "content": "1+1 equals 2.",
+ "refusal": null
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
+ }
+ ],
+ ...
+}
+```
+
+### Proxy to DeepSeek
+
+The following example demonstrates how you can configure the `ai-proxy` Plugin
to proxy requests to DeekSeek.
+
+Obtain the DeekSeek API key and save it to an environment variable:
+
+```shell
+export DEEPSEEK_API_KEY=<your-api-key>
+```
+
+Create a Route and configure the `ai-proxy` Plugin as such:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "id": "ai-proxy-route",
+ "uri": "/anything",
+ "methods": ["POST"],
+ "plugins": {
+ "ai-proxy": {
+ "provider": "deepseek",
+ "auth": {
+ "header": {
+ "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+ }
+ },
+ "options": {
+ "model": "deepseek-chat"
+ }
+ }
+ }
+ }'
+```
+
+Send a POST request to the Route with a sample question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "messages": [
+ {
+ "role": "system",
+ "content": "You are an AI assistant that helps people find
information."
+ },
+ {
+ "role": "user",
+ "content": "Write me a 50-word introduction for Apache APISIX."
}
+ ]
+ }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+ ...
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "Apache APISIX is a dynamic, real-time, high-performance
API gateway and cloud-native platform. It provides rich traffic management
features like load balancing, dynamic upstream, canary release, circuit
breaking, authentication, observability, and more. Designed for microservices
and serverless architectures, APISIX ensures scalability, security, and
seamless integration with modern DevOps workflows."
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
}
],
- "created": 1723777034,
- "id": "chatcmpl-9whRKFodKl5sGhOgHIjWltdeB8sr7",
- "model": "gpt-4o-2024-05-13",
- "object": "chat.completion",
- "system_fingerprint": "fp_abc28019ad",
- "usage": { "completion_tokens": 15, "prompt_tokens": 23, "total_tokens": 38 }
+ ...
}
```
-### Send request to an OpenAI compatible LLM
+### Proxy to Azure OpenAI
+
+The following example demonstrates how you can configure the `ai-proxy` Plugin
to proxy requests to other LLM services, such as Azure OpenAI.
+
+Obtain the Azure OpenAI API key and save it to an environment variable:
-Create a route with the `ai-proxy` plugin with `provider` set to
`openai-compatible` and the endpoint of the model set to `override.endpoint`
like so:
+```shell
+export AZ_OPENAI_API_KEY=<your-api-key>
+```
+
+Create a Route and configure the `ai-proxy` Plugin as such:
```shell
-curl "http://127.0.0.1:9180/apisix/admin/routes/1" -X PUT \
- -H "X-API-KEY: ${ADMIN_API_KEY}" \
+curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
-d '{
+ "id": "ai-proxy-route",
"uri": "/anything",
+ "methods": ["POST"],
"plugins": {
"ai-proxy": {
+ "provider": "openai-compatible",
"auth": {
"header": {
- "Authorization": "Bearer <some-token>"
+ "api-key": "'"$AZ_OPENAI_API_KEY"'"
}
},
- "model": {
- "provider": "openai-compatible",
- "name": "qwen-plus"
+ "options":{
+ "model": "gpt-4"
},
"override": {
- "endpoint":
"https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions"
+ "endpoint":
"https://api7-auzre-openai.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-02-15-preview"
}
}
- },
- "upstream": {
- "type": "roundrobin",
- "nodes": {
- "somerandom.com:443": 1
+ }
+ }'
+```
+
+Send a POST request to the Route with a sample question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "messages": [
+ {
+ "role": "system",
+ "content": "You are an AI assistant that helps people find
information."
},
- "scheme": "https",
- "pass_host": "node"
+ {
+ "role": "user",
+ "content": "Write me a 50-word introduction for Apache APISIX."
+ }
+ ],
+ "max_tokens": 800,
+ "temperature": 0.7,
+ "frequency_penalty": 0,
+ "presence_penalty": 0,
+ "top_p": 0.95,
+ "stop": null
+ }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+ "choices": [
+ {
+ ...,
+ "message": {
+ "content": "Apache APISIX is a modern, cloud-native API gateway built
to handle high-performance and low-latency use cases. It offers a wide range of
features, including load balancing, rate limiting, authentication, and dynamic
routing, making it an ideal choice for microservices and cloud-native
architectures.",
+ "role": "assistant"
+ }
+ }
+ ],
+ ...
+}
+```
+
+### Proxy to Embedding Models
+
+The following example demonstrates how you can configure the `ai-proxy` Plugin
to proxy requests to embedding models. This example will use the OpenAI
embedding model endpoint.
+
+Obtain the OpenAI [API key](https://openai.com/blog/openai-api) and save it to
an environment variable:
+
+```shell
+export OPENAI_API_KEY=<your-api-key>
+```
+
+Create a Route and configure the `ai-proxy` Plugin as such:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \
+ -H "X-API-KEY: ${admin_key}" \
+ -d '{
+ "id": "ai-proxy-route",
+ "uri": "/embeddings",
+ "methods": ["POST"],
+ "plugins": {
+ "ai-proxy": {
+ "provider": "openai",
+ "auth": {
+ "header": {
+ "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+ }
+ },
+ "options":{
+ "model": "text-embedding-3-small",
+ "encoding_format": "float"
+ },
+ "override": {
+ "endpoint": "https://api.openai.com/v1/embeddings"
+ }
+ }
}
}'
```
+
+Send a POST request to the Route with an input string:
+
+```shell
+curl "http://127.0.0.1:9080/embeddings" -X POST \
+ -H "Content-Type: application/json" \
+ -d '{
+ "input": "hello world"
+ }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+ "object": "list",
+ "data": [
+ {
+ "object": "embedding",
+ "index": 0,
+ "embedding": [
+ -0.0067144386,
+ -0.039197803,
+ 0.034177095,
+ 0.028763203,
+ -0.024785956,
+ -0.04201061,
+ ...
+ ],
+ }
+ ],
+ "model": "text-embedding-3-small",
+ "usage": {
+ "prompt_tokens": 2,
+ "total_tokens": 2
+ }
+}
+```
+
+### Include LLM Information in Access Log
+
+The following example demonstrates how you can log LLM request related
information in the gateway's access log to improve analytics and audit. The
following variables are available:
+
+* `request_type`: Type of request, where the value could be
`traditional_http`, `ai_chat`, or `ai_stream`.
+* `llm_time_to_first_token`: Duration from request sending to the first token
received from the LLM service, in milliseconds.
+* `llm_model`: LLM model.
+* `llm_prompt_tokens`: Number of tokens in the prompt.
+* `llm_completion_tokens`: Number of chat completion tokens in the prompt.
+
+:::note
+
+The usage will become available in APISIX 3.13.0.
+
+:::
+
+Update the access log format in your configuration file to include additional
LLM related variables:
+
+```yaml title="conf/config.yaml"
+nginx_config:
+ http:
+ access_log_format: "$remote_addr $remote_user [$time_local] $http_host
\"$request_line\" $status $body_bytes_sent $request_time \"$http_referer\"
\"$http_user_agent\" $upstream_addr $upstream_status $upstream_response_time
\"$upstream_scheme://$upstream_host$upstream_uri\" \"$apisix_request_id\"
\"$request_type\" \"$llm_time_to_first_token\" \"$llm_model\"
\"$llm_prompt_tokens\" \"$llm_completion_tokens\""
+```
+
+Reload APISIX for configuration changes to take effect.
+
+Now if you create a Route and send a request following the [Proxy to OpenAI
example](#proxy-to-openai), you should receive a response similar to the
following:
+
+```json
+{
+ ...,
+ "model": "gpt-4-0613",
+ "choices": [
+ {
+ "index": 0,
+ "message": {
+ "role": "assistant",
+ "content": "1+1 equals 2.",
+ "refusal": null,
+ "annotations": []
+ },
+ "logprobs": null,
+ "finish_reason": "stop"
+ }
+ ],
+ "usage": {
+ "prompt_tokens": 23,
+ "completion_tokens": 8,
+ "total_tokens": 31,
+ "prompt_tokens_details": {
+ "cached_tokens": 0,
+ "audio_tokens": 0
+ },
+ ...
+ },
+ "service_tier": "default",
+ "system_fingerprint": null
+}
+```
+
+In the gateway's access log, you should see a log entry similar to the
following:
+
+```text
+192.168.215.1 - [21/Mar/2025:04:28:03 +0000] api.openai.com "POST /anything
HTTP/1.1" 200 804 2.858 "-" "curl/8.6.0" - "http://api.openai.com"
"5c5e0b95f8d303cb81e4dc456a4b12d9" "ai_chat" "2858" "gpt-4" "23" "8"
+```
+
+The access log entry shows the request type is `ai_chat`, time to first token
is `2858` milliseconds, LLM model is `gpt-4`, prompt token usage is `23`, and
completion token usage is `8`.