(apisix) branch master updated: docs: update `ai-rate-limiting` and `ai-rag` docs (#12107)

traky Thu, 10 Apr 2025 11:00:35 -0700

This is an automated email from the ASF dual-hosted git repository.

traky pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/apisix.git



The following commit(s) were added to refs/heads/master by this push:
     new 901311039 docs: update `ai-rate-limiting` and `ai-rag` docs (#12107)
901311039 is described below

commit 9013110391c40a4a9ee948046a262860dbe327b0
Author: Traky Deng <trakyd...@gmail.com>
AuthorDate: Tue Apr 8 09:48:31 2025 +0800

    docs: update `ai-rate-limiting` and `ai-rag` docs (#12107)
---
 docs/en/latest/plugins/ai-rag.md           | 177 ++++---
 docs/en/latest/plugins/ai-rate-limiting.md | 798 ++++++++++++++++++++++++++++-
 2 files changed, 882 insertions(+), 93 deletions(-)

diff --git a/docs/en/latest/plugins/ai-rag.md b/docs/en/latest/plugins/ai-rag.md
index 813e5fff0..844b3078e 100644
--- a/docs/en/latest/plugins/ai-rag.md
+++ b/docs/en/latest/plugins/ai-rag.md
@@ -5,7 +5,9 @@ keywords:
   - API Gateway
   - Plugin
   - ai-rag
-description: This document contains information about the Apache APISIX ai-rag 
Plugin.
+  - AI
+  - LLM
+description: The ai-rag Plugin enhances LLM outputs with Retrieval-Augmented 
Generation (RAG), efficiently retrieving relevant documents to improve accuracy 
and contextual relevance in responses.
 ---
 
 <!--
@@ -27,34 +29,38 @@ description: This document contains information about the 
Apache APISIX ai-rag P
 #
 -->
 
+<head>
+  <link rel="canonical" href="https://docs.api7.ai/hub/ai-rag"; />
+</head>
+
 ## Description
 
-The `ai-rag` plugin integrates Retrieval-Augmented Generation (RAG) 
capabilities with AI models.
-It allows efficient retrieval of relevant documents or information from 
external data sources and
-augments the LLM responses with that data, improving the accuracy and context 
of generated outputs.
+The `ai-rag` Plugin provides Retrieval-Augmented Generation (RAG) capabilities 
with LLMs. It facilitates the efficient retrieval of relevant documents or 
information from external data sources, which are used to enhance the LLM 
responses, thereby improving the accuracy and contextual relevance of the 
generated outputs.
+
+The Plugin supports using [Azure 
OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) 
and [Azure AI 
Search](https://azure.microsoft.com/en-us/products/ai-services/ai-search) 
services for generating embeddings and performing vector search.
 
 **_As of now only [Azure 
OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) 
and [Azure AI 
Search](https://azure.microsoft.com/en-us/products/ai-services/ai-search) 
services are supported for generating embeddings and performing vector search 
respectively. PRs for introducing support for other service providers are 
welcomed._**
 
-## Plugin Attributes
+## Attributes
 
-| **Field**                                       | **Required** | **Type** | 
**Description**                                                                 
                                                          |
+| Name                                      |   Required   |   Type   |   
Description                                                                     
                                                        |
 | ----------------------------------------------- | ------------ | -------- | 
-----------------------------------------------------------------------------------------------------------------------------------------
 |
-| embeddings_provider                             | Yes          | object   | 
Configurations of the embedding models provider                                 
                                                          |
-| embeddings_provider.azure_openai                | Yes          | object   | 
Configurations of [Azure 
OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) 
as the embedding models provider. |
-| embeddings_provider.azure_openai.endpoint       | Yes          | string   | 
Azure OpenAI endpoint                                                           
                                                          |
-| embeddings_provider.azure_openai.api_key        | Yes          | string   | 
Azure OpenAI API key                                                            
                                                          |
-| vector_search_provider                          | Yes          | object   | 
Configuration for the vector search provider                                    
                                                          |
-| vector_search_provider.azure_ai_search          | Yes          | object   | 
Configuration for Azure AI Search                                               
                                                          |
-| vector_search_provider.azure_ai_search.endpoint | Yes          | string   | 
Azure AI Search endpoint                                                        
                                                          |
-| vector_search_provider.azure_ai_search.api_key  | Yes          | string   | 
Azure AI Search API key                                                         
                                                          |
+| embeddings_provider                             | True          | object   | 
Configurations of the embedding models provider.                                
                                                           |
+| embeddings_provider.azure_openai                | True          | object   | 
Configurations of [Azure 
OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) 
as the embedding models provider. |
+| embeddings_provider.azure_openai.endpoint       | True          | string   | 
Azure OpenAI embedding model endpoint.                                          
                                        |
+| embeddings_provider.azure_openai.api_key        | True          | string   | 
Azure OpenAI API key.                                                           
                                                         |
+| vector_search_provider                          | True          | object   | 
Configuration for the vector search provider.                                   
                                                           |
+| vector_search_provider.azure_ai_search          | True          | object   | 
Configuration for Azure AI Search.                                              
                                                           |
+| vector_search_provider.azure_ai_search.endpoint | True          | string   | 
Azure AI Search endpoint.                                                       
                                                           |
+| vector_search_provider.azure_ai_search.api_key  | True          | string   | 
Azure AI Search API key.                                                        
                                                          |
 
 ## Request Body Format
 
 The following fields must be present in the request body.
 
-| **Field**            | **Type** | **Description**                            
                                                                                
     |
+|   Field              |   Type   |    Description                             
                                                                                
      |
 | -------------------- | -------- | 
-------------------------------------------------------------------------------------------------------------------------------
 |
-| ai_rag               | object   | Configuration for AI-RAG (Retrieval 
Augmented Generation)                                                           
            |
+| ai_rag               | object   | Request body RAG specifications.           
                                                                   |
 | ai_rag.embeddings    | object   | Request parameters required to generate 
embeddings. Contents will depend on the API specification of the configured 
provider.   |
 | ai_rag.vector_search | object   | Request parameters required to perform 
vector search. Contents will depend on the API specification of the configured 
provider. |
 
@@ -62,12 +68,12 @@ The following fields must be present in the request body.
 
   - Azure OpenAI
 
-  | **Name**        | **Required** | **Type** | **Description**                
                                                                                
            |
+  |   Name          |   Required   |   Type   |   Description                  
                                                                                
            |
   | --------------- | ------------ | -------- | 
--------------------------------------------------------------------------------------------------------------------------
 |
-  | input           | Yes          | string   | Input text used to compute 
embeddings, encoded as a string.                                                
                |
-  | user            | No           | string   | A unique identifier 
representing your end-user, which can help in monitoring and detecting abuse.   
                       |
-  | encoding_format | No           | string   | The format to return the 
embeddings in. Can be either `float` or `base64`. Defaults to `float`.          
                  |
-  | dimensions      | No           | integer  | The number of dimensions the 
resulting output embeddings should have. Only supported in text-embedding-3 and 
later models. |
+  | input           | True          | string   | Input text used to compute 
embeddings, encoded as a string.                                                
                |
+  | user            | False           | string   | A unique identifier 
representing your end-user, which can help in monitoring and detecting abuse.   
                       |
+  | encoding_format | False           | string   | The format to return the 
embeddings in. Can be either `float` or `base64`. Defaults to `float`.          
                  |
+  | dimensions      | False           | integer  | The number of dimensions 
the resulting output embeddings should have. Only supported in text-embedding-3 
and later models. |
 
 For other parameters please refer to the [Azure OpenAI embeddings 
documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#embeddings).
 
@@ -75,9 +81,9 @@ For other parameters please refer to the [Azure OpenAI 
embeddings documentation]
 
   - Azure AI Search
 
-  | **Field** | **Required** | **Type** | **Description**              |
+  |   Field   |   Required   |   Type   |   Description                |
   | --------- | ------------ | -------- | ---------------------------- |
-  | fields    | Yes          | String   | Fields for the vector search |
+  | fields    | True          | String   | Fields for the vector search. |
 
   For other parameters please refer the [Azure AI Search 
documentation](https://learn.microsoft.com/en-us/rest/api/searchservice/documents/search-post).
 
@@ -95,106 +101,135 @@ Example request body:
 }
 ```
 
-## Example usage
+## Example
+
+To follow along the example, create an [Azure 
account](https://portal.azure.com) and complete the following steps:
 
-First initialise these shell variables:
+* In [Azure AI Foundry](https://oai.azure.com/portal), deploy a generative 
chat model, such as `gpt-4o`, and an embedding model, such as 
`text-embedding-3-large`. Obtain the API key and model endpoints.
+* Follow [Azure's 
example](https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/basic-vector-workflow/azure-search-vector-python-sample.ipynb)
 to prepare for a vector search in [Azure AI 
Search](https://azure.microsoft.com/en-us/products/ai-services/ai-search) using 
Python. The example will create a search index called `vectest` with the 
desired schema and upload the [sample 
data](https://github.com/Azure/azure-search-vector-samples/blob/main/data/text-sample.j
 [...]
+* In [Azure AI 
Search](https://azure.microsoft.com/en-us/products/ai-services/ai-search), 
[obtain the Azure vector search API key and the search service 
endpoint](https://learn.microsoft.com/en-us/azure/search/search-get-started-vector?tabs=api-key#retrieve-resource-information).
+
+Save the API keys and endpoints to environment variables:
 
 ```shell
-ADMIN_API_KEY=edd1c9f034335f136f87ad84b625c8f1
-AZURE_OPENAI_ENDPOINT=https://name.openai.azure.com/openai/deployments/gpt-4o/chat/completions
-VECTOR_SEARCH_ENDPOINT=https://name.search.windows.net/indexes/indexname/docs/search?api-version=2024-07-01
-EMBEDDINGS_ENDPOINT=https://name.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15
-EMBEDDINGS_KEY=secret-azure-openai-embeddings-key
-SEARCH_KEY=secret-azureai-search-key
-AZURE_OPENAI_KEY=secret-azure-openai-key
+# replace with your values
+
+AZ_OPENAI_DOMAIN=https://ai-plugin-developer.openai.azure.com
+AZ_OPENAI_API_KEY=9m7VYroxITMDEqKKEnpOknn1rV7QNQT7DrIBApcwMLYJQQJ99ALACYeBjFXJ3w3AAABACOGXGcd
+AZ_CHAT_ENDPOINT=${AZ_OPENAI_DOMAIN}/openai/deployments/gpt-4o/chat/completions?api-version=2024-02-15-preview
+AZ_EMBEDDING_MODEL=text-embedding-3-large
+AZ_EMBEDDINGS_ENDPOINT=${AZ_OPENAI_DOMAIN}/openai/deployments/${AZ_EMBEDDING_MODEL}/embeddings?api-version=2023-05-15
+
+AZ_AI_SEARCH_SVC_DOMAIN=https://ai-plugin-developer.search.windows.net
+AZ_AI_SEARCH_KEY=IFZBp3fKVdq7loEVe9LdwMvVdZrad9A4lPH90AzSeC06SlR
+AZ_AI_SEARCH_INDEX=vectest
+AZ_AI_SEARCH_ENDPOINT=${AZ_AI_SEARCH_SVC_DOMAIN}/indexes/${AZ_AI_SEARCH_INDEX}/docs/search?api-version=2024-07-01
+```
+
+:::note
+
+You can fetch the `admin_key` from `config.yaml` and save to an environment 
variable with the following command:
+
+```bash
+admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 
's/"//g')
 ```
 
-Create a route with the `ai-rag` and `ai-proxy` plugin like so:
+:::
+
+### Integrate with Azure for RAG-Enhaned Responses
+
+The following example demonstrates how you can use the 
[`ai-proxy`](./ai-proxy.md) Plugin to proxy requests to Azure OpenAI LLM and 
use the `ai-rag` Plugin to generate embeddings and perform vector search to 
enhance LLM responses.
+
+Create a Route as such:
 
 ```shell
-curl "http://127.0.0.1:9180/apisix/admin/routes/1"; -X PUT \
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
   -H "X-API-KEY: ${ADMIN_API_KEY}" \
   -d '{
+  "id": "ai-rag-route",
   "uri": "/rag",
   "plugins": {
     "ai-rag": {
       "embeddings_provider": {
         "azure_openai": {
-          "endpoint": "'"$EMBEDDINGS_ENDPOINT"'",
-          "api_key": "'"$EMBEDDINGS_KEY"'"
+          "endpoint": "'"$AZ_EMBEDDINGS_ENDPOINT"'",
+          "api_key": "'"$AZ_OPENAI_API_KEY"'"
         }
       },
       "vector_search_provider": {
         "azure_ai_search": {
-          "endpoint": "'"$VECTOR_SEARCH_ENDPOINT"'",
-          "api_key": "'"$SEARCH_KEY"'"
+          "endpoint": "'"$AZ_AI_SEARCH_ENDPOINT"'",
+          "api_key": "'"$AZ_AI_SEARCH_KEY"'"
         }
       }
     },
     "ai-proxy": {
+      "provider": "openai",
       "auth": {
         "header": {
-          "api-key": "'"$AZURE_OPENAI_KEY"'"
-        },
-        "query": {
-          "api-version": "2023-03-15-preview"
-         }
-      },
-      "model": {
-        "provider": "openai",
-        "name": "gpt-4",
-        "options": {
-          "max_tokens": 512,
-          "temperature": 1.0
+          "api-key": "'"$AZ_OPENAI_API_KEY"'"
         }
       },
+      "model": "gpt-4o",
       "override": {
-        "endpoint": "'"$AZURE_OPENAI_ENDPOINT"'"
+        "endpoint": "'"$AZ_CHAT_ENDPOINT"'"
       }
     }
-  },
-  "upstream": {
-    "type": "roundrobin",
-    "nodes": {
-      "someupstream.com:443": 1
-    },
-    "scheme": "https",
-    "pass_host": "node"
   }
 }'
 ```
 
-The `ai-proxy` plugin is used here as it simplifies access to LLMs. 
Alternatively, you may configure the LLM service address in the upstream 
configuration and update the route URI as well.
-
-Now send a request:
+Send a POST request to the Route with the vector fields name, embedding model 
dimensions, and an input prompt in the request body:
 
 ```shell
-curl http://127.0.0.1:9080/rag -XPOST  -H 'Content-Type: application/json' -d 
'{"ai_rag":{"vector_search":{"fields":"contentVector"},"embeddings":{"input":"which
 service is good for devops","dimensions":1024}}}'
+curl "http://127.0.0.1:9080/rag"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "ai_rag":{
+      "vector_search":{
+        "fields":"contentVector"
+      },
+      "embeddings":{
+        "input":"Which Azure services are good for DevOps?",
+        "dimensions":1024
+      }
+    }
+  }'
 ```
 
-You will receive a response like this:
+You should receive an `HTTP/1.1 200 OK` response similar to the following:
 
 ```json
 {
   "choices": [
     {
+      "content_filter_results": {
+        ...
+      },
       "finish_reason": "length",
       "index": 0,
+      "logprobs": null,
       "message": {
-        "content": "Here are the details for some of the services you inquired 
about from your Azure search context:\n\n ... <rest of the response>",
+        "content": "Here is a list of Azure services categorized along with a 
brief description of each based on the provided JSON data:\n\n### Developer 
Tools\n- **Azure DevOps**: A suite of services that help you plan, build, and 
deploy applications, including Azure Boards, Azure Repos, Azure Pipelines, 
Azure Test Plans, and Azure Artifacts.\n- **Azure DevTest Labs**: A fully 
managed service to create, manage, and share development and test environments 
in Azure, supporting custom temp [...]
         "role": "assistant"
       }
     }
   ],
-  "created": 1727079764,
-  "id": "chatcmpl-AAYdA40YjOaeIHfgFBkaHkUFCWxfc",
+  "created": 1740625850,
+  "id": "chatcmpl-B54gQdumpfioMPIybFnirr6rq9ZZS",
   "model": "gpt-4o-2024-05-13",
   "object": "chat.completion",
-  "system_fingerprint": "fp_67802d9a6d",
+  "prompt_filter_results": [
+    {
+      "prompt_index": 0,
+      "content_filter_results": {
+        ...
+      }
+    }
+  ],
+  "system_fingerprint": "fp_65792305e4",
   "usage": {
-    "completion_tokens": 512,
-    "prompt_tokens": 6560,
-    "total_tokens": 7072
+    ...
   }
 }
 ```
diff --git a/docs/en/latest/plugins/ai-rate-limiting.md 
b/docs/en/latest/plugins/ai-rate-limiting.md
index 8386b5cc0..84bad6161 100644
--- a/docs/en/latest/plugins/ai-rate-limiting.md
+++ b/docs/en/latest/plugins/ai-rate-limiting.md
@@ -5,7 +5,9 @@ keywords:
   - API Gateway
   - Plugin
   - ai-rate-limiting
-description: The ai-rate-limiting plugin enforces token-based rate limiting 
for LLM service requests, preventing overuse, optimizing API consumption, and 
ensuring efficient resource allocation.
+  - AI
+  - LLM
+description: The ai-rate-limiting Plugin enforces token-based rate limiting 
for LLM service requests, preventing overuse, optimizing API consumption, and 
ensuring efficient resource allocation.
 ---
 
 <!--
@@ -27,34 +29,52 @@ description: The ai-rate-limiting plugin enforces 
token-based rate limiting for
 #
 -->
 
+<head>
+  <link rel="canonical" href="https://docs.api7.ai/hub/ai-rate-limiting"; />
+</head>
+
 ## Description
 
-The `ai-rate-limiting` plugin enforces token-based rate limiting for requests 
sent to LLM services. It helps manage API usage by controlling the number of 
tokens consumed within a specified time frame, ensuring fair resource 
allocation and preventing excessive load on the service. It is often used with 
`ai-proxy` or `ai-proxy-multi` plugin.
+The `ai-rate-limiting` Plugin enforces token-based rate limiting for requests 
sent to LLM services. It helps manage API usage by controlling the number of 
tokens consumed within a specified time frame, ensuring fair resource 
allocation and preventing excessive load on the service. It is often used with 
[`ai-proxy`](./ai-proxy.md) or [`ai-proxy-multi`](./ai-proxy-multi.md) plugin.
+
+## Attributes
+
+| Name                         | Type            | Required | Default  | Valid 
values                                             | Description |
+|------------------------------|----------------|----------|----------|---------------------------------------------------------|-------------|
+| limit                        | integer        | False    |          | >0     
                        | The maximum number of tokens allowed within a given 
time interval. At least one of `limit` and `instances.limit` should be 
configured. |
+| time_window                  | integer        | False    |          | >0     
                        | The time interval corresponding to the rate limiting 
`limit` in seconds. At least one of `time_window` and `instances.time_window` 
should be configured. |
+| show_limit_quota_header      | boolean        | False    | true     |        
                                                 | If true, includes 
`X-AI-RateLimit-Limit-*`, `X-AI-RateLimit-Remaining-*`, and 
`X-AI-RateLimit-Reset-*` headers in the response, where `*` is the instance 
name. |
+| limit_strategy               | string         | False    | total_tokens | 
[total_tokens, prompt_tokens, completion_tokens] | Type of token to apply rate 
limiting. `total_tokens` is the sum of `prompt_tokens` and `completion_tokens`. 
|
+| instances                    | array[object]  | False    |          |        
                                                 | LLM instance rate limiting 
configurations. |
+| instances.name               | string         | True     |          |        
                                                 | Name of the LLM service 
instance. |
+| instances.limit              | integer        | True     |          | >0     
                        | The maximum number of tokens allowed within a given 
time interval for an instance. |
+| instances.time_window        | integer        | True     |          | >0     
                        | The time interval corresponding to the rate limiting 
`limit` in seconds for an instance. |
+| rejected_code                | integer        | False    | 503      |  [200, 
599]                           | The HTTP status code returned when a request 
exceeding the quota is rejected. |
+| rejected_msg                 | string         | False    |          |        
                                   | The response body returned when a request 
exceeding the quota is rejected. |
+
+## Examples
 
-## Plugin Attributes
+The examples below demonstrate how you can configure `ai-rate-limiting` for 
different scenarios.
 
-| Name                      | Type          | Required | Description           
                                                                                
                                                                                
                                                                                
                        |
-| ------------------------- | ------------- | -------- | 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 |
-| `limit`                   | integer       | conditionally    | The maximum 
number of tokens allowed to consume within a given time interval. At least one 
of `limit` and `instances.limit` should be configured.                          
                                                                                
                                   |
-| `time_window`             | integer       | conditionally    | The time 
interval corresponding to the rate limiting `limit` in seconds. At least one of 
`time_window` and `instances.time_window` should be configured.                 
                                                                                
                                     |
-| `show_limit_quota_header` | boolean       | false    | If true, include 
`X-AI-RateLimit-Limit-*` to show the total quota, `X-AI-RateLimit-Remaining-*` 
to show the remaining quota in the response header, and 
`X-AI-RateLimit-Reset-*` to show the number of seconds left for the counter to 
reset, where `*` is the instance name. Default: `true` |
-| `limit_strategy`          | string        | false    | Type of token to 
apply rate limiting. `total_tokens`, `prompt_tokens`, and `completion_tokens` 
values are returned in each model response, where `total_tokens` is the sum of 
`prompt_tokens` and `completion_tokens`. Default: `total_tokens`                
                                |
-| `instances`               | array[object] | conditionally    | LLM instance 
rate limiting configurations.                                                   
                                                                                
                                                                                
                                 |
-| `instances.name`          | string        | true     | Name of the LLM 
service instance.                                                               
                                                                                
                                                                                
                              |
-| `instances.limit`         | integer       | true     | The maximum number of 
tokens allowed to consume within a given time interval.                         
                                                                                
                                                                                
                        |
-| `instances.time_window`   | integer       | true     | The time interval 
corresponding to the rate limiting `limit` in seconds.                          
                                                                                
                                                                                
                            |
-| `rejected_code`           | integer       | false    | The HTTP status code 
returned when a request exceeding the quota is rejected. Default: `503`         
                                                                                
                                                                                
                         |
-| `rejected_msg`            | string        | false    | The response body 
returned when a request exceeding the quota is rejected.                        
                                                                                
                                                                                
                            |
+:::note
+
+You can fetch the `admin_key` from `config.yaml` and save to an environment 
variable with the following command:
+
+```bash
+admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 
's/"//g')
+```
 
-If `limit` is configured, `time_window` also needs to be configured. Else, 
just specifying `instances` will also suffice.
+:::
 
-## Example
+### Apply Rate Limiting with `ai-proxy`
 
-Create a route as such and update with your LLM providers, models, API keys, 
and endpoints:
+The following example demonstrates how you can use `ai-proxy` to proxy LLM 
traffic and use `ai-rate-limiting` to configure token-based rate limiting on 
the instance.
+
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints, if applicable:
 
 ```shell
 curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
-  -H "X-API-KEY: ${ADMIN_API_KEY}" \
+  -H "X-API-KEY: ${admin_key}" \
   -d '{
     "id": "ai-rate-limiting-route",
     "uri": "/anything",
@@ -64,7 +84,7 @@ curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
         "provider": "openai",
         "auth": {
           "header": {
-            "Authorization": "Bearer '"$API_KEY"'"
+            "Authorization": "Bearer '"$OPENAI_API_KEY"'"
           }
         },
         "options": {
@@ -82,7 +102,101 @@ curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
   }'
 ```
 
-Send a POST request to the route with a system prompt and a sample user 
question in the request body:
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1 + 1 equals 2. This is a fundamental arithmetic operation 
where adding one unit to another results in a total of two units."
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  ...
+}
+```
+
+If the rate limiting quota of 300 prompt tokens has been consumed in a 
30-second window, all additional requests will be rejected.
+
+### Rate Limit One Instance Among Multiple
+
+The following example demonstrates how you can use `ai-proxy-multi` to 
configure two models for load balancing, forwarding 80% of the traffic to one 
instance and 20% to the other. Additionally, use `ai-rate-limiting` to 
configure token-based rate limiting on the instance that receives 80% of the 
traffic, such that when the configured quota is fully consumed, the additional 
traffic will be forwarded to the other instance.
+
+Create a Route which applies rate limiting quota of 100 total tokens in a 
30-second window on the `deepseek-instance-1` instance, and update with your 
LLM providers, models, API keys, and endpoints, if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-rate-limiting-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "deepseek-instance-1",
+            "provider": "deepseek",
+            "weight": 8,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          },
+          {
+            "name": "deepseek-instance-2",
+            "provider": "deepseek",
+            "weight": 2,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          }
+        ]
+      },
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "deepseek-instance-1",
+            "limit_strategy": "total_tokens",
+            "limit": 100,
+            "time_window": 30
+          }
+        ]
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
 
 ```shell
 curl "http://127.0.0.1:9080/anything"; -X POST \
@@ -116,4 +230,644 @@ You should receive a response similar to the following:
 }
 ```
 
-If rate limiting quota of 300 tokens has been consumed in a 30-second window, 
the additional requests will all be rejected.
+If `deepseek-instance-1` instance rate limiting quota of 100 tokens has been 
consumed in a 30-second window, the additional requests will all be forwarded 
to `deepseek-instance-2`, which is not rate limited.
+
+### Apply the Same Quota to All Instances
+
+The following example demonstrates how you can apply the same rate limiting 
quota to all LLM upstream instances in `ai-rate-limiting`.
+
+For demonstration and easier differentiation, you will be configuring one 
OpenAI instance and one DeepSeek instance as the upstream LLM services.
+
+Create a Route which applies a rate limiting quota of 100 total tokens for all 
instances within a 60-second window, and update with your LLM providers, 
models, API keys, and endpoints, if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-rate-limiting-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "gpt-4"
+            }
+          },
+          {
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          }
+        ]
+      },
+      "ai-rate-limiting": {
+        "limit": 100,
+        "time_window": 60,
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive a response from either LLM instance, similar to the 
following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Sure! Sir Isaac Newton formulated three laws of motion 
that describe the motion of objects. These laws are widely used in physics and 
engineering for studying and understanding how things move. Here they 
are:\n\n1. Newton's First Law - Law of Inertia: An object at rest tends to stay 
at rest and an object in motion tends to stay in motion with the same speed and 
in the same direction unless acted upon by an unbalanced force. This is also 
known as the principle of inert [...]
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 256,
+    "total_tokens": 279,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `100`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the Route:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive a response from the other LLM instance, similar to the 
following:
+
+```json
+{
+  ...
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Sure! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics. Here's an explanation 
of each law:\n\n---\n\n### **1. Newton's First Law (Law of Inertia)**\n- 
**Statement**: An object will remain at rest or in uniform motion in a straight 
line unless acted upon by an ex [...]
+      },
+      "logprobs": null,
+      "finish_reason": "length"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 13,
+    "completion_tokens": 256,
+    "total_tokens": 269,
+    "prompt_tokens_details": {
+      "cached_tokens": 0
+    },
+    "prompt_cache_hit_tokens": 0,
+    "prompt_cache_miss_tokens": 13
+  },
+  "system_fingerprint": "fp_3a5770e1b4_prod0225"
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `100`, the next 
request within the 60-second window is expected to be rejected.
+
+Within the same 60-second window, send a third POST request to the Route:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws" }
+    ]
+  }'
+```
+
+You should receive an `HTTP 429 Too Many Requests` response and observe the 
following headers:
+
+```text
+X-AI-RateLimit-Limit-openai-instance: 100
+X-AI-RateLimit-Remaining-openai-instance: 0
+X-AI-RateLimit-Reset-openai-instance: 0
+X-AI-RateLimit-Limit-deepseek-instance: 100
+X-AI-RateLimit-Remaining-deepseek-instance: 0
+X-AI-RateLimit-Reset-deepseek-instance: 0
+```
+
+### Configure Instance Priority and Rate Limiting
+
+The following example demonstrates how you can configure two models with 
different priorities and apply rate limiting on the instance with a higher 
priority. In the case where `fallback_strategy` is set to 
`instance_health_and_rate_limiting`, the Plugin should continue to forward 
requests to the low priority instance once the high priority instance's rate 
limiting quota is fully consumed.
+
+Create a Route as such to set rate limiting and a higher priority on 
`openai-instance` instance and set the `fallback_strategy` to 
`instance_health_and_rate_limiting`. Update with your LLM providers, models, 
API keys, and endpoints, if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-rate-limiting-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "ai-proxy-multi": {
+        "fallback_strategy: "instance_health_and_rate_limiting",
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "priority": 1,
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "gpt-4"
+            }
+          },
+          {
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "priority": 0,
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          }
+        ]
+      },
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route with a system prompt and a sample user 
question in the request body:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of `10`, the next 
request within the 60-second window is expected to be forwarded to the other 
instance.
+
+Within the same 60-second window, send another POST request to the Route:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newton law" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Certainly! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics.\n\n---\n\n### **1. 
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will 
remain at rest, and an object in motion will continue moving at a constant 
velocity (in a straight lin [...]
+      },
+      ...
+    }
+  ],
+  ...
+}
+```
+
+### Load Balance and Rate Limit by Consumers
+
+The following example demonstrates how you can configure two models for load 
balancing and apply rate limiting by Consumer.
+
+Create a Consumer `johndoe` and a rate limiting quota of 10 tokens in a 
60-second window on `openai-instance` instance:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "username": "johndoe",
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "openai-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Configure `key-auth` credential for `johndoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials"; -X PUT 
\
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "cred-john-key-auth",
+    "plugins": {
+      "key-auth": {
+        "key": "john-key"
+      }
+    }
+  }'
+```
+
+Create another Consumer `janedoe` and a rate limiting quota of 10 tokens in a 
60-second window on `deepseek-instance` instance:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "username": "johndoe",
+    "plugins": {
+      "ai-rate-limiting": {
+        "instances": [
+          {
+            "name": "deepseek-instance",
+            "limit": 10,
+            "time_window": 60
+          }
+        ],
+        "rejected_code": 429,
+        "limit_strategy": "total_tokens"
+      }
+    }
+  }'
+```
+
+Configure `key-auth` credential for `janedoe`:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/consumers/janedoe/credentials"; -X PUT 
\
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "cred-jane-key-auth",
+    "plugins": {
+      "key-auth": {
+        "key": "jane-key"
+      }
+    }
+  }'
+```
+
+Create a Route as such and update with your LLM providers, models, API keys, 
and endpoints, if applicable:
+
+```shell
+curl "http://127.0.0.1:9180/apisix/admin/routes"; -X PUT \
+  -H "X-API-KEY: ${admin_key}" \
+  -d '{
+    "id": "ai-rate-limiting-route",
+    "uri": "/anything",
+    "methods": ["POST"],
+    "plugins": {
+      "key-auth": {},
+      "ai-proxy-multi": {
+        "fallback_strategy: "instance_health_and_rate_limiting",
+        "instances": [
+          {
+            "name": "openai-instance",
+            "provider": "openai",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$OPENAI_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "gpt-4"
+            }
+          },
+          {
+            "name": "deepseek-instance",
+            "provider": "deepseek",
+            "weight": 0,
+            "auth": {
+              "header": {
+                "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'"
+              }
+            },
+            "options": {
+              "model": "deepseek-chat"
+            }
+          }
+        ]
+      }
+    }
+  }'
+```
+
+Send a POST request to the Route without any Consumer key:
+
+```shell
+curl -i "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive an `HTTP/1.1 401 Unauthorized` response.
+
+Send a POST request to the Route with `johndoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H 'apikey: john-key' \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "1+1 equals 2.",
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 23,
+    "completion_tokens": 8,
+    "total_tokens": 31,
+    "prompt_tokens_details": {
+      "cached_tokens": 0,
+      "audio_tokens": 0
+    },
+    "completion_tokens_details": {
+      "reasoning_tokens": 0,
+      "audio_tokens": 0,
+      "accepted_prediction_tokens": 0,
+      "rejected_prediction_tokens": 0
+    }
+  },
+  "service_tier": "default",
+  "system_fingerprint": null
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of the `openai` 
instance for `johndoe`, the next request within the 60-second window from 
`johndoe` is expected to be forwarded to the `deepseek` instance.
+
+Within the same 60-second window, send another POST request to the Route with 
`johndoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H 'apikey: john-key' \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws to me" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Certainly! Newton's laws of motion are three fundamental 
principles that describe the relationship between the motion of an object and 
the forces acting on it. They were formulated by Sir Isaac Newton in the late 
17th century and are foundational to classical mechanics.\n\n---\n\n### **1. 
Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will 
remain at rest, and an object in motion will continue moving at a constant 
velocity (in a straight lin [...]
+      },
+      ...
+    }
+  ],
+  ...
+}
+```
+
+Send a POST request to the Route with `janedoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H 'apikey: jane-key' \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "What is 1+1?" }
+    ]
+  }'
+```
+
+You should receive a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "deepseek-chat",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "The sum of 1 and 1 is 2. This is a basic arithmetic 
operation where you combine two units to get a total of two units."
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 14,
+    "completion_tokens": 31,
+    "total_tokens": 45,
+    "prompt_tokens_details": {
+      "cached_tokens": 0
+    },
+    "prompt_cache_hit_tokens": 0,
+    "prompt_cache_miss_tokens": 14
+  },
+  "system_fingerprint": "fp_3a5770e1b4_prod0225"
+}
+```
+
+Since the `total_tokens` value exceeds the configured quota of the `deepseek` 
instance for `janedoe`, the next request within the 60-second window from 
`janedoe` is expected to be forwarded to the `openai` instance.
+
+Within the same 60-second window, send another POST request to the Route with 
`janedoe`'s key:
+
+```shell
+curl "http://127.0.0.1:9080/anything"; -X POST \
+  -H "Content-Type: application/json" \
+  -H 'apikey: jane-key' \
+  -d '{
+    "messages": [
+      { "role": "system", "content": "You are a mathematician" },
+      { "role": "user", "content": "Explain Newtons laws to me" }
+    ]
+  }'
+```
+
+You should see a response similar to the following:
+
+```json
+{
+  ...,
+  "model": "gpt-4-0613",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Sure, here are Newton's three laws of motion:\n\n1) 
Newton's First Law, also known as the Law of Inertia, states that an object at 
rest will stay at rest, and an object in motion will stay in motion, unless 
acted on by an external force. In simple words, this law suggests that an 
object will keep doing whatever it is doing until something causes it to do 
otherwise. \n\n2) Newton's Second Law states that the force acting on an object 
is equal to the mass of that object [...]
+        "refusal": null
+      },
+      "logprobs": null,
+      "finish_reason": "stop"
+    }
+  ],
+  ...
+}
+```
+
+This shows `ai-proxy-multi` load balance the traffic with respect to the rate 
limiting rules in `ai-rate-limiting` by Consumers.

(apisix) branch master updated: docs: update `ai-rate-limiting` and `ai-rag` docs (#12107)

Reply via email to