This is an automated email from the ASF dual-hosted git repository. traky pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/apisix.git
The following commit(s) were added to refs/heads/master by this push: new 901311039 docs: update `ai-rate-limiting` and `ai-rag` docs (#12107) 901311039 is described below commit 9013110391c40a4a9ee948046a262860dbe327b0 Author: Traky Deng <trakyd...@gmail.com> AuthorDate: Tue Apr 8 09:48:31 2025 +0800 docs: update `ai-rate-limiting` and `ai-rag` docs (#12107) --- docs/en/latest/plugins/ai-rag.md | 177 ++++--- docs/en/latest/plugins/ai-rate-limiting.md | 798 ++++++++++++++++++++++++++++- 2 files changed, 882 insertions(+), 93 deletions(-) diff --git a/docs/en/latest/plugins/ai-rag.md b/docs/en/latest/plugins/ai-rag.md index 813e5fff0..844b3078e 100644 --- a/docs/en/latest/plugins/ai-rag.md +++ b/docs/en/latest/plugins/ai-rag.md @@ -5,7 +5,9 @@ keywords: - API Gateway - Plugin - ai-rag -description: This document contains information about the Apache APISIX ai-rag Plugin. + - AI + - LLM +description: The ai-rag Plugin enhances LLM outputs with Retrieval-Augmented Generation (RAG), efficiently retrieving relevant documents to improve accuracy and contextual relevance in responses. --- <!-- @@ -27,34 +29,38 @@ description: This document contains information about the Apache APISIX ai-rag P # --> +<head> + <link rel="canonical" href="https://docs.api7.ai/hub/ai-rag" /> +</head> + ## Description -The `ai-rag` plugin integrates Retrieval-Augmented Generation (RAG) capabilities with AI models. -It allows efficient retrieval of relevant documents or information from external data sources and -augments the LLM responses with that data, improving the accuracy and context of generated outputs. +The `ai-rag` Plugin provides Retrieval-Augmented Generation (RAG) capabilities with LLMs. It facilitates the efficient retrieval of relevant documents or information from external data sources, which are used to enhance the LLM responses, thereby improving the accuracy and contextual relevance of the generated outputs. + +The Plugin supports using [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) and [Azure AI Search](https://azure.microsoft.com/en-us/products/ai-services/ai-search) services for generating embeddings and performing vector search. **_As of now only [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) and [Azure AI Search](https://azure.microsoft.com/en-us/products/ai-services/ai-search) services are supported for generating embeddings and performing vector search respectively. PRs for introducing support for other service providers are welcomed._** -## Plugin Attributes +## Attributes -| **Field** | **Required** | **Type** | **Description** | +| Name | Required | Type | Description | | ----------------------------------------------- | ------------ | -------- | ----------------------------------------------------------------------------------------------------------------------------------------- | -| embeddings_provider | Yes | object | Configurations of the embedding models provider | -| embeddings_provider.azure_openai | Yes | object | Configurations of [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) as the embedding models provider. | -| embeddings_provider.azure_openai.endpoint | Yes | string | Azure OpenAI endpoint | -| embeddings_provider.azure_openai.api_key | Yes | string | Azure OpenAI API key | -| vector_search_provider | Yes | object | Configuration for the vector search provider | -| vector_search_provider.azure_ai_search | Yes | object | Configuration for Azure AI Search | -| vector_search_provider.azure_ai_search.endpoint | Yes | string | Azure AI Search endpoint | -| vector_search_provider.azure_ai_search.api_key | Yes | string | Azure AI Search API key | +| embeddings_provider | True | object | Configurations of the embedding models provider. | +| embeddings_provider.azure_openai | True | object | Configurations of [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) as the embedding models provider. | +| embeddings_provider.azure_openai.endpoint | True | string | Azure OpenAI embedding model endpoint. | +| embeddings_provider.azure_openai.api_key | True | string | Azure OpenAI API key. | +| vector_search_provider | True | object | Configuration for the vector search provider. | +| vector_search_provider.azure_ai_search | True | object | Configuration for Azure AI Search. | +| vector_search_provider.azure_ai_search.endpoint | True | string | Azure AI Search endpoint. | +| vector_search_provider.azure_ai_search.api_key | True | string | Azure AI Search API key. | ## Request Body Format The following fields must be present in the request body. -| **Field** | **Type** | **Description** | +| Field | Type | Description | | -------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------- | -| ai_rag | object | Configuration for AI-RAG (Retrieval Augmented Generation) | +| ai_rag | object | Request body RAG specifications. | | ai_rag.embeddings | object | Request parameters required to generate embeddings. Contents will depend on the API specification of the configured provider. | | ai_rag.vector_search | object | Request parameters required to perform vector search. Contents will depend on the API specification of the configured provider. | @@ -62,12 +68,12 @@ The following fields must be present in the request body. - Azure OpenAI - | **Name** | **Required** | **Type** | **Description** | + | Name | Required | Type | Description | | --------------- | ------------ | -------- | -------------------------------------------------------------------------------------------------------------------------- | - | input | Yes | string | Input text used to compute embeddings, encoded as a string. | - | user | No | string | A unique identifier representing your end-user, which can help in monitoring and detecting abuse. | - | encoding_format | No | string | The format to return the embeddings in. Can be either `float` or `base64`. Defaults to `float`. | - | dimensions | No | integer | The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3 and later models. | + | input | True | string | Input text used to compute embeddings, encoded as a string. | + | user | False | string | A unique identifier representing your end-user, which can help in monitoring and detecting abuse. | + | encoding_format | False | string | The format to return the embeddings in. Can be either `float` or `base64`. Defaults to `float`. | + | dimensions | False | integer | The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3 and later models. | For other parameters please refer to the [Azure OpenAI embeddings documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#embeddings). @@ -75,9 +81,9 @@ For other parameters please refer to the [Azure OpenAI embeddings documentation] - Azure AI Search - | **Field** | **Required** | **Type** | **Description** | + | Field | Required | Type | Description | | --------- | ------------ | -------- | ---------------------------- | - | fields | Yes | String | Fields for the vector search | + | fields | True | String | Fields for the vector search. | For other parameters please refer the [Azure AI Search documentation](https://learn.microsoft.com/en-us/rest/api/searchservice/documents/search-post). @@ -95,106 +101,135 @@ Example request body: } ``` -## Example usage +## Example + +To follow along the example, create an [Azure account](https://portal.azure.com) and complete the following steps: -First initialise these shell variables: +* In [Azure AI Foundry](https://oai.azure.com/portal), deploy a generative chat model, such as `gpt-4o`, and an embedding model, such as `text-embedding-3-large`. Obtain the API key and model endpoints. +* Follow [Azure's example](https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/basic-vector-workflow/azure-search-vector-python-sample.ipynb) to prepare for a vector search in [Azure AI Search](https://azure.microsoft.com/en-us/products/ai-services/ai-search) using Python. The example will create a search index called `vectest` with the desired schema and upload the [sample data](https://github.com/Azure/azure-search-vector-samples/blob/main/data/text-sample.j [...] +* In [Azure AI Search](https://azure.microsoft.com/en-us/products/ai-services/ai-search), [obtain the Azure vector search API key and the search service endpoint](https://learn.microsoft.com/en-us/azure/search/search-get-started-vector?tabs=api-key#retrieve-resource-information). + +Save the API keys and endpoints to environment variables: ```shell -ADMIN_API_KEY=edd1c9f034335f136f87ad84b625c8f1 -AZURE_OPENAI_ENDPOINT=https://name.openai.azure.com/openai/deployments/gpt-4o/chat/completions -VECTOR_SEARCH_ENDPOINT=https://name.search.windows.net/indexes/indexname/docs/search?api-version=2024-07-01 -EMBEDDINGS_ENDPOINT=https://name.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15 -EMBEDDINGS_KEY=secret-azure-openai-embeddings-key -SEARCH_KEY=secret-azureai-search-key -AZURE_OPENAI_KEY=secret-azure-openai-key +# replace with your values + +AZ_OPENAI_DOMAIN=https://ai-plugin-developer.openai.azure.com +AZ_OPENAI_API_KEY=9m7VYroxITMDEqKKEnpOknn1rV7QNQT7DrIBApcwMLYJQQJ99ALACYeBjFXJ3w3AAABACOGXGcd +AZ_CHAT_ENDPOINT=${AZ_OPENAI_DOMAIN}/openai/deployments/gpt-4o/chat/completions?api-version=2024-02-15-preview +AZ_EMBEDDING_MODEL=text-embedding-3-large +AZ_EMBEDDINGS_ENDPOINT=${AZ_OPENAI_DOMAIN}/openai/deployments/${AZ_EMBEDDING_MODEL}/embeddings?api-version=2023-05-15 + +AZ_AI_SEARCH_SVC_DOMAIN=https://ai-plugin-developer.search.windows.net +AZ_AI_SEARCH_KEY=IFZBp3fKVdq7loEVe9LdwMvVdZrad9A4lPH90AzSeC06SlR +AZ_AI_SEARCH_INDEX=vectest +AZ_AI_SEARCH_ENDPOINT=${AZ_AI_SEARCH_SVC_DOMAIN}/indexes/${AZ_AI_SEARCH_INDEX}/docs/search?api-version=2024-07-01 +``` + +:::note + +You can fetch the `admin_key` from `config.yaml` and save to an environment variable with the following command: + +```bash +admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 's/"//g') ``` -Create a route with the `ai-rag` and `ai-proxy` plugin like so: +::: + +### Integrate with Azure for RAG-Enhaned Responses + +The following example demonstrates how you can use the [`ai-proxy`](./ai-proxy.md) Plugin to proxy requests to Azure OpenAI LLM and use the `ai-rag` Plugin to generate embeddings and perform vector search to enhance LLM responses. + +Create a Route as such: ```shell -curl "http://127.0.0.1:9180/apisix/admin/routes/1" -X PUT \ +curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ -H "X-API-KEY: ${ADMIN_API_KEY}" \ -d '{ + "id": "ai-rag-route", "uri": "/rag", "plugins": { "ai-rag": { "embeddings_provider": { "azure_openai": { - "endpoint": "'"$EMBEDDINGS_ENDPOINT"'", - "api_key": "'"$EMBEDDINGS_KEY"'" + "endpoint": "'"$AZ_EMBEDDINGS_ENDPOINT"'", + "api_key": "'"$AZ_OPENAI_API_KEY"'" } }, "vector_search_provider": { "azure_ai_search": { - "endpoint": "'"$VECTOR_SEARCH_ENDPOINT"'", - "api_key": "'"$SEARCH_KEY"'" + "endpoint": "'"$AZ_AI_SEARCH_ENDPOINT"'", + "api_key": "'"$AZ_AI_SEARCH_KEY"'" } } }, "ai-proxy": { + "provider": "openai", "auth": { "header": { - "api-key": "'"$AZURE_OPENAI_KEY"'" - }, - "query": { - "api-version": "2023-03-15-preview" - } - }, - "model": { - "provider": "openai", - "name": "gpt-4", - "options": { - "max_tokens": 512, - "temperature": 1.0 + "api-key": "'"$AZ_OPENAI_API_KEY"'" } }, + "model": "gpt-4o", "override": { - "endpoint": "'"$AZURE_OPENAI_ENDPOINT"'" + "endpoint": "'"$AZ_CHAT_ENDPOINT"'" } } - }, - "upstream": { - "type": "roundrobin", - "nodes": { - "someupstream.com:443": 1 - }, - "scheme": "https", - "pass_host": "node" } }' ``` -The `ai-proxy` plugin is used here as it simplifies access to LLMs. Alternatively, you may configure the LLM service address in the upstream configuration and update the route URI as well. - -Now send a request: +Send a POST request to the Route with the vector fields name, embedding model dimensions, and an input prompt in the request body: ```shell -curl http://127.0.0.1:9080/rag -XPOST -H 'Content-Type: application/json' -d '{"ai_rag":{"vector_search":{"fields":"contentVector"},"embeddings":{"input":"which service is good for devops","dimensions":1024}}}' +curl "http://127.0.0.1:9080/rag" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "ai_rag":{ + "vector_search":{ + "fields":"contentVector" + }, + "embeddings":{ + "input":"Which Azure services are good for DevOps?", + "dimensions":1024 + } + } + }' ``` -You will receive a response like this: +You should receive an `HTTP/1.1 200 OK` response similar to the following: ```json { "choices": [ { + "content_filter_results": { + ... + }, "finish_reason": "length", "index": 0, + "logprobs": null, "message": { - "content": "Here are the details for some of the services you inquired about from your Azure search context:\n\n ... <rest of the response>", + "content": "Here is a list of Azure services categorized along with a brief description of each based on the provided JSON data:\n\n### Developer Tools\n- **Azure DevOps**: A suite of services that help you plan, build, and deploy applications, including Azure Boards, Azure Repos, Azure Pipelines, Azure Test Plans, and Azure Artifacts.\n- **Azure DevTest Labs**: A fully managed service to create, manage, and share development and test environments in Azure, supporting custom temp [...] "role": "assistant" } } ], - "created": 1727079764, - "id": "chatcmpl-AAYdA40YjOaeIHfgFBkaHkUFCWxfc", + "created": 1740625850, + "id": "chatcmpl-B54gQdumpfioMPIybFnirr6rq9ZZS", "model": "gpt-4o-2024-05-13", "object": "chat.completion", - "system_fingerprint": "fp_67802d9a6d", + "prompt_filter_results": [ + { + "prompt_index": 0, + "content_filter_results": { + ... + } + } + ], + "system_fingerprint": "fp_65792305e4", "usage": { - "completion_tokens": 512, - "prompt_tokens": 6560, - "total_tokens": 7072 + ... } } ``` diff --git a/docs/en/latest/plugins/ai-rate-limiting.md b/docs/en/latest/plugins/ai-rate-limiting.md index 8386b5cc0..84bad6161 100644 --- a/docs/en/latest/plugins/ai-rate-limiting.md +++ b/docs/en/latest/plugins/ai-rate-limiting.md @@ -5,7 +5,9 @@ keywords: - API Gateway - Plugin - ai-rate-limiting -description: The ai-rate-limiting plugin enforces token-based rate limiting for LLM service requests, preventing overuse, optimizing API consumption, and ensuring efficient resource allocation. + - AI + - LLM +description: The ai-rate-limiting Plugin enforces token-based rate limiting for LLM service requests, preventing overuse, optimizing API consumption, and ensuring efficient resource allocation. --- <!-- @@ -27,34 +29,52 @@ description: The ai-rate-limiting plugin enforces token-based rate limiting for # --> +<head> + <link rel="canonical" href="https://docs.api7.ai/hub/ai-rate-limiting" /> +</head> + ## Description -The `ai-rate-limiting` plugin enforces token-based rate limiting for requests sent to LLM services. It helps manage API usage by controlling the number of tokens consumed within a specified time frame, ensuring fair resource allocation and preventing excessive load on the service. It is often used with `ai-proxy` or `ai-proxy-multi` plugin. +The `ai-rate-limiting` Plugin enforces token-based rate limiting for requests sent to LLM services. It helps manage API usage by controlling the number of tokens consumed within a specified time frame, ensuring fair resource allocation and preventing excessive load on the service. It is often used with [`ai-proxy`](./ai-proxy.md) or [`ai-proxy-multi`](./ai-proxy-multi.md) plugin. + +## Attributes + +| Name | Type | Required | Default | Valid values | Description | +|------------------------------|----------------|----------|----------|---------------------------------------------------------|-------------| +| limit | integer | False | | >0 | The maximum number of tokens allowed within a given time interval. At least one of `limit` and `instances.limit` should be configured. | +| time_window | integer | False | | >0 | The time interval corresponding to the rate limiting `limit` in seconds. At least one of `time_window` and `instances.time_window` should be configured. | +| show_limit_quota_header | boolean | False | true | | If true, includes `X-AI-RateLimit-Limit-*`, `X-AI-RateLimit-Remaining-*`, and `X-AI-RateLimit-Reset-*` headers in the response, where `*` is the instance name. | +| limit_strategy | string | False | total_tokens | [total_tokens, prompt_tokens, completion_tokens] | Type of token to apply rate limiting. `total_tokens` is the sum of `prompt_tokens` and `completion_tokens`. | +| instances | array[object] | False | | | LLM instance rate limiting configurations. | +| instances.name | string | True | | | Name of the LLM service instance. | +| instances.limit | integer | True | | >0 | The maximum number of tokens allowed within a given time interval for an instance. | +| instances.time_window | integer | True | | >0 | The time interval corresponding to the rate limiting `limit` in seconds for an instance. | +| rejected_code | integer | False | 503 | [200, 599] | The HTTP status code returned when a request exceeding the quota is rejected. | +| rejected_msg | string | False | | | The response body returned when a request exceeding the quota is rejected. | + +## Examples -## Plugin Attributes +The examples below demonstrate how you can configure `ai-rate-limiting` for different scenarios. -| Name | Type | Required | Description | -| ------------------------- | ------------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `limit` | integer | conditionally | The maximum number of tokens allowed to consume within a given time interval. At least one of `limit` and `instances.limit` should be configured. | -| `time_window` | integer | conditionally | The time interval corresponding to the rate limiting `limit` in seconds. At least one of `time_window` and `instances.time_window` should be configured. | -| `show_limit_quota_header` | boolean | false | If true, include `X-AI-RateLimit-Limit-*` to show the total quota, `X-AI-RateLimit-Remaining-*` to show the remaining quota in the response header, and `X-AI-RateLimit-Reset-*` to show the number of seconds left for the counter to reset, where `*` is the instance name. Default: `true` | -| `limit_strategy` | string | false | Type of token to apply rate limiting. `total_tokens`, `prompt_tokens`, and `completion_tokens` values are returned in each model response, where `total_tokens` is the sum of `prompt_tokens` and `completion_tokens`. Default: `total_tokens` | -| `instances` | array[object] | conditionally | LLM instance rate limiting configurations. | -| `instances.name` | string | true | Name of the LLM service instance. | -| `instances.limit` | integer | true | The maximum number of tokens allowed to consume within a given time interval. | -| `instances.time_window` | integer | true | The time interval corresponding to the rate limiting `limit` in seconds. | -| `rejected_code` | integer | false | The HTTP status code returned when a request exceeding the quota is rejected. Default: `503` | -| `rejected_msg` | string | false | The response body returned when a request exceeding the quota is rejected. | +:::note + +You can fetch the `admin_key` from `config.yaml` and save to an environment variable with the following command: + +```bash +admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 's/"//g') +``` -If `limit` is configured, `time_window` also needs to be configured. Else, just specifying `instances` will also suffice. +::: -## Example +### Apply Rate Limiting with `ai-proxy` -Create a route as such and update with your LLM providers, models, API keys, and endpoints: +The following example demonstrates how you can use `ai-proxy` to proxy LLM traffic and use `ai-rate-limiting` to configure token-based rate limiting on the instance. + +Create a Route as such and update with your LLM providers, models, API keys, and endpoints, if applicable: ```shell curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ - -H "X-API-KEY: ${ADMIN_API_KEY}" \ + -H "X-API-KEY: ${admin_key}" \ -d '{ "id": "ai-rate-limiting-route", "uri": "/anything", @@ -64,7 +84,7 @@ curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ "provider": "openai", "auth": { "header": { - "Authorization": "Bearer '"$API_KEY"'" + "Authorization": "Bearer '"$OPENAI_API_KEY"'" } }, "options": { @@ -82,7 +102,101 @@ curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ }' ``` -Send a POST request to the route with a system prompt and a sample user question in the request body: +Send a POST request to the Route with a system prompt and a sample user question in the request body: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' +``` + +You should receive a response similar to the following: + +```json +{ + ... + "model": "deepseek-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "1 + 1 equals 2. This is a fundamental arithmetic operation where adding one unit to another results in a total of two units." + }, + "logprobs": null, + "finish_reason": "stop" + } + ], + ... +} +``` + +If the rate limiting quota of 300 prompt tokens has been consumed in a 30-second window, all additional requests will be rejected. + +### Rate Limit One Instance Among Multiple + +The following example demonstrates how you can use `ai-proxy-multi` to configure two models for load balancing, forwarding 80% of the traffic to one instance and 20% to the other. Additionally, use `ai-rate-limiting` to configure token-based rate limiting on the instance that receives 80% of the traffic, such that when the configured quota is fully consumed, the additional traffic will be forwarded to the other instance. + +Create a Route which applies rate limiting quota of 100 total tokens in a 30-second window on the `deepseek-instance-1` instance, and update with your LLM providers, models, API keys, and endpoints, if applicable: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "id": "ai-rate-limiting-route", + "uri": "/anything", + "methods": ["POST"], + "plugins": { + "ai-rate-limiting": { + "instances": [ + { + "name": "deepseek-instance-1", + "provider": "deepseek", + "weight": 8, + "auth": { + "header": { + "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" + } + }, + "options": { + "model": "deepseek-chat" + } + }, + { + "name": "deepseek-instance-2", + "provider": "deepseek", + "weight": 2, + "auth": { + "header": { + "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" + } + }, + "options": { + "model": "deepseek-chat" + } + } + ] + }, + "ai-rate-limiting": { + "instances": [ + { + "name": "deepseek-instance-1", + "limit_strategy": "total_tokens", + "limit": 100, + "time_window": 30 + } + ] + } + } + }' +``` + +Send a POST request to the Route with a system prompt and a sample user question in the request body: ```shell curl "http://127.0.0.1:9080/anything" -X POST \ @@ -116,4 +230,644 @@ You should receive a response similar to the following: } ``` -If rate limiting quota of 300 tokens has been consumed in a 30-second window, the additional requests will all be rejected. +If `deepseek-instance-1` instance rate limiting quota of 100 tokens has been consumed in a 30-second window, the additional requests will all be forwarded to `deepseek-instance-2`, which is not rate limited. + +### Apply the Same Quota to All Instances + +The following example demonstrates how you can apply the same rate limiting quota to all LLM upstream instances in `ai-rate-limiting`. + +For demonstration and easier differentiation, you will be configuring one OpenAI instance and one DeepSeek instance as the upstream LLM services. + +Create a Route which applies a rate limiting quota of 100 total tokens for all instances within a 60-second window, and update with your LLM providers, models, API keys, and endpoints, if applicable: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "id": "ai-rate-limiting-route", + "uri": "/anything", + "methods": ["POST"], + "plugins": { + "ai-rate-limiting": { + "instances": [ + { + "name": "openai-instance", + "provider": "openai", + "weight": 0, + "auth": { + "header": { + "Authorization": "Bearer '"$OPENAI_API_KEY"'" + } + }, + "options": { + "model": "gpt-4" + } + }, + { + "name": "deepseek-instance", + "provider": "deepseek", + "weight": 0, + "auth": { + "header": { + "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" + } + }, + "options": { + "model": "deepseek-chat" + } + } + ] + }, + "ai-rate-limiting": { + "limit": 100, + "time_window": 60, + "rejected_code": 429, + "limit_strategy": "total_tokens" + } + } + }' +``` + +Send a POST request to the Route with a system prompt and a sample user question in the request body: + +```shell +curl -i "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "Explain Newtons laws" } + ] + }' +``` + +You should receive a response from either LLM instance, similar to the following: + +```json +{ + ..., + "model": "gpt-4-0613", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Sure! Sir Isaac Newton formulated three laws of motion that describe the motion of objects. These laws are widely used in physics and engineering for studying and understanding how things move. Here they are:\n\n1. Newton's First Law - Law of Inertia: An object at rest tends to stay at rest and an object in motion tends to stay in motion with the same speed and in the same direction unless acted upon by an unbalanced force. This is also known as the principle of inert [...] + "refusal": null + }, + "logprobs": null, + "finish_reason": "length" + } + ], + "usage": { + "prompt_tokens": 23, + "completion_tokens": 256, + "total_tokens": 279, + "prompt_tokens_details": { + "cached_tokens": 0, + "audio_tokens": 0 + }, + "completion_tokens_details": { + "reasoning_tokens": 0, + "audio_tokens": 0, + "accepted_prediction_tokens": 0, + "rejected_prediction_tokens": 0 + } + }, + "service_tier": "default", + "system_fingerprint": null +} +``` + +Since the `total_tokens` value exceeds the configured quota of `100`, the next request within the 60-second window is expected to be forwarded to the other instance. + +Within the same 60-second window, send another POST request to the Route: + +```shell +curl -i "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "Explain Newtons laws" } + ] + }' +``` + +You should receive a response from the other LLM instance, similar to the following: + +```json +{ + ... + "model": "deepseek-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Sure! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics. Here's an explanation of each law:\n\n---\n\n### **1. Newton's First Law (Law of Inertia)**\n- **Statement**: An object will remain at rest or in uniform motion in a straight line unless acted upon by an ex [...] + }, + "logprobs": null, + "finish_reason": "length" + } + ], + "usage": { + "prompt_tokens": 13, + "completion_tokens": 256, + "total_tokens": 269, + "prompt_tokens_details": { + "cached_tokens": 0 + }, + "prompt_cache_hit_tokens": 0, + "prompt_cache_miss_tokens": 13 + }, + "system_fingerprint": "fp_3a5770e1b4_prod0225" +} +``` + +Since the `total_tokens` value exceeds the configured quota of `100`, the next request within the 60-second window is expected to be rejected. + +Within the same 60-second window, send a third POST request to the Route: + +```shell +curl -i "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "Explain Newtons laws" } + ] + }' +``` + +You should receive an `HTTP 429 Too Many Requests` response and observe the following headers: + +```text +X-AI-RateLimit-Limit-openai-instance: 100 +X-AI-RateLimit-Remaining-openai-instance: 0 +X-AI-RateLimit-Reset-openai-instance: 0 +X-AI-RateLimit-Limit-deepseek-instance: 100 +X-AI-RateLimit-Remaining-deepseek-instance: 0 +X-AI-RateLimit-Reset-deepseek-instance: 0 +``` + +### Configure Instance Priority and Rate Limiting + +The following example demonstrates how you can configure two models with different priorities and apply rate limiting on the instance with a higher priority. In the case where `fallback_strategy` is set to `instance_health_and_rate_limiting`, the Plugin should continue to forward requests to the low priority instance once the high priority instance's rate limiting quota is fully consumed. + +Create a Route as such to set rate limiting and a higher priority on `openai-instance` instance and set the `fallback_strategy` to `instance_health_and_rate_limiting`. Update with your LLM providers, models, API keys, and endpoints, if applicable: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "id": "ai-rate-limiting-route", + "uri": "/anything", + "methods": ["POST"], + "plugins": { + "ai-proxy-multi": { + "fallback_strategy: "instance_health_and_rate_limiting", + "instances": [ + { + "name": "openai-instance", + "provider": "openai", + "priority": 1, + "weight": 0, + "auth": { + "header": { + "Authorization": "Bearer '"$OPENAI_API_KEY"'" + } + }, + "options": { + "model": "gpt-4" + } + }, + { + "name": "deepseek-instance", + "provider": "deepseek", + "priority": 0, + "weight": 0, + "auth": { + "header": { + "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" + } + }, + "options": { + "model": "deepseek-chat" + } + } + ] + }, + "ai-rate-limiting": { + "instances": [ + { + "name": "openai-instance", + "limit": 10, + "time_window": 60 + } + ], + "limit_strategy": "total_tokens" + } + } + }' +``` + +Send a POST request to the Route with a system prompt and a sample user question in the request body: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' +``` + +You should receive a response similar to the following: + +```json +{ + ..., + "model": "gpt-4-0613", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "1+1 equals 2.", + "refusal": null + }, + "logprobs": null, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 23, + "completion_tokens": 8, + "total_tokens": 31, + "prompt_tokens_details": { + "cached_tokens": 0, + "audio_tokens": 0 + }, + "completion_tokens_details": { + "reasoning_tokens": 0, + "audio_tokens": 0, + "accepted_prediction_tokens": 0, + "rejected_prediction_tokens": 0 + } + }, + "service_tier": "default", + "system_fingerprint": null +} +``` + +Since the `total_tokens` value exceeds the configured quota of `10`, the next request within the 60-second window is expected to be forwarded to the other instance. + +Within the same 60-second window, send another POST request to the Route: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "Explain Newton law" } + ] + }' +``` + +You should see a response similar to the following: + +```json +{ + ..., + "model": "deepseek-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Certainly! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics.\n\n---\n\n### **1. Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will remain at rest, and an object in motion will continue moving at a constant velocity (in a straight lin [...] + }, + ... + } + ], + ... +} +``` + +### Load Balance and Rate Limit by Consumers + +The following example demonstrates how you can configure two models for load balancing and apply rate limiting by Consumer. + +Create a Consumer `johndoe` and a rate limiting quota of 10 tokens in a 60-second window on `openai-instance` instance: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "username": "johndoe", + "plugins": { + "ai-rate-limiting": { + "instances": [ + { + "name": "openai-instance", + "limit": 10, + "time_window": 60 + } + ], + "rejected_code": 429, + "limit_strategy": "total_tokens" + } + } + }' +``` + +Configure `key-auth` credential for `johndoe`: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/consumers/johndoe/credentials" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "id": "cred-john-key-auth", + "plugins": { + "key-auth": { + "key": "john-key" + } + } + }' +``` + +Create another Consumer `janedoe` and a rate limiting quota of 10 tokens in a 60-second window on `deepseek-instance` instance: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/consumers" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "username": "johndoe", + "plugins": { + "ai-rate-limiting": { + "instances": [ + { + "name": "deepseek-instance", + "limit": 10, + "time_window": 60 + } + ], + "rejected_code": 429, + "limit_strategy": "total_tokens" + } + } + }' +``` + +Configure `key-auth` credential for `janedoe`: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/consumers/janedoe/credentials" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "id": "cred-jane-key-auth", + "plugins": { + "key-auth": { + "key": "jane-key" + } + } + }' +``` + +Create a Route as such and update with your LLM providers, models, API keys, and endpoints, if applicable: + +```shell +curl "http://127.0.0.1:9180/apisix/admin/routes" -X PUT \ + -H "X-API-KEY: ${admin_key}" \ + -d '{ + "id": "ai-rate-limiting-route", + "uri": "/anything", + "methods": ["POST"], + "plugins": { + "key-auth": {}, + "ai-proxy-multi": { + "fallback_strategy: "instance_health_and_rate_limiting", + "instances": [ + { + "name": "openai-instance", + "provider": "openai", + "weight": 0, + "auth": { + "header": { + "Authorization": "Bearer '"$OPENAI_API_KEY"'" + } + }, + "options": { + "model": "gpt-4" + } + }, + { + "name": "deepseek-instance", + "provider": "deepseek", + "weight": 0, + "auth": { + "header": { + "Authorization": "Bearer '"$DEEPSEEK_API_KEY"'" + } + }, + "options": { + "model": "deepseek-chat" + } + } + ] + } + } + }' +``` + +Send a POST request to the Route without any Consumer key: + +```shell +curl -i "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' +``` + +You should receive an `HTTP/1.1 401 Unauthorized` response. + +Send a POST request to the Route with `johndoe`'s key: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -H 'apikey: john-key' \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' +``` + +You should receive a response similar to the following: + +```json +{ + ..., + "model": "gpt-4-0613", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "1+1 equals 2.", + "refusal": null + }, + "logprobs": null, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 23, + "completion_tokens": 8, + "total_tokens": 31, + "prompt_tokens_details": { + "cached_tokens": 0, + "audio_tokens": 0 + }, + "completion_tokens_details": { + "reasoning_tokens": 0, + "audio_tokens": 0, + "accepted_prediction_tokens": 0, + "rejected_prediction_tokens": 0 + } + }, + "service_tier": "default", + "system_fingerprint": null +} +``` + +Since the `total_tokens` value exceeds the configured quota of the `openai` instance for `johndoe`, the next request within the 60-second window from `johndoe` is expected to be forwarded to the `deepseek` instance. + +Within the same 60-second window, send another POST request to the Route with `johndoe`'s key: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -H 'apikey: john-key' \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "Explain Newtons laws to me" } + ] + }' +``` + +You should see a response similar to the following: + +```json +{ + ..., + "model": "deepseek-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Certainly! Newton's laws of motion are three fundamental principles that describe the relationship between the motion of an object and the forces acting on it. They were formulated by Sir Isaac Newton in the late 17th century and are foundational to classical mechanics.\n\n---\n\n### **1. Newton's First Law (Law of Inertia):**\n- **Statement:** An object at rest will remain at rest, and an object in motion will continue moving at a constant velocity (in a straight lin [...] + }, + ... + } + ], + ... +} +``` + +Send a POST request to the Route with `janedoe`'s key: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -H 'apikey: jane-key' \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "What is 1+1?" } + ] + }' +``` + +You should receive a response similar to the following: + +```json +{ + ..., + "model": "deepseek-chat", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "The sum of 1 and 1 is 2. This is a basic arithmetic operation where you combine two units to get a total of two units." + }, + "logprobs": null, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 14, + "completion_tokens": 31, + "total_tokens": 45, + "prompt_tokens_details": { + "cached_tokens": 0 + }, + "prompt_cache_hit_tokens": 0, + "prompt_cache_miss_tokens": 14 + }, + "system_fingerprint": "fp_3a5770e1b4_prod0225" +} +``` + +Since the `total_tokens` value exceeds the configured quota of the `deepseek` instance for `janedoe`, the next request within the 60-second window from `janedoe` is expected to be forwarded to the `openai` instance. + +Within the same 60-second window, send another POST request to the Route with `janedoe`'s key: + +```shell +curl "http://127.0.0.1:9080/anything" -X POST \ + -H "Content-Type: application/json" \ + -H 'apikey: jane-key' \ + -d '{ + "messages": [ + { "role": "system", "content": "You are a mathematician" }, + { "role": "user", "content": "Explain Newtons laws to me" } + ] + }' +``` + +You should see a response similar to the following: + +```json +{ + ..., + "model": "gpt-4-0613", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Sure, here are Newton's three laws of motion:\n\n1) Newton's First Law, also known as the Law of Inertia, states that an object at rest will stay at rest, and an object in motion will stay in motion, unless acted on by an external force. In simple words, this law suggests that an object will keep doing whatever it is doing until something causes it to do otherwise. \n\n2) Newton's Second Law states that the force acting on an object is equal to the mass of that object [...] + "refusal": null + }, + "logprobs": null, + "finish_reason": "stop" + } + ], + ... +} +``` + +This shows `ai-proxy-multi` load balance the traffic with respect to the rate limiting rules in `ai-rate-limiting` by Consumers.