aruggero commented on code in PR #4259: URL: https://github.com/apache/solr/pull/4259#discussion_r3086970683
########## solr/solr-ref-guide/modules/indexing-guide/pages/document-enrichment-with-llms.adoc: ########## @@ -0,0 +1,543 @@ += Document Enrichment with LLMs +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +This module brings the power of *Large Language Models* to Solr. + +More specifically, it enables calling an LLM at indexing time to enrich documents with additional/generated/extracted +data. Given a prompt and a set of input fields, for each document, the LLM is invoked through +https://github.com/langchain4j/langchain4j[LangChain4j], and the result is stored in an outputField, which can support +multiple types and may also be multivalued. + +_Without_ this module, the LLM calls to enrich documents must be done _outside_ Solr, before indexing. + +[IMPORTANT] +==== +This module sends your documents off to some hosted service on the internet. +There are cost, privacy, performance, and service availability implications on such a strong dependency that should be +diligently examined before employing this module in a serious way. + +==== + +At the moment, Solr supports a subset of the LLM providers available in LangChain4j. + +*Disclaimer*: Apache Solr is *in no way* affiliated to any of these corporations or services. + +If you want to add support for additional services or improve the support for the existing ones, feel free to +contribute: + +* https://github.com/apache/solr/blob/main/CONTRIBUTING.md[Contributing to Solr] + +== Module + +This is provided via the `language-models` xref:configuration-guide:solr-modules.adoc[Solr Module] that needs to be +enabled before use. + +== Language Model Configuration + +Language Models is a module and therefore its plugins must be configured in `solrconfig.xml`. + +=== Minimum Requirements + +* Enable the `language-models` module to make the Language Models classes available on Solr's classpath. +See xref:configuration-guide:solr-modules.adoc[Solr Module] for more details. + +* An update processor, similar to the one below, must be declared in `solrconfig.xml`: ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">string_field</str> + <str name="outputField">summary</str> + <str name="prompt">Summarize this content: {string_field}</str> + <str name="model">model-name</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + +== Chat Model Setup + +=== Models + +* A model is a chat model that generates a text response given a prompt. +* A model is a reference to an external API that runs the Large Language Model responsible for chat completion. + +[IMPORTANT] +==== +The Solr chat model specifies the parameters to access the APIs, the LLM doesn't run internally in Solr + +==== + +A model is described by these parameters: + + +`class`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The model implementation. +Accepted values: + +* `dev.langchain4j.model.ollama.OllamaChatModel` +* `dev.langchain4j.model.mistralai.MistralAiChatModel` +* `dev.langchain4j.model.anthropic.AnthropicChatModel` +* `dev.langchain4j.model.openai.OpenAiChatModel` +* `dev.langchain4j.model.googleai.GoogleAiGeminiChatModel` + +`name`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The identifier of your model, this is used by any component that intends to use the model (e.g., `DocumentEnrichmentUpdateProcessorFactory` update processor). + +`params`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ +Each model class has potentially different params. +Many are shared but for the full set of parameters of the model you are interested in please refer to the official documentation of the LangChain4j version included in Solr: https://docs.langchain4j.dev/category/language-models[Chat Models in LangChain4j]. + +=== Supported Models +Apache Solr uses https://github.com/langchain4j/langchain4j[LangChain4j] to support document enrichment with LLMs. +The models currently supported are: + +[tabs#supported-chat-models] +====== +Ollama:: ++ +==== + +[source,json] +---- +{ + "class": "dev.langchain4j.model.ollama.OllamaChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "http://localhost:11434", + "modelName": "<a-local/hosted-chat-model>", + "timeout": 300, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +MistralAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.mistralai.MistralAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.mistral.ai/v1", + "apiKey": "<your-mistralAI-api-key>", + "modelName": "<a-mistralAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +OpenAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "<your-openAI-api-key>", + "modelName": "<a-openAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Anthropic:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.anthropic.AnthropicChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.anthropic.com/v1/", + "apiKey": "<your-anthropic-api-key>", + "modelName": "<a-anthropic-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Gemini:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.googleai.GoogleAiGeminiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://generativelanguage.googleapis.com/v1beta/", + "apiKey": "<your-geminiAi-api-key>", + "modelName": "<a-geminiAi-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +====== + +=== Uploading a Model + +To upload the model in a `/path/myModel.json` file, please run: + +[source,bash] +---- +curl -XPUT 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json' +---- + +To delete the `currentModel` model: + +[source,bash] +---- +curl -XDELETE 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store/currentModel' +---- + +To view all models: + +[source,text] +http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store + + +.Example: /path/myOpenAIModel.json +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "openai-1", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "apiKey-openAI", + "modelName": "gpt-5.4-nano", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- + +== How to Trigger Document Enrichment during Indexing +To create new fields from existing document fields at indexing time, configure an +{solr-javadocs}/core/org/apache/solr/update/processor/UpdateRequestProcessorChain.html[UpdateRequestProcessorChain] that +includes at least one DocumentEnrichmentUpdateProcessor update request processor. + +Several parameters must be defined: + +`inputField`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +One (or more) `inputField` needs to be injected in the prompt. This is done by some special tokens, that are the +`fieldName` surrounded by curly brackets (e.g., `{string_field}`, in the example at the +xref:document-enrichment-with-llms.adoc#minimum-requirements[top of the page]). These tokens are _mandatory_ for this +module to work properly. Solr will throw an error if the parameters are not properly defined. +For example, both the prompt or the content of the file `prompt.txt`, must contain the text '{string_field}', which +will be substituted with the content of the `string_field` field for each document. An example of a valid prompt with +multiple input fields is as follows: Review Comment: maybe here I would say: Multiple `inputField` could also be defined by using one of the following notations: and then list the two ways ########## solr/solr-ref-guide/modules/indexing-guide/pages/document-enrichment-with-llms.adoc: ########## @@ -0,0 +1,543 @@ += Document Enrichment with LLMs +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +This module brings the power of *Large Language Models* to Solr. + +More specifically, it enables calling an LLM at indexing time to enrich documents with additional/generated/extracted +data. Given a prompt and a set of input fields, for each document, the LLM is invoked through +https://github.com/langchain4j/langchain4j[LangChain4j], and the result is stored in an outputField, which can support +multiple types and may also be multivalued. + +_Without_ this module, the LLM calls to enrich documents must be done _outside_ Solr, before indexing. + +[IMPORTANT] +==== +This module sends your documents off to some hosted service on the internet. +There are cost, privacy, performance, and service availability implications on such a strong dependency that should be +diligently examined before employing this module in a serious way. + +==== + +At the moment, Solr supports a subset of the LLM providers available in LangChain4j. + +*Disclaimer*: Apache Solr is *in no way* affiliated to any of these corporations or services. + +If you want to add support for additional services or improve the support for the existing ones, feel free to +contribute: + +* https://github.com/apache/solr/blob/main/CONTRIBUTING.md[Contributing to Solr] + +== Module + +This is provided via the `language-models` xref:configuration-guide:solr-modules.adoc[Solr Module] that needs to be +enabled before use. + +== Language Model Configuration + +Language Models is a module and therefore its plugins must be configured in `solrconfig.xml`. + +=== Minimum Requirements + +* Enable the `language-models` module to make the Language Models classes available on Solr's classpath. +See xref:configuration-guide:solr-modules.adoc[Solr Module] for more details. + +* An update processor, similar to the one below, must be declared in `solrconfig.xml`: ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">string_field</str> + <str name="outputField">summary</str> + <str name="prompt">Summarize this content: {string_field}</str> + <str name="model">model-name</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + +== Chat Model Setup + +=== Models + +* A model is a chat model that generates a text response given a prompt. +* A model is a reference to an external API that runs the Large Language Model responsible for chat completion. + +[IMPORTANT] +==== +The Solr chat model specifies the parameters to access the APIs, the LLM doesn't run internally in Solr + +==== + +A model is described by these parameters: + + +`class`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The model implementation. +Accepted values: + +* `dev.langchain4j.model.ollama.OllamaChatModel` +* `dev.langchain4j.model.mistralai.MistralAiChatModel` +* `dev.langchain4j.model.anthropic.AnthropicChatModel` +* `dev.langchain4j.model.openai.OpenAiChatModel` +* `dev.langchain4j.model.googleai.GoogleAiGeminiChatModel` + +`name`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The identifier of your model, this is used by any component that intends to use the model (e.g., `DocumentEnrichmentUpdateProcessorFactory` update processor). + +`params`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ +Each model class has potentially different params. +Many are shared but for the full set of parameters of the model you are interested in please refer to the official documentation of the LangChain4j version included in Solr: https://docs.langchain4j.dev/category/language-models[Chat Models in LangChain4j]. + +=== Supported Models +Apache Solr uses https://github.com/langchain4j/langchain4j[LangChain4j] to support document enrichment with LLMs. +The models currently supported are: + +[tabs#supported-chat-models] +====== +Ollama:: ++ +==== + +[source,json] +---- +{ + "class": "dev.langchain4j.model.ollama.OllamaChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "http://localhost:11434", + "modelName": "<a-local/hosted-chat-model>", + "timeout": 300, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +MistralAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.mistralai.MistralAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.mistral.ai/v1", + "apiKey": "<your-mistralAI-api-key>", + "modelName": "<a-mistralAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +OpenAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "<your-openAI-api-key>", + "modelName": "<a-openAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Anthropic:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.anthropic.AnthropicChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.anthropic.com/v1/", + "apiKey": "<your-anthropic-api-key>", + "modelName": "<a-anthropic-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Gemini:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.googleai.GoogleAiGeminiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://generativelanguage.googleapis.com/v1beta/", + "apiKey": "<your-geminiAi-api-key>", + "modelName": "<a-geminiAi-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +====== + +=== Uploading a Model + +To upload the model in a `/path/myModel.json` file, please run: + +[source,bash] +---- +curl -XPUT 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json' +---- + +To delete the `currentModel` model: + +[source,bash] +---- +curl -XDELETE 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store/currentModel' +---- + +To view all models: + +[source,text] +http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store + + +.Example: /path/myOpenAIModel.json +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "openai-1", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "apiKey-openAI", + "modelName": "gpt-5.4-nano", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- + +== How to Trigger Document Enrichment during Indexing Review Comment: This part has not been moved to the top near the model configuration in the solrconfig ########## solr/solr-ref-guide/modules/indexing-guide/pages/document-enrichment-with-llms.adoc: ########## @@ -0,0 +1,543 @@ += Document Enrichment with LLMs +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +This module brings the power of *Large Language Models* to Solr. + +More specifically, it enables calling an LLM at indexing time to enrich documents with additional/generated/extracted +data. Given a prompt and a set of input fields, for each document, the LLM is invoked through +https://github.com/langchain4j/langchain4j[LangChain4j], and the result is stored in an outputField, which can support +multiple types and may also be multivalued. + +_Without_ this module, the LLM calls to enrich documents must be done _outside_ Solr, before indexing. + +[IMPORTANT] +==== +This module sends your documents off to some hosted service on the internet. +There are cost, privacy, performance, and service availability implications on such a strong dependency that should be +diligently examined before employing this module in a serious way. + +==== + +At the moment, Solr supports a subset of the LLM providers available in LangChain4j. + +*Disclaimer*: Apache Solr is *in no way* affiliated to any of these corporations or services. + +If you want to add support for additional services or improve the support for the existing ones, feel free to +contribute: + +* https://github.com/apache/solr/blob/main/CONTRIBUTING.md[Contributing to Solr] + +== Module + +This is provided via the `language-models` xref:configuration-guide:solr-modules.adoc[Solr Module] that needs to be +enabled before use. + +== Language Model Configuration + +Language Models is a module and therefore its plugins must be configured in `solrconfig.xml`. + +=== Minimum Requirements + +* Enable the `language-models` module to make the Language Models classes available on Solr's classpath. +See xref:configuration-guide:solr-modules.adoc[Solr Module] for more details. + +* An update processor, similar to the one below, must be declared in `solrconfig.xml`: ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">string_field</str> + <str name="outputField">summary</str> + <str name="prompt">Summarize this content: {string_field}</str> + <str name="model">model-name</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + +== Chat Model Setup + +=== Models + +* A model is a chat model that generates a text response given a prompt. +* A model is a reference to an external API that runs the Large Language Model responsible for chat completion. + +[IMPORTANT] +==== +The Solr chat model specifies the parameters to access the APIs, the LLM doesn't run internally in Solr + +==== + +A model is described by these parameters: + + +`class`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The model implementation. +Accepted values: + +* `dev.langchain4j.model.ollama.OllamaChatModel` +* `dev.langchain4j.model.mistralai.MistralAiChatModel` +* `dev.langchain4j.model.anthropic.AnthropicChatModel` +* `dev.langchain4j.model.openai.OpenAiChatModel` +* `dev.langchain4j.model.googleai.GoogleAiGeminiChatModel` + +`name`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The identifier of your model, this is used by any component that intends to use the model (e.g., `DocumentEnrichmentUpdateProcessorFactory` update processor). + +`params`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ +Each model class has potentially different params. +Many are shared but for the full set of parameters of the model you are interested in please refer to the official documentation of the LangChain4j version included in Solr: https://docs.langchain4j.dev/category/language-models[Chat Models in LangChain4j]. + +=== Supported Models +Apache Solr uses https://github.com/langchain4j/langchain4j[LangChain4j] to support document enrichment with LLMs. +The models currently supported are: + +[tabs#supported-chat-models] +====== +Ollama:: ++ +==== + +[source,json] +---- +{ + "class": "dev.langchain4j.model.ollama.OllamaChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "http://localhost:11434", + "modelName": "<a-local/hosted-chat-model>", + "timeout": 300, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +MistralAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.mistralai.MistralAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.mistral.ai/v1", + "apiKey": "<your-mistralAI-api-key>", + "modelName": "<a-mistralAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +OpenAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "<your-openAI-api-key>", + "modelName": "<a-openAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Anthropic:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.anthropic.AnthropicChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.anthropic.com/v1/", + "apiKey": "<your-anthropic-api-key>", + "modelName": "<a-anthropic-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Gemini:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.googleai.GoogleAiGeminiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://generativelanguage.googleapis.com/v1beta/", + "apiKey": "<your-geminiAi-api-key>", + "modelName": "<a-geminiAi-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +====== + +=== Uploading a Model + +To upload the model in a `/path/myModel.json` file, please run: + +[source,bash] +---- +curl -XPUT 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json' +---- + +To delete the `currentModel` model: + +[source,bash] +---- +curl -XDELETE 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store/currentModel' +---- + +To view all models: + +[source,text] +http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store + + +.Example: /path/myOpenAIModel.json +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "openai-1", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "apiKey-openAI", + "modelName": "gpt-5.4-nano", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- + +== How to Trigger Document Enrichment during Indexing +To create new fields from existing document fields at indexing time, configure an +{solr-javadocs}/core/org/apache/solr/update/processor/UpdateRequestProcessorChain.html[UpdateRequestProcessorChain] that +includes at least one DocumentEnrichmentUpdateProcessor update request processor. + +Several parameters must be defined: + +`inputField`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +One (or more) `inputField` needs to be injected in the prompt. This is done by some special tokens, that are the Review Comment: At most, you could say that every inputField declared must be referred to in the prompt ########## solr/solr-ref-guide/modules/indexing-guide/pages/document-enrichment-with-llms.adoc: ########## @@ -0,0 +1,543 @@ += Document Enrichment with LLMs +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +This module brings the power of *Large Language Models* to Solr. + +More specifically, it enables calling an LLM at indexing time to enrich documents with additional/generated/extracted +data. Given a prompt and a set of input fields, for each document, the LLM is invoked through +https://github.com/langchain4j/langchain4j[LangChain4j], and the result is stored in an outputField, which can support +multiple types and may also be multivalued. + +_Without_ this module, the LLM calls to enrich documents must be done _outside_ Solr, before indexing. + +[IMPORTANT] +==== +This module sends your documents off to some hosted service on the internet. +There are cost, privacy, performance, and service availability implications on such a strong dependency that should be +diligently examined before employing this module in a serious way. + +==== + +At the moment, Solr supports a subset of the LLM providers available in LangChain4j. + +*Disclaimer*: Apache Solr is *in no way* affiliated to any of these corporations or services. + +If you want to add support for additional services or improve the support for the existing ones, feel free to +contribute: + +* https://github.com/apache/solr/blob/main/CONTRIBUTING.md[Contributing to Solr] + +== Module + +This is provided via the `language-models` xref:configuration-guide:solr-modules.adoc[Solr Module] that needs to be +enabled before use. + +== Language Model Configuration + +Language Models is a module and therefore its plugins must be configured in `solrconfig.xml`. + +=== Minimum Requirements + +* Enable the `language-models` module to make the Language Models classes available on Solr's classpath. +See xref:configuration-guide:solr-modules.adoc[Solr Module] for more details. + +* An update processor, similar to the one below, must be declared in `solrconfig.xml`: ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">string_field</str> + <str name="outputField">summary</str> + <str name="prompt">Summarize this content: {string_field}</str> + <str name="model">model-name</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + +== Chat Model Setup + +=== Models + +* A model is a chat model that generates a text response given a prompt. +* A model is a reference to an external API that runs the Large Language Model responsible for chat completion. + +[IMPORTANT] +==== +The Solr chat model specifies the parameters to access the APIs, the LLM doesn't run internally in Solr + +==== + +A model is described by these parameters: + + +`class`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The model implementation. +Accepted values: + +* `dev.langchain4j.model.ollama.OllamaChatModel` +* `dev.langchain4j.model.mistralai.MistralAiChatModel` +* `dev.langchain4j.model.anthropic.AnthropicChatModel` +* `dev.langchain4j.model.openai.OpenAiChatModel` +* `dev.langchain4j.model.googleai.GoogleAiGeminiChatModel` + +`name`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The identifier of your model, this is used by any component that intends to use the model (e.g., `DocumentEnrichmentUpdateProcessorFactory` update processor). + +`params`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ +Each model class has potentially different params. +Many are shared but for the full set of parameters of the model you are interested in please refer to the official documentation of the LangChain4j version included in Solr: https://docs.langchain4j.dev/category/language-models[Chat Models in LangChain4j]. + +=== Supported Models +Apache Solr uses https://github.com/langchain4j/langchain4j[LangChain4j] to support document enrichment with LLMs. +The models currently supported are: + +[tabs#supported-chat-models] +====== +Ollama:: ++ +==== + +[source,json] +---- +{ + "class": "dev.langchain4j.model.ollama.OllamaChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "http://localhost:11434", + "modelName": "<a-local/hosted-chat-model>", + "timeout": 300, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +MistralAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.mistralai.MistralAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.mistral.ai/v1", + "apiKey": "<your-mistralAI-api-key>", + "modelName": "<a-mistralAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +OpenAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "<your-openAI-api-key>", + "modelName": "<a-openAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Anthropic:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.anthropic.AnthropicChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.anthropic.com/v1/", + "apiKey": "<your-anthropic-api-key>", + "modelName": "<a-anthropic-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Gemini:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.googleai.GoogleAiGeminiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://generativelanguage.googleapis.com/v1beta/", + "apiKey": "<your-geminiAi-api-key>", + "modelName": "<a-geminiAi-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +====== + +=== Uploading a Model + +To upload the model in a `/path/myModel.json` file, please run: + +[source,bash] +---- +curl -XPUT 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json' +---- + +To delete the `currentModel` model: + +[source,bash] +---- +curl -XDELETE 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store/currentModel' +---- + +To view all models: + +[source,text] +http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store + + +.Example: /path/myOpenAIModel.json +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "openai-1", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "apiKey-openAI", + "modelName": "gpt-5.4-nano", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- + +== How to Trigger Document Enrichment during Indexing +To create new fields from existing document fields at indexing time, configure an +{solr-javadocs}/core/org/apache/solr/update/processor/UpdateRequestProcessorChain.html[UpdateRequestProcessorChain] that +includes at least one DocumentEnrichmentUpdateProcessor update request processor. + +Several parameters must be defined: + +`inputField`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +One (or more) `inputField` needs to be injected in the prompt. This is done by some special tokens, that are the +`fieldName` surrounded by curly brackets (e.g., `{string_field}`, in the example at the +xref:document-enrichment-with-llms.adoc#minimum-requirements[top of the page]). These tokens are _mandatory_ for this +module to work properly. Solr will throw an error if the parameters are not properly defined. +For example, both the prompt or the content of the file `prompt.txt`, must contain the text '{string_field}', which +will be substituted with the content of the `string_field` field for each document. An example of a valid prompt with +multiple input fields is as follows: + ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">title</str> + <str name="inputField">body</str> + <str name="outputField">summary</str> + <str name="prompt">Summarize with the following information. Title: {title}. Body: {body}.</str> + <str name="model">chat-model</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + ++ +Multiple `inputField` could also be defined by using the following notation: + ++ +[source,xml] +---- +<arr name="inputField"> + <str>title</str> + <str>body</str> +</arr> +---- + + +`outputField`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The LLM response is mapped to the specified `outputField`, and only one field is supported as output. Note that this +module only supports a subset of Solr's available field types, which includes: + +* *String/Text*: `StrField`, `TextField`, `SortableTextField` +* *Date*: `DatePointField` (the LLM must return an ISO-8601 date string; it might be useful to tune your prompt accordingly, to avoid indexing errors) +* *Numeric*: `IntPointField`, `LongPointField`, `FloatPointField`, `DoublePointField` +* *Boolean*: `BoolField` + + +These fields _can_ be multivalued. Solr uses structured output from LangChain4j to deal with LLMs' responses. + + +`prompt` or `promptFile`:: Review Comment: here I would say that there are two ways of defining a prompt, one directly in the config and one through a file... then I would explain how the prompt should be structured and the thing related to the inputFields ########## solr/solr-ref-guide/modules/indexing-guide/pages/document-enrichment-with-llms.adoc: ########## @@ -0,0 +1,543 @@ += Document Enrichment with LLMs +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +This module brings the power of *Large Language Models* to Solr. + +More specifically, it enables calling an LLM at indexing time to enrich documents with additional/generated/extracted +data. Given a prompt and a set of input fields, for each document, the LLM is invoked through +https://github.com/langchain4j/langchain4j[LangChain4j], and the result is stored in an outputField, which can support +multiple types and may also be multivalued. + +_Without_ this module, the LLM calls to enrich documents must be done _outside_ Solr, before indexing. + +[IMPORTANT] +==== +This module sends your documents off to some hosted service on the internet. +There are cost, privacy, performance, and service availability implications on such a strong dependency that should be +diligently examined before employing this module in a serious way. + +==== + +At the moment, Solr supports a subset of the LLM providers available in LangChain4j. + +*Disclaimer*: Apache Solr is *in no way* affiliated to any of these corporations or services. + +If you want to add support for additional services or improve the support for the existing ones, feel free to +contribute: + +* https://github.com/apache/solr/blob/main/CONTRIBUTING.md[Contributing to Solr] + +== Module + +This is provided via the `language-models` xref:configuration-guide:solr-modules.adoc[Solr Module] that needs to be +enabled before use. + +== Language Model Configuration + +Language Models is a module and therefore its plugins must be configured in `solrconfig.xml`. + +=== Minimum Requirements + +* Enable the `language-models` module to make the Language Models classes available on Solr's classpath. +See xref:configuration-guide:solr-modules.adoc[Solr Module] for more details. + +* An update processor, similar to the one below, must be declared in `solrconfig.xml`: ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">string_field</str> + <str name="outputField">summary</str> + <str name="prompt">Summarize this content: {string_field}</str> + <str name="model">model-name</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + +== Chat Model Setup + +=== Models + +* A model is a chat model that generates a text response given a prompt. +* A model is a reference to an external API that runs the Large Language Model responsible for chat completion. + +[IMPORTANT] +==== +The Solr chat model specifies the parameters to access the APIs, the LLM doesn't run internally in Solr + +==== + +A model is described by these parameters: + + +`class`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The model implementation. +Accepted values: + +* `dev.langchain4j.model.ollama.OllamaChatModel` +* `dev.langchain4j.model.mistralai.MistralAiChatModel` +* `dev.langchain4j.model.anthropic.AnthropicChatModel` +* `dev.langchain4j.model.openai.OpenAiChatModel` +* `dev.langchain4j.model.googleai.GoogleAiGeminiChatModel` + +`name`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The identifier of your model, this is used by any component that intends to use the model (e.g., `DocumentEnrichmentUpdateProcessorFactory` update processor). + +`params`:: ++ +[%autowidth,frame=none] +|=== +|Optional |Default: none +|=== ++ +Each model class has potentially different params. +Many are shared but for the full set of parameters of the model you are interested in please refer to the official documentation of the LangChain4j version included in Solr: https://docs.langchain4j.dev/category/language-models[Chat Models in LangChain4j]. + +=== Supported Models +Apache Solr uses https://github.com/langchain4j/langchain4j[LangChain4j] to support document enrichment with LLMs. +The models currently supported are: + +[tabs#supported-chat-models] +====== +Ollama:: ++ +==== + +[source,json] +---- +{ + "class": "dev.langchain4j.model.ollama.OllamaChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "http://localhost:11434", + "modelName": "<a-local/hosted-chat-model>", + "timeout": 300, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +MistralAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.mistralai.MistralAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.mistral.ai/v1", + "apiKey": "<your-mistralAI-api-key>", + "modelName": "<a-mistralAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +OpenAI:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "<your-openAI-api-key>", + "modelName": "<a-openAI-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Anthropic:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.anthropic.AnthropicChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://api.anthropic.com/v1/", + "apiKey": "<your-anthropic-api-key>", + "modelName": "<a-anthropic-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== + +Gemini:: ++ +==== +[source,json] +---- +{ + "class": "dev.langchain4j.model.googleai.GoogleAiGeminiChatModel", + "name": "<a-name-for-your-model>", + "params": { + "baseUrl": "https://generativelanguage.googleapis.com/v1beta/", + "apiKey": "<your-geminiAi-api-key>", + "modelName": "<a-geminiAi-chat-model>", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- +==== +====== + +=== Uploading a Model + +To upload the model in a `/path/myModel.json` file, please run: + +[source,bash] +---- +curl -XPUT 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store' --data-binary "@/path/myModel.json" -H 'Content-type:application/json' +---- + +To delete the `currentModel` model: + +[source,bash] +---- +curl -XDELETE 'http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store/currentModel' +---- + +To view all models: + +[source,text] +http://localhost:8983/solr/YOUR_COLLECTION/schema/chat-model-store + + +.Example: /path/myOpenAIModel.json +[source,json] +---- +{ + "class": "dev.langchain4j.model.openai.OpenAiChatModel", + "name": "openai-1", + "params": { + "baseUrl": "https://api.openai.com/v1", + "apiKey": "apiKey-openAI", + "modelName": "gpt-5.4-nano", + "timeout": 60, + "logRequests": true, + "logResponses": true, + "maxRetries": 5 + } +} +---- + +== How to Trigger Document Enrichment during Indexing +To create new fields from existing document fields at indexing time, configure an +{solr-javadocs}/core/org/apache/solr/update/processor/UpdateRequestProcessorChain.html[UpdateRequestProcessorChain] that +includes at least one DocumentEnrichmentUpdateProcessor update request processor. + +Several parameters must be defined: + +`inputField`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +One (or more) `inputField` needs to be injected in the prompt. This is done by some special tokens, that are the Review Comment: I would just say this is the field whose content is used as input/passed to the LLM to enrich the document. And that there could be more than one inputField defined. I would move the other part about the special tokens to the prompt parameter explanation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
