alessandrobenedetti commented on code in PR #4259: URL: https://github.com/apache/solr/pull/4259#discussion_r3122738302
########## solr/modules/language-models/src/java/org/apache/solr/languagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java: ########## @@ -0,0 +1,338 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.languagemodels.documentenrichment.update.processor; + +import dev.langchain4j.model.chat.request.ResponseFormat; +import dev.langchain4j.model.chat.request.ResponseFormatType; +import dev.langchain4j.model.chat.request.json.JsonArraySchema; +import dev.langchain4j.model.chat.request.json.JsonBooleanSchema; +import dev.langchain4j.model.chat.request.json.JsonIntegerSchema; +import dev.langchain4j.model.chat.request.json.JsonNumberSchema; +import dev.langchain4j.model.chat.request.json.JsonObjectSchema; +import dev.langchain4j.model.chat.request.json.JsonSchema; +import dev.langchain4j.model.chat.request.json.JsonSchemaElement; +import dev.langchain4j.model.chat.request.json.JsonStringSchema; +import java.io.IOException; +import java.io.InputStream; +import java.nio.charset.StandardCharsets; +import java.util.Collection; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.apache.solr.common.SolrException; +import org.apache.solr.common.params.RequiredSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.core.SolrCore; +import org.apache.solr.core.SolrResourceLoader; +import org.apache.solr.languagemodels.documentenrichment.model.SolrChatModel; +import org.apache.solr.languagemodels.documentenrichment.store.rest.ManagedChatModelStore; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.response.SolrQueryResponse; +import org.apache.solr.rest.ManagedResource; +import org.apache.solr.rest.ManagedResourceObserver; +import org.apache.solr.schema.BoolField; +import org.apache.solr.schema.DatePointField; +import org.apache.solr.schema.DenseVectorField; +import org.apache.solr.schema.DoublePointField; +import org.apache.solr.schema.FieldType; +import org.apache.solr.schema.FloatPointField; +import org.apache.solr.schema.IndexSchema; +import org.apache.solr.schema.IntPointField; +import org.apache.solr.schema.LongPointField; +import org.apache.solr.schema.NestPathField; +import org.apache.solr.schema.SchemaField; +import org.apache.solr.schema.StrField; +import org.apache.solr.schema.TextField; +import org.apache.solr.schema.UUIDField; +import org.apache.solr.update.processor.UpdateRequestProcessor; +import org.apache.solr.update.processor.UpdateRequestProcessorFactory; +import org.apache.solr.util.plugin.SolrCoreAware; + +/** + * Generate the content of a field based on other fields specified as input. + * + * <p>One or more {@code inputField} parameters specify the Solr fields to use as input. Each field + * name must appear as a {@code {fieldName}} placeholder in the prompt. Exactly one of {@code + * prompt} or {@code promptFile} must be provided. + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <str name="inputField">title_field</str> + * <str name="inputField">body_field</str> + * <str name="outputField">enriched_field</str> + * <str name="prompt">Title: {title_field}. Body: {body_field}.</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Multiple {@code inputField} values can also be declared as an array using {@code arr}: + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <arr name="inputField"> + * <str>title_field</str> + * <str>body_field</str> + * </arr> + * <str name="outputField">enriched_field</str> + * <str name="prompt">Title: {title_field}. Body: {body_field}.</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Alternatively, the prompt can be loaded from a text file using {@code promptFile}: + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <str name="inputField">title_field</str> + * <str name="outputField">enriched_field</str> + * <str name="promptFile">prompt.txt</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Validation rules: + * + * <ul> + * <li>At least one {@code inputField} must be declared. + * <li>Exactly one of {@code prompt} or {@code promptFile} must be provided. + * <li>Every declared {@code inputField} must have a corresponding {@code {fieldName}} placeholder + * in the prompt. + * <li>Every {@code {placeholder}} in the prompt must correspond to a declared {@code inputField}. + * </ul> + */ Review Comment: The comment is full of encoded symbols, it's basically unreadable, possible built for Javadocs only, but in general if a comment is there it should be readable also simply via code ########## solr/modules/language-models/src/java/org/apache/solr/languagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java: ########## @@ -0,0 +1,338 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.languagemodels.documentenrichment.update.processor; + +import dev.langchain4j.model.chat.request.ResponseFormat; +import dev.langchain4j.model.chat.request.ResponseFormatType; +import dev.langchain4j.model.chat.request.json.JsonArraySchema; +import dev.langchain4j.model.chat.request.json.JsonBooleanSchema; +import dev.langchain4j.model.chat.request.json.JsonIntegerSchema; +import dev.langchain4j.model.chat.request.json.JsonNumberSchema; +import dev.langchain4j.model.chat.request.json.JsonObjectSchema; +import dev.langchain4j.model.chat.request.json.JsonSchema; +import dev.langchain4j.model.chat.request.json.JsonSchemaElement; +import dev.langchain4j.model.chat.request.json.JsonStringSchema; +import java.io.IOException; +import java.io.InputStream; +import java.nio.charset.StandardCharsets; +import java.util.Collection; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.apache.solr.common.SolrException; +import org.apache.solr.common.params.RequiredSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.core.SolrCore; +import org.apache.solr.core.SolrResourceLoader; +import org.apache.solr.languagemodels.documentenrichment.model.SolrChatModel; +import org.apache.solr.languagemodels.documentenrichment.store.rest.ManagedChatModelStore; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.response.SolrQueryResponse; +import org.apache.solr.rest.ManagedResource; +import org.apache.solr.rest.ManagedResourceObserver; +import org.apache.solr.schema.BoolField; +import org.apache.solr.schema.DatePointField; +import org.apache.solr.schema.DenseVectorField; +import org.apache.solr.schema.DoublePointField; +import org.apache.solr.schema.FieldType; +import org.apache.solr.schema.FloatPointField; +import org.apache.solr.schema.IndexSchema; +import org.apache.solr.schema.IntPointField; +import org.apache.solr.schema.LongPointField; +import org.apache.solr.schema.NestPathField; +import org.apache.solr.schema.SchemaField; +import org.apache.solr.schema.StrField; +import org.apache.solr.schema.TextField; +import org.apache.solr.schema.UUIDField; +import org.apache.solr.update.processor.UpdateRequestProcessor; +import org.apache.solr.update.processor.UpdateRequestProcessorFactory; +import org.apache.solr.util.plugin.SolrCoreAware; + +/** + * Generate the content of a field based on other fields specified as input. + * + * <p>One or more {@code inputField} parameters specify the Solr fields to use as input. Each field + * name must appear as a {@code {fieldName}} placeholder in the prompt. Exactly one of {@code + * prompt} or {@code promptFile} must be provided. + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <str name="inputField">title_field</str> + * <str name="inputField">body_field</str> + * <str name="outputField">enriched_field</str> + * <str name="prompt">Title: {title_field}. Body: {body_field}.</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Multiple {@code inputField} values can also be declared as an array using {@code arr}: + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <arr name="inputField"> + * <str>title_field</str> + * <str>body_field</str> + * </arr> + * <str name="outputField">enriched_field</str> + * <str name="prompt">Title: {title_field}. Body: {body_field}.</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Alternatively, the prompt can be loaded from a text file using {@code promptFile}: + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <str name="inputField">title_field</str> + * <str name="outputField">enriched_field</str> + * <str name="promptFile">prompt.txt</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Validation rules: + * + * <ul> + * <li>At least one {@code inputField} must be declared. + * <li>Exactly one of {@code prompt} or {@code promptFile} must be provided. + * <li>Every declared {@code inputField} must have a corresponding {@code {fieldName}} placeholder + * in the prompt. + * <li>Every {@code {placeholder}} in the prompt must correspond to a declared {@code inputField}. + * </ul> + */ +public class DocumentEnrichmentUpdateProcessorFactory extends UpdateRequestProcessorFactory + implements SolrCoreAware, ManagedResourceObserver { + private static final String INPUT_FIELD_PARAM = "inputField"; + private static final String OUTPUT_FIELD_PARAM = "outputField"; + private static final String PROMPT = "prompt"; + private static final String PROMPT_FILE = "promptFile"; + private static final String MODEL_NAME = "model"; + private static final Pattern PLACEHOLDER_PATTERN = Pattern.compile("\\{([^}]+)\\}"); + + private List<String> inputFields; + private String outputField; + private String promptText; + private String promptFile; + private String modelName; + + @Override + public void init(final NamedList<?> args) { + // removeConfigArgs handles both multiple <str name="inputField"> and <arr name="inputField"> + // and must be called before toSolrParams() since it mutates args in place + Collection<String> fieldNames = args.removeConfigArgs(INPUT_FIELD_PARAM); + if (fieldNames.isEmpty()) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "At least one 'inputField' must be provided"); + } + inputFields = List.copyOf(fieldNames); + + Collection<String> outputFields = args.removeConfigArgs(OUTPUT_FIELD_PARAM); + if (outputFields.isEmpty()) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Exactly one 'outputField' must be provided"); + } + if (outputFields.size() > 1) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "Only one 'outputField' can be provided, but found: " + outputFields); + } + outputField = outputFields.iterator().next(); + + SolrParams params = args.toSolrParams(); + RequiredSolrParams required = params.required(); + modelName = required.get(MODEL_NAME); + + String inlinePrompt = params.get(PROMPT); + String promptFilePath = params.get(PROMPT_FILE); + + if (inlinePrompt == null && promptFilePath == null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Either 'prompt' or 'promptFile' must be provided"); + } + if (inlinePrompt != null && promptFilePath != null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "Only one of 'prompt' or 'promptFile' can be provided, not both"); + } + if (inlinePrompt != null) { + validatePromptPlaceholders(inlinePrompt, inputFields); + this.promptText = inlinePrompt; + } + this.promptFile = promptFilePath; + } + + @Override + public void inform(SolrCore core) { + final SolrResourceLoader solrResourceLoader = core.getResourceLoader(); + ManagedChatModelStore.registerManagedChatModelStore(solrResourceLoader, this); + if (promptFile != null) { + try (InputStream is = solrResourceLoader.openResource(promptFile)) { + promptText = new String(is.readAllBytes(), StandardCharsets.UTF_8).trim(); + } catch (IOException e) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Cannot read prompt file: " + promptFile, e); + } + validatePromptPlaceholders(promptText, inputFields); + } + } + + @Override + public void onManagedResourceInitialized(NamedList<?> args, ManagedResource res) + throws SolrException { + if (res instanceof ManagedChatModelStore store) { + store.loadStoredModels(); + } + } + + @Override + public UpdateRequestProcessor getInstance( + SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { + IndexSchema latestSchema = req.getCore().getLatestSchema(); + + for (String fieldName : inputFields) { + if (!latestSchema.isDynamicField(fieldName) && !latestSchema.hasExplicitField(fieldName)) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "undefined field: \"" + fieldName + "\""); + } + } + + final SchemaField outputFieldSchema = latestSchema.getField(outputField); + + ResponseFormat responseFormat = buildResponseFormat(outputFieldSchema); + boolean multiValued = outputFieldSchema.multiValued(); + + ManagedChatModelStore store = ManagedChatModelStore.getManagedModelStore(req.getCore()); + SolrChatModel chatModel = store.getModel(modelName); + if (chatModel == null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "The model configured in the Update Request Processor '" + + modelName + + "' can't be found in the store: " + + ManagedChatModelStore.REST_END_POINT); + } + + return new DocumentEnrichmentUpdateProcessor( + inputFields, outputField, promptText, chatModel, multiValued, responseFormat, req, next); + } + + /** + * Builds a {@link ResponseFormat} that instructs the model to return a JSON object {@code + * {"value": ...}} whose value type matches the Solr field type. For multivalued fields the value + * is wrapped in a {@link JsonArraySchema} nested inside the root {@link JsonObjectSchema}. + * + * <p>Nesting {@link JsonArraySchema} inside a {@link JsonObjectSchema} property is supported by + * all langchain4j providers that implement structured outputs with {@link JsonObjectSchema} + * (OpenAI, Azure OpenAI, Google AI, Gemini, Mistral, Ollama, Amazon Bedrock, Watsonx). + */ + static ResponseFormat buildResponseFormat(SchemaField schemaField) { + JsonSchemaElement valueElement = toJsonSchemaElement(schemaField.getType()); + JsonSchemaElement valueSchema = + schemaField.multiValued() + ? JsonArraySchema.builder().items(valueElement).build() + : valueElement; + return ResponseFormat.builder() + .type(ResponseFormatType.JSON) + .jsonSchema( + JsonSchema.builder() + .name("output") + .rootElement( + JsonObjectSchema.builder() + .addProperty("value", valueSchema) + .required("value") + .build()) + .build()) + .build(); + } Review Comment: will need to clarify this via a call ########## solr/modules/language-models/src/test-files/modelChatExamples/dummy-chat-model-unsupported.json: ########## @@ -0,0 +1,8 @@ +{ + "class": "org.apache.solr.languagemodels.documentenrichment.model.DummyChatModel", + "name": "dummy-chat-1", + "params": { + "response": "enriched content", + "unsupported": 10 Review Comment: unsupported? ########## solr/modules/language-models/src/java/org/apache/solr/languagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java: ########## @@ -0,0 +1,338 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.languagemodels.documentenrichment.update.processor; + +import dev.langchain4j.model.chat.request.ResponseFormat; +import dev.langchain4j.model.chat.request.ResponseFormatType; +import dev.langchain4j.model.chat.request.json.JsonArraySchema; +import dev.langchain4j.model.chat.request.json.JsonBooleanSchema; +import dev.langchain4j.model.chat.request.json.JsonIntegerSchema; +import dev.langchain4j.model.chat.request.json.JsonNumberSchema; +import dev.langchain4j.model.chat.request.json.JsonObjectSchema; +import dev.langchain4j.model.chat.request.json.JsonSchema; +import dev.langchain4j.model.chat.request.json.JsonSchemaElement; +import dev.langchain4j.model.chat.request.json.JsonStringSchema; +import java.io.IOException; +import java.io.InputStream; +import java.nio.charset.StandardCharsets; +import java.util.Collection; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.apache.solr.common.SolrException; +import org.apache.solr.common.params.RequiredSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.core.SolrCore; +import org.apache.solr.core.SolrResourceLoader; +import org.apache.solr.languagemodels.documentenrichment.model.SolrChatModel; +import org.apache.solr.languagemodels.documentenrichment.store.rest.ManagedChatModelStore; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.response.SolrQueryResponse; +import org.apache.solr.rest.ManagedResource; +import org.apache.solr.rest.ManagedResourceObserver; +import org.apache.solr.schema.BoolField; +import org.apache.solr.schema.DatePointField; +import org.apache.solr.schema.DenseVectorField; +import org.apache.solr.schema.DoublePointField; +import org.apache.solr.schema.FieldType; +import org.apache.solr.schema.FloatPointField; +import org.apache.solr.schema.IndexSchema; +import org.apache.solr.schema.IntPointField; +import org.apache.solr.schema.LongPointField; +import org.apache.solr.schema.NestPathField; +import org.apache.solr.schema.SchemaField; +import org.apache.solr.schema.StrField; +import org.apache.solr.schema.TextField; +import org.apache.solr.schema.UUIDField; +import org.apache.solr.update.processor.UpdateRequestProcessor; +import org.apache.solr.update.processor.UpdateRequestProcessorFactory; +import org.apache.solr.util.plugin.SolrCoreAware; + +/** + * Generate the content of a field based on other fields specified as input. + * + * <p>One or more {@code inputField} parameters specify the Solr fields to use as input. Each field + * name must appear as a {@code {fieldName}} placeholder in the prompt. Exactly one of {@code + * prompt} or {@code promptFile} must be provided. + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <str name="inputField">title_field</str> + * <str name="inputField">body_field</str> + * <str name="outputField">enriched_field</str> + * <str name="prompt">Title: {title_field}. Body: {body_field}.</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Multiple {@code inputField} values can also be declared as an array using {@code arr}: + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <arr name="inputField"> + * <str>title_field</str> + * <str>body_field</str> + * </arr> + * <str name="outputField">enriched_field</str> + * <str name="prompt">Title: {title_field}. Body: {body_field}.</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Alternatively, the prompt can be loaded from a text file using {@code promptFile}: + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <str name="inputField">title_field</str> + * <str name="outputField">enriched_field</str> + * <str name="promptFile">prompt.txt</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Validation rules: + * + * <ul> + * <li>At least one {@code inputField} must be declared. + * <li>Exactly one of {@code prompt} or {@code promptFile} must be provided. + * <li>Every declared {@code inputField} must have a corresponding {@code {fieldName}} placeholder + * in the prompt. + * <li>Every {@code {placeholder}} in the prompt must correspond to a declared {@code inputField}. + * </ul> + */ +public class DocumentEnrichmentUpdateProcessorFactory extends UpdateRequestProcessorFactory + implements SolrCoreAware, ManagedResourceObserver { + private static final String INPUT_FIELD_PARAM = "inputField"; + private static final String OUTPUT_FIELD_PARAM = "outputField"; + private static final String PROMPT = "prompt"; + private static final String PROMPT_FILE = "promptFile"; + private static final String MODEL_NAME = "model"; + private static final Pattern PLACEHOLDER_PATTERN = Pattern.compile("\\{([^}]+)\\}"); + + private List<String> inputFields; + private String outputField; + private String promptText; + private String promptFile; + private String modelName; + + @Override + public void init(final NamedList<?> args) { + // removeConfigArgs handles both multiple <str name="inputField"> and <arr name="inputField"> + // and must be called before toSolrParams() since it mutates args in place + Collection<String> fieldNames = args.removeConfigArgs(INPUT_FIELD_PARAM); + if (fieldNames.isEmpty()) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "At least one 'inputField' must be provided"); + } + inputFields = List.copyOf(fieldNames); + + Collection<String> outputFields = args.removeConfigArgs(OUTPUT_FIELD_PARAM); + if (outputFields.isEmpty()) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Exactly one 'outputField' must be provided"); + } + if (outputFields.size() > 1) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "Only one 'outputField' can be provided, but found: " + outputFields); + } + outputField = outputFields.iterator().next(); + + SolrParams params = args.toSolrParams(); + RequiredSolrParams required = params.required(); + modelName = required.get(MODEL_NAME); + + String inlinePrompt = params.get(PROMPT); + String promptFilePath = params.get(PROMPT_FILE); + + if (inlinePrompt == null && promptFilePath == null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Either 'prompt' or 'promptFile' must be provided"); + } + if (inlinePrompt != null && promptFilePath != null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "Only one of 'prompt' or 'promptFile' can be provided, not both"); + } + if (inlinePrompt != null) { + validatePromptPlaceholders(inlinePrompt, inputFields); + this.promptText = inlinePrompt; + } + this.promptFile = promptFilePath; + } + + @Override + public void inform(SolrCore core) { + final SolrResourceLoader solrResourceLoader = core.getResourceLoader(); + ManagedChatModelStore.registerManagedChatModelStore(solrResourceLoader, this); + if (promptFile != null) { + try (InputStream is = solrResourceLoader.openResource(promptFile)) { + promptText = new String(is.readAllBytes(), StandardCharsets.UTF_8).trim(); + } catch (IOException e) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Cannot read prompt file: " + promptFile, e); + } + validatePromptPlaceholders(promptText, inputFields); + } + } + + @Override + public void onManagedResourceInitialized(NamedList<?> args, ManagedResource res) + throws SolrException { + if (res instanceof ManagedChatModelStore store) { + store.loadStoredModels(); + } + } + + @Override + public UpdateRequestProcessor getInstance( + SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { + IndexSchema latestSchema = req.getCore().getLatestSchema(); + + for (String fieldName : inputFields) { + if (!latestSchema.isDynamicField(fieldName) && !latestSchema.hasExplicitField(fieldName)) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "undefined field: \"" + fieldName + "\""); + } + } + + final SchemaField outputFieldSchema = latestSchema.getField(outputField); + + ResponseFormat responseFormat = buildResponseFormat(outputFieldSchema); + boolean multiValued = outputFieldSchema.multiValued(); + + ManagedChatModelStore store = ManagedChatModelStore.getManagedModelStore(req.getCore()); + SolrChatModel chatModel = store.getModel(modelName); + if (chatModel == null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "The model configured in the Update Request Processor '" + + modelName + + "' can't be found in the store: " + + ManagedChatModelStore.REST_END_POINT); + } + + return new DocumentEnrichmentUpdateProcessor( + inputFields, outputField, promptText, chatModel, multiValued, responseFormat, req, next); + } + + /** + * Builds a {@link ResponseFormat} that instructs the model to return a JSON object {@code + * {"value": ...}} whose value type matches the Solr field type. For multivalued fields the value + * is wrapped in a {@link JsonArraySchema} nested inside the root {@link JsonObjectSchema}. + * + * <p>Nesting {@link JsonArraySchema} inside a {@link JsonObjectSchema} property is supported by + * all langchain4j providers that implement structured outputs with {@link JsonObjectSchema} + * (OpenAI, Azure OpenAI, Google AI, Gemini, Mistral, Ollama, Amazon Bedrock, Watsonx). + */ + static ResponseFormat buildResponseFormat(SchemaField schemaField) { + JsonSchemaElement valueElement = toJsonSchemaElement(schemaField.getType()); + JsonSchemaElement valueSchema = + schemaField.multiValued() + ? JsonArraySchema.builder().items(valueElement).build() + : valueElement; + return ResponseFormat.builder() + .type(ResponseFormatType.JSON) + .jsonSchema( + JsonSchema.builder() + .name("output") + .rootElement( + JsonObjectSchema.builder() + .addProperty("value", valueSchema) + .required("value") + .build()) + .build()) + .build(); + } + + private static JsonSchemaElement toJsonSchemaElement(FieldType fieldType) { + // DenseVectorField extends FloatPointField, so it must be rejected before the numeric checks + if (fieldType instanceof DenseVectorField + || fieldType instanceof UUIDField + || fieldType instanceof NestPathField) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "field type is not supported by Document Enrichment: " + + fieldType.getClass().getSimpleName()); + } + if (fieldType instanceof StrField + || fieldType instanceof TextField + || fieldType instanceof DatePointField) { + return new JsonStringSchema(); + } else if (fieldType instanceof IntPointField || fieldType instanceof LongPointField) { + return new JsonIntegerSchema(); + } else if (fieldType instanceof FloatPointField || fieldType instanceof DoublePointField) { + return new JsonNumberSchema(); + } else if (fieldType instanceof BoolField) { + return new JsonBooleanSchema(); + } else { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "field type is not supported by Document Enrichment: " + + fieldType.getClass().getSimpleName()); + } + } + + private static void validatePromptPlaceholders(String prompt, List<String> fieldNames) { Review Comment: fieldNames -> inputFields ########## solr/modules/language-models/src/test/org/apache/solr/languagemodels/documentenrichment/model/DummyChatModel.java: ########## @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.solr.languagemodels.documentenrichment.model; + +import dev.langchain4j.data.message.AiMessage; +import dev.langchain4j.data.message.UserMessage; +import dev.langchain4j.model.chat.ChatModel; +import dev.langchain4j.model.chat.request.ChatRequest; +import dev.langchain4j.model.chat.response.ChatResponse; + +/** + * A deterministic {@link ChatModel} for testing. It returns a fixed response string regardless of + * the input, allowing tests to assert exact enriched-field values without real API calls. + * + * <p>The builder also exposes {@code unsupported} and {@code ambiguous} setter methods to exercise + * the reflection-based parameter handling in {@link + * org.apache.solr.languagemodels.documentenrichment.model.SolrChatModel#getInstance}. + */ +public class DummyChatModel implements ChatModel { + + /** The text of the last prompt received by any instance. Useful for test assertions. */ + public static String lastReceivedPrompt; + + private final String response; + + public DummyChatModel(String response) { + this.response = response; + } + + @Override + public ChatResponse chat(ChatRequest chatRequest) { + lastReceivedPrompt = ((UserMessage) chatRequest.messages().getFirst()).singleText(); + return ChatResponse.builder().aiMessage(AiMessage.from(response)).build(); + } + + public static DummyChatModelBuilder builder() { + return new DummyChatModelBuilder(); + } + + public static class DummyChatModelBuilder { + private String response = "dummy response"; + private int intValue; + + public DummyChatModelBuilder() {} + + public DummyChatModelBuilder response(String response) { + this.response = response; + return this; + } + + /** Intentionally has no String overload so the reflection code raises a BAD_REQUEST error. */ + public DummyChatModelBuilder unsupported(Integer input) { + return this; + } + + /** Two overloads make this param "ambiguous": the reflection code should default to String. */ + public DummyChatModelBuilder ambiguous(int input) { + this.intValue = input; + return this; + } + + public DummyChatModelBuilder ambiguous(String input) { + this.intValue = Integer.valueOf(input); + return this; + } Review Comment: will have to elaborate this ########## solr/modules/language-models/src/test/org/apache/solr/languagemodels/documentenrichment/store/rest/TestChatModelManager.java: ########## Review Comment: not sure we need a separate model manager, but tests are ok and may need to be relocated ########## solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc: ########## @@ -421,6 +421,10 @@ The {solr-javadocs}/modules/language-models/index.html[`language-models`] module It uses external text to vectors language models to perform the vectorisation for each processed document. For more information: xref:query-guide:text-to-vector.adoc[Update Request Processor] +{solr-javadocs}/modules/language-models/org/apache/solr/languagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.html[DocumentEnrichmentUpdateProcessorFactory]:: Update processor which, starting from one or more fields in input and a given prompt, adds the output of an LLM as the value of a new field. +It uses external chat language models to perform the enrichment of each processed document. Review Comment: It uses external Large Language Model services to perform the enrichment of each processed document. ########## solr/modules/language-models/src/test-files/modelChatExamples/dummy-chat-model-ambiguous.json: ########## @@ -0,0 +1,8 @@ +{ + "class": "org.apache.solr.languagemodels.documentenrichment.model.DummyChatModel", + "name": "dummy-chat-1", + "params": { + "response": "enriched content", + "ambiguous": 10 Review Comment: ambiguous? ########## solr/modules/language-models/src/test-files/solr/collection1/conf/enumsConfig.xml: ########## Review Comment: What's this file? ########## solr/modules/language-models/src/test-files/solr/collection1/conf/solrconfig-document-enrichment.xml: ########## @@ -0,0 +1,258 @@ +<?xml version="1.0" ?> +<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor + license agreements. See the NOTICE file distributed with this work for additional + information regarding copyright ownership. The ASF licenses this file to + You under the Apache License, Version 2.0 (the "License"); you may not use + this file except in compliance with the License. You may obtain a copy of + the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required + by applicable law or agreed to in writing, software distributed under the + License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS + OF ANY KIND, either express or implied. See the License for the specific + language governing permissions and limitations under the License. --> + +<config> + <luceneMatchVersion>${tests.luceneMatchVersion:LATEST}</luceneMatchVersion> + <dataDir>${solr.data.dir:}</dataDir> + <directoryFactory name="DirectoryFactory" + class="${solr.directoryFactory:solr.MockDirectoryFactory}" /> + <schemaFactory class="ClassicIndexSchemaFactory" /> + + <requestDispatcher> + <requestParsers /> + </requestDispatcher> + + <query> + <filterCache class="solr.CaffeineCache" size="4096" + initialSize="2048" autowarmCount="0" /> + </query> + <requestHandler name="/select" class="solr.SearchHandler" /> + + <updateHandler class="solr.DirectUpdateHandler2"> + <autoCommit> + <maxTime>15000</maxTime> + <openSearcher>false</openSearcher> + </autoCommit> + <autoSoftCommit> + <maxTime>1000</maxTime> + </autoSoftCommit> + <updateLog> + <str name="dir">${solr.data.dir:}</str> + </updateLog> + </updateHandler> + + <requestHandler name="/query" class="solr.SearchHandler"> + <lst name="defaults"> + <str name="echoParams">explicit</str> + <str name="wt">json</str> + <str name="indent">true</str> + <str name="df">id</str> + </lst> + </requestHandler> + + <updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">string_field</str> + <str name="outputField">enriched_field</str> + <str name="prompt">Summarize this content: {string_field}</str> + <str name="model">dummy-chat-1</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> + + <updateRequestProcessorChain name="documentEnrichmentArrInputField"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <arr name="inputField"> + <str>string_field</str> + <str>body_field</str> + </arr> + <str name="outputField">enriched_field</str> + <str name="prompt">Title: {string_field}. Body: {body_field}.</str> Review Comment: I don't understand this prompt, what type of enrichment do we expect? ########## solr/modules/language-models/src/java/org/apache/solr/languagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.java: ########## @@ -0,0 +1,338 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.languagemodels.documentenrichment.update.processor; + +import dev.langchain4j.model.chat.request.ResponseFormat; +import dev.langchain4j.model.chat.request.ResponseFormatType; +import dev.langchain4j.model.chat.request.json.JsonArraySchema; +import dev.langchain4j.model.chat.request.json.JsonBooleanSchema; +import dev.langchain4j.model.chat.request.json.JsonIntegerSchema; +import dev.langchain4j.model.chat.request.json.JsonNumberSchema; +import dev.langchain4j.model.chat.request.json.JsonObjectSchema; +import dev.langchain4j.model.chat.request.json.JsonSchema; +import dev.langchain4j.model.chat.request.json.JsonSchemaElement; +import dev.langchain4j.model.chat.request.json.JsonStringSchema; +import java.io.IOException; +import java.io.InputStream; +import java.nio.charset.StandardCharsets; +import java.util.Collection; +import java.util.HashSet; +import java.util.List; +import java.util.Set; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.apache.solr.common.SolrException; +import org.apache.solr.common.params.RequiredSolrParams; +import org.apache.solr.common.params.SolrParams; +import org.apache.solr.common.util.NamedList; +import org.apache.solr.core.SolrCore; +import org.apache.solr.core.SolrResourceLoader; +import org.apache.solr.languagemodels.documentenrichment.model.SolrChatModel; +import org.apache.solr.languagemodels.documentenrichment.store.rest.ManagedChatModelStore; +import org.apache.solr.request.SolrQueryRequest; +import org.apache.solr.response.SolrQueryResponse; +import org.apache.solr.rest.ManagedResource; +import org.apache.solr.rest.ManagedResourceObserver; +import org.apache.solr.schema.BoolField; +import org.apache.solr.schema.DatePointField; +import org.apache.solr.schema.DenseVectorField; +import org.apache.solr.schema.DoublePointField; +import org.apache.solr.schema.FieldType; +import org.apache.solr.schema.FloatPointField; +import org.apache.solr.schema.IndexSchema; +import org.apache.solr.schema.IntPointField; +import org.apache.solr.schema.LongPointField; +import org.apache.solr.schema.NestPathField; +import org.apache.solr.schema.SchemaField; +import org.apache.solr.schema.StrField; +import org.apache.solr.schema.TextField; +import org.apache.solr.schema.UUIDField; +import org.apache.solr.update.processor.UpdateRequestProcessor; +import org.apache.solr.update.processor.UpdateRequestProcessorFactory; +import org.apache.solr.util.plugin.SolrCoreAware; + +/** + * Generate the content of a field based on other fields specified as input. + * + * <p>One or more {@code inputField} parameters specify the Solr fields to use as input. Each field + * name must appear as a {@code {fieldName}} placeholder in the prompt. Exactly one of {@code + * prompt} or {@code promptFile} must be provided. + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <str name="inputField">title_field</str> + * <str name="inputField">body_field</str> + * <str name="outputField">enriched_field</str> + * <str name="prompt">Title: {title_field}. Body: {body_field}.</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Multiple {@code inputField} values can also be declared as an array using {@code arr}: + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <arr name="inputField"> + * <str>title_field</str> + * <str>body_field</str> + * </arr> + * <str name="outputField">enriched_field</str> + * <str name="prompt">Title: {title_field}. Body: {body_field}.</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Alternatively, the prompt can be loaded from a text file using {@code promptFile}: + * + * <pre class="prettyprint" > + * <processor class="solr.llm.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + * <str name="inputField">title_field</str> + * <str name="outputField">enriched_field</str> + * <str name="promptFile">prompt.txt</str> + * <str name="model">ChatModel</str> + * </processor> + * </pre> + * + * <p>Validation rules: + * + * <ul> + * <li>At least one {@code inputField} must be declared. + * <li>Exactly one of {@code prompt} or {@code promptFile} must be provided. + * <li>Every declared {@code inputField} must have a corresponding {@code {fieldName}} placeholder + * in the prompt. + * <li>Every {@code {placeholder}} in the prompt must correspond to a declared {@code inputField}. + * </ul> + */ +public class DocumentEnrichmentUpdateProcessorFactory extends UpdateRequestProcessorFactory + implements SolrCoreAware, ManagedResourceObserver { + private static final String INPUT_FIELD_PARAM = "inputField"; + private static final String OUTPUT_FIELD_PARAM = "outputField"; + private static final String PROMPT = "prompt"; + private static final String PROMPT_FILE = "promptFile"; + private static final String MODEL_NAME = "model"; + private static final Pattern PLACEHOLDER_PATTERN = Pattern.compile("\\{([^}]+)\\}"); + + private List<String> inputFields; + private String outputField; + private String promptText; + private String promptFile; + private String modelName; + + @Override + public void init(final NamedList<?> args) { + // removeConfigArgs handles both multiple <str name="inputField"> and <arr name="inputField"> + // and must be called before toSolrParams() since it mutates args in place + Collection<String> fieldNames = args.removeConfigArgs(INPUT_FIELD_PARAM); + if (fieldNames.isEmpty()) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "At least one 'inputField' must be provided"); + } + inputFields = List.copyOf(fieldNames); + + Collection<String> outputFields = args.removeConfigArgs(OUTPUT_FIELD_PARAM); + if (outputFields.isEmpty()) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Exactly one 'outputField' must be provided"); + } + if (outputFields.size() > 1) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "Only one 'outputField' can be provided, but found: " + outputFields); + } + outputField = outputFields.iterator().next(); + + SolrParams params = args.toSolrParams(); + RequiredSolrParams required = params.required(); + modelName = required.get(MODEL_NAME); + + String inlinePrompt = params.get(PROMPT); + String promptFilePath = params.get(PROMPT_FILE); + + if (inlinePrompt == null && promptFilePath == null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Either 'prompt' or 'promptFile' must be provided"); + } + if (inlinePrompt != null && promptFilePath != null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "Only one of 'prompt' or 'promptFile' can be provided, not both"); + } + if (inlinePrompt != null) { + validatePromptPlaceholders(inlinePrompt, inputFields); + this.promptText = inlinePrompt; + } + this.promptFile = promptFilePath; + } + + @Override + public void inform(SolrCore core) { + final SolrResourceLoader solrResourceLoader = core.getResourceLoader(); + ManagedChatModelStore.registerManagedChatModelStore(solrResourceLoader, this); + if (promptFile != null) { + try (InputStream is = solrResourceLoader.openResource(promptFile)) { + promptText = new String(is.readAllBytes(), StandardCharsets.UTF_8).trim(); + } catch (IOException e) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "Cannot read prompt file: " + promptFile, e); + } + validatePromptPlaceholders(promptText, inputFields); + } + } + + @Override + public void onManagedResourceInitialized(NamedList<?> args, ManagedResource res) + throws SolrException { + if (res instanceof ManagedChatModelStore store) { + store.loadStoredModels(); + } + } + + @Override + public UpdateRequestProcessor getInstance( + SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { + IndexSchema latestSchema = req.getCore().getLatestSchema(); + + for (String fieldName : inputFields) { + if (!latestSchema.isDynamicField(fieldName) && !latestSchema.hasExplicitField(fieldName)) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, "undefined field: \"" + fieldName + "\""); + } + } + + final SchemaField outputFieldSchema = latestSchema.getField(outputField); + + ResponseFormat responseFormat = buildResponseFormat(outputFieldSchema); + boolean multiValued = outputFieldSchema.multiValued(); + + ManagedChatModelStore store = ManagedChatModelStore.getManagedModelStore(req.getCore()); + SolrChatModel chatModel = store.getModel(modelName); + if (chatModel == null) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "The model configured in the Update Request Processor '" + + modelName + + "' can't be found in the store: " + + ManagedChatModelStore.REST_END_POINT); + } + + return new DocumentEnrichmentUpdateProcessor( + inputFields, outputField, promptText, chatModel, multiValued, responseFormat, req, next); + } + + /** + * Builds a {@link ResponseFormat} that instructs the model to return a JSON object {@code + * {"value": ...}} whose value type matches the Solr field type. For multivalued fields the value + * is wrapped in a {@link JsonArraySchema} nested inside the root {@link JsonObjectSchema}. + * + * <p>Nesting {@link JsonArraySchema} inside a {@link JsonObjectSchema} property is supported by + * all langchain4j providers that implement structured outputs with {@link JsonObjectSchema} + * (OpenAI, Azure OpenAI, Google AI, Gemini, Mistral, Ollama, Amazon Bedrock, Watsonx). + */ + static ResponseFormat buildResponseFormat(SchemaField schemaField) { + JsonSchemaElement valueElement = toJsonSchemaElement(schemaField.getType()); + JsonSchemaElement valueSchema = + schemaField.multiValued() + ? JsonArraySchema.builder().items(valueElement).build() + : valueElement; + return ResponseFormat.builder() + .type(ResponseFormatType.JSON) + .jsonSchema( + JsonSchema.builder() + .name("output") + .rootElement( + JsonObjectSchema.builder() + .addProperty("value", valueSchema) + .required("value") + .build()) + .build()) + .build(); + } + + private static JsonSchemaElement toJsonSchemaElement(FieldType fieldType) { + // DenseVectorField extends FloatPointField, so it must be rejected before the numeric checks + if (fieldType instanceof DenseVectorField + || fieldType instanceof UUIDField + || fieldType instanceof NestPathField) { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "field type is not supported by Document Enrichment: " + + fieldType.getClass().getSimpleName()); + } + if (fieldType instanceof StrField + || fieldType instanceof TextField + || fieldType instanceof DatePointField) { + return new JsonStringSchema(); + } else if (fieldType instanceof IntPointField || fieldType instanceof LongPointField) { + return new JsonIntegerSchema(); + } else if (fieldType instanceof FloatPointField || fieldType instanceof DoublePointField) { + return new JsonNumberSchema(); + } else if (fieldType instanceof BoolField) { + return new JsonBooleanSchema(); + } else { + throw new SolrException( + SolrException.ErrorCode.SERVER_ERROR, + "field type is not supported by Document Enrichment: " + + fieldType.getClass().getSimpleName()); + } Review Comment: I think with switch-case java construct this part will be more readable ########## solr/modules/language-models/src/test-files/modelEmbeddingExamples/dummy-model.json: ########## Review Comment: not sure we need this relocation, but in case it is: embeddingModelExamples ########## solr/solr-ref-guide/modules/indexing-guide/pages/document-enrichment-with-llms.adoc: ########## @@ -0,0 +1,534 @@ += Document Enrichment with LLMs +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +This module brings the power of *Large Language Models* to Solr. + +More specifically, it enables calling an LLM at indexing time to enrich documents with additional/generated/extracted +data. Given a prompt and a set of input fields, for each document, the LLM is invoked through +https://github.com/langchain4j/langchain4j[LangChain4j], and the result is stored in an output field, which can support +multiple types and may also be multivalued. + +_Without_ this module, the LLM calls to enrich documents must be done _outside_ Solr, before indexing. + +[IMPORTANT] +==== +This module sends your documents off to some hosted service on the internet. +There are cost, privacy, performance, and service availability implications on such a strong dependency that should be +diligently examined before employing this module in a serious way. + +==== + +At the moment, Solr supports a subset of the LLM providers available in LangChain4j. + +*Disclaimer*: Apache Solr is *in no way* affiliated to any of these corporations or services. + +If you want to add support for additional services or improve the support for the existing ones, feel free to +contribute: + +* https://github.com/apache/solr/blob/main/CONTRIBUTING.md[Contributing to Solr] + +== Module + +This is provided via the `language-models` xref:configuration-guide:solr-modules.adoc[Solr Module] that needs to be +enabled before use. + +== Language Model Configuration + +Language Models is a module and therefore its plugins must be configured in `solrconfig.xml`. + +=== Minimum Requirements + +* Enable the `language-models` module to make the Language Models classes available on Solr's classpath. +See xref:configuration-guide:solr-modules.adoc[Solr Module] for more details. + +* An {solr-javadocs}/core/org/apache/solr/update/processor/UpdateRequestProcessorChain.html[UpdateRequestProcessorChain] +that includes at least one `DocumentEnrichmentUpdateProcessor` update processor. + +=== Update Processor Chain Design + +To properly design the Update Processor Chain for Document Enrichment, several parameters must be defined: + +`inputField`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The field whose content is passed to the LLM to enrich the documents. Every `inputField` declared must be referred to in +the prompt. + ++ +Multiple `inputField` are supported and can be defined by using one of the following notations: + +* Add more than one `inputField` string element ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">title</str> + <str name="inputField">body</str> + <str name="outputField">summary</str> + <str name="prompt">Summarize with the following information. Title: {title}. Body: {body}.</str> + <str name="model">chat-model</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + +* Substitute the `inputField` string element with an array of string elements with the same name ++ +[source,xml] +---- +<arr name="inputField"> + <str>title</str> + <str>body</str> +</arr> +---- + + +`outputField`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ +The LLM response is mapped to the specified `outputField`, and only one field is supported as output. Note that this +module only supports a subset of Solr's available field types, which includes: + +* *String/Text*: `StrField`, `TextField`, `SortableTextField` +* *Date*: `DatePointField` (the LLM must return an ISO-8601 date string; it might be useful to tune your prompt accordingly, to avoid indexing errors) +* *Numeric*: `IntPointField`, `LongPointField`, `FloatPointField`, `DoublePointField` +* *Boolean*: `BoolField` + + +These fields _can_ be multivalued. Solr uses structured output from LangChain4j to deal with LLMs' responses. + + +`prompt` or `promptFile`:: ++ +[%autowidth,frame=none] +|=== +s|Exactly one of these parameters is required |Default: none +|=== ++ +Two different ways to define a prompt are available: one directly in the solrconfig and one through a dedicated file. +Either way, the content of the prompt _must_ contain a special token for each `inputField` declared, that are the +`fieldName` surrounded by curly brackets (e.g., `{string_field}`, in the example below). Solr will throw an error if +the parameters are not properly defined. ++ +These parameters can be defined in one of the following ways: + +* Update processor definition with the `prompt` parameter ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">string_field</str> + <str name="outputField">summary</str> + <str name="prompt">Summarize this content: {string_field}</str> + <str name="model">model-name</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + +* Update processor definition with the parameter `promptFile` parameter: in this case, the file `prompt.txt` must be +uploaded to Solr inside the config folder of the collection (e.g., similarly to `solrconfig.xml`, `synonyms.txt`, etc.) ++ +[source,xml] +---- +<updateRequestProcessorChain name="documentEnrichment"> + <processor class="solr.languagemodels.documentenrichment.update.processor.DocumentEnrichmentUpdateProcessorFactory"> + <str name="inputField">string_field</str> + <str name="outputField">summary</str> + <str name="promptFile">prompt.txt</str> + <str name="model">model-name</str> + </processor> + <processor class="solr.RunUpdateProcessorFactory"/> + </updateRequestProcessorChain> +---- + +`model`:: ++ +[%autowidth,frame=none] +|=== +s|Required |Default: none +|=== ++ + +The name of the model that will be uploaded via REST. See xref:document-enrichment-with-llms.adoc#chat-model-setup[] for +more information. + + +For more details on how to work with update request processors in Apache Solr, please refer to the dedicated page: +xref:configuration-guide:update-request-processors.adoc[Update Request Processor] + +[IMPORTANT] +==== +This update processor sends your document field content off to some hosted service on the internet. +There are serious performance implications that should be diligently examined before employing this component in production. +It will slow down substantially your indexing pipeline so make sure to stress test your solution before going live. + +==== + +[NOTE] +==== +If any `inputField` value is absent or empty for a given document, enrichment is silently skipped for that document: +the `outputField` is not added and the document is indexed as-is. + +If the LLM call fails at runtime (e.g., network error, model timeout), the exception is caught and logged but is +*non-fatal*: the document is still indexed without the `outputField`. +Monitor your indexing logs to detect documents that were not enriched as expected. +==== + +== Chat Model Setup Review Comment: Chat Model is a LangChain4j naming, please remove it entirely from the doc and Solr where possible. Furthermore we don't offer any chat style interaction so it can be misleading. let's just use 'general purpose LLM' ########## solr/solr-ref-guide/modules/configuration-guide/pages/update-request-processors.adoc: ########## @@ -421,6 +421,10 @@ The {solr-javadocs}/modules/language-models/index.html[`language-models`] module It uses external text to vectors language models to perform the vectorisation for each processed document. For more information: xref:query-guide:text-to-vector.adoc[Update Request Processor] +{solr-javadocs}/modules/language-models/org/apache/solr/languagemodels/documentenrichment/update/processor/DocumentEnrichmentUpdateProcessorFactory.html[DocumentEnrichmentUpdateProcessorFactory]:: Update processor which, starting from one or more fields in input and a given prompt, adds the output of an LLM as the value of a new field. Review Comment: Update processor that takes one or more fields and a given prompt in input and returns the output of an LLM as the value of a new field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
