[ https://issues.apache.org/jira/browse/SOLR-17843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013474#comment-18013474 ]
Emeric Bernet-Rollande commented on SOLR-17843: ----------------------------------------------- Here is a suggested solution. There probably is a better way to do it, but this one seems to work just fine. *TextToVectorUpdateProcessor.java:* {code:java} ... boolean isAtomicUpdate = false; ... @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); SolrInputField inputFieldContent = doc.getField(inputField); isAtomicUpdate = false; if (!isNullOrEmpty(inputFieldContent)) { try { String textToVectorise; Object inputFieldValue = inputFieldContent.getFirstValue(); textToVectorise = getTextToVectorise(inputFieldValue, inputFieldContent); float[] vector = textToVector.vectorise(textToVectorise); List<Float> vectorAsList = new ArrayList<>(vector.length); for (float f : vector) { vectorAsList.add(f); } // If the update request is Atomic Update, the vector field should be added as an Atomic Update request if (isAtomicUpdate) { Map<String, List<Float>> vectorAsOperation = new HashMap<>(); vectorAsOperation.put("set",vectorAsList); doc.addField(outputField, vectorAsOperation); } else { doc.addField(outputField, vectorAsList); } } catch (RuntimeException vectorisationFailure) { ... } } super.processAdd(cmd); } /** * @param inputFieldContent The Solr Input Field * @return The String text to vectorise */ @Nullable private String getTextToVectorise(SolrInputField inputFieldContent) { String textToVectorise; Object inputFieldValue = inputFieldContent.getFirstValue(); if (inputFieldValue instanceof Map) { isAtomicUpdate = true; // Atomic update is being used, extract the "set/add" value Map<?, ?> map = (Map<?, ?>) inputFieldValue; Object setVal = map.get("set"); Object addVal = map.get("add"); if (setVal != null) { textToVectorise = setVal.toString(); } else if (addVal != null) { textToVectorise = addVal.toString(); } else { textToVectorise = null; } } else { textToVectorise = inputFieldContent.getFirstValue().toString(); } return textToVectorise; } ....{code} In this scenario, the update request MUST contain the inputField. In case of atomic update, it must be associated to a "set" or "add" operation (not sure if "add" is actually relevant). The _*getTextToVectorise()*_ method retrieves the String value (if any) of the text to embed from {*}inputFieldValue{*}, weither is is a String of a LinkedHashMap. Also, if it is detected that the inputField is the object of an update request, the generated vector should also be added (or set) using atomic update. *Example of use:* || ||Not atomic update||Atomic update|| |*Request body*|[ { "id": "helloworld.txt", "content": "Hello world !", "author": "John Doe", "content": "Hello world !" } ]|[ { "id": "helloworld.txt", "content": { "set": "Hello world !" } } ]| |*inputFieldValue*|*_(String)_* "Hello world !"|_*LinkedHashMap*_ {"set" -> "Hello world !"}| |*Processor output*|[ { "id": "helloworld.txt", "content": "Hello world !", "author": "John Doe", "content": "Hello world !", "vector": [-0.045,0.027,0.012,-0.0086...] } ]|[ { "id": "helloworld.txt", "content": { "set": "Hello world !" }, "vector": { "set": [-0.045,0.027,0.012,-0.0086...] } } ]| > TextToVectorUpdateProcessor does not work with atomic update > ------------------------------------------------------------ > > Key: SOLR-17843 > URL: https://issues.apache.org/jira/browse/SOLR-17843 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: UpdateRequestProcessors, vector-search > Affects Versions: 9.9 > Environment: I'm working on an *Ubuntu 22* VM, running a local > Datafari 6.3-DEV server with {*}Solr 9.9{*}. > > Reporter: Emeric Bernet-Rollande > Priority: Minor > Labels: UpdateProcessor, vector, vectorization > Attachments: solrconfig.xml > > > Hi, > I'm working on *Solr 9.9* and using the *TextToVectorUpdateProcessor* to > enrich documents with semantic vectors. However, I am facing an issue when I > try to use this processor with {_}*atomic update*{_}. > > h2. Full context: Indexing / embeddings workflow > Solr is installed as a component of a search engine, Datafari. In this > scenario, Datafari crawles documents from a source (File share, web...), and > index them in a *FileShare* collection. > This FileShare collection has an Update Processor that chunks all documents > into smaller subdocuments (chunks), and send them to the *VectorMain* > collection. > Now, I need to vectorise the content of the chunks, using the > [TextToVectorUpdateProcessor|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html]. > I used to call this processor in the main processor chain, so all incoming > chunks were embedded. Most of the time, it worked well, but this solution has > two major issues: > * It significantly increases the indexing time > * When an embedding fails for any reason (timeout, network error, LLM > exception...), the associated chunk {*}was not indexed{*}. > That is why I decided to dissociate the indexing from embeddings using > [Atomic > Update|https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html]. > Here is the new workflow: > * Chunks are indexed in the VectorMain collection without embeddings. The > text content is stored in the "{_}*embedded_content*{_}" field. > {code:java} > <field name="embedded_content" type="text_general" indexed="true" > stored="true" multiValued="false"/> {code} > * Then, we manually job the {*}Atomic Updates Jobs{*}, that retrieves all > the documents from VectorMain, and sends update requests to each of them > using the "{_}*/update/embed*{_}" handler. Here is what the requests look > like: > {code:java} > [ > .... > { > "id": "file://///localhost/dataset/my_document.txt_4", > "embedded_content": { "set": "Lorem ipsum dolor sit amet, consectetur > adipiscing elit. Aenean aliquet quam sed convallis malesuada." } > }, > ... > ]{code} > And here is the handler & processor chain: > {code:java} > <!-- Request handler --> > <requestHandler class="solr.UpdateRequestHandler" name="/update/embed"> > <lst name="defaults"> > <str name="lowernames">true</str> > <str name="fmap.language">ignored_</str> > <str name="fmap.source">ignored_</str> > <str name="fmap.version">ignored_</str> > <str name="fmap._version_">ignored_</str> > <str name="uprefix">ignored_</str> > <str name="update.chain">datafari-embed</str> > </lst> > </requestHandler> {code} > {code:java} > <!-- Processor chain --> > <updateRequestProcessorChain name="datafari-embed"> > <processor > class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory"> > <str name="inputField">embedded_content</str> > <str name="outputField">${texttovector.outputfield:vector_1536}</str> > <str name="model">${texttovector.model:default_model}</str> > </processor> > <processor > class="com.francelabs.datafari.updateprocessor.VectorTaggerUpdateProcessorFactory"> > <str name="enabled">true</str> > <str name="outputField">${texttovector.outputfield:vector_1536}</str> > </processor> > <processor class="solr.LogUpdateProcessorFactory"/> > <processor class="solr.DistributedUpdateProcessorFactory"/> > <processor class="solr.RunUpdateProcessorFactory"/> > </updateRequestProcessorChain>{code} > * The *TextToVectorUpdateProcessor* takes the value of > "{*}embedded_content{*}", sends it to the external embeddings model (here, > I'm using our homemade [Datafari AI > Agent|https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/3522854915/AI+Agent+-+API+documentation]). > * The external embeddings service vectorize the content of the chunks, and > returns the vector. > * If the vectorisation is successful, the (homemade) > VectorTaggerUpdateProcessor adds the name of the output vector field in the > multivalued "{*}has_vector{*}" String field. > h2. > h2. The problem > At first look, the workflow described above seems to work just fine. However, > I noticed a significant issue: *the content received by the embeddings > service is different from the expected one.* > See the {*}actual AI Agent logs{*}: > {code:java} > 2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST > /embeddings : 60 > 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: {set=Lorem ipsum > dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... espace > réservé du code{code} > Here are the {*}expected logs{*}: > {code:java} > 2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST > /embeddings : 60 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... > {code} > (x) It appears that the *TextToVectorUpdateProcessor* uses the "raw value" of > the embedded content ("{color:#0747a6}_{set=Lorem ipsum dolor...}_{color}") > instead of the actual value ("{color:#0747a6}_Lorem ipsum dolor_{color}") > > h2. What does the doc says? > According to the [TextToVectorUpdateProcessor > documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time], > it is possible to use atomic update for embeddings. > I tried to follow the instructions: > * Using the existing /update/embed handler > * Creating the "vectorised" field > {code:java} > <field name="vectorised" type="boolean" uninvertible="false" docValues="true" > indexed="true" stored="false"/> {code} > * Sending an atomic update on an existing (not embedded) document: > {code:java} > curl -X POST "http://localhost:8983/solr/VectorMain/update/embed?commit=true" > \ > -H "Content-Type: application/json" \ > -d '[ > { > "id": "file://///localhost/mini/loremipsum.txt_0", > "vectorised":{"set":true} > } > ]' {code} > > > According to the > [documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time], > the update processor {*}should retrieve the value from the document's > _embedded_content_{*}: > {quote}What will happen is that internally Solr fetches the stored content of > the docs to update, all the existing fields are retrieved and a re-indexing > happens, targeting the 'vectorisation' chain that will add the vector and set > the boolean 'vectorised' field to 'true'. > {quote} > However, here, *(!) it does not (!).* > The Solr response is OK. The request is logged: > {code:java} > INFO 2025-08-07T16:13:05Z > (searcherExecutor-102-thread-5-processing-VectorMain_shard1_replica_n1 > 127.0.0.1-90 core_node2 127.0.0.1:8983_solr VectorMain shard1) - > Solr|Solr|org.apache.solr.core.SolrCore|[VectorMain shard1 core_node2 > VectorMain_shard1_replica_n1] o.a.s.c.SolrCore Registered new searcher > autowarm time: 0 ms INFO 2025-08-07T16:13:05Z > (qtp1739267143-220-127.0.0.1-90) - > Solr|Solr|org.apache.solr.update.processor.LogUpdateProcessorFactory|[VectorMain > shard1 core_node2 VectorMain_shard1_replica_n1] > o.a.s.u.p.LogUpdateProcessorFactory webapp=/solr path=/update/embed > params={commit=true}{add=[file://///localhost/mini/loremipsum.txt_0 > (1839813818592526336)], commit=} 0 71 {code} > However, if I don't provide the "embedded_content" in the request, the Update > Processor ignores it and don't call the external service. > > > h2. Suggestions > I tries many thinks to fix these two issues. Maybe I'm missing an important > point, but if I'm not, here are my suggestions. > * Handle "atomicly updated" fields as inputField in the > *TextToVectorUpdateProcessor.* > * Improve the processor to reload missing inputField from stored fields if > not provided. > * Alternatively, clarify documentation to indicate that partial updates must > still include inputField > > If you have any question or remark, feel free to ask. Also, I'm open to any > idea or advice. Thanks for reading ! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org