[jira] [Commented] (SOLR-17843) TextToVectorUpdateProcessor does not work with atomic update

Emeric Bernet-Rollande (Jira) Tue, 12 Aug 2025 09:01:31 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013474#comment-18013474
 ]


Emeric Bernet-Rollande commented on SOLR-17843:
-----------------------------------------------

Here is a suggested solution. There probably is a better way to do it, but this 
one seems to work just fine.

*TextToVectorUpdateProcessor.java:*

 
{code:java}
    ...
    boolean isAtomicUpdate = false;
    ...

    @Override
    public void processAdd(AddUpdateCommand cmd) throws IOException {
        SolrInputDocument doc = cmd.getSolrInputDocument();
        SolrInputField inputFieldContent = doc.getField(inputField);
        isAtomicUpdate = false;

        if (!isNullOrEmpty(inputFieldContent)) {
            try {
                String textToVectorise;
                Object inputFieldValue = inputFieldContent.getFirstValue();
                textToVectorise = getTextToVectorise(inputFieldValue, 
inputFieldContent);

                float[] vector = textToVector.vectorise(textToVectorise);
                List<Float> vectorAsList = new ArrayList<>(vector.length);
                for (float f : vector) {
                    vectorAsList.add(f);
                }

                // If the update request is Atomic Update, the vector field 
should be added as an Atomic Update request
                if (isAtomicUpdate) {
                    Map<String, List<Float>> vectorAsOperation = new 
HashMap<>();
                    vectorAsOperation.put("set",vectorAsList);
                    doc.addField(outputField, vectorAsOperation);
                } else {
                    doc.addField(outputField, vectorAsList);
                }
            } catch (RuntimeException vectorisationFailure) {
                ...
            }
        }
        super.processAdd(cmd);
    } 

    /**
     * @param inputFieldContent The Solr Input Field
     * @return The String text to vectorise
     */
    @Nullable
    private String getTextToVectorise(SolrInputField inputFieldContent) {
        String textToVectorise;
        Object inputFieldValue = inputFieldContent.getFirstValue();
        if (inputFieldValue instanceof Map) {
            isAtomicUpdate = true;
            // Atomic update is being used, extract the "set/add" value
            Map<?, ?> map = (Map<?, ?>) inputFieldValue;
            Object setVal = map.get("set");
            Object addVal = map.get("add");

            if (setVal != null) {
                textToVectorise = setVal.toString();
            } else if (addVal != null) {
                textToVectorise = addVal.toString();
            } else {
                textToVectorise = null;
            }
        } else {
            textToVectorise = inputFieldContent.getFirstValue().toString();
        }
        return textToVectorise;
    }

    ....{code}
In this scenario, the update request MUST contain the inputField. In case of 
atomic update, it must be associated to a "set" or "add" operation (not sure if 
"add" is actually relevant).

The _*getTextToVectorise()*_ method retrieves the String value (if any) of the 
text to embed from {*}inputFieldValue{*}, weither is is a String of a 
LinkedHashMap. Also, if it is detected that the inputField is the object of an 
update request, the generated vector should also be added (or set) using atomic 
update.

 

 

*Example of use:*
|| ||Not atomic update||Atomic update||
|*Request body*|[
    {
        "id": "helloworld.txt",
        "content": "Hello world !",
        "author": "John Doe",
        "content": "Hello world !"
    }
]|[
    {
        "id": "helloworld.txt",
        "content": {
            "set": "Hello world !"
        }
    }
]|
|*inputFieldValue*|*_(String)_* "Hello world !"|_*LinkedHashMap*_ {"set" -> 
"Hello world !"}|
|*Processor output*|[
    {
        "id": "helloworld.txt",
        "content": "Hello world !",
        "author": "John Doe",
        "content": "Hello world !",
        "vector": [-0.045,0.027,0.012,-0.0086...]
    }
]|[
    {
        "id": "helloworld.txt",
        "content": {
            "set": "Hello world !"
        },
        "vector": {
            "set": [-0.045,0.027,0.012,-0.0086...]
        }
    }
]|

 

 

> TextToVectorUpdateProcessor does not work with atomic update
> ------------------------------------------------------------
>
>                 Key: SOLR-17843
>                 URL: https://issues.apache.org/jira/browse/SOLR-17843
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: UpdateRequestProcessors, vector-search
>    Affects Versions: 9.9
>         Environment: I'm working on an *Ubuntu 22* VM, running a local 
> Datafari 6.3-DEV server with {*}Solr 9.9{*}.
>  
>            Reporter: Emeric Bernet-Rollande
>            Priority: Minor
>              Labels: UpdateProcessor, vector, vectorization
>         Attachments: solrconfig.xml
>
>
> Hi,
> I'm working on *Solr 9.9* and using the *TextToVectorUpdateProcessor* to 
> enrich documents with semantic vectors. However, I am facing an issue when I 
> try to use this processor with {_}*atomic update*{_}.
>  
> h2. Full context: Indexing / embeddings workflow
> Solr is installed as a component of a search engine, Datafari. In this 
> scenario, Datafari crawles documents from a source (File share, web...), and 
> index them in a *FileShare* collection.
> This FileShare collection has an Update Processor that chunks all documents 
> into smaller subdocuments (chunks), and send them to the *VectorMain* 
> collection.
> Now, I need to vectorise the content of the chunks, using the 
> [TextToVectorUpdateProcessor|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html].
> I used to call this processor in the main processor chain, so all incoming 
> chunks were embedded. Most of the time, it worked well, but this solution has 
> two major issues:
>  * It significantly increases the indexing time
>  * When an embedding fails for any reason (timeout, network error, LLM 
> exception...), the associated chunk {*}was not indexed{*}.
> That is why I decided to dissociate the indexing from embeddings using 
> [Atomic 
> Update|https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html].
> Here is the new workflow:
>  * Chunks are indexed in the VectorMain collection without embeddings. The 
> text content is stored in the "{_}*embedded_content*{_}" field.
> {code:java}
> <field name="embedded_content" type="text_general" indexed="true" 
> stored="true" multiValued="false"/> {code}
>  * Then, we manually job the {*}Atomic Updates Jobs{*}, that retrieves all 
> the documents from VectorMain, and sends update requests to each of them 
> using the "{_}*/update/embed*{_}" handler. Here is what the requests look 
> like: 
> {code:java}
> [
>     ....
>     {
>               "id": "file://///localhost/dataset/my_document.txt_4",
>         "embedded_content": { "set": "Lorem ipsum dolor sit amet, consectetur 
> adipiscing elit. Aenean aliquet quam sed convallis malesuada." }
>     },
>     ...
> ]{code}
>  And here is the handler & processor chain:
> {code:java}
> <!-- Request handler -->
> <requestHandler class="solr.UpdateRequestHandler" name="/update/embed">
>     <lst name="defaults">
>         <str name="lowernames">true</str>
>         <str name="fmap.language">ignored_</str>
>         <str name="fmap.source">ignored_</str>
>         <str name="fmap.version">ignored_</str>
>         <str name="fmap._version_">ignored_</str>
>         <str name="uprefix">ignored_</str>
>         <str name="update.chain">datafari-embed</str>
>     </lst>
> </requestHandler> {code}
> {code:java}
> <!-- Processor chain -->
> <updateRequestProcessorChain name="datafari-embed">
>     <processor 
> class="solr.llm.textvectorisation.update.processor.TextToVectorUpdateProcessorFactory">
>         <str name="inputField">embedded_content</str>
>         <str name="outputField">${texttovector.outputfield:vector_1536}</str>
>         <str name="model">${texttovector.model:default_model}</str>
>     </processor>
>     <processor 
> class="com.francelabs.datafari.updateprocessor.VectorTaggerUpdateProcessorFactory">
>         <str name="enabled">true</str>
>         <str name="outputField">${texttovector.outputfield:vector_1536}</str>
>     </processor>
>     <processor class="solr.LogUpdateProcessorFactory"/>
>     <processor class="solr.DistributedUpdateProcessorFactory"/>
>     <processor class="solr.RunUpdateProcessorFactory"/>
> </updateRequestProcessorChain>{code}
>  * The *TextToVectorUpdateProcessor* takes the value of 
> "{*}embedded_content{*}", sends it to the external embeddings model (here, 
> I'm using our homemade [Datafari AI 
> Agent|https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/3522854915/AI+Agent+-+API+documentation]).
>  * The external embeddings service vectorize the content of the chunks, and 
> returns the vector.
>  * If the vectorisation is successful, the (homemade) 
> VectorTaggerUpdateProcessor adds the name of the output vector field in the 
> multivalued "{*}has_vector{*}" String field.
> h2.  
> h2. The problem
> At first look, the workflow described above seems to work just fine. However, 
> I noticed a significant issue: *the content received by the embeddings 
> service is different from the expected one.*
> See the {*}actual AI Agent logs{*}:
> {code:java}
> 2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST 
> /embeddings : 60
> 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: {set=Lorem ipsum 
> dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... espace 
> réservé du code{code}
> Here are the {*}expected logs{*}:
> {code:java}
> 2025-08-07 14:50:44,951 - aiagent - INFO - Request received - POST 
> /embeddings : 60 2025-08-07 14:50:44,951 - aiagent - DEBUG - Input query 60: 
> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean aliquet ... 
> {code}
> (x) It appears that the *TextToVectorUpdateProcessor* uses the "raw value" of 
> the embedded content ("{color:#0747a6}_{set=Lorem ipsum dolor...}_{color}") 
> instead of the actual value ("{color:#0747a6}_Lorem ipsum dolor_{color}")
>  
> h2. What does the doc says?
> According to the [TextToVectorUpdateProcessor 
> documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
>  it is possible to use atomic update for embeddings.
> I tried to follow the instructions:
>  * Using the existing /update/embed handler
>  * Creating the "vectorised" field
> {code:java}
> <field name="vectorised" type="boolean" uninvertible="false" docValues="true" 
> indexed="true" stored="false"/> {code}
>  * Sending an atomic update on an existing (not embedded) document:
> {code:java}
> curl -X POST "http://localhost:8983/solr/VectorMain/update/embed?commit=true"; 
> \
>   -H "Content-Type: application/json" \
>   -d '[
>     {
>               "id": "file://///localhost/mini/loremipsum.txt_0",
>       "vectorised":{"set":true}
>     }
>   ]' {code}
>  
>  
> According to the 
> [documentation|https://solr.apache.org/guide/solr/latest/query-guide/text-to-vector.html#enriching-documents-with-vectors-at-indexing-time],
>  the update processor {*}should retrieve the value from the document's 
> _embedded_content_{*}:
> {quote}What will happen is that internally Solr fetches the stored content of 
> the docs to update, all the existing fields are retrieved and a re-indexing 
> happens, targeting the 'vectorisation' chain that will add the vector and set 
> the boolean 'vectorised' field to 'true'.
> {quote}
> However, here, *(!) it does not (!).* 
> The Solr response is OK. The request is logged:
> {code:java}
>  INFO 2025-08-07T16:13:05Z 
> (searcherExecutor-102-thread-5-processing-VectorMain_shard1_replica_n1 
> 127.0.0.1-90 core_node2 127.0.0.1:8983_solr VectorMain shard1) - 
> Solr|Solr|org.apache.solr.core.SolrCore|[VectorMain shard1 core_node2 
> VectorMain_shard1_replica_n1] o.a.s.c.SolrCore Registered new searcher 
> autowarm time: 0 ms INFO 2025-08-07T16:13:05Z 
> (qtp1739267143-220-127.0.0.1-90) - 
> Solr|Solr|org.apache.solr.update.processor.LogUpdateProcessorFactory|[VectorMain
>  shard1 core_node2 VectorMain_shard1_replica_n1] 
> o.a.s.u.p.LogUpdateProcessorFactory webapp=/solr path=/update/embed 
> params={commit=true}{add=[file://///localhost/mini/loremipsum.txt_0 
> (1839813818592526336)], commit=} 0 71 {code}
> However, if I don't provide the "embedded_content" in the request, the Update 
> Processor ignores it and don't call the external service.
>  
>  
> h2. Suggestions
> I tries many thinks to fix these two issues. Maybe I'm missing an important 
> point, but if I'm not, here are my suggestions.
>  * Handle "atomicly updated" fields as inputField in the 
> *TextToVectorUpdateProcessor.*
>  * Improve the processor to reload missing inputField from stored fields if 
> not provided.
>  * Alternatively, clarify documentation to indicate that partial updates must 
> still include inputField
>  
> If you have any question or remark, feel free to ask. Also, I'm open to any 
> idea or advice. Thanks for reading !



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-17843) TextToVectorUpdateProcessor does not work with atomic update

Reply via email to