[PR] fix(common-ai): pre-embed nodes so LlamaIndexEmbeddingOperator returns vectors [airflow]

via GitHub Fri, 12 Jun 2026 16:48:07 -0700


AgentNero-ch opened a new pull request, #68488:
URL: https://github.com/apache/airflow/pull/68488


   ## What
   
   `LlamaIndexEmbeddingOperator.execute()` returns chunks with `"vector": None` 
because it relies on `VectorStoreIndex` to populate `node.embedding` as a side 
effect. But `VectorStoreIndex._get_node_with_embedding()` attaches embeddings 
to *copies* of the nodes (via `model_copy()`), never the originals.
   
   ## Fix
   
   Call `embed_model.get_text_embedding_batch()` on the original nodes *before* 
passing them to `VectorStoreIndex`. The index's internal `embed_nodes()` skips 
nodes whose `.embedding` is already set, so there are no duplicate API calls.
   
   ## Why this works
   
   From llama-index-core source (`indices/utils.py`):
   python
   def embed_nodes(nodes, embed_model, ...):
       for node in nodes:
           if node.embedding is not None:
               continue  # skip already-embedded nodes
           ...
   
   Verified across llama-index-core v0.10.68 through v0.14.22 — all versions 
copy nodes internally, so the side-effect assumption has never held.
   
   ## Testing
   
   Updated unit tests to mock `get_text_embedding_batch` instead of relying on 
`VectorStoreIndex` side effects. Added a new test verifying the pre-embed step 
is called with correct node texts.
   
   Closes #68416


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(common-ai): pre-embed nodes so LlamaIndexEmbeddingOperator returns vectors [airflow]

Reply via email to