Re: [PR] fix(providers/common-ai): LlamaIndexEmbeddingOperator always returns … [airflow]

via GitHub Fri, 12 Jun 2026 07:39:19 -0700


kaxil commented on code in PR #68434:
URL: https://github.com/apache/airflow/pull/68434#discussion_r3404137719



##########
providers/common/ai/src/airflow/providers/common/ai/operators/llamaindex_embedding.py:
##########
@@ -125,9 +125,19 @@ def execute(self, context: Context) -> dict[str, Any]:
         nodes = splitter.get_nodes_from_documents(llama_docs)
         self.log.info("Split %d documents into %d chunks", len(llama_docs), 
len(nodes))
 
-        # ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
-        # side effect of building the index; capture the index so the
-        # variable isn't discarded.
+        # Pre-embed nodes so that ``.embedding`` is set on the original node
+        # objects before they are passed to VectorStoreIndex.  VectorStoreIndex
+        # calls ``_get_node_with_embedding()`` which does ``node.model_copy()``
+        # and attaches the embedding to the *copy*, never the original.  
Reading
+        # ``node.embedding`` after index construction therefore always returns
+        # ``None`` (confirmed across llama-index-core v0.10–v0.14).
+        # ``embed_nodes()`` inside VectorStoreIndex skips nodes whose
+        # ``.embedding`` is already set, so pre-embedding causes no duplicate
+        # API calls.
+        texts = [node.get_content() for node in nodes]

Review Comment:
   `get_content()` with no argument is `MetadataMode.NONE`, but llama-index's 
own `embed_nodes()` embeds 
`node.get_content(metadata_mode=MetadataMode.EMBED)`, which includes node 
metadata and respects `excluded_embed_metadata_keys`. Checked on 
llama-index-core 0.14.22: for a node with `metadata={"src": "a"}` the library 
embeds `'src: a\n\nhello world'` while this embeds just `'hello world'`. Since 
the operator attaches user metadata to every Document and the pre-set 
embeddings are reused for the persisted index, metadata stops contributing to 
the vectors entirely. Suggest `metadata_mode=MetadataMode.EMBED` here so the 
pre-embed step keeps the same embedding semantics llama-index applies itself.



##########
providers/common/ai/src/airflow/providers/common/ai/operators/llamaindex_embedding.py:
##########
@@ -125,9 +125,19 @@ def execute(self, context: Context) -> dict[str, Any]:
         nodes = splitter.get_nodes_from_documents(llama_docs)
         self.log.info("Split %d documents into %d chunks", len(llama_docs), 
len(nodes))
 
-        # ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
-        # side effect of building the index; capture the index so the
-        # variable isn't discarded.
+        # Pre-embed nodes so that ``.embedding`` is set on the original node
+        # objects before they are passed to VectorStoreIndex.  VectorStoreIndex
+        # calls ``_get_node_with_embedding()`` which does ``node.model_copy()``
+        # and attaches the embedding to the *copy*, never the original.  
Reading
+        # ``node.embedding`` after index construction therefore always returns
+        # ``None`` (confirmed across llama-index-core v0.10–v0.14).
+        # ``embed_nodes()`` inside VectorStoreIndex skips nodes whose
+        # ``.embedding`` is already set, so pre-embedding causes no duplicate
+        # API calls.
+        texts = [node.get_content() for node in nodes]
+        vectors = embed_model.get_text_embedding_batch(texts, 
show_progress=False)

Review Comment:
   This breaks `test_byo_embed_model_bypasses_hook`: `_byo_embedding()` builds 
its mock with `spec=["get_text_embedding", "_get_query_embedding"]`, so calling 
`.get_text_embedding_batch` on it raises AttributeError. The duck-type check in 
`_resolve_embed_model` has the same gap, it accepts anything with those two 
methods but execute now needs a third. Adding `get_text_embedding_batch` to 
both the spec and the check would cover it, along with the tests ashb asked for 
on the new pre-embed path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(providers/common-ai): LlamaIndexEmbeddingOperator always returns … [airflow]

Reply via email to