kaxil commented on code in PR #68434:
URL: https://github.com/apache/airflow/pull/68434#discussion_r3404137719
##########
providers/common/ai/src/airflow/providers/common/ai/operators/llamaindex_embedding.py:
##########
@@ -125,9 +125,19 @@ def execute(self, context: Context) -> dict[str, Any]:
nodes = splitter.get_nodes_from_documents(llama_docs)
self.log.info("Split %d documents into %d chunks", len(llama_docs),
len(nodes))
- # ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
- # side effect of building the index; capture the index so the
- # variable isn't discarded.
+ # Pre-embed nodes so that ``.embedding`` is set on the original node
+ # objects before they are passed to VectorStoreIndex. VectorStoreIndex
+ # calls ``_get_node_with_embedding()`` which does ``node.model_copy()``
+ # and attaches the embedding to the *copy*, never the original.
Reading
+ # ``node.embedding`` after index construction therefore always returns
+ # ``None`` (confirmed across llama-index-core v0.10–v0.14).
+ # ``embed_nodes()`` inside VectorStoreIndex skips nodes whose
+ # ``.embedding`` is already set, so pre-embedding causes no duplicate
+ # API calls.
+ texts = [node.get_content() for node in nodes]
Review Comment:
`get_content()` with no argument is `MetadataMode.NONE`, but llama-index's
own `embed_nodes()` embeds
`node.get_content(metadata_mode=MetadataMode.EMBED)`, which includes node
metadata and respects `excluded_embed_metadata_keys`. Checked on
llama-index-core 0.14.22: for a node with `metadata={"src": "a"}` the library
embeds `'src: a\n\nhello world'` while this embeds just `'hello world'`. Since
the operator attaches user metadata to every Document and the pre-set
embeddings are reused for the persisted index, metadata stops contributing to
the vectors entirely. Suggest `metadata_mode=MetadataMode.EMBED` here so the
pre-embed step keeps the same embedding semantics llama-index applies itself.
##########
providers/common/ai/src/airflow/providers/common/ai/operators/llamaindex_embedding.py:
##########
@@ -125,9 +125,19 @@ def execute(self, context: Context) -> dict[str, Any]:
nodes = splitter.get_nodes_from_documents(llama_docs)
self.log.info("Split %d documents into %d chunks", len(llama_docs),
len(nodes))
- # ``VectorStoreIndex(...)`` populates each node's ``.embedding`` as a
- # side effect of building the index; capture the index so the
- # variable isn't discarded.
+ # Pre-embed nodes so that ``.embedding`` is set on the original node
+ # objects before they are passed to VectorStoreIndex. VectorStoreIndex
+ # calls ``_get_node_with_embedding()`` which does ``node.model_copy()``
+ # and attaches the embedding to the *copy*, never the original.
Reading
+ # ``node.embedding`` after index construction therefore always returns
+ # ``None`` (confirmed across llama-index-core v0.10–v0.14).
+ # ``embed_nodes()`` inside VectorStoreIndex skips nodes whose
+ # ``.embedding`` is already set, so pre-embedding causes no duplicate
+ # API calls.
+ texts = [node.get_content() for node in nodes]
+ vectors = embed_model.get_text_embedding_batch(texts,
show_progress=False)
Review Comment:
This breaks `test_byo_embed_model_bypasses_hook`: `_byo_embedding()` builds
its mock with `spec=["get_text_embedding", "_get_query_embedding"]`, so calling
`.get_text_embedding_batch` on it raises AttributeError. The duck-type check in
`_resolve_embed_model` has the same gap, it accepts anything with those two
methods but execute now needs a third. Adding `get_text_embedding_batch` to
both the spec and the check would cover it, along with the tests ashb asked for
on the new pre-embed path.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]