AgentNero-ch opened a new pull request, #68488:
URL: https://github.com/apache/airflow/pull/68488
## What
`LlamaIndexEmbeddingOperator.execute()` returns chunks with `"vector": None`
because it relies on `VectorStoreIndex` to populate `node.embedding` as a side
effect. But `VectorStoreIndex._get_node_with_embedding()` attaches embeddings
to *copies* of the nodes (via `model_copy()`), never the originals.
## Fix
Call `embed_model.get_text_embedding_batch()` on the original nodes *before*
passing them to `VectorStoreIndex`. The index's internal `embed_nodes()` skips
nodes whose `.embedding` is already set, so there are no duplicate API calls.
## Why this works
From llama-index-core source (`indices/utils.py`):
python
def embed_nodes(nodes, embed_model, ...):
for node in nodes:
if node.embedding is not None:
continue # skip already-embedded nodes
...
Verified across llama-index-core v0.10.68 through v0.14.22 — all versions
copy nodes internally, so the side-effect assumption has never held.
## Testing
Updated unit tests to mock `get_text_embedding_batch` instead of relying on
`VectorStoreIndex` side effects. Added a new test verifying the pre-embed step
is called with correct node texts.
Closes #68416
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]