bujjibabukatta opened a new pull request, #68424:
URL: https://github.com/apache/airflow/pull/68424
## Problem
`LlamaIndexEmbeddingOperator` was returning `vector: None` for every chunk
in its output, making the results unusable for downstream vector storage tasks.
**Root cause:** `VectorStoreIndex._get_node_with_embedding()` in
`llama-index-core` calls `node.copy()` internally before attaching embedding
vectors. This means embeddings are only stored on the internal copies, The
original node objects in the `nodes` list retain `embedding=None`.
Minimal reproduction:
```python
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.embeddings.mock_embed_model import MockEmbedding
docs = [Document(text="hello world")]
nodes = SentenceSplitter(chunk_size=512).get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes, embed_model=MockEmbedding(embed_dim=8))
print(nodes[0].embedding) # None ← bug
print(index.vector_store.data.embedding_dict) # {node_id: [...]} ← vector
is here, not on the node
```
## Fix
Pre-embed the nodes using embed_model.get_text_embedding_batch() before
building the index and assign the results directly to the original node
objects. Since VectorStoreIndex skips re-embedding nodes that already carry a
vector, this avoids redundant API calls while ensuring node.embedding is
correctly set on the objects we read from later.
## Changes
providers/common/ai/.../operators/llamaindex_embedding.py - added
pre-embedding step before VectorStoreIndex construction
providers/common/ai/tests/.../test_llamaindex_embedding.py - updated
existing tests to mock get_text_embedding_batch, added regression test
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]