bujjibabukatta opened a new pull request, #68424:
URL: https://github.com/apache/airflow/pull/68424

   ## Problem
   
   `LlamaIndexEmbeddingOperator` was returning `vector: None` for every chunk 
in its output, making the results unusable for downstream vector storage tasks.
   
   **Root cause:** `VectorStoreIndex._get_node_with_embedding()` in 
`llama-index-core` calls `node.copy()` internally before attaching embedding 
vectors. This means embeddings are only stored on the internal copies, The 
original node objects in the `nodes` list retain `embedding=None`.
   
   Minimal reproduction:
   ```python
   from llama_index.core import Document, VectorStoreIndex
   from llama_index.core.node_parser import SentenceSplitter
   from llama_index.core.embeddings.mock_embed_model import MockEmbedding
   
   docs = [Document(text="hello world")]
   nodes = SentenceSplitter(chunk_size=512).get_nodes_from_documents(docs)
   index = VectorStoreIndex(nodes, embed_model=MockEmbedding(embed_dim=8))
   
   print(nodes[0].embedding)  # None  ← bug
   print(index.vector_store.data.embedding_dict)  # {node_id: [...]}  ← vector 
is here, not on the node
   ```
   
   
   ## Fix
   Pre-embed the nodes using embed_model.get_text_embedding_batch() before 
building the index and assign the results directly to the original node 
objects. Since VectorStoreIndex skips re-embedding nodes that already carry a 
vector, this avoids redundant API calls while ensuring node.embedding is 
correctly set on the objects we read from later.
   
   ## Changes
   providers/common/ai/.../operators/llamaindex_embedding.py - added 
pre-embedding step before VectorStoreIndex construction
   providers/common/ai/tests/.../test_llamaindex_embedding.py - updated 
existing tests to mock get_text_embedding_batch, added regression test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to