vikramkoka opened a new pull request, #67121:
URL: https://github.com/apache/airflow/pull/67121
- Adds LlamaIndexHook to bridge Airflow connections to LlamaIndex's
Settings singleton. Reuses the pydanticai connection type, supports separate
embedding and LLM connections.
- Adds EmbeddingOperator to chunk documents and produce embedding vectors
via LlamaIndex's SentenceSplitter. Input is list[dict(text, metadata)] (same
shape as DocumentLoaderOperator output), output includes chunks with vectors
ready for downstream vector store ingest operators (pgvector, Pinecone,
Weaviate).
- Adds RetrievalOperator to load a persisted LlamaIndex index and perform
similarity search. Output is scored chunks ready for synthesis via LLMOperator.
Design notes
All LlamaIndex imports are lazy (inside execute() / method bodies), so
modules parse without llama-index installed. The hook currently hardcodes
OpenAI embedding/LLM providers; a follow-up PR will refactor to use
BaseAIHook for provider-agnostic model resolution when it lands.
What's included
┌─────────────────────────────────────────┬──────────────────────────────────────────┐
│ File │ Purpose
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ hooks/llamaindex.py │ Hook (~110 lines)
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ operators/llamaindex_embedding.py │ EmbeddingOperator (~110 lines)
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ operators/llamaindex_retrieval.py │ RetrievalOperator (~90 lines)
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ tests/.../test_llamaindex.py │ 12 hook tests
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ tests/.../test_llamaindex_embedding.py │ 10 operator tests
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ tests/.../test_llamaindex_retrieval.py │ 8 operator tests
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ docs/hooks/llamaindex.rst │ Hook docs
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ docs/operators/llamaindex_embedding.rst │ EmbeddingOperator docs
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ docs/operators/llamaindex_retrieval.rst │ RetrievalOperator docs
│
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ provider.yaml │ Integration, hook, operator
registration │
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ docs/index.rst │ LlamaIndex Hook in Guides
toctree │
├─────────────────────────────────────────┼──────────────────────────────────────────┤
│ docs/operators/index.rst │ Chooser table rows
│
└─────────────────────────────────────────┴──────────────────────────────────────────┘
Test plan
- uv run --project providers/common/ai pytest
providers/common/ai/tests/unit/common/ai/hooks/test_llamaindex.py -xvs (12
tests)
- uv run --project providers/common/ai pytest
providers/common/ai/tests/unit/common/ai/operators/test_llamaindex_embedding.py
providers/common/ai/tests/unit/common/ai/operators/test_llamaindex_retrieval.py
-xvs (18 tests)
- Hook: init defaults, separate embed_conn_id, connection kwargs
extraction, embedding model, LLM, Settings configuration
- EmbeddingOperator: output shape, chunking, index persistence, vector
inclusion/omission, splitter params
- RetrievalOperator: output shape, chunk keys, top_k forwarding, multiple
results, storage context
---
Was generative AI tooling used to co-author this PR?
- Yes — Claude Code (Opus 4.6)
Generated-by: Claude Code (Opus 4.6) following
https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions
<!-- SPDX-License-Identifier: Apache-2.0
https://www.apache.org/licenses/LICENSE-2.0 -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]