[PR] Add LlamaIndex operators to common.ai provider [airflow]

via GitHub Mon, 18 May 2026 08:21:10 -0700


vikramkoka opened a new pull request, #67121:
URL: https://github.com/apache/airflow/pull/67121


    - Adds LlamaIndexHook to bridge Airflow connections to LlamaIndex's 
Settings singleton. Reuses the pydanticai connection type, supports separate 
embedding and LLM connections.
     - Adds EmbeddingOperator to chunk documents and produce embedding vectors 
via LlamaIndex's SentenceSplitter. Input is list[dict(text, metadata)] (same 
shape as DocumentLoaderOperator output), output includes chunks with vectors 
ready for downstream vector store ingest operators (pgvector, Pinecone, 
Weaviate).
     - Adds RetrievalOperator to load a persisted LlamaIndex index and perform 
similarity search. Output is scored chunks ready for synthesis via LLMOperator.
   
     Design notes
   
     All LlamaIndex imports are lazy (inside execute() / method bodies), so 
modules parse without llama-index installed. The hook currently hardcodes
     OpenAI embedding/LLM providers; a follow-up PR will refactor to use 
BaseAIHook for provider-agnostic model resolution when it lands.
   
     What's included
   
     
┌─────────────────────────────────────────┬──────────────────────────────────────────┐
     │                  File                   │                 Purpose        
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ hooks/llamaindex.py                     │ Hook (~110 lines)              
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ operators/llamaindex_embedding.py       │ EmbeddingOperator (~110 lines) 
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ operators/llamaindex_retrieval.py       │ RetrievalOperator (~90 lines)  
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ tests/.../test_llamaindex.py            │ 12 hook tests                  
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ tests/.../test_llamaindex_embedding.py  │ 10 operator tests              
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ tests/.../test_llamaindex_retrieval.py  │ 8 operator tests               
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ docs/hooks/llamaindex.rst               │ Hook docs                      
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ docs/operators/llamaindex_embedding.rst │ EmbeddingOperator docs         
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ docs/operators/llamaindex_retrieval.rst │ RetrievalOperator docs         
          │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ provider.yaml                           │ Integration, hook, operator 
registration │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ docs/index.rst                          │ LlamaIndex Hook in Guides 
toctree        │
     
├─────────────────────────────────────────┼──────────────────────────────────────────┤
     │ docs/operators/index.rst                │ Chooser table rows             
          │
     
└─────────────────────────────────────────┴──────────────────────────────────────────┘
   
     Test plan
   
     - uv run --project providers/common/ai pytest 
providers/common/ai/tests/unit/common/ai/hooks/test_llamaindex.py -xvs (12 
tests)
     - uv run --project providers/common/ai pytest 
providers/common/ai/tests/unit/common/ai/operators/test_llamaindex_embedding.py 
providers/common/ai/tests/unit/common/ai/operators/test_llamaindex_retrieval.py 
-xvs (18 tests)
     - Hook: init defaults, separate embed_conn_id, connection kwargs 
extraction, embedding model, LLM, Settings configuration
     - EmbeddingOperator: output shape, chunking, index persistence, vector 
inclusion/omission, splitter params
     - RetrievalOperator: output shape, chunk keys, top_k forwarding, multiple 
results, storage context
   
     ---
     Was generative AI tooling used to co-author this PR?
   
     - Yes — Claude Code (Opus 4.6)
   
     Generated-by: Claude Code (Opus 4.6) following
     
https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions
   
    <!-- SPDX-License-Identifier: Apache-2.0
         https://www.apache.org/licenses/LICENSE-2.0 -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add LlamaIndex operators to common.ai provider [airflow]

Reply via email to