[PR] Add DocumentLoaderOperator to common.ai provider [airflow]

via GitHub Mon, 18 May 2026 07:54:30 -0700


vikramkoka opened a new pull request, #67120:
URL: https://github.com/apache/airflow/pull/67120


    - Adds DocumentLoaderOperator, a framework-agnostic file parser that 
bridges Airflow's connectivity layer (hooks returning bytes/files) and the AI 
embedding layer (operators needing list[dict(text, metadata)]). No LlamaIndex, 
LangChain, or other AI framework dependency.
     - Built-in parsers for .txt, .md, .csv, .json with zero extra deps. PDF 
(via pypdf, BSD) and DOCX (via python-docx, MIT) available as optional extras: 
pip install apache-airflow-providers-common-ai[pdf] / [docx].
     - Supports two input modes: source_path (local file, directory, or glob 
pattern) and source_bytes (raw bytes from XCom). Output is list[dict(text, 
metadata)], the same shape consumed by downstream embedding operators.
   
     Motivation
   
     File parsing is the highest-volume gap in Airflow's AI story
     Every RAG pipeline on Airflow currently requires custom parsing code. This 
operator makes it a single line in a Dag.
   
     What's included
   
    
┌────────────────────────────────────┬───────────────────────────────────────────┐
     │                File                │                  Purpose            
      │
     
├────────────────────────────────────┼───────────────────────────────────────────┤
     │ operators/document_loader.py       │ Operator (~270 lines)               
      │
     
├────────────────────────────────────┼───────────────────────────────────────────┤
     │ tests/.../test_document_loader.py  │ 26 unit tests                       
      │
     
├────────────────────────────────────┼───────────────────────────────────────────┤
     │ docs/operators/document_loader.rst │ Usage docs                          
      │
     
├────────────────────────────────────┼───────────────────────────────────────────┤
     │ provider.yaml                      │ Operator registration + 
how-to-guide link │
     
├────────────────────────────────────┼───────────────────────────────────────────┤
     │ pyproject.toml                     │ [pdf] and [docx] optional 
dependencies    │
     
├────────────────────────────────────┼───────────────────────────────────────────┤
     │ docs/operators/index.rst           │ Chooser table row                   
      │
     
└────────────────────────────────────┴───────────────────────────────────────────┘
   
     Test plan
   
     - uv run --project providers/common/ai pytest 
providers/common/ai/tests/unit/common/ai/operators/test_document_loader.py -xvs 
(26 tests)
     - Built-in parsers: txt, md, csv (one doc per row), json (single object 
and array)
     - PDF/DOCX parsers: mocked via sys.modules injection (packages not 
installed in test env)
     - ImportError guidance when optional packages are missing
     - Init validation: mutual exclusion of source_path/source_bytes, file_type 
required with source_bytes
     - File discovery: glob patterns, extension filtering, empty directories
     - Output shape: every item has text and metadata, file_name/file_path in 
metadata, custom metadata_fields merged
   
    <!-- SPDX-License-Identifier: Apache-2.0
         https://www.apache.org/licenses/LICENSE-2.0 -->
   
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [ x] Yes (please specify the tool below)
   Generated-by: [Claude] following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
   
   ---
   
   * Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
 for more information. Note: commit author/co-author name and email in commits 
become permanently public when merged.
   * For fundamental code changes, an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
 is needed.
   * When adding dependency, check compliance with the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   * For significant user-facing changes create newsfragment: 
`{pr_number}.significant.rst`, in 
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
 You can add this file in a follow-up commit after the PR is created so you 
know the PR number.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Add DocumentLoaderOperator to common.ai provider [airflow]

Reply via email to