vikramkoka opened a new pull request, #67120:
URL: https://github.com/apache/airflow/pull/67120
- Adds DocumentLoaderOperator, a framework-agnostic file parser that
bridges Airflow's connectivity layer (hooks returning bytes/files) and the AI
embedding layer (operators needing list[dict(text, metadata)]). No LlamaIndex,
LangChain, or other AI framework dependency.
- Built-in parsers for .txt, .md, .csv, .json with zero extra deps. PDF
(via pypdf, BSD) and DOCX (via python-docx, MIT) available as optional extras:
pip install apache-airflow-providers-common-ai[pdf] / [docx].
- Supports two input modes: source_path (local file, directory, or glob
pattern) and source_bytes (raw bytes from XCom). Output is list[dict(text,
metadata)], the same shape consumed by downstream embedding operators.
Motivation
File parsing is the highest-volume gap in Airflow's AI story
Every RAG pipeline on Airflow currently requires custom parsing code. This
operator makes it a single line in a Dag.
What's included
┌────────────────────────────────────┬───────────────────────────────────────────┐
│ File │ Purpose
│
├────────────────────────────────────┼───────────────────────────────────────────┤
│ operators/document_loader.py │ Operator (~270 lines)
│
├────────────────────────────────────┼───────────────────────────────────────────┤
│ tests/.../test_document_loader.py │ 26 unit tests
│
├────────────────────────────────────┼───────────────────────────────────────────┤
│ docs/operators/document_loader.rst │ Usage docs
│
├────────────────────────────────────┼───────────────────────────────────────────┤
│ provider.yaml │ Operator registration +
how-to-guide link │
├────────────────────────────────────┼───────────────────────────────────────────┤
│ pyproject.toml │ [pdf] and [docx] optional
dependencies │
├────────────────────────────────────┼───────────────────────────────────────────┤
│ docs/operators/index.rst │ Chooser table row
│
└────────────────────────────────────┴───────────────────────────────────────────┘
Test plan
- uv run --project providers/common/ai pytest
providers/common/ai/tests/unit/common/ai/operators/test_document_loader.py -xvs
(26 tests)
- Built-in parsers: txt, md, csv (one doc per row), json (single object
and array)
- PDF/DOCX parsers: mocked via sys.modules injection (packages not
installed in test env)
- ImportError guidance when optional packages are missing
- Init validation: mutual exclusion of source_path/source_bytes, file_type
required with source_bytes
- File discovery: glob patterns, extension filtering, empty directories
- Output shape: every item has text and metadata, file_name/file_path in
metadata, custom metadata_fields merged
<!-- SPDX-License-Identifier: Apache-2.0
https://www.apache.org/licenses/LICENSE-2.0 -->
---
##### Was generative AI tooling used to co-author this PR?
- [ x] Yes (please specify the tool below)
Generated-by: [Claude] following [the
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
---
* Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information. Note: commit author/co-author name and email in commits
become permanently public when merged.
* For fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
* When adding dependency, check compliance with the [ASF 3rd Party License
Policy](https://www.apache.org/legal/resolved.html#category-x).
* For significant user-facing changes create newsfragment:
`{pr_number}.significant.rst`, in
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
You can add this file in a follow-up commit after the PR is created so you
know the PR number.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]