The GitHub Actions job "Tests (AMD)" on airflow.git/aip99-doc-loader has failed.
Run started by GitHub user vikramkoka (triggered by vikramkoka).

Head commit for run:
5e1c8abea8ed7b75f0adcc5cc56bff90820e570e / Vikram Koka <[email protected]>
Add DocumentLoaderOperator to common.ai provider

 - Adds DocumentLoaderOperator, a framework-agnostic file parser that bridges 
Airflow's connectivity layer (hooks returning bytes/files) and the
  AI embedding layer (operators needing list[dict(text, metadata)]). No 
LlamaIndex, LangChain, or other AI framework dependency.
  - Built-in parsers for .txt, .md, .csv, .json with zero extra deps. PDF (via 
pypdf, BSD) and DOCX (via python-docx, MIT) available as optional
  extras: pip install apache-airflow-providers-common-ai[pdf] / [docx].
  - Supports two input modes: source_path (local file, directory, or glob 
pattern) and source_bytes (raw bytes from XCom). Output is
  list[dict(text, metadata)], the same shape consumed by downstream embedding 
operators.

  Motivation

  File parsing is the highest-volume gap in Airflow's AI story
  Every RAG pipeline on Airflow currently requires custom parsing code. This 
operator makes it a single line in a Dag.

  What's included

  
┌────────────────────────────────────┬───────────────────────────────────────────┐
  │                File                │                  Purpose               
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ operators/document_loader.py       │ Operator (~270 lines)                  
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ tests/.../test_document_loader.py  │ 26 unit tests                          
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ docs/operators/document_loader.rst │ Usage docs                             
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ provider.yaml                      │ Operator registration + how-to-guide 
link │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ pyproject.toml                     │ [pdf] and [docx] optional dependencies 
   │
  
├────────────────────────────────────┼───────────────────────────────────────────┤
  │ docs/operators/index.rst           │ Chooser table row                      
   │
  
└────────────────────────────────────┴───────────────────────────────────────────┘

  Test plan

  - uv run --project providers/common/ai pytest 
providers/common/ai/tests/unit/common/ai/operators/test_document_loader.py -xvs 
(26 tests)
  - Built-in parsers: txt, md, csv (one doc per row), json (single object and 
array)
  - PDF/DOCX parsers: mocked via sys.modules injection (packages not installed 
in test env)
  - ImportError guidance when optional packages are missing
  - Init validation: mutual exclusion of source_path/source_bytes, file_type 
required with source_bytes
  - File discovery: glob patterns, extension filtering, empty directories
  - Output shape: every item has text and metadata, file_name/file_path in 
metadata, custom metadata_fields merged

Report URL: https://github.com/apache/airflow/actions/runs/26041307998

With regards,
GitHub Actions via GitBox


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to