iwannagotobed opened a new pull request, #67298:
URL: https://github.com/apache/airflow/pull/67298
## Summary
Fixes tenant-aware ingestion for Weaviate operators.
`WeaviateIngestOperator` and `WeaviateDocumentIngestOperator` already
accepted a
`tenant` argument, but the value was not consistently applied to the
underlying
hook operations.
As a result, users could configure `tenant="..."` on the operator while some
create, query, replace, or delete operations still ran against the base
collection instead of the tenant-scoped collection.
## Why this matters
Weaviate multi-tenancy isolates objects by tenant within a collection. For a
multi-tenant collection, object operations must be performed through:
```python
collection.with_tenant("<tenant>")
```
If the Airflow operator accepts a tenant parameter but does not apply it to
the
actual Weaviate collection operation, the provider does not honor the user's
multi-tenancy boundary.
This can lead to confusing and risky behavior:
- A Dag author sets `tenant` on the operator and expects data to be written
into
that tenant.
- The ingest task appears successful, but tenant-scoped verification cannot
find
the object.
- Document replacement logic may query or delete objects outside the intended
tenant scope.
## Reproduction
I reproduced the issue with a small Airflow UI Dag.
The collection is multi-tenant, the ingest operator receives
`tenant="tenant-a"`, and the verification task reads from the tenant-scoped
collection.
```python
COLLECTION_NAME = "AirflowTenantRepro"
TENANT_NAME = "tenant-a"
@task
def create_collection():
hook.create_collection(
COLLECTION_NAME,
vectorizer_config=None,
multi_tenancy_config=Configure.multi_tenancy(
enabled=True,
auto_tenant_creation=True,
),
)
ingest_with_tenant = WeaviateIngestOperator(
task_id="ingest_with_tenant",
conn_id="weaviate_default",
collection_name=COLLECTION_NAME,
input_data=SAMPLE_DATA,
tenant=TENANT_NAME,
)
@task
def verify_tenant_data():
collection =
hook.get_collection(COLLECTION_NAME).with_tenant(TENANT_NAME)
response = collection.query.fetch_objects(limit=10)
if not response.objects:
raise RuntimeError("Expected object was not found in tenant")
```
Before this fix, the ingest task completed successfully, but the
tenant-scoped
verification task failed because the expected object was not found in
`tenant-a`.
<img width="793" height="391" alt="스크린샷 2026-05-22 오전 1 01 33"
src="https://github.com/user-attachments/assets/acf49a28-1eef-4516-96b3-9e82a230927a"
/>
## Changes
This change makes the configured tenant flow through the provider
consistently:
- Passes `tenant` from `WeaviateIngestOperator` to
`WeaviateHook.batch_data()`.
- Passes `tenant` from `WeaviateDocumentIngestOperator` to
`WeaviateHook.create_or_replace_document_objects()`.
- Applies `collection.with_tenant(tenant)` inside `WeaviateHook.batch_data()`
before batch insertion.
- Applies tenant scoping to document ingestion paths, including
existing-document lookup, replace/delete, final batch insert, rollback
delete,
and verbose aggregate query.
- Adds optional `tenant` support to `delete_object()` and
`_delete_objects()` so
cleanup and rollback operations stay within the same tenant scope.
- Adds unit test coverage for operator-to-hook tenant handoff and hook-level
tenant collection usage.
## Result
After the fix, the same Airflow UI reproduction Dag succeeds end to end:
- `create_collection`: success
- `ingest_with_tenant`: success
- `verify_tenant_data`: success
- `cleanup_collection`: success
<img width="774" height="268" alt="스크린샷 2026-05-22 오전 2 05 54"
src="https://github.com/user-attachments/assets/1c1b15d8-51f8-4ac0-9dc5-9ad1e3400b94"
/>
## Tests
I ran the relevant Weaviate provider tests with Breeze:
```bash
breeze run pytest
providers/weaviate/tests/unit/weaviate/operators/test_weaviate.py
providers/weaviate/tests/unit/weaviate/hooks/test_weaviate.py -xvs
```
Result:
```text
54 passed
```
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes - Codex (GPT-5)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]