SamWheating opened a new issue, #63371:
URL: https://github.com/apache/airflow/issues/63371

   ### Description
   
   The S3Hook registers an output asset for every file which it uploads:
   
https://github.com/apache/airflow/blob/e491aac92a1d50f322a08d92d83616e7c79b3f2e/providers/amazon/src/airflow/providers/amazon/aws/hooks/s3.py#L1376-L1379
   
   Which is not always the desired behaviour when using the S3 Hook (see 
motivating example below).
   
   I'd propose just adding a switch to the S3Hook so disable this sort of 
lineage:
   ```python
   hook = S3Hook(enable_hook_level_lineage=False)
   ```
   
   I am happy to submit a fix here, but I wanted to run it by y'all first to 
make sure that I'm not missing some previous context or undoing an intentional 
design decision.
   
   ### Use case/motivation
   
   We have seen issues where users upload chunked data to S3 within a 
PythonOperator like so:
   ```python
   hook = S3Hook()
   for idx, data in enumerate(list_of_values):
     hook.upload_string(data, f"some_prefix/file_{idx}.txt", "some_bucket")
   ```
   
   Which then creates a _ton_ of output assets. I know that this is limited to 
100 output objects (since https://github.com/apache/airflow/pull/45798), but it 
would be nice if we could disable hook-level lineage altogether and instead 
manage our own output asset definition at the custom operator / PythonOperator 
level. 
   
   In this case, we likely want to only enable a single output asset at the 
`some_prefix/` level, not one per file. 
   
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to