uranusjr commented on code in PR #38481: URL: https://github.com/apache/airflow/pull/38481#discussion_r1543835985
########## docs/apache-airflow/authoring-and-scheduling/datasets.rst: ########## @@ -224,6 +224,29 @@ If one dataset is updated multiple times before all consumed datasets have been } +Attaching extra information to an emitting Dataset Event +-------------------------------------------------------- + +.. versionadded:: 2.10.0 + +A task with a dataset outlet can optionally attach extra information before it emits a dataset event. This is different +from `Extra information on Dataset`_. Extra information on a dataset statically describes the entity pointed to by the dataset URI; extra information on the *dataset event* instead should be used to annotate the triggering data change, such as how many rows in the database are changed by the update, or the date range covered by it. + +The easiest way to attach extra information to the dataset event is by accessing ``dataset_events`` in a task's execution context: + +.. code-block:: python + + example_s3_dataset = Dataset("s3://dataset/example.csv") + + + @task(outlets=[example_s3_dataset]) + def write_to_s3(*, dataset_events): + df = ... # Get a Pandas DataFrame to write. + # Write df to dataset... + dataset_events[example_s3_dataset].extras = {"row_count": len(df)} + +This can also be done in classic operators by either subclassing the operator and overriding ``execute``, or by supplying a pre- or post-execution function. Review Comment: The `post_execute` hook is a core Airflow feature and should be accepted by all operators. Please report if an operator doesn’t handle this correctly; it should be considered a bug. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org