Re: [PR] Implement context accessor for DatasetEvent extra [airflow]

via GitHub Thu, 28 Mar 2024 05:37:58 -0700


shrukul commented on code in PR #38481:
URL: https://github.com/apache/airflow/pull/38481#discussion_r1542886049



##########
docs/apache-airflow/authoring-and-scheduling/datasets.rst:
##########
@@ -224,6 +224,29 @@ If one dataset is updated multiple times before all 
consumed datasets have been
 
     }
 
+Attaching extra information to an emitting Dataset Event
+--------------------------------------------------------
+
+.. versionadded:: 2.10.0
+
+A task with a dataset outlet can optionally attach extra information before it 
emits a dataset event. This is different
+from `Extra information on Dataset`_. Extra information on a dataset 
statically describes the entity pointed to by the dataset URI; extra 
information on the *dataset event* instead should be used to annotate the 
triggering data change, such as how many rows in the database are changed by 
the update, or the date range covered by it.
+
+The easiest way to attach extra information to the dataset event is by 
accessing ``dataset_events`` in a task's execution context:
+
+.. code-block:: python
+
+    example_s3_dataset = Dataset("s3://dataset/example.csv")
+
+
+    @task(outlets=[example_s3_dataset])
+    def write_to_s3(*, dataset_events):
+        df = ...  # Get a Pandas DataFrame to write.
+        # Write df to dataset...
+        dataset_events[example_s3_dataset].extras = {"row_count": len(df)}
+
+This can also be done in classic operators by either subclassing the operator 
and overriding ``execute``, or by supplying a pre- or post-execution function.

Review Comment:
   Sorry, I hope I don't sound dumb - Will this work for operators like 
`AWSGlueOperator` or `DatabricksSubmitRunOperator` - which do not accept 
user-provided lambdas?
   
   ```py
   def _write2_post_execute(context, _):
       context["dataset_events"]["test_outlet_dataset_extra_2"].extra = {"x": 1}
   
   BashOperator(
       task_id="write2",
       bash_command=":",
       outlets=Dataset("test_outlet_dataset_extra_2"),
       post_execute=_write2_post_execute,
   )
   ```
   
   As I understand from the above code, we are relying on `context` variable 
inside the `_write2_post_execute` lambda to publish the _extras_. Not all 
operators accept user-provided lambdas. (case in point: `AWSGlueOperator`, 
`DatabricksSubmitRunOperator`)
   
   I intend to adopt this functionality once released, hence the question.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Implement context accessor for DatasetEvent extra [airflow]

Reply via email to