Hello Airflow dev community!

TL;DR;
I'd like AF Datasets to support the following use case: I use a BigQuery table 
MyRawRecords, which is partitioned by date. DAG write_my_raw_records inserts 
records into said table on a daily schedule and is sometimes rerun to either 
correct previously inserted records or to insert late arriving data. There is a 
downstream DAG aggregate_my_raw_records_for_analysis owned by a separate team 
that runs daily and generates aggregate values against MyRawRecords. DAG B 
should be rerun anytime the MyRawRecords table is written to (new/updated data 
for a single partition). The existing Dataset mechanism doesn't support 
providing the target partition to the consuming DAG.

At Bombora, the above is a common use case.  We produce many internally facing 
data products and would like to leverage a data/event-driven approach to 
triggering downstream DAGs without the explicit coupling of 
TriggerDagOperator/ExternalTaskOperator or the implicit coupling of schedule 
alignment with sensor timeouts.

Currently, to leverage Datasets, this use case requires the consuming DAG to 
figure out which target partition of the table has been updated.  Of course, 
the aggregates for the entire table could be recalculated, but this is a huge 
waste of resources.

It seems like this may be one of many logical next steps with respect to 
Dataset related features.  There has been interest in supporting this use case 
voiced in the comments on the original Dataset AIP, 
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling.
  Specifically, this and follow up comments: 
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling?focusedCommentId=217385741#comment-217385741.

I have not looked at the Dataset related code and am not sure what the 
complexity of something like this would be.  I assume it wouldn't be trivial, 
given that the Datasets are currently static objects.

Would love to hear some feedback.  Thanks!

Jeff Payne

Reply via email to