Hi Vikram, Thanks for the additional context. That sounds great. What's the best way for us to connect?
Jeff Payne ________________________________ From: Vikram Koka <vik...@astronomer.io.INVALID> Sent: Friday, July 14, 2023 8:00 AM To: jul...@astronomer.io.invalid <jul...@astronomer.io.invalid> Cc: dev@airflow.apache.org <dev@airflow.apache.org> Subject: Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs Hi Jeff, Thank you for bringing this up. This is definitely on my radar and part of a larger AIP which I have been in the process of writing up. We have thought of this use case and deliberately deferred it in the earlier AIP. Doing is definitely quite complex and I think it needs a couple of intermediate steps. I would be more than happy to collaborate on this work and walk you through my current thinking on this. Best regards, Vikram On Tue, Jul 11, 2023 at 7:18 AM Julien Le Dem <jul...@astronomer.io.invalid> wrote: > Hello Jeff, > I agree it is a very common use case for incremental processing. > For example you might schedule a DAG every hour that will use the data > interval to filter this partition in the output and overwrite a partition > in the output. > This also generalizes to more complex cases when you implement roll up > windows and, for example, read 24 previous days every day to create a new > daily average. > To do this we would need the Dataset updates to keep the context of what > data interval was updated and carry that to the downstream dataset update > as well. > I agree with Jarek, we should have a spec for this and the next iteration > of this feature. > Best > Julien > > On Sun, Jul 9, 2023 at 9:19 AM Jeff Payne <jpa...@bombora.com> wrote: > > > Thanks for responding. That all sounds reasonable. I don't know that > I'm > > the right person to lead that, but I'm willing to give it a shot. Let me > > know if you have someone else in mind. In the meantime, I'll discuss > this > > will a couple of work collogues about the effort. > > ________________________________ > > From: Jarek Potiuk <ja...@potiuk.com> > > Sent: Saturday, July 8, 2023 11:39 PM > > To: dev@airflow.apache.org <dev@airflow.apache.org> > > Subject: Re: Support for Datasets with execution-time values for use > cases > > requiring more fine-grained dataset specs > > > > Yeah. Agree with Elad. I think writing a follow-up AIP on a few cases for > > Datasets (that were discussed and specifically left out for late) is > > probably best thing we could do, and getting someone to lead it and > writes > > such AIP would be great > > > > I personally think that datasets are not as useful or popular as they > could > > be (though of course we have no good data to base it on, it's more of a > > gut feeling judging from a number of issues people raise about them (in > > general - very few). > > > > I think all of the above are the important areas that I personally think > > could help with more usages: > > > > * partial datasets (as in the original thread) > > * triggering datasets based on external dataset changes (banking on > evident > > popularity of Deferrable and Triggers it feels that we could make a nice > > and efficient, customizable solution for that one) > > * time + dataset combined dependencies (those two issues mentioned by > > Elad). > > > > It would be great to have the Dataset case more complete with these > three. > > > > J. > > > > > > > > On Sun, Jul 9, 2023 at 8:12 AM Elad Kalif <elad...@apache.org> wrote: > > > > > Hi Jeff, > > > > > > This is a valid use case but we need to think about how to approach > this. > > > It may be trivial change (or it may not IDK) but what worries me is > that > > we > > > have several other requests ( for example > > > https://github.com/apache/airflow/issues/30974 and > > > https://github.com/apache/airflow/issues/31013 ) each one is unique > and > > we > > > need to find out a way to make sure it's easy and simple to setup > > Datasets > > > without causing side effects or parameters conflicts/madness. > > > > > > I think we need to first discuss the mechanism that allows Datasets to > > > support the new use cases in a holistic manner rather than patching > each > > > individual use case. > > > > > > I don't know if this is AIP worthy. If you'd like to lead this I'm > happy > > to > > > review. > > > > > > On Thu, Jul 6, 2023 at 5:36 AM Jeff Payne <jpa...@bombora.com> wrote: > > > > > > > Hello Airflow dev community! > > > > > > > > TL;DR; > > > > I'd like AF Datasets to support the following use case: I use a > > BigQuery > > > > table MyRawRecords, which is partitioned by date. DAG > > > write_my_raw_records > > > > inserts records into said table on a daily schedule and is sometimes > > > rerun > > > > to either correct previously inserted records or to insert late > > arriving > > > > data. There is a downstream DAG aggregate_my_raw_records_for_analysis > > > owned > > > > by a separate team that runs daily and generates aggregate values > > against > > > > MyRawRecords. DAG B should be rerun anytime the MyRawRecords table is > > > > written to (new/updated data for a single partition). The existing > > > Dataset > > > > mechanism doesn't support providing the target partition to the > > consuming > > > > DAG. > > > > > > > > At Bombora, the above is a common use case. We produce many > internally > > > > facing data products and would like to leverage a data/event-driven > > > > approach to triggering downstream DAGs without the explicit coupling > of > > > > TriggerDagOperator/ExternalTaskOperator or the implicit coupling of > > > > schedule alignment with sensor timeouts. > > > > > > > > Currently, to leverage Datasets, this use case requires the consuming > > DAG > > > > to figure out which target partition of the table has been updated. > Of > > > > course, the aggregates for the entire table could be recalculated, > but > > > this > > > > is a huge waste of resources. > > > > > > > > It seems like this may be one of many logical next steps with respect > > to > > > > Dataset related features. There has been interest in supporting this > > use > > > > case voiced in the comments on the original Dataset AIP, > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling > > > . > > > > Specifically, this and follow up comments: > > > > > > > > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling?focusedCommentId=217385741#comment-217385741 > > > > . > > > > > > > > I have not looked at the Dataset related code and am not sure what > the > > > > complexity of something like this would be. I assume it wouldn't be > > > > trivial, given that the Datasets are currently static objects. > > > > > > > > Would love to hear some feedback. Thanks! > > > > > > > > Jeff Payne > > > > > > > > > >