Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs

Jeff Payne Fri, 14 Jul 2023 09:15:25 -0700

Hi Vikram,

Thanks for the additional context.  That sounds great.  What's the best way for 
us to connect?


Jeff Payne
________________________________
From: Vikram Koka <vik...@astronomer.io.INVALID>
Sent: Friday, July 14, 2023 8:00 AM
To: jul...@astronomer.io.invalid <jul...@astronomer.io.invalid>
Cc: dev@airflow.apache.org <dev@airflow.apache.org>
Subject: Re: Support for Datasets with execution-time values for use cases 
requiring more fine-grained dataset specs

Hi Jeff,

Thank you for bringing this up. This is definitely on my radar and part of
a larger AIP which I have been in the process of writing up.

We have thought of this use case and deliberately deferred it in the
earlier AIP. Doing is definitely quite complex and I think it needs a
couple of intermediate steps.
I would be more than happy to collaborate on this work and walk you through
my current thinking on this.

Best regards,
Vikram


On Tue, Jul 11, 2023 at 7:18 AM Julien Le Dem <jul...@astronomer.io.invalid>
wrote:

> Hello Jeff,
> I agree it is a very common use case for incremental processing.
> For example you might schedule a DAG every hour that will use the data
> interval to filter this partition in the output and overwrite a partition
> in the output.
> This also generalizes to more complex cases when you implement roll up
> windows and, for example, read 24 previous days every day to create a new
> daily average.
> To do this we would need the Dataset updates to keep the context of what
> data interval was updated and carry that to the downstream dataset update
> as well.
> I agree with Jarek, we should have a spec for this and the next iteration
> of this feature.
> Best
> Julien
>
> On Sun, Jul 9, 2023 at 9:19 AM Jeff Payne <jpa...@bombora.com> wrote:
>
> > Thanks for responding.  That all sounds reasonable.  I don't know that
> I'm
> > the right person to lead that, but I'm willing to give it a shot.  Let me
> > know if you have someone else in mind.  In the meantime, I'll discuss
> this
> > will a couple of work collogues about the effort.
> > ________________________________
> > From: Jarek Potiuk <ja...@potiuk.com>
> > Sent: Saturday, July 8, 2023 11:39 PM
> > To: dev@airflow.apache.org <dev@airflow.apache.org>
> > Subject: Re: Support for Datasets with execution-time values for use
> cases
> > requiring more fine-grained dataset specs
> >
> > Yeah. Agree with Elad. I think writing a follow-up AIP on a few cases for
> > Datasets (that were discussed and specifically left out for late) is
> > probably best thing we could do, and getting someone to lead it and
> writes
> > such AIP would be great
> >
> > I personally think that datasets are not as useful or popular as they
> could
> > be (though of course we have no good data to base it on, it's more of a
> > gut feeling judging from a number of issues people raise about them (in
> > general - very few).
> >
> > I think all of the above are the important areas that I personally think
> > could help with more usages:
> >
> > * partial datasets (as in the original thread)
> > * triggering datasets based on external dataset changes (banking on
> evident
> > popularity of Deferrable and Triggers it feels that we could make a nice
> > and efficient, customizable solution for that one)
> > * time + dataset combined dependencies (those two issues mentioned by
> > Elad).
> >
> > It would be great to have the Dataset case more complete with these
> three.
> >
> > J.
> >
> >
> >
> > On Sun, Jul 9, 2023 at 8:12 AM Elad Kalif <elad...@apache.org> wrote:
> >
> > > Hi Jeff,
> > >
> > > This is a valid use case but we need to think about how to approach
> this.
> > > It may be trivial change (or it may not IDK) but what worries me is
> that
> > we
> > > have several other requests ( for example
> > > https://github.com/apache/airflow/issues/30974 and
> > > https://github.com/apache/airflow/issues/31013 ) each one is unique
> and
> > we
> > > need to find out a way to make sure it's easy and simple to setup
> > Datasets
> > > without causing side effects or parameters conflicts/madness.
> > >
> > > I think we need to first discuss the mechanism that allows Datasets to
> > > support the new use cases in a holistic manner rather than patching
> each
> > > individual use case.
> > >
> > > I don't know if this is AIP worthy. If you'd like to lead this I'm
> happy
> > to
> > > review.
> > >
> > > On Thu, Jul 6, 2023 at 5:36 AM Jeff Payne <jpa...@bombora.com> wrote:
> > >
> > > > Hello Airflow dev community!
> > > >
> > > > TL;DR;
> > > > I'd like AF Datasets to support the following use case: I use a
> > BigQuery
> > > > table MyRawRecords, which is partitioned by date. DAG
> > > write_my_raw_records
> > > > inserts records into said table on a daily schedule and is sometimes
> > > rerun
> > > > to either correct previously inserted records or to insert late
> > arriving
> > > > data. There is a downstream DAG aggregate_my_raw_records_for_analysis
> > > owned
> > > > by a separate team that runs daily and generates aggregate values
> > against
> > > > MyRawRecords. DAG B should be rerun anytime the MyRawRecords table is
> > > > written to (new/updated data for a single partition). The existing
> > > Dataset
> > > > mechanism doesn't support providing the target partition to the
> > consuming
> > > > DAG.
> > > >
> > > > At Bombora, the above is a common use case.  We produce many
> internally
> > > > facing data products and would like to leverage a data/event-driven
> > > > approach to triggering downstream DAGs without the explicit coupling
> of
> > > > TriggerDagOperator/ExternalTaskOperator or the implicit coupling of
> > > > schedule alignment with sensor timeouts.
> > > >
> > > > Currently, to leverage Datasets, this use case requires the consuming
> > DAG
> > > > to figure out which target partition of the table has been updated.
> Of
> > > > course, the aggregates for the entire table could be recalculated,
> but
> > > this
> > > > is a huge waste of resources.
> > > >
> > > > It seems like this may be one of many logical next steps with respect
> > to
> > > > Dataset related features.  There has been interest in supporting this
> > use
> > > > case voiced in the comments on the original Dataset AIP,
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling
> > > .
> > > > Specifically, this and follow up comments:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling?focusedCommentId=217385741#comment-217385741
> > > > .
> > > >
> > > > I have not looked at the Dataset related code and am not sure what
> the
> > > > complexity of something like this would be.  I assume it wouldn't be
> > > > trivial, given that the Datasets are currently static objects.
> > > >
> > > > Would love to hear some feedback.  Thanks!
> > > >
> > > > Jeff Payne
> > > >
> > >
> >
>

Re: Support for Datasets with execution-time values for use cases requiring more fine-grained dataset specs

Reply via email to