Re: Stateful Tasks (was Poke Reschedule)

Kaxil Naik Wed, 15 Jan 2020 13:42:38 -0800

Hi Daniel,

Couldn't find "dstandish" but added "dpstandish", please confirm it is you
:)


Regards,
Kaxil

On Wed, Jan 15, 2020 at 9:35 PM Daniel Standish <dpstand...@gmail.com>
wrote:

> Can someone grant me AIP creation powers? (username dstandish)
>
> I think AIP is necessary make things more concrete and focus the discussion
> and I can work on one.
>
>
>
>
>
> On Tue, Jan 14, 2020 at 10:08 AM Chris Palmer <ch...@crpalmer.com> wrote:
>
> > I think some of the discussions about incremental and/or idempotency are
> > confusing the topic and are a distraction from the real question. As I
> said
> > in my previous reply on this thread, many tasks utilize state that is
> kept
> > somewhere in order to achieve idempotency in an efficient way. Whether
> that
> > be data in a target database table, or the files that exist in a
> > target filesystem. So the question we should be concerned with is not
> "are
> > there patterns that fit the Airflow model that require state to be kept
> > somewhere?" but is simply "Should that state be kept in Airflow?"
> >
> > I think my answer is no for a couple of reasons:
> >
> >    1. There is a possibility that the state kept in Airflow diverges from
> >    the reality of the target system it is supposed to represent. As Fokko
> > said
> >    "This data is already available" and storing it in Airflow is just
> "for
> >    optimization".
> >    2. I think it adds unnecessary complexity to Airflow, and risks
> feature
> >    creep. We would be better served keeping a tight focus on what we do
> > well.
> >
> >
> > However, I have a lot of sympathy for the arguments that Daniel and
> Damien
> > are making because the same arguments could be said for the idea of
> keeping
> > state across reschedules. The typical use case for that feature is to
> store
> > something like a BigQuery job id, which you poll against periodically.
> > Rather than store that id in Airflow, you could also add labels to the
> > BigQuery job when created and then at poll time, list the recent BigQuery
> > jobs and find the one with the correct labels.
> >
> > In the case where the system where your long running job is executing
> > doesn't have any ability to label/tag a job or search for jobs, then you
> > could store the relevant id in a file in S3 (or similar) in a location
> that
> > is templated based on dag, task and execution_date.
> >
> > In that sense, Airflow keeping state across reschedules, or retries and
> > even the existing XCom is just an optimization for data that generally
> > exists or could be persisted elsewhere. At its core I think Airflow
> should
> > care about scheduling, dependencies and what tasks ran and when, but
> > shouldn't care about the outcomes of those tasks beyond success or
> failure.
> > The existing XCom and any expansion of that adds a messaging component to
> > Airflow that I think is not its core functionality. I'm not advocating
> > removing XCom, but I say no to expanding upon it.
> >
> > However, IF we are going to expand on XCom and add more messaging between
> > tasks, task instances and/or different runs of task instances then:
> >
> >    1. Why limit it a few possible scenarios?
> >    2. Is a database really the best way of implementing that messaging
> >    functionality?
> >
> > Chris
> >
> > On Mon, Jan 13, 2020 at 7:03 PM Daniel Imberman <
> daniel.imber...@gmail.com
> > >
> > wrote:
> >
> > > @jarek reading through the history of this and will try to come back
> with
> > > thoughts.
> > >
> > > via Newton Mail
> > > [
> > >
> >
> https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.32&pv=10.14.5&source=email_footer_2
> > > ]
> > > On Mon, Jan 13, 2020 at 3:47 PM, Jarek Potiuk <
> jarek.pot...@polidea.com>
> > > wrote:
> > > I think the main subject has been hijacked (sorry Jacob) - true
> Pandora's
> > > box. I will just change the subject to something more appropriate and I
> > > proposed we leave the original thread to discuss rescheduling.
> > >
> > > I wonder what others think about it (commiters/PMCs)? I think this
> > > subject has been discussed already and maybe we could hear other
> > opinions ?
> > >
> > > J.
> > >
> > >
> > > On Mon, Jan 13, 2020 at 10:47 PM Shaw, Damian P. <
> > > damian.sha...@credit-suisse.com> wrote:
> > >
> > >  > Here are a couple more examples of where as an Airflow user I am
> > >  > introducing state in to my tasks (so they technically breaks
> > > idempotency):
> > >  > * Authentication mechanisms that return errors based on wrong
> details,
> > >  > state is kept so retries don't cause multiple login attempts which
> > would
> > >  > cause the account to be locked, in thie case manual intervention is
> > >  > required to re-enable the tasks
> > >  > * Sensors that send email/sms notifications on certain events, state
> > is
> > >  > kept so they're not redunently sent on retry/reschedule, state is
> also
> > > kept
> > >  > so they can "escalate" in urgency each time they are sent out even
> > > across
> > >  > retries/reschedules.
> > >  >
> > >  > I understand wanting to keep Airflow tasks as stateless as possible
> > but
> > >  > from my perspective real world systems are not stateless (or without
> > > side
> > >  > effect) and modeling that state is sometimes the only viable
> solution.
> > > The
> > >  > motivation of this thread is that some users are saying they need
> > > state,
> > > I
> > >  > can tell you at least some users are already hacking state in to
> their
> > >  > existing Airflow setup and would greately appreciate if implemented
> it
> > >  > doesn't have arbitrary limits on it's use :). My 2 cents as a user
> > > anyway.
> > >  >
> > >  > Damian
> > >  >
> > >  > -----Original Message-----
> > >  > From: Daniel Standish <dpstand...@gmail.com>
> > >  > Sent: Monday, January 13, 2020 14:03
> > >  > To: dev@airflow.apache.org
> > >  > Subject: Re: [Discussion] In Prep for AIP: Stateful XComs and Poke
> > >  > Rescheduling in Operators
> > >  >
> > >  > @Jarek & Fokko
> > >  > Firstly, thank you for the thorough and thoughtful replies -- I
> > > appreciate
> > >  > it.
> > >  >
> > >  > I recognize that the execution_date-based idempotence pattern is
> core
> > to
> > >  > airflow. I understand that it motivated a lot of the design
> decisions,
> > > and
> > >  > that it is a pattern that is endorsed and explicitly encouraged in
> the
> > >  > documentation, and by the community in general.
> > >  >
> > >  > But surely it's not the position of the airflow project that data
> > >  > pipelines deviating from this pattern are officially forbidden and
> > >  > unsupported?
> > >  >
> > >  > I think there is a place in airflow for tasks that are not driven by
> > >  > "execution_date", as well as tasks for which "airflow backfill" does
> > not
> > >  > make sense. Airflow provides great flexibility, and a lot of value
> and
> > >  > applications beyond that one pattern.
> > >  >
> > >  > I believe there are cases where it makes sense to persist some
> > > information
> > >  > for a task or a dag. Incremental processes is one. The reschedule
> poke
> > >  > operator is another. Breaking a heavy data transfer process into
> > >  > sub-batches and storing progress is another. And I'd be in favor of
> > >  > providing a better built-in mechanism to do it within airflow. I
> would
> > > not
> > >  > constrain it to async job ids; just arbitrary data however the user
> > > wants
> > >  > to use it.
> > >  >
> > >  > I am not convinced that adding the capacity to store information
> > >  > associated with dag+task undermines or meaningfully alters the
> airflow
> > >  > project.
> > >  >
> > >  > I would bet that pipelines not following the canonical pattern are
> > > pretty
> > >  > common in the wild. And when you don't for whatever reason lean on
> > >  > execution_date, it's nice to have a different mechanism for storing
> > > state.
> > >  >
> > >  > That's my take.
> > >  >
> > >  > Detailed responses below.
> > >  >
> > >  > @Jarek
> > >  > RE: the "cluster maintainer" role, and the "idempotence guarantee",
> > and
> > >  > "by adding state you lose the property of idempotency"
> > >  >
> > >  > I don't think airflow provides any such guarantee as it is.
> > >  >
> > >  > - By themselves, even built-in airflow operators do not in general
> > >  > provide an idempotence guarantee. It really depends how they are
> used.
> > >  > - For example, a _SqlOperator has a `sql` field. Whether the task is
> > >  > idempotent or not depends entirely on the `sql` that is passed to
> the
> > >  > operator.
> > >  > - With non-built-in operators, all bets are off.
> > >  > - Variables and XCom can be used to persist and transmit state.
> > >  > - People use catchup=False to facilitate non-idempotent tasks.
> > >  > - As a scheduler, it can execute arbitrary processes.
> > >  >
> > >  > These facts mean that as a cluster maintainer, if there is an
> > > idempotence
> > >  > guarantee, it's because the team enforces it -- not because airflow
> > >  > guarantees it. Airflow is simply unable to enforce this by itself.
> > >  >
> > >  > @Jarek & Fokko
> > >  > RE: you are using the wrong tool for this job, airflow is not for
> > >  > streaming, you should look at apache beam, read "beyond the horizon"
> > >  >
> > >  > Incremental processes are not necessarily streaming processes. The
> > >  > incremental scenarios I outlined are still batch processes --
> hourly,
> > > every
> > >  > 4 hours, daily, that kind of thing. I am just determining the
> > intervals
> > >  > in a different way from an execution_date-driven process.
> > >  > Should I use another tool? Fundamentally, what we're talking about
> is
> > > the
> > >  > ability to store a small bit of data associated with a task+dag.
> When
> > I
> > >  > consider this gap in the context of everything airflow provides,
> it's
> > >  > insignificant by comparison, and there are ways I can solve it:
> > >  >
> > >  > - xcom (notwithstanding edge case problems e.g. getting cleared)
> > >  > - variables (notwithstanding messiness)
> > >  > - roll your own model (notwithstanding effort and complexity on my
> > >  > system)
> > >  > - incidentally, because of airflow's plugin system, if you do roll
> > >  > your own, you can even expose your in-house model through an admin
> > >  > view in
> > >  > airflow!
> > >  >
> > >  > Nonetheless, I don't think we should *make* users solve it. And I
> > don't
> > >  > think we need beam or spark e.g. to schedule sql scripts, merely
> > because
> > >  > they are incremental instead of idempotent.
> > >  >
> > >  > @Jarek
> > >  > RE: adding support for state adds more work for committers Maybe any
> > >  > feature that is a new thing can create more work. But there is also
> > work
> > >  > it may eliminate. By providing explicit support for this design
> > pattern,
> > >  > users would not have to force it into XCom or Variable. We could
> have
> > a
> > >  > docs page "howto/incremental_process". This could reduce questions
> on
> > > "how
> > >  > can i do an incremental process?". By *not* providing support, the
> > > burden
> > >  > is shifted to users, who will find a way, but at some cost.
> > >  > As you have said, we could tell users this is not supported, but why
> > > turn
> > >  > someone away when the platform can actually handle that pattern just
> > > fine?
> > >  >
> > >  > @Fokko
> > >  > RE:
> > >  >
> > >  > > > This value needs to be stored somewhere.
> > >  > > This data is already available, but would you want to duplicate
> this
> > >  > > in Airflow for optimization, my answer would be: no.
> > >  >
> > >  >
> > >  > Let me try to clarify. The context here was a dim / fact / reporting
> > >  > process that is downstream of several tables with varying delivery
> > > cadences.
> > >  > When incremental load begins, you need to record the state of the
> > source
> > >  > tables. Storing min of max modified dates from source tables allows
> us
> > > to
> > >  > know what we need to process in next run. This value truly does need
> > to
> > > be
> > >  > stored somewhere; there is no way to figure this out after the fact
> by
> > >  > querying the content of the target.
> > >  >
> > >  > @Fokko
> > >  > RE:
> > >  >
> > >  > > Also, if you have a modern data platform, fetching the watermark
> can
> > >  > > be constant. I've also mentioned this on the Slack
> > >  >
> > >  >
> > >  > Yes in many data processes it is possible to avoid storing a
> > watermark.
> > >  > Other times, it may not be possible. Still other times, while maybe
> it
> > > is
> > >  > possible to avoid it, storing watermark is the practical and elegant
> > >  > solution.
> > >  > Querying the target is not without its own pitfalls and
> complexities.
> > >  > Maybe the target is a database where you must pay $ for each query,
> > > while
> > >  > the source is on-prem and no cost. Or maybe you can't be sure of the
> > >  > timing of the downstream process. Maybe querying downstream requires
> > >  > adding another hook to your operator and opening a new connection
> and
> > >  > spinning up a warehouse. If the target is files, there are
> > complexities
> > >  > here too. The point is, there are costs and tradeoffs, and I think
> > >  > watermarking based on source can for sure be defensible. Possibility
> > is
> > >  > not always determinative.
> > >  >
> > >  >
> > >  >
> > >  >
> > >
> > >
> >
> ===============================================================================
> > >  >
> > >  > Please access the attached hyperlink for an important electronic
> > >  > communications disclaimer:
> > >  > http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> > >  >
> > >
> > >
> >
> ===============================================================================
> > >  >
> > >  >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> >
>

Re: Stateful Tasks (was Poke Reschedule)

Reply via email to