Stateful Tasks (was Poke Reschedule)

Jarek Potiuk Mon, 13 Jan 2020 15:48:12 -0800

I think the main subject has been hijacked (sorry Jacob) - true Pandora's
box. I will just change the subject to something more appropriate and I
proposed we leave the original thread to discuss rescheduling.


I wonder what others think about it (commiters/PMCs)? I think this
subject has been discussed already and maybe we could hear other opinions ?

J.


On Mon, Jan 13, 2020 at 10:47 PM Shaw, Damian P. <
[email protected]> wrote:

> Here are a couple more examples of where as an Airflow user I am
> introducing state in to my tasks (so they technically breaks idempotency):
> * Authentication mechanisms that return errors based on wrong details,
> state is kept so retries don't cause multiple login attempts which would
> cause the account to be locked, in thie case manual intervention is
> required to re-enable the tasks
> * Sensors that send email/sms notifications on certain events, state is
> kept so they're not redunently sent on retry/reschedule, state is also kept
> so they can "escalate" in urgency each time they are sent out even across
> retries/reschedules.
>
> I understand wanting to keep Airflow tasks as stateless as possible but
> from my perspective real world systems are not stateless (or without side
> effect) and modeling that state is sometimes the only viable solution. The
> motivation of this thread is that some users are saying they need state, I
> can tell you at least some users are already hacking state in to their
> existing Airflow setup and would greately appreciate if implemented it
> doesn't have arbitrary limits on it's use :). My 2 cents as a user anyway.
>
> Damian
>
> -----Original Message-----
> From: Daniel Standish <[email protected]>
> Sent: Monday, January 13, 2020 14:03
> To: [email protected]
> Subject: Re: [Discussion] In Prep for AIP: Stateful XComs and Poke
> Rescheduling in Operators
>
> @Jarek & Fokko
> Firstly, thank you for the thorough and thoughtful replies -- I appreciate
> it.
>
> I recognize that the execution_date-based idempotence pattern is core to
> airflow.  I understand that it motivated a lot of the design decisions, and
> that it is a pattern that is endorsed and explicitly encouraged in the
> documentation, and by the community in general.
>
> But surely it's not the position of the airflow project that data
> pipelines deviating from this pattern are officially forbidden and
> unsupported?
>
> I think there is a place in airflow for tasks that are not driven by
> "execution_date", as well as tasks for which "airflow backfill" does not
> make sense.  Airflow provides great flexibility, and a lot of value and
> applications beyond that one pattern.
>
> I believe there are cases where it makes sense to persist some information
> for a task or a dag.  Incremental processes is one.  The reschedule poke
> operator is another.  Breaking a heavy data transfer process into
> sub-batches and storing progress is another.  And I'd be in favor of
> providing a better built-in mechanism to do it within airflow.  I would not
> constrain it to async job ids; just arbitrary data however the user wants
> to use it.
>
> I am not convinced that adding the capacity to store information
> associated with dag+task undermines or meaningfully alters the airflow
> project.
>
> I would bet that pipelines not following the canonical pattern are pretty
> common in the wild.  And when you don't for whatever reason lean on
> execution_date, it's nice to have a different mechanism for storing state.
>
> That's my take.
>
> Detailed responses below.
>
> @Jarek
> RE: the "cluster maintainer" role, and the "idempotence guarantee", and
> "by adding state you lose the property of idempotency"
>
> I don't think airflow provides any such guarantee as it is.
>
>    - By themselves, even built-in airflow operators do not in general
>    provide an idempotence guarantee.  It really depends how they are used.
>       - For example, a _SqlOperator has a `sql` field.  Whether the task is
>       idempotent or not depends entirely on the `sql` that is passed to the
>       operator.
>    - With non-built-in operators, all bets are off.
>    - Variables and XCom can be used to persist and transmit state.
>    - People use catchup=False to facilitate non-idempotent tasks.
>    - As a scheduler, it can execute arbitrary processes.
>
> These facts mean that as a cluster maintainer, if there is an idempotence
> guarantee, it's because the team enforces it -- not because airflow
> guarantees it.  Airflow is simply unable to enforce this by itself.
>
> @Jarek & Fokko
> RE: you are using the wrong tool for this job, airflow is not for
> streaming, you should look at apache beam, read "beyond the horizon"
>
> Incremental processes are not necessarily streaming processes.  The
> incremental scenarios I outlined are still batch processes -- hourly, every
> 4 hours, daily, that kind of thing.  I am just determining the intervals
> in a different way from an execution_date-driven process.
> Should I use another tool?  Fundamentally, what we're talking about is the
> ability to store a small bit of data associated with a task+dag.  When I
> consider this gap in the context of everything airflow provides, it's
> insignificant by comparison, and there are ways I can solve it:
>
>    - xcom (notwithstanding edge case problems e.g. getting cleared)
>    - variables (notwithstanding messiness)
>    - roll your own model (notwithstanding effort and complexity on my
>    system)
>       - incidentally, because of airflow's plugin system, if you do roll
>       your own, you can even expose your in-house model through an admin
> view in
>       airflow!
>
> Nonetheless, I don't think we should *make* users solve it.  And I don't
> think we need beam or spark e.g. to schedule sql scripts, merely because
> they are incremental instead of idempotent.
>
> @Jarek
> RE: adding support for state adds more work for committers Maybe any
> feature that is a new thing can create more work.  But there is also work
> it may eliminate.  By providing explicit support for this design pattern,
> users would not have to force it into XCom or Variable.  We could have a
> docs page "howto/incremental_process".  This could reduce questions on "how
> can i do an incremental process?".  By *not* providing support, the burden
> is shifted to users, who will find a way, but at some cost.
> As you have said, we could tell users this is not supported, but why turn
> someone away when the platform can actually handle that pattern just fine?
>
> @Fokko
> RE:
>
> > >  This value needs to be stored somewhere.
> > This data is already available, but would you want to duplicate this
> > in Airflow for optimization, my answer would be: no.
>
>
> Let me try to clarify.  The context here was a dim / fact / reporting
> process that is downstream of several tables with varying delivery cadences.
> When incremental load begins, you need to record the state of the source
> tables.  Storing min of max modified dates from source tables allows us to
> know what we need to process in next run.  This value truly does need to be
> stored somewhere; there is no way to figure this out after the fact by
> querying the content of the target.
>
> @Fokko
> RE:
>
> > Also, if you have a modern data platform, fetching the watermark can
> > be constant. I've also mentioned this on the Slack
>
>
> Yes in many data processes it is possible to avoid storing a watermark.
> Other times, it may not be possible.  Still other times, while maybe it is
> possible to avoid it, storing watermark is the practical and elegant
> solution.
> Querying the target is not without its own pitfalls and complexities.
> Maybe the target is a database where you must pay $ for each query, while
> the source is on-prem and no cost.  Or maybe you can't be sure of the
> timing of the downstream process.  Maybe querying downstream requires
> adding another hook to your operator and opening a new connection and
> spinning up a warehouse.  If the target is files, there are complexities
> here too.  The point is, there are costs and tradeoffs, and I think
> watermarking based on source can for sure be defensible.  Possibility is
> not always determinative.
>
>
>
> ===============================================================================
>
> Please access the attached hyperlink for an important electronic
> communications disclaimer:
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
> ===============================================================================
>
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Stateful Tasks (was Poke Reschedule)

Reply via email to