All for it. I think misunderstandings and assumptions on what "idempotency"
really means in the context of Airlfow Tasks has bitten us more than once.
I'd love to help with working out the right definition (and it's not
straightforward). I will have to give it quite a bit thinking to get some
of the corner cases and "guidelines" on them hashed out.

On Tue, Jul 7, 2020 at 12:55 PM Tomasz Urbaszek <turbas...@apache.org>
wrote:

> Hello everyone,
>
> The plenty of integrations with external services a.k.a operators is
> one of the bigest advantages of Airflow. As documentation states:
> "An operator represents a single, ideally idempotent, task. "
>
> The idempotence - I think - is the key to create a usable operator. It
> assures that we can run backfills and use fewer resources. The problem
> is that there's no official Airflow definition of idempotence. Or at
> least I'm not aware of any.
>
> What do I mean by "Airflow definition"? By this, I mean a guide or
> recipe for making an operator idempotent including the limits of
> real-world idempotency.
>
> The reason for bringing this topic are those two PRs:
> - https://github.com/apache/airflow/pull/9593 which improves creating
> Dataproc cluster (create, if exists check state, if wrong then delete
> and wait and then create new one)
> -https://github.com/apache/airflow/pull/9590 improving BigQuery insert
> job idempotency (submit, if job_id exists check state, if running/ok
> reattach, if failed then generate new job_id, submit)
>
> Both PRs implements suggestions from our users and solve real,
> production-grade problems. Both do this in a non-perfect way because
> each of those operators tries to tackle with variety of idempotence
> problems. This requires some custom logic that has to work with
> non-deterministic situations (i.e. Dataproc and unknown time of
> deleting cluster). And that makes me wonder what is the exact
> definition of "single, ideally idempotent, task"?
>
> Operators should answer users' needs - there's no question to that.
> But it is the community that will have to maintain the operators. And
> maintinaing complex logic which is hard (or nearly impossible) to test
> in e2e way is not a pleasent task.
>
> What I would like to ask you is:
> - what does it mean for you that the operator is idempotent?
> - what does it mean "single task"? Does it mean a single event or
> operation (set of events)?
>
> By doing this I would like to work on a set of how-to rules for
> designing the logic of `execute` method. I would like to encourage you
> to share your experiences with desiging and working with complex
> operators :)
>
> Hope you are good,
> Tomek
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to