baolsen commented on a change in pull request #6999: [AIRFLOW-XXXX] Clarify wait_for_downstream and execution_date URL: https://github.com/apache/airflow/pull/6999#discussion_r362747213
########## File path: docs/concepts.rst ########## @@ -113,13 +116,138 @@ DAGs can be used as context managers to automatically assign new operators to th op.dag is dag # True -.. _concepts-operators: +.. _concepts:dagruns: + +DAG Runs +======== + +A DAG run is a physical instance of a DAG, containing task instances that run for a specific ``execution_date``. + +A DAG run is usually created by the Airflow scheduler, but can also be created by an external trigger. +Multiple DAG runs may be running at once for a particular DAG, each of them having a different ``execution_date``. +For example, we might currently have two DAG runs that are in progress for 2016-01-01 and 2016-01-02 respectively. + +.. _concepts:execution_date: + +execution_date +-------------- + +The ``execution_date`` is the *logical* date and time which the DAG Run, and its task instances, are running for. + +This allows task instances to process data for the desired *logical* date & time. +While a task_instance or DAG run might have a *physical* start date of now, +their *logical* date might be 3 months ago because we are busy reloading something. + +In the prior example the ``execution_date`` was 2016-01-01 for the first DAG Run and 2016-01-02 for the second. + +A DAG run and all task instances created within it are instanced with the same ``execution_date``, so +that logically you can think of a DAG run as simulating the DAG running all of its tasks at some +previous date & time specified by the ``execution_date``. + +.. _concepts:tasks: + +Tasks +===== + +A Task defines a unit of work within a DAG; it is represented as a node in the DAG graph, and it is written in Python. + +Each task is an implementation of an Operator, for example a ``PythonOperator`` to execute some Python code, +or a ``BashOperator`` to run a Bash command. + +The task implements an operator by defining specific values for that operator, +such as a Python callable in the case of ``PythonOperator`` or a Bash command in the case of ``BashOperator``. + +Relations between Tasks +----------------------- + +Consider the following DAG with two tasks. +Each task is a node in our DAG, and there is a dependency from task_1 to task_2: + +.. code:: python + + with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag: + task_1 = DummyOperator('task_1') + task_2 = DummyOperator('task_2') + task_1 >> task_2 # Define dependencies + +We can say that task_1 is *upstream* of task_2, and conversely task_2 is *downstream* of task_1. +When a DAG Run is created, task_1 will start running and task_2 waits for task_1 to complete successfully before it may start. + +Task Instances +============== + +A task instance represents a specific run of a task and is characterized as the +combination of a DAG, a task, and a point in time (``execution_date``). Task instances +also have an indicative state, which could be "running", "success", "failed", "skipped", "up +for retry", etc. + +Tasks are defined in DAGs, and both are written in Python code to define what you want to do. +Task Instances belong to DAG Runs, have an associated ``execution_date``, and are physicalised, runnable entities. + +Relations between Task Instances +-------------------------------- + +Again consider the following tasks, defined for some DAG: + +.. code:: python + + with DAG('my_dag', start_date=datetime(2016, 1, 1)) as dag: + task_1 = DummyOperator('task_1') + task_2 = DummyOperator('task_2') + task_1 >> task_2 # Define dependencies + +When we enable this DAG, the scheduler creates several DAG Runs - one with ``execution_date`` of 2016-01-01, +one with ``execution_date`` of 2016-01-02, and so on up to the current date. + +Each DAG Run will contain a task_1 Task Instance and a task_2 Task instance. Both Task Instances will +have ``execution_date`` equal to the DAG Run's ``execution_date``, and each task_2 will be *upstream* of +(depends on) its task_1. + +We can also say that task_1 for 2016-01-01 is the *previous* task instance of the task_1 for 2016-01-02. +Or that the DAG Run for 2016-01-01 is the *previous* DAG Run to the DAG Run of 2016-01-02. +Here, *previous* refers to the logical past/prior ``execution_date``, that runs independently of other runs, +and *upstream* refers to a dependency within the same run and having the same ``execution_date``. + Review comment: Not sure how you feel about this note - but I think differentiating between previous and upstream is important for a new user especially. The concepts of upstream / downstream task didn't sink in for me at first and explicitly calling them out from previous would have helped me immediately understand. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services