[GitHub] [airflow] uranusjr commented on a change in pull request #17552: AIP 39: Documentation

GitBox Tue, 14 Sep 2021 19:36:29 -0700


uranusjr commented on a change in pull request #17552:
URL: https://github.com/apache/airflow/pull/17552#discussion_r708791466




##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -43,21 +43,26 @@ Please follow our guide on :ref:`custom Operators 
<custom_operator>`.
 Creating a task
 ---------------
 
-You should treat tasks in Airflow equivalent to transactions in a database. 
This implies that you should never produce
-incomplete results from your tasks. An example is not to produce incomplete 
data in ``HDFS`` or ``S3`` at the end of a task.
-
-Airflow can retry a task if it fails. Thus, the tasks should produce the same 
outcome on every re-run.
-Some of the ways you can avoid producing a different result -
-
-* Do not use INSERT during a task re-run, an INSERT statement might lead to 
duplicate rows in your database.
-  Replace it with UPSERT.
-* Read and write in a specific partition. Never read the latest available data 
in a task.
-  Someone may update the input data between re-runs, which results in 
different outputs.
-  A better way is to read the input data from a specific partition. You can 
use ``execution_date`` as a partition.
-  You should follow this partitioning method while writing data in S3/HDFS, as 
well.
-* The Python datetime ``now()`` function gives the current datetime object.
-  This function should never be used inside a task, especially to do the 
critical computation, as it leads to different outcomes on each run.
-  It's fine to use it, for example, to generate a temporary log.
+You should treat tasks in Airflow equivalent to transactions in a database. 
This
+implies that you should never produce incomplete results from your tasks. An
+example is not to produce incomplete data in ``HDFS`` or ``S3`` at the end of a
+task.
+
+Airflow can retry a task if it fails. Thus, the tasks should produce the same
+outcome on every re-run. Some of the ways you can avoid producing a different
+result -
+
+* Do not use INSERT during a task re-run, an INSERT statement might lead to
+  duplicate rows in your database. Replace it with UPSERT.
+* Read and write in a specific partition. Never read the latest available data
+  in a task. Someone may update the input data between re-runs, which results 
in
+  different outputs. A better way is to read the input data from a specific
+  partition. You can use ``data_interval_start`` as a partition. You should
+  follow this partitioning method while writing data in S3/HDFS, as well.

Review comment:
       Side note, there are a lot of *blah blah blah, as well* usages in the 
documentation, so it seems like whoever wrote it previously has a particular 
style. (I’ve removed the comma here.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [airflow] uranusjr commented on a change in pull request #17552: AIP 39: Documentation

Reply via email to