[GitHub] [airflow] potiuk commented on pull request #29815: Add retry to the scheduler loop to protect against DB hiccups

via GitHub Tue, 07 Mar 2023 15:02:43 -0800


potiuk commented on PR #29815:
URL: https://github.com/apache/airflow/pull/29815#issuecomment-1458999232


   > When each `create_session` block exits, it calls `session.commit()` and 
writes to the database. So say if a database failure happens during lines 
226–232, the entire block would be restarted and you end up writing the 
database twice without calling `heartbeat_callback` correctly. I don’t know if 
it would be problematic, but it is awkward. Ideally I would imagine the two 
`create_session` blocks should be retried separately instead.
   
   @uranusjr Yep. This is very awkward.  And this awkwardness is precisely what 
I am going to change during AIP-44 implementation. There is an (very 
unfinished) draft commit:  
https://github.com/apache/airflow/commit/f11f5afbddfe39a9f0e31bc1fc1ba3cc1dfa5394
 that depends on merging https://github.com/apache/airflow/pull/29776 which 
will solve this issue (by splitting the "before", "during", and "after" task 
into three steps (and three DB sessions when internal DB api won't be used - 
when the internal API will be used there will be no "during" session at all).
   
   > @potiuk DB connection failures happen. Airflow is running in distributed 
environment. As much as we want to have a reliable infrastructure it's not 
possible all the time. Heartbeat is the most frequent operation in the airflow 
DB and it's success has a significant impact on the task success or failure. 
What problems do you anticipate with making it more resilient to disruptions?
   
   @bjankie1  Yes. I understand that and sympathise with such statement. But 
the solution to that is not to retry a failed transaction without looking at 
the consequences (pointed out by @uranusjr nicely) . The proposal of yours is a 
"band-aid" which might create more problems. The right approach for making 
system resilent to DB problems is to understand every single DB transaction 
that happens in the system and deliberately design behaviour of what happens if 
the transaction fails and act appropriately to recover. Retrying failed 
transaction without understanding of the consequences is a recipe for disaster. 
And yes in this case the whole transaction is ... awward - as also nicely 
pointed out by @uranusjr. and the right approach is to fix the behaviour to 
make more sense (and so you can reason about it) and only after that to 
implement recovery scenario (which might actually be not needed - because the 
way I think it will work when I complete the refactor is that those 
transactions 
 will be split into tthree (or two in case of internal DB API) so that there 
will be no need to recover becuse the problem you observe will not exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [airflow] potiuk commented on pull request #29815: Add retry to the scheduler loop to protect against DB hiccups

Reply via email to