potiuk commented on issue #32973:
URL: https://github.com/apache/airflow/issues/32973#issuecomment-1658757292

   Hey @ashb @yuqian90 ? @uranusjr @ephraimbuddy @jedcunningham  @RNHTTR  - 
tagging those who I know were somewhat involved with Celery and maybe you have 
more of an experience/insight.
   
   I think this problem needs somewhat deeper understanding of the signal 
/multiprocessing part of the celery and reasoning why the test might hang 
indefinitely without reacting to timeouts. 
   
   It looks awfully like some nasty race condition. I am quite sure it started 
to fail not because we moved celery executor to providers (that was merely 
moving imports) but because the tests are running in the same session as other 
provider tests and some left-overs/side effects kick in. It might be this is a 
false negative, only occuring in some specific test environment (we do reset 
signals as part of the test which does not normally happen) - but it also might 
mean that this is a  "real" issue that might occur in production.
   
   I want to run it for at least few days and see if we will have visibly less 
number of "this job failled" without any logs (which I suspected might be this 
test hanging) and in the meantime as 2.7.0 preparation maybe some guesswork can 
be done by those more familiar with the subject - thanks to MWAA team's 
environment we seem to have a relatively easy way to test any hypothesis on why 
this might happen. I will also take a look at the quarantined tests running it 
- maybe we will be able to see it also hanging there occasionally. 
   
   If anyone has any ideas - maybe we can brainstorm here a bit on what could 
be the problem?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to