Greetings!

I apologise in advance for the long-winded question, or if this is more 
a RabbitMQ than Airflow question.

I am setting up a proof-of-concept multi-node cluster of Airflow 2.2.3 
with Celery executor. The aspect that I'm currently working on is high 
availability of RabbitMQ.

I have set up a 3-node RabbitMQ cluster which is fronted by HAProxy. 
HAProxy and RabbitMQ are actually running on the same machines that are 
running Airflow. On each machine there is a RabbitMQ server and HAProxy 
which is configured to direct traffic to all 3 nodes. broker_url in 
airflow.cfg is configured to point to HAProxy port on localhost.

RabbitMQ cluster itself looks perfectly healthy if I check it with 
'rabbitmqctl cluster_status'. Also the failover of tcp connection seems 
to be working - if I shut down rabbitmq on one machine, I can see in 
system logs that after a short pause the Airflow worker successfully 
reconnects with amqp.

But there is a problem with queues. Currently all the queues that exist 
in my rabbitmq setup are 'classic' queues. This means that if the 
rabbitmq node that hosts the queue goes down, the queue is not available 
and Airflow is sad. This is what happened when I shut down rabbitmq on 
one machine:

--------------------------------------------------------------------
[2022-03-11 12:48:40,913: ERROR/MainProcess] consumer: Cannot connect to 
amqp://airflow_ci:**@127.0.0.1:5673/airflow_ci: Server unexpectedly 
closed connection.
Mar 11 12:48:40 Trying again in 4.00 seconds... (2/100)
Mar 11 12:48:44 [2022-03-11 12:48:44,940: INFO/MainProcess] Connected to 
amqp://airflow_ci:**@127.0.0.1:5673/airflow_ci
[2022-03-11 12:48:44,961: INFO/MainProcess] mingle: searching for neighbors
[2022-03-11 12:48:45,997: INFO/MainProcess] mingle: all alone
[2022-03-11 12:48:46,005: CRITICAL/MainProcess] Unrecoverable error: 
NotFound(404, "NOT_FOUND - home node 'rabbit@ci-91-col' of durable queue 
'default' in vhost 'airflow_ci' is down or inaccessible", (50, 10), 
'Queue.declare')
--------------------------------------------------------------------

What I am looking for is a way to achieve the situation where the queues 
created by Airflow are not 'classic' queues but 'quorum' or 'mirrored' 
queues. I guess I could manually create a quorum queue in RabbitMQ 
management UI and set this as default_queue in airflow.cfg, but is this 
a proper way? Or is there perhaps an entirely different way for 
surviving a rabbitmq node going down that I'm not thinking of?

Thank you in advance for any pointers.
-- 
Toomas Aas

Reply via email to