I am sharing the problem that we face very frequently. When we go a particular dag and click the logs for a particular task_instance we end up with following error. The logs become available after the task finished i.e when it fetch from external source like s3.
*** Log file isn't local. *** Fetching here: http:// <ip>:9001/log/<dag-name>/etl_spark/2019-01-07T23:10:00/1.log *** *Failed to fetch log file from worker. HTTPConnectionPool(host='<ip>', port=9001): Max retries exceeded with url: /log/<dag>/etl_spark/2019-01-07T23:10:00/1.log (Caused by NewConnectionError*('<urllib3.connection.HTTPConnection object at 0x7f8c8a6ee250>: Failed to establish a new connection: [Errno 111] Connection refused',)) Going deeper into the issue , we found the reason. There is a service airlfow service_logs which is flask service running one each worker. Its work is to fetch task logs present locally from any worker to webserver when requested. So each time webserver Ui requests the serve_log process , it creates a thread within flask. We saw were lot of connections running . We explicttly kill the airflow server_log service which we saw it running from 11July, 2018. Since there are lot of open connections, it is unable to create a new one and we end up with the above error. We did manually to resolve the issue by killing all the connections. But again we see the same issue. What is permanent solution to this ? Should we keep restarting worker or airflow service_logs service ? Thanks, Pramiti