I am sharing the problem that we face very frequently. When we go a
particular dag and click the logs for a particular task_instance we end up
with following error. The logs become available after the task finished i.e
when it fetch from external source like s3.


*** Log file isn't local.
*** Fetching here: http://
<ip>:9001/log/<dag-name>/etl_spark/2019-01-07T23:10:00/1.log
*** *Failed to fetch log file from worker. HTTPConnectionPool(host='<ip>',
port=9001): Max retries exceeded with url:
/log/<dag>/etl_spark/2019-01-07T23:10:00/1.log (Caused by
NewConnectionError*('<urllib3.connection.HTTPConnection object at
0x7f8c8a6ee250>: Failed to establish a new connection: [Errno 111]
Connection refused',))


Going deeper into the issue , we found the reason.
There is a service airlfow service_logs which is flask service running one
each worker. Its work is to fetch task logs present locally from any worker
to webserver when requested. So each time webserver Ui requests the
serve_log process , it creates a thread within flask. We saw were lot of
connections running . We explicttly kill the airflow server_log service
which we saw it running from 11July, 2018.

Since there are lot of open connections, it is unable to create a new one
and we end up with the above error.
We did manually to resolve the issue by killing all the connections. But
again we see the same issue. What is permanent solution to this ? Should we
keep restarting worker or airflow service_logs service ?

Thanks,
Pramiti

Reply via email to