[ https://issues.apache.org/jira/browse/AIRFLOW-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715356#comment-16715356 ]
ASF GitHub Bot commented on AIRFLOW-3370: ----------------------------------------- rhwang10 opened a new pull request #4303: [AIRFLOW-3370] Elasticsearch log task handler additional features URL: https://github.com/apache/incubator-airflow/pull/4303 Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Airflow-3370](https://issues.apache.org/jira/browse/AIRFLOW-3370) ### Description Add additional options and documentation for using the ElasticsearchTaskHandler. The easiest and most foolproof way to implement a logging solution is to write to the standard output streams (stdout and stderr). In the Kubernetes ecosystem, given that pods are killed and restarted constantly, this implies persistent storage is a requirement for log history preservation. For the webserver and scheduler components of Airflow, logging to standard output streams is built in. However, when tasks are executed in Celery, workers will fork off child processes to execute tasks concurrently. Before the child process ends, it makes a call to the Airflow task logger, and a task log file is written to the file system. This potentially causes several problems with the Airflow on top of Kubernetes architecture. Given that Airflow has a constant stream of log output, running an Airflow environment using Celery in a Kubernetes cluster requires large amounts of memory resources. As such, when memory resources are exceeded, either 1) worker pods are often evicted by Kubernetes, or 2) worker output stalls and tasks pile up without completing. A current best-case workaround is to run a sidecar container that tails the `stdout` and `stderr` streams to read logs from the filesystem of the worker node and then output those logs to its own standard output. However, this becomes a scaling issue when running multiple deployments, and is unsustainable for massive Airflow deployments. The options that are added in this PR look to circumvent the need for persistent volumes in worker nodes when running Airflow on Kubernetes. Workers will no longer need to be Stateful Sets and can instead be Deployments. The `elasticsearch_write_stdout` flag in `airflow.cfg` will allow the child process to write its log output to the parent process standard output stream. The `elasticsearch_json_format` flag in `airflow.cfg` allows additional optional JSON configuration for task instances based on the `logging` module LogRecord attributes. This must be used in conjunction with the `elasticsearch_record_labels` configuration. A potential use case for these options is precisely when setting up a log monitoring stack, such as EFK (Elasticsearch FluentD Kibana), without requiring persistent volumes. A FluentD daemon listening on every node is awaiting log output on the standard output stream. With the `write_stdout` flag set, it can capture the task log information that has been executed on child processes from the parent process standard output. Using the `json_format` field, FluentD can be configured to filter and specify log records, without needing to parse the standard Airflow log formatted record, before sending it off to a destination, such as Elasticsearch, where it can stored and handled independent of Airflow on Kubernetes. If either of these options are NULL, or not set, the `es_task_handler.py` will function exactly as before, and will only have read functionality. The options in this PR simply provide more functionality and options to users. ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: #### handler close - test_close_stdout_logs - test_close_closed_stdout_logs - test_close_no_mark_end_stdout_logs - test_close_with_no_handler_stdout_logs - test_close_with_no_stream_stdout_logs #### handler read - test_read_stdout_logs - test_read_nonexistent_log_stdout_logs - test_read_raises_stdout_logs_json_true - test_read_timeout_stdout_logs - test_read_with_empty_metadata_stdout_logs - test_read_with_none_metadata_stdout_logs #### handler render_log_id and set_context - test_render_log_id_json_true - test_set_context_stdout_logs ### Commits - [x] My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [x] In case of new functionality, my PR adds documentation that describes how to use it. ### Code Quality - [x] Passes `flake8` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enhance current ES handler with stdout capability and more output options > ------------------------------------------------------------------------- > > Key: AIRFLOW-3370 > URL: https://issues.apache.org/jira/browse/AIRFLOW-3370 > Project: Apache Airflow > Issue Type: Improvement > Components: celery, logging, worker > Affects Versions: 1.10.0 > Reporter: Robert Hwang > Priority: Major > > Currently, the ES handler in 1.10 can only search from ES, and it's not. Two > possible enhancements are to allow the handler a "write to standard out" > functionality, and make log messages more remote-storage friendly, with an > optional JSON-formatted message, and the regular pretty-print log format as > the default -- This message was sent by Atlassian JIRA (v7.6.3#76005)