[ 
https://issues.apache.org/jira/browse/AIRFLOW-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715356#comment-16715356
 ] 

ASF GitHub Bot commented on AIRFLOW-3370:
-----------------------------------------

rhwang10 opened a new pull request #4303: [AIRFLOW-3370] Elasticsearch log task 
handler additional features
URL: https://github.com/apache/incubator-airflow/pull/4303
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [x] My PR addresses the following 
[Airflow-3370](https://issues.apache.org/jira/browse/AIRFLOW-3370)
   
   ### Description
   
   Add additional options and documentation for using the 
ElasticsearchTaskHandler.
   
   The easiest and most foolproof way to implement a logging solution is to 
write to the standard output streams (stdout and stderr). In the Kubernetes 
ecosystem, given that pods are killed and restarted constantly, this implies 
persistent storage is a requirement for log history preservation.
   
   For the webserver and scheduler components of Airflow, logging to standard 
output streams is built in. However, when tasks are executed in Celery, workers 
will fork off child processes to execute tasks concurrently. Before the child 
process ends, it makes a call to the Airflow task logger, and a task log file 
is written to the file system.
   
   This potentially causes several problems with the Airflow on top of 
Kubernetes architecture. Given that Airflow has a constant stream of log 
output, running an Airflow environment using Celery in a Kubernetes cluster 
requires large amounts of memory resources. As such, when memory resources are 
exceeded, either 1) worker pods are often evicted by Kubernetes, or 2) worker 
output stalls and tasks pile up without completing. A current best-case 
workaround is to run a sidecar container that tails the `stdout` and `stderr` 
streams to read logs from the filesystem of the worker node and then output 
those logs to its own standard output. However, this becomes a scaling issue 
when running multiple deployments, and is unsustainable for massive Airflow 
deployments.
   
   The options that are added in this PR look to circumvent the need for 
persistent volumes in worker nodes when running Airflow on Kubernetes. Workers 
will no longer need to be Stateful Sets and can instead be Deployments. 
   
   The `elasticsearch_write_stdout` flag in `airflow.cfg` will allow the child 
process to write its log output to the parent process standard output stream.
   The `elasticsearch_json_format` flag in `airflow.cfg` allows additional 
optional JSON configuration for task instances based on the `logging` module 
LogRecord attributes. This must be used in conjunction with the 
`elasticsearch_record_labels` configuration. 
   
   A potential use case for these options is precisely when setting up a log 
monitoring stack, such as EFK (Elasticsearch FluentD Kibana), without requiring 
persistent volumes. A FluentD daemon listening on every node is awaiting log 
output on the standard output stream. With the `write_stdout` flag set, it can 
capture the task log information that has been executed on child processes from 
the parent process standard output. Using the `json_format` field, FluentD can 
be configured to filter and specify log records, without needing to parse the 
standard Airflow log formatted record, before sending it off to a destination, 
such as Elasticsearch, where it can stored and handled independent of Airflow 
on Kubernetes.
   
   If either of these options are NULL, or not set, the `es_task_handler.py` 
will function exactly as before, and will only have read functionality. The 
options in this PR simply provide more functionality and options to users.
   
   
   ### Tests
   
   - [x] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   #### handler close
   - test_close_stdout_logs
   - test_close_closed_stdout_logs
   - test_close_no_mark_end_stdout_logs
   - test_close_with_no_handler_stdout_logs
   - test_close_with_no_stream_stdout_logs
   
   #### handler read
   - test_read_stdout_logs
   - test_read_nonexistent_log_stdout_logs
   - test_read_raises_stdout_logs_json_true
   - test_read_timeout_stdout_logs
   - test_read_with_empty_metadata_stdout_logs
   - test_read_with_none_metadata_stdout_logs
   
   #### handler render_log_id and set_context
   - test_render_log_id_json_true
   - test_set_context_stdout_logs
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
     1. Subject is separated from body by a blank line
     1. Subject is limited to 50 characters (not including Jira issue reference)
     1. Subject does not end with a period
     1. Subject uses the imperative mood ("add", not "adding")
     1. Body wraps at 72 characters
     1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
   
   ### Code Quality
   
   - [x] Passes `flake8`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enhance current ES handler with stdout capability and more output options
> -------------------------------------------------------------------------
>
>                 Key: AIRFLOW-3370
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3370
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: celery, logging, worker
>    Affects Versions: 1.10.0
>            Reporter: Robert Hwang
>            Priority: Major
>
> Currently, the ES handler in 1.10 can only search from ES, and it's not. Two 
> possible enhancements are to allow the handler a "write to standard out" 
> functionality, and make log messages more remote-storage friendly, with an 
> optional JSON-formatted message, and the regular pretty-print log format as 
> the default



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to