Hi Jamie Grier,
Thank you for your reply, let me add some explanations to this design.
First of all, as stated in "Goal", it is mainly for the "Standalone"
cluster model, although we have implemented it for Flink on YARN, this does
not mean that we can't turn off this feature by means of options. It should
be noted that the separation is basically based on the "log configuration
file", it is very scalable and even allows users to define the log pattern
of the configuration file (of course this is an extension feature, not
mentioned in the design documentation). In fact, "multiple files are a
special case of a single file", we can provide an option to keep it still
the default behavior, it should be the scene you expect in the container.
According to Flink's official 2016 adjustment report [1], users using the
standalone mode are quite close to the yarn mode (unfortunately there is no
data support in 2017). Although we mainly use Flink on Yarn now, we have
used standalone in depth (close to the daily processing volume of 20
trillion messages). In this scenario, the user logs generated by different
job's tasks are mixed together, and it is very difficult to locate the
issue. Moreover, as we configure the log file scrolling policy, we have to
log in to the server to view it. Therefore, we expect that for the same
task manager, the user logs generated by the tasks from the same job can be
distinguished.
In addition, I have tried MDC technology, but it can not achieve the goal.
The underlying Flink is log4j 1.x and logback. We need to be compatible
with both frameworks at the same time, and we don't allow large-scale
changes to the active code, and no sense to the user.
Some other points:
1) Many of our users have experience using Storm and Spark, and they are
more accustomed to that style in standalone mode;
2) We split the user log by Job, which will help to implement the "business
log aggregation" feature based on the Job.
Best,
Vino
[1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1
Jamie Grier <jgr...@lyft.com.invalid> 于2019年3月1日周五 上午7:32写道:
I think maybe if I understood this correctly this design is going in the
wrong direction. The problem with Flink logging, when you are running
multiple jobs in the same TMs, is not just about separating out the
business level logging into separate files. The Flink framework itself
logs many things where there is clearly a single job in context but that
all ends up in the same log file and with no clear separation amongst the
log lines.
Also, I don't think shooting to have multiple log files is a very good
idea
either. It's common, especially on container-based deployments, that the
expectation is that a process (like Flink) logs everything to stdout and
the surrounding tooling takes care of routing that log data somewhere. I
think we should stick with that model and expect that there will be a
single log stream coming out of each Flink process.
Instead, I think it would be better to enhance Flink's logging capability
such that the appropriate context can be added to each log line with the
exact format controlled by the end user. It might make sense to take a
look at MDC, for example, as a way to approach this.
On Thu, Feb 28, 2019 at 4:24 AM vino yang <yanghua1...@gmail.com> wrote:
Dear devs,
Currently, for log output, Flink does not explicitly distinguish
between
framework logs and user logs. In Task Manager, logs from the framework
are
intermixed with the user's business logs. In some deployment models,
such
as Standalone or YARN session, there are different task instances of
different jobs deployed in the same Task Manager. It makes the log
event
flow more confusing unless the users explicitly use tags to distinguish
them and it makes locating problems more difficult and inefficient. For
YARN job cluster deployment model, this problem will not be very
serious,
but we still need to artificially distinguish between the framework and
the
business log. Overall, we found that Flink's existing log model has the
following problems:
-
Framework log and business log are mixed in the same log file. There
is no way to make a clear distinction, which is not conducive to
problem
location and analysis;
-
Not conducive to the independent collection of business logs;
Therefore, we propose a mechanism to separate the framework and
business
log. It can split existing log files for Task Manager.
Currently, it is associated with two JIRA issue:
-
FLINK-11202[1]: Split log file per job
-
FLINK-11782[2]: Enhance TaskManager log visualization by listing all
log files for Flink web UI
We have implemented and validated it in standalone and Flink on YARN
(job
cluster) mode.
sketch 1:
[image: flink-web-ui-taskmanager-log-files.png]
sketch 2:
[image: flink-web-ui-taskmanager-log-files-2.png]
Design documentation :
https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing
Best,
Vino
[1]: https://issues.apache.org/jira/browse/FLINK-11202
[2]: https://issues.apache.org/jira/browse/FLINK-11782