Re: [DISCUSS] Flink framework and user log separation

vino yang Thu, 28 Feb 2019 18:44:28 -0800

Hi Jamie Grier,

Thank you for your reply, let me add some explanations to this design.

First of all, as stated in "Goal", it is mainly for the "Standalone"
cluster model, although we have implemented it for Flink on YARN, this does
not mean that we can't turn off this feature by means of options. It should
be noted that the separation is basically based on the "log configuration
file", it is very scalable and even allows users to define the log pattern
of the configuration file (of course this is an extension feature, not
mentioned in the design documentation). In fact, "multiple files are a
special case of a single file", we can provide an option to keep it still
the default behavior, it should be the scene you expect in the container.

According to Flink's official 2016 adjustment report [1], users using the
standalone mode are quite close to the yarn mode (unfortunately there is no
data support in 2017). Although we mainly use Flink on Yarn now, we have
used standalone in depth (close to the daily processing volume of 20
trillion messages). In this scenario, the user logs generated by different
job's tasks are mixed together, and it is very difficult to locate the
issue. Moreover, as we configure the log file scrolling policy, we have to
log in to the server to view it. Therefore, we expect that for the same
task manager, the user logs generated by the tasks from the same job can be
distinguished.

In addition, I have tried MDC technology, but it can not achieve the goal.
The underlying Flink is log4j 1.x and logback. We need to be compatible
with both frameworks at the same time, and we don't allow large-scale
changes to the active code, and no sense to the user.

Some other points:

1) Many of our users have experience using Storm and Spark, and they are
more accustomed to that style in standalone mode;
2) We split the user log by Job, which will help to implement the "business
log aggregation" feature based on the Job.

Best,
Vino

[1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1

Jamie Grier <[email protected]> 于2019年3月1日周五 上午7:32写道：

> I think maybe if I understood this correctly this design is going in the
> wrong direction.  The problem with Flink logging, when you are running
> multiple jobs in the same TMs, is not just about separating out the
> business level logging into separate files.  The Flink framework itself
> logs many things where there is clearly a single job in context but that
> all ends up in the same log file and with no clear separation amongst the
> log lines.
>
> Also, I don't think shooting to have multiple log files is a very good idea
> either.  It's common, especially on container-based deployments, that the
> expectation is that a process (like Flink) logs everything to stdout and
> the surrounding tooling takes care of routing that log data somewhere.  I
> think we should stick with that model and expect that there will be a
> single log stream coming out of each Flink process.
>
> Instead, I think it would be better to enhance Flink's logging capability
> such that the appropriate context can be added to each log line with the
> exact format controlled by the end user.  It might make sense to take a
> look at MDC, for example, as a way to approach this.
>
>
> On Thu, Feb 28, 2019 at 4:24 AM vino yang <[email protected]> wrote:
>
> > Dear devs,
> >
> > Currently, for log output, Flink does not explicitly distinguish between
> > framework logs and user logs. In Task Manager, logs from the framework
> are
> > intermixed with the user's business logs. In some deployment models, such
> > as Standalone or YARN session, there are different task instances of
> > different jobs deployed in the same Task Manager. It makes the log event
> > flow more confusing unless the users explicitly use tags to distinguish
> > them and it makes locating problems more difficult and inefficient. For
> > YARN job cluster deployment model, this problem will not be very serious,
> > but we still need to artificially distinguish between the framework and
> the
> > business log. Overall, we found that Flink's existing log model has the
> > following problems:
> >
> >
> >    -
> >
> >    Framework log and business log are mixed in the same log file. There
> >    is no way to make a clear distinction, which is not conducive to
> problem
> >    location and analysis;
> >    -
> >
> >    Not conducive to the independent collection of business logs;
> >
> >
> > Therefore, we propose a mechanism to separate the framework and business
> > log. It can split existing log files for Task Manager.
> >
> > Currently, it is associated with two JIRA issue:
> >
> >    -
> >
> >    FLINK-11202[1]: Split log file per job
> >    -
> >
> >    FLINK-11782[2]: Enhance TaskManager log visualization by listing all
> >    log files for Flink web UI
> >
> >
> > We have implemented and validated it in standalone and Flink on YARN (job
> > cluster) mode.
> >
> > sketch 1:
> >
> > [image: flink-web-ui-taskmanager-log-files.png]
> >
> > sketch 2:
> > [image: flink-web-ui-taskmanager-log-files-2.png]
> >
> > Design documentation :
> >
> https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing
> >
> > Best,
> > Vino
> >
> > [1]: https://issues.apache.org/jira/browse/FLINK-11202
> > [2]: https://issues.apache.org/jira/browse/FLINK-11782
> >
>

Re: [DISCUSS] Flink framework and user log separation

Reply via email to