Re: [DISCUSS] Flink framework and user log separation

Stephan Ewen Thu, 04 Jul 2019 03:38:59 -0700

Is that something that can just be done by the right logging framework and
configuration?


Like having a log framework with two targets, one filtered on
"org.apache.flink" and the other one filtered on "my.company.project" or so?

On Fri, Mar 1, 2019 at 3:44 AM vino yang <[email protected]> wrote:

> Hi Jamie Grier,
>
> Thank you for your reply, let me add some explanations to this design.
>
> First of all, as stated in "Goal", it is mainly for the "Standalone"
> cluster model, although we have implemented it for Flink on YARN, this does
> not mean that we can't turn off this feature by means of options. It should
> be noted that the separation is basically based on the "log configuration
> file", it is very scalable and even allows users to define the log pattern
> of the configuration file (of course this is an extension feature, not
> mentioned in the design documentation). In fact, "multiple files are a
> special case of a single file", we can provide an option to keep it still
> the default behavior, it should be the scene you expect in the container.
>
> According to Flink's official 2016 adjustment report [1], users using the
> standalone mode are quite close to the yarn mode (unfortunately there is no
> data support in 2017). Although we mainly use Flink on Yarn now, we have
> used standalone in depth (close to the daily processing volume of 20
> trillion messages). In this scenario, the user logs generated by different
> job's tasks are mixed together, and it is very difficult to locate the
> issue. Moreover, as we configure the log file scrolling policy, we have to
> log in to the server to view it. Therefore, we expect that for the same
> task manager, the user logs generated by the tasks from the same job can be
> distinguished.
>
> In addition, I have tried MDC technology, but it can not achieve the goal.
> The underlying Flink is log4j 1.x and logback. We need to be compatible
> with both frameworks at the same time, and we don't allow large-scale
> changes to the active code, and no sense to the user.
>
> Some other points:
>
> 1) Many of our users have experience using Storm and Spark, and they are
> more accustomed to that style in standalone mode;
> 2) We split the user log by Job, which will help to implement the "business
> log aggregation" feature based on the Job.
>
> Best,
> Vino
>
> [1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1
>
> Jamie Grier <[email protected]> 于2019年3月1日周五 上午7:32写道：
>
> > I think maybe if I understood this correctly this design is going in the
> > wrong direction.  The problem with Flink logging, when you are running
> > multiple jobs in the same TMs, is not just about separating out the
> > business level logging into separate files.  The Flink framework itself
> > logs many things where there is clearly a single job in context but that
> > all ends up in the same log file and with no clear separation amongst the
> > log lines.
> >
> > Also, I don't think shooting to have multiple log files is a very good
> idea
> > either.  It's common, especially on container-based deployments, that the
> > expectation is that a process (like Flink) logs everything to stdout and
> > the surrounding tooling takes care of routing that log data somewhere.  I
> > think we should stick with that model and expect that there will be a
> > single log stream coming out of each Flink process.
> >
> > Instead, I think it would be better to enhance Flink's logging capability
> > such that the appropriate context can be added to each log line with the
> > exact format controlled by the end user.  It might make sense to take a
> > look at MDC, for example, as a way to approach this.
> >
> >
> > On Thu, Feb 28, 2019 at 4:24 AM vino yang <[email protected]> wrote:
> >
> > > Dear devs,
> > >
> > > Currently, for log output, Flink does not explicitly distinguish
> between
> > > framework logs and user logs. In Task Manager, logs from the framework
> > are
> > > intermixed with the user's business logs. In some deployment models,
> such
> > > as Standalone or YARN session, there are different task instances of
> > > different jobs deployed in the same Task Manager. It makes the log
> event
> > > flow more confusing unless the users explicitly use tags to distinguish
> > > them and it makes locating problems more difficult and inefficient. For
> > > YARN job cluster deployment model, this problem will not be very
> serious,
> > > but we still need to artificially distinguish between the framework and
> > the
> > > business log. Overall, we found that Flink's existing log model has the
> > > following problems:
> > >
> > >
> > >    -
> > >
> > >    Framework log and business log are mixed in the same log file. There
> > >    is no way to make a clear distinction, which is not conducive to
> > problem
> > >    location and analysis;
> > >    -
> > >
> > >    Not conducive to the independent collection of business logs;
> > >
> > >
> > > Therefore, we propose a mechanism to separate the framework and
> business
> > > log. It can split existing log files for Task Manager.
> > >
> > > Currently, it is associated with two JIRA issue:
> > >
> > >    -
> > >
> > >    FLINK-11202[1]: Split log file per job
> > >    -
> > >
> > >    FLINK-11782[2]: Enhance TaskManager log visualization by listing all
> > >    log files for Flink web UI
> > >
> > >
> > > We have implemented and validated it in standalone and Flink on YARN
> (job
> > > cluster) mode.
> > >
> > > sketch 1:
> > >
> > > [image: flink-web-ui-taskmanager-log-files.png]
> > >
> > > sketch 2:
> > > [image: flink-web-ui-taskmanager-log-files-2.png]
> > >
> > > Design documentation :
> > >
> >
> https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing
> > >
> > > Best,
> > > Vino
> > >
> > > [1]: https://issues.apache.org/jira/browse/FLINK-11202
> > > [2]: https://issues.apache.org/jira/browse/FLINK-11782
> > >
> >
>

Re: [DISCUSS] Flink framework and user log separation

Reply via email to