Is that something that can just be done by the right logging framework and configuration?
Like having a log framework with two targets, one filtered on "org.apache.flink" and the other one filtered on "my.company.project" or so? On Fri, Mar 1, 2019 at 3:44 AM vino yang <yanghua1...@gmail.com> wrote: > Hi Jamie Grier, > > Thank you for your reply, let me add some explanations to this design. > > First of all, as stated in "Goal", it is mainly for the "Standalone" > cluster model, although we have implemented it for Flink on YARN, this does > not mean that we can't turn off this feature by means of options. It should > be noted that the separation is basically based on the "log configuration > file", it is very scalable and even allows users to define the log pattern > of the configuration file (of course this is an extension feature, not > mentioned in the design documentation). In fact, "multiple files are a > special case of a single file", we can provide an option to keep it still > the default behavior, it should be the scene you expect in the container. > > According to Flink's official 2016 adjustment report [1], users using the > standalone mode are quite close to the yarn mode (unfortunately there is no > data support in 2017). Although we mainly use Flink on Yarn now, we have > used standalone in depth (close to the daily processing volume of 20 > trillion messages). In this scenario, the user logs generated by different > job's tasks are mixed together, and it is very difficult to locate the > issue. Moreover, as we configure the log file scrolling policy, we have to > log in to the server to view it. Therefore, we expect that for the same > task manager, the user logs generated by the tasks from the same job can be > distinguished. > > In addition, I have tried MDC technology, but it can not achieve the goal. > The underlying Flink is log4j 1.x and logback. We need to be compatible > with both frameworks at the same time, and we don't allow large-scale > changes to the active code, and no sense to the user. > > Some other points: > > 1) Many of our users have experience using Storm and Spark, and they are > more accustomed to that style in standalone mode; > 2) We split the user log by Job, which will help to implement the "business > log aggregation" feature based on the Job. > > Best, > Vino > > [1]: https://www.ververica.com/blog/flink-user-survey-2016-part-1 > > Jamie Grier <jgr...@lyft.com.invalid> 于2019年3月1日周五 上午7:32写道: > > > I think maybe if I understood this correctly this design is going in the > > wrong direction. The problem with Flink logging, when you are running > > multiple jobs in the same TMs, is not just about separating out the > > business level logging into separate files. The Flink framework itself > > logs many things where there is clearly a single job in context but that > > all ends up in the same log file and with no clear separation amongst the > > log lines. > > > > Also, I don't think shooting to have multiple log files is a very good > idea > > either. It's common, especially on container-based deployments, that the > > expectation is that a process (like Flink) logs everything to stdout and > > the surrounding tooling takes care of routing that log data somewhere. I > > think we should stick with that model and expect that there will be a > > single log stream coming out of each Flink process. > > > > Instead, I think it would be better to enhance Flink's logging capability > > such that the appropriate context can be added to each log line with the > > exact format controlled by the end user. It might make sense to take a > > look at MDC, for example, as a way to approach this. > > > > > > On Thu, Feb 28, 2019 at 4:24 AM vino yang <yanghua1...@gmail.com> wrote: > > > > > Dear devs, > > > > > > Currently, for log output, Flink does not explicitly distinguish > between > > > framework logs and user logs. In Task Manager, logs from the framework > > are > > > intermixed with the user's business logs. In some deployment models, > such > > > as Standalone or YARN session, there are different task instances of > > > different jobs deployed in the same Task Manager. It makes the log > event > > > flow more confusing unless the users explicitly use tags to distinguish > > > them and it makes locating problems more difficult and inefficient. For > > > YARN job cluster deployment model, this problem will not be very > serious, > > > but we still need to artificially distinguish between the framework and > > the > > > business log. Overall, we found that Flink's existing log model has the > > > following problems: > > > > > > > > > - > > > > > > Framework log and business log are mixed in the same log file. There > > > is no way to make a clear distinction, which is not conducive to > > problem > > > location and analysis; > > > - > > > > > > Not conducive to the independent collection of business logs; > > > > > > > > > Therefore, we propose a mechanism to separate the framework and > business > > > log. It can split existing log files for Task Manager. > > > > > > Currently, it is associated with two JIRA issue: > > > > > > - > > > > > > FLINK-11202[1]: Split log file per job > > > - > > > > > > FLINK-11782[2]: Enhance TaskManager log visualization by listing all > > > log files for Flink web UI > > > > > > > > > We have implemented and validated it in standalone and Flink on YARN > (job > > > cluster) mode. > > > > > > sketch 1: > > > > > > [image: flink-web-ui-taskmanager-log-files.png] > > > > > > sketch 2: > > > [image: flink-web-ui-taskmanager-log-files-2.png] > > > > > > Design documentation : > > > > > > https://docs.google.com/document/d/1TTYAtFoTWaGCveKDZH394FYdRyNyQFnVoW5AYFvnr5I/edit?usp=sharing > > > > > > Best, > > > Vino > > > > > > [1]: https://issues.apache.org/jira/browse/FLINK-11202 > > > [2]: https://issues.apache.org/jira/browse/FLINK-11782 > > > > > >