Hi,

what's the difference in approach to the mentioned related Jira Issue ([1])? I commented there because I'm skeptical about adding Hadoop-specific code to the generic cluster components.

Best,
Aljoscha

[1] https://issues.apache.org/jira/browse/FLINK-14317

On 13.02.20 03:47, SHI Xiaogang wrote:
Hi Rong Rong,

Thanks for the proposal. We are also suffering from some pains brought by
history server. To address them, we propose a trace system, which is very
similar to the metric system, for historical information.

A trace is semi-structured information about events in Flink. Useful traces
include:
* job traces: which contain the job graph of submitted jobs.
* schedule traces: A schedule trace is typically composed of the
information of task slots. They are generated when a job finishes, fails,
or is canceled. As a job may restart mutliple times, a job typically has
multiple schedule traces.
* checkpoint traces: which are generated when a checkpoint completes or
fails.
* task manager traces: which are generated when a task manager terminates.
Users can access the link to aggregated logs intaskmanager traces.

Users can use TraceReport to collect traces in Flink and export them to
external storage (e.g., ElasticSearch). By retrieving traces when
exceptions happen, we can improve user experience in altering.

Regards,
Xiaogang

Rong Rong <walter...@gmail.com> 于2020年2月13日周四 上午9:41写道:

Hi All,

Recently we have been experimenting using Flink’s history server as a
centralized debugging service for completed streaming jobs.

Specifically, we dynamically generate links to access log files on the YARN
host; in the meantime, we use the Flink history server to show job graphs,
exceptions and other info of the completed jobs[2].

This causes some pain for our users, namely: It is inconvenient to go to
YARN host to access logs; then go to Flink history server for the other
information.

Thus we would like to propose an improvement to the currently Flink history
server:

    -

    To support dynamic links to residual log files from the host machine
    within the retention period [3];
    -

    To support dynamic links to aggregated log files provided by the
    cluster, if supported: such as Hadoop HistoryServer[1], or Kubernetes
    cluster level logging[4]?
    -

       Similar integration with Hadoop HistoryServer was already proposed
       before[5] with slightly different approach.


Any feedback and suggestions are highly appreciated!

--

Rong

[1]

https://hadoop.apache.org/docs/r2.9.2/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html

[2]

https://ci.apache.org/projects/flink/flink-docs-release-1.9/monitoring/historyserver.html

[3]

https://hadoop.apache.org/docs/r2.9.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml#yarn.nodemanager.log.retain-seconds

[4]

https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures
[5] https://issues.apache.org/jira/browse/FLINK-14317


Reply via email to