Hi Kevin,
I haven't worked with Bugsnag. So, I cannot give more input on that one.
For Flink, exceptions are handled by the job's scheduler. Flink collects
these exceptions in some bounded queue called the exception history [1]. It
collects task failures but also global failures which make the job fail or
restart. The size of the queue can be set through
web.exception-history-size [2].
Additionally, there's the FatalErrorHandler interface [3] which is used for
fatal (i.e. unrecoverable) errors of the ecosystem. You might want to have
a look at the implementations of this interface.

I hope that helps a bit.
Matthias

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/rest_api/#jobs-jobid-exceptions
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/#web-exception-history-size
[3]
https://github.com/apache/flink/blob/239067c8d6393a273acaa2a3c3f57ad1c5486e3a/flink-rpc/flink-rpc-core/src/main/java/org/apache/flink/runtime/rpc/FatalErrorHandler.java#L22

On Mon, Jun 21, 2021 at 3:12 PM Kevin Lam <kevin....@shopify.com> wrote:

> Hi all,
>
> I'm interested in instrumenting an Apache Flink application so that we can
> monitor exceptions. I was wondering what the best practices are here? Is
> there a good way to observe all the exceptions inside of a Flink
> application, including Flink internals?
>
> We are currently thinking of using Bugsnag, which has some steps to
> integrate with java applications:
> https://docs.bugsnag.com/platforms/java/other/, which works fine for
> uncaught exceptions in the job manager / pipeline driver context, but
> doesn't catch anything outside of that.
>
> We're also interested in reporting on exceptions that occur in the job
> execution context, eg. in task managers.
>
> Any tips/suggestions? I'd love to learn more about exception tracking and
> handling in Flink :)
>
> (reposting because it looks like my other thread got deleted?)
>

Reply via email to