Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Piotr Nowojski Tue, 21 Nov 2023 09:15:36 -0800

Hi,

I will start the voting thread shortly. Jing, I'm making an assumption
here/hoping that my answers have satisfied you. If not, please let me know.


Best,
Piotrek

pon., 20 lis 2023 o 09:59 Piotr Nowojski <pnowoj...@apache.org> napisał(a):

> Hi Jing!
>
> >  the upcoming OpenTelemetry based TraceReporter will use the same Span
> > implementation and will not support trace_id and span_id. Does it make
> > sense to at least add the span_id into the current Span design? The
> default
> > implementation could follow your suggestion:
> jobId#attemptId#checkpointId.
>
> Those IDs (jobId, checkpointId) will be accessible to the humans via
> attributes,
> so there is no need to encode them at the moment in the span/trace ids. At
> the
> same time, at the moment, I don't know for sure how the concept of span
> parent ids should be exposed to the user of this API. Whether it should be
> plain
> text, or some pojo generating the trace id/span id. Also I'm not sure how
> would
> this have to work for other reporting systems other than OTEL. Due to those
> reasons I thought that keeping the API as simple as possible would be the
> best
> option.
>
> > 1. The sample code shows that the scope of Span will be the CanonicalName
> > of a class. If there are other cases that could be used as the scope
> too, a
> > javadoc about Span scope would be helpful. If the CanonicalName of a
> class
> > is the best practice, removing the scope from the builder constructor and
> > adding setScope(Class) might ease the API usage. The Span.getScope() can
> > still return String.
>
> I like the idea with `#setScope(Class)`. I will adjust the FLIP :)
>
> > 2. The sample code in the FLIP is not consistent. The first example used
> > Span.builder(..) and the second example used new Span() with setters.
>
> I will fix that, I've forgotten to upgrade the second `new Span()` usage
> to the
> builder.
>
> > 3. I guess the constructor of SpanBuilder class is a typo.
>
> Yes! Thanks for noting.
>
> Best,
> Piotrek
>
>
> czw., 16 lis 2023 o 15:12 Roman Khachatryan <ro...@apache.org> napisał(a):
>
>> Thanks for the proposal,
>>
>> Starting with the minimal functionality and expanding if necessary as the
>> FLIP describes makes a lot of sense to me.
>>
>> Regards,
>> Roman
>>
>> On Wed, Nov 15, 2023, 9:31 PM Jing Ge <j...@ververica.com.invalid> wrote:
>>
>> > Hi Piotr,
>> >
>> > Sorry for the late reply and thanks for the proposal, it looks awesome!
>> >
>> > In the discussion, you pointed out that it is difficult to build true
>> > distributed traces. afaiu from FLIP-384 and FLIP-385, the
>> > upcoming OpenTelemetry based TraceReporter will use the same Span
>> > implementation and will not support trace_id and span_id. Does it make
>> > sense to at least add the span_id into the current Span design? The
>> default
>> > implementation could follow your suggestion:
>> jobId#attemptId#checkpointId.
>> >
>> > Some other NIT questions:
>> > 1. The sample code shows that the scope of Span will be the
>> CanonicalName
>> > of a class. If there are other cases that could be used as the scope
>> too, a
>> > javadoc about Span scope would be helpful. If the CanonicalName of a
>> class
>> > is the best practice, removing the scope from the builder constructor
>> and
>> > adding setScope(Class) might ease the API usage. The Span.getScope() can
>> > still return String.
>> > 2. The sample code in the FLIP is not consistent. The first example used
>> > Span.builder(..) and the second example used new Span() with setters.
>> > 3. I guess the constructor of SpanBuilder class is a typo.
>> >
>> > Really a nice idea to introduce the trace report! Thanks again!
>> >
>> > Best regards,
>> > Jing
>> >
>> > On Tue, Nov 14, 2023 at 3:16 PM Piotr Nowojski <pnowoj...@apache.org>
>> > wrote:
>> >
>> > > Hi All,
>> > >
>> > > Thanks for the answers!
>> > >
>> > > Unless there are some objections or suggestions, I will open a voting
>> > > thread later this
>> > > week.
>> > >
>> > > > My original thought was to show how much time a sampled record is
>> > > processed
>> > > > within each operator in stream processing. By saying 'sampled', I
>> mean
>> > we
>> > > > won't generate a trace for every record due to the high cost
>> involved.
>> > > > Instead, we could only trace ONE record from source when the user
>> > > requests
>> > > > it (via REST API or Web UI) or when triggered periodically at a very
>> > low
>> > > > frequency.
>> > >
>> > > That would be useful, but another issue is that we can not measure
>> time
>> > > reliably at the
>> > > granularity of a single record. Time to process a single record by the
>> > > whole operator
>> > > chain is usually faster compared to the syscalls to measure time.
>> > >
>> > > So I think we are stuck with sample based profilers, like Flame Graphs
>> > > generated by
>> > > the Flink WebUI.
>> > >
>> > > Best, Piotrek
>> > >
>> > > czw., 9 lis 2023 o 05:32 Rui Fan <1996fan...@gmail.com> napisał(a):
>> > >
>> > > > Hi Piotr:
>> > > >
>> > > > Thanks for your reply!
>> > > >
>> > > > > About structured logging (basically events?) I vaguely remember
>> some
>> > > > > discussions about that. It might be a much larger topic, so I
>> would
>> > > > prefer
>> > > > > to leave it out of the scope of this FLIP.
>> > > >
>> > > > Sounds make sense to me!
>> > > >
>> > > > > I think those could be indeed useful. If you would like to
>> contribute
>> > > to
>> > > > them
>> > > > > in the future, I would be happy to review the FLIP for it :)
>> > > >
>> > > > Thank you, after this FLIP, I or my colleagues can pick it up!
>> > > >
>> > > > Best,
>> > > > Rui
>> > > >
>> > > > On Thu, Nov 9, 2023 at 11:39 AM Zakelly Lan <zakelly....@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Hi Piotr,
>> > > > >
>> > > > > Thanks for your detailed explanation! I could see the challenge of
>> > > > > implementing traces with multiple spans and agree to put it in the
>> > > future
>> > > > > work. I personally prefer the idea of generating multi span traces
>> > for
>> > > > > checkpoints on the JM only.
>> > > > >
>> > > > > > I'm not sure if I understand the proposal - I don't know how
>> traces
>> > > > could
>> > > > > > be used for this purpose?
>> > > > > > Traces are perfect for one of events (like checkpointing,
>> recovery,
>> > > > etc),
>> > > > > > not for continuous monitoring
>> > > > > > (like processing records). That's what metrics are. Creating
>> trace
>> > > and
>> > > > > > span(s) per each record would
>> > > > > > be prohibitively expensive.
>> > > > >
>> > > > > My original thought was to show how much time a sampled record is
>> > > > processed
>> > > > > within each operator in stream processing. By saying 'sampled', I
>> > mean
>> > > we
>> > > > > won't generate a trace for every record due to the high cost
>> > involved.
>> > > > > Instead, we could only trace ONE record from source when the user
>> > > > requests
>> > > > > it (via REST API or Web UI) or when triggered periodically at a
>> very
>> > > low
>> > > > > frequency. However after re-thinking my idea, I realized it's
>> hard to
>> > > > > define the whole lifecycle of a record since it is transformed
>> into
>> > > > > different forms among operators. We could discuss this in future
>> > after
>> > > > the
>> > > > > basic trace is implemented in Flink.
>> > > > >
>> > > > > > Unless you mean in batch/bounded jobs? Then yes, we could
>> create a
>> > > > > bounded
>> > > > > > job trace, with spans
>> > > > > > for every stage/task/subtask.
>> > > > >
>> > > > > Oh yes, batch jobs could definitely leverage the trace.
>> > > > >
>> > > > > Best,
>> > > > > Zakelly
>> > > > >
>> > > > >
>> > > > > On Wed, Nov 8, 2023 at 9:18 PM Jinzhong Li <
>> lijinzhong2...@gmail.com
>> > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Piotr,
>> > > > > >
>> > > > > > Thanks for driving this proposal!   I strongly agree that the
>> > > existing
>> > > > > > metric APIs are not suitable for monitoring restore/checkpoint
>> > > > behavior!
>> > > > > >
>> > > > > > I think the TM-level recovery/checkpointing traces are
>> necessary in
>> > > the
>> > > > > > future. In our production environment, we sometimes encounter
>> that
>> > > job
>> > > > > > recovery time is very long (30min+), due to several subTask
>> heavy
>> > > disk
>> > > > > > traffic. The TM-level recovery trace is helpful for
>> troubleshooting
>> > > > such
>> > > > > > issues.
>> > > > > >
>> > > > > > Best
>> > > > > > Jinzhong
>> > > > > >
>> > > > > > On Wed, Nov 8, 2023 at 5:09 PM Piotr Nowojski <
>> > pnowoj...@apache.org>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Zakelly,
>> > > > > > >
>> > > > > > > Thanks for the comments. Quick answer for both of your
>> questions
>> > > > would
>> > > > > be
>> > > > > > > that it probably should be
>> > > > > > > left as a future work. For more detailed answers please take a
>> > look
>> > > > > below
>> > > > > > > :)
>> > > > > > >
>> > > > > > > > Does it mean the inclusion and subdivision relationships of
>> > spans
>> > > > > > defined
>> > > > > > > > by "parent_id" are not supported? I think it is a very
>> > necessary
>> > > > > > feature
>> > > > > > > > for the trace.
>> > > > > > >
>> > > > > > > Yes exactly, that is the current limitation. This could be
>> solved
>> > > > > somehow
>> > > > > > > one way or another in the future.
>> > > > > > >
>> > > > > > > Support for reporting multi span traces all at once - for
>> example
>> > > > > > > `CheckpointStatsTracker` running JM,
>> > > > > > > could upon checkpoint completion create in one place the whole
>> > > > > structure
>> > > > > > of
>> > > > > > > parent spans, to have for
>> > > > > > > example one span per each subtask. This would be a relatively
>> > easy
>> > > > > follow
>> > > > > > > up.
>> > > > > > >
>> > > > > > > However, if we would like to create true distributed traces,
>> with
>> > > > spans
>> > > > > > > reported from many different
>> > > > > > > components, potentially both on JM and TM, the problem is a
>> bit
>> > > > deeper.
>> > > > > > The
>> > > > > > > issue in that case is how
>> > > > > > > to actually fill out `parrent_id` and `trace_id`? Passing some
>> > > > context
>> > > > > > > entity as a java object would be
>> > > > > > > unfeasible. That would require too many changes in too many
>> > > places. I
>> > > > > > think
>> > > > > > > the only realistic way
>> > > > > > > to do it, would be to have a deterministic generator of
>> > `parten_id`
>> > > > and
>> > > > > > > `trace_id` values.
>> > > > > > >
>> > > > > > > For example we could create the parent trace/span of the
>> > checkpoint
>> > > > on
>> > > > > > JM,
>> > > > > > > and set those ids to
>> > > > > > > something like: `jobId#attemptId#checkpointId`. Each subtask
>> then
>> > > > could
>> > > > > > > re-generate those ids
>> > > > > > > and subtasks' checkpoint span would have an id of
>> > > > > > > `jobId#attemptId#checkpointId#subTaskId`.
>> > > > > > > Note that this is just an example, as most likely distributed
>> > spans
>> > > > for
>> > > > > > > checkpointing do not make
>> > > > > > > sense, as we can generate them much easier on the JM anyway.
>> > > > > > >
>> > > > > > > > In addition to checkpoint and recovery, I believe the trace
>> > would
>> > > > > also
>> > > > > > be
>> > > > > > > > valuable for performance tuning. If Flink can trace and
>> > visualize
>> > > > the
>> > > > > > > time
>> > > > > > > > cost of each operator and stage for a sampled record, users
>> > would
>> > > > be
>> > > > > > able
>> > > > > > > > to easily determine the end-to-end latency and identify
>> > > performance
>> > > > > > > issues
>> > > > > > > > for optimization. Looking forward to seeing these in the
>> > future.
>> > > > > > >
>> > > > > > > I'm not sure if I understand the proposal - I don't know how
>> > traces
>> > > > > could
>> > > > > > > be used for this purpose?
>> > > > > > > Traces are perfect for one of events (like checkpointing,
>> > recovery,
>> > > > > etc),
>> > > > > > > not for continuous monitoring
>> > > > > > > (like processing records). That's what metrics are. Creating
>> > trace
>> > > > and
>> > > > > > > span(s) per each record would
>> > > > > > > be prohibitively expensive.
>> > > > > > >
>> > > > > > > Unless you mean in batch/bounded jobs? Then yes, we could
>> create
>> > a
>> > > > > > bounded
>> > > > > > > job trace, with spans
>> > > > > > > for every stage/task/subtask.
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Piotrek
>> > > > > > >
>> > > > > > >
>> > > > > > > śr., 8 lis 2023 o 05:30 Zakelly Lan <zakelly....@gmail.com>
>> > > > > napisał(a):
>> > > > > > >
>> > > > > > > > Hi Piotr,
>> > > > > > > >
>> > > > > > > > Happy to see the trace! Thanks for this proposal.
>> > > > > > > >
>> > > > > > > > One minor question: It is mentioned in the interface of
>> Span:
>> > > > > > > >
>> > > > > > > > Currently we don't support traces with multiple spans. Each
>> > span
>> > > is
>> > > > > > > > > self-contained and represents things like a checkpoint or
>> > > > recovery.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Does it mean the inclusion and subdivision relationships of
>> > spans
>> > > > > > defined
>> > > > > > > > by "parent_id" are not supported? I think it is a very
>> > necessary
>> > > > > > feature
>> > > > > > > > for the trace.
>> > > > > > > >
>> > > > > > > > In addition to checkpoint and recovery, I believe the trace
>> > would
>> > > > > also
>> > > > > > be
>> > > > > > > > valuable for performance tuning. If Flink can trace and
>> > visualize
>> > > > the
>> > > > > > > time
>> > > > > > > > cost of each operator and stage for a sampled record, users
>> > would
>> > > > be
>> > > > > > able
>> > > > > > > > to easily determine the end-to-end latency and identify
>> > > performance
>> > > > > > > issues
>> > > > > > > > for optimization. Looking forward to seeing these in the
>> > future.
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > Zakelly
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Tue, Nov 7, 2023 at 6:27 PM Piotr Nowojski <
>> > > > pnowoj...@apache.org>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Rui,
>> > > > > > > > >
>> > > > > > > > > Thanks for the comments!
>> > > > > > > > >
>> > > > > > > > > > 1. I see the trace just supports Span? Does it support
>> > trace
>> > > > > > events?
>> > > > > > > > > > I'm not sure whether tracing events is reasonable for
>> > > > > > TraceReporter.
>> > > > > > > > > > If it supports, flink can report checkpoint and
>> checkpoint
>> > > path
>> > > > > > > > > proactively.
>> > > > > > > > > > Currently, checkpoint lists or the latest checkpoint can
>> > only
>> > > > be
>> > > > > > > > fetched
>> > > > > > > > > > by external components or platforms. And report is more
>> > > timely
>> > > > > and
>> > > > > > > > > > efficient than fetch.
>> > > > > > > > >
>> > > > > > > > > No, currently the `TraceReporter` that I'm introducing
>> > supports
>> > > > > only
>> > > > > > > > single
>> > > > > > > > > span traces.
>> > > > > > > > > So currently neither events on their own, nor events
>> inside
>> > > spans
>> > > > > are
>> > > > > > > not
>> > > > > > > > > supported.
>> > > > > > > > > This is done just for the sake of simplicity, and test out
>> > the
>> > > > > basic
>> > > > > > > > > functionality. But I think,
>> > > > > > > > > those currently missing features should be added at some
>> > point
>> > > in
>> > > > > > > > > the future.
>> > > > > > > > >
>> > > > > > > > > About structured logging (basically events?) I vaguely
>> > remember
>> > > > > some
>> > > > > > > > > discussions about
>> > > > > > > > > that. It might be a much larger topic, so I would prefer
>> to
>> > > leave
>> > > > > it
>> > > > > > > out
>> > > > > > > > of
>> > > > > > > > > the scope of this
>> > > > > > > > > FLIP.
>> > > > > > > > >
>> > > > > > > > > > 2. This FLIP just monitors the checkpoint and task
>> > recovery,
>> > > > > right?
>> > > > > > > > >
>> > > > > > > > > Yes, it only adds single span traces for checkpointing and
>> > > > > > > > > recovery/initialisation - one
>> > > > > > > > > span per whole job per either recovery/initialization
>> process
>> > > or
>> > > > > per
>> > > > > > > each
>> > > > > > > > > checkpoint.
>> > > > > > > > >
>> > > > > > > > > > Could we add more operations in this FLIP? In our
>> > production,
>> > > > we
>> > > > > > > > > > added a lot of trace reporters for job starts and
>> scheduler
>> > > > > > > operation.
>> > > > > > > > > > They are useful if some jobs start slowly, because they
>> > will
>> > > > > affect
>> > > > > > > > > > the job availability. For example:
>> > > > > > > > > > - From JobManager process is started to JobGraph is
>> created
>> > > > > > > > > > - From JobGraph is created to JobMaster is created
>> > > > > > > > > > - From JobMaster is created to job is running
>> > > > > > > > > > - From start request tm from yarn or kubernetes to all
>> tms
>> > > are
>> > > > > > ready
>> > > > > > > > > > - etc
>> > > > > > > > >
>> > > > > > > > > I think those could be indeed useful. If you would like to
>> > > > > contribute
>> > > > > > > > them
>> > > > > > > > > in the future,
>> > > > > > > > > I would be happy to review the FLIP for it :)
>> > > > > > > > >
>> > > > > > > > > > Of course, this FLIP doesn't include them is fine for
>> me.
>> > The
>> > > > > first
>> > > > > > > > > version
>> > > > > > > > > > only initializes the interface and common operations,
>> and
>> > we
>> > > > can
>> > > > > > add
>> > > > > > > > > > more operations in the future
>> > > > > > > > >
>> > > > > > > > > Yes, that's exactly my thinking :)
>> > > > > > > > >
>> > > > > > > > > Best,
>> > > > > > > > > Piotrek
>> > > > > > > > >
>> > > > > > > > > wt., 7 lis 2023 o 10:05 Rui Fan <1996fan...@gmail.com>
>> > > > napisał(a):
>> > > > > > > > >
>> > > > > > > > > > Hi Piotr,
>> > > > > > > > > >
>> > > > > > > > > > Thanks for driving this proposal! The trace reporter is
>> > > useful
>> > > > to
>> > > > > > > > > > check a lot of duration monitors inside of Flink.
>> > > > > > > > > >
>> > > > > > > > > > I have some questions about this proposal:
>> > > > > > > > > >
>> > > > > > > > > > 1. I see the trace just supports Span? Does it support
>> > trace
>> > > > > > events?
>> > > > > > > > > > I'm not sure whether tracing events is reasonable for
>> > > > > > TraceReporter.
>> > > > > > > > > > If it supports, flink can report checkpoint and
>> checkpoint
>> > > path
>> > > > > > > > > > proactively.
>> > > > > > > > > > Currently, checkpoint lists or the latest checkpoint can
>> > only
>> > > > be
>> > > > > > > > fetched
>> > > > > > > > > > by external components or platforms. And report is more
>> > > timely
>> > > > > and
>> > > > > > > > > > efficient than fetch.
>> > > > > > > > > >
>> > > > > > > > > > 2. This FLIP just monitors the checkpoint and task
>> > recovery,
>> > > > > right?
>> > > > > > > > > > Could we add more operations in this FLIP? In our
>> > production,
>> > > > we
>> > > > > > > > > > added a lot of trace reporters for job starts and
>> scheduler
>> > > > > > > operation.
>> > > > > > > > > > They are useful if some jobs start slowly, because they
>> > will
>> > > > > affect
>> > > > > > > > > > the job availability. For example:
>> > > > > > > > > > - From JobManager process is started to JobGraph is
>> created
>> > > > > > > > > > - From JobGraph is created to JobMaster is created
>> > > > > > > > > > - From JobMaster is created to job is running
>> > > > > > > > > > - From start request tm from yarn or kubernetes to all
>> tms
>> > > are
>> > > > > > ready
>> > > > > > > > > > - etc
>> > > > > > > > > >
>> > > > > > > > > > Of course, this FLIP doesn't include them is fine for
>> me.
>> > The
>> > > > > first
>> > > > > > > > > version
>> > > > > > > > > > only initializes the interface and common operations,
>> and
>> > we
>> > > > can
>> > > > > > add
>> > > > > > > > > > more operations in the future.
>> > > > > > > > > >
>> > > > > > > > > > Best,
>> > > > > > > > > > Rui
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Nov 7, 2023 at 4:31 PM Piotr Nowojski <
>> > > > > > pnowoj...@apache.org>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi all!
>> > > > > > > > > > >
>> > > > > > > > > > > I would like to start a discussion on FLIP-384:
>> Introduce
>> > > > > > > > TraceReporter
>> > > > > > > > > > and
>> > > > > > > > > > > use it to create checkpointing and recovery traces
>> [1].
>> > > > > > > > > > >
>> > > > > > > > > > > This proposal intends to improve observability of
>> Flink's
>> > > > > > > > Checkpointing
>> > > > > > > > > > and
>> > > > > > > > > > > Recovery/Initialization operations, by adding support
>> for
>> > > > > > reporting
>> > > > > > > > > > traces
>> > > > > > > > > > > from Flink. In the future, reporting traces can be of
>> > > course
>> > > > > used
>> > > > > > > for
>> > > > > > > > > > other
>> > > > > > > > > > > use cases and also by users.
>> > > > > > > > > > >
>> > > > > > > > > > > There are also two other follow up FLIPS, FLIP-385 [2]
>> > and
>> > > > > > FLIP-386
>> > > > > > > > > [3],
>> > > > > > > > > > > which expand the basic functionality introduced in
>> > FLIP-384
>> > > > > [1].
>> > > > > > > > > > >
>> > > > > > > > > > > Please let me know what you think!
>> > > > > > > > > > >
>> > > > > > > > > > > Best,
>> > > > > > > > > > > Piotr Nowojski
>> > > > > > > > > > >
>> > > > > > > > > > > [1]
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
>> > > > > > > > > > > [2]
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
>> > > > > > > > > > > [3]
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Reply via email to