Hello Bryan / Greg / NiFi devs,

Distributed tracing (DT) is similar to provenance in that it shows the path
a particular flowfile travels, but its core selling point is that it
supports tracing across multiple systems/services regardless of what's
receiving the data. Provenance is a fantastic feature and there are
instances where one might want to draw that bigger picture of identifying
bottlenecks as data flows from one system to another and that system
may/may not be using NiFi.

DT utilizes three ids: traceId, parentId, and spanId. While a tree can be
built using two ids, the third id (traceId) helps bring all of the relevant
information out of a datastore more easily.
DT is focused more on performance and identifying bottlenecks in one or
more systems. Imagine if NiFi were receiving data from various sources
(i.e. HTTP, Kafka, SQS) and NiFi egressed to other sources (HTTP, Kafka,
NiFi).
DT provides a spec that we'd be able to follow and correlate the data as it
traverses from system to system. Each system that participates in the DT
ecosystem would simply emit information (a trace is made up of one or more
spans) and there'd be a collection system which would aggregate all of
these spans and would draw a bigger picture of the path that data went
through and could help identify key bottlenecks.

OpenTelemetry (OTEL) provides clients (across many languages, including
java) where developers can instrument their library's APIs and participate
in a DT ecosystem as it adheres to the tracing spec. Egressing trace data
is possible without using OTEL, but then we may find ourselves having to
recreate the wheel, but could be optimized for NiFi.

Creating a reporting task could certainly be a path, mainly have a few
concerns with that:

1. If provenance is disabled, will provenance events still be emitted and
be collected by a new reporting task?
2. There'll be an impact on performance, how much is unknown. OTEL is
gaining traction across industry and there are ways to mitigate
performance, mainly sampling and the fact that *tracing is best effort*.
Spans would be emitted from NiFi via UDP to a collector on the same network
3. Would there be any issues with appending a flowfile attribute that is
carried throughout the flow where it maintains the traceId, parentSpanId,
and trace flags? See below for more details

There's a W3C spec (Trace context) which includes a formatted string that
would be propagated to services (HTTP, Kafka, etc...). So if NiFi were to
put information onto kafka, any consumers of that data would be able to
continue the trace and help draw the bigger picture.

W3C Spec: https://www.w3.org/TR/trace-context/#traceparent-header

For #2, since DT is focused on performance, sampling can help alleviate
chatter over the wire and ideally, 0.01% would draw the same picture as 1%
or 10%+. This is certainly different from provenance as DT is focused on
performance over quality of the data and should not be thought of as
auditing.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#sampler

On Thu, Jul 28, 2022 at 5:01 PM Bryan Bende <bbe...@gmail.com> wrote:

> Hi Greg,
>
> I don't really know anything about OpenTelemetry, but from the
> perspective of integrating something into the framework, some things
> to consider...
>
> Is there some way to piggy-back on provenance and use a ReportingTask
> to process provenance events and report something to OpenTelemetry?
>
> If something new does need to be added, it should probably be an
> extension point where there is an interface in the framework-api and
> different implementations can be plugged in.
> Ideally the framework itself wouldn't have any knowledge of
> OpenTelemetry specifically, it would only be reporting some
> information, which could then be used in some way by the OpenTelemetry
> implementation.
>
> How does NiFi actually communicate with OpenTelemetry? Are you
> expecting to send data to OpenTelemetry in this new method you are
> suggesting?
> That would likely have a significant impact on the performance of the flow.
>
> Thanks,
>
> Bryan
>
> On Thu, Jul 28, 2022 at 3:17 PM glma...@uwe.nsa.gov <glma...@uwe.nsa.gov>
> wrote:
> >
> > Nifi Devs,
> >
> > My team and I are looking for guidance on how we can extend Apache
> Nifi's capabilities. Specifically we're looking to include distributed
> tracing. We'll approach this effort as if we're the tracing experts and
> simply seeking implementation guidance. Our developers have good exposure
> to working with Nifi and creating custom processors. We plan to fork the
> project to begin this effort but want to make sure we approach this with
> the best possible direction for community adoption.
> >
> > Our initial thoughts on this approach would be to piggyback on how
> Provenance was implemented. We essentially want to include a subroutine or
> method that gets implicitly invoked upon a processors 'onTrigger' method.
> From there we would analyze the FlowFiles attributes to check for the
> existence of 'traceId' and/or propagate one if found.
> >
> > We can expound upon all of these tracing/observability details if that
> helps by any means. We're able to provide more detailed scope of this task
> as well but for now we just want to get feed back for our overall goal and
> proposed approach.
> >
> > Thanks,
> > Greg Marshall
>

Reply via email to