Re: [C++] Adopting a library for (distributed) tracing

Bob Tinsman Sat, 01 May 2021 07:52:36 -0700

I agree that OpenTelemetry is the future; I have been following the 
observability space off and on and I knew about OpenTracing; I just realized 
that OpenTelemetry is its successor. [1]
I have found tracing to be a very powerful approach; at one point, I did a POC 
of a trace recorder inside a Java webapp, which shed light on some nasty 
bottlenecks. If integrated properly, it can be left on all the time, so it's 
valuable for doing root-cause analysis in production. At least in Java, there 
are already a lot of packages with OpenTelemetry hooks built in. [2]
I'm not sure what the overhead is when disabled--I think it is probably minimal 
or else it wouldn't be used so widely. But if we're not ready to jump right in, 
we could introduce our own @WithSpan annotation which by default is a no-op. To 
build an instrumented Arrow lib, you'd hook it up with a shim. Or you could 
just maintain a branch with instrumentation for people to try it out.


[1] https://lightstep.com/blog/brief-history-of-opentelemetry/
[2] 
https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md

On 2021/04/30 22:18:46, Evan Chan <e...@urbanlogiq.com> wrote: 
> Dear David,
> 
> OpenTelemetry tracing is definitely the future, I guess the question is how 
> far down the stack we want to put it.   I think it would be useful for flight 
> and other higher level modules, and for DataFusion for example it would be 
> really useful.  
> As for being alpha, I don’t think it will stay that way very long, there is a 
> ton of industry momentum behind OpenTelemetry.
> 
> -Evan
> 
> > On Apr 29, 2021, at 1:21 PM, David Li <lidav...@apache.org> wrote:
> > 
> > Hello,
> > 
> > For Arrow Datasets, I've been working to instrument the scanner to find
> > bottlenecks. For example, here's a demo comparing the current async
> > scanner, which doesn't truly read asynchronously, to one that does; it
> > should be fairly evident where the bottleneck is:
> > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html
> > 
> > I'd like to upstream this, but I'd like to run some questions by
> > everyone first:
> > - Does this look useful to developers working on other sub-projects?
> > - This uses OpenTelemetry[1], which is still in alpha, so are we
> >  comfortable with adopting it? Is the overhead acceptable?
> > - Is there anyone using Arrow to build services, that would find more
> >  general integration useful?
> > 
> > How it works: OpenTelemetry[1] is used to annotate and record a "span"
> > for operations like reading a single record batch. The data is saved as
> > JSON, then rendered by some JavaScript. The branch is at [2].
> > 
> > As a quick summary, OpenTelemetry implements distributed tracing, in
> > which a request is tracked as a directed acyclic graph of spans. A span
> > is just metadata (name, ID, start/end time, parent span, ...) about an
> > operation (function call, network request, ...). Typically, it's used in
> > services. Spans can reference each other across machines, so you can
> > track a request across multiple services (e.g. finding which service
> > failed/is unusually slow in a chain of services that call each other).
> > 
> > As opposed to a (sampling) profiler, this gives you application-level
> > metadata, like filenames or S3 download rates, that you can use in
> > analysis (as in the demo). It's also something you'd always keep turned
> > on (at least when running a service). If integrated with Flight,
> > OpenTelemetry would also give us a performance picture across multiple
> > machines - speculatively, something like making a request to a Flight
> > service and being able to trace all the requests it makes to S3.
> > 
> > It does have some overhead; you wouldn't annotate every function in a
> > codebase. This is rather anecdotal, but for the demo above, there was
> > essentially zero impact on runtime. Of course, that demo records very
> > little data overall, so it's not very representative.
> > 
> > Alternatives:
> > - Add a simple Span class of our own, and defer Flight until later.
> > - Integrate OpenTelemetry in such a way that it gets compiled out if not
> >  enabled at build time. This would be messier but should alleviate any
> >  performance questions.
> > - Use something like Perfetto[3] or LLVM XRay[4]. They have their own
> >  caveats (e.g. XRay is LLVM-specific) and aren't intended for the
> >  multi-machine use case, but would otherwise work. I haven't looked
> >  into these much, but could evaluate them, especially if they seem more
> >  fit for purpose for use in other Arrow subprojects.
> > 
> > If people aren't super enthused, I'll most likely go with adding a
> > custom Span class for Datasets, and defer the question of whether we
> > should integrate Flight/Datasets with OpenTelemetry until another use
> > case arises. But recently we have seen interest in this - so I see this
> > as perhaps a chance to take care of two problems at once.
> > 
> > Thanks,
> > David
> > 
> > [1]: https://opentelemetry.io/
> > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry
> > [3]: https://perfetto.dev/
> > [4]: https://llvm.org/docs/XRay.html
> 
>

Re: [C++] Adopting a library for (distributed) tracing

Reply via email to