Thanks everyone for all the comments. Responding to a few things: > It seems to me it would be fairly implementation dependent -- so each > language implementation would choose if it made sense for them and then > implement the appropriate connection to that language's open telemetry > ecosystem.
Agreed - I think the important thing is to agree on using OpenTelemetry itself so that the various Flight implementations, for instance, can all contribute compatible trace data. And there will be details like naming of keys for extra metadata we might want to attach, or trying to make (some) span names consistent. > My main question is: does integrating OpenTracing complicate our build > procedure? Is it header-only as long as you use the no-op tracer? Or > do you have to build it and link with it nonetheless? I need to look into this more and will follow up. I believe we can use it header-only. It's fairly simple to depend on (and has no required dependencies), but it is a synchronous build step (you must build it to have its headers available) - perhaps that could be resolved upstream or I am configuring CMake wrongly. Right now, I've linked in OpenTelemetry to provide a few utilities (e.g. logging data to stdout as JSON), but that could be split out into a libarrow_tracing.so if we keep them. > Also, are there ABI issues that may complicate integration into > applications that were compiled against another version of OpenTracing? Upstream already seems to be considering ABI compatibility. However, until they reach 1.0, of course they need not keep any promises, and that is a worry depending on their timeline. As pointed out already, they are moving quickly, but they are behind the other languages' OpenTelemetry implementations. > I'm not sure what the overhead is when disabled--I think it is probably > minimal or else it wouldn't be used so widely. But if we're not ready to jump > right in, we could introduce our own @WithSpan annotation which by default is > a no-op. To build an instrumented Arrow lib, you'd hook it up with a shim. I am focusing on C++ here but of course the other languages come into play. A similar idea for C++ may be useful if we need to have OpenTelemetry be optional to avoid ABI worries. A branch may also work, but I'd like to avoid that if possible. Best, David On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote: > I agree that OpenTelemetry is the future; I have been following the > observability space off and on and I knew about OpenTracing; I just realized > that OpenTelemetry is its successor. [1] > I have found tracing to be a very powerful approach; at one point, I did a > POC of a trace recorder inside a Java webapp, which shed light on some nasty > bottlenecks. If integrated properly, it can be left on all the time, so it's > valuable for doing root-cause analysis in production. At least in Java, there > are already a lot of packages with OpenTelemetry hooks built in. [2] > I'm not sure what the overhead is when disabled--I think it is probably > minimal or else it wouldn't be used so widely. But if we're not ready to jump > right in, we could introduce our own @WithSpan annotation which by default is > a no-op. To build an instrumented Arrow lib, you'd hook it up with a shim. Or > you could just maintain a branch with instrumentation for people to try it > out. > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/ > [2] > https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md > > On 2021/04/30 22:18:46, Evan Chan <e...@urbanlogiq.com > <mailto:evan%40urbanlogiq.com>> wrote: > > Dear David, > > > > OpenTelemetry tracing is definitely the future, I guess the question is how > > far down the stack we want to put it. I think it would be useful for > > flight and other higher level modules, and for DataFusion for example it > > would be really useful. > > As for being alpha, I don’t think it will stay that way very long, there is > > a ton of industry momentum behind OpenTelemetry. > > > > -Evan > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidav...@apache.org > > > <mailto:lidavidm%40apache.org>> wrote: > > > > > > Hello, > > > > > > For Arrow Datasets, I've been working to instrument the scanner to find > > > bottlenecks. For example, here's a demo comparing the current async > > > scanner, which doesn't truly read asynchronously, to one that does; it > > > should be fairly evident where the bottleneck is: > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html > > > > > > I'd like to upstream this, but I'd like to run some questions by > > > everyone first: > > > - Does this look useful to developers working on other sub-projects? > > > - This uses OpenTelemetry[1], which is still in alpha, so are we > > > comfortable with adopting it? Is the overhead acceptable? > > > - Is there anyone using Arrow to build services, that would find more > > > general integration useful? > > > > > > How it works: OpenTelemetry[1] is used to annotate and record a "span" > > > for operations like reading a single record batch. The data is saved as > > > JSON, then rendered by some JavaScript. The branch is at [2]. > > > > > > As a quick summary, OpenTelemetry implements distributed tracing, in > > > which a request is tracked as a directed acyclic graph of spans. A span > > > is just metadata (name, ID, start/end time, parent span, ...) about an > > > operation (function call, network request, ...). Typically, it's used in > > > services. Spans can reference each other across machines, so you can > > > track a request across multiple services (e.g. finding which service > > > failed/is unusually slow in a chain of services that call each other). > > > > > > As opposed to a (sampling) profiler, this gives you application-level > > > metadata, like filenames or S3 download rates, that you can use in > > > analysis (as in the demo). It's also something you'd always keep turned > > > on (at least when running a service). If integrated with Flight, > > > OpenTelemetry would also give us a performance picture across multiple > > > machines - speculatively, something like making a request to a Flight > > > service and being able to trace all the requests it makes to S3. > > > > > > It does have some overhead; you wouldn't annotate every function in a > > > codebase. This is rather anecdotal, but for the demo above, there was > > > essentially zero impact on runtime. Of course, that demo records very > > > little data overall, so it's not very representative. > > > > > > Alternatives: > > > - Add a simple Span class of our own, and defer Flight until later. > > > - Integrate OpenTelemetry in such a way that it gets compiled out if not > > > enabled at build time. This would be messier but should alleviate any > > > performance questions. > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have their own > > > caveats (e.g. XRay is LLVM-specific) and aren't intended for the > > > multi-machine use case, but would otherwise work. I haven't looked > > > into these much, but could evaluate them, especially if they seem more > > > fit for purpose for use in other Arrow subprojects. > > > > > > If people aren't super enthused, I'll most likely go with adding a > > > custom Span class for Datasets, and defer the question of whether we > > > should integrate Flight/Datasets with OpenTelemetry until another use > > > case arises. But recently we have seen interest in this - so I see this > > > as perhaps a chance to take care of two problems at once. > > > > > > Thanks, > > > David > > > > > > [1]: https://opentelemetry.io/ > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry > > > [3]: https://perfetto.dev/ > > > [4]: https://llvm.org/docs/XRay.html > > > > >