I just updated the PR with support for exporting to Jaeger[1], which has a built in trace viewer.
1. Download and run the all-in-one Jaeger binary locally[2] (or their Docker image) 2. Build Arrow with `-DARROW_WITH_OPENTELEMETRY=ON -DARROW_THRIFT=ON` 3. Run your application with `env ARROW_TRACING_BACKEND=jaeger` 4. Visit http://localhost:16686 and search for "unknown_service". This gives you a variety of ways to drill into the captured data. Let me know what you think if you get a chance. Now, while this is convenient, I'm not so sure about bundling it with Arrow; as a library, we should be leaving all this config up to the end-user application. But since this is all in C++, it can be hard/annoying to configure in PyArrow and this is helpful for development and debugging. At the very least, it's behind an optional build flag so it won't ship by default. Also I see that Kibana (with proprietary xpack) and Grafana have trace viewers now; OpenTelemetry doens't include exporters for trace data to those backends (only metrics/logs) but that could be another option. Best, David [1]: https://www.jaegertracing.io/ [2]: https://www.jaegertracing.io/docs/1.22/getting-started/#all-in-one On 2021/06/08 19:30:06, David Li <[email protected]> wrote: > I'll have to do some more digging into that and get back to you. So > far I've been using a quick-and-dirty tool that I whipped up using > Vega-Lite but that's probably not something we want to maintain. I > tried the Chrome trace viewer ("Catapult") but it's not quite built > for this kind of trace; I hear Jaeger's trace viewer can be used > standalone but needs some setup. > > Though that does raise a good point: we should eventually have > documentation on this knob and how to use it. > > -David > > On 2021/06/08 19:21:16, Weston Pace <[email protected]> wrote: > > FWIW, I tried this out yesterday since I was profiling the execution > > of the async API reader. It worked great so +1 from me on that basis. > > I did struggle finding a good simple visualization tool. Do you have > > any good recommendations on that front? > > > > On Mon, Jun 7, 2021 at 10:50 AM David Li <[email protected]> wrote: > > > > > > Just to give an update on where this stands: > > > > > > Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to > > > use it. This contains a few fixes I submitted for the platforms our > > > various CI jobs use, as well as an explicit build flag to support > > > header-only use - I think this should alleviate any concerns over it > > > adding to our build too much. I'm hopeful this means it can make it > > > into 5.0.0, at least with minimal functionality. > > > > > > For anyone interested in using OpenTelemetry with Arrow, I hope you'll > > > have a chance to look through the PR and see if there's any places > > > where adding tracing may be useful. > > > > > > I also touched base with upstream about Python/C++ interop[2] - it > > > turns out upstream has thought about this before but doesn't have the > > > resources to pursue it at the moment, as the idea is to write an > > > API-compatible binding of the C++ library for Python (and presumably > > > R, Ruby, etc.) which is more work. > > > > > > Best, > > > David > > > > > > [1]: https://github.com/apache/arrow/pull/10260 > > > [2]: https://github.com/open-telemetry/community/discussions/734 > > > > > > On 2021/05/06 18:23:05, David Li <[email protected]> wrote: > > > > I've created ARROW-12671 [1] to track this work and filed a draft PR > > > > [2]; I'd appreciate any feedback, particularly from anyone already > > > > trying to use OpenTelemetry/Tracing/Census with Arrow. > > > > > > > > For dependencies: now we use OpenTelemetry as header-only by > > > > default. I also slimmed down the build, avoiding making the build wait > > > > on OpenTelemetry. By setting a CMake flag, you can link Arrow against > > > > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that > > > > can be toggled via environment variable. > > > > > > > > For Python: the PR includes basic integration with Flight/Python. The > > > > C++ side will start a span, then propagate it to Python. Spans in > > > > Python will not propagate back to C++, and Python/C++ need to both set > > > > up their respective exporters. I plan to poke the upstream community > > > > about if there's a good solution to this kind of issue. > > > > > > > > For ABI compatibility: this will be an issue until upstream reaches > > > > 1.0. Even currently, there's an unreleased change on their main branch > > > > which will break the current PR when it's released. Hopefully, they > > > > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want > > > > to avoid shipping this until there is a 1.0. I have confirmed that > > > > linking an application which itself links OpenTelemetry to Arrow > > > > works. > > > > > > > > As for the overhead: I measured the impact on a dataset scan recording > > > > ~900 spans per iteration and there was no discernible effect on > > > > runtime compared to an uninstrumented scan (though again, this is not > > > > that many spans). > > > > > > > > Best, > > > > David > > > > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-12671 > > > > [2]: https://github.com/apache/arrow/pull/10260 > > > > > > > > On 2021/05/01 19:53:45, "David Li" <[email protected]> wrote: > > > > > Thanks everyone for all the comments. Responding to a few things: > > > > > > > > > > > It seems to me it would be fairly implementation dependent -- so > > > > > > each > > > > > > language implementation would choose if it made sense for them and > > > > > > then > > > > > > implement the appropriate connection to that language's open > > > > > > telemetry > > > > > > ecosystem. > > > > > > > > > > Agreed - I think the important thing is to agree on using > > > > > OpenTelemetry itself so that the various Flight implementations, for > > > > > instance, can all contribute compatible trace data. And there will be > > > > > details like naming of keys for extra metadata we might want to > > > > > attach, or trying to make (some) span names consistent. > > > > > > > > > > > My main question is: does integrating OpenTracing complicate our > > > > > > build > > > > > > procedure? Is it header-only as long as you use the no-op tracer? > > > > > > Or > > > > > > do you have to build it and link with it nonetheless? > > > > > > > > > > I need to look into this more and will follow up. I believe we can > > > > > use it header-only. It's fairly simple to depend on (and has no > > > > > required dependencies), but it is a synchronous build step (you must > > > > > build it to have its headers available) - perhaps that could be > > > > > resolved upstream or I am configuring CMake wrongly. Right now, I've > > > > > linked in OpenTelemetry to provide a few utilities (e.g. logging data > > > > > to stdout as JSON), but that could be split out into a > > > > > libarrow_tracing.so if we keep them. > > > > > > > > > > > Also, are there ABI issues that may complicate integration into > > > > > > applications that were compiled against another version of > > > > > > OpenTracing? > > > > > > > > > > Upstream already seems to be considering ABI compatibility. However, > > > > > until they reach 1.0, of course they need not keep any promises, and > > > > > that is a worry depending on their timeline. As pointed out already, > > > > > they are moving quickly, but they are behind the other languages' > > > > > OpenTelemetry implementations. > > > > > > > > > > > I'm not sure what the overhead is when disabled--I think it is > > > > > > probably minimal or else it wouldn't be used so widely. But if > > > > > > we're not ready to jump right in, we could introduce our own > > > > > > @WithSpan annotation which by default is a no-op. To build an > > > > > > instrumented Arrow lib, you'd hook it up with a shim. > > > > > > > > > > I am focusing on C++ here but of course the other languages come into > > > > > play. A similar idea for C++ may be useful if we need to have > > > > > OpenTelemetry be optional to avoid ABI worries. A branch may also > > > > > work, but I'd like to avoid that if possible. > > > > > > > > > > Best, > > > > > David > > > > > > > > > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote: > > > > > > I agree that OpenTelemetry is the future; I have been following the > > > > > > observability space off and on and I knew about OpenTracing; I just > > > > > > realized that OpenTelemetry is its successor. [1] > > > > > > I have found tracing to be a very powerful approach; at one point, > > > > > > I did a POC of a trace recorder inside a Java webapp, which shed > > > > > > light on some nasty bottlenecks. If integrated properly, it can be > > > > > > left on all the time, so it's valuable for doing root-cause > > > > > > analysis in production. At least in Java, there are already a lot > > > > > > of packages with OpenTelemetry hooks built in. [2] > > > > > > I'm not sure what the overhead is when disabled--I think it is > > > > > > probably minimal or else it wouldn't be used so widely. But if > > > > > > we're not ready to jump right in, we could introduce our own > > > > > > @WithSpan annotation which by default is a no-op. To build an > > > > > > instrumented Arrow lib, you'd hook it up with a shim. Or you could > > > > > > just maintain a branch with instrumentation for people to try it > > > > > > out. > > > > > > > > > > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/ > > > > > > [2] > > > > > > https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md > > > > > > > > > > > > On 2021/04/30 22:18:46, Evan Chan <[email protected] > > > > > > <mailto:evan%40urbanlogiq.com>> wrote: > > > > > > > Dear David, > > > > > > > > > > > > > > OpenTelemetry tracing is definitely the future, I guess the > > > > > > > question is how far down the stack we want to put it. I think > > > > > > > it would be useful for flight and other higher level modules, and > > > > > > > for DataFusion for example it would be really useful. > > > > > > > As for being alpha, I don’t think it will stay that way very > > > > > > > long, there is a ton of industry momentum behind OpenTelemetry. > > > > > > > > > > > > > > -Evan > > > > > > > > > > > > > > > On Apr 29, 2021, at 1:21 PM, David Li <[email protected] > > > > > > > > <mailto:lidavidm%40apache.org>> wrote: > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > For Arrow Datasets, I've been working to instrument the scanner > > > > > > > > to find > > > > > > > > bottlenecks. For example, here's a demo comparing the current > > > > > > > > async > > > > > > > > scanner, which doesn't truly read asynchronously, to one that > > > > > > > > does; it > > > > > > > > should be fairly evident where the bottleneck is: > > > > > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html > > > > > > > > > > > > > > > > I'd like to upstream this, but I'd like to run some questions by > > > > > > > > everyone first: > > > > > > > > - Does this look useful to developers working on other > > > > > > > > sub-projects? > > > > > > > > - This uses OpenTelemetry[1], which is still in alpha, so are we > > > > > > > > comfortable with adopting it? Is the overhead acceptable? > > > > > > > > - Is there anyone using Arrow to build services, that would > > > > > > > > find more > > > > > > > > general integration useful? > > > > > > > > > > > > > > > > How it works: OpenTelemetry[1] is used to annotate and record a > > > > > > > > "span" > > > > > > > > for operations like reading a single record batch. The data is > > > > > > > > saved as > > > > > > > > JSON, then rendered by some JavaScript. The branch is at [2]. > > > > > > > > > > > > > > > > As a quick summary, OpenTelemetry implements distributed > > > > > > > > tracing, in > > > > > > > > which a request is tracked as a directed acyclic graph of > > > > > > > > spans. A span > > > > > > > > is just metadata (name, ID, start/end time, parent span, ...) > > > > > > > > about an > > > > > > > > operation (function call, network request, ...). Typically, > > > > > > > > it's used in > > > > > > > > services. Spans can reference each other across machines, so > > > > > > > > you can > > > > > > > > track a request across multiple services (e.g. finding which > > > > > > > > service > > > > > > > > failed/is unusually slow in a chain of services that call each > > > > > > > > other). > > > > > > > > > > > > > > > > As opposed to a (sampling) profiler, this gives you > > > > > > > > application-level > > > > > > > > metadata, like filenames or S3 download rates, that you can use > > > > > > > > in > > > > > > > > analysis (as in the demo). It's also something you'd always > > > > > > > > keep turned > > > > > > > > on (at least when running a service). If integrated with Flight, > > > > > > > > OpenTelemetry would also give us a performance picture across > > > > > > > > multiple > > > > > > > > machines - speculatively, something like making a request to a > > > > > > > > Flight > > > > > > > > service and being able to trace all the requests it makes to S3. > > > > > > > > > > > > > > > > It does have some overhead; you wouldn't annotate every > > > > > > > > function in a > > > > > > > > codebase. This is rather anecdotal, but for the demo above, > > > > > > > > there was > > > > > > > > essentially zero impact on runtime. Of course, that demo > > > > > > > > records very > > > > > > > > little data overall, so it's not very representative. > > > > > > > > > > > > > > > > Alternatives: > > > > > > > > - Add a simple Span class of our own, and defer Flight until > > > > > > > > later. > > > > > > > > - Integrate OpenTelemetry in such a way that it gets compiled > > > > > > > > out if not > > > > > > > > enabled at build time. This would be messier but should > > > > > > > > alleviate any > > > > > > > > performance questions. > > > > > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have > > > > > > > > their own > > > > > > > > caveats (e.g. XRay is LLVM-specific) and aren't intended for > > > > > > > > the > > > > > > > > multi-machine use case, but would otherwise work. I haven't > > > > > > > > looked > > > > > > > > into these much, but could evaluate them, especially if they > > > > > > > > seem more > > > > > > > > fit for purpose for use in other Arrow subprojects. > > > > > > > > > > > > > > > > If people aren't super enthused, I'll most likely go with > > > > > > > > adding a > > > > > > > > custom Span class for Datasets, and defer the question of > > > > > > > > whether we > > > > > > > > should integrate Flight/Datasets with OpenTelemetry until > > > > > > > > another use > > > > > > > > case arises. But recently we have seen interest in this - so I > > > > > > > > see this > > > > > > > > as perhaps a chance to take care of two problems at once. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > David > > > > > > > > > > > > > > > > [1]: https://opentelemetry.io/ > > > > > > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry > > > > > > > > [3]: https://perfetto.dev/ > > > > > > > > [4]: https://llvm.org/docs/XRay.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
