Following up here: I'm hoping we can enable this in 7.0.0 and am still working on getting all the builds passing (currently RPM packages fail to build with it enabled). OpenTelemetry released their v1.0.0 recently so that should not be a problem anymore.
Some changes in approach: * For now, I've removed integration with Flight and any other components, focusing on just getting the builds working. I'll file follow-up issues for the Flight integration. * Unlike before, I'll change this to be built only when enabled, instead of always. Flight will implicitly enable OpenTelemetry once integrated. (Thanks to @Kou for questioning this.) * I'm now looking at using this for evaluating performance issues/bottlenecks in the C++ query engine, instead of/in addition to the original use case in Flight. I'm curious if others have used OpenTelemetry or similar libraries for this purpose before. I know tools like Perfetto [1] are similar in concept if not approach, and @Weston was experimenting with it for this purpose as well earlier in the thread. [1]: https://perfetto.dev/ -David On Mon, Jul 12, 2021, at 09:47, David Li wrote: > A quick update on this, I don't think this will happen for 5.0; the upstream > library still hasn't reached 1.0, and I don't want to cram this in at the end > of a cycle, especially as each of their release candidates has needed an > upstream fix in order to keep all our CI platforms working. Furthermore there > appear to be some issues with their exporters (or in our usage of them) that > I'd like to resolve. Finally, I'd like to have a more complete example here, > especially one using an existing tool like Jaeger instead of an ad-hoc > visualization. > > -David > > On Wed, Jun 9, 2021, at 13:01, David Li wrote: > > I just updated the PR with support for exporting to Jaeger[1], which > > has a built in trace viewer. > > > > 1. Download and run the all-in-one Jaeger binary locally[2] (or their > > Docker image) > > 2. Build Arrow with `-DARROW_WITH_OPENTELEMETRY=ON -DARROW_THRIFT=ON` > > 3. Run your application with `env ARROW_TRACING_BACKEND=jaeger` > > 4. Visit http://localhost:16686 and search for "unknown_service". > > > > This gives you a variety of ways to drill into the captured data. Let > > me know what you think if you get a chance. > > > > Now, while this is convenient, I'm not so sure about bundling it with > > Arrow; as a library, we should be leaving all this config up to the > > end-user application. But since this is all in C++, it can be > > hard/annoying to configure in PyArrow and this is helpful for > > development and debugging. At the very least, it's behind an optional > > build flag so it won't ship by default. > > > > Also I see that Kibana (with proprietary xpack) and Grafana have trace > > viewers now; OpenTelemetry doens't include exporters for trace data to > > those backends (only metrics/logs) but that could be another option. > > > > Best, > > David > > > > [1]: https://www.jaegertracing.io/ > > [2]: https://www.jaegertracing.io/docs/1.22/getting-started/#all-in-one > > > > On 2021/06/08 19:30:06, David Li <lidav...@apache.org> wrote: > > > I'll have to do some more digging into that and get back to you. So > > > far I've been using a quick-and-dirty tool that I whipped up using > > > Vega-Lite but that's probably not something we want to maintain. I > > > tried the Chrome trace viewer ("Catapult") but it's not quite built > > > for this kind of trace; I hear Jaeger's trace viewer can be used > > > standalone but needs some setup. > > > > > > Though that does raise a good point: we should eventually have > > > documentation on this knob and how to use it. > > > > > > -David > > > > > > On 2021/06/08 19:21:16, Weston Pace <weston.p...@gmail.com> wrote: > > > > FWIW, I tried this out yesterday since I was profiling the execution > > > > of the async API reader. It worked great so +1 from me on that basis. > > > > I did struggle finding a good simple visualization tool. Do you have > > > > any good recommendations on that front? > > > > > > > > On Mon, Jun 7, 2021 at 10:50 AM David Li <lidav...@apache.org> wrote: > > > > > > > > > > Just to give an update on where this stands: > > > > > > > > > > Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to > > > > > use it. This contains a few fixes I submitted for the platforms our > > > > > various CI jobs use, as well as an explicit build flag to support > > > > > header-only use - I think this should alleviate any concerns over it > > > > > adding to our build too much. I'm hopeful this means it can make it > > > > > into 5.0.0, at least with minimal functionality. > > > > > > > > > > For anyone interested in using OpenTelemetry with Arrow, I hope you'll > > > > > have a chance to look through the PR and see if there's any places > > > > > where adding tracing may be useful. > > > > > > > > > > I also touched base with upstream about Python/C++ interop[2] - it > > > > > turns out upstream has thought about this before but doesn't have the > > > > > resources to pursue it at the moment, as the idea is to write an > > > > > API-compatible binding of the C++ library for Python (and presumably > > > > > R, Ruby, etc.) which is more work. > > > > > > > > > > Best, > > > > > David > > > > > > > > > > [1]: https://github.com/apache/arrow/pull/10260 > > > > > [2]: https://github.com/open-telemetry/community/discussions/734 > > > > > > > > > > On 2021/05/06 18:23:05, David Li <lidav...@apache.org> wrote: > > > > > > I've created ARROW-12671 [1] to track this work and filed a draft PR > > > > > > [2]; I'd appreciate any feedback, particularly from anyone already > > > > > > trying to use OpenTelemetry/Tracing/Census with Arrow. > > > > > > > > > > > > For dependencies: now we use OpenTelemetry as header-only by > > > > > > default. I also slimmed down the build, avoiding making the build > > > > > > wait > > > > > > on OpenTelemetry. By setting a CMake flag, you can link Arrow > > > > > > against > > > > > > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter > > > > > > that > > > > > > can be toggled via environment variable. > > > > > > > > > > > > For Python: the PR includes basic integration with Flight/Python. > > > > > > The > > > > > > C++ side will start a span, then propagate it to Python. Spans in > > > > > > Python will not propagate back to C++, and Python/C++ need to both > > > > > > set > > > > > > up their respective exporters. I plan to poke the upstream community > > > > > > about if there's a good solution to this kind of issue. > > > > > > > > > > > > For ABI compatibility: this will be an issue until upstream reaches > > > > > > 1.0. Even currently, there's an unreleased change on their main > > > > > > branch > > > > > > which will break the current PR when it's released. Hopefully, they > > > > > > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably > > > > > > want > > > > > > to avoid shipping this until there is a 1.0. I have confirmed that > > > > > > linking an application which itself links OpenTelemetry to Arrow > > > > > > works. > > > > > > > > > > > > As for the overhead: I measured the impact on a dataset scan > > > > > > recording > > > > > > ~900 spans per iteration and there was no discernible effect on > > > > > > runtime compared to an uninstrumented scan (though again, this is > > > > > > not > > > > > > that many spans). > > > > > > > > > > > > Best, > > > > > > David > > > > > > > > > > > > [1]: https://issues.apache.org/jira/browse/ARROW-12671 > > > > > > [2]: https://github.com/apache/arrow/pull/10260 > > > > > > > > > > > > On 2021/05/01 19:53:45, "David Li" <lidav...@apache.org> wrote: > > > > > > > Thanks everyone for all the comments. Responding to a few things: > > > > > > > > > > > > > > > It seems to me it would be fairly implementation dependent -- > > > > > > > > so each > > > > > > > > language implementation would choose if it made sense for them > > > > > > > > and then > > > > > > > > implement the appropriate connection to that language's open > > > > > > > > telemetry > > > > > > > > ecosystem. > > > > > > > > > > > > > > Agreed - I think the important thing is to agree on using > > > > > > > OpenTelemetry itself so that the various Flight implementations, > > > > > > > for instance, can all contribute compatible trace data. And there > > > > > > > will be details like naming of keys for extra metadata we might > > > > > > > want to attach, or trying to make (some) span names consistent. > > > > > > > > > > > > > > > My main question is: does integrating OpenTracing complicate > > > > > > > > our build > > > > > > > > procedure? Is it header-only as long as you use the no-op > > > > > > > > tracer? Or > > > > > > > > do you have to build it and link with it nonetheless? > > > > > > > > > > > > > > I need to look into this more and will follow up. I believe we > > > > > > > can use it header-only. It's fairly simple to depend on (and has > > > > > > > no required dependencies), but it is a synchronous build step > > > > > > > (you must build it to have its headers available) - perhaps that > > > > > > > could be resolved upstream or I am configuring CMake wrongly. > > > > > > > Right now, I've linked in OpenTelemetry to provide a few > > > > > > > utilities (e.g. logging data to stdout as JSON), but that could > > > > > > > be split out into a libarrow_tracing.so if we keep them. > > > > > > > > > > > > > > > Also, are there ABI issues that may complicate integration into > > > > > > > > applications that were compiled against another version of > > > > > > > > OpenTracing? > > > > > > > > > > > > > > Upstream already seems to be considering ABI compatibility. > > > > > > > However, until they reach 1.0, of course they need not keep any > > > > > > > promises, and that is a worry depending on their timeline. As > > > > > > > pointed out already, they are moving quickly, but they are behind > > > > > > > the other languages' OpenTelemetry implementations. > > > > > > > > > > > > > > > I'm not sure what the overhead is when disabled--I think it is > > > > > > > > probably minimal or else it wouldn't be used so widely. But if > > > > > > > > we're not ready to jump right in, we could introduce our own > > > > > > > > @WithSpan annotation which by default is a no-op. To build an > > > > > > > > instrumented Arrow lib, you'd hook it up with a shim. > > > > > > > > > > > > > > I am focusing on C++ here but of course the other languages come > > > > > > > into play. A similar idea for C++ may be useful if we need to > > > > > > > have OpenTelemetry be optional to avoid ABI worries. A branch may > > > > > > > also work, but I'd like to avoid that if possible. > > > > > > > > > > > > > > Best, > > > > > > > David > > > > > > > > > > > > > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote: > > > > > > > > I agree that OpenTelemetry is the future; I have been following > > > > > > > > the observability space off and on and I knew about > > > > > > > > OpenTracing; I just realized that OpenTelemetry is its > > > > > > > > successor. [1] > > > > > > > > I have found tracing to be a very powerful approach; at one > > > > > > > > point, I did a POC of a trace recorder inside a Java webapp, > > > > > > > > which shed light on some nasty bottlenecks. If integrated > > > > > > > > properly, it can be left on all the time, so it's valuable for > > > > > > > > doing root-cause analysis in production. At least in Java, > > > > > > > > there are already a lot of packages with OpenTelemetry hooks > > > > > > > > built in. [2] > > > > > > > > I'm not sure what the overhead is when disabled--I think it is > > > > > > > > probably minimal or else it wouldn't be used so widely. But if > > > > > > > > we're not ready to jump right in, we could introduce our own > > > > > > > > @WithSpan annotation which by default is a no-op. To build an > > > > > > > > instrumented Arrow lib, you'd hook it up with a shim. Or you > > > > > > > > could just maintain a branch with instrumentation for people to > > > > > > > > try it out. > > > > > > > > > > > > > > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/ > > > > > > > > [2] > > > > > > > > https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md > > > > > > > > > > > > > > > > On 2021/04/30 22:18:46, Evan Chan <e...@urbanlogiq.com > > > > > > > > <mailto:evan%40urbanlogiq.com>> wrote: > > > > > > > > > Dear David, > > > > > > > > > > > > > > > > > > OpenTelemetry tracing is definitely the future, I guess the > > > > > > > > > question is how far down the stack we want to put it. I > > > > > > > > > think it would be useful for flight and other higher level > > > > > > > > > modules, and for DataFusion for example it would be really > > > > > > > > > useful. > > > > > > > > > As for being alpha, I don’t think it will stay that way very > > > > > > > > > long, there is a ton of industry momentum behind > > > > > > > > > OpenTelemetry. > > > > > > > > > > > > > > > > > > -Evan > > > > > > > > > > > > > > > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidav...@apache.org > > > > > > > > > > <mailto:lidavidm%40apache.org>> wrote: > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > For Arrow Datasets, I've been working to instrument the > > > > > > > > > > scanner to find > > > > > > > > > > bottlenecks. For example, here's a demo comparing the > > > > > > > > > > current async > > > > > > > > > > scanner, which doesn't truly read asynchronously, to one > > > > > > > > > > that does; it > > > > > > > > > > should be fairly evident where the bottleneck is: > > > > > > > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html > > > > > > > > > > > > > > > > > > > > I'd like to upstream this, but I'd like to run some > > > > > > > > > > questions by > > > > > > > > > > everyone first: > > > > > > > > > > - Does this look useful to developers working on other > > > > > > > > > > sub-projects? > > > > > > > > > > - This uses OpenTelemetry[1], which is still in alpha, so > > > > > > > > > > are we > > > > > > > > > > comfortable with adopting it? Is the overhead acceptable? > > > > > > > > > > - Is there anyone using Arrow to build services, that would > > > > > > > > > > find more > > > > > > > > > > general integration useful? > > > > > > > > > > > > > > > > > > > > How it works: OpenTelemetry[1] is used to annotate and > > > > > > > > > > record a "span" > > > > > > > > > > for operations like reading a single record batch. The data > > > > > > > > > > is saved as > > > > > > > > > > JSON, then rendered by some JavaScript. The branch is at > > > > > > > > > > [2]. > > > > > > > > > > > > > > > > > > > > As a quick summary, OpenTelemetry implements distributed > > > > > > > > > > tracing, in > > > > > > > > > > which a request is tracked as a directed acyclic graph of > > > > > > > > > > spans. A span > > > > > > > > > > is just metadata (name, ID, start/end time, parent span, > > > > > > > > > > ...) about an > > > > > > > > > > operation (function call, network request, ...). Typically, > > > > > > > > > > it's used in > > > > > > > > > > services. Spans can reference each other across machines, > > > > > > > > > > so you can > > > > > > > > > > track a request across multiple services (e.g. finding > > > > > > > > > > which service > > > > > > > > > > failed/is unusually slow in a chain of services that call > > > > > > > > > > each other). > > > > > > > > > > > > > > > > > > > > As opposed to a (sampling) profiler, this gives you > > > > > > > > > > application-level > > > > > > > > > > metadata, like filenames or S3 download rates, that you can > > > > > > > > > > use in > > > > > > > > > > analysis (as in the demo). It's also something you'd always > > > > > > > > > > keep turned > > > > > > > > > > on (at least when running a service). If integrated with > > > > > > > > > > Flight, > > > > > > > > > > OpenTelemetry would also give us a performance picture > > > > > > > > > > across multiple > > > > > > > > > > machines - speculatively, something like making a request > > > > > > > > > > to a Flight > > > > > > > > > > service and being able to trace all the requests it makes > > > > > > > > > > to S3. > > > > > > > > > > > > > > > > > > > > It does have some overhead; you wouldn't annotate every > > > > > > > > > > function in a > > > > > > > > > > codebase. This is rather anecdotal, but for the demo above, > > > > > > > > > > there was > > > > > > > > > > essentially zero impact on runtime. Of course, that demo > > > > > > > > > > records very > > > > > > > > > > little data overall, so it's not very representative. > > > > > > > > > > > > > > > > > > > > Alternatives: > > > > > > > > > > - Add a simple Span class of our own, and defer Flight > > > > > > > > > > until later. > > > > > > > > > > - Integrate OpenTelemetry in such a way that it gets > > > > > > > > > > compiled out if not > > > > > > > > > > enabled at build time. This would be messier but should > > > > > > > > > > alleviate any > > > > > > > > > > performance questions. > > > > > > > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have > > > > > > > > > > their own > > > > > > > > > > caveats (e.g. XRay is LLVM-specific) and aren't intended > > > > > > > > > > for the > > > > > > > > > > multi-machine use case, but would otherwise work. I > > > > > > > > > > haven't looked > > > > > > > > > > into these much, but could evaluate them, especially if > > > > > > > > > > they seem more > > > > > > > > > > fit for purpose for use in other Arrow subprojects. > > > > > > > > > > > > > > > > > > > > If people aren't super enthused, I'll most likely go with > > > > > > > > > > adding a > > > > > > > > > > custom Span class for Datasets, and defer the question of > > > > > > > > > > whether we > > > > > > > > > > should integrate Flight/Datasets with OpenTelemetry until > > > > > > > > > > another use > > > > > > > > > > case arises. But recently we have seen interest in this - > > > > > > > > > > so I see this > > > > > > > > > > as perhaps a chance to take care of two problems at once. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > [1]: https://opentelemetry.io/ > > > > > > > > > > [2]: > > > > > > > > > > https://github.com/lidavidm/arrow/tree/arrow-opentelemetry > > > > > > > > > > [3]: https://perfetto.dev/ > > > > > > > > > > [4]: https://llvm.org/docs/XRay.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >