I agree that having arrow flight play well with open telemetry based observability systems would be great (and the future).
It seems to me it would be fairly implementation dependent -- so each language implementation would choose if it made sense for them and then implement the appropriate connection to that language's open telemetry ecosystem. On Fri, Apr 30, 2021 at 7:33 PM Evan Chan <[email protected]> wrote: > Dear David, > > OpenTelemetry tracing is definitely the future, I guess the question is > how far down the stack we want to put it. I think it would be useful for > flight and other higher level modules, and for DataFusion for example it > would be really useful. > As for being alpha, I don’t think it will stay that way very long, there > is a ton of industry momentum behind OpenTelemetry. > > -Evan > > > On Apr 29, 2021, at 1:21 PM, David Li <[email protected]> wrote: > > > > Hello, > > > > For Arrow Datasets, I've been working to instrument the scanner to find > > bottlenecks. For example, here's a demo comparing the current async > > scanner, which doesn't truly read asynchronously, to one that does; it > > should be fairly evident where the bottleneck is: > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html > > > > I'd like to upstream this, but I'd like to run some questions by > > everyone first: > > - Does this look useful to developers working on other sub-projects? > > - This uses OpenTelemetry[1], which is still in alpha, so are we > > comfortable with adopting it? Is the overhead acceptable? > > - Is there anyone using Arrow to build services, that would find more > > general integration useful? > > > > How it works: OpenTelemetry[1] is used to annotate and record a "span" > > for operations like reading a single record batch. The data is saved as > > JSON, then rendered by some JavaScript. The branch is at [2]. > > > > As a quick summary, OpenTelemetry implements distributed tracing, in > > which a request is tracked as a directed acyclic graph of spans. A span > > is just metadata (name, ID, start/end time, parent span, ...) about an > > operation (function call, network request, ...). Typically, it's used in > > services. Spans can reference each other across machines, so you can > > track a request across multiple services (e.g. finding which service > > failed/is unusually slow in a chain of services that call each other). > > > > As opposed to a (sampling) profiler, this gives you application-level > > metadata, like filenames or S3 download rates, that you can use in > > analysis (as in the demo). It's also something you'd always keep turned > > on (at least when running a service). If integrated with Flight, > > OpenTelemetry would also give us a performance picture across multiple > > machines - speculatively, something like making a request to a Flight > > service and being able to trace all the requests it makes to S3. > > > > It does have some overhead; you wouldn't annotate every function in a > > codebase. This is rather anecdotal, but for the demo above, there was > > essentially zero impact on runtime. Of course, that demo records very > > little data overall, so it's not very representative. > > > > Alternatives: > > - Add a simple Span class of our own, and defer Flight until later. > > - Integrate OpenTelemetry in such a way that it gets compiled out if not > > enabled at build time. This would be messier but should alleviate any > > performance questions. > > - Use something like Perfetto[3] or LLVM XRay[4]. They have their own > > caveats (e.g. XRay is LLVM-specific) and aren't intended for the > > multi-machine use case, but would otherwise work. I haven't looked > > into these much, but could evaluate them, especially if they seem more > > fit for purpose for use in other Arrow subprojects. > > > > If people aren't super enthused, I'll most likely go with adding a > > custom Span class for Datasets, and defer the question of whether we > > should integrate Flight/Datasets with OpenTelemetry until another use > > case arises. But recently we have seen interest in this - so I see this > > as perhaps a chance to take care of two problems at once. > > > > Thanks, > > David > > > > [1]: https://opentelemetry.io/ > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry > > [3]: https://perfetto.dev/ > > [4]: https://llvm.org/docs/XRay.html > >
