I'll have to do some more digging into that and get back to you. So
far I've been using a quick-and-dirty tool that I whipped up using
Vega-Lite but that's probably not something we want to maintain. I
tried the Chrome trace viewer ("Catapult") but it's not quite built
for this kind of trace; I hear Jaeger's trace viewer can be used
standalone but needs some setup.

Though that does raise a good point: we should eventually have
documentation on this knob and how to use it.

-David

On 2021/06/08 19:21:16, Weston Pace <weston.p...@gmail.com> wrote: 
> FWIW, I tried this out yesterday since I was profiling the execution
> of the async API reader.  It worked great so +1 from me on that basis.
> I did struggle finding a good simple visualization tool.  Do you have
> any good recommendations on that front?
> 
> On Mon, Jun 7, 2021 at 10:50 AM David Li <lidav...@apache.org> wrote:
> >
> > Just to give an update on where this stands:
> >
> > Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to
> > use it. This contains a few fixes I submitted for the platforms our
> > various CI jobs use, as well as an explicit build flag to support
> > header-only use - I think this should alleviate any concerns over it
> > adding to our build too much. I'm hopeful this means it can make it
> > into 5.0.0, at least with minimal functionality.
> >
> > For anyone interested in using OpenTelemetry with Arrow, I hope you'll
> > have a chance to look through the PR and see if there's any places
> > where adding tracing may be useful.
> >
> > I also touched base with upstream about Python/C++ interop[2] - it
> > turns out upstream has thought about this before but doesn't have the
> > resources to pursue it at the moment, as the idea is to write an
> > API-compatible binding of the C++ library for Python (and presumably
> > R, Ruby, etc.) which is more work.
> >
> > Best,
> > David
> >
> > [1]: https://github.com/apache/arrow/pull/10260
> > [2]: https://github.com/open-telemetry/community/discussions/734
> >
> > On 2021/05/06 18:23:05, David Li <lidav...@apache.org> wrote:
> > > I've created ARROW-12671 [1] to track this work and filed a draft PR
> > > [2]; I'd appreciate any feedback, particularly from anyone already
> > > trying to use OpenTelemetry/Tracing/Census with Arrow.
> > >
> > > For dependencies: now we use OpenTelemetry as header-only by
> > > default. I also slimmed down the build, avoiding making the build wait
> > > on OpenTelemetry. By setting a CMake flag, you can link Arrow against
> > > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that
> > > can be toggled via environment variable.
> > >
> > > For Python: the PR includes basic integration with Flight/Python. The
> > > C++ side will start a span, then propagate it to Python. Spans in
> > > Python will not propagate back to C++, and Python/C++ need to both set
> > > up their respective exporters. I plan to poke the upstream community
> > > about if there's a good solution to this kind of issue.
> > >
> > > For ABI compatibility: this will be an issue until upstream reaches
> > > 1.0. Even currently, there's an unreleased change on their main branch
> > > which will break the current PR when it's released. Hopefully, they
> > > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want
> > > to avoid shipping this until there is a 1.0. I have confirmed that
> > > linking an application which itself links OpenTelemetry to Arrow
> > > works.
> > >
> > > As for the overhead: I measured the impact on a dataset scan recording
> > > ~900 spans per iteration and there was no discernible effect on
> > > runtime compared to an uninstrumented scan (though again, this is not
> > > that many spans).
> > >
> > > Best,
> > > David
> > >
> > > [1]: https://issues.apache.org/jira/browse/ARROW-12671
> > > [2]: https://github.com/apache/arrow/pull/10260
> > >
> > > On 2021/05/01 19:53:45, "David Li" <lidav...@apache.org> wrote:
> > > > Thanks everyone for all the comments. Responding to a few things:
> > > >
> > > > > It seems to me it would be fairly implementation dependent -- so each
> > > > > language implementation would choose if it made sense for them and 
> > > > > then
> > > > > implement the appropriate connection to that language's open telemetry
> > > > > ecosystem.
> > > >
> > > > Agreed - I think the important thing is to agree on using OpenTelemetry 
> > > > itself so that the various Flight implementations, for instance, can 
> > > > all contribute compatible trace data. And there will be details like 
> > > > naming of keys for extra metadata we might want to attach, or trying to 
> > > > make (some) span names consistent.
> > > >
> > > > > My main question is: does integrating OpenTracing complicate our build
> > > > > procedure?  Is it header-only as long as you use the no-op tracer?  Or
> > > > > do you have to build it and link with it nonetheless?
> > > >
> > > > I need to look into this more and will follow up. I believe we can use 
> > > > it header-only. It's fairly simple to depend on (and has no required 
> > > > dependencies), but it is a synchronous build step (you must build it to 
> > > > have its headers available) - perhaps that could be resolved upstream 
> > > > or I am configuring CMake wrongly. Right now, I've linked in 
> > > > OpenTelemetry to provide a few utilities (e.g. logging data to stdout 
> > > > as JSON), but that could be split out into a libarrow_tracing.so if we 
> > > > keep them.
> > > >
> > > > > Also, are there ABI issues that may complicate integration into
> > > > > applications that were compiled against another version of 
> > > > > OpenTracing?
> > > >
> > > > Upstream already seems to be considering ABI compatibility. However, 
> > > > until they reach 1.0, of course they need not keep any promises, and 
> > > > that is a worry depending on their timeline. As pointed out already, 
> > > > they are moving quickly, but they are behind the other languages' 
> > > > OpenTelemetry implementations.
> > > >
> > > > > I'm not sure what the overhead is when disabled--I think it is 
> > > > > probably minimal or else it wouldn't be used so widely. But if we're 
> > > > > not ready to jump right in, we could introduce our own @WithSpan 
> > > > > annotation which by default is a no-op. To build an instrumented 
> > > > > Arrow lib, you'd hook it up with a shim.
> > > >
> > > > I am focusing on C++ here but of course the other languages come into 
> > > > play. A similar idea for C++ may be useful if we need to have 
> > > > OpenTelemetry be optional to avoid ABI worries. A branch may also work, 
> > > > but I'd like to avoid that if possible.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote:
> > > > > I agree that OpenTelemetry is the future; I have been following the 
> > > > > observability space off and on and I knew about OpenTracing; I just 
> > > > > realized that OpenTelemetry is its successor. [1]
> > > > > I have found tracing to be a very powerful approach; at one point, I 
> > > > > did a POC of a trace recorder inside a Java webapp, which shed light 
> > > > > on some nasty bottlenecks. If integrated properly, it can be left on 
> > > > > all the time, so it's valuable for doing root-cause analysis in 
> > > > > production. At least in Java, there are already a lot of packages 
> > > > > with OpenTelemetry hooks built in. [2]
> > > > > I'm not sure what the overhead is when disabled--I think it is 
> > > > > probably minimal or else it wouldn't be used so widely. But if we're 
> > > > > not ready to jump right in, we could introduce our own @WithSpan 
> > > > > annotation which by default is a no-op. To build an instrumented 
> > > > > Arrow lib, you'd hook it up with a shim. Or you could just maintain a 
> > > > > branch with instrumentation for people to try it out.
> > > > >
> > > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/
> > > > > [2] 
> > > > > https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md
> > > > >
> > > > > On 2021/04/30 22:18:46, Evan Chan <e...@urbanlogiq.com 
> > > > > <mailto:evan%40urbanlogiq.com>> wrote:
> > > > > > Dear David,
> > > > > >
> > > > > > OpenTelemetry tracing is definitely the future, I guess the 
> > > > > > question is how far down the stack we want to put it.   I think it 
> > > > > > would be useful for flight and other higher level modules, and for 
> > > > > > DataFusion for example it would be really useful.
> > > > > > As for being alpha, I don’t think it will stay that way very long, 
> > > > > > there is a ton of industry momentum behind OpenTelemetry.
> > > > > >
> > > > > > -Evan
> > > > > >
> > > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidav...@apache.org 
> > > > > > > <mailto:lidavidm%40apache.org>> wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > For Arrow Datasets, I've been working to instrument the scanner 
> > > > > > > to find
> > > > > > > bottlenecks. For example, here's a demo comparing the current 
> > > > > > > async
> > > > > > > scanner, which doesn't truly read asynchronously, to one that 
> > > > > > > does; it
> > > > > > > should be fairly evident where the bottleneck is:
> > > > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html
> > > > > > >
> > > > > > > I'd like to upstream this, but I'd like to run some questions by
> > > > > > > everyone first:
> > > > > > > - Does this look useful to developers working on other 
> > > > > > > sub-projects?
> > > > > > > - This uses OpenTelemetry[1], which is still in alpha, so are we
> > > > > > >  comfortable with adopting it? Is the overhead acceptable?
> > > > > > > - Is there anyone using Arrow to build services, that would find 
> > > > > > > more
> > > > > > >  general integration useful?
> > > > > > >
> > > > > > > How it works: OpenTelemetry[1] is used to annotate and record a 
> > > > > > > "span"
> > > > > > > for operations like reading a single record batch. The data is 
> > > > > > > saved as
> > > > > > > JSON, then rendered by some JavaScript. The branch is at [2].
> > > > > > >
> > > > > > > As a quick summary, OpenTelemetry implements distributed tracing, 
> > > > > > > in
> > > > > > > which a request is tracked as a directed acyclic graph of spans. 
> > > > > > > A span
> > > > > > > is just metadata (name, ID, start/end time, parent span, ...) 
> > > > > > > about an
> > > > > > > operation (function call, network request, ...). Typically, it's 
> > > > > > > used in
> > > > > > > services. Spans can reference each other across machines, so you 
> > > > > > > can
> > > > > > > track a request across multiple services (e.g. finding which 
> > > > > > > service
> > > > > > > failed/is unusually slow in a chain of services that call each 
> > > > > > > other).
> > > > > > >
> > > > > > > As opposed to a (sampling) profiler, this gives you 
> > > > > > > application-level
> > > > > > > metadata, like filenames or S3 download rates, that you can use in
> > > > > > > analysis (as in the demo). It's also something you'd always keep 
> > > > > > > turned
> > > > > > > on (at least when running a service). If integrated with Flight,
> > > > > > > OpenTelemetry would also give us a performance picture across 
> > > > > > > multiple
> > > > > > > machines - speculatively, something like making a request to a 
> > > > > > > Flight
> > > > > > > service and being able to trace all the requests it makes to S3.
> > > > > > >
> > > > > > > It does have some overhead; you wouldn't annotate every function 
> > > > > > > in a
> > > > > > > codebase. This is rather anecdotal, but for the demo above, there 
> > > > > > > was
> > > > > > > essentially zero impact on runtime. Of course, that demo records 
> > > > > > > very
> > > > > > > little data overall, so it's not very representative.
> > > > > > >
> > > > > > > Alternatives:
> > > > > > > - Add a simple Span class of our own, and defer Flight until 
> > > > > > > later.
> > > > > > > - Integrate OpenTelemetry in such a way that it gets compiled out 
> > > > > > > if not
> > > > > > >  enabled at build time. This would be messier but should 
> > > > > > > alleviate any
> > > > > > >  performance questions.
> > > > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have their 
> > > > > > > own
> > > > > > >  caveats (e.g. XRay is LLVM-specific) and aren't intended for the
> > > > > > >  multi-machine use case, but would otherwise work. I haven't 
> > > > > > > looked
> > > > > > >  into these much, but could evaluate them, especially if they 
> > > > > > > seem more
> > > > > > >  fit for purpose for use in other Arrow subprojects.
> > > > > > >
> > > > > > > If people aren't super enthused, I'll most likely go with adding a
> > > > > > > custom Span class for Datasets, and defer the question of whether 
> > > > > > > we
> > > > > > > should integrate Flight/Datasets with OpenTelemetry until another 
> > > > > > > use
> > > > > > > case arises. But recently we have seen interest in this - so I 
> > > > > > > see this
> > > > > > > as perhaps a chance to take care of two problems at once.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > David
> > > > > > >
> > > > > > > [1]: https://opentelemetry.io/
> > > > > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry
> > > > > > > [3]: https://perfetto.dev/
> > > > > > > [4]: https://llvm.org/docs/XRay.html
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> 

Reply via email to