I just updated the PR with support for exporting to Jaeger[1], which
has a built in trace viewer.

1. Download and run the all-in-one Jaeger binary locally[2] (or their
   Docker image)
2. Build Arrow with `-DARROW_WITH_OPENTELEMETRY=ON -DARROW_THRIFT=ON`
3. Run your application with `env ARROW_TRACING_BACKEND=jaeger`
4. Visit http://localhost:16686 and search for "unknown_service".

This gives you a variety of ways to drill into the captured data. Let
me know what you think if you get a chance.

Now, while this is convenient, I'm not so sure about bundling it with
Arrow; as a library, we should be leaving all this config up to the
end-user application. But since this is all in C++, it can be
hard/annoying to configure in PyArrow and this is helpful for
development and debugging. At the very least, it's behind an optional
build flag so it won't ship by default.

Also I see that Kibana (with proprietary xpack) and Grafana have trace
viewers now; OpenTelemetry doens't include exporters for trace data to
those backends (only metrics/logs) but that could be another option.

Best,
David

[1]: https://www.jaegertracing.io/
[2]: https://www.jaegertracing.io/docs/1.22/getting-started/#all-in-one

On 2021/06/08 19:30:06, David Li <lidav...@apache.org> wrote: 
> I'll have to do some more digging into that and get back to you. So
> far I've been using a quick-and-dirty tool that I whipped up using
> Vega-Lite but that's probably not something we want to maintain. I
> tried the Chrome trace viewer ("Catapult") but it's not quite built
> for this kind of trace; I hear Jaeger's trace viewer can be used
> standalone but needs some setup.
> 
> Though that does raise a good point: we should eventually have
> documentation on this knob and how to use it.
> 
> -David
> 
> On 2021/06/08 19:21:16, Weston Pace <weston.p...@gmail.com> wrote: 
> > FWIW, I tried this out yesterday since I was profiling the execution
> > of the async API reader.  It worked great so +1 from me on that basis.
> > I did struggle finding a good simple visualization tool.  Do you have
> > any good recommendations on that front?
> > 
> > On Mon, Jun 7, 2021 at 10:50 AM David Li <lidav...@apache.org> wrote:
> > >
> > > Just to give an update on where this stands:
> > >
> > > Upstream recently released v1.0.0-RC1 and I've updated the PR[1] to
> > > use it. This contains a few fixes I submitted for the platforms our
> > > various CI jobs use, as well as an explicit build flag to support
> > > header-only use - I think this should alleviate any concerns over it
> > > adding to our build too much. I'm hopeful this means it can make it
> > > into 5.0.0, at least with minimal functionality.
> > >
> > > For anyone interested in using OpenTelemetry with Arrow, I hope you'll
> > > have a chance to look through the PR and see if there's any places
> > > where adding tracing may be useful.
> > >
> > > I also touched base with upstream about Python/C++ interop[2] - it
> > > turns out upstream has thought about this before but doesn't have the
> > > resources to pursue it at the moment, as the idea is to write an
> > > API-compatible binding of the C++ library for Python (and presumably
> > > R, Ruby, etc.) which is more work.
> > >
> > > Best,
> > > David
> > >
> > > [1]: https://github.com/apache/arrow/pull/10260
> > > [2]: https://github.com/open-telemetry/community/discussions/734
> > >
> > > On 2021/05/06 18:23:05, David Li <lidav...@apache.org> wrote:
> > > > I've created ARROW-12671 [1] to track this work and filed a draft PR
> > > > [2]; I'd appreciate any feedback, particularly from anyone already
> > > > trying to use OpenTelemetry/Tracing/Census with Arrow.
> > > >
> > > > For dependencies: now we use OpenTelemetry as header-only by
> > > > default. I also slimmed down the build, avoiding making the build wait
> > > > on OpenTelemetry. By setting a CMake flag, you can link Arrow against
> > > > OpenTelemetry, which will bundle a simple JSON-to-stderr exporter that
> > > > can be toggled via environment variable.
> > > >
> > > > For Python: the PR includes basic integration with Flight/Python. The
> > > > C++ side will start a span, then propagate it to Python. Spans in
> > > > Python will not propagate back to C++, and Python/C++ need to both set
> > > > up their respective exporters. I plan to poke the upstream community
> > > > about if there's a good solution to this kind of issue.
> > > >
> > > > For ABI compatibility: this will be an issue until upstream reaches
> > > > 1.0. Even currently, there's an unreleased change on their main branch
> > > > which will break the current PR when it's released. Hopefully, they
> > > > will reach 1.0 in the Arrow 5.0 release cycle, else, we probably want
> > > > to avoid shipping this until there is a 1.0. I have confirmed that
> > > > linking an application which itself links OpenTelemetry to Arrow
> > > > works.
> > > >
> > > > As for the overhead: I measured the impact on a dataset scan recording
> > > > ~900 spans per iteration and there was no discernible effect on
> > > > runtime compared to an uninstrumented scan (though again, this is not
> > > > that many spans).
> > > >
> > > > Best,
> > > > David
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/ARROW-12671
> > > > [2]: https://github.com/apache/arrow/pull/10260
> > > >
> > > > On 2021/05/01 19:53:45, "David Li" <lidav...@apache.org> wrote:
> > > > > Thanks everyone for all the comments. Responding to a few things:
> > > > >
> > > > > > It seems to me it would be fairly implementation dependent -- so 
> > > > > > each
> > > > > > language implementation would choose if it made sense for them and 
> > > > > > then
> > > > > > implement the appropriate connection to that language's open 
> > > > > > telemetry
> > > > > > ecosystem.
> > > > >
> > > > > Agreed - I think the important thing is to agree on using 
> > > > > OpenTelemetry itself so that the various Flight implementations, for 
> > > > > instance, can all contribute compatible trace data. And there will be 
> > > > > details like naming of keys for extra metadata we might want to 
> > > > > attach, or trying to make (some) span names consistent.
> > > > >
> > > > > > My main question is: does integrating OpenTracing complicate our 
> > > > > > build
> > > > > > procedure?  Is it header-only as long as you use the no-op tracer?  
> > > > > > Or
> > > > > > do you have to build it and link with it nonetheless?
> > > > >
> > > > > I need to look into this more and will follow up. I believe we can 
> > > > > use it header-only. It's fairly simple to depend on (and has no 
> > > > > required dependencies), but it is a synchronous build step (you must 
> > > > > build it to have its headers available) - perhaps that could be 
> > > > > resolved upstream or I am configuring CMake wrongly. Right now, I've 
> > > > > linked in OpenTelemetry to provide a few utilities (e.g. logging data 
> > > > > to stdout as JSON), but that could be split out into a 
> > > > > libarrow_tracing.so if we keep them.
> > > > >
> > > > > > Also, are there ABI issues that may complicate integration into
> > > > > > applications that were compiled against another version of 
> > > > > > OpenTracing?
> > > > >
> > > > > Upstream already seems to be considering ABI compatibility. However, 
> > > > > until they reach 1.0, of course they need not keep any promises, and 
> > > > > that is a worry depending on their timeline. As pointed out already, 
> > > > > they are moving quickly, but they are behind the other languages' 
> > > > > OpenTelemetry implementations.
> > > > >
> > > > > > I'm not sure what the overhead is when disabled--I think it is 
> > > > > > probably minimal or else it wouldn't be used so widely. But if 
> > > > > > we're not ready to jump right in, we could introduce our own 
> > > > > > @WithSpan annotation which by default is a no-op. To build an 
> > > > > > instrumented Arrow lib, you'd hook it up with a shim.
> > > > >
> > > > > I am focusing on C++ here but of course the other languages come into 
> > > > > play. A similar idea for C++ may be useful if we need to have 
> > > > > OpenTelemetry be optional to avoid ABI worries. A branch may also 
> > > > > work, but I'd like to avoid that if possible.
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > On Sat, May 1, 2021, at 10:52, Bob Tinsman wrote:
> > > > > > I agree that OpenTelemetry is the future; I have been following the 
> > > > > > observability space off and on and I knew about OpenTracing; I just 
> > > > > > realized that OpenTelemetry is its successor. [1]
> > > > > > I have found tracing to be a very powerful approach; at one point, 
> > > > > > I did a POC of a trace recorder inside a Java webapp, which shed 
> > > > > > light on some nasty bottlenecks. If integrated properly, it can be 
> > > > > > left on all the time, so it's valuable for doing root-cause 
> > > > > > analysis in production. At least in Java, there are already a lot 
> > > > > > of packages with OpenTelemetry hooks built in. [2]
> > > > > > I'm not sure what the overhead is when disabled--I think it is 
> > > > > > probably minimal or else it wouldn't be used so widely. But if 
> > > > > > we're not ready to jump right in, we could introduce our own 
> > > > > > @WithSpan annotation which by default is a no-op. To build an 
> > > > > > instrumented Arrow lib, you'd hook it up with a shim. Or you could 
> > > > > > just maintain a branch with instrumentation for people to try it 
> > > > > > out.
> > > > > >
> > > > > > [1] https://lightstep.com/blog/brief-history-of-opentelemetry/
> > > > > > [2] 
> > > > > > https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/supported-libraries.md
> > > > > >
> > > > > > On 2021/04/30 22:18:46, Evan Chan <e...@urbanlogiq.com 
> > > > > > <mailto:evan%40urbanlogiq.com>> wrote:
> > > > > > > Dear David,
> > > > > > >
> > > > > > > OpenTelemetry tracing is definitely the future, I guess the 
> > > > > > > question is how far down the stack we want to put it.   I think 
> > > > > > > it would be useful for flight and other higher level modules, and 
> > > > > > > for DataFusion for example it would be really useful.
> > > > > > > As for being alpha, I don’t think it will stay that way very 
> > > > > > > long, there is a ton of industry momentum behind OpenTelemetry.
> > > > > > >
> > > > > > > -Evan
> > > > > > >
> > > > > > > > On Apr 29, 2021, at 1:21 PM, David Li <lidav...@apache.org 
> > > > > > > > <mailto:lidavidm%40apache.org>> wrote:
> > > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > For Arrow Datasets, I've been working to instrument the scanner 
> > > > > > > > to find
> > > > > > > > bottlenecks. For example, here's a demo comparing the current 
> > > > > > > > async
> > > > > > > > scanner, which doesn't truly read asynchronously, to one that 
> > > > > > > > does; it
> > > > > > > > should be fairly evident where the bottleneck is:
> > > > > > > > https://gistcdn.rawgit.org/lidavidm/b326f151fdecb2a5281b1a8be38ec1a6/a1e1a7516c5ce8f87a87ce196c6a726d1cdacf6f/index.html
> > > > > > > >
> > > > > > > > I'd like to upstream this, but I'd like to run some questions by
> > > > > > > > everyone first:
> > > > > > > > - Does this look useful to developers working on other 
> > > > > > > > sub-projects?
> > > > > > > > - This uses OpenTelemetry[1], which is still in alpha, so are we
> > > > > > > >  comfortable with adopting it? Is the overhead acceptable?
> > > > > > > > - Is there anyone using Arrow to build services, that would 
> > > > > > > > find more
> > > > > > > >  general integration useful?
> > > > > > > >
> > > > > > > > How it works: OpenTelemetry[1] is used to annotate and record a 
> > > > > > > > "span"
> > > > > > > > for operations like reading a single record batch. The data is 
> > > > > > > > saved as
> > > > > > > > JSON, then rendered by some JavaScript. The branch is at [2].
> > > > > > > >
> > > > > > > > As a quick summary, OpenTelemetry implements distributed 
> > > > > > > > tracing, in
> > > > > > > > which a request is tracked as a directed acyclic graph of 
> > > > > > > > spans. A span
> > > > > > > > is just metadata (name, ID, start/end time, parent span, ...) 
> > > > > > > > about an
> > > > > > > > operation (function call, network request, ...). Typically, 
> > > > > > > > it's used in
> > > > > > > > services. Spans can reference each other across machines, so 
> > > > > > > > you can
> > > > > > > > track a request across multiple services (e.g. finding which 
> > > > > > > > service
> > > > > > > > failed/is unusually slow in a chain of services that call each 
> > > > > > > > other).
> > > > > > > >
> > > > > > > > As opposed to a (sampling) profiler, this gives you 
> > > > > > > > application-level
> > > > > > > > metadata, like filenames or S3 download rates, that you can use 
> > > > > > > > in
> > > > > > > > analysis (as in the demo). It's also something you'd always 
> > > > > > > > keep turned
> > > > > > > > on (at least when running a service). If integrated with Flight,
> > > > > > > > OpenTelemetry would also give us a performance picture across 
> > > > > > > > multiple
> > > > > > > > machines - speculatively, something like making a request to a 
> > > > > > > > Flight
> > > > > > > > service and being able to trace all the requests it makes to S3.
> > > > > > > >
> > > > > > > > It does have some overhead; you wouldn't annotate every 
> > > > > > > > function in a
> > > > > > > > codebase. This is rather anecdotal, but for the demo above, 
> > > > > > > > there was
> > > > > > > > essentially zero impact on runtime. Of course, that demo 
> > > > > > > > records very
> > > > > > > > little data overall, so it's not very representative.
> > > > > > > >
> > > > > > > > Alternatives:
> > > > > > > > - Add a simple Span class of our own, and defer Flight until 
> > > > > > > > later.
> > > > > > > > - Integrate OpenTelemetry in such a way that it gets compiled 
> > > > > > > > out if not
> > > > > > > >  enabled at build time. This would be messier but should 
> > > > > > > > alleviate any
> > > > > > > >  performance questions.
> > > > > > > > - Use something like Perfetto[3] or LLVM XRay[4]. They have 
> > > > > > > > their own
> > > > > > > >  caveats (e.g. XRay is LLVM-specific) and aren't intended for 
> > > > > > > > the
> > > > > > > >  multi-machine use case, but would otherwise work. I haven't 
> > > > > > > > looked
> > > > > > > >  into these much, but could evaluate them, especially if they 
> > > > > > > > seem more
> > > > > > > >  fit for purpose for use in other Arrow subprojects.
> > > > > > > >
> > > > > > > > If people aren't super enthused, I'll most likely go with 
> > > > > > > > adding a
> > > > > > > > custom Span class for Datasets, and defer the question of 
> > > > > > > > whether we
> > > > > > > > should integrate Flight/Datasets with OpenTelemetry until 
> > > > > > > > another use
> > > > > > > > case arises. But recently we have seen interest in this - so I 
> > > > > > > > see this
> > > > > > > > as perhaps a chance to take care of two problems at once.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > David
> > > > > > > >
> > > > > > > > [1]: https://opentelemetry.io/
> > > > > > > > [2]: https://github.com/lidavidm/arrow/tree/arrow-opentelemetry
> > > > > > > > [3]: https://perfetto.dev/
> > > > > > > > [4]: https://llvm.org/docs/XRay.html
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > 
> 

Reply via email to