On Sat, Sep 10, 2016, at 20:04, Roberto Attias wrote: > Hello,I have some comment/concerns regarding the HTrace API, and was > wondering whether extensions/changes would be considered. I'm listing the > most important here, if there is interest we can discuss more in detail.
Welcome, Roberto! > > 1) From the HTrace Developer Guide: > > > > TraceScope objects manage the lifespan of Span objects. When a TraceScope > is created, it often comes with an associated Span object. When this > scope is closed, the Span will be closed as well. “Closing” the scope > means that the span is sent to a SpanReceiver for processing. > > > One of the implications of this model is the fact that nested spans (for > example instrumenting nested function calls) will be delivered to the > receiver in reverse order (as the innermost function completes before the > outermost. This may introduce more complexity on the logic in the span > receiver. Hmm. While I would never say never, in the existing span receivers, we haven't found that delivering the spans in this order results in any extra complexity. What you want is a span sink that aggregates all the spans together, and supports querying spans by various things like ID, time, etc. This is typically a distributed database like HBase, Kudu, etc. There isn't any performance or simplicity advantage to delivering spans in time order to these databases (as far as I know, at least). Of course, in a distributed system, just because node A sends out a span before some other node B doesn't mean that node A's spans will arrive before B's in the distributed database. And also, multiple threads and nodes will be sending spans to the database, so the input to the database will not be in strictly ascending time order anyway. > > Also, the fact that information about a span is not delivered until the > span is closed, relies on the program not terminating abruptly. In Java > this is not so much of a problem, but in C what happens if a series of > nested function calls is instrumented with spans, and the innermost > function crashes? As far as I can tell none of the span is delivered. > This makes the use of the tracing API unreliable for bug analysis. I definitely agree that it is frustrating when a program crashes with spans which are buffered. This can happen in both Java and .. although our out-of-the-box handling of shutdown hooks is better in Java. This problem is difficult to avoid completely for a few different reasons: 1. As you commented, we don't output spans until they're complete... i.e., closed. 2. Without buffering, we end up doing an RPC per span, which is too costly in real-world systems I would also add, one thing that is frustrating sometimes is how very long-running spans don't show up for a while in the GUI. > > Would you consider a change where each API call produces at least one > event sent to the SpanReceiver? It would be interesting to think about giving users (or maybe spanreceivers?) the option of receiving the same span twice: once when it was first opened, and once when it was completed. Or maybe having spans which were uncompleted for a certain amount of time sent out, to better avoid losing them in a crash. We'd have to think carefully about this to avoid overwhelming users with configuration knobs. And we'd also have to document that SpanReceivers would have to be able to handle receiving the same span twice. Hopefully the consistency implications don't get too tricky. > > 2) HTrace has a concept of spans having one or more parents. This > allows, for example, to capture the fact that a process makes an RPC call > to another. However, there is no information about when within the span > the caller calls the callee. A caller span may have two child spans, > representing the fact that it made two RPC calls, but the order in which > those were made is lost in the model (using the timestamps associated to > the begin of the callee spans is not feasible, as there may be different > RPC latencies, or simply the clocks may not be aligned. Also, the only > relation captured by the API is between blocks. In your example, is the caller span making the two RPCs in parallel? If so, it might be appropriate to say that the spans don't have a well-defined ordering. Certainly we don't have any guarantees about which one will be processed first. Which one was initiated first doesn't seem very interesting-- unless I'm missing something. > > I propose a more general API with a concept of spans and points > (timestamped sets of annotations), and cause-effect relationship among > points. an RPC call can be represented as a point in the caller span > marked as cause, and a (begin) point in the callee span marked as > effect. This is very flexible and allow to capture all sorts of > relationship, not just parent child. for example, a DMA operation may be > initiated in a block and captured as a point, the completion captured as > a point in a distinct block in the same entity (an abstraction for a unit > of concurrency) We're talked about tracking "points" in addition to "spans" before. This mainly came up in the context of tracing "point" events like application launches, MapReduce jobs being initiated, etc. etc. The biggest objection is that spans and points have almost as much data (the main difference is points don't have an "end"), so creating a whole separate code pathway and storage pathway might be overkill. We have to think about this more. It's interesting to think about adding some kind of "comes-after" dependency to htrace spans, besides the parent/child dependency. That has kind of a vector clock flavor. I do wonder how often this is really a problem in practice, though... > 3) there doesn't seem to be any provision in the HTrace API for > considering clock domains. In a distributed system, there may be > processes running on the same host, processes running in the same > cluster, process running in different clusters. Different domain may have > different degrees of clock mis-alignment. Providing indications of this > information in the API allows the backend or UI trace building to make > more accurate inferences on how concurrent entities line up. Clock skew is a very difficult problem. Even determining how much clock skew exists is difficult problem, since all your messages from one node to another will have some latency. There are estimation heuristics out there, but it's complex. Even systems like AugmentedTime don't attempt to precisely quantify clock skew, but just to keep it below some threshold required for correctness. In general, admins run NTP on their servers. YARN even requires this (or so I'm told... there is a JIRA out there I could find). From a practical point of view, I'm not sure what admins would do with clock skew data (but perhaps there's something I haven't thought of here). One thing that might be interesting is some kind of way of warning admins if the clocks are seriously misaligned (indicating that NTP was down, or there was a clock adjustment mishap, or something like that). Traditionally, that's the job of the cluster management system, but it would be interesting if we could surface that information in some way. > 4) does the API provide a mechanism for creating "delegated traces"? what > I mean by this is that in some circumstances some thread may need to > create traces on behalf of some other element which may not have such > capabilty. For example, a mobile device may have some custom tracing > mechanism, and attach the information to a request for the server. The > server would then need to create the HTrace trace from the existing data > passed in the request (including timestamps) Sure. In this case, the server can just create a span from JSON the client sent using the MilliSpanDeserializer. If you don't want to use JSON for some reason, you can construct an arbitrary span object using MilliSpan#Builder. > Let me know if there is interest in discussing changes at this level. > Thanks, > Roberto Sure. I have to warn you that we have a strong bias towards compatible changes, though. It is difficult to get all the downstream projects to change how they use the API, even when there is a strong reason to change. Almost as hard as getting Hadoop to do a new release :) I'm curious if you have a project you are thinking about instrumenting with HTrace. We would love to hear more about how people are using HTrace or plan to use it, so we can build what people want. cheers, Colin >
