Re: HTrace API comments

Colin McCabe Sun, 11 Sep 2016 22:04:14 -0700

On Sat, Sep 10, 2016, at 20:04, Roberto Attias wrote:
> Hello,I have some comment/concerns regarding the HTrace API, and was
> wondering whether extensions/changes would be considered. I'm listing the
> most important here, if there is interest we can discuss more in detail.


Welcome, Roberto!

> 
> 1) From the HTrace Developer Guide: 
> 
> 
> 
> TraceScope objects manage the lifespan of Span objects. When a TraceScope
> is created, it often comes with an associated Span object. When this
> scope is closed, the Span will be closed as well. “Closing” the scope
> means that the span is sent to a SpanReceiver for processing.
> 
> 
> One of the implications of this model is the fact that nested spans (for
> example instrumenting nested function calls) will be delivered to the
> receiver in reverse order (as the innermost function completes before the
> outermost. This may introduce more complexity on the logic in the span
> receiver. 

Hmm.  While I would never say never, in the existing span receivers, we
haven't found that delivering the spans in this order results in any
extra complexity.  What you want is a span sink that aggregates all the
spans together, and supports querying spans by various things like ID,
time, etc.  This is typically a distributed database like HBase, Kudu,
etc.  There isn't any performance or simplicity advantage to delivering
spans in time order to these databases (as far as I know, at least).

Of course, in a distributed system, just because node A sends out a span
before some other node B doesn't mean that node A's spans will arrive
before B's in the distributed database.  And also, multiple threads and
nodes will be sending spans to the database, so the input to the
database will not be in strictly ascending time order anyway.

> 
> Also, the fact that information about a span is not delivered until the
> span is closed, relies on the program not terminating abruptly. In Java
> this is not so much of a problem, but in C what happens if a series of
> nested function calls is instrumented with spans, and the innermost
> function crashes? As far as I can tell none of the span is delivered.
> This makes the use of the tracing API unreliable for bug analysis.

I definitely agree that it is frustrating when a program crashes with
spans which are buffered.  This can happen in both Java and .. although
our out-of-the-box handling of shutdown hooks is better in Java.  This
problem is difficult to avoid completely for a few different reasons:

1. As you commented, we don't output spans until they're complete...
i.e., closed.
2. Without buffering, we end up doing an RPC per span, which is too
costly in real-world systems

I would also add, one thing that is frustrating sometimes is how very
long-running spans don't show up for a while in the GUI.

> 
> Would you consider a change where each API call produces at least one
> event sent to the SpanReceiver? 

It would be interesting to think about giving users (or maybe
spanreceivers?) the option of receiving the same span twice: once when
it was first opened, and once when it was completed.  Or maybe having
spans which were uncompleted for a certain amount of time sent out, to
better avoid losing them in a crash.

We'd have to think carefully about this to avoid overwhelming users with
configuration knobs.  And we'd also have to document that SpanReceivers
would have to be able to handle receiving the same span twice. 
Hopefully the consistency implications don't get too tricky.

> 
> 2) HTrace has a concept of spans having one or more parents.  This
> allows, for example, to capture the fact that a process makes an RPC call
> to another.  However, there is no information about when within the span
> the caller calls the callee. A caller span may have two child spans,
> representing the fact that it made two RPC calls, but the order in which
> those were made is lost in the model (using the timestamps associated to
> the begin of the callee spans is not feasible, as there may be different
> RPC latencies, or simply the clocks may not be aligned. Also, the only
> relation captured by the API is between blocks. 

In your example, is the caller span making the two RPCs in parallel?  If
so, it might be appropriate to say that the spans don't have a
well-defined ordering.  Certainly we don't have any guarantees about
which one will be processed first.  Which one was initiated first
doesn't seem very interesting-- unless I'm missing something.

> 
> I propose a more general API with a concept of spans and  points
> (timestamped sets of annotations), and cause-effect relationship among
> points. an RPC call can be represented as a point in the caller span
> marked as cause, and a  (begin) point in the callee span marked as
> effect. This is very flexible and allow to capture all sorts of
> relationship, not just parent child. for example, a DMA operation may be
> initiated in a block  and captured as a point, the completion captured as
> a point in a distinct block in the same entity (an abstraction for a unit
> of concurrency) 

We're talked about tracking "points" in addition to "spans" before. 
This mainly came up in the context of tracing "point" events like
application launches, MapReduce jobs being initiated, etc. etc.  The
biggest objection is that spans and points have almost as much data (the
main difference is points don't have an "end"), so creating a whole
separate code pathway and storage pathway might be overkill.  We have to
think about this more.

It's interesting to think about adding some kind of "comes-after"
dependency to htrace spans, besides the parent/child dependency.  That
has kind of a vector clock flavor.  I do wonder how often this is really
a problem in practice, though...

> 3) there doesn't seem to be any provision in the HTrace API for
> considering clock domains. In a distributed system, there may be
> processes running on the same host, processes running in the same
> cluster, process running in different clusters. Different domain may have
> different degrees of clock mis-alignment. Providing indications of this
> information in the API allows the backend or UI trace building to make
> more accurate inferences on how concurrent entities line up.

Clock skew is a very difficult problem.  Even determining how much clock
skew exists is difficult problem, since all your messages from one node
to another will have some latency.  There are estimation heuristics out
there, but it's complex.  Even systems like AugmentedTime don't attempt
to precisely quantify clock skew, but just to keep it below some
threshold required for correctness.

In general, admins run NTP on their servers.  YARN even requires this
(or so I'm told... there is a JIRA out there I could find).  From a
practical point of view, I'm not sure what admins would do with clock
skew data (but perhaps there's something I haven't thought of here).

One thing that might be interesting is some kind of way of warning
admins if the clocks are seriously misaligned (indicating that NTP was
down, or there was a clock adjustment mishap, or something like that). 
Traditionally, that's the job of the cluster management system, but it
would be interesting if we could surface that information in some way.

> 4) does the API provide a mechanism for creating "delegated traces"? what
> I mean by this is that in some circumstances  some thread may need to
> create traces on behalf of some other element which may not have such
> capabilty. For example, a mobile device may have some custom tracing
> mechanism, and attach the information to a request for the server. The
> server would then need to create the HTrace trace from the existing data
> passed in the request (including timestamps)

Sure.  In this case, the server can just create a span from JSON the
client sent using the MilliSpanDeserializer.  If you don't want to use
JSON for some reason, you can construct an arbitrary span object using
MilliSpan#Builder.

> Let me know if there is interest in discussing changes at this level.
> Thanks,
>                     Roberto

Sure.  I have to warn you that we have a strong bias towards compatible
changes, though.  It is difficult to get all the downstream projects to
change how they use the API, even when there is a strong reason to
change.  Almost as hard as getting Hadoop to do a new release :)

I'm curious if you have a project you are thinking about instrumenting
with HTrace.  We would love to hear more about how people are using
HTrace or plan to use it, so we can build what people want.

cheers,
Colin


>

Re: HTrace API comments

Reply via email to