Hi Colin,see inline

      From: Colin McCabe <[email protected]>
 To: [email protected]; Roberto Attias <[email protected]> 
Cc: John D. Ament <[email protected]>; Jake Farrell <[email protected]>; 
Ted Dunning <[email protected]>
 Sent: Sunday, September 11, 2016 10:03 PM
 Subject: Re: HTrace API comments
   
On Sat, Sep 10, 2016, at 20:04, Roberto Attias wrote:
> Hello,I have some comment/concerns regarding the HTrace API, and was
> wondering whether extensions/changes would be considered. I'm listing the
> most important here, if there is interest we can discuss more in detail.

Welcome, Roberto!

> 
> 1) From the HTrace Developer Guide: 
> 
> 
> 
> TraceScope objects manage the lifespan of Span objects. When a TraceScope
> is created, it often comes with an associated Span object. When this
> scope is closed, the Span will be closed as well. “Closing” the scope
> means that the span is sent to a SpanReceiver for processing.
> 
> 
> One of the implications of this model is the fact that nested spans (for
> example instrumenting nested function calls) will be delivered to the
> receiver in reverse order (as the innermost function completes before the
> outermost. This may introduce more complexity on the logic in the span
> receiver. 

Hmm.  While I would never say never, in the existing span receivers, we
haven't found that delivering the spans in this order results in any
extra complexity.  What you want is a span sink that aggregates all the
spans together, and supports querying spans by various things like ID,
time, etc.  This is typically a distributed database like HBase, Kudu,
etc.  There isn't any performance or simplicity advantage to delivering
spans in time order to these databases (as far as I know, at least).

The advantage is not on the storage front, but rather on the consumer side. 
For example, consider a hypothetical messaging application. A sender client 
maysend a message to a server, the server storing the message until a receiver 
clientlogs-in to consume pending messages. Say a span captures the function 
that sendsthe message, a span captures the time spent between when the message 
is received bythe server and consumed from it, a span captures the function 
that receives the messageon the receiver client side. This may be a long 
lasting (days) interaction, but a consumerwill not be able to access any of the 
temporary information until the whole transactionis completed. Similarly, you 
mention later a UI visualization issue.

Of course, in a distributed system, just because node A sends out a spanbefore 
some other node B doesn't mean that node A's spans will arrive
before B's in the distributed database.  And also, multiple threads and
nodes will be sending spans to the database, so the input to the
database will not be in strictly ascending time order anyway.

> 
> Also, the fact that information about a span is not delivered until the
> span is closed, relies on the program not terminating abruptly. In Java
> this is not so much of a problem, but in C what happens if a series of
> nested function calls is instrumented with spans, and the innermost
> function crashes? As far as I can tell none of the span is delivered.
> This makes the use of the tracing API unreliable for bug analysis.

I definitely agree that it is frustrating when a program crashes with
spans which are buffered.  This can happen in both Java and .. although
our out-of-the-box handling of shutdown hooks is better in Java.  This
problem is difficult to avoid completely for a few different reasons:

1. As you commented, we don't output spans until they're complete...
i.e., closed.
2. Without buffering, we end up doing an RPC per span, which is too
costly in real-world systems

I agree performance of a tracing API is paramount. however I've worked 
on real-time systems where a message-per API action was generated. 
There are ways to reduce the impact of that, for example by using a localproxy 
which does the buffering on behalf of the application. Communicationwith such 
proxy can be much more lightweight  (Unix Sockets or shared memory)than generic 
UDP/TCP-based RPCs. Although ultimately IMHO it should be leftto the programmer 
to setup his tracing infra according to his/her particularuser case (in some 
cases the complexity of an extra proxy running may notbe required).

I would also add, one thing that is frustrating sometimes is how very
long-running spans don't show up for a while in the GUI.

> 
> Would you consider a change where each API call produces at least one
> event sent to the SpanReceiver? 

It would be interesting to think about giving users (or maybe
spanreceivers?) the option of receiving the same span twice: once when
it was first opened, and once when it was completed.  Or maybe having
spans which were uncompleted for a certain amount of time sent out, to
better avoid losing them in a crash.

We'd have to think carefully about this to avoid overwhelming users with
configuration knobs.  And we'd also have to document that SpanReceivers
would have to be able to handle receiving the same span twice. 
Hopefully the consistency implications don't get too tricky.

That seems to me forcing the existing model. A Span by definition shouldhave a 
start and an end time. IMHO the creation of the span should be 
an event, and it's closure a different one.

> 
> 2) HTrace has a concept of spans having one or more parents.  This
> allows, for example, to capture the fact that a process makes an RPC call
> to another.  However, there is no information about when within the span
> the caller calls the callee. A caller span may have two child spans,
> representing the fact that it made two RPC calls, but the order in which
> those were made is lost in the model (using the timestamps associated to
> the begin of the callee spans is not feasible, as there may be different
> RPC latencies, or simply the clocks may not be aligned. Also, the only
> relation captured by the API is between blocks. 

In your example, is the caller span making the two RPCs in parallel?  If
so, it might be appropriate to say that the spans don't have a
well-defined ordering.  Certainly we don't have any guarantees about
which one will be processed first.  Which one was initiated first
doesn't seem very interesting-- unless I'm missing something.

Actually, in my example the two RPC calls were made consecutively by the 
samethread, i.e. they were sequential. I would expect concurrent calls to be 
originatingfrom separate spans, one per thread. However, even in this case 
there is a differencebetween the potential order of the calls and the actual 
order. A well written programshould behave properly whatever the order is. But 
finding out that the programmisbehave when the calls happen in  a certain order 
may be invaluable.
> 
> I propose a more general API with a concept of spans and  points
> (timestamped sets of annotations), and cause-effect relationship among
> points. an RPC call can be represented as a point in the caller span
> marked as cause, and a  (begin) point in the callee span marked as
> effect. This is very flexible and allow to capture all sorts of
> relationship, not just parent child. for example, a DMA operation may be
> initiated in a block  and captured as a point, the completion captured as
> a point in a distinct block in the same entity (an abstraction for a unit
> of concurrency) 

We're talked about tracking "points" in addition to "spans" before. 
This mainly came up in the context of tracing "point" events like
application launches, MapReduce jobs being initiated, etc. etc.  The
biggest objection is that spans and points have almost as much data (the
main difference is points don't have an "end"), so creating a whole
separate code pathway and storage pathway might be overkill.  We have to
think about this more.

It's interesting to think about adding some kind of "comes-after"
dependency to htrace spans, besides the parent/child dependency.  That
has kind of a vector clock flavor.  I do wonder how often this is really
a problem in practice, though...

> 3) there doesn't seem to be any provision in the HTrace API for
> considering clock domains. In a distributed system, there may be
> processes running on the same host, processes running in the same
> cluster, process running in different clusters. Different domain may have
> different degrees of clock mis-alignment. Providing indications of this
> information in the API allows the backend or UI trace building to make
> more accurate inferences on how concurrent entities line up.

Clock skew is a very difficult problem.  Even determining how much clock
skew exists is difficult problem, since all your messages from one node
to another will have some latency.  There are estimation heuristics out
there, but it's complex.  Even systems like AugmentedTime don't attempt
to precisely quantify clock skew, but just to keep it below some
threshold required for correctness.
I agree. What I was thinking of is not a mechanism to estimate clock skew,but 
rather a mechanism where a user can configure a maximum expectedclock skew. 
This information can be integrated with causality dependencyimposed by "edges" 
(span dependencies and possibly new causalitydependencies) to constrain 
topological sorting of the model graph.

In general, admins run NTP on their servers.  YARN even requires this
(or so I'm told... there is a JIRA out there I could find).  From a
practical point of view, I'm not sure what admins would do with clock
skew data (but perhaps there's something I haven't thought of here).

One thing that might be interesting is some kind of way of warning
admins if the clocks are seriously misaligned (indicating that NTP was
down, or there was a clock adjustment mishap, or something like that). 
Traditionally, that's the job of the cluster management system, but it
would be interesting if we could surface that information in some way.

> 4) does the API provide a mechanism for creating "delegated traces"? what
> I mean by this is that in some circumstances  some thread may need to
> create traces on behalf of some other element which may not have such
> capabilty. For example, a mobile device may have some custom tracing
> mechanism, and attach the information to a request for the server. The
> server would then need to create the HTrace trace from the existing data
> passed in the request (including timestamps)

Sure.  In this case, the server can just create a span from JSON the
client sent using the MilliSpanDeserializer.  If you don't want to use
JSON for some reason, you can construct an arbitrary span object using
MilliSpan#Builder.

> Let me know if there is interest in discussing changes at this level.
> Thanks,
>                     Roberto

Sure.  I have to warn you that we have a strong bias towards compatible
changes, though.  It is difficult to get all the downstream projects to
change how they use the API, even when there is a strong reason to
change.  Almost as hard as getting Hadoop to do a new release :)
I understand that. To be honest I have a clean room implementationof an API 
based on my previous experiences with tracing, but I'm 
trying to see whether this could be captured by extensions to the 
existingHTrace API.

I'm curious if you have a project you are thinking about instrumenting
with HTrace.  We would love to hear more about how people are using
HTrace or plan to use it, so we can build what people want.

I don't have a specific project right now. I've been working on tracing at 
CISCOand Facebook in the last few years, and I'm in between gigs right now, so 
I'm interested in crystallizing my experience into an open source framework.

cheers,
Colin


>

   

Reply via email to