[ 
https://issues.apache.org/jira/browse/HTRACE-92?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295742#comment-14295742
 ] 

Colin Patrick McCabe commented on HTRACE-92:
--------------------------------------------

So my first question when seeing this was "why haven't we run into this before?"

I think the reason is that in HDFS, we hold all of our {{Spans}} inside 
{{TraceScope}} objects.  And then we are careful to call {{TraceScope#close}}.  
(These objects implement {{Closeable}} which makes it easier to catch places 
where we don't via FindBugs).  The scope closing function will then call 
{{TraceScope#detach}}, which sets the thread-local span object to 
{{TraceScope#savedSpan}}.  This will be the old parent span, or {{null}} in the 
case that the span we just closed was a top-level span.  So we never have to 
deal with this issue of the closed parent span hanging around in a thread-local 
variable.

An example might help.  The way it's supposed to work is:

{code}
1. The thread local variable Tracer#currentSpan starts off as null.
2. Application calls Trace#startSpan.  Tracer#currentSpan = SPAN1.  SCOPE1 
object is created to hold SPAN1.
3. Application calls Trace#startSpan.  Tracer#currentSpan = SPAN2.  SCOPE2 
object is created to hold SPAN2.
4. Application closes SCOPE2 by calling TraceScope#close.  Tracer#currentSpan = 
SPAN1 (this was saved inside SCOPE2 as TraceScope#savedSpan)
5. Application closes SCOPE2 by calling TraceScope#close.  Tracer#currentSpan = 
null (this was saved inside SCOPE1 as TraceScope#savedSpan)
6. Tracer#currentSpan is null again.
{code}

You can see that {{Tracer#currentSpan}} is never set to a closed span.  The act 
of closing the scope both closes the span and removes it from 
{{Tracer#currentSpan}}.

Things get more complicated when you're passing spans between threads:
{code}
1. The thread local variable Tracer#currentSpan starts off as null for THREAD1 
and THREAD2.
2. THREAD1 calls Trace#startSpan.  THREAD1#Tracer#currentSpan = SPAN1.  SCOPE1 
object is created to hold SPAN1.
3. THREAD1 calls TraceScope#detach.  THREAD1#Tracer#currentSpan = null (this 
was saved inside SCOPE1 as TraceScope#savedSpan)
4. THREAD2 calls Tracer#continueSpan.  THREAD2#Tracer#currentSpan = SPAN1.  
SCOPE1b is created to hold SPAN1.
5. THREAD2 closes SCOPE1b by calling TraceScope#close.  
THREAD2#Tracer#currentSpan = null (this was saved inside SCOPE1b as 
TraceScope#savedSpan)
6. THREAD2#Tracer#currentSpan is null again (and THREAD1#Tracer#currentSpan 
became null in step #3).
{code}

Again, you can see that {{Tracer#currentSpan}} is never set to a closed span.  
I think that probably {{currentSpan}} can only be set to a closed span in a  
buggy application (although I haven't thought it through that deeply).  Maybe 
we should put an assert inside our code somewhere to detect if the current 
trace span is closed, and blow up if so.  It would be good to catch these 
problems early.

> Thread local storing the currentSpan is never cleared
> -----------------------------------------------------
>
>                 Key: HTRACE-92
>                 URL: https://issues.apache.org/jira/browse/HTRACE-92
>             Project: HTrace
>          Issue Type: Bug
>    Affects Versions: 3.1.0
>            Reporter: Samarth Jain
>         Attachments: Screen Shot 2015-01-27 at 11.20.36 PM.png
>
>
> In Apache Phoenix, we use HTrace to provide request level trace information. 
> The trace information (traceid, parentid, spanid, description) among other 
> columns is stored in a Phoenix table. 
> Spans that are MilliSpan get persisted to the phoenix table this way:
> MilliSpans-> PhoenixTraceMetricsSource-> PhoenixMetricsSink->Phoenix Table
> While inserting the traces to the phoenix table, we make sure that the upsert 
> happening in sink is through a connection that doesn't have tracing on. So 
> this way any spans that are created by executing these upsert statements on 
> the client side are NullSpans. 
> On server side too, when these batched up upsert statements are executed as 
> batchMutate operations, they are not expected to have tracing on i.e. the 
> current spans are null. However, we noticed that these current spans are not 
> null which ends up resulting in an infinite loop. An example of such infinite 
> loop is the following:
> batchmutate -> fshlog.append -> check tracing on i.e. current span is not 
> null -> yes -> create milli span -> do operation -> stop span -> publish to 
> metrics source -> phoenix metrics sink -> upsert statement -> batchmutate -> 
> fshlog.append......
> My local cluster infact dies because of this infinite loop!!
> On examining the thread local of the threads in the RPC thread pool, I saw 
> that there were threads that had current spans that were closed at least an 
> hour before. See the screenshot attached. 
> The screenshot was taken at 11:20 PM and the thread had a current span whose 
> stop time was 10:17 PM. 
> These brought up a couple of design issues/limitations in the HTrace API:
> 1) There is no good way to set (reset) the value of the thread local current 
> span to null. This is a huge issue especially if we we are reusing threads 
> from a thread pool. 
> 2) Should we allow creating spans if the parent span is not running anymore 
> i.e. Span.isRunning() is false. In the example I have shown in the screen 
> shot, the current span stored in the thread local is already closed. 
> Essentially making {code}
> boolean isTracing() {
> return currentSpan.get() != null && currentSpan.get().isRunning()
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to