Samarth Jain created HTRACE-92:
----------------------------------

             Summary: Thread local storing the currentSpan is never cleared
                 Key: HTRACE-92
                 URL: https://issues.apache.org/jira/browse/HTRACE-92
             Project: HTrace
          Issue Type: Bug
    Affects Versions: 3.1.0
            Reporter: Samarth Jain


In Apache Phoenix, we use HTrace to provide request level trace information. 
The trace information (traceid, parentid, spanid, description) among other 
columns is stored in a Phoenix table. 

Spans that are MilliSpan get persisted to the phoenix table this way:
MilliSpans-> PhoenixTraceMetricsSource-> PhoenixMetricsSink->Phoenix Table

While inserting the traces to the phoenix table, we make sure that the upsert 
happening in sink is through a connection that doesn't have tracing on. So this 
way any spans that are created by executing these upsert statements on the 
client side are NullSpans. 

On server side too, when these batched up upsert statements are executed as 
batchMutate operations, they are not expected to have tracing on i.e. the 
current spans are null. However, we noticed that these current spans are not 
null which ends up resulting in an infinite loop. An example of such infinite 
loop is the following:

batchmutate -> fshlog.append -> check tracing on i.e. current span is not null 
-> yes -> create milli span -> do operation -> stop span -> publish to metrics 
source -> phoenix metrics sink -> upsert statement -> batchmutate -> 
fshlog.append......

My local cluster infact dies because of this infinite loop!!

On examining the thread local of the threads in the RPC thread pool, I saw that 
there were threads that had current spans that were closed at least an hour 
before. See the screenshot attached. 

The screenshot was taken at 11:20 PM and the thread had a current span whose 
stop time was 10:17 PM. 

These brought up a couple of design issues/limitations in the HTrace API:
1) There is no good way to set (reset) the value of the thread local current 
span to null. This is a huge issue especially if we we are reusing threads from 
a thread pool. 
2) Should we allow creating spans if the parent span is not running anymore 
i.e. Span.isRunning() is false. In the example I have shown in the screen shot, 
the current span stored in the thread local is already closed. Essentially 
making {code}
boolean isTracing() {
return currentSpan.get() != null && currentSpan.get().isRunning()
{code} 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to