[ 
https://issues.apache.org/jira/browse/ACCUMULO-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243190#comment-15243190
 ] 

ASF GitHub Bot commented on ACCUMULO-4191:
------------------------------------------

Github user ShawnWalker commented on the pull request:

    https://github.com/apache/accumulo/pull/94#issuecomment-210540256
  
    > ... the "lower" half of the span would be un-rooted (the part doing 
binning and sending).
    Is "unrooted" different from "unreported"?  My view of the tracing message 
suggests so, but I'm uncertain as to exactly the difference.
    
    > Have you been able to notice your changes positively affecting 
ShellServerIT#trace's success rate? I have seen it fail now and again, but 
never reliably.
    
    For me, `ShellServerIT.trace()` _always_ fails.  I've been banging my head 
against this for the past 3 days, trying to get tests to succeed while working 
on ACCUMULO-4187.  I'll take your word that it doesn't fail for everyone all 
the time.  With this change, `ShellServerIT` has succeeded for me the 5 or 6 
times I've run it.
    
    > Any idea as to why we only sometimes see ShellServerIT fail? Is there a 
reason why this doesn't always fail that you noticed?
    
    My current theory is that it's a race condition depending on how long it 
takes to spin up a new threadpool:
    
    At line `TabletServerBatchWriter` line 670, `binningThreadPool` uses a 
`SynchronousQueue`, and at line 671, the `binningThreadPool` has a 
`CallerRunsPolicy`.  Together these mean "perform binning in the other thread, 
but only if it's not currently busy doing binning; otherwise just do the 
binning in the current thread."
    
    If the binning is done in the current thread, there is no need to propagate 
tracing information; as you point out, this information is carried by 
thread-local variables.  If the binning ends up being done in the thread pool, 
the tracing information needs to be propagated.
    
    But there's a race condition here: The `SynchronousQueue` will only allow a 
job to be pushed into it if there's already a thread sitting waiting on a job.  
So the race ends up being which happens first: the consumer thread created to 
service `binningThreadPool` is ready to accept work, or the producer thread is 
ready to commit some of its mutations.  In the former case, the work gets done 
in the thread pool; in the latter case, the work gets done in the current 
thread.
    
    > Any thoughts on how we could make a test which would specifically check 
for regressions here?
    
    Race conditions are notably difficult to diagnose.  What's more, this 
particular race condition isn't an error (assuming the tracing is properly 
handled regardless of who wins the race).  I can think of a few ways one might 
better expose this particular bug under lab conditions, but any such test would 
by necessity be timing sensitive, and I do so hate knowingly writing 
timing-sensitive tests.
    



> Tracing on client can sometimes lose "sendMutations" events.
> ------------------------------------------------------------
>
>                 Key: ACCUMULO-4191
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4191
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Shawn Walker
>            Assignee: Shawn Walker
>            Priority: Minor
>
> In {{TabletServerBatchWriter.MutationWriter.queueMutations(...)}} (line 736), 
> HTrace {{Span}}'s are not properly propagated across thread boundaries.  
> Consequently, tracing doesn't propagate into 
> {{TabletServerBatchWriter.SendTask.send(...)}}, and so the {{sendMutation}} 
> event can fail to be logged.
> This can cause {{ShellServerIT.trace(...)}} to fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to