[ https://issues.apache.org/jira/browse/ACCUMULO-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243190#comment-15243190 ]
ASF GitHub Bot commented on ACCUMULO-4191: ------------------------------------------ Github user ShawnWalker commented on the pull request: https://github.com/apache/accumulo/pull/94#issuecomment-210540256 > ... the "lower" half of the span would be un-rooted (the part doing binning and sending). Is "unrooted" different from "unreported"? My view of the tracing message suggests so, but I'm uncertain as to exactly the difference. > Have you been able to notice your changes positively affecting ShellServerIT#trace's success rate? I have seen it fail now and again, but never reliably. For me, `ShellServerIT.trace()` _always_ fails. I've been banging my head against this for the past 3 days, trying to get tests to succeed while working on ACCUMULO-4187. I'll take your word that it doesn't fail for everyone all the time. With this change, `ShellServerIT` has succeeded for me the 5 or 6 times I've run it. > Any idea as to why we only sometimes see ShellServerIT fail? Is there a reason why this doesn't always fail that you noticed? My current theory is that it's a race condition depending on how long it takes to spin up a new threadpool: At line `TabletServerBatchWriter` line 670, `binningThreadPool` uses a `SynchronousQueue`, and at line 671, the `binningThreadPool` has a `CallerRunsPolicy`. Together these mean "perform binning in the other thread, but only if it's not currently busy doing binning; otherwise just do the binning in the current thread." If the binning is done in the current thread, there is no need to propagate tracing information; as you point out, this information is carried by thread-local variables. If the binning ends up being done in the thread pool, the tracing information needs to be propagated. But there's a race condition here: The `SynchronousQueue` will only allow a job to be pushed into it if there's already a thread sitting waiting on a job. So the race ends up being which happens first: the consumer thread created to service `binningThreadPool` is ready to accept work, or the producer thread is ready to commit some of its mutations. In the former case, the work gets done in the thread pool; in the latter case, the work gets done in the current thread. > Any thoughts on how we could make a test which would specifically check for regressions here? Race conditions are notably difficult to diagnose. What's more, this particular race condition isn't an error (assuming the tracing is properly handled regardless of who wins the race). I can think of a few ways one might better expose this particular bug under lab conditions, but any such test would by necessity be timing sensitive, and I do so hate knowingly writing timing-sensitive tests. > Tracing on client can sometimes lose "sendMutations" events. > ------------------------------------------------------------ > > Key: ACCUMULO-4191 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4191 > Project: Accumulo > Issue Type: Bug > Reporter: Shawn Walker > Assignee: Shawn Walker > Priority: Minor > > In {{TabletServerBatchWriter.MutationWriter.queueMutations(...)}} (line 736), > HTrace {{Span}}'s are not properly propagated across thread boundaries. > Consequently, tracing doesn't propagate into > {{TabletServerBatchWriter.SendTask.send(...)}}, and so the {{sendMutation}} > event can fail to be logged. > This can cause {{ShellServerIT.trace(...)}} to fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)