[ https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005853#comment-14005853 ]
Ben Chan commented on CASSANDRA-5483: ------------------------------------- {quote} Is that because the traces are asynchronous? Because I think session 2 only starts after session 1 finishes. {quote} Here's my high level understanding. * Repair command #1 and #2, etc are serial. * Each repair session "Syncing range ..." is "technically" concurrent, since each is submitted to a ThreadPoolExecutor. ** However, differencing is serialized, so if there is no streaming going on, you won't see very much overlap between the sessions, except at the beginning and end (which is exactly what we see with these simple tests). ** Conversely, this means you will see much more interleaving when heavy streaming is going on. So at the very least, it might be good to eventually disambiguate the streaming portion. {quote} The easiest thing would be to make them non-redundant. Can we make the tracing "extra detail" on top of the normal ones instead of competing with them? {quote} I think it may be a conceptual block on my part. I tend to think of traces as a kind of profiling mechanism. * Most of the sendNotification calls in StorageService#createRepairTask consist of reporting any errors from the results of RepairFuture objects. So the timing on those is not really that useful for profiling. They're not really what I'd usually think of as a trace. * Some are request validation reporting before the repair proper even starts. * The rest are informational sendNotification messages which are redundant when tracing is active (this is the easy case). In pseudocode: {noformat} if (some error #1 in repair request) sendNotification("NO #1!"); if (some error #2 in repair request) sendNotification("NO #2!"); for (r : ranges) { f = something.submitRepairSession(new RepairSession(r)); futures.add(f); try { // this serializes the differencing part. f.waitForDifferencing() } catch (SomeException) { // handle, sendNotification } } try { for (f : futures) { r = f.get(); sendNotification("done: %s", r); } } catch (ExecutionException ee) { // handle, sendNotification } catch (Exception e) { // handle, sendNotification } {noformat} The main point being that I can't be sure that every single interesting exception is caught and traced in the thread where it's thrown, then rethrown. Most likely, this is not the case, and some exceptions are only reported at the StorageService#createRepairTask level. I believe most \(?) cases are already caught and traced, though. So after going through all that, I'm thinking that the easiest thing is to just accept the possibility of redundancy and delayed reporting, and just trace all sendNotification in StorageService#createRepairTask (unless it's demonstrably redundant, or already being traced through some other mechanism). > Repair tracing > -------------- > > Key: CASSANDRA-5483 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5483 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Yuki Morishita > Assignee: Ben Chan > Priority: Minor > Labels: repair > Attachments: 5483-full-trunk.txt, > 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, > 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, > 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, > 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch, > 5483-v07-08-Fix-brace-style.patch, > 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch, > 5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, > 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, > 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, > 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, > 5483-v08-14-Poll-system_traces.events.patch, > 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, > 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, > 5483-v10-17-minor-bugfixes-and-changes.patch, > 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, > 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, > 5483-v13-608fb03-May-14-trace-formatting-changes.patch, ccm-repair-test, > cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, > test-5483-system_traces-events.txt, > trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, > trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch, > tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, > tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, > v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, > v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch, > v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch > > > I think it would be nice to log repair stats and results like query tracing > stores traces to system keyspace. With it, you don't have to lookup each log > file to see what was the status and how it performed the repair you invoked. > Instead, you can query the repair log with session ID to see the state and > stats of all nodes involved in that repair session. -- This message was sent by Atlassian JIRA (v6.2#6252)