[ https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998924#comment-13998924 ]
Ben Chan edited comment on CASSANDRA-5483 at 5/22/14 12:01 PM: --------------------------------------------------------------- May 14 formatting changes in [^5483-v13-608fb03-May-14-trace-formatting-changes.patch] (based off of commit 608fb03). {quote} I think the session log messages are still confusing, especially since we use the same term for "repairing a subrange" and "streaming data." {quote} Currently the "session" terminology is baked into the source code, in {{StreamSession.java}} and {{RepairSession.java}}. If the messages are changed to reflect different terminology, hopefully the source code can eventually be changed to match (fewer special cases to remember). Perhaps the best thing is to always qualify them, e.g. "stream session" and "repair session"? {quote} I don't actually see the session uuid being used in the logs except at start/finish. {quote} Sorry, that was another inadvertent mixing of nodetool messages and trace output. {{\[2014-05-13 23:49:52,283] Repair session cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished}} is not a trace, but a separate (pre-patch) sendNotification in {{StorageService.java}}. This message (and some of the error messages, I think) is redundant when combined with trace output. It should have been either one or the other, not both. In the trace proper, the session UUID only shows up at the start. But note: not all nodetool messages are rendered redundant by trace output. Since we can't just suppress all non-trace sendNotification, how can we unambiguously tell nodetool trace output from normal sendNotification messages? I'm currently leaning towards just marking all sendNotification trace output with a {{TRACE:}} tag. The repair session UUIDs used to be prepended to everything, but were removed in [^5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch]. Without them, things are less verbose, but it's sometimes hard to unambiguously follow traces for concurrent repair sessions. To make the point clearer, I've marked each sub-task graphically in the nodetool trace output below (I've cross-checked this with the logs, which do retain the UUIDs). If you cover up the left side, it's harder to figure out which trace goes with which sub-task. Real-world repair traces will probably be even more confusing. Note: indentation here does not denote nesting; the column roughly indicates task identity, though I reuse columns when it's not ambiguous. {noformat} 1 [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 ranges for s1.users (seq=true, full=true) x [2014-05-15 11:31:37,922] Syncing range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:38,108] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:39,953] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:41,279] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:42,632] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:42,671] Syncing range (-9223372036854775808,-3074457345618258603] x [2014-05-15 11:31:42,766] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:43,084] Completed sync of range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:43,422] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:44,637] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:45,474] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:45,494] Syncing range (3074457345618258602,-9223372036854775808] x [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:45,564] Completed sync of range (-9223372036854775808,-3074457345618258603] x [2014-05-15 11:31:45,581] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:46,214] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:46,233] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:46,459] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:46,472] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:46,545] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:46,574] Endpoint /127.0.0.3 has 1 range(s) out of sync with /127.0.0.1 for users x [2014-05-15 11:31:46,575] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:46,581] Endpoint /127.0.0.2 has 1 range(s) out of sync with /127.0.0.1 for users x [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges with /127.0.0.2 x [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges with /127.0.0.3 x [2014-05-15 11:31:46,910] Streaming session with /127.0.0.2 prepared. x [2014-05-15 11:31:46,949] Streaming session with /127.0.0.3 prepared. x [2014-05-15 11:31:46,991] 87/87 bytes (100%) received from idx:0/127.0.0.3 x [2014-05-15 11:31:47,004] 87/87 bytes (100%) received from idx:0/127.0.0.2 x [2014-05-15 11:31:47,053] Streaming session with /127.0.0.3 completed successfully. x [2014-05-15 11:31:47,112] Streaming session with /127.0.0.2 completed successfully. x [2014-05-15 11:31:47,131] Completed sync of range (3074457345618258602,-9223372036854775808] 1 [2014-05-15 11:31:47,133] Repair command #1 finished 2 [2014-05-15 11:31:47,180] Starting repair command #2, repairing 2 ranges for system_traces.sessions, system_traces.events (seq=true, full=true) x [2014-05-15 11:31:47,189] Syncing range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:47,260] Requesting merkle trees for sessions from [/127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:47,344] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces.sessions x [2014-05-15 11:31:47,400] Received merkle tree for sessions from /127.0.0.3 x [2014-05-15 11:31:47,577] Received merkle tree for sessions from /127.0.0.1 x [2014-05-15 11:31:47,617] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for sessions x [2014-05-15 11:31:47,743] Requesting merkle trees for events from [/127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:48,010] Received merkle tree for events from /127.0.0.3 x [2014-05-15 11:31:48,027] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces.events x [2014-05-15 11:31:48,245] Received merkle tree for events from /127.0.0.1 x [2014-05-15 11:31:48,253] Syncing range (3074457345618258602,-9223372036854775808] x [2014-05-15 11:31:48,294] Requesting merkle trees for sessions from [/127.0.0.2, /127.0.0.1] x [2014-05-15 11:31:48,313] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for events x [2014-05-15 11:31:48,322] Completed sync of range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:48,372] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces.sessions x [2014-05-15 11:31:48,424] Received merkle tree for sessions from /127.0.0.2 x [2014-05-15 11:31:48,582] Received merkle tree for sessions from /127.0.0.1 x [2014-05-15 11:31:48,613] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for sessions x [2014-05-15 11:31:48,641] Requesting merkle trees for events from [/127.0.0.2, /127.0.0.1] x [2014-05-15 11:31:48,765] Received merkle tree for events from /127.0.0.2 x [2014-05-15 11:31:48,792] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces.events x [2014-05-15 11:31:48,942] Received merkle tree for events from /127.0.0.1 x [2014-05-15 11:31:48,973] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for events x [2014-05-15 11:31:48,985] Completed sync of range (3074457345618258602,-9223372036854775808] 2 [2014-05-15 11:31:48,990] Repair command #2 finished {noformat} Edit: fixed patch link. Also note: I forgot that I rebased lyubent's fa4089e onto 33939ca, making its hash 608fb03. Obviously, you won't find that hash in any public Cassandra repository. This shouldn't affect patch application, though. Sorry for any confusion this may have caused. was (Author: usrbincc): May 14 formatting changes in [^5483-v13-01-608fb03-May-14-trace-formatting-changes.patch] (based off of commit 608fb03). {quote} I think the session log messages are still confusing, especially since we use the same term for "repairing a subrange" and "streaming data." {quote} Currently the "session" terminology is baked into the source code, in {{StreamSession.java}} and {{RepairSession.java}}. If the messages are changed to reflect different terminology, hopefully the source code can eventually be changed to match (fewer special cases to remember). Perhaps the best thing is to always qualify them, e.g. "stream session" and "repair session"? {quote} I don't actually see the session uuid being used in the logs except at start/finish. {quote} Sorry, that was another inadvertent mixing of nodetool messages and trace output. {{\[2014-05-13 23:49:52,283] Repair session cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished}} is not a trace, but a separate (pre-patch) sendNotification in {{StorageService.java}}. This message (and some of the error messages, I think) is redundant when combined with trace output. It should have been either one or the other, not both. In the trace proper, the session UUID only shows up at the start. But note: not all nodetool messages are rendered redundant by trace output. Since we can't just suppress all non-trace sendNotification, how can we unambiguously tell nodetool trace output from normal sendNotification messages? I'm currently leaning towards just marking all sendNotification trace output with a {{TRACE:}} tag. The repair session UUIDs used to be prepended to everything, but were removed in [^5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch]. Without them, things are less verbose, but it's sometimes hard to unambiguously follow traces for concurrent repair sessions. To make the point clearer, I've marked each sub-task graphically in the nodetool trace output below (I've cross-checked this with the logs, which do retain the UUIDs). If you cover up the left side, it's harder to figure out which trace goes with which sub-task. Real-world repair traces will probably be even more confusing. Note: indentation here does not denote nesting; the column roughly indicates task identity, though I reuse columns when it's not ambiguous. {noformat} 1 [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 ranges for s1.users (seq=true, full=true) x [2014-05-15 11:31:37,922] Syncing range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:38,108] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:39,953] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:41,279] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:42,632] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:42,671] Syncing range (-9223372036854775808,-3074457345618258603] x [2014-05-15 11:31:42,766] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:43,084] Completed sync of range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:43,422] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:44,637] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:45,474] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:45,494] Syncing range (3074457345618258602,-9223372036854775808] x [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for users x [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:45,564] Completed sync of range (-9223372036854775808,-3074457345618258603] x [2014-05-15 11:31:45,581] Requesting merkle trees for users from [/127.0.0.2, /127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:46,214] Received merkle tree for users from /127.0.0.2 x [2014-05-15 11:31:46,233] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:46,459] Received merkle tree for users from /127.0.0.3 x [2014-05-15 11:31:46,472] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for s1.users x [2014-05-15 11:31:46,545] Received merkle tree for users from /127.0.0.1 x [2014-05-15 11:31:46,574] Endpoint /127.0.0.3 has 1 range(s) out of sync with /127.0.0.1 for users x [2014-05-15 11:31:46,575] Endpoint /127.0.0.2 is consistent with /127.0.0.3 for users x [2014-05-15 11:31:46,581] Endpoint /127.0.0.2 has 1 range(s) out of sync with /127.0.0.1 for users x [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges with /127.0.0.2 x [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges with /127.0.0.3 x [2014-05-15 11:31:46,910] Streaming session with /127.0.0.2 prepared. x [2014-05-15 11:31:46,949] Streaming session with /127.0.0.3 prepared. x [2014-05-15 11:31:46,991] 87/87 bytes (100%) received from idx:0/127.0.0.3 x [2014-05-15 11:31:47,004] 87/87 bytes (100%) received from idx:0/127.0.0.2 x [2014-05-15 11:31:47,053] Streaming session with /127.0.0.3 completed successfully. x [2014-05-15 11:31:47,112] Streaming session with /127.0.0.2 completed successfully. x [2014-05-15 11:31:47,131] Completed sync of range (3074457345618258602,-9223372036854775808] 1 [2014-05-15 11:31:47,133] Repair command #1 finished 2 [2014-05-15 11:31:47,180] Starting repair command #2, repairing 2 ranges for system_traces.sessions, system_traces.events (seq=true, full=true) x [2014-05-15 11:31:47,189] Syncing range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:47,260] Requesting merkle trees for sessions from [/127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:47,344] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces.sessions x [2014-05-15 11:31:47,400] Received merkle tree for sessions from /127.0.0.3 x [2014-05-15 11:31:47,577] Received merkle tree for sessions from /127.0.0.1 x [2014-05-15 11:31:47,617] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for sessions x [2014-05-15 11:31:47,743] Requesting merkle trees for events from [/127.0.0.3, /127.0.0.1] x [2014-05-15 11:31:48,010] Received merkle tree for events from /127.0.0.3 x [2014-05-15 11:31:48,027] /127.0.0.3: Sending completed merkle tree to /127.0.0.1 for system_traces.events x [2014-05-15 11:31:48,245] Received merkle tree for events from /127.0.0.1 x [2014-05-15 11:31:48,253] Syncing range (3074457345618258602,-9223372036854775808] x [2014-05-15 11:31:48,294] Requesting merkle trees for sessions from [/127.0.0.2, /127.0.0.1] x [2014-05-15 11:31:48,313] Endpoint /127.0.0.3 is consistent with /127.0.0.1 for events x [2014-05-15 11:31:48,322] Completed sync of range (-3074457345618258603,3074457345618258602] x [2014-05-15 11:31:48,372] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces.sessions x [2014-05-15 11:31:48,424] Received merkle tree for sessions from /127.0.0.2 x [2014-05-15 11:31:48,582] Received merkle tree for sessions from /127.0.0.1 x [2014-05-15 11:31:48,613] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for sessions x [2014-05-15 11:31:48,641] Requesting merkle trees for events from [/127.0.0.2, /127.0.0.1] x [2014-05-15 11:31:48,765] Received merkle tree for events from /127.0.0.2 x [2014-05-15 11:31:48,792] /127.0.0.2: Sending completed merkle tree to /127.0.0.1 for system_traces.events x [2014-05-15 11:31:48,942] Received merkle tree for events from /127.0.0.1 x [2014-05-15 11:31:48,973] Endpoint /127.0.0.2 is consistent with /127.0.0.1 for events x [2014-05-15 11:31:48,985] Completed sync of range (3074457345618258602,-9223372036854775808] 2 [2014-05-15 11:31:48,990] Repair command #2 finished {noformat} > Repair tracing > -------------- > > Key: CASSANDRA-5483 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5483 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Yuki Morishita > Assignee: Ben Chan > Priority: Minor > Labels: repair > Attachments: 5483-full-trunk.txt, > 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, > 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, > 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, > 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch, > 5483-v07-08-Fix-brace-style.patch, > 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch, > 5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, > 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, > 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, > 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, > 5483-v08-14-Poll-system_traces.events.patch, > 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, > 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, > 5483-v10-17-minor-bugfixes-and-changes.patch, > 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, > 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, > 5483-v13-608fb03-May-14-trace-formatting-changes.patch, ccm-repair-test, > cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, > test-5483-system_traces-events.txt, > trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, > trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch, > tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, > tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, > v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, > v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch, > v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch > > > I think it would be nice to log repair stats and results like query tracing > stores traces to system keyspace. With it, you don't have to lookup each log > file to see what was the status and how it performed the repair you invoked. > Instead, you can query the repair log with session ID to see the state and > stats of all nodes involved in that repair session. -- This message was sent by Atlassian JIRA (v6.2#6252)