[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998924#comment-13998924
 ] 

Ben Chan edited comment on CASSANDRA-5483 at 5/22/14 12:01 PM:
---------------------------------------------------------------

May 14 formatting changes in 
[^5483-v13-608fb03-May-14-trace-formatting-changes.patch] (based off of commit 
608fb03).

{quote}
I think the session log messages are still confusing, especially since we use 
the same term for "repairing a subrange" and "streaming data."
{quote}

Currently the "session" terminology is baked into the source code, in 
{{StreamSession.java}} and {{RepairSession.java}}. If the messages are changed 
to reflect different terminology, hopefully the source code can eventually be 
changed to match (fewer special cases to remember). Perhaps the best thing is 
to always qualify them, e.g. "stream session" and "repair session"?

{quote}
I don't actually see the session uuid being used in the logs except at 
start/finish.
{quote}

Sorry, that was another inadvertent mixing of nodetool messages and trace 
output. {{\[2014-05-13 23:49:52,283] Repair session 
cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range 
(3074457345618258602,-9223372036854775808] finished}} is not a trace, but a 
separate (pre-patch) sendNotification in {{StorageService.java}}. This message 
(and some of the error messages, I think) is redundant when combined with trace 
output. It should have been either one or the other, not both. In the trace 
proper, the session UUID only shows up at the start.

But note: not all nodetool messages are rendered redundant by trace output. 
Since we can't just suppress all non-trace sendNotification, how can we 
unambiguously tell nodetool trace output from normal sendNotification messages? 
I'm currently leaning towards just marking all sendNotification trace output 
with a {{TRACE:}} tag.

The repair session UUIDs used to be prepended to everything, but were removed 
in [^5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch]. Without 
them, things are less verbose, but it's sometimes hard to unambiguously follow 
traces for concurrent repair sessions. To make the point clearer, I've marked 
each sub-task graphically in the nodetool trace output below (I've 
cross-checked this with the logs, which do retain the UUIDs). If you cover up 
the left side, it's harder to figure out which trace goes with which sub-task. 
Real-world repair traces will probably be even more confusing.

Note: indentation here does not denote nesting; the column roughly indicates 
task identity, though I reuse columns when it's not ambiguous.

{noformat}
1         [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 
ranges for s1.users (seq=true, full=true)
  x       [2014-05-15 11:31:37,922] Syncing range 
(-3074457345618258603,3074457345618258602]
  x       [2014-05-15 11:31:38,108] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x       [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x       [2014-05-15 11:31:39,953] Received merkle tree for users from 
/127.0.0.2
  x       [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x       [2014-05-15 11:31:41,279] Received merkle tree for users from 
/127.0.0.3
  x       [2014-05-15 11:31:42,632] Received merkle tree for users from 
/127.0.0.1
    x     [2014-05-15 11:31:42,671] Syncing range 
(-9223372036854775808,-3074457345618258603]
    x     [2014-05-15 11:31:42,766] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x       [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
  x       [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
  x       [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
  x       [2014-05-15 11:31:43,084] Completed sync of range 
(-3074457345618258603,3074457345618258602]
    x     [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
    x     [2014-05-15 11:31:43,422] Received merkle tree for users from 
/127.0.0.2
    x     [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
    x     [2014-05-15 11:31:44,637] Received merkle tree for users from 
/127.0.0.3
    x     [2014-05-15 11:31:45,474] Received merkle tree for users from 
/127.0.0.1
      x   [2014-05-15 11:31:45,494] Syncing range 
(3074457345618258602,-9223372036854775808]
    x     [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
    x     [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
    x     [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
    x     [2014-05-15 11:31:45,564] Completed sync of range 
(-9223372036854775808,-3074457345618258603]
      x   [2014-05-15 11:31:45,581] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
      x   [2014-05-15 11:31:46,214] Received merkle tree for users from 
/127.0.0.2
      x   [2014-05-15 11:31:46,233] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
      x   [2014-05-15 11:31:46,459] Received merkle tree for users from 
/127.0.0.3
      x   [2014-05-15 11:31:46,472] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
      x   [2014-05-15 11:31:46,545] Received merkle tree for users from 
/127.0.0.1
      x   [2014-05-15 11:31:46,574] Endpoint /127.0.0.3 has 1 range(s) out of 
sync with /127.0.0.1 for users
      x   [2014-05-15 11:31:46,575] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
      x   [2014-05-15 11:31:46,581] Endpoint /127.0.0.2 has 1 range(s) out of 
sync with /127.0.0.1 for users
      x   [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges 
with /127.0.0.2
      x   [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges 
with /127.0.0.3
      x   [2014-05-15 11:31:46,910] Streaming session with /127.0.0.2 prepared.
      x   [2014-05-15 11:31:46,949] Streaming session with /127.0.0.3 prepared.
      x   [2014-05-15 11:31:46,991] 87/87 bytes (100%) received from 
idx:0/127.0.0.3
      x   [2014-05-15 11:31:47,004] 87/87 bytes (100%) received from 
idx:0/127.0.0.2
      x   [2014-05-15 11:31:47,053] Streaming session with /127.0.0.3 completed 
successfully.
      x   [2014-05-15 11:31:47,112] Streaming session with /127.0.0.2 completed 
successfully.
      x   [2014-05-15 11:31:47,131] Completed sync of range 
(3074457345618258602,-9223372036854775808]
1         [2014-05-15 11:31:47,133] Repair command #1 finished
2         [2014-05-15 11:31:47,180] Starting repair command #2, repairing 2 
ranges for system_traces.sessions, system_traces.events (seq=true, full=true)
  x       [2014-05-15 11:31:47,189] Syncing range 
(-3074457345618258603,3074457345618258602]
  x       [2014-05-15 11:31:47,260] Requesting merkle trees for sessions from 
[/127.0.0.3, /127.0.0.1]
  x       [2014-05-15 11:31:47,344] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for system_traces.sessions
  x       [2014-05-15 11:31:47,400] Received merkle tree for sessions from 
/127.0.0.3
  x       [2014-05-15 11:31:47,577] Received merkle tree for sessions from 
/127.0.0.1
  x       [2014-05-15 11:31:47,617] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for sessions
  x       [2014-05-15 11:31:47,743] Requesting merkle trees for events from 
[/127.0.0.3, /127.0.0.1]
  x       [2014-05-15 11:31:48,010] Received merkle tree for events from 
/127.0.0.3
  x       [2014-05-15 11:31:48,027] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for system_traces.events
  x       [2014-05-15 11:31:48,245] Received merkle tree for events from 
/127.0.0.1
    x     [2014-05-15 11:31:48,253] Syncing range 
(3074457345618258602,-9223372036854775808]
    x     [2014-05-15 11:31:48,294] Requesting merkle trees for sessions from 
[/127.0.0.2, /127.0.0.1]
  x       [2014-05-15 11:31:48,313] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for events
  x       [2014-05-15 11:31:48,322] Completed sync of range 
(-3074457345618258603,3074457345618258602]
    x     [2014-05-15 11:31:48,372] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for system_traces.sessions
    x     [2014-05-15 11:31:48,424] Received merkle tree for sessions from 
/127.0.0.2
    x     [2014-05-15 11:31:48,582] Received merkle tree for sessions from 
/127.0.0.1
    x     [2014-05-15 11:31:48,613] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for sessions
    x     [2014-05-15 11:31:48,641] Requesting merkle trees for events from 
[/127.0.0.2, /127.0.0.1]
    x     [2014-05-15 11:31:48,765] Received merkle tree for events from 
/127.0.0.2
    x     [2014-05-15 11:31:48,792] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for system_traces.events
    x     [2014-05-15 11:31:48,942] Received merkle tree for events from 
/127.0.0.1
    x     [2014-05-15 11:31:48,973] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for events
    x     [2014-05-15 11:31:48,985] Completed sync of range 
(3074457345618258602,-9223372036854775808]
2         [2014-05-15 11:31:48,990] Repair command #2 finished
{noformat}

Edit: fixed patch link. Also note: I forgot that I rebased lyubent's fa4089e 
onto 33939ca, making its hash 608fb03. Obviously, you won't find that hash in 
any public Cassandra repository. This shouldn't affect patch application, 
though. Sorry for any confusion this may have caused.


was (Author: usrbincc):
May 14 formatting changes in 
[^5483-v13-01-608fb03-May-14-trace-formatting-changes.patch] (based off of 
commit 608fb03).

{quote}
I think the session log messages are still confusing, especially since we use 
the same term for "repairing a subrange" and "streaming data."
{quote}

Currently the "session" terminology is baked into the source code, in 
{{StreamSession.java}} and {{RepairSession.java}}. If the messages are changed 
to reflect different terminology, hopefully the source code can eventually be 
changed to match (fewer special cases to remember). Perhaps the best thing is 
to always qualify them, e.g. "stream session" and "repair session"?

{quote}
I don't actually see the session uuid being used in the logs except at 
start/finish.
{quote}

Sorry, that was another inadvertent mixing of nodetool messages and trace 
output. {{\[2014-05-13 23:49:52,283] Repair session 
cd6aad80-db1a-11e3-b0e7-f94811c7b860 for range 
(3074457345618258602,-9223372036854775808] finished}} is not a trace, but a 
separate (pre-patch) sendNotification in {{StorageService.java}}. This message 
(and some of the error messages, I think) is redundant when combined with trace 
output. It should have been either one or the other, not both. In the trace 
proper, the session UUID only shows up at the start.

But note: not all nodetool messages are rendered redundant by trace output. 
Since we can't just suppress all non-trace sendNotification, how can we 
unambiguously tell nodetool trace output from normal sendNotification messages? 
I'm currently leaning towards just marking all sendNotification trace output 
with a {{TRACE:}} tag.

The repair session UUIDs used to be prepended to everything, but were removed 
in [^5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch]. Without 
them, things are less verbose, but it's sometimes hard to unambiguously follow 
traces for concurrent repair sessions. To make the point clearer, I've marked 
each sub-task graphically in the nodetool trace output below (I've 
cross-checked this with the logs, which do retain the UUIDs). If you cover up 
the left side, it's harder to figure out which trace goes with which sub-task. 
Real-world repair traces will probably be even more confusing.

Note: indentation here does not denote nesting; the column roughly indicates 
task identity, though I reuse columns when it's not ambiguous.

{noformat}
1         [2014-05-15 11:31:37,839] Starting repair command #1, repairing 3 
ranges for s1.users (seq=true, full=true)
  x       [2014-05-15 11:31:37,922] Syncing range 
(-3074457345618258603,3074457345618258602]
  x       [2014-05-15 11:31:38,108] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x       [2014-05-15 11:31:38,833] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x       [2014-05-15 11:31:39,953] Received merkle tree for users from 
/127.0.0.2
  x       [2014-05-15 11:31:40,939] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
  x       [2014-05-15 11:31:41,279] Received merkle tree for users from 
/127.0.0.3
  x       [2014-05-15 11:31:42,632] Received merkle tree for users from 
/127.0.0.1
    x     [2014-05-15 11:31:42,671] Syncing range 
(-9223372036854775808,-3074457345618258603]
    x     [2014-05-15 11:31:42,766] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
  x       [2014-05-15 11:31:42,905] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
  x       [2014-05-15 11:31:43,044] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
  x       [2014-05-15 11:31:43,047] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
  x       [2014-05-15 11:31:43,084] Completed sync of range 
(-3074457345618258603,3074457345618258602]
    x     [2014-05-15 11:31:43,251] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
    x     [2014-05-15 11:31:43,422] Received merkle tree for users from 
/127.0.0.2
    x     [2014-05-15 11:31:44,495] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
    x     [2014-05-15 11:31:44,637] Received merkle tree for users from 
/127.0.0.3
    x     [2014-05-15 11:31:45,474] Received merkle tree for users from 
/127.0.0.1
      x   [2014-05-15 11:31:45,494] Syncing range 
(3074457345618258602,-9223372036854775808]
    x     [2014-05-15 11:31:45,499] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for users
    x     [2014-05-15 11:31:45,520] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for users
    x     [2014-05-15 11:31:45,544] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
    x     [2014-05-15 11:31:45,564] Completed sync of range 
(-9223372036854775808,-3074457345618258603]
      x   [2014-05-15 11:31:45,581] Requesting merkle trees for users from 
[/127.0.0.2, /127.0.0.3, /127.0.0.1]
      x   [2014-05-15 11:31:46,214] Received merkle tree for users from 
/127.0.0.2
      x   [2014-05-15 11:31:46,233] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for s1.users
      x   [2014-05-15 11:31:46,459] Received merkle tree for users from 
/127.0.0.3
      x   [2014-05-15 11:31:46,472] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for s1.users
      x   [2014-05-15 11:31:46,545] Received merkle tree for users from 
/127.0.0.1
      x   [2014-05-15 11:31:46,574] Endpoint /127.0.0.3 has 1 range(s) out of 
sync with /127.0.0.1 for users
      x   [2014-05-15 11:31:46,575] Endpoint /127.0.0.2 is consistent with 
/127.0.0.3 for users
      x   [2014-05-15 11:31:46,581] Endpoint /127.0.0.2 has 1 range(s) out of 
sync with /127.0.0.1 for users
      x   [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges 
with /127.0.0.2
      x   [2014-05-15 11:31:46,601] Performing streaming repair of 1 ranges 
with /127.0.0.3
      x   [2014-05-15 11:31:46,910] Streaming session with /127.0.0.2 prepared.
      x   [2014-05-15 11:31:46,949] Streaming session with /127.0.0.3 prepared.
      x   [2014-05-15 11:31:46,991] 87/87 bytes (100%) received from 
idx:0/127.0.0.3
      x   [2014-05-15 11:31:47,004] 87/87 bytes (100%) received from 
idx:0/127.0.0.2
      x   [2014-05-15 11:31:47,053] Streaming session with /127.0.0.3 completed 
successfully.
      x   [2014-05-15 11:31:47,112] Streaming session with /127.0.0.2 completed 
successfully.
      x   [2014-05-15 11:31:47,131] Completed sync of range 
(3074457345618258602,-9223372036854775808]
1         [2014-05-15 11:31:47,133] Repair command #1 finished
2         [2014-05-15 11:31:47,180] Starting repair command #2, repairing 2 
ranges for system_traces.sessions, system_traces.events (seq=true, full=true)
  x       [2014-05-15 11:31:47,189] Syncing range 
(-3074457345618258603,3074457345618258602]
  x       [2014-05-15 11:31:47,260] Requesting merkle trees for sessions from 
[/127.0.0.3, /127.0.0.1]
  x       [2014-05-15 11:31:47,344] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for system_traces.sessions
  x       [2014-05-15 11:31:47,400] Received merkle tree for sessions from 
/127.0.0.3
  x       [2014-05-15 11:31:47,577] Received merkle tree for sessions from 
/127.0.0.1
  x       [2014-05-15 11:31:47,617] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for sessions
  x       [2014-05-15 11:31:47,743] Requesting merkle trees for events from 
[/127.0.0.3, /127.0.0.1]
  x       [2014-05-15 11:31:48,010] Received merkle tree for events from 
/127.0.0.3
  x       [2014-05-15 11:31:48,027] /127.0.0.3: Sending completed merkle tree 
to /127.0.0.1 for system_traces.events
  x       [2014-05-15 11:31:48,245] Received merkle tree for events from 
/127.0.0.1
    x     [2014-05-15 11:31:48,253] Syncing range 
(3074457345618258602,-9223372036854775808]
    x     [2014-05-15 11:31:48,294] Requesting merkle trees for sessions from 
[/127.0.0.2, /127.0.0.1]
  x       [2014-05-15 11:31:48,313] Endpoint /127.0.0.3 is consistent with 
/127.0.0.1 for events
  x       [2014-05-15 11:31:48,322] Completed sync of range 
(-3074457345618258603,3074457345618258602]
    x     [2014-05-15 11:31:48,372] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for system_traces.sessions
    x     [2014-05-15 11:31:48,424] Received merkle tree for sessions from 
/127.0.0.2
    x     [2014-05-15 11:31:48,582] Received merkle tree for sessions from 
/127.0.0.1
    x     [2014-05-15 11:31:48,613] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for sessions
    x     [2014-05-15 11:31:48,641] Requesting merkle trees for events from 
[/127.0.0.2, /127.0.0.1]
    x     [2014-05-15 11:31:48,765] Received merkle tree for events from 
/127.0.0.2
    x     [2014-05-15 11:31:48,792] /127.0.0.2: Sending completed merkle tree 
to /127.0.0.1 for system_traces.events
    x     [2014-05-15 11:31:48,942] Received merkle tree for events from 
/127.0.0.1
    x     [2014-05-15 11:31:48,973] Endpoint /127.0.0.2 is consistent with 
/127.0.0.1 for events
    x     [2014-05-15 11:31:48,985] Completed sync of range 
(3074457345618258602,-9223372036854775808]
2         [2014-05-15 11:31:48,990] Repair command #2 finished
{noformat}


> Repair tracing
> --------------
>
>                 Key: CASSANDRA-5483
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Yuki Morishita
>            Assignee: Ben Chan
>            Priority: Minor
>              Labels: repair
>         Attachments: 5483-full-trunk.txt, 
> 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
> 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
> 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
> 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
>  5483-v07-08-Fix-brace-style.patch, 
> 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
>  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
> 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
> 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
> 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
> 5483-v08-14-Poll-system_traces.events.patch, 
> 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
> 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
> 5483-v10-17-minor-bugfixes-and-changes.patch, 
> 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
> 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
> 5483-v13-608fb03-May-14-trace-formatting-changes.patch, ccm-repair-test, 
> cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
> test-5483-system_traces-events.txt, 
> trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
> trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
>  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
> tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
> v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
> v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
>  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
>
>
> I think it would be nice to log repair stats and results like query tracing 
> stores traces to system keyspace. With it, you don't have to lookup each log 
> file to see what was the status and how it performed the repair you invoked. 
> Instead, you can query the repair log with session ID to see the state and 
> stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to