[ https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922721#comment-13922721 ]
Ben Chan commented on CASSANDRA-5483: ------------------------------------- It was more involved than I thought, partly because of heisenbugs and the trace state mysteriously not propagating (see {{v06-05}}). Note: changing JMX can cause mysterious errors if you don't {{ant clean && ant}}. I ran into the same kinds of stack traces as you did. It's not consistent. Sometimes I can make a JMX change and {{ant}} with no problem. To make patches simpler, I'm posting full repro code. I also tried to simplify the naming. Unfortunately, all the previous patches are in jumbled order due to a naming convention that doesn't sort. Fortunately, JIRA seems to have an easter egg where you can choose the attachment name by changing the url. {noformat} # Uncomment to exactly reproduce state. #git checkout -b 5483-e30d6dc e30d6dc # Download all needed patches with consistent names, apply patches, build. W=https://issues.apache.org/jira/secure/attachment for url in \ $W/12630490/5483-v02-01-Trace-filtering-and-tracestate-propagation.patch \ $W/12630491/5483-v02-02-Put-a-few-traces-parallel-to-the-repair-logging.patch \ $W/12631967/5483-v03-03-Make-repair-tracing-controllable-via-nodetool.patch \ $W/12633153/5483-v06-04-Allow-tracing-ttl-to-be-configured.patch \ $W/12633154/5483-v06-05-Add-a-command-column-to-system_traces.events.patch \ $W/12633155/5483-v06-06-Fix-interruption-in-tracestate-propagation.patch \ $W/12633156/ccm-repair-test do [ -e $(basename $url) ] || curl -sO $url; done && git apply 5483-v0[236]-*.patch && ant clean && ant # put on a separate line because you should at least minimally inspect # arbitrary code before running. chmod +x ./ccm-repair-test && ./ccm-repair-test {noformat} {{ccm-repair-test}} has some options for convenience: {noformat} -k keep (don't delete) the created cluster after successful exit. -r repair only -R don't repair -t do traced repair only -T don't do traced repair (if neither, then do both traced and untraced repair) {noformat} The output of a test run: {noformat} Current cluster is now: test-5483-QiR [2014-03-06 10:46:13,617] Nothing to repair for keyspace 'system' [2014-03-06 10:46:13,646] Starting repair command #1, repairing 2 ranges for keyspace s1 (seq=true, full=true) [2014-03-06 10:46:16,999] Repair session 72648190-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-03-06 10:46:17,465] Repair session 73ee2ed0-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-03-06 10:46:17,465] Repair command #1 finished [2014-03-06 10:46:17,485] Starting repair command #2, repairing 2 ranges for keyspace system_traces (seq=true, full=true) [2014-03-06 10:46:18,782] Repair session 74aaef20-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-03-06 10:46:18,816] Repair session 74ff0290-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-03-06 10:46:18,816] Repair command #2 finished 0 rows exported in 0.015 seconds. test-5483-QiR-system_traces-events.txt ok [2014-03-06 10:46:24,128] Nothing to repair for keyspace 'system' [2014-03-06 10:46:24,166] Starting repair command #3, repairing 2 ranges for keyspace s1 (seq=true, full=true) [2014-03-06 10:46:25,366] Repair session 78a6d4e0-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-03-06 10:46:25,415] Repair session 79263e10-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-03-06 10:46:25,415] Repair command #3 finished [2014-03-06 10:46:25,485] Starting repair command #4, repairing 2 ranges for keyspace system_traces (seq=true, full=true) [2014-03-06 10:46:27,077] Repair session 796f7c10-a546-11e3-a5f4-f94811c7b860 for range (-3074457345618258603,3074457345618258602] finished [2014-03-06 10:46:27,120] Repair session 79f240a0-a546-11e3-a5f4-f94811c7b860 for range (3074457345618258602,-9223372036854775808] finished [2014-03-06 10:46:27,120] Repair command #4 finished 48 rows exported in 0.104 seconds. test-5483-QiR-system_traces-events-tr.txt found source: 127.0.0.1 found thread: Thread-15 found thread: AntiEntropySessions:1 found thread: RepairJobTask:1 found source: 127.0.0.2 found thread: AntiEntropyStage:1 found source: 127.0.0.3 found thread: AntiEntropySessions:2 found thread: Thread-16 found thread: AntiEntropySessions:3 found thread: AntiEntropySessions:4 unique sources traced: 3 unique threads traced: 8 All thread categories accounted for ok {noformat} --- Patch comments: - {{v06-04}} I did something similar to {{v03-03}}, (almost) no refactoring. The implementation is a little messy architecturally. - {{v06-05}} This is the suggestion you had to add a "command" column. I don't know how to make it the last column. At least on my box, it's column 5 of 7 despite me putting it last in the cql. Note that {{ccm-repair-test}}'s checking code will break if the column order changes. - {{v06-06}} You need to submit {{Runnable}} s, etc. using {{DebuggableThreadPoolExecutor}} if you want them to inherit tracestate. Tracestate propagation is very easy to break under concurrency, so this is probably the first thing to check if it ever happens again. > Repair tracing > -------------- > > Key: CASSANDRA-5483 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5483 > Project: Cassandra > Issue Type: Improvement > Components: Tools > Reporter: Yuki Morishita > Assignee: Ben Chan > Priority: Minor > Labels: repair > Attachments: 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, > 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, > 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, > ccm-repair-test, test-5483-system_traces-events.txt, > trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, > trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch, > tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, > tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, > v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, > v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch, > v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch > > > I think it would be nice to log repair stats and results like query tracing > stores traces to system keyspace. With it, you don't have to lookup each log > file to see what was the status and how it performed the repair you invoked. > Instead, you can query the repair log with session ID to see the state and > stats of all nodes involved in that repair session. -- This message was sent by Atlassian JIRA (v6.2#6252)