[ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922721#comment-13922721
 ] 

Ben Chan commented on CASSANDRA-5483:
-------------------------------------

It was more involved than I thought, partly because of heisenbugs and the trace 
state mysteriously not propagating (see {{v06-05}}).

Note: changing JMX can cause mysterious errors if you don't {{ant clean && 
ant}}. I ran into the same kinds of stack traces as you did. It's not 
consistent. Sometimes I can make a JMX change and {{ant}} with no problem.

To make patches simpler, I'm posting full repro code. I also tried to simplify 
the naming. Unfortunately, all the previous patches are in jumbled order due to 
a naming convention that doesn't sort. Fortunately, JIRA seems to have an 
easter egg where you can choose the attachment name by changing the url.

{noformat}
# Uncomment to exactly reproduce state.
#git checkout -b 5483-e30d6dc e30d6dc

# Download all needed patches with consistent names, apply patches, build.
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12630490/5483-v02-01-Trace-filtering-and-tracestate-propagation.patch \
  $W/12630491/5483-v02-02-Put-a-few-traces-parallel-to-the-repair-logging.patch 
\
  $W/12631967/5483-v03-03-Make-repair-tracing-controllable-via-nodetool.patch \
  $W/12633153/5483-v06-04-Allow-tracing-ttl-to-be-configured.patch \
  $W/12633154/5483-v06-05-Add-a-command-column-to-system_traces.events.patch \
  $W/12633155/5483-v06-06-Fix-interruption-in-tracestate-propagation.patch \
  $W/12633156/ccm-repair-test
do [ -e $(basename $url) ] || curl -sO $url; done &&
git apply 5483-v0[236]-*.patch &&
ant clean && ant

# put on a separate line because you should at least minimally inspect
# arbitrary code before running.
chmod +x ./ccm-repair-test && ./ccm-repair-test
{noformat}

{{ccm-repair-test}} has some options for convenience:
{noformat}
-k keep (don't delete) the created cluster after successful exit.
-r repair only
-R don't repair
-t do traced repair only
-T don't do traced repair (if neither, then do both traced and untraced repair)
{noformat}

The output of a test run:

{noformat}
Current cluster is now: test-5483-QiR
[2014-03-06 10:46:13,617] Nothing to repair for keyspace 'system'
[2014-03-06 10:46:13,646] Starting repair command #1, repairing 2 ranges for 
keyspace s1 (seq=true, full=true)
[2014-03-06 10:46:16,999] Repair session 72648190-a546-11e3-a5f4-f94811c7b860 
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:17,465] Repair session 73ee2ed0-a546-11e3-a5f4-f94811c7b860 
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:17,465] Repair command #1 finished
[2014-03-06 10:46:17,485] Starting repair command #2, repairing 2 ranges for 
keyspace system_traces (seq=true, full=true)
[2014-03-06 10:46:18,782] Repair session 74aaef20-a546-11e3-a5f4-f94811c7b860 
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:18,816] Repair session 74ff0290-a546-11e3-a5f4-f94811c7b860 
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:18,816] Repair command #2 finished
0 rows exported in 0.015 seconds.
test-5483-QiR-system_traces-events.txt
ok
[2014-03-06 10:46:24,128] Nothing to repair for keyspace 'system'
[2014-03-06 10:46:24,166] Starting repair command #3, repairing 2 ranges for 
keyspace s1 (seq=true, full=true)
[2014-03-06 10:46:25,366] Repair session 78a6d4e0-a546-11e3-a5f4-f94811c7b860 
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:25,415] Repair session 79263e10-a546-11e3-a5f4-f94811c7b860 
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:25,415] Repair command #3 finished
[2014-03-06 10:46:25,485] Starting repair command #4, repairing 2 ranges for 
keyspace system_traces (seq=true, full=true)
[2014-03-06 10:46:27,077] Repair session 796f7c10-a546-11e3-a5f4-f94811c7b860 
for range (-3074457345618258603,3074457345618258602] finished
[2014-03-06 10:46:27,120] Repair session 79f240a0-a546-11e3-a5f4-f94811c7b860 
for range (3074457345618258602,-9223372036854775808] finished
[2014-03-06 10:46:27,120] Repair command #4 finished
48 rows exported in 0.104 seconds.
test-5483-QiR-system_traces-events-tr.txt
found source: 127.0.0.1
found thread: Thread-15
found thread: AntiEntropySessions:1
found thread: RepairJobTask:1
found source: 127.0.0.2
found thread: AntiEntropyStage:1
found source: 127.0.0.3
found thread: AntiEntropySessions:2
found thread: Thread-16
found thread: AntiEntropySessions:3
found thread: AntiEntropySessions:4
unique sources traced: 3
unique threads traced: 8
All thread categories accounted for
ok
{noformat}

---

Patch comments:

- {{v06-04}} I did something similar to {{v03-03}}, (almost) no refactoring. 
The implementation is a little messy architecturally.
- {{v06-05}} This is the suggestion you had to add a "command" column. I don't 
know how to make it the last column. At least on my box, it's column 5 of 7 
despite me putting it last in the cql. Note that {{ccm-repair-test}}'s checking 
code will break if the column order changes.
- {{v06-06}} You need to submit {{Runnable}} s, etc. using 
{{DebuggableThreadPoolExecutor}} if you want them to inherit tracestate. 
Tracestate propagation is very easy to break under concurrency, so this is 
probably the first thing to check if it ever happens again.


> Repair tracing
> --------------
>
>                 Key: CASSANDRA-5483
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Yuki Morishita
>            Assignee: Ben Chan
>            Priority: Minor
>              Labels: repair
>         Attachments: 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
> 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
> 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
> ccm-repair-test, test-5483-system_traces-events.txt, 
> trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
> trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
>  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
> tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
> v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
> v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
>  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
>
>
> I think it would be nice to log repair stats and results like query tracing 
> stores traces to system keyspace. With it, you don't have to lookup each log 
> file to see what was the status and how it performed the repair you invoked. 
> Instead, you can query the repair log with session ID to see the state and 
> stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to