[jira] [Commented] (CASSANDRA-5483) Repair tracing

Ben Chan (JIRA) Mon, 17 Mar 2014 09:20:13 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937976#comment-13937976
 ]


Ben Chan commented on CASSANDRA-5483:
-------------------------------------

{noformat}
# tested with branch 5483 @ bce0c2c555a3; should also work following successful
#git apply 5483-full-trunk.txt
W=https://issues.apache.org/jira/secure/attachment
for url in \
  $W/12635094/5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch       
   \
  
$W/12635095/5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch
 \
  
$W/12635096/5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch 
 \
  $W/12635097/5483-v08-14-Poll-system_traces.events.patch                       
   \
  
$W/12635098/5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch
do [ -e $(basename $url) ] || curl -sO $url; done &&
git apply 5483-v08-*.patch &&
ant clean && ant

./ccm-repair-test -kR &&
ccm node1 stop &&
ccm node1 clear &&
ccm node1 start &&
./ccm-repair-test -rt
{noformat}

* {{v08-11}} There was an error in one of the log formats in Differencer, which 
made my grep for "out of sync" in the logs fruitless.
* {{v08-12}} I ended up using the handleStreamEvent of StreamingRepairTask 
instead of implementing and registering my own StreamEventHandler. The new 
trace messages may need adjusting, especially for ProgressEvent, which is 
essentially just a toString currently.
* {{v08-13}} This works by adding a guarded sendNotification to 
TraceState#trace.
* {{v08-14}} This works by starting a thread to poll {{system_traces.events}}, 
and by adding notify functionality to TraceState. There is some jitter in the 
ordering between local and remote traces. An easy fix would be to have the 
query thread handle all sendNotification of traces. You have to accept latency 
in sendNotification of local traces in order to get better ordering. It might 
be necessary to delay all trace sendNotification by a few seconds to make it 
more likely that remote traces have arrived.
* {{v08-15}} Even more added TraceState functionality. All to try to reduce the 
amount of polling without hurting latency too much. There are only a few local 
traces that you would expect to be followed by a remote trace, so only wake up 
for those. Poll with an exponential backoff after each notification.

---

Heuristics are messy, and I expect plenty of opinions on {{v08-14}} and 
{{v08-15}}. I'm not especially proud of that code, but I can't think of 
anything better at the moment, given the (self-imposed?) constraints.

I may have reinvented the wheel with synchronization primitives. I checked 
{{java.util.concurrent.*}} and {{SimpleCondition}}, but not much beyond that. I 
could have missed something; I don't fully understand some of the classes. What 
I wanted was to be woken up (with a timeout) if anything has changed since the 
last time I checked. Theoretically, it should work for multiple consumers (As 
long as no one waits for longer than {{Integer.MAX_VALUE}} updates), though 
that's not really necessary here, if that would simplify the code.

The code seems to work reasonably well for small-scale tests. I can convince 
myself that it won't blow up for long repairs, but haven't done a full test yet.


> Repair tracing
> --------------
>
>                 Key: CASSANDRA-5483
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Yuki Morishita
>            Assignee: Ben Chan
>            Priority: Minor
>              Labels: repair
>         Attachments: 5483-full-trunk.txt, 
> 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
> 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
> 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
> 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
>  5483-v07-08-Fix-brace-style.patch, 
> 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
>  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
> 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
> 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
> 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
> 5483-v08-14-Poll-system_traces.events.patch, 
> 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
> ccm-repair-test, cqlsh-left-justify-text-columns.patch, 
> test-5483-system_traces-events.txt, 
> trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
> trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
>  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
> tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
> v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
> v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
>  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
>
>
> I think it would be nice to log repair stats and results like query tracing 
> stores traces to system keyspace. With it, you don't have to lookup each log 
> file to see what was the status and how it performed the repair you invoked. 
> Instead, you can query the repair log with session ID to see the state and 
> stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

Reply via email to