[jira] [Commented] (CASSANDRA-5483) Repair tracing

Ben Chan (JIRA) Thu, 22 May 2014 05:08:32 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-5483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005853#comment-14005853
 ]


Ben Chan commented on CASSANDRA-5483:
-------------------------------------

{quote}
Is that because the traces are asynchronous? Because I think session 2 only 
starts after session 1 finishes.
{quote}

Here's my high level understanding.

* Repair command #1 and #2, etc are serial.
* Each repair session "Syncing range ..." is "technically" concurrent, since 
each is submitted to a ThreadPoolExecutor.
** However, differencing is serialized, so if there is no streaming going on, 
you won't see very much overlap between the sessions, except at the beginning 
and end (which is exactly what we see with these simple tests).
** Conversely, this means you will see much more interleaving when heavy 
streaming is going on.

So at the very least, it might be good to eventually disambiguate the streaming 
portion.

{quote}
The easiest thing would be to make them non-redundant. Can we make the tracing 
"extra detail" on top of the normal ones instead of competing with them?
{quote}

I think it may be a conceptual block on my part. I tend to think of traces as a 
kind of profiling mechanism.

* Most of the sendNotification calls in StorageService#createRepairTask consist 
of reporting any errors from the results of RepairFuture objects. So the timing 
on those is not really that useful for profiling. They're not really what I'd 
usually think of as a trace.
* Some are request validation reporting before the repair proper even starts.
* The rest are informational sendNotification messages which are redundant when 
tracing is active (this is the easy case).

In pseudocode:

{noformat}
if (some error #1 in repair request)
  sendNotification("NO #1!");
if (some error #2 in repair request)
  sendNotification("NO #2!");
for (r : ranges)
{
  f = something.submitRepairSession(new RepairSession(r));
  futures.add(f);
  try
  {
    // this serializes the differencing part.
    f.waitForDifferencing()
  }
  catch (SomeException)
  {
    // handle, sendNotification
  }
}

try
{
  for (f : futures)
  {
    r = f.get();
    sendNotification("done: %s", r);
  }
}
catch (ExecutionException ee)
{
  // handle, sendNotification
}
catch (Exception e)
{
  // handle, sendNotification
}
{noformat}

The main point being that I can't be sure that every single interesting 
exception is caught and traced in the thread where it's thrown, then rethrown. 
Most likely, this is not the case, and some exceptions are only reported at the 
StorageService#createRepairTask level. I believe most \(?) cases are already 
caught and traced, though.

So after going through all that, I'm thinking that the easiest thing is to just 
accept the possibility of redundancy and delayed reporting, and just trace all 
sendNotification in StorageService#createRepairTask (unless it's demonstrably 
redundant, or already being traced through some other mechanism).


> Repair tracing
> --------------
>
>                 Key: CASSANDRA-5483
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5483
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Tools
>            Reporter: Yuki Morishita
>            Assignee: Ben Chan
>            Priority: Minor
>              Labels: repair
>         Attachments: 5483-full-trunk.txt, 
> 5483-v06-04-Allow-tracing-ttl-to-be-configured.patch, 
> 5483-v06-05-Add-a-command-column-to-system_traces.events.patch, 
> 5483-v06-06-Fix-interruption-in-tracestate-propagation.patch, 
> 5483-v07-07-Better-constructor-parameters-for-DebuggableThreadPoolExecutor.patch,
>  5483-v07-08-Fix-brace-style.patch, 
> 5483-v07-09-Add-trace-option-to-a-more-complete-set-of-repair-functions.patch,
>  5483-v07-10-Correct-name-of-boolean-repairedAt-to-fullRepair.patch, 
> 5483-v08-11-Shorten-trace-messages.-Use-Tracing-begin.patch, 
> 5483-v08-12-Trace-streaming-in-Differencer-StreamingRepairTask.patch, 
> 5483-v08-13-sendNotification-of-local-traces-back-to-nodetool.patch, 
> 5483-v08-14-Poll-system_traces.events.patch, 
> 5483-v08-15-Limit-trace-notifications.-Add-exponential-backoff.patch, 
> 5483-v09-16-Fix-hang-caused-by-incorrect-exit-code.patch, 
> 5483-v10-17-minor-bugfixes-and-changes.patch, 
> 5483-v10-rebased-and-squashed-471f5cc.patch, 5483-v11-01-squashed.patch, 
> 5483-v11-squashed-nits.patch, 5483-v12-02-cassandra-yaml-ttl-doc.patch, 
> 5483-v13-608fb03-May-14-trace-formatting-changes.patch, ccm-repair-test, 
> cqlsh-left-justify-text-columns.patch, prerepair-vs-postbuggedrepair.diff, 
> test-5483-system_traces-events.txt, 
> trunk@4620823-5483-v02-0001-Trace-filtering-and-tracestate-propagation.patch, 
> trunk@4620823-5483-v02-0002-Put-a-few-traces-parallel-to-the-repair-logging.patch,
>  tr...@8ebeee1-5483-v01-001-trace-filtering-and-tracestate-propagation.txt, 
> tr...@8ebeee1-5483-v01-002-simple-repair-tracing.txt, 
> v02p02-5483-v03-0003-Make-repair-tracing-controllable-via-nodetool.patch, 
> v02p02-5483-v04-0003-This-time-use-an-EnumSet-to-pass-boolean-repair-options.patch,
>  v02p02-5483-v05-0003-Use-long-instead-of-EnumSet-to-work-with-JMX.patch
>
>
> I think it would be nice to log repair stats and results like query tracing 
> stores traces to system keyspace. With it, you don't have to lookup each log 
> file to see what was the status and how it performed the repair you invoked. 
> Instead, you can query the repair log with session ID to see the state and 
> stats of all nodes involved in that repair session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-5483) Repair tracing

Reply via email to