subject:"\[RFC PATCH 16\/22 \-v2\] add get_monotonic

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-19 Thread Mathieu Desnoyers

* Frank Ch. Eigler ([EMAIL PROTECTED]) wrote:
> Hi -
> 
> On Fri, Jan 18, 2008 at 10:55:27PM -0500, Steven Rostedt wrote:
> > [...]
> > > All this complexity is to be justified by keeping the raw prev/next
> > > pointers from being sent to a naive tracer?  It seems to me way out of
> > > proportion.
> > 
> > Damn, and I just blew away all my marker code for something like this ;-)
> 
> Sorry! :-)
> 
> > [...]
> > We have in sched.c the following marker:
> >  trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next);
> 
> Fine so far!
> 
> > Then Mathieu can add in some code somewhere (or a module, or something)
> > ret = marker_probe_register("kernel_sched_scheduler",
> > "prev %p next %p",
> > pretty_print_sched_switch, NULL);
> 
> > static void pretty_print_sched_switch(const struct marker *mdata,
> > void *private_data,
> > const char *format, ...)
> > {
> > [...]
> > trace_mark(kernel_pretty_print_sched_switch,
> > "prev_pid %d next_pid %d prev_state %ld",
> > prev->pid, next->pid, prev->state);
> > }
> 
> That marker_probe_register call would need to be done only when the
> embedded (k_p_p_s_s) marker is actually being used.  Otherwise we'd
> lose all the savings of a dormant sched.c marker by always calling
> into pretty_print_sched_switch(), whether or not the k_p_p_s_s marker
> was active.
> 
> In any case, if the naive tracer agrees to become educated about some
> of these markers in the form of intermediary functions like that, they
> need not insist on a second hop through marker territory anyway:
> 
>  static void pretty_print_sched_switch(const struct marker *mdata,
>   void *private_data,
>   const char *format, ...)
>  {
>   [...]
>   lttng_backend_trace(kernel_pretty_print_sched_switch,
>   "prev_pid %d next_pid %d prev_state %ld",
>   prev->pid, next->pid, prev->state);
>  }
> 

Oh! perfect then :) Since I already planned my ltt-marker-control kernel
module to connect specialized callbacks instead of the dumb one, it
shouldn't be so hard to do.

I would just have to find another way to declare the trace events (it's
currently embedded in the markers), but it's not a showstopper. I'll try
this.

Thanks to you both for the good proposals,

Mathieu

> 
> - FChE

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-19 Thread Mathieu Desnoyers

* Frank Ch. Eigler ([EMAIL PROTECTED]) wrote:
 Hi -
 
 On Fri, Jan 18, 2008 at 10:55:27PM -0500, Steven Rostedt wrote:
  [...]
   All this complexity is to be justified by keeping the raw prev/next
   pointers from being sent to a naive tracer?  It seems to me way out of
   proportion.
  
  Damn, and I just blew away all my marker code for something like this ;-)
 
 Sorry! :-)
 
  [...]
  We have in sched.c the following marker:
   trace_mark(kernel_sched_scheduler, prev %p next %p, prev, next);
 
 Fine so far!
 
  Then Mathieu can add in some code somewhere (or a module, or something)
  ret = marker_probe_register(kernel_sched_scheduler,
  prev %p next %p,
  pretty_print_sched_switch, NULL);
 
  static void pretty_print_sched_switch(const struct marker *mdata,
  void *private_data,
  const char *format, ...)
  {
  [...]
  trace_mark(kernel_pretty_print_sched_switch,
  prev_pid %d next_pid %d prev_state %ld,
  prev-pid, next-pid, prev-state);
  }
 
 That marker_probe_register call would need to be done only when the
 embedded (k_p_p_s_s) marker is actually being used.  Otherwise we'd
 lose all the savings of a dormant sched.c marker by always calling
 into pretty_print_sched_switch(), whether or not the k_p_p_s_s marker
 was active.
 
 In any case, if the naive tracer agrees to become educated about some
 of these markers in the form of intermediary functions like that, they
 need not insist on a second hop through marker territory anyway:
 
  static void pretty_print_sched_switch(const struct marker *mdata,
   void *private_data,
   const char *format, ...)
  {
   [...]
   lttng_backend_trace(kernel_pretty_print_sched_switch,
   prev_pid %d next_pid %d prev_state %ld,
   prev-pid, next-pid, prev-state);
  }
 

Oh! perfect then :) Since I already planned my ltt-marker-control kernel
module to connect specialized callbacks instead of the dumb one, it
shouldn't be so hard to do.

I would just have to find another way to declare the trace events (it's
currently embedded in the markers), but it's not a showstopper. I'll try
this.

Thanks to you both for the good proposals,

Mathieu

 
 - FChE

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Frank Ch. Eigler

Hi -

On Fri, Jan 18, 2008 at 10:55:27PM -0500, Steven Rostedt wrote:
> [...]
> > All this complexity is to be justified by keeping the raw prev/next
> > pointers from being sent to a naive tracer?  It seems to me way out of
> > proportion.
> 
> Damn, and I just blew away all my marker code for something like this ;-)

Sorry! :-)

> [...]
> We have in sched.c the following marker:
>  trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next);

Fine so far!

> Then Mathieu can add in some code somewhere (or a module, or something)
>   ret = marker_probe_register("kernel_sched_scheduler",
>   "prev %p next %p",
>   pretty_print_sched_switch, NULL);

> static void pretty_print_sched_switch(const struct marker *mdata,
>   void *private_data,
>   const char *format, ...)
> {
>   [...]
>   trace_mark(kernel_pretty_print_sched_switch,
>   "prev_pid %d next_pid %d prev_state %ld",
>   prev->pid, next->pid, prev->state);
> }

That marker_probe_register call would need to be done only when the
embedded (k_p_p_s_s) marker is actually being used.  Otherwise we'd
lose all the savings of a dormant sched.c marker by always calling
into pretty_print_sched_switch(), whether or not the k_p_p_s_s marker
was active.

In any case, if the naive tracer agrees to become educated about some
of these markers in the form of intermediary functions like that, they
need not insist on a second hop through marker territory anyway:

 static void pretty_print_sched_switch(const struct marker *mdata,
void *private_data,
const char *format, ...)
 {
[...]
lttng_backend_trace(kernel_pretty_print_sched_switch,
"prev_pid %d next_pid %d prev_state %ld",
prev->pid, next->pid, prev->state);
 }


- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Steven Rostedt



On Fri, 18 Jan 2008, Frank Ch. Eigler wrote:
>
> All this complexity is to be justified by keeping the raw prev/next
> pointers from being sent to a naive tracer?  It seems to me way out of
> proportion.

Damn, and I just blew away all my marker code for something like this ;-)

Actually, you just gave me a great idea that I think can help all of us.
OK, Mathieu may not be in total agreement, but I think this is the
ultimate compromise.

We have in sched.c the following marker:

 trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next);


Then Mathieu can add in some code somewhere (or a module, or something)

ret = marker_probe_register("kernel_sched_scheduler",
"prev %p next %p",
pretty_print_sched_switch, NULL);

static void pretty_print_sched_switch(const struct marker *mdata,
void *private_data,
const char *format, ...)
{
va_list ap;
struct task_struct *prev;
struct task_struct *next;

va_start(ap, format);
prev = va_arg(ap, typeof(prev));
next = va_arg(ap, typeof(next));
va_end;

trace_mark(kernel_pretty_print_sched_switch,
"prev_pid %d next_pid %d prev_state %ld",
prev->pid, next->pid, prev->state);
}


Then LTTng on startup could arm the normal kernel_sched_switch code and
have the user see the nice one. All without adding any more goo or
overhead to the non tracing case, and keeping a few critical markers with
enough information to be useful to other tracers!

Thoughts?

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Frank Ch. Eigler

Hi -

On Fri, Jan 18, 2008 at 06:19:29PM -0500, Mathieu Desnoyers wrote:
> [...]
> Almost.. I would add :
> 
> static int trace_switch_to_enabled;
> 
> > static inline trace_switch_to(struct task_struct *prev,
> > struct task_struct *next)
> > {
> if (likely(!trace_switch_to_enabled))
>   return;
> > trace_mark(kernel_schedudule,
> > "prev_pid %d next_pid %d prev_state %ld",
> > prev->pid, next->pid, prev->pid);
> > 
> > trace_context_switch(prev, next);
> > }
> 
> And some code to activate the trace_switch_to_enabled variable (ideally
> keeping a refcount). [...]

All this complexity is to be justified by keeping the raw prev/next
pointers from being sent to a naive tracer?  It seems to me way out of
proportion.

- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Frank Ch. Eigler

Hi -

On Fri, Jan 18, 2008 at 05:49:19PM -0500, Steven Rostedt wrote:
> [...]
> > But I have not seen a lot of situations where that kind of glue-code was
> > needed, so I think it makes sense to keep markers simple to use and
> > efficient for the common case.
> >
> > Then, in this glue-code, we can put trace_mark() and calls to in-kernel
> > tracers.
> 
> I'm almost done with the latency tracer work, and there are only a total
> of 6 hooks that I needed.
> [...]
> With the above, we could have this (if this is what I think you are
> recommending). [...]
> static inline trace_switch_to(struct task_struct *prev,
>   struct task_struct *next)
> {
>   trace_mark(kernel_schedudule,
>   "prev_pid %d next_pid %d prev_state %ld",
>   prev->pid, next->pid, prev->pid);
> 
>   trace_context_switch(prev, next);
> }

I'm afraid I don't see the point in this.  You could use one marker
for all that data (and force the more naive tracer callbacks to ignore
out some of them).  You could even use two markers (and force the more
naive tracer to attach to only to its favorite subset).  But to use a
second, different, less efficient, not more configurable tracing hook
mechanism in the same logical spot makes no sense to me.

- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> On Fri, 18 Jan 2008, Mathieu Desnoyers wrote:
> >
> > But I have not seen a lot of situations where that kind of glue-code was
> > needed, so I think it makes sense to keep markers simple to use and
> > efficient for the common case.
> >
> > Then, in this glue-code, we can put trace_mark() and calls to in-kernel
> > tracers.
> 
> I'm almost done with the latency tracer work, and there are only a total
> of 6 hooks that I needed.
> 
>  - schedule context switch
>  - try_to_wake_up
>  - hard_irqs_off (which is already there for lockdep)
>  - hard irqs on (also for lockdep)
>  - lock_contention (already in for the lock contention code)
>  - lock acquire (also in there for contention code)
> 
> With the above, we could have this (if this is what I think you are
> recommending). For example in the context_switch case:
> 
>   trace_switch_to(prev, next);
>   switch_to(prev, next, prev);
> 
> and in sched.h I could have:
> 

Almost.. I would add :

static int trace_switch_to_enabled;

> static inline trace_switch_to(struct task_struct *prev,
>   struct task_struct *next)
> {
if (likely(!trace_switch_to_enabled))
  return;
>   trace_mark(kernel_schedudule,
>   "prev_pid %d next_pid %d prev_state %ld",
>   prev->pid, next->pid, prev->pid);
> 
>   trace_context_switch(prev, next);
> }
> 

And some code to activate the trace_switch_to_enabled variable (ideally
keeping a refcount).

By doing this, we would have the minimum impact on the scheduled when
disabled.

But remember that this trace_switch_to_enabled could be enabled for both
markers and your tracer, so you might need to put a branch at the
beginning of trace_context_switch() too.

Mathieu

> and have the trace_context_switch code be something that is turned on with
> the latency tracing utility (config option). That way production code can
> keep it off.
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Steven Rostedt

On Fri, 18 Jan 2008, Mathieu Desnoyers wrote:
>
> But I have not seen a lot of situations where that kind of glue-code was
> needed, so I think it makes sense to keep markers simple to use and
> efficient for the common case.
>
> Then, in this glue-code, we can put trace_mark() and calls to in-kernel
> tracers.

I'm almost done with the latency tracer work, and there are only a total
of 6 hooks that I needed.

 - schedule context switch
 - try_to_wake_up
 - hard_irqs_off (which is already there for lockdep)
 - hard irqs on (also for lockdep)
 - lock_contention (already in for the lock contention code)
 - lock acquire (also in there for contention code)

With the above, we could have this (if this is what I think you are
recommending). For example in the context_switch case:

trace_switch_to(prev, next);
switch_to(prev, next, prev);

and in sched.h I could have:

static inline trace_switch_to(struct task_struct *prev,
struct task_struct *next)
{
trace_mark(kernel_schedudule,
"prev_pid %d next_pid %d prev_state %ld",
prev->pid, next->pid, prev->pid);

trace_context_switch(prev, next);
}

and have the trace_context_switch code be something that is turned on with
the latency tracing utility (config option). That way production code can
keep it off.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> On Thu, 17 Jan 2008, Frank Ch. Eigler wrote:
> 
> > Hi -
> >
> > On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote:
> > > [...]
> > > +   trace_mark(kernel_sched_schedule,
> > > +   "prev_pid %d next_pid %d prev_state %ld",
> > > +   prev->pid, next->pid, prev->state);
> > > [...]
> > > But...
> > >
> > > Tracers that want to do a bit more work, like recording timings and seeing
> > > if we hit some max somewhere, can't do much with that pretty print data.
> >
> > If you find yourself wanting to perform computations like finding
> > maxima, or responding right there as opposed to later during userspace
> > trace data extraction, then you're trending toward a tool like
> > systemtap.
> 
> Yes, very much so. I'm working on getting the latency_tracer from the -rt
> patch into something suitable for mainline. We need to calculate the max
> latencies on the fly. If we hit a max, then we save it off, otherwise, we
> blow away the trace and start again.
> 
> >
> > > [...]
> > > So, at a minimum, I'd like to at least have meta data attached:
> > >   trace_mark(kernel_sched_schedule,
> > >   "prev_pid %d next_pid %d prev_state %ld\0"
> > >   "prev %p next %p",
> > >   prev->pid, next->pid, prev->state,
> > >   prev, next);
> > >
> > > This would allow for both the nice pretty print of your trace, as well as
> > > allowing other tracers to get to better meta data.
> >
> > Yes, more self-contained marker events are necessary for meaningful
> > in-situ processing.  That needs to be balanced by the increased cost
> > for computing and passing the extra parameters, multiplied the event
> > occurrence rate.
> 
> The cost is only done when the marker is armed. Since the marker is an
> unlikely, and will be placed at the end of the function.
> 
> >
> > In this case, the prev/next pointers are sufficient to compute the
> > other values.  For particularly performance-critical markers, it may
> > not be unreasonable to expect the callback functions to dereference
> > such pointers for pretty-printing or other processing.
> 
> This was exactly my point to Mathieu, but I think he has LTTng very much
> coupled with the markers. I haven't played with LTTng (yet), but from what
> I've read (Mathieu, correct me if I'm wrong), it seems that all the
> markers become visible to userspace, and the user can simple turn them on
> or off. LTTng doesn't need any knowledge of the marker since the marker
> contains how to print the information.
> 
> So* by placing a "prev %p next %p" as the only information, we lose out on
> this automated way LTTng works. Because the two pointers are just
> meaningless numbers to the user.
> 

Exactly. We have, at the marker site :

- a marker identifier
- format string containing field names and types
- arguments

I would like to keep that as much in a straight line as possible with
what ends up in the trace.

However, I see that it limits what can be done by in-kernel tracers. And
by the way, I also suffer from the same kind of limitation in LTTng. Here
is an example :

I would like to replace blktrace (actually, I already have a quite
complete implementation). However, there is some code ran in the kernel
to "prepare" the information for the trace which is blktrace specific.
Since this code is not required to run when tracing is disabled, it can
be seen as "glue-code" between the kernel tracing point and the
extraction of data to trace.

What looked like the less intrusive solution was to create inline
functions that consist of branches over code considered unlikely (could
be a function call) where the glue-code is executed to prepare the data.
It's a bit like what the markers are doing, except that there is no
marker name associated and no format string : the subsystem being traced
must enable its tracing features by itself (could be a /proc file). It
makes sense, since this type of code has to be subsystem-specific
anyway.

But I have not seen a lot of situations where that kind of glue-code was
needed, so I think it makes sense to keep markers simple to use and
efficient for the common case.

Then, in this glue-code, we can put trace_mark() and calls to in-kernel
tracers.

Since the markers are eventually meant to become an API visible from
user-space, I think it makes sense to keep it clean. If an in-kernel
tracer needs extra information, I think it would make sense for it to
get it from a mechanism that does not make the exported information
visible to user-space.

What do you think ?

> >
> > > The '\0' would keep your tracer from recording the extra data, and we
> > > could add some way to ignore the parameters in the printf to let other
> > > traces get straight to the meta data.
> >
> > This \0 hack is perhaps too clever.  Much of the cost of the extra
> > parameters is already paid by the time that a simpleminded tracing
> > callback function starts going through the

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Steven Rostedt

On Fri, 18 Jan 2008, Mathieu Desnoyers wrote:

 But I have not seen a lot of situations where that kind of glue-code was
 needed, so I think it makes sense to keep markers simple to use and
 efficient for the common case.

 Then, in this glue-code, we can put trace_mark() and calls to in-kernel
 tracers.

I'm almost done with the latency tracer work, and there are only a total
of 6 hooks that I needed.

 - schedule context switch
 - try_to_wake_up
 - hard_irqs_off (which is already there for lockdep)
 - hard irqs on (also for lockdep)
 - lock_contention (already in for the lock contention code)
 - lock acquire (also in there for contention code)

With the above, we could have this (if this is what I think you are
recommending). For example in the context_switch case:

trace_switch_to(prev, next);
switch_to(prev, next, prev);

and in sched.h I could have:

static inline trace_switch_to(struct task_struct *prev,
struct task_struct *next)
{
trace_mark(kernel_schedudule,
prev_pid %d next_pid %d prev_state %ld,
prev-pid, next-pid, prev-pid);

trace_context_switch(prev, next);
}

and have the trace_context_switch code be something that is turned on with
the latency tracing utility (config option). That way production code can
keep it off.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 On Fri, 18 Jan 2008, Mathieu Desnoyers wrote:
 
  But I have not seen a lot of situations where that kind of glue-code was
  needed, so I think it makes sense to keep markers simple to use and
  efficient for the common case.
 
  Then, in this glue-code, we can put trace_mark() and calls to in-kernel
  tracers.
 
 I'm almost done with the latency tracer work, and there are only a total
 of 6 hooks that I needed.
 
  - schedule context switch
  - try_to_wake_up
  - hard_irqs_off (which is already there for lockdep)
  - hard irqs on (also for lockdep)
  - lock_contention (already in for the lock contention code)
  - lock acquire (also in there for contention code)
 
 With the above, we could have this (if this is what I think you are
 recommending). For example in the context_switch case:
 
   trace_switch_to(prev, next);
   switch_to(prev, next, prev);
 
 and in sched.h I could have:
 

Almost.. I would add :

static int trace_switch_to_enabled;

 static inline trace_switch_to(struct task_struct *prev,
   struct task_struct *next)
 {
if (likely(!trace_switch_to_enabled))
  return;
   trace_mark(kernel_schedudule,
   prev_pid %d next_pid %d prev_state %ld,
   prev-pid, next-pid, prev-pid);
 
   trace_context_switch(prev, next);
 }
 

And some code to activate the trace_switch_to_enabled variable (ideally
keeping a refcount).

By doing this, we would have the minimum impact on the scheduled when
disabled.

But remember that this trace_switch_to_enabled could be enabled for both
markers and your tracer, so you might need to put a branch at the
beginning of trace_context_switch() too.

Mathieu

 and have the trace_context_switch code be something that is turned on with
 the latency tracing utility (config option). That way production code can
 keep it off.
 
 -- Steve
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Steven Rostedt



On Fri, 18 Jan 2008, Frank Ch. Eigler wrote:

 All this complexity is to be justified by keeping the raw prev/next
 pointers from being sent to a naive tracer?  It seems to me way out of
 proportion.

Damn, and I just blew away all my marker code for something like this ;-)

Actually, you just gave me a great idea that I think can help all of us.
OK, Mathieu may not be in total agreement, but I think this is the
ultimate compromise.

We have in sched.c the following marker:

 trace_mark(kernel_sched_scheduler, prev %p next %p, prev, next);


Then Mathieu can add in some code somewhere (or a module, or something)

ret = marker_probe_register(kernel_sched_scheduler,
prev %p next %p,
pretty_print_sched_switch, NULL);

static void pretty_print_sched_switch(const struct marker *mdata,
void *private_data,
const char *format, ...)
{
va_list ap;
struct task_struct *prev;
struct task_struct *next;

va_start(ap, format);
prev = va_arg(ap, typeof(prev));
next = va_arg(ap, typeof(next));
va_end;

trace_mark(kernel_pretty_print_sched_switch,
prev_pid %d next_pid %d prev_state %ld,
prev-pid, next-pid, prev-state);
}


Then LTTng on startup could arm the normal kernel_sched_switch code and
have the user see the nice one. All without adding any more goo or
overhead to the non tracing case, and keeping a few critical markers with
enough information to be useful to other tracers!

Thoughts?

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Frank Ch. Eigler

Hi -

On Fri, Jan 18, 2008 at 06:19:29PM -0500, Mathieu Desnoyers wrote:
 [...]
 Almost.. I would add :
 
 static int trace_switch_to_enabled;
 
  static inline trace_switch_to(struct task_struct *prev,
  struct task_struct *next)
  {
 if (likely(!trace_switch_to_enabled))
   return;
  trace_mark(kernel_schedudule,
  prev_pid %d next_pid %d prev_state %ld,
  prev-pid, next-pid, prev-pid);
  
  trace_context_switch(prev, next);
  }
 
 And some code to activate the trace_switch_to_enabled variable (ideally
 keeping a refcount). [...]

All this complexity is to be justified by keeping the raw prev/next
pointers from being sent to a naive tracer?  It seems to me way out of
proportion.

- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-18 Thread Frank Ch. Eigler

Hi -

On Fri, Jan 18, 2008 at 05:49:19PM -0500, Steven Rostedt wrote:
 [...]
  But I have not seen a lot of situations where that kind of glue-code was
  needed, so I think it makes sense to keep markers simple to use and
  efficient for the common case.
 
  Then, in this glue-code, we can put trace_mark() and calls to in-kernel
  tracers.
 
 I'm almost done with the latency tracer work, and there are only a total
 of 6 hooks that I needed.
 [...]
 With the above, we could have this (if this is what I think you are
 recommending). [...]
 static inline trace_switch_to(struct task_struct *prev,
   struct task_struct *next)
 {
   trace_mark(kernel_schedudule,
   prev_pid %d next_pid %d prev_state %ld,
   prev-pid, next-pid, prev-pid);
 
   trace_context_switch(prev, next);
 }

I'm afraid I don't see the point in this.  You could use one marker
for all that data (and force the more naive tracer callbacks to ignore
out some of them).  You could even use two markers (and force the more
naive tracer to attach to only to its favorite subset).  But to use a
second, different, less efficient, not more configurable tracing hook
mechanism in the same logical spot makes no sense to me.

- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Steven Rostedt

On Thu, 17 Jan 2008, Frank Ch. Eigler wrote:

> Hi -
>
> On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote:
> > [...]
> > +   trace_mark(kernel_sched_schedule,
> > +   "prev_pid %d next_pid %d prev_state %ld",
> > +   prev->pid, next->pid, prev->state);
> > [...]
> > But...
> >
> > Tracers that want to do a bit more work, like recording timings and seeing
> > if we hit some max somewhere, can't do much with that pretty print data.
>
> If you find yourself wanting to perform computations like finding
> maxima, or responding right there as opposed to later during userspace
> trace data extraction, then you're trending toward a tool like
> systemtap.

Yes, very much so. I'm working on getting the latency_tracer from the -rt
patch into something suitable for mainline. We need to calculate the max
latencies on the fly. If we hit a max, then we save it off, otherwise, we
blow away the trace and start again.

>
> > [...]
> > So, at a minimum, I'd like to at least have meta data attached:
> > trace_mark(kernel_sched_schedule,
> > "prev_pid %d next_pid %d prev_state %ld\0"
> > "prev %p next %p",
> > prev->pid, next->pid, prev->state,
> > prev, next);
> >
> > This would allow for both the nice pretty print of your trace, as well as
> > allowing other tracers to get to better meta data.
>
> Yes, more self-contained marker events are necessary for meaningful
> in-situ processing.  That needs to be balanced by the increased cost
> for computing and passing the extra parameters, multiplied the event
> occurrence rate.

The cost is only done when the marker is armed. Since the marker is an
unlikely, and will be placed at the end of the function.

>
> In this case, the prev/next pointers are sufficient to compute the
> other values.  For particularly performance-critical markers, it may
> not be unreasonable to expect the callback functions to dereference
> such pointers for pretty-printing or other processing.

This was exactly my point to Mathieu, but I think he has LTTng very much
coupled with the markers. I haven't played with LTTng (yet), but from what
I've read (Mathieu, correct me if I'm wrong), it seems that all the
markers become visible to userspace, and the user can simple turn them on
or off. LTTng doesn't need any knowledge of the marker since the marker
contains how to print the information.

So* by placing a "prev %p next %p" as the only information, we lose out on
this automated way LTTng works. Because the two pointers are just
meaningless numbers to the user.

>
> > The '\0' would keep your tracer from recording the extra data, and we
> > could add some way to ignore the parameters in the printf to let other
> > traces get straight to the meta data.
>
> This \0 hack is perhaps too clever.  Much of the cost of the extra
> parameters is already paid by the time that a simpleminded tracing
> callback function starts going through the string.  Also, I believe
> the systemtap marker interface would break if the format strings were
> not singly terminated ordinary strings.

Well, actually when I first wrote this letter, I used "--" as a delimiter
to allow a tool to hide the pretty stuff. But then I thought about the
"clever hack" with the '\0',  The "--" may be better since it wont break
systemtap.

-- Steve

* dvhart - bah!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Frank Ch. Eigler

Hi -

On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote:
> [...]
> +   trace_mark(kernel_sched_schedule,
> +   "prev_pid %d next_pid %d prev_state %ld",
> +   prev->pid, next->pid, prev->state);
> [...]
> But...
> 
> Tracers that want to do a bit more work, like recording timings and seeing
> if we hit some max somewhere, can't do much with that pretty print data.

If you find yourself wanting to perform computations like finding
maxima, or responding right there as opposed to later during userspace
trace data extraction, then you're trending toward a tool like
systemtap.

> [...]
> So, at a minimum, I'd like to at least have meta data attached:
>   trace_mark(kernel_sched_schedule,
>   "prev_pid %d next_pid %d prev_state %ld\0"
>   "prev %p next %p",
>   prev->pid, next->pid, prev->state,
>   prev, next);
> 
> This would allow for both the nice pretty print of your trace, as well as
> allowing other tracers to get to better meta data.

Yes, more self-contained marker events are necessary for meaningful
in-situ processing.  That needs to be balanced by the increased cost
for computing and passing the extra parameters, multiplied the event
occurrence rate.

In this case, the prev/next pointers are sufficient to compute the
other values.  For particularly performance-critical markers, it may
not be unreasonable to expect the callback functions to dereference
such pointers for pretty-printing or other processing.

> The '\0' would keep your tracer from recording the extra data, and we
> could add some way to ignore the parameters in the printf to let other
> traces get straight to the meta data.

This \0 hack is perhaps too clever.  Much of the cost of the extra
parameters is already paid by the time that a simpleminded tracing
callback function starts going through the string.  Also, I believe
the systemtap marker interface would break if the format strings were
not singly terminated ordinary strings.

- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Steven Rostedt

> > > >
> > > > One thing I want to clear up. The major difference between this
> > > > latency_tracer and LTTng is what we consider fast paths.  The latency
> > > > tracer is recording things like enabling and disabling interrupts, 
> > > > preempt
> > > > count changes, or simply profiling all function calls.  Those are what I
> > > > consider fast paths.  The slow path WRT the latency_tracer are things 
> > > > like
> > > > context switches.  This is why I don't have a problem with copying the
> > > > comm at context switch time. Because that _is_ a slow path for the 
> > > > latency
> > > > tracer.
> > >
> > > LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
> > > in flight recorder mode, we have nothing to write to disk and therefore
> > > we can handle very frequent events. We then do the analysis off-line
> > > using the last MB written in the buffers. The advantage is that the
> > > kernel dumbly writes data to a buffer : we therefore move the complexity
> > > to user-space.
> >
> > But you would still need to do something in case you want this information
> > dumped to console on a kernel crash.  Of course you can rely on kexec, but
> > if the kexec fails (which is possible) then you lose all the information.
> > Having the ability to dump the output to console on a crash is one of the
> > benefits of latency_tracer that I want to keep.
> >
>
> There has been some integration done between LTTng and the "crash" tool
> to extract the buffers from a crashed kernel. I am not an expert in
> crash buffer extraction though, but I guess all the available mechanisms
> depend on kexec and could show the limits you are referring to.
>
> If you really want to pretty-print the information to the console, I
> would propose that you leave that part of the problem to a different
> output module. The core of the latency tracer could keep the minimum
> information. Then, when a dump is triggered, it either sends the
> information to LTTng or to your console pretty-printer.
>
> However, I would not call the pretty-printer a "tracer" module per-se.
> We would have to accept that it is a bit more tied to the kernel
> internals than the latency tracer. My goal is to separate the core
> "profiling" module from the optional "pretty-printing" module as much as
> possible so the latency tracer core could be reused by other output
> modules.

Mathieu,

I've been thinking about the context switch marker, and the printf format
field you have:

prepare_task_switch(rq, prev, next);
+   trace_mark(kernel_sched_schedule,
+   "prev_pid %d next_pid %d prev_state %ld",
+   prev->pid, next->pid, prev->state);
mm = next->mm;


Now I see that this is great for your tracer, since all your hook would
need to do is:

static notrace void simple_trace(const struct marker *mdata,
void *private_data,
const char *format, ...)
{
va_list ap;

va_start(ap, format);
simple_trace_record(ap, format);
va_end(ap);
}

And you could hook this up to all your traces. Wonderful!

But...

Tracers that want to do a bit more work, like recording timings and seeing
if we hit some max somewhere, can't do much with that pretty print data.
For example, I like to record the priority of a task that is being swapped
out as well as the one being swapped in. But with this, all I can get is
the priority of the prev task (since it is still current).

You told me that I should put hooks into where the priority gets modified,
so that I can trace it there at a non hot path. Well, I have some issues
with this.

1) to me it's a management nightmare. Yes, I could hook into lttng, but
that too is too much. That takes me back to the days when COTS became the
new buzzword, and we ended up having to incorporate COTS products into
things better left done in house. The end result was more glue code than
what would happen if it was simple done in house, and a product that was
totally inefficient.

2) this requires we put a marker at all the places that might change the
data we want the snapshot of at the marker (here being prio at context
switch time). I could imaging some new code comes into the kernel that
modifies priority but the author has no idea about having to update the
trace marker, and the trace output ends up showing stale data. And this
would cause the poor bastard that needs to maintain this code to debug the
tracer on top of the code they are maintaining. Which simply would end up
where that poor bastard will lose all confidence in the tracer and would
simply give up on it.

I know that if we did something like:

trace_mark(kernel_sched_schedule,
"prev %p next %p",
prev, next);

It would be useless for the simple recording, because what the user would
see is just two meaningless pointer numbers.

So, at a minimum, I'd like to at least have meta data attached:

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Linus Torvalds

On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> 
> Or could we map a per-thread page that would contradict this
> "definition" ?

Over my dead body.

It's been done before. Many times. It's horrible, and means that you need 
to flush the TLB on context switches between threads and cannot share the 
same hw page tables across CPU's (since now different CPU's run different 
threads).

It generally makes threads pointless. You might as well implement them as 
processes with shared mmap.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> On Thu, 17 Jan 2008, Paul Mackerras wrote:
> >
> > It's very hard to do a per-thread counter in the VDSO, since threads
> > in the same process see the same memory, by definition.  You'd have to
> > have an array of counters and have some way for each thread to know
> > which entry to read.  Also you'd have to find space for tens or
> > hundreds of thousands of counters, since there can be that many
> > threads in a process sometimes.
> 
> I was thinking about this. What would also work is just the ability to
> read the schedule counter for the current cpu. Now this would require that
> the task had a way to know which CPU it was currently on.
> 
> -- Steve
> 

The problem with the per cpu schedule counter would be to deal with
stopped tasks that would wake up at the exact wrong moment. With a 32
bits counter, it could happen.

At 1000HZ, given the scheduler is only called once per tick
(approximation, it can also be called explicitly) it would happen after
49.7 days. But then, if the kernel calls schedule() too often, this can
be sooner than that.

Having a per-thread variable would make sure we don't have this problem.

By the way, a task cannot "really know" which CPU it is on : it could be
migrated between the cpu ID read and the moment it uses it as an array
index. Actually, knowing the schedule count would help to implement
algorithms that helps getting the cpu id and knowing it's been valid for
a period of time without pinning to a particular CPU.

Mathieu

(sorry for repost, messed up with my mail client)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> On Thu, 17 Jan 2008, Paul Mackerras wrote:
> >
> > It's very hard to do a per-thread counter in the VDSO, since threads
> > in the same process see the same memory, by definition.  You'd have to
> > have an array of counters and have some way for each thread to know
> > which entry to read.  Also you'd have to find space for tens or
> > hundreds of thousands of counters, since there can be that many
> > threads in a process sometimes.
> 
> I was thinking about this. What would also work is just the ability to
> read the schedule counter for the current cpu. Now this would require that
> the task had a way to know which CPU it was currently on.
> 
> -- Steve
> 

The problem with the per cpu schedule counter would be to deal with
stopped tasks that would wake up at the exact wrong moment. With a 32
bits counter, it could happen.

At 1000HZ, given the scheduler is only called once per tick
(approximation, it can also be called explicitly) it would happen after
49.7 days. But then, if the kernel calls schedule() too often, this can
be sooner than that.

Having a per-thread variable would make sure we don't have this problem.

By the way, a task cannot "really know" which CPU it is on : it could be
migrated between the cpu ID read and the moment it uses it as an array
index. Actually, knowing the schedule count would help to implement
algorithms that helps getting the cpu id and knowing it's been valid for
a period of time without pinning to a particular CPU.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Steven Rostedt


>
> Crazy ideas :
>
> Could we do something along the lines of the thread local storage ?
>
> Or could we map a per-thread page that would contradict this
> "definition" ?

When working on lguest64, I implemented a "per CPU" shadow page. That the
process of a guest running on one real CPU, could never see this page for
a process of the same guest running on another real CPU.

This is a nightmare to manage.  The thing is that threads share the same
pgd. With shadow pages, it's doable, but hard to manage. It required
making a "copy" of the pgd for each real CPU.

Here we are not talking about guests, but bare-metal.  So this complicates
things even more. The problem is that each thread shares the same pgd. If
two threads run on the same time on two different CPUS, then the two
threads would see the same counter. But we need to make this per cpu.

Now if the thread was given a clue to what CPU it was running on, then we
might be able to accomplish something. But then we have the same issue.
How do you tell a thread what CPU it's on, without running into the same
issues as where to store this data?

-- Steve

>
> Or can we move down the beginning of the user-space thread stack of 4
> bytes (it's already put at a random address anyway) and use these 32
> bits to put our variable ? We don't care if userspace also modifies it;
> the kernel would blindly increment it, so there would be no security
> concerns involved.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Steven Rostedt



 Crazy ideas :

 Could we do something along the lines of the thread local storage ?

 Or could we map a per-thread page that would contradict this
 definition ?

When working on lguest64, I implemented a per CPU shadow page. That the
process of a guest running on one real CPU, could never see this page for
a process of the same guest running on another real CPU.

This is a nightmare to manage.  The thing is that threads share the same
pgd. With shadow pages, it's doable, but hard to manage. It required
making a copy of the pgd for each real CPU.

Here we are not talking about guests, but bare-metal.  So this complicates
things even more. The problem is that each thread shares the same pgd. If
two threads run on the same time on two different CPUS, then the two
threads would see the same counter. But we need to make this per cpu.

Now if the thread was given a clue to what CPU it was running on, then we
might be able to accomplish something. But then we have the same issue.
How do you tell a thread what CPU it's on, without running into the same
issues as where to store this data?

-- Steve


 Or can we move down the beginning of the user-space thread stack of 4
 bytes (it's already put at a random address anyway) and use these 32
 bits to put our variable ? We don't care if userspace also modifies it;
 the kernel would blindly increment it, so there would be no security
 concerns involved.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 
 On Thu, 17 Jan 2008, Paul Mackerras wrote:
 
  It's very hard to do a per-thread counter in the VDSO, since threads
  in the same process see the same memory, by definition.  You'd have to
  have an array of counters and have some way for each thread to know
  which entry to read.  Also you'd have to find space for tens or
  hundreds of thousands of counters, since there can be that many
  threads in a process sometimes.
 
 I was thinking about this. What would also work is just the ability to
 read the schedule counter for the current cpu. Now this would require that
 the task had a way to know which CPU it was currently on.
 
 -- Steve
 

The problem with the per cpu schedule counter would be to deal with
stopped tasks that would wake up at the exact wrong moment. With a 32
bits counter, it could happen.

At 1000HZ, given the scheduler is only called once per tick
(approximation, it can also be called explicitly) it would happen after
49.7 days. But then, if the kernel calls schedule() too often, this can
be sooner than that.

Having a per-thread variable would make sure we don't have this problem.

By the way, a task cannot really know which CPU it is on : it could be
migrated between the cpu ID read and the moment it uses it as an array
index. Actually, knowing the schedule count would help to implement
algorithms that helps getting the cpu id and knowing it's been valid for
a period of time without pinning to a particular CPU.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Linus Torvalds



On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
 
 Or could we map a per-thread page that would contradict this
 definition ?

Over my dead body.

It's been done before. Many times. It's horrible, and means that you need 
to flush the TLB on context switches between threads and cannot share the 
same hw page tables across CPU's (since now different CPU's run different 
threads).

It generally makes threads pointless. You might as well implement them as 
processes with shared mmap.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 
 On Thu, 17 Jan 2008, Paul Mackerras wrote:
 
  It's very hard to do a per-thread counter in the VDSO, since threads
  in the same process see the same memory, by definition.  You'd have to
  have an array of counters and have some way for each thread to know
  which entry to read.  Also you'd have to find space for tens or
  hundreds of thousands of counters, since there can be that many
  threads in a process sometimes.
 
 I was thinking about this. What would also work is just the ability to
 read the schedule counter for the current cpu. Now this would require that
 the task had a way to know which CPU it was currently on.
 
 -- Steve
 

The problem with the per cpu schedule counter would be to deal with
stopped tasks that would wake up at the exact wrong moment. With a 32
bits counter, it could happen.

At 1000HZ, given the scheduler is only called once per tick
(approximation, it can also be called explicitly) it would happen after
49.7 days. But then, if the kernel calls schedule() too often, this can
be sooner than that.

Having a per-thread variable would make sure we don't have this problem.

By the way, a task cannot really know which CPU it is on : it could be
migrated between the cpu ID read and the moment it uses it as an array
index. Actually, knowing the schedule count would help to implement
algorithms that helps getting the cpu id and knowing it's been valid for
a period of time without pinning to a particular CPU.

Mathieu

(sorry for repost, messed up with my mail client)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Steven Rostedt

   
One thing I want to clear up. The major difference between this
latency_tracer and LTTng is what we consider fast paths.  The latency
tracer is recording things like enabling and disabling interrupts, 
preempt
count changes, or simply profiling all function calls.  Those are what I
consider fast paths.  The slow path WRT the latency_tracer are things 
like
context switches.  This is why I don't have a problem with copying the
comm at context switch time. Because that _is_ a slow path for the 
latency
tracer.
  
   LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
   in flight recorder mode, we have nothing to write to disk and therefore
   we can handle very frequent events. We then do the analysis off-line
   using the last MB written in the buffers. The advantage is that the
   kernel dumbly writes data to a buffer : we therefore move the complexity
   to user-space.
 
  But you would still need to do something in case you want this information
  dumped to console on a kernel crash.  Of course you can rely on kexec, but
  if the kexec fails (which is possible) then you lose all the information.
  Having the ability to dump the output to console on a crash is one of the
  benefits of latency_tracer that I want to keep.
 

 There has been some integration done between LTTng and the crash tool
 to extract the buffers from a crashed kernel. I am not an expert in
 crash buffer extraction though, but I guess all the available mechanisms
 depend on kexec and could show the limits you are referring to.

 If you really want to pretty-print the information to the console, I
 would propose that you leave that part of the problem to a different
 output module. The core of the latency tracer could keep the minimum
 information. Then, when a dump is triggered, it either sends the
 information to LTTng or to your console pretty-printer.

 However, I would not call the pretty-printer a tracer module per-se.
 We would have to accept that it is a bit more tied to the kernel
 internals than the latency tracer. My goal is to separate the core
 profiling module from the optional pretty-printing module as much as
 possible so the latency tracer core could be reused by other output
 modules.

Mathieu,

I've been thinking about the context switch marker, and the printf format
field you have:

prepare_task_switch(rq, prev, next);
+   trace_mark(kernel_sched_schedule,
+   prev_pid %d next_pid %d prev_state %ld,
+   prev-pid, next-pid, prev-state);
mm = next-mm;


Now I see that this is great for your tracer, since all your hook would
need to do is:

static notrace void simple_trace(const struct marker *mdata,
void *private_data,
const char *format, ...)
{
va_list ap;

va_start(ap, format);
simple_trace_record(ap, format);
va_end(ap);
}

And you could hook this up to all your traces. Wonderful!

But...

Tracers that want to do a bit more work, like recording timings and seeing
if we hit some max somewhere, can't do much with that pretty print data.
For example, I like to record the priority of a task that is being swapped
out as well as the one being swapped in. But with this, all I can get is
the priority of the prev task (since it is still current).

You told me that I should put hooks into where the priority gets modified,
so that I can trace it there at a non hot path. Well, I have some issues
with this.

1) to me it's a management nightmare. Yes, I could hook into lttng, but
that too is too much. That takes me back to the days when COTS became the
new buzzword, and we ended up having to incorporate COTS products into
things better left done in house. The end result was more glue code than
what would happen if it was simple done in house, and a product that was
totally inefficient.

2) this requires we put a marker at all the places that might change the
data we want the snapshot of at the marker (here being prio at context
switch time). I could imaging some new code comes into the kernel that
modifies priority but the author has no idea about having to update the
trace marker, and the trace output ends up showing stale data. And this
would cause the poor bastard that needs to maintain this code to debug the
tracer on top of the code they are maintaining. Which simply would end up
where that poor bastard will lose all confidence in the tracer and would
simply give up on it.

I know that if we did something like:

trace_mark(kernel_sched_schedule,
prev %p next %p,
prev, next);

It would be useless for the simple recording, because what the user would
see is just two meaningless pointer numbers.

So, at a minimum, I'd like to at least have meta data attached:

trace_mark(kernel_sched_schedule,
prev_pid %d next_pid %d prev_state %ld\0
prev %p

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Steven Rostedt


On Thu, 17 Jan 2008, Frank Ch. Eigler wrote:

 Hi -

 On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote:
  [...]
  +   trace_mark(kernel_sched_schedule,
  +   prev_pid %d next_pid %d prev_state %ld,
  +   prev-pid, next-pid, prev-state);
  [...]
  But...
 
  Tracers that want to do a bit more work, like recording timings and seeing
  if we hit some max somewhere, can't do much with that pretty print data.

 If you find yourself wanting to perform computations like finding
 maxima, or responding right there as opposed to later during userspace
 trace data extraction, then you're trending toward a tool like
 systemtap.

Yes, very much so. I'm working on getting the latency_tracer from the -rt
patch into something suitable for mainline. We need to calculate the max
latencies on the fly. If we hit a max, then we save it off, otherwise, we
blow away the trace and start again.


  [...]
  So, at a minimum, I'd like to at least have meta data attached:
  trace_mark(kernel_sched_schedule,
  prev_pid %d next_pid %d prev_state %ld\0
  prev %p next %p,
  prev-pid, next-pid, prev-state,
  prev, next);
 
  This would allow for both the nice pretty print of your trace, as well as
  allowing other tracers to get to better meta data.

 Yes, more self-contained marker events are necessary for meaningful
 in-situ processing.  That needs to be balanced by the increased cost
 for computing and passing the extra parameters, multiplied the event
 occurrence rate.

The cost is only done when the marker is armed. Since the marker is an
unlikely, and will be placed at the end of the function.


 In this case, the prev/next pointers are sufficient to compute the
 other values.  For particularly performance-critical markers, it may
 not be unreasonable to expect the callback functions to dereference
 such pointers for pretty-printing or other processing.

This was exactly my point to Mathieu, but I think he has LTTng very much
coupled with the markers. I haven't played with LTTng (yet), but from what
I've read (Mathieu, correct me if I'm wrong), it seems that all the
markers become visible to userspace, and the user can simple turn them on
or off. LTTng doesn't need any knowledge of the marker since the marker
contains how to print the information.

So* by placing a prev %p next %p as the only information, we lose out on
this automated way LTTng works. Because the two pointers are just
meaningless numbers to the user.


  The '\0' would keep your tracer from recording the extra data, and we
  could add some way to ignore the parameters in the printf to let other
  traces get straight to the meta data.

 This \0 hack is perhaps too clever.  Much of the cost of the extra
 parameters is already paid by the time that a simpleminded tracing
 callback function starts going through the string.  Also, I believe
 the systemtap marker interface would break if the format strings were
 not singly terminated ordinary strings.

Well, actually when I first wrote this letter, I used -- as a delimiter
to allow a tool to hide the pretty stuff. But then I thought about the
clever hack with the '\0',  The -- may be better since it wont break
systemtap.

-- Steve

* dvhart - bah!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-17 Thread Frank Ch. Eigler

Hi -

On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote:
 [...]
 +   trace_mark(kernel_sched_schedule,
 +   prev_pid %d next_pid %d prev_state %ld,
 +   prev-pid, next-pid, prev-state);
 [...]
 But...
 
 Tracers that want to do a bit more work, like recording timings and seeing
 if we hit some max somewhere, can't do much with that pretty print data.

If you find yourself wanting to perform computations like finding
maxima, or responding right there as opposed to later during userspace
trace data extraction, then you're trending toward a tool like
systemtap.

 [...]
 So, at a minimum, I'd like to at least have meta data attached:
   trace_mark(kernel_sched_schedule,
   prev_pid %d next_pid %d prev_state %ld\0
   prev %p next %p,
   prev-pid, next-pid, prev-state,
   prev, next);
 
 This would allow for both the nice pretty print of your trace, as well as
 allowing other tracers to get to better meta data.

Yes, more self-contained marker events are necessary for meaningful
in-situ processing.  That needs to be balanced by the increased cost
for computing and passing the extra parameters, multiplied the event
occurrence rate.

In this case, the prev/next pointers are sufficient to compute the
other values.  For particularly performance-critical markers, it may
not be unreasonable to expect the callback functions to dereference
such pointers for pretty-printing or other processing.


 The '\0' would keep your tracer from recording the extra data, and we
 could add some way to ignore the parameters in the printf to let other
 traces get straight to the meta data.

This \0 hack is perhaps too clever.  Much of the cost of the extra
parameters is already paid by the time that a simpleminded tracing
callback function starts going through the string.  Also, I believe
the systemtap marker interface would break if the format strings were
not singly terminated ordinary strings.

- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Paul Mackerras ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers writes:
> 
> > Sorry for self-reply, but I thought, in the past, of a way to make this
> > possible.
> > 
> > It would imply the creation of a new vsyscall : vgetschedperiod
> > 
> > It would read a counter that would increment each time the thread is
> > scheduled out (or in). It would be a per thread counter
> 
> It's very hard to do a per-thread counter in the VDSO, since threads
> in the same process see the same memory, by definition.  You'd have to
> have an array of counters and have some way for each thread to know
> which entry to read.  Also you'd have to find space for tens or
> hundreds of thousands of counters, since there can be that many
> threads in a process sometimes.
> 
> Paul.
> 

Crazy ideas :

Could we do something along the lines of the thread local storage ?

Or could we map a per-thread page that would contradict this
"definition" ?

Or can we move down the beginning of the user-space thread stack of 4
bytes (it's already put at a random address anyway) and use these 32
bits to put our variable ? We don't care if userspace also modifies it;
the kernel would blindly increment it, so there would be no security
concerns involved.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Thu, 17 Jan 2008, Paul Mackerras wrote:
>
> It's very hard to do a per-thread counter in the VDSO, since threads
> in the same process see the same memory, by definition.  You'd have to
> have an array of counters and have some way for each thread to know
> which entry to read.  Also you'd have to find space for tens or
> hundreds of thousands of counters, since there can be that many
> threads in a process sometimes.

I was thinking about this. What would also work is just the ability to
read the schedule counter for the current cpu. Now this would require that
the task had a way to know which CPU it was currently on.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Paul Mackerras

Mathieu Desnoyers writes:

> Sorry for self-reply, but I thought, in the past, of a way to make this
> possible.
> 
> It would imply the creation of a new vsyscall : vgetschedperiod
> 
> It would read a counter that would increment each time the thread is
> scheduled out (or in). It would be a per thread counter

It's very hard to do a per-thread counter in the VDSO, since threads
in the same process see the same memory, by definition.  You'd have to
have an array of counters and have some way for each thread to know
which entry to read.  Also you'd have to find space for tens or
hundreds of thousands of counters, since there can be that many
threads in a process sometimes.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:

> It would imply the creation of a new vsyscall : vgetschedperiod
>
> It would read a counter that would increment each time the thread is
> scheduled out (or in). It would be a per thread counter (not a per cpu
> counter) so we can deal appropriately with a stopped thread that would
> happen to come back running a lng time afterward (if we do per-cpu
> counters, we could get the same 32 bits counter value falsely if it is
> shared with other thread activity).
>
> Then, the clocksource read code would look like :
>
> int period;
>
> do {
>   period = vgetschedperiod();
>
>   perform the clocksource read..
>
> } while (period != vgetschedperiod());
>
> Therefore, we would be sure that we have not been scheduled out while
> reading the value. I think this new vsyscall could be useful for others.
> Actually, it would make implementation of RCU in user-space possible (as
> long as the read-side can retry the read operation).


This is something that I would agree is useful.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> * john stultz ([EMAIL PROTECTED]) wrote:
> > 
> > On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> > > Thanks John for doing this!
> > > 
> > > (comments imbedded)
> > > 
> > > On Wed, 16 Jan 2008, john stultz wrote:
> > > > +   int num = !cs->base_num;
> > > > +   cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > > > +   offset &= cs->mask;
> > > > +   cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > > > +   cs->base[num].cycle_base_last = now;
> > > 
> > > I would think that we would need some sort of barrier here. Otherwise,
> > > base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> > > is needed.
> > 
> > Hopefully addressed in the current version.
> > 
> > 
> > > > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > > > ===
> > > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
> > > > 12:21:46.0 -0800
> > > > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 
> > > > 14:15:31.0 -0800
> > > > @@ -71,10 +71,12 @@
> > > >   */
> > > >  static inline s64 __get_nsec_offset(void)
> > > >  {
> > > > -   cycle_t cycle_delta;
> > > > +   cycle_t now, cycle_delta;
> > > > s64 ns_offset;
> > > >
> > > > -   cycle_delta = clocksource_get_cycles(clock, 
> > > > clocksource_read(clock));
> > > > +   now = clocksource_read(clock);
> > > > +   cycle_delta = (now - clock->cycle_last) & clock->mask;
> > > > +   cycle_delta += clock->cycle_accumulated;
> > > 
> > > Is the above just to decouple the two methods?
> > 
> > Yep. clocksource_get_cycles() ended up not being as useful as an helper
> > function (I was hoping the arch vsyscall implementations could use it,
> > but they've done too much optimization - although that may reflect a
> > need up the chain to the clocksource structure).
> > 
> 
> The problem with vsyscall is that we will have a hard time disabling
> preemption :( Therefore, insuring that the read of the data is done in a
> timely manner is hard to do.
> 

Sorry for self-reply, but I thought, in the past, of a way to make this
possible.

It would imply the creation of a new vsyscall : vgetschedperiod

It would read a counter that would increment each time the thread is
scheduled out (or in). It would be a per thread counter (not a per cpu
counter) so we can deal appropriately with a stopped thread that would
happen to come back running a lng time afterward (if we do per-cpu
counters, we could get the same 32 bits counter value falsely if it is
shared with other thread activity).

Then, the clocksource read code would look like :

int period;

do {
  period = vgetschedperiod();

  perform the clocksource read..

} while (period != vgetschedperiod());

Therefore, we would be sure that we have not been scheduled out while
reading the value. I think this new vsyscall could be useful for others.
Actually, it would make implementation of RCU in user-space possible (as
long as the read-side can retry the read operation).

Mathieu

> 
> > thanks
> > -john
> > 
> > 
> 
> -- 
> Mathieu Desnoyers
> Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > Yep. clocksource_get_cycles() ended up not being as useful as an helper
> > function (I was hoping the arch vsyscall implementations could use it,
> > but they've done too much optimization - although that may reflect a
> > need up the chain to the clocksource structure).
> >
>
> The problem with vsyscall is that we will have a hard time disabling
> preemption :( Therefore, insuring that the read of the data is done in a
> timely manner is hard to do.

You'll have more than a hard time disabling preemption for vsyscall. We'll
need to come up with a better solution then. vsyscall can not modify any
kernel memory, nor can it disable preemption.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
> 
> On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> > Thanks John for doing this!
> > 
> > (comments imbedded)
> > 
> > On Wed, 16 Jan 2008, john stultz wrote:
> > > + int num = !cs->base_num;
> > > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > > + offset &= cs->mask;
> > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > > + cs->base[num].cycle_base_last = now;
> > 
> > I would think that we would need some sort of barrier here. Otherwise,
> > base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> > is needed.
> 
> Hopefully addressed in the current version.
> 
> 
> > > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > > ===
> > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c  2008-01-16 
> > > 12:21:46.0 -0800
> > > +++ monotonic-cleanup/kernel/time/timekeeping.c   2008-01-16 
> > > 14:15:31.0 -0800
> > > @@ -71,10 +71,12 @@
> > >   */
> > >  static inline s64 __get_nsec_offset(void)
> > >  {
> > > - cycle_t cycle_delta;
> > > + cycle_t now, cycle_delta;
> > >   s64 ns_offset;
> > >
> > > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> > > + now = clocksource_read(clock);
> > > + cycle_delta = (now - clock->cycle_last) & clock->mask;
> > > + cycle_delta += clock->cycle_accumulated;
> > 
> > Is the above just to decouple the two methods?
> 
> Yep. clocksource_get_cycles() ended up not being as useful as an helper
> function (I was hoping the arch vsyscall implementations could use it,
> but they've done too much optimization - although that may reflect a
> need up the chain to the clocksource structure).
> 

The problem with vsyscall is that we will have a hard time disabling
preemption :( Therefore, insuring that the read of the data is done in a
timely manner is hard to do.

Mathieu

> thanks
> -john
> 
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
> On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> > I would disable preemption in clocksource_get_basecycles. We would not
> > want to be scheduled out while we hold a pointer to the old array
> > element.
> > 
> > > + int num = cs->base_num;
> > 
> > Since you deal with base_num in a shared manner (not per cpu), you will
> > need a smp_read_barrier_depend() here after the cs->base_num read.
> > 
> > You should think about reading the cs->base_num first, and _after_ that
> > read the real clocksource. Here, the clocksource value is passed as
> > parameter. It means that the read clocksource may have been read in the
> > previous RCU window.
> 
> Here's an updated version of the patch w/ the suggested memory barrier
> changes and favored (1-x) inversion change. ;)  Let me know if you see
> any other holes, or have any other suggestions or ideas.
> 
> Still un-tested (my test box will free up soon, I promise!), but builds.
> 
> Signed-off-by: John Stultz <[EMAIL PROTECTED]>
> 
> Index: monotonic-cleanup/include/linux/clocksource.h
> ===
> --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
> 12:22:04.0 -0800
> +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
> 18:12:53.0 -0800
> @@ -87,9 +87,17 @@
>* more than one cache line.
>*/
>   struct {
> - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> - } cacheline_aligned_in_smp;
> + cycle_t cycle_last, cycle_accumulated;
>  
> + /* base structure provides lock-free read
> +  * access to a virtualized 64bit counter
> +  * Uses RCU-like update.
> +  */
> + struct {
> + cycle_t cycle_base_last, cycle_base;
> + } base[2];
> + int base_num;
> + } cacheline_aligned_in_smp;
>   u64 xtime_nsec;
>   s64 error;
>  
> @@ -175,19 +183,29 @@
>  }
>  
>  /**
> - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> + * clocksource_get_basecycles: - get the clocksource's accumulated cycle 
> value
>   * @cs:  pointer to clocksource being read
>   * @now: current cycle value
>   *
>   * Uses the clocksource to return the current cycle_t value.
>   * NOTE!!!: This is different from clocksource_read, because it
> - * returns the accumulated cycle value! Must hold xtime lock!
> + * returns a 64bit wide accumulated value.
>   */
>  static inline cycle_t
> -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> +clocksource_get_basecycles(struct clocksource *cs)
>  {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> - offset += cs->cycle_accumulated;
> + int num;
> + cycle_t now, offset;
> +
> + preempt_disable();
> + num = cs->base_num;
> + smp_read_barrier_depends();
> + now = clocksource_read(cs);
> + offset = (now - cs->base[num].cycle_base_last);
> + offset &= cs->mask;
> + offset += cs->base[num].cycle_base;
> + preempt_enable();
> +
>   return offset;
>  }
>  
> @@ -197,14 +215,26 @@
>   * @now: current cycle value
>   *
>   * Used to avoids clocksource hardware overflow by periodically
> - * accumulating the current cycle delta. Must hold xtime write lock!
> + * accumulating the current cycle delta. Uses RCU-like update, but
> + * ***still requires the xtime_lock is held for writing!***
>   */
>  static inline void clocksource_accumulate(struct clocksource *cs, cycle_t 
> now)
>  {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> + /* First update the monotonic base portion.
> +  * The dual array update method allows for lock-free reading.
> +  */
> + int num = 1 - cs->base_num;

(nitpick)
right here, you could probably express 1-num with cs->base_num, since we
are the only ones supposed to touch it.

> + cycle_t offset = (now - cs->base[1-num].cycle_base_last);
> + offset &= cs->mask;

here too.

> + cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
> + cs->base[num].cycle_base_last = now;
> + wmb();

As I just emailed : smp_smb() *should* be enough. I don't see which
architecture could reorder writes wrt local interrupts ? (please tell me
if I am grossly mistaken)

Mathieu

> + cs->base_num = num;
> +
> + /* Now update the cycle_accumulated portion */
> + offset = (now - cs->cycle_last) & cs->mask;
>   cs->cycle_last = now;
>   cs->cycle_accumulated += offset;
> - cs->cycle_raw += offset;
>  }
>  
>  /**
> Index: monotonic-cleanup/kernel/time/timekeeping.c
> ===
> --- monotonic-cleanup.orig/kernel/time/timekeeping.c  2008-01-16 
> 12:21:46.0 -0800
> +++ monotonic-cleanup/kernel/time/timekeeping.c   2008-01-16 
> 17:51:50.0 -0800
> @@ -71,10 +71,12 @@
>

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz


On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> Thanks John for doing this!
> 
> (comments imbedded)
> 
> On Wed, 16 Jan 2008, john stultz wrote:
> > +   int num = !cs->base_num;
> > +   cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > +   offset &= cs->mask;
> > +   cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > +   cs->base[num].cycle_base_last = now;
> 
> I would think that we would need some sort of barrier here. Otherwise,
> base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> is needed.

Hopefully addressed in the current version.


> > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > ===
> > --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
> > 12:21:46.0 -0800
> > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 
> > 14:15:31.0 -0800
> > @@ -71,10 +71,12 @@
> >   */
> >  static inline s64 __get_nsec_offset(void)
> >  {
> > -   cycle_t cycle_delta;
> > +   cycle_t now, cycle_delta;
> > s64 ns_offset;
> >
> > -   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> > +   now = clocksource_read(clock);
> > +   cycle_delta = (now - clock->cycle_last) & clock->mask;
> > +   cycle_delta += clock->cycle_accumulated;
> 
> Is the above just to decouple the two methods?

Yep. clocksource_get_cycles() ended up not being as useful as an helper
function (I was hoping the arch vsyscall implementations could use it,
but they've done too much optimization - although that may reflect a
need up the chain to the clocksource structure).

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
> 
> On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> > * john stultz ([EMAIL PROTECTED]) wrote:
> > > 
> > > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> > > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:
> > > > > If you really want an seqlock free algorithm (I _do_ want this for
> > > > > tracing!) :) maybe going in the RCU direction could help (I refer to 
> > > > > my
> > > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > > > > could be turned into the clocksource updater).
> > > > 
> > > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> > > > swing at this now.  The lock-free method still doesn't apply to the
> > > > update_wall_time function, but does work fine for the monotonic cycle
> > > > uses.  I'll send a patch for review as soon as I get things building.
> > > 
> > > So here's my first attempt at adding Mathieu's lock-free method to
> > > Steven's get_monotonic_cycles() interface. 
> > > 
> > > Completely un-tested, but it builds, so I figured I'd send it out for
> > > review.
> > > 
> > > I'm not super sure the update or the read doesn't need something
> > > additional to force a memory access, but as I didn't see anything 
> > > special in Mathieu's implementation, I'm going to guess this is ok.
> > > 
> > > Mathieu, Let me know if this isn't what you're suggesting.
> > > 
> > > Signed-off-by: John Stultz <[EMAIL PROTECTED]>
> > > 
> > > Index: monotonic-cleanup/include/linux/clocksource.h
> > > ===
> > > --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
> > > 12:22:04.0 -0800
> > > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
> > > 14:41:31.0 -0800
> > > @@ -87,9 +87,17 @@
> > >* more than one cache line.
> > >*/
> > >   struct {
> > > - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> > > - } cacheline_aligned_in_smp;
> > 
> > Shouldn't the cycle_last and cycle_accumulated by in the array too ?
> 
> No, we're leaving cycle_last and cycle_accumulated alone. They're
> relating to the update_wall_time conversion of cycles to xtime. 
> 
> > > + cycle_t cycle_last, cycle_accumulated;
> > >  
> > > + /* base structure provides lock-free read
> > > +  * access to a virtualized 64bit counter
> > > +  * Uses RCU-like update.
> > > +  */
> > > + struct {
> > 
> > We had cycle_raw before, why do we need the following two ?
> >
> > > + cycle_t cycle_base_last, cycle_base;
> > 
> > I'm not quite sure why you need both cycle_base_last and cycle_base...
> 
> So on my first shot at this, I tried to layer the concepts. Using the
> lock-free method to create a abstracted 64bit counter, as provided by
> get_monotonic_cycles(). Then I tried to use that abstraction directly in
> the update_wall_time() code, reading the abstracted 64bit counter and
> using it to update time.
> 
> However, then we start keeping cycle_last in 64bit cycles, rather then
> an actual counter read. This then caused changes to be needed in the
> arch vsyscall implementations, and that started to get ugly, as we had
> to also re-implement the abstracted 64bit counter w/ the lock free
> method as well. 
> 
> So I just backed off and tried to make it simple: We have two sets of
> data that counts cycles from the clocksource. One for timekeeping and
> one for get_monotoinc_cycles(). It is a little redundant, but I don't
> think you can escape that (the layering method above also has
> redundancy, but its just hidden until you implement the vsyscall gtod
> methods).
> 
> 
> > I think I'll need a bit of an explanation of what you are trying to
> > achieve here to see what to expect from the clock source. Are you trying
> > to deal with non-synchronized TSCs across CPUs in a way that will
> > generate a monotonic (sometimes stalling) clock ?
> 
> No no no.. I'm not touching the non-synced TSC issue. I'm just trying to
> take clocksource counters, which may be of different bit-widths (ACPI PM
> is 24bits, for instance), and create lock-free method to translate that
> into a virtual 64bit wide counter (using an accumulation bucket,
> basically).
> 
> > What I am trying to say is : I know you are trying to make a virtual
> > clock source where time cannot go backward, but what are your
> > assumptions about the "real" clock source ?
> 
> The assumptions of the real clocksource is the same we keep in the
> timekeeping core. It counts forward, at a constant rate and only wraps
> after the mask value has been reached. 
> 
> > Is the intent to deal with an HPET suddenly reset to 0 or something
> > like this ?
> 
> Well, dealing with clocksources wrapping short of 64bits.
> 

Ah ok, then the problem is clearer :) The main difference between the
approach I use and yours is that, let's say your clocksource

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz

On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> I would disable preemption in clocksource_get_basecycles. We would not
> want to be scheduled out while we hold a pointer to the old array
> element.
> 
> > +   int num = cs->base_num;
> 
> Since you deal with base_num in a shared manner (not per cpu), you will
> need a smp_read_barrier_depend() here after the cs->base_num read.
> 
> You should think about reading the cs->base_num first, and _after_ that
> read the real clocksource. Here, the clocksource value is passed as
> parameter. It means that the read clocksource may have been read in the
> previous RCU window.

Here's an updated version of the patch w/ the suggested memory barrier
changes and favored (1-x) inversion change. ;)  Let me know if you see
any other holes, or have any other suggestions or ideas.

Still un-tested (my test box will free up soon, I promise!), but builds.

Signed-off-by: John Stultz <[EMAIL PROTECTED]>

Index: monotonic-cleanup/include/linux/clocksource.h
===
--- monotonic-cleanup.orig/include/linux/clocksource.h  2008-01-16 
12:22:04.0 -0800
+++ monotonic-cleanup/include/linux/clocksource.h   2008-01-16 
18:12:53.0 -0800
@@ -87,9 +87,17 @@
 * more than one cache line.
 */
struct {
-   cycle_t cycle_last, cycle_accumulated, cycle_raw;
-   } cacheline_aligned_in_smp;
+   cycle_t cycle_last, cycle_accumulated;
 
+   /* base structure provides lock-free read
+* access to a virtualized 64bit counter
+* Uses RCU-like update.
+*/
+   struct {
+   cycle_t cycle_base_last, cycle_base;
+   } base[2];
+   int base_num;
+   } cacheline_aligned_in_smp;
u64 xtime_nsec;
s64 error;
 
@@ -175,19 +183,29 @@
 }
 
 /**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
  * @cs:pointer to clocksource being read
  * @now:   current cycle value
  *
  * Uses the clocksource to return the current cycle_t value.
  * NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
  */
 static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
-   offset += cs->cycle_accumulated;
+   int num;
+   cycle_t now, offset;
+
+   preempt_disable();
+   num = cs->base_num;
+   smp_read_barrier_depends();
+   now = clocksource_read(cs);
+   offset = (now - cs->base[num].cycle_base_last);
+   offset &= cs->mask;
+   offset += cs->base[num].cycle_base;
+   preempt_enable();
+
return offset;
 }
 
@@ -197,14 +215,26 @@
  * @now:   current cycle value
  *
  * Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
  */
 static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
+   /* First update the monotonic base portion.
+* The dual array update method allows for lock-free reading.
+*/
+   int num = 1 - cs->base_num;
+   cycle_t offset = (now - cs->base[1-num].cycle_base_last);
+   offset &= cs->mask;
+   cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
+   cs->base[num].cycle_base_last = now;
+   wmb();
+   cs->base_num = num;
+
+   /* Now update the cycle_accumulated portion */
+   offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
-   cs->cycle_raw += offset;
 }
 
 /**
Index: monotonic-cleanup/kernel/time/timekeeping.c
===
--- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
12:21:46.0 -0800
+++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 17:51:50.0 
-0800
@@ -71,10 +71,12 @@
  */
 static inline s64 __get_nsec_offset(void)
 {
-   cycle_t cycle_delta;
+   cycle_t now, cycle_delta;
s64 ns_offset;
 
-   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+   now = clocksource_read(clock);
+   cycle_delta = (now - clock->cycle_last) & clock->mask;
+   cycle_delta += clock->cycle_accumulated;
ns_offset = cyc2ns(clock, cycle_delta);
 
return ns_offset;
@@ -105,35 +107,7 @@
 
 cycle_t notrace get_monotonic_cycles(void)
 {

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Linus Torvalds ([EMAIL PROTECTED]) wrote:
> 
> 
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > > + int num = !cs->base_num;
> > > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > 
> > !0 is not necessarily 1.
> 
> Incorrect.
> 

Hrm, *digging in my mailbox*, ah, here it is :

http://listserv.shafik.org/pipermail/ltt-dev/2006-June/001548.html

Richard Purdie reviewed my code back in 2006 and made this modification.
Maybe will he have something to add.


> !0 _is_ necessarily 1. It's how all C logical operators work. If you find 
> a compiler that turns !x into anything but 0/1, you found a compiler for 
> another language than C.
> 
> It's true that any non-zero value counts as "true", but the that does not 
> mean that a logical operator can return any non-zero value for true. As a 
> return value of the logical operations in C, true is *always* 1.
> 
> So !, ||, &&, when used as values, will *always* return either 0 or 1 (but 
> when used as part of a conditional, the compiler will often optimize out 
> unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if 
> the value itself was never used, only branched upon).
> 
> So doing "!cs->base_num" to turn 0->1 and 1->0 is perfectly fine.
> 
> That's not to say it's necessarily the *best* way.
> 
> If you *know* that you started with 0/1 in the first place, the best way 
> to flip it tends to be to do (1-x) (or possibly (x^1)).
> 
> And if you can't guarantee that, !x is probably better than x ? 0 : 1, 
> but you might also decide to use ((x+1)&1) for example.
> 
> And obviously, the compiler may sometimes surprise you, and if *it* also 
> knows it's always 0/1 (for something like the source being a single-bit 
> bitfield for example), it may end up doing something else than you coded 
> that is equivalent. And the particular choice of operation the compiler 
> chooses may well depend on the code _around_ that sequence.
> 
> (One reason to potentially prefer (1-x) over (x^1) is that it's often 
> easier to combine a subtraction with other operations, while an xor seldom 
> combines with anything around it)
> 

Ok, I'll adopt (1-x) then. Thanks!

Mathieu

>   Linus

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Linus Torvalds

On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
>
> > +   int num = !cs->base_num;
> > +   cycle_t offset = (now - cs->base[!num].cycle_base_last);
> 
> !0 is not necessarily 1.

Incorrect.

!0 _is_ necessarily 1. It's how all C logical operators work. If you find 
a compiler that turns !x into anything but 0/1, you found a compiler for 
another language than C.

It's true that any non-zero value counts as "true", but the that does not 
mean that a logical operator can return any non-zero value for true. As a 
return value of the logical operations in C, true is *always* 1.

So !, ||, &&, when used as values, will *always* return either 0 or 1 (but 
when used as part of a conditional, the compiler will often optimize out 
unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if 
the value itself was never used, only branched upon).

So doing "!cs->base_num" to turn 0->1 and 1->0 is perfectly fine.

That's not to say it's necessarily the *best* way.

If you *know* that you started with 0/1 in the first place, the best way 
to flip it tends to be to do (1-x) (or possibly (x^1)).

And if you can't guarantee that, !x is probably better than x ? 0 : 1, 
but you might also decide to use ((x+1)&1) for example.

And obviously, the compiler may sometimes surprise you, and if *it* also 
knows it's always 0/1 (for something like the source being a single-bit 
bitfield for example), it may end up doing something else than you coded 
that is equivalent. And the particular choice of operation the compiler 
chooses may well depend on the code _around_ that sequence.

(One reason to potentially prefer (1-x) over (x^1) is that it's often 
easier to combine a subtraction with other operations, while an xor seldom 
combines with anything around it)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt



On Wed, 16 Jan 2008, Steven Rostedt wrote:
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in
>
> How about simply "cpu_synth->index ^ 1"? Seems the best choice if you ask
> me, if all you are doing is changing it from 1 to zero and back to 1.
>

FYI:

[EMAIL PROTECTED]:~/c$ cat flipme.c
int flip1 (int x)
{
return !x;
}

int flip2 (int x)
{
return x ? 0 : 1;
}

int flip3(int x)
{
return x ^ 1;
}
[EMAIL PROTECTED]:~/c$ gcc -O2 -c flipme.c
[EMAIL PROTECTED]:~/c$ objdump -d flipme.o

flipme.o: file format elf32-i386

Disassembly of section .text:

 :
   0:   55  push   %ebp
   1:   31 c0   xor%eax,%eax
   3:   89 e5   mov%esp,%ebp
   5:   83 7d 08 00 cmpl   $0x0,0x8(%ebp)
   9:   5d  pop%ebp
   a:   0f 94 c0sete   %al
   d:   c3  ret
   e:   66 90   xchg   %ax,%ax

0010 :
  10:   55  push   %ebp
  11:   31 c0   xor%eax,%eax
  13:   89 e5   mov%esp,%ebp
  15:   83 7d 08 00 cmpl   $0x0,0x8(%ebp)
  19:   5d  pop%ebp
  1a:   0f 94 c0sete   %al
  1d:   c3  ret
  1e:   66 90   xchg   %ax,%ax

0020 :
  20:   55  push   %ebp
  21:   89 e5   mov%esp,%ebp
  23:   8b 45 08mov0x8(%ebp),%eax
  26:   5d  pop%ebp
  27:   83 f0 01xor$0x1,%eax
  2a:   c3  ret


So, if you know for sure that x is only 1 or 0, then using x ^ 1 to invert
it, seems the most efficient.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz

On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> * john stultz ([EMAIL PROTECTED]) wrote:
> > 
> > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:
> > > > If you really want an seqlock free algorithm (I _do_ want this for
> > > > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > > > could be turned into the clocksource updater).
> > > 
> > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> > > swing at this now.  The lock-free method still doesn't apply to the
> > > update_wall_time function, but does work fine for the monotonic cycle
> > > uses.  I'll send a patch for review as soon as I get things building.
> > 
> > So here's my first attempt at adding Mathieu's lock-free method to
> > Steven's get_monotonic_cycles() interface. 
> > 
> > Completely un-tested, but it builds, so I figured I'd send it out for
> > review.
> > 
> > I'm not super sure the update or the read doesn't need something
> > additional to force a memory access, but as I didn't see anything 
> > special in Mathieu's implementation, I'm going to guess this is ok.
> > 
> > Mathieu, Let me know if this isn't what you're suggesting.
> > 
> > Signed-off-by: John Stultz <[EMAIL PROTECTED]>
> > 
> > Index: monotonic-cleanup/include/linux/clocksource.h
> > ===
> > --- monotonic-cleanup.orig/include/linux/clocksource.h  2008-01-16 
> > 12:22:04.0 -0800
> > +++ monotonic-cleanup/include/linux/clocksource.h   2008-01-16 
> > 14:41:31.0 -0800
> > @@ -87,9 +87,17 @@
> >  * more than one cache line.
> >  */
> > struct {
> > -   cycle_t cycle_last, cycle_accumulated, cycle_raw;
> > -   } cacheline_aligned_in_smp;
> 
> Shouldn't the cycle_last and cycle_accumulated by in the array too ?

No, we're leaving cycle_last and cycle_accumulated alone. They're
relating to the update_wall_time conversion of cycles to xtime. 

> > +   cycle_t cycle_last, cycle_accumulated;
> >  
> > +   /* base structure provides lock-free read
> > +* access to a virtualized 64bit counter
> > +* Uses RCU-like update.
> > +*/
> > +   struct {
> 
> We had cycle_raw before, why do we need the following two ?
>
> > +   cycle_t cycle_base_last, cycle_base;
> 
> I'm not quite sure why you need both cycle_base_last and cycle_base...

So on my first shot at this, I tried to layer the concepts. Using the
lock-free method to create a abstracted 64bit counter, as provided by
get_monotonic_cycles(). Then I tried to use that abstraction directly in
the update_wall_time() code, reading the abstracted 64bit counter and
using it to update time.

However, then we start keeping cycle_last in 64bit cycles, rather then
an actual counter read. This then caused changes to be needed in the
arch vsyscall implementations, and that started to get ugly, as we had
to also re-implement the abstracted 64bit counter w/ the lock free
method as well. 

So I just backed off and tried to make it simple: We have two sets of
data that counts cycles from the clocksource. One for timekeeping and
one for get_monotoinc_cycles(). It is a little redundant, but I don't
think you can escape that (the layering method above also has
redundancy, but its just hidden until you implement the vsyscall gtod
methods).

> I think I'll need a bit of an explanation of what you are trying to
> achieve here to see what to expect from the clock source. Are you trying
> to deal with non-synchronized TSCs across CPUs in a way that will
> generate a monotonic (sometimes stalling) clock ?

No no no.. I'm not touching the non-synced TSC issue. I'm just trying to
take clocksource counters, which may be of different bit-widths (ACPI PM
is 24bits, for instance), and create lock-free method to translate that
into a virtual 64bit wide counter (using an accumulation bucket,
basically).

> What I am trying to say is : I know you are trying to make a virtual
> clock source where time cannot go backward, but what are your
> assumptions about the "real" clock source ?

The assumptions of the real clocksource is the same we keep in the
timekeeping core. It counts forward, at a constant rate and only wraps
after the mask value has been reached. 

> Is the intent to deal with an HPET suddenly reset to 0 or something
> like this ?

Well, dealing with clocksources wrapping short of 64bits.

> Basically, I wonder why you have to calculate the current cycle count
> from the previous update_wall_time event. Is is because you need to be
> consistent when a clocksource change occurs ?

Actually, we try to do it from the last clocksource_accumulate() call
(which is called from update_wall_time).

> > +   } base[2];
> > +   int base_num;
> > +   }

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
>
> > -   cycle_t offset = (now - cs->cycle_last) & cs->mask;
> > +   /* First update the monotonic base portion.
> > +* The dual array update method allows for lock-free reading.
> > +*/
> > +   int num = !cs->base_num;
> > +   cycle_t offset = (now - cs->base[!num].cycle_base_last);
>
> !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in

How about simply "cpu_synth->index ^ 1"? Seems the best choice if you ask
me, if all you are doing is changing it from 1 to zero and back to 1.

-- Steve

> my code. The two previous lines seems buggy. (I did the same mistake in
> my first implementation) ;)
>
> > +   offset &= cs->mask;
> > +   cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
>
> Here too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
> 
> On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:
> > > If you really want an seqlock free algorithm (I _do_ want this for
> > > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > > could be turned into the clocksource updater).
> > 
> > Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> > swing at this now.  The lock-free method still doesn't apply to the
> > update_wall_time function, but does work fine for the monotonic cycle
> > uses.  I'll send a patch for review as soon as I get things building.
> 
> So here's my first attempt at adding Mathieu's lock-free method to
> Steven's get_monotonic_cycles() interface. 
> 
> Completely un-tested, but it builds, so I figured I'd send it out for
> review.
> 
> I'm not super sure the update or the read doesn't need something
> additional to force a memory access, but as I didn't see anything 
> special in Mathieu's implementation, I'm going to guess this is ok.
> 
> Mathieu, Let me know if this isn't what you're suggesting.
> 
> Signed-off-by: John Stultz <[EMAIL PROTECTED]>
> 
> Index: monotonic-cleanup/include/linux/clocksource.h
> ===
> --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
> 12:22:04.0 -0800
> +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
> 14:41:31.0 -0800
> @@ -87,9 +87,17 @@
>* more than one cache line.
>*/
>   struct {
> - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> - } cacheline_aligned_in_smp;

Shouldn't the cycle_last and cycle_accumulated by in the array too ?

> + cycle_t cycle_last, cycle_accumulated;
>  
> + /* base structure provides lock-free read
> +  * access to a virtualized 64bit counter
> +  * Uses RCU-like update.
> +  */
> + struct {

We had cycle_raw before, why do we need the following two ?

> + cycle_t cycle_base_last, cycle_base;

I'm not quite sure why you need both cycle_base_last and cycle_base...

I think I'll need a bit of an explanation of what you are trying to
achieve here to see what to expect from the clock source. Are you trying
to deal with non-synchronized TSCs across CPUs in a way that will
generate a monotonic (sometimes stalling) clock ?

What I am trying to say is : I know you are trying to make a virtual
clock source where time cannot go backward, but what are your
assumptions about the "real" clock source ?

Is the intent to deal with an HPET suddenly reset to 0 or something
like this ?

Basically, I wonder why you have to calculate the current cycle count
from the previous update_wall_time event. Is is because you need to be
consistent when a clocksource change occurs ?

> + } base[2];
> + int base_num;
> + } cacheline_aligned_in_smp;
>   u64 xtime_nsec;
>   s64 error;
>  
> @@ -175,19 +183,21 @@
>  }
>  
>  /**
> - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> + * clocksource_get_basecycles: - get the clocksource's accumulated cycle 
> value
>   * @cs:  pointer to clocksource being read
>   * @now: current cycle value
>   *
>   * Uses the clocksource to return the current cycle_t value.
>   * NOTE!!!: This is different from clocksource_read, because it
> - * returns the accumulated cycle value! Must hold xtime lock!
> + * returns a 64bit wide accumulated value.
>   */
>  static inline cycle_t
> -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> +clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
>  {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> - offset += cs->cycle_accumulated;

I would disable preemption in clocksource_get_basecycles. We would not
want to be scheduled out while we hold a pointer to the old array
element.

> + int num = cs->base_num;

Since you deal with base_num in a shared manner (not per cpu), you will
need a smp_read_barrier_depend() here after the cs->base_num read.

You should think about reading the cs->base_num first, and _after_ that
read the real clocksource. Here, the clocksource value is passed as
parameter. It means that the read clocksource may have been read in the
previous RCU window.

> + cycle_t offset = (now - cs->base[num].cycle_base_last);
> + offset &= cs->mask;
> + offset += cs->base[num].cycle_base;
>   return offset;
>  }
>  
> @@ -197,14 +207,25 @@
>   * @now: current cycle value
>   *
>   * Used to avoids clocksource hardware overflow by periodically
> - * accumulating the current cycle delta. Must hold xtime write lock!
> + * accumulating the current cycle delta. Uses RCU-like update, but
> + *

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


Thanks John for doing this!

(comments imbedded)

On Wed, 16 Jan 2008, john stultz wrote:

>
> On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
>
> Completely un-tested, but it builds, so I figured I'd send it out for
> review.

heh, ok, I'll take it and run it.

>
> I'm not super sure the update or the read doesn't need something
> additional to force a memory access, but as I didn't see anything
> special in Mathieu's implementation, I'm going to guess this is ok.
>
> Mathieu, Let me know if this isn't what you're suggesting.
>
> Signed-off-by: John Stultz <[EMAIL PROTECTED]>
>
> Index: monotonic-cleanup/include/linux/clocksource.h
> ===
> --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
> 12:22:04.0 -0800
> +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
> 14:41:31.0 -0800
> @@ -87,9 +87,17 @@
>* more than one cache line.
>*/
>   struct {
> - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> - } cacheline_aligned_in_smp;
> + cycle_t cycle_last, cycle_accumulated;
>
> + /* base structure provides lock-free read
> +  * access to a virtualized 64bit counter
> +  * Uses RCU-like update.
> +  */
> + struct {
> + cycle_t cycle_base_last, cycle_base;
> + } base[2];
> + int base_num;
> + } cacheline_aligned_in_smp;
>   u64 xtime_nsec;
>   s64 error;
>
> @@ -175,19 +183,21 @@
>  }
>
>  /**
> - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> + * clocksource_get_basecycles: - get the clocksource's accumulated cycle 
> value
>   * @cs:  pointer to clocksource being read
>   * @now: current cycle value
>   *
>   * Uses the clocksource to return the current cycle_t value.
>   * NOTE!!!: This is different from clocksource_read, because it
> - * returns the accumulated cycle value! Must hold xtime lock!
> + * returns a 64bit wide accumulated value.
>   */
>  static inline cycle_t
> -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> +clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
>  {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> - offset += cs->cycle_accumulated;
> + int num = cs->base_num;
> + cycle_t offset = (now - cs->base[num].cycle_base_last);
> + offset &= cs->mask;
> + offset += cs->base[num].cycle_base;
>   return offset;
>  }
>
> @@ -197,14 +207,25 @@
>   * @now: current cycle value
>   *
>   * Used to avoids clocksource hardware overflow by periodically
> - * accumulating the current cycle delta. Must hold xtime write lock!
> + * accumulating the current cycle delta. Uses RCU-like update, but
> + * ***still requires the xtime_lock is held for writing!***
>   */
>  static inline void clocksource_accumulate(struct clocksource *cs, cycle_t 
> now)
>  {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> + /* First update the monotonic base portion.
> +  * The dual array update method allows for lock-free reading.
> +  */
> + int num = !cs->base_num;
> + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> + offset &= cs->mask;
> + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> + cs->base[num].cycle_base_last = now;

I would think that we would need some sort of barrier here. Otherwise,
base_num could be updated before all the cycle_base. I'd expect a smp_wmb
is needed.

> + cs->base_num = num;
> +
> + /* Now update the cycle_accumulated portion */
> + offset = (now - cs->cycle_last) & cs->mask;
>   cs->cycle_last = now;
>   cs->cycle_accumulated += offset;
> - cs->cycle_raw += offset;
>  }
>
>  /**
> Index: monotonic-cleanup/kernel/time/timekeeping.c
> ===
> --- monotonic-cleanup.orig/kernel/time/timekeeping.c  2008-01-16 
> 12:21:46.0 -0800
> +++ monotonic-cleanup/kernel/time/timekeeping.c   2008-01-16 
> 14:15:31.0 -0800
> @@ -71,10 +71,12 @@
>   */
>  static inline s64 __get_nsec_offset(void)
>  {
> - cycle_t cycle_delta;
> + cycle_t now, cycle_delta;
>   s64 ns_offset;
>
> - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> + now = clocksource_read(clock);
> + cycle_delta = (now - clock->cycle_last) & clock->mask;
> + cycle_delta += clock->cycle_accumulated;

Is the above just to decouple the two methods?

>   ns_offset = cyc2ns(clock, cycle_delta);
>
>   return ns_offset;
> @@ -105,35 +107,7 @@
>
>  cycle_t notrace get_monotonic_cycles(void)
>  {
> - cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> -
> - do {
> - /*
> -  * cycle_raw and cycle_last can change on
> -  * another CPU and we need the delta

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz


On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:
> > If you really want an seqlock free algorithm (I _do_ want this for
> > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > could be turned into the clocksource updater).
> 
> Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> swing at this now.  The lock-free method still doesn't apply to the
> update_wall_time function, but does work fine for the monotonic cycle
> uses.  I'll send a patch for review as soon as I get things building.

So here's my first attempt at adding Mathieu's lock-free method to
Steven's get_monotonic_cycles() interface. 

Completely un-tested, but it builds, so I figured I'd send it out for
review.

I'm not super sure the update or the read doesn't need something
additional to force a memory access, but as I didn't see anything 
special in Mathieu's implementation, I'm going to guess this is ok.

Mathieu, Let me know if this isn't what you're suggesting.

Signed-off-by: John Stultz <[EMAIL PROTECTED]>

Index: monotonic-cleanup/include/linux/clocksource.h
===
--- monotonic-cleanup.orig/include/linux/clocksource.h  2008-01-16 
12:22:04.0 -0800
+++ monotonic-cleanup/include/linux/clocksource.h   2008-01-16 
14:41:31.0 -0800
@@ -87,9 +87,17 @@
 * more than one cache line.
 */
struct {
-   cycle_t cycle_last, cycle_accumulated, cycle_raw;
-   } cacheline_aligned_in_smp;
+   cycle_t cycle_last, cycle_accumulated;
 
+   /* base structure provides lock-free read
+* access to a virtualized 64bit counter
+* Uses RCU-like update.
+*/
+   struct {
+   cycle_t cycle_base_last, cycle_base;
+   } base[2];
+   int base_num;
+   } cacheline_aligned_in_smp;
u64 xtime_nsec;
s64 error;
 
@@ -175,19 +183,21 @@
 }
 
 /**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
  * @cs:pointer to clocksource being read
  * @now:   current cycle value
  *
  * Uses the clocksource to return the current cycle_t value.
  * NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
  */
 static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
-   offset += cs->cycle_accumulated;
+   int num = cs->base_num;
+   cycle_t offset = (now - cs->base[num].cycle_base_last);
+   offset &= cs->mask;
+   offset += cs->base[num].cycle_base;
return offset;
 }
 
@@ -197,14 +207,25 @@
  * @now:   current cycle value
  *
  * Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
  */
 static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
 {
-   cycle_t offset = (now - cs->cycle_last) & cs->mask;
+   /* First update the monotonic base portion.
+* The dual array update method allows for lock-free reading.
+*/
+   int num = !cs->base_num;
+   cycle_t offset = (now - cs->base[!num].cycle_base_last);
+   offset &= cs->mask;
+   cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
+   cs->base[num].cycle_base_last = now;
+   cs->base_num = num;
+
+   /* Now update the cycle_accumulated portion */
+   offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
-   cs->cycle_raw += offset;
 }
 
 /**
Index: monotonic-cleanup/kernel/time/timekeeping.c
===
--- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
12:21:46.0 -0800
+++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.0 
-0800
@@ -71,10 +71,12 @@
  */
 static inline s64 __get_nsec_offset(void)
 {
-   cycle_t cycle_delta;
+   cycle_t now, cycle_delta;
s64 ns_offset;
 
-   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+   now = clocksource_read(clock);
+   cycle_delta = (now - clock->cycle_last) & clock->mask;
+   cycle_delta += clock->cycle_accumulated;
ns_offset = cyc2ns(clock, cycle_delta);
 
return ns_offset;
@@ -105,35

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz

On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote:
> If you really want an seqlock free algorithm (I _do_ want this for
> tracing!) :) maybe going in the RCU direction could help (I refer to my
> RCU-based 32-to-64 bits lockless timestamp counter extension, which
> could be turned into the clocksource updater).

Yea. After our earlier discussion and talking w/ Steven, I'm taking a
swing at this now.  The lock-free method still doesn't apply to the
update_wall_time function, but does work fine for the monotonic cycle
uses.  I'll send a patch for review as soon as I get things building.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt

On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > In-other-words, latency_tracer is LTTng-lite ;-)
> >
>
> If LTTng is already ported to your specific kernel, the learning-curve
> is not big at all. Here is what the latency_tracer over LTTng guide
> could look like :
>
> Well, once you have LTTng in your kernel and have compiled and installed
> the ltt-control and lttv packages (configure, make, make install), all
> that would be needed is :
>
> (there may be some bits in the QUICKSTART GUIDE on
> http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure
> the LTTng modules are loaded)
>
> #arm all the markers
> ltt-armall
> #start lttng tracing
> lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt
>
> -> start latency tracer
> -> stop latency tracer
> -> trigger latency tracer dump
>
> While the tracing is active, trigger the condition...
>
> (rince, repeat, can handle multiple latency tracer dumps)
>
> #stop lttng tracing
> lttctl -n test -R
> #disarm all markers
> ltt-disarmall
>
> You can easily test the trace with :
> lttv -m textDump -t /tmp/trace1
>
> Your users would issue something like :
>
> lttv -m latencytracerDump -t /tmp/trace1
>
> that's it. LatencytracerDump would be a new specialized plugin, inspired
> from the generic textDump.c plugin and from the state.c module (for
> hooking on specific events rather that on _all_ events). It would
> generate a text output from the trace collected at each latency tracer
> dump.
>

Mathieu,

That's quite a bit. Currently we have tools that already hook into
latency_tracer. e.g. Thomas Gleixner's cyclictest. It turns on the latency
tracer runs a quick test, turns it off. If the latency is ok, the trace is
discarded and it runs the test again. When the latency is not acceptable,
it stops the test. Then all one needs to do is look at the given trace.

Hooking something simple as this to LTTng is not going to fly.

Don't get me wrong, I'm a huge supporter of LTTng, and even recommend it.
But there's things that using LTTng for is like using a sledgehammer for
a tack.  I like to use the tools that are best for the job (which also
mean easiest).  I don't buy the one tracer fits all mentality. And by
pushing that too much, I'm thinking we'll never get a tracer into the
kernel.

What my goal of these patches is, is to get the infrastructure for tracing
into the kernel.  The latency_tracer is like Rusty Russells lguest is for
pvops. LTTng is the Xen equivalent. Where latency_tracer is nothing more
than the quick and dirty tracer. For administrators that want to analyze
their systems, LTTng is much more appropriate. latency_tracer is more of
the first aid in finding latency troubles, and if that doesn't work, then
we can do the LTTng surgery.

Lets go back and focus on the infrastructure again. Namely, the mcount and
notrace parts of the kernel. As well as the tracer timer, which by the
way, we are looking more into what you have done ;-)

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Tim Bird

Mathieu Desnoyers wrote:
> If LTTng is already ported to your specific kernel, the learning-curve
> is not big at all. Here is what the latency_tracer over LTTng guide
> could look like :
> 
> Well, once you have LTTng in your kernel and have compiled and installed
> the ltt-control and lttv packages (configure, make, make install), all
> that would be needed is :
> 
> (there may be some bits in the QUICKSTART GUIDE on
> http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure
> the LTTng modules are loaded)
> 
> #arm all the markers
> ltt-armall
> #start lttng tracing
> lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt
> 
> -> start latency tracer
> -> stop latency tracer
> -> trigger latency tracer dump
> 
> While the tracing is active, trigger the condition...
> 
> (rince, repeat, can handle multiple latency tracer dumps)
> 
> #stop lttng tracing
> lttctl -n test -R
> #disarm all markers
> ltt-disarmall
> 
> You can easily test the trace with :
> lttv -m textDump -t /tmp/trace1
> 
> Your users would issue something like :
> 
> lttv -m latencytracerDump -t /tmp/trace1

No offense, but this is still quite a bit harder than:

echo 1 >/proc/something
... wait a bit ...
cat /proc/something

(substitute /sys or /debugfs where appropriate)

Having to compile something (besides the kernel) for the target
is sometimes a major hassle.  I avoided it completely with KFT.
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
 ...
> >
> > > >
> > > > - Disable preemption at the read-side :
> > > >   it makes sure the pointer I get will point to a data structure that
> > > >   will never change while I am in the preempt disabled code. (see *)
> > > > - I use per-cpu data to allow the read-side to be as fast as possible
> > > >   (only need to disable preemption, does not race against other CPUs and
> > > >   won't generate cache line bouncing). It also allows dealing with
> > > >   unsynchronized TSCs if needed.
> > > > - Periodical write side : it's called from an IPI running on each CPU.
> > > >
> > > > (*) We expect the read-side (preempt off region) to last shorter than
> > > > the interval between IPI updates so we can guarantee the data structure
> > > > it uses won't be modified underneath it. Since the IPI update is
> > > > launched each seconds or so (depends on the frequency of the counter we
> > > > are trying to extend), it's more than ok.
> > >
> > > One thing I want to clear up. The major difference between this
> > > latency_tracer and LTTng is what we consider fast paths.  The latency
> > > tracer is recording things like enabling and disabling interrupts, preempt
> > > count changes, or simply profiling all function calls.  Those are what I
> > > consider fast paths.  The slow path WRT the latency_tracer are things like
> > > context switches.  This is why I don't have a problem with copying the
> > > comm at context switch time. Because that _is_ a slow path for the latency
> > > tracer.
> >
> > LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
> > in flight recorder mode, we have nothing to write to disk and therefore
> > we can handle very frequent events. We then do the analysis off-line
> > using the last MB written in the buffers. The advantage is that the
> > kernel dumbly writes data to a buffer : we therefore move the complexity
> > to user-space.
> 
> But you would still need to do something in case you want this information
> dumped to console on a kernel crash.  Of course you can rely on kexec, but
> if the kexec fails (which is possible) then you lose all the information.
> Having the ability to dump the output to console on a crash is one of the
> benefits of latency_tracer that I want to keep.
> 

There has been some integration done between LTTng and the "crash" tool
to extract the buffers from a crashed kernel. I am not an expert in
crash buffer extraction though, but I guess all the available mechanisms
depend on kexec and could show the limits you are referring to.

If you really want to pretty-print the information to the console, I
would propose that you leave that part of the problem to a different
output module. The core of the latency tracer could keep the minimum
information. Then, when a dump is triggered, it either sends the
information to LTTng or to your console pretty-printer.

However, I would not call the pretty-printer a "tracer" module per-se.
We would have to accept that it is a bit more tied to the kernel
internals than the latency tracer. My goal is to separate the core
"profiling" module from the optional "pretty-printing" module as much as
possible so the latency tracer core could be reused by other output
modules.

> 
> >
> > I agree that some kind of tracing, like the one you are doing, might be
> > done more efficiently if you do a first clever analysis phase directly
> > in the kernel without writing the raw high event rate data in memory
> > buffers. However, I believe that it would be more powerful if we combine
> > the two approaches rather than trying to do everything in or out of the
> > kernel.  LTTng could provide the comm names, priorities, etc, and your
> > tracer could provide the top X list of processes that had a bad
> > behavior. It would mean that the complete overall information would be
> > made available after a post-processing phase done in an analysis tool
> > like LTTV, but I don't see any problem with it.
> 
> Of course you don't see any problem with it, because you know LTTV and
> LTTng very well ;-)
> 
> latency_tracer has been very detrimental in solving -rt patch latencies by
> telling the customer to run with latency trace on, and then having them
> simply set a few sysctl variables and run their app.  By combining this
> with LTTng, I wouldn't know how to start with telling a customer how to
> analyze the problem.
> 
> Simply put, latency_tracer has a much smaller learning curve than LTTng.
> Not to mention, a smaller footprint. The tracer here is very focused on
> what to do, and is not meant to be a general profiling tool as LTTng is.
> 
> In-other-words, latency_tracer is LTTng-lite ;-)
> 

If LTTng is already ported to your specific kernel, the learning-curve
is not big at all. Here is what the latency_tracer over LTTng guide
could look like :

Well, once you have LTTng in your kernel and have compiled and installed
the ltt-control and lttv packages (configure, make,

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt ([EMAIL PROTECTED]) wrote:
> >
> > Yeah, but if we replace the loop with a seq lock, then it would work.
> > albeit, more cacheline bouncing (caused by writes). (maybe not, see below)
> >
>
> Yes, but then you would trigger a deadlock if you instrument code called
> from NMI, SMI, MCE contexts :(
>
> grep -ri NMI drivers/* arch/* |grep -vi PNMI
>
> is quite interesting : actually, it show that a few spots need to handle
> those "so special interrupts" : watchdogs, oprofile, virtualization and
> much more in architecture specific code.
>
> I just would not like to add a tracer in the kernel that is _so_
> intrusive that module writers and architecture maintainers would have to
> audit their code and think about the tracer for each implementation that
> would deal with these kind of interrupts.

I don't want driver writers to worry about that either.

>
> > > > do you actually use the RCU internals? or do you just reimplement an RCU
> > > > algorithm?
> > > >
> > >
> > > Nope, I don't use RCU internals in this code. Preempt disable seemed
> > > like the best way to handle this utterly short code path and I wanted
> > > the write side to be fast enough to be called periodically. What I do is:
> >
> > grmble. Then how do you trace preempt_disable?  As my tracer does that
> > (see the last patch in the series).
> >
>
> I think using a kind of preempt_disable_notrace() would make sense here.

Actually after hitting the send button, I thought the same.

> I mean.. even if you use the seqlock, eventually, you will want to trace
> the seqlock behavior. Then you have to find the lightest way to do some
> sort of synchronization that will have a predictable execution. seqlock
> has the following disadvantage : if the seqlock read has to wait for the
> write seqlock to end, we add up some time to the execution of the
> code we are trying to profile, which will mix up the results. On the
> other hand, if the read-side executes in a constant amount of cycles
> which does not depend on the write-side activity, then we get a clearer
> picture of what the time should be accounted for. We can even create a
> module that will figure out how many nanoseconds are spent for reading
> the clock so we can substract this time from our analysis if required.
>
> That's why having to choose between read seqlock and preemption disable
> for the read-side, I would strongly prefer the preemption disable.
> (constant execution time and it's deadlock-free)

You may convince me yet ;-)

>
> > >
> > > - Disable preemption at the read-side :
> > >   it makes sure the pointer I get will point to a data structure that
> > >   will never change while I am in the preempt disabled code. (see *)
> > > - I use per-cpu data to allow the read-side to be as fast as possible
> > >   (only need to disable preemption, does not race against other CPUs and
> > >   won't generate cache line bouncing). It also allows dealing with
> > >   unsynchronized TSCs if needed.
> > > - Periodical write side : it's called from an IPI running on each CPU.
> > >
> > > (*) We expect the read-side (preempt off region) to last shorter than
> > > the interval between IPI updates so we can guarantee the data structure
> > > it uses won't be modified underneath it. Since the IPI update is
> > > launched each seconds or so (depends on the frequency of the counter we
> > > are trying to extend), it's more than ok.
> >
> > One thing I want to clear up. The major difference between this
> > latency_tracer and LTTng is what we consider fast paths.  The latency
> > tracer is recording things like enabling and disabling interrupts, preempt
> > count changes, or simply profiling all function calls.  Those are what I
> > consider fast paths.  The slow path WRT the latency_tracer are things like
> > context switches.  This is why I don't have a problem with copying the
> > comm at context switch time. Because that _is_ a slow path for the latency
> > tracer.
>
> LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
> in flight recorder mode, we have nothing to write to disk and therefore
> we can handle very frequent events. We then do the analysis off-line
> using the last MB written in the buffers. The advantage is that the
> kernel dumbly writes data to a buffer : we therefore move the complexity
> to user-space.

But you would still need to do something in case you want this information
dumped to console on a kernel crash.  Of course you can rely on kexec, but
if the kexec fails (which is possible) then you lose all the information.
Having the ability to dump the output to console on a crash is one of the
benefits of latency_tracer that I want to keep.


>
> I agree that some kind of tracing, like the one you are doing, might be
> done more efficiently if you do a first clever analysis phase directly
> in the kernel without writing the raw high event rate data in memory
> buffers. However, I believe

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Tim Bird

Steven Rostedt wrote:
> grmble. Then how do you trace preempt_disable?  As my tracer does that
> (see the last patch in the series).

One way is to make a tracer_preempt_disable() and tracer_preempt_enable(),
both of which would be 'notrace'. You could probably optimize them
as well.  The standard preempt_disable and preempt_enable don't look
very efficient (e.g. what's up with converting an increment operation into
an addition? - gak!)

Any lock you do is going to have a pretty bad effect.

In order to be able to trace as much as possible, for KFT, I implemented
my own synchronization mechanism using cmpxchg, to avoid using any of the
existing kernel locks (which change more often than you'd think, and have
weird side effects).

=
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> * Steven Rostedt ([EMAIL PROTECTED]) wrote:
> > 

> > 
 > 
> > One thing I want to clear up. The major difference between this
> > latency_tracer and LTTng is what we consider fast paths.  The latency
> > tracer is recording things like enabling and disabling interrupts, preempt
> > count changes, or simply profiling all function calls.  Those are what I
> > consider fast paths.  The slow path WRT the latency_tracer are things like
> > context switches.  This is why I don't have a problem with copying the
> > comm at context switch time. Because that _is_ a slow path for the latency
> > tracer.
> 
> LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
> in flight recorder mode, we have nothing to write to disk and therefore
> we can handle very frequent events. We then do the analysis off-line
> using the last MB written in the buffers. The advantage is that the
> kernel dumbly writes data to a buffer : we therefore move the complexity
> to user-space.
> 
> I agree that some kind of tracing, like the one you are doing, might be
> done more efficiently if you do a first clever analysis phase directly
> in the kernel without writing the raw high event rate data in memory
> buffers. However, I believe that it would be more powerful if we combine
> the two approaches rather than trying to do everything in or out of the
> kernel.  LTTng could provide the comm names, priorities, etc, and your
> tracer could provide the top X list of processes that had a bad
> behavior. It would mean that the complete overall information would be
> made available after a post-processing phase done in an analysis tool
> like LTTV, but I don't see any problem with it.
> 

Just to expand a bit on the design-side of my proposal :

Your module would create "profiles" based on the hooks called. If we
take the interrupt on/off for example, it would be called by lockdep and
could keep a table of the top X instructions that disables interrupts
for a long time. (it's just an example, you could want to save the pid
instead...)

Then, whenever a "profile dump" is triggered, you would simply have to
send the current state of your profile to lttng with something like :

struct irq_latency_table {
  void *ip;
  cycles_t time;
};

/*
 * Make sure only one profile at a time is written to the trace for the
 * whole system.
 */
static DECLARE_MUTEX(latency_profile_mutex);
static struct irq_latency_table latency_table[NR_ENTRIES];

void irq_latency_dump_profile(void)
{
  int i;
  char namebuf[KSYM_NAME_LEN];

  mutex_lock(_profile_mutex);
  trace_mark(irq_latency_dump_begin, MARK_NOARGS);
  for (i = 0; i < NR_ENTRIES; i++) {
sprint_symbol(namebuf, (unsigned long)latency_table[i].ip);
trace_mark(irq_latency_entry, "ip %p symbol %s time %llu",
  latency_table[i].ip, namebuf,
  (unsigned long long)latency_table[i].time);
  }
  trace_mark(irq_latency_dump_end, MARK_NOARGS);
  mutex_unlock(_profile_mutex);
}

You can then create a LTTV module that will format your nice output each
time a profile dump is encountered.

By doing this, your specialized profile generator would only have to
hook into the irq on/off events to gather the information it needs,
nothing more. I think that would trim the code size and the complexity
of your module by an interesting factor.

Note that I could optimize the way I currently deal with symbols by not
having to dump them in the trace, but since it's only for low rate
events, this optimization has a low priority on my todo list.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> 
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> > > No, there's probably issues there too, but no need to worry about it,
> > > since I already showed that allowing for clocksource_accumulate to happen
> > > inside the get_monotonic_cycles loop is already flawed.
> > >
> >
> > Yep, I just re-read through your previous email, and totally agree that
> > the algorithm is flawed in the way you pointed out.
> 
> Yeah, but if we replace the loop with a seq lock, then it would work.
> albeit, more cacheline bouncing (caused by writes). (maybe not, see below)
> 

Yes, but then you would trigger a deadlock if you instrument code called
from NMI, SMI, MCE contexts :(

grep -ri NMI drivers/* arch/* |grep -vi PNMI

is quite interesting : actually, it show that a few spots need to handle
those "so special interrupts" : watchdogs, oprofile, virtualization and
much more in architecture specific code.

I just would not like to add a tracer in the kernel that is _so_
intrusive that module writers and architecture maintainers would have to
audit their code and think about the tracer for each implementation that
would deal with these kind of interrupts.

> > > do you actually use the RCU internals? or do you just reimplement an RCU
> > > algorithm?
> > >
> >
> > Nope, I don't use RCU internals in this code. Preempt disable seemed
> > like the best way to handle this utterly short code path and I wanted
> > the write side to be fast enough to be called periodically. What I do is:
> 
> grmble. Then how do you trace preempt_disable?  As my tracer does that
> (see the last patch in the series).
> 

I think using a kind of preempt_disable_notrace() would make sense here.
I mean.. even if you use the seqlock, eventually, you will want to trace
the seqlock behavior. Then you have to find the lightest way to do some
sort of synchronization that will have a predictable execution. seqlock
has the following disadvantage : if the seqlock read has to wait for the
write seqlock to end, we add up some time to the execution of the
code we are trying to profile, which will mix up the results. On the
other hand, if the read-side executes in a constant amount of cycles
which does not depend on the write-side activity, then we get a clearer
picture of what the time should be accounted for. We can even create a
module that will figure out how many nanoseconds are spent for reading
the clock so we can substract this time from our analysis if required.

That's why having to choose between read seqlock and preemption disable
for the read-side, I would strongly prefer the preemption disable.
(constant execution time and it's deadlock-free)

> >
> > - Disable preemption at the read-side :
> >   it makes sure the pointer I get will point to a data structure that
> >   will never change while I am in the preempt disabled code. (see *)
> > - I use per-cpu data to allow the read-side to be as fast as possible
> >   (only need to disable preemption, does not race against other CPUs and
> >   won't generate cache line bouncing). It also allows dealing with
> >   unsynchronized TSCs if needed.
> > - Periodical write side : it's called from an IPI running on each CPU.
> >
> > (*) We expect the read-side (preempt off region) to last shorter than
> > the interval between IPI updates so we can guarantee the data structure
> > it uses won't be modified underneath it. Since the IPI update is
> > launched each seconds or so (depends on the frequency of the counter we
> > are trying to extend), it's more than ok.
> 
> One thing I want to clear up. The major difference between this
> latency_tracer and LTTng is what we consider fast paths.  The latency
> tracer is recording things like enabling and disabling interrupts, preempt
> count changes, or simply profiling all function calls.  Those are what I
> consider fast paths.  The slow path WRT the latency_tracer are things like
> context switches.  This is why I don't have a problem with copying the
> comm at context switch time. Because that _is_ a slow path for the latency
> tracer.

LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
in flight recorder mode, we have nothing to write to disk and therefore
we can handle very frequent events. We then do the analysis off-line
using the last MB written in the buffers. The advantage is that the
kernel dumbly writes data to a buffer : we therefore move the complexity
to user-space.

I agree that some kind of tracing, like the one you are doing, might be
done more efficiently if you do a first clever analysis phase directly
in the kernel without writing the raw high event rate data in memory
buffers. However, I believe that it would be more powerful if we combine
the two approaches rather than trying to do everything in or out of the
kernel.  LTTng could provide the comm names, priorities, etc, and your
tracer could provide the top X list of processes that had a bad
behavior. It would mean that the

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt

On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> > No, there's probably issues there too, but no need to worry about it,
> > since I already showed that allowing for clocksource_accumulate to happen
> > inside the get_monotonic_cycles loop is already flawed.
> >
>
> Yep, I just re-read through your previous email, and totally agree that
> the algorithm is flawed in the way you pointed out.

Yeah, but if we replace the loop with a seq lock, then it would work.
albeit, more cacheline bouncing (caused by writes). (maybe not, see below)

> > do you actually use the RCU internals? or do you just reimplement an RCU
> > algorithm?
> >
>
> Nope, I don't use RCU internals in this code. Preempt disable seemed
> like the best way to handle this utterly short code path and I wanted
> the write side to be fast enough to be called periodically. What I do is:

grmble. Then how do you trace preempt_disable?  As my tracer does that
(see the last patch in the series).

>
> - Disable preemption at the read-side :
>   it makes sure the pointer I get will point to a data structure that
>   will never change while I am in the preempt disabled code. (see *)
> - I use per-cpu data to allow the read-side to be as fast as possible
>   (only need to disable preemption, does not race against other CPUs and
>   won't generate cache line bouncing). It also allows dealing with
>   unsynchronized TSCs if needed.
> - Periodical write side : it's called from an IPI running on each CPU.
>
> (*) We expect the read-side (preempt off region) to last shorter than
> the interval between IPI updates so we can guarantee the data structure
> it uses won't be modified underneath it. Since the IPI update is
> launched each seconds or so (depends on the frequency of the counter we
> are trying to extend), it's more than ok.

One thing I want to clear up. The major difference between this
latency_tracer and LTTng is what we consider fast paths.  The latency
tracer is recording things like enabling and disabling interrupts, preempt
count changes, or simply profiling all function calls.  Those are what I
consider fast paths.  The slow path WRT the latency_tracer are things like
context switches.  This is why I don't have a problem with copying the
comm at context switch time. Because that _is_ a slow path for the latency
tracer.

Placing a read_seqlock in get_monotonic_cycles would not be that bad,
since the only slow down would be the rmb.  read_seqlocks don't modify
global data. Only the write_seqlock does. So the cache line bouncing would
only happen on updates in clocksource_accumulate. But then after the
caches are all balanced again, the reads will continue fine.

Question: Is a cache-miss a greater cost than a read to a clocksource
  (besides the TSC)?

Also note how I arrange these variables in the clock struct:

struct {
cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
cycle_t cycle_interval;
} cacheline_aligned_in_smp;

I could do the following:

struct {
seqlock_t cycle_lock;
cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
cycle_t cycle_interval;
} cacheline_aligned_in_smp;

Which would help to keep all these in the same cache line. These are all
updated at the same time, and hopefully this will keep the cache line
bouncing limited to a single cacheline.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> > Hrm, I will reply to the rest of this email in a separate mail, but
> > there is another concern, simpler than memory ordering, that just hit
> > me :
> >
> > If we have CPU A calling clocksource_accumulate while CPU B is calling
> > get_monotonic_cycles, but events happens in the following order (because
> > of preemption or interrupts). Here, to make things worse, we would be on
> > x86 where cycle_t is not an atomic write (64 bits) :
> >
> >
> > CPU A  CPU B
> >
> > clocksource read
> > update cycle_mono (1st 32 bits)
> >read cycle_mono
> >read cycle_last
> >clocksource read
> >read cycle_mono
> >read cycle_last
> > update cycle_mono (2nd 32 bits)
> > update cycle_last
> > update cycle_acc
> >
> > Therefore, we have :
> > - an inconsistant cycle_monotonic value
> > - inconsistant cycle_monotonic and cycle_last values.
> >
> > Or is there something I have missed ?
> 
> No, there's probably issues there too, but no need to worry about it,
> since I already showed that allowing for clocksource_accumulate to happen
> inside the get_monotonic_cycles loop is already flawed.
> 

Yep, I just re-read through your previous email, and totally agree that
the algorithm is flawed in the way you pointed out.


> >
> > If you really want an seqlock free algorithm (I _do_ want this for
> > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > could be turned into the clocksource updater).
> 
> I know you pointed me the code, but lets assume that I'm still ignorant
> ;-)
> 
> do you actually use the RCU internals? or do you just reimplement an RCU
> algorithm?
> 

Nope, I don't use RCU internals in this code. Preempt disable seemed
like the best way to handle this utterly short code path and I wanted
the write side to be fast enough to be called periodically. What I do is:

- Disable preemption at the read-side :
  it makes sure the pointer I get will point to a data structure that
  will never change while I am in the preempt disabled code. (see *)
- I use per-cpu data to allow the read-side to be as fast as possible
  (only need to disable preemption, does not race against other CPUs and
  won't generate cache line bouncing). It also allows dealing with
  unsynchronized TSCs if needed.
- Periodical write side : it's called from an IPI running on each CPU.

(*) We expect the read-side (preempt off region) to last shorter than
the interval between IPI updates so we can guarantee the data structure
it uses won't be modified underneath it. Since the IPI update is
launched each seconds or so (depends on the frequency of the counter we
are trying to extend), it's more than ok.

Mathieu

> -- Steve

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> Hrm, I will reply to the rest of this email in a separate mail, but
> there is another concern, simpler than memory ordering, that just hit
> me :
>
> If we have CPU A calling clocksource_accumulate while CPU B is calling
> get_monotonic_cycles, but events happens in the following order (because
> of preemption or interrupts). Here, to make things worse, we would be on
> x86 where cycle_t is not an atomic write (64 bits) :
>
>
> CPU A  CPU B
>
> clocksource read
> update cycle_mono (1st 32 bits)
>read cycle_mono
>read cycle_last
>clocksource read
>read cycle_mono
>read cycle_last
> update cycle_mono (2nd 32 bits)
> update cycle_last
> update cycle_acc
>
> Therefore, we have :
> - an inconsistant cycle_monotonic value
> - inconsistant cycle_monotonic and cycle_last values.
>
> Or is there something I have missed ?

No, there's probably issues there too, but no need to worry about it,
since I already showed that allowing for clocksource_accumulate to happen
inside the get_monotonic_cycles loop is already flawed.

>
> If you really want an seqlock free algorithm (I _do_ want this for
> tracing!) :) maybe going in the RCU direction could help (I refer to my
> RCU-based 32-to-64 bits lockless timestamp counter extension, which
> could be turned into the clocksource updater).

I know you pointed me the code, but lets assume that I'm still ignorant
;-)

do you actually use the RCU internals? or do you just reimplement an RCU
algorithm?

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> [ CC'd Daniel Walker, since he had problems with this code ]
> 
> On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
> >
> > I agree with you that I don't see how the compiler could reorder this.
> > So we forget about compiler barriers. Also, the clock source used is a
> > synchronized clock source (get_cycles_sync on x86_64), so it should make
> > sure the TSC is read at the right moment.
> >
> > However, what happens if the clock source is, say, the jiffies ?
> >
> > Is this case, we have :
> >
> > static cycle_t jiffies_read(void)
> > {
> > return (cycle_t) jiffies;
> > }
> >
> > Which is nothing more than a memory read of
> >
> > extern unsigned long volatile __jiffy_data jiffies;
> 
> Yep, and that's not my concern.
> 

Hrm, I will reply to the rest of this email in a separate mail, but
there is another concern, simpler than memory ordering, that just hit
me :

If we have CPU A calling clocksource_accumulate while CPU B is calling
get_monotonic_cycles, but events happens in the following order (because
of preemption or interrupts). Here, to make things worse, we would be on
x86 where cycle_t is not an atomic write (64 bits) :


CPU A  CPU B

clocksource read
update cycle_mono (1st 32 bits)
   read cycle_mono
   read cycle_last
   clocksource read
   read cycle_mono
   read cycle_last
update cycle_mono (2nd 32 bits)
update cycle_last
update cycle_acc

Therefore, we have :
- an inconsistant cycle_monotonic value
- inconsistant cycle_monotonic and cycle_last values.

Or is there something I have missed ?

If you really want an seqlock free algorithm (I _do_ want this for
tracing!) :) maybe going in the RCU direction could help (I refer to my
RCU-based 32-to-64 bits lockless timestamp counter extension, which
could be turned into the clocksource updater).

Mathieu

> >
> > I think it is wrong to assume that reads from clock->cycle_raw and from
> > jiffies will be ordered correctly in SMP. I am tempted to think that
> > ordering memory writes to clock->cycle_raw vs jiffies is also needed in this
> > case (where clock->cycle_raw is updated, or where jiffies is updated).
> >
> > We can fall in the same kind of issue if we read the HPET, which is
> > memory I/O based. It does not seems correct to assume that MMIO vs
> > normal memory reads are ordered. (pointing back to this article :
> > http://lwn.net/Articles/198988/)
> 
> That and the dread memory barrier thread that my head is still spinning
> on.
> 
> Ok, lets take a close look at the code in question. I may be wrong, and if
> so, great, we can fix it.
> 
> We have this in get_monotonic_cycles:
> 
> {
>   cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
>   do {
>   cycle_monotonic = clock->cycle_monotonic;
>   cycle_last = clock->cycle_last;
>   cycle_now = clocksource_read(clock);
>   cycle_delta = (cycle_now - cycle_last) & clock->mask;
>   } while (cycle_monotonic != clock->cycle_monotonic ||
>cycle_last != clock->cycle_last);
>   return cycle_monotonic + cycle_delta;
> }
> 
> and this in clocksource.h
> 
> static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
> {
>   cycle_t offset = (now - cs->cycle_last) & cs->mask;
>   cs->cycle_last = now;
>   cs->cycle_accumulated += offset;
>   cs->cycle_monotonic += offset;
> }
> 
> now is usually just a clocksource_read() passed in.
> 
> The goal is to have clock_monotonic always return something that is
> greater than what was read the last time.
> 
> Let's make a few assumptions now (for others to shoot them down). One
> thing is that we don't need to worry too much about MMIO, because we are
> doing a read. This means we need the data right now to contiune. So the
> read being a function call should keep gcc from moving stuff around, and
> since we are doing an IO read, the order of events should be pretty much
> synchronized. in
> 
> 1. load cycle_last and cycle_monotonic (we don't care which order)*
> 2. read clock source
> 3. calculate delta and while() compare (order doesn't matter)
> 
> * we might care (see below)
> 
> If the above is incorrect, then we need to fix get_monotonic_cycles.
> 
> in clocksource_accumulate, we have:
> 
>   offset = ((now = cs->read()) - cycle_last) & cs->mask;
>   cycle_last = now;
>   cycle_accumulate += offset;
>   cycle_monotonic += offset;
> 
> The order of events here are. Using the same reasoning as above, the read
> must be first and completed because for gcc it's a function, and for IO,
> it needs to return data.
> 
>   1. cs->read
>   2. update cycle_last, cycle_accumulate, cycle_monotonic.
> 
> Can we assume, if the above for get_monotonic_cycles is correct, that
> since we read and compare cycle_last and cycle_monotonic, that neither of
> them have

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt

[ CC'd Daniel Walker, since he had problems with this code ]

On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
>
> I agree with you that I don't see how the compiler could reorder this.
> So we forget about compiler barriers. Also, the clock source used is a
> synchronized clock source (get_cycles_sync on x86_64), so it should make
> sure the TSC is read at the right moment.
>
> However, what happens if the clock source is, say, the jiffies ?
>
> Is this case, we have :
>
> static cycle_t jiffies_read(void)
> {
> return (cycle_t) jiffies;
> }
>
> Which is nothing more than a memory read of
>
> extern unsigned long volatile __jiffy_data jiffies;

Yep, and that's not my concern.

>
> I think it is wrong to assume that reads from clock->cycle_raw and from
> jiffies will be ordered correctly in SMP. I am tempted to think that
> ordering memory writes to clock->cycle_raw vs jiffies is also needed in this
> case (where clock->cycle_raw is updated, or where jiffies is updated).
>
> We can fall in the same kind of issue if we read the HPET, which is
> memory I/O based. It does not seems correct to assume that MMIO vs
> normal memory reads are ordered. (pointing back to this article :
> http://lwn.net/Articles/198988/)

That and the dread memory barrier thread that my head is still spinning
on.

Ok, lets take a close look at the code in question. I may be wrong, and if
so, great, we can fix it.

We have this in get_monotonic_cycles:

{
cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
do {
cycle_monotonic = clock->cycle_monotonic;
cycle_last = clock->cycle_last;
cycle_now = clocksource_read(clock);
cycle_delta = (cycle_now - cycle_last) & clock->mask;
} while (cycle_monotonic != clock->cycle_monotonic ||
 cycle_last != clock->cycle_last);
return cycle_monotonic + cycle_delta;
}

and this in clocksource.h

static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
{
cycle_t offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
cs->cycle_monotonic += offset;
}

now is usually just a clocksource_read() passed in.

The goal is to have clock_monotonic always return something that is
greater than what was read the last time.

Let's make a few assumptions now (for others to shoot them down). One
thing is that we don't need to worry too much about MMIO, because we are
doing a read. This means we need the data right now to contiune. So the
read being a function call should keep gcc from moving stuff around, and
since we are doing an IO read, the order of events should be pretty much
synchronized. in

1. load cycle_last and cycle_monotonic (we don't care which order)*
2. read clock source
3. calculate delta and while() compare (order doesn't matter)

* we might care (see below)

If the above is incorrect, then we need to fix get_monotonic_cycles.

in clocksource_accumulate, we have:

  offset = ((now = cs->read()) - cycle_last) & cs->mask;
  cycle_last = now;
  cycle_accumulate += offset;
  cycle_monotonic += offset;

The order of events here are. Using the same reasoning as above, the read
must be first and completed because for gcc it's a function, and for IO,
it needs to return data.

  1. cs->read
  2. update cycle_last, cycle_accumulate, cycle_monotonic.

Can we assume, if the above for get_monotonic_cycles is correct, that
since we read and compare cycle_last and cycle_monotonic, that neither of
them have changed over the read? So we have a snapshot of the
clocksource_accumulate.

So the worst thing that I can think of, is that cycle_monotonic is update
*before* cycle_last:

   cycle_monotonic += offest;

   cycle_last = now;

cycle_last = 5
cycle_monotonic = 0

CPU 0 CPU 1
  --  -
 cs->read() = 10
 offset = 10 - 5 = 5
 cycle_monotonic = 5
cycle_monotonic = 5
cycle_last = 5
cs->read() = 11
delta = 11 - 5 = 6
cycle_monotonic and cycle_last still same
return 5 + 6 = 11

  cycle_last = 10

cycle_monotonic = 5
cycle_last = 10
cs->read() = 12
delta = 12 - 10 = 2
cycle_monotonic and cycle_last still same
return 5 + 2 = 7

    ERROR *

So, we *do* need memory barriers.  Looks like cycle_last and
cycle_monotonic need to be synchronized.

OK, will this do?

cycle_t notrace get_monotonic_cycles(void)
{
cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
do {
cycle_monotonic = clock->cycle_monotonic;

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


[ CC'd Daniel Walker, since he had problems with this code ]

On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:

 I agree with you that I don't see how the compiler could reorder this.
 So we forget about compiler barriers. Also, the clock source used is a
 synchronized clock source (get_cycles_sync on x86_64), so it should make
 sure the TSC is read at the right moment.

 However, what happens if the clock source is, say, the jiffies ?

 Is this case, we have :

 static cycle_t jiffies_read(void)
 {
 return (cycle_t) jiffies;
 }

 Which is nothing more than a memory read of

 extern unsigned long volatile __jiffy_data jiffies;

Yep, and that's not my concern.


 I think it is wrong to assume that reads from clock-cycle_raw and from
 jiffies will be ordered correctly in SMP. I am tempted to think that
 ordering memory writes to clock-cycle_raw vs jiffies is also needed in this
 case (where clock-cycle_raw is updated, or where jiffies is updated).

 We can fall in the same kind of issue if we read the HPET, which is
 memory I/O based. It does not seems correct to assume that MMIO vs
 normal memory reads are ordered. (pointing back to this article :
 http://lwn.net/Articles/198988/)

That and the dread memory barrier thread that my head is still spinning
on.

Ok, lets take a close look at the code in question. I may be wrong, and if
so, great, we can fix it.

We have this in get_monotonic_cycles:

{
cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
do {
cycle_monotonic = clock-cycle_monotonic;
cycle_last = clock-cycle_last;
cycle_now = clocksource_read(clock);
cycle_delta = (cycle_now - cycle_last)  clock-mask;
} while (cycle_monotonic != clock-cycle_monotonic ||
 cycle_last != clock-cycle_last);
return cycle_monotonic + cycle_delta;
}

and this in clocksource.h

static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
{
cycle_t offset = (now - cs-cycle_last)  cs-mask;
cs-cycle_last = now;
cs-cycle_accumulated += offset;
cs-cycle_monotonic += offset;
}

now is usually just a clocksource_read() passed in.

The goal is to have clock_monotonic always return something that is
greater than what was read the last time.

Let's make a few assumptions now (for others to shoot them down). One
thing is that we don't need to worry too much about MMIO, because we are
doing a read. This means we need the data right now to contiune. So the
read being a function call should keep gcc from moving stuff around, and
since we are doing an IO read, the order of events should be pretty much
synchronized. in

1. load cycle_last and cycle_monotonic (we don't care which order)*
2. read clock source
3. calculate delta and while() compare (order doesn't matter)

* we might care (see below)

If the above is incorrect, then we need to fix get_monotonic_cycles.

in clocksource_accumulate, we have:

  offset = ((now = cs-read()) - cycle_last)  cs-mask;
  cycle_last = now;
  cycle_accumulate += offset;
  cycle_monotonic += offset;

The order of events here are. Using the same reasoning as above, the read
must be first and completed because for gcc it's a function, and for IO,
it needs to return data.

  1. cs-read
  2. update cycle_last, cycle_accumulate, cycle_monotonic.

Can we assume, if the above for get_monotonic_cycles is correct, that
since we read and compare cycle_last and cycle_monotonic, that neither of
them have changed over the read? So we have a snapshot of the
clocksource_accumulate.

So the worst thing that I can think of, is that cycle_monotonic is update
*before* cycle_last:

   cycle_monotonic += offest;
 get_monotonic_cycles run on other CPU
   cycle_last = now;


cycle_last = 5
cycle_monotonic = 0


CPU 0 CPU 1
  --  -
 cs-read() = 10
 offset = 10 - 5 = 5
 cycle_monotonic = 5
cycle_monotonic = 5
cycle_last = 5
cs-read() = 11
delta = 11 - 5 = 6
cycle_monotonic and cycle_last still same
return 5 + 6 = 11

  cycle_last = 10

cycle_monotonic = 5
cycle_last = 10
cs-read() = 12
delta = 12 - 10 = 2
cycle_monotonic and cycle_last still same
return 5 + 2 = 7

    ERROR *

So, we *do* need memory barriers.  Looks like cycle_last and
cycle_monotonic need to be synchronized.

OK, will this do?

cycle_t notrace get_monotonic_cycles(void)
{
cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
do {
cycle_monotonic = clock-cycle_monotonic;
smp_rmb();

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 
 [ CC'd Daniel Walker, since he had problems with this code ]
 
 On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
 
  I agree with you that I don't see how the compiler could reorder this.
  So we forget about compiler barriers. Also, the clock source used is a
  synchronized clock source (get_cycles_sync on x86_64), so it should make
  sure the TSC is read at the right moment.
 
  However, what happens if the clock source is, say, the jiffies ?
 
  Is this case, we have :
 
  static cycle_t jiffies_read(void)
  {
  return (cycle_t) jiffies;
  }
 
  Which is nothing more than a memory read of
 
  extern unsigned long volatile __jiffy_data jiffies;
 
 Yep, and that's not my concern.
 

Hrm, I will reply to the rest of this email in a separate mail, but
there is another concern, simpler than memory ordering, that just hit
me :

If we have CPU A calling clocksource_accumulate while CPU B is calling
get_monotonic_cycles, but events happens in the following order (because
of preemption or interrupts). Here, to make things worse, we would be on
x86 where cycle_t is not an atomic write (64 bits) :


CPU A  CPU B

clocksource read
update cycle_mono (1st 32 bits)
   read cycle_mono
   read cycle_last
   clocksource read
   read cycle_mono
   read cycle_last
update cycle_mono (2nd 32 bits)
update cycle_last
update cycle_acc

Therefore, we have :
- an inconsistant cycle_monotonic value
- inconsistant cycle_monotonic and cycle_last values.

Or is there something I have missed ?

If you really want an seqlock free algorithm (I _do_ want this for
tracing!) :) maybe going in the RCU direction could help (I refer to my
RCU-based 32-to-64 bits lockless timestamp counter extension, which
could be turned into the clocksource updater).

Mathieu

 
  I think it is wrong to assume that reads from clock-cycle_raw and from
  jiffies will be ordered correctly in SMP. I am tempted to think that
  ordering memory writes to clock-cycle_raw vs jiffies is also needed in this
  case (where clock-cycle_raw is updated, or where jiffies is updated).
 
  We can fall in the same kind of issue if we read the HPET, which is
  memory I/O based. It does not seems correct to assume that MMIO vs
  normal memory reads are ordered. (pointing back to this article :
  http://lwn.net/Articles/198988/)
 
 That and the dread memory barrier thread that my head is still spinning
 on.
 
 Ok, lets take a close look at the code in question. I may be wrong, and if
 so, great, we can fix it.
 
 We have this in get_monotonic_cycles:
 
 {
   cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
   do {
   cycle_monotonic = clock-cycle_monotonic;
   cycle_last = clock-cycle_last;
   cycle_now = clocksource_read(clock);
   cycle_delta = (cycle_now - cycle_last)  clock-mask;
   } while (cycle_monotonic != clock-cycle_monotonic ||
cycle_last != clock-cycle_last);
   return cycle_monotonic + cycle_delta;
 }
 
 and this in clocksource.h
 
 static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
 {
   cycle_t offset = (now - cs-cycle_last)  cs-mask;
   cs-cycle_last = now;
   cs-cycle_accumulated += offset;
   cs-cycle_monotonic += offset;
 }
 
 now is usually just a clocksource_read() passed in.
 
 The goal is to have clock_monotonic always return something that is
 greater than what was read the last time.
 
 Let's make a few assumptions now (for others to shoot them down). One
 thing is that we don't need to worry too much about MMIO, because we are
 doing a read. This means we need the data right now to contiune. So the
 read being a function call should keep gcc from moving stuff around, and
 since we are doing an IO read, the order of events should be pretty much
 synchronized. in
 
 1. load cycle_last and cycle_monotonic (we don't care which order)*
 2. read clock source
 3. calculate delta and while() compare (order doesn't matter)
 
 * we might care (see below)
 
 If the above is incorrect, then we need to fix get_monotonic_cycles.
 
 in clocksource_accumulate, we have:
 
   offset = ((now = cs-read()) - cycle_last)  cs-mask;
   cycle_last = now;
   cycle_accumulate += offset;
   cycle_monotonic += offset;
 
 The order of events here are. Using the same reasoning as above, the read
 must be first and completed because for gcc it's a function, and for IO,
 it needs to return data.
 
   1. cs-read
   2. update cycle_last, cycle_accumulate, cycle_monotonic.
 
 Can we assume, if the above for get_monotonic_cycles is correct, that
 since we read and compare cycle_last and cycle_monotonic, that neither of
 them have changed over the read? So we have a snapshot of the
 clocksource_accumulate.
 
 So the worst thing that I can think of, is that cycle_monotonic is update

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
 Hrm, I will reply to the rest of this email in a separate mail, but
 there is another concern, simpler than memory ordering, that just hit
 me :

 If we have CPU A calling clocksource_accumulate while CPU B is calling
 get_monotonic_cycles, but events happens in the following order (because
 of preemption or interrupts). Here, to make things worse, we would be on
 x86 where cycle_t is not an atomic write (64 bits) :


 CPU A  CPU B

 clocksource read
 update cycle_mono (1st 32 bits)
read cycle_mono
read cycle_last
clocksource read
read cycle_mono
read cycle_last
 update cycle_mono (2nd 32 bits)
 update cycle_last
 update cycle_acc

 Therefore, we have :
 - an inconsistant cycle_monotonic value
 - inconsistant cycle_monotonic and cycle_last values.

 Or is there something I have missed ?

No, there's probably issues there too, but no need to worry about it,
since I already showed that allowing for clocksource_accumulate to happen
inside the get_monotonic_cycles loop is already flawed.


 If you really want an seqlock free algorithm (I _do_ want this for
 tracing!) :) maybe going in the RCU direction could help (I refer to my
 RCU-based 32-to-64 bits lockless timestamp counter extension, which
 could be turned into the clocksource updater).

I know you pointed me the code, but lets assume that I'm still ignorant
;-)

do you actually use the RCU internals? or do you just reimplement an RCU
algorithm?

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 
 On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
  Hrm, I will reply to the rest of this email in a separate mail, but
  there is another concern, simpler than memory ordering, that just hit
  me :
 
  If we have CPU A calling clocksource_accumulate while CPU B is calling
  get_monotonic_cycles, but events happens in the following order (because
  of preemption or interrupts). Here, to make things worse, we would be on
  x86 where cycle_t is not an atomic write (64 bits) :
 
 
  CPU A  CPU B
 
  clocksource read
  update cycle_mono (1st 32 bits)
 read cycle_mono
 read cycle_last
 clocksource read
 read cycle_mono
 read cycle_last
  update cycle_mono (2nd 32 bits)
  update cycle_last
  update cycle_acc
 
  Therefore, we have :
  - an inconsistant cycle_monotonic value
  - inconsistant cycle_monotonic and cycle_last values.
 
  Or is there something I have missed ?
 
 No, there's probably issues there too, but no need to worry about it,
 since I already showed that allowing for clocksource_accumulate to happen
 inside the get_monotonic_cycles loop is already flawed.
 

Yep, I just re-read through your previous email, and totally agree that
the algorithm is flawed in the way you pointed out.


 
  If you really want an seqlock free algorithm (I _do_ want this for
  tracing!) :) maybe going in the RCU direction could help (I refer to my
  RCU-based 32-to-64 bits lockless timestamp counter extension, which
  could be turned into the clocksource updater).
 
 I know you pointed me the code, but lets assume that I'm still ignorant
 ;-)
 
 do you actually use the RCU internals? or do you just reimplement an RCU
 algorithm?
 

Nope, I don't use RCU internals in this code. Preempt disable seemed
like the best way to handle this utterly short code path and I wanted
the write side to be fast enough to be called periodically. What I do is:

- Disable preemption at the read-side :
  it makes sure the pointer I get will point to a data structure that
  will never change while I am in the preempt disabled code. (see *)
- I use per-cpu data to allow the read-side to be as fast as possible
  (only need to disable preemption, does not race against other CPUs and
  won't generate cache line bouncing). It also allows dealing with
  unsynchronized TSCs if needed.
- Periodical write side : it's called from an IPI running on each CPU.

(*) We expect the read-side (preempt off region) to last shorter than
the interval between IPI updates so we can guarantee the data structure
it uses won't be modified underneath it. Since the IPI update is
launched each seconds or so (depends on the frequency of the counter we
are trying to extend), it's more than ok.

Mathieu

 -- Steve

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt



On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
  No, there's probably issues there too, but no need to worry about it,
  since I already showed that allowing for clocksource_accumulate to happen
  inside the get_monotonic_cycles loop is already flawed.
 

 Yep, I just re-read through your previous email, and totally agree that
 the algorithm is flawed in the way you pointed out.

Yeah, but if we replace the loop with a seq lock, then it would work.
albeit, more cacheline bouncing (caused by writes). (maybe not, see below)

  do you actually use the RCU internals? or do you just reimplement an RCU
  algorithm?
 

 Nope, I don't use RCU internals in this code. Preempt disable seemed
 like the best way to handle this utterly short code path and I wanted
 the write side to be fast enough to be called periodically. What I do is:

grmble. Then how do you trace preempt_disable?  As my tracer does that
(see the last patch in the series).


 - Disable preemption at the read-side :
   it makes sure the pointer I get will point to a data structure that
   will never change while I am in the preempt disabled code. (see *)
 - I use per-cpu data to allow the read-side to be as fast as possible
   (only need to disable preemption, does not race against other CPUs and
   won't generate cache line bouncing). It also allows dealing with
   unsynchronized TSCs if needed.
 - Periodical write side : it's called from an IPI running on each CPU.

 (*) We expect the read-side (preempt off region) to last shorter than
 the interval between IPI updates so we can guarantee the data structure
 it uses won't be modified underneath it. Since the IPI update is
 launched each seconds or so (depends on the frequency of the counter we
 are trying to extend), it's more than ok.

One thing I want to clear up. The major difference between this
latency_tracer and LTTng is what we consider fast paths.  The latency
tracer is recording things like enabling and disabling interrupts, preempt
count changes, or simply profiling all function calls.  Those are what I
consider fast paths.  The slow path WRT the latency_tracer are things like
context switches.  This is why I don't have a problem with copying the
comm at context switch time. Because that _is_ a slow path for the latency
tracer.

Placing a read_seqlock in get_monotonic_cycles would not be that bad,
since the only slow down would be the rmb.  read_seqlocks don't modify
global data. Only the write_seqlock does. So the cache line bouncing would
only happen on updates in clocksource_accumulate. But then after the
caches are all balanced again, the reads will continue fine.

Question: Is a cache-miss a greater cost than a read to a clocksource
  (besides the TSC)?

Also note how I arrange these variables in the clock struct:

struct {
cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
cycle_t cycle_interval;
} cacheline_aligned_in_smp;

I could do the following:

struct {
seqlock_t cycle_lock;
cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
cycle_t cycle_interval;
} cacheline_aligned_in_smp;

Which would help to keep all these in the same cache line. These are all
updated at the same time, and hopefully this will keep the cache line
bouncing limited to a single cacheline.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 
 
 On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
   No, there's probably issues there too, but no need to worry about it,
   since I already showed that allowing for clocksource_accumulate to happen
   inside the get_monotonic_cycles loop is already flawed.
  
 
  Yep, I just re-read through your previous email, and totally agree that
  the algorithm is flawed in the way you pointed out.
 
 Yeah, but if we replace the loop with a seq lock, then it would work.
 albeit, more cacheline bouncing (caused by writes). (maybe not, see below)
 

Yes, but then you would trigger a deadlock if you instrument code called
from NMI, SMI, MCE contexts :(

grep -ri NMI drivers/* arch/* |grep -vi PNMI

is quite interesting : actually, it show that a few spots need to handle
those so special interrupts : watchdogs, oprofile, virtualization and
much more in architecture specific code.

I just would not like to add a tracer in the kernel that is _so_
intrusive that module writers and architecture maintainers would have to
audit their code and think about the tracer for each implementation that
would deal with these kind of interrupts.

   do you actually use the RCU internals? or do you just reimplement an RCU
   algorithm?
  
 
  Nope, I don't use RCU internals in this code. Preempt disable seemed
  like the best way to handle this utterly short code path and I wanted
  the write side to be fast enough to be called periodically. What I do is:
 
 grmble. Then how do you trace preempt_disable?  As my tracer does that
 (see the last patch in the series).
 

I think using a kind of preempt_disable_notrace() would make sense here.
I mean.. even if you use the seqlock, eventually, you will want to trace
the seqlock behavior. Then you have to find the lightest way to do some
sort of synchronization that will have a predictable execution. seqlock
has the following disadvantage : if the seqlock read has to wait for the
write seqlock to end, we add up some time to the execution of the
code we are trying to profile, which will mix up the results. On the
other hand, if the read-side executes in a constant amount of cycles
which does not depend on the write-side activity, then we get a clearer
picture of what the time should be accounted for. We can even create a
module that will figure out how many nanoseconds are spent for reading
the clock so we can substract this time from our analysis if required.

That's why having to choose between read seqlock and preemption disable
for the read-side, I would strongly prefer the preemption disable.
(constant execution time and it's deadlock-free)

 
  - Disable preemption at the read-side :
it makes sure the pointer I get will point to a data structure that
will never change while I am in the preempt disabled code. (see *)
  - I use per-cpu data to allow the read-side to be as fast as possible
(only need to disable preemption, does not race against other CPUs and
won't generate cache line bouncing). It also allows dealing with
unsynchronized TSCs if needed.
  - Periodical write side : it's called from an IPI running on each CPU.
 
  (*) We expect the read-side (preempt off region) to last shorter than
  the interval between IPI updates so we can guarantee the data structure
  it uses won't be modified underneath it. Since the IPI update is
  launched each seconds or so (depends on the frequency of the counter we
  are trying to extend), it's more than ok.
 
 One thing I want to clear up. The major difference between this
 latency_tracer and LTTng is what we consider fast paths.  The latency
 tracer is recording things like enabling and disabling interrupts, preempt
 count changes, or simply profiling all function calls.  Those are what I
 consider fast paths.  The slow path WRT the latency_tracer are things like
 context switches.  This is why I don't have a problem with copying the
 comm at context switch time. Because that _is_ a slow path for the latency
 tracer.

LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
in flight recorder mode, we have nothing to write to disk and therefore
we can handle very frequent events. We then do the analysis off-line
using the last MB written in the buffers. The advantage is that the
kernel dumbly writes data to a buffer : we therefore move the complexity
to user-space.

I agree that some kind of tracing, like the one you are doing, might be
done more efficiently if you do a first clever analysis phase directly
in the kernel without writing the raw high event rate data in memory
buffers. However, I believe that it would be more powerful if we combine
the two approaches rather than trying to do everything in or out of the
kernel.  LTTng could provide the comm names, priorities, etc, and your
tracer could provide the top X list of processes that had a bad
behavior. It would mean that the complete overall information would be
made available after a post-processing phase done

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
 * Steven Rostedt ([EMAIL PROTECTED]) wrote:
  

  
  
  One thing I want to clear up. The major difference between this
  latency_tracer and LTTng is what we consider fast paths.  The latency
  tracer is recording things like enabling and disabling interrupts, preempt
  count changes, or simply profiling all function calls.  Those are what I
  consider fast paths.  The slow path WRT the latency_tracer are things like
  context switches.  This is why I don't have a problem with copying the
  comm at context switch time. Because that _is_ a slow path for the latency
  tracer.
 
 LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
 in flight recorder mode, we have nothing to write to disk and therefore
 we can handle very frequent events. We then do the analysis off-line
 using the last MB written in the buffers. The advantage is that the
 kernel dumbly writes data to a buffer : we therefore move the complexity
 to user-space.
 
 I agree that some kind of tracing, like the one you are doing, might be
 done more efficiently if you do a first clever analysis phase directly
 in the kernel without writing the raw high event rate data in memory
 buffers. However, I believe that it would be more powerful if we combine
 the two approaches rather than trying to do everything in or out of the
 kernel.  LTTng could provide the comm names, priorities, etc, and your
 tracer could provide the top X list of processes that had a bad
 behavior. It would mean that the complete overall information would be
 made available after a post-processing phase done in an analysis tool
 like LTTV, but I don't see any problem with it.
 

Just to expand a bit on the design-side of my proposal :

Your module would create profiles based on the hooks called. If we
take the interrupt on/off for example, it would be called by lockdep and
could keep a table of the top X instructions that disables interrupts
for a long time. (it's just an example, you could want to save the pid
instead...)

Then, whenever a profile dump is triggered, you would simply have to
send the current state of your profile to lttng with something like :


struct irq_latency_table {
  void *ip;
  cycles_t time;
};

/*
 * Make sure only one profile at a time is written to the trace for the
 * whole system.
 */
static DECLARE_MUTEX(latency_profile_mutex);
static struct irq_latency_table latency_table[NR_ENTRIES];


void irq_latency_dump_profile(void)
{
  int i;
  char namebuf[KSYM_NAME_LEN];

  mutex_lock(latency_profile_mutex);
  trace_mark(irq_latency_dump_begin, MARK_NOARGS);
  for (i = 0; i  NR_ENTRIES; i++) {
sprint_symbol(namebuf, (unsigned long)latency_table[i].ip);
trace_mark(irq_latency_entry, ip %p symbol %s time %llu,
  latency_table[i].ip, namebuf,
  (unsigned long long)latency_table[i].time);
  }
  trace_mark(irq_latency_dump_end, MARK_NOARGS);
  mutex_unlock(latency_profile_mutex);
}

You can then create a LTTV module that will format your nice output each
time a profile dump is encountered.

By doing this, your specialized profile generator would only have to
hook into the irq on/off events to gather the information it needs,
nothing more. I think that would trim the code size and the complexity
of your module by an interesting factor.

Note that I could optimize the way I currently deal with symbols by not
having to dump them in the trace, but since it's only for low rate
events, this optimization has a low priority on my todo list.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Tim Bird

Steven Rostedt wrote:
 grmble. Then how do you trace preempt_disable?  As my tracer does that
 (see the last patch in the series).

One way is to make a tracer_preempt_disable() and tracer_preempt_enable(),
both of which would be 'notrace'. You could probably optimize them
as well.  The standard preempt_disable and preempt_enable don't look
very efficient (e.g. what's up with converting an increment operation into
an addition? - gak!)

Any lock you do is going to have a pretty bad effect.

In order to be able to trace as much as possible, for KFT, I implemented
my own synchronization mechanism using cmpxchg, to avoid using any of the
existing kernel locks (which change more often than you'd think, and have
weird side effects).

=
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
 * Steven Rostedt ([EMAIL PROTECTED]) wrote:
 
  Yeah, but if we replace the loop with a seq lock, then it would work.
  albeit, more cacheline bouncing (caused by writes). (maybe not, see below)
 

 Yes, but then you would trigger a deadlock if you instrument code called
 from NMI, SMI, MCE contexts :(

 grep -ri NMI drivers/* arch/* |grep -vi PNMI

 is quite interesting : actually, it show that a few spots need to handle
 those so special interrupts : watchdogs, oprofile, virtualization and
 much more in architecture specific code.

 I just would not like to add a tracer in the kernel that is _so_
 intrusive that module writers and architecture maintainers would have to
 audit their code and think about the tracer for each implementation that
 would deal with these kind of interrupts.

I don't want driver writers to worry about that either.


do you actually use the RCU internals? or do you just reimplement an RCU
algorithm?
   
  
   Nope, I don't use RCU internals in this code. Preempt disable seemed
   like the best way to handle this utterly short code path and I wanted
   the write side to be fast enough to be called periodically. What I do is:
 
  grmble. Then how do you trace preempt_disable?  As my tracer does that
  (see the last patch in the series).
 

 I think using a kind of preempt_disable_notrace() would make sense here.

Actually after hitting the send button, I thought the same.

 I mean.. even if you use the seqlock, eventually, you will want to trace
 the seqlock behavior. Then you have to find the lightest way to do some
 sort of synchronization that will have a predictable execution. seqlock
 has the following disadvantage : if the seqlock read has to wait for the
 write seqlock to end, we add up some time to the execution of the
 code we are trying to profile, which will mix up the results. On the
 other hand, if the read-side executes in a constant amount of cycles
 which does not depend on the write-side activity, then we get a clearer
 picture of what the time should be accounted for. We can even create a
 module that will figure out how many nanoseconds are spent for reading
 the clock so we can substract this time from our analysis if required.

 That's why having to choose between read seqlock and preemption disable
 for the read-side, I would strongly prefer the preemption disable.
 (constant execution time and it's deadlock-free)

You may convince me yet ;-)


  
   - Disable preemption at the read-side :
 it makes sure the pointer I get will point to a data structure that
 will never change while I am in the preempt disabled code. (see *)
   - I use per-cpu data to allow the read-side to be as fast as possible
 (only need to disable preemption, does not race against other CPUs and
 won't generate cache line bouncing). It also allows dealing with
 unsynchronized TSCs if needed.
   - Periodical write side : it's called from an IPI running on each CPU.
  
   (*) We expect the read-side (preempt off region) to last shorter than
   the interval between IPI updates so we can guarantee the data structure
   it uses won't be modified underneath it. Since the IPI update is
   launched each seconds or so (depends on the frequency of the counter we
   are trying to extend), it's more than ok.
 
  One thing I want to clear up. The major difference between this
  latency_tracer and LTTng is what we consider fast paths.  The latency
  tracer is recording things like enabling and disabling interrupts, preempt
  count changes, or simply profiling all function calls.  Those are what I
  consider fast paths.  The slow path WRT the latency_tracer are things like
  context switches.  This is why I don't have a problem with copying the
  comm at context switch time. Because that _is_ a slow path for the latency
  tracer.

 LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
 in flight recorder mode, we have nothing to write to disk and therefore
 we can handle very frequent events. We then do the analysis off-line
 using the last MB written in the buffers. The advantage is that the
 kernel dumbly writes data to a buffer : we therefore move the complexity
 to user-space.

But you would still need to do something in case you want this information
dumped to console on a kernel crash.  Of course you can rely on kexec, but
if the kexec fails (which is possible) then you lose all the information.
Having the ability to dump the output to console on a crash is one of the
benefits of latency_tracer that I want to keep.



 I agree that some kind of tracing, like the one you are doing, might be
 done more efficiently if you do a first clever analysis phase directly
 in the kernel without writing the raw high event rate data in memory
 buffers. However, I believe that it would be more powerful if we combine
 the two approaches rather than trying to do everything in or out of the
 kernel.  LTTng could provide

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 
 ...
 
   
- Disable preemption at the read-side :
  it makes sure the pointer I get will point to a data structure that
  will never change while I am in the preempt disabled code. (see *)
- I use per-cpu data to allow the read-side to be as fast as possible
  (only need to disable preemption, does not race against other CPUs and
  won't generate cache line bouncing). It also allows dealing with
  unsynchronized TSCs if needed.
- Periodical write side : it's called from an IPI running on each CPU.
   
(*) We expect the read-side (preempt off region) to last shorter than
the interval between IPI updates so we can guarantee the data structure
it uses won't be modified underneath it. Since the IPI update is
launched each seconds or so (depends on the frequency of the counter we
are trying to extend), it's more than ok.
  
   One thing I want to clear up. The major difference between this
   latency_tracer and LTTng is what we consider fast paths.  The latency
   tracer is recording things like enabling and disabling interrupts, preempt
   count changes, or simply profiling all function calls.  Those are what I
   consider fast paths.  The slow path WRT the latency_tracer are things like
   context switches.  This is why I don't have a problem with copying the
   comm at context switch time. Because that _is_ a slow path for the latency
   tracer.
 
  LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
  in flight recorder mode, we have nothing to write to disk and therefore
  we can handle very frequent events. We then do the analysis off-line
  using the last MB written in the buffers. The advantage is that the
  kernel dumbly writes data to a buffer : we therefore move the complexity
  to user-space.
 
 But you would still need to do something in case you want this information
 dumped to console on a kernel crash.  Of course you can rely on kexec, but
 if the kexec fails (which is possible) then you lose all the information.
 Having the ability to dump the output to console on a crash is one of the
 benefits of latency_tracer that I want to keep.
 

There has been some integration done between LTTng and the crash tool
to extract the buffers from a crashed kernel. I am not an expert in
crash buffer extraction though, but I guess all the available mechanisms
depend on kexec and could show the limits you are referring to.

If you really want to pretty-print the information to the console, I
would propose that you leave that part of the problem to a different
output module. The core of the latency tracer could keep the minimum
information. Then, when a dump is triggered, it either sends the
information to LTTng or to your console pretty-printer.

However, I would not call the pretty-printer a tracer module per-se.
We would have to accept that it is a bit more tied to the kernel
internals than the latency tracer. My goal is to separate the core
profiling module from the optional pretty-printing module as much as
possible so the latency tracer core could be reused by other output
modules.

 
 
  I agree that some kind of tracing, like the one you are doing, might be
  done more efficiently if you do a first clever analysis phase directly
  in the kernel without writing the raw high event rate data in memory
  buffers. However, I believe that it would be more powerful if we combine
  the two approaches rather than trying to do everything in or out of the
  kernel.  LTTng could provide the comm names, priorities, etc, and your
  tracer could provide the top X list of processes that had a bad
  behavior. It would mean that the complete overall information would be
  made available after a post-processing phase done in an analysis tool
  like LTTV, but I don't see any problem with it.
 
 Of course you don't see any problem with it, because you know LTTV and
 LTTng very well ;-)
 
 latency_tracer has been very detrimental in solving -rt patch latencies by
 telling the customer to run with latency trace on, and then having them
 simply set a few sysctl variables and run their app.  By combining this
 with LTTng, I wouldn't know how to start with telling a customer how to
 analyze the problem.
 
 Simply put, latency_tracer has a much smaller learning curve than LTTng.
 Not to mention, a smaller footprint. The tracer here is very focused on
 what to do, and is not meant to be a general profiling tool as LTTng is.
 
 In-other-words, latency_tracer is LTTng-lite ;-)
 

If LTTng is already ported to your specific kernel, the learning-curve
is not big at all. Here is what the latency_tracer over LTTng guide
could look like :

Well, once you have LTTng in your kernel and have compiled and installed
the ltt-control and lttv packages (configure, make, make install), all
that would be needed is :

(there may be some bits in the QUICKSTART GUIDE on
http://ltt.polymtl.ca, like adding the debugfs mount to fstab

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Tim Bird

Mathieu Desnoyers wrote:
 If LTTng is already ported to your specific kernel, the learning-curve
 is not big at all. Here is what the latency_tracer over LTTng guide
 could look like :
 
 Well, once you have LTTng in your kernel and have compiled and installed
 the ltt-control and lttv packages (configure, make, make install), all
 that would be needed is :
 
 (there may be some bits in the QUICKSTART GUIDE on
 http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure
 the LTTng modules are loaded)
 
 #arm all the markers
 ltt-armall
 #start lttng tracing
 lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt
 
 - start latency tracer
 - stop latency tracer
 - trigger latency tracer dump
 
 While the tracing is active, trigger the condition...
 
 (rince, repeat, can handle multiple latency tracer dumps)
 
 #stop lttng tracing
 lttctl -n test -R
 #disarm all markers
 ltt-disarmall
 
 You can easily test the trace with :
 lttv -m textDump -t /tmp/trace1
 
 Your users would issue something like :
 
 lttv -m latencytracerDump -t /tmp/trace1

No offense, but this is still quite a bit harder than:

echo 1 /proc/something
... wait a bit ...
cat /proc/something

(substitute /sys or /debugfs where appropriate)

Having to compile something (besides the kernel) for the target
is sometimes a major hassle.  I avoided it completely with KFT.
 -- Tim

=
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
 
  In-other-words, latency_tracer is LTTng-lite ;-)
 

 If LTTng is already ported to your specific kernel, the learning-curve
 is not big at all. Here is what the latency_tracer over LTTng guide
 could look like :

 Well, once you have LTTng in your kernel and have compiled and installed
 the ltt-control and lttv packages (configure, make, make install), all
 that would be needed is :

 (there may be some bits in the QUICKSTART GUIDE on
 http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure
 the LTTng modules are loaded)

 #arm all the markers
 ltt-armall
 #start lttng tracing
 lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt

 - start latency tracer
 - stop latency tracer
 - trigger latency tracer dump

 While the tracing is active, trigger the condition...

 (rince, repeat, can handle multiple latency tracer dumps)

 #stop lttng tracing
 lttctl -n test -R
 #disarm all markers
 ltt-disarmall

 You can easily test the trace with :
 lttv -m textDump -t /tmp/trace1

 Your users would issue something like :

 lttv -m latencytracerDump -t /tmp/trace1

 that's it. LatencytracerDump would be a new specialized plugin, inspired
 from the generic textDump.c plugin and from the state.c module (for
 hooking on specific events rather that on _all_ events). It would
 generate a text output from the trace collected at each latency tracer
 dump.


Mathieu,

That's quite a bit. Currently we have tools that already hook into
latency_tracer. e.g. Thomas Gleixner's cyclictest. It turns on the latency
tracer runs a quick test, turns it off. If the latency is ok, the trace is
discarded and it runs the test again. When the latency is not acceptable,
it stops the test. Then all one needs to do is look at the given trace.

Hooking something simple as this to LTTng is not going to fly.

Don't get me wrong, I'm a huge supporter of LTTng, and even recommend it.
But there's things that using LTTng for is like using a sledgehammer for
a tack.  I like to use the tools that are best for the job (which also
mean easiest).  I don't buy the one tracer fits all mentality. And by
pushing that too much, I'm thinking we'll never get a tracer into the
kernel.

What my goal of these patches is, is to get the infrastructure for tracing
into the kernel.  The latency_tracer is like Rusty Russells lguest is for
pvops. LTTng is the Xen equivalent. Where latency_tracer is nothing more
than the quick and dirty tracer. For administrators that want to analyze
their systems, LTTng is much more appropriate. latency_tracer is more of
the first aid in finding latency troubles, and if that doesn't work, then
we can do the LTTng surgery.

Lets go back and focus on the infrastructure again. Namely, the mcount and
notrace parts of the kernel. As well as the tracer timer, which by the
way, we are looking more into what you have done ;-)

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz

On Jan 16, 2008 6:56 AM, Mathieu Desnoyers [EMAIL PROTECTED] wrote:
 If you really want an seqlock free algorithm (I _do_ want this for
 tracing!) :) maybe going in the RCU direction could help (I refer to my
 RCU-based 32-to-64 bits lockless timestamp counter extension, which
 could be turned into the clocksource updater).

Yea. After our earlier discussion and talking w/ Steven, I'm taking a
swing at this now.  The lock-free method still doesn't apply to the
update_wall_time function, but does work fine for the monotonic cycle
uses.  I'll send a patch for review as soon as I get things building.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


Thanks John for doing this!

(comments imbedded)

On Wed, 16 Jan 2008, john stultz wrote:


 On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:

 Completely un-tested, but it builds, so I figured I'd send it out for
 review.

heh, ok, I'll take it and run it.


 I'm not super sure the update or the read doesn't need something
 additional to force a memory access, but as I didn't see anything
 special in Mathieu's implementation, I'm going to guess this is ok.

 Mathieu, Let me know if this isn't what you're suggesting.

 Signed-off-by: John Stultz [EMAIL PROTECTED]

 Index: monotonic-cleanup/include/linux/clocksource.h
 ===
 --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
 12:22:04.0 -0800
 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
 14:41:31.0 -0800
 @@ -87,9 +87,17 @@
* more than one cache line.
*/
   struct {
 - cycle_t cycle_last, cycle_accumulated, cycle_raw;
 - } cacheline_aligned_in_smp;
 + cycle_t cycle_last, cycle_accumulated;

 + /* base structure provides lock-free read
 +  * access to a virtualized 64bit counter
 +  * Uses RCU-like update.
 +  */
 + struct {
 + cycle_t cycle_base_last, cycle_base;
 + } base[2];
 + int base_num;
 + } cacheline_aligned_in_smp;
   u64 xtime_nsec;
   s64 error;

 @@ -175,19 +183,21 @@
  }

  /**
 - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
 + * clocksource_get_basecycles: - get the clocksource's accumulated cycle 
 value
   * @cs:  pointer to clocksource being read
   * @now: current cycle value
   *
   * Uses the clocksource to return the current cycle_t value.
   * NOTE!!!: This is different from clocksource_read, because it
 - * returns the accumulated cycle value! Must hold xtime lock!
 + * returns a 64bit wide accumulated value.
   */
  static inline cycle_t
 -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
 +clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
  {
 - cycle_t offset = (now - cs-cycle_last)  cs-mask;
 - offset += cs-cycle_accumulated;
 + int num = cs-base_num;
 + cycle_t offset = (now - cs-base[num].cycle_base_last);
 + offset = cs-mask;
 + offset += cs-base[num].cycle_base;
   return offset;
  }

 @@ -197,14 +207,25 @@
   * @now: current cycle value
   *
   * Used to avoids clocksource hardware overflow by periodically
 - * accumulating the current cycle delta. Must hold xtime write lock!
 + * accumulating the current cycle delta. Uses RCU-like update, but
 + * ***still requires the xtime_lock is held for writing!***
   */
  static inline void clocksource_accumulate(struct clocksource *cs, cycle_t 
 now)
  {
 - cycle_t offset = (now - cs-cycle_last)  cs-mask;
 + /* First update the monotonic base portion.
 +  * The dual array update method allows for lock-free reading.
 +  */
 + int num = !cs-base_num;
 + cycle_t offset = (now - cs-base[!num].cycle_base_last);
 + offset = cs-mask;
 + cs-base[num].cycle_base = cs-base[!num].cycle_base + offset;
 + cs-base[num].cycle_base_last = now;

I would think that we would need some sort of barrier here. Otherwise,
base_num could be updated before all the cycle_base. I'd expect a smp_wmb
is needed.

 + cs-base_num = num;
 +
 + /* Now update the cycle_accumulated portion */
 + offset = (now - cs-cycle_last)  cs-mask;
   cs-cycle_last = now;
   cs-cycle_accumulated += offset;
 - cs-cycle_raw += offset;
  }

  /**
 Index: monotonic-cleanup/kernel/time/timekeeping.c
 ===
 --- monotonic-cleanup.orig/kernel/time/timekeeping.c  2008-01-16 
 12:21:46.0 -0800
 +++ monotonic-cleanup/kernel/time/timekeeping.c   2008-01-16 
 14:15:31.0 -0800
 @@ -71,10 +71,12 @@
   */
  static inline s64 __get_nsec_offset(void)
  {
 - cycle_t cycle_delta;
 + cycle_t now, cycle_delta;
   s64 ns_offset;

 - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
 + now = clocksource_read(clock);
 + cycle_delta = (now - clock-cycle_last)  clock-mask;
 + cycle_delta += clock-cycle_accumulated;

Is the above just to decouple the two methods?

   ns_offset = cyc2ns(clock, cycle_delta);

   return ns_offset;
 @@ -105,35 +107,7 @@

  cycle_t notrace get_monotonic_cycles(void)
  {
 - cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
 -
 - do {
 - /*
 -  * cycle_raw and cycle_last can change on
 -  * another CPU and we need the delta calculation
 -  * of cycle_now and cycle_last happen atomic, as well
 -  * as the adding to cycle_raw. We don't need to grab
 -  * any

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz


On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
 * john stultz ([EMAIL PROTECTED]) wrote:
  
  On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
   On Jan 16, 2008 6:56 AM, Mathieu Desnoyers [EMAIL PROTECTED] wrote:
If you really want an seqlock free algorithm (I _do_ want this for
tracing!) :) maybe going in the RCU direction could help (I refer to my
RCU-based 32-to-64 bits lockless timestamp counter extension, which
could be turned into the clocksource updater).
   
   Yea. After our earlier discussion and talking w/ Steven, I'm taking a
   swing at this now.  The lock-free method still doesn't apply to the
   update_wall_time function, but does work fine for the monotonic cycle
   uses.  I'll send a patch for review as soon as I get things building.
  
  So here's my first attempt at adding Mathieu's lock-free method to
  Steven's get_monotonic_cycles() interface. 
  
  Completely un-tested, but it builds, so I figured I'd send it out for
  review.
  
  I'm not super sure the update or the read doesn't need something
  additional to force a memory access, but as I didn't see anything 
  special in Mathieu's implementation, I'm going to guess this is ok.
  
  Mathieu, Let me know if this isn't what you're suggesting.
  
  Signed-off-by: John Stultz [EMAIL PROTECTED]
  
  Index: monotonic-cleanup/include/linux/clocksource.h
  ===
  --- monotonic-cleanup.orig/include/linux/clocksource.h  2008-01-16 
  12:22:04.0 -0800
  +++ monotonic-cleanup/include/linux/clocksource.h   2008-01-16 
  14:41:31.0 -0800
  @@ -87,9 +87,17 @@
   * more than one cache line.
   */
  struct {
  -   cycle_t cycle_last, cycle_accumulated, cycle_raw;
  -   } cacheline_aligned_in_smp;
 
 Shouldn't the cycle_last and cycle_accumulated by in the array too ?

No, we're leaving cycle_last and cycle_accumulated alone. They're
relating to the update_wall_time conversion of cycles to xtime. 

  +   cycle_t cycle_last, cycle_accumulated;
   
  +   /* base structure provides lock-free read
  +* access to a virtualized 64bit counter
  +* Uses RCU-like update.
  +*/
  +   struct {
 
 We had cycle_raw before, why do we need the following two ?

  +   cycle_t cycle_base_last, cycle_base;
 
 I'm not quite sure why you need both cycle_base_last and cycle_base...

So on my first shot at this, I tried to layer the concepts. Using the
lock-free method to create a abstracted 64bit counter, as provided by
get_monotonic_cycles(). Then I tried to use that abstraction directly in
the update_wall_time() code, reading the abstracted 64bit counter and
using it to update time.

However, then we start keeping cycle_last in 64bit cycles, rather then
an actual counter read. This then caused changes to be needed in the
arch vsyscall implementations, and that started to get ugly, as we had
to also re-implement the abstracted 64bit counter w/ the lock free
method as well. 

So I just backed off and tried to make it simple: We have two sets of
data that counts cycles from the clocksource. One for timekeeping and
one for get_monotoinc_cycles(). It is a little redundant, but I don't
think you can escape that (the layering method above also has
redundancy, but its just hidden until you implement the vsyscall gtod
methods).


 I think I'll need a bit of an explanation of what you are trying to
 achieve here to see what to expect from the clock source. Are you trying
 to deal with non-synchronized TSCs across CPUs in a way that will
 generate a monotonic (sometimes stalling) clock ?

No no no.. I'm not touching the non-synced TSC issue. I'm just trying to
take clocksource counters, which may be of different bit-widths (ACPI PM
is 24bits, for instance), and create lock-free method to translate that
into a virtual 64bit wide counter (using an accumulation bucket,
basically).

 What I am trying to say is : I know you are trying to make a virtual
 clock source where time cannot go backward, but what are your
 assumptions about the real clock source ?

The assumptions of the real clocksource is the same we keep in the
timekeeping core. It counts forward, at a constant rate and only wraps
after the mask value has been reached. 

 Is the intent to deal with an HPET suddenly reset to 0 or something
 like this ?

Well, dealing with clocksources wrapping short of 64bits.

 Basically, I wonder why you have to calculate the current cycle count
 from the previous update_wall_time event. Is is because you need to be
 consistent when a clocksource change occurs ?

Actually, we try to do it from the last clocksource_accumulate() call
(which is called from update_wall_time).


  +   } base[2];
  +   int base_num;
  +   } cacheline_aligned_in_smp;
  u64 xtime_nsec;
  s64 error;
   
  @@ -175,19 +183,21 @@
   }
   
   /**
  - *

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt



On Wed, 16 Jan 2008, Steven Rostedt wrote:
 On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
 
  !0 is not necessarily 1. This is why I use cpu_synth-index ? 0 : 1 in

 How about simply cpu_synth-index ^ 1? Seems the best choice if you ask
 me, if all you are doing is changing it from 1 to zero and back to 1.


FYI:

[EMAIL PROTECTED]:~/c$ cat flipme.c
int flip1 (int x)
{
return !x;
}

int flip2 (int x)
{
return x ? 0 : 1;
}

int flip3(int x)
{
return x ^ 1;
}
[EMAIL PROTECTED]:~/c$ gcc -O2 -c flipme.c
[EMAIL PROTECTED]:~/c$ objdump -d flipme.o

flipme.o: file format elf32-i386

Disassembly of section .text:

 flip1:
   0:   55  push   %ebp
   1:   31 c0   xor%eax,%eax
   3:   89 e5   mov%esp,%ebp
   5:   83 7d 08 00 cmpl   $0x0,0x8(%ebp)
   9:   5d  pop%ebp
   a:   0f 94 c0sete   %al
   d:   c3  ret
   e:   66 90   xchg   %ax,%ax

0010 flip2:
  10:   55  push   %ebp
  11:   31 c0   xor%eax,%eax
  13:   89 e5   mov%esp,%ebp
  15:   83 7d 08 00 cmpl   $0x0,0x8(%ebp)
  19:   5d  pop%ebp
  1a:   0f 94 c0sete   %al
  1d:   c3  ret
  1e:   66 90   xchg   %ax,%ax

0020 flip3:
  20:   55  push   %ebp
  21:   89 e5   mov%esp,%ebp
  23:   8b 45 08mov0x8(%ebp),%eax
  26:   5d  pop%ebp
  27:   83 f0 01xor$0x1,%eax
  2a:   c3  ret


So, if you know for sure that x is only 1 or 0, then using x ^ 1 to invert
it, seems the most efficient.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Linus Torvalds



On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:

  +   int num = !cs-base_num;
  +   cycle_t offset = (now - cs-base[!num].cycle_base_last);
 
 !0 is not necessarily 1.

Incorrect.

!0 _is_ necessarily 1. It's how all C logical operators work. If you find 
a compiler that turns !x into anything but 0/1, you found a compiler for 
another language than C.

It's true that any non-zero value counts as true, but the that does not 
mean that a logical operator can return any non-zero value for true. As a 
return value of the logical operations in C, true is *always* 1.

So !, ||, , when used as values, will *always* return either 0 or 1 (but 
when used as part of a conditional, the compiler will often optimize out 
unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if 
the value itself was never used, only branched upon).

So doing !cs-base_num to turn 0-1 and 1-0 is perfectly fine.

That's not to say it's necessarily the *best* way.

If you *know* that you started with 0/1 in the first place, the best way 
to flip it tends to be to do (1-x) (or possibly (x^1)).

And if you can't guarantee that, !x is probably better than x ? 0 : 1, 
but you might also decide to use ((x+1)1) for example.

And obviously, the compiler may sometimes surprise you, and if *it* also 
knows it's always 0/1 (for something like the source being a single-bit 
bitfield for example), it may end up doing something else than you coded 
that is equivalent. And the particular choice of operation the compiler 
chooses may well depend on the code _around_ that sequence.

(One reason to potentially prefer (1-x) over (x^1) is that it's often 
easier to combine a subtraction with other operations, while an xor seldom 
combines with anything around it)

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Linus Torvalds ([EMAIL PROTECTED]) wrote:
 
 
 On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
 
   + int num = !cs-base_num;
   + cycle_t offset = (now - cs-base[!num].cycle_base_last);
  
  !0 is not necessarily 1.
 
 Incorrect.
 

Hrm, *digging in my mailbox*, ah, here it is :

http://listserv.shafik.org/pipermail/ltt-dev/2006-June/001548.html

Richard Purdie reviewed my code back in 2006 and made this modification.
Maybe will he have something to add.


 !0 _is_ necessarily 1. It's how all C logical operators work. If you find 
 a compiler that turns !x into anything but 0/1, you found a compiler for 
 another language than C.
 
 It's true that any non-zero value counts as true, but the that does not 
 mean that a logical operator can return any non-zero value for true. As a 
 return value of the logical operations in C, true is *always* 1.
 
 So !, ||, , when used as values, will *always* return either 0 or 1 (but 
 when used as part of a conditional, the compiler will often optimize out 
 unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if 
 the value itself was never used, only branched upon).
 
 So doing !cs-base_num to turn 0-1 and 1-0 is perfectly fine.
 
 That's not to say it's necessarily the *best* way.
 
 If you *know* that you started with 0/1 in the first place, the best way 
 to flip it tends to be to do (1-x) (or possibly (x^1)).
 
 And if you can't guarantee that, !x is probably better than x ? 0 : 1, 
 but you might also decide to use ((x+1)1) for example.
 
 And obviously, the compiler may sometimes surprise you, and if *it* also 
 knows it's always 0/1 (for something like the source being a single-bit 
 bitfield for example), it may end up doing something else than you coded 
 that is equivalent. And the particular choice of operation the compiler 
 chooses may well depend on the code _around_ that sequence.
 
 (One reason to potentially prefer (1-x) over (x^1) is that it's often 
 easier to combine a subtraction with other operations, while an xor seldom 
 combines with anything around it)
 

Ok, I'll adopt (1-x) then. Thanks!

Mathieu

   Linus

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz

On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
 I would disable preemption in clocksource_get_basecycles. We would not
 want to be scheduled out while we hold a pointer to the old array
 element.
 
  +   int num = cs-base_num;
 
 Since you deal with base_num in a shared manner (not per cpu), you will
 need a smp_read_barrier_depend() here after the cs-base_num read.
 
 You should think about reading the cs-base_num first, and _after_ that
 read the real clocksource. Here, the clocksource value is passed as
 parameter. It means that the read clocksource may have been read in the
 previous RCU window.

Here's an updated version of the patch w/ the suggested memory barrier
changes and favored (1-x) inversion change. ;)  Let me know if you see
any other holes, or have any other suggestions or ideas.

Still un-tested (my test box will free up soon, I promise!), but builds.

Signed-off-by: John Stultz [EMAIL PROTECTED]

Index: monotonic-cleanup/include/linux/clocksource.h
===
--- monotonic-cleanup.orig/include/linux/clocksource.h  2008-01-16 
12:22:04.0 -0800
+++ monotonic-cleanup/include/linux/clocksource.h   2008-01-16 
18:12:53.0 -0800
@@ -87,9 +87,17 @@
 * more than one cache line.
 */
struct {
-   cycle_t cycle_last, cycle_accumulated, cycle_raw;
-   } cacheline_aligned_in_smp;
+   cycle_t cycle_last, cycle_accumulated;
 
+   /* base structure provides lock-free read
+* access to a virtualized 64bit counter
+* Uses RCU-like update.
+*/
+   struct {
+   cycle_t cycle_base_last, cycle_base;
+   } base[2];
+   int base_num;
+   } cacheline_aligned_in_smp;
u64 xtime_nsec;
s64 error;
 
@@ -175,19 +183,29 @@
 }
 
 /**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
  * @cs:pointer to clocksource being read
  * @now:   current cycle value
  *
  * Uses the clocksource to return the current cycle_t value.
  * NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
  */
 static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs)
 {
-   cycle_t offset = (now - cs-cycle_last)  cs-mask;
-   offset += cs-cycle_accumulated;
+   int num;
+   cycle_t now, offset;
+
+   preempt_disable();
+   num = cs-base_num;
+   smp_read_barrier_depends();
+   now = clocksource_read(cs);
+   offset = (now - cs-base[num].cycle_base_last);
+   offset = cs-mask;
+   offset += cs-base[num].cycle_base;
+   preempt_enable();
+
return offset;
 }
 
@@ -197,14 +215,26 @@
  * @now:   current cycle value
  *
  * Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
  */
 static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
 {
-   cycle_t offset = (now - cs-cycle_last)  cs-mask;
+   /* First update the monotonic base portion.
+* The dual array update method allows for lock-free reading.
+*/
+   int num = 1 - cs-base_num;
+   cycle_t offset = (now - cs-base[1-num].cycle_base_last);
+   offset = cs-mask;
+   cs-base[num].cycle_base = cs-base[1-num].cycle_base + offset;
+   cs-base[num].cycle_base_last = now;
+   wmb();
+   cs-base_num = num;
+
+   /* Now update the cycle_accumulated portion */
+   offset = (now - cs-cycle_last)  cs-mask;
cs-cycle_last = now;
cs-cycle_accumulated += offset;
-   cs-cycle_raw += offset;
 }
 
 /**
Index: monotonic-cleanup/kernel/time/timekeeping.c
===
--- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
12:21:46.0 -0800
+++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 17:51:50.0 
-0800
@@ -71,10 +71,12 @@
  */
 static inline s64 __get_nsec_offset(void)
 {
-   cycle_t cycle_delta;
+   cycle_t now, cycle_delta;
s64 ns_offset;
 
-   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+   now = clocksource_read(clock);
+   cycle_delta = (now - clock-cycle_last)  clock-mask;
+   cycle_delta += clock-cycle_accumulated;
ns_offset = cyc2ns(clock, cycle_delta);
 
return ns_offset;
@@ -105,35 +107,7 @@
 
 cycle_t notrace get_monotonic_cycles(void)
 {
-   cycle_t cycle_now, cycle_delta,

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
 
 On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
  * john stultz ([EMAIL PROTECTED]) wrote:
   
   On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
On Jan 16, 2008 6:56 AM, Mathieu Desnoyers [EMAIL PROTECTED] wrote:
 If you really want an seqlock free algorithm (I _do_ want this for
 tracing!) :) maybe going in the RCU direction could help (I refer to 
 my
 RCU-based 32-to-64 bits lockless timestamp counter extension, which
 could be turned into the clocksource updater).

Yea. After our earlier discussion and talking w/ Steven, I'm taking a
swing at this now.  The lock-free method still doesn't apply to the
update_wall_time function, but does work fine for the monotonic cycle
uses.  I'll send a patch for review as soon as I get things building.
   
   So here's my first attempt at adding Mathieu's lock-free method to
   Steven's get_monotonic_cycles() interface. 
   
   Completely un-tested, but it builds, so I figured I'd send it out for
   review.
   
   I'm not super sure the update or the read doesn't need something
   additional to force a memory access, but as I didn't see anything 
   special in Mathieu's implementation, I'm going to guess this is ok.
   
   Mathieu, Let me know if this isn't what you're suggesting.
   
   Signed-off-by: John Stultz [EMAIL PROTECTED]
   
   Index: monotonic-cleanup/include/linux/clocksource.h
   ===
   --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
   12:22:04.0 -0800
   +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
   14:41:31.0 -0800
   @@ -87,9 +87,17 @@
  * more than one cache line.
  */
 struct {
   - cycle_t cycle_last, cycle_accumulated, cycle_raw;
   - } cacheline_aligned_in_smp;
  
  Shouldn't the cycle_last and cycle_accumulated by in the array too ?
 
 No, we're leaving cycle_last and cycle_accumulated alone. They're
 relating to the update_wall_time conversion of cycles to xtime. 
 
   + cycle_t cycle_last, cycle_accumulated;

   + /* base structure provides lock-free read
   +  * access to a virtualized 64bit counter
   +  * Uses RCU-like update.
   +  */
   + struct {
  
  We had cycle_raw before, why do we need the following two ?
 
   + cycle_t cycle_base_last, cycle_base;
  
  I'm not quite sure why you need both cycle_base_last and cycle_base...
 
 So on my first shot at this, I tried to layer the concepts. Using the
 lock-free method to create a abstracted 64bit counter, as provided by
 get_monotonic_cycles(). Then I tried to use that abstraction directly in
 the update_wall_time() code, reading the abstracted 64bit counter and
 using it to update time.
 
 However, then we start keeping cycle_last in 64bit cycles, rather then
 an actual counter read. This then caused changes to be needed in the
 arch vsyscall implementations, and that started to get ugly, as we had
 to also re-implement the abstracted 64bit counter w/ the lock free
 method as well. 
 
 So I just backed off and tried to make it simple: We have two sets of
 data that counts cycles from the clocksource. One for timekeeping and
 one for get_monotoinc_cycles(). It is a little redundant, but I don't
 think you can escape that (the layering method above also has
 redundancy, but its just hidden until you implement the vsyscall gtod
 methods).
 
 
  I think I'll need a bit of an explanation of what you are trying to
  achieve here to see what to expect from the clock source. Are you trying
  to deal with non-synchronized TSCs across CPUs in a way that will
  generate a monotonic (sometimes stalling) clock ?
 
 No no no.. I'm not touching the non-synced TSC issue. I'm just trying to
 take clocksource counters, which may be of different bit-widths (ACPI PM
 is 24bits, for instance), and create lock-free method to translate that
 into a virtual 64bit wide counter (using an accumulation bucket,
 basically).
 
  What I am trying to say is : I know you are trying to make a virtual
  clock source where time cannot go backward, but what are your
  assumptions about the real clock source ?
 
 The assumptions of the real clocksource is the same we keep in the
 timekeeping core. It counts forward, at a constant rate and only wraps
 after the mask value has been reached. 
 
  Is the intent to deal with an HPET suddenly reset to 0 or something
  like this ?
 
 Well, dealing with clocksources wrapping short of 64bits.
 

Ah ok, then the problem is clearer :) The main difference between the
approach I use and yours is that, let's say your clocksource starts a
while after the system started running, then it will start the
accumulator at 0. You therefore have to keep track of the total time
accumulated and the current lower order bits of the last accumulation
upon clocksource_accumulate().

I

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread john stultz


On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
 Thanks John for doing this!
 
 (comments imbedded)
 
 On Wed, 16 Jan 2008, john stultz wrote:
  +   int num = !cs-base_num;
  +   cycle_t offset = (now - cs-base[!num].cycle_base_last);
  +   offset = cs-mask;
  +   cs-base[num].cycle_base = cs-base[!num].cycle_base + offset;
  +   cs-base[num].cycle_base_last = now;
 
 I would think that we would need some sort of barrier here. Otherwise,
 base_num could be updated before all the cycle_base. I'd expect a smp_wmb
 is needed.

Hopefully addressed in the current version.


  Index: monotonic-cleanup/kernel/time/timekeeping.c
  ===
  --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
  12:21:46.0 -0800
  +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 
  14:15:31.0 -0800
  @@ -71,10 +71,12 @@
*/
   static inline s64 __get_nsec_offset(void)
   {
  -   cycle_t cycle_delta;
  +   cycle_t now, cycle_delta;
  s64 ns_offset;
 
  -   cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
  +   now = clocksource_read(clock);
  +   cycle_delta = (now - clock-cycle_last)  clock-mask;
  +   cycle_delta += clock-cycle_accumulated;
 
 Is the above just to decouple the two methods?

Yep. clocksource_get_cycles() ended up not being as useful as an helper
function (I was hoping the arch vsyscall implementations could use it,
but they've done too much optimization - although that may reflect a
need up the chain to the clocksource structure).

thanks
-john


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
 On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
  I would disable preemption in clocksource_get_basecycles. We would not
  want to be scheduled out while we hold a pointer to the old array
  element.
  
   + int num = cs-base_num;
  
  Since you deal with base_num in a shared manner (not per cpu), you will
  need a smp_read_barrier_depend() here after the cs-base_num read.
  
  You should think about reading the cs-base_num first, and _after_ that
  read the real clocksource. Here, the clocksource value is passed as
  parameter. It means that the read clocksource may have been read in the
  previous RCU window.
 
 Here's an updated version of the patch w/ the suggested memory barrier
 changes and favored (1-x) inversion change. ;)  Let me know if you see
 any other holes, or have any other suggestions or ideas.
 
 Still un-tested (my test box will free up soon, I promise!), but builds.
 
 Signed-off-by: John Stultz [EMAIL PROTECTED]
 
 Index: monotonic-cleanup/include/linux/clocksource.h
 ===
 --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 
 12:22:04.0 -0800
 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 
 18:12:53.0 -0800
 @@ -87,9 +87,17 @@
* more than one cache line.
*/
   struct {
 - cycle_t cycle_last, cycle_accumulated, cycle_raw;
 - } cacheline_aligned_in_smp;
 + cycle_t cycle_last, cycle_accumulated;
  
 + /* base structure provides lock-free read
 +  * access to a virtualized 64bit counter
 +  * Uses RCU-like update.
 +  */
 + struct {
 + cycle_t cycle_base_last, cycle_base;
 + } base[2];
 + int base_num;
 + } cacheline_aligned_in_smp;
   u64 xtime_nsec;
   s64 error;
  
 @@ -175,19 +183,29 @@
  }
  
  /**
 - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
 + * clocksource_get_basecycles: - get the clocksource's accumulated cycle 
 value
   * @cs:  pointer to clocksource being read
   * @now: current cycle value
   *
   * Uses the clocksource to return the current cycle_t value.
   * NOTE!!!: This is different from clocksource_read, because it
 - * returns the accumulated cycle value! Must hold xtime lock!
 + * returns a 64bit wide accumulated value.
   */
  static inline cycle_t
 -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
 +clocksource_get_basecycles(struct clocksource *cs)
  {
 - cycle_t offset = (now - cs-cycle_last)  cs-mask;
 - offset += cs-cycle_accumulated;
 + int num;
 + cycle_t now, offset;
 +
 + preempt_disable();
 + num = cs-base_num;
 + smp_read_barrier_depends();
 + now = clocksource_read(cs);
 + offset = (now - cs-base[num].cycle_base_last);
 + offset = cs-mask;
 + offset += cs-base[num].cycle_base;
 + preempt_enable();
 +
   return offset;
  }
  
 @@ -197,14 +215,26 @@
   * @now: current cycle value
   *
   * Used to avoids clocksource hardware overflow by periodically
 - * accumulating the current cycle delta. Must hold xtime write lock!
 + * accumulating the current cycle delta. Uses RCU-like update, but
 + * ***still requires the xtime_lock is held for writing!***
   */
  static inline void clocksource_accumulate(struct clocksource *cs, cycle_t 
 now)
  {
 - cycle_t offset = (now - cs-cycle_last)  cs-mask;
 + /* First update the monotonic base portion.
 +  * The dual array update method allows for lock-free reading.
 +  */
 + int num = 1 - cs-base_num;

(nitpick)
right here, you could probably express 1-num with cs-base_num, since we
are the only ones supposed to touch it.

 + cycle_t offset = (now - cs-base[1-num].cycle_base_last);
 + offset = cs-mask;

here too.

 + cs-base[num].cycle_base = cs-base[1-num].cycle_base + offset;
 + cs-base[num].cycle_base_last = now;
 + wmb();

As I just emailed : smp_smb() *should* be enough. I don't see which
architecture could reorder writes wrt local interrupts ? (please tell me
if I am grossly mistaken)

Mathieu

 + cs-base_num = num;
 +
 + /* Now update the cycle_accumulated portion */
 + offset = (now - cs-cycle_last)  cs-mask;
   cs-cycle_last = now;
   cs-cycle_accumulated += offset;
 - cs-cycle_raw += offset;
  }
  
  /**
 Index: monotonic-cleanup/kernel/time/timekeeping.c
 ===
 --- monotonic-cleanup.orig/kernel/time/timekeeping.c  2008-01-16 
 12:21:46.0 -0800
 +++ monotonic-cleanup/kernel/time/timekeeping.c   2008-01-16 
 17:51:50.0 -0800
 @@ -71,10 +71,12 @@
   */
  static inline s64 __get_nsec_offset(void)
  {
 - cycle_t cycle_delta;
 + cycle_t now, cycle_delta;
   s64 ns_offset;
  
 - cycle_delta =

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* john stultz ([EMAIL PROTECTED]) wrote:
 
 On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
  Thanks John for doing this!
  
  (comments imbedded)
  
  On Wed, 16 Jan 2008, john stultz wrote:
   + int num = !cs-base_num;
   + cycle_t offset = (now - cs-base[!num].cycle_base_last);
   + offset = cs-mask;
   + cs-base[num].cycle_base = cs-base[!num].cycle_base + offset;
   + cs-base[num].cycle_base_last = now;
  
  I would think that we would need some sort of barrier here. Otherwise,
  base_num could be updated before all the cycle_base. I'd expect a smp_wmb
  is needed.
 
 Hopefully addressed in the current version.
 
 
   Index: monotonic-cleanup/kernel/time/timekeeping.c
   ===
   --- monotonic-cleanup.orig/kernel/time/timekeeping.c  2008-01-16 
   12:21:46.0 -0800
   +++ monotonic-cleanup/kernel/time/timekeeping.c   2008-01-16 
   14:15:31.0 -0800
   @@ -71,10 +71,12 @@
 */
static inline s64 __get_nsec_offset(void)
{
   - cycle_t cycle_delta;
   + cycle_t now, cycle_delta;
 s64 ns_offset;
  
   - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
   + now = clocksource_read(clock);
   + cycle_delta = (now - clock-cycle_last)  clock-mask;
   + cycle_delta += clock-cycle_accumulated;
  
  Is the above just to decouple the two methods?
 
 Yep. clocksource_get_cycles() ended up not being as useful as an helper
 function (I was hoping the arch vsyscall implementations could use it,
 but they've done too much optimization - although that may reflect a
 need up the chain to the clocksource structure).
 

The problem with vsyscall is that we will have a hard time disabling
preemption :( Therefore, insuring that the read of the data is done in a
timely manner is hard to do.

Mathieu

 thanks
 -john
 
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
 
  Yep. clocksource_get_cycles() ended up not being as useful as an helper
  function (I was hoping the arch vsyscall implementations could use it,
  but they've done too much optimization - although that may reflect a
  need up the chain to the clocksource structure).
 

 The problem with vsyscall is that we will have a hard time disabling
 preemption :( Therefore, insuring that the read of the data is done in a
 timely manner is hard to do.

You'll have more than a hard time disabling preemption for vsyscall. We'll
need to come up with a better solution then. vsyscall can not modify any
kernel memory, nor can it disable preemption.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
 * john stultz ([EMAIL PROTECTED]) wrote:
  
  On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
   Thanks John for doing this!
   
   (comments imbedded)
   
   On Wed, 16 Jan 2008, john stultz wrote:
+   int num = !cs-base_num;
+   cycle_t offset = (now - cs-base[!num].cycle_base_last);
+   offset = cs-mask;
+   cs-base[num].cycle_base = cs-base[!num].cycle_base + offset;
+   cs-base[num].cycle_base_last = now;
   
   I would think that we would need some sort of barrier here. Otherwise,
   base_num could be updated before all the cycle_base. I'd expect a smp_wmb
   is needed.
  
  Hopefully addressed in the current version.
  
  
Index: monotonic-cleanup/kernel/time/timekeeping.c
===
--- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 
12:21:46.0 -0800
+++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 
14:15:31.0 -0800
@@ -71,10 +71,12 @@
  */
 static inline s64 __get_nsec_offset(void)
 {
-   cycle_t cycle_delta;
+   cycle_t now, cycle_delta;
s64 ns_offset;
   
-   cycle_delta = clocksource_get_cycles(clock, 
clocksource_read(clock));
+   now = clocksource_read(clock);
+   cycle_delta = (now - clock-cycle_last)  clock-mask;
+   cycle_delta += clock-cycle_accumulated;
   
   Is the above just to decouple the two methods?
  
  Yep. clocksource_get_cycles() ended up not being as useful as an helper
  function (I was hoping the arch vsyscall implementations could use it,
  but they've done too much optimization - although that may reflect a
  need up the chain to the clocksource structure).
  
 
 The problem with vsyscall is that we will have a hard time disabling
 preemption :( Therefore, insuring that the read of the data is done in a
 timely manner is hard to do.
 

Sorry for self-reply, but I thought, in the past, of a way to make this
possible.

It would imply the creation of a new vsyscall : vgetschedperiod

It would read a counter that would increment each time the thread is
scheduled out (or in). It would be a per thread counter (not a per cpu
counter) so we can deal appropriately with a stopped thread that would
happen to come back running a lng time afterward (if we do per-cpu
counters, we could get the same 32 bits counter value falsely if it is
shared with other thread activity).

Then, the clocksource read code would look like :

int period;

do {
  period = vgetschedperiod();

  perform the clocksource read..

} while (period != vgetschedperiod());

Therefore, we would be sure that we have not been scheduled out while
reading the value. I think this new vsyscall could be useful for others.
Actually, it would make implementation of RCU in user-space possible (as
long as the read-side can retry the read operation).

Mathieu

 
  thanks
  -john
  
  
 
 -- 
 Mathieu Desnoyers
 Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
 OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:

 It would imply the creation of a new vsyscall : vgetschedperiod

 It would read a counter that would increment each time the thread is
 scheduled out (or in). It would be a per thread counter (not a per cpu
 counter) so we can deal appropriately with a stopped thread that would
 happen to come back running a lng time afterward (if we do per-cpu
 counters, we could get the same 32 bits counter value falsely if it is
 shared with other thread activity).

 Then, the clocksource read code would look like :

 int period;

 do {
   period = vgetschedperiod();

   perform the clocksource read..

 } while (period != vgetschedperiod());

 Therefore, we would be sure that we have not been scheduled out while
 reading the value. I think this new vsyscall could be useful for others.
 Actually, it would make implementation of RCU in user-space possible (as
 long as the read-side can retry the read operation).


This is something that I would agree is useful.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Paul Mackerras

Mathieu Desnoyers writes:

 Sorry for self-reply, but I thought, in the past, of a way to make this
 possible.
 
 It would imply the creation of a new vsyscall : vgetschedperiod
 
 It would read a counter that would increment each time the thread is
 scheduled out (or in). It would be a per thread counter

It's very hard to do a per-thread counter in the VDSO, since threads
in the same process see the same memory, by definition.  You'd have to
have an array of counters and have some way for each thread to know
which entry to read.  Also you'd have to find space for tens or
hundreds of thousands of counters, since there can be that many
threads in a process sometimes.

Paul.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Steven Rostedt


On Thu, 17 Jan 2008, Paul Mackerras wrote:

 It's very hard to do a per-thread counter in the VDSO, since threads
 in the same process see the same memory, by definition.  You'd have to
 have an array of counters and have some way for each thread to know
 which entry to read.  Also you'd have to find space for tens or
 hundreds of thousands of counters, since there can be that many
 threads in a process sometimes.

I was thinking about this. What would also work is just the ability to
read the schedule counter for the current cpu. Now this would require that
the task had a way to know which CPU it was currently on.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-16 Thread Mathieu Desnoyers

* Paul Mackerras ([EMAIL PROTECTED]) wrote:
 Mathieu Desnoyers writes:
 
  Sorry for self-reply, but I thought, in the past, of a way to make this
  possible.
  
  It would imply the creation of a new vsyscall : vgetschedperiod
  
  It would read a counter that would increment each time the thread is
  scheduled out (or in). It would be a per thread counter
 
 It's very hard to do a per-thread counter in the VDSO, since threads
 in the same process see the same memory, by definition.  You'd have to
 have an array of counters and have some way for each thread to know
 which entry to read.  Also you'd have to find space for tens or
 hundreds of thousands of counters, since there can be that many
 threads in a process sometimes.
 
 Paul.
 

Crazy ideas :

Could we do something along the lines of the thread local storage ?

Or could we map a per-thread page that would contradict this
definition ?

Or can we move down the beginning of the user-space thread stack of 4
bytes (it's already put at a random address anyway) and use these 32
bits to put our variable ? We don't care if userspace also modifies it;
the kernel would blindly increment it, so there would be no security
concerns involved.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
> >
> > Ok, but what actually insures that the clock->cycle_* reads won't be
> > reordered across the clocksource_read() ?
> 
> 
> 
> Hmm, interesting.I didn't notice that clocksource_read() is a static
> inline.  I was thinking that since it was passing a pointer to a function,
> gcc could not assume that it could move that code across it. But now
> looking to see that clocksource_read is simply a static inline that does:
> 
>   cs->read();
> 
> But still, can gcc assume that it can push loads of unknown origin
> variables across function calls? So something like:
> 
> static int *glob;
> 
> void foo(void) {
>   int x;
> 
>   x = *glob;
> 
>   bar();
> 
>   if (x != *glob)
>   /* ... */
> }
> 
> I can't see how any compiler could honestly move the loading of the first
> x after the calling of bar(). With glob pointing to some unknown
> variable, that may be perfectly fine for bar to modify.
> 
> 
> > > >
> > > > > + cycle_raw = clock->cycle_raw;
> > > > > + cycle_last = clock->cycle_last;
> > > > > +
> > > > > + /* read clocksource: */
> > > > > + cycle_now = clocksource_read(clock);
> 
> So the question here is,can cycle_raw and cycle_last be loaded from
> the unknown source that clock points to after the call to
> clocksource_read()?
> 
>  I'm thinking not.
> 

I agree with you that I don't see how the compiler could reorder this.
So we forget about compiler barriers. Also, the clock source used is a
synchronized clock source (get_cycles_sync on x86_64), so it should make
sure the TSC is read at the right moment.

However, what happens if the clock source is, say, the jiffies ?

Is this case, we have :

static cycle_t jiffies_read(void)
{
return (cycle_t) jiffies;
}

Which is nothing more than a memory read of 

extern unsigned long volatile __jiffy_data jiffies;

I think it is wrong to assume that reads from clock->cycle_raw and from
jiffies will be ordered correctly in SMP. I am tempted to think that
ordering memory writes to clock->cycle_raw vs jiffies is also needed in this
case (where clock->cycle_raw is updated, or where jiffies is updated).

We can fall in the same kind of issue if we read the HPET, which is
memory I/O based. It does not seems correct to assume that MMIO vs
normal memory reads are ordered. (pointing back to this article :
http://lwn.net/Articles/198988/)

Mathieu


> > > > > +
> > > > > + /* calculate the delta since the last update_wall_time: 
> > > > > */
> > > > > + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> > > > > +
> > > > > + } while (cycle_raw != clock->cycle_raw ||
> > > > > +  cycle_last != clock->cycle_last);
> > > > > +
> > > > > + return cycle_raw + cycle_delta;
> > > > > +}
> 
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Steven Rostedt

On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
>
> Ok, but what actually insures that the clock->cycle_* reads won't be
> reordered across the clocksource_read() ?

Hmm, interesting.I didn't notice that clocksource_read() is a static
inline.  I was thinking that since it was passing a pointer to a function,
gcc could not assume that it could move that code across it. But now
looking to see that clocksource_read is simply a static inline that does:

  cs->read();

But still, can gcc assume that it can push loads of unknown origin
variables across function calls? So something like:

static int *glob;

void foo(void) {
int x;

x = *glob;

bar();

if (x != *glob)
/* ... */
}

I can't see how any compiler could honestly move the loading of the first
x after the calling of bar(). With glob pointing to some unknown
variable, that may be perfectly fine for bar to modify.

> > >
> > > > +   cycle_raw = clock->cycle_raw;
> > > > +   cycle_last = clock->cycle_last;
> > > > +
> > > > +   /* read clocksource: */
> > > > +   cycle_now = clocksource_read(clock);

So the question here is,can cycle_raw and cycle_last be loaded from
the unknown source that clock points to after the call to
clocksource_read()?

 I'm thinking not.

> > > > +
> > > > +   /* calculate the delta since the last update_wall_time: 
> > > > */
> > > > +   cycle_delta = (cycle_now - cycle_last) & clock->mask;
> > > > +
> > > > +   } while (cycle_raw != clock->cycle_raw ||
> > > > +cycle_last != clock->cycle_last);
> > > > +
> > > > +   return cycle_raw + cycle_delta;
> > > > +}

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> 
> On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
> > >
> > > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
> > > ---
> > >  include/linux/clocksource.h |3 ++
> > >  kernel/time/timekeeping.c   |   48 
> > > 
> > >  2 files changed, 51 insertions(+)
> > >
> > > Index: linux-compile-i386.git/kernel/time/timekeeping.c
> > > ===
> > > --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 
> > > 14:27:26.0 -0500
> > > +++ linux-compile-i386.git/kernel/time/timekeeping.c  2008-01-09 
> > > 14:34:40.0 -0500
> > > @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
> > >   timespec_add_ns(ts, nsecs);
> > >  }
> > >
> > > +cycle_t notrace get_monotonic_cycles(void)
> > > +{
> > > + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> > > +
> > > + do {
> > > + /*
> > > +  * cycle_raw and cycle_last can change on
> > > +  * another CPU and we need the delta calculation
> > > +  * of cycle_now and cycle_last happen atomic, as well
> > > +  * as the adding to cycle_raw. We don't need to grab
> > > +  * any locks, we just keep trying until get all the
> > > +  * calculations together in one state.
> > > +  *
> > > +  * In fact, we __cant__ grab any locks. This
> > > +  * function is called from the latency_tracer which can
> > > +  * be called anywhere. To grab any locks (including
> > > +  * seq_locks) we risk putting ourselves into a deadlock.
> > > +  */
> >
> > I wonder what makes the compiler read the clock->cycle_raw and
> > clock->cycle_last variables twice ? I guess some memory barriers could
> > be welcome here ?
> 
> We need both cycle_raw and cycle_last to be the same from the time we read
> the clock source to the time we calculate cycle_delta. If either one
> changes then delta is bogus.
> 

Ok, but what actually insures that the clock->cycle_* reads won't be
reordered across the clocksource_read() ?


> Also, it just occurred to me that this is an old patch. I thought I
> renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/
> 
> >
> > > + cycle_raw = clock->cycle_raw;
> > > + cycle_last = clock->cycle_last;
> > > +
> > > + /* read clocksource: */
> > > + cycle_now = clocksource_read(clock);
> > > +
> > > + /* calculate the delta since the last update_wall_time: */
> > > + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> > > +
> > > + } while (cycle_raw != clock->cycle_raw ||
> > > +  cycle_last != clock->cycle_last);
> > > +
> > > + return cycle_raw + cycle_delta;
> > > +}
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Steven Rostedt


On Tue, 15 Jan 2008, Steven Rostedt wrote:
>
> Also, it just occurred to me that this is an old patch. I thought I
> renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/

Ah, I changed this in the -rt patch queue, and never moved the patch back
here.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Steven Rostedt



On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
> >
> > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
> > ---
> >  include/linux/clocksource.h |3 ++
> >  kernel/time/timekeeping.c   |   48 
> > 
> >  2 files changed, 51 insertions(+)
> >
> > Index: linux-compile-i386.git/kernel/time/timekeeping.c
> > ===
> > --- linux-compile-i386.git.orig/kernel/time/timekeeping.c   2008-01-09 
> > 14:27:26.0 -0500
> > +++ linux-compile-i386.git/kernel/time/timekeeping.c2008-01-09 
> > 14:34:40.0 -0500
> > @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
> > timespec_add_ns(ts, nsecs);
> >  }
> >
> > +cycle_t notrace get_monotonic_cycles(void)
> > +{
> > +   cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> > +
> > +   do {
> > +   /*
> > +* cycle_raw and cycle_last can change on
> > +* another CPU and we need the delta calculation
> > +* of cycle_now and cycle_last happen atomic, as well
> > +* as the adding to cycle_raw. We don't need to grab
> > +* any locks, we just keep trying until get all the
> > +* calculations together in one state.
> > +*
> > +* In fact, we __cant__ grab any locks. This
> > +* function is called from the latency_tracer which can
> > +* be called anywhere. To grab any locks (including
> > +* seq_locks) we risk putting ourselves into a deadlock.
> > +*/
>
> I wonder what makes the compiler read the clock->cycle_raw and
> clock->cycle_last variables twice ? I guess some memory barriers could
> be welcome here ?

We need both cycle_raw and cycle_last to be the same from the time we read
the clock source to the time we calculate cycle_delta. If either one
changes then delta is bogus.

Also, it just occurred to me that this is an old patch. I thought I
renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/

>
> > +   cycle_raw = clock->cycle_raw;
> > +   cycle_last = clock->cycle_last;
> > +
> > +   /* read clocksource: */
> > +   cycle_now = clocksource_read(clock);
> > +
> > +   /* calculate the delta since the last update_wall_time: */
> > +   cycle_delta = (cycle_now - cycle_last) & clock->mask;
> > +
> > +   } while (cycle_raw != clock->cycle_raw ||
> > +cycle_last != clock->cycle_last);
> > +
> > +   return cycle_raw + cycle_delta;
> > +}

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> The latency tracer needs a way to get an accurate time
> without grabbing any locks. Locks themselves might call
> the latency tracer and cause at best a slow down.
> 
> This patch adds get_monotonic_cycles that returns cycles
> from a reliable clock source in a monotonic fashion.
> 
> Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
> ---
>  include/linux/clocksource.h |3 ++
>  kernel/time/timekeeping.c   |   48 
> 
>  2 files changed, 51 insertions(+)
> 
> Index: linux-compile-i386.git/kernel/time/timekeeping.c
> ===
> --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 
> 14:27:26.0 -0500
> +++ linux-compile-i386.git/kernel/time/timekeeping.c  2008-01-09 
> 14:34:40.0 -0500
> @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
>   timespec_add_ns(ts, nsecs);
>  }
>  
> +cycle_t notrace get_monotonic_cycles(void)
> +{
> + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> +
> + do {
> + /*
> +  * cycle_raw and cycle_last can change on
> +  * another CPU and we need the delta calculation
> +  * of cycle_now and cycle_last happen atomic, as well
> +  * as the adding to cycle_raw. We don't need to grab
> +  * any locks, we just keep trying until get all the
> +  * calculations together in one state.
> +  *
> +  * In fact, we __cant__ grab any locks. This
> +  * function is called from the latency_tracer which can
> +  * be called anywhere. To grab any locks (including
> +  * seq_locks) we risk putting ourselves into a deadlock.
> +  */

I wonder what makes the compiler read the clock->cycle_raw and
clock->cycle_last variables twice ? I guess some memory barriers could
be welcome here ?

> + cycle_raw = clock->cycle_raw;
> + cycle_last = clock->cycle_last;
> +
> + /* read clocksource: */
> + cycle_now = clocksource_read(clock);
> +
> + /* calculate the delta since the last update_wall_time: */
> + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> +
> + } while (cycle_raw != clock->cycle_raw ||
> +  cycle_last != clock->cycle_last);
> +
> + return cycle_raw + cycle_delta;
> +}
> +
> +unsigned long notrace cycles_to_usecs(cycle_t cycles)
> +{
> + u64 ret = cyc2ns(clock, cycles);
> +
> + ret += NSEC_PER_USEC/2; /* For rounding in do_div() */
> + do_div(ret, NSEC_PER_USEC);
> +
> + return ret;
> +}
> +
> +cycle_t notrace usecs_to_cycles(unsigned long usecs)
> +{
> + return ns2cyc(clock, (u64)usecs * 1000);
> +}
> +
>  /**
>   * getnstimeofday - Returns the time of day in a timespec
>   * @ts:  pointer to the timespec to be set
> Index: linux-compile-i386.git/include/linux/clocksource.h
> ===
> --- linux-compile-i386.git.orig/include/linux/clocksource.h   2008-01-09 
> 14:27:51.0 -0500
> +++ linux-compile-i386.git/include/linux/clocksource.h2008-01-09 
> 14:29:44.0 -0500
> @@ -273,6 +273,9 @@ extern int clocksource_register(struct c
>  extern struct clocksource* clocksource_get_next(void);
>  extern void clocksource_change_rating(struct clocksource *cs, int rating);
>  extern void clocksource_resume(void);
> +extern cycle_t get_monotonic_cycles(void);
> +extern unsigned long cycles_to_usecs(cycle_t cycles);
> +extern cycle_t usecs_to_cycles(unsigned long usecs);
>  
>  /* used to initialize clock */
>  extern struct clocksource clocksource_jiffies;
> 
> -- 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 The latency tracer needs a way to get an accurate time
 without grabbing any locks. Locks themselves might call
 the latency tracer and cause at best a slow down.
 
 This patch adds get_monotonic_cycles that returns cycles
 from a reliable clock source in a monotonic fashion.
 
 Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
 ---
  include/linux/clocksource.h |3 ++
  kernel/time/timekeeping.c   |   48 
 
  2 files changed, 51 insertions(+)
 
 Index: linux-compile-i386.git/kernel/time/timekeeping.c
 ===
 --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 
 14:27:26.0 -0500
 +++ linux-compile-i386.git/kernel/time/timekeeping.c  2008-01-09 
 14:34:40.0 -0500
 @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
   timespec_add_ns(ts, nsecs);
  }
  
 +cycle_t notrace get_monotonic_cycles(void)
 +{
 + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
 +
 + do {
 + /*
 +  * cycle_raw and cycle_last can change on
 +  * another CPU and we need the delta calculation
 +  * of cycle_now and cycle_last happen atomic, as well
 +  * as the adding to cycle_raw. We don't need to grab
 +  * any locks, we just keep trying until get all the
 +  * calculations together in one state.
 +  *
 +  * In fact, we __cant__ grab any locks. This
 +  * function is called from the latency_tracer which can
 +  * be called anywhere. To grab any locks (including
 +  * seq_locks) we risk putting ourselves into a deadlock.
 +  */

I wonder what makes the compiler read the clock-cycle_raw and
clock-cycle_last variables twice ? I guess some memory barriers could
be welcome here ?

 + cycle_raw = clock-cycle_raw;
 + cycle_last = clock-cycle_last;
 +
 + /* read clocksource: */
 + cycle_now = clocksource_read(clock);
 +
 + /* calculate the delta since the last update_wall_time: */
 + cycle_delta = (cycle_now - cycle_last)  clock-mask;
 +
 + } while (cycle_raw != clock-cycle_raw ||
 +  cycle_last != clock-cycle_last);
 +
 + return cycle_raw + cycle_delta;
 +}
 +
 +unsigned long notrace cycles_to_usecs(cycle_t cycles)
 +{
 + u64 ret = cyc2ns(clock, cycles);
 +
 + ret += NSEC_PER_USEC/2; /* For rounding in do_div() */
 + do_div(ret, NSEC_PER_USEC);
 +
 + return ret;
 +}
 +
 +cycle_t notrace usecs_to_cycles(unsigned long usecs)
 +{
 + return ns2cyc(clock, (u64)usecs * 1000);
 +}
 +
  /**
   * getnstimeofday - Returns the time of day in a timespec
   * @ts:  pointer to the timespec to be set
 Index: linux-compile-i386.git/include/linux/clocksource.h
 ===
 --- linux-compile-i386.git.orig/include/linux/clocksource.h   2008-01-09 
 14:27:51.0 -0500
 +++ linux-compile-i386.git/include/linux/clocksource.h2008-01-09 
 14:29:44.0 -0500
 @@ -273,6 +273,9 @@ extern int clocksource_register(struct c
  extern struct clocksource* clocksource_get_next(void);
  extern void clocksource_change_rating(struct clocksource *cs, int rating);
  extern void clocksource_resume(void);
 +extern cycle_t get_monotonic_cycles(void);
 +extern unsigned long cycles_to_usecs(cycle_t cycles);
 +extern cycle_t usecs_to_cycles(unsigned long usecs);
  
  /* used to initialize clock */
  extern struct clocksource clocksource_jiffies;
 
 -- 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Steven Rostedt



On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
 
  Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
  ---
   include/linux/clocksource.h |3 ++
   kernel/time/timekeeping.c   |   48 
  
   2 files changed, 51 insertions(+)
 
  Index: linux-compile-i386.git/kernel/time/timekeeping.c
  ===
  --- linux-compile-i386.git.orig/kernel/time/timekeeping.c   2008-01-09 
  14:27:26.0 -0500
  +++ linux-compile-i386.git/kernel/time/timekeeping.c2008-01-09 
  14:34:40.0 -0500
  @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
  timespec_add_ns(ts, nsecs);
   }
 
  +cycle_t notrace get_monotonic_cycles(void)
  +{
  +   cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
  +
  +   do {
  +   /*
  +* cycle_raw and cycle_last can change on
  +* another CPU and we need the delta calculation
  +* of cycle_now and cycle_last happen atomic, as well
  +* as the adding to cycle_raw. We don't need to grab
  +* any locks, we just keep trying until get all the
  +* calculations together in one state.
  +*
  +* In fact, we __cant__ grab any locks. This
  +* function is called from the latency_tracer which can
  +* be called anywhere. To grab any locks (including
  +* seq_locks) we risk putting ourselves into a deadlock.
  +*/

 I wonder what makes the compiler read the clock-cycle_raw and
 clock-cycle_last variables twice ? I guess some memory barriers could
 be welcome here ?

We need both cycle_raw and cycle_last to be the same from the time we read
the clock source to the time we calculate cycle_delta. If either one
changes then delta is bogus.

Also, it just occurred to me that this is an old patch. I thought I
renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/


  +   cycle_raw = clock-cycle_raw;
  +   cycle_last = clock-cycle_last;
  +
  +   /* read clocksource: */
  +   cycle_now = clocksource_read(clock);
  +
  +   /* calculate the delta since the last update_wall_time: */
  +   cycle_delta = (cycle_now - cycle_last)  clock-mask;
  +
  +   } while (cycle_raw != clock-cycle_raw ||
  +cycle_last != clock-cycle_last);
  +
  +   return cycle_raw + cycle_delta;
  +}

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Steven Rostedt


On Tue, 15 Jan 2008, Steven Rostedt wrote:

 Also, it just occurred to me that this is an old patch. I thought I
 renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/

Ah, I changed this in the -rt patch queue, and never moved the patch back
here.

-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
 
 
 On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
  
   Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
   ---
include/linux/clocksource.h |3 ++
kernel/time/timekeeping.c   |   48 
   
2 files changed, 51 insertions(+)
  
   Index: linux-compile-i386.git/kernel/time/timekeeping.c
   ===
   --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 
   14:27:26.0 -0500
   +++ linux-compile-i386.git/kernel/time/timekeeping.c  2008-01-09 
   14:34:40.0 -0500
   @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
 timespec_add_ns(ts, nsecs);
}
  
   +cycle_t notrace get_monotonic_cycles(void)
   +{
   + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
   +
   + do {
   + /*
   +  * cycle_raw and cycle_last can change on
   +  * another CPU and we need the delta calculation
   +  * of cycle_now and cycle_last happen atomic, as well
   +  * as the adding to cycle_raw. We don't need to grab
   +  * any locks, we just keep trying until get all the
   +  * calculations together in one state.
   +  *
   +  * In fact, we __cant__ grab any locks. This
   +  * function is called from the latency_tracer which can
   +  * be called anywhere. To grab any locks (including
   +  * seq_locks) we risk putting ourselves into a deadlock.
   +  */
 
  I wonder what makes the compiler read the clock-cycle_raw and
  clock-cycle_last variables twice ? I guess some memory barriers could
  be welcome here ?
 
 We need both cycle_raw and cycle_last to be the same from the time we read
 the clock source to the time we calculate cycle_delta. If either one
 changes then delta is bogus.
 

Ok, but what actually insures that the clock-cycle_* reads won't be
reordered across the clocksource_read() ?


 Also, it just occurred to me that this is an old patch. I thought I
 renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/
 
 
   + cycle_raw = clock-cycle_raw;
   + cycle_last = clock-cycle_last;
   +
   + /* read clocksource: */
   + cycle_now = clocksource_read(clock);
   +
   + /* calculate the delta since the last update_wall_time: */
   + cycle_delta = (cycle_now - cycle_last)  clock-mask;
   +
   + } while (cycle_raw != clock-cycle_raw ||
   +  cycle_last != clock-cycle_last);
   +
   + return cycle_raw + cycle_delta;
   +}
 
 -- Steve
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles

2008-01-15 Thread Steven Rostedt


On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:

 Ok, but what actually insures that the clock-cycle_* reads won't be
 reordered across the clocksource_read() ?

looks at code

Hmm, interesting.I didn't notice that clocksource_read() is a static
inline.  I was thinking that since it was passing a pointer to a function,
gcc could not assume that it could move that code across it. But now
looking to see that clocksource_read is simply a static inline that does:

  cs-read();

But still, can gcc assume that it can push loads of unknown origin
variables across function calls? So something like:

static int *glob;

void foo(void) {
int x;

x = *glob;

bar();

if (x != *glob)
/* ... */
}

I can't see how any compiler could honestly move the loading of the first
x after the calling of bar(). With glob pointing to some unknown
variable, that may be perfectly fine for bar to modify.


  
+   cycle_raw = clock-cycle_raw;
+   cycle_last = clock-cycle_last;
+
+   /* read clocksource: */
+   cycle_now = clocksource_read(clock);

So the question here is,can cycle_raw and cycle_last be loaded from
the unknown source that clock points to after the call to
clocksource_read()?

 I'm thinking not.

+
+   /* calculate the delta since the last update_wall_time: 
*/
+   cycle_delta = (cycle_now - cycle_last)  clock-mask;
+
+   } while (cycle_raw != clock-cycle_raw ||
+cycle_last != clock-cycle_last);
+
+   return cycle_raw + cycle_delta;
+}


-- Steve

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 105 matches

Mail list logo