Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Frank Ch. Eigler ([EMAIL PROTECTED]) wrote: > Hi - > > On Fri, Jan 18, 2008 at 10:55:27PM -0500, Steven Rostedt wrote: > > [...] > > > All this complexity is to be justified by keeping the raw prev/next > > > pointers from being sent to a naive tracer? It seems to me way out of > > > proportion. > > > > Damn, and I just blew away all my marker code for something like this ;-) > > Sorry! :-) > > > [...] > > We have in sched.c the following marker: > > trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next); > > Fine so far! > > > Then Mathieu can add in some code somewhere (or a module, or something) > > ret = marker_probe_register("kernel_sched_scheduler", > > "prev %p next %p", > > pretty_print_sched_switch, NULL); > > > static void pretty_print_sched_switch(const struct marker *mdata, > > void *private_data, > > const char *format, ...) > > { > > [...] > > trace_mark(kernel_pretty_print_sched_switch, > > "prev_pid %d next_pid %d prev_state %ld", > > prev->pid, next->pid, prev->state); > > } > > That marker_probe_register call would need to be done only when the > embedded (k_p_p_s_s) marker is actually being used. Otherwise we'd > lose all the savings of a dormant sched.c marker by always calling > into pretty_print_sched_switch(), whether or not the k_p_p_s_s marker > was active. > > In any case, if the naive tracer agrees to become educated about some > of these markers in the form of intermediary functions like that, they > need not insist on a second hop through marker territory anyway: > > static void pretty_print_sched_switch(const struct marker *mdata, > void *private_data, > const char *format, ...) > { > [...] > lttng_backend_trace(kernel_pretty_print_sched_switch, > "prev_pid %d next_pid %d prev_state %ld", > prev->pid, next->pid, prev->state); > } > Oh! perfect then :) Since I already planned my ltt-marker-control kernel module to connect specialized callbacks instead of the dumb one, it shouldn't be so hard to do. I would just have to find another way to declare the trace events (it's currently embedded in the markers), but it's not a showstopper. I'll try this. Thanks to you both for the good proposals, Mathieu > > - FChE -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Frank Ch. Eigler ([EMAIL PROTECTED]) wrote: Hi - On Fri, Jan 18, 2008 at 10:55:27PM -0500, Steven Rostedt wrote: [...] All this complexity is to be justified by keeping the raw prev/next pointers from being sent to a naive tracer? It seems to me way out of proportion. Damn, and I just blew away all my marker code for something like this ;-) Sorry! :-) [...] We have in sched.c the following marker: trace_mark(kernel_sched_scheduler, prev %p next %p, prev, next); Fine so far! Then Mathieu can add in some code somewhere (or a module, or something) ret = marker_probe_register(kernel_sched_scheduler, prev %p next %p, pretty_print_sched_switch, NULL); static void pretty_print_sched_switch(const struct marker *mdata, void *private_data, const char *format, ...) { [...] trace_mark(kernel_pretty_print_sched_switch, prev_pid %d next_pid %d prev_state %ld, prev-pid, next-pid, prev-state); } That marker_probe_register call would need to be done only when the embedded (k_p_p_s_s) marker is actually being used. Otherwise we'd lose all the savings of a dormant sched.c marker by always calling into pretty_print_sched_switch(), whether or not the k_p_p_s_s marker was active. In any case, if the naive tracer agrees to become educated about some of these markers in the form of intermediary functions like that, they need not insist on a second hop through marker territory anyway: static void pretty_print_sched_switch(const struct marker *mdata, void *private_data, const char *format, ...) { [...] lttng_backend_trace(kernel_pretty_print_sched_switch, prev_pid %d next_pid %d prev_state %ld, prev-pid, next-pid, prev-state); } Oh! perfect then :) Since I already planned my ltt-marker-control kernel module to connect specialized callbacks instead of the dumb one, it shouldn't be so hard to do. I would just have to find another way to declare the trace events (it's currently embedded in the markers), but it's not a showstopper. I'll try this. Thanks to you both for the good proposals, Mathieu - FChE -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Hi - On Fri, Jan 18, 2008 at 10:55:27PM -0500, Steven Rostedt wrote: > [...] > > All this complexity is to be justified by keeping the raw prev/next > > pointers from being sent to a naive tracer? It seems to me way out of > > proportion. > > Damn, and I just blew away all my marker code for something like this ;-) Sorry! :-) > [...] > We have in sched.c the following marker: > trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next); Fine so far! > Then Mathieu can add in some code somewhere (or a module, or something) > ret = marker_probe_register("kernel_sched_scheduler", > "prev %p next %p", > pretty_print_sched_switch, NULL); > static void pretty_print_sched_switch(const struct marker *mdata, > void *private_data, > const char *format, ...) > { > [...] > trace_mark(kernel_pretty_print_sched_switch, > "prev_pid %d next_pid %d prev_state %ld", > prev->pid, next->pid, prev->state); > } That marker_probe_register call would need to be done only when the embedded (k_p_p_s_s) marker is actually being used. Otherwise we'd lose all the savings of a dormant sched.c marker by always calling into pretty_print_sched_switch(), whether or not the k_p_p_s_s marker was active. In any case, if the naive tracer agrees to become educated about some of these markers in the form of intermediary functions like that, they need not insist on a second hop through marker territory anyway: static void pretty_print_sched_switch(const struct marker *mdata, void *private_data, const char *format, ...) { [...] lttng_backend_trace(kernel_pretty_print_sched_switch, "prev_pid %d next_pid %d prev_state %ld", prev->pid, next->pid, prev->state); } - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Fri, 18 Jan 2008, Frank Ch. Eigler wrote: > > All this complexity is to be justified by keeping the raw prev/next > pointers from being sent to a naive tracer? It seems to me way out of > proportion. Damn, and I just blew away all my marker code for something like this ;-) Actually, you just gave me a great idea that I think can help all of us. OK, Mathieu may not be in total agreement, but I think this is the ultimate compromise. We have in sched.c the following marker: trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next); Then Mathieu can add in some code somewhere (or a module, or something) ret = marker_probe_register("kernel_sched_scheduler", "prev %p next %p", pretty_print_sched_switch, NULL); static void pretty_print_sched_switch(const struct marker *mdata, void *private_data, const char *format, ...) { va_list ap; struct task_struct *prev; struct task_struct *next; va_start(ap, format); prev = va_arg(ap, typeof(prev)); next = va_arg(ap, typeof(next)); va_end; trace_mark(kernel_pretty_print_sched_switch, "prev_pid %d next_pid %d prev_state %ld", prev->pid, next->pid, prev->state); } Then LTTng on startup could arm the normal kernel_sched_switch code and have the user see the nice one. All without adding any more goo or overhead to the non tracing case, and keeping a few critical markers with enough information to be useful to other tracers! Thoughts? -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Hi - On Fri, Jan 18, 2008 at 06:19:29PM -0500, Mathieu Desnoyers wrote: > [...] > Almost.. I would add : > > static int trace_switch_to_enabled; > > > static inline trace_switch_to(struct task_struct *prev, > > struct task_struct *next) > > { > if (likely(!trace_switch_to_enabled)) > return; > > trace_mark(kernel_schedudule, > > "prev_pid %d next_pid %d prev_state %ld", > > prev->pid, next->pid, prev->pid); > > > > trace_context_switch(prev, next); > > } > > And some code to activate the trace_switch_to_enabled variable (ideally > keeping a refcount). [...] All this complexity is to be justified by keeping the raw prev/next pointers from being sent to a naive tracer? It seems to me way out of proportion. - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Hi - On Fri, Jan 18, 2008 at 05:49:19PM -0500, Steven Rostedt wrote: > [...] > > But I have not seen a lot of situations where that kind of glue-code was > > needed, so I think it makes sense to keep markers simple to use and > > efficient for the common case. > > > > Then, in this glue-code, we can put trace_mark() and calls to in-kernel > > tracers. > > I'm almost done with the latency tracer work, and there are only a total > of 6 hooks that I needed. > [...] > With the above, we could have this (if this is what I think you are > recommending). [...] > static inline trace_switch_to(struct task_struct *prev, > struct task_struct *next) > { > trace_mark(kernel_schedudule, > "prev_pid %d next_pid %d prev_state %ld", > prev->pid, next->pid, prev->pid); > > trace_context_switch(prev, next); > } I'm afraid I don't see the point in this. You could use one marker for all that data (and force the more naive tracer callbacks to ignore out some of them). You could even use two markers (and force the more naive tracer to attach to only to its favorite subset). But to use a second, different, less efficient, not more configurable tracing hook mechanism in the same logical spot makes no sense to me. - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > On Fri, 18 Jan 2008, Mathieu Desnoyers wrote: > > > > But I have not seen a lot of situations where that kind of glue-code was > > needed, so I think it makes sense to keep markers simple to use and > > efficient for the common case. > > > > Then, in this glue-code, we can put trace_mark() and calls to in-kernel > > tracers. > > I'm almost done with the latency tracer work, and there are only a total > of 6 hooks that I needed. > > - schedule context switch > - try_to_wake_up > - hard_irqs_off (which is already there for lockdep) > - hard irqs on (also for lockdep) > - lock_contention (already in for the lock contention code) > - lock acquire (also in there for contention code) > > With the above, we could have this (if this is what I think you are > recommending). For example in the context_switch case: > > trace_switch_to(prev, next); > switch_to(prev, next, prev); > > and in sched.h I could have: > Almost.. I would add : static int trace_switch_to_enabled; > static inline trace_switch_to(struct task_struct *prev, > struct task_struct *next) > { if (likely(!trace_switch_to_enabled)) return; > trace_mark(kernel_schedudule, > "prev_pid %d next_pid %d prev_state %ld", > prev->pid, next->pid, prev->pid); > > trace_context_switch(prev, next); > } > And some code to activate the trace_switch_to_enabled variable (ideally keeping a refcount). By doing this, we would have the minimum impact on the scheduled when disabled. But remember that this trace_switch_to_enabled could be enabled for both markers and your tracer, so you might need to put a branch at the beginning of trace_context_switch() too. Mathieu > and have the trace_context_switch code be something that is turned on with > the latency tracing utility (config option). That way production code can > keep it off. > > -- Steve > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Fri, 18 Jan 2008, Mathieu Desnoyers wrote: > > But I have not seen a lot of situations where that kind of glue-code was > needed, so I think it makes sense to keep markers simple to use and > efficient for the common case. > > Then, in this glue-code, we can put trace_mark() and calls to in-kernel > tracers. I'm almost done with the latency tracer work, and there are only a total of 6 hooks that I needed. - schedule context switch - try_to_wake_up - hard_irqs_off (which is already there for lockdep) - hard irqs on (also for lockdep) - lock_contention (already in for the lock contention code) - lock acquire (also in there for contention code) With the above, we could have this (if this is what I think you are recommending). For example in the context_switch case: trace_switch_to(prev, next); switch_to(prev, next, prev); and in sched.h I could have: static inline trace_switch_to(struct task_struct *prev, struct task_struct *next) { trace_mark(kernel_schedudule, "prev_pid %d next_pid %d prev_state %ld", prev->pid, next->pid, prev->pid); trace_context_switch(prev, next); } and have the trace_context_switch code be something that is turned on with the latency tracing utility (config option). That way production code can keep it off. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > On Thu, 17 Jan 2008, Frank Ch. Eigler wrote: > > > Hi - > > > > On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote: > > > [...] > > > + trace_mark(kernel_sched_schedule, > > > + "prev_pid %d next_pid %d prev_state %ld", > > > + prev->pid, next->pid, prev->state); > > > [...] > > > But... > > > > > > Tracers that want to do a bit more work, like recording timings and seeing > > > if we hit some max somewhere, can't do much with that pretty print data. > > > > If you find yourself wanting to perform computations like finding > > maxima, or responding right there as opposed to later during userspace > > trace data extraction, then you're trending toward a tool like > > systemtap. > > Yes, very much so. I'm working on getting the latency_tracer from the -rt > patch into something suitable for mainline. We need to calculate the max > latencies on the fly. If we hit a max, then we save it off, otherwise, we > blow away the trace and start again. > > > > > > [...] > > > So, at a minimum, I'd like to at least have meta data attached: > > > trace_mark(kernel_sched_schedule, > > > "prev_pid %d next_pid %d prev_state %ld\0" > > > "prev %p next %p", > > > prev->pid, next->pid, prev->state, > > > prev, next); > > > > > > This would allow for both the nice pretty print of your trace, as well as > > > allowing other tracers to get to better meta data. > > > > Yes, more self-contained marker events are necessary for meaningful > > in-situ processing. That needs to be balanced by the increased cost > > for computing and passing the extra parameters, multiplied the event > > occurrence rate. > > The cost is only done when the marker is armed. Since the marker is an > unlikely, and will be placed at the end of the function. > > > > > In this case, the prev/next pointers are sufficient to compute the > > other values. For particularly performance-critical markers, it may > > not be unreasonable to expect the callback functions to dereference > > such pointers for pretty-printing or other processing. > > This was exactly my point to Mathieu, but I think he has LTTng very much > coupled with the markers. I haven't played with LTTng (yet), but from what > I've read (Mathieu, correct me if I'm wrong), it seems that all the > markers become visible to userspace, and the user can simple turn them on > or off. LTTng doesn't need any knowledge of the marker since the marker > contains how to print the information. > > So* by placing a "prev %p next %p" as the only information, we lose out on > this automated way LTTng works. Because the two pointers are just > meaningless numbers to the user. > Exactly. We have, at the marker site : - a marker identifier - format string containing field names and types - arguments I would like to keep that as much in a straight line as possible with what ends up in the trace. However, I see that it limits what can be done by in-kernel tracers. And by the way, I also suffer from the same kind of limitation in LTTng. Here is an example : I would like to replace blktrace (actually, I already have a quite complete implementation). However, there is some code ran in the kernel to "prepare" the information for the trace which is blktrace specific. Since this code is not required to run when tracing is disabled, it can be seen as "glue-code" between the kernel tracing point and the extraction of data to trace. What looked like the less intrusive solution was to create inline functions that consist of branches over code considered unlikely (could be a function call) where the glue-code is executed to prepare the data. It's a bit like what the markers are doing, except that there is no marker name associated and no format string : the subsystem being traced must enable its tracing features by itself (could be a /proc file). It makes sense, since this type of code has to be subsystem-specific anyway. But I have not seen a lot of situations where that kind of glue-code was needed, so I think it makes sense to keep markers simple to use and efficient for the common case. Then, in this glue-code, we can put trace_mark() and calls to in-kernel tracers. Since the markers are eventually meant to become an API visible from user-space, I think it makes sense to keep it clean. If an in-kernel tracer needs extra information, I think it would make sense for it to get it from a mechanism that does not make the exported information visible to user-space. What do you think ? > > > > > The '\0' would keep your tracer from recording the extra data, and we > > > could add some way to ignore the parameters in the printf to let other > > > traces get straight to the meta data. > > > > This \0 hack is perhaps too clever. Much of the cost of the extra > > parameters is already paid by the time that a simpleminded tracing > > callback function starts going through the
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Fri, 18 Jan 2008, Mathieu Desnoyers wrote: But I have not seen a lot of situations where that kind of glue-code was needed, so I think it makes sense to keep markers simple to use and efficient for the common case. Then, in this glue-code, we can put trace_mark() and calls to in-kernel tracers. I'm almost done with the latency tracer work, and there are only a total of 6 hooks that I needed. - schedule context switch - try_to_wake_up - hard_irqs_off (which is already there for lockdep) - hard irqs on (also for lockdep) - lock_contention (already in for the lock contention code) - lock acquire (also in there for contention code) With the above, we could have this (if this is what I think you are recommending). For example in the context_switch case: trace_switch_to(prev, next); switch_to(prev, next, prev); and in sched.h I could have: static inline trace_switch_to(struct task_struct *prev, struct task_struct *next) { trace_mark(kernel_schedudule, prev_pid %d next_pid %d prev_state %ld, prev-pid, next-pid, prev-pid); trace_context_switch(prev, next); } and have the trace_context_switch code be something that is turned on with the latency tracing utility (config option). That way production code can keep it off. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: On Fri, 18 Jan 2008, Mathieu Desnoyers wrote: But I have not seen a lot of situations where that kind of glue-code was needed, so I think it makes sense to keep markers simple to use and efficient for the common case. Then, in this glue-code, we can put trace_mark() and calls to in-kernel tracers. I'm almost done with the latency tracer work, and there are only a total of 6 hooks that I needed. - schedule context switch - try_to_wake_up - hard_irqs_off (which is already there for lockdep) - hard irqs on (also for lockdep) - lock_contention (already in for the lock contention code) - lock acquire (also in there for contention code) With the above, we could have this (if this is what I think you are recommending). For example in the context_switch case: trace_switch_to(prev, next); switch_to(prev, next, prev); and in sched.h I could have: Almost.. I would add : static int trace_switch_to_enabled; static inline trace_switch_to(struct task_struct *prev, struct task_struct *next) { if (likely(!trace_switch_to_enabled)) return; trace_mark(kernel_schedudule, prev_pid %d next_pid %d prev_state %ld, prev-pid, next-pid, prev-pid); trace_context_switch(prev, next); } And some code to activate the trace_switch_to_enabled variable (ideally keeping a refcount). By doing this, we would have the minimum impact on the scheduled when disabled. But remember that this trace_switch_to_enabled could be enabled for both markers and your tracer, so you might need to put a branch at the beginning of trace_context_switch() too. Mathieu and have the trace_context_switch code be something that is turned on with the latency tracing utility (config option). That way production code can keep it off. -- Steve -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Fri, 18 Jan 2008, Frank Ch. Eigler wrote: All this complexity is to be justified by keeping the raw prev/next pointers from being sent to a naive tracer? It seems to me way out of proportion. Damn, and I just blew away all my marker code for something like this ;-) Actually, you just gave me a great idea that I think can help all of us. OK, Mathieu may not be in total agreement, but I think this is the ultimate compromise. We have in sched.c the following marker: trace_mark(kernel_sched_scheduler, prev %p next %p, prev, next); Then Mathieu can add in some code somewhere (or a module, or something) ret = marker_probe_register(kernel_sched_scheduler, prev %p next %p, pretty_print_sched_switch, NULL); static void pretty_print_sched_switch(const struct marker *mdata, void *private_data, const char *format, ...) { va_list ap; struct task_struct *prev; struct task_struct *next; va_start(ap, format); prev = va_arg(ap, typeof(prev)); next = va_arg(ap, typeof(next)); va_end; trace_mark(kernel_pretty_print_sched_switch, prev_pid %d next_pid %d prev_state %ld, prev-pid, next-pid, prev-state); } Then LTTng on startup could arm the normal kernel_sched_switch code and have the user see the nice one. All without adding any more goo or overhead to the non tracing case, and keeping a few critical markers with enough information to be useful to other tracers! Thoughts? -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Hi - On Fri, Jan 18, 2008 at 06:19:29PM -0500, Mathieu Desnoyers wrote: [...] Almost.. I would add : static int trace_switch_to_enabled; static inline trace_switch_to(struct task_struct *prev, struct task_struct *next) { if (likely(!trace_switch_to_enabled)) return; trace_mark(kernel_schedudule, prev_pid %d next_pid %d prev_state %ld, prev-pid, next-pid, prev-pid); trace_context_switch(prev, next); } And some code to activate the trace_switch_to_enabled variable (ideally keeping a refcount). [...] All this complexity is to be justified by keeping the raw prev/next pointers from being sent to a naive tracer? It seems to me way out of proportion. - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Hi - On Fri, Jan 18, 2008 at 05:49:19PM -0500, Steven Rostedt wrote: [...] But I have not seen a lot of situations where that kind of glue-code was needed, so I think it makes sense to keep markers simple to use and efficient for the common case. Then, in this glue-code, we can put trace_mark() and calls to in-kernel tracers. I'm almost done with the latency tracer work, and there are only a total of 6 hooks that I needed. [...] With the above, we could have this (if this is what I think you are recommending). [...] static inline trace_switch_to(struct task_struct *prev, struct task_struct *next) { trace_mark(kernel_schedudule, prev_pid %d next_pid %d prev_state %ld, prev-pid, next-pid, prev-pid); trace_context_switch(prev, next); } I'm afraid I don't see the point in this. You could use one marker for all that data (and force the more naive tracer callbacks to ignore out some of them). You could even use two markers (and force the more naive tracer to attach to only to its favorite subset). But to use a second, different, less efficient, not more configurable tracing hook mechanism in the same logical spot makes no sense to me. - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Thu, 17 Jan 2008, Frank Ch. Eigler wrote: > Hi - > > On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote: > > [...] > > + trace_mark(kernel_sched_schedule, > > + "prev_pid %d next_pid %d prev_state %ld", > > + prev->pid, next->pid, prev->state); > > [...] > > But... > > > > Tracers that want to do a bit more work, like recording timings and seeing > > if we hit some max somewhere, can't do much with that pretty print data. > > If you find yourself wanting to perform computations like finding > maxima, or responding right there as opposed to later during userspace > trace data extraction, then you're trending toward a tool like > systemtap. Yes, very much so. I'm working on getting the latency_tracer from the -rt patch into something suitable for mainline. We need to calculate the max latencies on the fly. If we hit a max, then we save it off, otherwise, we blow away the trace and start again. > > > [...] > > So, at a minimum, I'd like to at least have meta data attached: > > trace_mark(kernel_sched_schedule, > > "prev_pid %d next_pid %d prev_state %ld\0" > > "prev %p next %p", > > prev->pid, next->pid, prev->state, > > prev, next); > > > > This would allow for both the nice pretty print of your trace, as well as > > allowing other tracers to get to better meta data. > > Yes, more self-contained marker events are necessary for meaningful > in-situ processing. That needs to be balanced by the increased cost > for computing and passing the extra parameters, multiplied the event > occurrence rate. The cost is only done when the marker is armed. Since the marker is an unlikely, and will be placed at the end of the function. > > In this case, the prev/next pointers are sufficient to compute the > other values. For particularly performance-critical markers, it may > not be unreasonable to expect the callback functions to dereference > such pointers for pretty-printing or other processing. This was exactly my point to Mathieu, but I think he has LTTng very much coupled with the markers. I haven't played with LTTng (yet), but from what I've read (Mathieu, correct me if I'm wrong), it seems that all the markers become visible to userspace, and the user can simple turn them on or off. LTTng doesn't need any knowledge of the marker since the marker contains how to print the information. So* by placing a "prev %p next %p" as the only information, we lose out on this automated way LTTng works. Because the two pointers are just meaningless numbers to the user. > > > The '\0' would keep your tracer from recording the extra data, and we > > could add some way to ignore the parameters in the printf to let other > > traces get straight to the meta data. > > This \0 hack is perhaps too clever. Much of the cost of the extra > parameters is already paid by the time that a simpleminded tracing > callback function starts going through the string. Also, I believe > the systemtap marker interface would break if the format strings were > not singly terminated ordinary strings. Well, actually when I first wrote this letter, I used "--" as a delimiter to allow a tool to hide the pretty stuff. But then I thought about the "clever hack" with the '\0', The "--" may be better since it wont break systemtap. -- Steve * dvhart - bah! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Hi - On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote: > [...] > + trace_mark(kernel_sched_schedule, > + "prev_pid %d next_pid %d prev_state %ld", > + prev->pid, next->pid, prev->state); > [...] > But... > > Tracers that want to do a bit more work, like recording timings and seeing > if we hit some max somewhere, can't do much with that pretty print data. If you find yourself wanting to perform computations like finding maxima, or responding right there as opposed to later during userspace trace data extraction, then you're trending toward a tool like systemtap. > [...] > So, at a minimum, I'd like to at least have meta data attached: > trace_mark(kernel_sched_schedule, > "prev_pid %d next_pid %d prev_state %ld\0" > "prev %p next %p", > prev->pid, next->pid, prev->state, > prev, next); > > This would allow for both the nice pretty print of your trace, as well as > allowing other tracers to get to better meta data. Yes, more self-contained marker events are necessary for meaningful in-situ processing. That needs to be balanced by the increased cost for computing and passing the extra parameters, multiplied the event occurrence rate. In this case, the prev/next pointers are sufficient to compute the other values. For particularly performance-critical markers, it may not be unreasonable to expect the callback functions to dereference such pointers for pretty-printing or other processing. > The '\0' would keep your tracer from recording the extra data, and we > could add some way to ignore the parameters in the printf to let other > traces get straight to the meta data. This \0 hack is perhaps too clever. Much of the cost of the extra parameters is already paid by the time that a simpleminded tracing callback function starts going through the string. Also, I believe the systemtap marker interface would break if the format strings were not singly terminated ordinary strings. - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
> > > > > > > > One thing I want to clear up. The major difference between this > > > > latency_tracer and LTTng is what we consider fast paths. The latency > > > > tracer is recording things like enabling and disabling interrupts, > > > > preempt > > > > count changes, or simply profiling all function calls. Those are what I > > > > consider fast paths. The slow path WRT the latency_tracer are things > > > > like > > > > context switches. This is why I don't have a problem with copying the > > > > comm at context switch time. Because that _is_ a slow path for the > > > > latency > > > > tracer. > > > > > > LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. > > > in flight recorder mode, we have nothing to write to disk and therefore > > > we can handle very frequent events. We then do the analysis off-line > > > using the last MB written in the buffers. The advantage is that the > > > kernel dumbly writes data to a buffer : we therefore move the complexity > > > to user-space. > > > > But you would still need to do something in case you want this information > > dumped to console on a kernel crash. Of course you can rely on kexec, but > > if the kexec fails (which is possible) then you lose all the information. > > Having the ability to dump the output to console on a crash is one of the > > benefits of latency_tracer that I want to keep. > > > > There has been some integration done between LTTng and the "crash" tool > to extract the buffers from a crashed kernel. I am not an expert in > crash buffer extraction though, but I guess all the available mechanisms > depend on kexec and could show the limits you are referring to. > > If you really want to pretty-print the information to the console, I > would propose that you leave that part of the problem to a different > output module. The core of the latency tracer could keep the minimum > information. Then, when a dump is triggered, it either sends the > information to LTTng or to your console pretty-printer. > > However, I would not call the pretty-printer a "tracer" module per-se. > We would have to accept that it is a bit more tied to the kernel > internals than the latency tracer. My goal is to separate the core > "profiling" module from the optional "pretty-printing" module as much as > possible so the latency tracer core could be reused by other output > modules. Mathieu, I've been thinking about the context switch marker, and the printf format field you have: prepare_task_switch(rq, prev, next); + trace_mark(kernel_sched_schedule, + "prev_pid %d next_pid %d prev_state %ld", + prev->pid, next->pid, prev->state); mm = next->mm; Now I see that this is great for your tracer, since all your hook would need to do is: static notrace void simple_trace(const struct marker *mdata, void *private_data, const char *format, ...) { va_list ap; va_start(ap, format); simple_trace_record(ap, format); va_end(ap); } And you could hook this up to all your traces. Wonderful! But... Tracers that want to do a bit more work, like recording timings and seeing if we hit some max somewhere, can't do much with that pretty print data. For example, I like to record the priority of a task that is being swapped out as well as the one being swapped in. But with this, all I can get is the priority of the prev task (since it is still current). You told me that I should put hooks into where the priority gets modified, so that I can trace it there at a non hot path. Well, I have some issues with this. 1) to me it's a management nightmare. Yes, I could hook into lttng, but that too is too much. That takes me back to the days when COTS became the new buzzword, and we ended up having to incorporate COTS products into things better left done in house. The end result was more glue code than what would happen if it was simple done in house, and a product that was totally inefficient. 2) this requires we put a marker at all the places that might change the data we want the snapshot of at the marker (here being prio at context switch time). I could imaging some new code comes into the kernel that modifies priority but the author has no idea about having to update the trace marker, and the trace output ends up showing stale data. And this would cause the poor bastard that needs to maintain this code to debug the tracer on top of the code they are maintaining. Which simply would end up where that poor bastard will lose all confidence in the tracer and would simply give up on it. I know that if we did something like: trace_mark(kernel_sched_schedule, "prev %p next %p", prev, next); It would be useless for the simple recording, because what the user would see is just two meaningless pointer numbers. So, at a minimum, I'd like to at least have meta data attached:
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > Or could we map a per-thread page that would contradict this > "definition" ? Over my dead body. It's been done before. Many times. It's horrible, and means that you need to flush the TLB on context switches between threads and cannot share the same hw page tables across CPU's (since now different CPU's run different threads). It generally makes threads pointless. You might as well implement them as processes with shared mmap. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > On Thu, 17 Jan 2008, Paul Mackerras wrote: > > > > It's very hard to do a per-thread counter in the VDSO, since threads > > in the same process see the same memory, by definition. You'd have to > > have an array of counters and have some way for each thread to know > > which entry to read. Also you'd have to find space for tens or > > hundreds of thousands of counters, since there can be that many > > threads in a process sometimes. > > I was thinking about this. What would also work is just the ability to > read the schedule counter for the current cpu. Now this would require that > the task had a way to know which CPU it was currently on. > > -- Steve > The problem with the per cpu schedule counter would be to deal with stopped tasks that would wake up at the exact wrong moment. With a 32 bits counter, it could happen. At 1000HZ, given the scheduler is only called once per tick (approximation, it can also be called explicitly) it would happen after 49.7 days. But then, if the kernel calls schedule() too often, this can be sooner than that. Having a per-thread variable would make sure we don't have this problem. By the way, a task cannot "really know" which CPU it is on : it could be migrated between the cpu ID read and the moment it uses it as an array index. Actually, knowing the schedule count would help to implement algorithms that helps getting the cpu id and knowing it's been valid for a period of time without pinning to a particular CPU. Mathieu (sorry for repost, messed up with my mail client) -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > On Thu, 17 Jan 2008, Paul Mackerras wrote: > > > > It's very hard to do a per-thread counter in the VDSO, since threads > > in the same process see the same memory, by definition. You'd have to > > have an array of counters and have some way for each thread to know > > which entry to read. Also you'd have to find space for tens or > > hundreds of thousands of counters, since there can be that many > > threads in a process sometimes. > > I was thinking about this. What would also work is just the ability to > read the schedule counter for the current cpu. Now this would require that > the task had a way to know which CPU it was currently on. > > -- Steve > The problem with the per cpu schedule counter would be to deal with stopped tasks that would wake up at the exact wrong moment. With a 32 bits counter, it could happen. At 1000HZ, given the scheduler is only called once per tick (approximation, it can also be called explicitly) it would happen after 49.7 days. But then, if the kernel calls schedule() too often, this can be sooner than that. Having a per-thread variable would make sure we don't have this problem. By the way, a task cannot "really know" which CPU it is on : it could be migrated between the cpu ID read and the moment it uses it as an array index. Actually, knowing the schedule count would help to implement algorithms that helps getting the cpu id and knowing it's been valid for a period of time without pinning to a particular CPU. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
> > Crazy ideas : > > Could we do something along the lines of the thread local storage ? > > Or could we map a per-thread page that would contradict this > "definition" ? When working on lguest64, I implemented a "per CPU" shadow page. That the process of a guest running on one real CPU, could never see this page for a process of the same guest running on another real CPU. This is a nightmare to manage. The thing is that threads share the same pgd. With shadow pages, it's doable, but hard to manage. It required making a "copy" of the pgd for each real CPU. Here we are not talking about guests, but bare-metal. So this complicates things even more. The problem is that each thread shares the same pgd. If two threads run on the same time on two different CPUS, then the two threads would see the same counter. But we need to make this per cpu. Now if the thread was given a clue to what CPU it was running on, then we might be able to accomplish something. But then we have the same issue. How do you tell a thread what CPU it's on, without running into the same issues as where to store this data? -- Steve > > Or can we move down the beginning of the user-space thread stack of 4 > bytes (it's already put at a random address anyway) and use these 32 > bits to put our variable ? We don't care if userspace also modifies it; > the kernel would blindly increment it, so there would be no security > concerns involved. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Crazy ideas : Could we do something along the lines of the thread local storage ? Or could we map a per-thread page that would contradict this definition ? When working on lguest64, I implemented a per CPU shadow page. That the process of a guest running on one real CPU, could never see this page for a process of the same guest running on another real CPU. This is a nightmare to manage. The thing is that threads share the same pgd. With shadow pages, it's doable, but hard to manage. It required making a copy of the pgd for each real CPU. Here we are not talking about guests, but bare-metal. So this complicates things even more. The problem is that each thread shares the same pgd. If two threads run on the same time on two different CPUS, then the two threads would see the same counter. But we need to make this per cpu. Now if the thread was given a clue to what CPU it was running on, then we might be able to accomplish something. But then we have the same issue. How do you tell a thread what CPU it's on, without running into the same issues as where to store this data? -- Steve Or can we move down the beginning of the user-space thread stack of 4 bytes (it's already put at a random address anyway) and use these 32 bits to put our variable ? We don't care if userspace also modifies it; the kernel would blindly increment it, so there would be no security concerns involved. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: On Thu, 17 Jan 2008, Paul Mackerras wrote: It's very hard to do a per-thread counter in the VDSO, since threads in the same process see the same memory, by definition. You'd have to have an array of counters and have some way for each thread to know which entry to read. Also you'd have to find space for tens or hundreds of thousands of counters, since there can be that many threads in a process sometimes. I was thinking about this. What would also work is just the ability to read the schedule counter for the current cpu. Now this would require that the task had a way to know which CPU it was currently on. -- Steve The problem with the per cpu schedule counter would be to deal with stopped tasks that would wake up at the exact wrong moment. With a 32 bits counter, it could happen. At 1000HZ, given the scheduler is only called once per tick (approximation, it can also be called explicitly) it would happen after 49.7 days. But then, if the kernel calls schedule() too often, this can be sooner than that. Having a per-thread variable would make sure we don't have this problem. By the way, a task cannot really know which CPU it is on : it could be migrated between the cpu ID read and the moment it uses it as an array index. Actually, knowing the schedule count would help to implement algorithms that helps getting the cpu id and knowing it's been valid for a period of time without pinning to a particular CPU. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: Or could we map a per-thread page that would contradict this definition ? Over my dead body. It's been done before. Many times. It's horrible, and means that you need to flush the TLB on context switches between threads and cannot share the same hw page tables across CPU's (since now different CPU's run different threads). It generally makes threads pointless. You might as well implement them as processes with shared mmap. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: On Thu, 17 Jan 2008, Paul Mackerras wrote: It's very hard to do a per-thread counter in the VDSO, since threads in the same process see the same memory, by definition. You'd have to have an array of counters and have some way for each thread to know which entry to read. Also you'd have to find space for tens or hundreds of thousands of counters, since there can be that many threads in a process sometimes. I was thinking about this. What would also work is just the ability to read the schedule counter for the current cpu. Now this would require that the task had a way to know which CPU it was currently on. -- Steve The problem with the per cpu schedule counter would be to deal with stopped tasks that would wake up at the exact wrong moment. With a 32 bits counter, it could happen. At 1000HZ, given the scheduler is only called once per tick (approximation, it can also be called explicitly) it would happen after 49.7 days. But then, if the kernel calls schedule() too often, this can be sooner than that. Having a per-thread variable would make sure we don't have this problem. By the way, a task cannot really know which CPU it is on : it could be migrated between the cpu ID read and the moment it uses it as an array index. Actually, knowing the schedule count would help to implement algorithms that helps getting the cpu id and knowing it's been valid for a period of time without pinning to a particular CPU. Mathieu (sorry for repost, messed up with my mail client) -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
One thing I want to clear up. The major difference between this latency_tracer and LTTng is what we consider fast paths. The latency tracer is recording things like enabling and disabling interrupts, preempt count changes, or simply profiling all function calls. Those are what I consider fast paths. The slow path WRT the latency_tracer are things like context switches. This is why I don't have a problem with copying the comm at context switch time. Because that _is_ a slow path for the latency tracer. LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. in flight recorder mode, we have nothing to write to disk and therefore we can handle very frequent events. We then do the analysis off-line using the last MB written in the buffers. The advantage is that the kernel dumbly writes data to a buffer : we therefore move the complexity to user-space. But you would still need to do something in case you want this information dumped to console on a kernel crash. Of course you can rely on kexec, but if the kexec fails (which is possible) then you lose all the information. Having the ability to dump the output to console on a crash is one of the benefits of latency_tracer that I want to keep. There has been some integration done between LTTng and the crash tool to extract the buffers from a crashed kernel. I am not an expert in crash buffer extraction though, but I guess all the available mechanisms depend on kexec and could show the limits you are referring to. If you really want to pretty-print the information to the console, I would propose that you leave that part of the problem to a different output module. The core of the latency tracer could keep the minimum information. Then, when a dump is triggered, it either sends the information to LTTng or to your console pretty-printer. However, I would not call the pretty-printer a tracer module per-se. We would have to accept that it is a bit more tied to the kernel internals than the latency tracer. My goal is to separate the core profiling module from the optional pretty-printing module as much as possible so the latency tracer core could be reused by other output modules. Mathieu, I've been thinking about the context switch marker, and the printf format field you have: prepare_task_switch(rq, prev, next); + trace_mark(kernel_sched_schedule, + prev_pid %d next_pid %d prev_state %ld, + prev-pid, next-pid, prev-state); mm = next-mm; Now I see that this is great for your tracer, since all your hook would need to do is: static notrace void simple_trace(const struct marker *mdata, void *private_data, const char *format, ...) { va_list ap; va_start(ap, format); simple_trace_record(ap, format); va_end(ap); } And you could hook this up to all your traces. Wonderful! But... Tracers that want to do a bit more work, like recording timings and seeing if we hit some max somewhere, can't do much with that pretty print data. For example, I like to record the priority of a task that is being swapped out as well as the one being swapped in. But with this, all I can get is the priority of the prev task (since it is still current). You told me that I should put hooks into where the priority gets modified, so that I can trace it there at a non hot path. Well, I have some issues with this. 1) to me it's a management nightmare. Yes, I could hook into lttng, but that too is too much. That takes me back to the days when COTS became the new buzzword, and we ended up having to incorporate COTS products into things better left done in house. The end result was more glue code than what would happen if it was simple done in house, and a product that was totally inefficient. 2) this requires we put a marker at all the places that might change the data we want the snapshot of at the marker (here being prio at context switch time). I could imaging some new code comes into the kernel that modifies priority but the author has no idea about having to update the trace marker, and the trace output ends up showing stale data. And this would cause the poor bastard that needs to maintain this code to debug the tracer on top of the code they are maintaining. Which simply would end up where that poor bastard will lose all confidence in the tracer and would simply give up on it. I know that if we did something like: trace_mark(kernel_sched_schedule, prev %p next %p, prev, next); It would be useless for the simple recording, because what the user would see is just two meaningless pointer numbers. So, at a minimum, I'd like to at least have meta data attached: trace_mark(kernel_sched_schedule, prev_pid %d next_pid %d prev_state %ld\0 prev %p
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Thu, 17 Jan 2008, Frank Ch. Eigler wrote: Hi - On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote: [...] + trace_mark(kernel_sched_schedule, + prev_pid %d next_pid %d prev_state %ld, + prev-pid, next-pid, prev-state); [...] But... Tracers that want to do a bit more work, like recording timings and seeing if we hit some max somewhere, can't do much with that pretty print data. If you find yourself wanting to perform computations like finding maxima, or responding right there as opposed to later during userspace trace data extraction, then you're trending toward a tool like systemtap. Yes, very much so. I'm working on getting the latency_tracer from the -rt patch into something suitable for mainline. We need to calculate the max latencies on the fly. If we hit a max, then we save it off, otherwise, we blow away the trace and start again. [...] So, at a minimum, I'd like to at least have meta data attached: trace_mark(kernel_sched_schedule, prev_pid %d next_pid %d prev_state %ld\0 prev %p next %p, prev-pid, next-pid, prev-state, prev, next); This would allow for both the nice pretty print of your trace, as well as allowing other tracers to get to better meta data. Yes, more self-contained marker events are necessary for meaningful in-situ processing. That needs to be balanced by the increased cost for computing and passing the extra parameters, multiplied the event occurrence rate. The cost is only done when the marker is armed. Since the marker is an unlikely, and will be placed at the end of the function. In this case, the prev/next pointers are sufficient to compute the other values. For particularly performance-critical markers, it may not be unreasonable to expect the callback functions to dereference such pointers for pretty-printing or other processing. This was exactly my point to Mathieu, but I think he has LTTng very much coupled with the markers. I haven't played with LTTng (yet), but from what I've read (Mathieu, correct me if I'm wrong), it seems that all the markers become visible to userspace, and the user can simple turn them on or off. LTTng doesn't need any knowledge of the marker since the marker contains how to print the information. So* by placing a prev %p next %p as the only information, we lose out on this automated way LTTng works. Because the two pointers are just meaningless numbers to the user. The '\0' would keep your tracer from recording the extra data, and we could add some way to ignore the parameters in the printf to let other traces get straight to the meta data. This \0 hack is perhaps too clever. Much of the cost of the extra parameters is already paid by the time that a simpleminded tracing callback function starts going through the string. Also, I believe the systemtap marker interface would break if the format strings were not singly terminated ordinary strings. Well, actually when I first wrote this letter, I used -- as a delimiter to allow a tool to hide the pretty stuff. But then I thought about the clever hack with the '\0', The -- may be better since it wont break systemtap. -- Steve * dvhart - bah! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Hi - On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote: [...] + trace_mark(kernel_sched_schedule, + prev_pid %d next_pid %d prev_state %ld, + prev-pid, next-pid, prev-state); [...] But... Tracers that want to do a bit more work, like recording timings and seeing if we hit some max somewhere, can't do much with that pretty print data. If you find yourself wanting to perform computations like finding maxima, or responding right there as opposed to later during userspace trace data extraction, then you're trending toward a tool like systemtap. [...] So, at a minimum, I'd like to at least have meta data attached: trace_mark(kernel_sched_schedule, prev_pid %d next_pid %d prev_state %ld\0 prev %p next %p, prev-pid, next-pid, prev-state, prev, next); This would allow for both the nice pretty print of your trace, as well as allowing other tracers to get to better meta data. Yes, more self-contained marker events are necessary for meaningful in-situ processing. That needs to be balanced by the increased cost for computing and passing the extra parameters, multiplied the event occurrence rate. In this case, the prev/next pointers are sufficient to compute the other values. For particularly performance-critical markers, it may not be unreasonable to expect the callback functions to dereference such pointers for pretty-printing or other processing. The '\0' would keep your tracer from recording the extra data, and we could add some way to ignore the parameters in the printf to let other traces get straight to the meta data. This \0 hack is perhaps too clever. Much of the cost of the extra parameters is already paid by the time that a simpleminded tracing callback function starts going through the string. Also, I believe the systemtap marker interface would break if the format strings were not singly terminated ordinary strings. - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Paul Mackerras ([EMAIL PROTECTED]) wrote: > Mathieu Desnoyers writes: > > > Sorry for self-reply, but I thought, in the past, of a way to make this > > possible. > > > > It would imply the creation of a new vsyscall : vgetschedperiod > > > > It would read a counter that would increment each time the thread is > > scheduled out (or in). It would be a per thread counter > > It's very hard to do a per-thread counter in the VDSO, since threads > in the same process see the same memory, by definition. You'd have to > have an array of counters and have some way for each thread to know > which entry to read. Also you'd have to find space for tens or > hundreds of thousands of counters, since there can be that many > threads in a process sometimes. > > Paul. > Crazy ideas : Could we do something along the lines of the thread local storage ? Or could we map a per-thread page that would contradict this "definition" ? Or can we move down the beginning of the user-space thread stack of 4 bytes (it's already put at a random address anyway) and use these 32 bits to put our variable ? We don't care if userspace also modifies it; the kernel would blindly increment it, so there would be no security concerns involved. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Thu, 17 Jan 2008, Paul Mackerras wrote: > > It's very hard to do a per-thread counter in the VDSO, since threads > in the same process see the same memory, by definition. You'd have to > have an array of counters and have some way for each thread to know > which entry to read. Also you'd have to find space for tens or > hundreds of thousands of counters, since there can be that many > threads in a process sometimes. I was thinking about this. What would also work is just the ability to read the schedule counter for the current cpu. Now this would require that the task had a way to know which CPU it was currently on. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Mathieu Desnoyers writes: > Sorry for self-reply, but I thought, in the past, of a way to make this > possible. > > It would imply the creation of a new vsyscall : vgetschedperiod > > It would read a counter that would increment each time the thread is > scheduled out (or in). It would be a per thread counter It's very hard to do a per-thread counter in the VDSO, since threads in the same process see the same memory, by definition. You'd have to have an array of counters and have some way for each thread to know which entry to read. Also you'd have to find space for tens or hundreds of thousands of counters, since there can be that many threads in a process sometimes. Paul. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > It would imply the creation of a new vsyscall : vgetschedperiod > > It would read a counter that would increment each time the thread is > scheduled out (or in). It would be a per thread counter (not a per cpu > counter) so we can deal appropriately with a stopped thread that would > happen to come back running a lng time afterward (if we do per-cpu > counters, we could get the same 32 bits counter value falsely if it is > shared with other thread activity). > > Then, the clocksource read code would look like : > > int period; > > do { > period = vgetschedperiod(); > > perform the clocksource read.. > > } while (period != vgetschedperiod()); > > Therefore, we would be sure that we have not been scheduled out while > reading the value. I think this new vsyscall could be useful for others. > Actually, it would make implementation of RCU in user-space possible (as > long as the read-side can retry the read operation). This is something that I would agree is useful. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: > * john stultz ([EMAIL PROTECTED]) wrote: > > > > On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote: > > > Thanks John for doing this! > > > > > > (comments imbedded) > > > > > > On Wed, 16 Jan 2008, john stultz wrote: > > > > + int num = !cs->base_num; > > > > + cycle_t offset = (now - cs->base[!num].cycle_base_last); > > > > + offset &= cs->mask; > > > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset; > > > > + cs->base[num].cycle_base_last = now; > > > > > > I would think that we would need some sort of barrier here. Otherwise, > > > base_num could be updated before all the cycle_base. I'd expect a smp_wmb > > > is needed. > > > > Hopefully addressed in the current version. > > > > > > > > Index: monotonic-cleanup/kernel/time/timekeeping.c > > > > === > > > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 > > > > 12:21:46.0 -0800 > > > > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 > > > > 14:15:31.0 -0800 > > > > @@ -71,10 +71,12 @@ > > > > */ > > > > static inline s64 __get_nsec_offset(void) > > > > { > > > > - cycle_t cycle_delta; > > > > + cycle_t now, cycle_delta; > > > > s64 ns_offset; > > > > > > > > - cycle_delta = clocksource_get_cycles(clock, > > > > clocksource_read(clock)); > > > > + now = clocksource_read(clock); > > > > + cycle_delta = (now - clock->cycle_last) & clock->mask; > > > > + cycle_delta += clock->cycle_accumulated; > > > > > > Is the above just to decouple the two methods? > > > > Yep. clocksource_get_cycles() ended up not being as useful as an helper > > function (I was hoping the arch vsyscall implementations could use it, > > but they've done too much optimization - although that may reflect a > > need up the chain to the clocksource structure). > > > > The problem with vsyscall is that we will have a hard time disabling > preemption :( Therefore, insuring that the read of the data is done in a > timely manner is hard to do. > Sorry for self-reply, but I thought, in the past, of a way to make this possible. It would imply the creation of a new vsyscall : vgetschedperiod It would read a counter that would increment each time the thread is scheduled out (or in). It would be a per thread counter (not a per cpu counter) so we can deal appropriately with a stopped thread that would happen to come back running a lng time afterward (if we do per-cpu counters, we could get the same 32 bits counter value falsely if it is shared with other thread activity). Then, the clocksource read code would look like : int period; do { period = vgetschedperiod(); perform the clocksource read.. } while (period != vgetschedperiod()); Therefore, we would be sure that we have not been scheduled out while reading the value. I think this new vsyscall could be useful for others. Actually, it would make implementation of RCU in user-space possible (as long as the read-side can retry the read operation). Mathieu > > > thanks > > -john > > > > > > -- > Mathieu Desnoyers > Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > > > Yep. clocksource_get_cycles() ended up not being as useful as an helper > > function (I was hoping the arch vsyscall implementations could use it, > > but they've done too much optimization - although that may reflect a > > need up the chain to the clocksource structure). > > > > The problem with vsyscall is that we will have a hard time disabling > preemption :( Therefore, insuring that the read of the data is done in a > timely manner is hard to do. You'll have more than a hard time disabling preemption for vsyscall. We'll need to come up with a better solution then. vsyscall can not modify any kernel memory, nor can it disable preemption. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* john stultz ([EMAIL PROTECTED]) wrote: > > On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote: > > Thanks John for doing this! > > > > (comments imbedded) > > > > On Wed, 16 Jan 2008, john stultz wrote: > > > + int num = !cs->base_num; > > > + cycle_t offset = (now - cs->base[!num].cycle_base_last); > > > + offset &= cs->mask; > > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset; > > > + cs->base[num].cycle_base_last = now; > > > > I would think that we would need some sort of barrier here. Otherwise, > > base_num could be updated before all the cycle_base. I'd expect a smp_wmb > > is needed. > > Hopefully addressed in the current version. > > > > > Index: monotonic-cleanup/kernel/time/timekeeping.c > > > === > > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 > > > 12:21:46.0 -0800 > > > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 > > > 14:15:31.0 -0800 > > > @@ -71,10 +71,12 @@ > > > */ > > > static inline s64 __get_nsec_offset(void) > > > { > > > - cycle_t cycle_delta; > > > + cycle_t now, cycle_delta; > > > s64 ns_offset; > > > > > > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); > > > + now = clocksource_read(clock); > > > + cycle_delta = (now - clock->cycle_last) & clock->mask; > > > + cycle_delta += clock->cycle_accumulated; > > > > Is the above just to decouple the two methods? > > Yep. clocksource_get_cycles() ended up not being as useful as an helper > function (I was hoping the arch vsyscall implementations could use it, > but they've done too much optimization - although that may reflect a > need up the chain to the clocksource structure). > The problem with vsyscall is that we will have a hard time disabling preemption :( Therefore, insuring that the read of the data is done in a timely manner is hard to do. Mathieu > thanks > -john > > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* john stultz ([EMAIL PROTECTED]) wrote: > On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote: > > I would disable preemption in clocksource_get_basecycles. We would not > > want to be scheduled out while we hold a pointer to the old array > > element. > > > > > + int num = cs->base_num; > > > > Since you deal with base_num in a shared manner (not per cpu), you will > > need a smp_read_barrier_depend() here after the cs->base_num read. > > > > You should think about reading the cs->base_num first, and _after_ that > > read the real clocksource. Here, the clocksource value is passed as > > parameter. It means that the read clocksource may have been read in the > > previous RCU window. > > Here's an updated version of the patch w/ the suggested memory barrier > changes and favored (1-x) inversion change. ;) Let me know if you see > any other holes, or have any other suggestions or ideas. > > Still un-tested (my test box will free up soon, I promise!), but builds. > > Signed-off-by: John Stultz <[EMAIL PROTECTED]> > > Index: monotonic-cleanup/include/linux/clocksource.h > === > --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 > 12:22:04.0 -0800 > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 > 18:12:53.0 -0800 > @@ -87,9 +87,17 @@ >* more than one cache line. >*/ > struct { > - cycle_t cycle_last, cycle_accumulated, cycle_raw; > - } cacheline_aligned_in_smp; > + cycle_t cycle_last, cycle_accumulated; > > + /* base structure provides lock-free read > + * access to a virtualized 64bit counter > + * Uses RCU-like update. > + */ > + struct { > + cycle_t cycle_base_last, cycle_base; > + } base[2]; > + int base_num; > + } cacheline_aligned_in_smp; > u64 xtime_nsec; > s64 error; > > @@ -175,19 +183,29 @@ > } > > /** > - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value > + * clocksource_get_basecycles: - get the clocksource's accumulated cycle > value > * @cs: pointer to clocksource being read > * @now: current cycle value > * > * Uses the clocksource to return the current cycle_t value. > * NOTE!!!: This is different from clocksource_read, because it > - * returns the accumulated cycle value! Must hold xtime lock! > + * returns a 64bit wide accumulated value. > */ > static inline cycle_t > -clocksource_get_cycles(struct clocksource *cs, cycle_t now) > +clocksource_get_basecycles(struct clocksource *cs) > { > - cycle_t offset = (now - cs->cycle_last) & cs->mask; > - offset += cs->cycle_accumulated; > + int num; > + cycle_t now, offset; > + > + preempt_disable(); > + num = cs->base_num; > + smp_read_barrier_depends(); > + now = clocksource_read(cs); > + offset = (now - cs->base[num].cycle_base_last); > + offset &= cs->mask; > + offset += cs->base[num].cycle_base; > + preempt_enable(); > + > return offset; > } > > @@ -197,14 +215,26 @@ > * @now: current cycle value > * > * Used to avoids clocksource hardware overflow by periodically > - * accumulating the current cycle delta. Must hold xtime write lock! > + * accumulating the current cycle delta. Uses RCU-like update, but > + * ***still requires the xtime_lock is held for writing!*** > */ > static inline void clocksource_accumulate(struct clocksource *cs, cycle_t > now) > { > - cycle_t offset = (now - cs->cycle_last) & cs->mask; > + /* First update the monotonic base portion. > + * The dual array update method allows for lock-free reading. > + */ > + int num = 1 - cs->base_num; (nitpick) right here, you could probably express 1-num with cs->base_num, since we are the only ones supposed to touch it. > + cycle_t offset = (now - cs->base[1-num].cycle_base_last); > + offset &= cs->mask; here too. > + cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset; > + cs->base[num].cycle_base_last = now; > + wmb(); As I just emailed : smp_smb() *should* be enough. I don't see which architecture could reorder writes wrt local interrupts ? (please tell me if I am grossly mistaken) Mathieu > + cs->base_num = num; > + > + /* Now update the cycle_accumulated portion */ > + offset = (now - cs->cycle_last) & cs->mask; > cs->cycle_last = now; > cs->cycle_accumulated += offset; > - cs->cycle_raw += offset; > } > > /** > Index: monotonic-cleanup/kernel/time/timekeeping.c > === > --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 > 12:21:46.0 -0800 > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 > 17:51:50.0 -0800 > @@ -71,10 +71,12 @@ >
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote: > Thanks John for doing this! > > (comments imbedded) > > On Wed, 16 Jan 2008, john stultz wrote: > > + int num = !cs->base_num; > > + cycle_t offset = (now - cs->base[!num].cycle_base_last); > > + offset &= cs->mask; > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset; > > + cs->base[num].cycle_base_last = now; > > I would think that we would need some sort of barrier here. Otherwise, > base_num could be updated before all the cycle_base. I'd expect a smp_wmb > is needed. Hopefully addressed in the current version. > > Index: monotonic-cleanup/kernel/time/timekeeping.c > > === > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 > > 12:21:46.0 -0800 > > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 > > 14:15:31.0 -0800 > > @@ -71,10 +71,12 @@ > > */ > > static inline s64 __get_nsec_offset(void) > > { > > - cycle_t cycle_delta; > > + cycle_t now, cycle_delta; > > s64 ns_offset; > > > > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); > > + now = clocksource_read(clock); > > + cycle_delta = (now - clock->cycle_last) & clock->mask; > > + cycle_delta += clock->cycle_accumulated; > > Is the above just to decouple the two methods? Yep. clocksource_get_cycles() ended up not being as useful as an helper function (I was hoping the arch vsyscall implementations could use it, but they've done too much optimization - although that may reflect a need up the chain to the clocksource structure). thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* john stultz ([EMAIL PROTECTED]) wrote: > > On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote: > > * john stultz ([EMAIL PROTECTED]) wrote: > > > > > > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: > > > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote: > > > > > If you really want an seqlock free algorithm (I _do_ want this for > > > > > tracing!) :) maybe going in the RCU direction could help (I refer to > > > > > my > > > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which > > > > > could be turned into the clocksource updater). > > > > > > > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a > > > > swing at this now. The lock-free method still doesn't apply to the > > > > update_wall_time function, but does work fine for the monotonic cycle > > > > uses. I'll send a patch for review as soon as I get things building. > > > > > > So here's my first attempt at adding Mathieu's lock-free method to > > > Steven's get_monotonic_cycles() interface. > > > > > > Completely un-tested, but it builds, so I figured I'd send it out for > > > review. > > > > > > I'm not super sure the update or the read doesn't need something > > > additional to force a memory access, but as I didn't see anything > > > special in Mathieu's implementation, I'm going to guess this is ok. > > > > > > Mathieu, Let me know if this isn't what you're suggesting. > > > > > > Signed-off-by: John Stultz <[EMAIL PROTECTED]> > > > > > > Index: monotonic-cleanup/include/linux/clocksource.h > > > === > > > --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 > > > 12:22:04.0 -0800 > > > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 > > > 14:41:31.0 -0800 > > > @@ -87,9 +87,17 @@ > > >* more than one cache line. > > >*/ > > > struct { > > > - cycle_t cycle_last, cycle_accumulated, cycle_raw; > > > - } cacheline_aligned_in_smp; > > > > Shouldn't the cycle_last and cycle_accumulated by in the array too ? > > No, we're leaving cycle_last and cycle_accumulated alone. They're > relating to the update_wall_time conversion of cycles to xtime. > > > > + cycle_t cycle_last, cycle_accumulated; > > > > > > + /* base structure provides lock-free read > > > + * access to a virtualized 64bit counter > > > + * Uses RCU-like update. > > > + */ > > > + struct { > > > > We had cycle_raw before, why do we need the following two ? > > > > > + cycle_t cycle_base_last, cycle_base; > > > > I'm not quite sure why you need both cycle_base_last and cycle_base... > > So on my first shot at this, I tried to layer the concepts. Using the > lock-free method to create a abstracted 64bit counter, as provided by > get_monotonic_cycles(). Then I tried to use that abstraction directly in > the update_wall_time() code, reading the abstracted 64bit counter and > using it to update time. > > However, then we start keeping cycle_last in 64bit cycles, rather then > an actual counter read. This then caused changes to be needed in the > arch vsyscall implementations, and that started to get ugly, as we had > to also re-implement the abstracted 64bit counter w/ the lock free > method as well. > > So I just backed off and tried to make it simple: We have two sets of > data that counts cycles from the clocksource. One for timekeeping and > one for get_monotoinc_cycles(). It is a little redundant, but I don't > think you can escape that (the layering method above also has > redundancy, but its just hidden until you implement the vsyscall gtod > methods). > > > > I think I'll need a bit of an explanation of what you are trying to > > achieve here to see what to expect from the clock source. Are you trying > > to deal with non-synchronized TSCs across CPUs in a way that will > > generate a monotonic (sometimes stalling) clock ? > > No no no.. I'm not touching the non-synced TSC issue. I'm just trying to > take clocksource counters, which may be of different bit-widths (ACPI PM > is 24bits, for instance), and create lock-free method to translate that > into a virtual 64bit wide counter (using an accumulation bucket, > basically). > > > What I am trying to say is : I know you are trying to make a virtual > > clock source where time cannot go backward, but what are your > > assumptions about the "real" clock source ? > > The assumptions of the real clocksource is the same we keep in the > timekeeping core. It counts forward, at a constant rate and only wraps > after the mask value has been reached. > > > Is the intent to deal with an HPET suddenly reset to 0 or something > > like this ? > > Well, dealing with clocksources wrapping short of 64bits. > Ah ok, then the problem is clearer :) The main difference between the approach I use and yours is that, let's say your clocksource
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote: > I would disable preemption in clocksource_get_basecycles. We would not > want to be scheduled out while we hold a pointer to the old array > element. > > > + int num = cs->base_num; > > Since you deal with base_num in a shared manner (not per cpu), you will > need a smp_read_barrier_depend() here after the cs->base_num read. > > You should think about reading the cs->base_num first, and _after_ that > read the real clocksource. Here, the clocksource value is passed as > parameter. It means that the read clocksource may have been read in the > previous RCU window. Here's an updated version of the patch w/ the suggested memory barrier changes and favored (1-x) inversion change. ;) Let me know if you see any other holes, or have any other suggestions or ideas. Still un-tested (my test box will free up soon, I promise!), but builds. Signed-off-by: John Stultz <[EMAIL PROTECTED]> Index: monotonic-cleanup/include/linux/clocksource.h === --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.0 -0800 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 18:12:53.0 -0800 @@ -87,9 +87,17 @@ * more than one cache line. */ struct { - cycle_t cycle_last, cycle_accumulated, cycle_raw; - } cacheline_aligned_in_smp; + cycle_t cycle_last, cycle_accumulated; + /* base structure provides lock-free read +* access to a virtualized 64bit counter +* Uses RCU-like update. +*/ + struct { + cycle_t cycle_base_last, cycle_base; + } base[2]; + int base_num; + } cacheline_aligned_in_smp; u64 xtime_nsec; s64 error; @@ -175,19 +183,29 @@ } /** - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value * @cs:pointer to clocksource being read * @now: current cycle value * * Uses the clocksource to return the current cycle_t value. * NOTE!!!: This is different from clocksource_read, because it - * returns the accumulated cycle value! Must hold xtime lock! + * returns a 64bit wide accumulated value. */ static inline cycle_t -clocksource_get_cycles(struct clocksource *cs, cycle_t now) +clocksource_get_basecycles(struct clocksource *cs) { - cycle_t offset = (now - cs->cycle_last) & cs->mask; - offset += cs->cycle_accumulated; + int num; + cycle_t now, offset; + + preempt_disable(); + num = cs->base_num; + smp_read_barrier_depends(); + now = clocksource_read(cs); + offset = (now - cs->base[num].cycle_base_last); + offset &= cs->mask; + offset += cs->base[num].cycle_base; + preempt_enable(); + return offset; } @@ -197,14 +215,26 @@ * @now: current cycle value * * Used to avoids clocksource hardware overflow by periodically - * accumulating the current cycle delta. Must hold xtime write lock! + * accumulating the current cycle delta. Uses RCU-like update, but + * ***still requires the xtime_lock is held for writing!*** */ static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs->cycle_last) & cs->mask; + /* First update the monotonic base portion. +* The dual array update method allows for lock-free reading. +*/ + int num = 1 - cs->base_num; + cycle_t offset = (now - cs->base[1-num].cycle_base_last); + offset &= cs->mask; + cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset; + cs->base[num].cycle_base_last = now; + wmb(); + cs->base_num = num; + + /* Now update the cycle_accumulated portion */ + offset = (now - cs->cycle_last) & cs->mask; cs->cycle_last = now; cs->cycle_accumulated += offset; - cs->cycle_raw += offset; } /** Index: monotonic-cleanup/kernel/time/timekeeping.c === --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 12:21:46.0 -0800 +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 17:51:50.0 -0800 @@ -71,10 +71,12 @@ */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle_delta = (now - clock->cycle_last) & clock->mask; + cycle_delta += clock->cycle_accumulated; ns_offset = cyc2ns(clock, cycle_delta); return ns_offset; @@ -105,35 +107,7 @@ cycle_t notrace get_monotonic_cycles(void) {
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Linus Torvalds ([EMAIL PROTECTED]) wrote: > > > On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > > > > + int num = !cs->base_num; > > > + cycle_t offset = (now - cs->base[!num].cycle_base_last); > > > > !0 is not necessarily 1. > > Incorrect. > Hrm, *digging in my mailbox*, ah, here it is : http://listserv.shafik.org/pipermail/ltt-dev/2006-June/001548.html Richard Purdie reviewed my code back in 2006 and made this modification. Maybe will he have something to add. > !0 _is_ necessarily 1. It's how all C logical operators work. If you find > a compiler that turns !x into anything but 0/1, you found a compiler for > another language than C. > > It's true that any non-zero value counts as "true", but the that does not > mean that a logical operator can return any non-zero value for true. As a > return value of the logical operations in C, true is *always* 1. > > So !, ||, &&, when used as values, will *always* return either 0 or 1 (but > when used as part of a conditional, the compiler will often optimize out > unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if > the value itself was never used, only branched upon). > > So doing "!cs->base_num" to turn 0->1 and 1->0 is perfectly fine. > > That's not to say it's necessarily the *best* way. > > If you *know* that you started with 0/1 in the first place, the best way > to flip it tends to be to do (1-x) (or possibly (x^1)). > > And if you can't guarantee that, !x is probably better than x ? 0 : 1, > but you might also decide to use ((x+1)&1) for example. > > And obviously, the compiler may sometimes surprise you, and if *it* also > knows it's always 0/1 (for something like the source being a single-bit > bitfield for example), it may end up doing something else than you coded > that is equivalent. And the particular choice of operation the compiler > chooses may well depend on the code _around_ that sequence. > > (One reason to potentially prefer (1-x) over (x^1) is that it's often > easier to combine a subtraction with other operations, while an xor seldom > combines with anything around it) > Ok, I'll adopt (1-x) then. Thanks! Mathieu > Linus -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > > + int num = !cs->base_num; > > + cycle_t offset = (now - cs->base[!num].cycle_base_last); > > !0 is not necessarily 1. Incorrect. !0 _is_ necessarily 1. It's how all C logical operators work. If you find a compiler that turns !x into anything but 0/1, you found a compiler for another language than C. It's true that any non-zero value counts as "true", but the that does not mean that a logical operator can return any non-zero value for true. As a return value of the logical operations in C, true is *always* 1. So !, ||, &&, when used as values, will *always* return either 0 or 1 (but when used as part of a conditional, the compiler will often optimize out unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if the value itself was never used, only branched upon). So doing "!cs->base_num" to turn 0->1 and 1->0 is perfectly fine. That's not to say it's necessarily the *best* way. If you *know* that you started with 0/1 in the first place, the best way to flip it tends to be to do (1-x) (or possibly (x^1)). And if you can't guarantee that, !x is probably better than x ? 0 : 1, but you might also decide to use ((x+1)&1) for example. And obviously, the compiler may sometimes surprise you, and if *it* also knows it's always 0/1 (for something like the source being a single-bit bitfield for example), it may end up doing something else than you coded that is equivalent. And the particular choice of operation the compiler chooses may well depend on the code _around_ that sequence. (One reason to potentially prefer (1-x) over (x^1) is that it's often easier to combine a subtraction with other operations, while an xor seldom combines with anything around it) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Steven Rostedt wrote: > On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > > > !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in > > How about simply "cpu_synth->index ^ 1"? Seems the best choice if you ask > me, if all you are doing is changing it from 1 to zero and back to 1. > FYI: [EMAIL PROTECTED]:~/c$ cat flipme.c int flip1 (int x) { return !x; } int flip2 (int x) { return x ? 0 : 1; } int flip3(int x) { return x ^ 1; } [EMAIL PROTECTED]:~/c$ gcc -O2 -c flipme.c [EMAIL PROTECTED]:~/c$ objdump -d flipme.o flipme.o: file format elf32-i386 Disassembly of section .text: : 0: 55 push %ebp 1: 31 c0 xor%eax,%eax 3: 89 e5 mov%esp,%ebp 5: 83 7d 08 00 cmpl $0x0,0x8(%ebp) 9: 5d pop%ebp a: 0f 94 c0sete %al d: c3 ret e: 66 90 xchg %ax,%ax 0010 : 10: 55 push %ebp 11: 31 c0 xor%eax,%eax 13: 89 e5 mov%esp,%ebp 15: 83 7d 08 00 cmpl $0x0,0x8(%ebp) 19: 5d pop%ebp 1a: 0f 94 c0sete %al 1d: c3 ret 1e: 66 90 xchg %ax,%ax 0020 : 20: 55 push %ebp 21: 89 e5 mov%esp,%ebp 23: 8b 45 08mov0x8(%ebp),%eax 26: 5d pop%ebp 27: 83 f0 01xor$0x1,%eax 2a: c3 ret So, if you know for sure that x is only 1 or 0, then using x ^ 1 to invert it, seems the most efficient. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote: > * john stultz ([EMAIL PROTECTED]) wrote: > > > > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: > > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote: > > > > If you really want an seqlock free algorithm (I _do_ want this for > > > > tracing!) :) maybe going in the RCU direction could help (I refer to my > > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which > > > > could be turned into the clocksource updater). > > > > > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a > > > swing at this now. The lock-free method still doesn't apply to the > > > update_wall_time function, but does work fine for the monotonic cycle > > > uses. I'll send a patch for review as soon as I get things building. > > > > So here's my first attempt at adding Mathieu's lock-free method to > > Steven's get_monotonic_cycles() interface. > > > > Completely un-tested, but it builds, so I figured I'd send it out for > > review. > > > > I'm not super sure the update or the read doesn't need something > > additional to force a memory access, but as I didn't see anything > > special in Mathieu's implementation, I'm going to guess this is ok. > > > > Mathieu, Let me know if this isn't what you're suggesting. > > > > Signed-off-by: John Stultz <[EMAIL PROTECTED]> > > > > Index: monotonic-cleanup/include/linux/clocksource.h > > === > > --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 > > 12:22:04.0 -0800 > > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 > > 14:41:31.0 -0800 > > @@ -87,9 +87,17 @@ > > * more than one cache line. > > */ > > struct { > > - cycle_t cycle_last, cycle_accumulated, cycle_raw; > > - } cacheline_aligned_in_smp; > > Shouldn't the cycle_last and cycle_accumulated by in the array too ? No, we're leaving cycle_last and cycle_accumulated alone. They're relating to the update_wall_time conversion of cycles to xtime. > > + cycle_t cycle_last, cycle_accumulated; > > > > + /* base structure provides lock-free read > > +* access to a virtualized 64bit counter > > +* Uses RCU-like update. > > +*/ > > + struct { > > We had cycle_raw before, why do we need the following two ? > > > + cycle_t cycle_base_last, cycle_base; > > I'm not quite sure why you need both cycle_base_last and cycle_base... So on my first shot at this, I tried to layer the concepts. Using the lock-free method to create a abstracted 64bit counter, as provided by get_monotonic_cycles(). Then I tried to use that abstraction directly in the update_wall_time() code, reading the abstracted 64bit counter and using it to update time. However, then we start keeping cycle_last in 64bit cycles, rather then an actual counter read. This then caused changes to be needed in the arch vsyscall implementations, and that started to get ugly, as we had to also re-implement the abstracted 64bit counter w/ the lock free method as well. So I just backed off and tried to make it simple: We have two sets of data that counts cycles from the clocksource. One for timekeeping and one for get_monotoinc_cycles(). It is a little redundant, but I don't think you can escape that (the layering method above also has redundancy, but its just hidden until you implement the vsyscall gtod methods). > I think I'll need a bit of an explanation of what you are trying to > achieve here to see what to expect from the clock source. Are you trying > to deal with non-synchronized TSCs across CPUs in a way that will > generate a monotonic (sometimes stalling) clock ? No no no.. I'm not touching the non-synced TSC issue. I'm just trying to take clocksource counters, which may be of different bit-widths (ACPI PM is 24bits, for instance), and create lock-free method to translate that into a virtual 64bit wide counter (using an accumulation bucket, basically). > What I am trying to say is : I know you are trying to make a virtual > clock source where time cannot go backward, but what are your > assumptions about the "real" clock source ? The assumptions of the real clocksource is the same we keep in the timekeeping core. It counts forward, at a constant rate and only wraps after the mask value has been reached. > Is the intent to deal with an HPET suddenly reset to 0 or something > like this ? Well, dealing with clocksources wrapping short of 64bits. > Basically, I wonder why you have to calculate the current cycle count > from the previous update_wall_time event. Is is because you need to be > consistent when a clocksource change occurs ? Actually, we try to do it from the last clocksource_accumulate() call (which is called from update_wall_time). > > + } base[2]; > > + int base_num; > > + }
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > > - cycle_t offset = (now - cs->cycle_last) & cs->mask; > > + /* First update the monotonic base portion. > > +* The dual array update method allows for lock-free reading. > > +*/ > > + int num = !cs->base_num; > > + cycle_t offset = (now - cs->base[!num].cycle_base_last); > > !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in How about simply "cpu_synth->index ^ 1"? Seems the best choice if you ask me, if all you are doing is changing it from 1 to zero and back to 1. -- Steve > my code. The two previous lines seems buggy. (I did the same mistake in > my first implementation) ;) > > > + offset &= cs->mask; > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset; > > Here too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* john stultz ([EMAIL PROTECTED]) wrote: > > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote: > > > If you really want an seqlock free algorithm (I _do_ want this for > > > tracing!) :) maybe going in the RCU direction could help (I refer to my > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which > > > could be turned into the clocksource updater). > > > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a > > swing at this now. The lock-free method still doesn't apply to the > > update_wall_time function, but does work fine for the monotonic cycle > > uses. I'll send a patch for review as soon as I get things building. > > So here's my first attempt at adding Mathieu's lock-free method to > Steven's get_monotonic_cycles() interface. > > Completely un-tested, but it builds, so I figured I'd send it out for > review. > > I'm not super sure the update or the read doesn't need something > additional to force a memory access, but as I didn't see anything > special in Mathieu's implementation, I'm going to guess this is ok. > > Mathieu, Let me know if this isn't what you're suggesting. > > Signed-off-by: John Stultz <[EMAIL PROTECTED]> > > Index: monotonic-cleanup/include/linux/clocksource.h > === > --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 > 12:22:04.0 -0800 > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 > 14:41:31.0 -0800 > @@ -87,9 +87,17 @@ >* more than one cache line. >*/ > struct { > - cycle_t cycle_last, cycle_accumulated, cycle_raw; > - } cacheline_aligned_in_smp; Shouldn't the cycle_last and cycle_accumulated by in the array too ? > + cycle_t cycle_last, cycle_accumulated; > > + /* base structure provides lock-free read > + * access to a virtualized 64bit counter > + * Uses RCU-like update. > + */ > + struct { We had cycle_raw before, why do we need the following two ? > + cycle_t cycle_base_last, cycle_base; I'm not quite sure why you need both cycle_base_last and cycle_base... I think I'll need a bit of an explanation of what you are trying to achieve here to see what to expect from the clock source. Are you trying to deal with non-synchronized TSCs across CPUs in a way that will generate a monotonic (sometimes stalling) clock ? What I am trying to say is : I know you are trying to make a virtual clock source where time cannot go backward, but what are your assumptions about the "real" clock source ? Is the intent to deal with an HPET suddenly reset to 0 or something like this ? Basically, I wonder why you have to calculate the current cycle count from the previous update_wall_time event. Is is because you need to be consistent when a clocksource change occurs ? > + } base[2]; > + int base_num; > + } cacheline_aligned_in_smp; > u64 xtime_nsec; > s64 error; > > @@ -175,19 +183,21 @@ > } > > /** > - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value > + * clocksource_get_basecycles: - get the clocksource's accumulated cycle > value > * @cs: pointer to clocksource being read > * @now: current cycle value > * > * Uses the clocksource to return the current cycle_t value. > * NOTE!!!: This is different from clocksource_read, because it > - * returns the accumulated cycle value! Must hold xtime lock! > + * returns a 64bit wide accumulated value. > */ > static inline cycle_t > -clocksource_get_cycles(struct clocksource *cs, cycle_t now) > +clocksource_get_basecycles(struct clocksource *cs, cycle_t now) > { > - cycle_t offset = (now - cs->cycle_last) & cs->mask; > - offset += cs->cycle_accumulated; I would disable preemption in clocksource_get_basecycles. We would not want to be scheduled out while we hold a pointer to the old array element. > + int num = cs->base_num; Since you deal with base_num in a shared manner (not per cpu), you will need a smp_read_barrier_depend() here after the cs->base_num read. You should think about reading the cs->base_num first, and _after_ that read the real clocksource. Here, the clocksource value is passed as parameter. It means that the read clocksource may have been read in the previous RCU window. > + cycle_t offset = (now - cs->base[num].cycle_base_last); > + offset &= cs->mask; > + offset += cs->base[num].cycle_base; > return offset; > } > > @@ -197,14 +207,25 @@ > * @now: current cycle value > * > * Used to avoids clocksource hardware overflow by periodically > - * accumulating the current cycle delta. Must hold xtime write lock! > + * accumulating the current cycle delta. Uses RCU-like update, but > + *
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Thanks John for doing this! (comments imbedded) On Wed, 16 Jan 2008, john stultz wrote: > > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: > > Completely un-tested, but it builds, so I figured I'd send it out for > review. heh, ok, I'll take it and run it. > > I'm not super sure the update or the read doesn't need something > additional to force a memory access, but as I didn't see anything > special in Mathieu's implementation, I'm going to guess this is ok. > > Mathieu, Let me know if this isn't what you're suggesting. > > Signed-off-by: John Stultz <[EMAIL PROTECTED]> > > Index: monotonic-cleanup/include/linux/clocksource.h > === > --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 > 12:22:04.0 -0800 > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 > 14:41:31.0 -0800 > @@ -87,9 +87,17 @@ >* more than one cache line. >*/ > struct { > - cycle_t cycle_last, cycle_accumulated, cycle_raw; > - } cacheline_aligned_in_smp; > + cycle_t cycle_last, cycle_accumulated; > > + /* base structure provides lock-free read > + * access to a virtualized 64bit counter > + * Uses RCU-like update. > + */ > + struct { > + cycle_t cycle_base_last, cycle_base; > + } base[2]; > + int base_num; > + } cacheline_aligned_in_smp; > u64 xtime_nsec; > s64 error; > > @@ -175,19 +183,21 @@ > } > > /** > - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value > + * clocksource_get_basecycles: - get the clocksource's accumulated cycle > value > * @cs: pointer to clocksource being read > * @now: current cycle value > * > * Uses the clocksource to return the current cycle_t value. > * NOTE!!!: This is different from clocksource_read, because it > - * returns the accumulated cycle value! Must hold xtime lock! > + * returns a 64bit wide accumulated value. > */ > static inline cycle_t > -clocksource_get_cycles(struct clocksource *cs, cycle_t now) > +clocksource_get_basecycles(struct clocksource *cs, cycle_t now) > { > - cycle_t offset = (now - cs->cycle_last) & cs->mask; > - offset += cs->cycle_accumulated; > + int num = cs->base_num; > + cycle_t offset = (now - cs->base[num].cycle_base_last); > + offset &= cs->mask; > + offset += cs->base[num].cycle_base; > return offset; > } > > @@ -197,14 +207,25 @@ > * @now: current cycle value > * > * Used to avoids clocksource hardware overflow by periodically > - * accumulating the current cycle delta. Must hold xtime write lock! > + * accumulating the current cycle delta. Uses RCU-like update, but > + * ***still requires the xtime_lock is held for writing!*** > */ > static inline void clocksource_accumulate(struct clocksource *cs, cycle_t > now) > { > - cycle_t offset = (now - cs->cycle_last) & cs->mask; > + /* First update the monotonic base portion. > + * The dual array update method allows for lock-free reading. > + */ > + int num = !cs->base_num; > + cycle_t offset = (now - cs->base[!num].cycle_base_last); > + offset &= cs->mask; > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset; > + cs->base[num].cycle_base_last = now; I would think that we would need some sort of barrier here. Otherwise, base_num could be updated before all the cycle_base. I'd expect a smp_wmb is needed. > + cs->base_num = num; > + > + /* Now update the cycle_accumulated portion */ > + offset = (now - cs->cycle_last) & cs->mask; > cs->cycle_last = now; > cs->cycle_accumulated += offset; > - cs->cycle_raw += offset; > } > > /** > Index: monotonic-cleanup/kernel/time/timekeeping.c > === > --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 > 12:21:46.0 -0800 > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 > 14:15:31.0 -0800 > @@ -71,10 +71,12 @@ > */ > static inline s64 __get_nsec_offset(void) > { > - cycle_t cycle_delta; > + cycle_t now, cycle_delta; > s64 ns_offset; > > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); > + now = clocksource_read(clock); > + cycle_delta = (now - clock->cycle_last) & clock->mask; > + cycle_delta += clock->cycle_accumulated; Is the above just to decouple the two methods? > ns_offset = cyc2ns(clock, cycle_delta); > > return ns_offset; > @@ -105,35 +107,7 @@ > > cycle_t notrace get_monotonic_cycles(void) > { > - cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; > - > - do { > - /* > - * cycle_raw and cycle_last can change on > - * another CPU and we need the delta
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote: > > If you really want an seqlock free algorithm (I _do_ want this for > > tracing!) :) maybe going in the RCU direction could help (I refer to my > > RCU-based 32-to-64 bits lockless timestamp counter extension, which > > could be turned into the clocksource updater). > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a > swing at this now. The lock-free method still doesn't apply to the > update_wall_time function, but does work fine for the monotonic cycle > uses. I'll send a patch for review as soon as I get things building. So here's my first attempt at adding Mathieu's lock-free method to Steven's get_monotonic_cycles() interface. Completely un-tested, but it builds, so I figured I'd send it out for review. I'm not super sure the update or the read doesn't need something additional to force a memory access, but as I didn't see anything special in Mathieu's implementation, I'm going to guess this is ok. Mathieu, Let me know if this isn't what you're suggesting. Signed-off-by: John Stultz <[EMAIL PROTECTED]> Index: monotonic-cleanup/include/linux/clocksource.h === --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.0 -0800 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.0 -0800 @@ -87,9 +87,17 @@ * more than one cache line. */ struct { - cycle_t cycle_last, cycle_accumulated, cycle_raw; - } cacheline_aligned_in_smp; + cycle_t cycle_last, cycle_accumulated; + /* base structure provides lock-free read +* access to a virtualized 64bit counter +* Uses RCU-like update. +*/ + struct { + cycle_t cycle_base_last, cycle_base; + } base[2]; + int base_num; + } cacheline_aligned_in_smp; u64 xtime_nsec; s64 error; @@ -175,19 +183,21 @@ } /** - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value * @cs:pointer to clocksource being read * @now: current cycle value * * Uses the clocksource to return the current cycle_t value. * NOTE!!!: This is different from clocksource_read, because it - * returns the accumulated cycle value! Must hold xtime lock! + * returns a 64bit wide accumulated value. */ static inline cycle_t -clocksource_get_cycles(struct clocksource *cs, cycle_t now) +clocksource_get_basecycles(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs->cycle_last) & cs->mask; - offset += cs->cycle_accumulated; + int num = cs->base_num; + cycle_t offset = (now - cs->base[num].cycle_base_last); + offset &= cs->mask; + offset += cs->base[num].cycle_base; return offset; } @@ -197,14 +207,25 @@ * @now: current cycle value * * Used to avoids clocksource hardware overflow by periodically - * accumulating the current cycle delta. Must hold xtime write lock! + * accumulating the current cycle delta. Uses RCU-like update, but + * ***still requires the xtime_lock is held for writing!*** */ static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs->cycle_last) & cs->mask; + /* First update the monotonic base portion. +* The dual array update method allows for lock-free reading. +*/ + int num = !cs->base_num; + cycle_t offset = (now - cs->base[!num].cycle_base_last); + offset &= cs->mask; + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset; + cs->base[num].cycle_base_last = now; + cs->base_num = num; + + /* Now update the cycle_accumulated portion */ + offset = (now - cs->cycle_last) & cs->mask; cs->cycle_last = now; cs->cycle_accumulated += offset; - cs->cycle_raw += offset; } /** Index: monotonic-cleanup/kernel/time/timekeeping.c === --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 12:21:46.0 -0800 +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.0 -0800 @@ -71,10 +71,12 @@ */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle_delta = (now - clock->cycle_last) & clock->mask; + cycle_delta += clock->cycle_accumulated; ns_offset = cyc2ns(clock, cycle_delta); return ns_offset; @@ -105,35
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[EMAIL PROTECTED]> wrote: > If you really want an seqlock free algorithm (I _do_ want this for > tracing!) :) maybe going in the RCU direction could help (I refer to my > RCU-based 32-to-64 bits lockless timestamp counter extension, which > could be turned into the clocksource updater). Yea. After our earlier discussion and talking w/ Steven, I'm taking a swing at this now. The lock-free method still doesn't apply to the update_wall_time function, but does work fine for the monotonic cycle uses. I'll send a patch for review as soon as I get things building. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > > > In-other-words, latency_tracer is LTTng-lite ;-) > > > > If LTTng is already ported to your specific kernel, the learning-curve > is not big at all. Here is what the latency_tracer over LTTng guide > could look like : > > Well, once you have LTTng in your kernel and have compiled and installed > the ltt-control and lttv packages (configure, make, make install), all > that would be needed is : > > (there may be some bits in the QUICKSTART GUIDE on > http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure > the LTTng modules are loaded) > > #arm all the markers > ltt-armall > #start lttng tracing > lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt > > -> start latency tracer > -> stop latency tracer > -> trigger latency tracer dump > > While the tracing is active, trigger the condition... > > (rince, repeat, can handle multiple latency tracer dumps) > > #stop lttng tracing > lttctl -n test -R > #disarm all markers > ltt-disarmall > > You can easily test the trace with : > lttv -m textDump -t /tmp/trace1 > > Your users would issue something like : > > lttv -m latencytracerDump -t /tmp/trace1 > > that's it. LatencytracerDump would be a new specialized plugin, inspired > from the generic textDump.c plugin and from the state.c module (for > hooking on specific events rather that on _all_ events). It would > generate a text output from the trace collected at each latency tracer > dump. > Mathieu, That's quite a bit. Currently we have tools that already hook into latency_tracer. e.g. Thomas Gleixner's cyclictest. It turns on the latency tracer runs a quick test, turns it off. If the latency is ok, the trace is discarded and it runs the test again. When the latency is not acceptable, it stops the test. Then all one needs to do is look at the given trace. Hooking something simple as this to LTTng is not going to fly. Don't get me wrong, I'm a huge supporter of LTTng, and even recommend it. But there's things that using LTTng for is like using a sledgehammer for a tack. I like to use the tools that are best for the job (which also mean easiest). I don't buy the one tracer fits all mentality. And by pushing that too much, I'm thinking we'll never get a tracer into the kernel. What my goal of these patches is, is to get the infrastructure for tracing into the kernel. The latency_tracer is like Rusty Russells lguest is for pvops. LTTng is the Xen equivalent. Where latency_tracer is nothing more than the quick and dirty tracer. For administrators that want to analyze their systems, LTTng is much more appropriate. latency_tracer is more of the first aid in finding latency troubles, and if that doesn't work, then we can do the LTTng surgery. Lets go back and focus on the infrastructure again. Namely, the mcount and notrace parts of the kernel. As well as the tracer timer, which by the way, we are looking more into what you have done ;-) -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Mathieu Desnoyers wrote: > If LTTng is already ported to your specific kernel, the learning-curve > is not big at all. Here is what the latency_tracer over LTTng guide > could look like : > > Well, once you have LTTng in your kernel and have compiled and installed > the ltt-control and lttv packages (configure, make, make install), all > that would be needed is : > > (there may be some bits in the QUICKSTART GUIDE on > http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure > the LTTng modules are loaded) > > #arm all the markers > ltt-armall > #start lttng tracing > lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt > > -> start latency tracer > -> stop latency tracer > -> trigger latency tracer dump > > While the tracing is active, trigger the condition... > > (rince, repeat, can handle multiple latency tracer dumps) > > #stop lttng tracing > lttctl -n test -R > #disarm all markers > ltt-disarmall > > You can easily test the trace with : > lttv -m textDump -t /tmp/trace1 > > Your users would issue something like : > > lttv -m latencytracerDump -t /tmp/trace1 No offense, but this is still quite a bit harder than: echo 1 >/proc/something ... wait a bit ... cat /proc/something (substitute /sys or /debugfs where appropriate) Having to compile something (besides the kernel) for the target is sometimes a major hassle. I avoided it completely with KFT. -- Tim = Tim Bird Architecture Group Chair, CE Linux Forum Senior Staff Engineer, Sony Corporation of America = -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > ... > > > > > > > > > > - Disable preemption at the read-side : > > > > it makes sure the pointer I get will point to a data structure that > > > > will never change while I am in the preempt disabled code. (see *) > > > > - I use per-cpu data to allow the read-side to be as fast as possible > > > > (only need to disable preemption, does not race against other CPUs and > > > > won't generate cache line bouncing). It also allows dealing with > > > > unsynchronized TSCs if needed. > > > > - Periodical write side : it's called from an IPI running on each CPU. > > > > > > > > (*) We expect the read-side (preempt off region) to last shorter than > > > > the interval between IPI updates so we can guarantee the data structure > > > > it uses won't be modified underneath it. Since the IPI update is > > > > launched each seconds or so (depends on the frequency of the counter we > > > > are trying to extend), it's more than ok. > > > > > > One thing I want to clear up. The major difference between this > > > latency_tracer and LTTng is what we consider fast paths. The latency > > > tracer is recording things like enabling and disabling interrupts, preempt > > > count changes, or simply profiling all function calls. Those are what I > > > consider fast paths. The slow path WRT the latency_tracer are things like > > > context switches. This is why I don't have a problem with copying the > > > comm at context switch time. Because that _is_ a slow path for the latency > > > tracer. > > > > LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. > > in flight recorder mode, we have nothing to write to disk and therefore > > we can handle very frequent events. We then do the analysis off-line > > using the last MB written in the buffers. The advantage is that the > > kernel dumbly writes data to a buffer : we therefore move the complexity > > to user-space. > > But you would still need to do something in case you want this information > dumped to console on a kernel crash. Of course you can rely on kexec, but > if the kexec fails (which is possible) then you lose all the information. > Having the ability to dump the output to console on a crash is one of the > benefits of latency_tracer that I want to keep. > There has been some integration done between LTTng and the "crash" tool to extract the buffers from a crashed kernel. I am not an expert in crash buffer extraction though, but I guess all the available mechanisms depend on kexec and could show the limits you are referring to. If you really want to pretty-print the information to the console, I would propose that you leave that part of the problem to a different output module. The core of the latency tracer could keep the minimum information. Then, when a dump is triggered, it either sends the information to LTTng or to your console pretty-printer. However, I would not call the pretty-printer a "tracer" module per-se. We would have to accept that it is a bit more tied to the kernel internals than the latency tracer. My goal is to separate the core "profiling" module from the optional "pretty-printing" module as much as possible so the latency tracer core could be reused by other output modules. > > > > > I agree that some kind of tracing, like the one you are doing, might be > > done more efficiently if you do a first clever analysis phase directly > > in the kernel without writing the raw high event rate data in memory > > buffers. However, I believe that it would be more powerful if we combine > > the two approaches rather than trying to do everything in or out of the > > kernel. LTTng could provide the comm names, priorities, etc, and your > > tracer could provide the top X list of processes that had a bad > > behavior. It would mean that the complete overall information would be > > made available after a post-processing phase done in an analysis tool > > like LTTV, but I don't see any problem with it. > > Of course you don't see any problem with it, because you know LTTV and > LTTng very well ;-) > > latency_tracer has been very detrimental in solving -rt patch latencies by > telling the customer to run with latency trace on, and then having them > simply set a few sysctl variables and run their app. By combining this > with LTTng, I wouldn't know how to start with telling a customer how to > analyze the problem. > > Simply put, latency_tracer has a much smaller learning curve than LTTng. > Not to mention, a smaller footprint. The tracer here is very focused on > what to do, and is not meant to be a general profiling tool as LTTng is. > > In-other-words, latency_tracer is LTTng-lite ;-) > If LTTng is already ported to your specific kernel, the learning-curve is not big at all. Here is what the latency_tracer over LTTng guide could look like : Well, once you have LTTng in your kernel and have compiled and installed the ltt-control and lttv packages (configure, make,
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > * Steven Rostedt ([EMAIL PROTECTED]) wrote: > > > > Yeah, but if we replace the loop with a seq lock, then it would work. > > albeit, more cacheline bouncing (caused by writes). (maybe not, see below) > > > > Yes, but then you would trigger a deadlock if you instrument code called > from NMI, SMI, MCE contexts :( > > grep -ri NMI drivers/* arch/* |grep -vi PNMI > > is quite interesting : actually, it show that a few spots need to handle > those "so special interrupts" : watchdogs, oprofile, virtualization and > much more in architecture specific code. > > I just would not like to add a tracer in the kernel that is _so_ > intrusive that module writers and architecture maintainers would have to > audit their code and think about the tracer for each implementation that > would deal with these kind of interrupts. I don't want driver writers to worry about that either. > > > > > do you actually use the RCU internals? or do you just reimplement an RCU > > > > algorithm? > > > > > > > > > > Nope, I don't use RCU internals in this code. Preempt disable seemed > > > like the best way to handle this utterly short code path and I wanted > > > the write side to be fast enough to be called periodically. What I do is: > > > > grmble. Then how do you trace preempt_disable? As my tracer does that > > (see the last patch in the series). > > > > I think using a kind of preempt_disable_notrace() would make sense here. Actually after hitting the send button, I thought the same. > I mean.. even if you use the seqlock, eventually, you will want to trace > the seqlock behavior. Then you have to find the lightest way to do some > sort of synchronization that will have a predictable execution. seqlock > has the following disadvantage : if the seqlock read has to wait for the > write seqlock to end, we add up some time to the execution of the > code we are trying to profile, which will mix up the results. On the > other hand, if the read-side executes in a constant amount of cycles > which does not depend on the write-side activity, then we get a clearer > picture of what the time should be accounted for. We can even create a > module that will figure out how many nanoseconds are spent for reading > the clock so we can substract this time from our analysis if required. > > That's why having to choose between read seqlock and preemption disable > for the read-side, I would strongly prefer the preemption disable. > (constant execution time and it's deadlock-free) You may convince me yet ;-) > > > > > > > - Disable preemption at the read-side : > > > it makes sure the pointer I get will point to a data structure that > > > will never change while I am in the preempt disabled code. (see *) > > > - I use per-cpu data to allow the read-side to be as fast as possible > > > (only need to disable preemption, does not race against other CPUs and > > > won't generate cache line bouncing). It also allows dealing with > > > unsynchronized TSCs if needed. > > > - Periodical write side : it's called from an IPI running on each CPU. > > > > > > (*) We expect the read-side (preempt off region) to last shorter than > > > the interval between IPI updates so we can guarantee the data structure > > > it uses won't be modified underneath it. Since the IPI update is > > > launched each seconds or so (depends on the frequency of the counter we > > > are trying to extend), it's more than ok. > > > > One thing I want to clear up. The major difference between this > > latency_tracer and LTTng is what we consider fast paths. The latency > > tracer is recording things like enabling and disabling interrupts, preempt > > count changes, or simply profiling all function calls. Those are what I > > consider fast paths. The slow path WRT the latency_tracer are things like > > context switches. This is why I don't have a problem with copying the > > comm at context switch time. Because that _is_ a slow path for the latency > > tracer. > > LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. > in flight recorder mode, we have nothing to write to disk and therefore > we can handle very frequent events. We then do the analysis off-line > using the last MB written in the buffers. The advantage is that the > kernel dumbly writes data to a buffer : we therefore move the complexity > to user-space. But you would still need to do something in case you want this information dumped to console on a kernel crash. Of course you can rely on kexec, but if the kexec fails (which is possible) then you lose all the information. Having the ability to dump the output to console on a crash is one of the benefits of latency_tracer that I want to keep. > > I agree that some kind of tracing, like the one you are doing, might be > done more efficiently if you do a first clever analysis phase directly > in the kernel without writing the raw high event rate data in memory > buffers. However, I believe
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Steven Rostedt wrote: > grmble. Then how do you trace preempt_disable? As my tracer does that > (see the last patch in the series). One way is to make a tracer_preempt_disable() and tracer_preempt_enable(), both of which would be 'notrace'. You could probably optimize them as well. The standard preempt_disable and preempt_enable don't look very efficient (e.g. what's up with converting an increment operation into an addition? - gak!) Any lock you do is going to have a pretty bad effect. In order to be able to trace as much as possible, for KFT, I implemented my own synchronization mechanism using cmpxchg, to avoid using any of the existing kernel locks (which change more often than you'd think, and have weird side effects). = Tim Bird Architecture Group Chair, CE Linux Forum Senior Staff Engineer, Sony Corporation of America = -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: > * Steven Rostedt ([EMAIL PROTECTED]) wrote: > > > > > > > One thing I want to clear up. The major difference between this > > latency_tracer and LTTng is what we consider fast paths. The latency > > tracer is recording things like enabling and disabling interrupts, preempt > > count changes, or simply profiling all function calls. Those are what I > > consider fast paths. The slow path WRT the latency_tracer are things like > > context switches. This is why I don't have a problem with copying the > > comm at context switch time. Because that _is_ a slow path for the latency > > tracer. > > LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. > in flight recorder mode, we have nothing to write to disk and therefore > we can handle very frequent events. We then do the analysis off-line > using the last MB written in the buffers. The advantage is that the > kernel dumbly writes data to a buffer : we therefore move the complexity > to user-space. > > I agree that some kind of tracing, like the one you are doing, might be > done more efficiently if you do a first clever analysis phase directly > in the kernel without writing the raw high event rate data in memory > buffers. However, I believe that it would be more powerful if we combine > the two approaches rather than trying to do everything in or out of the > kernel. LTTng could provide the comm names, priorities, etc, and your > tracer could provide the top X list of processes that had a bad > behavior. It would mean that the complete overall information would be > made available after a post-processing phase done in an analysis tool > like LTTV, but I don't see any problem with it. > Just to expand a bit on the design-side of my proposal : Your module would create "profiles" based on the hooks called. If we take the interrupt on/off for example, it would be called by lockdep and could keep a table of the top X instructions that disables interrupts for a long time. (it's just an example, you could want to save the pid instead...) Then, whenever a "profile dump" is triggered, you would simply have to send the current state of your profile to lttng with something like : struct irq_latency_table { void *ip; cycles_t time; }; /* * Make sure only one profile at a time is written to the trace for the * whole system. */ static DECLARE_MUTEX(latency_profile_mutex); static struct irq_latency_table latency_table[NR_ENTRIES]; void irq_latency_dump_profile(void) { int i; char namebuf[KSYM_NAME_LEN]; mutex_lock(_profile_mutex); trace_mark(irq_latency_dump_begin, MARK_NOARGS); for (i = 0; i < NR_ENTRIES; i++) { sprint_symbol(namebuf, (unsigned long)latency_table[i].ip); trace_mark(irq_latency_entry, "ip %p symbol %s time %llu", latency_table[i].ip, namebuf, (unsigned long long)latency_table[i].time); } trace_mark(irq_latency_dump_end, MARK_NOARGS); mutex_unlock(_profile_mutex); } You can then create a LTTV module that will format your nice output each time a profile dump is encountered. By doing this, your specialized profile generator would only have to hook into the irq on/off events to gather the information it needs, nothing more. I think that would trim the code size and the complexity of your module by an interesting factor. Note that I could optimize the way I currently deal with symbols by not having to dump them in the trace, but since it's only for low rate events, this optimization has a low priority on my todo list. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > > On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > > No, there's probably issues there too, but no need to worry about it, > > > since I already showed that allowing for clocksource_accumulate to happen > > > inside the get_monotonic_cycles loop is already flawed. > > > > > > > Yep, I just re-read through your previous email, and totally agree that > > the algorithm is flawed in the way you pointed out. > > Yeah, but if we replace the loop with a seq lock, then it would work. > albeit, more cacheline bouncing (caused by writes). (maybe not, see below) > Yes, but then you would trigger a deadlock if you instrument code called from NMI, SMI, MCE contexts :( grep -ri NMI drivers/* arch/* |grep -vi PNMI is quite interesting : actually, it show that a few spots need to handle those "so special interrupts" : watchdogs, oprofile, virtualization and much more in architecture specific code. I just would not like to add a tracer in the kernel that is _so_ intrusive that module writers and architecture maintainers would have to audit their code and think about the tracer for each implementation that would deal with these kind of interrupts. > > > do you actually use the RCU internals? or do you just reimplement an RCU > > > algorithm? > > > > > > > Nope, I don't use RCU internals in this code. Preempt disable seemed > > like the best way to handle this utterly short code path and I wanted > > the write side to be fast enough to be called periodically. What I do is: > > grmble. Then how do you trace preempt_disable? As my tracer does that > (see the last patch in the series). > I think using a kind of preempt_disable_notrace() would make sense here. I mean.. even if you use the seqlock, eventually, you will want to trace the seqlock behavior. Then you have to find the lightest way to do some sort of synchronization that will have a predictable execution. seqlock has the following disadvantage : if the seqlock read has to wait for the write seqlock to end, we add up some time to the execution of the code we are trying to profile, which will mix up the results. On the other hand, if the read-side executes in a constant amount of cycles which does not depend on the write-side activity, then we get a clearer picture of what the time should be accounted for. We can even create a module that will figure out how many nanoseconds are spent for reading the clock so we can substract this time from our analysis if required. That's why having to choose between read seqlock and preemption disable for the read-side, I would strongly prefer the preemption disable. (constant execution time and it's deadlock-free) > > > > - Disable preemption at the read-side : > > it makes sure the pointer I get will point to a data structure that > > will never change while I am in the preempt disabled code. (see *) > > - I use per-cpu data to allow the read-side to be as fast as possible > > (only need to disable preemption, does not race against other CPUs and > > won't generate cache line bouncing). It also allows dealing with > > unsynchronized TSCs if needed. > > - Periodical write side : it's called from an IPI running on each CPU. > > > > (*) We expect the read-side (preempt off region) to last shorter than > > the interval between IPI updates so we can guarantee the data structure > > it uses won't be modified underneath it. Since the IPI update is > > launched each seconds or so (depends on the frequency of the counter we > > are trying to extend), it's more than ok. > > One thing I want to clear up. The major difference between this > latency_tracer and LTTng is what we consider fast paths. The latency > tracer is recording things like enabling and disabling interrupts, preempt > count changes, or simply profiling all function calls. Those are what I > consider fast paths. The slow path WRT the latency_tracer are things like > context switches. This is why I don't have a problem with copying the > comm at context switch time. Because that _is_ a slow path for the latency > tracer. LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. in flight recorder mode, we have nothing to write to disk and therefore we can handle very frequent events. We then do the analysis off-line using the last MB written in the buffers. The advantage is that the kernel dumbly writes data to a buffer : we therefore move the complexity to user-space. I agree that some kind of tracing, like the one you are doing, might be done more efficiently if you do a first clever analysis phase directly in the kernel without writing the raw high event rate data in memory buffers. However, I believe that it would be more powerful if we combine the two approaches rather than trying to do everything in or out of the kernel. LTTng could provide the comm names, priorities, etc, and your tracer could provide the top X list of processes that had a bad behavior. It would mean that the
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > No, there's probably issues there too, but no need to worry about it, > > since I already showed that allowing for clocksource_accumulate to happen > > inside the get_monotonic_cycles loop is already flawed. > > > > Yep, I just re-read through your previous email, and totally agree that > the algorithm is flawed in the way you pointed out. Yeah, but if we replace the loop with a seq lock, then it would work. albeit, more cacheline bouncing (caused by writes). (maybe not, see below) > > do you actually use the RCU internals? or do you just reimplement an RCU > > algorithm? > > > > Nope, I don't use RCU internals in this code. Preempt disable seemed > like the best way to handle this utterly short code path and I wanted > the write side to be fast enough to be called periodically. What I do is: grmble. Then how do you trace preempt_disable? As my tracer does that (see the last patch in the series). > > - Disable preemption at the read-side : > it makes sure the pointer I get will point to a data structure that > will never change while I am in the preempt disabled code. (see *) > - I use per-cpu data to allow the read-side to be as fast as possible > (only need to disable preemption, does not race against other CPUs and > won't generate cache line bouncing). It also allows dealing with > unsynchronized TSCs if needed. > - Periodical write side : it's called from an IPI running on each CPU. > > (*) We expect the read-side (preempt off region) to last shorter than > the interval between IPI updates so we can guarantee the data structure > it uses won't be modified underneath it. Since the IPI update is > launched each seconds or so (depends on the frequency of the counter we > are trying to extend), it's more than ok. One thing I want to clear up. The major difference between this latency_tracer and LTTng is what we consider fast paths. The latency tracer is recording things like enabling and disabling interrupts, preempt count changes, or simply profiling all function calls. Those are what I consider fast paths. The slow path WRT the latency_tracer are things like context switches. This is why I don't have a problem with copying the comm at context switch time. Because that _is_ a slow path for the latency tracer. Placing a read_seqlock in get_monotonic_cycles would not be that bad, since the only slow down would be the rmb. read_seqlocks don't modify global data. Only the write_seqlock does. So the cache line bouncing would only happen on updates in clocksource_accumulate. But then after the caches are all balanced again, the reads will continue fine. Question: Is a cache-miss a greater cost than a read to a clocksource (besides the TSC)? Also note how I arrange these variables in the clock struct: struct { cycle_t cycle_last, cycle_accumulated, cycle_monotonic; cycle_t cycle_interval; } cacheline_aligned_in_smp; I could do the following: struct { seqlock_t cycle_lock; cycle_t cycle_last, cycle_accumulated, cycle_monotonic; cycle_t cycle_interval; } cacheline_aligned_in_smp; Which would help to keep all these in the same cache line. These are all updated at the same time, and hopefully this will keep the cache line bouncing limited to a single cacheline. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > > Hrm, I will reply to the rest of this email in a separate mail, but > > there is another concern, simpler than memory ordering, that just hit > > me : > > > > If we have CPU A calling clocksource_accumulate while CPU B is calling > > get_monotonic_cycles, but events happens in the following order (because > > of preemption or interrupts). Here, to make things worse, we would be on > > x86 where cycle_t is not an atomic write (64 bits) : > > > > > > CPU A CPU B > > > > clocksource read > > update cycle_mono (1st 32 bits) > >read cycle_mono > >read cycle_last > >clocksource read > >read cycle_mono > >read cycle_last > > update cycle_mono (2nd 32 bits) > > update cycle_last > > update cycle_acc > > > > Therefore, we have : > > - an inconsistant cycle_monotonic value > > - inconsistant cycle_monotonic and cycle_last values. > > > > Or is there something I have missed ? > > No, there's probably issues there too, but no need to worry about it, > since I already showed that allowing for clocksource_accumulate to happen > inside the get_monotonic_cycles loop is already flawed. > Yep, I just re-read through your previous email, and totally agree that the algorithm is flawed in the way you pointed out. > > > > If you really want an seqlock free algorithm (I _do_ want this for > > tracing!) :) maybe going in the RCU direction could help (I refer to my > > RCU-based 32-to-64 bits lockless timestamp counter extension, which > > could be turned into the clocksource updater). > > I know you pointed me the code, but lets assume that I'm still ignorant > ;-) > > do you actually use the RCU internals? or do you just reimplement an RCU > algorithm? > Nope, I don't use RCU internals in this code. Preempt disable seemed like the best way to handle this utterly short code path and I wanted the write side to be fast enough to be called periodically. What I do is: - Disable preemption at the read-side : it makes sure the pointer I get will point to a data structure that will never change while I am in the preempt disabled code. (see *) - I use per-cpu data to allow the read-side to be as fast as possible (only need to disable preemption, does not race against other CPUs and won't generate cache line bouncing). It also allows dealing with unsynchronized TSCs if needed. - Periodical write side : it's called from an IPI running on each CPU. (*) We expect the read-side (preempt off region) to last shorter than the interval between IPI updates so we can guarantee the data structure it uses won't be modified underneath it. Since the IPI update is launched each seconds or so (depends on the frequency of the counter we are trying to extend), it's more than ok. Mathieu > -- Steve -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: > Hrm, I will reply to the rest of this email in a separate mail, but > there is another concern, simpler than memory ordering, that just hit > me : > > If we have CPU A calling clocksource_accumulate while CPU B is calling > get_monotonic_cycles, but events happens in the following order (because > of preemption or interrupts). Here, to make things worse, we would be on > x86 where cycle_t is not an atomic write (64 bits) : > > > CPU A CPU B > > clocksource read > update cycle_mono (1st 32 bits) >read cycle_mono >read cycle_last >clocksource read >read cycle_mono >read cycle_last > update cycle_mono (2nd 32 bits) > update cycle_last > update cycle_acc > > Therefore, we have : > - an inconsistant cycle_monotonic value > - inconsistant cycle_monotonic and cycle_last values. > > Or is there something I have missed ? No, there's probably issues there too, but no need to worry about it, since I already showed that allowing for clocksource_accumulate to happen inside the get_monotonic_cycles loop is already flawed. > > If you really want an seqlock free algorithm (I _do_ want this for > tracing!) :) maybe going in the RCU direction could help (I refer to my > RCU-based 32-to-64 bits lockless timestamp counter extension, which > could be turned into the clocksource updater). I know you pointed me the code, but lets assume that I'm still ignorant ;-) do you actually use the RCU internals? or do you just reimplement an RCU algorithm? -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > [ CC'd Daniel Walker, since he had problems with this code ] > > On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: > > > > I agree with you that I don't see how the compiler could reorder this. > > So we forget about compiler barriers. Also, the clock source used is a > > synchronized clock source (get_cycles_sync on x86_64), so it should make > > sure the TSC is read at the right moment. > > > > However, what happens if the clock source is, say, the jiffies ? > > > > Is this case, we have : > > > > static cycle_t jiffies_read(void) > > { > > return (cycle_t) jiffies; > > } > > > > Which is nothing more than a memory read of > > > > extern unsigned long volatile __jiffy_data jiffies; > > Yep, and that's not my concern. > Hrm, I will reply to the rest of this email in a separate mail, but there is another concern, simpler than memory ordering, that just hit me : If we have CPU A calling clocksource_accumulate while CPU B is calling get_monotonic_cycles, but events happens in the following order (because of preemption or interrupts). Here, to make things worse, we would be on x86 where cycle_t is not an atomic write (64 bits) : CPU A CPU B clocksource read update cycle_mono (1st 32 bits) read cycle_mono read cycle_last clocksource read read cycle_mono read cycle_last update cycle_mono (2nd 32 bits) update cycle_last update cycle_acc Therefore, we have : - an inconsistant cycle_monotonic value - inconsistant cycle_monotonic and cycle_last values. Or is there something I have missed ? If you really want an seqlock free algorithm (I _do_ want this for tracing!) :) maybe going in the RCU direction could help (I refer to my RCU-based 32-to-64 bits lockless timestamp counter extension, which could be turned into the clocksource updater). Mathieu > > > > I think it is wrong to assume that reads from clock->cycle_raw and from > > jiffies will be ordered correctly in SMP. I am tempted to think that > > ordering memory writes to clock->cycle_raw vs jiffies is also needed in this > > case (where clock->cycle_raw is updated, or where jiffies is updated). > > > > We can fall in the same kind of issue if we read the HPET, which is > > memory I/O based. It does not seems correct to assume that MMIO vs > > normal memory reads are ordered. (pointing back to this article : > > http://lwn.net/Articles/198988/) > > That and the dread memory barrier thread that my head is still spinning > on. > > Ok, lets take a close look at the code in question. I may be wrong, and if > so, great, we can fix it. > > We have this in get_monotonic_cycles: > > { > cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last; > do { > cycle_monotonic = clock->cycle_monotonic; > cycle_last = clock->cycle_last; > cycle_now = clocksource_read(clock); > cycle_delta = (cycle_now - cycle_last) & clock->mask; > } while (cycle_monotonic != clock->cycle_monotonic || >cycle_last != clock->cycle_last); > return cycle_monotonic + cycle_delta; > } > > and this in clocksource.h > > static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) > { > cycle_t offset = (now - cs->cycle_last) & cs->mask; > cs->cycle_last = now; > cs->cycle_accumulated += offset; > cs->cycle_monotonic += offset; > } > > now is usually just a clocksource_read() passed in. > > The goal is to have clock_monotonic always return something that is > greater than what was read the last time. > > Let's make a few assumptions now (for others to shoot them down). One > thing is that we don't need to worry too much about MMIO, because we are > doing a read. This means we need the data right now to contiune. So the > read being a function call should keep gcc from moving stuff around, and > since we are doing an IO read, the order of events should be pretty much > synchronized. in > > 1. load cycle_last and cycle_monotonic (we don't care which order)* > 2. read clock source > 3. calculate delta and while() compare (order doesn't matter) > > * we might care (see below) > > If the above is incorrect, then we need to fix get_monotonic_cycles. > > in clocksource_accumulate, we have: > > offset = ((now = cs->read()) - cycle_last) & cs->mask; > cycle_last = now; > cycle_accumulate += offset; > cycle_monotonic += offset; > > The order of events here are. Using the same reasoning as above, the read > must be first and completed because for gcc it's a function, and for IO, > it needs to return data. > > 1. cs->read > 2. update cycle_last, cycle_accumulate, cycle_monotonic. > > Can we assume, if the above for get_monotonic_cycles is correct, that > since we read and compare cycle_last and cycle_monotonic, that neither of > them have
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
[ CC'd Daniel Walker, since he had problems with this code ] On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: > > I agree with you that I don't see how the compiler could reorder this. > So we forget about compiler barriers. Also, the clock source used is a > synchronized clock source (get_cycles_sync on x86_64), so it should make > sure the TSC is read at the right moment. > > However, what happens if the clock source is, say, the jiffies ? > > Is this case, we have : > > static cycle_t jiffies_read(void) > { > return (cycle_t) jiffies; > } > > Which is nothing more than a memory read of > > extern unsigned long volatile __jiffy_data jiffies; Yep, and that's not my concern. > > I think it is wrong to assume that reads from clock->cycle_raw and from > jiffies will be ordered correctly in SMP. I am tempted to think that > ordering memory writes to clock->cycle_raw vs jiffies is also needed in this > case (where clock->cycle_raw is updated, or where jiffies is updated). > > We can fall in the same kind of issue if we read the HPET, which is > memory I/O based. It does not seems correct to assume that MMIO vs > normal memory reads are ordered. (pointing back to this article : > http://lwn.net/Articles/198988/) That and the dread memory barrier thread that my head is still spinning on. Ok, lets take a close look at the code in question. I may be wrong, and if so, great, we can fix it. We have this in get_monotonic_cycles: { cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last; do { cycle_monotonic = clock->cycle_monotonic; cycle_last = clock->cycle_last; cycle_now = clocksource_read(clock); cycle_delta = (cycle_now - cycle_last) & clock->mask; } while (cycle_monotonic != clock->cycle_monotonic || cycle_last != clock->cycle_last); return cycle_monotonic + cycle_delta; } and this in clocksource.h static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { cycle_t offset = (now - cs->cycle_last) & cs->mask; cs->cycle_last = now; cs->cycle_accumulated += offset; cs->cycle_monotonic += offset; } now is usually just a clocksource_read() passed in. The goal is to have clock_monotonic always return something that is greater than what was read the last time. Let's make a few assumptions now (for others to shoot them down). One thing is that we don't need to worry too much about MMIO, because we are doing a read. This means we need the data right now to contiune. So the read being a function call should keep gcc from moving stuff around, and since we are doing an IO read, the order of events should be pretty much synchronized. in 1. load cycle_last and cycle_monotonic (we don't care which order)* 2. read clock source 3. calculate delta and while() compare (order doesn't matter) * we might care (see below) If the above is incorrect, then we need to fix get_monotonic_cycles. in clocksource_accumulate, we have: offset = ((now = cs->read()) - cycle_last) & cs->mask; cycle_last = now; cycle_accumulate += offset; cycle_monotonic += offset; The order of events here are. Using the same reasoning as above, the read must be first and completed because for gcc it's a function, and for IO, it needs to return data. 1. cs->read 2. update cycle_last, cycle_accumulate, cycle_monotonic. Can we assume, if the above for get_monotonic_cycles is correct, that since we read and compare cycle_last and cycle_monotonic, that neither of them have changed over the read? So we have a snapshot of the clocksource_accumulate. So the worst thing that I can think of, is that cycle_monotonic is update *before* cycle_last: cycle_monotonic += offest; cycle_last = now; cycle_last = 5 cycle_monotonic = 0 CPU 0 CPU 1 -- - cs->read() = 10 offset = 10 - 5 = 5 cycle_monotonic = 5 cycle_monotonic = 5 cycle_last = 5 cs->read() = 11 delta = 11 - 5 = 6 cycle_monotonic and cycle_last still same return 5 + 6 = 11 cycle_last = 10 cycle_monotonic = 5 cycle_last = 10 cs->read() = 12 delta = 12 - 10 = 2 cycle_monotonic and cycle_last still same return 5 + 2 = 7 ERROR * So, we *do* need memory barriers. Looks like cycle_last and cycle_monotonic need to be synchronized. OK, will this do? cycle_t notrace get_monotonic_cycles(void) { cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last; do { cycle_monotonic = clock->cycle_monotonic;
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
[ CC'd Daniel Walker, since he had problems with this code ] On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: I agree with you that I don't see how the compiler could reorder this. So we forget about compiler barriers. Also, the clock source used is a synchronized clock source (get_cycles_sync on x86_64), so it should make sure the TSC is read at the right moment. However, what happens if the clock source is, say, the jiffies ? Is this case, we have : static cycle_t jiffies_read(void) { return (cycle_t) jiffies; } Which is nothing more than a memory read of extern unsigned long volatile __jiffy_data jiffies; Yep, and that's not my concern. I think it is wrong to assume that reads from clock-cycle_raw and from jiffies will be ordered correctly in SMP. I am tempted to think that ordering memory writes to clock-cycle_raw vs jiffies is also needed in this case (where clock-cycle_raw is updated, or where jiffies is updated). We can fall in the same kind of issue if we read the HPET, which is memory I/O based. It does not seems correct to assume that MMIO vs normal memory reads are ordered. (pointing back to this article : http://lwn.net/Articles/198988/) That and the dread memory barrier thread that my head is still spinning on. Ok, lets take a close look at the code in question. I may be wrong, and if so, great, we can fix it. We have this in get_monotonic_cycles: { cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last; do { cycle_monotonic = clock-cycle_monotonic; cycle_last = clock-cycle_last; cycle_now = clocksource_read(clock); cycle_delta = (cycle_now - cycle_last) clock-mask; } while (cycle_monotonic != clock-cycle_monotonic || cycle_last != clock-cycle_last); return cycle_monotonic + cycle_delta; } and this in clocksource.h static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { cycle_t offset = (now - cs-cycle_last) cs-mask; cs-cycle_last = now; cs-cycle_accumulated += offset; cs-cycle_monotonic += offset; } now is usually just a clocksource_read() passed in. The goal is to have clock_monotonic always return something that is greater than what was read the last time. Let's make a few assumptions now (for others to shoot them down). One thing is that we don't need to worry too much about MMIO, because we are doing a read. This means we need the data right now to contiune. So the read being a function call should keep gcc from moving stuff around, and since we are doing an IO read, the order of events should be pretty much synchronized. in 1. load cycle_last and cycle_monotonic (we don't care which order)* 2. read clock source 3. calculate delta and while() compare (order doesn't matter) * we might care (see below) If the above is incorrect, then we need to fix get_monotonic_cycles. in clocksource_accumulate, we have: offset = ((now = cs-read()) - cycle_last) cs-mask; cycle_last = now; cycle_accumulate += offset; cycle_monotonic += offset; The order of events here are. Using the same reasoning as above, the read must be first and completed because for gcc it's a function, and for IO, it needs to return data. 1. cs-read 2. update cycle_last, cycle_accumulate, cycle_monotonic. Can we assume, if the above for get_monotonic_cycles is correct, that since we read and compare cycle_last and cycle_monotonic, that neither of them have changed over the read? So we have a snapshot of the clocksource_accumulate. So the worst thing that I can think of, is that cycle_monotonic is update *before* cycle_last: cycle_monotonic += offest; get_monotonic_cycles run on other CPU cycle_last = now; cycle_last = 5 cycle_monotonic = 0 CPU 0 CPU 1 -- - cs-read() = 10 offset = 10 - 5 = 5 cycle_monotonic = 5 cycle_monotonic = 5 cycle_last = 5 cs-read() = 11 delta = 11 - 5 = 6 cycle_monotonic and cycle_last still same return 5 + 6 = 11 cycle_last = 10 cycle_monotonic = 5 cycle_last = 10 cs-read() = 12 delta = 12 - 10 = 2 cycle_monotonic and cycle_last still same return 5 + 2 = 7 ERROR * So, we *do* need memory barriers. Looks like cycle_last and cycle_monotonic need to be synchronized. OK, will this do? cycle_t notrace get_monotonic_cycles(void) { cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last; do { cycle_monotonic = clock-cycle_monotonic; smp_rmb();
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: [ CC'd Daniel Walker, since he had problems with this code ] On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: I agree with you that I don't see how the compiler could reorder this. So we forget about compiler barriers. Also, the clock source used is a synchronized clock source (get_cycles_sync on x86_64), so it should make sure the TSC is read at the right moment. However, what happens if the clock source is, say, the jiffies ? Is this case, we have : static cycle_t jiffies_read(void) { return (cycle_t) jiffies; } Which is nothing more than a memory read of extern unsigned long volatile __jiffy_data jiffies; Yep, and that's not my concern. Hrm, I will reply to the rest of this email in a separate mail, but there is another concern, simpler than memory ordering, that just hit me : If we have CPU A calling clocksource_accumulate while CPU B is calling get_monotonic_cycles, but events happens in the following order (because of preemption or interrupts). Here, to make things worse, we would be on x86 where cycle_t is not an atomic write (64 bits) : CPU A CPU B clocksource read update cycle_mono (1st 32 bits) read cycle_mono read cycle_last clocksource read read cycle_mono read cycle_last update cycle_mono (2nd 32 bits) update cycle_last update cycle_acc Therefore, we have : - an inconsistant cycle_monotonic value - inconsistant cycle_monotonic and cycle_last values. Or is there something I have missed ? If you really want an seqlock free algorithm (I _do_ want this for tracing!) :) maybe going in the RCU direction could help (I refer to my RCU-based 32-to-64 bits lockless timestamp counter extension, which could be turned into the clocksource updater). Mathieu I think it is wrong to assume that reads from clock-cycle_raw and from jiffies will be ordered correctly in SMP. I am tempted to think that ordering memory writes to clock-cycle_raw vs jiffies is also needed in this case (where clock-cycle_raw is updated, or where jiffies is updated). We can fall in the same kind of issue if we read the HPET, which is memory I/O based. It does not seems correct to assume that MMIO vs normal memory reads are ordered. (pointing back to this article : http://lwn.net/Articles/198988/) That and the dread memory barrier thread that my head is still spinning on. Ok, lets take a close look at the code in question. I may be wrong, and if so, great, we can fix it. We have this in get_monotonic_cycles: { cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last; do { cycle_monotonic = clock-cycle_monotonic; cycle_last = clock-cycle_last; cycle_now = clocksource_read(clock); cycle_delta = (cycle_now - cycle_last) clock-mask; } while (cycle_monotonic != clock-cycle_monotonic || cycle_last != clock-cycle_last); return cycle_monotonic + cycle_delta; } and this in clocksource.h static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { cycle_t offset = (now - cs-cycle_last) cs-mask; cs-cycle_last = now; cs-cycle_accumulated += offset; cs-cycle_monotonic += offset; } now is usually just a clocksource_read() passed in. The goal is to have clock_monotonic always return something that is greater than what was read the last time. Let's make a few assumptions now (for others to shoot them down). One thing is that we don't need to worry too much about MMIO, because we are doing a read. This means we need the data right now to contiune. So the read being a function call should keep gcc from moving stuff around, and since we are doing an IO read, the order of events should be pretty much synchronized. in 1. load cycle_last and cycle_monotonic (we don't care which order)* 2. read clock source 3. calculate delta and while() compare (order doesn't matter) * we might care (see below) If the above is incorrect, then we need to fix get_monotonic_cycles. in clocksource_accumulate, we have: offset = ((now = cs-read()) - cycle_last) cs-mask; cycle_last = now; cycle_accumulate += offset; cycle_monotonic += offset; The order of events here are. Using the same reasoning as above, the read must be first and completed because for gcc it's a function, and for IO, it needs to return data. 1. cs-read 2. update cycle_last, cycle_accumulate, cycle_monotonic. Can we assume, if the above for get_monotonic_cycles is correct, that since we read and compare cycle_last and cycle_monotonic, that neither of them have changed over the read? So we have a snapshot of the clocksource_accumulate. So the worst thing that I can think of, is that cycle_monotonic is update
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: Hrm, I will reply to the rest of this email in a separate mail, but there is another concern, simpler than memory ordering, that just hit me : If we have CPU A calling clocksource_accumulate while CPU B is calling get_monotonic_cycles, but events happens in the following order (because of preemption or interrupts). Here, to make things worse, we would be on x86 where cycle_t is not an atomic write (64 bits) : CPU A CPU B clocksource read update cycle_mono (1st 32 bits) read cycle_mono read cycle_last clocksource read read cycle_mono read cycle_last update cycle_mono (2nd 32 bits) update cycle_last update cycle_acc Therefore, we have : - an inconsistant cycle_monotonic value - inconsistant cycle_monotonic and cycle_last values. Or is there something I have missed ? No, there's probably issues there too, but no need to worry about it, since I already showed that allowing for clocksource_accumulate to happen inside the get_monotonic_cycles loop is already flawed. If you really want an seqlock free algorithm (I _do_ want this for tracing!) :) maybe going in the RCU direction could help (I refer to my RCU-based 32-to-64 bits lockless timestamp counter extension, which could be turned into the clocksource updater). I know you pointed me the code, but lets assume that I'm still ignorant ;-) do you actually use the RCU internals? or do you just reimplement an RCU algorithm? -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: Hrm, I will reply to the rest of this email in a separate mail, but there is another concern, simpler than memory ordering, that just hit me : If we have CPU A calling clocksource_accumulate while CPU B is calling get_monotonic_cycles, but events happens in the following order (because of preemption or interrupts). Here, to make things worse, we would be on x86 where cycle_t is not an atomic write (64 bits) : CPU A CPU B clocksource read update cycle_mono (1st 32 bits) read cycle_mono read cycle_last clocksource read read cycle_mono read cycle_last update cycle_mono (2nd 32 bits) update cycle_last update cycle_acc Therefore, we have : - an inconsistant cycle_monotonic value - inconsistant cycle_monotonic and cycle_last values. Or is there something I have missed ? No, there's probably issues there too, but no need to worry about it, since I already showed that allowing for clocksource_accumulate to happen inside the get_monotonic_cycles loop is already flawed. Yep, I just re-read through your previous email, and totally agree that the algorithm is flawed in the way you pointed out. If you really want an seqlock free algorithm (I _do_ want this for tracing!) :) maybe going in the RCU direction could help (I refer to my RCU-based 32-to-64 bits lockless timestamp counter extension, which could be turned into the clocksource updater). I know you pointed me the code, but lets assume that I'm still ignorant ;-) do you actually use the RCU internals? or do you just reimplement an RCU algorithm? Nope, I don't use RCU internals in this code. Preempt disable seemed like the best way to handle this utterly short code path and I wanted the write side to be fast enough to be called periodically. What I do is: - Disable preemption at the read-side : it makes sure the pointer I get will point to a data structure that will never change while I am in the preempt disabled code. (see *) - I use per-cpu data to allow the read-side to be as fast as possible (only need to disable preemption, does not race against other CPUs and won't generate cache line bouncing). It also allows dealing with unsynchronized TSCs if needed. - Periodical write side : it's called from an IPI running on each CPU. (*) We expect the read-side (preempt off region) to last shorter than the interval between IPI updates so we can guarantee the data structure it uses won't be modified underneath it. Since the IPI update is launched each seconds or so (depends on the frequency of the counter we are trying to extend), it's more than ok. Mathieu -- Steve -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: No, there's probably issues there too, but no need to worry about it, since I already showed that allowing for clocksource_accumulate to happen inside the get_monotonic_cycles loop is already flawed. Yep, I just re-read through your previous email, and totally agree that the algorithm is flawed in the way you pointed out. Yeah, but if we replace the loop with a seq lock, then it would work. albeit, more cacheline bouncing (caused by writes). (maybe not, see below) do you actually use the RCU internals? or do you just reimplement an RCU algorithm? Nope, I don't use RCU internals in this code. Preempt disable seemed like the best way to handle this utterly short code path and I wanted the write side to be fast enough to be called periodically. What I do is: grmble. Then how do you trace preempt_disable? As my tracer does that (see the last patch in the series). - Disable preemption at the read-side : it makes sure the pointer I get will point to a data structure that will never change while I am in the preempt disabled code. (see *) - I use per-cpu data to allow the read-side to be as fast as possible (only need to disable preemption, does not race against other CPUs and won't generate cache line bouncing). It also allows dealing with unsynchronized TSCs if needed. - Periodical write side : it's called from an IPI running on each CPU. (*) We expect the read-side (preempt off region) to last shorter than the interval between IPI updates so we can guarantee the data structure it uses won't be modified underneath it. Since the IPI update is launched each seconds or so (depends on the frequency of the counter we are trying to extend), it's more than ok. One thing I want to clear up. The major difference between this latency_tracer and LTTng is what we consider fast paths. The latency tracer is recording things like enabling and disabling interrupts, preempt count changes, or simply profiling all function calls. Those are what I consider fast paths. The slow path WRT the latency_tracer are things like context switches. This is why I don't have a problem with copying the comm at context switch time. Because that _is_ a slow path for the latency tracer. Placing a read_seqlock in get_monotonic_cycles would not be that bad, since the only slow down would be the rmb. read_seqlocks don't modify global data. Only the write_seqlock does. So the cache line bouncing would only happen on updates in clocksource_accumulate. But then after the caches are all balanced again, the reads will continue fine. Question: Is a cache-miss a greater cost than a read to a clocksource (besides the TSC)? Also note how I arrange these variables in the clock struct: struct { cycle_t cycle_last, cycle_accumulated, cycle_monotonic; cycle_t cycle_interval; } cacheline_aligned_in_smp; I could do the following: struct { seqlock_t cycle_lock; cycle_t cycle_last, cycle_accumulated, cycle_monotonic; cycle_t cycle_interval; } cacheline_aligned_in_smp; Which would help to keep all these in the same cache line. These are all updated at the same time, and hopefully this will keep the cache line bouncing limited to a single cacheline. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: No, there's probably issues there too, but no need to worry about it, since I already showed that allowing for clocksource_accumulate to happen inside the get_monotonic_cycles loop is already flawed. Yep, I just re-read through your previous email, and totally agree that the algorithm is flawed in the way you pointed out. Yeah, but if we replace the loop with a seq lock, then it would work. albeit, more cacheline bouncing (caused by writes). (maybe not, see below) Yes, but then you would trigger a deadlock if you instrument code called from NMI, SMI, MCE contexts :( grep -ri NMI drivers/* arch/* |grep -vi PNMI is quite interesting : actually, it show that a few spots need to handle those so special interrupts : watchdogs, oprofile, virtualization and much more in architecture specific code. I just would not like to add a tracer in the kernel that is _so_ intrusive that module writers and architecture maintainers would have to audit their code and think about the tracer for each implementation that would deal with these kind of interrupts. do you actually use the RCU internals? or do you just reimplement an RCU algorithm? Nope, I don't use RCU internals in this code. Preempt disable seemed like the best way to handle this utterly short code path and I wanted the write side to be fast enough to be called periodically. What I do is: grmble. Then how do you trace preempt_disable? As my tracer does that (see the last patch in the series). I think using a kind of preempt_disable_notrace() would make sense here. I mean.. even if you use the seqlock, eventually, you will want to trace the seqlock behavior. Then you have to find the lightest way to do some sort of synchronization that will have a predictable execution. seqlock has the following disadvantage : if the seqlock read has to wait for the write seqlock to end, we add up some time to the execution of the code we are trying to profile, which will mix up the results. On the other hand, if the read-side executes in a constant amount of cycles which does not depend on the write-side activity, then we get a clearer picture of what the time should be accounted for. We can even create a module that will figure out how many nanoseconds are spent for reading the clock so we can substract this time from our analysis if required. That's why having to choose between read seqlock and preemption disable for the read-side, I would strongly prefer the preemption disable. (constant execution time and it's deadlock-free) - Disable preemption at the read-side : it makes sure the pointer I get will point to a data structure that will never change while I am in the preempt disabled code. (see *) - I use per-cpu data to allow the read-side to be as fast as possible (only need to disable preemption, does not race against other CPUs and won't generate cache line bouncing). It also allows dealing with unsynchronized TSCs if needed. - Periodical write side : it's called from an IPI running on each CPU. (*) We expect the read-side (preempt off region) to last shorter than the interval between IPI updates so we can guarantee the data structure it uses won't be modified underneath it. Since the IPI update is launched each seconds or so (depends on the frequency of the counter we are trying to extend), it's more than ok. One thing I want to clear up. The major difference between this latency_tracer and LTTng is what we consider fast paths. The latency tracer is recording things like enabling and disabling interrupts, preempt count changes, or simply profiling all function calls. Those are what I consider fast paths. The slow path WRT the latency_tracer are things like context switches. This is why I don't have a problem with copying the comm at context switch time. Because that _is_ a slow path for the latency tracer. LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. in flight recorder mode, we have nothing to write to disk and therefore we can handle very frequent events. We then do the analysis off-line using the last MB written in the buffers. The advantage is that the kernel dumbly writes data to a buffer : we therefore move the complexity to user-space. I agree that some kind of tracing, like the one you are doing, might be done more efficiently if you do a first clever analysis phase directly in the kernel without writing the raw high event rate data in memory buffers. However, I believe that it would be more powerful if we combine the two approaches rather than trying to do everything in or out of the kernel. LTTng could provide the comm names, priorities, etc, and your tracer could provide the top X list of processes that had a bad behavior. It would mean that the complete overall information would be made available after a post-processing phase done
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: * Steven Rostedt ([EMAIL PROTECTED]) wrote: One thing I want to clear up. The major difference between this latency_tracer and LTTng is what we consider fast paths. The latency tracer is recording things like enabling and disabling interrupts, preempt count changes, or simply profiling all function calls. Those are what I consider fast paths. The slow path WRT the latency_tracer are things like context switches. This is why I don't have a problem with copying the comm at context switch time. Because that _is_ a slow path for the latency tracer. LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. in flight recorder mode, we have nothing to write to disk and therefore we can handle very frequent events. We then do the analysis off-line using the last MB written in the buffers. The advantage is that the kernel dumbly writes data to a buffer : we therefore move the complexity to user-space. I agree that some kind of tracing, like the one you are doing, might be done more efficiently if you do a first clever analysis phase directly in the kernel without writing the raw high event rate data in memory buffers. However, I believe that it would be more powerful if we combine the two approaches rather than trying to do everything in or out of the kernel. LTTng could provide the comm names, priorities, etc, and your tracer could provide the top X list of processes that had a bad behavior. It would mean that the complete overall information would be made available after a post-processing phase done in an analysis tool like LTTV, but I don't see any problem with it. Just to expand a bit on the design-side of my proposal : Your module would create profiles based on the hooks called. If we take the interrupt on/off for example, it would be called by lockdep and could keep a table of the top X instructions that disables interrupts for a long time. (it's just an example, you could want to save the pid instead...) Then, whenever a profile dump is triggered, you would simply have to send the current state of your profile to lttng with something like : struct irq_latency_table { void *ip; cycles_t time; }; /* * Make sure only one profile at a time is written to the trace for the * whole system. */ static DECLARE_MUTEX(latency_profile_mutex); static struct irq_latency_table latency_table[NR_ENTRIES]; void irq_latency_dump_profile(void) { int i; char namebuf[KSYM_NAME_LEN]; mutex_lock(latency_profile_mutex); trace_mark(irq_latency_dump_begin, MARK_NOARGS); for (i = 0; i NR_ENTRIES; i++) { sprint_symbol(namebuf, (unsigned long)latency_table[i].ip); trace_mark(irq_latency_entry, ip %p symbol %s time %llu, latency_table[i].ip, namebuf, (unsigned long long)latency_table[i].time); } trace_mark(irq_latency_dump_end, MARK_NOARGS); mutex_unlock(latency_profile_mutex); } You can then create a LTTV module that will format your nice output each time a profile dump is encountered. By doing this, your specialized profile generator would only have to hook into the irq on/off events to gather the information it needs, nothing more. I think that would trim the code size and the complexity of your module by an interesting factor. Note that I could optimize the way I currently deal with symbols by not having to dump them in the trace, but since it's only for low rate events, this optimization has a low priority on my todo list. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Steven Rostedt wrote: grmble. Then how do you trace preempt_disable? As my tracer does that (see the last patch in the series). One way is to make a tracer_preempt_disable() and tracer_preempt_enable(), both of which would be 'notrace'. You could probably optimize them as well. The standard preempt_disable and preempt_enable don't look very efficient (e.g. what's up with converting an increment operation into an addition? - gak!) Any lock you do is going to have a pretty bad effect. In order to be able to trace as much as possible, for KFT, I implemented my own synchronization mechanism using cmpxchg, to avoid using any of the existing kernel locks (which change more often than you'd think, and have weird side effects). = Tim Bird Architecture Group Chair, CE Linux Forum Senior Staff Engineer, Sony Corporation of America = -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: * Steven Rostedt ([EMAIL PROTECTED]) wrote: Yeah, but if we replace the loop with a seq lock, then it would work. albeit, more cacheline bouncing (caused by writes). (maybe not, see below) Yes, but then you would trigger a deadlock if you instrument code called from NMI, SMI, MCE contexts :( grep -ri NMI drivers/* arch/* |grep -vi PNMI is quite interesting : actually, it show that a few spots need to handle those so special interrupts : watchdogs, oprofile, virtualization and much more in architecture specific code. I just would not like to add a tracer in the kernel that is _so_ intrusive that module writers and architecture maintainers would have to audit their code and think about the tracer for each implementation that would deal with these kind of interrupts. I don't want driver writers to worry about that either. do you actually use the RCU internals? or do you just reimplement an RCU algorithm? Nope, I don't use RCU internals in this code. Preempt disable seemed like the best way to handle this utterly short code path and I wanted the write side to be fast enough to be called periodically. What I do is: grmble. Then how do you trace preempt_disable? As my tracer does that (see the last patch in the series). I think using a kind of preempt_disable_notrace() would make sense here. Actually after hitting the send button, I thought the same. I mean.. even if you use the seqlock, eventually, you will want to trace the seqlock behavior. Then you have to find the lightest way to do some sort of synchronization that will have a predictable execution. seqlock has the following disadvantage : if the seqlock read has to wait for the write seqlock to end, we add up some time to the execution of the code we are trying to profile, which will mix up the results. On the other hand, if the read-side executes in a constant amount of cycles which does not depend on the write-side activity, then we get a clearer picture of what the time should be accounted for. We can even create a module that will figure out how many nanoseconds are spent for reading the clock so we can substract this time from our analysis if required. That's why having to choose between read seqlock and preemption disable for the read-side, I would strongly prefer the preemption disable. (constant execution time and it's deadlock-free) You may convince me yet ;-) - Disable preemption at the read-side : it makes sure the pointer I get will point to a data structure that will never change while I am in the preempt disabled code. (see *) - I use per-cpu data to allow the read-side to be as fast as possible (only need to disable preemption, does not race against other CPUs and won't generate cache line bouncing). It also allows dealing with unsynchronized TSCs if needed. - Periodical write side : it's called from an IPI running on each CPU. (*) We expect the read-side (preempt off region) to last shorter than the interval between IPI updates so we can guarantee the data structure it uses won't be modified underneath it. Since the IPI update is launched each seconds or so (depends on the frequency of the counter we are trying to extend), it's more than ok. One thing I want to clear up. The major difference between this latency_tracer and LTTng is what we consider fast paths. The latency tracer is recording things like enabling and disabling interrupts, preempt count changes, or simply profiling all function calls. Those are what I consider fast paths. The slow path WRT the latency_tracer are things like context switches. This is why I don't have a problem with copying the comm at context switch time. Because that _is_ a slow path for the latency tracer. LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. in flight recorder mode, we have nothing to write to disk and therefore we can handle very frequent events. We then do the analysis off-line using the last MB written in the buffers. The advantage is that the kernel dumbly writes data to a buffer : we therefore move the complexity to user-space. But you would still need to do something in case you want this information dumped to console on a kernel crash. Of course you can rely on kexec, but if the kexec fails (which is possible) then you lose all the information. Having the ability to dump the output to console on a crash is one of the benefits of latency_tracer that I want to keep. I agree that some kind of tracing, like the one you are doing, might be done more efficiently if you do a first clever analysis phase directly in the kernel without writing the raw high event rate data in memory buffers. However, I believe that it would be more powerful if we combine the two approaches rather than trying to do everything in or out of the kernel. LTTng could provide
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: ... - Disable preemption at the read-side : it makes sure the pointer I get will point to a data structure that will never change while I am in the preempt disabled code. (see *) - I use per-cpu data to allow the read-side to be as fast as possible (only need to disable preemption, does not race against other CPUs and won't generate cache line bouncing). It also allows dealing with unsynchronized TSCs if needed. - Periodical write side : it's called from an IPI running on each CPU. (*) We expect the read-side (preempt off region) to last shorter than the interval between IPI updates so we can guarantee the data structure it uses won't be modified underneath it. Since the IPI update is launched each seconds or so (depends on the frequency of the counter we are trying to extend), it's more than ok. One thing I want to clear up. The major difference between this latency_tracer and LTTng is what we consider fast paths. The latency tracer is recording things like enabling and disabling interrupts, preempt count changes, or simply profiling all function calls. Those are what I consider fast paths. The slow path WRT the latency_tracer are things like context switches. This is why I don't have a problem with copying the comm at context switch time. Because that _is_ a slow path for the latency tracer. LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc.. in flight recorder mode, we have nothing to write to disk and therefore we can handle very frequent events. We then do the analysis off-line using the last MB written in the buffers. The advantage is that the kernel dumbly writes data to a buffer : we therefore move the complexity to user-space. But you would still need to do something in case you want this information dumped to console on a kernel crash. Of course you can rely on kexec, but if the kexec fails (which is possible) then you lose all the information. Having the ability to dump the output to console on a crash is one of the benefits of latency_tracer that I want to keep. There has been some integration done between LTTng and the crash tool to extract the buffers from a crashed kernel. I am not an expert in crash buffer extraction though, but I guess all the available mechanisms depend on kexec and could show the limits you are referring to. If you really want to pretty-print the information to the console, I would propose that you leave that part of the problem to a different output module. The core of the latency tracer could keep the minimum information. Then, when a dump is triggered, it either sends the information to LTTng or to your console pretty-printer. However, I would not call the pretty-printer a tracer module per-se. We would have to accept that it is a bit more tied to the kernel internals than the latency tracer. My goal is to separate the core profiling module from the optional pretty-printing module as much as possible so the latency tracer core could be reused by other output modules. I agree that some kind of tracing, like the one you are doing, might be done more efficiently if you do a first clever analysis phase directly in the kernel without writing the raw high event rate data in memory buffers. However, I believe that it would be more powerful if we combine the two approaches rather than trying to do everything in or out of the kernel. LTTng could provide the comm names, priorities, etc, and your tracer could provide the top X list of processes that had a bad behavior. It would mean that the complete overall information would be made available after a post-processing phase done in an analysis tool like LTTV, but I don't see any problem with it. Of course you don't see any problem with it, because you know LTTV and LTTng very well ;-) latency_tracer has been very detrimental in solving -rt patch latencies by telling the customer to run with latency trace on, and then having them simply set a few sysctl variables and run their app. By combining this with LTTng, I wouldn't know how to start with telling a customer how to analyze the problem. Simply put, latency_tracer has a much smaller learning curve than LTTng. Not to mention, a smaller footprint. The tracer here is very focused on what to do, and is not meant to be a general profiling tool as LTTng is. In-other-words, latency_tracer is LTTng-lite ;-) If LTTng is already ported to your specific kernel, the learning-curve is not big at all. Here is what the latency_tracer over LTTng guide could look like : Well, once you have LTTng in your kernel and have compiled and installed the ltt-control and lttv packages (configure, make, make install), all that would be needed is : (there may be some bits in the QUICKSTART GUIDE on http://ltt.polymtl.ca, like adding the debugfs mount to fstab
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Mathieu Desnoyers wrote: If LTTng is already ported to your specific kernel, the learning-curve is not big at all. Here is what the latency_tracer over LTTng guide could look like : Well, once you have LTTng in your kernel and have compiled and installed the ltt-control and lttv packages (configure, make, make install), all that would be needed is : (there may be some bits in the QUICKSTART GUIDE on http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure the LTTng modules are loaded) #arm all the markers ltt-armall #start lttng tracing lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt - start latency tracer - stop latency tracer - trigger latency tracer dump While the tracing is active, trigger the condition... (rince, repeat, can handle multiple latency tracer dumps) #stop lttng tracing lttctl -n test -R #disarm all markers ltt-disarmall You can easily test the trace with : lttv -m textDump -t /tmp/trace1 Your users would issue something like : lttv -m latencytracerDump -t /tmp/trace1 No offense, but this is still quite a bit harder than: echo 1 /proc/something ... wait a bit ... cat /proc/something (substitute /sys or /debugfs where appropriate) Having to compile something (besides the kernel) for the target is sometimes a major hassle. I avoided it completely with KFT. -- Tim = Tim Bird Architecture Group Chair, CE Linux Forum Senior Staff Engineer, Sony Corporation of America = -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: In-other-words, latency_tracer is LTTng-lite ;-) If LTTng is already ported to your specific kernel, the learning-curve is not big at all. Here is what the latency_tracer over LTTng guide could look like : Well, once you have LTTng in your kernel and have compiled and installed the ltt-control and lttv packages (configure, make, make install), all that would be needed is : (there may be some bits in the QUICKSTART GUIDE on http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure the LTTng modules are loaded) #arm all the markers ltt-armall #start lttng tracing lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt - start latency tracer - stop latency tracer - trigger latency tracer dump While the tracing is active, trigger the condition... (rince, repeat, can handle multiple latency tracer dumps) #stop lttng tracing lttctl -n test -R #disarm all markers ltt-disarmall You can easily test the trace with : lttv -m textDump -t /tmp/trace1 Your users would issue something like : lttv -m latencytracerDump -t /tmp/trace1 that's it. LatencytracerDump would be a new specialized plugin, inspired from the generic textDump.c plugin and from the state.c module (for hooking on specific events rather that on _all_ events). It would generate a text output from the trace collected at each latency tracer dump. Mathieu, That's quite a bit. Currently we have tools that already hook into latency_tracer. e.g. Thomas Gleixner's cyclictest. It turns on the latency tracer runs a quick test, turns it off. If the latency is ok, the trace is discarded and it runs the test again. When the latency is not acceptable, it stops the test. Then all one needs to do is look at the given trace. Hooking something simple as this to LTTng is not going to fly. Don't get me wrong, I'm a huge supporter of LTTng, and even recommend it. But there's things that using LTTng for is like using a sledgehammer for a tack. I like to use the tools that are best for the job (which also mean easiest). I don't buy the one tracer fits all mentality. And by pushing that too much, I'm thinking we'll never get a tracer into the kernel. What my goal of these patches is, is to get the infrastructure for tracing into the kernel. The latency_tracer is like Rusty Russells lguest is for pvops. LTTng is the Xen equivalent. Where latency_tracer is nothing more than the quick and dirty tracer. For administrators that want to analyze their systems, LTTng is much more appropriate. latency_tracer is more of the first aid in finding latency troubles, and if that doesn't work, then we can do the LTTng surgery. Lets go back and focus on the infrastructure again. Namely, the mcount and notrace parts of the kernel. As well as the tracer timer, which by the way, we are looking more into what you have done ;-) -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Jan 16, 2008 6:56 AM, Mathieu Desnoyers [EMAIL PROTECTED] wrote: If you really want an seqlock free algorithm (I _do_ want this for tracing!) :) maybe going in the RCU direction could help (I refer to my RCU-based 32-to-64 bits lockless timestamp counter extension, which could be turned into the clocksource updater). Yea. After our earlier discussion and talking w/ Steven, I'm taking a swing at this now. The lock-free method still doesn't apply to the update_wall_time function, but does work fine for the monotonic cycle uses. I'll send a patch for review as soon as I get things building. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Thanks John for doing this! (comments imbedded) On Wed, 16 Jan 2008, john stultz wrote: On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: Completely un-tested, but it builds, so I figured I'd send it out for review. heh, ok, I'll take it and run it. I'm not super sure the update or the read doesn't need something additional to force a memory access, but as I didn't see anything special in Mathieu's implementation, I'm going to guess this is ok. Mathieu, Let me know if this isn't what you're suggesting. Signed-off-by: John Stultz [EMAIL PROTECTED] Index: monotonic-cleanup/include/linux/clocksource.h === --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 12:22:04.0 -0800 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.0 -0800 @@ -87,9 +87,17 @@ * more than one cache line. */ struct { - cycle_t cycle_last, cycle_accumulated, cycle_raw; - } cacheline_aligned_in_smp; + cycle_t cycle_last, cycle_accumulated; + /* base structure provides lock-free read + * access to a virtualized 64bit counter + * Uses RCU-like update. + */ + struct { + cycle_t cycle_base_last, cycle_base; + } base[2]; + int base_num; + } cacheline_aligned_in_smp; u64 xtime_nsec; s64 error; @@ -175,19 +183,21 @@ } /** - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value * @cs: pointer to clocksource being read * @now: current cycle value * * Uses the clocksource to return the current cycle_t value. * NOTE!!!: This is different from clocksource_read, because it - * returns the accumulated cycle value! Must hold xtime lock! + * returns a 64bit wide accumulated value. */ static inline cycle_t -clocksource_get_cycles(struct clocksource *cs, cycle_t now) +clocksource_get_basecycles(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs-cycle_last) cs-mask; - offset += cs-cycle_accumulated; + int num = cs-base_num; + cycle_t offset = (now - cs-base[num].cycle_base_last); + offset = cs-mask; + offset += cs-base[num].cycle_base; return offset; } @@ -197,14 +207,25 @@ * @now: current cycle value * * Used to avoids clocksource hardware overflow by periodically - * accumulating the current cycle delta. Must hold xtime write lock! + * accumulating the current cycle delta. Uses RCU-like update, but + * ***still requires the xtime_lock is held for writing!*** */ static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs-cycle_last) cs-mask; + /* First update the monotonic base portion. + * The dual array update method allows for lock-free reading. + */ + int num = !cs-base_num; + cycle_t offset = (now - cs-base[!num].cycle_base_last); + offset = cs-mask; + cs-base[num].cycle_base = cs-base[!num].cycle_base + offset; + cs-base[num].cycle_base_last = now; I would think that we would need some sort of barrier here. Otherwise, base_num could be updated before all the cycle_base. I'd expect a smp_wmb is needed. + cs-base_num = num; + + /* Now update the cycle_accumulated portion */ + offset = (now - cs-cycle_last) cs-mask; cs-cycle_last = now; cs-cycle_accumulated += offset; - cs-cycle_raw += offset; } /** Index: monotonic-cleanup/kernel/time/timekeeping.c === --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.0 -0800 +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.0 -0800 @@ -71,10 +71,12 @@ */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle_delta = (now - clock-cycle_last) clock-mask; + cycle_delta += clock-cycle_accumulated; Is the above just to decouple the two methods? ns_offset = cyc2ns(clock, cycle_delta); return ns_offset; @@ -105,35 +107,7 @@ cycle_t notrace get_monotonic_cycles(void) { - cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; - - do { - /* - * cycle_raw and cycle_last can change on - * another CPU and we need the delta calculation - * of cycle_now and cycle_last happen atomic, as well - * as the adding to cycle_raw. We don't need to grab - * any
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote: * john stultz ([EMAIL PROTECTED]) wrote: On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: On Jan 16, 2008 6:56 AM, Mathieu Desnoyers [EMAIL PROTECTED] wrote: If you really want an seqlock free algorithm (I _do_ want this for tracing!) :) maybe going in the RCU direction could help (I refer to my RCU-based 32-to-64 bits lockless timestamp counter extension, which could be turned into the clocksource updater). Yea. After our earlier discussion and talking w/ Steven, I'm taking a swing at this now. The lock-free method still doesn't apply to the update_wall_time function, but does work fine for the monotonic cycle uses. I'll send a patch for review as soon as I get things building. So here's my first attempt at adding Mathieu's lock-free method to Steven's get_monotonic_cycles() interface. Completely un-tested, but it builds, so I figured I'd send it out for review. I'm not super sure the update or the read doesn't need something additional to force a memory access, but as I didn't see anything special in Mathieu's implementation, I'm going to guess this is ok. Mathieu, Let me know if this isn't what you're suggesting. Signed-off-by: John Stultz [EMAIL PROTECTED] Index: monotonic-cleanup/include/linux/clocksource.h === --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.0 -0800 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.0 -0800 @@ -87,9 +87,17 @@ * more than one cache line. */ struct { - cycle_t cycle_last, cycle_accumulated, cycle_raw; - } cacheline_aligned_in_smp; Shouldn't the cycle_last and cycle_accumulated by in the array too ? No, we're leaving cycle_last and cycle_accumulated alone. They're relating to the update_wall_time conversion of cycles to xtime. + cycle_t cycle_last, cycle_accumulated; + /* base structure provides lock-free read +* access to a virtualized 64bit counter +* Uses RCU-like update. +*/ + struct { We had cycle_raw before, why do we need the following two ? + cycle_t cycle_base_last, cycle_base; I'm not quite sure why you need both cycle_base_last and cycle_base... So on my first shot at this, I tried to layer the concepts. Using the lock-free method to create a abstracted 64bit counter, as provided by get_monotonic_cycles(). Then I tried to use that abstraction directly in the update_wall_time() code, reading the abstracted 64bit counter and using it to update time. However, then we start keeping cycle_last in 64bit cycles, rather then an actual counter read. This then caused changes to be needed in the arch vsyscall implementations, and that started to get ugly, as we had to also re-implement the abstracted 64bit counter w/ the lock free method as well. So I just backed off and tried to make it simple: We have two sets of data that counts cycles from the clocksource. One for timekeeping and one for get_monotoinc_cycles(). It is a little redundant, but I don't think you can escape that (the layering method above also has redundancy, but its just hidden until you implement the vsyscall gtod methods). I think I'll need a bit of an explanation of what you are trying to achieve here to see what to expect from the clock source. Are you trying to deal with non-synchronized TSCs across CPUs in a way that will generate a monotonic (sometimes stalling) clock ? No no no.. I'm not touching the non-synced TSC issue. I'm just trying to take clocksource counters, which may be of different bit-widths (ACPI PM is 24bits, for instance), and create lock-free method to translate that into a virtual 64bit wide counter (using an accumulation bucket, basically). What I am trying to say is : I know you are trying to make a virtual clock source where time cannot go backward, but what are your assumptions about the real clock source ? The assumptions of the real clocksource is the same we keep in the timekeeping core. It counts forward, at a constant rate and only wraps after the mask value has been reached. Is the intent to deal with an HPET suddenly reset to 0 or something like this ? Well, dealing with clocksources wrapping short of 64bits. Basically, I wonder why you have to calculate the current cycle count from the previous update_wall_time event. Is is because you need to be consistent when a clocksource change occurs ? Actually, we try to do it from the last clocksource_accumulate() call (which is called from update_wall_time). + } base[2]; + int base_num; + } cacheline_aligned_in_smp; u64 xtime_nsec; s64 error; @@ -175,19 +183,21 @@ } /** - *
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Steven Rostedt wrote: On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: !0 is not necessarily 1. This is why I use cpu_synth-index ? 0 : 1 in How about simply cpu_synth-index ^ 1? Seems the best choice if you ask me, if all you are doing is changing it from 1 to zero and back to 1. FYI: [EMAIL PROTECTED]:~/c$ cat flipme.c int flip1 (int x) { return !x; } int flip2 (int x) { return x ? 0 : 1; } int flip3(int x) { return x ^ 1; } [EMAIL PROTECTED]:~/c$ gcc -O2 -c flipme.c [EMAIL PROTECTED]:~/c$ objdump -d flipme.o flipme.o: file format elf32-i386 Disassembly of section .text: flip1: 0: 55 push %ebp 1: 31 c0 xor%eax,%eax 3: 89 e5 mov%esp,%ebp 5: 83 7d 08 00 cmpl $0x0,0x8(%ebp) 9: 5d pop%ebp a: 0f 94 c0sete %al d: c3 ret e: 66 90 xchg %ax,%ax 0010 flip2: 10: 55 push %ebp 11: 31 c0 xor%eax,%eax 13: 89 e5 mov%esp,%ebp 15: 83 7d 08 00 cmpl $0x0,0x8(%ebp) 19: 5d pop%ebp 1a: 0f 94 c0sete %al 1d: c3 ret 1e: 66 90 xchg %ax,%ax 0020 flip3: 20: 55 push %ebp 21: 89 e5 mov%esp,%ebp 23: 8b 45 08mov0x8(%ebp),%eax 26: 5d pop%ebp 27: 83 f0 01xor$0x1,%eax 2a: c3 ret So, if you know for sure that x is only 1 or 0, then using x ^ 1 to invert it, seems the most efficient. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: + int num = !cs-base_num; + cycle_t offset = (now - cs-base[!num].cycle_base_last); !0 is not necessarily 1. Incorrect. !0 _is_ necessarily 1. It's how all C logical operators work. If you find a compiler that turns !x into anything but 0/1, you found a compiler for another language than C. It's true that any non-zero value counts as true, but the that does not mean that a logical operator can return any non-zero value for true. As a return value of the logical operations in C, true is *always* 1. So !, ||, , when used as values, will *always* return either 0 or 1 (but when used as part of a conditional, the compiler will often optimize out unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if the value itself was never used, only branched upon). So doing !cs-base_num to turn 0-1 and 1-0 is perfectly fine. That's not to say it's necessarily the *best* way. If you *know* that you started with 0/1 in the first place, the best way to flip it tends to be to do (1-x) (or possibly (x^1)). And if you can't guarantee that, !x is probably better than x ? 0 : 1, but you might also decide to use ((x+1)1) for example. And obviously, the compiler may sometimes surprise you, and if *it* also knows it's always 0/1 (for something like the source being a single-bit bitfield for example), it may end up doing something else than you coded that is equivalent. And the particular choice of operation the compiler chooses may well depend on the code _around_ that sequence. (One reason to potentially prefer (1-x) over (x^1) is that it's often easier to combine a subtraction with other operations, while an xor seldom combines with anything around it) Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Linus Torvalds ([EMAIL PROTECTED]) wrote: On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: + int num = !cs-base_num; + cycle_t offset = (now - cs-base[!num].cycle_base_last); !0 is not necessarily 1. Incorrect. Hrm, *digging in my mailbox*, ah, here it is : http://listserv.shafik.org/pipermail/ltt-dev/2006-June/001548.html Richard Purdie reviewed my code back in 2006 and made this modification. Maybe will he have something to add. !0 _is_ necessarily 1. It's how all C logical operators work. If you find a compiler that turns !x into anything but 0/1, you found a compiler for another language than C. It's true that any non-zero value counts as true, but the that does not mean that a logical operator can return any non-zero value for true. As a return value of the logical operations in C, true is *always* 1. So !, ||, , when used as values, will *always* return either 0 or 1 (but when used as part of a conditional, the compiler will often optimize out unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if the value itself was never used, only branched upon). So doing !cs-base_num to turn 0-1 and 1-0 is perfectly fine. That's not to say it's necessarily the *best* way. If you *know* that you started with 0/1 in the first place, the best way to flip it tends to be to do (1-x) (or possibly (x^1)). And if you can't guarantee that, !x is probably better than x ? 0 : 1, but you might also decide to use ((x+1)1) for example. And obviously, the compiler may sometimes surprise you, and if *it* also knows it's always 0/1 (for something like the source being a single-bit bitfield for example), it may end up doing something else than you coded that is equivalent. And the particular choice of operation the compiler chooses may well depend on the code _around_ that sequence. (One reason to potentially prefer (1-x) over (x^1) is that it's often easier to combine a subtraction with other operations, while an xor seldom combines with anything around it) Ok, I'll adopt (1-x) then. Thanks! Mathieu Linus -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote: I would disable preemption in clocksource_get_basecycles. We would not want to be scheduled out while we hold a pointer to the old array element. + int num = cs-base_num; Since you deal with base_num in a shared manner (not per cpu), you will need a smp_read_barrier_depend() here after the cs-base_num read. You should think about reading the cs-base_num first, and _after_ that read the real clocksource. Here, the clocksource value is passed as parameter. It means that the read clocksource may have been read in the previous RCU window. Here's an updated version of the patch w/ the suggested memory barrier changes and favored (1-x) inversion change. ;) Let me know if you see any other holes, or have any other suggestions or ideas. Still un-tested (my test box will free up soon, I promise!), but builds. Signed-off-by: John Stultz [EMAIL PROTECTED] Index: monotonic-cleanup/include/linux/clocksource.h === --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.0 -0800 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 18:12:53.0 -0800 @@ -87,9 +87,17 @@ * more than one cache line. */ struct { - cycle_t cycle_last, cycle_accumulated, cycle_raw; - } cacheline_aligned_in_smp; + cycle_t cycle_last, cycle_accumulated; + /* base structure provides lock-free read +* access to a virtualized 64bit counter +* Uses RCU-like update. +*/ + struct { + cycle_t cycle_base_last, cycle_base; + } base[2]; + int base_num; + } cacheline_aligned_in_smp; u64 xtime_nsec; s64 error; @@ -175,19 +183,29 @@ } /** - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value * @cs:pointer to clocksource being read * @now: current cycle value * * Uses the clocksource to return the current cycle_t value. * NOTE!!!: This is different from clocksource_read, because it - * returns the accumulated cycle value! Must hold xtime lock! + * returns a 64bit wide accumulated value. */ static inline cycle_t -clocksource_get_cycles(struct clocksource *cs, cycle_t now) +clocksource_get_basecycles(struct clocksource *cs) { - cycle_t offset = (now - cs-cycle_last) cs-mask; - offset += cs-cycle_accumulated; + int num; + cycle_t now, offset; + + preempt_disable(); + num = cs-base_num; + smp_read_barrier_depends(); + now = clocksource_read(cs); + offset = (now - cs-base[num].cycle_base_last); + offset = cs-mask; + offset += cs-base[num].cycle_base; + preempt_enable(); + return offset; } @@ -197,14 +215,26 @@ * @now: current cycle value * * Used to avoids clocksource hardware overflow by periodically - * accumulating the current cycle delta. Must hold xtime write lock! + * accumulating the current cycle delta. Uses RCU-like update, but + * ***still requires the xtime_lock is held for writing!*** */ static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs-cycle_last) cs-mask; + /* First update the monotonic base portion. +* The dual array update method allows for lock-free reading. +*/ + int num = 1 - cs-base_num; + cycle_t offset = (now - cs-base[1-num].cycle_base_last); + offset = cs-mask; + cs-base[num].cycle_base = cs-base[1-num].cycle_base + offset; + cs-base[num].cycle_base_last = now; + wmb(); + cs-base_num = num; + + /* Now update the cycle_accumulated portion */ + offset = (now - cs-cycle_last) cs-mask; cs-cycle_last = now; cs-cycle_accumulated += offset; - cs-cycle_raw += offset; } /** Index: monotonic-cleanup/kernel/time/timekeeping.c === --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 12:21:46.0 -0800 +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 17:51:50.0 -0800 @@ -71,10 +71,12 @@ */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle_delta = (now - clock-cycle_last) clock-mask; + cycle_delta += clock-cycle_accumulated; ns_offset = cyc2ns(clock, cycle_delta); return ns_offset; @@ -105,35 +107,7 @@ cycle_t notrace get_monotonic_cycles(void) { - cycle_t cycle_now, cycle_delta,
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* john stultz ([EMAIL PROTECTED]) wrote: On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote: * john stultz ([EMAIL PROTECTED]) wrote: On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: On Jan 16, 2008 6:56 AM, Mathieu Desnoyers [EMAIL PROTECTED] wrote: If you really want an seqlock free algorithm (I _do_ want this for tracing!) :) maybe going in the RCU direction could help (I refer to my RCU-based 32-to-64 bits lockless timestamp counter extension, which could be turned into the clocksource updater). Yea. After our earlier discussion and talking w/ Steven, I'm taking a swing at this now. The lock-free method still doesn't apply to the update_wall_time function, but does work fine for the monotonic cycle uses. I'll send a patch for review as soon as I get things building. So here's my first attempt at adding Mathieu's lock-free method to Steven's get_monotonic_cycles() interface. Completely un-tested, but it builds, so I figured I'd send it out for review. I'm not super sure the update or the read doesn't need something additional to force a memory access, but as I didn't see anything special in Mathieu's implementation, I'm going to guess this is ok. Mathieu, Let me know if this isn't what you're suggesting. Signed-off-by: John Stultz [EMAIL PROTECTED] Index: monotonic-cleanup/include/linux/clocksource.h === --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 12:22:04.0 -0800 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.0 -0800 @@ -87,9 +87,17 @@ * more than one cache line. */ struct { - cycle_t cycle_last, cycle_accumulated, cycle_raw; - } cacheline_aligned_in_smp; Shouldn't the cycle_last and cycle_accumulated by in the array too ? No, we're leaving cycle_last and cycle_accumulated alone. They're relating to the update_wall_time conversion of cycles to xtime. + cycle_t cycle_last, cycle_accumulated; + /* base structure provides lock-free read + * access to a virtualized 64bit counter + * Uses RCU-like update. + */ + struct { We had cycle_raw before, why do we need the following two ? + cycle_t cycle_base_last, cycle_base; I'm not quite sure why you need both cycle_base_last and cycle_base... So on my first shot at this, I tried to layer the concepts. Using the lock-free method to create a abstracted 64bit counter, as provided by get_monotonic_cycles(). Then I tried to use that abstraction directly in the update_wall_time() code, reading the abstracted 64bit counter and using it to update time. However, then we start keeping cycle_last in 64bit cycles, rather then an actual counter read. This then caused changes to be needed in the arch vsyscall implementations, and that started to get ugly, as we had to also re-implement the abstracted 64bit counter w/ the lock free method as well. So I just backed off and tried to make it simple: We have two sets of data that counts cycles from the clocksource. One for timekeeping and one for get_monotoinc_cycles(). It is a little redundant, but I don't think you can escape that (the layering method above also has redundancy, but its just hidden until you implement the vsyscall gtod methods). I think I'll need a bit of an explanation of what you are trying to achieve here to see what to expect from the clock source. Are you trying to deal with non-synchronized TSCs across CPUs in a way that will generate a monotonic (sometimes stalling) clock ? No no no.. I'm not touching the non-synced TSC issue. I'm just trying to take clocksource counters, which may be of different bit-widths (ACPI PM is 24bits, for instance), and create lock-free method to translate that into a virtual 64bit wide counter (using an accumulation bucket, basically). What I am trying to say is : I know you are trying to make a virtual clock source where time cannot go backward, but what are your assumptions about the real clock source ? The assumptions of the real clocksource is the same we keep in the timekeeping core. It counts forward, at a constant rate and only wraps after the mask value has been reached. Is the intent to deal with an HPET suddenly reset to 0 or something like this ? Well, dealing with clocksources wrapping short of 64bits. Ah ok, then the problem is clearer :) The main difference between the approach I use and yours is that, let's say your clocksource starts a while after the system started running, then it will start the accumulator at 0. You therefore have to keep track of the total time accumulated and the current lower order bits of the last accumulation upon clocksource_accumulate(). I
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote: Thanks John for doing this! (comments imbedded) On Wed, 16 Jan 2008, john stultz wrote: + int num = !cs-base_num; + cycle_t offset = (now - cs-base[!num].cycle_base_last); + offset = cs-mask; + cs-base[num].cycle_base = cs-base[!num].cycle_base + offset; + cs-base[num].cycle_base_last = now; I would think that we would need some sort of barrier here. Otherwise, base_num could be updated before all the cycle_base. I'd expect a smp_wmb is needed. Hopefully addressed in the current version. Index: monotonic-cleanup/kernel/time/timekeeping.c === --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 12:21:46.0 -0800 +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.0 -0800 @@ -71,10 +71,12 @@ */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle_delta = (now - clock-cycle_last) clock-mask; + cycle_delta += clock-cycle_accumulated; Is the above just to decouple the two methods? Yep. clocksource_get_cycles() ended up not being as useful as an helper function (I was hoping the arch vsyscall implementations could use it, but they've done too much optimization - although that may reflect a need up the chain to the clocksource structure). thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* john stultz ([EMAIL PROTECTED]) wrote: On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote: I would disable preemption in clocksource_get_basecycles. We would not want to be scheduled out while we hold a pointer to the old array element. + int num = cs-base_num; Since you deal with base_num in a shared manner (not per cpu), you will need a smp_read_barrier_depend() here after the cs-base_num read. You should think about reading the cs-base_num first, and _after_ that read the real clocksource. Here, the clocksource value is passed as parameter. It means that the read clocksource may have been read in the previous RCU window. Here's an updated version of the patch w/ the suggested memory barrier changes and favored (1-x) inversion change. ;) Let me know if you see any other holes, or have any other suggestions or ideas. Still un-tested (my test box will free up soon, I promise!), but builds. Signed-off-by: John Stultz [EMAIL PROTECTED] Index: monotonic-cleanup/include/linux/clocksource.h === --- monotonic-cleanup.orig/include/linux/clocksource.h2008-01-16 12:22:04.0 -0800 +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 18:12:53.0 -0800 @@ -87,9 +87,17 @@ * more than one cache line. */ struct { - cycle_t cycle_last, cycle_accumulated, cycle_raw; - } cacheline_aligned_in_smp; + cycle_t cycle_last, cycle_accumulated; + /* base structure provides lock-free read + * access to a virtualized 64bit counter + * Uses RCU-like update. + */ + struct { + cycle_t cycle_base_last, cycle_base; + } base[2]; + int base_num; + } cacheline_aligned_in_smp; u64 xtime_nsec; s64 error; @@ -175,19 +183,29 @@ } /** - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value * @cs: pointer to clocksource being read * @now: current cycle value * * Uses the clocksource to return the current cycle_t value. * NOTE!!!: This is different from clocksource_read, because it - * returns the accumulated cycle value! Must hold xtime lock! + * returns a 64bit wide accumulated value. */ static inline cycle_t -clocksource_get_cycles(struct clocksource *cs, cycle_t now) +clocksource_get_basecycles(struct clocksource *cs) { - cycle_t offset = (now - cs-cycle_last) cs-mask; - offset += cs-cycle_accumulated; + int num; + cycle_t now, offset; + + preempt_disable(); + num = cs-base_num; + smp_read_barrier_depends(); + now = clocksource_read(cs); + offset = (now - cs-base[num].cycle_base_last); + offset = cs-mask; + offset += cs-base[num].cycle_base; + preempt_enable(); + return offset; } @@ -197,14 +215,26 @@ * @now: current cycle value * * Used to avoids clocksource hardware overflow by periodically - * accumulating the current cycle delta. Must hold xtime write lock! + * accumulating the current cycle delta. Uses RCU-like update, but + * ***still requires the xtime_lock is held for writing!*** */ static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) { - cycle_t offset = (now - cs-cycle_last) cs-mask; + /* First update the monotonic base portion. + * The dual array update method allows for lock-free reading. + */ + int num = 1 - cs-base_num; (nitpick) right here, you could probably express 1-num with cs-base_num, since we are the only ones supposed to touch it. + cycle_t offset = (now - cs-base[1-num].cycle_base_last); + offset = cs-mask; here too. + cs-base[num].cycle_base = cs-base[1-num].cycle_base + offset; + cs-base[num].cycle_base_last = now; + wmb(); As I just emailed : smp_smb() *should* be enough. I don't see which architecture could reorder writes wrt local interrupts ? (please tell me if I am grossly mistaken) Mathieu + cs-base_num = num; + + /* Now update the cycle_accumulated portion */ + offset = (now - cs-cycle_last) cs-mask; cs-cycle_last = now; cs-cycle_accumulated += offset; - cs-cycle_raw += offset; } /** Index: monotonic-cleanup/kernel/time/timekeeping.c === --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.0 -0800 +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 17:51:50.0 -0800 @@ -71,10 +71,12 @@ */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta =
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* john stultz ([EMAIL PROTECTED]) wrote: On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote: Thanks John for doing this! (comments imbedded) On Wed, 16 Jan 2008, john stultz wrote: + int num = !cs-base_num; + cycle_t offset = (now - cs-base[!num].cycle_base_last); + offset = cs-mask; + cs-base[num].cycle_base = cs-base[!num].cycle_base + offset; + cs-base[num].cycle_base_last = now; I would think that we would need some sort of barrier here. Otherwise, base_num could be updated before all the cycle_base. I'd expect a smp_wmb is needed. Hopefully addressed in the current version. Index: monotonic-cleanup/kernel/time/timekeeping.c === --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.0 -0800 +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.0 -0800 @@ -71,10 +71,12 @@ */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle_delta = (now - clock-cycle_last) clock-mask; + cycle_delta += clock-cycle_accumulated; Is the above just to decouple the two methods? Yep. clocksource_get_cycles() ended up not being as useful as an helper function (I was hoping the arch vsyscall implementations could use it, but they've done too much optimization - although that may reflect a need up the chain to the clocksource structure). The problem with vsyscall is that we will have a hard time disabling preemption :( Therefore, insuring that the read of the data is done in a timely manner is hard to do. Mathieu thanks -john -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: Yep. clocksource_get_cycles() ended up not being as useful as an helper function (I was hoping the arch vsyscall implementations could use it, but they've done too much optimization - although that may reflect a need up the chain to the clocksource structure). The problem with vsyscall is that we will have a hard time disabling preemption :( Therefore, insuring that the read of the data is done in a timely manner is hard to do. You'll have more than a hard time disabling preemption for vsyscall. We'll need to come up with a better solution then. vsyscall can not modify any kernel memory, nor can it disable preemption. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: * john stultz ([EMAIL PROTECTED]) wrote: On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote: Thanks John for doing this! (comments imbedded) On Wed, 16 Jan 2008, john stultz wrote: + int num = !cs-base_num; + cycle_t offset = (now - cs-base[!num].cycle_base_last); + offset = cs-mask; + cs-base[num].cycle_base = cs-base[!num].cycle_base + offset; + cs-base[num].cycle_base_last = now; I would think that we would need some sort of barrier here. Otherwise, base_num could be updated before all the cycle_base. I'd expect a smp_wmb is needed. Hopefully addressed in the current version. Index: monotonic-cleanup/kernel/time/timekeeping.c === --- monotonic-cleanup.orig/kernel/time/timekeeping.c2008-01-16 12:21:46.0 -0800 +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.0 -0800 @@ -71,10 +71,12 @@ */ static inline s64 __get_nsec_offset(void) { - cycle_t cycle_delta; + cycle_t now, cycle_delta; s64 ns_offset; - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); + now = clocksource_read(clock); + cycle_delta = (now - clock-cycle_last) clock-mask; + cycle_delta += clock-cycle_accumulated; Is the above just to decouple the two methods? Yep. clocksource_get_cycles() ended up not being as useful as an helper function (I was hoping the arch vsyscall implementations could use it, but they've done too much optimization - although that may reflect a need up the chain to the clocksource structure). The problem with vsyscall is that we will have a hard time disabling preemption :( Therefore, insuring that the read of the data is done in a timely manner is hard to do. Sorry for self-reply, but I thought, in the past, of a way to make this possible. It would imply the creation of a new vsyscall : vgetschedperiod It would read a counter that would increment each time the thread is scheduled out (or in). It would be a per thread counter (not a per cpu counter) so we can deal appropriately with a stopped thread that would happen to come back running a lng time afterward (if we do per-cpu counters, we could get the same 32 bits counter value falsely if it is shared with other thread activity). Then, the clocksource read code would look like : int period; do { period = vgetschedperiod(); perform the clocksource read.. } while (period != vgetschedperiod()); Therefore, we would be sure that we have not been scheduled out while reading the value. I think this new vsyscall could be useful for others. Actually, it would make implementation of RCU in user-space possible (as long as the read-side can retry the read operation). Mathieu thanks -john -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote: It would imply the creation of a new vsyscall : vgetschedperiod It would read a counter that would increment each time the thread is scheduled out (or in). It would be a per thread counter (not a per cpu counter) so we can deal appropriately with a stopped thread that would happen to come back running a lng time afterward (if we do per-cpu counters, we could get the same 32 bits counter value falsely if it is shared with other thread activity). Then, the clocksource read code would look like : int period; do { period = vgetschedperiod(); perform the clocksource read.. } while (period != vgetschedperiod()); Therefore, we would be sure that we have not been scheduled out while reading the value. I think this new vsyscall could be useful for others. Actually, it would make implementation of RCU in user-space possible (as long as the read-side can retry the read operation). This is something that I would agree is useful. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
Mathieu Desnoyers writes: Sorry for self-reply, but I thought, in the past, of a way to make this possible. It would imply the creation of a new vsyscall : vgetschedperiod It would read a counter that would increment each time the thread is scheduled out (or in). It would be a per thread counter It's very hard to do a per-thread counter in the VDSO, since threads in the same process see the same memory, by definition. You'd have to have an array of counters and have some way for each thread to know which entry to read. Also you'd have to find space for tens or hundreds of thousands of counters, since there can be that many threads in a process sometimes. Paul. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Thu, 17 Jan 2008, Paul Mackerras wrote: It's very hard to do a per-thread counter in the VDSO, since threads in the same process see the same memory, by definition. You'd have to have an array of counters and have some way for each thread to know which entry to read. Also you'd have to find space for tens or hundreds of thousands of counters, since there can be that many threads in a process sometimes. I was thinking about this. What would also work is just the ability to read the schedule counter for the current cpu. Now this would require that the task had a way to know which CPU it was currently on. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Paul Mackerras ([EMAIL PROTECTED]) wrote: Mathieu Desnoyers writes: Sorry for self-reply, but I thought, in the past, of a way to make this possible. It would imply the creation of a new vsyscall : vgetschedperiod It would read a counter that would increment each time the thread is scheduled out (or in). It would be a per thread counter It's very hard to do a per-thread counter in the VDSO, since threads in the same process see the same memory, by definition. You'd have to have an array of counters and have some way for each thread to know which entry to read. Also you'd have to find space for tens or hundreds of thousands of counters, since there can be that many threads in a process sometimes. Paul. Crazy ideas : Could we do something along the lines of the thread local storage ? Or could we map a per-thread page that would contradict this definition ? Or can we move down the beginning of the user-space thread stack of 4 bytes (it's already put at a random address anyway) and use these 32 bits to put our variable ? We don't care if userspace also modifies it; the kernel would blindly increment it, so there would be no security concerns involved. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: > > > > Ok, but what actually insures that the clock->cycle_* reads won't be > > reordered across the clocksource_read() ? > > > > Hmm, interesting.I didn't notice that clocksource_read() is a static > inline. I was thinking that since it was passing a pointer to a function, > gcc could not assume that it could move that code across it. But now > looking to see that clocksource_read is simply a static inline that does: > > cs->read(); > > But still, can gcc assume that it can push loads of unknown origin > variables across function calls? So something like: > > static int *glob; > > void foo(void) { > int x; > > x = *glob; > > bar(); > > if (x != *glob) > /* ... */ > } > > I can't see how any compiler could honestly move the loading of the first > x after the calling of bar(). With glob pointing to some unknown > variable, that may be perfectly fine for bar to modify. > > > > > > > > > > > + cycle_raw = clock->cycle_raw; > > > > > + cycle_last = clock->cycle_last; > > > > > + > > > > > + /* read clocksource: */ > > > > > + cycle_now = clocksource_read(clock); > > So the question here is,can cycle_raw and cycle_last be loaded from > the unknown source that clock points to after the call to > clocksource_read()? > > I'm thinking not. > I agree with you that I don't see how the compiler could reorder this. So we forget about compiler barriers. Also, the clock source used is a synchronized clock source (get_cycles_sync on x86_64), so it should make sure the TSC is read at the right moment. However, what happens if the clock source is, say, the jiffies ? Is this case, we have : static cycle_t jiffies_read(void) { return (cycle_t) jiffies; } Which is nothing more than a memory read of extern unsigned long volatile __jiffy_data jiffies; I think it is wrong to assume that reads from clock->cycle_raw and from jiffies will be ordered correctly in SMP. I am tempted to think that ordering memory writes to clock->cycle_raw vs jiffies is also needed in this case (where clock->cycle_raw is updated, or where jiffies is updated). We can fall in the same kind of issue if we read the HPET, which is memory I/O based. It does not seems correct to assume that MMIO vs normal memory reads are ordered. (pointing back to this article : http://lwn.net/Articles/198988/) Mathieu > > > > > + > > > > > + /* calculate the delta since the last update_wall_time: > > > > > */ > > > > > + cycle_delta = (cycle_now - cycle_last) & clock->mask; > > > > > + > > > > > + } while (cycle_raw != clock->cycle_raw || > > > > > + cycle_last != clock->cycle_last); > > > > > + > > > > > + return cycle_raw + cycle_delta; > > > > > +} > > > -- Steve > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: > > Ok, but what actually insures that the clock->cycle_* reads won't be > reordered across the clocksource_read() ? Hmm, interesting.I didn't notice that clocksource_read() is a static inline. I was thinking that since it was passing a pointer to a function, gcc could not assume that it could move that code across it. But now looking to see that clocksource_read is simply a static inline that does: cs->read(); But still, can gcc assume that it can push loads of unknown origin variables across function calls? So something like: static int *glob; void foo(void) { int x; x = *glob; bar(); if (x != *glob) /* ... */ } I can't see how any compiler could honestly move the loading of the first x after the calling of bar(). With glob pointing to some unknown variable, that may be perfectly fine for bar to modify. > > > > > > > + cycle_raw = clock->cycle_raw; > > > > + cycle_last = clock->cycle_last; > > > > + > > > > + /* read clocksource: */ > > > > + cycle_now = clocksource_read(clock); So the question here is,can cycle_raw and cycle_last be loaded from the unknown source that clock points to after the call to clocksource_read()? I'm thinking not. > > > > + > > > > + /* calculate the delta since the last update_wall_time: > > > > */ > > > > + cycle_delta = (cycle_now - cycle_last) & clock->mask; > > > > + > > > > + } while (cycle_raw != clock->cycle_raw || > > > > +cycle_last != clock->cycle_last); > > > > + > > > > + return cycle_raw + cycle_delta; > > > > +} -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > > On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: > > > > > > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> > > > --- > > > include/linux/clocksource.h |3 ++ > > > kernel/time/timekeeping.c | 48 > > > > > > 2 files changed, 51 insertions(+) > > > > > > Index: linux-compile-i386.git/kernel/time/timekeeping.c > > > === > > > --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 > > > 14:27:26.0 -0500 > > > +++ linux-compile-i386.git/kernel/time/timekeeping.c 2008-01-09 > > > 14:34:40.0 -0500 > > > @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_ > > > timespec_add_ns(ts, nsecs); > > > } > > > > > > +cycle_t notrace get_monotonic_cycles(void) > > > +{ > > > + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; > > > + > > > + do { > > > + /* > > > + * cycle_raw and cycle_last can change on > > > + * another CPU and we need the delta calculation > > > + * of cycle_now and cycle_last happen atomic, as well > > > + * as the adding to cycle_raw. We don't need to grab > > > + * any locks, we just keep trying until get all the > > > + * calculations together in one state. > > > + * > > > + * In fact, we __cant__ grab any locks. This > > > + * function is called from the latency_tracer which can > > > + * be called anywhere. To grab any locks (including > > > + * seq_locks) we risk putting ourselves into a deadlock. > > > + */ > > > > I wonder what makes the compiler read the clock->cycle_raw and > > clock->cycle_last variables twice ? I guess some memory barriers could > > be welcome here ? > > We need both cycle_raw and cycle_last to be the same from the time we read > the clock source to the time we calculate cycle_delta. If either one > changes then delta is bogus. > Ok, but what actually insures that the clock->cycle_* reads won't be reordered across the clocksource_read() ? > Also, it just occurred to me that this is an old patch. I thought I > renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/ > > > > > > + cycle_raw = clock->cycle_raw; > > > + cycle_last = clock->cycle_last; > > > + > > > + /* read clocksource: */ > > > + cycle_now = clocksource_read(clock); > > > + > > > + /* calculate the delta since the last update_wall_time: */ > > > + cycle_delta = (cycle_now - cycle_last) & clock->mask; > > > + > > > + } while (cycle_raw != clock->cycle_raw || > > > + cycle_last != clock->cycle_last); > > > + > > > + return cycle_raw + cycle_delta; > > > +} > > -- Steve > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Tue, 15 Jan 2008, Steven Rostedt wrote: > > Also, it just occurred to me that this is an old patch. I thought I > renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/ Ah, I changed this in the -rt patch queue, and never moved the patch back here. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: > > > > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> > > --- > > include/linux/clocksource.h |3 ++ > > kernel/time/timekeeping.c | 48 > > > > 2 files changed, 51 insertions(+) > > > > Index: linux-compile-i386.git/kernel/time/timekeeping.c > > === > > --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 > > 14:27:26.0 -0500 > > +++ linux-compile-i386.git/kernel/time/timekeeping.c2008-01-09 > > 14:34:40.0 -0500 > > @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_ > > timespec_add_ns(ts, nsecs); > > } > > > > +cycle_t notrace get_monotonic_cycles(void) > > +{ > > + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; > > + > > + do { > > + /* > > +* cycle_raw and cycle_last can change on > > +* another CPU and we need the delta calculation > > +* of cycle_now and cycle_last happen atomic, as well > > +* as the adding to cycle_raw. We don't need to grab > > +* any locks, we just keep trying until get all the > > +* calculations together in one state. > > +* > > +* In fact, we __cant__ grab any locks. This > > +* function is called from the latency_tracer which can > > +* be called anywhere. To grab any locks (including > > +* seq_locks) we risk putting ourselves into a deadlock. > > +*/ > > I wonder what makes the compiler read the clock->cycle_raw and > clock->cycle_last variables twice ? I guess some memory barriers could > be welcome here ? We need both cycle_raw and cycle_last to be the same from the time we read the clock source to the time we calculate cycle_delta. If either one changes then delta is bogus. Also, it just occurred to me that this is an old patch. I thought I renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/ > > > + cycle_raw = clock->cycle_raw; > > + cycle_last = clock->cycle_last; > > + > > + /* read clocksource: */ > > + cycle_now = clocksource_read(clock); > > + > > + /* calculate the delta since the last update_wall_time: */ > > + cycle_delta = (cycle_now - cycle_last) & clock->mask; > > + > > + } while (cycle_raw != clock->cycle_raw || > > +cycle_last != clock->cycle_last); > > + > > + return cycle_raw + cycle_delta; > > +} -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > The latency tracer needs a way to get an accurate time > without grabbing any locks. Locks themselves might call > the latency tracer and cause at best a slow down. > > This patch adds get_monotonic_cycles that returns cycles > from a reliable clock source in a monotonic fashion. > > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> > --- > include/linux/clocksource.h |3 ++ > kernel/time/timekeeping.c | 48 > > 2 files changed, 51 insertions(+) > > Index: linux-compile-i386.git/kernel/time/timekeeping.c > === > --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 > 14:27:26.0 -0500 > +++ linux-compile-i386.git/kernel/time/timekeeping.c 2008-01-09 > 14:34:40.0 -0500 > @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_ > timespec_add_ns(ts, nsecs); > } > > +cycle_t notrace get_monotonic_cycles(void) > +{ > + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; > + > + do { > + /* > + * cycle_raw and cycle_last can change on > + * another CPU and we need the delta calculation > + * of cycle_now and cycle_last happen atomic, as well > + * as the adding to cycle_raw. We don't need to grab > + * any locks, we just keep trying until get all the > + * calculations together in one state. > + * > + * In fact, we __cant__ grab any locks. This > + * function is called from the latency_tracer which can > + * be called anywhere. To grab any locks (including > + * seq_locks) we risk putting ourselves into a deadlock. > + */ I wonder what makes the compiler read the clock->cycle_raw and clock->cycle_last variables twice ? I guess some memory barriers could be welcome here ? > + cycle_raw = clock->cycle_raw; > + cycle_last = clock->cycle_last; > + > + /* read clocksource: */ > + cycle_now = clocksource_read(clock); > + > + /* calculate the delta since the last update_wall_time: */ > + cycle_delta = (cycle_now - cycle_last) & clock->mask; > + > + } while (cycle_raw != clock->cycle_raw || > + cycle_last != clock->cycle_last); > + > + return cycle_raw + cycle_delta; > +} > + > +unsigned long notrace cycles_to_usecs(cycle_t cycles) > +{ > + u64 ret = cyc2ns(clock, cycles); > + > + ret += NSEC_PER_USEC/2; /* For rounding in do_div() */ > + do_div(ret, NSEC_PER_USEC); > + > + return ret; > +} > + > +cycle_t notrace usecs_to_cycles(unsigned long usecs) > +{ > + return ns2cyc(clock, (u64)usecs * 1000); > +} > + > /** > * getnstimeofday - Returns the time of day in a timespec > * @ts: pointer to the timespec to be set > Index: linux-compile-i386.git/include/linux/clocksource.h > === > --- linux-compile-i386.git.orig/include/linux/clocksource.h 2008-01-09 > 14:27:51.0 -0500 > +++ linux-compile-i386.git/include/linux/clocksource.h2008-01-09 > 14:29:44.0 -0500 > @@ -273,6 +273,9 @@ extern int clocksource_register(struct c > extern struct clocksource* clocksource_get_next(void); > extern void clocksource_change_rating(struct clocksource *cs, int rating); > extern void clocksource_resume(void); > +extern cycle_t get_monotonic_cycles(void); > +extern unsigned long cycles_to_usecs(cycle_t cycles); > +extern cycle_t usecs_to_cycles(unsigned long usecs); > > /* used to initialize clock */ > extern struct clocksource clocksource_jiffies; > > -- -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: The latency tracer needs a way to get an accurate time without grabbing any locks. Locks themselves might call the latency tracer and cause at best a slow down. This patch adds get_monotonic_cycles that returns cycles from a reliable clock source in a monotonic fashion. Signed-off-by: Steven Rostedt [EMAIL PROTECTED] --- include/linux/clocksource.h |3 ++ kernel/time/timekeeping.c | 48 2 files changed, 51 insertions(+) Index: linux-compile-i386.git/kernel/time/timekeeping.c === --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 14:27:26.0 -0500 +++ linux-compile-i386.git/kernel/time/timekeeping.c 2008-01-09 14:34:40.0 -0500 @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_ timespec_add_ns(ts, nsecs); } +cycle_t notrace get_monotonic_cycles(void) +{ + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; + + do { + /* + * cycle_raw and cycle_last can change on + * another CPU and we need the delta calculation + * of cycle_now and cycle_last happen atomic, as well + * as the adding to cycle_raw. We don't need to grab + * any locks, we just keep trying until get all the + * calculations together in one state. + * + * In fact, we __cant__ grab any locks. This + * function is called from the latency_tracer which can + * be called anywhere. To grab any locks (including + * seq_locks) we risk putting ourselves into a deadlock. + */ I wonder what makes the compiler read the clock-cycle_raw and clock-cycle_last variables twice ? I guess some memory barriers could be welcome here ? + cycle_raw = clock-cycle_raw; + cycle_last = clock-cycle_last; + + /* read clocksource: */ + cycle_now = clocksource_read(clock); + + /* calculate the delta since the last update_wall_time: */ + cycle_delta = (cycle_now - cycle_last) clock-mask; + + } while (cycle_raw != clock-cycle_raw || + cycle_last != clock-cycle_last); + + return cycle_raw + cycle_delta; +} + +unsigned long notrace cycles_to_usecs(cycle_t cycles) +{ + u64 ret = cyc2ns(clock, cycles); + + ret += NSEC_PER_USEC/2; /* For rounding in do_div() */ + do_div(ret, NSEC_PER_USEC); + + return ret; +} + +cycle_t notrace usecs_to_cycles(unsigned long usecs) +{ + return ns2cyc(clock, (u64)usecs * 1000); +} + /** * getnstimeofday - Returns the time of day in a timespec * @ts: pointer to the timespec to be set Index: linux-compile-i386.git/include/linux/clocksource.h === --- linux-compile-i386.git.orig/include/linux/clocksource.h 2008-01-09 14:27:51.0 -0500 +++ linux-compile-i386.git/include/linux/clocksource.h2008-01-09 14:29:44.0 -0500 @@ -273,6 +273,9 @@ extern int clocksource_register(struct c extern struct clocksource* clocksource_get_next(void); extern void clocksource_change_rating(struct clocksource *cs, int rating); extern void clocksource_resume(void); +extern cycle_t get_monotonic_cycles(void); +extern unsigned long cycles_to_usecs(cycle_t cycles); +extern cycle_t usecs_to_cycles(unsigned long usecs); /* used to initialize clock */ extern struct clocksource clocksource_jiffies; -- -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: Signed-off-by: Steven Rostedt [EMAIL PROTECTED] --- include/linux/clocksource.h |3 ++ kernel/time/timekeeping.c | 48 2 files changed, 51 insertions(+) Index: linux-compile-i386.git/kernel/time/timekeeping.c === --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 14:27:26.0 -0500 +++ linux-compile-i386.git/kernel/time/timekeeping.c2008-01-09 14:34:40.0 -0500 @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_ timespec_add_ns(ts, nsecs); } +cycle_t notrace get_monotonic_cycles(void) +{ + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; + + do { + /* +* cycle_raw and cycle_last can change on +* another CPU and we need the delta calculation +* of cycle_now and cycle_last happen atomic, as well +* as the adding to cycle_raw. We don't need to grab +* any locks, we just keep trying until get all the +* calculations together in one state. +* +* In fact, we __cant__ grab any locks. This +* function is called from the latency_tracer which can +* be called anywhere. To grab any locks (including +* seq_locks) we risk putting ourselves into a deadlock. +*/ I wonder what makes the compiler read the clock-cycle_raw and clock-cycle_last variables twice ? I guess some memory barriers could be welcome here ? We need both cycle_raw and cycle_last to be the same from the time we read the clock source to the time we calculate cycle_delta. If either one changes then delta is bogus. Also, it just occurred to me that this is an old patch. I thought I renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/ + cycle_raw = clock-cycle_raw; + cycle_last = clock-cycle_last; + + /* read clocksource: */ + cycle_now = clocksource_read(clock); + + /* calculate the delta since the last update_wall_time: */ + cycle_delta = (cycle_now - cycle_last) clock-mask; + + } while (cycle_raw != clock-cycle_raw || +cycle_last != clock-cycle_last); + + return cycle_raw + cycle_delta; +} -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Tue, 15 Jan 2008, Steven Rostedt wrote: Also, it just occurred to me that this is an old patch. I thought I renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/ Ah, I changed this in the -rt patch queue, and never moved the patch back here. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
* Steven Rostedt ([EMAIL PROTECTED]) wrote: On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: Signed-off-by: Steven Rostedt [EMAIL PROTECTED] --- include/linux/clocksource.h |3 ++ kernel/time/timekeeping.c | 48 2 files changed, 51 insertions(+) Index: linux-compile-i386.git/kernel/time/timekeeping.c === --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 14:27:26.0 -0500 +++ linux-compile-i386.git/kernel/time/timekeeping.c 2008-01-09 14:34:40.0 -0500 @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_ timespec_add_ns(ts, nsecs); } +cycle_t notrace get_monotonic_cycles(void) +{ + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; + + do { + /* + * cycle_raw and cycle_last can change on + * another CPU and we need the delta calculation + * of cycle_now and cycle_last happen atomic, as well + * as the adding to cycle_raw. We don't need to grab + * any locks, we just keep trying until get all the + * calculations together in one state. + * + * In fact, we __cant__ grab any locks. This + * function is called from the latency_tracer which can + * be called anywhere. To grab any locks (including + * seq_locks) we risk putting ourselves into a deadlock. + */ I wonder what makes the compiler read the clock-cycle_raw and clock-cycle_last variables twice ? I guess some memory barriers could be welcome here ? We need both cycle_raw and cycle_last to be the same from the time we read the clock source to the time we calculate cycle_delta. If either one changes then delta is bogus. Ok, but what actually insures that the clock-cycle_* reads won't be reordered across the clocksource_read() ? Also, it just occurred to me that this is an old patch. I thought I renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/ + cycle_raw = clock-cycle_raw; + cycle_last = clock-cycle_last; + + /* read clocksource: */ + cycle_now = clocksource_read(clock); + + /* calculate the delta since the last update_wall_time: */ + cycle_delta = (cycle_now - cycle_last) clock-mask; + + } while (cycle_raw != clock-cycle_raw || + cycle_last != clock-cycle_last); + + return cycle_raw + cycle_delta; +} -- Steve -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles
On Tue, 15 Jan 2008, Mathieu Desnoyers wrote: Ok, but what actually insures that the clock-cycle_* reads won't be reordered across the clocksource_read() ? looks at code Hmm, interesting.I didn't notice that clocksource_read() is a static inline. I was thinking that since it was passing a pointer to a function, gcc could not assume that it could move that code across it. But now looking to see that clocksource_read is simply a static inline that does: cs-read(); But still, can gcc assume that it can push loads of unknown origin variables across function calls? So something like: static int *glob; void foo(void) { int x; x = *glob; bar(); if (x != *glob) /* ... */ } I can't see how any compiler could honestly move the loading of the first x after the calling of bar(). With glob pointing to some unknown variable, that may be perfectly fine for bar to modify. + cycle_raw = clock-cycle_raw; + cycle_last = clock-cycle_last; + + /* read clocksource: */ + cycle_now = clocksource_read(clock); So the question here is,can cycle_raw and cycle_last be loaded from the unknown source that clock points to after the call to clocksource_read()? I'm thinking not. + + /* calculate the delta since the last update_wall_time: */ + cycle_delta = (cycle_now - cycle_last) clock-mask; + + } while (cycle_raw != clock-cycle_raw || +cycle_last != clock-cycle_last); + + return cycle_raw + cycle_delta; +} -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/