Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Joel Fernandes
On Thu, Sep 05, 2019 at 06:15:43PM -0700, Daniel Colascione wrote:
[snip]
> > > > > > > The bigger improvement with the threshold is the number of trace 
> > > > > > > records are
> > > > > > > almost halved by using a threshold. The number of records went 
> > > > > > > from 4.6K to
> > > > > > > 2.6K.
> > > > > >
> > > > > > Steven, would it be feasible to add a generic tracepoint throttling?
> > > > >
> > > > > I might misunderstand this but is the issue here actually throttling
> > > > > of the sheer number of trace records or tracing large enough changes
> > > > > to RSS that user might care about? Small changes happen all the time
> > > > > but we are likely not interested in those. Surely we could postprocess
> > > > > the traces to extract changes large enough to be interesting but why
> > > > > capture uninteresting information in the first place? IOW the
> > > > > throttling here should be based not on the time between traces but on
> > > > > the amount of change of the traced signal. Maybe a generic facility
> > > > > like that would be a good idea?
> > > >
> > > > You mean like add a trigger (or filter) that only traces if a field has
> > > > changed since the last time the trace was hit? Hmm, I think we could
> > > > possibly do that. Perhaps even now with histogram triggers?
> > >
> > > I was thinking along the same lines. The histogram subsystem seems
> > > like a very good fit here. Histogram triggers already let users talk
> > > about specific fields of trace events, aggregate them in configurable
> > > ways, and (importantly, IMHO) create synthetic new trace events that
> > > the kernel emits under configurable conditions.
> >
> > Hmm, I think this tracing feature will be a good idea. But in order not to
> > gate this patch, can we agree on keeping a temporary threshold for this
> > patch? Once such idea is implemented in trace subsystem, then we can remove
> > the temporary filter.
> >
> > As Tim said, we don't want our traces flooded and this is a very useful
> > tracepoint as proven in our internal usage at Android. The threshold filter
> > is just few lines of code.
> 
> I'm not sure the threshold filtering code you've added does the right
> thing: we don't keep state, so if a counter constantly flips between
> one "side" of the TRACE_MM_COUNTER_THRESHOLD and the other, we'll emit
> ftrace events at high frequency. More generally, this filtering
> couples the rate of counter logging to the *value* of the counter ---
> that is, we log ftrace events at different times depending on how much
> memory we happen to have used --- and that's not ideal from a
> predictability POV.
> 
> All things being equal, I'd prefer that we get things upstream as fast
> as possible. But in this case, I'd rather wait for a general-purpose
> filtering facility (whether that facility is based on histogram, eBPF,
> or something else) rather than hardcode one particular fixed filtering
> strategy (which might be suboptimal) for one particular kind of event.
> Is there some special urgency here?
> 
> How about we instead add non-filtered tracepoints for the mm counters?
> These tracepoints will still be free when turned off.
> 
> Having added the basic tracepoints, we can discuss separately how to
> do the rate limiting. Maybe instead of providing direct support for
> the algorithm that I described above, we can just use a BPF program as
> a yes/no predicate for whether to log to ftrace. That'd get us to the
> same place as this patch, but more flexibly, right?

Chatted with Daniel offline, we agreed on removing the threshold -- which
Michal also wants to be that way.

So I'll be resubmitting this patch with the threshold removed; and we'll work
on seeing to use filtering through other generic ways like BPF.

thanks all!

 - Joel



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 5:59 PM Joel Fernandes  wrote:
> On Thu, Sep 05, 2019 at 10:50:27AM -0700, Daniel Colascione wrote:
> > On Thu, Sep 5, 2019 at 10:35 AM Steven Rostedt  wrote:
> > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > Suren Baghdasaryan  wrote:
> > >
> > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> > > > >
> > > > > [Add Steven]
> > > > >
> > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > [...]
> > > > > > > > but also for reducing
> > > > > > > > tracing noise. Flooding the traces makes it less useful for 
> > > > > > > > long traces and
> > > > > > > > post-processing of traces. IOW, the overhead reduction is a 
> > > > > > > > bonus.
> > > > > > >
> > > > > > > This is not really anything special for this tracepoint though.
> > > > > > > Basically any tracepoint in a hot path is in the same situation 
> > > > > > > and I do
> > > > > > > not see a point why each of them should really invent its own way 
> > > > > > > to
> > > > > > > throttle. Maybe there is some way to do that in the tracing 
> > > > > > > subsystem
> > > > > > > directly.
> > > > > >
> > > > > > I am not sure if there is a way to do this easily. Add to that, the 
> > > > > > fact that
> > > > > > you still have to call into trace events. Why call into it at all, 
> > > > > > if you can
> > > > > > filter in advance and have a sane filtering default?
> > > > > >
> > > > > > The bigger improvement with the threshold is the number of trace 
> > > > > > records are
> > > > > > almost halved by using a threshold. The number of records went from 
> > > > > > 4.6K to
> > > > > > 2.6K.
> > > > >
> > > > > Steven, would it be feasible to add a generic tracepoint throttling?
> > > >
> > > > I might misunderstand this but is the issue here actually throttling
> > > > of the sheer number of trace records or tracing large enough changes
> > > > to RSS that user might care about? Small changes happen all the time
> > > > but we are likely not interested in those. Surely we could postprocess
> > > > the traces to extract changes large enough to be interesting but why
> > > > capture uninteresting information in the first place? IOW the
> > > > throttling here should be based not on the time between traces but on
> > > > the amount of change of the traced signal. Maybe a generic facility
> > > > like that would be a good idea?
> > >
> > > You mean like add a trigger (or filter) that only traces if a field has
> > > changed since the last time the trace was hit? Hmm, I think we could
> > > possibly do that. Perhaps even now with histogram triggers?
> >
> > I was thinking along the same lines. The histogram subsystem seems
> > like a very good fit here. Histogram triggers already let users talk
> > about specific fields of trace events, aggregate them in configurable
> > ways, and (importantly, IMHO) create synthetic new trace events that
> > the kernel emits under configurable conditions.
>
> Hmm, I think this tracing feature will be a good idea. But in order not to
> gate this patch, can we agree on keeping a temporary threshold for this
> patch? Once such idea is implemented in trace subsystem, then we can remove
> the temporary filter.
>
> As Tim said, we don't want our traces flooded and this is a very useful
> tracepoint as proven in our internal usage at Android. The threshold filter
> is just few lines of code.

I'm not sure the threshold filtering code you've added does the right
thing: we don't keep state, so if a counter constantly flips between
one "side" of the TRACE_MM_COUNTER_THRESHOLD and the other, we'll emit
ftrace events at high frequency. More generally, this filtering
couples the rate of counter logging to the *value* of the counter ---
that is, we log ftrace events at different times depending on how much
memory we happen to have used --- and that's not ideal from a
predictability POV.

All things being equal, I'd prefer that we get things upstream as fast
as possible. But in this case, I'd rather wait for a general-purpose
filtering facility (whether that facility is based on histogram, eBPF,
or something else) rather than hardcode one particular fixed filtering
strategy (which might be suboptimal) for one particular kind of event.
Is there some special urgency here?

How about we instead add non-filtered tracepoints for the mm counters?
These tracepoints will still be free when turned off.

Having added the basic tracepoints, we can discuss separately how to
do the rate limiting. Maybe instead of providing direct support for
the algorithm that I described above, we can just use a BPF program as
a yes/no predicate for whether to log to ftrace. That'd get us to the
same place as this patch, but more flexibly, right?


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Joel Fernandes
On Thu, Sep 05, 2019 at 10:50:27AM -0700, Daniel Colascione wrote:
> On Thu, Sep 5, 2019 at 10:35 AM Steven Rostedt  wrote:
> > On Thu, 5 Sep 2019 09:03:01 -0700
> > Suren Baghdasaryan  wrote:
> >
> > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> > > >
> > > > [Add Steven]
> > > >
> > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  
> > > > > wrote:
> > > > > >
> > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > [...]
> > > > > > > but also for reducing
> > > > > > > tracing noise. Flooding the traces makes it less useful for long 
> > > > > > > traces and
> > > > > > > post-processing of traces. IOW, the overhead reduction is a bonus.
> > > > > >
> > > > > > This is not really anything special for this tracepoint though.
> > > > > > Basically any tracepoint in a hot path is in the same situation and 
> > > > > > I do
> > > > > > not see a point why each of them should really invent its own way to
> > > > > > throttle. Maybe there is some way to do that in the tracing 
> > > > > > subsystem
> > > > > > directly.
> > > > >
> > > > > I am not sure if there is a way to do this easily. Add to that, the 
> > > > > fact that
> > > > > you still have to call into trace events. Why call into it at all, if 
> > > > > you can
> > > > > filter in advance and have a sane filtering default?
> > > > >
> > > > > The bigger improvement with the threshold is the number of trace 
> > > > > records are
> > > > > almost halved by using a threshold. The number of records went from 
> > > > > 4.6K to
> > > > > 2.6K.
> > > >
> > > > Steven, would it be feasible to add a generic tracepoint throttling?
> > >
> > > I might misunderstand this but is the issue here actually throttling
> > > of the sheer number of trace records or tracing large enough changes
> > > to RSS that user might care about? Small changes happen all the time
> > > but we are likely not interested in those. Surely we could postprocess
> > > the traces to extract changes large enough to be interesting but why
> > > capture uninteresting information in the first place? IOW the
> > > throttling here should be based not on the time between traces but on
> > > the amount of change of the traced signal. Maybe a generic facility
> > > like that would be a good idea?
> >
> > You mean like add a trigger (or filter) that only traces if a field has
> > changed since the last time the trace was hit? Hmm, I think we could
> > possibly do that. Perhaps even now with histogram triggers?
> 
> I was thinking along the same lines. The histogram subsystem seems
> like a very good fit here. Histogram triggers already let users talk
> about specific fields of trace events, aggregate them in configurable
> ways, and (importantly, IMHO) create synthetic new trace events that
> the kernel emits under configurable conditions.

Hmm, I think this tracing feature will be a good idea. But in order not to
gate this patch, can we agree on keeping a temporary threshold for this
patch? Once such idea is implemented in trace subsystem, then we can remove
the temporary filter.

As Tim said, we don't want our traces flooded and this is a very useful
tracepoint as proven in our internal usage at Android. The threshold filter
is just few lines of code.

thanks,

 - Joel



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 3:12 PM Daniel Colascione  wrote:
> Basically, what I have in mind is this:

Actually --- I wonder whether there's already enough power in the
trigger mechanism to do this without any code changes to ftrace
histograms themselves. I'm trying to think of the minimum additional
kernel facility that we'd need to implement the scheme I described
above, and it might be that we don't need to do anything at all except
add the actual level tracepoints.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 2:14 PM Tom Zanussi  wrote:
>
> On Thu, 2019-09-05 at 13:24 -0700, Daniel Colascione wrote:
> > On Thu, Sep 5, 2019 at 12:56 PM Tom Zanussi 
> > wrote:
> > > On Thu, 2019-09-05 at 13:51 -0400, Joel Fernandes wrote:
> > > > On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> > > > > On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > > > > >
> > > > > >
> > > > > > [ Added Tom ]
> > > > > >
> > > > > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > > > > Suren Baghdasaryan  wrote:
> > > > > >
> > > > > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  > > > > > > org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > [Add Steven]
> > > > > > > >
> > > > > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  > > > > > > > > rnel
> > > > > > > > > .org> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > > > >
> > > > > > > > [...]
> > > > > > > > > > > but also for reducing
> > > > > > > > > > > tracing noise. Flooding the traces makes it less
> > > > > > > > > > > useful
> > > > > > > > > > > for long traces and
> > > > > > > > > > > post-processing of traces. IOW, the overhead
> > > > > > > > > > > reduction
> > > > > > > > > > > is a bonus.
> > > > > > > > > >
> > > > > > > > > > This is not really anything special for this
> > > > > > > > > > tracepoint
> > > > > > > > > > though.
> > > > > > > > > > Basically any tracepoint in a hot path is in the same
> > > > > > > > > > situation and I do
> > > > > > > > > > not see a point why each of them should really invent
> > > > > > > > > > its
> > > > > > > > > > own way to
> > > > > > > > > > throttle. Maybe there is some way to do that in the
> > > > > > > > > > tracing subsystem
> > > > > > > > > > directly.
> > > > > > > > >
> > > > > > > > > I am not sure if there is a way to do this easily. Add
> > > > > > > > > to
> > > > > > > > > that, the fact that
> > > > > > > > > you still have to call into trace events. Why call into
> > > > > > > > > it
> > > > > > > > > at all, if you can
> > > > > > > > > filter in advance and have a sane filtering default?
> > > > > > > > >
> > > > > > > > > The bigger improvement with the threshold is the number
> > > > > > > > > of
> > > > > > > > > trace records are
> > > > > > > > > almost halved by using a threshold. The number of
> > > > > > > > > records
> > > > > > > > > went from 4.6K to
> > > > > > > > > 2.6K.
> > > > > > > >
> > > > > > > > Steven, would it be feasible to add a generic tracepoint
> > > > > > > > throttling?
> > > > > > >
> > > > > > > I might misunderstand this but is the issue here actually
> > > > > > > throttling
> > > > > > > of the sheer number of trace records or tracing large
> > > > > > > enough
> > > > > > > changes
> > > > > > > to RSS that user might care about? Small changes happen all
> > > > > > > the
> > > > > > > time
> > > > > > > but we are likely not interested in those. Surely we could
> > > > > > > postprocess
> > > > > > > the traces to extract changes large enough to be
> > > > > > > interesting
> > > > > > > but why
> > > > > > > capture uninteresting information in the first place? IOW
> > > > > > > the
> > > > > > > throttling here should be based not on the time between
> > > > > > > traces
> > > > > > > but on
> > > > > > > the amount of change of the traced signal. Maybe a generic
> > > > > > > facility
> > > > > > > like that would be a good idea?
> > > > > >
> > > > > > You mean like add a trigger (or filter) that only traces if a
> > > > > > field has
> > > > > > changed since the last time the trace was hit? Hmm, I think
> > > > > > we
> > > > > > could
> > > > > > possibly do that. Perhaps even now with histogram triggers?
> > > > >
> > > > >
> > > > > Hey Steve,
> > > > >
> > > > > Something like an analog to digitial coversion function where
> > > > > you
> > > > > lose the
> > > > > granularity of the signal depending on how much trace data:
> > > > > https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee3
> > > > > 8d1a
> > > > > 85d37fa23f86a14d3a9776ff67b0ec0f3b.gif
> > > >
> > > > s/how much trace data/what the resolution is/
> > > >
> > > > > so like, if you had a counter incrementing with values after
> > > > > the
> > > > > increments
> > > > > as:  1,3,4,8,12,14,30 and say 5 is the threshold at which to
> > > > > emit a
> > > > > trace,
> > > > > then you would get 1,8,12,30.
> > > > >
> > > > > So I guess what is need is a way to reduce the quantiy of trace
> > > > > data this
> > > > > way. For this usecase, the user mostly cares about spikes in
> > > > > the
> > > > > counter
> > > > > changing that accurate values of the different points.
> > > >
> > > > s/that accurate/than accurate/
> > > >
> > > > I think Tim, Suren, Dan and Michal are all saying the same thing
> > > > as
> > > > well.
> > > >
> > >
> > > There's not a way to do this using existing triggers (histogram
> > > 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Tom Zanussi
On Thu, 2019-09-05 at 13:24 -0700, Daniel Colascione wrote:
> On Thu, Sep 5, 2019 at 12:56 PM Tom Zanussi 
> wrote:
> > On Thu, 2019-09-05 at 13:51 -0400, Joel Fernandes wrote:
> > > On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> > > > On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > > > > 
> > > > > 
> > > > > [ Added Tom ]
> > > > > 
> > > > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > > > Suren Baghdasaryan  wrote:
> > > > > 
> > > > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  > > > > > org>
> > > > > > wrote:
> > > > > > > 
> > > > > > > [Add Steven]
> > > > > > > 
> > > > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  > > > > > > > rnel
> > > > > > > > .org> wrote:
> > > > > > > > > 
> > > > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > > > 
> > > > > > > [...]
> > > > > > > > > > but also for reducing
> > > > > > > > > > tracing noise. Flooding the traces makes it less
> > > > > > > > > > useful
> > > > > > > > > > for long traces and
> > > > > > > > > > post-processing of traces. IOW, the overhead
> > > > > > > > > > reduction
> > > > > > > > > > is a bonus.
> > > > > > > > > 
> > > > > > > > > This is not really anything special for this
> > > > > > > > > tracepoint
> > > > > > > > > though.
> > > > > > > > > Basically any tracepoint in a hot path is in the same
> > > > > > > > > situation and I do
> > > > > > > > > not see a point why each of them should really invent
> > > > > > > > > its
> > > > > > > > > own way to
> > > > > > > > > throttle. Maybe there is some way to do that in the
> > > > > > > > > tracing subsystem
> > > > > > > > > directly.
> > > > > > > > 
> > > > > > > > I am not sure if there is a way to do this easily. Add
> > > > > > > > to
> > > > > > > > that, the fact that
> > > > > > > > you still have to call into trace events. Why call into
> > > > > > > > it
> > > > > > > > at all, if you can
> > > > > > > > filter in advance and have a sane filtering default?
> > > > > > > > 
> > > > > > > > The bigger improvement with the threshold is the number
> > > > > > > > of
> > > > > > > > trace records are
> > > > > > > > almost halved by using a threshold. The number of
> > > > > > > > records
> > > > > > > > went from 4.6K to
> > > > > > > > 2.6K.
> > > > > > > 
> > > > > > > Steven, would it be feasible to add a generic tracepoint
> > > > > > > throttling?
> > > > > > 
> > > > > > I might misunderstand this but is the issue here actually
> > > > > > throttling
> > > > > > of the sheer number of trace records or tracing large
> > > > > > enough
> > > > > > changes
> > > > > > to RSS that user might care about? Small changes happen all
> > > > > > the
> > > > > > time
> > > > > > but we are likely not interested in those. Surely we could
> > > > > > postprocess
> > > > > > the traces to extract changes large enough to be
> > > > > > interesting
> > > > > > but why
> > > > > > capture uninteresting information in the first place? IOW
> > > > > > the
> > > > > > throttling here should be based not on the time between
> > > > > > traces
> > > > > > but on
> > > > > > the amount of change of the traced signal. Maybe a generic
> > > > > > facility
> > > > > > like that would be a good idea?
> > > > > 
> > > > > You mean like add a trigger (or filter) that only traces if a
> > > > > field has
> > > > > changed since the last time the trace was hit? Hmm, I think
> > > > > we
> > > > > could
> > > > > possibly do that. Perhaps even now with histogram triggers?
> > > > 
> > > > 
> > > > Hey Steve,
> > > > 
> > > > Something like an analog to digitial coversion function where
> > > > you
> > > > lose the
> > > > granularity of the signal depending on how much trace data:
> > > > https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee3
> > > > 8d1a
> > > > 85d37fa23f86a14d3a9776ff67b0ec0f3b.gif
> > > 
> > > s/how much trace data/what the resolution is/
> > > 
> > > > so like, if you had a counter incrementing with values after
> > > > the
> > > > increments
> > > > as:  1,3,4,8,12,14,30 and say 5 is the threshold at which to
> > > > emit a
> > > > trace,
> > > > then you would get 1,8,12,30.
> > > > 
> > > > So I guess what is need is a way to reduce the quantiy of trace
> > > > data this
> > > > way. For this usecase, the user mostly cares about spikes in
> > > > the
> > > > counter
> > > > changing that accurate values of the different points.
> > > 
> > > s/that accurate/than accurate/
> > > 
> > > I think Tim, Suren, Dan and Michal are all saying the same thing
> > > as
> > > well.
> > > 
> > 
> > There's not a way to do this using existing triggers (histogram
> > triggers have an onchange() that fires on any change, but that
> > doesn't
> > help here), and I wouldn't expect there to be - these sound like
> > very
> > specific cases that would never have support in the simple trigger
> > 'language'.
> 
> I don't see the filtering under discussion as 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Tom Zanussi
Hi,

On Thu, 2019-09-05 at 13:24 -0700, Daniel Colascione wrote:
> On Thu, Sep 5, 2019 at 12:56 PM Tom Zanussi 
> wrote:
> > On Thu, 2019-09-05 at 13:51 -0400, Joel Fernandes wrote:
> > > On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> > > > On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > > > > 
> > > > > 
> > > > > [ Added Tom ]
> > > > > 
> > > > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > > > Suren Baghdasaryan  wrote:
> > > > > 
> > > > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  > > > > > org>
> > > > > > wrote:
> > > > > > > 
> > > > > > > [Add Steven]
> > > > > > > 
> > > > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  > > > > > > > rnel
> > > > > > > > .org> wrote:
> > > > > > > > > 
> > > > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > > > 
> > > > > > > [...]
> > > > > > > > > > but also for reducing
> > > > > > > > > > tracing noise. Flooding the traces makes it less
> > > > > > > > > > useful
> > > > > > > > > > for long traces and
> > > > > > > > > > post-processing of traces. IOW, the overhead
> > > > > > > > > > reduction
> > > > > > > > > > is a bonus.
> > > > > > > > > 
> > > > > > > > > This is not really anything special for this
> > > > > > > > > tracepoint
> > > > > > > > > though.
> > > > > > > > > Basically any tracepoint in a hot path is in the same
> > > > > > > > > situation and I do
> > > > > > > > > not see a point why each of them should really invent
> > > > > > > > > its
> > > > > > > > > own way to
> > > > > > > > > throttle. Maybe there is some way to do that in the
> > > > > > > > > tracing subsystem
> > > > > > > > > directly.
> > > > > > > > 
> > > > > > > > I am not sure if there is a way to do this easily. Add
> > > > > > > > to
> > > > > > > > that, the fact that
> > > > > > > > you still have to call into trace events. Why call into
> > > > > > > > it
> > > > > > > > at all, if you can
> > > > > > > > filter in advance and have a sane filtering default?
> > > > > > > > 
> > > > > > > > The bigger improvement with the threshold is the number
> > > > > > > > of
> > > > > > > > trace records are
> > > > > > > > almost halved by using a threshold. The number of
> > > > > > > > records
> > > > > > > > went from 4.6K to
> > > > > > > > 2.6K.
> > > > > > > 
> > > > > > > Steven, would it be feasible to add a generic tracepoint
> > > > > > > throttling?
> > > > > > 
> > > > > > I might misunderstand this but is the issue here actually
> > > > > > throttling
> > > > > > of the sheer number of trace records or tracing large
> > > > > > enough
> > > > > > changes
> > > > > > to RSS that user might care about? Small changes happen all
> > > > > > the
> > > > > > time
> > > > > > but we are likely not interested in those. Surely we could
> > > > > > postprocess
> > > > > > the traces to extract changes large enough to be
> > > > > > interesting
> > > > > > but why
> > > > > > capture uninteresting information in the first place? IOW
> > > > > > the
> > > > > > throttling here should be based not on the time between
> > > > > > traces
> > > > > > but on
> > > > > > the amount of change of the traced signal. Maybe a generic
> > > > > > facility
> > > > > > like that would be a good idea?
> > > > > 
> > > > > You mean like add a trigger (or filter) that only traces if a
> > > > > field has
> > > > > changed since the last time the trace was hit? Hmm, I think
> > > > > we
> > > > > could
> > > > > possibly do that. Perhaps even now with histogram triggers?
> > > > 
> > > > 
> > > > Hey Steve,
> > > > 
> > > > Something like an analog to digitial coversion function where
> > > > you
> > > > lose the
> > > > granularity of the signal depending on how much trace data:
> > > > https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee3
> > > > 8d1a
> > > > 85d37fa23f86a14d3a9776ff67b0ec0f3b.gif
> > > 
> > > s/how much trace data/what the resolution is/
> > > 
> > > > so like, if you had a counter incrementing with values after
> > > > the
> > > > increments
> > > > as:  1,3,4,8,12,14,30 and say 5 is the threshold at which to
> > > > emit a
> > > > trace,
> > > > then you would get 1,8,12,30.
> > > > 
> > > > So I guess what is need is a way to reduce the quantiy of trace
> > > > data this
> > > > way. For this usecase, the user mostly cares about spikes in
> > > > the
> > > > counter
> > > > changing that accurate values of the different points.
> > > 
> > > s/that accurate/than accurate/
> > > 
> > > I think Tim, Suren, Dan and Michal are all saying the same thing
> > > as
> > > well.
> > > 
> > 
> > There's not a way to do this using existing triggers (histogram
> > triggers have an onchange() that fires on any change, but that
> > doesn't
> > help here), and I wouldn't expect there to be - these sound like
> > very
> > specific cases that would never have support in the simple trigger
> > 'language'.
> 
> I don't see the filtering under discussion 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 12:56 PM Tom Zanussi  wrote:
> On Thu, 2019-09-05 at 13:51 -0400, Joel Fernandes wrote:
> > On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> > > On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > > >
> > > >
> > > > [ Added Tom ]
> > > >
> > > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > > Suren Baghdasaryan  wrote:
> > > >
> > > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko 
> > > > > wrote:
> > > > > >
> > > > > > [Add Steven]
> > > > > >
> > > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  > > > > > > .org> wrote:
> > > > > > > >
> > > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > >
> > > > > > [...]
> > > > > > > > > but also for reducing
> > > > > > > > > tracing noise. Flooding the traces makes it less useful
> > > > > > > > > for long traces and
> > > > > > > > > post-processing of traces. IOW, the overhead reduction
> > > > > > > > > is a bonus.
> > > > > > > >
> > > > > > > > This is not really anything special for this tracepoint
> > > > > > > > though.
> > > > > > > > Basically any tracepoint in a hot path is in the same
> > > > > > > > situation and I do
> > > > > > > > not see a point why each of them should really invent its
> > > > > > > > own way to
> > > > > > > > throttle. Maybe there is some way to do that in the
> > > > > > > > tracing subsystem
> > > > > > > > directly.
> > > > > > >
> > > > > > > I am not sure if there is a way to do this easily. Add to
> > > > > > > that, the fact that
> > > > > > > you still have to call into trace events. Why call into it
> > > > > > > at all, if you can
> > > > > > > filter in advance and have a sane filtering default?
> > > > > > >
> > > > > > > The bigger improvement with the threshold is the number of
> > > > > > > trace records are
> > > > > > > almost halved by using a threshold. The number of records
> > > > > > > went from 4.6K to
> > > > > > > 2.6K.
> > > > > >
> > > > > > Steven, would it be feasible to add a generic tracepoint
> > > > > > throttling?
> > > > >
> > > > > I might misunderstand this but is the issue here actually
> > > > > throttling
> > > > > of the sheer number of trace records or tracing large enough
> > > > > changes
> > > > > to RSS that user might care about? Small changes happen all the
> > > > > time
> > > > > but we are likely not interested in those. Surely we could
> > > > > postprocess
> > > > > the traces to extract changes large enough to be interesting
> > > > > but why
> > > > > capture uninteresting information in the first place? IOW the
> > > > > throttling here should be based not on the time between traces
> > > > > but on
> > > > > the amount of change of the traced signal. Maybe a generic
> > > > > facility
> > > > > like that would be a good idea?
> > > >
> > > > You mean like add a trigger (or filter) that only traces if a
> > > > field has
> > > > changed since the last time the trace was hit? Hmm, I think we
> > > > could
> > > > possibly do that. Perhaps even now with histogram triggers?
> > >
> > >
> > > Hey Steve,
> > >
> > > Something like an analog to digitial coversion function where you
> > > lose the
> > > granularity of the signal depending on how much trace data:
> > > https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee38d1a
> > > 85d37fa23f86a14d3a9776ff67b0ec0f3b.gif
> >
> > s/how much trace data/what the resolution is/
> >
> > > so like, if you had a counter incrementing with values after the
> > > increments
> > > as:  1,3,4,8,12,14,30 and say 5 is the threshold at which to emit a
> > > trace,
> > > then you would get 1,8,12,30.
> > >
> > > So I guess what is need is a way to reduce the quantiy of trace
> > > data this
> > > way. For this usecase, the user mostly cares about spikes in the
> > > counter
> > > changing that accurate values of the different points.
> >
> > s/that accurate/than accurate/
> >
> > I think Tim, Suren, Dan and Michal are all saying the same thing as
> > well.
> >
>
> There's not a way to do this using existing triggers (histogram
> triggers have an onchange() that fires on any change, but that doesn't
> help here), and I wouldn't expect there to be - these sound like very
> specific cases that would never have support in the simple trigger
> 'language'.

I don't see the filtering under discussion as some "very specific"
esoteric need. You need this general kind of mechanism any time you
want to monitor at low frequency a thing that changes at high
frequency. The general pattern isn't specific to RSS or even memory in
general. One might imagine, say, wanting to trace large changes in TCP
window sizes. Any time something in the kernel has a "level" and that
level changes at high frequency and we want to learn about big swings
in that level, the mechanism we're talking about becomes useful. I
don't think it should be out of bounds for the histogram mechanism,
which is *almost* there right now. We already 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Tom Zanussi
Hi,

On Thu, 2019-09-05 at 13:51 -0400, Joel Fernandes wrote:
> On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> > On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > > 
> > > 
> > > [ Added Tom ]
> > > 
> > > On Thu, 5 Sep 2019 09:03:01 -0700
> > > Suren Baghdasaryan  wrote:
> > > 
> > > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko 
> > > > wrote:
> > > > > 
> > > > > [Add Steven]
> > > > > 
> > > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:  
> > > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  > > > > > .org> wrote:  
> > > > > > > 
> > > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:  
> > > > > 
> > > > > [...]  
> > > > > > > > but also for reducing
> > > > > > > > tracing noise. Flooding the traces makes it less useful
> > > > > > > > for long traces and
> > > > > > > > post-processing of traces. IOW, the overhead reduction
> > > > > > > > is a bonus.  
> > > > > > > 
> > > > > > > This is not really anything special for this tracepoint
> > > > > > > though.
> > > > > > > Basically any tracepoint in a hot path is in the same
> > > > > > > situation and I do
> > > > > > > not see a point why each of them should really invent its
> > > > > > > own way to
> > > > > > > throttle. Maybe there is some way to do that in the
> > > > > > > tracing subsystem
> > > > > > > directly.  
> > > > > > 
> > > > > > I am not sure if there is a way to do this easily. Add to
> > > > > > that, the fact that
> > > > > > you still have to call into trace events. Why call into it
> > > > > > at all, if you can
> > > > > > filter in advance and have a sane filtering default?
> > > > > > 
> > > > > > The bigger improvement with the threshold is the number of
> > > > > > trace records are
> > > > > > almost halved by using a threshold. The number of records
> > > > > > went from 4.6K to
> > > > > > 2.6K.  
> > > > > 
> > > > > Steven, would it be feasible to add a generic tracepoint
> > > > > throttling?  
> > > > 
> > > > I might misunderstand this but is the issue here actually
> > > > throttling
> > > > of the sheer number of trace records or tracing large enough
> > > > changes
> > > > to RSS that user might care about? Small changes happen all the
> > > > time
> > > > but we are likely not interested in those. Surely we could
> > > > postprocess
> > > > the traces to extract changes large enough to be interesting
> > > > but why
> > > > capture uninteresting information in the first place? IOW the
> > > > throttling here should be based not on the time between traces
> > > > but on
> > > > the amount of change of the traced signal. Maybe a generic
> > > > facility
> > > > like that would be a good idea?
> > > 
> > > You mean like add a trigger (or filter) that only traces if a
> > > field has
> > > changed since the last time the trace was hit? Hmm, I think we
> > > could
> > > possibly do that. Perhaps even now with histogram triggers?
> > 
> > 
> > Hey Steve,
> > 
> > Something like an analog to digitial coversion function where you
> > lose the
> > granularity of the signal depending on how much trace data:
> > https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee38d1a
> > 85d37fa23f86a14d3a9776ff67b0ec0f3b.gif
> 
> s/how much trace data/what the resolution is/
> 
> > so like, if you had a counter incrementing with values after the
> > increments
> > as:  1,3,4,8,12,14,30 and say 5 is the threshold at which to emit a
> > trace,
> > then you would get 1,8,12,30.
> > 
> > So I guess what is need is a way to reduce the quantiy of trace
> > data this
> > way. For this usecase, the user mostly cares about spikes in the
> > counter
> > changing that accurate values of the different points.
> 
> s/that accurate/than accurate/
> 
> I think Tim, Suren, Dan and Michal are all saying the same thing as
> well.
> 

There's not a way to do this using existing triggers (histogram
triggers have an onchange() that fires on any change, but that doesn't
help here), and I wouldn't expect there to be - these sound like very
specific cases that would never have support in the simple trigger
'language'.

On the other hand, I have been working on something that should give
you the ability to do something like this, by writing a module that
hooks into arbitrary trace events, accessing their fields, building up
any needed state across events, and then generating synthetic events as
needed:

https://git.kernel.org/pub/scm/linux/kernel/git/zanussi/linux-trace.git/log/?h=ftrace/in-kernel-events-v0

The documentation is in the top commit, and I'll include it below, but
the basic idea is that you can set up a 'live handler' for any event,
and when that event is hit, you look at the data and save it however
you need, or modify some state machine, etc, and the event is then
discarded since it's in soft enable mode and doesn't end up in the
trace stream unless you enable it.  At any point you can decide to
generate a synthetic event of your own design which will end up in the
trace 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Joel Fernandes
On Thu, Sep 05, 2019 at 01:47:05PM -0400, Joel Fernandes wrote:
> On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> > 
> > 
> > [ Added Tom ]
> > 
> > On Thu, 5 Sep 2019 09:03:01 -0700
> > Suren Baghdasaryan  wrote:
> > 
> > > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> > > >
> > > > [Add Steven]
> > > >
> > > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:  
> > > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  
> > > > > wrote:  
> > > > > >
> > > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:  
> > > > [...]  
> > > > > > > but also for reducing
> > > > > > > tracing noise. Flooding the traces makes it less useful for long 
> > > > > > > traces and
> > > > > > > post-processing of traces. IOW, the overhead reduction is a 
> > > > > > > bonus.  
> > > > > >
> > > > > > This is not really anything special for this tracepoint though.
> > > > > > Basically any tracepoint in a hot path is in the same situation and 
> > > > > > I do
> > > > > > not see a point why each of them should really invent its own way to
> > > > > > throttle. Maybe there is some way to do that in the tracing 
> > > > > > subsystem
> > > > > > directly.  
> > > > >
> > > > > I am not sure if there is a way to do this easily. Add to that, the 
> > > > > fact that
> > > > > you still have to call into trace events. Why call into it at all, if 
> > > > > you can
> > > > > filter in advance and have a sane filtering default?
> > > > >
> > > > > The bigger improvement with the threshold is the number of trace 
> > > > > records are
> > > > > almost halved by using a threshold. The number of records went from 
> > > > > 4.6K to
> > > > > 2.6K.  
> > > >
> > > > Steven, would it be feasible to add a generic tracepoint throttling?  
> > > 
> > > I might misunderstand this but is the issue here actually throttling
> > > of the sheer number of trace records or tracing large enough changes
> > > to RSS that user might care about? Small changes happen all the time
> > > but we are likely not interested in those. Surely we could postprocess
> > > the traces to extract changes large enough to be interesting but why
> > > capture uninteresting information in the first place? IOW the
> > > throttling here should be based not on the time between traces but on
> > > the amount of change of the traced signal. Maybe a generic facility
> > > like that would be a good idea?
> > 
> > You mean like add a trigger (or filter) that only traces if a field has
> > changed since the last time the trace was hit? Hmm, I think we could
> > possibly do that. Perhaps even now with histogram triggers?
> 
> 
> Hey Steve,
> 
> Something like an analog to digitial coversion function where you lose the
> granularity of the signal depending on how much trace data:
> https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee38d1a85d37fa23f86a14d3a9776ff67b0ec0f3b.gif

s/how much trace data/what the resolution is/

> so like, if you had a counter incrementing with values after the increments
> as:  1,3,4,8,12,14,30 and say 5 is the threshold at which to emit a trace,
> then you would get 1,8,12,30.
> 
> So I guess what is need is a way to reduce the quantiy of trace data this
> way. For this usecase, the user mostly cares about spikes in the counter
> changing that accurate values of the different points.

s/that accurate/than accurate/

I think Tim, Suren, Dan and Michal are all saying the same thing as well.

thanks,

 - Joel



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Daniel Colascione
On Thu, Sep 5, 2019 at 10:35 AM Steven Rostedt  wrote:
> On Thu, 5 Sep 2019 09:03:01 -0700
> Suren Baghdasaryan  wrote:
>
> > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> > >
> > > [Add Steven]
> > >
> > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > [...]
> > > > > > but also for reducing
> > > > > > tracing noise. Flooding the traces makes it less useful for long 
> > > > > > traces and
> > > > > > post-processing of traces. IOW, the overhead reduction is a bonus.
> > > > >
> > > > > This is not really anything special for this tracepoint though.
> > > > > Basically any tracepoint in a hot path is in the same situation and I 
> > > > > do
> > > > > not see a point why each of them should really invent its own way to
> > > > > throttle. Maybe there is some way to do that in the tracing subsystem
> > > > > directly.
> > > >
> > > > I am not sure if there is a way to do this easily. Add to that, the 
> > > > fact that
> > > > you still have to call into trace events. Why call into it at all, if 
> > > > you can
> > > > filter in advance and have a sane filtering default?
> > > >
> > > > The bigger improvement with the threshold is the number of trace 
> > > > records are
> > > > almost halved by using a threshold. The number of records went from 
> > > > 4.6K to
> > > > 2.6K.
> > >
> > > Steven, would it be feasible to add a generic tracepoint throttling?
> >
> > I might misunderstand this but is the issue here actually throttling
> > of the sheer number of trace records or tracing large enough changes
> > to RSS that user might care about? Small changes happen all the time
> > but we are likely not interested in those. Surely we could postprocess
> > the traces to extract changes large enough to be interesting but why
> > capture uninteresting information in the first place? IOW the
> > throttling here should be based not on the time between traces but on
> > the amount of change of the traced signal. Maybe a generic facility
> > like that would be a good idea?
>
> You mean like add a trigger (or filter) that only traces if a field has
> changed since the last time the trace was hit? Hmm, I think we could
> possibly do that. Perhaps even now with histogram triggers?

I was thinking along the same lines. The histogram subsystem seems
like a very good fit here. Histogram triggers already let users talk
about specific fields of trace events, aggregate them in configurable
ways, and (importantly, IMHO) create synthetic new trace events that
the kernel emits under configurable conditions.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Joel Fernandes
On Thu, Sep 05, 2019 at 01:35:07PM -0400, Steven Rostedt wrote:
> 
> 
> [ Added Tom ]
> 
> On Thu, 5 Sep 2019 09:03:01 -0700
> Suren Baghdasaryan  wrote:
> 
> > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> > >
> > > [Add Steven]
> > >
> > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:  
> > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote: 
> > > >  
> > > > >
> > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:  
> > > [...]  
> > > > > > but also for reducing
> > > > > > tracing noise. Flooding the traces makes it less useful for long 
> > > > > > traces and
> > > > > > post-processing of traces. IOW, the overhead reduction is a bonus.  
> > > > >
> > > > > This is not really anything special for this tracepoint though.
> > > > > Basically any tracepoint in a hot path is in the same situation and I 
> > > > > do
> > > > > not see a point why each of them should really invent its own way to
> > > > > throttle. Maybe there is some way to do that in the tracing subsystem
> > > > > directly.  
> > > >
> > > > I am not sure if there is a way to do this easily. Add to that, the 
> > > > fact that
> > > > you still have to call into trace events. Why call into it at all, if 
> > > > you can
> > > > filter in advance and have a sane filtering default?
> > > >
> > > > The bigger improvement with the threshold is the number of trace 
> > > > records are
> > > > almost halved by using a threshold. The number of records went from 
> > > > 4.6K to
> > > > 2.6K.  
> > >
> > > Steven, would it be feasible to add a generic tracepoint throttling?  
> > 
> > I might misunderstand this but is the issue here actually throttling
> > of the sheer number of trace records or tracing large enough changes
> > to RSS that user might care about? Small changes happen all the time
> > but we are likely not interested in those. Surely we could postprocess
> > the traces to extract changes large enough to be interesting but why
> > capture uninteresting information in the first place? IOW the
> > throttling here should be based not on the time between traces but on
> > the amount of change of the traced signal. Maybe a generic facility
> > like that would be a good idea?
> 
> You mean like add a trigger (or filter) that only traces if a field has
> changed since the last time the trace was hit? Hmm, I think we could
> possibly do that. Perhaps even now with histogram triggers?


Hey Steve,

Something like an analog to digitial coversion function where you lose the
granularity of the signal depending on how much trace data:
https://www.globalspec.com/ImageRepository/LearnMore/20142/9ee38d1a85d37fa23f86a14d3a9776ff67b0ec0f3b.gif

so like, if you had a counter incrementing with values after the increments
as:  1,3,4,8,12,14,30 and say 5 is the threshold at which to emit a trace,
then you would get 1,8,12,30.

So I guess what is need is a way to reduce the quantiy of trace data this
way. For this usecase, the user mostly cares about spikes in the counter
changing that accurate values of the different points.

thanks,

 - Joel



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Tim Murray
On Thu, Sep 5, 2019 at 9:03 AM Suren Baghdasaryan  wrote:
> I might misunderstand this but is the issue here actually throttling
> of the sheer number of trace records or tracing large enough changes
> to RSS that user might care about? Small changes happen all the time
> but we are likely not interested in those. Surely we could postprocess
> the traces to extract changes large enough to be interesting but why
> capture uninteresting information in the first place? IOW the
> throttling here should be based not on the time between traces but on
> the amount of change of the traced signal. Maybe a generic facility
> like that would be a good idea?

You want two properties from the tracepoint:

- Small fluctuations in the value don't flood the trace buffer. If you
get a new trace event from a process every time kswapd reclaims a
single page from that process, you're going to need an enormous trace
buffer that will have significant side effects on overall system
performance.
- Any spike in memory consumption gets a trace event, regardless of
the duration of that spike. This tracepoint has been incredibly useful
in both understanding the causes of kswapd wakeups and
lowmemorykiller/lmkd kills and evaluating the impact of memory
management changes because it guarantees that any spike appears in the
trace output.

As a result, the RSS tracepoint in particular needs to be throttled
based on the delta of the value, not time. The very first prototype of
the patch emitted a trace event per RSS counter change, and IIRC the
RSS trace events consumed significantly more room in the buffer than
sched_switch (and Android has a lot of sched_switch events). It's not
practical to trace changes in RSS without throttling. If there's a
generic throttling approach that would work here, I'm all for it; like
Dan mentioned, there are many more counters that we would like to
trace in a similar way.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Suren Baghdasaryan
On Thu, Sep 5, 2019 at 10:35 AM Steven Rostedt  wrote:
>
>
>
> [ Added Tom ]
>
> On Thu, 5 Sep 2019 09:03:01 -0700
> Suren Baghdasaryan  wrote:
>
> > On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> > >
> > > [Add Steven]
> > >
> > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > [...]
> > > > > > but also for reducing
> > > > > > tracing noise. Flooding the traces makes it less useful for long 
> > > > > > traces and
> > > > > > post-processing of traces. IOW, the overhead reduction is a bonus.
> > > > >
> > > > > This is not really anything special for this tracepoint though.
> > > > > Basically any tracepoint in a hot path is in the same situation and I 
> > > > > do
> > > > > not see a point why each of them should really invent its own way to
> > > > > throttle. Maybe there is some way to do that in the tracing subsystem
> > > > > directly.
> > > >
> > > > I am not sure if there is a way to do this easily. Add to that, the 
> > > > fact that
> > > > you still have to call into trace events. Why call into it at all, if 
> > > > you can
> > > > filter in advance and have a sane filtering default?
> > > >
> > > > The bigger improvement with the threshold is the number of trace 
> > > > records are
> > > > almost halved by using a threshold. The number of records went from 
> > > > 4.6K to
> > > > 2.6K.
> > >
> > > Steven, would it be feasible to add a generic tracepoint throttling?
> >
> > I might misunderstand this but is the issue here actually throttling
> > of the sheer number of trace records or tracing large enough changes
> > to RSS that user might care about? Small changes happen all the time
> > but we are likely not interested in those. Surely we could postprocess
> > the traces to extract changes large enough to be interesting but why
> > capture uninteresting information in the first place? IOW the
> > throttling here should be based not on the time between traces but on
> > the amount of change of the traced signal. Maybe a generic facility
> > like that would be a good idea?
>
> You mean like add a trigger (or filter) that only traces if a field has
> changed since the last time the trace was hit?

Almost... I mean emit a trace if a field has changed by more than X
amount since the last time the trace was hit.

> Hmm, I think we could
> possibly do that. Perhaps even now with histogram triggers?
>
> -- Steve
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Steven Rostedt



[ Added Tom ]

On Thu, 5 Sep 2019 09:03:01 -0700
Suren Baghdasaryan  wrote:

> On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
> >
> > [Add Steven]
> >
> > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:  
> > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:  
> > > >
> > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:  
> > [...]  
> > > > > but also for reducing
> > > > > tracing noise. Flooding the traces makes it less useful for long 
> > > > > traces and
> > > > > post-processing of traces. IOW, the overhead reduction is a bonus.  
> > > >
> > > > This is not really anything special for this tracepoint though.
> > > > Basically any tracepoint in a hot path is in the same situation and I do
> > > > not see a point why each of them should really invent its own way to
> > > > throttle. Maybe there is some way to do that in the tracing subsystem
> > > > directly.  
> > >
> > > I am not sure if there is a way to do this easily. Add to that, the fact 
> > > that
> > > you still have to call into trace events. Why call into it at all, if you 
> > > can
> > > filter in advance and have a sane filtering default?
> > >
> > > The bigger improvement with the threshold is the number of trace records 
> > > are
> > > almost halved by using a threshold. The number of records went from 4.6K 
> > > to
> > > 2.6K.  
> >
> > Steven, would it be feasible to add a generic tracepoint throttling?  
> 
> I might misunderstand this but is the issue here actually throttling
> of the sheer number of trace records or tracing large enough changes
> to RSS that user might care about? Small changes happen all the time
> but we are likely not interested in those. Surely we could postprocess
> the traces to extract changes large enough to be interesting but why
> capture uninteresting information in the first place? IOW the
> throttling here should be based not on the time between traces but on
> the amount of change of the traced signal. Maybe a generic facility
> like that would be a good idea?

You mean like add a trigger (or filter) that only traces if a field has
changed since the last time the trace was hit? Hmm, I think we could
possibly do that. Perhaps even now with histogram triggers?

-- Steve



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Suren Baghdasaryan
On Thu, Sep 5, 2019 at 7:43 AM Michal Hocko  wrote:
>
> [Add Steven]
>
> On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> > >
> > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> [...]
> > > > but also for reducing
> > > > tracing noise. Flooding the traces makes it less useful for long traces 
> > > > and
> > > > post-processing of traces. IOW, the overhead reduction is a bonus.
> > >
> > > This is not really anything special for this tracepoint though.
> > > Basically any tracepoint in a hot path is in the same situation and I do
> > > not see a point why each of them should really invent its own way to
> > > throttle. Maybe there is some way to do that in the tracing subsystem
> > > directly.
> >
> > I am not sure if there is a way to do this easily. Add to that, the fact 
> > that
> > you still have to call into trace events. Why call into it at all, if you 
> > can
> > filter in advance and have a sane filtering default?
> >
> > The bigger improvement with the threshold is the number of trace records are
> > almost halved by using a threshold. The number of records went from 4.6K to
> > 2.6K.
>
> Steven, would it be feasible to add a generic tracepoint throttling?

I might misunderstand this but is the issue here actually throttling
of the sheer number of trace records or tracing large enough changes
to RSS that user might care about? Small changes happen all the time
but we are likely not interested in those. Surely we could postprocess
the traces to extract changes large enough to be interesting but why
capture uninteresting information in the first place? IOW the
throttling here should be based not on the time between traces but on
the amount of change of the traced signal. Maybe a generic facility
like that would be a good idea?

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Michal Hocko
[Add Steven]

On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> >
> > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
[...]
> > > but also for reducing
> > > tracing noise. Flooding the traces makes it less useful for long traces 
> > > and
> > > post-processing of traces. IOW, the overhead reduction is a bonus.
> >
> > This is not really anything special for this tracepoint though.
> > Basically any tracepoint in a hot path is in the same situation and I do
> > not see a point why each of them should really invent its own way to
> > throttle. Maybe there is some way to do that in the tracing subsystem
> > directly.
> 
> I am not sure if there is a way to do this easily. Add to that, the fact that
> you still have to call into trace events. Why call into it at all, if you can
> filter in advance and have a sane filtering default?
> 
> The bigger improvement with the threshold is the number of trace records are
> almost halved by using a threshold. The number of records went from 4.6K to
> 2.6K.

Steven, would it be feasible to add a generic tracepoint throttling?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Joel Fernandes
On Thu, Sep 05, 2019 at 04:20:10PM +0200, Michal Hocko wrote:
> On Thu 05-09-19 10:14:52, Joel Fernandes wrote:
> > On Thu, Sep 05, 2019 at 12:54:24PM +0200, Michal Hocko wrote:
> > > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > > On Wed, Sep 04, 2019 at 10:45:08AM +0200, Michal Hocko wrote:
> > > > > > > On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> > > > > > > > Useful to track how RSS is changing per TGID to detect spikes 
> > > > > > > > in RSS and
> > > > > > > > memory hogs. Several Android teams have been using this patch 
> > > > > > > > in various
> > > > > > > > kernel trees for half a year now. Many reported to me it is 
> > > > > > > > really
> > > > > > > > useful so I'm posting it upstream.
> > > > > > > >
> > > > > > > > Initial patch developed by Tim Murray. Changes I made from 
> > > > > > > > original patch:
> > > > > > > > o Prevent any additional space consumed by mm_struct.
> > > > > > > > o Keep overhead low by checking if tracing is enabled.
> > > > > > > > o Add some noise reduction and lower overhead by emitting only 
> > > > > > > > on
> > > > > > > >   threshold changes.
> > > > > > >
> > > > > > > Does this have any pre-requisite? I do not see 
> > > > > > > trace_rss_stat_enabled in
> > > > > > > the Linus tree (nor in linux-next).
> > > > > >
> > > > > > No, this is generated automatically by the tracepoint 
> > > > > > infrastructure when a
> > > > > > tracepoint is added.
> > > > >
> > > > > OK, I was not aware of that.
> > > > >
> > > > > > > Besides that why do we need batching in the first place. Does 
> > > > > > > this have a
> > > > > > > measurable overhead? How does it differ from any other 
> > > > > > > tracepoints that we
> > > > > > > have in other hotpaths (e.g.  page allocator doesn't do any 
> > > > > > > checks).
> > > > > >
> > > > > > We do need batching not only for overhead reduction,
> > > > >
> > > > > What is the overhead?
> > > > 
> > > > The overhead is occasionally higher without the threshold (that is if we
> > > > trace every counter change). I would classify performance benefit to be
> > > > almost the same and within the noise.
> > > 
> > > OK, so the additional code is not really justified.
> > 
> > It is really justified. Did you read the whole of the last email?
> 
> Of course I have. The information that numbers are in noise with some
> outliers (without any details about the underlying reason) is simply
> showing that you are optimizing something probably not worth it.
> 
> I would recommend adding a simple tracepoint. That should be pretty non
> controversial. And if you want to add an optimization on top then
> provide data to justify it.

Did you read the point about trace sizes? We don't want traces flooded and
you are not really making any good points about why we should not reduce
flooding of traces. I don't want to simplify it and lose the benefit. It is
already simple enough and non-controversial.

thanks,

 - Joel



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Michal Hocko
On Thu 05-09-19 10:14:52, Joel Fernandes wrote:
> On Thu, Sep 05, 2019 at 12:54:24PM +0200, Michal Hocko wrote:
> > On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> > > >
> > > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > > On Wed, Sep 04, 2019 at 10:45:08AM +0200, Michal Hocko wrote:
> > > > > > On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> > > > > > > Useful to track how RSS is changing per TGID to detect spikes in 
> > > > > > > RSS and
> > > > > > > memory hogs. Several Android teams have been using this patch in 
> > > > > > > various
> > > > > > > kernel trees for half a year now. Many reported to me it is really
> > > > > > > useful so I'm posting it upstream.
> > > > > > >
> > > > > > > Initial patch developed by Tim Murray. Changes I made from 
> > > > > > > original patch:
> > > > > > > o Prevent any additional space consumed by mm_struct.
> > > > > > > o Keep overhead low by checking if tracing is enabled.
> > > > > > > o Add some noise reduction and lower overhead by emitting only on
> > > > > > >   threshold changes.
> > > > > >
> > > > > > Does this have any pre-requisite? I do not see 
> > > > > > trace_rss_stat_enabled in
> > > > > > the Linus tree (nor in linux-next).
> > > > >
> > > > > No, this is generated automatically by the tracepoint infrastructure 
> > > > > when a
> > > > > tracepoint is added.
> > > >
> > > > OK, I was not aware of that.
> > > >
> > > > > > Besides that why do we need batching in the first place. Does this 
> > > > > > have a
> > > > > > measurable overhead? How does it differ from any other tracepoints 
> > > > > > that we
> > > > > > have in other hotpaths (e.g.  page allocator doesn't do any checks).
> > > > >
> > > > > We do need batching not only for overhead reduction,
> > > >
> > > > What is the overhead?
> > > 
> > > The overhead is occasionally higher without the threshold (that is if we
> > > trace every counter change). I would classify performance benefit to be
> > > almost the same and within the noise.
> > 
> > OK, so the additional code is not really justified.
> 
> It is really justified. Did you read the whole of the last email?

Of course I have. The information that numbers are in noise with some
outliers (without any details about the underlying reason) is simply
showing that you are optimizing something probably not worth it.

I would recommend adding a simple tracepoint. That should be pretty non
controversial. And if you want to add an optimization on top then
provide data to justify it.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Joel Fernandes
On Thu, Sep 05, 2019 at 12:54:24PM +0200, Michal Hocko wrote:
> On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> > On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> > >
> > > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > > On Wed, Sep 04, 2019 at 10:45:08AM +0200, Michal Hocko wrote:
> > > > > On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> > > > > > Useful to track how RSS is changing per TGID to detect spikes in 
> > > > > > RSS and
> > > > > > memory hogs. Several Android teams have been using this patch in 
> > > > > > various
> > > > > > kernel trees for half a year now. Many reported to me it is really
> > > > > > useful so I'm posting it upstream.
> > > > > >
> > > > > > Initial patch developed by Tim Murray. Changes I made from original 
> > > > > > patch:
> > > > > > o Prevent any additional space consumed by mm_struct.
> > > > > > o Keep overhead low by checking if tracing is enabled.
> > > > > > o Add some noise reduction and lower overhead by emitting only on
> > > > > >   threshold changes.
> > > > >
> > > > > Does this have any pre-requisite? I do not see trace_rss_stat_enabled 
> > > > > in
> > > > > the Linus tree (nor in linux-next).
> > > >
> > > > No, this is generated automatically by the tracepoint infrastructure 
> > > > when a
> > > > tracepoint is added.
> > >
> > > OK, I was not aware of that.
> > >
> > > > > Besides that why do we need batching in the first place. Does this 
> > > > > have a
> > > > > measurable overhead? How does it differ from any other tracepoints 
> > > > > that we
> > > > > have in other hotpaths (e.g.  page allocator doesn't do any checks).
> > > >
> > > > We do need batching not only for overhead reduction,
> > >
> > > What is the overhead?
> > 
> > The overhead is occasionally higher without the threshold (that is if we
> > trace every counter change). I would classify performance benefit to be
> > almost the same and within the noise.
> 
> OK, so the additional code is not really justified.

It is really justified. Did you read the whole of the last email?

thanks,

 - Joel



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-05 Thread Michal Hocko
On Wed 04-09-19 12:28:08, Joel Fernandes wrote:
> On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
> >
> > On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > > On Wed, Sep 04, 2019 at 10:45:08AM +0200, Michal Hocko wrote:
> > > > On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> > > > > Useful to track how RSS is changing per TGID to detect spikes in RSS 
> > > > > and
> > > > > memory hogs. Several Android teams have been using this patch in 
> > > > > various
> > > > > kernel trees for half a year now. Many reported to me it is really
> > > > > useful so I'm posting it upstream.
> > > > >
> > > > > Initial patch developed by Tim Murray. Changes I made from original 
> > > > > patch:
> > > > > o Prevent any additional space consumed by mm_struct.
> > > > > o Keep overhead low by checking if tracing is enabled.
> > > > > o Add some noise reduction and lower overhead by emitting only on
> > > > >   threshold changes.
> > > >
> > > > Does this have any pre-requisite? I do not see trace_rss_stat_enabled in
> > > > the Linus tree (nor in linux-next).
> > >
> > > No, this is generated automatically by the tracepoint infrastructure when 
> > > a
> > > tracepoint is added.
> >
> > OK, I was not aware of that.
> >
> > > > Besides that why do we need batching in the first place. Does this have 
> > > > a
> > > > measurable overhead? How does it differ from any other tracepoints that 
> > > > we
> > > > have in other hotpaths (e.g.  page allocator doesn't do any checks).
> > >
> > > We do need batching not only for overhead reduction,
> >
> > What is the overhead?
> 
> The overhead is occasionally higher without the threshold (that is if we
> trace every counter change). I would classify performance benefit to be
> almost the same and within the noise.

OK, so the additional code is not really justified.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread sspatil
On Wed, Sep 04, 2019 at 10:15:10AM -0700, 'Daniel Colascione' via kernel-team 
wrote:
> On Wed, Sep 4, 2019 at 7:59 AM Joel Fernandes  wrote:
> >
> > On Tue, Sep 03, 2019 at 10:42:53PM -0700, Daniel Colascione wrote:
> > > On Tue, Sep 3, 2019 at 10:15 PM Joel Fernandes  
> > > wrote:
> > > >
> > > > On Tue, Sep 03, 2019 at 09:51:20PM -0700, Daniel Colascione wrote:
> > > > > On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
> > > > > >  wrote:
> > > > > > >
> > > > > > > Useful to track how RSS is changing per TGID to detect spikes in 
> > > > > > > RSS and
> > > > > > > memory hogs. Several Android teams have been using this patch in 
> > > > > > > various
> > > > > > > kernel trees for half a year now. Many reported to me it is really
> > > > > > > useful so I'm posting it upstream.
> > > > >
> > > > > It's also worth being able to turn off the per-task memory counter
> > > > > caching, otherwise you'll have two levels of batching before the
> > > > > counter gets updated, IIUC.
> > > >
> > > > I prefer to keep split RSS accounting turned on if it is available.
> > >
> > > Why? AFAIK, nobody's produced numbers showing that split accounting
> > > has a real benefit.
> >
> > I am not too sure. Have you checked the original patches that added this
> > stuff though? It seems to me the main win would be on big systems that have
> > to pay for atomic updates.
> 
> I looked into this issue the last time I mentioned split mm
> accounting. See [1]. It's my sense that the original change was
> inadequately justified; Michal Hocko seems to agree. I've tried
> disabling split rss accounting locally on a variety of systems ---
> Android, laptop, desktop --- and failed to notice any difference. It's
> possible that some difference appears at a scale beyond that to which
> I have access, but if the benefit of split rss accounting is limited
> to these cases, split rss accounting shouldn't be on by default, since
> it comes at a cost in consistency.
> 
> [1] https://lore.kernel.org/linux-mm/20180227100234.gf15...@dhcp22.suse.cz/
> 
> > > > I think
> > > > discussing split RSS accounting is a bit out of scope of this patch as 
> > > > well.
> > >
> > > It's in-scope, because with split RSS accounting, allocated memory can
> > > stay accumulated in task structs for an indefinite time without being
> > > flushed to the mm. As a result, if you take the stream of virtual
> > > memory management system calls that  program makes on one hand, and VM
> > > counter values on the other, the two don't add up. For various kinds
> > > of robustness (trace self-checking, say) it's important that various
> > > sources of data add up.
> > >
> > > If we're adding a configuration knob that controls how often VM
> > > counters get reflected in system trace points, we should also have a
> > > knob to control delayed VM counter operations. The whole point is for
> > > users to be able to specify how precisely they want VM counter changes
> > > reported to analysis tools.
> >
> > We're not adding more configuration knobs.
> 
> This position doesn't seem to be the thread consensus yet.
> 
> > > > Any improvements on that front can be a follow-up.
> > > >
> > > > Curious, has split RSS accounting shown you any issue with this patch?
> > >
> > > Split accounting has been a source of confusion for a while now: it
> > > causes that numbers-don't-add-up problem even when sampling from
> > > procfs instead of reading memory tracepoint data.
> >
> > I think you can just disable split RSS accounting if it does not work well
> > for your configuration.
> 
> There's no build-time configuration for split RSS accounting. It's not
> reasonable to expect people to carry patches just to get their memory
> usage numbers to add up.

sure, may be send a patch to make one in that case or for deleting the split
RSS accounting code like you said below.

> 
> > Also AFAIU, every TASK_RSS_EVENTS_THRESH the page fault code does sync the
> > counters. So it does not indefinitely lurk.
> 
> If a thread incurs TASK_RSS_EVENTS_THRESH - 1 page faults and then
> sleeps for a week, all memory counters observable from userspace will
> be wrong for a week. Multiply this potential error by the number of
> threads on a typical system and you have to conclude that split RSS
> accounting produces a lot of potential uncertainty. What are we
> getting in exchange for this uncertainty?
> 
> > The tracepoint's main intended
> > use is to detect spikes which provides ample opportunity to sync the cache.
> 
> The intended use is measuring memory levels of various processes over
> time, not just detecting "spikes". In order to make sense of the
> resulting data series, we need to be able to place error bars on it.
> The presence of split RSS accounting makes those error bars much
> larger than they have to be.
> 
> > You could reduce TASK_RSS_EVENTS_THRESH in your kernel, or even just 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Daniel Colascione
On Wed, Sep 4, 2019 at 8:38 AM Michal Hocko  wrote:
> > but also for reducing
> > tracing noise. Flooding the traces makes it less useful for long traces and
> > post-processing of traces. IOW, the overhead reduction is a bonus.
>
> This is not really anything special for this tracepoint though.
> Basically any tracepoint in a hot path is in the same situation and I do
> not see a point why each of them should really invent its own way to
> throttle. Maybe there is some way to do that in the tracing subsystem
> directly.

I agree. I'd rather not special-case RSS in this way, especially not
with a hardcoded aggregation and thresholding configuration. It should
be possible to handle high-frequency trace data point aggregation in a
general way. Why shouldn't we be able to get a time series for, say,
dirty pages? Or various slab counts? IMHO, any counter on the system
ought to be observable in a uniform way.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Daniel Colascione
On Wed, Sep 4, 2019 at 7:59 AM Joel Fernandes  wrote:
>
> On Tue, Sep 03, 2019 at 10:42:53PM -0700, Daniel Colascione wrote:
> > On Tue, Sep 3, 2019 at 10:15 PM Joel Fernandes  
> > wrote:
> > >
> > > On Tue, Sep 03, 2019 at 09:51:20PM -0700, Daniel Colascione wrote:
> > > > On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  
> > > > wrote:
> > > > >
> > > > > On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
> > > > >  wrote:
> > > > > >
> > > > > > Useful to track how RSS is changing per TGID to detect spikes in 
> > > > > > RSS and
> > > > > > memory hogs. Several Android teams have been using this patch in 
> > > > > > various
> > > > > > kernel trees for half a year now. Many reported to me it is really
> > > > > > useful so I'm posting it upstream.
> > > >
> > > > It's also worth being able to turn off the per-task memory counter
> > > > caching, otherwise you'll have two levels of batching before the
> > > > counter gets updated, IIUC.
> > >
> > > I prefer to keep split RSS accounting turned on if it is available.
> >
> > Why? AFAIK, nobody's produced numbers showing that split accounting
> > has a real benefit.
>
> I am not too sure. Have you checked the original patches that added this
> stuff though? It seems to me the main win would be on big systems that have
> to pay for atomic updates.

I looked into this issue the last time I mentioned split mm
accounting. See [1]. It's my sense that the original change was
inadequately justified; Michal Hocko seems to agree. I've tried
disabling split rss accounting locally on a variety of systems ---
Android, laptop, desktop --- and failed to notice any difference. It's
possible that some difference appears at a scale beyond that to which
I have access, but if the benefit of split rss accounting is limited
to these cases, split rss accounting shouldn't be on by default, since
it comes at a cost in consistency.

[1] https://lore.kernel.org/linux-mm/20180227100234.gf15...@dhcp22.suse.cz/

> > > I think
> > > discussing split RSS accounting is a bit out of scope of this patch as 
> > > well.
> >
> > It's in-scope, because with split RSS accounting, allocated memory can
> > stay accumulated in task structs for an indefinite time without being
> > flushed to the mm. As a result, if you take the stream of virtual
> > memory management system calls that  program makes on one hand, and VM
> > counter values on the other, the two don't add up. For various kinds
> > of robustness (trace self-checking, say) it's important that various
> > sources of data add up.
> >
> > If we're adding a configuration knob that controls how often VM
> > counters get reflected in system trace points, we should also have a
> > knob to control delayed VM counter operations. The whole point is for
> > users to be able to specify how precisely they want VM counter changes
> > reported to analysis tools.
>
> We're not adding more configuration knobs.

This position doesn't seem to be the thread consensus yet.

> > > Any improvements on that front can be a follow-up.
> > >
> > > Curious, has split RSS accounting shown you any issue with this patch?
> >
> > Split accounting has been a source of confusion for a while now: it
> > causes that numbers-don't-add-up problem even when sampling from
> > procfs instead of reading memory tracepoint data.
>
> I think you can just disable split RSS accounting if it does not work well
> for your configuration.

There's no build-time configuration for split RSS accounting. It's not
reasonable to expect people to carry patches just to get their memory
usage numbers to add up.

> Also AFAIU, every TASK_RSS_EVENTS_THRESH the page fault code does sync the
> counters. So it does not indefinitely lurk.

If a thread incurs TASK_RSS_EVENTS_THRESH - 1 page faults and then
sleeps for a week, all memory counters observable from userspace will
be wrong for a week. Multiply this potential error by the number of
threads on a typical system and you have to conclude that split RSS
accounting produces a lot of potential uncertainty. What are we
getting in exchange for this uncertainty?

> The tracepoint's main intended
> use is to detect spikes which provides ample opportunity to sync the cache.

The intended use is measuring memory levels of various processes over
time, not just detecting "spikes". In order to make sense of the
resulting data series, we need to be able to place error bars on it.
The presence of split RSS accounting makes those error bars much
larger than they have to be.

> You could reduce TASK_RSS_EVENTS_THRESH in your kernel, or even just disable
> split RSS accounting if that suits you better. That would solve all the
> issues you raised, not just any potential ones that you raised here for this
> tracepoint.

I think we should just delete the split RSS accounting code unless
someone can demonstrate that it's a measurable win on a typical
system. The first priority of any system should be correctness.
Consistency is a kind of correctness. 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Joel Fernandes
On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko  wrote:
>
> On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > On Wed, Sep 04, 2019 at 10:45:08AM +0200, Michal Hocko wrote:
> > > On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> > > > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > > > memory hogs. Several Android teams have been using this patch in various
> > > > kernel trees for half a year now. Many reported to me it is really
> > > > useful so I'm posting it upstream.
> > > >
> > > > Initial patch developed by Tim Murray. Changes I made from original 
> > > > patch:
> > > > o Prevent any additional space consumed by mm_struct.
> > > > o Keep overhead low by checking if tracing is enabled.
> > > > o Add some noise reduction and lower overhead by emitting only on
> > > >   threshold changes.
> > >
> > > Does this have any pre-requisite? I do not see trace_rss_stat_enabled in
> > > the Linus tree (nor in linux-next).
> >
> > No, this is generated automatically by the tracepoint infrastructure when a
> > tracepoint is added.
>
> OK, I was not aware of that.
>
> > > Besides that why do we need batching in the first place. Does this have a
> > > measurable overhead? How does it differ from any other tracepoints that we
> > > have in other hotpaths (e.g.  page allocator doesn't do any checks).
> >
> > We do need batching not only for overhead reduction,
>
> What is the overhead?

The overhead is occasionally higher without the threshold (that is if we
trace every counter change). I would classify performance benefit to be
almost the same and within the noise.

For memset of 1GB data:

With threshold:
Total time for 1GB data: 684172499 nanoseconds.
Total time for 1GB data: 692379986 nanoseconds.
Total time for 1GB data: 760023463 nanoseconds.
Total time for 1GB data: 669291457 nanoseconds.
Total time for 1GB data: 729722783 nanoseconds.

Without threshold
Total time for 1GB data: 722505810 nanoseconds.
Total time for 1GB data: 648724292 nanoseconds.
Total time for 1GB data: 643102853 nanoseconds.
Total time for 1GB data: 641815282 nanoseconds.
Total time for 1GB data: 828561187 nanoseconds.  <-- outlier but it did happen.

> > but also for reducing
> > tracing noise. Flooding the traces makes it less useful for long traces and
> > post-processing of traces. IOW, the overhead reduction is a bonus.
>
> This is not really anything special for this tracepoint though.
> Basically any tracepoint in a hot path is in the same situation and I do
> not see a point why each of them should really invent its own way to
> throttle. Maybe there is some way to do that in the tracing subsystem
> directly.

I am not sure if there is a way to do this easily. Add to that, the fact that
you still have to call into trace events. Why call into it at all, if you can
filter in advance and have a sane filtering default?

The bigger improvement with the threshold is the number of trace records are
almost halved by using a threshold. The number of records went from 4.6K to
2.6K.

I don't see any drawbacks with using a threshold. There is no overhead either
way. For system without split RSS accounting, the reduction in number of
trace records would be even higher significantly reducing the consumption of
the ftrace buffer and the noise that people have to deal with.

Hope you agree now?

thanks,

 - Joel



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Michal Hocko
On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> On Wed, Sep 04, 2019 at 10:45:08AM +0200, Michal Hocko wrote:
> > On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> > > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > > memory hogs. Several Android teams have been using this patch in various
> > > kernel trees for half a year now. Many reported to me it is really
> > > useful so I'm posting it upstream.
> > > 
> > > Initial patch developed by Tim Murray. Changes I made from original patch:
> > > o Prevent any additional space consumed by mm_struct.
> > > o Keep overhead low by checking if tracing is enabled.
> > > o Add some noise reduction and lower overhead by emitting only on
> > >   threshold changes.
> > 
> > Does this have any pre-requisite? I do not see trace_rss_stat_enabled in
> > the Linus tree (nor in linux-next).
> 
> No, this is generated automatically by the tracepoint infrastructure when a
> tracepoint is added.

OK, I was not aware of that.

> > Besides that why do we need batching in the first place. Does this have a
> > measurable overhead? How does it differ from any other tracepoints that we
> > have in other hotpaths (e.g.  page allocator doesn't do any checks).
> 
> We do need batching not only for overhead reduction,

What is the overhead?

> but also for reducing
> tracing noise. Flooding the traces makes it less useful for long traces and
> post-processing of traces. IOW, the overhead reduction is a bonus.

This is not really anything special for this tracepoint though.
Basically any tracepoint in a hot path is in the same situation and I do
not see a point why each of them should really invent its own way to
throttle. Maybe there is some way to do that in the tracing subsystem
directly.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Joel Fernandes
On Wed, Sep 04, 2019 at 10:45:08AM +0200, Michal Hocko wrote:
> On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > memory hogs. Several Android teams have been using this patch in various
> > kernel trees for half a year now. Many reported to me it is really
> > useful so I'm posting it upstream.
> > 
> > Initial patch developed by Tim Murray. Changes I made from original patch:
> > o Prevent any additional space consumed by mm_struct.
> > o Keep overhead low by checking if tracing is enabled.
> > o Add some noise reduction and lower overhead by emitting only on
> >   threshold changes.
> 
> Does this have any pre-requisite? I do not see trace_rss_stat_enabled in
> the Linus tree (nor in linux-next).

No, this is generated automatically by the tracepoint infrastructure when a
tracepoint is added.

> Besides that why do we need batching in the first place. Does this have a
> measurable overhead? How does it differ from any other tracepoints that we
> have in other hotpaths (e.g.  page allocator doesn't do any checks).

We do need batching not only for overhead reduction, but also for reducing
tracing noise. Flooding the traces makes it less useful for long traces and
post-processing of traces. IOW, the overhead reduction is a bonus.

I have not looked at the page allocator paths, we don't currently use that
for the purposes of this rss_stat tracepoint.

> Other than that this looks reasonable to me.

Thanks!

 - Joel


> 
> > Co-developed-by: Tim Murray 
> > Signed-off-by: Tim Murray 
> > Signed-off-by: Joel Fernandes (Google) 
> > 
> > ---
> > 
> > v1->v2: Added more commit message.
> > 
> > Cc: carmenjack...@google.com
> > Cc: mayankgu...@google.com
> > Cc: dan...@google.com
> > Cc: rost...@goodmis.org
> > Cc: minc...@kernel.org
> > Cc: a...@linux-foundation.org
> > Cc: kernel-t...@android.com
> > 
> >  include/linux/mm.h  | 14 +++---
> >  include/trace/events/kmem.h | 21 +
> >  mm/memory.c | 20 
> >  3 files changed, 52 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0334ca97c584..823aaf759bdb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1671,19 +1671,27 @@ static inline unsigned long get_mm_counter(struct 
> > mm_struct *mm, int member)
> > return (unsigned long)val;
> >  }
> >  
> > +void mm_trace_rss_stat(int member, long count, long value);
> > +
> >  static inline void add_mm_counter(struct mm_struct *mm, int member, long 
> > value)
> >  {
> > -   atomic_long_add(value, >rss_stat.count[member]);
> > +   long count = atomic_long_add_return(value, >rss_stat.count[member]);
> > +
> > +   mm_trace_rss_stat(member, count, value);
> >  }
> >  
> >  static inline void inc_mm_counter(struct mm_struct *mm, int member)
> >  {
> > -   atomic_long_inc(>rss_stat.count[member]);
> > +   long count = atomic_long_inc_return(>rss_stat.count[member]);
> > +
> > +   mm_trace_rss_stat(member, count, 1);
> >  }
> >  
> >  static inline void dec_mm_counter(struct mm_struct *mm, int member)
> >  {
> > -   atomic_long_dec(>rss_stat.count[member]);
> > +   long count = atomic_long_dec_return(>rss_stat.count[member]);
> > +
> > +   mm_trace_rss_stat(member, count, -1);
> >  }
> >  
> >  /* Optimized variant when page is already known not to be PageAnon */
> > diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> > index eb57e3037deb..8b88e04fafbf 100644
> > --- a/include/trace/events/kmem.h
> > +++ b/include/trace/events/kmem.h
> > @@ -315,6 +315,27 @@ TRACE_EVENT(mm_page_alloc_extfrag,
> > __entry->change_ownership)
> >  );
> >  
> > +TRACE_EVENT(rss_stat,
> > +
> > +   TP_PROTO(int member,
> > +   long count),
> > +
> > +   TP_ARGS(member, count),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(int, member)
> > +   __field(long, size)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->member = member;
> > +   __entry->size = (count << PAGE_SHIFT);
> > +   ),
> > +
> > +   TP_printk("member=%d size=%ldB",
> > +   __entry->member,
> > +   __entry->size)
> > +   );
> >  #endif /* _TRACE_KMEM_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e2bb51b6242e..9d81322c24a3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -72,6 +72,8 @@
> >  #include 
> >  #include 
> >  
> > +#include 
> > +
> >  #include 
> >  #include 
> >  #include 
> > @@ -140,6 +142,24 @@ static int __init init_zero_pfn(void)
> >  }
> >  core_initcall(init_zero_pfn);
> >  
> > +/*
> > + * This threshold is the boundary in the value space, that the counter has 
> > to
> > + * advance before we trace it. Should be a power of 2. It is to reduce 
> > unwanted
> > + * trace overhead. The counter is in units of number of pages.
> > + */
> > +#define 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Joel Fernandes
On Tue, Sep 03, 2019 at 10:42:53PM -0700, Daniel Colascione wrote:
> On Tue, Sep 3, 2019 at 10:15 PM Joel Fernandes  wrote:
> >
> > On Tue, Sep 03, 2019 at 09:51:20PM -0700, Daniel Colascione wrote:
> > > On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
> > > >  wrote:
> > > > >
> > > > > Useful to track how RSS is changing per TGID to detect spikes in RSS 
> > > > > and
> > > > > memory hogs. Several Android teams have been using this patch in 
> > > > > various
> > > > > kernel trees for half a year now. Many reported to me it is really
> > > > > useful so I'm posting it upstream.
> > >
> > > It's also worth being able to turn off the per-task memory counter
> > > caching, otherwise you'll have two levels of batching before the
> > > counter gets updated, IIUC.
> >
> > I prefer to keep split RSS accounting turned on if it is available.
> 
> Why? AFAIK, nobody's produced numbers showing that split accounting
> has a real benefit.

I am not too sure. Have you checked the original patches that added this
stuff though? It seems to me the main win would be on big systems that have
to pay for atomic updates.

> > I think
> > discussing split RSS accounting is a bit out of scope of this patch as well.
> 
> It's in-scope, because with split RSS accounting, allocated memory can
> stay accumulated in task structs for an indefinite time without being
> flushed to the mm. As a result, if you take the stream of virtual
> memory management system calls that  program makes on one hand, and VM
> counter values on the other, the two don't add up. For various kinds
> of robustness (trace self-checking, say) it's important that various
> sources of data add up.
> 
> If we're adding a configuration knob that controls how often VM
> counters get reflected in system trace points, we should also have a
> knob to control delayed VM counter operations. The whole point is for
> users to be able to specify how precisely they want VM counter changes
> reported to analysis tools.

We're not adding more configuration knobs.

> > Any improvements on that front can be a follow-up.
> >
> > Curious, has split RSS accounting shown you any issue with this patch?
> 
> Split accounting has been a source of confusion for a while now: it
> causes that numbers-don't-add-up problem even when sampling from
> procfs instead of reading memory tracepoint data.

I think you can just disable split RSS accounting if it does not work well
for your configuration. It sounds like the problems you share are common all
with existing ways of getting RSS accounting working, and not this particular
one, hence I mentioned it is a bit of scope.

Also AFAIU, every TASK_RSS_EVENTS_THRESH the page fault code does sync the
counters. So it does not indefinitely lurk. The tracepoint's main intended
use is to detect spikes which provides ample opportunity to sync the cache.

You could reduce TASK_RSS_EVENTS_THRESH in your kernel, or even just disable
split RSS accounting if that suits you better. That would solve all the
issues you raised, not just any potential ones that you raised here for this
tracepoint.

thanks,

 - Joel




Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-04 Thread Michal Hocko
On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> Useful to track how RSS is changing per TGID to detect spikes in RSS and
> memory hogs. Several Android teams have been using this patch in various
> kernel trees for half a year now. Many reported to me it is really
> useful so I'm posting it upstream.
> 
> Initial patch developed by Tim Murray. Changes I made from original patch:
> o Prevent any additional space consumed by mm_struct.
> o Keep overhead low by checking if tracing is enabled.
> o Add some noise reduction and lower overhead by emitting only on
>   threshold changes.

Does this have any pre-requisite? I do not see trace_rss_stat_enabled in
the Linus tree (nor in linux-next). Besides that why do we need batching
in the first place. Does this have a measurable overhead? How does it
differ from any other tracepoints that we have in other hotpaths (e.g.
page allocator doesn't do any checks).

Other than that this looks reasonable to me.

> Co-developed-by: Tim Murray 
> Signed-off-by: Tim Murray 
> Signed-off-by: Joel Fernandes (Google) 
> 
> ---
> 
> v1->v2: Added more commit message.
> 
> Cc: carmenjack...@google.com
> Cc: mayankgu...@google.com
> Cc: dan...@google.com
> Cc: rost...@goodmis.org
> Cc: minc...@kernel.org
> Cc: a...@linux-foundation.org
> Cc: kernel-t...@android.com
> 
>  include/linux/mm.h  | 14 +++---
>  include/trace/events/kmem.h | 21 +
>  mm/memory.c | 20 
>  3 files changed, 52 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0334ca97c584..823aaf759bdb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1671,19 +1671,27 @@ static inline unsigned long get_mm_counter(struct 
> mm_struct *mm, int member)
>   return (unsigned long)val;
>  }
>  
> +void mm_trace_rss_stat(int member, long count, long value);
> +
>  static inline void add_mm_counter(struct mm_struct *mm, int member, long 
> value)
>  {
> - atomic_long_add(value, >rss_stat.count[member]);
> + long count = atomic_long_add_return(value, >rss_stat.count[member]);
> +
> + mm_trace_rss_stat(member, count, value);
>  }
>  
>  static inline void inc_mm_counter(struct mm_struct *mm, int member)
>  {
> - atomic_long_inc(>rss_stat.count[member]);
> + long count = atomic_long_inc_return(>rss_stat.count[member]);
> +
> + mm_trace_rss_stat(member, count, 1);
>  }
>  
>  static inline void dec_mm_counter(struct mm_struct *mm, int member)
>  {
> - atomic_long_dec(>rss_stat.count[member]);
> + long count = atomic_long_dec_return(>rss_stat.count[member]);
> +
> + mm_trace_rss_stat(member, count, -1);
>  }
>  
>  /* Optimized variant when page is already known not to be PageAnon */
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index eb57e3037deb..8b88e04fafbf 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -315,6 +315,27 @@ TRACE_EVENT(mm_page_alloc_extfrag,
>   __entry->change_ownership)
>  );
>  
> +TRACE_EVENT(rss_stat,
> +
> + TP_PROTO(int member,
> + long count),
> +
> + TP_ARGS(member, count),
> +
> + TP_STRUCT__entry(
> + __field(int, member)
> + __field(long, size)
> + ),
> +
> + TP_fast_assign(
> + __entry->member = member;
> + __entry->size = (count << PAGE_SHIFT);
> + ),
> +
> + TP_printk("member=%d size=%ldB",
> + __entry->member,
> + __entry->size)
> + );
>  #endif /* _TRACE_KMEM_H */
>  
>  /* This part must be outside protection */
> diff --git a/mm/memory.c b/mm/memory.c
> index e2bb51b6242e..9d81322c24a3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -72,6 +72,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -140,6 +142,24 @@ static int __init init_zero_pfn(void)
>  }
>  core_initcall(init_zero_pfn);
>  
> +/*
> + * This threshold is the boundary in the value space, that the counter has to
> + * advance before we trace it. Should be a power of 2. It is to reduce 
> unwanted
> + * trace overhead. The counter is in units of number of pages.
> + */
> +#define TRACE_MM_COUNTER_THRESHOLD 128
> +
> +void mm_trace_rss_stat(int member, long count, long value)
> +{
> + long thresh_mask = ~(TRACE_MM_COUNTER_THRESHOLD - 1);
> +
> + if (!trace_rss_stat_enabled())
> + return;
> +
> + /* Threshold roll-over, trace it */
> + if ((count & thresh_mask) != ((count - value) & thresh_mask))
> + trace_rss_stat(member, count);
> +}
>  
>  #if defined(SPLIT_RSS_COUNTING)
>  
> -- 
> 2.23.0.187.g17f5b7556c-goog

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-03 Thread Daniel Colascione
On Tue, Sep 3, 2019 at 10:15 PM Joel Fernandes  wrote:
>
> On Tue, Sep 03, 2019 at 09:51:20PM -0700, Daniel Colascione wrote:
> > On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  wrote:
> > >
> > > On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
> > >  wrote:
> > > >
> > > > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > > > memory hogs. Several Android teams have been using this patch in various
> > > > kernel trees for half a year now. Many reported to me it is really
> > > > useful so I'm posting it upstream.
> >
> > It's also worth being able to turn off the per-task memory counter
> > caching, otherwise you'll have two levels of batching before the
> > counter gets updated, IIUC.
>
> I prefer to keep split RSS accounting turned on if it is available.

Why? AFAIK, nobody's produced numbers showing that split accounting
has a real benefit.

> I think
> discussing split RSS accounting is a bit out of scope of this patch as well.

It's in-scope, because with split RSS accounting, allocated memory can
stay accumulated in task structs for an indefinite time without being
flushed to the mm. As a result, if you take the stream of virtual
memory management system calls that  program makes on one hand, and VM
counter values on the other, the two don't add up. For various kinds
of robustness (trace self-checking, say) it's important that various
sources of data add up.

If we're adding a configuration knob that controls how often VM
counters get reflected in system trace points, we should also have a
knob to control delayed VM counter operations. The whole point is for
users to be able to specify how precisely they want VM counter changes
reported to analysis tools.

> Any improvements on that front can be a follow-up.
>
> Curious, has split RSS accounting shown you any issue with this patch?

Split accounting has been a source of confusion for a while now: it
causes that numbers-don't-add-up problem even when sampling from
procfs instead of reading memory tracepoint data.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-03 Thread Suren Baghdasaryan
On Tue, Sep 3, 2019 at 10:02 PM Joel Fernandes  wrote:
>
> On Tue, Sep 03, 2019 at 09:44:51PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
> >  wrote:
> > >
> > > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > > memory hogs. Several Android teams have been using this patch in various
> > > kernel trees for half a year now. Many reported to me it is really
> > > useful so I'm posting it upstream.
> > >
> > > Initial patch developed by Tim Murray. Changes I made from original patch:
> > > o Prevent any additional space consumed by mm_struct.
> > > o Keep overhead low by checking if tracing is enabled.
> > > o Add some noise reduction and lower overhead by emitting only on
> > >   threshold changes.
> > >
> > > Co-developed-by: Tim Murray 
> > > Signed-off-by: Tim Murray 
> > > Signed-off-by: Joel Fernandes (Google) 
> > >
> > > ---
> > >
> > > v1->v2: Added more commit message.
> > >
> > > Cc: carmenjack...@google.com
> > > Cc: mayankgu...@google.com
> > > Cc: dan...@google.com
> > > Cc: rost...@goodmis.org
> > > Cc: minc...@kernel.org
> > > Cc: a...@linux-foundation.org
> > > Cc: kernel-t...@android.com
> > >
> > >  include/linux/mm.h  | 14 +++---
> > >  include/trace/events/kmem.h | 21 +
> > >  mm/memory.c | 20 
> > >  3 files changed, 52 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 0334ca97c584..823aaf759bdb 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1671,19 +1671,27 @@ static inline unsigned long get_mm_counter(struct 
> > > mm_struct *mm, int member)
> > > return (unsigned long)val;
> > >  }
> > >
> > > +void mm_trace_rss_stat(int member, long count, long value);
> > > +
> > >  static inline void add_mm_counter(struct mm_struct *mm, int member, long 
> > > value)
> > >  {
> > > -   atomic_long_add(value, >rss_stat.count[member]);
> > > +   long count = atomic_long_add_return(value, 
> > > >rss_stat.count[member]);
> > > +
> > > +   mm_trace_rss_stat(member, count, value);
> > >  }
> > >
> > >  static inline void inc_mm_counter(struct mm_struct *mm, int member)
> > >  {
> > > -   atomic_long_inc(>rss_stat.count[member]);
> > > +   long count = atomic_long_inc_return(>rss_stat.count[member]);
> > > +
> > > +   mm_trace_rss_stat(member, count, 1);
> > >  }
> > >
> > >  static inline void dec_mm_counter(struct mm_struct *mm, int member)
> > >  {
> > > -   atomic_long_dec(>rss_stat.count[member]);
> > > +   long count = atomic_long_dec_return(>rss_stat.count[member]);
> > > +
> > > +   mm_trace_rss_stat(member, count, -1);
> > >  }
> > >
> > >  /* Optimized variant when page is already known not to be PageAnon */
> > > diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> > > index eb57e3037deb..8b88e04fafbf 100644
> > > --- a/include/trace/events/kmem.h
> > > +++ b/include/trace/events/kmem.h
> > > @@ -315,6 +315,27 @@ TRACE_EVENT(mm_page_alloc_extfrag,
> > > __entry->change_ownership)
> > >  );
> > >
> > > +TRACE_EVENT(rss_stat,
> > > +
> > > +   TP_PROTO(int member,
> > > +   long count),
> > > +
> > > +   TP_ARGS(member, count),
> > > +
> > > +   TP_STRUCT__entry(
> > > +   __field(int, member)
> > > +   __field(long, size)
> > > +   ),
> > > +
> > > +   TP_fast_assign(
> > > +   __entry->member = member;
> > > +   __entry->size = (count << PAGE_SHIFT);
> > > +   ),
> > > +
> > > +   TP_printk("member=%d size=%ldB",
> > > +   __entry->member,
> > > +   __entry->size)
> > > +   );
> > >  #endif /* _TRACE_KMEM_H */
> > >
> > >  /* This part must be outside protection */
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index e2bb51b6242e..9d81322c24a3 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -72,6 +72,8 @@
> > >  #include 
> > >  #include 
> > >
> > > +#include 
> > > +
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -140,6 +142,24 @@ static int __init init_zero_pfn(void)
> > >  }
> > >  core_initcall(init_zero_pfn);
> > >
> > > +/*
> > > + * This threshold is the boundary in the value space, that the counter 
> > > has to
> > > + * advance before we trace it. Should be a power of 2. It is to reduce 
> > > unwanted
> > > + * trace overhead. The counter is in units of number of pages.
> > > + */
> > > +#define TRACE_MM_COUNTER_THRESHOLD 128
> >
> > IIUC the counter has to change by 128 pages (512kB assuming 4kB pages)
> > before the change gets traced. Would it make sense to make this step
> > size configurable? For a system with limited memory size change of
> > 512kB might be considerable while on systems with plenty of memory
> > that might be negligible. Not even mentioning possible difference in
> > page sizes. Maybe something like

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-03 Thread Joel Fernandes
On Tue, Sep 03, 2019 at 09:51:20PM -0700, Daniel Colascione wrote:
> On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  wrote:
> >
> > On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
> >  wrote:
> > >
> > > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > > memory hogs. Several Android teams have been using this patch in various
> > > kernel trees for half a year now. Many reported to me it is really
> > > useful so I'm posting it upstream.
> 
> It's also worth being able to turn off the per-task memory counter
> caching, otherwise you'll have two levels of batching before the
> counter gets updated, IIUC.

I prefer to keep split RSS accounting turned on if it is available. I think
discussing split RSS accounting is a bit out of scope of this patch as well.
Any improvements on that front can be a follow-up.

Curious, has split RSS accounting shown you any issue with this patch?

thanks,

 - Joel



Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-03 Thread Joel Fernandes
On Tue, Sep 03, 2019 at 09:44:51PM -0700, Suren Baghdasaryan wrote:
> On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
>  wrote:
> >
> > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > memory hogs. Several Android teams have been using this patch in various
> > kernel trees for half a year now. Many reported to me it is really
> > useful so I'm posting it upstream.
> >
> > Initial patch developed by Tim Murray. Changes I made from original patch:
> > o Prevent any additional space consumed by mm_struct.
> > o Keep overhead low by checking if tracing is enabled.
> > o Add some noise reduction and lower overhead by emitting only on
> >   threshold changes.
> >
> > Co-developed-by: Tim Murray 
> > Signed-off-by: Tim Murray 
> > Signed-off-by: Joel Fernandes (Google) 
> >
> > ---
> >
> > v1->v2: Added more commit message.
> >
> > Cc: carmenjack...@google.com
> > Cc: mayankgu...@google.com
> > Cc: dan...@google.com
> > Cc: rost...@goodmis.org
> > Cc: minc...@kernel.org
> > Cc: a...@linux-foundation.org
> > Cc: kernel-t...@android.com
> >
> >  include/linux/mm.h  | 14 +++---
> >  include/trace/events/kmem.h | 21 +
> >  mm/memory.c | 20 
> >  3 files changed, 52 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 0334ca97c584..823aaf759bdb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1671,19 +1671,27 @@ static inline unsigned long get_mm_counter(struct 
> > mm_struct *mm, int member)
> > return (unsigned long)val;
> >  }
> >
> > +void mm_trace_rss_stat(int member, long count, long value);
> > +
> >  static inline void add_mm_counter(struct mm_struct *mm, int member, long 
> > value)
> >  {
> > -   atomic_long_add(value, >rss_stat.count[member]);
> > +   long count = atomic_long_add_return(value, 
> > >rss_stat.count[member]);
> > +
> > +   mm_trace_rss_stat(member, count, value);
> >  }
> >
> >  static inline void inc_mm_counter(struct mm_struct *mm, int member)
> >  {
> > -   atomic_long_inc(>rss_stat.count[member]);
> > +   long count = atomic_long_inc_return(>rss_stat.count[member]);
> > +
> > +   mm_trace_rss_stat(member, count, 1);
> >  }
> >
> >  static inline void dec_mm_counter(struct mm_struct *mm, int member)
> >  {
> > -   atomic_long_dec(>rss_stat.count[member]);
> > +   long count = atomic_long_dec_return(>rss_stat.count[member]);
> > +
> > +   mm_trace_rss_stat(member, count, -1);
> >  }
> >
> >  /* Optimized variant when page is already known not to be PageAnon */
> > diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> > index eb57e3037deb..8b88e04fafbf 100644
> > --- a/include/trace/events/kmem.h
> > +++ b/include/trace/events/kmem.h
> > @@ -315,6 +315,27 @@ TRACE_EVENT(mm_page_alloc_extfrag,
> > __entry->change_ownership)
> >  );
> >
> > +TRACE_EVENT(rss_stat,
> > +
> > +   TP_PROTO(int member,
> > +   long count),
> > +
> > +   TP_ARGS(member, count),
> > +
> > +   TP_STRUCT__entry(
> > +   __field(int, member)
> > +   __field(long, size)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __entry->member = member;
> > +   __entry->size = (count << PAGE_SHIFT);
> > +   ),
> > +
> > +   TP_printk("member=%d size=%ldB",
> > +   __entry->member,
> > +   __entry->size)
> > +   );
> >  #endif /* _TRACE_KMEM_H */
> >
> >  /* This part must be outside protection */
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e2bb51b6242e..9d81322c24a3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -72,6 +72,8 @@
> >  #include 
> >  #include 
> >
> > +#include 
> > +
> >  #include 
> >  #include 
> >  #include 
> > @@ -140,6 +142,24 @@ static int __init init_zero_pfn(void)
> >  }
> >  core_initcall(init_zero_pfn);
> >
> > +/*
> > + * This threshold is the boundary in the value space, that the counter has 
> > to
> > + * advance before we trace it. Should be a power of 2. It is to reduce 
> > unwanted
> > + * trace overhead. The counter is in units of number of pages.
> > + */
> > +#define TRACE_MM_COUNTER_THRESHOLD 128
> 
> IIUC the counter has to change by 128 pages (512kB assuming 4kB pages)
> before the change gets traced. Would it make sense to make this step
> size configurable? For a system with limited memory size change of
> 512kB might be considerable while on systems with plenty of memory
> that might be negligible. Not even mentioning possible difference in
> page sizes. Maybe something like
> /sys/kernel/debug/tracing/rss_step_order with
> TRACE_MM_COUNTER_THRESHOLD=(1< > +void mm_trace_rss_stat(int member, long count, long value)
> > +{
> > +   long thresh_mask = ~(TRACE_MM_COUNTER_THRESHOLD - 1);
> > +
> > +   if (!trace_rss_stat_enabled())
> > +   return;
> > +
> > +   /* Threshold 

Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-03 Thread Daniel Colascione
On Tue, Sep 3, 2019 at 9:45 PM Suren Baghdasaryan  wrote:
>
> On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
>  wrote:
> >
> > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > memory hogs. Several Android teams have been using this patch in various
> > kernel trees for half a year now. Many reported to me it is really
> > useful so I'm posting it upstream.

It's also worth being able to turn off the per-task memory counter
caching, otherwise you'll have two levels of batching before the
counter gets updated, IIUC.


Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

2019-09-03 Thread Suren Baghdasaryan
On Tue, Sep 3, 2019 at 1:09 PM Joel Fernandes (Google)
 wrote:
>
> Useful to track how RSS is changing per TGID to detect spikes in RSS and
> memory hogs. Several Android teams have been using this patch in various
> kernel trees for half a year now. Many reported to me it is really
> useful so I'm posting it upstream.
>
> Initial patch developed by Tim Murray. Changes I made from original patch:
> o Prevent any additional space consumed by mm_struct.
> o Keep overhead low by checking if tracing is enabled.
> o Add some noise reduction and lower overhead by emitting only on
>   threshold changes.
>
> Co-developed-by: Tim Murray 
> Signed-off-by: Tim Murray 
> Signed-off-by: Joel Fernandes (Google) 
>
> ---
>
> v1->v2: Added more commit message.
>
> Cc: carmenjack...@google.com
> Cc: mayankgu...@google.com
> Cc: dan...@google.com
> Cc: rost...@goodmis.org
> Cc: minc...@kernel.org
> Cc: a...@linux-foundation.org
> Cc: kernel-t...@android.com
>
>  include/linux/mm.h  | 14 +++---
>  include/trace/events/kmem.h | 21 +
>  mm/memory.c | 20 
>  3 files changed, 52 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0334ca97c584..823aaf759bdb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1671,19 +1671,27 @@ static inline unsigned long get_mm_counter(struct 
> mm_struct *mm, int member)
> return (unsigned long)val;
>  }
>
> +void mm_trace_rss_stat(int member, long count, long value);
> +
>  static inline void add_mm_counter(struct mm_struct *mm, int member, long 
> value)
>  {
> -   atomic_long_add(value, >rss_stat.count[member]);
> +   long count = atomic_long_add_return(value, 
> >rss_stat.count[member]);
> +
> +   mm_trace_rss_stat(member, count, value);
>  }
>
>  static inline void inc_mm_counter(struct mm_struct *mm, int member)
>  {
> -   atomic_long_inc(>rss_stat.count[member]);
> +   long count = atomic_long_inc_return(>rss_stat.count[member]);
> +
> +   mm_trace_rss_stat(member, count, 1);
>  }
>
>  static inline void dec_mm_counter(struct mm_struct *mm, int member)
>  {
> -   atomic_long_dec(>rss_stat.count[member]);
> +   long count = atomic_long_dec_return(>rss_stat.count[member]);
> +
> +   mm_trace_rss_stat(member, count, -1);
>  }
>
>  /* Optimized variant when page is already known not to be PageAnon */
> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
> index eb57e3037deb..8b88e04fafbf 100644
> --- a/include/trace/events/kmem.h
> +++ b/include/trace/events/kmem.h
> @@ -315,6 +315,27 @@ TRACE_EVENT(mm_page_alloc_extfrag,
> __entry->change_ownership)
>  );
>
> +TRACE_EVENT(rss_stat,
> +
> +   TP_PROTO(int member,
> +   long count),
> +
> +   TP_ARGS(member, count),
> +
> +   TP_STRUCT__entry(
> +   __field(int, member)
> +   __field(long, size)
> +   ),
> +
> +   TP_fast_assign(
> +   __entry->member = member;
> +   __entry->size = (count << PAGE_SHIFT);
> +   ),
> +
> +   TP_printk("member=%d size=%ldB",
> +   __entry->member,
> +   __entry->size)
> +   );
>  #endif /* _TRACE_KMEM_H */
>
>  /* This part must be outside protection */
> diff --git a/mm/memory.c b/mm/memory.c
> index e2bb51b6242e..9d81322c24a3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -72,6 +72,8 @@
>  #include 
>  #include 
>
> +#include 
> +
>  #include 
>  #include 
>  #include 
> @@ -140,6 +142,24 @@ static int __init init_zero_pfn(void)
>  }
>  core_initcall(init_zero_pfn);
>
> +/*
> + * This threshold is the boundary in the value space, that the counter has to
> + * advance before we trace it. Should be a power of 2. It is to reduce 
> unwanted
> + * trace overhead. The counter is in units of number of pages.
> + */
> +#define TRACE_MM_COUNTER_THRESHOLD 128

IIUC the counter has to change by 128 pages (512kB assuming 4kB pages)
before the change gets traced. Would it make sense to make this step
size configurable? For a system with limited memory size change of
512kB might be considerable while on systems with plenty of memory
that might be negligible. Not even mentioning possible difference in
page sizes. Maybe something like
/sys/kernel/debug/tracing/rss_step_order with
TRACE_MM_COUNTER_THRESHOLD=(1< +
> +void mm_trace_rss_stat(int member, long count, long value)
> +{
> +   long thresh_mask = ~(TRACE_MM_COUNTER_THRESHOLD - 1);
> +
> +   if (!trace_rss_stat_enabled())
> +   return;
> +
> +   /* Threshold roll-over, trace it */
> +   if ((count & thresh_mask) != ((count - value) & thresh_mask))
> +   trace_rss_stat(member, count);
> +}
>
>  #if defined(SPLIT_RSS_COUNTING)
>
> --
> 2.23.0.187.g17f5b7556c-goog
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kernel-team+unsubscr...@android.com.
>