Re: [PATCH v2] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter.Enderborg
On 4/20/21 4:48 PM, Daniel Stone wrote:
> On Tue, 20 Apr 2021 at 14:46,  > wrote:
>
> On 4/20/21 3:34 PM, Daniel Stone wrote:
> > On Fri, 16 Apr 2021 at 13:34, Peter Enderborg    >> wrote:
> >     This adds a total used dma-buf memory. Details
> >     can be found in debugfs, however it is not for everyone
> >     and not always available. dma-buf are indirect allocated by
> >     userspace. So with this value we can monitor and detect
> >     userspace applications that have problems.
> >
> >
> > FWIW, this won't work super well for Android where gralloc is 
> implemented as a system service, so all graphics usage will instantly be 
> accounted to it.
>
> This resource allocation is a big part of why we need it. Why should it 
> not work?
>
>
> Sorry, I'd somehow completely misread that as being locally rather than 
> globally accounted. Given that, it's more correct, just also not super useful.
>
> Some drivers export allocation tracepoints which you could use if you have a 
> decent userspace tracing infrastructure. Short of that, many drivers export 
> this kind of thing through debugfs already. I think a better long-term 
> direction is probably getting accounting from dma-heaps rather than extending 
> core dmabuf itself.
>
> Cheers,
> Daniel 

Debugfs and traces are useful when you pin down your problem.  Debugfs does not 
exist on commercial devices so we need some hints on what going on, and trace 
points needs active debugging
and before the problem occurs. A metric on dma-buf can be sent with a bugreport.


Re: [PATCH v2] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter.Enderborg
On 4/20/21 3:34 PM, Daniel Stone wrote:
> Hi Peter,
>
> On Fri, 16 Apr 2021 at 13:34, Peter Enderborg  > wrote:
>
> This adds a total used dma-buf memory. Details
> can be found in debugfs, however it is not for everyone
> and not always available. dma-buf are indirect allocated by
> userspace. So with this value we can monitor and detect
> userspace applications that have problems.
>
>
> FWIW, this won't work super well for Android where gralloc is implemented as 
> a system service, so all graphics usage will instantly be accounted to it.
>
> Cheers,
> Daniel 

This resource allocation is a big part of why we need it. Why should it not 
work? 


Re: [PATCH v5] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter.Enderborg
On 4/20/21 1:52 PM, Mike Rapoport wrote:
> On Tue, Apr 20, 2021 at 10:45:21AM +, peter.enderb...@sony.com wrote:
>> On 4/20/21 11:41 AM, Mike Rapoport wrote:
>>> Hello Peter,
>>>
>>> On Tue, Apr 20, 2021 at 09:26:00AM +, peter.enderb...@sony.com wrote:
 On 4/20/21 10:58 AM, Daniel Vetter wrote:
> On Sat, Apr 17, 2021 at 06:38:35PM +0200, Peter Enderborg wrote:
>> This adds a total used dma-buf memory. Details
>> can be found in debugfs, however it is not for everyone
>> and not always available. dma-buf are indirect allocated by
>> userspace. So with this value we can monitor and detect
>> userspace applications that have problems.
>>
>> Signed-off-by: Peter Enderborg 
> So there have been tons of discussions around how to track dma-buf and
> why, and I really need to understand the use-cass here first I think. proc
> uapi is as much forever as anything else, and depending what you're doing
> this doesn't make any sense at all:
>
> - on most linux systems dma-buf are only instantiated for shared buffer.
>   So there this gives you a fairly meaningless number and not anything
>   reflecting gpu memory usage at all.
>
> - on Android all buffers are allocated through dma-buf afaik. But there
>   we've recently had some discussions about how exactly we should track
>   all this, and the conclusion was that most of this should be solved by
>   cgroups long term. So if this is for Android, then I don't think adding
>   random quick stop-gaps to upstream is a good idea (because it's a pretty
>   long list of patches that have come up on this).
>
> So what is this for?
 For the overview. dma-buf today only have debugfs for info. Debugfs
 is not allowed by google to use in andoid. So this aggregate the 
 information
 so we can get information on what going on on the system. 
>>>  
>>> Can you send an example debugfs output to see what data are we talking
>>> about?
>> Sure. This is on a idle system. Im not sure why you need it.The problem is 
>> partly that debugfs is
>> not accessable on a commercial device.
> I wanted to see what kind of information is there, but I didn't think it's
> that long :)
Sorry, but it was making a point.
>  
>> Dma-buf Objects:
>> size        flags       mode        count       exp_name        buf name    
>> ino
>> 00032768    0002    00080007    0002    
>> ion-system-1006-allocator-servi    dmabuf17728    07400825    dmabuf17728
>>     Attached Devices:
>> Total 0 devices attached
>>
>> 11083776    0002    00080007    0003    
>> ion-system-1006-allocator-servi    dmabuf17727    07400824    dmabuf17727
>>     Attached Devices:
>>     ae0.qcom,mdss_mdp:qcom,smmu_sde_unsec_cb
>> Total 1 devices attached
>>
>> 00032768    0002    00080007    0002    
>> ion-system-1006-allocator-servi    dmabuf17726    07400823    dmabuf17726
>>     Attached Devices:
>> Total 0 devices attached
>>
>> 11083776    0002    00080007    0002    
>> ion-system-1006-allocator-servi    dmabuf17725    07400822    dmabuf17725
>>     Attached Devices:
>>     ae0.qcom,mdss_mdp:qcom,smmu_sde_unsec_cb
>> Total 1 devices attached
> ...
>
>> Total 654 objects, 744144896 bytes
>  
> Isn't the size from the first column also available in fdinfo?
>
> Is there anything that prevents monitoring those?
>
Yes, selinux.


Re: [PATCH v5] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter.Enderborg
On 4/20/21 1:14 PM, Daniel Vetter wrote:
> On Tue, Apr 20, 2021 at 09:26:00AM +, peter.enderb...@sony.com wrote:
>> On 4/20/21 10:58 AM, Daniel Vetter wrote:
>>> On Sat, Apr 17, 2021 at 06:38:35PM +0200, Peter Enderborg wrote:
 This adds a total used dma-buf memory. Details
 can be found in debugfs, however it is not for everyone
 and not always available. dma-buf are indirect allocated by
 userspace. So with this value we can monitor and detect
 userspace applications that have problems.

 Signed-off-by: Peter Enderborg 
>>> So there have been tons of discussions around how to track dma-buf and
>>> why, and I really need to understand the use-cass here first I think. proc
>>> uapi is as much forever as anything else, and depending what you're doing
>>> this doesn't make any sense at all:
>>>
>>> - on most linux systems dma-buf are only instantiated for shared buffer.
>>>   So there this gives you a fairly meaningless number and not anything
>>>   reflecting gpu memory usage at all.
>>>
>>> - on Android all buffers are allocated through dma-buf afaik. But there
>>>   we've recently had some discussions about how exactly we should track
>>>   all this, and the conclusion was that most of this should be solved by
>>>   cgroups long term. So if this is for Android, then I don't think adding
>>>   random quick stop-gaps to upstream is a good idea (because it's a pretty
>>>   long list of patches that have come up on this).
>>>
>>> So what is this for?
>> For the overview. dma-buf today only have debugfs for info. Debugfs
>> is not allowed by google to use in andoid. So this aggregate the information
>> so we can get information on what going on on the system. 
>>
>> And the LKML standard respond to that is "SHOW ME THE CODE".
> Yes. Except this extends to how exactly this is supposed to be used in
> userspace and acted upon.
>
>> When the top memgc has a aggregated information on dma-buf it is maybe
>> a better source to meminfo. But then it also imply that dma-buf requires 
>> memcg.
>>
>> And I dont see any problem to replace this with something better with it is 
>> ready.
> The thing is, this is uapi. Once it's merged we cannot, ever, replace it.
> It must be kept around forever, or a very close approximation thereof. So
> merging this with the justification that we can fix it later on or replace
> isn't going to happen.

It is intended to be relevant as long there is a dma-buf. This is a proper
metric. If the newer implementations is not get the same result it is
not doing it right and is not better. If a memcg counter or a global_zone
counter do the same thing they it can replace the suggested method.

But I dont think they will. dma-buf does not have to be mapped to a process,
and the case of vram, it is not covered in current global_zone. All of them
would be very nice to have in some form. But it wont change what the
correct value of what "Total" is.


> -Daniel
>
>>> -Daniel
>>>
 ---
  drivers/dma-buf/dma-buf.c | 12 
  fs/proc/meminfo.c |  5 -
  include/linux/dma-buf.h   |  1 +
  3 files changed, 17 insertions(+), 1 deletion(-)

 diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
 index f264b70c383e..4dc37cd4293b 100644
 --- a/drivers/dma-buf/dma-buf.c
 +++ b/drivers/dma-buf/dma-buf.c
 @@ -37,6 +37,7 @@ struct dma_buf_list {
  };
  
  static struct dma_buf_list db_list;
 +static atomic_long_t dma_buf_global_allocated;
  
  static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int 
 buflen)
  {
 @@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
  
 +  atomic_long_sub(dmabuf->size, _buf_global_allocated);
module_put(dmabuf->owner);
kfree(dmabuf->name);
kfree(dmabuf);
 @@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
 dma_buf_export_info *exp_info)
mutex_lock(_list.lock);
list_add(>list_node, _list.head);
mutex_unlock(_list.lock);
 +  atomic_long_add(dmabuf->size, _buf_global_allocated);
  
return dmabuf;
  
 @@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
 dma_buf_map *map)
  }
  EXPORT_SYMBOL_GPL(dma_buf_vunmap);
  
 +/**
 + * dma_buf_allocated_pages - Return the used nr of pages
 + * allocated for dma-buf
 + */
 +long dma_buf_allocated_pages(void)
 +{
 +  return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
 +}
 +
  #ifdef CONFIG_DEBUG_FS
  static int dma_buf_debug_show(struct seq_file *s, void *unused)
  {
 diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
 index 6fa761c9cc78..ccc7c40c8db7 100644
 --- a/fs/proc/meminfo.c
 +++ b/fs/proc/meminfo.c
 @@ -16,6 +16,7 @@
  #ifdef CONFIG_CMA
  #include 

Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter.Enderborg
On 4/20/21 1:04 PM, Michal Hocko wrote:
> On Tue 20-04-21 09:25:51, peter.enderb...@sony.com wrote:
>> On 4/20/21 11:12 AM, Michal Hocko wrote:
>>> On Tue 20-04-21 09:02:57, peter.enderb...@sony.com wrote:
>> But that isn't really system memory at all, it's just allocated device
>> memory.
> OK, that was not really clear to me. So this is not really accounted to
> MemTotal? If that is really the case then reporting it into the oom
> report is completely pointless and I am not even sure /proc/meminfo is
> the right interface either. It would just add more confusion I am
> afraid.
>  
 Why is it confusing? Documentation is quite clear:
>>> Because a single counter without a wider context cannot be put into any
>>> reasonable context. There is no notion of the total amount of device
>>> memory usable for dma-buf. As Christian explained some of it can be RAM
>>> based. So a single number is rather pointless on its own in many cases.
>>>
>>> Or let me just ask. What can you tell from dma-bud: $FOO kB in its
>>> current form?
>> It is better to be blind?
> No it is better to have a sensible counter that can be reasoned about.
> So far you are only claiming that having something is better than
> nothing and I would agree with you if that was a debugging one off
> interface. But /proc/meminfo and other proc files have to be maintained
> with future portability in mind. This is not a dumping ground for _some_
> counters that might be interesting at the _current_ moment. E.g. what
> happens if somebody wants to have a per device resp. memory based
> dma-buf data? Are you going to change the semantic or add another
> 2 counters?

This is the DmaBufTotal. It is the upper limit. If is not there is  is 
something else.

And when we have a better resolution on measuring it, it would make sense
to add a DmaBufVram, DmaBufMemGC or what ever we can pickup.

This is what we can measure today.


Re: [PATCH v5] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter.Enderborg
On 4/20/21 10:58 AM, Daniel Vetter wrote:
> On Sat, Apr 17, 2021 at 06:38:35PM +0200, Peter Enderborg wrote:
>> This adds a total used dma-buf memory. Details
>> can be found in debugfs, however it is not for everyone
>> and not always available. dma-buf are indirect allocated by
>> userspace. So with this value we can monitor and detect
>> userspace applications that have problems.
>>
>> Signed-off-by: Peter Enderborg 
> So there have been tons of discussions around how to track dma-buf and
> why, and I really need to understand the use-cass here first I think. proc
> uapi is as much forever as anything else, and depending what you're doing
> this doesn't make any sense at all:
>
> - on most linux systems dma-buf are only instantiated for shared buffer.
>   So there this gives you a fairly meaningless number and not anything
>   reflecting gpu memory usage at all.
>
> - on Android all buffers are allocated through dma-buf afaik. But there
>   we've recently had some discussions about how exactly we should track
>   all this, and the conclusion was that most of this should be solved by
>   cgroups long term. So if this is for Android, then I don't think adding
>   random quick stop-gaps to upstream is a good idea (because it's a pretty
>   long list of patches that have come up on this).
>
> So what is this for?

For the overview. dma-buf today only have debugfs for info. Debugfs
is not allowed by google to use in andoid. So this aggregate the information
so we can get information on what going on on the system. 

And the LKML standard respond to that is "SHOW ME THE CODE".

When the top memgc has a aggregated information on dma-buf it is maybe
a better source to meminfo. But then it also imply that dma-buf requires memcg.

And I dont see any problem to replace this with something better with it is 
ready.

> -Daniel
>
>> ---
>>  drivers/dma-buf/dma-buf.c | 12 
>>  fs/proc/meminfo.c |  5 -
>>  include/linux/dma-buf.h   |  1 +
>>  3 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index f264b70c383e..4dc37cd4293b 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -37,6 +37,7 @@ struct dma_buf_list {
>>  };
>>  
>>  static struct dma_buf_list db_list;
>> +static atomic_long_t dma_buf_global_allocated;
>>  
>>  static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
>>  {
>> @@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
>>  if (dmabuf->resv == (struct dma_resv *)[1])
>>  dma_resv_fini(dmabuf->resv);
>>  
>> +atomic_long_sub(dmabuf->size, _buf_global_allocated);
>>  module_put(dmabuf->owner);
>>  kfree(dmabuf->name);
>>  kfree(dmabuf);
>> @@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
>> dma_buf_export_info *exp_info)
>>  mutex_lock(_list.lock);
>>  list_add(>list_node, _list.head);
>>  mutex_unlock(_list.lock);
>> +atomic_long_add(dmabuf->size, _buf_global_allocated);
>>  
>>  return dmabuf;
>>  
>> @@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
>> dma_buf_map *map)
>>  }
>>  EXPORT_SYMBOL_GPL(dma_buf_vunmap);
>>  
>> +/**
>> + * dma_buf_allocated_pages - Return the used nr of pages
>> + * allocated for dma-buf
>> + */
>> +long dma_buf_allocated_pages(void)
>> +{
>> +return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
>> +}
>> +
>>  #ifdef CONFIG_DEBUG_FS
>>  static int dma_buf_debug_show(struct seq_file *s, void *unused)
>>  {
>> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
>> index 6fa761c9cc78..ccc7c40c8db7 100644
>> --- a/fs/proc/meminfo.c
>> +++ b/fs/proc/meminfo.c
>> @@ -16,6 +16,7 @@
>>  #ifdef CONFIG_CMA
>>  #include 
>>  #endif
>> +#include 
>>  #include 
>>  #include "internal.h"
>>  
>> @@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>>  show_val_kb(m, "CmaFree:",
>>  global_zone_page_state(NR_FREE_CMA_PAGES));
>>  #endif
>> -
>> +#ifdef CONFIG_DMA_SHARED_BUFFER
>> +show_val_kb(m, "DmaBufTotal:", dma_buf_allocated_pages());
>> +#endif
>>  hugetlb_report_meminfo(m);
>>  
>>  arch_report_meminfo(m);
>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>> index efdc56b9d95f..5b05816bd2cd 100644
>> --- a/include/linux/dma-buf.h
>> +++ b/include/linux/dma-buf.h
>> @@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct 
>> *,
>>   unsigned long);
>>  int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>>  void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>> +long dma_buf_allocated_pages(void);
>>  #endif /* __DMA_BUF_H__ */
>> -- 
>> 2.17.1
>>
>> ___
>> dri-devel mailing list
>> dri-de...@lists.freedesktop.org
>> 

Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter.Enderborg
On 4/20/21 11:12 AM, Michal Hocko wrote:
> On Tue 20-04-21 09:02:57, peter.enderb...@sony.com wrote:
 But that isn't really system memory at all, it's just allocated device
 memory.
>>> OK, that was not really clear to me. So this is not really accounted to
>>> MemTotal? If that is really the case then reporting it into the oom
>>> report is completely pointless and I am not even sure /proc/meminfo is
>>> the right interface either. It would just add more confusion I am
>>> afraid.
>>>  
>> Why is it confusing? Documentation is quite clear:
> Because a single counter without a wider context cannot be put into any
> reasonable context. There is no notion of the total amount of device
> memory usable for dma-buf. As Christian explained some of it can be RAM
> based. So a single number is rather pointless on its own in many cases.
>
> Or let me just ask. What can you tell from dma-bud: $FOO kB in its
> current form?

It is better to be blind? The value can still be used a relative metric.
You collect the data and see how it change. And when you see
a unexpected change you start to dig in. It fore sure wont tell what line
in your application that has a bug.  But it might be an indicator that
a new game trigger a leak. And it is very well specified, it exactly the
size of mapped dma-buf. For what you use dma-buf you need to know
other parts of the system.


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Peter.Enderborg

>> But that isn't really system memory at all, it's just allocated device
>> memory.
> OK, that was not really clear to me. So this is not really accounted to
> MemTotal? If that is really the case then reporting it into the oom
> report is completely pointless and I am not even sure /proc/meminfo is
> the right interface either. It would just add more confusion I am
> afraid.
>  

Why is it confusing? Documentation is quite clear:

"Provides information about distribution and utilization of memory. This
varies by architecture and compile options."

A topology with VRAM fits very well on that. The point is to have an
overview.


>>> See where I am heading?
>> Yeah, totally. Thanks for pointing this out.
>>
>> Suggestions how to handle that?
> As I've pointed out in previous reply we do have an API to account per
> node memory but now that you have brought up that this is not something
> we account as a regular memory then this doesn't really fit into that
> model. But maybe I am just confused.



Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-19 Thread Peter.Enderborg
On 4/19/21 5:00 PM, Michal Hocko wrote:
> On Mon 19-04-21 12:41:58, peter.enderb...@sony.com wrote:
>> On 4/19/21 2:16 PM, Michal Hocko wrote:
>>> On Sat 17-04-21 12:40:32, Peter Enderborg wrote:
 This adds a total used dma-buf memory. Details
 can be found in debugfs, however it is not for everyone
 and not always available. dma-buf are indirect allocated by
 userspace. So with this value we can monitor and detect
 userspace applications that have problems.
>>> The changelog would benefit from more background on why this is needed,
>>> and who is the primary consumer of that value.
>>>
>>> I cannot really comment on the dma-buf internals but I have two remarks.
>>> Documentation/filesystems/proc.rst needs an update with the counter
>>> explanation and secondly is this information useful for OOM situations
>>> analysis? If yes then show_mem should dump the value as well.
>>>
>>> From the implementation point of view, is there any reason why this
>>> hasn't used the existing global_node_page_state infrastructure?
>> I fix doc in next version.  Im not sure what you expect the commit message 
>> to include.
> As I've said. Usual justification covers answers to following questions
>   - Why do we need it?
>   - Why the existing data is insuficient?
>   - Who is supposed to use the data and for what?
>
> I can see an answer for the first two questions (because this can be a
> lot of memory and the existing infrastructure is not production suitable
> - debugfs). But the changelog doesn't really explain who is going to use
> the new data. Is this a monitoring to raise an early alarm when the
> value grows? Is this for debugging misbehaving drivers? How is it
> valuable for those?
>
>> The function of the meminfo is: (From Documentation/filesystems/proc.rst)
>>
>> "Provides information about distribution and utilization of memory."
> True. Yet we do not export any random counters, do we?
>
>> Im not the designed of dma-buf, I think  global_node_page_state as a kernel
>> internal.
> It provides a node specific and optimized counters. Is this a good fit
> with your new counter? Or the NUMA locality is of no importance?

Sounds good to me, if Christian Koenig think it is good, I will use that.
It is only virtio in drivers that use the global_node_page_state if
that matters.


>
>> dma-buf is a device driver that provides a function so I might be
>> on the outside. However I also see that it might be relevant for a OOM.
>> It is memory that can be freed by killing userspace processes.
>>
>> The show_mem thing. Should it be a separate patch?
> This is up to you but if you want to expose the counter then send it in
> one series.
>


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-19 Thread Peter.Enderborg
On 4/19/21 2:16 PM, Michal Hocko wrote:
> On Sat 17-04-21 12:40:32, Peter Enderborg wrote:
>> This adds a total used dma-buf memory. Details
>> can be found in debugfs, however it is not for everyone
>> and not always available. dma-buf are indirect allocated by
>> userspace. So with this value we can monitor and detect
>> userspace applications that have problems.
> The changelog would benefit from more background on why this is needed,
> and who is the primary consumer of that value.
>
> I cannot really comment on the dma-buf internals but I have two remarks.
> Documentation/filesystems/proc.rst needs an update with the counter
> explanation and secondly is this information useful for OOM situations
> analysis? If yes then show_mem should dump the value as well.
>
> From the implementation point of view, is there any reason why this
> hasn't used the existing global_node_page_state infrastructure?

I fix doc in next version.  Im not sure what you expect the commit message to 
include.

The function of the meminfo is: (From Documentation/filesystems/proc.rst)

"Provides information about distribution and utilization of memory."

Im not the designed of dma-buf, I think  global_node_page_state as a kernel
internal. dma-buf is a device driver that provides a function so I might be
on the outside. However I also see that it might be relevant for a OOM.
It is memory that can be freed by killing userspace processes.

The show_mem thing. Should it be a separate patch?






Re: [External] [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-17 Thread Peter.Enderborg
On 4/17/21 3:07 PM, Muchun Song wrote:
> On Sat, Apr 17, 2021 at 6:41 PM Peter Enderborg
>  wrote:
>> This adds a total used dma-buf memory. Details
>> can be found in debugfs, however it is not for everyone
>> and not always available. dma-buf are indirect allocated by
>> userspace. So with this value we can monitor and detect
>> userspace applications that have problems.
>>
>> Signed-off-by: Peter Enderborg 
>> ---
>>  drivers/dma-buf/dma-buf.c | 13 +
>>  fs/proc/meminfo.c |  5 -
>>  include/linux/dma-buf.h   |  1 +
>>  3 files changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index f264b70c383e..197e5c45dd26 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -37,6 +37,7 @@ struct dma_buf_list {
>>  };
>>
>>  static struct dma_buf_list db_list;
>> +static atomic_long_t dma_buf_global_allocated;
>>
>>  static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
>>  {
>> @@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
>> if (dmabuf->resv == (struct dma_resv *)[1])
>> dma_resv_fini(dmabuf->resv);
>>
>> +   atomic_long_sub(dmabuf->size, _buf_global_allocated);
>> module_put(dmabuf->owner);
>> kfree(dmabuf->name);
>> kfree(dmabuf);
>> @@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
>> dma_buf_export_info *exp_info)
>> mutex_lock(_list.lock);
>> list_add(>list_node, _list.head);
>> mutex_unlock(_list.lock);
>> +   atomic_long_add(dmabuf->size, _buf_global_allocated);
>>
>> return dmabuf;
>>
>> @@ -1346,6 +1349,16 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
>> dma_buf_map *map)
>>  }
>>  EXPORT_SYMBOL_GPL(dma_buf_vunmap);
>>
>> +/**
>> + * dma_buf_allocated_pages - Return the used nr of pages
>> + * allocated for dma-buf
>> + */
>> +long dma_buf_allocated_pages(void)
>> +{
>> +   return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
>> +}
>> +EXPORT_SYMBOL_GPL(dma_buf_allocated_pages);
> dma_buf_allocated_pages is only called from fs/proc/meminfo.c.
> I am confused why it should be exported. If it won't be called
> from the driver module, we should not export it.

Ah. I thought you did not want the GPL restriction. I don't have real
opinion about it. It's written to be following the rest of the module.
It is not needed for the usage of dma-buf in kernel module. But I
don't see any reason for hiding it either.


> Thanks.
>
>> +
>>  #ifdef CONFIG_DEBUG_FS
>>  static int dma_buf_debug_show(struct seq_file *s, void *unused)
>>  {
>> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
>> index 6fa761c9cc78..ccc7c40c8db7 100644
>> --- a/fs/proc/meminfo.c
>> +++ b/fs/proc/meminfo.c
>> @@ -16,6 +16,7 @@
>>  #ifdef CONFIG_CMA
>>  #include 
>>  #endif
>> +#include 
>>  #include 
>>  #include "internal.h"
>>
>> @@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>> show_val_kb(m, "CmaFree:",
>> global_zone_page_state(NR_FREE_CMA_PAGES));
>>  #endif
>> -
>> +#ifdef CONFIG_DMA_SHARED_BUFFER
>> +   show_val_kb(m, "DmaBufTotal:", dma_buf_allocated_pages());
>> +#endif
>> hugetlb_report_meminfo(m);
>>
>> arch_report_meminfo(m);
>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>> index efdc56b9d95f..5b05816bd2cd 100644
>> --- a/include/linux/dma-buf.h
>> +++ b/include/linux/dma-buf.h
>> @@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct 
>> *,
>>  unsigned long);
>>  int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>>  void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>> +long dma_buf_allocated_pages(void);
>>  #endif /* __DMA_BUF_H__ */
>> --
>> 2.17.1
>>


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-17 Thread Peter.Enderborg
On 4/17/21 1:54 PM, Christian König wrote:
> Am 17.04.21 um 13:20 schrieb peter.enderb...@sony.com:
>> On 4/17/21 12:59 PM, Christian König wrote:
>>> Am 17.04.21 um 12:40 schrieb Peter Enderborg:
 This adds a total used dma-buf memory. Details
 can be found in debugfs, however it is not for everyone
 and not always available. dma-buf are indirect allocated by
 userspace. So with this value we can monitor and detect
 userspace applications that have problems.

 Signed-off-by: Peter Enderborg 
>>> Reviewed-by: Christian König 
>>>
>>> How do you want to upstream this?
>> I don't understand that question. The patch applies on Torvalds 5.12-rc7,
>> but I guess 5.13 is what we work on right now.
>
> Yeah, but how do you want to get it into Linus tree?
>
> I can push it together with other DMA-buf patches through drm-misc-next and 
> then Dave will send it to Linus for inclusion in 5.13.
>
> But could be that you are pushing multiple changes towards Linus through some 
> other branch. In this case I'm fine if you pick that way instead if you want 
> to keep your patches together for some reason.
>
It is a dma-buf functionality so it make very much sense that you as maintainer 
of dma-buf pick them the way you usually send them. I don't have any other path 
for this patch.

Thx!

Peter.


> Christian.
>
>>
 ---
    drivers/dma-buf/dma-buf.c | 13 +
    fs/proc/meminfo.c |  5 -
    include/linux/dma-buf.h   |  1 +
    3 files changed, 18 insertions(+), 1 deletion(-)

 diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
 index f264b70c383e..197e5c45dd26 100644
 --- a/drivers/dma-buf/dma-buf.c
 +++ b/drivers/dma-buf/dma-buf.c
 @@ -37,6 +37,7 @@ struct dma_buf_list {
    };
      static struct dma_buf_list db_list;
 +static atomic_long_t dma_buf_global_allocated;
      static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int 
 buflen)
    {
 @@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
    if (dmabuf->resv == (struct dma_resv *)[1])
    dma_resv_fini(dmabuf->resv);
    +    atomic_long_sub(dmabuf->size, _buf_global_allocated);
    module_put(dmabuf->owner);
    kfree(dmabuf->name);
    kfree(dmabuf);
 @@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
 dma_buf_export_info *exp_info)
    mutex_lock(_list.lock);
    list_add(>list_node, _list.head);
    mutex_unlock(_list.lock);
 +    atomic_long_add(dmabuf->size, _buf_global_allocated);
      return dmabuf;
    @@ -1346,6 +1349,16 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, 
 struct dma_buf_map *map)
    }
    EXPORT_SYMBOL_GPL(dma_buf_vunmap);
    +/**
 + * dma_buf_allocated_pages - Return the used nr of pages
 + * allocated for dma-buf
 + */
 +long dma_buf_allocated_pages(void)
 +{
 +    return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
 +}
 +EXPORT_SYMBOL_GPL(dma_buf_allocated_pages);
 +
    #ifdef CONFIG_DEBUG_FS
    static int dma_buf_debug_show(struct seq_file *s, void *unused)
    {
 diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
 index 6fa761c9cc78..ccc7c40c8db7 100644
 --- a/fs/proc/meminfo.c
 +++ b/fs/proc/meminfo.c
 @@ -16,6 +16,7 @@
    #ifdef CONFIG_CMA
    #include 
    #endif
 +#include 
    #include 
    #include "internal.h"
    @@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, 
 void *v)
    show_val_kb(m, "CmaFree:    ",
    global_zone_page_state(NR_FREE_CMA_PAGES));
    #endif
 -
 +#ifdef CONFIG_DMA_SHARED_BUFFER
 +    show_val_kb(m, "DmaBufTotal:    ", dma_buf_allocated_pages());
 +#endif
    hugetlb_report_meminfo(m);
      arch_report_meminfo(m);
 diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
 index efdc56b9d95f..5b05816bd2cd 100644
 --- a/include/linux/dma-buf.h
 +++ b/include/linux/dma-buf.h
 @@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct 
 vm_area_struct *,
     unsigned long);
    int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
    void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
 +long dma_buf_allocated_pages(void);
    #endif /* __DMA_BUF_H__ */
>


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-17 Thread Peter.Enderborg
On 4/17/21 12:59 PM, Christian König wrote:
> Am 17.04.21 um 12:40 schrieb Peter Enderborg:
>> This adds a total used dma-buf memory. Details
>> can be found in debugfs, however it is not for everyone
>> and not always available. dma-buf are indirect allocated by
>> userspace. So with this value we can monitor and detect
>> userspace applications that have problems.
>>
>> Signed-off-by: Peter Enderborg 
>
> Reviewed-by: Christian König 
>
> How do you want to upstream this?

I don't understand that question. The patch applies on Torvalds 5.12-rc7,
but I guess 5.13 is what we work on right now.

>
>> ---
>>   drivers/dma-buf/dma-buf.c | 13 +
>>   fs/proc/meminfo.c |  5 -
>>   include/linux/dma-buf.h   |  1 +
>>   3 files changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index f264b70c383e..197e5c45dd26 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -37,6 +37,7 @@ struct dma_buf_list {
>>   };
>>     static struct dma_buf_list db_list;
>> +static atomic_long_t dma_buf_global_allocated;
>>     static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int 
>> buflen)
>>   {
>> @@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
>>   if (dmabuf->resv == (struct dma_resv *)[1])
>>   dma_resv_fini(dmabuf->resv);
>>   +    atomic_long_sub(dmabuf->size, _buf_global_allocated);
>>   module_put(dmabuf->owner);
>>   kfree(dmabuf->name);
>>   kfree(dmabuf);
>> @@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
>> dma_buf_export_info *exp_info)
>>   mutex_lock(_list.lock);
>>   list_add(>list_node, _list.head);
>>   mutex_unlock(_list.lock);
>> +    atomic_long_add(dmabuf->size, _buf_global_allocated);
>>     return dmabuf;
>>   @@ -1346,6 +1349,16 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
>> dma_buf_map *map)
>>   }
>>   EXPORT_SYMBOL_GPL(dma_buf_vunmap);
>>   +/**
>> + * dma_buf_allocated_pages - Return the used nr of pages
>> + * allocated for dma-buf
>> + */
>> +long dma_buf_allocated_pages(void)
>> +{
>> +    return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
>> +}
>> +EXPORT_SYMBOL_GPL(dma_buf_allocated_pages);
>> +
>>   #ifdef CONFIG_DEBUG_FS
>>   static int dma_buf_debug_show(struct seq_file *s, void *unused)
>>   {
>> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
>> index 6fa761c9cc78..ccc7c40c8db7 100644
>> --- a/fs/proc/meminfo.c
>> +++ b/fs/proc/meminfo.c
>> @@ -16,6 +16,7 @@
>>   #ifdef CONFIG_CMA
>>   #include 
>>   #endif
>> +#include 
>>   #include 
>>   #include "internal.h"
>>   @@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void 
>> *v)
>>   show_val_kb(m, "CmaFree:    ",
>>   global_zone_page_state(NR_FREE_CMA_PAGES));
>>   #endif
>> -
>> +#ifdef CONFIG_DMA_SHARED_BUFFER
>> +    show_val_kb(m, "DmaBufTotal:    ", dma_buf_allocated_pages());
>> +#endif
>>   hugetlb_report_meminfo(m);
>>     arch_report_meminfo(m);
>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>> index efdc56b9d95f..5b05816bd2cd 100644
>> --- a/include/linux/dma-buf.h
>> +++ b/include/linux/dma-buf.h
>> @@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct 
>> *,
>>    unsigned long);
>>   int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>>   void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>> +long dma_buf_allocated_pages(void);
>>   #endif /* __DMA_BUF_H__ */
>


Re: [External] [PATCH v3] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-17 Thread Peter.Enderborg
On 4/17/21 5:05 AM, Muchun Song wrote:
> On Sat, Apr 17, 2021 at 12:08 AM Peter Enderborg
>  wrote:
>> This adds a total used dma-buf memory. Details
>> can be found in debugfs, however it is not for everyone
>> and not always available. dma-buf are indirect allocated by
>> userspace. So with this value we can monitor and detect
>> userspace applications that have problems.
> I want to know more details about the problems.
> Can you share what problems you have encountered?
>
> Thanks.

What do you expect to be relevant for the kernel? Applications that leaks
is not that important. This types of buffers are important for android
applications, and android have moved a from ION buffers that has
metrics. It easily get in to 5-10 percent of the total amount ram.

This provide that information for end users or application developers
using commercial devices.  The end user get to know why their device
is running out of memory.


>> Signed-off-by: Peter Enderborg 
>> ---
>>  drivers/dma-buf/dma-buf.c | 12 
>>  fs/proc/meminfo.c |  5 -
>>  include/linux/dma-buf.h   |  1 +
>>  3 files changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index f264b70c383e..d40fff2ae1fa 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -37,6 +37,7 @@ struct dma_buf_list {
>>  };
>>
>>  static struct dma_buf_list db_list;
>> +static atomic_long_t dma_buf_global_allocated;
>>
>>  static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
>>  {
>> @@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
>> if (dmabuf->resv == (struct dma_resv *)[1])
>> dma_resv_fini(dmabuf->resv);
>>
>> +   atomic_long_sub(dmabuf->size, _buf_global_allocated);
>> module_put(dmabuf->owner);
>> kfree(dmabuf->name);
>> kfree(dmabuf);
>> @@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
>> dma_buf_export_info *exp_info)
>> mutex_lock(_list.lock);
>> list_add(>list_node, _list.head);
>> mutex_unlock(_list.lock);
>> +   atomic_long_add(dmabuf->size, _buf_global_allocated);
>>
>> return dmabuf;
>>
>> @@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
>> dma_buf_map *map)
>>  }
>>  EXPORT_SYMBOL_GPL(dma_buf_vunmap);
>>
>> +/**
>> + * dma_buf_get_size - Return the used nr pages by dma-buf
>> + */
>> +long dma_buf_allocated_pages(void)
>> +{
>> +   return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
>> +}
>> +EXPORT_SYMBOL_GPL(dma_buf_allocated_pages);
> Why need "EXPORT_SYMBOL_GPL"?
This what all other exported functions for this module are. I don't see any 
reason for this do be different.
>
>> +
>>  #ifdef CONFIG_DEBUG_FS
>>  static int dma_buf_debug_show(struct seq_file *s, void *unused)
>>  {
>> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
>> index 6fa761c9cc78..ccc7c40c8db7 100644
>> --- a/fs/proc/meminfo.c
>> +++ b/fs/proc/meminfo.c
>> @@ -16,6 +16,7 @@
>>  #ifdef CONFIG_CMA
>>  #include 
>>  #endif
>> +#include 
>>  #include 
>>  #include "internal.h"
>>
>> @@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>> show_val_kb(m, "CmaFree:",
>> global_zone_page_state(NR_FREE_CMA_PAGES));
>>  #endif
>> -
>> +#ifdef CONFIG_DMA_SHARED_BUFFER
>> +   show_val_kb(m, "DmaBufTotal:", dma_buf_allocated_pages());
>> +#endif
>> hugetlb_report_meminfo(m);
>>
>> arch_report_meminfo(m);
>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>> index efdc56b9d95f..5b05816bd2cd 100644
>> --- a/include/linux/dma-buf.h
>> +++ b/include/linux/dma-buf.h
>> @@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct 
>> *,
>>  unsigned long);
>>  int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>>  void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>> +long dma_buf_allocated_pages(void);
>>  #endif /* __DMA_BUF_H__ */
>> --
>> 2.17.1
>>


Re: [PATCH] debugfs: Make debugfs_allow RO after init

2021-04-06 Thread Peter.Enderborg
On 4/5/21 11:39 PM, Kees Cook wrote:
> Since debugfs_allow is only set at boot time during __init, make it
> read-only after being set.
>
> Cc: Peter Enderborg 
> Fixes: a24c6f7bc923 ("debugfs: Add access restriction option")
> Signed-off-by: Kees Cook 
> ---
>  fs/debugfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
> index 22e86ae4dd5a..1d252164d97b 100644
> --- a/fs/debugfs/inode.c
> +++ b/fs/debugfs/inode.c
> @@ -35,7 +35,7 @@
>  static struct vfsmount *debugfs_mount;
>  static int debugfs_mount_count;
>  static bool debugfs_registered;
> -static unsigned int debugfs_allow = DEFAULT_DEBUGFS_ALLOW_BITS;
> +static unsigned int debugfs_allow __ro_after_init = 
> DEFAULT_DEBUGFS_ALLOW_BITS;
>  
>  /*
>   * Don't allow access attributes to be changed whilst the kernel is locked 
> down

Tnx. Looks good to me.

You can add:

Reviewed-by: Peter Enderborg 



[PATCH v2 2/5] selinux: Move policydb to pointer structure

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

To be able to use rcu locks we seed to address the policydb
though a pointer. This patch adds a pointer structure to
repleace the static policydb.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 274 ++---
 1 file changed, 149 insertions(+), 125 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 47d8030..21400bd 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -90,7 +90,6 @@ int selinux_policycap_nnp_nosuid_transition;
 static DEFINE_RWLOCK(policy_rwlock);
 
 static struct sidtab sidtab;
-static struct policydb policydb;
 int ss_initialized;
 
 /*
@@ -120,6 +119,7 @@ struct selinux_mapping {
 struct shared_current_mapping {
struct selinux_mapping *current_mapping;
u16 current_mapping_size;
+   struct policydb policydb;
 };
 
 static struct shared_current_mapping *crm;
@@ -277,7 +277,7 @@ static void map_decision(u16 tclass, struct av_decision 
*avd,
 
 int security_mls_enabled(void)
 {
-   return policydb.mls_enabled;
+   return crm->policydb.mls_enabled;
 }
 
 /*
@@ -335,8 +335,8 @@ static int constraint_expr_eval(struct context *scontext,
case CEXPR_ROLE:
val1 = scontext->role;
val2 = tcontext->role;
-   r1 = policydb.role_val_to_struct[val1 - 1];
-   r2 = policydb.role_val_to_struct[val2 - 1];
+   r1 = crm->policydb.role_val_to_struct[val1 - 1];
+   r2 = crm->policydb.role_val_to_struct[val2 - 1];
switch (e->op) {
case CEXPR_DOM:
s[++sp] = 
ebitmap_get_bit(>dominates,
@@ -501,8 +501,8 @@ static void security_dump_masked_av(struct context 
*scontext,
if (!permissions)
return;
 
-   tclass_name = sym_name(, SYM_CLASSES, tclass - 1);
-   tclass_dat = policydb.class_val_to_struct[tclass - 1];
+   tclass_name = sym_name(>policydb, SYM_CLASSES, tclass - 1);
+   tclass_dat = crm->policydb.class_val_to_struct[tclass - 1];
common_dat = tclass_dat->comdatum;
 
/* init permission_names */
@@ -571,14 +571,14 @@ static void type_attribute_bounds_av(struct context 
*scontext,
struct type_datum *target;
u32 masked = 0;
 
-   source = flex_array_get_ptr(policydb.type_val_to_struct_array,
+   source = flex_array_get_ptr(crm->policydb.type_val_to_struct_array,
scontext->type - 1);
BUG_ON(!source);
 
if (!source->bounds)
return;
 
-   target = flex_array_get_ptr(policydb.type_val_to_struct_array,
+   target = flex_array_get_ptr(crm->policydb.type_val_to_struct_array,
tcontext->type - 1);
BUG_ON(!target);
 
@@ -664,13 +664,13 @@ static void context_struct_compute_av(struct context 
*scontext,
xperms->len = 0;
}
 
-   if (unlikely(!tclass || tclass > policydb.p_classes.nprim)) {
+   if (unlikely(!tclass || tclass > crm->policydb.p_classes.nprim)) {
if (printk_ratelimit())
printk(KERN_WARNING "SELinux:  Invalid class %hu\n", 
tclass);
return;
}
 
-   tclass_datum = policydb.class_val_to_struct[tclass - 1];
+   tclass_datum = crm->policydb.class_val_to_struct[tclass - 1];
 
/*
 * If a specific type enforcement rule was defined for
@@ -678,15 +678,18 @@ static void context_struct_compute_av(struct context 
*scontext,
 */
avkey.target_class = tclass;
avkey.specified = AVTAB_AV | AVTAB_XPERMS;
-   sattr = flex_array_get(policydb.type_attr_map_array, scontext->type - 
1);
+   sattr = flex_array_get(crm->policydb.type_attr_map_array,
+  scontext->type - 1);
BUG_ON(!sattr);
-   tattr = flex_array_get(policydb.type_attr_map_array, tcontext->type - 
1);
+   tattr = flex_array_get(crm->policydb.type_attr_map_array,
+  tcontext->type - 1);
BUG_ON(!tattr);
ebitmap_for_each_positive_bit(sattr, snode, i) {
ebitmap_for_each_positive_bit(tattr, tnode, j) {
avkey.source_type = i + 1;
avkey.target_type = j + 1;
-   for (node = avtab_search_node(_avtab, 
);
+   for (node = avtab_search_node(>policydb.te_avtab,
+ );
 node;
 node = avtab_search_node_next(node, 
avkey.specified)) {
if (node->key.specified == AVTAB_ALLOWED)
@@ -700,7 

[PATCH v2 2/5] selinux: Move policydb to pointer structure

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

To be able to use rcu locks we seed to address the policydb
though a pointer. This patch adds a pointer structure to
repleace the static policydb.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 274 ++---
 1 file changed, 149 insertions(+), 125 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 47d8030..21400bd 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -90,7 +90,6 @@ int selinux_policycap_nnp_nosuid_transition;
 static DEFINE_RWLOCK(policy_rwlock);
 
 static struct sidtab sidtab;
-static struct policydb policydb;
 int ss_initialized;
 
 /*
@@ -120,6 +119,7 @@ struct selinux_mapping {
 struct shared_current_mapping {
struct selinux_mapping *current_mapping;
u16 current_mapping_size;
+   struct policydb policydb;
 };
 
 static struct shared_current_mapping *crm;
@@ -277,7 +277,7 @@ static void map_decision(u16 tclass, struct av_decision 
*avd,
 
 int security_mls_enabled(void)
 {
-   return policydb.mls_enabled;
+   return crm->policydb.mls_enabled;
 }
 
 /*
@@ -335,8 +335,8 @@ static int constraint_expr_eval(struct context *scontext,
case CEXPR_ROLE:
val1 = scontext->role;
val2 = tcontext->role;
-   r1 = policydb.role_val_to_struct[val1 - 1];
-   r2 = policydb.role_val_to_struct[val2 - 1];
+   r1 = crm->policydb.role_val_to_struct[val1 - 1];
+   r2 = crm->policydb.role_val_to_struct[val2 - 1];
switch (e->op) {
case CEXPR_DOM:
s[++sp] = 
ebitmap_get_bit(>dominates,
@@ -501,8 +501,8 @@ static void security_dump_masked_av(struct context 
*scontext,
if (!permissions)
return;
 
-   tclass_name = sym_name(, SYM_CLASSES, tclass - 1);
-   tclass_dat = policydb.class_val_to_struct[tclass - 1];
+   tclass_name = sym_name(>policydb, SYM_CLASSES, tclass - 1);
+   tclass_dat = crm->policydb.class_val_to_struct[tclass - 1];
common_dat = tclass_dat->comdatum;
 
/* init permission_names */
@@ -571,14 +571,14 @@ static void type_attribute_bounds_av(struct context 
*scontext,
struct type_datum *target;
u32 masked = 0;
 
-   source = flex_array_get_ptr(policydb.type_val_to_struct_array,
+   source = flex_array_get_ptr(crm->policydb.type_val_to_struct_array,
scontext->type - 1);
BUG_ON(!source);
 
if (!source->bounds)
return;
 
-   target = flex_array_get_ptr(policydb.type_val_to_struct_array,
+   target = flex_array_get_ptr(crm->policydb.type_val_to_struct_array,
tcontext->type - 1);
BUG_ON(!target);
 
@@ -664,13 +664,13 @@ static void context_struct_compute_av(struct context 
*scontext,
xperms->len = 0;
}
 
-   if (unlikely(!tclass || tclass > policydb.p_classes.nprim)) {
+   if (unlikely(!tclass || tclass > crm->policydb.p_classes.nprim)) {
if (printk_ratelimit())
printk(KERN_WARNING "SELinux:  Invalid class %hu\n", 
tclass);
return;
}
 
-   tclass_datum = policydb.class_val_to_struct[tclass - 1];
+   tclass_datum = crm->policydb.class_val_to_struct[tclass - 1];
 
/*
 * If a specific type enforcement rule was defined for
@@ -678,15 +678,18 @@ static void context_struct_compute_av(struct context 
*scontext,
 */
avkey.target_class = tclass;
avkey.specified = AVTAB_AV | AVTAB_XPERMS;
-   sattr = flex_array_get(policydb.type_attr_map_array, scontext->type - 
1);
+   sattr = flex_array_get(crm->policydb.type_attr_map_array,
+  scontext->type - 1);
BUG_ON(!sattr);
-   tattr = flex_array_get(policydb.type_attr_map_array, tcontext->type - 
1);
+   tattr = flex_array_get(crm->policydb.type_attr_map_array,
+  tcontext->type - 1);
BUG_ON(!tattr);
ebitmap_for_each_positive_bit(sattr, snode, i) {
ebitmap_for_each_positive_bit(tattr, tnode, j) {
avkey.source_type = i + 1;
avkey.target_type = j + 1;
-   for (node = avtab_search_node(_avtab, 
);
+   for (node = avtab_search_node(>policydb.te_avtab,
+ );
 node;
 node = avtab_search_node_next(node, 
avkey.specified)) {
if (node->key.specified == AVTAB_ALLOWED)
@@ -700,7 +703,7 @@ static void context_struct_compute_av(struct 

[PATCH v2 1/5] selinux:Remove direct references to policydb.

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

To be able to use rcu locks we seed to address the policydb
though a pointer. This preparation removes the export of the
policydb and send pointers to it through parameter agruments.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/mls.c  | 69 
 security/selinux/ss/mls.h  | 37 +
 security/selinux/ss/services.c | 90 +++---
 security/selinux/ss/services.h |  3 --
 4 files changed, 114 insertions(+), 85 deletions(-)

diff --git a/security/selinux/ss/mls.c b/security/selinux/ss/mls.c
index ad982ce..b1f35d3 100644
--- a/security/selinux/ss/mls.c
+++ b/security/selinux/ss/mls.c
@@ -33,20 +33,20 @@
  * Return the length in bytes for the MLS fields of the
  * security context string representation of `context'.
  */
-int mls_compute_context_len(struct context *context)
+int mls_compute_context_len(struct policydb *p, struct context *context)
 {
int i, l, len, head, prev;
char *nm;
struct ebitmap *e;
struct ebitmap_node *node;
 
-   if (!policydb.mls_enabled)
+   if (!p->mls_enabled)
return 0;
 
len = 1; /* for the beginning ":" */
for (l = 0; l < 2; l++) {
int index_sens = context->range.level[l].sens;
-   len += strlen(sym_name(, SYM_LEVELS, index_sens - 1));
+   len += strlen(sym_name(p, SYM_LEVELS, index_sens - 1));
 
/* categories */
head = -2;
@@ -56,17 +56,17 @@ int mls_compute_context_len(struct context *context)
if (i - prev > 1) {
/* one or more negative bits are skipped */
if (head != prev) {
-   nm = sym_name(, SYM_CATS, 
prev);
+   nm = sym_name(p, SYM_CATS, prev);
len += strlen(nm) + 1;
}
-   nm = sym_name(, SYM_CATS, i);
+   nm = sym_name(p, SYM_CATS, i);
len += strlen(nm) + 1;
head = i;
}
prev = i;
}
if (prev != head) {
-   nm = sym_name(, SYM_CATS, prev);
+   nm = sym_name(p, SYM_CATS, prev);
len += strlen(nm) + 1;
}
if (l == 0) {
@@ -86,7 +86,7 @@ int mls_compute_context_len(struct context *context)
  * the MLS fields of `context' into the string `*scontext'.
  * Update `*scontext' to point to the end of the MLS fields.
  */
-void mls_sid_to_context(struct context *context,
+void mls_sid_to_context(struct policydb *p, struct context *context,
char **scontext)
 {
char *scontextp, *nm;
@@ -94,7 +94,7 @@ void mls_sid_to_context(struct context *context,
struct ebitmap *e;
struct ebitmap_node *node;
 
-   if (!policydb.mls_enabled)
+   if (!p->mls_enabled)
return;
 
scontextp = *scontext;
@@ -103,7 +103,7 @@ void mls_sid_to_context(struct context *context,
scontextp++;
 
for (l = 0; l < 2; l++) {
-   strcpy(scontextp, sym_name(, SYM_LEVELS,
+   strcpy(scontextp, sym_name(p, SYM_LEVELS,
   context->range.level[l].sens - 1));
scontextp += strlen(scontextp);
 
@@ -119,7 +119,7 @@ void mls_sid_to_context(struct context *context,
*scontextp++ = '.';
else
*scontextp++ = ',';
-   nm = sym_name(, SYM_CATS, 
prev);
+   nm = sym_name(p, SYM_CATS, prev);
strcpy(scontextp, nm);
scontextp += strlen(nm);
}
@@ -127,7 +127,7 @@ void mls_sid_to_context(struct context *context,
*scontextp++ = ':';
else
*scontextp++ = ',';
-   nm = sym_name(, SYM_CATS, i);
+   nm = sym_name(p, SYM_CATS, i);
strcpy(scontextp, nm);
scontextp += strlen(nm);
head = i;
@@ -140,7 +140,7 @@ void mls_sid_to_context(struct context *context,
*scontextp++ = '.';
else
*scontextp++ = ',';
-   nm = sym_name(, SYM_CATS, prev);
+   nm = sym_name(p, 

[PATCH v2 1/5] selinux:Remove direct references to policydb.

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

To be able to use rcu locks we seed to address the policydb
though a pointer. This preparation removes the export of the
policydb and send pointers to it through parameter agruments.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/mls.c  | 69 
 security/selinux/ss/mls.h  | 37 +
 security/selinux/ss/services.c | 90 +++---
 security/selinux/ss/services.h |  3 --
 4 files changed, 114 insertions(+), 85 deletions(-)

diff --git a/security/selinux/ss/mls.c b/security/selinux/ss/mls.c
index ad982ce..b1f35d3 100644
--- a/security/selinux/ss/mls.c
+++ b/security/selinux/ss/mls.c
@@ -33,20 +33,20 @@
  * Return the length in bytes for the MLS fields of the
  * security context string representation of `context'.
  */
-int mls_compute_context_len(struct context *context)
+int mls_compute_context_len(struct policydb *p, struct context *context)
 {
int i, l, len, head, prev;
char *nm;
struct ebitmap *e;
struct ebitmap_node *node;
 
-   if (!policydb.mls_enabled)
+   if (!p->mls_enabled)
return 0;
 
len = 1; /* for the beginning ":" */
for (l = 0; l < 2; l++) {
int index_sens = context->range.level[l].sens;
-   len += strlen(sym_name(, SYM_LEVELS, index_sens - 1));
+   len += strlen(sym_name(p, SYM_LEVELS, index_sens - 1));
 
/* categories */
head = -2;
@@ -56,17 +56,17 @@ int mls_compute_context_len(struct context *context)
if (i - prev > 1) {
/* one or more negative bits are skipped */
if (head != prev) {
-   nm = sym_name(, SYM_CATS, 
prev);
+   nm = sym_name(p, SYM_CATS, prev);
len += strlen(nm) + 1;
}
-   nm = sym_name(, SYM_CATS, i);
+   nm = sym_name(p, SYM_CATS, i);
len += strlen(nm) + 1;
head = i;
}
prev = i;
}
if (prev != head) {
-   nm = sym_name(, SYM_CATS, prev);
+   nm = sym_name(p, SYM_CATS, prev);
len += strlen(nm) + 1;
}
if (l == 0) {
@@ -86,7 +86,7 @@ int mls_compute_context_len(struct context *context)
  * the MLS fields of `context' into the string `*scontext'.
  * Update `*scontext' to point to the end of the MLS fields.
  */
-void mls_sid_to_context(struct context *context,
+void mls_sid_to_context(struct policydb *p, struct context *context,
char **scontext)
 {
char *scontextp, *nm;
@@ -94,7 +94,7 @@ void mls_sid_to_context(struct context *context,
struct ebitmap *e;
struct ebitmap_node *node;
 
-   if (!policydb.mls_enabled)
+   if (!p->mls_enabled)
return;
 
scontextp = *scontext;
@@ -103,7 +103,7 @@ void mls_sid_to_context(struct context *context,
scontextp++;
 
for (l = 0; l < 2; l++) {
-   strcpy(scontextp, sym_name(, SYM_LEVELS,
+   strcpy(scontextp, sym_name(p, SYM_LEVELS,
   context->range.level[l].sens - 1));
scontextp += strlen(scontextp);
 
@@ -119,7 +119,7 @@ void mls_sid_to_context(struct context *context,
*scontextp++ = '.';
else
*scontextp++ = ',';
-   nm = sym_name(, SYM_CATS, 
prev);
+   nm = sym_name(p, SYM_CATS, prev);
strcpy(scontextp, nm);
scontextp += strlen(nm);
}
@@ -127,7 +127,7 @@ void mls_sid_to_context(struct context *context,
*scontextp++ = ':';
else
*scontextp++ = ',';
-   nm = sym_name(, SYM_CATS, i);
+   nm = sym_name(p, SYM_CATS, i);
strcpy(scontextp, nm);
scontextp += strlen(nm);
head = i;
@@ -140,7 +140,7 @@ void mls_sid_to_context(struct context *context,
*scontextp++ = '.';
else
*scontextp++ = ',';
-   nm = sym_name(, SYM_CATS, prev);
+   nm = sym_name(p, SYM_CATS, prev);

[PATCH v2 5/5] selinux: Switch locking to RCU.

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

This patch switch to using RCU locks instead of rwlocks. This has
the big advantage that it does not has preempt disable.

Signed-off-by: Peter Enderborg 
Reported-by: Björn Davidsson 
---
 security/selinux/ss/services.c | 162 +
 1 file changed, 82 insertions(+), 80 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 81c5717..f142ef8 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -87,7 +87,7 @@ int selinux_policycap_alwaysnetwork;
 int selinux_policycap_cgroupseclabel;
 int selinux_policycap_nnp_nosuid_transition;
 
-static DEFINE_RWLOCK(policy_rwlock);
+static DEFINE_SPINLOCK(policy_w_lock);
 
 int ss_initialized;
 
@@ -115,14 +115,14 @@ struct selinux_mapping {
u32 perms[sizeof(u32) * 8];
 };
 
-struct shared_current_mapping {
+struct shared_rcu_mapping {
struct selinux_mapping *current_mapping;
u16 current_mapping_size;
struct policydb policydb;
struct sidtab sidtab;
 };
 
-static struct shared_current_mapping *crm;
+static struct shared_rcu_mapping *crm;
 
 static int selinux_set_mapping(struct policydb *pol,
   struct security_class_mapping *map,
@@ -791,7 +791,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
if (!ss_initialized)
return 0;
 
-   read_lock(_rwlock);
+   rcu_read_lock();
 
if (!user)
tclass = unmap_class(orig_tclass);
@@ -845,7 +845,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
}
 
 out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
return rc;
 }
 
@@ -879,7 +879,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
int index;
int rc;
 
-   read_lock(_rwlock);
+   rcu_read_lock();
 
rc = -EINVAL;
old_context = sidtab_search(>sidtab, old_sid);
@@ -941,7 +941,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
kfree(old_name);
}
 out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
 
return rc;
 }
@@ -1029,7 +1029,7 @@ void security_compute_xperms_decision(u32 ssid,
memset(xpermd->auditallow->p, 0, sizeof(xpermd->auditallow->p));
memset(xpermd->dontaudit->p, 0, sizeof(xpermd->dontaudit->p));
 
-   read_lock(_rwlock);
+   rcu_read_lock();
if (!ss_initialized)
goto allow;
 
@@ -1083,7 +1083,7 @@ void security_compute_xperms_decision(u32 ssid,
}
}
 out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
return;
 allow:
memset(xpermd->allowed->p, 0xff, sizeof(xpermd->allowed->p));
@@ -1110,7 +1110,7 @@ void security_compute_av(u32 ssid,
u16 tclass;
struct context *scontext = NULL, *tcontext = NULL;
 
-   read_lock(_rwlock);
+   rcu_read_lock();
avd_init(avd);
xperms->len = 0;
if (!ss_initialized)
@@ -1143,7 +1143,7 @@ void security_compute_av(u32 ssid,
context_struct_compute_av(scontext, tcontext, tclass, avd, xperms);
map_decision(orig_tclass, avd, crm->policydb.allow_unknown);
 out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
return;
 allow:
avd->allowed = 0x;
@@ -1157,7 +1157,7 @@ void security_compute_av_user(u32 ssid,
 {
struct context *scontext = NULL, *tcontext = NULL;
 
-   read_lock(_rwlock);
+   rcu_read_lock();
avd_init(avd);
if (!ss_initialized)
goto allow;
@@ -1188,7 +1188,7 @@ void security_compute_av_user(u32 ssid,
 
context_struct_compute_av(scontext, tcontext, tclass, avd, NULL);
  out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
return;
 allow:
avd->allowed = 0x;
@@ -1293,7 +1293,7 @@ static int security_sid_to_context_core(u32 sid, char 
**scontext,
rc = -EINVAL;
goto out;
}
-   read_lock(_rwlock);
+   rcu_read_lock();
if (force)
context = sidtab_search_force(>sidtab, sid);
else
@@ -1306,7 +1306,7 @@ static int security_sid_to_context_core(u32 sid, char 
**scontext,
}
rc = context_struct_to_string(context, scontext, scontext_len);
 out_unlock:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
 out:
return rc;
 
@@ -1458,7 +1458,7 @@ static int security_context_to_sid_core(const char 
*scontext, u32 scontext_len,
goto out;
}
 
-   read_lock(_rwlock);
+   rcu_read_lock();
rc = string_to_context_struct(>policydb, >sidtab, scontext2,
  scontext_len, , def_sid);
if (rc == -EINVAL && force) {
@@ -1470,7 +1470,7 @@ static int 

[PATCH v2 5/5] selinux: Switch locking to RCU.

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

This patch switch to using RCU locks instead of rwlocks. This has
the big advantage that it does not has preempt disable.

Signed-off-by: Peter Enderborg 
Reported-by: Björn Davidsson 
---
 security/selinux/ss/services.c | 162 +
 1 file changed, 82 insertions(+), 80 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 81c5717..f142ef8 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -87,7 +87,7 @@ int selinux_policycap_alwaysnetwork;
 int selinux_policycap_cgroupseclabel;
 int selinux_policycap_nnp_nosuid_transition;
 
-static DEFINE_RWLOCK(policy_rwlock);
+static DEFINE_SPINLOCK(policy_w_lock);
 
 int ss_initialized;
 
@@ -115,14 +115,14 @@ struct selinux_mapping {
u32 perms[sizeof(u32) * 8];
 };
 
-struct shared_current_mapping {
+struct shared_rcu_mapping {
struct selinux_mapping *current_mapping;
u16 current_mapping_size;
struct policydb policydb;
struct sidtab sidtab;
 };
 
-static struct shared_current_mapping *crm;
+static struct shared_rcu_mapping *crm;
 
 static int selinux_set_mapping(struct policydb *pol,
   struct security_class_mapping *map,
@@ -791,7 +791,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
if (!ss_initialized)
return 0;
 
-   read_lock(_rwlock);
+   rcu_read_lock();
 
if (!user)
tclass = unmap_class(orig_tclass);
@@ -845,7 +845,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
}
 
 out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
return rc;
 }
 
@@ -879,7 +879,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
int index;
int rc;
 
-   read_lock(_rwlock);
+   rcu_read_lock();
 
rc = -EINVAL;
old_context = sidtab_search(>sidtab, old_sid);
@@ -941,7 +941,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
kfree(old_name);
}
 out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
 
return rc;
 }
@@ -1029,7 +1029,7 @@ void security_compute_xperms_decision(u32 ssid,
memset(xpermd->auditallow->p, 0, sizeof(xpermd->auditallow->p));
memset(xpermd->dontaudit->p, 0, sizeof(xpermd->dontaudit->p));
 
-   read_lock(_rwlock);
+   rcu_read_lock();
if (!ss_initialized)
goto allow;
 
@@ -1083,7 +1083,7 @@ void security_compute_xperms_decision(u32 ssid,
}
}
 out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
return;
 allow:
memset(xpermd->allowed->p, 0xff, sizeof(xpermd->allowed->p));
@@ -1110,7 +1110,7 @@ void security_compute_av(u32 ssid,
u16 tclass;
struct context *scontext = NULL, *tcontext = NULL;
 
-   read_lock(_rwlock);
+   rcu_read_lock();
avd_init(avd);
xperms->len = 0;
if (!ss_initialized)
@@ -1143,7 +1143,7 @@ void security_compute_av(u32 ssid,
context_struct_compute_av(scontext, tcontext, tclass, avd, xperms);
map_decision(orig_tclass, avd, crm->policydb.allow_unknown);
 out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
return;
 allow:
avd->allowed = 0x;
@@ -1157,7 +1157,7 @@ void security_compute_av_user(u32 ssid,
 {
struct context *scontext = NULL, *tcontext = NULL;
 
-   read_lock(_rwlock);
+   rcu_read_lock();
avd_init(avd);
if (!ss_initialized)
goto allow;
@@ -1188,7 +1188,7 @@ void security_compute_av_user(u32 ssid,
 
context_struct_compute_av(scontext, tcontext, tclass, avd, NULL);
  out:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
return;
 allow:
avd->allowed = 0x;
@@ -1293,7 +1293,7 @@ static int security_sid_to_context_core(u32 sid, char 
**scontext,
rc = -EINVAL;
goto out;
}
-   read_lock(_rwlock);
+   rcu_read_lock();
if (force)
context = sidtab_search_force(>sidtab, sid);
else
@@ -1306,7 +1306,7 @@ static int security_sid_to_context_core(u32 sid, char 
**scontext,
}
rc = context_struct_to_string(context, scontext, scontext_len);
 out_unlock:
-   read_unlock(_rwlock);
+   rcu_read_unlock();
 out:
return rc;
 
@@ -1458,7 +1458,7 @@ static int security_context_to_sid_core(const char 
*scontext, u32 scontext_len,
goto out;
}
 
-   read_lock(_rwlock);
+   rcu_read_lock();
rc = string_to_context_struct(>policydb, >sidtab, scontext2,
  scontext_len, , def_sid);
if (rc == -EINVAL && force) {
@@ -1470,7 +1470,7 @@ static int security_context_to_sid_core(const char 
*scontext, u32 scontext_len,
rc = 

[PATCH v2 0/5] selinux:Significant reduce of preempt_disable holds

2018-01-26 Thread peter.enderborg
Holding the preempt_disable is very bad for low latency tasks
as audio and therefore we need to break out the rule-set dependent
part from this disable. By using a RCU instead of rwlock we
have an efficient locking and less preemption interference.

Selinux uses a lot of read_locks. This patch replaces the rwlock
with RCY that does not hold preempt_disable.

Intel Xeon W3520 2.67 Ghz running FC27 with 4.15.0-rc9git (+measurement)
I get preempt_disable in worst case for 1.2ms in security_compute_av().
With the patch I get 960us as the longest security_compute_av()
without preempt disabeld. It very much noise in the measurement
but it is not likely a degrade.

And the preempt_disable times is also very dependent on the selinux
rule-set.

In security_get_user_sids() we have two nested for-loops and the
inner part calls sittab_context_to_sid() that calls
sidtab_search_context() that has a for loop() over a while() where
the loops is dependent on the rules.

On the test system the average lookup time is 60us and does
not change with the RCU usage.

To use RCU the structure of policydb has to be accesses through a pointer.
We need 4 patches to get there.

  [PATCH v2 1/5] selinux:Remove direct references to policydb.
  We remove direct references and pass it through function arguments.

  [PATCH v2 2/5] selinux: Move policydb to pointer structure
  Move the policydb to dynamic allocated structure.

  [PATCH v2 3/5] selinux: Move sidtab to pointer structure
  Same as for policydb but for sidtab. They are closly related
  and should be switched at the same time.
  
  [PATCH v2 4/5] selinux: Use pointer to switch policydb and sidtab
  Now we can switch rules by switching pointers.

  [PATCH v2 5/5] selinux: Switch locking to RCU.
  We are now ready to use RCU.
  
History: V1 rwsem


[PATCH v2 3/5] selinux: Move sidtab to pointer structure

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

To be able to use rcu locks we need access the sidtab trough
a pointer. This moves the sittab to a dynamic allocated struture.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 140 ++---
 1 file changed, 74 insertions(+), 66 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 21400bd..2a8486c 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -89,7 +89,6 @@ int selinux_policycap_nnp_nosuid_transition;
 
 static DEFINE_RWLOCK(policy_rwlock);
 
-static struct sidtab sidtab;
 int ss_initialized;
 
 /*
@@ -120,6 +119,7 @@ struct shared_current_mapping {
struct selinux_mapping *current_mapping;
u16 current_mapping_size;
struct policydb policydb;
+   struct sidtab sidtab;
 };
 
 static struct shared_current_mapping *crm;
@@ -804,7 +804,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
}
tclass_datum = crm->policydb.class_val_to_struct[tclass - 1];
 
-   ocontext = sidtab_search(, oldsid);
+   ocontext = sidtab_search(>sidtab, oldsid);
if (!ocontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
__func__, oldsid);
@@ -812,7 +812,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
goto out;
}
 
-   ncontext = sidtab_search(, newsid);
+   ncontext = sidtab_search(>sidtab, newsid);
if (!ncontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
__func__, newsid);
@@ -820,7 +820,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
goto out;
}
 
-   tcontext = sidtab_search(, tasksid);
+   tcontext = sidtab_search(>sidtab, tasksid);
if (!tcontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
__func__, tasksid);
@@ -882,7 +882,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
read_lock(_rwlock);
 
rc = -EINVAL;
-   old_context = sidtab_search(, old_sid);
+   old_context = sidtab_search(>sidtab, old_sid);
if (!old_context) {
printk(KERN_ERR "SELinux: %s: unrecognized SID %u\n",
   __func__, old_sid);
@@ -890,7 +890,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
}
 
rc = -EINVAL;
-   new_context = sidtab_search(, new_sid);
+   new_context = sidtab_search(>sidtab, new_sid);
if (!new_context) {
printk(KERN_ERR "SELinux: %s: unrecognized SID %u\n",
   __func__, new_sid);
@@ -1033,14 +1033,14 @@ void security_compute_xperms_decision(u32 ssid,
if (!ss_initialized)
goto allow;
 
-   scontext = sidtab_search(, ssid);
+   scontext = sidtab_search(>sidtab, ssid);
if (!scontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, ssid);
goto out;
}
 
-   tcontext = sidtab_search(, tsid);
+   tcontext = sidtab_search(>sidtab, tsid);
if (!tcontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, tsid);
@@ -1116,7 +1116,7 @@ void security_compute_av(u32 ssid,
if (!ss_initialized)
goto allow;
 
-   scontext = sidtab_search(, ssid);
+   scontext = sidtab_search(>sidtab, ssid);
if (!scontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, ssid);
@@ -1127,7 +1127,7 @@ void security_compute_av(u32 ssid,
if (ebitmap_get_bit(>policydb.permissive_map, scontext->type))
avd->flags |= AVD_FLAGS_PERMISSIVE;
 
-   tcontext = sidtab_search(, tsid);
+   tcontext = sidtab_search(>sidtab, tsid);
if (!tcontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, tsid);
@@ -1162,7 +1162,7 @@ void security_compute_av_user(u32 ssid,
if (!ss_initialized)
goto allow;
 
-   scontext = sidtab_search(, ssid);
+   scontext = sidtab_search(>sidtab, ssid);
if (!scontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, ssid);
@@ -1173,7 +1173,7 @@ void security_compute_av_user(u32 ssid,
if (ebitmap_get_bit(>policydb.permissive_map, scontext->type))
avd->flags |= AVD_FLAGS_PERMISSIVE;
 
-   tcontext = sidtab_search(, tsid);
+   tcontext = sidtab_search(>sidtab, tsid);
if (!tcontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, tsid);
@@ 

[PATCH v2 4/5] selinux: Use pointer to switch policydb and sidtab

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

This i preparation for switching to RCU locks. To be able to use
RCU we need atomic switched pointer. This adds the dynamic
memory copying to be a single pointer. It copy all the
data structures in to new ones. This is an overhead
for writing rules but the benifit is RCU.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 139 +++--
 1 file changed, 78 insertions(+), 61 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 2a8486c..81c5717 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -2064,76 +2064,67 @@ static int security_preserve_bools(struct policydb *p);
  */
 int security_load_policy(void *data, size_t len)
 {
-   struct policydb *oldpolicydb, *newpolicydb;
+   struct policydb *oldpolicydb;
struct sidtab oldsidtab, newsidtab;
struct selinux_mapping *oldmap = NULL, *map = NULL;
struct convert_context_args args;
-   struct shared_current_mapping *new_mapping;
struct shared_current_mapping *next_rcu;
-
+   struct shared_current_mapping *old_rcu;
u32 seqno;
u16 map_size;
int rc = 0;
struct policy_file file = { data, len }, *fp = 
 
-   oldpolicydb = kzalloc(2 * sizeof(*oldpolicydb), GFP_KERNEL);
-   if (!oldpolicydb) {
-   rc = -ENOMEM;
-   goto out;
-   }
-   new_mapping = kzalloc(sizeof(struct shared_current_mapping),
- GFP_KERNEL);
-   if (!new_mapping) {
-   rc = -ENOMEM;
-   goto out;
-   }
-   newpolicydb = oldpolicydb + 1;
-   next_rcu = kmalloc(sizeof(struct shared_current_mapping), GFP_KERNEL);
-   if (!next_rcu) {
-   rc = -ENOMEM;
-   goto out;
-   }
-
if (!ss_initialized) {
-   crm = kzalloc(sizeof(struct shared_current_mapping),
- GFP_KERNEL);
-   if (!crm) {
+   struct shared_current_mapping *first_mapping;
+
+   first_mapping = kzalloc(sizeof(struct shared_current_mapping),
+   GFP_KERNEL);
+   if (!first_mapping) {
rc = -ENOMEM;
goto out;
}
avtab_cache_init();
ebitmap_cache_init();
hashtab_cache_init();
-   rc = policydb_read(>policydb, fp);
+   rc = policydb_read(_mapping->policydb, fp);
if (rc) {
avtab_cache_destroy();
ebitmap_cache_destroy();
hashtab_cache_destroy();
+   kfree(first_mapping);
goto out;
}
 
-   crm->policydb.len = len;
-   rc = selinux_set_mapping(>policydb, secclass_map,
->current_mapping,
->current_mapping_size);
+   first_mapping->policydb.len = len;
+   rc = selinux_set_mapping(_mapping->policydb, secclass_map,
+_mapping->current_mapping,
+_mapping->current_mapping_size);
if (rc) {
-   policydb_destroy(>policydb);
+   policydb_destroy(_mapping->policydb);
avtab_cache_destroy();
ebitmap_cache_destroy();
hashtab_cache_destroy();
+   kfree(first_mapping);
goto out;
}
 
-   rc = policydb_load_isids(>policydb, >sidtab);
+   rc = policydb_load_isids(_mapping->policydb,
+_mapping->sidtab);
if (rc) {
-   policydb_destroy(>policydb);
+   policydb_destroy(_mapping->policydb);
avtab_cache_destroy();
ebitmap_cache_destroy();
hashtab_cache_destroy();
+   kfree(first_mapping);
goto out;
}
 
-   security_load_policycaps(>policydb);
+   security_load_policycaps(_mapping->policydb);
+   crm = first_mapping;
+
+   smp_mb(); /* make sure that crm exist before we */
+ /* switch ss_initialized */
ss_initialized = 1;
seqno = ++latest_granting;
selinux_complete_init();
@@ -2148,30 +2139,44 @@ int security_load_policy(void *data, size_t len)
 #if 0
sidtab_hash_eval(>sidtab, "sids");
 #endif
+   oldpolicydb = kzalloc(sizeof(*oldpolicydb), GFP_KERNEL);
+   if (!oldpolicydb) {
+ 

[PATCH v2 0/5] selinux:Significant reduce of preempt_disable holds

2018-01-26 Thread peter.enderborg
Holding the preempt_disable is very bad for low latency tasks
as audio and therefore we need to break out the rule-set dependent
part from this disable. By using a RCU instead of rwlock we
have an efficient locking and less preemption interference.

Selinux uses a lot of read_locks. This patch replaces the rwlock
with RCY that does not hold preempt_disable.

Intel Xeon W3520 2.67 Ghz running FC27 with 4.15.0-rc9git (+measurement)
I get preempt_disable in worst case for 1.2ms in security_compute_av().
With the patch I get 960us as the longest security_compute_av()
without preempt disabeld. It very much noise in the measurement
but it is not likely a degrade.

And the preempt_disable times is also very dependent on the selinux
rule-set.

In security_get_user_sids() we have two nested for-loops and the
inner part calls sittab_context_to_sid() that calls
sidtab_search_context() that has a for loop() over a while() where
the loops is dependent on the rules.

On the test system the average lookup time is 60us and does
not change with the RCU usage.

To use RCU the structure of policydb has to be accesses through a pointer.
We need 4 patches to get there.

  [PATCH v2 1/5] selinux:Remove direct references to policydb.
  We remove direct references and pass it through function arguments.

  [PATCH v2 2/5] selinux: Move policydb to pointer structure
  Move the policydb to dynamic allocated structure.

  [PATCH v2 3/5] selinux: Move sidtab to pointer structure
  Same as for policydb but for sidtab. They are closly related
  and should be switched at the same time.
  
  [PATCH v2 4/5] selinux: Use pointer to switch policydb and sidtab
  Now we can switch rules by switching pointers.

  [PATCH v2 5/5] selinux: Switch locking to RCU.
  We are now ready to use RCU.
  
History: V1 rwsem


[PATCH v2 3/5] selinux: Move sidtab to pointer structure

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

To be able to use rcu locks we need access the sidtab trough
a pointer. This moves the sittab to a dynamic allocated struture.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 140 ++---
 1 file changed, 74 insertions(+), 66 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 21400bd..2a8486c 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -89,7 +89,6 @@ int selinux_policycap_nnp_nosuid_transition;
 
 static DEFINE_RWLOCK(policy_rwlock);
 
-static struct sidtab sidtab;
 int ss_initialized;
 
 /*
@@ -120,6 +119,7 @@ struct shared_current_mapping {
struct selinux_mapping *current_mapping;
u16 current_mapping_size;
struct policydb policydb;
+   struct sidtab sidtab;
 };
 
 static struct shared_current_mapping *crm;
@@ -804,7 +804,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
}
tclass_datum = crm->policydb.class_val_to_struct[tclass - 1];
 
-   ocontext = sidtab_search(, oldsid);
+   ocontext = sidtab_search(>sidtab, oldsid);
if (!ocontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
__func__, oldsid);
@@ -812,7 +812,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
goto out;
}
 
-   ncontext = sidtab_search(, newsid);
+   ncontext = sidtab_search(>sidtab, newsid);
if (!ncontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
__func__, newsid);
@@ -820,7 +820,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
goto out;
}
 
-   tcontext = sidtab_search(, tasksid);
+   tcontext = sidtab_search(>sidtab, tasksid);
if (!tcontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
__func__, tasksid);
@@ -882,7 +882,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
read_lock(_rwlock);
 
rc = -EINVAL;
-   old_context = sidtab_search(, old_sid);
+   old_context = sidtab_search(>sidtab, old_sid);
if (!old_context) {
printk(KERN_ERR "SELinux: %s: unrecognized SID %u\n",
   __func__, old_sid);
@@ -890,7 +890,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
}
 
rc = -EINVAL;
-   new_context = sidtab_search(, new_sid);
+   new_context = sidtab_search(>sidtab, new_sid);
if (!new_context) {
printk(KERN_ERR "SELinux: %s: unrecognized SID %u\n",
   __func__, new_sid);
@@ -1033,14 +1033,14 @@ void security_compute_xperms_decision(u32 ssid,
if (!ss_initialized)
goto allow;
 
-   scontext = sidtab_search(, ssid);
+   scontext = sidtab_search(>sidtab, ssid);
if (!scontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, ssid);
goto out;
}
 
-   tcontext = sidtab_search(, tsid);
+   tcontext = sidtab_search(>sidtab, tsid);
if (!tcontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, tsid);
@@ -1116,7 +1116,7 @@ void security_compute_av(u32 ssid,
if (!ss_initialized)
goto allow;
 
-   scontext = sidtab_search(, ssid);
+   scontext = sidtab_search(>sidtab, ssid);
if (!scontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, ssid);
@@ -1127,7 +1127,7 @@ void security_compute_av(u32 ssid,
if (ebitmap_get_bit(>policydb.permissive_map, scontext->type))
avd->flags |= AVD_FLAGS_PERMISSIVE;
 
-   tcontext = sidtab_search(, tsid);
+   tcontext = sidtab_search(>sidtab, tsid);
if (!tcontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, tsid);
@@ -1162,7 +1162,7 @@ void security_compute_av_user(u32 ssid,
if (!ss_initialized)
goto allow;
 
-   scontext = sidtab_search(, ssid);
+   scontext = sidtab_search(>sidtab, ssid);
if (!scontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, ssid);
@@ -1173,7 +1173,7 @@ void security_compute_av_user(u32 ssid,
if (ebitmap_get_bit(>policydb.permissive_map, scontext->type))
avd->flags |= AVD_FLAGS_PERMISSIVE;
 
-   tcontext = sidtab_search(, tsid);
+   tcontext = sidtab_search(>sidtab, tsid);
if (!tcontext) {
printk(KERN_ERR "SELinux: %s:  unrecognized SID %d\n",
   __func__, tsid);
@@ -1295,9 +1295,9 @@ static int 

[PATCH v2 4/5] selinux: Use pointer to switch policydb and sidtab

2018-01-26 Thread peter.enderborg
From: Peter Enderborg 

This i preparation for switching to RCU locks. To be able to use
RCU we need atomic switched pointer. This adds the dynamic
memory copying to be a single pointer. It copy all the
data structures in to new ones. This is an overhead
for writing rules but the benifit is RCU.

Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 139 +++--
 1 file changed, 78 insertions(+), 61 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 2a8486c..81c5717 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -2064,76 +2064,67 @@ static int security_preserve_bools(struct policydb *p);
  */
 int security_load_policy(void *data, size_t len)
 {
-   struct policydb *oldpolicydb, *newpolicydb;
+   struct policydb *oldpolicydb;
struct sidtab oldsidtab, newsidtab;
struct selinux_mapping *oldmap = NULL, *map = NULL;
struct convert_context_args args;
-   struct shared_current_mapping *new_mapping;
struct shared_current_mapping *next_rcu;
-
+   struct shared_current_mapping *old_rcu;
u32 seqno;
u16 map_size;
int rc = 0;
struct policy_file file = { data, len }, *fp = 
 
-   oldpolicydb = kzalloc(2 * sizeof(*oldpolicydb), GFP_KERNEL);
-   if (!oldpolicydb) {
-   rc = -ENOMEM;
-   goto out;
-   }
-   new_mapping = kzalloc(sizeof(struct shared_current_mapping),
- GFP_KERNEL);
-   if (!new_mapping) {
-   rc = -ENOMEM;
-   goto out;
-   }
-   newpolicydb = oldpolicydb + 1;
-   next_rcu = kmalloc(sizeof(struct shared_current_mapping), GFP_KERNEL);
-   if (!next_rcu) {
-   rc = -ENOMEM;
-   goto out;
-   }
-
if (!ss_initialized) {
-   crm = kzalloc(sizeof(struct shared_current_mapping),
- GFP_KERNEL);
-   if (!crm) {
+   struct shared_current_mapping *first_mapping;
+
+   first_mapping = kzalloc(sizeof(struct shared_current_mapping),
+   GFP_KERNEL);
+   if (!first_mapping) {
rc = -ENOMEM;
goto out;
}
avtab_cache_init();
ebitmap_cache_init();
hashtab_cache_init();
-   rc = policydb_read(>policydb, fp);
+   rc = policydb_read(_mapping->policydb, fp);
if (rc) {
avtab_cache_destroy();
ebitmap_cache_destroy();
hashtab_cache_destroy();
+   kfree(first_mapping);
goto out;
}
 
-   crm->policydb.len = len;
-   rc = selinux_set_mapping(>policydb, secclass_map,
->current_mapping,
->current_mapping_size);
+   first_mapping->policydb.len = len;
+   rc = selinux_set_mapping(_mapping->policydb, secclass_map,
+_mapping->current_mapping,
+_mapping->current_mapping_size);
if (rc) {
-   policydb_destroy(>policydb);
+   policydb_destroy(_mapping->policydb);
avtab_cache_destroy();
ebitmap_cache_destroy();
hashtab_cache_destroy();
+   kfree(first_mapping);
goto out;
}
 
-   rc = policydb_load_isids(>policydb, >sidtab);
+   rc = policydb_load_isids(_mapping->policydb,
+_mapping->sidtab);
if (rc) {
-   policydb_destroy(>policydb);
+   policydb_destroy(_mapping->policydb);
avtab_cache_destroy();
ebitmap_cache_destroy();
hashtab_cache_destroy();
+   kfree(first_mapping);
goto out;
}
 
-   security_load_policycaps(>policydb);
+   security_load_policycaps(_mapping->policydb);
+   crm = first_mapping;
+
+   smp_mb(); /* make sure that crm exist before we */
+ /* switch ss_initialized */
ss_initialized = 1;
seqno = ++latest_granting;
selinux_complete_init();
@@ -2148,30 +2139,44 @@ int security_load_policy(void *data, size_t len)
 #if 0
sidtab_hash_eval(>sidtab, "sids");
 #endif
+   oldpolicydb = kzalloc(sizeof(*oldpolicydb), GFP_KERNEL);
+   if (!oldpolicydb) {
+   rc = -ENOMEM;
+   goto out;
+   }

[PATCH] selinux:Significant reduce of preempt_disable holds

2018-01-17 Thread peter.enderborg
From: Peter Enderborg 

Holding the preempt_disable is very bad for low latency tasks
as audio and therefore we need to break out the rule-set dependent
part from this disable. By using a rwsem instead of rwlock we
have an efficient locking and less preemption interference.

Selinux uses a lot of read_locks. This patch replaces the rwlock
with rwsem/percpu_down_read() that does not hold preempt_disable.

Intel Xeon W3520 2.67 Ghz running FC27 with 4.15.0-rc8git (+measurement)
I get preempt_disable in worst case for 1.2ms in security_compute_av().
With the patch I get 960us as the longest security_compute_av()
without preempt disabeld. It very much noise in the measurement
but it is not likely a degrade.

And the preempt_disable times is also very dependent on the selinux
rule-set.

In security_get_user_sids() we have two nested for-loops and the
inner part calls sittab_context_to_sid() that calls
sidtab_search_context() that has a for loop() over a while() where
the loops is dependent on the rules.

On the test system the average lookup time is 60us and does
not change with the rwsem usage.

Reported-by: Björn Davidsson 
Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 134 -
 1 file changed, 67 insertions(+), 67 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 33cfe5d..a3daaf2 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -87,7 +87,7 @@ int selinux_policycap_alwaysnetwork;
 int selinux_policycap_cgroupseclabel;
 int selinux_policycap_nnp_nosuid_transition;
 
-static DEFINE_RWLOCK(policy_rwlock);
+DEFINE_STATIC_PERCPU_RWSEM(policy_rwsem);
 
 static struct sidtab sidtab;
 struct policydb policydb;
@@ -779,7 +779,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
if (!ss_initialized)
return 0;
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
 
if (!user)
tclass = unmap_class(orig_tclass);
@@ -833,7 +833,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
}
 
 out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
return rc;
 }
 
@@ -867,7 +867,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
int index;
int rc;
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
 
rc = -EINVAL;
old_context = sidtab_search(, old_sid);
@@ -929,7 +929,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
kfree(old_name);
}
 out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
 
return rc;
 }
@@ -1017,7 +1017,7 @@ void security_compute_xperms_decision(u32 ssid,
memset(xpermd->auditallow->p, 0, sizeof(xpermd->auditallow->p));
memset(xpermd->dontaudit->p, 0, sizeof(xpermd->dontaudit->p));
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
if (!ss_initialized)
goto allow;
 
@@ -1070,7 +1070,7 @@ void security_compute_xperms_decision(u32 ssid,
}
}
 out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
return;
 allow:
memset(xpermd->allowed->p, 0xff, sizeof(xpermd->allowed->p));
@@ -1097,7 +1097,7 @@ void security_compute_av(u32 ssid,
u16 tclass;
struct context *scontext = NULL, *tcontext = NULL;
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
avd_init(avd);
xperms->len = 0;
if (!ss_initialized)
@@ -1130,7 +1130,7 @@ void security_compute_av(u32 ssid,
context_struct_compute_av(scontext, tcontext, tclass, avd, xperms);
map_decision(orig_tclass, avd, policydb.allow_unknown);
 out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
return;
 allow:
avd->allowed = 0x;
@@ -1144,7 +1144,7 @@ void security_compute_av_user(u32 ssid,
 {
struct context *scontext = NULL, *tcontext = NULL;
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
avd_init(avd);
if (!ss_initialized)
goto allow;
@@ -1175,7 +1175,7 @@ void security_compute_av_user(u32 ssid,
 
context_struct_compute_av(scontext, tcontext, tclass, avd, NULL);
  out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
return;
 allow:
avd->allowed = 0x;
@@ -1277,7 +1277,7 @@ static int security_sid_to_context_core(u32 sid, char 
**scontext,
rc = -EINVAL;
goto out;
}
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
if (force)
context = sidtab_search_force(, sid);
else
@@ -1290,7 +1290,7 @@ static int security_sid_to_context_core(u32 sid, char 
**scontext,
}
rc = context_struct_to_string(context, scontext, 

[PATCH] selinux:Significant reduce of preempt_disable holds

2018-01-17 Thread peter.enderborg
From: Peter Enderborg 

Holding the preempt_disable is very bad for low latency tasks
as audio and therefore we need to break out the rule-set dependent
part from this disable. By using a rwsem instead of rwlock we
have an efficient locking and less preemption interference.

Selinux uses a lot of read_locks. This patch replaces the rwlock
with rwsem/percpu_down_read() that does not hold preempt_disable.

Intel Xeon W3520 2.67 Ghz running FC27 with 4.15.0-rc8git (+measurement)
I get preempt_disable in worst case for 1.2ms in security_compute_av().
With the patch I get 960us as the longest security_compute_av()
without preempt disabeld. It very much noise in the measurement
but it is not likely a degrade.

And the preempt_disable times is also very dependent on the selinux
rule-set.

In security_get_user_sids() we have two nested for-loops and the
inner part calls sittab_context_to_sid() that calls
sidtab_search_context() that has a for loop() over a while() where
the loops is dependent on the rules.

On the test system the average lookup time is 60us and does
not change with the rwsem usage.

Reported-by: Björn Davidsson 
Signed-off-by: Peter Enderborg 
---
 security/selinux/ss/services.c | 134 -
 1 file changed, 67 insertions(+), 67 deletions(-)

diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
index 33cfe5d..a3daaf2 100644
--- a/security/selinux/ss/services.c
+++ b/security/selinux/ss/services.c
@@ -87,7 +87,7 @@ int selinux_policycap_alwaysnetwork;
 int selinux_policycap_cgroupseclabel;
 int selinux_policycap_nnp_nosuid_transition;
 
-static DEFINE_RWLOCK(policy_rwlock);
+DEFINE_STATIC_PERCPU_RWSEM(policy_rwsem);
 
 static struct sidtab sidtab;
 struct policydb policydb;
@@ -779,7 +779,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
if (!ss_initialized)
return 0;
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
 
if (!user)
tclass = unmap_class(orig_tclass);
@@ -833,7 +833,7 @@ static int security_compute_validatetrans(u32 oldsid, u32 
newsid, u32 tasksid,
}
 
 out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
return rc;
 }
 
@@ -867,7 +867,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
int index;
int rc;
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
 
rc = -EINVAL;
old_context = sidtab_search(, old_sid);
@@ -929,7 +929,7 @@ int security_bounded_transition(u32 old_sid, u32 new_sid)
kfree(old_name);
}
 out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
 
return rc;
 }
@@ -1017,7 +1017,7 @@ void security_compute_xperms_decision(u32 ssid,
memset(xpermd->auditallow->p, 0, sizeof(xpermd->auditallow->p));
memset(xpermd->dontaudit->p, 0, sizeof(xpermd->dontaudit->p));
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
if (!ss_initialized)
goto allow;
 
@@ -1070,7 +1070,7 @@ void security_compute_xperms_decision(u32 ssid,
}
}
 out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
return;
 allow:
memset(xpermd->allowed->p, 0xff, sizeof(xpermd->allowed->p));
@@ -1097,7 +1097,7 @@ void security_compute_av(u32 ssid,
u16 tclass;
struct context *scontext = NULL, *tcontext = NULL;
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
avd_init(avd);
xperms->len = 0;
if (!ss_initialized)
@@ -1130,7 +1130,7 @@ void security_compute_av(u32 ssid,
context_struct_compute_av(scontext, tcontext, tclass, avd, xperms);
map_decision(orig_tclass, avd, policydb.allow_unknown);
 out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
return;
 allow:
avd->allowed = 0x;
@@ -1144,7 +1144,7 @@ void security_compute_av_user(u32 ssid,
 {
struct context *scontext = NULL, *tcontext = NULL;
 
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
avd_init(avd);
if (!ss_initialized)
goto allow;
@@ -1175,7 +1175,7 @@ void security_compute_av_user(u32 ssid,
 
context_struct_compute_av(scontext, tcontext, tclass, avd, NULL);
  out:
-   read_unlock(_rwlock);
+   percpu_up_read(_rwsem);
return;
 allow:
avd->allowed = 0x;
@@ -1277,7 +1277,7 @@ static int security_sid_to_context_core(u32 sid, char 
**scontext,
rc = -EINVAL;
goto out;
}
-   read_lock(_rwlock);
+   percpu_down_read(_rwsem);
if (force)
context = sidtab_search_force(, sid);
else
@@ -1290,7 +1290,7 @@ static int security_sid_to_context_core(u32 sid, char 
**scontext,
}
rc = context_struct_to_string(context, scontext, scontext_len);
 out_unlock:
-   read_unlock(_rwlock);
+   

[PATCH] Add slowpath enter/exit trace events

2017-11-23 Thread peter.enderborg
From: Peter Enderborg 

The warning of slow allocation has been removed, this is
a other way to fetch that information. But you need
to enable the trace. The exit function also returns
information about the number of retries, how long
it was stalled and failure reason if that happened.

Signed-off-by: Peter Enderborg 
---
 include/trace/events/kmem.h | 68 +
 mm/page_alloc.c | 62 +++--
 2 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index eb57e30..bb882ca 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -315,6 +315,74 @@ TRACE_EVENT(mm_page_alloc_extfrag,
__entry->change_ownership)
 );
 
+TRACE_EVENT(mm_page_alloc_slowpath_enter,
+
+   TP_PROTO(int alloc_order,
+   nodemask_t *nodemask,
+   gfp_t gfp_flags),
+
+   TP_ARGS(alloc_order, nodemask, gfp_flags),
+
+   TP_STRUCT__entry(
+   __field(int, alloc_order)
+   __field(nodemask_t *, nodemask)
+   __field(gfp_t, gfp_flags)
+),
+
+TP_fast_assign(
+   __entry->alloc_order= alloc_order;
+   __entry->nodemask   = nodemask;
+   __entry->gfp_flags  = gfp_flags;
+),
+
+TP_printk("alloc_order=%d nodemask=%*pbl gfp_flags=%s",
+   __entry->alloc_order,
+   nodemask_pr_args(__entry->nodemask),
+   show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_page_alloc_slowpath_exit,
+
+   TP_PROTO(struct page *page,
+   int alloc_order,
+   nodemask_t *nodemask,
+   u64 alloc_start,
+   gfp_t gfp_flags,
+   int retrys,
+   int exit),
+
+   TP_ARGS(page, alloc_order, nodemask, alloc_start, gfp_flags,
+   retrys, exit),
+
+   TP_STRUCT__entry(__field(struct page *, page)
+   __field(int, alloc_order)
+   __field(nodemask_t *, nodemask)
+   __field(u64, msdelay)
+   __field(gfp_t, gfp_flags)
+   __field(int, retrys)
+   __field(int, exit)
+   ),
+
+   TP_fast_assign(
+   __entry->page= page;
+   __entry->alloc_order = alloc_order;
+   __entry->nodemask= nodemask;
+   __entry->msdelay = jiffies_to_msecs(jiffies-alloc_start);
+   __entry->gfp_flags   = gfp_flags;
+   __entry->retrys  = retrys;
+   __entry->exit= exit;
+   ),
+
+   TP_printk("page=%p alloc_order=%d nodemask=%*pbl msdelay=%llu 
gfp_flags=%s retrys=%d exit=%d",
+   __entry->page,
+   __entry->alloc_order,
+   nodemask_pr_args(__entry->nodemask),
+   __entry->msdelay,
+   show_gfp_flags(__entry->gfp_flags),
+   __entry->retrys,
+   __entry->exit)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48b5b01..bae9cb9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -104,6 +104,17 @@ DEFINE_PER_CPU(struct work_struct, pcpu_drain);
 volatile unsigned long latent_entropy __latent_entropy;
 EXPORT_SYMBOL(latent_entropy);
 #endif
+enum slowpath_exit {
+   SLOWPATH_NOZONE = -16,
+   SLOWPATH_COMPACT_DEFERRED,
+   SLOWPATH_CAN_NOT_DIRECT_RECLAIM,
+   SLOWPATH_RECURSION,
+   SLOWPATH_NO_RETRY,
+   SLOWPATH_COSTLY_ORDER,
+   SLOWPATH_OOM_VICTIM,
+   SLOWPATH_NO_DIRECT_RECLAIM,
+   SLOWPATH_ORDER
+};
 
 /*
  * Array of node states.
@@ -3908,8 +3919,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
enum compact_result compact_result;
int compaction_retries;
int no_progress_loops;
+   unsigned long alloc_start = jiffies;
unsigned int cpuset_mems_cookie;
int reserve_flags;
+   enum slowpath_exit slowpath_exit;
+   int retry_count = 0;
+
+   trace_mm_page_alloc_slowpath_enter(order,
+   ac->nodemask,
+   gfp_mask);
 
/*
 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3919,7 +3937,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 */
if (order >= MAX_ORDER) {
WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
-   return NULL;
+   slowpath_exit = SLOWPATH_ORDER;
+   goto fail;
}
 
/*
@@ -3951,8 +3970,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
 */
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
ac->high_zoneidx, ac->nodemask);
-   if 

[PATCH] Add slowpath enter/exit trace events

2017-11-23 Thread peter.enderborg
From: Peter Enderborg 

The warning of slow allocation has been removed, this is
a other way to fetch that information. But you need
to enable the trace. The exit function also returns
information about the number of retries, how long
it was stalled and failure reason if that happened.

Signed-off-by: Peter Enderborg 
---
 include/trace/events/kmem.h | 68 +
 mm/page_alloc.c | 62 +++--
 2 files changed, 116 insertions(+), 14 deletions(-)

diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index eb57e30..bb882ca 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -315,6 +315,74 @@ TRACE_EVENT(mm_page_alloc_extfrag,
__entry->change_ownership)
 );
 
+TRACE_EVENT(mm_page_alloc_slowpath_enter,
+
+   TP_PROTO(int alloc_order,
+   nodemask_t *nodemask,
+   gfp_t gfp_flags),
+
+   TP_ARGS(alloc_order, nodemask, gfp_flags),
+
+   TP_STRUCT__entry(
+   __field(int, alloc_order)
+   __field(nodemask_t *, nodemask)
+   __field(gfp_t, gfp_flags)
+),
+
+TP_fast_assign(
+   __entry->alloc_order= alloc_order;
+   __entry->nodemask   = nodemask;
+   __entry->gfp_flags  = gfp_flags;
+),
+
+TP_printk("alloc_order=%d nodemask=%*pbl gfp_flags=%s",
+   __entry->alloc_order,
+   nodemask_pr_args(__entry->nodemask),
+   show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_page_alloc_slowpath_exit,
+
+   TP_PROTO(struct page *page,
+   int alloc_order,
+   nodemask_t *nodemask,
+   u64 alloc_start,
+   gfp_t gfp_flags,
+   int retrys,
+   int exit),
+
+   TP_ARGS(page, alloc_order, nodemask, alloc_start, gfp_flags,
+   retrys, exit),
+
+   TP_STRUCT__entry(__field(struct page *, page)
+   __field(int, alloc_order)
+   __field(nodemask_t *, nodemask)
+   __field(u64, msdelay)
+   __field(gfp_t, gfp_flags)
+   __field(int, retrys)
+   __field(int, exit)
+   ),
+
+   TP_fast_assign(
+   __entry->page= page;
+   __entry->alloc_order = alloc_order;
+   __entry->nodemask= nodemask;
+   __entry->msdelay = jiffies_to_msecs(jiffies-alloc_start);
+   __entry->gfp_flags   = gfp_flags;
+   __entry->retrys  = retrys;
+   __entry->exit= exit;
+   ),
+
+   TP_printk("page=%p alloc_order=%d nodemask=%*pbl msdelay=%llu 
gfp_flags=%s retrys=%d exit=%d",
+   __entry->page,
+   __entry->alloc_order,
+   nodemask_pr_args(__entry->nodemask),
+   __entry->msdelay,
+   show_gfp_flags(__entry->gfp_flags),
+   __entry->retrys,
+   __entry->exit)
+);
+
 #endif /* _TRACE_KMEM_H */
 
 /* This part must be outside protection */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48b5b01..bae9cb9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -104,6 +104,17 @@ DEFINE_PER_CPU(struct work_struct, pcpu_drain);
 volatile unsigned long latent_entropy __latent_entropy;
 EXPORT_SYMBOL(latent_entropy);
 #endif
+enum slowpath_exit {
+   SLOWPATH_NOZONE = -16,
+   SLOWPATH_COMPACT_DEFERRED,
+   SLOWPATH_CAN_NOT_DIRECT_RECLAIM,
+   SLOWPATH_RECURSION,
+   SLOWPATH_NO_RETRY,
+   SLOWPATH_COSTLY_ORDER,
+   SLOWPATH_OOM_VICTIM,
+   SLOWPATH_NO_DIRECT_RECLAIM,
+   SLOWPATH_ORDER
+};
 
 /*
  * Array of node states.
@@ -3908,8 +3919,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
enum compact_result compact_result;
int compaction_retries;
int no_progress_loops;
+   unsigned long alloc_start = jiffies;
unsigned int cpuset_mems_cookie;
int reserve_flags;
+   enum slowpath_exit slowpath_exit;
+   int retry_count = 0;
+
+   trace_mm_page_alloc_slowpath_enter(order,
+   ac->nodemask,
+   gfp_mask);
 
/*
 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3919,7 +3937,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 */
if (order >= MAX_ORDER) {
WARN_ON_ONCE(!(gfp_mask & __GFP_NOWARN));
-   return NULL;
+   slowpath_exit = SLOWPATH_ORDER;
+   goto fail;
}
 
/*
@@ -3951,8 +3970,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int 
order,
 */
ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
ac->high_zoneidx, ac->nodemask);
-   if (!ac->preferred_zoneref->zone)
+   if (!ac->preferred_zoneref->zone) {
+  

[PATCH 3/3 staging-next] mm: Remove RCU and tasklocks from lmk

2017-02-14 Thread peter.enderborg
From: Peter Enderborg 

Fundamental changes:
1 Does NOT take any RCU lock in shrinker functions.
2 It returns same result for scan and counts, so  we dont need to do
  shinker will know when it is pointless to call scan.
3 It does not lock any other process than the one that is
  going to be killed.

Background.
The low memory killer scans for process that can be killed to free
memory. This can be cpu consuming when there is a high demand for
memory. This can be seen by analysing the kswapd0 task work.
The stats function added in earler patch adds a counter for waste work.

How it works.
This patch create a structure within the lowmemory killer that caches
the user spaces processes that it might kill. It is done with a
sorted rbtree so we can very easy find the candidate to be killed,
and knows its properies as memory usage and sorted by oom_score_adj
to look up the task with highest oom_score_adj. To be able to achive
this it uses oom_score_notify events.

This patch also as a other effect, we are now free to do other
lowmemorykiller configurations.  Without the patch there is a need
for a tradeoff between freed memory and task and rcu locks. This
is no longer a concern for tuning lmk. This patch is not intended
to do any calculation changes other than we do use the cache for
calculate the count values and that makes kswapd0 to shrink other
areas.

Signed-off-by: Peter Enderborg 
---
 drivers/staging/android/Kconfig |   1 +
 drivers/staging/android/Makefile|   1 +
 drivers/staging/android/lowmemorykiller.c   | 294 +++-
 drivers/staging/android/lowmemorykiller.h   |  15 ++
 drivers/staging/android/lowmemorykiller_stats.c |  24 ++
 drivers/staging/android/lowmemorykiller_stats.h |  14 +-
 drivers/staging/android/lowmemorykiller_tasks.c | 220 ++
 drivers/staging/android/lowmemorykiller_tasks.h |  35 +++
 8 files changed, 498 insertions(+), 106 deletions(-)
 create mode 100644 drivers/staging/android/lowmemorykiller.h
 create mode 100644 drivers/staging/android/lowmemorykiller_tasks.c
 create mode 100644 drivers/staging/android/lowmemorykiller_tasks.h

diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
index 96e86c7..899186c 100644
--- a/drivers/staging/android/Kconfig
+++ b/drivers/staging/android/Kconfig
@@ -16,6 +16,7 @@ config ASHMEM
 
 config ANDROID_LOW_MEMORY_KILLER
bool "Android Low Memory Killer"
+   select OOM_SCORE_NOTIFIER
---help---
  Registers processes to be killed when low memory conditions, this is 
useful
  as there is no particular swap space on android.
diff --git a/drivers/staging/android/Makefile b/drivers/staging/android/Makefile
index d710eb2..b7a8036 100644
--- a/drivers/staging/android/Makefile
+++ b/drivers/staging/android/Makefile
@@ -4,4 +4,5 @@ obj-y   += ion/
 
 obj-$(CONFIG_ASHMEM)   += ashmem.o
 obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller.o
+obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller_tasks.o
 obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER_STATS)  += lowmemorykiller_stats.o
diff --git a/drivers/staging/android/lowmemorykiller.c 
b/drivers/staging/android/lowmemorykiller.c
index 15c1b38..1e275b1 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -41,10 +41,14 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include "lowmemorykiller.h"
 #include "lowmemorykiller_stats.h"
+#include "lowmemorykiller_tasks.h"
 
-static u32 lowmem_debug_level = 1;
+u32 lowmem_debug_level = 1;
 static short lowmem_adj[6] = {
0,
1,
@@ -62,135 +66,212 @@ static int lowmem_minfree[6] = {
 
 static int lowmem_minfree_size = 4;
 
-static unsigned long lowmem_deathpending_timeout;
-
-#define lowmem_print(level, x...)  \
-   do {\
-   if (lowmem_debug_level >= (level))  \
-   pr_info(x); \
-   } while (0)
-
-static unsigned long lowmem_count(struct shrinker *s,
- struct shrink_control *sc)
-{
-   lmk_inc_stats(LMK_COUNT);
-   return global_node_page_state(NR_ACTIVE_ANON) +
-   global_node_page_state(NR_ACTIVE_FILE) +
-   global_node_page_state(NR_INACTIVE_ANON) +
-   global_node_page_state(NR_INACTIVE_FILE);
-}
+struct calculated_params {
+   long selected_tasksize;
+   long minfree;
+   int other_file;
+   int other_free;
+   int dynamic_max_queue_len;
+   short selected_oom_score_adj;
+   short min_score_adj;
+};
 
-static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
+static int kill_needed(int level, struct shrink_control *sc,
+  struct calculated_params 

[PATCH 3/3 staging-next] mm: Remove RCU and tasklocks from lmk

2017-02-14 Thread peter.enderborg
From: Peter Enderborg 

Fundamental changes:
1 Does NOT take any RCU lock in shrinker functions.
2 It returns same result for scan and counts, so  we dont need to do
  shinker will know when it is pointless to call scan.
3 It does not lock any other process than the one that is
  going to be killed.

Background.
The low memory killer scans for process that can be killed to free
memory. This can be cpu consuming when there is a high demand for
memory. This can be seen by analysing the kswapd0 task work.
The stats function added in earler patch adds a counter for waste work.

How it works.
This patch create a structure within the lowmemory killer that caches
the user spaces processes that it might kill. It is done with a
sorted rbtree so we can very easy find the candidate to be killed,
and knows its properies as memory usage and sorted by oom_score_adj
to look up the task with highest oom_score_adj. To be able to achive
this it uses oom_score_notify events.

This patch also as a other effect, we are now free to do other
lowmemorykiller configurations.  Without the patch there is a need
for a tradeoff between freed memory and task and rcu locks. This
is no longer a concern for tuning lmk. This patch is not intended
to do any calculation changes other than we do use the cache for
calculate the count values and that makes kswapd0 to shrink other
areas.

Signed-off-by: Peter Enderborg 
---
 drivers/staging/android/Kconfig |   1 +
 drivers/staging/android/Makefile|   1 +
 drivers/staging/android/lowmemorykiller.c   | 294 +++-
 drivers/staging/android/lowmemorykiller.h   |  15 ++
 drivers/staging/android/lowmemorykiller_stats.c |  24 ++
 drivers/staging/android/lowmemorykiller_stats.h |  14 +-
 drivers/staging/android/lowmemorykiller_tasks.c | 220 ++
 drivers/staging/android/lowmemorykiller_tasks.h |  35 +++
 8 files changed, 498 insertions(+), 106 deletions(-)
 create mode 100644 drivers/staging/android/lowmemorykiller.h
 create mode 100644 drivers/staging/android/lowmemorykiller_tasks.c
 create mode 100644 drivers/staging/android/lowmemorykiller_tasks.h

diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
index 96e86c7..899186c 100644
--- a/drivers/staging/android/Kconfig
+++ b/drivers/staging/android/Kconfig
@@ -16,6 +16,7 @@ config ASHMEM
 
 config ANDROID_LOW_MEMORY_KILLER
bool "Android Low Memory Killer"
+   select OOM_SCORE_NOTIFIER
---help---
  Registers processes to be killed when low memory conditions, this is 
useful
  as there is no particular swap space on android.
diff --git a/drivers/staging/android/Makefile b/drivers/staging/android/Makefile
index d710eb2..b7a8036 100644
--- a/drivers/staging/android/Makefile
+++ b/drivers/staging/android/Makefile
@@ -4,4 +4,5 @@ obj-y   += ion/
 
 obj-$(CONFIG_ASHMEM)   += ashmem.o
 obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller.o
+obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller_tasks.o
 obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER_STATS)  += lowmemorykiller_stats.o
diff --git a/drivers/staging/android/lowmemorykiller.c 
b/drivers/staging/android/lowmemorykiller.c
index 15c1b38..1e275b1 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -41,10 +41,14 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include "lowmemorykiller.h"
 #include "lowmemorykiller_stats.h"
+#include "lowmemorykiller_tasks.h"
 
-static u32 lowmem_debug_level = 1;
+u32 lowmem_debug_level = 1;
 static short lowmem_adj[6] = {
0,
1,
@@ -62,135 +66,212 @@ static int lowmem_minfree[6] = {
 
 static int lowmem_minfree_size = 4;
 
-static unsigned long lowmem_deathpending_timeout;
-
-#define lowmem_print(level, x...)  \
-   do {\
-   if (lowmem_debug_level >= (level))  \
-   pr_info(x); \
-   } while (0)
-
-static unsigned long lowmem_count(struct shrinker *s,
- struct shrink_control *sc)
-{
-   lmk_inc_stats(LMK_COUNT);
-   return global_node_page_state(NR_ACTIVE_ANON) +
-   global_node_page_state(NR_ACTIVE_FILE) +
-   global_node_page_state(NR_INACTIVE_ANON) +
-   global_node_page_state(NR_INACTIVE_FILE);
-}
+struct calculated_params {
+   long selected_tasksize;
+   long minfree;
+   int other_file;
+   int other_free;
+   int dynamic_max_queue_len;
+   short selected_oom_score_adj;
+   short min_score_adj;
+};
 
-static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
+static int kill_needed(int level, struct shrink_control *sc,
+  struct calculated_params *cp)
 {
-   struct task_struct *tsk;
-   struct 

[PATCH 2/3 staging-next] oom: Add notification for oom_score_adj

2017-02-14 Thread peter.enderborg
From: Peter Enderborg 

This adds subscribtion for changes in oom_score_adj, this
value is important to android systems. For task that uses
oom_score_adj they read the task list. This can be long
and need rcu locks and has a impact on the system. Let
the user track the changes based on oom_score_adj changes
and keep them in their own context so they do their actions
with minimal system impact.

Signed-off-by: Peter Enderborg 
---
 fs/proc/base.c | 13 +++
 include/linux/oom_score_notifier.h | 47 
 kernel/Makefile|  1 +
 kernel/fork.c  |  6 +++
 kernel/oom_score_notifier.c| 75 ++
 mm/Kconfig |  9 +
 6 files changed, 151 insertions(+)
 create mode 100644 include/linux/oom_score_notifier.h
 create mode 100644 kernel/oom_score_notifier.c

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 87c9a9a..60c2d9b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -87,6 +87,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_HARDWALL
 #include 
 #endif
@@ -1057,6 +1058,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
static DEFINE_MUTEX(oom_adj_mutex);
struct mm_struct *mm = NULL;
struct task_struct *task;
+   int old_oom_score_adj;
int err = 0;
 
task = get_proc_task(file_inode(file));
@@ -1102,9 +1104,20 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
}
}
 
+   old_oom_score_adj = task->signal->oom_score_adj;
task->signal->oom_score_adj = oom_adj;
if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
task->signal->oom_score_adj_min = (short)oom_adj;
+
+#ifdef CONFIG_OOM_SCORE_NOTIFIER
+   err = oom_score_notify_update(task, old_oom_score_adj);
+   if (err) {
+   /* rollback and error handle. */
+   task->signal->oom_score_adj = old_oom_score_adj;
+   goto err_unlock;
+   }
+#endif
+
trace_oom_score_adj_update(task);
 
if (mm) {
diff --git a/include/linux/oom_score_notifier.h 
b/include/linux/oom_score_notifier.h
new file mode 100644
index 000..c5cea47
--- /dev/null
+++ b/include/linux/oom_score_notifier.h
@@ -0,0 +1,47 @@
+/*
+ *  oom_score_notifier interface
+ *  Copyright (C) 2017 Sony Mobile Communications Inc.
+ *
+ *  Author: Peter Enderborg 
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2 as
+ *  published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_OOM_SCORE_NOTIFIER_H
+#define _LINUX_OOM_SCORE_NOTIFIER_H
+
+#ifdef CONFIG_OOM_SCORE_NOTIFIER
+
+#include 
+#include 
+#include 
+
+enum osn_msg_type {
+   OSN_NEW,
+   OSN_FREE,
+   OSN_UPDATE
+};
+
+extern struct atomic_notifier_head oom_score_notifier;
+extern int oom_score_notifier_register(struct notifier_block *n);
+extern int oom_score_notifier_unregister(struct notifier_block *n);
+extern int oom_score_notify_free(struct task_struct *tsk);
+extern int oom_score_notify_new(struct task_struct *tsk);
+extern int oom_score_notify_update(struct task_struct *tsk, int old_score);
+
+struct oom_score_notifier_struct {
+   struct task_struct *tsk;
+   int old_score;
+};
+
+#else
+
+#define oom_score_notify_free(t)  do {} while (0)
+#define oom_score_notify_new(t) false
+#define oom_score_notify_update(t, s) do {} while (0)
+
+#endif /* CONFIG_OOM_SCORE_NOTIFIER */
+
+#endif /* _LINUX_OOM_SCORE_NOTIFIER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 12c679f..747c66c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -91,6 +91,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
+obj-$(CONFIG_OOM_SCORE_NOTIFIER) += oom_score_notifier.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_ELFCORE) += elfcore.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8a..f8a1a89 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -73,6 +73,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -391,6 +392,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   oom_score_notify_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
@@ -1790,6 +1792,10 @@ static __latent_entropy struct task_struct *copy_process(
 
init_task_pid(p, PIDTYPE_PID, pid);
if (thread_group_leader(p)) {
+   retval = oom_score_notify_new(p);
+   if (retval)
+  

[PATCH 2/3 staging-next] oom: Add notification for oom_score_adj

2017-02-14 Thread peter.enderborg
From: Peter Enderborg 

This adds subscribtion for changes in oom_score_adj, this
value is important to android systems. For task that uses
oom_score_adj they read the task list. This can be long
and need rcu locks and has a impact on the system. Let
the user track the changes based on oom_score_adj changes
and keep them in their own context so they do their actions
with minimal system impact.

Signed-off-by: Peter Enderborg 
---
 fs/proc/base.c | 13 +++
 include/linux/oom_score_notifier.h | 47 
 kernel/Makefile|  1 +
 kernel/fork.c  |  6 +++
 kernel/oom_score_notifier.c| 75 ++
 mm/Kconfig |  9 +
 6 files changed, 151 insertions(+)
 create mode 100644 include/linux/oom_score_notifier.h
 create mode 100644 kernel/oom_score_notifier.c

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 87c9a9a..60c2d9b 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -87,6 +87,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_HARDWALL
 #include 
 #endif
@@ -1057,6 +1058,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
static DEFINE_MUTEX(oom_adj_mutex);
struct mm_struct *mm = NULL;
struct task_struct *task;
+   int old_oom_score_adj;
int err = 0;
 
task = get_proc_task(file_inode(file));
@@ -1102,9 +1104,20 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
}
}
 
+   old_oom_score_adj = task->signal->oom_score_adj;
task->signal->oom_score_adj = oom_adj;
if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
task->signal->oom_score_adj_min = (short)oom_adj;
+
+#ifdef CONFIG_OOM_SCORE_NOTIFIER
+   err = oom_score_notify_update(task, old_oom_score_adj);
+   if (err) {
+   /* rollback and error handle. */
+   task->signal->oom_score_adj = old_oom_score_adj;
+   goto err_unlock;
+   }
+#endif
+
trace_oom_score_adj_update(task);
 
if (mm) {
diff --git a/include/linux/oom_score_notifier.h 
b/include/linux/oom_score_notifier.h
new file mode 100644
index 000..c5cea47
--- /dev/null
+++ b/include/linux/oom_score_notifier.h
@@ -0,0 +1,47 @@
+/*
+ *  oom_score_notifier interface
+ *  Copyright (C) 2017 Sony Mobile Communications Inc.
+ *
+ *  Author: Peter Enderborg 
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2 as
+ *  published by the Free Software Foundation.
+ */
+
+#ifndef _LINUX_OOM_SCORE_NOTIFIER_H
+#define _LINUX_OOM_SCORE_NOTIFIER_H
+
+#ifdef CONFIG_OOM_SCORE_NOTIFIER
+
+#include 
+#include 
+#include 
+
+enum osn_msg_type {
+   OSN_NEW,
+   OSN_FREE,
+   OSN_UPDATE
+};
+
+extern struct atomic_notifier_head oom_score_notifier;
+extern int oom_score_notifier_register(struct notifier_block *n);
+extern int oom_score_notifier_unregister(struct notifier_block *n);
+extern int oom_score_notify_free(struct task_struct *tsk);
+extern int oom_score_notify_new(struct task_struct *tsk);
+extern int oom_score_notify_update(struct task_struct *tsk, int old_score);
+
+struct oom_score_notifier_struct {
+   struct task_struct *tsk;
+   int old_score;
+};
+
+#else
+
+#define oom_score_notify_free(t)  do {} while (0)
+#define oom_score_notify_new(t) false
+#define oom_score_notify_update(t, s) do {} while (0)
+
+#endif /* CONFIG_OOM_SCORE_NOTIFIER */
+
+#endif /* _LINUX_OOM_SCORE_NOTIFIER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 12c679f..747c66c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -91,6 +91,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
 obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
+obj-$(CONFIG_OOM_SCORE_NOTIFIER) += oom_score_notifier.o
 obj-$(CONFIG_LATENCYTOP) += latencytop.o
 obj-$(CONFIG_ELFCORE) += elfcore.o
 obj-$(CONFIG_FUNCTION_TRACER) += trace/
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8a..f8a1a89 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -73,6 +73,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -391,6 +392,7 @@ void __put_task_struct(struct task_struct *tsk)
exit_creds(tsk);
delayacct_tsk_free(tsk);
put_signal_struct(tsk->signal);
+   oom_score_notify_free(tsk);
 
if (!profile_handoff_task(tsk))
free_task(tsk);
@@ -1790,6 +1792,10 @@ static __latent_entropy struct task_struct *copy_process(
 
init_task_pid(p, PIDTYPE_PID, pid);
if (thread_group_leader(p)) {
+   retval = oom_score_notify_new(p);
+   if (retval)
+   goto bad_fork_cancel_cgroup;
+

[PATCH 1/3 staging-next] android: Collect statistics from lowmemorykiller

2017-02-14 Thread peter.enderborg
From: Peter Enderborg 

This collects stats for shrinker calls and how much
waste work we do within the lowmemorykiller.

Signed-off-by: Peter Enderborg 
---
 drivers/staging/android/Kconfig | 11 
 drivers/staging/android/Makefile|  1 +
 drivers/staging/android/lowmemorykiller.c   |  9 ++-
 drivers/staging/android/lowmemorykiller_stats.c | 85 +
 drivers/staging/android/lowmemorykiller_stats.h | 29 +
 5 files changed, 134 insertions(+), 1 deletion(-)
 create mode 100644 drivers/staging/android/lowmemorykiller_stats.c
 create mode 100644 drivers/staging/android/lowmemorykiller_stats.h

diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
index 6c00d6f..96e86c7 100644
--- a/drivers/staging/android/Kconfig
+++ b/drivers/staging/android/Kconfig
@@ -24,6 +24,17 @@ config ANDROID_LOW_MEMORY_KILLER
  scripts (/init.rc), and it defines priority values with minimum free 
memory size
  for each priority.
 
+config ANDROID_LOW_MEMORY_KILLER_STATS
+   bool "Android Low Memory Killer: collect statistics"
+   depends on ANDROID_LOW_MEMORY_KILLER
+   default n
+   help
+ Create a file in /proc/lmkstats that includes
+ collected statistics about kills, scans and counts
+ and  interaction with the shrinker. Its content
+ will be different depeding on lmk implementation used.
+
+
 source "drivers/staging/android/ion/Kconfig"
 
 endif # if ANDROID
diff --git a/drivers/staging/android/Makefile b/drivers/staging/android/Makefile
index 7ed1be7..d710eb2 100644
--- a/drivers/staging/android/Makefile
+++ b/drivers/staging/android/Makefile
@@ -4,3 +4,4 @@ obj-y   += ion/
 
 obj-$(CONFIG_ASHMEM)   += ashmem.o
 obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller.o
+obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER_STATS)  += lowmemorykiller_stats.o
diff --git a/drivers/staging/android/lowmemorykiller.c 
b/drivers/staging/android/lowmemorykiller.c
index ec3b665..15c1b38 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include "lowmemorykiller_stats.h"
 
 static u32 lowmem_debug_level = 1;
 static short lowmem_adj[6] = {
@@ -72,6 +73,7 @@ static unsigned long lowmem_deathpending_timeout;
 static unsigned long lowmem_count(struct shrinker *s,
  struct shrink_control *sc)
 {
+   lmk_inc_stats(LMK_COUNT);
return global_node_page_state(NR_ACTIVE_ANON) +
global_node_page_state(NR_ACTIVE_FILE) +
global_node_page_state(NR_INACTIVE_ANON) +
@@ -95,6 +97,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct 
shrink_control *sc)
global_node_page_state(NR_SHMEM) -
total_swapcache_pages();
 
+   lmk_inc_stats(LMK_SCAN);
if (lowmem_adj_size < array_size)
array_size = lowmem_adj_size;
if (lowmem_minfree_size < array_size)
@@ -134,6 +137,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct 
shrink_control *sc)
if (task_lmk_waiting(p) &&
time_before_eq(jiffies, lowmem_deathpending_timeout)) {
task_unlock(p);
+   lmk_inc_stats(LMK_TIMEOUT);
rcu_read_unlock();
return 0;
}
@@ -179,7 +183,9 @@ static unsigned long lowmem_scan(struct shrinker *s, struct 
shrink_control *sc)
 other_free * (long)(PAGE_SIZE / 1024));
lowmem_deathpending_timeout = jiffies + HZ;
rem += selected_tasksize;
-   }
+   lmk_inc_stats(LMK_KILL);
+   } else
+   lmk_inc_stats(LMK_WASTE);
 
lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
 sc->nr_to_scan, sc->gfp_mask, rem);
@@ -196,6 +202,7 @@ static struct shrinker lowmem_shrinker = {
 static int __init lowmem_init(void)
 {
register_shrinker(_shrinker);
+   init_procfs_lmk();
return 0;
 }
 device_initcall(lowmem_init);
diff --git a/drivers/staging/android/lowmemorykiller_stats.c 
b/drivers/staging/android/lowmemorykiller_stats.c
new file mode 100644
index 000..673691c
--- /dev/null
+++ b/drivers/staging/android/lowmemorykiller_stats.c
@@ -0,0 +1,85 @@
+/*
+ *  lowmemorykiller_stats
+ *
+ *  Copyright (C) 2017 Sony Mobile Communications Inc.
+ *
+ *  Author: Peter Enderborg 
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2 as
+ *  published by the Free Software Foundation.
+ */
+/* This code is bookkeeping of statistical information
+ * from 

[PATCH 1/3 staging-next] android: Collect statistics from lowmemorykiller

2017-02-14 Thread peter.enderborg
From: Peter Enderborg 

This collects stats for shrinker calls and how much
waste work we do within the lowmemorykiller.

Signed-off-by: Peter Enderborg 
---
 drivers/staging/android/Kconfig | 11 
 drivers/staging/android/Makefile|  1 +
 drivers/staging/android/lowmemorykiller.c   |  9 ++-
 drivers/staging/android/lowmemorykiller_stats.c | 85 +
 drivers/staging/android/lowmemorykiller_stats.h | 29 +
 5 files changed, 134 insertions(+), 1 deletion(-)
 create mode 100644 drivers/staging/android/lowmemorykiller_stats.c
 create mode 100644 drivers/staging/android/lowmemorykiller_stats.h

diff --git a/drivers/staging/android/Kconfig b/drivers/staging/android/Kconfig
index 6c00d6f..96e86c7 100644
--- a/drivers/staging/android/Kconfig
+++ b/drivers/staging/android/Kconfig
@@ -24,6 +24,17 @@ config ANDROID_LOW_MEMORY_KILLER
  scripts (/init.rc), and it defines priority values with minimum free 
memory size
  for each priority.
 
+config ANDROID_LOW_MEMORY_KILLER_STATS
+   bool "Android Low Memory Killer: collect statistics"
+   depends on ANDROID_LOW_MEMORY_KILLER
+   default n
+   help
+ Create a file in /proc/lmkstats that includes
+ collected statistics about kills, scans and counts
+ and  interaction with the shrinker. Its content
+ will be different depeding on lmk implementation used.
+
+
 source "drivers/staging/android/ion/Kconfig"
 
 endif # if ANDROID
diff --git a/drivers/staging/android/Makefile b/drivers/staging/android/Makefile
index 7ed1be7..d710eb2 100644
--- a/drivers/staging/android/Makefile
+++ b/drivers/staging/android/Makefile
@@ -4,3 +4,4 @@ obj-y   += ion/
 
 obj-$(CONFIG_ASHMEM)   += ashmem.o
 obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER)+= lowmemorykiller.o
+obj-$(CONFIG_ANDROID_LOW_MEMORY_KILLER_STATS)  += lowmemorykiller_stats.o
diff --git a/drivers/staging/android/lowmemorykiller.c 
b/drivers/staging/android/lowmemorykiller.c
index ec3b665..15c1b38 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include "lowmemorykiller_stats.h"
 
 static u32 lowmem_debug_level = 1;
 static short lowmem_adj[6] = {
@@ -72,6 +73,7 @@ static unsigned long lowmem_deathpending_timeout;
 static unsigned long lowmem_count(struct shrinker *s,
  struct shrink_control *sc)
 {
+   lmk_inc_stats(LMK_COUNT);
return global_node_page_state(NR_ACTIVE_ANON) +
global_node_page_state(NR_ACTIVE_FILE) +
global_node_page_state(NR_INACTIVE_ANON) +
@@ -95,6 +97,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct 
shrink_control *sc)
global_node_page_state(NR_SHMEM) -
total_swapcache_pages();
 
+   lmk_inc_stats(LMK_SCAN);
if (lowmem_adj_size < array_size)
array_size = lowmem_adj_size;
if (lowmem_minfree_size < array_size)
@@ -134,6 +137,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct 
shrink_control *sc)
if (task_lmk_waiting(p) &&
time_before_eq(jiffies, lowmem_deathpending_timeout)) {
task_unlock(p);
+   lmk_inc_stats(LMK_TIMEOUT);
rcu_read_unlock();
return 0;
}
@@ -179,7 +183,9 @@ static unsigned long lowmem_scan(struct shrinker *s, struct 
shrink_control *sc)
 other_free * (long)(PAGE_SIZE / 1024));
lowmem_deathpending_timeout = jiffies + HZ;
rem += selected_tasksize;
-   }
+   lmk_inc_stats(LMK_KILL);
+   } else
+   lmk_inc_stats(LMK_WASTE);
 
lowmem_print(4, "lowmem_scan %lu, %x, return %lu\n",
 sc->nr_to_scan, sc->gfp_mask, rem);
@@ -196,6 +202,7 @@ static struct shrinker lowmem_shrinker = {
 static int __init lowmem_init(void)
 {
register_shrinker(_shrinker);
+   init_procfs_lmk();
return 0;
 }
 device_initcall(lowmem_init);
diff --git a/drivers/staging/android/lowmemorykiller_stats.c 
b/drivers/staging/android/lowmemorykiller_stats.c
new file mode 100644
index 000..673691c
--- /dev/null
+++ b/drivers/staging/android/lowmemorykiller_stats.c
@@ -0,0 +1,85 @@
+/*
+ *  lowmemorykiller_stats
+ *
+ *  Copyright (C) 2017 Sony Mobile Communications Inc.
+ *
+ *  Author: Peter Enderborg 
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2 as
+ *  published by the Free Software Foundation.
+ */
+/* This code is bookkeeping of statistical information
+ * from lowmemorykiller and provide a node in proc "/proc/lmkstats".
+ */
+
+#include 
+#include 
+#include