Re: [PATCH v13 03/10] mux: minimal mux subsystem and gpio-based mux controller

2017-04-21 Thread Peter Rosin
On 2017-04-20 23:53, Peter Rosin wrote:
> On 2017-04-18 23:53, Peter Rosin wrote:
>> On 2017-04-18 13:44, Greg Kroah-Hartman wrote:
>>> On Tue, Apr 18, 2017 at 12:59:50PM +0200, Peter Rosin wrote:
 On 2017-04-18 10:51, Greg Kroah-Hartman wrote:
> On Thu, Apr 13, 2017 at 06:43:07PM +0200, Peter Rosin wrote:
> 
> *snip*
> 
>> +if (mux->idle_state != MUX_IDLE_AS_IS &&
>> +mux->idle_state != mux->cached_state)
>> +ret = mux_control_set(mux, mux->idle_state);
>> +
>> +up_read(>lock);
>
> You require a lock to be held for a "global" function?  Without
> documentation?  Or even a sparse marking?  That's asking for trouble...

 Documentation I can handle, but where should I look to understand how I
 should add sparse markings?
>>>
>>> Run sparse on the code and see what it says :)
>>
>> Will do.
> 
> I just did, and even went through the trouble of getting the bleeding
> edge sparse from the git repo when sparse 0.5.0 came up empty, but it's
> all silence for me. So, how do I add sparse markings?

I looked some more into this, and the markings I find that seem related
are __acquire() and __release(). But neither mutex_lock() nor up_read()
has markings like that, so adding them when using those kinds of locks
in an imbalanced way seems like a sure way of *getting* sparse messages
about context imbalance...

So, either that, or you are talking about __must_check markings?

I feel like I'm missing something, please advise further.

Cheers,
peda

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 31/32] x86: Add sysfs support for Secure Memory Encryption

2017-04-21 Thread Dave Hansen
On 04/18/2017 02:22 PM, Tom Lendacky wrote:
> Add sysfs support for SME so that user-space utilities (kdump, etc.) can
> determine if SME is active.
> 
> A new directory will be created:
>   /sys/kernel/mm/sme/
> 
> And two entries within the new directory:
>   /sys/kernel/mm/sme/active
>   /sys/kernel/mm/sme/encryption_mask

Why do they care, and what will they be doing with this information?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 09/32] x86/mm: Provide general kernel support for memory encryption

2017-04-21 Thread Dave Hansen
On 04/18/2017 02:17 PM, Tom Lendacky wrote:
> @@ -55,7 +57,7 @@ static inline void copy_user_page(void *to, void *from, 
> unsigned long vaddr,
>   __phys_addr_symbol(__phys_reloc_hide((unsigned long)(x)))
>  
>  #ifndef __va
> -#define __va(x)  ((void *)((unsigned 
> long)(x)+PAGE_OFFSET))
> +#define __va(x)  ((void *)(__sme_clr(x) + PAGE_OFFSET))
>  #endif

It seems wrong to be modifying __va().  It currently takes a physical
address, and this modifies it to take a physical address plus the SME bits.

How does that end up ever happening?  If we are pulling physical
addresses out of the page tables, we use p??_phys().  I'd expect *those*
to be masking off the SME bits.

Is it these cases?

pgd_t *base = __va(read_cr3());

For those, it seems like we really want to create two modes of reading
cr3.  One that truly reads CR3 and another that reads the pgd's physical
address out of CR3.  Then you only do the SME masking on the one
fetching a physical address, and the SME bits never leak into __va().
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 07/32] x86/mm: Add support to enable SME in early boot processing

2017-04-21 Thread Tom Lendacky

On 4/21/2017 9:55 AM, Borislav Petkov wrote:

On Tue, Apr 18, 2017 at 04:17:35PM -0500, Tom Lendacky wrote:

Add support to the early boot code to use Secure Memory Encryption (SME).
Since the kernel has been loaded into memory in a decrypted state, support
is added to encrypt the kernel in place and update the early pagetables


s/support is added to //


Done.

Thanks,
Tom




with the memory encryption mask so that new pagetable entries will use
memory encryption.




--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 32/32] x86/mm: Add support to make use of Secure Memory Encryption

2017-04-21 Thread Tom Lendacky

On 4/18/2017 4:22 PM, Tom Lendacky wrote:

Add support to check if SME has been enabled and if memory encryption
should be activated (checking of command line option based on the
configuration of the default state).  If memory encryption is to be
activated, then the encryption mask is set and the kernel is encrypted
"in place."

Signed-off-by: Tom Lendacky 
---
 arch/x86/kernel/head_64.S |1 +
 arch/x86/mm/mem_encrypt.c |   83 +++--
 2 files changed, 80 insertions(+), 4 deletions(-)



...



-unsigned long __init sme_enable(void)
+unsigned long __init sme_enable(struct boot_params *bp)
 {
+   const char *cmdline_ptr, *cmdline_arg, *cmdline_on, *cmdline_off;
+   unsigned int eax, ebx, ecx, edx;
+   unsigned long me_mask;
+   bool active_by_default;
+   char buffer[16];


So it turns out that when KASLR is enabled (CONFIG_RAMDOMIZE_BASE=y)
the stack-protector support causes issues with this function because
it is called so early. I can get past it by adding:

CFLAGS_mem_encrypt.o := $(nostackp)

in the arch/x86/mm/Makefile, but that obviously eliminates the support
for the whole file.  Would it be better to split out the sme_enable()
and other boot routines into a separate file or just apply the
$(nostackp) to the whole file?

Thanks,
Tom


+   u64 msr;
+
+   /* Check for the SME support leaf */
+   eax = 0x8000;
+   ecx = 0;
+   native_cpuid(, , , );
+   if (eax < 0x801f)
+   goto out;
+
+   /*
+* Check for the SME feature:
+*   CPUID Fn8000_001F[EAX] - Bit 0
+* Secure Memory Encryption support
+*   CPUID Fn8000_001F[EBX] - Bits 5:0
+* Pagetable bit position used to indicate encryption
+*/
+   eax = 0x801f;
+   ecx = 0;
+   native_cpuid(, , , );
+   if (!(eax & 1))
+   goto out;
+   me_mask = 1UL << (ebx & 0x3f);
+
+   /* Check if SME is enabled */
+   msr = __rdmsr(MSR_K8_SYSCFG);
+   if (!(msr & MSR_K8_SYSCFG_MEM_ENCRYPT))
+   goto out;
+
+   /*
+* Fixups have not been applied to phys_base yet, so we must obtain
+* the address to the SME command line option data in the following
+* way.
+*/
+   asm ("lea sme_cmdline_arg(%%rip), %0"
+: "=r" (cmdline_arg)
+: "p" (sme_cmdline_arg));
+   asm ("lea sme_cmdline_on(%%rip), %0"
+: "=r" (cmdline_on)
+: "p" (sme_cmdline_on));
+   asm ("lea sme_cmdline_off(%%rip), %0"
+: "=r" (cmdline_off)
+: "p" (sme_cmdline_off));
+
+   if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT))
+   active_by_default = true;
+   else
+   active_by_default = false;
+
+   cmdline_ptr = (const char *)((u64)bp->hdr.cmd_line_ptr |
+((u64)bp->ext_cmd_line_ptr << 32));
+
+   cmdline_find_option(cmdline_ptr, cmdline_arg, buffer, sizeof(buffer));
+
+   if (strncmp(buffer, cmdline_on, sizeof(buffer)) == 0)
+   sme_me_mask = me_mask;
+   else if (strncmp(buffer, cmdline_off, sizeof(buffer)) == 0)
+   sme_me_mask = 0;
+   else
+   sme_me_mask = active_by_default ? me_mask : 0;
+
+out:
return sme_me_mask;
 }

@@ -543,9 +618,9 @@ unsigned long sme_get_me_mask(void)

 #else  /* !CONFIG_AMD_MEM_ENCRYPT */

-void __init sme_encrypt_kernel(void)   { }
-unsigned long __init sme_enable(void)  { return 0; }
+void __init sme_encrypt_kernel(void)   { }
+unsigned long __init sme_enable(struct boot_params *bp){ return 0; }

-unsigned long sme_get_me_mask(void){ return 0; }
+unsigned long sme_get_me_mask(void){ return 0; }

 #endif /* CONFIG_AMD_MEM_ENCRYPT */


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 07/32] x86/mm: Add support to enable SME in early boot processing

2017-04-21 Thread Borislav Petkov
On Tue, Apr 18, 2017 at 04:17:35PM -0500, Tom Lendacky wrote:
> Add support to the early boot code to use Secure Memory Encryption (SME).
> Since the kernel has been loaded into memory in a decrypted state, support
> is added to encrypt the kernel in place and update the early pagetables

s/support is added to //

> with the memory encryption mask so that new pagetable entries will use
> memory encryption.
> 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 03/10] mux: minimal mux subsystem and gpio-based mux controller

2017-04-21 Thread Peter Rosin
On 2017-04-21 16:41, Philipp Zabel wrote:
> On Fri, 2017-04-21 at 16:32 +0200, Peter Rosin wrote:
>> On 2017-04-21 16:23, Philipp Zabel wrote:
>>> On Thu, 2017-04-13 at 18:43 +0200, Peter Rosin wrote:
>>> [...]
 +int mux_chip_register(struct mux_chip *mux_chip)
 +{
 +  int i;
 +  int ret;
 +
 +  for (i = 0; i < mux_chip->controllers; ++i) {
 +  struct mux_control *mux = _chip->mux[i];
 +
 +  if (mux->idle_state == mux->cached_state)
 +  continue;
>>>
>>> I think this should be changed to
>>>  
>>> -   if (mux->idle_state == mux->cached_state)
>>> +   if (mux->idle_state == mux->cached_state ||
>>> +   mux->idle_state == MUX_IDLE_AS_IS)
>>> continue;
>>>
>>> or the following mux_control_set will be called with state ==
>>> MUX_IDLE_AS_IS. Alternatively, mux_control_set should return when passed
>>> this value.
>>
>> That cannot happen because ->cached_state is initialized to -1
>> in mux_chip_alloc, so should always be == MUX_IDLE_AS_IS when
>> registering. And drivers are not supposed to touch ->cached_state.
>> I.e., ->cached_state is "owned" by the core.
> 
> So this was caused by me filling cached_state from register reads in the
> mmio driver. Makes me wonder why I am not allowed to do this, though, if
> I am able to read back the initial state?

You gain fairly little by reading back the original state. If the mux
should idle-as-is, you can avoid a maximum of one mux update if the first
consumer happens to starts by requesting the previously active state.
Similarly, if the mux should idle in a specific state, you can avoid a
maximum of one mux update.

In both cases it costs one unconditional read of the mux state.

Sure, in some cases reads are cheaper than writes, but I didn't think
support for seeding the cache was worth it. Is it worth it?

Cheers,
peda

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] rtmutex: update rt-mutex-design

2017-04-21 Thread Mathieu Poirier
On 21 April 2017 at 08:12, Alex Shi  wrote:
> The rt-mutex-design documents didn't gotten meaningful update from its
> first version. Even after owner's pending bit was removed in commit 
> 8161239a8bcc
> ("rtmutex: Simplify PI algorithm and make highest prio task get lock")
> and priority list 'plist' changed to rbtree. So the documents are far
> late of real code.
>
> Update it to latest code and make it meaningful.
>
> Signed-off-by: Alex Shi 
> Cc: Steven Rostedt 
> Cc: Sebastian Siewior 
> To: linux-doc@vger.kernel.org
> To: linux-ker...@vger.kernel.org
> To: Jonathan Corbet 
> To: Ingo Molnar 
> To: Peter Zijlstra 
> Cc: Thomas Gleixner 
> ---
>  Documentation/locking/rt-mutex-design.txt | 390 
> +++---
>  1 file changed, 88 insertions(+), 302 deletions(-)
>
> diff --git a/Documentation/locking/rt-mutex-design.txt 
> b/Documentation/locking/rt-mutex-design.txt
> index 8666070..11beb55 100644
> --- a/Documentation/locking/rt-mutex-design.txt
> +++ b/Documentation/locking/rt-mutex-design.txt
> @@ -97,9 +97,9 @@ waiter   - A waiter is a struct that is stored on the stack 
> of a blocked
> a process being blocked on the mutex, it is fine to allocate
> the waiter on the process's stack (local variable).  This
> structure holds a pointer to the task, as well as the mutex that
> -   the task is blocked on.  It also has the plist node structures to
> -   place the task in the waiter_list of a mutex as well as the
> -   pi_list of a mutex owner task (described below).
> +  the task is blocked on.  It also has a rbtree node structures to

Here I assume we are talking about struct rt_mutex_waiter[1].  If so I
suggest to replace rbtree with rb_node.

> +  place the task in waiters rbtree of a mutex as well as the
> +  pi_waiters rbtree of a mutex owner task (described below).

Also following the comment for @pi_tree_entry, s/"a mutex owner
task"/"a mutex owner waiters tree" .

[1]. http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h#L25


>
> waiter is sometimes used in reference to the task that is waiting
> on a mutex. This is the same as waiter->task.
> @@ -179,53 +179,35 @@ again.
>   |
> F->L5-+
>
> -
> -Plist
> --
> -
> -Before I go further and talk about how the PI chain is stored through lists
> -on both mutexes and processes, I'll explain the plist.  This is similar to
> -the struct list_head functionality that is already in the kernel.
> -The implementation of plist is out of scope for this document, but it is
> -very important to understand what it does.
> -
> -There are a few differences between plist and list, the most important one
> -being that plist is a priority sorted linked list.  This means that the
> -priorities of the plist are sorted, such that it takes O(1) to retrieve the
> -highest priority item in the list.  Obviously this is useful to store 
> processes
> -based on their priorities.
> -
> -Another difference, which is important for implementation, is that, unlike
> -list, the head of the list is a different element than the nodes of a list.
> -So the head of the list is declared as struct plist_head and nodes that will
> -be added to the list are declared as struct plist_node.
> -
> +If the G process has highest priority in the chain, then all the tasks up

If process G has the highest priority in the chain, ...

> +the chain (A and B in this example), must have their priorities increased
> +to that of G.
>
>  Mutex Waiter List
>  -
>
>  Every mutex keeps track of all the waiters that are blocked on itself. The 
> mutex
> -has a plist to store these waiters by priority.  This list is protected by
> +has a rbtree to store these waiters by priority.  This tree is protected by
>  a spin lock that is located in the struct of the mutex. This lock is called
> -wait_lock.  Since the modification of the waiter list is never done in
> +wait_lock.  Since the modification of the waiter tree is never done in
>  interrupt context, the wait_lock can be taken without disabling interrupts.
>
>
> -Task PI List
> +Task PI Tree
>  
>
> -To keep track of the PI chains, each process has its own PI list.  This is
> -a list of all top waiters of the mutexes that are owned by the process.
> -Note that this list only holds the top waiters and not all waiters that are
> +To keep track of the PI chains, each process has its own PI rbtree.  This is
> +a tree of all top waiters of the mutexes that are owned by the process.
> +Note that this tree only holds the top waiters and not all waiters that are
>  blocked on mutexes owned by the process.
>
> -The top of the task's PI list is always the highest priority task that
> +The top of 

Re: [PATCH v2 1/3] rtmutex: update rt-mutex-design

2017-04-21 Thread Peter Zijlstra
On Fri, Apr 21, 2017 at 10:12:53PM +0800, Alex Shi wrote:
> diff --git a/Documentation/locking/rt-mutex-design.txt 
> b/Documentation/locking/rt-mutex-design.txt
> index 8666070..11beb55 100644
> --- a/Documentation/locking/rt-mutex-design.txt
> +++ b/Documentation/locking/rt-mutex-design.txt
> @@ -97,9 +97,9 @@ waiter   - A waiter is a struct that is stored on the stack 
> of a blocked
> a process being blocked on the mutex, it is fine to allocate
> the waiter on the process's stack (local variable).  This
> structure holds a pointer to the task, as well as the mutex that
> -   the task is blocked on.  It also has the plist node structures to
> -   place the task in the waiter_list of a mutex as well as the
> -   pi_list of a mutex owner task (described below).
> +the task is blocked on.  It also has a rbtree node structures to
> +place the task in waiters rbtree of a mutex as well as the
> +pi_waiters rbtree of a mutex owner task (described below).

whitespace fail

>  
> waiter is sometimes used in reference to the task that is waiting
> on a mutex. This is the same as waiter->task.
> @@ -179,53 +179,35 @@ again.
>   |
> F->L5-+
>  
> +If the G process has highest priority in the chain, then all the tasks up
> +the chain (A and B in this example), must have their priorities increased
> +to that of G.

No, only the top task that's actually runnable needs to be modified. The
rest we don't care about because they're blocked.

> +Since the pi_waiters of a task holds an order by priority of all the top 
> waiters
> +of all the mutexes that the task owns, rt_mutex_getprio simply needs to 
> compare
> +the top pi waiter to its own normal priority, and return the higher priority
> +back.

rt_mutex_getprio() doesn't exist.

> +The main operation of this function is summarized by Thomas Gleixner in
> +rtmutex.c. See the 'Chain walk basics and protection scope' comment for 
> further
> +details.

Since all the useful bits are there anyway, why keep this document
around at all?

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 03/10] mux: minimal mux subsystem and gpio-based mux controller

2017-04-21 Thread Peter Rosin
On 2017-04-21 16:18, Philipp Zabel wrote:
> Hi Peter,
> 
> On Thu, 2017-04-13 at 18:43 +0200, Peter Rosin wrote:
> [...]
>> +int mux_control_select(struct mux_control *mux, int state)
> 
> state could be unsigned int for the consumer facing API.
> 
>> +{
>> +int ret;
> 
> And mux_control_select should check that (0 <= state < mux->states).

Yes, that makes sense. I worried that we might end up with
signed/unsigned comparisons since the internal state still needs
to be signed, but that didn't happen when I tried...

I'll include this change in v14.

Cheers,
peda
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 03/10] mux: minimal mux subsystem and gpio-based mux controller

2017-04-21 Thread Philipp Zabel
On Fri, 2017-04-21 at 16:32 +0200, Peter Rosin wrote:
> On 2017-04-21 16:23, Philipp Zabel wrote:
> > On Thu, 2017-04-13 at 18:43 +0200, Peter Rosin wrote:
> > [...]
> >> +int mux_chip_register(struct mux_chip *mux_chip)
> >> +{
> >> +  int i;
> >> +  int ret;
> >> +
> >> +  for (i = 0; i < mux_chip->controllers; ++i) {
> >> +  struct mux_control *mux = _chip->mux[i];
> >> +
> >> +  if (mux->idle_state == mux->cached_state)
> >> +  continue;
> > 
> > I think this should be changed to
> >  
> > -   if (mux->idle_state == mux->cached_state)
> > +   if (mux->idle_state == mux->cached_state ||
> > +   mux->idle_state == MUX_IDLE_AS_IS)
> > continue;
> > 
> > or the following mux_control_set will be called with state ==
> > MUX_IDLE_AS_IS. Alternatively, mux_control_set should return when passed
> > this value.
> 
> That cannot happen because ->cached_state is initialized to -1
> in mux_chip_alloc, so should always be == MUX_IDLE_AS_IS when
> registering. And drivers are not supposed to touch ->cached_state.
> I.e., ->cached_state is "owned" by the core.

So this was caused by me filling cached_state from register reads in the
mmio driver. Makes me wonder why I am not allowed to do this, though, if
I am able to read back the initial state?

regards
Philipp

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 03/10] mux: minimal mux subsystem and gpio-based mux controller

2017-04-21 Thread Philipp Zabel
On Fri, 2017-04-21 at 16:55 +0200, Peter Rosin wrote:
> On 2017-04-21 16:41, Philipp Zabel wrote:
> > On Fri, 2017-04-21 at 16:32 +0200, Peter Rosin wrote:
> >> On 2017-04-21 16:23, Philipp Zabel wrote:
> >>> On Thu, 2017-04-13 at 18:43 +0200, Peter Rosin wrote:
> >>> [...]
>  +int mux_chip_register(struct mux_chip *mux_chip)
>  +{
>  +int i;
>  +int ret;
>  +
>  +for (i = 0; i < mux_chip->controllers; ++i) {
>  +struct mux_control *mux = _chip->mux[i];
>  +
>  +if (mux->idle_state == mux->cached_state)
>  +continue;
> >>>
> >>> I think this should be changed to
> >>>  
> >>> -   if (mux->idle_state == mux->cached_state)
> >>> +   if (mux->idle_state == mux->cached_state ||
> >>> +   mux->idle_state == MUX_IDLE_AS_IS)
> >>> continue;
> >>>
> >>> or the following mux_control_set will be called with state ==
> >>> MUX_IDLE_AS_IS. Alternatively, mux_control_set should return when passed
> >>> this value.
> >>
> >> That cannot happen because ->cached_state is initialized to -1
> >> in mux_chip_alloc, so should always be == MUX_IDLE_AS_IS when
> >> registering. And drivers are not supposed to touch ->cached_state.
> >> I.e., ->cached_state is "owned" by the core.
> > 
> > So this was caused by me filling cached_state from register reads in the
> > mmio driver. Makes me wonder why I am not allowed to do this, though, if
> > I am able to read back the initial state?
> 
> You gain fairly little by reading back the original state. If the mux
> should idle-as-is, you can avoid a maximum of one mux update if the first
> consumer happens to starts by requesting the previously active state.
> Similarly, if the mux should idle in a specific state, you can avoid a
> maximum of one mux update.
> 
> In both cases it costs one unconditional read of the mux state.
> 
> Sure, in some cases reads are cheaper than writes, but I didn't think
> support for seeding the cache was worth it. Is it worth it?

Probably not, I'll just drop the cached_state initialization. It should
be documented in the mux.h that this field is framework internal and not
to be touched by the drivers. At least I was surprised.

regards
Philipp

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 03/10] mux: minimal mux subsystem and gpio-based mux controller

2017-04-21 Thread Philipp Zabel
On Thu, 2017-04-13 at 18:43 +0200, Peter Rosin wrote:
[...]
> +int mux_chip_register(struct mux_chip *mux_chip)
> +{
> + int i;
> + int ret;
> +
> + for (i = 0; i < mux_chip->controllers; ++i) {
> + struct mux_control *mux = _chip->mux[i];
> +
> + if (mux->idle_state == mux->cached_state)
> + continue;

I think this should be changed to
 
-   if (mux->idle_state == mux->cached_state)
+   if (mux->idle_state == mux->cached_state ||
+   mux->idle_state == MUX_IDLE_AS_IS)
continue;

or the following mux_control_set will be called with state ==
MUX_IDLE_AS_IS. Alternatively, mux_control_set should return when passed
this value.

> + ret = mux_control_set(mux, mux->idle_state);
> + if (ret < 0) {
> + dev_err(_chip->dev, "unable to set idle state\n");
> + return ret;
> + }
> + }
> +
> + ret = device_add(_chip->dev);
> + if (ret < 0)
> + dev_err(_chip->dev,
> + "device_add failed in mux_chip_register: %d\n", ret);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(mux_chip_register);

regards
Philipp

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/11] Documentation: Add ABI to the admin guide

2017-04-21 Thread Mauro Carvalho Chehab
Em Fri, 21 Apr 2017 08:37:45 +0200
Markus Heiser  escreveu:

> Am 21.04.2017 um 01:21 schrieb Mauro Carvalho Chehab 
> :
> 
> > - I'm not a python programmer ;-) I just took Markus "generic" kernel-cmd
> >  code, hardcoding there a call to the script.
> > 
> >  With (a lot of) time, I would likely be able to find a solution to add
> >  the entire ABI logic there, but, in this case, we would lose the
> >  capability of calling the script without Sphinx.  
> 
> Hi Mauro,
> 
> I have no problem with calling the perl script, but your last sentence
> is not correct: We can implement a python module, which is used by sphinx
> and we can add a command line as well.

Markus,

Yeah, I guess technically it would be possible to make a Sphinx plugin 
that would also allow have a command line interface. I don't like this
kind of option, because the code can be come messy.

A better design would be to create a library and make two interfaces
for it, one for the ReST plugin and another one to be called via a
command line, but, in this case, we'll need to maintain 3 interdependent
components (library, command line, Sphinx plugin) instead of two (almost)
independent ones (a script and a Sphinx plugin).

So, IMO, the design I took is a good one, and it has a big advantage:
writing in perl is a way easier for me, and I can benefit from my
knowledge to write a script that performs well. On my desktop, it can
parse the entire ABI, search for a string there and output its result
in 100ms:

$ time ./scripts/get_abi.pl search usbip_status

/sys/devices/platform/usbip-vudc.%d/usbip_status


Kernel version: 4.6
Date:   April 2016
Contact:Krzysztof Opasiak 
Defined on file:Documentation/ABI/testing/sysfs-platform-usbip-vudc

Description:

Current status of the device.
Allowed values:
1 - Device is available and can be exported
2 - Device is currently exported
3 - Fatal error occurred during communication
  with peer


real0m0.112s
user0m0.106s
sys 0m0.005s

---

With regards to the decision of using perl instead of python,
see below. One might thing it is rant. It isn't. It is just a
matter of optimizing my development time.

I have lots of bad experiences with patchwork and even with
cgit (running on an old LTS debian machine) due to the lack of 
a consistent backward compatible API on python.

Most of my bad experiences on python scripts is related to how badly
it handles a file input with an unknown charset and unsigned chars > 0x7f. 
If Python can't recognize the a char as a valid ascii character 
(or the charset was not explicity defined - or the script doesn't
 know what's the original encoding charset),  the script crashes 
(ok, one could add a "try" block, but it is very painful to do that
all over the code). Also, the way the charset is specified on a python 
script changed several times during 2.x development (causing incompatible
changes), and again on 3.x. It is even worse if a python script is called 
by some other script, as, in such case, Python (not sure if all versions)
ignores script headers like:
# -*- coding: utf-8 -*-

with would otherwise tell what's the default encoding.

If you look at patchwork git tree, you'll see that it took a really long
time since when the first patch trying to address those issues until the
last one was merged.

That's the first patch there[1]:

commit ea39a9952e3fa647ebcb4bf16981ce941ec5236a
Author: Mauro Carvalho Chehab 
Date:   Tue Nov 18 23:00:32 2008 -0200

Fix non-ascii character encodings on xmlrpc interface

That seems to be the last one[2]:
commit 880fc52d2d4ccdcbf4a7b76f1b4ba6b9e7482dff
Author: Siddhesh Poyarekar 
Date:   Mon Jul 14 10:21:32 2014 +0800

parsemail: Fallback to common charsets when charset is None or 
x-unknown

AFAIKT, none of those charset fixes are due to a code regression, but,
instead, fixing parsing on different places of the code. So, it took
at least 6 years to get it right there.

So, my decision of writing it in perl is basically due to the fact
that I can write a reliable script it in a few hours, and won't 
need to be concerned that some weird char inside a file or some
new scripting interpreter version would cause my script to crash.

[1] git://github.com/getpatchwork/patchwork
[2] I guess I hit other patchwork charset bugs after 2014.

Thanks,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 03/10] mux: minimal mux subsystem and gpio-based mux controller

2017-04-21 Thread Philipp Zabel
Hi Peter,

On Thu, 2017-04-13 at 18:43 +0200, Peter Rosin wrote:
[...]
> +int mux_control_select(struct mux_control *mux, int state)

state could be unsigned int for the consumer facing API.

> +{
> + int ret;

And mux_control_select should check that (0 <= state < mux->states).

regards
Philipp


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] rtmutex: update rt-mutex

2017-04-21 Thread Alex Shi
The rtmutex remove a pending owner bit in in rt_mutex::owner, in
commit 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task 
get lock")
But the document was changed accordingly. Updating it to a meaningful
state.

BTW, as 'Steven Rostedt' mentioned:
There is still technically a "Pending Owner", it's just not called
that anymore. The pending owner happens to be the top_waiter of a lock
that has no owner and has been woken up to grab the lock.

Signed-off-by: Alex Shi 
Cc: Steven Rostedt 
Cc: Sebastian Siewior 
To: linux-doc@vger.kernel.org
To: linux-ker...@vger.kernel.org
To: Jonathan Corbet 
To: Ingo Molnar 
To: Peter Zijlstra 
Cc: Thomas Gleixner 
---
 Documentation/locking/rt-mutex.txt | 58 +-
 1 file changed, 26 insertions(+), 32 deletions(-)

diff --git a/Documentation/locking/rt-mutex.txt 
b/Documentation/locking/rt-mutex.txt
index 243393d..35793e0 100644
--- a/Documentation/locking/rt-mutex.txt
+++ b/Documentation/locking/rt-mutex.txt
@@ -28,14 +28,13 @@ magic bullet for poorly designed applications, but it allows
 well-designed applications to use userspace locks in critical parts of
 an high priority thread, without losing determinism.
 
-The enqueueing of the waiters into the rtmutex waiter list is done in
+The enqueueing of the waiters into the rtmutex waiter tree is done in
 priority order. For same priorities FIFO order is chosen. For each
 rtmutex, only the top priority waiter is enqueued into the owner's
-priority waiters list. This list too queues in priority order. Whenever
+priority waiters tree. This tree too queues in priority order. Whenever
 the top priority waiter of a task changes (for example it timed out or
-got a signal), the priority of the owner task is readjusted. [The
-priority enqueueing is handled by "plists", see include/linux/plist.h
-for more details.]
+got a signal), the priority of the owner task is readjusted. The
+priority enqueueing is handled by "pi_waiters".
 
 RT-mutexes are optimized for fastpath operations and have no internal
 locking overhead when locking an uncontended mutex or unlocking a mutex
@@ -46,34 +45,29 @@ is used]
 The state of the rt-mutex is tracked via the owner field of the rt-mutex
 structure:
 
-rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1
-are used to keep track of the "owner is pending" and "rtmutex has
-waiters" state.
+lock->owner holds the task_struct pointer of the owner. Bit 0 is used to
+keep track of the "lock has waiters" state.
 
- owner bit1bit0
- NULL  0   0   mutex is free (fast acquire possible)
- NULL  0   1   invalid state
- NULL  1   0   Transitional state*
- NULL  1   1   invalid state
- taskpointer   0   0   mutex is held (fast release possible)
- taskpointer   0   1   task is pending owner
- taskpointer   1   0   mutex is held and has waiters
- taskpointer   1   1   task is pending owner and mutex has waiters
+ ownerbit0
+ NULL 0   lock is free (fast acquire possible)
+ NULL 1   lock is free and has waiters and the top waiter
+   is going to take the lock*
+ taskpointer  0   lock is held (fast release possible)
+ taskpointer  1   lock is held and has waiters**
 
-Pending-ownership handling is a performance optimization:
-pending-ownership is assigned to the first (highest priority) waiter of
-the mutex, when the mutex is released. The thread is woken up and once
-it starts executing it can acquire the mutex. Until the mutex is taken
-by it (bit 0 is cleared) a competing higher priority thread can "steal"
-the mutex which puts the woken up thread back on the waiters list.
+The fast atomic compare exchange based acquire and release is only
+possible when bit 0 of lock->owner is 0.
 
-The pending-ownership optimization is especially important for the
-uninterrupted workflow of high-prio tasks which repeatedly
-takes/releases locks that have lower-prio waiters. Without this
-optimization the higher-prio thread would ping-pong to the lower-prio
-task [because at unlock time we always assign a new owner].
+(*) It also can be a transitional state when grabbing the lock
+with ->wait_lock is held. To prevent any fast path cmpxchg to the lock,
+we need to set the bit0 before looking at the lock, and the owner may be
+NULL in this small time, hence this can be a transitional state.
 
-(*) The "mutex has waiters" bit gets set to take the lock. If the lock
-doesn't already have an owner, this bit is quickly cleared if there are
-no waiters.  So this is a transitional state to synchronize with looking
-at the owner field of the mutex and the mutex owner releasing the lock.
+(**) There is a small time when bit 0 is set but there are no
+waiters. This 

[PATCH v2 1/3] rtmutex: update rt-mutex-design

2017-04-21 Thread Alex Shi
The rt-mutex-design documents didn't gotten meaningful update from its
first version. Even after owner's pending bit was removed in commit 8161239a8bcc
("rtmutex: Simplify PI algorithm and make highest prio task get lock")
and priority list 'plist' changed to rbtree. So the documents are far
late of real code.

Update it to latest code and make it meaningful.

Signed-off-by: Alex Shi 
Cc: Steven Rostedt 
Cc: Sebastian Siewior 
To: linux-doc@vger.kernel.org
To: linux-ker...@vger.kernel.org
To: Jonathan Corbet 
To: Ingo Molnar 
To: Peter Zijlstra 
Cc: Thomas Gleixner 
---
 Documentation/locking/rt-mutex-design.txt | 390 +++---
 1 file changed, 88 insertions(+), 302 deletions(-)

diff --git a/Documentation/locking/rt-mutex-design.txt 
b/Documentation/locking/rt-mutex-design.txt
index 8666070..11beb55 100644
--- a/Documentation/locking/rt-mutex-design.txt
+++ b/Documentation/locking/rt-mutex-design.txt
@@ -97,9 +97,9 @@ waiter   - A waiter is a struct that is stored on the stack 
of a blocked
a process being blocked on the mutex, it is fine to allocate
the waiter on the process's stack (local variable).  This
structure holds a pointer to the task, as well as the mutex that
-   the task is blocked on.  It also has the plist node structures to
-   place the task in the waiter_list of a mutex as well as the
-   pi_list of a mutex owner task (described below).
+  the task is blocked on.  It also has a rbtree node structures to
+  place the task in waiters rbtree of a mutex as well as the
+  pi_waiters rbtree of a mutex owner task (described below).
 
waiter is sometimes used in reference to the task that is waiting
on a mutex. This is the same as waiter->task.
@@ -179,53 +179,35 @@ again.
  |
F->L5-+
 
-
-Plist
--
-
-Before I go further and talk about how the PI chain is stored through lists
-on both mutexes and processes, I'll explain the plist.  This is similar to
-the struct list_head functionality that is already in the kernel.
-The implementation of plist is out of scope for this document, but it is
-very important to understand what it does.
-
-There are a few differences between plist and list, the most important one
-being that plist is a priority sorted linked list.  This means that the
-priorities of the plist are sorted, such that it takes O(1) to retrieve the
-highest priority item in the list.  Obviously this is useful to store processes
-based on their priorities.
-
-Another difference, which is important for implementation, is that, unlike
-list, the head of the list is a different element than the nodes of a list.
-So the head of the list is declared as struct plist_head and nodes that will
-be added to the list are declared as struct plist_node.
-
+If the G process has highest priority in the chain, then all the tasks up
+the chain (A and B in this example), must have their priorities increased
+to that of G.
 
 Mutex Waiter List
 -
 
 Every mutex keeps track of all the waiters that are blocked on itself. The 
mutex
-has a plist to store these waiters by priority.  This list is protected by
+has a rbtree to store these waiters by priority.  This tree is protected by
 a spin lock that is located in the struct of the mutex. This lock is called
-wait_lock.  Since the modification of the waiter list is never done in
+wait_lock.  Since the modification of the waiter tree is never done in
 interrupt context, the wait_lock can be taken without disabling interrupts.
 
 
-Task PI List
+Task PI Tree
 
 
-To keep track of the PI chains, each process has its own PI list.  This is
-a list of all top waiters of the mutexes that are owned by the process.
-Note that this list only holds the top waiters and not all waiters that are
+To keep track of the PI chains, each process has its own PI rbtree.  This is
+a tree of all top waiters of the mutexes that are owned by the process.
+Note that this tree only holds the top waiters and not all waiters that are
 blocked on mutexes owned by the process.
 
-The top of the task's PI list is always the highest priority task that
+The top of the task's PI tree is always the highest priority task that
 is waiting on a mutex that is owned by the task.  So if the task has
 inherited a priority, it will always be the priority of the task that is
-at the top of this list.
+at the top of this tree.
 
-This list is stored in the task structure of a process as a plist called
-pi_list.  This list is protected by a spin lock also in the task structure,
+This tree is stored in the task structure of a process as a rbtree called
+pi_waiters.  It is protected by a spin lock also in the task structure,
 called pi_lock.  This lock may also be 

[RFC PATCH 01/14] cgroup: reorganize cgroup.procs / task write path

2017-04-21 Thread Waiman Long
From: Tejun Heo 

Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2.  This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.

While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.

Signed-off-by: Tejun Heo 
---
 kernel/cgroup/cgroup-internal.h |   8 +-
 kernel/cgroup/cgroup-v1.c   |  58 --
 kernel/cgroup/cgroup.c  | 163 +---
 3 files changed, 142 insertions(+), 87 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 9203bfb..6ef662a 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -179,10 +179,10 @@ int cgroup_migrate(struct task_struct *leader, bool 
threadgroup,
 
 int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
   bool threadgroup);
-ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
-size_t nbytes, loff_t off, bool threadgroup);
-ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t 
nbytes,
-  loff_t off);
+struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
+   __acquires(_threadgroup_rwsem);
+void cgroup_procs_write_finish(void)
+   __releases(_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
 
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 1dc22f6..e4f3202 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -514,10 +514,58 @@ static int cgroup_pidlist_show(struct seq_file *s, void 
*v)
return 0;
 }
 
-static ssize_t cgroup_tasks_write(struct kernfs_open_file *of,
- char *buf, size_t nbytes, loff_t off)
+static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
+char *buf, size_t nbytes, loff_t off,
+bool threadgroup)
 {
-   return __cgroup_procs_write(of, buf, nbytes, off, false);
+   struct cgroup *cgrp;
+   struct task_struct *task;
+   const struct cred *cred, *tcred;
+   ssize_t ret;
+
+   cgrp = cgroup_kn_lock_live(of->kn, false);
+   if (!cgrp)
+   return -ENODEV;
+
+   task = cgroup_procs_write_start(buf, threadgroup);
+   ret = PTR_ERR_OR_ZERO(task);
+   if (ret)
+   goto out_unlock;
+
+   /*
+* Even if we're attaching all tasks in the thread group, we only
+* need to check permissions on one of them.
+*/
+   cred = current_cred();
+   tcred = get_task_cred(task);
+   if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
+   !uid_eq(cred->euid, tcred->uid) &&
+   !uid_eq(cred->euid, tcred->suid))
+   ret = -EACCES;
+   put_cred(tcred);
+   if (ret)
+   goto out_finish;
+
+   ret = cgroup_attach_task(cgrp, task, threadgroup);
+
+out_finish:
+   cgroup_procs_write_finish();
+out_unlock:
+   cgroup_kn_unlock(of->kn);
+
+   return ret ?: nbytes;
+}
+
+static ssize_t cgroup1_procs_write(struct kernfs_open_file *of,
+  char *buf, size_t nbytes, loff_t off)
+{
+   return __cgroup1_procs_write(of, buf, nbytes, off, true);
+}
+
+static ssize_t cgroup1_tasks_write(struct kernfs_open_file *of,
+  char *buf, size_t nbytes, loff_t off)
+{
+   return __cgroup1_procs_write(of, buf, nbytes, off, false);
 }
 
 static ssize_t cgroup_release_agent_write(struct kernfs_open_file *of,
@@ -596,7 +644,7 @@ struct cftype cgroup1_base_files[] = {
.seq_stop = cgroup_pidlist_stop,
.seq_show = cgroup_pidlist_show,
.private = CGROUP_FILE_PROCS,
-   .write = cgroup_procs_write,
+   .write = cgroup1_procs_write,
},
{
.name = "cgroup.clone_children",
@@ -615,7 +663,7 @@ struct cftype cgroup1_base_files[] = {
.seq_stop = cgroup_pidlist_stop,
.seq_show = cgroup_pidlist_show,
.private = CGROUP_FILE_TASKS,
-   .write = cgroup_tasks_write,
+   .write = cgroup1_tasks_write,
},
{
.name = "notify_on_release",
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 687f5e0..b4b8c6b 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1914,6 +1914,23 @@ int task_cgroup_path(struct task_struct *task, char 
*buf, size_t buflen)
 }
 EXPORT_SYMBOL_GPL(task_cgroup_path);
 
+static struct cgroup *cgroup_migrate_common_ancestor(struct task_struct *task,
+  

[RFC PATCH 03/14] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling

2017-04-21 Thread Waiman Long
From: Tejun Heo 

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the proc_cgrp to which the processes of the subtree conceptually
belong and domain-level resource consumptions not tied to any specific
task are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch introduces cgroup->proc_cgrp along with threaded css_set
handling.

* cgroup->proc_cgrp is NULL if !threaded.  If threaded, points to the
  proc_cgrp (root of the threaded subtree).

* css_set->proc_cset points to self if !threaded.  If threaded, points
  to the css_set which belongs to the cgrp->proc_cgrp.  The proc_cgrp
  serves as the resource domain and needs the matching csses readily
  available.  The proc_cset holds those csses and makes them easily
  accessible.

* All threaded csets are linked on their proc_csets to enable
  iteration of all threaded tasks.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup-defs.h | 22 
 kernel/cgroup/cgroup.c  | 87 +
 2 files changed, 103 insertions(+), 6 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6a3f850..9283ee9 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -158,6 +158,15 @@ struct css_set {
/* reference count */
atomic_t refcount;
 
+   /*
+* If not threaded, the following points to self.  If threaded, to
+* a cset which belongs to the top cgroup of the threaded subtree.
+* The proc_cset provides access to the process cgroup and its
+* csses to which domain level resource consumptions should be
+* charged.
+*/
+   struct css_set __rcu *proc_cset;
+
/* the default cgroup associated with this css_set */
struct cgroup *dfl_cgrp;
 
@@ -183,6 +192,10 @@ struct css_set {
 */
struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
 
+   /* all csets whose ->proc_cset points to this cset */
+   struct list_head threaded_csets;
+   struct list_head threaded_csets_node;
+
/*
 * List running through all cgroup groups in the same hash
 * slot. Protected by css_set_lock
@@ -289,6 +302,15 @@ struct cgroup {
struct list_head e_csets[CGROUP_SUBSYS_COUNT];
 
/*
+* If !threaded, NULL.  If threaded, it points to the top cgroup of
+* the threaded subtree, on which it points to self.  Threaded
+* subtree is exempt from process granularity and no-internal-task
+* constraint.  Domain level resource consumptions which aren't
+* tied to a specific task should be charged to the proc_cgrp.
+*/
+   struct cgroup *proc_cgrp;
+
+   /*
 * list of pidlists, up to two for each namespace (one for procs, one
 * for tasks); created on demand.
 */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9bbfadc..016bbc6 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -555,9 +555,11 @@ struct cgroup_subsys_state *of_css(struct kernfs_open_file 
*of)
  */
 struct css_set init_css_set = {
.refcount   = ATOMIC_INIT(1),
+   .proc_cset  = RCU_INITIALIZER(_css_set),
.tasks  = LIST_HEAD_INIT(init_css_set.tasks),
.mg_tasks   = LIST_HEAD_INIT(init_css_set.mg_tasks),
.task_iters = LIST_HEAD_INIT(init_css_set.task_iters),
+   .threaded_csets = LIST_HEAD_INIT(init_css_set.threaded_csets),
.cgrp_links = LIST_HEAD_INIT(init_css_set.cgrp_links),
.mg_preload_node= LIST_HEAD_INIT(init_css_set.mg_preload_node),
.mg_node= LIST_HEAD_INIT(init_css_set.mg_node),
@@ -576,6 +578,17 @@ static bool css_set_populated(struct css_set *cset)
return !list_empty(>tasks) || !list_empty(>mg_tasks);
 }
 
+static struct css_set *proc_css_set(struct css_set *cset)
+{
+   return rcu_dereference_protected(cset->proc_cset,
+lockdep_is_held(_set_lock));
+}
+
+static bool css_set_threaded(struct css_set *cset)
+{
+   return proc_css_set(cset) != cset;
+}
+
 /**
  * cgroup_update_populated - updated populated count of a cgroup
  * @cgrp: the target cgroup
@@ -727,6 +740,8 @@ void put_css_set_locked(struct css_set *cset)
if (!atomic_dec_and_test(>refcount))
return;
 
+   WARN_ON_ONCE(!list_empty(>threaded_csets));
+
/* This css_set is dead. unlink it and release cgroup and css refs */
for_each_subsys(ss, ssid) {
list_del(>e_cset_node[ssid]);
@@ -743,6 +758,11 @@ 

[RFC PATCH 02/14] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS

2017-04-21 Thread Waiman Long
From: Tejun Heo 

css_task_iter currently always walks all tasks.  With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration.  As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
is not specified, it walks all tasks as before.  When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr().  As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h   |  6 +-
 kernel/cgroup/cgroup-v1.c|  6 +++---
 kernel/cgroup/cgroup.c   | 24 ++--
 kernel/cgroup/cpuset.c   |  6 +++---
 kernel/cgroup/freezer.c  |  6 +++---
 mm/memcontrol.c  |  2 +-
 net/core/netclassid_cgroup.c |  2 +-
 7 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index af9c86e..37b20ef 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -36,9 +36,13 @@
 #define CGROUP_WEIGHT_DFL  100
 #define CGROUP_WEIGHT_MAX  1
 
+/* walk only threadgroup leaders */
+#define CSS_TASK_ITER_PROCS(1U << 0)
+
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
struct cgroup_subsys*ss;
+   unsigned intflags;
 
struct list_head*cset_pos;
struct list_head*cset_head;
@@ -129,7 +133,7 @@ struct task_struct *cgroup_taskset_first(struct 
cgroup_taskset *tset,
 struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
struct cgroup_subsys_state **dst_cssp);
 
-void css_task_iter_start(struct cgroup_subsys_state *css,
+void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
 struct css_task_iter *it);
 struct task_struct *css_task_iter_next(struct css_task_iter *it);
 void css_task_iter_end(struct css_task_iter *it);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index e4f3202..b837e1a 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -121,7 +121,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup 
*from)
 * ->can_attach() fails.
 */
do {
-   css_task_iter_start(>self, );
+   css_task_iter_start(>self, 0, );
task = css_task_iter_next();
if (task)
get_task_struct(task);
@@ -377,7 +377,7 @@ static int pidlist_array_load(struct cgroup *cgrp, enum 
cgroup_filetype type,
if (!array)
return -ENOMEM;
/* now, populate the array */
-   css_task_iter_start(>self, );
+   css_task_iter_start(>self, 0, );
while ((tsk = css_task_iter_next())) {
if (unlikely(n == length))
break;
@@ -753,7 +753,7 @@ int cgroupstats_build(struct cgroupstats *stats, struct 
dentry *dentry)
}
rcu_read_unlock();
 
-   css_task_iter_start(>self, );
+   css_task_iter_start(>self, 0, );
while ((tsk = css_task_iter_next())) {
switch (tsk->state) {
case TASK_RUNNING:
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index b4b8c6b..9bbfadc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3590,6 +3590,7 @@ static void css_task_iter_advance(struct css_task_iter 
*it)
lockdep_assert_held(_set_lock);
WARN_ON_ONCE(!l);
 
+repeat:
/*
 * Advance iterator to find next entry.  cset->tasks is consumed
 * first and then ->mg_tasks.  After ->mg_tasks, we move onto the
@@ -3604,11 +3605,18 @@ static void css_task_iter_advance(struct css_task_iter 
*it)
css_task_iter_advance_css_set(it);
else
it->task_pos = l;
+
+   /* if PROCS, skip over tasks which aren't group leaders */
+   if ((it->flags & CSS_TASK_ITER_PROCS) && it->task_pos &&
+   !thread_group_leader(list_entry(it->task_pos, struct task_struct,
+   cg_list)))
+   goto repeat;
 }
 
 /**
  * css_task_iter_start - initiate task iteration
  * @css: the css to walk tasks of
+ * @flags: CSS_TASK_ITER_* flags
  * @it: the task iterator to use
  *
  * Initiate iteration through the tasks of @css.  

[RFC PATCH 05/14] cgroup: implement cgroup v2 thread support

2017-04-21 Thread Waiman Long
From: Tejun Heo 

This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

Once thread mode is enabled on a cgroup, the threads of the processes
which are in its subtree can be placed inside the subtree without
being restricted by process granularity or no-internal-process
constraint.  Note that the threads aren't allowed to escape to a
different threaded subtree.  To be used inside a threaded subtree, a
controller should explicitly support threaded mode and be able to
handle internal competition in the way which is appropriate for the
resource.

The root of a threaded subtree, where thread mode is enabled in the
first place, is called the thread root and serves as the resource
domain for the whole subtree.  This is the last cgroup where
non-threaded controllers are operational and where all the
domain-level resource consumptions in the subtree are accounted.  This
allows threaded controllers to operate at thread granularity when
requested while staying inside the scope of system-level resource
distribution.

Internally, in a threaded subtree, each css_set has its ->proc_cset
pointing to a matching css_set which belongs to the thread root.  This
ensures that thread root level cgroup_subsys_state for all threaded
controllers are readily accessible for domain-level operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.  Note that the documentation update is not complete as
the rest of the documentation needs to be updated accordingly.
Rolling those updates into this patch can be confusing so that will be
separate patches.

Signed-off-by: Tejun Heo 
---
 Documentation/cgroup-v2.txt |  75 +-
 include/linux/cgroup-defs.h |  16 +++
 kernel/cgroup/cgroup.c  | 240 +++-
 kernel/cgroup/pids.c|   1 +
 kernel/events/core.c|   1 +
 5 files changed, 326 insertions(+), 7 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 49d7c99..2375e22 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -16,7 +16,9 @@ CONTENTS
   1-2. What is cgroup?
 2. Basic Operations
   2-1. Mounting
-  2-2. Organizing Processes
+  2-2. Organizing Processes and Threads
+2-2-1. Processes
+2-2-2. Threads
   2-3. [Un]populated Notification
   2-4. Controlling Controllers
 2-4-1. Enabling and Disabling
@@ -150,7 +152,9 @@ and experimenting easier, the kernel parameter 
cgroup_no_v1= allows
 disabling controllers in v1 and make them always available in v2.
 
 
-2-2. Organizing Processes
+2-2. Organizing Processes and Threads
+
+2-2-1. Processes
 
 Initially, only the root cgroup exists to which all processes belong.
 A child cgroup can be created by creating a sub-directory.
@@ -201,6 +205,73 @@ is removed subsequently, " (deleted)" is appended to the 
path.
   0::/test-cgroup/test-cgroup-nested (deleted)
 
 
+2-2-2. Threads
+
+cgroup v2 supports thread granularity for a subset of controllers to
+support use cases requiring hierarchical resource distribution across
+the threads of a group of processes.  By default, all threads of a
+process belong to the same cgroup, which also serves as the resource
+domain to host resource consumptions which are not specific to a
+process or thread.  The thread mode allows threads to be spread across
+a subtree while still maintaining the common resource domain for them.
+
+Enabling thread mode on a subtree makes it threaded.  The root of a
+threaded subtree is called thread root and serves as the resource
+domain for the entire subtree.  In a threaded subtree, threads of a
+process can be put in different cgroups and are not subject to the no
+internal process constraint - threaded controllers can be enabled on
+non-leaf cgroups whether they have threads in them or not.
+
+To enable the thread mode, the following conditions must be met.
+
+- The thread root doesn't have any child cgroups.
+
+- The thread root doesn't have any controllers enabled.
+
+Thread mode can be enabled by writing "enable" to "cgroup.threads"
+file.
+
+  # echo enable > cgroup.threads
+
+Inside a threaded subtree, "cgroup.threads" can be read and contains
+the list of the thread IDs of all threads in the cgroup.  Except that
+the operations are per-thread instead of per-process, "cgroup.threads"
+has the same format and behaves the same way as "cgroup.procs".
+
+The thread root serves as the resource domain for the whole subtree,

[RFC PATCH 04/14] cgroup: implement CSS_TASK_ITER_THREADED

2017-04-21 Thread Waiman Long
From: Tejun Heo 

cgroup v2 is in the process of growing thread granularity support.
Once thread mode is enabled, the root cgroup of the subtree serves as
the proc_cgrp to which the processes of the subtree conceptually
belong and domain-level resource consumptions not tied to any specific
task are charged.  In the subtree, threads won't be subject to process
granularity or no-internal-task constraint and can be distributed
arbitrarily across the subtree.

This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
which, when used on a proc_cgrp, makes the iteration include the tasks
on all the associated threaded css_sets.  "cgroup.procs" read path is
updated to use it so that reading the file on a proc_cgrp lists all
processes.  This will also be used by controller implementations which
need to walk processes or tasks at the resource domain level.

Task iteration is implemented nested in css_set iteration.  If
CSS_TASK_ITER_THREADED is specified, after walking tasks of each
!threaded css_set, all the associated threaded css_sets are visited
before moving onto the next !threaded css_set.

Signed-off-by: Tejun Heo 
---
 include/linux/cgroup.h |  6 
 kernel/cgroup/cgroup.c | 81 +-
 2 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 37b20ef..d62d75c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -38,6 +38,8 @@
 
 /* walk only threadgroup leaders */
 #define CSS_TASK_ITER_PROCS(1U << 0)
+/* walk threaded css_sets as part of their proc_csets */
+#define CSS_TASK_ITER_THREADED (1U << 1)
 
 /* a css_task_iter should be treated as an opaque object */
 struct css_task_iter {
@@ -47,11 +49,15 @@ struct css_task_iter {
struct list_head*cset_pos;
struct list_head*cset_head;
 
+   struct list_head*tcset_pos;
+   struct list_head*tcset_head;
+
struct list_head*task_pos;
struct list_head*tasks_head;
struct list_head*mg_tasks_head;
 
struct css_set  *cur_cset;
+   struct css_set  *cur_pcset;
struct task_struct  *cur_task;
struct list_headiters_node; /* css_set->task_iters 
*/
 };
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 016bbc6..b2b1886 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3592,27 +3592,36 @@ bool css_has_online_children(struct cgroup_subsys_state 
*css)
return ret;
 }
 
-/**
- * css_task_iter_advance_css_set - advance a task itererator to the next 
css_set
- * @it: the iterator to advance
- *
- * Advance @it to the next css_set to walk.
- */
-static void css_task_iter_advance_css_set(struct css_task_iter *it)
+static struct css_set *css_task_iter_next_css_set(struct css_task_iter *it)
 {
-   struct list_head *l = it->cset_pos;
+   bool threaded = it->flags & CSS_TASK_ITER_THREADED;
+   struct list_head *l;
struct cgrp_cset_link *link;
struct css_set *cset;
 
lockdep_assert_held(_set_lock);
 
-   /* Advance to the next non-empty css_set */
+   /* find the next threaded cset */
+   if (it->tcset_pos) {
+   l = it->tcset_pos->next;
+
+   if (l != it->tcset_head) {
+   it->tcset_pos = l;
+   return container_of(l, struct css_set,
+   threaded_csets_node);
+   }
+
+   it->tcset_pos = NULL;
+   }
+
+   /* find the next cset */
+   l = it->cset_pos;
+
do {
l = l->next;
if (l == it->cset_head) {
it->cset_pos = NULL;
-   it->task_pos = NULL;
-   return;
+   return NULL;
}
 
if (it->ss) {
@@ -3622,10 +3631,50 @@ static void css_task_iter_advance_css_set(struct 
css_task_iter *it)
link = list_entry(l, struct cgrp_cset_link, cset_link);
cset = link->cset;
}
-   } while (!css_set_populated(cset));
+
+   /*
+* For threaded iterations, threaded csets are walked
+* together with their proc_csets.  Skip here.
+*/
+   } while (threaded && css_set_threaded(cset));
 
it->cset_pos = l;
 
+   /* initialize threaded cset walking */
+   if (threaded) {
+   if (it->cur_pcset)
+   put_css_set_locked(it->cur_pcset);
+   it->cur_pcset = cset;
+   get_css_set(cset);
+
+   it->tcset_head = >threaded_csets;
+   it->tcset_pos = >threaded_csets;
+   }
+
+   return 

[RFC PATCH 07/14] cgroup: Move debug cgroup to its own file

2017-04-21 Thread Waiman Long
The debug cgroup currently resides within cgroup-v1.c and is enabled
only for v1 cgroup. To enable the debug cgroup also for v2, it
makes sense to put the code into its own file as it will no longer
be v1 specific. The only change in this patch is the expansion of
cgroup_task_count() within the debug_taskcount_read() function.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/Makefile|   1 +
 kernel/cgroup/cgroup-v1.c | 147 -
 kernel/cgroup/debug.c | 165 ++
 3 files changed, 166 insertions(+), 147 deletions(-)
 create mode 100644 kernel/cgroup/debug.c

diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 387348a..ce693cc 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -4,3 +4,4 @@ obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
 obj-$(CONFIG_CGROUP_PIDS) += pids.o
 obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
+obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index e80bc8e..6757a50 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1297,150 +1297,3 @@ static int __init cgroup_no_v1(char *str)
return 1;
 }
 __setup("cgroup_no_v1=", cgroup_no_v1);
-
-
-#ifdef CONFIG_CGROUP_DEBUG
-static struct cgroup_subsys_state *
-debug_css_alloc(struct cgroup_subsys_state *parent_css)
-{
-   struct cgroup_subsys_state *css = kzalloc(sizeof(*css), GFP_KERNEL);
-
-   if (!css)
-   return ERR_PTR(-ENOMEM);
-
-   return css;
-}
-
-static void debug_css_free(struct cgroup_subsys_state *css)
-{
-   kfree(css);
-}
-
-static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
-   struct cftype *cft)
-{
-   return cgroup_task_count(css->cgroup);
-}
-
-static u64 current_css_set_read(struct cgroup_subsys_state *css,
-   struct cftype *cft)
-{
-   return (u64)(unsigned long)current->cgroups;
-}
-
-static u64 current_css_set_refcount_read(struct cgroup_subsys_state *css,
-struct cftype *cft)
-{
-   u64 count;
-
-   rcu_read_lock();
-   count = atomic_read(_css_set(current)->refcount);
-   rcu_read_unlock();
-   return count;
-}
-
-static int current_css_set_cg_links_read(struct seq_file *seq, void *v)
-{
-   struct cgrp_cset_link *link;
-   struct css_set *cset;
-   char *name_buf;
-
-   name_buf = kmalloc(NAME_MAX + 1, GFP_KERNEL);
-   if (!name_buf)
-   return -ENOMEM;
-
-   spin_lock_irq(_set_lock);
-   rcu_read_lock();
-   cset = rcu_dereference(current->cgroups);
-   list_for_each_entry(link, >cgrp_links, cgrp_link) {
-   struct cgroup *c = link->cgrp;
-
-   cgroup_name(c, name_buf, NAME_MAX + 1);
-   seq_printf(seq, "Root %d group %s\n",
-  c->root->hierarchy_id, name_buf);
-   }
-   rcu_read_unlock();
-   spin_unlock_irq(_set_lock);
-   kfree(name_buf);
-   return 0;
-}
-
-#define MAX_TASKS_SHOWN_PER_CSS 25
-static int cgroup_css_links_read(struct seq_file *seq, void *v)
-{
-   struct cgroup_subsys_state *css = seq_css(seq);
-   struct cgrp_cset_link *link;
-
-   spin_lock_irq(_set_lock);
-   list_for_each_entry(link, >cgroup->cset_links, cset_link) {
-   struct css_set *cset = link->cset;
-   struct task_struct *task;
-   int count = 0;
-
-   seq_printf(seq, "css_set %pK\n", cset);
-
-   list_for_each_entry(task, >tasks, cg_list) {
-   if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-   goto overflow;
-   seq_printf(seq, "  task %d\n", task_pid_vnr(task));
-   }
-
-   list_for_each_entry(task, >mg_tasks, cg_list) {
-   if (count++ > MAX_TASKS_SHOWN_PER_CSS)
-   goto overflow;
-   seq_printf(seq, "  task %d\n", task_pid_vnr(task));
-   }
-   continue;
-   overflow:
-   seq_puts(seq, "  ...\n");
-   }
-   spin_unlock_irq(_set_lock);
-   return 0;
-}
-
-static u64 releasable_read(struct cgroup_subsys_state *css, struct cftype *cft)
-{
-   return (!cgroup_is_populated(css->cgroup) &&
-   !css_has_online_children(>cgroup->self));
-}
-
-static struct cftype debug_files[] =  {
-   {
-   .name = "taskcount",
-   .read_u64 = debug_taskcount_read,
-   },
-
-   {
-   .name = "current_css_set",
-   .read_u64 = current_css_set_read,
-   },
-
-   {
-   .name = "current_css_set_refcount",
-   .read_u64 = current_css_set_refcount_read,
-   },
-
-   {
-   .name = "current_css_set_cg_links",
-   .seq_show 

[RFC PATCH 06/14] cgroup: Fix reference counting bug in cgroup_procs_write()

2017-04-21 Thread Waiman Long
The cgroup_procs_write_start() took a reference to the task structure
which was not properly released within cgroup_procs_write() and so
on. So a put_task_struct() call is added to cgroup_procs_write_finish()
to match the get_task_struct() in cgroup_procs_write_start() to fix
this reference counting error.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/cgroup-internal.h |  2 +-
 kernel/cgroup/cgroup-v1.c   |  2 +-
 kernel/cgroup/cgroup.c  | 10 ++
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 6ef662a..bea3928 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -181,7 +181,7 @@ int cgroup_attach_task(struct cgroup *dst_cgrp, struct 
task_struct *leader,
   bool threadgroup);
 struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup)
__acquires(_threadgroup_rwsem);
-void cgroup_procs_write_finish(void)
+void cgroup_procs_write_finish(struct task_struct *task)
__releases(_threadgroup_rwsem);
 
 void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index b837e1a..e80bc8e 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -549,7 +549,7 @@ static ssize_t __cgroup1_procs_write(struct 
kernfs_open_file *of,
ret = cgroup_attach_task(cgrp, task, threadgroup);
 
 out_finish:
-   cgroup_procs_write_finish();
+   cgroup_procs_write_finish(task);
 out_unlock:
cgroup_kn_unlock(of->kn);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6748207..d48eedd 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -2487,12 +2487,15 @@ struct task_struct *cgroup_procs_write_start(char *buf, 
bool threadgroup)
return tsk;
 }
 
-void cgroup_procs_write_finish(void)
+void cgroup_procs_write_finish(struct task_struct *task)
__releases(_threadgroup_rwsem)
 {
struct cgroup_subsys *ss;
int ssid;
 
+   /* release reference from cgroup_procs_write_start() */
+   put_task_struct(task);
+
percpu_up_write(_threadgroup_rwsem);
for_each_subsys(ss, ssid)
if (ss->post_attach)
@@ -3295,7 +3298,6 @@ static int cgroup_addrm_files(struct cgroup_subsys_state 
*css,
 
 static int cgroup_apply_cftypes(struct cftype *cfts, bool is_add)
 {
-   LIST_HEAD(pending);
struct cgroup_subsys *ss = cfts[0].ss;
struct cgroup *root = >root->cgrp;
struct cgroup_subsys_state *css;
@@ -4060,7 +4062,7 @@ static ssize_t cgroup_procs_write(struct kernfs_open_file 
*of,
ret = cgroup_attach_task(cgrp, task, true);
 
 out_finish:
-   cgroup_procs_write_finish();
+   cgroup_procs_write_finish(task);
 out_unlock:
cgroup_kn_unlock(of->kn);
 
@@ -4130,7 +4132,7 @@ static ssize_t cgroup_threads_write(struct 
kernfs_open_file *of,
ret = cgroup_attach_task(cgrp, task, false);
 
 out_finish:
-   cgroup_procs_write_finish();
+   cgroup_procs_write_finish(task);
 out_unlock:
cgroup_kn_unlock(of->kn);
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 09/14] cgroup: Make debug cgroup support v2 and thread mode

2017-04-21 Thread Waiman Long
Besides supporting cgroup v2 and thread mode, the following changes
are also made:
 1) current_* cgroup files now resides only at the root as we don't
need duplicated files of the same function all over the cgroup
hierarchy.
 2) The cgroup_css_links_read() function is modified to report
the number of tasks that are skipped because of overflow.
 3) The relationship between proc_cset and threaded_csets are displayed.
 4) The number of extra unaccounted references are displayed.
 5) The status of being a thread root or threaded cgroup is displayed.
 6) The current_css_set_read() function now prints out the addresses of
the css'es associated with the current css_set.
 7) A new cgroup_subsys_states file is added to display the css objects
associated with a cgroup.

Signed-off-by: Waiman Long 
---
 kernel/cgroup/debug.c | 151 --
 1 file changed, 134 insertions(+), 17 deletions(-)

diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index c8f7590..4d74458 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -38,10 +38,37 @@ static u64 debug_taskcount_read(struct cgroup_subsys_state 
*css,
return count;
 }
 
-static u64 current_css_set_read(struct cgroup_subsys_state *css,
-   struct cftype *cft)
+static int current_css_set_read(struct seq_file *seq, void *v)
 {
-   return (u64)(unsigned long)current->cgroups;
+   struct css_set *cset;
+   struct cgroup_subsys *ss;
+   struct cgroup_subsys_state *css;
+   int i, refcnt;
+
+   mutex_lock(_mutex);
+   spin_lock_irq(_set_lock);
+   rcu_read_lock();
+   cset = rcu_dereference(current->cgroups);
+   refcnt = atomic_read(>refcount);
+   seq_printf(seq, "css_set %pK %d", cset, refcnt);
+   if (refcnt > cset->task_count)
+   seq_printf(seq, " +%d", refcnt - cset->task_count);
+   seq_puts(seq, "\n");
+
+   /*
+* Print the css'es stored in the current css_set.
+*/
+   for_each_subsys(ss, i) {
+   css = cset->subsys[ss->id];
+   if (!css)
+   continue;
+   seq_printf(seq, "%2d: %-4s\t- %lx[%d]\n", ss->id, ss->name,
+ (unsigned long)css, css->id);
+   }
+   rcu_read_unlock();
+   spin_unlock_irq(_set_lock);
+   mutex_unlock(_mutex);
+   return 0;
 }
 
 static u64 current_css_set_refcount_read(struct cgroup_subsys_state *css,
@@ -86,31 +113,111 @@ static int cgroup_css_links_read(struct seq_file *seq, 
void *v)
 {
struct cgroup_subsys_state *css = seq_css(seq);
struct cgrp_cset_link *link;
+   int dead_cnt = 0, extra_refs = 0, threaded_csets = 0;
 
spin_lock_irq(_set_lock);
+   if (css->cgroup->proc_cgrp)
+   seq_puts(seq, (css->cgroup->proc_cgrp == css->cgroup)
+ ? "[thread root]\n" : "[threaded]\n");
+
list_for_each_entry(link, >cgroup->cset_links, cset_link) {
struct css_set *cset = link->cset;
struct task_struct *task;
int count = 0;
+   int refcnt = atomic_read(>refcount);
 
-   seq_printf(seq, "css_set %pK\n", cset);
+   /*
+* Print out the proc_cset and threaded_cset relationship
+* and highlight difference between refcount and task_count.
+*/
+   seq_printf(seq, "css_set %pK", cset);
+   if (cset->proc_cset != cset) {
+   threaded_csets++;
+   seq_printf(seq, "=>%pK", cset->proc_cset);
+   }
+   if (!list_empty(>threaded_csets)) {
+   struct css_set *tcset;
+   int idx = 0;
+
+   list_for_each_entry(tcset, >threaded_csets,
+   threaded_csets_node) {
+   seq_puts(seq, idx ? "," : "<=");
+   seq_printf(seq, "%pK", tcset);
+   idx++;
+   }
+   } else {
+   seq_printf(seq, " %d", refcnt);
+   if (refcnt - cset->task_count > 0) {
+   int extra = refcnt - cset->task_count;
+
+   seq_printf(seq, " +%d", extra);
+   /*
+* Take out the one additional reference in
+* init_css_set.
+*/
+   if (cset == _css_set)
+   extra--;
+   extra_refs += extra;
+   }
+   }
+   seq_puts(seq, "\n");
 
list_for_each_entry(task, >tasks, cg_list) {
-   if (count++ > 

[RFC PATCH 08/14] cgroup: Keep accurate count of tasks in each css_set

2017-04-21 Thread Waiman Long
The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable task_count is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.

Signed-off-by: Waiman Long 
---
 include/linux/cgroup-defs.h | 3 +++
 kernel/cgroup/cgroup-v1.c   | 6 +-
 kernel/cgroup/cgroup.c  | 5 +
 kernel/cgroup/debug.c   | 6 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index bb4752a..7be1a90 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -158,6 +158,9 @@ struct css_set {
/* reference count */
atomic_t refcount;
 
+   /* internal task count, protected by css_set_lock */
+   int task_count;
+
/*
 * If not threaded, the following points to self.  If threaded, to
 * a cset which belongs to the top cgroup of the threaded subtree.
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 6757a50..6d69796 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -334,10 +334,6 @@ static struct cgroup_pidlist 
*cgroup_pidlist_find_create(struct cgroup *cgrp,
 /**
  * cgroup_task_count - count the number of tasks in a cgroup.
  * @cgrp: the cgroup in question
- *
- * Return the number of tasks in the cgroup.  The returned number can be
- * higher than the actual number of tasks due to css_set references from
- * namespace roots and temporary usages.
  */
 static int cgroup_task_count(const struct cgroup *cgrp)
 {
@@ -346,7 +342,7 @@ static int cgroup_task_count(const struct cgroup *cgrp)
 
spin_lock_irq(_set_lock);
list_for_each_entry(link, >cset_links, cset_link)
-   count += atomic_read(>cset->refcount);
+   count += link->cset->task_count;
spin_unlock_irq(_set_lock);
return count;
 }
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d48eedd..3186b1f 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1671,6 +1671,7 @@ static void cgroup_enable_task_cg_lists(void)
css_set_update_populated(cset, true);
list_add_tail(>cg_list, >tasks);
get_css_set(cset);
+   cset->task_count++;
}
spin_unlock(>sighand->siglock);
} while_each_thread(g, p);
@@ -2154,8 +2155,10 @@ static int cgroup_migrate_execute(struct cgroup_mgctx 
*mgctx)
struct css_set *to_cset = cset->mg_dst_cset;
 
get_css_set(to_cset);
+   to_cset->task_count++;
css_set_move_task(task, from_cset, to_cset, true);
put_css_set_locked(from_cset);
+   from_cset->task_count--;
}
}
spin_unlock_irq(_set_lock);
@@ -5150,6 +5153,7 @@ void cgroup_post_fork(struct task_struct *child)
cset = task_css_set(current);
if (list_empty(>cg_list)) {
get_css_set(cset);
+   cset->task_count++;
css_set_move_task(child, NULL, cset, false);
}
spin_unlock_irq(_set_lock);
@@ -5199,6 +5203,7 @@ void cgroup_exit(struct task_struct *tsk)
if (!list_empty(>cg_list)) {
spin_lock_irq(_set_lock);
css_set_move_task(tsk, cset, NULL, false);
+   cset->task_count--;
spin_unlock_irq(_set_lock);
} else {
get_css_set(cset);
diff --git a/kernel/cgroup/debug.c b/kernel/cgroup/debug.c
index 9146461..c8f7590 100644
--- a/kernel/cgroup/debug.c
+++ b/kernel/cgroup/debug.c
@@ -23,10 +23,6 @@ static void debug_css_free(struct cgroup_subsys_state *css)
 /*
  * debug_taskcount_read - return the number of tasks in a cgroup.
  * @cgrp: the cgroup in question
- *
- * Return the number of tasks in the cgroup.  The returned number can be
- * higher than the actual number of tasks due to css_set references from
- * namespace roots and temporary usages.
  */
 static u64 debug_taskcount_read(struct cgroup_subsys_state *css,
struct cftype *cft)
@@ -37,7 +33,7 @@ static u64 debug_taskcount_read(struct cgroup_subsys_state 
*css,
 
spin_lock_irq(_set_lock);
list_for_each_entry(link, >cset_links, cset_link)
-   count += atomic_read(>cset->refcount);
+   count += link->cset->task_count;
spin_unlock_irq(_set_lock);
return count;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of 

[RFC PATCH 11/14] sched: Misc preps for cgroup unified hierarchy interface

2017-04-21 Thread Waiman Long
From: Tejun Heo 

Make the following changes in preparation for the cpu controller
interface implementation for the unified hierarchy.  This patch
doesn't cause any functional differences.

* s/cpu_stats_show()/cpu_cfs_stats_show()/

* s/cpu_files/cpu_legacy_files/

* Separate out cpuacct_stats_read() from cpuacct_stats_show().  While
  at it, make the @val array u64 for consistency.

Signed-off-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Li Zefan 
Cc: Johannes Weiner 
---
 kernel/sched/core.c|  8 
 kernel/sched/cpuacct.c | 29 ++---
 2 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 27b4dd5..5e3a217 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7209,7 +7209,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 
period, u64 quota)
return ret;
 }
 
-static int cpu_stats_show(struct seq_file *sf, void *v)
+static int cpu_cfs_stats_show(struct seq_file *sf, void *v)
 {
struct task_group *tg = css_tg(seq_css(sf));
struct cfs_bandwidth *cfs_b = >cfs_bandwidth;
@@ -7249,7 +7249,7 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
-static struct cftype cpu_files[] = {
+static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
{
.name = "shares",
@@ -7270,7 +7270,7 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
},
{
.name = "stat",
-   .seq_show = cpu_stats_show,
+   .seq_show = cpu_cfs_stats_show,
},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -7296,7 +7296,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.fork   = cpu_cgroup_fork,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
-   .legacy_cftypes = cpu_files,
+   .legacy_cftypes = cpu_legacy_files,
.early_init = true,
 };
 
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index f95ab29..6151c23 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -276,26 +276,33 @@ static int cpuacct_all_seq_show(struct seq_file *m, void 
*V)
return 0;
 }
 
-static int cpuacct_stats_show(struct seq_file *sf, void *v)
+static void cpuacct_stats_read(struct cpuacct *ca,
+  u64 (*val)[CPUACCT_STAT_NSTATS])
 {
-   struct cpuacct *ca = css_ca(seq_css(sf));
-   s64 val[CPUACCT_STAT_NSTATS];
int cpu;
-   int stat;
 
-   memset(val, 0, sizeof(val));
+   memset(val, 0, sizeof(*val));
+
for_each_possible_cpu(cpu) {
u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-   val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
-   val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
-   val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
-   val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
-   val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
+   (*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
+   (*val)[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
+   (*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
+   (*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
+   (*val)[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
}
+}
+
+static int cpuacct_stats_show(struct seq_file *sf, void *v)
+{
+   u64 val[CPUACCT_STAT_NSTATS];
+   int stat;
+
+   cpuacct_stats_read(css_ca(seq_css(sf)), );
 
for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) {
-   seq_printf(sf, "%s %lld\n",
+   seq_printf(sf, "%s %llu\n",
   cpuacct_stat_desc[stat],
   (long long)nsec_to_clock_t(val[stat]));
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 14/14] cgroup: Enable separate control knobs for thread root internal processes

2017-04-21 Thread Waiman Long
Internal processes are allowed in a thread root of the cgroup v2
default hierarchy. For those resource domain controllers that don't
want to deal with resource competition between internal processes and
child cgroups, there is now the option of specifying the sep_res_domain
flag in their cgroup_subsys data structure. This flag will tell the
cgroup core to create a special directory "cgroup.self" under the
thread root to hold their resource control knobs for all the processes
within the threaded subtree.

User applications can then tune the control knobs in the "cgroup.self"
directory as if all the threaded subtree processes are under it for
resoruce tracking and controlling purpose.

This directory name is reserved and so it cannot be created or deleted
directly. Moreover, sub-directory cannot be created under it.

This sep_res_domain flag is turned on in the memcg to showcase
its effect.

Signed-off-by: Waiman Long 
---
 Documentation/cgroup-v2.txt |  20 
 include/linux/cgroup-defs.h |  15 ++
 kernel/cgroup/cgroup.c  | 122 +++-
 kernel/cgroup/debug.c   |   6 +++
 mm/memcontrol.c |   1 +
 5 files changed, 150 insertions(+), 14 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 4d1c24d..e4c25ec 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -393,6 +393,26 @@ cgroup must create children and transfer all its processes 
to the
 children before enabling controllers in its "cgroup.subtree_control"
 file.
 
+2-4-4. Resource Domain Controllers
+
+As internal processes are allowed in a threaded subtree, a non-threaded
+controller at a thread root cgroup has to properly manage resource
+competition between internal processes and other child non-threaded
+cgroups. However, a controller can specify that it wants to have
+separate resource domain to manage the resources of the processes in
+the threaded subtree instead of each process individually. In this
+case, a "cgroup.self" directory will be created at the thread root
+to hold the resource control knobs for the processes in the threaded
+subtree as if those internal processes are all under the cgroup.self
+child cgroup for resource tracking and controlling purpose.
+
+The "cgroup.self" directory is a special directory which cannot
+be created or deleted directly. No sub-directory can be created
+under it and special files like "cgroup.procs" are not present so
+tasks cannot be moved directly into it.  It is created when a cgroup
+becomes the thread root and have controllers that request separate
+resource domains. It will be removed when that cgroup is not a thread
+root anymore.
 
 2-5. Delegation
 
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 7be1a90..e383f10 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -65,6 +65,7 @@ enum {
 enum {
CGRP_ROOT_NOPREFIX  = (1 << 1), /* mounted subsystems have no named 
prefix */
CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
+   CGRP_RESOURCE_DOMAIN= (1 << 3), /* thread root resource domain */
 };
 
 /* cftype->flags */
@@ -293,6 +294,9 @@ struct cgroup {
 
struct cgroup_root *root;
 
+   /* Pointer to separate resource domain for thread root */
+   struct cgroup *resource_domain;
+
/*
 * List of cgrp_cset_links pointing at css_sets with tasks in this
 * cgroup.  Protected by css_set_lock.
@@ -516,6 +520,17 @@ struct cgroup_subsys {
bool threaded:1;
 
/*
+* If %true, the controller will need a separate resource domain in
+* a thread root to avoid internal processes associated with the
+* threaded subtree to compete with other child cgroups. This is done
+* by having a separate set of knobs in the cgroup.self directory.
+* These knobs will control how much resources are allocated to the
+* processes in the threaded subtree. Only !thread controllers should
+* have this flag turned on.
+*/
+   bool sep_res_domain:1;
+
+   /*
 * If %false, this subsystem is properly hierarchical -
 * configuration, resource accounting and restriction on a parent
 * cgroup cover those of its children.  If %true, hierarchy support
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 50577c5..3ff3ff5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -61,6 +61,11 @@
 
 #define CGROUP_FILE_NAME_MAX   (MAX_CGROUP_TYPE_NAMELEN +  \
 MAX_CFTYPE_NAME + 2)
+/*
+ * Reserved cgroup directory name for resource domain controllers. Users
+ * are not allowed to create child cgroup of that name.
+ */
+#define CGROUP_SELF"cgroup.self"
 
 /*
  * cgroup_mutex is the master lock.  Any modification to cgroup or its
@@ -165,6 +170,12 @@ struct cgroup_subsys 

[RFC PATCH 10/14] cgroup: Implement new thread mode semantics

2017-04-21 Thread Waiman Long
The current thread mode semantics aren't sufficient to fully support
threaded controllers like cpu. The main problem is that when thread
mode is enabled at root (mainly for performance reason), all the
non-threaded controllers cannot be supported at all.

To alleviate this problem, the roles of thread root and threaded
cgroups are now further separated. Now thread mode can only be enabled
on a non-root leaf cgroup whose parent will then become the thread
root. All the descendants of a threaded cgroup will still need to be
threaded. All the non-threaded resource will be accounted for in the
thread root. Unlike the previous thread mode, however, a thread root
can have non-threaded children where system resources like memory
can be further split down the hierarchy.

Now we could have something like

R -- A -- B
 \
  T1 -- T2

where R is the thread root, A and B are non-threaded cgroups, T1 and
T2 are threaded cgroups. The cgroups R, T1, T2 form a threaded subtree
where all the non-threaded resources are accounted for in R.  The no
internal process constraint does not apply in the threaded subtree.
Non-threaded controllers need to properly handle the competition
between internal processes and child cgroups at the thread root.

This model will be flexible enough to support the need of the threaded
controllers.

Signed-off-by: Waiman Long 
---
 Documentation/cgroup-v2.txt |  51 +++
 kernel/cgroup/cgroup-internal.h |  10 +++
 kernel/cgroup/cgroup.c  | 184 +++-
 3 files changed, 208 insertions(+), 37 deletions(-)

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 2375e22..4d1c24d 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -222,21 +222,32 @@ process can be put in different cgroups and are not 
subject to the no
 internal process constraint - threaded controllers can be enabled on
 non-leaf cgroups whether they have threads in them or not.
 
-To enable the thread mode, the following conditions must be met.
+To enable the thread mode on a cgroup, the following conditions must
+be met.
 
-- The thread root doesn't have any child cgroups.
+- The cgroup doesn't have any child cgroups.
 
-- The thread root doesn't have any controllers enabled.
+- The cgroup doesn't have any non-threaded controllers enabled.
+
+- The cgroup doesn't have any processes attached to it.
 
 Thread mode can be enabled by writing "enable" to "cgroup.threads"
 file.
 
   # echo enable > cgroup.threads
 
-Inside a threaded subtree, "cgroup.threads" can be read and contains
-the list of the thread IDs of all threads in the cgroup.  Except that
-the operations are per-thread instead of per-process, "cgroup.threads"
-has the same format and behaves the same way as "cgroup.procs".
+The parent of the threaded cgroup will become the thread root, if
+it hasn't been a thread root yet. In other word, thread mode cannot
+be enabled on the root cgroup as it doesn't have a parent cgroup. A
+thread root can have child cgroups and controllers enabled before
+becoming one.
+
+A threaded subtree includes the thread root and all the threaded child
+cgroups as well as their descendants which are all threaded cgroups.
+"cgroup.threads" can be read and contains the list of the thread
+IDs of all threads in the cgroup.  Except that the operations are
+per-thread instead of per-process, "cgroup.threads" has the same
+format and behaves the same way as "cgroup.procs".
 
 The thread root serves as the resource domain for the whole subtree,
 and, while the threads can be scattered across the subtree, all the
@@ -246,25 +257,30 @@ not readable in the subtree proper.  However, 
"cgroup.procs" can be
 written to from anywhere in the subtree to migrate all threads of the
 matching process to the cgroup.
 
-Only threaded controllers can be enabled in a threaded subtree.  When
-a threaded controller is enabled inside a threaded subtree, it only
-accounts for and controls resource consumptions associated with the
-threads in the cgroup and its descendants.  All consumptions which
-aren't tied to a specific thread belong to the thread root.
+Only threaded controllers can be enabled in a non-root threaded cgroup.
+When a threaded controller is enabled inside a threaded subtree,
+it only accounts for and controls resource consumptions associated
+with the threads in the cgroup and its descendants.  All consumptions
+which aren't tied to a specific thread belong to the thread root.
 
 Because a threaded subtree is exempt from no internal process
 constraint, a threaded controller must be able to handle competition
 between threads in a non-leaf cgroup and its child cgroups.  Each
 threaded controller defines how such competitions are handled.
 
+A new child cgroup created under a thread root will not be threaded.
+Thread mode has to be explicitly enabled on each of the thread root's
+children.  Descendants of a threaded 

[RFC PATCH 13/14] sched: Make cpu/cpuacct threaded controllers

2017-04-21 Thread Waiman Long
Make cpu and cpuacct cgroup controllers usable within a threaded cgroup.

Signed-off-by: Waiman Long 
---
 kernel/sched/core.c| 1 +
 kernel/sched/cpuacct.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 78dfcaa..9d8beda 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7432,6 +7432,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.legacy_cftypes = cpu_legacy_files,
.dfl_cftypes= cpu_files,
.early_init = true,
+   .threaded   = true,
 #ifdef CONFIG_CGROUP_CPUACCT
/*
 * cpuacct is enabled together with cpu on the unified hierarchy
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index fc1cf13..853d18a 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -414,4 +414,5 @@ struct cgroup_subsys cpuacct_cgrp_subsys = {
.css_free   = cpuacct_css_free,
.legacy_cftypes = files,
.early_init = true,
+   .threaded   = true,
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] docs: Fix a spelling error in vfio-mediated-device.txt

2017-04-21 Thread Stan Drozd
docs: Fix a spelling error in vfio-mediated-device.txt

This commit fixes a repeated "the" in vfio-mediated-device.txt and reflows the
paragraph.

Signed-off-by: Stan Drozd 
---
 Documentation/vfio-mediated-device.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/vfio-mediated-device.txt 
b/Documentation/vfio-mediated-device.txt
index 6f994abd93d0..e5e57b40f8af 100644
--- a/Documentation/vfio-mediated-device.txt
+++ b/Documentation/vfio-mediated-device.txt
@@ -217,9 +217,9 @@ Directories and files under the sysfs for Each Physical 
Device
 
 * []
 
-  The [] name is created by adding the the device driver string as a
-  prefix to the string provided by the vendor driver. This format of this name
-  is as follows:
+  The [] name is created by adding the device driver string as a 
prefix
+  to the string provided by the vendor driver. This format of this name is as
+  follows:
 
sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name);
 
-- 
2.12.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] docs: Fix a spelling error in ioctl-number.txt

2017-04-21 Thread Stan Drozd
docs: Fix a spelling error in ioctl-number.txt

This commit fixes a misspelled header name in the ioctl numbers list

Signed-off-by: Stan Drozd 
---
 Documentation/ioctl/ioctl-number.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 08244bea5048..a77ead911956 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -212,7 +212,7 @@ Code  Seq#(hex) Include FileComments
 'c'00-1F   linux/chio.hconflict!
 'c'80-9F   arch/s390/include/asm/chsc.hconflict!
 'c'A0-AF   arch/x86/include/asm/msr.h  conflict!
-'d'00-FF   linux/char/drm/drm/hconflict!
+'d'00-FF   linux/char/drm/drm.hconflict!
 'd'02-40   pcmcia/ds.h conflict!
 'd'F0-FF   linux/digi1.h
 'e'all linux/digi1.h   conflict!
-- 
2.12.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] kvm: better MWAIT emulation for guests

2017-04-21 Thread Paolo Bonzini


On 21/04/2017 12:05, Alexander Graf wrote:
> 
> 
> On 21.04.17 12:02, Paolo Bonzini wrote:
>>
>>
>> On 12/04/2017 18:29, Michael S. Tsirkin wrote:
>>> I don't really agree we do not need the PV flag. mwait on kvm is
>>> different from mwait on bare metal in that you are heavily penalized by
>>> scheduler for polling unless you configure the host just so.
>>> HLT lets you give up the host CPU if you know you won't need
>>> it for a long time.
>>>
>>> So while many people can get by with monitor cpuid (those that isolate
>>> host CPUs) and it's a valuable option to have, I think a PV flag is also
>>> a valuable option and can be set for more configurations.
>>>
>>> Guest has an idle driver calling mwait on short waits and halt on longer
>>> ones.  I'm in fact testing an idle driver using such a PV flag and will
>>> post when ready (after vacation ~3 weeks from now probably).
>>
>> For now I think I'm removing the PV flag, making this just an
>> optimization of commit 87c00572ba05aa8c ("kvm: x86: emulate
>> monitor and mwait instructions as nop").
>>
>> We can add it for 4.13 together with the idle driver.
> 
> I think that's a perfectly reasonable approach, yes. We can always add
> the PV flag with the driver.
> 
> Thanks a lot!

Queuing the patch for 4.12.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] kvm: better MWAIT emulation for guests

2017-04-21 Thread Alexander Graf



On 21.04.17 12:02, Paolo Bonzini wrote:



On 12/04/2017 18:29, Michael S. Tsirkin wrote:

I don't really agree we do not need the PV flag. mwait on kvm is
different from mwait on bare metal in that you are heavily penalized by
scheduler for polling unless you configure the host just so.
HLT lets you give up the host CPU if you know you won't need
it for a long time.

So while many people can get by with monitor cpuid (those that isolate
host CPUs) and it's a valuable option to have, I think a PV flag is also
a valuable option and can be set for more configurations.

Guest has an idle driver calling mwait on short waits and halt on longer
ones.  I'm in fact testing an idle driver using such a PV flag and will
post when ready (after vacation ~3 weeks from now probably).


For now I think I'm removing the PV flag, making this just an
optimization of commit 87c00572ba05aa8c ("kvm: x86: emulate
monitor and mwait instructions as nop").

We can add it for 4.13 together with the idle driver.


I think that's a perfectly reasonable approach, yes. We can always add 
the PV flag with the driver.


Thanks a lot!


Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] kvm: better MWAIT emulation for guests

2017-04-21 Thread Paolo Bonzini


On 12/04/2017 18:29, Michael S. Tsirkin wrote:
> I don't really agree we do not need the PV flag. mwait on kvm is
> different from mwait on bare metal in that you are heavily penalized by
> scheduler for polling unless you configure the host just so.
> HLT lets you give up the host CPU if you know you won't need
> it for a long time.
> 
> So while many people can get by with monitor cpuid (those that isolate
> host CPUs) and it's a valuable option to have, I think a PV flag is also
> a valuable option and can be set for more configurations.
> 
> Guest has an idle driver calling mwait on short waits and halt on longer
> ones.  I'm in fact testing an idle driver using such a PV flag and will
> post when ready (after vacation ~3 weeks from now probably).

For now I think I'm removing the PV flag, making this just an
optimization of commit 87c00572ba05aa8c ("kvm: x86: emulate
monitor and mwait instructions as nop").

We can add it for 4.13 together with the idle driver.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Documentation: DocBook: kgdb: update CONFIG_STRICT_KERNEL_RWX info

2017-04-21 Thread Daniel Thompson

On 21/04/17 03:26, Li Qiang wrote:


@Daniel

2017-04-20 23:28 GMT+08:00 Daniel Thompson >:

On 19/04/17 02:58, Li Qiang wrote:

CONFIG_STRICT_KERNEL_RWX is no longer selectable on most
architectures.
Update this info to thedocumentation.


"git grep STRICT_KERNEL_RWX" comes up with nothing.


It was introduced in commit 0f5bf6d0afe4be6e1391908ff2d6dc9730e91550.


Oops. I did the grep on the wrong machine :-( and therefore on an older 
kernel than I thought...




It is selectable on any architecture? If not we should remove it
entirely!

The 'STRICT_KERNEL_RWX' is renamed from 'CONFIG_DEBUG_RODATA
'. The original option is selectable.

I'm not sure is this selectable on any architecture.


So... having found the right kernel, it looks to me like only arm, 
arm64, parisc, s390 and x86 define ARCH_HAS_STRICT_KERNEL_RWX. Of these 
five, only arm defines ARCH_OPTIONAL_KERNEL_RWX and makes it user 
selectable.





@Jonathan

On Tue, 18 Apr 2017 18:58:45 -0700
Li Qiang > wrote:

> CONFIG_STRICT_KERNEL_RWX is no longer selectable on most architectures.
> Update this info to the documentation.
>
> Signed-off-by: Li Qiang >
> ---
>  Documentation/DocBook/kgdb.tmpl | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/DocBook/kgdb.tmpl 
b/Documentation/DocBook/kgdb.tmpl
> index 856ac20..ef0b67b 100644
> --- a/Documentation/DocBook/kgdb.tmpl
> +++ b/Documentation/DocBook/kgdb.tmpl
> @@ -121,7 +121,9 @@
>  If kgdb supports it for the architecture you are using, you can
>  use hardware breakpoints if you desire to run with the
>  CONFIG_STRICT_KERNEL_RWX option turned on, else you need to turn off
> -this option.
> +this option. In most architectures, this option is not selectable.
> +For this situation, it can be turned off by adding a runtime 
parameter
> +'rodata=off'.

So this is an improvement, I guess, though the paragraph remains kind of
confusing.  Is there any chance we could actually just say which
architectures can use hardware breakpoints, and which should boot with
rodata=off?


I think this is unnecessary as it is not common to change the
default CONFIG_STRICT_KERNEL_RWX /add rodata=off.
We here give this hint because CONFIG_STRICT_KERNEL_RWX is renamed
from CONFIG_DEBUG_RODATA.
And the latter is selectable, this can help the peoples who
think CONFIG_STRICT_KERNEL_RWX  is also selectable.


Having looked at the earlier part of the paragraph I think the info 
about rodata actually needs to be introduced slightly earlier (and 
rodata should be presented as the primary way to do it because 4 of the 
5 architectures don't make STRICT_KERNEL_RWX optional).


Something like:

  If the architecture that you are using supports making the text
  section read-only (CONFIG_STRICT_KERNEL_RWX), you should consider
  turning it off by adding 'rodata=off' to the kernel commandline or,
  if your architecture makes CONFIG_STRICT_KERNEL_RWX optional, by
  disabling this config option. Alternatively, if your architecture
  supports hardware breakpoints, these can be used to provide limited
  breakpoint support if you desire to run with a read-only text section.


Daniel.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/11] Documentation: Add ABI to the admin guide

2017-04-21 Thread Markus Heiser

Am 21.04.2017 um 01:21 schrieb Mauro Carvalho Chehab :

> - I'm not a python programmer ;-) I just took Markus "generic" kernel-cmd
>  code, hardcoding there a call to the script.
> 
>  With (a lot of) time, I would likely be able to find a solution to add
>  the entire ABI logic there, but, in this case, we would lose the
>  capability of calling the script without Sphinx.

Hi Mauro,

I have no problem with calling the perl script, but your last sentence
is not correct: We can implement a python module, which is used by sphinx
and we can add a command line as well.

--Markus--

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] docs: typo in Documentation/memory-barriers.txt

2017-04-21 Thread Stanisław Drozd
On Thu, Apr 20, 2017 at 08:54:46AM -0700, Paul E. McKenney wrote:
> Please let me know if my reworking was in any way problematic

No, it looks solid, thanks :)

Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html