date:20180524

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro

On 2018/05/25 2:45, Mike Kravetz wrote:
[...]
>> THP does not guarantee to use the Huge Page, but may use the normal page.
> 
> Note.  You do not want to use THP because "THP does not guarantee".

[...]
>> One of the answers I have reached is to use HugeTLBfs by overcommitting
>> without creating a pool(this is the surplus hugepage).
> 
> Using hugetlbfs overcommit also does not provide a guarantee.  Without
> doing much research, I would say the failure rate for obtaining a huge
> page via THP and hugetlbfs overcommit is about the same.  The most
> difficult issue in both cases will be obtaining a "huge page" number of
> pages from the buddy allocator.

Yes. If do not support multiple size hugetlb pages such as x86, because
number of pages between THP and hugetlb is same, the failure rate of
obtaining a compound page is same, as you said.

> I really do not think hugetlbfs overcommit will provide any benefit over
> THP for your use case.

I think that what you say is absolutely right.

>  Also, new user space code is required to "fall back"
> to normal pages in the case of hugetlbfs page allocation failure.  This
> is not needed in the THP case.

I understand the superiority of THP, but there are scenes where khugepaged
occupies cpu due to page fragmentation. Instead of overcommit, setup a
persistent pool once, I think that hugetlb can be superior, such as memory
allocation performance exceeding THP. I will try to find a good way to use
hugetlb page.

I sincerely thank you for your help.

-- 
Thanks,
Tsukada

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro

On 2018/05/24 22:24, Michal Hocko wrote
[...]> I do not see anything like that. adjust_pool_surplus is simply and
> accounting thing. At least the last time I've checked. Maybe your
> patchset handles that?

As you said, my patch did not consider handling when manipulating the
pool. And even if that handling is done well, it will not be a valid
reason to charge surplus hugepage to memcg.

[...]
>> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
>> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
>> pages and hugetlb page?)
> 
> No mlocked pages cannot be reclaimed and that is why we restrict them to
> a relatively small amount.

I understood the concept of memcg.

[...]
> Fatal? Not sure. It simply tries to add an alien memory to the memcg
> concept so I would pressume an unexpected behavior (e.g. not being able
> to reclaim memcg or, over reclaim, trashing etc.).

As you said, it must be an alien. Thanks to the interaction up to here,
I understood that my solution is inappropriate. I will look for another
way.

Thank you for your kind explanation.

-- 
Thanks,
Tsukada


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH bpf-next v2 0/3] bpf: add boot parameters for sysctl knobs

2018-05-24 Thread Alexei Starovoitov

On Thu, May 24, 2018 at 09:41:08AM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 23 May 2018 15:02:45 -0700
> Alexei Starovoitov  wrote:
> 
> > On Wed, May 23, 2018 at 02:18:19PM +0200, Eugene Syromiatnikov wrote:
> > > Some BPF sysctl knobs affect the loading of BPF programs, and during
> > > system boot/init stages these sysctls are not yet configured.
> > > A concrete example is systemd, that has implemented loading of BPF
> > > programs.
> > > 
> > > Thus, to allow controlling these setting at early boot, this patch set
> > > adds the ability to change the default setting of these sysctl knobs
> > > as well as option to override them via a boot-time kernel parameter
> > > (in order to avoid rebuilding kernel each time a need of changing these
> > > defaults arises).
> > > 
> > > The sysctl knobs in question are kernel.unprivileged_bpf_disable,
> > > net.core.bpf_jit_harden, and net.core.bpf_jit_kallsyms.  
> > 
> > - systemd is root. today it only uses cgroup-bpf progs which require root,
> >   so disabling unpriv during boot time makes no difference to systemd.
> >   what is the actual reason to present time?
> > 
> > - say in the future systemd wants to use so_reuseport+bpf for faster
> >   networking. With unpriv disable during boot, it will force systemd
> >   to do such networking from root, which will lower its security barrier.
> >   How that make sense?
> > 
> > - bpf_jit_kallsyms sysctl has immediate effect on loaded programs.
> >   Flipping it during the boot or right after or any time after
> >   is the same thing. Why add such boot flag then?
> > 
> > - jit_harden can be turned on by systemd. so turning it during the boot
> >   will make systemd progs to be constant blinded.
> >   Constant blinding protects kernel from unprivileged JIT spraying.
> >   Are you worried that systemd will attack the kernel with JIT spraying?
> 
> 
> I think you are missing that, we want the ability to change these
> defaults in-order to avoid depending on /etc/sysctl.conf settings, and
> that the these sysctl.conf setting happen too late.

What does it mean 'happens too late' ?
Too late for what?
sysctl.conf has plenty of system critical knobs like
kernel.perf_event_paranoid, kernel.core_pattern, etc
The behavior of the host is drastically different after sysctl config
is applied.

> For example with jit_harden, there will be a difference between the
> loaded BPF program that got loaded at boot-time with systemd (no
> constant blinding) and when someone reloads that systemd service after
> /etc/sysctl.conf have been evaluated and setting bpf_jit_harden (now
> slower due to constant blinding).   This is inconsistent behavior.

net.core.bpf_jit_harden can be flipped back and forth at run-time,
so bpf progs before and after will be either blinded or not.
I don't see any inconsistency.
In general I think bootparams should be used only for things
like kpti=on/off that cannot be set by sysctl.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] PCI: Allow specifying devices using a base bus and path of devfns

2018-05-24 Thread Logan Gunthorpe

When specifying PCI devices on the kernel command line using a
BDF, the bus numbers can change when adding or replacing a device,
changing motherboard firmware, or applying kernel parameters like
pci=assign-buses. When this happens, it is usually undesirable to
apply whatever command line tweak to the wrong device.

Therefore, it is useful to be able to specify devices with a base
bus number and the path of devfns needed to get to it. (Similar to
the "device scope" structure in the Intel VT-d spec, Section 8.3.1.)

Thus, we add an option to specify devices in the following format:

path:[:]:./.[/ ...]

The path can be any segment within the PCI hierarchy of any length and
determined through the use of 'lspci -t'. When specified this way, it is
less likely that a renumbered bus will result in a valid device specification
and the tweak won't be applied to the wrong device.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Stephen Bates 
---
 Documentation/admin-guide/kernel-parameters.txt |  12 ++-
 drivers/pci/pci.c   | 106 +++-
 2 files changed, 112 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 894aa516ceab..519ab95bb418 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2986,9 +2986,10 @@
 
Some options herein operate on a specific device
or a set of devices (). These are
-   specified in one of two formats:
+   specified in one of three formats:
 
[:]:.
+   
path:[:]:./.[/ ...]
pci::[::]
 
Note: the first format specifies a PCI
@@ -2996,9 +2997,12 @@
if new hardware is inserted, if motherboard
firmware changes, or due to changes caused
by other kernel parameters. The second format
-   selects devices using IDs from the
-   configuration space which may match multiple
-   devices in the system.
+   specifies a path from a device through
+   a path of multiple slot/function addresses
+   (this is more robust against renumbering
+   issues). The third format selects devices using
+   IDs from the configuration space which may match
+   multiple devices in the system.
 
earlydump   [X86] dump PCI config space before the kernel
changes anything
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 85fec5e2640b..53ea0d7b02ce 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -184,22 +184,116 @@ EXPORT_SYMBOL_GPL(pci_ioremap_wc_bar);
 #endif
 
 /**
+ * pci_dev_str_match_path - test if a path string matches a device
+ * @dev:the PCI device to test
+ * @p:  string to match the device against
+ * @endptr: pointer to the string after the match
+ *
+ * Test if a string (typically from a kernel parameter) formated as a
+ * path of slot/function addresses matches a PCI device. The string must
+ * be of the form:
+ *
+ *   [:]:./.[/ ...]
+ *
+ * A path for a device can be obtained using 'lspci -t'. Using a path
+ * is more robust against renumbering of devices than using only
+ * a single bus, slot and function address.
+ *
+ * Returns 1 if the string matches the device, 0 if it does not and
+ * a negative error code if it fails to parse the string.
+ */
+static int pci_dev_str_match_path(struct pci_dev *dev, const char *p,
+ const char **endptr)
+{
+   int ret;
+   int seg, bus, slot, func, count;
+   u8 *devfn_path;
+   int num_devfn = 0;
+   struct pci_dev *tmp;
+
+   ret = sscanf(p, "%x:%x:%x.%x%n", , , ,
+, );
+   if (ret != 4) {
+   seg = 0;
+   ret = sscanf(p, "%x:%x.%x%n", , ,
+, );
+   if (ret != 3)
+   return -EINVAL;
+   }
+
+   p += count;
+
+   devfn_path = kmalloc(PAGE_SIZE, GFP_KERNEL);
+   devfn_path[num_devfn++] = PCI_DEVFN(slot, func);
+
+   while (*p && *p != ',' && *p != ';') {
+   ret = sscanf(p, "/%x.%x%n", , , );
+   if (ret != 2) {
+   ret = -EINVAL;
+   goto free_and_exit;
+   }
+
+   p += count;
+   devfn_path[num_devfn++] = PCI_DEVFN(slot, func);
+   if (num_devfn >= PAGE_SIZE) {
+

[PATCH 0/3] Add parameter for disabling ACS redirection for P2P

2018-05-24 Thread Logan Gunthorpe

Hi,

As discussed in our PCI P2PDMA series, we'd like to add a kernel
parameter for selectively disabling ACS redirection for select
bridges. Seeing this turned out to be a small series in itself, we've
decided to send this separately from the P2P work.

This series generalizes the code already done for the resource_alignment
option that already exists. The first patch creates a helper function
to match PCI devices against strings based on the code that already
existed in pci_specified_resource_alignment().

The second patch expands the new helper to optionally take a path of
PCI devfns. This is to address Alex's renumbering concern when using
simple bus-devfns. The implementation is essentially how he described it and
similar to the Intel VT-d spec (Section 8.3.1).

The final patch adds the disable_acs_redir kernel parameter which takes
a list of PCI devices and will disable the ACS P2P Request Redirect,
ACS P2P Completion Redirect and ACS P2P Egress Control bits for the
selected devices. This allows P2P traffic between selected bridges and
seeing it's done at boot, before IOMMU group creating the IOMMU groups
will be created correctly based on the bits.

Thanks,

Logan


Logan Gunthorpe (3):
  PCI: Make specifying PCI devices in kernel parameters reusable
  PCI: Allow specifying devices using a base bus and path of devfns
  PCI: Introduce the disable_acs_redir parameter

 Documentation/admin-guide/kernel-parameters.txt |  39 ++-
 drivers/pci/pci.c   | 358 
 2 files changed, 336 insertions(+), 61 deletions(-)

--
2.11.0
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] PCI: Introduce the disable_acs_redir parameter

2018-05-24 Thread Logan Gunthorpe

In order to support P2P traffic on a segment of the PCI hierarchy,
we must be able to disable the ACS redirect bits for select
PCI bridges. The bridges must be selected before the devices are
discovered by the kernel and the IOMMU groups created. Therefore,
a kernel command line parameter is created to specify devices
which must have their ACS bits disabled.

The new parameter takes a list of devices separated by a semicolon.
Each device specified will have it's ACS redirect bits disabled.
This is similar to the existing 'resource_alignment' parameter and just
like it we also create a sysfs bus attribute which can be used to
read the parameter. Writing the parameter is not supported
as it would require forcibly hot plugging the affected device as
well as all devices whose IOMMU groups might change.

The ACS Request P2P Request Redirect, P2P Completion Redirect and P2P
Egress Control bits are disabled which is sufficient to always allow
passing P2P traffic uninterrupted. The bits are set after the kernel
(optionally) enables the ACS bits itself. It is also done regardless of
whether the kernel sets the bits or not seeing some BIOS firmware is known
to set the bits on boot.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Stephen Bates 
---
 Documentation/admin-guide/kernel-parameters.txt |   9 +++
 drivers/pci/pci.c   | 103 +++-
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 519ab95bb418..215285c4772d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3176,6 +3176,15 @@
Adding the window is slightly risky (it may
conflict with unreported devices), so this
taints the kernel.
+   disable_acs_redir=[; ...]
+   Specify one or more PCI devices (in the format
+   specified above) separated by semicolons.
+   Each device specified will have the PCI ACS
+   redirect capabilities forced off which will
+   allow P2P traffic between devices through
+   bridges without forcing it upstream. Note:
+   this removes isolation between devices and
+   will make the IOMMU groups less granular.
 
pcie_aspm=  [PCIE] Forcibly enable or disable PCIe Active State 
Power
Management.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 53ea0d7b02ce..3465895a55ab 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2998,6 +2998,92 @@ void pci_request_acs(void)
pci_acs_enable = 1;
 }
 
+#define DISABLE_ACS_REDIR_PARAM_SIZE COMMAND_LINE_SIZE
+static char disable_acs_redir_param[DISABLE_ACS_REDIR_PARAM_SIZE] = {0};
+static DEFINE_SPINLOCK(disable_acs_redir_lock);
+
+static ssize_t pci_set_disable_acs_redir_param(const char *buf, size_t count)
+{
+   if (count > DISABLE_ACS_REDIR_PARAM_SIZE - 1)
+   count = DISABLE_ACS_REDIR_PARAM_SIZE - 1;
+   spin_lock(_acs_redir_lock);
+   strncpy(disable_acs_redir_param, buf, count);
+   disable_acs_redir_param[count] = '\0';
+   spin_unlock(_acs_redir_lock);
+   return count;
+}
+
+static ssize_t pci_disable_acs_redir_show(struct bus_type *bus, char *buf)
+{
+   size_t count;
+
+   spin_lock(_acs_redir_lock);
+   count = snprintf(buf, PAGE_SIZE, "%s\n", disable_acs_redir_param);
+   spin_unlock(_acs_redir_lock);
+   return count;
+}
+
+static BUS_ATTR(disable_acs_redir, 0444, pci_disable_acs_redir_show, NULL);
+
+static int __init pci_disable_acs_redir_sysfs_init(void)
+{
+   return bus_create_file(_bus_type, _attr_disable_acs_redir);
+}
+late_initcall(pci_disable_acs_redir_sysfs_init);
+
+/**
+ * pci_disable_acs_redir - disable ACS redirect capabilities
+ * @dev: the PCI device
+ *
+ * For only devices specified in the disable_acs_redir parameter.
+ */
+static void pci_disable_acs_redir(struct pci_dev *dev)
+{
+   int ret = 0;
+   const char *p;
+   int pos;
+   u16 ctrl;
+
+   spin_lock(_acs_redir_lock);
+
+   p = disable_acs_redir_param;
+   while (*p) {
+   ret = pci_dev_str_match(dev, p, );
+   if (ret < 0) {
+   pr_info_once("PCI: Can't parse disable_acs_redir 
parameter: %s\n",
+disable_acs_redir_param);
+
+   break;
+   } else if (ret == 1) {
+   /* Found a match */
+   break;
+   }
+
+   if (*p != ';' && *p != ',') {
+   /* End of param or invalid format */
+

[PATCH 1/3] PCI: Make specifying PCI devices in kernel parameters reusable

2018-05-24 Thread Logan Gunthorpe

Separate out the code to match a PCI device with a string (typically
originating from a kernel parameter) from the
pci_specified_resource_alignment() function into its own helper
function.

While we are at it, this change fixes the kernel style of the function
(fixing a number of long lines and extra parentheses).

Additionally, make the analogous change to the kernel parameter
documentation: Separating the description of how to specify a PCI device
into it's own section at the head of the pci= parameter.

Signed-off-by: Logan Gunthorpe 
Reviewed-by: Stephen Bates 
---
 Documentation/admin-guide/kernel-parameters.txt |  26 +++-
 drivers/pci/pci.c   | 153 +++-
 2 files changed, 120 insertions(+), 59 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 11fc28ecdb6d..894aa516ceab 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2982,7 +2982,24 @@
See header of drivers/block/paride/pcd.c.
See also Documentation/blockdev/paride.txt.
 
-   pci=option[,option...]  [PCI] various PCI subsystem options:
+   pci=option[,option...]  [PCI] various PCI subsystem options.
+
+   Some options herein operate on a specific device
+   or a set of devices (). These are
+   specified in one of two formats:
+
+   [:]:.
+   pci::[::]
+
+   Note: the first format specifies a PCI
+   bus/slot/function address which may change
+   if new hardware is inserted, if motherboard
+   firmware changes, or due to changes caused
+   by other kernel parameters. The second format
+   selects devices using IDs from the
+   configuration space which may match multiple
+   devices in the system.
+
earlydump   [X86] dump PCI config space before the kernel
changes anything
off [X86] don't probe for the PCI bus
@@ -3111,11 +3128,10 @@
window. The default value is 64 megabytes.
resource_alignment=
Format:
-   [@][:]:.[; ...]
-   [@]pci::\
-   [::][; 
...]
+   [@][; ...]
Specifies alignment and device to reassign
-   aligned memory resources.
+   aligned memory resources. How to
+   specify the device is described above.
If  is not specified,
PAGE_SIZE is used as alignment.
PCI-PCI bridge can be specified, if resource
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index dbfe7c4f3776..85fec5e2640b 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -183,6 +183,88 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int 
bar)
 EXPORT_SYMBOL_GPL(pci_ioremap_wc_bar);
 #endif
 
+/**
+ * pci_dev_str_match - test if a string matches a device
+ * @dev:the PCI device to test
+ * @p:  string to match the device against
+ * @endptr: pointer to the string after the match
+ *
+ * Test if a string (typically from a kernel parameter) matches a
+ * specified. The string may be of one of two forms formats:
+ *
+ *   [:]:.
+ *   pci::[::]
+ *
+ * The first format specifies a PCI bus/slot/function address which
+ * may change if new hardware is inserted, if motherboard firmware changes,
+ * or due to changes caused in kernel parameters.
+ *
+ * The second format matches devices using IDs in the configuration
+ * space which may match multiple devices in the system. A value of 0
+ * for any field will match all devices.
+ *
+ * Returns 1 if the string matches the device, 0 if it does not and
+ * a negative error code if the string cannot be parsed.
+ */
+static int pci_dev_str_match(struct pci_dev *dev, const char *p,
+const char **endptr)
+{
+   int ret;
+   int seg, bus, slot, func, count;
+   unsigned short vendor, device, subsystem_vendor, subsystem_device;
+
+   if (strncmp(p, "pci:", 4) == 0) {
+   /* PCI vendor/device (subvendor/subdevice) ids are specified */
+   p += 4;
+   ret = sscanf(p, "%hx:%hx:%hx:%hx%n", , ,
+_vendor, _device, );
+   if (ret != 4) {
+   ret =

Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Waiman Long

On 05/24/2018 11:43 AM, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 04:55:42PM -0400, Waiman Long wrote:
>> The sched.load_balance flag is needed to enable CPU isolation similar to
>> what can be done with the "isolcpus" kernel boot parameter. Its value
>> can only be changed in a scheduling domain with no child cpusets. On
>> a non-scheduling domain cpuset, the value of sched.load_balance is
>> inherited from its parent.
>>
>> This flag is set by the parent and is not delegatable.
>>
>> Signed-off-by: Waiman Long 
>> ---
>>  Documentation/cgroup-v2.txt | 24 
>>  kernel/cgroup/cpuset.c  | 53 
>> +
>>  2 files changed, 73 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index 54d9e22..071b634d 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1536,6 +1536,30 @@ Cpuset Interface Files
>>  CPUs of the parent cgroup. Once it is set, this flag cannot be
>>  cleared if there are any child cgroups with cpuset enabled.
>>  
>> +A parent cgroup cannot distribute all its CPUs to child
>> +scheduling domain cgroups unless its load balancing flag is
>> +turned off.
>> +
>> +  cpuset.sched.load_balance
>> +A read-write single value file which exists on non-root
>> +cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +either "0" (off) or a non-zero value (on).  This flag is set
>> +by the parent and is not delegatable.
>> +
>> +When it is on, tasks within this cpuset will be load-balanced
>> +by the kernel scheduler.  Tasks will be moved from CPUs with
>> +high load to other CPUs within the same cpuset with less load
>> +periodically.
>> +
>> +When it is off, there will be no load balancing among CPUs on
>> +this cgroup.  Tasks will stay in the CPUs they are running on
>> +and will not be moved to other CPUs.
>> +
>> +The initial value of this flag is "1".  This flag is then
>> +inherited by child cgroups with cpuset enabled.  Its state
>> +can only be changed on a scheduling domain cgroup with no
>> +cpuset-enabled children.
> I'm confused... why exactly do we have both domain and load_balance ?

The domain is for partitioning the CPUs only. It doesn't change the load
balancing state. So the load_balance flag is still need to turn on and
off load balancing.

Cheers,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag

2018-05-24 Thread Waiman Long

On 05/24/2018 11:41 AM, Peter Zijlstra wrote:
> On Thu, May 17, 2018 at 04:55:41PM -0400, Waiman Long wrote:
>> A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
>> flag indicates that the CPUs in the current cpuset should be treated
>> as a separate scheduling domain.
> The traditional name for this is a partition.

Do you want to call it cpuset.sched.partition? That name sounds strange
to me.

>>  This new flag is owned by the parent
>> and will cause the CPUs in the cpuset to be removed from the effective
>> CPUs of its parent.
> This is a significant departure from existing behaviour, but one I can
> appreciate. I don't immediately see something terribly wrong with it.
>
>> This is implemented internally by adding a new isolated_cpus mask that
>> holds the CPUs belonging to child scheduling domain cpusets so that:
>>
>>  isolated_cpus | effective_cpus = cpus_allowed
>>  isolated_cpus & effective_cpus = 0
>>
>> This new flag can only be turned on in a cpuset if its parent is either
>> root or a scheduling domain itself with non-empty cpu list. The state
>> of this flag cannot be changed if the cpuset has children.
>>
>> Signed-off-by: Waiman Long 
>> ---
>>  Documentation/cgroup-v2.txt |  22 
>>  kernel/cgroup/cpuset.c  | 237 
>> +++-
>>  2 files changed, 256 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
>> index cf7bac6..54d9e22 100644
>> --- a/Documentation/cgroup-v2.txt
>> +++ b/Documentation/cgroup-v2.txt
>> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>>  it is a subset of "cpuset.mems".  Its value will be affected
>>  by memory nodes hotplug events.
>>  
>> +  cpuset.sched.domain
>> +A read-write single value file which exists on non-root
>> +cpuset-enabled cgroups.  It is a binary value flag that accepts
>> +either "0" (off) or a non-zero value (on).
> I would be conservative and only allow 0/1.

I stated that because echoing other integer value like 2 into the flag
file won't return any error. I will modify it to say just 0 and 1.

>>  This flag is set
>> +by the parent and is not delegatable.
>> +
>> +If set, it indicates that the CPUs in the current cgroup will
>> +be the root of a scheduling domain.  The root cgroup is always
>> +a scheduling domain.  There are constraints on where this flag
>> +can be set.  It can only be set in a cgroup if all the following
>> +conditions are true.
>> +
>> +1) The parent cgroup is also a scheduling domain with a non-empty
>> +   cpu list.
> Ah, so initially I was confused by the requirement for root to have it
> always set, but you'll allow child domains to steal _all_ CPUs, such
> that root ends up with an empty effective set?
>
> What about the (kernel) threads that cannot be moved out of the root
> group?

Actually, the current code won't allow you to take all the CPUs from a
scheduling domain cpuset with load balancing on. So there must be at
least 1 cpu left. You can take all away if load balancing is off.

>> +2) The list of CPUs are exclusive, i.e. they are not shared by
>> +   any of its siblings.
> Right.
>
>> +3) There is no child cgroups with cpuset enabled.
>> +
>> +Setting this flag will take the CPUs away from the effective
>> +CPUs of the parent cgroup. Once it is set, this flag cannot be
>> +cleared if there are any child cgroups with cpuset enabled.
> This I'm not clear on. Why?
>
That is for pragmatic reason as it is easier to code this way. We could
remove this restriction but that will make the code more complex.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv5 2/8] arm64: dts: stratix10: add stratix10 service driver binding to base dtsi

2018-05-24 Thread Moritz Fischer

Hi Richard,

On Thu, May 24, 2018 at 11:33:14AM -0500, richard.g...@linux.intel.com wrote:
> From: Richard Gong 
> 
> Add Intel Stratix10 service layer to the device tree
> 
> Signed-off-by: Richard Gong 
> Signed-off-by: Alan Tull 
Acked-by: Moritz Fischer 
> ---
> v2: Change to put service layer driver node under the firmware node
> Change compatible to "intel, stratix10-svc"
> v3: No change
> v4: s/service driver/stratix10 service driver/ in subject line
> v5: No change
> ---
>  arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi 
> b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
> index d8c94d5..c257287 100644
> --- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
> +++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
> @@ -24,6 +24,19 @@
>   #address-cells = <2>;
>   #size-cells = <2>;
>  
> + reserved-memory {
> + #address-cells = <2>;
> + #size-cells = <2>;
> + ranges;
> +
> + service_reserved: svcbuffer@0 {
> + compatible = "shared-dma-pool";
> + reg = <0x0 0x0 0x0 0x100>;
> + alignment = <0x1000>;
> + no-map;
> + };
> + };
> +
>   cpus {
>   #address-cells = <1>;
>   #size-cells = <0>;
> @@ -487,5 +500,13 @@
>  
>   status = "disabled";
>   };
> +
> + firmware {
> + svc {
> + compatible = "intel,stratix10-svc";
> + method = "smc";
> + memory-region = <_reserved>;
> + };
> + };
>   };
>  };
> -- 
> 2.7.4
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv5 1/8] dt-bindings, firmware: add Intel Stratix10 service layer binding

2018-05-24 Thread Moritz Fischer

On Thu, May 24, 2018 at 11:33:13AM -0500, richard.g...@linux.intel.com wrote:
> From: Richard Gong 
> 
> Add a device tree binding for the Intel Stratix10 service layer driver
> 
> Signed-off-by: Richard Gong 
> Signed-off-by: Alan Tull 
> Reviewed-by: Rob Herring 
Acked-by: Moritz Fischer 
> ---
> v2: Change to put service layer driver node under the firmware node
> Change compatible to "intel, stratix10-svc"
> v3: No change
> v4: Add Rob's Reviewed-by
> v5: No change
> ---
>  .../bindings/firmware/intel,stratix10-svc.txt  | 57 
> ++
>  1 file changed, 57 insertions(+)
>  create mode 100644 
> Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
> 
> diff --git 
> a/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt 
> b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
> new file mode 100644
> index 000..1fa6606
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
> @@ -0,0 +1,57 @@
> +Intel Service Layer Driver for Stratix10 SoC
> +
> +Intel Stratix10 SoC is composed of a 64 bit quad-core ARM Cortex A53 hard
> +processor system (HPS) and Secure Device Manager (SDM). When the FPGA is
> +configured from HPS, there needs to be a way for HPS to notify SDM the
> +location and size of the configuration data. Then SDM will get the
> +configuration data from that location and perform the FPGA configuration.
> +
> +To meet the whole system security needs and support virtual machine 
> requesting
> +communication with SDM, only the secure world of software (EL3, Exception
> +Layer 3) can interface with SDM. All software entities running on other
> +exception layers must channel through the EL3 software whenever it needs
> +service from SDM.
> +
> +Intel Stratix10 service layer driver, running at privileged exception level
> +(EL1, Exception Layer 1), interfaces with the service providers and provides
> +the services for FPGA configuration, QSPI, Crypto and warm reset. Service 
> layer
> +driver also manages secure monitor call (SMC) to communicate with secure 
> monitor
> +code running in EL3.
> +
> +Required properties:
> +---
> +The svc node has the following mandatory properties, must be located under
> +the firmware node.
> +
> +- compatible: "intel,stratix10-svc"
> +- method: smc or hvc
> +smc - Secure Monitor Call
> +hvc - Hypervisor Call
> +- memory-region:
> + phandle to the reserved memory node. See
> + Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
> + for details
> +
> +Example:
> +---
> +
> + reserved-memory {
> +#address-cells = <2>;
> +#size-cells = <2>;
> +ranges;
> +
> +service_reserved: svcbuffer@0 {
> +compatible = "shared-dma-pool";
> +reg = <0x0 0x0 0x0 0x100>;
> +alignment = <0x1000>;
> +no-map;
> +};
> +};
> +
> + firmware {
> + svc {
> + compatible = "intel,stratix10-svc";
> + method = "smc";
> + memory-region = <_reserved>;
> + };
> + };
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fpga" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 4/8] dt-bindings: fpga: add Stratix10 SoC FPGA manager binding

2018-05-24 Thread richard . gong

From: Alan Tull 

Add a Device Tree binding for the Intel Stratix10 SoC FPGA manager.

Signed-off-by: Alan Tull 
Signed-off-by: Richard Gong 
Reviewed-by: Rob Herring 
---
v2: this patch is added in patch set version 2
v3: change to put fpga_mgr node under firmware/svc node
v4: s/fpga-mgr@0/fpga-mgr/ to remove unit_address
add Richard's signed-off-by
v5: add Reviewed-by Rob Herring
---
 .../bindings/fpga/intel-stratix10-soc-fpga-mgr.txt  | 17 +
 1 file changed, 17 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt

diff --git 
a/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt 
b/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt
new file mode 100644
index 000..6e03f79
--- /dev/null
+++ b/Documentation/devicetree/bindings/fpga/intel-stratix10-soc-fpga-mgr.txt
@@ -0,0 +1,17 @@
+Intel Stratix10 SoC FPGA Manager
+
+Required properties:
+The fpga_mgr node has the following mandatory property, must be located under
+firmware/svc node.
+
+- compatible : should contain "intel,stratix10-soc-fpga-mgr"
+
+Example:
+
+   firmware {
+   svc {
+   fpga_mgr: fpga-mgr {
+   compatible = "intel,stratix10-soc-fpga-mgr";
+   };
+   };
+   };
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 2/8] arm64: dts: stratix10: add stratix10 service driver binding to base dtsi

2018-05-24 Thread richard . gong

From: Richard Gong 

Add Intel Stratix10 service layer to the device tree

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
---
v2: Change to put service layer driver node under the firmware node
Change compatible to "intel, stratix10-svc"
v3: No change
v4: s/service driver/stratix10 service driver/ in subject line
v5: No change
---
 arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi 
b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
index d8c94d5..c257287 100644
--- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
+++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
@@ -24,6 +24,19 @@
#address-cells = <2>;
#size-cells = <2>;
 
+   reserved-memory {
+   #address-cells = <2>;
+   #size-cells = <2>;
+   ranges;
+
+   service_reserved: svcbuffer@0 {
+   compatible = "shared-dma-pool";
+   reg = <0x0 0x0 0x0 0x100>;
+   alignment = <0x1000>;
+   no-map;
+   };
+   };
+
cpus {
#address-cells = <1>;
#size-cells = <0>;
@@ -487,5 +500,13 @@
 
status = "disabled";
};
+
+   firmware {
+   svc {
+   compatible = "intel,stratix10-svc";
+   method = "smc";
+   memory-region = <_reserved>;
+   };
+   };
};
 };
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 1/8] dt-bindings, firmware: add Intel Stratix10 service layer binding

2018-05-24 Thread richard . gong

From: Richard Gong 

Add a device tree binding for the Intel Stratix10 service layer driver

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
Reviewed-by: Rob Herring 
---
v2: Change to put service layer driver node under the firmware node
Change compatible to "intel, stratix10-svc"
v3: No change
v4: Add Rob's Reviewed-by
v5: No change
---
 .../bindings/firmware/intel,stratix10-svc.txt  | 57 ++
 1 file changed, 57 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt

diff --git a/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt 
b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
new file mode 100644
index 000..1fa6606
--- /dev/null
+++ b/Documentation/devicetree/bindings/firmware/intel,stratix10-svc.txt
@@ -0,0 +1,57 @@
+Intel Service Layer Driver for Stratix10 SoC
+
+Intel Stratix10 SoC is composed of a 64 bit quad-core ARM Cortex A53 hard
+processor system (HPS) and Secure Device Manager (SDM). When the FPGA is
+configured from HPS, there needs to be a way for HPS to notify SDM the
+location and size of the configuration data. Then SDM will get the
+configuration data from that location and perform the FPGA configuration.
+
+To meet the whole system security needs and support virtual machine requesting
+communication with SDM, only the secure world of software (EL3, Exception
+Layer 3) can interface with SDM. All software entities running on other
+exception layers must channel through the EL3 software whenever it needs
+service from SDM.
+
+Intel Stratix10 service layer driver, running at privileged exception level
+(EL1, Exception Layer 1), interfaces with the service providers and provides
+the services for FPGA configuration, QSPI, Crypto and warm reset. Service layer
+driver also manages secure monitor call (SMC) to communicate with secure 
monitor
+code running in EL3.
+
+Required properties:
+---
+The svc node has the following mandatory properties, must be located under
+the firmware node.
+
+- compatible: "intel,stratix10-svc"
+- method: smc or hvc
+smc - Secure Monitor Call
+hvc - Hypervisor Call
+- memory-region:
+   phandle to the reserved memory node. See
+   Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+   for details
+
+Example:
+---
+
+   reserved-memory {
+#address-cells = <2>;
+#size-cells = <2>;
+ranges;
+
+service_reserved: svcbuffer@0 {
+compatible = "shared-dma-pool";
+reg = <0x0 0x0 0x0 0x100>;
+alignment = <0x1000>;
+no-map;
+};
+};
+
+   firmware {
+   svc {
+   compatible = "intel,stratix10-svc";
+   method = "smc";
+   memory-region = <_reserved>;
+   };
+   };
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 7/8] defconfig: enable fpga and service layer

2018-05-24 Thread richard . gong

From: Richard Gong 

Enable fpga framework, Stratix 10 SoC FPGA manager and Stratix10
Service Layer

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
---
v2: this patch is added in patch set version 2
v3: no change
v4: s/CONFIG_INTEL_SERVICE/CONFIG_STRATIX10_SERVICE/
add CONFIG_OF_FPGA_REGION=y
s/Intel/Stratix10/ in subject line
v5: no change
---
 arch/arm64/configs/defconfig | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index ecf6137..5f7a9b7 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -180,6 +180,7 @@ CONFIG_BLK_DEV_LOOP=y
 CONFIG_BLK_DEV_NBD=m
 CONFIG_VIRTIO_BLK=y
 CONFIG_BLK_DEV_NVME=m
+CONFIG_STRATIX10_SERVICE=y
 CONFIG_SRAM=y
 CONFIG_EEPROM_AT25=m
 # CONFIG_SCSI_PROC_FS is not set
@@ -595,6 +596,11 @@ CONFIG_PHY_TEGRA_XUSB=y
 CONFIG_QCOM_L2_PMU=y
 CONFIG_QCOM_L3_PMU=y
 CONFIG_MESON_EFUSE=m
+CONFIG_FPGA=y
+CONFIG_FPGA_MGR_STRATIX10_SOC=y
+CONFIG_FPGA_REGION=y
+CONFIG_FPGA_BRIDGE=y
+CONFIG_OF_FPGA_REGION=y
 CONFIG_QCOM_QFPROM=y
 CONFIG_UNIPHIER_EFUSE=y
 CONFIG_TEE=y
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 3/8] driver, misc: add Intel Stratix10 service layer driver

2018-05-24 Thread richard . gong

From: Richard Gong 

Some features of the Intel Stratix10 SoC require a level of privilege
higher than the kernel is granted. Such secure features include
FPGA programming. In terms of the ARMv8 architecture, the kernel runs
at Exception Level 1 (EL1), access to the features requires
Exception Level 3 (EL3).

The Intel Stratix10 SoC service layer provides an in kernel API for
drivers to request access to the secure features. The requests are queued
and processed one by one. ARM’s SMCCC is used to pass the execution
of the requests on to a secure monitor (EL3).

The header file stratix10-sve-client.h defines the interface between
service providers (FPGA manager is one of them) and service layer.

The header file stratix10-smc.h defines the secure monitor call (SMC)
message protocols used for service layer driver in normal world
(EL1) to communicate with secure monitor SW in secure monitor exception
level 3 (EL3).

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
---
v2: Remove intel-service subdirectory and intel-service.h, move
intel-smc.h and intel-service.c to driver/misc subdirectory
Correct SPDX markers
Change service layer driver be 'default n'
Remove global variables
Add timeout for do..while() loop
Add kernel-doc for the functions and structs, correct multiline comments
Replace kfifo_in/kfifo_out with kfifo_in_spinlocked/kfifo_out_spinlocked
rename struct intel_svc_data (at client header) to intel_svc_client_msg
rename struct intel_svc_private_mem to intel_svc_data
Other corrections/changes from Intel internal code reviews
v3: Change all exported functions with "intel_svc_" as the prefix
Increase timeout values for claiming back submitted buffer(s)
Rename struct intel_command_reconfig_payload to
struct intel_svc_command_reconfig_payload
Add pr_err() to provide the error return value
Other corrections/changes
v4: s/intel/stratix10/ on some variables, structs, functions, and file names
intel-service.c -> stratix10-svc.c
intel-smc.h -> stratix10-smc.h
intel-service-client.h -> stratix10-svc-client.h
Remove non-kernel-doc formatting
v5: add a new API statix10_svc_done() which is called by service client
when client request is completed or error occurs during request
process. Which allows service layer to free its resources.
remove dummy client from service layer client header and service
layer source file.
kernel-doc fixes
---
 drivers/misc/Kconfig |  12 +
 drivers/misc/Makefile|   1 +
 drivers/misc/stratix10-smc.h | 205 
 drivers/misc/stratix10-svc.c | 984 +++
 include/linux/stratix10-svc-client.h | 199 +++
 5 files changed, 1401 insertions(+)
 create mode 100644 drivers/misc/stratix10-smc.h
 create mode 100644 drivers/misc/stratix10-svc.c
 create mode 100644 include/linux/stratix10-svc-client.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 5d71300..5d5b648 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -138,6 +138,18 @@ config INTEL_MID_PTI
  an Intel Atom (non-netbook) mobile device containing a MIPI
  P1149.7 standard implementation.
 
+config STRATIX10_SERVICE
+   tristate "Stratix10 Service Layer"
+   depends on HAVE_ARM_SMCCC
+   default n
+   help
+Stratix10 service layer runs at privileged exception level, interfaces 
with
+the service providers (FPGA manager is one of them) and manages secure
+monitor call to communicate with secure monitor software at secure 
monitor
+exception level.
+
+Say Y here if you want Stratix10 service layer support.
+
 config SGI_IOC4
tristate "SGI IOC4 Base IO support"
depends on PCI
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 20be70c..99fed8b 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_AD525X_DPOT)   += ad525x_dpot.o
 obj-$(CONFIG_AD525X_DPOT_I2C)  += ad525x_dpot-i2c.o
 obj-$(CONFIG_AD525X_DPOT_SPI)  += ad525x_dpot-spi.o
 obj-$(CONFIG_INTEL_MID_PTI)+= pti.o
+obj-$(CONFIG_STRATIX10_SERVICE) += stratix10-svc.o
 obj-$(CONFIG_ATMEL_SSC)+= atmel-ssc.o
 obj-$(CONFIG_ATMEL_TCLIB)  += atmel_tclib.o
 obj-$(CONFIG_DUMMY_IRQ)+= dummy-irq.o
diff --git a/drivers/misc/stratix10-smc.h b/drivers/misc/stratix10-smc.h
new file mode 100644
index 000..94615f4
--- /dev/null
+++ b/drivers/misc/stratix10-smc.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2017-2018, Intel Corporation
+ */
+
+#ifndef __STRATIX10_SMC_H
+#define __STRATIX10_SMC_H
+
+#include 
+#include 
+
+/**
+ * This file defines the Secure Monitor Call (SMC) message protocol used for
+ * service layer driver in normal world (EL1) to communicate with secure
+ * monitor software in Secure Monitor

[PATCHv5 6/8] fpga: add intel stratix10 soc fpga manager driver

2018-05-24 Thread richard . gong

From: Alan Tull 

Add driver for reconfiguring Intel Stratix10 SoC FPGA devices.
This driver communicates through the Intel Service Driver which
does communication with privileged hardware (that does the
FPGA programming) through a secure mailbox.

Signed-off-by: Alan Tull 
Signed-off-by: Richard Gong 
---
v2: this patch is added in patch set version 2
v3: change to align to the update of service client APIs, and the
update of fpga_mgr device node
v4: changes to align with stratix10-svc-client API updates
add Richard's signed-off-by
v5: update to align changes at service layer to minimize service
layer thread usages
---
 drivers/fpga/Kconfig |   6 +
 drivers/fpga/Makefile|   1 +
 drivers/fpga/stratix10-soc.c | 545 +++
 3 files changed, 552 insertions(+)
 create mode 100644 drivers/fpga/stratix10-soc.c

diff --git a/drivers/fpga/Kconfig b/drivers/fpga/Kconfig
index f47ef84..1624a73 100644
--- a/drivers/fpga/Kconfig
+++ b/drivers/fpga/Kconfig
@@ -57,6 +57,12 @@ config FPGA_MGR_ZYNQ_FPGA
help
  FPGA manager driver support for Xilinx Zynq FPGAs.
 
+config FPGA_MGR_STRATIX10_SOC
+   tristate "Intel Stratix10 SoC FPGA Manager"
+   depends on (ARCH_STRATIX10 && STRATIX10_SERVICE)
+   help
+ FPGA manager driver support for the Intel Stratix10 SoC.
+
 config FPGA_MGR_XILINX_SPI
tristate "Xilinx Configuration over Slave Serial (SPI)"
depends on SPI
diff --git a/drivers/fpga/Makefile b/drivers/fpga/Makefile
index 3cb276a..6eef670 100644
--- a/drivers/fpga/Makefile
+++ b/drivers/fpga/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_FPGA_MGR_ALTERA_PS_SPI)  += altera-ps-spi.o
 obj-$(CONFIG_FPGA_MGR_ICE40_SPI)   += ice40-spi.o
 obj-$(CONFIG_FPGA_MGR_SOCFPGA) += socfpga.o
 obj-$(CONFIG_FPGA_MGR_SOCFPGA_A10) += socfpga-a10.o
+obj-$(CONFIG_FPGA_MGR_STRATIX10_SOC)   += stratix10-soc.o
 obj-$(CONFIG_FPGA_MGR_TS73XX)  += ts73xx-fpga.o
 obj-$(CONFIG_FPGA_MGR_XILINX_SPI)  += xilinx-spi.o
 obj-$(CONFIG_FPGA_MGR_ZYNQ_FPGA)   += zynq-fpga.o
diff --git a/drivers/fpga/stratix10-soc.c b/drivers/fpga/stratix10-soc.c
new file mode 100644
index 000..d645ef7
--- /dev/null
+++ b/drivers/fpga/stratix10-soc.c
@@ -0,0 +1,545 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * FPGA Manager Driver for Intel Stratix10 SoC
+ *
+ *  Copyright (C) 2018 Intel Corporation
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+/*
+ * FPGA programming requires a higher level of privilege (EL3), per the SoC
+ * design.
+ */
+#define NUM_SVC_BUFS   4
+#define SVC_BUF_SIZE   SZ_512K
+
+/* Indicates buffer is in use if set */
+#define SVC_BUF_LOCK   0
+
+/**
+ * struct s10_svc_buf
+ * @buf: virtual address of buf provided by service layer
+ * @lock: locked if buffer is in use
+ */
+struct s10_svc_buf {
+   char *buf;
+   unsigned long lock;
+};
+
+struct s10_priv {
+   struct stratix10_svc_chan *chan;
+   struct stratix10_svc_client client;
+   struct completion status_return_completion;
+   struct s10_svc_buf svc_bufs[NUM_SVC_BUFS];
+   unsigned long status;
+};
+
+static int s10_svc_send_msg(struct s10_priv *priv,
+   enum stratix10_svc_command_code command,
+   void *payload, u32 payload_length)
+{
+   struct stratix10_svc_chan *chan = priv->chan;
+   struct stratix10_svc_client_msg msg;
+   int ret;
+
+   pr_debug("%s cmd=%d payload=%p legnth=%d\n",
+__func__, command, payload, payload_length);
+
+   msg.command = command;
+   msg.payload = payload;
+   msg.payload_length = payload_length;
+
+   ret = stratix10_svc_send(chan, );
+   pr_debug("stratix10_svc_send returned status %d\n", ret);
+
+   return ret;
+}
+
+/**
+ * s10_free_buffers
+ * Free buffers allocated from the service layer's pool that are not in use.
+ * @mgr: fpga manager struct
+ * Free all buffers that are not in use.
+ * Return true when all buffers are freed.
+ */
+static bool s10_free_buffers(struct fpga_manager *mgr)
+{
+   struct s10_priv *priv = mgr->priv;
+   uint num_free = 0;
+   uint i;
+
+   for (i = 0; i < NUM_SVC_BUFS; i++) {
+   if (!priv->svc_bufs[i].buf) {
+   num_free++;
+   continue;
+   }
+
+   if (!test_and_set_bit_lock(SVC_BUF_LOCK,
+  >svc_bufs[i].lock)) {
+   stratix10_svc_free_memory(priv->chan,
+ priv->svc_bufs[i].buf);
+   priv->svc_bufs[i].buf = NULL;
+   num_free++;
+   }
+   }
+
+   return num_free == NUM_SVC_BUFS;
+}
+
+/**
+ * s10_free_buffer_count
+ * Count how many buffers are not in use.
+ * @mgr: fpga manager struct
+ * Return # of buffers that are not

[PATCHv5 8/8] Documentation: driver-api: add stratix10 service layer

2018-05-24 Thread richard . gong

From: Richard Gong 

Add new file stratix10-svc.rst
Add stratix10-svc.rst to driver-api/index.rst

Signed-off-by: Richard Gong 
Signed-off-by: Alan Tull 
---
v5: this patch is added in patch set version 5
---
 Documentation/driver-api/index.rst |  1 +
 Documentation/driver-api/stratix10-svc.rst | 32 ++
 2 files changed, 33 insertions(+)
 create mode 100644 Documentation/driver-api/stratix10-svc.rst

diff --git a/Documentation/driver-api/index.rst 
b/Documentation/driver-api/index.rst
index 6d8352c..4b31109 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -49,6 +49,7 @@ available subsections can be seen below.
dmaengine/index
slimbus
soundwire/index
+   stratix10-svc
 
 .. only::  subproject and html
 
diff --git a/Documentation/driver-api/stratix10-svc.rst 
b/Documentation/driver-api/stratix10-svc.rst
new file mode 100644
index 000..ed361d8
--- /dev/null
+++ b/Documentation/driver-api/stratix10-svc.rst
@@ -0,0 +1,32 @@
+
+Intel Stratix10 SoC Service Layer
+=
+
+Some features of the Intel Stratix10 SoC require a level of privilege
+higher than the kernel is granted. Such secure features include
+FPGA programming. In terms of the ARMv8 architecture, the kernel runs
+at Exception Level 1 (EL1), access to the features requires
+Exception Level 3 (EL3).
+
+The Intel Stratix10 SoC service layer provides an in kernel API for
+drivers to request access to the secure features. The requests are queued
+and processed one by one. ARM’s SMCCC is used to pass the execution
+of the requests on to a secure monitor (EL3).
+
+.. kernel-doc:: include/linux/stratix10-svc-client.h
+   :functions: stratix10_svc_command_code
+
+.. kernel-doc:: include/linux/stratix10-svc-client.h
+   :functions: stratix10_svc_client_msg
+
+.. kernel-doc:: include/linux/stratix10-svc-client.h
+   :functions: stratix10_svc_command_reconfig_payload
+
+.. kernel-doc:: include/linux/stratix10-svc-client.h
+   :functions: stratix10_svc_cb_data
+
+.. kernel-doc:: include/linux/stratix10-svc-client.h
+   :functions: stratix10_svc_client
+
+.. kernel-doc:: drivers/misc/stratix10-svc.c
+   :export:
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv5 5/8] arm64: dts: stratix10: add fpga manager and region

2018-05-24 Thread richard . gong

From: Alan Tull 

Add the Stratix10 FPGA manager and a FPGA region to the
device tree.

Signed-off-by: Alan Tull 
Signed-off-by: Richard Gong 
---
v2: this patch is added in patch set version 2
v3: change to put fpga_mgr node under firmware/svc node
v4: s/fpga-mgr@0/fpga-mgr/ to remove unit_address
add Richard's signed-off-by
v5: no change
---
 arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi 
b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
index c257287..8f8f409 100644
--- a/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
+++ b/arch/arm64/boot/dts/altera/socfpga_stratix10.dtsi
@@ -106,6 +106,14 @@
interrupt-parent = <>;
ranges = <0 0 0 0x>;
 
+   base_fpga_region {
+   #address-cells = <0x1>;
+   #size-cells = <0x1>;
+
+   compatible = "fpga-region";
+   fpga-mgr = <_mgr>;
+   };
+
clkmgr: clock-controller@ffd1 {
compatible = "intel,stratix10-clkmgr";
reg = <0xffd1 0x1000>;
@@ -506,6 +514,10 @@
compatible = "intel,stratix10-svc";
method = "smc";
memory-region = <_reserved>;
+
+   fpga_mgr: fpga-mgr {
+   compatible = 
"intel,stratix10-soc-fpga-mgr";
+   };
};
};
};
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 6/9] trace_uprobe: Support SDT markers having reference count (semaphore)

2018-05-24 Thread Oleg Nesterov

Hi Ravi,

sorry for delay!

I am trying to recall what this code should do ;) At first glance, I do
not see any serious problem in this version... except it doesn't apply
to Linus's tree. just one question for now.

On 04/17, Ravi Bangoria wrote:
>
> @@ -941,6 +1091,9 @@ typedef bool (*filter_func_t)(struct uprobe_consumer 
> *self,
>   if (ret)
>   goto err_buffer;
>  
> + if (tu->ref_ctr_offset)
> + sdt_increment_ref_ctr(tu);
> +

iiuc, this is probe_event_enable()...

Looks racy, but afaics the race with uprobe_mmap() will be closed by the next
change. However, it seems that probe_event_disable() can race with 
trace_uprobe_mmap()
too and the next 7/9 patch won't help,

> + if (tu->ref_ctr_offset)
> + sdt_decrement_ref_ctr(tu);
> +
>   uprobe_unregister(tu->inode, tu->offset, >consumer);
>   tu->tp.flags &= file ? ~TP_FLAG_TRACE : ~TP_FLAG_PROFILE;

so what if trace_uprobe_mmap() comes right after uprobe_unregister() ?
Note that trace_probe_is_enabled() is T until we update tp.flags.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Peter Zijlstra

On Thu, May 17, 2018 at 04:55:42PM -0400, Waiman Long wrote:
> The sched.load_balance flag is needed to enable CPU isolation similar to
> what can be done with the "isolcpus" kernel boot parameter. Its value
> can only be changed in a scheduling domain with no child cpusets. On
> a non-scheduling domain cpuset, the value of sched.load_balance is
> inherited from its parent.
> 
> This flag is set by the parent and is not delegatable.
> 
> Signed-off-by: Waiman Long 
> ---
>  Documentation/cgroup-v2.txt | 24 
>  kernel/cgroup/cpuset.c  | 53 
> +
>  2 files changed, 73 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index 54d9e22..071b634d 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1536,6 +1536,30 @@ Cpuset Interface Files
>   CPUs of the parent cgroup. Once it is set, this flag cannot be
>   cleared if there are any child cgroups with cpuset enabled.
>  
> + A parent cgroup cannot distribute all its CPUs to child
> + scheduling domain cgroups unless its load balancing flag is
> + turned off.
> +
> +  cpuset.sched.load_balance
> + A read-write single value file which exists on non-root
> + cpuset-enabled cgroups.  It is a binary value flag that accepts
> + either "0" (off) or a non-zero value (on).  This flag is set
> + by the parent and is not delegatable.
> +
> + When it is on, tasks within this cpuset will be load-balanced
> + by the kernel scheduler.  Tasks will be moved from CPUs with
> + high load to other CPUs within the same cpuset with less load
> + periodically.
> +
> + When it is off, there will be no load balancing among CPUs on
> + this cgroup.  Tasks will stay in the CPUs they are running on
> + and will not be moved to other CPUs.
> +
> + The initial value of this flag is "1".  This flag is then
> + inherited by child cgroups with cpuset enabled.  Its state
> + can only be changed on a scheduling domain cgroup with no
> + cpuset-enabled children.

I'm confused... why exactly do we have both domain and load_balance ?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 2/6] cpuset: Add new v2 cpuset.sched.domain flag

2018-05-24 Thread Peter Zijlstra

On Thu, May 17, 2018 at 04:55:41PM -0400, Waiman Long wrote:
> A new cpuset.sched.domain boolean flag is added to cpuset v2. This new
> flag indicates that the CPUs in the current cpuset should be treated
> as a separate scheduling domain.

The traditional name for this is a partition.

>  This new flag is owned by the parent
> and will cause the CPUs in the cpuset to be removed from the effective
> CPUs of its parent.

This is a significant departure from existing behaviour, but one I can
appreciate. I don't immediately see something terribly wrong with it.

> This is implemented internally by adding a new isolated_cpus mask that
> holds the CPUs belonging to child scheduling domain cpusets so that:
> 
>   isolated_cpus | effective_cpus = cpus_allowed
>   isolated_cpus & effective_cpus = 0
> 
> This new flag can only be turned on in a cpuset if its parent is either
> root or a scheduling domain itself with non-empty cpu list. The state
> of this flag cannot be changed if the cpuset has children.
> 
> Signed-off-by: Waiman Long 
> ---
>  Documentation/cgroup-v2.txt |  22 
>  kernel/cgroup/cpuset.c  | 237 
> +++-
>  2 files changed, 256 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index cf7bac6..54d9e22 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -1514,6 +1514,28 @@ Cpuset Interface Files
>   it is a subset of "cpuset.mems".  Its value will be affected
>   by memory nodes hotplug events.
>  
> +  cpuset.sched.domain
> + A read-write single value file which exists on non-root
> + cpuset-enabled cgroups.  It is a binary value flag that accepts
> + either "0" (off) or a non-zero value (on).

I would be conservative and only allow 0/1.

>  This flag is set
> + by the parent and is not delegatable.
> +
> + If set, it indicates that the CPUs in the current cgroup will
> + be the root of a scheduling domain.  The root cgroup is always
> + a scheduling domain.  There are constraints on where this flag
> + can be set.  It can only be set in a cgroup if all the following
> + conditions are true.
> +
> + 1) The parent cgroup is also a scheduling domain with a non-empty
> +cpu list.

Ah, so initially I was confused by the requirement for root to have it
always set, but you'll allow child domains to steal _all_ CPUs, such
that root ends up with an empty effective set?

What about the (kernel) threads that cannot be moved out of the root
group?

> + 2) The list of CPUs are exclusive, i.e. they are not shared by
> +any of its siblings.

Right.

> + 3) There is no child cgroups with cpuset enabled.
> +
> + Setting this flag will take the CPUs away from the effective
> + CPUs of the parent cgroup. Once it is set, this flag cannot be
> + cleared if there are any child cgroups with cpuset enabled.

This I'm not clear on. Why?


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Waiman Long

On 05/24/2018 11:16 AM, Juri Lelli wrote:
> On 24/05/18 11:09, Waiman Long wrote:
>> On 05/24/2018 10:36 AM, Juri Lelli wrote:
>>> On 17/05/18 16:55, Waiman Long wrote:
>>>
>>> [...]
>>>
 +  A parent cgroup cannot distribute all its CPUs to child
 +  scheduling domain cgroups unless its load balancing flag is
 +  turned off.
 +
 +  cpuset.sched.load_balance
 +  A read-write single value file which exists on non-root
 +  cpuset-enabled cgroups.  It is a binary value flag that accepts
 +  either "0" (off) or a non-zero value (on).  This flag is set
 +  by the parent and is not delegatable.
 +
 +  When it is on, tasks within this cpuset will be load-balanced
 +  by the kernel scheduler.  Tasks will be moved from CPUs with
 +  high load to other CPUs within the same cpuset with less load
 +  periodically.
 +
 +  When it is off, there will be no load balancing among CPUs on
 +  this cgroup.  Tasks will stay in the CPUs they are running on
 +  and will not be moved to other CPUs.
 +
 +  The initial value of this flag is "1".  This flag is then
 +  inherited by child cgroups with cpuset enabled.  Its state
 +  can only be changed on a scheduling domain cgroup with no
 +  cpuset-enabled children.
>>> [...]
>>>
 +  /*
 +   * On default hierachy, a load balance flag change is only allowed
 +   * in a scheduling domain with no child cpuset.
 +   */
 +  if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
 + (!is_sched_domain(cs) || css_has_online_children(>css))) {
 +  err = -EINVAL;
 +  goto out;
 +  }
>>> The rule is actually
>>>
>>>  - no child cpuset
>>>  - and it must be a scheduling domain
>>>
>>> Right?
>> Yes, because it doesn't make sense to have a cpu in one cpuset that has
>> loading balance off while, at the same time, in another cpuset with load
>> balancing turned on. This restriction is there to make sure that the
>> above condition will not happen. I may be wrong if there is a realistic
>> use case where the above condition is desired.
> Yep, makes sense to me.
>
> Maybe add the second condition to the comment and documentation.

Sure. Will do.

-Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Juri Lelli

On 24/05/18 11:09, Waiman Long wrote:
> On 05/24/2018 10:36 AM, Juri Lelli wrote:
> > On 17/05/18 16:55, Waiman Long wrote:
> >
> > [...]
> >
> >> +  A parent cgroup cannot distribute all its CPUs to child
> >> +  scheduling domain cgroups unless its load balancing flag is
> >> +  turned off.
> >> +
> >> +  cpuset.sched.load_balance
> >> +  A read-write single value file which exists on non-root
> >> +  cpuset-enabled cgroups.  It is a binary value flag that accepts
> >> +  either "0" (off) or a non-zero value (on).  This flag is set
> >> +  by the parent and is not delegatable.
> >> +
> >> +  When it is on, tasks within this cpuset will be load-balanced
> >> +  by the kernel scheduler.  Tasks will be moved from CPUs with
> >> +  high load to other CPUs within the same cpuset with less load
> >> +  periodically.
> >> +
> >> +  When it is off, there will be no load balancing among CPUs on
> >> +  this cgroup.  Tasks will stay in the CPUs they are running on
> >> +  and will not be moved to other CPUs.
> >> +
> >> +  The initial value of this flag is "1".  This flag is then
> >> +  inherited by child cgroups with cpuset enabled.  Its state
> >> +  can only be changed on a scheduling domain cgroup with no
> >> +  cpuset-enabled children.
> > [...]
> >
> >> +  /*
> >> +   * On default hierachy, a load balance flag change is only allowed
> >> +   * in a scheduling domain with no child cpuset.
> >> +   */
> >> +  if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
> >> + (!is_sched_domain(cs) || css_has_online_children(>css))) {
> >> +  err = -EINVAL;
> >> +  goto out;
> >> +  }
> > The rule is actually
> >
> >  - no child cpuset
> >  - and it must be a scheduling domain
> >
> > Right?
> 
> Yes, because it doesn't make sense to have a cpu in one cpuset that has
> loading balance off while, at the same time, in another cpuset with load
> balancing turned on. This restriction is there to make sure that the
> above condition will not happen. I may be wrong if there is a realistic
> use case where the above condition is desired.

Yep, makes sense to me.

Maybe add the second condition to the comment and documentation.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2

2018-05-24 Thread Juri Lelli

On 17/05/18 16:55, Waiman Long wrote:

[...]

> + A parent cgroup cannot distribute all its CPUs to child
> + scheduling domain cgroups unless its load balancing flag is
> + turned off.
> +
> +  cpuset.sched.load_balance
> + A read-write single value file which exists on non-root
> + cpuset-enabled cgroups.  It is a binary value flag that accepts
> + either "0" (off) or a non-zero value (on).  This flag is set
> + by the parent and is not delegatable.
> +
> + When it is on, tasks within this cpuset will be load-balanced
> + by the kernel scheduler.  Tasks will be moved from CPUs with
> + high load to other CPUs within the same cpuset with less load
> + periodically.
> +
> + When it is off, there will be no load balancing among CPUs on
> + this cgroup.  Tasks will stay in the CPUs they are running on
> + and will not be moved to other CPUs.
> +
> + The initial value of this flag is "1".  This flag is then
> + inherited by child cgroups with cpuset enabled.  Its state
> + can only be changed on a scheduling domain cgroup with no
> + cpuset-enabled children.

[...]

> + /*
> +  * On default hierachy, a load balance flag change is only allowed
> +  * in a scheduling domain with no child cpuset.
> +  */
> + if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && balance_flag_changed &&
> +(!is_sched_domain(cs) || css_has_online_children(>css))) {
> + err = -EINVAL;
> + goto out;
> + }

The rule is actually

 - no child cpuset
 - and it must be a scheduling domain

Right?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko

On Thu 24-05-18 21:58:49, TSUKADA Koutaro wrote:
> On 2018/05/24 17:20, Michal Hocko wrote:
> > On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
> >> On 2018/05/23 3:54, Michal Hocko wrote:
> > [...]
> >>> I am also quite confused why you keep distinguishing surplus hugetlb
> >>> pages from regular preallocated ones. Being a surplus page is an
> >>> implementation detail that we use for an internal accounting rather than
> >>> something to exhibit to the userspace even more than we do currently.
> >>
> >> I apologize for having confused.
> >>
> >> The hugetlb pages obtained from the pool do not waste the buddy pool.
> > 
> > Because they have already allocated from the buddy allocator so the end
> > result is very same.
> > 
> >> On
> >> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
> >> difference in property, I thought it could be distinguished.
> > 
> > But this is simply not correct. Surplus pages are fluid. If you increase
> > the hugetlb size they will become regular persistent hugetlb pages.
> 
> I really can not understand what's wrong with this. That page is obviously
> released before being added to the persistent pool, and at that time it is
> uncharged from memcg to which the task belongs(This assumes my patch-set).
> After that, the same page obtained from the pool is not surplus hugepage
> so it will not be charged to memcg again.

I do not see anything like that. adjust_pool_surplus is simply and
accounting thing. At least the last time I've checked. Maybe your
patchset handles that?
 
> >> Although my memcg knowledge is extremely limited, memcg is accounting for
> >> various kinds of pages obtained from the buddy pool by the task belonging
> >> to it. I would like to argue that surplus hugepage has specificity in
> >> terms of obtaining from the buddy pool, and that it is specially permitted
> >> charge requirements for memcg.
> > 
> > Not really. Memcg accounts primarily for reclaimable memory. We do
> > account for some non-reclaimable slabs but the life time should be at
> > least bound to a process life time. Otherwise the memcg oom killer
> > behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> > simply persistent. Well, to be completely honest tmpfs pages have a
> > similar problem but lacking the swap space for them is kinda
> > configuration bug.
> 
> Absolutely you are saying the right thing, but, for example, can mlock(2)ed
> pages be swapped out by reclaim?(What is the difference between mlock(2)ed
> pages and hugetlb page?)

No mlocked pages cannot be reclaimed and that is why we restrict them to
a relatively small amount.
 
> >> It seems very strange that charge hugetlb page to memcg, but essentially
> >> it only charges the usage of the compound page obtained from the buddy 
> >> pool,
> >> and even if that page is used as hugetlb page after that, memcg is not
> >> interested in that.
> > 
> > Ohh, it is very much interested. The primary goal of memcg is to enforce
> > the limit. How are you going to do that in an absence of the reclaimable
> > memory? And quite a lot of it because hugetlb pages usually consume a
> > lot of memory.
> 
> Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
> reclaim at the time of account of with surplus hugepages.

But that will not release the hugetlb memory, does it?
 
> [...]
> >> I could not understand the intention of this question, sorry. When resize
> >> the pool, I think that the number of surplus hugepages in use does not
> >> change. Could you explain what you were concerned about?
> > 
> > It does change when you change the hugetlb pool size, migrate pages
> > between per-numa pools (have a look at adjust_pool_surplus).
> 
> As I looked at, what kind of fatal problem is caused by charging surplus
> hugepages to memcg by just manipulating counter of statistical information?

Fatal? Not sure. It simply tries to add an alien memory to the memcg
concept so I would pressume an unexpected behavior (e.g. not being able
to reclaim memcg or, over reclaim, trashing etc.).
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] libata: remove ata_sff_data_xfer_noirq()

2018-05-24 Thread Sebastian Andrzej Siewior

On 2018-05-07 17:52:16 [+0200], To Tejun Heo wrote:
> On 2018-05-07 08:49:08 [-0700], Tejun Heo wrote:
> > Hello, Sebastian.
Hi Tejun,

> > On Fri, May 04, 2018 at 05:06:20PM +0200, Sebastian Andrzej Siewior wrote:
> > > ata_sff_data_xfer_noirq() is invoked via the ->sff_data_xfer hook. The
> > > latter is invoked by ata_pio_sector(), atapi_send_cdb() and
> > > __atapi_pio_bytes() which in turn is invoked by ata_sff_hsm_move().
> > > The latter function requires that the "ap->lock" lock is held which
> > > needs to be taken with disabled interrupts.
> > > 
> > > There is no need have to have ata_sff_data_xfer_noirq() which invokes
> > > ata_sff_data_xfer32() with disabled interrupts because at this point the
> > > interrupts are already disabled.
> > > Remove the function and its references to it and replace all callers
> > > with ata_sff_data_xfer32().
> > 
> > Can you please add irq disabled assert to ata_sff_data_xfer*()?
> 
> Why irq-disabled assert? Can we use lockdep_assert_held() instead?
That irq-disabled assert won't work on RT as expected that is why I
intend to remove the local_irq_save() (which is not needed). If we could
avoid the irq-disabled assert or use a lock instead, then it wouldn't
trigger another error on RT.

> > Thanks.

Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread TSUKADA Koutaro

On 2018/05/24 17:20, Michal Hocko wrote:
> On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
>> On 2018/05/23 3:54, Michal Hocko wrote:
> [...]
>>> I am also quite confused why you keep distinguishing surplus hugetlb
>>> pages from regular preallocated ones. Being a surplus page is an
>>> implementation detail that we use for an internal accounting rather than
>>> something to exhibit to the userspace even more than we do currently.
>>
>> I apologize for having confused.
>>
>> The hugetlb pages obtained from the pool do not waste the buddy pool.
> 
> Because they have already allocated from the buddy allocator so the end
> result is very same.
> 
>> On
>> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
>> difference in property, I thought it could be distinguished.
> 
> But this is simply not correct. Surplus pages are fluid. If you increase
> the hugetlb size they will become regular persistent hugetlb pages.

I really can not understand what's wrong with this. That page is obviously
released before being added to the persistent pool, and at that time it is
uncharged from memcg to which the task belongs(This assumes my patch-set).
After that, the same page obtained from the pool is not surplus hugepage
so it will not be charged to memcg again.

>> Although my memcg knowledge is extremely limited, memcg is accounting for
>> various kinds of pages obtained from the buddy pool by the task belonging
>> to it. I would like to argue that surplus hugepage has specificity in
>> terms of obtaining from the buddy pool, and that it is specially permitted
>> charge requirements for memcg.
> 
> Not really. Memcg accounts primarily for reclaimable memory. We do
> account for some non-reclaimable slabs but the life time should be at
> least bound to a process life time. Otherwise the memcg oom killer
> behavior is not guaranteed to unclutter the situation. Hugetlb pages are
> simply persistent. Well, to be completely honest tmpfs pages have a
> similar problem but lacking the swap space for them is kinda
> configuration bug.

Absolutely you are saying the right thing, but, for example, can mlock(2)ed
pages be swapped out by reclaim?(What is the difference between mlock(2)ed
pages and hugetlb page?)

>> It seems very strange that charge hugetlb page to memcg, but essentially
>> it only charges the usage of the compound page obtained from the buddy pool,
>> and even if that page is used as hugetlb page after that, memcg is not
>> interested in that.
> 
> Ohh, it is very much interested. The primary goal of memcg is to enforce
> the limit. How are you going to do that in an absence of the reclaimable
> memory? And quite a lot of it because hugetlb pages usually consume a
> lot of memory.

Simply kill any of the tasks belonging to that memcg. Maybe, no one wants
reclaim at the time of account of with surplus hugepages.

[...]
>> I could not understand the intention of this question, sorry. When resize
>> the pool, I think that the number of surplus hugepages in use does not
>> change. Could you explain what you were concerned about?
> 
> It does change when you change the hugetlb pool size, migrate pages
> between per-numa pools (have a look at adjust_pool_surplus).

As I looked at, what kind of fatal problem is caused by charging surplus
hugepages to memcg by just manipulating counter of statistical information?

-- 
Thanks,
Tsukada

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/24] arm64: ilp32: add documentation on the ILP32 ABI for ARM64

2018-05-24 Thread Dr. Philipp Tomsich

Yury & Pavel,

> On 24 May 2018, at 14:15, Yury Norov  wrote:
> 
> Hi Pavel,
> 
> On Wed, May 23, 2018 at 04:06:20PM +0200, Pavel Machek wrote:
>> On Wed 2018-05-16 11:18:52, Yury Norov wrote:
>>> Based on Andrew Pinski's patch-series.
>>> 
>>> Signed-off-by: Yury Norov 
>> 
>> So Andrew's signoff should be here?
> 
> Yes it should, but it lost since v4. I'll restore it.
> 
>>> ---
>>> Documentation/arm64/ilp32.txt | 45 +++
>>> 1 file changed, 45 insertions(+)
>>> create mode 100644 Documentation/arm64/ilp32.txt
>>> 
>>> diff --git a/Documentation/arm64/ilp32.txt b/Documentation/arm64/ilp32.txt
>>> new file mode 100644
>>> index ..d0fd5109c4b2
>>> --- /dev/null
>>> +++ b/Documentation/arm64/ilp32.txt
>>> @@ -0,0 +1,45 @@
>>> +ILP32 AARCH64 SYSCALL ABI
>>> +=
>>> +
>>> +This document describes the ILP32 syscall ABI and where it differs
>>> +from the generic compat linux syscall interface.
>> 
>> I was hoping to learn what ILP32 is / what is it good for, but no,
>> this does not tell me... it would be good to do a short explanation
>> here, and maybe reference it from cover letter of the series...
>>  Pavel
> 
> ILP32 is ABI acronym that means "Integers, Longs and Pointers are 32-bit".
> And LP64 means "Longs and Pointers are 64-bit”.

Just a nitpick: ILP32 is in fact just the memory model, but calling from ILP32
code into the Linux kernel requires modifications to the syscall-ABI due to
datastructure layout changing (every time a pointer or a ‘long’ is present in
a structure). As such structures are passed between the userspace and the
kernel (and also due to the fact that time_t is an ‘unsigned long’ in the C
language standard), modifications to the syscall ABI in Linux are needed to
support ILP32 processes calling into the kernel.

Things get a bit more involved, as the final consensus was to pass 64bit
quantities in the lower half of 2 64bit registers instead of as a single 
register:
this makes the way (on AArch64) that an ILP32 process calls into the kernel
more dissimilar from a LP64 process calling the same syscall.

What this rambling boils down to is: “ILP32" is the memory model, whereas
this series deals with the “Linux/AArch64 syscall ABI for ILP32 processes”.

Thanks,
Phil.

> 
> There's AN490 - "ILP32 for AArch64 Whitepaper" from ARM which covers
> the topic:
> http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0490a/ar01s01.html
> 
> And some talks:
> http://connect.linaro.org/resource/bkk16/bkk16-305b/
> 
> Briefly, ILP32 is 32-bit ABI that works with AARCH64 instruction set. It looks
> better in some performance tests, and is useful for compatibility with 32-bit
> legacy code.
> 
> If you're more familiar with x86 terminology, in ARM world LP64 corresponds
> to x86_64, AARCH32_EL0 corresponds to x86_32, and ILP32 corresponds to x32
> ABI.
> 
> I'll add link to AN490 in next submission.
> 
> Yury
> 
>> -- 
>> (english) http://www.livejournal.com/~pavelmachek
>> (cesky, pictures) 
>> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 07/24] arm64: ilp32: add documentation on the ILP32 ABI for ARM64

2018-05-24 Thread Yury Norov

Hi Pavel,

On Wed, May 23, 2018 at 04:06:20PM +0200, Pavel Machek wrote:
> On Wed 2018-05-16 11:18:52, Yury Norov wrote:
> > Based on Andrew Pinski's patch-series.
> > 
> > Signed-off-by: Yury Norov 
> 
> So Andrew's signoff should be here?

Yes it should, but it lost since v4. I'll restore it.
 
> > ---
> >  Documentation/arm64/ilp32.txt | 45 +++
> >  1 file changed, 45 insertions(+)
> >  create mode 100644 Documentation/arm64/ilp32.txt
> > 
> > diff --git a/Documentation/arm64/ilp32.txt b/Documentation/arm64/ilp32.txt
> > new file mode 100644
> > index ..d0fd5109c4b2
> > --- /dev/null
> > +++ b/Documentation/arm64/ilp32.txt
> > @@ -0,0 +1,45 @@
> > +ILP32 AARCH64 SYSCALL ABI
> > +=
> > +
> > +This document describes the ILP32 syscall ABI and where it differs
> > +from the generic compat linux syscall interface.
> 
> I was hoping to learn what ILP32 is / what is it good for, but no,
> this does not tell me... it would be good to do a short explanation
> here, and maybe reference it from cover letter of the series...
>   Pavel

ILP32 is ABI acronym that means "Integers, Longs and Pointers are 32-bit".
And LP64 means "Longs and Pointers are 64-bit".

There's AN490 - "ILP32 for AArch64 Whitepaper" from ARM which covers
the topic:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0490a/ar01s01.html

And some talks:
http://connect.linaro.org/resource/bkk16/bkk16-305b/

Briefly, ILP32 is 32-bit ABI that works with AARCH64 instruction set. It looks
better in some performance tests, and is useful for compatibility with 32-bit
legacy code.

If you're more familiar with x86 terminology, in ARM world LP64 corresponds
to x86_64, AARCH32_EL0 corresponds to x86_32, and ILP32 corresponds to x32
ABI.

I'll add link to AN490 in next submission.

Yury

> -- 
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) 
> http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html



--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-24 Thread Juri Lelli

On 24/05/18 10:04, Patrick Bellasi wrote:

[...]

> From 84bb8137ce79f74849d97e30871cf67d06d8d682 Mon Sep 17 00:00:00 2001
> From: Patrick Bellasi 
> Date: Wed, 23 May 2018 16:33:06 +0100
> Subject: [PATCH 1/1] cgroup/cpuset: disable sched domain rebuild when not
>  required
> 
> The generate_sched_domains() already addresses the "special case for 99%
> of systems" which require a single full sched domain at the root,
> spanning all the CPUs. However, the current support is based on an
> expensive sequence of operations which destroy and recreate the exact
> same scheduling domain configuration.
> 
> If we notice that:
> 
>  1) CPUs in "cpuset.isolcpus" are excluded from load balancing by the
> isolcpus= kernel boot option, and will never be load balanced
> regardless of the value of "cpuset.sched_load_balance" in any
> cpuset.
> 
>  2) the root cpuset has load_balance enabled by default at boot and
> it's the only parameter which userspace can change at run-time.
> 
> we know that, by default, every system comes up with a complete and
> properly configured set of scheduling domains covering all the CPUs.
> 
> Thus, on every system, unless the user explicitly disables load balance
> for the top_cpuset, the scheduling domains already configured at boot
> time by the scheduler/topology code and updated in consequence of
> hotplug events, are already properly configured for cpuset too.
> 
> This configuration is the default one for 99% of the systems,
> and it's also the one used by most of the Android devices which never
> disable load balance from the top_cpuset.
> 
> Thus, while load balance is enabled for the top_cpuset,
> destroying/rebuilding the scheduling domains at every cpuset.cpus
> reconfiguration is a useless operation which will always produce the
> same result.
> 
> Let's anticipate the "special" optimization within:
> 
>rebuild_sched_domains_locked()
> 
> thus completely skipping the expensive:
> 
>generate_sched_domains()
>partition_sched_domains()
> 
> for all the cases we know that the scheduling domains already defined
> will not be affected by whatsoever value of cpuset.cpus.

[...]

> + /* Special case for the 99% of systems with one, full, sched domain */
> + if (!top_cpuset.isolation_count &&
> + is_sched_load_balance(_cpuset))
> + goto out;
> +

Mmm, looks like we still need to destroy e recreate if there is a
new_topology (see arch_update_cpu_topology() in partition_sched_
domains).

Maybe we could move the check you are proposing in update_cpumasks_
hier() ?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-24 Thread Juri Lelli

On 17/05/18 16:55, Waiman Long wrote:

[...]

> @@ -849,7 +860,12 @@ static void rebuild_sched_domains_locked(void)
>* passing doms with offlined cpu to partition_sched_domains().
>* Anyways, hotplug work item will rebuild sched domains.
>*/
> - if (!cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> + if (!top_cpuset.isolation_count &&
> + !cpumask_equal(top_cpuset.effective_cpus, cpu_active_mask))
> + goto out;
> +
> + if (top_cpuset.isolation_count &&
> +!cpumask_subset(top_cpuset.effective_cpus, cpu_active_mask))
>   goto out;

Do we cover the case in which hotplug removed one of the isolated cpus
from cpu_active_mask?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 4/6] cpuset: Make generate_sched_domains() recognize isolated_cpus

2018-05-24 Thread Patrick Bellasi

On 23-May 16:18, Waiman Long wrote:
> On 05/23/2018 01:34 PM, Patrick Bellasi wrote:
> > Hi Waiman,
> >
> > On 17-May 16:55, Waiman Long wrote:
> >
> > [...]
> >
> >> @@ -672,13 +672,14 @@ static int generate_sched_domains(cpumask_var_t 
> >> **domains,
> >>int ndoms = 0;  /* number of sched domains in result */
> >>int nslot;  /* next empty doms[] struct cpumask slot */
> >>struct cgroup_subsys_state *pos_css;
> >> +  bool root_load_balance = is_sched_load_balance(_cpuset);
> >>  
> >>doms = NULL;
> >>dattr = NULL;
> >>csa = NULL;
> >>  
> >>/* Special case for the 99% of systems with one, full, sched domain */
> >> -  if (is_sched_load_balance(_cpuset)) {
> >> +  if (root_load_balance && !top_cpuset.isolation_count) {
> > Perhaps I'm missing something but, it seems to me that, when the two
> > conditions above are true, then we are going to destroy and rebuild
> > the exact same scheduling domains.
> >
> > IOW, on 99% of systems where:
> >
> >is_sched_load_balance(_cpuset)
> >top_cpuset.isolation_count = 0
> >
> > since boot time and forever, then every time we update a value for
> > cpuset.cpus we keep rebuilding the same SDs.
> >
> > It's not strictly related to this patch, the same already happens in
> > mainline based just on the first condition, but since you are extending
> > that optimization, perhaps you can tell me where I'm possibly wrong or
> > which cases I'm not considering.
> >
> > I'm interested mainly because on Android systems those conditions
> > are always true and we see SDs rebuilds every time we write
> > something in cpuset.cpus, which ultimately accounts for almost all the
> > 6-7[ms] time required for the write to return, depending on the CPU
> > frequency.
> >
> > Cheers Patrick
> >
> Yes, that is true. I will look into how to further optimize this. Thanks
> for the suggestion.

FWIW, following is my take on top of your series.

With the following patch applied I see a reduction of the average
execution time for a rebuild_sched_domains_locked() from 1.4[ms] to
40[us] while running 60 /tg1/cpuset.cpus switches in a loop on an
JunoR2 Arm board using the performance cpufreq governor.

---8<---
>From 84bb8137ce79f74849d97e30871cf67d06d8d682 Mon Sep 17 00:00:00 2001
From: Patrick Bellasi 
Date: Wed, 23 May 2018 16:33:06 +0100
Subject: [PATCH 1/1] cgroup/cpuset: disable sched domain rebuild when not
 required

The generate_sched_domains() already addresses the "special case for 99%
of systems" which require a single full sched domain at the root,
spanning all the CPUs. However, the current support is based on an
expensive sequence of operations which destroy and recreate the exact
same scheduling domain configuration.

If we notice that:

 1) CPUs in "cpuset.isolcpus" are excluded from load balancing by the
isolcpus= kernel boot option, and will never be load balanced
regardless of the value of "cpuset.sched_load_balance" in any
cpuset.

 2) the root cpuset has load_balance enabled by default at boot and
it's the only parameter which userspace can change at run-time.

we know that, by default, every system comes up with a complete and
properly configured set of scheduling domains covering all the CPUs.

Thus, on every system, unless the user explicitly disables load balance
for the top_cpuset, the scheduling domains already configured at boot
time by the scheduler/topology code and updated in consequence of
hotplug events, are already properly configured for cpuset too.

This configuration is the default one for 99% of the systems,
and it's also the one used by most of the Android devices which never
disable load balance from the top_cpuset.

Thus, while load balance is enabled for the top_cpuset,
destroying/rebuilding the scheduling domains at every cpuset.cpus
reconfiguration is a useless operation which will always produce the
same result.

Let's anticipate the "special" optimization within:

   rebuild_sched_domains_locked()

thus completely skipping the expensive:

   generate_sched_domains()
   partition_sched_domains()

for all the cases we know that the scheduling domains already defined
will not be affected by whatsoever value of cpuset.cpus.

The proposed solution is the minimal variation to optimize the case for
systems with load balance enabled at the root level and without isolated
CPUs. As soon as one of these conditions is not more valid, we fall back
to the original behavior.

Signed-off-by: Patrick Bellasi 
Cc: Li Zefan 
Cc: Tejun Heo ,
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Frederic Weisbecker 
Cc: Johannes Weiner 
Cc: Mike Galbraith 
Cc: Paul Turner 
Cc: Waiman Long 
Cc: Juri Lelli 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc:

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko

On Thu 24-05-18 13:26:12, TSUKADA Koutaro wrote:
[...]
> I do not know if it is really a strong use case, but I will explain my
> motive in detail. English is not my native language, so please pardon
> my poor English.
> 
> I am one of the developers for software that managing the resource used
> from user job at HPC-Cluster with Linux. The resource is memory mainly.
> The HPC-Cluster may be shared by multiple people and used. Therefore, the
> memory used by each user must be strictly controlled, otherwise the
> user's job will runaway, not only will it hamper the other users, it will
> crash the entire system in OOM.
> 
> Some users of HPC are very nervous about performance. Jobs are executed
> while synchronizing with MPI communication using multiple compute nodes.
> Since CPU wait time will occur when synchronizing, they want to minimize
> the variation in execution time at each node to reduce waiting times as
> much as possible. We call this variation a noise.
> 
> THP does not guarantee to use the Huge Page, but may use the normal page.
> This mechanism is one cause of variation(noise).
> 
> The users who know this mechanism will be hesitant to use THP. However,
> the users also know the benefits of the Huge Page's TLB hit rate
> performance, and the Huge Page seems to be attractive. It seems natural
> that these users are interested in HugeTLBfs, I do not know at all
> whether it is the right approach or not.

Sure, asking for guarantee makes hugetlb pages attractive. But nothing
is really for free, especially any resource _guarantee_, and you have to
pay an additional configuration price usually.
 
> At the very least, our HPC system is pursuing high versatility and we
> have to consider whether we can provide it if users want to use HugeTLBfs.
> 
> In order to use HugeTLBfs we need to create a persistent pool, but in
> our use case sharing nodes, it would be impossible to create, delete or
> resize the pool.

Why? I can see this would be quite a PITA but not really impossible.

> One of the answers I have reached is to use HugeTLBfs by overcommitting
> without creating a pool(this is the surplus hugepage).
> 
> Surplus hugepages is hugetlb page, but I think at least that consuming
> buddy pool is a decisive difference from hugetlb page of persistent pool.
> If nr_overcommit_hugepages is assumed to be infinite, allocating pages for
> surplus hugepages from buddy pool is all unlimited even if being limited
> by memcg.

Not really, you can specify how much you can overcommit hugetlb pages.

> In extreme cases, overcommitment will allow users to exhaust
> the entire memory of the system. Of course, this can be prevented by the
> hugetlb cgroup, but even if we set the limit for memcg and hugetlb cgroup
> respectively, as I asked in the first mail(set limit to 10GB), the
> control will not work.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/7] mm: pages for hugetlb's overcommit may be able to charge to memcg

2018-05-24 Thread Michal Hocko

On Thu 24-05-18 13:39:59, TSUKADA Koutaro wrote:
> On 2018/05/23 3:54, Michal Hocko wrote:
[...]
> > I am also quite confused why you keep distinguishing surplus hugetlb
> > pages from regular preallocated ones. Being a surplus page is an
> > implementation detail that we use for an internal accounting rather than
> > something to exhibit to the userspace even more than we do currently.
> 
> I apologize for having confused.
> 
> The hugetlb pages obtained from the pool do not waste the buddy pool.

Because they have already allocated from the buddy allocator so the end
result is very same.

> On
> the other hand, surplus hugetlb pages waste the buddy pool. Due to this
> difference in property, I thought it could be distinguished.

But this is simply not correct. Surplus pages are fluid. If you increase
the hugetlb size they will become regular persistent hugetlb pages.
 
> Although my memcg knowledge is extremely limited, memcg is accounting for
> various kinds of pages obtained from the buddy pool by the task belonging
> to it. I would like to argue that surplus hugepage has specificity in
> terms of obtaining from the buddy pool, and that it is specially permitted
> charge requirements for memcg.

Not really. Memcg accounts primarily for reclaimable memory. We do
account for some non-reclaimable slabs but the life time should be at
least bound to a process life time. Otherwise the memcg oom killer
behavior is not guaranteed to unclutter the situation. Hugetlb pages are
simply persistent. Well, to be completely honest tmpfs pages have a
similar problem but lacking the swap space for them is kinda
configuration bug.

> It seems very strange that charge hugetlb page to memcg, but essentially
> it only charges the usage of the compound page obtained from the buddy pool,
> and even if that page is used as hugetlb page after that, memcg is not
> interested in that.

Ohh, it is very much interested. The primary goal of memcg is to enforce
the limit. How are you going to do that in an absence of the reclaimable
memory? And quite a lot of it because hugetlb pages usually consume a
lot of memory.

> I will completely apologize if my way of thinking is wrong. It would be
> greatly appreciated if you could mention why we can not charge surplus
> hugepages to memcg.
> 
> > Just look at what [sw]hould when you need to adjust accounting - e.g.
> > due to the pool resize. Are you going to uncharge those surplus pages
> > ffrom memcg to reflect their persistence?
> > 
> 
> I could not understand the intention of this question, sorry. When resize
> the pool, I think that the number of surplus hugepages in use does not
> change. Could you explain what you were concerned about?

It does change when ou change the hugetlb pool size, migrate pages
between per-numa pools (have a look at adjust_pool_surplus).
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH bpf-next v2 0/3] bpf: add boot parameters for sysctl knobs

2018-05-24 Thread Jesper Dangaard Brouer

On Wed, 23 May 2018 15:02:45 -0700
Alexei Starovoitov  wrote:

> On Wed, May 23, 2018 at 02:18:19PM +0200, Eugene Syromiatnikov wrote:
> > Some BPF sysctl knobs affect the loading of BPF programs, and during
> > system boot/init stages these sysctls are not yet configured.
> > A concrete example is systemd, that has implemented loading of BPF
> > programs.
> > 
> > Thus, to allow controlling these setting at early boot, this patch set
> > adds the ability to change the default setting of these sysctl knobs
> > as well as option to override them via a boot-time kernel parameter
> > (in order to avoid rebuilding kernel each time a need of changing these
> > defaults arises).
> > 
> > The sysctl knobs in question are kernel.unprivileged_bpf_disable,
> > net.core.bpf_jit_harden, and net.core.bpf_jit_kallsyms.  
> 
> - systemd is root. today it only uses cgroup-bpf progs which require root,
>   so disabling unpriv during boot time makes no difference to systemd.
>   what is the actual reason to present time?
> 
> - say in the future systemd wants to use so_reuseport+bpf for faster
>   networking. With unpriv disable during boot, it will force systemd
>   to do such networking from root, which will lower its security barrier.
>   How that make sense?
> 
> - bpf_jit_kallsyms sysctl has immediate effect on loaded programs.
>   Flipping it during the boot or right after or any time after
>   is the same thing. Why add such boot flag then?
> 
> - jit_harden can be turned on by systemd. so turning it during the boot
>   will make systemd progs to be constant blinded.
>   Constant blinding protects kernel from unprivileged JIT spraying.
>   Are you worried that systemd will attack the kernel with JIT spraying?


I think you are missing that, we want the ability to change these
defaults in-order to avoid depending on /etc/sysctl.conf settings, and
that the these sysctl.conf setting happen too late.

For example with jit_harden, there will be a difference between the
loaded BPF program that got loaded at boot-time with systemd (no
constant blinding) and when someone reloads that systemd service after
/etc/sysctl.conf have been evaluated and setting bpf_jit_harden (now
slower due to constant blinding).   This is inconsistent behavior.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/5] PCI/AER: Define and allocate aer_stats structure for AER capable devices

2018-05-24 Thread Greg Kroah-Hartman

On Wed, May 23, 2018 at 10:58:04AM -0700, Rajat Jain wrote:
> ---
> v2: Fix the license header as per Greg's suggestions
> (Since there is disagreement with using "//" vs "/* */" for license
>  I decided to keep the one preferred by Linus, also used by others
>  in this directory)

The rules are pretty simple for how to do this, and they are documented
in Documentation/process/license-rules.rst, please just follow that like
the rest of the kernel has done.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

37 matches

Mail list logo