date:20160708

* Dave Hansen  wrote:

> > Applications that frequently get called will get hammed into the ground 
> > with 
> > serialisation on mmap_sem not to mention the cost of the syscall entry/exit.
>
> I think we can do both of them without mmap_sem, as long as we resign 
> ourselves 
> to this just being fundamentally racy (which it is already, I think).  But, 
> is 
> it worth performance-tuning things that we don't expect performance-sensitive 
> apps to be using in the first place?  They'll just use the RDPKRU/WRPKRU 
> instructions directly.
>
> Ingo, do you still feel strongly that these syscalls (pkey_set/get()) should 
> be 
> included?  Of the 5, they're definitely the two with the weakest 
> justification.

Firstly, I'd like to thank Mel for the review, having this kind of high level 
interface discussion was exactly what I was hoping for before we merged any ABI 
patches.

So my hope was that we'd also grow some debugging features: such as a periodic 
watchdog timer clearing all non-allocated pkeys of a task and re-setting them 
to 
their (kernel-)known values and thus forcing user-space to coordinate key 
allocation/freeing.

While allocation/freeing of keys is very likely a slow path in any reasonable 
workload, _setting_ the values of pkeys could easily be a fast path. The whole 
point of pkeys is to allow both thread local and dynamic mapping and unmapping 
of 
memory ranges, without having to touch any page table attributes in the hot 
path.

Now another, more fundamental question is that pkeys are per-CPU (per thread) 
on 
the hardware side - so why do we even care about the mmap_sem in the syscalls 
in 
the first place? If we want any serialization wouldn't a pair of 
get_cpu()/put_cpu() preemption control calls be enough? Those would also be 
much 
cheaper.

The idea behind my suggestion to manage all pkey details via a kernel interface 
and 'shadow' the pkeys state in the kernel was that if we don't even offer a 
complete system call interface then user-space is _forced_ into using the raw 
instructions and into implementing a (potentially crappy) uncoordinated API or 
not 
implementing any API at all but randomly using fixed pkey indices.

My hope was to avoid that, especially since currently _all_ memory mapping 
details 
on x86 are controlled via system calls. If we don't offer that kind of facility 
then we force user-space into using the raw instructions and will likely end up 
with a poor user-space interface.

So the question is, what is user-space going to do? Do any glibc patches exist? 
How are the user-space library side APIs going to look like?

Thanks,

Ingo

Re: Introspecting userns relationships to other namespaces?

2016-07-08 Thread W. Trevor King

On Thu, Jul 07, 2016 at 11:54:54PM -0700, Andrew Vagin wrote:
> On Thu, Jul 07, 2016 at 10:26:50PM -0700, W. Trevor King wrote:
> > On Thu, Jul 07, 2016 at 08:26:47PM -0700, James Bottomley wrote:
> > > On Thu, 2016-07-07 at 20:00 -0700, Andrew Vagin wrote:
> > > > On Thu, Jul 07, 2016 at 07:16:18PM -0700, Andrew Vagin wrote:
> > > > > I think we can show all required information in fdinfo. We open
> > > > > a namespaces file (/proc/pid/ns/N) and then read
> > > > > /proc/pid/fdinfo/X for it.
> > > > 
> > > > Here is a proof-of-concept patch.
> > > > …
> > > > In [2]: fd = os.open("/proc/self/ns/pid", os.O_RDONLY)
> > > > 
> > > > In [3]: print open("/proc/self/fdinfo/%d" % fd).read()
> > > > pos:0
> > > > flags:  010
> > > > mnt_id: 2
> > > > userns: 4026531837
> > > > 
> > > > In [4]: print "/proc/self/ns/user -> %s" %
> > > > os.readlink("/proc/self/ns/user")
> > > > /proc/self/ns/user -> user:[4026531837]
> > > 
> > > can't you just do
> > > 
> > > readlink /proc/self/ns/user | sed 's/.*\[\(.*\)\]/\1/'
> > …
> > If you only put one level in fdinfo, you're stuck if one of the
> > namespaces involved has neither bind mounts nor a PID to give you
> > handle on it [1].  And if you want to put that whole ancestor tree in
> > fdinfo, you have to come up with some way to handle the two-parent
> > branching.
> 
> I think it's a bad idea to draw a tree in fdinfo. Why do we want to know
> this hierarchy? Probably we will want to access these namespaces (setns),
> in this case we need to have a way to open them.
> 
> Maybe we need to extend functionality of the nsfs filesystem
> (somethink like /proc/PID for namespaces)?

A similar idea came up during the PID-translation brainstorming [1],
but I'm not sure if anything ever came of that.  Once you're dealing
with a separate pseudo-filesystem, it seems easier to decouple it from
proc and just make a mountable namespace-hierarchy filesystem (like we
have mountable cgroup hierarchy filesystems).  That also gets you an
opt-in playground while the details of the nsfs filesystem view are
worked out.  Are you imagining something like:

  $ tree .
  .
  ├── mnt{inum}
  │   └── user -> ../user{inum}
  ├── pid{inum}
  │   ├── pid{inum}
  │   │   └── user -> ../../user{inum}/user{inum}
  │   └── user -> ../user{inum}
  └── user{inum}
  └── user{inum}

Cheers,
Trevor

[1]: http://thread.gmane.org/gmane.linux.kernel.containers/28105/focus=28164
 Subject: RE: [RFC]Pid conversion between pid namespace
 Date: Fri, 25 Jul 2014 10:01:45 +
 Message-ID: 
<5871495633F38949900D2BF2DC04883E56C7A2@G08CNEXMBPEKD02.g08.fujitsu.local>

-- 
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy


signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/5] security, overlayfs: provide copy up security hook for unioned files

2016-07-08 Thread Miklos Szeredi

On Thu, Jul 7, 2016 at 11:44 PM, Casey Schaufler  wrote:
> On 7/7/2016 1:33 PM, Vivek Goyal wrote:
>> On Tue, Jul 05, 2016 at 12:36:17PM -0700, Casey Schaufler wrote:
>>> On 7/5/2016 8:50 AM, Vivek Goyal wrote:
 Provide a security hook to label new file correctly when a file is copied
 up from lower layer to upper layer of a overlay/union mount.

 This hook can prepare and switch to a new set of creds which are suitable
 for new file creation during copy up. Caller should revert to old creds
 after file creation.

 In SELinux, newly copied up file gets same label as lower file for
 non-context mounts. But it gets label specified in mount option context=
 for context mounts.

 Signed-off-by: Vivek Goyal 
 ---
  fs/overlayfs/copy_up.c|  8 
  include/linux/lsm_hooks.h | 13 +
  include/linux/security.h  |  6 ++
  security/security.c   |  8 
  security/selinux/hooks.c  | 27 +++
  5 files changed, 62 insertions(+)

 diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
 index 80aa6f1..90dc362 100644
 --- a/fs/overlayfs/copy_up.c
 +++ b/fs/overlayfs/copy_up.c
 @@ -246,6 +246,7 @@ static int ovl_copy_up_locked(struct dentry *workdir, 
 struct dentry *upperdir,
 struct dentry *upper = NULL;
 umode_t mode = stat->mode;
 int err;
 +   const struct cred *old_creds = NULL;

 newdentry = ovl_lookup_temp(workdir, dentry);
 err = PTR_ERR(newdentry);
 @@ -258,10 +259,17 @@ static int ovl_copy_up_locked(struct dentry 
 *workdir, struct dentry *upperdir,
 if (IS_ERR(upper))
 goto out1;

 +   err = security_inode_copy_up(dentry, &old_creds);
 +   if (err < 0)
 +   goto out2;
 +
 /* Can't properly set mode on creation because of the umask */
 stat->mode &= S_IFMT;
 err = ovl_create_real(wdir, newdentry, stat, link, NULL, true);
 stat->mode = mode;
 +   if (old_creds)
 +   revert_creds(old_creds);
 +
 if (err)
 goto out2;
>>> I don't much care for the way part of the credential manipulation
>>> is done in the caller and part is done the the security module.
>>> If the caller is going to restore the old state, the caller should
>>> save the old state.

Conversely if the SM is setting the state it should restore it.
This needs yet another hook, but that's fine, I think.

>> One advantage of current patches is that we switch to new creds only if
>> it is needed. For example, if there are no LSMs loaded,
>
> Point.
>
>>  then there is
>> no need to modify creds and make a switch to new creds.
>
> I'm not a fan of cred flipping. There are too many ways for it to go
> wrong. Consider interrupts. I assume you've ruled that out as a possibility
> in the caller, but I still think the practice is dangerous.
>
> I greatly prefer "create and set attributes" to "change cred, create and
> reset cred". I know that has it's own set of problems, including races
> and faking privilege.

Yeah, we've talked about this. The races can be eliminated by always
doing the create in a the temporary "workdir" area and atomically
renaming to the final destination after everything has been set up.
OTOH that has a performance impact that the cred flipping eliminates.

>> But if I start allocating new creds and save old state in caller, then
>> caller always has to do it (irrespective of the fact whether any LSM
>> modified the creds or not).
>
> It starts getting messy when I have two modules that want to
> change change the credential. Each module will have to check to
> see if a module called before it has allocated a new cred.

Doesn't seem to me too difficult: check if *credp == NULL and allocate
if so.  Can even invent a heper for this if needed.

Thanks,
Miklos

Re: [RFC PATCH 0/1] Portable Device Tree Connector -- conceptual

2016-07-08 Thread Pantelis Antoniou

Hi David,

> On Jul 7, 2016, at 10:15 , David Gibson  wrote:
> 
> On Sat, Jul 02, 2016 at 04:55:49PM -0700, frowand.l...@gmail.com wrote:
>> From: Frank Rowand 
>> 
>> Hi All,
>> 
>> This is version 2 of this email.
>> 
>> Changes from version 1:
>> 
>>  - some rewording of the text
>>  - removed new (theoretical) dtc directive "/connector/"
>>  - added compatibility between mother board and daughter board
>>  - added info on applying a single .dtbo to different connectors
>>  - attached an RFC patch showing the required kernel changes
>>  - changes to mother board .dts connector node:
>> - removed target_path property
>> - added connector-socket property
>>  - changes to daughter board .dts connector node:
>> - added connector-plug property
>> 
>> 
>> I've been trying to wrap my head around what Pantelis and Rob have written
>> on the subject of a device tree representation of a connector for a
>> daughter board to connect to (eg a cape or a shield) and the representation
>> of the daughter board.  (Or any other physically pluggable object.)
>> 
>> After trying to make sense of what had been written (or presented via slides
>> at a conference - thanks Pantelis!), I decided to go back to first principals
>> of what we are trying to accomplish.  I came up with some really simple bogus
>> examples to try to explain what my thought process is.
>> 
>> This is an extremely simple example to illustrate the concepts.  It is not
>> meant to represent the complexity of a real board.
>> 
>> To start with, assume that the device that will eventually be on a daughter
>> board is first soldered onto the mother board.  The mother board contains
>> two devices connected via bus spi_1.  One device is described in the .dts
>> file, the other is described in an included .dtsi file.
>> Then the device tree files will look like:
>> 
>> $ cat board.dts
>> /dts-v1/;
>> 
>> / {
>>#address-cells = < 1 >;
>>#size-cells = < 1 >;
>> 
>>tree_1: soc@0 {
>>reg = <0x0 0x0>;
>> 
>>spi_1: spi1 {
>>};
>>};
>> 
>> };
>> 
>> &spi_1 {
>>ethernet-switch@0 {
>>compatible = "micrel,ks8995m";
>>};
>> };
>> 
>> #include "spi_codec.dtsi"
>> 
>> 
>> $ cat spi_codec.dtsi
>> &spi_1 {
>>  codec@1 {
>>  compatible = "ti,tlv320aic26";
>>  };
>> };
>> 
>> 
>> #- codec chip on cape
>> 
>> Then suppose I move the codec chip to a cape.  Then I will have the same
>> exact .dts and .dtsi and everything still works.
>> 
>> 
>> @- codec chip on cape, overlay
>> 
>> If I want to use overlays, I only have to add the version and "/plugin/",
>> then use the '-@' flag for dtc (both for the previous board.dts and
>> this spi_codec_overlay.dts):
>> 
>> $ cat spi_codec_overlay.dts
>> /dts-v1/;
>> 
>> /plugin/;
>> 
>> &spi_1 {
>>  codec@1 {
>>  compatible = "ti,tlv320aic26";
>>  };
>> };
>> 
>> 
>> Pantelis pointed out that the syntax has changed to be:
>>   /dts-v1/ /plugin/;
>> 
>> 
>> #- codec chip on cape, overlay, connector
>> 
>> Now we move into the realm of connectors.  My mental model of what the
>> hardware and driver look like has not changed.  The only thing that has
>> changed is that I want to be able to specify that the connector that
>> the cape is plugged into has some pins that are the spi bus /soc/spi1.
>> 
>> The following _almost_ but not quite gets me what I want.  Note that
>> the only thing the connector node does is provide some kind of
>> pointer or reference to what node(s) are physically routed through
>> the connector.  The connector node does not need to describe the pins;
>> it only has to point to the node that describes the pins.
>> 
>> This example will turn out to be not sufficient.  It is a stepping
>> stone in building my mental model.
>> 
>> $ cat board_with_connector.dts
>> /dts-v1/;
>> 
>> / {
>>  #address-cells = < 1 >;
>>  #size-cells = < 1 >;
>> 
>>  tree_1: soc@0 {
>>  reg = <0x0 0x0>;
>> 
>>  spi_1: spi1 {
>>  };
>>  };
>> 
>>  connector_1: connector_1 {
>>  spi1 {
>>  target_phandle = <&spi_1>;
>>  };
>>  };
>> 
>> };
>> 
>> &spi_1 {
>>  ethernet-switch@0 {
>>  compatible = "micrel,ks8995m";
>>  };
>> };
>> 
>> 
>> $ cat spi_codec_overlay_with_connector.dts
>> /dts-v1/;
>> 
>> /plugin/;
>> 
>> &connector_1 {
>>  spi1 {
>>  codec@1 {
>>  compatible = "ti,tlv320aic26";
>>  };
>>  };
>> };
>> 
>> 
>> The result is that the overlay fixup for spi1 on the cape will
>> relocate the spi1 node to /connector_1 in the host tree, so
>> this does not solve the connector linkage yet:
>> 
>> -- chunk from the decompiled board_with_connector.dtb:
>> 
>>  __symbols__ {
>>  connector_1 = "/connector_1";
>>  };
>> 
>> -- chunk from the decompiled spi_codec_overlay_with_connector.

Re: [PATCH] qla2xxx: Fix NULL pointer deref in QLA interrupt

2016-07-08 Thread Thorsten Leemhuis

Bruno Prémont wrote on 30.06.2016 17:00:
> In qla24xx_process_response_queue() rsp->msix->cpuid may trigger NULL
> pointer dereference when rsp->msix is NULL:
> […]
> The affected code was introduced by commit 
> cdb898c52d1dfad4b4800b83a58b3fe5d352edde
> (qla2xxx: Add irq affinity notification).
> 
> Only dereference rsp->msix when it has been set so the machine can boot
> fine. Possibly rsp->msix is unset because:
> [3.479679] qla2xxx [:00:00.0]-0005: : QLogic Fibre Channel HBA 
> Driver: 8.07.00.33-k.
> [3.481839] qla2xxx [:13:00.0]-001d: : Found an ISP2432 irq 17 iobase 
> 0xc9038000.
> [3.484081] qla2xxx [:13:00.0]-0035:0: MSI-X; Unsupported ISP2432 
> (0x2, 0x3).
> [3.485804] qla2xxx [:13:00.0]-0037:0: Falling back-to MSI mode -258.
> [3.890145] scsi host0: qla2xxx
> [3.891956] qla2xxx [:13:00.0]-00fb:0: QLogic QLE2460 - PCI-Express 
> Single Channel 4Gb Fibre Channel HBA.
> [3.894207] qla2xxx [:13:00.0]-00fc:0: ISP2432: PCIe (2.5GT/s x4) @ 
> :13:00.0 hdma+ host#=0 fw=7.03.00 (9496).
> [5.714774] qla2xxx [:13:00.0]-500a:0: LOOP UP detected (4 Gbps).

Bruno: Does that mean you actually tested that patch and it fixed the
problem for you? It looks like it, but there is some confusion about it;
that's one of the reasons why this patch didn't get any further yet
afaics, so a quick clarification might help to finally get this fixed
properly in mainline and stable.

Himanshu: While at it: Can you confirm this patch should get merged to
mainline? Seems Quinn is on PTO and his out-of-office reply mentioned
you as one point of contact.

Cheers, your regression tracker for Linux 4.7
 Thorsten

> CC: 
> Signed-off-by: Bruno Prémont 
> ---
> diff --git a/drivers/scsi/qla2xxx/qla_isr.c
> b/drivers/scsi/qla2xxx/qla_isr.c index 5649c20..a92a62d 100644
> --- a/drivers/scsi/qla2xxx/qla_isr.c
> +++ b/drivers/scsi/qla2xxx/qla_isr.c
> @@ -2548,7 +2548,7 @@ void qla24xx_process_response_queue(struct
> scsi_qla_host *vha, if (!vha->flags.online)
>   return;
>  
> - if (rsp->msix->cpuid != smp_processor_id()) {
> + if (rsp->msix && rsp->msix->cpuid != smp_processor_id()) {
>   /* if kernel does not notify qla of IRQ's CPU change,
>* then set it here.
>*/
> 
> http://news.gmane.org/find-root.php?message_id=20160630170032.6dbaf496%40pluto.restena.lu
>  
> http://mid.gmane.org/20160630170032.6dbaf496%40pluto.restena.lu
>

Re: [PATCH 0/2] sched/cputime: Deltas for "replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code"

* Rik van Riel  wrote:

> On Thu, 2016-07-07 at 16:27 +0200, Frederic Weisbecker wrote:
> > Hi Rick,
> > 
> > While reviewing your 2nd patch, I thought about these cleanups.
> > Perhaps
> > the first one could be merged into your patch. I let you decide.
> 
> I'm not convinced we want to merge cleanups and functional
> changes into the same patch, given how convoluted the code
> is/was.
> 
> Both of your patches look good though.
> 
> What tree should they go in through?

-tip I suspect. So my plan was the following, this series of yours:

  [PATCH v3 0/4] sched,time: fix irq time accounting with nohz_idle

... looked almost ready, it looked like as if I could merge v4 once you sent it.

Plus Frederic submitted these two cleanups - looks like I could merge these on 
top 
of your series and have them close to each other in the Git space.

And I do agree that we should keep these cleanups separate and not merge them 
into 
patches that change functionality.

If your series is expected to be risky then we could make things easier to 
handle 
later on if we switched around things and first made low-risk cleanups and then 
any changes/fixes on top - do you think that's necessary in this case?

Thanks,

Ingo

Re: [PATCH v2 1/6] dt-bindings: clock: add DT binding for the Xtal clock on Armada 3700

2016-07-08 Thread Thomas Petazzoni

Hello,

On Fri,  8 Jul 2016 00:37:46 +0200, Gregory CLEMENT wrote:

> +gpio1: gpio@13800 {
> + compatible = "marvell,mvebu-gpio-3700", "syscon", "simple-mfd";

I find this compatible string not very consistent with what we do for
other drivers, it should have been:

marvell,armada-3700-gpio

or something like that.


> + xtalclk: xtal-clk {
> + compatible = "marvell,armada-3700-xtal-clock";

See here for example.

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux, Kernel and Android engineering
http://free-electrons.com

RE: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support

2016-07-08 Thread KEITA KOBAYASHI

Hi,

I tested these patches on Renesas SoC r8a7790(CA15*4 + CA7*4)
and your preview branch[1] on Renesas SoC r8a7795(CA57*4 + CA53*4).

> Test 0:
>   for i in `seq 1 10`; \
>  do sysbench --test=cpu --max-time=3 --num-threads=1 run;
> \
>  done \
>   | awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
>  END {print "Average events: " sum/runs}'
> 
> Target: ARM TC2 (2xA15+3xA7)
> 
>   (Higher is better)
> tip:  Average events: 146.9
> patch:Average events: 217.9
> 
Target: Renesas SoC r8a7790(CA15*4 + CA7*4)
  w/  capacity patches: Average events: 200.2
  w/o capacity patches: Average events: 144.4

Target: Renesas SoC r8a7795(CA57*4 + CA53*4)
  w/  capacity patches : 3587.7
  w/o capacity patches : 2327.8

> Test 1:
>   perf stat --null --repeat 10 -- \
>   perf bench sched messaging -g 50 -l 5000
> 
> Target: Intel IVB-EP (2*10*2)
> 
> tip:4.861970420 seconds time elapsed ( +-  1.39% )
> patch:  4.886204224 seconds time elapsed ( +-  0.75% )
> 
> Target: ARM TC2 A7-only (3xA7) (-l 1000)
> 
> tip:61.485682596 seconds time elapsed ( +-  0.07% )
> patch:  62.667950130 seconds time elapsed ( +-  0.36% )
> 
Target: Renesas SoC r8a7790(CA15*4) (-l 1000)
  w/  capacity patches: 38.955532040 seconds time elapsed ( +-  0.12% )
  w/o capacity patches: 39.424945580 seconds time elapsed ( +-  0.10% )

Target: Renesas SoC r8a7795(CA57*4) (-l 1000)
  w/  capacity patches : 29.804292200 seconds time elapsed ( +-  0.37% )
  w/o capacity patches : 29.838826790 seconds time elapsed ( +-  0.40% )

Tested-by: Keita Kobayashi 

[1] git://linux-arm.org/linux-power.git capacity_awareness_v2_arm64_v1

Re: [PATCH v2] kexec: Fix kdump failure with notsc


* Eric W. Biederman  wrote:

> Sigh.  Can we please just do the work to rip out the apic shutdown code from 
> the 
> kexec on panic code path?
> 
> I forgetting details but the only reason we have do any apic shutdown is bugs 
> in 
> older kernels that could not initialize a system properly if we did not shut 
> down the apics.
> 
> I certainly don't see an issue with goofy cases like notsc not working on a 
> crash capture kernel if we are not initializing the hardware properly.
> 
> The strategy really needs to be to only do the absolutely essential hardware 
> shutdown in the crashing kernel, every adintional line of code we execute in 
> the 
> crashing kernel increases our chances of hitting a bug.

Fully agreed.

> Under that policy things like requring we don't pass boot options that 
> inhibit 
> the dump catpure kernel from initializing the hardware from a random state 
> are 
> reasonable requirements.  AKA I don't see any justification in this as to why 
> we 
> would even want to support notsc on the dump capture kernel.  Especially when 
> things clearly work when that option is not specified.

So at least on the surface it appears 'surprising' that the 'notsc' option 
(which, 
supposedly, disables TSC handling) interferes with being able to fully boot. 
Even 
if 'notsc' is specified we are still using the local APIC, right?

So it might be a good idea to find the root cause of this bootup fragility even 
if 
'notsc' is specified. And I fully agree that it should be fixed in the bootup 
path 
of the dump kernel, not the crash kernel reboot path.

Thanks,

Ingo

Re: [RFC PATCH 0/1] Portable Device Tree Connector -- conceptual

2016-07-08 Thread David Gibson

On Fri, Jul 08, 2016 at 10:26:15AM +0300, Pantelis Antoniou wrote:
> Hi David,
> 
> > On Jul 7, 2016, at 10:15 , David Gibson  wrote:
> > 
> > On Sat, Jul 02, 2016 at 04:55:49PM -0700, frowand.l...@gmail.com wrote:
> >> From: Frank Rowand 
> >> 
> >> Hi All,
> >> 
> >> This is version 2 of this email.
> >> 
> >> Changes from version 1:
> >> 
> >>  - some rewording of the text
> >>  - removed new (theoretical) dtc directive "/connector/"
> >>  - added compatibility between mother board and daughter board
> >>  - added info on applying a single .dtbo to different connectors
> >>  - attached an RFC patch showing the required kernel changes
> >>  - changes to mother board .dts connector node:
> >> - removed target_path property
> >> - added connector-socket property
> >>  - changes to daughter board .dts connector node:
> >> - added connector-plug property
> >> 
> >> 
> >> I've been trying to wrap my head around what Pantelis and Rob have written
> >> on the subject of a device tree representation of a connector for a
> >> daughter board to connect to (eg a cape or a shield) and the representation
> >> of the daughter board.  (Or any other physically pluggable object.)
> >> 
> >> After trying to make sense of what had been written (or presented via 
> >> slides
> >> at a conference - thanks Pantelis!), I decided to go back to first 
> >> principals
> >> of what we are trying to accomplish.  I came up with some really simple 
> >> bogus
> >> examples to try to explain what my thought process is.
> >> 
> >> This is an extremely simple example to illustrate the concepts.  It is not
> >> meant to represent the complexity of a real board.
> >> 
> >> To start with, assume that the device that will eventually be on a daughter
> >> board is first soldered onto the mother board.  The mother board contains
> >> two devices connected via bus spi_1.  One device is described in the .dts
> >> file, the other is described in an included .dtsi file.
> >> Then the device tree files will look like:
> >> 
> >> $ cat board.dts
> >> /dts-v1/;
> >> 
> >> / {
> >>#address-cells = < 1 >;
> >>#size-cells = < 1 >;
> >> 
> >>tree_1: soc@0 {
> >>reg = <0x0 0x0>;
> >> 
> >>spi_1: spi1 {
> >>};
> >>};
> >> 
> >> };
> >> 
> >> &spi_1 {
> >>ethernet-switch@0 {
> >>compatible = "micrel,ks8995m";
> >>};
> >> };
> >> 
> >> #include "spi_codec.dtsi"
> >> 
> >> 
> >> $ cat spi_codec.dtsi
> >> &spi_1 {
> >>codec@1 {
> >>compatible = "ti,tlv320aic26";
> >>};
> >> };
> >> 
> >> 
> >> #- codec chip on cape
> >> 
> >> Then suppose I move the codec chip to a cape.  Then I will have the same
> >> exact .dts and .dtsi and everything still works.
> >> 
> >> 
> >> @- codec chip on cape, overlay
> >> 
> >> If I want to use overlays, I only have to add the version and "/plugin/",
> >> then use the '-@' flag for dtc (both for the previous board.dts and
> >> this spi_codec_overlay.dts):
> >> 
> >> $ cat spi_codec_overlay.dts
> >> /dts-v1/;
> >> 
> >> /plugin/;
> >> 
> >> &spi_1 {
> >>codec@1 {
> >>compatible = "ti,tlv320aic26";
> >>};
> >> };
> >> 
> >> 
> >> Pantelis pointed out that the syntax has changed to be:
> >>   /dts-v1/ /plugin/;
> >> 
> >> 
> >> #- codec chip on cape, overlay, connector
> >> 
> >> Now we move into the realm of connectors.  My mental model of what the
> >> hardware and driver look like has not changed.  The only thing that has
> >> changed is that I want to be able to specify that the connector that
> >> the cape is plugged into has some pins that are the spi bus /soc/spi1.
> >> 
> >> The following _almost_ but not quite gets me what I want.  Note that
> >> the only thing the connector node does is provide some kind of
> >> pointer or reference to what node(s) are physically routed through
> >> the connector.  The connector node does not need to describe the pins;
> >> it only has to point to the node that describes the pins.
> >> 
> >> This example will turn out to be not sufficient.  It is a stepping
> >> stone in building my mental model.
> >> 
> >> $ cat board_with_connector.dts
> >> /dts-v1/;
> >> 
> >> / {
> >>#address-cells = < 1 >;
> >>#size-cells = < 1 >;
> >> 
> >>tree_1: soc@0 {
> >>reg = <0x0 0x0>;
> >> 
> >>spi_1: spi1 {
> >>};
> >>};
> >> 
> >>connector_1: connector_1 {
> >>spi1 {
> >>target_phandle = <&spi_1>;
> >>};
> >>};
> >> 
> >> };
> >> 
> >> &spi_1 {
> >>ethernet-switch@0 {
> >>compatible = "micrel,ks8995m";
> >>};
> >> };
> >> 
> >> 
> >> $ cat spi_codec_overlay_with_connector.dts
> >> /dts-v1/;
> >> 
> >> /plugin/;
> >> 
> >> &connector_1 {
> >>spi1 {
> >>codec@1 {
> >>compatible = "ti,tlv320aic26";
> >>};
> >>};
> >> };
> >> 
> >> 
> >> The result is that the over

[PATCH v15 net-next 0/1] introduce Hyper-V VM Sockets(hv_sock)

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by
introducing a new socket address family AF_HYPERV.

You can also get the patch by:
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160708_v15

Note: the VMBus driver side's supporting patches have been in the mainline
tree.

I know the kernel has already had a VM Sockets driver (AF_VSOCK) based
on VMware VMCI (net/vmw_vsock/, drivers/misc/vmw_vmci), and KVM is
proposing AF_VSOCK of virtio version:
http://marc.info/?l=linux-netdev&m=145952064004765&w=2

However, though Hyper-V Sockets may seem conceptually similar to
AF_VOSCK, there are differences in the transportation layer, and IMO these
make the direct code reusing impractical:

1. In AF_VSOCK, the endpoint type is: , but in
AF_HYPERV, the endpoint type is: . Here GUID
is 128-bit.

2. AF_VSOCK supports SOCK_DGRAM, while AF_HYPERV doesn't.

3. AF_VSOCK supports some special sock opts, like SO_VM_SOCKETS_BUFFER_SIZE,
SO_VM_SOCKETS_BUFFER_MIN/MAX_SIZE and SO_VM_SOCKETS_CONNECT_TIMEOUT.
These are meaningless to AF_HYPERV.

4. Some AF_VSOCK's VMCI transportation ops are meanless to AF_HYPERV/VMBus,
like .notify_recv_init
.notify_recv_pre_block
.notify_recv_pre_dequeue
.notify_recv_post_dequeue
.notify_send_init
.notify_send_pre_block
.notify_send_pre_enqueue
.notify_send_post_enqueue
etc.

So I think we'd better introduce a new address family: AF_HYPERV.

Please review the patch.

Looking forward to your comments, especially comments from David. :-)

Changes since v1:
- updated "[PATCH 6/7] hvsock: introduce Hyper-V VM Sockets feature"
- added __init and __exit for the module init/exit functions
- net/hv_sock/Kconfig: "default m" -> "default m if HYPERV"
- MODULE_LICENSE: "Dual MIT/GPL" -> "Dual BSD/GPL"

Changes since v2:
- fixed various coding issue pointed out by David Miller
- fixed indentation issues
- removed pr_debug in net/hv_sock/af_hvsock.c
- used reverse-Chrismas-tree style for local variables.
- EXPORT_SYMBOL -> EXPORT_SYMBOL_GPL

Changes since v3:
- fixed a few coding issue pointed by Vitaly Kuznetsov and Dan Carpenter
- fixed the ret value in vmbus_recvpacket_hvsock on error
- fixed the style of multi-line comment: vmbus_get_hvsock_rw_status()

Changes since v4 (https://lkml.org/lkml/2015/7/28/404):
- addressed all the comments about V4.
- treat the hvsock offers/channels as special VMBus devices
- add a mechanism to pass hvsock events to the hvsock driver
- fixed some corner cases with proper locking when a connection is closed
- rebased to the latest Greg's tree

Changes since v5 (https://lkml.org/lkml/2015/12/24/103):
- addressed the coding style issues (Vitaly Kuznetsov & David Miller, thanks!)
- used a better coding for the per-channel rescind callback (Thank Vitaly!)
- avoided the introduction of new VMBUS driver APIs vmbus_sendpacket_hvsock()
and vmbus_recvpacket_hvsock() and used vmbus_sendpacket()/vmbus_recvpacket()
in the higher level (i.e., the vmsock driver). Thank Vitaly!

Changes since v6 (http://lkml.iu.edu/hypermail/linux/kernel/1601.3/01813.html)
- only a few minor changes of coding style and comments

Changes since v7
- a few minor changes of coding style: thanks, Joe Perches!
- added some lines of comments about GUID/UUID before the struct sockaddr_hv.

Changes since v8
- removed the unnecessary __packed for some definitions: thanks, David!
- hvsock_open_connection: use offer.u.pipe.user_def[0] to know the connection
and reorganized the function
direction
- reorganized the code according to suggestions from Cathy Avery: split big
functions into small ones, set .setsockopt and getsockopt to
sock_no_setsockopt/sock_no_getsockopt
- inline'd some small list helper functions

Changes since v9
- minimized struct hvsock_sock by making the send/recv buffers pointers.
the buffers are allocated by kmalloc() in __hvsock_create() now.
- minimized the sizes of the send/recv buffers and the vmbus ringbuffers.

Changes since v10

1) add module params: send_ring_page, recv_ring_page. They can be used to
enlarge the ringbuffer size to get better performance, e.g.,
# modprobe hv_sock recv_ring_page=16 send_ring_page=16
By default, recv_ring_page is 3 and send_ring_page is 2.

2) add module param max_socket_number (the default is 1024).
A user can enlarge the number to create more than 1024 hv_sock sockets.
By default, 1024 sockets take about 1024 * (3+2+1+1) * 4KB = 28M bytes.
(H

[PATCH v15 net-next 1/1] hv_sock: introduce Hyper-V Sockets

Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Vitaly Kuznetsov 
Cc: Cathy Avery 
---

You can also get the patch here (2764221d):
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160708_v15

For the change log before v12, please see https://lkml.org/lkml/2016/5/15/31

In v12, the changes are mainly the following:

1) remove the module params as David suggested.

2) use 5 exact pages for VMBus send/recv rings, respectively.
The host side's design of the feature requires 5 exact pages for recv/send
rings respectively -- this is suboptimal considering memory consumption,
however unluckily we have to live with it, before the host comes up with
a new design in the future. :-(

3) remove the per-connection static send/recv buffers
Instead, we allocate and free the buffers dynamically only when we recv/send
data. This means: when a connection is idle, no memory is consumed as
recv/send buffers at all.

In v13:
I return ENOMEM on buffer alllocation failure

   Actually "man read/write" says "Other errors may occur, depending on the
object connected to fd". "man send/recv" indeed lists ENOMEM.
   Considering AF_HYPERV is a new socket type, ENOMEM seems OK here.
   In the long run, I think we should add a new API in the VMBus driver,
allowing data copy from VMBus ringbuffer into user mode buffer directly.
This way, we can even eliminate this temporary buffer.

In v14:
fix some coding style issues pointed out by David.

In v15:
Just some stylistic changes addressing comments from Joe Perches and
Olaf Hering -- thank you!
- add a GPL blurb.
- define a new macro PAGE_SIZE_4K and use it to replace PAGE_SIZE
- change sk_to_hvsock/hvsock_to_sk() from macros to inline functions
- remove a not-very-useful pr_err()
- fix some typos in comment and coding style issues.

Looking forward to your comments!

 MAINTAINERS |2 +
 include/linux/hyperv.h  |   13 +
 include/linux/socket.h  |4 +-
 include/net/af_hvsock.h |   78 +++
 include/uapi/linux/hyperv.h |   24 +
 net/Kconfig |1 +
 net/Makefile|1 +
 net/hv_sock/Kconfig |   10 +
 net/hv_sock/Makefile|3 +
 net/hv_sock/af_hvsock.c | 1523 +++
 10 files changed, 1658 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 50f69ba..6eaa26f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5514,7 +5514,9 @@ F:drivers/pci/host/pci-hyperv.c
 F: drivers/net/hyperv/
 F: drivers/scsi/storvsc_drv.c
 F: drivers/video/fbdev/hyperv_fb.c
+F: net/hv_sock/
 F: include/linux/hyperv.h
+F: include/net/af_hvsock.h
 F: tools/hv/
 F: Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 50f493e..1cda6ea5 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1508,5 +1508,18 @@ static inline void commit_rd_index(struct vmbus_channel 
*channel)
vmbus_set_event(channel);
 }
 
+struct vmpipe_proto_header {
+   u32 pkt_type;
+   u32 data_size;
+};
+
+#define HVSOCK_HEADER_LEN  (sizeof(struct vmpacket_descriptor) + \
+sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN   (sizeof(u64))
 
+#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \
+   ALIGN((payload_len), 8) + \
+   PREV_INDICES_LEN)
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index b5cc5a6..0b68b58 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -202,8 +202,9 @@ struct ucred {
 #define AF_VSOCK   40  /* vSockets */
 #define AF_KCM 41  /* Kernel Connection Multiplexor*/
 #define AF_QIPCRTR 42  /* Qualcomm IPC Router  */
+#define AF_HYPERV  43  /* Hyper-V Sockets  */
 
-#define AF_MAX 43  /* For now.. */
+#define AF_MAX 44  /* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC  AF_UNSPEC
@@ -251,6 +252,7 @@ struct ucred {
 #define PF_VS

[PATCH v2] pmem: add pmem support codes on ARM64

2016-07-08 Thread Kwangwoo Lee

v2)
rewrite functions under the mapping information MEMREMAP_WB.
rewrite the comments for ARM64 in pmem.h
add __clean_dcache_area() to clean data in PoC.

v1)
The PMEM driver on top of NVDIMM(Non-Volatile DIMM) has already been
supported on X86_64 and there exist several ARM64 platforms which support
DIMM type memories.

This patch set enables the PMEM driver on ARM64 (AArch64) architecture
on top of NVDIMM. While developing this patch set, QEMU 2.6.50 with NVDIMM
emulation for ARM64 has also been developed and tested on it.

$ dmesg
[0.00] Booting Linux on physical CPU 0x0
[0.00] Linux version 4.6.0-rc2kw-dirty (kwangwoo@VBox15) (gcc version 
5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu1) ) #10 SMP Tue Jul 5 11:30:33 KST 2016
[0.00] Boot CPU: AArch64 Processor [411fd070]
[0.00] efi: Getting EFI parameters from FDT:
[0.00] EFI v2.60 by EDK II
[0.00] efi:  SMBIOS 3.0=0x5871  ACPI 2.0=0x589b
[0.00] ACPI: Early table checksum verification disabled
[0.00] ACPI: RSDP 0x589B 24 (v02 BOCHS )
[0.00] ACPI: XSDT 0x589A 5C (v01 BOCHS  BXPCFACP 
0001  0113)
[0.00] ACPI: FACP 0x5862 00010C (v05 BOCHS  BXPCFACP 
0001 BXPC 0001)
[0.00] ACPI: DSDT 0x5863 00108F (v02 BOCHS  BXPCDSDT 
0001 BXPC 0001)
[0.00] ACPI: APIC 0x5861 A8 (v03 BOCHS  BXPCAPIC 
0001 BXPC 0001)
[0.00] ACPI: GTDT 0x5860 60 (v02 BOCHS  BXPCGTDT 
0001 BXPC 0001)
[0.00] ACPI: MCFG 0x585F 3C (v01 BOCHS  BXPCMCFG 
0001 BXPC 0001)
[0.00] ACPI: SPCR 0x585E 50 (v02 BOCHS  BXPCSPCR 
0001 BXPC 0001)
[0.00] ACPI: NFIT 0x585D E0 (v01 BOCHS  BXPCNFIT 
0001 BXPC 0001)
[0.00] ACPI: SSDT 0x585C 000131 (v01 BOCHS  NVDIMM 0001 
BXPC 0001)
...
[5.386743] pmem0: detected capacity change from 0 to 1073741824
...
[  531.952466] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your 
own risk
[  531.961073] EXT4-fs (pmem0): mounted filesystem with ordered data mode. 
Opts: dax

$ mount
rootfs on / type rootfs (rw,size=206300k,nr_inodes=51575)
...
/dev/pmem0 on /mnt/mem type ext4 (rw,relatime,dax,data=ordered)

$ df -h
FilesystemSize  Used Available Use% Mounted on
...
/dev/pmem0  975.9M  1.3M907.4M   0% /mnt/mem

Signed-off-by: Kwangwoo Lee 
---
 arch/arm64/Kconfig  |   2 +
 arch/arm64/include/asm/cacheflush.h |   3 +
 arch/arm64/include/asm/pmem.h   | 151 
 arch/arm64/mm/cache.S   |  18 +
 4 files changed, 174 insertions(+)
 create mode 100644 arch/arm64/include/asm/pmem.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 4f43622..ee1d679 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -15,6 +15,8 @@ config ARM64
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
select ARCH_WANT_FRAME_POINTERS
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_PMEM_API
+   select ARCH_HAS_MMIO_FLUSH
select ARM_AMBA
select ARM_ARCH_TIMER
select ARM_GIC
diff --git a/arch/arm64/include/asm/cacheflush.h 
b/arch/arm64/include/asm/cacheflush.h
index c64268d..fba18e4 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -68,6 +68,7 @@
 extern void flush_cache_range(struct vm_area_struct *vma, unsigned long start, 
unsigned long end);
 extern void flush_icache_range(unsigned long start, unsigned long end);
 extern void __flush_dcache_area(void *addr, size_t len);
+extern void __clean_dcache_area(void *addr, size_t len);
 extern void __clean_dcache_area_pou(void *addr, size_t len);
 extern long __flush_cache_user_range(unsigned long start, unsigned long end);
 
@@ -133,6 +134,8 @@ static inline void __flush_icache_all(void)
  */
 #define flush_icache_page(vma,page)do { } while (0)
 
+#define mmio_flush_range(addr, size)   __flush_dcache_area(addr, size)
+
 /*
  * Not required on AArch64 (PIPT or VIPT non-aliasing D-cache).
  */
diff --git a/arch/arm64/include/asm/pmem.h b/arch/arm64/include/asm/pmem.h
new file mode 100644
index 000..7504b2b
--- /dev/null
+++ b/arch/arm64/include/asm/pmem.h
@@ -0,0 +1,151 @@
+/*
+ * Based on arch/x86/include/asm/pmem.h
+ *
+ * Copyright(c) 2016 SK hynix Inc. Kwangwoo Lee 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __ASM_PMEM_H__
+#define __ASM_PMEM_H__
+
+#i

Re: [RFC PATCH 3/3] perf: util: only open events on CPUs an evsel permits

2016-07-08 Thread Jiri Olsa

On Thu, Jul 07, 2016 at 05:04:34PM +0100, Mark Rutland wrote:
> In systems with heterogeneous CPU PMUs, it's possible for each evsel to
> cover a distinct set of CPUs, and hence the cpu_map associated with each
> evsel may have a distinct idx<->id mapping. Any of these may be distinct from
> the evlist's cpu map.
> 
> Events can be tied to the same fd so long as they use the same per-cpu
> ringbuffer (i.e. so long as they are on the same CPU). To acquire the
> correct FDs, we must compare the Linux logical IDs rather than the evsel
> or evlist indices.
> 
> This path adds logic to perf_evlist__mmap_per_evsel to handle this,
> translating IDs as required. As PMUs may cover a subset of CPUs from the
> evlist, we skip the CPUs a PMU cannot handle.
> 
> Signed-off-by: Mark Rutland 
> Cc: Adrian Hunter 
> Cc: Alexander Shishkin 
> Cc: Arnaldo Carvalho de Melo 
> Cc: He Kuang 
> Cc: Ingo Molnar 
> Cc: Jiri Olsa 
> Cc: Peter Zijlstra 
> Cc: Wang Nan 
> Cc: linux-kernel@vger.kernel.org
> ---
>  tools/perf/util/evlist.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index e82ba90..0b5b1be 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -984,17 +984,24 @@ static int __perf_evlist__mmap(struct perf_evlist 
> *evlist, int idx,
>  }
>  
>  static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
> -struct mmap_params *mp, int cpu,
> +struct mmap_params *mp, int cpu_idx,
>  int thread, int *output)
>  {
>   struct perf_evsel *evsel;
> + int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
>  
>   evlist__for_each(evlist, evsel) {
>   int fd;
> + int cpu;
>  
>   if (evsel->system_wide && thread)
>   continue;
>  
> + if (!cpu_map__has(evsel->cpus, evlist_cpu))
> + continue;
> +
> + cpu = cpu_map__idx(evsel->cpus, evlist_cpu);

you basicaly call cpu_map__idx twice in here,
I think it might be better call it just once
and check the cpu for -1

jirka

Re: [PATCH v2 2/3] nvme: implement DMA_ATTR_NO_WARN

2016-07-08 Thread Masayoshi Mizuma



On Thu, 7 Jul 2016 09:45:08 -0300 Mauricio Faria De Oliveira wrote:

Use the DMA_ATTR_NO_WARN attribute on dma_map_sg() calls of nvme driver.

Signed-off-by: Mauricio Faria de Oliveira 
Reviewed-by: Gabriel Krisman Bertazi 
---
Changelog:
  v2:
   - address warnings from checkpatch.pl (line wrapping and typos)

  drivers/nvme/host/pci.c | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d1a8259..a7ccad8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -18,6 +18,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -65,6 +66,8 @@ MODULE_PARM_DESC(use_cmb_sqes, "use controller's memory buffer for 
I/O SQes");

  static struct workqueue_struct *nvme_workq;

+static DEFINE_DMA_ATTRS(nvme_dma_attrs);
+
  struct nvme_dev;
  struct nvme_queue;

@@ -498,7 +501,8 @@ static int nvme_map_data(struct nvme_dev *dev, struct 
request *req,
goto out;

ret = BLK_MQ_RQ_QUEUE_BUSY;
-   if (!dma_map_sg(dev->dev, iod->sg, iod->nents, dma_dir))
+   if (!dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
+   &nvme_dma_attrs))


This change is OK because the return value of nvme_map_data() is
BLK_MQ_RQ_QUEUE_BUSY, so the IO will be requeued.


goto out;

if (!nvme_setup_prps(dev, req, size))
@@ -516,7 +520,8 @@ static int nvme_map_data(struct nvme_dev *dev, struct 
request *req,
if (rq_data_dir(req))
nvme_dif_remap(req, nvme_dif_prep);

-   if (!dma_map_sg(dev->dev, &iod->meta_sg, 1, dma_dir))
+   if (!dma_map_sg_attrs(dev->dev, &iod->meta_sg, 1, dma_dir,
+   &nvme_dma_attrs))


Here, I think the error messages should not be suppressed because
the return value of nvme_map_data() is BLK_MQ_RQ_QUEUE_ERROR, so
the IO returns as -EIO.

- Masayoshi Mizuma


goto out_unmap;
}

@@ -2118,6 +2123,9 @@ static int __init nvme_init(void)
result = pci_register_driver(&nvme_driver);
if (result)
destroy_workqueue(nvme_workq);
+
+   dma_set_attr(DMA_ATTR_NO_WARN, &nvme_dma_attrs);
+
return result;
  }

Re: [CRIU] Introspecting userns relationships to other namespaces?

2016-07-08 Thread Eric W. Biederman

Andrew Vagin  writes:

> On Wed, Jul 06, 2016 at 10:46:33AM -0500, Eric W. Biederman wrote:
>> "Serge E. Hallyn"  writes:
>> 
>> > On Wed, Jul 06, 2016 at 10:41:48AM +0200, Michael Kerrisk (man-pages) 
>> > wrote:
>> >> [Rats! Doing now what I should have down to start with. Looping some
>> >> lists and CRIU and other possibly relevant people into this
>> >> conversation]
>> >> 
>> >> Hi Eric,
>> >> 
>> >> On 5 July 2016 at 23:47, Eric W. Biederman  wrote:
>> >> > "Michael Kerrisk (man-pages)"  writes:
>> >> >
>> >> >> Hi Eric,
>> >> >>
>> >> >> I have a question. Is there any way currently to discover which
>> >> >> user namespace a particular nonuser namespace is governed by?
>> >> >> Maybe I am missing something, but there does not seem to be a
>> >> >> way to do this. Also, can one discover which userns is the
>> >> >> parent of a given userns? Again, I can't see a way to do this.
>> >> >>
>> >> >> The point here is introspecting so that a process might determine
>> >> >> what its capabilities are when operating on some resource governed
>> >> >> by a (nonuser) namespace.
>> >> >
>> >> > To the best of my knowledge that there is not an interface to get that
>> >> > information.  It would be good to have such an interface for no other
>> >> > reason than the CRIU folks are going to need it at some point.  I am a
>> >> > bit surprised they have not complained yet.
>> >
>> > I don't think they need it.  They do in fact have what they need.  Assume
>> > you have tasks T1, T2, T1_1 and T2_1;  T1 and T2 are in init_user_ns;  T1
>> > spawned T1_1 in a new userns;  T2 spawned T2_1 which setns()d to T1_1's ns.
>> > There's some {handwave} uid mapping, does not matter.
>> >
>> > At restart, it doesn't matter which task originally created the new userns.
>> > criu knows T1_1 and T2_1 are in the same userns;  it creates the userns, 
>> > sets
>> > up the mapping, and T1_1 and T2_1 setns() to it.
>> 
>> Given that the simple cases are so easy it probably doesn't matter in
>> that sense.
>> 
>> However we now have the case where user namespaces own pid namespaces,
>> and uts namespaces, and network namespaces, and ipc namespaces, and
>> filesystems.  Throw in some mount propagation and use of setns and
>> things could get confusing.   It is something that will need to be
>> figured out if CRIU is going to properly checkpoint containers
>> containing containers containing containers containing containers.
>
> It isn't a joke:). We have a few requests to support CR of containers with
> Docker containers inside. And we are going to start this task in a near
> future, so we would like to have interface to get dependencies between
> namespaces too.
>
> BTW: CRIU already supports nested mount namespaces, because systemd
> creates them for services.

The tricky part about this and what messes up James proposed plan is
that the interface needs to be something that returns a namespace file
descriptor.  So we can't print something out in a simple text file.
Well I suppose we could print an device number and inode number pair.
But then someone would still have to scour processes looking for a user
namespace so that is likely less than ideal.

Starting with 4.8 we are also going to need to be able to retrieve the
user namespace owner of filesystems.  That will be an interesting mix.

Eric

Re: perf bpf examples

2016-07-08 Thread Brendan Gregg

On Thu, Jul 7, 2016 at 9:18 PM, Wangnan (F)  wrote:
>
>
> On 2016/7/8 1:58, Brendan Gregg wrote:
>>
>> On Thu, Jul 7, 2016 at 10:54 AM, Brendan Gregg
>>  wrote:
>>>
>>> On Wed, Jul 6, 2016 at 6:49 PM, Wangnan (F)  wrote:
[...]
>> ... Also, has anyone looked into perf sampling (-F 99) with bpf yet?
>> Thanks,
>
>
> Theoretically, BPF program is an additional filter to
> decide whetier an event should be filtered out or pass to perf. -F 99
> is another filter, which drops samples to ensure the frequence.
> Filters works together. The full graph should be:
>
>  BPF --> traditional filter --> proc (system wide of proc specific) -->
> period
>
> See the example at the end of this mail. The BPF program returns 0 for half
> of
> the events, and the result should be symmetrical. We can get similar result
> without
> -F:
>
> # ~/perf record -a --clang-opt '-DCATCH_ODD' -e ./sampling.c dd if=/dev/zero
> of=/dev/null count=8388480
> 8388480+0 records in
> 8388480+0 records out
> 4294901760 bytes (4.3 GB) copied, 11.9908 s, 358 MB/s
> [ perf record: Woken up 28 times to write data ]
> [ perf record: Captured and wrote 303.915 MB perf.data (4194449 samples) ]
> #
> root@wn-Lenovo-Product:~# ~/perf record -a --clang-opt '-DCATCH_EVEN' -e
> ./sampling.c dd if=/dev/zero of=/dev/null count=8388480
> 8388480+0 records in
> 8388480+0 records out
> 4294901760 bytes (4.3 GB) copied, 12.1154 s, 355 MB/s
> [ perf record: Woken up 54 times to write data ]
> [ perf record: Captured and wrote 303.933 MB perf.data (4194347 samples) ]
>
>
> With -F99 added:
>
> # ~/perf record -F99 -a --clang-opt '-DCATCH_ODD' -e ./sampling.c dd
> if=/dev/zero of=/dev/null count=8388480
> 8388480+0 records in
> 8388480+0 records out
> 4294901760 bytes (4.3 GB) copied, 9.60126 s, 447 MB/s
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.402 MB perf.data (35 samples) ]
> # ~/perf record -F99 -a --clang-opt '-DCATCH_EVEN' -e ./sampling.c dd
> if=/dev/zero of=/dev/null count=8388480
> 8388480+0 records in
> 8388480+0 records out
> 4294901760 bytes (4.3 GB) copied, 9.76719 s, 440 MB/s
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.399 MB perf.data (37 samples) ]

That looks like it's doing two different things: -F99, and a
sampling.c script (SEC("func=sys_read")).

I mean just an -F99 that executes a BPF program on each sample. My
most common use for perf is:

perf record -F 99 -a -g -- sleep 30
perf report (or perf script, for making flame graphs)

But this uses perf.data as an intermediate file. With the recent
BPF_MAP_TYPE_STACK_TRACE, we could frequency count stack traces in
kernel context, and just dump a report. Much more efficient. And
improving a very common perf one-liner.

Brendan

Re: [PATCH] make WRITE_ONCE return void

2016-07-08 Thread Peter Zijlstra

On Fri, Jul 08, 2016 at 01:20:08AM +0300, Alexey Dobriyan wrote:
> Currently WRITE_ONCE is used as if it returns void. Let's codify this
> before somebody tries to be smarter than necessary.
> 
> Signed-off-by: Alexey Dobriyan 
> ---
> 
>  include/linux/compiler.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -301,7 +301,7 @@ static __always_inline void __write_once_size(volatile 
> void *p, void *res, int s
>   union { typeof(x) __val; char __c[1]; } __u =   \
>   { .__val = (__force typeof(x)) (val) }; \
>   __write_once_size(&(x), __u.__c, sizeof(x));\
> - __u.__val;  \
> + (void)0;\
>  })

Why then still use the statement expression? Would it not make more
sense to change it into the regular do { } while (0) form if you want to
remove the return semantics?

Re: [dm-devel] [RFC] block: fix blk_queue_split() resource exhaustion

2016-07-08 Thread Lars Ellenberg

On Fri, Jul 08, 2016 at 08:07:52AM +1000, NeilBrown wrote:
> Before I introduced the recursion limiting, requests were handled as an
> in-order tree walk.  The code I wrote tried to preserve that but didn't
> for several reasons.  I think we need to restore the original in-order
> walk because it makes the most sense.
> So after processing a particular bio, we should then process all the
> 'child' bios - bios send to underlying devices.  Then the 'sibling'
> bios, that were split off, and then any remaining parents and ancestors.
> 
> You patch created the right structures for doing this, my proposal took
> it a step closer, but now after more careful analysis I don't think it
> is quite right.
> With my previous proposal (and you latest patch - thanks!) requests for
> "this" level are stacked, but they should be queued.
> If a make_request_fn only ever submits one request for this level and
> zero or more lower levels, then the difference between a queue and a
> stack is irrelevant.  If it submited more that one, a stack would cause
> them to be handled in the reverse order.

We have a device stack.
q_this_level->make_request_fn() cannot possibly submit anything
on "this_level", or it would create a device loop, I think.

So we start with the initial, "top most" call to generic_make_request().
That is one single bio. All queues are empty.

This bio is then passed on to its destination queue make_request_fn().

Which may chose to split it (via blk_queue_split, or like dm does, or
else). If it, like blk_queue_split() does, splits it into
"piece-I-can-handle-now" and "remainder", both still targeted at the
top most (current) queue, I think the "remainder" should just be pushed
back, which will make it look as if upper layers did
generic_make_request("piece-I-can-handle-now");
generic_make_request("remainder");
Which I do, by using bio_list_add_head(remainder, bio); (*head*).

I don't see any other way for a make_request_fn(bio(l=x)) to generate
"sibling" bios to the same level (l=x) as its own argument.

This same q(l=x)->make_request_fn(bio(l=x)) may now call
generic_make_request() for zero or more "child" bios (l=x+1),
which are queued in order: bio_list_add(recursion, bio); (*tail*).
Then, once l=x returns, the queue generated by it is spliced
in front of the "remainder" (*head*).
All bios are processed in the order they have been queued,
by peeling off of the head.

After all "child" bios of level l>=x+1 have been processed,
the next bio to be processed will be the "pushed back" remainder.

All "Natural order".

> To make the patch "perfect", and maybe even more elegant we could treat
> ->remainder and ->recursion more alike.
> i.e.:
>   - generic make request has a private "stack" of requests.
>   - before calling ->make_request_fn(), both ->remainder and ->recursion
> are initialised
>   - after ->make_request_fn(), ->remainder are spliced in to top of
> 'stack', then ->recursion is spliced onto that.
>   - If stack is not empty, the top request is popped and we loop to top.
> 
> This reliably follows in-order execution, and handles siblings correctly
> (in submitted order) if/when a request splits off multiple siblings.

The only splitting that creates siblings on the current level
is blk_queue_split(), which splits the current bio into
"front piece" and "remainder", already processed in this order.

Anything else creating "siblings" is not creating siblings for the
current layer, but for the next deeper layer, which are queue on
"recursion" and also processed in the order they have been generated.

> I think that as long a requests are submitted in the order they are
> created at each level there is no reason to expect performance
> regressions.
> All we are doing is changing the ordering between requests generated at
> different levels, and I think we are restoring a more natural order.

I believe both patches combined are doing exactly this already.
I could rename .remainder to .todo or .incoming, though.

.incoming = [ bio(l=0) ]
.recursion = []

split

.incoming = [ bio(l=0,now_1), bio(l=0,remainder_1) ]
.recursion = []

process head of .incoming

.incoming = [ bio(l=0,remainder_1) ]
.recursion = [ bio(l=1,a), bio(l=1,b), bio(l=1,c), ... ]

merge_head

.incoming = [ bio(l=1,a), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
.recursion = []

process head of .incoming, potentially split first

.incoming = [ bio(l=1,a,now), bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
...
.incoming = [ bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
.recursion = [ bio(l=2,aa), bio(l=2,ab), ... ]

merge_head

.incoming = [ bio(l=2,aa), bio(l=2,ab), ...,
bio(l=1,a,remainder), bio(l=1,b), bio(l=1,c), ...,
bio(l=0,remainder_1) ]
.recursion = []

...

process away ... until back at l=0

.incoming = [ bio(l=0,remainder_1) ]
.recursion = []

potentially split fu

linux-next: Tree for Jul 8

2016-07-08 Thread Stephen Rothwell

fixes/for-backlight-fixes (68feaca0b13e backlight: pwm: 
Handle EPROBE_DEFER while requesting the PWM)
Merging ftrace-fixes/for-next-urgent (6224beb12e19 tracing: Have branch tracer 
use recursive field of task struct)
Merging mfd-fixes/for-mfd-fixes (5baaf3b9efe1 usb: dwc3: st: Use explicit 
reset_control_get_exclusive() API)
Merging drm-intel-fixes/for-linux-next-fixes (cab103274352 drm/i915: Fix 
missing unlock on error in i915_ppgtt_info())
Merging asm-generic/master (b0da6d44157a asm-generic: Drop renameat syscall 
from default list)
Merging arc/for-next (9bd54517ee86 arc: unwind: warn only once if DW2_UNWIND is 
disabled)
Merging arm/for-next (e69089ce7984 Merge branches 'component', 'fixes' and 
'misc' into for-next)
Merging arm-perf/for-next/perf (1a695a905c18 Linux 4.7-rc1)
Merging arm-soc/for-next (cdd641aa1c2d Merge branch 'next/cleanup' into 
for-next)
Merging amlogic/for-next (4d9b3db03bd4 Merge remote-tracking branch 
'clk/clk-s905' into tmp/aml-rebuild)
Merging at91/at91-next (0f59c948faed Merge tag 'at91-ab-4.8-defconfig' of 
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux into at91-next)
Merging bcm2835/for-next (aa5c0a1e15c2 Merge branch anholt/bcm2835-dt-64-next 
into for-next)
Merging berlin/berlin/for-next (8ea93fef7612 Merge branch 'berlin64/dt' into 
berlin/for-next)
Merging cortex-m/for-next (f719a0d6a854 ARM: efm32: switch to vendor,device 
compatible strings)
Merging imx-mxs/for-next (63a404a3f177 Merge branch 'imx/defconfig' into 
for-next)
Merging keystone/next (eef6bb9fc17a Merge branch 'for_4.8/keystone' into next)
Merging mvebu/for-next (26b5a342ead6 Merge branch 'mvebu/defconfig64' into 
mvebu/for-next)
Merging omap/for-next (312cd3b1ce54 Merge branch 'omap-for-v4.8/soc' into 
for-next)
Merging omap-pending/for-next (c20c8f750d9f ARM: OMAP2+: hwmod: fix _idle() 
hwmod state sanity check sequence)
Merging qcom/for-next (2083e2852282 firmware: qcom: scm: Change initcall to 
subsys)
Merging renesas/next (a48de99a74c6 Merge branches 
'heads/arm64-defconfig-for-v4.8', 'heads/arm64-dt-for-v4.8', 
'heads/defconfig-for-v4.8', 'heads/dt-for-v4.8' and 'heads/soc-for-v4.8' into 
next)
Merging rockchip/for-next (a37b05423406 Merge branch 'v4.8-clk/next' into 
for-next)
Merging rpi/for-rpi-next (bc0195aad0da Linux 4.2-rc2)
Merging samsung/for-next (92e963f50fc7 Linux 4.5-rc1)
Merging samsung-krzk/for-next (d11d19d93a01 Merge branch 'next/defconfig64' 
into for-next)
CONFLICT (content): Merge conflict in arch/arm/boot/dts/exynos5420.dtsi
Merging tegra/for-next (34db2df13ab7 Merge branch for-4.8/arm64 into for-next)
Merging arm64/for-next/core (40f87d3114b8 arm64: mm: fold init_pgd() into 
__create_pgd_mapping())
Merging blackfin/for-linus (391e74a51ea2 eth: bf609 eth clock: add pclk clock 
for stmmac driver probe)
CONFLICT (content): Merge conflict in arch/blackfin/mach-common/pm.c
Merging c6x/for-linux-next (ca3060d39ae7 c6x: Use generic clkdev.h header)
Merging cris/for-next (f9f3f864b5e8 cris: Fix section mismatches in 
architecture startup code)
Merging h8300/h8300-next (58c57526711f h8300: Add missing include file to 
asm/io.h)
Merging hexagon/linux-next (02cc2ccfe771 Revert "Hexagon: fix signal.c compile 
error")
Merging ia64/next (70f4f9352317 ia64: efi: use timespec64 for persistent clock)
Merging m68k/for-next (86a8280a7fe0 m68k: Assorted spelling fixes)
Merging m68knommu/for-next (33688abb2802 Linux 4.7-rc4)
Merging metag/for-next (592ddeeff8cb metag: Fix typos)
Merging microblaze/next (52e9e6e05617 microblaze: pci: export isa_io_base to 
fix link errors)
Merging mips/mips-for-linux-next (d117d8edaf68 Merge branch '4.7-fixes' into 
mips-for-linux-next)
Merging nios2/for-next (9fa78f63a892 nios2: Add order-only DTC dependency to 
%.dtb target)
Merging parisc-hd/for-next (5975b2c0c10a Merge branch 'parisc-4.7-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux)
Merging powerpc/next (656ad58ef19e powerpc/boot: Add OPAL console to epapr 
wrappers)
CONFLICT (content): Merge conflict in arch/powerpc/Kconfig
Merging powerpc-mpe/next (bc0195aad0da Linux 4.2-rc2)
Merging fsl/next (1eef33bec12d powerpc/86xx: Fix PCI interrupt map definition)
Merging mpc5xxx/next (39e69f55f857 powerpc: Introduce the use of the managed 
version of kzalloc)
Merging s390/features (d08de8e2d867 s390/mm: add support for 2GB hugepages)
Merging sparc-next/master (9f935675d41a Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input)
Merging tile/master (893d66192c46 tile: support gcc 7 optimization to use 
__multi3)
Merging uml/linux-next (a78ff1112263 um: add extended processor state 
save/restore support)
Merging unicore32/unicore32 (c83d8b2fc986 unicore32: mm: Add missing parameter 
to arch_vma_access_permitted)
Merging xtensa/for_next

RE: [PATCH v15 net-next 1/1] hv_sock: introduce Hyper-V Sockets

> From: Dexuan Cui
> Sent: Friday, July 8, 2016 15:47
> 
> You can also get the patch here (2764221d):
> https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160708_v15
> 
> In v14:
> fix some coding style issues pointed out by David.
> 
> In v15:
> Just some stylistic changes addressing comments from Joe Perches and
> Olaf Hering -- thank you!
> - add a GPL blurb.
> - define a new macro PAGE_SIZE_4K and use it to replace PAGE_SIZE
> - change sk_to_hvsock/hvsock_to_sk() from macros to inline functions
> - remove a not-very-useful pr_err()
> - fix some typos in comment and coding style issues.

FYI: the diff between v14 and v15 is attached: the diff is generated by 
git-diff-ing the 2 branches decui/hv_sock/net-next/20160629_v14 and 
decui/hv_sock/net-next/20160708_v15 in the above github repo.
 
Thanks,
-- Dexuan


delta_v14_vs.v15.patch
Description: delta_v14_vs.v15.patch

[PATCH 1/2] tty: amba-pl011: add support for clock frequency setting via dt

2016-07-08 Thread Jorge Ramirez-Ortiz

Allow to specify the clock frequency for any given port via the
assigned-clock-rates device tree property.

Signed-off-by: Jorge Ramirez-Ortiz 
Tested-by: Jorge Ramirez-Ortiz 
---
 drivers/tty/serial/amba-pl011.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/tty/serial/amba-pl011.c b/drivers/tty/serial/amba-pl011.c
index 1b7331e..51867ab 100644
--- a/drivers/tty/serial/amba-pl011.c
+++ b/drivers/tty/serial/amba-pl011.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2472,6 +2473,10 @@ static int pl011_probe(struct amba_device *dev, const 
struct amba_id *id)
if (IS_ERR(uap->clk))
return PTR_ERR(uap->clk);
 
+   ret = of_clk_set_defaults(dev->dev.of_node, false);
+   if (ret < 0)
+   return ret;
+
uap->reg_offset = vendor->reg_offset;
uap->vendor = vendor;
uap->fifosize = vendor->get_fifosize(dev);
-- 
2.7.4

[PATCH 2/2] arm64: dts: set UART1 clock frequency to 150MHz

2016-07-08 Thread Jorge Ramirez-Ortiz

Enable support for higher baud rates (up to 3Mbps) in UART1 - required
for bluetooth transfers.

Signed-off-by: Jorge Ramirez-Ortiz 
Tested-by: Jorge Ramirez-Ortiz 
---
 arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts 
b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
index e92a30c..27be812 100644
--- a/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
+++ b/arch/arm64/boot/dts/hisilicon/hi6220-hikey.dts
@@ -55,6 +55,8 @@
};
 
uart1: uart@f7111000 {
+   assigned-clocks = <&sys_ctrl HI6220_UART1_SRC>;
+   assigned-clock-rates = <15000>;
status = "ok";
};
 
-- 
2.7.4

Re: [RFC PATCH v2] net: sched: convert qdisc linked list to hashtable

2016-07-08 Thread Jiri Kosina

On Thu, 7 Jul 2016, Craig Gallek wrote:

> This sort of seems like it's just side-stepping the problem.  Given
> that the size of this hash table is fixed, the lookup time of this
> operation is still going to approach linear as the number of qdiscs
> increases.  

That's true; however the primary goal here is not to actually ultimately 
improve speed of qdisc lookup per se, but rather to make it possible to 
unhide the qdiscs which are currently omitted as the linked list takes too 
long to walk. The static hashtable is going help here.

Thanks,

-- 
Jiri Kosina
SUSE Labs

Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support

2016-07-08 Thread Morten Rasmussen

On Fri, Jul 08, 2016 at 07:35:56AM +, KEITA KOBAYASHI wrote:
> Hi,
> 
> I tested these patches on Renesas SoC r8a7790(CA15*4 + CA7*4)
> and your preview branch[1] on Renesas SoC r8a7795(CA57*4 + CA53*4).
> 
> > Test 0:
> > for i in `seq 1 10`; \
> >do sysbench --test=cpu --max-time=3 --num-threads=1 run;
> > \
> >done \
> > | awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
> >END {print "Average events: " sum/runs}'
> > 
> > Target: ARM TC2 (2xA15+3xA7)
> > 
> > (Higher is better)
> > tip:Average events: 146.9
> > patch:  Average events: 217.9
> > 
> Target: Renesas SoC r8a7790(CA15*4 + CA7*4)
>   w/  capacity patches: Average events: 200.2
>   w/o capacity patches: Average events: 144.4
> 
> Target: Renesas SoC r8a7795(CA57*4 + CA53*4)
>   w/  capacity patches : 3587.7
>   w/o capacity patches : 2327.8
> 
> > Test 1:
> > perf stat --null --repeat 10 -- \
> > perf bench sched messaging -g 50 -l 5000
> > 
> > Target: Intel IVB-EP (2*10*2)
> > 
> > tip:4.861970420 seconds time elapsed ( +-  1.39% )
> > patch:  4.886204224 seconds time elapsed ( +-  0.75% )
> > 
> > Target: ARM TC2 A7-only (3xA7) (-l 1000)
> > 
> > tip:61.485682596 seconds time elapsed ( +-  0.07% )
> > patch:  62.667950130 seconds time elapsed ( +-  0.36% )
> > 
> Target: Renesas SoC r8a7790(CA15*4) (-l 1000)
>   w/  capacity patches: 38.955532040 seconds time elapsed ( +-  0.12% )
>   w/o capacity patches: 39.424945580 seconds time elapsed ( +-  0.10% )
> 
> Target: Renesas SoC r8a7795(CA57*4) (-l 1000)
>   w/  capacity patches : 29.804292200 seconds time elapsed ( +-  0.37% )
>   w/o capacity patches : 29.838826790 seconds time elapsed ( +-  0.40% )
> 
> Tested-by: Keita Kobayashi 

Thank you for testing and sharing your test results. They seem to show a
significant improvement in throughput which is in line with the
measurement we have for the ARM dev boards and the MediaTek SoC.

Thanks,
Morten

Re: [RFC] [PATCH v2 1/3] scatterlist: Add support to clone scatterlist

On Thu, Jul 07, 2016 at 07:43:25PM +0200, Robert Jarzmik wrote:

> I'll try, but I don't trust much my chances of success, given that this 
> tester :
>  - should compile and link in $(TOP)/lib/scatterlist.c, as this is where
>sg_split() is defined
>  - this implies all its includes
>  - this implies at least these ones :
>   bug.h
>   mm.h
>   scatterlist.h
>   string.h
>   types.h
>  - this implies having page_to_phys and co. defined somewhere without
>draining the whole include/linux and include/asm* trees

> For the tester, I had created an apart include/linux tree where all the 
> includes
> were _manually_ filled in with minimal content.

> I don't know if an existing selftest had already this kind of problem,
> ie. having to compile and link a kernel .c file, and that makes me feel this
> might be difficult to keep a nice standalone tester.

Right, that's messy :(  Could it be refactored as a boot/module load
time test so it could be built in the kernel environment?  Less
convenient to use (though KVM/UML help) but easier to build.


signature.asc
Description: PGP signature

Re: [PATCH v2] kexec: Fix kdump failure with notsc

2016-07-08 Thread Nikolay Borisov



On 07/07/2016 01:17 PM, Wei Jiangang wrote:
> If we specify the 'notsc' boot parameter for the dump-capture kernel,
> and then trigger a crash(panic) by using "ALT-SysRq-c" or "echo c >
> /proc/sysrq-trigger",
> the dump-capture kernel will hang in calibrate_delay_converge():
> 
> /* wait for "start of" clock tick */
> ticks = jiffies;
> while (ticks == jiffies)
> ; /* nothing */
> 
> serial log of the hang is as follows:
> 
> tsc: Fast TSC calibration using PIT
> tsc: Detected 2099.947 MHz processor
> Calibrating delay loop...
> 
> The reason is that the dump-capture kernel hangs in while loops and
> waits for jiffies to be updated, but no timer interrupts is passed
> to BSP by APIC.
> 
> In fact, the local APIC was disabled in reboot and crash path by
> lapic_shutdown(). We need to put APIC in legacy mode in kexec jump path
> (put the system into PIT during the crash kernel),
> so that the dump-capture kernel can get timer interrupts.
> 
> BTW,
> I found the buggy commit 522e66464467 ("x86/apic: Disable I/O APIC
> before shutdown of the local APIC") via bisection.
> 
> Originally, I want to revert it.
> But Ingo Molnar comments that "By reverting the change can paper over
> the bug, but re-introduce the bug that can result in certain CPUs hanging
> if IO-APIC sends an APIC message if the lapic is disabled prematurely"
> And I think it's pertinent.
> 
> Signed-off-by: Wei Jiangang 
> ---
>  arch/x86/include/asm/apic.h| 5 +
>  arch/x86/kernel/apic/apic.c| 9 +
>  arch/x86/kernel/machine_kexec_32.c | 5 ++---
>  arch/x86/kernel/machine_kexec_64.c | 6 +++---
>  4 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
> index bc27611fa58f..5d7e635e580a 100644
> --- a/arch/x86/include/asm/apic.h
> +++ b/arch/x86/include/asm/apic.h
> @@ -128,6 +128,7 @@ extern void clear_local_APIC(void);
>  extern void disconnect_bsp_APIC(int virt_wire_setup);
>  extern void disable_local_APIC(void);
>  extern void lapic_shutdown(void);
> +extern int lapic_disabled(void);



>  extern void sync_Arb_IDs(void);
>  extern void init_bsp_APIC(void);
>  extern void setup_local_APIC(void);
> @@ -165,6 +166,10 @@ extern int setup_APIC_eilvt(u8 lvt_off, u8 vector, u8 
> msg_type, u8 mask);
>  
>  #else /* !CONFIG_X86_LOCAL_APIC */
>  static inline void lapic_shutdown(void) { }
> +static inline int lapic_disabled(void)
> +{
> + return 0;
> +}
>  #define local_apic_timer_c2_ok   1
>  static inline void init_apic_mappings(void) { }
>  static inline void disable_local_APIC(void) { }
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 60078a67d7e3..d1df250994bb 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -133,6 +133,9 @@ static inline void imcr_apic_to_pic(void)
>  }
>  #endif
>  
> +/* Local APIC is disabled by the kernel for crash or reboot path */
> +static int disabled_local_apic;

Your are using an int for a boolean value, please be more explicit by
declaring the variable as boolean and respectively changing all the
functions returning this value.

> +
>  /*
>   * Knob to control our willingness to enable the local APIC.
>   *
> @@ -1097,10 +1100,16 @@ void lapic_shutdown(void)
>  #endif
>   disable_local_APIC();
>  
> + disabled_local_apic = 1;
>  
>   local_irq_restore(flags);
>  }
>  
> +int lapic_disabled(void)
> +{
> + return disabled_local_apic;
> +}
> +
>  /**
>   * sync_Arb_IDs - synchronize APIC bus arbitration IDs
>   */
> diff --git a/arch/x86/kernel/machine_kexec_32.c 
> b/arch/x86/kernel/machine_kexec_32.c
> index 469b23d6acc2..c934a7868e6b 100644
> --- a/arch/x86/kernel/machine_kexec_32.c
> +++ b/arch/x86/kernel/machine_kexec_32.c
> @@ -202,14 +202,13 @@ void machine_kexec(struct kimage *image)
>   local_irq_disable();
>   hw_breakpoint_disable();
>  
> - if (image->preserve_context) {
> + if (image->preserve_context || lapic_disabled()) {
>  #ifdef CONFIG_X86_IO_APIC
>   /*
>* We need to put APICs in legacy mode so that we can
>* get timer interrupts in second kernel. kexec/kdump
>* paths already have calls to disable_IO_APIC() in
> -  * one form or other. kexec jump path also need
> -  * one.
> +  * one form or other. kexec jump path also need one.
>*/
>   disable_IO_APIC();
>  #endif
> diff --git a/arch/x86/kernel/machine_kexec_64.c 
> b/arch/x86/kernel/machine_kexec_64.c
> index 5a294e48b185..d3598cdd6437 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -23,6 +23,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -269,14 +270,13 @@ void machine_kexec(struct kimage *image)
>   local_irq_disable();
>   hw_breakpoint_disable();
>  
> - if (i

Re: [PATCH 1/1] irqdomain: Export __irq_domain_alloc_irqs() and irq_domain_free_irqs()

2016-07-08 Thread Alexander Popov

On 06.07.2016 14:17, Thomas Gleixner wrote:
> On Fri, 1 Jul 2016, Alexander Popov wrote:
> 
>> Export __irq_domain_alloc_irqs() and irq_domain_free_irqs() for being
>> able to work with irq_domain hierarchy in modules.
> 
> We usually export only when we have a proper use case which is supposed to go
> into the kernel tree proper. What's yours?

Hello, Thomas,

I work at Positive Technologies ( https://www.ptsecurity.com/ ). We develop
a bare-metal hypervisor, which targets x86_64 and supports Linux as a guest OS.

Intel VT-x allows hypervisor to inject interrupts into virtual machines.
We want to handle these interrupts in guest Linux.

So I wrote a simple kernel module creating an irq_domain, which has
x86_vector_domain as a parent in the hierarchy. In this module I just call:
- irq_domain_alloc_irqs() to allocate irqs and allow calling request_irq()
   for them;
- irqd_cfg(irq_get_irq_data()) to get the APIC vectors of the allocated irqs;
- irq_domain_free_irqs() to free the resources at the end.

It allows to handle interrupts injected by the hypervisor in guest Linux easily,
without emulating MSI-capable PCI device at the hypervisor side.

Everything works fine if __irq_domain_alloc_irqs() and irq_domain_free_irqs()
are exported. Is it a proper use-case?

Do you think my module could be useful for the mainline in some form?
It took me some time to understand irq_domain hierarchy design, so I can
prepare some patch or share my code to help others.

Best regards,
Alexander

Re: [PATCH v2 2/3] Documentation, ABI: Add a document entry for cache id


* Fenghua Yu  wrote:

> From: Fenghua Yu 
> 
> Add an ABI document entry for /sys/devices/system/cpu/cpu*/cache/index*/id.
> 
> Signed-off-by: Fenghua Yu 
> ---
>  Documentation/ABI/testing/sysfs-devices-system-cpu | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu 
> b/Documentation/ABI/testing/sysfs-devices-system-cpu
> index 1650133..cc62034 100644
> --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> @@ -272,6 +272,19 @@ Description: Parameters for the CPU cache attributes
>the modified cache line is written to main
>memory only when it is replaced
>  
> +
> +What:/sys/devices/system/cpu/cpu*/cache/index*/id
> +Date:July 2016
> +Contact: Linux kernel mailing list 
> +Description: Cache id
> +
> + The id identifies a cache in the platform. In same index, the id
> + is unique across the platform.

What does 'In same index' mean?

Thanks,

Ingo

Re: [PATCH 0/9] mm: Hardened usercopy

* Kees Cook  wrote:

> - I couldn't detect a measurable performance change with these features
>   enabled. Kernel build times were unchanged, hackbench was unchanged,
>   etc. I think we could flip this to "on by default" at some point.

Could you please try to find some syscall workload that does many small user 
copies and thus excercises this code path aggressively?

If that measurement works out fine then I'd prefer to enable these security 
checks 
by default.

Thaks,

Ingo

Re: [RFC] [PATCH v3 1/4] spi: omap2-mcspi: Add comments for RX only DMA buffer workaround

On Thu, Jul 07, 2016 at 12:17:48PM -0500, Franklin S Cooper Jr wrote:
> OMAP35x and OMAP37x mentions in the McSPI End-of-Transfer Sequences section
> that if the McSPI is configured as a Master and only DMA RX is being
> performed then the DMA transfer size needs to be reduced by 1 or 2.

Please do not submit new versions of already applied patches, please
submit incremental updates to the existing code.  Modifying existing
commits creates problems for other users building on top of those
commits so it's best practice to only change pubished git commits if
absolutely essential.

signature.asc
Description: PGP signature

kernel/time/ntp.c: possible unit inconsistency

2016-07-08 Thread Matwey V. Kornilov

Hello,

I think I found minor inconsistency in measurement units between ntpd
and linux kernel. Though, I am not sure completely.
I've failed to reach ntp mail list because lists.ntp.org is down for
me for several days.

My principal concern is about `maxerror' quantity.
kernel/time/ntp.c has time_maxerror variable which is can be get/put
from/to userspace using ntp_adjtime call.
The single point where the variable is altered by the linux kernel is
second_overflow() function in kernel/time/ntp.c:456.
Here the following happens:

/* Bump the maxerror field */
time_maxerror += MAXFREQ / NSEC_PER_USEC;
if (time_maxerror > NTP_PHASE_LIMIT) {
time_maxerror = NTP_PHASE_LIMIT;
time_status |= STA_UNSYNC;
}

The line assumes that time_maxerror is always measured in the units of usec.

At the same time, if we get ntp-4.2.8p8 sources and look at

ntpdc/ntpdc_ops.c:2955 function kerninfo() or
ntpd/ntp_control.c:2359 function ctl_putsys()

then we found that ntpd expects that ntp_adjtime returns maxerror
either in usec (if STA_NANO is not used) or in nsec (if STA_NANO is
used).
So, there can be the case when kernel measures maxerror in usec and
ntpd does in nsec.

-- 
With best regards,
Matwey V. Kornilov
http://blog.matwey.name
xmpp://0x2...@jabber.ru

Applied "spi: omap2-mcspi: Select SPI_SPLIT" to the spi tree

The patch

   spi: omap2-mcspi: Select SPI_SPLIT

has been applied to the spi tree at

   git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git 

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 2b32e987c48c65a1a40b3b4294435f761e063b6b Mon Sep 17 00:00:00 2001
From: Franklin S Cooper Jr 
Date: Thu, 7 Jul 2016 12:17:49 -0500
Subject: [PATCH] spi: omap2-mcspi: Select SPI_SPLIT

The function sg_split will be used by spi-omap2-mcspi to handle a SoC
workaround in the SPI driver. Therefore, select SG_SPLIT so this function
is available to the driver.

Signed-off-by: Franklin S Cooper Jr 
Signed-off-by: Mark Brown 
---
 drivers/spi/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/spi/Kconfig b/drivers/spi/Kconfig
index 4b931ec8d90b..d6fb8d4b7786 100644
--- a/drivers/spi/Kconfig
+++ b/drivers/spi/Kconfig
@@ -411,6 +411,7 @@ config SPI_OMAP24XX
tristate "McSPI driver for OMAP"
depends on HAS_DMA
depends on ARCH_OMAP2PLUS || COMPILE_TEST
+   select SG_SPLIT
help
  SPI master controller for OMAP24XX and later Multichannel SPI
  (McSPI) modules.
-- 
2.8.1

Re: [RFC PATCH v2] net: sched: convert qdisc linked list to hashtable

2016-07-08 Thread Eric Dumazet

On Thu, 2016-07-07 at 22:36 +0200, Jiri Kosina wrote:
> From: Jiri Kosina 
> 
> Convert the per-device linked list into a hashtable. The primary 
> motivation for this change is that currently, we're not tracking all the 
> qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup 
> performed over the linked list by qdisc_match_from_root() is rather 
> expensive.
> 
> The ultimate goal is to get rid of hidden qdiscs completely, which will 
> bring much more determinism in user experience.
> 
> As we're adding hashtable.h include into generic netdevice.h, we have to make
> sure HASH_SIZE macro is now non-conflicting with local definitions.
> 
> Signed-off-by: Jiri Kosina 
> ---


> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index fdc9de2..0f70ecc 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -62,11 +62,11 @@ module_param(log_ecn_error, bool, 0644);
>  MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
>  
>  #define HASH_SIZE_SHIFT  5
> -#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
> +#define __HASH_SIZE (1 << HASH_SIZE_SHIFT)

__ prefix is mostly used for functions having some kind of
shells/helpers.

I would rather use IP6_GRE_HASH_SIZE or something which has lower
chances of being used elsewhere.

Or maybe you could use new HASH_SIZE(name), providing proper 'name'

@@ -732,6 +730,8 @@ static void attach_default_qdiscs(struct net_device *dev)
>   qdisc->ops->attach(qdisc);
>   }
>   }
> + if (dev->qdisc)
> + qdisc_hash_add(dev->qdisc);
>  }
>  

I do not understand this addition, could you comment on it ?

Applied "spi: omap2-mcspi: Use the SPI framework to handle DMA mapping" to the spi tree

The patch

   spi: omap2-mcspi: Use the SPI framework to handle DMA mapping

has been applied to the spi tree at

   git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git 

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.  

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

>From 0ba1870f886501beca0e2c19ec367a85ae201ea8 Mon Sep 17 00:00:00 2001
From: Franklin S Cooper Jr 
Date: Thu, 7 Jul 2016 12:17:50 -0500
Subject: [PATCH] spi: omap2-mcspi: Use the SPI framework to handle DMA mapping

Currently, the driver handles mapping buffers to be used by the DMA.
However, there are times that the current mapping implementation will
fail for certain buffers. Fortunately, the SPI framework can detect
and map buffers so its usable by the DMA.

Update the driver to utilize the SPI framework for buffer
mapping instead. Also incorporate hooks that the framework uses to
determine if the DMA can or can not be used.

This will result in the original omap2_mcspi_transfer_one function being
deleted and omap2_mcspi_work_one being renamed to
omap2_mcspi_transfer_one. Previously transfer_one was only responsible
for mapping and work_one handled the transfer. But now only transferring
needs to be handled by the driver.

Signed-off-by: Franklin S Cooper Jr 
Signed-off-by: Mark Brown 
---
 drivers/spi/spi-omap2-mcspi.c | 132 ++
 1 file changed, 56 insertions(+), 76 deletions(-)

diff --git a/drivers/spi/spi-omap2-mcspi.c b/drivers/spi/spi-omap2-mcspi.c
index c47f95879833..d5157bce 100644
--- a/drivers/spi/spi-omap2-mcspi.c
+++ b/drivers/spi/spi-omap2-mcspi.c
@@ -419,16 +419,13 @@ static void omap2_mcspi_tx_dma(struct spi_device *spi,
 
if (mcspi_dma->dma_tx) {
struct dma_async_tx_descriptor *tx;
-   struct scatterlist sg;
 
dmaengine_slave_config(mcspi_dma->dma_tx, &cfg);
 
-   sg_init_table(&sg, 1);
-   sg_dma_address(&sg) = xfer->tx_dma;
-   sg_dma_len(&sg) = xfer->len;
-
-   tx = dmaengine_prep_slave_sg(mcspi_dma->dma_tx, &sg, 1,
-   DMA_MEM_TO_DEV, DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
+   tx = dmaengine_prep_slave_sg(mcspi_dma->dma_tx, xfer->tx_sg.sgl,
+xfer->tx_sg.nents,
+DMA_MEM_TO_DEV,
+DMA_PREP_INTERRUPT | DMA_CTRL_ACK);
if (tx) {
tx->callback = omap2_mcspi_tx_callback;
tx->callback_param = spi;
@@ -449,7 +446,10 @@ omap2_mcspi_rx_dma(struct spi_device *spi, struct 
spi_transfer *xfer,
 {
struct omap2_mcspi  *mcspi;
struct omap2_mcspi_dma  *mcspi_dma;
-   unsigned intcount, dma_count;
+   unsigned intcount, transfer_reduction = 0;
+   struct scatterlist  *sg_out[2];
+   int nb_sizes = 0, out_mapped_nents[2], ret, x;
+   size_t  sizes[2];
u32 l;
int elements = 0;
int word_len, element_count;
@@ -457,7 +457,6 @@ omap2_mcspi_rx_dma(struct spi_device *spi, struct 
spi_transfer *xfer,
mcspi = spi_master_get_devdata(spi->master);
mcspi_dma = &mcspi->dma_channels[spi->chip_select];
count = xfer->len;
-   dma_count = xfer->len;
 
/*
 *  In the "End-of-Transfer Procedure" section for DMA RX in OMAP35x TRM
@@ -465,7 +464,7 @@ omap2_mcspi_rx_dma(struct spi_device *spi, struct 
spi_transfer *xfer,
 *  normal mode.
 */
if (mcspi->fifo_depth == 0)
-   dma_count -= es;
+   transfer_reduction = es;
 
word_len = cs->word_len;
l = mcspi_cached_chconf0(spi);
@@ -479,7 +478,6 @@ omap2_mcspi_rx_dma(struct spi_device *spi, struct 
spi_transfer *xfer,
 
if (mcspi_dma->dma_rx) {
struct dma_async_tx_descriptor *tx;
-   struct scatterlist sg;
 
dmaengine_slave_config(mcspi_dma->dma_rx, &cfg);
 
@@ -488,15 +486,38 @@ omap2_mcspi_rx_dma(struct spi_device *spi, struct 
spi_transfer *xfer,
 *  configured in turbo mode.
 */
if ((l & OMAP2_MCSPI_CHCONF

Re: [PATCH v2] regulator: pwm: Fix regulator ramp delay for continuous mode

On Thu, Jul 07, 2016 at 06:43:33PM +, Aleksandr Frid wrote:

I'm not entirely sure what's wrong with your mail client here but your
mails are essentially illegible.  There appears to be some combination
of top posting, reflowing quoted content to remove line breaks and extra
levels of quoting.

> >>
> In the case that you don't need multiple steps then don't specify a 
> regulator-ramp-delay, just a pwm-regulator-settle-us.
> >>
> Agreed
> 
> -Original Message-
> From: diand...@google.com [mailto:diand...@google.com] On Behalf Of Doug 
> Anderson
> Sent: Thursday, July 07, 2016 11:32 AM
> To: Aleksandr Frid
> Cc: Laxman Dewangan; Mark Brown; Boris Brezillon; Lee Jones; Brian Norris; 
> open list:ARM/Rockchip SoC...; Heiko Stuebner; Thierry Reding; Liam Girdwood; 
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v2] regulator: pwm: Fix regulator ramp delay for 
> continuous mode
> 
> Hi,
> 
> On Thu, Jul 7, 2016 at 11:23 AM, Aleksandr Frid  wrote:
> > Hi,
> >
> >>>
> > In that case we should probably add a new PWM regulator property and not 
> > abuse the existing one.  Maybe you use "pwm-regulator-settle-us"
> > or something?
> >>>
> > Looks reasonable to me.
> >
> >>>
> > actually the right thing is probably to implement 
> > 'regulator-ramp-delay' as doing several small steps in that case
> >>>
> > Ramp delay uV/us  is not a "real" metric for some PWM regulators with 
> > exponential transition -- as opposite to fixed slew-rate linear transition 
> > on other regulators. So splitting transition into multiple steps to 
> > implement artificial  (in this case) metric seems questionable.
> 
> In the case that you don't need multiple steps then don't specify a 
> regulator-ramp-delay, just a pwm-regulator-settle-us.
> 
> ...the suggestion for multiple steps is because (so I'm told) it helps avoid 
> overshoot or undershoot problems.  In general the whole point of ramping a 
> regulator slowly is to avoid overshoot or undershoot problems.  The 
> "regulator-ramp-delay" property in Linux is a little odd because it sort of 
> "describes" the ramp delay and sort of "sets"
> the ramp delay.  Many PMICs allow you to set how fast the regulator will ramp 
> and this property is used to specify how the register in the PMIC should be 
> set.  However, it is also used as the actual delay in Linux.
> 
> 
> -Doug


signature.asc
Description: PGP signature

Re: Applied "spi: omap2-mcspi: Select SPI_SPLIT" to the spi tree

2016-07-08 Thread Sekhar Nori

Mark,

On Friday 08 July 2016 02:19 PM, Mark Brown wrote:
> From 2b32e987c48c65a1a40b3b4294435f761e063b6b Mon Sep 17 00:00:00 2001
> From: Franklin S Cooper Jr 
> Date: Thu, 7 Jul 2016 12:17:49 -0500
> Subject: [PATCH] spi: omap2-mcspi: Select SPI_SPLIT

Looks like you fixed up the description locally but forgot to fix the
subject line. It still shows SPI_SPLIT.

> 
> The function sg_split will be used by spi-omap2-mcspi to handle a SoC
> workaround in the SPI driver. Therefore, select SG_SPLIT so this function
> is available to the driver.
> 
> Signed-off-by: Franklin S Cooper Jr 
> Signed-off-by: Mark Brown 

Thanks,
Sekhar

Re: [PATCH] x86: add workaround monitor bug


* Jacob Pan  wrote:

> From: Peter Zijlstra 
> 
> Monitored cached line may not wake up from mwait on certain
> Goldmont based CPUs. This patch will avoid calling
> current_set_polling_and_test() and thereby not set the TIF_ flag.
> The result is that we'll always send IPIs for wakeups.
> 
> Signed-off-by: Peter Zijlstra 
> Signed-off-by: Jacob Pan 
> ---
>  arch/x86/include/asm/cpufeatures.h | 1 +
>  arch/x86/include/asm/mwait.h   | 2 +-
>  arch/x86/kernel/cpu/intel.c| 5 +
>  arch/x86/kernel/process.c  | 2 +-
>  4 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpufeatures.h 
> b/arch/x86/include/asm/cpufeatures.h
> index 78dbd28..197a3f4 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -304,6 +304,7 @@
>  #define X86_BUG_SYSRET_SS_ATTRS  X86_BUG(8) /* SYSRET doesn't fix up SS 
> attrs */
>  #define X86_BUG_NULL_SEG X86_BUG(9) /* Nulling a selector preserves the 
> base */
>  #define X86_BUG_SWAPGS_FENCE X86_BUG(10) /* SWAPGS without input dep on GS */
> +#define X86_BUG_MONITOR  X86_BUG(11) /* IPI required to wake up 
> remote cpu */
>  
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
> index 0deeb2d..f37f2d8 100644
> --- a/arch/x86/include/asm/mwait.h
> +++ b/arch/x86/include/asm/mwait.h
> @@ -97,7 +97,7 @@ static inline void __sti_mwait(unsigned long eax, unsigned 
> long ecx)
>   */
>  static inline void mwait_idle_with_hints(unsigned long eax, unsigned long 
> ecx)
>  {
> - if (!current_set_polling_and_test()) {
> + if (static_cpu_has_bug(X86_BUG_MONITOR) || 
> !current_set_polling_and_test()) {

Hm, this might be suboptimal: if MONITOR/MWAIT is implemented by setting the 
exclusive flag for the monitored memory address and then snooping for cache 
invalidation requests for that cache line, then not modifying the ->flags value 
with TIF_POLLING_NRFLAG makes MWAIT not wake up - only the IPI would wake it up.

I think a better approach would be to still optimistically modify the ->flags 
value _AND_ to also send an IPI, to make sure the wakeup is not lost. This 
means 
that the woken CPU will wake up much faster (no IPI latency).

(The system will still bear the ovehread of sending and receiving the IPI, but 
that cost is unavoidable if there's no other workaround for this erratum.)

Thanks,

Ingo

Re: [RFC PATCH v2] net: sched: convert qdisc linked list to hashtable

2016-07-08 Thread Jiri Kosina

On Fri, 8 Jul 2016, Eric Dumazet wrote:

> > diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> > index fdc9de2..0f70ecc 100644
> > --- a/net/ipv6/ip6_gre.c
> > +++ b/net/ipv6/ip6_gre.c
> > @@ -62,11 +62,11 @@ module_param(log_ecn_error, bool, 0644);
> >  MODULE_PARM_DESC(log_ecn_error, "Log packets received with corrupted ECN");
> >  
> >  #define HASH_SIZE_SHIFT  5
> > -#define HASH_SIZE (1 << HASH_SIZE_SHIFT)
> > +#define __HASH_SIZE (1 << HASH_SIZE_SHIFT)
> 
> __ prefix is mostly used for functions having some kind of
> shells/helpers.
> 
> I would rather use IP6_GRE_HASH_SIZE or something which has lower
> chances of being used elsewhere.

Alright, makes sense, will do this in v3.

> @@ -732,6 +730,8 @@ static void attach_default_qdiscs(struct net_device *dev)
> > qdisc->ops->attach(qdisc);
> > }
> > }
> > +   if (dev->qdisc)
> > +   qdisc_hash_add(dev->qdisc);
> >  }
> >  
> 
> I do not understand this addition, could you comment on it ?

With linked lists, assigning to struct net_device's Qdisc pointer is 
enough to "initialize" the linked list and have it contain one (root) 
item. With hashtable, this is not the case, it needs to be explicitly 
added.

Hmm, dev_init_scheduler() (and perhaps also dev_shutdown()) would possibly 
need similar treatment in order to have accurate data there 100% of the 
time even during initialization.

-- 
Jiri Kosina
SUSE Labs

[PATCH 0/7] Rockchip dw-mipi-dsi driver


Hi all

This is a bunch of dw-mipi-dsi driver for RK3399 and RK3288, they have
been tested on rk3399 and rk3288 evb board.

This series is based on Mark Yao's branch:
https://github.com/markyzq/kernel-drm-rockchip/tree/drm-rockchip-next-2016-05-23



Chris Zhong (7):
  dt-bindings: add rk3399 support for dw-mipi-rockchip
  DRM: mipi: support rk3399 mipi dsi
  dt-bindings: add power domain node for dw-mipi-rockchip
  drm/rockchip: dw-mipi: add dw-mipi power domain support
  drm/rockchip: dw-mipi: support HPD poll
  drm/rockchip: dw-mipi: fix phy clk lane stop state timeout
  drm/rockchip: dw-mipi: fix insufficient bandwidth of some panel

 .../display/rockchip/dw_mipi_dsi_rockchip.txt  |   6 +
 drivers/gpu/drm/rockchip/dw-mipi-dsi.c | 135 +
 2 files changed, 120 insertions(+), 21 deletions(-)

-- 
2.6.3

[PATCH 3/7] dt-bindings: add power domain node for dw-mipi-rockchip

Signed-off-by: Chris Zhong 
---

 .../devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt| 1 +
 1 file changed, 1 insertion(+)

diff --git 
a/Documentation/devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt 
b/Documentation/devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt
index 4d59df3..e433ba5 100644
--- 
a/Documentation/devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt
+++ 
b/Documentation/devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt
@@ -17,6 +17,7 @@ Required properties:
 Optional properties:
 - clocks, clock-names: phandle to the dw-mipi phy clock, name should be
   "phy_cfg".
+- power-domains: a phandle to mipi dsi power domain node.
 
 [1] Documentation/devicetree/bindings/clock/clock-bindings.txt
 [2] Documentation/devicetree/bindings/media/video-interfaces.txt
-- 
2.6.3

[PATCH 5/7] drm/rockchip: dw-mipi: support HPD poll

At the first time of bind, there is no any panel attach in mipi. Add a
DRM_CONNECTOR_POLL_HPD porperty to detect the panel status, when panel
probe, the dw_mipi_dsi_host_attach would be called, then mipi-dsi will
trigger a event to notify the drm framework.

Signed-off-by: Chris Zhong 
---

 drivers/gpu/drm/rockchip/dw-mipi-dsi.c | 41 --
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c 
b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
index 15ba796..72d7f48 100644
--- a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
+++ b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
@@ -285,6 +285,7 @@ struct dw_mipi_dsi {
struct drm_encoder encoder;
struct drm_connector connector;
struct mipi_dsi_host dsi_host;
+   struct device_node *panel_node;
struct drm_panel *panel;
struct device *dev;
struct regmap *grf_regmap;
@@ -462,7 +463,6 @@ static int dw_mipi_dsi_phy_init(struct dw_mipi_dsi *dsi)
dsi_write(dsi, DSI_PHY_RSTZ, PHY_ENFORCEPLL | PHY_ENABLECLK |
 PHY_UNRSTZ | PHY_UNSHUTDOWNZ);
 
-
ret = readx_poll_timeout(readl, dsi->base + DSI_PHY_STATUS,
 val, val & LOCK, 1000, PHY_STATUS_TIMEOUT_US);
if (ret < 0) {
@@ -550,11 +550,11 @@ static int dw_mipi_dsi_host_attach(struct mipi_dsi_host 
*host,
dsi->lanes = device->lanes;
dsi->channel = device->channel;
dsi->format = device->format;
-   dsi->panel = of_drm_find_panel(device->dev.of_node);
-   if (dsi->panel)
-   return drm_panel_attach(dsi->panel, &dsi->connector);
+   dsi->panel_node = device->dev.of_node;
+   if (dsi->connector.dev)
+   drm_helper_hpd_irq_event(dsi->connector.dev);
 
-   return -EINVAL;
+   return 0;
 }
 
 static int dw_mipi_dsi_host_detach(struct mipi_dsi_host *host,
@@ -562,7 +562,10 @@ static int dw_mipi_dsi_host_detach(struct mipi_dsi_host 
*host,
 {
struct dw_mipi_dsi *dsi = host_to_dsi(host);
 
-   drm_panel_detach(dsi->panel);
+   dsi->panel_node = NULL;
+
+   if (dsi->connector.dev)
+   drm_helper_hpd_irq_event(dsi->connector.dev);
 
return 0;
 }
@@ -1022,13 +1025,33 @@ static struct drm_connector_helper_funcs 
dw_mipi_dsi_connector_helper_funcs = {
 static enum drm_connector_status
 dw_mipi_dsi_detect(struct drm_connector *connector, bool force)
 {
-   return connector_status_connected;
+   struct dw_mipi_dsi *dsi = con_to_dsi(connector);
+
+
+   if (!dsi->panel) {
+   dsi->panel = of_drm_find_panel(dsi->panel_node);
+   if (dsi->panel)
+   drm_panel_attach(dsi->panel, &dsi->connector);
+   } else if (!dsi->panel_node) {
+   struct drm_encoder *encoder;
+
+   encoder = platform_get_drvdata(to_platform_device(dsi->dev));
+   dw_mipi_dsi_encoder_disable(encoder);
+   drm_panel_detach(dsi->panel);
+   dsi->panel = NULL;
+   }
+
+   if (dsi->panel)
+   return connector_status_connected;
+
+   return connector_status_disconnected;
 }
 
 static void dw_mipi_dsi_drm_connector_destroy(struct drm_connector *connector)
 {
drm_connector_unregister(connector);
drm_connector_cleanup(connector);
+   connector->dev = NULL;
 }
 
 static struct drm_connector_funcs dw_mipi_dsi_atomic_connector_funcs = {
@@ -1069,6 +1092,8 @@ static int dw_mipi_dsi_register(struct drm_device *drm,
return ret;
}
 
+   connector->polled = DRM_CONNECTOR_POLL_HPD;
+
drm_connector_helper_add(connector,
&dw_mipi_dsi_connector_helper_funcs);
 
@@ -1225,6 +1250,8 @@ static void dw_mipi_dsi_unbind(struct device *dev, struct 
device *master,
 {
struct dw_mipi_dsi *dsi = dev_get_drvdata(dev);
 
+   dw_mipi_dsi_encoder_disable(&dsi->encoder);
+
mipi_dsi_host_unregister(&dsi->dsi_host);
pm_runtime_disable(dev);
clk_disable_unprepare(dsi->pllref_clk);
-- 
2.6.3

[PATCH 1/7] dt-bindings: add rk3399 support for dw-mipi-rockchip

The dw-mipi-dsi of rk3399 is almost the same as rk3288, the rk3399 has
additional phy config clock.

Signed-off-by: Chris Zhong 
---

 .../devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt| 5 +
 1 file changed, 5 insertions(+)

diff --git 
a/Documentation/devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt 
b/Documentation/devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt
index 1753f0c..4d59df3 100644
--- 
a/Documentation/devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt
+++ 
b/Documentation/devicetree/bindings/display/rockchip/dw_mipi_dsi_rockchip.txt
@@ -5,6 +5,7 @@ Required properties:
 - #address-cells: Should be <1>.
 - #size-cells: Should be <0>.
 - compatible: "rockchip,rk3288-mipi-dsi", "snps,dw-mipi-dsi".
+ "rockchip,rk3399-mipi-dsi", "snps,dw-mipi-dsi".
 - reg: Represent the physical address range of the controller.
 - interrupts: Represent the controller's interrupt to the CPU(s).
 - clocks, clock-names: Phandles to the controller's pll reference
@@ -13,6 +14,10 @@ Required properties:
 - ports: contain a port node with endpoint definitions as defined in [2].
   For vopb,set the reg = <0> and set the reg = <1> for vopl.
 
+Optional properties:
+- clocks, clock-names: phandle to the dw-mipi phy clock, name should be
+  "phy_cfg".
+
 [1] Documentation/devicetree/bindings/clock/clock-bindings.txt
 [2] Documentation/devicetree/bindings/media/video-interfaces.txt
 
-- 
2.6.3

[PATCH 2/7] DRM: mipi: support rk3399 mipi dsi

The vopb/vopl switch register of rk3399 mipi is different from rk3288,
the default setting for mipi dsi mode is different too, so add a
of_device_id structure to distinguish them, and make sure set the
correct mode before mipi phy init.

Signed-off-by: Chris Zhong 
Signed-off-by: Mark Yao 
---

 drivers/gpu/drm/rockchip/dw-mipi-dsi.c | 70 --
 1 file changed, 59 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c 
b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
index dedc65b..100da01 100644
--- a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
+++ b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
@@ -28,9 +28,17 @@
 
 #define DRIVER_NAME"dw-mipi-dsi"
 
-#define GRF_SOC_CON60x025c
-#define DSI0_SEL_VOP_LIT(1 << 6)
-#define DSI1_SEL_VOP_LIT(1 << 9)
+#define RK3288_GRF_SOC_CON60x025c
+#define RK3288_DSI0_SEL_VOP_LITBIT(6)
+#define RK3288_DSI1_SEL_VOP_LITBIT(9)
+
+#define RK3399_GRF_SOC_CON19   0x6250
+#define RK3399_DSI0_SEL_VOP_LITBIT(0)
+#define RK3399_DSI1_SEL_VOP_LITBIT(4)
+
+/* disable turnrequest, turndisable, forcetxstopmode, forcerxmode */
+#define RK3399_GRF_SOC_CON22   0x6258
+#define RK3399_GRF_DSI_MODE0x
 
 #define DSI_VERSION0x00
 #define DSI_PWR_UP 0x04
@@ -147,7 +155,6 @@
 #define LPRX_TO_CNT(p) ((p) & 0x)
 
 #define DSI_BTA_TO_CNT 0x8c
-
 #define DSI_LPCLK_CTRL 0x94
 #define AUTO_CLKLANE_CTRL  BIT(1)
 #define PHY_TXREQUESTCLKHS BIT(0)
@@ -263,6 +270,11 @@ enum {
 };
 
 struct dw_mipi_dsi_plat_data {
+   u32 dsi0_en_bit;
+   u32 dsi1_en_bit;
+   u32 grf_switch_reg;
+   u32 grf_dsi0_mode;
+   u32 grf_dsi0_mode_reg;
unsigned int max_data_lanes;
enum drm_mode_status (*mode_valid)(struct drm_connector *connector,
   struct drm_display_mode *mode);
@@ -279,6 +291,7 @@ struct dw_mipi_dsi {
 
struct clk *pllref_clk;
struct clk *pclk;
+   struct clk *phy_cfg_clk;
 
unsigned int lane_mbps; /* per lane */
u32 channel;
@@ -400,6 +413,14 @@ static int dw_mipi_dsi_phy_init(struct dw_mipi_dsi *dsi)
 
dsi_write(dsi, DSI_PWR_UP, POWERUP);
 
+   if (!IS_ERR(dsi->phy_cfg_clk)) {
+   ret = clk_prepare_enable(dsi->phy_cfg_clk);
+   if (ret) {
+   dev_err(dsi->dev, "Failed to enable phy_cfg_clk\n");
+   return ret;
+   }
+   }
+
dw_mipi_dsi_phy_write(dsi, 0x10, BYPASS_VCO_RANGE |
 VCO_RANGE_CON_SEL(vco) |
 VCO_IN_CAP_CON_LOW |
@@ -444,17 +465,19 @@ static int dw_mipi_dsi_phy_init(struct dw_mipi_dsi *dsi)
 val, val & LOCK, 1000, PHY_STATUS_TIMEOUT_US);
if (ret < 0) {
dev_err(dsi->dev, "failed to wait for phy lock state\n");
-   return ret;
+   goto phy_init_end;
}
 
ret = readx_poll_timeout(readl, dsi->base + DSI_PHY_STATUS,
 val, val & STOP_STATE_CLK_LANE, 1000,
 PHY_STATUS_TIMEOUT_US);
-   if (ret < 0) {
+   if (ret < 0)
dev_err(dsi->dev,
"failed to wait for phy clk lane stop state\n");
-   return ret;
-   }
+
+phy_init_end:
+   if (!IS_ERR(dsi->phy_cfg_clk))
+   clk_disable_unprepare(dsi->phy_cfg_clk);
 
return ret;
 }
@@ -878,6 +901,7 @@ static void dw_mipi_dsi_encoder_disable(struct drm_encoder 
*encoder)
 static void dw_mipi_dsi_encoder_commit(struct drm_encoder *encoder)
 {
struct dw_mipi_dsi *dsi = encoder_to_dsi(encoder);
+   const struct dw_mipi_dsi_plat_data *pdata = dsi->pdata;
int mux = drm_of_encoder_active_endpoint_id(dsi->dev->of_node, encoder);
u32 val;
 
@@ -886,6 +910,10 @@ static void dw_mipi_dsi_encoder_commit(struct drm_encoder 
*encoder)
return;
}
 
+   if (pdata->grf_dsi0_mode_reg)
+   regmap_write(dsi->grf_regmap, pdata->grf_dsi0_mode_reg,
+pdata->grf_dsi0_mode);
+
dw_mipi_dsi_phy_init(dsi);
dw_mipi_dsi_wait_for_two_frames(dsi);
 
@@ -895,11 +923,11 @@ static void dw_mipi_dsi_encoder_commit(struct drm_encoder 
*encoder)
clk_disable_unprepare(dsi->pclk);
 
if (mux)
-   val = DSI0_SEL_VOP_LIT | (DSI0_SEL_VOP_LIT << 16);
+   val = pdata->dsi0_en_bit | (pdata->dsi0_en_bit << 16);
else
-   val = DSI0_SEL_VOP_LIT << 16;
+   val = pdata->dsi0_en_bit << 16;
 
-   regmap_write(dsi->grf_regmap, GRF_SOC_CON6, val);
+   regmap_write(dsi->grf_regmap, pdata->grf_switch_re

[PATCH 7/7] drm/rockchip: dw-mipi: fix insufficient bandwidth of some panel

Set the lanes bps to 1 / 0.9 times of pclk, the margin is not enough
for some panel, it will cause the screen display is not normal, so
increases the badnwidth to 1 / 0.8.

Signed-off-by: Chris Zhong 

---

 drivers/gpu/drm/rockchip/dw-mipi-dsi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c 
b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
index 8401185..fca7dde 100644
--- a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
+++ b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
@@ -503,8 +503,8 @@ static int dw_mipi_dsi_get_lane_bps(struct dw_mipi_dsi *dsi)
 
mpclk = DIV_ROUND_UP(dsi->mode->clock, MSEC_PER_SEC);
if (mpclk) {
-   /* take 1 / 0.9, since mbps must big than bandwidth of RGB */
-   tmp = mpclk * (bpp / dsi->lanes) * 10 / 9;
+   /* take 1 / 0.8, since mbps must big than bandwidth of RGB */
+   tmp = mpclk * (bpp / dsi->lanes) * 10 / 8;
if (tmp < max_mbps)
target_mbps = tmp;
else
-- 
2.6.3

[PATCH 4/7] drm/rockchip: dw-mipi: add dw-mipi power domain support

Reference the power domain incase dw-mipi power down when
in use.

Signed-off-by: Chris Zhong 
---

 drivers/gpu/drm/rockchip/dw-mipi-dsi.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c 
b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
index 100da01..15ba796 100644
--- a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
+++ b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -293,6 +294,7 @@ struct dw_mipi_dsi {
struct clk *pclk;
struct clk *phy_cfg_clk;
 
+   int dpms_mode;
unsigned int lane_mbps; /* per lane */
u32 channel;
u32 lanes;
@@ -844,6 +846,11 @@ static void dw_mipi_dsi_encoder_mode_set(struct 
drm_encoder *encoder,
struct dw_mipi_dsi *dsi = encoder_to_dsi(encoder);
int ret;
 
+   if (dsi->dpms_mode == DRM_MODE_DPMS_ON)
+   return;
+
+   pm_runtime_get_sync(dsi->dev);
+
dsi->mode = adjusted_mode;
 
ret = dw_mipi_dsi_get_lane_bps(dsi);
@@ -876,6 +883,9 @@ static void dw_mipi_dsi_encoder_disable(struct drm_encoder 
*encoder)
 {
struct dw_mipi_dsi *dsi = encoder_to_dsi(encoder);
 
+   if (dsi->dpms_mode != DRM_MODE_DPMS_ON)
+   return;
+
drm_panel_disable(dsi->panel);
 
if (clk_prepare_enable(dsi->pclk)) {
@@ -896,6 +906,8 @@ static void dw_mipi_dsi_encoder_disable(struct drm_encoder 
*encoder)
dw_mipi_dsi_set_mode(dsi, DW_MIPI_DSI_CMD_MODE);
dw_mipi_dsi_disable(dsi);
clk_disable_unprepare(dsi->pclk);
+   pm_runtime_put(dsi->dev);
+   dsi->dpms_mode = DRM_MODE_DPMS_OFF;
 }
 
 static void dw_mipi_dsi_encoder_commit(struct drm_encoder *encoder)
@@ -929,6 +941,7 @@ static void dw_mipi_dsi_encoder_commit(struct drm_encoder 
*encoder)
 
regmap_write(dsi->grf_regmap, pdata->grf_switch_reg, val);
dev_dbg(dsi->dev, "vop %s output to dsi0\n", (mux) ? "LIT" : "BIG");
+   dsi->dpms_mode = DRM_MODE_DPMS_ON;
 }
 
 static int
@@ -1150,6 +1163,7 @@ static int dw_mipi_dsi_bind(struct device *dev, struct 
device *master,
 
dsi->dev = dev;
dsi->pdata = pdata;
+   dsi->dpms_mode = DRM_MODE_DPMS_OFF;
 
ret = rockchip_mipi_parse_dt(dsi);
if (ret)
@@ -1195,6 +1209,8 @@ static int dw_mipi_dsi_bind(struct device *dev, struct 
device *master,
 
dev_set_drvdata(dev, dsi);
 
+   pm_runtime_enable(dev);
+
dsi->dsi_host.ops = &dw_mipi_dsi_host_ops;
dsi->dsi_host.dev = dev;
return mipi_dsi_host_register(&dsi->dsi_host);
@@ -1210,6 +1226,7 @@ static void dw_mipi_dsi_unbind(struct device *dev, struct 
device *master,
struct dw_mipi_dsi *dsi = dev_get_drvdata(dev);
 
mipi_dsi_host_unregister(&dsi->dsi_host);
+   pm_runtime_disable(dev);
clk_disable_unprepare(dsi->pllref_clk);
 }
 
-- 
2.6.3

[PATCH 6/7] drm/rockchip: dw-mipi: fix phy clk lane stop state timeout

Before phy init, the detection of phy state should be controlled
manually. After that, we can switch the detection to hardward,
it is automatic. Hence move PHY_TXREQUESTCLKHS setting to the end
of phy init.

Signed-off-by: Chris Zhong 
---

 drivers/gpu/drm/rockchip/dw-mipi-dsi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c 
b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
index 72d7f48..8401185 100644
--- a/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
+++ b/drivers/gpu/drm/rockchip/dw-mipi-dsi.c
@@ -477,6 +477,8 @@ static int dw_mipi_dsi_phy_init(struct dw_mipi_dsi *dsi)
dev_err(dsi->dev,
"failed to wait for phy clk lane stop state\n");
 
+   dsi_write(dsi, DSI_LPCLK_CTRL, PHY_TXREQUESTCLKHS);
+
 phy_init_end:
if (!IS_ERR(dsi->phy_cfg_clk))
clk_disable_unprepare(dsi->phy_cfg_clk);
@@ -714,7 +716,6 @@ static void dw_mipi_dsi_init(struct dw_mipi_dsi *dsi)
  | PHY_RSTZ | PHY_SHUTDOWNZ);
dsi_write(dsi, DSI_CLKMGR_CFG, TO_CLK_DIVIDSION(10) |
  TX_ESC_CLK_DIVIDSION(7));
-   dsi_write(dsi, DSI_LPCLK_CTRL, PHY_TXREQUESTCLKHS);
 }
 
 static void dw_mipi_dsi_dpi_config(struct dw_mipi_dsi *dsi,
-- 
2.6.3

Re: get_nohz_timer_target?

2016-07-08 Thread Thomas Gleixner

On Mon, 18 Apr 2016, Richard Cochran wrote:
> Looking at kernel/sched/core.c:get_nohz_timer_target(), I don't
> understand the change made in:
> 
> commit 9642d18eee2cd169b60c6ac0f20bda745b5a3d1e
> Author: Vatika Harlalka 
> Date:   Tue Sep 1 16:50:59 2015 +0200
> nohz: Affine unpinned timers to housekeepers
> 
> After that change, the code now reads like this:
> 
>   int i, cpu = smp_processor_id();
>   struct sched_domain *sd;
> 
>   if (!idle_cpu(cpu) && is_housekeeping_cpu(cpu))
>   return cpu;
> 
>   rcu_read_lock();
>   for_each_domain(cpu, sd) {
>   for_each_cpu(i, sched_domain_span(sd)) {
>   if (!idle_cpu(i) && is_housekeeping_cpu(cpu)) {
> --- ^^^
> Was this supposed to be 'i' instead?

Yes. Care to send a patch?

Thanks,

tglx

[PATCH 4/6] x86/mce: Fix mce_rdmsrl() warning message

From: Borislav Petkov 

The MSR address we're dumping in there should be in hex, otherwise we
get funsies like:

[0.016000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/cpu/mcheck/mce.c:428 
mce_rdmsrl+0xd9/0xe0
[0.016000] mce: Unable to read msr -1073733631!
   ^^^

Signed-off-by: Borislav Petkov 
---
 arch/x86/kernel/cpu/mcheck/mce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 16aebe737cae..2f7bb1f075c2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -425,7 +425,7 @@ static u64 mce_rdmsrl(u32 msr)
}
 
if (rdmsrl_safe(msr, &v)) {
-   WARN_ONCE(1, "mce: Unable to read msr %d!\n", msr);
+   WARN_ONCE(1, "mce: Unable to read msr 0x%x!\n", msr);
/*
 * Return zero in case the access faulted. This should
 * not happen normally but can happen if the CPU does
-- 
2.7.3

[PATCH 0/6] x86/RAS queue

From: Borislav Petkov 

Hi,

here's some more RAS stuff for 4.8.

Please queue,
thanks.

Aravind Gopalakrishnan (1):
  x86/mce/AMD: Increase size of bank_map type

Borislav Petkov (1):
  x86/mce: Fix mce_rdmsrl() warning message

Yazen Ghannam (4):
  x86/RAS/AMD: Reduce number of IPIs when prepping error injection
  x86/mce: Add support for new MCA_SYND register
  EDAC, mce_amd: Print syndrome register value on SMCA systems
  x86/RAS: Add syndrome support to mce_amd_inj

 arch/x86/include/asm/mce.h   |  5 ++-
 arch/x86/include/uapi/asm/mce.h  |  1 +
 arch/x86/kernel/cpu/mcheck/mce.c |  6 +++-
 arch/x86/kernel/cpu/mcheck/mce_amd.c |  5 ++-
 arch/x86/ras/mce_amd_inj.c   | 69 
 drivers/edac/mce_amd.c   | 14 ++--
 include/trace/events/mce.h   |  6 ++--
 7 files changed, 68 insertions(+), 38 deletions(-)

-- 
2.7.3

[PATCH 3/6] x86/mce: Add support for new MCA_SYND register

From: Yazen Ghannam 

Syndrome information is no longer contained in MCA_STATUS for SMCA
systems but in a new register.

Add a synd field to struct mce to hold MCA_SYND register value. Add it
to the end of struct mce to maintain compatibility with old versions of
mcelog. Also, add it to the respective tracepoint.

Signed-off-by: Yazen Ghannam 
Cc: Aravind Gopalakrishnan 
Cc: Ashok Raj 
Cc: linux-edac 
Cc: Steven Rostedt 
Cc: Tony Luck 
Cc: x86-ml 
Link: 
http://lkml.kernel.org/r/1467633035-32080-1-git-send-email-yazen.ghan...@amd.com
Signed-off-by: Borislav Petkov 
---
 arch/x86/include/asm/mce.h   | 5 -
 arch/x86/include/uapi/asm/mce.h  | 1 +
 arch/x86/kernel/cpu/mcheck/mce.c | 4 
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 3 +++
 include/trace/events/mce.h   | 6 --
 5 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 8bf766ef0e18..21bc5a3a4c89 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -40,9 +40,10 @@
 #define MCI_STATUS_AR   (1ULL<<55)  /* Action required */
 
 /* AMD-specific bits */
+#define MCI_STATUS_TCC (1ULL<<55)  /* Task context corrupt */
+#define MCI_STATUS_SYNDV   (1ULL<<53)  /* synd reg. valid */
 #define MCI_STATUS_DEFERRED(1ULL<<44)  /* uncorrected error, deferred 
exception */
 #define MCI_STATUS_POISON  (1ULL<<43)  /* access poisonous data */
-#define MCI_STATUS_TCC (1ULL<<55)  /* Task context corrupt */
 
 /*
  * McaX field if set indicates a given bank supports MCA extensions:
@@ -110,6 +111,7 @@
 #define MSR_AMD64_SMCA_MC0_MISC0   0xc0002003
 #define MSR_AMD64_SMCA_MC0_CONFIG  0xc0002004
 #define MSR_AMD64_SMCA_MC0_IPID0xc0002005
+#define MSR_AMD64_SMCA_MC0_SYND0xc0002006
 #define MSR_AMD64_SMCA_MC0_DESTAT  0xc0002008
 #define MSR_AMD64_SMCA_MC0_DEADDR  0xc0002009
 #define MSR_AMD64_SMCA_MC0_MISC1   0xc000200a
@@ -119,6 +121,7 @@
 #define MSR_AMD64_SMCA_MCx_MISC(x) (MSR_AMD64_SMCA_MC0_MISC0 + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_CONFIG(x)   (MSR_AMD64_SMCA_MC0_CONFIG + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_IPID(x) (MSR_AMD64_SMCA_MC0_IPID + 0x10*(x))
+#define MSR_AMD64_SMCA_MCx_SYND(x) (MSR_AMD64_SMCA_MC0_SYND + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_DESTAT(x)   (MSR_AMD64_SMCA_MC0_DESTAT + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_DEADDR(x)   (MSR_AMD64_SMCA_MC0_DEADDR + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_MISCy(x, y) ((MSR_AMD64_SMCA_MC0_MISC1 + y) + 
(0x10*(x)))
diff --git a/arch/x86/include/uapi/asm/mce.h b/arch/x86/include/uapi/asm/mce.h
index 2184943341bf..8c75fbc94c3f 100644
--- a/arch/x86/include/uapi/asm/mce.h
+++ b/arch/x86/include/uapi/asm/mce.h
@@ -26,6 +26,7 @@ struct mce {
__u32 socketid; /* CPU socket ID */
__u32 apicid;   /* CPU initial apic ID */
__u64 mcgcap;   /* MCGCAP MSR: machine check capabilities of CPU */
+   __u64 synd; /* MCA_SYND MSR: only valid on SMCA systems */
 };
 
 #define MCE_GET_RECORD_LEN   _IOR('M', 1, int)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 92e5e37d97bf..16aebe737cae 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -568,6 +568,7 @@ static void mce_read_aux(struct mce *m, int i)
 {
if (m->status & MCI_STATUS_MISCV)
m->misc = mce_rdmsrl(msr_ops.misc(i));
+
if (m->status & MCI_STATUS_ADDRV) {
m->addr = mce_rdmsrl(msr_ops.addr(i));
 
@@ -580,6 +581,9 @@ static void mce_read_aux(struct mce *m, int i)
m->addr <<= shift;
}
}
+
+   if (mce_flags.smca && (m->status & MCI_STATUS_SYNDV))
+   m->synd = mce_rdmsrl(MSR_AMD64_SMCA_MCx_SYND(i));
 }
 
 static bool memory_error(struct mce *m)
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c 
b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 7b7f3be783d4..8b8c33a6e6a0 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -479,6 +479,9 @@ __log_error(unsigned int bank, bool deferred_err, bool 
threshold_err, u64 misc)
if (m.status & MCI_STATUS_ADDRV)
rdmsrl(msr_addr, m.addr);
 
+   if (mce_flags.smca && (m.status & MCI_STATUS_SYNDV))
+   rdmsrl(MSR_AMD64_SMCA_MCx_SYND(bank), m.synd);
+
mce_log(&m);
 
wrmsrl(msr_status, 0);
diff --git a/include/trace/events/mce.h b/include/trace/events/mce.h
index 4cbbcef6baa8..8be5268caf28 100644
--- a/include/trace/events/mce.h
+++ b/include/trace/events/mce.h
@@ -20,6 +20,7 @@ TRACE_EVENT(mce_record,
__field(u64,status  )
__field(u64,addr)
__field(u64,misc)
+   __field(u64,synd)
__field(u64,ip  )

[PATCH 5/6] EDAC, mce_amd: Print syndrome register value on SMCA systems

From: Yazen Ghannam 

Print SyndV bit status and print the raw value of the MCA_SYND register.
Further decoding of the syndrome from struct mce.synd can be done in
other places where appropriate, e.g. DRAM ECC.

Boris: make the error stanza more compact by putting the error address
and syndrome on the same line:

  [Hardware Error]: Corrected error, no action required.
  [Hardware Error]: CPU:2 (17:0:0) MC4_STATUS[-|CE|-|PCC|AddrV|-|-|SyndV|CECC]: 
0x9620411e0117
  [Hardware Error]: Error Addr: 0x7f4c52e3, Syndrome: 0x
  [Hardware Error]: Invalid IP block specified.
  [Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: RD

Signed-off-by: Yazen Ghannam 
Cc: Aravind Gopalakrishnan 
Cc: Tony Luck 
Cc: linux-edac 
Link: 
http://lkml.kernel.org/r/1467633035-32080-2-git-send-email-yazen.ghan...@amd.com
Signed-off-by: Borislav Petkov 
---
 drivers/edac/mce_amd.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 9b6800a79c7f..057ece577800 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -927,7 +927,7 @@ static void decode_smca_errors(struct mce *m)
size_t len;
 
if (rdmsr_safe(addr, &low, &high)) {
-   pr_emerg("Invalid IP block specified, error information is 
unreliable.\n");
+   pr_emerg(HW_ERR "Invalid IP block specified.\n");
return;
}
 
@@ -1078,6 +1078,8 @@ int amd_decode_mce(struct notifier_block *nb, unsigned 
long val, void *data)
u32 low, high;
u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank);
 
+   pr_cont("|%s", ((m->status & MCI_STATUS_SYNDV) ? "SyndV" : 
"-"));
+
if (!rdmsr_safe(addr, &low, &high) &&
(low & MCI_CONFIG_MCAX))
pr_cont("|%s", ((m->status & MCI_STATUS_TCC) ? "TCC" : 
"-"));
@@ -1091,12 +1093,18 @@ int amd_decode_mce(struct notifier_block *nb, unsigned 
long val, void *data)
pr_cont("]: 0x%016llx\n", m->status);
 
if (m->status & MCI_STATUS_ADDRV)
-   pr_emerg(HW_ERR "MC%d Error Address: 0x%016llx\n", m->bank, 
m->addr);
+   pr_emerg(HW_ERR "Error Addr: 0x%016llx", m->addr);
 
if (boot_cpu_has(X86_FEATURE_SMCA)) {
+   if (m->status & MCI_STATUS_SYNDV)
+   pr_cont(", Syndrome: 0x%016llx", m->synd);
+
+   pr_cont("\n");
+
decode_smca_errors(m);
goto err_code;
-   }
+   } else
+   pr_cont("\n");
 
if (!fam_ops)
goto err_code;
-- 
2.7.3

[PATCH 6/6] x86/RAS: Add syndrome support to mce_amd_inj

From: Yazen Ghannam 

Add a debugfs file which holds the error syndrome (written into
MCA_SYND) of an injected error. Only write it on SMCA systems. Update
README file, while at it.

Signed-off-by: Yazen Ghannam 
Cc: Aravind Gopalakrishnan 
Cc: Tony Luck 
Cc: linux-edac 
Cc: x86-ml 
Link: 
http://lkml.kernel.org/r/1467633035-32080-3-git-send-email-yazen.ghan...@amd.com
Signed-off-by: Borislav Petkov 
---
 arch/x86/ras/mce_amd_inj.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/ras/mce_amd_inj.c b/arch/x86/ras/mce_amd_inj.c
index 1104515d5ad2..ff8eb1a9ce6d 100644
--- a/arch/x86/ras/mce_amd_inj.c
+++ b/arch/x86/ras/mce_amd_inj.c
@@ -68,6 +68,7 @@ static int inj_##reg##_set(void *data, u64 val)   
\
 MCE_INJECT_SET(status);
 MCE_INJECT_SET(misc);
 MCE_INJECT_SET(addr);
+MCE_INJECT_SET(synd);
 
 #define MCE_INJECT_GET(reg)\
 static int inj_##reg##_get(void *data, u64 *val)   \
@@ -81,10 +82,12 @@ static int inj_##reg##_get(void *data, u64 *val)
\
 MCE_INJECT_GET(status);
 MCE_INJECT_GET(misc);
 MCE_INJECT_GET(addr);
+MCE_INJECT_GET(synd);
 
 DEFINE_SIMPLE_ATTRIBUTE(status_fops, inj_status_get, inj_status_set, "%llx\n");
 DEFINE_SIMPLE_ATTRIBUTE(misc_fops, inj_misc_get, inj_misc_set, "%llx\n");
 DEFINE_SIMPLE_ATTRIBUTE(addr_fops, inj_addr_get, inj_addr_set, "%llx\n");
+DEFINE_SIMPLE_ATTRIBUTE(synd_fops, inj_synd_get, inj_synd_set, "%llx\n");
 
 /*
  * Caller needs to be make sure this cpu doesn't disappear
@@ -258,6 +261,7 @@ static void prepare_msrs(void *info)
}
 
wrmsrl(MSR_AMD64_SMCA_MCx_MISC(b), i_mce.misc);
+   wrmsrl(MSR_AMD64_SMCA_MCx_SYND(b), i_mce.synd);
} else {
wrmsrl(MSR_IA32_MCx_STATUS(b), i_mce.status);
wrmsrl(MSR_IA32_MCx_ADDR(b), i_mce.addr);
@@ -275,6 +279,9 @@ static void do_inject(void)
if (i_mce.misc)
i_mce.status |= MCI_STATUS_MISCV;
 
+   if (i_mce.synd)
+   i_mce.status |= MCI_STATUS_SYNDV;
+
if (inj_type == SW_INJ) {
mce_inject_log(&i_mce);
return;
@@ -371,6 +378,9 @@ static const char readme_msg[] =
 "\t used for error thresholding purposes and its validity is indicated by\n"
 "\t MCi_STATUS[MiscV].\n"
 "\n"
+"synd:\t Set MCi_SYND: provide syndrome info about the error. Only valid on\n"
+"\t Scalable MCA systems, and its validity is indicated by 
MCi_STATUS[SyndV].\n"
+"\n"
 "addr:\t Error address value to be written to MCi_ADDR. Log address 
information\n"
 "\t associated with the error.\n"
 "\n"
@@ -420,6 +430,7 @@ static struct dfs_node {
{ .name = "status", .fops = &status_fops, .perm = S_IRUSR | S_IWUSR 
},
{ .name = "misc",   .fops = &misc_fops,   .perm = S_IRUSR | S_IWUSR 
},
{ .name = "addr",   .fops = &addr_fops,   .perm = S_IRUSR | S_IWUSR 
},
+   { .name = "synd",   .fops = &synd_fops,   .perm = S_IRUSR | S_IWUSR 
},
{ .name = "bank",   .fops = &bank_fops,   .perm = S_IRUSR | S_IWUSR 
},
{ .name = "flags",  .fops = &flags_fops,  .perm = S_IRUSR | S_IWUSR 
},
{ .name = "cpu",.fops = &extcpu_fops, .perm = S_IRUSR | S_IWUSR 
},
-- 
2.7.3

[PATCH 1/6] x86/mce/AMD: Increase size of bank_map type

From: Aravind Gopalakrishnan 

Change bank_map type from char to int since we now have more than eight
banks in a system.

Signed-off-by: Aravind Gopalakrishnan 
Cc: Aravind Gopalakrishnan 
Cc: Tony Luck 
Cc: linux-edac 
Link: 
http://lkml.kernel.org/r/1466462163-29008-1-git-send-email-yazen.ghan...@amd.com
Signed-off-by: Yazen Ghannam 
Signed-off-by: Borislav Petkov 
---
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c 
b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 10b0661651e0..7b7f3be783d4 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -93,7 +93,7 @@ const char * const amd_df_mcablock_names[] = {
 EXPORT_SYMBOL_GPL(amd_df_mcablock_names);
 
 static DEFINE_PER_CPU(struct threshold_bank **, threshold_banks);
-static DEFINE_PER_CPU(unsigned char, bank_map);/* see which banks are 
on */
+static DEFINE_PER_CPU(unsigned int, bank_map); /* see which banks are on */
 
 static void amd_threshold_interrupt(void);
 static void amd_deferred_error_interrupt(void);
-- 
2.7.3

[PATCH 2/6] x86/RAS/AMD: Reduce number of IPIs when prepping error injection

From: Yazen Ghannam 

We currently use wrmsr_on_cpu() 4 times when prepping for an error
injection. This will generate 4 IPIs for each MSR write. We can reduce
the number of IPIs to 1 by grouping the MSR writes and executing them
serially on the appropriate CPU.

Signed-off-by: Yazen Ghannam 
Suggested-by: Borislav Petkov 
Cc: Aravind Gopalakrishnan 
Cc: linux-edac 
Cc: Tony Luck 
Link: 
http://lkml.kernel.org/r/1466462347-31657-1-git-send-email-yazen.ghan...@amd.com
Signed-off-by: Borislav Petkov 
---
 arch/x86/ras/mce_amd_inj.c | 58 ++
 1 file changed, 28 insertions(+), 30 deletions(-)

diff --git a/arch/x86/ras/mce_amd_inj.c b/arch/x86/ras/mce_amd_inj.c
index e69f4701a076..1104515d5ad2 100644
--- a/arch/x86/ras/mce_amd_inj.c
+++ b/arch/x86/ras/mce_amd_inj.c
@@ -241,6 +241,31 @@ static void toggle_nb_mca_mst_cpu(u16 nid)
   __func__, PCI_FUNC(F3->devfn), NBCFG);
 }
 
+static void prepare_msrs(void *info)
+{
+   struct mce i_mce = *(struct mce *)info;
+   u8 b = i_mce.bank;
+
+   wrmsrl(MSR_IA32_MCG_STATUS, i_mce.mcgstatus);
+
+   if (boot_cpu_has(X86_FEATURE_SMCA)) {
+   if (i_mce.inject_flags == DFR_INT_INJ) {
+   wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT(b), i_mce.status);
+   wrmsrl(MSR_AMD64_SMCA_MCx_DEADDR(b), i_mce.addr);
+   } else {
+   wrmsrl(MSR_AMD64_SMCA_MCx_STATUS(b), i_mce.status);
+   wrmsrl(MSR_AMD64_SMCA_MCx_ADDR(b), i_mce.addr);
+   }
+
+   wrmsrl(MSR_AMD64_SMCA_MCx_MISC(b), i_mce.misc);
+   } else {
+   wrmsrl(MSR_IA32_MCx_STATUS(b), i_mce.status);
+   wrmsrl(MSR_IA32_MCx_ADDR(b), i_mce.addr);
+   wrmsrl(MSR_IA32_MCx_MISC(b), i_mce.misc);
+   }
+
+}
+
 static void do_inject(void)
 {
u64 mcg_status = 0;
@@ -287,36 +312,9 @@ static void do_inject(void)
 
toggle_hw_mce_inject(cpu, true);
 
-   wrmsr_on_cpu(cpu, MSR_IA32_MCG_STATUS,
-(u32)mcg_status, (u32)(mcg_status >> 32));
-
-   if (boot_cpu_has(X86_FEATURE_SMCA)) {
-   if (inj_type == DFR_INT_INJ) {
-   wrmsr_on_cpu(cpu, MSR_AMD64_SMCA_MCx_DESTAT(b),
-(u32)i_mce.status, (u32)(i_mce.status >> 
32));
-
-   wrmsr_on_cpu(cpu, MSR_AMD64_SMCA_MCx_DEADDR(b),
-(u32)i_mce.addr, (u32)(i_mce.addr >> 32));
-   } else {
-   wrmsr_on_cpu(cpu, MSR_AMD64_SMCA_MCx_STATUS(b),
-(u32)i_mce.status, (u32)(i_mce.status >> 
32));
-
-   wrmsr_on_cpu(cpu, MSR_AMD64_SMCA_MCx_ADDR(b),
-(u32)i_mce.addr, (u32)(i_mce.addr >> 32));
-   }
-
-   wrmsr_on_cpu(cpu, MSR_AMD64_SMCA_MCx_MISC(b),
-(u32)i_mce.misc, (u32)(i_mce.misc >> 32));
-   } else {
-   wrmsr_on_cpu(cpu, MSR_IA32_MCx_STATUS(b),
-(u32)i_mce.status, (u32)(i_mce.status >> 32));
-
-   wrmsr_on_cpu(cpu, MSR_IA32_MCx_ADDR(b),
-(u32)i_mce.addr, (u32)(i_mce.addr >> 32));
-
-   wrmsr_on_cpu(cpu, MSR_IA32_MCx_MISC(b),
-(u32)i_mce.misc, (u32)(i_mce.misc >> 32));
-   }
+   i_mce.mcgstatus = mcg_status;
+   i_mce.inject_flags = inj_type;
+   smp_call_function_single(cpu, prepare_msrs, &i_mce, 0);
 
toggle_hw_mce_inject(cpu, false);
 
-- 
2.7.3

Re: [PATCH v2 03/22] usb: ulpi: Support device discovery via device properties

2016-07-08 Thread Peter Chen

On Thu, Jul 07, 2016 at 03:20:54PM -0700, Stephen Boyd wrote:
> @@ -39,6 +42,10 @@ static int ulpi_match(struct device *dev, struct 
> device_driver *driver)
>   struct ulpi *ulpi = to_ulpi_dev(dev);
>   const struct ulpi_device_id *id;
>  
> + /* Some ULPI devices don't have a product id so rely on OF match */
> + if (ulpi->id.product == 0)
> + return of_driver_match_device(dev, driver);
> +

How about using vendor id? It can't be 0, but pid may be 0.
See: http://www.linux-usb.org/usb.ids

> +static int ulpi_of_register(struct ulpi *ulpi)
> +{
> + struct device_node *np = NULL, *child;
> +
> + /* Find a ulpi bus underneath the parent or the parent of the parent */
> + if (ulpi->dev.parent->of_node)
> + np = of_find_node_by_name(ulpi->dev.parent->of_node, "ulpi");
> + else if (ulpi->dev.parent->parent && ulpi->dev.parent->parent->of_node)
> + np = of_find_node_by_name(ulpi->dev.parent->parent->of_node,
> +   "ulpi");
> + if (!np)
> + return 0;
> +
> + child = of_get_next_available_child(np, NULL);
> + if (!child)
> + return -EINVAL;

You may need to call of_node_put on parent (np), not on child node
below.

> +
> + ulpi->dev.of_node = child;
> +
> + return 0;
> +}
> +
> +static int ulpi_read_id(struct ulpi *ulpi)
>  {
>   int ret;
>  
> @@ -174,14 +218,39 @@ static int ulpi_register(struct device *dev, struct 
> ulpi *ulpi)
>   ulpi->id.product = ulpi_read(ulpi, ULPI_PRODUCT_ID_LOW);
>   ulpi->id.product |= ulpi_read(ulpi, ULPI_PRODUCT_ID_HIGH) << 8;
>  
> + return 0;
> +}
> +

What does this API for? Why it still needs to be called after
vid/pid gets from firmware?

> +static int ulpi_register(struct device *dev, struct ulpi *ulpi)
> +{
> + int ret;
> +
>   ulpi->dev.parent = dev;
>   ulpi->dev.bus = &ulpi_bus;
>   ulpi->dev.type = &ulpi_dev_type;
>   dev_set_name(&ulpi->dev, "%s.ulpi", dev_name(dev));
>  
> + if (IS_ENABLED(CONFIG_OF)) {
> + ret = ulpi_of_register(ulpi);
> + if (ret)
> + return ret;
> + }
> +
>   ACPI_COMPANION_SET(&ulpi->dev, ACPI_COMPANION(dev));
>  
> - request_module("ulpi:v%04xp%04x", ulpi->id.vendor, ulpi->id.product);
> + ret = device_property_read_u16(&ulpi->dev, "ulpi-vendor",
> +&ulpi->id.vendor);
> + ret |= device_property_read_u16(&ulpi->dev, "ulpi-product",
> + &ulpi->id.product);
> + if (ret) {
> + ret = ulpi_read_id(ulpi);
> + if (ret)
> + return ret;
> + }
> +
[...]

>  void ulpi_unregister_interface(struct ulpi *ulpi)
>  {
> + of_node_put(ulpi->dev.of_node);

[...]

-- 

Best Regards,
Peter Chen

Re: [PATCH] capabilities: add capability cgroup controller

2016-07-08 Thread Petr Mladek

On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
> On 07/07/16 09:16, Petr Mladek wrote:
> > On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
> >> The attached patch would make any uses of capabilities generate audit
> >> messages. It works for simple tests as you can see from the commit
> >> message, but unfortunately the call to audit_cgroup_list() deadlocks the
> >> system when booting a full blown OS. There's no deadlock when the call
> >> is removed.
> >>
> >> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
> >> already held earlier before entering audit_cgroup_list(). Holding the
> >> locks is however required by task_cgroup_from_root(). Is there any way
> >> to avoid this? For example, only print some kind of cgroup ID numbers
> >> (are there unique and stable IDs, available without locks?) for those
> >> cgroups where the task is registered in the audit message?
> > 
> > I am not sure if anyone know what really happens here. I suggest to
> > enable lockdep. It might detect possible deadlock even before it
> > really happens, see Documentation/locking/lockdep-design.txt
> > 
> > It can be enabled by
> > 
> >CONFIG_PROVE_LOCKING=y
> > 
> > It depends on
> > 
> > CONFIG_DEBUG_KERNEL=y
> > 
> > and maybe some more options, see lib/Kconfig.debug
> 
> Thanks a lot! I caught this stack dump:
> 
> starting version 230
> [3.416647] [ cut here ]
> [3.417310] WARNING: CPU: 0 PID: 95 at
> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
> lockdep_trace_alloc+0xb4/0xc0
> [3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> [3.417923] Modules linked in:
> [3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
> [3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Debian-1.8.2-1 04/01/2014
> [3.418726]  0086 7970f3b0 8816fb00
> 813c9c45
> [3.418993]  8816fb50  8816fb40
> 81091e9b
> [3.419176]  0b3705e2c798 0046 0410
> 
> [3.419374] Call Trace:
> [3.419511]  [] dump_stack+0x67/0x92
> [3.419644]  [] __warn+0xcb/0xf0
> [3.419745]  [] warn_slowpath_fmt+0x5f/0x80
> [3.419868]  [] lockdep_trace_alloc+0xb4/0xc0
> [3.419988]  [] kmem_cache_alloc_node+0x42/0x600
> [3.420156]  [] ? debug_lockdep_rcu_enabled+0x1d/0x20
> [3.420170]  [] __alloc_skb+0x5b/0x1d0
> [3.420170]  [] audit_log_start+0x29b/0x480
> [3.420170]  [] ? __lock_task_sighand+0x95/0x270
> [3.420170]  [] audit_log_cap_use+0x39/0xf0
> [3.420170]  [] ns_capable+0x45/0x70
> [3.420170]  [] capable+0x17/0x20
> [3.420170]  [] oom_score_adj_write+0x150/0x2f0
> [3.420170]  [] __vfs_write+0x37/0x160
> [3.420170]  [] ? update_fast_ctr+0x17/0x30
> [3.420170]  [] ? percpu_down_read+0x49/0x90
> [3.420170]  [] ? __sb_start_write+0xb7/0xf0
> [3.420170]  [] ? __sb_start_write+0xb7/0xf0
> [3.420170]  [] vfs_write+0xb8/0x1b0
> [3.420170]  [] ? __fget_light+0x66/0x90
> [3.420170]  [] SyS_write+0x58/0xc0
> [3.420170]  [] do_syscall_64+0x5c/0x300
> [3.420170]  [] entry_SYSCALL64_slow_path+0x25/0x25
> [3.420170] ---[ end trace fb586899fb556a5e ]---
> [3.447922] random: systemd-udevd urandom read with 3 bits of entropy
> available
> [4.014078] clocksource: Switched to clocksource tsc
> Begin: Loading essential drivers ... done.
> 
> This is with qemu and the boot continues normally. With real computer,
> there's no such output and system just seems to freeze.
> 
> Could it be possible that the deadlock happens because there's some IO
> towards /sys/fs/cgroup, which causes a capability check and that in turn
> causes locking problems when we try to print cgroup list?

The above warning is printed by the code from
kernel/locking/lockdep.c:2871

static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
{
[...]
/* We're only interested __GFP_FS allocations for now */
if (!(gfp_mask & __GFP_FS))
return;

/*
 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
 */
if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
return;


The backtrace shows that your new audit_log_cap_use() is called
from vfs_write(). You might try to use audit_log_start() with
GFP_NOFS instead of GFP_KERNEL.

Note that this is rather intuitive advice. I still need to learn a lot
about memory management and kernel in general to be more sure about
a correct solution.

Best Regards,
Petr

Re: [RESEND PATCH v2 02/13] drivers: clk: st: Simplify clock binding of STiH4xx platforms

2016-07-08 Thread Gabriel Fernandez


Hi Mike,

On 07/08/2016 03:43 AM, Michael Turquette wrote:

Quoting Rob Herring (2016-06-19 08:04:58)

On Thu, Jun 16, 2016 at 11:20:22AM +0200, Gabriel Fernandez wrote:

This patch reworks the clock binding to avoid too much detail in DT.
Now we have only compatible string per type of clock
(remark from Rob https://lkml.org/lkml/2016/5/25/492)


I have no idea what the clock trees and clock controller in these chips
look like, so it's hard to say if the changes here are good. It still
looks like things are somewhat fine grained clocks in DT. I'll leave
it up to the platform maintainers to decide...

Is this series breaking ABI? If yes, why not do what Maxime did for the
Allwinner/sunxi clocks and just fully convert over to a
one-node-per-clock-controller binding? This one-node-per-clock stuff is
pretty unfortunate, and if we're deprecating platforms (patch #1) then
now might be a good time to re-evaluate the whole thing.


The goal of my patchset was to be aligned with DRM / KMS development and 
to offer
the possibility to make a correct video playback on STiH407/STiH410 
platform.

Our milestone is the 4.8 for that.

Currently people need these patches to work.
I'm not sure it's a good time to re-evaluate the whole thing.

Is it possible to re-evaluate later ?

Best regards,
Gabriel


Regards,
Mike

  

Signed-off-by: Gabriel Fernandez 
---
  .../devicetree/bindings/clock/st/st,clkgen-mux.txt |  2 +-
  .../devicetree/bindings/clock/st/st,clkgen-pll.txt | 11 ++--
  .../devicetree/bindings/clock/st/st,clkgen.txt |  2 +-
  .../devicetree/bindings/clock/st/st,quadfs.txt |  6 +--
  drivers/clk/st/clkgen-fsyn.c   | 41 ++
  drivers/clk/st/clkgen-mux.c| 28 --
  drivers/clk/st/clkgen-pll.c| 62 ++
  7 files changed, 65 insertions(+), 87 deletions(-)

diff --git a/Documentation/devicetree/bindings/clock/st/st,clkgen-mux.txt 
b/Documentation/devicetree/bindings/clock/st/st,clkgen-mux.txt
index 4d277d6..9a46cb1d7 100644
--- a/Documentation/devicetree/bindings/clock/st/st,clkgen-mux.txt
+++ b/Documentation/devicetree/bindings/clock/st/st,clkgen-mux.txt
@@ -10,7 +10,7 @@ This binding uses the common clock binding[1].
  Required properties:
  
  - compatible : shall be:

- "st,stih407-clkgen-a9-mux", "st,clkgen-mux"
+ "st,stih407-clkgen-a9-mux"
  
  - #clock-cells : from common clock binding; shall be set to 0.
  
diff --git a/Documentation/devicetree/bindings/clock/st/st,clkgen-pll.txt b/Documentation/devicetree/bindings/clock/st/st,clkgen-pll.txt

index c9fd674..be0b043 100644
--- a/Documentation/devicetree/bindings/clock/st/st,clkgen-pll.txt
+++ b/Documentation/devicetree/bindings/clock/st/st,clkgen-pll.txt
@@ -9,11 +9,10 @@ Base address is located to the parent node. See clock 
binding[2]
  Required properties:
  
  - compatible : shall be:

- "st,stih407-plls-c32-a0",   "st,clkgen-plls-c32"
- "st,stih407-plls-c32-a9",   "st,clkgen-plls-c32"
- "sst,plls-c32-cx_0","st,clkgen-plls-c32"
- "sst,plls-c32-cx_1","st,clkgen-plls-c32"
- "st,stih418-plls-c28-a9",   "st,clkgen-plls-c32"
+ "st,clkgen-pll0"
+ "st,clkgen-pll0"

Repeated. Supposed to be 0 and 1? This seems a bit generic, too.


+ "st,stih407-clkgen-plla9"
+ "st,stih418-clkgen-plla9"

Re: [PATCH net-next V4 0/6] switch to use tx skb array in tun

2016-07-08 Thread Jason Wang




On 2016年07月08日 14:19, Michael S. Tsirkin wrote:

On Wed, Jul 06, 2016 at 01:45:58PM -0400, Craig Gallek wrote:

>On Thu, Jun 30, 2016 at 2:45 AM, Jason Wang  wrote:

> >Hi all:
> >
> >This series tries to switch to use skb array in tun. This is used to
> >eliminate the spinlock contention between producer and consumer. The
> >conversion was straightforward: just introdce a tx skb array and use
> >it instead of sk_receive_queue.

>
>I'm seeing the splat below after this series.  I'm still wrapping my
>head around this code, but it appears to be happening because the
>tun_struct passed into tun_queue_resize is uninitialized.
>Specifically, iteration over the disabled list_head fails because prev
>= next = NULL.  This seems to happen when a startup script on my test
>machine changes the queue length.  I'll try to figure out what's
>happening, but if it's obvious to someone else from the stack, please
>let me know.

Don't see anything obvious. I'm traveling, will look at it when I'm back
unless it's fixed by then. Jason, any idea?



Looks like Craig has posted a fix to this:

http://patchwork.ozlabs.org/patch/645645/

Re: [PATCH v2 4/4] ACPI / button: Add document for ACPI control method lid device restrictions

2016-07-08 Thread Benjamin Tissoires

Hi,

On Thu, Jul 7, 2016 at 9:11 AM, Lv Zheng  wrote:
> There are many AML tables reporting wrong initial lid state, and some of
> them never reports lid state. As a proxy layer acting between, ACPI button
> driver is not able to handle all such cases, but need to re-define the
> usage model of the ACPI lid. That is:
> 1. It's initial state is not reliable;
> 2. There may not be open event;
> 3. Userspace should only take action against the close event which is
>reliable, always sent after a real lid close.
> This patch adds documentation of the usage model.
>
> Link: https://lkml.org/2016/3/7/460
> Link: https://github.com/systemd/systemd/issues/2087
> Signed-off-by: Lv Zheng 
> Cc: Bastien Nocera: 
> Cc: Benjamin Tissoires 
> Cc: linux-in...@vger.kernel.org
> ---
>  Documentation/acpi/acpi-lid.txt |   62 
> +++
>  1 file changed, 62 insertions(+)
>  create mode 100644 Documentation/acpi/acpi-lid.txt
>
> diff --git a/Documentation/acpi/acpi-lid.txt b/Documentation/acpi/acpi-lid.txt
> new file mode 100644
> index 000..7e4f7ed
> --- /dev/null
> +++ b/Documentation/acpi/acpi-lid.txt
> @@ -0,0 +1,62 @@
> +Usage Model of the ACPI Control Method Lid Device
> +
> +Copyright (C) 2016, Intel Corporation
> +Author: Lv Zheng 
> +
> +
> +Abstract:
> +
> +Platforms containing lids convey lid state (open/close) to OSPMs using a
> +control method lid device. To implement this, the AML tables issue
> +Notify(lid_device, 0x80) to notify the OSPMs whenever the lid state has
> +changed. The _LID control method for the lid device must be implemented to
> +report the "current" state of the lid as either "opened" or "closed".
> +
> +This document describes the restrictions and the expections of the Linux
> +ACPI lid device driver.
> +
> +
> +1. Restrictions of the returning value of the _LID control method
> +
> +The _LID control method is described to return the "current" lid state.
> +However the word of "current" has ambiguity, many AML tables return the lid
> +state upon the last lid notification instead of returning the lid state
> +upon the last _LID evaluation. There won't be difference when the _LID
> +control method is evaluated during the runtime, the problem is its initial
> +returning value. When the AML tables implement this control method with
> +cached value, the initial returning value is likely not reliable. There are
> +simply so many examples always retuning "closed" as initial lid state.
> +
> +2. Restrictions of the lid state change notifications
> +
> +There are many AML tables never notifying when the lid device state is
> +changed to "opened". But it is ensured that the AML tables always notify
> +"closed" when the lid state is changed to "closed". This is normally used
> +to trigger some system power saving operations on Windows. Since it is
> +fully tested, this notification is reliable for all AML tables.
> +
> +3. Expections for the userspace users of the ACPI lid device driver
> +
> +The userspace programs should stop relying on
> +/proc/acpi/button/lid/LID0/state to obtain the lid state. This file is only
> +used for the validation purpose.

I'd say: this file actually calls the _LID method described above. And
given the previous explanation, it is not reliable enough on some
platforms. So it is strongly advised for user-space program to not
solely rely on this file to determine the actual lid state.

> +
> +New userspace programs should rely on the lid "closed" notification to
> +trigger some power saving operations and may stop taking actions according
> +to the lid "opened" notification. A new input switch event - SW_ACPI_LID is
> +prepared for the new userspace to implement this ACPI control method lid
> +device specific logics.

That's not entirely what we discussed before (to prevent regressions):
- if the device doesn't have reliable LID switch state, then there
would be the new input event, and so userspace should only rely on
opened notifications.
- if the device has reliable switch information, the new input event
should not be exported and userspace knows that the current input
switch event is reliable.

Also, using a new "switch" event is a terrible idea. Switches have a
state (open/close) and you are using this to forward a single open
event. So using a switch just allows you to say to userspace you are
using the "new" LID meaning, but you'll still have to manually reset
the switch and you will have to document how this event is not a
switch.

Please use a simple KEY_LID_OPEN event you will send through
[input_key_event(KEY_LID_OPEN, 1), input_sync(),
input_key_event(KEY_LID_OPEN, 0), input_sync()], which userspace knows
how to handle.

> +
> +During the period the userspace hasn't been switched to use the new
> +SW_ACPI_LID event, Linux users can use the following boot parameter to
> +handle possible issues:
> +  button.lid_init_state=method:
> +   This is the default behavior of the Linux ACPI lid driver, Linux kernel
> +   reports the ini

[PATCH -v4 0/2] printk.devkmsg: Ratelimit it by default

From: Borislav Petkov 

Hi all,

sorry for spamming so quickly again and not waiting for a week before
resubmitting but I believe the stuff is ready for 4.8.

So here's v4 with all the minor review comments addressed.


Changelog:
--

v3:

here's v3 integrating Ingo's comments. The thing is called
printk.devkmsg= or printk_devkmsg now, depending on cmdline option or
sysctl.


v2:

here's v2 with the requested sysctl option kernel.printk_kmsg and
locking of the setting when printk.kmsg= is supplied on the command
line.

Patch 1 is unchanged.

Patch 2 has grown the sysctl addition.

v1:

Rostedt is busy so I took Linus' old patch and Steven's last v2 and
split and extended them with the comments people had on the last thread:

https://lkml.kernel.org/r/20160425145606.59832...@gandalf.local.home

I hope, at least.

So it is ratelimiting by default, with "on" and "off" cmdline options. I
called the option somewhat a bit shorter too: "printk.kmsg"

The current use cases of this and of which I'm aware are:

* debug the kernel and thus shut up all interfering input from
userspace, i.e. boot with "printk.kmsg=off"

* debug userspace (and by that I mean systemd) by booting with
"printk.kmsg=on" so that the ratelimiting is disabled and the kernel log
gets all the spew.

Thoughts?

Please queue,
thanks.

Borislav Petkov (2):
  ratelimit: Extend to print suppressed messages on release
  printk: Add kernel parameter to control writes to /dev/kmsg

 Documentation/kernel-parameters.txt |  6 +++
 Documentation/sysctl/kernel.txt | 14 ++
 include/linux/printk.h  |  7 +++
 include/linux/ratelimit.h   | 38 +---
 kernel/printk/printk.c  | 86 +
 kernel/sysctl.c |  9 
 lib/ratelimit.c | 10 +++--
 7 files changed, 153 insertions(+), 17 deletions(-)

-- 
2.7.3

[PATCH -v4 1/2] ratelimit: Extend to print suppressed messages on release

From: Borislav Petkov 

Extend the ratelimiting facility to print the amount of suppressed lines
when it is being released.

Separated from a previous patch by Linus.

Also, make the ON_RELEASE image not use "callbacks" as it is misleading.

Signed-off-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Franck Bui 
Cc: Greg Kroah-Hartman 
Cc: Ingo Molnar 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Uwe Kleine-König 
---
 include/linux/ratelimit.h | 38 +-
 lib/ratelimit.c   | 10 ++
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/include/linux/ratelimit.h b/include/linux/ratelimit.h
index 18102529254e..57c9e0622a38 100644
--- a/include/linux/ratelimit.h
+++ b/include/linux/ratelimit.h
@@ -2,11 +2,15 @@
 #define _LINUX_RATELIMIT_H
 
 #include 
+#include 
 #include 
 
 #define DEFAULT_RATELIMIT_INTERVAL (5 * HZ)
 #define DEFAULT_RATELIMIT_BURST10
 
+/* issue num suppressed message on exit */
+#define RATELIMIT_MSG_ON_RELEASE   BIT(0)
+
 struct ratelimit_state {
raw_spinlock_t  lock;   /* protect the state */
 
@@ -15,6 +19,7 @@ struct ratelimit_state {
int printed;
int missed;
unsigned long   begin;
+   unsigned long   flags;
 };
 
 #define RATELIMIT_STATE_INIT(name, interval_init, burst_init) {
\
@@ -34,12 +39,35 @@ struct ratelimit_state {
 static inline void ratelimit_state_init(struct ratelimit_state *rs,
int interval, int burst)
 {
+   memset(rs, 0, sizeof(*rs));
+
raw_spin_lock_init(&rs->lock);
-   rs->interval = interval;
-   rs->burst = burst;
-   rs->printed = 0;
-   rs->missed = 0;
-   rs->begin = 0;
+   rs->interval= interval;
+   rs->burst   = burst;
+}
+
+static inline void ratelimit_default_init(struct ratelimit_state *rs)
+{
+   return ratelimit_state_init(rs, DEFAULT_RATELIMIT_INTERVAL,
+   DEFAULT_RATELIMIT_BURST);
+}
+
+static inline void ratelimit_state_exit(struct ratelimit_state *rs)
+{
+   if (!(rs->flags & RATELIMIT_MSG_ON_RELEASE))
+   return;
+
+   if (rs->missed) {
+   pr_warn("%s: %d output lines suppressed due to ratelimiting\n",
+   current->comm, rs->missed);
+   rs->missed = 0;
+   }
+}
+
+static inline void
+ratelimit_set_flags(struct ratelimit_state *rs, unsigned long flags)
+{
+   rs->flags = flags;
 }
 
 extern struct ratelimit_state printk_ratelimit_state;
diff --git a/lib/ratelimit.c b/lib/ratelimit.c
index 2c5de86460c5..08f8043cac61 100644
--- a/lib/ratelimit.c
+++ b/lib/ratelimit.c
@@ -46,12 +46,14 @@ int ___ratelimit(struct ratelimit_state *rs, const char 
*func)
rs->begin = jiffies;
 
if (time_is_before_jiffies(rs->begin + rs->interval)) {
-   if (rs->missed)
-   printk(KERN_WARNING "%s: %d callbacks suppressed\n",
-   func, rs->missed);
+   if (rs->missed) {
+   if (!(rs->flags & RATELIMIT_MSG_ON_RELEASE)) {
+   pr_warn("%s: %d callbacks suppressed\n", func, 
rs->missed);
+   rs->missed = 0;
+   }
+   }
rs->begin   = jiffies;
rs->printed = 0;
-   rs->missed  = 0;
}
if (rs->burst && rs->burst > rs->printed) {
rs->printed++;
-- 
2.7.3

[PATCH -v4 2/2] printk: Add kernel parameter to control writes to /dev/kmsg

From: Borislav Petkov 

Add a "printk.devkmsg" kernel command line parameter which controls how
userspace writes into /dev/kmsg. It has three options:

* ratelimit - ratelimit logging from userspace.
* on  - unlimited logging from userspace
* off - logging from userspace gets ignored

The default setting is to ratelimit the messages written to it.

It additionally does not limit logging to /dev/kmsg while the system is
booting if we haven't disabled it on the command line.

This patch is based on previous patches from Linus and Steven.

In addition, we can control the logging from a lower priority
sysctl interface - kernel.printk_devkmsg={0,1,2} - with numeric values
corresponding to the options above.

That interface will succeed only if printk.devkmsg *hasn't* been supplied
on the command line. If it has, then printk.devkmsg is a one-time setting
which remains for the duration of the system lifetime.

Signed-off-by: Borislav Petkov 
Cc: Andrew Morton 
Cc: Franck Bui 
Cc: Greg Kroah-Hartman 
Cc: Ingo Molnar 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Uwe Kleine-König 
---
 Documentation/kernel-parameters.txt |  6 +++
 Documentation/sysctl/kernel.txt | 14 ++
 include/linux/printk.h  |  7 +++
 kernel/printk/printk.c  | 86 +
 kernel/sysctl.c |  9 
 5 files changed, 114 insertions(+), 8 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 82b42c958d1c..0b1fea56dd49 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3150,6 +3150,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
Format:   (1/Y/y=enable, 0/N/n=disable)
default: disabled
 
+   printk.devkmsg={on,off}
+   Control writing to /dev/kmsg.
+   on - unlimited logging to /dev/kmsg from userspace
+   off - logging to /dev/kmsg disabled
+   Default: ratelimited logging.
+
printk.time=Show timing data prefixed to each printk message line
Format:   (1/Y/y=enable, 0/N/n=disable)
 
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a3683ce2a2f3..dec84d90061c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -752,6 +752,20 @@ send before ratelimiting kicks in.
 
 ==
 
+printk_devkmsg:
+
+Control the logging to /dev/kmsg from userspace:
+
+0: default, ratelimited
+1: unlimited logging to /dev/kmsg from userspace
+2: logging to /dev/kmsg disabled
+
+The kernel command line parameter printk.devkmsg= overrides this and is
+a one-time setting until next reboot: once set, it cannot be changed by
+this sysctl interface anymore.
+
+==
+
 randomize_va_space:
 
 This option can be used to select the type of process address
diff --git a/include/linux/printk.h b/include/linux/printk.h
index f4da695fd615..e6bb50751504 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -171,6 +171,13 @@ extern bool printk_timed_ratelimit(unsigned long 
*caller_jiffies,
 extern int printk_delay_msec;
 extern int dmesg_restrict;
 extern int kptr_restrict;
+extern unsigned int devkmsg_log;
+
+struct ctl_table;
+
+extern int
+devkmsg_sysctl_set_loglvl(struct ctl_table *table, int write, void __user *buf,
+ size_t *lenp, loff_t *ppos);
 
 extern void wake_up_klogd(void);
 
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 60cdf6386763..1e71369780f0 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -86,6 +86,55 @@ static struct lockdep_map console_lock_dep_map = {
 };
 #endif
 
+enum devkmsg_log_bits {
+   __DEVKMSG_LOG_BIT_ON = 0,
+   __DEVKMSG_LOG_BIT_OFF,
+   __DEVKMSG_LOG_BIT_LOCK,
+};
+
+enum devkmsg_log_masks {
+   DEVKMSG_LOG_MASK_ON = BIT(__DEVKMSG_LOG_BIT_ON),
+   DEVKMSG_LOG_MASK_OFF= BIT(__DEVKMSG_LOG_BIT_OFF),
+   DEVKMSG_LOG_MASK_LOCK   = BIT(__DEVKMSG_LOG_BIT_LOCK),
+};
+
+/* Keep both the 'on' and 'off' bits clear, i.e. ratelimit by default: */
+#define DEVKMSG_LOG_MASK_DEFAULT   0
+
+unsigned int __read_mostly devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
+static int __init control_devkmsg(char *str)
+{
+   if (!str)
+   return -EINVAL;
+
+   if (!strncmp(str, "on", 2))
+   devkmsg_log = DEVKMSG_LOG_MASK_ON;
+   else if (!strncmp(str, "off", 3))
+   devkmsg_log = DEVKMSG_LOG_MASK_OFF;
+   else if (!strncmp(str, "ratelimit", 9))
+   devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
+   else
+   return -EINVAL;
+
+   /* Sysctl cannot change it anymore. */
+   devkmsg_log |= DEVKMSG_LOG_MASK_LO

Re: [PATCH 1/9] mm: Hardened usercopy

2016-07-08 Thread Arnd Bergmann

On Thursday, July 7, 2016 1:37:43 PM CEST Kees Cook wrote:
> >
> >> + /* Allow kernel bss region (if not marked as Reserved). */
> >> + if (ptr >= (const void *)__bss_start &&
> >> + end <= (const void *)__bss_stop)
> >> + return NULL;
> >
> > accesses to .data/.rodata/.bss are probably not performance critical,
> > so we could go further here and check the kallsyms table to ensure
> > that we are not spanning multiple symbols here.
> 
> Oh, interesting! Yeah, would you be willing to put together that patch
> and test it?

Not at the moment, sorry.

I've given it a closer look and unfortunately realized that kallsyms
today only covers .text and .init.text, so it's currently useless because
those sections are already disallowed.

We could extend kallsyms to also cover all other sections, but doing
that right will likely cause a number of problems (most likely
kallsyms size mismatch) that will have to be debugged first.\

I think it's doable but time-consuming. The check function should
actually be trivial:

static bool usercopy_spans_multiple_symbols(void *ptr, size_t len)
{
unsigned long size, offset; 

if (kallsyms_lookup_size_offset((unsigned long)ptr, &size, &offset))
return 0; /* no symbol found or kallsyms disabled */

if (size - offset <= len)
return 0; /* range is within one symbol */

return 1;
}

This part would also be trivial:

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 1f22a186c18c..e0f37212e2a9 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -50,6 +50,11 @@ static struct addr_range text_ranges[] = {
{ "_sinittext", "_einittext" },
{ "_stext_l1",  "_etext_l1"  }, /* Blackfin on-chip L1 inst SRAM */
{ "_stext_l2",  "_etext_l2"  }, /* Blackfin on-chip L2 SRAM */
+#ifdef CONFIG_HARDENED_USERCOPY
+   { "_sdata", "_edata" },
+   { "__bss_start", "__bss_stop" },
+   { "__start_rodata", "__end_rodata" },
+#endif
 };
 #define text_range_text (&text_ranges[0])
 #define text_range_inittext (&text_ranges[1])

but I fear that if you actually try that, things start falling apart
in a big way, so I didn't try ;-)

> I wonder if there are any cases where there are
> legitimate usercopys across multiple symbols.

The only possible use case I can think of is for reading out the entire
kernel memory from /dev/kmem, but your other checks in here already
define that as illegitimate. On that subject, we probably want to
make CONFIG_DEVKMEM mutually exclusive with CONFIG_HARDENED_USERCOPY.

Arnd

Re: [PATCH 1/6] x86/mce/AMD: Increase size of bank_map type


* Borislav Petkov  wrote:

> From: Aravind Gopalakrishnan 
> 
> Change bank_map type from char to int since we now have more than eight
> banks in a system.
> 
> Signed-off-by: Aravind Gopalakrishnan 
> Cc: Aravind Gopalakrishnan 
> Cc: Tony Luck 
> Cc: linux-edac 
> Link: 
> http://lkml.kernel.org/r/1466462163-29008-1-git-send-email-yazen.ghan...@amd.com
> Signed-off-by: Yazen Ghannam 
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/kernel/cpu/mcheck/mce_amd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c 
> b/arch/x86/kernel/cpu/mcheck/mce_amd.c
> index 10b0661651e0..7b7f3be783d4 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
> @@ -93,7 +93,7 @@ const char * const amd_df_mcablock_names[] = {
>  EXPORT_SYMBOL_GPL(amd_df_mcablock_names);
>  
>  static DEFINE_PER_CPU(struct threshold_bank **, threshold_banks);
> -static DEFINE_PER_CPU(unsigned char, bank_map);  /* see which banks are 
> on */
> +static DEFINE_PER_CPU(unsigned int, bank_map);   /* see which banks are 
> on */

Btw., is there any check somewhere which printed a helpful warning when we 
exceeded the 8 banks limit - and which would prints a helpful warning if we 
ever 
exceed the 32 banks limit?

Thanks,

Ingo

Re: [PATCH v2 04/22] usb: chipidea: Only read/write OTGSC from one place

2016-07-08 Thread Peter Chen

On Thu, Jul 07, 2016 at 03:20:55PM -0700, Stephen Boyd wrote:
> With the id and vbus detection done via extcon we need to make
> sure we poll the status of OTGSC properly by considering what the
> extcon is saying, and not just what the register is saying. Let's
> move this hw_wait_reg() function to the only place it's used and
> simplify it for polling the OTGSC register. Then we can make
> certain we only use the hw_read_otgsc() API to read OTGSC, which
> will make sure we properly handle extcon events.
> 
> Cc: Peter Chen 
> Cc: Greg Kroah-Hartman 
> Cc: "Ivan T. Ivanov" 
> Fixes: 3ecb3e09b042 ("usb: chipidea: Use extcon framework for VBUS and ID 
> detect")
> Signed-off-by: Stephen Boyd 
> ---
>  drivers/usb/chipidea/ci.h   |  3 ---
>  drivers/usb/chipidea/core.c | 32 
>  drivers/usb/chipidea/otg.c  | 34 ++
>  3 files changed, 30 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/usb/chipidea/ci.h b/drivers/usb/chipidea/ci.h
> index cd414559040f..05bc4d631cb9 100644
> --- a/drivers/usb/chipidea/ci.h
> +++ b/drivers/usb/chipidea/ci.h
> @@ -428,9 +428,6 @@ int hw_port_test_set(struct ci_hdrc *ci, u8 mode);
>  
>  u8 hw_port_test_get(struct ci_hdrc *ci);
>  
> -int hw_wait_reg(struct ci_hdrc *ci, enum ci_hw_regs reg, u32 mask,
> - u32 value, unsigned int timeout_ms);
> -
>  void ci_platform_configure(struct ci_hdrc *ci);
>  
>  int dbg_create_files(struct ci_hdrc *ci);
> diff --git a/drivers/usb/chipidea/core.c b/drivers/usb/chipidea/core.c
> index 69426e644d17..01390e02ee53 100644
> --- a/drivers/usb/chipidea/core.c
> +++ b/drivers/usb/chipidea/core.c
> @@ -516,38 +516,6 @@ int hw_device_reset(struct ci_hdrc *ci)
>   return 0;
>  }
>  
> -/**
> - * hw_wait_reg: wait the register value
> - *
> - * Sometimes, it needs to wait register value before going on.
> - * Eg, when switch to device mode, the vbus value should be lower
> - * than OTGSC_BSV before connects to host.
> - *
> - * @ci: the controller
> - * @reg: register index
> - * @mask: mast bit
> - * @value: the bit value to wait
> - * @timeout_ms: timeout in millisecond
> - *
> - * This function returns an error code if timeout
> - */
> -int hw_wait_reg(struct ci_hdrc *ci, enum ci_hw_regs reg, u32 mask,
> - u32 value, unsigned int timeout_ms)
> -{
> - unsigned long elapse = jiffies + msecs_to_jiffies(timeout_ms);
> -
> - while (hw_read(ci, reg, mask) != value) {
> - if (time_after(jiffies, elapse)) {
> - dev_err(ci->dev, "timeout waiting for %08x in %d\n",
> - mask, reg);
> - return -ETIMEDOUT;
> - }
> - msleep(20);
> - }
> -
> - return 0;
> -}
> -
>  static irqreturn_t ci_irq(int irq, void *data)
>  {
>   struct ci_hdrc *ci = data;
> diff --git a/drivers/usb/chipidea/otg.c b/drivers/usb/chipidea/otg.c
> index 03b6743461d1..a6fc60934297 100644
> --- a/drivers/usb/chipidea/otg.c
> +++ b/drivers/usb/chipidea/otg.c
> @@ -104,7 +104,31 @@ void ci_handle_vbus_change(struct ci_hdrc *ci)
>   usb_gadget_vbus_disconnect(&ci->gadget);
>  }
>  
> -#define CI_VBUS_STABLE_TIMEOUT_MS 5000
> +/**
> + * When we switch to device mode, the vbus value should be lower
> + * than OTGSC_BSV before connecting to host.
> + *
> + * @ci: the controller
> + *
> + * This function returns an error code if timeout
> + */
> +static int hw_wait_otgsc_bsv(struct ci_hdrc *ci)

I think the function name should reflect "we wait for vbus lower than bsv"
Care to change one?

-- 

Best Regards,
Peter Chen

Re: [PATCH 3/6] x86/mce: Add support for new MCA_SYND register

* Borislav Petkov  wrote:

> From: Yazen Ghannam 
> 
> Syndrome information is no longer contained in MCA_STATUS for SMCA
> systems but in a new register.
> 
> Add a synd field to struct mce to hold MCA_SYND register value. Add it
> to the end of struct mce to maintain compatibility with old versions of
> mcelog. Also, add it to the respective tracepoint.

>  /* AMD-specific bits */
> +#define MCI_STATUS_TCC   (1ULL<<55)  /* Task context corrupt */
> +#define MCI_STATUS_SYNDV (1ULL<<53)  /* synd reg. valid */

> --- a/arch/x86/include/uapi/asm/mce.h
> +++ b/arch/x86/include/uapi/asm/mce.h
> @@ -26,6 +26,7 @@ struct mce {
>   __u32 socketid; /* CPU socket ID */
>   __u32 apicid;   /* CPU initial apic ID */
>   __u64 mcgcap;   /* MCGCAP MSR: machine check capabilities of CPU */
> + __u64 synd; /* MCA_SYND MSR: only valid on SMCA systems */
>  };

So why does neither the changelog nor the code comment actually _explain_ this 
and 
give aa bit of a background about what 'syndrome information' is and why we 
want 
to have kernel support for it?

This is why I hate kernel tooling that is not part of the kernel tree - the 
mcelog 
patch (hopefully ...) would tell us more about all this - but it's separate and 
this patch does not tell us anything ...

Thanks,

Ingo

Re: [PATCH v2 3/4] ACPI / button: Add SW_ACPI_LID for new usage model

2016-07-08 Thread Benjamin Tissoires

On Thu, Jul 7, 2016 at 9:10 AM, Lv Zheng  wrote:
> There are many AML tables reporting wrong initial lid state, and some of
> them never reports lid state. As a proxy layer acting between, ACPI button
> driver is not able to handle all such cases, but need to re-define the
> usage model of the ACPI lid. That is:
> 1. It's initial state is not reliable;
> 2. There may not be open event;
> 3. Userspace should only take action against the close event which is
>reliable, always sent after a real lid close.
> This patch adds a new input key event so that new userspace programs can
> use it to handle this usage model correctly. And in the meanwhile, no old
> programs will be broken by the userspace changes.
>
> Link: https://lkml.org/2016/3/7/460
> Link: https://github.com/systemd/systemd/issues/2087
> Signed-off-by: Lv Zheng 
> Cc: Bastien Nocera: 
> Cc: Benjamin Tissoires 
> Cc: linux-in...@vger.kernel.org
> ---
>  drivers/acpi/button.c  |   20 ++--
>  include/linux/mod_devicetable.h|2 +-
>  include/uapi/linux/input-event-codes.h |3 ++-
>  3 files changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/acpi/button.c b/drivers/acpi/button.c
> index 148f4e5..4ef94d2 100644
> --- a/drivers/acpi/button.c
> +++ b/drivers/acpi/button.c
> @@ -130,7 +130,8 @@ static int acpi_lid_evaluate_state(struct acpi_device 
> *device)
> return lid_state ? 1 : 0;
>  }
>
> -static int acpi_lid_notify_state(struct acpi_device *device, int state)
> +static int acpi_lid_notify_state(struct acpi_device *device,
> +int state, bool notify_acpi)
>  {
> struct acpi_button *button = acpi_driver_data(device);
> int ret;
> @@ -138,6 +139,11 @@ static int acpi_lid_notify_state(struct acpi_device 
> *device, int state)
> /* input layer checks if event is redundant */
> input_report_switch(button->input, SW_LID, !state);
> input_sync(button->input);
> +   if (notify_acpi) {
> +   input_report_switch(button->input,
> +   SW_ACPI_LID, !state);
> +   input_sync(button->input);

If you use a switch, you'll never send subsequent open state if you
doesn't close it yourself.
See my comments in 5/5 and please use a KEY event instead.

> +   }
>
> if (state)
> pm_wakeup_event(&device->dev, 0);
> @@ -279,7 +285,8 @@ int acpi_lid_open(void)
>  }
>  EXPORT_SYMBOL(acpi_lid_open);
>
> -static int acpi_lid_update_state(struct acpi_device *device)
> +static int acpi_lid_update_state(struct acpi_device *device,
> +bool notify_acpi)
>  {
> int state;
>
> @@ -287,17 +294,17 @@ static int acpi_lid_update_state(struct acpi_device 
> *device)
> if (state < 0)
> return state;
>
> -   return acpi_lid_notify_state(device, state);
> +   return acpi_lid_notify_state(device, state, notify_acpi);
>  }
>
>  static void acpi_lid_initialize_state(struct acpi_device *device)
>  {
> switch (lid_init_state) {
> case ACPI_BUTTON_LID_INIT_OPEN:
> -   (void)acpi_lid_notify_state(device, 1);
> +   (void)acpi_lid_notify_state(device, 1, false);
> break;
> case ACPI_BUTTON_LID_INIT_METHOD:
> -   (void)acpi_lid_update_state(device);
> +   (void)acpi_lid_update_state(device, false);
> break;
> case ACPI_BUTTON_LID_INIT_IGNORE:
> default:
> @@ -317,7 +324,7 @@ static void acpi_button_notify(struct acpi_device 
> *device, u32 event)
> case ACPI_BUTTON_NOTIFY_STATUS:
> input = button->input;
> if (button->type == ACPI_BUTTON_TYPE_LID) {
> -   acpi_lid_update_state(device);
> +   acpi_lid_update_state(device, true);
> } else {
> int keycode;
>
> @@ -436,6 +443,7 @@ static int acpi_button_add(struct acpi_device *device)
>
> case ACPI_BUTTON_TYPE_LID:
> input_set_capability(input, EV_SW, SW_LID);
> +   input_set_capability(input, EV_SW, SW_ACPI_LID);

Can't we export this new event only if the _LID function is not
reliable? This could check for the module parameter lid_init_state and
only enable it for ACPI_BUTTON_LID_INIT_OPEN.

I really hope we will be able to find a reliable way to determine
whether or not the platform support reliable LID state. If not, there
might be a need to have a db of reliable switch platforms. This can be
set in the kernel or with a hwdb entry in userspace.

Cheers,
Benjamin

> break;
> }
>
> diff --git a/include/linux/mod_devicetable.h b/include/linux/mod_devicetable.h
> index 6e4c645..1014968 100644
> --- a/include/linux/mod_devicetable.h
> +++ b/include/linux/mod_devicetable.h
> @@ -291,7 +291,7 @@ struct pcmcia_device_id {
>  #define INPUT_DEVICE_ID_LED_MAX

[PATCH] bcma: define ChipCommon B MII registers

2016-07-08 Thread Rafał Miłecki

We don't have access to datasheets to document all the bits but we can
name these registers at least.

Signed-off-by: Rafał Miłecki 
---
 drivers/bcma/driver_chipcommon_b.c  | 10 ++
 include/linux/bcma/bcma_driver_chipcommon.h |  3 +++
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/bcma/driver_chipcommon_b.c 
b/drivers/bcma/driver_chipcommon_b.c
index c20b5f4..52c3d36 100644
--- a/drivers/bcma/driver_chipcommon_b.c
+++ b/drivers/bcma/driver_chipcommon_b.c
@@ -33,11 +33,13 @@ static bool bcma_wait_reg(struct bcma_bus *bus, void 
__iomem *addr, u32 mask,
 void bcma_chipco_b_mii_write(struct bcma_drv_cc_b *ccb, u32 offset, u32 value)
 {
struct bcma_bus *bus = ccb->core->bus;
+   void __iomem *mii = ccb->mii;
 
-   writel(offset, ccb->mii + 0x00);
-   bcma_wait_reg(bus, ccb->mii + 0x00, 0x0100, 0x, 100);
-   writel(value, ccb->mii + 0x04);
-   bcma_wait_reg(bus, ccb->mii + 0x00, 0x0100, 0x, 100);
+   writel(offset, mii + BCMA_CCB_MII_MNG_CTL);
+   bcma_wait_reg(bus, mii + BCMA_CCB_MII_MNG_CTL, 0x0100, 0x, 100);
+   writel(value, mii + BCMA_CCB_MII_MNG_CMD_DATA);
+   bcma_wait_reg(bus, mii + BCMA_CCB_MII_MNG_CMD_DATA, 0x0100, 0x,
+ 100);
 }
 EXPORT_SYMBOL_GPL(bcma_chipco_b_mii_write);
 
diff --git a/include/linux/bcma/bcma_driver_chipcommon.h 
b/include/linux/bcma/bcma_driver_chipcommon.h
index a5ac2ca..b20e3d5 100644
--- a/include/linux/bcma/bcma_driver_chipcommon.h
+++ b/include/linux/bcma/bcma_driver_chipcommon.h
@@ -504,6 +504,9 @@
 #define BCMA_CC_PMU1_PLL0_PC2_NDIV_INT_MASK0x1ff0
 #define BCMA_CC_PMU1_PLL0_PC2_NDIV_INT_SHIFT   20
 
+#define BCMA_CCB_MII_MNG_CTL   0x
+#define BCMA_CCB_MII_MNG_CMD_DATA  0x0004
+
 /* BCM4331 ChipControl numbers. */
 #define BCMA_CHIPCTL_4331_BT_COEXIST   BIT(0)  /* 0 disable */
 #define BCMA_CHIPCTL_4331_SECI BIT(1)  /* 0 SECI is disabled 
(JATG functional) */
-- 
1.8.4.5

Re: [PATCH 0/5] hwmon: New hwmon registration API

2016-07-08 Thread Punit Agrawal

Hi Guenter,

Guenter Roeck  writes:

> Up to now, each hwmon driver has to implement its own sysfs attributes.
> This requires a lot of template code, and distracts from the driver's
> core function to read and write chip registers.
>
> To be able to reduce driver complexity, move sensor attribute handling
> and thermal zone registration into the hwmon core. By using the new API,
> driver size is typically reduced by 20-50% depending on driver complexity
> and the number of sysfs attributes supported.
>
> The first patch of the series introduces the API as well as support
> for temperature sensors. Subsequent patches introduce support for
> voltage, current, power, energy, humidity, and fan speed sensors.
>
> The series was tested by converting several drivers (lm75, lm90, tmp102,
> tmp421, ltc4245) to the new API. Testing was done with with real chips
> as well as with the hwmon driver module test code available at
> https://github.com/groeck/module-tests.

I like this series - it takes all of the attributes' handling out of the
individual driver code and moving it to hwmon core.

Having attempted a port of scpi-hwmon.c, I think that driver will not
gain a big savings in line count. Though it'll help separate access to
sensors from sysfs related code - which I think is worth the change.

FWIW,

Acked-by: Punit Agrawal 

Thanks,
Punit

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] x86/mce/AMD: Increase size of bank_map type

On Fri, Jul 08, 2016 at 11:21:35AM +0200, Ingo Molnar wrote:
> Btw., is there any check somewhere which printed a helpful warning when we 
> exceeded the 8 banks limit - and which would prints a helpful warning if we 
> ever 
> exceed the 32 banks limit?

__mcheck_cpu_cap_init().

And it'll be hard to exceed this limit as there are hw limitations in play.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Re: [PATCH v2 06/22] usb: chipidea: Add platform flag for wrapper phy management

2016-07-08 Thread Peter Chen

On Thu, Jul 07, 2016 at 03:20:57PM -0700, Stephen Boyd wrote:
> The ULPI phy on qcom platforms needs to be initialized and
> powered on after a USB reset and before we toggle the run/stop
> bit. Otherwise, the phy locks up and doesn't work properly.
> Therefore, add a flag to skip any phy power management in the
> core layer, leaving it up to the glue driver to manage.
> 
> Cc: Peter Chen 
> Cc: Greg Kroah-Hartman 
> Signed-off-by: Stephen Boyd 
> ---
>  drivers/usb/chipidea/core.c  | 6 ++
>  include/linux/usb/chipidea.h | 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/drivers/usb/chipidea/core.c b/drivers/usb/chipidea/core.c
> index 01390e02ee53..532085a096d9 100644
> --- a/drivers/usb/chipidea/core.c
> +++ b/drivers/usb/chipidea/core.c
> @@ -361,6 +361,9 @@ static int _ci_usb_phy_init(struct ci_hdrc *ci)
>   */
>  static void ci_usb_phy_exit(struct ci_hdrc *ci)
>  {
> + if (ci->platdata->flags & CI_HDRC_OVERRIDE_PHY_CONTROL)
> + return;
> +
>   if (ci->phy) {
>   phy_power_off(ci->phy);
>   phy_exit(ci->phy);
> @@ -379,6 +382,9 @@ static int ci_usb_phy_init(struct ci_hdrc *ci)
>  {
>   int ret;
>  
> + if (ci->platdata->flags & CI_HDRC_OVERRIDE_PHY_CONTROL)
> + return 0;
> +

How you handle the code for PHY getting at probe?

-- 

Best Regards,
Peter Chen

RE: [PATCH v14 net-next 1/1] hv_sock: introduce Hyper-V Sockets

> From: Olaf Hering [mailto:o...@aepfle.de]
> Sent: Friday, July 8, 2016 0:02
> On Thu, Jun 30, Dexuan Cui wrote:
> 
> > +/* The MTU is 16KB per the host side's design. */
> > +struct hvsock_recv_buf {
> > +   unsigned int data_len;
> > +   unsigned int data_offset;
> > +
> > +   struct vmpipe_proto_header hdr;
> > +   u8 buf[PAGE_SIZE * 4];
> 
> Please use some macro related to the protocol rather than a Linux
> compiletime macro.
OK. I'll fix this.
 
> > +/* We send at most 4KB payload per VMBus packet. */
> > +struct hvsock_send_buf {
> > +   struct vmpipe_proto_header hdr;
> > +   u8 buf[PAGE_SIZE];
> 
> Same here.
OK. I'll fix this.

> > + * Copyright(c) 2016, Microsoft Corporation. All rights reserved.
> 
> Here the BSD license follows. I think its required/desired to also
> include a GPL blurb like it is done in many other files:
> ...
>  * Alternatively, this software may be distributed under the terms of
>  * the GNU General Public License ("GPL") version 2 as published by the
>  * Free Software Foundation.
> 
> 
> Otherwise the MODULE_LICENSE string might be incorrect.
I'll add the GPL blurb.
 
> > +   /* Hyper-V Sockets requires at least VMBus 4.0 */
> > +   if ((vmbus_proto_version >> 16) < 4) {
> > +   pr_err("failed to load: VMBus 4 or later is required\n");
> 
> I guess this mens WS 2016+, and loading in earlier host versions will
> trigger this path? I think a silent ENODEV is enough.
Yes. 
OK, I'll remove the pr_err().

> 
> > +   return -ENODEV;
> 
> Olaf

I'll post v15 shortly, which will address all the comments from Joe and Olaf.

Thanks,
-- Dexuan

[PATCH 04/34] mm, mmzone: clarify the usage of zone padding

Zone padding separates write-intensive fields used by page allocation,
compaction and vmstats but the comments are a little misleading and
need clarification.

Signed-off-by: Mel Gorman 
---
 include/linux/mmzone.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d4f5cac0a8c3..edafdaf62e90 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -477,20 +477,21 @@ struct zone {
unsigned long   wait_table_hash_nr_entries;
unsigned long   wait_table_bits;
 
+   /* Write-intensive fields used from the page allocator */
ZONE_PADDING(_pad1_)
+
/* free areas of different sizes */
struct free_areafree_area[MAX_ORDER];
 
/* zone flags, see below */
unsigned long   flags;
 
-   /* Write-intensive fields used from the page allocator */
+   /* Primarily protects free_area */
spinlock_t  lock;
 
+   /* Write-intensive fields used by compaction and vmstats. */
ZONE_PADDING(_pad2_)
 
-   /* Write-intensive fields used by page reclaim */
-
/*
 * When free pages are below this point, additional steps are taken
 * when reading the number of free pages to avoid per-cpu counter
-- 
2.6.4

[PATCH 01/34] mm, vmstat: add infrastructure for per-node vmstats

VM statistic counters for reclaim decisions are zone-based.  If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that.  The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet.  There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
patch with no functional change.  There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
---
 drivers/base/node.c|  76 +++--
 include/linux/mm.h |   5 +
 include/linux/mmzone.h |  13 +++
 include/linux/vmstat.h |  92 +--
 mm/page_alloc.c|  10 +-
 mm/vmstat.c| 295 +
 mm/workingset.c|   9 +-
 7 files changed, 424 insertions(+), 76 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ed0ef0f69489..92d8e090c5b3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -74,16 +74,16 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(i.totalram),
   nid, K(i.freeram),
   nid, K(i.totalram - i.freeram),
-  nid, K(node_page_state(nid, NR_ACTIVE_ANON) +
-   node_page_state(nid, NR_ACTIVE_FILE)),
-  nid, K(node_page_state(nid, NR_INACTIVE_ANON) +
-   node_page_state(nid, NR_INACTIVE_FILE)),
-  nid, K(node_page_state(nid, NR_ACTIVE_ANON)),
-  nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
-  nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
-  nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
-  nid, K(node_page_state(nid, NR_UNEVICTABLE)),
-  nid, K(node_page_state(nid, NR_MLOCK)));
+  nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON) +
+   sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+  nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON) +
+   sum_zone_node_page_state(nid, 
NR_INACTIVE_FILE)),
+  nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_ANON)),
+  nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_ANON)),
+  nid, K(sum_zone_node_page_state(nid, NR_ACTIVE_FILE)),
+  nid, K(sum_zone_node_page_state(nid, NR_INACTIVE_FILE)),
+  nid, K(sum_zone_node_page_state(nid, NR_UNEVICTABLE)),
+  nid, K(sum_zone_node_page_state(nid, NR_MLOCK)));
 
 #ifdef CONFIG_HIGHMEM
n += sprintf(buf + n,
@@ -117,31 +117,31 @@ static ssize_t node_read_meminfo(struct device *dev,
   "Node %d ShmemPmdMapped: %8lu kB\n"
 #endif
,
-  nid, K(node_page_state(nid, NR_FILE_DIRTY)),
-  nid, K(node_page_state(nid, NR_WRITEBACK)),
-  nid, K(node_page_state(nid, NR_FILE_PAGES)),
-  nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-  nid, K(node_page_state(nid, NR_ANON_PAGES)),
+  nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
+  nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
+  nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
+  nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
+  nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
   nid, K(i.sharedram),
-  nid, node_page_state(nid, NR_KERNEL_STACK) *
+  nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
-  nid, K(node_page_state(nid, NR_PAGETABLE)),
-  nid, K(node_page_state(nid, NR_UNSTABLE_NFS)),
-  nid, K(node_page_state(nid, NR_BOUNCE)),
-  nid, K(node_page_state(nid, NR_WRITEBACK_TEMP)),
-  nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
-   node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
-  nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
+  nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
+  nid, K(sum_zone_node_page_state(nid, NR_UNSTABLE_NFS)),
+  nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
+  nid, K(sum_zone_node_

[PATCH 05/34] mm, vmscan: begin reclaiming pages on a per-node basis

This patch makes reclaim decisions on a per-node basis.  A reclaimer knows
what zone is required by the allocation request and skips pages from
higher zones.  In many cases this will be ok because it's a GFP_HIGHMEM
request of some description.  On 64-bit, ZONE_DMA32 requests will cause
some problems but 32-bit devices on 64-bit platforms are increasingly
rare.  Historically it would have been a major problem on 32-bit with big
Highmem:Lowmem ratios but such configurations are also now rare and even
where they exist, they are not encouraged.  If it really becomes a
problem, it'll manifest as very low reclaim efficiencies.

Signed-off-by: Mel Gorman 
Acked-by: Hillf Danton 
---
 mm/vmscan.c | 79 ++---
 1 file changed, 55 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 86a523a761c9..766b36bec829 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -84,6 +84,9 @@ struct scan_control {
/* Scan (total_size >> priority) pages at once */
int priority;
 
+   /* The highest zone to isolate pages for reclaim from */
+   enum zone_type reclaim_idx;
+
unsigned int may_writepage:1;
 
/* Can mapped pages be reclaimed? */
@@ -1392,6 +1395,7 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
unsigned long nr_taken = 0;
unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
unsigned long scan, nr_pages;
+   LIST_HEAD(pages_skipped);
 
for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
!list_empty(src); scan++) {
@@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
 
VM_BUG_ON_PAGE(!PageLRU(page), page);
 
+   if (page_zonenum(page) > sc->reclaim_idx) {
+   list_move(&page->lru, &pages_skipped);
+   continue;
+   }
+
switch (__isolate_lru_page(page, mode)) {
case 0:
nr_pages = hpage_nr_pages(page);
@@ -1420,6 +1429,15 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
}
}
 
+   /*
+* Splice any skipped pages to the start of the LRU list. Note that
+* this disrupts the LRU order when reclaiming for lower zones but
+* we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
+* scanning would soon rescan the same pages to skip and put the
+* system at risk of premature OOM.
+*/
+   if (!list_empty(&pages_skipped))
+   list_splice(&pages_skipped, src);
*nr_scanned = scan;
trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
nr_taken, mode, is_file_lru(lru));
@@ -1589,7 +1607,7 @@ static int current_may_throttle(void)
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
@@ -2401,12 +2419,13 @@ static inline bool should_continue_reclaim(struct zone 
*zone,
}
 }
 
-static bool shrink_zone(struct zone *zone, struct scan_control *sc,
-   bool is_classzone)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
+   enum zone_type classzone_idx)
 {
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
+   struct zone *zone = &pgdat->node_zones[classzone_idx];
 
do {
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2438,7 +2457,7 @@ static bool shrink_zone(struct zone *zone, struct 
scan_control *sc,
shrink_zone_memcg(zone, memcg, sc, &lru_pages);
zone_lru_pages += lru_pages;
 
-   if (memcg && is_classzone)
+   if (!global_reclaim(sc))
shrink_slab(sc->gfp_mask, zone_to_nid(zone),
memcg, sc->nr_scanned - scanned,
lru_pages);
@@ -2469,7 +2488,7 @@ static bool shrink_zone(struct zone *zone, struct 
scan_control *sc,
 * Shrink the slab caches in the same proportion that
 * the eligible LRU pages were scanned.
 */
-   if (global_reclaim(sc) && is_classzone)
+   if (global_reclaim(sc))
shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
sc->nr_scanned - nr_scanned,
zone_lru_pages);
@@ -2553,7 +2572,7 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
unsigned long nr_soft_reclaimed;
unsigned long nr_sof

[PATCH 00/34] Move LRU page reclaim from zones to nodes v9

Minor changes this time

Changelog since v8
o Cosmetic cleanups to comments
o Calculate node vmstat threshold based on the largest zone in the node
o Align retry checks with decisions made by the OOM killer
o Avoid tricks with -1 and kswapd_classzone_idx
o More consistent handling of buffer_heads_over_limit

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
-

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

   4.7.0-rc4  4.7.0-rc4
  mmotm-20160623 nodelru-v9
Min  total-odr0-1   490.00 (  0.00%)   457.00 (  6.73%)
Min  total-odr0-2   347.00 (  0.00%)   329.00 (  5.19%)
Min  total-odr0-4   288.00 (  0.00%)   273.00 (  5.21%)
Min  total-odr0-8   251.00 (  0.00%)   239.00 (  4.78%)
Min  total-odr0-16  234.00 (  0.00%)   222.00 (  5.13%)
Min  total-odr0-32  223.00 (  0.00%)   211.00 (  5.38%)
Min  total-odr0-64  217.00 (  0.00%)   208.00 (  4.15%)
Min  total-odr0-128 214.00 (  0.00%)   204.00 (  4.67%)
Min  total-odr0-256 250.00 (  0.00%)   230.00 (  8.00%)
Min  total-odr0-512 271.00 (  0.00%)   269.00 (  0.74%)

[PATCH 13/34] mm, vmscan: make shrink_node decisions more node-centric

Earlier patches focused on having direct reclaim and kswapd use data that
is node-centric for reclaiming but shrink_node() itself still uses too
much zone information.  This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not.  Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

Signed-off-by: Mel Gorman 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
---
 include/linux/memcontrol.h | 19 ---
 include/linux/mmzone.h |  4 ++--
 include/linux/swap.h   |  2 +-
 mm/memcontrol.c|  4 ++--
 mm/page_alloc.c|  2 +-
 mm/vmscan.c| 59 ++
 mm/workingset.c|  6 ++---
 7 files changed, 52 insertions(+), 44 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 68f1121c8fe7..c13227d018f2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -325,22 +325,23 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct 
zone *zone)
 }
 
 /**
- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
+ * @node: node of the wanted lruvec
  * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
- * Returns the lru list vector holding pages for the given @zone and
- * @mem.  This can be the global zone lruvec, if the memory controller
+ * Returns the lru list vector holding pages for a given @node or a given
+ * @memcg and @zone. This can be the node lruvec, if the memory controller
  * is disabled.
  */
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-   struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+   struct zone *zone, struct mem_cgroup *memcg)
 {
struct mem_cgroup_per_zone *mz;
struct lruvec *lruvec;
 
if (mem_cgroup_disabled()) {
-   lruvec = zone_lruvec(zone);
+   lruvec = node_lruvec(pgdat);
goto out;
}
 
@@ -610,10 +611,10 @@ static inline void mem_cgroup_migrate(struct page *old, 
struct page *new)
 {
 }
 
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-   struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+   struct zone *zone, struct mem_cgroup *memcg)
 {
-   return zone_lruvec(zone);
+   return node_lruvec(pgdat);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4062fa74526f..895c365e3259 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
return &zone->zone_pgdat->lru_lock;
 }
 
-static inline struct lruvec *zone_lruvec(struct zone *zone)
+static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
-   return &zone->zone_pgdat->lruvec;
+   return &pgdat->lruvec;
 }
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 916e2eddecd6..0ad616d7c381 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,7 +316,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
  unsigned long nr_pages,
  gfp_t gfp_mask,
  bool may_swap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50c86ad121bc..c9ebec98e92a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1432,8 +1432,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup 
*root_memcg,
}
continue;
}
-   total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-zone, &nr_scanned);
+   total += mem_cgroup_shrink_node(victim, gfp_mask, false,
+   zone, &nr_scanned);
*total_scanned += nr_scanned;
if (!soft_limit_excess(root_memcg))
break;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d25dc24f65f2..8215c51d5b23 100644
--- a/mm/page_all

[PATCH 09/34] mm, vmscan: simplify the logic deciding whether kswapd sleeps

kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c|   5 ++-
 mm/page_alloc.c|   2 +-
 mm/vmscan.c| 101 -
 4 files changed, 57 insertions(+), 56 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index edafdaf62e90..4062fa74526f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -668,8 +668,9 @@ typedef struct pglist_data {
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
   mem_hotplug_begin/end() */
-   int kswapd_max_order;
-   enum zone_type classzone_idx;
+   int kswapd_order;
+   enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 
arch_refresh_nodedata(nid, pgdat);
} else {
-   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   /* Reset the nr_zones, order and classzone_idx before reuse */
pgdat->nr_zones = 0;
-   pgdat->classzone_idx = 0;
+   pgdat->kswapd_order = 0;
+   pgdat->kswapd_classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b84b85ae54ff..d25dc24f65f2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
unsigned long end_pfn = 0;
 
/* pg_data_t should be reset to zero when it's allocated */
-   WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+   WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..905c60473126 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
/* kswapd must be awake if processes are being throttled */
if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
-   pgdat->classzone_idx = min(pgdat->classzone_idx,
+   pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
(enum zone_type)ZONE_NORMAL);
wake_up_interruptible(&pgdat->kswapd_wait);
}
@@ -3042,11 +3042,11 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, order, classzone_idx))
-   return true;
+   if (!zone_balanced(zone, order, classzone_idx))
+   return false;
}
 
-   return false;
+   return true;
 }
 
 /*
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-   int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
reclaim_order,
+   unsigned int classzone_idx)
 {
long remaining = 0;
DEFINE_WAIT(wait);
@@ -3250,8 +3250,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
/* Try to sleep for a short interval */
-   if (prepare_kswapd_sleep(pgdat, order, remaining,
-   balanced_classzone_idx)) {
+   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
/*
 * Compaction records what page blocks it recently failed to
 * isolate pages from and skips them in the future scanning.
@@ -3264,9 +3263,20 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 * We have freed the memory, now we should compact it to make
 * allocation of the requested order possible.
 *

[PATCH 06/34] mm, vmscan: have kswapd only scan based on the highest requested zone

kswapd checks all eligible zones to see if they need balancing even if it
was woken for a lower zone.  This made sense when we reclaimed on a
per-zone basis because we wanted to shrink zones fairly so avoid
age-inversion problems.  Ideally this is completely unnecessary when
reclaiming on a per-node basis.  In theory, there may still be anomalies
when all requests are for lower zones and very old pages are preserved in
higher zones but this should be the exceptional case.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
---
 mm/vmscan.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 766b36bec829..c6e61dae382b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3209,11 +3209,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
 
sc.nr_reclaimed = 0;
 
-   /*
-* Scan in the highmem->dma direction for the highest
-* zone which needs scanning
-*/
-   for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+   /* Scan from the highest requested zone to dma */
+   for (i = classzone_idx; i >= 0; i--) {
struct zone *zone = pgdat->node_zones + i;
 
if (!populated_zone(zone))
-- 
2.6.4

Re: [PATCH 3/6] x86/mce: Add support for new MCA_SYND register

On Fri, Jul 08, 2016 at 11:26:59AM +0200, Ingo Molnar wrote:
> So why does neither the changelog nor the code comment actually _explain_ 
> this and 
> give aa bit of a background about what 'syndrome information' is and why we 
> want 
> to have kernel support for it?
> 
> This is why I hate kernel tooling that is not part of the kernel tree - the 
> mcelog 
> patch (hopefully ...) would tell us more about all this - but it's separate 
> and 
> this patch does not tell us anything ...

Ah, this is one of those omissions where we forgot to explain, sorry.
How about this:

"The syndrome value is used to uniquely identify which bits of a
reported ECC error are corrupted."

Do you want it as a comment in the code or in the commit message or both?

Thanks.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

[PATCH 02/34] mm, vmscan: move lru_lock to the node

Node-based reclaim requires node-based LRUs and locking.  This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review.  It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming from
different zones.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Reviewed-by: Minchan Kim 
---
 Documentation/cgroup-v1/memcg_test.txt |  4 +--
 Documentation/cgroup-v1/memory.txt |  4 +--
 include/linux/mm_types.h   |  2 +-
 include/linux/mmzone.h | 10 +--
 mm/compaction.c| 10 +++
 mm/filemap.c   |  4 +--
 mm/huge_memory.c   |  6 ++---
 mm/memcontrol.c|  6 ++---
 mm/mlock.c | 10 +++
 mm/page_alloc.c|  4 +--
 mm/page_idle.c |  4 +--
 mm/rmap.c  |  2 +-
 mm/swap.c  | 30 ++---
 mm/vmscan.c| 48 +-
 14 files changed, 75 insertions(+), 69 deletions(-)

diff --git a/Documentation/cgroup-v1/memcg_test.txt 
b/Documentation/cgroup-v1/memcg_test.txt
index 8870b0212150..78a8c2963b38 100644
--- a/Documentation/cgroup-v1/memcg_test.txt
+++ b/Documentation/cgroup-v1/memcg_test.txt
@@ -107,9 +107,9 @@ Under below explanation, we assume 
CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 Each memcg has its own private LRU. Now, its handling is under global
-   VM's control (means that it's handled under global zone->lru_lock).
+   VM's control (means that it's handled under global zone_lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
-   list management functions under zone->lru_lock().
+   list management functions under zone_lru_lock().
 
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
diff --git a/Documentation/cgroup-v1/memory.txt 
b/Documentation/cgroup-v1/memory.txt
index b14abf217239..946e69103cdd 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -267,11 +267,11 @@ When oom event notifier is registered, event will be 
delivered.
Other lock order is following:
PG_locked.
mm->page_table_lock
-   zone->lru_lock
+   zone_lru_lock
  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  zone->lru_lock, it has no lock of its own.
+  zone_lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d3285b..ca2ed9a6c8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,7 +118,7 @@ struct page {
 */
union {
struct list_head lru;   /* Pageout list, eg. active_list
-* protected by zone->lru_lock !
+* protected by zone_lru_lock !
 * Can be used as a generic list
 * by the page owner.
 */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 078ecb81e209..cfa870107abe 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -93,7 +93,7 @@ struct free_area {
 struct pglist_data;
 
 /*
- * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
@@ -496,7 +496,6 @@ struct zone {
/* Write-intensive fields used by page reclaim */
 
/* Fields commonly accessed by the page reclaim scanner */
-   spinlock_t  lru_lock;
struct lruvec   lruvec;
 
/*
@@ -690,6 +689,9 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
 #endif
+   /* Write-intensive fields used by page reclaim */
+   ZONE_PADDING(_pad1_)
+   spinlock_t  lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
@@ -721,6 +723,10 @@ typedef struct pglist_data {
 
 #define node_start_pfn(nid)(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
+static inline spinlock_t *zone_lru_lock(struct zone *zone)
+{
+   return &zone->zone_pgdat->lru_lock;
+}
 
 stati

[PATCH 03/34] mm, vmscan: move LRU lists to node

This moves the LRU lists from the zone to the node and related data
such as counters, tracing, congestion tracking and writeback tracking.
Unfortunately, due to reclaim and compaction retry logic, it is necessary
to account for the number of LRU pages on both zone and node logic.
Most reclaim logic is based on the node counters but the retry logic uses
the zone counters which do not distinguish inactive and active sizes.
It would be possible to leave the LRU counters on a per-zone basis but
it's a heavier calculation across multiple cache lines that is much more
frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed later
but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
---
 arch/tile/mm/pgtable.c|   8 +-
 drivers/base/node.c   |  19 +--
 drivers/staging/android/lowmemorykiller.c |   8 +-
 include/linux/backing-dev.h   |   2 +-
 include/linux/memcontrol.h|  18 +--
 include/linux/mm_inline.h |  21 ++-
 include/linux/mmzone.h|  68 +
 include/linux/swap.h  |   1 +
 include/linux/vm_event_item.h |  10 +-
 include/linux/vmstat.h|  17 +++
 include/trace/events/vmscan.h |  12 +-
 kernel/power/snapshot.c   |  10 +-
 mm/backing-dev.c  |  15 +-
 mm/compaction.c   |  18 +--
 mm/huge_memory.c  |   2 +-
 mm/internal.h |   2 +-
 mm/khugepaged.c   |   4 +-
 mm/memcontrol.c   |  17 +--
 mm/memory-failure.c   |   4 +-
 mm/memory_hotplug.c   |   2 +-
 mm/mempolicy.c|   2 +-
 mm/migrate.c  |  21 +--
 mm/mlock.c|   2 +-
 mm/page-writeback.c   |   8 +-
 mm/page_alloc.c   |  68 +
 mm/swap.c |  50 +++
 mm/vmscan.c   | 226 +-
 mm/vmstat.c   |  47 ---
 mm/workingset.c   |   4 +-
 29 files changed, 386 insertions(+), 300 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index c4d5bf841a7f..9e389213580d 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -45,10 +45,10 @@ void show_mem(unsigned int filter)
struct zone *zone;
 
pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu 
free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu 
swap:%lu\n",
-  (global_page_state(NR_ACTIVE_ANON) +
-   global_page_state(NR_ACTIVE_FILE)),
-  (global_page_state(NR_INACTIVE_ANON) +
-   global_page_state(NR_INACTIVE_FILE)),
+  (global_node_page_state(NR_ACTIVE_ANON) +
+   global_node_page_state(NR_ACTIVE_FILE)),
+  (global_node_page_state(NR_INACTIVE_ANON) +
+   global_node_page_state(NR_INACTIVE_FILE)),
   global_page_state(NR_FILE_DIRTY),
   global_page_state(NR_WRITEBACK),
   global_page_state(NR_UNSTABLE_NFS),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 92d8e090c5b3..b7f01a4a642d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -56,6 +56,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 {
int n;
int nid = dev->id;
+   struct pglist_data *pgdat = NODE_DATA(nid);
struct sysinfo i;
 
si_meminfo_node(&i, nid);
@@ -74,15 +75,15 @@ static ssize_t node_read_mem

[PATCH 12/34] mm: vmscan: do not reclaim from kswapd if there is any eligible zone

kswapd scans from highest to lowest for a zone that requires balancing.
This was necessary when reclaim was per-zone to fairly age pages on lower
zones.  Now that we are reclaiming on a per-node basis, any eligible zone
can be used and pages will still be aged fairly.  This patch avoids
reclaiming excessively unless buffer_heads are over the limit and it's
necessary to reclaim from a higher zone than requested by the waker of
kswapd to relieve low memory pressure.

[hillf...@alibaba-inc.com: Force kswapd reclaim no more than needed]
Link: 
http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Signed-off-by: Hillf Danton 
Acked-by: Vlastimil Babka 
---
 mm/vmscan.c | 59 +++
 1 file changed, 27 insertions(+), 32 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b39b903bd14..b7a276f4b1b0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3144,31 +3144,39 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
 
sc.nr_reclaimed = 0;
 
-   /* Scan from the highest requested zone to dma */
-   for (i = classzone_idx; i >= 0; i--) {
-   zone = pgdat->node_zones + i;
-   if (!populated_zone(zone))
-   continue;
-
-   /*
-* If the number of buffer_heads in the machine
-* exceeds the maximum allowed level and this node
-* has a highmem zone, force kswapd to reclaim from
-* it to relieve lowmem pressure.
-*/
-   if (buffer_heads_over_limit && is_highmem_idx(i)) {
-   classzone_idx = i;
-   break;
-   }
+   /*
+* If the number of buffer_heads in the machine exceeds the
+* maximum allowed level then reclaim from all zones. This is
+* not specific to highmem as highmem may not exist but it is
+* it is expected that buffer_heads are stripped in writeback.
+*/
+   if (buffer_heads_over_limit) {
+   for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
+   zone = pgdat->node_zones + i;
+   if (!populated_zone(zone))
+   continue;
 
-   if (!zone_balanced(zone, order, 0)) {
classzone_idx = i;
break;
}
}
 
-   if (i < 0)
-   goto out;
+   /*
+* Only reclaim if there are no eligible zones. Check from
+* high to low zone as allocations prefer higher zones.
+* Scanning from low to high zone would allow congestion to be
+* cleared during a very small window when a small low
+* zone was balanced even under extreme pressure when the
+* overall node may be congested.
+*/
+   for (i = classzone_idx; i >= 0; i--) {
+   zone = pgdat->node_zones + i;
+   if (!populated_zone(zone))
+   continue;
+
+   if (zone_balanced(zone, sc.order, classzone_idx))
+   goto out;
+   }
 
/*
 * Do some background aging of the anon list, to give
@@ -3214,19 +3222,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
break;
 
/*
-* Stop reclaiming if any eligible zone is balanced and clear
-* node writeback or congested.
-*/
-   for (i = 0; i <= classzone_idx; i++) {
-   zone = pgdat->node_zones + i;
-   if (!populated_zone(zone))
-   continue;
-
-   if (zone_balanced(zone, sc.order, classzone_idx))
-   goto out;
-   }
-
-   /*
 * Raise priority if scanning rate is too low or there was no
 * progress in reclaiming pages
 */
-- 
2.6.4

[PATCH 11/34] mm, vmscan: remove duplicate logic clearing node congestion and dirty state

Reclaim may stall if there is too much dirty or congested data on a node.
This was previously based on zone flags and the logic for clearing the
flags is in two places.  As congestion/dirty tracking is now tracked on a
per-node basis, we can remove some duplicate logic.

Signed-off-by: Mel Gorman 
Acked-by: Hillf Danton 
---
 mm/vmscan.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 01fe4708e404..8b39b903bd14 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3008,7 +3008,17 @@ static bool zone_balanced(struct zone *zone, int order, 
int classzone_idx)
 {
unsigned long mark = high_wmark_pages(zone);
 
-   return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
+   if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+   return false;
+
+   /*
+* If any eligible zone is balanced then the node is not considered
+* to be congested or dirty
+*/
+   clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
+   clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
+
+   return true;
 }
 
 /*
@@ -3154,13 +3164,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
if (!zone_balanced(zone, order, 0)) {
classzone_idx = i;
break;
-   } else {
-   /*
-* If any eligible zone is balanced then the
-* node is not considered congested or dirty.
-*/
-   clear_bit(PGDAT_CONGESTED, 
&zone->zone_pgdat->flags);
-   clear_bit(PGDAT_DIRTY, 
&zone->zone_pgdat->flags);
}
}
 
@@ -3219,11 +3222,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, sc.order, classzone_idx)) {
-   clear_bit(PGDAT_CONGESTED, &pgdat->flags);
-   clear_bit(PGDAT_DIRTY, &pgdat->flags);
+   if (zone_balanced(zone, sc.order, classzone_idx))
goto out;
-   }
}
 
/*
-- 
2.6.4

[PATCH 23/34] mm: convert zone_reclaim to node_reclaim

As reclaim is now per-node based, convert zone_reclaim to be node_reclaim.
It is possible that a node will be reclaimed multiple times if it has
multiple zones but this is unavoidable without caching all nodes traversed
so far.  The documentation and interface to userspace is the same from a
configuration perspective and will will be similar in behaviour unless the
node-local allocation requests were also limited to lower zones.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
---
 include/linux/mmzone.h   | 18 +--
 include/linux/swap.h |  9 +++---
 include/linux/topology.h |  2 +-
 kernel/sysctl.c  |  4 +--
 mm/internal.h|  8 ++---
 mm/khugepaged.c  |  4 +--
 mm/page_alloc.c  | 24 ++-
 mm/vmscan.c  | 77 
 8 files changed, 77 insertions(+), 69 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e3d6d42722a0..e19c081c794e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -372,14 +372,6 @@ struct zone {
unsigned long   *pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
-#ifdef CONFIG_NUMA
-   /*
-* zone reclaim becomes active if more unmapped pages exist.
-*/
-   unsigned long   min_unmapped_pages;
-   unsigned long   min_slab_pages;
-#endif /* CONFIG_NUMA */
-
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long   zone_start_pfn;
 
@@ -525,7 +517,6 @@ struct zone {
 } cacheline_internodealigned_in_smp;
 
 enum zone_flags {
-   ZONE_RECLAIM_LOCKED,/* prevents concurrent reclaim */
ZONE_FAIR_DEPLETED, /* fair zone policy batch depleted */
 };
 
@@ -540,6 +531,7 @@ enum pgdat_flags {
PGDAT_WRITEBACK,/* reclaim scanning has recently found
 * many pages under writeback
 */
+   PGDAT_RECLAIM_LOCKED,   /* prevents concurrent reclaim */
 };
 
 static inline unsigned long zone_end_pfn(const struct zone *zone)
@@ -688,6 +680,14 @@ typedef struct pglist_data {
 */
unsigned long   totalreserve_pages;
 
+#ifdef CONFIG_NUMA
+   /*
+* zone reclaim becomes active if more unmapped pages exist.
+*/
+   unsigned long   min_unmapped_pages;
+   unsigned long   min_slab_pages;
+#endif /* CONFIG_NUMA */
+
/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
spinlock_t  lru_lock;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2a23ddc96edd..b17cc4830fa6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -326,13 +326,14 @@ extern int remove_mapping(struct address_space *mapping, 
struct page *page);
 extern unsigned long vm_total_pages;
 
 #ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
+extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
 #else
-#define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+#define node_reclaim_mode 0
+static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
+   unsigned int order)
 {
return 0;
 }
diff --git a/include/linux/topology.h b/include/linux/topology.h
index afce69296ac0..cb0775e1ee4b 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -54,7 +54,7 @@ int arch_update_cpu_topology(void);
 /*
  * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
  * (in whatever arch specific measurement units returned by node_distance())
- * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim()
+ * and node_reclaim_mode is enabled then the VM will only call node_reclaim()
  * on nodes within this distance.
  */
 #define RECLAIM_DISTANCE 30
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index de331c3858e5..6e47ebe5384e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1498,8 +1498,8 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
{
.procname   = "zone_reclaim_mode",
-   .data   = &zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
+   .data   = &node_reclaim_mode,
+   .maxlen = sizeof(node_reclaim_mode),
.mode   = 0644,
.proc_handler   = proc_dointvec,
.extra1 = &zero,
diff --git a/mm/internal.h b/mm/internal.h
index 2f80d0343c56..1e21b2d3838d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -433,10 +433,10 @@ static inline void 
mminit_validate_memmodel_limits(unsigned long *start_pfn,
 }
 #endif /* CONF

[PATCH 14/34] mm, memcg: move memcg limit enforcement from zones to nodes

Memcg needs adjustment after moving LRUs to the node. Limits are tracked
per memcg but the soft-limit excess is tracked per zone. As global page
reclaim is based on the node, it is easy to imagine a situation where
a zone soft limit is exceeded even though the memcg limit is fine.

This patch moves the soft limit tree the node.  Technically, all the variable
names should also change but people are already familiar by the meaning of
"mz" even if "mn" would be a more appropriate name now.

Signed-off-by: Mel Gorman 
Acked-by: Michal Hocko 
---
 include/linux/memcontrol.h |  38 -
 include/linux/swap.h   |   2 +-
 mm/memcontrol.c| 190 -
 mm/vmscan.c|  19 +++--
 mm/workingset.c|   6 +-
 5 files changed, 111 insertions(+), 144 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c13227d018f2..80bf8458148a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -61,7 +61,7 @@ enum mem_cgroup_stat_index {
 };
 
 struct mem_cgroup_reclaim_cookie {
-   struct zone *zone;
+   pg_data_t *pgdat;
int priority;
unsigned int generation;
 };
@@ -119,7 +119,7 @@ struct mem_cgroup_reclaim_iter {
 /*
  * per-zone information in memory controller.
  */
-struct mem_cgroup_per_zone {
+struct mem_cgroup_per_node {
struct lruvec   lruvec;
unsigned long   lru_size[NR_LRU_LISTS];
 
@@ -133,10 +133,6 @@ struct mem_cgroup_per_zone {
/* use container_of*/
 };
 
-struct mem_cgroup_per_node {
-   struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
 struct mem_cgroup_threshold {
struct eventfd_ctx *eventfd;
unsigned long threshold;
@@ -315,19 +311,15 @@ void mem_cgroup_uncharge_list(struct list_head 
*page_list);
 
 void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
 
-static inline struct mem_cgroup_per_zone *
-mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
+static struct mem_cgroup_per_node *
+mem_cgroup_nodeinfo(struct mem_cgroup *memcg, int nid)
 {
-   int nid = zone_to_nid(zone);
-   int zid = zone_idx(zone);
-
-   return &memcg->nodeinfo[nid]->zoneinfo[zid];
+   return memcg->nodeinfo[nid];
 }
 
 /**
  * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
  * @node: node of the wanted lruvec
- * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
  * Returns the lru list vector holding pages for a given @node or a given
@@ -335,9 +327,9 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct 
zone *zone)
  * is disabled.
  */
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-   struct zone *zone, struct mem_cgroup *memcg)
+   struct mem_cgroup *memcg)
 {
-   struct mem_cgroup_per_zone *mz;
+   struct mem_cgroup_per_node *mz;
struct lruvec *lruvec;
 
if (mem_cgroup_disabled()) {
@@ -345,7 +337,7 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
pglist_data *pgdat,
goto out;
}
 
-   mz = mem_cgroup_zone_zoneinfo(memcg, zone);
+   mz = mem_cgroup_nodeinfo(memcg, pgdat->node_id);
lruvec = &mz->lruvec;
 out:
/*
@@ -353,8 +345,8 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
pglist_data *pgdat,
 * we have to be prepared to initialize lruvec->pgdat here;
 * and if offlined then reonlined, we need to reinitialize it.
 */
-   if (unlikely(lruvec->pgdat != zone->zone_pgdat))
-   lruvec->pgdat = zone->zone_pgdat;
+   if (unlikely(lruvec->pgdat != pgdat))
+   lruvec->pgdat = pgdat;
return lruvec;
 }
 
@@ -447,9 +439,9 @@ unsigned long mem_cgroup_node_nr_lru_pages(struct 
mem_cgroup *memcg,
 static inline
 unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
 {
-   struct mem_cgroup_per_zone *mz;
+   struct mem_cgroup_per_node *mz;
 
-   mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+   mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
return mz->lru_size[lru];
 }
 
@@ -520,7 +512,7 @@ static inline void mem_cgroup_dec_page_stat(struct page 
*page,
mem_cgroup_update_page_stat(page, idx, -1);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
gfp_t gfp_mask,
unsigned long *total_scanned);
 
@@ -612,7 +604,7 @@ static inline void mem_cgroup_migrate(struct page *old, 
struct page *new)
 }
 
 static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
-   struct zone *zone, struct mem_cgroup *memcg)
+

[PATCH 17/34] mm: move page mapped accounting to the node

Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information.  Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
---
 arch/tile/mm/pgtable.c |  2 +-
 drivers/base/node.c|  4 ++--
 fs/proc/meminfo.c  |  4 ++--
 include/linux/mmzone.h |  6 +++---
 mm/page_alloc.c|  6 +++---
 mm/rmap.c  | 14 +++---
 mm/vmscan.c|  2 +-
 mm/vmstat.c|  4 ++--
 8 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 9e389213580d..c606b0ef2f7e 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -55,7 +55,7 @@ void show_mem(unsigned int filter)
   global_page_state(NR_FREE_PAGES),
   (global_page_state(NR_SLAB_RECLAIMABLE) +
global_page_state(NR_SLAB_UNRECLAIMABLE)),
-  global_page_state(NR_FILE_MAPPED),
+  global_node_page_state(NR_FILE_MAPPED),
   global_page_state(NR_PAGETABLE),
   global_page_state(NR_BOUNCE),
   global_page_state(NR_FILE_PAGES),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index b7f01a4a642d..acca09536ad9 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -121,8 +121,8 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(sum_zone_node_page_state(nid, NR_FILE_DIRTY)),
   nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
   nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
-  nid, K(sum_zone_node_page_state(nid, NR_FILE_MAPPED)),
-  nid, K(sum_zone_node_page_state(nid, NR_ANON_PAGES)),
+  nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
+  nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
   nid, K(i.sharedram),
   nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index cf301a9ef512..b8d52aa2f19a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -140,8 +140,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(i.freeswap),
K(global_page_state(NR_FILE_DIRTY)),
K(global_page_state(NR_WRITEBACK)),
-   K(global_page_state(NR_ANON_PAGES)),
-   K(global_page_state(NR_FILE_MAPPED)),
+   K(global_node_page_state(NR_ANON_PAGES)),
+   K(global_node_page_state(NR_FILE_MAPPED)),
K(i.sharedram),
K(global_page_state(NR_SLAB_RECLAIMABLE) +
global_page_state(NR_SLAB_UNRECLAIMABLE)),
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fae2fe3c6942..95d34d1e1fb5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -115,9 +115,6 @@ enum zone_stat_item {
NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
NR_ZONE_LRU_FILE,
NR_MLOCK,   /* mlock()ed pages found and moved off LRU */
-   NR_ANON_PAGES,  /* Mapped anonymous pages */
-   NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
-  only modified from process context */
NR_FILE_PAGES,
NR_FILE_DIRTY,
NR_WRITEBACK,
@@ -164,6 +161,9 @@ enum node_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
+   NR_ANON_PAGES,  /* Mapped anonymous pages */
+   NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
+  only modified from process context */
NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e113a6ff9a0..78338b51819b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4355,7 +4355,7 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_UNSTABLE_NFS),
global_page_state(NR_SLAB_RECLAIMABLE),
global_page_state(NR_SLAB_UNRECLAIMABLE),
-   global_page_state(NR_FILE_MAPPED),
+   global_node_page_state(NR_FILE_MAPPED),
global_page_state(NR_SHMEM),
global_page_state(NR_PAGETABLE),
global_page_state(NR_BOUNCE),
@@ -4377,6 +4377,7 @@ void show_free_areas(unsigned int filter)
" unevictable:%lukB"
" isolated(anon):%lukB"
" isolated(file):%lukB"
+   " mapped:%lukB"
" all_unreclaimable? %s"
"\n",
pgdat->node_id,
@@ -4387,6 +4388,7 @@ void show_free_areas(unsigned int filter)
K(node_page_state(pgdat, NR_UNEVICTABLE)),
K(node_page_state(pgdat, NR_ISOLATED_ANO

[PATCH 21/34] mm, vmscan: only wakeup kswapd once per node for the requested classzone

kswapd is woken when zones are below the low watermark but the wakeup
decision is not taking the classzone into account.  Now that reclaim is
node-based, it is only required to wake kswapd once per node and only if
all zones are unbalanced for the requested classzone.

Note that one node might be checked multiple times if the zonelist is
ordered by node because there is no cheap way of tracking what nodes have
already been visited.  For zone-ordering, each node should be checked only
once.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
---
 mm/page_alloc.c |  8 ++--
 mm/vmscan.c | 13 +++--
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fd34b305c8ee..bb261885c121 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3410,10 +3410,14 @@ static void wake_all_kswapds(unsigned int order, const 
struct alloc_context *ac)
 {
struct zoneref *z;
struct zone *zone;
+   pg_data_t *last_pgdat = NULL;
 
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
-   ac->high_zoneidx, ac->nodemask)
-   wakeup_kswapd(zone, order, ac_classzone_idx(ac));
+   ac->high_zoneidx, ac->nodemask) {
+   if (last_pgdat != zone->zone_pgdat)
+   wakeup_kswapd(zone, order, ac_classzone_idx(ac));
+   last_pgdat = zone->zone_pgdat;
+   }
 }
 
 static inline unsigned int
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ad670881d8d..cc820bbe9c01 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3419,6 +3419,7 @@ static int kswapd(void *p)
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
 {
pg_data_t *pgdat;
+   int z;
 
if (!populated_zone(zone))
return;
@@ -3430,8 +3431,16 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
pgdat->kswapd_order = max(pgdat->kswapd_order, order);
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
-   if (zone_balanced(zone, order, 0))
-   return;
+
+   /* Only wake kswapd if all zones are unbalanced */
+   for (z = 0; z <= classzone_idx; z++) {
+   zone = pgdat->node_zones + z;
+   if (!populated_zone(zone))
+   continue;
+
+   if (zone_balanced(zone, order, classzone_idx))
+   return;
+   }
 
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
wake_up_interruptible(&pgdat->kswapd_wait);
-- 
2.6.4

[PATCH 24/34] mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node

shrink_node receives all information it needs about classzone_idx
from sc->reclaim_idx so remove the aliases.

Signed-off-by: Mel Gorman 
Acked-by: Hillf Danton 
---
 mm/vmscan.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e12b0fd2044c..bba71b6c9a4c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2426,8 +2426,7 @@ static inline bool should_continue_reclaim(struct 
pglist_data *pgdat,
return true;
 }
 
-static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
-   enum zone_type classzone_idx)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 {
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
@@ -2656,7 +2655,7 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
if (zone->zone_pgdat == last_pgdat)
continue;
last_pgdat = zone->zone_pgdat;
-   shrink_node(zone->zone_pgdat, sc, classzone_idx);
+   shrink_node(zone->zone_pgdat, sc);
}
 
/*
@@ -3080,7 +3079,6 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
  * This is used to determine if the scanning priority needs to be raised.
  */
 static bool kswapd_shrink_node(pg_data_t *pgdat,
-  int classzone_idx,
   struct scan_control *sc)
 {
struct zone *zone;
@@ -3088,7 +3086,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 
/* Reclaim a number of pages proportional to the number of zones */
sc->nr_to_reclaim = 0;
-   for (z = 0; z <= classzone_idx; z++) {
+   for (z = 0; z <= sc->reclaim_idx; z++) {
zone = pgdat->node_zones + z;
if (!populated_zone(zone))
continue;
@@ -3100,7 +3098,7 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
 * Historically care was taken to put equal pressure on all zones but
 * now pressure is applied based on node LRU order.
 */
-   shrink_node(pgdat, sc, classzone_idx);
+   shrink_node(pgdat, sc);
 
/*
 * Fragmentation may mean that the system cannot be rebalanced for
@@ -3162,7 +3160,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
if (!populated_zone(zone))
continue;
 
-   classzone_idx = i;
+   sc.reclaim_idx = i;
break;
}
}
@@ -3175,12 +3173,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
 * zone was balanced even under extreme pressure when the
 * overall node may be congested.
 */
-   for (i = classzone_idx; i >= 0; i--) {
+   for (i = sc.reclaim_idx; i >= 0; i--) {
zone = pgdat->node_zones + i;
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, sc.order, classzone_idx))
+   if (zone_balanced(zone, sc.order, sc.reclaim_idx))
goto out;
}
 
@@ -3211,7 +3209,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
 * enough pages are already being scanned that that high
 * watermark would be met at 100% efficiency.
 */
-   if (kswapd_shrink_node(pgdat, classzone_idx, &sc))
+   if (kswapd_shrink_node(pgdat, &sc))
raise_priority = false;
 
/*
@@ -3674,7 +3672,7 @@ static int __node_reclaim(struct pglist_data *pgdat, 
gfp_t gfp_mask, unsigned in
 * priorities until we have enough memory freed.
 */
do {
-   shrink_node(pgdat, &sc, classzone_idx);
+   shrink_node(pgdat, &sc);
} while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0);
}
 
-- 
2.6.4

[PATCH 28/34] mm, vmscan: add classzone information to tracepoints

This is convenient when tracking down why the skip count is high because
it'll show what classzone kswapd woke up at and what zones are being
isolated.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
---
 include/trace/events/vmscan.h | 51 ++-
 mm/vmscan.c   | 14 +++-
 2 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 897f1aa1ee5f..c88fd0934e7e 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -55,21 +55,23 @@ TRACE_EVENT(mm_vmscan_kswapd_sleep,
 
 TRACE_EVENT(mm_vmscan_kswapd_wake,
 
-   TP_PROTO(int nid, int order),
+   TP_PROTO(int nid, int zid, int order),
 
-   TP_ARGS(nid, order),
+   TP_ARGS(nid, zid, order),
 
TP_STRUCT__entry(
__field(int,nid )
+   __field(int,zid )
__field(int,order   )
),
 
TP_fast_assign(
__entry->nid= nid;
+   __entry->zid= zid;
__entry->order  = order;
),
 
-   TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+   TP_printk("nid=%d zid=%d order=%d", __entry->nid, __entry->zid, 
__entry->order)
 );
 
 TRACE_EVENT(mm_vmscan_wakeup_kswapd,
@@ -98,47 +100,50 @@ TRACE_EVENT(mm_vmscan_wakeup_kswapd,
 
 DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_begin_template,
 
-   TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+   TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int 
classzone_idx),
 
-   TP_ARGS(order, may_writepage, gfp_flags),
+   TP_ARGS(order, may_writepage, gfp_flags, classzone_idx),
 
TP_STRUCT__entry(
__field(int,order   )
__field(int,may_writepage   )
__field(gfp_t,  gfp_flags   )
+   __field(int,classzone_idx   )
),
 
TP_fast_assign(
__entry->order  = order;
__entry->may_writepage  = may_writepage;
__entry->gfp_flags  = gfp_flags;
+   __entry->classzone_idx  = classzone_idx;
),
 
-   TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+   TP_printk("order=%d may_writepage=%d gfp_flags=%s classzone_idx=%d",
__entry->order,
__entry->may_writepage,
-   show_gfp_flags(__entry->gfp_flags))
+   show_gfp_flags(__entry->gfp_flags),
+   __entry->classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, 
mm_vmscan_direct_reclaim_begin,
 
-   TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+   TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int 
classzone_idx),
 
-   TP_ARGS(order, may_writepage, gfp_flags)
+   TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, 
mm_vmscan_memcg_reclaim_begin,
 
-   TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+   TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int 
classzone_idx),
 
-   TP_ARGS(order, may_writepage, gfp_flags)
+   TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DEFINE_EVENT(mm_vmscan_direct_reclaim_begin_template, 
mm_vmscan_memcg_softlimit_reclaim_begin,
 
-   TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+   TP_PROTO(int order, int may_writepage, gfp_t gfp_flags, int 
classzone_idx),
 
-   TP_ARGS(order, may_writepage, gfp_flags)
+   TP_ARGS(order, may_writepage, gfp_flags, classzone_idx)
 );
 
 DECLARE_EVENT_CLASS(mm_vmscan_direct_reclaim_end_template,
@@ -266,16 +271,18 @@ TRACE_EVENT(mm_shrink_slab_end,
 
 DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
 
-   TP_PROTO(int order,
+   TP_PROTO(int classzone_idx,
+   int order,
unsigned long nr_requested,
unsigned long nr_scanned,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
int file),
 
-   TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
+   TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_taken, 
isolate_mode, file),
 
TP_STRUCT__entry(
+   __field(int, classzone_idx)
__field(int, order)
__field(unsigned long, nr_requested)
__field(unsigned long, nr_scanned)
@@ -285,6 +292,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
),
 
TP_fast_assign(
+   __entry->classzone_idx = classzone_idx;
__entry->order = order;
__entry->nr_requested = nr_requested;
__entry->nr_scanned = nr_scanned;
@@ -293,8 +301,9 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
__entry->fil

[PATCH 26/34] mm, vmscan: avoid passing in remaining unnecessarily to prepare_kswapd_sleep

As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep
always passes in 0 for remaining and the second call can trivially
check the parameter in advance.

Suggested-by: Minchan Kim 
Signed-off-by: Mel Gorman 
---
 mm/vmscan.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d5c78e2312b..8a67aa53aa7b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3020,15 +3020,10 @@ static bool zone_balanced(struct zone *zone, int order, 
int classzone_idx)
  *
  * Returns true if kswapd is ready to sleep
  */
-static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
-   int classzone_idx)
+static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, int 
classzone_idx)
 {
int i;
 
-   /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
-   if (remaining)
-   return false;
-
/*
 * The throttled processes are normally woken up in balance_pgdat() as
 * soon as pfmemalloc_watermark_ok() is true. But there is a potential
@@ -3243,7 +3238,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
alloc_order, int reclaim_o
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 
/* Try to sleep for a short interval */
-   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
+   if (prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
/*
 * Compaction records what page blocks it recently failed to
 * isolate pages from and skips them in the future scanning.
@@ -3278,7 +3273,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
alloc_order, int reclaim_o
 * After a short sleep, check if it was a premature sleep. If not, then
 * go fully to sleep until explicitly woken up.
 */
-   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
+   if (!remaining &&
+   prepare_kswapd_sleep(pgdat, reclaim_order, classzone_idx)) {
trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
 
/*
-- 
2.6.4

[PATCH 16/34] mm, page_alloc: consider dirtyable memory in terms of nodes

Historically dirty pages were spread among zones but now that LRUs are
per-node it is more appropriate to consider dirty pages in a node.

Signed-off-by: Mel Gorman 
Signed-off-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
---
 include/linux/mmzone.h| 12 +++
 include/linux/writeback.h |  2 +-
 mm/page-writeback.c   | 91 +++
 mm/page_alloc.c   | 26 ++
 4 files changed, 79 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 62f477d6cfe8..fae2fe3c6942 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -363,12 +363,6 @@ struct zone {
struct pglist_data  *zone_pgdat;
struct per_cpu_pageset __percpu *pageset;
 
-   /*
-* This is a per-zone reserve of pages that are not available
-* to userspace allocations.
-*/
-   unsigned long   totalreserve_pages;
-
 #ifndef CONFIG_SPARSEMEM
/*
 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
@@ -687,6 +681,12 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
 #endif
+   /*
+* This is a per-node reserve of pages that are not available
+* to userspace allocations.
+*/
+   unsigned long   totalreserve_pages;
+
/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
spinlock_t  lru_lock;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 717e6149e753..fc1e16c25a29 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -320,7 +320,7 @@ void laptop_mode_timer_fn(unsigned long data);
 static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
-bool zone_dirty_ok(struct zone *zone);
+bool node_dirty_ok(struct pglist_data *pgdat);
 int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
 #ifdef CONFIG_CGROUP_WRITEBACK
 void wb_domain_exit(struct wb_domain *dom);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0ada2b2954b0..f7c0fb993fb9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -267,26 +267,35 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
  */
 
 /**
- * zone_dirtyable_memory - number of dirtyable pages in a zone
- * @zone: the zone
+ * node_dirtyable_memory - number of dirtyable pages in a node
+ * @pgdat: the node
  *
- * Returns the zone's number of pages potentially available for dirty
- * page cache.  This is the base value for the per-zone dirty limits.
+ * Returns the node's number of pages potentially available for dirty
+ * page cache.  This is the base value for the per-node dirty limits.
  */
-static unsigned long zone_dirtyable_memory(struct zone *zone)
+static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 {
-   unsigned long nr_pages;
+   unsigned long nr_pages = 0;
+   int z;
+
+   for (z = 0; z < MAX_NR_ZONES; z++) {
+   struct zone *zone = pgdat->node_zones + z;
+
+   if (!populated_zone(zone))
+   continue;
+
+   nr_pages += zone_page_state(zone, NR_FREE_PAGES);
+   }
 
-   nr_pages = zone_page_state(zone, NR_FREE_PAGES);
/*
 * Pages reserved for the kernel should not be considered
 * dirtyable, to prevent a situation where reclaim has to
 * clean pages in order to balance the zones.
 */
-   nr_pages -= min(nr_pages, zone->totalreserve_pages);
+   nr_pages -= min(nr_pages, pgdat->totalreserve_pages);
 
-   nr_pages += node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE);
-   nr_pages += node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE);
+   nr_pages += node_page_state(pgdat, NR_INACTIVE_FILE);
+   nr_pages += node_page_state(pgdat, NR_ACTIVE_FILE);
 
return nr_pages;
 }
@@ -299,13 +308,24 @@ static unsigned long highmem_dirtyable_memory(unsigned 
long total)
int i;
 
for_each_node_state(node, N_HIGH_MEMORY) {
-   for (i = 0; i < MAX_NR_ZONES; i++) {
-   struct zone *z = &NODE_DATA(node)->node_zones[i];
+   for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
+   struct zone *z;
+   unsigned long dirtyable;
+
+   if (!is_highmem_idx(i))
+   continue;
+
+   z = &NODE_DATA(node)->node_zones[i];
+   dirtyable = zone_page_state(z, NR_FREE_PAGES) +
+   zone_page_state(z, NR_ZONE_LRU_FILE);
 
-   if (is_highmem(z))
-   x += zone_dirtyable_memory(z);
+   /* watch for underflows */
+   dirtyable -= min(dirtyable, high_wmark_pages(z));

[PATCH 15/34] mm, workingset: make working set detection node-aware

Working set and refault detection is still zone-based, fix it.

Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
---
 include/linux/mmzone.h |  6 +++---
 include/linux/vmstat.h |  1 -
 mm/vmstat.c| 20 +++-
 mm/workingset.c| 43 ---
 4 files changed, 26 insertions(+), 44 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 895c365e3259..62f477d6cfe8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -145,9 +145,6 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
 #endif
-   WORKINGSET_REFAULT,
-   WORKINGSET_ACTIVATE,
-   WORKINGSET_NODERECLAIM,
NR_ANON_THPS,
NR_SHMEM_THPS,
NR_SHMEM_PMDMAPPED,
@@ -164,6 +161,9 @@ enum node_stat_item {
NR_ISOLATED_ANON,   /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE,   /* Temporary isolated pages from file lru */
NR_PAGES_SCANNED,   /* pages scanned since last reclaim */
+   WORKINGSET_REFAULT,
+   WORKINGSET_ACTIVATE,
+   WORKINGSET_NODERECLAIM,
NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index fee321c98550..6b7975cd98aa 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -227,7 +227,6 @@ void mod_node_page_state(struct pglist_data *, enum 
node_stat_item, long);
 void inc_node_page_state(struct page *, enum node_stat_item);
 void dec_node_page_state(struct page *, enum node_stat_item);
 
-extern void inc_zone_state(struct zone *, enum zone_stat_item);
 extern void inc_node_state(struct pglist_data *, enum node_stat_item);
 extern void __inc_zone_state(struct zone *, enum zone_stat_item);
 extern void __inc_node_state(struct pglist_data *, enum node_stat_item);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index de0c17076270..d17d66e85def 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -446,11 +446,6 @@ void mod_zone_page_state(struct zone *zone, enum 
zone_stat_item item,
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
-void inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-   mod_zone_state(zone, item, 1, 1);
-}
-
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
mod_zone_state(page_zone(page), item, 1, 1);
@@ -539,15 +534,6 @@ void mod_zone_page_state(struct zone *zone, enum 
zone_stat_item item,
 }
 EXPORT_SYMBOL(mod_zone_page_state);
 
-void inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
-   unsigned long flags;
-
-   local_irq_save(flags);
-   __inc_zone_state(zone, item);
-   local_irq_restore(flags);
-}
-
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
unsigned long flags;
@@ -967,9 +953,6 @@ const char * const vmstat_text[] = {
"numa_local",
"numa_other",
 #endif
-   "workingset_refault",
-   "workingset_activate",
-   "workingset_nodereclaim",
"nr_anon_transparent_hugepages",
"nr_shmem_hugepages",
"nr_shmem_pmdmapped",
@@ -984,6 +967,9 @@ const char * const vmstat_text[] = {
"nr_isolated_anon",
"nr_isolated_file",
"nr_pages_scanned",
+   "workingset_refault",
+   "workingset_activate",
+   "workingset_nodereclaim",
 
/* enum writeback_stat_item counters */
"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 9a1016f5d500..56334e7d6924 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -16,7 +16,7 @@
 /*
  * Double CLOCK lists
  *
- * Per zone, two clock lists are maintained for file pages: the
+ * Per node, two clock lists are maintained for file pages: the
  * inactive and the active list.  Freshly faulted pages start out at
  * the head of the inactive list and page reclaim scans pages from the
  * tail.  Pages that are accessed multiple times on the inactive list
@@ -141,11 +141,11 @@
  *
  * Implementation
  *
- * For each zone's file LRU lists, a counter for inactive evictions
- * and activations is maintained (zone->inactive_age).
+ * For each node's file LRU lists, a counter for inactive evictions
+ * and activations is maintained (node->inactive_age).
  *
  * On eviction, a snapshot of this counter (along with some bits to
- * identify the zone) is stored in the now empty page cache radix tree
+ * identify the node) is stored in the now empty page cache radix tree
  * slot of the evicted page.  This is called a shadow entry.
  *
  * On cache misses for which there are shadow entries, an eligible
@@ -153,7 +153,7 @@
  */
 
 #define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \
-ZONES_SHIFT + NODES_SHIFT +\
+NODES_SHIFT +  \
 MEM_CGROUP_ID_SHIFT)
 #define EVICTION_MASK  (~0UL >> EVICTION_SHIFT)
 
@@ -167,33 +167,

[PATCH 20/34] mm: move vmscan writes and file write accounting to the node

As reclaim is now node-based, it follows that page write activity due to
page reclaim should also be accounted for on the node.  For consistency,
also account page writes and page dirtying on a per-node basis.

After this patch, there are a few remaining zone counters that may appear
strange but are fine.  NUMA stats are still per-zone as this is a
user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
potentially pin low memory and cannot trivially be reclaimed on demand.
This information is still useful for debugging a page allocation failure
warning.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
---
 include/linux/mmzone.h   | 8 
 include/trace/events/writeback.h | 4 ++--
 mm/page-writeback.c  | 6 +++---
 mm/vmscan.c  | 4 ++--
 mm/vmstat.c  | 8 
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index acd4665c3025..e3d6d42722a0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -122,10 +122,6 @@ enum zone_stat_item {
NR_KERNEL_STACK,
/* Second 128 byte cacheline */
NR_BOUNCE,
-   NR_VMSCAN_WRITE,
-   NR_VMSCAN_IMMEDIATE,/* Prioritise for reclaim when writeback ends */
-   NR_DIRTIED, /* page dirtyings since bootup */
-   NR_WRITTEN, /* page writings since bootup */
 #if IS_ENABLED(CONFIG_ZSMALLOC)
NR_ZSPAGES, /* allocated in zsmalloc */
 #endif
@@ -165,6 +161,10 @@ enum node_stat_item {
NR_SHMEM_PMDMAPPED,
NR_ANON_THPS,
NR_UNSTABLE_NFS,/* NFS unstable pages */
+   NR_VMSCAN_WRITE,
+   NR_VMSCAN_IMMEDIATE,/* Prioritise for reclaim when writeback ends */
+   NR_DIRTIED, /* page dirtyings since bootup */
+   NR_WRITTEN, /* page writings since bootup */
NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index ad20f2d2b1f9..2ccd9ccbf9ef 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -415,8 +415,8 @@ TRACE_EVENT(global_dirty_state,
__entry->nr_dirty   = global_node_page_state(NR_FILE_DIRTY);
__entry->nr_writeback   = global_node_page_state(NR_WRITEBACK);
__entry->nr_unstable= 
global_node_page_state(NR_UNSTABLE_NFS);
-   __entry->nr_dirtied = global_page_state(NR_DIRTIED);
-   __entry->nr_written = global_page_state(NR_WRITTEN);
+   __entry->nr_dirtied = global_node_page_state(NR_DIRTIED);
+   __entry->nr_written = global_node_page_state(NR_WRITTEN);
__entry->background_thresh = background_thresh;
__entry->dirty_thresh   = dirty_thresh;
__entry->dirty_limit= global_wb_domain.dirty_limit;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f97591d9fa00..3c02aa603f5a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2461,7 +2461,7 @@ void account_page_dirtied(struct page *page, struct 
address_space *mapping)
mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
__inc_node_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-   __inc_zone_page_state(page, NR_DIRTIED);
+   __inc_node_page_state(page, NR_DIRTIED);
__inc_wb_stat(wb, WB_RECLAIMABLE);
__inc_wb_stat(wb, WB_DIRTIED);
task_io_account_write(PAGE_SIZE);
@@ -2550,7 +2550,7 @@ void account_page_redirty(struct page *page)
 
wb = unlocked_inode_to_wb_begin(inode, &locked);
current->nr_dirtied--;
-   dec_zone_page_state(page, NR_DIRTIED);
+   dec_node_page_state(page, NR_DIRTIED);
dec_wb_stat(wb, WB_DIRTIED);
unlocked_inode_to_wb_end(inode, locked);
}
@@ -2787,7 +2787,7 @@ int test_clear_page_writeback(struct page *page)
mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
dec_node_page_state(page, NR_WRITEBACK);
dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-   inc_zone_page_state(page, NR_WRITTEN);
+   inc_node_page_state(page, NR_WRITTEN);
}
unlock_page_memcg(page);
return ret;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index aef2a6245657..5ad670881d8d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -612,7 +612,7 @@ static pageout_t pageout(struct page *page, struct 
address_space *mapping,
ClearPageReclaim(page);
}
trace_mm_vmscan_writepage(page);
-   inc_zone_page_state(page, NR_VMSCAN_WRITE);
+   inc_node_page_stat

[PATCH 18/34] mm: rename NR_ANON_PAGES to NR_ANON_MAPPED

NR_FILE_PAGES  is the number offile pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES  is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
have

NR_FILE_PAGES  is the number offile pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
---
 drivers/base/node.c| 2 +-
 fs/proc/meminfo.c  | 2 +-
 include/linux/mmzone.h | 2 +-
 mm/migrate.c   | 2 +-
 mm/rmap.c  | 8 
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index acca09536ad9..ac69a7215bcc 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -122,7 +122,7 @@ static ssize_t node_read_meminfo(struct device *dev,
   nid, K(sum_zone_node_page_state(nid, NR_WRITEBACK)),
   nid, K(sum_zone_node_page_state(nid, NR_FILE_PAGES)),
   nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
-  nid, K(node_page_state(pgdat, NR_ANON_PAGES)),
+  nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
   nid, K(i.sharedram),
   nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b8d52aa2f19a..40f108783d59 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -140,7 +140,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(i.freeswap),
K(global_page_state(NR_FILE_DIRTY)),
K(global_page_state(NR_WRITEBACK)),
-   K(global_node_page_state(NR_ANON_PAGES)),
+   K(global_node_page_state(NR_ANON_MAPPED)),
K(global_node_page_state(NR_FILE_MAPPED)),
K(i.sharedram),
K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 95d34d1e1fb5..2d4a8804eafa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -161,7 +161,7 @@ enum node_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
-   NR_ANON_PAGES,  /* Mapped anonymous pages */
+   NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
   only modified from process context */
NR_VM_NODE_STAT_ITEMS
diff --git a/mm/migrate.c b/mm/migrate.c
index 3033dae33a0a..fba770c54d84 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -501,7 +501,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 * new page and drop references to the old page.
 *
 * Note that anonymous pages are accounted for
-* via NR_FILE_PAGES and NR_ANON_PAGES if they
+* via NR_FILE_PAGES and NR_ANON_MAPPED if they
 * are mapped to swap space.
 */
if (newzone != oldzone) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 17876517f5fa..a66f80bc8703 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1217,7 +1217,7 @@ void do_page_add_anon_rmap(struct page *page,
 */
if (compound)
__inc_zone_page_state(page, NR_ANON_THPS);
-   __mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
+   __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
}
if (unlikely(PageKsm(page)))
return;
@@ -1261,7 +1261,7 @@ void page_add_new_anon_rmap(struct page *page,
/* increment count (starts at -1) */
atomic_set(&page->_mapcount, 0);
}
-   __mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, nr);
+   __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
__page_set_anon_rmap(page, vma, address, 1);
 }
 
@@ -1378,7 +1378,7 @@ static void page_remove_anon_compound_rmap(struct page 
*page)
clear_page_mlock(page);
 
if (nr) {
-   __mod_node_page_state(page_pgdat(page), NR_ANON_PAGES, -nr);
+   __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
deferred_split_huge_page(page);
}
 }
@@ -1407,7 +1407,7 @@ void page_remove_rmap(struct page *page, bool compound)
 * these counters are not modified in interrupt context, and
 * pte lock(a spinlock) is held, which implies preemption disabled.
 */
-   __dec_node_page_state(page, NR_ANON_PAGES);
+   __dec_node_page_state(page, NR_ANON_MAPPED);
 
if (unlikely(PageMlocked(page)))
clear_page_mlock(page);
-- 
2.6.4

[PATCH 30/34] mm: page_alloc: cache the last node whose dirty limit is reached