date:20180907

[PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-07 Thread Alexey Budankov



Currently in record mode the tool implements trace writing serially. 
The algorithm loops over mapped per-cpu data buffers and stores 
ready data chunks into a trace file using write() system call.

At some circumstances the kernel may lack free space in a buffer 
because the other buffer's half is not yet written to disk due to 
some other buffer's data writing by the tool at the moment.

Thus serial trace writing implementation may cause the kernel 
to loose profiling data and that is what observed when profiling 
highly parallel CPU bound workloads on machines with big number 
of cores.

Experiment with profiling matrix multiplication code executing 128 
threads on Intel Xeon Phi (KNM) with 272 cores, like below,
demonstrates data loss metrics value of 98%:

/usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
--call-graph dwarf,1024 --user-regs=IP,SP,BP \
--switch-events -e 
cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
matrix.gcc

Data loss metrics is the ratio lost_time/elapsed_time where 
lost_time is the sum of time intervals containing PERF_RECORD_LOST 
records and elapsed_time is the elapsed application run time 
under profiling.

Applying asynchronous trace streaming thru Posix AIO API
(http://man7.org/linux/man-pages/man7/aio.7.html) 
lowers data loss metrics value providing 2x improvement -
lowering 98% loss to almost 0%.

---
 Alexey Budankov (3):
perf util: map data buffer for preserving collected data
perf record: enable asynchronous trace writing
perf record: extend trace writing to multi AIO
 
 tools/perf/builtin-record.c | 166 ++--
 tools/perf/perf.h   |   1 +
 tools/perf/util/evlist.c|   7 +-
 tools/perf/util/evlist.h|   3 +-
 tools/perf/util/mmap.c  | 114 ++
 tools/perf/util/mmap.h  |  11 ++-
 6 files changed, 277 insertions(+), 25 deletions(-)

---
 Changes in v8:
 - run the whole thing thru checkpatch.pl and corrected found issues except
   lines longer than 80 symbols
 - corrected comments alignment and formatting
 - moved multi AIO implementation into 3rd patch in the series
 - implemented explicit cblocks array allocation
 - split AIO completion check into separate record__aio_complete()
 - set nr_cblocks default to 1 and max allowed value to 4
 Changes in v7:
 - implemented handling record.aio setting from perfconfig file
 Changes in v6:
 - adjusted setting of priorities for cblocks;
 - handled errno == EAGAIN case from aio_write() return;
 Changes in v5:
 - resolved livelock on perf record -e intel_pt// -- dd if=/dev/zero 
of=/dev/null count=10
 - data loss metrics decreased from 25% to 2x in trialed configuration;
 - reshaped layout of data structures;
 - implemented --aio option;
 - avoided nanosleep() prior calling aio_suspend();
 - switched to per-cpu aio multi buffer record__aio_sync();
 - record_mmap_read_sync() now does global sync just before 
   switching trace file or collection stop;
 Changes in v4:
 - converted mmap()/munmap() to malloc()/free() for mmap->data buffer management
 - converted void *bf to struct perf_mmap *md in signatures
 - written comment in perf_mmap__push() just before perf_mmap__get();
 - written comment in record__mmap_read_sync() on possible restarting 
   of aio_write() operation and releasing perf_mmap object after all;
 - added perf_mmap__put() for the cases of failed aio_write();
 Changes in v3:
 - written comments about nanosleep(0.5ms) call prior aio_suspend()
   to cope with intrusiveness of its implementation in glibc;
 - written comments about rationale behind coping profiling data 
   into mmap->data buffer;
 Changes in v2:
 - converted zalloc() to calloc() for allocation of mmap_aio array,
 - cleared typo and adjusted fallback branch code;

[PATCH] Input: elantech - enable middle button of touchpad on ThinkPad P72

2018-09-07 Thread Aaron Ma

Adding 2 new touchpad IDs to support middle button support.

Cc: sta...@vger.kernel.org
Signed-off-by: Aaron Ma 
---
 drivers/input/mouse/elantech.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/input/mouse/elantech.c b/drivers/input/mouse/elantech.c
index 44f57cf6675b..2d95e8d93cc7 100644
--- a/drivers/input/mouse/elantech.c
+++ b/drivers/input/mouse/elantech.c
@@ -1178,6 +1178,8 @@ static const struct dmi_system_id 
elantech_dmi_has_middle_button[] = {
 static const char * const middle_button_pnp_ids[] = {
"LEN2131", /* ThinkPad P52 w/ NFC */
"LEN2132", /* ThinkPad P52 */
+   "LEN2133", /* ThinkPad P72 w/ NFC */
+   "LEN2134", /* ThinkPad P72 */
NULL
 };
 
-- 
2.17.1

[PATCH v8 1/3]: perf util: map data buffer for preserving collected data

2018-09-07 Thread Alexey Budankov



The map->data buffer is used to preserve map->base profiling data 
for writing to disk. AIO map->cblock is used to queue corresponding 
map->data buffer for asynchronous writing.

Signed-off-by: Alexey Budankov 
---
 Changes in v7:
  - implemented handling record.aio setting from perfconfig file
 Changes in v6:
  - adjusted setting of priorities for cblocks;
 Changes in v5:
  - reshaped layout of data structures;
  - implemented --aio option;
 Changes in v4:
  - converted mmap()/munmap() to malloc()/free() for mmap->data buffer 
management 
 Changes in v2:
  - converted zalloc() to calloc() for allocation of mmap_aio array,
  - cleared typo and adjusted fallback branch code;
---
 tools/perf/util/mmap.c | 25 +
 tools/perf/util/mmap.h |  3 +++
 2 files changed, 28 insertions(+)

diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index fc832676a798..e53038d76445 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -155,6 +155,8 @@ void __weak auxtrace_mmap_params__set_idx(struct 
auxtrace_mmap_params *mp __mayb
 
 void perf_mmap__munmap(struct perf_mmap *map)
 {
+   if (map->data)
+   zfree(&map->data);
if (map->base != NULL) {
munmap(map->base, perf_mmap__mmap_len(map));
map->base = NULL;
@@ -166,6 +168,7 @@ void perf_mmap__munmap(struct perf_mmap *map)
 
 int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd)
 {
+   int delta_max;
/*
 * The last one will be done at perf_mmap__consume(), so that we
 * make sure we don't prevent tools from consuming every last event in
@@ -190,6 +193,28 @@ int perf_mmap__mmap(struct perf_mmap *map, struct 
mmap_params *mp, int fd)
map->base = NULL;
return -1;
}
+   delta_max = sysconf(_SC_AIO_PRIO_DELTA_MAX);
+   map->data = malloc(perf_mmap__mmap_len(map));
+   if (!map->data) {
+   pr_debug2("failed to allocate data buffer, error %d\n",
+   errno);
+   return -1;
+   }
+   /*
+* Use cblock.aio_fildes value different from -1
+* to denote started aio write operation on the
+* cblock so it requires explicit record__aio_sync()
+* call prior the cblock may be reused again.
+*/
+   map->cblock.aio_fildes = -1;
+   /*
+* Allocate cblock with max priority delta to
+* have faster aio_write() calls because queued
+* requests are kept in separate per-prio queues
+* and adding a new request iterates thru shorter
+* per-prio list.
+*/
+   map->cblock.aio_reqprio = delta_max;
map->fd = fd;
 
if (auxtrace_mmap__mmap(&map->auxtrace_mmap,
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index d82294db1295..1974e621e36b 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "auxtrace.h"
 #include "event.h"
 
@@ -25,6 +26,8 @@ struct perf_mmap {
bool overwrite;
struct auxtrace_mmap auxtrace_mmap;
char event_copy[PERF_SAMPLE_MAX_SIZE] __aligned(8);
+   void *data;
+   struct aiocb cblock;
 };
 
 /*

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-07 Thread Juri Lelli

On 06/09/18 16:25, Dietmar Eggemann wrote:
> Hi Juri,
> 
> On 08/23/2018 11:54 PM, Juri Lelli wrote:
> > On 23/08/18 18:52, Dietmar Eggemann wrote:
> > > Hi,
> > > 
> > > On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> > > > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > > > > From: John Dias 
> 
> [...]
> 
> > > 
> > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a
> > > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
> > > rt-tests but wasn't able to do so.
> > > 
> > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched 
> > > id=low,policy=cfs
> > > 
> > > Starting PI Stress Test
> > > Number of thread groups: 1
> > > Duration of test run: 1 seconds
> > > Number of inversions per group: 1
> > >   Admin thread SCHED_FIFO priority 4
> > > 1 groups of 3 threads will be created
> > >High thread SCHED_FIFO priority 3
> > > Med thread SCHED_FIFO priority 2
> > > Low thread SCHED_OTHER nice 0
> > > 
> > > # ./pip_stress
> > > 
> > > In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
> > > dequeue_task_fair()->dequeue_entity(), which subtracts 
> > > cfs_rq->min_vruntime
> > > from se->vruntime, is called on it before it gets the rt prio.
> > > 
> > > Maybe it requires a very specific use of the pthread library to provoke 
> > > this
> > > issue by making sure that the cfs tasks really blocks/sleeps?
> > 
> > Maybe one could play with rt-app to recreate such specific use case?
> > 
> > https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459
> 
> I played a little bit with rt-app on hikey960 to re-create Steve's test
> program.

Oh, nice! Thanks for sharing what you have got.

> Since there is no semaphore support (sem_wait(), sem_post()) I used
> condition variables (wait: pthread_cond_wait() , signal:
> pthread_cond_signal()). It's not really the same since this is stateless but
> sleeps before the signals help to maintain the state in this easy example.
> 
> This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1:
> 
> 
> "global": {
> "calibration" : 130,
>   "pi_enabled" : true
> },
> "tasks": {
> "rt_task": {
>   "loop" : 100,
>   "policy" : "SCHED_FIFO",
>   "cpus" : [0],
> 
>   "lock" : "b_mutex",
>   "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" },
>   "unlock" : "b_mutex",
>   "sleep" : 3000,
>   "lock1" : "a_mutex",
>   "signal" : "a_cond",
>   "unlock1" : "a_mutex",
>   "lock2" : "pi-mutex",
>   "unlock2" : "pi-mutex"
> },
>   "cfs_task": {
>   "loop" : 100,
>   "policy" : "SCHED_OTHER",
>   "cpus" : [4],
> 
>   "lock" : "pi-mutex",
>   "sleep" : 3000,
>   "lock1" : "b_mutex",
>   "signal" : "b_cond",
>   "unlock" : "b_mutex",
>   "lock2" : "a_mutex",
>   "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" },
>   "unlock1" : "a_mutex",
>   "unlock2" : "pi-mutex"
>   }
> }
> }
> 
> Adding semaphores is possible but rt-app has no easy way to initialize
> individual objects, e.g. sem_init(..., value). The only way I see is via the
> global section, like "pi_enabled". But then, this is true for all objects of
> this kind (in this case mutexes)?

Right, global section should work fine. Why do you think this is a
problem/limitation?

> So the following couple of lines extension to rt-app works because both
> semaphores can be initialized to 0:
> 
>  {
> "global": {
> "calibration" : 130,
>   "pi_enabled" : true
> },
> "tasks": {
> "rt_task": {
>   "loop" : 100,
>   "policy" : "SCHED_FIFO",
>   "cpus" : [0],
> 
>   "sem_wait" : "b_sem",
>   "sleep" : 1000,
>   "sem_post" : "a_sem",
> 
>   "lock" : "pi-mutex",
>   "unlock" : "pi-mutex"
> },
>   "cfs_task": {
>   "loop" : 100,
>   "policy" : "SCHED_OTHER",
>   "cpus" : [4],
> 
>   "lock" : "pi-mutex",
>   "sleep" : 1000,
>   "sem_post" : "b_sem",
>   "sem_wait" : "a_sem",
>   "unlock" : "pi-mutex"
>   }
> }
> }
> 
> Any thoughts on that? I can see something like this as infrastructure to
> create a regression test case based on rt-app and standard ftrace.

Agree. I guess we should add your first example to the repo (you'd be
very welcome to create a PR) already and then work to support the second?

Re: [PATCH AUTOSEL 4.18 043/131] ASoC: soc-pcm: Use delay set in component pointer function

2018-09-07 Thread Agrawal, Akshu

On 9/7/2018 5:53 AM, Sasha Levin wrote:
> On Mon, Sep 03, 2018 at 12:16:26PM +0100, Mark Brown wrote:
>> On Sun, Sep 02, 2018 at 01:03:55PM +, Sasha Levin wrote:
>>> From: Akshu Agrawal 
>>>
>>> [ Upstream commit 9fb4c2bf130b922c77c16a8368732699799c40de ]
>>>
>>> Take into account the base delay set in pointer callback.
>>>
>>> There are cases where a pointer function populates
>>> runtime->delay, such as:
>>> ./sound/pci/hda/hda_controller.c
>>> ./sound/soc/intel/atom/sst-mfld-platform-pcm.c
>>
>> I'm worried that if anyone notices this at all they will have already
>> compensated for the delays in userspace and therefore this will cause
>> them to see problems as they get double compenstation for delays.
> 
> But what happens when they update to a newer Stable? They're going to
> hit that issue anyways.
> 

Drivers which had exposed this delay in pointer function but have
compensated for the issue in userspace are likely see the problem of
double delay when the update happens.
I Don't know what is the best way to communicate that issue is fixed in
kernel and usersapce compensation isn't required.

But more likely I think the delay was just getting left out and there
wouldn't have been a compensation in userspace.

Thanks,
Akshu

[PATCH v8 2/3]: perf record: enable asynchronous trace writing

2018-09-07 Thread Alexey Budankov



Trace file offset is calculated and updated linearly prior
enqueuing aio write at record__pushfn().

record__aio_sync() blocks till completion of started AIO operation 
and then proceeds.

record__mmap_read_sync() implements a barrier for all incomplete
aio write requests.

Signed-off-by: Alexey Budankov 
---
 Changes in v8:
 -  split AIO completion check into separate record__aio_complete()
 Changes in v6:
 - handled errno == EAGAIN case from aio_write();
 Changes in v5:
 - data loss metrics decreased from 25% to 2x in trialed configuration;
 - avoided nanosleep() prior calling aio_suspend();
 - switched to per cpu multi record__aio_sync() aio
 - record_mmap_read_sync() now does global barrier just before 
   switching trace file or collection stop;
 - resolved livelock on perf record -e intel_pt// -- dd if=/dev/zero 
of=/dev/null count=10
 Changes in v4:
 - converted void *bf to struct perf_mmap *md in signatures
 - written comment in perf_mmap__push() just before perf_mmap__get();
 - written comment in record__mmap_read_sync() on possible restarting 
   of aio_write() operation and releasing perf_mmap object after all;
 - added perf_mmap__put() for the cases of failed aio_write();
 Changes in v3:
 - written comments about nanosleep(0.5ms) call prior aio_suspend()
   to cope with intrusiveness of its implementation in glibc;
 - written comments about rationale behind coping profiling data 
   into mmap->data buffer;
---
 tools/perf/builtin-record.c | 128 +++-
 tools/perf/util/mmap.c  |  54 ++-
 tools/perf/util/mmap.h  |   2 +-
 3 files changed, 169 insertions(+), 15 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 22ebeb92ac51..d4857572cf33 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -121,6 +121,93 @@ static int record__write(struct record *rec, void *bf, 
size_t size)
return 0;
 }
 
+static int record__aio_write(struct aiocb *cblock, int trace_fd,
+   void *buf, size_t size, off_t off)
+{
+   int rc;
+
+   cblock->aio_fildes = trace_fd;
+   cblock->aio_buf= buf;
+   cblock->aio_nbytes = size;
+   cblock->aio_offset = off;
+   cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+   do {
+   rc = aio_write(cblock);
+   if (rc == 0) {
+   break;
+   } else if (errno != EAGAIN) {
+   cblock->aio_fildes = -1;
+   pr_err("failed to queue perf data, error: %m\n");
+   break;
+   }
+   } while (1);
+
+   return rc;
+}
+
+static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
+{
+   void *rem_buf;
+   off_t rem_off;
+   size_t rem_size;
+   int rc, aio_errno;
+   ssize_t aio_ret, written;
+
+   aio_errno = aio_error(cblock);
+   if (aio_errno == EINPROGRESS)
+   return 0;
+
+   written = aio_ret = aio_return(cblock);
+   if (aio_ret < 0) {
+   if (!(aio_errno == EINTR))
+   pr_err("failed to write perf data, error: %m\n");
+   written = 0;
+   }
+
+   rem_size = cblock->aio_nbytes - written;
+
+   if (rem_size == 0) {
+   cblock->aio_fildes = -1;
+   /*
+* md->refcount is incremented in perf_mmap__push() for
+* every enqueued aio write request so decrement it because
+* the request is now complete.
+*/
+   perf_mmap__put(md);
+   rc = 1;
+   } else {
+   /*
+* aio write request may require restart with the
+* reminder if the kernel didn't write whole
+* chunk at once.
+*/
+   rem_off = cblock->aio_offset + written;
+   rem_buf = (void *)(cblock->aio_buf + written);
+   record__aio_write(cblock, cblock->aio_fildes,
+   rem_buf, rem_size, rem_off);
+   rc = 0;
+   }
+
+   return rc;
+}
+
+static void record__aio_sync(struct perf_mmap *md)
+{
+   struct aiocb *cblock = &md->cblock;
+   struct timespec timeout = { 0, 1000 * 1000  * 1 }; // 1ms
+
+   do {
+   if (cblock->aio_fildes == -1 || record__aio_complete(md, 
cblock))
+   return;
+
+   while (aio_suspend((const struct aiocb**)&cblock, 1, &timeout)) 
{
+   if (!(errno == EAGAIN || errno == EINTR))
+   pr_err("failed to sync perf data, error: %m\n");
+   }
+   } while (1);
+}
+
 static int process_synthesized_event(struct perf_tool *tool,
 union perf_event *event,
 struct perf_sample *sample __maybe_unused,
@@ -130,12 +217,27 @@ static int process_synthes

Re: [PATCH resend 0/2] irqchip: convert to SPDX for Renesas drivers

2018-09-07 Thread Marc Zyngier

On Fri, 07 Sep 2018 02:50:13 +0100,
Kuninori Morimoto  wrote:
> 
> 
> Hi Thomas, Marc, Jason
> 
> 2weeks passed. I resend this patch again
> 
> Kuninori Morimoto (2):
>   pinctrl: sh-pfc: convert to SPDX identifiers
>   pinctrl: rza1: convert to SPDX identifiers
> 
>  drivers/pinctrl/pinctrl-rza1.c   |  5 +
>  drivers/pinctrl/sh-pfc/Kconfig   |  1 +
>  drivers/pinctrl/sh-pfc/core.c|  5 +
>  drivers/pinctrl/sh-pfc/core.h|  7 ++-
>  drivers/pinctrl/sh-pfc/gpio.c|  5 +
>  drivers/pinctrl/sh-pfc/pfc-emev2.c   |  5 +
>  drivers/pinctrl/sh-pfc/pfc-r8a73a4.c | 15 +--
>  drivers/pinctrl/sh-pfc/pfc-r8a7740.c | 15 +--
>  drivers/pinctrl/sh-pfc/pfc-r8a7778.c | 10 +-
>  drivers/pinctrl/sh-pfc/pfc-r8a7779.c | 14 +-
>  drivers/pinctrl/sh-pfc/pfc-r8a7790.c | 15 +--
>  drivers/pinctrl/sh-pfc/pfc-r8a7791.c |  5 +
>  drivers/pinctrl/sh-pfc/pfc-r8a7792.c |  5 +
>  drivers/pinctrl/sh-pfc/pfc-r8a7794.c |  5 +
>  drivers/pinctrl/sh-pfc/pfc-r8a7795-es1.c |  5 +
>  drivers/pinctrl/sh-pfc/pfc-r8a7795.c |  5 +
>  drivers/pinctrl/sh-pfc/pfc-r8a7796.c |  5 +
>  drivers/pinctrl/sh-pfc/pfc-r8a77970.c|  5 +
>  drivers/pinctrl/sh-pfc/pfc-r8a77995.c|  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7203.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7264.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7269.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh73a0.c  | 15 +--
>  drivers/pinctrl/sh-pfc/pfc-sh7720.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7723.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7724.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7734.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7757.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7785.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-sh7786.c  |  5 +
>  drivers/pinctrl/sh-pfc/pfc-shx3.c|  5 +
>  drivers/pinctrl/sh-pfc/pinctrl.c |  5 +
>  drivers/pinctrl/sh-pfc/sh_pfc.h  |  7 ++-
>  33 files changed, 35 insertions(+), 184 deletions(-)

[+ Linus]

If I trust the diffstat, should this be sent to the pinctrl maintainer
instead?

M.

-- 
Jazz is not dead, it just smell funny.

Re: [PATCH] vme: remove unneeded kfree

2018-09-07 Thread Greg Kroah-Hartman

On Thu, Sep 06, 2018 at 10:04:49PM -0700, Linus Torvalds wrote:
> On Thu, Sep 6, 2018 at 1:51 AM Ding Xiang
>  wrote:
> >
> > put_device will call vme_dev_release to free vdev, kfree is
> > unnecessary here.
> 
> That does seem to be the case.  I think "unnecessary" is overly kind,
> it does seem to be a double free.
> 
> Looks like the issue was introduced back in 2013 by commit
> def1820d25fa ("vme: add missing put_device() after device_register()
> fails").
> 
> It seems you should *either* kfree() the vdev, _or_ do put_device(),
> but doing both seems wrong.

You should only ever call put_device() after you have created the
structure, the documentation should say that somewhere...

> I presume the device_register() has never failed, and this being
> vme-only I'm guessing there isn't a vibrant testing community.
> 
> Greg?

It's the correct fix, I'll queue it up soon, thanks.

greg k-h

[PATCH 1/2] mtd: rawnand: denali: remove ->dev_ready() hook

2018-09-07 Thread Masahiro Yamada

The Denali NAND IP has no way to read out the current signal level
of the R/B# pin.  Instead, denali_dev_ready() checks if the R/B#
transition has already happened. (The INTR__INT_ACT interrupt is
asserted at the rising edge of the R/B# pin.)  It is not a correct
way to implement the ->dev_ready() hook.

In fact, it has a drawback; in the nand_scan_ident phase, the chip
detection iterates over maxchips until it fails to find a homogeneous
chip.  For the last loop, nand_reset() fails if no chip is there.

If ->dev_ready hook exists, nand_command(_lp) calls nand_wait_ready()
after NAND_CMD_RESET.  However, we know denali_dev_ready() never
returns 1 unless there exists a chip that toggles R/B# in that chip
select.  Then, nand_wait_ready() just ends up with wasting 400 msec,
in the end, shows the "timeout while waiting for chip to become ready"
warning.

Let's remove the mis-implemented dev_ready hook, and fallback to
sending the NAND_CMD_STATUS and nand_wait_status_ready(), which
bails out more quickly.

Signed-off-by: Masahiro Yamada 
---

 drivers/mtd/nand/raw/denali.c | 22 +-
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/mtd/nand/raw/denali.c b/drivers/mtd/nand/raw/denali.c
index f88a5dc..f069184 100644
--- a/drivers/mtd/nand/raw/denali.c
+++ b/drivers/mtd/nand/raw/denali.c
@@ -203,18 +203,6 @@ static uint32_t denali_wait_for_irq(struct 
denali_nand_info *denali,
return denali->irq_status;
 }
 
-static uint32_t denali_check_irq(struct denali_nand_info *denali)
-{
-   unsigned long flags;
-   uint32_t irq_status;
-
-   spin_lock_irqsave(&denali->irq_lock, flags);
-   irq_status = denali->irq_status;
-   spin_unlock_irqrestore(&denali->irq_lock, flags);
-
-   return irq_status;
-}
-
 static void denali_read_buf(struct mtd_info *mtd, uint8_t *buf, int len)
 {
struct denali_nand_info *denali = mtd_to_denali(mtd);
@@ -294,7 +282,7 @@ static void denali_cmd_ctrl(struct mtd_info *mtd, int dat, 
unsigned int ctrl)
return;
 
/*
-* Some commands are followed by chip->dev_ready or chip->waitfunc.
+* Some commands are followed by chip->waitfunc.
 * irq_status must be cleared here to catch the R/B# interrupt later.
 */
if (ctrl & NAND_CTRL_CHANGE)
@@ -303,13 +291,6 @@ static void denali_cmd_ctrl(struct mtd_info *mtd, int dat, 
unsigned int ctrl)
denali->host_write(denali, DENALI_BANK(denali) | type, dat);
 }
 
-static int denali_dev_ready(struct mtd_info *mtd)
-{
-   struct denali_nand_info *denali = mtd_to_denali(mtd);
-
-   return !!(denali_check_irq(denali) & INTR__INT_ACT);
-}
-
 static int denali_check_erased_page(struct mtd_info *mtd,
struct nand_chip *chip, uint8_t *buf,
unsigned long uncor_ecc_flags,
@@ -1349,7 +1330,6 @@ int denali_init(struct denali_nand_info *denali)
chip->write_byte = denali_write_byte;
chip->read_word = denali_read_word;
chip->cmd_ctrl = denali_cmd_ctrl;
-   chip->dev_ready = denali_dev_ready;
chip->waitfunc = denali_waitfunc;
 
if (features & FEATURES__INDEX_ADDR) {
-- 
2.7.4

[PATCH 0/2] mtd: rawnand: denali: clean-up unnecessary hook and device reset

2018-09-07 Thread Masahiro Yamada



As I replied to Boris [1],
I took a closer look for further cleanups.
I test this series on my board.

Remove mis-implemented ->dev_ready hook.
Remove unnecessary device resetting because
nand_scan_ident() reset devices anyway.

[1] http://patchwork.ozlabs.org/patch/960160/



Masahiro Yamada (2):
  mtd: rawnand: denali: remove ->dev_ready() hook
  mtd: rawnand: denali: remove denali_reset_banks()

 drivers/mtd/nand/raw/denali.c | 51 +--
 1 file changed, 1 insertion(+), 50 deletions(-)

-- 
2.7.4

[PATCH 2/2] mtd: rawnand: denali: remove denali_reset_banks()

2018-09-07 Thread Masahiro Yamada

In nand_scan_ident(), the controller driver resets every NAND chip.
This is done by sending NAND_CMD_RESET.  The Denali IP provides
another way to do the equivalent thing; if a bit is set in the
DEVICE_RESET register, the controller sends the RESET command to
the corresponding device.  denali_reset_banks() uses it to reset
all devices beforehand.

This redundant reset sequence was needed to know the actual number
of chips before calling nand_scan_ident(); if DEVICE_RESET fails,
there is no chip in that chip select.  Then, denali_reset_banks()
sets denali->max_banks to the number of detected chips.

As commit f486287d2372 ("mtd: nand: denali: fix bank reset function
to detect the number of chips") explained, nand_scan_ident() issued
Set Features (0xEF) command to all CS lines, some of which may not be
connected with a chip. Then, the driver would wait for R/B# response,
which never happens.

This problem was solved by commit 107b7d6a7ad4 ("mtd: rawnand: avoid
setting again the timings to mode 0 after a reset").  In the current
code, nand_setup_data_interface() is called from nand_scan_tail(),
which is invoked after the chip detection.

Now, we can really remove the redundant denali_nand_banks() by simply
passing the maximum number of chip selects supported by this IP
(typically 4 or 8) to nand_scan().  Let's leave all the chip detection
process to nand_scan_ident().

Signed-off-by: Masahiro Yamada 
---

 drivers/mtd/nand/raw/denali.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/drivers/mtd/nand/raw/denali.c b/drivers/mtd/nand/raw/denali.c
index f069184..d1ae968 100644
--- a/drivers/mtd/nand/raw/denali.c
+++ b/drivers/mtd/nand/raw/denali.c
@@ -1040,29 +1040,6 @@ static int denali_setup_data_interface(struct mtd_info 
*mtd, int chipnr,
return 0;
 }
 
-static void denali_reset_banks(struct denali_nand_info *denali)
-{
-   u32 irq_status;
-   int i;
-
-   for (i = 0; i < denali->max_banks; i++) {
-   denali->active_bank = i;
-
-   denali_reset_irq(denali);
-
-   iowrite32(DEVICE_RESET__BANK(i),
- denali->reg + DEVICE_RESET);
-
-   irq_status = denali_wait_for_irq(denali,
-   INTR__RST_COMP | INTR__INT_ACT | INTR__TIME_OUT);
-   if (!(irq_status & INTR__INT_ACT))
-   break;
-   }
-
-   dev_dbg(denali->dev, "%d chips connected\n", i);
-   denali->max_banks = i;
-}
-
 static void denali_hw_init(struct denali_nand_info *denali)
 {
/*
@@ -1311,12 +1288,6 @@ int denali_init(struct denali_nand_info *denali)
}
 
denali_enable_irq(denali);
-   denali_reset_banks(denali);
-   if (!denali->max_banks) {
-   /* Error out earlier if no chip is found for some reasons. */
-   ret = -ENODEV;
-   goto disable_irq;
-   }
 
denali->active_bank = DENALI_INVALID_BANK;
 
-- 
2.7.4

Re: [PATCH 1/2] platform/chrome: Move mfd/cros_ec_lpc* includes to drivers/platform.

2018-09-07 Thread Benson Leung

Hi Enric,

On Wed, Jul 18, 2018 at 06:09:55PM +0200, Enric Balletbo i Serra wrote:
> The cros-ec-lpc driver lives in drivers/platform because is platform
> specific, however there are two includes (cros_ec_lpc_mec.h and
> cros_ec_lpc_reg.h) that lives in include/linux/mfd. These two includes
> are only used for the platform driver and are not really related to the
> MFD subsystem, so move the includes from include/linux/mfd to
> drivers/platform/chrome.
> 
> Signed-off-by: Enric Balletbo i Serra 

Thanks. Applied to my working branch for v4.20.

-- 
Benson Leung
Staff Software Engineer
Chrome OS Kernel
Google Inc.
ble...@google.com
Chromium OS Project
ble...@chromium.org


signature.asc
Description: PGP signature

Re: [PATCH v7 1/2] leds: core: Introduce LED pattern trigger

2018-09-07 Thread Pavel Machek

Hi!

> +What:/sys/class/leds//hw_pattern
> +Date:September 2018
> +KernelVersion:   4.20
> +Description:
> + Specify a hardware pattern for the SC27XX LED. For the SC27XX
> + LED controller, it only supports 4 hardware patterns to 
> configure
> + the low time, rise time, high time and fall time for the 
> breathing
> + mode, and each stage duration unit is 125ms. So the format of
> + the hardware pattern values should be:
> + "brightness_1 duration_1 brightness_2 duration_2 brightness_3
> + duration_3 brightness_4 duration_4".
> 
> In this case low time and high time can be easily described with
> use of the proposed [brightness delta_t] tuples. It is not equally
> obvious in case of rise time and fall time.
> 
> I can imagine hw pattern that would require defining blink rate
> over period of time, or blink rate during rise/fall time - in the
> latter case we would have odd number of pattern components. Probably
> it wouldn't be a big deal, we'd need one "padding" value, but still
> there's room for improvement IMHO.

Well, you can describe blinking while rising, it is just going to be
awkward as you'll need to give precise times/brightnesses for each
blinking, and pattern will become long.

I'm sure some hardware can do that (the led in N900 can compute prime
numbers, it can blink while changing brightness, too).

OTOH people tend to use pretty simple patterns on their LEDs, so we
should be fine.

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

Re: [PATCH] printk/tracing: Do not trace printk_nmi_enter()

2018-09-07 Thread Peter Zijlstra

On Wed, Sep 05, 2018 at 09:33:34PM -0400, Steven Rostedt wrote:
>   do_idle {
> 
> [interrupts enabled]
> 
>  [interrupts disabled]
>   TRACE_IRQS_OFF [lockdep says irqs off]
>   [...]
>   TRACE_IRQS_IRET
>   test if pt_regs say return to interrupts enabled [yes]
>   TRACE_IRQS_ON [lockdep says irqs are on]
> 
>   
>   nmi_enter() {
>   printk_nmi_enter() [traced by ftrace]
>   [ hit ftrace breakpoint ]
>   
>   TRACE_IRQS_OFF [lockdep says irqs off]
>   [...]
>   TRACE_IRQS_IRET [return from breakpoint]
>  test if pt_regs say interrupts enabled [no]
>  [iret back to interrupt]
>  [iret back to code]
> 
> tick_nohz_idle_enter() {
> 
>   lockdep_assert_irqs_enabled() [lockdep say no!]

Isn't the problem that we muck with the IRQ state from NMI context? We
shouldn't be doing that.

The thing is, since we trace the IRQ state from within IRQ-disable,
since that's the only IRQ-safe option, it is very much not NMI-safe.

Your patch might avoid the symptom, but I don't think it cures the
fundamental problem.

Re: [PATCH v9 3/6] kernel/reboot.c: export pm_power_off_prepare

2018-09-07 Thread Oleksij Rempel

Hi Mark,

On Thu, Sep 06, 2018 at 11:15:17AM +0100, Mark Brown wrote:
> On Mon, Aug 27, 2018 at 09:48:16AM +0800, Shawn Guo wrote:
> 
> > Can you ACK on those two regulator patches, so that I can queue this
> > series up on IMX tree?
> 
> I was expecting to get a pull request with the precursor patches in it -
> the regulator driver seems to get a moderate amount of development so
> there's a reasonable risk of conflicts.

Are there any thing I can or should do?

-- 
Pengutronix e.K.   | |
Industrial Linux Solutions | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |


signature.asc
Description: PGP signature

Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4

2018-09-07 Thread Daniel Drake

On Thu, Sep 6, 2018 at 5:43 AM, Johannes Weiner  wrote:
> Peter, do the changes from v3 look sane to you?
>
> If there aren't any further objections, I was hoping we could get this
> lined up for 4.20.

That would be excellent. I just retested the latest version at
http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the
results are great.

Test setup:
Endless OS
GeminiLake N4200 low end laptop
2GB RAM
swap (and zram swap) disabled

Baseline test: open a handful of large-ish apps and several website
tabs in Google Chrome.
Results: after a couple of minutes, system is excessively thrashing,
mouse cursor can barely be moved, UI is not responding to mouse
clicks, so it's impractical to recover from this situation as an
ordinary user

Add my simple killer:
https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
Results: when the thrashing causes the UI to become sluggish, the
killer steps in and kills something (usually a chrome tab), and the
system remains usable. I repeatedly opened more apps and more websites
over a 15 minute period but I wasn't able to get the system to a point
of UI unresponsiveness.

Thanks,
Daniel

[PATCH v8 3/3]: perf record: extend trace writing to multi AIO

2018-09-07 Thread Alexey Budankov



Multi AIO trace writing allows caching more kernel data into userspace 
memory postponing trace writing for the sake of overall profiling data 
thruput increase. It could be seen as kernel data buffer extension into
userspace memory.

With aio-cblocks option value different from 1, current default value, 
tool has capability to cache more and more data into user space
along with delegating spill to AIO.

That allows avoiding suspend at record__aio_sync() between calls of 
record__mmap_read_evlist() and increase profiling data thruput for 
the cost of userspace memory.

Signed-off-by: Alexey Budankov 
---
 tools/perf/builtin-record.c | 55 +++---
 tools/perf/perf.h   |  1 +
 tools/perf/util/evlist.c|  7 ++--
 tools/perf/util/evlist.h|  3 +-
 tools/perf/util/mmap.c  | 83 +++--
 tools/perf/util/mmap.h  | 10 +++---
 6 files changed, 114 insertions(+), 45 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index d4857572cf33..6361098a5898 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -192,16 +192,35 @@ static int record__aio_complete(struct perf_mmap *md, 
struct aiocb *cblock)
return rc;
 }
 
-static void record__aio_sync(struct perf_mmap *md)
+static int record__aio_sync(struct perf_mmap *md, bool sync_all)
 {
-   struct aiocb *cblock = &md->cblock;
+   struct aiocb **aiocb = md->aiocb;
+   struct aiocb *cblocks = md->cblocks;
struct timespec timeout = { 0, 1000 * 1000  * 1 }; // 1ms
+   int i, do_suspend;
 
do {
-   if (cblock->aio_fildes == -1 || record__aio_complete(md, 
cblock))
-   return;
+   do_suspend = 0;
+   for (i = 0; i < md->nr_cblocks; ++i) {
+   if (cblocks[i].aio_fildes == -1 || 
record__aio_complete(md, &cblocks[i])) {
+   if (sync_all)
+   aiocb[i] = NULL;
+   else
+   return i;
+   } else {
+   /*
+* Started aio write is not complete yet
+* so it has to be waited before the
+* next allocation.
+*/
+   aiocb[i] = &cblocks[i];
+   do_suspend = 1;
+   }
+   }
+   if (!do_suspend)
+   return -1;
 
-   while (aio_suspend((const struct aiocb**)&cblock, 1, &timeout)) 
{
+   while (aio_suspend((const struct aiocb **)aiocb, 
md->nr_cblocks, &timeout)) {
if (!(errno == EAGAIN || errno == EINTR))
pr_err("failed to sync perf data, error: %m\n");
}
@@ -428,7 +447,8 @@ static int record__mmap_evlist(struct record *rec,
 
if (perf_evlist__mmap_ex(evlist, opts->mmap_pages,
 opts->auxtrace_mmap_pages,
-opts->auxtrace_snapshot_mode) < 0) {
+opts->auxtrace_snapshot_mode,
+opts->nr_cblocks) < 0) {
if (errno == EPERM) {
pr_err("Permission error mapping pages.\n"
   "Consider increasing "
@@ -621,7 +641,7 @@ static void record__mmap_read_sync(struct record *rec)
for (i = 0; i < evlist->nr_mmaps; i++) {
struct perf_mmap *map = &maps[i];
if (map->base)
-   record__aio_sync(map);
+   record__aio_sync(map, true);
}
 }
 
@@ -629,7 +649,7 @@ static int record__mmap_read_evlist(struct record *rec, 
struct perf_evlist *evli
bool overwrite)
 {
u64 bytes_written = rec->bytes_written;
-   int i;
+   int i, idx;
int rc = 0;
struct perf_mmap *maps;
 
@@ -648,11 +668,12 @@ static int record__mmap_read_evlist(struct record *rec, 
struct perf_evlist *evli
 
if (maps[i].base) {
/*
-* Call record__aio_sync() to wait till map->data buffer
-* becomes available after previous aio write request.
+* Call record__aio_sync() to get some free map->data
+* buffer or wait if all of previously started aio
+* writes are still incomplete.
 */
-   record__aio_sync(&maps[i]);
-   if (perf_mmap__push(&maps[i], rec, record__pushfn) != 
0) {
+   idx = record__aio_sync(&maps[i], false);
+   if (perf_mmap__push(&maps[i], rec, idx, record__pushfn) 
!= 0) {

Re: [PATCH] printk/tracing: Do not trace printk_nmi_enter()

2018-09-07 Thread Peter Zijlstra

On Thu, Sep 06, 2018 at 11:31:51AM +0900, Sergey Senozhatsky wrote:
> An alternative option, thus, could be re-instating back the rule that
> lockdep_off/on should be the first and the last thing we do in
> nmi_enter/nmi_exit. E.g.
> 
> nmi_enter()
>   lockdep_off();
>   printk_nmi_enter();
> 
> nmi_exit()
>   printk_nmi_exit();
>   lockdep_on();

Yes that. Also, those should probably be inline functions.

---
Subject: locking/lockdep: Fix NMI handling

Someone put code in the NMI handler before lockdep_off(). Since lockdep
is not NMI safe, this wrecks stuff.

Fixes: 42a0bb3f7138 ("printk/nmi: generic solution for safe printk in NMI")
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/hardirq.h  |  4 ++--
 include/linux/lockdep.h  | 11 +--
 kernel/locking/lockdep.c | 12 
 3 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 0fbbcdf0c178..8d70270d9486 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -62,8 +62,8 @@ extern void irq_exit(void);
 
 #define nmi_enter()\
do {\
-   printk_nmi_enter(); \
lockdep_off();  \
+   printk_nmi_enter(); \
ftrace_nmi_enter(); \
BUG_ON(in_nmi());   \
preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \
@@ -78,8 +78,8 @@ extern void irq_exit(void);
BUG_ON(!in_nmi());  \
preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET); \
ftrace_nmi_exit();  \
-   lockdep_on();   \
printk_nmi_exit();  \
+   lockdep_on();   \
} while (0)
 
 #endif /* LINUX_HARDIRQ_H */
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index b0d0b51c4d85..70bb9e8fc8f9 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -272,8 +272,15 @@ extern void lockdep_reset_lock(struct lockdep_map *lock);
 extern void lockdep_free_key_range(void *start, unsigned long size);
 extern asmlinkage void lockdep_sys_exit(void);
 
-extern void lockdep_off(void);
-extern void lockdep_on(void);
+static inline void lockdep_off(void)
+{
+   current->lockdep_recursion++;
+}
+
+static inline void lockdep_on(void)
+{
+   current->lockdep_recursion--;
+}
 
 /*
  * These methods are used by specific locking variants (spinlocks,
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e406c5fdb41e..da51ed1c0c21 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -317,18 +317,6 @@ static inline u64 iterate_chain_key(u64 key, u32 idx)
return k0 | (u64)k1 << 32;
 }
 
-void lockdep_off(void)
-{
-   current->lockdep_recursion++;
-}
-EXPORT_SYMBOL(lockdep_off);
-
-void lockdep_on(void)
-{
-   current->lockdep_recursion--;
-}
-EXPORT_SYMBOL(lockdep_on);
-
 /*
  * Debugging switches:
  */

Re: [PATCH v5 1/2] dt-bindings: leds: Add bindings for lm3697 driver

2018-09-07 Thread Pavel Machek

Hi!

> >> +All HVLED strings controlled by control bank A
> > 
> > ":"?
> 
> Not sure what you are asking for here.

Text looked like missing ":" at end of line.

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

[PATCH 2/2] rxrpc: use kmemdup_nul in rxrpc_krb5_decode_principal

2018-09-07 Thread Rasmus Villemoes

Signed-off-by: Rasmus Villemoes 
---
This depends on patch 1/2.

 net/rxrpc/key.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/rxrpc/key.c b/net/rxrpc/key.c
index e7f6b8823eb6..42f0ca9265c2 100644
--- a/net/rxrpc/key.c
+++ b/net/rxrpc/key.c
@@ -252,11 +252,9 @@ static int rxrpc_krb5_decode_principal(struct 
krb5_principal *princ,
paddedlen = (tmp + 3) & ~3;
if (paddedlen > toklen)
return -EINVAL;
-   princ->name_parts[loop] = kmalloc(tmp + 1, GFP_KERNEL);
+   princ->name_parts[loop] = kmemdup_nul(xdr, tmp, GFP_KERNEL);
if (!princ->name_parts[loop])
return -ENOMEM;
-   memcpy(princ->name_parts[loop], xdr, tmp);
-   princ->name_parts[loop][tmp] = 0;
toklen -= paddedlen;
xdr += paddedlen >> 2;
}
@@ -270,11 +268,9 @@ static int rxrpc_krb5_decode_principal(struct 
krb5_principal *princ,
paddedlen = (tmp + 3) & ~3;
if (paddedlen > toklen)
return -EINVAL;
-   princ->realm = kmalloc(tmp + 1, GFP_KERNEL);
+   princ->realm = kmemdup_nul(xdr, tmp, GFP_KERNEL);
if (!princ->realm)
return -ENOMEM;
-   memcpy(princ->realm, xdr, tmp);
-   princ->realm[tmp] = 0;
toklen -= paddedlen;
xdr += paddedlen >> 2;
 
-- 
2.16.4

Re: [PATCH 0/9] psi: pressure stall information for CPU, memory, and IO v4

2018-09-07 Thread Peter Zijlstra

On Wed, Sep 05, 2018 at 05:43:03PM -0400, Johannes Weiner wrote:
> On Tue, Aug 28, 2018 at 01:22:49PM -0400, Johannes Weiner wrote:
> > This version 4 of the PSI series incorporates feedback from Peter and
> > fixes two races in the lockless aggregator that Suren found in his
> > testing and which caused the sample calculation to sometimes underflow
> > and record bogusly large samples; details at the bottom of this email.
> 
> Peter, do the changes from v3 look sane to you?

I'll go have a look.

[PATCH 1/2] string: make kmemdup_nul take and return void, not char

2018-09-07 Thread Rasmus Villemoes

This allows kmemdup_nul to be used in cases where the source pointer is
not a char* or const char*, but the result should nevertheless have a
nul char after the memcpy'ed data.

Signed-off-by: Rasmus Villemoes 
---
 include/linux/string.h | 2 +-
 mm/util.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/string.h b/include/linux/string.h
index 4a5a0eb7df51..b44a2254bc6b 100644
--- a/include/linux/string.h
+++ b/include/linux/string.h
@@ -169,7 +169,7 @@ extern char *kstrdup(const char *s, gfp_t gfp) __malloc;
 extern const char *kstrdup_const(const char *s, gfp_t gfp);
 extern char *kstrndup(const char *s, size_t len, gfp_t gfp);
 extern void *kmemdup(const void *src, size_t len, gfp_t gfp);
-extern char *kmemdup_nul(const char *s, size_t len, gfp_t gfp);
+extern void *kmemdup_nul(const void *s, size_t len, gfp_t gfp);
 
 extern char **argv_split(gfp_t gfp, const char *str, int *argcp);
 extern void argv_free(char **argv);
diff --git a/mm/util.c b/mm/util.c
index 9e3ebd2ef65f..15ef23f1176e 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -128,7 +128,7 @@ EXPORT_SYMBOL(kmemdup);
  * @len: The size of the data
  * @gfp: the GFP mask used in the kmalloc() call when allocating memory
  */
-char *kmemdup_nul(const char *s, size_t len, gfp_t gfp)
+void *kmemdup_nul(const void *s, size_t len, gfp_t gfp)
 {
char *buf;
 
-- 
2.16.4

Re: [PATCH v2 7/9] net: stmmac: dwmac-sun8i: fix OF child-node lookup

2018-09-07 Thread Johan Hovold

On Thu, Sep 06, 2018 at 10:03:37PM +0200, Corentin Labbe wrote:
> On Mon, Aug 27, 2018 at 10:21:51AM +0200, Johan Hovold wrote:
> > Use the new of_get_compatible_child() helper to lookup the mdio-internal
> > child node instead of using of_find_compatible_node(), which searches
> > the entire tree from a given start node and thus can return an unrelated
> > (i.e. non-child) node.
> > 
> > This also addresses a potential use-after-free (e.g. after probe
> > deferral) as the tree-wide helper drops a reference to its first
> > argument (i.e. the mdio-mux node). Fortunately, this was inadvertently
> > balanced by a failure to drop the mdio-mux reference after lookup.
> > 
> > While at it, also fix the related mdio-internal- and phy-node reference
> > leaks.
> > 
> > Fixes: 634db83b8265 ("net: stmmac: dwmac-sun8i: Handle integrated/external 
> > MDIOs")
> > Cc: Corentin Labbe 
> > Cc: Andrew Lunn 
> > Cc: Giuseppe Cavallaro 
> > Cc: Alexandre Torgue 
> > Cc: Jose Abreu 
> > Cc: David S. Miller 
> > Signed-off-by: Johan Hovold 

> Tested-by: Corentin Labbe 

Thanks for testing.

Johan

Re: [PATCH v3 4/5] x86/mm: optimize static_protection() by using overlap()

2018-09-07 Thread Thomas Gleixner

On Fri, 7 Sep 2018, Yang, Bin wrote:
> On Tue, 2018-09-04 at 14:22 +0200, Thomas Gleixner wrote:
> 
> I just write a test.c to compare the result between overlap() and
> original within().

You are right. Your version of doing the overlap exclusive works. I misread
the conditions. I still prefer doing inclusive checks because they are way
more obvious.

Thanks,

tglx

Re: [PATCH] ttyprintk: make the printk log level configurable

2018-09-07 Thread Peter Korsgaard

On Tue, Aug 21, 2018 at 7:28 PM Peter Korsgaard  wrote:
>
> For some use cases it is handy to use a different printk log level than the
> default (info) for the messages written to ttyprintk, so add a Kconfig
> option similar to what we have for default console loglevel.

Ping? Feedback, comments?

> diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
> index ce277ee0a28a..14a7f023f20b 100644
> --- a/drivers/char/Kconfig
> +++ b/drivers/char/Kconfig
> @@ -66,6 +66,14 @@ config TTY_PRINTK
>
>   If unsure, say N.
>
> +config TTY_PRINTK_LEVEL
> +   depends on TTY_PRINTK
> +   int "ttyprintk log level (1-7)"
> +   range 1 7
> +   default "6"
> +   help
> + Printk log level to use for ttyprintk messages.
> +
>  config PRINTER
> tristate "Parallel printer support"
> depends on PARPORT
> diff --git a/drivers/char/ttyprintk.c b/drivers/char/ttyprintk.c
> index 67549ce88cc9..22fbd483b5dc 100644
> --- a/drivers/char/ttyprintk.c
> +++ b/drivers/char/ttyprintk.c
> @@ -37,6 +37,8 @@ static struct ttyprintk_port tpk_port;
>   */
>  #define TPK_STR_SIZE 508 /* should be bigger then max expected line length */
>  #define TPK_MAX_ROOM 4096 /* we could assume 4K for instance */
> +#define TPK_PREFIX KERN_SOH __stringify(CONFIG_TTY_PRINTK_LEVEL) " [U]"
> +
>  static int tpk_curr;
>
>  static char tpk_buffer[TPK_STR_SIZE + 4];
> @@ -45,7 +47,7 @@ static void tpk_flush(void)
>  {
> if (tpk_curr > 0) {
> tpk_buffer[tpk_curr] = '\0';
> -   pr_info("[U] %s\n", tpk_buffer);
> +   printk(TPK_PREFIX " %s\n", tpk_buffer);
> tpk_curr = 0;
> }
>  }
> --
> 2.11.0
>

-- 
Bye, Peter Korsgaard

[PATCH] sched/fair: fix load_balance redo for null imbalance

2018-09-07 Thread Vincent Guittot

It can happen that load_balance finds a busiest group and then a busiest rq
but the calculated imbalance is in fact null.

In such situation, detach_tasks returns immediately and lets the flag
LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to have pinned
tasks and removed from the load balance mask. then, we redo a load balance
without the busiest CPU. This creates wrong load balance situation and
generates wrong task migration.

If the calculated imbalance is null, it's useless to try to find a busiest
rq as no task will be migrated and we can return immediately.

This situation can happen with heterogeneous system or smp system when RT
tasks are decreasing the capacity of some CPUs.

Signed-off-by: Vincent Guittot 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..224bfae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8464,7 +8464,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
}
 
group = find_busiest_group(&env);
-   if (!group) {
+   if (!group || !env.imbalance) {
schedstat_inc(sd->lb_nobusyg[idle]);
goto out_balanced;
}
-- 
2.7.4

Re: [PATCH 2/2] mfd: cros_ec: Fix and improve kerneldoc comments.

2018-09-07 Thread Benson Leung

Hi Enric,

On Wed, Jul 18, 2018 at 06:09:56PM +0200, Enric Balletbo i Serra wrote:
> cros-ec includes inside the MFD subsystem, specially the file
> cros_ec_commands.h, has been modified several times and it has grown a
> lot, unfortunately, we didn't have care too much about the documentation.
> This patch tries to improve the documentation and also fixes all the
> issues reported by kerneldoc script.
> 
> Signed-off-by: Enric Balletbo i Serra 

Applied, thanks.

-- 
Benson Leung
Staff Software Engineer
Chrome OS Kernel
Google Inc.
ble...@google.com
Chromium OS Project
ble...@chromium.org


signature.asc
Description: PGP signature

Re: [PATCH] leds: pwm: silently error out on EPROBE_DEFER

2018-09-07 Thread Jerome Brunet

On Thu, 2018-09-06 at 17:35 +0200, Pavel Machek wrote:
> On Thu 2018-09-06 15:59:04, Jerome Brunet wrote:
> > When probing, if we fail to get the pwm due to probe deferal, we shouldn't
> > print an error message. Just be silent in this case.
> > 
> > Signed-off-by: Jerome Brunet 
> > ---
> >  drivers/leds/leds-pwm.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/leds/leds-pwm.c b/drivers/leds/leds-pwm.c
> > index df80c89ebe7f..5d3faae51d59 100644
> > --- a/drivers/leds/leds-pwm.c
> > +++ b/drivers/leds/leds-pwm.c
> > @@ -100,8 +100,9 @@ static int led_pwm_add(struct device *dev, struct 
> > led_pwm_priv *priv,
> > led_data->pwm = devm_pwm_get(dev, led->name);
> > if (IS_ERR(led_data->pwm)) {
> > ret = PTR_ERR(led_data->pwm);
> > -   dev_err(dev, "unable to request PWM for %s: %d\n",
> > -   led->name, ret);
> > +   if (ret != -EPROBE_DEFER)
> > +   dev_err(dev, "unable to request PWM for %s: %d\n",
> > +   led->name, ret);
> > return ret;
> > }
> 
> Hmm, sometimes probing is deffered forever, and in such case debug
> message is useful.
> 
> Do you see excessive number of these?

About 10 but displaying an error which is not error is already excessive IMO.

There is nothing out of the ordinary here. Feel free to grep git log for
EPROBE_DEFER. Many drivers take this into account from the beginning now and
we've been fixing others in most subsystem.

>   
>   Pavel

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-07 Thread Vincent Guittot

On Fri, 7 Sep 2018 at 09:16, Juri Lelli  wrote:
>
> On 06/09/18 16:25, Dietmar Eggemann wrote:
> > Hi Juri,
> >
> > On 08/23/2018 11:54 PM, Juri Lelli wrote:
> > > On 23/08/18 18:52, Dietmar Eggemann wrote:
> > > > Hi,
> > > >
> > > > On 08/21/2018 01:54 AM, Miguel de Dios wrote:
> > > > > On 08/17/2018 11:27 AM, Steve Muckle wrote:
> > > > > > From: John Dias 
> >
> > [...]
> >
> > > >
> > > > I tried to catch this issue on my Arm64 Juno board using pi_test (and a
> > > > slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
> > > > rt-tests but wasn't able to do so.
> > > >
> > > > # pi_stress --inversions=1 --duration=1 --groups=1 --sched 
> > > > id=low,policy=cfs
> > > >
> > > > Starting PI Stress Test
> > > > Number of thread groups: 1
> > > > Duration of test run: 1 seconds
> > > > Number of inversions per group: 1
> > > >   Admin thread SCHED_FIFO priority 4
> > > > 1 groups of 3 threads will be created
> > > >High thread SCHED_FIFO priority 3
> > > > Med thread SCHED_FIFO priority 2
> > > > Low thread SCHED_OTHER nice 0
> > > >
> > > > # ./pip_stress
> > > >
> > > > In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
> > > > dequeue_task_fair()->dequeue_entity(), which subtracts 
> > > > cfs_rq->min_vruntime
> > > > from se->vruntime, is called on it before it gets the rt prio.
> > > >
> > > > Maybe it requires a very specific use of the pthread library to provoke 
> > > > this
> > > > issue by making sure that the cfs tasks really blocks/sleeps?
> > >
> > > Maybe one could play with rt-app to recreate such specific use case?
> > >
> > > https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459
> >
> > I played a little bit with rt-app on hikey960 to re-create Steve's test
> > program.
>
> Oh, nice! Thanks for sharing what you have got.
>
> > Since there is no semaphore support (sem_wait(), sem_post()) I used
> > condition variables (wait: pthread_cond_wait() , signal:
> > pthread_cond_signal()). It's not really the same since this is stateless but
> > sleeps before the signals help to maintain the state in this easy example.
> >
> > This provokes the vruntime issue e.g. for cpus 0,4 and it doesn't for 0,1:
> >
> >
> > "global": {
> > "calibration" : 130,
> >   "pi_enabled" : true
> > },
> > "tasks": {
> > "rt_task": {
> >   "loop" : 100,
> >   "policy" : "SCHED_FIFO",
> >   "cpus" : [0],
> >
> >   "lock" : "b_mutex",
> >   "wait" : { "ref" : "b_cond", "mutex" : "b_mutex" },
> >   "unlock" : "b_mutex",
> >   "sleep" : 3000,
> >   "lock1" : "a_mutex",
> >   "signal" : "a_cond",
> >   "unlock1" : "a_mutex",
> >   "lock2" : "pi-mutex",
> >   "unlock2" : "pi-mutex"
> > },
> >   "cfs_task": {
> >   "loop" : 100,
> >   "policy" : "SCHED_OTHER",
> >   "cpus" : [4],
> >
> >   "lock" : "pi-mutex",
> >   "sleep" : 3000,
> >   "lock1" : "b_mutex",
> >   "signal" : "b_cond",
> >   "unlock" : "b_mutex",
> >   "lock2" : "a_mutex",
> >   "wait" : { "ref" : "a_cond", "mutex" : "a_mutex" },
> >   "unlock1" : "a_mutex",
> >   "unlock2" : "pi-mutex"
> >   }
> > }
> > }
> >
> > Adding semaphores is possible but rt-app has no easy way to initialize
> > individual objects, e.g. sem_init(..., value). The only way I see is via the
> > global section, like "pi_enabled". But then, this is true for all objects of
> > this kind (in this case mutexes)?
>
> Right, global section should work fine. Why do you think this is a
> problem/limitation?

keep in mind that rt-app still have "ressources" section. This one is
optional and almost never used as resources can be created on the fly
but it's still there and can be used to initialize resources if needed
like semaphore

>
> > So the following couple of lines extension to rt-app works because both
> > semaphores can be initialized to 0:
> >
> >  {
> > "global": {
> > "calibration" : 130,
> >   "pi_enabled" : true
> > },
> > "tasks": {
> > "rt_task": {
> >   "loop" : 100,
> >   "policy" : "SCHED_FIFO",
> >   "cpus" : [0],
> >
> >   "sem_wait" : "b_sem",
> >   "sleep" : 1000,
> >   "sem_post" : "a_sem",
> >
> >   "lock" : "pi-mutex",
> >   "unlock" : "pi-mutex"
> > },
> >   "cfs_task": {
> >   "loop" : 100,
> >   "policy" : "SCHED_OTHER",
> >   "cpus" : [4],
> >
> >   "lock" : "pi-mutex",
> >   "sleep" : 1000,
> >   "sem_post" : "b_sem",
> >   "sem_wait" : "a_sem",
> >   "unlock" : "pi-mutex"
> >   }
> > }
> > }
> >
> > Any thoughts on that? I can see something like this as infrastructure to
> > create a regression test case based on rt-app and standard ftrace.
>
> Agree. I guess we s

Re: [PATCH v2 2/3] x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space

2018-09-07 Thread Borislav Petkov

On Mon, Sep 03, 2018 at 03:59:43PM -0700, Andy Lutomirski wrote:
> In the non-trampoline SYSCALL64 path, we use a percpu variable to
> temporarily store the user RSP value.  Instead of a separate
> variable, use the otherwise unused sp2 slot in the TSS.  This will
> improve cache locality, as the sp1 slot is already used in the same
> code to find the kernel stack.  It will also simplify a future
> change to make the non-trampoline path work in PTI mode.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/entry/entry_64.S| 16 +---
>  arch/x86/include/asm/processor.h |  6 ++
>  arch/x86/kernel/asm-offsets.c|  3 ++-
>  arch/x86/kernel/process_64.c |  2 --
>  arch/x86/xen/xen-asm_64.S|  8 +---
>  5 files changed, 22 insertions(+), 13 deletions(-)

Reviewed-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Re: [PATCH V3 17/26] csky: Misc headers

2018-09-07 Thread Arnd Bergmann

On Fri, Sep 7, 2018 at 7:17 AM Guo Ren  wrote:
>
> On Thu, Sep 06, 2018 at 04:16:30PM +0200, Arnd Bergmann wrote:
> > On Wed, Sep 5, 2018 at 2:08 PM Guo Ren  wrote:
> >
> > > diff --git a/arch/csky/boot/dts/qemu.dts b/arch/csky/boot/dts/qemu.dts
> > > new file mode 100644
> > > index 000..d36e4cd
> > > --- /dev/null
> > > +++ b/arch/csky/boot/dts/qemu.dts
> > > @@ -0,0 +1,77 @@
> > > +/dts-v1/;
> > > +/ {
> > > +   compatible = "csky,qemu";
> > > +   #address-cells = <1>;
> > > +   #size-cells = <1>;
> > > +   interrupt-parent = <&intc>;
> >
> > Ideally, qemu would supply a dtb file that matches the current 
> > configuration,
> > as we do for instance on the ARM 'virt' machine. This allows you
> > much more flexibility in running all kinds of options, as well as extending
> > qemu later with new features.
> So, I should remove qemu.dts in next version patch?

It's up to you really. If you won't have a version of qemu that can do this
by itself, it may make sense to keep it around for a while. You might
want to include the version of your current qemu port is based on
qemu-2.x but not upstream, you could include a qemu-2.x.dts file
here, and have the future 3.x port provide its own.

  Arnd

[PATCH v14 12/16] arm64: kexec_file: add crash dump support

2018-09-07 Thread AKASHI Takahiro

Enabling crash dump (kdump) includes
* prepare contents of ELF header of a core dump file, /proc/vmcore,
  using crash_prepare_elf64_headers(), and
* add two device tree properties, "linux,usable-memory-range" and
  "linux,elfcorehdr", which represent respectively a memory range
  to be used by crash dump kernel and the header's location

Signed-off-by: AKASHI Takahiro 
Cc: Catalin Marinas 
Cc: Will Deacon 
Reviewed-by: James Morse 
---
 arch/arm64/include/asm/kexec.h |   4 +
 arch/arm64/kernel/machine_kexec_file.c | 113 -
 2 files changed, 114 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 5e673481b3a3..1b2c27026ae0 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -99,6 +99,10 @@ static inline void crash_post_resume(void) {}
 struct kimage_arch {
void *dtb;
unsigned long dtb_mem;
+   /* Core ELF header buffer */
+   void *elf_headers;
+   unsigned long elf_headers_mem;
+   unsigned long elf_headers_sz;
 };
 
 /**
diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
index 05fb2d4e6fef..ecaecb122cad 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -16,10 +16,14 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 
 /* relevant device tree properties */
+#define FDT_PSTR_KEXEC_ELFHDR  "linux,elfcorehdr"
+#define FDT_PSTR_MEM_RANGE "linux,usable-memory-range"
 #define FDT_PSTR_INITRD_STA"linux,initrd-start"
 #define FDT_PSTR_INITRD_END"linux,initrd-end"
 #define FDT_PSTR_BOOTARGS  "bootargs"
@@ -34,6 +38,10 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
vfree(image->arch.dtb);
image->arch.dtb = NULL;
 
+   vfree(image->arch.elf_headers);
+   image->arch.elf_headers = NULL;
+   image->arch.elf_headers_sz = 0;
+
return kexec_image_post_load_cleanup_default(image);
 }
 
@@ -43,12 +51,29 @@ static int setup_dtb(struct kimage *image,
void **dtb_buf, unsigned long *dtb_buf_len)
 {
void *buf = NULL;
-   size_t buf_size;
+   size_t buf_size, range_size;
int nodeoffset;
int ret;
 
+   /* check ranges against root's #address-cells and #size-cells */
+   if (image->type == KEXEC_TYPE_CRASH &&
+   (!of_fdt_cells_size_fitted(image->arch.elf_headers_mem,
+   image->arch.elf_headers_sz) ||
+!of_fdt_cells_size_fitted(crashk_res.start,
+   crashk_res.end - crashk_res.start + 1))) {
+   pr_err("Crash memory region doesn't fit into DT's root cell 
sizes.\n");
+   ret = -EINVAL;
+   goto out_err;
+   }
+
/* duplicate dt blob */
buf_size = fdt_totalsize(initial_boot_params);
+   range_size = of_fdt_reg_cells_size();
+
+   if (image->type == KEXEC_TYPE_CRASH) {
+   buf_size += fdt_prop_len(FDT_PSTR_KEXEC_ELFHDR, range_size);
+   buf_size += fdt_prop_len(FDT_PSTR_MEM_RANGE, range_size);
+   }
 
if (initrd_load_addr) {
/* can be redundant, but trimmed at the end */
@@ -78,6 +103,22 @@ static int setup_dtb(struct kimage *image,
goto out_err;
}
 
+   if (image->type == KEXEC_TYPE_CRASH) {
+   /* add linux,elfcorehdr */
+   ret = fdt_setprop_reg(buf, nodeoffset, FDT_PSTR_KEXEC_ELFHDR,
+   image->arch.elf_headers_mem,
+   image->arch.elf_headers_sz);
+   if (ret)
+   goto out_err;
+
+   /* add linux,usable-memory-range */
+   ret = fdt_setprop_reg(buf, nodeoffset, FDT_PSTR_MEM_RANGE,
+   crashk_res.start,
+   crashk_res.end - crashk_res.start + 1);
+   if (ret)
+   goto out_err;
+   }
+
/* add bootargs */
if (cmdline) {
ret = fdt_setprop_string(buf, nodeoffset, FDT_PSTR_BOOTARGS,
@@ -135,6 +176,43 @@ static int setup_dtb(struct kimage *image,
return ret;
 }
 
+static int prepare_elf_headers(void **addr, unsigned long *sz)
+{
+   struct crash_mem *cmem;
+   unsigned int nr_ranges;
+   int ret;
+   u64 i;
+   phys_addr_t start, end;
+
+   nr_ranges = 1; /* for exclusion of crashkernel region */
+   for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,
+   MEMBLOCK_NONE, &start, &end, NULL)
+   nr_ranges++;
+
+   cmem = kmalloc(sizeof(struct crash_mem) +
+   sizeof(struct crash_mem_range) * nr_ranges, GFP_KERNEL);
+   if (!cmem)
+   return -ENOMEM;
+
+   cmem->max_nr_ranges = nr_ranges;
+   cmem->nr_ranges = 0;
+   for

[PATCH v14 11/16] arm64: kexec_file: allow for loading Image-format kernel

2018-09-07 Thread AKASHI Takahiro

This patch provides kexec_file_ops for "Image"-format kernel. In this
implementation, a binary is always loaded with a fixed offset identified
in text_offset field of its header.

Regarding signature verification for trusted boot, this patch doesn't
contains CONFIG_KEXEC_VERIFY_SIG support, which is to be added later
in this series, but file-attribute-based verification is still a viable
option by enabling IMA security subsystem.

You can sign(label) a to-be-kexec'ed kernel image on target file system
with:
$ evmctl ima_sign --key /path/to/private_key.pem Image

On live system, you must have IMA enforced with, at least, the following
security policy:
"appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig"

See more details about IMA here:
https://sourceforge.net/p/linux-ima/wiki/Home/

Signed-off-by: AKASHI Takahiro 
Cc: Catalin Marinas 
Cc: Will Deacon 
Reviewed-by: James Morse 
---
 arch/arm64/include/asm/kexec.h |  28 +++
 arch/arm64/kernel/Makefile |   2 +-
 arch/arm64/kernel/kexec_image.c| 108 +
 arch/arm64/kernel/machine_kexec_file.c |   1 +
 4 files changed, 138 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kernel/kexec_image.c

diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h
index 157b2897d911..5e673481b3a3 100644
--- a/arch/arm64/include/asm/kexec.h
+++ b/arch/arm64/include/asm/kexec.h
@@ -101,6 +101,34 @@ struct kimage_arch {
unsigned long dtb_mem;
 };
 
+/**
+ * struct arm64_image_header - arm64 kernel image header
+ * See Documentation/arm64/booting.txt for details
+ *
+ * @mz_magic: DOS header magic number ('MZ', optional)
+ * @code1: Instruction (branch to stext)
+ * @text_offset: Image load offset
+ * @image_size: Effective image size
+ * @flags: Bit-field flags
+ * @reserved: Reserved
+ * @magic: Magic number
+ * @pe_header: Offset to PE COFF header (optional)
+ **/
+
+struct arm64_image_header {
+   __le16 mz_magic; /* also code0 */
+   __le16 pad;
+   __le32 code1;
+   __le64 text_offset;
+   __le64 image_size;
+   __le64 flags;
+   __le64 reserved[3];
+   __le32 magic;
+   __le32 pe_header;
+};
+
+extern const struct kexec_file_ops kexec_image_ops;
+
 struct kimage;
 
 extern int arch_kimage_file_post_load_cleanup(struct kimage *image);
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 8f1326b2d327..8cd514855eec 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -51,7 +51,7 @@ arm64-obj-$(CONFIG_RANDOMIZE_BASE)+= kaslr.o
 arm64-obj-$(CONFIG_HIBERNATION)+= hibernate.o hibernate-asm.o
 arm64-obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o
\
   cpu-reset.o
-arm64-obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o
+arm64-obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o kexec_image.o
 arm64-obj-$(CONFIG_ARM64_RELOC_TEST)   += arm64-reloc-test.o
 arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o
 arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c
new file mode 100644
index ..d64f5e9f9d22
--- /dev/null
+++ b/arch/arm64/kernel/kexec_image.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Kexec image loader
+
+ * Copyright (C) 2018 Linaro Limited
+ * Author: AKASHI Takahiro 
+ */
+
+#define pr_fmt(fmt)"kexec_file(Image): " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int image_probe(const char *kernel_buf, unsigned long kernel_len)
+{
+   const struct arm64_image_header *h;
+
+   h = (const struct arm64_image_header *)(kernel_buf);
+
+   if (!h || (kernel_len < sizeof(*h)) ||
+   memcmp(&h->magic, ARM64_MAGIC, sizeof(h->magic)))
+   return -EINVAL;
+
+   return 0;
+}
+
+static void *image_load(struct kimage *image,
+   char *kernel, unsigned long kernel_len,
+   char *initrd, unsigned long initrd_len,
+   char *cmdline, unsigned long cmdline_len)
+{
+   struct arm64_image_header *h;
+   u64 flags, value;
+   struct kexec_buf kbuf;
+   unsigned long text_offset;
+   struct kexec_segment *kernel_segment;
+   int ret;
+
+   /* Don't support old kernel */
+   h = (struct arm64_image_header *)kernel;
+   if (!h->text_offset)
+   return ERR_PTR(-EINVAL);
+
+   /* Check cpu features */
+   flags = le64_to_cpu(h->flags);
+   value = head_flag_field(flags, HEAD_FLAG_BE);
+   if (((value == HEAD_FLAG_BE) && !IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)) ||
+   ((value != HEAD_FLAG_BE) && IS_ENABLED(CONFIG_CPU_BIG_ENDIAN)))
+   if (!system_supports_mixed_endian())
+   return ERR_PTR(-EINVAL);
+
+

[PATCH v14 08/16] arm64: cpufeature: add MMFR0 helper functions

2018-09-07 Thread AKASHI Takahiro

Those helper functions for MMFR0 register will be used later by kexec_file
loader.

Signed-off-by: AKASHI Takahiro 
Cc: Catalin Marinas 
Cc: Will Deacon 
Reviewed-by: James Morse 
---
 arch/arm64/include/asm/cpufeature.h | 48 +
 1 file changed, 48 insertions(+)

diff --git a/arch/arm64/include/asm/cpufeature.h 
b/arch/arm64/include/asm/cpufeature.h
index 1717ba1db35d..cd90b5252d6d 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -486,11 +486,59 @@ static inline bool system_supports_32bit_el0(void)
return cpus_have_const_cap(ARM64_HAS_32BIT_EL0);
 }
 
+static inline bool system_supports_4kb_granule(void)
+{
+   u64 mmfr0;
+   u32 val;
+
+   mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+   val = cpuid_feature_extract_unsigned_field(mmfr0,
+   ID_AA64MMFR0_TGRAN4_SHIFT);
+
+   return val == ID_AA64MMFR0_TGRAN4_SUPPORTED;
+}
+
+static inline bool system_supports_64kb_granule(void)
+{
+   u64 mmfr0;
+   u32 val;
+
+   mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+   val = cpuid_feature_extract_unsigned_field(mmfr0,
+   ID_AA64MMFR0_TGRAN64_SHIFT);
+
+   return val == ID_AA64MMFR0_TGRAN64_SUPPORTED;
+}
+
+static inline bool system_supports_16kb_granule(void)
+{
+   u64 mmfr0;
+   u32 val;
+
+   mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+   val = cpuid_feature_extract_unsigned_field(mmfr0,
+   ID_AA64MMFR0_TGRAN16_SHIFT);
+
+   return val == ID_AA64MMFR0_TGRAN16_SUPPORTED;
+}
+
 static inline bool system_supports_mixed_endian_el0(void)
 {
return 
id_aa64mmfr0_mixed_endian_el0(read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1));
 }
 
+static inline bool system_supports_mixed_endian(void)
+{
+   u64 mmfr0;
+   u32 val;
+
+   mmfr0 = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1);
+   val = cpuid_feature_extract_unsigned_field(mmfr0,
+   ID_AA64MMFR0_BIGENDEL_SHIFT);
+
+   return val == 0x1;
+}
+
 static inline bool system_supports_fpsimd(void)
 {
return !cpus_have_const_cap(ARM64_HAS_NO_FPSIMD);
-- 
2.18.0

[PATCH] android: binder: use kstrdup instead of open-coding it

2018-09-07 Thread Rasmus Villemoes

Signed-off-by: Rasmus Villemoes 
---
 drivers/android/binder.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index d58763b6b009..2abcf4501d9a 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -5667,12 +5667,11 @@ static int __init binder_init(void)
 * Copy the module_parameter string, because we don't want to
 * tokenize it in-place.
 */
-   device_names = kzalloc(strlen(binder_devices_param) + 1, GFP_KERNEL);
+   device_names = kstrdup(binder_devices_param, GFP_KERNEL);
if (!device_names) {
ret = -ENOMEM;
goto err_alloc_device_names_failed;
}
-   strcpy(device_names, binder_devices_param);
 
device_tmp = device_names;
while ((device_name = strsep(&device_tmp, ","))) {
-- 
2.16.4

KASAN: use-after-free Write in ucma_put_ctx

2018-09-07 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:b36fdc6853a3 Merge tag 'gpio-v4.19-2' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16a1842140
kernel config:  https://syzkaller.appspot.com/x/.config?x=6c9564cd177daf0c
dashboard link: https://syzkaller.appspot.com/bug?extid=cfe3c1e8ef634ba8964b
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=154f205640

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+cfe3c1e8ef634ba89...@syzkaller.appspotmail.com

IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
8021q: adding VLAN 0 to HW filter on device team0
hrtimer: interrupt took 27351 ns
==
BUG: KASAN: use-after-free in atomic_dec_and_test  
include/asm-generic/atomic-instrumented.h:259 [inline]
BUG: KASAN: use-after-free in ucma_put_ctx+0x1d/0x60  
drivers/infiniband/core/ucma.c:158

Write of size 4 at addr 8801d9193858 by task syz-executor0/5348

CPU: 1 PID: 5348 Comm: syz-executor0 Not tainted 4.19.0-rc2+ #224
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
 atomic_dec_and_test include/asm-generic/atomic-instrumented.h:259 [inline]
 ucma_put_ctx+0x1d/0x60 drivers/infiniband/core/ucma.c:158
 ucma_resolve_ip+0x24d/0x2a0 drivers/infiniband/core/ucma.c:713
 ucma_write+0x336/0x420 drivers/infiniband/core/ucma.c:1680
 __vfs_write+0x117/0x9d0 fs/read_write.c:485
 vfs_write+0x1fc/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457099
Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00

RSP: 002b:7f95c9f42c78 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 7f95c9f436d4 RCX: 00457099
RDX: 0048 RSI: 2240 RDI: 0005
RBP: 00930140 R08:  R09: 
R10:  R11: 0246 R12: 
R13: 004d8100 R14: 004c1c28 R15: 0001

Allocated by task 5348:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x152/0x730 mm/slab.c:3620
 kmalloc include/linux/slab.h:513 [inline]
 kzalloc include/linux/slab.h:707 [inline]
 ucma_alloc_ctx+0xd5/0x670 drivers/infiniband/core/ucma.c:205
 ucma_create_id+0x276/0x9d0 drivers/infiniband/core/ucma.c:496
 ucma_write+0x336/0x420 drivers/infiniband/core/ucma.c:1680
 __vfs_write+0x117/0x9d0 fs/read_write.c:485
 vfs_write+0x1fc/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 5344:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xd9/0x210 mm/slab.c:3813
 ucma_free_ctx+0x9e2/0xe20 drivers/infiniband/core/ucma.c:595
 ucma_close+0x10d/0x300 drivers/infiniband/core/ucma.c:1764
 __fput+0x38a/0xa40 fs/file_table.c:278
 fput+0x15/0x20 fs/file_table.c:309
 task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:193 [inline]
 exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
 prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
 do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at 8801d9193800
 which belongs to the cache kmalloc-256 of size 256
The buggy address is located 88 bytes inside of
 256-byte region [8801d9193800, 8801d9193900)
The buggy address belongs to the page:
page:

Re: [PATCH] ttyprintk: make the printk log level configurable

2018-09-07 Thread Joe Perches

On Fri, 2018-09-07 at 09:50 +0200, Peter Korsgaard wrote:
> On Tue, Aug 21, 2018 at 7:28 PM Peter Korsgaard  wrote:
> > 
> > For some use cases it is handy to use a different printk log level than the
> > default (info) for the messages written to ttyprintk, so add a Kconfig
> > option similar to what we have for default console loglevel.
> 
> Ping? Feedback, comments?

I think it is moving "[U]" into TPK_LEVEL is an
unnecessary and a tad obfuscating change.

This also adds a leading space for unknown reasons
after the KERN_SOH .

> > diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
> > index ce277ee0a28a..14a7f023f20b 100644
> > --- a/drivers/char/Kconfig
> > +++ b/drivers/char/Kconfig
> > @@ -66,6 +66,14 @@ config TTY_PRINTK
> > 
> >   If unsure, say N.
> > 
> > +config TTY_PRINTK_LEVEL
> > +   depends on TTY_PRINTK
> > +   int "ttyprintk log level (1-7)"
> > +   range 1 7
> > +   default "6"
> > +   help
> > + Printk log level to use for ttyprintk messages.
> > +
> >  config PRINTER
> > tristate "Parallel printer support"
> > depends on PARPORT
> > diff --git a/drivers/char/ttyprintk.c b/drivers/char/ttyprintk.c
> > index 67549ce88cc9..22fbd483b5dc 100644
> > --- a/drivers/char/ttyprintk.c
> > +++ b/drivers/char/ttyprintk.c
> > @@ -37,6 +37,8 @@ static struct ttyprintk_port tpk_port;
> >   */
> >  #define TPK_STR_SIZE 508 /* should be bigger then max expected line length 
> > */
> >  #define TPK_MAX_ROOM 4096 /* we could assume 4K for instance */
> > +#define TPK_PREFIX KERN_SOH __stringify(CONFIG_TTY_PRINTK_LEVEL) " [U]"

I think this should be

#define TPK_PREFIX KERN_SOH __stringify(CONFIG_TTY_PRINTK_LEVEL)

> > +
> >  static int tpk_curr;
> > 
> >  static char tpk_buffer[TPK_STR_SIZE + 4];
> > @@ -45,7 +47,7 @@ static void tpk_flush(void)
> >  {
> > if (tpk_curr > 0) {
> > tpk_buffer[tpk_curr] = '\0';
> > -   pr_info("[U] %s\n", tpk_buffer);
> > +   printk(TPK_PREFIX " %s\n", tpk_buffer);

and this

printk(TPK_PREFIX "[U] %s\n", tpk_buffer);

> > tpk_curr = 0;
> > }
> >  }
> > --
> > 2.11.0
> > 
> 
>

Re: [PATCH v3 4/5] x86/mm: optimize static_protection() by using overlap()

2018-09-07 Thread Yang, Bin

On Fri, 2018-09-07 at 09:49 +0200, Thomas Gleixner wrote:
> On Fri, 7 Sep 2018, Yang, Bin wrote:
> > On Tue, 2018-09-04 at 14:22 +0200, Thomas Gleixner wrote:
> > 
> > I just write a test.c to compare the result between overlap() and
> > original within().
> 
> You are right. Your version of doing the overlap exclusive works. I misread
> the conditions. I still prefer doing inclusive checks because they are way
> more obvious.

I am sorry for my poor english. What is "inclusive checks"?


> 
> Thanks,
> 
>   tglx

Re: [PATCH] android: binder: use kstrdup instead of open-coding it

2018-09-07 Thread Greg Kroah-Hartman

On Fri, Sep 07, 2018 at 10:01:46AM +0200, Rasmus Villemoes wrote:
> Signed-off-by: Rasmus Villemoes 
> ---
>  drivers/android/binder.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Hi,

This is the friendly patch-bot of Greg Kroah-Hartman.  You have sent him
a patch that has triggered this response.  He used to manually respond
to these common problems, but in order to save his sanity (he kept
writing the same thing over and over, yet to different people), I was
created.  Hopefully you will not take offence and will fix the problem
in your patch and resubmit it so that it can be accepted into the Linux
kernel tree.

You are receiving this message because of the following common error(s)
as indicated below:

- You did not specify a description of why the patch is needed, or
  possibly, any description at all, in the email body.  Please read the
  section entitled "The canonical patch format" in the kernel file,
  Documentation/SubmittingPatches for what is needed in order to
  properly describe the change.

If you wish to discuss this problem further, or you have questions about
how to resolve this issue, please feel free to respond to this email and
Greg will reply once he has dug out from the pending patches received
from other developers.

thanks,

greg k-h's patch email bot

Re: [PATCH V3 17/26] csky: Misc headers

2018-09-07 Thread Guo Ren

On Fri, Sep 07, 2018 at 10:01:03AM +0200, Arnd Bergmann wrote:
> On Fri, Sep 7, 2018 at 7:17 AM Guo Ren  wrote:
> >
> > On Thu, Sep 06, 2018 at 04:16:30PM +0200, Arnd Bergmann wrote:
> > > On Wed, Sep 5, 2018 at 2:08 PM Guo Ren  wrote:
> > >
> > > > diff --git a/arch/csky/boot/dts/qemu.dts b/arch/csky/boot/dts/qemu.dts
> > > > new file mode 100644
> > > > index 000..d36e4cd
> > > > --- /dev/null
> > > > +++ b/arch/csky/boot/dts/qemu.dts
> > > > @@ -0,0 +1,77 @@
> > > > +/dts-v1/;
> > > > +/ {
> > > > +   compatible = "csky,qemu";
> > > > +   #address-cells = <1>;
> > > > +   #size-cells = <1>;
> > > > +   interrupt-parent = <&intc>;
> > >
> > > Ideally, qemu would supply a dtb file that matches the current 
> > > configuration,
> > > as we do for instance on the ARM 'virt' machine. This allows you
> > > much more flexibility in running all kinds of options, as well as 
> > > extending
> > > qemu later with new features.
> > So, I should remove qemu.dts in next version patch?
> 
> It's up to you really. If you won't have a version of qemu that can do this
> by itself, it may make sense to keep it around for a while. You might
> want to include the version of your current qemu port is based on
> qemu-2.x but not upstream, you could include a qemu-2.x.dts file
> here, and have the future 3.x port provide its own.
Ok, thx for the tips.

 Guo Ren

Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

2018-09-07 Thread Jirka Hladky

> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
> condition in terms of idle CPU handling that has been problematic.


We will try that, thanks!

>  I would suggest contacting Srikar directly.


I will do that right away. Whom should I put on Cc? Just you and
linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as
well?

$scripts/get_maintainer.pl -f kernel/sched
Ingo Molnar  (maintainer:SCHEDULER)
Peter Zijlstra  (maintainer:SCHEDULER)
linux-kernel@vger.kernel.org (open list:SCHEDULER)

Jirka

On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman  wrote:
> On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote:
>> Hi Mel,
>>
>> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.
>>
>>   * Compared to 4.18, there is still performance regression -
>> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
>> systems, regression is around 10-15%
>>   * Compared to 4.19rc1 there is a clear gain across all benchmarks around 
>> 20%
>>
>
> Ok.
>
>> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
>> lot there is another issue as well. Could you please recommend some
>> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?
>>
>
> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
> condition in terms of idle CPU handling that has been problematic.
>
>> Regarding the current results, how do we proceed? Could you please
>> contact Srikar and ask for the advice or should we contact him
>> directly?
>>
>
> I would suggest contacting Srikar directly. While I'm working on a
> series that touches off some similar areas, there is no guarantee it'll
> be a success as I'm not primarily upstream focused at the moment.
>
> Restarting the thread would also end up with a much more sensible cc
> list.
>
> --
> Mel Gorman
> SUSE Labs

Re: [PATCH] ttyprintk: make the printk log level configurable

2018-09-07 Thread Peter Korsgaard

> "Joe" == Joe Perches  writes:

 > On Fri, 2018-09-07 at 09:50 +0200, Peter Korsgaard wrote:
 >> On Tue, Aug 21, 2018 at 7:28 PM Peter Korsgaard  wrote:
 >> > 
 >> > For some use cases it is handy to use a different printk log level than 
 >> > the
 >> > default (info) for the messages written to ttyprintk, so add a Kconfig
 >> > option similar to what we have for default console loglevel.
 >> 
 >> Ping? Feedback, comments?

 > I think it is moving "[U]" into TPK_LEVEL is an
 > unnecessary and a tad obfuscating change.

It is arguably part of the prefix, but OK - I have no problem leaving it
in the printk line.

 > This also adds a leading space for unknown reasons
 > after the KERN_SOH .

True. I'll fix that and send a v2 - Thanks.

 >> > diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
 >> > index ce277ee0a28a..14a7f023f20b 100644
 >> > --- a/drivers/char/Kconfig
 >> > +++ b/drivers/char/Kconfig
 >> > @@ -66,6 +66,14 @@ config TTY_PRINTK
 >> > 
 >> >   If unsure, say N.
 >> > 
 >> > +config TTY_PRINTK_LEVEL
 >> > +   depends on TTY_PRINTK
 >> > +   int "ttyprintk log level (1-7)"
 >> > +   range 1 7
 >> > +   default "6"
 >> > +   help
 >> > + Printk log level to use for ttyprintk messages.
 >> > +
 >> >  config PRINTER
 >> > tristate "Parallel printer support"
 >> > depends on PARPORT
 >> > diff --git a/drivers/char/ttyprintk.c b/drivers/char/ttyprintk.c
 >> > index 67549ce88cc9..22fbd483b5dc 100644
 >> > --- a/drivers/char/ttyprintk.c
 >> > +++ b/drivers/char/ttyprintk.c
 >> > @@ -37,6 +37,8 @@ static struct ttyprintk_port tpk_port;
 >> >   */
 >> >  #define TPK_STR_SIZE 508 /* should be bigger then max expected line 
 >> > length */
 >> >  #define TPK_MAX_ROOM 4096 /* we could assume 4K for instance */
 >> > +#define TPK_PREFIX KERN_SOH __stringify(CONFIG_TTY_PRINTK_LEVEL) " [U]"

 > I think this should be

 > #define TPK_PREFIX KERN_SOH __stringify(CONFIG_TTY_PRINTK_LEVEL)

 >> > +
 >> >  static int tpk_curr;
 >> > 
 >> >  static char tpk_buffer[TPK_STR_SIZE + 4];
 >> > @@ -45,7 +47,7 @@ static void tpk_flush(void)
 >> >  {
 >> > if (tpk_curr > 0) {
 >> > tpk_buffer[tpk_curr] = '\0';
 >> > -   pr_info("[U] %s\n", tpk_buffer);
 >> > +   printk(TPK_PREFIX " %s\n", tpk_buffer);

 > and this

 >  printk(TPK_PREFIX "[U] %s\n", tpk_buffer);

 >> > tpk_curr = 0;
 >> > }
 >> >  }
 >> > --
 >> > 2.11.0
 >> > 
 >> 
 >> 

-- 
Bye, Peter Korsgaard

Re: [PATCH] x86, mm: Reserver some memory for bootmem allocator for NO_BOOTMEM

2018-09-07 Thread Feng Tang

Hi Thomas,

On Fri, Aug 31, 2018 at 09:36:59PM +0800, Feng Tang wrote:
> On Fri, Aug 31, 2018 at 01:33:05PM +0200, Thomas Gleixner wrote:
> > On Fri, 31 Aug 2018, Feng Tang wrote:
> > > On Thu, Aug 30, 2018 at 03:25:42PM +0200, Thomas Gleixner wrote:
> > > This panic happens as the earlycon's fixmap address has no
> > > pmd/pte ready, and __set_fixmap will try to allocate memory to
> > > setup the page table, and trigger panic due to no memory.
> > > 
> > > x86 kernel actually prepares the page table for fixmap in head_64.S:
> > > 
> > >   NEXT_PAGE(level2_fixmap_pgt)
> > >   .fill   506,8,0
> > >   .quad   level1_fixmap_pgt - __START_KERNEL_map + 
> > > _PAGE_TABLE_NOENC
> > >   /* 8MB reserved for vsyscalls + a 2MB hole = 4 + 1 entries */
> > >   .fill   5,8,0
> > > 
> > > and it expects the fixmap address is in [-12M, -10M] range, but
> > > current code in fixmap.h will break the expectation when
> > > X86_VSYSCALL_EMULATION=n
> > > 
> > >   #ifdef CONFIG_X86_VSYSCALL_EMULATION
> > >   VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
> > >   #endif
> > > 
> > > So removing the "#ifdef" will make the fixmap address space stable in
> > > [-12M, -10M] and fix the issue.
> > 
> > Why on earth are you not fixing the damned PTE setup which is the obvious
> > and correct thing to do?
> 
> Any sugestion? I can only have patch like this:

Could you review this patch, as at this time window there is no usable memory
block or other memory allocators that I know, so I follow the exisitng static
fixmap page table code and add one more.

Thanks,
Feng

> 
> ---
> diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
> index 15ebc2fc166e..8cdb27ccc3a3 100644
> --- a/arch/x86/kernel/head_64.S
> +++ b/arch/x86/kernel/head_64.S
> @@ -446,11 +446,15 @@ NEXT_PAGE(level2_kernel_pgt)
>  
>  NEXT_PAGE(level2_fixmap_pgt)
>   .fill   506,8,0
> - .quad   level1_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
> + .quad   level1_fixmap_pgt0 - __START_KERNEL_map + _PAGE_TABLE_NOENC
> + .quad   level1_fixmap_pgt1 - __START_KERNEL_map + _PAGE_TABLE_NOENC
>   /* 8MB reserved for vsyscalls + a 2MB hole = 4 + 1 entries */
> - .fill   5,8,0
> + .fill   4,8,0
>  
> -NEXT_PAGE(level1_fixmap_pgt)
> +NEXT_PAGE(level1_fixmap_pgt0)
> + .fill   512,8,0
> +
> +NEXT_PAGE(level1_fixmap_pgt1)
>   .fill   512,8,0
>  
>  #undef PMDS
> 
> 
> Thanks,
> Feng
>

Re: [PATCH V3 06/26] csky: Cache and TLB routines

2018-09-07 Thread Arnd Bergmann

On Fri, Sep 7, 2018 at 5:04 AM Guo Ren  wrote:
>
> On Thu, Sep 06, 2018 at 04:31:16PM +0200, Arnd Bergmann wrote:
> > On Wed, Sep 5, 2018 at 2:08 PM Guo Ren  wrote:
> >
> > Can you describe how C-Sky hardware implements MMIO?
> Our mmio is uncachable and strong-order address, so there is no need
> barriers for access these io addr.
>
>  #define ioremap_wc ioremap_nocache
>  #define ioremap_wt ioremap_nocache
>
> Current ioremap_wc and ioremap_wt implementation are too simple and
> we'll improve it in future.
>
> > In particular:
> >
> > - Is a read from uncached memory always serialized with DMA, and with
> >   other CPUs doing MMIO access to a different address?
> CPU use ld.w to get data from uncached strong order memory.
> Other CPUs use the same mmio vaddr to access the uncachable strong order
> memory paddr.

Ok, but what about the DMA? The most common requirement for
serialization here is with a DMA transfer, where you first write
into a buffer in memory, then write to an MMIO register to trigger
a DMA-load, and then the device reads the data from memory.
Without a barrier before the MMIO, the data may still be in a
store queue of the CPU, and the DMA gets stale data.

Similarly, an MMIO read may be used to see if a DMA has completed
and the device register tells you that the DMA has left the device,
but without a barrier, the CPU may have prefetched the DMA
data while waiting for the MMIO-read to complete. The __io_ar()
barrier() in asm-generic/io.h prevents the compiler from reordering
the two reads, but if an weakly ordered read (in coherent DMA buffer)
can bypass a strongly ordered read (MMIO), then it's still still
broken.

> > - How does endianess work? Are there any buses that flip bytes around
> >   when running big-endian, or do you always do that in software?
> Currently we only support little-endian and soc will follow it.

Ok, that makes it easier. If you think that you won't even need big-endian
support in the long run, you could also remove your asm/byteorder.h
header. If you're not sure, it doesn't hurt to keep it of course.

Arnd

Re: [PATCH v4 3/4] drivers: edac: Add EDAC driver support for QCOM SoCs

2018-09-07 Thread Borislav Petkov

On Tue, Sep 04, 2018 at 04:22:24PM -0700, Venkata Narendra Kumar Gutta wrote:
> From: Channagoud Kadabi 
> 
> Add error reporting driver for Single Bit Errors (SBEs) and Double Bit
> Errors (DBEs). As of now, this driver supports error reporting for
> Last Level Cache Controller (LLCC) of Tag RAM and Data RAM. Interrupts
> are triggered when the errors happen in the cache, the driver handles
> those interrupts and dumps the syndrome registers.
> 
> Signed-off-by: Channagoud Kadabi 
> Signed-off-by: Venkata Narendra Kumar Gutta 
> Co-developed-by: Venkata Narendra Kumar Gutta 
> ---
>  MAINTAINERS|   8 +
>  drivers/edac/Kconfig   |  14 ++
>  drivers/edac/Makefile  |   1 +
>  drivers/edac/qcom_edac.c   | 420 
> +
>  include/linux/soc/qcom/llcc-qcom.h |  24 +++
>  5 files changed, 467 insertions(+)
>  create mode 100644 drivers/edac/qcom_edac.c

EDAC bits look ok now, feel free to carry it through the qualcomm tree:

Acked-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

[PATCH v2] ttyprintk: make the printk log level configurable

2018-09-07 Thread Peter Korsgaard

For some use cases it is handy to use a different printk log level than the
default (info) for the messages written to ttyprintk, so add a Kconfig
option similar to what we have for default console loglevel.

Signed-off-by: Peter Korsgaard 
---
Changes since v1:
- Leave [U] prefix in printk invocation and drop space before it as
  suggested by Joe Perces.

 drivers/char/Kconfig | 8 
 drivers/char/ttyprintk.c | 4 +++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index ce277ee0a28a..14a7f023f20b 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -66,6 +66,14 @@ config TTY_PRINTK
 
  If unsure, say N.
 
+config TTY_PRINTK_LEVEL
+   depends on TTY_PRINTK
+   int "ttyprintk log level (1-7)"
+   range 1 7
+   default "6"
+   help
+ Printk log level to use for ttyprintk messages.
+
 config PRINTER
tristate "Parallel printer support"
depends on PARPORT
diff --git a/drivers/char/ttyprintk.c b/drivers/char/ttyprintk.c
index 67549ce88cc9..88808dbba486 100644
--- a/drivers/char/ttyprintk.c
+++ b/drivers/char/ttyprintk.c
@@ -37,6 +37,8 @@ static struct ttyprintk_port tpk_port;
  */
 #define TPK_STR_SIZE 508 /* should be bigger then max expected line length */
 #define TPK_MAX_ROOM 4096 /* we could assume 4K for instance */
+#define TPK_PREFIX KERN_SOH __stringify(CONFIG_TTY_PRINTK_LEVEL)
+
 static int tpk_curr;
 
 static char tpk_buffer[TPK_STR_SIZE + 4];
@@ -45,7 +47,7 @@ static void tpk_flush(void)
 {
if (tpk_curr > 0) {
tpk_buffer[tpk_curr] = '\0';
-   pr_info("[U] %s\n", tpk_buffer);
+   printk(TPK_PREFIX "[U] %s\n", tpk_buffer);
tpk_curr = 0;
}
 }
-- 
2.11.0

[PATCH 1/4 v7] x86/ioremap: add a function ioremap_encrypted() to remap kdump old memory

2018-09-07 Thread Lianbo Jiang

When SME is enabled on AMD machine, the memory is encrypted in the first
kernel. In this case, SME also needs to be enabled in kdump kernel, and
we have to remap the old memory with the memory encryption mask.

Signed-off-by: Lianbo Jiang 
---
 arch/x86/include/asm/io.h |  3 +++
 arch/x86/mm/ioremap.c | 25 +
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 6de64840dd22..f8795f9581c7 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -192,6 +192,9 @@ extern void __iomem *ioremap_cache(resource_size_t offset, 
unsigned long size);
 #define ioremap_cache ioremap_cache
 extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, 
unsigned long prot_val);
 #define ioremap_prot ioremap_prot
+extern void __iomem *ioremap_encrypted(resource_size_t phys_addr,
+   unsigned long size);
+#define ioremap_encrypted ioremap_encrypted
 
 /**
  * ioremap -   map bus memory into CPU space
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index c63a545ec199..e01e6c695add 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "physaddr.h"
 
@@ -131,7 +132,8 @@ static void __ioremap_check_mem(resource_size_t addr, 
unsigned long size,
  * caller shouldn't need to know that small detail.
  */
 static void __iomem *__ioremap_caller(resource_size_t phys_addr,
-   unsigned long size, enum page_cache_mode pcm, void *caller)
+   unsigned long size, enum page_cache_mode pcm,
+   void *caller, bool encrypted)
 {
unsigned long offset, vaddr;
resource_size_t last_addr;
@@ -199,7 +201,7 @@ static void __iomem *__ioremap_caller(resource_size_t 
phys_addr,
 * resulting mapping.
 */
prot = PAGE_KERNEL_IO;
-   if (sev_active() && mem_flags.desc_other)
+   if ((sev_active() && mem_flags.desc_other) || encrypted)
prot = pgprot_encrypted(prot);
 
switch (pcm) {
@@ -291,7 +293,7 @@ void __iomem *ioremap_nocache(resource_size_t phys_addr, 
unsigned long size)
enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC_MINUS;
 
return __ioremap_caller(phys_addr, size, pcm,
-   __builtin_return_address(0));
+   __builtin_return_address(0), false);
 }
 EXPORT_SYMBOL(ioremap_nocache);
 
@@ -324,7 +326,7 @@ void __iomem *ioremap_uc(resource_size_t phys_addr, 
unsigned long size)
enum page_cache_mode pcm = _PAGE_CACHE_MODE_UC;
 
return __ioremap_caller(phys_addr, size, pcm,
-   __builtin_return_address(0));
+   __builtin_return_address(0), false);
 }
 EXPORT_SYMBOL_GPL(ioremap_uc);
 
@@ -341,7 +343,7 @@ EXPORT_SYMBOL_GPL(ioremap_uc);
 void __iomem *ioremap_wc(resource_size_t phys_addr, unsigned long size)
 {
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WC,
-   __builtin_return_address(0));
+   __builtin_return_address(0), false);
 }
 EXPORT_SYMBOL(ioremap_wc);
 
@@ -358,14 +360,21 @@ EXPORT_SYMBOL(ioremap_wc);
 void __iomem *ioremap_wt(resource_size_t phys_addr, unsigned long size)
 {
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WT,
-   __builtin_return_address(0));
+   __builtin_return_address(0), false);
 }
 EXPORT_SYMBOL(ioremap_wt);
 
+void __iomem *ioremap_encrypted(resource_size_t phys_addr, unsigned long size)
+{
+   return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
+   __builtin_return_address(0), true);
+}
+EXPORT_SYMBOL(ioremap_encrypted);
+
 void __iomem *ioremap_cache(resource_size_t phys_addr, unsigned long size)
 {
return __ioremap_caller(phys_addr, size, _PAGE_CACHE_MODE_WB,
-   __builtin_return_address(0));
+   __builtin_return_address(0), false);
 }
 EXPORT_SYMBOL(ioremap_cache);
 
@@ -374,7 +383,7 @@ void __iomem *ioremap_prot(resource_size_t phys_addr, 
unsigned long size,
 {
return __ioremap_caller(phys_addr, size,
pgprot2cachemode(__pgprot(prot_val)),
-   __builtin_return_address(0));
+   __builtin_return_address(0), false);
 }
 EXPORT_SYMBOL(ioremap_prot);
 
-- 
2.17.1

Re: [PATCH v3 4/5] x86/mm: optimize static_protection() by using overlap()

2018-09-07 Thread Thomas Gleixner

On Fri, 7 Sep 2018, Yang, Bin wrote:
> On Fri, 2018-09-07 at 09:49 +0200, Thomas Gleixner wrote:
> > On Fri, 7 Sep 2018, Yang, Bin wrote:
> > > On Tue, 2018-09-04 at 14:22 +0200, Thomas Gleixner wrote:
> > > 
> > > I just write a test.c to compare the result between overlap() and
> > > original within().
> > 
> > You are right. Your version of doing the overlap exclusive works. I misread
> > the conditions. I still prefer doing inclusive checks because they are way
> > more obvious.
> 
> I am sorry for my poor english. What is "inclusive checks"?

Exlusive:val >= start && val < end

Inclusive:   val >= start && val <= end

So the difference is that you feed exclusive with:

   end = start + size

and inclusive with

  end = start + size - 1

Thanks,

tglx

[PATCH] ARM: dts: at91: sama5d2_ptc_ek: fix nand pinctrl

2018-09-07 Thread Ludovic Desroches

The drive strength has to be set to medium otherwise some data
corruption may happen.

Signed-off-by: Ludovic Desroches 
---

Hi,

This fix depends on the support of the drive-strength for the atmel pio4
pinctroller. It has been added in v4.19 but I omitted to send it at the
same time.

Ludovic

 arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts 
b/arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts
index b10dccd0958f..3b1baa8605a7 100644
--- a/arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts
+++ b/arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts
@@ -11,6 +11,7 @@
 #include "sama5d2-pinfunc.h"
 #include 
 #include 
+#include 
 
 / {
model = "Atmel SAMA5D2 PTC EK";
@@ -299,6 +300,7 @@
 ,
 ;
bias-pull-up;
+   atmel,drive-strength = 
;
};
 
ale_cle_rdy_cs {
-- 
2.12.2

Re: [PATCH v6 07/14] sched/topology: Introduce sched_energy_present static key

2018-09-07 Thread Quentin Perret

On Thursday 06 Sep 2018 at 16:49:47 (-0700), Dietmar Eggemann wrote:
> I would prefer a sched_feature. I guess it has to be disabled by default so
> that other systems don't have to check rcu_dereference(rd->pd) in the wakeup
> path.

Right, this is what I had in mind too. I guess downstream kernels can
always carry a patch that changes the default if they want it enabled
without messing around in userspace.

> But since at the beginning EAS will be the only user of the EM there is no
> need to change the static key sched_energy_present right now.

Indeed, I could add a patch introducing this sched_feat in the series
that migrates IPA to using the EM framework (to be posted later). It is
just not required until we have a new user.

However that IPA-related patchset would then change the default
behaviour for users who used to get EAS enabled automatically, but
wouldn't after updating their kernel (meaning they'd now have to flip
switches by hand whereas it used to "just work"). Not sure if that
qualifies as "breaking users" (cf. Linus' rule #1 of kernel
development) ...

Thanks,
Quentin

[PATCH 2/2] irq/matrix: Spread managed interrupts on allocation

2018-09-07 Thread Dou Liyang

From: Dou Liyang 

Linux has spread out the non managed interrupt across the possible
target CPUs to avoid vector space exhaustion.

But, the same situation may happen on the managed interrupts.

Spread managed interrupt on allocation as well.

Fixes: a0c9259dc4e1("irq/matrix: Spread interrupts on allocation")
Signed-off-by: Dou Liyang 
---
 arch/x86/kernel/apic/vector.c |  8 +++-
 include/linux/irq.h   |  3 ++-
 kernel/irq/matrix.c   | 32 
 3 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 9f148e3d45b4..b7fc290b4b98 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -313,14 +313,12 @@ assign_managed_vector(struct irq_data *irqd, const struct 
cpumask *dest)
struct apic_chip_data *apicd = apic_chip_data(irqd);
int vector, cpu;
 
-   cpumask_and(vector_searchmask, vector_searchmask, affmsk);
-   cpu = cpumask_first(vector_searchmask);
-   if (cpu >= nr_cpu_ids)
-   return -EINVAL;
+   cpumask_and(vector_searchmask, dest, affmsk);
+
/* set_affinity might call here for nothing */
if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask))
return 0;
-   vector = irq_matrix_alloc_managed(vector_matrix, cpu);
+   vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask, 
&cpu);
trace_vector_alloc_managed(irqd->irq, vector, vector);
if (vector < 0)
return vector;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 201de12a9957..c9bffda04a45 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1151,7 +1151,8 @@ void irq_matrix_offline(struct irq_matrix *m);
 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool 
replace);
 int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask 
*msk);
 void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask 
*msk);
-int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu);
+int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
+   unsigned int *mapped_cpu);
 void irq_matrix_reserve(struct irq_matrix *m);
 void irq_matrix_remove_reserved(struct irq_matrix *m);
 int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 5eb0c8b857f0..b449a749b354 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -259,21 +259,29 @@ void irq_matrix_remove_managed(struct irq_matrix *m, 
const struct cpumask *msk)
  * @m: Matrix pointer
  * @cpu:   On which CPU the interrupt should be allocated
  */
-int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu)
+int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
+   unsigned int *mapped_cpu)
 {
-   struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
unsigned int bit, end = m->alloc_end;
+   unsigned int best_cpu = UINT_MAX;
+   struct cpumap *cm;
 
-   /* Get managed bit which are not allocated */
-   bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end);
-   bit = find_first_bit(m->scratch_map, end);
-   if (bit >= end)
-   return -ENOSPC;
-   set_bit(bit, cm->alloc_map);
-   cm->allocated++;
-   m->total_allocated++;
-   trace_irq_matrix_alloc_managed(bit, cpu, m, cm);
-   return bit;
+   if (matrix_find_best_cpu(m, msk, &best_cpu)) {
+   cm = per_cpu_ptr(m->maps, best_cpu);
+   end = m->alloc_end;
+   /* Get managed bit which are not allocated */
+   bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, 
end);
+   bit = find_first_bit(m->scratch_map, end);
+   if (bit >= end)
+   return -ENOSPC;
+   set_bit(bit, cm->alloc_map);
+   cm->allocated++;
+   m->total_allocated++;
+   *mapped_cpu = best_cpu;
+   trace_irq_matrix_alloc_managed(bit, best_cpu, m, cm);
+   return bit;
+   }
+   return -ENOSPC;
 }
 
 /**
-- 
2.14.3

[PATCH 1/2] irq/matrix: Split out the CPU finding code into a helper

2018-09-07 Thread Dou Liyang

From: Dou Liyang 

Linux finds the CPU which has the lowest vector allocation count to spread
out the non managed interrupt across the possible target CPUs.

This common CPU finding code will also be used in managed case, 

So Split it out into a helper for preparation.

Signed-off-by: Dou Liyang 
---
 kernel/irq/matrix.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 5092494bf261..5eb0c8b857f0 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -124,6 +124,26 @@ static unsigned int matrix_alloc_area(struct irq_matrix 
*m, struct cpumap *cm,
return area;
 }
 
+/* Find the best CPU which has the lowest vector allocation count */
+static int matrix_find_best_cpu(struct irq_matrix *m,
+   const struct cpumask *msk, int *best_cpu)
+{
+   unsigned int cpu, maxavl = 0;
+   struct cpumap *cm;
+
+   for_each_cpu(cpu, msk) {
+   cm = per_cpu_ptr(m->maps, cpu);
+
+   if (!cm->online || cm->available <= maxavl)
+   continue;
+
+   *best_cpu = cpu;
+   maxavl = cm->available;
+   }
+
+   return maxavl;
+}
+
 /**
  * irq_matrix_assign_system - Assign system wide entry in the matrix
  * @m: Matrix pointer
@@ -322,22 +342,11 @@ void irq_matrix_remove_reserved(struct irq_matrix *m)
 int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
 bool reserved, unsigned int *mapped_cpu)
 {
-   unsigned int cpu, best_cpu, maxavl = 0;
+   unsigned int best_cpu = UINT_MAX;
struct cpumap *cm;
unsigned int bit;
 
-   best_cpu = UINT_MAX;
-   for_each_cpu(cpu, msk) {
-   cm = per_cpu_ptr(m->maps, cpu);
-
-   if (!cm->online || cm->available <= maxavl)
-   continue;
-
-   best_cpu = cpu;
-   maxavl = cm->available;
-   }
-
-   if (maxavl) {
+   if (matrix_find_best_cpu(m, msk, &best_cpu)) {
cm = per_cpu_ptr(m->maps, best_cpu);
bit = matrix_alloc_area(m, cm, 1, false);
if (bit < m->alloc_end) {
-- 
2.14.3

Re: [PATCH v3 4/5] x86/mm: optimize static_protection() by using overlap()

2018-09-07 Thread Yang, Bin

On Fri, 2018-09-07 at 10:21 +0200, Thomas Gleixner wrote:
> On Fri, 7 Sep 2018, Yang, Bin wrote:
> > On Fri, 2018-09-07 at 09:49 +0200, Thomas Gleixner wrote:
> > > On Fri, 7 Sep 2018, Yang, Bin wrote:
> > > > On Tue, 2018-09-04 at 14:22 +0200, Thomas Gleixner wrote:
> > > > 
> > > > I just write a test.c to compare the result between overlap() and
> > > > original within().
> > > 
> > > You are right. Your version of doing the overlap exclusive works. I 
> > > misread
> > > the conditions. I still prefer doing inclusive checks because they are way
> > > more obvious.
> > 
> > I am sorry for my poor english. What is "inclusive checks"?
> 
> Exlusive:val >= start && val < end
> 
> Inclusive:   val >= start && val <= end
> 
> So the difference is that you feed exclusive with:
> 
>end = start + size
> 
> and inclusive with
> 
>   end = start + size - 1
> 

Thanks. I will change it to inclusive check.

> Thanks,
> 
>   tglx
> 
>

Re: [PATCH] printk/tracing: Do not trace printk_nmi_enter()

2018-09-07 Thread Petr Mladek

On Fri 2018-09-07 09:45:31, Peter Zijlstra wrote:
> On Thu, Sep 06, 2018 at 11:31:51AM +0900, Sergey Senozhatsky wrote:
> > An alternative option, thus, could be re-instating back the rule that
> > lockdep_off/on should be the first and the last thing we do in
> > nmi_enter/nmi_exit. E.g.
> > 
> > nmi_enter()
> > lockdep_off();
> > printk_nmi_enter();
> > 
> > nmi_exit()
> > printk_nmi_exit();
> > lockdep_on();
> 
> Yes that. Also, those should probably be inline functions.
> 
> ---
> Subject: locking/lockdep: Fix NMI handling
> 
> Someone put code in the NMI handler before lockdep_off(). Since lockdep
> is not NMI safe, this wrecks stuff.

My view is that nmi_enter() has to switch several features into
NMI-safe mode. The code must not trigger the other features when
they are still in the unsafe mode.

It is a chicken&egg problem. And it is hard to completely prevent
regressions caused by future changes.

I though that printk_nmi_enter() should never need any lockdep-related
code. On the other hand, people might want to printk debug messages
when lockdep_off() is called. This is why I put it in the current order.

That said, I am not against this change. Especially the inlining
is a good move. Note that lockdep_off()/lockdep_on() must not
be traced as well.

Best Regards,
Petr

Re: [PATCH] vme: remove unneeded kfree

2018-09-07 Thread Martyn Welch

On Thu, 2018-09-06 at 22:04 -0700, Linus Torvalds wrote:
> On Thu, Sep 6, 2018 at 1:51 AM Ding Xiang
>  wrote:
> > 
> > put_device will call vme_dev_release to free vdev, kfree is
> > unnecessary here.
> 
> That does seem to be the case.  I think "unnecessary" is overly kind,
> it does seem to be a double free.
> 
> Looks like the issue was introduced back in 2013 by commit
> def1820d25fa ("vme: add missing put_device() after device_register()
> fails").
> 
> It seems you should *either* kfree() the vdev, _or_ do put_device(),
> but doing both seems wrong.
> 
> I presume the device_register() has never failed, and this being
> vme-only I'm guessing there isn't a vibrant testing community.
> 

I think that is also overly kind :-)

I currently lack access to suitable hardware to test fully myself and I
need to find some time to (re)implement some automated testing, after I
lost access to the bits I had when I left a previous employer. That and
see if I can get access to some hardware again...

Manohar, do you still have access/interest in the VME stuff? You've
been very quiet for a long time now.

Martyn

Re: [PATCH] arm64: add NUMA emulation support

2018-09-07 Thread Michal Hocko

On Thu 06-09-18 15:53:34, Shuah Khan wrote:
[...]
> A few critical allocations could be satisfied and root cgroup prevails. It is 
> not the
> intent to have exclusivity at the expense of the kernel.

Well, it is not "few critical allocations". It can be a lot of
memory. Basically any GFP_KERNEL allocation. So how exactly you expect
this to work when you cannot estimate how much
memory will kernel eat?

> 
> This feature will allow a way to configure cpusets on non-NUMA for workloads 
> that can
> benefit from the reservation and isolation that is available within the 
> constraints of
> exclusive cpuset policies.

AFAIR this was the first approach Google took for the memory isolation
and they moved over to memory cgroups. I would recommend to talk to
those guys bebfore you introduce potentially a lot of code that will not
really work for the workload you indend it for.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] printk/tracing: Do not trace printk_nmi_enter()

2018-09-07 Thread Sergey Senozhatsky

On (09/07/18 10:28), Petr Mladek wrote:
> On Fri 2018-09-07 09:45:31, Peter Zijlstra wrote:
> > On Thu, Sep 06, 2018 at 11:31:51AM +0900, Sergey Senozhatsky wrote:
> > > An alternative option, thus, could be re-instating back the rule that
> > > lockdep_off/on should be the first and the last thing we do in
> > > nmi_enter/nmi_exit. E.g.
> > > 
> > > nmi_enter()
> > >   lockdep_off();
> > >   printk_nmi_enter();
> > > 
> > > nmi_exit()
> > >   printk_nmi_exit();
> > >   lockdep_on();
> > 
> > Yes that. Also, those should probably be inline functions.
> > 
> > ---
> > Subject: locking/lockdep: Fix NMI handling
> > 
> > Someone put code in the NMI handler before lockdep_off(). Since lockdep
> > is not NMI safe, this wrecks stuff.
> 
> My view is that nmi_enter() has to switch several features into
> NMI-safe mode. The code must not trigger the other features when
> they are still in the unsafe mode.
> 
> It is a chicken&egg problem. And it is hard to completely prevent
> regressions caused by future changes.
> 
> I though that printk_nmi_enter() should never need any lockdep-related
> code. On the other hand, people might want to printk debug messages
> when lockdep_off() is called. This is why I put it in the current order.
> 
> That said, I am not against this change. Especially the inlining
> is a good move. Note that lockdep_off()/lockdep_on() must not
> be traced as well.

Should't printk_nmi_enter()/printk_nmi_exit() still be notrace?
Like you and Steven said - it's still before ftrace_nmi_enter()
and should be notrace regardless.

-ss

possible deadlock in start_this_handle

2018-09-07 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:ca16eb342ebe Merge tag 'for-linus-20180906' of git://git.k..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=129e80ae40
kernel config:  https://syzkaller.appspot.com/x/.config?x=6c9564cd177daf0c
dashboard link: https://syzkaller.appspot.com/bug?extid=fe49aec75e221f9b093e
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+fe49aec75e221f9b0...@syzkaller.appspotmail.com

ISOFS: Unable to identify CD-ROM format.

==
WARNING: possible circular locking dependency detected
4.19.0-rc2+ #2 Not tainted
--
kswapd0/1430 is trying to acquire lock:
85a9412e (jbd2_handle){}, at: start_this_handle+0x589/0x1260  
fs/jbd2/transaction.c:383


but task is already holding lock:
af99a839 (fs_reclaim){+.+.}, at: __page_frag_cache_refill  
mm/page_alloc.c:4476 [inline]
af99a839 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30  
mm/page_alloc.c:4505


which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (fs_reclaim){+.+.}:
   __fs_reclaim_acquire mm/page_alloc.c:3728 [inline]
   fs_reclaim_acquire.part.98+0x24/0x30 mm/page_alloc.c:3739
   fs_reclaim_acquire+0x14/0x20 mm/page_alloc.c:3740
   slab_pre_alloc_hook mm/slab.h:418 [inline]
   slab_alloc mm/slab.c:3378 [inline]
   kmem_cache_alloc_trace+0x2d/0x730 mm/slab.c:3618
   kmalloc include/linux/slab.h:513 [inline]
   kzalloc include/linux/slab.h:707 [inline]
   smk_fetch.part.24+0x5a/0xf0 security/smack/smack_lsm.c:273
   smk_fetch security/smack/smack_lsm.c:3548 [inline]
   smack_d_instantiate+0x946/0xea0 security/smack/smack_lsm.c:3502
   security_d_instantiate+0x5c/0xf0 security/security.c:1287
   d_instantiate+0x5e/0xa0 fs/dcache.c:1870
   shmem_mknod+0x189/0x1f0 mm/shmem.c:2812
   vfs_mknod+0x447/0x800 fs/namei.c:3719
   handle_create+0x1ff/0x7c0 drivers/base/devtmpfs.c:211
   handle drivers/base/devtmpfs.c:374 [inline]
   devtmpfsd+0x27f/0x4c0 drivers/base/devtmpfs.c:400
   kthread+0x35a/0x420 kernel/kthread.c:246
   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

-> #1 (&isp->smk_lock){+.+.}:
   __mutex_lock_common kernel/locking/mutex.c:925 [inline]
   __mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
   mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
   smack_d_instantiate+0x130/0xea0 security/smack/smack_lsm.c:3369
   security_d_instantiate+0x5c/0xf0 security/security.c:1287
   d_instantiate_new+0x7e/0x160 fs/dcache.c:1889
   ext4_add_nondir+0x81/0x90 fs/ext4/namei.c:2415
   ext4_symlink+0x761/0x1170 fs/ext4/namei.c:3162
   vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
   do_symlinkat+0x242/0x2d0 fs/namei.c:4154
   __do_sys_symlink fs/namei.c:4173 [inline]
   __se_sys_symlink fs/namei.c:4171 [inline]
   __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (jbd2_handle){}:
   lock_acquire+0x1e4/0x4f0 kernel/locking/lockdep.c:3901
   start_this_handle+0x5c0/0x1260 fs/jbd2/transaction.c:385
   jbd2__journal_start+0x3c9/0x9f0 fs/jbd2/transaction.c:439
   __ext4_journal_start_sb+0x18d/0x590 fs/ext4/ext4_jbd2.c:81
   __ext4_journal_start fs/ext4/ext4_jbd2.h:311 [inline]
   ext4_dirty_inode+0x62/0xc0 fs/ext4/inode.c:6021
   __mark_inode_dirty+0x760/0x1300 fs/fs-writeback.c:2129
   mark_inode_dirty_sync include/linux/fs.h:2072 [inline]
   iput+0x131/0xa00 fs/inode.c:1570
   dentry_unlink_inode+0x461/0x5e0 fs/dcache.c:374
   __dentry_kill+0x44c/0x7a0 fs/dcache.c:566
   shrink_dentry_list+0x322/0x7c0 fs/dcache.c:1079
   prune_dcache_sb+0x12f/0x1c0 fs/dcache.c:1171
   super_cache_scan+0x270/0x480 fs/super.c:102
   do_shrink_slab+0x4ba/0xbb0 mm/vmscan.c:536
   shrink_slab+0x389/0x8c0 mm/vmscan.c:686
   shrink_node+0x429/0x16a0 mm/vmscan.c:2735
   kswapd_shrink_node mm/vmscan.c:3457 [inline]
   balance_pgdat+0x7ca/0x1010 mm/vmscan.c:3567
   kswapd+0x82f/0x11e0 mm/vmscan.c:3789
   kthread+0x35a/0x420 kernel/kthread.c:246
   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

other info that might help us debug this:

Chain exists of:
  jbd2_handle --> &isp->smk_lock --> fs_reclaim

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(fs_reclaim);
   lock(&isp->smk_lock);
   lock(fs_reclaim);
  lock(jbd2_handle);

 *** DEADLOCK ***

3 locks held by kswapd0/1430:
 #0: af99a

general protection fault in ovl_free_fs

2018-09-07 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:b36fdc6853a3 Merge tag 'gpio-v4.19-2' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13579cd140
kernel config:  https://syzkaller.appspot.com/x/.config?x=6c9564cd177daf0c
dashboard link: https://syzkaller.appspot.com/bug?extid=c75f181dc8429d2eb887
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c75f181dc8429d2eb...@syzkaller.appspotmail.com

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
CPU: 0 PID: 12806 Comm: syz-executor3 Not tainted 4.19.0-rc2+ #224
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:ovl_free_fs+0x4d9/0x650 fs/overlayfs/super.c:226
Code: 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 67 01 00 00 48  
b8 00 00 00 00 00 fc ff df 4c 8b 23 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f  
85 56 01 00 00 49 8b 3c 24 e8 d4 0a 01 00 e9 37 fc

RSP: 0018:8800aec7f7b8 EFLAGS: 00010246
RAX: dc00 RBX: 880111d7ef00 RCX: c90003487000
RDX:  RSI: 827c844a RDI: 0001
RBP: 8800aec7f810 R08: 8801b9cae080 R09: ed003b6046de
R10: 0003 R11: 0001 R12: 
R13: 880111d7ef20 R14: fff4 R15: 880111d7ef00
FS:  7f715a1ab700() GS:8801db00() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 02242e80 CR3: 00018698d000 CR4: 001426f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending  
cookies.  Check SNMP counters.

Call Trace:
 ovl_fill_super+0x4f1/0x3ffc fs/overlayfs/super.c:1508
 mount_nodev+0x6b/0x110 fs/super.c:1204
 ovl_mount+0x2c/0x40 fs/overlayfs/super.c:1516
 mount_fs+0xae/0x328 fs/super.c:1261
 vfs_kern_mount.part.35+0xdc/0x4f0 fs/namespace.c:961
 vfs_kern_mount fs/namespace.c:951 [inline]
 do_new_mount fs/namespace.c:2457 [inline]
 do_mount+0x581/0x30e0 fs/namespace.c:2787
 ksys_mount+0x12d/0x140 fs/namespace.c:3003
 __do_sys_mount fs/namespace.c:3017 [inline]
 __se_sys_mount fs/namespace.c:3014 [inline]
 __x64_sys_mount+0xbe/0x150 fs/namespace.c:3014
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457099
Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00

RSP: 002b:7f715a1aac78 EFLAGS: 0246 ORIG_RAX: 00a5
RAX: ffda RBX: 7f715a1ab6d4 RCX: 00457099
RDX: 20c0 RSI: 2000 RDI: 
RBP: 009300a0 R08: 2100 R09: 
R10:  R11: 0246 R12: 0003
R13: 004d3300 R14: 004c8241 R15: 002a
Modules linked in:
Dumping ftrace buffer:
   (ftrace buffer empty)
---[ end trace a35d50b1706abe7c ]---
RIP: 0010:ovl_free_fs+0x4d9/0x650 fs/overlayfs/super.c:226
Code: 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 67 01 00 00 48  
b8 00 00 00 00 00 fc ff df 4c 8b 23 4c 89 e2 48 c1 ea 03 <80> 3c 02 00 0f  
85 56 01 00 00 49 8b 3c 24 e8 d4 0a 01 00 e9 37 fc

RSP: 0018:8800aec7f7b8 EFLAGS: 00010246
RAX: dc00 RBX: 880111d7ef00 RCX: c90003487000
RDX:  RSI: 827c844a RDI: 0001
RBP: 8800aec7f810 R08: 8801b9cae080 R09: ed003b6046de
R10: 0003 R11: 0001 R12: 
R13: 880111d7ef20 R14: fff4 R15: 880111d7ef00
FS:  7f715a1ab700() GS:8801db00() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 02242e80 CR3: 00018698d000 CR4: 001426f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

KASAN: use-after-free Read in cma_bind_port

2018-09-07 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:b36fdc6853a3 Merge tag 'gpio-v4.19-2' of git://git.kernel...
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1660c26640
kernel config:  https://syzkaller.appspot.com/x/.config?x=6c9564cd177daf0c
dashboard link: https://syzkaller.appspot.com/bug?extid=da2591e115d57a9cbb8b
compiler:   gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+da2591e115d57a9cb...@syzkaller.appspotmail.com

==
BUG: KASAN: use-after-free in cma_bind_port+0x35d/0x3f0  
drivers/infiniband/core/cma.c:3059

Read of size 2 at addr 8801b89fb320 by task syz-executor3/7584

CPU: 1 PID: 7584 Comm: syz-executor3 Not tainted 4.19.0-rc2+ #224
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
 __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:431
 cma_bind_port+0x35d/0x3f0 drivers/infiniband/core/cma.c:3059
 cma_alloc_port+0x115/0x180 drivers/infiniband/core/cma.c:3095
 cma_alloc_any_port drivers/infiniband/core/cma.c:3160 [inline]
 cma_get_port drivers/infiniband/core/cma.c:3314 [inline]
 rdma_bind_addr+0x1765/0x23d0 drivers/infiniband/core/cma.c:3434
 cma_bind_addr drivers/infiniband/core/cma.c:2963 [inline]
 rdma_resolve_addr+0x4fa/0x27a0 drivers/infiniband/core/cma.c:2974
 ucma_resolve_ip+0x242/0x2a0 drivers/infiniband/core/ucma.c:711
 ucma_write+0x336/0x420 drivers/infiniband/core/ucma.c:1680
 __vfs_write+0x117/0x9d0 fs/read_write.c:485
 vfs_write+0x1fc/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457099
Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00

RSP: 002b:7f47da18ac78 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 7f47da18b6d4 RCX: 00457099
RDX: 0048 RSI: 2240 RDI: 0005
RBP: 009300a0 R08:  R09: 
R10:  R11: 0246 R12: 
R13: 004d8100 R14: 004c1c28 R15: 

Allocated by task 7601:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
 kmem_cache_alloc_trace+0x152/0x730 mm/slab.c:3620
 kmalloc include/linux/slab.h:513 [inline]
 kzalloc include/linux/slab.h:707 [inline]
 __rdma_create_id+0xdf/0x7a0 drivers/infiniband/core/cma.c:782
 ucma_create_id+0x399/0x9d0 drivers/infiniband/core/ucma.c:502
 ucma_write+0x336/0x420 drivers/infiniband/core/ucma.c:1680
 __vfs_write+0x117/0x9d0 fs/read_write.c:485
 vfs_write+0x1fc/0x560 fs/read_write.c:549
 ksys_write+0x101/0x260 fs/read_write.c:598
 __do_sys_write fs/read_write.c:610 [inline]
 __se_sys_write fs/read_write.c:607 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:607
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 7583:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:448
 set_track mm/kasan/kasan.c:460 [inline]
 __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
 kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
 __cache_free mm/slab.c:3498 [inline]
 kfree+0xd9/0x210 mm/slab.c:3813
 rdma_destroy_id+0x848/0xce0 drivers/infiniband/core/cma.c:1737
 ucma_close+0x100/0x300 drivers/infiniband/core/ucma.c:1759
 __fput+0x38a/0xa40 fs/file_table.c:278
 fput+0x15/0x20 fs/file_table.c:309
 task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:193 [inline]
 exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
 prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
 do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at 8801b89fb300
 which belongs to the cache kmalloc-2048 of size 2048
The buggy address is located 32 bytes inside of
 2048-byte region [8801b89fb300, 8801b89fbb00)
The buggy address belongs to the page:
page:ea0006e27e80 count:1 mapcount:0 mapping:8801dac00c40 index:0x0  
compound_mapcount: 0

flags: 0x2fffc

Re: [PATCH 1/5] mfd: lochnagar: Add support for the Cirrus Logic Lochnagar

2018-09-07 Thread Charles Keepax

On Fri, Sep 07, 2018 at 08:06:52AM +0800, kbuild test robot wrote:
> Hi Charles,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on ljones-mfd/for-mfd-next]
> [also build test ERROR on v4.19-rc2 next-20180906]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Charles-Keepax/mfd-lochnagar-Add-support-for-the-Cirrus-Logic-Lochnagar/20180907-010308
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd.git 
> for-mfd-next
> config: alpha-allmodconfig (attached as .config)
> compiler: alpha-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.2.0 make.cross ARCH=alpha 
> 
> All errors (new ones prefixed by >>):
> 
>drivers/mfd/lochnagar-i2c.o: In function `lochnagar_i2c_probe':
> >> (.text+0x4ec): undefined reference to `__devm_regmap_init_i2c'
>(.text+0x500): undefined reference to `__devm_regmap_init_i2c'

Sorry seems I am missing a select REGMAP_I2C in the Kconfig will
fix up for v2 after any other comments.

>drivers/mfd/lochnagar-i2c.o: In function `lochnagar_i2c_init':
> >> (.init.text+0x10): undefined reference to `i2c_register_driver'
>(.init.text+0x24): undefined reference to `i2c_register_driver'

Not sure what is causing this one there is a depends on I2C in
there will have a look and see if I can figure that out.

Thanks,
Charles

Re: Regression in next with filesystem context concept

2018-09-07 Thread David Howells

Tony Lindgren  wrote:

> Looks like next-20180906 now has a regression where mounting
> root won't work with commit fd0002870b45 ("vfs: Implement a
> filesystem superblock creation/configuration context").

Am I right in thinking you're not using any of the LSMs?

David

[PATCH] Input: reserve 2 events code because of HID

2018-09-07 Thread Benjamin Tissoires

From: Benjamin Tissoires 

Prior to commit 190d7f02ce8e ("HID: input: do not increment usages when
a duplicate is found") from the v4.18 kernel, HID used to shift the
event codes if a duplicate usage was found. This ended up in a situation
where a device would export a ton of ABS_MISC+n event codes, or a ton
of REL_MISC+n event codes.

This is now fixed, however userspace needs to detect those situation.
Fortunately, ABS_MISC+1 was never assigned a code, and so libinput
can detect fake multitouch devices from genuine ones by checking is
ABS_MISC+1 is set.

Now that we have REL_WHEEL_HI_RES, libinput won't be able to differentiate
true high res mice from some other device in a pre-v4.18 kernel.

Set in stone that the ABS_MISC+1 and REL_MISC+1 are reserved and should not
be used so userspace can properly work around those old kernels.

Signed-off-by: Benjamin Tissoires 
---

Hi,

while reviewing my local tree, I realize that we might want to be able
to differentiate older kernels from new ones that export REL_WHEEL_HI_RES.

I know Dmitry was against adding several REL_MISC, so I hope just moving
REL_WHEEL_HI_RES by one and reserving the faulty event codes would be good
this time.

This patch applies on top of the branch for-4.20/logitech-highres from
Jiri's tree. It should go through Jiri's tree as well.

Cheers,
Benjamin

 include/uapi/linux/input-event-codes.h | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/input-event-codes.h 
b/include/uapi/linux/input-event-codes.h
index 29fb891ea337..30149939249a 100644
--- a/include/uapi/linux/input-event-codes.h
+++ b/include/uapi/linux/input-event-codes.h
@@ -708,7 +708,12 @@
 #define REL_DIAL   0x07
 #define REL_WHEEL  0x08
 #define REL_MISC   0x09
-#define REL_WHEEL_HI_RES   0x0a
+/*
+ * 0x0a is reserved and should not be used.
+ * It was used by HID as REL_MISC+1 and usersapce needs to detect if
+ * the next REL_* event is correct or is just REL_MISC + n.
+ */
+#define REL_WHEEL_HI_RES   0x0b
 #define REL_MAX0x0f
 #define REL_CNT(REL_MAX+1)
 
@@ -745,6 +750,12 @@
 
 #define ABS_MISC   0x28
 
+/*
+ * 0x29 is reserved and should not be used.
+ * It was used by HID as ABS_MISC+1 and usersapce needs to detect if
+ * the next ABS_* event is correct or is just ABS_MISC + n.
+ */
+
 #define ABS_MT_SLOT0x2f/* MT slot being modified */
 #define ABS_MT_TOUCH_MAJOR 0x30/* Major axis of touching ellipse */
 #define ABS_MT_TOUCH_MINOR 0x31/* Minor axis (omit if circular) */
-- 
2.14.3

Re: [PATCH V8 0/9] mmc: add support for sdhci 4.0

2018-09-07 Thread Ulf Hansson

On 30 August 2018 at 10:21, Chunyan Zhang  wrote:
> From the SD host controller version 4.0 on, SDHCI implementation either
> is version 3 compatible or version 4 mode. This patch-set covers those
> changes which are common for SDHCI 4.0 version, regardless of whether
> they are used with SD or eMMC storage devices.
>
> This patchset also added a new sdhci driver for Spreadtrum's controller
> which supports v4.0 mode.
>
> This patchset has been tested on Spreadtrum's mobile phone, emmc can be
> initialized, mounted, read and written, with these changes for common
> sdhci framework and sdhci-sprd driver.
>
> Changes from V7:
> - Added Adrian's acked-by on patch 1-6;
> - Addressed comments.
>
> Previous patch series:
> v7: https://lkml.org/lkml/2018/8/29/130
> v6: http://lkml.org/lkml/2018/8/24/205
> v5: https://lkml.org/lkml/2018/8/16/122
> v4: https://lkml.org/lkml/2018/7/23/269
> v3: https://lkml.org/lkml/2018/7/8/239
> v2: https://lkml.org/lkml/2018/6/14/936
> v1: https://lkml.org/lkml/2018/6/8/108
>
> Chunyan Zhang (9):
>   mmc: sdhci: Add version V4 definition
>   mmc: sdhci: Add sd host v4 mode
>   mmc: sdhci: Change SDMA address register for v4 mode
>   mmc: sdhci: Add ADMA2 64-bit addressing support for V4 mode
>   mmc: sdhci: Add 32-bit block count support for v4 mode
>   mmc: sdhci: Add Auto CMD Auto Select support
>   mmc: sdhci: SDMA may use Auto-CMD23 in v4 mode
>   mmc: sdhci-sprd: Add Spreadtrum's initial host controller
>   dt-bindings: sdhci-sprd: Add bindings for the sdhci-sprd controller
>
>  .../devicetree/bindings/mmc/sdhci-sprd.txt |  41 ++
>  drivers/mmc/host/Kconfig   |  13 +
>  drivers/mmc/host/Makefile  |   1 +
>  drivers/mmc/host/sdhci-sprd.c  | 498 
> +
>  drivers/mmc/host/sdhci.c   | 223 +++--
>  drivers/mmc/host/sdhci.h   |  28 +-
>  6 files changed, 754 insertions(+), 50 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/mmc/sdhci-sprd.txt
>  create mode 100644 drivers/mmc/host/sdhci-sprd.c
>
> --
> 2.7.4
>

Applied for next, thanks!

Kind regards
Uffe

Re: [RESEND PATCH v2 1/3] mtd: spi-nor: add support to non-uniform SFDP SPI NOR flash memories

2018-09-07 Thread Tudor Ambarus

Thanks Marek,

On 09/03/2018 08:37 PM, Marek Vasut wrote:
> On 08/27/2018 12:26 PM, Tudor Ambarus wrote:
> [...]
> 
>> +/* JEDEC JESD216B Standard imposes erase sizes to be power of 2. */
>> +static inline u64
>> +spi_nor_div_by_erase_size(const struct spi_nor_erase_type *erase,
>> +  u64 dividend, u32 *remainder)
>> +{
>> +*remainder = (u32)dividend & erase->size_mask;
> 
> Is the cast really needed ? btw I think there might be a macro doing
> just this, div_by_ or something in include/ .

The cast is not needed, the AND sets to zero all but the low-order 32bits of
divided and then we have the implicit cast.

Are you referring to do_div()? I expect the bitwise operations to be faster.
Bitwise operations are preferred in include/linux/mtd/mtd.h too:

static inline uint32_t mtd_div_by_eb(uint64_t sz, struct mtd_info *mtd)
{
if (mtd->erasesize_shift)
return sz >> mtd->erasesize_shift;
do_div(sz, mtd->erasesize);
return sz;
}

> 
>> +return dividend >> erase->size_shift;
>> +}
>> +
>> +static const struct spi_nor_erase_type *
>> +spi_nor_find_best_erase_type(const struct spi_nor_erase_map *map,
>> + const struct spi_nor_erase_region *region,
>> + u64 addr, u32 len)
>> +{
>> +const struct spi_nor_erase_type *erase;
>> +u32 rem;
>> +int i;
>> +u8 erase_mask = region->offset & SNOR_ERASE_TYPE_MASK;
>> +
>> +/*
>> + * Erase types are ordered by size, with the biggest erase type at
>> + * index 0.
>> + */
>> +for (i = SNOR_ERASE_TYPE_MAX - 1; i >= 0; i--) {
>> +/* Does the erase region support the tested erase type? */
>> +if (!(erase_mask & BIT(i)))
>> +continue;
>> +
>> +erase = &map->erase_type[i];
>> +
>> +/* Don't erase more than what the user has asked for. */
>> +if (erase->size > len)
>> +continue;
>> +
>> +/* Alignment is not mandatory for overlaid regions */
>> +if (region->offset & SNOR_OVERLAID_REGION)
>> +return erase;
>> +
>> +spi_nor_div_by_erase_size(erase, addr, &rem);
>> +if (rem)
>> +continue;
>> +else
>> +return erase;
>> +}
>> +
>> +return NULL;
>> +}
>> +
>> +static struct spi_nor_erase_region *
>> +spi_nor_region_next(struct spi_nor_erase_region *region)
>> +{
>> +if (spi_nor_region_is_last(region))
>> +return NULL;
>> +return ++region;
> 
> region++ ...

It's an array of regions, consecutive in address space, in which walking is done
incrementally. If the received region is not the last, I want to return the next
region, so ++region is correct.

> 
> [...]
> 
>> +static int spi_nor_cmp_erase_type(const void *a, const void *b)
>> +{
>> +const struct spi_nor_erase_type *erase1 = a;
>> +const struct spi_nor_erase_type *erase2 = b;
>> +
>> +return erase1->size - erase2->size;
> 
> What does this function do again ?

It's a compare function, I compare by size the map's Erase Types. I pass a
pointer to this function in the sort() call. I sort in ascending order, by size,
all the map's Erase Types when parsing bfpt. I'm doing the sort at init to speed
up the finding of the best erase command at run-time.

A better name for this function is spi_nor_map_cmp_erase_type(), we compare the
map's Erase Types by size.

> 
>> +}
>> +
>> +static void spi_nor_regions_sort_erase_types(struct spi_nor_erase_map *map)
>> +{
>> +struct spi_nor_erase_region *region = map->regions;
>> +struct spi_nor_erase_type *erase_type = map->erase_type;
>> +int i;
>> +u8 region_erase_mask, ordered_erase_mask;
>> +
>> +/*
>> + * Sort each region's Erase Types in ascending order with the smallest
>> + * Erase Type size starting at BIT(0).
>> + */
>> +while (region) {
>> +region_erase_mask = region->offset & SNOR_ERASE_TYPE_MASK;
>> +
>> +/*
>> + * The region's erase mask indicates which erase types are
>> + * supported from the erase types defined in the map.
>> + */
>> +ordered_erase_mask = 0;
>> +for (i = 0; i < SNOR_ERASE_TYPE_MAX; i++)
>> +if (erase_type[i].size &&
>> +region_erase_mask & BIT(erase_type[i].idx))
>> +ordered_erase_mask |= BIT(i);
>> +
>> +/* Overwrite erase mask. */
>> +region->offset = (region->offset & ~SNOR_ERASE_TYPE_MASK) |
>> + ordered_erase_mask;
>> +
>> +region = spi_nor_region_next(region);
>> +}
>> +}
>> +
>> +static inline void
> 
> Drop the inline

Ok.

> 
>> +spi_nor_init_uniform_erase_map(struct spi_nor_erase_map *map,
>> +   u8 erase_mask, u64 flash_size)
>> +{
>> +map->uniform_region.offset = SNOR_ERASE_FLA

[PATCH V2] perf tools: Fix maps__find_symbol_by_name()

2018-09-07 Thread Adrian Hunter

Commit 1c5aae7710bb ("perf machine: Create maps for x86 PTI entry
trampolines") revealed a problem with maps__find_symbol_by_name() that
resulted in probes not being found e.g.

$ sudo perf probe xsk_mmap
xsk_mmap is out of .text, skip it.
Probe point 'xsk_mmap' not found.
   Error: Failed to add events.

maps__find_symbol_by_name() can optionally return the map of the found
symbol. It can get the map wrong because, in fact, the symbol is found
on the map's dso, not allowing for the possibility that the dso has more
than one map. Fix by always checking the map contains the symbol.

Reported-by: Björn Töpel 
Tested-by: Björn Töpel 
Cc: sta...@vger.kernel.org
Signed-off-by: Adrian Hunter 
---


Changes in V2:

Expanded commit message
Corrected email address


 tools/perf/util/map.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c
index 3f07a587c8e6..354e54550d2b 100644
--- a/tools/perf/util/map.c
+++ b/tools/perf/util/map.c
@@ -574,6 +574,13 @@ struct symbol *map_groups__find_symbol(struct map_groups 
*mg,
return NULL;
 }
 
+static bool map__contains_symbol(struct map *map, struct symbol *sym)
+{
+   u64 ip = map->unmap_ip(map, sym->start);
+
+   return ip >= map->start && ip < map->end;
+}
+
 struct symbol *maps__find_symbol_by_name(struct maps *maps, const char *name,
 struct map **mapp)
 {
@@ -589,6 +596,10 @@ struct symbol *maps__find_symbol_by_name(struct maps 
*maps, const char *name,
 
if (sym == NULL)
continue;
+   if (!map__contains_symbol(pos, sym)) {
+   sym = NULL;
+   continue;
+   }
if (mapp != NULL)
*mapp = pos;
goto out;
-- 
2.17.1

Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

2018-09-07 Thread Rafael J. Wysocki

On Thursday, September 6, 2018 4:38:44 PM CEST Quentin Perret wrote:
> Hi Rafael,
> 
> On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> > I'm not a particular fan of notifiers to be honest and you don't need
> > to add an extra chain just in order to be able to register a callback
> > from a single user.
> 
> Right. I agree there are alternatives to using notifiers. I used them
> because they're existing infrastructure, and because they let me do what
> I want without too much troubles, which are two important points.
> 
> > That can be achieved with a single callback
> > pointer too, but also you could just call a function exported by the
> > scheduler directly from where in the cpufreq code it needs to be
> > called.
> 
> Are you thinking about something comparable to what is done in
> cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
> That would probably have the same drawback as my current implementation,
> that is that the scheduler is notified of _all_ governor changes, not
> only changes to/from sugov although this is the only thing we care about
> for EAS.

Well, why don't you implement it as something like "if the governor changes
from sugov to something else (or the other way around), call this function
from the scheduler"?

Re: [RFC 3/3] stk1160: Use non-coherent buffers for USB transfers

2018-09-07 Thread Tomasz Figa

On Fri, Aug 31, 2018 at 2:59 AM Christoph Hellwig  wrote:
>
> > + dma_sync_single_for_cpu(&urb->dev->dev, urb->transfer_dma,
> > + urb->transfer_buffer_length, DMA_FROM_DEVICE);
>
> You can't ue dma_sync_single_for_cpu on non-coherent dma buffers,
> which is one of the major issues with them.

It's not an issue of DMA API, but just an API mismatch. By design,
memory allocated for device (e.g. by DMA API) doesn't have to be
physically contiguous, while dma_*_single() API expects a _single_,
physically contiguous region of memory.

We need a way to allocate non-coherent memory using DMA API to handle
(on USB example, but applies to virtually any class of devices doing
DMA):
 - DMA address range limitations (e.g. dma_mask) - while a USB HCD
driver is normally aware of those, USB device driver should have no
idea,
 - memory mapping capability === whether contiguous memory or a set of
random pages can be allocated - this is a platform integration detail,
which even a USB HCD driver may not be aware of, if a SoC IOMMU is
just stuffed between the bus and HCD,
 - platform coherency specifics - there are practical scenarios when
on a coherent-by-default system it's more efficient to allocate
non-coherent memory and manage caches explicitly to avoid the costs of
cache snooping.

If DMA_ATTR_NON_CONSISTENT is not the right way to do it, there should
be definitely a new API introduced, coupled closely to DMA API
implementation on given platform, since it's the only place which can
solve all the constraints above.

Best regards,
Tomasz

Re: [PATCH] ARM: dts: at91: sama5d2_ptc_ek: fix nand pinctrl

2018-09-07 Thread Nicolas Ferre


On 07/09/2018 at 10:18, Ludovic Desroches wrote:

The drive strength has to be set to medium otherwise some data
corruption may happen.

Signed-off-by: Ludovic Desroches 


Acked-by: Nicolas Ferre 


---

Hi,

This fix depends on the support of the drive-strength for the atmel pio4
pinctroller. It has been added in v4.19 but I omitted to send it at the
same time.


Could be good to be queued for "4.19-fixes".

Best regards,
  Nicolas


Ludovic

  arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts 
b/arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts
index b10dccd0958f..3b1baa8605a7 100644
--- a/arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts
+++ b/arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts
@@ -11,6 +11,7 @@
  #include "sama5d2-pinfunc.h"
  #include 
  #include 
+#include 
  
  / {

model = "Atmel SAMA5D2 PTC EK";
@@ -299,6 +300,7 @@
 ,
 ;
bias-pull-up;
+   atmel,drive-strength = 
;
};
  
  	ale_cle_rdy_cs {





--
Nicolas Ferre

Re: [PATCH v3] mm: slowly shrink slabs with a relatively small number of objects

2018-09-07 Thread Michal Hocko

[Please make sure to CC Vladimir when modifying memcg kmem reclaim]

On Wed 05-09-18 16:07:59, Roman Gushchin wrote:
> Commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
> changed the way how the target slab pressure is calculated and
> made it priority-based:
> 
> delta = freeable >> priority;
> delta *= 4;
> do_div(delta, shrinker->seeks);
> 
> The problem is that on a default priority (which is 12) no pressure
> is applied at all, if the number of potentially reclaimable objects
> is less than 4096 (1<<12).
> 
> This causes the last objects on slab caches of no longer used cgroups
> to (almost) never get reclaimed. It's obviously a waste of memory.
> 
> It can be especially painful, if these stale objects are holding
> a reference to a dying cgroup. Slab LRU lists are reparented on memcg
> offlining, but corresponding objects are still holding a reference
> to the dying cgroup. If we don't scan these objects, the dying cgroup
> can't go away. Most likely, the parent cgroup hasn't any directly
> charged objects, only remaining objects from dying children cgroups.
> So it can easily hold a reference to hundreds of dying cgroups.
> 
> If there are no big spikes in memory pressure, and new memory cgroups
> are created and destroyed periodically, this causes the number of
> dying cgroups grow steadily, causing a slow-ish and hard-to-detect
> memory "leak". It's not a real leak, as the memory can be eventually
> reclaimed, but it could not happen in a real life at all. I've seen
> hosts with a steadily climbing number of dying cgroups, which doesn't
> show any signs of a decline in months, despite the host is loaded
> with a production workload.
> 
> It is an obvious waste of memory, and to prevent it, let's apply
> a minimal pressure even on small shrinker lists. E.g. if there are
> freeable objects, let's scan at least min(freeable, scan_batch)
> objects.
> 
> This fix significantly improves a chance of a dying cgroup to be
> reclaimed, and together with some previous patches stops the steady
> growth of the dying cgroups number on some of our hosts.
> 
> Signed-off-by: Roman Gushchin 
> Acked-by: Rik van Riel 
> Cc: Josef Bacik 
> Cc: Johannes Weiner 
> Cc: Shakeel Butt 
> Cc: Andrew Morton 
> ---
>  mm/vmscan.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index fa2c150ab7b9..858d7558909e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -476,6 +476,17 @@ static unsigned long do_shrink_slab(struct 
> shrink_control *shrinkctl,
>   delta = freeable >> priority;
>   delta *= 4;
>   do_div(delta, shrinker->seeks);
> +
> + /*
> +  * Make sure we apply some minimal pressure on default priority
> +  * even on small cgroups. Stale objects are not only consuming memory
> +  * by themselves, but can also hold a reference to a dying cgroup,
> +  * preventing it from being reclaimed. A dying cgroup with all
> +  * corresponding structures like per-cpu stats and kmem caches
> +  * can be really big, so it may lead to a significant waste of memory.
> +  */
> + delta = max_t(unsigned long long, delta, min(freeable, batch_size));
> +
>   total_scan += delta;
>   if (total_scan < 0) {
>   pr_err("shrink_slab: %pF negative objects to delete nr=%ld\n",
> -- 
> 2.17.1

-- 
Michal Hocko
SUSE Labs

Re: linux-next: build warning after merge of the vfs tree

2018-09-07 Thread David Howells

Stephen Rothwell  wrote:

> After merging the net-next tree, today's linux-next build (x86_64
> allmodconfig) produced this warning:
> 
> security/selinux/hooks.c:2835:12: warning: 'selinux_sb_remount' defined but 
> not used [-Wunused-function]
>  static int selinux_sb_remount(struct super_block *sb, void *data, size_t 
> data_size)
> ^~
> 
> Introduced by commit
> 
>   dfcf373eab92 ("vfs: Remove unused code after filesystem context changes")

Sorry, I forgot to remove the function.  It's no longer used, with the checks
being done in selinux_validate_for_sb_reconfigure() now, called indirectly
from the ->fs_context_validate() hook.

David

Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

2018-09-07 Thread Rafael J. Wysocki

On Friday, September 7, 2018 10:52:01 AM CEST Rafael J. Wysocki wrote:
> On Thursday, September 6, 2018 4:38:44 PM CEST Quentin Perret wrote:
> > Hi Rafael,
> > 
> > On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> > > I'm not a particular fan of notifiers to be honest and you don't need
> > > to add an extra chain just in order to be able to register a callback
> > > from a single user.
> > 
> > Right. I agree there are alternatives to using notifiers. I used them
> > because they're existing infrastructure, and because they let me do what
> > I want without too much troubles, which are two important points.
> > 
> > > That can be achieved with a single callback
> > > pointer too, but also you could just call a function exported by the
> > > scheduler directly from where in the cpufreq code it needs to be
> > > called.
> > 
> > Are you thinking about something comparable to what is done in
> > cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
> > That would probably have the same drawback as my current implementation,
> > that is that the scheduler is notified of _all_ governor changes, not
> > only changes to/from sugov although this is the only thing we care about
> > for EAS.
> 
> Well, why don't you implement it as something like "if the governor changes
> from sugov to something else (or the other way around), call this function
> from the scheduler"?

That said, governors are stopped and started in a few cases other than just
changing the governor, so maybe you want the EAS side to be notified whenever
sugov is stopped and started after all?

Re: [PATCH v6 13/14] sched/topology: Make Energy Aware Scheduling depend on schedutil

2018-09-07 Thread Quentin Perret

On Friday 07 Sep 2018 at 10:56:12 (+0200), Rafael J. Wysocki wrote:
> On Friday, September 7, 2018 10:52:01 AM CEST Rafael J. Wysocki wrote:
> > On Thursday, September 6, 2018 4:38:44 PM CEST Quentin Perret wrote:
> > > Hi Rafael,
> > > 
> > > On Thursday 06 Sep 2018 at 11:18:55 (+0200), Rafael J. Wysocki wrote:
> > > > I'm not a particular fan of notifiers to be honest and you don't need
> > > > to add an extra chain just in order to be able to register a callback
> > > > from a single user.
> > > 
> > > Right. I agree there are alternatives to using notifiers. I used them
> > > because they're existing infrastructure, and because they let me do what
> > > I want without too much troubles, which are two important points.
> > > 
> > > > That can be achieved with a single callback
> > > > pointer too, but also you could just call a function exported by the
> > > > scheduler directly from where in the cpufreq code it needs to be
> > > > called.
> > > 
> > > Are you thinking about something comparable to what is done in
> > > cpufreq_add_update_util_hook() (kernel/sched/cpufreq.c) for example ?
> > > That would probably have the same drawback as my current implementation,
> > > that is that the scheduler is notified of _all_ governor changes, not
> > > only changes to/from sugov although this is the only thing we care about
> > > for EAS.
> > 
> > Well, why don't you implement it as something like "if the governor changes
> > from sugov to something else (or the other way around), call this function
> > from the scheduler"?

Yes that work too ...

> That said, governors are stopped and started in a few cases other than just
> changing the governor, so maybe you want the EAS side to be notified whenever
> sugov is stopped and started after all?

Right, so sugov_start/sugov_stop could be an option in this case ... And
that would leave the CPUFreq core untouched. I'll try to write something :-)

Thanks,
Quentin

Re: [PATCH 00/17] thermal: enable/check sensor after its setup is finished

2018-09-07 Thread Amit Kucheria

Hi Bartlomiej,

Do you have an updated version of this patchset on 4.19-rc1 somewhere
that I can look at? I might be seeing this issue on the QCom TSENS
driver and would like to verify.

Regards,
Amit

On Tue, Apr 10, 2018 at 6:11 PM, Bartlomiej Zolnierkiewicz
 wrote:
> Hi,
>
> [devm]_thermal_zone_of_sensor_register() is used to register
> thermal sensor by thermal drivers using DeviceTree. Besides
> registering sensor this function also immediately enables it
> (using ->set_mode method) and then checks it with a update call
> to the thermal core (which ends up using ->get_temp method).
> For many DT thermal drivers this causes a problem because
> [devm]_thermal_zone_of_sensor_register() need to be called in
> order to obtain data about thermal trips which are then used to
> finish hardware sensor setup (only after which ->get_temp can
> be used). The issue has been observed when using Samsung Exynos
> thermal driver and fixed internally in the driver in commit
> d8efad71e5b6 ("thermal: exynos: Reading temperature makes sense
> only when TMU is turned on"). However after this commit there
> are now following warnings from the thermal core visible:
>
> [3.453602] thermal thermal_zone0: failed to read out thermal zone (-22)
> [3.483468] thermal thermal_zone1: failed to read out thermal zone (-22)
> [3.505965] thermal thermal_zone2: failed to read out thermal zone (-22)
> [3.528455] thermal thermal_zone3: failed to read out thermal zone (-22)
> [3.550939] thermal thermal_zone4: failed to read out thermal zone (-22)
>
> This patchset attempts to directly address the thermal core
> problem with [devm]_thermal_zone_of_sensor_register() and
> affected DT thermal drivers. In order to achieve this sensor
> registration, enable and check operations are separated and
> corresponding drivers are modified to use the new helpers to
> enable and check sensor explicitly.
>
> Tested on Exynos5422 based Odroid-XU3 Lite board (aforementioned
> warnings from the thermal core are now gone).
>
> Best regards,
> --
> Bartlomiej Zolnierkiewicz
> Samsung R&D Institute Poland
> Samsung Electronics
>
>
> Bartlomiej Zolnierkiewicz (17):
>   thermal: add thermal_zone_device_toggle() helper
>   thermal: separate sensor registration and enable
>   thermal: add thermal_zone_device_check() helper
>   thermal: do sensor checking explicitly in drivers
>   thermal: bcm2835: enable/check sensor after its setup is finished
>   thermal: brcmstb: enable/check sensor after its setup is finished
>   thermal: hisi_thermal: enable/check sensor after its setup is finished
>   thermal: qcom: tsens: enable/check sensor after its setup is finished
>   thermal: qoriq: enable/check sensor after its setup is finished
>   thermal: rcar_gen3_thermal: enable/check sensor after its setup is
> finished
>   thermal: rockchip_thermal: enable/check sensor after its setup is
> finished
>   thermal: exynos: enable/check sensor after its setup is finished
>   thermal: tegra: enable/check sensor after its setup is finished
>   thermal: ti-soc-thermal: enable/check sensor after its setup is
> finished
>   thermal: uniphier: enable/check sensor after its setup is
> finished
>   thermal: zx2967: enable/check sensor after its setup is finished
>   thermal: warn on attempts to read temperature on disabled sensors
>
>  drivers/acpi/thermal.c |  5 ++--
>  drivers/net/ethernet/mellanox/mlxsw/core_thermal.c |  1 -
>  drivers/platform/x86/acerhdf.c |  6 +++-
>  drivers/regulator/max8973-regulator.c  |  3 +-
>  drivers/thermal/broadcom/bcm2835_thermal.c |  3 ++
>  drivers/thermal/broadcom/brcmstb_thermal.c |  3 ++
>  drivers/thermal/broadcom/ns-thermal.c  |  3 ++
>  drivers/thermal/da9062-thermal.c   |  7 ++---
>  drivers/thermal/db8500_thermal.c   |  5 +++-
>  drivers/thermal/hisi_thermal.c | 22 --
>  drivers/thermal/imx_thermal.c  |  3 +-
>  drivers/thermal/int340x_thermal/int3400_thermal.c  |  1 +
>  drivers/thermal/intel_bxt_pmic_thermal.c   |  3 +-
>  drivers/thermal/intel_soc_dts_iosf.c   |  3 +-
>  drivers/thermal/max77620_thermal.c |  6 ++--
>  drivers/thermal/mtk_thermal.c  |  3 ++
>  drivers/thermal/of-thermal.c   |  6 ++--
>  drivers/thermal/qcom-spmi-temp-alarm.c |  5 +++-
>  drivers/thermal/qcom/tsens.c   |  6 
>  drivers/thermal/qoriq_thermal.c|  3 ++
>  drivers/thermal/rcar_gen3_thermal.c|  7 +++--
>  drivers/thermal/rcar_thermal.c |  8 +++--
>  drivers/thermal/rockchip_thermal.c | 34 
> ++
>  drivers/thermal/samsung/exynos_tmu.c   |  7 -
>  drivers/thermal/st/st_thermal_memmap.c |  3 +-
>  drivers/thermal/tango_thermal.c

Re: [RESEND PATCH v2 2/3] mtd: spi-nor: parse SFDP Sector Map Parameter Table

2018-09-07 Thread Tudor Ambarus




On 09/03/2018 08:40 PM, Marek Vasut wrote:
> On 08/27/2018 12:26 PM, Tudor Ambarus wrote:
> [...]
>> +static const u32 *spi_nor_get_map_in_use(struct spi_nor *nor, const u32 
>> *smpt)
>> +{
>> +const u32 *ret = NULL;
>> +u32 i, addr;
>> +int err;
>> +u8 addr_width, read_opcode, read_dummy;
>> +u8 read_data_mask, data_byte, map_id;
>> +
>> +addr_width = nor->addr_width;
>> +read_dummy = nor->read_dummy;
>> +read_opcode = nor->read_opcode;
>> +
>> +map_id = 0;
>> +i = 0;
>> +/* Determine if there are any optional Detection Command Descriptors */
>> +while (!(smpt[i] & SMPT_DESC_TYPE_MAP)) {
>> +read_data_mask = SMPT_CMD_READ_DATA(smpt[i]);
>> +nor->addr_width = spi_nor_smpt_addr_width(nor, smpt[i]);
>> +nor->read_dummy = spi_nor_smpt_read_dummy(nor, smpt[i]);
>> +nor->read_opcode = SMPT_CMD_OPCODE(smpt[i]);
>> +addr = smpt[i + 1];
>> +
>> +err = spi_nor_read_raw(nor, addr, 1, &data_byte);
>> +if (err)
>> +goto out;
>> +
>> +/*
>> + * Build an index value that is used to select the Sector Map
>> + * Configuration that is currently in use.
>> + */
>> +map_id = map_id << 1 | (!(data_byte & read_data_mask) ? 0 : 1);
> 
> You can drop the ternary operator part completely ^

I'll use !! instead.

> 
>> +i = i + 2;
>> +}
>> +
>> +/* Find the matching configuration map */
>> +while (SMPT_MAP_ID(smpt[i]) != map_id) {
>> +if (smpt[i] & SMPT_DESC_END)
>> +goto out;
>> +/* increment the table index to the next map */
>> +i += SMPT_MAP_REGION_COUNT(smpt[i]) + 1;
>> +}
>> +
>> +ret = smpt + i;
>> +/* fall through */
>> +out:
>> +nor->addr_width = addr_width;
>> +nor->read_dummy = read_dummy;
>> +nor->read_opcode = read_opcode;
>> +return ret;
>> +}
>> +
>> +static void
>> +spi_nor_region_check_overlay(struct spi_nor_erase_region *region,
>> + const struct spi_nor_erase_type *erase,
>> + const u8 erase_type)
>> +{
>> +int i;
>> +
>> +for (i = 0; i < SNOR_ERASE_TYPE_MAX; i++) {
>> +if (!(erase_type & BIT(i)))
>> +continue;
>> +if (region->size & erase[i].size_mask) {
>> +spi_nor_region_mark_overlay(region);
>> +return;
>> +}
>> +}
>> +}
>> +
>> +static int spi_nor_init_non_uniform_erase_map(struct spi_nor *nor,
>> +  const u32 *smpt)
>> +{
>> +struct spi_nor_erase_map *map = &nor->erase_map;
>> +const struct spi_nor_erase_type *erase = map->erase_type;
>> +struct spi_nor_erase_region *region;
>> +u64 offset;
>> +u32 region_count;
>> +int i, j;
>> +u8 erase_type;
>> +
>> +region_count = SMPT_MAP_REGION_COUNT(*smpt);
>> +region = devm_kcalloc(nor->dev, region_count, sizeof(*region),
>> +  GFP_KERNEL);
> 
> Is this memory always correctly free'd ?

Yes. It will be free'd when the driver detaches from the device.

Thanks,
ta

Re: [PATCH] printk/tracing: Do not trace printk_nmi_enter()

2018-09-07 Thread Peter Zijlstra

On Fri, Sep 07, 2018 at 10:28:34AM +0200, Petr Mladek wrote:
> On Fri 2018-09-07 09:45:31, Peter Zijlstra wrote:
> > On Thu, Sep 06, 2018 at 11:31:51AM +0900, Sergey Senozhatsky wrote:
> > > An alternative option, thus, could be re-instating back the rule that
> > > lockdep_off/on should be the first and the last thing we do in
> > > nmi_enter/nmi_exit. E.g.
> > > 
> > > nmi_enter()
> > >   lockdep_off();
> > >   printk_nmi_enter();
> > > 
> > > nmi_exit()
> > >   printk_nmi_exit();
> > >   lockdep_on();
> > 
> > Yes that. Also, those should probably be inline functions.
> > 
> > ---
> > Subject: locking/lockdep: Fix NMI handling
> > 
> > Someone put code in the NMI handler before lockdep_off(). Since lockdep
> > is not NMI safe, this wrecks stuff.
> 
> My view is that nmi_enter() has to switch several features into
> NMI-safe mode. The code must not trigger the other features when
> they are still in the unsafe mode.
> 
> It is a chicken&egg problem. And it is hard to completely prevent
> regressions caused by future changes.

Sure, not bothered too much about the regression, that happens.

> I though that printk_nmi_enter() should never need any lockdep-related
> code. On the other hand, people might want to printk debug messages
> when lockdep_off() is called. This is why I put it in the current order.

Nah, that'd be broken. Or rather, if you want to debug NMI stuff, you
had better know wth you're doing. Printk, as per mainline, is utterly
useless for that -- I still carry those force_early_printk patches
locally.

Because even if the core printk code were to be NMI safe (possible I
think, all we need is a lockless ring-buffer), then none of the console
drivers are :/

(I really hate this current printk-nmi mess)

> That said, I am not against this change. Especially the inlining
> is a good move. Note that lockdep_off()/lockdep_on() must not
> be traced as well.

Hard to trace an inline funcion; we could make it __always_inline to
feel better.

Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads

2018-09-07 Thread Alexey Budankov




On 07.09.2018 10:07, Alexey Budankov wrote:
> 
> Currently in record mode the tool implements trace writing serially. 
> The algorithm loops over mapped per-cpu data buffers and stores 
> ready data chunks into a trace file using write() system call.
> 
> At some circumstances the kernel may lack free space in a buffer 
> because the other buffer's half is not yet written to disk due to 
> some other buffer's data writing by the tool at the moment.
> 
> Thus serial trace writing implementation may cause the kernel 
> to loose profiling data and that is what observed when profiling 
> highly parallel CPU bound workloads on machines with big number 
> of cores.
> 
> Experiment with profiling matrix multiplication code executing 128 
> threads on Intel Xeon Phi (KNM) with 272 cores, like below,
> demonstrates data loss metrics value of 98%:
> 
> /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \
> --call-graph dwarf,1024 --user-regs=IP,SP,BP \
> --switch-events -e 
> cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \
> matrix.gcc
> 
> Data loss metrics is the ratio lost_time/elapsed_time where 
> lost_time is the sum of time intervals containing PERF_RECORD_LOST 
> records and elapsed_time is the elapsed application run time 
> under profiling.
> 
> Applying asynchronous trace streaming thru Posix AIO API
> (http://man7.org/linux/man-pages/man7/aio.7.html) 
> lowers data loss metrics value providing 2x improvement -
> lowering 98% loss to almost 0%.
> 
> ---
>  Alexey Budankov (3):
> perf util: map data buffer for preserving collected data
> perf record: enable asynchronous trace writing
> perf record: extend trace writing to multi AIO
>  
>  tools/perf/builtin-record.c | 166 
> ++--
>  tools/perf/perf.h   |   1 +
>  tools/perf/util/evlist.c|   7 +-
>  tools/perf/util/evlist.h|   3 +-
>  tools/perf/util/mmap.c  | 114 ++
>  tools/perf/util/mmap.h  |  11 ++-
>  6 files changed, 277 insertions(+), 25 deletions(-)

The whole thing for 

git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux perf/core 

repository follows:

 tools/perf/builtin-record.c | 165 ++--
 tools/perf/perf.h   |   1 +
 tools/perf/util/evlist.c|   7 +-
 tools/perf/util/evlist.h|   3 +-
 tools/perf/util/mmap.c  | 114 ++
 tools/perf/util/mmap.h  |  11 ++-
 6 files changed, 276 insertions(+), 25 deletions(-)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 9853552bcf16..7bb7947072e5 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -121,6 +121,112 @@ static int record__write(struct record *rec, void *bf, 
size_t size)
return 0;
 }
 
+static int record__aio_write(struct aiocb *cblock, int trace_fd,
+   void *buf, size_t size, off_t off)
+{
+   int rc;
+
+   cblock->aio_fildes = trace_fd;
+   cblock->aio_buf= buf;
+   cblock->aio_nbytes = size;
+   cblock->aio_offset = off;
+   cblock->aio_sigevent.sigev_notify = SIGEV_NONE;
+
+   do {
+   rc = aio_write(cblock);
+   if (rc == 0) {
+   break;
+   } else if (errno != EAGAIN) {
+   cblock->aio_fildes = -1;
+   pr_err("failed to queue perf data, error: %m\n");
+   break;
+   }
+   } while (1);
+
+   return rc;
+}
+
+static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock)
+{
+   void *rem_buf;
+   off_t rem_off;
+   size_t rem_size;
+   int rc, aio_errno;
+   ssize_t aio_ret, written;
+
+   aio_errno = aio_error(cblock);
+   if (aio_errno == EINPROGRESS)
+   return 0;
+
+   written = aio_ret = aio_return(cblock);
+   if (aio_ret < 0) {
+   if (!(aio_errno == EINTR))
+   pr_err("failed to write perf data, error: %m\n");
+   written = 0;
+   }
+
+   rem_size = cblock->aio_nbytes - written;
+
+   if (rem_size == 0) {
+   cblock->aio_fildes = -1;
+   /*
+* md->refcount is incremented in perf_mmap__push() for
+* every enqueued aio write request so decrement it because
+* the request is now complete.
+*/
+   perf_mmap__put(md);
+   rc = 1;
+   } else {
+   /*
+* aio write request may require restart with the
+* reminder if the kernel didn't write whole
+* chunk at once.
+*/
+   rem_off = cblock->aio_offset + written;
+   rem_buf = (void *)(cblock->aio_buf + written);
+   record__aio_write(cblock, cblock->aio_fildes,
+   re

[SCHEDULER] Performance drop in 4.19 compared to 4.18 kernel

2018-09-07 Thread Jirka Hladky

Hi Srikar,

I work at Red Hat in the Kernel Performance Team. I would like to ask
you for help.

We have detected a significant performance drop (20% and more) with
4.19rc1 relatively to 4.18 vanilla. We see the regression on different
2 NUMA and 4 NUMA boxes with pretty much all the benchmarks we use -
NAS, Stream, SPECjbb2005, SPECjvm2008.

Mel Gorman has suggested checking
2d4056fafa196e1ab4e7161bae4df76f9602d56d commit - with it reverted we
got some performance back but not entirely:

 * Compared to 4.18, there is still performance regression -
especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
systems, regression is around 10-15%
  * Compared to 4.19rc1 there is a clear gain across all benchmarks, up to 20%.

We are investigating the issue further, Mel has suggested to check
305c1fac3225dfa7eeb89bfe91b7335a6edd5172 as next.

Do you have any further recommendations, which commits have possibly
caused the performance degradation?

I want to discuss with you how can we collaborate on performance
testing for the upstream kernel. Does your testing show as well
performance drop in 4.19? If so, do you have any plans for the fix? If
no, can we send you some more information about our tests so that you
can try to reproduce it?

We would also be more than happy to test the new patches for the
performance - please let us know if you are interested.  We have a
pool of 1 NUMA up to 8 NUMA boxes for that, both AMD and Intel,
covering different CPU generations from Sandy Bridge till Skylake.

I'm looking forward to hearing from you.
Jirka

Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

2018-09-07 Thread Borislav Petkov

On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
> The SYSCALL64 trampoline has a couple of nice properties:
> 
>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>set up RSP is somewhat slow because the GS-relative accesses need
>to wait for SWAPGS to finish.  The trampoline approach allows
>RIP-relative accesses to set up RSP, which avoids the stall.

...

> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index 31341ae7309f..7e79154846c8 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -434,11 +434,42 @@ static void __init pti_clone_p4d(unsigned long addr)
>  }
>  
>  /*
> - * Clone the CPU_ENTRY_AREA into the user space visible page table.
> + * Clone the CPU_ENTRY_AREA and associated data into the user space visible
> + * page table.
>   */
>  static void __init pti_clone_user_shared(void)
>  {
> + unsigned cpu;

Make that

unsigned int cpu;

Otherwise, patches removing complex code are always good!

Reviewed-by: Borislav Petkov 

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

[GIT PULL] ACPI fixes for v4.19-rc3

2018-09-07 Thread Rafael J. Wysocki

two ACPI fixes for 4.19-rc3.

These fix a regression from the 4.18 cycle in the ACPI driver
for Intel SoCs (LPSS) and prevent dmi_check_system() from being
called on non-x86 systems in the ACPI core.

Specifics:

 - Fix a power management regression in the ACPI driver for Intel
   SoCs (LPSS) introduced by a system-wide suspend/resume fix during
   the 4.18 cycle (Zhang Rui).

 - Prevent dmi_check_system() from being called on non-x86 systems in
   the ACPI core (Jean Delvare).

Thanks!


---

Jean Delvare (1):
  ACPI / bus: Only call dmi_check_system() on X86

Zhang Rui (1):
  ACPI / LPSS: Force LPSS quirks on boot

Re: possible deadlock in start_this_handle

2018-09-07 Thread Jan Kara

On Fri 07-09-18 01:38:03, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:ca16eb342ebe Merge tag 'for-linus-20180906' of git://git.k..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=129e80ae40
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6c9564cd177daf0c
> dashboard link: https://syzkaller.appspot.com/bug?extid=fe49aec75e221f9b093e
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+fe49aec75e221f9b0...@syzkaller.appspotmail.com
> 

So this looks like a false positive due to Michal's memalloc_nofs_save()
work and lockdep not being clever enough. And it actually nicely shows how
Michal's work prevents real deadlock possibilities.

> ==
> WARNING: possible circular locking dependency detected
> 4.19.0-rc2+ #2 Not tainted
> --
> kswapd0/1430 is trying to acquire lock:
> 85a9412e (jbd2_handle){}, at: start_this_handle+0x589/0x1260
> fs/jbd2/transaction.c:383
> 
> but task is already holding lock:
> af99a839 (fs_reclaim){+.+.}, at: __page_frag_cache_refill
> mm/page_alloc.c:4476 [inline]
> af99a839 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
> mm/page_alloc.c:4505

So lockdep is complaining about kswapd starting a transaction during fs
reclaim. It should be OK to do so so let's see why lockdep thinks starting
a transaction in fs reclaim is not good.

> the existing dependency chain (in reverse order) is:
> 
> -> #2 (fs_reclaim){+.+.}:
>__fs_reclaim_acquire mm/page_alloc.c:3728 [inline]
>fs_reclaim_acquire.part.98+0x24/0x30 mm/page_alloc.c:3739
>fs_reclaim_acquire+0x14/0x20 mm/page_alloc.c:3740
>slab_pre_alloc_hook mm/slab.h:418 [inline]
>slab_alloc mm/slab.c:3378 [inline]
>kmem_cache_alloc_trace+0x2d/0x730 mm/slab.c:3618

Here smack is doing GFP_KERNEL allocation. In this case it gets called from
shmfs which is OK with doing fs reclaim... 

>kmalloc include/linux/slab.h:513 [inline]
>kzalloc include/linux/slab.h:707 [inline]
>smk_fetch.part.24+0x5a/0xf0 security/smack/smack_lsm.c:273
>smk_fetch security/smack/smack_lsm.c:3548 [inline]
>smack_d_instantiate+0x946/0xea0 security/smack/smack_lsm.c:3502
>security_d_instantiate+0x5c/0xf0 security/security.c:1287
>d_instantiate+0x5e/0xa0 fs/dcache.c:1870
>shmem_mknod+0x189/0x1f0 mm/shmem.c:2812
>vfs_mknod+0x447/0x800 fs/namei.c:3719
>handle_create+0x1ff/0x7c0 drivers/base/devtmpfs.c:211
>handle drivers/base/devtmpfs.c:374 [inline]
>devtmpfsd+0x27f/0x4c0 drivers/base/devtmpfs.c:400
>kthread+0x35a/0x420 kernel/kthread.c:246
>ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413
> 
> -> #1 (&isp->smk_lock){+.+.}:
>__mutex_lock_common kernel/locking/mutex.c:925 [inline]
>__mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
>mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
>smack_d_instantiate+0x130/0xea0 security/smack/smack_lsm.c:3369
>security_d_instantiate+0x5c/0xf0 security/security.c:1287

... while here the same function gets called from ext4 which will have
PF_MEMALLOC_NOFS set so fs reclaim won't happen. Lockdep transfers the
locking dependency through isp->smk_lock as it is not clever enough to know
that this lock in shmfs case is different from this lock in ext4 case -
they are in the same locking class.

So what Smack needs to do to prevent these false positives is to set
locking class of its smk_lock to be different for different filesystem
types. We do the same for other inode-related locks in fs/inode.c:
inode_init_always() - see the lockdep_set_class calls there - so Smack
needs to do something similar.

>d_instantiate_new+0x7e/0x160 fs/dcache.c:1889
>ext4_add_nondir+0x81/0x90 fs/ext4/namei.c:2415
>ext4_symlink+0x761/0x1170 fs/ext4/namei.c:3162
>vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
>do_symlinkat+0x242/0x2d0 fs/namei.c:4154
>__do_sys_symlink fs/namei.c:4173 [inline]
>__se_sys_symlink fs/namei.c:4171 [inline]
>__x64_sys_symlink+0x59/0x80 fs/namei.c:4171
>do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>entry_SYSCALL_64_after_hwframe+0x49/0xbe

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH] mfd: ti-lmu: constify mfd_cell tables

2018-09-07 Thread Pavel Machek

On Wed 2018-08-29 11:31:08, Pavel Machek wrote:
> From: Sebastian Reichel 
> 
> mfd: ti-lmu: constify mfd_cell tables
> 
> Add const attribute to all mfd_cell structures.
> 
> Signed-off-by: Sebastian Reichel 
> Signed-off-by: Pavel Machek 

Lee, I guess this is for you to apply. Any news there?

There are more patches ready,

https://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap.git/log/?h=droid4-pending-v4.19

and it would be good to get them in. (Alternatively, you can just
cherry-pick them from droid4-pending-v4.19).

Thanks,
Pavel


> diff --git a/drivers/mfd/ti-lmu.c b/drivers/mfd/ti-lmu.c
> index cfb411c..990437e 100644
> --- a/drivers/mfd/ti-lmu.c
> +++ b/drivers/mfd/ti-lmu.c
> @@ -25,7 +25,7 @@
>  #include 
>  
>  struct ti_lmu_data {
> - struct mfd_cell *cells;
> + const struct mfd_cell *cells;
>   int num_cells;
>   unsigned int max_register;
>  };
> @@ -63,7 +63,7 @@ static void ti_lmu_disable_hw(struct ti_lmu *lmu)
>   gpio_set_value(lmu->en_gpio, 0);
>  }
>  
> -static struct mfd_cell lm3532_devices[] = {
> +static const struct mfd_cell lm3532_devices[] = {
>   {
>   .name  = "ti-lmu-backlight",
>   .id= LM3532,
> @@ -78,7 +78,7 @@ static struct mfd_cell lm3532_devices[] = {
>   .of_compatible = "ti,lm363x-regulator", \
>  }\
>  
> -static struct mfd_cell lm3631_devices[] = {
> +static const struct mfd_cell lm3631_devices[] = {
>   LM363X_REGULATOR(LM3631_BOOST),
>   LM363X_REGULATOR(LM3631_LDO_CONT),
>   LM363X_REGULATOR(LM3631_LDO_OREF),
> @@ -91,7 +91,7 @@ static struct mfd_cell lm3631_devices[] = {
>   },
>  };
>  
> -static struct mfd_cell lm3632_devices[] = {
> +static const struct mfd_cell lm3632_devices[] = {
>   LM363X_REGULATOR(LM3632_BOOST),
>   LM363X_REGULATOR(LM3632_LDO_POS),
>   LM363X_REGULATOR(LM3632_LDO_NEG),
> @@ -102,7 +102,7 @@ static struct mfd_cell lm3632_devices[] = {
>   },
>  };
>  
> -static struct mfd_cell lm3633_devices[] = {
> +static const struct mfd_cell lm3633_devices[] = {
>   {
>   .name  = "ti-lmu-backlight",
>   .id= LM3633,
> @@ -120,7 +120,7 @@ static struct mfd_cell lm3633_devices[] = {
>   },
>  };
>  
> -static struct mfd_cell lm3695_devices[] = {
> +static const struct mfd_cell lm3695_devices[] = {
>   {
>   .name  = "ti-lmu-backlight",
>   .id= LM3695,
> @@ -128,7 +128,7 @@ static struct mfd_cell lm3695_devices[] = {
>   },
>  };
>  
> -static struct mfd_cell lm3697_devices[] = {
> +static const struct mfd_cell lm3697_devices[] = {
>   {
>   .name  = "ti-lmu-backlight",
>   .id= LM3697,
> 



-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature

[GIT PULL] sound fixes for 4.19-rc3

2018-09-07 Thread Takashi Iwai

Linus,

please pull sound fixes for v4.19-rc3 from:

  git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git 
tags/sound-4.19-rc3

The topmost commit is f7c50fa636f72490baceb1664ba64973137266f2



sound fixes for 4.19-rc3

Just a few small fixes:
- a fix for the recursive work cancellation in a specific HD-audio
  operation mode
- a fix for potentially uninitialized memory access via rawmidi
- the register bit access fixes for ASoC HD-audio



Keyon Jie (1):
  ALSA: hda: Fix several mismatch for register mask and value

Takashi Iwai (2):
  ALSA: hda - Fix cancel_work_sync() stall from jackpoll work
  ALSA: rawmidi: Initialize allocated buffers

---
 sound/core/rawmidi.c|  4 ++--
 sound/hda/ext/hdac_ext_stream.c | 22 +++---
 sound/pci/hda/hda_codec.c   |  3 ++-
 3 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/sound/core/rawmidi.c b/sound/core/rawmidi.c
index 69517e18ef07..08d5662039e3 100644
--- a/sound/core/rawmidi.c
+++ b/sound/core/rawmidi.c
@@ -129,7 +129,7 @@ static int snd_rawmidi_runtime_create(struct 
snd_rawmidi_substream *substream)
runtime->avail = 0;
else
runtime->avail = runtime->buffer_size;
-   runtime->buffer = kvmalloc(runtime->buffer_size, GFP_KERNEL);
+   runtime->buffer = kvzalloc(runtime->buffer_size, GFP_KERNEL);
if (!runtime->buffer) {
kfree(runtime);
return -ENOMEM;
@@ -655,7 +655,7 @@ static int resize_runtime_buffer(struct snd_rawmidi_runtime 
*runtime,
if (params->avail_min < 1 || params->avail_min > params->buffer_size)
return -EINVAL;
if (params->buffer_size != runtime->buffer_size) {
-   newbuf = kvmalloc(params->buffer_size, GFP_KERNEL);
+   newbuf = kvzalloc(params->buffer_size, GFP_KERNEL);
if (!newbuf)
return -ENOMEM;
spin_lock_irq(&runtime->lock);
diff --git a/sound/hda/ext/hdac_ext_stream.c b/sound/hda/ext/hdac_ext_stream.c
index 1bd27576db98..a835558ddbc9 100644
--- a/sound/hda/ext/hdac_ext_stream.c
+++ b/sound/hda/ext/hdac_ext_stream.c
@@ -146,7 +146,8 @@ EXPORT_SYMBOL_GPL(snd_hdac_ext_stream_decouple);
  */
 void snd_hdac_ext_link_stream_start(struct hdac_ext_stream *stream)
 {
-   snd_hdac_updatel(stream->pplc_addr, AZX_REG_PPLCCTL, 0, 
AZX_PPLCCTL_RUN);
+   snd_hdac_updatel(stream->pplc_addr, AZX_REG_PPLCCTL,
+AZX_PPLCCTL_RUN, AZX_PPLCCTL_RUN);
 }
 EXPORT_SYMBOL_GPL(snd_hdac_ext_link_stream_start);
 
@@ -171,7 +172,8 @@ void snd_hdac_ext_link_stream_reset(struct hdac_ext_stream 
*stream)
 
snd_hdac_ext_link_stream_clear(stream);
 
-   snd_hdac_updatel(stream->pplc_addr, AZX_REG_PPLCCTL, 0, 
AZX_PPLCCTL_STRST);
+   snd_hdac_updatel(stream->pplc_addr, AZX_REG_PPLCCTL,
+AZX_PPLCCTL_STRST, AZX_PPLCCTL_STRST);
udelay(3);
timeout = 50;
do {
@@ -242,7 +244,7 @@ EXPORT_SYMBOL_GPL(snd_hdac_ext_link_set_stream_id);
 void snd_hdac_ext_link_clear_stream_id(struct hdac_ext_link *link,
 int stream)
 {
-   snd_hdac_updatew(link->ml_addr, AZX_REG_ML_LOSIDV, 0, (1 << stream));
+   snd_hdac_updatew(link->ml_addr, AZX_REG_ML_LOSIDV, (1 << stream), 0);
 }
 EXPORT_SYMBOL_GPL(snd_hdac_ext_link_clear_stream_id);
 
@@ -415,7 +417,6 @@ void snd_hdac_ext_stream_spbcap_enable(struct hdac_bus *bus,
 bool enable, int index)
 {
u32 mask = 0;
-   u32 register_mask = 0;
 
if (!bus->spbcap) {
dev_err(bus->dev, "Address of SPB capability is NULL\n");
@@ -424,12 +425,8 @@ void snd_hdac_ext_stream_spbcap_enable(struct hdac_bus 
*bus,
 
mask |= (1 << index);
 
-   register_mask = readl(bus->spbcap + AZX_REG_SPB_SPBFCCTL);
-
-   mask |= register_mask;
-
if (enable)
-   snd_hdac_updatel(bus->spbcap, AZX_REG_SPB_SPBFCCTL, 0, mask);
+   snd_hdac_updatel(bus->spbcap, AZX_REG_SPB_SPBFCCTL, mask, mask);
else
snd_hdac_updatel(bus->spbcap, AZX_REG_SPB_SPBFCCTL, mask, 0);
 }
@@ -503,7 +500,6 @@ void snd_hdac_ext_stream_drsm_enable(struct hdac_bus *bus,
bool enable, int index)
 {
u32 mask = 0;
-   u32 register_mask = 0;
 
if (!bus->drsmcap) {
dev_err(bus->dev, "Address of DRSM capability is NULL\n");
@@ -512,12 +508,8 @@ void snd_hdac_ext_stream_drsm_enable(struct hdac_bus *bus,
 
mask |= (1 << index);
 
-   register_mask = readl(bus->drsmcap + AZX_REG_SPB_SPBFCCTL);
-
-   mask |= register_mask;
-
if (enable)
-   snd_hdac_updatel(bus->drsmcap, AZX_REG_DRSM_CTL, 0, mask);
+   snd_hdac_updatel(bus->drsmcap, AZX_REG_DRSM_CTL, mask, mask);

Re: [GIT PULL] ACPI fixes for v4.19-rc3

2018-09-07 Thread Rafael J. Wysocki

On Fri, Sep 7, 2018 at 11:37 AM Rafael J. Wysocki  wrote:
>
> two ACPI fixes for 4.19-rc3.
>
> These fix a regression from the 4.18 cycle in the ACPI driver
> for Intel SoCs (LPSS) and prevent dmi_check_system() from being
> called on non-x86 systems in the ACPI core.
>
> Specifics:
>
>  - Fix a power management regression in the ACPI driver for Intel
>SoCs (LPSS) introduced by a system-wide suspend/resume fix during
>the 4.18 cycle (Zhang Rui).
>
>  - Prevent dmi_check_system() from being called on non-x86 systems in
>the ACPI core (Jean Delvare).
>
> Thanks!
>
>
> ---
>
> Jean Delvare (1):
>   ACPI / bus: Only call dmi_check_system() on X86
>
> Zhang Rui (1):
>   ACPI / LPSS: Force LPSS quirks on boot

Sorry for the incomplete message, I will send the correct one momentarily.

Rafael

[GIT PULL] ACPI fixes for v4.19-rc3

2018-09-07 Thread Rafael J. Wysocki

Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 acpi-4.19-rc3

with top-most commit a6b7eb3b4176d9732d74d214b349932faa5524b4

 Merge branch 'acpi-bus'

on top of commit 57361846b52bc686112da6ca5368d11210796804

 Linux 4.19-rc2

to receive two ACPI fixes for 4.19-rc3.

These fix a regression from the 4.18 cycle in the ACPI driver
for Intel SoCs (LPSS) and prevent dmi_check_system() from being
called on non-x86 systems in the ACPI core.

Specifics:

 - Fix a power management regression in the ACPI driver for Intel
   SoCs (LPSS) introduced by a system-wide suspend/resume fix during
   the 4.18 cycle (Zhang Rui).

 - Prevent dmi_check_system() from being called on non-x86 systems in
   the ACPI core (Jean Delvare).

Thanks!


---

Jean Delvare (1):
  ACPI / bus: Only call dmi_check_system() on X86

Zhang Rui (1):
  ACPI / LPSS: Force LPSS quirks on boot

---

 drivers/acpi/acpi_lpss.c |  2 +-
 drivers/acpi/bus.c   | 13 +++--
 2 files changed, 8 insertions(+), 7 deletions(-)

Re: [PATCH v3 3/4] PCI: mediatek: Add system pm support for MT2712 and MT7622

2018-09-07 Thread Honghui Zhang

On Wed, 2018-07-18 at 10:49 +0100, Lorenzo Pieralisi wrote:
> On Wed, Jul 18, 2018 at 02:02:41PM +0800, Honghui Zhang wrote:
> 
> 
> 
> > > > +static int __maybe_unused mtk_pcie_resume_noirq(struct device *dev)
> > > > +{
> > > > +   struct mtk_pcie *pcie = dev_get_drvdata(dev);
> > > > +   const struct mtk_pcie_soc *soc = pcie->soc;
> > > > +   struct mtk_pcie_port *port, *tmp;
> > > > +
> > > > +   if (!soc->pm_support)
> > > > +   return 0;
> > > > +
> > > > +   if (list_empty(&pcie->ports))
> > > > +   return 0;
> > > > +
> > > > +   if (dev->pm_domain) {
> > > > +   pm_runtime_enable(dev);
> > > > +   pm_runtime_get_sync(dev);
> > > > +   }
> > > 
> > > Are these runtime PM calls needed/abused here ?
> > > 
> > > Mind explaining the logic ?
> > > 
> > > There is certainly an asymmetry with the suspend callback which made me
> > > suspicious, I am pretty certain Rafael/Kevin/Ulf can help me clarify so
> > > that we can make progress with this patch.
> > > 
> > > Lorenzo
> > > 
> > Hi Lorenzo, thanks for your comments.
> > Sorry I don't get you.
> > I believe that in suspend callbacks the pm_runtime_put_sync and
> > pm_runtime_disable should be called to gated the CMOS for this module,
> > while the pm_rumtime_enable and pm_rumtime_get_sync should be called
> > in resume callback.

> That's why I CC'ed Rafael, Kevin and Ulf, to answer this question
> thoroughly, I am not sure it is needed and that's the right way
> of doing it in system suspend callbacks.
> 

hi, Rafael, Kevin and Ulf,

after reading of the power related documents in Documents/power/, I'm
still confused whether the runtime_pm callbacks should be called in
system suspend callbacks.

I believe that system suspend does not care about the device's CMOS
status. And the device's CMOS status is controlled by runtime pm.
That why I gated the CMOS through runtime pm in the system suspend
callbacks.

But I checked all existing system suspend callbacks and found there's no
runtime pm was executed in system suspend callbacks. Does that means
when system suspend, the system suspend framework does not care about
the power consume? Or does it gated each device's CMOS in somewhere
else?

Or should I just remove the runtime pm callbacks in the system suspend
flow?

Could someone kindly give me some comments?

Thanks in advance.

> > That's exactly this patch doing.
> > But the pm_rumtime_put_sync and pm_runtime_disable functions was wrapped
> > in the mtk_pcie_subsys_powerdown.
> 
> Ah, sorry, I missed that.
> 
> > I did not call mtk_pcie_subsys_powerup since it does not just wrapped
> > pm_rumtime related functions but also do the platform_resource_get,
> > devm_ioremap, and free_ck clock get which I do not needed in resume
> > callback.
> > 
> > Do you think it will be much clear if I abstract the
> > platform_resource_get, devm_ioremap functions from
> > mtk_pcie_subsys_powerup and put it to a new functions like
> > mtk_pcie_subsys_resource_get, and then we may call the
> > mtk_pcie_subsys_powerup in the resume function?
> 
> I think so but let's wait first for feedback on whether those
> runtime PM calls are needed in the first place.
> 
> Lorenzo

Re: Fwd: Smack: wrong-looking capable() check in smk_ptrace_rule_check()

2018-09-07 Thread Lukasz Pawelczyk

Hi,

On Thu, 2018-09-06 at 11:53 -0700, Casey Schaufler wrote:
> Lukasz, does this analysis seem correct to you? This is code you
> wrote in 2014.

It seems correct.
Moreover I've sent a patch that fixes this bug long time ago with the
namespace series.

https://lists.linux-foundation.org/pipermail/containers/2015-October/036318.html

Not sure this is the latest version. The latest I ever wrote can be
found here:
https://github.com/Havner/smack-namespace/commit/52d6e4be2db51e9aca53e0e112a7ff9625000994

Without namespaces, parts of this patch are probably irrelevant, but it
does fix this bug and one or two similar elsewhere.

Best regards,
Lukasz



> 
>  Forwarded Message 
> Subject:  Smack: wrong-looking capable() check in
> smk_ptrace_rule_check()
> Date: Thu, 6 Sep 2018 20:22:35 +0200
> From: Jann Horn 
> To:   Casey Schaufler 
> CC:   linux-security-module ,
> kernel list 
> 
> Hi!
> 
> I noticed the following check in smk_ptrace_rule_check():
> 
> if (tracer_known->smk_known == tracee_known-
> >smk_known)
> rc = 0;
> else if (smack_ptrace_rule == SMACK_PTRACE_DRACONIAN)
> rc = -EACCES;
> else if (capable(CAP_SYS_PTRACE))
> rc = 0;
> else
> rc = -EACCES;
> 
> Note that smk_ptrace_rule_check() can be called from not just
> smack_ptrace_access_check() and smack_ptrace_traceme(), but also
> smack_bprm_set_creds(). AFAICS this means that if a task executes
> with
> a smack privilege transition and smack_ptrace_rule is
> SMACK_PTRACE_EXACT, whether the execution is permitted depends on
> whether _the debugged task_ has CAP_SYS_PTRACE (and not on whether
> the
> debugger has that capability).
> This seems like it's probably unintentional?
> 
> 


-- 
Lukasz Pawelczyk
Samsung R&D Institute Poland
Samsung Electronics

Re: ovl: hash non-dir by lower inode for fsnotify

2018-09-07 Thread Greg KH

On Thu, Aug 02, 2018 at 08:29:02AM -0700, Mark Salyzyn wrote:
> On 08/01/2018 11:05 PM, Greg KH wrote:
> > On Wed, Aug 01, 2018 at 02:29:01PM -0700, Mark Salyzyn wrote:
> > > 764baba80168ad3adafb521d2ab483ccbc49e344 ovl: hash non-dir by lower inode
> > > for fsnotify is not part of 4.14 stable and yet it was marked for 4.13
> > > stable merge when committed.
> > > 
> > > Please evaluate.
> > Why not try applying it yourself to 4.14.y and note that it does not
> > apply at all and then provide a working backport so that we can skip at
> > least one email cycle here?  :)
> > 
> > thanks,
> > 
> > greg k-h
> 
> Because I am embarrassed by the backport (!) perhaps? :-)
> 
> +linux-kernel list and authors/approvers for clearance.
> 
> I took some liberty with sb = dentry_d_sb and then sprinkled it in, upstream
> passes sb to the function and the conflicts assumed so.
> 
> --> snip <-
> 
> From 764baba80168ad3adafb521d2ab483ccbc49e344 Mon Sep 17 00:00:00 2001
> From: Amir Goldstein 
> Date: Sun, 4 Feb 2018 15:35:09 +0200
> Subject: ovl: hash non-dir by lower inode for fsnotify
> 
> (cherry pick from commit 764baba80168ad3adafb521d2ab483ccbc49e344)
> 
> Commit 31747eda41ef ("ovl: hash directory inodes for fsnotify")
> fixed an issue of inotify watch on directory that stops getting
> events after dropping dentry caches.
> 
> A similar issue exists for non-dir non-upper files, for example:
> 
> $ mkdir -p lower upper work merged
> $ touch lower/foo
> $ mount -t overlay -o
> lowerdir=lower,workdir=work,upperdir=upper none merged
> $ inotifywait merged/foo &
> $ echo 2 > /proc/sys/vm/drop_caches
> $ cat merged/foo
> 
> inotifywait doesn't get the OPEN event, because ovl_lookup() called
> from 'cat' allocates a new overlay inode and does not reuse the
> watched inode.
> 
> Fix this by hashing non-dir overlay inodes by lower real inode in
> the following cases that were not hashed before this change:
>  - A non-upper overlay mount
>  - A lower non-hardlink when index=off
> 
> A helper ovl_hash_bylower() was added to put all the logic and
> documentation about which real inode an overlay inode is hashed by
> into one place.
> 
> The issue dates back to initial version of overlayfs, but this
> patch depends on ovl_inode code that was introduced in kernel v4.13.
> 
> Signed-off-by: Amir Goldstein 
> Signed-off-by: Miklos Szeredi 
> Cc:  #v4.13
> Signed-off-by: Mark Salyzyn  #v4.14
> ---
>  fs/overlayfs/inode.c | 62 +++-
>  1 file changed, 44 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index 28a320464609a..7cfef4152e9a4 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include "overlayfs.h"
> +#include "ovl_entry.h"
> 
>  int ovl_setattr(struct dentry *dentry, struct iattr *attr)
>  {
> @@ -608,39 +609,63 @@ static bool ovl_verify_inode(struct inode *inode,
> struct dentry *lowerdentry,

As this patch is deemed "good", can you please resend it in a
non-corrupted way so that I can apply it to the 4.14.y tree?

thanks,

greg k-h

Re: [RESEND PATCH v4 1/6] arm64/mm: Introduce the init_pg_dir.

2018-09-07 Thread James Morse

Hi Jun,

On 22/08/18 10:54, Jun Yao wrote:
> To make the swapper_pg_dir read only, we will move it to the rodata
> section. And force the kernel to set up the initial page table in
> the init_pg_dir. After generating all levels page table, we copy
> only the top level into the swapper_pg_dir during paging_init().

Could you add v3's
| Add init_pg_dir to vmlinux.lds.S and boiler-plate
| clearing/cleaning/invalidating it in head.S.

too. This makes it obvious that 'init_pg_dir isn't used yet' is deliberate.

Reviewed-by: James Morse 


Some boring nits:

> diff --git a/arch/arm64/include/asm/assembler.h 
> b/arch/arm64/include/asm/assembler.h
> index 0bcc98dbba56..eb363a915c0e 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -456,6 +456,35 @@ USER(\label, ic  ivau, \tmp2)// 
> invalidate I line PoU

> +/*
> + * clear_pages - clear contiguous pages
> + *
> + *   start, end: page aligend virtual addresses

(Nit: aligned)


> + */
> + .macro clear_pages, start:req, end:req

> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 605d1b60469c..61d7cee3eaa6 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -68,6 +68,12 @@ jiffies = jiffies_64;
>  #define TRAMP_TEXT
>  #endif
>  
> +#define INIT_PG_TABLES   \

   ^ These are tabs ...

> + . = ALIGN(PAGE_SIZE);   \

   ^ ... but these are spaces.

> + init_pg_dir = .;\
> + . += SWAPPER_DIR_SIZE;  \
> + init_pg_end = .;

Please pick one and stick with it. The macro above,
CONFIG_UNMAP_KERNEL_AT_EL0, uses tabs, please do the same.



Thanks,

James

Re: [RESEND PATCH v4 0/6] arm64/mm: Move swapper_pg_dir to rodata

2018-09-07 Thread James Morse

Hi Jun,

(I'm a bit confused about which version of this series I should be looking at.
I have a v4, and two v4-resends, all of which are different. Please only mark
something as 'resend' if it is exactly the same!)

On 22/08/18 10:54, Jun Yao wrote:
> The set_init_mm_pgd() is reimplemented using assembly in order to
> avoid being instrumented by kasan.

There are some tidier ways of fixing this. The kasan init code is also C code
that is run before kasan is initialized. Kbuild is told not to let KASAN touch
it with 'KASAN_SANITISE_filename.o := n'.

But, in this case you're only calling into C code from pre-kasan head.S so you
can use the same helper to set init_mm.pgd. I don't think this is worth the
effort, we can just do the store in assembly. (more in patch 3).

Thanks,

James

Re: [RESEND PATCH v4 2/6] arm64/mm: Pass ttbr1 as a parameter to __enable_mmu().

2018-09-07 Thread James Morse

Hi Jun,

On 22/08/18 10:54, Jun Yao wrote:
> The kernel sets up the initial page table in the init_pg_dir.

(Nit: 'will set up', it doesn't until patch 3.)

> However, it will create the final page table in the swapper_pg_dir
> during the initialization process. We need to let __enable_mmu()
> know which page table to use.

> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 2c83a8c47e3f..c3e4b1886cde 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -756,6 +757,7 @@ ENDPROC(__secondary_switched)
>   * Enable the MMU.
>   *
>   *  x0  = SCTLR_EL1 value for turning on the MMU.
> + *  x1  = TTBR1_EL1 value for turning on the MMU.
>   *
>   * Returns to the caller via x30/lr. This requires the caller to be covered
>   * by the .idmap.text section.
> @@ -764,15 +766,15 @@ ENDPROC(__secondary_switched)
>   * If it isn't, park the CPU
>   */
>  ENTRY(__enable_mmu)
> - mrs x1, ID_AA64MMFR0_EL1
> - ubfxx2, x1, #ID_AA64MMFR0_TGRAN_SHIFT, 4
> - cmp x2, #ID_AA64MMFR0_TGRAN_SUPPORTED
> + mrs x5, ID_AA64MMFR0_EL1
> + ubfxx6, x5, #ID_AA64MMFR0_TGRAN_SHIFT, 4
> + cmp x6, #ID_AA64MMFR0_TGRAN_SUPPORTED
>   b.ne__no_granule_support
> - update_early_cpu_boot_status 0, x1, x2
> - adrpx1, idmap_pg_dir
> - adrpx2, swapper_pg_dir
> - phys_to_ttbr x3, x1
> - phys_to_ttbr x4, x2
> + update_early_cpu_boot_status 0, x5, x6
> + adrpx5, idmap_pg_dir
> + mov x6, x1
> + phys_to_ttbr x3, x5
> + phys_to_ttbr x4, x6
>   msr ttbr0_el1, x3   // load TTBR0
>   msr ttbr1_el1, x4   // load TTBR1
>   isb
> @@ -791,7 +793,7 @@ ENDPROC(__enable_mmu)
>  
>  __no_granule_support:
>   /* Indicate that this CPU can't boot and is stuck in the kernel */


> - update_early_cpu_boot_status CPU_STUCK_IN_KERNEL, x1, x2
> + update_early_cpu_boot_status CPU_STUCK_IN_KERNEL, x5, x6

(You don't need to change these as they are both temporary registers.)

Reviewed-by: James Morse 


Thanks,

James

Re: [RESEND PATCH v4 4/6] arm64/mm: Create the final page table directly in swapper_pg_dir.

2018-09-07 Thread James Morse

Hi Jun,

On 22/08/18 10:54, Jun Yao wrote:
> As the initial page table is created in the init_pg_dir, we can set
> up the final page table directly in the swapper_pg_dir. And it only> contains 
> the top level page table, so we can reduce it to a page

Reviewed-by: James Morse 


Thanks,

James

Re: [PATCH V12 00/14] Krait clocks + Krait CPUfreq

2018-09-07 Thread Sricharan R

Hi Craig,


>> [v12]
>>   * Added my signed-off that was missing in some patches.
>>   * Added Bjorn's acked that i missed earlier.
>>
> 
>   Can you give this a try on your 8974 device and check if the
>   pvs version reporting, scaling for higher frequencies are fine ?
>   Sorry, i could not get hold of a 8974 device. So in-case if you still
>   have the issues with higher frequencies, can you give a quick debug
>   and report. That would be of great help.
> 
   Ping on this ..

Regards,
 Sricharan

> Regards,
>  Sricharan
> 
> 
>> [v11]
>>   * Dropped patch 13 and 14 from v10 and
>> merged the qcom-cpufreq-krait driver to the existing qcom-cpufreq-kryo.c
>>   * Rebased on top of clk-next
>>   * Fixed a bug while populating the pvs version for krait.
>>
>> [v10]
>>   * Addressed Stephen's comments to add clocks bindings properties
>> to the newly introduced nodes.
>>   * Added a change to include opp-supported-hw to qcom-cpufreq.c
>>   * Rebased on top of clk-next
>>   * Although there were minor changes to bindings and the driver
>> retained the acked-by tags from Rob and Viresh respectively.
>>
>> [v9]
>>   * Fixed a rebase issue in Makefile and added Tag from Robh.
>>
>> [v8]
>>   * Fixed a bug in path#14 pointed out by Viresh and also added tags.
>> No change in any other patch.
>>
>> [v7]
>>   * Fixed comments from Viresh for cleaning up the error handling
>> in qcom-cpufreq.c. Also changed the init function to lateinit
>> call. This is required because nvmem which gets initialised with
>> module_init needs to go first.
>>   * Fixed Rob's comments for bindings documentation
>>   * Fixed kbuild build issue in clk-lpc32xx.c
>>   * Rebased on top of clk-next
>>
>> [v6]
>>   * Adrressed comments from Viresh for patch #14 in v5 [5]
>>   * Introduced a new binding operating-points-v2-krait-cpu
>> as per discussion with Rob
>>   * Added Review tags
>>
>> [v5]
>>   * Addressed comments from Rob for bindings
>>   * Addressed comments from Viresh to use dev_pm_opp_set_prop_name, 
>> accordingly
>> dropped patch #12 and corrected patch #11 from previous patch set in [4]
>>   * Converted to use #spdx tags for newly introduced files
>>
>> Mostly a resend of the v3 posted by Stephen quite some time back [1]
>> except for few changes.
>>   Based on reading some feedback from list,
>>   * Dropped the patch "clk: Add safe switch hook" from v3 [2].
>> Now this is taken care by patch#10 in this series only for Krait.
>>   * Dropped the path "clk: Avoid sending high rates to downstream
>>clocks during set_rate" from v3 [3].
>>   * Rebased on top of clk-next.
>>   * Dropped the DT update from the series. Will send separately
>>   * Now with cpufreq-dt+opp supporting voltage scaling, registering the
>> krait cpu supplies in DT should be sufficient. But one issue is,
>> the qcom-cpufreq drivers reads the efuse and based on that registers
>> the opp data and then registers the cpufreq-dt device. So when
>> cpufreq-dt driver probes and registers the regulator to the OPP 
>> framework,
>> it expects that the opp data for the device should not be registered 
>> before
>> the regulator. Will send a RFC patch removing that check, to find out the
>> right way of doing it.
>>
>> These patches provide cpufreq scaling on devices with Krait CPUs.
>> In Krait CPU designs there's one PLL and two muxes per CPU, allowing
>> us to switch CPU frequencies independently.
>>
>>   secondary
>>   +-++
>>   | QSB |---+|\
>>   +-+   || |-+
>> |+---|/  |
>> ||   +   |
>>   +-+   ||   |
>>   | PLL |+---+   |   primary
>>   +-+|  || +
>>  |  |+-|\   +--+
>>   +---+  |  |  | \  |  |
>>   | HFPLL |--+-|  |-| CPU0 |
>>   +---+  |  || |  | |  |
>>  |  || +-+ | /  +--+
>>  |  |+-| / 2 |-|/
>>  |  |  +-+ +
>>  |  | secondary
>>  |  |+
>>  |  +|\
>>  |   | |-+
>>  +---|/  |   primary
>>  +   | +
>>  +-|\   +--+
>>   +---+| \  |  |
>>   | HFPLL ||  |-| CPU1 |
>>   +---+  | |  | |  |
>>  | +-+ | /  +--+
>>  +-| / 2 |-|/
>>+-+ +
>>
>> To support this

Re: [RESEND PATCH v4 3/6] arm64/mm: Create the initial page table in the init_pg_dir.

2018-09-07 Thread James Morse

Hi Jun,

On 22/08/18 10:54, Jun Yao wrote:
> Create the initial page table in the init_pg_dir. And before
> calling kasan_early_init(), we update the init_mm.pgd by
> introducing set_init_mm_pgd(). This will ensure that pgd_offset_k()
> works correctly. When the final page table is created, we redirect
> the init_mm.pgd to the swapper_pg_dir.

> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index c3e4b1886cde..ede2e964592b 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -402,7 +402,6 @@ __create_page_tables:
>   adrpx1, init_pg_end
>   sub x1, x1, x0
>   bl  __inval_dcache_area
> -
>   ret x28
>  ENDPROC(__create_page_tables)
>   .ltorg

Nit: spurious whitespace change.


> @@ -439,6 +438,9 @@ __primary_switched:
>   bl  __pi_memset
>   dsb ishst   // Make zero page visible to PTW
>  
> + adrpx0, init_pg_dir
> + bl  set_init_mm_pgd

Having a C helper to just do a store is a bit strange, calling C code before
kasan is ready is clearly causing some pain.

Couldn't we just do store in assembly here?:
--%<--
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index ede2e964592b..7464fb31452d 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -439,7 +439,8 @@ __primary_switched:
dsb ishst   // Make zero page visible to PTW

adrpx0, init_pg_dir
-   bl  set_init_mm_pgd
+   adr_l   x1, init_mm
+   str x0, [x1, #MM_PGD]

 #ifdef CONFIG_KASAN
bl  kasan_early_init
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 323aeb5f2fe6..43f52cfdfad4 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -82,6 +82,7 @@ int main(void)
   DEFINE(S_FRAME_SIZE,  sizeof(struct pt_regs));
   BLANK();
   DEFINE(MM_CONTEXT_ID, offsetof(struct mm_struct, context.id.counter));
+  DEFINE(MM_PGD,offsetof(struct mm_struct, pgd));
   BLANK();
   DEFINE(VMA_VM_MM, offsetof(struct vm_area_struct, vm_mm));
   DEFINE(VMA_VM_FLAGS,  offsetof(struct vm_area_struct, vm_flags));
--%<--


> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 65f86271f02b..f7e544f6f3eb 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -623,6 +623,19 @@ static void __init map_kernel(pgd_t *pgdp)
>   kasan_copy_shadow(pgdp);
>  }
>  
> +/*
> + * set_init_mm_pgd() just updates init_mm.pgd. The purpose of using
> + * assembly is to prevent KASAN instrumentation, as KASAN has not
> + * been initialized when this function is called.

You're hiding the store from KASAN as its shadow region hasn't been initialized 
yet?

I think newer versions of the compiler let KASAN check stack accesses too, and
the compiler may generate those all by itself. Hiding things like this gets us
into an arms-race with the compiler.


> +void __init set_init_mm_pgd(pgd_t *pgd)
> +{
> + pgd_t **addr = &(init_mm.pgd);
> +
> + asm volatile("str %x0, [%1]\n"
> + : : "r" (pgd), "r" (addr) : "memory");
> +}
> +
>  /*
>   * paging_init() sets up the page tables, initialises the zone memory
>   * maps and sets up the zero page.


Thanks,

James

Re: [RESEND PATCH v4 5/6] arm64/mm: Populate the swapper_pg_dir by fixmap.

2018-09-07 Thread James Morse

Hi Jun,

On 22/08/18 10:54, Jun Yao wrote:
> Since we will move the swapper_pg_dir to rodata section, we need a
> way to update it. The fixmap can handle it. When the swapper_pg_dir
> needs to be updated, we map it dynamically. The map will be
> canceled after the update is complete. In this way, we can defend
> against KSMA(Kernel Space Mirror Attack).

> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 46ef21ebfe47..d5c3df99af7b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -428,8 +435,32 @@ extern pgprot_t phys_mem_access_prot(struct file *file, 
> unsigned long pfn,
>PUD_TYPE_TABLE)
>  #endif
>  
> +extern spinlock_t swapper_pgdir_lock;

Hmmm, it would be good if we could avoid exposing this lock.
Wherever this ends up needs to include spinlock.h, and we don't have to do that
in arch headers today.

>  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
> +#ifdef __PAGETABLE_PMD_FOLDED
> + if (in_swapper_pgdir(pmdp)) {
> + pmd_t *fixmap_pmdp;
> +
> + spin_lock(&swapper_pgdir_lock);
> + fixmap_pmdp = (pmd_t *)pgd_set_fixmap(__pa(pmdp));
> + WRITE_ONCE(*fixmap_pmdp, pmd);
> + dsb(ishst);
> + pgd_clear_fixmap();
> + spin_unlock(&swapper_pgdir_lock);
> + return;
> + }
> +#endif

You have this pattern multiple times, it ought to be a macro. (Any reason why
the last copy for pgd is different?)

Putting all this directly into the inlined helper is noisy and risks bloating
the locations it appears. Could we do the in_swappper_pgdir() test, and if it
passes call some out-of-line set_swapper_pgd() that lives in mm/mmu.c? Once we
know we're using the fixmap I don't think there is a benefit to inline-ing the 
code.

Doing this would avoid moving the extern defines and p?d_set_fixmap() helpers
around in this header and let us avoid extern-ing the lock or including
spinlock.h in here.

>   WRITE_ONCE(*pmdp, pmd);
>   dsb(ishst);
>  }
> @@ -480,6 +511,19 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
>  
>  static inline void set_pud(pud_t *pudp, pud_t pud)
>  {
> +#ifdef __PAGETABLE_PUD_FOLDED
> + if (in_swapper_pgdir(pudp)) {
> + pud_t *fixmap_pudp;
> +
> + spin_lock(&swapper_pgdir_lock);
> + fixmap_pudp = (pud_t *)pgd_set_fixmap(__pa(pudp));

This is a bit subtle: are you using the pgd fixmap entry because the path from
map_mem() uses the other three?

Using the pgd fix slot for a pud looks a bit strange to me, but its arguably a
side-effect of the folding.

I see this called 68 times during boot on a 64K/42bit-VA, 65 of which appear to
be during paging_init(). What do you think to keeping paging_init()s use of the
pgd fixmap for swapper_pg_dir, deliberately to skip the in_swapper_pgdir() test
during paging_init()?

> + WRITE_ONCE(*fixmap_pudp, pud);

> + dsb(ishst);
> + pgd_clear_fixmap();

Hmm,

p?d_clear_fixmap() is done by calling __set_fixmap(FIX_P?G, 0, __pgprot(0)).

__set_fixmap() calls flush_tlb_kernel_range() if the flags are 0.

flush_tlb_kernel_range() has a dsb(ishst) before it does the maintenance, (even
via flush_tlb_all()).

I think we can replace replace the dsb() before each p?d_clear_fixmap() call
with a comment that the flush_tlb_*() will do it for us. Something like:

|/*
| * We need dsb(ishst) here to ensure the page-table-walker sees our new entry
| * before set_p?d() returns. The fixmap's flush_tlb_kernel_range() via
| * clear_fixmap() does this for us.
| */

> + spin_unlock(&swapper_pgdir_lock);
> + return;
> + }
> +#endif
>   WRITE_ONCE(*pudp, pud);
>   dsb(ishst);
>  }

Thanks,

James

Re: [PATCH V2] perf tools: Fix maps__find_symbol_by_name()

2018-09-07 Thread Jiri Olsa

On Fri, Sep 07, 2018 at 11:51:16AM +0300, Adrian Hunter wrote:
> Commit 1c5aae7710bb ("perf machine: Create maps for x86 PTI entry
> trampolines") revealed a problem with maps__find_symbol_by_name() that
> resulted in probes not being found e.g.
> 
>   $ sudo perf probe xsk_mmap
>   xsk_mmap is out of .text, skip it.
>   Probe point 'xsk_mmap' not found.
>  Error: Failed to add events.
> 
> maps__find_symbol_by_name() can optionally return the map of the found
> symbol. It can get the map wrong because, in fact, the symbol is found
> on the map's dso, not allowing for the possibility that the dso has more
> than one map. Fix by always checking the map contains the symbol.
> 
> Reported-by: Björn Töpel 
> Tested-by: Björn Töpel 

Acked-by: Jiri Olsa 

thanks,
jirka

Re: [PATCH V3] spi: spi-geni-qcom: Add SPI driver support for GENI based QUP

2018-09-07 Thread dkota


On 2018-09-02 10:36, Doug Anderson wrote:

Hi,

On Fri, Aug 24, 2018 at 3:42 AM, Dilip Kota  
wrote:

From: Girish Mahadevan 

This driver supports GENI based SPI Controller in the Qualcomm SOCs. 
The
Qualcomm Generic Interface (GENI) is a programmable module supporting 
a
wide range of serial interfaces including SPI. This driver supports 
SPI

operations using FIFO mode of transfer.

Signed-off-by: Girish Mahadevan 
Signed-off-by: Dilip Kota 
---
Addressing all the reviewer commets given in Patchset1.
Summerizing all the comments below:

MAKEFILE: Arrange SPI-GENI driver in alphabetical order
Kconfig: Mark SPI_GENI driver dependent on QCOM_GENI_SE
Enable SPI core auto runtime pm, and remove runtime pm calls.
Remove spi_geni_unprepare_message(), 
spi_geni_unprepare_transfer_hardware()

Remove likely/unlikely keywords.
Remove get_spi_master() and use dev_get_drvdata()
Move request_irq to probe()
Mark bus number assignment to -1 as SPI core framework will 
assign dynamically

Use devm_spi_register_master()
Include platform_device.h instead of of_platform.h
Removing macros which are used only once:
#define SPI_NUM_CHIPSELECT 4
#define SPI_XFER_TIMEOUT_MS250
Place Register field definitions next to respective Register 
definitions.

Replace int and u32 declerations to unsigned int.
Remove Hex numbers in debug prints.
Declare mode as u16 in spi_setup_word_len()
Remove the labels: setup_fifo_params_exit: 
exit_prepare_transfer_hardware:

Declaring struct spi_master as spi everywhere in the file.
Calling spi_finalize_current_transfer() for end of transfer.
Hard code the SPI controller max frequency instead of reading 
from DTSI node.

Spinlock not required, removed it.
Removed unrequired error prints.
Fix KASAN error in geni_spi_isr().
Remove spi-geni-qcom.h
Remove inter words delay and CS to Clock toggle delay logic in 
the driver, as of now no clients are using it.

Will submit this logic in the next patchset.
Use major, minor and step macros to read from hardware version 
register.


 .../devicetree/bindings/soc/qcom/qcom,geni-se.txt  |   2 -
 drivers/spi/Kconfig|  12 +
 drivers/spi/Makefile   |   1 +
 drivers/spi/spi-geni-qcom.c| 678 
+

 4 files changed, 691 insertions(+), 2 deletions(-)


See below for comments.  In general I've tried to post patches to
address my own comments.  See the series ending at
.
From there you can download patch files by using the "DOWNLOAD" link
at the bottom.  Yell if you have problems.  Hopefully that's useful.
I expect that you can squash many of these into your patch to give you
a leg up on v3.

NOTE: I won't promise that I made no mistakes on my fixup patches nor
that I caught everything or did everything right.  I'll plan to take a
fresh look at the whole patch when I see your v3.



--- a/Documentation/devicetree/bindings/soc/qcom/qcom,geni-se.txt
+++ b/Documentation/devicetree/bindings/soc/qcom/qcom,geni-se.txt
@@ -60,7 +60,6 @@ Required properties:
 - interrupts:  Must contain SPI controller interrupts.
 - clock-names: Must contain "se".
 - clocks:  Serial engine core clock needed by the device.
-- spi-max-frequency:   Specifies maximum SPI clock frequency, units - 
Hz.


As per Rob's feedback, please split the device tree change into a
separate patch and justify it.  Perhaps the commit message could be
something like:

---

Ok, will submit as separate patch.


dt-bindings: spi: Remove spi-max-frequency from qcom,geni-se controller

No other SPI controllers have a "spi-max-frequency" at the controller
level.  The normal "spi-max-frequency" property is something that is
used when defining the nodes for SPI slaves.  While it is possible
that someone might want to define a controller-level max frequency it
should be done in other ways (perhaps by keying off a compatible
string?)

---

I think in the past Mark Brown has also requested that the bindings
actually live under "Documentation/devicetree/bindings/spi/", so
perhaps you should also add a patch to your series that moves this
documentation there and changes the "soc/qcom/qcom,geni-se.txt" to
reference that.



+static irqreturn_t geni_spi_isr(int irq, void *data);
+
+struct spi_geni_master {
+   struct geni_se se;
+   unsigned int irq;


In v1 Stephen requested that many things in this struct become
"unsigned int", but he didn't mean the "irq".  Please change this back
to an int.  As you have things right now the code "if (spi_geni->irq <
0)" you have below is a no-op.  :(

...oh, and as Stephen pointed out to me offline you don't even need to
store ir

[PATCH 4/4] sched/numa: Do not move imbalanced load purely on the basis of an idle CPU

2018-09-07 Thread Mel Gorman

Commit 305c1fac3225 ("sched/numa: Evaluate move once per node")
restructured how task_numa_compare evaluates load but there is an anomaly.
task_numa_find_cpu() checks if the load balance between too nodes is too
imbalanced with the intent of only swapping tasks if it would improve
the balance overall. However, if an idle CPU is encountered, the task is
still moved if it's the best improvement but an idle cpu is always going
to appear to be the best improvement.

If a machine is lightly loaded such that all tasks can fit on one node then
the idle CPUs are found and the tasks migrate to one socket.  From a NUMA
perspective, this seems intuitively great because memory accesses are all
local but there are two counter-intuitive effects.

First, the load balancer may move tasks so the machine is more evenly
utilised and conflict with automatic NUMA balancing which may respond by
scanning more frequently and increasing overhead.  Second, sockets usually
have their own memory channels so using one socket means that fewer
channels are available yielding less memory bandwidth overall. For
memory-bound tasks, it can be beneficial to migrate to another socket and
migrate the data to increase bandwidth even though the accesses are remote
in the short term.

The second observation is not universally true for all workloads but some
of the computational kernels opf NAS benefit when paralellised with openMP.

NAS C class 2-socket
 4.19.0-rc1 4.19.0-rc1
oneselect-v1r18   nomove-v1r19
Amean bt   62.26 (   0.00%)   53.03 (  14.83%)
Amean cg   27.85 (   0.00%)   27.82 (   0.09%)
Amean ep8.94 (   0.00%)8.58 (   4.09%)
Amean ft   11.89 (   0.00%)   12.00 (  -0.93%)
Amean is0.87 (   0.00%)0.86 (   0.92%)
Amean lu   41.77 (   0.00%)   38.95 (   6.76%)
Amean mg5.30 (   0.00%)5.26 (   0.64%)
Amean sp  105.39 (   0.00%)   63.80 (  39.46%)
Amean ua   47.42 (   0.00%)   43.99 (   7.24%)

Active balancing for NUMA still happens but it greatly reduced. When
running with D class (so it runs longer), the relevant unpatched stats are

   3773.21  Elapsed time in seconds
489.24  Mops/sec/thread
38,918  cpu-migrations
 3,817,238  page-faults
11,197  sched:sched_move_numa
 0  sched:sched_stick_numa
23  sched:sched_swap_numa

With the patch applied

   2037.92  Elapsed time in seconds
905.83  Mops/sec/thread
   147  cpu-migrations
   552,529  page-faults
26  sched:sched_move_numa
 0  sched:sched_stick_numa
16  sched:sched_swap_numa

Note the large drop in CPU migrations, the calls to sched_move_numa and
page faults.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d59d3e00a480..d4c289c11012 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1560,7 +1560,7 @@ static bool task_numa_compare(struct task_numa_env *env,
goto unlock;
 
if (!cur) {
-   if (maymove || imp > env->best_imp)
+   if (maymove)
goto assign;
else
goto unlock;
-- 
2.16.4

[PATCH 3/4] sched/numa: Stop comparing tasks for NUMA placement after selecting an idle core

2018-09-07 Thread Mel Gorman

task_numa_migrate is responsible for finding a core on a preferred NUMA
node for a task. As part of this, task_numa_find_cpu iterates through
the CPUs of a node and evaulates CPUs, both idle and with running tasks,
as placement candidates. Generally though, any idle CPU is equivalent in
terms of improving imbalances and a search after finding one is pointless.
This patch stops examining CPUs on a node if an idle CPU is considered
suitable.

While there are some workloads that show minor gains and losses, they are
mostly within the noise with the exception of specjbb whether running
as one large VM or one VM per socket. The following was reported on a
two-socket Haswell machine with 24 cores per socket.

specjbb, one JVM per socket (2 total)
  4.19.0-rc1 4.19.0-rc1
 vanilla   oneselect-v1
Hmean tput-1 42258.43 (   0.00%)43692.10 (   3.39%)
Hmean tput-2 87811.26 (   0.00%)93719.52 (   6.73%)
Hmean tput-3138100.56 (   0.00%)   143484.08 (   3.90%)
Hmean tput-4181061.51 (   0.00%)   191292.99 (   5.65%)
Hmean tput-5225577.34 (   0.00%)   233439.58 (   3.49%)
Hmean tput-6264763.44 (   0.00%)   270634.50 (   2.22%)
Hmean tput-7301458.48 (   0.00%)   314133.32 (   4.20%)
Hmean tput-8348364.50 (   0.00%)   358445.76 (   2.89%)
Hmean tput-9382129.65 (   0.00%)   403288.75 (   5.54%)
Hmean tput-10   403566.70 (   0.00%)   444592.51 (  10.17%)
Hmean tput-11   456967.43 (   0.00%)   483300.45 (   5.76%)
Hmean tput-12   502295.98 (   0.00%)   526281.53 (   4.78%)
Hmean tput-13   441284.41 (   0.00%)   535507.75 (  21.35%)
Hmean tput-14   461478.57 (   0.00%)   542068.97 (  17.46%)
Hmean tput-15   489725.29 (   0.00%)   545033.17 (  11.29%)
Hmean tput-16   503726.56 (   0.00%)   549738.23 (   9.13%)
Hmean tput-17   528650.57 (   0.00%)   550849.00 (   4.20%)
Hmean tput-18   518065.41 (   0.00%)   550018.29 (   6.17%)
Hmean tput-19   527412.99 (   0.00%)   550652.26 (   4.41%)
Hmean tput-20   528166.25 (   0.00%)   545783.85 (   3.34%)
Hmean tput-21   524669.70 (   0.00%)   544848.37 (   3.85%)
Hmean tput-22   519010.38 (   0.00%)   539603.70 (   3.97%)
Hmean tput-23   514947.43 (   0.00%)   534714.32 (   3.84%)
Hmean tput-24   517953.29 (   0.00%)   531783.24 (   2.67%)

Coeffecient of variance is roughly 0-3% depending on the wareshouse count
so these results are generally outside of the noise. Note that the biggest
improvements are when a socket would be roughly half loaded. It's not
especially obvious why this would be true given that without the patch the
socket is scanned anyway but it may be cache miss related. On a 2-socket
broadwell machine, the same observation was made in that the biggest
benefit was when a socket was half-loaded. If a single JVM is used for
the entire machine, the biggest benefit was also when the machine was
half-utilised.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b2f1684e96e..d59d3e00a480 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1535,7 +1535,7 @@ static bool load_too_imbalanced(long src_load, long 
dst_load,
  * into account that it might be best if task running on the dst_cpu should
  * be exchanged with the source task
  */
-static void task_numa_compare(struct task_numa_env *env,
+static bool task_numa_compare(struct task_numa_env *env,
  long taskimp, long groupimp, bool maymove)
 {
struct rq *dst_rq = cpu_rq(env->dst_cpu);
@@ -1545,6 +1545,7 @@ static void task_numa_compare(struct task_numa_env *env,
long imp = env->p->numa_group ? groupimp : taskimp;
long moveimp = imp;
int dist = env->dist;
+   bool dst_idle = false;
 
rcu_read_lock();
cur = task_rcu_dereference(&dst_rq->curr);
@@ -1638,11 +1639,13 @@ static void task_numa_compare(struct task_numa_env *env,
env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
   env->dst_cpu);
local_irq_enable();
+   dst_idle = true;
}
 
task_numa_assign(env, cur, imp);
 unlock:
rcu_read_unlock();
+   return dst_idle;
 }
 
 static void task_numa_find_cpu(struct task_numa_env *env,
@@ -1668,7 +1671,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
continue;
 
env->dst_cpu = cpu;
-   task_numa_compare(env, taskimp, groupimp, maymove);
+   if (task_numa_compare(env, taskimp, groupimp, maymove))
+   break;
}
 }
 
-- 
2.16.4

[PATCH 2/4] sched/numa: Remove unused calculations in update_numa_stats

2018-09-07 Thread Mel Gorman

Commit 2d4056fafa19 ("sched/numa: Remove numa_has_capacity()") removed
the the has_free_capacity field but did not remove calculations
related to it in update_numa_stats. This patch removes the unused
code.

Signed-off-by: Mel Gorman 
---
 kernel/sched/fair.c | 22 +-
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2472aeaff92e..5b2f1684e96e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1461,8 +1461,7 @@ struct numa_stats {
  */
 static void update_numa_stats(struct numa_stats *ns, int nid)
 {
-   int smt, cpu, cpus = 0;
-   unsigned long capacity;
+   int cpu;
 
memset(ns, 0, sizeof(*ns));
for_each_cpu(cpu, cpumask_of_node(nid)) {
@@ -1470,26 +1469,7 @@ static void update_numa_stats(struct numa_stats *ns, int 
nid)
 
ns->load += weighted_cpuload(rq);
ns->compute_capacity += capacity_of(cpu);
-
-   cpus++;
}
-
-   /*
-* If we raced with hotplug and there are no CPUs left in our mask
-* the @ns structure is NULL'ed and task_numa_compare() will
-* not find this node attractive.
-*
-* We'll detect a huge imbalance and bail there.
-*/
-   if (!cpus)
-   return;
-
-   /* smt := ceil(cpus / capacity), assumes: 1 < smt_power < 2 */
-   smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, ns->compute_capacity);
-   capacity = cpus / smt; /* cores */
-
-   capacity = min_t(unsigned, capacity,
-   DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
 }
 
 struct task_numa_env {
-- 
2.16.4

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 962 matches

Mail list logo