Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.

2018-07-19 Thread Michael Ellerman
Michael Neuling  writes:
> On Fri, 2018-07-20 at 12:32 +1000, Michael Ellerman wrote:
>> Michael Neuling  writes:
>> > On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
>> > > On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
>> > > > 
>> > > > >  DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
>> > > > > diff --git a/arch/powerpc/kernel/idle_book3s.S
>> > > > > b/arch/powerpc/kernel/idle_book3s.S
>> > > > > index d85d551..5069d42 100644
>> > > > > --- a/arch/powerpc/kernel/idle_book3s.S
>> > > > > +++ b/arch/powerpc/kernel/idle_book3s.S
>> > > > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
>> > > > >  mfspr   r4, SPRN_MMCR2
>> > > > >  std r3, STOP_MMCR1(r13)
>> > > > >  std r4, STOP_MMCR2(r13)
>> > > > > +
>> > > > > +mfspr   r3, SPRN_SPRG3
>> > > > > +std r3, STOP_SPRG3(r13)
>> > > > 
>> > > > We don't need to save it.  Just restore it from paca->sprg_vdso which
>> > > > should
>> > > > never change.
>> > > 
>> > > Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
>> > > 
>> > > > 
>> > > > How can we do better at catching these missing SPRGs?
>> > > 
>> > > We can go through the list of SPRs from the POWER9 User Manual and
>> > > document explicitly why we don't have to save/restore certain SPRs
>> > > during the execution of the stop instruction. Does this sound ok ?
>> > > 
>> > > (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
>> > > accessible from
>> > > https://openpowerfoundation.org/?resource_lib=power9-processor-users-manua
>> > > l)
>> > 
>> > I was thinking of a boot time test case built into linux. linux has some
>> > boot
>> > time test cases which you can enable via CONFIG options.
>> > 
>> > Firstly you could see if an SPR exists using the same trick xmon does in
>> > dump_one_spr(). Then once you have a list of usable SPRs, you could write
>> > all
>> > the known ones (I assume you'd have to leave out some, like the PSSCR), 
>> > then
>> > set
>> 
>> Write what value?
>> 
>> Ideally you want to write a random bit pattern to reduce the chance
>> that only some bits are being restored.
>
> The xmon dump_one_spr() trick tries to work around that by writing one random
> value and then a different one to see if it really is a nop.
>
>> But you can't do that because writing a value to an SPRs has an effect.
>
> Sure that's a concern but xmon seems to get away with it.

I don't think it writes, but maybe I'm reading the code wrong.

Writing a random value to the MSR could be fun :)

>> Some of them might even need to be zero, in which case you can't really
>> distinguish that from a non-restored zero.
>
> It doesn't need to be perfect. It just needs to catch more than we have now.

Sure.

>> > the appropriate stop level, make sure you got into that stop level, and 
>> > then
>> > see
>> > if that register was changed. Then you'd have an automated list of 
>> > registers
>> > you
>> > need to make sure you save/restore at each stop level.
>> > 
>> > Could something like that work?
>> 
>> Maybe.
>> 
>> Ignoring the problem of whether you can write a meaningful value to some
>> of the SPRs, I'm not entirely convinced it's going to work. But maybe
>> I'm wrong.
>
> Yeah, I'm not convinced it'll work either but it would be a nice piece of test
> infrastructure to have if it does work.

Yeah I guess I'd rather we worked on 1) and 2) below first :)

> We'd still need to marry up the SPR numbers we get from the test to what's
> actually being restored in Linux.
>
>> But there's a much simpler solution, we should 1) have a selftest for
>> getcpu() and 2) we should be running the glibc (I think?) test suite
>> that found this in the first place. It's frankly embarrassing that we
>> didn't find this.
>
> Yeah, we should do that also, but how do we catch the next SPR we are missing.
> I'd like some systematic way of doing that rather than wack-a-mole.

Whack-a-mole 

We could also improve things by documenting how each SPR is handled, eg.
is it saved/restored across idle, syscall, KVM etc. And possibly that
could even become code that defines how SPRs are handled, rather than it
all being done ad-hoc.

cheers


[PATCH v7 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups

2018-07-19 Thread Shilpasri G Bhat
OPAL firmware provides the facility for some groups of sensors to be
enabled/disabled at runtime to give the user the option of using the
system resources for collecting these sensors or not.

For example, on POWER9 systems, the On Chip Controller (OCC) gathers
various system and chip level sensors and maintains their values in
main memory.

This patch provides support for enabling/disabling the sensor groups
like power, temperature, current and voltage.

Signed-off-by: Shilpasri G Bhat 
[stew...@linux.vnet.ibm.com: Commit message]
---
Changes from v6:
- Updated the commit message as per Stewart's suggestion
- Use the compatible DT property instead of the path to find the node

 Documentation/hwmon/ibmpowernv |  43 ++-
 drivers/hwmon/ibmpowernv.c | 249 +++--
 2 files changed, 258 insertions(+), 34 deletions(-)

diff --git a/Documentation/hwmon/ibmpowernv b/Documentation/hwmon/ibmpowernv
index 8826ba2..5646825 100644
--- a/Documentation/hwmon/ibmpowernv
+++ b/Documentation/hwmon/ibmpowernv
@@ -33,9 +33,48 @@ fanX_input   Measured RPM value.
 fanX_min   Threshold RPM for alert generation.
 fanX_fault 0: No fail condition
1: Failing fan
+
 tempX_inputMeasured ambient temperature.
 tempX_max  Threshold ambient temperature for alert generation.
-inX_input  Measured power supply voltage
+tempX_highest  Historical maximum temperature
+tempX_lowest   Historical minimum temperature
+tempX_enable   Enable/disable all temperature sensors belonging to the
+   sub-group. In POWER9, this attribute corresponds to
+   each OCC. Using this attribute each OCC can be asked to
+   disable/enable all of its temperature sensors.
+   1: Enable
+   0: Disable
+
+inX_input  Measured power supply voltage (millivolt)
 inX_fault  0: No fail condition.
1: Failing power supply.
-power1_input   System power consumption (microWatt)
+inX_highestHistorical maximum voltage
+inX_lowest Historical minimum voltage
+inX_enable Enable/disable all voltage sensors belonging to the
+   sub-group. In POWER9, this attribute corresponds to
+   each OCC. Using this attribute each OCC can be asked to
+   disable/enable all of its voltage sensors.
+   1: Enable
+   0: Disable
+
+powerX_input   Power consumption (microWatt)
+powerX_input_highest   Historical maximum power
+powerX_input_lowestHistorical minimum power
+powerX_enable  Enable/disable all power sensors belonging to the
+   sub-group. In POWER9, this attribute corresponds to
+   each OCC. Using this attribute each OCC can be asked to
+   disable/enable all of its power sensors.
+   1: Enable
+   0: Disable
+
+currX_inputMeasured current (milliampere)
+currX_highest  Historical maximum current
+currX_lowest   Historical minimum current
+currX_enable   Enable/disable all current sensors belonging to the
+   sub-group. In POWER9, this attribute corresponds to
+   each OCC. Using this attribute each OCC can be asked to
+   disable/enable all of its current sensors.
+   1: Enable
+   0: Disable
+
+energyX_input  Cumulative energy (microJoule)
diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c
index f829dad..99afbf7 100644
--- a/drivers/hwmon/ibmpowernv.c
+++ b/drivers/hwmon/ibmpowernv.c
@@ -90,11 +90,20 @@ struct sensor_data {
char label[MAX_LABEL_LEN];
char name[MAX_ATTR_LEN];
struct device_attribute dev_attr;
+   struct sensor_group_data *sgrp_data;
+};
+
+struct sensor_group_data {
+   struct mutex mutex;
+   u32 gid;
+   bool enable;
 };
 
 struct platform_data {
const struct attribute_group *attr_groups[MAX_SENSOR_TYPE + 1];
+   struct sensor_group_data *sgrp_data;
u32 sensors_count; /* Total count of sensors from each group */
+   u32 nr_sensor_groups; /* Total number of sensor groups */
 };
 
 static ssize_t show_sensor(struct device *dev, struct device_attribute 
*devattr,
@@ -105,6 +114,9 @@ static ssize_t show_sensor(struct device *dev, struct 
device_attribute *devattr,
ssize_t ret;
u64 x;
 
+   if (sdata->sgrp_data && !sdata->sgrp_data->enable)
+   return -ENODATA;
+
ret =  opal_get_sensor_data_u64(sdata->id, &x);
 
if (ret)
@@ -120,6 +132,46 @@ static ssize_t show_sensor(struct device *dev, struct 
device_attribute *devattr,
return sprintf(buf, "%ll

[PATCH v7 1/2] powernv:opal-sensor-groups: Add support to enable sensor groups

2018-07-19 Thread Shilpasri G Bhat
Adds support to enable/disable a sensor group at runtime. This
can be used to select the sensor groups that needs to be copied to
main memory by OCC. Sensor groups like power, temperature, current,
voltage, frequency, utilization can be enabled/disabled at runtime.

Signed-off-by: Shilpasri G Bhat 
---
 arch/powerpc/include/asm/opal-api.h|  1 +
 arch/powerpc/include/asm/opal.h|  2 ++
 .../powerpc/platforms/powernv/opal-sensor-groups.c | 28 ++
 arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
 4 files changed, 32 insertions(+)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 3bab299..56a94a1 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -206,6 +206,7 @@
 #define OPAL_NPU_SPA_CLEAR_CACHE   160
 #define OPAL_NPU_TL_SET161
 #define OPAL_SENSOR_READ_U64   162
+#define OPAL_SENSOR_GROUP_ENABLE   163
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   165
 #define OPAL_LAST  165
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index e1b2910..fc0550e 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -292,6 +292,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t 
address,
 int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
 int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 int opal_sensor_group_clear(u32 group_hndl, int token);
+int opal_sensor_group_enable(u32 group_hndl, int token, bool enable);
 
 s64 opal_signal_system_reset(s32 cpu);
 s64 opal_quiesce(u64 shutdown_type, s32 cpu);
@@ -326,6 +327,7 @@ extern int opal_async_wait_response_interruptible(uint64_t 
token,
struct opal_msg *msg);
 extern int opal_get_sensor_data(u32 sensor_hndl, u32 *sensor_data);
 extern int opal_get_sensor_data_u64(u32 sensor_hndl, u64 *sensor_data);
+extern int sensor_group_enable(u32 grp_hndl, bool enable);
 
 struct rtc_time;
 extern time64_t opal_get_boot_time(void);
diff --git a/arch/powerpc/platforms/powernv/opal-sensor-groups.c 
b/arch/powerpc/platforms/powernv/opal-sensor-groups.c
index 541c9ea..f7d04b6 100644
--- a/arch/powerpc/platforms/powernv/opal-sensor-groups.c
+++ b/arch/powerpc/platforms/powernv/opal-sensor-groups.c
@@ -32,6 +32,34 @@ struct sg_attr {
struct sg_attr *sgattrs;
 } *sgs;
 
+int sensor_group_enable(u32 handle, bool enable)
+{
+   struct opal_msg msg;
+   int token, ret;
+
+   token = opal_async_get_token_interruptible();
+   if (token < 0)
+   return token;
+
+   ret = opal_sensor_group_enable(handle, token, enable);
+   if (ret == OPAL_ASYNC_COMPLETION) {
+   ret = opal_async_wait_response(token, &msg);
+   if (ret) {
+   pr_devel("Failed to wait for the async response\n");
+   ret = -EIO;
+   goto out;
+   }
+   ret = opal_error_code(opal_get_async_rc(msg));
+   } else {
+   ret = opal_error_code(ret);
+   }
+
+out:
+   opal_async_release_token(token);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(sensor_group_enable);
+
 static ssize_t sg_store(struct kobject *kobj, struct kobj_attribute *attr,
const char *buf, size_t count)
 {
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index a8d9b40..8268a1e 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -327,3 +327,4 @@ OPAL_CALL(opal_npu_tl_set,  
OPAL_NPU_TL_SET);
 OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,
OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,
OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
 OPAL_CALL(opal_sensor_read_u64,OPAL_SENSOR_READ_U64);
+OPAL_CALL(opal_sensor_group_enable,OPAL_SENSOR_GROUP_ENABLE);
-- 
1.8.3.1



[PATCH v7 0/2] hwmon/powernv: Add attributes to enable/disable sensors

2018-07-19 Thread Shilpasri G Bhat
This patch series adds new attribute to enable or disable a sensor at
runtime.

Changes from v6:
- Updated the commit message as per Stewart's suggestion
- Use the compatible DT property instead of the path to find the node

v6 : https://lkml.org/lkml/2018/7/18/806
v5 : https://lkml.org/lkml/2018/7/15/15
v4 : https://lkml.org/lkml/2018/7/6/379
v3 : https://lkml.org/lkml/2018/7/5/476
v2 : https://lkml.org/lkml/2018/7/4/263
v1 : https://lkml.org/lkml/2018/3/22/214

Shilpasri G Bhat (2):
  powernv:opal-sensor-groups: Add support to enable sensor groups
  hwmon: ibmpowernv: Add attributes to enable/disable sensor groups

 Documentation/hwmon/ibmpowernv |  43 +++-
 arch/powerpc/include/asm/opal-api.h|   1 +
 arch/powerpc/include/asm/opal.h|   2 +
 .../powerpc/platforms/powernv/opal-sensor-groups.c |  28 +++
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 drivers/hwmon/ibmpowernv.c | 249 ++---
 6 files changed, 290 insertions(+), 34 deletions(-)

-- 
1.8.3.1



Re: [powerpc/powervm]Oops: Kernel access of bad area, sig: 11 [#1] while running stress-ng

2018-07-19 Thread vrbagal1

On 2018-07-19 19:03, Brian Foster wrote:

cc linux-xfs

On Thu, Jul 19, 2018 at 11:47:59AM +0530, vrbagal1 wrote:

On 2018-07-10 19:12, Michael Ellerman wrote:
> vrbagal1  writes:
>
> > On 2018-07-10 13:37, Nicholas Piggin wrote:
> > > On Tue, 10 Jul 2018 11:58:40 +0530
> > > vrbagal1  wrote:
> > >
> > > > Hi,
> > > >
> > > > Observing kernel oops on Power9(ZZ) box, running on PowerVM, while
> > > > running stress-ng.
> > > >
> > > >
> > > > Kernel: 4.18.0-rc4
> > > > Machine: Power9 ZZ (PowerVM)
> > > > Test: Stress-ng
> > > >
> > > > Attached is .config file
> > > >
> > > > Traces:
> > > >
> > > >   [12251.245209] Oops: Kernel access of bad area, sig: 11 [#1]
> > >
> > > Can you post the lines above this? Otherwise we don't know what
> > > address
> > > it tried to access (without decoding the instructions and
> > > reconstructing
> > > it from registers at least, which the XFS devs wouldn't be
> > > inclined to
> > > do).
> > >
> >
> > ah my bad.
> >
> >   [12251.245179] Unable to handle kernel paging request for data at
> > address 0x60006000
> >   [12251.245199] Faulting instruction address: 0xc0319e2c
>
> Which matches the regs & disassembly:
>
> r4 = 60006000
> r9 = 0
> ldx r9,r4,r9   <- pop
>
> So object was 0x60006000.
>
> That looks like two nops, ie. we got some code?
>
> And there's only one caller of prefetch_freepointer() in
> slab_alloc_node():
>
>prefetch_freepointer(s, next_object);
>
>
> So slab corruption is looking likely.
>
> Do you have slub_debug=FZP on the kernel command line?

I tried to reproduce this, but didnt succeed. But instead I got xfs
assertion.

Kernel: 4.18.0-rc5
Tree Head: 30b06abfb92bfd5f9b63ea6a2ffb0bd905d1a6da

Traces:

 [12290.993612] XFS: Assertion failed: flags & XFS_BMAPI_COWFORK, 
file:

fs/xfs/libxfs/xfs_bmap.c, line: 4370


This usually means we have a writeback of a page with delayed 
allocation
buffers over an extent map that reflects a hole in the file. This 
should
never happen for normal (non-reflink) data fork writes, hence the 
assert

failure and -EIO return.

We used to actually (accidentally) allocate a new block in this case,
but more recent kernels generate the error. More interestingly, I think
the pending iomap + writeback rework in XFS may simply drop this error
on the floor because we refer to extent state to process the page 
rather
than the other way around (and buffer_delay() isn't a state in the 
iomap

bits). This of course depends on which state is actually valid, the
buffer or extent map, which we don't know for sure.

Can you reliably reproduce this problem? If so, can you describe the
reproducer? Also, what is the filesystem geometry ('xfs_info ')?
And since this is a power kernel, I'm guessing you have 64k pages as
well..?

Brian



I tried, but couldnt hit the issue. But this issue was hit running 
stress-ng test case, last time(saw some xfs traces) and this time.
More specific I was running: /bin/avocado run 
avocado-misc-tests/generic/stress-ng.py -m 
/root/avocado-fvt-wrapper/tests/avocado-misc-tests/generic/stress-ng.py.data/stress-ng-cpu.yaml


xfs_info o/p:

# xfs_info /dev/sdi2
meta-data=/dev/sdi2  isize=512agcount=4, agsize=65536 
blks

 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=0 spinodes=0
data =   bsize=4096   blocks=262144, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
log  =internal   bsize=4096   blocks=2560, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0


Yes, 64K pagesize.

Regards,
Venkat.





 [12290.993668] [ cut here ]
 [12290.993672] kernel BUG at fs/xfs/xfs_message.c:102!
 [12290.993676] Oops: Exception in kernel mode, sig: 5 [#1]
 [12290.993678] LE SMP NR_CPUS=2048 NUMA pSeries
 [12290.993697] Dumping ftrace buffer:
 [12290.993730](ftrace buffer empty)
 [12290.993735] Modules linked in: camellia_generic(E) 
cast6_generic(E)
cast_common(E) ppp_generic(E) serpent_generic(E) slhc(E) kvm_pr(E) 
kvm(E)

snd_seq(E) snd_seq_device(E) twofish_generic(E) snd_timer(E) snd(E)
soundcore(E) twofish_common(E) lrw(E) cuse(E) tgr192(E) hci_vhci(E)
vhost_net(E) bluetooth(E) ecdh_generic(E) wp512(E) rfkill(E)
vfio_iommu_spapr_tce(E) vfio_spapr_eeh(E) vfio(E) rmd320(E) uhid(E) 
tun(E)

tap(E) rmd256(E) rmd160(E) vhost_vsock(E) rmd128(E)
vmw_vsock_virtio_transport_common(E) vsock(E) vhost(E) uinput(E) 
md4(E)
unix_diag(E) dccp_ipv4(E) dccp(E) sha512_generic(E) binfmt_misc(E) 
fuse(E)
vfat(E) fat(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) 
xxhash(E)
raid6_pq(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) loop(E) 
ip6t_rpfilter(E)

ipt_REJECT(E) nf_reject_ipv4(E) ip6t_REJECT(E)
 [12290.993784]  nf_reject_ipv6(E) xt_conntrack(E) ip_set

Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

2018-07-19 Thread Benjamin Herrenschmidt
On Thu, 2018-07-19 at 10:04 -0700, Andy Lutomirski wrote:
> On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski  wrote:
> > [I added PeterZ and Vitaly -- can you see any way in which this would
> > break something obscure?  I don't.]

Added Nick and Aneesh. We do have HW remote flushes on powerpc.

> > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel  wrote:
> > > I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
> > > next?
> > 
> > Yes, AFAICS.
> > 
> > > 
> > > On to the lazy TLB mm_struct refcounting stuff :)
> > > 
> > > > 
> > > > Which refcount?  mm_users shouldn’t be hot, so I assume you’re talking 
> > > > about
> > > > mm_count. My suggestion is to get rid of mm_count instead of trying to
> > > > optimize it.
> > > 
> > > 
> > > Do you have any suggestions on how? :)
> > > 
> > > The TLB shootdown sent at __exit_mm time does not get rid of the
> > > kernelthread->active_mm
> > > pointer pointing at the mm that is exiting.
> > > 
> > 
> > Ah, but that's conceptually very easy to fix.  Add a #define like
> > ARCH_NO_TASK_ACTIVE_MM.  Then just get rid of active_mm if that
> > #define is set.  After some grepping, there are very few users.  The
> > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> > are involved in the rather complicated dance of refcounting active_mm.
> > If that field goes away, it doesn't need to be refcounted.  Instead, I
> > think the refcounting can get replaced with something like:
> > 
> > /*
> >  * Release any arch-internal references to mm.  Only called when
> > mm_users is zero
> >  * and all tasks using mm have either been switch_mm()'d away or have had
> >  * enter_lazy_tlb() called.
> >  */
> > extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
> > 
> > which the kernel calls in __mmput() after tearing down all the page
> > tables.  The body can be something like:
> > 
> > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
> >   /* send an IPI.  Maybe just call tlb_flush_remove_tables() */
> > }
> > 
> > (You'll also have to fix up the highly questionable users in
> > arch/x86/platform/efi/efi_64.c, but that's easy.)
> > 
> > Does all that make sense?  Basically, as I understand it, the
> > expensive atomic ops you're seeing are all pointless because they're
> > enabling an optimization that hasn't actually worked for a long time,
> > if ever.
> 
> Hmm.  Xen PV has a big hack in xen_exit_mmap(), which is called from
> arch_exit_mmap(), I think.  It's a heavier weight version of more or
> less the same thing that arch_shoot_down_dead_mm() would be, except
> that it happens before exit_mmap().  But maybe Xen actually has the
> right idea.  In other words, rather doing the big pagetable free in
> exit_mmap() while there may still be other CPUs pointing at the page
> tables, the other order might make more sense.  So maybe, if
> ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
> for getting rid of all secret arch references to the mm.
> 
> Hmm.  ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.
> 
> I added some more arch maintainers.  The idea here is that, on x86 at
> least, task->active_mm and all its refcounting is pure overhead.  When
> a process exits, __mmput() gets called, but the core kernel has a
> longstanding "optimization" in which other tasks (kernel threads and
> idle tasks) may have ->active_mm pointing at this mm.  This is nasty,
> complicated, and hurts performance on large systems, since it requires
> extra atomic operations whenever a CPU switches between real users
> threads and idle/kernel threads.
> 
> It's also almost completely worthless on x86 at least, since __mmput()
> frees pagetables, and that operation *already* forces a remote TLB
> flush, so we might as well zap all the active_mm references at the
> same time.
> 
> But arm64 has real HW remote flushes.  Does arm64 actually benefit
> from the active_mm optimization?  What happens on arm64 when a process
> exits?  How about s390?  I suspect that x390 has rather larger systems
> than arm64, where the cost of the reference counting can be much
> higher.
> 
> (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that
> interact with your previous patch set?)


Re: [PATCH v3] PCI: Data corruption happening due to race condition

2018-07-19 Thread Benjamin Herrenschmidt
On Thu, 2018-07-19 at 20:55 +0200, Lukas Wunner wrote:
> On Thu, Jul 19, 2018 at 9:48 AM, Benjamin Herrenschmidt 
>  > Indeed. However I'm not fan of the solution. Shouldn't we instead have
> > some locking for the content of pci_dev? I've always been wary of us
> > having other similar races in there.
> 
> The solution presented is perfectly fine as it uses atomic bitops which
> obviate the need for locking.  Why do you want to add unnecessary locking
> on top?

Atomic bitops tend to be *more* expensive than a lock.

My concern is that the PCIe code historically had no locking and I
worry we may have other fields in there with similar issues. But maybe
I'm wrong.

> Certain other parts of struct pci_dev use their own locking, e.g.
> pci_bus_sem to protect bus_list.  Most elements can and should
> be accessed lockless for performance.
> 
> 
> > > The powerpc PCI code contains a lot of cruft coming from the depth of
> > > history, including rather nasty assumptions. We want to progressively
> > > clean it up, starting with EEH, but it will take time.
> 
> Then I suggest using the #include "../../../drivers/pci/pci.h" for now
> until the powerpc arch code has been consolidated.

There's also the need both in powerpc and sparc to access the guts of
pci_dev because those archs will "fabricate" as pci_dev from the
device-tree rather than probing it under some circumstances.

Cheers,
Ben.



[RFC 4/4] virtio: Add platform specific DMA API translation for virito devices

2018-07-19 Thread Anshuman Khandual
This adds a hook which a platform can define in order to allow it to
override virtio device's DMA OPS irrespective of whether it has the
flag VIRTIO_F_IOMMU_PLATFORM set or not. We want to use this to do
bounce-buffering of data on the new secure pSeries platform, currently
under development, where a KVM host cannot access all of the memory
space of a secure KVM guest.  The host can only access the pages which
the guest has explicitly requested to be shared with the host, thus
the virtio implementation in the guest has to copy data to and from
shared pages.

With this hook, the platform code in the secure guest can force the
use of swiotlb for virtio buffers, with a back-end for swiotlb which
will use a pool of pre-allocated shared pages.  Thus all data being
sent or received by virtio devices will be copied through pages which
the host has access to.

Signed-off-by: Anshuman Khandual 
---
 arch/powerpc/include/asm/dma-mapping.h | 6 ++
 arch/powerpc/platforms/pseries/iommu.c | 6 ++
 drivers/virtio/virtio.c| 7 +++
 3 files changed, 19 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 8fa3945..bc5a9d3 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -116,3 +116,9 @@ extern u64 __dma_get_required_mask(struct device *dev);
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_DMA_MAPPING_H */
+
+#define platform_override_dma_ops platform_override_dma_ops
+
+struct virtio_device;
+
+extern void platform_override_dma_ops(struct virtio_device *vdev);
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..5773bc7 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1396,3 +1397,8 @@ static int __init disable_multitce(char *str)
 __setup("multitce=", disable_multitce);
 
 machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
+
+void platform_override_dma_ops(struct virtio_device *vdev)
+{
+   /* Override vdev->parent.dma_ops if required */
+}
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 6b13987..432c332 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -168,6 +168,12 @@ EXPORT_SYMBOL_GPL(virtio_add_status);
 
 const struct dma_map_ops virtio_direct_dma_ops;
 
+#ifndef platform_override_dma_ops
+static inline void platform_override_dma_ops(struct virtio_device *vdev)
+{
+}
+#endif
+
 int virtio_finalize_features(struct virtio_device *dev)
 {
int ret = dev->config->finalize_features(dev);
@@ -179,6 +185,7 @@ int virtio_finalize_features(struct virtio_device *dev)
if (virtio_has_iommu_quirk(dev))
set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops);
 
+   platform_override_dma_ops(dev);
if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
return 0;
 
-- 
2.9.3



[RFC 3/4] virtio: Force virtio core to use DMA API callbacks for all virtio devices

2018-07-19 Thread Anshuman Khandual
Virtio core should use DMA API callbacks for all virtio devices which may
generate either GAP or IOVA depending on VIRTIO_F_IOMMU_PLATFORM flag and
resulting QEMU expectations. This implies that every virtio device needs
to have a DMA OPS structure. This removes previous GPA fallback code paths.

Signed-off-by: Anshuman Khandual 
---
 drivers/virtio/virtio_ring.c | 65 ++--
 1 file changed, 2 insertions(+), 63 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 814b395..c265964 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -141,26 +141,6 @@ struct vring_virtqueue {
  * unconditionally on data path.
  */
 
-static bool vring_use_dma_api(struct virtio_device *vdev)
-{
-   if (!virtio_has_iommu_quirk(vdev))
-   return true;
-
-   /* Otherwise, we are left to guess. */
-   /*
-* In theory, it's possible to have a buggy QEMU-supposed
-* emulated Q35 IOMMU and Xen enabled at the same time.  On
-* such a configuration, virtio has never worked and will
-* not work without an even larger kludge.  Instead, enable
-* the DMA API if we're a Xen guest, which at least allows
-* all of the sensible Xen configurations to work correctly.
-*/
-   if (xen_domain())
-   return true;
-
-   return false;
-}
-
 /*
  * The DMA ops on various arches are rather gnarly right now, and
  * making all of the arch DMA ops work on the vring device itself
@@ -176,9 +156,6 @@ static dma_addr_t vring_map_one_sg(const struct 
vring_virtqueue *vq,
   struct scatterlist *sg,
   enum dma_data_direction direction)
 {
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return (dma_addr_t)sg_phys(sg);
-
/*
 * We can't use dma_map_sg, because we don't use scatterlists in
 * the way it expects (we don't guarantee that the scatterlist
@@ -193,9 +170,6 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
   void *cpu_addr, size_t size,
   enum dma_data_direction direction)
 {
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return (dma_addr_t)virt_to_phys(cpu_addr);
-
return dma_map_single(vring_dma_dev(vq),
  cpu_addr, size, direction);
 }
@@ -205,9 +179,6 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
 {
u16 flags;
 
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return;
-
flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
 
if (flags & VRING_DESC_F_INDIRECT) {
@@ -228,9 +199,6 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
 static int vring_mapping_error(const struct vring_virtqueue *vq,
   dma_addr_t addr)
 {
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
return dma_mapping_error(vring_dma_dev(vq), addr);
 }
 
@@ -1016,43 +984,14 @@ EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
 static void *vring_alloc_queue(struct virtio_device *vdev, size_t size,
  dma_addr_t *dma_handle, gfp_t flag)
 {
-   if (vring_use_dma_api(vdev)) {
-   return dma_alloc_coherent(vdev->dev.parent, size,
+   return dma_alloc_coherent(vdev->dev.parent, size,
  dma_handle, flag);
-   } else {
-   void *queue = alloc_pages_exact(PAGE_ALIGN(size), flag);
-   if (queue) {
-   phys_addr_t phys_addr = virt_to_phys(queue);
-   *dma_handle = (dma_addr_t)phys_addr;
-
-   /*
-* Sanity check: make sure we dind't truncate
-* the address.  The only arches I can find that
-* have 64-bit phys_addr_t but 32-bit dma_addr_t
-* are certain non-highmem MIPS and x86
-* configurations, but these configurations
-* should never allocate physical pages above 32
-* bits, so this is fine.  Just in case, throw a
-* warning and abort if we end up with an
-* unrepresentable address.
-*/
-   if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
-   free_pages_exact(queue, PAGE_ALIGN(size));
-   return NULL;
-   }
-   }
-   return queue;
-   }
 }
 
 static void vring_free_queue(struct virtio_device *vdev, size_t size,
 void *queue, dma_addr_t dma_handle)
 {
-   if (vring_use_dma_api(vdev)) {
-   dma_free_coherent(vdev->dev.parent, size, queue, dma_handle);
-   } else {
-  

[RFC 2/4] virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively

2018-07-19 Thread Anshuman Khandual
Now that virtio core always needs all virtio devices to have DMA OPS, we
need to make sure that the structure it points is the right one. In the
absence of VIRTIO_F_IOMMU_PLATFORM flag QEMU expects GPA from guest kernel.
In such case, virtio device must use default virtio_direct_dma_ops DMA OPS
structure which transforms scatter gather buffer addresses as GPA. This
DMA OPS override must happen as early as possible during virtio device
initializatin sequence before virtio core starts using given device's DMA
OPS callbacks for I/O transactions. This change detects device's IOMMU flag
and does the override in case the flag is cleared.

Signed-off-by: Anshuman Khandual 
---
 drivers/virtio/virtio.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 7907ad3..6b13987 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -166,6 +166,8 @@ void virtio_add_status(struct virtio_device *dev, unsigned 
int status)
 }
 EXPORT_SYMBOL_GPL(virtio_add_status);
 
+const struct dma_map_ops virtio_direct_dma_ops;
+
 int virtio_finalize_features(struct virtio_device *dev)
 {
int ret = dev->config->finalize_features(dev);
@@ -174,6 +176,9 @@ int virtio_finalize_features(struct virtio_device *dev)
if (ret)
return ret;
 
+   if (virtio_has_iommu_quirk(dev))
+   set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops);
+
if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
return 0;
 
-- 
2.9.3



[RFC 1/4] virtio: Define virtio_direct_dma_ops structure

2018-07-19 Thread Anshuman Khandual
Current implementation of DMA API inside virtio core calls device's DMA OPS
callback functions when the flag VIRTIO_F_IOMMU_PLATFORM flag is set. But
in absence of the flag, virtio core falls back calling basic transformation
of the incoming SG addresses as GPA. Going forward virtio should only call
DMA API based transformations generating either GPA or IOVA depending on
QEMU expectations again based on VIRTIO_F_IOMMU_PLATFORM flag. It requires
removing existing fallback code path for GPA transformation and replacing
that with a direct map DMA OPS structure. This adds that direct mapping DMA
OPS structure to be used in later patches which will make virtio core call
DMA API all the time for all virtio devices.

Signed-off-by: Anshuman Khandual 
---
 drivers/virtio/virtio.c| 60 ++
 drivers/virtio/virtio_pci_common.h |  3 ++
 2 files changed, 63 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 59e36ef..7907ad3 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Unique numbering for virtio devices. */
@@ -442,3 +443,62 @@ core_initcall(virtio_init);
 module_exit(virtio_exit);
 
 MODULE_LICENSE("GPL");
+
+/*
+ * Virtio direct mapping DMA API operations structure
+ *
+ * This defines DMA API structure for all virtio devices which would not
+ * either bring in their own DMA OPS from architecture or they would not
+ * like to use architecture specific IOMMU based DMA OPS because QEMU
+ * expects GPA instead of an IOVA in absence of VIRTIO_F_IOMMU_PLATFORM.
+ */
+dma_addr_t virtio_direct_map_page(struct device *dev, struct page *page,
+   unsigned long offset, size_t size,
+   enum dma_data_direction dir,
+   unsigned long attrs)
+{
+   return page_to_phys(page) + offset;
+}
+
+void virtio_direct_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+   size_t size, enum dma_data_direction dir,
+   unsigned long attrs)
+{
+}
+
+int virtio_direct_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
+{
+   return 0;
+}
+
+void *virtio_direct_alloc(struct device *dev, size_t size, dma_addr_t 
*dma_handle,
+   gfp_t gfp, unsigned long attrs)
+{
+   void *queue = alloc_pages_exact(PAGE_ALIGN(size), gfp);
+
+   if (queue) {
+   phys_addr_t phys_addr = virt_to_phys(queue);
+   *dma_handle = (dma_addr_t)phys_addr;
+
+   if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
+   free_pages_exact(queue, PAGE_ALIGN(size));
+   return NULL;
+   }
+   }
+   return queue;
+}
+
+void virtio_direct_free(struct device *dev, size_t size, void *vaddr,
+   dma_addr_t dma_addr, unsigned long attrs)
+{
+   free_pages_exact(vaddr, PAGE_ALIGN(size));
+}
+
+const struct dma_map_ops virtio_direct_dma_ops = {
+   .alloc  = virtio_direct_alloc,
+   .free   = virtio_direct_free,
+   .map_page   = virtio_direct_map_page,
+   .unmap_page = virtio_direct_unmap_page,
+   .mapping_error  = virtio_direct_mapping_error,
+};
+EXPORT_SYMBOL(virtio_direct_dma_ops);
diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index 135ee3c..ec44d2f 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -31,6 +31,9 @@
 #include 
 #include 
 
+extern struct dma_map_ops virtio_direct_dma_ops;
+
+
 struct virtio_pci_vq_info {
/* the actual virtqueue */
struct virtqueue *vq;
-- 
2.9.3



[RFC 0/4] Virtio uses DMA API for all devices

2018-07-19 Thread Anshuman Khandual
This patch series is the follow up on the discussions we had before about
the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
for virito devices (https://patchwork.kernel.org/patch/10417371/). There
were suggestions about doing away with two different paths of transactions
with the host/QEMU, first being the direct GPA and the other being the DMA
API based translations.

First patch attempts to create a direct GPA mapping based DMA operations
structure called 'virtio_direct_dma_ops' with exact same implementation
of the direct GPA path which virtio core currently has but just wrapped in
a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
existing semantics. The second patch does exactly that inside the function
virtio_finalize_features(). The third patch removes the default direct GPA
path from virtio core forcing it to use DMA API callbacks for all devices.
Now with that change, every device must have a DMA operations structure
associated with it. The fourth patch adds an additional hook which gives
the platform an opportunity to do yet another override if required. This
platform hook can be used on POWER Ultravisor based protected guests to
load up SWIOTLB DMA callbacks to do the required (as discussed previously
in the above mentioned thread how host is allowed to access only parts of
the guest GPA range) bounce buffering into the shared memory for all I/O
scatter gather buffers to be consumed on the host side.

Please go through these patches and review whether this approach broadly
makes sense. I will appreciate suggestions, inputs, comments regarding
the patches or the approach in general. Thank you.

Anshuman Khandual (4):
  virtio: Define virtio_direct_dma_ops structure
  virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
  virtio: Force virtio core to use DMA API callbacks for all virtio devices
  virtio: Add platform specific DMA API translation for virito devices

 arch/powerpc/include/asm/dma-mapping.h |  6 +++
 arch/powerpc/platforms/pseries/iommu.c |  6 +++
 drivers/virtio/virtio.c| 72 ++
 drivers/virtio/virtio_pci_common.h |  3 ++
 drivers/virtio/virtio_ring.c   | 65 +-
 5 files changed, 89 insertions(+), 63 deletions(-)

-- 
2.9.3



Re: [kernel,v7,1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize

2018-07-19 Thread Paul Mackerras
On Thu, Jul 19, 2018 at 04:06:10PM +1000, Michael Ellerman wrote:
> On Tue, 2018-07-17 at 07:19:12 UTC, Alexey Kardashevskiy wrote:
> > The size is always equal to 1 page so let's use this. Later on this will
> > be used for other checks which use page shifts to check the granularity
> > of access.
> > 
> > This should cause no behavioral change.
> > 
> > Reviewed-by: David Gibson 
> > Acked-by: Alex Williamson 
> > Signed-off-by: Alexey Kardashevskiy 
> 
> Applied to powerpc fixes, thanks.
> 
> https://git.kernel.org/powerpc/c/1463edca6734d42ab4406fa2896e20

Ah.  I have put these two patches in my kvm-ppc-next branch and I was
about to send a pull request to Paolo.

Paul.


Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.

2018-07-19 Thread Michael Neuling
On Fri, 2018-07-20 at 12:32 +1000, Michael Ellerman wrote:
> Michael Neuling  writes:
> > On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
> > > On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
> > > > 
> > > > >   DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> > > > > diff --git a/arch/powerpc/kernel/idle_book3s.S
> > > > > b/arch/powerpc/kernel/idle_book3s.S
> > > > > index d85d551..5069d42 100644
> > > > > --- a/arch/powerpc/kernel/idle_book3s.S
> > > > > +++ b/arch/powerpc/kernel/idle_book3s.S
> > > > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
> > > > >   mfspr   r4, SPRN_MMCR2
> > > > >   std r3, STOP_MMCR1(r13)
> > > > >   std r4, STOP_MMCR2(r13)
> > > > > +
> > > > > + mfspr   r3, SPRN_SPRG3
> > > > > + std r3, STOP_SPRG3(r13)
> > > > 
> > > > We don't need to save it.  Just restore it from paca->sprg_vdso which
> > > > should
> > > > never change.
> > > 
> > > Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
> > > 
> > > > 
> > > > How can we do better at catching these missing SPRGs?
> > > 
> > > We can go through the list of SPRs from the POWER9 User Manual and
> > > document explicitly why we don't have to save/restore certain SPRs
> > > during the execution of the stop instruction. Does this sound ok ?
> > > 
> > > (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
> > > accessible from
> > > https://openpowerfoundation.org/?resource_lib=power9-processor-users-manua
> > > l)
> > 
> > I was thinking of a boot time test case built into linux. linux has some
> > boot
> > time test cases which you can enable via CONFIG options.
> > 
> > Firstly you could see if an SPR exists using the same trick xmon does in
> > dump_one_spr(). Then once you have a list of usable SPRs, you could write
> > all
> > the known ones (I assume you'd have to leave out some, like the PSSCR), then
> > set
> 
> Write what value?
> 
> Ideally you want to write a random bit pattern to reduce the chance
> that only some bits are being restored.

The xmon dump_one_spr() trick tries to work around that by writing one random
value and then a different one to see if it really is a nop.

> But you can't do that because writing a value to an SPRs has an effect.

Sure that's a concern but xmon seems to get away with it.

> Some of them might even need to be zero, in which case you can't really
> distinguish that from a non-restored zero.

It doesn't need to be perfect. It just needs to catch more than we have now.

> > the appropriate stop level, make sure you got into that stop level, and then
> > see
> > if that register was changed. Then you'd have an automated list of registers
> > you
> > need to make sure you save/restore at each stop level.
> > 
> > Could something like that work?
> 
> Maybe.
> 
> Ignoring the problem of whether you can write a meaningful value to some
> of the SPRs, I'm not entirely convinced it's going to work. But maybe
> I'm wrong.

Yeah, I'm not convinced it'll work either but it would be a nice piece of test
infrastructure to have if it does work.

We'd still need to marry up the SPR numbers we get from the test to what's
actually being restored in Linux.

> But there's a much simpler solution, we should 1) have a selftest for
> getcpu() and 2) we should be running the glibc (I think?) test suite
> that found this in the first place. It's frankly embarrassing that we
> didn't find this.

Yeah, we should do that also, but how do we catch the next SPR we are missing.
I'd like some systematic way of doing that rather than wack-a-mole.

Mikey



Re: linux-next: manual merge of the powerpc tree with the powerpc-fixes tree

2018-07-19 Thread Michael Ellerman
Stephen Rothwell  writes:

> Hi all,
>
> Today's linux-next merge of the powerpc tree got a conflict in:
>
>   drivers/vfio/vfio_iommu_spapr_tce.c
>
> between commit:
>
>   1463edca6734 ("vfio/spapr: Use IOMMU pageshift rather than pagesize")
>
> from the powerpc-fixes tree and commit:
>
>   00a5c58d9499 ("KVM: PPC: Make iommu_table::it_userspace big endian")
>
> from the powerpc tree.

Thanks.

That has turned into a real mess, with conflicting code in next, fixes
and topic/ppc-kvm.

I'll fix it all up before the merge window.

cheers


Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.

2018-07-19 Thread Michael Ellerman
Michael Neuling  writes:
> On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
>> On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
>> > 
>> > >  DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
>> > > diff --git a/arch/powerpc/kernel/idle_book3s.S
>> > > b/arch/powerpc/kernel/idle_book3s.S
>> > > index d85d551..5069d42 100644
>> > > --- a/arch/powerpc/kernel/idle_book3s.S
>> > > +++ b/arch/powerpc/kernel/idle_book3s.S
>> > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
>> > >  mfspr   r4, SPRN_MMCR2
>> > >  std r3, STOP_MMCR1(r13)
>> > >  std r4, STOP_MMCR2(r13)
>> > > +
>> > > +mfspr   r3, SPRN_SPRG3
>> > > +std r3, STOP_SPRG3(r13)
>> > 
>> > We don't need to save it.  Just restore it from paca->sprg_vdso which 
>> > should
>> > never change.
>> 
>> Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
>> 
>> > 
>> > How can we do better at catching these missing SPRGs?
>> 
>> We can go through the list of SPRs from the POWER9 User Manual and
>> document explicitly why we don't have to save/restore certain SPRs
>> during the execution of the stop instruction. Does this sound ok ?
>> 
>> (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
>> accessible from
>> https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual)
>
> I was thinking of a boot time test case built into linux. linux has some boot
> time test cases which you can enable via CONFIG options.
>
> Firstly you could see if an SPR exists using the same trick xmon does in
> dump_one_spr(). Then once you have a list of usable SPRs, you could write all
> the known ones (I assume you'd have to leave out some, like the PSSCR), then 
> set

Write what value?

Ideally you want to write a random bit pattern to reduce the chance
that only some bits are being restored.

But you can't do that because writing a value to an SPRs has an effect.

Some of them might even need to be zero, in which case you can't really
distinguish that from a non-restored zero.

> the appropriate stop level, make sure you got into that stop level, and then 
> see
> if that register was changed. Then you'd have an automated list of registers 
> you
> need to make sure you save/restore at each stop level.
>
> Could something like that work?

Maybe.

Ignoring the problem of whether you can write a meaningful value to some
of the SPRs, I'm not entirely convinced it's going to work. But maybe
I'm wrong.

But there's a much simpler solution, we should 1) have a selftest for
getcpu() and 2) we should be running the glibc (I think?) test suite
that found this in the first place. It's frankly embarrassing that we
didn't find this.

cheers


Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.

2018-07-19 Thread Michael Neuling
On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
> Hello Mikey,
> 
> On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
> > 
> > >   DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> > > diff --git a/arch/powerpc/kernel/idle_book3s.S
> > > b/arch/powerpc/kernel/idle_book3s.S
> > > index d85d551..5069d42 100644
> > > --- a/arch/powerpc/kernel/idle_book3s.S
> > > +++ b/arch/powerpc/kernel/idle_book3s.S
> > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
> > >   mfspr   r4, SPRN_MMCR2
> > >   std r3, STOP_MMCR1(r13)
> > >   std r4, STOP_MMCR2(r13)
> > > +
> > > + mfspr   r3, SPRN_SPRG3
> > > + std r3, STOP_SPRG3(r13)
> > 
> > We don't need to save it.  Just restore it from paca->sprg_vdso which should
> > never change.
> 
> Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
> 
> > 
> > How can we do better at catching these missing SPRGs?
> 
> We can go through the list of SPRs from the POWER9 User Manual and
> document explicitly why we don't have to save/restore certain SPRs
> during the execution of the stop instruction. Does this sound ok ?
> 
> (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
> accessible from
> https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual)

I was thinking of a boot time test case built into linux. linux has some boot
time test cases which you can enable via CONFIG options.

Firstly you could see if an SPR exists using the same trick xmon does in
dump_one_spr(). Then once you have a list of usable SPRs, you could write all
the known ones (I assume you'd have to leave out some, like the PSSCR), then set
the appropriate stop level, make sure you got into that stop level, and then see
if that register was changed. Then you'd have an automated list of registers you
need to make sure you save/restore at each stop level.

Could something like that work?

Mikey


linux-next: manual merge of the powerpc tree with the powerpc-fixes tree

2018-07-19 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the powerpc tree got a conflict in:

  drivers/vfio/vfio_iommu_spapr_tce.c

between commit:

  1463edca6734 ("vfio/spapr: Use IOMMU pageshift rather than pagesize")

from the powerpc-fixes tree and commit:

  00a5c58d9499 ("KVM: PPC: Make iommu_table::it_userspace big endian")

from the powerpc tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/vfio/vfio_iommu_spapr_tce.c
index 7cd63b0c1a46,11a4c194d6e3..
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@@ -487,11 -449,11 +449,11 @@@ static void tce_iommu_unuse_page_v2(str
if (!pua)
return;
  
-   ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift,
-   &hpa, &mem);
+   ret = tce_iommu_prereg_ua_to_hpa(container, be64_to_cpu(*pua),
 -  IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
++  tbl->it_page_shift, &hpa, &mem);
if (ret)
-   pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
-   __func__, *pua, entry, ret);
+   pr_debug("%s: tce %llx at #%lx was not cached, ret=%d\n",
+   __func__, be64_to_cpu(*pua), entry, ret);
if (mem)
mm_iommu_mapped_dec(mem);
  
@@@ -599,19 -561,12 +561,12 @@@ static long tce_iommu_build_v2(struct t
unsigned long hpa;
enum dma_data_direction dirtmp;
  
-   if (!tbl->it_userspace) {
-   ret = tce_iommu_userspace_view_alloc(tbl, container->mm);
-   if (ret)
-   return ret;
-   }
- 
for (i = 0; i < pages; ++i) {
struct mm_iommu_table_group_mem_t *mem = NULL;
-   unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
-   entry + i);
+   __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry + i);
  
ret = tce_iommu_prereg_ua_to_hpa(container,
 -  tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
 +  tce, tbl->it_page_shift, &hpa, &mem);
if (ret)
break;
  


pgpFvm2nd2jBL.pgp
Description: OpenPGP digital signature


Re: [PATCH v3] PCI: Data corruption happening due to race condition

2018-07-19 Thread Hari Vyas
Hi Bjonr, Ben

On Thu, Jul 19, 2018 at 9:48 AM, Benjamin Herrenschmidt
 wrote:
> On Wed, 2018-07-18 at 18:29 -0500, Bjorn Helgaas wrote:
>> [+cc Paul, Michael, linuxppc-dev]
>>
>
>/...
>
>> > Debugging revealed a race condition between pcie core driver
>> > enabling is_added bit(pci_bus_add_device()) and nvme driver
>> > reset work-queue enabling is_busmaster bit (by pci_set_master()).
>> > As both fields are not handled in atomic manner and that clears
>> > is_added bit.
>> >
>> > Fix moves device addition is_added bit to separate private flag
>> > variable and use different atomic functions to set and retrieve
>> > device addition state. As is_added shares different memory
>> > location so race condition is avoided.
>>
>> Really nice bit of debugging!
>
> Indeed. However I'm not fan of the solution. Shouldn't we instead have
> some locking for the content of pci_dev ? I've always been wary of us
> having other similar races in there.
>
> As for the powerpc bits, I'm probably the one who wrote them, however,
> I'm on vacation this week and right now, no bandwidth to context switch
> all that back in :-) So give me a few days and/or ping me next week.
>
> The powerpc PCI code contains a lot of cruft coming from the depth of
> history, including rather nasty assumptions. We want to progressively
> clean it up, starting with EEH, but it will take time.
>
> Cheers,
> Ben.
>
Some driver too directly using pci_dev structure flags and may cause similar
type of issues in race condition and should be avoided.
Probably not causing issue currently but some race scenario may affect
and needs to be handled
with some get(),set() api's in atomic manner.
I will suggest to use bit position for all remaining bitfields and use
atomic operation. In that way,
it can be controlled and avoid direct updating fields from outside.
Ex;
enum pci_dev_flags {
   IS_BUSMASTER=1,
.  NO_MSI=2.
}
void assign_pci_dev_flag(struct pci_dev *dev, int flag, bool val)
{
   assign_bit(flag, &dev->flags, val);
}
Proper cleanup is required at so many places but that will certainly
take some time
i.e.  good effort but will be future safe.If  Bjorn agrees, we can
work on that one.
>> > Signed-off-by: Hari Vyas 
>> > ---
>> >  arch/powerpc/kernel/pci-common.c  |  4 +++-
>> >  arch/powerpc/platforms/powernv/pci-ioda.c |  3 ++-
>> >  arch/powerpc/platforms/pseries/setup.c|  3 ++-
>> >  drivers/pci/bus.c |  6 +++---
>> >  drivers/pci/hotplug/acpiphp_glue.c|  2 +-
>> >  drivers/pci/pci.h | 11 +++
>> >  drivers/pci/probe.c   |  4 ++--
>> >  drivers/pci/remove.c  |  5 +++--
>> >  include/linux/pci.h   |  1 -
>> >  9 files changed, 27 insertions(+), 12 deletions(-)
>> >
>> > diff --git a/arch/powerpc/kernel/pci-common.c 
>> > b/arch/powerpc/kernel/pci-common.c
>> > index fe9733f..471aac3 100644
>> > --- a/arch/powerpc/kernel/pci-common.c
>> > +++ b/arch/powerpc/kernel/pci-common.c
>> > @@ -42,6 +42,8 @@
>> >  #include 
>> >  #include 
>> >
>> > +#include "../../../drivers/pci/pci.h"
>>
>> I see why you need it, but this include path is really ugly.  Outside
>> of bootloaders and tools, there are very few instances of includes
>> like this that reference a different top-level directory, and I'm not
>> very keen about adding more.
>>
>> Obviously powerpc is the only arch that needs dev->is_added.  It seems
>> to be because "We can only call pcibios_setup_device() after bus setup
>> is complete, since some of the platform specific DMA setup code
>> depends on it."
>>
>> I don't know powerpc, but it does raise the question in my mind of
>> whether powerpc could be changed to do the DMA setup more like other
>> arches do to remove this ordering dependency and the need to use
>> dev->is_added.
>>
>> That sounds like a lot of work, but it would have the benefit of
>> unifying some code that is probably needlessly arch-specific.
>>
Yes. I also agree, including pci.h with ../ references is really bad.
First patch
was using spin lock for protecting is_added and is_busmaster bits but in final
patch moved is_added to private flags.

>> >  /* hose_spinlock protects accesses to the the phb_bitmap. */
>> >  static DEFINE_SPINLOCK(hose_spinlock);
>> >  LIST_HEAD(hose_list);
>> > @@ -1014,7 +1016,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
>> > /* Cardbus can call us to add new devices to a bus, so ignore
>> >  * those who are already fully discovered
>> >  */
>> > -   if (dev->is_added)
>> > +   if (pci_dev_is_added(dev))
>> > continue;
>> >
>> > pcibios_setup_device(dev);
>> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
>> > b/arch/powerpc/platforms/powernv/pci-ioda.c
>> > index 5bd0eb6..70b2e1e 100644
>> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> > @@ -46,6 +46,

Re: Improvements for the PS3

2018-07-19 Thread Fredrik Noring
Hi Geert, Geoff,

> > > so I added a sleep with
> > >
> > > + msleep(1);
> 
> I can't see where you added the sleep, but 10s seems excessive.
> If the real reason is the need to wait for an interrupt for 
> ps3fb_sync_image(),
> then waiting for 40 ms should be sufficient? Or am I missing something?

It's at the end of ps3fb_probe, as shown in the original post:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-July/175771.html

I thought 100 ms or so would work, but evidently it didn't. In fact, 1 s
for even 5 s didn't seem to work either. In any case, I would like to
develop a solution that does not need to sleep at all, so that will be my
first approach for a proper implementation.

> > > I suppose the problem is that it relies on interrupts for ps3fb_sync_image
> > > to regularly copy the image, hence without them the screen isn't updated 
> > > to
> > > show kernel panics, etc. Perhaps one way to fix that is to implement the
> > > struct fb_tile_ops API, so that the console is synchronously updated? 
> > > Would
> > > that be acceptable?
> >
> > I'm not sure if that would work or not.   Maybe Geert is more familiar with 
> > it.
> 
> That sounds like a complex solution, slowing down the console a lot.

Why would that be slow? I have implemented a similar technique for the
PlayStation 2 frame buffer, and (without any measurements at hand now) it
appears to be about as fast as is possible, and reasonably easy too. :)

It works like this:

The PS2 frame buffer can operate in two distinct modes: virtual mode or
console mode. Virtual mode is very similar to the PS3 in that the whole
visible kernel memory frame buffer is copied to the Graphics Synthesizer
(GS) via DMA regularly at vertical blank intervals.

Console mode is different. No kernel memory is allocated for the frame
buffer at all (which is a significant improvement in itself given that the
PS2 is limited to 32 MiB of main memory) and mmap is disabled. Some struct
fb_ops such as fb_fillrect and fb_copyarea are directly accelerated by GS
hardware. The GS has two CRT merge circuits made for picture-in-picture that
are used to accelerate YWRAP in hardware, which is particularly effective
for console text scrolling.

Additionally, the tiled API is implemented, and it turned out to be a rather
good fit. The GS has 4 MiB of local frame buffer video memory (not directly
accessible by the kernel). The maximum practical resolution 1920x1080p at 16
bits per pixel is 4147200 bytes, which leaves 47104 bytes for a tiled font
which at 8x8 pixels and a minimum 4 bits indexed texture palette is at most
1464 characters. The indexed palette makes it easy to switch colors. struct
fb_tile_ops such as fb_tileblit are accelerated by GS texture sprites which
are fast (GS local copy) for the kernel via simple DMA (GIF) GS commands.

Console text is always synchronous and therefore works properly in interrupt
contexts and with kernel panics, etc. which is essential for debuggability.
A buffered version could be faster, possibly, but I think that might as well
be implemented by a user space console driver using a /dev/gs interface that
can do zero-copy GS commands. The PS2 frame buffer implementation is nearly
complete:

https://github.com/frno7/linux/blob/ps2-v4.17-n3/drivers/video/fbdev/ps2fb.c

Some adjustments and feature reductions seem to be needed for a PS3 version
of anything similar. The simplest implementation is probably to just mirror
characters as they are printed synchronously. I don't know the overhead of
the hypervisor calls for copying graphics though, but the typical areas are
quite small. Perhaps one could avoid allocating the kernel frame buffer as
well when it's not needed. I have to investigate these things to be sure.

> What about letting ps3fb register a panic notifier to sync the screen, like
> hyperv_fb does?

That would not work with kernel freezes unfortunately. Debugging those with
nondeterministicly invisible kernel prints would be painful, I believe.

Fredrik


Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-19 Thread Andrew Morton
On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He  wrote:

> Hi Andrew,
> 
> On 07/18/18 at 03:33pm, Andrew Morton wrote:
> > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He  wrote:
> > 
> > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > > top to down. This is what we have been doing all along in the old kexec
> > > loading interface and the kexec loading is still default setting in some
> > > distributions. However, the current kexec_file loading interface doesn't
> > > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > > all resources of System RAM from bottom to up, to try to find memory 
> > > region
> > > which can contain the specific kexec buffer, then call 
> > > locate_mem_hole_callback()
> > > to allocate memory in that found memory region from top to down. This 
> > > brings
> > > confusion especially when KASLR is widely supported , users have to make 
> > > clear
> > > why kexec/kdump kernel loading position is different between these two
> > > interfaces in order to exclude unnecessary noises. Hence these two 
> > > interfaces
> > > need be unified on behaviour.
> > 
> > As far as I can tell, the above is the whole reason for the patchset,
> > yes?  To avoid confusing users.
> 
> 
> In fact, it's not just trying to avoid confusing users. Kexec loading
> and kexec_file loading are just do the same thing in essence. Just we
> need do kernel image verification on uefi system, have to port kexec
> loading code to kernel. 
> 
> Kexec has been a formal feature in our distro, and customers owning
> those kind of very large machine can make use of this feature to speed
> up the reboot process. On uefi machine, the kexec_file loading will
> search place to put kernel under 4G from top to down. As we know, the
> 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> it. It may have possibility to not be able to find a usable space for
> kernel/initrd. From the top down of the whole memory space, we don't
> have this worry. 
> 
> And at the first post, I just posted below with AKASHI's
> walk_system_ram_res_rev() version. Later you suggested to use
> list_head to link child sibling of resource, see what the code change
> looks like.
> http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com
> 
> Then I posted v2
> http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com
> Rob Herring mentioned that other components which has this tree struct
> have planned to do the same thing, replacing the singly linked list with
> list_head to link resource child sibling. Just quote Rob's words as
> below. I think this could be another reason.
> 
> ~ From Rob
> The DT struct device_node also has the same tree structure with
> parent, child, sibling pointers and converting to list_head had been
> on the todo list for a while. ACPI also has some tree walking
> functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
> common tree struct and helpers defined either on top of list_head or a
> ~
> new struct if that saves some size.

Please let's get all this into the changelogs?

> > 
> > Is that sufficient?  Can we instead simplify their lives by providing
> > better documentation or informative printks or better Kconfig text,
> > etc?
> > 
> > And who *are* the people who are performing this configuration?  Random
> > system administrators?  Linux distro engineers?  If the latter then
> > they presumably aren't easily confused!
> 
> Kexec was invented for kernel developer to speed up their kernel
> rebooting. Now high end sever admin, kernel developer and QE are also
> keen to use it to reboot large box for faster feature testing, bug
> debugging. Kernel dev could know this well, about kernel loading
> position, admin or QE might not be aware of it very well. 
> 
> > 
> > In other words, I'm trying to understand how much benefit this patchset
> > will provide to our users as a whole.
> 
> Understood. The list_head replacing patch truly involes too many code
> changes, it's risky. I am willing to try any idea from reviewers, won't
> persuit they have to be accepted finally. If don't have a try, we don't
> know what it looks like, and what impact it may have. I am fine to take
> AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
> though it could be a little bit low efficient.

The larger patch produces a better result.  We can handle it ;)


Re: [PATCH] powerpc/msi: Remove VLA usage

2018-07-19 Thread Kees Cook
On Thu, Jul 19, 2018 at 5:17 AM, Michael Ellerman  wrote:
> Kees Cook  writes:
>
>> On Fri, Jun 29, 2018 at 11:52 AM, Kees Cook  wrote:
>>> In the quest to remove all stack VLA usage from the kernel[1], this
>>> switches from an unchanging variable to a constant expression to eliminate
>>> the VLA generation.
>>>
>>> [1] 
>>> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
>>>
>>> Signed-off-by: Kees Cook 
>>
>> Friendly ping! Michael, can you take this?
>
> Yep, sorry. I actually applied it weeks ago but hadn't pushed.

No worries! Thanks! :)

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier

2018-07-19 Thread Andy Lutomirski
On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski  wrote:
> [I added PeterZ and Vitaly -- can you see any way in which this would
> break something obscure?  I don't.]
>
> On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel  wrote:
>> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
>> next?
>
> Yes, AFAICS.
>
>>
>> On to the lazy TLB mm_struct refcounting stuff :)
>>
>>>
>>> Which refcount?  mm_users shouldn’t be hot, so I assume you’re talking about
>>> mm_count. My suggestion is to get rid of mm_count instead of trying to
>>> optimize it.
>>
>>
>> Do you have any suggestions on how? :)
>>
>> The TLB shootdown sent at __exit_mm time does not get rid of the
>> kernelthread->active_mm
>> pointer pointing at the mm that is exiting.
>>
>
> Ah, but that's conceptually very easy to fix.  Add a #define like
> ARCH_NO_TASK_ACTIVE_MM.  Then just get rid of active_mm if that
> #define is set.  After some grepping, there are very few users.  The
> only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> are involved in the rather complicated dance of refcounting active_mm.
> If that field goes away, it doesn't need to be refcounted.  Instead, I
> think the refcounting can get replaced with something like:
>
> /*
>  * Release any arch-internal references to mm.  Only called when
> mm_users is zero
>  * and all tasks using mm have either been switch_mm()'d away or have had
>  * enter_lazy_tlb() called.
>  */
> extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
>
> which the kernel calls in __mmput() after tearing down all the page
> tables.  The body can be something like:
>
> if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
>   /* send an IPI.  Maybe just call tlb_flush_remove_tables() */
> }
>
> (You'll also have to fix up the highly questionable users in
> arch/x86/platform/efi/efi_64.c, but that's easy.)
>
> Does all that make sense?  Basically, as I understand it, the
> expensive atomic ops you're seeing are all pointless because they're
> enabling an optimization that hasn't actually worked for a long time,
> if ever.

Hmm.  Xen PV has a big hack in xen_exit_mmap(), which is called from
arch_exit_mmap(), I think.  It's a heavier weight version of more or
less the same thing that arch_shoot_down_dead_mm() would be, except
that it happens before exit_mmap().  But maybe Xen actually has the
right idea.  In other words, rather doing the big pagetable free in
exit_mmap() while there may still be other CPUs pointing at the page
tables, the other order might make more sense.  So maybe, if
ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
for getting rid of all secret arch references to the mm.

Hmm.  ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.

I added some more arch maintainers.  The idea here is that, on x86 at
least, task->active_mm and all its refcounting is pure overhead.  When
a process exits, __mmput() gets called, but the core kernel has a
longstanding "optimization" in which other tasks (kernel threads and
idle tasks) may have ->active_mm pointing at this mm.  This is nasty,
complicated, and hurts performance on large systems, since it requires
extra atomic operations whenever a CPU switches between real users
threads and idle/kernel threads.

It's also almost completely worthless on x86 at least, since __mmput()
frees pagetables, and that operation *already* forces a remote TLB
flush, so we might as well zap all the active_mm references at the
same time.

But arm64 has real HW remote flushes.  Does arm64 actually benefit
from the active_mm optimization?  What happens on arm64 when a process
exits?  How about s390?  I suspect that x390 has rather larger systems
than arm64, where the cost of the reference counting can be much
higher.

(Also, Rik, x86 on Hyper-V has remote flushes, too. How does that
interact with your previous patch set?)


Re: [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public

2018-07-19 Thread Baoquan He
On 07/18/18 at 07:37pm, Andy Shevchenko wrote:
> On Wed, Jul 18, 2018 at 7:36 PM, Andy Shevchenko
>  wrote:
> > On Wed, Jul 18, 2018 at 5:49 AM, Baoquan He  wrote:
> >> reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> >> and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> >> so that it's shared.
> 
> >> + * Returns 0 on success, -ENOTSUPP if child resource is not completely
> >> + * contained by 'res', -ECANCELED if no any conflicting entry found.
> 
> You also can refer to constants by prefixing them with %, e.g. %-ENOTSUPP.
> But this is up to you completely.

Thanks, will fix when repost. 



Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required

2018-07-19 Thread Baoquan He
Hi Andrew,

On 07/18/18 at 03:33pm, Andrew Morton wrote:
> On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He  wrote:
> 
> > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > top to down. This is what we have been doing all along in the old kexec
> > loading interface and the kexec loading is still default setting in some
> > distributions. However, the current kexec_file loading interface doesn't
> > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > all resources of System RAM from bottom to up, to try to find memory region
> > which can contain the specific kexec buffer, then call 
> > locate_mem_hole_callback()
> > to allocate memory in that found memory region from top to down. This brings
> > confusion especially when KASLR is widely supported , users have to make 
> > clear
> > why kexec/kdump kernel loading position is different between these two
> > interfaces in order to exclude unnecessary noises. Hence these two 
> > interfaces
> > need be unified on behaviour.
> 
> As far as I can tell, the above is the whole reason for the patchset,
> yes?  To avoid confusing users.


In fact, it's not just trying to avoid confusing users. Kexec loading
and kexec_file loading are just do the same thing in essence. Just we
need do kernel image verification on uefi system, have to port kexec
loading code to kernel. 

Kexec has been a formal feature in our distro, and customers owning
those kind of very large machine can make use of this feature to speed
up the reboot process. On uefi machine, the kexec_file loading will
search place to put kernel under 4G from top to down. As we know, the
1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
it. It may have possibility to not be able to find a usable space for
kernel/initrd. From the top down of the whole memory space, we don't
have this worry. 

And at the first post, I just posted below with AKASHI's
walk_system_ram_res_rev() version. Later you suggested to use
list_head to link child sibling of resource, see what the code change
looks like.
http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com

Then I posted v2
http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com
Rob Herring mentioned that other components which has this tree struct
have planned to do the same thing, replacing the singly linked list with
list_head to link resource child sibling. Just quote Rob's words as
below. I think this could be another reason.

~ From Rob
The DT struct device_node also has the same tree structure with
parent, child, sibling pointers and converting to list_head had been
on the todo list for a while. ACPI also has some tree walking
functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
common tree struct and helpers defined either on top of list_head or a
~
new struct if that saves some size.

> 
> Is that sufficient?  Can we instead simplify their lives by providing
> better documentation or informative printks or better Kconfig text,
> etc?
> 
> And who *are* the people who are performing this configuration?  Random
> system administrators?  Linux distro engineers?  If the latter then
> they presumably aren't easily confused!

Kexec was invented for kernel developer to speed up their kernel
rebooting. Now high end sever admin, kernel developer and QE are also
keen to use it to reboot large box for faster feature testing, bug
debugging. Kernel dev could know this well, about kernel loading
position, admin or QE might not be aware of it very well. 

> 
> In other words, I'm trying to understand how much benefit this patchset
> will provide to our users as a whole.

Understood. The list_head replacing patch truly involes too many code
changes, it's risky. I am willing to try any idea from reviewers, won't
persuit they have to be accepted finally. If don't have a try, we don't
know what it looks like, and what impact it may have. I am fine to take
AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
though it could be a little bit low efficient.

Thanks
Baoquan


Re: Improvements for the PS3

2018-07-19 Thread Geoff Levand
Hi Geert, Fredrik,

On 07/19/2018 12:45 AM, Geert Uytterhoeven wrote:
>> On 07/14/2018 09:49 AM, Fredrik Noring wrote:
>>>
>>> et voilà, the screen came alive and the kernel panic was revealed! It seems
>>> the kernel panics so fast that the PS3 frame buffer is unprepared. This is,
>>> of course, very unfortunate because trying to debug the boot process without
>>> a screen or any other means of obtaining console text is quite difficult.
>>
> What about letting ps3fb register a panic notifier to sync the screen, like
> hyperv_fb does?

Seems like the thing to do.  Fredrik, do you want to try it?  Otherwise, I'll
work on it when I have some time.

-Geoff


Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask

2018-07-19 Thread Alan Stern
On Thu, 19 Jul 2018, Geoff Levand wrote:

> Hi Alan,
> 
> On 07/19/2018 07:33 AM, Alan Stern wrote:
> > On Wed, 18 Jul 2018, Geoff Levand wrote:
> > 
> >> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
> >> index 8c733492d8fe..454d8c624a3f 100644
> >> --- a/drivers/usb/host/ehci-ps3.c
> >> +++ b/drivers/usb/host/ehci-ps3.c
> >> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device 
> >> *dev)
> >>int result;
> >>struct usb_hcd *hcd;
> >>unsigned int virq;
> >> -  static u64 dummy_mask = DMA_BIT_MASK(32);
> >> +  static u64 dummy_mask;
> >>  
> >>if (usb_disabled()) {
> >>result = -ENODEV;
> >> @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device 
> >> *dev)
> >>goto fail_irq;
> >>}
> >>  
> >> -  dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
> >> +  dummy_mask = DMA_BIT_MASK(32);
> >> +  dev->core.dma_mask = &dummy_mask;
> >> +  dma_set_coherent_mask(&dev->core, dummy_mask);
> > 
> > What is the reason for changing a static initialization to a dynamic
> > one?  As far as I can see, the patch touches four lines of code but the
> > only real difference is addition of a single line (and removal of a 
> > comment).
> 
> I thought it would be better if all the setting was done in
> one place, that's the only reason.

All right; in that case (for the EHCI and OHCI pieces):

Acked-by: Alan Stern 

Alan Stern



Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask

2018-07-19 Thread Geoff Levand
Hi Alan,

On 07/19/2018 07:33 AM, Alan Stern wrote:
> On Wed, 18 Jul 2018, Geoff Levand wrote:
> 
>> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
>> index 8c733492d8fe..454d8c624a3f 100644
>> --- a/drivers/usb/host/ehci-ps3.c
>> +++ b/drivers/usb/host/ehci-ps3.c
>> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device 
>> *dev)
>>  int result;
>>  struct usb_hcd *hcd;
>>  unsigned int virq;
>> -static u64 dummy_mask = DMA_BIT_MASK(32);
>> +static u64 dummy_mask;
>>  
>>  if (usb_disabled()) {
>>  result = -ENODEV;
>> @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device 
>> *dev)
>>  goto fail_irq;
>>  }
>>  
>> -dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
>> +dummy_mask = DMA_BIT_MASK(32);
>> +dev->core.dma_mask = &dummy_mask;
>> +dma_set_coherent_mask(&dev->core, dummy_mask);
> 
> What is the reason for changing a static initialization to a dynamic
> one?  As far as I can see, the patch touches four lines of code but the
> only real difference is addition of a single line (and removal of a 
> comment).

I thought it would be better if all the setting was done in
one place, that's the only reason.

-Geoff


Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask

2018-07-19 Thread Alan Stern
On Wed, 18 Jul 2018, Geoff Levand wrote:

> Set the coherent_dma_mask for the PS3 ehci, ohci, and snd devices.
> 
> Silences WARN_ON_ONCE messages emitted by the dma_alloc_attrs() routine.
> 
> Reported-by: Fredrik Noring 
> Signed-off-by: Geoff Levand 
> ---
> Hi Michael,
> 
> This just silences some warnings.  Can you take it through the powerpc
> tree?
> 
> -Geoff
> 
> 
>  drivers/usb/host/ehci-ps3.c | 6 --
>  drivers/usb/host/ohci-ps3.c | 6 --
>  sound/ppc/snd_ps3.c | 5 +
>  3 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
> index 8c733492d8fe..454d8c624a3f 100644
> --- a/drivers/usb/host/ehci-ps3.c
> +++ b/drivers/usb/host/ehci-ps3.c
> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
>   int result;
>   struct usb_hcd *hcd;
>   unsigned int virq;
> - static u64 dummy_mask = DMA_BIT_MASK(32);
> + static u64 dummy_mask;
>  
>   if (usb_disabled()) {
>   result = -ENODEV;
> @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device 
> *dev)
>   goto fail_irq;
>   }
>  
> - dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
> + dummy_mask = DMA_BIT_MASK(32);
> + dev->core.dma_mask = &dummy_mask;
> + dma_set_coherent_mask(&dev->core, dummy_mask);

What is the reason for changing a static initialization to a dynamic
one?  As far as I can see, the patch touches four lines of code but the
only real difference is addition of a single line (and removal of a 
comment).

> diff --git a/drivers/usb/host/ohci-ps3.c b/drivers/usb/host/ohci-ps3.c
> index 20a23d795adf..395f9d3bc849 100644
> --- a/drivers/usb/host/ohci-ps3.c
> +++ b/drivers/usb/host/ohci-ps3.c
> @@ -69,7 +69,7 @@ static int ps3_ohci_probe(struct ps3_system_bus_device *dev)
>   int result;
>   struct usb_hcd *hcd;
>   unsigned int virq;
> - static u64 dummy_mask = DMA_BIT_MASK(32);
> + static u64 dummy_mask;
>  
>   if (usb_disabled()) {
>   result = -ENODEV;
> @@ -115,7 +115,9 @@ static int ps3_ohci_probe(struct ps3_system_bus_device 
> *dev)
>   goto fail_irq;
>   }
>  
> - dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
> + dummy_mask = DMA_BIT_MASK(32);
> + dev->core.dma_mask = &dummy_mask;
> + dma_set_coherent_mask(&dev->core, dummy_mask);

Same here.

Alan Stern



[PATCH] powerpc/mm: Don't report PUDs as memory leaks when using kmemleak

2018-07-19 Thread Michael Ellerman
Paul Menzel reported that kmemleak was producing reports such as:

  unreferenced object 0xc000f8b8 (size 16384):
comm "init", pid 1, jiffies 4294937416 (age 312.240s)
hex dump (first 32 bytes):
  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
backtrace:
  [] __pud_alloc+0x80/0x190
  [<87f2e8a3>] move_page_tables+0xbac/0xdc0
  [<091e51c2>] shift_arg_pages+0xc0/0x210
  [] setup_arg_pages+0x22c/0x2a0
  [<60871529>] load_elf_binary+0x41c/0x1648
  [] search_binary_handler.part.11+0xbc/0x280
  [<34e0cdd7>] __do_execve_file.isra.13+0x73c/0x940
  [<5f953a6e>] sys_execve+0x58/0x70
  [<9700a858>] system_call+0x5c/0x70

Indicating that a PUD was being leaked.

However what's really happening is that kmemleak is not able to
recognise the references from the PGD to the PUD, because they are not
fully qualified pointers.

We can confirm that in xmon, eg:

Find the task struct for pid 1 "init":
  0:mon> P
   task_struct ->thread.kspPID   PPID S  P CMD
  c001fe7c c001fe803960  1  0 S 13 systemd

Dump virtual address 0 to find the PGD:
  0:mon> dv 0 c001fe7c
  pgd  @ 0xc000f8b01000

Dump the memory of the PGD:
  0:mon> d c000f8b01000
  c000f8b01000 f8b9   ||
  c000f8b01010    ||
  c000f8b01020    ||
  c000f8b01030  f8b8  ||


There we can see the reference to our supposedly leaked PUD. But
because it's missing the leading 0xc, kmemleak won't recognise it.

We can confirm it's still in use by translating an address that is
mapped via it:
  0:mon> dv 7fff9400 c001fe7c
  pgd  @ 0xc000f8b01000
  pgdp @ 0xc000f8b01038 = 0xf8b8 <--
  pudp @ 0xc000f8b81ff8 = 0x037c4000
  pmdp @ 0xc37c5ca0 = 0xfbd89000
  ptep @ 0xc000fbd89000 = 0xc081d5ce0386
  Maps physical address = 0x0001d5ce
  Flags = Accessed Dirty Read Write

The fix is fairly simple. We need to tell kmemleak to ignore PUD
allocations and never report them as leaks. We can also tell it not to
scan the PGD, because it will never find pointers in there. However it
will still notice if we allocate a PGD and then leak it.

Reported-by: Paul Menzel 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 01ee40f11f3a..76234a14b97d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 struct vmemmap_backing {
@@ -82,6 +83,13 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
pgd = kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE),
   pgtable_gfp_flags(mm, GFP_KERNEL));
+   /*
+* Don't scan the PGD for pointers, it contains references to PUDs but
+* those references are not full pointers and so can't be recognised by
+* kmemleak.
+*/
+   kmemleak_no_scan(pgd);
+
/*
 * With hugetlb, we don't clear the second half of the page table.
 * If we share the same slab cache with the pmd or pud level table,
@@ -110,8 +118,19 @@ static inline void pgd_populate(struct mm_struct *mm, 
pgd_t *pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return kmem_cache_alloc(PGT_CACHE(PUD_CACHE_INDEX),
-   pgtable_gfp_flags(mm, GFP_KERNEL));
+   pud_t *pud;
+
+   pud = kmem_cache_alloc(PGT_CACHE(PUD_CACHE_INDEX),
+  pgtable_gfp_flags(mm, GFP_KERNEL));
+   /*
+* Tell kmemleak to ignore the PUD, that means don't scan it for
+* pointers and don't consider it a leak. PUDs are typically only
+* referred to by their PGD, but kmemleak is not able to recognise those
+* as pointers, leading to false leak reports.
+*/
+   kmemleak_ignore(pud);
+
+   return pud;
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
-- 
2.14.1



Re: [powerpc/powervm]Oops: Kernel access of bad area, sig: 11 [#1] while running stress-ng

2018-07-19 Thread Brian Foster
cc linux-xfs

On Thu, Jul 19, 2018 at 11:47:59AM +0530, vrbagal1 wrote:
> On 2018-07-10 19:12, Michael Ellerman wrote:
> > vrbagal1  writes:
> > 
> > > On 2018-07-10 13:37, Nicholas Piggin wrote:
> > > > On Tue, 10 Jul 2018 11:58:40 +0530
> > > > vrbagal1  wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > Observing kernel oops on Power9(ZZ) box, running on PowerVM, while
> > > > > running stress-ng.
> > > > > 
> > > > > 
> > > > > Kernel: 4.18.0-rc4
> > > > > Machine: Power9 ZZ (PowerVM)
> > > > > Test: Stress-ng
> > > > > 
> > > > > Attached is .config file
> > > > > 
> > > > > Traces:
> > > > > 
> > > > >   [12251.245209] Oops: Kernel access of bad area, sig: 11 [#1]
> > > > 
> > > > Can you post the lines above this? Otherwise we don't know what
> > > > address
> > > > it tried to access (without decoding the instructions and
> > > > reconstructing
> > > > it from registers at least, which the XFS devs wouldn't be
> > > > inclined to
> > > > do).
> > > > 
> > > 
> > > ah my bad.
> > > 
> > >   [12251.245179] Unable to handle kernel paging request for data at
> > > address 0x60006000
> > >   [12251.245199] Faulting instruction address: 0xc0319e2c
> > 
> > Which matches the regs & disassembly:
> > 
> > r4 = 60006000
> > r9 = 0
> > ldx r9,r4,r9<- pop
> > 
> > So object was 0x60006000.
> > 
> > That looks like two nops, ie. we got some code?
> > 
> > And there's only one caller of prefetch_freepointer() in
> > slab_alloc_node():
> > 
> > prefetch_freepointer(s, next_object);
> > 
> > 
> > So slab corruption is looking likely.
> > 
> > Do you have slub_debug=FZP on the kernel command line?
> 
> I tried to reproduce this, but didnt succeed. But instead I got xfs
> assertion.
> 
> Kernel: 4.18.0-rc5
> Tree Head: 30b06abfb92bfd5f9b63ea6a2ffb0bd905d1a6da
> 
> Traces:
> 
>  [12290.993612] XFS: Assertion failed: flags & XFS_BMAPI_COWFORK, file:
> fs/xfs/libxfs/xfs_bmap.c, line: 4370

This usually means we have a writeback of a page with delayed allocation
buffers over an extent map that reflects a hole in the file. This should
never happen for normal (non-reflink) data fork writes, hence the assert
failure and -EIO return.

We used to actually (accidentally) allocate a new block in this case,
but more recent kernels generate the error. More interestingly, I think
the pending iomap + writeback rework in XFS may simply drop this error
on the floor because we refer to extent state to process the page rather
than the other way around (and buffer_delay() isn't a state in the iomap
bits). This of course depends on which state is actually valid, the
buffer or extent map, which we don't know for sure.

Can you reliably reproduce this problem? If so, can you describe the
reproducer? Also, what is the filesystem geometry ('xfs_info ')?
And since this is a power kernel, I'm guessing you have 64k pages as
well..?

Brian

>  [12290.993668] [ cut here ]
>  [12290.993672] kernel BUG at fs/xfs/xfs_message.c:102!
>  [12290.993676] Oops: Exception in kernel mode, sig: 5 [#1]
>  [12290.993678] LE SMP NR_CPUS=2048 NUMA pSeries
>  [12290.993697] Dumping ftrace buffer:
>  [12290.993730](ftrace buffer empty)
>  [12290.993735] Modules linked in: camellia_generic(E) cast6_generic(E)
> cast_common(E) ppp_generic(E) serpent_generic(E) slhc(E) kvm_pr(E) kvm(E)
> snd_seq(E) snd_seq_device(E) twofish_generic(E) snd_timer(E) snd(E)
> soundcore(E) twofish_common(E) lrw(E) cuse(E) tgr192(E) hci_vhci(E)
> vhost_net(E) bluetooth(E) ecdh_generic(E) wp512(E) rfkill(E)
> vfio_iommu_spapr_tce(E) vfio_spapr_eeh(E) vfio(E) rmd320(E) uhid(E) tun(E)
> tap(E) rmd256(E) rmd160(E) vhost_vsock(E) rmd128(E)
> vmw_vsock_virtio_transport_common(E) vsock(E) vhost(E) uinput(E) md4(E)
> unix_diag(E) dccp_ipv4(E) dccp(E) sha512_generic(E) binfmt_misc(E) fuse(E)
> vfat(E) fat(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E)
> raid6_pq(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) loop(E) ip6t_rpfilter(E)
> ipt_REJECT(E) nf_reject_ipv4(E) ip6t_REJECT(E)
>  [12290.993784]  nf_reject_ipv6(E) xt_conntrack(E) ip_set(E) nfnetlink(E)
> ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ip6table_nat(E)
> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_nat_ipv6(E) ip6table_mangle(E)
> ip6table_security(E) ip6table_raw(E) iptable_nat(E) nf_conntrack_ipv4(E)
> nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) iptable_mangle(E)
> iptable_security(E) iptable_raw(E) ebtable_filter(E) ebtables(E)
> ip6table_filter(E) ip6_tables(E) iptable_filter(E) xts(E) sg(E)
> vmx_crypto(E) pseries_rng(E) ip_tables(E) xfs(E) libcrc32c(E) sd_mod(E)
> ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) lpfc(E) mlx5_core(E)
> nvmet_fc(E) nvmet(E) nvme_fc(E) nvme_fabrics(E) nvme_core(E)
> scsi_transport_fc(E) mlxfw(E) devlink(E) dm_mirror(E) dm_region_hash(E)
> dm_log(E) dm_mod(E) [last unloaded: torture]
>  [12290.993834] CPU: 2 PID: 433 Comm: kswapd1 Tainted: GE
> 4.18.0-rc5-autotest-autotest #1
>  [1

Re: [PATCH v6 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups

2018-07-19 Thread Guenter Roeck

On 07/18/2018 11:59 PM, Stewart Smith wrote:

Shilpasri G Bhat  writes:

On-Chip-Controller(OCC) is an embedded micro-processor in POWER9 chip
which measures various system and chip level sensors. These sensors
comprises of environmental sensors (like power, temperature, current
and voltage) and performance sensors (like utilization, frequency).
All these sensors are copied to main memory at a regular interval of
100ms. OCC provides a way to select a group of sensors that is copied
to the main memory to increase the update frequency of selected sensor
groups. When a sensor-group is disabled, OCC will not copy it to main
memory and those sensors read 0 values.


OCC is an implementation detail rather than a core part of this firmware
API.

Why not something like this:

OPAL firmware provides the facility for some groups of sensors to be
enabled/disabled at runtime to give the user the option of using the
system resources for collecting these sensors or not.

For example, on POWER9 systems, the On Chip Controller (OCC) gathers
various system and chip level sensors and maintains their values in main
memory.



+static int init_sensor_group_data(struct platform_device *pdev,
+ struct platform_data *pdata)
+{
+   struct sensor_group_data *sgrp_data;
+   struct device_node *groups, *sgrp;
+   enum sensors type;
+   int count = 0, ret = 0;
+
+   groups = of_find_node_by_path("/ibm,opal/sensor-groups");
+   if (!groups)
+   return ret;


Why not look for the compatible property?




For both, I don't really care either way. Can you folks get to an agreement
and let me know after you decided ?

Thanks,
Guenter


Re: [PATCH v5 5/7] powerpc/pseries: flush SLB contents on SLB MCE errors.

2018-07-19 Thread Michael Ellerman
Michal Suchánek  writes:
> On Tue, 3 Jul 2018 08:08:14 +1000
> "Nicholas Piggin"  wrote: >> On Mon, 02 Jul 2018 11:17:06 
> +0530
>> Mahesh J Salgaonkar  wrote:
>> > From: Mahesh Salgaonkar 
>> > diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> > index efdd16a79075..221271c96a57 100644
>> > --- a/arch/powerpc/kernel/mce.c
>> > +++ b/arch/powerpc/kernel/mce.c
>> > @@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs)
>> >  {
>> >long handled = 0;
>> >  
>> > -  __this_cpu_inc(irq_stat.mce_exceptions);
>> > +  /*
>> > +   * For pSeries we count mce when we go into virtual mode
>> > machine
>> > +   * check handler. Hence skip it. Also, We can't access per
>> > cpu
>> > +   * variables in real mode for LPAR.
>> > +   */
>> > +  if (early_cpu_has_feature(CPU_FTR_HVMODE))
>> > +  __this_cpu_inc(irq_stat.mce_exceptions);
>> >  
>> > -  if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
>> > +  /*
>> > +   * See if platform is capable of handling machine check.
>> > +   * Otherwise fallthrough and allow CPU to handle this
>> > machine check.
>> > +   */
>> > +  if (ppc_md.machine_check_early)
>> > +  handled = ppc_md.machine_check_early(regs);
>> > +  else if (cur_cpu_spec && cur_cpu_spec->machine_check_early)
>> >handled =
>> > cur_cpu_spec->machine_check_early(regs);  
>> 
>> Would be good to add a powernv ppc_md handler which does the
>> cur_cpu_spec->machine_check_early() call now that other platforms are
>> calling this code. Because those aren't valid as a fallback call, but
>> specific to powernv.
>> 
>
> Something like this (untested)?
>
> Subject: [PATCH] powerpc/powernv: define platform MCE handler.

LGTM.

cheers


Re: [PATCH] powerpc/msi: Remove VLA usage

2018-07-19 Thread Michael Ellerman
Kees Cook  writes:

> On Fri, Jun 29, 2018 at 11:52 AM, Kees Cook  wrote:
>> In the quest to remove all stack VLA usage from the kernel[1], this
>> switches from an unchanging variable to a constant expression to eliminate
>> the VLA generation.
>>
>> [1] 
>> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
>>
>> Signed-off-by: Kees Cook 
>
> Friendly ping! Michael, can you take this?

Yep, sorry. I actually applied it weeks ago but hadn't pushed.

cheers


Re: [PATCH v4 4/6] powerpc/fsl: Enable cpu vulnerabilities reporting for NXP PPC BOOK3E

2018-07-19 Thread Michael Ellerman
LEROY Christophe  writes:
> Diana Madalina Craciun  a écrit :
>> On 7/17/2018 7:47 PM, LEROY Christophe wrote:
>>> Diana Craciun  a écrit :
 The NXP PPC Book3E platforms are not vulnerable to meltdown and
 Spectre v4, so make them PPC_BOOK3S_64 specific.

 Signed-off-by: Diana Craciun 
 ---
 History:

 v2-->v3
 - used the existing functions for spectre v1/v2

  arch/powerpc/Kconfig   | 7 ++-
  arch/powerpc/kernel/security.c | 2 ++
  2 files changed, 8 insertions(+), 1 deletion(-)

 diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
 index 9f2b75f..116c953 100644
 --- a/arch/powerpc/Kconfig
 +++ b/arch/powerpc/Kconfig
 @@ -165,7 +165,7 @@ config PPC
select GENERIC_CLOCKEVENTS_BROADCASTif SMP
select GENERIC_CMOS_UPDATE
select GENERIC_CPU_AUTOPROBE
 -  select GENERIC_CPU_VULNERABILITIES  if PPC_BOOK3S_64
 +  select GENERIC_CPU_VULNERABILITIES  if PPC_NOSPEC
>>> I don't understand.  You say this patch is to make something specific
>>> to book3s64 specific, and you are creating a new config param that
>>> make things less specific
>>>
>>> Christophe
>>
>> In order to enable the vulnerabilities reporting on NXP socs I need to
>> enable them for PPC_FSL_BOOK3E. So they will be enabled for both
>> PPC_FSL_BOOK3E and PPC_BOOK3S_64. This is the reason for adding the
>> Kconfig. However this will enable: spectre v1/v2 and meltdown. NXP socs
>> are not vulnerable to meltdown, so I made the meltdown reporting
>> PPC_BOOK3S_64 specific. I guess I can have the PPC_NOSPEC definition in
>> a separate patch to be more clear.
>
> Yes you can. Or keep it as a single patch and add the details you gave  
> me in the patch description.

Yeah I think the patch is fine, but the change log is a bit short on detail.

If you just send me a new change log I can fold it in.

cheers


Re: [PATCH] powerpc/prom_init: remove linux,stdout-package property

2018-07-19 Thread Michael Ellerman
Murilo Opsfelder Araujo  writes:
> On Wed, Jul 18, 2018 at 07:37:37PM +1000, Michael Ellerman wrote:
>> Murilo Opsfelder Araujo  writes:
>> > This property was added in 2004 by
>> >
>> > 
>> > https://github.com/mpe/linux-fullhistory/commit/689fe5072fe9a0dec914bfa4fa60aed1e54563e6
>> >
>> > and the only use of it, which was already inside `#if 0`, was removed a 
>> > month
>> > later by
>> >
>> > 
>> > https://github.com/mpe/linux-fullhistory/commit/1fbe5a6d90f6cd4ea610737ef488719d1a875de7
>> >
>> > Fixes: https://github.com/linuxppc/linux/issues/125
>> 
>> That is going to confuse some scripts that are expecting that to be a
>> "Fixes: " tag :)
>> 
>> The proper tag to use there would be "Link:".
>> 
>> But, I'd prefer we didn't add github issue links to the history, as I'm
>> not sure they won't bit-rot eventually. Not because I'm a anti-Microsoft
>> conspiracy person but just because it's a repo I set up and manage and
>> there's no long term plan for it necessarily.
>> 
>> > ---
>> >  arch/powerpc/kernel/prom_init.c | 2 --
>> >  1 file changed, 2 deletions(-)
>> 
>> Including the link here would be ideal, as it means it doesn't end up in
>> the commit history, but it does end up in the mail archive. So if we
>> ever really need to find it, it should be there.
>> 
>> cheers
>
> Hi, Michael.
>
> Thanks for reviewing it.  I've sent v2 with your suggestions:
>
> https://lkml.kernel.org/r/20180718161544.12134-1-muri...@linux.ibm.com

Thanks.

I had actually already applied the first version, but I didn't say that
in my email did I :}

So I've rebased and applied your v2, thanks.

cheers


Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask

2018-07-19 Thread Michael Ellerman
Greg KH  writes:

> On Wed, Jul 18, 2018 at 03:08:33PM -0700, Geoff Levand wrote:
>> Set the coherent_dma_mask for the PS3 ehci, ohci, and snd devices.
>> 
>> Silences WARN_ON_ONCE messages emitted by the dma_alloc_attrs() routine.
>> 
>> Reported-by: Fredrik Noring 
>> Signed-off-by: Geoff Levand 
>> ---
>> Hi Michael,
>> 
>> This just silences some warnings.  Can you take it through the powerpc
>> tree?
>
> Oops, nevermind, it should go through the ppc tree.  Here's my ack:
>
> Acked-by: Greg Kroah-Hartman 

OK thanks, I'll take it.

cheers


Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

2018-07-19 Thread Michal Hocko
On Wed 18-07-18 21:52:17, Mahesh Jagannath Salgaonkar wrote:
> On 07/17/2018 05:22 PM, Michal Hocko wrote:
> > On Tue 17-07-18 16:58:10, Mahesh Jagannath Salgaonkar wrote:
> >> On 07/16/2018 01:56 PM, Michal Hocko wrote:
> >>> On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote:
>  One of the primary issues with Firmware Assisted Dump (fadump) on Power
>  is that it needs a large amount of memory to be reserved. This reserved
>  memory is used for saving the contents of old crashed kernel's memory 
>  before
>  fadump capture kernel uses old kernel's memory area to boot. However, 
>  This
>  reserved memory area stays unused until system crash and isn't available
>  for production kernel to use.
> >>>
> >>> How much memory are we talking about. Regular kernel dump process needs
> >>> some reserved memory as well. Why that is not a big problem?
> >>
> >> We reserve around 5% of total system RAM. On large systems with
> >> TeraBytes of memory, this reservation can be quite significant.
> >>
> >> The regular kernel dump uses the kexec method to boot into capture
> >> kernel and it can control the parameters that are being passed to
> >> capture kernel. This allows a capability to strip down the parameters
> >> that can help lowering down the memory requirement for capture kernel to
> >> boot. This allows regular kdump to reserve less memory to start with.
> >>
> >> Where as fadump depends on power firmware (pHyp) to load the capture
> >> kernel after full reset and boots like a regular kernel. It needs same
> >> amount of memory to boot as the production kernel. On large systems
> >> production kernel needs significant amount of memory to boot. Hence
> >> fadump needs to reserve enough memory for capture kernel to boot
> >> successfully and execute dump capturing operations. By default fadump
> >> reserves 5% of total system RAM and in most cases this has worked
> >> flawlessly on variety of system configurations. Optionally,
> >> 'crashkernel=X' can also be used to specify more fine-tuned memory size
> >> for reservation.
> > 
> > So why do we even care about fadump when regular kexec provides
> > (presumably) same functionality with a smaller memory footprint? Or is
> > there any reason why kexec doesn't work well on ppc?
> 
> Kexec based kdump is loaded by crashing kernel. When OS crashes, the
> system is in an inconsistent state, especially the devices. In some
> cases, a rogue DMA or ill-behaving device drivers can cause the kdump
> capture to fail.
> 
> On power platform, fadump solves these issues by taking help from power
> firmware, to fully-reset the system, load the fresh copy of same kernel
> to capture the dump with PCI and I/O devices reinitialized, making it
> more reliable.

Thanks for the clarification.

> Fadump does full system reset, booting system through the regular boot
> options i.e the dump capture kernel is booted in the same fashion and
> doesn't have specialized kernel command line option. This implies, we
> need to give more memory for the system boot. Since the new kernel boots
> from the same memory location as crashed kernel, we reserve 5% of memory
> where power firmware moves the crashed kernel's memory content. This
> reserved memory is completely removed from the available memory. For
> large memory systems like 64TB systems, this account to ~ 3TB, which is
> a significant chunk of memory production kernel is deprived of. Hence,
> this patch adds an improvement to exiting fadump feature to make the
> reserved memory available to system for use, using zone movable.

Is the 5% a reasonable estimate or more a ballpark number? I find it a
bit strange to require 3TB of memory to boot a kernel just to dump the
crashed kernel image. Shouldn't you rather look into this estimate than
spreading ZONE_MOVABLE abuse? Larger systems need more memory to dump
even with the regular kexec kdump but I have never seen any to use more
than 1G or something like that.
-- 
Michal Hocko
SUSE Labs


Re: Improvements for the PS3

2018-07-19 Thread Geert Uytterhoeven
Hi Geoff, Frederik,

On Thu, Jul 19, 2018 at 12:40 AM Geoff Levand  wrote:
> On 07/14/2018 09:49 AM, Fredrik Noring wrote:
> > so I added a sleep with
> >
> > + msleep(1);

I can't see where you added the sleep, but 10s seems excessive.
If the real reason is the need to wait for an interrupt for ps3fb_sync_image(),
then waiting for 40 ms should be sufficient? Or am I missing something?

> > +
> >   return 0;
> >
> > et voilà, the screen came alive and the kernel panic was revealed! It seems
> > the kernel panics so fast that the PS3 frame buffer is unprepared. This is,
> > of course, very unfortunate because trying to debug the boot process without
> > a screen or any other means of obtaining console text is quite difficult.
>
> We could add a fixed delay there, but I'd like to avoid waiting that
> long on every boot.  Why don't you add a kernel module_param named
> something like ps3fb_delay that takes a value in milliseconds and a
> default of zero.  See:
>
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/ps3fb.c?h=v4.17#n260
>
> > I suppose the problem is that it relies on interrupts for ps3fb_sync_image
> > to regularly copy the image, hence without them the screen isn't updated to
> > show kernel panics, etc. Perhaps one way to fix that is to implement the
> > struct fb_tile_ops API, so that the console is synchronously updated? Would
> > that be acceptable?
>
> I'm not sure if that would work or not.   Maybe Geert is more familiar with 
> it.

That sounds like a complex solution, slowing down the console a lot.

What about letting ps3fb register a panic notifier to sync the screen, like
hyperv_fb does?

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds