Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
Michael Neuling writes: > On Fri, 2018-07-20 at 12:32 +1000, Michael Ellerman wrote: >> Michael Neuling writes: >> > On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote: >> > > On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote: >> > > > >> > > > > DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER); >> > > > > diff --git a/arch/powerpc/kernel/idle_book3s.S >> > > > > b/arch/powerpc/kernel/idle_book3s.S >> > > > > index d85d551..5069d42 100644 >> > > > > --- a/arch/powerpc/kernel/idle_book3s.S >> > > > > +++ b/arch/powerpc/kernel/idle_book3s.S >> > > > > @@ -120,6 +120,9 @@ power9_save_additional_sprs: >> > > > > mfspr r4, SPRN_MMCR2 >> > > > > std r3, STOP_MMCR1(r13) >> > > > > std r4, STOP_MMCR2(r13) >> > > > > + >> > > > > +mfspr r3, SPRN_SPRG3 >> > > > > +std r3, STOP_SPRG3(r13) >> > > > >> > > > We don't need to save it. Just restore it from paca->sprg_vdso which >> > > > should >> > > > never change. >> > > >> > > Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso. >> > > >> > > > >> > > > How can we do better at catching these missing SPRGs? >> > > >> > > We can go through the list of SPRs from the POWER9 User Manual and >> > > document explicitly why we don't have to save/restore certain SPRs >> > > during the execution of the stop instruction. Does this sound ok ? >> > > >> > > (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual >> > > accessible from >> > > https://openpowerfoundation.org/?resource_lib=power9-processor-users-manua >> > > l) >> > >> > I was thinking of a boot time test case built into linux. linux has some >> > boot >> > time test cases which you can enable via CONFIG options. >> > >> > Firstly you could see if an SPR exists using the same trick xmon does in >> > dump_one_spr(). Then once you have a list of usable SPRs, you could write >> > all >> > the known ones (I assume you'd have to leave out some, like the PSSCR), >> > then >> > set >> >> Write what value? >> >> Ideally you want to write a random bit pattern to reduce the chance >> that only some bits are being restored. > > The xmon dump_one_spr() trick tries to work around that by writing one random > value and then a different one to see if it really is a nop. > >> But you can't do that because writing a value to an SPRs has an effect. > > Sure that's a concern but xmon seems to get away with it. I don't think it writes, but maybe I'm reading the code wrong. Writing a random value to the MSR could be fun :) >> Some of them might even need to be zero, in which case you can't really >> distinguish that from a non-restored zero. > > It doesn't need to be perfect. It just needs to catch more than we have now. Sure. >> > the appropriate stop level, make sure you got into that stop level, and >> > then >> > see >> > if that register was changed. Then you'd have an automated list of >> > registers >> > you >> > need to make sure you save/restore at each stop level. >> > >> > Could something like that work? >> >> Maybe. >> >> Ignoring the problem of whether you can write a meaningful value to some >> of the SPRs, I'm not entirely convinced it's going to work. But maybe >> I'm wrong. > > Yeah, I'm not convinced it'll work either but it would be a nice piece of test > infrastructure to have if it does work. Yeah I guess I'd rather we worked on 1) and 2) below first :) > We'd still need to marry up the SPR numbers we get from the test to what's > actually being restored in Linux. > >> But there's a much simpler solution, we should 1) have a selftest for >> getcpu() and 2) we should be running the glibc (I think?) test suite >> that found this in the first place. It's frankly embarrassing that we >> didn't find this. > > Yeah, we should do that also, but how do we catch the next SPR we are missing. > I'd like some systematic way of doing that rather than wack-a-mole. Whack-a-mole We could also improve things by documenting how each SPR is handled, eg. is it saved/restored across idle, syscall, KVM etc. And possibly that could even become code that defines how SPRs are handled, rather than it all being done ad-hoc. cheers
[PATCH v7 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
OPAL firmware provides the facility for some groups of sensors to be enabled/disabled at runtime to give the user the option of using the system resources for collecting these sensors or not. For example, on POWER9 systems, the On Chip Controller (OCC) gathers various system and chip level sensors and maintains their values in main memory. This patch provides support for enabling/disabling the sensor groups like power, temperature, current and voltage. Signed-off-by: Shilpasri G Bhat [stew...@linux.vnet.ibm.com: Commit message] --- Changes from v6: - Updated the commit message as per Stewart's suggestion - Use the compatible DT property instead of the path to find the node Documentation/hwmon/ibmpowernv | 43 ++- drivers/hwmon/ibmpowernv.c | 249 +++-- 2 files changed, 258 insertions(+), 34 deletions(-) diff --git a/Documentation/hwmon/ibmpowernv b/Documentation/hwmon/ibmpowernv index 8826ba2..5646825 100644 --- a/Documentation/hwmon/ibmpowernv +++ b/Documentation/hwmon/ibmpowernv @@ -33,9 +33,48 @@ fanX_input Measured RPM value. fanX_min Threshold RPM for alert generation. fanX_fault 0: No fail condition 1: Failing fan + tempX_inputMeasured ambient temperature. tempX_max Threshold ambient temperature for alert generation. -inX_input Measured power supply voltage +tempX_highest Historical maximum temperature +tempX_lowest Historical minimum temperature +tempX_enable Enable/disable all temperature sensors belonging to the + sub-group. In POWER9, this attribute corresponds to + each OCC. Using this attribute each OCC can be asked to + disable/enable all of its temperature sensors. + 1: Enable + 0: Disable + +inX_input Measured power supply voltage (millivolt) inX_fault 0: No fail condition. 1: Failing power supply. -power1_input System power consumption (microWatt) +inX_highestHistorical maximum voltage +inX_lowest Historical minimum voltage +inX_enable Enable/disable all voltage sensors belonging to the + sub-group. In POWER9, this attribute corresponds to + each OCC. Using this attribute each OCC can be asked to + disable/enable all of its voltage sensors. + 1: Enable + 0: Disable + +powerX_input Power consumption (microWatt) +powerX_input_highest Historical maximum power +powerX_input_lowestHistorical minimum power +powerX_enable Enable/disable all power sensors belonging to the + sub-group. In POWER9, this attribute corresponds to + each OCC. Using this attribute each OCC can be asked to + disable/enable all of its power sensors. + 1: Enable + 0: Disable + +currX_inputMeasured current (milliampere) +currX_highest Historical maximum current +currX_lowest Historical minimum current +currX_enable Enable/disable all current sensors belonging to the + sub-group. In POWER9, this attribute corresponds to + each OCC. Using this attribute each OCC can be asked to + disable/enable all of its current sensors. + 1: Enable + 0: Disable + +energyX_input Cumulative energy (microJoule) diff --git a/drivers/hwmon/ibmpowernv.c b/drivers/hwmon/ibmpowernv.c index f829dad..99afbf7 100644 --- a/drivers/hwmon/ibmpowernv.c +++ b/drivers/hwmon/ibmpowernv.c @@ -90,11 +90,20 @@ struct sensor_data { char label[MAX_LABEL_LEN]; char name[MAX_ATTR_LEN]; struct device_attribute dev_attr; + struct sensor_group_data *sgrp_data; +}; + +struct sensor_group_data { + struct mutex mutex; + u32 gid; + bool enable; }; struct platform_data { const struct attribute_group *attr_groups[MAX_SENSOR_TYPE + 1]; + struct sensor_group_data *sgrp_data; u32 sensors_count; /* Total count of sensors from each group */ + u32 nr_sensor_groups; /* Total number of sensor groups */ }; static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr, @@ -105,6 +114,9 @@ static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr, ssize_t ret; u64 x; + if (sdata->sgrp_data && !sdata->sgrp_data->enable) + return -ENODATA; + ret = opal_get_sensor_data_u64(sdata->id, &x); if (ret) @@ -120,6 +132,46 @@ static ssize_t show_sensor(struct device *dev, struct device_attribute *devattr, return sprintf(buf, "%ll
[PATCH v7 1/2] powernv:opal-sensor-groups: Add support to enable sensor groups
Adds support to enable/disable a sensor group at runtime. This can be used to select the sensor groups that needs to be copied to main memory by OCC. Sensor groups like power, temperature, current, voltage, frequency, utilization can be enabled/disabled at runtime. Signed-off-by: Shilpasri G Bhat --- arch/powerpc/include/asm/opal-api.h| 1 + arch/powerpc/include/asm/opal.h| 2 ++ .../powerpc/platforms/powernv/opal-sensor-groups.c | 28 ++ arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + 4 files changed, 32 insertions(+) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 3bab299..56a94a1 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -206,6 +206,7 @@ #define OPAL_NPU_SPA_CLEAR_CACHE 160 #define OPAL_NPU_TL_SET161 #define OPAL_SENSOR_READ_U64 162 +#define OPAL_SENSOR_GROUP_ENABLE 163 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR 164 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR 165 #define OPAL_LAST 165 diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index e1b2910..fc0550e 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -292,6 +292,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t address, int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr); int opal_set_power_shift_ratio(u32 handle, int token, u32 psr); int opal_sensor_group_clear(u32 group_hndl, int token); +int opal_sensor_group_enable(u32 group_hndl, int token, bool enable); s64 opal_signal_system_reset(s32 cpu); s64 opal_quiesce(u64 shutdown_type, s32 cpu); @@ -326,6 +327,7 @@ extern int opal_async_wait_response_interruptible(uint64_t token, struct opal_msg *msg); extern int opal_get_sensor_data(u32 sensor_hndl, u32 *sensor_data); extern int opal_get_sensor_data_u64(u32 sensor_hndl, u64 *sensor_data); +extern int sensor_group_enable(u32 grp_hndl, bool enable); struct rtc_time; extern time64_t opal_get_boot_time(void); diff --git a/arch/powerpc/platforms/powernv/opal-sensor-groups.c b/arch/powerpc/platforms/powernv/opal-sensor-groups.c index 541c9ea..f7d04b6 100644 --- a/arch/powerpc/platforms/powernv/opal-sensor-groups.c +++ b/arch/powerpc/platforms/powernv/opal-sensor-groups.c @@ -32,6 +32,34 @@ struct sg_attr { struct sg_attr *sgattrs; } *sgs; +int sensor_group_enable(u32 handle, bool enable) +{ + struct opal_msg msg; + int token, ret; + + token = opal_async_get_token_interruptible(); + if (token < 0) + return token; + + ret = opal_sensor_group_enable(handle, token, enable); + if (ret == OPAL_ASYNC_COMPLETION) { + ret = opal_async_wait_response(token, &msg); + if (ret) { + pr_devel("Failed to wait for the async response\n"); + ret = -EIO; + goto out; + } + ret = opal_error_code(opal_get_async_rc(msg)); + } else { + ret = opal_error_code(ret); + } + +out: + opal_async_release_token(token); + return ret; +} +EXPORT_SYMBOL_GPL(sensor_group_enable); + static ssize_t sg_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) { diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index a8d9b40..8268a1e 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -327,3 +327,4 @@ OPAL_CALL(opal_npu_tl_set, OPAL_NPU_TL_SET); OPAL_CALL(opal_pci_get_pbcq_tunnel_bar, OPAL_PCI_GET_PBCQ_TUNNEL_BAR); OPAL_CALL(opal_pci_set_pbcq_tunnel_bar, OPAL_PCI_SET_PBCQ_TUNNEL_BAR); OPAL_CALL(opal_sensor_read_u64,OPAL_SENSOR_READ_U64); +OPAL_CALL(opal_sensor_group_enable,OPAL_SENSOR_GROUP_ENABLE); -- 1.8.3.1
[PATCH v7 0/2] hwmon/powernv: Add attributes to enable/disable sensors
This patch series adds new attribute to enable or disable a sensor at runtime. Changes from v6: - Updated the commit message as per Stewart's suggestion - Use the compatible DT property instead of the path to find the node v6 : https://lkml.org/lkml/2018/7/18/806 v5 : https://lkml.org/lkml/2018/7/15/15 v4 : https://lkml.org/lkml/2018/7/6/379 v3 : https://lkml.org/lkml/2018/7/5/476 v2 : https://lkml.org/lkml/2018/7/4/263 v1 : https://lkml.org/lkml/2018/3/22/214 Shilpasri G Bhat (2): powernv:opal-sensor-groups: Add support to enable sensor groups hwmon: ibmpowernv: Add attributes to enable/disable sensor groups Documentation/hwmon/ibmpowernv | 43 +++- arch/powerpc/include/asm/opal-api.h| 1 + arch/powerpc/include/asm/opal.h| 2 + .../powerpc/platforms/powernv/opal-sensor-groups.c | 28 +++ arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + drivers/hwmon/ibmpowernv.c | 249 ++--- 6 files changed, 290 insertions(+), 34 deletions(-) -- 1.8.3.1
Re: [powerpc/powervm]Oops: Kernel access of bad area, sig: 11 [#1] while running stress-ng
On 2018-07-19 19:03, Brian Foster wrote: cc linux-xfs On Thu, Jul 19, 2018 at 11:47:59AM +0530, vrbagal1 wrote: On 2018-07-10 19:12, Michael Ellerman wrote: > vrbagal1 writes: > > > On 2018-07-10 13:37, Nicholas Piggin wrote: > > > On Tue, 10 Jul 2018 11:58:40 +0530 > > > vrbagal1 wrote: > > > > > > > Hi, > > > > > > > > Observing kernel oops on Power9(ZZ) box, running on PowerVM, while > > > > running stress-ng. > > > > > > > > > > > > Kernel: 4.18.0-rc4 > > > > Machine: Power9 ZZ (PowerVM) > > > > Test: Stress-ng > > > > > > > > Attached is .config file > > > > > > > > Traces: > > > > > > > > [12251.245209] Oops: Kernel access of bad area, sig: 11 [#1] > > > > > > Can you post the lines above this? Otherwise we don't know what > > > address > > > it tried to access (without decoding the instructions and > > > reconstructing > > > it from registers at least, which the XFS devs wouldn't be > > > inclined to > > > do). > > > > > > > ah my bad. > > > > [12251.245179] Unable to handle kernel paging request for data at > > address 0x60006000 > > [12251.245199] Faulting instruction address: 0xc0319e2c > > Which matches the regs & disassembly: > > r4 = 60006000 > r9 = 0 > ldx r9,r4,r9 <- pop > > So object was 0x60006000. > > That looks like two nops, ie. we got some code? > > And there's only one caller of prefetch_freepointer() in > slab_alloc_node(): > >prefetch_freepointer(s, next_object); > > > So slab corruption is looking likely. > > Do you have slub_debug=FZP on the kernel command line? I tried to reproduce this, but didnt succeed. But instead I got xfs assertion. Kernel: 4.18.0-rc5 Tree Head: 30b06abfb92bfd5f9b63ea6a2ffb0bd905d1a6da Traces: [12290.993612] XFS: Assertion failed: flags & XFS_BMAPI_COWFORK, file: fs/xfs/libxfs/xfs_bmap.c, line: 4370 This usually means we have a writeback of a page with delayed allocation buffers over an extent map that reflects a hole in the file. This should never happen for normal (non-reflink) data fork writes, hence the assert failure and -EIO return. We used to actually (accidentally) allocate a new block in this case, but more recent kernels generate the error. More interestingly, I think the pending iomap + writeback rework in XFS may simply drop this error on the floor because we refer to extent state to process the page rather than the other way around (and buffer_delay() isn't a state in the iomap bits). This of course depends on which state is actually valid, the buffer or extent map, which we don't know for sure. Can you reliably reproduce this problem? If so, can you describe the reproducer? Also, what is the filesystem geometry ('xfs_info ')? And since this is a power kernel, I'm guessing you have 64k pages as well..? Brian I tried, but couldnt hit the issue. But this issue was hit running stress-ng test case, last time(saw some xfs traces) and this time. More specific I was running: /bin/avocado run avocado-misc-tests/generic/stress-ng.py -m /root/avocado-fvt-wrapper/tests/avocado-misc-tests/generic/stress-ng.py.data/stress-ng-cpu.yaml xfs_info o/p: # xfs_info /dev/sdi2 meta-data=/dev/sdi2 isize=512agcount=4, agsize=65536 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=0 spinodes=0 data = bsize=4096 blocks=262144, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Yes, 64K pagesize. Regards, Venkat. [12290.993668] [ cut here ] [12290.993672] kernel BUG at fs/xfs/xfs_message.c:102! [12290.993676] Oops: Exception in kernel mode, sig: 5 [#1] [12290.993678] LE SMP NR_CPUS=2048 NUMA pSeries [12290.993697] Dumping ftrace buffer: [12290.993730](ftrace buffer empty) [12290.993735] Modules linked in: camellia_generic(E) cast6_generic(E) cast_common(E) ppp_generic(E) serpent_generic(E) slhc(E) kvm_pr(E) kvm(E) snd_seq(E) snd_seq_device(E) twofish_generic(E) snd_timer(E) snd(E) soundcore(E) twofish_common(E) lrw(E) cuse(E) tgr192(E) hci_vhci(E) vhost_net(E) bluetooth(E) ecdh_generic(E) wp512(E) rfkill(E) vfio_iommu_spapr_tce(E) vfio_spapr_eeh(E) vfio(E) rmd320(E) uhid(E) tun(E) tap(E) rmd256(E) rmd160(E) vhost_vsock(E) rmd128(E) vmw_vsock_virtio_transport_common(E) vsock(E) vhost(E) uinput(E) md4(E) unix_diag(E) dccp_ipv4(E) dccp(E) sha512_generic(E) binfmt_misc(E) fuse(E) vfat(E) fat(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) loop(E) ip6t_rpfilter(E) ipt_REJECT(E) nf_reject_ipv4(E) ip6t_REJECT(E) [12290.993784] nf_reject_ipv6(E) xt_conntrack(E) ip_set
Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
On Thu, 2018-07-19 at 10:04 -0700, Andy Lutomirski wrote: > On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski wrote: > > [I added PeterZ and Vitaly -- can you see any way in which this would > > break something obscure? I don't.] Added Nick and Aneesh. We do have HW remote flushes on powerpc. > > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel wrote: > > > I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals > > > next? > > > > Yes, AFAICS. > > > > > > > > On to the lazy TLB mm_struct refcounting stuff :) > > > > > > > > > > > Which refcount? mm_users shouldn’t be hot, so I assume you’re talking > > > > about > > > > mm_count. My suggestion is to get rid of mm_count instead of trying to > > > > optimize it. > > > > > > > > > Do you have any suggestions on how? :) > > > > > > The TLB shootdown sent at __exit_mm time does not get rid of the > > > kernelthread->active_mm > > > pointer pointing at the mm that is exiting. > > > > > > > Ah, but that's conceptually very easy to fix. Add a #define like > > ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that > > #define is set. After some grepping, there are very few users. The > > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that > > are involved in the rather complicated dance of refcounting active_mm. > > If that field goes away, it doesn't need to be refcounted. Instead, I > > think the refcounting can get replaced with something like: > > > > /* > > * Release any arch-internal references to mm. Only called when > > mm_users is zero > > * and all tasks using mm have either been switch_mm()'d away or have had > > * enter_lazy_tlb() called. > > */ > > extern void arch_shoot_down_dead_mm(struct mm_struct *mm); > > > > which the kernel calls in __mmput() after tearing down all the page > > tables. The body can be something like: > > > > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) { > > /* send an IPI. Maybe just call tlb_flush_remove_tables() */ > > } > > > > (You'll also have to fix up the highly questionable users in > > arch/x86/platform/efi/efi_64.c, but that's easy.) > > > > Does all that make sense? Basically, as I understand it, the > > expensive atomic ops you're seeing are all pointless because they're > > enabling an optimization that hasn't actually worked for a long time, > > if ever. > > Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from > arch_exit_mmap(), I think. It's a heavier weight version of more or > less the same thing that arch_shoot_down_dead_mm() would be, except > that it happens before exit_mmap(). But maybe Xen actually has the > right idea. In other words, rather doing the big pagetable free in > exit_mmap() while there may still be other CPUs pointing at the page > tables, the other order might make more sense. So maybe, if > ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible > for getting rid of all secret arch references to the mm. > > Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name. > > I added some more arch maintainers. The idea here is that, on x86 at > least, task->active_mm and all its refcounting is pure overhead. When > a process exits, __mmput() gets called, but the core kernel has a > longstanding "optimization" in which other tasks (kernel threads and > idle tasks) may have ->active_mm pointing at this mm. This is nasty, > complicated, and hurts performance on large systems, since it requires > extra atomic operations whenever a CPU switches between real users > threads and idle/kernel threads. > > It's also almost completely worthless on x86 at least, since __mmput() > frees pagetables, and that operation *already* forces a remote TLB > flush, so we might as well zap all the active_mm references at the > same time. > > But arm64 has real HW remote flushes. Does arm64 actually benefit > from the active_mm optimization? What happens on arm64 when a process > exits? How about s390? I suspect that x390 has rather larger systems > than arm64, where the cost of the reference counting can be much > higher. > > (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that > interact with your previous patch set?)
Re: [PATCH v3] PCI: Data corruption happening due to race condition
On Thu, 2018-07-19 at 20:55 +0200, Lukas Wunner wrote: > On Thu, Jul 19, 2018 at 9:48 AM, Benjamin Herrenschmidt > > Indeed. However I'm not fan of the solution. Shouldn't we instead have > > some locking for the content of pci_dev? I've always been wary of us > > having other similar races in there. > > The solution presented is perfectly fine as it uses atomic bitops which > obviate the need for locking. Why do you want to add unnecessary locking > on top? Atomic bitops tend to be *more* expensive than a lock. My concern is that the PCIe code historically had no locking and I worry we may have other fields in there with similar issues. But maybe I'm wrong. > Certain other parts of struct pci_dev use their own locking, e.g. > pci_bus_sem to protect bus_list. Most elements can and should > be accessed lockless for performance. > > > > > The powerpc PCI code contains a lot of cruft coming from the depth of > > > history, including rather nasty assumptions. We want to progressively > > > clean it up, starting with EEH, but it will take time. > > Then I suggest using the #include "../../../drivers/pci/pci.h" for now > until the powerpc arch code has been consolidated. There's also the need both in powerpc and sparc to access the guts of pci_dev because those archs will "fabricate" as pci_dev from the device-tree rather than probing it under some circumstances. Cheers, Ben.
[RFC 4/4] virtio: Add platform specific DMA API translation for virito devices
This adds a hook which a platform can define in order to allow it to override virtio device's DMA OPS irrespective of whether it has the flag VIRTIO_F_IOMMU_PLATFORM set or not. We want to use this to do bounce-buffering of data on the new secure pSeries platform, currently under development, where a KVM host cannot access all of the memory space of a secure KVM guest. The host can only access the pages which the guest has explicitly requested to be shared with the host, thus the virtio implementation in the guest has to copy data to and from shared pages. With this hook, the platform code in the secure guest can force the use of swiotlb for virtio buffers, with a back-end for swiotlb which will use a pool of pre-allocated shared pages. Thus all data being sent or received by virtio devices will be copied through pages which the host has access to. Signed-off-by: Anshuman Khandual --- arch/powerpc/include/asm/dma-mapping.h | 6 ++ arch/powerpc/platforms/pseries/iommu.c | 6 ++ drivers/virtio/virtio.c| 7 +++ 3 files changed, 19 insertions(+) diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h index 8fa3945..bc5a9d3 100644 --- a/arch/powerpc/include/asm/dma-mapping.h +++ b/arch/powerpc/include/asm/dma-mapping.h @@ -116,3 +116,9 @@ extern u64 __dma_get_required_mask(struct device *dev); #endif /* __KERNEL__ */ #endif /* _ASM_DMA_MAPPING_H */ + +#define platform_override_dma_ops platform_override_dma_ops + +struct virtio_device; + +extern void platform_override_dma_ops(struct virtio_device *vdev); diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 06f0296..5773bc7 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -1396,3 +1397,8 @@ static int __init disable_multitce(char *str) __setup("multitce=", disable_multitce); machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init); + +void platform_override_dma_ops(struct virtio_device *vdev) +{ + /* Override vdev->parent.dma_ops if required */ +} diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 6b13987..432c332 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -168,6 +168,12 @@ EXPORT_SYMBOL_GPL(virtio_add_status); const struct dma_map_ops virtio_direct_dma_ops; +#ifndef platform_override_dma_ops +static inline void platform_override_dma_ops(struct virtio_device *vdev) +{ +} +#endif + int virtio_finalize_features(struct virtio_device *dev) { int ret = dev->config->finalize_features(dev); @@ -179,6 +185,7 @@ int virtio_finalize_features(struct virtio_device *dev) if (virtio_has_iommu_quirk(dev)) set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops); + platform_override_dma_ops(dev); if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) return 0; -- 2.9.3
[RFC 3/4] virtio: Force virtio core to use DMA API callbacks for all virtio devices
Virtio core should use DMA API callbacks for all virtio devices which may generate either GAP or IOVA depending on VIRTIO_F_IOMMU_PLATFORM flag and resulting QEMU expectations. This implies that every virtio device needs to have a DMA OPS structure. This removes previous GPA fallback code paths. Signed-off-by: Anshuman Khandual --- drivers/virtio/virtio_ring.c | 65 ++-- 1 file changed, 2 insertions(+), 63 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 814b395..c265964 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -141,26 +141,6 @@ struct vring_virtqueue { * unconditionally on data path. */ -static bool vring_use_dma_api(struct virtio_device *vdev) -{ - if (!virtio_has_iommu_quirk(vdev)) - return true; - - /* Otherwise, we are left to guess. */ - /* -* In theory, it's possible to have a buggy QEMU-supposed -* emulated Q35 IOMMU and Xen enabled at the same time. On -* such a configuration, virtio has never worked and will -* not work without an even larger kludge. Instead, enable -* the DMA API if we're a Xen guest, which at least allows -* all of the sensible Xen configurations to work correctly. -*/ - if (xen_domain()) - return true; - - return false; -} - /* * The DMA ops on various arches are rather gnarly right now, and * making all of the arch DMA ops work on the vring device itself @@ -176,9 +156,6 @@ static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq, struct scatterlist *sg, enum dma_data_direction direction) { - if (!vring_use_dma_api(vq->vq.vdev)) - return (dma_addr_t)sg_phys(sg); - /* * We can't use dma_map_sg, because we don't use scatterlists in * the way it expects (we don't guarantee that the scatterlist @@ -193,9 +170,6 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, void *cpu_addr, size_t size, enum dma_data_direction direction) { - if (!vring_use_dma_api(vq->vq.vdev)) - return (dma_addr_t)virt_to_phys(cpu_addr); - return dma_map_single(vring_dma_dev(vq), cpu_addr, size, direction); } @@ -205,9 +179,6 @@ static void vring_unmap_one(const struct vring_virtqueue *vq, { u16 flags; - if (!vring_use_dma_api(vq->vq.vdev)) - return; - flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); if (flags & VRING_DESC_F_INDIRECT) { @@ -228,9 +199,6 @@ static void vring_unmap_one(const struct vring_virtqueue *vq, static int vring_mapping_error(const struct vring_virtqueue *vq, dma_addr_t addr) { - if (!vring_use_dma_api(vq->vq.vdev)) - return 0; - return dma_mapping_error(vring_dma_dev(vq), addr); } @@ -1016,43 +984,14 @@ EXPORT_SYMBOL_GPL(__vring_new_virtqueue); static void *vring_alloc_queue(struct virtio_device *vdev, size_t size, dma_addr_t *dma_handle, gfp_t flag) { - if (vring_use_dma_api(vdev)) { - return dma_alloc_coherent(vdev->dev.parent, size, + return dma_alloc_coherent(vdev->dev.parent, size, dma_handle, flag); - } else { - void *queue = alloc_pages_exact(PAGE_ALIGN(size), flag); - if (queue) { - phys_addr_t phys_addr = virt_to_phys(queue); - *dma_handle = (dma_addr_t)phys_addr; - - /* -* Sanity check: make sure we dind't truncate -* the address. The only arches I can find that -* have 64-bit phys_addr_t but 32-bit dma_addr_t -* are certain non-highmem MIPS and x86 -* configurations, but these configurations -* should never allocate physical pages above 32 -* bits, so this is fine. Just in case, throw a -* warning and abort if we end up with an -* unrepresentable address. -*/ - if (WARN_ON_ONCE(*dma_handle != phys_addr)) { - free_pages_exact(queue, PAGE_ALIGN(size)); - return NULL; - } - } - return queue; - } } static void vring_free_queue(struct virtio_device *vdev, size_t size, void *queue, dma_addr_t dma_handle) { - if (vring_use_dma_api(vdev)) { - dma_free_coherent(vdev->dev.parent, size, queue, dma_handle); - } else { -
[RFC 2/4] virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
Now that virtio core always needs all virtio devices to have DMA OPS, we need to make sure that the structure it points is the right one. In the absence of VIRTIO_F_IOMMU_PLATFORM flag QEMU expects GPA from guest kernel. In such case, virtio device must use default virtio_direct_dma_ops DMA OPS structure which transforms scatter gather buffer addresses as GPA. This DMA OPS override must happen as early as possible during virtio device initializatin sequence before virtio core starts using given device's DMA OPS callbacks for I/O transactions. This change detects device's IOMMU flag and does the override in case the flag is cleared. Signed-off-by: Anshuman Khandual --- drivers/virtio/virtio.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 7907ad3..6b13987 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -166,6 +166,8 @@ void virtio_add_status(struct virtio_device *dev, unsigned int status) } EXPORT_SYMBOL_GPL(virtio_add_status); +const struct dma_map_ops virtio_direct_dma_ops; + int virtio_finalize_features(struct virtio_device *dev) { int ret = dev->config->finalize_features(dev); @@ -174,6 +176,9 @@ int virtio_finalize_features(struct virtio_device *dev) if (ret) return ret; + if (virtio_has_iommu_quirk(dev)) + set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops); + if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1)) return 0; -- 2.9.3
[RFC 1/4] virtio: Define virtio_direct_dma_ops structure
Current implementation of DMA API inside virtio core calls device's DMA OPS callback functions when the flag VIRTIO_F_IOMMU_PLATFORM flag is set. But in absence of the flag, virtio core falls back calling basic transformation of the incoming SG addresses as GPA. Going forward virtio should only call DMA API based transformations generating either GPA or IOVA depending on QEMU expectations again based on VIRTIO_F_IOMMU_PLATFORM flag. It requires removing existing fallback code path for GPA transformation and replacing that with a direct map DMA OPS structure. This adds that direct mapping DMA OPS structure to be used in later patches which will make virtio core call DMA API all the time for all virtio devices. Signed-off-by: Anshuman Khandual --- drivers/virtio/virtio.c| 60 ++ drivers/virtio/virtio_pci_common.h | 3 ++ 2 files changed, 63 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index 59e36ef..7907ad3 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -3,6 +3,7 @@ #include #include #include +#include #include /* Unique numbering for virtio devices. */ @@ -442,3 +443,62 @@ core_initcall(virtio_init); module_exit(virtio_exit); MODULE_LICENSE("GPL"); + +/* + * Virtio direct mapping DMA API operations structure + * + * This defines DMA API structure for all virtio devices which would not + * either bring in their own DMA OPS from architecture or they would not + * like to use architecture specific IOMMU based DMA OPS because QEMU + * expects GPA instead of an IOVA in absence of VIRTIO_F_IOMMU_PLATFORM. + */ +dma_addr_t virtio_direct_map_page(struct device *dev, struct page *page, + unsigned long offset, size_t size, + enum dma_data_direction dir, + unsigned long attrs) +{ + return page_to_phys(page) + offset; +} + +void virtio_direct_unmap_page(struct device *hwdev, dma_addr_t dev_addr, + size_t size, enum dma_data_direction dir, + unsigned long attrs) +{ +} + +int virtio_direct_mapping_error(struct device *hwdev, dma_addr_t dma_addr) +{ + return 0; +} + +void *virtio_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle, + gfp_t gfp, unsigned long attrs) +{ + void *queue = alloc_pages_exact(PAGE_ALIGN(size), gfp); + + if (queue) { + phys_addr_t phys_addr = virt_to_phys(queue); + *dma_handle = (dma_addr_t)phys_addr; + + if (WARN_ON_ONCE(*dma_handle != phys_addr)) { + free_pages_exact(queue, PAGE_ALIGN(size)); + return NULL; + } + } + return queue; +} + +void virtio_direct_free(struct device *dev, size_t size, void *vaddr, + dma_addr_t dma_addr, unsigned long attrs) +{ + free_pages_exact(vaddr, PAGE_ALIGN(size)); +} + +const struct dma_map_ops virtio_direct_dma_ops = { + .alloc = virtio_direct_alloc, + .free = virtio_direct_free, + .map_page = virtio_direct_map_page, + .unmap_page = virtio_direct_unmap_page, + .mapping_error = virtio_direct_mapping_error, +}; +EXPORT_SYMBOL(virtio_direct_dma_ops); diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index 135ee3c..ec44d2f 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -31,6 +31,9 @@ #include #include +extern struct dma_map_ops virtio_direct_dma_ops; + + struct virtio_pci_vq_info { /* the actual virtqueue */ struct virtqueue *vq; -- 2.9.3
[RFC 0/4] Virtio uses DMA API for all devices
This patch series is the follow up on the discussions we had before about the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation for virito devices (https://patchwork.kernel.org/patch/10417371/). There were suggestions about doing away with two different paths of transactions with the host/QEMU, first being the direct GPA and the other being the DMA API based translations. First patch attempts to create a direct GPA mapping based DMA operations structure called 'virtio_direct_dma_ops' with exact same implementation of the direct GPA path which virtio core currently has but just wrapped in a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the existing semantics. The second patch does exactly that inside the function virtio_finalize_features(). The third patch removes the default direct GPA path from virtio core forcing it to use DMA API callbacks for all devices. Now with that change, every device must have a DMA operations structure associated with it. The fourth patch adds an additional hook which gives the platform an opportunity to do yet another override if required. This platform hook can be used on POWER Ultravisor based protected guests to load up SWIOTLB DMA callbacks to do the required (as discussed previously in the above mentioned thread how host is allowed to access only parts of the guest GPA range) bounce buffering into the shared memory for all I/O scatter gather buffers to be consumed on the host side. Please go through these patches and review whether this approach broadly makes sense. I will appreciate suggestions, inputs, comments regarding the patches or the approach in general. Thank you. Anshuman Khandual (4): virtio: Define virtio_direct_dma_ops structure virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively virtio: Force virtio core to use DMA API callbacks for all virtio devices virtio: Add platform specific DMA API translation for virito devices arch/powerpc/include/asm/dma-mapping.h | 6 +++ arch/powerpc/platforms/pseries/iommu.c | 6 +++ drivers/virtio/virtio.c| 72 ++ drivers/virtio/virtio_pci_common.h | 3 ++ drivers/virtio/virtio_ring.c | 65 +- 5 files changed, 89 insertions(+), 63 deletions(-) -- 2.9.3
Re: [kernel,v7,1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize
On Thu, Jul 19, 2018 at 04:06:10PM +1000, Michael Ellerman wrote: > On Tue, 2018-07-17 at 07:19:12 UTC, Alexey Kardashevskiy wrote: > > The size is always equal to 1 page so let's use this. Later on this will > > be used for other checks which use page shifts to check the granularity > > of access. > > > > This should cause no behavioral change. > > > > Reviewed-by: David Gibson > > Acked-by: Alex Williamson > > Signed-off-by: Alexey Kardashevskiy > > Applied to powerpc fixes, thanks. > > https://git.kernel.org/powerpc/c/1463edca6734d42ab4406fa2896e20 Ah. I have put these two patches in my kvm-ppc-next branch and I was about to send a pull request to Paolo. Paul.
Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
On Fri, 2018-07-20 at 12:32 +1000, Michael Ellerman wrote: > Michael Neuling writes: > > On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote: > > > On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote: > > > > > > > > > DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER); > > > > > diff --git a/arch/powerpc/kernel/idle_book3s.S > > > > > b/arch/powerpc/kernel/idle_book3s.S > > > > > index d85d551..5069d42 100644 > > > > > --- a/arch/powerpc/kernel/idle_book3s.S > > > > > +++ b/arch/powerpc/kernel/idle_book3s.S > > > > > @@ -120,6 +120,9 @@ power9_save_additional_sprs: > > > > > mfspr r4, SPRN_MMCR2 > > > > > std r3, STOP_MMCR1(r13) > > > > > std r4, STOP_MMCR2(r13) > > > > > + > > > > > + mfspr r3, SPRN_SPRG3 > > > > > + std r3, STOP_SPRG3(r13) > > > > > > > > We don't need to save it. Just restore it from paca->sprg_vdso which > > > > should > > > > never change. > > > > > > Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso. > > > > > > > > > > > How can we do better at catching these missing SPRGs? > > > > > > We can go through the list of SPRs from the POWER9 User Manual and > > > document explicitly why we don't have to save/restore certain SPRs > > > during the execution of the stop instruction. Does this sound ok ? > > > > > > (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual > > > accessible from > > > https://openpowerfoundation.org/?resource_lib=power9-processor-users-manua > > > l) > > > > I was thinking of a boot time test case built into linux. linux has some > > boot > > time test cases which you can enable via CONFIG options. > > > > Firstly you could see if an SPR exists using the same trick xmon does in > > dump_one_spr(). Then once you have a list of usable SPRs, you could write > > all > > the known ones (I assume you'd have to leave out some, like the PSSCR), then > > set > > Write what value? > > Ideally you want to write a random bit pattern to reduce the chance > that only some bits are being restored. The xmon dump_one_spr() trick tries to work around that by writing one random value and then a different one to see if it really is a nop. > But you can't do that because writing a value to an SPRs has an effect. Sure that's a concern but xmon seems to get away with it. > Some of them might even need to be zero, in which case you can't really > distinguish that from a non-restored zero. It doesn't need to be perfect. It just needs to catch more than we have now. > > the appropriate stop level, make sure you got into that stop level, and then > > see > > if that register was changed. Then you'd have an automated list of registers > > you > > need to make sure you save/restore at each stop level. > > > > Could something like that work? > > Maybe. > > Ignoring the problem of whether you can write a meaningful value to some > of the SPRs, I'm not entirely convinced it's going to work. But maybe > I'm wrong. Yeah, I'm not convinced it'll work either but it would be a nice piece of test infrastructure to have if it does work. We'd still need to marry up the SPR numbers we get from the test to what's actually being restored in Linux. > But there's a much simpler solution, we should 1) have a selftest for > getcpu() and 2) we should be running the glibc (I think?) test suite > that found this in the first place. It's frankly embarrassing that we > didn't find this. Yeah, we should do that also, but how do we catch the next SPR we are missing. I'd like some systematic way of doing that rather than wack-a-mole. Mikey
Re: linux-next: manual merge of the powerpc tree with the powerpc-fixes tree
Stephen Rothwell writes: > Hi all, > > Today's linux-next merge of the powerpc tree got a conflict in: > > drivers/vfio/vfio_iommu_spapr_tce.c > > between commit: > > 1463edca6734 ("vfio/spapr: Use IOMMU pageshift rather than pagesize") > > from the powerpc-fixes tree and commit: > > 00a5c58d9499 ("KVM: PPC: Make iommu_table::it_userspace big endian") > > from the powerpc tree. Thanks. That has turned into a real mess, with conflicting code in next, fixes and topic/ppc-kvm. I'll fix it all up before the merge window. cheers
Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
Michael Neuling writes: > On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote: >> On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote: >> > >> > > DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER); >> > > diff --git a/arch/powerpc/kernel/idle_book3s.S >> > > b/arch/powerpc/kernel/idle_book3s.S >> > > index d85d551..5069d42 100644 >> > > --- a/arch/powerpc/kernel/idle_book3s.S >> > > +++ b/arch/powerpc/kernel/idle_book3s.S >> > > @@ -120,6 +120,9 @@ power9_save_additional_sprs: >> > > mfspr r4, SPRN_MMCR2 >> > > std r3, STOP_MMCR1(r13) >> > > std r4, STOP_MMCR2(r13) >> > > + >> > > +mfspr r3, SPRN_SPRG3 >> > > +std r3, STOP_SPRG3(r13) >> > >> > We don't need to save it. Just restore it from paca->sprg_vdso which >> > should >> > never change. >> >> Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso. >> >> > >> > How can we do better at catching these missing SPRGs? >> >> We can go through the list of SPRs from the POWER9 User Manual and >> document explicitly why we don't have to save/restore certain SPRs >> during the execution of the stop instruction. Does this sound ok ? >> >> (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual >> accessible from >> https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual) > > I was thinking of a boot time test case built into linux. linux has some boot > time test cases which you can enable via CONFIG options. > > Firstly you could see if an SPR exists using the same trick xmon does in > dump_one_spr(). Then once you have a list of usable SPRs, you could write all > the known ones (I assume you'd have to leave out some, like the PSSCR), then > set Write what value? Ideally you want to write a random bit pattern to reduce the chance that only some bits are being restored. But you can't do that because writing a value to an SPRs has an effect. Some of them might even need to be zero, in which case you can't really distinguish that from a non-restored zero. > the appropriate stop level, make sure you got into that stop level, and then > see > if that register was changed. Then you'd have an automated list of registers > you > need to make sure you save/restore at each stop level. > > Could something like that work? Maybe. Ignoring the problem of whether you can write a meaningful value to some of the SPRs, I'm not entirely convinced it's going to work. But maybe I'm wrong. But there's a much simpler solution, we should 1) have a selftest for getcpu() and 2) we should be running the glibc (I think?) test suite that found this in the first place. It's frankly embarrassing that we didn't find this. cheers
Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote: > Hello Mikey, > > On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote: > > > > > DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER); > > > diff --git a/arch/powerpc/kernel/idle_book3s.S > > > b/arch/powerpc/kernel/idle_book3s.S > > > index d85d551..5069d42 100644 > > > --- a/arch/powerpc/kernel/idle_book3s.S > > > +++ b/arch/powerpc/kernel/idle_book3s.S > > > @@ -120,6 +120,9 @@ power9_save_additional_sprs: > > > mfspr r4, SPRN_MMCR2 > > > std r3, STOP_MMCR1(r13) > > > std r4, STOP_MMCR2(r13) > > > + > > > + mfspr r3, SPRN_SPRG3 > > > + std r3, STOP_SPRG3(r13) > > > > We don't need to save it. Just restore it from paca->sprg_vdso which should > > never change. > > Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso. > > > > > How can we do better at catching these missing SPRGs? > > We can go through the list of SPRs from the POWER9 User Manual and > document explicitly why we don't have to save/restore certain SPRs > during the execution of the stop instruction. Does this sound ok ? > > (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual > accessible from > https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual) I was thinking of a boot time test case built into linux. linux has some boot time test cases which you can enable via CONFIG options. Firstly you could see if an SPR exists using the same trick xmon does in dump_one_spr(). Then once you have a list of usable SPRs, you could write all the known ones (I assume you'd have to leave out some, like the PSSCR), then set the appropriate stop level, make sure you got into that stop level, and then see if that register was changed. Then you'd have an automated list of registers you need to make sure you save/restore at each stop level. Could something like that work? Mikey
linux-next: manual merge of the powerpc tree with the powerpc-fixes tree
Hi all, Today's linux-next merge of the powerpc tree got a conflict in: drivers/vfio/vfio_iommu_spapr_tce.c between commit: 1463edca6734 ("vfio/spapr: Use IOMMU pageshift rather than pagesize") from the powerpc-fixes tree and commit: 00a5c58d9499 ("KVM: PPC: Make iommu_table::it_userspace big endian") from the powerpc tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc drivers/vfio/vfio_iommu_spapr_tce.c index 7cd63b0c1a46,11a4c194d6e3.. --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@@ -487,11 -449,11 +449,11 @@@ static void tce_iommu_unuse_page_v2(str if (!pua) return; - ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift, - &hpa, &mem); + ret = tce_iommu_prereg_ua_to_hpa(container, be64_to_cpu(*pua), - IOMMU_PAGE_SIZE(tbl), &hpa, &mem); ++ tbl->it_page_shift, &hpa, &mem); if (ret) - pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n", - __func__, *pua, entry, ret); + pr_debug("%s: tce %llx at #%lx was not cached, ret=%d\n", + __func__, be64_to_cpu(*pua), entry, ret); if (mem) mm_iommu_mapped_dec(mem); @@@ -599,19 -561,12 +561,12 @@@ static long tce_iommu_build_v2(struct t unsigned long hpa; enum dma_data_direction dirtmp; - if (!tbl->it_userspace) { - ret = tce_iommu_userspace_view_alloc(tbl, container->mm); - if (ret) - return ret; - } - for (i = 0; i < pages; ++i) { struct mm_iommu_table_group_mem_t *mem = NULL; - unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, - entry + i); + __be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry + i); ret = tce_iommu_prereg_ua_to_hpa(container, - tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem); + tce, tbl->it_page_shift, &hpa, &mem); if (ret) break; pgpFvm2nd2jBL.pgp Description: OpenPGP digital signature
Re: [PATCH v3] PCI: Data corruption happening due to race condition
Hi Bjonr, Ben On Thu, Jul 19, 2018 at 9:48 AM, Benjamin Herrenschmidt wrote: > On Wed, 2018-07-18 at 18:29 -0500, Bjorn Helgaas wrote: >> [+cc Paul, Michael, linuxppc-dev] >> > >/... > >> > Debugging revealed a race condition between pcie core driver >> > enabling is_added bit(pci_bus_add_device()) and nvme driver >> > reset work-queue enabling is_busmaster bit (by pci_set_master()). >> > As both fields are not handled in atomic manner and that clears >> > is_added bit. >> > >> > Fix moves device addition is_added bit to separate private flag >> > variable and use different atomic functions to set and retrieve >> > device addition state. As is_added shares different memory >> > location so race condition is avoided. >> >> Really nice bit of debugging! > > Indeed. However I'm not fan of the solution. Shouldn't we instead have > some locking for the content of pci_dev ? I've always been wary of us > having other similar races in there. > > As for the powerpc bits, I'm probably the one who wrote them, however, > I'm on vacation this week and right now, no bandwidth to context switch > all that back in :-) So give me a few days and/or ping me next week. > > The powerpc PCI code contains a lot of cruft coming from the depth of > history, including rather nasty assumptions. We want to progressively > clean it up, starting with EEH, but it will take time. > > Cheers, > Ben. > Some driver too directly using pci_dev structure flags and may cause similar type of issues in race condition and should be avoided. Probably not causing issue currently but some race scenario may affect and needs to be handled with some get(),set() api's in atomic manner. I will suggest to use bit position for all remaining bitfields and use atomic operation. In that way, it can be controlled and avoid direct updating fields from outside. Ex; enum pci_dev_flags { IS_BUSMASTER=1, . NO_MSI=2. } void assign_pci_dev_flag(struct pci_dev *dev, int flag, bool val) { assign_bit(flag, &dev->flags, val); } Proper cleanup is required at so many places but that will certainly take some time i.e. good effort but will be future safe.If Bjorn agrees, we can work on that one. >> > Signed-off-by: Hari Vyas >> > --- >> > arch/powerpc/kernel/pci-common.c | 4 +++- >> > arch/powerpc/platforms/powernv/pci-ioda.c | 3 ++- >> > arch/powerpc/platforms/pseries/setup.c| 3 ++- >> > drivers/pci/bus.c | 6 +++--- >> > drivers/pci/hotplug/acpiphp_glue.c| 2 +- >> > drivers/pci/pci.h | 11 +++ >> > drivers/pci/probe.c | 4 ++-- >> > drivers/pci/remove.c | 5 +++-- >> > include/linux/pci.h | 1 - >> > 9 files changed, 27 insertions(+), 12 deletions(-) >> > >> > diff --git a/arch/powerpc/kernel/pci-common.c >> > b/arch/powerpc/kernel/pci-common.c >> > index fe9733f..471aac3 100644 >> > --- a/arch/powerpc/kernel/pci-common.c >> > +++ b/arch/powerpc/kernel/pci-common.c >> > @@ -42,6 +42,8 @@ >> > #include >> > #include >> > >> > +#include "../../../drivers/pci/pci.h" >> >> I see why you need it, but this include path is really ugly. Outside >> of bootloaders and tools, there are very few instances of includes >> like this that reference a different top-level directory, and I'm not >> very keen about adding more. >> >> Obviously powerpc is the only arch that needs dev->is_added. It seems >> to be because "We can only call pcibios_setup_device() after bus setup >> is complete, since some of the platform specific DMA setup code >> depends on it." >> >> I don't know powerpc, but it does raise the question in my mind of >> whether powerpc could be changed to do the DMA setup more like other >> arches do to remove this ordering dependency and the need to use >> dev->is_added. >> >> That sounds like a lot of work, but it would have the benefit of >> unifying some code that is probably needlessly arch-specific. >> Yes. I also agree, including pci.h with ../ references is really bad. First patch was using spin lock for protecting is_added and is_busmaster bits but in final patch moved is_added to private flags. >> > /* hose_spinlock protects accesses to the the phb_bitmap. */ >> > static DEFINE_SPINLOCK(hose_spinlock); >> > LIST_HEAD(hose_list); >> > @@ -1014,7 +1016,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus) >> > /* Cardbus can call us to add new devices to a bus, so ignore >> > * those who are already fully discovered >> > */ >> > - if (dev->is_added) >> > + if (pci_dev_is_added(dev)) >> > continue; >> > >> > pcibios_setup_device(dev); >> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c >> > b/arch/powerpc/platforms/powernv/pci-ioda.c >> > index 5bd0eb6..70b2e1e 100644 >> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c >> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c >> > @@ -46,6 +46,
Re: Improvements for the PS3
Hi Geert, Geoff, > > > so I added a sleep with > > > > > > + msleep(1); > > I can't see where you added the sleep, but 10s seems excessive. > If the real reason is the need to wait for an interrupt for > ps3fb_sync_image(), > then waiting for 40 ms should be sufficient? Or am I missing something? It's at the end of ps3fb_probe, as shown in the original post: https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-July/175771.html I thought 100 ms or so would work, but evidently it didn't. In fact, 1 s for even 5 s didn't seem to work either. In any case, I would like to develop a solution that does not need to sleep at all, so that will be my first approach for a proper implementation. > > > I suppose the problem is that it relies on interrupts for ps3fb_sync_image > > > to regularly copy the image, hence without them the screen isn't updated > > > to > > > show kernel panics, etc. Perhaps one way to fix that is to implement the > > > struct fb_tile_ops API, so that the console is synchronously updated? > > > Would > > > that be acceptable? > > > > I'm not sure if that would work or not. Maybe Geert is more familiar with > > it. > > That sounds like a complex solution, slowing down the console a lot. Why would that be slow? I have implemented a similar technique for the PlayStation 2 frame buffer, and (without any measurements at hand now) it appears to be about as fast as is possible, and reasonably easy too. :) It works like this: The PS2 frame buffer can operate in two distinct modes: virtual mode or console mode. Virtual mode is very similar to the PS3 in that the whole visible kernel memory frame buffer is copied to the Graphics Synthesizer (GS) via DMA regularly at vertical blank intervals. Console mode is different. No kernel memory is allocated for the frame buffer at all (which is a significant improvement in itself given that the PS2 is limited to 32 MiB of main memory) and mmap is disabled. Some struct fb_ops such as fb_fillrect and fb_copyarea are directly accelerated by GS hardware. The GS has two CRT merge circuits made for picture-in-picture that are used to accelerate YWRAP in hardware, which is particularly effective for console text scrolling. Additionally, the tiled API is implemented, and it turned out to be a rather good fit. The GS has 4 MiB of local frame buffer video memory (not directly accessible by the kernel). The maximum practical resolution 1920x1080p at 16 bits per pixel is 4147200 bytes, which leaves 47104 bytes for a tiled font which at 8x8 pixels and a minimum 4 bits indexed texture palette is at most 1464 characters. The indexed palette makes it easy to switch colors. struct fb_tile_ops such as fb_tileblit are accelerated by GS texture sprites which are fast (GS local copy) for the kernel via simple DMA (GIF) GS commands. Console text is always synchronous and therefore works properly in interrupt contexts and with kernel panics, etc. which is essential for debuggability. A buffered version could be faster, possibly, but I think that might as well be implemented by a user space console driver using a /dev/gs interface that can do zero-copy GS commands. The PS2 frame buffer implementation is nearly complete: https://github.com/frno7/linux/blob/ps2-v4.17-n3/drivers/video/fbdev/ps2fb.c Some adjustments and feature reductions seem to be needed for a PS3 version of anything similar. The simplest implementation is probably to just mirror characters as they are printed synchronously. I don't know the overhead of the hypervisor calls for copying graphics though, but the typical areas are quite small. Perhaps one could avoid allocating the kernel frame buffer as well when it's not needed. I have to investigate these things to be sure. > What about letting ps3fb register a panic notifier to sync the screen, like > hyperv_fb does? That would not work with kernel freezes unfortunately. Debugging those with nondeterministicly invisible kernel prints would be painful, I believe. Fredrik
Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He wrote: > Hi Andrew, > > On 07/18/18 at 03:33pm, Andrew Morton wrote: > > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He wrote: > > > > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which > > > is used to load kernel/initrd/purgatory is supposed to be allocated from > > > top to down. This is what we have been doing all along in the old kexec > > > loading interface and the kexec loading is still default setting in some > > > distributions. However, the current kexec_file loading interface doesn't > > > do like this. The function arch_kexec_walk_mem() it calls ignores checking > > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through > > > all resources of System RAM from bottom to up, to try to find memory > > > region > > > which can contain the specific kexec buffer, then call > > > locate_mem_hole_callback() > > > to allocate memory in that found memory region from top to down. This > > > brings > > > confusion especially when KASLR is widely supported , users have to make > > > clear > > > why kexec/kdump kernel loading position is different between these two > > > interfaces in order to exclude unnecessary noises. Hence these two > > > interfaces > > > need be unified on behaviour. > > > > As far as I can tell, the above is the whole reason for the patchset, > > yes? To avoid confusing users. > > > In fact, it's not just trying to avoid confusing users. Kexec loading > and kexec_file loading are just do the same thing in essence. Just we > need do kernel image verification on uefi system, have to port kexec > loading code to kernel. > > Kexec has been a formal feature in our distro, and customers owning > those kind of very large machine can make use of this feature to speed > up the reboot process. On uefi machine, the kexec_file loading will > search place to put kernel under 4G from top to down. As we know, the > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume > it. It may have possibility to not be able to find a usable space for > kernel/initrd. From the top down of the whole memory space, we don't > have this worry. > > And at the first post, I just posted below with AKASHI's > walk_system_ram_res_rev() version. Later you suggested to use > list_head to link child sibling of resource, see what the code change > looks like. > http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com > > Then I posted v2 > http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com > Rob Herring mentioned that other components which has this tree struct > have planned to do the same thing, replacing the singly linked list with > list_head to link resource child sibling. Just quote Rob's words as > below. I think this could be another reason. > > ~ From Rob > The DT struct device_node also has the same tree structure with > parent, child, sibling pointers and converting to list_head had been > on the todo list for a while. ACPI also has some tree walking > functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a > common tree struct and helpers defined either on top of list_head or a > ~ > new struct if that saves some size. Please let's get all this into the changelogs? > > > > Is that sufficient? Can we instead simplify their lives by providing > > better documentation or informative printks or better Kconfig text, > > etc? > > > > And who *are* the people who are performing this configuration? Random > > system administrators? Linux distro engineers? If the latter then > > they presumably aren't easily confused! > > Kexec was invented for kernel developer to speed up their kernel > rebooting. Now high end sever admin, kernel developer and QE are also > keen to use it to reboot large box for faster feature testing, bug > debugging. Kernel dev could know this well, about kernel loading > position, admin or QE might not be aware of it very well. > > > > > In other words, I'm trying to understand how much benefit this patchset > > will provide to our users as a whole. > > Understood. The list_head replacing patch truly involes too many code > changes, it's risky. I am willing to try any idea from reviewers, won't > persuit they have to be accepted finally. If don't have a try, we don't > know what it looks like, and what impact it may have. I am fine to take > AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even > though it could be a little bit low efficient. The larger patch produces a better result. We can handle it ;)
Re: [PATCH] powerpc/msi: Remove VLA usage
On Thu, Jul 19, 2018 at 5:17 AM, Michael Ellerman wrote: > Kees Cook writes: > >> On Fri, Jun 29, 2018 at 11:52 AM, Kees Cook wrote: >>> In the quest to remove all stack VLA usage from the kernel[1], this >>> switches from an unchanging variable to a constant expression to eliminate >>> the VLA generation. >>> >>> [1] >>> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com >>> >>> Signed-off-by: Kees Cook >> >> Friendly ping! Michael, can you take this? > > Yep, sorry. I actually applied it weeks ago but hadn't pushed. No worries! Thanks! :) -Kees -- Kees Cook Pixel Security
Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski wrote: > [I added PeterZ and Vitaly -- can you see any way in which this would > break something obscure? I don't.] > > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel wrote: >> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals >> next? > > Yes, AFAICS. > >> >> On to the lazy TLB mm_struct refcounting stuff :) >> >>> >>> Which refcount? mm_users shouldn’t be hot, so I assume you’re talking about >>> mm_count. My suggestion is to get rid of mm_count instead of trying to >>> optimize it. >> >> >> Do you have any suggestions on how? :) >> >> The TLB shootdown sent at __exit_mm time does not get rid of the >> kernelthread->active_mm >> pointer pointing at the mm that is exiting. >> > > Ah, but that's conceptually very easy to fix. Add a #define like > ARCH_NO_TASK_ACTIVE_MM. Then just get rid of active_mm if that > #define is set. After some grepping, there are very few users. The > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that > are involved in the rather complicated dance of refcounting active_mm. > If that field goes away, it doesn't need to be refcounted. Instead, I > think the refcounting can get replaced with something like: > > /* > * Release any arch-internal references to mm. Only called when > mm_users is zero > * and all tasks using mm have either been switch_mm()'d away or have had > * enter_lazy_tlb() called. > */ > extern void arch_shoot_down_dead_mm(struct mm_struct *mm); > > which the kernel calls in __mmput() after tearing down all the page > tables. The body can be something like: > > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) { > /* send an IPI. Maybe just call tlb_flush_remove_tables() */ > } > > (You'll also have to fix up the highly questionable users in > arch/x86/platform/efi/efi_64.c, but that's easy.) > > Does all that make sense? Basically, as I understand it, the > expensive atomic ops you're seeing are all pointless because they're > enabling an optimization that hasn't actually worked for a long time, > if ever. Hmm. Xen PV has a big hack in xen_exit_mmap(), which is called from arch_exit_mmap(), I think. It's a heavier weight version of more or less the same thing that arch_shoot_down_dead_mm() would be, except that it happens before exit_mmap(). But maybe Xen actually has the right idea. In other words, rather doing the big pagetable free in exit_mmap() while there may still be other CPUs pointing at the page tables, the other order might make more sense. So maybe, if ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible for getting rid of all secret arch references to the mm. Hmm. ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name. I added some more arch maintainers. The idea here is that, on x86 at least, task->active_mm and all its refcounting is pure overhead. When a process exits, __mmput() gets called, but the core kernel has a longstanding "optimization" in which other tasks (kernel threads and idle tasks) may have ->active_mm pointing at this mm. This is nasty, complicated, and hurts performance on large systems, since it requires extra atomic operations whenever a CPU switches between real users threads and idle/kernel threads. It's also almost completely worthless on x86 at least, since __mmput() frees pagetables, and that operation *already* forces a remote TLB flush, so we might as well zap all the active_mm references at the same time. But arm64 has real HW remote flushes. Does arm64 actually benefit from the active_mm optimization? What happens on arm64 when a process exits? How about s390? I suspect that x390 has rather larger systems than arm64, where the cost of the reference counting can be much higher. (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that interact with your previous patch set?)
Re: [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public
On 07/18/18 at 07:37pm, Andy Shevchenko wrote: > On Wed, Jul 18, 2018 at 7:36 PM, Andy Shevchenko > wrote: > > On Wed, Jul 18, 2018 at 5:49 AM, Baoquan He wrote: > >> reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c > >> and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c > >> so that it's shared. > > >> + * Returns 0 on success, -ENOTSUPP if child resource is not completely > >> + * contained by 'res', -ECANCELED if no any conflicting entry found. > > You also can refer to constants by prefixing them with %, e.g. %-ENOTSUPP. > But this is up to you completely. Thanks, will fix when repost.
Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
Hi Andrew, On 07/18/18 at 03:33pm, Andrew Morton wrote: > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He wrote: > > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which > > is used to load kernel/initrd/purgatory is supposed to be allocated from > > top to down. This is what we have been doing all along in the old kexec > > loading interface and the kexec loading is still default setting in some > > distributions. However, the current kexec_file loading interface doesn't > > do like this. The function arch_kexec_walk_mem() it calls ignores checking > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through > > all resources of System RAM from bottom to up, to try to find memory region > > which can contain the specific kexec buffer, then call > > locate_mem_hole_callback() > > to allocate memory in that found memory region from top to down. This brings > > confusion especially when KASLR is widely supported , users have to make > > clear > > why kexec/kdump kernel loading position is different between these two > > interfaces in order to exclude unnecessary noises. Hence these two > > interfaces > > need be unified on behaviour. > > As far as I can tell, the above is the whole reason for the patchset, > yes? To avoid confusing users. In fact, it's not just trying to avoid confusing users. Kexec loading and kexec_file loading are just do the same thing in essence. Just we need do kernel image verification on uefi system, have to port kexec loading code to kernel. Kexec has been a formal feature in our distro, and customers owning those kind of very large machine can make use of this feature to speed up the reboot process. On uefi machine, the kexec_file loading will search place to put kernel under 4G from top to down. As we know, the 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume it. It may have possibility to not be able to find a usable space for kernel/initrd. From the top down of the whole memory space, we don't have this worry. And at the first post, I just posted below with AKASHI's walk_system_ram_res_rev() version. Later you suggested to use list_head to link child sibling of resource, see what the code change looks like. http://lkml.kernel.org/r/20180322033722.9279-1-...@redhat.com Then I posted v2 http://lkml.kernel.org/r/20180408024724.16812-1-...@redhat.com Rob Herring mentioned that other components which has this tree struct have planned to do the same thing, replacing the singly linked list with list_head to link resource child sibling. Just quote Rob's words as below. I think this could be another reason. ~ From Rob The DT struct device_node also has the same tree structure with parent, child, sibling pointers and converting to list_head had been on the todo list for a while. ACPI also has some tree walking functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a common tree struct and helpers defined either on top of list_head or a ~ new struct if that saves some size. > > Is that sufficient? Can we instead simplify their lives by providing > better documentation or informative printks or better Kconfig text, > etc? > > And who *are* the people who are performing this configuration? Random > system administrators? Linux distro engineers? If the latter then > they presumably aren't easily confused! Kexec was invented for kernel developer to speed up their kernel rebooting. Now high end sever admin, kernel developer and QE are also keen to use it to reboot large box for faster feature testing, bug debugging. Kernel dev could know this well, about kernel loading position, admin or QE might not be aware of it very well. > > In other words, I'm trying to understand how much benefit this patchset > will provide to our users as a whole. Understood. The list_head replacing patch truly involes too many code changes, it's risky. I am willing to try any idea from reviewers, won't persuit they have to be accepted finally. If don't have a try, we don't know what it looks like, and what impact it may have. I am fine to take AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even though it could be a little bit low efficient. Thanks Baoquan
Re: Improvements for the PS3
Hi Geert, Fredrik, On 07/19/2018 12:45 AM, Geert Uytterhoeven wrote: >> On 07/14/2018 09:49 AM, Fredrik Noring wrote: >>> >>> et voilà, the screen came alive and the kernel panic was revealed! It seems >>> the kernel panics so fast that the PS3 frame buffer is unprepared. This is, >>> of course, very unfortunate because trying to debug the boot process without >>> a screen or any other means of obtaining console text is quite difficult. >> > What about letting ps3fb register a panic notifier to sync the screen, like > hyperv_fb does? Seems like the thing to do. Fredrik, do you want to try it? Otherwise, I'll work on it when I have some time. -Geoff
Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask
On Thu, 19 Jul 2018, Geoff Levand wrote: > Hi Alan, > > On 07/19/2018 07:33 AM, Alan Stern wrote: > > On Wed, 18 Jul 2018, Geoff Levand wrote: > > > >> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c > >> index 8c733492d8fe..454d8c624a3f 100644 > >> --- a/drivers/usb/host/ehci-ps3.c > >> +++ b/drivers/usb/host/ehci-ps3.c > >> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device > >> *dev) > >>int result; > >>struct usb_hcd *hcd; > >>unsigned int virq; > >> - static u64 dummy_mask = DMA_BIT_MASK(32); > >> + static u64 dummy_mask; > >> > >>if (usb_disabled()) { > >>result = -ENODEV; > >> @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device > >> *dev) > >>goto fail_irq; > >>} > >> > >> - dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */ > >> + dummy_mask = DMA_BIT_MASK(32); > >> + dev->core.dma_mask = &dummy_mask; > >> + dma_set_coherent_mask(&dev->core, dummy_mask); > > > > What is the reason for changing a static initialization to a dynamic > > one? As far as I can see, the patch touches four lines of code but the > > only real difference is addition of a single line (and removal of a > > comment). > > I thought it would be better if all the setting was done in > one place, that's the only reason. All right; in that case (for the EHCI and OHCI pieces): Acked-by: Alan Stern Alan Stern
Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask
Hi Alan, On 07/19/2018 07:33 AM, Alan Stern wrote: > On Wed, 18 Jul 2018, Geoff Levand wrote: > >> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c >> index 8c733492d8fe..454d8c624a3f 100644 >> --- a/drivers/usb/host/ehci-ps3.c >> +++ b/drivers/usb/host/ehci-ps3.c >> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device >> *dev) >> int result; >> struct usb_hcd *hcd; >> unsigned int virq; >> -static u64 dummy_mask = DMA_BIT_MASK(32); >> +static u64 dummy_mask; >> >> if (usb_disabled()) { >> result = -ENODEV; >> @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device >> *dev) >> goto fail_irq; >> } >> >> -dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */ >> +dummy_mask = DMA_BIT_MASK(32); >> +dev->core.dma_mask = &dummy_mask; >> +dma_set_coherent_mask(&dev->core, dummy_mask); > > What is the reason for changing a static initialization to a dynamic > one? As far as I can see, the patch touches four lines of code but the > only real difference is addition of a single line (and removal of a > comment). I thought it would be better if all the setting was done in one place, that's the only reason. -Geoff
Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask
On Wed, 18 Jul 2018, Geoff Levand wrote: > Set the coherent_dma_mask for the PS3 ehci, ohci, and snd devices. > > Silences WARN_ON_ONCE messages emitted by the dma_alloc_attrs() routine. > > Reported-by: Fredrik Noring > Signed-off-by: Geoff Levand > --- > Hi Michael, > > This just silences some warnings. Can you take it through the powerpc > tree? > > -Geoff > > > drivers/usb/host/ehci-ps3.c | 6 -- > drivers/usb/host/ohci-ps3.c | 6 -- > sound/ppc/snd_ps3.c | 5 + > 3 files changed, 13 insertions(+), 4 deletions(-) > > diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c > index 8c733492d8fe..454d8c624a3f 100644 > --- a/drivers/usb/host/ehci-ps3.c > +++ b/drivers/usb/host/ehci-ps3.c > @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev) > int result; > struct usb_hcd *hcd; > unsigned int virq; > - static u64 dummy_mask = DMA_BIT_MASK(32); > + static u64 dummy_mask; > > if (usb_disabled()) { > result = -ENODEV; > @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device > *dev) > goto fail_irq; > } > > - dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */ > + dummy_mask = DMA_BIT_MASK(32); > + dev->core.dma_mask = &dummy_mask; > + dma_set_coherent_mask(&dev->core, dummy_mask); What is the reason for changing a static initialization to a dynamic one? As far as I can see, the patch touches four lines of code but the only real difference is addition of a single line (and removal of a comment). > diff --git a/drivers/usb/host/ohci-ps3.c b/drivers/usb/host/ohci-ps3.c > index 20a23d795adf..395f9d3bc849 100644 > --- a/drivers/usb/host/ohci-ps3.c > +++ b/drivers/usb/host/ohci-ps3.c > @@ -69,7 +69,7 @@ static int ps3_ohci_probe(struct ps3_system_bus_device *dev) > int result; > struct usb_hcd *hcd; > unsigned int virq; > - static u64 dummy_mask = DMA_BIT_MASK(32); > + static u64 dummy_mask; > > if (usb_disabled()) { > result = -ENODEV; > @@ -115,7 +115,9 @@ static int ps3_ohci_probe(struct ps3_system_bus_device > *dev) > goto fail_irq; > } > > - dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */ > + dummy_mask = DMA_BIT_MASK(32); > + dev->core.dma_mask = &dummy_mask; > + dma_set_coherent_mask(&dev->core, dummy_mask); Same here. Alan Stern
[PATCH] powerpc/mm: Don't report PUDs as memory leaks when using kmemleak
Paul Menzel reported that kmemleak was producing reports such as: unreferenced object 0xc000f8b8 (size 16384): comm "init", pid 1, jiffies 4294937416 (age 312.240s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 backtrace: [] __pud_alloc+0x80/0x190 [<87f2e8a3>] move_page_tables+0xbac/0xdc0 [<091e51c2>] shift_arg_pages+0xc0/0x210 [ ] setup_arg_pages+0x22c/0x2a0 [<60871529>] load_elf_binary+0x41c/0x1648 [ ] search_binary_handler.part.11+0xbc/0x280 [<34e0cdd7>] __do_execve_file.isra.13+0x73c/0x940 [<5f953a6e>] sys_execve+0x58/0x70 [<9700a858>] system_call+0x5c/0x70 Indicating that a PUD was being leaked. However what's really happening is that kmemleak is not able to recognise the references from the PGD to the PUD, because they are not fully qualified pointers. We can confirm that in xmon, eg: Find the task struct for pid 1 "init": 0:mon> P task_struct ->thread.kspPID PPID S P CMD c001fe7c c001fe803960 1 0 S 13 systemd Dump virtual address 0 to find the PGD: 0:mon> dv 0 c001fe7c pgd @ 0xc000f8b01000 Dump the memory of the PGD: 0:mon> d c000f8b01000 c000f8b01000 f8b9 || c000f8b01010 || c000f8b01020 || c000f8b01030 f8b8 || There we can see the reference to our supposedly leaked PUD. But because it's missing the leading 0xc, kmemleak won't recognise it. We can confirm it's still in use by translating an address that is mapped via it: 0:mon> dv 7fff9400 c001fe7c pgd @ 0xc000f8b01000 pgdp @ 0xc000f8b01038 = 0xf8b8 <-- pudp @ 0xc000f8b81ff8 = 0x037c4000 pmdp @ 0xc37c5ca0 = 0xfbd89000 ptep @ 0xc000fbd89000 = 0xc081d5ce0386 Maps physical address = 0x0001d5ce Flags = Accessed Dirty Read Write The fix is fairly simple. We need to tell kmemleak to ignore PUD allocations and never report them as leaks. We can also tell it not to scan the PGD, because it will never find pointers in there. However it will still notice if we allocate a PGD and then leak it. Reported-by: Paul Menzel Signed-off-by: Michael Ellerman --- arch/powerpc/include/asm/book3s/64/pgalloc.h | 23 +-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h index 01ee40f11f3a..76234a14b97d 100644 --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h @@ -9,6 +9,7 @@ #include #include +#include #include struct vmemmap_backing { @@ -82,6 +83,13 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm) pgd = kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), pgtable_gfp_flags(mm, GFP_KERNEL)); + /* +* Don't scan the PGD for pointers, it contains references to PUDs but +* those references are not full pointers and so can't be recognised by +* kmemleak. +*/ + kmemleak_no_scan(pgd); + /* * With hugetlb, we don't clear the second half of the page table. * If we share the same slab cache with the pmd or pud level table, @@ -110,8 +118,19 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) { - return kmem_cache_alloc(PGT_CACHE(PUD_CACHE_INDEX), - pgtable_gfp_flags(mm, GFP_KERNEL)); + pud_t *pud; + + pud = kmem_cache_alloc(PGT_CACHE(PUD_CACHE_INDEX), + pgtable_gfp_flags(mm, GFP_KERNEL)); + /* +* Tell kmemleak to ignore the PUD, that means don't scan it for +* pointers and don't consider it a leak. PUDs are typically only +* referred to by their PGD, but kmemleak is not able to recognise those +* as pointers, leading to false leak reports. +*/ + kmemleak_ignore(pud); + + return pud; } static inline void pud_free(struct mm_struct *mm, pud_t *pud) -- 2.14.1
Re: [powerpc/powervm]Oops: Kernel access of bad area, sig: 11 [#1] while running stress-ng
cc linux-xfs On Thu, Jul 19, 2018 at 11:47:59AM +0530, vrbagal1 wrote: > On 2018-07-10 19:12, Michael Ellerman wrote: > > vrbagal1 writes: > > > > > On 2018-07-10 13:37, Nicholas Piggin wrote: > > > > On Tue, 10 Jul 2018 11:58:40 +0530 > > > > vrbagal1 wrote: > > > > > > > > > Hi, > > > > > > > > > > Observing kernel oops on Power9(ZZ) box, running on PowerVM, while > > > > > running stress-ng. > > > > > > > > > > > > > > > Kernel: 4.18.0-rc4 > > > > > Machine: Power9 ZZ (PowerVM) > > > > > Test: Stress-ng > > > > > > > > > > Attached is .config file > > > > > > > > > > Traces: > > > > > > > > > > [12251.245209] Oops: Kernel access of bad area, sig: 11 [#1] > > > > > > > > Can you post the lines above this? Otherwise we don't know what > > > > address > > > > it tried to access (without decoding the instructions and > > > > reconstructing > > > > it from registers at least, which the XFS devs wouldn't be > > > > inclined to > > > > do). > > > > > > > > > > ah my bad. > > > > > > [12251.245179] Unable to handle kernel paging request for data at > > > address 0x60006000 > > > [12251.245199] Faulting instruction address: 0xc0319e2c > > > > Which matches the regs & disassembly: > > > > r4 = 60006000 > > r9 = 0 > > ldx r9,r4,r9<- pop > > > > So object was 0x60006000. > > > > That looks like two nops, ie. we got some code? > > > > And there's only one caller of prefetch_freepointer() in > > slab_alloc_node(): > > > > prefetch_freepointer(s, next_object); > > > > > > So slab corruption is looking likely. > > > > Do you have slub_debug=FZP on the kernel command line? > > I tried to reproduce this, but didnt succeed. But instead I got xfs > assertion. > > Kernel: 4.18.0-rc5 > Tree Head: 30b06abfb92bfd5f9b63ea6a2ffb0bd905d1a6da > > Traces: > > [12290.993612] XFS: Assertion failed: flags & XFS_BMAPI_COWFORK, file: > fs/xfs/libxfs/xfs_bmap.c, line: 4370 This usually means we have a writeback of a page with delayed allocation buffers over an extent map that reflects a hole in the file. This should never happen for normal (non-reflink) data fork writes, hence the assert failure and -EIO return. We used to actually (accidentally) allocate a new block in this case, but more recent kernels generate the error. More interestingly, I think the pending iomap + writeback rework in XFS may simply drop this error on the floor because we refer to extent state to process the page rather than the other way around (and buffer_delay() isn't a state in the iomap bits). This of course depends on which state is actually valid, the buffer or extent map, which we don't know for sure. Can you reliably reproduce this problem? If so, can you describe the reproducer? Also, what is the filesystem geometry ('xfs_info ')? And since this is a power kernel, I'm guessing you have 64k pages as well..? Brian > [12290.993668] [ cut here ] > [12290.993672] kernel BUG at fs/xfs/xfs_message.c:102! > [12290.993676] Oops: Exception in kernel mode, sig: 5 [#1] > [12290.993678] LE SMP NR_CPUS=2048 NUMA pSeries > [12290.993697] Dumping ftrace buffer: > [12290.993730](ftrace buffer empty) > [12290.993735] Modules linked in: camellia_generic(E) cast6_generic(E) > cast_common(E) ppp_generic(E) serpent_generic(E) slhc(E) kvm_pr(E) kvm(E) > snd_seq(E) snd_seq_device(E) twofish_generic(E) snd_timer(E) snd(E) > soundcore(E) twofish_common(E) lrw(E) cuse(E) tgr192(E) hci_vhci(E) > vhost_net(E) bluetooth(E) ecdh_generic(E) wp512(E) rfkill(E) > vfio_iommu_spapr_tce(E) vfio_spapr_eeh(E) vfio(E) rmd320(E) uhid(E) tun(E) > tap(E) rmd256(E) rmd160(E) vhost_vsock(E) rmd128(E) > vmw_vsock_virtio_transport_common(E) vsock(E) vhost(E) uinput(E) md4(E) > unix_diag(E) dccp_ipv4(E) dccp(E) sha512_generic(E) binfmt_misc(E) fuse(E) > vfat(E) fat(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) > raid6_pq(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) loop(E) ip6t_rpfilter(E) > ipt_REJECT(E) nf_reject_ipv4(E) ip6t_REJECT(E) > [12290.993784] nf_reject_ipv6(E) xt_conntrack(E) ip_set(E) nfnetlink(E) > ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ip6table_nat(E) > nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_nat_ipv6(E) ip6table_mangle(E) > ip6table_security(E) ip6table_raw(E) iptable_nat(E) nf_conntrack_ipv4(E) > nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) iptable_mangle(E) > iptable_security(E) iptable_raw(E) ebtable_filter(E) ebtables(E) > ip6table_filter(E) ip6_tables(E) iptable_filter(E) xts(E) sg(E) > vmx_crypto(E) pseries_rng(E) ip_tables(E) xfs(E) libcrc32c(E) sd_mod(E) > ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) lpfc(E) mlx5_core(E) > nvmet_fc(E) nvmet(E) nvme_fc(E) nvme_fabrics(E) nvme_core(E) > scsi_transport_fc(E) mlxfw(E) devlink(E) dm_mirror(E) dm_region_hash(E) > dm_log(E) dm_mod(E) [last unloaded: torture] > [12290.993834] CPU: 2 PID: 433 Comm: kswapd1 Tainted: GE > 4.18.0-rc5-autotest-autotest #1 > [1
Re: [PATCH v6 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups
On 07/18/2018 11:59 PM, Stewart Smith wrote: Shilpasri G Bhat writes: On-Chip-Controller(OCC) is an embedded micro-processor in POWER9 chip which measures various system and chip level sensors. These sensors comprises of environmental sensors (like power, temperature, current and voltage) and performance sensors (like utilization, frequency). All these sensors are copied to main memory at a regular interval of 100ms. OCC provides a way to select a group of sensors that is copied to the main memory to increase the update frequency of selected sensor groups. When a sensor-group is disabled, OCC will not copy it to main memory and those sensors read 0 values. OCC is an implementation detail rather than a core part of this firmware API. Why not something like this: OPAL firmware provides the facility for some groups of sensors to be enabled/disabled at runtime to give the user the option of using the system resources for collecting these sensors or not. For example, on POWER9 systems, the On Chip Controller (OCC) gathers various system and chip level sensors and maintains their values in main memory. +static int init_sensor_group_data(struct platform_device *pdev, + struct platform_data *pdata) +{ + struct sensor_group_data *sgrp_data; + struct device_node *groups, *sgrp; + enum sensors type; + int count = 0, ret = 0; + + groups = of_find_node_by_path("/ibm,opal/sensor-groups"); + if (!groups) + return ret; Why not look for the compatible property? For both, I don't really care either way. Can you folks get to an agreement and let me know after you decided ? Thanks, Guenter
Re: [PATCH v5 5/7] powerpc/pseries: flush SLB contents on SLB MCE errors.
Michal Suchánek writes: > On Tue, 3 Jul 2018 08:08:14 +1000 > "Nicholas Piggin" wrote: >> On Mon, 02 Jul 2018 11:17:06 > +0530 >> Mahesh J Salgaonkar wrote: >> > From: Mahesh Salgaonkar >> > diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c >> > index efdd16a79075..221271c96a57 100644 >> > --- a/arch/powerpc/kernel/mce.c >> > +++ b/arch/powerpc/kernel/mce.c >> > @@ -488,9 +488,21 @@ long machine_check_early(struct pt_regs *regs) >> > { >> >long handled = 0; >> > >> > - __this_cpu_inc(irq_stat.mce_exceptions); >> > + /* >> > + * For pSeries we count mce when we go into virtual mode >> > machine >> > + * check handler. Hence skip it. Also, We can't access per >> > cpu >> > + * variables in real mode for LPAR. >> > + */ >> > + if (early_cpu_has_feature(CPU_FTR_HVMODE)) >> > + __this_cpu_inc(irq_stat.mce_exceptions); >> > >> > - if (cur_cpu_spec && cur_cpu_spec->machine_check_early) >> > + /* >> > + * See if platform is capable of handling machine check. >> > + * Otherwise fallthrough and allow CPU to handle this >> > machine check. >> > + */ >> > + if (ppc_md.machine_check_early) >> > + handled = ppc_md.machine_check_early(regs); >> > + else if (cur_cpu_spec && cur_cpu_spec->machine_check_early) >> >handled = >> > cur_cpu_spec->machine_check_early(regs); >> >> Would be good to add a powernv ppc_md handler which does the >> cur_cpu_spec->machine_check_early() call now that other platforms are >> calling this code. Because those aren't valid as a fallback call, but >> specific to powernv. >> > > Something like this (untested)? > > Subject: [PATCH] powerpc/powernv: define platform MCE handler. LGTM. cheers
Re: [PATCH] powerpc/msi: Remove VLA usage
Kees Cook writes: > On Fri, Jun 29, 2018 at 11:52 AM, Kees Cook wrote: >> In the quest to remove all stack VLA usage from the kernel[1], this >> switches from an unchanging variable to a constant expression to eliminate >> the VLA generation. >> >> [1] >> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com >> >> Signed-off-by: Kees Cook > > Friendly ping! Michael, can you take this? Yep, sorry. I actually applied it weeks ago but hadn't pushed. cheers
Re: [PATCH v4 4/6] powerpc/fsl: Enable cpu vulnerabilities reporting for NXP PPC BOOK3E
LEROY Christophe writes: > Diana Madalina Craciun a écrit : >> On 7/17/2018 7:47 PM, LEROY Christophe wrote: >>> Diana Craciun a écrit : The NXP PPC Book3E platforms are not vulnerable to meltdown and Spectre v4, so make them PPC_BOOK3S_64 specific. Signed-off-by: Diana Craciun --- History: v2-->v3 - used the existing functions for spectre v1/v2 arch/powerpc/Kconfig | 7 ++- arch/powerpc/kernel/security.c | 2 ++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 9f2b75f..116c953 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -165,7 +165,7 @@ config PPC select GENERIC_CLOCKEVENTS_BROADCASTif SMP select GENERIC_CMOS_UPDATE select GENERIC_CPU_AUTOPROBE - select GENERIC_CPU_VULNERABILITIES if PPC_BOOK3S_64 + select GENERIC_CPU_VULNERABILITIES if PPC_NOSPEC >>> I don't understand. You say this patch is to make something specific >>> to book3s64 specific, and you are creating a new config param that >>> make things less specific >>> >>> Christophe >> >> In order to enable the vulnerabilities reporting on NXP socs I need to >> enable them for PPC_FSL_BOOK3E. So they will be enabled for both >> PPC_FSL_BOOK3E and PPC_BOOK3S_64. This is the reason for adding the >> Kconfig. However this will enable: spectre v1/v2 and meltdown. NXP socs >> are not vulnerable to meltdown, so I made the meltdown reporting >> PPC_BOOK3S_64 specific. I guess I can have the PPC_NOSPEC definition in >> a separate patch to be more clear. > > Yes you can. Or keep it as a single patch and add the details you gave > me in the patch description. Yeah I think the patch is fine, but the change log is a bit short on detail. If you just send me a new change log I can fold it in. cheers
Re: [PATCH] powerpc/prom_init: remove linux,stdout-package property
Murilo Opsfelder Araujo writes: > On Wed, Jul 18, 2018 at 07:37:37PM +1000, Michael Ellerman wrote: >> Murilo Opsfelder Araujo writes: >> > This property was added in 2004 by >> > >> > >> > https://github.com/mpe/linux-fullhistory/commit/689fe5072fe9a0dec914bfa4fa60aed1e54563e6 >> > >> > and the only use of it, which was already inside `#if 0`, was removed a >> > month >> > later by >> > >> > >> > https://github.com/mpe/linux-fullhistory/commit/1fbe5a6d90f6cd4ea610737ef488719d1a875de7 >> > >> > Fixes: https://github.com/linuxppc/linux/issues/125 >> >> That is going to confuse some scripts that are expecting that to be a >> "Fixes: " tag :) >> >> The proper tag to use there would be "Link:". >> >> But, I'd prefer we didn't add github issue links to the history, as I'm >> not sure they won't bit-rot eventually. Not because I'm a anti-Microsoft >> conspiracy person but just because it's a repo I set up and manage and >> there's no long term plan for it necessarily. >> >> > --- >> > arch/powerpc/kernel/prom_init.c | 2 -- >> > 1 file changed, 2 deletions(-) >> >> Including the link here would be ideal, as it means it doesn't end up in >> the commit history, but it does end up in the mail archive. So if we >> ever really need to find it, it should be there. >> >> cheers > > Hi, Michael. > > Thanks for reviewing it. I've sent v2 with your suggestions: > > https://lkml.kernel.org/r/20180718161544.12134-1-muri...@linux.ibm.com Thanks. I had actually already applied the first version, but I didn't say that in my email did I :} So I've rebased and applied your v2, thanks. cheers
Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask
Greg KH writes: > On Wed, Jul 18, 2018 at 03:08:33PM -0700, Geoff Levand wrote: >> Set the coherent_dma_mask for the PS3 ehci, ohci, and snd devices. >> >> Silences WARN_ON_ONCE messages emitted by the dma_alloc_attrs() routine. >> >> Reported-by: Fredrik Noring >> Signed-off-by: Geoff Levand >> --- >> Hi Michael, >> >> This just silences some warnings. Can you take it through the powerpc >> tree? > > Oops, nevermind, it should go through the ppc tree. Here's my ack: > > Acked-by: Greg Kroah-Hartman OK thanks, I'll take it. cheers
Re: [RFC PATCH v6 0/4] powerpc/fadump: Improvements and fixes for firmware-assisted dump.
On Wed 18-07-18 21:52:17, Mahesh Jagannath Salgaonkar wrote: > On 07/17/2018 05:22 PM, Michal Hocko wrote: > > On Tue 17-07-18 16:58:10, Mahesh Jagannath Salgaonkar wrote: > >> On 07/16/2018 01:56 PM, Michal Hocko wrote: > >>> On Mon 16-07-18 11:32:56, Mahesh J Salgaonkar wrote: > One of the primary issues with Firmware Assisted Dump (fadump) on Power > is that it needs a large amount of memory to be reserved. This reserved > memory is used for saving the contents of old crashed kernel's memory > before > fadump capture kernel uses old kernel's memory area to boot. However, > This > reserved memory area stays unused until system crash and isn't available > for production kernel to use. > >>> > >>> How much memory are we talking about. Regular kernel dump process needs > >>> some reserved memory as well. Why that is not a big problem? > >> > >> We reserve around 5% of total system RAM. On large systems with > >> TeraBytes of memory, this reservation can be quite significant. > >> > >> The regular kernel dump uses the kexec method to boot into capture > >> kernel and it can control the parameters that are being passed to > >> capture kernel. This allows a capability to strip down the parameters > >> that can help lowering down the memory requirement for capture kernel to > >> boot. This allows regular kdump to reserve less memory to start with. > >> > >> Where as fadump depends on power firmware (pHyp) to load the capture > >> kernel after full reset and boots like a regular kernel. It needs same > >> amount of memory to boot as the production kernel. On large systems > >> production kernel needs significant amount of memory to boot. Hence > >> fadump needs to reserve enough memory for capture kernel to boot > >> successfully and execute dump capturing operations. By default fadump > >> reserves 5% of total system RAM and in most cases this has worked > >> flawlessly on variety of system configurations. Optionally, > >> 'crashkernel=X' can also be used to specify more fine-tuned memory size > >> for reservation. > > > > So why do we even care about fadump when regular kexec provides > > (presumably) same functionality with a smaller memory footprint? Or is > > there any reason why kexec doesn't work well on ppc? > > Kexec based kdump is loaded by crashing kernel. When OS crashes, the > system is in an inconsistent state, especially the devices. In some > cases, a rogue DMA or ill-behaving device drivers can cause the kdump > capture to fail. > > On power platform, fadump solves these issues by taking help from power > firmware, to fully-reset the system, load the fresh copy of same kernel > to capture the dump with PCI and I/O devices reinitialized, making it > more reliable. Thanks for the clarification. > Fadump does full system reset, booting system through the regular boot > options i.e the dump capture kernel is booted in the same fashion and > doesn't have specialized kernel command line option. This implies, we > need to give more memory for the system boot. Since the new kernel boots > from the same memory location as crashed kernel, we reserve 5% of memory > where power firmware moves the crashed kernel's memory content. This > reserved memory is completely removed from the available memory. For > large memory systems like 64TB systems, this account to ~ 3TB, which is > a significant chunk of memory production kernel is deprived of. Hence, > this patch adds an improvement to exiting fadump feature to make the > reserved memory available to system for use, using zone movable. Is the 5% a reasonable estimate or more a ballpark number? I find it a bit strange to require 3TB of memory to boot a kernel just to dump the crashed kernel image. Shouldn't you rather look into this estimate than spreading ZONE_MOVABLE abuse? Larger systems need more memory to dump even with the regular kexec kdump but I have never seen any to use more than 1G or something like that. -- Michal Hocko SUSE Labs
Re: Improvements for the PS3
Hi Geoff, Frederik, On Thu, Jul 19, 2018 at 12:40 AM Geoff Levand wrote: > On 07/14/2018 09:49 AM, Fredrik Noring wrote: > > so I added a sleep with > > > > + msleep(1); I can't see where you added the sleep, but 10s seems excessive. If the real reason is the need to wait for an interrupt for ps3fb_sync_image(), then waiting for 40 ms should be sufficient? Or am I missing something? > > + > > return 0; > > > > et voilà, the screen came alive and the kernel panic was revealed! It seems > > the kernel panics so fast that the PS3 frame buffer is unprepared. This is, > > of course, very unfortunate because trying to debug the boot process without > > a screen or any other means of obtaining console text is quite difficult. > > We could add a fixed delay there, but I'd like to avoid waiting that > long on every boot. Why don't you add a kernel module_param named > something like ps3fb_delay that takes a value in milliseconds and a > default of zero. See: > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/ps3fb.c?h=v4.17#n260 > > > I suppose the problem is that it relies on interrupts for ps3fb_sync_image > > to regularly copy the image, hence without them the screen isn't updated to > > show kernel panics, etc. Perhaps one way to fix that is to implement the > > struct fb_tile_ops API, so that the console is synchronously updated? Would > > that be acceptable? > > I'm not sure if that would work or not. Maybe Geert is more familiar with > it. That sounds like a complex solution, slowing down the console a lot. What about letting ps3fb register a panic notifier to sync the screen, like hyperv_fb does? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds