[PATCH 3/3] USB: musb: dsps: propagate device-tree node
To be able to use DSPS-based controllers with device-tree descriptions of the USB topology, we need to associate the glue device's device-tree node with the child controller device. Note that this can also be used to eventually let USB core manage generic phys. Also note that the other glue drivers will require similar changes to be able to describe their buses in DT. Signed-off-by: Johan Hovold --- drivers/usb/musb/musb_dsps.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/usb/musb/musb_dsps.c b/drivers/usb/musb/musb_dsps.c index 6a60bc0490c5..23dba59045a7 100644 --- a/drivers/usb/musb/musb_dsps.c +++ b/drivers/usb/musb/musb_dsps.c @@ -786,6 +786,7 @@ static int dsps_create_musb_pdev(struct dsps_glue *glue, musb->dev.parent= dev; musb->dev.dma_mask = &musb_dmamask; musb->dev.coherent_dma_mask = musb_dmamask; + device_set_of_node_from_dev(&musb->dev, &parent->dev); glue->musb = musb; -- 2.17.0
[PATCH 2/3] USB: musb: host: prevent core phy initialisation
Set the new HCD flag which prevents USB core from trying to manage our phys. This is needed to be able to associate the controller platform device with the glue device device-tree node on the BBB which uses legacy USB phys. Otherwise, the generic phy lookup in usb_phy_roothub_init() and thus HCD registration fails repeatedly with -EPROBE_DEFER (see commit 178a0bce05cb ("usb: core: hcd: integrate the PHY wrapper into the HCD core")). Note that a related phy-lookup issue was recently worked around in the phy core by commit b7563e2796f8 ("phy: work around 'phys' references to usb-nop-xceiv devices"). Something similar may now be needed for other USB phys, and in particular if we eventually want to let USB core manage musb generic phys. Cc: Arnd Bergmann Cc: Martin Blumenstingl Signed-off-by: Johan Hovold --- drivers/usb/musb/musb_host.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/usb/musb/musb_host.c b/drivers/usb/musb/musb_host.c index 3a8451a15f7f..4fa372c845e1 100644 --- a/drivers/usb/musb/musb_host.c +++ b/drivers/usb/musb/musb_host.c @@ -2754,6 +2754,7 @@ int musb_host_setup(struct musb *musb, int power_budget) hcd->self.otg_port = 1; musb->xceiv->otg->host = &hcd->self; hcd->power_budget = 2 * (power_budget ? : 250); + hcd->skip_phy_initialization = 1; ret = usb_add_hcd(hcd, 0, 0); if (ret < 0) -- 2.17.0
Re: [PATCH v6 01/11] dt-bindings: firmware: Add bindings for ZynqMP firmware
On Tue, Apr 10, 2018 at 12:38:37PM -0700, Jolly Shah wrote: > From: Rajan Vaja > > Add documentation to describe Xilinx ZynqMP firmware driver > bindings. Firmware driver provides an interface to firmware > APIs. Interface APIs can be used by any driver to communicate > to PMUFW (Platform Management Unit). > > Signed-off-by: Rajan Vaja > Signed-off-by: Jolly Shah > --- > .../firmware/xilinx/xlnx,zynqmp-firmware.txt | 29 > ++ > 1 file changed, 29 insertions(+) > create mode 100644 > Documentation/devicetree/bindings/firmware/xilinx/xlnx,zynqmp-firmware.txt Please add acks/reviewed-by's when posting new versions. Rob
[PATCH] drm/vmwgfx: Fix scatterlist unmapping
dma_unmap_sg() should be called with the same number of entries originally passed to dma_map_sg(), not the number it returned, which may be fewer. Admittedly this driver probably never runs on non-coherent architectures where getting that wrong could lead to data loss, but it's always good to be correct, and it's trivially easy to fix by just restoring the SG table state before the call instead of afterwards. Signed-off-by: Robin Murphy --- Found by inspection while poking around TTM users. drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c index 2fd091f9..971223d39469 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c @@ -369,9 +369,9 @@ static void vmw_ttm_unmap_from_dma(struct vmw_ttm_tt *vmw_tt) { struct device *dev = vmw_tt->dev_priv->dev->dev; + vmw_tt->sgt.nents = vmw_tt->sgt.orig_nents; dma_unmap_sg(dev, vmw_tt->sgt.sgl, vmw_tt->sgt.nents, DMA_BIDIRECTIONAL); - vmw_tt->sgt.nents = vmw_tt->sgt.orig_nents; } /** -- 2.16.1.dirty
[PATCH] cifs: smb2ops: Fix NULL check in smb2_query_symlink
The current code null checks variable err_buf, which is always null when it is checked, hence utf16_path is free'd and the function returns -ENOENT everytime it is called, making it impossible for the execution path to reach the following code: err_buf = err_iov.iov_base; Fix this by null checking err_iov.iov_base instead of err_buf. Also, notice that err_buf no longer needs to be initialized to NULL. Addresses-Coverity-ID: 1467876 ("Logically dead code") Fixes: 2d636199e400 ("cifs: Change SMB2_open to return an iov for the error parameter") Signed-off-by: Gustavo A. R. Silva --- fs/cifs/smb2ops.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c index b4ae932..38ebf3f 100644 --- a/fs/cifs/smb2ops.c +++ b/fs/cifs/smb2ops.c @@ -1452,7 +1452,7 @@ smb2_query_symlink(const unsigned int xid, struct cifs_tcon *tcon, struct cifs_open_parms oparms; struct cifs_fid fid; struct kvec err_iov = {NULL, 0}; - struct smb2_err_rsp *err_buf = NULL; + struct smb2_err_rsp *err_buf; struct smb2_symlink_err_rsp *symlink; unsigned int sub_len; unsigned int sub_offset; @@ -1476,7 +1476,7 @@ smb2_query_symlink(const unsigned int xid, struct cifs_tcon *tcon, rc = SMB2_open(xid, &oparms, utf16_path, &oplock, NULL, &err_iov); - if (!rc || !err_buf) { + if (!rc || !err_iov.iov_base) { kfree(utf16_path); return -ENOENT; } -- 2.7.4
Re: Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")
On Fri, Apr 13, 2018 at 03:22:37PM +0200, Jesper Dangaard Brouer wrote: > Hi Peter, > > Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build > for several samples/bpf programs. I'm unsure what the best way forward > is to unbreak these... > > The issue is that these samples are build with LLVM/clang (which > doesn't like 'asm goto' constructs). And they end up including > arch/x86/include/asm/cpufeature.h via a long include path, see build > examples below (through different path to include/linux/thread_info.h). > > Maybe Alexei or Daniel have an idea how to work around this? > As tools/testing/selftests/bpf/ does not seem to fail!? Right. All of bpf tracing and samples/bpf/ broke. Here is the proposed fix that we're asking Peter to apply and send to Linus asap. https://lkml.org/lkml/2018/4/10/825 > Build error#1: > -- > clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include > -I./arch/x86/include -I./arch/x86/include/generated -I./include > -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi > -I./include/uapi -I./include/generated/uapi -include > ./include/linux/kconfig.h -Isamples/bpf \ > -I./tools/testing/selftests/bpf/ \ > -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \ > -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \ > -Wno-gnu-variable-sized-type-not-at-end \ > -Wno-address-of-packed-member -Wno-tautological-compare \ > -Wno-unknown-warning-option \ > -O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf > -filetype=obj -o samples/bpf/sockex2_kern.o > In file included from samples/bpf/sockex2_kern.c:3: > In file included from ./include/uapi/linux/in.h:24: > In file included from ./include/linux/socket.h:8: > In file included from ./include/linux/uio.h:13: > In file included from ./include/linux/thread_info.h:38: > In file included from ./arch/x86/include/asm/thread_info.h:53: > ./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are > not supported yet > asm_volatile_goto("1: jmp 6f\n" > ^ > ./include/linux/compiler-gcc.h:290:42: note: expanded from macro > 'asm_volatile_goto' > #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
Re: [PATCH v7 11/26] of: base: Add of_get_cpu_state_node() to get idle states for a CPU node
On Thu, Apr 12, 2018 at 6:14 AM, Ulf Hansson wrote: > The CPU's idle state nodes are currently parsed at the common cpuidle DT > library, but also when initializing back-end data for the arch specific CPU > operations, as in the PSCI driver case. > > To avoid open-coding, let's introduce of_get_cpu_state_node(), which takes > the device node for the CPU and the index to the requested idle state node, > as in-parameters. In case a corresponding idle state node is found, it > returns the node with the refcount incremented for it, else it returns > NULL. > > Moreover, for ARM, there are two generic methods, to describe the CPU's > idle states, either via the flattened description through the > "cpu-idle-states" binding [1] or via the hierarchical layout, using the > "power-domains" and the "domain-idle-states" bindings [2]. Hence, let's > take both options into account. > > [1] > Documentation/devicetree/bindings/arm/idle-states.txt > [2] > Documentation/devicetree/bindings/arm/psci.txt > > Cc: Rob Herring > Cc: devicet...@vger.kernel.org > Cc: Lina Iyer > Suggested-by: Sudeep Holla > Co-developed-by: Lina Iyer > Signed-off-by: Ulf Hansson > --- > drivers/of/base.c | 35 +++ > include/linux/of.h | 8 > 2 files changed, 43 insertions(+) Some reason you didn't add my Reviewed-by from v6? Rob
Re: [PATCH] mmap.2: MAP_FIXED is okay if the address range has been reserved
On Fri, Apr 13, 2018 at 8:49 AM, Michal Hocko wrote: > On Fri 13-04-18 08:43:27, Michael Kerrisk wrote: > [...] >> So, you mean remove this entire paragraph: >> >> For cases in which the specified memory region has not been >> reserved using an existing mapping, newer kernels (Linux >> 4.17 and later) provide an option MAP_FIXED_NOREPLACE that >> should be used instead; older kernels require the caller to >> use addr as a hint (without MAP_FIXED) and take appropriate >> action if the kernel places the new mapping at a different >> address. >> >> It seems like some version of the first half of the paragraph is worth >> keeping, though, so as to point the reader in the direction of a remedy. >> How about replacing that text with the following: >> >> Since Linux 4.17, the MAP_FIXED_NOREPLACE flag can be used >> in a multithreaded program to avoid the hazard described >> above. > > Yes, that sounds reasonable to me. But that kind of sounds as if you can't avoid it before Linux 4.17, when actually, you just have to call mmap() with the address as hint, and if mmap() returns a different address, munmap() it and go on your normal error path.
Re: [PATCH] iommu: amd: hide unused iommu_table_lock
On 2018-04-04 12:56:59 [+0200], Arnd Bergmann wrote: > The newly introduced lock is only used when CONFIG_IRQ_REMAP is enabled: > > drivers/iommu/amd_iommu.c:86:24: error: 'iommu_table_lock' defined but not > used [-Werror=unused-variable] > static DEFINE_SPINLOCK(iommu_table_lock); > > This moves the definition next to the user, within the #ifdef protected > section of the file. > > Fixes: ea6166f4b83e ("iommu/amd: Split irq_lookup_table out of the > amd_iommu_devtable_lock") > Signed-off-by: Arnd Bergmann Acked-by: Sebastian Andrzej Siewior Thank you Arnd. Sebastian
Re: [PATCH] tools build: Use -Xpreprocessor instead of -Wp and leave pathnames intact
On Fri, Apr 13, 2018 at 02:53:10PM +0100, Will Deacon wrote: > Build.include invokes the pre-processor via GCC in order to generate a > dependency list for the input file. Since these options are passed using > '-Wp,-M...,$(depfile)' it is important that $(depfile) does not contain > any commas, so these are substituted with underscores. This substitution > will break the build if the directory name of the output directory happens > to include a comma, e.g. when using "aiaiai" for bisection testing: > > | cc1: fatal error: x86/tools/objtool/fixdep.o: No such file or directory > | compilation terminated. > | cat: > /tmp/aiaiai-test-patchset.qroS/before/obj.defconfig_x86/tools/objtool/.fixdep.o.d: > No such file or directory > | make[5]: *** [tools/objtool/fixdep.o] Error 1 > > We can address this by using -Xpreprocessor instead of -Wp, which allows > us to pass down an unmodified pathname. > > Cc: Jiri Olsa > Cc: Dave Martin > Cc: Arnaldo Carvalho de Melo > Cc: Ingo Molnar > Signed-off-by: Will Deacon > --- > > As an aside, the way we currently pass the depfile to -MD appears to be > in direct contradiction with the preprocessor documentation, although it > does work with the cc1 implementation. Hmmm, I try cc1 --help, and it gives ... -I -M -MD -MF -MG -MM ... so it looks like even cc1 shouldn't really be parsing a depfile name argument after -MD. The only way to get -MD parsed in the undocumented way seems to be with gcc -Wp,-MD,... or direct invocation of cc1. The cpp frontend, and the gcc frontend itself seem to follow the documentation and don't parse as the depfile name here: [...] > diff --git a/tools/build/Build.include b/tools/build/Build.include We should probably address this everywhere when we've figured out what to do. > index 418871d02ebf..e1914f8e2328 100644 > --- a/tools/build/Build.include > +++ b/tools/build/Build.include > @@ -22,9 +22,7 @@ dot-target = $(dir $@).$(notdir $@) > basetarget = $(basename $(notdir $@)) > > ### > -# The temporary file to save gcc -MD generated dependencies must not > -# contain a comma > -depfile = $(subst $(comma),_,$(dot-target).d) > +depfile = $(dot-target).d > > ### > # Check if both arguments has same arguments. Result is empty string if > equal. > @@ -89,12 +87,12 @@ if_changed = $(if $(strip $(any-prereq) $(arg-check)), >\ > # - per target C flags > # - per object C flags > # - BUILD_STR macro to allow '-D"$(variable)"' constructs > -c_flags_1 = -Wp,-MD,$(depfile) -Wp,-MT,$@ $(CFLAGS) -D"BUILD_STR(s)=\#s" > $(CFLAGS_$(basetarget).o) $(CFLAGS_$(obj)) > +c_flags_1 = -Xpreprocessor -MD -Xpreprocessor $(depfile) -Xpreprocessor -MT > -Xpreprocessor $@ $(CFLAGS) -D"BUILD_STR(s)=\#s" $(CFLAGS_$(basetarget).o) > $(CFLAGS_$(obj)) > c_flags_2 = $(filter-out $(CFLAGS_REMOVE_$(basetarget).o), $(c_flags_1)) > c_flags = $(filter-out $(CFLAGS_REMOVE_$(obj)), $(c_flags_2)) > -cxx_flags = -Wp,-MD,$(depfile) -Wp,-MT,$@ $(CXXFLAGS) -D"BUILD_STR(s)=\#s" > $(CXXFLAGS_$(basetarget).o) $(CXXFLAGS_$(obj)) > +cxx_flags = -Xpreprocessor -MD -Xpreprocessor $(depfile) -Xpreprocessor -MT > -Xpreprocessor $@ $(CXXFLAGS) -D"BUILD_STR(s)=\#s" > $(CXXFLAGS_$(basetarget).o) $(CXXFLAGS_$(obj)) > > ### > ## HOSTCC C flags > > -host_c_flags = -Wp,-MD,$(depfile) -Wp,-MT,$@ $(CHOSTFLAGS) > -D"BUILD_STR(s)=\#s" $(CHOSTFLAGS_$(basetarget).o) $(CHOSTFLAGS_$(obj)) > +host_c_flags = -Xpreprocessor -MD -Xpreprocessor $(depfile) -Xpreprocessor > -MT -Xpreprocessor $@ $(CHOSTFLAGS) -D"BUILD_STR(s)=\#s" > $(CHOSTFLAGS_$(basetarget).o) $(CHOSTFLAGS_$(obj)) Any idea why we use -Wp here other than as a bug compatibility hack? The gcc/clang support the depfile options directly. It's possible that gcc didn't support them, or didn't support -MF, sometime in the distant past. This use in the kernel makefiles predates git. I'm wondering whether we should actually switch to using -M -MF, or -MD -MF (strictly without -Wp or -Xpreprocessor) rather than relying on a combination of undocumented interactions between -Wp and cc1, and cc1 violating its own documentation. Cheers ---Dave
Re: [PATCH 0/2] drm: Make it compilable without CONFIG_HDMI and CONFIG_I2C
On Fri, Apr 13, 2018 at 4:46 PM, Thomas Huth wrote: > On 13.04.2018 16:32, Daniel Vetter wrote: >> On Fri, Apr 13, 2018 at 11:40 AM, Thomas Huth wrote: >>> By enabling the DRM code for virtio-gpu on S390, you currently also get >>> all the code that is enabled by CONFIG_HDMI and CONFIG_I2C automatically. >>> This is quite ugly, since on S390, there is no HDMI and no I2C. Thus it >>> would be great if the DRM code could also be compiled without CONFIG_HDMI >>> and CONFIG_I2C. These two patches now refactor the DRM code a little bit >>> so that we can compile it also without CONFIG_HDMI and CONFIG_I2C. >>> >>> Thomas Huth (2): >>> drivers/gpu/drm: Move CONFIG_HDMI-dependent code to a separate file >>> drivers/gpu/drm: Make the DRM code compilable without CONFIG_I2C >> >> What's the benefit? Why does I2C/HDMI hurt you? > > Why should I be forced to compile-in subsystems that do not make any > sense on this architecture? It's just completely weird to see CONFIG_I2C > enabled on s390x. "Looks wierd" is not really a good engineering criteria, especially in graphics :-) For context: In DRM almost nothing is optional, and it greatly simplifies life and coding. We don't have epic amounts of #ifdef battles to make trivial code changes compile, except in all the places where external stuff is optional (like backlight). So making something optional will have a pretty clear cost on the drm subsystem, and it doesn't make sense to pay that cost to "look less wierd". To get this merged we need some clear benefits, which will balance out the inevitable cost of having to maintain this forever (and most likely getting yelled at by Linus for making some rando compile config no longer work). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
On Fri, Apr 13, 2018 at 3:15 PM, Pablo Neira Ayuso wrote: > On Mon, Apr 09, 2018 at 04:43:40PM +0200, Arnd Bergmann wrote: >> On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso >> wrote: >> > Hi Arnd, >> > >> > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote: >> >> We get a new link error with CONFIG_NFT_REJECT_INET=y and >> >> CONFIG_NF_REJECT_IPV6=m >> > >> > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4 >> > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y >> > and CONFIG_NF_REJECT_IPV6=m. >> > >> > I mean, just like we do with NFT_FIB_INET. >> >> That can only work if NFT_REJECT_INET can be made a 'tristate' symbol >> again, so that code gets built as a loadable module if >> CONFIG_NF_REJECT_IPV6=m. >> >> > BTW, I think this problem has been is not related to the recent patch, >> > but something older that kbuild robot has triggered more easily for >> > some reason? >> >> 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool' >> symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to >> restricted to a loadable module with IPV6=m, but can now be >> built-in, which causes that link error. > > Still one more spin on this, I would like to see if we have a way to > fix this by simplifing things a bit. > > Would this one I'm attaching would work? One disadvantage is that it makes the vmlinux bigger since NF_REJECT_IPV{4,6} can no longer be a module at all now. I suspect you also stil get a link error with IPV6=m, this time because the nf_reject_ipv6.o file fails to link against the ipv6 code, e.g. ipv6_skip_exthdr() and icmpv6_send() appear to be unreachable here. I haven't tried that though, so I might be missing something. Arnd
Re: [RFC PATCH 16/35] ovl: readd lsattr/chattr support
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi wrote: > Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS. > > Needs vfs_ioctl() exported to modules. > > Signed-off-by: Miklos Szeredi > --- > fs/internal.h | 1 - > fs/ioctl.c | 1 + > fs/overlayfs/file.c | 59 > + > include/linux/fs.h | 2 ++ > 4 files changed, 62 insertions(+), 1 deletion(-) > > diff --git a/fs/internal.h b/fs/internal.h > index 3319bf39e339..d5108d9c6a2f 100644 > --- a/fs/internal.h > +++ b/fs/internal.h > @@ -176,7 +176,6 @@ extern const struct dentry_operations > ns_dentry_operations; > */ > extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd, > unsigned long arg); > -extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long > arg); > > /* > * iomap support: > diff --git a/fs/ioctl.c b/fs/ioctl.c > index 5ace7efb0d04..696f4c46a868 100644 > --- a/fs/ioctl.c > +++ b/fs/ioctl.c > @@ -49,6 +49,7 @@ long vfs_ioctl(struct file *filp, unsigned int cmd, > unsigned long arg) > out: > return error; > } > +EXPORT_SYMBOL(vfs_ioctl); > > static int ioctl_fibmap(struct file *filp, int __user *p) > { > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > index 05e3e2f80b89..cc004ff1b05b 100644 > --- a/fs/overlayfs/file.c > +++ b/fs/overlayfs/file.c > @@ -8,6 +8,7 @@ > > #include > #include > +#include > #include > #include > #include "overlayfs.h" > @@ -291,6 +292,63 @@ long ovl_fallocate(struct file *file, int mode, loff_t > offset, loff_t len) > return ret; > } > > +static long ovl_real_ioctl(struct file *file, unsigned int cmd, > + unsigned long arg) > +{ > + struct fd real; > + const struct cred *old_cred; > + long ret; > + > + ret = ovl_real_file(file, &real); > + if (ret) > + return ret; > + > + old_cred = ovl_override_creds(file_inode(file)->i_sb); > + ret = vfs_ioctl(real.file, cmd, arg); > + revert_creds(old_cred); > + > + fdput(real); > + > + return ret; > +} > + > +long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg) > +{ > + long ret; > + struct inode *inode = file_inode(file); > + > + switch (cmd) { > + case FS_IOC_GETFLAGS: > + ret = ovl_real_ioctl(file, cmd, arg); > + break; > + > + case FS_IOC_SETFLAGS: > + if (!inode_owner_or_capable(inode)) > + return -EACCES; > + > + ret = mnt_want_write_file(file); > + if (ret) > + return ret; > + > + ret = ovl_copy_up(file_dentry(file)); > + if (!ret) { > + ret = ovl_real_ioctl(file, cmd, arg); > + > + inode_lock(inode); > + ovl_copyflags(ovl_inode_real(inode), inode); > + inode_unlock(inode); > + } > + > + mnt_drop_write_file(file); > + break; > + > + default: > + ret = -ENOTTY; I am wondering out loud. This is a change of behavior that fs specific ioctls cannot be executed on overlay file - arguably a good change of behavior, but still a change that applications may got dependent on. Would it have been better to opt-in for this change by a more generic config/mount options, for example "consistent_fd" , instead of "copy_up_shared" and then we can choose whether or not to pass though unknown ioctls to real file. I know we removed the want_write_file() protection from VFS, but still, pass through of ioctls was the legacy behavior. Thoughts? I don't mind to wait and see if someone shouts. Thanks, Amir.
Re: [PATCH 2/2] iio: afe: unit-converter: add support for adi,lt6106
On 04/12/2018 05:31 PM, Peter Rosin wrote: > On 2018-04-12 17:35, Andrew F. Davis wrote: >> On 04/12/2018 09:29 AM, Peter Rosin wrote: >>> On 2018-04-11 18:13, Andrew F. Davis wrote: On 04/11/2018 10:51 AM, Lars-Peter Clausen wrote: > On 04/11/2018 05:43 PM, Andrew F. Davis wrote: >> On 04/11/2018 09:15 AM, Peter Rosin wrote: >>> This is a current sense amplifier from Analog Devices. >>> >>> Signed-off-by: Peter Rosin >>> --- >>> drivers/iio/afe/Kconfig | 3 +- >>> drivers/iio/afe/iio-unit-converter.c | 54 >>> >>> 2 files changed, 56 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/iio/afe/Kconfig b/drivers/iio/afe/Kconfig >>> index 642ce4eb12a6..0e10fe8f459a 100644 >>> --- a/drivers/iio/afe/Kconfig >>> +++ b/drivers/iio/afe/Kconfig >>> @@ -10,7 +10,8 @@ config IIO_UNIT_CONVERTER >>> depends on OF || COMPILE_TEST >>> help >>> Say yes here to build support for the IIO unit converter >>> - that handles voltage dividers and current sense shunts. >>> + that handles voltage dividers, current sense shunts and >>> + the LT6106 Current Sense Amplifier from Analog Devices. >> >> Could work better to split these out into separate drivers. Maybe a >> iio-shunt-resistor.c that does just voltage->current with the >> appropriate scaling. Then make a a separate lt6106.c. > > I don't think we need a separate driver here. There are tons of circuits > that all work the same way and all require the same properties. If we'd > add > a driver for each of them we'd get buried in boilerplate code. > Fair enough, then it should at least be renamed to something generic like current-sense-amplifier, as you said lots of circuits do this, not just lt6106s. We will have then have support for: current-sense-amplifier current-sense-shunt voltage-divider >>> >>> For the compatible "current-sense-amplifier", I would advocate the >>> properties... >>> >>> sense-resistor-micro-ohms >>> sense-gain >>> >>> (or something close to that) >>> >>> ...and not input-resistor-ohms and output-resistor-ohms which are way >>> more particular to the LT6106. >>> >>> But as I said in the cover letter, I didn't go with sense-gain since I >>> thought I would end up with requests for non-integer gains. There is >>> yet to be a comment on the non-integer gain problem, and before there >>> is a path forward for that case, I'm reluctant. >>> >> >> Why not similar to what you had before with the resistor: >> >> sense-gain-multiplier >> sense-gain-divider >> >> if either are missing assume they are 1. > > Hmm, how about sense-gain for the normal integer case, and then divide > by sense-attenuation if needed? I.e. exactly the same functionality as > you describe, just different names. > I like these names, but I think gain/attenuation sound very analog and I would be tempted to assume they are floating point numbers or the units are logarithmic (dB). To prevent any more needless bike-shedding on my part I'd like to say either yours, mine, or Lars-Peter's suggestion all work for me. compatibles in this driver called "unit-converter" which is still a misnomer IMHO. >>> >>> I don't remember you having presented your preference, and I think >>> that goes against the established bike-shedding protocol? >>> >> >> True, how about "current-sense-from-voltage" ? > > Doesn't cover "voltage-divider" (and we don't need separate drivers > doing the exact same calculations, that's a maintenance nightmare). > The driver name doesn't have to cover every use, just more than the other name. > Cheers, > Peter >
Re: [PATCH 0/2] drm: Make it compilable without CONFIG_HDMI and CONFIG_I2C
On 13.04.2018 16:32, Daniel Vetter wrote: > On Fri, Apr 13, 2018 at 11:40 AM, Thomas Huth wrote: >> By enabling the DRM code for virtio-gpu on S390, you currently also get >> all the code that is enabled by CONFIG_HDMI and CONFIG_I2C automatically. >> This is quite ugly, since on S390, there is no HDMI and no I2C. Thus it >> would be great if the DRM code could also be compiled without CONFIG_HDMI >> and CONFIG_I2C. These two patches now refactor the DRM code a little bit >> so that we can compile it also without CONFIG_HDMI and CONFIG_I2C. >> >> Thomas Huth (2): >> drivers/gpu/drm: Move CONFIG_HDMI-dependent code to a separate file >> drivers/gpu/drm: Make the DRM code compilable without CONFIG_I2C > > What's the benefit? Why does I2C/HDMI hurt you? Why should I be forced to compile-in subsystems that do not make any sense on this architecture? It's just completely weird to see CONFIG_I2C enabled on s390x. Thomas
Re: [PATCH 2/6] tracing: Add trace event error log
On Fri, 13 Apr 2018 09:24:34 -0500 Tom Zanussi wrote: > Yeah, I agree - I'd rather get it right than get it in now. I thought > this made sense, and was based on input from Masami, which I may have > misinterpreted, but I'll wait for some more ideas about the best way to > do this. Too bad we are not closer to November, as this would actually be a good Plumbers topic. Maybe it's not that important and we should wait until then. I'd like to get some brain storming ideas out before we decide on anything, and this is something I believe is better done face to face than over email. -- Steve
Re: [PATCH] ath10k: search all IEs for variant before falling back
Kalle Valo writes: > Thomas Hebb writes: > >> commit f2593cb1b291 ("ath10k: Search SMBIOS for OEM board file >> extension") added a feature to ath10k that allows Board Data File >> (BDF) conflicts between multiple devices that use the same device IDs >> but have different calibration requirements to be resolved by allowing >> a "variant" string to be stored in SMBIOS [and later device tree, added >> by commit d06f26c5c8a4 ("ath10k: search DT for qcom,ath10k-calibration- >> variant")] that gets appended to the ID stored in board-2.bin. >> >> This original patch had a regression, however. Namely that devices with >> a variant present in SMBIOS that didn't need custom BDFs could no longer >> find the default BDF, which has no variant appended. The patch was >> reverted and re-applied with a fix for this issue in commit 1657b8f84ed9 >> ("search SMBIOS for OEM board file extension"). >> >> But the fix to fall back to a default BDF introduced another issue: the >> driver currently parses IEs in board-2.bin one by one, and for each one >> it first checks to see if it matches the ID with the variant appended. >> If it doesn't, it checks to see if it matches the "fallback" ID with no >> variant. If a matching BDF is found at any point during this search, the >> search is terminated and that BDF is used. The issue is that it's very >> possible (and is currently the case for board-2.bin files present in the >> ath10k-firmware repository) for the default BDF to occur in an earlier >> IE than the variant-specific BDF. In this case, the current code will >> happily choose the default BDF even though a better-matching BDF is >> present later in the file. >> >> This patch fixes the issue by first searching the entire file for the ID >> with variant, and searching for the fallback ID only if that search >> fails. It also includes some code cleanup in the area, as >> ath10k_core_fetch_board_data_api_n() no longer does its own string >> mangling to remove the variant from an ID, instead leaving that job to a >> new flag passed to ath10k_core_create_board_name(). >> >> I've tested this patch on a QCA4019 and verified that the driver behaves >> correctly for 1) both fallback and variant BDFs present, 2) only fallback >> BDF present, and 3) no matching BDFs present. >> >> Fixes: 1657b8f84ed9 ("ath10k: search SMBIOS for OEM board file extension") >> Signed-off-by: Thomas Hebb > > BTW, you forgot to CC linux-wireless so I don't see this in patchwork. > > https://wireless.wiki.kernel.org/en/users/drivers/ath10k/submittingpatches I submitted v2 so that I see it in patchwork: https://patchwork.kernel.org/patch/10340241/ -- Kalle Valo
Re: [PATCH 3/3] dcache: account external names as indirectly reclaimable memory
On Fri, Apr 13, 2018 at 04:28:21PM +0200, Michal Hocko wrote: > On Fri 13-04-18 16:20:00, Vlastimil Babka wrote: > > We would need kmalloc-reclaimable-X variants. It could be worth it, > > especially if we find more similar usages. I suspect they would be more > > useful than the existing dma-kmalloc-X :) > > I am still not sure why __GFP_RECLAIMABLE cannot be made work as > expected and account slab pages as SLAB_RECLAIMABLE Can you outline how this would work without separate caches?
Re: Some minor fixes for perf user tools
Em Fri, Apr 13, 2018 at 03:13:09PM +0200, Jiri Olsa escreveu: > On Fri, Apr 06, 2018 at 01:38:08PM -0700, Andi Kleen wrote: > > This patchkit fixes some random minor issues in the perf user tools > > Acked-by: Jiri Olsa Thanks, applied. - Arnaldo
Re: [PATCH 0/2] drm: Make it compilable without CONFIG_HDMI and CONFIG_I2C
On Fri, Apr 13, 2018 at 11:40 AM, Thomas Huth wrote: > By enabling the DRM code for virtio-gpu on S390, you currently also get > all the code that is enabled by CONFIG_HDMI and CONFIG_I2C automatically. > This is quite ugly, since on S390, there is no HDMI and no I2C. Thus it > would be great if the DRM code could also be compiled without CONFIG_HDMI > and CONFIG_I2C. These two patches now refactor the DRM code a little bit > so that we can compile it also without CONFIG_HDMI and CONFIG_I2C. > > Thomas Huth (2): > drivers/gpu/drm: Move CONFIG_HDMI-dependent code to a separate file > drivers/gpu/drm: Make the DRM code compilable without CONFIG_I2C What's the benefit? Why does I2C/HDMI hurt you? Note that you still can't compile out DP code, and the DRM legacy code, and that's much bigger ... -Daniel > > drivers/gpu/drm/Kconfig | 6 +- > drivers/gpu/drm/Makefile| 17 ++-- > drivers/gpu/drm/drm_crtc_internal.h | 2 + > drivers/gpu/drm/drm_edid.c | 173 ++ > drivers/gpu/drm/drm_hdmi.c | 182 > > 5 files changed, 206 insertions(+), 174 deletions(-) > create mode 100644 drivers/gpu/drm/drm_hdmi.c > > -- > 1.8.3.1 > > ___ > dri-devel mailing list > dri-de...@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Re: [PATCH 3/3] dcache: account external names as indirectly reclaimable memory
On Fri 13-04-18 16:20:00, Vlastimil Babka wrote: > On 04/13/2018 03:59 PM, Michal Hocko wrote: > > On Fri 13-04-18 22:35:19, Minchan Kim wrote: > >> On Mon, Mar 05, 2018 at 01:37:43PM +, Roman Gushchin wrote: > > [...] > >>> @@ -1614,9 +1623,11 @@ struct dentry *__d_alloc(struct super_block *sb, > >>> const struct qstr *name) > >>> name = &slash_name; > >>> dname = dentry->d_iname; > >>> } else if (name->len > DNAME_INLINE_LEN-1) { > >>> - size_t size = offsetof(struct external_name, name[1]); > >>> - struct external_name *p = kmalloc(size + name->len, > >>> - GFP_KERNEL_ACCOUNT); > >>> + struct external_name *p; > >>> + > >>> + reclaimable = offsetof(struct external_name, name[1]) + > >>> + name->len; > >>> + p = kmalloc(reclaimable, GFP_KERNEL_ACCOUNT); > >> > >> Can't we use kmem_cache_alloc with own cache created with > >> SLAB_RECLAIM_ACCOUNT > >> if they are reclaimable? > > > > No, because names have different sizes and so we would basically have to > > duplicate many caches. > > We would need kmalloc-reclaimable-X variants. It could be worth it, > especially if we find more similar usages. I suspect they would be more > useful than the existing dma-kmalloc-X :) I am still not sure why __GFP_RECLAIMABLE cannot be made work as expected and account slab pages as SLAB_RECLAIMABLE -- Michal Hocko SUSE Labs
Re: [PATCH 2/6] tracing: Add trace event error log
Hi Steve, On Fri, 2018-04-13 at 09:45 -0400, Steven Rostedt wrote: > On Thu, 12 Apr 2018 18:52:13 -0500 > Tom Zanussi wrote: > > > Hi Steve, > > > > On Thu, 2018-04-12 at 18:20 -0400, Steven Rostedt wrote: > > > On Thu, 12 Apr 2018 10:13:17 -0500 > > > Tom Zanussi wrote: > > > > > > > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h > > > > index 6fb46a0..f2dc7e6 100644 > > > > --- a/kernel/trace/trace.h > > > > +++ b/kernel/trace/trace.h > > > > @@ -1765,6 +1765,9 @@ extern ssize_t trace_parse_run_command(struct > > > > file *file, > > > > const char __user *buffer, size_t count, loff_t *ppos, > > > > int (*createfn)(int, char**)); > > > > > > > > +extern void event_log_err(const char *loc, const char *cmd, const char > > > > *fmt, > > > > + ...); > > > > + > > > > /* > > > > * Normal trace_printk() and friends allocates special buffers > > > > * to do the manipulation, as well as saves the print formats > > > > diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c > > > > index 05c7172..fd02e22 100644 > > > > --- a/kernel/trace/trace_events.c > > > > +++ b/kernel/trace/trace_events.c > > > > @@ -1668,6 +1668,164 @@ static void ignore_task_cpu(void *data) > > > > return ret; > > > > } > > > > > > > > +#define EVENT_LOG_ERRS_MAX (PAGE_SIZE / sizeof(struct > > > > event_log_err)) > > > > > > > +#define EVENT_ERR_LOG_MASK (EVENT_LOG_ERRS_MAX - 1) > > > > > > BTW, the above only works if EVENT_LOG_ERRS_MAX is a power of two, > > > which it's not guaranteed to be. > > > > > > > My assumption was that we'd only ever need a page or two for the > > error_log and so would always would be a power of two, since the size of > > the struct event_log_err is 512. > > Assumptions are not what we want to rely on. There should be something > like: > > BUILD_BUG_ON(EVENT_LOG_ERRS_MAX & EVENT_ERR_LOG_MASK); > > Which would guarantee that your assumption is correct otherwise the > kernel wont build. > OK. > > > > > Anyway, I should probably have put comments about all this in the code, > > and I will, but the way it works kind of assumes a very small number of > > errors - it's replacing a simple 'last error' facility for the hist > > triggers and making it a common facility for other things that have > > similar needs like Masami's kprobe_events errors. For those purposes, I > > assumed it would suffice to simply be able to show that last 8 or some > > similar small number of errors and constantly recycle the slots. > > The errors are still in the files that have the errors right? Perhaps > just have a file that lists the files that contain errors. That way if > something goes wrong, you can examine that file and then look at the > file that contains the error? > No, that's part of the motivation for this change - currently there is just one last 'last error', the output tacked onto whichever event's hist file you read (normally this would be the one you just got the error for, but doesn't have to be) - there isn't a last error per event. Masami of course found this unintuitive, which it is, I agree, and wanted a single file (error_log) to look into for the last error. In addition, it should have a logging interface that any trace event command could use, such as kprobe_events. > And I'm not sure it being in the events directory is the best place > either, especially, if you plan to have it handle kprobe_events because > that's not in the events directory. > Yeah, I put it there because it's associated with trace events - putting it in tracing/ would imply that it's meant for ftrace in general (which maybe it should be but this isn't). Actually I'm not sure kprobe_events shouldn't be in tracing/events too.. > > > > Basically it just splits the page into 16 strings, 2 per error, one for > > the actual error text, the other for the command the user entered. The > > struct event_log_err just overlays a struct on top of 2 strings just to > > make it easier to manage. > > > > Anyway, because it is such a small number, and we start with a zeroed > > page, whenever we print the error log, we print all 16 strings even if > > we only have one error (2 strings). The rest are NULL and print > > nothing. We start with the tail, which could also be thought of as the > > 'oldest' or the 'first' error in the buffer and just cycle through them > > all. Hope that clears up some of the other questions you had about how > > a non-full log gets printed, etc... > > OK, I was thinking a NULL entry would return NULL, but we are > returning a pointer to NULL. That's where I missed it. > > > > > > > + > > > > +struct event_log_err { > > > > + charerr[MAX_FILTER_STR_VAL]; > > > > + charcmd[MAX_FILTER_STR_VAL]; > > > > +}; > > > > > > I like the event_log_err idea, but the above can be shrunk to: > > > > > > struct err_info { > > > u
Re: [PATCH] KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support
Paolo Bonzini writes: > On 12/04/2018 17:25, Vitaly Kuznetsov wrote: >> @@ -5335,6 +5353,9 @@ static void __always_inline >> vmx_disable_intercept_for_msr(unsigned long *msr_bit >> if (!cpu_has_vmx_msr_bitmap()) >> return; >> >> +if (static_branch_unlikely(&enable_emsr_bitmap)) >> +evmcs_touch_msr_bitmap(); >> + >> /* >> * See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals >> * have the write-low and read-high bitmap offsets the wrong way round. >> @@ -5370,6 +5391,9 @@ static void __always_inline >> vmx_enable_intercept_for_msr(unsigned long *msr_bitm >> if (!cpu_has_vmx_msr_bitmap()) >> return; >> >> +if (static_branch_unlikely(&enable_emsr_bitmap)) >> +evmcs_touch_msr_bitmap(); > > I'm not sure about the "unlikely". Can you just check current_evmcs > instead (dropping the static key completely)? current_evmcs is just a cast: (struct hv_enlightened_vmcs *)this_cpu_read(current_vmcs) so it is always not NULL here :-) We need to check enable_evmcs static key first. Getting rid of the newly added enable_emsr_bitmap is, of course, possible. (Actually, we only call vmx_{dis,en}able_intercept_for_msr in the very beginning of vCPUs life so this is not a hotpath and likeliness doesn't really matter). Will do v2 without the static key, thanks! > > The function, also, is small enough that inlining should be beneficial. > > Paolo -- Vitaly
Re: [PATCH] kbuild: rpm-pkg: use kernel-install as a fallback for new-kernel-pkg
2018-04-12 3:15 GMT+09:00 Javier Martinez Canillas : > The new-kernel-pkg script is only present when grubby is installed, but it > may not always be the case. So if the script isn't present, attempt to use > the kernel-install script as a fallback instead. > > Signed-off-by: Javier Martinez Canillas > > --- > Applied to linux-kbuild. Thanks! > scripts/package/mkspec | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/scripts/package/mkspec b/scripts/package/mkspec > index 61427c6f2209..e05646dc24dc 100755 > --- a/scripts/package/mkspec > +++ b/scripts/package/mkspec > @@ -118,6 +118,8 @@ $S$Mln -sf /usr/src/kernels/$KERNELRELEASE source > %preun > if [ -x /sbin/new-kernel-pkg ]; then > new-kernel-pkg --remove $KERNELRELEASE --rminitrd > --initrdfile=/boot/initramfs-$KERNELRELEASE.img > + elif [ -x /usr/bin/kernel-install ]; then > + kernel-install remove $KERNELRELEASE > fi > > %postun > -- > 2.14.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards Masahiro Yamada
Re: [RFC PATCH 04/35] ovl: copy up times
On Thu, Apr 12, 2018 at 05:07:55PM +0200, Miklos Szeredi wrote: > Copy up mtime and ctime to overlay inode after times in real object are > modified. Be careful not to dirty cachelines when not necessary. > > This is in preparation for moving overlay functionality out of the VFS. > > This patch shouldn't have any observable effect. So there are bunch of operations which will change inode ctime. I had missed this in my metadata only copy up patch series and that would broken atime updates in some cases. Vivek > > Signed-off-by: Miklos Szeredi > --- > fs/overlayfs/dir.c | 5 + > fs/overlayfs/inode.c | 1 + > fs/overlayfs/overlayfs.h | 7 +++ > fs/overlayfs/util.c | 19 +++ > 4 files changed, 32 insertions(+) > > diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c > index 839709c7803a..cd0fa2363723 100644 > --- a/fs/overlayfs/dir.c > +++ b/fs/overlayfs/dir.c > @@ -507,6 +507,7 @@ static int ovl_create_or_link(struct dentry *dentry, > struct inode *inode, > else > err = ovl_create_over_whiteout(dentry, inode, attr, > hardlink); > + ovl_copytimes_with_parent(dentry); > } > out_revert_creds: > revert_creds(old_cred); > @@ -768,6 +769,7 @@ static int ovl_do_remove(struct dentry *dentry, bool > is_dir) > drop_nlink(dentry->d_inode); > } > ovl_nlink_end(dentry, locked); > + ovl_copytimes_with_parent(dentry); > out_drop_write: > ovl_drop_write(dentry); > out: > @@ -1079,6 +1081,9 @@ static int ovl_rename(struct inode *olddir, struct > dentry *old, > ovl_dentry_version_inc(new->d_parent, ovl_type_origin(old) || > (d_inode(new) && ovl_type_origin(new))); > > + ovl_copytimes_with_parent(old); > + ovl_copytimes_with_parent(new); > + > out_dput: > dput(newdentry); > out_dput_old: > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c > index 6e3815fb006b..33635106c5f7 100644 > --- a/fs/overlayfs/inode.c > +++ b/fs/overlayfs/inode.c > @@ -303,6 +303,7 @@ int ovl_xattr_set(struct dentry *dentry, struct inode > *inode, const char *name, > err = vfs_removexattr(realdentry, name); > } > revert_creds(old_cred); > + ovl_copytimes(d_inode(dentry)); > > out_drop_write: > ovl_drop_write(dentry); > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > index e0b7de799f6b..eef720ef0f07 100644 > --- a/fs/overlayfs/overlayfs.h > +++ b/fs/overlayfs/overlayfs.h > @@ -258,6 +258,13 @@ bool ovl_need_index(struct dentry *dentry); > int ovl_nlink_start(struct dentry *dentry, bool *locked); > void ovl_nlink_end(struct dentry *dentry, bool locked); > int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir); > +void ovl_copytimes(struct inode *inode); > + > +static inline void ovl_copytimes_with_parent(struct dentry *dentry) > +{ > + ovl_copytimes(d_inode(dentry)); > + ovl_copytimes(d_inode(dentry->d_parent)); > +} > > static inline bool ovl_is_impuredir(struct dentry *dentry) > { > diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c > index 6f1078028c66..11e62e70733a 100644 > --- a/fs/overlayfs/util.c > +++ b/fs/overlayfs/util.c > @@ -675,3 +675,22 @@ int ovl_lock_rename_workdir(struct dentry *workdir, > struct dentry *upperdir) > pr_err("overlayfs: failed to lock workdir+upperdir\n"); > return -EIO; > } > + > +void ovl_copytimes(struct inode *inode) > +{ > + struct inode *upperinode; > + > + if (!inode) > + return; > + > + upperinode = ovl_inode_upper(inode); > + > + if (!upperinode) > + return; > + > + if ((!timespec_equal(&inode->i_mtime, &upperinode->i_mtime) || > + !timespec_equal(&inode->i_ctime, &upperinode->i_ctime))) { > + inode->i_mtime = upperinode->i_mtime; > + inode->i_ctime = upperinode->i_ctime; > + } > +} > -- > 2.14.3 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] dcache: account external names as indirectly reclaimable memory
On 04/13/2018 03:59 PM, Michal Hocko wrote: > On Fri 13-04-18 22:35:19, Minchan Kim wrote: >> On Mon, Mar 05, 2018 at 01:37:43PM +, Roman Gushchin wrote: > [...] >>> @@ -1614,9 +1623,11 @@ struct dentry *__d_alloc(struct super_block *sb, >>> const struct qstr *name) >>> name = &slash_name; >>> dname = dentry->d_iname; >>> } else if (name->len > DNAME_INLINE_LEN-1) { >>> - size_t size = offsetof(struct external_name, name[1]); >>> - struct external_name *p = kmalloc(size + name->len, >>> - GFP_KERNEL_ACCOUNT); >>> + struct external_name *p; >>> + >>> + reclaimable = offsetof(struct external_name, name[1]) + >>> + name->len; >>> + p = kmalloc(reclaimable, GFP_KERNEL_ACCOUNT); >> >> Can't we use kmem_cache_alloc with own cache created with >> SLAB_RECLAIM_ACCOUNT >> if they are reclaimable? > > No, because names have different sizes and so we would basically have to > duplicate many caches. We would need kmalloc-reclaimable-X variants. It could be worth it, especially if we find more similar usages. I suspect they would be more useful than the existing dma-kmalloc-X :) Maybe create both (dma and reclaimable) on demand?
[PATCH] sparc: fix compat siginfo ABI regression
Starting with commit v4.14-rc1~60^2^2~1, a SIGFPE signal sent via kill results to wrong values in si_pid and si_uid fields of compat siginfo_t. This happens due to FPE_FIXME being defined to 0 for sparc, and at the same time siginfo_layout() introduced by the same commit returns SIL_FAULT for SIGFPE if si_code == SI_USER and FPE_FIXME is defined to 0. Fix this regression by removing FPE_FIXME macro and changing all its users to assign FPE_FLTUNK to si_code instead of FPE_FIXME. Note that FPE_FLTUNK is a new macro introduced by commit 266da65e9156d93e1126e185259a4aae68188d0e. Tested with commit v4.16-11958-g16e205cf42da. This bug was found by strace test suite. Link: https://github.com/strace/strace/issues/21 Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic") Thanks-to: Anatoly Pugachev Signed-off-by: Dmitry V. Levin --- arch/sparc/include/uapi/asm/siginfo.h | 7 --- arch/sparc/kernel/traps_32.c | 2 +- arch/sparc/kernel/traps_64.c | 2 +- 3 files changed, 2 insertions(+), 9 deletions(-) diff --git a/arch/sparc/include/uapi/asm/siginfo.h b/arch/sparc/include/uapi/asm/siginfo.h index 896ce44..e704955 100644 --- a/arch/sparc/include/uapi/asm/siginfo.h +++ b/arch/sparc/include/uapi/asm/siginfo.h @@ -18,13 +18,6 @@ #define SI_NOINFO 32767 /* no information in siginfo_t */ /* - * SIGFPE si_codes - */ -#ifdef __KERNEL__ -#define FPE_FIXME 0 /* Broken dup of SI_USER */ -#endif /* __KERNEL__ */ - -/* * SIGEMT si_codes */ #define EMT_TAGOVF 1 /* tag overflow */ diff --git a/arch/sparc/kernel/traps_32.c b/arch/sparc/kernel/traps_32.c index b1ed763..33cd35b 100644 --- a/arch/sparc/kernel/traps_32.c +++ b/arch/sparc/kernel/traps_32.c @@ -307,7 +307,7 @@ void do_fpe_trap(struct pt_regs *regs, unsigned long pc, unsigned long npc, info.si_errno = 0; info.si_addr = (void __user *)pc; info.si_trapno = 0; - info.si_code = FPE_FIXME; + info.si_code = FPE_FLTUNK; if ((fsr & 0x1c000) == (1 << 14)) { if (fsr & 0x10) info.si_code = FPE_FLTINV; diff --git a/arch/sparc/kernel/traps_64.c b/arch/sparc/kernel/traps_64.c index 462a21a..e81072a 100644 --- a/arch/sparc/kernel/traps_64.c +++ b/arch/sparc/kernel/traps_64.c @@ -2372,7 +2372,7 @@ static void do_fpe_common(struct pt_regs *regs) info.si_errno = 0; info.si_addr = (void __user *)regs->tpc; info.si_trapno = 0; - info.si_code = FPE_FIXME; + info.si_code = FPE_FLTUNK; if ((fsr & 0x1c000) == (1 << 14)) { if (fsr & 0x10) info.si_code = FPE_FLTINV; -- ldv
Re: [PATCH 1/2] tracing/events: block: track and print if unplug was explicit or schedule
On Fri, 13 Apr 2018 15:07:17 +0200 Steffen Maier wrote: > Just like blktrace distinguishes explicit and schedule by means of > BLK_TA_UNPLUG_IO and BLK_TA_UNPLUG_TIMER, actually make use of the > existing argument "explicit" to distinguish the two cases in the one > common tracepoint block_unplug. > > Complements v2.6.39 commit 49cac01e1fa7 ("block: make unplug timer trace > event correspond to the schedule() unplug") and commit d9c978331790 > ("block: remove block_unplug_timer() trace point"). > > Signed-off-by: Steffen Maier > --- > include/trace/events/block.h | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/include/trace/events/block.h b/include/trace/events/block.h > index 81b43f5bdf23..a13613d27cee 100644 > --- a/include/trace/events/block.h > +++ b/include/trace/events/block.h > @@ -470,6 +470,11 @@ TRACE_EVENT(block_plug, > TP_printk("[%s]", __entry->comm) > ); > > +#define show_block_unplug_explicit(val) \ > + __print_symbolic(val, \ > + {false, "schedule"}, \ > + {true, "explicit"}) That's new. I haven't seen "true"/"false" values used for print_symbolic before. But could you please use 1 and 0 instead, because perf and trace-cmd won't be able to parse that. I could update libtraceevent to handle it, but really, the first parameter is suppose to be numeric. -- Steve > + > DECLARE_EVENT_CLASS(block_unplug, > > TP_PROTO(struct request_queue *q, unsigned int depth, bool explicit), > @@ -478,15 +483,18 @@ DECLARE_EVENT_CLASS(block_unplug, > > TP_STRUCT__entry( > __field( int, nr_rq ) > + __field( bool, explicit) > __array( char, comm, TASK_COMM_LEN ) > ), > > TP_fast_assign( > __entry->nr_rq = depth; > + __entry->explicit = explicit; > memcpy(__entry->comm, current->comm, TASK_COMM_LEN); > ), > > - TP_printk("[%s] %d", __entry->comm, __entry->nr_rq) > + TP_printk("[%s] %d %s", __entry->comm, __entry->nr_rq, > + show_block_unplug_explicit(__entry->explicit)) > ); > > /**
Re: [PATCH] printk: Ratelimit messages printed by console drivers
On Fri, 13 Apr 2018 14:47:04 +0200 Petr Mladek wrote: > The interval is set to one hour. It is rather arbitrary selected time. > It is supposed to be a compromise between never print these messages, > do not lockup the machine, do not fill the entire buffer too quickly, > and get information if something changes over time. I think an hour is incredibly long. We only allow 100 lines per hour for printks happening inside another printk? I think 5 minutes (at most) would probably be plenty. One minute may be good enough. -- Steve
Re: [PATCH v2 2/3] microblaze: remove redundant early_printk support
On Tue, Apr 10, 2018 at 8:44 AM, Michal Simek wrote: > Hi Rob, > > On 28.3.2018 04:06, Rob Herring wrote: >> With earlycon support now enabled, the arch specific early_printk support >> can be removed. > > earlycon is not the full replacement of early_printk support as is > designed right now. > Definitely current early_printk is pretty old and contains code > duplication but it starts much earlier then earlycon. Yes, essentially it's after MMU enabling rather than before. But it is still before any h/w specific setup (dependent on the DT) which is where one would typically fail to boot. Generally, I've found before DT unflattening to be early enough. What can go wrong at this early stage? Memory is flaky or you've passed in bad memory ranges or image locations. An earlier console may or may not help there and those problems are easier to debug in the bootloader. So it is a question of what you want to maintain. >> Signed-off-by: Rob Herring >> Cc: Michal Simek >> --- >> v2: >> - Fix booting. The setup_memory call needed to be before the >> parse_early_param call. > > What's the reason for calling setup_memory before parse_early_param? > Is there any dependency? Yes, either fixmap or ioremap (in your case) has to be functional when earlycon is setup which happens via parse_early_param. Rob
[GIT PULL] arm64: Late updates for 4.17
Hi Linus, As I mentioned in the previous pull request, we had some nasty conflicts with the KVM tree that resulted in us dropping some spectre-related work shortly before the merge window opened. Now that the KVM tree has been merged, we've put together an updated version of the patches based on your merge commit (details in the tag). I appreciate this isn't ideal, so if you'd rather just see this stuff at -rc1 please let me know and we can do that instead. There are also a couple of patches here adding some unused assembler macros which will be needed by some 4.18 crypto code and we'd like to head that dependency off early. Thanks, Will --->8 The following changes since commit d8312a3f61024352f1c7cb967571fd53631b0d6c: Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm (2018-04-09 11:42:31 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git tags/arm64-upstream for you to fetch changes up to 24534b3511828c66215fdf1533d77a7bf2e1fdb2: arm64: assembler: add macros to conditionally yield the NEON under PREEMPT (2018-04-11 18:50:34 +0100) Additional arm64 updates for 4.17 A few late updates to address some issues arising from conflicts with other trees: - Removal of Qualcomm-specific Spectre-v2 mitigation in favour of the generic SMCCC-based firmware call - Fix EL2 hardening capability checking, which was bodged to reduce conflicts with the KVM tree - Add some currently unused assembler macros for managing SIMD registers which will be used by some crypto code in the next merge window Ard Biesheuvel (2): arm64: assembler: add utility macros to push/pop stack frames arm64: assembler: add macros to conditionally yield the NEON under PREEMPT Marc Zyngier (3): arm64: capabilities: Rework EL2 vector hardening entry arm64: Get rid of __smccc_workaround_1_hvc_* arm64: Move the content of bpi.S to hyp-entry.S Shanker Donthineni (1): arm64: KVM: Use SMCCC_ARCH_WORKAROUND_1 for Falkor BP hardening arch/arm64/include/asm/assembler.h | 136 + arch/arm64/include/asm/cpucaps.h | 13 ++-- arch/arm64/include/asm/kvm_asm.h | 2 - arch/arm64/kernel/Makefile | 2 - arch/arm64/kernel/asm-offsets.c| 3 + arch/arm64/kernel/bpi.S| 102 arch/arm64/kernel/cpu_errata.c | 97 ++ arch/arm64/kvm/hyp/entry.S | 12 arch/arm64/kvm/hyp/hyp-entry.S | 64 - arch/arm64/kvm/hyp/switch.c| 10 --- 10 files changed, 242 insertions(+), 199 deletions(-) delete mode 100644 arch/arm64/kernel/bpi.S
Re: [PATCH ipmi/kcs_bmc v1] ipmi: kcs_bmc: optimize the data buffers allocation
On 2018-04-13 21:50, Corey Minyard wrote: On 04/07/2018 02:54 AM, Wang, Haiyue wrote: Hi Corey, Since IPMI 2.0 just defined minimum, no maximum: KCS/SMIC Input : Required: 40 bytes IPMI Message, minimum KCS/SMIC Output : Required: 38 bytes IPMI Message, minimum Yes, though there are practical maximums that are much smaller than 1000 bytes. We can enlarge the block size for avoiding waste, and make our driver support most worst message size case. And I think this patch make checking simple (from 3 to 1), and the code clean, this is the biggest reason I want to change. The TLB is just memory management study from book, no data to support access improvement. :) I would argue that the way it is now expresses the intent of the code better than one allocation split into three parts. Expressing your intent is more important than the number of checks and a minuscule performance improvement. For me it makes the code easier to understand. If you had a tool that checked for out-of-bounds memory access, then a single allocation might not find an overrun between the parts. Smaller allocations tend to result in less memory fragmentation. When I wrote the commit, I felt that the message was not so professional, and the reason sounded weak. The driver development is a complex work, needs considering more things, not just one. Thanks for your patience. My preference is to leave it as it is. However, it's not that important, and if you really want this patch, I can include it. So leave it as it is, abandon this patch. :-) BTW, another patch about KCS BMC chip support: https://lkml.org/lkml/2018/3/22/284 Look forward your reviewing, I've tried my best to make it better. Thanks, -corey BR, Haiyue On 2018-04-07 10:37, Wang, Haiyue wrote: On 2018-04-07 05:47, Corey Minyard wrote: On 03/15/2018 07:20 AM, Haiyue Wang wrote: Allocate a continuous memory block for the three KCS data buffers with related index assignment. I'm finally getting to this. Is there a reason you want to do this? In general, it's better to not try to outsmart your base system. Depending on the memory allocator, in this case, you might actually use more memory. You probably won't use any less. I got this idea from another code review, but that patch allocates 30 more the same size memory block, reducing the devm_kmalloc call will be better. For KCS only have 3, may be the key point is memory waste. In the original case, you allocate three 1000 byte buffers, resulting in 3 1024 byte slab allocated. In the changed case, you will allocate a 3000 byte buffer, resulting in a single 4096 byte slab allocation, wasting 1024 more bytes of memory. As the kcs has memory copy between in/out/kbuffer, put them in the same page will be better ? Such as the same TLB ? (Well, I just got this from book, no real experience of memory accessing performance. And also, I was told that using space to save the time. :-)). Just my stupid thinking. I'm OK to drop this patch if it doesn't help with performance, or something else. BR. Haiyue -corey Signed-off-by: Haiyue Wang --- drivers/char/ipmi/kcs_bmc.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/char/ipmi/kcs_bmc.c b/drivers/char/ipmi/kcs_bmc.c index fbfc05e..dc19c0d 100644 --- a/drivers/char/ipmi/kcs_bmc.c +++ b/drivers/char/ipmi/kcs_bmc.c @@ -435,6 +435,7 @@ static const struct file_operations kcs_bmc_fops = { struct kcs_bmc *kcs_bmc_alloc(struct device *dev, int sizeof_priv, u32 channel) { struct kcs_bmc *kcs_bmc; + void *buf; kcs_bmc = devm_kzalloc(dev, sizeof(*kcs_bmc) + sizeof_priv, GFP_KERNEL); if (!kcs_bmc) @@ -448,11 +449,12 @@ struct kcs_bmc *kcs_bmc_alloc(struct device *dev, int sizeof_priv, u32 channel) mutex_init(&kcs_bmc->mutex); init_waitqueue_head(&kcs_bmc->queue); - kcs_bmc->data_in = devm_kmalloc(dev, KCS_MSG_BUFSIZ, GFP_KERNEL); - kcs_bmc->data_out = devm_kmalloc(dev, KCS_MSG_BUFSIZ, GFP_KERNEL); - kcs_bmc->kbuffer = devm_kmalloc(dev, KCS_MSG_BUFSIZ, GFP_KERNEL); - if (!kcs_bmc->data_in || !kcs_bmc->data_out || !kcs_bmc->kbuffer) + buf = devm_kmalloc_array(dev, 3, KCS_MSG_BUFSIZ, GFP_KERNEL); + if (!buf) return NULL; + kcs_bmc->data_in = buf; + kcs_bmc->data_out = buf + KCS_MSG_BUFSIZ; + kcs_bmc->kbuffer = buf + KCS_MSG_BUFSIZ * 2; kcs_bmc->miscdev.minor = MISC_DYNAMIC_MINOR; kcs_bmc->miscdev.name = dev_name(dev);
Re: [PATCH] ARM: omap2: Fix build when using split object directories
Tony, On 04/12/2018 04:08 AM, Masahiro Yamada wrote: > 2018-04-12 17:21 GMT+09:00 Anders Roxell : >> On 2018-04-11 16:15, Dave Gerlach wrote: >>> The sleep33xx and sleep43xx files should not depend on a header file >>> generated in drivers/memory. Remove this dependency and instead allow >>> both drivers/memory and arch/arm/mach-omap2 to generate all macros >>> needed in headers local to their own paths. >>> >>> This fixes an issue where the build fail will when using O= to set a >>> split object directory and arch/arm/mach-omap2 is built before >>> drivers/memory with the following error: >>> >>> .../drivers/memory/emif-asm-offsets.c:1:0: fatal error: can't open >>> drivers/memory/emif-asm-offsets.s for writing: No such file or directory >>> compilation terminated. >>> >>> Fixes: 41d9d44d7258 ("ARM: OMAP2+: pm33xx-core: Add platform code needed >>> for PM") >>> Acked-by: Tony Lindgren >>> Reviewed-by: Masahiro Yamada >>> Signed-off-by: Dave Gerlach >> >> Tested-by: Anders Roxell >> >> Maybe we can remove drivers/memory/Makefile.asm-offsets and move those >> changes into drivers/memory/Makefile ? > > Agree! > This is the version of this patch that we want to use, will this go through you? Regards, Dave > > >
Re: [PATCH v2] ARM: omap2: Fix build when using split object directories
On 04/12/2018 10:24 PM, Masahiro Yamada wrote: > 2018-04-13 11:58 GMT+09:00 Dave Gerlach : >> The sleep33xx and sleep43xx files should not depend on a header file >> generated in drivers/memory. Remove this dependency and instead allow >> both drivers/memory and arch/arm/mach-omap2 to generate all macros >> needed in headers local to their own paths. >> >> This fixes an issue where the build fail will when using O= to set a >> split object directory and arch/arm/mach-omap2 is built before >> drivers/memory with the following error: >> >> .../drivers/memory/emif-asm-offsets.c:1:0: fatal error: can't open >> drivers/memory/emif-asm-offsets.s for writing: No such file or directory >> compilation terminated. >> >> Fixes: 41d9d44d7258 ("ARM: OMAP2+: pm33xx-core: Add platform code needed for >> PM") >> Acked-by: Tony Lindgren >> Reviewed-by: Masahiro Yamada >> Tested-by: Anders Roxell >> Signed-off-by: Dave Gerlach >> --- >> v1 -> v2: >> * Removed drivers/memory/Makefile.asm-offsets and consolidated into >>drivers/memory/Makefile. > > > > I did not mean like this. > > I thought this clean-up would be done in a separate patch. > > I think your previous patch is OK as-is. > Ok sorry for the confusion let's forget this version then. Regards, Dave > > > > >> arch/arm/mach-omap2/Makefile | 6 +-- >> arch/arm/mach-omap2/pm-asm-offsets.c | 3 ++ >> arch/arm/mach-omap2/sleep33xx.S | 1 - >> arch/arm/mach-omap2/sleep43xx.S | 1 - >> drivers/memory/Makefile | 8 +++- >> drivers/memory/Makefile.asm-offsets | 5 --- >> drivers/memory/emif-asm-offsets.c| 72 +- >> include/linux/ti-emif-sram.h | 75 >> >> 8 files changed, 86 insertions(+), 85 deletions(-) >> delete mode 100644 drivers/memory/Makefile.asm-offsets >> >> diff --git a/arch/arm/mach-omap2/Makefile b/arch/arm/mach-omap2/Makefile >> index 4603c30fef73..0d9ce58bc464 100644 >> --- a/arch/arm/mach-omap2/Makefile >> +++ b/arch/arm/mach-omap2/Makefile >> @@ -243,8 +243,4 @@ arch/arm/mach-omap2/pm-asm-offsets.s: >> arch/arm/mach-omap2/pm-asm-offsets.c >> include/generated/ti-pm-asm-offsets.h: arch/arm/mach-omap2/pm-asm-offsets.s >> FORCE >> $(call filechk,offsets,__TI_PM_ASM_OFFSETS_H__) >> >> -# For rule to generate ti-emif-asm-offsets.h dependency >> -include drivers/memory/Makefile.asm-offsets >> - >> -arch/arm/mach-omap2/sleep33xx.o: include/generated/ti-pm-asm-offsets.h >> include/generated/ti-emif-asm-offsets.h >> -arch/arm/mach-omap2/sleep43xx.o: include/generated/ti-pm-asm-offsets.h >> include/generated/ti-emif-asm-offsets.h >> +$(obj)/sleep33xx.o $(obj)/sleep43xx.o: include/generated/ti-pm-asm-offsets.h >> diff --git a/arch/arm/mach-omap2/pm-asm-offsets.c >> b/arch/arm/mach-omap2/pm-asm-offsets.c >> index 6d4392da7c11..b9846b19e5e2 100644 >> --- a/arch/arm/mach-omap2/pm-asm-offsets.c >> +++ b/arch/arm/mach-omap2/pm-asm-offsets.c >> @@ -7,9 +7,12 @@ >> >> #include >> #include >> +#include >> >> int main(void) >> { >> + ti_emif_asm_offsets(); >> + >> DEFINE(AMX3_PM_WFI_FLAGS_OFFSET, >>offsetof(struct am33xx_pm_sram_data, wfi_flags)); >> DEFINE(AMX3_PM_L2_AUX_CTRL_VAL_OFFSET, >> diff --git a/arch/arm/mach-omap2/sleep33xx.S >> b/arch/arm/mach-omap2/sleep33xx.S >> index 218d79930b04..322b3bb868b4 100644 >> --- a/arch/arm/mach-omap2/sleep33xx.S >> +++ b/arch/arm/mach-omap2/sleep33xx.S >> @@ -6,7 +6,6 @@ >> * Dave Gerlach, Vaibhav Bedia >> */ >> >> -#include >> #include >> #include >> #include >> diff --git a/arch/arm/mach-omap2/sleep43xx.S >> b/arch/arm/mach-omap2/sleep43xx.S >> index b24be624e8b9..8903814a6677 100644 >> --- a/arch/arm/mach-omap2/sleep43xx.S >> +++ b/arch/arm/mach-omap2/sleep43xx.S >> @@ -6,7 +6,6 @@ >> * Dave Gerlach, Vaibhav Bedia >> */ >> >> -#include >> #include >> #include >> #include >> diff --git a/drivers/memory/Makefile b/drivers/memory/Makefile >> index 66f55240830e..b3b95380346f 100644 >> --- a/drivers/memory/Makefile >> +++ b/drivers/memory/Makefile >> @@ -28,6 +28,10 @@ ti-emif-sram-objs:= ti-emif-pm.o >> ti-emif-sram-pm.o >> >> AFLAGS_ti-emif-sram-pm.o :=-Wa,-march=armv7-a >> >> -include drivers/memory/Makefile.asm-offsets >> +drivers/memory/emif-asm-offsets.s: drivers/memory/emif-asm-offsets.c >> + $(call if_changed_dep,cc_s_c) >> >> -drivers/memory/ti-emif-sram-pm.o: include/generated/ti-emif-asm-offsets.h >> +include/generated/ti-emif-asm-offsets.h: drivers/memory/emif-asm-offsets.s >> FORCE >> + $(call filechk,offsets,__TI_EMIF_ASM_OFFSETS_H__) >> + >> +$(obj)/ti-emif-sram-pm.o: include/generated/ti-emif-asm-offsets.h >> diff --git a/drivers/memory/Makefile.asm-offsets >> b/drivers/memory/Makefile.asm-offsets >> deleted file mode 100644 >> index 843ff60ccb5a.. >> --- a/drivers/memory/Makefile.asm-offsets >> +++ /dev/null >> @@ -1,5 +0,0 @@ >> -drivers/memor
Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited
On Thu, Apr 12, 2018 at 09:32:49PM -0700, Jakub Kicinski wrote: > On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote: > > + if (bw_avail >= bw_cap) > > + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n", > > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); > > + else > > + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d > > link at %s (capable of %d Mb/s with %s x%d link)\n", > > +bw_avail, PCIE_SPEED2STR(speed), width, > > +limiting_dev ? pci_name(limiting_dev) : "", > > +bw_cap, PCIE_SPEED2STR(speed_cap), width_cap); > > I was just looking at using this new function to print PCIe BW for a > NIC, but I'm slightly worried that there is nothing in the message that > says PCIe... For a NIC some people may interpret the bandwidth as NIC > bandwidth: > > [ 39.839989] nfp :04:00.0: Netronome Flow Processor NFP4000/NFP6000 > PCIe Card Probe > [ 39.848943] nfp :04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 > link) > [ 39.857146] nfp :04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 0.1: > PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4 > > It's not a 63Gbps NIC... I'm sorry if this was discussed before and I > didn't find it. Would it make sense to add the "PCIe: " prefix to the > message like bnx2x used to do? Like: > > nfp :04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link) I agree, that does look potentially confusing. How about this: nfp :04:00.0: 63.008 Gb/s available PCIe bandwidth (8 GT/s x8 link) I did have to look twice at this before I remembered that we're printing Gb/s (not GB/s). Most of the references I found on the web use GB/s when talking about total PCIe bandwidth. But either way I think it's definitely worth mentioning PCIe explicitly.
[PATCH 01/17] perf stat: Enable 1ms interval for printing event counters values
From: Alexey Budankov Currently print count interval for performance counters values is limited by 10ms so reading the values at frequencies higher than 100Hz is restricted by the tool. This change makes perf stat -I possible on frequencies up to 1KHz and, to some extent, makes perf stat -I to be on-par with perf record sampling profiling. When running perf stat -I for monitoring e.g. PCIe uncore counters and at the same time profiling some I/O workload by perf record e.g. for cpu-cycles and context switches, it is then possible to observe consolidated CPU/OS/IO(Uncore) performance picture for that workload. Tool overhead warning printed when specifying -v option can be missed due to screen scrolling in case you have output to the console so message is moved into help available by running perf stat -h. Signed-off-by: Alexey Budankov Acked-by: Jiri Olsa Cc: Alexander Shishkin Cc: Andi Kleen Cc: Namhyung Kim Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/b842ad6a-d606-32e4-afe5-974071b51...@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/Documentation/perf-stat.txt | 2 +- tools/perf/builtin-stat.c | 14 ++ 2 files changed, 3 insertions(+), 13 deletions(-) diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index f15b306be183..e6c3b4e555c2 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -153,7 +153,7 @@ perf stat --repeat 10 --null --sync --pre 'make -s O=defconfig-build/clean' -- m -I msecs:: --interval-print msecs:: -Print count deltas every N milliseconds (minimum: 10ms) +Print count deltas every N milliseconds (minimum: 1ms) The overhead percentage could be high in some cases, for instance with small, sub 100ms intervals. Use with caution. example: 'perf stat -I 1000 -e cycles -a sleep 5' diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index f5c454855908..147a27e8c937 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -1943,7 +1943,8 @@ static const struct option stat_options[] = { OPT_STRING(0, "post", &post_cmd, "command", "command to run after to the measured command"), OPT_UINTEGER('I', "interval-print", &stat_config.interval, - "print counts at regular interval in ms (>= 10)"), + "print counts at regular interval in ms " + "(overhead is possible for values <= 100ms)"), OPT_INTEGER(0, "interval-count", &stat_config.times, "print counts for fixed number of times"), OPT_UINTEGER(0, "timeout", &stat_config.timeout, @@ -2923,17 +2924,6 @@ int cmd_stat(int argc, const char **argv) } } - if (interval && interval < 100) { - if (interval < 10) { - pr_err("print interval must be >= 10ms\n"); - parse_options_usage(stat_usage, stat_options, "I", 1); - goto out; - } else - pr_warning("print interval < 100ms. " - "The overhead percentage could be high in some cases. " - "Please proceed with caution.\n"); - } - if (stat_config.times && interval) interval_count = true; else if (stat_config.times && !interval) { -- 2.14.3
[PATCH 02/17] tools headers: Restore READ_ONCE() C++ compatibility
From: Mark Rutland Our userspace defines READ_ONCE() in a way that clang doesn't like, as we have an anonymous union in which neither field is initialized. WRITE_ONCE() is fine since it initializes the __val field. For READ_ONCE() we can keep clang and GCC happy with a dummy initialization of the __c field, so let's do that. At the same time, let's split READ_ONCE() and WRITE_ONCE() over several lines for legibility, as we do in the in-kernel . Reported-by: Li Zhijian Reported-by: Sandipan Das Tested-by: Sandipan Das Signed-off-by: Mark Rutland Fixes: 6aa7de059173a986 ("locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()") Link: http://lkml.kernel.org/r/20180404163445.16492-1-mark.rutl...@arm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/include/linux/compiler.h | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/tools/include/linux/compiler.h b/tools/include/linux/compiler.h index 04e32f965ad7..1827c2f973f9 100644 --- a/tools/include/linux/compiler.h +++ b/tools/include/linux/compiler.h @@ -151,11 +151,21 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s * required ordering. */ -#define READ_ONCE(x) \ - ({ union { typeof(x) __val; char __c[1]; } __u; __read_once_size(&(x), __u.__c, sizeof(x)); __u.__val; }) - -#define WRITE_ONCE(x, val) \ - ({ union { typeof(x) __val; char __c[1]; } __u = { .__val = (val) }; __write_once_size(&(x), __u.__c, sizeof(x)); __u.__val; }) +#define READ_ONCE(x) \ +({ \ + union { typeof(x) __val; char __c[1]; } __u = \ + { .__c = { 0 } }; \ + __read_once_size(&(x), __u.__c, sizeof(x)); \ + __u.__val; \ +}) + +#define WRITE_ONCE(x, val) \ +({ \ + union { typeof(x) __val; char __c[1]; } __u = \ + { .__val = (val) }; \ + __write_once_size(&(x), __u.__c, sizeof(x));\ + __u.__val; \ +}) #ifndef __fallthrough -- 2.14.3
[PATCH] vfio-ccw: process ssch with interrupts disabled
When we call ssch, an interrupt might already be pending once we return from the START SUBCHANNEL instruction. Therefore we need to make sure interrupts are disabled until after we're done with our processing. Note that the subchannel lock is the same as the ccwdevice lock that is mentioned in the documentation for ccw_device_start() and friends. Signed-off-by: Cornelia Huck --- drivers/s390/cio/vfio_ccw_fsm.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/drivers/s390/cio/vfio_ccw_fsm.c b/drivers/s390/cio/vfio_ccw_fsm.c index ff6963ad6e39..3c800642134e 100644 --- a/drivers/s390/cio/vfio_ccw_fsm.c +++ b/drivers/s390/cio/vfio_ccw_fsm.c @@ -20,12 +20,12 @@ static int fsm_io_helper(struct vfio_ccw_private *private) int ccode; __u8 lpm; unsigned long flags; + int ret; sch = private->sch; spin_lock_irqsave(sch->lock, flags); private->state = VFIO_CCW_STATE_BUSY; - spin_unlock_irqrestore(sch->lock, flags); orb = cp_get_orb(&private->cp, (u32)(addr_t)sch, sch->lpm); @@ -38,10 +38,12 @@ static int fsm_io_helper(struct vfio_ccw_private *private) * Initialize device status information */ sch->schib.scsw.cmd.actl |= SCSW_ACTL_START_PEND; - return 0; + ret = 0; + break; case 1: /* Status pending */ case 2: /* Busy */ - return -EBUSY; + ret = -EBUSY; + break; case 3: /* Device/path not operational */ { lpm = orb->cmd.lpm; @@ -51,13 +53,16 @@ static int fsm_io_helper(struct vfio_ccw_private *private) sch->lpm = 0; if (cio_update_schib(sch)) - return -ENODEV; - - return sch->lpm ? -EACCES : -ENODEV; + ret = -ENODEV; + else + ret = sch->lpm ? -EACCES : -ENODEV; + break; } default: - return ccode; + ret = ccode; } + spin_unlock_irqrestore(sch->lock, flags); + return ret; } static void fsm_notoper(struct vfio_ccw_private *private, -- 2.14.3
[PATCH 03/17] perf tests: Run dwarf unwind test on arm32
From: Kim Phillips Enable the unwind test on arm32: $ perf test unwind 58: DWARF unwind : Ok Signed-off-by: Kim Phillips Cc: Alexander Shishkin Cc: Brian Robbins Cc: Jiri Olsa Cc: Namhyung Kim Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/20180410191624.a3a468670dd4548c66d3d...@arm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/arm/include/arch-tests.h | 12 tools/perf/arch/arm/tests/Build | 2 ++ tools/perf/arch/arm/tests/arch-tests.c | 16 3 files changed, 30 insertions(+) create mode 100644 tools/perf/arch/arm/include/arch-tests.h create mode 100644 tools/perf/arch/arm/tests/arch-tests.c diff --git a/tools/perf/arch/arm/include/arch-tests.h b/tools/perf/arch/arm/include/arch-tests.h new file mode 100644 index ..90ec4c8cb880 --- /dev/null +++ b/tools/perf/arch/arm/include/arch-tests.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef ARCH_TESTS_H +#define ARCH_TESTS_H + +#ifdef HAVE_DWARF_UNWIND_SUPPORT +struct thread; +struct perf_sample; +#endif + +extern struct test arch_tests[]; + +#endif diff --git a/tools/perf/arch/arm/tests/Build b/tools/perf/arch/arm/tests/Build index b30eff9bcc83..883c57ff0c08 100644 --- a/tools/perf/arch/arm/tests/Build +++ b/tools/perf/arch/arm/tests/Build @@ -1,2 +1,4 @@ libperf-y += regs_load.o libperf-y += dwarf-unwind.o + +libperf-y += arch-tests.o diff --git a/tools/perf/arch/arm/tests/arch-tests.c b/tools/perf/arch/arm/tests/arch-tests.c new file mode 100644 index ..5b1543c98022 --- /dev/null +++ b/tools/perf/arch/arm/tests/arch-tests.c @@ -0,0 +1,16 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include "tests/tests.h" +#include "arch-tests.h" + +struct test arch_tests[] = { +#ifdef HAVE_DWARF_UNWIND_SUPPORT + { + .desc = "DWARF unwind", + .func = test__dwarf_unwind, + }, +#endif + { + .func = NULL, + }, +}; -- 2.14.3
[PATCH 06/17] perf jvmti: Give hints about package names needed to build
From: Arnaldo Carvalho de Melo Give as examples of package names to install to have this built for fedora and debian, to help the user a bit. The part from 'e.g.:' onwards: No openjdk development package found, please install JDK package, e.g. openjdk-8-jdk, java-1.8.0-openjdk-devel Cc: Andi Kleen Cc: David Ahern Cc: Jiri Olsa Cc: Namhyung Kim Cc: Stephane Eranian Cc: William Cohen Link: https://lkml.kernel.org/n/tip-edbi4r2pvzn7no6ebxbtc...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/Makefile.config | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config index c7abd83a8e19..6b307e97dc57 100644 --- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -847,7 +847,7 @@ ifndef NO_JVMTI ifeq ($(feature-jvmti), 1) $(call detected_var,JDIR) else -$(warning No openjdk development package found, please install JDK package) +$(warning No openjdk development package found, please install JDK package, e.g. openjdk-8-jdk, java-1.8.0-openjdk-devel) NO_JVMTI := 1 endif endif -- 2.14.3
[GIT PULL V2] Thermal management updates for v4.17-rc1
Hi, Linus, Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux.git next to receive the latest Thermal Management updates for v4.17-rc1 with top-most commit b907b408ca64482989cd95dacef804ce509a3673: Merge branches 'thermal-core' and 'thermal-soc' into next (2018-04-13 14:11:53 +0800) on top of commit 0c8efd610b58cb23cefdfa12015799079aef94ae: Linux 4.16-rc5 (2018-03-11 17:25:09 -0700) Differences in V2: - Dropped all patches from thermal-soc tree, including the exynos patch that introduces the compiler warnings. Specifics: - Fix race condition in imx_thermal_probe(). (Mikhail Lappo) - Add cooling device's statistics in sysfs. (Viresh Kumar) thanks, rui Mikhail Lappo (1): thermal: imx: Fix race condition in imx_thermal_probe() Viresh Kumar (1): thermal: Add cooling device's statistics in sysfs Zhang Rui (1): Merge branches 'thermal-core' and 'thermal-soc' into next Documentation/thermal/sysfs-api.txt | 31 + drivers/thermal/Kconfig | 7 ++ drivers/thermal/imx_thermal.c | 6 +- drivers/thermal/thermal_core.c | 3 +- drivers/thermal/thermal_core.h | 10 ++ drivers/thermal/thermal_helpers.c | 5 +- drivers/thermal/thermal_sysfs.c | 225 include/linux/thermal.h | 1 + 8 files changed, 283 insertions(+), 5 deletions(-)
[PATCH 08/17] Revert "x86/asm: Allow again using asm.h when building for the 'bpf' clang target"
From: Arnaldo Carvalho de Melo This reverts commit ca26cffa4e4aaeb09bb9e308f95c7835cb149248. Newer clang versions accept that asm(_ASM_SP) construct, and now that the bpf-script-test-kbuild.c script, used in one of the 'perf test LLVM' subtests doesn't include ptrace.h, which ended up including arch/x86/include/asm/asm.h, we can revert this patch. Suggested-by: Yonghong Song Link: https://lkml.kernel.org/r/613f0a0d-c433-8f4d-dcc1-c9889deae...@fb.com Acked-by: Yonghong Song Cc: Adrian Hunter Cc: Alexander Potapenko Cc: Alexei Starovoitov Cc: Andrey Ryabinin Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Daniel Borkmann Cc: David Ahern Cc: Dmitriy Vyukov Cc: Jiri Olsa Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Matthias Kaehlcke Cc: Miguel Bernal Marin Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Wang Nan Link: https://lkml.kernel.org/n/tip-nqozcv8loq40tkqpfw997...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- arch/x86/include/asm/asm.h | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h index 386a6900e206..219faaec51df 100644 --- a/arch/x86/include/asm/asm.h +++ b/arch/x86/include/asm/asm.h @@ -136,7 +136,6 @@ #endif #ifndef __ASSEMBLY__ -#ifndef __BPF__ /* * This output constraint should be used for any inline asm which has a "call" * instruction. Otherwise the asm may be inserted before the frame pointer @@ -146,6 +145,5 @@ register unsigned long current_stack_pointer asm(_ASM_SP); #define ASM_CALL_CONSTRAINT "+r" (current_stack_pointer) #endif -#endif #endif /* _ASM_X86_ASM_H */ -- 2.14.3
Re: [RFC PATCH 24/35] Revert "ovl: fix relatime for directories"
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi wrote: > This reverts commit cd91304e7190b4c4802f8e413ab2214b233e0260. > > Overlayfs no longer relies on the vfs correct atime handling. > > Signed-off-by: Miklos Szeredi > --- > fs/inode.c | 21 - > fs/overlayfs/super.c | 3 --- > include/linux/dcache.h | 3 --- > 3 files changed, 4 insertions(+), 23 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index ef362364d396..163715de8cb2 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -1570,24 +1570,11 @@ EXPORT_SYMBOL(bmap); > static void update_ovl_inode_times(struct dentry *dentry, struct inode > *inode, >bool rcu) > { > - struct dentry *upperdentry; > + if (!rcu) { > + struct inode *realinode = d_real_inode(dentry); > > - /* > -* Nothing to do if in rcu or if non-overlayfs > -*/ > - if (rcu || likely(!(dentry->d_flags & DCACHE_OP_REAL))) > - return; > - > - upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER); > - > - /* > -* If file is on lower then we can't update atime, so no worries about > -* stale mtime/ctime. > -*/ > - if (upperdentry) { > - struct inode *realinode = d_inode(upperdentry); > - > - if ((!timespec_equal(&inode->i_mtime, &realinode->i_mtime) || > + if (unlikely(inode != realinode) && > + (!timespec_equal(&inode->i_mtime, &realinode->i_mtime) || > !timespec_equal(&inode->i_ctime, &realinode->i_ctime))) { > inode->i_mtime = realinode->i_mtime; > inode->i_ctime = realinode->i_ctime; > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > index c3d8c7ea180f..006dc70d7425 100644 > --- a/fs/overlayfs/super.c > +++ b/fs/overlayfs/super.c > @@ -107,9 +107,6 @@ static struct dentry *ovl_d_real(struct dentry *dentry, > if (inode && d_inode(dentry) == inode) > return dentry; > > - if (flags & D_REAL_UPPER) > - return ovl_dentry_upper(dentry); > - > if (!d_is_reg(dentry)) { > if (!inode || inode == d_inode(dentry)) > return dentry; > diff --git a/include/linux/dcache.h b/include/linux/dcache.h > index 82a99d366aec..4c7ab11c627a 100644 > --- a/include/linux/dcache.h > +++ b/include/linux/dcache.h > @@ -565,9 +565,6 @@ static inline struct dentry *d_backing_dentry(struct > dentry *upper) > return upper; > } > > -/* d_real() flags */ > -#define D_REAL_UPPER 0x2 /* return upper dentry or NULL if non-upper */ > - Premature removal of constant. Still in use by may_write_real() at this point. Thanks, Amir.
[PATCH 14/17] perf record: Change warning for missing sysfs entry to debug
From: Thomas Richter Using perf on 4.16.0 kernel on s390 shows this warning: failed: can't open node sysfs data each time I run command perf record ... for example: [root@s35lp76 perf]# ./perf record -e rB -- sleep 1 [ perf record: Woken up 1 times to write data ] failed: can't open node sysfs data [ perf record: Captured and wrote 0.001 MB perf.data (4 samples) ] [root@s35lp76 perf]# It turns out commit e2091cedd51bf ("perf tools: Add MEM_TOPOLOGY feature to perf data file") tries to open directory named /sys/devices/system/node/ which does not exist on s390. This is the call stack: __cmd_record +---> perf_session__write_header +---> perf_header__adds_write +---> do_write_feat +---> write_mem_topology +---> build_mem_topology prints warning The issue starts in do_write_feat() which unconditionally loops over all features and now includes HEADER_MEM_TOPOLOGY and calls write_mem_topology(). Function record__init_features() at the beginning of __cmd_record() sets all features and then turns off some of them. Fix this by changing the warning to a level 2 debug output statement. So it is only shown when debug level 2 or higher is set. Signed-off-by: Thomas Richter Cc: Heiko Carstens Cc: Hendrik Brueckner Cc: Martin Schwidefsky Link: http://lkml.kernel.org/r/20180412133246.92801-1-tmri...@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/header.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c index 121df1683c36..a8bff2178fbc 100644 --- a/tools/perf/util/header.c +++ b/tools/perf/util/header.c @@ -1320,7 +1320,8 @@ static int build_mem_topology(struct memory_node *nodes, u64 size, u64 *cntp) dir = opendir(path); if (!dir) { - pr_warning("failed: can't open node sysfs data\n"); + pr_debug2("%s: could't read %s, does this arch have topology information?\n", + __func__, path); return -1; } -- 2.14.3
[PATCH 12/17] perf sched: Fix documentation for timehist
From: Takuya Yamamoto Fixed a incorrect option and usage to those shown by "perf sched timehist -h", i.e. the default is really --call-graph, which is equivalent to -g. Signed-off-by: Takuya Yamamoto Cc: Peter Zijlstra Link: https://lkml.kernel.org/n/tip-8fzo0dlsi1mku5aqx8bre...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/Documentation/perf-sched.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/perf/Documentation/perf-sched.txt b/tools/perf/Documentation/perf-sched.txt index bb33601a823b..63f938b887dd 100644 --- a/tools/perf/Documentation/perf-sched.txt +++ b/tools/perf/Documentation/perf-sched.txt @@ -104,8 +104,8 @@ OPTIONS for 'perf sched timehist' kallsyms pathname -g:: ---no-call-graph:: - Do not display call chains if present. +--call-graph:: + Display call chains if present (default on). --max-stack:: Maximum number of functions to display in backtrace, default 5. -- 2.14.3
[PATCH 13/17] perf tests: Disable breakpoint accounting test for powerpc
From: Sandipan Das We disable this test as instruction breakpoints (HW_BREAKPOINT_X) are not available for powerpc. Before applying patch: 21: Breakpoint accounting : --- start --- test child forked, pid 3635 failed opening event 0 failed opening event 0 watchpoints count 1, breakpoints count 0, has_ioctl 1, share 0 test child finished with -2 end Breakpoint accounting: Skip After applying patch: 21: Breakpoint accounting : Disabled Signed-off-by: Sandipan Das Cc: Jiri Olsa Cc: Naveen N. Rao Cc: Ravi Bangoria Link: http://lkml.kernel.org/r/20180412162140.2992-1-sandi...@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/tests/builtin-test.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c index 625f5a6772af..cac8f8889bc3 100644 --- a/tools/perf/tests/builtin-test.c +++ b/tools/perf/tests/builtin-test.c @@ -118,6 +118,7 @@ static struct test generic_tests[] = { { .desc = "Breakpoint accounting", .func = test__bp_accounting, + .is_supported = test__bp_signal_is_supported, }, { .desc = "Number of exit events of a simple workload", -- 2.14.3
[PATCH 15/17] perf report: Fix switching to another perf.data file
From: Arnaldo Carvalho de Melo In the TUI the 's' hotkey can be used to switch to another perf.data file in the current directory, but that got broken in Fixes: b01141f4f59c ("perf annotate: Initialize the priv are in symbol__new()"), that would show this once another file was chosen: ┌─Fatal Error─┐ │Annotation needs to be init before symbol__init()│ │ │ │ │ │Press any key... │ └─┘ Fix it by just silently bailing out if symbol__annotation_init() was already called, just like is done with symbol__init(), i.e. they are done just once at session start, not when switching to a new perf.data file. Cc: Adrian Hunter Cc: Andi Kleen Cc: David Ahern Cc: Jin Yao Cc: Jiri Olsa Cc: Martin Liška Cc: Namhyung Kim Cc: Ravi Bangoria Cc: Thomas Richter Cc: Wang Nan Fixes: b01141f4f59c ("perf annotate: Initialize the priv are in symbol__new()") Link: https://lkml.kernel.org/n/tip-ogppdtpzfax7y1h6gjdv5...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/symbol.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c index 62b2dd2253eb..1466814ebada 100644 --- a/tools/perf/util/symbol.c +++ b/tools/perf/util/symbol.c @@ -2091,16 +2091,14 @@ static bool symbol__read_kptr_restrict(void) int symbol__annotation_init(void) { + if (symbol_conf.init_annotation) + return 0; + if (symbol_conf.initialized) { pr_err("Annotation needs to be init before symbol__init()\n"); return -1; } - if (symbol_conf.init_annotation) { - pr_warning("Annotation being initialized multiple times\n"); - return 0; - } - symbol_conf.priv_size += sizeof(struct annotation); symbol_conf.init_annotation = true; return 0; -- 2.14.3
[PATCH 16/17] perf annotate: Allow setting the offset level in .perfconfig
From: Arnaldo Carvalho de Melo The default is 1 (jump_target): # perf annotate --ignore-vmlinux --stdio2 _raw_spin_lock_irqsave Samples: 3K of event 'cycles:ppp', 3000 Hz, Event count (approx.): 2766398574 _raw_spin_lock_irqsave() /proc/kcore 0.26nop 4.61push %rbx 19.33pushfq 7.97pop%rax 0.32nop 0.06mov%rax,%rbx 14.63cli 0.06nop xor%eax,%eax mov$0x1,%edx 49.94lock cmpxchg %edx,(%rdi) 0.16test %eax,%eax ↓ jne2b 2.66mov%rbx,%rax pop%rbx ← retq 2b: mov%eax,%esi → callq *b30eaed0 mov%rbx,%rax pop%rbx ← retq # But one can ask for showing offsets for call instructions by setting this: # perf annotate --ignore-vmlinux --stdio2 _raw_spin_lock_irqsave Samples: 3K of event 'cycles:ppp', 3000 Hz, Event count (approx.): 2766398574 _raw_spin_lock_irqsave() /proc/kcore 0.26nop 4.61push %rbx 19.33pushfq 7.97pop%rax 0.32nop 0.06mov%rax,%rbx 14.63cli 0.06nop xor%eax,%eax mov$0x1,%edx 49.94lock cmpxchg %edx,(%rdi) 0.16test %eax,%eax ↓ jne2b 2.66mov%rbx,%rax pop%rbx ← retq 2b: mov%eax,%esi 2d: → callq *b30eaed0 mov%rbx,%rax pop%rbx ← retq # Or using a big value to ask for all offsets to be shown: # cat ~/.perfconfig [annotate] offset_level = 100 hide_src_code = true # perf annotate --ignore-vmlinux --stdio2 _raw_spin_lock_irqsave Samples: 3K of event 'cycles:ppp', 3000 Hz, Event count (approx.): 2766398574 _raw_spin_lock_irqsave() /proc/kcore 0.26 0: nop 4.61 5: push %rbx 19.33 6: pushfq 7.97 7: pop%rax 0.32 8: nop 0.06 d: mov%rax,%rbx 14.63 10: cli 0.06 11: nop 17: xor%eax,%eax 19: mov$0x1,%edx 49.94 1e: lock cmpxchg %edx,(%rdi) 0.16 22: test %eax,%eax 24: ↓ jne2b 2.66 26: mov%rbx,%rax 29: pop%rbx 2a: ← retq 2b: mov%eax,%esi 2d: → callq *b30eaed0 32: mov%rbx,%rax 35: pop%rbx 36: ← retq # This also affects the TUI, i.e. the default 'perf annotate' and 'perf top/report' -> A hotkey -> annotate interfaces, when slang-devel is present in the build, i.e.: # perf version --build-options | grep slang libslang: [ on ] # HAVE_SLANG_SUPPORT # Cc: Adrian Hunter Cc: Andi Kleen Cc: David Ahern Cc: Jin Yao Cc: Jiri Olsa Cc: Martin Liška Cc: Namhyung Kim Cc: Ravi Bangoria Cc: Thomas Richter Cc: Wang Nan Link: https://lkml.kernel.org/n/tip-venm6x5zrt40eu8hxdsmq...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/Documentation/perf-config.txt | 5 + tools/perf/util/annotate.c | 15 --- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/tools/perf/Documentation/perf-config.txt b/tools/perf/Documentation/perf-config.txt index 5b4fff3adc4b..32f4a898e3f2 100644 --- a/tools/perf/Documentation/perf-config.txt +++ b/tools/perf/Documentation/perf-config.txt @@ -334,6 +334,11 @@ annotate.*:: 99.93 │ mov%eax,%eax + annotate.offset_level:: + Default is '1', meaning just jump targets will have offsets show right beside + the instruction. When set to '2' 'call' instructions will also have its offsets + shown, 3 or higher will show offsets for all instructions. + hist.*:: hist.percentage:: This option control the way to calculate overhead of filtered entries - diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 5edc565d86c4..536ee148bff8 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -2649,10 +2649,11 @@ int __annotation__scnprintf_samples_period(struct annotation *notes, */ static struct annotation_config { const char *name; - bool *value; + void *value; } annotation__configs[] = { ANNOTATION__CFG(hide_src_code), ANNOTATION__CFG(jump_arrows), + ANNOTATION__CFG(offset_level), ANNOTATION__CFG(show_linenr), ANNOTATION__CFG(show_nr_jumps), ANNOTATION__CFG(show_nr_samples), @@ -2684,8 +2685,16 @@ static int annotation__config(const char *var, const char *value, if (cfg == NULL) pr_debug("%s variable unknown, ignoring...", var); - else - *cfg->valu
[PATCH 17/17] perf annotate: Handle variables in 'sub', 'or' and many other instructions
From: Arnaldo Carvalho de Melo Just like is done for 'mov' and others that can have as source or targets variables resolved by objdump, to make them more compact: - orb$0x4,0x224d71(%rip)# 226ca4 <_rtld_global+0xca4> + orb$0x4,_rtld_global+0xca4 Cc: Adrian Hunter Cc: Andi Kleen Cc: David Ahern Cc: Jin Yao Cc: Jiri Olsa Cc: Martin Liška Cc: Namhyung Kim Cc: Ravi Bangoria Cc: Thomas Richter Cc: Wang Nan Link: https://lkml.kernel.org/n/tip-efex7746id4w4wa03nqxv...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/x86/annotate/instructions.c | 67 - 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/tools/perf/arch/x86/annotate/instructions.c b/tools/perf/arch/x86/annotate/instructions.c index 5bd1ba8c0282..44f5aba78210 100644 --- a/tools/perf/arch/x86/annotate/instructions.c +++ b/tools/perf/arch/x86/annotate/instructions.c @@ -1,21 +1,43 @@ // SPDX-License-Identifier: GPL-2.0 static struct ins x86__instructions[] = { + { .name = "adc",.ops = &mov_ops, }, + { .name = "adcb", .ops = &mov_ops, }, + { .name = "adcl", .ops = &mov_ops, }, { .name = "add",.ops = &mov_ops, }, { .name = "addl", .ops = &mov_ops, }, { .name = "addq", .ops = &mov_ops, }, + { .name = "addsd", .ops = &mov_ops, }, { .name = "addw", .ops = &mov_ops, }, { .name = "and",.ops = &mov_ops, }, + { .name = "andb", .ops = &mov_ops, }, + { .name = "andl", .ops = &mov_ops, }, + { .name = "andpd", .ops = &mov_ops, }, + { .name = "andps", .ops = &mov_ops, }, + { .name = "andq", .ops = &mov_ops, }, + { .name = "andw", .ops = &mov_ops, }, + { .name = "bsr",.ops = &mov_ops, }, + { .name = "bt", .ops = &mov_ops, }, + { .name = "btr",.ops = &mov_ops, }, { .name = "bts",.ops = &mov_ops, }, + { .name = "btsq", .ops = &mov_ops, }, { .name = "call", .ops = &call_ops, }, { .name = "callq", .ops = &call_ops, }, + { .name = "cmovbe", .ops = &mov_ops, }, + { .name = "cmove", .ops = &mov_ops, }, + { .name = "cmovae", .ops = &mov_ops, }, { .name = "cmp",.ops = &mov_ops, }, { .name = "cmpb", .ops = &mov_ops, }, { .name = "cmpl", .ops = &mov_ops, }, { .name = "cmpq", .ops = &mov_ops, }, { .name = "cmpw", .ops = &mov_ops, }, { .name = "cmpxch", .ops = &mov_ops, }, + { .name = "cmpxchg",.ops = &mov_ops, }, + { .name = "cs", .ops = &mov_ops, }, { .name = "dec",.ops = &dec_ops, }, { .name = "decl", .ops = &dec_ops, }, + { .name = "divsd", .ops = &mov_ops, }, + { .name = "divss", .ops = &mov_ops, }, + { .name = "gs", .ops = &mov_ops, }, { .name = "imul", .ops = &mov_ops, }, { .name = "inc",.ops = &dec_ops, }, { .name = "incl", .ops = &dec_ops, }, @@ -57,25 +79,68 @@ static struct ins x86__instructions[] = { { .name = "lea",.ops = &mov_ops, }, { .name = "lock", .ops = &lock_ops, }, { .name = "mov",.ops = &mov_ops, }, + { .name = "movapd", .ops = &mov_ops, }, + { .name = "movaps", .ops = &mov_ops, }, { .name = "movb", .ops = &mov_ops, }, { .name = "movdqa", .ops = &mov_ops, }, + { .name = "movdqu", .ops = &mov_ops, }, { .name = "movl", .ops = &mov_ops, }, { .name = "movq", .ops = &mov_ops, }, + { .name = "movsd", .ops = &mov_ops, }, { .name = "movslq", .ops = &mov_ops, }, + { .name = "movss", .ops = &mov_ops, }, + { .name = "movupd", .ops = &mov_ops, }, + { .name = "movups", .ops = &mov_ops, }, + { .name = "movw", .ops = &mov_ops, }, { .name = "movzbl", .ops = &mov_ops, }, { .name = "movzwl", .ops = &mov_ops, }, + { .name = "mulsd", .ops = &mov_ops, }, + { .name = "mulss", .ops = &mov_ops, }, { .name = "nop",.ops = &nop_ops, }, { .name = "nopl", .ops = &nop_ops, }, { .name = "nopw", .ops = &nop_ops, }, { .name = "or", .ops = &mov_ops, }, + { .name = "orb",.ops = &mov_ops, }, { .name = "orl",.ops = &mov_ops, }, + { .name = "orps", .ops = &mov_ops, }, + { .name = "orq",.ops = &mov_ops, }, + { .name = "pand", .ops = &mov_ops, }, + { .name = "paddq", .ops = &mov_ops, }, + { .name = "pcmpeqb",.ops = &mov_ops, }, + { .name = "por",.ops = &mov_ops, }, + { .n
[PATCH 11/17] perf version: Print status for syscall_table
From: Jin Yao This patch doesn't print "libaudit" line if HAVE_SYSCALL_TABLE_SUPPORT is available and add a line for HAVE_SYSCALL_TABLE_SUPPORT. For example, $ ./perf -vv perf version 4.13.rc5.gc2f8af9 dwarf: [ on ] # HAVE_DWARF_SUPPORT dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT glibc: [ on ] # HAVE_GLIBC_SUPPORT gtk2: [ on ] # HAVE_GTK2_SUPPORT syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT libbfd: [ on ] # HAVE_LIBBFD_SUPPORT libelf: [ on ] # HAVE_LIBELF_SUPPORT libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT libperl: [ on ] # HAVE_LIBPERL_SUPPORT libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT libslang: [ on ] # HAVE_SLANG_SUPPORT libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT zlib: [ on ] # HAVE_ZLIB_SUPPORT lzma: [ on ] # HAVE_LZMA_SUPPORT get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT bpf: [ on ] # HAVE_LIBBPF_SUPPORT The line "syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT" is new created. Signed-off-by: Jin Yao Suggested-by: Arnaldo Carvalho de Melo Tested-by: Arnaldo Carvalho de Melo Cc: Alexander Shishkin Cc: Andi Kleen Cc: Jiri Olsa Cc: Kan Liang Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/1523269609-28824-4-git-send-email-yao@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/builtin-version.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/perf/builtin-version.c b/tools/perf/builtin-version.c index 2abe3910d6b6..50df168be326 100644 --- a/tools/perf/builtin-version.c +++ b/tools/perf/builtin-version.c @@ -60,7 +60,10 @@ static void library_status(void) STATUS(HAVE_DWARF_GETLOCATIONS_SUPPORT, dwarf_getlocations); STATUS(HAVE_GLIBC_SUPPORT, glibc); STATUS(HAVE_GTK2_SUPPORT, gtk2); +#ifndef HAVE_SYSCALL_TABLE_SUPPORT STATUS(HAVE_LIBAUDIT_SUPPORT, libaudit); +#endif + STATUS(HAVE_SYSCALL_TABLE_SUPPORT, syscall_table); STATUS(HAVE_LIBBFD_SUPPORT, libbfd); STATUS(HAVE_LIBELF_SUPPORT, libelf); STATUS(HAVE_LIBNUMA_SUPPORT, libnuma); -- 2.14.3
[PATCH 10/17] perf tools: Rename HAVE_SYSCALL_TABLE to HAVE_SYSCALL_TABLE_SUPPORT
From: Jin Yao To be consistent with other HAVE_XXX_SUPPORT uses in Makefile.config, this patch renames HAVE_SYSCALL_TABLE to HAVE_SYSCALL_TABLE_SUPPORT and updates the C code accordingly. Signed-off-by: Jin Yao Suggested-by: Arnaldo Carvalho de Melo Cc: Alexander Shishkin Cc: Andi Kleen Cc: Jiri Olsa Cc: Kan Liang Cc: Peter Zijlstra Link: http://lkml.kernel.org/r/1523269609-28824-3-git-send-email-yao@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/Makefile.config | 2 +- tools/perf/builtin-help.c | 2 +- tools/perf/perf.c | 4 ++-- tools/perf/util/generate-cmdlist.sh | 2 +- tools/perf/util/syscalltbl.c| 6 +++--- 5 files changed, 8 insertions(+), 8 deletions(-) diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config index 6b307e97dc57..ae7dc46e8f8a 100644 --- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -68,7 +68,7 @@ ifeq ($(NO_PERF_REGS),0) endif ifneq ($(NO_SYSCALL_TABLE),1) - CFLAGS += -DHAVE_SYSCALL_TABLE + CFLAGS += -DHAVE_SYSCALL_TABLE_SUPPORT endif # So far there's only x86 and arm libdw unwind support merged in perf. diff --git a/tools/perf/builtin-help.c b/tools/perf/builtin-help.c index 4aca13f23b9d..1c41b4eaf73c 100644 --- a/tools/perf/builtin-help.c +++ b/tools/perf/builtin-help.c @@ -439,7 +439,7 @@ int cmd_help(int argc, const char **argv) #ifdef HAVE_LIBELF_SUPPORT "probe", #endif -#if defined(HAVE_LIBAUDIT_SUPPORT) || defined(HAVE_SYSCALL_TABLE) +#if defined(HAVE_LIBAUDIT_SUPPORT) || defined(HAVE_SYSCALL_TABLE_SUPPORT) "trace", #endif NULL }; diff --git a/tools/perf/perf.c b/tools/perf/perf.c index 1659029d03fc..20a08cb32332 100644 --- a/tools/perf/perf.c +++ b/tools/perf/perf.c @@ -73,7 +73,7 @@ static struct cmd_struct commands[] = { { "lock", cmd_lock, 0 }, { "kvm",cmd_kvm,0 }, { "test", cmd_test, 0 }, -#if defined(HAVE_LIBAUDIT_SUPPORT) || defined(HAVE_SYSCALL_TABLE) +#if defined(HAVE_LIBAUDIT_SUPPORT) || defined(HAVE_SYSCALL_TABLE_SUPPORT) { "trace", cmd_trace, 0 }, #endif { "inject", cmd_inject, 0 }, @@ -491,7 +491,7 @@ int main(int argc, const char **argv) argv[0] = cmd; } if (strstarts(cmd, "trace")) { -#if defined(HAVE_LIBAUDIT_SUPPORT) || defined(HAVE_SYSCALL_TABLE) +#if defined(HAVE_LIBAUDIT_SUPPORT) || defined(HAVE_SYSCALL_TABLE_SUPPORT) setup_path(); argv[0] = "trace"; return cmd_trace(argc, argv); diff --git a/tools/perf/util/generate-cmdlist.sh b/tools/perf/util/generate-cmdlist.sh index ff17920a5ebc..c3cef36d4176 100755 --- a/tools/perf/util/generate-cmdlist.sh +++ b/tools/perf/util/generate-cmdlist.sh @@ -38,7 +38,7 @@ do done echo "#endif /* HAVE_LIBELF_SUPPORT */" -echo "#if defined(HAVE_LIBAUDIT_SUPPORT) || defined(HAVE_SYSCALL_TABLE)" +echo "#if defined(HAVE_LIBAUDIT_SUPPORT) || defined(HAVE_SYSCALL_TABLE_SUPPORT)" sed -n -e 's/^perf-\([^]*\)[ ].* audit*/\1/p' command-list.txt | sort | while read cmd diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c index 895122d638dd..0ee7f568d60c 100644 --- a/tools/perf/util/syscalltbl.c +++ b/tools/perf/util/syscalltbl.c @@ -17,7 +17,7 @@ #include #include -#ifdef HAVE_SYSCALL_TABLE +#ifdef HAVE_SYSCALL_TABLE_SUPPORT #include #include "string2.h" #include "util.h" @@ -139,7 +139,7 @@ int syscalltbl__strglobmatch_first(struct syscalltbl *tbl, const char *syscall_g return syscalltbl__strglobmatch_next(tbl, syscall_glob, idx); } -#else /* HAVE_SYSCALL_TABLE */ +#else /* HAVE_SYSCALL_TABLE_SUPPORT */ #include @@ -176,4 +176,4 @@ int syscalltbl__strglobmatch_first(struct syscalltbl *tbl, const char *syscall_g { return syscalltbl__strglobmatch_next(tbl, syscall_glob, idx); } -#endif /* HAVE_SYSCALL_TABLE */ +#endif /* HAVE_SYSCALL_TABLE_SUPPORT */ -- 2.14.3
[PATCH 07/17] perf tests bpf: Remove unused ptrace.h include from LLVM test
From: Arnaldo Carvalho de Melo The bpf-script-test-kbuild.c script, used in one of the LLVM subtests, includes ptrace.h unnecessarily, and that ends up making it include a header that uses asm(_ASM_SP), a feature that is not supported by clang <= 4.0, breaking that 'perf test' entry. This ended up leading to the ca26cffa4e4a ("x86/asm: Allow again using asm.h when building for the 'bpf' clang target"), adding an ifndef __BPF__ to the arch/x86/include/asm/asm.h file. Newer clang versions accept that asm(_ASM_SP) construct, so just remove the ptrace.h include, which paves the way for reverting ca26cffa4e4a ("x86/asm: Allow again using asm.h when building for the 'bpf' clang target"). Suggested-by: Yonghong Song Acked-by: Yonghong Song Link: https://lkml.kernel.org/r/613f0a0d-c433-8f4d-dcc1-c9889deae...@fb.com Cc: Adrian Hunter Cc: Alexander Potapenko Cc: Alexei Starovoitov Cc: Andrey Ryabinin Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Daniel Borkmann Cc: David Ahern Cc: Dmitriy Vyukov Cc: Jiri Olsa Cc: Josh Poimboeuf Cc: Linus Torvalds Cc: Matthias Kaehlcke Cc: Miguel Bernal Marin Cc: Namhyung Kim Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Wang Nan Link: https://lkml.kernel.org/n/tip-clbcnzbakdp18ibme4wt4...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/tests/bpf-script-test-kbuild.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/perf/tests/bpf-script-test-kbuild.c b/tools/perf/tests/bpf-script-test-kbuild.c index 3626924740d8..ff3ec8337f0a 100644 --- a/tools/perf/tests/bpf-script-test-kbuild.c +++ b/tools/perf/tests/bpf-script-test-kbuild.c @@ -9,7 +9,6 @@ #define SEC(NAME) __attribute__((section(NAME), used)) #include -#include SEC("func=vfs_llseek") int bpf_func__vfs_llseek(void *ctx) -- 2.14.3
[PATCH 05/17] perf annotate browser: Allow showing offsets in more than just jump targets
From: Arnaldo Carvalho de Melo Jesper wanted to see offsets at callq sites when doing some performance investigation related to retpolines, so save him some time by providing a 'O' hotkey to allow showing offsets from function start at call instructions or in all instructions, just go on pressing 'O' till the offsets you need appear. Example: Starts with: Samples: 64 of event 'cycles:ppp', 10 Hz, Event count (approx.): 318963 ixgbe_read_reg /proc/kcore Percent│↑ je 2a │ ┌──cmp$0x,%r13d │ ├──je d0 │ │ mov$0x53e3,%edi │ │→ callq __const_udelay │ │ sub$0x1,%r15d │ │↑ jne83 │ │ mov0x8(%rbp),%rax │ │ testb $0x20,0x1799(%rax) │ │↑ je 2a │ │ mov0x200(%rax),%rdi │ │ mov%r13d,%edx │ │ mov$0xc02595d8,%rsi │ │→ callq netdev_warn │ │↑ jmpq 2a │d0:└─→mov0x8(%rbp),%rsi │ mov%rbp,%rdi │ mov%eax,0x4(%rsp) │→ callq ixgbe_remove_adapter.isra.77 │ mov0x4(%rsp),%eax Press 'h' for help on key bindings Pess 'O': Samples: 64 of event 'cycles:ppp', 10 Hz, Event count (approx.): 318963 ixgbe_read_reg /proc/kcore Percent│↑ je 2a │ ┌──cmp$0x,%r13d │ ├──je d0 │ │ mov$0x53e3,%edi │99:│→ callq __const_udelay │ │ sub$0x1,%r15d │ │↑ jne83 │ │ mov0x8(%rbp),%rax │ │ testb $0x20,0x1799(%rax) │ │↑ je 2a │ │ mov0x200(%rax),%rdi │ │ mov%r13d,%edx │ │ mov$0xc02595d8,%rsi │c6:│→ callq netdev_warn │ │↑ jmpq 2a │d0:└─→mov0x8(%rbp),%rsi │ mov%rbp,%rdi │ mov%eax,0x4(%rsp) │db: → callq ixgbe_remove_adapter.isra.77 │ mov0x4(%rsp),%eax Press 'h' for help on key bindings Press 'O' again: Samples: 64 of event 'cycles:ppp', 10 Hz, Event count (approx.): 318963 ixgbe_read_reg /proc/kcore Percent│8c: ↑ je 2a │8e:┌──cmp$0x,%r13d │92:├──je d0 │94:│ mov$0x53e3,%edi │99:│→ callq __const_udelay │9e:│ sub$0x1,%r15d │a2:│↑ jne83 │a4:│ mov0x8(%rbp),%rax │a8:│ testb $0x20,0x1799(%rax) │af:│↑ je 2a │b5:│ mov0x200(%rax),%rdi │bc:│ mov%r13d,%edx │bf:│ mov$0xc02595d8,%rsi │c6:│→ callq netdev_warn │cb:│↑ jmpq 2a │d0:└─→mov0x8(%rbp),%rsi │d4: mov%rbp,%rdi │d7: mov%eax,0x4(%rsp) │db: → callq ixgbe_remove_adapter.isra.77 │e0: mov0x4(%rsp),%eax Press 'h' for help on key bindings Press 'O' again and it will show just jump target offsets. Suggested-by: Jesper Dangaard Brouer Cc: Adrian Hunter Cc: Alexei Starovoitov Cc: Andi Kleen Cc: Daniel Borkmann Cc: David Ahern Cc: Jin Yao Cc: Jiri Olsa Cc: Linus Torvalds Cc: Martin Liška Cc: Namhyung Kim Cc: Ravi Bangoria Cc: Thomas Richter Cc: Wang Nan Link: https://lkml.kernel.org/n/tip-upp6pfdetwlsx18ec2uf1...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/ui/browsers/annotate.c | 5 + 1 file changed, 5 insertions(+) diff --git a/tools/perf/ui/browsers/annotate.c b/tools/perf/ui/browsers/annotate.c index 12c099a87f8b..3781d74088a7 100644 --- a/tools/perf/ui/browsers/annotate.c +++ b/tools/perf/ui/browsers/annotate.c @@ -692,6 +692,7 @@ static int annotate_browser__run(struct annotate_browser *browser, "J Toggle showing number of jump sources on targets\n" "n Search next string\n" "o Toggle disassembler output/simplified view\n" + "O Bump offset level (jump targets -> +call -> all -> cycle thru)\n" "s Toggle source code view\n" "t Circulate percent, total period, samples view\n" "/ Search string\n" @@ -719,6 +720,10 @@ static int annotate_browser__run(struct annotate_browser *browser, notes->options->use_offset = !notes->options->use_offset; annotation__update_column_widths(notes); continue; + case 'O': + if (++notes->options->offset_level > ANNOTATION__MAX_OFFSET_LEVEL) + notes->options->offset_level = ANNOTATION__MIN_OFFSET_LEVEL; + continue; case 'j': notes->options->jump_arrows = !notes->
[PATCH 04/17] perf annotate: Allow showing offsets in more than just jump targets
From: Arnaldo Carvalho de Melo Jesper wanted to see offsets at callq sites when doing some performance investigation related to retpolines, so save him some time by providing an 'struct annotation_options' to control where offsets should appear: just on jump targets? That + call instructions? All? This puts in place the logic to show the offsets, now we need to wire this up in the TUI browser (next patch) and on the 'perf annotate --stdio2" interface, where we need a more general mechanism to setup the 'annotation_options' struct from the command line. Suggested-by: Jesper Dangaard Brouer Cc: Adrian Hunter Cc: Alexei Starovoitov Cc: Andi Kleen Cc: Daniel Borkmann Cc: David Ahern Cc: Jin Yao Cc: Jiri Olsa Cc: Linus Torvalds Cc: Martin Liška Cc: Namhyung Kim Cc: Ravi Bangoria Cc: Thomas Richter Cc: Wang Nan Link: https://lkml.kernel.org/n/tip-m3jc9c3swobye9tj08gnh...@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/annotate.c | 11 +-- tools/perf/util/annotate.h | 9 + 2 files changed, 18 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index fbad8dfbb186..5edc565d86c4 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -46,6 +46,7 @@ struct annotation_options annotation__default_options = { .use_offset = true, .jump_arrows= true, + .offset_level = ANNOTATION__OFFSET_JUMP_TARGETS, }; const char *disassembler_style; @@ -2512,7 +2513,8 @@ static void __annotation_line__write(struct annotation_line *al, struct annotati if (!notes->options->use_offset) { printed = scnprintf(bf, sizeof(bf), "%" PRIx64 ": ", addr); } else { - if (al->jump_sources) { + if (al->jump_sources && + notes->options->offset_level >= ANNOTATION__OFFSET_JUMP_TARGETS) { if (notes->options->show_nr_jumps) { int prev; printed = scnprintf(bf, sizeof(bf), "%*d ", @@ -2523,9 +2525,14 @@ static void __annotation_line__write(struct annotation_line *al, struct annotati obj__printf(obj, bf); obj__set_color(obj, prev); } - +print_addr: printed = scnprintf(bf, sizeof(bf), "%*" PRIx64 ": ", notes->widths.target, addr); + } else if (ins__is_call(&disasm_line(al)->ins) && + notes->options->offset_level >= ANNOTATION__OFFSET_CALL) { + goto print_addr; + } else if (notes->options->offset_level == ANNOTATION__MAX_OFFSET_LEVEL) { + goto print_addr; } else { printed = scnprintf(bf, sizeof(bf), "%-*s ", notes->widths.addr, " "); diff --git a/tools/perf/util/annotate.h b/tools/perf/util/annotate.h index db8d09bea07e..f28a9e43421d 100644 --- a/tools/perf/util/annotate.h +++ b/tools/perf/util/annotate.h @@ -70,8 +70,17 @@ struct annotation_options { show_nr_jumps, show_nr_samples, show_total_period; + u8 offset_level; }; +enum { + ANNOTATION__OFFSET_JUMP_TARGETS = 1, + ANNOTATION__OFFSET_CALL, + ANNOTATION__MAX_OFFSET_LEVEL, +}; + +#define ANNOTATION__MIN_OFFSET_LEVEL ANNOTATION__OFFSET_JUMP_TARGETS + extern struct annotation_options annotation__default_options; struct annotation; -- 2.14.3
Re: [PATCH 3/3] dcache: account external names as indirectly reclaimable memory
On Fri 13-04-18 22:35:19, Minchan Kim wrote: > On Mon, Mar 05, 2018 at 01:37:43PM +, Roman Gushchin wrote: [...] > > @@ -1614,9 +1623,11 @@ struct dentry *__d_alloc(struct super_block *sb, > > const struct qstr *name) > > name = &slash_name; > > dname = dentry->d_iname; > > } else if (name->len > DNAME_INLINE_LEN-1) { > > - size_t size = offsetof(struct external_name, name[1]); > > - struct external_name *p = kmalloc(size + name->len, > > - GFP_KERNEL_ACCOUNT); > > + struct external_name *p; > > + > > + reclaimable = offsetof(struct external_name, name[1]) + > > + name->len; > > + p = kmalloc(reclaimable, GFP_KERNEL_ACCOUNT); > > Can't we use kmem_cache_alloc with own cache created with SLAB_RECLAIM_ACCOUNT > if they are reclaimable? No, because names have different sizes and so we would basically have to duplicate many caches. -- Michal Hocko SUSE Labs
Re: [PATCH 00/30] kconfig: move compiler capability tests to Kconfig
2018-04-13 21:21 GMT+09:00 Masahiro Yamada : > 2018-04-13 14:52 GMT+09:00 Kees Cook : >> On Thu, Apr 12, 2018 at 10:06 PM, Masahiro Yamada >> wrote: >>> [Major Changes in V3] >> >> Awesome work! I don't see this pushed to your git tree? I'd like to >> test it, but I'd rather "git fetch" instead of "git am" :) >> >> -Kees >> > > I pushed this series to the following branch. > > git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild.git > kconfig-shell-v3 > If this approach is successful, we will move more and more compiler option tests to the Kconfig stage in the future. People (including me) might be worried about how slow Kconfig will become. First, I compared the before/after on my PC. Without this series, masahiro@grover:~/workspace/linux-kbuild$ time make -s defconfig real 0m0.175s user 0m0.128s sys 0m0.008s With this series, masahiro@grover:~/workspace/linux-kbuild$ time make -s defconfig real 0m0.729s user 0m0.400s sys 0m0.056s This is noticeable difference. Then, I looked into per-commit analysis. Here is the result of the real time of 'time make -s defconfig' [30/30] kbuild: test dead code/data elimination... 0m0.719s [29/30] arm64: move GCC version check for... 0m0.711s [28/30] gcc-plugins: allow to enable GCC_PLUGINS...0m0.722s [27/30] gcc-plugins: test plugin support in... 0m0.719s [+0.31] [26/30] gcc-plugins: move GCC version check... 0m0.410s [25/30] kcov: test compiler capability in... 0m0.392s [24/30] gcov: remove CONFIG_GCOV_FORMAT_AUTODETECT 0m0.400s [23/30] kconfig: add CC_IS_CLANG and CLANG_VERSION 0m0.396s [22/30] kconfig: add CC_IS_GCC and GCC_VERSION 0m0.392s [21/30] stack-protector: test compiler capability... 0m0.381s [+0.04] [20/30] kconfig: add basic helper macros to... 0m0.343s [19/30] kconfig: show compiler version text... 0m0.345s [18/30] kconfig: test: test text expansion... 0m0.342s [17/30] Documentation: kconfig: document...0m0.344s [16/30] kconfig: add 'info' and 'warning'... 0m0.347s [15/30] kconfig: expand lefthand side of...0m0.340s [14/30] kconfig: support append assignment... 0m0.342s [13/30] kconfig: support simply expanded...0m0.341s [12/30] kconfig: support variable and... 0m0.344s [11/30] kconfig: begin PARAM state only... 0m0.342s [10/30] kconfig: replace $(UNAME_RELEASE)... 0m0.347s [09/30] kconfig: add 'shell' built-in function 0m0.344s [08/30] kconfig: add built-in function support 0m0.350s [07/30] kconfig: remove sym_expand_string_value() 0m0.344s [06/30] kconfig: remove string expansion...0m0.349s [05/30] kconfig: remove string expansion...0m0.342s [04/30] kconfig: reference environment... 0m0.342s [03/30] kbuild: remove CONFIG_CROSS_COMPILE... 0m0.347s [02/30] kbuild: remove kbuild cache0m0.347s [+0.17] [01/30] gcc-plugins: fix build condition...0m0.171s [00/30] Merge tag 'drm-fixes-for-v4.17-rc1'... 0m0.176s There are three big jump points. The first one is [02/30] (+0.17) We are removing the build cache, so this is what we expect. The second one is [21/30] (+0.04) For x86, Kconfig runs scripts/gcc-x86_{32,64}-has-stack-protector.sh The biggest one is [27/30] (+0.31) scripts/gcc-plugins.sh is probably very costly script. If we bump the minimum gcc version to GCC 4.8 the script will be much cleaner in the future. I was also interested in the cost of a single $(cc-option ...) invocation. It is pretty easy to measure this. For example, copy $(cc-option -fstack-protector) 1000 lines like follows. config FOO bool default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) default $(cc-option -fstack-protector) ... [ repeat 1000 line ] On my core i7 PC, it took 7.2 msec to run $(cc-option -fstack-protector) 1000 times. We can make it much faster. Currently we use $(CC) -Werror $(1) -c -x c /dev/null to test the compiler flag. Ulf Magnusson suggested to use -S instead of -c (https://patchwork.kernel.org/patch/10309297/) With -S, the compiler stops after the compilation stage. It took only 4.0 msec to run $(cc-option -fstack-protector) 1000 times If I use -E (only pre-process stage), it becomes even faster. It took only 2.6 msec. As for $(cc-option ...), probably this will not be a problem. For some feature, we need special shell-scripts, some of which can be more costly. -- Best Regards Masahiro Yamada
[GIT PULL] dmi fixes for v4.17
Hi Linus, Please pull dmi subsystem updates/fixes for Linux v4.17 from: git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging.git dmi-for-linus drivers/firmware/dmi_scan.c | 16 include/linux/mod_devicetable.h | 1 + 2 files changed, 13 insertions(+), 4 deletions(-) --- Alex Hung (1): firmware: dmi_scan: Add DMI_OEM_STRING support to dmi_matches Jean Delvare (2): firmware: dmi_scan: Fix UUID length safety check firmware: dmi_scan: Use lowercase letters for UUID Thanks, -- Jean Delvare SUSE L3 Support
[PATCH v3 1/2] dmaengine: stm32-mdma: align TLEN and buffer length on burst
Both buffer Transfer Length (TLEN if any) and transfer size have to be aligned on burst size (burst beats*bus width). Signed-off-by: Pierre-Yves MORDRET --- Version history: v1: * Initial v2: v3: * Get rid of while loop in favor of computed values --- --- drivers/dma/stm32-mdma.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/dma/stm32-mdma.c b/drivers/dma/stm32-mdma.c index daa1602..4c7634c 100644 --- a/drivers/dma/stm32-mdma.c +++ b/drivers/dma/stm32-mdma.c @@ -410,13 +410,10 @@ static enum dma_slave_buswidth stm32_mdma_get_max_width(dma_addr_t addr, static u32 stm32_mdma_get_best_burst(u32 buf_len, u32 tlen, u32 max_burst, enum dma_slave_buswidth width) { - u32 best_burst = max_burst; - u32 burst_len = best_burst * width; + u32 best_burst; - while ((burst_len > 0) && (tlen % burst_len)) { - best_burst = best_burst >> 1; - burst_len = best_burst * width; - } + best_burst = min((u32)1 << __ffs(tlen | buf_len), +max_burst * width) / width; return (best_burst > 0) ? best_burst : 1; } -- 2.7.4
Re: [PATCH] KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support
On 12/04/2018 17:25, Vitaly Kuznetsov wrote: > @@ -5335,6 +5353,9 @@ static void __always_inline > vmx_disable_intercept_for_msr(unsigned long *msr_bit > if (!cpu_has_vmx_msr_bitmap()) > return; > > + if (static_branch_unlikely(&enable_emsr_bitmap)) > + evmcs_touch_msr_bitmap(); > + > /* >* See Intel PRM Vol. 3, 20.6.9 (MSR-Bitmap Address). Early manuals >* have the write-low and read-high bitmap offsets the wrong way round. > @@ -5370,6 +5391,9 @@ static void __always_inline > vmx_enable_intercept_for_msr(unsigned long *msr_bitm > if (!cpu_has_vmx_msr_bitmap()) > return; > > + if (static_branch_unlikely(&enable_emsr_bitmap)) > + evmcs_touch_msr_bitmap(); I'm not sure about the "unlikely". Can you just check current_evmcs instead (dropping the static key completely)? The function, also, is small enough that inlining should be beneficial. Paolo
[PATCH v3 2/2] dmaengine: stm32-mdma: Fix incomplete Hw descriptors allocator
Only 1 Hw Descriptor is allocated. Loop over required Hw descriptor for proper allocation. Signed-off-by: Pierre-Yves MORDRET --- Version history: v1: * Initial v2: * Fix kbuild warning format: /0x%08x/%pad/ v3: * use of "offsetof" instead of explicit calculation --- --- drivers/dma/stm32-mdma.c | 89 ++-- 1 file changed, 55 insertions(+), 34 deletions(-) diff --git a/drivers/dma/stm32-mdma.c b/drivers/dma/stm32-mdma.c index 4c7634c..1ac775f 100644 --- a/drivers/dma/stm32-mdma.c +++ b/drivers/dma/stm32-mdma.c @@ -252,13 +252,17 @@ struct stm32_mdma_hwdesc { u32 cmdr; } __aligned(64); +struct stm32_mdma_desc_node { + struct stm32_mdma_hwdesc *hwdesc; + dma_addr_t hwdesc_phys; +}; + struct stm32_mdma_desc { struct virt_dma_desc vdesc; u32 ccr; - struct stm32_mdma_hwdesc *hwdesc; - dma_addr_t hwdesc_phys; bool cyclic; u32 count; + struct stm32_mdma_desc_node node[]; }; struct stm32_mdma_chan { @@ -344,30 +348,42 @@ static struct stm32_mdma_desc *stm32_mdma_alloc_desc( struct stm32_mdma_chan *chan, u32 count) { struct stm32_mdma_desc *desc; + int i; - desc = kzalloc(sizeof(*desc), GFP_NOWAIT); + desc = kzalloc(offsetof(typeof(*desc), node[count]), GFP_NOWAIT); if (!desc) return NULL; - desc->hwdesc = dma_pool_alloc(chan->desc_pool, GFP_NOWAIT, - &desc->hwdesc_phys); - if (!desc->hwdesc) { - dev_err(chan2dev(chan), "Failed to allocate descriptor\n"); - kfree(desc); - return NULL; + for (i = 0; i < count; i++) { + desc->node[i].hwdesc = + dma_pool_alloc(chan->desc_pool, GFP_NOWAIT, + &desc->node[i].hwdesc_phys); + if (!desc->node[i].hwdesc) + goto err; } desc->count = count; return desc; + +err: + dev_err(chan2dev(chan), "Failed to allocate descriptor\n"); + while (--i >= 0) + dma_pool_free(chan->desc_pool, desc->node[i].hwdesc, + desc->node[i].hwdesc_phys); + kfree(desc); + return NULL; } static void stm32_mdma_desc_free(struct virt_dma_desc *vdesc) { struct stm32_mdma_desc *desc = to_stm32_mdma_desc(vdesc); struct stm32_mdma_chan *chan = to_stm32_mdma_chan(vdesc->tx.chan); + int i; - dma_pool_free(chan->desc_pool, desc->hwdesc, desc->hwdesc_phys); + for (i = 0; i < desc->count; i++) + dma_pool_free(chan->desc_pool, desc->node[i].hwdesc, + desc->node[i].hwdesc_phys); kfree(desc); } @@ -666,18 +682,18 @@ static int stm32_mdma_set_xfer_param(struct stm32_mdma_chan *chan, } static void stm32_mdma_dump_hwdesc(struct stm32_mdma_chan *chan, - struct stm32_mdma_hwdesc *hwdesc) + struct stm32_mdma_desc_node *node) { - dev_dbg(chan2dev(chan), "hwdesc: 0x%p\n", hwdesc); - dev_dbg(chan2dev(chan), "CTCR:0x%08x\n", hwdesc->ctcr); - dev_dbg(chan2dev(chan), "CBNDTR: 0x%08x\n", hwdesc->cbndtr); - dev_dbg(chan2dev(chan), "CSAR:0x%08x\n", hwdesc->csar); - dev_dbg(chan2dev(chan), "CDAR:0x%08x\n", hwdesc->cdar); - dev_dbg(chan2dev(chan), "CBRUR: 0x%08x\n", hwdesc->cbrur); - dev_dbg(chan2dev(chan), "CLAR:0x%08x\n", hwdesc->clar); - dev_dbg(chan2dev(chan), "CTBR:0x%08x\n", hwdesc->ctbr); - dev_dbg(chan2dev(chan), "CMAR:0x%08x\n", hwdesc->cmar); - dev_dbg(chan2dev(chan), "CMDR:0x%08x\n\n", hwdesc->cmdr); + dev_dbg(chan2dev(chan), "hwdesc: %pad\n", &node->hwdesc_phys); + dev_dbg(chan2dev(chan), "CTCR:0x%08x\n", node->hwdesc->ctcr); + dev_dbg(chan2dev(chan), "CBNDTR: 0x%08x\n", node->hwdesc->cbndtr); + dev_dbg(chan2dev(chan), "CSAR:0x%08x\n", node->hwdesc->csar); + dev_dbg(chan2dev(chan), "CDAR:0x%08x\n", node->hwdesc->cdar); + dev_dbg(chan2dev(chan), "CBRUR: 0x%08x\n", node->hwdesc->cbrur); + dev_dbg(chan2dev(chan), "CLAR:0x%08x\n", node->hwdesc->clar); + dev_dbg(chan2dev(chan), "CTBR:0x%08x\n", node->hwdesc->ctbr); + dev_dbg(chan2dev(chan), "CMAR:0x%08x\n", node->hwdesc->cmar); + dev_dbg(chan2dev(chan), "CMDR:0x%08x\n\n", node->hwdesc->cmdr); } static void stm32_mdma_setup_hwdesc(struct stm32_mdma_chan *chan, @@ -691,7 +707,7 @@ static void stm32_mdma_setup_hwdesc(struct stm32_mdma_chan *chan, struct stm32_mdma_hwdesc *hwdesc; u32 next = count + 1; - hwdesc = &desc->hwdesc[count]; + hwdesc = desc->node[count].hwdesc; hwdesc->ctcr = ctcr; hwdesc->cbndtr &= ~(STM32_MDMA_CBNDTR_BRC_MK | STM32_MDMA_C
Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8
James, Thanks for this mail. On 2018/4/13 0:14, James Morse wrote: > Hi gengdongjiu, > > On 12/04/18 06:00, gengdongjiu wrote: >> 2018-02-16 1:55 GMT+08:00 James Morse : >>> On 05/02/18 11:24, gengdongjiu wrote: > Is the emulated SError routed following the routing rules for > HCR_EL2.{AMO, > TGE}? Yes, it is. >>> >>> ... and yet ... >>> >>> > What does your firmware do when it wants to emulate SError but its masked? > (e.g.1: The physical-SError interrupted EL2 and the SPSR shows EL2 had > PSTATE.A set. > e.g.2: The physical-SError interrupted EL2 but HCR_EL2 indicates the > emulated SError should go to EL1. This effectively masks SError.) Currently we does not consider much about the mask status(SPSR). >>> >>> .. this is a problem. >>> >>> If you ignore SPSR_EL3 you may deliver an SError to EL1 when the exception >>> interrupted EL2. Even if you setup the EL1 register correctly, EL1 can't >>> eret to >>> EL2. This should never happen, SError is effectively masked if you are >>> running >>> at an EL higher than the one its routed to. >>> >>> More obviously: if the exception came from the EL that SError should be >>> routed >>> to, but PSTATE.A was set, you can't deliver SError. Masking SError is the >>> only > >> James, I summarized the masking and routing rules for SError to >> confirm with you for the firmware first solution, > > You also said "Currently we does not consider much about the mask > status(SPSR)." Yes, we currently do not consider much it. After clarification with you, we want to modify the EL3 firmware to follow this rule. > > >> 1. If the HCR_EL2.{AMO,TGE} is set, > > If one or the other of these bits is set: (AMO==1 || TGE==1) > >> which means the SError should route to EL2, >> When system happens SError and trap to EL3, If EL3 find >> HCR_EL2.{AMO,TGE} and SPSR_EL3.A are both set, >> and find this SError come from EL2, it will not deliver an SError: >> store the RAS error in the BERT and 'reboot'; but if >> it find that this SError come from EL1 or EL0, it also need to deliver >> an SError, right? > > Yes. > > >> 2. If the HCR_EL2.{AMO,TGE} is not set, > > If neither of these bits is set: (AMO==0 && TGE == 0) > >> which means the SError should route to EL1, >> When system happens SError and trap to EL3, If EL3 find >> HCR_EL2.{AMO,TGE} and SPSR_EL3.A are both not set, > > (I'm reading this as all three of these bits are clear) sorry, it is a typo issue. it should be HCR_EL2.AMO and HCR_EL2.TGE are both clear, but SPSR_EL3.A is set. > >> and find this SError come from EL1, it will not deliver an SError: >> store the RAS error in the BERT and 'reboot'; > > No, (AMO==0 && TGE == 0) means SError is routed to EL1, this exception > interrupted EL1 and the A bit was clear, so EL1 can take an SError. Agree. > > The two cases here are: > AMO==0,TGE==0 means SError should be routed to EL1. If SPSR_EL3 says the > exception interrupted EL1 and the A bit was set, you need to do the BERT > trick. > > If SPSR_EL3 says the exception interrupted EL2, you need to do the BERT trick "BERT trick" is storing the RAS error in the BERT and 'reboot, right? > regardless of the A bit, as SError is implicitly masked by running at a higher > exception level than it was routed to. > > >>From your v11 reply: >> 2. The exception came from the EL that SError should not be routed >> to(according to hcr_EL2.{AMO, TGE}),even though the PSTATE.A was set,EL3 >> firmware still deliver SError > > (this is re-iterating the two-cases above:) > 'not be routed to' is one of two things: Route-to-EL2+interruted-EL1, or > Route-to-EL1+interrupted-EL2. > > Route-to-EL2+interrupted-EL1 is fine, regardless of SPSR_EL3.A the emulated > SError can be delivered to EL2, as EL2 can't mask SError when executing at a > lower EL. Agree. > > Route-to-EL1+interrupted-EL2 is the problem. SError is implicitly masked by > running at a higher EL. Regardless of SPSR_EL3.A, the emulated SError can not > be > delivered. "can not be delivered" means storing the RAS error in the BERT and 'reboot, right? In the Table D1-15 in "D1.14.2 Asynchronous exception masking", for the case, it is "C" "C"means SError is not taken regardless of the value of the Process state interrupt mask. for this case, whether it will be unsafe if BIOS directly reboot? > KVM does this on the way out of a guest, if an SError occurs during this time > the CPU will wait until execution returns to EL1 before delivering the SError. > Your firmware has to do the same. > > Table D1-15 in "D1.14.2 Asynchronous exception masking" has a table with all > the > combinations. The ARM-ARM is what we need to match with this behaviour. > > >> but if it find that this SError come from EL0, it also need to deliver an >> SError, right? > > I thought interrupted-EL0 could always be delivered: but re-reading the > ARM-ARM's "D1.14.2 Asynchronous exception masking", if asynchronous excepti
[PATCH v3 0/2] Append some fixes and improvements
Fix an issue with FIFO Size and burst size. Fix an incomplete allocator for Hardware descriptors: memory badly allocated. --- Version history: v1: * Initial v2: * Fix kbuild warning format: /0x%08x/%pad/ v3: * Get rid of while loop in favor of computed values * use of "offsetof" instead of explicit calculation --- Pierre-Yves MORDRET (2): dmaengine: stm32-mdma: align TLEN and buffer length on burst dmaengine: stm32-mdma: Fix incomplete Hw descriptors allocator drivers/dma/stm32-mdma.c | 98 1 file changed, 58 insertions(+), 40 deletions(-) -- 2.7.4
[PATCH] tools build: Use -Xpreprocessor instead of -Wp and leave pathnames intact
Build.include invokes the pre-processor via GCC in order to generate a dependency list for the input file. Since these options are passed using '-Wp,-M...,$(depfile)' it is important that $(depfile) does not contain any commas, so these are substituted with underscores. This substitution will break the build if the directory name of the output directory happens to include a comma, e.g. when using "aiaiai" for bisection testing: | cc1: fatal error: x86/tools/objtool/fixdep.o: No such file or directory | compilation terminated. | cat: /tmp/aiaiai-test-patchset.qroS/before/obj.defconfig_x86/tools/objtool/.fixdep.o.d: No such file or directory | make[5]: *** [tools/objtool/fixdep.o] Error 1 We can address this by using -Xpreprocessor instead of -Wp, which allows us to pass down an unmodified pathname. Cc: Jiri Olsa Cc: Dave Martin Cc: Arnaldo Carvalho de Melo Cc: Ingo Molnar Signed-off-by: Will Deacon --- As an aside, the way we currently pass the depfile to -MD appears to be in direct contradiction with the preprocessor documentation, although it does work with the cc1 implementation. tools/build/Build.include | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/tools/build/Build.include b/tools/build/Build.include index 418871d02ebf..e1914f8e2328 100644 --- a/tools/build/Build.include +++ b/tools/build/Build.include @@ -22,9 +22,7 @@ dot-target = $(dir $@).$(notdir $@) basetarget = $(basename $(notdir $@)) ### -# The temporary file to save gcc -MD generated dependencies must not -# contain a comma -depfile = $(subst $(comma),_,$(dot-target).d) +depfile = $(dot-target).d ### # Check if both arguments has same arguments. Result is empty string if equal. @@ -89,12 +87,12 @@ if_changed = $(if $(strip $(any-prereq) $(arg-check)), \ # - per target C flags # - per object C flags # - BUILD_STR macro to allow '-D"$(variable)"' constructs -c_flags_1 = -Wp,-MD,$(depfile) -Wp,-MT,$@ $(CFLAGS) -D"BUILD_STR(s)=\#s" $(CFLAGS_$(basetarget).o) $(CFLAGS_$(obj)) +c_flags_1 = -Xpreprocessor -MD -Xpreprocessor $(depfile) -Xpreprocessor -MT -Xpreprocessor $@ $(CFLAGS) -D"BUILD_STR(s)=\#s" $(CFLAGS_$(basetarget).o) $(CFLAGS_$(obj)) c_flags_2 = $(filter-out $(CFLAGS_REMOVE_$(basetarget).o), $(c_flags_1)) c_flags = $(filter-out $(CFLAGS_REMOVE_$(obj)), $(c_flags_2)) -cxx_flags = -Wp,-MD,$(depfile) -Wp,-MT,$@ $(CXXFLAGS) -D"BUILD_STR(s)=\#s" $(CXXFLAGS_$(basetarget).o) $(CXXFLAGS_$(obj)) +cxx_flags = -Xpreprocessor -MD -Xpreprocessor $(depfile) -Xpreprocessor -MT -Xpreprocessor $@ $(CXXFLAGS) -D"BUILD_STR(s)=\#s" $(CXXFLAGS_$(basetarget).o) $(CXXFLAGS_$(obj)) ### ## HOSTCC C flags -host_c_flags = -Wp,-MD,$(depfile) -Wp,-MT,$@ $(CHOSTFLAGS) -D"BUILD_STR(s)=\#s" $(CHOSTFLAGS_$(basetarget).o) $(CHOSTFLAGS_$(obj)) +host_c_flags = -Xpreprocessor -MD -Xpreprocessor $(depfile) -Xpreprocessor -MT -Xpreprocessor $@ $(CHOSTFLAGS) -D"BUILD_STR(s)=\#s" $(CHOSTFLAGS_$(basetarget).o) $(CHOSTFLAGS_$(obj)) -- 2.1.4
Re: [PATCH ipmi/kcs_bmc v1] ipmi: kcs_bmc: optimize the data buffers allocation
On 04/07/2018 02:54 AM, Wang, Haiyue wrote: Hi Corey, Since IPMI 2.0 just defined minimum, no maximum: KCS/SMIC Input : Required: 40 bytes IPMI Message, minimum KCS/SMIC Output : Required: 38 bytes IPMI Message, minimum Yes, though there are practical maximums that are much smaller than 1000 bytes. We can enlarge the block size for avoiding waste, and make our driver support most worst message size case. And I think this patch make checking simple (from 3 to 1), and the code clean, this is the biggest reason I want to change. The TLB is just memory management study from book, no data to support access improvement. :) I would argue that the way it is now expresses the intent of the code better than one allocation split into three parts. Expressing your intent is more important than the number of checks and a minuscule performance improvement. For me it makes the code easier to understand. If you had a tool that checked for out-of-bounds memory access, then a single allocation might not find an overrun between the parts. Smaller allocations tend to result in less memory fragmentation. My preference is to leave it as it is. However, it's not that important, and if you really want this patch, I can include it. Thanks, -corey BR, Haiyue On 2018-04-07 10:37, Wang, Haiyue wrote: On 2018-04-07 05:47, Corey Minyard wrote: On 03/15/2018 07:20 AM, Haiyue Wang wrote: Allocate a continuous memory block for the three KCS data buffers with related index assignment. I'm finally getting to this. Is there a reason you want to do this? In general, it's better to not try to outsmart your base system. Depending on the memory allocator, in this case, you might actually use more memory. You probably won't use any less. I got this idea from another code review, but that patch allocates 30 more the same size memory block, reducing the devm_kmalloc call will be better. For KCS only have 3, may be the key point is memory waste. In the original case, you allocate three 1000 byte buffers, resulting in 3 1024 byte slab allocated. In the changed case, you will allocate a 3000 byte buffer, resulting in a single 4096 byte slab allocation, wasting 1024 more bytes of memory. As the kcs has memory copy between in/out/kbuffer, put them in the same page will be better ? Such as the same TLB ? (Well, I just got this from book, no real experience of memory accessing performance. And also, I was told that using space to save the time. :-)). Just my stupid thinking. I'm OK to drop this patch if it doesn't help with performance, or something else. BR. Haiyue -corey Signed-off-by: Haiyue Wang --- drivers/char/ipmi/kcs_bmc.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/char/ipmi/kcs_bmc.c b/drivers/char/ipmi/kcs_bmc.c index fbfc05e..dc19c0d 100644 --- a/drivers/char/ipmi/kcs_bmc.c +++ b/drivers/char/ipmi/kcs_bmc.c @@ -435,6 +435,7 @@ static const struct file_operations kcs_bmc_fops = { struct kcs_bmc *kcs_bmc_alloc(struct device *dev, int sizeof_priv, u32 channel) { struct kcs_bmc *kcs_bmc; + void *buf; kcs_bmc = devm_kzalloc(dev, sizeof(*kcs_bmc) + sizeof_priv, GFP_KERNEL); if (!kcs_bmc) @@ -448,11 +449,12 @@ struct kcs_bmc *kcs_bmc_alloc(struct device *dev, int sizeof_priv, u32 channel) mutex_init(&kcs_bmc->mutex); init_waitqueue_head(&kcs_bmc->queue); - kcs_bmc->data_in = devm_kmalloc(dev, KCS_MSG_BUFSIZ, GFP_KERNEL); - kcs_bmc->data_out = devm_kmalloc(dev, KCS_MSG_BUFSIZ, GFP_KERNEL); - kcs_bmc->kbuffer = devm_kmalloc(dev, KCS_MSG_BUFSIZ, GFP_KERNEL); - if (!kcs_bmc->data_in || !kcs_bmc->data_out || !kcs_bmc->kbuffer) + buf = devm_kmalloc_array(dev, 3, KCS_MSG_BUFSIZ, GFP_KERNEL); + if (!buf) return NULL; + kcs_bmc->data_in = buf; + kcs_bmc->data_out = buf + KCS_MSG_BUFSIZ; + kcs_bmc->kbuffer = buf + KCS_MSG_BUFSIZ * 2; kcs_bmc->miscdev.minor = MISC_DYNAMIC_MINOR; kcs_bmc->miscdev.name = dev_name(dev);
[PATCH] Move handling of the MIDR Variant and Revision bits into the mapfile.csv file
The arm64 code indentification code was filtering out the Variant and Revision bits when it initially read the MIDR value. It is better to do the filtering of Variant and Revision bits in the regular expressions in the mapsfile.csv. If some performance events do not function for particular versions of silicon, special case maps can be added to mapsfile.csv before the general case to handle them. Signed-off-by: William Cohen --- tools/perf/arch/arm64/util/header.c | 7 --- tools/perf/pmu-events/arch/arm64/mapfile.csv | 12 +++- 2 files changed, 7 insertions(+), 12 deletions(-) diff --git a/tools/perf/arch/arm64/util/header.c b/tools/perf/arch/arm64/util/header.c index 534cd2507d83..05d1439c2cff 100644 --- a/tools/perf/arch/arm64/util/header.c +++ b/tools/perf/arch/arm64/util/header.c @@ -5,9 +5,6 @@ #define MIDR "/regs/identification/midr_el1" #define MIDR_SIZE 19 -#define MIDR_REVISION_MASK 0xf -#define MIDR_VARIANT_SHIFT 20 -#define MIDR_VARIANT_MASK (0xf << MIDR_VARIANT_SHIFT) char *get_cpuid_str(struct perf_pmu *pmu) { @@ -44,11 +41,7 @@ char *get_cpuid_str(struct perf_pmu *pmu) } fclose(file); - /* Ignore/clear Variant[23:20] and -* Revision[3:0] of MIDR -*/ midr = strtoul(buf, NULL, 16); - midr &= (~(MIDR_VARIANT_MASK | MIDR_REVISION_MASK)); scnprintf(buf, MIDR_SIZE, "0x%016lx", midr); /* got midr break loop */ break; diff --git a/tools/perf/pmu-events/arch/arm64/mapfile.csv b/tools/perf/pmu-events/arch/arm64/mapfile.csv index f03e26ecb658..23372a335f97 100644 --- a/tools/perf/pmu-events/arch/arm64/mapfile.csv +++ b/tools/perf/pmu-events/arch/arm64/mapfile.csv @@ -3,7 +3,9 @@ # # where # MIDRProcessor version -# Variant[23:20] and Revision [3:0] should be zero. +# Variant[23:20] and Revision [3:0] bits should be matched +# with regular expression hex digits ([[:xdigit:]]) +# unless particular variants or revisions need special handling. # Version could be used to track version of of JSON file # but currently unused. # JSON/file/pathname is the path to JSON file, relative @@ -12,7 +14,7 @@ # # #Family-model,Version,Filename,EventType -0x410fd03[[:xdigit:]],v1,arm/cortex-a53,core -0x420f5160,v1,cavium/thunderx2,core -0x430f0af0,v1,cavium/thunderx2,core -0x480fd010,v1,hisilicon/hip08,core +0x41[[:xdigit:]]fd03[[:xdigit:]],v1,arm/cortex-a53,core +0x42[[:xdigit:]]f516[[:xdigit:]],v1,cavium/thunderx2,core +0x43[[:xdigit:]]f0af[[:xdigit:]],v1,cavium/thunderx2,core +0x48[[:xdigit:]]fd01[[:xdigit:]],v1,hisilicon/hip08,core -- 2.14.3
Re: [PATCH RFC 2/8] mm: introduce PG_offline
On 13.04.2018 15:40, Michal Hocko wrote: > On Fri 13-04-18 15:16:26, David Hildenbrand wrote: >> online_pages()/offline_pages() theoretically allows us to work on >> sub-section sizes. This is especially relevant in the context of >> virtualization. It e.g. allows us to add/remove memory to Linux in a VM in >> 4MB chunks. > > Well, theoretically possible but this would require a lot of auditing > because the hotplug and per section assumption is quite a spread one. Indeed. But besides changing section sizes / size of memory blocks this seems to be the only way to do it. (btw, I think Windows allows to add 1MB chunks - e.g. 1MB DIMMs) But as these pages "belong to nobody" nobody (besides kdump) should dare to access the content, although the section is online. > >> While the whole section is marked as online/offline, we have to know >> the state of each page. E.g. to not read memory that is not online >> during kexec() or to properly mark a section as offline as soon as all >> contained pages are offline. > > But you cannot use a page flag for that, I am afraid. Page flags are > extremely scarce resource. I haven't looked at the rest of the series > but _if_ we have a bit spare which I am not really sure about then you > should prove there are no other ways around this. Open for suggestions. We could remember per segment/memory block which parts are online/offline and use that to decide if a section can go offline. However: kdump will also have to (easily) know which pages are offline, so it can skip reading them. (see the other patch) > >> Signed-off-by: David Hildenbrand -- Thanks, David / dhildenb
Re: [PATCH 2/6] tracing: Add trace event error log
On Thu, 12 Apr 2018 18:52:13 -0500 Tom Zanussi wrote: > Hi Steve, > > On Thu, 2018-04-12 at 18:20 -0400, Steven Rostedt wrote: > > On Thu, 12 Apr 2018 10:13:17 -0500 > > Tom Zanussi wrote: > > > > > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h > > > index 6fb46a0..f2dc7e6 100644 > > > --- a/kernel/trace/trace.h > > > +++ b/kernel/trace/trace.h > > > @@ -1765,6 +1765,9 @@ extern ssize_t trace_parse_run_command(struct file > > > *file, > > > const char __user *buffer, size_t count, loff_t *ppos, > > > int (*createfn)(int, char**)); > > > > > > +extern void event_log_err(const char *loc, const char *cmd, const char > > > *fmt, > > > + ...); > > > + > > > /* > > > * Normal trace_printk() and friends allocates special buffers > > > * to do the manipulation, as well as saves the print formats > > > diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c > > > index 05c7172..fd02e22 100644 > > > --- a/kernel/trace/trace_events.c > > > +++ b/kernel/trace/trace_events.c > > > @@ -1668,6 +1668,164 @@ static void ignore_task_cpu(void *data) > > > return ret; > > > } > > > > > > +#define EVENT_LOG_ERRS_MAX (PAGE_SIZE / sizeof(struct > > > event_log_err)) > > > > > +#define EVENT_ERR_LOG_MASK (EVENT_LOG_ERRS_MAX - 1) > > > > BTW, the above only works if EVENT_LOG_ERRS_MAX is a power of two, > > which it's not guaranteed to be. > > > > My assumption was that we'd only ever need a page or two for the > error_log and so would always would be a power of two, since the size of > the struct event_log_err is 512. Assumptions are not what we want to rely on. There should be something like: BUILD_BUG_ON(EVENT_LOG_ERRS_MAX & EVENT_ERR_LOG_MASK); Which would guarantee that your assumption is correct otherwise the kernel wont build. > > Anyway, I should probably have put comments about all this in the code, > and I will, but the way it works kind of assumes a very small number of > errors - it's replacing a simple 'last error' facility for the hist > triggers and making it a common facility for other things that have > similar needs like Masami's kprobe_events errors. For those purposes, I > assumed it would suffice to simply be able to show that last 8 or some > similar small number of errors and constantly recycle the slots. The errors are still in the files that have the errors right? Perhaps just have a file that lists the files that contain errors. That way if something goes wrong, you can examine that file and then look at the file that contains the error? And I'm not sure it being in the events directory is the best place either, especially, if you plan to have it handle kprobe_events because that's not in the events directory. > > Basically it just splits the page into 16 strings, 2 per error, one for > the actual error text, the other for the command the user entered. The > struct event_log_err just overlays a struct on top of 2 strings just to > make it easier to manage. > > Anyway, because it is such a small number, and we start with a zeroed > page, whenever we print the error log, we print all 16 strings even if > we only have one error (2 strings). The rest are NULL and print > nothing. We start with the tail, which could also be thought of as the > 'oldest' or the 'first' error in the buffer and just cycle through them > all. Hope that clears up some of the other questions you had about how > a non-full log gets printed, etc... OK, I was thinking a NULL entry would return NULL, but we are returning a pointer to NULL. That's where I missed it. > > > > + > > > +struct event_log_err { > > > + charerr[MAX_FILTER_STR_VAL]; > > > + charcmd[MAX_FILTER_STR_VAL]; > > > +}; > > > > I like the event_log_err idea, but the above can be shrunk to: > > > > struct err_info { > > u8 type; /* I can only imagine 254 types */ > > u8 pos; /* MAX_FILTER_STR_VAR = 256 */ > > }; > > > > struct event_log_err { > > struct err_info info; > > charcmd[MAX_FILTER_STR_VAL]; > > }; > > > > There's no reason to put in a bunch of text that's going to be static > > anyway. Have a lookup table like we do for filters. > > > > + log_err("Variable name not unique, need to use > > fully qualified name (%s) for variable: ", fqvar(system, event_name, > > var_name, true)); > > > > Hmm, most of the log_errs use printf strings that get expanded, so need > a destination buffer, the event_log_err->err string, but I think I see > what you're getting at - that we can get rid of the format strings > altogether and make them static strings if we use the method of simply > printing the static string and putting a caret where the error is as > below. > > > > > Instead of making the fqvar, find the location of the variable, and add: > > > > blah blah $var blah blah > > ^ > > Variable
[PATCH 4/6] Documentation for Pmalloc
Detailed documentation about the protectable memory allocator. Signed-off-by: Igor Stoppa --- Documentation/core-api/index.rst | 1 + Documentation/core-api/pmalloc.rst | 107 + 2 files changed, 108 insertions(+) create mode 100644 Documentation/core-api/pmalloc.rst diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index c670a8031786..8f5de42d6571 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -25,6 +25,7 @@ Core utilities genalloc errseq printk-formats + pmalloc Interfaces for kernel debugging === diff --git a/Documentation/core-api/pmalloc.rst b/Documentation/core-api/pmalloc.rst new file mode 100644 index ..c14907485137 --- /dev/null +++ b/Documentation/core-api/pmalloc.rst @@ -0,0 +1,107 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. _pmalloc: + +Protectable memory allocator + + +Purpose +--- + +The pmalloc library is meant to provide read-only status to data that, +for some reason, could neither be declared as constant, nor could it take +advantage of the qualifier __ro_after_init, but is write-once and +read-only in spirit. At least as long as it doesn't get teared down. +It protects data from both accidental and malicious overwrites. + +Example: A policy that is loaded from userspace. + + +Concept +--- + +The MMU available in the system can be used to write protect memory pages. +Unfortunately this feature cannot be used as-it-is, to protect sensitive +data, because this potentially read-only data is typically interleaved +with other data, which must stay writeable. + +pmalloc introduces the concept of protectable memory pools. +A pool contains a list of areas of virtually contiguous pages of +memory. An area is the minimum amount of memory that pmalloc allows to +protect, because the user might have allocated a memory range that +crosses the boundary between pages. + +When an allocation is performed, if there is not enough memory already +available in the pool, a new area of suitable size is grabbed. +The size chosen is the largest between the roundup (to PAGE_SIZE) of +the request from pmalloc and friends and the refill parameter specified +when creating the pool. + +When a pool is created, it is possible to specify two parameters: +- refill size: the minimum size of the memory area to allocate when needed +- align_order: the default alignment to use when reserving memory + +To facilitate the conversion of existing code to pmalloc pools, several +helper functions are provided, mirroring their k/vmalloc counterparts. +However one is missing. There is no pfree() because the memory protected +by a pool will be released exclusively when the pool is destroyed. + + + +Caveats +--- + +- When a pool is protected, whatever memory would be still available in + the current vmap_area (from which allocations are performed) is + relinquished. + +- As already explained, freeing of memory is not supported. Pages will be + returned to the system upon destruction of the memory pool that they + belong to. + +- The address range available for vmalloc (and thus for pmalloc too) is + limited, on 32-bit systems. However it shouldn't be an issue, since not + much data is expected tobe dynamically allocated and turned into + read-only. + +- Regarding SMP systems, the allocations are expected to happen mostly + during an initial transient, after which there should be no more need + to perform cross-processor synchronizations of page tables. + Loading of kernel modules is an exception to this, but it's not expected + to happen with such high frequency to become a problem. + + +Use +--- + +The typical sequence, when using pmalloc, is: + +#. create a pool + + :c:func:`pmalloc_create_pool` + +#. issue one or more allocation requests to the pool + + :c:func:`pmalloc` + + or + + :c:func:`pzalloc` + +#. initialize the memory obtained, with the desired values + +#. write-protect the memory so far allocated + + :c::func:`pmalloc_protect_pool` + +#. iterate over the last 3 points as needed + +#. [optional] destroy the pool + + :c:func:`pmalloc_destroy_pool` + +API +--- + +.. kernel-doc:: include/linux/pmalloc.h +.. kernel-doc:: mm/pmalloc.c -- 2.14.1
[PATCH 5/6] Pmalloc selftest
Add basic self-test functionality for pmalloc. The testing is introduced as early as possible, right after the main dependency, genalloc, has passed successfully, so that it can help diagnosing failures in pmalloc users. Signed-off-by: Igor Stoppa --- include/linux/test_pmalloc.h | 24 init/main.c | 2 + mm/Kconfig | 10 mm/Makefile | 1 + mm/test_pmalloc.c| 137 +++ 5 files changed, 174 insertions(+) create mode 100644 include/linux/test_pmalloc.h create mode 100644 mm/test_pmalloc.c diff --git a/include/linux/test_pmalloc.h b/include/linux/test_pmalloc.h new file mode 100644 index ..c7e2e451c17c --- /dev/null +++ b/include/linux/test_pmalloc.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * test_pmalloc.h + * + * (C) Copyright 2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa + */ + + +#ifndef __LINUX_TEST_PMALLOC_H +#define __LINUX_TEST_PMALLOC_H + + +#ifdef CONFIG_TEST_PROTECTABLE_MEMORY + +void test_pmalloc(void); + +#else + +static inline void test_pmalloc(void){}; + +#endif + +#endif diff --git a/init/main.c b/init/main.c index b795aa341a3a..27f8479c4578 100644 --- a/init/main.c +++ b/init/main.c @@ -91,6 +91,7 @@ #include #include #include +#include #include #include @@ -679,6 +680,7 @@ asmlinkage __visible void __init start_kernel(void) */ mem_encrypt_init(); + test_pmalloc(); #ifdef CONFIG_BLK_DEV_INITRD if (initrd_start && !initrd_below_start_ok && page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) { diff --git a/mm/Kconfig b/mm/Kconfig index d7ef40eaa4e8..f98b4c0aebce 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -758,3 +758,13 @@ config PROTECTABLE_MEMORY depends on MMU depends on ARCH_HAS_SET_MEMORY default y + +config TEST_PROTECTABLE_MEMORY + bool "Run self test for pmalloc memory allocator" +depends on MMU + depends on ARCH_HAS_SET_MEMORY + select PROTECTABLE_MEMORY + default n + help + Tries to verify that pmalloc works correctly and that the memory + is effectively protected. diff --git a/mm/Makefile b/mm/Makefile index 6a6668f99799..802cba37013b 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -66,6 +66,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o obj-$(CONFIG_PROTECTABLE_MEMORY) += pmalloc.o +obj-$(CONFIG_TEST_PROTECTABLE_MEMORY) += test_pmalloc.o obj-$(CONFIG_KSM) += ksm.o obj-$(CONFIG_PAGE_POISONING) += page_poison.o obj-$(CONFIG_SLAB) += slab.o diff --git a/mm/test_pmalloc.c b/mm/test_pmalloc.c new file mode 100644 index ..b0e091bf6329 --- /dev/null +++ b/mm/test_pmalloc.c @@ -0,0 +1,137 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * test_pmalloc.c + * + * (C) Copyright 2018 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa + */ + +#include +#include +#include +#include + +#define SIZE_1 (PAGE_SIZE * 3) +#define SIZE_2 1000 + + +/* wrapper for is_pmalloc_object() with messages */ +static inline bool validate_alloc(bool expected, void *addr, + unsigned long size) +{ + bool test; + + test = is_pmalloc_object(addr, size) > 0; + pr_notice("must be %s: %s", + expected ? "ok" : "no", test ? "ok" : "no"); + return test == expected; +} + + +#define is_alloc_ok(variable, size)\ + validate_alloc(true, variable, size) + + +#define is_alloc_no(variable, size)\ + validate_alloc(false, variable, size) + +/* tests the basic life-cycle of a pool */ +static bool create_and_destroy_pool(void) +{ + static struct pmalloc_pool *pool; + + pr_notice("Testing pool creation and destruction capability"); + + pool = pmalloc_create_pool(); + if (WARN(!pool, "Cannot allocate memory for pmalloc selftest.")) + return false; + pmalloc_destroy_pool(pool); + return true; +} + + +/* verifies that it's possible to allocate from the pool */ +static bool test_alloc(void) +{ + static struct pmalloc_pool *pool; + static void *p; + + pr_notice("Testing allocation capability"); + pool = pmalloc_create_pool(); + if (WARN(!pool, "Unable to allocate memory for pmalloc selftest.")) + return false; + p = pmalloc(pool, SIZE_1 - 1); + pmalloc_protect_pool(pool); + pmalloc_destroy_pool(pool); + if (WARN(!p, "Failed to allocate memory from the pool")) + return false; + return true; +} + + +/* tests the identification of pmalloc ranges */ +static bool test_is_pmalloc_object(void) +{ + struct pmalloc_pool *pool; + void *pmalloc_p; + void *vmalloc_p; + bool retval = false; + + pr_notice("Test correctness of is_pmalloc_object()"); + + vmalloc_p = vmalloc(SIZE_1); +
Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts
On Fri, Apr 13, 2018 at 03:01:11PM +0800, Jason Wang wrote: > > > On 2018年04月12日 08:24, Jonathan Helman wrote: > > > > > > On 04/10/2018 08:12 PM, Jason Wang wrote: > > > > > > > > > On 2018年04月10日 05:11, Jonathan Helman wrote: > > > > > > > > > > > > On 03/22/2018 07:38 PM, Jason Wang wrote: > > > > > > > > > > > > > > > On 2018年03月22日 11:10, Michael S. Tsirkin wrote: > > > > > > On Thu, Mar 22, 2018 at 09:52:18AM +0800, Jason Wang wrote: > > > > > > > On 2018年03月20日 12:26, Jonathan Helman wrote: > > > > > > > > > On Mar 19, 2018, at 7:31 PM, Jason > > > > > > > > > Wang wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2018年03月20日 06:14, Jonathan Helman wrote: > > > > > > > > > > Export the number of successful and failed hugetlb page > > > > > > > > > > allocations via the virtio balloon driver. These 2 counts > > > > > > > > > > come directly from the vm_events HTLB_BUDDY_PGALLOC and > > > > > > > > > > HTLB_BUDDY_PGALLOC_FAIL. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Helman > > > > > > > > > Reviewed-by: Jason Wang > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > drivers/virtio/virtio_balloon.c | 6 ++ > > > > > > > > > > include/uapi/linux/virtio_balloon.h | 4 +++- > > > > > > > > > > 2 files changed, 9 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git > > > > > > > > > > a/drivers/virtio/virtio_balloon.c > > > > > > > > > > b/drivers/virtio/virtio_balloon.c > > > > > > > > > > index dfe5684..6b237e3 100644 > > > > > > > > > > --- a/drivers/virtio/virtio_balloon.c > > > > > > > > > > +++ b/drivers/virtio/virtio_balloon.c > > > > > > > > > > @@ -272,6 +272,12 @@ static unsigned int > > > > > > > > > > update_balloon_stats(struct > > > > > > > > > > virtio_balloon *vb) > > > > > > > > > > pages_to_bytes(events[PSWPOUT])); > > > > > > > > > > update_stat(vb, idx++, > > > > > > > > > > VIRTIO_BALLOON_S_MAJFLT, > > > > > > > > > > events[PGMAJFAULT]); > > > > > > > > > > update_stat(vb, idx++, > > > > > > > > > > VIRTIO_BALLOON_S_MINFLT, > > > > > > > > > > events[PGFAULT]); > > > > > > > > > > +#ifdef CONFIG_HUGETLB_PAGE > > > > > > > > > > + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC, > > > > > > > > > > + events[HTLB_BUDDY_PGALLOC]); > > > > > > > > > > + update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL, > > > > > > > > > > + events[HTLB_BUDDY_PGALLOC_FAIL]); > > > > > > > > > > +#endif > > > > > > > > > > #endif > > > > > > > > > > update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE, > > > > > > > > > > pages_to_bytes(i.freeram)); > > > > > > > > > > diff --git > > > > > > > > > > a/include/uapi/linux/virtio_balloon.h > > > > > > > > > > b/include/uapi/linux/virtio_balloon.h > > > > > > > > > > index 4e8b830..40297a3 100644 > > > > > > > > > > --- a/include/uapi/linux/virtio_balloon.h > > > > > > > > > > +++ b/include/uapi/linux/virtio_balloon.h > > > > > > > > > > @@ -53,7 +53,9 @@ struct virtio_balloon_config { > > > > > > > > > > #define VIRTIO_BALLOON_S_MEMTOT 5 > > > > > > > > > > /* Total amount of memory */ > > > > > > > > > > #define VIRTIO_BALLOON_S_AVAIL 6 > > > > > > > > > > /* Available memory as in /proc */ > > > > > > > > > > #define VIRTIO_BALLOON_S_CACHES 7 /* Disk caches */ > > > > > > > > > > -#define VIRTIO_BALLOON_S_NR 8 > > > > > > > > > > +#define VIRTIO_BALLOON_S_HTLB_PGALLOC > > > > > > > > > > 8 /* Hugetlb page allocations */ > > > > > > > > > > +#define VIRTIO_BALLOON_S_HTLB_PGFAIL > > > > > > > > > > 9 /* Hugetlb page allocation failures > > > > > > > > > > */ > > > > > > > > > > +#define VIRTIO_BALLOON_S_NR 10 > > > > > > > > > > /* > > > > > > > > > > * Memory statistics structure. > > > > > > > > > Not for this patch, but it looks to me that > > > > > > > > > exporting such nr through uapi is fragile. > > > > > > > > Sorry, can you explain what you mean here? > > > > > > > > > > > > > > > > Jon > > > > > > > Spec said "Within an output buffer submitted to the > > > > > > > statsq, the device MUST > > > > > > > ignore entries with tag values that it does not > > > > > > > recognize". So exporting > > > > > > > VIRTIO_BALLOON_S_NR seems useless and device > > > > > > > implementation can not depend > > > > > > > on such number in uapi. > > > > > > > > > > > > > > Thanks > > > > > > Suggestions? I don't like to break build for people ... > > > > > > > > > > > > > > > > Didn't have a good idea. But maybe we should keep > > > > > VIRTIO_BALLOON_S_NR unchanged, and add a comment here. > > > > > > > > > > Thanks > > > > > > > > I think Jason's comment is for a future patch. Didn't see this > > > > patch get applied, so wondering if it could be. > > > > > > > > Thanks, > > > > Jon > > > > > > Hi Jon: > > > > > > Have you tested new driver with old qemu? > > > > Yes, this testing scenario looks g
[PATCH] virtio_balloon: add array of stat names
Jason Wang points out that it's vary hard for users to build an array of stat names. The naive thing is to use VIRTIO_BALLOON_S_NR but that breaks if we add more stats. Let's add an array of reasonably readable names. Fixes: 6c64fe7f2 ("virtio_balloon: export hugetlb page allocation counts") Cc: Jason Wang Cc: Jonathan Helman , Signed-off-by: Michael S. Tsirkin --- include/uapi/linux/virtio_balloon.h | 15 +++ 1 file changed, 15 insertions(+) diff --git a/include/uapi/linux/virtio_balloon.h b/include/uapi/linux/virtio_balloon.h index 9e02137..1477c17 100644 --- a/include/uapi/linux/virtio_balloon.h +++ b/include/uapi/linux/virtio_balloon.h @@ -64,6 +64,21 @@ struct virtio_balloon_config { #define VIRTIO_BALLOON_S_HTLB_PGFAIL 9 /* Hugetlb page allocation failures */ #define VIRTIO_BALLOON_S_NR 10 +#define VIRTIO_BALLOON_S_NAMES_WITH_PREFIX(VIRTIO_BALLOON_S_NAMES_prefix) { \ + VIRTIO_BALLOON_S_NAMES_prefix "swap-in", \ + VIRTIO_BALLOON_S_NAMES_prefix "swap-out", \ + VIRTIO_BALLOON_S_NAMES_prefix "major-faults", \ + VIRTIO_BALLOON_S_NAMES_prefix "minor-faults", \ + VIRTIO_BALLOON_S_NAMES_prefix "free-memory", \ + VIRTIO_BALLOON_S_NAMES_prefix "total-memory", \ + VIRTIO_BALLOON_S_NAMES_prefix "available-memory", \ + VIRTIO_BALLOON_S_NAMES_prefix "disk-caches", \ + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-allocations", \ + VIRTIO_BALLOON_S_NAMES_prefix "hugetlb-failures" \ +} + +#define VIRTIO_BALLOON_S_NAMES VIRTIO_BALLOON_S_NAMES_WITH_PREFIX("") + /* * Memory statistics structure. * Driver fills an array of these structures and passes to device. -- MST
[PATCH 6/6] lkdtm: crash on overwriting protected pmalloc var
Verify that pmalloc read-only protection is in place: trying to overwrite a protected variable will crash the kernel. Signed-off-by: Igor Stoppa --- drivers/misc/lkdtm/core.c | 3 +++ drivers/misc/lkdtm/lkdtm.h | 1 + drivers/misc/lkdtm/perms.c | 25 + 3 files changed, 29 insertions(+) diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c index 2154d1bfd18b..c9fd42bda6ee 100644 --- a/drivers/misc/lkdtm/core.c +++ b/drivers/misc/lkdtm/core.c @@ -155,6 +155,9 @@ static const struct crashtype crashtypes[] = { CRASHTYPE(ACCESS_USERSPACE), CRASHTYPE(WRITE_RO), CRASHTYPE(WRITE_RO_AFTER_INIT), +#ifdef CONFIG_PROTECTABLE_MEMORY + CRASHTYPE(WRITE_RO_PMALLOC), +#endif CRASHTYPE(WRITE_KERN), CRASHTYPE(REFCOUNT_INC_OVERFLOW), CRASHTYPE(REFCOUNT_ADD_OVERFLOW), diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h index 9e513dcfd809..dcda3ae76ceb 100644 --- a/drivers/misc/lkdtm/lkdtm.h +++ b/drivers/misc/lkdtm/lkdtm.h @@ -38,6 +38,7 @@ void lkdtm_READ_BUDDY_AFTER_FREE(void); void __init lkdtm_perms_init(void); void lkdtm_WRITE_RO(void); void lkdtm_WRITE_RO_AFTER_INIT(void); +void lkdtm_WRITE_RO_PMALLOC(void); void lkdtm_WRITE_KERN(void); void lkdtm_EXEC_DATA(void); void lkdtm_EXEC_STACK(void); diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c index 53b85c9d16b8..4660ff0bfa44 100644 --- a/drivers/misc/lkdtm/perms.c +++ b/drivers/misc/lkdtm/perms.c @@ -9,6 +9,7 @@ #include #include #include +#include #include /* Whether or not to fill the target memory area with do_nothing(). */ @@ -104,6 +105,30 @@ void lkdtm_WRITE_RO_AFTER_INIT(void) *ptr ^= 0xabcd1234; } +#ifdef CONFIG_PROTECTABLE_MEMORY +void lkdtm_WRITE_RO_PMALLOC(void) +{ + struct pmalloc_pool *pool; + int *i; + + pool = pmalloc_create_pool(); + if (WARN(!pool, "Failed preparing pool for pmalloc test.")) + return; + + i = (int *)pmalloc(pool, sizeof(int)); + if (WARN(!i, "Failed allocating memory for pmalloc test.")) { + pmalloc_destroy_pool(pool); + return; + } + + *i = INT_MAX; + pmalloc_protect_pool(pool); + + pr_info("attempting bad pmalloc write at %p\n", i); + *i = 0; +} +#endif + void lkdtm_WRITE_KERN(void) { size_t size; -- 2.14.1
[PATCH 3/6] Protectable Memory
The MMU available in many systems running Linux can often provide R/O protection to the memory pages it handles. However, the MMU-based protection works efficiently only when said pages contain exclusively data that will not need further modifications. Statically allocated variables can be segregated into a dedicated section (that's how __ro_after_init works), but this does not sit very well with dynamically allocated ones. Dynamic allocation does not provide, currently, any means for grouping variables in memory pages that would contain exclusively data suitable for conversion to read only access mode. The allocator here provided (pmalloc - protectable memory allocator) introduces the concept of pools of protectable memory. A module can instantiate a pool, and then refer any allocation request to the pool handler it has received. A pool is organized ias list of areas of virtually contiguous memory. Whenever the protection functionality is invoked on a pool, all the areas it contains that are not yet read-only are write-protected. The process of growing and protecting the pool can be iterated at will. Each iteration will prevent further allocation from the memory area currently active, turn it into read-only mode and then proceed to secure whatever other area might still be unprotected. Write-protcting some part of a pool before completing all the allocations can be wasteful, however it will guarrantee the minimum window of vulnerability, sice the data can be allocated, initialized and protected in a single sweep. There are pros and cons, depending on the allocation patterns, the size of the areas being allocated, the time intervals between initialization and protection. Dstroying a pool is the only way to claim back the associated memory. It is up to its user to avoid any further references to the memory that was allocated, once the destruction is invoked. An example where it is desirable to destroy a pool and claim back its memory is when unloading a kernel module. A module can have as many pools as needed. Since pmalloc memory is obtained from vmalloc, an attacker that has gained access to the physical mapping, still has to identify where the target of the attack (in virtually contiguous mapping) is located. Compared to plain vmalloc, pmalloc does not generate as much TLB trashing, since it can host multiple allocations in the same page, where present. Signed-off-by: Igor Stoppa --- include/linux/pmalloc.h | 166 ++ include/linux/vmalloc.h | 3 + mm/Kconfig | 6 ++ mm/Makefile | 1 + mm/pmalloc.c| 265 mm/usercopy.c | 33 ++ mm/vmalloc.c| 2 +- 7 files changed, 475 insertions(+), 1 deletion(-) create mode 100644 include/linux/pmalloc.h create mode 100644 mm/pmalloc.c diff --git a/include/linux/pmalloc.h b/include/linux/pmalloc.h new file mode 100644 index ..1c24067eb167 --- /dev/null +++ b/include/linux/pmalloc.h @@ -0,0 +1,166 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * pmalloc.h: Header for Protectable Memory Allocator + * + * (C) Copyright 2017-18 Huawei Technologies Co. Ltd. + * Author: Igor Stoppa + */ + +#ifndef _LINUX_PMALLOC_H +#define _LINUX_PMALLOC_H + + +#include +#include + +/* + * Library for dynamic allocation of pools of protectable memory. + * A pool is a single linked list of vmap_area structures. + * Whenever a pool is protected, all the areas it contain at that point + * are write protected. + * More areas can be added and protected, in the same way. + * Memory in a pool cannot be individually unprotected, but the pool can + * be destroyed. + * Upon destruction of a certain pool, all the related memory is released, + * including its metadata. + * + * Pmalloc memory is intended to complement __read_only_after_init. + * It can be used, for example, where there is a write-once variable, for + * which it is not possible to know the initialization value before init + * is completed (which is what __read_only_after_init requires). + * + * It can be useful also where the amount of data to protect is not known + * at compile time and the memory can only be allocated dynamically. + * + * Finally, it can be useful also when it is desirable to control + * dynamically (for example throguh the command line) if something ought + * to be protected or not, without having to rebuild the kernel (like in + * the build used for a linux distro). + */ + + +#define PMALLOC_REFILL_DEFAULT (0) +#define PMALLOC_ALIGN_DEFAULT ARCH_KMALLOC_MINALIGN + +struct pmalloc_pool *pmalloc_create_custom_pool(size_t refill, + unsigned short align_order); + +/** + * pmalloc_create_pool() - create a protectable memory pool + * + * Shorthand for pmalloc_create_custom_pool() with default argument: + * * refill is set to PMALLOC_REFILL_DEFAULT + * * align_order is set to PMALLOC_ALIGN_DEFAULT +
Applied "ASoC: tfa9879: switch to SPDX license tag" to the asoc tree
The patch ASoC: tfa9879: switch to SPDX license tag has been applied to the asoc tree at https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark >From 55c19bd95f4dac2ee221272349900dda75a67ebb Mon Sep 17 00:00:00 2001 From: Peter Rosin Date: Fri, 13 Apr 2018 13:47:51 +0200 Subject: [PATCH] ASoC: tfa9879: switch to SPDX license tag It's less overhead, clearer and generally neater. Signed-off-by: Peter Rosin Signed-off-by: Mark Brown --- sound/soc/codecs/tfa9879.c | 18 ++ sound/soc/codecs/tfa9879.h | 7 +-- 2 files changed, 7 insertions(+), 18 deletions(-) diff --git a/sound/soc/codecs/tfa9879.c b/sound/soc/codecs/tfa9879.c index 4ed020262a27..abc114a3ae2b 100644 --- a/sound/soc/codecs/tfa9879.c +++ b/sound/soc/codecs/tfa9879.c @@ -1,15 +1,9 @@ -/* - * tfa9879.c -- driver for NXP Semiconductors TFA9879 - * - * Copyright (C) 2014 Axentia Technologies AB - * Author: Peter Rosin - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License as published by the - * Free Software Foundation; either version 2 of the License, or (at your - * option) any later version. - * - */ +// SPDX-License-Identifier: GPL-2.0+ +// +// tfa9879.c -- driver for NXP Semiconductors TFA9879 +// +// Copyright (C) 2014 Axentia Technologies AB +// Author: Peter Rosin #include #include diff --git a/sound/soc/codecs/tfa9879.h b/sound/soc/codecs/tfa9879.h index 3408c90c4628..66c88d0396fe 100644 --- a/sound/soc/codecs/tfa9879.h +++ b/sound/soc/codecs/tfa9879.h @@ -1,14 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ /* * tfa9879.h -- driver for NXP Semiconductors TFA9879 * * Copyright (C) 2014 Axentia Technologies AB * Author: Peter Rosin - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License as published by the - * Free Software Foundation; either version 2 of the License, or (at your - * option) any later version. - * */ #ifndef _TFA9879_H -- 2.17.0
[RFC PATCH v22 0/6] mm: security: ro protection for dynamic data
This patch-set introduces the possibility of protecting memory that has been allocated dynamically. The memory is managed in pools: when a memory pool is protected, all the memory that is currently part of it, will become R/O. A R/O pool can be expanded (adding more protectable memory). It can also be destroyed, to recover its memory, but it cannot be turned back into R/W mode. This is intentional. This feature is meant for data that doesn't need further modifications after initialization. However the data might need to be released, for example as part of module unloading. The pool, therefore, can be destroyed. An example is provided, in the form of self-testing. Since it was advised to give an example of protecting real kernel data [1], a well known vulnerability has been used to demo an effective use of pmalloc. [1] http://www.openwall.com/lists/kernel-hardening/2018/03/29/7 However it turned out to be almost an how-to for attacking the kernel, so it was sent first to secur...@kernel.org, for obtaining clearance about the publication. Changes since v21: [http://www.openwall.com/lists/kernel-hardening/2018/03/27/23] * fixed type mismatch error in use of max(), detected by gcc 7.3 * converted internal types into size_t * fixed leak of vmalloc memory in the self-test code Igor Stoppa (6): struct page: add field for vm_struct vmalloc: rename llist field in vmap_area Protectable Memory Documentation for Pmalloc Pmalloc selftest lkdtm: crash on overwriting protected pmalloc var Igor Stoppa (6): struct page: add field for vm_struct vmalloc: rename llist field in vmap_area Protectable Memory Documentation for Pmalloc Pmalloc selftest lkdtm: crash on overwriting protected pmalloc var Documentation/core-api/index.rst | 1 + Documentation/core-api/pmalloc.rst | 107 +++ drivers/misc/lkdtm/core.c | 3 + drivers/misc/lkdtm/lkdtm.h | 1 + drivers/misc/lkdtm/perms.c | 25 include/linux/mm_types.h | 1 + include/linux/pmalloc.h| 166 +++ include/linux/test_pmalloc.h | 24 include/linux/vmalloc.h| 5 +- init/main.c| 2 + mm/Kconfig | 16 +++ mm/Makefile| 2 + mm/pmalloc.c | 265 + mm/test_pmalloc.c | 137 +++ mm/usercopy.c | 33 + mm/vmalloc.c | 10 +- 16 files changed, 793 insertions(+), 5 deletions(-) create mode 100644 Documentation/core-api/pmalloc.rst create mode 100644 include/linux/pmalloc.h create mode 100644 include/linux/test_pmalloc.h create mode 100644 mm/pmalloc.c create mode 100644 mm/test_pmalloc.c -- 2.14.1
[PATCH 2/6] vmalloc: rename llist field in vmap_area
The vmap_area structure has a field of type struct llist_node, named purge_list and is used when performing lazy purge of the area. Such field is left unused during the actual utilization of the structure. This patch renames the field to a more generic "area_list", to allow for utilization outside of the purging phase. Since the purging happens after the vmap_area is dismissed, its use is mutually exclusive with any use performed while the area is allocated. Signed-off-by: Igor Stoppa --- include/linux/vmalloc.h | 2 +- mm/vmalloc.c| 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 1e5d8c392f15..2d07dfef3cfd 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -47,7 +47,7 @@ struct vmap_area { unsigned long flags; struct rb_node rb_node; /* address sorted rbtree */ struct list_head list; /* address sorted list */ - struct llist_node purge_list;/* "lazy purge" list */ + struct llist_node area_list;/* generic list of areas */ struct vm_struct *vm; struct rcu_head rcu_head; }; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 61a1ca22b0f6..1bb2233bb262 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -682,7 +682,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) lockdep_assert_held(&vmap_purge_lock); valist = llist_del_all(&vmap_purge_list); - llist_for_each_entry(va, valist, purge_list) { + llist_for_each_entry(va, valist, area_list) { if (va->va_start < start) start = va->va_start; if (va->va_end > end) @@ -696,7 +696,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) flush_tlb_kernel_range(start, end); spin_lock(&vmap_area_lock); - llist_for_each_entry_safe(va, n_va, valist, purge_list) { + llist_for_each_entry_safe(va, n_va, valist, area_list) { int nr = (va->va_end - va->va_start) >> PAGE_SHIFT; __free_vmap_area(va); @@ -743,7 +743,7 @@ static void free_vmap_area_noflush(struct vmap_area *va) &vmap_lazy_nr); /* After this point, we may free va at any time */ - llist_add(&va->purge_list, &vmap_purge_list); + llist_add(&va->area_list, &vmap_purge_list); if (unlikely(nr_lazy > lazy_max_pages())) try_purge_vmap_area_lazy(); -- 2.14.1
[PATCH 1/6] struct page: add field for vm_struct
When a page is used for virtual memory, it is often necessary to obtain a handler to the corresponding vm_struct, which refers to the virtually continuous area generated when invoking vmalloc. The struct page has a "mapping" field, which can be re-used, to store a pointer to the parent area. This will avoid more expensive searches, later on. Signed-off-by: Igor Stoppa Reviewed-by: Jay Freyensee Reviewed-by: Matthew Wilcox --- include/linux/mm_types.h | 1 + mm/vmalloc.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 21612347d311..c74e2aa9a48b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -86,6 +86,7 @@ struct page { void *s_mem;/* slab first object */ atomic_t compound_mapcount; /* first tail page */ /* page_deferred_list().next -- second tail page */ + struct vm_struct *area; }; /* Second double word */ diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ebff729cc956..61a1ca22b0f6 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1536,6 +1536,7 @@ static void __vunmap(const void *addr, int deallocate_pages) struct page *page = area->pages[i]; BUG_ON(!page); + page->area = NULL; __free_pages(page, 0); } @@ -1705,6 +1706,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, area->nr_pages = i; goto fail; } + page->area = area; area->pages[i] = page; if (gfpflags_allow_blocking(gfp_mask|highmem_mask)) cond_resched(); -- 2.14.1
Re: [PATCH RFC 5/8] mm: only mark section offline when all pages are offline
On 13.04.2018 15:32, David Hildenbrand wrote: > If any page is still online, the section should stay online. > > Signed-off-by: David Hildenbrand > --- This is a duplicate, please ignore. (get_maintainers.sh and my mail server had a little clinch, so I had to send half of the series out manually -_- ) -- Thanks, David / dhildenb
Re: [PATCH v3 0/2] ASoC: max9860/tfa9879: switch to SPDX license tag
On Fri, Apr 13, 2018 at 01:47:49PM +0200, Peter Rosin wrote: > Peter Rosin (2): > ASoC: max9860: switch to SPDX license tag This one didn't turn up yet - it's only just been sent though so it might be stuck in a mail queue somewhere, I've applied patch 2 and I expect I'll apply this one as soon as it appears. signature.asc Description: PGP signature
Re: [PATCH RFC 2/8] mm: introduce PG_offline
On Fri 13-04-18 15:16:26, David Hildenbrand wrote: > online_pages()/offline_pages() theoretically allows us to work on > sub-section sizes. This is especially relevant in the context of > virtualization. It e.g. allows us to add/remove memory to Linux in a VM in > 4MB chunks. Well, theoretically possible but this would require a lot of auditing because the hotplug and per section assumption is quite a spread one. > While the whole section is marked as online/offline, we have to know > the state of each page. E.g. to not read memory that is not online > during kexec() or to properly mark a section as offline as soon as all > contained pages are offline. But you cannot use a page flag for that, I am afraid. Page flags are extremely scarce resource. I haven't looked at the rest of the series but _if_ we have a bit spare which I am not really sure about then you should prove there are no other ways around this. > Signed-off-by: David Hildenbrand -- Michal Hocko SUSE Labs
Re: [PATCH 3/3] dcache: account external names as indirectly reclaimable memory
On Mon, Mar 05, 2018 at 01:37:43PM +, Roman Gushchin wrote: > I was reported about suspicious growth of unreclaimable slabs > on some machines. I've found that it happens on machines > with low memory pressure, and these unreclaimable slabs > are external names attached to dentries. > > External names are allocated using generic kmalloc() function, > so they are accounted as unreclaimable. But they are held > by dentries, which are reclaimable, and they will be reclaimed > under the memory pressure. > > In particular, this breaks MemAvailable calculation, as it > doesn't take unreclaimable slabs into account. > This leads to a silly situation, when a machine is almost idle, > has no memory pressure and therefore has a big dentry cache. > And the resulting MemAvailable is too low to start a new workload. > > To address the issue, the NR_INDIRECTLY_RECLAIMABLE_BYTES counter > is used to track the amount of memory, consumed by external names. > The counter is increased in the dentry allocation path, if an external > name structure is allocated; and it's decreased in the dentry freeing > path. > > To reproduce the problem I've used the following Python script: > import os > > for iter in range (0, 1000): > try: > name = ("/some_long_name_%d" % iter) + "_" * 220 > os.stat(name) > except Exception: > pass > > Without this patch: > $ cat /proc/meminfo | grep MemAvailable > MemAvailable:7811688 kB > $ python indirect.py > $ cat /proc/meminfo | grep MemAvailable > MemAvailable:2753052 kB > > With the patch: > $ cat /proc/meminfo | grep MemAvailable > MemAvailable:7809516 kB > $ python indirect.py > $ cat /proc/meminfo | grep MemAvailable > MemAvailable:7749144 kB > > Signed-off-by: Roman Gushchin > Cc: Andrew Morton > Cc: Alexander Viro > Cc: Michal Hocko > Cc: Johannes Weiner > Cc: linux-fsde...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: linux...@kvack.org > Cc: kernel-t...@fb.com > --- > fs/dcache.c | 29 - > 1 file changed, 24 insertions(+), 5 deletions(-) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 5c7df1df81ff..a0312d73f575 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -273,8 +273,16 @@ static void __d_free(struct rcu_head *head) > static void __d_free_external(struct rcu_head *head) > { > struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu); > - kfree(external_name(dentry)); > - kmem_cache_free(dentry_cache, dentry); > + struct external_name *name = external_name(dentry); > + unsigned long bytes; > + > + bytes = dentry->d_name.len + offsetof(struct external_name, name[1]); > + mod_node_page_state(page_pgdat(virt_to_page(name)), > + NR_INDIRECTLY_RECLAIMABLE_BYTES, > + -kmalloc_size(kmalloc_index(bytes))); > + > + kfree(name); > + kmem_cache_free(dentry_cache, dentry); > } > > static inline int dname_external(const struct dentry *dentry) > @@ -1598,6 +1606,7 @@ struct dentry *__d_alloc(struct super_block *sb, const > struct qstr *name) > struct dentry *dentry; > char *dname; > int err; > + size_t reclaimable = 0; > > dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); > if (!dentry) > @@ -1614,9 +1623,11 @@ struct dentry *__d_alloc(struct super_block *sb, const > struct qstr *name) > name = &slash_name; > dname = dentry->d_iname; > } else if (name->len > DNAME_INLINE_LEN-1) { > - size_t size = offsetof(struct external_name, name[1]); > - struct external_name *p = kmalloc(size + name->len, > - GFP_KERNEL_ACCOUNT); > + struct external_name *p; > + > + reclaimable = offsetof(struct external_name, name[1]) + > + name->len; > + p = kmalloc(reclaimable, GFP_KERNEL_ACCOUNT); Can't we use kmem_cache_alloc with own cache created with SLAB_RECLAIM_ACCOUNT if they are reclaimable? With that, it would help fragmentation problem with __GFP_RECLAIMABLE for page allocation as well as counting problem, IMHO. > if (!p) { > kmem_cache_free(dentry_cache, dentry); > return NULL; > @@ -1665,6 +1676,14 @@ struct dentry *__d_alloc(struct super_block *sb, const > struct qstr *name) > } > } > > + if (unlikely(reclaimable)) { > + pg_data_t *pgdat; > + > + pgdat = page_pgdat(virt_to_page(external_name(dentry))); > + mod_node_page_state(pgdat, NR_INDIRECTLY_RECLAIMABLE_BYTES, > + kmalloc_size(kmalloc_index(reclaimable))); > + } > + > this_cpu_inc(nr_dentry); > > return dentry; > -- > 2.14.3 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more
Re: [RFC PATCH 11/35] ovl: readd read_iter
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi wrote: > Implement stacked reading. > I couldn't decipher the meaning of "readd" in the subject of this and other file ops pacthes?? > Signed-off-by: Miklos Szeredi > --- > fs/overlayfs/file.c | 56 > + > 1 file changed, 56 insertions(+) > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > index 409b542ff30c..a19429c5965d 100644 > --- a/fs/overlayfs/file.c > +++ b/fs/overlayfs/file.c > @@ -9,6 +9,7 @@ > #include > #include > #include > +#include > #include "overlayfs.h" > > static struct file *ovl_open_realfile(const struct file *file) > @@ -129,8 +130,63 @@ static loff_t ovl_llseek(struct file *file, loff_t > offset, int whence) > i_size_read(realinode)); > } > > +static void ovl_file_accessed(struct file *file) > +{ > + struct inode *inode = file_inode(file); > + > + if ((file->f_flags & O_NOATIME) || !ovl_inode_upper(inode)) > + return; > + > + ovl_copytimes(inode); > + touch_atime(&file->f_path); > +} > + > +static rwf_t ovl_iocb_to_rwf(struct kiocb *iocb) > +{ > + int ifl = iocb->ki_flags; > + rwf_t flags = 0; > + > + if (ifl & IOCB_NOWAIT) > + flags |= RWF_NOWAIT; > + if (ifl & IOCB_HIPRI) > + flags |= RWF_HIPRI; > + if (ifl & IOCB_DSYNC) > + flags |= RWF_DSYNC; > + if (ifl & IOCB_SYNC) > + flags |= RWF_SYNC; > + > + return flags; > +} > + > +static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter) > +{ > + struct file *file = iocb->ki_filp; > + struct fd real; > + const struct cred *old_cred; > + ssize_t ret; > + > + if (!iov_iter_count(iter)) > + return 0; > + > + ret = ovl_real_file(file, &real); > + if (ret) > + return ret; > + > + old_cred = ovl_override_creds(file_inode(file)->i_sb); > + ret = vfs_iter_read(real.file, iter, &iocb->ki_pos, > + ovl_iocb_to_rwf(iocb)); > + revert_creds(old_cred); > + > + ovl_file_accessed(file); > + > + fdput(real); I find it confusing that the name of ovl_real_file() does not suggest it may take a reference, so this fdput() looks unbalanced. All other ovl_XXX_{real,upper,lower} helpers do not take a reference. Perhaps something along the lines of ovl_file_real_fdget(). Thanks, Amir.
Re: [PATCH v9 00/24] Speculative page faults
On 14/03/2018 14:11, Michal Hocko wrote: > On Tue 13-03-18 18:59:30, Laurent Dufour wrote: >> Changes since v8: >> - Don't check PMD when locking the pte when THP is disabled >>Thanks to Daniel Jordan for reporting this. >> - Rebase on 4.16 > > Is this really worth reposting the whole pile? I mean this is at v9, > each doing little changes. It is quite tiresome to barely get to a > bookmarked version just to find out that there are 2 new versions out. > > I am sorry to be grumpy and I can understand some frustration it doesn't > move forward that easilly but this is a _big_ change. We should start > with a real high level review rather than doing small changes here and > there and reach v20 quickly. I know this would mean v10, but there has been a bunch of reviews from David Rientjes and Jerome Glisse, and I had to make many changes to address them. So I think this is time to push a v10. If you have already started a review of this v9 series, please send me your remarks so that I can compile them in this v10 asap. Thanks, Laurent.
[PATCH RFC 7/8] mm: allow to control onlining/offlining of memory by a driver
Some devices (esp. paravirtualized) might want to control - when to online/offline a memory block - how to online memory (MOVABLE/NORMAL) - in which granularity to online/offline memory So let's add a new flag "driver_managed" and disallow to change the state by user space. Device onlining/offlining will still work, however the memory will not be actually onlined/offlined. That has to be handled by the device driver that owns the memory. Signed-off-by: David Hildenbrand --- drivers/base/memory.c | 22 ++ drivers/xen/balloon.c | 2 +- include/linux/memory.h | 1 + include/linux/memory_hotplug.h | 4 +++- mm/memory_hotplug.c| 34 -- 5 files changed, 51 insertions(+), 12 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index bffe8616bd55..3b8616551561 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -231,27 +231,28 @@ static bool pages_correctly_probed(unsigned long start_pfn) * Must already be protected by mem_hotplug_begin(). */ static int -memory_block_action(unsigned long phys_index, unsigned long action, int online_type) +memory_block_action(struct memory_block *mem, unsigned long action) { - unsigned long start_pfn; + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; - int ret; + int ret = 0; - start_pfn = section_nr_to_pfn(phys_index); + if (mem->driver_managed) + return 0; switch (action) { case MEM_ONLINE: if (!pages_correctly_probed(start_pfn)) return -EBUSY; - ret = online_pages(start_pfn, nr_pages, online_type); + ret = online_pages(start_pfn, nr_pages, mem->online_type); break; case MEM_OFFLINE: ret = offline_pages(start_pfn, nr_pages); break; default: WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: " -"%ld\n", __func__, phys_index, action, action); +"%ld\n", __func__, mem->start_section_nr, action, action); ret = -EINVAL; } @@ -269,8 +270,7 @@ static int memory_block_change_state(struct memory_block *mem, if (to_state == MEM_OFFLINE) mem->state = MEM_GOING_OFFLINE; - ret = memory_block_action(mem->start_section_nr, to_state, - mem->online_type); + ret = memory_block_action(mem, to_state); mem->state = ret ? from_state_req : to_state; @@ -350,6 +350,11 @@ store_mem_state(struct device *dev, */ mem_hotplug_begin(); + if (mem->driver_managed) { + ret = -EINVAL; + goto out; + } + switch (online_type) { case MMOP_ONLINE_KERNEL: case MMOP_ONLINE_MOVABLE: @@ -364,6 +369,7 @@ store_mem_state(struct device *dev, ret = -EINVAL; /* should never happen */ } +out: mem_hotplug_done(); err: unlock_device_hotplug(); diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c index 065f0b607373..89981d573c06 100644 --- a/drivers/xen/balloon.c +++ b/drivers/xen/balloon.c @@ -401,7 +401,7 @@ static enum bp_state reserve_additional_memory(void) * callers drop the mutex before trying again. */ mutex_unlock(&balloon_mutex); - rc = add_memory_resource(nid, resource, memhp_auto_online); + rc = add_memory_resource(nid, resource, memhp_auto_online, false); mutex_lock(&balloon_mutex); if (rc) { diff --git a/include/linux/memory.h b/include/linux/memory.h index 9f8cd856ca1e..018c5e5ecde1 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -29,6 +29,7 @@ struct memory_block { unsigned long state;/* serialized by the dev->lock */ int section_count; /* serialized by mem_sysfs_mutex */ int online_type;/* for passing data to online routine */ + bool driver_managed;/* driver handles online/offline */ int phys_device;/* to which fru does this belong? */ void *hw; /* optional pointer to fw/hw data */ int (*phys_callback)(struct memory_block *); diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index e0e49b5b1ee1..46c6ceb1110d 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -320,7 +320,9 @@ static inline void remove_memory(int nid, u64 start, u64 size) {} extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, void *arg, int (*func)(struct memory_block *, void *)); extern int add_memory(int nid, u64 start, u64 size); -extern int add_memory_resource(int nid, struct resource *resource, bool online); +
[PATCH RFC 5/8] mm: only mark section offline when all pages are offline
If any page is still online, the section should stay online. Signed-off-by: David Hildenbrand --- mm/page_alloc.c | 2 +- mm/sparse.c | 25 - 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e5dcfdb0908..ae9023da2ca2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8013,7 +8013,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) break; if (pfn == end_pfn) return; - offline_mem_sections(pfn, end_pfn); zone = page_zone(pfn_to_page(pfn)); spin_lock_irqsave(&zone->lock, flags); pfn = start_pfn; @@ -8051,6 +8050,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) pfn += (1 << order); } spin_unlock_irqrestore(&zone->lock, flags); + offline_mem_sections(start_pfn, end_pfn); } #endif diff --git a/mm/sparse.c b/mm/sparse.c index 58cab483e81b..44978cb18fed 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -623,7 +623,27 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn) } #ifdef CONFIG_MEMORY_HOTREMOVE -/* Mark all memory sections within the pfn range as online */ +static bool all_pages_in_section_offline(unsigned long section_nr) +{ + unsigned long pfn = section_nr_to_pfn(section_nr); + struct page *page; + int i; + + for (i = 0; i < PAGES_PER_SECTION; i++, pfn++) { + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (!PageOffline(page)) + return false; + } + return true; +} + +/* + * Mark all memory sections within the pfn range as offline (if all pages + * of a memory section are already offline) + */ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn; @@ -639,6 +659,9 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) if (WARN_ON(!valid_section_nr(section_nr))) continue; + if (!all_pages_in_section_offline(section_nr)) + continue; + ms = __nr_to_section(section_nr); ms->section_mem_map &= ~SECTION_IS_ONLINE; } -- 2.14.3
[PATCH RFC 8/8] mm: export more functions used to online/offline memory
Kernel modules that want to control how/when memory is onlined/offlined need these functions. Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ac14ea772792..3c374d308cf4 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -979,6 +979,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ memory_notify(MEM_CANCEL_ONLINE, &arg); return ret; } +EXPORT_SYMBOL(online_pages); #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ static void reset_node_present_pages(pg_data_t *pgdat) @@ -1296,6 +1297,7 @@ bool is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages) /* All pageblocks in the memory block are likely to be hot-removable */ return true; } +EXPORT_SYMBOL(is_mem_section_removable); /* * Confirm all pages in a range [start, end) belong to the same zone. @@ -1752,6 +1754,7 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages) { return __offline_pages(start_pfn, start_pfn + nr_pages); } +EXPORT_SYMBOL(offline_pages); #endif /* CONFIG_MEMORY_HOTREMOVE */ /** @@ -1802,6 +1805,7 @@ int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn, return 0; } +EXPORT_SYMBOL(walk_memory_range); #ifdef CONFIG_MEMORY_HOTREMOVE static int check_memblock_offlined_cb(struct memory_block *mem, void *arg) -- 2.14.3
[PATCH RFC 6/8] mm: offline_pages() is also limited by MAX_ORDER
Page blocks might contain references to the next page block. So a page block cannot be offlined independently. E.g. on x86: page block size is 2MB, MAX_ORDER -1 (10) allows 4MB allocations. -> Right now, __offline_isolated_pages() will mark pages in the following page block as reserved. Let document offline_pages() while at it. Signed-off-by: David Hildenbrand --- mm/memory_hotplug.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 3a8d56476233..1d6054edc241 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1598,11 +1598,14 @@ static int __ref __offline_pages(unsigned long start_pfn, struct zone *zone; struct memory_notify arg; - /* at least, alignment against pageblock is necessary */ if (!IS_ALIGNED(start_pfn, pageblock_nr_pages)) return -EINVAL; + if (!IS_ALIGNED(start_pfn, (1 << (MAX_ORDER - 1 + return -EINVAL; if (!IS_ALIGNED(end_pfn, pageblock_nr_pages)) return -EINVAL; + if (!IS_ALIGNED(end_pfn, (1 << (MAX_ORDER - 1 + return -EINVAL; /* This makes hotplug much easier...and readable. we assume this for now. .*/ if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start, &valid_end)) @@ -1699,7 +1702,22 @@ static int __ref __offline_pages(unsigned long start_pfn, return ret; } -/* Must be protected by mem_hotplug_begin() or a device_lock */ +/** + * offline_pages - offline pages in a given range (that are currently online) + * @start_pfn: start pfn of the memory range + * @nr_pages: the number of pages + * + * This function tries to offline the given pages. The alignment/size that + * can be used is max(pageblock_nr_pages, 1 << (MAX_ORDER - 1)). + * + * Returns 0 if sucessful, -EBUSY if the pages cannot be offlined and + * -EINVAL if start_pfn/nr_pages is not properly aligned or not in a zone. + * -EINTR is returned if interrupted by a signal. + * + * Bad things will happen if pages are already offline. + * + * Must be protected by mem_hotplug_begin() or a device_lock + */ int offline_pages(unsigned long start_pfn, unsigned long nr_pages) { return __offline_pages(start_pfn, start_pfn + nr_pages); -- 2.14.3
[PATCH RFC 5/8] mm: only mark section offline when all pages are offline
If any page is still online, the section should stay online. Signed-off-by: David Hildenbrand --- mm/page_alloc.c | 2 +- mm/sparse.c | 25 - 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2e5dcfdb0908..ae9023da2ca2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8013,7 +8013,6 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) break; if (pfn == end_pfn) return; - offline_mem_sections(pfn, end_pfn); zone = page_zone(pfn_to_page(pfn)); spin_lock_irqsave(&zone->lock, flags); pfn = start_pfn; @@ -8051,6 +8050,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) pfn += (1 << order); } spin_unlock_irqrestore(&zone->lock, flags); + offline_mem_sections(start_pfn, end_pfn); } #endif diff --git a/mm/sparse.c b/mm/sparse.c index 58cab483e81b..44978cb18fed 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -623,7 +623,27 @@ void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn) } #ifdef CONFIG_MEMORY_HOTREMOVE -/* Mark all memory sections within the pfn range as online */ +static bool all_pages_in_section_offline(unsigned long section_nr) +{ + unsigned long pfn = section_nr_to_pfn(section_nr); + struct page *page; + int i; + + for (i = 0; i < PAGES_PER_SECTION; i++, pfn++) { + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + if (!PageOffline(page)) + return false; + } + return true; +} + +/* + * Mark all memory sections within the pfn range as offline (if all pages + * of a memory section are already offline) + */ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) { unsigned long pfn; @@ -639,6 +659,9 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn) if (WARN_ON(!valid_section_nr(section_nr))) continue; + if (!all_pages_in_section_offline(section_nr)) + continue; + ms = __nr_to_section(section_nr); ms->section_mem_map &= ~SECTION_IS_ONLINE; } -- 2.14.3
[PATCH RFC 4/8] kdump: expose PG_offline
This allows user space to skip pages that are offline when dumping. This is especially relevant when dealing with pages that have been unplugged in the context of virtualization, and their backing storage has already been freed. Signed-off-by: David Hildenbrand --- kernel/crash_core.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index a93590cdd9e1..d6f21b19aeb3 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -463,6 +463,9 @@ static int __init crash_save_vmcoreinfo_init(void) #ifdef CONFIG_HUGETLB_PAGE VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR); #endif +#ifdef CONFIG_MEMORY_HOTPLUG + VMCOREINFO_NUMBER(PG_offline); +#endif arch_crash_save_vmcoreinfo(); update_vmcoreinfo_note(); -- 2.14.3
Re: [PATCH] x86/mm: vmemmap and vmalloc base addressess are usngined longs
On Thu, Apr 12, 2018 at 02:39:10PM +0200, Jiri Kosina wrote: > From: Jiri Kosina > > Commits 9b46a051e4 ("x86/mm: Initialize vmemmap_base at boot-time") and > a7412546d8 ("x86/mm: Adjust vmalloc base and size at boot-time") lost the > type information for __VMALLOC_BASE_L4, __VMALLOC_BASE_L5, > __VMEMMAP_BASE_L4 and __VMEMMAP_BASE_L5 constants. > > Let's declare them explicitly unsigned long again. It is just cosmetics, right? I mean these literals are 'unsigned long' anyway. -- Kirill A. Shutemov
Build error for samples/bpf/ due to commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS")
Hi Peter, Your commit d0266046ad54 ("x86: Remove FAST_FEATURE_TESTS") broke build for several samples/bpf programs. I'm unsure what the best way forward is to unbreak these... The issue is that these samples are build with LLVM/clang (which doesn't like 'asm goto' constructs). And they end up including arch/x86/include/asm/cpufeature.h via a long include path, see build examples below (through different path to include/linux/thread_info.h). Maybe Alexei or Daniel have an idea how to work around this? As tools/testing/selftests/bpf/ does not seem to fail!? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer Build error#1: -- clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -Isamples/bpf \ -I./tools/testing/selftests/bpf/ \ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \ -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \ -Wno-gnu-variable-sized-type-not-at-end \ -Wno-address-of-packed-member -Wno-tautological-compare \ -Wno-unknown-warning-option \ -O2 -emit-llvm -c samples/bpf/sockex2_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/sockex2_kern.o In file included from samples/bpf/sockex2_kern.c:3: In file included from ./include/uapi/linux/in.h:24: In file included from ./include/linux/socket.h:8: In file included from ./include/linux/uio.h:13: In file included from ./include/linux/thread_info.h:38: In file included from ./arch/x86/include/asm/thread_info.h:53: ./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet asm_volatile_goto("1: jmp 6f\n" ^ ./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto' #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0) ^ Build error#2: -- clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -Isamples/bpf \ -I./tools/testing/selftests/bpf/ \ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \ -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \ -Wno-gnu-variable-sized-type-not-at-end \ -Wno-address-of-packed-member -Wno-tautological-compare \ -Wno-unknown-warning-option \ -O2 -emit-llvm -c samples/bpf/tracex1_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/tracex1_kern.o In file included from samples/bpf/tracex1_kern.c:7: In file included from ./include/linux/skbuff.h:19: In file included from ./include/linux/time.h:6: In file included from ./include/linux/seqlock.h:36: In file included from ./include/linux/spinlock.h:51: In file included from ./include/linux/preempt.h:81: In file included from ./arch/x86/include/asm/preempt.h:7: In file included from ./include/linux/thread_info.h:38: In file included from ./arch/x86/include/asm/thread_info.h:53: ./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet asm_volatile_goto("1: jmp 6f\n" ^ ./include/linux/compiler-gcc.h:290:42: note: expanded from macro 'asm_volatile_goto' #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0) ^ Build error#3: -- clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86 /include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -Isamples/bpf \ -I./tools/testing/selftests/bpf/ \ -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \ -D__TARGET_ARCH_x86 -Wno-compare-distinct-pointer-types \ -Wno-gnu-variable-sized-type-not-at-end \ -Wno-address-of-packed-member -Wno-tautological-compare \ -Wno-unknown-warning-option \ -O2 -emit-llvm -c samples/bpf/xdp1_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/xdp1_kern.o In file included from samples/bpf/xdp1_kern.c:9: In file included from ./include/linux/in.h:23: In file included from ./include/uapi/linux/in.h:24: In file included from ./include/linux/socket.h:8: In file included from ./include/linux/uio.h:13: In file included from ./include/linux/thread_info.h:38: In file included from ./arch/x86/include/asm/thread_info.h:53: ./arch/x86/include/asm/cpufeature.h:150:2: error: 'asm goto' constructs are not supported yet asm_volatile_goto("1: jmp 6f\n
Re: [PATCH v2 12/17] kvm: arm/arm64: Expose supported physical address limit for VM
On 27 March 2018 at 14:15, Suzuki K Poulose wrote: > Expose the maximum physical address size supported by the host > for a VM. This could be later used by the userspace to choose the > appropriate size for a given VM. The limit is determined as the > minimum of actual CPU limit, the kernel limit (i.e, either 48 or 52) > and the stage2 page table support limit (which is 40bits at the moment). > For backward compatibility, we support a minimum of 40bits. The limit > will be lifted as we add support for the stage2 to support the host > kernel PA limit. > > This value may be different from what is exposed to the VM via > CPU ID registers. The limit only applies to the stage2 page table. > > Cc: Christoffer Dall > Cc: Marc Zyngier > Cc: Peter Maydel > Signed-off-by: Suzuki K Poulose > --- > Documentation/virtual/kvm/api.txt | 14 ++ > arch/arm/include/asm/kvm_mmu.h| 5 + > arch/arm64/include/asm/kvm_mmu.h | 5 + > include/uapi/linux/kvm.h | 6 ++ > virt/kvm/arm/arm.c| 6 ++ > 5 files changed, 36 insertions(+) > > diff --git a/Documentation/virtual/kvm/api.txt > b/Documentation/virtual/kvm/api.txt > index 792fa87..55908a8 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -3500,6 +3500,20 @@ Returns: 0 on success; -1 on error > This ioctl can be used to unregister the guest memory region registered > with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above. > > +4.113 KVM_ARM_GET_MAX_VM_PHYS_SHIFT > +Capability: basic > +Architectures: arm, arm64 > +Type: system ioctl > +Parameters: none > +Returns: log2(Maximum physical address space size) supported by the > +hyperviosr. typo: "hypervisor". > + > +This ioctl can be used to identify the maximum physical address space size > +supported by the hypervisor. Is that the physical address space on the host, or the physical address space size we present to the guest? > The returned value indicates the maximum size > +of the address that can be resolved by the stage2 translation table on > +arm/arm64. On arm64, the value is decided based on the host kernel > +configuration and the system wide safe value of ID_AA64MMFR0_EL1:PARange. > +This may not match the value exposed to the VM in CPU ID registers. Isn't it likely to confuse the guest if we lie to it about the PA range it sees? When would the two values differ? Do we also need a 'set' operation, so userspace can create a VM that has a 40 bit userspace on a CPU that supports more than that, or does it just work? What's the x86 API for KVM to tell userspace about physical address range restrictions? thanks -- PMM
Re: [RFC tip/locking/lockdep v6 19/20] rcu: Equip sleepable RCU with lockdep dependency graph checks
On Thu, Apr 12, 2018 at 11:12:17AM +0200, Peter Zijlstra wrote: > On Thu, Apr 12, 2018 at 10:12:33AM +0800, Boqun Feng wrote: > > A trivial fix/hack would be adding local_irq_disable() and > > local_irq_enable() around srcu_lock_sync() like: > > > > static inline void srcu_lock_sync(struct lockdep_map *map) > > { > > local_irq_disable(); > > lock_map_acquire(map); > > lock_map_release(map); > > local_irq_enable(); > > } > > > > However, it might be better, if lockdep could provide some annotation > > API for such an empty critical section to say the grap-and-drop is > > atomic. Something like: > > > > /* > > * Annotate a wait point for all previous critical section to > > * go out. > > * > > * This won't make @map a irq unsafe lock, no matter it's called > > * w/ or w/o irq disabled. > > */ > > lock_wait_unlock(struct lockdep_map *map, ..) > > > > And in this primitive, we do something similar like > > lock_acquire()+lock_release(). This primitive could be used elsewhere, > > as I bebieve we have several empty grab-and-drop critical section for > > lockdep annotations, e.g. in start_flush_work(). > > > > Thoughts? > > > > This cerntainly requires a bit more work, in the meanwhile, I will add > > another self testcase which has a srcu_read_lock() called in irq. > > Yeah, I've never really bothered to clean those things up, but I don't > see any reason to stop you from doing it ;-) > > As to the initial pattern with disabling IRQs, I think I've seen code > like that before, and in general performance isn't a top priority Yeah, I saw we used that pattern in del_timer_sync() > (within reason) when you're running lockdep kernels, so I've usually let > it be. Turns out it's not very hard to write a working version of lock_wait_unlock() ;-) Just call __lock_acquire() and __lock_release() back-to-back with the @hardirqoff for __lock_acquire() to be 1: /* * lock_sync() - synchronize with all previous critical sections to finish. * * Simply a acquire+release annotation with hardirqoff is true, because no lock * is actually held, so this annotaion alone is safe to be interrupted as if * irqs are off */ void lock_sync(struct lockdep_map *lock, unsigned subclass, int read, int check, struct lockdep_map *nest_lock, unsigned long ip) { unsigned long flags; if (unlikely(current->lockdep_recursion)) return; raw_local_irq_save(flags); check_flags(flags); current->lockdep_recursion = 1; __lock_acquire(lock, subclass, 0, read, check, 1, nest_lock, ip, 0, 0); if (__lock_release(lock, 0, ip)) check_chain_key(current); current->lockdep_recursion = 0; raw_local_irq_restore(flags); } EXPORT_SYMBOL_GPL(lock_sync); I rename as lock_sync(), because most of the time, we annotate with this for a "sync point" with other critical sections. We can avoid some overhead if we refactor __lock_acquire() and __lock_release() with some helper functions, but I think this version is good enough for now, at least better than disabling IRQs around lock_map_acquire() + lock_map_release() ;-) Thoughts? Regards, Boqun signature.asc Description: PGP signature
[GIT PULL] s390 patches for the 4.17 merge window #2
Hi Linus, please pull from the 'for-linus' branch of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git for-linus to receive the following updates: Three notable larger changes next to the usual bug fixing: * Update the email addresses in MAINTAINERS for the s390 folks to use the simpler linux.ibm.com domain instead of the old linux.vnet.ibm.com * An update for the zcrypt device driver that removes some old and obsolete interfaces and add support for up to 256 crypto adapters * A rework of the IPL aka boot code Harald Freudenberger (6): s390/crypto: Adjust s390 aes and paes cipher priorities s390/zcrypt: remove unused functions and declarations s390/zcrypt: Make ap init functions static. s390/zcrypt: Remove deprecated ioctls. s390/zcrypt: Remove deprecated zcrypt proc interface. s390/zcrypt: Support up to 256 crypto adapters. Heiko Carstens (2): s390/compat: fix setup_frame32 MAINTAINERS: update s390 maintainers email addresses Julian Wiedmann (3): s390/ccwgroup: require at least one ccw device s390/qdio: clear intparm during shutdown s390/qdio: lock device while installing IRQ handler Martin Schwidefsky (1): s390: correct nospec auto detection init order Vasily Gorbik (11): s390/ipl: ensure loadparm valid flag is set s390/ipl: unite diag308 and scsi boot ipl blocks s390/ipl: get rid of ipl_ssid and ipl_devno s390/ipl: move ipl_flags to ipl.c s390/ipl: rely on diag308 store to get ipl info s390/ipl: correct ipl parmblock valid checks s390/ipl: avoid adding scpdata to cmdline during ftp/dvd boot s390: assume diag308 set always works s390/ipl: remove non-existing functions declaration s390/ipl: correct kdump reipl block checksum calculation s390/ipl: remove reipl_method and dump_method MAINTAINERS | 34 +-- arch/s390/boot/compressed/misc.c | 23 -- arch/s390/crypto/aes_s390.c | 8 +- arch/s390/crypto/paes_s390.c | 8 +- arch/s390/include/asm/ap.h| 6 +- arch/s390/include/asm/cio.h | 10 - arch/s390/include/asm/ipl.h | 25 +- arch/s390/include/asm/nospec-branch.h | 1 + arch/s390/include/asm/reset.h | 20 -- arch/s390/include/uapi/asm/zcrypt.h | 163 ++-- arch/s390/kernel/compat_signal.c | 2 +- arch/s390/kernel/early.c | 14 +- arch/s390/kernel/ipl.c| 376 +-- arch/s390/kernel/machine_kexec.c | 2 +- arch/s390/kernel/nospec-branch.c | 8 +- arch/s390/kernel/reipl.S | 87 --- arch/s390/kernel/relocate_kernel.S| 54 +--- arch/s390/kernel/setup.c | 3 + drivers/s390/cio/ccwgroup.c | 5 +- drivers/s390/cio/cio.c| 257 --- drivers/s390/cio/ioasm.c | 24 -- drivers/s390/cio/ioasm.h | 1 - drivers/s390/cio/qdio_main.c | 4 +- drivers/s390/cio/qdio_setup.c | 2 + drivers/s390/crypto/ap_bus.c | 32 +-- drivers/s390/crypto/ap_bus.h | 5 +- drivers/s390/crypto/ap_debug.h| 3 - drivers/s390/crypto/pkey_api.c| 41 +-- drivers/s390/crypto/zcrypt_api.c | 471 ++ drivers/s390/crypto/zcrypt_api.h | 26 +- 30 files changed, 357 insertions(+), 1358 deletions(-) delete mode 100644 arch/s390/include/asm/reset.h
[PATCH RFC 2/8] mm: introduce PG_offline
online_pages()/offline_pages() theoretically allows us to work on sub-section sizes. This is especially relevant in the context of virtualization. It e.g. allows us to add/remove memory to Linux in a VM in 4MB chunks. While the whole section is marked as online/offline, we have to know the state of each page. E.g. to not read memory that is not online during kexec() or to properly mark a section as offline as soon as all contained pages are offline. Signed-off-by: David Hildenbrand --- include/linux/page-flags.h | 10 ++ include/trace/events/mmflags.h | 9 - 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index e34a27727b9a..8ebc4bad7824 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -49,6 +49,9 @@ * PG_hwpoison indicates that a page got corrupted in hardware and contains * data with incorrect ECC bits that triggered a machine check. Accessing is * not safe since it may cause another machine check. Don't touch! + * + * PG_offline indicates that a page is offline and the backing storage + * might already have been removed (virtualization). Don't touch! */ /* @@ -100,6 +103,9 @@ enum pageflags { #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT) PG_young, PG_idle, +#endif +#ifdef CONFIG_MEMORY_HOTPLUG + PG_offline, /* Page is offline. Don't touch */ #endif __NR_PAGEFLAGS, @@ -381,6 +387,10 @@ TESTCLEARFLAG(Young, young, PF_ANY) PAGEFLAG(Idle, idle, PF_ANY) #endif +#ifdef CONFIG_MEMORY_HOTPLUG +PAGEFLAG(Offline, offline, PF_ANY) +#endif + /* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index a81cffb76d89..14c31209e34a 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -79,6 +79,12 @@ #define IF_HAVE_PG_IDLE(flag,string) #endif +#ifdef CONFIG_MEMORY_HOTPLUG +#define IF_HAVE_PG_OFFLINE(flag,string) ,{1UL << flag, string} +#else +#define IF_HAVE_PG_OFFLINE(flag,string) +#endif + #define __def_pageflag_names \ {1UL << PG_locked, "locked"}, \ {1UL << PG_waiters, "waiters" }, \ @@ -104,7 +110,8 @@ IF_HAVE_PG_MLOCK(PG_mlocked,"mlocked" ) \ IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \ IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \ IF_HAVE_PG_IDLE(PG_young, "young" ) \ -IF_HAVE_PG_IDLE(PG_idle, "idle" ) +IF_HAVE_PG_IDLE(PG_idle, "idle" ) \ +IF_HAVE_PG_OFFLINE(PG_offline, "offline" ) #define show_page_flags(flags) \ (flags) ? __print_flags(flags, "|", \ -- 2.14.3
[PATCH RFC 1/8] mm/memory_hotplug: Revert "mm/memory_hotplug: optimize memory hotplug"
Conflicts with the possibility to online sub-section chunks. Revert it for now. Signed-off-by: David Hildenbrand --- drivers/base/node.c| 2 -- include/linux/memory.h | 1 - mm/memory_hotplug.c| 27 +++ mm/page_alloc.c| 28 ++-- mm/sparse.c| 8 +--- 5 files changed, 38 insertions(+), 28 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 7a3a580821e0..92b00a7e6a02 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -407,8 +407,6 @@ int register_mem_sect_under_node(struct memory_block *mem_blk, int nid, if (!mem_blk) return -EFAULT; - - mem_blk->nid = nid; if (!node_online(nid)) return 0; diff --git a/include/linux/memory.h b/include/linux/memory.h index 31ca3e28b0eb..9f8cd856ca1e 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -33,7 +33,6 @@ struct memory_block { void *hw; /* optional pointer to fw/hw data */ int (*phys_callback)(struct memory_block *); struct device dev; - int nid;/* NID for this memory block */ }; int arch_get_memory_phys_device(unsigned long start_pfn); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f74826cdceea..d4474781c799 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -250,6 +250,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn, struct vmem_altmap *altmap, bool want_memblock) { int ret; + int i; if (pfn_valid(phys_start_pfn)) return -EEXIST; @@ -258,6 +259,23 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn, if (ret < 0) return ret; + /* +* Make all the pages reserved so that nobody will stumble over half +* initialized state. +* FIXME: We also have to associate it with a node because page_to_nid +* relies on having page with the proper node. +*/ + for (i = 0; i < PAGES_PER_SECTION; i++) { + unsigned long pfn = phys_start_pfn + i; + struct page *page; + if (!pfn_valid(pfn)) + continue; + + page = pfn_to_page(pfn); + set_page_node(page, nid); + SetPageReserved(page); + } + if (!want_memblock) return 0; @@ -891,15 +909,8 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ int nid; int ret; struct memory_notify arg; - struct memory_block *mem; - - /* -* We can't use pfn_to_nid() because nid might be stored in struct page -* which is not yet initialized. Instead, we find nid from memory block. -*/ - mem = find_memory_block(__pfn_to_section(pfn)); - nid = mem->nid; + nid = pfn_to_nid(pfn); /* associate pfn range with the zone */ zone = move_pfn_range(online_type, nid, pfn, nr_pages); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 905db9d7962f..647c8c6dd4d1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1172,9 +1172,10 @@ static void free_one_page(struct zone *zone, } static void __meminit __init_single_page(struct page *page, unsigned long pfn, - unsigned long zone, int nid) + unsigned long zone, int nid, bool zero) { - mm_zero_struct_page(page); + if (zero) + mm_zero_struct_page(page); set_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); @@ -1188,6 +1189,12 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, #endif } +static void __meminit __init_single_pfn(unsigned long pfn, unsigned long zone, + int nid, bool zero) +{ + return __init_single_page(pfn_to_page(pfn), pfn, zone, nid, zero); +} + #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT static void __meminit init_reserved_page(unsigned long pfn) { @@ -1206,7 +1213,7 @@ static void __meminit init_reserved_page(unsigned long pfn) if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone)) break; } - __init_single_page(pfn_to_page(pfn), pfn, zid, nid); + __init_single_pfn(pfn, zid, nid, true); } #else static inline void init_reserved_page(unsigned long pfn) @@ -1523,7 +1530,7 @@ static unsigned long __init deferred_init_pages(int nid, int zid, } else { page++; } - __init_single_page(page, pfn, zid, nid); + __init_single_page(page, pfn, zid, nid, true); nr_pages++; } return (nr_pages); @@ -5460,7 +5467,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, un
[PATCH RFC 3/8] mm: use PG_offline in online/offlining code
Let's mark all offline pages with PG_offline. We'll continue to mark them reserved. Signed-off-by: David Hildenbrand --- drivers/hv/hv_balloon.c | 2 +- mm/memory_hotplug.c | 10 ++ mm/page_alloc.c | 5 - 3 files changed, 11 insertions(+), 6 deletions(-) diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c index b3e9f13f8bc3..04d98d9b6191 100644 --- a/drivers/hv/hv_balloon.c +++ b/drivers/hv/hv_balloon.c @@ -893,7 +893,7 @@ static unsigned long handle_pg_range(unsigned long pg_start, * backed previously) online too. */ if (start_pfn > has->start_pfn && - !PageReserved(pfn_to_page(start_pfn - 1))) + !PageOffline(pfn_to_page(start_pfn - 1))) hv_bring_pgs_online(has, start_pfn, pgs_ol); } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d4474781c799..3a8d56476233 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -260,8 +260,8 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn, return ret; /* -* Make all the pages reserved so that nobody will stumble over half -* initialized state. +* Make all the pages offline and reserved so that nobody will stumble +* over half initialized state. * FIXME: We also have to associate it with a node because page_to_nid * relies on having page with the proper node. */ @@ -274,6 +274,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn, page = pfn_to_page(pfn); set_page_node(page, nid); SetPageReserved(page); + SetPageOffline(page); } if (!want_memblock) @@ -669,6 +670,7 @@ EXPORT_SYMBOL_GPL(__online_page_increment_counters); void __online_page_free(struct page *page) { + ClearPageOffline(page); __free_reserved_page(page); } EXPORT_SYMBOL_GPL(__online_page_free); @@ -687,7 +689,7 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages, unsigned long onlined_pages = *(unsigned long *)arg; struct page *page; - if (PageReserved(pfn_to_page(start_pfn))) + if (PageOffline(pfn_to_page(start_pfn))) for (i = 0; i < nr_pages; i++) { page = pfn_to_page(start_pfn + i); (*online_page_callback)(page); @@ -1437,7 +1439,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) } /* - * remove from free_area[] and mark all as Reserved. + * remove from free_area[] and mark all as Reserved and Offline. */ static int offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 647c8c6dd4d1..2e5dcfdb0908 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8030,6 +8030,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) if (unlikely(!PageBuddy(page) && PageHWPoison(page))) { pfn++; SetPageReserved(page); + SetPageOffline(page); continue; } @@ -8043,8 +8044,10 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn) list_del(&page->lru); rmv_page_order(page); zone->free_area[order].nr_free--; - for (i = 0; i < (1 << order); i++) + for (i = 0; i < (1 << order); i++) { SetPageReserved((page+i)); + SetPageOffline(page + i); + } pfn += (1 << order); } spin_unlock_irqrestore(&zone->lock, flags); -- 2.14.3
Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
On Mon, Apr 09, 2018 at 04:43:40PM +0200, Arnd Bergmann wrote: > On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso wrote: > > Hi Arnd, > > > > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote: > >> We get a new link error with CONFIG_NFT_REJECT_INET=y and > >> CONFIG_NF_REJECT_IPV6=m > > > > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4 > > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y > > and CONFIG_NF_REJECT_IPV6=m. > > > > I mean, just like we do with NFT_FIB_INET. > > That can only work if NFT_REJECT_INET can be made a 'tristate' symbol > again, so that code gets built as a loadable module if > CONFIG_NF_REJECT_IPV6=m. > > > BTW, I think this problem has been is not related to the recent patch, > > but something older that kbuild robot has triggered more easily for > > some reason? > > 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool' > symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to > restricted to a loadable module with IPV6=m, but can now be > built-in, which causes that link error. Still one more spin on this, I would like to see if we have a way to fix this by simplifing things a bit. Would this one I'm attaching would work? Thanks for you patience. >From af07bc7ff5d34ce54e7913233912c058e6699e3c Mon Sep 17 00:00:00 2001 From: Pablo Neira Ayuso Date: Fri, 13 Apr 2018 10:48:40 +0200 Subject: [PATCH] netfilter: CONFIG_NF_REJECT_IPV{4,6} becomes bool toggle Arnd reports that we get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m after larger parts of the nftables modules are linked together: net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval': nft_reject_inet.c:(.text+0x17c): undefined reference to `nf_send_unreach6' nft_reject_inet.c:(.text+0x190): undefined reference to `nf_send_reset6' The problem is that with NF_TABLES_INET set, we implicitly try to use the ipv6 version as well for NFT_REJECT, but when CONFIG_IPV6 is set to a loadable module, it's impossible to reach that. This patch fixes this problem by building-in nf_reject_ipv{4,6}.c, IPv6 symbol dependencies for the IPv6 reject infrastructure are located in exthdrs_core.c, ip6_checksum.c and ip6_icmp.c which are also built-in, so let's do the same to simplify this. Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type") Reported-by: Arnd Bergmann Signed-off-by: Pablo Neira Ayuso --- net/ipv4/netfilter/Kconfig | 3 +-- net/ipv6/netfilter/Kconfig | 3 +-- net/netfilter/Kconfig | 2 ++ 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig index 280048e1e395..3e4e0ae2a9a1 100644 --- a/net/ipv4/netfilter/Kconfig +++ b/net/ipv4/netfilter/Kconfig @@ -104,8 +104,7 @@ config NF_LOG_IPV4 select NF_LOG_COMMON config NF_REJECT_IPV4 - tristate "IPv4 packet rejection" - default m if NETFILTER_ADVANCED=n + bool "IPv4 packet rejection" config NF_NAT_IPV4 tristate "IPv4 NAT" diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig index ccbfa83e4bb0..1e5d040a60b8 100644 --- a/net/ipv6/netfilter/Kconfig +++ b/net/ipv6/netfilter/Kconfig @@ -87,8 +87,7 @@ config NF_DUP_IPV6 packet to be rerouted to another destination. config NF_REJECT_IPV6 - tristate "IPv6 packet rejection" - default m if NETFILTER_ADVANCED=n + bool "IPv6 packet rejection" config NF_LOG_IPV6 tristate "IPv6 packet logging" diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 4189f574f5ec..d7b3272fe821 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -609,6 +609,8 @@ config NFT_REJECT config NFT_REJECT_INET depends on NF_TABLES_INET + select NF_REJECT_IPV4 + select NF_REJECT_IPV6 default NFT_REJECT tristate -- 2.11.0
Re: Some minor fixes for perf user tools
On Fri, Apr 06, 2018 at 01:38:08PM -0700, Andi Kleen wrote: > This patchkit fixes some random minor issues in the perf user tools Acked-by: Jiri Olsa thanks, jirka
[PATCH 0/2] tracing/events: block: bring more on a par with blktrace
I had the need to understand I/O request processing in detail. But I also had the need to enrich block traces with other trace events including my own dynamic kprobe events. So I preferred block trace events over blktrace to get everything nicely sorted into one ftrace output. However, I missed device filtering for (un)plug events and also the difference between the two flavors of unplug. The first two patches bring block trace events closer to their counterpart in blktrace tooling. The last patch is just an RFC. I still kept it in this patch set because it is inspired by PATCH 2/2. Steffen Maier (3): tracing/events: block: track and print if unplug was explicit or schedule tracing/events: block: dev_t via driver core for plug and unplug events tracing/events: block: also try to get dev_t via driver core for some events include/trace/events/block.h | 33 - 1 file changed, 28 insertions(+), 5 deletions(-) -- 2.13.5