Re: [PATCH v7 16/24] i2c: allow adapter drivers to override the adapter locking
> Yes, they look like reasonable complaints. Thanks for fixing them. I just sent out my latest comments and when you fix those and send V8, I'll apply that right away. I think we are safe to fix the rest incrementally if needed. Note that I didn't review the IIO and media patches, I trust the reviewers on those. Thanks for your work on this! I need a break now, this is mind-boggling...
Re: [PATCH v7 16/24] i2c: allow adapter drivers to override the adapter locking
> Yes, they look like reasonable complaints. Thanks for fixing them. I just sent out my latest comments and when you fix those and send V8, I'll apply that right away. I think we are safe to fix the rest incrementally if needed. Note that I didn't review the IIO and media patches, I trust the reviewers on those. Thanks for your work on this! I need a break now, this is mind-boggling...
Re: [PATCH] [RFC] x86: work around MPX Erratum
On Tue, May 3, 2016 at 2:31 PM, Andy Lutomirskiwrote: > > Having actually read the erratum: how can this affect Linux at all > under any scenario where user code hasn't already completely > compromised the kernel? If it matches purely on linear address, you will potentially have interesting situations with people running in virtualized environments and crashing programs in other virtual containers or the host. I do agree that we likely don't care all that much (especially with the lack of users in the first place). But considering that the trivial workaround is to make sure SMEP is enabled - which we want to heavily encourage anyway - I also don't think there's any real downside to just enforcing that workaround. Linus
Re: [PATCH] [RFC] x86: work around MPX Erratum
On Tue, May 3, 2016 at 2:31 PM, Andy Lutomirski wrote: > > Having actually read the erratum: how can this affect Linux at all > under any scenario where user code hasn't already completely > compromised the kernel? If it matches purely on linear address, you will potentially have interesting situations with people running in virtualized environments and crashing programs in other virtual containers or the host. I do agree that we likely don't care all that much (especially with the lack of users in the first place). But considering that the trivial workaround is to make sure SMEP is enabled - which we want to heavily encourage anyway - I also don't think there's any real downside to just enforcing that workaround. Linus
Re: [PATCH v7 16/24] i2c: allow adapter drivers to override the adapter locking
On Wed, Apr 20, 2016 at 05:17:56PM +0200, Peter Rosin wrote: > Add i2c_lock_bus() and i2c_unlock_bus(), which call the new lock_bus and > unlock_bus ops in the adapter. These funcs/ops take an additional flags > argument that indicates for what purpose the adapter is locked. > > There are two flags, I2C_LOCK_ADAPTER and I2C_LOCK_SEGMENT, but they are > both implemented the same. For now. Locking the adapter means that the > whole bus is locked, locking the segment means that only the current bus > segment is locked (i.e. i2c traffic on the parent side of mux is still > allowed even if the child side of the mux is locked. > > Also support a trylock_bus op (but no function to call it, as it is not > expected to be needed outside of the i2c core). > > Implement i2c_lock_adapter/i2c_unlock_adapter in terms of the new locking > scheme (i.e. lock with the I2C_LOCK_ADAPTER flag). > > Annotate some of the locking with explicit I2C_LOCK_SEGMENT flags. Can you explain a little why it is SEGMENT and not ADAPTER here? That probably makes it easier to get into this patch. And to double check my understanding: I was surprised to not see any i2c_lock_adapter() or I2C_LOCK_ADAPTER in action. This is because muxes call I2C_LOCK_SEGMENT on their parent which in case of the parent being the root adapter is essentially the same as I2C_LOCK_ADAPTER. Correct? > > Signed-off-by: Peter Rosin> --- > drivers/i2c/i2c-core.c | 46 -- > include/linux/i2c.h| 28 ++-- > 2 files changed, 54 insertions(+), 20 deletions(-) > > diff --git a/drivers/i2c/i2c-core.c b/drivers/i2c/i2c-core.c > index 0f2f8484e8ec..21f46d011c33 100644 > --- a/drivers/i2c/i2c-core.c > +++ b/drivers/i2c/i2c-core.c > @@ -960,10 +960,12 @@ static int i2c_check_addr_busy(struct i2c_adapter > *adapter, int addr) > } > > /** > - * i2c_lock_adapter - Get exclusive access to an I2C bus segment > + * i2c_adapter_lock_bus - Get exclusive access to an I2C bus segment > * @adapter: Target I2C bus segment > + * @flags: I2C_LOCK_ADAPTER locks the root i2c adapter, I2C_LOCK_SEGMENT > + * locks only this branch in the adapter tree > */ I think this kerneldoc should be moved to i2c_lock_adapter and/or i2c_lock_bus() which are now in i2c.h. This is what users will use, not this static, adapter-specific implementation. I think it is enough to have a comment here explaining what is special in handling adapters. Thanks, Wolfram
Re: [PATCH v7 16/24] i2c: allow adapter drivers to override the adapter locking
On Wed, Apr 20, 2016 at 05:17:56PM +0200, Peter Rosin wrote: > Add i2c_lock_bus() and i2c_unlock_bus(), which call the new lock_bus and > unlock_bus ops in the adapter. These funcs/ops take an additional flags > argument that indicates for what purpose the adapter is locked. > > There are two flags, I2C_LOCK_ADAPTER and I2C_LOCK_SEGMENT, but they are > both implemented the same. For now. Locking the adapter means that the > whole bus is locked, locking the segment means that only the current bus > segment is locked (i.e. i2c traffic on the parent side of mux is still > allowed even if the child side of the mux is locked. > > Also support a trylock_bus op (but no function to call it, as it is not > expected to be needed outside of the i2c core). > > Implement i2c_lock_adapter/i2c_unlock_adapter in terms of the new locking > scheme (i.e. lock with the I2C_LOCK_ADAPTER flag). > > Annotate some of the locking with explicit I2C_LOCK_SEGMENT flags. Can you explain a little why it is SEGMENT and not ADAPTER here? That probably makes it easier to get into this patch. And to double check my understanding: I was surprised to not see any i2c_lock_adapter() or I2C_LOCK_ADAPTER in action. This is because muxes call I2C_LOCK_SEGMENT on their parent which in case of the parent being the root adapter is essentially the same as I2C_LOCK_ADAPTER. Correct? > > Signed-off-by: Peter Rosin > --- > drivers/i2c/i2c-core.c | 46 -- > include/linux/i2c.h| 28 ++-- > 2 files changed, 54 insertions(+), 20 deletions(-) > > diff --git a/drivers/i2c/i2c-core.c b/drivers/i2c/i2c-core.c > index 0f2f8484e8ec..21f46d011c33 100644 > --- a/drivers/i2c/i2c-core.c > +++ b/drivers/i2c/i2c-core.c > @@ -960,10 +960,12 @@ static int i2c_check_addr_busy(struct i2c_adapter > *adapter, int addr) > } > > /** > - * i2c_lock_adapter - Get exclusive access to an I2C bus segment > + * i2c_adapter_lock_bus - Get exclusive access to an I2C bus segment > * @adapter: Target I2C bus segment > + * @flags: I2C_LOCK_ADAPTER locks the root i2c adapter, I2C_LOCK_SEGMENT > + * locks only this branch in the adapter tree > */ I think this kerneldoc should be moved to i2c_lock_adapter and/or i2c_lock_bus() which are now in i2c.h. This is what users will use, not this static, adapter-specific implementation. I think it is enough to have a comment here explaining what is special in handling adapters. Thanks, Wolfram
Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
On Tue, May 03, 2016 at 11:19:46PM +0200, Christoph Hellwig wrote: > Hi Bjorn, > > I've implemented your suggestion and I'm getting ready to send out > a new version. One thing that came to mind is: do you prefer this > code in irq.c or would you rather have it in msi.c? While it > also has a legacy irq fallback most of it tied pretty closely to > the msi.c code, so I wonder if we should group them together. Good question. There isn't much in irq.c, and the interesting bits are the MSI-related things, so maybe msi.c would make more sense. Bjorn
Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
On Tue, May 03, 2016 at 11:19:46PM +0200, Christoph Hellwig wrote: > Hi Bjorn, > > I've implemented your suggestion and I'm getting ready to send out > a new version. One thing that came to mind is: do you prefer this > code in irq.c or would you rather have it in msi.c? While it > also has a legacy irq fallback most of it tied pretty closely to > the msi.c code, so I wonder if we should group them together. Good question. There isn't much in irq.c, and the interesting bits are the MSI-related things, so maybe msi.c would make more sense. Bjorn
Re: VDSO unmap and remap support for additional architectures
On 04/29/2016 09:55 AM, Dmitry Safonov wrote: > On 04/29/2016 04:22 PM, Christopher Covington wrote: >> On 04/28/2016 02:53 PM, Andy Lutomirski wrote: >>> Also, at some point, possibly quite soon, x86 will want a way for >>> user code to ask the kernel to map a specific vdso variant at a specific >>> address. Could we perhaps add a new pair of syscalls: >>> >>> struct vdso_info { >>> unsigned long space_needed_before; >>> unsigned long space_needed_after; >>> unsigned long alignment; >>> }; >>> >>> long vdso_get_info(unsigned int vdso_type, struct vdso_info *info); >>> >>> long vdso_remap(unsigned int vdso_type, unsigned long addr, unsigned >>> int flags); >>> >>> #define VDSO_X86_I386 0 >>> #define VDSO_X86_64 1 >>> #define VDSO_X86_X32 2 >>> // etc. >>> >>> vdso_remap will map the vdso of the chosen type such at >>> AT_SYSINFO_EHDR lines up with addr. It will use up to >>> space_needed_before bytes before that address and space_needed_after >>> after than address. It will also unmap the old vdso (or maybe only do >>> that if some flag is set). >>> >>> On x86, mremap is *not* sufficient for everything that's needed, >>> because some programs will need to change the vdso type. >> I don't I understand. Why can't people just exec() the ELF type that >> corresponds to the VDSO they want? > > I may say about my needs in it: to not lose all the existing > information in application. > Imagine you're restoring a container with 64-bit and 32-bit > applications (in compatible mode). So you need somehow > switch vdso type in restorer for a 32-bit application. > Yes, you may exec() and then - all already restored application > properties will got lost. You will need to transpher information > about mappings, make protocol between restorer binary > and main criu application, finally you'll end up with some > really much more difficult architecture than it is now. > And it'll be slower. Perhaps a more modest exec based strategy would be for x86_64 criu to handle all of the x86_64 restores as usual but exec i386 and/or x32 criu service daemons and use them for restoring applications needing those ABIs. Regards, Christopher Covington -- Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: VDSO unmap and remap support for additional architectures
On 04/29/2016 09:55 AM, Dmitry Safonov wrote: > On 04/29/2016 04:22 PM, Christopher Covington wrote: >> On 04/28/2016 02:53 PM, Andy Lutomirski wrote: >>> Also, at some point, possibly quite soon, x86 will want a way for >>> user code to ask the kernel to map a specific vdso variant at a specific >>> address. Could we perhaps add a new pair of syscalls: >>> >>> struct vdso_info { >>> unsigned long space_needed_before; >>> unsigned long space_needed_after; >>> unsigned long alignment; >>> }; >>> >>> long vdso_get_info(unsigned int vdso_type, struct vdso_info *info); >>> >>> long vdso_remap(unsigned int vdso_type, unsigned long addr, unsigned >>> int flags); >>> >>> #define VDSO_X86_I386 0 >>> #define VDSO_X86_64 1 >>> #define VDSO_X86_X32 2 >>> // etc. >>> >>> vdso_remap will map the vdso of the chosen type such at >>> AT_SYSINFO_EHDR lines up with addr. It will use up to >>> space_needed_before bytes before that address and space_needed_after >>> after than address. It will also unmap the old vdso (or maybe only do >>> that if some flag is set). >>> >>> On x86, mremap is *not* sufficient for everything that's needed, >>> because some programs will need to change the vdso type. >> I don't I understand. Why can't people just exec() the ELF type that >> corresponds to the VDSO they want? > > I may say about my needs in it: to not lose all the existing > information in application. > Imagine you're restoring a container with 64-bit and 32-bit > applications (in compatible mode). So you need somehow > switch vdso type in restorer for a 32-bit application. > Yes, you may exec() and then - all already restored application > properties will got lost. You will need to transpher information > about mappings, make protocol between restorer binary > and main criu application, finally you'll end up with some > really much more difficult architecture than it is now. > And it'll be slower. Perhaps a more modest exec based strategy would be for x86_64 criu to handle all of the x86_64 restores as usual but exec i386 and/or x32 criu service daemons and use them for restoring applications needing those ABIs. Regards, Christopher Covington -- Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [RFC PATCH] livepatch: allow removal of a disabled patch
On Mon, May 02, 2016 at 01:57:22PM +0200, Miroslav Benes wrote: > 1. Do we really need a completion? If I am not missing something > kobject_del() always waits for sysfs callers to leave thanks to kernfs > active protection. What do you mean by "kernfs active protection"? I see that kernfs_remove() gets the kernfs_mutex lock, but I can't find anywhere that a write to a sysfs file uses that lock. I'm probably missing something... -- Josh
Re: [RFC PATCH] livepatch: allow removal of a disabled patch
On Mon, May 02, 2016 at 01:57:22PM +0200, Miroslav Benes wrote: > 1. Do we really need a completion? If I am not missing something > kobject_del() always waits for sysfs callers to leave thanks to kernfs > active protection. What do you mean by "kernfs active protection"? I see that kernfs_remove() gets the kernfs_mutex lock, but I can't find anywhere that a write to a sysfs file uses that lock. I'm probably missing something... -- Josh
Re: [PATCH] fix infoleak in wireless
On Tue, May 03, 2016 at 05:11:07PM -0400, Kangjie Lu wrote: > Opps, I did not notice the patch is not attached. > > From 34a82a734388d07eb10f91770f86938e38f7575a Mon Sep 17 00:00:00 2001 > From: Kangjie Lu> Date: Tue, 3 May 2016 14:15:18 -0400 > Subject: [PATCH] fix infoleak in wireless > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > The 6-bytes array “mac_addr” is not initialized in the dump_station > implementations of “drivers/staging/wilc1000/wilc_wfi_cfgoperations.c” > and “drivers/staging/rtl8723au/os_dep/ioctl_cfg80211.c”, so all 6 > bytes may be leaked. > > Signed-off-by: Kangjie Lu > --- > net/wireless/nl80211.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c > index 056a730..2e92d14 100644 > --- a/net/wireless/nl80211.c > +++ b/net/wireless/nl80211.c > @@ -3905,6 +3905,7 @@ static int nl80211_dump_station(struct sk_buff *skb, > > while (1) { > memset(, 0, sizeof(sinfo)); > + eth_zero_addr(mac_addr); > err = rdev_dump_station(rdev, wdev->netdev, sta_idx, > mac_addr, ); > if (err == -ENOENT) Patch is corrupted :( Why not fix up the staging drivers, they are the real problem here, which is what I think the networking maintainers were telling you to do. thanks, greg k-h
Re: [PATCH] fix infoleak in wireless
On Tue, May 03, 2016 at 05:11:07PM -0400, Kangjie Lu wrote: > Opps, I did not notice the patch is not attached. > > From 34a82a734388d07eb10f91770f86938e38f7575a Mon Sep 17 00:00:00 2001 > From: Kangjie Lu > Date: Tue, 3 May 2016 14:15:18 -0400 > Subject: [PATCH] fix infoleak in wireless > MIME-Version: 1.0 > Content-Type: text/plain; charset=UTF-8 > Content-Transfer-Encoding: 8bit > > The 6-bytes array “mac_addr” is not initialized in the dump_station > implementations of “drivers/staging/wilc1000/wilc_wfi_cfgoperations.c” > and “drivers/staging/rtl8723au/os_dep/ioctl_cfg80211.c”, so all 6 > bytes may be leaked. > > Signed-off-by: Kangjie Lu > --- > net/wireless/nl80211.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c > index 056a730..2e92d14 100644 > --- a/net/wireless/nl80211.c > +++ b/net/wireless/nl80211.c > @@ -3905,6 +3905,7 @@ static int nl80211_dump_station(struct sk_buff *skb, > > while (1) { > memset(, 0, sizeof(sinfo)); > + eth_zero_addr(mac_addr); > err = rdev_dump_station(rdev, wdev->netdev, sta_idx, > mac_addr, ); > if (err == -ENOENT) Patch is corrupted :( Why not fix up the staging drivers, they are the real problem here, which is what I think the networking maintainers were telling you to do. thanks, greg k-h
Re: [PATCH] [RFC] x86: work around MPX Erratum
On Tue, May 3, 2016 at 2:28 PM, Dave Hansenwrote: >> >> So we won't init MPX on those... > > Yes, and as long as such a processor doesn't exist today and never > exists in the future or the folks that buy such a processor truly don't > care about MPX, that's fine to do. I'm just a bit nervous about the > whole "never exists in the future" part. I don't think we need to care. If a CPU doesn't support SMEP, there really is no reason for us to ever support MPX either. It's not like there is a clamoring for MPX support in the first place (have you ever heard of anybody actually asking for or using it?), and quite frankly, it's a _lot_ more complicated from a CPU core side than SMEP (which is trivial). So if Intel ever releases a CPU with MPX but without SMEP, nobody will ever care about the MPX part being useless. Linus
Re: [PATCH] [RFC] x86: work around MPX Erratum
On Tue, May 3, 2016 at 2:28 PM, Dave Hansen wrote: >> >> So we won't init MPX on those... > > Yes, and as long as such a processor doesn't exist today and never > exists in the future or the folks that buy such a processor truly don't > care about MPX, that's fine to do. I'm just a bit nervous about the > whole "never exists in the future" part. I don't think we need to care. If a CPU doesn't support SMEP, there really is no reason for us to ever support MPX either. It's not like there is a clamoring for MPX support in the first place (have you ever heard of anybody actually asking for or using it?), and quite frankly, it's a _lot_ more complicated from a CPU core side than SMEP (which is trivial). So if Intel ever releases a CPU with MPX but without SMEP, nobody will ever care about the MPX part being useless. Linus
Re: [PATCH] [RFC] x86: work around MPX Erratum
On May 3, 2016 2:05 PM, "Dave Hansen"wrote: > > On 05/02/2016 11:43 PM, Ingo Molnar wrote: > >> > +static int is_mpx_affected_microarch(struct cpuinfo_x86 *c) > >> > +{ > >> > + /* Only family 6 is affected */ > >> > + if (c->x86 != 0x6) > >> > + return 0; > >> > + > >> > + /* We know these Atom models are unaffected, for sure */ > >> > + switch (c->x86_model) { > >> > + case 0x5F: /* "Future Intel Atom ... Goldmont */ > >> > + case 0x5C: /* "Future Intel Atom ... Goldmont */ > >> > + return 0; > >> > + } > >> > + /* > >> > + * We will get here on future unknown processors and all > >> > + * Core/Xeons. They might be unaffected Atoms or > >> > + * affected Core/Xeons. Be conservative and assume > >> > + * processor is affected. > >> > + * > >> > + * Once the complete list of Core/Xeon models is known > >> > + * it can be added here, and the Atom list removed. > >> > + */ > >> > + return 1; > > So instead of trying to sort out the erratum, could we not just generally > > make MPX > > dependent on SMEP and be done with it? MPX is a sophisticated security > > feature, > > and it makes little sense to not do SMEP if you have it available. > > > > Anyone who is absolutely desperate to disable SMEP while enabling MPX is > > free to > > step in and make his case. > > My concern was not necessarily with folks booting with 'nosmep', but > with processors that have MPX present and SMEP fused off (or made > unavailable by a hypervisor) and which are unaffected by this issue. > > People would have to be very careful to never create a processor which > did not have SMEP but did have MPX, since MPX would effectively be > unusable on such a processor. > > Having actually read the erratum: how can this affect Linux at all under any scenario where user code hasn't already completely compromised the kernel? I.e. why do we care about this erratum?
Re: [PATCH] [RFC] x86: work around MPX Erratum
On May 3, 2016 2:05 PM, "Dave Hansen" wrote: > > On 05/02/2016 11:43 PM, Ingo Molnar wrote: > >> > +static int is_mpx_affected_microarch(struct cpuinfo_x86 *c) > >> > +{ > >> > + /* Only family 6 is affected */ > >> > + if (c->x86 != 0x6) > >> > + return 0; > >> > + > >> > + /* We know these Atom models are unaffected, for sure */ > >> > + switch (c->x86_model) { > >> > + case 0x5F: /* "Future Intel Atom ... Goldmont */ > >> > + case 0x5C: /* "Future Intel Atom ... Goldmont */ > >> > + return 0; > >> > + } > >> > + /* > >> > + * We will get here on future unknown processors and all > >> > + * Core/Xeons. They might be unaffected Atoms or > >> > + * affected Core/Xeons. Be conservative and assume > >> > + * processor is affected. > >> > + * > >> > + * Once the complete list of Core/Xeon models is known > >> > + * it can be added here, and the Atom list removed. > >> > + */ > >> > + return 1; > > So instead of trying to sort out the erratum, could we not just generally > > make MPX > > dependent on SMEP and be done with it? MPX is a sophisticated security > > feature, > > and it makes little sense to not do SMEP if you have it available. > > > > Anyone who is absolutely desperate to disable SMEP while enabling MPX is > > free to > > step in and make his case. > > My concern was not necessarily with folks booting with 'nosmep', but > with processors that have MPX present and SMEP fused off (or made > unavailable by a hypervisor) and which are unaffected by this issue. > > People would have to be very careful to never create a processor which > did not have SMEP but did have MPX, since MPX would effectively be > unusable on such a processor. > > Having actually read the erratum: how can this affect Linux at all under any scenario where user code hasn't already completely compromised the kernel? I.e. why do we care about this erratum?
Re: [PATCH] [RFC] x86: work around MPX Erratum
On 05/03/2016 02:12 PM, Borislav Petkov wrote: > On Tue, May 03, 2016 at 02:04:40PM -0700, Dave Hansen wrote: >> My concern was not necessarily with folks booting with 'nosmep', but > > Btw, does anything speak for even keeping that 'nosmep' thing? Generally, I'm not sure we need the no$foo options at all. There's always "clearcpuid=" which does the same thing. It just requires you to go look up the X86_FEATURE_* bit first. >> with processors that have MPX present and SMEP fused off (or made >> unavailable by a hypervisor) and which are unaffected by this issue. > > So we won't init MPX on those... Yes, and as long as such a processor doesn't exist today and never exists in the future or the folks that buy such a processor truly don't care about MPX, that's fine to do. I'm just a bit nervous about the whole "never exists in the future" part. >> People would have to be very careful to never create a processor which >> did not have SMEP but did have MPX, since MPX would effectively be >> unusable on such a processor. > > We can disable that combination in qemu too, right? What do you mean by disable? Have qemu error out if MPX and SMEP aren't disabled in concert with each other?
Re: [PATCH] [RFC] x86: work around MPX Erratum
On 05/03/2016 02:12 PM, Borislav Petkov wrote: > On Tue, May 03, 2016 at 02:04:40PM -0700, Dave Hansen wrote: >> My concern was not necessarily with folks booting with 'nosmep', but > > Btw, does anything speak for even keeping that 'nosmep' thing? Generally, I'm not sure we need the no$foo options at all. There's always "clearcpuid=" which does the same thing. It just requires you to go look up the X86_FEATURE_* bit first. >> with processors that have MPX present and SMEP fused off (or made >> unavailable by a hypervisor) and which are unaffected by this issue. > > So we won't init MPX on those... Yes, and as long as such a processor doesn't exist today and never exists in the future or the folks that buy such a processor truly don't care about MPX, that's fine to do. I'm just a bit nervous about the whole "never exists in the future" part. >> People would have to be very careful to never create a processor which >> did not have SMEP but did have MPX, since MPX would effectively be >> unusable on such a processor. > > We can disable that combination in qemu too, right? What do you mean by disable? Have qemu error out if MPX and SMEP aren't disabled in concert with each other?
Re: [PATCH v7 1/6] Shared library support
On Tue, 3 May 2016 11:00:56 +0900 Masahiro Yamadawrote: Hi, > # Compile .c file, create position independent .o file > # host-cxxshobjs -> .o > quiet_cmd_host-cxxshobjs = HOSTCXX -fPIC $@ > cmd_host-cxxshobjs = $(HOSTCXX) $(hostcxx_flags) -fPIC -c -o $@ $< >$(host-cxxshobjs): $(obj)/%.o: $(src)/%.c FORCE >$(call if_changed_dep,host-cxxshobjs) > > > > > We generally use HOSTCC to compile *.c files, > and HOSTCXX to compile *.cc files. > > > But, here, you mention to use HOSTCXX to compile .c files > such as cyc_complexity_plugin.c, sancov_plugin.c, etc. > > This is not straight-forward. It is worthwhile to comment the reason. I wrote a comment about it here: https://github.com/ephox-gcc-plugins/gcc-plugins_linux-next/commit/74f6343a7f13c071e00c417332051e25f15009ea -- Emese
Re: [PATCH v7 1/6] Shared library support
On Tue, 3 May 2016 11:00:56 +0900 Masahiro Yamada wrote: Hi, > # Compile .c file, create position independent .o file > # host-cxxshobjs -> .o > quiet_cmd_host-cxxshobjs = HOSTCXX -fPIC $@ > cmd_host-cxxshobjs = $(HOSTCXX) $(hostcxx_flags) -fPIC -c -o $@ $< >$(host-cxxshobjs): $(obj)/%.o: $(src)/%.c FORCE >$(call if_changed_dep,host-cxxshobjs) > > > > > We generally use HOSTCC to compile *.c files, > and HOSTCXX to compile *.cc files. > > > But, here, you mention to use HOSTCXX to compile .c files > such as cyc_complexity_plugin.c, sancov_plugin.c, etc. > > This is not straight-forward. It is worthwhile to comment the reason. I wrote a comment about it here: https://github.com/ephox-gcc-plugins/gcc-plugins_linux-next/commit/74f6343a7f13c071e00c417332051e25f15009ea -- Emese
Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
Hi Bjorn, I've implemented your suggestion and I'm getting ready to send out a new version. One thing that came to mind is: do you prefer this code in irq.c or would you rather have it in msi.c? While it also has a legacy irq fallback most of it tied pretty closely to the msi.c code, so I wonder if we should group them together.
Re: [PATCH 6/8] pci: provide sensible irq vector alloc/free routines
Hi Bjorn, I've implemented your suggestion and I'm getting ready to send out a new version. One thing that came to mind is: do you prefer this code in irq.c or would you rather have it in msi.c? While it also has a legacy irq fallback most of it tied pretty closely to the msi.c code, so I wonder if we should group them together.
Re: [patch] usb: dwc3: gadget: fix mask and shift order in DWC3_DCFG_NUMP()
On Tue, May 03, 2016 at 10:53:05AM +0300, Felipe Balbi wrote: > > Hi, > > Dan Carpenterwrites: > > In the original DWC3_DCFG_NUMP() was always zero. It looks like the > > intent was to shift first and then do the mask. > > > > Fixes: 2a58f9c12bb3 ('usb: dwc3: gadget: disable automatic calculation of > > ACK TP NUMP') > > Signed-off-by: Dan Carpenter > > Thanks for this fix, I completely missed it :-) Greg, if you want to > take this as a patch, that's fine by me. Otherwise I can queue it for > -rc1. In any case, S-o-B below: > > Signed-off-by: Felipe Balbi I'll take it, thanks. greg k-h
Re: [patch] usb: dwc3: gadget: fix mask and shift order in DWC3_DCFG_NUMP()
On Tue, May 03, 2016 at 10:53:05AM +0300, Felipe Balbi wrote: > > Hi, > > Dan Carpenter writes: > > In the original DWC3_DCFG_NUMP() was always zero. It looks like the > > intent was to shift first and then do the mask. > > > > Fixes: 2a58f9c12bb3 ('usb: dwc3: gadget: disable automatic calculation of > > ACK TP NUMP') > > Signed-off-by: Dan Carpenter > > Thanks for this fix, I completely missed it :-) Greg, if you want to > take this as a patch, that's fine by me. Otherwise I can queue it for > -rc1. In any case, S-o-B below: > > Signed-off-by: Felipe Balbi I'll take it, thanks. greg k-h
Re: [PATCH] usb: hub: fix panic caused by NULL bos pointer during reset device
On Wed, Apr 27, 2016 at 09:35:57AM -0400, Tony Battersby wrote: > On 04/26/2016 10:53 PM, Du, Changbin wrote: > >> On Tue, Mar 08, 2016 at 05:15:17PM +0800, changbin...@intel.com wrote: > >>> From: "Du, Changbin"> >>> > >>> This is a reworked patch based on reverted commit d8f00cd685f5 ("usb: > >>> hub: do not clear BOS field during reset device"). > >>> > >>> The privious one caused double mem-free if run to re_enumerate label. > >>> New patch title changed to distinguish from old one. And I have tested > >>> it with memory debugging options. > >>> > >>> In function usb_reset_and_verify_device, the old BOS descriptor may > >>> still be used before allocating a new one. (usb_disable_lpm function > >>> uses it under the situation that it fails at usb_disable_link_state.) > >>> So we cannot set the udev->bos to NULL before that, just keep what it > >>> was. It will be overwrite when allocating a new one. > >>> > >>> How to reproduce: > >>> 1. connect one usb3 hub to xhci port. > >>> 2. connect several lpm-capable super-speed usb disk to the hub. > >>> 3. copy big files to the usb disks. > >>> 4. disconnect the hub and repeat step 1-4. > >>> > >>> Crash log: > >>> BUG: unable to handle kernel NULL pointer dereference at > >>> 0010 > >>> IP: [] usb_enable_link_state+0x2d/0x2f0 > >>> Call Trace: > >>> [] ? usb_set_lpm_timeout+0x12b/0x140 > >>> [] usb_enable_lpm+0x81/0xa0 > >>> [] usb_disable_lpm+0xa8/0xc0 > >>> [] usb_unlocked_disable_lpm+0x2c/0x50 > >>> [] usb_reset_and_verify_device+0xc3/0x710 > >>> [] ? usb_sg_wait+0x13d/0x190 > >>> [] usb_reset_device+0x133/0x280 > >>> [] usb_stor_port_reset+0x61/0x70 > >>> [] usb_stor_invoke_transport+0x88/0x520 > >>> > >>> Signed-off-by: Du, Changbin > >>> --- > >>> drivers/usb/core/hub.c | 14 +- > >>> 1 file changed, 9 insertions(+), 5 deletions(-) > >> Is this patch still needed? I thought we had some other fix in this > >> area... > >> > >> confused, > >> > >> greg k-h > >> > > Hi, Greg k-h, > > Sorry for it confused you. This patch still need. This is same fix with > > previous commit d8f00cd685f5 ("usb: hub: do not clear BOS field > > during reset device"). But d8f00cd685f5 is buggy and reverted. This > > new patch should be the final fix. > > > > Best Regards, > > Du, Changbin > > > > I think Greg is referring to commit 464ad8c43a9e ("usb: core : hub: Fix > BOS 'NULL pointer' kernel panic"), which has already been applied > upstream. It looks to me like that patch might have fixed the same > problem in a different way, in which case Changbin's patch is not > needed. But I haven't been involved in developing or testing that > patch, so I can't say for sure. At the very least, 464ad8c43a9e > conflicts with Changbin's patch. > > Changbin, can you take a look at 464ad8c43a9e and see if that fixes the > same problem that your patch did? Given the lack of response here, I've dropped this from my queue. If it's still needed, someone must resend it. thanks, greg k-h
Re: [PATCH v7 1/6] Shared library support
On Mon, 2 May 2016 14:03:00 +0900 Masahiro Yamadawrote: Hi, > > diff --git a/scripts/Makefile.clean b/scripts/Makefile.clean > > index 55c96cb..e4e88ab 100644 > > --- a/scripts/Makefile.clean > > +++ b/scripts/Makefile.clean > > @@ -38,7 +38,8 @@ subdir-ymn:= $(addprefix $(obj)/,$(subdir-ymn)) > > __clean-files := $(extra-y) $(extra-m) $(extra-) \ > >$(always) $(targets) $(clean-files) \ > >$(host-progs) \ > > - $(hostprogs-y) $(hostprogs-m) $(hostprogs-) > > + $(hostprogs-y) $(hostprogs-m) $(hostprogs-) \ > > + $(hostlibs-y) $(hostlibs-m) $(hostlibs-) > > $(hostcxxlibs-y)$(hostcxxlibs-m) is missing here. > > > BTW, are you planning to support hostlibs without .so extention in the future? > > > You are changing the top Makefile to clean with "*.so" pattern rule. > > Do you need to change both the top Makefile and Makefile.clean > for belt-and-braces cleaning? It doesn't delete the *.so files without the hunk from the top Makefile. The mrproper and distclean targets remove them with this patch: https://github.com/ephox-gcc-plugins/gcc-plugins_linux-next/commit/f23b7eb247ccc9f1007e53313af343980dd52591 But I think this isn't the best solution much like doing it from the top Makefile. -- Emese
Re: [PATCH] usb: hub: fix panic caused by NULL bos pointer during reset device
On Wed, Apr 27, 2016 at 09:35:57AM -0400, Tony Battersby wrote: > On 04/26/2016 10:53 PM, Du, Changbin wrote: > >> On Tue, Mar 08, 2016 at 05:15:17PM +0800, changbin...@intel.com wrote: > >>> From: "Du, Changbin" > >>> > >>> This is a reworked patch based on reverted commit d8f00cd685f5 ("usb: > >>> hub: do not clear BOS field during reset device"). > >>> > >>> The privious one caused double mem-free if run to re_enumerate label. > >>> New patch title changed to distinguish from old one. And I have tested > >>> it with memory debugging options. > >>> > >>> In function usb_reset_and_verify_device, the old BOS descriptor may > >>> still be used before allocating a new one. (usb_disable_lpm function > >>> uses it under the situation that it fails at usb_disable_link_state.) > >>> So we cannot set the udev->bos to NULL before that, just keep what it > >>> was. It will be overwrite when allocating a new one. > >>> > >>> How to reproduce: > >>> 1. connect one usb3 hub to xhci port. > >>> 2. connect several lpm-capable super-speed usb disk to the hub. > >>> 3. copy big files to the usb disks. > >>> 4. disconnect the hub and repeat step 1-4. > >>> > >>> Crash log: > >>> BUG: unable to handle kernel NULL pointer dereference at > >>> 0010 > >>> IP: [] usb_enable_link_state+0x2d/0x2f0 > >>> Call Trace: > >>> [] ? usb_set_lpm_timeout+0x12b/0x140 > >>> [] usb_enable_lpm+0x81/0xa0 > >>> [] usb_disable_lpm+0xa8/0xc0 > >>> [] usb_unlocked_disable_lpm+0x2c/0x50 > >>> [] usb_reset_and_verify_device+0xc3/0x710 > >>> [] ? usb_sg_wait+0x13d/0x190 > >>> [] usb_reset_device+0x133/0x280 > >>> [] usb_stor_port_reset+0x61/0x70 > >>> [] usb_stor_invoke_transport+0x88/0x520 > >>> > >>> Signed-off-by: Du, Changbin > >>> --- > >>> drivers/usb/core/hub.c | 14 +- > >>> 1 file changed, 9 insertions(+), 5 deletions(-) > >> Is this patch still needed? I thought we had some other fix in this > >> area... > >> > >> confused, > >> > >> greg k-h > >> > > Hi, Greg k-h, > > Sorry for it confused you. This patch still need. This is same fix with > > previous commit d8f00cd685f5 ("usb: hub: do not clear BOS field > > during reset device"). But d8f00cd685f5 is buggy and reverted. This > > new patch should be the final fix. > > > > Best Regards, > > Du, Changbin > > > > I think Greg is referring to commit 464ad8c43a9e ("usb: core : hub: Fix > BOS 'NULL pointer' kernel panic"), which has already been applied > upstream. It looks to me like that patch might have fixed the same > problem in a different way, in which case Changbin's patch is not > needed. But I haven't been involved in developing or testing that > patch, so I can't say for sure. At the very least, 464ad8c43a9e > conflicts with Changbin's patch. > > Changbin, can you take a look at 464ad8c43a9e and see if that fixes the > same problem that your patch did? Given the lack of response here, I've dropped this from my queue. If it's still needed, someone must resend it. thanks, greg k-h
Re: [PATCH v7 1/6] Shared library support
On Mon, 2 May 2016 14:03:00 +0900 Masahiro Yamada wrote: Hi, > > diff --git a/scripts/Makefile.clean b/scripts/Makefile.clean > > index 55c96cb..e4e88ab 100644 > > --- a/scripts/Makefile.clean > > +++ b/scripts/Makefile.clean > > @@ -38,7 +38,8 @@ subdir-ymn:= $(addprefix $(obj)/,$(subdir-ymn)) > > __clean-files := $(extra-y) $(extra-m) $(extra-) \ > >$(always) $(targets) $(clean-files) \ > >$(host-progs) \ > > - $(hostprogs-y) $(hostprogs-m) $(hostprogs-) > > + $(hostprogs-y) $(hostprogs-m) $(hostprogs-) \ > > + $(hostlibs-y) $(hostlibs-m) $(hostlibs-) > > $(hostcxxlibs-y)$(hostcxxlibs-m) is missing here. > > > BTW, are you planning to support hostlibs without .so extention in the future? > > > You are changing the top Makefile to clean with "*.so" pattern rule. > > Do you need to change both the top Makefile and Makefile.clean > for belt-and-braces cleaning? It doesn't delete the *.so files without the hunk from the top Makefile. The mrproper and distclean targets remove them with this patch: https://github.com/ephox-gcc-plugins/gcc-plugins_linux-next/commit/f23b7eb247ccc9f1007e53313af343980dd52591 But I think this isn't the best solution much like doing it from the top Makefile. -- Emese
Re: [PATCH] perf/sdt: Directly record cached SDT events
On 05/03/2016 06:05 AM, Masami Hiramatsu wrote: On Tue, 03 May 2016 05:06:24 +0530 Hemant Kumarwrote: Hi Masami, On 04/30/2016 06:06 PM, Masami Hiramatsu wrote: Hi Hemant, On Fri, 29 Apr 2016 19:10:41 +0530 Hemant Kumar wrote: This patch adds support for directly recording SDT events which are present in the probe cache. This patch is based on current SDT enablement patchset (v5) by Masami : https://lkml.org/lkml/2016/4/27/828 and it implements two points in the TODO list mentioned in the cover note : "- (perf record) Support SDT event recording directly" "- (perf record) Try to unregister SDT events after record." Without this patch, we could probe into SDT events using "perf probe" and "perf record". With this patch, we can probe the SDT events directly using "perf record". Thanks! However, before looking over each part of this patch, I think this is not enough for supporting SDT for perf record. Hmm. If there are several SDTs which have same eventname but differnt addresses (e.g. libc:memory_memalign_retry), how are those handled? Currently, to support this, we'll need to enable those events in different names, or just pick one of them. It could confuse users in each case. Right. But now, its the same case with a binary having multiple symbols with same names, isn't it? Yes, but for the symbols or lines etc., user can not directly specify it via perf record. And as you showed below, perf-probe expresses there are 2 events on the probe point. So user is forced to aware of it. Right. # nm ./multi | grep foo 00400530 t foo 00400560 t foo # perf probe -x ./multi foo Added new events: probe_multi:foo (on foo in /home/hemant/work/linux/tools/perf/multi) probe_multi:foo_1(on foo in /home/hemant/work/linux/tools/perf/multi) You can now use it in all perf tools, such as: perf record -e probe_multi:foo_1 -aR sleep 1 My point being, the user can still know, if its shown that there are two or more probes being placed and the o/p of perf report/script shows that the probes are placed at two or more different addresses. Not only the different address, but also they will see the different event names. That may be no good for making a script on it. My point is, if the user only uses "perf record -e sdt_something:sdtevent", they will think that there is one event recorded. it can easily misleading them. Ok. Makes sense. With a warning message then, we can make the user aware in this case. To solve this issue, we need to introduce multiple SDTs on single ftrace event. Please read my comment on v3 patch (https://lkml.org/lkml/2015/8/15/52) Ok. But, I think, for initial direct recording support, we can go with this IMHO. So, at least this should be noticed to users carefully. (e.g. warn if there are more than two SDTs defined) Ok. I have made the changes and also added a warning message if the user tries to record on an sdt event, which has multiple occurences with the same event and group name. I have sent a v2 for this patch. -- Thanks, Hemant Kumar
Re: [PATCH] perf/sdt: Directly record cached SDT events
On 05/03/2016 06:05 AM, Masami Hiramatsu wrote: On Tue, 03 May 2016 05:06:24 +0530 Hemant Kumar wrote: Hi Masami, On 04/30/2016 06:06 PM, Masami Hiramatsu wrote: Hi Hemant, On Fri, 29 Apr 2016 19:10:41 +0530 Hemant Kumar wrote: This patch adds support for directly recording SDT events which are present in the probe cache. This patch is based on current SDT enablement patchset (v5) by Masami : https://lkml.org/lkml/2016/4/27/828 and it implements two points in the TODO list mentioned in the cover note : "- (perf record) Support SDT event recording directly" "- (perf record) Try to unregister SDT events after record." Without this patch, we could probe into SDT events using "perf probe" and "perf record". With this patch, we can probe the SDT events directly using "perf record". Thanks! However, before looking over each part of this patch, I think this is not enough for supporting SDT for perf record. Hmm. If there are several SDTs which have same eventname but differnt addresses (e.g. libc:memory_memalign_retry), how are those handled? Currently, to support this, we'll need to enable those events in different names, or just pick one of them. It could confuse users in each case. Right. But now, its the same case with a binary having multiple symbols with same names, isn't it? Yes, but for the symbols or lines etc., user can not directly specify it via perf record. And as you showed below, perf-probe expresses there are 2 events on the probe point. So user is forced to aware of it. Right. # nm ./multi | grep foo 00400530 t foo 00400560 t foo # perf probe -x ./multi foo Added new events: probe_multi:foo (on foo in /home/hemant/work/linux/tools/perf/multi) probe_multi:foo_1(on foo in /home/hemant/work/linux/tools/perf/multi) You can now use it in all perf tools, such as: perf record -e probe_multi:foo_1 -aR sleep 1 My point being, the user can still know, if its shown that there are two or more probes being placed and the o/p of perf report/script shows that the probes are placed at two or more different addresses. Not only the different address, but also they will see the different event names. That may be no good for making a script on it. My point is, if the user only uses "perf record -e sdt_something:sdtevent", they will think that there is one event recorded. it can easily misleading them. Ok. Makes sense. With a warning message then, we can make the user aware in this case. To solve this issue, we need to introduce multiple SDTs on single ftrace event. Please read my comment on v3 patch (https://lkml.org/lkml/2015/8/15/52) Ok. But, I think, for initial direct recording support, we can go with this IMHO. So, at least this should be noticed to users carefully. (e.g. warn if there are more than two SDTs defined) Ok. I have made the changes and also added a warning message if the user tries to record on an sdt event, which has multiple occurences with the same event and group name. I have sent a v2 for this patch. -- Thanks, Hemant Kumar
Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow transmissions
On 03/05/16 15:42, David B. Robins wrote: I don't think the first one is giving you problems (except as triggered by the second) but I had concerns about the second myself (and emailed the author off-list, but received no reply), and we did not take that commit for our own product. Sorry, I might have missed your original E-mail. Specifically, the second change, 3f30... (original patch: https://www.mail-archive.com/netdev@vger.kernel.org/msg80720.html) (1) appears to do the exact opposite of what it claims, i.e., instead of "resync if this looks like a header", it does "resync if this does NOT look like a (packet) header", where "looks like a header" means "bits 0-10 (size) are equal to the bitwise-NOT of bits 16-26", and (2) can happen by coincidence for 1/2048 32-bit values starting a continuation URB (easy to hit dealing with large volumes of video data as we were). It appears to expect the header for every URB whereas the rest of the code at least expects it only once per network packet (look at following code that only reads it for remaining == 0). David, I think that your interpretation is incorrect. Please see below. Here is the code snippet from the patch with my annotations between # #, I will try to explain my intentions. Feel free to point out any flaws: if (rx->remaining && (rx->remaining + sizeof(u32) <= skb->len)) { # Only runs when rx->remaining !=0 and the end of the Ethernet frame + next 32-bit header word is within the URB buffer. # # Therefore, this code does not run when the end of an Ethernet frame has been reached in the previous URB # # or when the end of the Ethernet frame + next 32-bit header word will be in a later URB buffer # # offset is an index to the expected next 32-bit header word after the end of the Ethernet frame # offset = ((rx->remaining + 1) & 0xfffe) + sizeof(u32); # rx->header contains the expected 32-bit header value corrected for Endianness and alignment # rx->header = get_unaligned_le32(skb->data + offset); offset = 0; # check the data integrity of the size value from the header word # size = (u16)(rx->header & 0x7ff); # if the size value fails the integrity check then we are not looking at a valid header word so # # synchronisation has been lost # if (size != ((~rx->header >> 16) & 0x7ff)) { netdev_err(dev->net, "asix_rx_fixup() Data Header synchronisation was lost, remaining %d\n", rx->remaining); if (rx->ax_skb) { kfree_skb(rx->ax_skb); rx->ax_skb = NULL; /* Discard the incomplete netdev Ethernet frame * and assume the Data header is at the start of * the current URB socket buffer. */ } rx->remaining = 0; } } So that change made no sense to me, but I don't have significant kernel dev experience. Effectively it will drop/truncate every (2047/2048) split (longer than an URB) packet, and report an error for the second URB and then again for treating said second URB as a first URB for a packet. I would expect your problems will go away just removing the second change. You could also change the != to == in "if (size != ...)" but then you'd still have 1/2048 (depending on data patterns) false positives. The code only runs when the Ethernet frame spans across URBs and is checking that the next 32-bit header word is present and valid. Upon loss of synchronisation, the strategy is to assume that the 32-bit header is at the start of the URB buffer. Obviously, that might not be true every time but it is the most likely location especially when Ethernet frames are not spanning URBs at that point at time. Looking at the error messages: [ 239.037310] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x54ebb5ec, offset 4 The offset 4 means that the 32-bit header word was invalid at the start of the URB buffer. This could be a consequence of data synchronisation being lost however, we would expect the timestamps between the error messages of "synchronisation was lost" and "Bad Header Length" to very close as they would be consecutive URBs. The evidence is showing 10ms gaps which does not suggest consecutive URBs. In other words, an Ethernet frame should not be spanned over a time gap of 10ms as that would be very inefficient. If that were true then there would be USB communications problem with the USB to Ethernet adaptor which I hope is not true. [ 239.027993] asix 1-1.1:1.0 eth0: asix_rx_fixup() Data Header synchronisation was lost, remaining 988 This error message consistently shows the remaining value to be 988, at least for the 3 examples provided by John. This does not suggest a random failure unless there are other examples of a non 988 remaining value error message. 988 is well within a Ethernet frame
Re: [REGRESSION] asix: Lots of asix_rx_fixup() errors and slow transmissions
On 03/05/16 15:42, David B. Robins wrote: I don't think the first one is giving you problems (except as triggered by the second) but I had concerns about the second myself (and emailed the author off-list, but received no reply), and we did not take that commit for our own product. Sorry, I might have missed your original E-mail. Specifically, the second change, 3f30... (original patch: https://www.mail-archive.com/netdev@vger.kernel.org/msg80720.html) (1) appears to do the exact opposite of what it claims, i.e., instead of "resync if this looks like a header", it does "resync if this does NOT look like a (packet) header", where "looks like a header" means "bits 0-10 (size) are equal to the bitwise-NOT of bits 16-26", and (2) can happen by coincidence for 1/2048 32-bit values starting a continuation URB (easy to hit dealing with large volumes of video data as we were). It appears to expect the header for every URB whereas the rest of the code at least expects it only once per network packet (look at following code that only reads it for remaining == 0). David, I think that your interpretation is incorrect. Please see below. Here is the code snippet from the patch with my annotations between # #, I will try to explain my intentions. Feel free to point out any flaws: if (rx->remaining && (rx->remaining + sizeof(u32) <= skb->len)) { # Only runs when rx->remaining !=0 and the end of the Ethernet frame + next 32-bit header word is within the URB buffer. # # Therefore, this code does not run when the end of an Ethernet frame has been reached in the previous URB # # or when the end of the Ethernet frame + next 32-bit header word will be in a later URB buffer # # offset is an index to the expected next 32-bit header word after the end of the Ethernet frame # offset = ((rx->remaining + 1) & 0xfffe) + sizeof(u32); # rx->header contains the expected 32-bit header value corrected for Endianness and alignment # rx->header = get_unaligned_le32(skb->data + offset); offset = 0; # check the data integrity of the size value from the header word # size = (u16)(rx->header & 0x7ff); # if the size value fails the integrity check then we are not looking at a valid header word so # # synchronisation has been lost # if (size != ((~rx->header >> 16) & 0x7ff)) { netdev_err(dev->net, "asix_rx_fixup() Data Header synchronisation was lost, remaining %d\n", rx->remaining); if (rx->ax_skb) { kfree_skb(rx->ax_skb); rx->ax_skb = NULL; /* Discard the incomplete netdev Ethernet frame * and assume the Data header is at the start of * the current URB socket buffer. */ } rx->remaining = 0; } } So that change made no sense to me, but I don't have significant kernel dev experience. Effectively it will drop/truncate every (2047/2048) split (longer than an URB) packet, and report an error for the second URB and then again for treating said second URB as a first URB for a packet. I would expect your problems will go away just removing the second change. You could also change the != to == in "if (size != ...)" but then you'd still have 1/2048 (depending on data patterns) false positives. The code only runs when the Ethernet frame spans across URBs and is checking that the next 32-bit header word is present and valid. Upon loss of synchronisation, the strategy is to assume that the 32-bit header is at the start of the URB buffer. Obviously, that might not be true every time but it is the most likely location especially when Ethernet frames are not spanning URBs at that point at time. Looking at the error messages: [ 239.037310] asix 1-1.1:1.0 eth0: asix_rx_fixup() Bad Header Length 0x54ebb5ec, offset 4 The offset 4 means that the 32-bit header word was invalid at the start of the URB buffer. This could be a consequence of data synchronisation being lost however, we would expect the timestamps between the error messages of "synchronisation was lost" and "Bad Header Length" to very close as they would be consecutive URBs. The evidence is showing 10ms gaps which does not suggest consecutive URBs. In other words, an Ethernet frame should not be spanned over a time gap of 10ms as that would be very inefficient. If that were true then there would be USB communications problem with the USB to Ethernet adaptor which I hope is not true. [ 239.027993] asix 1-1.1:1.0 eth0: asix_rx_fixup() Data Header synchronisation was lost, remaining 988 This error message consistently shows the remaining value to be 988, at least for the 3 examples provided by John. This does not suggest a random failure unless there are other examples of a non 988 remaining value error message. 988 is well within a Ethernet frame
[PATCH v2] perf/sdt: Directly record SDT events
This patch adds support for directly recording SDT events which are present in the probe cache. This patch is based on current SDT enablement patchset (v5) by Masami : https://lkml.org/lkml/2016/4/27/828 and it implements two points in the TODO list mentioned in the cover note : "- (perf record) Support SDT event recording directly" "- (perf record) Try to unregister SDT events after record." Without this patch, we could probe into SDT events using "perf probe" and "perf record". With this patch, we can probe the SDT events directly using "perf record". For example : # perf list sdt // List the SDT events ... sdt_mysql:update__row__done[SDT event] sdt_mysql:update__row__start [SDT event] sdt_mysql:update__start[SDT event] sdt_python:function__entry [SDT event] sdt_python:function__return[SDT event] sdt_test:marker1 [SDT event] sdt_test:marker2 [SDT event] ... # perf record -e %sdt_test:marker1 -e %sdt_test:marker2 -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.087 MB perf.data (22 samples) ] # perf script test_sdt 29230 [002] 405550.548017: sdt_test:marker1: (400534) test_sdt 29230 [002] 405550.548064: sdt_test:marker2: (40053f) test_sdt 29231 [002] 405550.962806: sdt_test:marker1: (400534) test_sdt 29231 [002] 405550.962841: sdt_test:marker2: (40053f) test_sdt 29232 [001] 405551.379327: sdt_test:marker1: (400534) ... Recording on SDT events with same provider and marker names is also supported : # readelf -n ./test_sdt | grep test Provider: test Provider: test Provider: test # readelf -n ./test_sdt | grep -A2 test Provider: test Name: marker1 Location: 0x0040053e, Base: 0x00400607, Semaphore: 0x -- Provider: test Name: marker1 Location: 0x00400545, Base: 0x00400607, Semaphore: 0x -- Provider: test Name: marker1 Location: 0x00400550, Base: 0x00400607, Semaphore: 0x # perf record -e %sdt_test:marker1 -a Warning : Recording on 3 occurences of sdt_test:marker1 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.208 MB perf.data (12 samples) ] # perf script ... test_sdt 1445 [001] 578689.792146: sdt_test:marker1_1: (400545) test_sdt 1445 [001] 578689.792168: sdt_test:marker1_2: (400550) test_sdt 1445 [001] 578689.792170: sdt_test:marker1: (40053e) test_sdt 1454 [000] 578690.436117: sdt_test:marker1_1: (400545) ... After invoking "perf record", behind the scenes, it checks whether the event specified is an SDT event using the flag '%'. After that, it does a lookup of the probe cache to find out the SDT event. If its not present, it throws an error. Otherwise, it goes on and writes the event into the uprobe_events file and sets up the probe event, trace events, etc and starts recording. It also maintains a list of the event names that were written to uprobe_events file. After finishing the record session, it removes the events from the uprobe_events file using the maintained name list. Signed-off-by: Hemant Kumar--- Changes since v1: - Added support for recording on multiple SDT events with same names. - Added support to show warning if multiple SDT events with same names are present. - Used del_perf_probe_events() to delete the SDT events after recording is done. - Moved function remove_sdt_event_list() to util/probe-event.c - Added a function parse_sdt_event() to parse an SDT event from a string. - Used find_cached_events_all() to find SDT events. - Used group:event as a strfilter to delete the SDT events from the uprobe_events file. (Thanks Masami for the above suggestions!) tools/perf/builtin-record.c| 21 + tools/perf/perf.h | 2 + tools/perf/util/parse-events.c | 53 +- tools/perf/util/parse-events.h | 1 + tools/perf/util/probe-event.c | 40 - tools/perf/util/probe-event.h | 4 ++ tools/perf/util/probe-file.c | 99 ++ tools/perf/util/probe-file.h | 8 8 files changed, 226 insertions(+), 2 deletions(-) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 515510e..104eafe 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -34,6 +34,7 @@ #include "util/llvm-utils.h" #include "util/bpf-loader.h" #include "asm/bug.h" +#include "util/probe-file.h" #include #include @@ -56,6 +57,7 @@ struct record { boolno_buildid_cache_set; boolbuildid_all; unsigned long long
[PATCH v2] perf/sdt: Directly record SDT events
This patch adds support for directly recording SDT events which are present in the probe cache. This patch is based on current SDT enablement patchset (v5) by Masami : https://lkml.org/lkml/2016/4/27/828 and it implements two points in the TODO list mentioned in the cover note : "- (perf record) Support SDT event recording directly" "- (perf record) Try to unregister SDT events after record." Without this patch, we could probe into SDT events using "perf probe" and "perf record". With this patch, we can probe the SDT events directly using "perf record". For example : # perf list sdt // List the SDT events ... sdt_mysql:update__row__done[SDT event] sdt_mysql:update__row__start [SDT event] sdt_mysql:update__start[SDT event] sdt_python:function__entry [SDT event] sdt_python:function__return[SDT event] sdt_test:marker1 [SDT event] sdt_test:marker2 [SDT event] ... # perf record -e %sdt_test:marker1 -e %sdt_test:marker2 -a ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.087 MB perf.data (22 samples) ] # perf script test_sdt 29230 [002] 405550.548017: sdt_test:marker1: (400534) test_sdt 29230 [002] 405550.548064: sdt_test:marker2: (40053f) test_sdt 29231 [002] 405550.962806: sdt_test:marker1: (400534) test_sdt 29231 [002] 405550.962841: sdt_test:marker2: (40053f) test_sdt 29232 [001] 405551.379327: sdt_test:marker1: (400534) ... Recording on SDT events with same provider and marker names is also supported : # readelf -n ./test_sdt | grep test Provider: test Provider: test Provider: test # readelf -n ./test_sdt | grep -A2 test Provider: test Name: marker1 Location: 0x0040053e, Base: 0x00400607, Semaphore: 0x -- Provider: test Name: marker1 Location: 0x00400545, Base: 0x00400607, Semaphore: 0x -- Provider: test Name: marker1 Location: 0x00400550, Base: 0x00400607, Semaphore: 0x # perf record -e %sdt_test:marker1 -a Warning : Recording on 3 occurences of sdt_test:marker1 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.208 MB perf.data (12 samples) ] # perf script ... test_sdt 1445 [001] 578689.792146: sdt_test:marker1_1: (400545) test_sdt 1445 [001] 578689.792168: sdt_test:marker1_2: (400550) test_sdt 1445 [001] 578689.792170: sdt_test:marker1: (40053e) test_sdt 1454 [000] 578690.436117: sdt_test:marker1_1: (400545) ... After invoking "perf record", behind the scenes, it checks whether the event specified is an SDT event using the flag '%'. After that, it does a lookup of the probe cache to find out the SDT event. If its not present, it throws an error. Otherwise, it goes on and writes the event into the uprobe_events file and sets up the probe event, trace events, etc and starts recording. It also maintains a list of the event names that were written to uprobe_events file. After finishing the record session, it removes the events from the uprobe_events file using the maintained name list. Signed-off-by: Hemant Kumar --- Changes since v1: - Added support for recording on multiple SDT events with same names. - Added support to show warning if multiple SDT events with same names are present. - Used del_perf_probe_events() to delete the SDT events after recording is done. - Moved function remove_sdt_event_list() to util/probe-event.c - Added a function parse_sdt_event() to parse an SDT event from a string. - Used find_cached_events_all() to find SDT events. - Used group:event as a strfilter to delete the SDT events from the uprobe_events file. (Thanks Masami for the above suggestions!) tools/perf/builtin-record.c| 21 + tools/perf/perf.h | 2 + tools/perf/util/parse-events.c | 53 +- tools/perf/util/parse-events.h | 1 + tools/perf/util/probe-event.c | 40 - tools/perf/util/probe-event.h | 4 ++ tools/perf/util/probe-file.c | 99 ++ tools/perf/util/probe-file.h | 8 8 files changed, 226 insertions(+), 2 deletions(-) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 515510e..104eafe 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -34,6 +34,7 @@ #include "util/llvm-utils.h" #include "util/bpf-loader.h" #include "asm/bug.h" +#include "util/probe-file.h" #include #include @@ -56,6 +57,7 @@ struct record { boolno_buildid_cache_set; boolbuildid_all; unsigned long long samples; +
Re: [PATCH] [RFC] x86: work around MPX Erratum
On Tue, May 03, 2016 at 02:04:40PM -0700, Dave Hansen wrote: > My concern was not necessarily with folks booting with 'nosmep', but Btw, does anything speak for even keeping that 'nosmep' thing? > with processors that have MPX present and SMEP fused off (or made > unavailable by a hypervisor) and which are unaffected by this issue. So we won't init MPX on those... > People would have to be very careful to never create a processor which > did not have SMEP but did have MPX, since MPX would effectively be > unusable on such a processor. We can disable that combination in qemu too, right? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
Re: [PATCH] [RFC] x86: work around MPX Erratum
On Tue, May 03, 2016 at 02:04:40PM -0700, Dave Hansen wrote: > My concern was not necessarily with folks booting with 'nosmep', but Btw, does anything speak for even keeping that 'nosmep' thing? > with processors that have MPX present and SMEP fused off (or made > unavailable by a hypervisor) and which are unaffected by this issue. So we won't init MPX on those... > People would have to be very careful to never create a processor which > did not have SMEP but did have MPX, since MPX would effectively be > unusable on such a processor. We can disable that combination in qemu too, right? -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
[PATCH 7/7] mm: Batch unmapping of pages that are in swap cache
We created a new function __remove_swap_mapping_batch that allows all pages under the same swap partition to be removed from the swap cache's mapping in a single acquisition of the mapping's tree lock. This reduces the contention on the lock when multiple threads are reclaiming memory by swapping to the same swap partition. The handle_pgout_batch function is updated so all the pages under the same swap partition are unmapped together when the have been paged out. Signed-off-by: Tim Chen--- mm/vmscan.c | 426 1 file changed, 286 insertions(+), 140 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9fc04e1..5e4b8ce 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -690,6 +690,103 @@ cannot_free: return 0; } +/* use this only for swap mapped pages */ +static void __remove_swap_mapping_batch(struct page *pages[], + bool reclaimed, short ret[], int nr) +{ + unsigned long flags; + struct page *page; + swp_entry_t swap[SWAP_BATCH]; + struct address_space *mapping; + + int i, batch_size; + + if (nr <= 0) + return; + + while (nr) { + mapping = page_mapping(pages[0]); + BUG_ON(!mapping); + + batch_size = min(nr, SWAP_BATCH); + + spin_lock_irqsave(>tree_lock, flags); + for (i = 0; i < batch_size; ++i) { + page = pages[i]; + + BUG_ON(!PageLocked(page)); + BUG_ON(!PageSwapCache(page)); + BUG_ON(mapping != page_mapping(page)); + + /* stop batching if mapping changes */ + if (mapping != page_mapping(page)) { + batch_size = i; + break; + } + /* + * The non racy check for a busy page. + * + * Must be careful with the order of the tests. When someone has + * a ref to the page, it may be possible that they dirty it then + * drop the reference. So if PageDirty is tested before page_count + * here, then the following race may occur: + * + * get_user_pages(); + * [user mapping goes away] + * write_to(page); + * !PageDirty(page) [good] + * SetPageDirty(page); + * put_page(page); + * !page_count(page) [good, discard it] + * + * [oops, our write_to data is lost] + * + * Reversing the order of the tests ensures such a situation cannot + * escape unnoticed. The smp_rmb is needed to ensure the page->flags + * load is not satisfied before that of page->_count. + * + * Note that if SetPageDirty is always performed via set_page_dirty, + * and thus under tree_lock, then this ordering is not required. + */ + if (!page_ref_freeze(page, 2)) + goto cannot_free; + /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */ + if (unlikely(PageDirty(page))) { + page_ref_unfreeze(page, 2); + goto cannot_free; + } + + swap[i].val = page_private(page); + __delete_from_swap_cache(page); + + ret[i] = 1; + continue; + +cannot_free: + ret[i] = 0; + } + spin_unlock_irqrestore(>tree_lock, flags); + + /* need to keep irq off for mem_cgroup accounting, don't restore flags yet */ + local_irq_disable(); + for (i = 0; i < batch_size; ++i) { + if (ret[i]) { + page = pages[i]; + mem_cgroup_swapout(page, swap[i]); + } + } + local_irq_enable(); + + for (i = 0; i < batch_size; ++i) { + if (ret[i]) + swapcache_free(swap[i]); + } + /* advance to next batch */ + pages += batch_size; + ret += batch_size; + nr -= batch_size; + } +} /* * Attempt to detach a locked page from its ->mapping. If it is dirty or if * someone else has a
[PATCH 7/7] mm: Batch unmapping of pages that are in swap cache
We created a new function __remove_swap_mapping_batch that allows all pages under the same swap partition to be removed from the swap cache's mapping in a single acquisition of the mapping's tree lock. This reduces the contention on the lock when multiple threads are reclaiming memory by swapping to the same swap partition. The handle_pgout_batch function is updated so all the pages under the same swap partition are unmapped together when the have been paged out. Signed-off-by: Tim Chen --- mm/vmscan.c | 426 1 file changed, 286 insertions(+), 140 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9fc04e1..5e4b8ce 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -690,6 +690,103 @@ cannot_free: return 0; } +/* use this only for swap mapped pages */ +static void __remove_swap_mapping_batch(struct page *pages[], + bool reclaimed, short ret[], int nr) +{ + unsigned long flags; + struct page *page; + swp_entry_t swap[SWAP_BATCH]; + struct address_space *mapping; + + int i, batch_size; + + if (nr <= 0) + return; + + while (nr) { + mapping = page_mapping(pages[0]); + BUG_ON(!mapping); + + batch_size = min(nr, SWAP_BATCH); + + spin_lock_irqsave(>tree_lock, flags); + for (i = 0; i < batch_size; ++i) { + page = pages[i]; + + BUG_ON(!PageLocked(page)); + BUG_ON(!PageSwapCache(page)); + BUG_ON(mapping != page_mapping(page)); + + /* stop batching if mapping changes */ + if (mapping != page_mapping(page)) { + batch_size = i; + break; + } + /* + * The non racy check for a busy page. + * + * Must be careful with the order of the tests. When someone has + * a ref to the page, it may be possible that they dirty it then + * drop the reference. So if PageDirty is tested before page_count + * here, then the following race may occur: + * + * get_user_pages(); + * [user mapping goes away] + * write_to(page); + * !PageDirty(page) [good] + * SetPageDirty(page); + * put_page(page); + * !page_count(page) [good, discard it] + * + * [oops, our write_to data is lost] + * + * Reversing the order of the tests ensures such a situation cannot + * escape unnoticed. The smp_rmb is needed to ensure the page->flags + * load is not satisfied before that of page->_count. + * + * Note that if SetPageDirty is always performed via set_page_dirty, + * and thus under tree_lock, then this ordering is not required. + */ + if (!page_ref_freeze(page, 2)) + goto cannot_free; + /* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */ + if (unlikely(PageDirty(page))) { + page_ref_unfreeze(page, 2); + goto cannot_free; + } + + swap[i].val = page_private(page); + __delete_from_swap_cache(page); + + ret[i] = 1; + continue; + +cannot_free: + ret[i] = 0; + } + spin_unlock_irqrestore(>tree_lock, flags); + + /* need to keep irq off for mem_cgroup accounting, don't restore flags yet */ + local_irq_disable(); + for (i = 0; i < batch_size; ++i) { + if (ret[i]) { + page = pages[i]; + mem_cgroup_swapout(page, swap[i]); + } + } + local_irq_enable(); + + for (i = 0; i < batch_size; ++i) { + if (ret[i]) + swapcache_free(swap[i]); + } + /* advance to next batch */ + pages += batch_size; + ret += batch_size; + nr -= batch_size; + } +} /* * Attempt to detach a locked page from its ->mapping. If it is dirty or if * someone else has a ref on the page, abort and
Re: [PATCH] Staging: dgnc: fix coding style in dgnc_tty.c
On Tue, May 03, 2016 at 08:37:27PM +0200, Patryk Mezydlo wrote: > Fix checkpatch.pl warning about 'line over 80 characters'. > I just split line with function. > > Signed-off-by: Patryk Mezydlo> --- > drivers/staging/dgnc/dgnc_tty.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/staging/dgnc/dgnc_tty.c b/drivers/staging/dgnc/dgnc_tty.c > index 9aa0ba0..b818b0a 100644 > --- a/drivers/staging/dgnc/dgnc_tty.c > +++ b/drivers/staging/dgnc/dgnc_tty.c > @@ -186,7 +186,8 @@ int dgnc_tty_register(struct dgnc_board *brd) > if (IS_ERR(brd->serial_driver)) > return PTR_ERR(brd->serial_driver); > > - snprintf(brd->serial_name, MAXTTYNAMELEN, "tty_dgnc_%d_", > brd->boardnum); > + snprintf(brd->serial_name, MAXTTYNAMELEN, > + "tty_dgnc_%d_", brd->boardnum); Why move the string to a new line? It didn't need to be there, did it?
Re: [PATCH] Staging: dgnc: fix coding style in dgnc_tty.c
On Tue, May 03, 2016 at 08:37:27PM +0200, Patryk Mezydlo wrote: > Fix checkpatch.pl warning about 'line over 80 characters'. > I just split line with function. > > Signed-off-by: Patryk Mezydlo > --- > drivers/staging/dgnc/dgnc_tty.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/staging/dgnc/dgnc_tty.c b/drivers/staging/dgnc/dgnc_tty.c > index 9aa0ba0..b818b0a 100644 > --- a/drivers/staging/dgnc/dgnc_tty.c > +++ b/drivers/staging/dgnc/dgnc_tty.c > @@ -186,7 +186,8 @@ int dgnc_tty_register(struct dgnc_board *brd) > if (IS_ERR(brd->serial_driver)) > return PTR_ERR(brd->serial_driver); > > - snprintf(brd->serial_name, MAXTTYNAMELEN, "tty_dgnc_%d_", > brd->boardnum); > + snprintf(brd->serial_name, MAXTTYNAMELEN, > + "tty_dgnc_%d_", brd->boardnum); Why move the string to a new line? It didn't need to be there, did it?
Re: [PATCH] [RFC] x86: work around MPX Erratum
On 05/02/2016 11:43 PM, Ingo Molnar wrote: >> > +static int is_mpx_affected_microarch(struct cpuinfo_x86 *c) >> > +{ >> > + /* Only family 6 is affected */ >> > + if (c->x86 != 0x6) >> > + return 0; >> > + >> > + /* We know these Atom models are unaffected, for sure */ >> > + switch (c->x86_model) { >> > + case 0x5F: /* "Future Intel Atom ... Goldmont */ >> > + case 0x5C: /* "Future Intel Atom ... Goldmont */ >> > + return 0; >> > + } >> > + /* >> > + * We will get here on future unknown processors and all >> > + * Core/Xeons. They might be unaffected Atoms or >> > + * affected Core/Xeons. Be conservative and assume >> > + * processor is affected. >> > + * >> > + * Once the complete list of Core/Xeon models is known >> > + * it can be added here, and the Atom list removed. >> > + */ >> > + return 1; > So instead of trying to sort out the erratum, could we not just generally > make MPX > dependent on SMEP and be done with it? MPX is a sophisticated security > feature, > and it makes little sense to not do SMEP if you have it available. > > Anyone who is absolutely desperate to disable SMEP while enabling MPX is free > to > step in and make his case. My concern was not necessarily with folks booting with 'nosmep', but with processors that have MPX present and SMEP fused off (or made unavailable by a hypervisor) and which are unaffected by this issue. People would have to be very careful to never create a processor which did not have SMEP but did have MPX, since MPX would effectively be unusable on such a processor.
Re: [PATCH] [RFC] x86: work around MPX Erratum
On 05/02/2016 11:43 PM, Ingo Molnar wrote: >> > +static int is_mpx_affected_microarch(struct cpuinfo_x86 *c) >> > +{ >> > + /* Only family 6 is affected */ >> > + if (c->x86 != 0x6) >> > + return 0; >> > + >> > + /* We know these Atom models are unaffected, for sure */ >> > + switch (c->x86_model) { >> > + case 0x5F: /* "Future Intel Atom ... Goldmont */ >> > + case 0x5C: /* "Future Intel Atom ... Goldmont */ >> > + return 0; >> > + } >> > + /* >> > + * We will get here on future unknown processors and all >> > + * Core/Xeons. They might be unaffected Atoms or >> > + * affected Core/Xeons. Be conservative and assume >> > + * processor is affected. >> > + * >> > + * Once the complete list of Core/Xeon models is known >> > + * it can be added here, and the Atom list removed. >> > + */ >> > + return 1; > So instead of trying to sort out the erratum, could we not just generally > make MPX > dependent on SMEP and be done with it? MPX is a sophisticated security > feature, > and it makes little sense to not do SMEP if you have it available. > > Anyone who is absolutely desperate to disable SMEP while enabling MPX is free > to > step in and make his case. My concern was not necessarily with folks booting with 'nosmep', but with processors that have MPX present and SMEP fused off (or made unavailable by a hypervisor) and which are unaffected by this issue. People would have to be very careful to never create a processor which did not have SMEP but did have MPX, since MPX would effectively be unusable on such a processor.
[PATCH 6/7] mm: Cleanup - Reorganize code to group handling of page
In this patch, we reorganize the paging operations so the paging operations of pages to the same swap device can be grouped together. This prepares for the next patch that remove multiple pages from the same swap cache together once they have been paged out. The patch creates a new function handle_pgout_batch that takes the code of handle_pgout and put a loop around handle_pgout code for multiple pages. Signed-off-by: Tim Chen--- mm/vmscan.c | 338 +++- 1 file changed, 196 insertions(+), 142 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index fab61f1..9fc04e1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -884,154 +884,218 @@ enum pg_result { PG_UNKNOWN, }; -static enum pg_result handle_pgout(struct list_head *page_list, +static void handle_pgout_batch(struct list_head *page_list, struct zone *zone, struct scan_control *sc, enum ttu_flags ttu_flags, enum page_references references, bool may_enter_fs, bool lazyfree, - int *swap_ret, - struct page *page) + struct page *pages[], + int swap_ret[], + int ret[], + int nr) { struct address_space *mapping; + struct page *page; + int i; - mapping = page_mapping(page); + for (i = 0; i < nr; ++i) { + page = pages[i]; + mapping = page_mapping(page); - /* - * The page is mapped into the page tables of one or more - * processes. Try to unmap it here. - */ - if (page_mapped(page) && mapping) { - switch (*swap_ret = try_to_unmap(page, lazyfree ? - (ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) : - (ttu_flags | TTU_BATCH_FLUSH))) { - case SWAP_FAIL: - return PG_ACTIVATE_LOCKED; - case SWAP_AGAIN: - return PG_KEEP_LOCKED; - case SWAP_MLOCK: - return PG_MLOCKED; - case SWAP_LZFREE: - goto lazyfree; - case SWAP_SUCCESS: - ; /* try to free the page below */ + /* check outcome of cache addition */ + if (!ret[i]) { + ret[i] = PG_ACTIVATE_LOCKED; + continue; } - } - - if (PageDirty(page)) { /* - * Only kswapd can writeback filesystem pages to - * avoid risk of stack overflow but only writeback - * if many dirty pages have been encountered. + * The page is mapped into the page tables of one or more + * processes. Try to unmap it here. */ - if (page_is_file_cache(page) && - (!current_is_kswapd() || - !test_bit(ZONE_DIRTY, >flags))) { + if (page_mapped(page) && mapping) { + switch (swap_ret[i] = try_to_unmap(page, lazyfree ? + (ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) : + (ttu_flags | TTU_BATCH_FLUSH))) { + case SWAP_FAIL: + ret[i] = PG_ACTIVATE_LOCKED; + continue; + case SWAP_AGAIN: + ret[i] = PG_KEEP_LOCKED; + continue; + case SWAP_MLOCK: + ret[i] = PG_MLOCKED; + continue; + case SWAP_LZFREE: + goto lazyfree; + case SWAP_SUCCESS: + ; /* try to free the page below */ + } + } + + if (PageDirty(page)) { /* - * Immediately reclaim when written back. - * Similar in principal to deactivate_page() - * except we already have the page isolated - * and know it's dirty + * Only kswapd can writeback filesystem pages to + * avoid risk of stack overflow but only writeback + * if many dirty pages have been encountered. */ - inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE); - SetPageReclaim(page); - - return PG_KEEP_LOCKED; - } + if (page_is_file_cache(page) && + (!current_is_kswapd() || + !test_bit(ZONE_DIRTY, >flags))) { + /* + * Immediately reclaim when written back. +
[PATCH 3/7] mm: Add new functions to allocate swap slots in batches
Currently, the swap slots have to be allocated one page at a time, causing contention to the swap_info lock protecting the swap partition on every page being swapped. This patch adds new functions get_swap_pages and scan_swap_map_slots to request multiple swap slots at once. This will reduce the lock contention on the swap_info lock as we only need to acquire the lock once to get multiple slots. Also scan_swap_map_slots can operate more efficiently as swap slots often occurs in clusters close to each other on a swap device and it is quicker to allocate them together. Multiple swap slots can also be freed in one shot with new function swapcache_free_entries, that further reduce contention on the swap_info lock. Signed-off-by: Tim Chen--- include/linux/swap.h | 27 +-- mm/swap_state.c | 23 +++--- mm/swapfile.c| 215 +-- mm/vmscan.c | 2 +- 4 files changed, 228 insertions(+), 39 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2b83359..da6d994 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -23,6 +23,7 @@ struct bio; #define SWAP_FLAG_DISCARD 0x1 /* enable discard for swap */ #define SWAP_FLAG_DISCARD_ONCE 0x2 /* discard swap area at swapon-time */ #define SWAP_FLAG_DISCARD_PAGES 0x4 /* discard page-clusters after use */ +#define SWAP_BATCH 64 #define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \ SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \ @@ -370,7 +371,8 @@ extern struct address_space swapper_spaces[]; #define swap_address_space(entry) (_spaces[swp_type(entry)]) extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); -extern int add_to_swap(struct page *, struct list_head *list); +extern int add_to_swap(struct page *, struct list_head *list, + swp_entry_t *entry); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); extern void __delete_from_swap_cache(struct page *); @@ -403,6 +405,7 @@ static inline long get_nr_swap_pages(void) extern void si_swapinfo(struct sysinfo *); extern swp_entry_t get_swap_page(void); +extern int get_swap_pages(int n, swp_entry_t swp_entries[]); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); @@ -410,6 +413,7 @@ extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); extern void swapcache_free(swp_entry_t); +extern void swapcache_free_entries(swp_entry_t *entries, int n); extern int free_swap_and_cache(swp_entry_t); extern int swap_type_of(dev_t, sector_t, struct block_device **); extern unsigned int count_swap_pages(int, int); @@ -429,7 +433,6 @@ struct backing_dev_info; #define total_swap_pages 0L #define total_swapcache_pages()0UL #define vm_swap_full() 0 - #define si_swapinfo(val) \ do { (val)->freeswap = (val)->totalswap = 0; } while (0) /* only sparc can not include linux/pagemap.h in this file @@ -451,6 +454,21 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) return 0; } +static inline int add_to_swap(struct page *page, struct list_head *list, + swp_entry_t *entry) +{ + return 0; +} + +static inline int get_swap_pages(int n, swp_entry_t swp_entries[]) +{ + return 0; +} + +static inline void swapcache_free_entries(swp_entry_t *entries, int n) +{ +} + static inline void swap_shmem_alloc(swp_entry_t swp) { } @@ -484,11 +502,6 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp) return NULL; } -static inline int add_to_swap(struct page *page, struct list_head *list) -{ - return 0; -} - static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 366ce35..bad02c1 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -154,30 +154,35 @@ void __delete_from_swap_cache(struct page *page) /** * add_to_swap - allocate swap space for a page * @page: page we want to move to swap + * @entry: swap entry that we have pre-allocated * * Allocate swap space for the page and add the page to the * swap cache. Caller needs to hold the page lock. */ -int add_to_swap(struct page *page, struct list_head *list) +int add_to_swap(struct page *page, struct list_head *list, swp_entry_t *entry) { - swp_entry_t entry; int err; + swp_entry_t ent; VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageUptodate(page), page); - entry = get_swap_page(); -
[PATCH 6/7] mm: Cleanup - Reorganize code to group handling of page
In this patch, we reorganize the paging operations so the paging operations of pages to the same swap device can be grouped together. This prepares for the next patch that remove multiple pages from the same swap cache together once they have been paged out. The patch creates a new function handle_pgout_batch that takes the code of handle_pgout and put a loop around handle_pgout code for multiple pages. Signed-off-by: Tim Chen --- mm/vmscan.c | 338 +++- 1 file changed, 196 insertions(+), 142 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index fab61f1..9fc04e1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -884,154 +884,218 @@ enum pg_result { PG_UNKNOWN, }; -static enum pg_result handle_pgout(struct list_head *page_list, +static void handle_pgout_batch(struct list_head *page_list, struct zone *zone, struct scan_control *sc, enum ttu_flags ttu_flags, enum page_references references, bool may_enter_fs, bool lazyfree, - int *swap_ret, - struct page *page) + struct page *pages[], + int swap_ret[], + int ret[], + int nr) { struct address_space *mapping; + struct page *page; + int i; - mapping = page_mapping(page); + for (i = 0; i < nr; ++i) { + page = pages[i]; + mapping = page_mapping(page); - /* - * The page is mapped into the page tables of one or more - * processes. Try to unmap it here. - */ - if (page_mapped(page) && mapping) { - switch (*swap_ret = try_to_unmap(page, lazyfree ? - (ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) : - (ttu_flags | TTU_BATCH_FLUSH))) { - case SWAP_FAIL: - return PG_ACTIVATE_LOCKED; - case SWAP_AGAIN: - return PG_KEEP_LOCKED; - case SWAP_MLOCK: - return PG_MLOCKED; - case SWAP_LZFREE: - goto lazyfree; - case SWAP_SUCCESS: - ; /* try to free the page below */ + /* check outcome of cache addition */ + if (!ret[i]) { + ret[i] = PG_ACTIVATE_LOCKED; + continue; } - } - - if (PageDirty(page)) { /* - * Only kswapd can writeback filesystem pages to - * avoid risk of stack overflow but only writeback - * if many dirty pages have been encountered. + * The page is mapped into the page tables of one or more + * processes. Try to unmap it here. */ - if (page_is_file_cache(page) && - (!current_is_kswapd() || - !test_bit(ZONE_DIRTY, >flags))) { + if (page_mapped(page) && mapping) { + switch (swap_ret[i] = try_to_unmap(page, lazyfree ? + (ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) : + (ttu_flags | TTU_BATCH_FLUSH))) { + case SWAP_FAIL: + ret[i] = PG_ACTIVATE_LOCKED; + continue; + case SWAP_AGAIN: + ret[i] = PG_KEEP_LOCKED; + continue; + case SWAP_MLOCK: + ret[i] = PG_MLOCKED; + continue; + case SWAP_LZFREE: + goto lazyfree; + case SWAP_SUCCESS: + ; /* try to free the page below */ + } + } + + if (PageDirty(page)) { /* - * Immediately reclaim when written back. - * Similar in principal to deactivate_page() - * except we already have the page isolated - * and know it's dirty + * Only kswapd can writeback filesystem pages to + * avoid risk of stack overflow but only writeback + * if many dirty pages have been encountered. */ - inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE); - SetPageReclaim(page); - - return PG_KEEP_LOCKED; - } + if (page_is_file_cache(page) && + (!current_is_kswapd() || + !test_bit(ZONE_DIRTY, >flags))) { + /* + * Immediately reclaim when written back. +
[PATCH 3/7] mm: Add new functions to allocate swap slots in batches
Currently, the swap slots have to be allocated one page at a time, causing contention to the swap_info lock protecting the swap partition on every page being swapped. This patch adds new functions get_swap_pages and scan_swap_map_slots to request multiple swap slots at once. This will reduce the lock contention on the swap_info lock as we only need to acquire the lock once to get multiple slots. Also scan_swap_map_slots can operate more efficiently as swap slots often occurs in clusters close to each other on a swap device and it is quicker to allocate them together. Multiple swap slots can also be freed in one shot with new function swapcache_free_entries, that further reduce contention on the swap_info lock. Signed-off-by: Tim Chen --- include/linux/swap.h | 27 +-- mm/swap_state.c | 23 +++--- mm/swapfile.c| 215 +-- mm/vmscan.c | 2 +- 4 files changed, 228 insertions(+), 39 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2b83359..da6d994 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -23,6 +23,7 @@ struct bio; #define SWAP_FLAG_DISCARD 0x1 /* enable discard for swap */ #define SWAP_FLAG_DISCARD_ONCE 0x2 /* discard swap area at swapon-time */ #define SWAP_FLAG_DISCARD_PAGES 0x4 /* discard page-clusters after use */ +#define SWAP_BATCH 64 #define SWAP_FLAGS_VALID (SWAP_FLAG_PRIO_MASK | SWAP_FLAG_PREFER | \ SWAP_FLAG_DISCARD | SWAP_FLAG_DISCARD_ONCE | \ @@ -370,7 +371,8 @@ extern struct address_space swapper_spaces[]; #define swap_address_space(entry) (_spaces[swp_type(entry)]) extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); -extern int add_to_swap(struct page *, struct list_head *list); +extern int add_to_swap(struct page *, struct list_head *list, + swp_entry_t *entry); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); extern void __delete_from_swap_cache(struct page *); @@ -403,6 +405,7 @@ static inline long get_nr_swap_pages(void) extern void si_swapinfo(struct sysinfo *); extern swp_entry_t get_swap_page(void); +extern int get_swap_pages(int n, swp_entry_t swp_entries[]); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); @@ -410,6 +413,7 @@ extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); extern void swapcache_free(swp_entry_t); +extern void swapcache_free_entries(swp_entry_t *entries, int n); extern int free_swap_and_cache(swp_entry_t); extern int swap_type_of(dev_t, sector_t, struct block_device **); extern unsigned int count_swap_pages(int, int); @@ -429,7 +433,6 @@ struct backing_dev_info; #define total_swap_pages 0L #define total_swapcache_pages()0UL #define vm_swap_full() 0 - #define si_swapinfo(val) \ do { (val)->freeswap = (val)->totalswap = 0; } while (0) /* only sparc can not include linux/pagemap.h in this file @@ -451,6 +454,21 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) return 0; } +static inline int add_to_swap(struct page *page, struct list_head *list, + swp_entry_t *entry) +{ + return 0; +} + +static inline int get_swap_pages(int n, swp_entry_t swp_entries[]) +{ + return 0; +} + +static inline void swapcache_free_entries(swp_entry_t *entries, int n) +{ +} + static inline void swap_shmem_alloc(swp_entry_t swp) { } @@ -484,11 +502,6 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp) return NULL; } -static inline int add_to_swap(struct page *page, struct list_head *list) -{ - return 0; -} - static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 366ce35..bad02c1 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -154,30 +154,35 @@ void __delete_from_swap_cache(struct page *page) /** * add_to_swap - allocate swap space for a page * @page: page we want to move to swap + * @entry: swap entry that we have pre-allocated * * Allocate swap space for the page and add the page to the * swap cache. Caller needs to hold the page lock. */ -int add_to_swap(struct page *page, struct list_head *list) +int add_to_swap(struct page *page, struct list_head *list, swp_entry_t *entry) { - swp_entry_t entry; int err; + swp_entry_t ent; VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageUptodate(page), page); - entry = get_swap_page(); - if (!entry.val) +
[PATCH 5/7] mm: Batch addtion of pages to swap cache
When a page is to be swapped, it needed to be added to the swap cache and then removed after the paging has been completed. A swap partition's mapping tree lock is acquired for each anonymous page's addition to the swap cache. This patch created new functions add_to_swap_batch and __add_to_swap_cache_batch that allows multiple pages destinied for the same swap partition to be added to that swap partition's swap cache in one acquisition of the mapping tree lock. These functions extend the original add_to_swap and __add_to_swap_cache. This reduces the contention of the swap partition's mapping tree lock when we are actively reclaiming memory and swapping pages Signed-off-by: Tim Chen--- include/linux/swap.h | 2 + mm/swap_state.c | 248 +-- mm/vmscan.c | 19 ++-- 3 files changed, 196 insertions(+), 73 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index da6d994..cd06f2a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -373,6 +373,8 @@ extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); extern int add_to_swap(struct page *, struct list_head *list, swp_entry_t *entry); +extern void add_to_swap_batch(struct page *pages[], struct list_head *list, + swp_entry_t entries[], int ret_codes[], int nr); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); extern void __delete_from_swap_cache(struct page *); diff --git a/mm/swap_state.c b/mm/swap_state.c index bad02c1..ce02024 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -72,49 +72,94 @@ void show_swap_cache_info(void) printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); } -/* - * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space, - * but sets SwapCache flag and private instead of mapping and index. - */ -int __add_to_swap_cache(struct page *page, swp_entry_t entry) +void __add_to_swap_cache_batch(struct page *pages[], swp_entry_t entries[], + int ret[], int nr) { - int error; + int error, i; struct address_space *address_space; + struct address_space *prev; + struct page *page; + swp_entry_t entry; - VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE(PageSwapCache(page), page); - VM_BUG_ON_PAGE(!PageSwapBacked(page), page); + prev = NULL; + address_space = NULL; + for (i = 0; i < nr; ++i) { + /* error at pre-processing stage, swap entry already released */ + if (ret[i] == -ENOENT) + continue; - get_page(page); - SetPageSwapCache(page); - set_page_private(page, entry.val); + page = pages[i]; + entry = entries[i]; - address_space = swap_address_space(entry); - spin_lock_irq(_space->tree_lock); - error = radix_tree_insert(_space->page_tree, - entry.val, page); - if (likely(!error)) { - address_space->nrpages++; - __inc_zone_page_state(page, NR_FILE_PAGES); - INC_CACHE_INFO(add_total); - } - spin_unlock_irq(_space->tree_lock); + VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_PAGE(PageSwapCache(page), page); + VM_BUG_ON_PAGE(!PageSwapBacked(page), page); - if (unlikely(error)) { - /* - * Only the context which have set SWAP_HAS_CACHE flag - * would call add_to_swap_cache(). - * So add_to_swap_cache() doesn't returns -EEXIST. - */ - VM_BUG_ON(error == -EEXIST); - set_page_private(page, 0UL); - ClearPageSwapCache(page); - put_page(page); + get_page(page); + SetPageSwapCache(page); + set_page_private(page, entry.val); + + address_space = swap_address_space(entry); + if (prev != address_space) { + if (prev) + spin_unlock_irq(>tree_lock); + spin_lock_irq(_space->tree_lock); + } + error = radix_tree_insert(_space->page_tree, + entry.val, page); + if (likely(!error)) { + address_space->nrpages++; + __inc_zone_page_state(page, NR_FILE_PAGES); + INC_CACHE_INFO(add_total); + } + + if (unlikely(error)) { + spin_unlock_irq(_space->tree_lock); + address_space = NULL; + /* + * Only the context which have set SWAP_HAS_CACHE
[PATCH 5/7] mm: Batch addtion of pages to swap cache
When a page is to be swapped, it needed to be added to the swap cache and then removed after the paging has been completed. A swap partition's mapping tree lock is acquired for each anonymous page's addition to the swap cache. This patch created new functions add_to_swap_batch and __add_to_swap_cache_batch that allows multiple pages destinied for the same swap partition to be added to that swap partition's swap cache in one acquisition of the mapping tree lock. These functions extend the original add_to_swap and __add_to_swap_cache. This reduces the contention of the swap partition's mapping tree lock when we are actively reclaiming memory and swapping pages Signed-off-by: Tim Chen --- include/linux/swap.h | 2 + mm/swap_state.c | 248 +-- mm/vmscan.c | 19 ++-- 3 files changed, 196 insertions(+), 73 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index da6d994..cd06f2a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -373,6 +373,8 @@ extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); extern int add_to_swap(struct page *, struct list_head *list, swp_entry_t *entry); +extern void add_to_swap_batch(struct page *pages[], struct list_head *list, + swp_entry_t entries[], int ret_codes[], int nr); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern int __add_to_swap_cache(struct page *page, swp_entry_t entry); extern void __delete_from_swap_cache(struct page *); diff --git a/mm/swap_state.c b/mm/swap_state.c index bad02c1..ce02024 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -72,49 +72,94 @@ void show_swap_cache_info(void) printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); } -/* - * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space, - * but sets SwapCache flag and private instead of mapping and index. - */ -int __add_to_swap_cache(struct page *page, swp_entry_t entry) +void __add_to_swap_cache_batch(struct page *pages[], swp_entry_t entries[], + int ret[], int nr) { - int error; + int error, i; struct address_space *address_space; + struct address_space *prev; + struct page *page; + swp_entry_t entry; - VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE(PageSwapCache(page), page); - VM_BUG_ON_PAGE(!PageSwapBacked(page), page); + prev = NULL; + address_space = NULL; + for (i = 0; i < nr; ++i) { + /* error at pre-processing stage, swap entry already released */ + if (ret[i] == -ENOENT) + continue; - get_page(page); - SetPageSwapCache(page); - set_page_private(page, entry.val); + page = pages[i]; + entry = entries[i]; - address_space = swap_address_space(entry); - spin_lock_irq(_space->tree_lock); - error = radix_tree_insert(_space->page_tree, - entry.val, page); - if (likely(!error)) { - address_space->nrpages++; - __inc_zone_page_state(page, NR_FILE_PAGES); - INC_CACHE_INFO(add_total); - } - spin_unlock_irq(_space->tree_lock); + VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_PAGE(PageSwapCache(page), page); + VM_BUG_ON_PAGE(!PageSwapBacked(page), page); - if (unlikely(error)) { - /* - * Only the context which have set SWAP_HAS_CACHE flag - * would call add_to_swap_cache(). - * So add_to_swap_cache() doesn't returns -EEXIST. - */ - VM_BUG_ON(error == -EEXIST); - set_page_private(page, 0UL); - ClearPageSwapCache(page); - put_page(page); + get_page(page); + SetPageSwapCache(page); + set_page_private(page, entry.val); + + address_space = swap_address_space(entry); + if (prev != address_space) { + if (prev) + spin_unlock_irq(>tree_lock); + spin_lock_irq(_space->tree_lock); + } + error = radix_tree_insert(_space->page_tree, + entry.val, page); + if (likely(!error)) { + address_space->nrpages++; + __inc_zone_page_state(page, NR_FILE_PAGES); + INC_CACHE_INFO(add_total); + } + + if (unlikely(error)) { + spin_unlock_irq(_space->tree_lock); + address_space = NULL; + /* + * Only the context which have set SWAP_HAS_CACHE flag + *
[PATCH 4/7] mm: Shrink page list batch allocates swap slots for page swapping
In shrink page list, we take advantage bulk allocation of swap entries with the new get_swap_pages function. This reduces contention on a swap device's swap_info lock. When the memory is low and the system is actively trying to reclaim memory, both direct reclaim path and kswapd contends on this lock when they access the same swap partition. Signed-off-by: Tim Chen--- mm/vmscan.c | 63 - 1 file changed, 42 insertions(+), 21 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e36d8a7..310e2b2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1096,38 +1096,59 @@ static unsigned long shrink_anon_page_list(struct list_head *page_list, { unsigned long nr_reclaimed = 0; enum pg_result pg_dispose; + swp_entry_t swp_entries[SWAP_BATCH]; + struct page *page; + int m, i, k; while (n > 0) { - struct page *page; int swap_ret = SWAP_SUCCESS; - --n; - if (list_empty(swap_pages)) - return nr_reclaimed; + m = get_swap_pages(n, swp_entries); + if (!m) + goto no_swap_slots; + n -= m; + for (i = 0; i < m; ++i) { + if (list_empty(swap_pages)) { + /* free any leftover swap slots */ + for (k = i; k < m; ++k) + swapcache_free(swp_entries[k]); + return nr_reclaimed; + } + page = lru_to_page(swap_pages); - page = lru_to_page(swap_pages); + list_del(>lru); - list_del(>lru); + /* + * Anonymous process memory has backing store? + * Try to allocate it some swap space here. + */ + + if (!add_to_swap(page, page_list, NULL)) { + pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, + _reclaimed, pgactivate, + ret_pages, free_pages); + continue; + } - /* - * Anonymous process memory has backing store? - * Try to allocate it some swap space here. - */ + if (clean) + pg_dispose = handle_pgout(page_list, zone, sc, + ttu_flags, PAGEREF_RECLAIM_CLEAN, + true, true, _ret, page); + else + pg_dispose = handle_pgout(page_list, zone, sc, + ttu_flags, PAGEREF_RECLAIM, + true, true, _ret, page); - if (!add_to_swap(page, page_list, NULL)) { - pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, _reclaimed, + pg_finish(page, pg_dispose, swap_ret, _reclaimed, pgactivate, ret_pages, free_pages); - continue; } + } + return nr_reclaimed; - if (clean) - pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, - PAGEREF_RECLAIM_CLEAN, true, true, _ret, page); - else - pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, - PAGEREF_RECLAIM, true, true, _ret, page); - - pg_finish(page, pg_dispose, swap_ret, _reclaimed, +no_swap_slots: + while (!list_empty(swap_pages)) { + page = lru_to_page(swap_pages); + pg_finish(page, PG_ACTIVATE_LOCKED, 0, _reclaimed, pgactivate, ret_pages, free_pages); } return nr_reclaimed; -- 2.5.5
[PATCH 4/7] mm: Shrink page list batch allocates swap slots for page swapping
In shrink page list, we take advantage bulk allocation of swap entries with the new get_swap_pages function. This reduces contention on a swap device's swap_info lock. When the memory is low and the system is actively trying to reclaim memory, both direct reclaim path and kswapd contends on this lock when they access the same swap partition. Signed-off-by: Tim Chen --- mm/vmscan.c | 63 - 1 file changed, 42 insertions(+), 21 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e36d8a7..310e2b2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1096,38 +1096,59 @@ static unsigned long shrink_anon_page_list(struct list_head *page_list, { unsigned long nr_reclaimed = 0; enum pg_result pg_dispose; + swp_entry_t swp_entries[SWAP_BATCH]; + struct page *page; + int m, i, k; while (n > 0) { - struct page *page; int swap_ret = SWAP_SUCCESS; - --n; - if (list_empty(swap_pages)) - return nr_reclaimed; + m = get_swap_pages(n, swp_entries); + if (!m) + goto no_swap_slots; + n -= m; + for (i = 0; i < m; ++i) { + if (list_empty(swap_pages)) { + /* free any leftover swap slots */ + for (k = i; k < m; ++k) + swapcache_free(swp_entries[k]); + return nr_reclaimed; + } + page = lru_to_page(swap_pages); - page = lru_to_page(swap_pages); + list_del(>lru); - list_del(>lru); + /* + * Anonymous process memory has backing store? + * Try to allocate it some swap space here. + */ + + if (!add_to_swap(page, page_list, NULL)) { + pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, + _reclaimed, pgactivate, + ret_pages, free_pages); + continue; + } - /* - * Anonymous process memory has backing store? - * Try to allocate it some swap space here. - */ + if (clean) + pg_dispose = handle_pgout(page_list, zone, sc, + ttu_flags, PAGEREF_RECLAIM_CLEAN, + true, true, _ret, page); + else + pg_dispose = handle_pgout(page_list, zone, sc, + ttu_flags, PAGEREF_RECLAIM, + true, true, _ret, page); - if (!add_to_swap(page, page_list, NULL)) { - pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, _reclaimed, + pg_finish(page, pg_dispose, swap_ret, _reclaimed, pgactivate, ret_pages, free_pages); - continue; } + } + return nr_reclaimed; - if (clean) - pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, - PAGEREF_RECLAIM_CLEAN, true, true, _ret, page); - else - pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, - PAGEREF_RECLAIM, true, true, _ret, page); - - pg_finish(page, pg_dispose, swap_ret, _reclaimed, +no_swap_slots: + while (!list_empty(swap_pages)) { + page = lru_to_page(swap_pages); + pg_finish(page, PG_ACTIVATE_LOCKED, 0, _reclaimed, pgactivate, ret_pages, free_pages); } return nr_reclaimed; -- 2.5.5
[PATCH 2/7] mm: Group the processing of anonymous pages to be swapped in shrink_page_list
This is a clean up patch to reorganize the processing of anonymous pages in shrink_page_list. We delay the processing of swapping anonymous pages in shrink_page_list and put them together on a separate list. This prepares for batching of pages to be swapped. The processing of the list of anonymous pages to be swapped is consolidated in the function shrink_anon_page_list. Functionally, there is no change in the logic of how pages are processed, just the order of processing of the anonymous pages and file mapped pages in shrink_page_list. Signed-off-by: Tim Chen--- mm/vmscan.c | 82 + 1 file changed, 77 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5542005..132ba02 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1083,6 +1083,58 @@ static void pg_finish(struct page *page, } } +static unsigned long shrink_anon_page_list(struct list_head *page_list, + struct zone *zone, + struct scan_control *sc, + struct list_head *swap_pages, + struct list_head *ret_pages, + struct list_head *free_pages, + enum ttu_flags ttu_flags, + int *pgactivate, + int n, + bool clean) +{ + unsigned long nr_reclaimed = 0; + enum pg_result pg_dispose; + + while (n > 0) { + struct page *page; + int swap_ret = SWAP_SUCCESS; + + --n; + if (list_empty(swap_pages)) + return nr_reclaimed; + + page = lru_to_page(swap_pages); + + list_del(>lru); + + /* + * Anonymous process memory has backing store? + * Try to allocate it some swap space here. + */ + + if (!add_to_swap(page, page_list)) { + pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, _reclaimed, + pgactivate, ret_pages, free_pages); + continue; + } + + if (clean) + pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, + PAGEREF_RECLAIM_CLEAN, true, true, _ret, page); + else + pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, + PAGEREF_RECLAIM, true, true, _ret, page); + + pg_finish(page, pg_dispose, swap_ret, _reclaimed, + pgactivate, ret_pages, free_pages); + } + return nr_reclaimed; +} + + + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -1099,6 +1151,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(swap_pages); + LIST_HEAD(swap_pages_clean); int pgactivate = 0; unsigned long nr_unqueued_dirty = 0; unsigned long nr_dirty = 0; @@ -1106,6 +1160,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, unsigned long nr_reclaimed = 0; unsigned long nr_writeback = 0; unsigned long nr_immediate = 0; + unsigned long nr_swap = 0; + unsigned long nr_swap_clean = 0; cond_resched(); @@ -1271,12 +1327,17 @@ static unsigned long shrink_page_list(struct list_head *page_list, pg_dispose = PG_KEEP_LOCKED; goto finish; } - if (!add_to_swap(page, page_list)) { - pg_dispose = PG_ACTIVATE_LOCKED; - goto finish; + if (references == PAGEREF_RECLAIM_CLEAN) { + list_add(>lru, _pages_clean); + ++nr_swap_clean; + } else { + list_add(>lru, _pages); + ++nr_swap; } - lazyfree = true; - may_enter_fs = 1; + + pg_dispose = PG_NEXT; + goto finish; + } pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, @@ -1288,6 +1349,17 @@ finish: } + nr_reclaimed += shrink_anon_page_list(page_list, zone, sc, + _pages_clean, _pages, + _pages, ttu_flags, + , nr_swap_clean, + true); + nr_reclaimed += shrink_anon_page_list(page_list, zone, sc, + _pages, _pages, + _pages, ttu_flags, + , nr_swap, +
[PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
This patch prepares the code for being able to batch the anonymous pages to be swapped out. It reorganizes shrink_page_list function with 2 new functions: handle_pgout and pg_finish. The paging operation in shrink_page_list is consolidated into handle_pgout function. After we have scanned a page shrink_page_list and completed any paging, the final disposition and clean up of the page is conslidated into pg_finish. The designated disposition of the page from page scanning in shrink_page_list is marked with one of the designation in pg_result. This is a clean up patch and there is no change in functionality or logic of the code. Signed-off-by: Tim Chen--- mm/vmscan.c | 429 ++-- 1 file changed, 246 insertions(+), 183 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b934223e..5542005 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -873,6 +873,216 @@ static void page_check_dirty_writeback(struct page *page, mapping->a_ops->is_dirty_writeback(page, dirty, writeback); } +enum pg_result { + PG_SPECULATIVE_REF, + PG_FREE, + PG_MLOCKED, + PG_ACTIVATE_LOCKED, + PG_KEEP_LOCKED, + PG_KEEP, + PG_NEXT, + PG_UNKNOWN, +}; + +static enum pg_result handle_pgout(struct list_head *page_list, + struct zone *zone, + struct scan_control *sc, + enum ttu_flags ttu_flags, + enum page_references references, + bool may_enter_fs, + bool lazyfree, + int *swap_ret, + struct page *page) +{ + struct address_space *mapping; + + mapping = page_mapping(page); + + /* + * The page is mapped into the page tables of one or more + * processes. Try to unmap it here. + */ + if (page_mapped(page) && mapping) { + switch (*swap_ret = try_to_unmap(page, lazyfree ? + (ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) : + (ttu_flags | TTU_BATCH_FLUSH))) { + case SWAP_FAIL: + return PG_ACTIVATE_LOCKED; + case SWAP_AGAIN: + return PG_KEEP_LOCKED; + case SWAP_MLOCK: + return PG_MLOCKED; + case SWAP_LZFREE: + goto lazyfree; + case SWAP_SUCCESS: + ; /* try to free the page below */ + } + } + + if (PageDirty(page)) { + /* + * Only kswapd can writeback filesystem pages to + * avoid risk of stack overflow but only writeback + * if many dirty pages have been encountered. + */ + if (page_is_file_cache(page) && + (!current_is_kswapd() || + !test_bit(ZONE_DIRTY, >flags))) { + /* + * Immediately reclaim when written back. + * Similar in principal to deactivate_page() + * except we already have the page isolated + * and know it's dirty + */ + inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE); + SetPageReclaim(page); + + return PG_KEEP_LOCKED; + } + + if (references == PAGEREF_RECLAIM_CLEAN) + return PG_KEEP_LOCKED; + if (!may_enter_fs) + return PG_KEEP_LOCKED; + if (!sc->may_writepage) + return PG_KEEP_LOCKED; + + /* + * Page is dirty. Flush the TLB if a writable entry + * potentially exists to avoid CPU writes after IO + * starts and then write it out here. + */ + try_to_unmap_flush_dirty(); + switch (pageout(page, mapping, sc)) { + case PAGE_KEEP: + return PG_KEEP_LOCKED; + case PAGE_ACTIVATE: + return PG_ACTIVATE_LOCKED; + case PAGE_SUCCESS: + if (PageWriteback(page)) + return PG_KEEP; + if (PageDirty(page)) + return PG_KEEP; + + /* + * A synchronous write - probably a ramdisk. Go + * ahead and try to reclaim the page. + */ + if (!trylock_page(page)) + return PG_KEEP; + if (PageDirty(page) || PageWriteback(page)) + return PG_KEEP_LOCKED; + mapping = page_mapping(page); + case PAGE_CLEAN: + ; /* try to free the page below */ + } + } + +
[PATCH 2/7] mm: Group the processing of anonymous pages to be swapped in shrink_page_list
This is a clean up patch to reorganize the processing of anonymous pages in shrink_page_list. We delay the processing of swapping anonymous pages in shrink_page_list and put them together on a separate list. This prepares for batching of pages to be swapped. The processing of the list of anonymous pages to be swapped is consolidated in the function shrink_anon_page_list. Functionally, there is no change in the logic of how pages are processed, just the order of processing of the anonymous pages and file mapped pages in shrink_page_list. Signed-off-by: Tim Chen --- mm/vmscan.c | 82 + 1 file changed, 77 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5542005..132ba02 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1083,6 +1083,58 @@ static void pg_finish(struct page *page, } } +static unsigned long shrink_anon_page_list(struct list_head *page_list, + struct zone *zone, + struct scan_control *sc, + struct list_head *swap_pages, + struct list_head *ret_pages, + struct list_head *free_pages, + enum ttu_flags ttu_flags, + int *pgactivate, + int n, + bool clean) +{ + unsigned long nr_reclaimed = 0; + enum pg_result pg_dispose; + + while (n > 0) { + struct page *page; + int swap_ret = SWAP_SUCCESS; + + --n; + if (list_empty(swap_pages)) + return nr_reclaimed; + + page = lru_to_page(swap_pages); + + list_del(>lru); + + /* + * Anonymous process memory has backing store? + * Try to allocate it some swap space here. + */ + + if (!add_to_swap(page, page_list)) { + pg_finish(page, PG_ACTIVATE_LOCKED, swap_ret, _reclaimed, + pgactivate, ret_pages, free_pages); + continue; + } + + if (clean) + pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, + PAGEREF_RECLAIM_CLEAN, true, true, _ret, page); + else + pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, + PAGEREF_RECLAIM, true, true, _ret, page); + + pg_finish(page, pg_dispose, swap_ret, _reclaimed, + pgactivate, ret_pages, free_pages); + } + return nr_reclaimed; +} + + + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -1099,6 +1151,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(swap_pages); + LIST_HEAD(swap_pages_clean); int pgactivate = 0; unsigned long nr_unqueued_dirty = 0; unsigned long nr_dirty = 0; @@ -1106,6 +1160,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, unsigned long nr_reclaimed = 0; unsigned long nr_writeback = 0; unsigned long nr_immediate = 0; + unsigned long nr_swap = 0; + unsigned long nr_swap_clean = 0; cond_resched(); @@ -1271,12 +1327,17 @@ static unsigned long shrink_page_list(struct list_head *page_list, pg_dispose = PG_KEEP_LOCKED; goto finish; } - if (!add_to_swap(page, page_list)) { - pg_dispose = PG_ACTIVATE_LOCKED; - goto finish; + if (references == PAGEREF_RECLAIM_CLEAN) { + list_add(>lru, _pages_clean); + ++nr_swap_clean; + } else { + list_add(>lru, _pages); + ++nr_swap; } - lazyfree = true; - may_enter_fs = 1; + + pg_dispose = PG_NEXT; + goto finish; + } pg_dispose = handle_pgout(page_list, zone, sc, ttu_flags, @@ -1288,6 +1349,17 @@ finish: } + nr_reclaimed += shrink_anon_page_list(page_list, zone, sc, + _pages_clean, _pages, + _pages, ttu_flags, + , nr_swap_clean, + true); + nr_reclaimed += shrink_anon_page_list(page_list, zone, sc, + _pages, _pages, + _pages, ttu_flags, + , nr_swap, + false); +
[PATCH 1/7] mm: Cleanup - Reorganize the shrink_page_list code into smaller functions
This patch prepares the code for being able to batch the anonymous pages to be swapped out. It reorganizes shrink_page_list function with 2 new functions: handle_pgout and pg_finish. The paging operation in shrink_page_list is consolidated into handle_pgout function. After we have scanned a page shrink_page_list and completed any paging, the final disposition and clean up of the page is conslidated into pg_finish. The designated disposition of the page from page scanning in shrink_page_list is marked with one of the designation in pg_result. This is a clean up patch and there is no change in functionality or logic of the code. Signed-off-by: Tim Chen --- mm/vmscan.c | 429 ++-- 1 file changed, 246 insertions(+), 183 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b934223e..5542005 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -873,6 +873,216 @@ static void page_check_dirty_writeback(struct page *page, mapping->a_ops->is_dirty_writeback(page, dirty, writeback); } +enum pg_result { + PG_SPECULATIVE_REF, + PG_FREE, + PG_MLOCKED, + PG_ACTIVATE_LOCKED, + PG_KEEP_LOCKED, + PG_KEEP, + PG_NEXT, + PG_UNKNOWN, +}; + +static enum pg_result handle_pgout(struct list_head *page_list, + struct zone *zone, + struct scan_control *sc, + enum ttu_flags ttu_flags, + enum page_references references, + bool may_enter_fs, + bool lazyfree, + int *swap_ret, + struct page *page) +{ + struct address_space *mapping; + + mapping = page_mapping(page); + + /* + * The page is mapped into the page tables of one or more + * processes. Try to unmap it here. + */ + if (page_mapped(page) && mapping) { + switch (*swap_ret = try_to_unmap(page, lazyfree ? + (ttu_flags | TTU_BATCH_FLUSH | TTU_LZFREE) : + (ttu_flags | TTU_BATCH_FLUSH))) { + case SWAP_FAIL: + return PG_ACTIVATE_LOCKED; + case SWAP_AGAIN: + return PG_KEEP_LOCKED; + case SWAP_MLOCK: + return PG_MLOCKED; + case SWAP_LZFREE: + goto lazyfree; + case SWAP_SUCCESS: + ; /* try to free the page below */ + } + } + + if (PageDirty(page)) { + /* + * Only kswapd can writeback filesystem pages to + * avoid risk of stack overflow but only writeback + * if many dirty pages have been encountered. + */ + if (page_is_file_cache(page) && + (!current_is_kswapd() || + !test_bit(ZONE_DIRTY, >flags))) { + /* + * Immediately reclaim when written back. + * Similar in principal to deactivate_page() + * except we already have the page isolated + * and know it's dirty + */ + inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE); + SetPageReclaim(page); + + return PG_KEEP_LOCKED; + } + + if (references == PAGEREF_RECLAIM_CLEAN) + return PG_KEEP_LOCKED; + if (!may_enter_fs) + return PG_KEEP_LOCKED; + if (!sc->may_writepage) + return PG_KEEP_LOCKED; + + /* + * Page is dirty. Flush the TLB if a writable entry + * potentially exists to avoid CPU writes after IO + * starts and then write it out here. + */ + try_to_unmap_flush_dirty(); + switch (pageout(page, mapping, sc)) { + case PAGE_KEEP: + return PG_KEEP_LOCKED; + case PAGE_ACTIVATE: + return PG_ACTIVATE_LOCKED; + case PAGE_SUCCESS: + if (PageWriteback(page)) + return PG_KEEP; + if (PageDirty(page)) + return PG_KEEP; + + /* + * A synchronous write - probably a ramdisk. Go + * ahead and try to reclaim the page. + */ + if (!trylock_page(page)) + return PG_KEEP; + if (PageDirty(page) || PageWriteback(page)) + return PG_KEEP_LOCKED; + mapping = page_mapping(page); + case PAGE_CLEAN: + ; /* try to free the page below */ + } + } + + /* + * If the
[PATCH 0/7] mm: Improve swap path scalability with batched operations
The page swap out path is not scalable due to the numerous locks acquired and released along the way, which are all executed on a page by page basis, e.g.: 1. The acquisition of the mapping tree lock in swap cache when adding a page to swap cache, and then again when deleting a page from swap cache after it has been swapped out. 2. The acquisition of the lock on swap device to allocate a swap slot for a page to be swapped out. With the advent of high speed block devices that's several orders of magnitude faster than the old spinning disks, these bottlenecks become fairly significant, especially on server class machines with many theads running. To reduce these locking costs, this patch series attempt to batch the pages on the following oprations needed on for swap: 1. Allocate swap slots in large batches, so locks on the swap device don't need to be acquired as often. 2. Add anonymous pages to the swap cache for the same swap device in batches, so the mapping tree lock can be acquired less. 3. Delete pages from swap cache also in batches. We experimented the effect of this patches. We set up N threads to access memory in excess of memory capcity, causing swap. In experiments using a single pmem based fast block device on a 2 socket machine, we saw that for 1 thread, there is a ~25% increase in swap throughput and for 16 threads, the swap throughput increase by ~85%, when compared with the vanilla kernel. Batching helps even for 1 thread because of contention with kswapd when doing direct memory reclaim. Feedbacks and reviews to this patch series are much appreciated. Thanks. Tim Tim Chen (7): mm: Cleanup - Reorganize the shrink_page_list code into smaller functions mm: Group the processing of anonymous pages to be swapped in shrink_page_list mm: Add new functions to allocate swap slots in batches mm: Shrink page list batch allocates swap slots for page swapping mm: Batch addtion of pages to swap cache mm: Cleanup - Reorganize code to group handling of page mm: Batch unmapping of pages that are in swap cache include/linux/swap.h | 29 ++- mm/swap_state.c | 253 +- mm/swapfile.c| 215 +-- mm/vmscan.c | 725 ++- 4 files changed, 945 insertions(+), 277 deletions(-) -- 2.5.5
Re: [PATCH] fix infoleak in wireless
A: No. Q: Should I include quotations after my reply? http://daringfireball.net/2007/07/on_top On Tue, May 03, 2016 at 04:47:16PM -0400, Kangjie Lu wrote: > Hi Greg, > > Could you please take a look at this issue. > mac_addr is not initialized is some implementations of dump_station(), which > can be exploited by attackers for leaking information. You are going to have to give me more context here... Like the patch itself? thanks, greg k-h
[PATCH 0/7] mm: Improve swap path scalability with batched operations
The page swap out path is not scalable due to the numerous locks acquired and released along the way, which are all executed on a page by page basis, e.g.: 1. The acquisition of the mapping tree lock in swap cache when adding a page to swap cache, and then again when deleting a page from swap cache after it has been swapped out. 2. The acquisition of the lock on swap device to allocate a swap slot for a page to be swapped out. With the advent of high speed block devices that's several orders of magnitude faster than the old spinning disks, these bottlenecks become fairly significant, especially on server class machines with many theads running. To reduce these locking costs, this patch series attempt to batch the pages on the following oprations needed on for swap: 1. Allocate swap slots in large batches, so locks on the swap device don't need to be acquired as often. 2. Add anonymous pages to the swap cache for the same swap device in batches, so the mapping tree lock can be acquired less. 3. Delete pages from swap cache also in batches. We experimented the effect of this patches. We set up N threads to access memory in excess of memory capcity, causing swap. In experiments using a single pmem based fast block device on a 2 socket machine, we saw that for 1 thread, there is a ~25% increase in swap throughput and for 16 threads, the swap throughput increase by ~85%, when compared with the vanilla kernel. Batching helps even for 1 thread because of contention with kswapd when doing direct memory reclaim. Feedbacks and reviews to this patch series are much appreciated. Thanks. Tim Tim Chen (7): mm: Cleanup - Reorganize the shrink_page_list code into smaller functions mm: Group the processing of anonymous pages to be swapped in shrink_page_list mm: Add new functions to allocate swap slots in batches mm: Shrink page list batch allocates swap slots for page swapping mm: Batch addtion of pages to swap cache mm: Cleanup - Reorganize code to group handling of page mm: Batch unmapping of pages that are in swap cache include/linux/swap.h | 29 ++- mm/swap_state.c | 253 +- mm/swapfile.c| 215 +-- mm/vmscan.c | 725 ++- 4 files changed, 945 insertions(+), 277 deletions(-) -- 2.5.5
Re: [PATCH] fix infoleak in wireless
A: No. Q: Should I include quotations after my reply? http://daringfireball.net/2007/07/on_top On Tue, May 03, 2016 at 04:47:16PM -0400, Kangjie Lu wrote: > Hi Greg, > > Could you please take a look at this issue. > mac_addr is not initialized is some implementations of dump_station(), which > can be exploited by attackers for leaking information. You are going to have to give me more context here... Like the patch itself? thanks, greg k-h
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
Hi Jamie, On 05/03/2016 01:35 PM, David Howells wrote: > (cc'ing Tadeusz as he did the pkcs1 padding function) > > Jamie Heilmanwrote: > Problem loading in-kernel X.509 certificate (-2) >>> >>> ENOENT? Hmmm... The only place that is generated is in the crypto layer. >>> That suggests missing crypto of some sort. >>> >>> The attached patch enables some debugging in some relevant files if you can >>> try applying it to your kernel. >> >> Alrighty, presumably relevant bits: >> >> X.509: Cert Issuer: Build time autogenerated kernel key >> X.509: Cert Subject: Build time autogenerated kernel key >> X.509: Cert Key Algo: rsa >> X.509: Cert Valid period: 1461826791-4615426791 >> X.509: Cert Signature: rsa + sha512 >> X.509: ==>x509_check_signature() >> X.509: ==>x509_get_sig_params() >> X.509: <==x509_get_sig_params() = 0 >> PKEY: ==>public_key_verify_signature() >> X.509: Cert Verification: -2 > > Hmmm... Okay, the only ways out of public_key_verify_signature() without > printing a leaving message are for snprintf() to overrun (which would return > error -22) or for crypto_alloc_akcipher() to have failed; everything else must > go through the kleave() at the pr_devel() at the bottom of the function. > > Can you stick: > > pr_devel("ALGO: %s\n", alg_name); > > immediately before this line: > > tfm = crypto_alloc_akcipher(alg_name, 0, 0); > > and try it again? > Could you please check if this one fixes the problem for you: https://patchwork.kernel.org/patch/8766361/ Thanks, -- TS
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
Hi Jamie, On 05/03/2016 01:35 PM, David Howells wrote: > (cc'ing Tadeusz as he did the pkcs1 padding function) > > Jamie Heilman wrote: > Problem loading in-kernel X.509 certificate (-2) >>> >>> ENOENT? Hmmm... The only place that is generated is in the crypto layer. >>> That suggests missing crypto of some sort. >>> >>> The attached patch enables some debugging in some relevant files if you can >>> try applying it to your kernel. >> >> Alrighty, presumably relevant bits: >> >> X.509: Cert Issuer: Build time autogenerated kernel key >> X.509: Cert Subject: Build time autogenerated kernel key >> X.509: Cert Key Algo: rsa >> X.509: Cert Valid period: 1461826791-4615426791 >> X.509: Cert Signature: rsa + sha512 >> X.509: ==>x509_check_signature() >> X.509: ==>x509_get_sig_params() >> X.509: <==x509_get_sig_params() = 0 >> PKEY: ==>public_key_verify_signature() >> X.509: Cert Verification: -2 > > Hmmm... Okay, the only ways out of public_key_verify_signature() without > printing a leaving message are for snprintf() to overrun (which would return > error -22) or for crypto_alloc_akcipher() to have failed; everything else must > go through the kleave() at the pr_devel() at the bottom of the function. > > Can you stick: > > pr_devel("ALGO: %s\n", alg_name); > > immediately before this line: > > tfm = crypto_alloc_akcipher(alg_name, 0, 0); > > and try it again? > Could you please check if this one fixes the problem for you: https://patchwork.kernel.org/patch/8766361/ Thanks, -- TS
Re: [PATCH 00/42] v7: separate operations from flags in the bio/request structs
mchri...@redhat.com writes: > The following patches begin to cleanup the request->cmd_flags and > bio->bi_rw mess. We currently use cmd_flags to specify the operation, > attributes and state of the request. For bi_rw we use it for similar > info and also the priority but then also have another bi_flags field > for state. At some point, we abused them so much we just made cmd_flags > 64 bits, so we could add more. > > The following patches seperate the operation (read, write discard, > flush, etc) from cmd_flags/bi_rw. > > This patchset was made against linux-next from today April 15 > (git tag next-20160415). > > I put a git tree here: > https://github.com/mikechristie/linux-kernel.git > The patches are in the op branch. Hi, Mike, That git tree doesn't seem to exist. I did manage to apply your patch set on top of next-20160415, though. So... what testing did you do? ;-) I ran into the following problems: - git clone fails - yum segfaults - many blktrace/blkparse issues, including incorrect cpu recorded in traces, null task names, and blkparse outputting nothing for a trace file several gigabytes in size. After that, I decided to back out your patches and test the base linux-next kernel. That kernel has none of those issues. So, either I'm missing some dependencies, or I think we've got some issues to iron out before this thing goes in. Before I dig any further, am I missing something? Cheers, Jeff
Re: [PATCH 00/42] v7: separate operations from flags in the bio/request structs
mchri...@redhat.com writes: > The following patches begin to cleanup the request->cmd_flags and > bio->bi_rw mess. We currently use cmd_flags to specify the operation, > attributes and state of the request. For bi_rw we use it for similar > info and also the priority but then also have another bi_flags field > for state. At some point, we abused them so much we just made cmd_flags > 64 bits, so we could add more. > > The following patches seperate the operation (read, write discard, > flush, etc) from cmd_flags/bi_rw. > > This patchset was made against linux-next from today April 15 > (git tag next-20160415). > > I put a git tree here: > https://github.com/mikechristie/linux-kernel.git > The patches are in the op branch. Hi, Mike, That git tree doesn't seem to exist. I did manage to apply your patch set on top of next-20160415, though. So... what testing did you do? ;-) I ran into the following problems: - git clone fails - yum segfaults - many blktrace/blkparse issues, including incorrect cpu recorded in traces, null task names, and blkparse outputting nothing for a trace file several gigabytes in size. After that, I decided to back out your patches and test the base linux-next kernel. That kernel has none of those issues. So, either I'm missing some dependencies, or I think we've got some issues to iron out before this thing goes in. Before I dig any further, am I missing something? Cheers, Jeff
[PATCH] fix infoleak in rtnetlink
The stack object “map” has a total size of 32 bytes. Its last 4 bytes are padding generated by compiler. These padding bytes are not initialized and sent out via “nla_put”. Signed-off-by: Kangjie Lu--- net/core/rtnetlink.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index a75f7e9..65763c2 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1180,14 +1180,16 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb, static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev) { - struct rtnl_link_ifmap map = { - .mem_start = dev->mem_start, - .mem_end = dev->mem_end, - .base_addr = dev->base_addr, - .irq = dev->irq, - .dma = dev->dma, - .port= dev->if_port, - }; + struct rtnl_link_ifmap map; + + memset(, 0, sizeof(map)); + map.mem_start = dev->mem_start; + map.mem_end = dev->mem_end; + map.base_addr = dev->base_addr; + map.irq = dev->irq; + map.dma = dev->dma; + map.port= dev->if_port; + if (nla_put(skb, IFLA_MAP, sizeof(map), )) return -EMSGSIZE; -- 1.9.1
[PATCH] fix infoleak in rtnetlink
The stack object “map” has a total size of 32 bytes. Its last 4 bytes are padding generated by compiler. These padding bytes are not initialized and sent out via “nla_put”. Signed-off-by: Kangjie Lu --- net/core/rtnetlink.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index a75f7e9..65763c2 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1180,14 +1180,16 @@ static noinline_for_stack int rtnl_fill_vfinfo(struct sk_buff *skb, static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev) { - struct rtnl_link_ifmap map = { - .mem_start = dev->mem_start, - .mem_end = dev->mem_end, - .base_addr = dev->base_addr, - .irq = dev->irq, - .dma = dev->dma, - .port= dev->if_port, - }; + struct rtnl_link_ifmap map; + + memset(, 0, sizeof(map)); + map.mem_start = dev->mem_start; + map.mem_end = dev->mem_end; + map.base_addr = dev->base_addr; + map.irq = dev->irq; + map.dma = dev->dma; + map.port= dev->if_port; + if (nla_put(skb, IFLA_MAP, sizeof(map), )) return -EMSGSIZE; -- 1.9.1
[PATCH] infoleak fix2 in timer
The stack object “r1” has a total size of 32 bytes. Its field “event” and “val” both contain 4 bytes padding. These 8 bytes padding bytes are sent to user without being initialized. Signed-off-by: Kangjie Lu--- sound/core/timer.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/core/timer.c b/sound/core/timer.c index 964f5eb..e98fa5f 100644 --- a/sound/core/timer.c +++ b/sound/core/timer.c @@ -1225,6 +1225,7 @@ static void snd_timer_user_ccallback(struct snd_timer_instance *timeri, tu->tstamp = *tstamp; if ((tu->filter & (1 << event)) == 0 || !tu->tread) return; + memset(, 0, sizeof(r1)); r1.event = event; r1.tstamp = *tstamp; r1.val = resolution; -- 1.9.1
[PATCH] infoleak fix2 in timer
The stack object “r1” has a total size of 32 bytes. Its field “event” and “val” both contain 4 bytes padding. These 8 bytes padding bytes are sent to user without being initialized. Signed-off-by: Kangjie Lu --- sound/core/timer.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/core/timer.c b/sound/core/timer.c index 964f5eb..e98fa5f 100644 --- a/sound/core/timer.c +++ b/sound/core/timer.c @@ -1225,6 +1225,7 @@ static void snd_timer_user_ccallback(struct snd_timer_instance *timeri, tu->tstamp = *tstamp; if ((tu->filter & (1 << event)) == 0 || !tu->tread) return; + memset(, 0, sizeof(r1)); r1.event = event; r1.tstamp = *tstamp; r1.val = resolution; -- 1.9.1
[PATCH] infoleak fix3 in timer
The stack object “r1” has a total size of 32 bytes. Its field “event” and “val” both contain 4 bytes padding. These 8 bytes padding bytes are sent to user without being initialized. Signed-off-by: Kangjie Lu--- sound/core/timer.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/core/timer.c b/sound/core/timer.c index e98fa5f..c69a271 100644 --- a/sound/core/timer.c +++ b/sound/core/timer.c @@ -1268,6 +1268,7 @@ static void snd_timer_user_tinterrupt(struct snd_timer_instance *timeri, } if ((tu->filter & (1 << SNDRV_TIMER_EVENT_RESOLUTION)) && tu->last_resolution != resolution) { + memset(, 0, sizeof(r1)); r1.event = SNDRV_TIMER_EVENT_RESOLUTION; r1.tstamp = tstamp; r1.val = resolution; -- 1.9.1
[PATCH] infoleak fix3 in timer
The stack object “r1” has a total size of 32 bytes. Its field “event” and “val” both contain 4 bytes padding. These 8 bytes padding bytes are sent to user without being initialized. Signed-off-by: Kangjie Lu --- sound/core/timer.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/core/timer.c b/sound/core/timer.c index e98fa5f..c69a271 100644 --- a/sound/core/timer.c +++ b/sound/core/timer.c @@ -1268,6 +1268,7 @@ static void snd_timer_user_tinterrupt(struct snd_timer_instance *timeri, } if ((tu->filter & (1 << SNDRV_TIMER_EVENT_RESOLUTION)) && tu->last_resolution != resolution) { + memset(, 0, sizeof(r1)); r1.event = SNDRV_TIMER_EVENT_RESOLUTION; r1.tstamp = tstamp; r1.val = resolution; -- 1.9.1
[PATCH] infoleak fix1 in timer
The stack object “tread” has a total size of 32 bytes. Its field “event” and “val” both contain 4 bytes padding. These 8 bytes padding bytes are sent to user without being initialized. Signed-off-by: Kangjie Lu--- sound/core/timer.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/core/timer.c b/sound/core/timer.c index 6469bed..964f5eb 100644 --- a/sound/core/timer.c +++ b/sound/core/timer.c @@ -1739,6 +1739,7 @@ static int snd_timer_user_params(struct file *file, if (tu->timeri->flags & SNDRV_TIMER_IFLG_EARLY_EVENT) { if (tu->tread) { struct snd_timer_tread tread; + memset(, 0, sizeof(tread)); tread.event = SNDRV_TIMER_EVENT_EARLY; tread.tstamp.tv_sec = 0; tread.tstamp.tv_nsec = 0; -- 1.9.1
Re: [PATCH] fix infoleak in wireless
On Tue, 2016-05-03 at 16:40 -0400, Kangjie Lu wrote: > The 6-bytes array “mac_addr” is not initialized in the dump_station > implementations of > “drivers/staging/wilc1000/wilc_wfi_cfgoperations.c” > and “drivers/staging/rtl8723au/os_dep/ioctl_cfg80211.c”, so all 6 > bytes may be leaked. Like I said to you before, this makes those implementations completely broken. I'm not going to apply this patch. If you want, feel free to send patches to Greg to remove those dump_station implementations that are completely broken and that can never do anything useful. johannes
[PATCH] infoleak fix1 in timer
The stack object “tread” has a total size of 32 bytes. Its field “event” and “val” both contain 4 bytes padding. These 8 bytes padding bytes are sent to user without being initialized. Signed-off-by: Kangjie Lu --- sound/core/timer.c | 1 + 1 file changed, 1 insertion(+) diff --git a/sound/core/timer.c b/sound/core/timer.c index 6469bed..964f5eb 100644 --- a/sound/core/timer.c +++ b/sound/core/timer.c @@ -1739,6 +1739,7 @@ static int snd_timer_user_params(struct file *file, if (tu->timeri->flags & SNDRV_TIMER_IFLG_EARLY_EVENT) { if (tu->tread) { struct snd_timer_tread tread; + memset(, 0, sizeof(tread)); tread.event = SNDRV_TIMER_EVENT_EARLY; tread.tstamp.tv_sec = 0; tread.tstamp.tv_nsec = 0; -- 1.9.1
Re: [PATCH] fix infoleak in wireless
On Tue, 2016-05-03 at 16:40 -0400, Kangjie Lu wrote: > The 6-bytes array “mac_addr” is not initialized in the dump_station > implementations of > “drivers/staging/wilc1000/wilc_wfi_cfgoperations.c” > and “drivers/staging/rtl8723au/os_dep/ioctl_cfg80211.c”, so all 6 > bytes may be leaked. Like I said to you before, this makes those implementations completely broken. I'm not going to apply this patch. If you want, feel free to send patches to Greg to remove those dump_station implementations that are completely broken and that can never do anything useful. johannes
[PATCH] infoleak fix1 in signal
The stack object “info” has a total size of 128 bytes; however, only 28 bytes are initialized. The remaining uninitialized bytes are sent to userland via send_signal. Signed-off-by: Kangjie Lu--- kernel/signal.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/signal.c b/kernel/signal.c index aa9bf00..f40f0b3 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1594,6 +1594,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig) sig = SIGCHLD; } + memset(, 0, sizeof(info)); info.si_signo = sig; info.si_errno = 0; /* -- 1.9.1
[PATCH] infoleak fix1 in signal
The stack object “info” has a total size of 128 bytes; however, only 28 bytes are initialized. The remaining uninitialized bytes are sent to userland via send_signal. Signed-off-by: Kangjie Lu --- kernel/signal.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/signal.c b/kernel/signal.c index aa9bf00..f40f0b3 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1594,6 +1594,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig) sig = SIGCHLD; } + memset(, 0, sizeof(info)); info.si_signo = sig; info.si_errno = 0; /* -- 1.9.1
[PATCH] infoleak fix2 in signal
The stack object “info” has a total size of 128 bytes; however, only 32 bytes are initialized. The remaining uninitialized bytes are sent to userland via send_signal. Signed-off-by: Kangjie Lu--- kernel/signal.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/signal.c b/kernel/signal.c index f40f0b3..26fdb74 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1689,6 +1689,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, parent = tsk->real_parent; } + memset(, 0, sizeof(info)); info.si_signo = SIGCHLD; info.si_errno = 0; /* -- 1.9.1
[PATCH] infoleak fix2 in signal
The stack object “info” has a total size of 128 bytes; however, only 32 bytes are initialized. The remaining uninitialized bytes are sent to userland via send_signal. Signed-off-by: Kangjie Lu --- kernel/signal.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/signal.c b/kernel/signal.c index f40f0b3..26fdb74 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1689,6 +1689,7 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, parent = tsk->real_parent; } + memset(, 0, sizeof(info)); info.si_signo = SIGCHLD; info.si_errno = 0; /* -- 1.9.1
[GIT] Networking
Some straggler bug fixes: 1) Batman-adv DAT must consider VLAN IDs when choosing candidate nodes, from Antonio Quartulli. 2) Fix botched reference counting of vlan objects and neigh nodes in batman-adv, from Sven Eckelmann. 3) netem can crash when it sees GSO packets, the fix is to segment then upon ->enqueue. Fix from Neil Horman with help from Eric Dumazet. 4) Fix VXLAN dependencies in mlx5 driver Kconfig, from Matthew Finlay. 5) Handle VXLAN ops outside of rcu lock, via a workqueue, in mlx5, since it can sleep. Fix also from Matthew Finlay. 6) Check mdiobus_scan() return values properly in pxa168_eth and macb drivers. From Sergei Shtylyov. 7) If the netdevice doesn't support checksumming, disable segmentation. From Alexandery Duyck. 8) Fix races between RDS tcp accept and sending, from Sowmini Varadhan. 9) In macb driver, probe MDIO bus before we register the netdev, otherwise we can try to open the device before it is really ready for that. Fix from Florian Fainelli. 10) Netlink attribute size for ILA "tunnels" not calculated properly, fix from Nicolas Dichtel. Please pull, thanks a lot! The following changes since commit 33656a1f2ee5346c742d63ddd0e0970c95a56b70: Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs (2016-05-02 09:59:57 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net for you to fetch changes up to 79e8dc8b80bff0bc5bbb90ca5e73044bf207c8ac: ipv6/ila: fix nlsize calculation for lwtunnel (2016-05-03 16:21:33 -0400) Alexander Duyck (2): net: Disable segmentation if checksumming is not supported vxlan: Add checksum check to the features check function Anna-Maria Gleixner (1): net: mvneta: Remove superfluous SMP function call Antonio Quartulli (2): batman-adv: fix DAT candidate selection (must use vid) batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event David S. Miller (4): Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge Merge branch 'mlx5-fixes' Merge branch 'tunnel-csum-and-sg-offloads' Merge branch 'rds-fixes' Florian Fainelli (1): net: macb: Probe MDIO bus before registering netdev Gal Pressman (1): net/mlx5: Unmap only the relevant IO memory mapping Matthew Finlay (3): net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue net/mlx5e: Implement a mlx5e workqueue net/mlx5e: Use workqueue for vxlan ops Neil Horman (1): netem: Segment GSO packets on enqueue Nicolas Dichtel (1): ipv6/ila: fix nlsize calculation for lwtunnel Sergei Shtylyov (2): pxa168_eth: fix mdiobus_scan() error check macb: fix mdiobus_scan() error check Sowmini Varadhan (2): RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock. Sven Eckelmann (2): batman-adv: Fix reference counting of vlan object for tt_local_entry batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node drivers/net/ethernet/cadence/macb.c | 34 +- drivers/net/ethernet/marvell/mvneta.c | 6 ++ drivers/net/ethernet/marvell/pxa168_eth.c | 2 ++ drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 1 + drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 + drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 34 +- drivers/net/ethernet/mellanox/mlx5/core/uar.c | 6 -- drivers/net/ethernet/mellanox/mlx5/core/vxlan.c | 50 ++ drivers/net/ethernet/mellanox/mlx5/core/vxlan.h | 11 +-- include/linux/if_ether.h | 5 + include/net/vxlan.h | 4 +++- net/batman-adv/bat_v.c| 12 net/batman-adv/distributed-arp-table.c| 17 ++--- net/batman-adv/hard-interface.c | 3 +++ net/batman-adv/originator.c | 16 +--- net/batman-adv/translation-table.c| 42 -- net/batman-adv/types.h| 7 +++ net/core/dev.c| 2 +- net/ipv6/ila/ila_lwt.c| 3 +-- net/rds/tcp.c | 3 ++- net/rds/tcp.h | 4 net/rds/tcp_connect.c | 8 net/rds/tcp_listen.c | 54 -- net/sched/sch_netem.c | 61 +++-- 24
[GIT] Networking
Some straggler bug fixes: 1) Batman-adv DAT must consider VLAN IDs when choosing candidate nodes, from Antonio Quartulli. 2) Fix botched reference counting of vlan objects and neigh nodes in batman-adv, from Sven Eckelmann. 3) netem can crash when it sees GSO packets, the fix is to segment then upon ->enqueue. Fix from Neil Horman with help from Eric Dumazet. 4) Fix VXLAN dependencies in mlx5 driver Kconfig, from Matthew Finlay. 5) Handle VXLAN ops outside of rcu lock, via a workqueue, in mlx5, since it can sleep. Fix also from Matthew Finlay. 6) Check mdiobus_scan() return values properly in pxa168_eth and macb drivers. From Sergei Shtylyov. 7) If the netdevice doesn't support checksumming, disable segmentation. From Alexandery Duyck. 8) Fix races between RDS tcp accept and sending, from Sowmini Varadhan. 9) In macb driver, probe MDIO bus before we register the netdev, otherwise we can try to open the device before it is really ready for that. Fix from Florian Fainelli. 10) Netlink attribute size for ILA "tunnels" not calculated properly, fix from Nicolas Dichtel. Please pull, thanks a lot! The following changes since commit 33656a1f2ee5346c742d63ddd0e0970c95a56b70: Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs (2016-05-02 09:59:57 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net for you to fetch changes up to 79e8dc8b80bff0bc5bbb90ca5e73044bf207c8ac: ipv6/ila: fix nlsize calculation for lwtunnel (2016-05-03 16:21:33 -0400) Alexander Duyck (2): net: Disable segmentation if checksumming is not supported vxlan: Add checksum check to the features check function Anna-Maria Gleixner (1): net: mvneta: Remove superfluous SMP function call Antonio Quartulli (2): batman-adv: fix DAT candidate selection (must use vid) batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event David S. Miller (4): Merge tag 'batman-adv-fix-for-davem' of git://git.open-mesh.org/linux-merge Merge branch 'mlx5-fixes' Merge branch 'tunnel-csum-and-sg-offloads' Merge branch 'rds-fixes' Florian Fainelli (1): net: macb: Probe MDIO bus before registering netdev Gal Pressman (1): net/mlx5: Unmap only the relevant IO memory mapping Matthew Finlay (3): net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue net/mlx5e: Implement a mlx5e workqueue net/mlx5e: Use workqueue for vxlan ops Neil Horman (1): netem: Segment GSO packets on enqueue Nicolas Dichtel (1): ipv6/ila: fix nlsize calculation for lwtunnel Sergei Shtylyov (2): pxa168_eth: fix mdiobus_scan() error check macb: fix mdiobus_scan() error check Sowmini Varadhan (2): RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock. Sven Eckelmann (2): batman-adv: Fix reference counting of vlan object for tt_local_entry batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node drivers/net/ethernet/cadence/macb.c | 34 +- drivers/net/ethernet/marvell/mvneta.c | 6 ++ drivers/net/ethernet/marvell/pxa168_eth.c | 2 ++ drivers/net/ethernet/mellanox/mlx5/core/Kconfig | 1 + drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 + drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 34 +- drivers/net/ethernet/mellanox/mlx5/core/uar.c | 6 -- drivers/net/ethernet/mellanox/mlx5/core/vxlan.c | 50 ++ drivers/net/ethernet/mellanox/mlx5/core/vxlan.h | 11 +-- include/linux/if_ether.h | 5 + include/net/vxlan.h | 4 +++- net/batman-adv/bat_v.c| 12 net/batman-adv/distributed-arp-table.c| 17 ++--- net/batman-adv/hard-interface.c | 3 +++ net/batman-adv/originator.c | 16 +--- net/batman-adv/translation-table.c| 42 -- net/batman-adv/types.h| 7 +++ net/core/dev.c| 2 +- net/ipv6/ila/ila_lwt.c| 3 +-- net/rds/tcp.c | 3 ++- net/rds/tcp.h | 4 net/rds/tcp_connect.c | 8 net/rds/tcp_listen.c | 54 -- net/sched/sch_netem.c | 61 +++-- 24
[PATCH] fix infoleak in wireless
The 6-bytes array “mac_addr” is not initialized in the dump_station implementations of “drivers/staging/wilc1000/wilc_wfi_cfgoperations.c” and “drivers/staging/rtl8723au/os_dep/ioctl_cfg80211.c”, so all 6 bytes may be leaked. Signed-off-by: Kangjie Lu--- net/wireless/nl80211.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c index 056a730..2e92d14 100644 --- a/net/wireless/nl80211.c +++ b/net/wireless/nl80211.c @@ -3905,6 +3905,7 @@ static int nl80211_dump_station(struct sk_buff *skb, while (1) { memset(, 0, sizeof(sinfo)); + eth_zero_addr(mac_addr); err = rdev_dump_station(rdev, wdev->netdev, sta_idx, mac_addr, ); if (err == -ENOENT) -- 1.9.1
[PATCH] fix infoleak in wireless
The 6-bytes array “mac_addr” is not initialized in the dump_station implementations of “drivers/staging/wilc1000/wilc_wfi_cfgoperations.c” and “drivers/staging/rtl8723au/os_dep/ioctl_cfg80211.c”, so all 6 bytes may be leaked. Signed-off-by: Kangjie Lu --- net/wireless/nl80211.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c index 056a730..2e92d14 100644 --- a/net/wireless/nl80211.c +++ b/net/wireless/nl80211.c @@ -3905,6 +3905,7 @@ static int nl80211_dump_station(struct sk_buff *skb, while (1) { memset(, 0, sizeof(sinfo)); + eth_zero_addr(mac_addr); err = rdev_dump_station(rdev, wdev->netdev, sta_idx, mac_addr, ); if (err == -ENOENT) -- 1.9.1
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
(cc'ing Tadeusz as he did the pkcs1 padding function) Jamie Heilmanwrote: > > > Problem loading in-kernel X.509 certificate (-2) > > > > ENOENT? Hmmm... The only place that is generated is in the crypto layer. > > That suggests missing crypto of some sort. > > > > The attached patch enables some debugging in some relevant files if you can > > try applying it to your kernel. > > Alrighty, presumably relevant bits: > > X.509: Cert Issuer: Build time autogenerated kernel key > X.509: Cert Subject: Build time autogenerated kernel key > X.509: Cert Key Algo: rsa > X.509: Cert Valid period: 1461826791-4615426791 > X.509: Cert Signature: rsa + sha512 > X.509: ==>x509_check_signature() > X.509: ==>x509_get_sig_params() > X.509: <==x509_get_sig_params() = 0 > PKEY: ==>public_key_verify_signature() > X.509: Cert Verification: -2 Hmmm... Okay, the only ways out of public_key_verify_signature() without printing a leaving message are for snprintf() to overrun (which would return error -22) or for crypto_alloc_akcipher() to have failed; everything else must go through the kleave() at the pr_devel() at the bottom of the function. Can you stick: pr_devel("ALGO: %s\n", alg_name); immediately before this line: tfm = crypto_alloc_akcipher(alg_name, 0, 0); and try it again? Thanks, David
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
(cc'ing Tadeusz as he did the pkcs1 padding function) Jamie Heilman wrote: > > > Problem loading in-kernel X.509 certificate (-2) > > > > ENOENT? Hmmm... The only place that is generated is in the crypto layer. > > That suggests missing crypto of some sort. > > > > The attached patch enables some debugging in some relevant files if you can > > try applying it to your kernel. > > Alrighty, presumably relevant bits: > > X.509: Cert Issuer: Build time autogenerated kernel key > X.509: Cert Subject: Build time autogenerated kernel key > X.509: Cert Key Algo: rsa > X.509: Cert Valid period: 1461826791-4615426791 > X.509: Cert Signature: rsa + sha512 > X.509: ==>x509_check_signature() > X.509: ==>x509_get_sig_params() > X.509: <==x509_get_sig_params() = 0 > PKEY: ==>public_key_verify_signature() > X.509: Cert Verification: -2 Hmmm... Okay, the only ways out of public_key_verify_signature() without printing a leaving message are for snprintf() to overrun (which would return error -22) or for crypto_alloc_akcipher() to have failed; everything else must go through the kleave() at the pr_devel() at the bottom of the function. Can you stick: pr_devel("ALGO: %s\n", alg_name); immediately before this line: tfm = crypto_alloc_akcipher(alg_name, 0, 0); and try it again? Thanks, David
[PATCH] fix infoleak in mm
The stack object “si” has a total size of 128; however, only 20 bytes are initialized. The remaining uninitialized bytes are sent to userland via send_signal Signed-off-by: Kangjie Lu--- arch/arm64/mm/fault.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 95df28b..f790eda 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -117,6 +117,7 @@ static void __do_user_fault(struct task_struct *tsk, unsigned long addr, { struct siginfo si; + memset(, 0, sizeof(si)); if (unhandled_signal(tsk, sig) && show_unhandled_signals_ratelimited()) { pr_info("%s[%d]: unhandled %s (%d) at 0x%08lx, esr 0x%03x\n", tsk->comm, task_pid_nr(tsk), fault_name(esr), sig, -- 1.9.1
[PATCH] fix infoleak in mm
The stack object “si” has a total size of 128; however, only 20 bytes are initialized. The remaining uninitialized bytes are sent to userland via send_signal Signed-off-by: Kangjie Lu --- arch/arm64/mm/fault.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 95df28b..f790eda 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -117,6 +117,7 @@ static void __do_user_fault(struct task_struct *tsk, unsigned long addr, { struct siginfo si; + memset(, 0, sizeof(si)); if (unhandled_signal(tsk, sig) && show_unhandled_signals_ratelimited()) { pr_info("%s[%d]: unhandled %s (%d) at 0x%08lx, esr 0x%03x\n", tsk->comm, task_pid_nr(tsk), fault_name(esr), sig, -- 1.9.1
[PATCH] fix infoleak in llc
The stack object “info” has a total size of 12 bytes. Its last byte is padding which is not initialized and leaked via “put_cmsg”. Signed-off-by: Kangjie Lu--- net/llc/af_llc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c index b3c52e3..8ae3ed9 100644 --- a/net/llc/af_llc.c +++ b/net/llc/af_llc.c @@ -626,6 +626,7 @@ static void llc_cmsg_rcv(struct msghdr *msg, struct sk_buff *skb) if (llc->cmsg_flags & LLC_CMSG_PKTINFO) { struct llc_pktinfo info; + memset(, 0, sizeof(info)); info.lpi_ifindex = llc_sk(skb->sk)->dev->ifindex; llc_pdu_decode_dsap(skb, _sap); llc_pdu_decode_da(skb, info.lpi_mac); -- 1.9.1
[PATCH] fix infoleak in llc
The stack object “info” has a total size of 12 bytes. Its last byte is padding which is not initialized and leaked via “put_cmsg”. Signed-off-by: Kangjie Lu --- net/llc/af_llc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c index b3c52e3..8ae3ed9 100644 --- a/net/llc/af_llc.c +++ b/net/llc/af_llc.c @@ -626,6 +626,7 @@ static void llc_cmsg_rcv(struct msghdr *msg, struct sk_buff *skb) if (llc->cmsg_flags & LLC_CMSG_PKTINFO) { struct llc_pktinfo info; + memset(, 0, sizeof(info)); info.lpi_ifindex = llc_sk(skb->sk)->dev->ifindex; llc_pdu_decode_dsap(skb, _sap); llc_pdu_decode_da(skb, info.lpi_mac); -- 1.9.1
[PATCH] fix infoleak in fcntl
The stack object “si” has a total size of 128 bytes; however, only 16 bytes are initialized. The remaining uninitialized bytes are sent to userland via send_signal. Signed-off-by: Kangjie Lu--- fs/fcntl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/fcntl.c b/fs/fcntl.c index 350a2c8..d06f943 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -468,6 +468,7 @@ static void send_sigio_to_task(struct task_struct *p, delivered even if we can't queue. Failure to queue in this case _should_ be reported; we fall back to SIGIO in that case. --sct */ + memset(, 0, sizeof(si)); si.si_signo = signum; si.si_errno = 0; si.si_code = reason; -- 1.9.1
[PATCH] fix infoleak in fcntl
The stack object “si” has a total size of 128 bytes; however, only 16 bytes are initialized. The remaining uninitialized bytes are sent to userland via send_signal. Signed-off-by: Kangjie Lu --- fs/fcntl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/fcntl.c b/fs/fcntl.c index 350a2c8..d06f943 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -468,6 +468,7 @@ static void send_sigio_to_task(struct task_struct *p, delivered even if we can't queue. Failure to queue in this case _should_ be reported; we fall back to SIGIO in that case. --sct */ + memset(, 0, sizeof(si)); si.si_signo = signum; si.si_errno = 0; si.si_code = reason; -- 1.9.1
[PATCH 0/3] [media] s5p-mfc: Fixes for issues when module is removed
Hello, This patch series fixes some issues that I noticed when trying to remove the s5p-mfc driver when built as a module. Some of these issues will be fixed once Marek's patches to convert the custom memory region reservation code is replaced by a generic one that supports named memory region reservation [0]. But the fixes are trivial so we can fix the current code until his rework patch lands. [0]: https://patchwork.linuxtv.org/patch/32287/ Best regards, Javier Javier Martinez Canillas (3): [media] s5p-mfc: Set device name for reserved memory region devs [media] s5p-mfc: Add release callback for memory region devs [media] s5p-mfc: Fix race between s5p_mfc_probe() and s5p_mfc_open() drivers/media/platform/s5p-mfc/s5p_mfc.c | 50 1 file changed, 32 insertions(+), 18 deletions(-) -- 2.5.5
[PATCH] fix infoleak in devio
The stack object “ci” has a total size of 8 bytes. Its last 3 bytes are padding bytes which are not initialized and leaked to userland via “copy_to_user”. Signed-off-by: Kangjie Lu--- drivers/usb/core/devio.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c index 52c4461..9b7f1f7 100644 --- a/drivers/usb/core/devio.c +++ b/drivers/usb/core/devio.c @@ -1316,10 +1316,11 @@ static int proc_getdriver(struct usb_dev_state *ps, void __user *arg) static int proc_connectinfo(struct usb_dev_state *ps, void __user *arg) { - struct usbdevfs_connectinfo ci = { - .devnum = ps->dev->devnum, - .slow = ps->dev->speed == USB_SPEED_LOW - }; + struct usbdevfs_connectinfo ci; + + memset(, 0, sizeof(ci)); + ci.devnum = ps->dev->devnum; + ci.slow = ps->dev->speed == USB_SPEED_LOW; if (copy_to_user(arg, , sizeof(ci))) return -EFAULT; -- 1.9.1
[PATCH 0/3] [media] s5p-mfc: Fixes for issues when module is removed
Hello, This patch series fixes some issues that I noticed when trying to remove the s5p-mfc driver when built as a module. Some of these issues will be fixed once Marek's patches to convert the custom memory region reservation code is replaced by a generic one that supports named memory region reservation [0]. But the fixes are trivial so we can fix the current code until his rework patch lands. [0]: https://patchwork.linuxtv.org/patch/32287/ Best regards, Javier Javier Martinez Canillas (3): [media] s5p-mfc: Set device name for reserved memory region devs [media] s5p-mfc: Add release callback for memory region devs [media] s5p-mfc: Fix race between s5p_mfc_probe() and s5p_mfc_open() drivers/media/platform/s5p-mfc/s5p_mfc.c | 50 1 file changed, 32 insertions(+), 18 deletions(-) -- 2.5.5
[PATCH] fix infoleak in devio
The stack object “ci” has a total size of 8 bytes. Its last 3 bytes are padding bytes which are not initialized and leaked to userland via “copy_to_user”. Signed-off-by: Kangjie Lu --- drivers/usb/core/devio.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c index 52c4461..9b7f1f7 100644 --- a/drivers/usb/core/devio.c +++ b/drivers/usb/core/devio.c @@ -1316,10 +1316,11 @@ static int proc_getdriver(struct usb_dev_state *ps, void __user *arg) static int proc_connectinfo(struct usb_dev_state *ps, void __user *arg) { - struct usbdevfs_connectinfo ci = { - .devnum = ps->dev->devnum, - .slow = ps->dev->speed == USB_SPEED_LOW - }; + struct usbdevfs_connectinfo ci; + + memset(, 0, sizeof(ci)); + ci.devnum = ps->dev->devnum; + ci.slow = ps->dev->speed == USB_SPEED_LOW; if (copy_to_user(arg, , sizeof(ci))) return -EFAULT; -- 1.9.1
[PATCH 3/3] [media] s5p-mfc: Fix race between s5p_mfc_probe() and s5p_mfc_open()
The s5p_mfc_probe() function registers the video devices before all the resources needed by s5p_mfc_open() are correctly initalized. So if s5p_mfc_open() function is called before s5p_mfc_probe() finishes (since the video dev is already registered), a NULL pointer dereference will happen due s5p_mfc_open() accessing uninitialized vars such as the struct s5p_mfc_dev .watchdog_timer and .mfc_ops fields. An example is following BUG caused by add_timer() getting a NULL pointer: [ 45.765374] kernel BUG at kernel/time/timer.c:790! [ 45.765381] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM ... [ 45.766149] [] (mod_timer) from [] (s5p_mfc_open+0x274/0x4d4 [s5p_mfc]) [ 45.766416] [] (s5p_mfc_open [s5p_mfc]) from [] (v4l2_open+0x9c/0x100 [videodev]) [ 45.766547] [] (v4l2_open [videodev]) from [] (chrdev_open+0x9c/0x178) [ 45.766575] [] (chrdev_open) from [] (do_dentry_open+0x1e0/0x300) [ 45.766595] [] (do_dentry_open) from [] (path_openat+0x800/0x10d4) [ 45.766610] [] (path_openat) from [] (do_filp_open+0x5c/0xc0) [ 45.766624] [] (do_filp_open) from [] (do_sys_open+0x10c/0x1bc) [ 45.766642] [] (do_sys_open) from [] (ret_fast_syscall+0x0/0x3c) [ 45.766655] Code: eae3 e3a1 e28dd008 e8bd81f0 (e7f001f2) Fix it by registering the video devs as the last step in s5p_mfc_probe(). Signed-off-by: Javier Martinez Canillas--- drivers/media/platform/s5p-mfc/s5p_mfc.c | 39 +--- 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c b/drivers/media/platform/s5p-mfc/s5p_mfc.c index beb4fd5bd326..501d822cad6b 100644 --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c @@ -1212,14 +1212,6 @@ static int s5p_mfc_probe(struct platform_device *pdev) vfd->vfl_dir= VFL_DIR_M2M; snprintf(vfd->name, sizeof(vfd->name), "%s", S5P_MFC_DEC_NAME); dev->vfd_dec= vfd; - ret = video_register_device(vfd, VFL_TYPE_GRABBER, 0); - if (ret) { - v4l2_err(>v4l2_dev, "Failed to register video device\n"); - video_device_release(vfd); - goto err_dec_reg; - } - v4l2_info(>v4l2_dev, - "decoder registered as /dev/video%d\n", vfd->num); video_set_drvdata(vfd, dev); /* encoder */ @@ -1237,14 +1229,6 @@ static int s5p_mfc_probe(struct platform_device *pdev) vfd->vfl_dir= VFL_DIR_M2M; snprintf(vfd->name, sizeof(vfd->name), "%s", S5P_MFC_ENC_NAME); dev->vfd_enc= vfd; - ret = video_register_device(vfd, VFL_TYPE_GRABBER, 0); - if (ret) { - v4l2_err(>v4l2_dev, "Failed to register video device\n"); - video_device_release(vfd); - goto err_enc_reg; - } - v4l2_info(>v4l2_dev, - "encoder registered as /dev/video%d\n", vfd->num); video_set_drvdata(vfd, dev); platform_set_drvdata(pdev, dev); @@ -1261,15 +1245,34 @@ static int s5p_mfc_probe(struct platform_device *pdev) s5p_mfc_init_hw_cmds(dev); s5p_mfc_init_regs(dev); + /* Register decoder and encoder */ + ret = video_register_device(dev->vfd_dec, VFL_TYPE_GRABBER, 0); + if (ret) { + v4l2_err(>v4l2_dev, "Failed to register video device\n"); + video_device_release(dev->vfd_dec); + goto err_dec_reg; + } + v4l2_info(>v4l2_dev, + "decoder registered as /dev/video%d\n", dev->vfd_dec->num); + + ret = video_register_device(dev->vfd_enc, VFL_TYPE_GRABBER, 0); + if (ret) { + v4l2_err(>v4l2_dev, "Failed to register video device\n"); + video_device_release(dev->vfd_enc); + goto err_enc_reg; + } + v4l2_info(>v4l2_dev, + "encoder registered as /dev/video%d\n", dev->vfd_enc->num); + pr_debug("%s--\n", __func__); return 0; /* Deinit MFC if probe had failed */ err_enc_reg: - video_device_release(dev->vfd_enc); -err_enc_alloc: video_unregister_device(dev->vfd_dec); err_dec_reg: + video_device_release(dev->vfd_enc); +err_enc_alloc: video_device_release(dev->vfd_dec); err_dec_alloc: v4l2_device_unregister(>v4l2_dev); -- 2.5.5
[PATCH 3/3] [media] s5p-mfc: Fix race between s5p_mfc_probe() and s5p_mfc_open()
The s5p_mfc_probe() function registers the video devices before all the resources needed by s5p_mfc_open() are correctly initalized. So if s5p_mfc_open() function is called before s5p_mfc_probe() finishes (since the video dev is already registered), a NULL pointer dereference will happen due s5p_mfc_open() accessing uninitialized vars such as the struct s5p_mfc_dev .watchdog_timer and .mfc_ops fields. An example is following BUG caused by add_timer() getting a NULL pointer: [ 45.765374] kernel BUG at kernel/time/timer.c:790! [ 45.765381] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM ... [ 45.766149] [] (mod_timer) from [] (s5p_mfc_open+0x274/0x4d4 [s5p_mfc]) [ 45.766416] [] (s5p_mfc_open [s5p_mfc]) from [] (v4l2_open+0x9c/0x100 [videodev]) [ 45.766547] [] (v4l2_open [videodev]) from [] (chrdev_open+0x9c/0x178) [ 45.766575] [] (chrdev_open) from [] (do_dentry_open+0x1e0/0x300) [ 45.766595] [] (do_dentry_open) from [] (path_openat+0x800/0x10d4) [ 45.766610] [] (path_openat) from [] (do_filp_open+0x5c/0xc0) [ 45.766624] [] (do_filp_open) from [] (do_sys_open+0x10c/0x1bc) [ 45.766642] [] (do_sys_open) from [] (ret_fast_syscall+0x0/0x3c) [ 45.766655] Code: eae3 e3a1 e28dd008 e8bd81f0 (e7f001f2) Fix it by registering the video devs as the last step in s5p_mfc_probe(). Signed-off-by: Javier Martinez Canillas --- drivers/media/platform/s5p-mfc/s5p_mfc.c | 39 +--- 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c b/drivers/media/platform/s5p-mfc/s5p_mfc.c index beb4fd5bd326..501d822cad6b 100644 --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c @@ -1212,14 +1212,6 @@ static int s5p_mfc_probe(struct platform_device *pdev) vfd->vfl_dir= VFL_DIR_M2M; snprintf(vfd->name, sizeof(vfd->name), "%s", S5P_MFC_DEC_NAME); dev->vfd_dec= vfd; - ret = video_register_device(vfd, VFL_TYPE_GRABBER, 0); - if (ret) { - v4l2_err(>v4l2_dev, "Failed to register video device\n"); - video_device_release(vfd); - goto err_dec_reg; - } - v4l2_info(>v4l2_dev, - "decoder registered as /dev/video%d\n", vfd->num); video_set_drvdata(vfd, dev); /* encoder */ @@ -1237,14 +1229,6 @@ static int s5p_mfc_probe(struct platform_device *pdev) vfd->vfl_dir= VFL_DIR_M2M; snprintf(vfd->name, sizeof(vfd->name), "%s", S5P_MFC_ENC_NAME); dev->vfd_enc= vfd; - ret = video_register_device(vfd, VFL_TYPE_GRABBER, 0); - if (ret) { - v4l2_err(>v4l2_dev, "Failed to register video device\n"); - video_device_release(vfd); - goto err_enc_reg; - } - v4l2_info(>v4l2_dev, - "encoder registered as /dev/video%d\n", vfd->num); video_set_drvdata(vfd, dev); platform_set_drvdata(pdev, dev); @@ -1261,15 +1245,34 @@ static int s5p_mfc_probe(struct platform_device *pdev) s5p_mfc_init_hw_cmds(dev); s5p_mfc_init_regs(dev); + /* Register decoder and encoder */ + ret = video_register_device(dev->vfd_dec, VFL_TYPE_GRABBER, 0); + if (ret) { + v4l2_err(>v4l2_dev, "Failed to register video device\n"); + video_device_release(dev->vfd_dec); + goto err_dec_reg; + } + v4l2_info(>v4l2_dev, + "decoder registered as /dev/video%d\n", dev->vfd_dec->num); + + ret = video_register_device(dev->vfd_enc, VFL_TYPE_GRABBER, 0); + if (ret) { + v4l2_err(>v4l2_dev, "Failed to register video device\n"); + video_device_release(dev->vfd_enc); + goto err_enc_reg; + } + v4l2_info(>v4l2_dev, + "encoder registered as /dev/video%d\n", dev->vfd_enc->num); + pr_debug("%s--\n", __func__); return 0; /* Deinit MFC if probe had failed */ err_enc_reg: - video_device_release(dev->vfd_enc); -err_enc_alloc: video_unregister_device(dev->vfd_dec); err_dec_reg: + video_device_release(dev->vfd_enc); +err_enc_alloc: video_device_release(dev->vfd_dec); err_dec_alloc: v4l2_device_unregister(>v4l2_dev); -- 2.5.5
[PATCH 2/3] [media] s5p-mfc: Add release callback for memory region devs
When s5p_mfc_remove() calls put_device() for the reserved memory region devs, the driver core warns that the dev doesn't have a release callback: WARNING: CPU: 0 PID: 591 at drivers/base/core.c:251 device_release+0x8c/0x90 Device 's5p-mfc-l' does not have a release() function, it is broken and must be fixed. Also, the declared DMA memory using dma_declare_coherent_memory() isn't relased so add a dev .release that calls dma_release_declared_memory(). Cc:Fixes: 6e83e6e25eb4 ("[media] s5p-mfc: Fix kernel warning on memory init") Signed-off-by: Javier Martinez Canillas --- drivers/media/platform/s5p-mfc/s5p_mfc.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c b/drivers/media/platform/s5p-mfc/s5p_mfc.c index 8fc1fd4ee2ed..beb4fd5bd326 100644 --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c @@ -1050,6 +1050,11 @@ static int match_child(struct device *dev, void *data) return !strcmp(dev_name(dev), (char *)data); } +static void s5p_mfc_memdev_release(struct device *dev) +{ + dma_release_declared_memory(dev); +} + static void *mfc_get_drv_data(struct platform_device *pdev); static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) @@ -1064,6 +1069,7 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) } dev_set_name(dev->mem_dev_l, "%s", "s5p-mfc-l"); + dev->mem_dev_l->release = s5p_mfc_memdev_release; device_initialize(dev->mem_dev_l); of_property_read_u32_array(dev->plat_dev->dev.of_node, "samsung,mfc-l", mem_info, 2); @@ -1083,6 +1089,7 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) } dev_set_name(dev->mem_dev_r, "%s", "s5p-mfc-r"); + dev->mem_dev_r->release = s5p_mfc_memdev_release; device_initialize(dev->mem_dev_r); of_property_read_u32_array(dev->plat_dev->dev.of_node, "samsung,mfc-r", mem_info, 2); -- 2.5.5
[PATCH 1/3] [media] s5p-mfc: Set device name for reserved memory region devs
The devices don't have a name set, so makes dev_name() returns NULL which makes harder to identify the devices that are causing issues, for example: WARNING: CPU: 2 PID: 616 at drivers/base/core.c:251 device_release+0x8c/0x90 Device '(null)' does not have a release() function, it is broken and must be fixed. And after setting the device name: WARNING: CPU: 0 PID: 591 at drivers/base/core.c:251 device_release+0x8c/0x90 Device 's5p-mfc-l' does not have a release() function, it is broken and must be fixed. Cc:Fixes: 6e83e6e25eb4 ("[media] s5p-mfc: Fix kernel warning on memory init") Signed-off-by: Javier Martinez Canillas --- drivers/media/platform/s5p-mfc/s5p_mfc.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c b/drivers/media/platform/s5p-mfc/s5p_mfc.c index b16466fe35ee..8fc1fd4ee2ed 100644 --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c @@ -1062,6 +1062,8 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) mfc_err("Not enough memory\n"); return -ENOMEM; } + + dev_set_name(dev->mem_dev_l, "%s", "s5p-mfc-l"); device_initialize(dev->mem_dev_l); of_property_read_u32_array(dev->plat_dev->dev.of_node, "samsung,mfc-l", mem_info, 2); @@ -1079,6 +1081,8 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) mfc_err("Not enough memory\n"); return -ENOMEM; } + + dev_set_name(dev->mem_dev_r, "%s", "s5p-mfc-r"); device_initialize(dev->mem_dev_r); of_property_read_u32_array(dev->plat_dev->dev.of_node, "samsung,mfc-r", mem_info, 2); -- 2.5.5
[PATCH 2/3] [media] s5p-mfc: Add release callback for memory region devs
When s5p_mfc_remove() calls put_device() for the reserved memory region devs, the driver core warns that the dev doesn't have a release callback: WARNING: CPU: 0 PID: 591 at drivers/base/core.c:251 device_release+0x8c/0x90 Device 's5p-mfc-l' does not have a release() function, it is broken and must be fixed. Also, the declared DMA memory using dma_declare_coherent_memory() isn't relased so add a dev .release that calls dma_release_declared_memory(). Cc: Fixes: 6e83e6e25eb4 ("[media] s5p-mfc: Fix kernel warning on memory init") Signed-off-by: Javier Martinez Canillas --- drivers/media/platform/s5p-mfc/s5p_mfc.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c b/drivers/media/platform/s5p-mfc/s5p_mfc.c index 8fc1fd4ee2ed..beb4fd5bd326 100644 --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c @@ -1050,6 +1050,11 @@ static int match_child(struct device *dev, void *data) return !strcmp(dev_name(dev), (char *)data); } +static void s5p_mfc_memdev_release(struct device *dev) +{ + dma_release_declared_memory(dev); +} + static void *mfc_get_drv_data(struct platform_device *pdev); static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) @@ -1064,6 +1069,7 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) } dev_set_name(dev->mem_dev_l, "%s", "s5p-mfc-l"); + dev->mem_dev_l->release = s5p_mfc_memdev_release; device_initialize(dev->mem_dev_l); of_property_read_u32_array(dev->plat_dev->dev.of_node, "samsung,mfc-l", mem_info, 2); @@ -1083,6 +1089,7 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) } dev_set_name(dev->mem_dev_r, "%s", "s5p-mfc-r"); + dev->mem_dev_r->release = s5p_mfc_memdev_release; device_initialize(dev->mem_dev_r); of_property_read_u32_array(dev->plat_dev->dev.of_node, "samsung,mfc-r", mem_info, 2); -- 2.5.5
[PATCH 1/3] [media] s5p-mfc: Set device name for reserved memory region devs
The devices don't have a name set, so makes dev_name() returns NULL which makes harder to identify the devices that are causing issues, for example: WARNING: CPU: 2 PID: 616 at drivers/base/core.c:251 device_release+0x8c/0x90 Device '(null)' does not have a release() function, it is broken and must be fixed. And after setting the device name: WARNING: CPU: 0 PID: 591 at drivers/base/core.c:251 device_release+0x8c/0x90 Device 's5p-mfc-l' does not have a release() function, it is broken and must be fixed. Cc: Fixes: 6e83e6e25eb4 ("[media] s5p-mfc: Fix kernel warning on memory init") Signed-off-by: Javier Martinez Canillas --- drivers/media/platform/s5p-mfc/s5p_mfc.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/media/platform/s5p-mfc/s5p_mfc.c b/drivers/media/platform/s5p-mfc/s5p_mfc.c index b16466fe35ee..8fc1fd4ee2ed 100644 --- a/drivers/media/platform/s5p-mfc/s5p_mfc.c +++ b/drivers/media/platform/s5p-mfc/s5p_mfc.c @@ -1062,6 +1062,8 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) mfc_err("Not enough memory\n"); return -ENOMEM; } + + dev_set_name(dev->mem_dev_l, "%s", "s5p-mfc-l"); device_initialize(dev->mem_dev_l); of_property_read_u32_array(dev->plat_dev->dev.of_node, "samsung,mfc-l", mem_info, 2); @@ -1079,6 +1081,8 @@ static int s5p_mfc_alloc_memdevs(struct s5p_mfc_dev *dev) mfc_err("Not enough memory\n"); return -ENOMEM; } + + dev_set_name(dev->mem_dev_r, "%s", "s5p-mfc-r"); device_initialize(dev->mem_dev_r); of_property_read_u32_array(dev->plat_dev->dev.of_node, "samsung,mfc-r", mem_info, 2); -- 2.5.5