Re: [PATCH] block: Restore /proc/partitions to not display non-partitionable removable devices
ping? On 11/19/2012 08:56 PM, Josh Hunt wrote: We found with newer kernels we started seeing the cdrom device showing up in /proc/partitions, but it was not there before. Looking into this I found that commit d27769ec... block: add GENHD_FL_NO_PART_SCAN introduces this change in behavior. It's not clear to me from the commit's changelog if this change was intentional or not. This comment still remains: /* Don't show non-partitionable removeable devices or empty devices */ so I've decided to send a patch to restore the behavior of not printing unpartitionable removable devices. Thanks Josh --- After commit d27769ec... block: add GENHD_FL_NO_PART_SCAN, /proc/partitions started printing removable devices with only one partition. This is different than prior to the commit. This restores /proc/partitions to behave as it did before. Signed-off-by: Josh Hunt --- block/genhd.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 6cace66..6bfeb2a 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -851,7 +851,7 @@ static int show_partition(struct seq_file *seqf, void *v) char buf[BDEVNAME_SIZE]; /* Don't show non-partitionable removeable devices or empty devices */ - if (!get_capacity(sgp) || (!disk_max_parts(sgp) && + if (!get_capacity(sgp) || (!(disk_max_parts(sgp) > 1) && (sgp->flags & GENHD_FL_REMOVABLE))) return 0; if (sgp->flags & GENHD_FL_SUPPRESS_PARTITION_INFO) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh
[fixing lkml address] On 09/11/2012 05:52 PM, Josh Hunt wrote: With recent kernels we noticed that edac was reporting double the memory size on systems running with AMD family 0Fh processors. I'm not very familiar with the code, but this resolves it from what I can see on my systems. At least in amd64_debug_display_dimm_sizes() and k8_dbam_to_chip_select() there appeared to be redundant shifts to the left by 1 when WIDTH_128 is present. thanks Josh --- Since commit 41d8bfaba70311c2fa0666554ef160ea8ffc9daf the memory size reported by edac has been double. Adding to the shift if WIDTH_128 is set looks to be redundant, removing. Signed-off-by: Josh Hunt --- drivers/edac/amd64_edac.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c index 5a297a2..c796710 100644 --- a/drivers/edac/amd64_edac.c +++ b/drivers/edac/amd64_edac.c @@ -1123,7 +1123,7 @@ static int k8_dbam_to_chip_select(struct amd64_pvt *pvt, u8 dct, if (pvt->ext_model >= K8_REV_F) { WARN_ON(cs_mode > 11); - return ddr2_cs_size(cs_mode, dclr & WIDTH_128); + return ddr2_cs_size(cs_mode, 0); } else if (pvt->ext_model >= K8_REV_D) { unsigned diff; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh
On 09/12/2012 03:51 AM, Borislav Petkov wrote: > On Tue, Sep 11, 2012 at 06:02:01PM -0500, Josh Hunt wrote: >> On 09/11/2012 05:52 PM, Josh Hunt wrote: >>> With recent kernels we noticed that edac was reporting double the memory >>> size on >>> systems running with AMD family 0Fh processors. I'm not very familiar with >>> the >>> code, but this resolves it from what I can see on my systems. At least in >>> amd64_debug_display_dimm_sizes() and k8_dbam_to_chip_select() there appeared >>> to be redundant shifts to the left by 1 when WIDTH_128 is present. > > Any chance you could enable CONFIG_EDAC_DEBUG, boot and send me your > whole dmesg? Privately is fine too. > > Thanks. > Sure. Attached. This is booting with latest git. Josh [0.00] Initializing cgroup subsys cpuset [0.00] Initializing cgroup subsys cpu [0.00] Linux version 3.6.0-rc5+ (johunt@kernelsuite-780) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #33 SMP Wed Sep 12 05:10:20 PDT 2012 [0.00] Command line: root=UUID=2ceb4350-c206-4bcd-b8b3-e6dd2ff5b021 ro oops=panic panic=30 INIT_VERBOSE=yes init=/sbin/init processor.max_cstate=1 reboot_type=1 nmi_watchdog=2,120 [0.00] e820: BIOS-provided physical RAM map: [0.00] BIOS-e820: [mem 0x-0x0009fbff] usable [0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved [0.00] BIOS-e820: [mem 0x000e8000-0x000f] reserved [0.00] BIOS-e820: [mem 0x0010-0xedfe] usable [0.00] BIOS-e820: [mem 0xedff-0xedffdfff] ACPI data [0.00] BIOS-e820: [mem 0xedffe000-0xedff] ACPI NVS [0.00] BIOS-e820: [mem 0xfec0-0xfec02fff] reserved [0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved [0.00] BIOS-e820: [mem 0xff70-0x] reserved [0.00] BIOS-e820: [mem 0x0001-0x000111ff] usable [0.00] NX (Execute Disable) protection: active [0.00] DMI present. [0.00] DMI: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 09/05/2007 [0.00] e820: update [mem 0x-0x] usable ==> reserved [0.00] e820: remove [mem 0x000a-0x000f] usable [0.00] No AGP bridge found [0.00] e820: last_pfn = 0x112000 max_arch_pfn = 0x4 [0.00] MTRR default type: uncachable [0.00] MTRR fixed ranges enabled: [0.00] 0-9 write-back [0.00] A-E uncachable [0.00] F-F write-protect [0.00] MTRR variable ranges enabled: [0.00] 0 base 00 mask FF write-back [0.00] 1 base 01 mask FFF000 write-back [0.00] 2 base 011000 mask FFFE00 write-back [0.00] 3 base 00EE00 mask FFFE00 uncachable [0.00] 4 base 00F000 mask FFF000 uncachable [0.00] 5 disabled [0.00] 6 disabled [0.00] 7 disabled [0.00] TOM2: 00011200 aka 4384M [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106 [0.00] e820: update [mem 0xee00-0x] usable ==> reserved [0.00] e820: last_pfn = 0xedff0 max_arch_pfn = 0x4 [0.00] found SMP MP-table at [mem 0x000ff780-0x000ff78f] mapped at [880ff780] [0.00] initial memory mapped: [mem 0x-0x1fff] [0.00] Base memory trampoline at [88097000] 97000 size 24576 [0.00] init_memory_mapping: [mem 0x-0xedfe] [0.00] [mem 0x-0xeddf] page 2M [0.00] [mem 0xede0-0xedfe] page 4k [0.00] kernel direct mapping tables up to 0xedfe @ [mem 0x1f88b000-0x1fff] [0.00] init_memory_mapping: [mem 0x1-0x111ff] [0.00] [mem 0x1-0x111ff] page 2M [0.00] kernel direct mapping tables up to 0x111ff @ [mem 0xedfea000-0xedfe] [0.00] RAMDISK: [mem 0x37511000-0x37fe] [0.00] ACPI: RSDP 000f9e40 00014 (v00 ACPIAM) [0.00] ACPI: RSDT edff 00030 (v01 A M I OEMRSDT 09000705 MSFT 0097) [0.00] ACPI: FACP edff0200 00084 (v02 A M I OEMFACP 09000705 MSFT 0097) [0.00] ACPI: DSDT edff0410 030C5 (v01 1SSL2 1SSL2005 0005 INTL 02002026) [0.00] ACPI: FACS edffe000 00040 [0.00] ACPI: APIC edff0390 00076 (v01 A M I OEMAPIC 09000705 MSFT 0097) [0.00] ACPI: OEMB edffe040 00056 (v01 A M I AMI_OEM 09000705 MSFT 0097) [0.00] ACPI: Local APIC address 0xfee0 [0.00] Scanning NUMA topology in Northbridge 24 [0.00] No NUMA configuration found [0.00] Faking a node at [mem 0x-0x000111fff
Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh
On 09/12/2012 07:38 AM, Josh Hunt wrote: > On 09/12/2012 03:51 AM, Borislav Petkov wrote: >> On Tue, Sep 11, 2012 at 06:02:01PM -0500, Josh Hunt wrote: >>> On 09/11/2012 05:52 PM, Josh Hunt wrote: >>>> With recent kernels we noticed that edac was reporting double the memory >>>> size on >>>> systems running with AMD family 0Fh processors. I'm not very familiar with >>>> the >>>> code, but this resolves it from what I can see on my systems. At least in >>>> amd64_debug_display_dimm_sizes() and k8_dbam_to_chip_select() there >>>> appeared >>>> to be redundant shifts to the left by 1 when WIDTH_128 is present. >> >> Any chance you could enable CONFIG_EDAC_DEBUG, boot and send me your >> whole dmesg? Privately is fine too. >> >> Thanks. >> > > Sure. Attached. This is booting with latest git. > > Josh I wanted to add that we started seeing this back in 3.0. I didn't go back any farther, but know it was not occurring in 2.6.38. The issue in 3.0 appeared to be that we shift left k8_dbam_to_chip_select() and there was also another shift in amd64_csrow_nr_pages(). It looks like that second shift has been replaced by a more recent patch, which actually checks to see if the csrow is enabled and then counts nr_pages again adding that to the first calculation, still resulting in the size being double. Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh
On 09/12/2012 10:30 AM, Borislav Petkov wrote: Yes, you're basically right. Here's what I see from here: In 2009 I added commit 603adaf6b3e37450235f0ddb5986b961b3146a79 Author: Borislav Petkov Date: Mon Dec 21 14:52:53 2009 +0100 amd64_edac: fix K8 chip select reporting Fix the case when amd64_debug_display_dimm_sizes() reports only half the amount of DRAM on it because it doesn't account for when the single DCT operates in 128-bit mode and merges chip selects from different DIMMs. which was supposed to fix a bug-report of DRAM chip selects being halved in reporting. But, commit 41d8bfaba70311c2fa0666554ef160ea8ffc9daf Author: Borislav Petkov Date: Tue Jan 18 19:16:08 2011 +0100 amd64_edac: Improve DRAM address mapping Drop static tables which map the bits in F2x80 to a chip select size in favor of functions doing the mapping with some bit fiddling. Also, add F15 support. two years later reworked the whole DBAM to chip select sizes mapping for all families. But it left in the clumsy workaround above for K8 only, thus the double shifting. So, long story short, reverting 603adaf6b3e37450235f0ddb5986b961b3146a79 should probably fix the issue since it is not needed anymore. Let me run it here to make sure I'm not missing anything else. Thanks. Well from what I see 603ad... would only fix the case of printing the values correctly on boot, by removing the factor=1 shift. However, that is merely cosmetic as it does not affect the actual calculation of nr_pages. I guess maybe I wasn't completely clear before, but I see the doubling of the amount of memory both on boot via dmesg, but also in /sys/devices/system/edac/mc/mc0/size_mb and all of the csrow* subdirs therein. Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh
On 09/12/2012 10:49 AM, Borislav Petkov wrote: Yes, that's because the whole init_csrows thing has been b0rked since forever. In your case, amd64_csrow_nr_pages() should pay attention to the dct (second argument) which on K8 is always 0 (we have only one DCT aka Dram ConTroller on K8) and the function should return 0 if dct is 1. But the whole loop in init_csrows is fishy too so I'll need to rework that properly. Oh well... Looks like we're seeing an issue on another machine. Still 0Fh family, but the model is reported as 2, with cs_mode 7. This causes the machine to report that it has 32GB memory instead of 4GB. For this issue it seems like the shift logic is incorrect. Looking at the old ddr2_dbam_revD table it doesn't really lend itself to shifting very easily. Perhaps something like this (albeit very ugly - it matches the old table): /* 3 to 5 */ if ( cs_mode > 2 && cs_mode < 6 ) return 32 << (cs_mode - 1); /* 6 to 8 */ else if ( cs_mode > 5 && cs_mode < 9 ) return 32 << (cs_mode - 3); /* 9 and 10 */ else if ( cs_mode > 8 ) return 32 << (cs_mode - 4); /* 0 to 3 */ else return 32 << cs_mode; Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh
On 09/12/2012 11:48 AM, Borislav Petkov wrote: Can I have /proc/cpuinfo and full dmesg with EDAC_DEBUG enabled from that machine please? Actually my apologies. I was looking at the 3.0 code. This issue is fixed in the latest kernel. Sorry for the noise on that. Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh
On 09/12/2012 12:23 PM, Borislav Petkov wrote: Ok, I have something preliminary which seems to work fine on my K8 here. If you'd like, you can give it a run: git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git error-queue I've changed also debug messages, etc, so pls take a look at those and let me know whether they're understandable, make sense, etc, etc. Boris Thanks to your help I was able to test your branch, but it still does not resolve the problem. Removal of the "factor=1" workaround fixes the memory size reporting on boot, but the sysfs values are still incorrect. [ 25.836264] EDAC MC: DCT0 chip selects: [ 25.836266] EDAC amd64: MC: 0: 1024MB 1: 1024MB [ 25.836398] EDAC amd64: MC: 2: 1024MB 3: 1024MB [ 25.836530] EDAC amd64: MC: 4: 0MB 5: 0MB [ 25.836662] EDAC amd64: MC: 6: 0MB 7: 0MB root@192.168.1.1:/sys/devices/system/edac/mc/mc0# grep . * ce_count:0 ce_noinfo_count:0 dbam:0x0022 dhar:0xee001201 dram_hole:ee00 1200 1200 max_location:csrow 7 channel 1 mc_name:K8 grep: reset_counters: Permission denied sdram_scrub_rate:761 seconds_since_reset:276 size_mb:8192 topmem:0xee00 topmem2:0x00011200 ue_count:0 ue_noinfo_count:0 root@192.168.1.1:/sys/devices/system/edac/mc/mc0/csrow1# grep . * ce_count:0 ch0_ce_count:0 ch0_dimm_label:mc#0csrow#0channel#1 ch1_ce_count:0 ch1_dimm_label:mc#0csrow#4channel#1 dev_type:Unknown edac_mode:S4ECD4ED mem_type:Unbuffered-DDR2 size_mb:2048 ue_count:0 To be sure I'm using the correct branch the last git log entry is: 9d67117feece8852570cc8ee25b68c41f8def323 I could be incorrect, but I still think there's a problem with either a) ddr2_cs_size() for this cpu, or b) the extra shift left when WIDTH_128 is true. Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh
On 09/14/2012 07:55 AM, Josh Hunt wrote: Thanks to your help I was able to test your branch, but it still does not resolve the problem. Removal of the "factor=1" workaround fixes the memory size reporting on boot, but the sysfs values are still incorrect. Please disregard what I said earlier about the shift still being wrong. Looking at the dmesg more I see now that the # of pages are correctly reported (262144), however sysfs is still wrong. [ 25.837588] EDAC DEBUG: init_csrows: MC node: 0, csrow: 0 [ 25.837589] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 0, channel: 0, DBAM idx: 2 [ 25.837591] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144 [ 25.837592] EDAC amd64: CS0: Unbuffered DDR2 RAM [ 25.837724] EDAC DEBUG: init_csrows: Total csrow0 pages: 262144 [ 25.837725] DBG: init_csrows: channel_count:2 [ 25.837856] DBG: init_csrows: channel_count:2 [ 25.837988] EDAC DEBUG: init_csrows: MC node: 0, csrow: 1 [ 25.837989] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 1, channel: 0, DBAM idx: 2 [ 25.837991] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144 [ 25.837992] EDAC amd64: CS1: Unbuffered DDR2 RAM [ 25.838157] EDAC DEBUG: init_csrows: Total csrow1 pages: 262144 [ 25.838158] DBG: init_csrows: channel_count:2 [ 25.838289] DBG: init_csrows: channel_count:2 [ 25.838421] EDAC DEBUG: init_csrows: MC node: 0, csrow: 2 [ 25.838422] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 2, channel: 0, DBAM idx: 2 [ 25.838424] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144 [ 25.838425] EDAC amd64: CS2: Unbuffered DDR2 RAM [ 25.838556] EDAC DEBUG: init_csrows: Total csrow2 pages: 262144 [ 25.838558] DBG: init_csrows: channel_count:2 [ 25.838689] DBG: init_csrows: channel_count:2 [ 25.838820] EDAC DEBUG: init_csrows: MC node: 0, csrow: 3 [ 25.838822] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 3, channel: 0, DBAM idx: 2 [ 25.838823] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144 [ 25.838824] EDAC amd64: CS3: Unbuffered DDR2 RAM [ 25.838957] EDAC DEBUG: init_csrows: Total csrow3 pages: 262144 I looked into this and see that sysfs is doing the double counting b/c it loops over the # of channels: [ 131.423949] DBG: csrow_size_show: i:0 nr_pages:262144 nr_channels:2 [ 131.424112] DBG: csrow_size_show: i:1 nr_pages:524288 nr_channels:2 I verified this in init_csrows: [ 25.838958] DBG: init_csrows: channel_count:2 Since I don't know the details of the hardware here it's hard for me to suggest a fix, but it would seem that k8_early_channel_count() needs to be modified to only return 1 in this case? Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] block: Restore /proc/partitions to not display non-partitionable removable devices
On 12/03/2012 06:06 PM, Andrew Morton wrote: On Mon, 19 Nov 2012 18:56:49 -0800 Josh Hunt wrote: We found with newer kernels we started seeing the cdrom device showing up in /proc/partitions, but it was not there before. Looking into this I found that commit d27769ec... block: add GENHD_FL_NO_PART_SCAN introduces this change in behavior. It's not clear to me from the commit's changelog if this change was intentional or not. This comment still remains: /* Don't show non-partitionable removeable devices or empty devices */ so I've decided to send a patch to restore the behavior of not printing unpartitionable removable devices. d27769ec was merged in August 2011, so I after all this time, your fix could be viewed as "changing existing behaviour". So perhaps it would be best to leave things alone. Is there any particular problem with the post-Aug, 2011 behaviour? We caught this by a script that parses /proc/partitions and made some assumptions about the contents therein. It had worked fine up until when this behavior changed. We were able to modify our script to get what we needed. The patch was meant to do two things: 1) understand if this was an unintended change and 2) if so, propose a solution to resolve it. Since the comment was left in the source I believe either a) my patch should be applied or b) a new patch with the comment removed should be put in since it's no longer correct. I did not think this type of change to kernel abi was generally acceptable. While the commit is over a year old, it changes behavior which had been in tact for a while (years?) from what I can tell. We were running 3.0 with stable updates until we upgraded to 3.2 and hit this. Neither of these are what I would consider "old" kernels. Thanks for looking at this. Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] block: Restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing up in /proc/partitions, but it was not there before. Looking into this I found that commit d27769ec... block: add GENHD_FL_NO_PART_SCAN introduces this change in behavior. It's not clear to me from the commit's changelog if this change was intentional or not. This comment still remains: /* Don't show non-partitionable removeable devices or empty devices */ so I've decided to send a patch to restore the behavior of not printing unpartitionable removable devices. Thanks Josh --- After commit d27769ec... block: add GENHD_FL_NO_PART_SCAN, /proc/partitions started printing removable devices with only one partition. This is different than prior to the commit. This restores /proc/partitions to behave as it did before. Signed-off-by: Josh Hunt --- block/genhd.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 6cace66..6bfeb2a 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -851,7 +851,7 @@ static int show_partition(struct seq_file *seqf, void *v) char buf[BDEVNAME_SIZE]; /* Don't show non-partitionable removeable devices or empty devices */ - if (!get_capacity(sgp) || (!disk_max_parts(sgp) && + if (!get_capacity(sgp) || (!(disk_max_parts(sgp) > 1) && (sgp->flags & GENHD_FL_REMOVABLE))) return 0; if (sgp->flags & GENHD_FL_SUPPRESS_PARTITION_INFO) -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 4.1.28
On Fri, Jul 15, 2016 at 6:54 AM, Sasha Levin wrote: > On 07/15/2016 07:38 AM, Thomas Voegtle wrote: >> On Wed, 13 Jul 2016, Sasha Levin wrote: >> >>> I'm announcing the release of the 4.1.28 kernel. >> >> I have a serious memleak with 4.1.28 (like 20mb/s) >> I stripped down my kernel config and started a bisect, which came to: >> >> # first bad commit: [c5ad33184354260be6d05de57e46a5498692f6d6] mm/swap.c: >> flush lru pvecs on compound page arrival >> => >> commit c5ad33184354260be6d05de57e46a5498692f6d6 >> Author: Lukasz Odzioba >> Date: Fri Jun 24 14:50:01 2016 -0700 >> >> mm/swap.c: flush lru pvecs on compound page arrival >> >> >> Reverting this on top 4.1.28 helps. Config attached. > > Yup, this was reported and a fix is already queued for 4.1.29. 4.1.28 is unusable for us. Do you have an ETA when 4.1.29 will be available? Also, is the iptables problem reported by Thomas Voegtle fixed as well? I do not see any iptables-related changes in the 4.1.y-queue: https://git.kernel.org/cgit/linux/kernel/git/sashal/linux-stable.git/log/?h=linux-4.1.y-queue Thanks! -- Josh
vector space exhaustion on 4.14 LTS kernels
Hi Thomas We have a class of machines that appear to be exhausting the vector space on cpus 0 and 1 which causes some breakage later on when trying to set the affinity. The boxes are running the 4.14 LTS kernel. I instrumented 4.14 and here's what I see: [ 28.328849] __assign_irq_vector: irq:512 cpu:0 mask:ff, onlinemask:ff, vector:0 [ 28.329847] __assign_irq_vector: irq:512 cpu:2 vector:222 cfgvect:0 off:14 old_domain:00, domain:00, vector_search:00,0004 update [ 28.329847] default_cpu_mask_to_apicid: irq:512 mask:00,0004 ... [ 31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff, onlinemask:ff, vector:222 [ 31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff, vector_cpumask:00,0001 vector:222 ... [ 31.729154] __assign_irq_vector: irq:512 cpu:2 vector:00,0004 domain:00,0004 success [ 31.729154] default_cpu_mask_to_apicid: irq:512 hwirq:512 mask:00,0004 [ 31.729154] apic_set_affinity: irq:512 mask:ff, err:0 ... [ 32.818152] mlx5_irq_set_affinity_hint: 0: irq:512 mask:00,0001 ... [ 39.531242] __assign_irq_vector: irq:512 cpu:0 mask:00,0001 onlinemask:ff, vector:222 [ 39.531244] __assign_irq_vector: irq:512 cpu:0 mask:00,0001 vector_cpumask:00,0001 vector:222 [ 39.531245] __assign_irq_vector: irq:512 cpu:0 vector:00,0001 domain:00,0004 ... [ 39.531384] __assign_irq_vector: irq:512 cpu:0 vector:37 current_vector:37 next_cpu2 [ 39.531385] __assign_irq_vector: irq:512 cpu:128 searched:00,0001 vector:00, continue [ 39.531386] apic_set_affinity: irq:512 mask:00,0001 err:-28 The affinity values: root@172.25.48.208:/proc/irq/512# grep . * affinity_hint:00,0001 effective_affinity:00,0004 effective_affinity_list:2 grep: mlx5_comp0@pci::65:00.1: Is a directory node:0 smp_affinity:ff, smp_affinity_list:0-39 spurious:count 3 spurious:unhandled 0 spurious:last_unhandled 0 ms I noticed your change, a0c9259dc4e1 "irq/matrix: Spread interrupts on allocation", and this sounds like what we're hitting. Booting 4.19 does not have this problem. I haven't booted 4.15 yet, but can do it to confirm the above commit is what resolves this. Since 4.14 doesn't have the matrix allocator it's not a trivial backport. I was wondering a) if you agree with my assessment and b) if there's any plans on resolving this on the 4.14 allocator? If not I can attempt to backport the idea to 4.14 to spread the interrupts around on allocation. Thanks Josh
Re: [PATCH 4.14 106/143] sch_netem: restore skb->dev after dequeuing from the rbtree
On Fri, Nov 2, 2018 at 12:00 PM Greg Kroah-Hartman wrote: > > 4.14-stable review patch. If anyone has any objections, please let me know. > > -- > > Upstream commit bffa72cf7f9d ("net: sk_buff rbnode reorg") got > backported as commit 6b921536f170 ("net: sk_buff rbnode reorg") into the > v4.14.x-tree. > > However, the backport does not include the changes in sch_netem.c > > We need these, as otherwise the skb->dev pointer is not set when > dequeueing from the netem rbtree, resulting in a panic: > > [ 15.427748] BUG: unable to handle kernel NULL pointer dereference at > 00d0 > [ 15.428863] IP: netif_skb_features+0x24/0x230 > [ 15.429402] PGD 0 P4D 0 > [ 15.429733] Oops: [#1] SMP PTI > [ 15.430169] Modules linked in: > [ 15.430614] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.77.mptcp #77 > [ 15.431497] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 0.5.1 01/01/2011 > [ 15.432568] task: 88042db19680 task.stack: c907 > [ 15.433356] RIP: 0010:netif_skb_features+0x24/0x230 > [ 15.433977] RSP: 0018:88043fd83e70 EFLAGS: 00010286 > [ 15.434665] RAX: 880429ad80c0 RBX: 88042bd0e400 RCX: > 880429ad8000 > [ 15.435585] RDX: RSI: RDI: > 88042bd0e400 > [ 15.436551] RBP: 88042bd0e400 R08: 88042a4b6c9c R09: > 0001 > [ 15.437485] R10: 0004 R11: R12: > 88042c70 > [ 15.438393] R13: 88042c70 R14: 88042a4b6c00 R15: > 88042c6bb000 > [ 15.439315] FS: () GS:88043fd8() > knlGS: > [ 15.440314] CS: 0010 DS: ES: CR0: 80050033 > [ 15.441084] CR2: 00d0 CR3: 00042c374000 CR4: > 06e0 > [ 15.442016] Call Trace: > [ 15.442333] > [ 15.442596] validate_xmit_skb+0x17/0x270 > [ 15.443134] validate_xmit_skb_list+0x38/0x60 > [ 15.443698] sch_direct_xmit+0x102/0x190 > [ 15.444198] __qdisc_run+0xe3/0x240 > [ 15.444671] net_tx_action+0x121/0x140 > [ 15.445177] __do_softirq+0xe2/0x224 > [ 15.445654] irq_exit+0xbf/0xd0 > [ 15.446072] smp_apic_timer_interrupt+0x5d/0x90 > [ 15.446654] apic_timer_interrupt+0x7d/0x90 > [ 15.447185] > [ 15.447460] RIP: 0010:native_safe_halt+0x2/0x10 > [ 15.447992] RSP: 0018:c9073f10 EFLAGS: 0282 ORIG_RAX: > ff10 > [ 15.449008] RAX: 816667d0 RBX: 820946b0 RCX: > > [ 15.449895] RDX: RSI: RDI: > > [ 15.450768] RBP: 82026940 R08: 0004e858e5e1 R09: > 88042a4b6d58 > [ 15.451643] R10: R11: 00d0d56879bb R12: > > [ 15.452478] R13: R14: R15: > > [ 15.453340] ? __sched_text_end+0x2/0x2 > [ 15.453835] default_idle+0xf/0x20 > [ 15.454259] do_idle+0x170/0x200 > [ 15.454653] cpu_startup_entry+0x14/0x20 > [ 15.455142] secondary_startup_64+0xa5/0xb0 > [ 15.455715] Code: 1f 84 00 00 00 00 00 55 53 48 89 fd 48 83 ec 08 8b 87 bc > 00 00 00 48 8b 8f c0 00 00 00 0f b6 97 81 00 00 00 48 8b 77 10 48 01 c8 <48> > 8b 9 > [ 15.458138] RIP: netif_skb_features+0x24/0x230 RSP: 88043fd83e70 > [ 15.458933] CR2: 00d0 > [ 15.459352] ---[ end trace 083925903ae60570 ]--- > > Fixes: 6b921536f170 ("net: sk_buff rbnode reorg") > Cc: Stephen Hemminger > Cc: Eric Dumazet > Cc: Soheil Hassas Yeganeh > Cc: Wei Wang > Cc: Willem de Bruijn > Signed-off-by: Christoph Paasch > Signed-off-by: Sasha Levin > --- > net/sched/sch_netem.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c > index 2a2ab6bfe5d8..3d325b840802 100644 > --- a/net/sched/sch_netem.c > +++ b/net/sched/sch_netem.c > @@ -624,6 +624,10 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch) > skb->next = NULL; > skb->prev = NULL; > skb->tstamp = netem_skb_cb(skb)->tstamp_save; > + /* skb->dev shares skb->rbnode area, > +* we need to restore its value. > +*/ > + skb->dev = qdisc_dev(sch); > > #ifdef CONFIG_NET_CLS_ACT > /* > -- > 2.17.1 I was seeing this crash on 4.14.78 and this patch fixes it. Thanks! -- Josh
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
Hi Jike On 8/20/20 12:43 AM, Jike Song wrote: Hi Josh, We met possibly the same problem when testing nvidia/mellanox's GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you can also have a try? We also did something similar where we've switched over to using the fq scheduler everywhere for now. We believe the bug is in the nolock code which only pfifo_fast uses atm, but we've been unable to come up with a satisfactory solution. Besides, our testing is pretty complex, do you have a quick test to reproduce it? Unfortunately we don't have a simple test case either. Our current reproducer is complex as well, although it would seem like we should be able to come up with something where you have maybe 2 threads trying to send on the same tx queue running pfifo_fast every few hundred milliseconds and not much else/no other tx traffic on that queue. IIRC we believe the scenario is when one thread is in the process of dequeuing a packet while another is enqueuing, the enqueue-er (word? :)) sees the dequeue is in progress and so does not xmit the packet assuming the dequeue operation will take care of it. However b/c the dequeue is in the process of completing it doesn't and the newly enqueued packet stays in the qdisc until another packet is enqueued pushing both out. Given that we have a workaround with using fq or any other qdisc not named pfifo_fast this has gotten bumped down in priority for us. I would like to work on a reproducer at some point, but won't likely be for a few weeks :( Josh
[PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. Signed-off-by: Josh Hunt --- kernel/sched/psi.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index b1b00e9bd7ed..98ff7baf1ba8 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1270,6 +1270,9 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf, if (!nbytes) return -EINVAL; + if (!capable(CAP_SYS_RESOURCE)) + return -EPERM; + buf_size = min(nbytes, sizeof(buf)); if (copy_from_user(buf, user_buf, buf_size)) return -EFAULT; @@ -1353,9 +1356,9 @@ static int __init psi_proc_init(void) { if (psi_enable) { proc_mkdir("pressure", NULL); - proc_create("pressure/io", 0, NULL, &psi_io_proc_ops); - proc_create("pressure/memory", 0, NULL, &psi_memory_proc_ops); - proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops); + proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops); + proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops); + proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops); } return 0; } -- 2.17.1
Re: [PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
On 4/1/21 10:47 AM, Eric W. Biederman wrote: Kees Cook writes: On Wed, Mar 31, 2021 at 11:36:28PM -0500, Eric W. Biederman wrote: Josh Hunt writes: Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. The test for CAP_SYS_RESOURCE really needs to be in open rather than in write. Otherwise a suid root executable could have stdout redirected into these files. Right. Or check against f_cred. (See uses of kallsyms_show_value()) https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/security/credentials.html*open-file-credentials__;Iw!!GjvTz_vk!B_aeVyHMG20VNUGx001EFKpeYlahLQHye7oS8sokXuZOhVDTtF_deDl71a_KYA$ We really want to limit checking against f_cred to those cases where we break userspace by checking in open. AKA the cases where we made the mistake of putting the permission check in the wrong place and now can't fix it. Since this change is change the permissions that open uses already I don't see any reason we can't perform a proper check in open. Eric Thank you for the feedback. I will spin a v2 doing the check in open. Josh
[PATCH v2] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. Signed-off-by: Josh Hunt Acked-by: Johannes Weiner --- kernel/sched/psi.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index b1b00e9bd7ed..d1212f17a898 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v) return psi_show(m, &psi_system, PSI_CPU); } +static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void *)) +{ + if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) + return -EPERM; + + return single_open(file, psi_show, NULL); +} + static int psi_io_open(struct inode *inode, struct file *file) { - return single_open(file, psi_io_show, NULL); + return psi_open(file, psi_io_show); } static int psi_memory_open(struct inode *inode, struct file *file) { - return single_open(file, psi_memory_show, NULL); + return psi_open(file, psi_memory_show); } static int psi_cpu_open(struct inode *inode, struct file *file) { - return single_open(file, psi_cpu_show, NULL); + return psi_open(file, psi_cpu_show); } struct psi_trigger *psi_trigger_create(struct psi_group *group, @@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void) { if (psi_enable) { proc_mkdir("pressure", NULL); - proc_create("pressure/io", 0, NULL, &psi_io_proc_ops); - proc_create("pressure/memory", 0, NULL, &psi_memory_proc_ops); - proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops); + proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops); + proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops); + proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops); } return 0; } -- 2.17.1
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
On 4/2/21 12:25 PM, Jiri Kosina wrote: On Thu, 3 Sep 2020, John Fastabend wrote: At this point I fear we could consider reverting the NOLOCK stuff. I personally would hate doing so, but it looks like NOLOCK benefits are outweighed by its issues. I agree, NOLOCK brings more pains than gains. There are many race conditions hidden in generic qdisc layer, another one is enqueue vs. reset which is being discussed in another thread. Sure. Seems they crept in over time. I had some plans to write a lockless HTB implementation. But with fq+EDT with BPF it seems that it is no longer needed, we have a more generic/better solution. So I dropped it. Also most folks should really be using fq, fq_codel, etc. by default anyways. Using pfifo_fast alone is not ideal IMO. Half a year later, we still have the NOLOCK implementation present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself. And we've just been bitten by this very same race which appears to be still unfixed, with single packet being stuck in pfifo_fast qdisc basically indefinitely due to this very race that this whole thread began with back in 2019. Unless there are (a) any nice ideas how to solve this in an elegant way without (re-)introducing extra spinlock (Cong's fix) or (b) any objections to revert as per the argumentation above I'll be happy to send a revert of the whole NOLOCK implementation next week. Jiri If you have a reproducer can you try https://lkml.org/lkml/2021/3/24/1485 ? If that doesn't work I think your suggestion of reverting nolock makes sense to me. We've moved to using fq as our default now b/c of this bug. Josh
Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
On Mon, Aug 12, 2013 at 10:09 AM, Tejun Heo wrote: > Hello, Jonathan. > > On Mon, Aug 12, 2013 at 12:04:11AM -0700, Jonathan Nieder wrote: >> My laptop fails to boot[1] with the message 'Volume group "data" not >> found'. Bisects to v3.8-rc4~17 (the above commit). Reverting that >> commit on top of current "master" (d92581fcad18, 2013-08-10) produces >> a working kernel. dmesg output from that working kernel attached. >> More details, including .config, at [2]. >> >> Any ideas for tracking this down? > > Which initrd / boot script are you using? It looks like lvm assemble > scripts are running before sdX are detected leading to volume assembly > failure. Before the patch, any module loading would end up > synchronizing async probes but after the patch modprobe invocations > which don't schedule them won't be. Does your boot script happen to > run multiple modprobes in parallel and proceed to configure lvm > without waiting for modprobes of libata drivers to finish? > > Thanks. > > -- > tejun > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ I'm also hitting a regression b/c of this patch. Booting 3.10.20 on a number of older machines with onboard sata controllers are unable to find their root device quickly enough. I bisected the issue down to 774a1221e862b343388347bac9b318767336b20b. Reverting it allows my systems to boot fine. I'm seeing this with both sata_svw and ahci. I'm using ubuntu precise userspace which is using initramfs-tools 0.99ubuntu13.1. From what I can tell the modprobes in this initrd are done in serial, but the port probing is async. This allows init to continue on to try and mount root, but it's not there yet. I'm attaching some serial log output with initcall_debug enabled. Both ahci and sata_svw call ata_host_activate(), which call ata_host_register() and async_schedule(async_port_probe, ap). Let me know what other information you may need. Thanks -- Josh seriallog Description: Binary data
Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
On Tue, Nov 26, 2013 at 3:53 PM, Linus Torvalds wrote: > On Tue, Nov 26, 2013 at 1:29 PM, Josh Hunt wrote: >> >> Both ahci and sata_svw call ata_host_activate(), which call >> ata_host_register() and async_schedule(async_port_probe, ap). > > Well, with the modern logic ("only wait for async probing if the > module itself did async probing") the ahci and svw modules didn't > really change any behavior. > > But other modules did. I wonder, for example, if people insmod the dm > module, and expect all devices to exist afterwards. Which the old > logic of "we always wait for all async code regardless of whether we > started it ourselves" would do, but the new logic does not. > > Something similar might hit the (non-modular) md auto-detect ioctl. > > So maybe we should just special-case those two issues, and say "let's > just wait for async requests here" > > Something like the appended (whitespace-damaged) diff. Does that make > a difference to you guys? And if it does, can you check *which* of the > two async_synchronize_full() calls it is that matters for your cases? > > Linus > > --- duh, apply by hand -- > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index 0704c523a76b..7e7a2f743b11 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -351,6 +351,7 @@ static int __init dm_init(void) > goto bad; > } > > + async_synchronize_full(); > return 0; > >bad: > diff --git a/drivers/md/md.c b/drivers/md/md.c > index b6b7a2866c9e..1d173dc662fc 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -8602,6 +8602,7 @@ static void autostart_arrays(int part) > i_scanned = 0; > i_passed = 0; > > + async_synchronize_full(); > printk(KERN_INFO "md: Autodetecting RAID arrays.\n"); > > while (!list_empty(&all_detected_devices) && i_scanned < INT_MAX) > { I should have clarified that I'm not using dm/md in my setup. I know the modules are getting loaded in the log I attached, but root is not a md/dm device. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rds: Error on offset mismatch if not loopback
On Tue, Nov 12, 2013 at 10:22 PM, Josh Hunt wrote: > On Sat, Sep 22, 2012 at 2:25 PM, David Miller wrote: >> >> From: John Jolly >> Date: Fri, 21 Sep 2012 15:32:40 -0600 >> >> > Attempting an rds connection from the IP address of an IPoIB interface >> > to itself causes a kernel panic due to a BUG_ON() being triggered. >> > Making the test less strict allows rds-ping to work without crashing >> > the machine. >> > >> > A local unprivileged user could use this flaw to crash the system. >> > >> > Signed-off-by: John Jolly >> >> Besides the questions being asked of you by Venkat Venkatsubra, this >> patch has another issue. >> >> It has been completely corrupted by your email client, it has >> turned all TAB characters into spaces, making the patch useless. >> >> Please learn how to send a patch unmolested in the body of your >> email. Test it by emailing the patch to yourself, and verifying >> that you can in fact apply the patch you receive in that email. >> Then, and only then, should you consider making a new submission >> of this patch. >> >> Use Documentation/email-clients.txt for guidance. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > > I think this issue was lost in the shuffle. It appears that redhat, ubuntu, > and oracle are maintaining local patches to resolve this: > > https://oss.oracle.com/git/?p=redpatch.git;a=commit;h=c7b6a0a1d8d636852be130fa15fa8be10d4704e8 > https://bugzilla.redhat.com/show_bug.cgi?id=822754 > http://ubuntu.5.x6.nabble.com/CVE-2012-2372-RDS-local-ping-DOS-td4985388.html > > Given that Oracle has applied it I'll make the assumption that Venkat's > question was answered at some point. > > David - I can resubmit the patch with the proper signed-off-by and > formatting if you are willing to apply it unless John wants to try again. I > think it's time this got upstream. > > -- > Josh Ugh.. hopefully resending with all the html crap removed... -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
On Tue, Nov 26, 2013 at 4:29 PM, Tejun Heo wrote: > Hello, > > On Tue, Nov 26, 2013 at 04:12:41PM -0600, Josh Hunt wrote: >> I should have clarified that I'm not using dm/md in my setup. I know >> the modules are getting loaded in the log I attached, but root is not >> a md/dm device. > > Can you please still try it? The init script is broken and we're now > just trying to restore just enough of the old behavior so that the > issue is not exposed. The boot script in use seems to load md/dm > modules after storage drivers and use their termination as the signal > for "storage ready", so it could be a good enough bandaid even if > you're not using dm/md. > > Thanks. > > -- > tejun Tejun You're right. Thanks for pointing this out. I did not realize there was a bug in the init script. The version of initramfs-tools I was using had the following bug: https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1215911 Updating to 0.99ubuntu13.4 of initramfs-tools resolved my boot hangs. I did try using the workaround as suggested by Linus. In my setup the dm_init() code was hit, however it still appeared to be too late at times. I also tried moving the call to async_synchronize_full() above the for loop and it still had the same issue (patch attached.) Out of around 10 reboot tests it failed to find root 1 or 2 times. The ubuntu scripts don't ever actually call do_mount() if it can't find the device. It seems to rely on some udev functionality to tell it when the device is present, and if that fails it just bails out. This change has introduced a regression. However, I only noticed it b/c my init script had a bug which caused it not to wait around for the device to appear. -- Josh Index: linux-3.10/drivers/md/dm.c === --- linux-3.10.orig/drivers/md/dm.c +++ linux-3.10/drivers/md/dm.c @@ -16,12 +16,13 @@ #include #include #include #include #include #include +#include #include #define DM_MSG_PREFIX "core" #ifdef CONFIG_PRINTK @@ -275,12 +276,15 @@ static void (*_exits[])(void) = { static int __init dm_init(void) { const int count = ARRAY_SIZE(_inits); int r, i; + printk(KERN_CRIT "DBG: %s: Calling async_synchronize_full();\n", __func__); + async_synchronize_full(); + for (i = 0; i < count; i++) { r = _inits[i](); if (r) goto bad; } Index: linux-3.10/drivers/md/md.c === --- linux-3.10.orig/drivers/md/md.c +++ linux-3.10/drivers/md/md.c @@ -48,12 +48,13 @@ #include #include #include #include #include #include +#include #include "md.h" #include "bitmap.h" #ifndef MODULE static void autostart_arrays(int part); #endif @@ -8573,12 +8574,14 @@ static void autostart_arrays(int part) dev_t dev; int i_scanned, i_passed; i_scanned = 0; i_passed = 0; + printk(KERN_CRIT "DBG: %s: Calling async_synchronize_full()\n", __func__); + async_synchronize_full(); printk(KERN_INFO "md: Autodetecting RAID arrays.\n"); while (!list_empty(&all_detected_devices) && i_scanned < INT_MAX) { i_scanned++; node_detected_dev = list_entry(all_detected_devices.next, struct detected_devices_node, list);
Re: [PATCH] printk: Fix discarding of records
On Sun, Feb 16, 2014 at 8:38 PM, Banerjee, Debabrata wrote: > On 2/16/14, 7:41 PM, "Linus Torvalds" > wrote: > >>On Sun, Feb 16, 2014 at 4:19 PM, Banerjee, Debabrata >> wrote: >>> >>> No that can't be right, the prev value after every loop is the >>>msg->flags >>> from the *last* line in the list, which has no relation to the *first*, >>>so >>> reusing it for the top of the next loop is nonsense. >> >>Please, Debabrata, humor me, and just try the patch. >> >>And try reading the source code. Because your statement is BS. > ... >>No, I haven't tested my patch, and maybe it's broken for some subtle >>reason I'm missing too. > > Yes my explanation was wrong, your patch works for me. I assumed printing > the prefix was desired, but if not, great. The resulting patch from this discussion: commit e4178d809fdaee32a56833fff1f5056c99e90a1a Author: Linus Torvalds Date: Mon Feb 17 12:24:45 2014 -0800 printk: fix syslog() overflowing user buffer has not made it into stable yet. Can we add it to the stable queue? I'm not sure if it needs more bake time, etc. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] panic: add TAINT_SOFTLOCKUP
This taint flag will be set if the system has ever entered a softlockup state. Similar to TAINT_WARN it is useful to know whether or not the system has been in a softlockup state when debugging. Signed-off-by: Josh Hunt --- Documentation/oops-tracing.txt |2 ++ Documentation/sysctl/kernel.txt |1 + include/linux/kernel.h |1 + kernel/panic.c |1 + kernel/watchdog.c |1 + 5 files changed, 6 insertions(+) diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt index e315599..beefb9f 100644 --- a/Documentation/oops-tracing.txt +++ b/Documentation/oops-tracing.txt @@ -268,6 +268,8 @@ characters, each representing a particular tainted value. 14: 'E' if an unsigned module has been loaded in a kernel supporting module signature. + 15: 'L' if a soft lockup has previously occurred on the system. + The primary reason for the 'Tainted: ' string is to tell kernel debuggers if this is a clean kernel or if anything unusual has occurred. Tainting is permanent: even if an offending module is diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 9886c3d..8dfdf2f 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -788,6 +788,7 @@ can be ORed together: 4096 - An out-of-tree module has been loaded. 8192 - An unsigned module has been loaded in a kernel supporting module signature. +16384 - A soft lockup has previously occurred on the system. == diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 4c52907..eb7b074 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -470,6 +470,7 @@ extern enum system_states { #define TAINT_FIRMWARE_WORKAROUND 11 #define TAINT_OOT_MODULE 12 #define TAINT_UNSIGNED_MODULE 13 +#define TAINT_SOFTLOCKUP 14 extern const char hex_asc[]; #define hex_asc_lo(x) hex_asc[((x) & 0x0f)] diff --git a/kernel/panic.c b/kernel/panic.c index d02fa9f..d68c5d8 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -212,6 +212,7 @@ static const struct tnt tnts[] = { { TAINT_FIRMWARE_WORKAROUND,'I', ' ' }, { TAINT_OOT_MODULE, 'O', ' ' }, { TAINT_UNSIGNED_MODULE,'E', ' ' }, + { TAINT_SOFTLOCKUP, 'L', ' ' }, }; /** diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 516203e..09ac67c 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -329,6 +329,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) if (softlockup_panic) panic("softlockup: hung tasks"); + add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK); __this_cpu_write(soft_watchdog_warn, true); } else __this_cpu_write(soft_watchdog_warn, false); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] panic: add TAINT_SOFTLOCKUP
On 06/23/2014 05:11 PM, Andrew Morton wrote: On Tue, 3 Jun 2014 22:12:35 -0400 Josh Hunt wrote: This taint flag will be set if the system has ever entered a softlockup state. Similar to TAINT_WARN it is useful to know whether or not the system has been in a softlockup state when debugging. ... @@ -329,6 +329,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) if (softlockup_panic) panic("softlockup: hung tasks"); + add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK); __this_cpu_write(soft_watchdog_warn, true); } else __this_cpu_write(soft_watchdog_warn, false); Would make more sense to have applied the taint *before* calling panic()? Andrew Yep, that's a good call. Thanks. Do you want me to send a v2 or did you take care of it? In addition to adding the softlockup taint flag, do you think it'd be reasonable to add another flag for page allocation failures? I think it'd be nice to be able to account for these conditions somehow without having to parse dmesg, etc. As with the softlockup flag, it's helpful to know if your system had encountered a page allocation failure at some point before the crash or whatever you're debugging. Thanks Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] panic: add TAINT_SOFTLOCKUP
On 06/23/2014 05:51 PM, Andrew Morton wrote: On Mon, 23 Jun 2014 17:45:00 -0500 Josh Hunt wrote: On 06/23/2014 05:11 PM, Andrew Morton wrote: On Tue, 3 Jun 2014 22:12:35 -0400 Josh Hunt wrote: This taint flag will be set if the system has ever entered a softlockup state. Similar to TAINT_WARN it is useful to know whether or not the system has been in a softlockup state when debugging. ... @@ -329,6 +329,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) if (softlockup_panic) panic("softlockup: hung tasks"); + add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK); __this_cpu_write(soft_watchdog_warn, true); } else __this_cpu_write(soft_watchdog_warn, false); Would make more sense to have applied the taint *before* calling panic()? Andrew Yep, that's a good call. Thanks. Do you want me to send a v2 or did you take care of it? I fixed it up. In addition to adding the softlockup taint flag, do you think it'd be reasonable to add another flag for page allocation failures? I think it'd be nice to be able to account for these conditions somehow without having to parse dmesg, etc. As with the softlockup flag, it's helpful to know if your system had encountered a page allocation failure at some point before the crash or whatever you're debugging. I don't know, really. Allocation failures are often an expected thing as drivers try to work out how much memory they can allocate. Those things can be screened out by testing __GFP_NOWARN. GFP_ATOMIC failures should probably be ignored, except for when they shouldn't. But even then, allocation failures are somewhat common. And recency is a concern: an allocation failure 10 minutes ago is unlikely to be relevant. But that's just me waving hands around. I'd be interested to hear from people whose kernels crash more often than mine, and from those whose job is to support them (ie distro people?). Anyone you'd suggest adding to this thread to get other feedback about tracking page allocation failures? I could also spin up a patch and cc them. Thanks Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] panic: add TAINT_SOFTLOCKUP
On 06/24/2014 07:45 PM, David Rientjes wrote: On Tue, 24 Jun 2014, Josh Hunt wrote: Anyone you'd suggest adding to this thread to get other feedback about tracking page allocation failures? I could also spin up a patch and cc them. Page allocation failures happen all the time, mostly because of large-order allocations (more than PAGE_ALLOC_COSTLY_ORDER) or allocations done with GFP_ATOMIC where it's impossible to reclaim or compact memory to allocate. Because of this, they are fairly easy to trigger from userspace without having to do much. Why would this qualify for a taint? I have never debugged a kernel crash that I traced back to an earlier page allocation failure and said "oh, if I had only known about that page allocation failure earlier!". If one of them is going to cause an issue, it probably is at the point of the crash and you shouldn't have to "investigate" much. I guess I was thinking more of the case where all you have is the trace/dump and for whatever reason the last bits which may contain the page allocation failure info didn't get flushed to disk. In that case it'd be nice to know what lead up to the crash. However, I do agree with your point and Andrew's about the frequency and ease of triggering them which would make taint the wrong place to account for them. Thanks Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rds: Error on offset mismatch if not loopback
On Wed, Nov 13, 2013 at 9:16 AM, Venkat Venkatsubra wrote: > > > -Original Message- > From: Josh Hunt [mailto:joshhun...@gmail.com] > Sent: Tuesday, November 12, 2013 10:25 PM > To: David Miller > Cc: jjo...@suse.com; LKML; Venkat Venkatsubra; net...@vger.kernel.org > Subject: Re: [PATCH] rds: Error on offset mismatch if not loopback > > On Tue, Nov 12, 2013 at 10:22 PM, Josh Hunt wrote: >> On Sat, Sep 22, 2012 at 2:25 PM, David Miller wrote: >>> >>> From: John Jolly >>> Date: Fri, 21 Sep 2012 15:32:40 -0600 >>> >>> > Attempting an rds connection from the IP address of an IPoIB >>> > interface to itself causes a kernel panic due to a BUG_ON() being >>> > triggered. >>> > Making the test less strict allows rds-ping to work without >>> > crashing the machine. >>> > >>> > A local unprivileged user could use this flaw to crash the system. >>> > >>> > Signed-off-by: John Jolly >>> >>> Besides the questions being asked of you by Venkat Venkatsubra, this >>> patch has another issue. >>> >>> It has been completely corrupted by your email client, it has turned >>> all TAB characters into spaces, making the patch useless. >>> >>> Please learn how to send a patch unmolested in the body of your >>> email. Test it by emailing the patch to yourself, and verifying that >>> you can in fact apply the patch you receive in that email. >>> Then, and only then, should you consider making a new submission of >>> this patch. >>> >>> Use Documentation/email-clients.txt for guidance. >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-kernel" in the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> Please read the FAQ at http://www.tux.org/lkml/ >> >> >> I think this issue was lost in the shuffle. It appears that redhat, >> ubuntu, and oracle are maintaining local patches to resolve this: >> >> https://oss.oracle.com/git/?p=redpatch.git;a=commit;h=c7b6a0a1d8d63685 >> 2be130fa15fa8be10d4704e8 >> https://bugzilla.redhat.com/show_bug.cgi?id=822754 >> http://ubuntu.5.x6.nabble.com/CVE-2012-2372-RDS-local-ping-DOS-td49853 >> 88.html >> >> Given that Oracle has applied it I'll make the assumption that >> Venkat's question was answered at some point. >> >> David - I can resubmit the patch with the proper signed-off-by and >> formatting if you are willing to apply it unless John wants to try >> again. I think it's time this got upstream. >> >> -- >> Josh > > Ugh.. hopefully resending with all the html crap removed... > > -- > Josh > > Hi Josh, > > No, I still didn't get an answer for how "off" could be non-zero in case of > rds-ping to hit BUG_ON(off % RDS_FRAG_SIZE). > Because, rds-ping uses zero byte messages to ping. > If you have a test case that reproduces the kernel panic I can try it out and > see how that can happen. > The Oracle's internal code I checked doesn't have that patch applied. > > Venkat No I don't have a test case. I came across this CVE while doing an audit and noticed it was patched in Ubuntu's kernel and other distros, but was not in the upstream kernel yet. Quick googling of lkml showed that there were at least two attempts to get this patch upstream, but both had issues due to not following the proper submission process: https://lkml.org/lkml/2012/10/22/433 https://lkml.org/lkml/2012/9/21/505 >From my searching it appears the initial bug was found by someone at redhat: https://bugzilla.redhat.com/show_bug.cgi?id=822754 I've added Li Honggang the reporter of this issue from Redhat to the mail. Hopefully he can share his testcase. and possibly requires certain hardware as Jay writes in the first link above: "...some Infiniband HCAs(QLogic, possibly others) the machine will panic..." I was referring to this oracle commit: https://oss.oracle.com/git/?p=redpatch.git;a=commit;h=c7b6a0a1d8d636852be130fa15fa8be10d4704e8 I have no experience with this code. There were a few comments around the reset and xmit fns about making sure the caller did certain things if not they were racy, but I have no idea if that's coming into play here. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] rds: fix local ping DoS
The rds_ib_xmit function in net/rds/ib_send.c in the Reliable Datagram Sockets (RDS) protocol implementation allows local users to cause a denial of service (BUG_ON and kernel panic) by establishing an RDS connection with the source IP address equal to the IPoIB interface's own IP address, as demonstrated by rds-ping. A local unprivileged user could use this flaw to crash the system. CVE-2012-2372 Reported-by: Honggang Li Signed-off-by: Josh Hunt --- net/rds/ib_send.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c index e590949..7920c85 100644 --- a/net/rds/ib_send.c +++ b/net/rds/ib_send.c @@ -544,7 +544,7 @@ int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm, int flow_controlled = 0; int nr_sig = 0; - BUG_ON(off % RDS_FRAG_SIZE); + BUG_ON(!conn->c_loopback && off % RDS_FRAG_SIZE); BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header)); /* Do not send cong updates to IB loopback */ -- 1.7.0.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rds: Error on offset mismatch if not loopback
On Wed, Nov 13, 2013 at 6:55 PM, Honggang LI wrote: > On 11/14/2013 01:40 AM, Josh Hunt wrote: >> On Wed, Nov 13, 2013 at 9:16 AM, Venkat Venkatsubra >> wrote: >>> >>> -Original Message- >>> From: Josh Hunt [mailto:joshhun...@gmail.com] >>> Sent: Tuesday, November 12, 2013 10:25 PM >>> To: David Miller >>> Cc: jjo...@suse.com; LKML; Venkat Venkatsubra; net...@vger.kernel.org >>> Subject: Re: [PATCH] rds: Error on offset mismatch if not loopback >>> >>> On Tue, Nov 12, 2013 at 10:22 PM, Josh Hunt wrote: >>>> On Sat, Sep 22, 2012 at 2:25 PM, David Miller wrote: >>>>> From: John Jolly >>>>> Date: Fri, 21 Sep 2012 15:32:40 -0600 >>>>> >>>>>> Attempting an rds connection from the IP address of an IPoIB >>>>>> interface to itself causes a kernel panic due to a BUG_ON() being >>>>>> triggered. >>>>>> Making the test less strict allows rds-ping to work without >>>>>> crashing the machine. >>>>>> >>>>>> A local unprivileged user could use this flaw to crash the system. >>>>>> >>>>>> Signed-off-by: John Jolly >>>>> Besides the questions being asked of you by Venkat Venkatsubra, this >>>>> patch has another issue. >>>>> >>>>> It has been completely corrupted by your email client, it has turned >>>>> all TAB characters into spaces, making the patch useless. >>>>> >>>>> Please learn how to send a patch unmolested in the body of your >>>>> email. Test it by emailing the patch to yourself, and verifying that >>>>> you can in fact apply the patch you receive in that email. >>>>> Then, and only then, should you consider making a new submission of >>>>> this patch. >>>>> >>>>> Use Documentation/email-clients.txt for guidance. >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe >>>>> linux-kernel" in the body of a message to majord...@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> Please read the FAQ at http://www.tux.org/lkml/ >>>> >>>> I think this issue was lost in the shuffle. It appears that redhat, >>>> ubuntu, and oracle are maintaining local patches to resolve this: >>>> >>>> https://oss.oracle.com/git/?p=redpatch.git;a=commit;h=c7b6a0a1d8d63685 >>>> 2be130fa15fa8be10d4704e8 >>>> https://bugzilla.redhat.com/show_bug.cgi?id=822754 >>>> http://ubuntu.5.x6.nabble.com/CVE-2012-2372-RDS-local-ping-DOS-td49853 >>>> 88.html >>>> >>>> Given that Oracle has applied it I'll make the assumption that >>>> Venkat's question was answered at some point. >>>> >>>> David - I can resubmit the patch with the proper signed-off-by and >>>> formatting if you are willing to apply it unless John wants to try >>>> again. I think it's time this got upstream. >>>> >>>> -- >>>> Josh >>> Ugh.. hopefully resending with all the html crap removed... >>> >>> -- >>> Josh >>> >>> Hi Josh, >>> >>> No, I still didn't get an answer for how "off" could be non-zero in case of >>> rds-ping to hit BUG_ON(off % RDS_FRAG_SIZE). >>> Because, rds-ping uses zero byte messages to ping. >>> If you have a test case that reproduces the kernel panic I can try it out >>> and see how that can happen. >>> The Oracle's internal code I checked doesn't have that patch applied. >>> >>> Venkat >> No I don't have a test case. I came across this CVE while doing an >> audit and noticed it was patched in Ubuntu's kernel and other distros, >> but was not in the upstream kernel yet. Quick googling of lkml showed >> that there were at least two attempts to get this patch upstream, but >> both had issues due to not following the proper submission process: >> >> https://lkml.org/lkml/2012/10/22/433 >> https://lkml.org/lkml/2012/9/21/505 >> >> From my searching it appears the initial bug was found by someone at redhat: >> https://bugzilla.redhat.com/show_bug.cgi?id=822754 >> >> I've added Li Honggang the reporter of this issue from Redhat to the >> mail. Hopefully he can share his testcase. > The test case is very simple: > Steps to Reproduce: > 1. yum install -y rds-tools > > 2. [root@rdma3 ~]# ifconfig ib0 | grep 'inet addr' > inet addr:172.31.0.3 Bcast:172.31.0.255 Mask:255.255.255.0 > > 3. [root@rdma3 ~]# /usr/bin/rds-ping 172.31.0.3 <<<< kernel panic (You > may need to wait for a few seconds before the kernel panic.) >> >> and possibly requires certain hardware as Jay writes in the first link above: >> "...some Infiniband HCAs(QLogic, possibly others) the machine will panic..." > This bug can be reproduced with Mellanox HCAs (mlx4_ib.ko and mthca.ko), > QLogic HCA (ib_qib.ko). I did not test the QLogic HCA running "ib_ipath.ko". > > As I know the upstream code of RDS is broken. There are *many* RDS bugs. > > Best regards. > Honggang Thanks Honggang. I have resubmitted the patch for approval. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rds: fix local ping DoS
On 11/14/2013 01:03 AM, David Miller wrote: From: Josh Hunt Date: Wed, 13 Nov 2013 17:15:43 -0800 The rds_ib_xmit function in net/rds/ib_send.c in the Reliable Datagram Sockets (RDS) protocol implementation allows local users to cause a denial of service (BUG_ON and kernel panic) by establishing an RDS connection with the source IP address equal to the IPoIB interface's own IP address, as demonstrated by rds-ping. A local unprivileged user could use this flaw to crash the system. CVE-2012-2372 Reported-by: Honggang Li Signed-off-by: Josh Hunt I'm sorry I can't apply this. This commit message needs to be much less terse and explain things more. First of all, why is the "off % RDS_FRAG_SIZE" important? And, even more importantly, why is is OK to avoid this assertion just because we're going over loopback? Furthermore, why doesn't net/rds/iw_send.c:rds_iw_xmit() have the same exact problem? It makes the same exact assertion check. I know this RDS code is a steaming pile of poo, but that doesn't mean we just randomly adjust assertions to make crashes go away without sufficient understanding of exactly what's going on. Thanks. Sure understandable questions. Unfortunately I don't have the hardware to properly debug and analyze. I was just trying to get this through on the assumption that the previous attempts just failed due to incorrect submission procedures and lack of a reproducible testcase. If nothing else this whole thing brought out the testcase :) Testcase from Honggang's earlier mail: The test case is very simple: Steps to Reproduce: 1. yum install -y rds-tools 2. [root@rdma3 ~]# ifconfig ib0 | grep 'inet addr' inet addr:172.31.0.3 Bcast:172.31.0.255 Mask:255.255.255.0 3. [root@rdma3 ~]# /usr/bin/rds-ping 172.31.0.3 <<<< kernel panic (You may need to wait for a few seconds before the kernel panic.) This bug can be reproduced with Mellanox HCAs (mlx4_ib.ko and mthca.ko), QLogic HCA (ib_qib.ko). I did not test the QLogic HCA running "ib_ipath.ko". Perhaps Venkat or someone else with the hardware mentioned can provide a better explanation and better solution to the crash. Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used
On Tue, Dec 3, 2013 at 9:19 AM, Tejun Heo wrote: > Hello, > > On Tue, Dec 03, 2013 at 08:28:43AM -0600, Josh Hunt wrote: >> You're right. Thanks for pointing this out. I did not realize there >> was a bug in the init script. The version of initramfs-tools I was >> using had the following bug: >> https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1215911 >> >> Updating to 0.99ubuntu13.4 of initramfs-tools resolved my boot hangs. >> >> I did try using the workaround as suggested by Linus. In my setup the >> dm_init() code was hit, however it still appeared to be too late at >> times. I also tried moving the call to async_synchronize_full() above >> the for loop and it still had the same issue (patch attached.) Out of >> around 10 reboot tests it failed to find root 1 or 2 times. >> >> The ubuntu scripts don't ever actually call do_mount() if it can't >> find the device. It seems to rely on some udev functionality to tell >> it when the device is present, and if that fails it just bails out. >> >> This change has introduced a regression. However, I only noticed it >> b/c my init script had a bug which caused it not to wait around for >> the device to appear. > > Hmmm so, read the bug report, digged and asked around a bit. > Here's the root problem - ubuntu's initramfs uses a tool to wait for > the root device which uses libudev to listen for the device event; > unfortunately, its rx buffer is not set large enough and the receiver > isn't fast enough, which means that netlink broadcast messages from > the kernel can overrun the buffer. When that happens, it sets an > error on the socket, so the next recv fails with -ENOBUFS. If that > happens, the wait for root aborts immediately and initramfs proceeds > to mount non-existent root device. > > The only thing which changes by these patches is the timing of events. > The problem likely wasn't as exposed before because things were slow > enough so that either the messages could be consumed fast enough or > there's enough delay between libata module load and the root device > wait hiding the bug in the wait logic. > > So, yeah, it's a full blown timing bug. I'm not sure what we can do > to work around from kernel side except for randomly slowing things > down or forcefully enlarging rx buffer size. There really is no > interlocking to take advantage of. :( So there used to be a call to async_synchronize_full() in ata_host_register(), but it was removed by f29d3b23238e1955a8094e038c72546e99308e61 as part of some fastboot changes. Adding it back (in the attached patch) seems to resolve the issue when using the broken initrd. I'm guessing adding it back isn't an option, but I wanted to point it out. -- Josh Index: b/drivers/ata/libata-core.c === --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -6181,12 +6181,14 @@ int ata_host_register(struct ata_host *h /* perform each probe asynchronously */ for (i = 0; i < host->n_ports; i++) { struct ata_port *ap = host->ports[i]; async_schedule(async_port_probe, ap); } + async_synchronize_full(); + return 0; err_tadd: while (--i >= 0) { ata_tport_delete(host->ports[i]); }
Re: [PATCH 3.19 016/175] ksoftirqd: Enable IRQs and call cond_resched() before poking RCU
On Wed, Mar 4, 2015 at 12:13 AM, Greg Kroah-Hartman wrote: > 3.19-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Calvin Owens > > commit 28423ad283d5348793b0c45cc9b1af058e776fd6 upstream. > > While debugging an issue with excessive softirq usage, I encountered the > following note in commit 3e339b5dae24a706 ("softirq: Use hotplug thread > infrastructure"): > > [ paulmck: Call rcu_note_context_switch() with interrupts enabled. ] > > ...but despite this note, the patch still calls RCU with IRQs disabled. > > This seemingly innocuous change caused a significant regression in softirq > CPU usage on the sending side of a large TCP transfer (~1 GB/s): when > introducing 0.01% packet loss, the softirq usage would jump to around 25%, > spiking as high as 50%. Before the change, the usage would never exceed 5%. > > Moving the call to rcu_note_context_switch() after the cond_sched() call, > as it was originally before the hotplug patch, completely eliminated this > problem. > > Signed-off-by: Calvin Owens > Signed-off-by: Paul E. McKenney > Signed-off-by: Greg Kroah-Hartman > > --- > kernel/softirq.c |6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -656,9 +656,13 @@ static void run_ksoftirqd(unsigned int c > * in the task stack here. > */ > __do_softirq(); > - rcu_note_context_switch(); > local_irq_enable(); > cond_resched(); > + > + preempt_disable(); > + rcu_note_context_switch(); > + preempt_enable(); > + > return; > } > local_irq_enable(); > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ Sorry for the delay in noticing this, but should this be applied to 3.14-stable as well? Thanks Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old
On 03/14/2016 11:29 AM, Don Zickus wrote: Hi Josh, I believe Uli thought the below patch might fix it. Cheers, Don Don It looks like I was incorrect when I said 4.5 was getting the soft lockup. I originally found this problem on 4.1.19 and saw both the problem my patch solves and the soft lockups there. I thought when I checked 4.5 that I saw both issues there as well, but going back and checking now that is not the case. I only see the issue my patch resolves on 4.5. With that info my changelog is incorrect now as it states I saw a soft lockup on the head. I will submit a v2 of my patch with the updated changelog. I'll also cc stable this time as I'd like to see this fix end up there as well. As for the soft lockups showing up on 4.1, I tried Uli's patch and it did not help. After that I did a git bisect to figure out when the soft lockup was fixed and it appears to be resolved after one of the commits in this series: commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 Author: Ulrich Obergfell Date: Fri Sep 4 15:45:15 2015 -0700 watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads() I didn't identify the exact commit. It would be nice to resolve the soft lockup for the stable folks since 4.1 and 4.4 are longterm stable releases and would see this problem. I did not have time to debug it any more outside of this bisection today. If you have something you'd like me to try which may work for the stable kernels I'm happy to test it. For the record I'm able to reproduce the soft lockup on 4.1 doing: while :; do echo 1 > /proc/sys/kernel/nmi_watchdog; sleep .1; done & sleep 30 && kill %1 && sleep 5 Josh
Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old
On 03/16/2016 04:21 AM, Ulrich Obergfell wrote: Josh, I haven't tried to reproduce the soft lockups with kernel 4.1, but I believe I found an explanation in terms of how your test case breaks the watchdog mechanism in kernel 4.1: The soft lockup detector is implemented via two components (per-cpu). - The watchdog_timer_fn() function. This function runs periodically (sample_period) in the context of a high-resolution timer on CPU N. It wakes up the 'watchdog/N' thread and checks watchdog_touch_ts[N] to determine whether the thread has run within the soft lockup detection interval (watchdog_thresh * 2). - The watchdog() function. This function is executed in the context of the 'watchdog/N' thread. It updates the watchdog_touch_ts[N] time stamp to signal that it got a chance to run. The thread has a high realtime priority. If it does not get a chance to run, this is indicative that 'something else' is hogging CPU N. The while :; do echo 1 > /proc/sys/kernel/nmi_watchdog; sleep .1; done loop triggers the following flow of execution in the watchdog code at a frequency that is much higher than sample_period. proc_nmi_watchdog proc_watchdog_common proc_watchdog_update watchdog_enable_all_cpus update_watchdog_all_cpus for_each_online_cpu update_watchdog smp_call_function_single(restart_watchdog_hrtimer) Calling the restart_watchdog_hrtimer() function at such a high rate has the following effects. - The high-resolution timer gets canceled and restarted before it ever gets a chance to expire. Hence, the watchdog_timer_fn() function does not get a chance to wake up the 'watchdog/N' thread. So this prevents the thread from running within the soft lockup detection interval. - Since watchdog_timer_fn() does not wake up the 'watchdog/N' thread, the watchdog_touch_ts[N] time stamp is not being updated. The sleep 30 && kill %1 command terminates the above while loop after 30 seconds, so the high-resolution timer no longer gets canceled. When the timer expires, watchdog_timer_fn() detects that 'watchdog/N' has not run within the soft lockup detection interval. Your patch 'masks' the above issue because proc_watchdog_update() is not called if the parameter value is not changed. The mechanism that picks up changes of watchdog parameters 'on the fly' was rewritten in kernel 4.3 (see https://lkml.org/lkml/2015/8/1/64 and http://marc.info/?l=linux-kernel&m=143894132129981&w=2). $ git log --pretty=oneline v4.2..v4.3 -- kernel/watchdog.c | head -5 ec6a90661a0d6ce1461d05c7a58a0a151154e14a watchdog: rename watchdog_suspend() and watchdog_resume() 999bbe49ea0118b70ddf3f5d679f51dc7a97ae55 watchdog: use suspend/resume interface in fixup_ht_bug() d4bdd0b21c7652a8271f873cc755486b255c1bbd watchdog: use park/unpark functions in update_watchdog_all_cpus() 8c073d27d7ad293bf734cc8475689413afadab81 watchdog: introduce watchdog_suspend() and watchdog_resume() 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads() The new version of the update_watchdog_all_cpus() function now parks and unparks all 'watchdog/N' threads. If any watchdog parameter in /proc was changed, the new value gets picked up by the threads during unpark when watchdog_enable() is executed. The following is particularly relevant to your test case. - watchdog_disable() cancels the high-resolution timer during park - watchdog_enable() starts the high-resolution timer during unpark and updates watchdog_touch_ts[N] by calling __touch_watchdog() I believe this explains why you can only reproduce the soft lockup with kernel v4.1 but not with v4.5. Your patch "don't run proc_watchdog_update if new value is same as old" looks OK to me. I have no objections. Many Thanks, Uli Uli Thanks for the detailed explanation! As you mention my patch will mask this problem for 4.1 which is why I wanted to get it into stable. Do you think there is any way to mitigate this issue for the stable kernels (4.1 to 4.4) if the user changes the values doing something like: foo=1; while :; do echo $foo > /proc/sys/kernel/nmi_watchdog; foo=$(( ! $foo )); sleep .1; done & sleep 30 && kill %1 ? I realize this isn't a real-world use-case (I hope :)), but shows there is still a way to cause the box to soft lockup with this code path. I ask b/c these kernels are both longterm stable releases and so this issue will likely live for a while in the wild. Josh
Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error
On Tue, Nov 10, 2015 at 12:24 PM, Josh Hunt wrote: > > On Mon, Nov 9, 2015 at 4:02 PM, Peter Zijlstra wrote: >> >> On Mon, Nov 09, 2015 at 01:59:02PM -0600, gratian.cri...@ni.com wrote: >> >> > The Intel Xeon E5 processor family suffers from errata[1] BT81: >> >> > +#ifdef CONFIG_X86_TSC >> > + /* >> > + * Xeon E5 BT81 errata: TSC is not affected by warm reset. >> > + * The TSC registers for CPUs other than CPU0 are not cleared by a >> > warm >> > + * reset resulting in a constant offset error. >> > + */ >> > + if ((c->x86 == 6) && (c->x86_model == 0x3f)) >> > + set_cpu_bug(c, X86_BUG_TSC_OFFSET); >> > +#endif >> >> That's hardly a family, that's just one, Haswell server. > > > Are you actually observing this problem on this processor? > > The document you've referenced and the x86_model # above do not match up. The > errata should be for Intel processors with an x86_model value of 0x2d by my > calculations: > > Model: 1101b > Extended Model: 0010b > > The calc from cpu_detect() is: > if (c->x86 >= 0x6) > c->x86_model += ((tfms >> 16) & 0xf) << 4; > > 0x3f is a different CPU. > -- > Josh Resending, as gmail inserted html and lkml dropped the previous reply... -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error
On Tue, Nov 10, 2015 at 1:47 PM, Gratian Crisan wrote: > > The observed behavior does seem to match BT81 errata i.e. the TSC does > not get reset on warm reboots and it is otherwise stable. > If you have a simple testcase to reproduce the problem I'd be interested in seeing it. -- Josh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Long standing kernel warning: perfevents: irq loop stuck!
On Mon, Aug 19, 2019 at 4:16 PM Josh Hunt wrote: > > On Mon, Aug 19, 2019 at 2:17 PM Josh Hunt wrote: > > > > On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt wrote: > > > > > > On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner > > > wrote: > > > > > > > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > > > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner > > > > > wrote: > > > > > > > > > > > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > > > > > > Was there any progress made on debugging this issue? We are still > > > > > > > seeing it on 4.19.44: > > > > > > > > > > > > I haven't seen anyone looking at this. > > > > > > > > > > > > Can you please try the patch Ingo posted: > > > > > > > > > > > > https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/ > > > > > > > > > > > > and if it fixes the issue decrease the value from 128 to the point > > > > > > where it > > > > > > comes back, i.e. 128 -> 64 -> 32 ... > > > > > > > > > > > > Thanks, > > > > > > > > > > > > tglx > > > > > > > > > > I just checked the machines where this problem occurs and they're both > > > > > Nehalem boxes. I think Ingo's patch would only help Haswell machines. > > > > > Please let me know if I misread the patch or if what I'm seeing is a > > > > > different issue than the one Cong originally reported. > > > > > > > > Find the NHM hack below. > > > > > > > > Thanks, > > > > > > > > tglx > > > > > > > > 8< > > > > > > > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > > > > index 648260b5f367..93c1a4f0e73e 100644 > > > > --- a/arch/x86/events/intel/core.c > > > > +++ b/arch/x86/events/intel/core.c > > > > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event > > > > *event, u64 left) > > > > return left; > > > > } > > > > > > > > +static u64 nhm_limit_period(struct perf_event *event, u64 left) > > > > +{ > > > > + return max(left, 128ULL); > > > > +} > > > > + > > > > PMU_FORMAT_ATTR(event, "config:0-7"); > > > > PMU_FORMAT_ATTR(umask, "config:8-15" ); > > > > PMU_FORMAT_ATTR(edge, "config:18" ); > > > > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void) > > > > x86_pmu.pebs_constraints = > > > > intel_nehalem_pebs_event_constraints; > > > > x86_pmu.enable_all = intel_pmu_nhm_enable_all; > > > > x86_pmu.extra_regs = intel_nehalem_extra_regs; > > > > + x86_pmu.limit_period = nhm_limit_period; > > > > > > > > mem_attr = nhm_mem_events_attrs; > > > > > > > Thanks Thomas. Will try this and let you know. > > > > > > -- > > > Josh > > > > Thomas > > > > I found on my setup that setting the value to 32 was the lowest value > > I could use to keep the problem from happening. Let me know if you > > want me to send a patch with the updated value, etc. > > > > I saw in the original thread from Ingo and Vince that this was seen on > > Haswell, but I checked our Haswell boxes and so far we have not > > reproduced the problem there. > > > > -- > > Josh > > I went ahead and sent this patch with the value set to 32: > https://lore.kernel.org/lkml/1566256411-18820-1-git-send-email-joh...@akamai.com/T/#u > > I wasn't sure how/who to give credit to for the change, so please > resubmit if what I did is incorrect or if you wanted to debug further. > If you decide to resubmit the patch please add my tested-by and > Bhupesh's reported-by. I'm able to reproduce the problem within about > 2 hours if there's anything else you wanted to look into before going > with this approach. > > Thanks! > -- > Josh Thomas Any thoughts on the above or the patch that I sent? Thanks! -- Josh
4.19 dwarf unwinding broken
The following commit is breaking dwarf unwinding on 4.19 kernels: commit e5adfc3e7e774ba86f7bb725c6eef5f32df8630e Author: Konstantin Khlebnikov Date: Tue Aug 7 12:09:01 2018 +0300 perf map: Synthesize maps only for thread group leader It looks like this was fixed later by: commit e8ba2906f6b9054102ad035ac9cafad9d4168589 Author: John Keeping Date: Thu Aug 15 11:01:45 2019 +0100 perf unwind: Fix libunwind when tid != pid but was not selected as a backport to 4.19. Are there plans to backport the fix? If not, could we have the breaking commit reverted in 4.19 LTS? Thanks Josh
Re: 4.19 dwarf unwinding broken
On 10/3/19 3:03 AM, Jiri Olsa wrote: On Thu, Oct 03, 2019 at 12:54:09AM -0700, Josh Hunt wrote: The following commit is breaking dwarf unwinding on 4.19 kernels: how? When doing something like: perf record -p $(cat /var/run/app.pid) -g --call-graph dwarf -F 999 -- sleep 3 with 4.19.75 perf I see things like: app_Thr00 26247 1810131.375329: 168288 cycles:ppp: app_Thr01 26767 1810131.377449: 344415 cycles:ppp: uvm:WorkerThread 26746 1810131.383052:504 cycles:ppp: 9f77cce0 _raw_spin_lock+0x10 (/boot/vmlinux-4.19.46) 9f181527 __perf_event_task_sched_in+0xf7 (/boot/vmlinux-4.19.46) 9f09a7b8 finish_task_switch+0x158 (/boot/vmlinux-4.19.46) 9f778276 __schedule+0x2f6 (/boot/vmlinux-4.19.46) 9f7787f2 schedule+0x32 (/boot/vmlinux-4.19.46) 9f77bb0a schedule_hrtimeout_range_clock+0x8a (/boot/vmlinux-4.19.46) 9f22ea12 poll_schedule_timeout.constprop.6+0x42 (/boot/vmlinux-4.19.46) 9f22eeeb do_sys_poll+0x4ab (/boot/vmlinux-4.19.46) 9f22fb7b __se_sys_poll+0x5b (/boot/vmlinux-4.19.46) 9f0023de do_syscall_64+0x4e (/boot/vmlinux-4.19.46) 9f800088 entry_SYSCALL_64+0x68 (/boot/vmlinux-4.19.46) --- and with 4.19.75 perf with e5adfc3e7e77 reverted those empty call stacks go away and also other call stacks show more thread details: uvm:WorkerThread 26746 1810207.336391: 1 cycles:ppp: 9f181505 __perf_event_task_sched_in+0xd5 (/boot/vmlinux-4.19.46) 9f09a7b8 finish_task_switch+0x158 (/boot/vmlinux-4.19.46) 9f778276 __schedule+0x2f6 (/boot/vmlinux-4.19.46) 9f7787f2 schedule+0x32 (/boot/vmlinux-4.19.46) 9f77bb0a schedule_hrtimeout_range_clock+0x8a (/boot/vmlinux-4.19.46) 9f22ea12 poll_schedule_timeout.constprop.6+0x42 (/boot/vmlinux-4.19.46) 9f22eeeb do_sys_poll+0x4ab (/boot/vmlinux-4.19.46) 9f22fb7b __se_sys_poll+0x5b (/boot/vmlinux-4.19.46) 9f0023de do_syscall_64+0x4e (/boot/vmlinux-4.19.46) 9f800088 entry_SYSCALL_64+0x68 (/boot/vmlinux-4.19.46) 7f7ef3f5c90d [unknown] (/lib/x86_64-linux-gnu/libc-2.23.so) 3eb5c99 poll+0xc9 (inlined) 3eb5c99 colib::ipc::EventFd::wait+0xc9 (/usr/local/bin/app) 3296779 uvm::WorkerThread::run+0x129 (/usr/local/bin/app) [unknown] ([unknown]) They also look the same as earlier kernel versions we have running. In addition reading e8ba2906f6b's changelog sounded very similar to what I was seeing. This application launches a # of threads and is definitely already running before the invocation of perf. Thanks for looking at this. Josh jirka commit e5adfc3e7e774ba86f7bb725c6eef5f32df8630e Author: Konstantin Khlebnikov Date: Tue Aug 7 12:09:01 2018 +0300 perf map: Synthesize maps only for thread group leader It looks like this was fixed later by: commit e8ba2906f6b9054102ad035ac9cafad9d4168589 Author: John Keeping Date: Thu Aug 15 11:01:45 2019 +0100 perf unwind: Fix libunwind when tid != pid but was not selected as a backport to 4.19. Are there plans to backport the fix? If not, could we have the breaking commit reverted in 4.19 LTS? Thanks Josh
Re: Long standing kernel warning: perfevents: irq loop stuck!
On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt wrote: > > On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner wrote: > > > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner > > > wrote: > > > > > > > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > > > > Was there any progress made on debugging this issue? We are still > > > > > seeing it on 4.19.44: > > > > > > > > I haven't seen anyone looking at this. > > > > > > > > Can you please try the patch Ingo posted: > > > > > > > > https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/ > > > > > > > > and if it fixes the issue decrease the value from 128 to the point > > > > where it > > > > comes back, i.e. 128 -> 64 -> 32 ... > > > > > > > > Thanks, > > > > > > > > tglx > > > > > > I just checked the machines where this problem occurs and they're both > > > Nehalem boxes. I think Ingo's patch would only help Haswell machines. > > > Please let me know if I misread the patch or if what I'm seeing is a > > > different issue than the one Cong originally reported. > > > > Find the NHM hack below. > > > > Thanks, > > > > tglx > > > > 8< > > > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > > index 648260b5f367..93c1a4f0e73e 100644 > > --- a/arch/x86/events/intel/core.c > > +++ b/arch/x86/events/intel/core.c > > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event > > *event, u64 left) > > return left; > > } > > > > +static u64 nhm_limit_period(struct perf_event *event, u64 left) > > +{ > > + return max(left, 128ULL); > > +} > > + > > PMU_FORMAT_ATTR(event, "config:0-7"); > > PMU_FORMAT_ATTR(umask, "config:8-15" ); > > PMU_FORMAT_ATTR(edge, "config:18" ); > > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void) > > x86_pmu.pebs_constraints = > > intel_nehalem_pebs_event_constraints; > > x86_pmu.enable_all = intel_pmu_nhm_enable_all; > > x86_pmu.extra_regs = intel_nehalem_extra_regs; > > + x86_pmu.limit_period = nhm_limit_period; > > > > mem_attr = nhm_mem_events_attrs; > > > Thanks Thomas. Will try this and let you know. > > -- > Josh Thomas I found on my setup that setting the value to 32 was the lowest value I could use to keep the problem from happening. Let me know if you want me to send a patch with the updated value, etc. I saw in the original thread from Ingo and Vince that this was seen on Haswell, but I checked our Haswell boxes and so far we have not reproduced the problem there. -- Josh
[PATCH] perf/x86/intel: restrict period on Nehalem
We see our Nehalem machines reporting 'perfevents: irq loop stuck!' in some cases when using perf: perfevents: irq loop stuck! WARNING: CPU: 0 PID: 3485 at arch/x86/events/intel/core.c:2282 intel_pmu_handle_irq+0x37b/0x530 ... RIP: 0010:intel_pmu_handle_irq+0x37b/0x530 ... Call Trace: ? perf_event_nmi_handler+0x2e/0x50 ? intel_pmu_save_and_restart+0x50/0x50 perf_event_nmi_handler+0x2e/0x50 nmi_handle+0x6e/0x120 default_do_nmi+0x3e/0x100 do_nmi+0x102/0x160 end_repeat_nmi+0x16/0x50 ... ? native_write_msr+0x6/0x20 ? native_write_msr+0x6/0x20 intel_pmu_enable_event+0x1ce/0x1f0 x86_pmu_start+0x78/0xa0 x86_pmu_enable+0x252/0x310 __perf_event_task_sched_in+0x181/0x190 ? __switch_to_asm+0x41/0x70 ? __switch_to_asm+0x35/0x70 ? __switch_to_asm+0x41/0x70 ? __switch_to_asm+0x35/0x70 finish_task_switch+0x158/0x260 __schedule+0x2f6/0x840 ? hrtimer_start_range_ns+0x153/0x210 schedule+0x32/0x80 schedule_hrtimeout_range_clock+0x8a/0x100 ? hrtimer_init+0x120/0x120 ep_poll+0x2f7/0x3a0 ? wake_up_q+0x60/0x60 do_epoll_wait+0xa9/0xc0 __x64_sys_epoll_wait+0x1a/0x20 do_syscall_64+0x4e/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fdeb1e96c03 ... ---[ end trace 7a8f0b2beff82ee0 ]--- CPU#0: ctrl: CPU#0: status: 0004 CPU#0: overflow: CPU#0: fixed: 0bb0 CPU#0: pebs: CPU#0: debugctl: CPU#0: active: 0006 CPU#0: gen-PMC0 ctrl: CPU#0: gen-PMC0 count: CPU#0: gen-PMC0 left: CPU#0: gen-PMC1 ctrl: CPU#0: gen-PMC1 count: CPU#0: gen-PMC1 left: CPU#0: gen-PMC2 ctrl: CPU#0: gen-PMC2 count: CPU#0: gen-PMC2 left: CPU#0: gen-PMC3 ctrl: CPU#0: gen-PMC3 count: CPU#0: gen-PMC3 left: CPU#0: fixed-PMC0 count: CPU#0: fixed-PMC1 count: d22ebd19 CPU#0: fixed-PMC2 count: fff1 core: clearing PMU state on CPU#0 I found that a period limit of 32 was the lowest I could set it to without the problem reoccurring. The idea for the patch and approach to find the target value were suggested by Ingo and Thomas. Signed-off-by: Josh Hunt Reported-by: Bhupesh Purandare Suggested-by: Thomas Gleixner Suggested-by: Ingo Molnar Link: https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/ Link: https://lore.kernel.org/lkml/alpine.deb.2.21.1908122133310.7...@nanos.tec.linutronix.de/ --- arch/x86/events/intel/core.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 648260b5f367..e4c2cb65ea50 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, u64 left) return left; } +static u64 nhm_limit_period(struct perf_event *event, u64 left) +{ + return max(left, 32ULL); +} + PMU_FORMAT_ATTR(event, "config:0-7"); PMU_FORMAT_ATTR(umask, "config:8-15" ); PMU_FORMAT_ATTR(edge, "config:18" ); @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void) x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints; x86_pmu.enable_all = intel_pmu_nhm_enable_all; x86_pmu.extra_regs = intel_nehalem_extra_regs; + x86_pmu.limit_period = nhm_limit_period; mem_attr = nhm_mem_events_attrs; -- 2.7.4
Re: Long standing kernel warning: perfevents: irq loop stuck!
On Mon, Aug 19, 2019 at 2:17 PM Josh Hunt wrote: > > On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt wrote: > > > > On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner wrote: > > > > > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner > > > > wrote: > > > > > > > > > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > > > > > Was there any progress made on debugging this issue? We are still > > > > > > seeing it on 4.19.44: > > > > > > > > > > I haven't seen anyone looking at this. > > > > > > > > > > Can you please try the patch Ingo posted: > > > > > > > > > > https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/ > > > > > > > > > > and if it fixes the issue decrease the value from 128 to the point > > > > > where it > > > > > comes back, i.e. 128 -> 64 -> 32 ... > > > > > > > > > > Thanks, > > > > > > > > > > tglx > > > > > > > > I just checked the machines where this problem occurs and they're both > > > > Nehalem boxes. I think Ingo's patch would only help Haswell machines. > > > > Please let me know if I misread the patch or if what I'm seeing is a > > > > different issue than the one Cong originally reported. > > > > > > Find the NHM hack below. > > > > > > Thanks, > > > > > > tglx > > > > > > 8< > > > > > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > > > index 648260b5f367..93c1a4f0e73e 100644 > > > --- a/arch/x86/events/intel/core.c > > > +++ b/arch/x86/events/intel/core.c > > > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event > > > *event, u64 left) > > > return left; > > > } > > > > > > +static u64 nhm_limit_period(struct perf_event *event, u64 left) > > > +{ > > > + return max(left, 128ULL); > > > +} > > > + > > > PMU_FORMAT_ATTR(event, "config:0-7"); > > > PMU_FORMAT_ATTR(umask, "config:8-15" ); > > > PMU_FORMAT_ATTR(edge, "config:18" ); > > > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void) > > > x86_pmu.pebs_constraints = > > > intel_nehalem_pebs_event_constraints; > > > x86_pmu.enable_all = intel_pmu_nhm_enable_all; > > > x86_pmu.extra_regs = intel_nehalem_extra_regs; > > > + x86_pmu.limit_period = nhm_limit_period; > > > > > > mem_attr = nhm_mem_events_attrs; > > > > > Thanks Thomas. Will try this and let you know. > > > > -- > > Josh > > Thomas > > I found on my setup that setting the value to 32 was the lowest value > I could use to keep the problem from happening. Let me know if you > want me to send a patch with the updated value, etc. > > I saw in the original thread from Ingo and Vince that this was seen on > Haswell, but I checked our Haswell boxes and so far we have not > reproduced the problem there. > > -- > Josh I went ahead and sent this patch with the value set to 32: https://lore.kernel.org/lkml/1566256411-18820-1-git-send-email-joh...@akamai.com/T/#u I wasn't sure how/who to give credit to for the change, so please resubmit if what I did is incorrect or if you wanted to debug further. If you decide to resubmit the patch please add my tested-by and Bhupesh's reported-by. I'm able to reproduce the problem within about 2 hours if there's anything else you wanted to look into before going with this approach. Thanks! -- Josh
Re: Long standing kernel warning: perfevents: irq loop stuck!
On Mon, Feb 26, 2018 at 12:40 PM Andi Kleen wrote: > > > Given the HSD143 errata and its possible relevance, have you tried > > changing the magic number to 32, does it then still fix things? > > > > No real objection to the patch as such, it just needs a coherent comment > > and a tested-by tag I think. > > 128 min period will affect a lot of valid use cases with slower ticking > events. I often use smaller periods there. > > It would be better to debug this properly. > > Or at a minimum only do the limitation for the events that tick really > fast (like cycles, uops retired etc.) > > -Andi Was there any progress made on debugging this issue? We are still seeing it on 4.19.44: [ 2660.685392] [ cut here ] [ 2660.685392] perfevents: irq loop stuck! [ 2660.685392] WARNING: CPU: 1 PID: 4436 at arch/x86/events/intel/core.c:2278 intel_pmu_handle_irq+0x37b/0x530 [ 2660.685393] Modules linked in: sch_fq ip6table_raw ip6table_filter ip6_tables iptable_raw xt_TARPIT ts_bm xt_u32 xt_recent xt_string xt_set ip_set_hash_ip ip_set_hash_ipportip ip_set_hash_net dev_cstack tcp_bbr tcp_qdk netconsole aep ip6_udp_tunnel udp_tunnel dm_mod i2c_dev tcp_fast w83627ehf hwmon_vid jc42 i2c_core softdog autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 linear md_mod ext4 crc32c_generic crc16 mbcache jbd2 xt_tcpudp ipv6 iptable_filter ip_tables ip_set nfnetlink x_tables zfs(O) zunicode(O) zavl(O) icp(O) sr_mod zcommon(O) znvpair(O) spl(O) coretemp hwmon kvm_intel ipmi_devintf kvm ata_piix irqbypass crc32c_intel ipmi_msghandler i7core_edac libata lpc_ich mfd_core e1000e pcc_cpufreq [ 2660.685405] CPU: 1 PID: 4436 Comm: xx_01 Tainted: G O 4.19.44 #1 [ 2660.685405] Hardware name: Ciara Technologies [ 2660.685405] RIP: 0010:intel_pmu_handle_irq+0x37b/0x530 [ 2660.685406] Code: 00 00 bf 40 03 00 00 48 8b 40 10 e8 bf 82 9f 00 e9 f3 fc ff ff 80 3d f3 54 61 01 00 75 1a 48 c7 c7 02 00 e1 93 e8 45 8d 06 00 <0f> 0b e8 2e a9 ff ff c6 05 d7 54 61 01 01 65 4c 8b 35 5f 4f 00 6d [ 2660.685406] RSP: 0018:fe034c40 EFLAGS: 00010086 [ 2660.685407] RAX: 001c RBX: 0064 RCX: 0002 [ 2660.685407] RDX: 0003 RSI: 93e1001e RDI: 926bafa555a8 [ 2660.685407] RBP: fe034e30 R08: fff8919fe17a R09: [ 2660.685407] R10: fe034c40 R11: R12: 926bafa4f3a0 [ 2660.685408] R13: 926b93739000 R14: 0040 R15: 926bafa4f5a0 [ 2660.685408] FS: 7f0b9ccd7940() GS:926bafa4() knlGS: [ 2660.685408] CS: 0010 DS: ES: CR0: 80050033 [ 2660.685409] CR2: 7f45c3497750 CR3: 00042388e000 CR4: 07e0 [ 2660.685409] Call Trace: [ 2660.685409] [ 2660.685409] ? perf_event_nmi_handler+0x2e/0x50 [ 2660.685409] ? intel_pmu_save_and_restart+0x50/0x50 [ 2660.685410] perf_event_nmi_handler+0x2e/0x50 [ 2660.685410] nmi_handle+0x6e/0x120 [ 2660.685410] default_do_nmi+0x3e/0x100 [ 2660.685410] do_nmi+0x102/0x160 [ 2660.685410] end_repeat_nmi+0x16/0x50 [ 2660.685411] RIP: 0010:native_write_msr+0x6/0x20 [ 2660.685411] Code: c3 48 c1 e2 20 48 89 d3 8b 16 48 09 c3 48 89 de e8 bf 53 3b 00 48 89 d8 5b c3 66 2e 0f 1f 84 00 00 00 00 00 89 f9 89 f0 0f 30 <66> 66 66 66 90 c3 48 c1 e2 20 89 f6 48 09 d6 31 d2 e9 24 53 3b 00 [ 2660.685411] RSP: 0018:b04661fb7c60 EFLAGS: 0046 [ 2660.685412] RAX: 0bb0 RBX: 926b93739000 RCX: 038d [ 2660.685412] RDX: RSI: 0bb0 RDI: 038d [ 2660.685412] RBP: 000b R08: fff8919fe17a R09: 941602d0 [ 2660.685413] R10: b04661fb7bd0 R11: 0362 R12: 0008 [ 2660.685413] R13: 0001 R14: 926bafa4f5c4 R15: 0001 [ 2660.685413] ? native_write_msr+0x6/0x20 [ 2660.685413] ? native_write_msr+0x6/0x20 [ 2660.685414] [ 2660.685414] intel_pmu_enable_event+0x1ce/0x1f0 [ 2660.685414] x86_pmu_start+0x78/0xa0 [ 2660.685414] x86_pmu_enable+0x252/0x310 [ 2660.685414] __perf_event_task_sched_in+0x181/0x190 [ 2660.685415] ? __switch_to_asm+0x34/0x70 [ 2660.685415] ? __switch_to_asm+0x40/0x70 [ 2660.685415] ? __switch_to_asm+0x34/0x70 [ 2660.685415] ? __switch_to_asm+0x40/0x70 [ 2660.685416] finish_task_switch+0x158/0x260 [ 2660.685416] __schedule+0x2f6/0x840 [ 2660.685416] ? hrtimer_start_range_ns+0x153/0x210 [ 2660.685416] schedule+0x32/0x80 [ 2660.685417] schedule_hrtimeout_range_clock+0x8a/0x100 [ 2660.685417] ? hrtimer_init+0x120/0x120 [ 2660.685417] ep_poll+0x2f7/0x3a0 [ 2660.685417] ? wake_up_q+0x60/0x60 [ 2660.685417] do_epoll_wait+0xa9/0xc0 [ 2660.685418] __x64_sys_epoll_wait+0x1a/0x20 [ 2660.685418] do_syscall_64+0x4e/0x110 [ 2660.685418] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 2660.685418] RIP: 0033:0x7f4d35107c03 [ 2660.685419] Code: 49 89 ca b8 e8 00 00 00 0f 05 48 3d 01 f0 ff ff 73 34 c3 48 8
Re: Long standing kernel warning: perfevents: irq loop stuck!
On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner wrote: > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > Was there any progress made on debugging this issue? We are still > > seeing it on 4.19.44: > > I haven't seen anyone looking at this. > > Can you please try the patch Ingo posted: > > https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/ > > and if it fixes the issue decrease the value from 128 to the point where it > comes back, i.e. 128 -> 64 -> 32 ... > > Thanks, > > tglx I just checked the machines where this problem occurs and they're both Nehalem boxes. I think Ingo's patch would only help Haswell machines. Please let me know if I misread the patch or if what I'm seeing is a different issue than the one Cong originally reported. Thanks -- Josh
Re: Long standing kernel warning: perfevents: irq loop stuck!
On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner wrote: > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner wrote: > > > > > > On Mon, 12 Aug 2019, Josh Hunt wrote: > > > > Was there any progress made on debugging this issue? We are still > > > > seeing it on 4.19.44: > > > > > > I haven't seen anyone looking at this. > > > > > > Can you please try the patch Ingo posted: > > > > > > https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/ > > > > > > and if it fixes the issue decrease the value from 128 to the point where > > > it > > > comes back, i.e. 128 -> 64 -> 32 ... > > > > > > Thanks, > > > > > > tglx > > > > I just checked the machines where this problem occurs and they're both > > Nehalem boxes. I think Ingo's patch would only help Haswell machines. > > Please let me know if I misread the patch or if what I'm seeing is a > > different issue than the one Cong originally reported. > > Find the NHM hack below. > > Thanks, > > tglx > > 8< > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c > index 648260b5f367..93c1a4f0e73e 100644 > --- a/arch/x86/events/intel/core.c > +++ b/arch/x86/events/intel/core.c > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, > u64 left) > return left; > } > > +static u64 nhm_limit_period(struct perf_event *event, u64 left) > +{ > + return max(left, 128ULL); > +} > + > PMU_FORMAT_ATTR(event, "config:0-7"); > PMU_FORMAT_ATTR(umask, "config:8-15" ); > PMU_FORMAT_ATTR(edge, "config:18" ); > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void) > x86_pmu.pebs_constraints = > intel_nehalem_pebs_event_constraints; > x86_pmu.enable_all = intel_pmu_nhm_enable_all; > x86_pmu.extra_regs = intel_nehalem_extra_regs; > + x86_pmu.limit_period = nhm_limit_period; > > mem_attr = nhm_mem_events_attrs; > Thanks Thomas. Will try this and let you know. -- Josh
Re: Steam is broken on new kernels
On Fri, Jun 21, 2019 at 4:07 PM Pierre-Loup A. Griffais wrote: > > > > On 6/21/19 3:38 PM, Eric Dumazet wrote: > > Please look at my recent patch. > > Sorry I am travelling > > > > On Fri, Jun 21, 2019, 6:19 PM Linus Torvalds > > mailto:torva...@linux-foundation.org>> > > wrote: > > > > On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman > > mailto:gre...@linuxfoundation.org>> wrote: > > > > > > What specific commit caused the breakage? > > > > Both on reddit and on github there seems to be confusion about whether > > it's a problem or not. Some people have it working with the exact same > > kernel that breaks for others. > > > > And then some people seem to say it works intermittently for them, > > which seems to indicate a timing issue. > > > > Looking at the SACK patches (assuming it's one of them), I'd suspect > > the "tcp: tcp_fragment() should apply sane memory limits". > > > > Eric, that one does > > > > if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) { > > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG); > > return -ENOMEM; > > } > > > > but I think it's *normal* for "sk_wmem_queued >> 1" to be around the > > same size as sk_sndbuf. So if there is some fragmentation, and we add > > more skb's to it, that would seem to trigger fairly easily. > > Particularly since this is all in "truesize" units, which can be a lot > > bigger than the packets themselves. > > > > I don't know the code, so I may be out to lunch and barking up > > completely the wrong tree, but that particular check does seem like it > > might trigger much more easily than I think the code _intended_ it to > > trigger? > > > > Pierre-Loup - do you guys have a test-case inside of valve? Or is this > > purely "we see some people with problems"? > > Definitely the latter, although the volume of complaints clearly points > to a real problem from our experience. Reproducing locally, bisecting > and testing possible fixes is just now starting on our end. > > I agree not all users seem affected; most affected people report success > by using -tcp to launch Steam, which makes it use direct TCP instead of > WebSockets, our current default connection method for Linux. > > Thanks, > - Pierre-Loup > > > > > Linus > > > I asked on the github thread if users seeing the problem could check the new wqueue too big counter: https://github.com/ValveSoftware/steam-for-linux/issues/6326 So far one person is seeing the counter increase when they see the problem, and another that doesn't see the problem has the counter at 0. Obviously not a great sample size, but hopefully more will report. If nothing else, someone is seeing the counter increase while trying to connect to steam. -- Josh
[PATCH v2] perf tools: allow map files to specify DSO
Add the ability to specify a DSO in the /tmp/perf-.map file. The DSO should be the first line in the file and readable by the running user. If a valid DSO is found all other contents of the file will be ignored. This allows things like callchain unwinding with DWARF to work. Suggested-by: Wang Nan Signed-off-by: Josh Hunt --- We have an application which uses huge pages for its text section, but still needs the ability to do callchain unwinding with DWARF. We use the perf-.map file setup to do symbol resolution and that works great, but callchain unwinding fails. A few months ago I mentioned this to Wang Nan and he suggested a way around this problem could be to specify the path of the DSO in the map file. The attached patch is my initial hack at this. Running with this patch I can now get full callchain unwinding with DWARF. FWIW LBR + map file works with callchains, but unfortunately there are some cases where we still need DWARF. I submitted an RFC as an earlier draft: https://lkml.org/lkml/2018/4/24/1070 v2: 1. Rebase RFC patch to git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core 2. Update jit-interface.txt as per Arnaldo's request tools/perf/Documentation/jit-interface.txt | 8 +--- tools/perf/util/map.c | 32 ++ 2 files changed, 37 insertions(+), 3 deletions(-) diff --git a/tools/perf/Documentation/jit-interface.txt b/tools/perf/Documentation/jit-interface.txt index a8656f564915..8a25b979802f 100644 --- a/tools/perf/Documentation/jit-interface.txt +++ b/tools/perf/Documentation/jit-interface.txt @@ -3,13 +3,15 @@ by a JIT. The JIT has to write a /tmp/perf-%d.map (%d = pid of process) file -This is a text file. +This is a text file and can have one of the following formats: -Each line has the following format, fields separated with spaces: +1) Each line has the following format, fields separated with spaces: START SIZE symbolname START and SIZE are hex numbers without 0x. symbolname is the rest of the line, so it could contain special characters. -The ownership of the file has to match the process. +2) A single line with the full pathname of the DSO to use. + +For both formats, the ownership of the file has to match the process. diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c index c8fe836e4c3c..da3050b18e34 100644 --- a/tools/perf/util/map.c +++ b/tools/perf/util/map.c @@ -139,6 +139,31 @@ void map__init(struct map *map, u64 start, u64 end, u64 pgoff, struct dso *dso) refcount_set(&map->refcnt, 1); } +static bool replace_anon(char *mapfilename) +{ + FILE *file = NULL; + bool ret = false; + + file = fopen(mapfilename, "r"); + if (file != NULL) { + char *line = NULL; + size_t line_len, linesz=0; + + line_len = getline(&line, &linesz, file); + if (line_len > 0) { + line[line_len-1] = '\0'; /* null terminate */ + if (!access(line, R_OK)) { + strlcpy(mapfilename, line, line_len); + ret = true; + } + } + free(line); + fclose(file); + } + + return ret; +} + struct map *map__new(struct machine *machine, u64 start, u64 len, u64 pgoff, u32 d_maj, u32 d_min, u64 ino, u64 ino_gen, u32 prot, u32 flags, char *filename, @@ -170,6 +195,13 @@ struct map *map__new(struct machine *machine, u64 start, u64 len, snprintf(newfilename, sizeof(newfilename), "/tmp/perf-%d.map", nsi->pid); filename = newfilename; + /* +* Check to see if map file references DSO to use, if so, use it. +*/ + if (anon && replace_anon(newfilename)) { + anon = 0; + filename = newfilename; + } } if (android) { -- 1.9.1
Re: [PATCH v2] perf tools: allow map files to specify DSO
On 05/07/2018 11:40 AM, Andi Kleen wrote: On Mon, May 07, 2018 at 02:24:16PM -0400, Josh Hunt wrote: Add the ability to specify a DSO in the /tmp/perf-.map file. The DSO should be the first line in the file and readable by the running user. If a valid DSO is found all other contents of the file will be ignored. This allows things like callchain unwinding with DWARF to work. FWIW it's ok, but also obsolete with Kirill's large-pages-in-tmpfs work in newer kernels. With that you can just copy the executable into a 2MB tmpfs and disable the manual huge page copying and everything should work as usually. So essentially it's only a hack for old kernels and old binaries. But doesn't hurt I guess. Ah, very interesting. I wasn't aware of this. Can you point me to some more details on this process? Thanks Josh
[PATCH] perf probe: Ignore vmlinux build-id when offline vmlinux and dry-run used
Commit e50243bb ("perf probe: Ignore vmlinux Build-id when offline vmlinux given") added the ability to ignore vmlinux when an offline vmlinux is passed and no l, d, or a flag is also passed. This extends that to ignore build-id when -n/--dry-run is used. This way we can gather probe information for offline kernels using the method Brendan Gregg mentions in this post: http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html . Ex: $ ./perf probe -k ./vmlinux-3.14.59 -nv 'do_sys_open=do_sys_open:7 filename:string dfd flags op' will generate: do_sys_open+91 filename_string=+0(%si):string dfd=%di:s32 flags=%bx:s32 op=-60(%bp):u64 and then copy/paste this to the machine running the 3.14.59 kernel. Signed-off-by: Josh Hunt --- tools/perf/builtin-probe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c index f87996b..e0eea723 100644 --- a/tools/perf/builtin-probe.c +++ b/tools/perf/builtin-probe.c @@ -616,7 +616,7 @@ static int perf_del_probe_events(struct strfilter *filter) * nor change running kernel. So if user gives offline vmlinux, * ignore its buildid. */ - if (!strchr("lda", params.command) && symbol_conf.vmlinux_name) + if ((!strchr("lda", params.command) || probe_event_dry_run) && symbol_conf.vmlinux_name) symbol_conf.ignore_vmlinux_buildid = true; switch (params.command) { -- 1.9.1
Re: [PATCH] perf probe: Ignore vmlinux build-id when offline vmlinux and dry-run used
On 01/25/2017 08:37 PM, Masami Hiramatsu wrote: On Tue, 24 Jan 2017 16:41:28 -0500 Josh Hunt wrote: Commit e50243bb ("perf probe: Ignore vmlinux Build-id when offline vmlinux given") added the ability to ignore vmlinux when an offline vmlinux is passed and no l, d, or a flag is also passed. This extends that to ignore build-id when -n/--dry-run is used. This way we can gather probe information for offline kernels using the method Brendan Gregg mentions in this post: http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html . Sorry, NAK. Could you use -D (--definition) for that usage? It directly gives you the definition on stdout, and also ignore the buildid. Ahhh! I guess I totally missed the point of that flag. You're right. It works exactly the way I want :) Thanks. Please ignore my patch. Josh
Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old
On 03/14/2016 09:34 AM, Don Zickus wrote: On Sat, Mar 12, 2016 at 06:50:26PM -0500, Joshua Hunt wrote: While working on a script to restore all sysctl params before a series of tests I found that writing any value into the /proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh} causes them to call proc_watchdog_update(). Not only that, but when I wrote to these proc files in a loop I could easily trigger a soft lockup. [ 955.756196] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. [ 955.765994] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. [ 955.774619] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. [ 955.783182] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. [ 959.788319] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 30s! [swapper/4:0] [ 959.788325] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 30s! [swapper/5:0] There doesn't appear to be a reason for doing this work other every time a write occurs, so only do the work when the values change. Hi Josh, Thanks for the patch. I have no objections to it, but Uli and myself were interested in the reason for the softlockups. Uli is going to provide a test patch to see if his theory is correct. That way we fix the underlying issue and then apply your patch on top. Make sense? Yep. Sounds good. I meant to mention I didn't diagnose the soft-lockup. If you provide a patch I'm happy to test. I can also attempt to debug that part more if needed. Josh Cheers, Don Signed-off-by: Josh Hunt --- kernel/watchdog.c |9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index b3ace6e..9acb29f 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -923,6 +923,9 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write, * both lockup detectors are disabled if proc_watchdog_update() * returns an error. */ + if (old == new) + goto out; + err = proc_watchdog_update(); } out: @@ -967,7 +970,7 @@ int proc_soft_watchdog(struct ctl_table *table, int write, int proc_watchdog_thresh(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos) { - int err, old; + int err, old, new; get_online_cpus(); mutex_lock(&watchdog_proc_mutex); @@ -987,6 +990,10 @@ int proc_watchdog_thresh(struct ctl_table *table, int write, /* * Update the sample period. Restore on failure. */ + new = ACCESS_ONCE(watchdog_thresh); + if (old == new) + goto out; + set_sample_period(); err = proc_watchdog_update(); if (err) { -- 1.7.9.5
Re: [PATCH] udp6: fix UDP/IPv6 encap resubmit path
On 03/04/2016 04:47 PM, Bill Sommerfeld wrote: IPv4 interprets a negative return value from a protocol handler as a request to redispatch to a new protocol. In contrast, IPv6 interprets a negative value as an error, and interprets a positive value as a request for redispatch. UDP for IPv6 was unaware of this difference. Change __udp6_lib_rcv() to return a positive value for redispatch. Note that the socket's encap_rcv hook still needs to return a negative value to request dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to identify the byte containing the next protocol. Signed-off-by: Bill Sommerfeld --- net/ipv6/udp.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 0711f8f..fd25e44 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -922,11 +922,9 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable, ret = udpv6_queue_rcv_skb(sk, skb); sock_put(sk); - /* a return value > 0 means to resubmit the input, but -* it wants the return to be -protocol, or 0 -*/ + /* a return value > 0 means to resubmit the input */ if (ret > 0) - return -ret; + return ret; return 0; } This looks good to me. Thanks Bill! Josh
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
On 7/2/20 2:45 AM, Paolo Abeni wrote: Hi all, On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote: Hi Cong, On 01/07/2020 21:58, Cong Wang wrote: On Wed, Jul 1, 2020 at 9:05 AM Cong Wang wrote: On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt wrote: Do either of you know if there's been any development on a fix for this issue? If not we can propose something. If you have a reproducer, I can look into this. Does the attached patch fix this bug completely? It's easier to comment if you inline the patch, but after taking a quick look it seems too simplistic. i) Are you sure you haven't got the return values on qdisc_run reversed? qdisc_run() returns true if it was able to acquire the seq lock. We need to take special action in the opposite case, so Cong's patch LGTM from a functional PoV. ii) There's a "bypass" path that skips the enqueue/dequeue operation if the queue is empty; that needs a similar treatment: after releasing seqlock it needs to ensure that another packet hasn't been enqueued since it last checked. That has been reverted with commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323 --- diff --git a/net/core/dev.c b/net/core/dev.c index 90b59fc50dc9..c7e48356132a 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, if (q->flags & TCQ_F_NOLOCK) { rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK; - qdisc_run(q); + if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS) + __netif_schedule(q); I fear the __netif_schedule() call may cause performance regression to the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to collect some data. Initial results with Cong's patch look promising, so far no stalls. We will let it run over the long weekend and report back on Tuesday. Paolo - I have concerns about possible performance regression with the change as well. If you can gather some data that would be great. If things look good with our low throughput test over the weekend we can also try assessing performance next week. Josh
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
On 6/23/20 6:42 AM, Michael Zhivich wrote: From: Jonas Bonn To: Paolo Abeni , "net...@vger.kernel.org" , LKML , "David S . Miller" , John Fastabend Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc Date: Fri, 11 Oct 2019 02:39:48 +0200 Message-ID: <465a540e-5296-32e7-f6a6-79942dfe2...@netrounds.com> (raw) In-Reply-To: <95c5a697932e19ebd6577b5dac4d7052fe8c4255.ca...@redhat.com> Hi Paolo, On 09/10/2019 21:14, Paolo Abeni wrote: Something alike the following code - completely untested - can possibly address the issue, but it's a bit rough and I would prefer not adding additonal complexity to the lockless qdiscs, can you please have a spin a it? We've tested a couple of variants of this patch today, but unfortunately it doesn't fix the problem of packets getting stuck in the queue. A couple of comments: i) On 5.4, there is the BYPASS path that also needs the same treatment as it's essentially replicating the behavour of qdisc_run, just without the queue/dequeue steps ii) We are working a lot with the 4.19 kernel so I backported to the patch to this version and tested there. Here the solution would seem to be more robust as the BYPASS path does not exist. Unfortunately, in both cases we continue to see the issue of the "last packet" getting stuck in the queue. /Jonas Hello Jonas, Paolo, We have observed the same problem with pfifo_fast qdisc when sending periodic small packets on a TCP flow with multiple simultaneous connections on a 4.19.75 kernel. We've been able to catch it in action using perf probes (see trace below). For qdisc = 0x900d7c247c00, skb = 0x900b72c334f0, it takes 200270us to traverse the networking stack on a system that's not otherwise busy. qdisc only resumes processing when another enqueued packet comes in, so the packet could have been stuck indefinitely. proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (9b69d99d) qdisc=0x900d7c247c00 skb=0x900bfc294af0 band=2 atomic_qlen=0 proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue: (9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=2 proc-19927 19927 [014] 580644.045480: probe:tcp_transmit_skb2: (9b6dc4e5) skb=0x900b72c334f0 sk=0x900d62958040 source=0x4b4e dest=0x9abe proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (9b69d99d) qdisc=0x900d7c247c00 skb=0x0 band=3 atomic_qlen=0 proc-19927 19927 [014] 580644.045481: probe:ip_finish_output2: (9b6bc650) net=0x9c107c80 sk=0x900d62958040 skb=0x900b72c334f0 __func__=0x0 proc-19902 19902 [032] 580644.045481:probe:sch_direct_xmit: (9b69e570) skb=0x900bfc294af0 q=0x900d7c247c00 dev=0x900d6a14 txq=0x900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155 proc-19927 19927 [014] 580644.045481:net:net_dev_queue: dev=eth0 skbaddr=0x900b72c334f0 len=115 proc-19902 19902 [032] 580644.045482: probe:pfifo_fast_dequeue: (9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=1 proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue: (9b69d9f0) skb=0x900b72c334f0 qdisc=0x900d7c247c00 to_free=18446622925407304000 proc-19902 19902 [032] 580644.045483: probe:pfifo_fast_dequeue_end: (9b69d99d) qdisc=0x900d7c247c00 skb=0x0 band=3 atomic_qlen=0 proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue_end: (9b69da9f) skb=0x900b72c334f0 qdisc=0x900d7c247c00 to_free=0x91d0f67ab940 atomic_qlen=1 proc-19902 19902 [032] 580644.045484: probe:__qdisc_run_2: (9b69ea5a) q=0x900d7c247c00 packets=1 proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue: (9b69d9f0) skb=0x900d98fdf6f0 qdisc=0x900d7c247c00 to_free=18446622925407304000 proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue_end: (9b69da9f) skb=0x900d98fdf6f0 qdisc=0x900d7c247c00 to_free=0x91d0f67ab940 atomic_qlen=2 proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue: (9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=0 proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue_end: (9b69d99d) qdisc=0x900d7c247c00 skb=0x900b72c334f0 band=2 atomic_qlen=1 proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue: (9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=2 proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue_end: (9b69d99d) qdisc=0x900d7c247c00 skb=0x900d98fdf6f0 band=2 atomic_qlen=0 proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue: (9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=2 proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue_end: (9b69d99d) qdisc=0xfff
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
On 7/1/20 12:58 PM, Cong Wang wrote: On Wed, Jul 1, 2020 at 9:05 AM Cong Wang wrote: On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt wrote: Do either of you know if there's been any development on a fix for this issue? If not we can propose something. If you have a reproducer, I can look into this. Does the attached patch fix this bug completely? Thanks. Hey Cong Unfortunately we don't have a reproducer that would be easy to share, however your patch makes sense to me at a high-level. We will test and let you know if this fixes our problem. Thanks! Josh
Re: [PATCH 4.9 086/104] arm64: kasan: avoid bad virt_to_pfn()
On Tue, Oct 10, 2017 at 10:31 AM, Julia Lawall wrote: > > > On Tue, 10 Oct 2017, Levin, Alexander (Sasha Levin) wrote: > >> (Cc'ed Julia) >> >> On Mon, Oct 09, 2017 at 09:33:01AM -0700, Laura Abbott wrote: >> >On 10/06/2017 08:10 PM, Levin, Alexander (Sasha Levin) wrote: >> >> We are experimenting with using neural network to aid with patch >> >> selection for stable kernel trees. There are quite a few commits that >> >> were not marked for stable, but are stable material, and we're trying >> >> to get them into their appropriate kernel trees. >> >> >> > >> >Apart from the practical which has been covered, I'd be interested >> >in hearing about the details of how this works if you can share >> >them. >> >> This work is based on Julia's work >> (https://soarsmu.github.io/papers/icse12-patch.pdf) to identify >> commits that fix bugs. >> >> Essentially, my approach to this is to extract as much information as >> possbile form the commit, including things such as: >> >> - How many times a certain word appeared in the message >> - Who is the author >> - Code metrics >> - etc >> >> In my case, I end up with about 30,000 of these "inputs", and train a >> neural network based on whether a given commit was included in a >> stable tree or not. >> >> This approach has a few drawbacks compared to the one Julia >> described in her paper: >> >> - Not every bug fixing commit ends up in stable (some end up in -rc >> fixing a bug from the current merge window). >> - Same as above, but for commits we miss and fail to add to stable. >> - Sometimes commits get added to stable even though they don't follow >> the rules at all (security fixes are a simple example). >> >> But it does seem to be effective at finding bug fixing commits that >> should be in stable. >> >> At this stage we are still trying to figure out what a "bug fixing" >> commit really is. For example, an observation we recently made was >> that the code metrics actually don't have much weight in determining >> whether a commit should be in stable or not. >> >> As we just started, I'm still experimenting with a few approaches, and >> I belive Julia is waiting for a new student to take over this, so we >> don't have any big insights to share just yet :) > > That's a good summary of the current status. Thanks! > > julia I just started noticing the AUTOSEL tags yesterday and I think that's a great idea to tag patches, but was there any thought to also putting something in the commit message this way they're easily identifiable in the git logs? I think it would be useful if there was some metadata in the commit message which identified that it was selected through some automated system. That way if I find a regression and it identifies one of these commits I can know that maybe it was chosen incorrectly, and also would allow me to alert the owner of the selection script to better help refine its selection process. Otherwise I'd have to track back through the mailing lists to see how it landed in the stable release. Just a thought. Also, thank you for trying to improve the stable kernels! -- Josh
Re: [PATCH 4.9 086/104] arm64: kasan: avoid bad virt_to_pfn()
On Thu, Nov 16, 2017 at 3:13 PM, wrote: > > It's possible, but I didn't want to add a bunch of clutter to the > commit message. Right now it's somewhat easy to track it back to > automatic selection because: > > 1. I'm signed off on all of them, so I could chime in in the case > concerns/issues arise with a patch. > 2. They all have a corresponding review request email with the > AUTOSEL marker. > I get the want to not clutter the commit logs. My comment was more directed to a few weeks or months after the patch has made it into a stable release. At that point I'm not sure the person who found the problem with the change would know to CC you on any correspondence, or even if the author of the change would know to contact you. Although I guess maybe they'd eventually track things down and report to stable, or Greg, or something else and it would eventually get back to you. -- Josh
[tip: sched/core] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files
The following commit has been merged into the sched/core branch of tip: Commit-ID: 6db12ee0456d0e369c7b59788d46e15a56ad0294 Gitweb: https://git.kernel.org/tip/6db12ee0456d0e369c7b59788d46e15a56ad0294 Author:Josh Hunt AuthorDate:Thu, 01 Apr 2021 22:58:33 -04:00 Committer: Peter Zijlstra CommitterDate: Thu, 08 Apr 2021 23:09:44 +02:00 psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. Signed-off-by: Josh Hunt Signed-off-by: Peter Zijlstra (Intel) Acked-by: Johannes Weiner Link: https://lkml.kernel.org/r/20210402025833.27599-1-joh...@akamai.com --- kernel/sched/psi.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index b1b00e9..d1212f1 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v) return psi_show(m, &psi_system, PSI_CPU); } +static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void *)) +{ + if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) + return -EPERM; + + return single_open(file, psi_show, NULL); +} + static int psi_io_open(struct inode *inode, struct file *file) { - return single_open(file, psi_io_show, NULL); + return psi_open(file, psi_io_show); } static int psi_memory_open(struct inode *inode, struct file *file) { - return single_open(file, psi_memory_show, NULL); + return psi_open(file, psi_memory_show); } static int psi_cpu_open(struct inode *inode, struct file *file) { - return single_open(file, psi_cpu_show, NULL); + return psi_open(file, psi_cpu_show); } struct psi_trigger *psi_trigger_create(struct psi_group *group, @@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void) { if (psi_enable) { proc_mkdir("pressure", NULL); - proc_create("pressure/io", 0, NULL, &psi_io_proc_ops); - proc_create("pressure/memory", 0, NULL, &psi_memory_proc_ops); - proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops); + proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops); + proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops); + proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops); } return 0; }
[tip: perf/urgent] perf/x86/intel: Restrict period on Nehalem
The following commit has been merged into the perf/urgent branch of tip: Commit-ID: 44d3bbb6f5e501b873218142fe08cdf62a4ac1f3 Gitweb: https://git.kernel.org/tip/44d3bbb6f5e501b873218142fe08cdf62a4ac1f3 Author:Josh Hunt AuthorDate:Mon, 19 Aug 2019 19:13:31 -04:00 Committer: Peter Zijlstra CommitterDate: Fri, 30 Aug 2019 14:27:47 +02:00 perf/x86/intel: Restrict period on Nehalem We see our Nehalem machines reporting 'perfevents: irq loop stuck!' in some cases when using perf: perfevents: irq loop stuck! WARNING: CPU: 0 PID: 3485 at arch/x86/events/intel/core.c:2282 intel_pmu_handle_irq+0x37b/0x530 ... RIP: 0010:intel_pmu_handle_irq+0x37b/0x530 ... Call Trace: ? perf_event_nmi_handler+0x2e/0x50 ? intel_pmu_save_and_restart+0x50/0x50 perf_event_nmi_handler+0x2e/0x50 nmi_handle+0x6e/0x120 default_do_nmi+0x3e/0x100 do_nmi+0x102/0x160 end_repeat_nmi+0x16/0x50 ... ? native_write_msr+0x6/0x20 ? native_write_msr+0x6/0x20 intel_pmu_enable_event+0x1ce/0x1f0 x86_pmu_start+0x78/0xa0 x86_pmu_enable+0x252/0x310 __perf_event_task_sched_in+0x181/0x190 ? __switch_to_asm+0x41/0x70 ? __switch_to_asm+0x35/0x70 ? __switch_to_asm+0x41/0x70 ? __switch_to_asm+0x35/0x70 finish_task_switch+0x158/0x260 __schedule+0x2f6/0x840 ? hrtimer_start_range_ns+0x153/0x210 schedule+0x32/0x80 schedule_hrtimeout_range_clock+0x8a/0x100 ? hrtimer_init+0x120/0x120 ep_poll+0x2f7/0x3a0 ? wake_up_q+0x60/0x60 do_epoll_wait+0xa9/0xc0 __x64_sys_epoll_wait+0x1a/0x20 do_syscall_64+0x4e/0x110 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fdeb1e96c03 ... Signed-off-by: Peter Zijlstra (Intel) Cc: a...@kernel.org Cc: Josh Hunt Cc: bpura...@akamai.com Cc: mi...@redhat.com Cc: jo...@redhat.com Cc: t...@linutronix.de Cc: namhy...@kernel.org Cc: alexander.shish...@linux.intel.com Link: https://lkml.kernel.org/r/1566256411-18820-1-git-send-email-joh...@akamai.com --- arch/x86/events/intel/core.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 648260b..e4c2cb6 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, u64 left) return left; } +static u64 nhm_limit_period(struct perf_event *event, u64 left) +{ + return max(left, 32ULL); +} + PMU_FORMAT_ATTR(event, "config:0-7"); PMU_FORMAT_ATTR(umask, "config:8-15" ); PMU_FORMAT_ATTR(edge, "config:18" ); @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void) x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints; x86_pmu.enable_all = intel_pmu_nhm_enable_all; x86_pmu.extra_regs = intel_nehalem_extra_regs; + x86_pmu.limit_period = nhm_limit_period; mem_attr = nhm_mem_events_attrs;