Re: [PATCH] block: Restore /proc/partitions to not display non-partitionable removable devices

2012-11-27 Thread Josh Hunt

ping?

On 11/19/2012 08:56 PM, Josh Hunt wrote:

We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before. Looking into this I found
that commit d27769ec... block: add GENHD_FL_NO_PART_SCAN introduces this change
in behavior. It's not clear to me from the commit's changelog if this change was
intentional or not. This comment still remains:
/* Don't show non-partitionable removeable devices or empty devices */
so I've decided to send a patch to restore the behavior of not printing
unpartitionable removable devices.

Thanks
Josh
---

After commit d27769ec... block: add GENHD_FL_NO_PART_SCAN, /proc/partitions
started printing removable devices with only one partition. This is different
than prior to the commit. This restores /proc/partitions to behave as it did 
before.

Signed-off-by: Josh Hunt 
---
  block/genhd.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 6cace66..6bfeb2a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -851,7 +851,7 @@ static int show_partition(struct seq_file *seqf, void *v)
char buf[BDEVNAME_SIZE];

/* Don't show non-partitionable removeable devices or empty devices */
-   if (!get_capacity(sgp) || (!disk_max_parts(sgp) &&
+   if (!get_capacity(sgp) || (!(disk_max_parts(sgp) > 1) &&
   (sgp->flags & GENHD_FL_REMOVABLE)))
return 0;
if (sgp->flags & GENHD_FL_SUPPRESS_PARTITION_INFO)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh

2012-09-11 Thread Josh Hunt

[fixing lkml address]

On 09/11/2012 05:52 PM, Josh Hunt wrote:

With recent kernels we noticed that edac was reporting double the memory size on
systems running with AMD family 0Fh processors. I'm not very familiar with the
code, but this resolves it from what I can see on my systems. At least in
amd64_debug_display_dimm_sizes() and k8_dbam_to_chip_select() there appeared
to be redundant shifts to the left by 1 when WIDTH_128 is present.

thanks
Josh

---

Since commit 41d8bfaba70311c2fa0666554ef160ea8ffc9daf the memory size reported
by edac has been double. Adding to the shift if WIDTH_128 is set looks to be
redundant, removing.

Signed-off-by: Josh Hunt 
---
  drivers/edac/amd64_edac.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 5a297a2..c796710 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1123,7 +1123,7 @@ static int k8_dbam_to_chip_select(struct amd64_pvt *pvt, 
u8 dct,

if (pvt->ext_model >= K8_REV_F) {
WARN_ON(cs_mode > 11);
-   return ddr2_cs_size(cs_mode, dclr & WIDTH_128);
+   return ddr2_cs_size(cs_mode, 0);
}
else if (pvt->ext_model >= K8_REV_D) {
unsigned diff;



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh

2012-09-12 Thread Josh Hunt
On 09/12/2012 03:51 AM, Borislav Petkov wrote:
> On Tue, Sep 11, 2012 at 06:02:01PM -0500, Josh Hunt wrote:
>> On 09/11/2012 05:52 PM, Josh Hunt wrote:
>>> With recent kernels we noticed that edac was reporting double the memory 
>>> size on
>>> systems running with AMD family 0Fh processors. I'm not very familiar with 
>>> the
>>> code, but this resolves it from what I can see on my systems. At least in
>>> amd64_debug_display_dimm_sizes() and k8_dbam_to_chip_select() there appeared
>>> to be redundant shifts to the left by 1 when WIDTH_128 is present.
> 
> Any chance you could enable CONFIG_EDAC_DEBUG, boot and send me your
> whole dmesg? Privately is fine too.
> 
> Thanks.
> 

Sure. Attached. This is booting with latest git.

Josh
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 3.6.0-rc5+ (johunt@kernelsuite-780) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #33 SMP Wed Sep 12 05:10:20 PDT 2012
[0.00] Command line: root=UUID=2ceb4350-c206-4bcd-b8b3-e6dd2ff5b021 ro oops=panic panic=30 INIT_VERBOSE=yes init=/sbin/init processor.max_cstate=1 reboot_type=1 nmi_watchdog=2,120 
[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0009fbff] usable
[0.00] BIOS-e820: [mem 0x0009fc00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e8000-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xedfe] usable
[0.00] BIOS-e820: [mem 0xedff-0xedffdfff] ACPI data
[0.00] BIOS-e820: [mem 0xedffe000-0xedff] ACPI NVS
[0.00] BIOS-e820: [mem 0xfec0-0xfec02fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff70-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x000111ff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] DMI present.
[0.00] DMI: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011  09/05/2007
[0.00] e820: update [mem 0x-0x] usable ==> reserved
[0.00] e820: remove [mem 0x000a-0x000f] usable
[0.00] No AGP bridge found
[0.00] e820: last_pfn = 0x112000 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-E uncachable
[0.00]   F-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask FF write-back
[0.00]   1 base 01 mask FFF000 write-back
[0.00]   2 base 011000 mask FFFE00 write-back
[0.00]   3 base 00EE00 mask FFFE00 uncachable
[0.00]   4 base 00F000 mask FFF000 uncachable
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] TOM2: 00011200 aka 4384M
[0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
[0.00] e820: update [mem 0xee00-0x] usable ==> reserved
[0.00] e820: last_pfn = 0xedff0 max_arch_pfn = 0x4
[0.00] found SMP MP-table at [mem 0x000ff780-0x000ff78f] mapped at [880ff780]
[0.00] initial memory mapped: [mem 0x-0x1fff]
[0.00] Base memory trampoline at [88097000] 97000 size 24576
[0.00] init_memory_mapping: [mem 0x-0xedfe]
[0.00]  [mem 0x-0xeddf] page 2M
[0.00]  [mem 0xede0-0xedfe] page 4k
[0.00] kernel direct mapping tables up to 0xedfe @ [mem 0x1f88b000-0x1fff]
[0.00] init_memory_mapping: [mem 0x1-0x111ff]
[0.00]  [mem 0x1-0x111ff] page 2M
[0.00] kernel direct mapping tables up to 0x111ff @ [mem 0xedfea000-0xedfe]
[0.00] RAMDISK: [mem 0x37511000-0x37fe]
[0.00] ACPI: RSDP 000f9e40 00014 (v00 ACPIAM)
[0.00] ACPI: RSDT edff 00030 (v01 A M I  OEMRSDT  09000705 MSFT 0097)
[0.00] ACPI: FACP edff0200 00084 (v02 A M I  OEMFACP  09000705 MSFT 0097)
[0.00] ACPI: DSDT edff0410 030C5 (v01  1SSL2 1SSL2005 0005 INTL 02002026)
[0.00] ACPI: FACS edffe000 00040
[0.00] ACPI: APIC edff0390 00076 (v01 A M I  OEMAPIC  09000705 MSFT 0097)
[0.00] ACPI: OEMB edffe040 00056 (v01 A M I  AMI_OEM  09000705 MSFT 0097)
[0.00] ACPI: Local APIC address 0xfee0
[0.00] Scanning NUMA topology in Northbridge 24
[0.00] No NUMA configuration found
[0.00] Faking a node at [mem 0x-0x000111fff

Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh

2012-09-12 Thread Josh Hunt
On 09/12/2012 07:38 AM, Josh Hunt wrote:
> On 09/12/2012 03:51 AM, Borislav Petkov wrote:
>> On Tue, Sep 11, 2012 at 06:02:01PM -0500, Josh Hunt wrote:
>>> On 09/11/2012 05:52 PM, Josh Hunt wrote:
>>>> With recent kernels we noticed that edac was reporting double the memory 
>>>> size on
>>>> systems running with AMD family 0Fh processors. I'm not very familiar with 
>>>> the
>>>> code, but this resolves it from what I can see on my systems. At least in
>>>> amd64_debug_display_dimm_sizes() and k8_dbam_to_chip_select() there 
>>>> appeared
>>>> to be redundant shifts to the left by 1 when WIDTH_128 is present.
>>
>> Any chance you could enable CONFIG_EDAC_DEBUG, boot and send me your
>> whole dmesg? Privately is fine too.
>>
>> Thanks.
>>
> 
> Sure. Attached. This is booting with latest git.
> 
> Josh

I wanted to add that we started seeing this back in 3.0. I didn't go
back any farther, but know it was not occurring in 2.6.38. The issue in
3.0 appeared to be that we shift left k8_dbam_to_chip_select() and there
was also another shift in amd64_csrow_nr_pages(). It looks like that
second shift has been replaced by a more recent patch, which actually
checks to see if the csrow is enabled and then counts nr_pages again
adding that to the first calculation, still resulting in the size being
double.

Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh

2012-09-12 Thread Josh Hunt

On 09/12/2012 10:30 AM, Borislav Petkov wrote:

Yes, you're basically right. Here's what I see from here:

In 2009 I added

commit 603adaf6b3e37450235f0ddb5986b961b3146a79
Author: Borislav Petkov 
Date:   Mon Dec 21 14:52:53 2009 +0100

 amd64_edac: fix K8 chip select reporting

 Fix the case when amd64_debug_display_dimm_sizes() reports only half the
 amount of DRAM on it because it doesn't account for when the single DCT
 operates in 128-bit mode and merges chip selects from different DIMMs.

which was supposed to fix a bug-report of DRAM chip selects being halved
in reporting.

But,

commit 41d8bfaba70311c2fa0666554ef160ea8ffc9daf
Author: Borislav Petkov 
Date:   Tue Jan 18 19:16:08 2011 +0100

 amd64_edac: Improve DRAM address mapping

 Drop static tables which map the bits in F2x80 to a chip select size in
 favor of functions doing the mapping with some bit fiddling. Also, add
 F15 support.


two years later reworked the whole DBAM to chip select sizes mapping for
all families. But it left in the clumsy workaround above for K8 only,
thus the double shifting.

So, long story short, reverting 603adaf6b3e37450235f0ddb5986b961b3146a79
should probably fix the issue since it is not needed anymore.

Let me run it here to make sure I'm not missing anything else.

Thanks.



Well from what I see 603ad... would only fix the case of printing the 
values correctly on boot, by removing the factor=1 shift. However, that 
is merely cosmetic as it does not affect the actual calculation of 
nr_pages. I guess maybe I wasn't completely clear before, but I see the 
doubling of the amount of memory both on boot via dmesg, but also in 
/sys/devices/system/edac/mc/mc0/size_mb and all of the csrow* subdirs 
therein.


Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh

2012-09-12 Thread Josh Hunt

On 09/12/2012 10:49 AM, Borislav Petkov wrote:


Yes, that's because the whole init_csrows thing has been b0rked since
forever. In your case, amd64_csrow_nr_pages() should pay attention to
the dct (second argument) which on K8 is always 0 (we have only one DCT
aka Dram ConTroller on K8) and the function should return 0 if dct is 1.
But the whole loop in init_csrows is fishy too so I'll need to rework
that properly. Oh well...

Looks like we're seeing an issue on another machine. Still 0Fh family, 
but the model is reported as 2, with cs_mode 7. This causes the machine 
to report that it has 32GB memory instead of 4GB. For this issue it 
seems like the shift logic is incorrect. Looking at the old 
ddr2_dbam_revD table it doesn't really lend itself to shifting very 
easily. Perhaps something like this (albeit very ugly - it matches the 
old table):


/* 3 to 5 */
if ( cs_mode > 2 && cs_mode < 6 )
return 32 << (cs_mode - 1);
/* 6 to 8 */
else if ( cs_mode > 5 && cs_mode < 9 )
return 32 << (cs_mode - 3);
/* 9 and 10 */
else if ( cs_mode > 8 )
return 32 << (cs_mode - 4);
/* 0 to 3 */
else
return 32 << cs_mode;

Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh

2012-09-12 Thread Josh Hunt

On 09/12/2012 11:48 AM, Borislav Petkov wrote:

Can I have /proc/cpuinfo and full dmesg with EDAC_DEBUG enabled from
that machine please?

Actually my apologies. I was looking at the 3.0 code. This issue is 
fixed in the latest kernel.


Sorry for the noise on that.

Josh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh

2012-09-14 Thread Josh Hunt

On 09/12/2012 12:23 PM, Borislav Petkov wrote:

Ok, I have something preliminary which seems to work fine on my K8 here.
If you'd like, you can give it a run:

git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git error-queue

I've changed also debug messages, etc, so pls take a look at those and
let me know whether they're understandable, make sense, etc, etc.


Boris

Thanks to your help I was able to test your branch, but it still does 
not resolve the problem. Removal of the "factor=1" workaround fixes the 
memory size reporting on boot, but the sysfs values are still incorrect.


[   25.836264] EDAC MC: DCT0 chip selects:
[   25.836266] EDAC amd64: MC: 0:  1024MB 1:  1024MB
[   25.836398] EDAC amd64: MC: 2:  1024MB 3:  1024MB
[   25.836530] EDAC amd64: MC: 4: 0MB 5: 0MB
[   25.836662] EDAC amd64: MC: 6: 0MB 7: 0MB

root@192.168.1.1:/sys/devices/system/edac/mc/mc0# grep . *
ce_count:0
ce_noinfo_count:0
dbam:0x0022
dhar:0xee001201
dram_hole:ee00 1200 1200
max_location:csrow 7 channel 1
mc_name:K8
grep: reset_counters: Permission denied
sdram_scrub_rate:761
seconds_since_reset:276
size_mb:8192
topmem:0xee00
topmem2:0x00011200
ue_count:0
ue_noinfo_count:0

root@192.168.1.1:/sys/devices/system/edac/mc/mc0/csrow1# grep . *
ce_count:0
ch0_ce_count:0
ch0_dimm_label:mc#0csrow#0channel#1
ch1_ce_count:0
ch1_dimm_label:mc#0csrow#4channel#1
dev_type:Unknown
edac_mode:S4ECD4ED
mem_type:Unbuffered-DDR2
size_mb:2048
ue_count:0

To be sure I'm using the correct branch the last git log entry is:
9d67117feece8852570cc8ee25b68c41f8def323

I could be incorrect, but I still think there's a problem with either a) 
ddr2_cs_size() for this cpu, or b) the extra shift left when WIDTH_128 
is true.


Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] amd64_edac: Memory size reported double on processor family 0Fh

2012-09-14 Thread Josh Hunt

On 09/14/2012 07:55 AM, Josh Hunt wrote:


Thanks to your help I was able to test your branch, but it still does
not resolve the problem. Removal of the "factor=1" workaround fixes the
memory size reporting on boot, but the sysfs values are still incorrect.



Please disregard what I said earlier about the shift still being wrong. 
Looking at the dmesg more I see now that the # of pages are correctly 
reported (262144), however sysfs is still wrong.


[   25.837588] EDAC DEBUG: init_csrows: MC node: 0, csrow: 0
[   25.837589] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 0, channel: 0, 
DBAM idx: 2

[   25.837591] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
[   25.837592] EDAC amd64: CS0: Unbuffered DDR2 RAM
[   25.837724] EDAC DEBUG: init_csrows: Total csrow0 pages: 262144
[   25.837725] DBG: init_csrows: channel_count:2
[   25.837856] DBG: init_csrows: channel_count:2
[   25.837988] EDAC DEBUG: init_csrows: MC node: 0, csrow: 1
[   25.837989] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 1, channel: 0, 
DBAM idx: 2

[   25.837991] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
[   25.837992] EDAC amd64: CS1: Unbuffered DDR2 RAM
[   25.838157] EDAC DEBUG: init_csrows: Total csrow1 pages: 262144
[   25.838158] DBG: init_csrows: channel_count:2
[   25.838289] DBG: init_csrows: channel_count:2
[   25.838421] EDAC DEBUG: init_csrows: MC node: 0, csrow: 2
[   25.838422] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 2, channel: 0, 
DBAM idx: 2

[   25.838424] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
[   25.838425] EDAC amd64: CS2: Unbuffered DDR2 RAM
[   25.838556] EDAC DEBUG: init_csrows: Total csrow2 pages: 262144
[   25.838558] DBG: init_csrows: channel_count:2
[   25.838689] DBG: init_csrows: channel_count:2
[   25.838820] EDAC DEBUG: init_csrows: MC node: 0, csrow: 3
[   25.838822] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 3, channel: 0, 
DBAM idx: 2

[   25.838823] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
[   25.838824] EDAC amd64: CS3: Unbuffered DDR2 RAM
[   25.838957] EDAC DEBUG: init_csrows: Total csrow3 pages: 262144

I looked into this and see that sysfs is doing the double counting b/c 
it loops over the # of channels:


[  131.423949] DBG: csrow_size_show: i:0 nr_pages:262144 nr_channels:2
[  131.424112] DBG: csrow_size_show: i:1 nr_pages:524288 nr_channels:2

I verified this in init_csrows:
[   25.838958] DBG: init_csrows: channel_count:2

Since I don't know the details of the hardware here it's hard for me to 
suggest a fix, but it would seem that k8_early_channel_count() needs to 
be modified to only return 1 in this case?


Josh



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] block: Restore /proc/partitions to not display non-partitionable removable devices

2012-12-03 Thread Josh Hunt

On 12/03/2012 06:06 PM, Andrew Morton wrote:

On Mon, 19 Nov 2012 18:56:49 -0800
Josh Hunt  wrote:


We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before. Looking into this I found
that commit d27769ec... block: add GENHD_FL_NO_PART_SCAN introduces this change
in behavior. It's not clear to me from the commit's changelog if this change was
intentional or not. This comment still remains:
/* Don't show non-partitionable removeable devices or empty devices */
so I've decided to send a patch to restore the behavior of not printing
unpartitionable removable devices.


d27769ec was merged in August 2011, so I after all this time, your fix
could be viewed as "changing existing behaviour".

So perhaps it would be best to leave things alone.  Is there any
particular problem with the post-Aug, 2011 behaviour?



We caught this by a script that parses /proc/partitions and made some 
assumptions about the contents therein. It had worked fine up until when 
this behavior changed. We were able to modify our script to get what we 
needed.


The patch was meant to do two things: 1) understand if this was an 
unintended change and 2) if so, propose a solution to resolve it. Since 
the comment was left in the source I believe either a) my patch should 
be applied or b) a new patch with the comment removed should be put in 
since it's no longer correct. I did not think this type of change to 
kernel abi was generally acceptable.


While the commit is over a year old, it changes behavior which had been 
in tact for a while (years?) from what I can tell. We were running 3.0 
with stable updates until we upgraded to 3.2 and hit this. Neither of 
these are what I would consider "old" kernels.


Thanks for looking at this.

Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] block: Restore /proc/partitions to not display non-partitionable removable devices

2012-11-19 Thread Josh Hunt
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before. Looking into this I found
that commit d27769ec... block: add GENHD_FL_NO_PART_SCAN introduces this change
in behavior. It's not clear to me from the commit's changelog if this change was
intentional or not. This comment still remains:
/* Don't show non-partitionable removeable devices or empty devices */
so I've decided to send a patch to restore the behavior of not printing
unpartitionable removable devices.

Thanks
Josh
---

After commit d27769ec... block: add GENHD_FL_NO_PART_SCAN, /proc/partitions
started printing removable devices with only one partition. This is different
than prior to the commit. This restores /proc/partitions to behave as it did 
before.

Signed-off-by: Josh Hunt 
---
 block/genhd.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 6cace66..6bfeb2a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -851,7 +851,7 @@ static int show_partition(struct seq_file *seqf, void *v)
char buf[BDEVNAME_SIZE];
 
/* Don't show non-partitionable removeable devices or empty devices */
-   if (!get_capacity(sgp) || (!disk_max_parts(sgp) &&
+   if (!get_capacity(sgp) || (!(disk_max_parts(sgp) > 1) &&
   (sgp->flags & GENHD_FL_REMOVABLE)))
return 0;
if (sgp->flags & GENHD_FL_SUPPRESS_PARTITION_INFO)
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Linux 4.1.28

2016-07-29 Thread Josh Hunt
On Fri, Jul 15, 2016 at 6:54 AM, Sasha Levin  wrote:
> On 07/15/2016 07:38 AM, Thomas Voegtle wrote:
>> On Wed, 13 Jul 2016, Sasha Levin wrote:
>>
>>> I'm announcing the release of the 4.1.28 kernel.
>>
>> I have a serious memleak with 4.1.28 (like 20mb/s)
>> I stripped down my kernel config and started a bisect, which came to:
>>
>> # first bad commit: [c5ad33184354260be6d05de57e46a5498692f6d6] mm/swap.c:
>> flush lru pvecs on compound page arrival
>> =>
>> commit c5ad33184354260be6d05de57e46a5498692f6d6
>> Author: Lukasz Odzioba 
>> Date:   Fri Jun 24 14:50:01 2016 -0700
>>
>> mm/swap.c: flush lru pvecs on compound page arrival
>>
>>
>> Reverting this on top 4.1.28 helps. Config attached.
>
> Yup, this was reported and a fix is already queued for 4.1.29.

4.1.28 is unusable for us. Do you have an ETA when 4.1.29 will be
available? Also, is the iptables problem reported by Thomas Voegtle
fixed as well?

I do not see any iptables-related changes in the 4.1.y-queue:
https://git.kernel.org/cgit/linux/kernel/git/sashal/linux-stable.git/log/?h=linux-4.1.y-queue

Thanks!
-- 
Josh


vector space exhaustion on 4.14 LTS kernels

2018-11-19 Thread Josh Hunt

Hi Thomas

We have a class of machines that appear to be exhausting the vector 
space on cpus 0 and 1 which causes some breakage later on when trying to 
set the affinity. The boxes are running the 4.14 LTS kernel.


I instrumented 4.14 and here's what I see:

[   28.328849] __assign_irq_vector: irq:512 cpu:0 mask:ff, 
onlinemask:ff, vector:0
[   28.329847] __assign_irq_vector: irq:512 cpu:2 vector:222 cfgvect:0 
off:14 old_domain:00, domain:00, 
vector_search:00,0004 update

[   28.329847] default_cpu_mask_to_apicid: irq:512 mask:00,0004
...
[   31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff, 
onlinemask:ff, vector:222
[   31.729154] __assign_irq_vector: irq:512 cpu:0 mask:ff, 
vector_cpumask:00,0001 vector:222

...
[   31.729154] __assign_irq_vector: irq:512 cpu:2 vector:00,0004 
domain:00,0004 success
[   31.729154] default_cpu_mask_to_apicid: irq:512 hwirq:512 
mask:00,0004

[   31.729154] apic_set_affinity: irq:512 mask:ff, err:0
...
[   32.818152] mlx5_irq_set_affinity_hint: 0: irq:512 mask:00,0001
...
[   39.531242] __assign_irq_vector: irq:512 cpu:0 mask:00,0001 
onlinemask:ff, vector:222
[   39.531244] __assign_irq_vector: irq:512 cpu:0 mask:00,0001 
vector_cpumask:00,0001 vector:222
[   39.531245] __assign_irq_vector: irq:512 cpu:0 vector:00,0001 
domain:00,0004

...
[   39.531384] __assign_irq_vector: irq:512 cpu:0 vector:37 
current_vector:37 next_cpu2
[   39.531385] __assign_irq_vector: irq:512 cpu:128 searched:00,0001 
vector:00, continue

[   39.531386] apic_set_affinity: irq:512 mask:00,0001 err:-28

The affinity values:

root@172.25.48.208:/proc/irq/512# grep . *
affinity_hint:00,0001
effective_affinity:00,0004
effective_affinity_list:2
grep: mlx5_comp0@pci::65:00.1: Is a directory
node:0
smp_affinity:ff,
smp_affinity_list:0-39
spurious:count 3
spurious:unhandled 0
spurious:last_unhandled 0 ms

I noticed your change, a0c9259dc4e1 "irq/matrix: Spread interrupts on 
allocation", and this sounds like what we're hitting. Booting 4.19 does 
not have this problem. I haven't booted 4.15 yet, but can do it to 
confirm the above commit is what resolves this.


Since 4.14 doesn't have the matrix allocator it's not a trivial 
backport. I was wondering a) if you agree with my assessment and b) if 
there's any plans on resolving this on the 4.14 allocator? If not I can 
attempt to backport the idea to 4.14 to spread the interrupts around on 
allocation.


Thanks
Josh









Re: [PATCH 4.14 106/143] sch_netem: restore skb->dev after dequeuing from the rbtree

2018-11-02 Thread Josh Hunt
On Fri, Nov 2, 2018 at 12:00 PM Greg Kroah-Hartman
 wrote:
>
> 4.14-stable review patch.  If anyone has any objections, please let me know.
>
> --
>
> Upstream commit bffa72cf7f9d ("net: sk_buff rbnode reorg") got
> backported as commit 6b921536f170 ("net: sk_buff rbnode reorg") into the
> v4.14.x-tree.
>
> However, the backport does not include the changes in sch_netem.c
>
> We need these, as otherwise the skb->dev pointer is not set when
> dequeueing from the netem rbtree, resulting in a panic:
>
> [   15.427748] BUG: unable to handle kernel NULL pointer dereference at 
> 00d0
> [   15.428863] IP: netif_skb_features+0x24/0x230
> [   15.429402] PGD 0 P4D 0
> [   15.429733] Oops:  [#1] SMP PTI
> [   15.430169] Modules linked in:
> [   15.430614] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.77.mptcp #77
> [   15.431497] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 0.5.1 01/01/2011
> [   15.432568] task: 88042db19680 task.stack: c907
> [   15.433356] RIP: 0010:netif_skb_features+0x24/0x230
> [   15.433977] RSP: 0018:88043fd83e70 EFLAGS: 00010286
> [   15.434665] RAX: 880429ad80c0 RBX: 88042bd0e400 RCX: 
> 880429ad8000
> [   15.435585] RDX:  RSI:  RDI: 
> 88042bd0e400
> [   15.436551] RBP: 88042bd0e400 R08: 88042a4b6c9c R09: 
> 0001
> [   15.437485] R10: 0004 R11:  R12: 
> 88042c70
> [   15.438393] R13: 88042c70 R14: 88042a4b6c00 R15: 
> 88042c6bb000
> [   15.439315] FS:  () GS:88043fd8() 
> knlGS:
> [   15.440314] CS:  0010 DS:  ES:  CR0: 80050033
> [   15.441084] CR2: 00d0 CR3: 00042c374000 CR4: 
> 06e0
> [   15.442016] Call Trace:
> [   15.442333]  
> [   15.442596]  validate_xmit_skb+0x17/0x270
> [   15.443134]  validate_xmit_skb_list+0x38/0x60
> [   15.443698]  sch_direct_xmit+0x102/0x190
> [   15.444198]  __qdisc_run+0xe3/0x240
> [   15.444671]  net_tx_action+0x121/0x140
> [   15.445177]  __do_softirq+0xe2/0x224
> [   15.445654]  irq_exit+0xbf/0xd0
> [   15.446072]  smp_apic_timer_interrupt+0x5d/0x90
> [   15.446654]  apic_timer_interrupt+0x7d/0x90
> [   15.447185]  
> [   15.447460] RIP: 0010:native_safe_halt+0x2/0x10
> [   15.447992] RSP: 0018:c9073f10 EFLAGS: 0282 ORIG_RAX: 
> ff10
> [   15.449008] RAX: 816667d0 RBX: 820946b0 RCX: 
> 
> [   15.449895] RDX:  RSI:  RDI: 
> 
> [   15.450768] RBP: 82026940 R08: 0004e858e5e1 R09: 
> 88042a4b6d58
> [   15.451643] R10:  R11: 00d0d56879bb R12: 
> 
> [   15.452478] R13:  R14:  R15: 
> 
> [   15.453340]  ? __sched_text_end+0x2/0x2
> [   15.453835]  default_idle+0xf/0x20
> [   15.454259]  do_idle+0x170/0x200
> [   15.454653]  cpu_startup_entry+0x14/0x20
> [   15.455142]  secondary_startup_64+0xa5/0xb0
> [   15.455715] Code: 1f 84 00 00 00 00 00 55 53 48 89 fd 48 83 ec 08 8b 87 bc 
> 00 00 00 48 8b 8f c0 00 00 00 0f b6 97 81 00 00 00 48 8b 77 10 48 01 c8 <48> 
> 8b 9
> [   15.458138] RIP: netif_skb_features+0x24/0x230 RSP: 88043fd83e70
> [   15.458933] CR2: 00d0
> [   15.459352] ---[ end trace 083925903ae60570 ]---
>
> Fixes: 6b921536f170 ("net: sk_buff rbnode reorg")
> Cc: Stephen Hemminger 
> Cc: Eric Dumazet 
> Cc: Soheil Hassas Yeganeh 
> Cc: Wei Wang 
> Cc: Willem de Bruijn 
> Signed-off-by: Christoph Paasch 
> Signed-off-by: Sasha Levin 
> ---
>  net/sched/sch_netem.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index 2a2ab6bfe5d8..3d325b840802 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -624,6 +624,10 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
> skb->next = NULL;
> skb->prev = NULL;
> skb->tstamp = netem_skb_cb(skb)->tstamp_save;
> +   /* skb->dev shares skb->rbnode area,
> +* we need to restore its value.
> +*/
> +   skb->dev = qdisc_dev(sch);
>
>  #ifdef CONFIG_NET_CLS_ACT
> /*
> --
> 2.17.1

I was seeing this crash on 4.14.78 and this patch fixes it.

Thanks!
-- 
Josh


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-08-20 Thread Josh Hunt

Hi Jike

On 8/20/20 12:43 AM, Jike Song wrote:

Hi Josh,


We met possibly the same problem when testing nvidia/mellanox's
GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
can also have a try?


We also did something similar where we've switched over to using the fq 
scheduler everywhere for now. We believe the bug is in the nolock code 
which only pfifo_fast uses atm, but we've been unable to come up with a 
satisfactory solution.




Besides, our testing is pretty complex, do you have a quick test to
reproduce it?



Unfortunately we don't have a simple test case either. Our current 
reproducer is complex as well, although it would seem like we should be 
able to come up with something where you have maybe 2 threads trying to 
send on the same tx queue running pfifo_fast every few hundred 
milliseconds and not much else/no other tx traffic on that queue. IIRC 
we believe the scenario is when one thread is in the process of 
dequeuing a packet while another is enqueuing, the enqueue-er (word? :)) 
sees the dequeue is in progress and so does not xmit the packet assuming 
the dequeue operation will take care of it. However b/c the dequeue is 
in the process of completing it doesn't and the newly enqueued packet 
stays in the qdisc until another packet is enqueued pushing both out.


Given that we have a workaround with using fq or any other qdisc not 
named pfifo_fast this has gotten bumped down in priority for us. I would 
like to work on a reproducer at some point, but won't likely be for a 
few weeks :(


Josh


[PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-03-31 Thread Josh Hunt
Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt 
---
 kernel/sched/psi.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index b1b00e9bd7ed..98ff7baf1ba8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1270,6 +1270,9 @@ static ssize_t psi_write(struct file *file, const char 
__user *user_buf,
if (!nbytes)
return -EINVAL;
 
+   if (!capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+
buf_size = min(nbytes, sizeof(buf));
if (copy_from_user(buf, user_buf, buf_size))
return -EFAULT;
@@ -1353,9 +1356,9 @@ static int __init psi_proc_init(void)
 {
if (psi_enable) {
proc_mkdir("pressure", NULL);
-   proc_create("pressure/io", 0, NULL, &psi_io_proc_ops);
-   proc_create("pressure/memory", 0, NULL, &psi_memory_proc_ops);
-   proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops);
+   proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
+   proc_create("pressure/memory", 0666, NULL, 
&psi_memory_proc_ops);
+   proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
}
return 0;
 }
-- 
2.17.1



Re: [PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-01 Thread Josh Hunt

On 4/1/21 10:47 AM, Eric W. Biederman wrote:

Kees Cook  writes:


On Wed, Mar 31, 2021 at 11:36:28PM -0500, Eric W. Biederman wrote:

Josh Hunt  writes:


Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.


The test for CAP_SYS_RESOURCE really needs to be in open rather
than in write.

Otherwise a suid root executable could have stdout redirected
into these files.


Right. Or check against f_cred. (See uses of kallsyms_show_value())
https://urldefense.com/v3/__https://www.kernel.org/doc/html/latest/security/credentials.html*open-file-credentials__;Iw!!GjvTz_vk!B_aeVyHMG20VNUGx001EFKpeYlahLQHye7oS8sokXuZOhVDTtF_deDl71a_KYA$


We really want to limit checking against f_cred to those cases where we
break userspace by checking in open.  AKA the cases where we made the
mistake of putting the permission check in the wrong place and now can't
fix it.

Since this change is change the permissions that open uses already I
don't see any reason we can't perform a proper check in open.

Eric



Thank you for the feedback. I will spin a v2 doing the check in open.

Josh


[PATCH v2] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-01 Thread Josh Hunt
Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt 
Acked-by: Johannes Weiner 
---
 kernel/sched/psi.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index b1b00e9bd7ed..d1212f17a898 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v)
return psi_show(m, &psi_system, PSI_CPU);
 }
 
+static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void 
*))
+{
+   if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+
+   return single_open(file, psi_show, NULL);
+}
+
 static int psi_io_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_io_show, NULL);
+   return psi_open(file, psi_io_show);
 }
 
 static int psi_memory_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_memory_show, NULL);
+   return psi_open(file, psi_memory_show);
 }
 
 static int psi_cpu_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_cpu_show, NULL);
+   return psi_open(file, psi_cpu_show);
 }
 
 struct psi_trigger *psi_trigger_create(struct psi_group *group,
@@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void)
 {
if (psi_enable) {
proc_mkdir("pressure", NULL);
-   proc_create("pressure/io", 0, NULL, &psi_io_proc_ops);
-   proc_create("pressure/memory", 0, NULL, &psi_memory_proc_ops);
-   proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops);
+   proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
+   proc_create("pressure/memory", 0666, NULL, 
&psi_memory_proc_ops);
+   proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
}
return 0;
 }
-- 
2.17.1



Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2021-04-02 Thread Josh Hunt

On 4/2/21 12:25 PM, Jiri Kosina wrote:

On Thu, 3 Sep 2020, John Fastabend wrote:


At this point I fear we could consider reverting the NOLOCK stuff.
I personally would hate doing so, but it looks like NOLOCK benefits are
outweighed by its issues.


I agree, NOLOCK brings more pains than gains. There are many race
conditions hidden in generic qdisc layer, another one is enqueue vs.
reset which is being discussed in another thread.


Sure. Seems they crept in over time. I had some plans to write a
lockless HTB implementation. But with fq+EDT with BPF it seems that
it is no longer needed, we have a more generic/better solution.  So
I dropped it. Also most folks should really be using fq, fq_codel,
etc. by default anyways. Using pfifo_fast alone is not ideal IMO.


Half a year later, we still have the NOLOCK implementation
present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.

And we've just been bitten by this very same race which appears to be
still unfixed, with single packet being stuck in pfifo_fast qdisc
basically indefinitely due to this very race that this whole thread began
with back in 2019.

Unless there are

(a) any nice ideas how to solve this in an elegant way without
(re-)introducing extra spinlock (Cong's fix) or

(b) any objections to revert as per the argumentation above

I'll be happy to send a revert of the whole NOLOCK implementation next
week.



Jiri

If you have a reproducer can you try 
https://lkml.org/lkml/2021/3/24/1485 ? If that doesn't work I think your 
suggestion of reverting nolock makes sense to me. We've moved to using 
fq as our default now b/c of this bug.


Josh


Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used

2013-11-26 Thread Josh Hunt
On Mon, Aug 12, 2013 at 10:09 AM, Tejun Heo  wrote:
> Hello, Jonathan.
>
> On Mon, Aug 12, 2013 at 12:04:11AM -0700, Jonathan Nieder wrote:
>> My laptop fails to boot[1] with the message 'Volume group "data" not
>> found'.  Bisects to v3.8-rc4~17 (the above commit).  Reverting that
>> commit on top of current "master" (d92581fcad18, 2013-08-10) produces
>> a working kernel.  dmesg output from that working kernel attached.
>> More details, including .config, at [2].
>>
>> Any ideas for tracking this down?
>
> Which initrd / boot script are you using?  It looks like lvm assemble
> scripts are running before sdX are detected leading to volume assembly
> failure.  Before the patch, any module loading would end up
> synchronizing async probes but after the patch modprobe invocations
> which don't schedule them won't be.  Does your boot script happen to
> run multiple modprobes in parallel and proceed to configure lvm
> without waiting for modprobes of libata drivers to finish?
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

I'm also hitting a regression b/c of this patch. Booting 3.10.20 on a
number of older machines with onboard sata controllers are unable to
find their root device quickly enough. I bisected the issue down to
774a1221e862b343388347bac9b318767336b20b. Reverting it allows my
systems to boot fine. I'm seeing this with both sata_svw and ahci. I'm
using ubuntu precise userspace which is using initramfs-tools
0.99ubuntu13.1. From what I can tell the modprobes in this initrd are
done in serial, but the port probing is async. This allows init to
continue on to try and mount root, but it's not there yet.

I'm attaching some serial log output with initcall_debug enabled.

Both ahci and sata_svw call ata_host_activate(), which call
ata_host_register() and async_schedule(async_port_probe, ap).

Let me know what other information you may need.
Thanks

-- 
Josh


seriallog
Description: Binary data


Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used

2013-11-26 Thread Josh Hunt
On Tue, Nov 26, 2013 at 3:53 PM, Linus Torvalds
 wrote:
> On Tue, Nov 26, 2013 at 1:29 PM, Josh Hunt  wrote:
>>
>> Both ahci and sata_svw call ata_host_activate(), which call
>> ata_host_register() and async_schedule(async_port_probe, ap).
>
> Well, with the modern logic ("only wait for async probing if the
> module itself did async probing") the ahci and svw modules didn't
> really change any behavior.
>
> But other modules did. I wonder, for example, if people insmod the dm
> module, and expect all devices to exist afterwards. Which the old
> logic of "we always wait for all async code regardless of whether we
> started it ourselves" would do, but the new logic does not.
>
> Something similar might hit the (non-modular) md auto-detect ioctl.
>
> So maybe we should just special-case those two issues, and say "let's
> just wait for async requests here"
>
> Something like the appended (whitespace-damaged) diff. Does that make
> a difference to you guys? And if it does, can you check *which* of the
> two async_synchronize_full() calls it is that matters for your cases?
>
>  Linus
>
> --- duh, apply by hand --
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 0704c523a76b..7e7a2f743b11 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -351,6 +351,7 @@ static int __init dm_init(void)
> goto bad;
> }
>
> +   async_synchronize_full();
> return 0;
>
>bad:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index b6b7a2866c9e..1d173dc662fc 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -8602,6 +8602,7 @@ static void autostart_arrays(int part)
> i_scanned = 0;
> i_passed = 0;
>
> +   async_synchronize_full();
> printk(KERN_INFO "md: Autodetecting RAID arrays.\n");
>
> while (!list_empty(&all_detected_devices) && i_scanned < INT_MAX) 
> {

I should have clarified that I'm not using dm/md in my setup. I know
the modules are getting loaded in the log I attached, but root is not
a md/dm device.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rds: Error on offset mismatch if not loopback

2013-11-12 Thread Josh Hunt
On Tue, Nov 12, 2013 at 10:22 PM, Josh Hunt  wrote:
> On Sat, Sep 22, 2012 at 2:25 PM, David Miller  wrote:
>>
>> From: John Jolly 
>> Date: Fri, 21 Sep 2012 15:32:40 -0600
>>
>> > Attempting an rds connection from the IP address of an IPoIB interface
>> > to itself causes a kernel panic due to a BUG_ON() being triggered.
>> > Making the test less strict allows rds-ping to work without crashing
>> > the machine.
>> >
>> > A local unprivileged user could use this flaw to crash the system.
>> >
>> > Signed-off-by: John Jolly 
>>
>> Besides the questions being asked of you by Venkat Venkatsubra, this
>> patch has another issue.
>>
>> It has been completely corrupted by your email client, it has
>> turned all TAB characters into spaces, making the patch useless.
>>
>> Please learn how to send a patch unmolested in the body of your
>> email.  Test it by emailing the patch to yourself, and verifying
>> that you can in fact apply the patch you receive in that email.
>> Then, and only then, should you consider making a new submission
>> of this patch.
>>
>> Use Documentation/email-clients.txt for guidance.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
> I think this issue was lost in the shuffle. It appears that redhat, ubuntu,
> and oracle are maintaining local patches to resolve this:
>
> https://oss.oracle.com/git/?p=redpatch.git;a=commit;h=c7b6a0a1d8d636852be130fa15fa8be10d4704e8
> https://bugzilla.redhat.com/show_bug.cgi?id=822754
> http://ubuntu.5.x6.nabble.com/CVE-2012-2372-RDS-local-ping-DOS-td4985388.html
>
> Given that Oracle has applied it I'll make the assumption that Venkat's
> question was answered at some point.
>
> David - I can resubmit the patch with the proper signed-off-by and
> formatting if you are willing to apply it unless John wants to try again. I
> think it's time this got upstream.
>
> --
> Josh

Ugh.. hopefully resending with all the html crap removed...

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used

2013-12-03 Thread Josh Hunt
On Tue, Nov 26, 2013 at 4:29 PM, Tejun Heo  wrote:
> Hello,
>
> On Tue, Nov 26, 2013 at 04:12:41PM -0600, Josh Hunt wrote:
>> I should have clarified that I'm not using dm/md in my setup. I know
>> the modules are getting loaded in the log I attached, but root is not
>> a md/dm device.
>
> Can you please still try it?  The init script is broken and we're now
> just trying to restore just enough of the old behavior so that the
> issue is not exposed.  The boot script in use seems to load md/dm
> modules after storage drivers and use their termination as the signal
> for "storage ready", so it could be a good enough bandaid even if
> you're not using dm/md.
>
> Thanks.
>
> --
> tejun

Tejun

You're right. Thanks for pointing this out. I did not realize there
was a bug in the init script. The version of initramfs-tools I was
using had the following bug:
https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1215911

Updating to 0.99ubuntu13.4 of initramfs-tools resolved my boot hangs.

I did try using the workaround as suggested by Linus. In my setup the
dm_init() code was hit, however it still appeared to be too late at
times. I also tried moving the call to async_synchronize_full() above
the for loop and it still had the same issue (patch attached.) Out of
around 10 reboot tests it failed to find root 1 or 2 times.

The ubuntu scripts don't ever actually call do_mount() if it can't
find the device. It seems to rely on some udev functionality to tell
it when the device is present, and if that fails it just bails out.

This change has introduced a regression. However, I only noticed it
b/c my init script had a bug which caused it not to wait around for
the device to appear.

-- 
Josh
Index: linux-3.10/drivers/md/dm.c
===
--- linux-3.10.orig/drivers/md/dm.c
+++ linux-3.10/drivers/md/dm.c
@@ -16,12 +16,13 @@
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
 #define DM_MSG_PREFIX "core"
 
 #ifdef CONFIG_PRINTK
@@ -275,12 +276,15 @@ static void (*_exits[])(void) = {
 static int __init dm_init(void)
 {
 	const int count = ARRAY_SIZE(_inits);
 
 	int r, i;
 
+	printk(KERN_CRIT "DBG: %s: Calling async_synchronize_full();\n", __func__);
+	async_synchronize_full();
+
 	for (i = 0; i < count; i++) {
 		r = _inits[i]();
 		if (r)
 			goto bad;
 	}
 
Index: linux-3.10/drivers/md/md.c
===
--- linux-3.10.orig/drivers/md/md.c
+++ linux-3.10/drivers/md/md.c
@@ -48,12 +48,13 @@
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include "md.h"
 #include "bitmap.h"
 
 #ifndef MODULE
 static void autostart_arrays(int part);
 #endif
@@ -8573,12 +8574,14 @@ static void autostart_arrays(int part)
 	dev_t dev;
 	int i_scanned, i_passed;
 
 	i_scanned = 0;
 	i_passed = 0;
 
+	printk(KERN_CRIT "DBG: %s: Calling async_synchronize_full()\n", __func__);
+	async_synchronize_full();
 	printk(KERN_INFO "md: Autodetecting RAID arrays.\n");
 
 	while (!list_empty(&all_detected_devices) && i_scanned < INT_MAX) {
 		i_scanned++;
 		node_detected_dev = list_entry(all_detected_devices.next,
 	struct detected_devices_node, list);


Re: [PATCH] printk: Fix discarding of records

2014-03-21 Thread Josh Hunt
On Sun, Feb 16, 2014 at 8:38 PM, Banerjee, Debabrata
 wrote:
> On 2/16/14, 7:41 PM, "Linus Torvalds" 
> wrote:
>
>>On Sun, Feb 16, 2014 at 4:19 PM, Banerjee, Debabrata
>> wrote:
>>>
>>> No that can't be right, the prev value after every loop is the
>>>msg->flags
>>> from the *last* line in the list, which has no relation to the *first*,
>>>so
>>> reusing it for the top of the next loop is nonsense.
>>
>>Please, Debabrata, humor me, and just try the patch.
>>
>>And try reading the source code. Because your statement is BS.
> ...
>>No, I haven't tested my patch, and maybe it's broken for some subtle
>>reason I'm missing too.
>
> Yes my explanation was wrong, your patch works for me. I assumed printing
> the prefix was desired, but if not, great.

The resulting patch from this discussion:

commit e4178d809fdaee32a56833fff1f5056c99e90a1a
Author: Linus Torvalds 
Date:   Mon Feb 17 12:24:45 2014 -0800

printk: fix syslog() overflowing user buffer

has not made it into stable yet. Can we add it to the stable queue?
I'm not sure if it needs more bake time, etc.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] panic: add TAINT_SOFTLOCKUP

2014-06-03 Thread Josh Hunt
This taint flag will be set if the system has ever entered a softlockup
state. Similar to TAINT_WARN it is useful to know whether or not the system
has been in a softlockup state when debugging.

Signed-off-by: Josh Hunt 
---
 Documentation/oops-tracing.txt  |2 ++
 Documentation/sysctl/kernel.txt |1 +
 include/linux/kernel.h  |1 +
 kernel/panic.c  |1 +
 kernel/watchdog.c   |1 +
 5 files changed, 6 insertions(+)

diff --git a/Documentation/oops-tracing.txt b/Documentation/oops-tracing.txt
index e315599..beefb9f 100644
--- a/Documentation/oops-tracing.txt
+++ b/Documentation/oops-tracing.txt
@@ -268,6 +268,8 @@ characters, each representing a particular tainted value.
  14: 'E' if an unsigned module has been loaded in a kernel supporting
  module signature.
 
+ 15: 'L' if a soft lockup has previously occurred on the system.
+
 The primary reason for the 'Tainted: ' string is to tell kernel
 debuggers if this is a clean kernel or if anything unusual has
 occurred.  Tainting is permanent: even if an offending module is
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 9886c3d..8dfdf2f 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -788,6 +788,7 @@ can be ORed together:
 4096 - An out-of-tree module has been loaded.
 8192 - An unsigned module has been loaded in a kernel supporting module
signature.
+16384 - A soft lockup has previously occurred on the system.
 
 ==
 
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 4c52907..eb7b074 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -470,6 +470,7 @@ extern enum system_states {
 #define TAINT_FIRMWARE_WORKAROUND  11
 #define TAINT_OOT_MODULE   12
 #define TAINT_UNSIGNED_MODULE  13
+#define TAINT_SOFTLOCKUP   14
 
 extern const char hex_asc[];
 #define hex_asc_lo(x)  hex_asc[((x) & 0x0f)]
diff --git a/kernel/panic.c b/kernel/panic.c
index d02fa9f..d68c5d8 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -212,6 +212,7 @@ static const struct tnt tnts[] = {
{ TAINT_FIRMWARE_WORKAROUND,'I', ' ' },
{ TAINT_OOT_MODULE, 'O', ' ' },
{ TAINT_UNSIGNED_MODULE,'E', ' ' },
+   { TAINT_SOFTLOCKUP, 'L', ' ' },
 };
 
 /**
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 516203e..09ac67c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -329,6 +329,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
 
if (softlockup_panic)
panic("softlockup: hung tasks");
+   add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
__this_cpu_write(soft_watchdog_warn, true);
} else
__this_cpu_write(soft_watchdog_warn, false);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] panic: add TAINT_SOFTLOCKUP

2014-06-23 Thread Josh Hunt

On 06/23/2014 05:11 PM, Andrew Morton wrote:

On Tue,  3 Jun 2014 22:12:35 -0400 Josh Hunt  wrote:


This taint flag will be set if the system has ever entered a softlockup
state. Similar to TAINT_WARN it is useful to know whether or not the system
has been in a softlockup state when debugging.

...

@@ -329,6 +329,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)

if (softlockup_panic)
panic("softlockup: hung tasks");
+   add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
__this_cpu_write(soft_watchdog_warn, true);
} else
__this_cpu_write(soft_watchdog_warn, false);


Would make more sense to have applied the taint *before* calling
panic()?


Andrew

Yep, that's a good call. Thanks. Do you want me to send a v2 or did you 
take care of it?


In addition to adding the softlockup taint flag, do you think it'd be 
reasonable to add another flag for page allocation failures? I think 
it'd be nice to be able to account for these conditions somehow without 
having to parse dmesg, etc. As with the softlockup flag, it's helpful to 
know if your system had encountered a page allocation failure at some 
point before the crash or whatever you're debugging.


Thanks
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] panic: add TAINT_SOFTLOCKUP

2014-06-24 Thread Josh Hunt

On 06/23/2014 05:51 PM, Andrew Morton wrote:

On Mon, 23 Jun 2014 17:45:00 -0500 Josh Hunt  wrote:


On 06/23/2014 05:11 PM, Andrew Morton wrote:

On Tue,  3 Jun 2014 22:12:35 -0400 Josh Hunt  wrote:


This taint flag will be set if the system has ever entered a softlockup
state. Similar to TAINT_WARN it is useful to know whether or not the system
has been in a softlockup state when debugging.

...

@@ -329,6 +329,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)

if (softlockup_panic)
panic("softlockup: hung tasks");
+   add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
__this_cpu_write(soft_watchdog_warn, true);
} else
__this_cpu_write(soft_watchdog_warn, false);


Would make more sense to have applied the taint *before* calling
panic()?


Andrew

Yep, that's a good call. Thanks. Do you want me to send a v2 or did you
take care of it?


I fixed it up.


In addition to adding the softlockup taint flag, do you think it'd be
reasonable to add another flag for page allocation failures? I think
it'd be nice to be able to account for these conditions somehow without
having to parse dmesg, etc. As with the softlockup flag, it's helpful to
know if your system had encountered a page allocation failure at some
point before the crash or whatever you're debugging.


I don't know, really.  Allocation failures are often an expected thing
as drivers try to work out how much memory they can allocate.  Those
things can be screened out by testing __GFP_NOWARN.  GFP_ATOMIC
failures should probably be ignored, except for when they shouldn't.
But even then, allocation failures are somewhat common.  And recency is
a concern: an allocation failure 10 minutes ago is unlikely to be
relevant.

But that's just me waving hands around.  I'd be interested to hear from
people whose kernels crash more often than mine, and from those whose
job is to support them (ie distro people?).



Anyone you'd suggest adding to this thread to get other feedback about 
tracking page allocation failures? I could also spin up a patch and cc them.


Thanks
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] panic: add TAINT_SOFTLOCKUP

2014-06-24 Thread Josh Hunt

On 06/24/2014 07:45 PM, David Rientjes wrote:

On Tue, 24 Jun 2014, Josh Hunt wrote:


Anyone you'd suggest adding to this thread to get other feedback about
tracking page allocation failures? I could also spin up a patch and cc them.



Page allocation failures happen all the time, mostly because of
large-order allocations (more than PAGE_ALLOC_COSTLY_ORDER) or allocations
done with GFP_ATOMIC where it's impossible to reclaim or compact memory to
allocate.  Because of this, they are fairly easy to trigger from userspace
without having to do much.

Why would this qualify for a taint?  I have never debugged a kernel crash
that I traced back to an earlier page allocation failure and said "oh, if
I had only known about that page allocation failure earlier!".  If one of
them is going to cause an issue, it probably is at the point of the crash
and you shouldn't have to "investigate" much.



I guess I was thinking more of the case where all you have is the 
trace/dump and for whatever reason the last bits which may contain the 
page allocation failure info didn't get flushed to disk. In that case 
it'd be nice to know what lead up to the crash. However, I do agree with 
your point and Andrew's about the frequency and ease of triggering them 
which would make taint the wrong place to account for them.


Thanks
Josh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rds: Error on offset mismatch if not loopback

2013-11-13 Thread Josh Hunt
On Wed, Nov 13, 2013 at 9:16 AM, Venkat Venkatsubra
 wrote:
>
>
> -Original Message-
> From: Josh Hunt [mailto:joshhun...@gmail.com]
> Sent: Tuesday, November 12, 2013 10:25 PM
> To: David Miller
> Cc: jjo...@suse.com; LKML; Venkat Venkatsubra; net...@vger.kernel.org
> Subject: Re: [PATCH] rds: Error on offset mismatch if not loopback
>
> On Tue, Nov 12, 2013 at 10:22 PM, Josh Hunt  wrote:
>> On Sat, Sep 22, 2012 at 2:25 PM, David Miller  wrote:
>>>
>>> From: John Jolly 
>>> Date: Fri, 21 Sep 2012 15:32:40 -0600
>>>
>>> > Attempting an rds connection from the IP address of an IPoIB
>>> > interface to itself causes a kernel panic due to a BUG_ON() being 
>>> > triggered.
>>> > Making the test less strict allows rds-ping to work without
>>> > crashing the machine.
>>> >
>>> > A local unprivileged user could use this flaw to crash the system.
>>> >
>>> > Signed-off-by: John Jolly 
>>>
>>> Besides the questions being asked of you by Venkat Venkatsubra, this
>>> patch has another issue.
>>>
>>> It has been completely corrupted by your email client, it has turned
>>> all TAB characters into spaces, making the patch useless.
>>>
>>> Please learn how to send a patch unmolested in the body of your
>>> email.  Test it by emailing the patch to yourself, and verifying that
>>> you can in fact apply the patch you receive in that email.
>>> Then, and only then, should you consider making a new submission of
>>> this patch.
>>>
>>> Use Documentation/email-clients.txt for guidance.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>>
>> I think this issue was lost in the shuffle. It appears that redhat,
>> ubuntu, and oracle are maintaining local patches to resolve this:
>>
>> https://oss.oracle.com/git/?p=redpatch.git;a=commit;h=c7b6a0a1d8d63685
>> 2be130fa15fa8be10d4704e8
>> https://bugzilla.redhat.com/show_bug.cgi?id=822754
>> http://ubuntu.5.x6.nabble.com/CVE-2012-2372-RDS-local-ping-DOS-td49853
>> 88.html
>>
>> Given that Oracle has applied it I'll make the assumption that
>> Venkat's question was answered at some point.
>>
>> David - I can resubmit the patch with the proper signed-off-by and
>> formatting if you are willing to apply it unless John wants to try
>> again. I think it's time this got upstream.
>>
>> --
>> Josh
>
> Ugh.. hopefully resending with all the html crap removed...
>
> --
> Josh
>
> Hi Josh,
>
> No, I still didn't get an answer for how "off" could be non-zero in case of 
> rds-ping to hit BUG_ON(off % RDS_FRAG_SIZE).
> Because, rds-ping uses zero byte messages to ping.
> If you have a test case that reproduces the kernel panic I can try it out and 
> see how that can happen.
> The Oracle's internal code I checked doesn't have that patch applied.
>
> Venkat

No I don't have a test case. I came across this CVE while doing an
audit and noticed it was patched in Ubuntu's kernel and other distros,
but was not in the upstream kernel yet. Quick googling of lkml showed
that there were at least two attempts to get this patch upstream, but
both had issues due to not following the proper submission process:

https://lkml.org/lkml/2012/10/22/433
https://lkml.org/lkml/2012/9/21/505

>From my searching it appears the initial bug was found by someone at redhat:
https://bugzilla.redhat.com/show_bug.cgi?id=822754

I've added Li Honggang the reporter of this issue from Redhat to the
mail. Hopefully he can share his testcase.

and possibly requires certain hardware as Jay writes in the first link above:
"...some Infiniband HCAs(QLogic, possibly others) the machine will panic..."

I was referring to this oracle commit:
https://oss.oracle.com/git/?p=redpatch.git;a=commit;h=c7b6a0a1d8d636852be130fa15fa8be10d4704e8

I have no experience with this code. There were a few comments around
the reset and xmit fns about making sure the caller did certain things
if not they were racy, but I have no idea if that's coming into play
here.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] rds: fix local ping DoS

2013-11-13 Thread Josh Hunt
The rds_ib_xmit function in net/rds/ib_send.c in the Reliable Datagram Sockets
(RDS) protocol implementation allows local users to cause a denial of service
(BUG_ON and kernel panic) by establishing an RDS connection with the source
IP address equal to the IPoIB interface's own IP address, as demonstrated by
rds-ping.

A local unprivileged user could use this flaw to crash the system.

CVE-2012-2372

Reported-by: Honggang Li 
Signed-off-by: Josh Hunt 
---
 net/rds/ib_send.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index e590949..7920c85 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -544,7 +544,7 @@ int rds_ib_xmit(struct rds_connection *conn, struct 
rds_message *rm,
int flow_controlled = 0;
int nr_sig = 0;
 
-   BUG_ON(off % RDS_FRAG_SIZE);
+   BUG_ON(!conn->c_loopback && off % RDS_FRAG_SIZE);
BUG_ON(hdr_off != 0 && hdr_off != sizeof(struct rds_header));
 
/* Do not send cong updates to IB loopback */
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rds: Error on offset mismatch if not loopback

2013-11-13 Thread Josh Hunt
On Wed, Nov 13, 2013 at 6:55 PM, Honggang LI  wrote:
> On 11/14/2013 01:40 AM, Josh Hunt wrote:
>> On Wed, Nov 13, 2013 at 9:16 AM, Venkat Venkatsubra
>>  wrote:
>>>
>>> -Original Message-
>>> From: Josh Hunt [mailto:joshhun...@gmail.com]
>>> Sent: Tuesday, November 12, 2013 10:25 PM
>>> To: David Miller
>>> Cc: jjo...@suse.com; LKML; Venkat Venkatsubra; net...@vger.kernel.org
>>> Subject: Re: [PATCH] rds: Error on offset mismatch if not loopback
>>>
>>> On Tue, Nov 12, 2013 at 10:22 PM, Josh Hunt  wrote:
>>>> On Sat, Sep 22, 2012 at 2:25 PM, David Miller  wrote:
>>>>> From: John Jolly 
>>>>> Date: Fri, 21 Sep 2012 15:32:40 -0600
>>>>>
>>>>>> Attempting an rds connection from the IP address of an IPoIB
>>>>>> interface to itself causes a kernel panic due to a BUG_ON() being 
>>>>>> triggered.
>>>>>> Making the test less strict allows rds-ping to work without
>>>>>> crashing the machine.
>>>>>>
>>>>>> A local unprivileged user could use this flaw to crash the system.
>>>>>>
>>>>>> Signed-off-by: John Jolly 
>>>>> Besides the questions being asked of you by Venkat Venkatsubra, this
>>>>> patch has another issue.
>>>>>
>>>>> It has been completely corrupted by your email client, it has turned
>>>>> all TAB characters into spaces, making the patch useless.
>>>>>
>>>>> Please learn how to send a patch unmolested in the body of your
>>>>> email.  Test it by emailing the patch to yourself, and verifying that
>>>>> you can in fact apply the patch you receive in that email.
>>>>> Then, and only then, should you consider making a new submission of
>>>>> this patch.
>>>>>
>>>>> Use Documentation/email-clients.txt for guidance.
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>> linux-kernel" in the body of a message to majord...@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>>
>>>> I think this issue was lost in the shuffle. It appears that redhat,
>>>> ubuntu, and oracle are maintaining local patches to resolve this:
>>>>
>>>> https://oss.oracle.com/git/?p=redpatch.git;a=commit;h=c7b6a0a1d8d63685
>>>> 2be130fa15fa8be10d4704e8
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=822754
>>>> http://ubuntu.5.x6.nabble.com/CVE-2012-2372-RDS-local-ping-DOS-td49853
>>>> 88.html
>>>>
>>>> Given that Oracle has applied it I'll make the assumption that
>>>> Venkat's question was answered at some point.
>>>>
>>>> David - I can resubmit the patch with the proper signed-off-by and
>>>> formatting if you are willing to apply it unless John wants to try
>>>> again. I think it's time this got upstream.
>>>>
>>>> --
>>>> Josh
>>> Ugh.. hopefully resending with all the html crap removed...
>>>
>>> --
>>> Josh
>>>
>>> Hi Josh,
>>>
>>> No, I still didn't get an answer for how "off" could be non-zero in case of 
>>> rds-ping to hit BUG_ON(off % RDS_FRAG_SIZE).
>>> Because, rds-ping uses zero byte messages to ping.
>>> If you have a test case that reproduces the kernel panic I can try it out 
>>> and see how that can happen.
>>> The Oracle's internal code I checked doesn't have that patch applied.
>>>
>>> Venkat
>> No I don't have a test case. I came across this CVE while doing an
>> audit and noticed it was patched in Ubuntu's kernel and other distros,
>> but was not in the upstream kernel yet. Quick googling of lkml showed
>> that there were at least two attempts to get this patch upstream, but
>> both had issues due to not following the proper submission process:
>>
>> https://lkml.org/lkml/2012/10/22/433
>> https://lkml.org/lkml/2012/9/21/505
>>
>> From my searching it appears the initial bug was found by someone at redhat:
>> https://bugzilla.redhat.com/show_bug.cgi?id=822754
>>
>> I've added Li Honggang the reporter of this issue from Redhat to the
>> mail. Hopefully he can share his testcase.
> The test case is very simple:
> Steps to Reproduce:
> 1. yum install -y rds-tools
>
> 2. [root@rdma3 ~]# ifconfig ib0 | grep 'inet addr'
>   inet addr:172.31.0.3  Bcast:172.31.0.255  Mask:255.255.255.0
>
> 3. [root@rdma3 ~]# /usr/bin/rds-ping 172.31.0.3  <<<< kernel panic (You
> may need to wait for a few seconds before the kernel panic.)
>>
>> and possibly requires certain hardware as Jay writes in the first link above:
>> "...some Infiniband HCAs(QLogic, possibly others) the machine will panic..."
> This bug can be reproduced with Mellanox HCAs (mlx4_ib.ko and mthca.ko),
> QLogic HCA (ib_qib.ko). I did not test the QLogic HCA running "ib_ipath.ko".
>
> As I know the upstream code of RDS is broken. There are *many* RDS bugs.
>
> Best regards.
> Honggang

Thanks Honggang. I have resubmitted the patch for approval.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] rds: fix local ping DoS

2013-11-14 Thread Josh Hunt

On 11/14/2013 01:03 AM, David Miller wrote:

From: Josh Hunt 
Date: Wed, 13 Nov 2013 17:15:43 -0800


The rds_ib_xmit function in net/rds/ib_send.c in the Reliable Datagram Sockets
(RDS) protocol implementation allows local users to cause a denial of service
(BUG_ON and kernel panic) by establishing an RDS connection with the source
IP address equal to the IPoIB interface's own IP address, as demonstrated by
rds-ping.

A local unprivileged user could use this flaw to crash the system.

CVE-2012-2372

Reported-by: Honggang Li 
Signed-off-by: Josh Hunt 


I'm sorry I can't apply this.  This commit message needs to be much
less terse and explain things more.

First of all, why is the "off % RDS_FRAG_SIZE" important?

And, even more importantly, why is is OK to avoid this assertion just
because we're going over loopback?

Furthermore, why doesn't net/rds/iw_send.c:rds_iw_xmit() have the same
exact problem?  It makes the same exact assertion check.

I know this RDS code is a steaming pile of poo, but that doesn't mean
we just randomly adjust assertions to make crashes go away without
sufficient understanding of exactly what's going on.

Thanks.



Sure understandable questions. Unfortunately I don't have the hardware 
to properly debug and analyze. I was just trying to get this through on 
the assumption that the previous attempts just failed due to incorrect 
submission procedures and lack of a reproducible testcase. If nothing 
else this whole thing brought out the testcase :)


Testcase from Honggang's earlier mail:

The test case is very simple:
Steps to Reproduce:
1. yum install -y rds-tools

2. [root@rdma3 ~]# ifconfig ib0 | grep 'inet addr'
  inet addr:172.31.0.3  Bcast:172.31.0.255  Mask:255.255.255.0

3. [root@rdma3 ~]# /usr/bin/rds-ping 172.31.0.3  <<<< kernel panic (You
may need to wait for a few seconds before the kernel panic.)

This bug can be reproduced with Mellanox HCAs (mlx4_ib.ko and mthca.ko),
QLogic HCA (ib_qib.ko). I did not test the QLogic HCA running "ib_ipath.ko".


Perhaps Venkat or someone else with the hardware mentioned can provide a 
better explanation and better solution to the crash.


Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.8-rc3 -> 3.8-rc4 regression] Re: [PATCH] module, async: async_synchronize_full() on module init iff async is used

2013-12-04 Thread Josh Hunt
On Tue, Dec 3, 2013 at 9:19 AM, Tejun Heo  wrote:
> Hello,
>
> On Tue, Dec 03, 2013 at 08:28:43AM -0600, Josh Hunt wrote:
>> You're right. Thanks for pointing this out. I did not realize there
>> was a bug in the init script. The version of initramfs-tools I was
>> using had the following bug:
>> https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/1215911
>>
>> Updating to 0.99ubuntu13.4 of initramfs-tools resolved my boot hangs.
>>
>> I did try using the workaround as suggested by Linus. In my setup the
>> dm_init() code was hit, however it still appeared to be too late at
>> times. I also tried moving the call to async_synchronize_full() above
>> the for loop and it still had the same issue (patch attached.) Out of
>> around 10 reboot tests it failed to find root 1 or 2 times.
>>
>> The ubuntu scripts don't ever actually call do_mount() if it can't
>> find the device. It seems to rely on some udev functionality to tell
>> it when the device is present, and if that fails it just bails out.
>>
>> This change has introduced a regression. However, I only noticed it
>> b/c my init script had a bug which caused it not to wait around for
>> the device to appear.
>
> Hmmm so, read the bug report, digged and asked around a bit.
> Here's the root problem - ubuntu's initramfs uses a tool to wait for
> the root device which uses libudev to listen for the device event;
> unfortunately, its rx buffer is not set large enough and the receiver
> isn't fast enough, which means that netlink broadcast messages from
> the kernel can overrun the buffer.  When that happens, it sets an
> error on the socket, so the next recv fails with -ENOBUFS.  If that
> happens, the wait for root aborts immediately and initramfs proceeds
> to mount non-existent root device.
>
> The only thing which changes by these patches is the timing of events.
> The problem likely wasn't as exposed before because things were slow
> enough so that either the messages could be consumed fast enough or
> there's enough delay between libata module load and the root device
> wait hiding the bug in the wait logic.
>
> So, yeah, it's a full blown timing bug.  I'm not sure what we can do
> to work around from kernel side except for randomly slowing things
> down or forcefully enlarging rx buffer size.  There really is no
> interlocking to take advantage of. :(

So there used to be a call to async_synchronize_full() in
ata_host_register(), but it was removed by
f29d3b23238e1955a8094e038c72546e99308e61 as part of some fastboot
changes. Adding it back (in the attached patch) seems to resolve the
issue when using the broken initrd. I'm guessing adding it back isn't
an option, but I wanted to point it out.

-- 
Josh
Index: b/drivers/ata/libata-core.c
===
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -6181,12 +6181,14 @@ int ata_host_register(struct ata_host *h
 	/* perform each probe asynchronously */
 	for (i = 0; i < host->n_ports; i++) {
 		struct ata_port *ap = host->ports[i];
 		async_schedule(async_port_probe, ap);
 	}
 
+	async_synchronize_full();
+
 	return 0;
 
  err_tadd:
 	while (--i >= 0) {
 		ata_tport_delete(host->ports[i]);
 	}


Re: [PATCH 3.19 016/175] ksoftirqd: Enable IRQs and call cond_resched() before poking RCU

2015-05-01 Thread Josh Hunt
On Wed, Mar 4, 2015 at 12:13 AM, Greg Kroah-Hartman
 wrote:
> 3.19-stable review patch.  If anyone has any objections, please let me know.
>
> --
>
> From: Calvin Owens 
>
> commit 28423ad283d5348793b0c45cc9b1af058e776fd6 upstream.
>
> While debugging an issue with excessive softirq usage, I encountered the
> following note in commit 3e339b5dae24a706 ("softirq: Use hotplug thread
> infrastructure"):
>
> [ paulmck: Call rcu_note_context_switch() with interrupts enabled. ]
>
> ...but despite this note, the patch still calls RCU with IRQs disabled.
>
> This seemingly innocuous change caused a significant regression in softirq
> CPU usage on the sending side of a large TCP transfer (~1 GB/s): when
> introducing 0.01% packet loss, the softirq usage would jump to around 25%,
> spiking as high as 50%. Before the change, the usage would never exceed 5%.
>
> Moving the call to rcu_note_context_switch() after the cond_sched() call,
> as it was originally before the hotplug patch, completely eliminated this
> problem.
>
> Signed-off-by: Calvin Owens 
> Signed-off-by: Paul E. McKenney 
> Signed-off-by: Greg Kroah-Hartman 
>
> ---
>  kernel/softirq.c |6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -656,9 +656,13 @@ static void run_ksoftirqd(unsigned int c
>  * in the task stack here.
>  */
> __do_softirq();
> -   rcu_note_context_switch();
> local_irq_enable();
> cond_resched();
> +
> +   preempt_disable();
> +   rcu_note_context_switch();
> +   preempt_enable();
> +
> return;
> }
> local_irq_enable();
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Sorry for the delay in noticing this, but should this be applied to
3.14-stable as well?

Thanks
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old

2016-03-14 Thread Josh Hunt

On 03/14/2016 11:29 AM, Don Zickus wrote:


Hi Josh,

I believe Uli thought the below patch might fix it.

Cheers,
Don


Don

It looks like I was incorrect when I said 4.5 was getting the soft 
lockup. I originally found this problem on 4.1.19 and saw both the 
problem my patch solves and the soft lockups there. I thought when I 
checked 4.5 that I saw both issues there as well, but going back and 
checking now that is not the case. I only see the issue my patch 
resolves on 4.5.


With that info my changelog is incorrect now as it states I saw a soft 
lockup on the head. I will submit a v2 of my patch with the updated 
changelog. I'll also cc stable this time as I'd like to see this fix end 
up there as well.


As for the soft lockups showing up on 4.1, I tried Uli's patch and it 
did not help. After that I did a git bisect to figure out when the soft 
lockup was fixed and it appears to be resolved after one of the commits 
in this series:


commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28
Author: Ulrich Obergfell 
Date:   Fri Sep 4 15:45:15 2015 -0700

watchdog: introduce watchdog_park_threads() and 
watchdog_unpark_threads()


I didn't identify the exact commit.

It would be nice to resolve the soft lockup for the stable folks since 
4.1 and 4.4 are longterm stable releases and would see this problem.


I did not have time to debug it any more outside of this bisection 
today. If you have something you'd like me to try which may work for the 
stable kernels I'm happy to test it.


For the record I'm able to reproduce the soft lockup on 4.1 doing:

while :; do echo 1 > /proc/sys/kernel/nmi_watchdog; sleep .1; done & 
sleep 30 && kill %1 && sleep 5


Josh



Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old

2016-03-18 Thread Josh Hunt

On 03/16/2016 04:21 AM, Ulrich Obergfell wrote:


Josh,

I haven't tried to reproduce the soft lockups with kernel 4.1, but I
believe I found an explanation in terms of how your test case breaks
the watchdog mechanism in kernel 4.1:

The soft lockup detector is implemented via two components (per-cpu).

- The watchdog_timer_fn() function.
   This function runs periodically (sample_period) in the context of a
   high-resolution timer on CPU N. It wakes up the 'watchdog/N' thread
   and checks watchdog_touch_ts[N] to determine whether the thread has
   run within the soft lockup detection interval (watchdog_thresh * 2).

- The watchdog() function.
   This function is executed in the context of the 'watchdog/N' thread.
   It updates the watchdog_touch_ts[N] time stamp to signal that it got
   a chance to run. The thread has a high realtime priority. If it does
   not get a chance to run, this is indicative that 'something else' is
   hogging CPU N.


The while :; do echo 1 > /proc/sys/kernel/nmi_watchdog; sleep .1; done
loop triggers the following flow of execution in the watchdog code at
a frequency that is much higher than sample_period.

   proc_nmi_watchdog
 proc_watchdog_common
   proc_watchdog_update
 watchdog_enable_all_cpus
   update_watchdog_all_cpus
 for_each_online_cpu
 update_watchdog
   smp_call_function_single(restart_watchdog_hrtimer)

Calling the restart_watchdog_hrtimer() function at such a high rate has
the following effects.

- The high-resolution timer gets canceled and restarted before it ever
   gets a chance to expire. Hence, the watchdog_timer_fn() function does
   not get a chance to wake up the 'watchdog/N' thread. So this prevents
   the thread from running within the soft lockup detection interval.

- Since watchdog_timer_fn() does not wake up the 'watchdog/N' thread,
   the watchdog_touch_ts[N] time stamp is not being updated.

The  sleep 30 && kill %1  command terminates the above while loop after
30 seconds, so the high-resolution timer no longer gets canceled. When
the timer expires, watchdog_timer_fn() detects that 'watchdog/N' has not
run within the soft lockup detection interval.

Your patch 'masks' the above issue because proc_watchdog_update() is not
called if the parameter value is not changed.


The mechanism that picks up changes of watchdog parameters 'on the fly'
was rewritten in kernel 4.3 (see https://lkml.org/lkml/2015/8/1/64 and
http://marc.info/?l=linux-kernel&m=143894132129981&w=2).

$ git log --pretty=oneline v4.2..v4.3 -- kernel/watchdog.c | head -5
ec6a90661a0d6ce1461d05c7a58a0a151154e14a watchdog: rename watchdog_suspend() 
and watchdog_resume()
999bbe49ea0118b70ddf3f5d679f51dc7a97ae55 watchdog: use suspend/resume interface 
in fixup_ht_bug()
d4bdd0b21c7652a8271f873cc755486b255c1bbd watchdog: use park/unpark functions in 
update_watchdog_all_cpus()
8c073d27d7ad293bf734cc8475689413afadab81 watchdog: introduce watchdog_suspend() 
and watchdog_resume()
81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 watchdog: introduce 
watchdog_park_threads() and watchdog_unpark_threads()

The new version of the update_watchdog_all_cpus() function now parks and
unparks all 'watchdog/N' threads. If any watchdog parameter in /proc was
changed, the new value gets picked up by the threads during unpark when
watchdog_enable() is executed.

The following is particularly relevant to your test case.

- watchdog_disable() cancels the high-resolution timer during park

- watchdog_enable() starts the high-resolution timer during unpark
   and updates watchdog_touch_ts[N] by calling __touch_watchdog()

I believe this explains why you can only reproduce the soft lockup with
kernel v4.1 but not with v4.5.


Your patch "don't run proc_watchdog_update if new value is same as old"
looks OK to me. I have no objections.


Many Thanks,

Uli


Uli

Thanks for the detailed explanation!

As you mention my patch will mask this problem for 4.1 which is why I 
wanted to get it into stable. Do you think there is any way to mitigate 
this issue for the stable kernels (4.1 to 4.4) if the user changes the 
values doing something like:


foo=1; while :; do echo $foo > /proc/sys/kernel/nmi_watchdog; foo=$(( ! 
$foo )); sleep .1; done & sleep 30 && kill %1


?

I realize this isn't a real-world use-case (I hope :)), but shows there 
is still a way to cause the box to soft lockup with this code path.


I ask b/c these kernels are both longterm stable releases and so this 
issue will likely live for a while in the wild.


Josh


Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error

2015-11-10 Thread Josh Hunt
On Tue, Nov 10, 2015 at 12:24 PM, Josh Hunt  wrote:
>
> On Mon, Nov 9, 2015 at 4:02 PM, Peter Zijlstra  wrote:
>>
>> On Mon, Nov 09, 2015 at 01:59:02PM -0600, gratian.cri...@ni.com wrote:
>>
>> > The Intel Xeon E5 processor family suffers from errata[1] BT81:
>>
>> > +#ifdef CONFIG_X86_TSC
>> > + /*
>> > +  * Xeon E5 BT81 errata: TSC is not affected by warm reset.
>> > +  * The TSC registers for CPUs other than CPU0 are not cleared by a 
>> > warm
>> > +  * reset resulting in a constant offset error.
>> > +  */
>> > + if ((c->x86 == 6) && (c->x86_model == 0x3f))
>> > + set_cpu_bug(c, X86_BUG_TSC_OFFSET);
>> > +#endif
>>
>> That's hardly a family, that's just one, Haswell server.
>
>
> Are you actually observing this problem on this processor?
>
> The document you've referenced and the x86_model # above do not match up. The 
> errata should be for Intel processors with an x86_model value of 0x2d by my 
> calculations:
>
> Model: 1101b
> Extended Model: 0010b
>
> The calc from cpu_detect() is:
>  if (c->x86 >= 0x6)
> c->x86_model += ((tfms >> 16) & 0xf) << 4;
>
> 0x3f is a different CPU.
> --
> Josh

Resending, as gmail inserted html and lkml dropped the previous reply...


-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] tsc: synchronize TSCs on buggy Intel Xeon E5 CPUs with offset error

2015-11-10 Thread Josh Hunt
On Tue, Nov 10, 2015 at 1:47 PM, Gratian Crisan  wrote:
>
> The observed behavior does seem to match BT81 errata i.e. the TSC does
> not get reset on warm reboots and it is otherwise stable.
>
If you have a simple testcase to reproduce the problem I'd be
interested in seeing it.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Long standing kernel warning: perfevents: irq loop stuck!

2019-08-22 Thread Josh Hunt
On Mon, Aug 19, 2019 at 4:16 PM Josh Hunt  wrote:
>
> On Mon, Aug 19, 2019 at 2:17 PM Josh Hunt  wrote:
> >
> > On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt  wrote:
> > >
> > > On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner  
> > > wrote:
> > > >
> > > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > > > > Was there any progress made on debugging this issue? We are still
> > > > > > > seeing it on 4.19.44:
> > > > > >
> > > > > > I haven't seen anyone looking at this.
> > > > > >
> > > > > > Can you please try the patch Ingo posted:
> > > > > >
> > > > > >   https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/
> > > > > >
> > > > > > and if it fixes the issue decrease the value from 128 to the point 
> > > > > > where it
> > > > > > comes back, i.e. 128 -> 64 -> 32 ...
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > tglx
> > > > >
> > > > > I just checked the machines where this problem occurs and they're both
> > > > > Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> > > > > Please let me know if I misread the patch or if what I'm seeing is a
> > > > > different issue than the one Cong originally reported.
> > > >
> > > > Find the NHM hack below.
> > > >
> > > > Thanks,
> > > >
> > > > tglx
> > > >
> > > > 8<
> > > >
> > > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > > index 648260b5f367..93c1a4f0e73e 100644
> > > > --- a/arch/x86/events/intel/core.c
> > > > +++ b/arch/x86/events/intel/core.c
> > > > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event 
> > > > *event, u64 left)
> > > > return left;
> > > >  }
> > > >
> > > > +static u64 nhm_limit_period(struct perf_event *event, u64 left)
> > > > +{
> > > > +   return max(left, 128ULL);
> > > > +}
> > > > +
> > > >  PMU_FORMAT_ATTR(event, "config:0-7");
> > > >  PMU_FORMAT_ATTR(umask, "config:8-15"   );
> > > >  PMU_FORMAT_ATTR(edge,  "config:18" );
> > > > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
> > > > x86_pmu.pebs_constraints = 
> > > > intel_nehalem_pebs_event_constraints;
> > > > x86_pmu.enable_all = intel_pmu_nhm_enable_all;
> > > > x86_pmu.extra_regs = intel_nehalem_extra_regs;
> > > > +   x86_pmu.limit_period = nhm_limit_period;
> > > >
> > > > mem_attr = nhm_mem_events_attrs;
> > > >
> > > Thanks Thomas. Will try this and let you know.
> > >
> > > --
> > > Josh
> >
> > Thomas
> >
> > I found on my setup that setting the value to 32 was the lowest value
> > I could use to keep the problem from happening. Let me know if you
> > want me to send a patch with the updated value, etc.
> >
> > I saw in the original thread from Ingo and Vince that this was seen on
> > Haswell, but I checked our Haswell boxes and so far we have not
> > reproduced the problem there.
> >
> > --
> > Josh
>
> I went ahead and sent this patch with the value set to 32:
> https://lore.kernel.org/lkml/1566256411-18820-1-git-send-email-joh...@akamai.com/T/#u
>
> I wasn't sure how/who to give credit to for the change, so please
> resubmit if what I did is incorrect or if you wanted to debug further.
> If you decide to resubmit the patch please add my tested-by and
> Bhupesh's reported-by. I'm able to reproduce the problem within about
> 2 hours if there's anything else you wanted to look into before going
> with this approach.
>
> Thanks!
> --
> Josh

Thomas

Any thoughts on the above or the patch that I sent?

Thanks!
-- 
Josh


4.19 dwarf unwinding broken

2019-10-03 Thread Josh Hunt

The following commit is breaking dwarf unwinding on 4.19 kernels:

commit e5adfc3e7e774ba86f7bb725c6eef5f32df8630e
Author: Konstantin Khlebnikov 
Date:   Tue Aug 7 12:09:01 2018 +0300

perf map: Synthesize maps only for thread group leader

It looks like this was fixed later by:

commit e8ba2906f6b9054102ad035ac9cafad9d4168589
Author: John Keeping 
Date:   Thu Aug 15 11:01:45 2019 +0100

perf unwind: Fix libunwind when tid != pid

but was not selected as a backport to 4.19. Are there plans to backport 
the fix? If not, could we have the breaking commit reverted in 4.19 LTS?


Thanks
Josh


Re: 4.19 dwarf unwinding broken

2019-10-03 Thread Josh Hunt

On 10/3/19 3:03 AM, Jiri Olsa wrote:

On Thu, Oct 03, 2019 at 12:54:09AM -0700, Josh Hunt wrote:

The following commit is breaking dwarf unwinding on 4.19 kernels:


how?


When doing something like:
perf record -p $(cat /var/run/app.pid) -g --call-graph dwarf -F 999 -- 
sleep 3


with 4.19.75 perf I see things like:

app_Thr00 26247 1810131.375329: 168288 cycles:ppp:

app_Thr01 26767 1810131.377449: 344415 cycles:ppp:

uvm:WorkerThread 26746 1810131.383052:504 cycles:ppp:
9f77cce0 _raw_spin_lock+0x10 (/boot/vmlinux-4.19.46)
9f181527 __perf_event_task_sched_in+0xf7 
(/boot/vmlinux-4.19.46)

9f09a7b8 finish_task_switch+0x158 (/boot/vmlinux-4.19.46)
9f778276 __schedule+0x2f6 (/boot/vmlinux-4.19.46)
9f7787f2 schedule+0x32 (/boot/vmlinux-4.19.46)
9f77bb0a schedule_hrtimeout_range_clock+0x8a 
(/boot/vmlinux-4.19.46)
9f22ea12 poll_schedule_timeout.constprop.6+0x42 
(/boot/vmlinux-4.19.46)

9f22eeeb do_sys_poll+0x4ab (/boot/vmlinux-4.19.46)
9f22fb7b __se_sys_poll+0x5b (/boot/vmlinux-4.19.46)
9f0023de do_syscall_64+0x4e (/boot/vmlinux-4.19.46)
9f800088 entry_SYSCALL_64+0x68 (/boot/vmlinux-4.19.46)
---

and with 4.19.75 perf with e5adfc3e7e77 reverted those empty call stacks 
go away and also other call stacks show more thread details:


uvm:WorkerThread 26746 1810207.336391:  1 cycles:ppp:
9f181505 __perf_event_task_sched_in+0xd5 
(/boot/vmlinux-4.19.46)

9f09a7b8 finish_task_switch+0x158 (/boot/vmlinux-4.19.46)
9f778276 __schedule+0x2f6 (/boot/vmlinux-4.19.46)
9f7787f2 schedule+0x32 (/boot/vmlinux-4.19.46)
9f77bb0a schedule_hrtimeout_range_clock+0x8a 
(/boot/vmlinux-4.19.46)
9f22ea12 poll_schedule_timeout.constprop.6+0x42 
(/boot/vmlinux-4.19.46)

9f22eeeb do_sys_poll+0x4ab (/boot/vmlinux-4.19.46)
9f22fb7b __se_sys_poll+0x5b (/boot/vmlinux-4.19.46)
9f0023de do_syscall_64+0x4e (/boot/vmlinux-4.19.46)
9f800088 entry_SYSCALL_64+0x68 (/boot/vmlinux-4.19.46)
7f7ef3f5c90d [unknown] (/lib/x86_64-linux-gnu/libc-2.23.so)
 3eb5c99 poll+0xc9 (inlined)
 3eb5c99 colib::ipc::EventFd::wait+0xc9 
(/usr/local/bin/app)

 3296779 uvm::WorkerThread::run+0x129 (/usr/local/bin/app)
 [unknown] ([unknown])

They also look the same as earlier kernel versions we have running.

In addition reading e8ba2906f6b's changelog sounded very similar to what 
I was seeing. This application launches a # of threads and is definitely 
already running before the invocation of perf.


Thanks for looking at this.

Josh


jirka



commit e5adfc3e7e774ba86f7bb725c6eef5f32df8630e
Author: Konstantin Khlebnikov 
Date:   Tue Aug 7 12:09:01 2018 +0300

 perf map: Synthesize maps only for thread group leader

It looks like this was fixed later by:

commit e8ba2906f6b9054102ad035ac9cafad9d4168589
Author: John Keeping 
Date:   Thu Aug 15 11:01:45 2019 +0100

 perf unwind: Fix libunwind when tid != pid

but was not selected as a backport to 4.19. Are there plans to backport the
fix? If not, could we have the breaking commit reverted in 4.19 LTS?

Thanks
Josh


Re: Long standing kernel warning: perfevents: irq loop stuck!

2019-08-19 Thread Josh Hunt
On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt  wrote:
>
> On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner  wrote:
> >
> > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner  
> > > wrote:
> > > >
> > > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > > Was there any progress made on debugging this issue? We are still
> > > > > seeing it on 4.19.44:
> > > >
> > > > I haven't seen anyone looking at this.
> > > >
> > > > Can you please try the patch Ingo posted:
> > > >
> > > >   https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/
> > > >
> > > > and if it fixes the issue decrease the value from 128 to the point 
> > > > where it
> > > > comes back, i.e. 128 -> 64 -> 32 ...
> > > >
> > > > Thanks,
> > > >
> > > > tglx
> > >
> > > I just checked the machines where this problem occurs and they're both
> > > Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> > > Please let me know if I misread the patch or if what I'm seeing is a
> > > different issue than the one Cong originally reported.
> >
> > Find the NHM hack below.
> >
> > Thanks,
> >
> > tglx
> >
> > 8<
> >
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index 648260b5f367..93c1a4f0e73e 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event 
> > *event, u64 left)
> > return left;
> >  }
> >
> > +static u64 nhm_limit_period(struct perf_event *event, u64 left)
> > +{
> > +   return max(left, 128ULL);
> > +}
> > +
> >  PMU_FORMAT_ATTR(event, "config:0-7");
> >  PMU_FORMAT_ATTR(umask, "config:8-15"   );
> >  PMU_FORMAT_ATTR(edge,  "config:18" );
> > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
> > x86_pmu.pebs_constraints = 
> > intel_nehalem_pebs_event_constraints;
> > x86_pmu.enable_all = intel_pmu_nhm_enable_all;
> > x86_pmu.extra_regs = intel_nehalem_extra_regs;
> > +   x86_pmu.limit_period = nhm_limit_period;
> >
> > mem_attr = nhm_mem_events_attrs;
> >
> Thanks Thomas. Will try this and let you know.
>
> --
> Josh

Thomas

I found on my setup that setting the value to 32 was the lowest value
I could use to keep the problem from happening. Let me know if you
want me to send a patch with the updated value, etc.

I saw in the original thread from Ingo and Vince that this was seen on
Haswell, but I checked our Haswell boxes and so far we have not
reproduced the problem there.

-- 
Josh


[PATCH] perf/x86/intel: restrict period on Nehalem

2019-08-19 Thread Josh Hunt
We see our Nehalem machines reporting 'perfevents: irq loop stuck!' in
some cases when using perf:

perfevents: irq loop stuck!
WARNING: CPU: 0 PID: 3485 at arch/x86/events/intel/core.c:2282 
intel_pmu_handle_irq+0x37b/0x530
...
RIP: 0010:intel_pmu_handle_irq+0x37b/0x530
...
Call Trace:

? perf_event_nmi_handler+0x2e/0x50
? intel_pmu_save_and_restart+0x50/0x50
perf_event_nmi_handler+0x2e/0x50
nmi_handle+0x6e/0x120
default_do_nmi+0x3e/0x100
do_nmi+0x102/0x160
end_repeat_nmi+0x16/0x50
...
? native_write_msr+0x6/0x20
? native_write_msr+0x6/0x20

intel_pmu_enable_event+0x1ce/0x1f0
x86_pmu_start+0x78/0xa0
x86_pmu_enable+0x252/0x310
__perf_event_task_sched_in+0x181/0x190
? __switch_to_asm+0x41/0x70
? __switch_to_asm+0x35/0x70
? __switch_to_asm+0x41/0x70
? __switch_to_asm+0x35/0x70
finish_task_switch+0x158/0x260
__schedule+0x2f6/0x840
? hrtimer_start_range_ns+0x153/0x210
schedule+0x32/0x80
schedule_hrtimeout_range_clock+0x8a/0x100
? hrtimer_init+0x120/0x120
ep_poll+0x2f7/0x3a0
? wake_up_q+0x60/0x60
do_epoll_wait+0xa9/0xc0
__x64_sys_epoll_wait+0x1a/0x20
do_syscall_64+0x4e/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fdeb1e96c03
...
---[ end trace 7a8f0b2beff82ee0 ]---

CPU#0: ctrl:   
CPU#0: status: 0004
CPU#0: overflow:   
CPU#0: fixed:  0bb0
CPU#0: pebs:   
CPU#0: debugctl:   
CPU#0: active: 0006
CPU#0:   gen-PMC0 ctrl:  
CPU#0:   gen-PMC0 count: 
CPU#0:   gen-PMC0 left:  
CPU#0:   gen-PMC1 ctrl:  
CPU#0:   gen-PMC1 count: 
CPU#0:   gen-PMC1 left:  
CPU#0:   gen-PMC2 ctrl:  
CPU#0:   gen-PMC2 count: 
CPU#0:   gen-PMC2 left:  
CPU#0:   gen-PMC3 ctrl:  
CPU#0:   gen-PMC3 count: 
CPU#0:   gen-PMC3 left:  
CPU#0: fixed-PMC0 count: 
CPU#0: fixed-PMC1 count: d22ebd19
CPU#0: fixed-PMC2 count: fff1
core: clearing PMU state on CPU#0

I found that a period limit of 32 was the lowest I could set it to without
the problem reoccurring. The idea for the patch and approach to find the
target value were suggested by Ingo and Thomas.

Signed-off-by: Josh Hunt 
Reported-by: Bhupesh Purandare 
Suggested-by: Thomas Gleixner 
Suggested-by: Ingo Molnar 
Link: https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/
Link: 
https://lore.kernel.org/lkml/alpine.deb.2.21.1908122133310.7...@nanos.tec.linutronix.de/
---
 arch/x86/events/intel/core.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 648260b5f367..e4c2cb65ea50 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, 
u64 left)
return left;
 }
 
+static u64 nhm_limit_period(struct perf_event *event, u64 left)
+{
+   return max(left, 32ULL);
+}
+
 PMU_FORMAT_ATTR(event, "config:0-7");
 PMU_FORMAT_ATTR(umask, "config:8-15"   );
 PMU_FORMAT_ATTR(edge,  "config:18" );
@@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
x86_pmu.enable_all = intel_pmu_nhm_enable_all;
x86_pmu.extra_regs = intel_nehalem_extra_regs;
+   x86_pmu.limit_period = nhm_limit_period;
 
mem_attr = nhm_mem_events_attrs;
 
-- 
2.7.4



Re: Long standing kernel warning: perfevents: irq loop stuck!

2019-08-19 Thread Josh Hunt
On Mon, Aug 19, 2019 at 2:17 PM Josh Hunt  wrote:
>
> On Mon, Aug 12, 2019 at 12:42 PM Josh Hunt  wrote:
> >
> > On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner  wrote:
> > >
> > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner  
> > > > wrote:
> > > > >
> > > > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > > > Was there any progress made on debugging this issue? We are still
> > > > > > seeing it on 4.19.44:
> > > > >
> > > > > I haven't seen anyone looking at this.
> > > > >
> > > > > Can you please try the patch Ingo posted:
> > > > >
> > > > >   https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/
> > > > >
> > > > > and if it fixes the issue decrease the value from 128 to the point 
> > > > > where it
> > > > > comes back, i.e. 128 -> 64 -> 32 ...
> > > > >
> > > > > Thanks,
> > > > >
> > > > > tglx
> > > >
> > > > I just checked the machines where this problem occurs and they're both
> > > > Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> > > > Please let me know if I misread the patch or if what I'm seeing is a
> > > > different issue than the one Cong originally reported.
> > >
> > > Find the NHM hack below.
> > >
> > > Thanks,
> > >
> > > tglx
> > >
> > > 8<
> > >
> > > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > > index 648260b5f367..93c1a4f0e73e 100644
> > > --- a/arch/x86/events/intel/core.c
> > > +++ b/arch/x86/events/intel/core.c
> > > @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event 
> > > *event, u64 left)
> > > return left;
> > >  }
> > >
> > > +static u64 nhm_limit_period(struct perf_event *event, u64 left)
> > > +{
> > > +   return max(left, 128ULL);
> > > +}
> > > +
> > >  PMU_FORMAT_ATTR(event, "config:0-7");
> > >  PMU_FORMAT_ATTR(umask, "config:8-15"   );
> > >  PMU_FORMAT_ATTR(edge,  "config:18" );
> > > @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
> > > x86_pmu.pebs_constraints = 
> > > intel_nehalem_pebs_event_constraints;
> > > x86_pmu.enable_all = intel_pmu_nhm_enable_all;
> > > x86_pmu.extra_regs = intel_nehalem_extra_regs;
> > > +   x86_pmu.limit_period = nhm_limit_period;
> > >
> > > mem_attr = nhm_mem_events_attrs;
> > >
> > Thanks Thomas. Will try this and let you know.
> >
> > --
> > Josh
>
> Thomas
>
> I found on my setup that setting the value to 32 was the lowest value
> I could use to keep the problem from happening. Let me know if you
> want me to send a patch with the updated value, etc.
>
> I saw in the original thread from Ingo and Vince that this was seen on
> Haswell, but I checked our Haswell boxes and so far we have not
> reproduced the problem there.
>
> --
> Josh

I went ahead and sent this patch with the value set to 32:
https://lore.kernel.org/lkml/1566256411-18820-1-git-send-email-joh...@akamai.com/T/#u

I wasn't sure how/who to give credit to for the change, so please
resubmit if what I did is incorrect or if you wanted to debug further.
If you decide to resubmit the patch please add my tested-by and
Bhupesh's reported-by. I'm able to reproduce the problem within about
2 hours if there's anything else you wanted to look into before going
with this approach.

Thanks!
-- 
Josh


Re: Long standing kernel warning: perfevents: irq loop stuck!

2019-08-12 Thread Josh Hunt
On Mon, Feb 26, 2018 at 12:40 PM Andi Kleen  wrote:
>
> > Given the HSD143 errata and its possible relevance, have you tried
> > changing the magic number to 32, does it then still fix things?
> >
> > No real objection to the patch as such, it just needs a coherent comment
> > and a tested-by tag I think.
>
> 128 min period will affect a lot of valid use cases with slower ticking
> events.  I often use smaller periods there.
>
> It would be better to debug this properly.
>
> Or at a minimum only do the limitation for the events that tick really
> fast (like cycles, uops retired etc.)
>
> -Andi

Was there any progress made on debugging this issue? We are still
seeing it on 4.19.44:

[ 2660.685392] [ cut here ]
[ 2660.685392] perfevents: irq loop stuck!
[ 2660.685392] WARNING: CPU: 1 PID: 4436 at
arch/x86/events/intel/core.c:2278 intel_pmu_handle_irq+0x37b/0x530
[ 2660.685393] Modules linked in: sch_fq ip6table_raw ip6table_filter
ip6_tables iptable_raw xt_TARPIT ts_bm xt_u32 xt_recent xt_string
xt_set ip_set_hash_ip ip_set_hash_ipportip ip_set_hash_net dev_cstack
tcp_bbr tcp_qdk netconsole aep ip6_udp_tunnel udp_tunnel dm_mod
i2c_dev tcp_fast w83627ehf hwmon_vid jc42 i2c_core softdog autofs4
raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
async_tx xor raid6_pq libcrc32c raid1 raid0 linear md_mod ext4
crc32c_generic crc16 mbcache jbd2 xt_tcpudp ipv6 iptable_filter
ip_tables ip_set nfnetlink x_tables zfs(O) zunicode(O) zavl(O) icp(O)
sr_mod zcommon(O) znvpair(O) spl(O) coretemp hwmon kvm_intel
ipmi_devintf kvm ata_piix irqbypass crc32c_intel ipmi_msghandler
i7core_edac libata lpc_ich mfd_core e1000e pcc_cpufreq
[ 2660.685405] CPU: 1 PID: 4436 Comm: xx_01 Tainted: G   O
 4.19.44 #1
[ 2660.685405] Hardware name: Ciara Technologies
[ 2660.685405] RIP: 0010:intel_pmu_handle_irq+0x37b/0x530
[ 2660.685406] Code: 00 00 bf 40 03 00 00 48 8b 40 10 e8 bf 82 9f 00
e9 f3 fc ff ff 80 3d f3 54 61 01 00 75 1a 48 c7 c7 02 00 e1 93 e8 45
8d 06 00 <0f> 0b e8 2e a9 ff ff c6 05 d7 54 61 01 01 65 4c 8b 35 5f 4f
00 6d
[ 2660.685406] RSP: 0018:fe034c40 EFLAGS: 00010086
[ 2660.685407] RAX: 001c RBX: 0064 RCX: 0002
[ 2660.685407] RDX: 0003 RSI: 93e1001e RDI: 926bafa555a8
[ 2660.685407] RBP: fe034e30 R08: fff8919fe17a R09: 
[ 2660.685407] R10: fe034c40 R11:  R12: 926bafa4f3a0
[ 2660.685408] R13: 926b93739000 R14: 0040 R15: 926bafa4f5a0
[ 2660.685408] FS:  7f0b9ccd7940() GS:926bafa4()
knlGS:
[ 2660.685408] CS:  0010 DS:  ES:  CR0: 80050033
[ 2660.685409] CR2: 7f45c3497750 CR3: 00042388e000 CR4: 07e0
[ 2660.685409] Call Trace:
[ 2660.685409]  
[ 2660.685409]  ? perf_event_nmi_handler+0x2e/0x50
[ 2660.685409]  ? intel_pmu_save_and_restart+0x50/0x50
[ 2660.685410]  perf_event_nmi_handler+0x2e/0x50
[ 2660.685410]  nmi_handle+0x6e/0x120
[ 2660.685410]  default_do_nmi+0x3e/0x100
[ 2660.685410]  do_nmi+0x102/0x160
[ 2660.685410]  end_repeat_nmi+0x16/0x50
[ 2660.685411] RIP: 0010:native_write_msr+0x6/0x20
[ 2660.685411] Code: c3 48 c1 e2 20 48 89 d3 8b 16 48 09 c3 48 89 de
e8 bf 53 3b 00 48 89 d8 5b c3 66 2e 0f 1f 84 00 00 00 00 00 89 f9 89
f0 0f 30 <66> 66 66 66 90 c3 48 c1 e2 20 89 f6 48 09 d6 31 d2 e9 24 53
3b 00
[ 2660.685411] RSP: 0018:b04661fb7c60 EFLAGS: 0046
[ 2660.685412] RAX: 0bb0 RBX: 926b93739000 RCX: 038d
[ 2660.685412] RDX:  RSI: 0bb0 RDI: 038d
[ 2660.685412] RBP: 000b R08: fff8919fe17a R09: 941602d0
[ 2660.685413] R10: b04661fb7bd0 R11: 0362 R12: 0008
[ 2660.685413] R13: 0001 R14: 926bafa4f5c4 R15: 0001
[ 2660.685413]  ? native_write_msr+0x6/0x20
[ 2660.685413]  ? native_write_msr+0x6/0x20
[ 2660.685414]  
[ 2660.685414]  intel_pmu_enable_event+0x1ce/0x1f0
[ 2660.685414]  x86_pmu_start+0x78/0xa0
[ 2660.685414]  x86_pmu_enable+0x252/0x310
[ 2660.685414]  __perf_event_task_sched_in+0x181/0x190
[ 2660.685415]  ? __switch_to_asm+0x34/0x70
[ 2660.685415]  ? __switch_to_asm+0x40/0x70
[ 2660.685415]  ? __switch_to_asm+0x34/0x70
[ 2660.685415]  ? __switch_to_asm+0x40/0x70
[ 2660.685416]  finish_task_switch+0x158/0x260
[ 2660.685416]  __schedule+0x2f6/0x840
[ 2660.685416]  ? hrtimer_start_range_ns+0x153/0x210
[ 2660.685416]  schedule+0x32/0x80
[ 2660.685417]  schedule_hrtimeout_range_clock+0x8a/0x100
[ 2660.685417]  ? hrtimer_init+0x120/0x120
[ 2660.685417]  ep_poll+0x2f7/0x3a0
[ 2660.685417]  ? wake_up_q+0x60/0x60
[ 2660.685417]  do_epoll_wait+0xa9/0xc0
[ 2660.685418]  __x64_sys_epoll_wait+0x1a/0x20
[ 2660.685418]  do_syscall_64+0x4e/0x110
[ 2660.685418]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2660.685418] RIP: 0033:0x7f4d35107c03
[ 2660.685419] Code: 49 89 ca b8 e8 00 00 00 0f 05 48 3d 01 f0 ff ff
73 34 c3 48 8

Re: Long standing kernel warning: perfevents: irq loop stuck!

2019-08-12 Thread Josh Hunt
On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner  wrote:
>
> On Mon, 12 Aug 2019, Josh Hunt wrote:
> > Was there any progress made on debugging this issue? We are still
> > seeing it on 4.19.44:
>
> I haven't seen anyone looking at this.
>
> Can you please try the patch Ingo posted:
>
>   https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/
>
> and if it fixes the issue decrease the value from 128 to the point where it
> comes back, i.e. 128 -> 64 -> 32 ...
>
> Thanks,
>
> tglx

I just checked the machines where this problem occurs and they're both
Nehalem boxes. I think Ingo's patch would only help Haswell machines.
Please let me know if I misread the patch or if what I'm seeing is a
different issue than the one Cong originally reported.

Thanks
-- 
Josh


Re: Long standing kernel warning: perfevents: irq loop stuck!

2019-08-12 Thread Josh Hunt
On Mon, Aug 12, 2019 at 12:34 PM Thomas Gleixner  wrote:
>
> On Mon, 12 Aug 2019, Josh Hunt wrote:
> > On Mon, Aug 12, 2019 at 10:55 AM Thomas Gleixner  wrote:
> > >
> > > On Mon, 12 Aug 2019, Josh Hunt wrote:
> > > > Was there any progress made on debugging this issue? We are still
> > > > seeing it on 4.19.44:
> > >
> > > I haven't seen anyone looking at this.
> > >
> > > Can you please try the patch Ingo posted:
> > >
> > >   https://lore.kernel.org/lkml/20150501070226.gb18...@gmail.com/
> > >
> > > and if it fixes the issue decrease the value from 128 to the point where 
> > > it
> > > comes back, i.e. 128 -> 64 -> 32 ...
> > >
> > > Thanks,
> > >
> > > tglx
> >
> > I just checked the machines where this problem occurs and they're both
> > Nehalem boxes. I think Ingo's patch would only help Haswell machines.
> > Please let me know if I misread the patch or if what I'm seeing is a
> > different issue than the one Cong originally reported.
>
> Find the NHM hack below.
>
> Thanks,
>
> tglx
>
> 8<
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 648260b5f367..93c1a4f0e73e 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, 
> u64 left)
> return left;
>  }
>
> +static u64 nhm_limit_period(struct perf_event *event, u64 left)
> +{
> +   return max(left, 128ULL);
> +}
> +
>  PMU_FORMAT_ATTR(event, "config:0-7");
>  PMU_FORMAT_ATTR(umask, "config:8-15"   );
>  PMU_FORMAT_ATTR(edge,  "config:18" );
> @@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
> x86_pmu.pebs_constraints = 
> intel_nehalem_pebs_event_constraints;
> x86_pmu.enable_all = intel_pmu_nhm_enable_all;
> x86_pmu.extra_regs = intel_nehalem_extra_regs;
> +   x86_pmu.limit_period = nhm_limit_period;
>
> mem_attr = nhm_mem_events_attrs;
>
Thanks Thomas. Will try this and let you know.

-- 
Josh


Re: Steam is broken on new kernels

2019-06-21 Thread Josh Hunt
On Fri, Jun 21, 2019 at 4:07 PM Pierre-Loup A. Griffais
 wrote:
>
>
>
> On 6/21/19 3:38 PM, Eric Dumazet wrote:
> > Please look at my recent patch.
> >   Sorry I am travelling
> >
> > On Fri, Jun 21, 2019, 6:19 PM Linus Torvalds
> > mailto:torva...@linux-foundation.org>>
> > wrote:
> >
> > On Fri, Jun 21, 2019 at 2:41 PM Greg Kroah-Hartman
> > mailto:gre...@linuxfoundation.org>> wrote:
> >  >
> >  > What specific commit caused the breakage?
> >
> > Both on reddit and on github there seems to be confusion about whether
> > it's a problem or not. Some people have it working with the exact same
> > kernel that breaks for others.
> >
> > And then some people seem to say it works intermittently for them,
> > which seems to indicate a timing issue.
> >
> > Looking at the SACK patches (assuming it's one of them), I'd suspect
> > the "tcp: tcp_fragment() should apply sane memory limits".
> >
> > Eric, that one does
> >
> > if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
> > return -ENOMEM;
> > }
> >
> > but I think it's *normal* for "sk_wmem_queued >> 1" to be around the
> > same size as sk_sndbuf. So if there is some fragmentation, and we add
> > more skb's to it, that would seem to trigger fairly easily.
> > Particularly since this is all in "truesize" units, which can be a lot
> > bigger than the packets themselves.
> >
> > I don't know the code, so I may be out to lunch and barking up
> > completely the wrong tree, but that particular check does seem like it
> > might trigger much more easily than I think the code _intended_ it to
> > trigger?
> >
> > Pierre-Loup - do you guys have a test-case inside of valve? Or is this
> > purely "we see some people with problems"?
>
> Definitely the latter, although the volume of complaints clearly points
> to a real problem from our experience. Reproducing locally, bisecting
> and testing possible fixes is just now starting on our end.
>
> I agree not all users seem affected; most affected people report success
> by using -tcp to launch Steam, which makes it use direct TCP instead of
> WebSockets, our current default connection method for Linux.
>
> Thanks,
>   - Pierre-Loup
>
> >
> > Linus
> >
>
I asked on the github thread if users seeing the problem could check
the new wqueue too big counter:
https://github.com/ValveSoftware/steam-for-linux/issues/6326

So far one person is seeing the counter increase when they see the
problem, and another that doesn't see the problem has the counter at
0. Obviously not a great sample size, but hopefully more will report.
If nothing else, someone is seeing the counter increase while trying
to connect to steam.
-- 
Josh


[PATCH v2] perf tools: allow map files to specify DSO

2018-05-07 Thread Josh Hunt
Add the ability to specify a DSO in the /tmp/perf-.map file.
The DSO should be the first line in the file and readable by the
running user. If a valid DSO is found all other contents of the
file will be ignored. This allows things like callchain unwinding
with DWARF to work.

Suggested-by: Wang Nan 
Signed-off-by: Josh Hunt 
---
We have an application which uses huge pages for its text section, but
still needs the ability to do callchain unwinding with DWARF. We use
the perf-.map file setup to do symbol resolution and that works
great, but callchain unwinding fails.

A few months ago I mentioned this to Wang Nan and he suggested a way
around this problem could be to specify the path of the DSO in the map
file. The attached patch is my initial hack at this. Running with this
patch I can now get full callchain unwinding with DWARF.

FWIW LBR + map file works with callchains, but unfortunately there are
some cases where we still need DWARF.

I submitted an RFC as an earlier draft:
https://lkml.org/lkml/2018/4/24/1070

v2:
1. Rebase RFC patch to
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
2. Update jit-interface.txt as per Arnaldo's request

 tools/perf/Documentation/jit-interface.txt |  8 +---
 tools/perf/util/map.c  | 32 ++
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/jit-interface.txt 
b/tools/perf/Documentation/jit-interface.txt
index a8656f564915..8a25b979802f 100644
--- a/tools/perf/Documentation/jit-interface.txt
+++ b/tools/perf/Documentation/jit-interface.txt
@@ -3,13 +3,15 @@ by a JIT.
 
 The JIT has to write a /tmp/perf-%d.map  (%d = pid of process) file
 
-This is a text file.
+This is a text file and can have one of the following formats:
 
-Each line has the following format, fields separated with spaces:
+1) Each line has the following format, fields separated with spaces:
 
 START SIZE symbolname
 
 START and SIZE are hex numbers without 0x.
 symbolname is the rest of the line, so it could contain special characters.
 
-The ownership of the file has to match the process.
+2) A single line with the full pathname of the DSO to use.
+
+For both formats, the ownership of the file has to match the process.
diff --git a/tools/perf/util/map.c b/tools/perf/util/map.c
index c8fe836e4c3c..da3050b18e34 100644
--- a/tools/perf/util/map.c
+++ b/tools/perf/util/map.c
@@ -139,6 +139,31 @@ void map__init(struct map *map, u64 start, u64 end, u64 
pgoff, struct dso *dso)
refcount_set(&map->refcnt, 1);
 }
 
+static bool replace_anon(char *mapfilename)
+{
+   FILE *file = NULL;
+   bool ret = false;
+
+   file = fopen(mapfilename, "r");
+   if (file != NULL) {
+   char *line = NULL;
+   size_t line_len, linesz=0;
+
+   line_len = getline(&line, &linesz, file);
+   if (line_len > 0) {
+   line[line_len-1] = '\0'; /* null terminate */
+   if (!access(line, R_OK)) {
+   strlcpy(mapfilename, line, line_len);
+   ret = true;
+   }
+   }
+   free(line);
+   fclose(file);
+   }
+
+   return ret;
+}
+
 struct map *map__new(struct machine *machine, u64 start, u64 len,
 u64 pgoff, u32 d_maj, u32 d_min, u64 ino,
 u64 ino_gen, u32 prot, u32 flags, char *filename,
@@ -170,6 +195,13 @@ struct map *map__new(struct machine *machine, u64 start, 
u64 len,
snprintf(newfilename, sizeof(newfilename),
 "/tmp/perf-%d.map", nsi->pid);
filename = newfilename;
+   /*
+* Check to see if map file references DSO to use, if 
so, use it.
+*/
+   if (anon && replace_anon(newfilename)) {
+   anon = 0;
+   filename = newfilename;
+   }
}
 
if (android) {
-- 
1.9.1



Re: [PATCH v2] perf tools: allow map files to specify DSO

2018-05-07 Thread Josh Hunt

On 05/07/2018 11:40 AM, Andi Kleen wrote:

On Mon, May 07, 2018 at 02:24:16PM -0400, Josh Hunt wrote:

Add the ability to specify a DSO in the /tmp/perf-.map file.
The DSO should be the first line in the file and readable by the
running user. If a valid DSO is found all other contents of the
file will be ignored. This allows things like callchain unwinding
with DWARF to work.


FWIW it's ok, but also obsolete with Kirill's large-pages-in-tmpfs
work in newer kernels. With that you can just copy the executable into
a 2MB tmpfs and disable the manual huge page copying and everything
should work as usually.

So essentially it's only a hack for old kernels and old binaries.

But doesn't hurt I guess.


Ah, very interesting. I wasn't aware of this. Can you point me to some 
more details on this process?


Thanks
Josh


[PATCH] perf probe: Ignore vmlinux build-id when offline vmlinux and dry-run used

2017-01-24 Thread Josh Hunt
Commit e50243bb ("perf probe: Ignore vmlinux Build-id when offline vmlinux 
given")
added the ability to ignore vmlinux when an offline vmlinux is passed and no
l, d, or a flag is also passed. This extends that to ignore build-id when
-n/--dry-run is used. This way we can gather probe information for offline 
kernels
using the method Brendan Gregg mentions in this post:
http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html .

Ex:
 $ ./perf probe -k ./vmlinux-3.14.59 -nv 'do_sys_open=do_sys_open:7 
filename:string dfd flags op'

will generate:
 do_sys_open+91 filename_string=+0(%si):string dfd=%di:s32 flags=%bx:s32 
op=-60(%bp):u64

and then copy/paste this to the machine running the 3.14.59 kernel.

Signed-off-by: Josh Hunt 
---
 tools/perf/builtin-probe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index f87996b..e0eea723 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -616,7 +616,7 @@ static int perf_del_probe_events(struct strfilter *filter)
 * nor change running kernel. So if user gives offline vmlinux,
 * ignore its buildid.
 */
-   if (!strchr("lda", params.command) && symbol_conf.vmlinux_name)
+   if ((!strchr("lda", params.command) || probe_event_dry_run) && 
symbol_conf.vmlinux_name)
symbol_conf.ignore_vmlinux_buildid = true;
 
switch (params.command) {
-- 
1.9.1



Re: [PATCH] perf probe: Ignore vmlinux build-id when offline vmlinux and dry-run used

2017-01-25 Thread Josh Hunt

On 01/25/2017 08:37 PM, Masami Hiramatsu wrote:

On Tue, 24 Jan 2017 16:41:28 -0500
Josh Hunt  wrote:


Commit e50243bb ("perf probe: Ignore vmlinux Build-id when offline vmlinux 
given")
added the ability to ignore vmlinux when an offline vmlinux is passed and no
l, d, or a flag is also passed. This extends that to ignore build-id when
-n/--dry-run is used. This way we can gather probe information for offline 
kernels
using the method Brendan Gregg mentions in this post:
http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html .


Sorry, NAK.
Could you use -D (--definition) for that usage? It directly gives you
the definition on stdout, and also ignore the buildid.


Ahhh! I guess I totally missed the point of that flag. You're right. It 
works exactly the way I want :)


Thanks. Please ignore my patch.

Josh


Re: [PATCH] watchdog: don't run proc_watchdog_update if new value is same as old

2016-03-14 Thread Josh Hunt

On 03/14/2016 09:34 AM, Don Zickus wrote:

On Sat, Mar 12, 2016 at 06:50:26PM -0500, Joshua Hunt wrote:

While working on a script to restore all sysctl params before a series of
tests I found that writing any value into the
/proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
causes them to call proc_watchdog_update(). Not only that, but when I
wrote to these proc files in a loop I could easily trigger a soft lockup.

[  955.756196] NMI watchdog: enabled on all CPUs, permanently consumes one 
hw-PMU counter.
[  955.765994] NMI watchdog: enabled on all CPUs, permanently consumes one 
hw-PMU counter.
[  955.774619] NMI watchdog: enabled on all CPUs, permanently consumes one 
hw-PMU counter.
[  955.783182] NMI watchdog: enabled on all CPUs, permanently consumes one 
hw-PMU counter.
[  959.788319] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 30s! 
[swapper/4:0]
[  959.788325] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 30s! 
[swapper/5:0]

There doesn't appear to be a reason for doing this work other every time a
write occurs, so only do the work when the values change.


Hi Josh,

Thanks for the patch.  I have no objections to it, but Uli and myself were
interested in the reason for the softlockups.  Uli is going to provide a
test patch to see if his theory is correct.  That way we fix the underlying
issue and then apply your patch on top. Make sense?


Yep. Sounds good. I meant to mention I didn't diagnose the soft-lockup. 
If you provide a patch I'm happy to test. I can also attempt to debug 
that part more if needed.


Josh



Cheers,
Don



Signed-off-by: Josh Hunt 
---
  kernel/watchdog.c |9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b3ace6e..9acb29f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -923,6 +923,9 @@ static int proc_watchdog_common(int which, struct ctl_table 
*table, int write,
 * both lockup detectors are disabled if proc_watchdog_update()
 * returns an error.
 */
+   if (old == new)
+   goto out;
+
err = proc_watchdog_update();
}
  out:
@@ -967,7 +970,7 @@ int proc_soft_watchdog(struct ctl_table *table, int write,
  int proc_watchdog_thresh(struct ctl_table *table, int write,
 void __user *buffer, size_t *lenp, loff_t *ppos)
  {
-   int err, old;
+   int err, old, new;

get_online_cpus();
mutex_lock(&watchdog_proc_mutex);
@@ -987,6 +990,10 @@ int proc_watchdog_thresh(struct ctl_table *table, int 
write,
/*
 * Update the sample period. Restore on failure.
 */
+   new = ACCESS_ONCE(watchdog_thresh);
+   if (old == new)
+   goto out;
+
set_sample_period();
err = proc_watchdog_update();
if (err) {
--
1.7.9.5



Re: [PATCH] udp6: fix UDP/IPv6 encap resubmit path

2016-03-07 Thread Josh Hunt

On 03/04/2016 04:47 PM, Bill Sommerfeld wrote:

IPv4 interprets a negative return value from a protocol handler as a
request to redispatch to a new protocol.  In contrast, IPv6 interprets a
negative value as an error, and interprets a positive value as a request
for redispatch.

UDP for IPv6 was unaware of this difference.  Change __udp6_lib_rcv() to
return a positive value for redispatch.  Note that the socket's
encap_rcv hook still needs to return a negative value to request
dispatch, and in the case of IPv6 packets, adjust IP6CB(skb)->nhoff to
identify the byte containing the next protocol.

Signed-off-by: Bill Sommerfeld 
---
  net/ipv6/udp.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 0711f8f..fd25e44 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -922,11 +922,9 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table 
*udptable,
ret = udpv6_queue_rcv_skb(sk, skb);
sock_put(sk);

-   /* a return value > 0 means to resubmit the input, but
-* it wants the return to be -protocol, or 0
-*/
+   /* a return value > 0 means to resubmit the input */
if (ret > 0)
-   return -ret;
+   return ret;

return 0;
}



This looks good to me. Thanks Bill!

Josh


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-02 Thread Josh Hunt

On 7/2/20 2:45 AM, Paolo Abeni wrote:

Hi all,

On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:

Hi Cong,

On 01/07/2020 21:58, Cong Wang wrote:

On Wed, Jul 1, 2020 at 9:05 AM Cong Wang  wrote:

On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt  wrote:

Do either of you know if there's been any development on a fix for this
issue? If not we can propose something.


If you have a reproducer, I can look into this.


Does the attached patch fix this bug completely?


It's easier to comment if you inline the patch, but after taking a quick
look it seems too simplistic.

i)  Are you sure you haven't got the return values on qdisc_run reversed?


qdisc_run() returns true if it was able to acquire the seq lock. We
need to take special action in the opposite case, so Cong's patch LGTM
from a functional PoV.


ii) There's a "bypass" path that skips the enqueue/dequeue operation if
the queue is empty; that needs a similar treatment:  after releasing
seqlock it needs to ensure that another packet hasn't been enqueued
since it last checked.


That has been reverted with
commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323

---

diff --git a/net/core/dev.c b/net/core/dev.c
index 90b59fc50dc9..c7e48356132a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, 
struct Qdisc *q,

if (q->flags & TCQ_F_NOLOCK) {
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-   qdisc_run(q);
+   if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
+   __netif_schedule(q);


I fear the __netif_schedule() call may cause performance regression to
the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
collect some data.


Initial results with Cong's patch look promising, so far no stalls. We 
will let it run over the long weekend and report back on Tuesday.


Paolo - I have concerns about possible performance regression with the 
change as well. If you can gather some data that would be great. If 
things look good with our low throughput test over the weekend we can 
also try assessing performance next week.


Josh


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-06-30 Thread Josh Hunt

On 6/23/20 6:42 AM, Michael Zhivich wrote:

From: Jonas Bonn 
To: Paolo Abeni ,
"net...@vger.kernel.org" ,
LKML ,
"David S . Miller" ,
John Fastabend 
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
Date: Fri, 11 Oct 2019 02:39:48 +0200
Message-ID: <465a540e-5296-32e7-f6a6-79942dfe2...@netrounds.com> (raw)
In-Reply-To: <95c5a697932e19ebd6577b5dac4d7052fe8c4255.ca...@redhat.com>

Hi Paolo,

On 09/10/2019 21:14, Paolo Abeni wrote:

Something alike the following code - completely untested - can possibly
address the issue, but it's a bit rough and I would prefer not adding
additonal complexity to the lockless qdiscs, can you please have a spin
a it?


We've tested a couple of variants of this patch today, but unfortunately
it doesn't fix the problem of packets getting stuck in the queue.

A couple of comments:

i) On 5.4, there is the BYPASS path that also needs the same treatment
as it's essentially replicating the behavour of qdisc_run, just without
the queue/dequeue steps

ii)  We are working a lot with the 4.19 kernel so I backported to the
patch to this version and tested there.  Here the solution would seem to
be more robust as the BYPASS path does not exist.

Unfortunately, in both cases we continue to see the issue of the "last
packet" getting stuck in the queue.

/Jonas


Hello Jonas, Paolo,

We have observed the same problem with pfifo_fast qdisc when sending periodic 
small
packets on a TCP flow with multiple simultaneous connections on a 4.19.75
kernel.  We've been able to catch it in action using perf probes (see trace
below).  For qdisc = 0x900d7c247c00, skb = 0x900b72c334f0,
it takes 200270us to traverse the networking stack on a system that's not 
otherwise busy.
qdisc only resumes processing when another enqueued packet comes in,
so the packet could have been stuck indefinitely.

proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: 
(9b69d99d) qdisc=0x900d7c247c00 skb=0x900bfc294af0 band=2 
atomic_qlen=0
proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue: 
(9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=2
proc-19927 19927 [014] 580644.045480:  probe:tcp_transmit_skb2: 
(9b6dc4e5) skb=0x900b72c334f0 sk=0x900d62958040 source=0x4b4e 
dest=0x9abe
proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: 
(9b69d99d) qdisc=0x900d7c247c00 skb=0x0 band=3 atomic_qlen=0
proc-19927 19927 [014] 580644.045481:  probe:ip_finish_output2: 
(9b6bc650) net=0x9c107c80 sk=0x900d62958040 
skb=0x900b72c334f0 __func__=0x0
proc-19902 19902 [032] 580644.045481:probe:sch_direct_xmit: 
(9b69e570) skb=0x900bfc294af0 q=0x900d7c247c00 
dev=0x900d6a14 txq=0x900d6a181180 root_lock=0x0 validate=1 ret=-1 
again=155
proc-19927 19927 [014] 580644.045481:net:net_dev_queue: 
dev=eth0 skbaddr=0x900b72c334f0 len=115
proc-19902 19902 [032] 580644.045482: probe:pfifo_fast_dequeue: 
(9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=1
proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue: 
(9b69d9f0) skb=0x900b72c334f0 qdisc=0x900d7c247c00 
to_free=18446622925407304000
proc-19902 19902 [032] 580644.045483: probe:pfifo_fast_dequeue_end: 
(9b69d99d) qdisc=0x900d7c247c00 skb=0x0 band=3 atomic_qlen=0
proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue_end: 
(9b69da9f) skb=0x900b72c334f0 qdisc=0x900d7c247c00 
to_free=0x91d0f67ab940 atomic_qlen=1
proc-19902 19902 [032] 580644.045484:  probe:__qdisc_run_2: 
(9b69ea5a) q=0x900d7c247c00 packets=1
proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue: 
(9b69d9f0) skb=0x900d98fdf6f0 qdisc=0x900d7c247c00 
to_free=18446622925407304000
proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue_end: 
(9b69da9f) skb=0x900d98fdf6f0 qdisc=0x900d7c247c00 
to_free=0x91d0f67ab940 atomic_qlen=2
proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue: 
(9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=0
proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue_end: 
(9b69d99d) qdisc=0x900d7c247c00 skb=0x900b72c334f0 band=2 
atomic_qlen=1
proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue: 
(9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=2
proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue_end: 
(9b69d99d) qdisc=0x900d7c247c00 skb=0x900d98fdf6f0 band=2 
atomic_qlen=0
proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue: 
(9b69d8c0) qdisc=0x900d7c247c00 skb=0x9b69d8c0 band=2
proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue_end: 
(9b69d99d) qdisc=0xfff

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-01 Thread Josh Hunt



On 7/1/20 12:58 PM, Cong Wang wrote:

On Wed, Jul 1, 2020 at 9:05 AM Cong Wang  wrote:


On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt  wrote:

Do either of you know if there's been any development on a fix for this
issue? If not we can propose something.


If you have a reproducer, I can look into this.


Does the attached patch fix this bug completely?

Thanks.



Hey Cong

Unfortunately we don't have a reproducer that would be easy to share, 
however your patch makes sense to me at a high-level. We will test and 
let you know if this fixes our problem.


Thanks!
Josh


Re: [PATCH 4.9 086/104] arm64: kasan: avoid bad virt_to_pfn()

2017-11-15 Thread Josh Hunt
On Tue, Oct 10, 2017 at 10:31 AM, Julia Lawall  wrote:
>
>
> On Tue, 10 Oct 2017, Levin, Alexander (Sasha Levin) wrote:
>
>> (Cc'ed Julia)
>>
>> On Mon, Oct 09, 2017 at 09:33:01AM -0700, Laura Abbott wrote:
>> >On 10/06/2017 08:10 PM, Levin, Alexander (Sasha Levin) wrote:
>> >> We are experimenting with using neural network to aid with patch
>> >> selection for stable kernel trees. There are quite a few commits that
>> >> were not marked for stable, but are stable material, and we're trying
>> >> to get them into their appropriate kernel trees.
>> >>
>> >
>> >Apart from the practical which has been covered, I'd be interested
>> >in hearing about the details of how this works if you can share
>> >them.
>>
>> This work is based on Julia's work
>> (https://soarsmu.github.io/papers/icse12-patch.pdf) to identify
>> commits that fix bugs.
>>
>> Essentially, my approach to this is to extract as much information as
>> possbile form the commit, including things such as:
>>
>>  - How many times a certain word appeared in the message
>>  - Who is the author
>>  - Code metrics
>>  - etc
>>
>> In my case, I end up with about 30,000 of these "inputs", and train a
>> neural network based on whether a given commit was included in a
>> stable tree or not.
>>
>> This approach has a few drawbacks compared to the one Julia
>> described in her paper:
>>
>>  - Not every bug fixing commit ends up in stable (some end up in -rc
>> fixing a bug from the current merge window).
>>  - Same as above, but for commits we miss and fail to add to stable.
>>  - Sometimes commits get added to stable even though they don't follow
>> the rules at all (security fixes are a simple example).
>>
>> But it does seem to be effective at finding bug fixing commits that
>> should be in stable.
>>
>> At this stage we are still trying to figure out what a "bug fixing"
>> commit really is. For example, an observation we recently made was
>> that the code metrics actually don't have much weight in determining
>> whether a commit should be in stable or not.
>>
>> As we just started, I'm still experimenting with a few approaches, and
>> I belive Julia is waiting for a new student to take over this, so we
>> don't have any big insights to share just yet :)
>
> That's a good summary of the current status.  Thanks!
>
> julia

I just started noticing the AUTOSEL tags yesterday and I think that's
a great idea to tag patches, but was there any thought to also putting
something in the commit message this way they're easily identifiable
in the git logs? I think it would be useful if there was some metadata
in the commit message which identified that it was selected through
some automated system. That way if I find a regression and it
identifies one of these commits I can know that maybe it was chosen
incorrectly, and also would allow me to alert the owner of the
selection script to better help refine its selection process.
Otherwise I'd have to track back through the mailing lists to see how
it landed in the stable release.

Just a thought. Also, thank you for trying to improve the stable kernels!
-- 
Josh


Re: [PATCH 4.9 086/104] arm64: kasan: avoid bad virt_to_pfn()

2017-11-16 Thread Josh Hunt
On Thu, Nov 16, 2017 at 3:13 PM,   wrote:
>
> It's possible, but I didn't want to add a bunch of clutter to the
> commit message. Right now it's somewhat easy to track it back to
> automatic selection because:
>
>  1. I'm signed off on all of them, so I could chime in in the case
> concerns/issues arise with a patch.
>  2. They all have a corresponding review request email with the
> AUTOSEL marker.
>

I get the want to not clutter the commit logs. My comment was more
directed to a few weeks or months after the patch has made it into a
stable release. At that point I'm not sure the person who found the
problem with the change would know to CC you on any correspondence, or
even if the author of the change would know to contact you. Although I
guess maybe they'd eventually track things down and report to stable,
or Greg, or something else and it would eventually get back to you.

-- 
Josh


[tip: sched/core] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-09 Thread tip-bot2 for Josh Hunt
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 6db12ee0456d0e369c7b59788d46e15a56ad0294
Gitweb:
https://git.kernel.org/tip/6db12ee0456d0e369c7b59788d46e15a56ad0294
Author:Josh Hunt 
AuthorDate:Thu, 01 Apr 2021 22:58:33 -04:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 08 Apr 2021 23:09:44 +02:00

psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.

Signed-off-by: Josh Hunt 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lkml.kernel.org/r/20210402025833.27599-1-joh...@akamai.com
---
 kernel/sched/psi.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index b1b00e9..d1212f1 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1061,19 +1061,27 @@ static int psi_cpu_show(struct seq_file *m, void *v)
return psi_show(m, &psi_system, PSI_CPU);
 }
 
+static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void 
*))
+{
+   if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE))
+   return -EPERM;
+
+   return single_open(file, psi_show, NULL);
+}
+
 static int psi_io_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_io_show, NULL);
+   return psi_open(file, psi_io_show);
 }
 
 static int psi_memory_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_memory_show, NULL);
+   return psi_open(file, psi_memory_show);
 }
 
 static int psi_cpu_open(struct inode *inode, struct file *file)
 {
-   return single_open(file, psi_cpu_show, NULL);
+   return psi_open(file, psi_cpu_show);
 }
 
 struct psi_trigger *psi_trigger_create(struct psi_group *group,
@@ -1353,9 +1361,9 @@ static int __init psi_proc_init(void)
 {
if (psi_enable) {
proc_mkdir("pressure", NULL);
-   proc_create("pressure/io", 0, NULL, &psi_io_proc_ops);
-   proc_create("pressure/memory", 0, NULL, &psi_memory_proc_ops);
-   proc_create("pressure/cpu", 0, NULL, &psi_cpu_proc_ops);
+   proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops);
+   proc_create("pressure/memory", 0666, NULL, 
&psi_memory_proc_ops);
+   proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops);
}
return 0;
 }


[tip: perf/urgent] perf/x86/intel: Restrict period on Nehalem

2019-08-31 Thread tip-bot2 for Josh Hunt
The following commit has been merged into the perf/urgent branch of tip:

Commit-ID: 44d3bbb6f5e501b873218142fe08cdf62a4ac1f3
Gitweb:
https://git.kernel.org/tip/44d3bbb6f5e501b873218142fe08cdf62a4ac1f3
Author:Josh Hunt 
AuthorDate:Mon, 19 Aug 2019 19:13:31 -04:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 30 Aug 2019 14:27:47 +02:00

perf/x86/intel: Restrict period on Nehalem

We see our Nehalem machines reporting 'perfevents: irq loop stuck!' in
some cases when using perf:

perfevents: irq loop stuck!
WARNING: CPU: 0 PID: 3485 at arch/x86/events/intel/core.c:2282 
intel_pmu_handle_irq+0x37b/0x530
...
RIP: 0010:intel_pmu_handle_irq+0x37b/0x530
...
Call Trace:

? perf_event_nmi_handler+0x2e/0x50
? intel_pmu_save_and_restart+0x50/0x50
perf_event_nmi_handler+0x2e/0x50
nmi_handle+0x6e/0x120
default_do_nmi+0x3e/0x100
do_nmi+0x102/0x160
end_repeat_nmi+0x16/0x50
...
? native_write_msr+0x6/0x20
? native_write_msr+0x6/0x20

intel_pmu_enable_event+0x1ce/0x1f0
x86_pmu_start+0x78/0xa0
x86_pmu_enable+0x252/0x310
__perf_event_task_sched_in+0x181/0x190
? __switch_to_asm+0x41/0x70
? __switch_to_asm+0x35/0x70
? __switch_to_asm+0x41/0x70
? __switch_to_asm+0x35/0x70
finish_task_switch+0x158/0x260
__schedule+0x2f6/0x840
? hrtimer_start_range_ns+0x153/0x210
schedule+0x32/0x80
schedule_hrtimeout_range_clock+0x8a/0x100
? hrtimer_init+0x120/0x120
ep_poll+0x2f7/0x3a0
? wake_up_q+0x60/0x60
do_epoll_wait+0xa9/0xc0
__x64_sys_epoll_wait+0x1a/0x20
do_syscall_64+0x4e/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fdeb1e96c03
...
Signed-off-by: Peter Zijlstra (Intel) 
Cc: a...@kernel.org
Cc: Josh Hunt 
Cc: bpura...@akamai.com
Cc: mi...@redhat.com
Cc: jo...@redhat.com
Cc: t...@linutronix.de
Cc: namhy...@kernel.org
Cc: alexander.shish...@linux.intel.com
Link: 
https://lkml.kernel.org/r/1566256411-18820-1-git-send-email-joh...@akamai.com
---
 arch/x86/events/intel/core.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 648260b..e4c2cb6 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3572,6 +3572,11 @@ static u64 bdw_limit_period(struct perf_event *event, 
u64 left)
return left;
 }
 
+static u64 nhm_limit_period(struct perf_event *event, u64 left)
+{
+   return max(left, 32ULL);
+}
+
 PMU_FORMAT_ATTR(event, "config:0-7");
 PMU_FORMAT_ATTR(umask, "config:8-15"   );
 PMU_FORMAT_ATTR(edge,  "config:18" );
@@ -4606,6 +4611,7 @@ __init int intel_pmu_init(void)
x86_pmu.pebs_constraints = intel_nehalem_pebs_event_constraints;
x86_pmu.enable_all = intel_pmu_nhm_enable_all;
x86_pmu.extra_regs = intel_nehalem_extra_regs;
+   x86_pmu.limit_period = nhm_limit_period;
 
mem_attr = nhm_mem_events_attrs;