Re: 4.14.44: BUG_ON(!list_empty(>wait_list));
On 5 June 2018 at 05:47, wrote: >> -Original Message- >> From: Daniel J Blueman [mailto:dan...@quora.org] >> Sent: Thursday, May 31, 2018 9:21 PM >> To: Linux Kernel; linux-a...@vger.kernel.org >> Cc: Limonciello, Mario; Dominguez, Jared >> Subject: 4.14.44: BUG_ON(!list_empty(>wait_list)); >> >> Plugging in a USB-C power source on my Dell XPS 9550 trips an ACPI >> BUG_ON [1], reproducible with mainline 4.14.44, suggesting other >> threads are waiting for semaphore acquisition due to >> "BUG_ON(!list_empty(>wait_list))". >> >> This is the current 1.7.0 BIOS with Ubuntu 18.04 userspace, plugging >> in an LG 27UD88 (also with the current firmware) monitor USB-C >> connection which apparently advertises 60W charging (x1, >> PowerDelivery, DisplayPort alternative mode, data). The same issues >> reproduce on a Dell Precision 5510 with Ubuntu 16.04, the shipped >> kernel and 4.14.44. >> >> I can enable ACPI debugging if useful? Perhaps ACPI_DB_MUTEX or other >> levels would be appropriate? > > I think most useful would be if this can still reproduce with 4.17. Fair suggestion! I can achieve 100% reproducibility of the same backtrace on a clean Ubuntu 18.04 install with 4.17 mainline [1]: 1. disable grub 'quiet' parameter, disconnect charger and power off laptop to S5 2. power on laptop from S5 3. suspend via closing lid 4. resume by opening lid 5. connect LG 27UD88 via USB-C 6. wait 20s 7. disconnect LG 27UD88 8. run 'systemctl poweroff' 9. observe the same backtrace from acpi_os_delete_semaphore I don't observe the issue when using an Apple 87W USB-C Power Adapter, so it may reproduce on other monitors advertising USB-C DisplayPort alternate mode. Thanks, Daniel [1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17/linux-image-unsigned-4.17.0-041700-generic_4.17.0-041700.201806041953_amd64.deb >> kernel BUG at /home/kernel/COD/linux/drivers/acpi/osl.c:1201 >> invalid opcode: [#1] SMP PTI >> Modules linked in: [...] >> CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.14.44-041444-generic >> #201805251612 >> Hardware name: Dell Inc. XPS 15 9550/, BIOS 1.7.0 02/23/2018 >> task: 9bc2ab6b9740 task.stack: bOca80034000 >> RIP: 0010:acpi_os_delete_semaphore+0x6d/0x70 >> RSP: 0018:bOca80037be8 EFLAGS: 00010283 >> RAX: bOca83f8fc40 RBX: 9bc238b5dbe0 RCX: >> RDX: 9bc238b5dbe8 RSI: RDI: 9bc238b5dbe0 >> RBP: 9bc2adlc0990 ROB: 9bc2bdc25f20 R09: 9bc29ee56300 >> R10: e03bd2796440 R11: 9bc2ad183fa0 R12: 9bc22f1321e0 >> R13: 0001 R14: 0001 R15: 9bc22f132eb0 >> FS: 7fc03886f940() GS:9bc2bdc0() >> knlGS: >> CS: 0010 DS: ES: CRO: 80050033 >> CR2: 7ffc645e70f8 CR3: 00049e120001 CR4: 003606f0 >> Call Trace: >> acpi_ex_system_reset_event+0x3f/0x65 >> acpi_ex_opcode_1A_OT_0R+0x70/0xfa >> acpi_ds_exec_end_op+0x15d/0x71b >> acpi_ps_parse_loop+0x929/0x9d6 >> ? acpi_ds_result_push+0x82/0x1d2 >> acpi_ps_parse_aml+0x1a2/0x4af >> acpi_ps_execute_method+0x1ef/0x2ab >> acpi_ns_evaluate+0x2e4/0x41d >> acpi_evaluate_object+0x1cb/0x38e >> acpi_enter_sleep_state_prep+0xae/0x13a >> acpi_sleep_prepare.part.2+0x2e/0x40 >> acpi_power_off_prepare+0xf/0x20 >> [38871.1925361 kernel_power_off+0x42/0x70 >> SYSC_reboot+0x12f/0x210 >> ? handle_mm_fault+0xea/0x1e0 >> [38871.1925861 ? do_writev+0x5e/0xf0 >> ? do_writev+0x5e/0xf0 >> do_syscall_64+0x6e/0x120 >> entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> RIP: 0033:0x7fc03839b373 >> RSP: 002b:7ffc645e70f8 EFLAGS: 0202 ORIG_RAX: 00a9 >> RAX: ffda RBX: 4321fedc RCX: 7fc03839b373 >> ROX: 4321fedc RSI: 28121969 RDI: fee1dead >> RBP: 7ffc645e7160 R08: R09: >> R10: 00000002 R11: 0202 R12: 7ffc645e7168 >> R13: R14: 001b0004 R15: 7ffc645e7458 >> Code: b8 00 04 00 00 48 c7 c1 c3 91 28 ab 48 c7 c2 20 91 28 ab be of >> 04 00 00 bf 00 00 00 01 03 41 85 04 00 58 eb b0 b8 01 10 00 00 c3 >> Ob 90 Of if 44 00 00 80 3d 74 CO 97 01 00 41 54 55 53 Of 84 >> RIP: acpi_os_delete_semaphore+0x6d/0x70 RSP: b0ca80037be8 -- Daniel J Blueman
Re: 4.14.44: BUG_ON(!list_empty(>wait_list));
On 5 June 2018 at 05:47, wrote: >> -Original Message- >> From: Daniel J Blueman [mailto:dan...@quora.org] >> Sent: Thursday, May 31, 2018 9:21 PM >> To: Linux Kernel; linux-a...@vger.kernel.org >> Cc: Limonciello, Mario; Dominguez, Jared >> Subject: 4.14.44: BUG_ON(!list_empty(>wait_list)); >> >> Plugging in a USB-C power source on my Dell XPS 9550 trips an ACPI >> BUG_ON [1], reproducible with mainline 4.14.44, suggesting other >> threads are waiting for semaphore acquisition due to >> "BUG_ON(!list_empty(>wait_list))". >> >> This is the current 1.7.0 BIOS with Ubuntu 18.04 userspace, plugging >> in an LG 27UD88 (also with the current firmware) monitor USB-C >> connection which apparently advertises 60W charging (x1, >> PowerDelivery, DisplayPort alternative mode, data). The same issues >> reproduce on a Dell Precision 5510 with Ubuntu 16.04, the shipped >> kernel and 4.14.44. >> >> I can enable ACPI debugging if useful? Perhaps ACPI_DB_MUTEX or other >> levels would be appropriate? > > I think most useful would be if this can still reproduce with 4.17. Fair suggestion! I can achieve 100% reproducibility of the same backtrace on a clean Ubuntu 18.04 install with 4.17 mainline [1]: 1. disable grub 'quiet' parameter, disconnect charger and power off laptop to S5 2. power on laptop from S5 3. suspend via closing lid 4. resume by opening lid 5. connect LG 27UD88 via USB-C 6. wait 20s 7. disconnect LG 27UD88 8. run 'systemctl poweroff' 9. observe the same backtrace from acpi_os_delete_semaphore I don't observe the issue when using an Apple 87W USB-C Power Adapter, so it may reproduce on other monitors advertising USB-C DisplayPort alternate mode. Thanks, Daniel [1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17/linux-image-unsigned-4.17.0-041700-generic_4.17.0-041700.201806041953_amd64.deb >> kernel BUG at /home/kernel/COD/linux/drivers/acpi/osl.c:1201 >> invalid opcode: [#1] SMP PTI >> Modules linked in: [...] >> CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.14.44-041444-generic >> #201805251612 >> Hardware name: Dell Inc. XPS 15 9550/, BIOS 1.7.0 02/23/2018 >> task: 9bc2ab6b9740 task.stack: bOca80034000 >> RIP: 0010:acpi_os_delete_semaphore+0x6d/0x70 >> RSP: 0018:bOca80037be8 EFLAGS: 00010283 >> RAX: bOca83f8fc40 RBX: 9bc238b5dbe0 RCX: >> RDX: 9bc238b5dbe8 RSI: RDI: 9bc238b5dbe0 >> RBP: 9bc2adlc0990 ROB: 9bc2bdc25f20 R09: 9bc29ee56300 >> R10: e03bd2796440 R11: 9bc2ad183fa0 R12: 9bc22f1321e0 >> R13: 0001 R14: 0001 R15: 9bc22f132eb0 >> FS: 7fc03886f940() GS:9bc2bdc0() >> knlGS: >> CS: 0010 DS: ES: CRO: 80050033 >> CR2: 7ffc645e70f8 CR3: 00049e120001 CR4: 003606f0 >> Call Trace: >> acpi_ex_system_reset_event+0x3f/0x65 >> acpi_ex_opcode_1A_OT_0R+0x70/0xfa >> acpi_ds_exec_end_op+0x15d/0x71b >> acpi_ps_parse_loop+0x929/0x9d6 >> ? acpi_ds_result_push+0x82/0x1d2 >> acpi_ps_parse_aml+0x1a2/0x4af >> acpi_ps_execute_method+0x1ef/0x2ab >> acpi_ns_evaluate+0x2e4/0x41d >> acpi_evaluate_object+0x1cb/0x38e >> acpi_enter_sleep_state_prep+0xae/0x13a >> acpi_sleep_prepare.part.2+0x2e/0x40 >> acpi_power_off_prepare+0xf/0x20 >> [38871.1925361 kernel_power_off+0x42/0x70 >> SYSC_reboot+0x12f/0x210 >> ? handle_mm_fault+0xea/0x1e0 >> [38871.1925861 ? do_writev+0x5e/0xf0 >> ? do_writev+0x5e/0xf0 >> do_syscall_64+0x6e/0x120 >> entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >> RIP: 0033:0x7fc03839b373 >> RSP: 002b:7ffc645e70f8 EFLAGS: 0202 ORIG_RAX: 00a9 >> RAX: ffda RBX: 4321fedc RCX: 7fc03839b373 >> ROX: 4321fedc RSI: 28121969 RDI: fee1dead >> RBP: 7ffc645e7160 R08: R09: >> R10: 00000002 R11: 0202 R12: 7ffc645e7168 >> R13: R14: 001b0004 R15: 7ffc645e7458 >> Code: b8 00 04 00 00 48 c7 c1 c3 91 28 ab 48 c7 c2 20 91 28 ab be of >> 04 00 00 bf 00 00 00 01 03 41 85 04 00 58 eb b0 b8 01 10 00 00 c3 >> Ob 90 Of if 44 00 00 80 3d 74 CO 97 01 00 41 54 55 53 Of 84 >> RIP: acpi_os_delete_semaphore+0x6d/0x70 RSP: b0ca80037be8 -- Daniel J Blueman
4.14.44: BUG_ON(!list_empty(>wait_list));
Plugging in a USB-C power source on my Dell XPS 9550 trips an ACPI BUG_ON [1], reproducible with mainline 4.14.44, suggesting other threads are waiting for semaphore acquisition due to "BUG_ON(!list_empty(>wait_list))". This is the current 1.7.0 BIOS with Ubuntu 18.04 userspace, plugging in an LG 27UD88 (also with the current firmware) monitor USB-C connection which apparently advertises 60W charging (x1, PowerDelivery, DisplayPort alternative mode, data). The same issues reproduce on a Dell Precision 5510 with Ubuntu 16.04, the shipped kernel and 4.14.44. I can enable ACPI debugging if useful? Perhaps ACPI_DB_MUTEX or other levels would be appropriate? Thanks, Daniel -- [1] kernel BUG at /home/kernel/COD/linux/drivers/acpi/osl.c:1201 invalid opcode: [#1] SMP PTI Modules linked in: [...] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.14.44-041444-generic #201805251612 Hardware name: Dell Inc. XPS 15 9550/, BIOS 1.7.0 02/23/2018 task: 9bc2ab6b9740 task.stack: bOca80034000 RIP: 0010:acpi_os_delete_semaphore+0x6d/0x70 RSP: 0018:bOca80037be8 EFLAGS: 00010283 RAX: bOca83f8fc40 RBX: 9bc238b5dbe0 RCX: RDX: 9bc238b5dbe8 RSI: RDI: 9bc238b5dbe0 RBP: 9bc2adlc0990 ROB: 9bc2bdc25f20 R09: 9bc29ee56300 R10: e03bd2796440 R11: 9bc2ad183fa0 R12: 9bc22f1321e0 R13: 0001 R14: 0001 R15: 9bc22f132eb0 FS: 7fc03886f940() GS:9bc2bdc0() knlGS: CS: 0010 DS: ES: CRO: 80050033 CR2: 7ffc645e70f8 CR3: 00049e120001 CR4: 003606f0 Call Trace: acpi_ex_system_reset_event+0x3f/0x65 acpi_ex_opcode_1A_OT_0R+0x70/0xfa acpi_ds_exec_end_op+0x15d/0x71b acpi_ps_parse_loop+0x929/0x9d6 ? acpi_ds_result_push+0x82/0x1d2 acpi_ps_parse_aml+0x1a2/0x4af acpi_ps_execute_method+0x1ef/0x2ab acpi_ns_evaluate+0x2e4/0x41d acpi_evaluate_object+0x1cb/0x38e acpi_enter_sleep_state_prep+0xae/0x13a acpi_sleep_prepare.part.2+0x2e/0x40 acpi_power_off_prepare+0xf/0x20 [38871.1925361 kernel_power_off+0x42/0x70 SYSC_reboot+0x12f/0x210 ? handle_mm_fault+0xea/0x1e0 [38871.1925861 ? do_writev+0x5e/0xf0 ? do_writev+0x5e/0xf0 do_syscall_64+0x6e/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fc03839b373 RSP: 002b:7ffc645e70f8 EFLAGS: 0202 ORIG_RAX: 00a9 RAX: ffda RBX: 4321fedc RCX: 7fc03839b373 ROX: 4321fedc RSI: 28121969 RDI: fee1dead RBP: 7ffc645e7160 R08: R09: R10: 0002 R11: 0202 R12: 7ffc645e7168 R13: R14: 001b0004 R15: 7ffc645e7458 Code: b8 00 04 00 00 48 c7 c1 c3 91 28 ab 48 c7 c2 20 91 28 ab be of 04 00 00 bf 00 00 00 01 03 41 85 04 00 58 eb b0 b8 01 10 00 00 c3 Ob 90 Of if 44 00 00 80 3d 74 CO 97 01 00 41 54 55 53 Of 84 RIP: acpi_os_delete_semaphore+0x6d/0x70 RSP: b0ca80037be8 -- Daniel J Blueman
4.14.44: BUG_ON(!list_empty(>wait_list));
Plugging in a USB-C power source on my Dell XPS 9550 trips an ACPI BUG_ON [1], reproducible with mainline 4.14.44, suggesting other threads are waiting for semaphore acquisition due to "BUG_ON(!list_empty(>wait_list))". This is the current 1.7.0 BIOS with Ubuntu 18.04 userspace, plugging in an LG 27UD88 (also with the current firmware) monitor USB-C connection which apparently advertises 60W charging (x1, PowerDelivery, DisplayPort alternative mode, data). The same issues reproduce on a Dell Precision 5510 with Ubuntu 16.04, the shipped kernel and 4.14.44. I can enable ACPI debugging if useful? Perhaps ACPI_DB_MUTEX or other levels would be appropriate? Thanks, Daniel -- [1] kernel BUG at /home/kernel/COD/linux/drivers/acpi/osl.c:1201 invalid opcode: [#1] SMP PTI Modules linked in: [...] CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.14.44-041444-generic #201805251612 Hardware name: Dell Inc. XPS 15 9550/, BIOS 1.7.0 02/23/2018 task: 9bc2ab6b9740 task.stack: bOca80034000 RIP: 0010:acpi_os_delete_semaphore+0x6d/0x70 RSP: 0018:bOca80037be8 EFLAGS: 00010283 RAX: bOca83f8fc40 RBX: 9bc238b5dbe0 RCX: RDX: 9bc238b5dbe8 RSI: RDI: 9bc238b5dbe0 RBP: 9bc2adlc0990 ROB: 9bc2bdc25f20 R09: 9bc29ee56300 R10: e03bd2796440 R11: 9bc2ad183fa0 R12: 9bc22f1321e0 R13: 0001 R14: 0001 R15: 9bc22f132eb0 FS: 7fc03886f940() GS:9bc2bdc0() knlGS: CS: 0010 DS: ES: CRO: 80050033 CR2: 7ffc645e70f8 CR3: 00049e120001 CR4: 003606f0 Call Trace: acpi_ex_system_reset_event+0x3f/0x65 acpi_ex_opcode_1A_OT_0R+0x70/0xfa acpi_ds_exec_end_op+0x15d/0x71b acpi_ps_parse_loop+0x929/0x9d6 ? acpi_ds_result_push+0x82/0x1d2 acpi_ps_parse_aml+0x1a2/0x4af acpi_ps_execute_method+0x1ef/0x2ab acpi_ns_evaluate+0x2e4/0x41d acpi_evaluate_object+0x1cb/0x38e acpi_enter_sleep_state_prep+0xae/0x13a acpi_sleep_prepare.part.2+0x2e/0x40 acpi_power_off_prepare+0xf/0x20 [38871.1925361 kernel_power_off+0x42/0x70 SYSC_reboot+0x12f/0x210 ? handle_mm_fault+0xea/0x1e0 [38871.1925861 ? do_writev+0x5e/0xf0 ? do_writev+0x5e/0xf0 do_syscall_64+0x6e/0x120 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fc03839b373 RSP: 002b:7ffc645e70f8 EFLAGS: 0202 ORIG_RAX: 00a9 RAX: ffda RBX: 4321fedc RCX: 7fc03839b373 ROX: 4321fedc RSI: 28121969 RDI: fee1dead RBP: 7ffc645e7160 R08: R09: R10: 0002 R11: 0202 R12: 7ffc645e7168 R13: R14: 001b0004 R15: 7ffc645e7458 Code: b8 00 04 00 00 48 c7 c1 c3 91 28 ab 48 c7 c2 20 91 28 ab be of 04 00 00 bf 00 00 00 01 03 41 85 04 00 58 eb b0 b8 01 10 00 00 c3 Ob 90 Of if 44 00 00 80 3d 74 CO 97 01 00 41 54 55 53 Of 84 RIP: acpi_os_delete_semaphore+0x6d/0x70 RSP: b0ca80037be8 -- Daniel J Blueman
4.14.34: kernel stack regs has bad 'bp' value
0638ad7d20: 8810513d5540 (0x8810513d5540) 880638ad7d28: ... 880638ad7d48: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7d50: 1100c715afb4 (0x1100c715afb4) 880638ad7d58: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7d60: 0002 (0x2) 880638ad7d68: 8810513d5550 (0x8810513d5550) 880638ad7d70: 8805e358ce30 (0x8805e358ce30) 880638ad7d78: 0004 (0x4) 880638ad7d80: 8810 (0x8810) 880638ad7d88: ... 880638ad7d90: 0400 (0x400) 880638ad7d98: 880638ad7ce0 (0x880638ad7ce0) 880638ad7da0: 0001 (0x1) 880638ad7da8: 8805e358ce30 (0x8805e358ce30) 880638ad7db0: ... 880638ad7dc0: 880638ad7e08 (0x880638ad7e08) 880638ad7dc8: 816ac15d (rw_verify_area+0xbd/0x2b0) 880638ad7dd0: 0020 (0x20) 880638ad7dd8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7de0: ... 880638ad7de8: 0400 (0x400) 880638ad7df0: 8810513d5540 (0x8810513d5540) 880638ad7df8: 7ffddaab65d0 (0x7ffddaab65d0) 880638ad7e00: 8810513d5584 (0x8810513d5584) 880638ad7e08: 880638ad7e48 (0x880638ad7e48) 880638ad7e10: 816b01ff (vfs_read+0xef/0x2f0) 880638ad7e18: 880638ad7e90 (0x880638ad7e90) 880638ad7e20: 8810513d5540 (0x8810513d5540) 880638ad7e28: 1100c715afce (0x1100c715afce) 880638ad7e30: 8810513d5540 (0x8810513d5540) 880638ad7e38: 880638ad7ed0 (0x880638ad7ed0) 880638ad7e40: 8810513d55a8 (0x8810513d55a8) 880638ad7e48: 880638ad7ef8 (0x880638ad7ef8) 880638ad7e50: 816b1672 (SyS_read+0xd2/0x1b0) 880638ad7e58: 7ffddaab65d0 (0x7ffddaab65d0) 880638ad7e60: 0400 (0x400) 880638ad7e68: dc00 (0xdc00) 880638ad7e70: 41b58ab3 (0x41b58ab3) 880638ad7e78: 8332f7eb (inat_primary_table+0x18292b/0x1d0d97) 880638ad7e80: 816b15a0 (kernel_write+0x130/0x130) 880638ad7e88: ... 880638ad7e98: 881004cd9f8c (0x881004cd9f8c) 880638ad7ea0: 0002 (0x2) 880638ad7ea8: dc00 (0xdc00) 880638ad7eb0: 880638ad7f58 (0x880638ad7f58) 880638ad7eb8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7ec0: 880638ad7f58 (0x880638ad7f58) 880638ad7ec8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7ed0: 880638ad7f58 (0x880638ad7f58) 880638ad7ed8: 816b15a0 (kernel_write+0x130/0x130) 880638ad7ee0: 881004cd9f40 (0x881004cd9f40) 880638ad7ee8: ... 880638ad7ef8: 880638ad7f48 (0x880638ad7f48) 880638ad7f00: 810073d9 (do_syscall_64+0x199/0x4c0) 880638ad7f08: ... 880638ad7f10: 880638ad7f58 (0x880638ad7f58) 880638ad7f18: ... 880638ad7f20: 880638ad7f48 (0x880638ad7f48) 880638ad7f28: 8100702e (prepare_exit_to_usermode+0x11e/0x150) 880638ad7f30: ... 880638ad7f50: 82a00081 (entry_SYSCALL_64_after_hwframe+0x3d/0xa2) 880638ad7f58: 0005 (0x5) 880638ad7f60: 7ffddaab61d0 (0x7ffddaab61d0) 880638ad7f68: 7ffddaab65d0 (0x7ffddaab65d0) 880638ad7f70: 0048 (0x48) 880638ad7f78: 7ffddaab6a10 (0x7ffddaab6a10) 880638ad7f80: 0008 (0x8) 880638ad7f88: 0246 (0x246) 880638ad7f90: 7ffddaab60a0 (0x7ffddaab60a0) 880638ad7f98: 4000 (0x4000) 880638ad7fa0: 7ffddaab6050 (0x7ffddaab6050) 880638ad7fa8: ffda (0xffda) 880638ad7fb0: 7f504bfe56f0 (0x7f504bfe56f0) 880638ad7fb8: 0400 (0x400) 880638ad7fc0: 7ffddaab65d0 (0x7ffddaab65d0) 880638ad7fc8: 0005 (0x5) 880638ad7fd0: ... 880638ad7fd8: 7f504bfe56f0 (0x7f504bfe56f0) 880638ad7fe0: 0033 (0x33) 880638ad7fe8: 0246 (0x246) 880638ad7ff0: 7ffddaab6048 (0x7ffddaab6048) 880638ad7ff8: 002b (0x2b) -- Daniel J Blueman
4.14.34: kernel stack regs has bad 'bp' value
0638ad7d20: 8810513d5540 (0x8810513d5540) 880638ad7d28: ... 880638ad7d48: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7d50: 1100c715afb4 (0x1100c715afb4) 880638ad7d58: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7d60: 0002 (0x2) 880638ad7d68: 8810513d5550 (0x8810513d5550) 880638ad7d70: 8805e358ce30 (0x8805e358ce30) 880638ad7d78: 0004 (0x4) 880638ad7d80: 8810 (0x8810) 880638ad7d88: ... 880638ad7d90: 0400 (0x400) 880638ad7d98: 880638ad7ce0 (0x880638ad7ce0) 880638ad7da0: 0001 (0x1) 880638ad7da8: 8805e358ce30 (0x8805e358ce30) 880638ad7db0: ... 880638ad7dc0: 880638ad7e08 (0x880638ad7e08) 880638ad7dc8: 816ac15d (rw_verify_area+0xbd/0x2b0) 880638ad7dd0: 0020 (0x20) 880638ad7dd8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7de0: ... 880638ad7de8: 0400 (0x400) 880638ad7df0: 8810513d5540 (0x8810513d5540) 880638ad7df8: 7ffddaab65d0 (0x7ffddaab65d0) 880638ad7e00: 8810513d5584 (0x8810513d5584) 880638ad7e08: 880638ad7e48 (0x880638ad7e48) 880638ad7e10: 816b01ff (vfs_read+0xef/0x2f0) 880638ad7e18: 880638ad7e90 (0x880638ad7e90) 880638ad7e20: 8810513d5540 (0x8810513d5540) 880638ad7e28: 1100c715afce (0x1100c715afce) 880638ad7e30: 8810513d5540 (0x8810513d5540) 880638ad7e38: 880638ad7ed0 (0x880638ad7ed0) 880638ad7e40: 8810513d55a8 (0x8810513d55a8) 880638ad7e48: 880638ad7ef8 (0x880638ad7ef8) 880638ad7e50: 816b1672 (SyS_read+0xd2/0x1b0) 880638ad7e58: 7ffddaab65d0 (0x7ffddaab65d0) 880638ad7e60: 0400 (0x400) 880638ad7e68: dc00 (0xdc00) 880638ad7e70: 41b58ab3 (0x41b58ab3) 880638ad7e78: 8332f7eb (inat_primary_table+0x18292b/0x1d0d97) 880638ad7e80: 816b15a0 (kernel_write+0x130/0x130) 880638ad7e88: ... 880638ad7e98: 881004cd9f8c (0x881004cd9f8c) 880638ad7ea0: 0002 (0x2) 880638ad7ea8: dc00 (0xdc00) 880638ad7eb0: 880638ad7f58 (0x880638ad7f58) 880638ad7eb8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7ec0: 880638ad7f58 (0x880638ad7f58) 880638ad7ec8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00) 880638ad7ed0: 880638ad7f58 (0x880638ad7f58) 880638ad7ed8: 816b15a0 (kernel_write+0x130/0x130) 880638ad7ee0: 881004cd9f40 (0x881004cd9f40) 880638ad7ee8: ... 880638ad7ef8: 880638ad7f48 (0x880638ad7f48) 880638ad7f00: 810073d9 (do_syscall_64+0x199/0x4c0) 880638ad7f08: ... 880638ad7f10: 880638ad7f58 (0x880638ad7f58) 880638ad7f18: ... 880638ad7f20: 880638ad7f48 (0x880638ad7f48) 880638ad7f28: 8100702e (prepare_exit_to_usermode+0x11e/0x150) 880638ad7f30: ... 880638ad7f50: 82a00081 (entry_SYSCALL_64_after_hwframe+0x3d/0xa2) 880638ad7f58: 0005 (0x5) 880638ad7f60: 7ffddaab61d0 (0x7ffddaab61d0) 880638ad7f68: 7ffddaab65d0 (0x7ffddaab65d0) 880638ad7f70: 0048 (0x48) 880638ad7f78: 7ffddaab6a10 (0x7ffddaab6a10) 880638ad7f80: 0008 (0x8) 880638ad7f88: 0246 (0x246) 880638ad7f90: 7ffddaab60a0 (0x7ffddaab60a0) 880638ad7f98: 4000 (0x4000) 880638ad7fa0: 7ffddaab6050 (0x7ffddaab6050) 880638ad7fa8: ffda (0xffda) 880638ad7fb0: 7f504bfe56f0 (0x7f504bfe56f0) 880638ad7fb8: 0400 (0x400) 880638ad7fc0: 7ffddaab65d0 (0x7ffddaab65d0) 880638ad7fc8: 0005 (0x5) 880638ad7fd0: ... 880638ad7fd8: 7f504bfe56f0 (0x7f504bfe56f0) 880638ad7fe0: 0033 (0x33) 880638ad7fe8: 0246 (0x246) 880638ad7ff0: 7ffddaab6048 (0x7ffddaab6048) 880638ad7ff8: 002b (0x2b) -- Daniel J Blueman
drm/vc4: false-positive negative cursor position warning
Hi Eric et al, In a number of windowing environments (eg GNOME 3) on Raspberry Pi 3B on 4.16.0 arm64, the mouse cursor top-left gets down to x,y -4,-4, tripping WARN_ON_ONCE(plane->state->crtc_x < 0 || plane->state->crtc_y < 0) [1], which therefore seems false-positive. Git history doesn't turn up any reason, eg it could cause undefined hardware behaviour, which it doesn't appear to, so would it be better to drop the warning, or adjust it to trip on x or y < -4 or so? If so, I'll prepare a patch to adjust it. [Side note: simply opening the GNOME 3 Activities menu with libgl1-mesa-dri 17.3.7 is a reliable way to reproduce "[drm] Resetting GPU"] Thanks, Dan -- [1] WARNING: CPU: 3 PID: 966 at drivers/gpu/drm/vc4/vc4_plane.c:771 vc4_plane_async_set_fb+0x98/0xa0 CPU: 3 PID: 966 Comm: Xorg Tainted: G S 4.16.0+ #13 Hardware name: Raspberry Pi 3 Model B Rev 1.2 (DT) pstate: 0005 (nzcv daif -PAN -UAO) pc : vc4_plane_async_set_fb+0x98/0xa0 lr : vc4_plane_async_set_fb+0x4c/0xa0 sp : 086ab9b0 x29: 086ab9b0 x28: x27: 0009 x26: fffc x25: a81b36ca8b00 x24: a81b30667c00 x23: 0040 x22: a81b30790400 x21: a81b36ca8b00 x20: a81b30a53018 x19: a81b30667c00 x18: x17: b4fcec50 x16: 3447cc0e8588 x15: 3447d14fbf88 x14: 344851bb337f x13: 3447d1bb338d x12: 3447d153b000 x11: 3447d14fc7f0 x10: 3447ccb3dac8 x9 : ffd0 x8 : 0005 x7 : 3932373639343932 x6 : 056e x5 : x4 : x3 : x2 : ed59bd53d8905e00 x1 : fffc x0 : a81b30667c00 Call trace: vc4_plane_async_set_fb+0x98/0xa0 vc4_update_plane+0x124/0x1a0 __setplane_internal+0x1f4/0x260 drm_mode_cursor_universal+0xf4/0x220 drm_mode_cursor_common+0x19c/0x218 drm_mode_cursor2_ioctl+0x34/0x48 drm_ioctl_kernel+0x70/0xd8 drm_ioctl+0x30c/0x438 do_vfs_ioctl+0xc4/0x880 SyS_ioctl+0x8c/0xa8 el0_svc_naked+0x30/0x34 -- Daniel J Blueman
drm/vc4: false-positive negative cursor position warning
Hi Eric et al, In a number of windowing environments (eg GNOME 3) on Raspberry Pi 3B on 4.16.0 arm64, the mouse cursor top-left gets down to x,y -4,-4, tripping WARN_ON_ONCE(plane->state->crtc_x < 0 || plane->state->crtc_y < 0) [1], which therefore seems false-positive. Git history doesn't turn up any reason, eg it could cause undefined hardware behaviour, which it doesn't appear to, so would it be better to drop the warning, or adjust it to trip on x or y < -4 or so? If so, I'll prepare a patch to adjust it. [Side note: simply opening the GNOME 3 Activities menu with libgl1-mesa-dri 17.3.7 is a reliable way to reproduce "[drm] Resetting GPU"] Thanks, Dan -- [1] WARNING: CPU: 3 PID: 966 at drivers/gpu/drm/vc4/vc4_plane.c:771 vc4_plane_async_set_fb+0x98/0xa0 CPU: 3 PID: 966 Comm: Xorg Tainted: G S 4.16.0+ #13 Hardware name: Raspberry Pi 3 Model B Rev 1.2 (DT) pstate: 0005 (nzcv daif -PAN -UAO) pc : vc4_plane_async_set_fb+0x98/0xa0 lr : vc4_plane_async_set_fb+0x4c/0xa0 sp : 086ab9b0 x29: 086ab9b0 x28: x27: 0009 x26: fffc x25: a81b36ca8b00 x24: a81b30667c00 x23: 0040 x22: a81b30790400 x21: a81b36ca8b00 x20: a81b30a53018 x19: a81b30667c00 x18: x17: b4fcec50 x16: 3447cc0e8588 x15: 3447d14fbf88 x14: 344851bb337f x13: 3447d1bb338d x12: 3447d153b000 x11: 3447d14fc7f0 x10: 3447ccb3dac8 x9 : ffd0 x8 : 0005 x7 : 3932373639343932 x6 : 056e x5 : x4 : x3 : x2 : ed59bd53d8905e00 x1 : fffc x0 : a81b30667c00 Call trace: vc4_plane_async_set_fb+0x98/0xa0 vc4_update_plane+0x124/0x1a0 __setplane_internal+0x1f4/0x260 drm_mode_cursor_universal+0xf4/0x220 drm_mode_cursor_common+0x19c/0x218 drm_mode_cursor2_ioctl+0x34/0x48 drm_ioctl_kernel+0x70/0xd8 drm_ioctl+0x30c/0x438 do_vfs_ioctl+0xc4/0x880 SyS_ioctl+0x8c/0xa8 el0_svc_naked+0x30/0x34 -- Daniel J Blueman
[PATCH] drm/vc4: Fix memory leak during BO teardown
During BO teardown, an indirect list 'uniform_addr_offsets' wasn't being freed leading to leaking many 128B allocations. Fix the memory leak by releasing it at teardown time. To: linux-kernel@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: Eric Anholt <e...@anholt.net> Cc: Dave Airlie <airl...@redhat.com> Cc: sta...@vger.kernel.org Signed-off-by: Daniel J Blueman <dan...@quora.org> --- drivers/gpu/drm/vc4/vc4_bo.c | 2 ++ drivers/gpu/drm/vc4/vc4_validate_shaders.c | 1 + 2 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/vc4/vc4_bo.c b/drivers/gpu/drm/vc4/vc4_bo.c index 2decc8e2c79f..add9cc97a3b6 100644 --- a/drivers/gpu/drm/vc4/vc4_bo.c +++ b/drivers/gpu/drm/vc4/vc4_bo.c @@ -195,6 +195,7 @@ static void vc4_bo_destroy(struct vc4_bo *bo) vc4_bo_set_label(obj, -1); if (bo->validated_shader) { + kfree(bo->validated_shader->uniform_addr_offsets); kfree(bo->validated_shader->texture_samples); kfree(bo->validated_shader); bo->validated_shader = NULL; @@ -591,6 +592,7 @@ void vc4_free_object(struct drm_gem_object *gem_bo) } if (bo->validated_shader) { + kfree(bo->validated_shader->uniform_addr_offsets); kfree(bo->validated_shader->texture_samples); kfree(bo->validated_shader); bo->validated_shader = NULL; diff --git a/drivers/gpu/drm/vc4/vc4_validate_shaders.c b/drivers/gpu/drm/vc4/vc4_validate_shaders.c index d3f15bf60900..7cf82b071de2 100644 --- a/drivers/gpu/drm/vc4/vc4_validate_shaders.c +++ b/drivers/gpu/drm/vc4/vc4_validate_shaders.c @@ -942,6 +942,7 @@ vc4_validate_shader(struct drm_gem_cma_object *shader_obj) fail: kfree(validation_state.branch_targets); if (validated_shader) { + kfree(validated_shader->uniform_addr_offsets); kfree(validated_shader->texture_samples); kfree(validated_shader); } -- 2.11.0
[PATCH] drm/vc4: Fix memory leak during BO teardown
During BO teardown, an indirect list 'uniform_addr_offsets' wasn't being freed leading to leaking many 128B allocations. Fix the memory leak by releasing it at teardown time. To: linux-kernel@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Cc: Eric Anholt Cc: Dave Airlie Cc: sta...@vger.kernel.org Signed-off-by: Daniel J Blueman --- drivers/gpu/drm/vc4/vc4_bo.c | 2 ++ drivers/gpu/drm/vc4/vc4_validate_shaders.c | 1 + 2 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/vc4/vc4_bo.c b/drivers/gpu/drm/vc4/vc4_bo.c index 2decc8e2c79f..add9cc97a3b6 100644 --- a/drivers/gpu/drm/vc4/vc4_bo.c +++ b/drivers/gpu/drm/vc4/vc4_bo.c @@ -195,6 +195,7 @@ static void vc4_bo_destroy(struct vc4_bo *bo) vc4_bo_set_label(obj, -1); if (bo->validated_shader) { + kfree(bo->validated_shader->uniform_addr_offsets); kfree(bo->validated_shader->texture_samples); kfree(bo->validated_shader); bo->validated_shader = NULL; @@ -591,6 +592,7 @@ void vc4_free_object(struct drm_gem_object *gem_bo) } if (bo->validated_shader) { + kfree(bo->validated_shader->uniform_addr_offsets); kfree(bo->validated_shader->texture_samples); kfree(bo->validated_shader); bo->validated_shader = NULL; diff --git a/drivers/gpu/drm/vc4/vc4_validate_shaders.c b/drivers/gpu/drm/vc4/vc4_validate_shaders.c index d3f15bf60900..7cf82b071de2 100644 --- a/drivers/gpu/drm/vc4/vc4_validate_shaders.c +++ b/drivers/gpu/drm/vc4/vc4_validate_shaders.c @@ -942,6 +942,7 @@ vc4_validate_shader(struct drm_gem_cma_object *shader_obj) fail: kfree(validation_state.branch_targets); if (validated_shader) { + kfree(validated_shader->uniform_addr_offsets); kfree(validated_shader->texture_samples); kfree(validated_shader); } -- 2.11.0
Re: stack frame unwindind KASAN errors
On 7 March 2017 at 00:40, Josh Poimboeuf <jpoim...@redhat.com> wrote: > On Mon, Mar 06, 2017 at 02:52:01PM +0800, Daniel J Blueman wrote: >> Thanks Josh! >> >> With this patch, the KASAN warning still occurs, but at >> unwind_get_return_address+0x1d3/0x130 instead; the rest of the trace >> is identical. >> >> (gdb) list *(unwind_get_return_address+0x1d3) >> 0x8112bca3 is in unwind_get_return_address >> (./include/linux/compiler.h:243). >> 238}) >> 239 >> 240static __always_inline >> 241void __read_once_size(const volatile void *p, void *res, int size) >> 242{ >> 243__READ_ONCE_SIZE; > > Looking deeper, I have an idea about what's going on: > > https://quora.org/dmesg.txt > > Each of the warnings seems to show an interrupt happening during an EFI > call. I'm guessing EFI modified the frame pointer, at least > temporarily, which confused the unwinder :-( > > Would it be possible for you to test again with 4.10? It has some > additional unwinder output which should hopefully confirm my suspicions. Very good; I don't see the KASAN warnings with 4.10 in the same environment. Thanks, Daniel -- Daniel J Blueman
Re: stack frame unwindind KASAN errors
On 7 March 2017 at 00:40, Josh Poimboeuf wrote: > On Mon, Mar 06, 2017 at 02:52:01PM +0800, Daniel J Blueman wrote: >> Thanks Josh! >> >> With this patch, the KASAN warning still occurs, but at >> unwind_get_return_address+0x1d3/0x130 instead; the rest of the trace >> is identical. >> >> (gdb) list *(unwind_get_return_address+0x1d3) >> 0x8112bca3 is in unwind_get_return_address >> (./include/linux/compiler.h:243). >> 238}) >> 239 >> 240static __always_inline >> 241void __read_once_size(const volatile void *p, void *res, int size) >> 242{ >> 243__READ_ONCE_SIZE; > > Looking deeper, I have an idea about what's going on: > > https://quora.org/dmesg.txt > > Each of the warnings seems to show an interrupt happening during an EFI > call. I'm guessing EFI modified the frame pointer, at least > temporarily, which confused the unwinder :-( > > Would it be possible for you to test again with 4.10? It has some > additional unwinder output which should hopefully confirm my suspicions. Very good; I don't see the KASAN warnings with 4.10 in the same environment. Thanks, Daniel -- Daniel J Blueman
Re: stack frame unwindind KASAN errors
On 27 February 2017 at 23:47, Josh Poimboeuf <jpoim...@redhat.com> wrote: > On Mon, Feb 27, 2017 at 12:49:59PM +0800, Daniel J Blueman wrote: >> On 4.9.13 with KASAN enabled [1], we see a number of stack unwinding >> errors reported [2,3]. >> >> This seems to occur at half of boots. >> >> Let me know for further debug info/patch testing and thanks, >> Daniel >> >> [1] https://quora.org/config >> [2] https://quora.org/dmesg.txt > > Hi Daniel, > > Can you try the following patch? It's a backport of the following > upstream commit: > > 09ae68dd0a8d ("x86/unwind: Disable KASAN checks for non-current tasks") > > If it fixes it then I'll submit it for 4.9 stable. > > --- > > From: Josh Poimboeuf <jpoim...@redhat.com> > Subject: [PATCH] x86/unwind: Disable KASAN checks for non-current tasks > > There are a handful of callers to save_stack_trace_tsk() and > show_stack() which try to unwind the stack of a task other than current. > In such cases, it's remotely possible that the task is running on one > CPU while the unwinder is reading its stack from another CPU, causing > the unwinder to see stack corruption. > > These cases seem to be mostly harmless. The unwinder has checks which > prevent it from following bad pointers beyond the bounds of the stack. > So it's not really a bug as long as the caller understands that > unwinding another task will not always succeed. > > In such cases, it's possible that the unwinder may read a KASAN-poisoned > region of the stack. Account for that by using READ_ONCE_NOCHECK() when > reading the stack of another task. > > Use READ_ONCE() when reading the stack of the current task, since KASAN > warnings can still be useful for finding bugs in that case. > > Reported-by: Dmitry Vyukov <dvyu...@google.com> > Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com> > Cc: Andy Lutomirski <l...@amacapital.net> > Cc: Andy Lutomirski <l...@kernel.org> > Cc: Borislav Petkov <b...@alien8.de> > Cc: Brian Gerst <brge...@gmail.com> > Cc: Dave Jones <da...@codemonkey.org.uk> > Cc: Denys Vlasenko <dvlas...@redhat.com> > Cc: H. Peter Anvin <h...@zytor.com> > Cc: Linus Torvalds <torva...@linux-foundation.org> > Cc: Miroslav Benes <mbe...@suse.cz> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Thomas Gleixner <t...@linutronix.de> > Link: > http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoim...@redhat.com > Signed-off-by: Ingo Molnar <mi...@kernel.org> > --- > arch/x86/include/asm/stacktrace.h | 5 - > arch/x86/kernel/unwind_frame.c| 20 ++-- > 2 files changed, 22 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/include/asm/stacktrace.h > b/arch/x86/include/asm/stacktrace.h > index 37f2e0b..4141ead 100644 > --- a/arch/x86/include/asm/stacktrace.h > +++ b/arch/x86/include/asm/stacktrace.h > @@ -55,13 +55,16 @@ extern int kstack_depth_to_print; > static inline unsigned long * > get_frame_pointer(struct task_struct *task, struct pt_regs *regs) > { > + struct inactive_task_frame *frame; > + > if (regs) > return (unsigned long *)regs->bp; > > if (task == current) > return __builtin_frame_address(0); > > - return (unsigned long *)((struct inactive_task_frame > *)task->thread.sp)->bp; > + frame = (struct inactive_task_frame *)task->thread.sp; > + return (unsigned long *)READ_ONCE_NOCHECK(frame->bp); > } > #else > static inline unsigned long * > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c > index a2456d4..caff129 100644 > --- a/arch/x86/kernel/unwind_frame.c > +++ b/arch/x86/kernel/unwind_frame.c > @@ -6,6 +6,21 @@ > > #define FRAME_HEADER_SIZE (sizeof(long) * 2) > > +/* > + * This disables KASAN checking when reading a value from another task's > stack, > + * since the other task could be running on another CPU and could have > poisoned > + * the stack in the meantime. > + */ > +#define READ_ONCE_TASK_STACK(task, x) \ > +({ \ > + unsigned long val; \ > + if (task == current)\ > + val = READ_ONCE(x); \ > + else\ > + val = READ_ONCE_NOCHECK(x); \ > + val;\ > +}) > + > unsigned long unwind_get_return_address(struct unwind_state *state) >
Re: stack frame unwindind KASAN errors
On 27 February 2017 at 23:47, Josh Poimboeuf wrote: > On Mon, Feb 27, 2017 at 12:49:59PM +0800, Daniel J Blueman wrote: >> On 4.9.13 with KASAN enabled [1], we see a number of stack unwinding >> errors reported [2,3]. >> >> This seems to occur at half of boots. >> >> Let me know for further debug info/patch testing and thanks, >> Daniel >> >> [1] https://quora.org/config >> [2] https://quora.org/dmesg.txt > > Hi Daniel, > > Can you try the following patch? It's a backport of the following > upstream commit: > > 09ae68dd0a8d ("x86/unwind: Disable KASAN checks for non-current tasks") > > If it fixes it then I'll submit it for 4.9 stable. > > --- > > From: Josh Poimboeuf > Subject: [PATCH] x86/unwind: Disable KASAN checks for non-current tasks > > There are a handful of callers to save_stack_trace_tsk() and > show_stack() which try to unwind the stack of a task other than current. > In such cases, it's remotely possible that the task is running on one > CPU while the unwinder is reading its stack from another CPU, causing > the unwinder to see stack corruption. > > These cases seem to be mostly harmless. The unwinder has checks which > prevent it from following bad pointers beyond the bounds of the stack. > So it's not really a bug as long as the caller understands that > unwinding another task will not always succeed. > > In such cases, it's possible that the unwinder may read a KASAN-poisoned > region of the stack. Account for that by using READ_ONCE_NOCHECK() when > reading the stack of another task. > > Use READ_ONCE() when reading the stack of the current task, since KASAN > warnings can still be useful for finding bugs in that case. > > Reported-by: Dmitry Vyukov > Signed-off-by: Josh Poimboeuf > Cc: Andy Lutomirski > Cc: Andy Lutomirski > Cc: Borislav Petkov > Cc: Brian Gerst > Cc: Dave Jones > Cc: Denys Vlasenko > Cc: H. Peter Anvin > Cc: Linus Torvalds > Cc: Miroslav Benes > Cc: Peter Zijlstra > Cc: Thomas Gleixner > Link: > http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoim...@redhat.com > Signed-off-by: Ingo Molnar > --- > arch/x86/include/asm/stacktrace.h | 5 - > arch/x86/kernel/unwind_frame.c| 20 ++-- > 2 files changed, 22 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/include/asm/stacktrace.h > b/arch/x86/include/asm/stacktrace.h > index 37f2e0b..4141ead 100644 > --- a/arch/x86/include/asm/stacktrace.h > +++ b/arch/x86/include/asm/stacktrace.h > @@ -55,13 +55,16 @@ extern int kstack_depth_to_print; > static inline unsigned long * > get_frame_pointer(struct task_struct *task, struct pt_regs *regs) > { > + struct inactive_task_frame *frame; > + > if (regs) > return (unsigned long *)regs->bp; > > if (task == current) > return __builtin_frame_address(0); > > - return (unsigned long *)((struct inactive_task_frame > *)task->thread.sp)->bp; > + frame = (struct inactive_task_frame *)task->thread.sp; > + return (unsigned long *)READ_ONCE_NOCHECK(frame->bp); > } > #else > static inline unsigned long * > diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c > index a2456d4..caff129 100644 > --- a/arch/x86/kernel/unwind_frame.c > +++ b/arch/x86/kernel/unwind_frame.c > @@ -6,6 +6,21 @@ > > #define FRAME_HEADER_SIZE (sizeof(long) * 2) > > +/* > + * This disables KASAN checking when reading a value from another task's > stack, > + * since the other task could be running on another CPU and could have > poisoned > + * the stack in the meantime. > + */ > +#define READ_ONCE_TASK_STACK(task, x) \ > +({ \ > + unsigned long val; \ > + if (task == current)\ > + val = READ_ONCE(x); \ > + else\ > + val = READ_ONCE_NOCHECK(x); \ > + val;\ > +}) > + > unsigned long unwind_get_return_address(struct unwind_state *state) > { > unsigned long addr; > @@ -14,7 +29,8 @@ unsigned long unwind_get_return_address(struct unwind_state > *state) > if (unwind_done(state)) > return 0; > > - addr = ftrace_graph_ret_addr(state->task, >graph_idx, *addr_p, > + addr = READ_ONCE_TASK_STACK(state->task, *addr_p); > + addr = ftrace_graph_ret_addr(state->task, >
stack frame unwindind KASAN errors
On 4.9.13 with KASAN enabled [1], we see a number of stack unwinding errors reported [2,3]. This seems to occur at half of boots. Let me know for further debug info/patch testing and thanks, Daniel [1] https://quora.org/config [2] https://quora.org/dmesg.txt -- [3] BUG: KASAN: stack-out-of-bounds in unwind_get_return_address+0x11d/0x130 at addr 881034eafa08 Read of size 8 by task systemd/1 page:ea0040d3abc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x2f8000() page dumped because: kasan: bad access detected CPU: 20 PID: 1 Comm: systemd Not tainted 4.9.13-debug+ #3 Hardware name: Supermicro Super Server/X10DRL-i, BIOS 2.0a 08/25/2016 881c2f607a60 b0cdb541 881c2f607af8 881034eafa08 881c2f607ae8 b064dd17 881034ea4f70 0024 0286 881034ea4fe2 Call Trace: [] dump_stack+0x85/0xc4 [] kasan_report_error+0x4d7/0x500 [] __asan_report_load8_noabort+0x61/0x70 [] ? unwind_get_return_address+0x11d/0x130 [] unwind_get_return_address+0x11d/0x130 [] ? unwind_next_frame+0x97/0xf0 [] __save_stack_trace+0x92/0x100 [] ? file_free_rcu+0x46/0x60 [] save_stack_trace+0x1b/0x20 [] save_stack+0x46/0xd0 [] ? save_stack_trace+0x1b/0x20 [] ? save_stack+0x46/0xd0 [] ? kasan_slab_free+0x71/0xb0 [] ? kmem_cache_free+0xc4/0x350 [] ? file_free_rcu+0x46/0x60 [] ? rcu_process_callbacks+0x9d2/0x1220 [] ? __do_softirq+0x286/0x87d [] ? irq_exit+0x160/0x190 [] ? smp_apic_timer_interrupt+0x80/0xa0 [] ? apic_timer_interrupt+0x8c/0xa0 [] ? debug_check_no_locks_freed+0x290/0x290 [] ? debug_object_deactivate+0xf8/0x320 [] ? _raw_spin_unlock_irqrestore+0x5f/0x80 [] ? trace_hardirqs_on_caller+0x19e/0x580 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80 [] ? mark_held_locks+0xc8/0x120 [] ? kmem_cache_free+0xaf/0x350 [] ? file_free_rcu+0x46/0x60 [] kasan_slab_free+0x71/0xb0 [] kmem_cache_free+0xc4/0x350 [] file_free_rcu+0x46/0x60 [] rcu_process_callbacks+0x9d2/0x1220 [] ? rcu_process_callbacks+0x97d/0x1220 [] ? get_max_files+0x20/0x20 [] __do_softirq+0x286/0x87d [] irq_exit+0x160/0x190 [] smp_apic_timer_interrupt+0x80/0xa0 [] apic_timer_interrupt+0x8c/0xa0 [] ? save_stack+0x46/0xd0 [] ? debug_check_no_locks_freed+0x290/0x290 [] ? mark_held_locks+0xc8/0x120 [] ? efi_call+0x58/0x90 [] ? virt_efi_get_variable+0x9c/0x150 [] ? efivar_entry_size+0xa4/0x110 [] ? efivarfs_callback+0x30f/0x4e7 [] ? efivarfs_evict_inode+0x10/0x10 [] mark_held_locks+0xc8/0x120 [] ? _raw_spin_unlock_irqrestore+0x5f/0x80 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80 [] ? efivar_init+0x512/0x750 [] ? efivarfs_evict_inode+0x10/0x10 [] ? efivar_entry_iter+0x140/0x140 [] ? debug_lockdep_rcu_enabled+0x77/0x90 [] ? d_instantiate+0x6f/0x80 [] ? _raw_spin_unlock+0x31/0x50 [] ? _raw_spin_unlock+0x31/0x50 [] ? d_instantiate+0x6f/0x80 [] ? efivarfs_mount+0x20/0x20 [] ? efivarfs_fill_super+0x1ea/0x290 [] ? efivarfs_mount+0x20/0x20 [] ? mount_single+0xcc/0x130 [] ? efivarfs_mount+0x18/0x20 [] ? mount_fs+0x81/0x2c0 [] ? alloc_vfsmnt+0x451/0x720 [] ? vfs_kern_mount+0x6b/0x370 [] ? do_mount+0x355/0x2af0 [] ? debug_lockdep_rcu_enabled+0x77/0x90 [] ? copy_mount_string+0x20/0x20 [] ? __might_fault+0xf6/0x1b0 [] ? __check_object_size+0x1b4/0x3fe [] ? memdup_user+0x6b/0xa0 [] ? SyS_mount+0x95/0xe0 [] ? entry_SYSCALL_64_fastpath+0x23/0xc6 Memory state around the buggy address: 881034eaf900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 881034eaf980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 >881034eafa00: f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 00 f4 f4 f3 f3 ^ 881034eafa80: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 881034eafb00: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 Disabling lock debugging due to kernel taint -- Daniel J Blueman
stack frame unwindind KASAN errors
On 4.9.13 with KASAN enabled [1], we see a number of stack unwinding errors reported [2,3]. This seems to occur at half of boots. Let me know for further debug info/patch testing and thanks, Daniel [1] https://quora.org/config [2] https://quora.org/dmesg.txt -- [3] BUG: KASAN: stack-out-of-bounds in unwind_get_return_address+0x11d/0x130 at addr 881034eafa08 Read of size 8 by task systemd/1 page:ea0040d3abc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x2f8000() page dumped because: kasan: bad access detected CPU: 20 PID: 1 Comm: systemd Not tainted 4.9.13-debug+ #3 Hardware name: Supermicro Super Server/X10DRL-i, BIOS 2.0a 08/25/2016 881c2f607a60 b0cdb541 881c2f607af8 881034eafa08 881c2f607ae8 b064dd17 881034ea4f70 0024 0286 881034ea4fe2 Call Trace: [] dump_stack+0x85/0xc4 [] kasan_report_error+0x4d7/0x500 [] __asan_report_load8_noabort+0x61/0x70 [] ? unwind_get_return_address+0x11d/0x130 [] unwind_get_return_address+0x11d/0x130 [] ? unwind_next_frame+0x97/0xf0 [] __save_stack_trace+0x92/0x100 [] ? file_free_rcu+0x46/0x60 [] save_stack_trace+0x1b/0x20 [] save_stack+0x46/0xd0 [] ? save_stack_trace+0x1b/0x20 [] ? save_stack+0x46/0xd0 [] ? kasan_slab_free+0x71/0xb0 [] ? kmem_cache_free+0xc4/0x350 [] ? file_free_rcu+0x46/0x60 [] ? rcu_process_callbacks+0x9d2/0x1220 [] ? __do_softirq+0x286/0x87d [] ? irq_exit+0x160/0x190 [] ? smp_apic_timer_interrupt+0x80/0xa0 [] ? apic_timer_interrupt+0x8c/0xa0 [] ? debug_check_no_locks_freed+0x290/0x290 [] ? debug_object_deactivate+0xf8/0x320 [] ? _raw_spin_unlock_irqrestore+0x5f/0x80 [] ? trace_hardirqs_on_caller+0x19e/0x580 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80 [] ? mark_held_locks+0xc8/0x120 [] ? kmem_cache_free+0xaf/0x350 [] ? file_free_rcu+0x46/0x60 [] kasan_slab_free+0x71/0xb0 [] kmem_cache_free+0xc4/0x350 [] file_free_rcu+0x46/0x60 [] rcu_process_callbacks+0x9d2/0x1220 [] ? rcu_process_callbacks+0x97d/0x1220 [] ? get_max_files+0x20/0x20 [] __do_softirq+0x286/0x87d [] irq_exit+0x160/0x190 [] smp_apic_timer_interrupt+0x80/0xa0 [] apic_timer_interrupt+0x8c/0xa0 [] ? save_stack+0x46/0xd0 [] ? debug_check_no_locks_freed+0x290/0x290 [] ? mark_held_locks+0xc8/0x120 [] ? efi_call+0x58/0x90 [] ? virt_efi_get_variable+0x9c/0x150 [] ? efivar_entry_size+0xa4/0x110 [] ? efivarfs_callback+0x30f/0x4e7 [] ? efivarfs_evict_inode+0x10/0x10 [] mark_held_locks+0xc8/0x120 [] ? _raw_spin_unlock_irqrestore+0x5f/0x80 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80 [] ? efivar_init+0x512/0x750 [] ? efivarfs_evict_inode+0x10/0x10 [] ? efivar_entry_iter+0x140/0x140 [] ? debug_lockdep_rcu_enabled+0x77/0x90 [] ? d_instantiate+0x6f/0x80 [] ? _raw_spin_unlock+0x31/0x50 [] ? _raw_spin_unlock+0x31/0x50 [] ? d_instantiate+0x6f/0x80 [] ? efivarfs_mount+0x20/0x20 [] ? efivarfs_fill_super+0x1ea/0x290 [] ? efivarfs_mount+0x20/0x20 [] ? mount_single+0xcc/0x130 [] ? efivarfs_mount+0x18/0x20 [] ? mount_fs+0x81/0x2c0 [] ? alloc_vfsmnt+0x451/0x720 [] ? vfs_kern_mount+0x6b/0x370 [] ? do_mount+0x355/0x2af0 [] ? debug_lockdep_rcu_enabled+0x77/0x90 [] ? copy_mount_string+0x20/0x20 [] ? __might_fault+0xf6/0x1b0 [] ? __check_object_size+0x1b4/0x3fe [] ? memdup_user+0x6b/0xa0 [] ? SyS_mount+0x95/0xe0 [] ? entry_SYSCALL_64_fastpath+0x23/0xc6 Memory state around the buggy address: 881034eaf900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 881034eaf980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 >881034eafa00: f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 00 f4 f4 f3 f3 ^ 881034eafa80: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 881034eafb00: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 Disabling lock debugging due to kernel taint -- Daniel J Blueman
Re: [4.9.10] ip_route_me_harder() reading off-slab
On 17 February 2017 at 13:36, Eric Dumazet <eric.duma...@gmail.com> wrote: > On Fri, 2017-02-17 at 12:36 +0800, Daniel J Blueman wrote: >> When booting a VM in libvirt/KVM attached to a local bridge and KASAN >> enabled on 4.9.10, we see a stream of KASAN warnings about off-slab >> access [1]. >> >> Let me know if you'd like more debug. > > Could you try the following patch ? > > Thanks ! > > diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c > index > b3cc1335adbc1a20dcd225d0501b0a286d27e3c8..18839e59da849f0988924bcbc9873965a3681eb0 > 100644 > --- a/net/ipv4/netfilter.c > +++ b/net/ipv4/netfilter.c > @@ -23,7 +23,8 @@ int ip_route_me_harder(struct net *net, struct sk_buff > *skb, unsigned int addr_t > struct rtable *rt; > struct flowi4 fl4 = {}; > __be32 saddr = iph->saddr; > - __u8 flags = skb->sk ? inet_sk_flowi_flags(skb->sk) : 0; > + struct sock *sk = skb->sk; > + __u8 flags = sk && sk_fullsock(sk) ? inet_sk_flowi_flags(sk) : 0; > struct net_device *dev = skb_dst(skb)->dev; > unsigned int hh_len; > > @@ -40,7 +41,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff > *skb, unsigned int addr_t > fl4.daddr = iph->daddr; > fl4.saddr = saddr; > fl4.flowi4_tos = RT_TOS(iph->tos); > - fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0; > + fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0; > if (!fl4.flowi4_oif) > fl4.flowi4_oif = l3mdev_master_ifindex(dev); > fl4.flowi4_mark = skb->mark; > @@ -61,7 +62,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff > *skb, unsigned int addr_t > xfrm_decode_session(skb, flowi4_to_flowi(), AF_INET) == 0) { > struct dst_entry *dst = skb_dst(skb); > skb_dst_set(skb, NULL); > - dst = xfrm_lookup(net, dst, flowi4_to_flowi(), skb->sk, > 0); > + dst = xfrm_lookup(net, dst, flowi4_to_flowi(), sk, 0); > if (IS_ERR(dst)) > return PTR_ERR(dst); > skb_dst_set(skb, dst); Fine work! This nicely resolves the issue. I'll test Florian's proposed fix also. Tested-by: Daniel J Blueman <dan...@quora.org> Thanks, Dan -- Daniel J Blueman
Re: [4.9.10] ip_route_me_harder() reading off-slab
On 17 February 2017 at 13:36, Eric Dumazet wrote: > On Fri, 2017-02-17 at 12:36 +0800, Daniel J Blueman wrote: >> When booting a VM in libvirt/KVM attached to a local bridge and KASAN >> enabled on 4.9.10, we see a stream of KASAN warnings about off-slab >> access [1]. >> >> Let me know if you'd like more debug. > > Could you try the following patch ? > > Thanks ! > > diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c > index > b3cc1335adbc1a20dcd225d0501b0a286d27e3c8..18839e59da849f0988924bcbc9873965a3681eb0 > 100644 > --- a/net/ipv4/netfilter.c > +++ b/net/ipv4/netfilter.c > @@ -23,7 +23,8 @@ int ip_route_me_harder(struct net *net, struct sk_buff > *skb, unsigned int addr_t > struct rtable *rt; > struct flowi4 fl4 = {}; > __be32 saddr = iph->saddr; > - __u8 flags = skb->sk ? inet_sk_flowi_flags(skb->sk) : 0; > + struct sock *sk = skb->sk; > + __u8 flags = sk && sk_fullsock(sk) ? inet_sk_flowi_flags(sk) : 0; > struct net_device *dev = skb_dst(skb)->dev; > unsigned int hh_len; > > @@ -40,7 +41,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff > *skb, unsigned int addr_t > fl4.daddr = iph->daddr; > fl4.saddr = saddr; > fl4.flowi4_tos = RT_TOS(iph->tos); > - fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0; > + fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0; > if (!fl4.flowi4_oif) > fl4.flowi4_oif = l3mdev_master_ifindex(dev); > fl4.flowi4_mark = skb->mark; > @@ -61,7 +62,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff > *skb, unsigned int addr_t > xfrm_decode_session(skb, flowi4_to_flowi(), AF_INET) == 0) { > struct dst_entry *dst = skb_dst(skb); > skb_dst_set(skb, NULL); > - dst = xfrm_lookup(net, dst, flowi4_to_flowi(), skb->sk, > 0); > + dst = xfrm_lookup(net, dst, flowi4_to_flowi(), sk, 0); > if (IS_ERR(dst)) > return PTR_ERR(dst); > skb_dst_set(skb, dst); Fine work! This nicely resolves the issue. I'll test Florian's proposed fix also. Tested-by: Daniel J Blueman Thanks, Dan -- Daniel J Blueman
[4.9.10] ip_route_me_harder() reading off-slab
[ 473.580640] [] ? nf_hook_slow+0xf6/0x1b0 [ 473.580651] [] ? nf_iterate+0x2d0/0x2d0 [ 473.580660] [] ip_finish_output+0x5a8/0x9b0 [ 473.580670] [] ip_output+0x1d6/0x520 [ 473.580679] [] ? ip_output+0x21d/0x520 [ 473.580692] [] ? ip_mc_output+0xc10/0xc10 [ 473.580704] [] ? ip_fragment.constprop.54+0x220/0x220 [ 473.580714] [] ip_local_out+0x7d/0x130 [ 473.580724] [] ip_queue_xmit+0x7f7/0x1bc0 [ 473.580733] [] ? ip_queue_xmit+0x3e/0x1bc0 [ 473.580749] [] ? __skb_clone+0x97/0x7d0 [ 473.580760] [] tcp_transmit_skb+0x172c/0x3430 [ 473.580771] [] ? kasan_unpoison_shadow+0x36/0x50 [ 473.580782] [] ? __tcp_select_window+0x6b0/0x6b0 [ 473.580795] [] ? fib_table_lookup+0xde2/0x1580 [ 473.580808] [] ? sk_stream_alloc_skb+0x2da/0x770 [ 473.580816] [] ? tcp_mtup_init+0x1af/0x330 [ 473.580827] [] tcp_connect+0x1ffd/0x2e30 [ 473.580836] [] ? trace_hardirqs_on+0xd/0x10 [ 473.580850] [] ? tcp_push_one+0xf0/0xf0 [ 473.580862] [] ? secure_tcp_sequence_number+0x101/0x190 [ 473.580873] [] ? secure_dccpv6_sequence_number+0x440/0x440 [ 473.580885] [] ? ip_rt_update_pmtu+0xd10/0xd10 [ 473.580896] [] ? xfrm_lookup_route+0x21/0x160 [ 473.580910] [] tcp_v4_connect+0xe08/0x1cd0 [ 473.580923] [] __inet_stream_connect+0x64b/0xd70 [ 473.580934] [] ? inet_bind+0x880/0x880 [ 473.580946] [] ? lock_sock_nested+0x90/0x110 [ 473.580955] [] ? trace_hardirqs_on+0xd/0x10 [ 473.580965] [] ? __local_bh_enable_ip+0x70/0xc0 [ 473.580980] [] inet_stream_connect+0x55/0xa0 [ 473.580991] [] SYSC_connect+0x22c/0x2d0 [ 473.581000] [] ? SYSC_bind+0x240/0x240 [ 473.581011] [] ? set_close_on_exec+0xc2/0x170 [ 473.581021] [] ? _raw_spin_unlock+0x27/0x40 [ 473.581035] [] ? set_close_on_exec+0xc2/0x170 [ 473.581046] [] ? SyS_fcntl+0x666/0xde0 [ 473.581056] [] ? f_getown+0xb0/0xb0 [ 473.581067] [] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 473.581078] [] SyS_connect+0xe/0x10 [ 473.581091] [] entry_SYSCALL_64_fastpath+0x23/0xc6 [ 473.581102] Object at 8801e1eb26f8, in cache request_sock_TCP size: 352 [ 473.581105] Allocated: [ 473.581109] PID = 0 [ 473.581112] (stack is not available) [ 473.581115] Freed: [ 473.581119] PID = 0 [ 473.581122] (stack is not available) [ 473.581125] Memory state around the buggy address: [ 473.581134] 8801e1eb2780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 473.581140] 8801e1eb2800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 473.581147] >8801e1eb2880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 473.581151] ^ [ 473.581157] 8801e1eb2900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 473.581164] 8801e1eb2980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc -- Daniel J Blueman
[4.9.10] ip_route_me_harder() reading off-slab
[ 473.580640] [] ? nf_hook_slow+0xf6/0x1b0 [ 473.580651] [] ? nf_iterate+0x2d0/0x2d0 [ 473.580660] [] ip_finish_output+0x5a8/0x9b0 [ 473.580670] [] ip_output+0x1d6/0x520 [ 473.580679] [] ? ip_output+0x21d/0x520 [ 473.580692] [] ? ip_mc_output+0xc10/0xc10 [ 473.580704] [] ? ip_fragment.constprop.54+0x220/0x220 [ 473.580714] [] ip_local_out+0x7d/0x130 [ 473.580724] [] ip_queue_xmit+0x7f7/0x1bc0 [ 473.580733] [] ? ip_queue_xmit+0x3e/0x1bc0 [ 473.580749] [] ? __skb_clone+0x97/0x7d0 [ 473.580760] [] tcp_transmit_skb+0x172c/0x3430 [ 473.580771] [] ? kasan_unpoison_shadow+0x36/0x50 [ 473.580782] [] ? __tcp_select_window+0x6b0/0x6b0 [ 473.580795] [] ? fib_table_lookup+0xde2/0x1580 [ 473.580808] [] ? sk_stream_alloc_skb+0x2da/0x770 [ 473.580816] [] ? tcp_mtup_init+0x1af/0x330 [ 473.580827] [] tcp_connect+0x1ffd/0x2e30 [ 473.580836] [] ? trace_hardirqs_on+0xd/0x10 [ 473.580850] [] ? tcp_push_one+0xf0/0xf0 [ 473.580862] [] ? secure_tcp_sequence_number+0x101/0x190 [ 473.580873] [] ? secure_dccpv6_sequence_number+0x440/0x440 [ 473.580885] [] ? ip_rt_update_pmtu+0xd10/0xd10 [ 473.580896] [] ? xfrm_lookup_route+0x21/0x160 [ 473.580910] [] tcp_v4_connect+0xe08/0x1cd0 [ 473.580923] [] __inet_stream_connect+0x64b/0xd70 [ 473.580934] [] ? inet_bind+0x880/0x880 [ 473.580946] [] ? lock_sock_nested+0x90/0x110 [ 473.580955] [] ? trace_hardirqs_on+0xd/0x10 [ 473.580965] [] ? __local_bh_enable_ip+0x70/0xc0 [ 473.580980] [] inet_stream_connect+0x55/0xa0 [ 473.580991] [] SYSC_connect+0x22c/0x2d0 [ 473.581000] [] ? SYSC_bind+0x240/0x240 [ 473.581011] [] ? set_close_on_exec+0xc2/0x170 [ 473.581021] [] ? _raw_spin_unlock+0x27/0x40 [ 473.581035] [] ? set_close_on_exec+0xc2/0x170 [ 473.581046] [] ? SyS_fcntl+0x666/0xde0 [ 473.581056] [] ? f_getown+0xb0/0xb0 [ 473.581067] [] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 473.581078] [] SyS_connect+0xe/0x10 [ 473.581091] [] entry_SYSCALL_64_fastpath+0x23/0xc6 [ 473.581102] Object at 8801e1eb26f8, in cache request_sock_TCP size: 352 [ 473.581105] Allocated: [ 473.581109] PID = 0 [ 473.581112] (stack is not available) [ 473.581115] Freed: [ 473.581119] PID = 0 [ 473.581122] (stack is not available) [ 473.581125] Memory state around the buggy address: [ 473.581134] 8801e1eb2780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 473.581140] 8801e1eb2800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 473.581147] >8801e1eb2880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 473.581151] ^ [ 473.581157] 8801e1eb2900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 473.581164] 8801e1eb2980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc -- Daniel J Blueman
Re: Dell XPS13: MCE (Hardware Error) reported
On 5 January 2017 at 13:00, Daniel J Blueman <dan...@quora.org> wrote: > On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote: >> Hi Boris >> >> thanks for forwarding. >> >> > > CPUID Vendor Intel Family 6 Model 142 >> This is Kabylake Mobile >> >> > > Hardware event. This is not a software error. >> > > MCE 1 >> > > CPU 0 BANK 7 >> > > MISC 7880018086 ADDR fef1ce40 >> > > TIME 1483543069 Wed Jan 4 16:17:49 2017 >> > > MCG status: >> > > MCi status: >> > > Error overflow >> > > Uncorrected error >> > > MCi_MISC register valid >> > > MCi_ADDR register valid >> > > Processor context corrupt >> > > MCA: corrected filtering (some unreported errors in same region) >> > > Generic CACHE Level-2 Generic Error >> > > STATUS ee40110a MCGSTATUS 0 >> >> Decoding the bits further from MCi_STATUS above: >> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have >> been signaled by a CMCI. >> >> PCC=1, but should be ignored when EN=0. >> MCACOD: 110a MSCOD: 0040 >> >> If the system is stable enough after the report, can you send the output of >> /proc/interrupts to confirm that. >> >> Although its reported as a L2 error, some memory errors can also manifest >> itself as a cache error in certain cases. In this case it looks like >> some speculative fetch from bad memory might be the cause. >> >> > > MCGCAP c08 APICID 0 SOCKETID 0 >> >> MCG_CAP: c08 >> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and >> Threshold based error reporting (bit 11) (TES_P). >> >> >> Do you have another machine which doesn't report these errors? if so try >> swapping memory between them to see if the error disappears. >> >> I don't have the model specific error handy.. will check that in the meantime >> to get some decoding as well. >> >> If you haven't already running some memory tests would also help. >> >> If you replaced the motherboard, did that involve both cpu and memory? >> or just the motheboard swap? > > I see the MCE on my XPS 9360 also. It's not related to DRAM, as the > physical address is in the non-coherent low MMIO window: > MISC 7880018086 ADDR fef1ce40 > > Which is declared as device memory: > [0.00] PM: Registered nosave memory: [mem 0xfee01000-0xfeff] > > For core-generated cycles, it is between the local APIC space at > FEE0:FEE and SPI BIOS at FFE0:, so will be > subtractively decoded to the PCH, maybe being aborted due to a device > not being enabled (hello TPM3 or new image processor). > > As it is logged as soon as the MCE driver initialises, it was probably > logged during BIOS init, so there's not much we can do about it > anyways. That said, I have seen this reoccur after boot; there were no other kernel messages around 300s uptime, and it hasn't occurred in the last hours since: $ dmesg | grep Machine [0.039072] mce: [Hardware Error]: Machine check events logged [ 300.069176] mce: [Hardware Error]: Machine check events logged As I don't see a driver controlling this area of address space, the access is likely initiated from the UEFI BIOS System Management Mode handler, and we see the same pair of registers FEF1FF40, FEF1CE40 accessed each time. Dan -- Daniel J Blueman
Re: Dell XPS13: MCE (Hardware Error) reported
On 5 January 2017 at 13:00, Daniel J Blueman wrote: > On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote: >> Hi Boris >> >> thanks for forwarding. >> >> > > CPUID Vendor Intel Family 6 Model 142 >> This is Kabylake Mobile >> >> > > Hardware event. This is not a software error. >> > > MCE 1 >> > > CPU 0 BANK 7 >> > > MISC 7880018086 ADDR fef1ce40 >> > > TIME 1483543069 Wed Jan 4 16:17:49 2017 >> > > MCG status: >> > > MCi status: >> > > Error overflow >> > > Uncorrected error >> > > MCi_MISC register valid >> > > MCi_ADDR register valid >> > > Processor context corrupt >> > > MCA: corrected filtering (some unreported errors in same region) >> > > Generic CACHE Level-2 Generic Error >> > > STATUS ee40110a MCGSTATUS 0 >> >> Decoding the bits further from MCi_STATUS above: >> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have >> been signaled by a CMCI. >> >> PCC=1, but should be ignored when EN=0. >> MCACOD: 110a MSCOD: 0040 >> >> If the system is stable enough after the report, can you send the output of >> /proc/interrupts to confirm that. >> >> Although its reported as a L2 error, some memory errors can also manifest >> itself as a cache error in certain cases. In this case it looks like >> some speculative fetch from bad memory might be the cause. >> >> > > MCGCAP c08 APICID 0 SOCKETID 0 >> >> MCG_CAP: c08 >> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and >> Threshold based error reporting (bit 11) (TES_P). >> >> >> Do you have another machine which doesn't report these errors? if so try >> swapping memory between them to see if the error disappears. >> >> I don't have the model specific error handy.. will check that in the meantime >> to get some decoding as well. >> >> If you haven't already running some memory tests would also help. >> >> If you replaced the motherboard, did that involve both cpu and memory? >> or just the motheboard swap? > > I see the MCE on my XPS 9360 also. It's not related to DRAM, as the > physical address is in the non-coherent low MMIO window: > MISC 7880018086 ADDR fef1ce40 > > Which is declared as device memory: > [0.00] PM: Registered nosave memory: [mem 0xfee01000-0xfeff] > > For core-generated cycles, it is between the local APIC space at > FEE0:FEE and SPI BIOS at FFE0:, so will be > subtractively decoded to the PCH, maybe being aborted due to a device > not being enabled (hello TPM3 or new image processor). > > As it is logged as soon as the MCE driver initialises, it was probably > logged during BIOS init, so there's not much we can do about it > anyways. That said, I have seen this reoccur after boot; there were no other kernel messages around 300s uptime, and it hasn't occurred in the last hours since: $ dmesg | grep Machine [0.039072] mce: [Hardware Error]: Machine check events logged [ 300.069176] mce: [Hardware Error]: Machine check events logged As I don't see a driver controlling this area of address space, the access is likely initiated from the UEFI BIOS System Management Mode handler, and we see the same pair of registers FEF1FF40, FEF1CE40 accessed each time. Dan -- Daniel J Blueman
Re: Question regarding power button of Dell XPS13
On Monday, December 26, 2016 at 6:30:05 AM UTC+8, Linus Torvalds wrote: > On Fri, Dec 23, 2016 at 4:36 AM, Paul Menzel <pmen...@molgen.mpg.de> wrote: > > > > I heard that you both have a Dell XPS13. I got the “revision” 9360, and > > installed Debian Stretch/testing on it with Linux 4.8.15 and Linux 4.9-rc8. > > > > When pressing the power button the GNOME dialog, asking what to do (restart, > > power off, …) doesn’t appear. > > Hmm. I don't recall ever seeing such a dialog. But I don't run Debian. > > For me it works like all power buttons on my laptops have worked > lately - it suspends the machine. > > Of course, so does just closing the lid. > > The only "bug" I've seen in this area is the design bug of the XPS13 > where there is no visible indication of the suspend state (ie the > traditional slowly pulsing LED showing that it's all nice and > suspended). But that seems to be intentional, if stupid. I think it's > the only real beef I have with the XPS13. I find the 9360 to be a solid laptop (my XPS 15 9550 would fail to resume from suspend 15% of the time), but did any of you guys run into bit-depth colour issues [1] on the Skylake/9350 with USB-C to HDMI adapters? Dan [1] https://bugs.freedesktop.org/show_bug.cgi?id=99137 -- Daniel J Blueman
Re: Question regarding power button of Dell XPS13
On Monday, December 26, 2016 at 6:30:05 AM UTC+8, Linus Torvalds wrote: > On Fri, Dec 23, 2016 at 4:36 AM, Paul Menzel wrote: > > > > I heard that you both have a Dell XPS13. I got the “revision” 9360, and > > installed Debian Stretch/testing on it with Linux 4.8.15 and Linux 4.9-rc8. > > > > When pressing the power button the GNOME dialog, asking what to do (restart, > > power off, …) doesn’t appear. > > Hmm. I don't recall ever seeing such a dialog. But I don't run Debian. > > For me it works like all power buttons on my laptops have worked > lately - it suspends the machine. > > Of course, so does just closing the lid. > > The only "bug" I've seen in this area is the design bug of the XPS13 > where there is no visible indication of the suspend state (ie the > traditional slowly pulsing LED showing that it's all nice and > suspended). But that seems to be intentional, if stupid. I think it's > the only real beef I have with the XPS13. I find the 9360 to be a solid laptop (my XPS 15 9550 would fail to resume from suspend 15% of the time), but did any of you guys run into bit-depth colour issues [1] on the Skylake/9350 with USB-C to HDMI adapters? Dan [1] https://bugs.freedesktop.org/show_bug.cgi?id=99137 -- Daniel J Blueman
Re: Dell XPS13: MCE (Hardware Error) reported
On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote: > Hi Boris > > thanks for forwarding. > > > > CPUID Vendor Intel Family 6 Model 142 > This is Kabylake Mobile > > > > Hardware event. This is not a software error. > > > MCE 1 > > > CPU 0 BANK 7 > > > MISC 7880018086 ADDR fef1ce40 > > > TIME 1483543069 Wed Jan 4 16:17:49 2017 > > > MCG status: > > > MCi status: > > > Error overflow > > > Uncorrected error > > > MCi_MISC register valid > > > MCi_ADDR register valid > > > Processor context corrupt > > > MCA: corrected filtering (some unreported errors in same region) > > > Generic CACHE Level-2 Generic Error > > > STATUS ee40110a MCGSTATUS 0 > > Decoding the bits further from MCi_STATUS above: > Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have > been signaled by a CMCI. > > PCC=1, but should be ignored when EN=0. > MCACOD: 110a MSCOD: 0040 > > If the system is stable enough after the report, can you send the output of > /proc/interrupts to confirm that. > > Although its reported as a L2 error, some memory errors can also manifest > itself as a cache error in certain cases. In this case it looks like > some speculative fetch from bad memory might be the cause. > > > > MCGCAP c08 APICID 0 SOCKETID 0 > > MCG_CAP: c08 > Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and > Threshold based error reporting (bit 11) (TES_P). > > > Do you have another machine which doesn't report these errors? if so try > swapping memory between them to see if the error disappears. > > I don't have the model specific error handy.. will check that in the meantime > to get some decoding as well. > > If you haven't already running some memory tests would also help. > > If you replaced the motherboard, did that involve both cpu and memory? > or just the motheboard swap? I see the MCE on my XPS 9360 also. It's not related to DRAM, as the physical address is in the non-coherent low MMIO window: MISC 7880018086 ADDR fef1ce40 Which is declared as device memory: [0.00] PM: Registered nosave memory: [mem 0xfee01000-0xfeff] For core-generated cycles, it is between the local APIC space at FEE0:FEE and SPI BIOS at FFE0:, so will be subtractively decoded to the PCH, maybe being aborted due to a device not being enabled (hello TPM3 or new image processor). As it is logged as soon as the MCE driver initialises, it was probably logged during BIOS init, so there's not much we can do about it anyways. Dan -- Daniel J Blueman
Re: Dell XPS13: MCE (Hardware Error) reported
On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote: > Hi Boris > > thanks for forwarding. > > > > CPUID Vendor Intel Family 6 Model 142 > This is Kabylake Mobile > > > > Hardware event. This is not a software error. > > > MCE 1 > > > CPU 0 BANK 7 > > > MISC 7880018086 ADDR fef1ce40 > > > TIME 1483543069 Wed Jan 4 16:17:49 2017 > > > MCG status: > > > MCi status: > > > Error overflow > > > Uncorrected error > > > MCi_MISC register valid > > > MCi_ADDR register valid > > > Processor context corrupt > > > MCA: corrected filtering (some unreported errors in same region) > > > Generic CACHE Level-2 Generic Error > > > STATUS ee40110a MCGSTATUS 0 > > Decoding the bits further from MCi_STATUS above: > Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have > been signaled by a CMCI. > > PCC=1, but should be ignored when EN=0. > MCACOD: 110a MSCOD: 0040 > > If the system is stable enough after the report, can you send the output of > /proc/interrupts to confirm that. > > Although its reported as a L2 error, some memory errors can also manifest > itself as a cache error in certain cases. In this case it looks like > some speculative fetch from bad memory might be the cause. > > > > MCGCAP c08 APICID 0 SOCKETID 0 > > MCG_CAP: c08 > Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and > Threshold based error reporting (bit 11) (TES_P). > > > Do you have another machine which doesn't report these errors? if so try > swapping memory between them to see if the error disappears. > > I don't have the model specific error handy.. will check that in the meantime > to get some decoding as well. > > If you haven't already running some memory tests would also help. > > If you replaced the motherboard, did that involve both cpu and memory? > or just the motheboard swap? I see the MCE on my XPS 9360 also. It's not related to DRAM, as the physical address is in the non-coherent low MMIO window: MISC 7880018086 ADDR fef1ce40 Which is declared as device memory: [0.00] PM: Registered nosave memory: [mem 0xfee01000-0xfeff] For core-generated cycles, it is between the local APIC space at FEE0:FEE and SPI BIOS at FFE0:, so will be subtractively decoded to the PCH, maybe being aborted due to a device not being enabled (hello TPM3 or new image processor). As it is logged as soon as the MCE driver initialises, it was probably logged during BIOS init, so there's not much we can do about it anyways. Dan -- Daniel J Blueman
FOSSASIA'17 Kernel Track: Call for Speakers
Dear Linux Kernel developers, The FOSSASIA 2017 Kernel Track would like to welcome all interested speakers to submit abstracts for presentations. You'll have the opportunity to share your knowledge and discuss with like-minded individuals, representing a broad range of industries and technologies. The topics include, but are not limited to: - new kernel developments, ideas and limitations - development process and community - bringup experience on new platforms or SoCs - debugging, profiling, tuning tips and experience - security and vulnerabilities - new and exciting architectures, features and platforms There are over 3000 attendees each year and a broad range of other tracks including the Hardware and Maker track, the Artificial Intelligence track, the Startup and Business Development track and the DevOps track. The deadline for submission has been extended until Jan 20th; for more details see: http://blog.fossasia.org/fossasia-summit-2017-singapore-call-for-speakers/ We are looking forward to seeing you at the summit! Daniel -- Daniel J Blueman
FOSSASIA'17 Kernel Track: Call for Speakers
Dear Linux Kernel developers, The FOSSASIA 2017 Kernel Track would like to welcome all interested speakers to submit abstracts for presentations. You'll have the opportunity to share your knowledge and discuss with like-minded individuals, representing a broad range of industries and technologies. The topics include, but are not limited to: - new kernel developments, ideas and limitations - development process and community - bringup experience on new platforms or SoCs - debugging, profiling, tuning tips and experience - security and vulnerabilities - new and exciting architectures, features and platforms There are over 3000 attendees each year and a broad range of other tracks including the Hardware and Maker track, the Artificial Intelligence track, the Startup and Business Development track and the DevOps track. The deadline for submission has been extended until Jan 20th; for more details see: http://blog.fossasia.org/fossasia-summit-2017-singapore-call-for-speakers/ We are looking forward to seeing you at the summit! Daniel -- Daniel J Blueman
[PATCH] x86/urgent: Fix NumaConnect2 MMCFG PCI access
The MMCFG PCI accessors weren't being setup for NumacConnect2 correctly due to over-early assignment; this would create the potential for the wrong PCI domain to be accessed. Fix this by using the correct arch-specific PCI init function. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold --- arch/x86/kernel/apic/apic_numachip.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 38dd5ef..2bd2292 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -193,20 +193,17 @@ static int __init numachip_system_init(void) case 1: init_extra_mapping_uc(NUMACHIP_LCSR_BASE, NUMACHIP_LCSR_SIZE); numachip_apic_icr_write = numachip1_apic_icr_write; - x86_init.pci.arch_init = pci_numachip_init; break; case 2: init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE); numachip_apic_icr_write = numachip2_apic_icr_write; - - /* Use MCFG config cycles rather than locked CF8 cycles */ - raw_pci_ops = _mmcfg; break; default: return 0; } x86_cpuinit.fixup_cpu_id = fixup_cpu_id; + x86_init.pci.arch_init = pci_numachip_init; return 0; } -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86/numachip: Fix NumaConnect2 MMCFG PCI access
Commit-ID: dd7a5ab495019d424c2b0747892eb2e38a052ba5 Gitweb: http://git.kernel.org/tip/dd7a5ab495019d424c2b0747892eb2e38a052ba5 Author: Daniel J Blueman AuthorDate: Thu, 31 Dec 2015 02:06:47 +0800 Committer: Thomas Gleixner CommitDate: Wed, 30 Dec 2015 19:19:03 +0100 x86/numachip: Fix NumaConnect2 MMCFG PCI access The MMCFG PCI accessors weren't being setup for NumacConnect2 correctly due to over-early assignment; this would create the potential for the wrong PCI domain to be accessed. Fix this by using the correct arch-specific PCI init function. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold Cc: Daniel Lezcano Cc: Linus Torvalds Link: http://lkml.kernel.org/r/1451498807-15920-1-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner --- arch/x86/kernel/apic/apic_numachip.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 38dd5ef..2bd2292 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -193,20 +193,17 @@ static int __init numachip_system_init(void) case 1: init_extra_mapping_uc(NUMACHIP_LCSR_BASE, NUMACHIP_LCSR_SIZE); numachip_apic_icr_write = numachip1_apic_icr_write; - x86_init.pci.arch_init = pci_numachip_init; break; case 2: init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE); numachip_apic_icr_write = numachip2_apic_icr_write; - - /* Use MCFG config cycles rather than locked CF8 cycles */ - raw_pci_ops = _mmcfg; break; default: return 0; } x86_cpuinit.fixup_cpu_id = fixup_cpu_id; + x86_init.pci.arch_init = pci_numachip_init; return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86/numachip: Fix NumaConnect2 MMCFG PCI access
Commit-ID: dd7a5ab495019d424c2b0747892eb2e38a052ba5 Gitweb: http://git.kernel.org/tip/dd7a5ab495019d424c2b0747892eb2e38a052ba5 Author: Daniel J Blueman <dan...@numascale.com> AuthorDate: Thu, 31 Dec 2015 02:06:47 +0800 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Wed, 30 Dec 2015 19:19:03 +0100 x86/numachip: Fix NumaConnect2 MMCFG PCI access The MMCFG PCI accessors weren't being setup for NumacConnect2 correctly due to over-early assignment; this would create the potential for the wrong PCI domain to be accessed. Fix this by using the correct arch-specific PCI init function. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> Cc: Daniel Lezcano <daniel.lezc...@linaro.org> Cc: Linus Torvalds <torva...@linux-foundation.org> Link: http://lkml.kernel.org/r/1451498807-15920-1-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- arch/x86/kernel/apic/apic_numachip.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 38dd5ef..2bd2292 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -193,20 +193,17 @@ static int __init numachip_system_init(void) case 1: init_extra_mapping_uc(NUMACHIP_LCSR_BASE, NUMACHIP_LCSR_SIZE); numachip_apic_icr_write = numachip1_apic_icr_write; - x86_init.pci.arch_init = pci_numachip_init; break; case 2: init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE); numachip_apic_icr_write = numachip2_apic_icr_write; - - /* Use MCFG config cycles rather than locked CF8 cycles */ - raw_pci_ops = _mmcfg; break; default: return 0; } x86_cpuinit.fixup_cpu_id = fixup_cpu_id; + x86_init.pci.arch_init = pci_numachip_init; return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86/urgent: Fix NumaConnect2 MMCFG PCI access
The MMCFG PCI accessors weren't being setup for NumacConnect2 correctly due to over-early assignment; this would create the potential for the wrong PCI domain to be accessed. Fix this by using the correct arch-specific PCI init function. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> --- arch/x86/kernel/apic/apic_numachip.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 38dd5ef..2bd2292 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -193,20 +193,17 @@ static int __init numachip_system_init(void) case 1: init_extra_mapping_uc(NUMACHIP_LCSR_BASE, NUMACHIP_LCSR_SIZE); numachip_apic_icr_write = numachip1_apic_icr_write; - x86_init.pci.arch_init = pci_numachip_init; break; case 2: init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE); numachip_apic_icr_write = numachip2_apic_icr_write; - - /* Use MCFG config cycles rather than locked CF8 cycles */ - raw_pci_ops = _mmcfg; break; default: return 0; } x86_cpuinit.fixup_cpu_id = fixup_cpu_id; + x86_init.pci.arch_init = pci_numachip_init; return 0; } -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] PCI: Add mechanism to find topologically near cores
Some devices (eg ixgbe) make assumptions about device to core locality when specifying interrupts locality hints and allocate starting from core 0. Moreover, interrupts may not be routable to distant NUMA nodes due to the 8-bit APIC ID space limitations. Provide a mechanism drivers can use to find cores with reasonable locality to a device; use the existing precendent of RECLAIM_DISTANCE (30), wrapping the offset. Signed-off-by: Daniel J Blueman --- drivers/pci/pci.c | 15 +++ include/linux/pci.h | 1 + 2 files changed, 16 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 314db8c..d5535d1 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4833,6 +4833,22 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) } EXPORT_SYMBOL(pci_fixup_cardbus); +int cpu_near_dev(const struct pci_dev *pdev, unsigned offset) +{ + /* Start search from node device is on for optimal locality */ + int localnode = pcibus_to_node(pdev->bus); + int cpu = cpumask_first(cpumask_of_node(localnode)); + + while (offset--) { + do { + cpu = (cpu + 1) % nr_cpu_ids; + } while (!cpu_online(cpu) || node_distance(cpu_to_node(cpu), + localnode) > RECLAIM_DISTANCE); + } + + return cpu; +} + static int __init pci_setup(char *str) { while (str) { diff --git a/include/linux/pci.h b/include/linux/pci.h index 6ae25aa..f7491bd 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -842,6 +842,7 @@ void pci_stop_root_bus(struct pci_bus *bus); void pci_remove_root_bus(struct pci_bus *bus); void pci_setup_cardbus(struct pci_bus *bus); void pci_sort_breadthfirst(void); +int cpu_near_dev(const struct pci_dev *pdev, unsigned offset); #define dev_is_pci(d) ((d)->bus == _bus_type) #define dev_is_pf(d) ((dev_is_pci(d) ? to_pci_dev(d)->is_physfn : false)) #define dev_num_vf(d) ((dev_is_pci(d) ? pci_num_vf(to_pci_dev(d)) : 0)) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] ixgbe: Use core to device locality interface
Rather than assuming cores starting from 0 are local to the ethernet device, use the introduced interface to find near cores. Not only does this improve performance due to spreading interrupts via near NUMA nodes, it prevents assigning cores on distant NUMA nodes, which aren't reachable by device interrupts due to the 8-bit APIC ID limitation. With Numascale NumaConnect2 systems with Intel ixgbe cards on non-primary PCI domains, all ixgbe NICs would previously revector interrupts to cores 0 to 63 (cores 0 to 47 would be considered near the primary PCI domain). Now, cores 48 to 95 are used, increasing performance and addressing interrupt delivery issues: do_IRQ: 79.180 No irq handler for vector (irq -1) do_IRQ: 78.42 No irq handler for vector (irq -1) do_IRQ: 71.172 No irq handler for vector (irq -1) do_IRQ: 70.236 No irq handler for vector (irq -1) do_IRQ: 69.109 No irq handler for vector (irq -1) do_IRQ: 68.189 No irq handler for vector (irq -1) do_IRQ: 72.92 No irq handler for vector (irq -1) do_IRQ: 73.235 No irq handler for vector (irq -1) do_IRQ: 66.185 No irq handler for vector (irq -1) do_IRQ: 67.62 No irq handler for vector (irq -1) do_IRQ: 197 callbacks suppressed Signed-off-by: Daniel J Blueman --- drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c index f3168bc..12c4ce1 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c @@ -817,10 +817,8 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter, if ((tcs <= 1) && !(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)) { u16 rss_i = adapter->ring_feature[RING_F_RSS].indices; if (rss_i > 1 && adapter->atr_sample_rate) { - if (cpu_online(v_idx)) { - cpu = v_idx; - node = cpu_to_node(cpu); - } + cpu = cpu_near_dev(adapter->pdev, v_idx); + node = cpu_to_node(cpu); } } -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] ixgbe: Use core to device locality interface
Rather than assuming cores starting from 0 are local to the ethernet device, use the introduced interface to find near cores. Not only does this improve performance due to spreading interrupts via near NUMA nodes, it prevents assigning cores on distant NUMA nodes, which aren't reachable by device interrupts due to the 8-bit APIC ID limitation. With Numascale NumaConnect2 systems with Intel ixgbe cards on non-primary PCI domains, all ixgbe NICs would previously revector interrupts to cores 0 to 63 (cores 0 to 47 would be considered near the primary PCI domain). Now, cores 48 to 95 are used, increasing performance and addressing interrupt delivery issues: do_IRQ: 79.180 No irq handler for vector (irq -1) do_IRQ: 78.42 No irq handler for vector (irq -1) do_IRQ: 71.172 No irq handler for vector (irq -1) do_IRQ: 70.236 No irq handler for vector (irq -1) do_IRQ: 69.109 No irq handler for vector (irq -1) do_IRQ: 68.189 No irq handler for vector (irq -1) do_IRQ: 72.92 No irq handler for vector (irq -1) do_IRQ: 73.235 No irq handler for vector (irq -1) do_IRQ: 66.185 No irq handler for vector (irq -1) do_IRQ: 67.62 No irq handler for vector (irq -1) do_IRQ: 197 callbacks suppressed Signed-off-by: Daniel J Blueman <dan...@numascale.com> --- drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c index f3168bc..12c4ce1 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c @@ -817,10 +817,8 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter, if ((tcs <= 1) && !(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)) { u16 rss_i = adapter->ring_feature[RING_F_RSS].indices; if (rss_i > 1 && adapter->atr_sample_rate) { - if (cpu_online(v_idx)) { - cpu = v_idx; - node = cpu_to_node(cpu); - } + cpu = cpu_near_dev(adapter->pdev, v_idx); + node = cpu_to_node(cpu); } } -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] PCI: Add mechanism to find topologically near cores
Some devices (eg ixgbe) make assumptions about device to core locality when specifying interrupts locality hints and allocate starting from core 0. Moreover, interrupts may not be routable to distant NUMA nodes due to the 8-bit APIC ID space limitations. Provide a mechanism drivers can use to find cores with reasonable locality to a device; use the existing precendent of RECLAIM_DISTANCE (30), wrapping the offset. Signed-off-by: Daniel J Blueman <dan...@numascale.com> --- drivers/pci/pci.c | 15 +++ include/linux/pci.h | 1 + 2 files changed, 16 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 314db8c..d5535d1 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -4833,6 +4833,22 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus) } EXPORT_SYMBOL(pci_fixup_cardbus); +int cpu_near_dev(const struct pci_dev *pdev, unsigned offset) +{ + /* Start search from node device is on for optimal locality */ + int localnode = pcibus_to_node(pdev->bus); + int cpu = cpumask_first(cpumask_of_node(localnode)); + + while (offset--) { + do { + cpu = (cpu + 1) % nr_cpu_ids; + } while (!cpu_online(cpu) || node_distance(cpu_to_node(cpu), + localnode) > RECLAIM_DISTANCE); + } + + return cpu; +} + static int __init pci_setup(char *str) { while (str) { diff --git a/include/linux/pci.h b/include/linux/pci.h index 6ae25aa..f7491bd 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -842,6 +842,7 @@ void pci_stop_root_bus(struct pci_bus *bus); void pci_remove_root_bus(struct pci_bus *bus); void pci_setup_cardbus(struct pci_bus *bus); void pci_sort_breadthfirst(void); +int cpu_near_dev(const struct pci_dev *pdev, unsigned offset); #define dev_is_pci(d) ((d)->bus == _bus_type) #define dev_is_pf(d) ((dev_is_pci(d) ? to_pci_dev(d)->is_physfn : false)) #define dev_num_vf(d) ((dev_is_pci(d) ? pci_num_vf(to_pci_dev(d)) : 0)) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [TESTPATCH v2] xhci: fix usb2 resume timing and races.
On 1 December 2015 at 16:26, Mathias Nyman wrote: > usb2 ports need to signal resume for 20ms before moving to U0 state. > Both device and host can initiate resume. > > On host initated resume port is set to resume state, sleep 20ms, > and finally set port to U0 state. > > On device initated resume a port status interrupt with a port in resume > state in issued. The interrupt handler tags a resume_done[port] > timestamp with current time + 20ms, and kick roothub timer. > Root hub timer requests for port status, finds the port in resume state, > checks if resume_done[port] timestamp passed, and set port to U0 state. > > There are a few issues with this approach, > 1. A host initated resume will also generate a resume event, the event >handler will find the port in resume state, believe it's a device >initated and act accordingly. > > 2. A port status request might cut the 20ms resume signalling short if a >get_port_status request is handled during the 20ms host resume. >The port will be found in resume state. The timestamp is not set leading >to time_after_eq(jiffoes, timestamp) returning true, as timestamp = 0. >get_port_status will proceed with moving the port to U0. > > 3. If an error, or anything else happends to the port during device >initated 20ms resume signalling it will leave all device resume >parameters hanging uncleared preventing further resume. > > Fix this by using the existing resuming_ports bitfield to indicate if > resume signalling timing is taken care of. > Also check if the resume_done[port] is set before using it in time > comparison. Also clear out any resume signalling related variables if port > is not in U0 or Resume state. > > v2. fix parentheses when checking for uncleared resume variables. > we want: if ((unclear1 OR unclear2 ) AND !in_resume AND !in_U3) { .. } > > Signed-off-by: Mathias Nyman Excellent; this correctly prevents the cyclic chain of suspend attempts, resolving the issue. Tested-by: Daniel J Blueman Thanks Mathias! Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [TESTPATCH v2] xhci: fix usb2 resume timing and races.
On 1 December 2015 at 16:26, Mathias Nyman <mathias.ny...@linux.intel.com> wrote: > usb2 ports need to signal resume for 20ms before moving to U0 state. > Both device and host can initiate resume. > > On host initated resume port is set to resume state, sleep 20ms, > and finally set port to U0 state. > > On device initated resume a port status interrupt with a port in resume > state in issued. The interrupt handler tags a resume_done[port] > timestamp with current time + 20ms, and kick roothub timer. > Root hub timer requests for port status, finds the port in resume state, > checks if resume_done[port] timestamp passed, and set port to U0 state. > > There are a few issues with this approach, > 1. A host initated resume will also generate a resume event, the event >handler will find the port in resume state, believe it's a device >initated and act accordingly. > > 2. A port status request might cut the 20ms resume signalling short if a >get_port_status request is handled during the 20ms host resume. >The port will be found in resume state. The timestamp is not set leading >to time_after_eq(jiffoes, timestamp) returning true, as timestamp = 0. >get_port_status will proceed with moving the port to U0. > > 3. If an error, or anything else happends to the port during device >initated 20ms resume signalling it will leave all device resume >parameters hanging uncleared preventing further resume. > > Fix this by using the existing resuming_ports bitfield to indicate if > resume signalling timing is taken care of. > Also check if the resume_done[port] is set before using it in time > comparison. Also clear out any resume signalling related variables if port > is not in U0 or Resume state. > > v2. fix parentheses when checking for uncleared resume variables. > we want: if ((unclear1 OR unclear2 ) AND !in_resume AND !in_U3) { .. } > > Signed-off-by: Mathias Nyman <mathias.ny...@linux.intel.com> Excellent; this correctly prevents the cyclic chain of suspend attempts, resolving the issue. Tested-by: Daniel J Blueman <dan...@quora.org> Thanks Mathias! Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: overriding ACPI _CRS method
On Mon, Nov 30, 2015 at 11:09 AM, Zheng, Lv wrote: Hi, IMO, if you want the new _CRS to be applied during the Linux early boot stage, you can override the table using initrd override or DSDT override mechanism. Please see Documentation/acpi/initrd_table_override.txt or Documentation/acpi/dsdt-override.txt. If you want the new _CRS to be applied during Linux runtime, you can override it using method customization mechanism. Please see Documentation/acpi/method-customizing.txt The reason I'm trying to adjust this in firmware, is to deliver the right behaviour with pre-built/distro kernels, so I can't use that approach. Thanks Lv, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
overriding ACPI _CRS method
In firmware that is loaded after the BIOS, I need to trim the root bus resource (0x4000-0xdfff) covering the MMIO window [1], so I can attach further PCI domains. One strategy is to override the BIOS's DSDT [2] _SB.PCI0._CRS method; even when my firmware appends the bytecode for a new _CRS method [3], alas I see AE_ALREADY_EXISTS [4]. I understood methods were overrideable within the same table (eg not from an SSDT), but perhaps am missing something? Or any better approach to reduce the scope of the PCI domain root bus? Thanks! Daniel -- [1] pci_bus :00: root bus resource [io 0x-0x03af window] pci_bus :00: root bus resource [io 0x03e0-0x0cf7 window] pci_bus :00: root bus resource [io 0x03b0-0x03bb window] pci_bus :00: root bus resource [io 0x03c0-0x03df window] pci_bus :00: root bus resource [io 0x8000-0xdfff window] pci_bus :00: root bus resource [mem 0x000a-0x000b window] pci_bus :00: root bus resource [mem 0xf000-0x window] pci_bus :00: root bus resource [mem 0x000d-0x000d window] pci_bus :00: root bus resource [mem 0x4000-0xdfff window] pci_bus :00: root bus resource [bus 00-04] [2] https://resources.numascale.com/DSDT.dsl [3] https://resources.numascale.com/DSDT-extra.dsl -- [4] ACPI: Core revision 20150930 ACPI Error: [_CRS] Namespace lookup failure, AE_ALREADY_EXISTS (20150930/dswload-378) ACPI Exception: AE_ALREADY_EXISTS, During name lookup/catalog (20150930/psobject-227) ACPI Exception: AE_ALREADY_EXISTS, [DSDT] table load failed (20150930/tbxfload-163) ACPI Error: [\_PR_.P001] Namespace lookup failure, AE_NOT_FOUND (20150930/dswload-210) ACPI Exception: AE_NOT_FOUND, During name lookup/catalog (20150930/psobject-227) ACPI Exception: AE_NOT_FOUND, (SSDT:POWERNOW) while loading table (20150930/tbxfload-193) ACPI Error: 2 table load failures, 0 successful (20150930/tbxfload-214) -- Daniel J Blueman Principal Software Engineer, Numascale -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
overriding ACPI _CRS method
In firmware that is loaded after the BIOS, I need to trim the root bus resource (0x4000-0xdfff) covering the MMIO window [1], so I can attach further PCI domains. One strategy is to override the BIOS's DSDT [2] _SB.PCI0._CRS method; even when my firmware appends the bytecode for a new _CRS method [3], alas I see AE_ALREADY_EXISTS [4]. I understood methods were overrideable within the same table (eg not from an SSDT), but perhaps am missing something? Or any better approach to reduce the scope of the PCI domain root bus? Thanks! Daniel -- [1] pci_bus :00: root bus resource [io 0x-0x03af window] pci_bus :00: root bus resource [io 0x03e0-0x0cf7 window] pci_bus :00: root bus resource [io 0x03b0-0x03bb window] pci_bus :00: root bus resource [io 0x03c0-0x03df window] pci_bus :00: root bus resource [io 0x8000-0xdfff window] pci_bus :00: root bus resource [mem 0x000a-0x000b window] pci_bus :00: root bus resource [mem 0xf000-0x window] pci_bus :00: root bus resource [mem 0x000d-0x000d window] pci_bus :00: root bus resource [mem 0x4000-0xdfff window] pci_bus :00: root bus resource [bus 00-04] [2] https://resources.numascale.com/DSDT.dsl [3] https://resources.numascale.com/DSDT-extra.dsl -- [4] ACPI: Core revision 20150930 ACPI Error: [_CRS] Namespace lookup failure, AE_ALREADY_EXISTS (20150930/dswload-378) ACPI Exception: AE_ALREADY_EXISTS, During name lookup/catalog (20150930/psobject-227) ACPI Exception: AE_ALREADY_EXISTS, [DSDT] table load failed (20150930/tbxfload-163) ACPI Error: [\_PR_.P001] Namespace lookup failure, AE_NOT_FOUND (20150930/dswload-210) ACPI Exception: AE_NOT_FOUND, During name lookup/catalog (20150930/psobject-227) ACPI Exception: AE_NOT_FOUND, (SSDT:POWERNOW) while loading table (20150930/tbxfload-193) ACPI Error: 2 table load failures, 0 successful (20150930/tbxfload-214) -- Daniel J Blueman Principal Software Engineer, Numascale -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: overriding ACPI _CRS method
On Mon, Nov 30, 2015 at 11:09 AM, Zheng, Lvwrote: Hi, IMO, if you want the new _CRS to be applied during the Linux early boot stage, you can override the table using initrd override or DSDT override mechanism. Please see Documentation/acpi/initrd_table_override.txt or Documentation/acpi/dsdt-override.txt. If you want the new _CRS to be applied during Linux runtime, you can override it using method customization mechanism. Please see Documentation/acpi/method-customizing.txt The reason I'm trying to adjust this in firmware, is to deliver the right behaviour with pre-built/distro kernels, so I can't use that approach. Thanks Lv, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4.3] kworker busy in pm_runtime_work
On 23 November 2015 at 23:52, Alan Stern wrote: > On Sun, 22 Nov 2015, Daniel J Blueman wrote: > >> On 16 November 2015 at 23:22, Alan Stern wrote: >> > On Mon, 16 Nov 2015, Daniel J Blueman wrote: >> > >> >> Tuning USB suspend [1] in 4.3 on a Dell XPS 15 9553 (Skylake), I see a >> >> kworker thread spinning in rpm_suspend [2]. >> >> >> >> What is the most useful debug to get here beyond the immediate [3]? >> > >> > You can try doing: >> > >> > echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control >> >> kworker and ksoftirqd spinning occurs when I echo 'auto' to all the >> USB control entries. Using Alan's excellent tip, we see this being >> logged repeatedly at a high rate: >> [ 353.245180] usb usb1-port4: status 0107 change >> [ 353.245194] usb usb1-port12: status 0507 change >> [ 353.245202] hub 1-0:1.0: state 7 ports 16 chg evt >> [ 353.245203] hub 1-0:1.0: hub_suspend >> [ 353.245205] usb usb1: bus auto-suspend, wakeup 1 >> [ 353.245206] usb usb1: bus suspend fail, err -16 >> [ 353.245207] hub 1-0:1.0: hub_resume >> ... >> >> So, EBUSY. Both the webcam is not open, and the bluetooth interface >> [1] is rfkill'd; the situation occurs even if I unload all related >> modules. >> >> What further debug would be useful? >> >> Thanks! >> Daniel >> >> -- [1] >> >> Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub >> Bus 001 Device 002: ID 0a5c:6410 Broadcom Corp. >> Bus 001 Device 003: ID 1bcf:2b95 Sunplus Innovation Technology Inc. >> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub > > Since bus 1 uses an xHCI controller, you should do: > > echo 'module xhci-hcd =p' >/sys/kernel/debug/dynamic_debug/control > > I'm reasonably sure this will end up printing "suspend failed > because a port is resuming", since that's the only place where > xhci_bus_suspend() fails with -EBUSY, but you should try it to confirm > this. I had to use: echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control and indeed we see: [29172.246221] xhci_hcd :00:14.0: get port status, actual port 11 status = 0xe63 [29172.246222] xhci_hcd :00:14.0: Get port status returned 0x507 [29172.246224] xhci_hcd :00:14.0: get port status, actual port 12 status = 0x2a0 [29172.246228] xhci_hcd :00:14.0: get port status, actual port 13 status = 0x2a0 [29172.246228] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246231] xhci_hcd :00:14.0: get port status, actual port 14 status = 0x2a0 [29172.246232] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246235] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246248] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246254] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246264] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246275] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246285] xhci_hcd :00:14.0: Get port status returned 0x507 [29172.246294] xhci_hcd :00:14.0: get port status, actual port 14 status = 0x2a0 [29172.246302] xhci_hcd :00:14.0: suspend failed because a port is resuming [29172.246321] xhci_hcd :00:14.0: Get port status returned 0x107 [29172.246332] xhci_hcd :00:14.0: get port status, actual port 6 status = 0x2a0 [29172.246346] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246359] xhci_hcd :00:14.0: get port status, actual port 13 status = 0x2a0 [29172.246364] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246366] xhci_hcd :00:14.0: get port status, actual port 15 status = 0x2a0 [29172.246371] xhci_hcd :00:14.0: suspend failed because a port is resuming [29172.246380] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246382] xhci_hcd :00:14.0: get port status, actual port 1 status = 0x2a0 [29172.246383] xhci_hcd :00:14.0: Get port status returned 0x100 ... -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4.3] kworker busy in pm_runtime_work
On 23 November 2015 at 23:52, Alan Stern <st...@rowland.harvard.edu> wrote: > On Sun, 22 Nov 2015, Daniel J Blueman wrote: > >> On 16 November 2015 at 23:22, Alan Stern <st...@rowland.harvard.edu> wrote: >> > On Mon, 16 Nov 2015, Daniel J Blueman wrote: >> > >> >> Tuning USB suspend [1] in 4.3 on a Dell XPS 15 9553 (Skylake), I see a >> >> kworker thread spinning in rpm_suspend [2]. >> >> >> >> What is the most useful debug to get here beyond the immediate [3]? >> > >> > You can try doing: >> > >> > echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control >> >> kworker and ksoftirqd spinning occurs when I echo 'auto' to all the >> USB control entries. Using Alan's excellent tip, we see this being >> logged repeatedly at a high rate: >> [ 353.245180] usb usb1-port4: status 0107 change >> [ 353.245194] usb usb1-port12: status 0507 change >> [ 353.245202] hub 1-0:1.0: state 7 ports 16 chg evt >> [ 353.245203] hub 1-0:1.0: hub_suspend >> [ 353.245205] usb usb1: bus auto-suspend, wakeup 1 >> [ 353.245206] usb usb1: bus suspend fail, err -16 >> [ 353.245207] hub 1-0:1.0: hub_resume >> ... >> >> So, EBUSY. Both the webcam is not open, and the bluetooth interface >> [1] is rfkill'd; the situation occurs even if I unload all related >> modules. >> >> What further debug would be useful? >> >> Thanks! >> Daniel >> >> -- [1] >> >> Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub >> Bus 001 Device 002: ID 0a5c:6410 Broadcom Corp. >> Bus 001 Device 003: ID 1bcf:2b95 Sunplus Innovation Technology Inc. >> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub > > Since bus 1 uses an xHCI controller, you should do: > > echo 'module xhci-hcd =p' >/sys/kernel/debug/dynamic_debug/control > > I'm reasonably sure this will end up printing "suspend failed > because a port is resuming", since that's the only place where > xhci_bus_suspend() fails with -EBUSY, but you should try it to confirm > this. I had to use: echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control and indeed we see: [29172.246221] xhci_hcd :00:14.0: get port status, actual port 11 status = 0xe63 [29172.246222] xhci_hcd :00:14.0: Get port status returned 0x507 [29172.246224] xhci_hcd :00:14.0: get port status, actual port 12 status = 0x2a0 [29172.246228] xhci_hcd :00:14.0: get port status, actual port 13 status = 0x2a0 [29172.246228] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246231] xhci_hcd :00:14.0: get port status, actual port 14 status = 0x2a0 [29172.246232] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246235] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246248] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246254] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246264] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246275] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246285] xhci_hcd :00:14.0: Get port status returned 0x507 [29172.246294] xhci_hcd :00:14.0: get port status, actual port 14 status = 0x2a0 [29172.246302] xhci_hcd :00:14.0: suspend failed because a port is resuming [29172.246321] xhci_hcd :00:14.0: Get port status returned 0x107 [29172.246332] xhci_hcd :00:14.0: get port status, actual port 6 status = 0x2a0 [29172.246346] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246359] xhci_hcd :00:14.0: get port status, actual port 13 status = 0x2a0 [29172.246364] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246366] xhci_hcd :00:14.0: get port status, actual port 15 status = 0x2a0 [29172.246371] xhci_hcd :00:14.0: suspend failed because a port is resuming [29172.246380] xhci_hcd :00:14.0: Get port status returned 0x100 [29172.246382] xhci_hcd :00:14.0: get port status, actual port 1 status = 0x2a0 [29172.246383] xhci_hcd :00:14.0: Get port status returned 0x100 ... -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4.3] kworker busy in pm_runtime_work
On 16 November 2015 at 23:22, Alan Stern wrote: > On Mon, 16 Nov 2015, Daniel J Blueman wrote: > >> Tuning USB suspend [1] in 4.3 on a Dell XPS 15 9553 (Skylake), I see a >> kworker thread spinning in rpm_suspend [2]. >> >> What is the most useful debug to get here beyond the immediate [3]? > > You can try doing: > > echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control kworker and ksoftirqd spinning occurs when I echo 'auto' to all the USB control entries. Using Alan's excellent tip, we see this being logged repeatedly at a high rate: [ 353.245180] usb usb1-port4: status 0107 change [ 353.245194] usb usb1-port12: status 0507 change [ 353.245202] hub 1-0:1.0: state 7 ports 16 chg evt [ 353.245203] hub 1-0:1.0: hub_suspend [ 353.245205] usb usb1: bus auto-suspend, wakeup 1 [ 353.245206] usb usb1: bus suspend fail, err -16 [ 353.245207] hub 1-0:1.0: hub_resume ... So, EBUSY. Both the webcam is not open, and the bluetooth interface [1] is rfkill'd; the situation occurs even if I unload all related modules. What further debug would be useful? Thanks! Daniel -- [1] Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 002: ID 0a5c:6410 Broadcom Corp. Bus 001 Device 003: ID 1bcf:2b95 Sunplus Innovation Technology Inc. Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4.3] kworker busy in pm_runtime_work
On 16 November 2015 at 23:22, Alan Stern <st...@rowland.harvard.edu> wrote: > On Mon, 16 Nov 2015, Daniel J Blueman wrote: > >> Tuning USB suspend [1] in 4.3 on a Dell XPS 15 9553 (Skylake), I see a >> kworker thread spinning in rpm_suspend [2]. >> >> What is the most useful debug to get here beyond the immediate [3]? > > You can try doing: > > echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control kworker and ksoftirqd spinning occurs when I echo 'auto' to all the USB control entries. Using Alan's excellent tip, we see this being logged repeatedly at a high rate: [ 353.245180] usb usb1-port4: status 0107 change [ 353.245194] usb usb1-port12: status 0507 change [ 353.245202] hub 1-0:1.0: state 7 ports 16 chg evt [ 353.245203] hub 1-0:1.0: hub_suspend [ 353.245205] usb usb1: bus auto-suspend, wakeup 1 [ 353.245206] usb usb1: bus suspend fail, err -16 [ 353.245207] hub 1-0:1.0: hub_resume ... So, EBUSY. Both the webcam is not open, and the bluetooth interface [1] is rfkill'd; the situation occurs even if I unload all related modules. What further debug would be useful? Thanks! Daniel -- [1] Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 001 Device 002: ID 0a5c:6410 Broadcom Corp. Bus 001 Device 003: ID 1bcf:2b95 Sunplus Innovation Technology Inc. Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lkp] [x86/numachip] db1003a719: BUG: kernel early-boot hang
Hi Ying Huang, On Tue, Nov 10, 2015 at 6:12 AM, kernel test robot wrote: FYI, we noticed the below changes on https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master commit db1003a719d75cebe5843a7906c02c29bec9922c ("x86/numachip: Cleanup Numachip support") Elapsed time: 210 BUG: kernel early-boot hang Linux version 4.3.0-rc2-1-gdb1003a #1 Command line: root=/dev/ram0 user=lkp job=/lkp/scheduled/vm-intel12-yocto-x86_64-14/bisect_boot-1-yocto-minimal-x86_64.cgz-x86_64-allyesdebian-db1003a719d75cebe5843a7906c02c29bec9922c-20151107-100037-1jb4qfh-1.yaml ARCH=x86_64 kconfig=x86_64-allyesdebian branch=sergeh-security/2015-11-05/cgroupns commit=db1003a719d75cebe5843a7906c02c29bec9922c BOOT_IMAGE=/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a max_uptime=600 RESULT_ROOT=/result/boot/1/vm-intel12-yocto-x86_64/yocto-minimal-x86_64.cgz/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/0 LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 vga=normal rw ip=vm-intel12-yocto-x86_64-14::dhcp drbd.minor_count=8 qemu-system-x86_64 -enable-kvm -cpu Nehalem -kernel /pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a -append 'root=/dev/ram0 user=lkp job=/lkp/scheduled/vm-intel12-yocto-x86_64-14/bisect_boot-1-yocto-minimal-x86_64.cgz-x86_64-allyesdebian-db1003a719d75cebe5843a7906c02c29bec9922c-20151107-100037-1jb4qfh-1.yaml ARCH=x86_64 kconfig=x86_64-allyesdebian branch=sergeh-security/2015-11-05/cgroupns commit=db1003a719d75cebe5843a7906c02c29bec9922c BOOT_IMAGE=/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a max_uptime=600 RESULT_ROOT=/result/boot/1/vm-intel12-yocto-x86_64/yocto-minimal-x86_64.cgz/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/0 LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 vga=normal rw ip=vm-intel12-yocto-x86_64-14::dhcp drbd.minor_count=8' -initrd /fs/KVM/initrd-vm-intel12-yocto-x86_64-14 -m 832 -smp 2 -device e1000,netdev=net0 -netdev user,id=net0 -boot order=nc -no-reboot -watchdog i6300esb -rtc base=localtime -drive file=/fs/KVM/disk0-vm-intel12-yocto-x86_64-14,media=disk,if=virtio -drive file=/fs/KVM/disk1-vm-intel12-yocto-x86_64-14,media=disk,if=virtio -pidfile /dev/shm/kboot/pid-vm-intel12-yocto-x86_64-14 -serial file:/dev/shm/kboot/serial-vm-intel12-yocto-x86_64-14 -daemonize -display none -monitor null Neat, however checking out the same kernel tree at "db1003a x86/numachip: Cleanup Numachip support", building with the same config (though with GCC 5.2.1), it boots just peachy with the same args. The patch itself is conservative, so I can't see how it could cause early boot hangs. Have you seen this kind of issue before, or is this the first time? Thanks! Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lkp] [x86/numachip] db1003a719: BUG: kernel early-boot hang
Hi Ying Huang, On Tue, Nov 10, 2015 at 6:12 AM, kernel test robotwrote: FYI, we noticed the below changes on https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master commit db1003a719d75cebe5843a7906c02c29bec9922c ("x86/numachip: Cleanup Numachip support") Elapsed time: 210 BUG: kernel early-boot hang Linux version 4.3.0-rc2-1-gdb1003a #1 Command line: root=/dev/ram0 user=lkp job=/lkp/scheduled/vm-intel12-yocto-x86_64-14/bisect_boot-1-yocto-minimal-x86_64.cgz-x86_64-allyesdebian-db1003a719d75cebe5843a7906c02c29bec9922c-20151107-100037-1jb4qfh-1.yaml ARCH=x86_64 kconfig=x86_64-allyesdebian branch=sergeh-security/2015-11-05/cgroupns commit=db1003a719d75cebe5843a7906c02c29bec9922c BOOT_IMAGE=/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a max_uptime=600 RESULT_ROOT=/result/boot/1/vm-intel12-yocto-x86_64/yocto-minimal-x86_64.cgz/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/0 LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 vga=normal rw ip=vm-intel12-yocto-x86_64-14::dhcp drbd.minor_count=8 qemu-system-x86_64 -enable-kvm -cpu Nehalem -kernel /pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a -append 'root=/dev/ram0 user=lkp job=/lkp/scheduled/vm-intel12-yocto-x86_64-14/bisect_boot-1-yocto-minimal-x86_64.cgz-x86_64-allyesdebian-db1003a719d75cebe5843a7906c02c29bec9922c-20151107-100037-1jb4qfh-1.yaml ARCH=x86_64 kconfig=x86_64-allyesdebian branch=sergeh-security/2015-11-05/cgroupns commit=db1003a719d75cebe5843a7906c02c29bec9922c BOOT_IMAGE=/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a max_uptime=600 RESULT_ROOT=/result/boot/1/vm-intel12-yocto-x86_64/yocto-minimal-x86_64.cgz/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/0 LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 vga=normal rw ip=vm-intel12-yocto-x86_64-14::dhcp drbd.minor_count=8' -initrd /fs/KVM/initrd-vm-intel12-yocto-x86_64-14 -m 832 -smp 2 -device e1000,netdev=net0 -netdev user,id=net0 -boot order=nc -no-reboot -watchdog i6300esb -rtc base=localtime -drive file=/fs/KVM/disk0-vm-intel12-yocto-x86_64-14,media=disk,if=virtio -drive file=/fs/KVM/disk1-vm-intel12-yocto-x86_64-14,media=disk,if=virtio -pidfile /dev/shm/kboot/pid-vm-intel12-yocto-x86_64-14 -serial file:/dev/shm/kboot/serial-vm-intel12-yocto-x86_64-14 -daemonize -display none -monitor null Neat, however checking out the same kernel tree at "db1003a x86/numachip: Cleanup Numachip support", building with the same config (though with GCC 5.2.1), it boots just peachy with the same args. The patch itself is conservative, so I can't see how it could cause early boot hangs. Have you seen this kind of issue before, or is this the first time? Thanks! Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] x86/apic: Use smaller array for __apicid_to_node[] mapping
On Fri, Oct 9, 2015 at 11:35 PM, Jiang Liu wrote: On 2015/10/3 3:12, Denys Vlasenko wrote: From: Daniel J Blueman The Intel x2APIC spec states the upper 16-bits of APIC ID is the cluster ID [1, p2-12], intended for future distributed systems. Beyond the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the position of a server on each axis of a multi-dimension torus; SGI NUMAlink also structures the APIC ID space. Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and perform linear search; this addresses the binary bloat and the present artificial APIC ID limits. With CONFIG_NR_CPUS=256: $ size vmlinux vmlinux-patched text data bss dec hex filename 18232877 1849656 2281472 22364005 1553f65 vmlinux 18233034 1786168 2281472 22300674 1544802 vmlinux-patched That is, ~64 kbytes less data. Works peachy on a 256-core system with a 20-bit APIC ID space, and on a 48-core legacy 8-bit APIC ID system. If we care, I can make numa_cpu_node O(1) lookup for typical cases. Signed-off-by: Daniel J Blueman CC: Ingo Molnar CC: Daniel J Blueman CC: Jiang Liu CC: Thomas Gleixner CC: Len Brown CC: x...@kernel.org CC: linux-kernel@vger.kernel.org [1] http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf --- I added forgotten change in arch/x86/mm/numa_emulation.c (Denys) arch/x86/include/asm/numa.h | 13 +++-- arch/x86/kernel/cpu/amd.c| 8 arch/x86/mm/numa.c | 31 +++ arch/x86/mm/numa_emulation.c | 6 +++--- 4 files changed, 37 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index c2ecfd0..33becb8 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -17,6 +17,11 @@ */ #define NODE_MIN_SIZE (4*1024*1024) +struct apicid_to_node { + int apicid; + s16 node; +}; + extern int numa_off; /* @@ -27,17 +32,13 @@ extern int numa_off; * should be accessed by the accessors - set_apicid_to_node() and * numa_cpu_node(). */ -extern s16 __apicid_to_node[MAX_LOCAL_APICID]; +extern struct apicid_to_node __apicid_to_node[NR_CPUS]; Hi Denys and Daniel, I still have some concerns about limiting the array to NR_CPUS. __apicid_to_node are populated according to the order that CPUs are listed in ACPI SRAT table. And CPU IDs are allocated according to the order that CPUs are listed in ACPI MADT(APIC) order. So it may cause trouble if: 1) system has more than NR_CPUS CPUs 2) CPUs are listed in different order in SRAT and MADT tables. Another approach which may be suitable without changing SRAT parsing to be after the memory allocator is up, is to exploit the associativity of the bottom APIC ID bits. We'd have a searchable static array based on NUMA_SHIFT and use the bit-shift encoded in the MSRs. That said, this may run into the issue Jiang cited albeit with CONFIG_NUMA_SHIFT. Perhaps the constraints or risk of restructuring SRAT parsing aren't worth the payoff? Finally, the only alternative is as the current mapping is initialised in numa_init, we can drop the static initialisation and move the 64KB to the BSS to avoid bloating the binary image, but this may not achieve the initial goal of runtime footprint reduction. Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] x86/apic: Use smaller array for __apicid_to_node[] mapping
On Fri, Oct 9, 2015 at 11:35 PM, Jiang Liu <jiang@linux.intel.com> wrote: On 2015/10/3 3:12, Denys Vlasenko wrote: From: Daniel J Blueman <dan...@numascale.com> The Intel x2APIC spec states the upper 16-bits of APIC ID is the cluster ID [1, p2-12], intended for future distributed systems. Beyond the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the position of a server on each axis of a multi-dimension torus; SGI NUMAlink also structures the APIC ID space. Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and perform linear search; this addresses the binary bloat and the present artificial APIC ID limits. With CONFIG_NR_CPUS=256: $ size vmlinux vmlinux-patched text data bss dec hex filename 18232877 1849656 2281472 22364005 1553f65 vmlinux 18233034 1786168 2281472 22300674 1544802 vmlinux-patched That is, ~64 kbytes less data. Works peachy on a 256-core system with a 20-bit APIC ID space, and on a 48-core legacy 8-bit APIC ID system. If we care, I can make numa_cpu_node O(1) lookup for typical cases. Signed-off-by: Daniel J Blueman <dan...@numascale.com> CC: Ingo Molnar <mi...@kernel.org> CC: Daniel J Blueman <dan...@numascale.com> CC: Jiang Liu <jiang@linux.intel.com> CC: Thomas Gleixner <t...@linutronix.de> CC: Len Brown <len.br...@intel.com> CC: x...@kernel.org CC: linux-kernel@vger.kernel.org [1] http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf --- I added forgotten change in arch/x86/mm/numa_emulation.c (Denys) arch/x86/include/asm/numa.h | 13 +++-- arch/x86/kernel/cpu/amd.c| 8 arch/x86/mm/numa.c | 31 +++ arch/x86/mm/numa_emulation.c | 6 +++--- 4 files changed, 37 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index c2ecfd0..33becb8 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -17,6 +17,11 @@ */ #define NODE_MIN_SIZE (4*1024*1024) +struct apicid_to_node { + int apicid; + s16 node; +}; + extern int numa_off; /* @@ -27,17 +32,13 @@ extern int numa_off; * should be accessed by the accessors - set_apicid_to_node() and * numa_cpu_node(). */ -extern s16 __apicid_to_node[MAX_LOCAL_APICID]; +extern struct apicid_to_node __apicid_to_node[NR_CPUS]; Hi Denys and Daniel, I still have some concerns about limiting the array to NR_CPUS. __apicid_to_node are populated according to the order that CPUs are listed in ACPI SRAT table. And CPU IDs are allocated according to the order that CPUs are listed in ACPI MADT(APIC) order. So it may cause trouble if: 1) system has more than NR_CPUS CPUs 2) CPUs are listed in different order in SRAT and MADT tables. Another approach which may be suitable without changing SRAT parsing to be after the memory allocator is up, is to exploit the associativity of the bottom APIC ID bits. We'd have a searchable static array based on NUMA_SHIFT and use the bit-shift encoded in the MSRs. That said, this may run into the issue Jiang cited albeit with CONFIG_NUMA_SHIFT. Perhaps the constraints or risk of restructuring SRAT parsing aren't worth the payoff? Finally, the only alternative is as the current mapping is initialised in numa_init, we can drop the static initialisation and move the 64KB to the BSS to avoid bloating the binary image, but this may not achieve the initial goal of runtime footprint reduction. Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb: do not re-init SR-IOV during probe
It would be great if the patch "igb: do not re-init SR-IOV during probe" [1] can be backported from 4.3-rc to stable kernels, since it fixes the regression introduced by "igb: do a reset on SR-IOV re-init if device is down" [2]. The regression was introduced in 3.16 and can isolate the IPMI interface on servers with 82576 NICs if using shared mode (high impact), around 0.5% of times booted. Many thanks! Daniel -- [1] commit 6423fc34160939142d72ffeaa2db6408317f54df Author: Stefan Assmann Date: Fri Jul 10 15:01:12 2015 +0200 igb: do not re-init SR-IOV during probe During driver probing the following code path is triggered. igb_probe ->igb_sw_init ->igb_probe_vfs ->igb_pci_enable_sriov ->igb_sriov_reinit Doing the SR-IOV re-init is not necessary during probing since we're starting from scratch. Here we can call igb_enable_sriov() right away. Running igb_sriov_reinit() during igb_probe() also seems to cause occasional packet loss on some onboard 82576 NICs. Reproduced on Dell and HP servers with onboard 82576 NICs. Example: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01) Subsystem: Dell Device [1028:0481] Signed-off-by: Stefan Assmann Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher -- [2] commit 76252723e88681628a3dbb9c09c963e095476f73 Author: Stefan Assmann Date: Thu Jul 10 03:29:39 2014 -0700 igb: do a reset on SR-IOV re-init if device is down To properly re-initialize SR-IOV it is necessary to reset the device even if it is already down. Not doing this may result in Tx unit hangs. Cc: stable Signed-off-by: Stefan Assmann Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher Signed-off-by: David S. Miller -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: igb: do not re-init SR-IOV during probe
It would be great if the patch "igb: do not re-init SR-IOV during probe" [1] can be backported from 4.3-rc to stable kernels, since it fixes the regression introduced by "igb: do a reset on SR-IOV re-init if device is down" [2]. The regression was introduced in 3.16 and can isolate the IPMI interface on servers with 82576 NICs if using shared mode (high impact), around 0.5% of times booted. Many thanks! Daniel -- [1] commit 6423fc34160939142d72ffeaa2db6408317f54df Author: Stefan AssmannDate: Fri Jul 10 15:01:12 2015 +0200 igb: do not re-init SR-IOV during probe During driver probing the following code path is triggered. igb_probe ->igb_sw_init ->igb_probe_vfs ->igb_pci_enable_sriov ->igb_sriov_reinit Doing the SR-IOV re-init is not necessary during probing since we're starting from scratch. Here we can call igb_enable_sriov() right away. Running igb_sriov_reinit() during igb_probe() also seems to cause occasional packet loss on some onboard 82576 NICs. Reproduced on Dell and HP servers with onboard 82576 NICs. Example: Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 01) Subsystem: Dell Device [1028:0481] Signed-off-by: Stefan Assmann Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher -- [2] commit 76252723e88681628a3dbb9c09c963e095476f73 Author: Stefan Assmann Date: Thu Jul 10 03:29:39 2014 -0700 igb: do a reset on SR-IOV re-init if device is down To properly re-initialize SR-IOV it is necessary to reset the device even if it is already down. Not doing this may result in Tx unit hangs. Cc: stable Signed-off-by: Stefan Assmann Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher Signed-off-by: David S. Miller -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] x86/apic: Use smaller array for __apicid_to_node[] mapping
The Intel x2APIC spec states the upper 16-bits of APIC ID is the cluster ID [1, p2-12], intended for future distributed systems. Beyond the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the position of a server on each axis of a multi-dimension torus; SGI NUMAlink also structures the APIC ID space. Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and perform linear search; we see "ACPI: NR_CPUS/possible_cpus limit of X reached. Processor 8/0x16 ignored." when config-limited. This addresses the binary bloat and the present artificial APIC ID limits. With CONFIG_NR_CPUS=256, we save ~64KB of vmlinux data: $ size vmlinux vmlinux-patched text data bss dec hex filename 18232877 1849656 2281472 22364005 1553f65 vmlinux 18233034 1786168 2281472 22300674 1544802 vmlinux-patched Tested on a 256-core system with a 20-bit APIC ID space, and on a 48-core legacy 8-bit APIC ID system with and without CONFIG_NUMA, CONFIG_NUMA_EMU and CONFIG_AMD_NUMA. v2: Improved readability by moving static variable out; integrated Denys's numa emulation fix Signed-off-by: Daniel J Blueman CC: Denys Vlasenko CC: Ingo Molnar CC: Thomas Gleixner CC: Jiang Liu CC: Len Brown CC: Steffen Persvold CC: linux-kernel@vger.kernel.org CC: x...@kernel.org [1] http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf --- arch/x86/include/asm/numa.h | 13 +++-- arch/x86/kernel/cpu/amd.c| 11 ++- arch/x86/mm/numa.c | 29 + arch/x86/mm/numa_emulation.c | 6 +++--- 4 files changed, 37 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 01b493e..33becb8 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -17,6 +17,11 @@ */ #define NODE_MIN_SIZE (4*1024*1024) +struct apicid_to_node { + int apicid; + s16 node; +}; + extern int numa_off; /* @@ -27,17 +32,13 @@ extern int numa_off; * should be accessed by the accessors - set_apicid_to_node() and * numa_cpu_node(). */ -extern s16 __apicid_to_node[MAX_LOCAL_APIC]; +extern struct apicid_to_node __apicid_to_node[NR_CPUS]; extern nodemask_t numa_nodes_parsed __initdata; extern int __init numa_add_memblk(int nodeid, u64 start, u64 end); extern void __init numa_set_distance(int from, int to, int distance); -static inline void set_apicid_to_node(int apicid, s16 node) -{ - __apicid_to_node[apicid] = node; -} - +extern void set_apicid_to_node(int apicid, s16 node); extern int numa_cpu_node(int cpu); #else /* CONFIG_NUMA */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 4a70fc6..9494f0e 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -277,12 +277,13 @@ static int nearby_node(int apicid) int i, node; for (i = apicid - 1; i >= 0; i--) { - node = __apicid_to_node[i]; + node = __apicid_to_node[i].node; if (node != NUMA_NO_NODE && node_online(node)) return node; } - for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) { - node = __apicid_to_node[i]; + for (i = apicid + 1; i < NR_CPUS; i++) { + node = __apicid_to_node[i].node; + if (node != NUMA_NO_NODE && node_online(node)) return node; } @@ -422,8 +423,8 @@ static void srat_detect_node(struct cpuinfo_x86 *c) int ht_nodeid = c->initial_apicid; if (ht_nodeid >= 0 && - __apicid_to_node[ht_nodeid] != NUMA_NO_NODE) - node = __apicid_to_node[ht_nodeid]; + __apicid_to_node[ht_nodeid].node != NUMA_NO_NODE) + node = __apicid_to_node[ht_nodeid].node; /* Pick a nearby node */ if (!node_online(node)) node = nearby_node(apicid); diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index c3b3f65..849a113 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -26,6 +26,7 @@ nodemask_t numa_nodes_parsed __initdata; struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); +static unsigned apicids; static struct numa_meminfo numa_meminfo #ifndef CONFIG_MEMORY_HOTPLUG __initdata @@ -56,16 +57,31 @@ early_param("numa", numa_setup); /* * apicid, cpu, node mappings */ -s16 __apicid_to_node[MAX_LOCAL_APIC] = { - [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE +struct apicid_to_node __apicid_to_node[NR_CPUS] = { + [0 ... NR_CPUS-1] = {-1, NUMA_NO_NODE} }; +void set_apicid_to_node(int apicid, s16 node) +{ + /* Protect against small kernel on large system */ + if (apicids >= NR_CPUS) + return; + + __apicid_to_node[apicids].apicid = apicid; + __apicid_to_node[apicid
[PATCH v2] x86/apic: Use smaller array for __apicid_to_node[] mapping
The Intel x2APIC spec states the upper 16-bits of APIC ID is the cluster ID [1, p2-12], intended for future distributed systems. Beyond the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the position of a server on each axis of a multi-dimension torus; SGI NUMAlink also structures the APIC ID space. Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and perform linear search; we see "ACPI: NR_CPUS/possible_cpus limit of X reached. Processor 8/0x16 ignored." when config-limited. This addresses the binary bloat and the present artificial APIC ID limits. With CONFIG_NR_CPUS=256, we save ~64KB of vmlinux data: $ size vmlinux vmlinux-patched text data bss dec hex filename 18232877 1849656 2281472 22364005 1553f65 vmlinux 18233034 1786168 2281472 22300674 1544802 vmlinux-patched Tested on a 256-core system with a 20-bit APIC ID space, and on a 48-core legacy 8-bit APIC ID system with and without CONFIG_NUMA, CONFIG_NUMA_EMU and CONFIG_AMD_NUMA. v2: Improved readability by moving static variable out; integrated Denys's numa emulation fix Signed-off-by: Daniel J Blueman <dan...@numascale.com> CC: Denys Vlasenko <dvlas...@redhat.com> CC: Ingo Molnar <mi...@kernel.org> CC: Thomas Gleixner <t...@linutronix.de> CC: Jiang Liu <jiang@linux.intel.com> CC: Len Brown <len.br...@intel.com> CC: Steffen Persvold <s...@numascale.com> CC: linux-kernel@vger.kernel.org CC: x...@kernel.org [1] http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf --- arch/x86/include/asm/numa.h | 13 +++-- arch/x86/kernel/cpu/amd.c| 11 ++- arch/x86/mm/numa.c | 29 + arch/x86/mm/numa_emulation.c | 6 +++--- 4 files changed, 37 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 01b493e..33becb8 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -17,6 +17,11 @@ */ #define NODE_MIN_SIZE (4*1024*1024) +struct apicid_to_node { + int apicid; + s16 node; +}; + extern int numa_off; /* @@ -27,17 +32,13 @@ extern int numa_off; * should be accessed by the accessors - set_apicid_to_node() and * numa_cpu_node(). */ -extern s16 __apicid_to_node[MAX_LOCAL_APIC]; +extern struct apicid_to_node __apicid_to_node[NR_CPUS]; extern nodemask_t numa_nodes_parsed __initdata; extern int __init numa_add_memblk(int nodeid, u64 start, u64 end); extern void __init numa_set_distance(int from, int to, int distance); -static inline void set_apicid_to_node(int apicid, s16 node) -{ - __apicid_to_node[apicid] = node; -} - +extern void set_apicid_to_node(int apicid, s16 node); extern int numa_cpu_node(int cpu); #else /* CONFIG_NUMA */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 4a70fc6..9494f0e 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -277,12 +277,13 @@ static int nearby_node(int apicid) int i, node; for (i = apicid - 1; i >= 0; i--) { - node = __apicid_to_node[i]; + node = __apicid_to_node[i].node; if (node != NUMA_NO_NODE && node_online(node)) return node; } - for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) { - node = __apicid_to_node[i]; + for (i = apicid + 1; i < NR_CPUS; i++) { + node = __apicid_to_node[i].node; + if (node != NUMA_NO_NODE && node_online(node)) return node; } @@ -422,8 +423,8 @@ static void srat_detect_node(struct cpuinfo_x86 *c) int ht_nodeid = c->initial_apicid; if (ht_nodeid >= 0 && - __apicid_to_node[ht_nodeid] != NUMA_NO_NODE) - node = __apicid_to_node[ht_nodeid]; + __apicid_to_node[ht_nodeid].node != NUMA_NO_NODE) + node = __apicid_to_node[ht_nodeid].node; /* Pick a nearby node */ if (!node_online(node)) node = nearby_node(apicid); diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index c3b3f65..849a113 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -26,6 +26,7 @@ nodemask_t numa_nodes_parsed __initdata; struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); +static unsigned apicids; static struct numa_meminfo numa_meminfo #ifndef CONFIG_MEMORY_HOTPLUG __initdata @@ -56,16 +57,31 @@ early_param("numa", numa_setup); /* * apicid, cpu, node mappings */ -s16 __apicid_to_node[MAX_LOCAL_APIC] = { - [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE +struct apicid_to_node __apicid_to_node[NR_CPUS] = { + [0 ... NR_CPUS-1] = {-1, NUMA_NO_NODE} }; +void set_apicid_to_node(int apicid, s16 node) +{ + /* P
Re: [PATCH RFC] x86: Reduce MAX_LOCAL_APIC and MAX_IO_APICS
On Saturday, September 26, 2015 at 4:40:07 AM UTC+8, Denys Vlasenko wrote: > Before this change MAX_LOCAL_APIC had the fixed value of 32*1024. > Such a big value causes several data arrays to be quite oversized: > > phys_cpu_present_map is 4 kbytes (one bit per apic id), > __apicid_to_node[] is 64 kbytes, > apic_version[] is 128 kbytes. > > On "usual" systems, APIC ids simply go from zero > to maximum logical CPU number, mirroring CPU ids. > > On broken and unusual multi-socket systems > APIC ids can be non-contiguous. The Intel x2APIC spec states the upper 16-bits of APIC ID is the cluster ID [1, p2-12], intended for future distributed systems. Beyond the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the position of a server on each axis of a multi-dimension torus; SGI NUMAlink also structures the APIC ID space. Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and perform linear search; this addresses the binary bloat and the present artificial APIC ID limits. With CONFIG_NR_CPUS=256: $ size vmlinux vmlinux-patched textdata bss dec hex filename 182328771849656 2281472 223640051553f65 vmlinux 182330341786168 2281472 223006741544802 vmlinux-patched Works peachy on a 256-core system with a 20-bit APIC ID space, and on a 48-core legacy 8-bit APIC ID system. If we care, I can make numa_cpu_node O(1) lookup for typical cases. Signed-off-by: Daniel J Blueman Daniel [1] http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf --- arch/x86/include/asm/numa.h | 13 +++-- arch/x86/kernel/cpu/amd.c | 8 arch/x86/mm/numa.c | 31 +++ 3 files changed, 34 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 01b493e..33becb8 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -17,6 +17,11 @@ */ #define NODE_MIN_SIZE (4*1024*1024) +struct apicid_to_node { + int apicid; + s16 node; +}; + extern int numa_off; /* @@ -27,17 +32,13 @@ extern int numa_off; * should be accessed by the accessors - set_apicid_to_node() and * numa_cpu_node(). */ -extern s16 __apicid_to_node[MAX_LOCAL_APIC]; +extern struct apicid_to_node __apicid_to_node[NR_CPUS]; extern nodemask_t numa_nodes_parsed __initdata; extern int __init numa_add_memblk(int nodeid, u64 start, u64 end); extern void __init numa_set_distance(int from, int to, int distance); -static inline void set_apicid_to_node(int apicid, s16 node) -{ - __apicid_to_node[apicid] = node; -} - +extern void set_apicid_to_node(int apicid, s16 node); extern int numa_cpu_node(int cpu); #else /* CONFIG_NUMA */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 4a70fc6..e65c01c 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -277,12 +277,12 @@ static int nearby_node(int apicid) int i, node; for (i = apicid - 1; i >= 0; i--) { - node = __apicid_to_node[i]; + node = __apicid_to_node[i].node; if (node != NUMA_NO_NODE && node_online(node)) return node; } for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) { - node = __apicid_to_node[i]; + node = __apicid_to_node[i].node; if (node != NUMA_NO_NODE && node_online(node)) return node; } @@ -422,8 +422,8 @@ static void srat_detect_node(struct cpuinfo_x86 *c) int ht_nodeid = c->initial_apicid; if (ht_nodeid >= 0 && - __apicid_to_node[ht_nodeid] != NUMA_NO_NODE) - node = __apicid_to_node[ht_nodeid]; + __apicid_to_node[ht_nodeid].node != NUMA_NO_NODE) + node = __apicid_to_node[ht_nodeid].node; /* Pick a nearby node */ if (!node_online(node)) node = nearby_node(apicid); diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index c3b3f65..70f03a0 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -56,16 +56,34 @@ early_param("numa", numa_setup); /* * apicid, cpu, node mappings */ -s16 __apicid_to_node[MAX_LOCAL_APIC] = { - [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE + +struct apicid_to_node __apicid_to_node[NR_CPUS] = { + [0 ... NR_CPUS-1] = {-1, NUMA_NO_NODE} }; +void set_apicid_to_node(int apicid, s16 node) +{ + static int ent; + + /* Protect against small kernel on large system */ + if (ent >= NR_CPUS) + return; + + __apicid_to_node[ent].apicid = apicid; + __apicid_to_node[ent].node = node; + ent++; +} + int numa_cpu_node(int cpu) { - int apicid = early_per_cpu(x86_cpu_to_apicid, cpu); + int ent, apicid = early_per_cpu(x86_cpu_to_apicid,
Re: [PATCH RFC] x86: Reduce MAX_LOCAL_APIC and MAX_IO_APICS
On Saturday, September 26, 2015 at 4:40:07 AM UTC+8, Denys Vlasenko wrote: > Before this change MAX_LOCAL_APIC had the fixed value of 32*1024. > Such a big value causes several data arrays to be quite oversized: > > phys_cpu_present_map is 4 kbytes (one bit per apic id), > __apicid_to_node[] is 64 kbytes, > apic_version[] is 128 kbytes. > > On "usual" systems, APIC ids simply go from zero > to maximum logical CPU number, mirroring CPU ids. > > On broken and unusual multi-socket systems > APIC ids can be non-contiguous. The Intel x2APIC spec states the upper 16-bits of APIC ID is the cluster ID [1, p2-12], intended for future distributed systems. Beyond the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the position of a server on each axis of a multi-dimension torus; SGI NUMAlink also structures the APIC ID space. Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and perform linear search; this addresses the binary bloat and the present artificial APIC ID limits. With CONFIG_NR_CPUS=256: $ size vmlinux vmlinux-patched textdata bss dec hex filename 182328771849656 2281472 223640051553f65 vmlinux 182330341786168 2281472 223006741544802 vmlinux-patched Works peachy on a 256-core system with a 20-bit APIC ID space, and on a 48-core legacy 8-bit APIC ID system. If we care, I can make numa_cpu_node O(1) lookup for typical cases. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Daniel [1] http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf --- arch/x86/include/asm/numa.h | 13 +++-- arch/x86/kernel/cpu/amd.c | 8 arch/x86/mm/numa.c | 31 +++ 3 files changed, 34 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 01b493e..33becb8 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -17,6 +17,11 @@ */ #define NODE_MIN_SIZE (4*1024*1024) +struct apicid_to_node { + int apicid; + s16 node; +}; + extern int numa_off; /* @@ -27,17 +32,13 @@ extern int numa_off; * should be accessed by the accessors - set_apicid_to_node() and * numa_cpu_node(). */ -extern s16 __apicid_to_node[MAX_LOCAL_APIC]; +extern struct apicid_to_node __apicid_to_node[NR_CPUS]; extern nodemask_t numa_nodes_parsed __initdata; extern int __init numa_add_memblk(int nodeid, u64 start, u64 end); extern void __init numa_set_distance(int from, int to, int distance); -static inline void set_apicid_to_node(int apicid, s16 node) -{ - __apicid_to_node[apicid] = node; -} - +extern void set_apicid_to_node(int apicid, s16 node); extern int numa_cpu_node(int cpu); #else /* CONFIG_NUMA */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 4a70fc6..e65c01c 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -277,12 +277,12 @@ static int nearby_node(int apicid) int i, node; for (i = apicid - 1; i >= 0; i--) { - node = __apicid_to_node[i]; + node = __apicid_to_node[i].node; if (node != NUMA_NO_NODE && node_online(node)) return node; } for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) { - node = __apicid_to_node[i]; + node = __apicid_to_node[i].node; if (node != NUMA_NO_NODE && node_online(node)) return node; } @@ -422,8 +422,8 @@ static void srat_detect_node(struct cpuinfo_x86 *c) int ht_nodeid = c->initial_apicid; if (ht_nodeid >= 0 && - __apicid_to_node[ht_nodeid] != NUMA_NO_NODE) - node = __apicid_to_node[ht_nodeid]; + __apicid_to_node[ht_nodeid].node != NUMA_NO_NODE) + node = __apicid_to_node[ht_nodeid].node; /* Pick a nearby node */ if (!node_online(node)) node = nearby_node(apicid); diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index c3b3f65..70f03a0 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -56,16 +56,34 @@ early_param("numa", numa_setup); /* * apicid, cpu, node mappings */ -s16 __apicid_to_node[MAX_LOCAL_APIC] = { - [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE + +struct apicid_to_node __apicid_to_node[NR_CPUS] = { + [0 ... NR_CPUS-1] = {-1, NUMA_NO_NODE} }; +void set_apicid_to_node(int apicid, s16 node) +{ + static int ent; + + /* Protect against small kernel on large system */ + if (ent >= NR_CPUS) + return; + + __apicid_to_node[ent].apicid = apicid; + __apicid_to_node[ent].node = node; + ent++; +} + int numa_cpu_node(int cpu) { - int apicid = early_per_cpu(x86_cpu_to_apicid, cpu); + int ent, api
Re: [RFC] PCI: Unassigned Expansion ROM BARs
On Thursday, September 24, 2015 at 10:50:07 AM UTC+8, Myron Stowe wrote: > I've encountered numerous bugzilla reports related to platform BIOS' not > programming valid values into a PCI device's Type 0 Configuration space > "Expansion ROM Base Address" field (a.k.a. Expansion ROM BAR). The main > observed consequence being 'dmesg' entries like the following that get > customers excited enough to file reports against the kernel. PCI option ROMs legitimately hold real-mode/EFI code needed to initialise devices; the problem is, we can't guarantee that the BIOS has initialised all devices with the option ROM code, so linux must ensure they are correctly accessible. In addition to VMs as Alex points out, hotplug (eg Thunderbold GPUs) and PCI domains which may not be visible to the BIOS at early boot, may need the option ROM. Nvidia GPUs primarily have had a lot of encoder/connector (HDCP?) and product-specific voltage-frequency setup code and tables in the ROM. As such, in my NumaConnect open firmware which maps the PCI domains of multiple servers into one, I have to also reallocate PCI option ROMs [1] to guarantee GPU VBIOS execution in linux. That said, option ROMs are a dying trend in favour of shipped binary blobs and open-coded initialisation for cross-platform support, and there are only 10 users of pci_map_rom(). Thanks, Daniel [1] https://github.com/numascale/nc-utils/blob/master/bootloader/dnc-mmio.c -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] PCI: Unassigned Expansion ROM BARs
On Thursday, September 24, 2015 at 10:50:07 AM UTC+8, Myron Stowe wrote: > I've encountered numerous bugzilla reports related to platform BIOS' not > programming valid values into a PCI device's Type 0 Configuration space > "Expansion ROM Base Address" field (a.k.a. Expansion ROM BAR). The main > observed consequence being 'dmesg' entries like the following that get > customers excited enough to file reports against the kernel. PCI option ROMs legitimately hold real-mode/EFI code needed to initialise devices; the problem is, we can't guarantee that the BIOS has initialised all devices with the option ROM code, so linux must ensure they are correctly accessible. In addition to VMs as Alex points out, hotplug (eg Thunderbold GPUs) and PCI domains which may not be visible to the BIOS at early boot, may need the option ROM. Nvidia GPUs primarily have had a lot of encoder/connector (HDCP?) and product-specific voltage-frequency setup code and tables in the ROM. As such, in my NumaConnect open firmware which maps the PCI domains of multiple servers into one, I have to also reallocate PCI option ROMs [1] to guarantee GPU VBIOS execution in linux. That said, option ROMs are a dying trend in favour of shipped binary blobs and open-coded initialisation for cross-platform support, and there are only 10 users of pci_map_rom(). Thanks, Daniel [1] https://github.com/numascale/nc-utils/blob/master/bootloader/dnc-mmio.c -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/apic] x86/numachip: Add Numachip IPI optimisations
Commit-ID: ad03a9c25d258641556c7198e26fd882c741987a Gitweb: http://git.kernel.org/tip/ad03a9c25d258641556c7198e26fd882c741987a Author: Daniel J Blueman AuthorDate: Mon, 21 Sep 2015 01:02:01 +0800 Committer: Thomas Gleixner CommitDate: Tue, 22 Sep 2015 22:25:33 +0200 x86/numachip: Add Numachip IPI optimisations When sending IPIs, first check if the non-local part of the source and destination APIC IDs match; if so, send via the local APIC for efficiency. Secondly, since the AMD BIOS-kernel developer guide states IPI delivery will occur invarient of prior deliver status, avoid polling the delivery status bit for efficiency. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold Cc: Daniel Lezcano Link: http://lkml.kernel.org/r/1442768522-19217-3-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner --- arch/x86/include/asm/numachip/numachip_csr.h | 1 + arch/x86/kernel/apic/apic_numachip.c | 37 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index e08b803..e09d845 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -34,6 +34,7 @@ #define NUMACHIP_LCSR_BASE 0x3e00ULL #define NUMACHIP_LCSR_LIM 0x3fffULL #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1) +#define NUMACHIP_LAPIC_BITS8 static inline void *lcsr_address(unsigned long offset) { diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 3cb9294..38dd5ef 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -96,9 +96,25 @@ static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) static void numachip_send_IPI_one(int cpu, int vector) { - int apicid = per_cpu(x86_cpu_to_apicid, cpu); + int local_apicid, apicid = per_cpu(x86_cpu_to_apicid, cpu); unsigned int dmode; + preempt_disable(); + local_apicid = __this_cpu_read(x86_cpu_to_apicid); + + /* Send via local APIC where non-local part matches */ + if (!((apicid ^ local_apicid) >> NUMACHIP_LAPIC_BITS)) { + unsigned long flags; + + local_irq_save(flags); + __default_send_IPI_dest_field(apicid, vector, + APIC_DEST_PHYSICAL); + local_irq_restore(flags); + preempt_enable(); + return; + } + preempt_enable(); + dmode = (vector == NMI_VECTOR) ? APIC_DM_NMI : APIC_DM_FIXED; numachip_apic_icr_write(apicid, dmode | vector); } @@ -218,6 +234,17 @@ static int numachip2_acpi_madt_oem_check(char *oem_id, char *oem_table_id) return 1; } +/* APIC IPIs are queued */ +static void numachip_apic_wait_icr_idle(void) +{ +} + +/* APIC NMI IPIs are queued */ +static u32 numachip_safe_apic_wait_icr_idle(void) +{ + return 0; +} + static const struct apic apic_numachip1 __refconst = { .name = "NumaConnect system", .probe = numachip1_probe, @@ -263,8 +290,8 @@ static const struct apic apic_numachip1 __refconst = { .eoi_write = native_apic_mem_write, .icr_read = native_apic_icr_read, .icr_write = native_apic_icr_write, - .wait_icr_idle = native_apic_wait_icr_idle, - .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, + .wait_icr_idle = numachip_apic_wait_icr_idle, + .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle, }; apic_driver(apic_numachip1); @@ -314,8 +341,8 @@ static const struct apic apic_numachip2 __refconst = { .eoi_write = native_apic_mem_write, .icr_read = native_apic_icr_read, .icr_write = native_apic_icr_write, - .wait_icr_idle = native_apic_wait_icr_idle, - .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, + .wait_icr_idle = numachip_apic_wait_icr_idle, + .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle, }; apic_driver(apic_numachip2); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/apic] x86/numachip: Introduce Numachip2 timer mechanisms
Commit-ID: ce2e572cfe7b2fc3f0e9da4aa7bc61a2c2c51fc7 Gitweb: http://git.kernel.org/tip/ce2e572cfe7b2fc3f0e9da4aa7bc61a2c2c51fc7 Author: Daniel J Blueman AuthorDate: Mon, 21 Sep 2015 18:02:25 +0800 Committer: Thomas Gleixner CommitDate: Tue, 22 Sep 2015 22:25:33 +0200 x86/numachip: Introduce Numachip2 timer mechanisms Add 1GHz 64-bit Numachip2 clocksource timer support for accurate system-wide timekeeping, as core TSCs are unsynchronised. Additionally, add a per-core clockevent mechanism that interrupts via the platform IPI vector after a programmed period. [ tglx: Taking it through x86 due to dependencies ] Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold Cc: Daniel Lezcano Link: http://lkml.kernel.org/r/1442829745-29311-1-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner --- arch/x86/include/asm/numachip/numachip_csr.h | 9 +++ drivers/clocksource/Makefile | 1 + drivers/clocksource/numachip.c | 95 3 files changed, 105 insertions(+) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index e09d845..29719ee 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) #define NUMACHIP2_LCSR_BASE 0xf000UL #define NUMACHIP2_LCSR_SIZE 0x100UL #define NUMACHIP2_APIC_ICR0x10 +#define NUMACHIP2_TIMER_DEADLINE 0x20 +#define NUMACHIP2_TIMER_INT 0x28 +#define NUMACHIP2_TIMER_NOW 0x200018 +#define NUMACHIP2_TIMER_RESET 0x200020 static inline void __iomem *numachip2_lcsr_address(unsigned long offset) { @@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) writeq(val, numachip2_lcsr_address(offset)); } +static inline unsigned int numachip2_timer(void) +{ + return (smp_processor_id() % 48) << 6; +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile index 5c00863..57dfad3 100644 --- a/drivers/clocksource/Makefile +++ b/drivers/clocksource/Makefile @@ -62,3 +62,4 @@ obj-$(CONFIG_H8300) += h8300_timer8.o obj-$(CONFIG_H8300_TMR16) += h8300_timer16.o obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o +obj-$(CONFIG_X86_NUMACHIP) += numachip.o diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c new file mode 100644 index 000..088e5fa --- /dev/null +++ b/drivers/clocksource/numachip.c @@ -0,0 +1,95 @@ +/* + * + * Copyright (C) 2015 Numascale AS. All rights reserved. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include + +#include +#include +#include + +static DEFINE_PER_CPU(struct clock_event_device, cpu_ced); + +static cycles_t numachip2_timer_read(struct clocksource *cs) +{ + return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW); +} + +static struct clocksource numachip2_clocksource = { + .name= "numachip2", + .rating = 295, + .read= numachip2_timer_read, + .mask= CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .mult= 1, + .shift = 0, +}; + +static int numachip2_set_next_event(unsigned long delta, struct clock_event_device *ced) +{ + numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(), + delta); + return 0; +} + +static struct clock_event_device numachip2_clockevent = { + .name= "numachip2", + .rating = 400, + .set_next_event = numachip2_set_next_event, + .features= CLOCK_EVT_FEAT_ONESHOT, + .mult= 1, + .shift = 0, + .min_delta_ns= 1250, + .max_delta_ns= LONG_MAX, +}; + +static void numachip_timer_interrupt(void) +{ + struct clock_event_device *ced = this_cpu_ptr(_ced); + + ced->event_handler(ced); +} + +static __init void numachip_timer_each(struct work_struct *work) +{ + unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff; + struct clock_event_device *ced = this_cpu_ptr(_ced); + + /* Setup IPI vector to local core and relative timing mode */ + numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_
[tip:x86/apic] x86/numachip: Add Numachip2 APIC support
Commit-ID: d9d4dee6cedfa17e5eedcba242dca3091bf73bc3 Gitweb: http://git.kernel.org/tip/d9d4dee6cedfa17e5eedcba242dca3091bf73bc3 Author: Daniel J Blueman AuthorDate: Mon, 21 Sep 2015 01:02:00 +0800 Committer: Thomas Gleixner CommitDate: Tue, 22 Sep 2015 22:25:33 +0200 x86/numachip: Add Numachip2 APIC support Introduce support for Numachip2 remote interrupts via detecting the right ACPI SRAT signature. Access is performed via a fixed mapping in the x86 physical address space. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold Cc: Daniel Lezcano Link: http://lkml.kernel.org/r/1442768522-19217-2-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner --- arch/x86/include/asm/numachip/numachip.h | 1 + arch/x86/include/asm/numachip/numachip_csr.h | 35 +++ arch/x86/kernel/apic/apic_numachip.c | 93 3 files changed, 129 insertions(+) diff --git a/arch/x86/include/asm/numachip/numachip.h b/arch/x86/include/asm/numachip/numachip.h index 1c6f7f6..c64373a 100644 --- a/arch/x86/include/asm/numachip/numachip.h +++ b/arch/x86/include/asm/numachip/numachip.h @@ -14,6 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_H #define _ASM_X86_NUMACHIP_NUMACHIP_H +extern u8 numachip_system; extern int __init pci_numachip_init(void); #endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */ diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index 7469b13..e08b803 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -14,6 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H +#include #include #define CSR_NODE_SHIFT 16 @@ -50,4 +51,38 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) writel(swab32(val), lcsr_address(offset)); } +/* + * On NumaChip2, local CSR space is 16MB and starts at fixed offset below 4G + */ + +#define NUMACHIP2_LCSR_BASE 0xf000UL +#define NUMACHIP2_LCSR_SIZE 0x100UL +#define NUMACHIP2_APIC_ICR0x10 + +static inline void __iomem *numachip2_lcsr_address(unsigned long offset) +{ + return (void __iomem *)__va(NUMACHIP2_LCSR_BASE | + (offset & (NUMACHIP2_LCSR_SIZE - 1))); +} + +static inline u32 numachip2_read32_lcsr(unsigned long offset) +{ + return readl(numachip2_lcsr_address(offset)); +} + +static inline u64 numachip2_read64_lcsr(unsigned long offset) +{ + return readq(numachip2_lcsr_address(offset)); +} + +static inline void numachip2_write32_lcsr(unsigned long offset, u32 val) +{ + writel(val, numachip2_lcsr_address(offset)); +} + +static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) +{ + writeq(val, numachip2_lcsr_address(offset)); +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index eeefbb1..3cb9294 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -22,6 +22,7 @@ u8 numachip_system __read_mostly; static const struct apic apic_numachip1; +static const struct apic apic_numachip2; static void (*numachip_apic_icr_write)(int apicid, unsigned int val) __read_mostly; static unsigned int numachip1_get_apic_id(unsigned long x) @@ -45,6 +46,19 @@ static unsigned long numachip1_set_apic_id(unsigned int id) return x; } +static unsigned int numachip2_get_apic_id(unsigned long x) +{ + u64 mcfg; + + rdmsrl(MSR_FAM10H_MMIO_CONF_BASE, mcfg); + return ((mcfg >> (28 - 8)) & 0xfff00) | (x >> 24); +} + +static unsigned long numachip2_set_apic_id(unsigned int id) +{ + return id << 24; +} + static int numachip_apic_id_valid(int apicid) { /* Trust what bootloader passes in MADT */ @@ -66,6 +80,11 @@ static void numachip1_apic_icr_write(int apicid, unsigned int val) write_lcsr(CSR_G3_EXT_IRQ_GEN, (apicid << 16) | val); } +static void numachip2_apic_icr_write(int apicid, unsigned int val) +{ + numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val); +} + static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) { numachip_apic_icr_write(phys_apicid, APIC_DM_INIT); @@ -130,6 +149,11 @@ static int __init numachip1_probe(void) return apic == _numachip1; } +static int __init numachip2_probe(void) +{ + return apic == _numachip2; +} + static void fixup_cpu_id(struct cpuinfo_x86 *c, int node) { u64 val; @@ -155,6 +179,13 @@ static int __init numachip_system_init(void) numachip_apic_icr_write = numachip1_apic_icr_write; x86_init.pci.arch_init = pci_numachip_init; break; + case 2: + init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE); + numachip_api
[tip:x86/apic] x86/numachip: Cleanup Numachip support
Commit-ID: db1003a719d75cebe5843a7906c02c29bec9922c Gitweb: http://git.kernel.org/tip/db1003a719d75cebe5843a7906c02c29bec9922c Author: Daniel J Blueman AuthorDate: Mon, 21 Sep 2015 01:01:59 +0800 Committer: Thomas Gleixner CommitDate: Tue, 22 Sep 2015 22:25:32 +0200 x86/numachip: Cleanup Numachip support Drop unused code and includes in Numachip header files and APIC driver. Additionally, use the 'numachip1' prefix on Numachip1-specific functions; this prepares for adding Numachip2 support in later patches. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold Cc: Daniel Lezcano Link: http://lkml.kernel.org/r/1442768522-19217-1-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner --- arch/x86/include/asm/numachip/numachip_csr.h | 118 +-- arch/x86/kernel/apic/apic_numachip.c | 104 ++- 2 files changed, 44 insertions(+), 178 deletions(-) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index 660f843..7469b13 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -14,12 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H -#include -#include #include -#include -#include -#include #define CSR_NODE_SHIFT 16 #define CSR_NODE_BITS(p) (((unsigned long)(p)) << CSR_NODE_SHIFT) @@ -27,11 +22,8 @@ /* 32K CSR space, b15 indicates geo/non-geo */ #define CSR_OFFSET_MASK0x7fffUL - -/* Global CSR space covers all 4K possible nodes with 64K CSR space per node */ -#define NUMACHIP_GCSR_BASE 0x3fffULL -#define NUMACHIP_GCSR_LIM 0x3fff0fffULL -#define NUMACHIP_GCSR_SIZE (NUMACHIP_GCSR_LIM - NUMACHIP_GCSR_BASE + 1) +#define CSR_G0_NODE_IDS (0x008 + (0 << 12)) +#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12)) /* * Local CSR space starts in global CSR space with "nodeid" = 0xfff0, however @@ -42,28 +34,12 @@ #define NUMACHIP_LCSR_LIM 0x3fffULL #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1) -static inline void *gcsr_address(int node, unsigned long offset) -{ - return __va(NUMACHIP_GCSR_BASE | (1UL << 15) | - CSR_NODE_BITS(node & CSR_NODE_MASK) | (offset & CSR_OFFSET_MASK)); -} - static inline void *lcsr_address(unsigned long offset) { return __va(NUMACHIP_LCSR_BASE | (1UL << 15) | CSR_NODE_BITS(0xfff0) | (offset & CSR_OFFSET_MASK)); } -static inline unsigned int read_gcsr(int node, unsigned long offset) -{ - return swab32(readl(gcsr_address(node, offset))); -} - -static inline void write_gcsr(int node, unsigned long offset, unsigned int val) -{ - writel(swab32(val), gcsr_address(node, offset)); -} - static inline unsigned int read_lcsr(unsigned long offset) { return swab32(readl(lcsr_address(offset))); @@ -74,94 +50,4 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) writel(swab32(val), lcsr_address(offset)); } -/* = */ -/* CSR_G0_STATE_CLEAR */ -/* = */ - -#define CSR_G0_STATE_CLEAR (0x000 + (0 << 12)) -union numachip_csr_g0_state_clear { - unsigned int v; - struct numachip_csr_g0_state_clear_s { - unsigned int _state:2; - unsigned int _rsvd_2_6:5; - unsigned int _lost:1; - unsigned int _rsvd_8_31:24; - } s; -}; - -/* = */ -/* CSR_G0_NODE_IDS */ -/* = */ - -#define CSR_G0_NODE_IDS (0x008 + (0 << 12)) -union numachip_csr_g0_node_ids { - unsigned int v; - struct numachip_csr_g0_node_ids_s { - unsigned int _initialid:16; - unsigned int _nodeid:12; - unsigned int _rsvd_28_31:4; - } s; -}; - -/* = */ -/* CSR_G3_EXT_IRQ_GEN */ -/* = */ - -#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12)) -union numachip_csr_g3_ext_irq_gen { - unsigned int v; - struct numachip_csr_g3_ext_irq_gen_s { - unsigned int _vector:8; - unsigned int _msgtype:3; - unsigned int _index:5; - unsigned int _destination_apic_id:16; - } s; -}; - -/* ==
[tip:x86/apic] x86/numachip: Add Numachip IPI optimisations
Commit-ID: ad03a9c25d258641556c7198e26fd882c741987a Gitweb: http://git.kernel.org/tip/ad03a9c25d258641556c7198e26fd882c741987a Author: Daniel J Blueman <dan...@numascale.com> AuthorDate: Mon, 21 Sep 2015 01:02:01 +0800 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Tue, 22 Sep 2015 22:25:33 +0200 x86/numachip: Add Numachip IPI optimisations When sending IPIs, first check if the non-local part of the source and destination APIC IDs match; if so, send via the local APIC for efficiency. Secondly, since the AMD BIOS-kernel developer guide states IPI delivery will occur invarient of prior deliver status, avoid polling the delivery status bit for efficiency. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> Cc: Daniel Lezcano <daniel.lezc...@linaro.org> Link: http://lkml.kernel.org/r/1442768522-19217-3-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- arch/x86/include/asm/numachip/numachip_csr.h | 1 + arch/x86/kernel/apic/apic_numachip.c | 37 2 files changed, 33 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index e08b803..e09d845 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -34,6 +34,7 @@ #define NUMACHIP_LCSR_BASE 0x3e00ULL #define NUMACHIP_LCSR_LIM 0x3fffULL #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1) +#define NUMACHIP_LAPIC_BITS8 static inline void *lcsr_address(unsigned long offset) { diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 3cb9294..38dd5ef 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -96,9 +96,25 @@ static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) static void numachip_send_IPI_one(int cpu, int vector) { - int apicid = per_cpu(x86_cpu_to_apicid, cpu); + int local_apicid, apicid = per_cpu(x86_cpu_to_apicid, cpu); unsigned int dmode; + preempt_disable(); + local_apicid = __this_cpu_read(x86_cpu_to_apicid); + + /* Send via local APIC where non-local part matches */ + if (!((apicid ^ local_apicid) >> NUMACHIP_LAPIC_BITS)) { + unsigned long flags; + + local_irq_save(flags); + __default_send_IPI_dest_field(apicid, vector, + APIC_DEST_PHYSICAL); + local_irq_restore(flags); + preempt_enable(); + return; + } + preempt_enable(); + dmode = (vector == NMI_VECTOR) ? APIC_DM_NMI : APIC_DM_FIXED; numachip_apic_icr_write(apicid, dmode | vector); } @@ -218,6 +234,17 @@ static int numachip2_acpi_madt_oem_check(char *oem_id, char *oem_table_id) return 1; } +/* APIC IPIs are queued */ +static void numachip_apic_wait_icr_idle(void) +{ +} + +/* APIC NMI IPIs are queued */ +static u32 numachip_safe_apic_wait_icr_idle(void) +{ + return 0; +} + static const struct apic apic_numachip1 __refconst = { .name = "NumaConnect system", .probe = numachip1_probe, @@ -263,8 +290,8 @@ static const struct apic apic_numachip1 __refconst = { .eoi_write = native_apic_mem_write, .icr_read = native_apic_icr_read, .icr_write = native_apic_icr_write, - .wait_icr_idle = native_apic_wait_icr_idle, - .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, + .wait_icr_idle = numachip_apic_wait_icr_idle, + .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle, }; apic_driver(apic_numachip1); @@ -314,8 +341,8 @@ static const struct apic apic_numachip2 __refconst = { .eoi_write = native_apic_mem_write, .icr_read = native_apic_icr_read, .icr_write = native_apic_icr_write, - .wait_icr_idle = native_apic_wait_icr_idle, - .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, + .wait_icr_idle = numachip_apic_wait_icr_idle, + .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle, }; apic_driver(apic_numachip2); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/apic] x86/numachip: Introduce Numachip2 timer mechanisms
Commit-ID: ce2e572cfe7b2fc3f0e9da4aa7bc61a2c2c51fc7 Gitweb: http://git.kernel.org/tip/ce2e572cfe7b2fc3f0e9da4aa7bc61a2c2c51fc7 Author: Daniel J Blueman <dan...@numascale.com> AuthorDate: Mon, 21 Sep 2015 18:02:25 +0800 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Tue, 22 Sep 2015 22:25:33 +0200 x86/numachip: Introduce Numachip2 timer mechanisms Add 1GHz 64-bit Numachip2 clocksource timer support for accurate system-wide timekeeping, as core TSCs are unsynchronised. Additionally, add a per-core clockevent mechanism that interrupts via the platform IPI vector after a programmed period. [ tglx: Taking it through x86 due to dependencies ] Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> Cc: Daniel Lezcano <daniel.lezc...@linaro.org> Link: http://lkml.kernel.org/r/1442829745-29311-1-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- arch/x86/include/asm/numachip/numachip_csr.h | 9 +++ drivers/clocksource/Makefile | 1 + drivers/clocksource/numachip.c | 95 3 files changed, 105 insertions(+) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index e09d845..29719ee 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) #define NUMACHIP2_LCSR_BASE 0xf000UL #define NUMACHIP2_LCSR_SIZE 0x100UL #define NUMACHIP2_APIC_ICR0x10 +#define NUMACHIP2_TIMER_DEADLINE 0x20 +#define NUMACHIP2_TIMER_INT 0x28 +#define NUMACHIP2_TIMER_NOW 0x200018 +#define NUMACHIP2_TIMER_RESET 0x200020 static inline void __iomem *numachip2_lcsr_address(unsigned long offset) { @@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) writeq(val, numachip2_lcsr_address(offset)); } +static inline unsigned int numachip2_timer(void) +{ + return (smp_processor_id() % 48) << 6; +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile index 5c00863..57dfad3 100644 --- a/drivers/clocksource/Makefile +++ b/drivers/clocksource/Makefile @@ -62,3 +62,4 @@ obj-$(CONFIG_H8300) += h8300_timer8.o obj-$(CONFIG_H8300_TMR16) += h8300_timer16.o obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o +obj-$(CONFIG_X86_NUMACHIP) += numachip.o diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c new file mode 100644 index 000..088e5fa --- /dev/null +++ b/drivers/clocksource/numachip.c @@ -0,0 +1,95 @@ +/* + * + * Copyright (C) 2015 Numascale AS. All rights reserved. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include + +#include +#include +#include + +static DEFINE_PER_CPU(struct clock_event_device, cpu_ced); + +static cycles_t numachip2_timer_read(struct clocksource *cs) +{ + return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW); +} + +static struct clocksource numachip2_clocksource = { + .name= "numachip2", + .rating = 295, + .read= numachip2_timer_read, + .mask= CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .mult= 1, + .shift = 0, +}; + +static int numachip2_set_next_event(unsigned long delta, struct clock_event_device *ced) +{ + numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(), + delta); + return 0; +} + +static struct clock_event_device numachip2_clockevent = { + .name= "numachip2", + .rating = 400, + .set_next_event = numachip2_set_next_event, + .features= CLOCK_EVT_FEAT_ONESHOT, + .mult= 1, + .shift = 0, + .min_delta_ns= 1250, + .max_delta_ns= LONG_MAX, +}; + +static void numachip_timer_interrupt(void) +{ + struct clock_event_device *ced = this_cpu_ptr(_ced); + + ced->event_handler(ced); +} + +static __init void numachip_timer_each(struct work_struct *work) +{ + unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff; + struct clock_event_device *
[tip:x86/apic] x86/numachip: Cleanup Numachip support
Commit-ID: db1003a719d75cebe5843a7906c02c29bec9922c Gitweb: http://git.kernel.org/tip/db1003a719d75cebe5843a7906c02c29bec9922c Author: Daniel J Blueman <dan...@numascale.com> AuthorDate: Mon, 21 Sep 2015 01:01:59 +0800 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Tue, 22 Sep 2015 22:25:32 +0200 x86/numachip: Cleanup Numachip support Drop unused code and includes in Numachip header files and APIC driver. Additionally, use the 'numachip1' prefix on Numachip1-specific functions; this prepares for adding Numachip2 support in later patches. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> Cc: Daniel Lezcano <daniel.lezc...@linaro.org> Link: http://lkml.kernel.org/r/1442768522-19217-1-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- arch/x86/include/asm/numachip/numachip_csr.h | 118 +-- arch/x86/kernel/apic/apic_numachip.c | 104 ++- 2 files changed, 44 insertions(+), 178 deletions(-) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index 660f843..7469b13 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -14,12 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H -#include -#include #include -#include -#include -#include #define CSR_NODE_SHIFT 16 #define CSR_NODE_BITS(p) (((unsigned long)(p)) << CSR_NODE_SHIFT) @@ -27,11 +22,8 @@ /* 32K CSR space, b15 indicates geo/non-geo */ #define CSR_OFFSET_MASK0x7fffUL - -/* Global CSR space covers all 4K possible nodes with 64K CSR space per node */ -#define NUMACHIP_GCSR_BASE 0x3fffULL -#define NUMACHIP_GCSR_LIM 0x3fff0fffULL -#define NUMACHIP_GCSR_SIZE (NUMACHIP_GCSR_LIM - NUMACHIP_GCSR_BASE + 1) +#define CSR_G0_NODE_IDS (0x008 + (0 << 12)) +#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12)) /* * Local CSR space starts in global CSR space with "nodeid" = 0xfff0, however @@ -42,28 +34,12 @@ #define NUMACHIP_LCSR_LIM 0x3fffULL #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1) -static inline void *gcsr_address(int node, unsigned long offset) -{ - return __va(NUMACHIP_GCSR_BASE | (1UL << 15) | - CSR_NODE_BITS(node & CSR_NODE_MASK) | (offset & CSR_OFFSET_MASK)); -} - static inline void *lcsr_address(unsigned long offset) { return __va(NUMACHIP_LCSR_BASE | (1UL << 15) | CSR_NODE_BITS(0xfff0) | (offset & CSR_OFFSET_MASK)); } -static inline unsigned int read_gcsr(int node, unsigned long offset) -{ - return swab32(readl(gcsr_address(node, offset))); -} - -static inline void write_gcsr(int node, unsigned long offset, unsigned int val) -{ - writel(swab32(val), gcsr_address(node, offset)); -} - static inline unsigned int read_lcsr(unsigned long offset) { return swab32(readl(lcsr_address(offset))); @@ -74,94 +50,4 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) writel(swab32(val), lcsr_address(offset)); } -/* = */ -/* CSR_G0_STATE_CLEAR */ -/* = */ - -#define CSR_G0_STATE_CLEAR (0x000 + (0 << 12)) -union numachip_csr_g0_state_clear { - unsigned int v; - struct numachip_csr_g0_state_clear_s { - unsigned int _state:2; - unsigned int _rsvd_2_6:5; - unsigned int _lost:1; - unsigned int _rsvd_8_31:24; - } s; -}; - -/* = */ -/* CSR_G0_NODE_IDS */ -/* = */ - -#define CSR_G0_NODE_IDS (0x008 + (0 << 12)) -union numachip_csr_g0_node_ids { - unsigned int v; - struct numachip_csr_g0_node_ids_s { - unsigned int _initialid:16; - unsigned int _nodeid:12; - unsigned int _rsvd_28_31:4; - } s; -}; - -/* = */ -/* CSR_G3_EXT_IRQ_GEN */ -/* = */ - -#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12)) -union numachip_csr_g3_ext_irq_gen { - unsigned int v; - struct numachip_csr_g3_ext_irq_gen_s { - unsigned int _vector:8; - unsigned int _msgtype:3
[tip:x86/apic] x86/numachip: Add Numachip2 APIC support
Commit-ID: d9d4dee6cedfa17e5eedcba242dca3091bf73bc3 Gitweb: http://git.kernel.org/tip/d9d4dee6cedfa17e5eedcba242dca3091bf73bc3 Author: Daniel J Blueman <dan...@numascale.com> AuthorDate: Mon, 21 Sep 2015 01:02:00 +0800 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Tue, 22 Sep 2015 22:25:33 +0200 x86/numachip: Add Numachip2 APIC support Introduce support for Numachip2 remote interrupts via detecting the right ACPI SRAT signature. Access is performed via a fixed mapping in the x86 physical address space. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> Cc: Daniel Lezcano <daniel.lezc...@linaro.org> Link: http://lkml.kernel.org/r/1442768522-19217-2-git-send-email-dan...@numascale.com Signed-off-by: Thomas Gleixner <t...@linutronix.de> --- arch/x86/include/asm/numachip/numachip.h | 1 + arch/x86/include/asm/numachip/numachip_csr.h | 35 +++ arch/x86/kernel/apic/apic_numachip.c | 93 3 files changed, 129 insertions(+) diff --git a/arch/x86/include/asm/numachip/numachip.h b/arch/x86/include/asm/numachip/numachip.h index 1c6f7f6..c64373a 100644 --- a/arch/x86/include/asm/numachip/numachip.h +++ b/arch/x86/include/asm/numachip/numachip.h @@ -14,6 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_H #define _ASM_X86_NUMACHIP_NUMACHIP_H +extern u8 numachip_system; extern int __init pci_numachip_init(void); #endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */ diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index 7469b13..e08b803 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -14,6 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H +#include #include #define CSR_NODE_SHIFT 16 @@ -50,4 +51,38 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) writel(swab32(val), lcsr_address(offset)); } +/* + * On NumaChip2, local CSR space is 16MB and starts at fixed offset below 4G + */ + +#define NUMACHIP2_LCSR_BASE 0xf000UL +#define NUMACHIP2_LCSR_SIZE 0x100UL +#define NUMACHIP2_APIC_ICR0x10 + +static inline void __iomem *numachip2_lcsr_address(unsigned long offset) +{ + return (void __iomem *)__va(NUMACHIP2_LCSR_BASE | + (offset & (NUMACHIP2_LCSR_SIZE - 1))); +} + +static inline u32 numachip2_read32_lcsr(unsigned long offset) +{ + return readl(numachip2_lcsr_address(offset)); +} + +static inline u64 numachip2_read64_lcsr(unsigned long offset) +{ + return readq(numachip2_lcsr_address(offset)); +} + +static inline void numachip2_write32_lcsr(unsigned long offset, u32 val) +{ + writel(val, numachip2_lcsr_address(offset)); +} + +static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) +{ + writeq(val, numachip2_lcsr_address(offset)); +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index eeefbb1..3cb9294 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -22,6 +22,7 @@ u8 numachip_system __read_mostly; static const struct apic apic_numachip1; +static const struct apic apic_numachip2; static void (*numachip_apic_icr_write)(int apicid, unsigned int val) __read_mostly; static unsigned int numachip1_get_apic_id(unsigned long x) @@ -45,6 +46,19 @@ static unsigned long numachip1_set_apic_id(unsigned int id) return x; } +static unsigned int numachip2_get_apic_id(unsigned long x) +{ + u64 mcfg; + + rdmsrl(MSR_FAM10H_MMIO_CONF_BASE, mcfg); + return ((mcfg >> (28 - 8)) & 0xfff00) | (x >> 24); +} + +static unsigned long numachip2_set_apic_id(unsigned int id) +{ + return id << 24; +} + static int numachip_apic_id_valid(int apicid) { /* Trust what bootloader passes in MADT */ @@ -66,6 +80,11 @@ static void numachip1_apic_icr_write(int apicid, unsigned int val) write_lcsr(CSR_G3_EXT_IRQ_GEN, (apicid << 16) | val); } +static void numachip2_apic_icr_write(int apicid, unsigned int val) +{ + numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val); +} + static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) { numachip_apic_icr_write(phys_apicid, APIC_DM_INIT); @@ -130,6 +149,11 @@ static int __init numachip1_probe(void) return apic == _numachip1; } +static int __init numachip2_probe(void) +{ + return apic == _numachip2; +} + static void fixup_cpu_id(struct cpuinfo_x86 *c, int node) { u64 val; @@ -155,6 +179,13 @@ static int __init numachip_system_init(void) numachip_apic_icr_write = numachip1_apic_icr_write; x86_ini
[PATCH v2] x86: Introduce Numachip2 timer mechanisms
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate system-wide timekeeping, as core TSCs are unsynchronised. Additionally, add a per-core clockevent mechanism that interrupts via the platform IPI vector after a programmed period. v2: Fix whitespace and wrapping issue Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold --- arch/x86/include/asm/numachip/numachip_csr.h | 9 +++ drivers/clocksource/Makefile | 1 + drivers/clocksource/numachip.c | 95 3 files changed, 105 insertions(+) create mode 100644 drivers/clocksource/numachip.c diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index e09d845..29719ee 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) #define NUMACHIP2_LCSR_BASE 0xf000UL #define NUMACHIP2_LCSR_SIZE 0x100UL #define NUMACHIP2_APIC_ICR0x10 +#define NUMACHIP2_TIMER_DEADLINE 0x20 +#define NUMACHIP2_TIMER_INT 0x28 +#define NUMACHIP2_TIMER_NOW 0x200018 +#define NUMACHIP2_TIMER_RESET 0x200020 static inline void __iomem *numachip2_lcsr_address(unsigned long offset) { @@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) writeq(val, numachip2_lcsr_address(offset)); } +static inline unsigned int numachip2_timer(void) +{ + return (smp_processor_id() % 48) << 6; +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile index 5c00863..57dfad3 100644 --- a/drivers/clocksource/Makefile +++ b/drivers/clocksource/Makefile @@ -62,3 +62,4 @@ obj-$(CONFIG_H8300) += h8300_timer8.o obj-$(CONFIG_H8300_TMR16) += h8300_timer16.o obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o +obj-$(CONFIG_X86_NUMACHIP) += numachip.o diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c new file mode 100644 index 000..088e5fa --- /dev/null +++ b/drivers/clocksource/numachip.c @@ -0,0 +1,95 @@ +/* + * + * Copyright (C) 2015 Numascale AS. All rights reserved. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include + +#include +#include +#include + +static DEFINE_PER_CPU(struct clock_event_device, cpu_ced); + +static cycles_t numachip2_timer_read(struct clocksource *cs) +{ + return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW); +} + +static struct clocksource numachip2_clocksource = { + .name= "numachip2", + .rating = 295, + .read= numachip2_timer_read, + .mask= CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .mult= 1, + .shift = 0, +}; + +static int numachip2_set_next_event(unsigned long delta, struct clock_event_device *ced) +{ + numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(), + delta); + return 0; +} + +static struct clock_event_device numachip2_clockevent = { + .name= "numachip2", + .rating = 400, + .set_next_event = numachip2_set_next_event, + .features= CLOCK_EVT_FEAT_ONESHOT, + .mult= 1, + .shift = 0, + .min_delta_ns= 1250, + .max_delta_ns= LONG_MAX, +}; + +static void numachip_timer_interrupt(void) +{ + struct clock_event_device *ced = this_cpu_ptr(_ced); + + ced->event_handler(ced); +} + +static __init void numachip_timer_each(struct work_struct *work) +{ + unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff; + struct clock_event_device *ced = this_cpu_ptr(_ced); + + /* Setup IPI vector to local core and relative timing mode */ + numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_timer(), + (3 << 22) | (X86_PLATFORM_IPI_VECTOR << 14) | + (local_apicid << 6)); + + *ced = numachip2_clockevent; + ced->cpumask = cpumask_of(smp_processor_id()); + clockevents_register_device(ced); +} + +static int __init numachip_timer_init(void) +{ + if (numachip_system != 2) + return -ENODEV; + + /* Reset timer */ +
[PATCH v2] x86: Introduce Numachip2 timer mechanisms
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate system-wide timekeeping, as core TSCs are unsynchronised. Additionally, add a per-core clockevent mechanism that interrupts via the platform IPI vector after a programmed period. v2: Fix whitespace and wrapping issue Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> --- arch/x86/include/asm/numachip/numachip_csr.h | 9 +++ drivers/clocksource/Makefile | 1 + drivers/clocksource/numachip.c | 95 3 files changed, 105 insertions(+) create mode 100644 drivers/clocksource/numachip.c diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index e09d845..29719ee 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) #define NUMACHIP2_LCSR_BASE 0xf000UL #define NUMACHIP2_LCSR_SIZE 0x100UL #define NUMACHIP2_APIC_ICR0x10 +#define NUMACHIP2_TIMER_DEADLINE 0x20 +#define NUMACHIP2_TIMER_INT 0x28 +#define NUMACHIP2_TIMER_NOW 0x200018 +#define NUMACHIP2_TIMER_RESET 0x200020 static inline void __iomem *numachip2_lcsr_address(unsigned long offset) { @@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) writeq(val, numachip2_lcsr_address(offset)); } +static inline unsigned int numachip2_timer(void) +{ + return (smp_processor_id() % 48) << 6; +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile index 5c00863..57dfad3 100644 --- a/drivers/clocksource/Makefile +++ b/drivers/clocksource/Makefile @@ -62,3 +62,4 @@ obj-$(CONFIG_H8300) += h8300_timer8.o obj-$(CONFIG_H8300_TMR16) += h8300_timer16.o obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o +obj-$(CONFIG_X86_NUMACHIP) += numachip.o diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c new file mode 100644 index 000..088e5fa --- /dev/null +++ b/drivers/clocksource/numachip.c @@ -0,0 +1,95 @@ +/* + * + * Copyright (C) 2015 Numascale AS. All rights reserved. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include + +#include +#include +#include + +static DEFINE_PER_CPU(struct clock_event_device, cpu_ced); + +static cycles_t numachip2_timer_read(struct clocksource *cs) +{ + return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW); +} + +static struct clocksource numachip2_clocksource = { + .name= "numachip2", + .rating = 295, + .read= numachip2_timer_read, + .mask= CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .mult= 1, + .shift = 0, +}; + +static int numachip2_set_next_event(unsigned long delta, struct clock_event_device *ced) +{ + numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(), + delta); + return 0; +} + +static struct clock_event_device numachip2_clockevent = { + .name= "numachip2", + .rating = 400, + .set_next_event = numachip2_set_next_event, + .features= CLOCK_EVT_FEAT_ONESHOT, + .mult= 1, + .shift = 0, + .min_delta_ns= 1250, + .max_delta_ns= LONG_MAX, +}; + +static void numachip_timer_interrupt(void) +{ + struct clock_event_device *ced = this_cpu_ptr(_ced); + + ced->event_handler(ced); +} + +static __init void numachip_timer_each(struct work_struct *work) +{ + unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff; + struct clock_event_device *ced = this_cpu_ptr(_ced); + + /* Setup IPI vector to local core and relative timing mode */ + numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_timer(), + (3 << 22) | (X86_PLATFORM_IPI_VECTOR << 14) | + (local_apicid << 6)); + + *ced = numachip2_clockevent; + ced->cpumask = cpumask_of(smp_processor_id()); + clockevents_register_device(ced); +} + +static int __init numachip_timer_init(void) +{ + if (numachip_system != 2) + return -
[PATCH 3/4] x86: Add Numachip IPI optimisations
When sending IPIs, first check if the non-local part of the source and destination APIC IDs match; if so, send via the local APIC for efficiency. Secondly, since the AMD BIOS-kernel developer guide states IPI delivery will occur invarient of prior deliver status, avoid polling the delivery status bit for efficiency. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold --- arch/x86/include/asm/numachip/numachip_csr.h | 1 + arch/x86/kernel/apic/apic_numachip.c | 36 2 files changed, 32 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index c7efc25..75379f6 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -34,6 +34,7 @@ #define NUMACHIP_LCSR_BASE 0x3e00ULL #define NUMACHIP_LCSR_LIM 0x3fffULL #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1) +#define NUMACHIP_LAPIC_BITS8 static inline void *lcsr_address(unsigned long offset) { diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index dfe2b1c..81bc216 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -95,9 +95,25 @@ static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) static void numachip_send_IPI_one(int cpu, int vector) { - int apicid = per_cpu(x86_cpu_to_apicid, cpu); + int local_apicid, apicid = per_cpu(x86_cpu_to_apicid, cpu); unsigned int dmode; + preempt_disable(); + local_apicid = __this_cpu_read(x86_cpu_to_apicid); + + /* Send via local APIC where non-local part matches */ + if (!((apicid ^ local_apicid) >> NUMACHIP_LAPIC_BITS)) { + unsigned long flags; + + local_irq_save(flags); + __default_send_IPI_dest_field(apicid, vector, + APIC_DEST_PHYSICAL); + local_irq_restore(flags); + preempt_enable(); + return; + } + preempt_enable(); + dmode = (vector == NMI_VECTOR) ? APIC_DM_NMI : APIC_DM_FIXED; numachip_apic_icr_write(apicid, dmode | vector); } @@ -217,6 +232,17 @@ static int numachip2_acpi_madt_oem_check(char *oem_id, char *oem_table_id) return 1; } +/* APIC IPIs are queued */ +static void numachip_apic_wait_icr_idle(void) +{ +} + +/* APIC NMI IPIs are queued */ +static u32 numachip_safe_apic_wait_icr_idle(void) +{ + return 0; +} + static const struct apic apic_numachip1 __refconst = { .name = "NumaConnect system", .probe = numachip1_probe, @@ -262,8 +288,8 @@ static const struct apic apic_numachip1 __refconst = { .eoi_write = native_apic_mem_write, .icr_read = native_apic_icr_read, .icr_write = native_apic_icr_write, - .wait_icr_idle = native_apic_wait_icr_idle, - .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, + .wait_icr_idle = numachip_apic_wait_icr_idle, + .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle, }; apic_driver(apic_numachip1); @@ -313,8 +339,8 @@ static const struct apic apic_numachip2 __refconst = { .eoi_write = native_apic_mem_write, .icr_read = native_apic_icr_read, .icr_write = native_apic_icr_write, - .wait_icr_idle = native_apic_wait_icr_idle, - .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, + .wait_icr_idle = numachip_apic_wait_icr_idle, + .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle, }; apic_driver(apic_numachip2); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] x86: Introduce Numachip2 timer mechanisms
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate system-wide timekeeping, as core TSCs are unsynchronised. Additionally, add a per-core clockevent mechanism that interrupts via the platform IPI vector after a programmed period. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold --- arch/x86/include/asm/numachip/numachip_csr.h | 9 + drivers/clocksource/Makefile | 1 + 2 files changed, 10 insertions(+) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index e09d845..29719ee 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) #define NUMACHIP2_LCSR_BASE 0xf000UL #define NUMACHIP2_LCSR_SIZE 0x100UL #define NUMACHIP2_APIC_ICR0x10 +#define NUMACHIP2_TIMER_DEADLINE 0x20 +#define NUMACHIP2_TIMER_INT 0x28 +#define NUMACHIP2_TIMER_NOW 0x200018 +#define NUMACHIP2_TIMER_RESET 0x200020 static inline void __iomem *numachip2_lcsr_address(unsigned long offset) { @@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) writeq(val, numachip2_lcsr_address(offset)); } +static inline unsigned int numachip2_timer(void) +{ + return (smp_processor_id() % 48) << 6; +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile index 5c00863..57dfad3 100644 --- a/drivers/clocksource/Makefile +++ b/drivers/clocksource/Makefile @@ -62,3 +62,4 @@ obj-$(CONFIG_H8300) += h8300_timer8.o obj-$(CONFIG_H8300_TMR16) += h8300_timer16.o obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o +obj-$(CONFIG_X86_NUMACHIP) += numachip.o diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c new file mode 100644 index 000..5e4f90e --- /dev/null +++ b/drivers/clocksource/numachip.c @@ -0,0 +1,95 @@ +/* + * + * Copyright (C) 2015 Numascale AS. All rights reserved. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include + +#include +#include +#include + +static DEFINE_PER_CPU(struct clock_event_device, cpu_ced); + +static cycles_t numachip2_timer_read(struct clocksource *cs) +{ + return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW); +} + +static struct clocksource numachip2_clocksource = { + .name= "numachip2", + .rating = 295, + .read= numachip2_timer_read, + .mask= CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .mult= 1, + .shift = 0, +}; + +static int numachip2_set_next_event(unsigned long delta, struct clock_event_device *ced) +{ + numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(), + delta); + return 0; +} + +static struct clock_event_device numachip2_clockevent = { + .name= "numachip2", + .rating = 400, + .set_next_event = numachip2_set_next_event, + .features= CLOCK_EVT_FEAT_ONESHOT, + .mult= 1, + .shift = 0, + .min_delta_ns= 1250, + .max_delta_ns= LONG_MAX, +}; + +static void numachip_timer_interrupt(void) +{ + struct clock_event_device *ced = this_cpu_ptr(_ced); + + ced->event_handler(ced); +} + +static __init void numachip_timer_each(struct work_struct *work) +{ + unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff; + struct clock_event_device *ced = this_cpu_ptr(_ced); + + /* Setup IPI vector to local core and relative timing mode */ + numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_timer(), + | (X86_PLATFORM_IPI_VECTOR << 14) | + (local_apicid << 6)); + + *ced = numachip2_clockevent; + ced->cpumask = cpumask_of(smp_processor_id()); + clockevents_register_device(ced); +} + +static int __init numachip_timer_init(void) +{ + if (numachip_system != 2) + return -ENODEV; + + /* Reset timer */ + numachip2_write64_lcsr(NUMACHIP2_TIMER_RESET, 0); + clocksource_register_hz(_clocksource, NSEC_PER_SEC); + + /* Setup per-cpu clockevents */ + x86_
[PATCH 2/4] x86: Add Numachip2 APIC support
Introduce support for Numachip2 remote interrupts via detecting the right ACPI SRAT signature. Access is performed via a fixed mapping in the x86 physical address space. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold --- arch/x86/include/asm/numachip/numachip.h | 1 + arch/x86/include/asm/numachip/numachip_csr.h | 34 ++ arch/x86/kernel/apic/apic_numachip.c | 93 3 files changed, 128 insertions(+) diff --git a/arch/x86/include/asm/numachip/numachip.h b/arch/x86/include/asm/numachip/numachip.h index 1c6f7f6..c64373a 100644 --- a/arch/x86/include/asm/numachip/numachip.h +++ b/arch/x86/include/asm/numachip/numachip.h @@ -14,6 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_H #define _ASM_X86_NUMACHIP_NUMACHIP_H +extern u8 numachip_system; extern int __init pci_numachip_init(void); #endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */ diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index 7469b13..c7efc25 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -14,6 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H +#include #include #define CSR_NODE_SHIFT 16 @@ -50,4 +51,38 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) writel(swab32(val), lcsr_address(offset)); } +/* + * On NumaChip2, local CSR space is 16MB and starts at fixed offset below 4G + */ + +#define NUMACHIP2_LCSR_BASE 0xf000UL +#define NUMACHIP2_LCSR_SIZE 0x100UL +#define NUMACHIP2_APIC_ICR0x10 + +static inline void __iomem *numachip2_lcsr_address(unsigned long offset) +{ + return (void __iomem *)__va(NUMACHIP2_LCSR_BASE | + (offset & (NUMACHIP2_LCSR_SIZE - 1))); +} + +static inline u32 numachip2_read32_lcsr(unsigned long offset) +{ + return readl(numachip2_lcsr_address(offset)); +} + +static inline u64 numachip2_read64_lcsr(unsigned long offset) +{ + return readq(numachip2_lcsr_address(offset)); +} + +static inline void numachip2_write32_lcsr(unsigned long offset, u32 val) +{ + writel(val, numachip2_lcsr_address(offset)); +} + +static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) +{ + writeq(val, numachip2_lcsr_address(offset)); +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 8729249..dfe2b1c 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -22,6 +22,7 @@ u8 numachip_system __read_mostly; static const struct apic apic_numachip1; +static const struct apic apic_numachip2; static void (*numachip_apic_icr_write)(int apicid, unsigned int val) __read_mostly; static unsigned int numachip1_get_apic_id(unsigned long x) @@ -45,6 +46,19 @@ static unsigned long numachip1_set_apic_id(unsigned int id) return x; } +static unsigned int numachip2_get_apic_id(unsigned long x) +{ + u64 mcfg; + + rdmsrl(MSR_FAM10H_MMIO_CONF_BASE, mcfg); + return ((mcfg >> (28 - 8)) & 0xfff00) | (x >> 24); +} + +static unsigned long numachip2_set_apic_id(unsigned int id) +{ + return id << 24; +} + static int numachip_apic_id_valid(int apicid) { /* Trust what bootloader passes in MADT */ @@ -66,6 +80,11 @@ static void numachip1_apic_icr_write(int apicid, unsigned int val) write_lcsr(CSR_G3_EXT_IRQ_GEN, (apicid << 16) | val); } +static void numachip2_apic_icr_write(int apicid, unsigned int val) +{ + numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val); +} + static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) { numachip_apic_icr_write(phys_apicid, APIC_DM_INIT); @@ -129,6 +148,11 @@ static int __init numachip1_probe(void) return apic == _numachip1; } +static int __init numachip2_probe(void) +{ + return apic == _numachip2; +} + static void fixup_cpu_id(struct cpuinfo_x86 *c, int node) { u64 val; @@ -154,6 +178,13 @@ static int __init numachip_system_init(void) numachip_apic_icr_write = numachip1_apic_icr_write; x86_init.pci.arch_init = pci_numachip_init; break; + case 2: + init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE); + numachip_apic_icr_write = numachip2_apic_icr_write; + + /* Use MCFG config cycles rather than locked CF8 cycles */ + raw_pci_ops = _mmcfg; + break; default: return 0; } @@ -175,6 +206,17 @@ static int numachip1_acpi_madt_oem_check(char *oem_id, char *oem_table_id) return 1; } +static int numachip2_acpi_madt_oem_check(char *oem_id, char *oem_table_id) +{ + if ((strncmp(oem_id, "NUMASC"
[PATCH 1/4] x86: Cleanup Numachip support
Drop unused code and includes in Numachip header files and APIC driver. Additionally, use the 'numachip1' prefix on Numachip1-specific functions; this prepares for adding Numachip2 support in later patches. Signed-off-by: Daniel J Blueman Acked-by: Steffen Persvold --- arch/x86/include/asm/numachip/numachip_csr.h | 118 +-- arch/x86/kernel/apic/apic_numachip.c | 103 ++- 2 files changed, 43 insertions(+), 178 deletions(-) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index 660f843..7469b13 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -14,12 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H -#include -#include #include -#include -#include -#include #define CSR_NODE_SHIFT 16 #define CSR_NODE_BITS(p) (((unsigned long)(p)) << CSR_NODE_SHIFT) @@ -27,11 +22,8 @@ /* 32K CSR space, b15 indicates geo/non-geo */ #define CSR_OFFSET_MASK0x7fffUL - -/* Global CSR space covers all 4K possible nodes with 64K CSR space per node */ -#define NUMACHIP_GCSR_BASE 0x3fffULL -#define NUMACHIP_GCSR_LIM 0x3fff0fffULL -#define NUMACHIP_GCSR_SIZE (NUMACHIP_GCSR_LIM - NUMACHIP_GCSR_BASE + 1) +#define CSR_G0_NODE_IDS (0x008 + (0 << 12)) +#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12)) /* * Local CSR space starts in global CSR space with "nodeid" = 0xfff0, however @@ -42,28 +34,12 @@ #define NUMACHIP_LCSR_LIM 0x3fffULL #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1) -static inline void *gcsr_address(int node, unsigned long offset) -{ - return __va(NUMACHIP_GCSR_BASE | (1UL << 15) | - CSR_NODE_BITS(node & CSR_NODE_MASK) | (offset & CSR_OFFSET_MASK)); -} - static inline void *lcsr_address(unsigned long offset) { return __va(NUMACHIP_LCSR_BASE | (1UL << 15) | CSR_NODE_BITS(0xfff0) | (offset & CSR_OFFSET_MASK)); } -static inline unsigned int read_gcsr(int node, unsigned long offset) -{ - return swab32(readl(gcsr_address(node, offset))); -} - -static inline void write_gcsr(int node, unsigned long offset, unsigned int val) -{ - writel(swab32(val), gcsr_address(node, offset)); -} - static inline unsigned int read_lcsr(unsigned long offset) { return swab32(readl(lcsr_address(offset))); @@ -74,94 +50,4 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) writel(swab32(val), lcsr_address(offset)); } -/* = */ -/* CSR_G0_STATE_CLEAR */ -/* = */ - -#define CSR_G0_STATE_CLEAR (0x000 + (0 << 12)) -union numachip_csr_g0_state_clear { - unsigned int v; - struct numachip_csr_g0_state_clear_s { - unsigned int _state:2; - unsigned int _rsvd_2_6:5; - unsigned int _lost:1; - unsigned int _rsvd_8_31:24; - } s; -}; - -/* = */ -/* CSR_G0_NODE_IDS */ -/* = */ - -#define CSR_G0_NODE_IDS (0x008 + (0 << 12)) -union numachip_csr_g0_node_ids { - unsigned int v; - struct numachip_csr_g0_node_ids_s { - unsigned int _initialid:16; - unsigned int _nodeid:12; - unsigned int _rsvd_28_31:4; - } s; -}; - -/* = */ -/* CSR_G3_EXT_IRQ_GEN */ -/* = */ - -#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12)) -union numachip_csr_g3_ext_irq_gen { - unsigned int v; - struct numachip_csr_g3_ext_irq_gen_s { - unsigned int _vector:8; - unsigned int _msgtype:3; - unsigned int _index:5; - unsigned int _destination_apic_id:16; - } s; -}; - -/* = */ -/* CSR_G3_EXT_IRQ_STATUS */ -/* = */ - -#define CSR_G3_EXT_IRQ_STATUS (0x034 + (3 << 12)) -union numachip_csr_g3_ext_irq_status { - unsigned int v; - struct numachip_csr_g3_ext_irq_status_s { - unsigned int _result:32; - } s; -}; - -/* ==
[PATCH 3/4] x86: Add Numachip IPI optimisations
When sending IPIs, first check if the non-local part of the source and destination APIC IDs match; if so, send via the local APIC for efficiency. Secondly, since the AMD BIOS-kernel developer guide states IPI delivery will occur invarient of prior deliver status, avoid polling the delivery status bit for efficiency. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> --- arch/x86/include/asm/numachip/numachip_csr.h | 1 + arch/x86/kernel/apic/apic_numachip.c | 36 2 files changed, 32 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index c7efc25..75379f6 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -34,6 +34,7 @@ #define NUMACHIP_LCSR_BASE 0x3e00ULL #define NUMACHIP_LCSR_LIM 0x3fffULL #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1) +#define NUMACHIP_LAPIC_BITS8 static inline void *lcsr_address(unsigned long offset) { diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index dfe2b1c..81bc216 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -95,9 +95,25 @@ static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) static void numachip_send_IPI_one(int cpu, int vector) { - int apicid = per_cpu(x86_cpu_to_apicid, cpu); + int local_apicid, apicid = per_cpu(x86_cpu_to_apicid, cpu); unsigned int dmode; + preempt_disable(); + local_apicid = __this_cpu_read(x86_cpu_to_apicid); + + /* Send via local APIC where non-local part matches */ + if (!((apicid ^ local_apicid) >> NUMACHIP_LAPIC_BITS)) { + unsigned long flags; + + local_irq_save(flags); + __default_send_IPI_dest_field(apicid, vector, + APIC_DEST_PHYSICAL); + local_irq_restore(flags); + preempt_enable(); + return; + } + preempt_enable(); + dmode = (vector == NMI_VECTOR) ? APIC_DM_NMI : APIC_DM_FIXED; numachip_apic_icr_write(apicid, dmode | vector); } @@ -217,6 +232,17 @@ static int numachip2_acpi_madt_oem_check(char *oem_id, char *oem_table_id) return 1; } +/* APIC IPIs are queued */ +static void numachip_apic_wait_icr_idle(void) +{ +} + +/* APIC NMI IPIs are queued */ +static u32 numachip_safe_apic_wait_icr_idle(void) +{ + return 0; +} + static const struct apic apic_numachip1 __refconst = { .name = "NumaConnect system", .probe = numachip1_probe, @@ -262,8 +288,8 @@ static const struct apic apic_numachip1 __refconst = { .eoi_write = native_apic_mem_write, .icr_read = native_apic_icr_read, .icr_write = native_apic_icr_write, - .wait_icr_idle = native_apic_wait_icr_idle, - .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, + .wait_icr_idle = numachip_apic_wait_icr_idle, + .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle, }; apic_driver(apic_numachip1); @@ -313,8 +339,8 @@ static const struct apic apic_numachip2 __refconst = { .eoi_write = native_apic_mem_write, .icr_read = native_apic_icr_read, .icr_write = native_apic_icr_write, - .wait_icr_idle = native_apic_wait_icr_idle, - .safe_wait_icr_idle = native_safe_apic_wait_icr_idle, + .wait_icr_idle = numachip_apic_wait_icr_idle, + .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle, }; apic_driver(apic_numachip2); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/4] x86: Introduce Numachip2 timer mechanisms
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate system-wide timekeeping, as core TSCs are unsynchronised. Additionally, add a per-core clockevent mechanism that interrupts via the platform IPI vector after a programmed period. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> --- arch/x86/include/asm/numachip/numachip_csr.h | 9 + drivers/clocksource/Makefile | 1 + 2 files changed, 10 insertions(+) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index e09d845..29719ee 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) #define NUMACHIP2_LCSR_BASE 0xf000UL #define NUMACHIP2_LCSR_SIZE 0x100UL #define NUMACHIP2_APIC_ICR0x10 +#define NUMACHIP2_TIMER_DEADLINE 0x20 +#define NUMACHIP2_TIMER_INT 0x28 +#define NUMACHIP2_TIMER_NOW 0x200018 +#define NUMACHIP2_TIMER_RESET 0x200020 static inline void __iomem *numachip2_lcsr_address(unsigned long offset) { @@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) writeq(val, numachip2_lcsr_address(offset)); } +static inline unsigned int numachip2_timer(void) +{ + return (smp_processor_id() % 48) << 6; +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile index 5c00863..57dfad3 100644 --- a/drivers/clocksource/Makefile +++ b/drivers/clocksource/Makefile @@ -62,3 +62,4 @@ obj-$(CONFIG_H8300) += h8300_timer8.o obj-$(CONFIG_H8300_TMR16) += h8300_timer16.o obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o +obj-$(CONFIG_X86_NUMACHIP) += numachip.o diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c new file mode 100644 index 000..5e4f90e --- /dev/null +++ b/drivers/clocksource/numachip.c @@ -0,0 +1,95 @@ +/* + * + * Copyright (C) 2015 Numascale AS. All rights reserved. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include + +#include +#include +#include + +static DEFINE_PER_CPU(struct clock_event_device, cpu_ced); + +static cycles_t numachip2_timer_read(struct clocksource *cs) +{ + return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW); +} + +static struct clocksource numachip2_clocksource = { + .name= "numachip2", + .rating = 295, + .read= numachip2_timer_read, + .mask= CLOCKSOURCE_MASK(64), + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .mult= 1, + .shift = 0, +}; + +static int numachip2_set_next_event(unsigned long delta, struct clock_event_device *ced) +{ + numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(), + delta); + return 0; +} + +static struct clock_event_device numachip2_clockevent = { + .name= "numachip2", + .rating = 400, + .set_next_event = numachip2_set_next_event, + .features= CLOCK_EVT_FEAT_ONESHOT, + .mult= 1, + .shift = 0, + .min_delta_ns= 1250, + .max_delta_ns= LONG_MAX, +}; + +static void numachip_timer_interrupt(void) +{ + struct clock_event_device *ced = this_cpu_ptr(_ced); + + ced->event_handler(ced); +} + +static __init void numachip_timer_each(struct work_struct *work) +{ + unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff; + struct clock_event_device *ced = this_cpu_ptr(_ced); + + /* Setup IPI vector to local core and relative timing mode */ + numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_timer(), + | (X86_PLATFORM_IPI_VECTOR << 14) | + (local_apicid << 6)); + + *ced = numachip2_clockevent; + ced->cpumask = cpumask_of(smp_processor_id()); + clockevents_register_device(ced); +} + +static int __init numachip_timer_init(void) +{ + if (numachip_system != 2) + return -ENODEV; + + /* Reset timer */ + numachip2_write64_lcsr(NUMACHIP2_TIMER_RESET, 0); + clocksource_register_hz(_clocksource, NSEC_PER_SEC); + + /* Se
[PATCH 2/4] x86: Add Numachip2 APIC support
Introduce support for Numachip2 remote interrupts via detecting the right ACPI SRAT signature. Access is performed via a fixed mapping in the x86 physical address space. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> --- arch/x86/include/asm/numachip/numachip.h | 1 + arch/x86/include/asm/numachip/numachip_csr.h | 34 ++ arch/x86/kernel/apic/apic_numachip.c | 93 3 files changed, 128 insertions(+) diff --git a/arch/x86/include/asm/numachip/numachip.h b/arch/x86/include/asm/numachip/numachip.h index 1c6f7f6..c64373a 100644 --- a/arch/x86/include/asm/numachip/numachip.h +++ b/arch/x86/include/asm/numachip/numachip.h @@ -14,6 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_H #define _ASM_X86_NUMACHIP_NUMACHIP_H +extern u8 numachip_system; extern int __init pci_numachip_init(void); #endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */ diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index 7469b13..c7efc25 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -14,6 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H +#include #include #define CSR_NODE_SHIFT 16 @@ -50,4 +51,38 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) writel(swab32(val), lcsr_address(offset)); } +/* + * On NumaChip2, local CSR space is 16MB and starts at fixed offset below 4G + */ + +#define NUMACHIP2_LCSR_BASE 0xf000UL +#define NUMACHIP2_LCSR_SIZE 0x100UL +#define NUMACHIP2_APIC_ICR0x10 + +static inline void __iomem *numachip2_lcsr_address(unsigned long offset) +{ + return (void __iomem *)__va(NUMACHIP2_LCSR_BASE | + (offset & (NUMACHIP2_LCSR_SIZE - 1))); +} + +static inline u32 numachip2_read32_lcsr(unsigned long offset) +{ + return readl(numachip2_lcsr_address(offset)); +} + +static inline u64 numachip2_read64_lcsr(unsigned long offset) +{ + return readq(numachip2_lcsr_address(offset)); +} + +static inline void numachip2_write32_lcsr(unsigned long offset, u32 val) +{ + writel(val, numachip2_lcsr_address(offset)); +} + +static inline void numachip2_write64_lcsr(unsigned long offset, u64 val) +{ + writeq(val, numachip2_lcsr_address(offset)); +} + #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */ diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 8729249..dfe2b1c 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -22,6 +22,7 @@ u8 numachip_system __read_mostly; static const struct apic apic_numachip1; +static const struct apic apic_numachip2; static void (*numachip_apic_icr_write)(int apicid, unsigned int val) __read_mostly; static unsigned int numachip1_get_apic_id(unsigned long x) @@ -45,6 +46,19 @@ static unsigned long numachip1_set_apic_id(unsigned int id) return x; } +static unsigned int numachip2_get_apic_id(unsigned long x) +{ + u64 mcfg; + + rdmsrl(MSR_FAM10H_MMIO_CONF_BASE, mcfg); + return ((mcfg >> (28 - 8)) & 0xfff00) | (x >> 24); +} + +static unsigned long numachip2_set_apic_id(unsigned int id) +{ + return id << 24; +} + static int numachip_apic_id_valid(int apicid) { /* Trust what bootloader passes in MADT */ @@ -66,6 +80,11 @@ static void numachip1_apic_icr_write(int apicid, unsigned int val) write_lcsr(CSR_G3_EXT_IRQ_GEN, (apicid << 16) | val); } +static void numachip2_apic_icr_write(int apicid, unsigned int val) +{ + numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val); +} + static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip) { numachip_apic_icr_write(phys_apicid, APIC_DM_INIT); @@ -129,6 +148,11 @@ static int __init numachip1_probe(void) return apic == _numachip1; } +static int __init numachip2_probe(void) +{ + return apic == _numachip2; +} + static void fixup_cpu_id(struct cpuinfo_x86 *c, int node) { u64 val; @@ -154,6 +178,13 @@ static int __init numachip_system_init(void) numachip_apic_icr_write = numachip1_apic_icr_write; x86_init.pci.arch_init = pci_numachip_init; break; + case 2: + init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE); + numachip_apic_icr_write = numachip2_apic_icr_write; + + /* Use MCFG config cycles rather than locked CF8 cycles */ + raw_pci_ops = _mmcfg; + break; default: return 0; } @@ -175,6 +206,17 @@ static int numachip1_acpi_madt_oem_check(char *oem_id, char *oem_table_id) return 1; } +static int numachip2_acpi_madt_oem_check(char *oem_id, char *oem_tab
[PATCH 1/4] x86: Cleanup Numachip support
Drop unused code and includes in Numachip header files and APIC driver. Additionally, use the 'numachip1' prefix on Numachip1-specific functions; this prepares for adding Numachip2 support in later patches. Signed-off-by: Daniel J Blueman <dan...@numascale.com> Acked-by: Steffen Persvold <s...@numascale.com> --- arch/x86/include/asm/numachip/numachip_csr.h | 118 +-- arch/x86/kernel/apic/apic_numachip.c | 103 ++- 2 files changed, 43 insertions(+), 178 deletions(-) diff --git a/arch/x86/include/asm/numachip/numachip_csr.h b/arch/x86/include/asm/numachip/numachip_csr.h index 660f843..7469b13 100644 --- a/arch/x86/include/asm/numachip/numachip_csr.h +++ b/arch/x86/include/asm/numachip/numachip_csr.h @@ -14,12 +14,7 @@ #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H -#include -#include #include -#include -#include -#include #define CSR_NODE_SHIFT 16 #define CSR_NODE_BITS(p) (((unsigned long)(p)) << CSR_NODE_SHIFT) @@ -27,11 +22,8 @@ /* 32K CSR space, b15 indicates geo/non-geo */ #define CSR_OFFSET_MASK0x7fffUL - -/* Global CSR space covers all 4K possible nodes with 64K CSR space per node */ -#define NUMACHIP_GCSR_BASE 0x3fffULL -#define NUMACHIP_GCSR_LIM 0x3fff0fffULL -#define NUMACHIP_GCSR_SIZE (NUMACHIP_GCSR_LIM - NUMACHIP_GCSR_BASE + 1) +#define CSR_G0_NODE_IDS (0x008 + (0 << 12)) +#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12)) /* * Local CSR space starts in global CSR space with "nodeid" = 0xfff0, however @@ -42,28 +34,12 @@ #define NUMACHIP_LCSR_LIM 0x3fffULL #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1) -static inline void *gcsr_address(int node, unsigned long offset) -{ - return __va(NUMACHIP_GCSR_BASE | (1UL << 15) | - CSR_NODE_BITS(node & CSR_NODE_MASK) | (offset & CSR_OFFSET_MASK)); -} - static inline void *lcsr_address(unsigned long offset) { return __va(NUMACHIP_LCSR_BASE | (1UL << 15) | CSR_NODE_BITS(0xfff0) | (offset & CSR_OFFSET_MASK)); } -static inline unsigned int read_gcsr(int node, unsigned long offset) -{ - return swab32(readl(gcsr_address(node, offset))); -} - -static inline void write_gcsr(int node, unsigned long offset, unsigned int val) -{ - writel(swab32(val), gcsr_address(node, offset)); -} - static inline unsigned int read_lcsr(unsigned long offset) { return swab32(readl(lcsr_address(offset))); @@ -74,94 +50,4 @@ static inline void write_lcsr(unsigned long offset, unsigned int val) writel(swab32(val), lcsr_address(offset)); } -/* = */ -/* CSR_G0_STATE_CLEAR */ -/* = */ - -#define CSR_G0_STATE_CLEAR (0x000 + (0 << 12)) -union numachip_csr_g0_state_clear { - unsigned int v; - struct numachip_csr_g0_state_clear_s { - unsigned int _state:2; - unsigned int _rsvd_2_6:5; - unsigned int _lost:1; - unsigned int _rsvd_8_31:24; - } s; -}; - -/* = */ -/* CSR_G0_NODE_IDS */ -/* = */ - -#define CSR_G0_NODE_IDS (0x008 + (0 << 12)) -union numachip_csr_g0_node_ids { - unsigned int v; - struct numachip_csr_g0_node_ids_s { - unsigned int _initialid:16; - unsigned int _nodeid:12; - unsigned int _rsvd_28_31:4; - } s; -}; - -/* = */ -/* CSR_G3_EXT_IRQ_GEN */ -/* = */ - -#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12)) -union numachip_csr_g3_ext_irq_gen { - unsigned int v; - struct numachip_csr_g3_ext_irq_gen_s { - unsigned int _vector:8; - unsigned int _msgtype:3; - unsigned int _index:5; - unsigned int _destination_apic_id:16; - } s; -}; - -/* = */ -/* CSR_G3_EXT_IRQ_STATUS */ -/* = */ - -#define CSR_G3_EXT_IRQ_STATUS (0x034 + (3 << 12)) -union numachip_csr_g3_ext_irq_status { - unsigned int v; - struct numachip_csr_g3_ext_irq_status_s { - unsigned int _result:32; - } s; -}; - -/* ==
Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup
Hi Nate, On Wed, Jun 24, 2015 at 11:50 PM, Nathan Zimmer wrote: My apologies for taking so long to get back to this. I think I did locate two potential sources of slowdown. One is the set_cpus_allowed_ptr as I have noted previously. However I only notice that on the very largest boxes. I did cobble together a patch that seems to help. The other spot I suspect is the zone lock in free_one_page. I haven't been able to give that much thought as of yet though. Daniel do you mind seeing if the attached patch helps out? Just got back from travel, so apologies for the delays. The patch doesn't mitigate the increasing initialisation time; summing the per-node times for an accurate measure, there was a total of 171.48s before the patch and 175.23s after. I double-checked and got similar data. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup
Hi Nate, On Wed, Jun 24, 2015 at 11:50 PM, Nathan Zimmer nzim...@sgi.com wrote: My apologies for taking so long to get back to this. I think I did locate two potential sources of slowdown. One is the set_cpus_allowed_ptr as I have noted previously. However I only notice that on the very largest boxes. I did cobble together a patch that seems to help. The other spot I suspect is the zone lock in free_one_page. I haven't been able to give that much thought as of yet though. Daniel do you mind seeing if the attached patch helps out? Just got back from travel, so apologies for the delays. The patch doesn't mitigate the increasing initialisation time; summing the per-node times for an accurate measure, there was a total of 171.48s before the patch and 175.23s after. I double-checked and got similar data. Thanks, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockup when C1E and high-resolution timers enabled
On 14 June 2015 at 22:49, Christoph Fritz wrote: > On Sun, 2015-06-14 at 15:54 +0800, Daniel J Blueman wrote: >> As a workaround, you can probably just disable message triggered C1E >> (see the BKDG p399 [1]): >> >> val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4 > > mhm... $(setpci -s 00:18.4 0xd4.l) returns zero, this can't be right. Ahh, try: val=0x$(setpci -s 00:18.3 0xd4.l) # read D18F3xD4 val=$((val &~(1 << 13))) # clear bit13 (MTC1eEn) setpci -d 1022:1603 0xd4.l=$(printf %x $val) # write back -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockup when C1E and high-resolution timers enabled
On 14 June 2015 at 12:39, Christoph Fritz wrote: > On Sun, 2015-06-14 at 11:13 +0800, Daniel J Blueman wrote: >> On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote: >> > Hi, >> > >> > on following computer configuration, I do get hard lockup under heavy >> > IO-Load (using rsync): >> > >> > - CONFIG_HIGH_RES_TIMERS=y >> > - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2) >> > - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950 >> > - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option) >> > - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x >> > >> > Tests: >> > - add kernel parameter "idle=halt" -> system runs fine >> > - disable CONFIG_HIGH_RES_TIMERS -> system runs fine >> > - change motherboard and disable C1E -> system runs fine >> > - change CPU to AMD Phenom II X6 Processor -> system runs fine >> [..] >> >> C1E disconnects HyperTransport links when all cores enter C1 (halt) >> for a period of time; this is all at the platform level, so isn't due >> to the kernel. The AMD AGESA code which controls the setup of this >> mechanism is updated in the F2g BIOS: >> http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios >> >> Did you try both BIOS releases with defaults? > > Yes, rechecked both versions: Same bad behaviour. > >> If still issues, also try with the current family 10h microcode from >> http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2 > > Don't you mean family 15h for 'AMD FX(tm)-8350' ? > > already using latest microcode: As a workaround, you can probably just disable message triggered C1E (see the BKDG p399 [1]): val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4 val=$((val &~(1 << 13))) # clear bit13 (MTC1eEn) setpci -d 1022:1604 0xd4.l=$(printf %x $val) # write back The chipset setup and behaviour is quite complex, so it's likely Gigabyte haven't done their homework. The alternative is coreboot of course. Thanks, Daniel [1] http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockup when C1E and high-resolution timers enabled
On 14 June 2015 at 12:39, Christoph Fritz chf.fr...@googlemail.com wrote: On Sun, 2015-06-14 at 11:13 +0800, Daniel J Blueman wrote: On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote: Hi, on following computer configuration, I do get hard lockup under heavy IO-Load (using rsync): - CONFIG_HIGH_RES_TIMERS=y - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2) - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950 - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option) - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x Tests: - add kernel parameter idle=halt - system runs fine - disable CONFIG_HIGH_RES_TIMERS - system runs fine - change motherboard and disable C1E - system runs fine - change CPU to AMD Phenom II X6 Processor - system runs fine [..] C1E disconnects HyperTransport links when all cores enter C1 (halt) for a period of time; this is all at the platform level, so isn't due to the kernel. The AMD AGESA code which controls the setup of this mechanism is updated in the F2g BIOS: http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios Did you try both BIOS releases with defaults? Yes, rechecked both versions: Same bad behaviour. If still issues, also try with the current family 10h microcode from http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2 Don't you mean family 15h for 'AMD FX(tm)-8350' ? already using latest microcode: As a workaround, you can probably just disable message triggered C1E (see the BKDG p399 [1]): val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4 val=$((val ~(1 13))) # clear bit13 (MTC1eEn) setpci -d 1022:1604 0xd4.l=$(printf %x $val) # write back The chipset setup and behaviour is quite complex, so it's likely Gigabyte haven't done their homework. The alternative is coreboot of course. Thanks, Daniel [1] http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf -- Daniel J Blueman -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockup when C1E and high-resolution timers enabled
On 14 June 2015 at 22:49, Christoph Fritz chf.fr...@googlemail.com wrote: On Sun, 2015-06-14 at 15:54 +0800, Daniel J Blueman wrote: As a workaround, you can probably just disable message triggered C1E (see the BKDG p399 [1]): val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4 mhm... $(setpci -s 00:18.4 0xd4.l) returns zero, this can't be right. Ahh, try: val=0x$(setpci -s 00:18.3 0xd4.l) # read D18F3xD4 val=$((val ~(1 13))) # clear bit13 (MTC1eEn) setpci -d 1022:1603 0xd4.l=$(printf %x $val) # write back -- Daniel J Blueman -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockup when C1E and high-resolution timers enabled
On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote: > Hi, > > on following computer configuration, I do get hard lockup under heavy > IO-Load (using rsync): > > - CONFIG_HIGH_RES_TIMERS=y > - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2) > - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950 > - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option) > - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x > > Tests: > - add kernel parameter "idle=halt" -> system runs fine > - disable CONFIG_HIGH_RES_TIMERS -> system runs fine > - change motherboard and disable C1E -> system runs fine > - change CPU to AMD Phenom II X6 Processor -> system runs fine [..] C1E disconnects HyperTransport links when all cores enter C1 (halt) for a period of time; this is all at the platform level, so isn't due to the kernel. The AMD AGESA code which controls the setup of this mechanism is updated in the F2g BIOS: http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios Did you try both BIOS releases with defaults? If still issues, also try with the current family 10h microcode from http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2 Thanks, Daniel -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: lockup when C1E and high-resolution timers enabled
On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote: Hi, on following computer configuration, I do get hard lockup under heavy IO-Load (using rsync): - CONFIG_HIGH_RES_TIMERS=y - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2) - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950 - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option) - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x Tests: - add kernel parameter idle=halt - system runs fine - disable CONFIG_HIGH_RES_TIMERS - system runs fine - change motherboard and disable C1E - system runs fine - change CPU to AMD Phenom II X6 Processor - system runs fine [..] C1E disconnects HyperTransport links when all cores enter C1 (halt) for a period of time; this is all at the platform level, so isn't due to the kernel. The AMD AGESA code which controls the setup of this mechanism is updated in the F2g BIOS: http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios Did you try both BIOS releases with defaults? If still issues, also try with the current family 10h microcode from http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2 Thanks, Daniel -- Daniel J Blueman -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -next] iommu: Fix build failure without INTEL_IOMMU
Fix Intel IOMMU build failure in linux-next when CONFIG_INTEL_IOMMU is not enabled. Signed-off-by: Daniel J Blueman --- drivers/iommu/intel_irq_remapping.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c index 24f7a35..ec337e7 100644 --- a/drivers/iommu/intel_irq_remapping.c +++ b/drivers/iommu/intel_irq_remapping.c @@ -146,8 +146,10 @@ static int modify_irte(struct irq_2_iommu *irq_iommu, set_64bit(>low, irte_modified->low); set_64bit(>high, irte_modified->high); +#ifdef CONFIG_INTEL_IOMMU if (iommu->pre_enabled_ir) __iommu_update_old_irte(iommu, index); +#endif __iommu_flush_cache(iommu, irte, sizeof(*irte)); @@ -210,8 +212,10 @@ static int clear_entries(struct irq_2_iommu *irq_iommu) bitmap_release_region(iommu->ir_table->bitmap, index, irq_iommu->irte_mask); +#ifdef CONFIG_INTEL_IOMMU if (iommu->pre_enabled_ir) __iommu_update_old_irte(iommu, -1); +#endif return qi_flush_iec(iommu, index, irq_iommu->irte_mask); } @@ -650,6 +654,7 @@ static int __init intel_enable_irq_remapping(void) * Setup Interrupt-remapping for all the DRHD's now. */ for_each_iommu(iommu, drhd) { +#ifdef CONFIG_INTEL_IOMMU if (iommu->pre_enabled_ir) { unsigned long long q; @@ -660,6 +665,7 @@ static int __init intel_enable_irq_remapping(void) INTR_REMAP_TABLE_ENTRIES*sizeof(struct irte)); __iommu_load_old_irte(iommu); } else +#endif iommu_set_irq_remapping(iommu, eim); setup = true; @@ -1374,6 +1380,7 @@ static int __iommu_update_old_irte(struct intel_iommu *iommu, int index) static void iommu_check_pre_ir_status(struct intel_iommu *iommu) { +#ifdef CONFIG_INTEL_IOMMU u32 sts; sts = readl(iommu->reg + DMAR_GSTS_REG); @@ -1381,4 +1388,5 @@ static void iommu_check_pre_ir_status(struct intel_iommu *iommu) pr_info("IR is enabled prior to OS.\n"); iommu->pre_enabled_ir = 1; } +#endif } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH -next] iommu: Fix build failure without INTEL_IOMMU
Fix Intel IOMMU build failure in linux-next when CONFIG_INTEL_IOMMU is not enabled. Signed-off-by: Daniel J Blueman dan...@numascale.com --- drivers/iommu/intel_irq_remapping.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c index 24f7a35..ec337e7 100644 --- a/drivers/iommu/intel_irq_remapping.c +++ b/drivers/iommu/intel_irq_remapping.c @@ -146,8 +146,10 @@ static int modify_irte(struct irq_2_iommu *irq_iommu, set_64bit(irte-low, irte_modified-low); set_64bit(irte-high, irte_modified-high); +#ifdef CONFIG_INTEL_IOMMU if (iommu-pre_enabled_ir) __iommu_update_old_irte(iommu, index); +#endif __iommu_flush_cache(iommu, irte, sizeof(*irte)); @@ -210,8 +212,10 @@ static int clear_entries(struct irq_2_iommu *irq_iommu) bitmap_release_region(iommu-ir_table-bitmap, index, irq_iommu-irte_mask); +#ifdef CONFIG_INTEL_IOMMU if (iommu-pre_enabled_ir) __iommu_update_old_irte(iommu, -1); +#endif return qi_flush_iec(iommu, index, irq_iommu-irte_mask); } @@ -650,6 +654,7 @@ static int __init intel_enable_irq_remapping(void) * Setup Interrupt-remapping for all the DRHD's now. */ for_each_iommu(iommu, drhd) { +#ifdef CONFIG_INTEL_IOMMU if (iommu-pre_enabled_ir) { unsigned long long q; @@ -660,6 +665,7 @@ static int __init intel_enable_irq_remapping(void) INTR_REMAP_TABLE_ENTRIES*sizeof(struct irte)); __iommu_load_old_irte(iommu); } else +#endif iommu_set_irq_remapping(iommu, eim); setup = true; @@ -1374,6 +1380,7 @@ static int __iommu_update_old_irte(struct intel_iommu *iommu, int index) static void iommu_check_pre_ir_status(struct intel_iommu *iommu) { +#ifdef CONFIG_INTEL_IOMMU u32 sts; sts = readl(iommu-reg + DMAR_GSTS_REG); @@ -1381,4 +1388,5 @@ static void iommu_check_pre_ir_status(struct intel_iommu *iommu) pr_info(IR is enabled prior to OS.\n); iommu-pre_enabled_ir = 1; } +#endif } -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup
-- Daniel J Blueman Principal Software Engineer, Numascale On Sat, May 23, 2015 at 1:14 AM, Waiman Long wrote: On 05/22/2015 05:33 AM, Mel Gorman wrote: On Fri, May 22, 2015 at 02:30:01PM +0800, Daniel J Blueman wrote: On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman wrote: On Thu, May 14, 2015 at 12:31 AM, Mel Gorman wrote: On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote: I am just noticed a hang on my largest box. I can only reproduce with large core counts, if I turn down the number of cpus it doesn't have an issue. Odd. The number of core counts should make little a difference as only one CPU per node should be in use. Does sysrq+t give any indication how or where it is hanging? I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; this suggests either lock contention or O(n) behaviour. Nathan, can you check with this ordering of patches from Andrew's cache [2]? I was getting hanging until I a found them all. I'll follow up with timing data. 7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login: 1. 2086s with patches 01-19 [1] 2. 2026s adding "Take into account that large system caches scale linearly with memory", which has: min(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3)); 3. 2442s fixing to: max(2UL<< (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 3)); 4. 2064s adjusting minimum and shift to: max(512UL<< (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 8)); 5. 1934s adjusting minimum and shift to: max(128UL<< (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>> 8)); 6. 930s #5 with the non-temporal PMD init patch I had earlier proposed (I'll pursue separately) The scaling patch isn't in -mm. That patch was superceded by "mm: meminit: finish initialisation of struct pages before basic setup" and "mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix" so that's ok. FWIW, I think you should still go ahead with the non-temporal patches because there is potential benefit there other than the initialisation. If there was an arch-optional implementation of a non-termporal clear then it would also be worth considering if __GFP_ZERO should use non-temporal stores. At a greater stretch it would be worth considering if kswapd freeing should zero pages to avoid a zero on the allocation side in the general case as it would be more generally useful and a stepping stone towards what the series "Sanitizing freed pages" attempts. Good tip Mel; I'll take a look when time allows and get some data, though I guess it'll only be a win where the clearing is on a different node than the allocation. I think the non-temporal patch benefits mainly AMD systems. I have tried the patch on both DragonHawk and it actually made it boot up a little bit slower. I think the Intel optimized "rep stosb" instruction (used in memset) is performing well. I had done similar test on zero page code and the performance gain was non-conclusive. I suspect 'rep stosb' on modern Intel hardware can write whole cachelines atomically, avoiding the RMW, or that the read part of the RMW is optimally prefetched. Open-coding it just can't reach the same level of pipeline saturation that the microcode can. Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup
On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman wrote: On Thu, May 14, 2015 at 12:31 AM, Mel Gorman wrote: On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote: I am just noticed a hang on my largest box. I can only reproduce with large core counts, if I turn down the number of cpus it doesn't have an issue. Odd. The number of core counts should make little a difference as only one CPU per node should be in use. Does sysrq+t give any indication how or where it is hanging? I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; this suggests either lock contention or O(n) behaviour. Nathan, can you check with this ordering of patches from Andrew's cache [2]? I was getting hanging until I a found them all. I'll follow up with timing data. 7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login: 1. 2086s with patches 01-19 [1] 2. 2026s adding "Take into account that large system caches scale linearly with memory", which has: min(2UL << (30 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 3)); 3. 2442s fixing to: max(2UL << (30 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 3)); 4. 2064s adjusting minimum and shift to: max(512UL << (20 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 8)); 5. 1934s adjusting minimum and shift to: max(128UL << (20 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 8)); 6. 930s #5 with the non-temporal PMD init patch I had earlier proposed (I'll pursue separately) The scaling patch isn't in -mm. #5 tests out nice on a bunch of other AMD systems, 64GB and up, so: Tested-by: Daniel J Blueman . Fine work, Mel! Daniel -- [1] http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup
On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman dan...@numascale.com wrote: On Thu, May 14, 2015 at 12:31 AM, Mel Gorman mgor...@suse.de wrote: On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote: I am just noticed a hang on my largest box. I can only reproduce with large core counts, if I turn down the number of cpus it doesn't have an issue. Odd. The number of core counts should make little a difference as only one CPU per node should be in use. Does sysrq+t give any indication how or where it is hanging? I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; this suggests either lock contention or O(n) behaviour. Nathan, can you check with this ordering of patches from Andrew's cache [2]? I was getting hanging until I a found them all. I'll follow up with timing data. 7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login: 1. 2086s with patches 01-19 [1] 2. 2026s adding Take into account that large system caches scale linearly with memory, which has: min(2UL (30 - PAGE_SHIFT), (pgdat-node_spanned_pages 3)); 3. 2442s fixing to: max(2UL (30 - PAGE_SHIFT), (pgdat-node_spanned_pages 3)); 4. 2064s adjusting minimum and shift to: max(512UL (20 - PAGE_SHIFT), (pgdat-node_spanned_pages 8)); 5. 1934s adjusting minimum and shift to: max(128UL (20 - PAGE_SHIFT), (pgdat-node_spanned_pages 8)); 6. 930s #5 with the non-temporal PMD init patch I had earlier proposed (I'll pursue separately) The scaling patch isn't in -mm. #5 tests out nice on a bunch of other AMD systems, 64GB and up, so: Tested-by: Daniel J Blueman dan...@numascale.com. Fine work, Mel! Daniel -- [1] http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup
-- Daniel J Blueman Principal Software Engineer, Numascale On Sat, May 23, 2015 at 1:14 AM, Waiman Long waiman.l...@hp.com wrote: On 05/22/2015 05:33 AM, Mel Gorman wrote: On Fri, May 22, 2015 at 02:30:01PM +0800, Daniel J Blueman wrote: On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman dan...@numascale.com wrote: On Thu, May 14, 2015 at 12:31 AM, Mel Gormanmgor...@suse.de wrote: On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote: I am just noticed a hang on my largest box. I can only reproduce with large core counts, if I turn down the number of cpus it doesn't have an issue. Odd. The number of core counts should make little a difference as only one CPU per node should be in use. Does sysrq+t give any indication how or where it is hanging? I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; this suggests either lock contention or O(n) behaviour. Nathan, can you check with this ordering of patches from Andrew's cache [2]? I was getting hanging until I a found them all. I'll follow up with timing data. 7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login: 1. 2086s with patches 01-19 [1] 2. 2026s adding Take into account that large system caches scale linearly with memory, which has: min(2UL (30 - PAGE_SHIFT), (pgdat-node_spanned_pages 3)); 3. 2442s fixing to: max(2UL (30 - PAGE_SHIFT), (pgdat-node_spanned_pages 3)); 4. 2064s adjusting minimum and shift to: max(512UL (20 - PAGE_SHIFT), (pgdat-node_spanned_pages 8)); 5. 1934s adjusting minimum and shift to: max(128UL (20 - PAGE_SHIFT), (pgdat-node_spanned_pages 8)); 6. 930s #5 with the non-temporal PMD init patch I had earlier proposed (I'll pursue separately) The scaling patch isn't in -mm. That patch was superceded by mm: meminit: finish initialisation of struct pages before basic setup and mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix so that's ok. FWIW, I think you should still go ahead with the non-temporal patches because there is potential benefit there other than the initialisation. If there was an arch-optional implementation of a non-termporal clear then it would also be worth considering if __GFP_ZERO should use non-temporal stores. At a greater stretch it would be worth considering if kswapd freeing should zero pages to avoid a zero on the allocation side in the general case as it would be more generally useful and a stepping stone towards what the series Sanitizing freed pages attempts. Good tip Mel; I'll take a look when time allows and get some data, though I guess it'll only be a win where the clearing is on a different node than the allocation. I think the non-temporal patch benefits mainly AMD systems. I have tried the patch on both DragonHawk and it actually made it boot up a little bit slower. I think the Intel optimized rep stosb instruction (used in memset) is performing well. I had done similar test on zero page code and the performance gain was non-conclusive. I suspect 'rep stosb' on modern Intel hardware can write whole cachelines atomically, avoiding the RMW, or that the read part of the RMW is optimally prefetched. Open-coding it just can't reach the same level of pipeline saturation that the microcode can. Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup
On Thu, May 14, 2015 at 12:31 AM, Mel Gorman wrote: On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote: I am just noticed a hang on my largest box. I can only reproduce with large core counts, if I turn down the number of cpus it doesn't have an issue. Odd. The number of core counts should make little a difference as only one CPU per node should be in use. Does sysrq+t give any indication how or where it is hanging? I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; this suggests either lock contention or O(n) behaviour. Nathan, can you check with this ordering of patches from Andrew's cache [2]? I was getting hanging until I a found them all. I'll follow up with timing data. Thanks, Daniel -- [1] [ 73.076117] node 2 initialised, 7732961 pages in 1060ms [ 73.077184] node 38 initialised, 7732961 pages in 1060ms [ 73.079626] node 146 initialised, 7732961 pages in 1050ms [ 73.093488] node 62 initialised, 7732961 pages in 1080ms [ 73.091557] node 3 initialised, 7732962 pages in 1080ms [ 73.10] node 186 initialised, 7732961 pages in 1040ms [ 73.095731] node 4 initialised, 7732961 pages in 1080ms [ 73.090289] node 50 initialised, 7732961 pages in 1080ms [ 73.094005] node 158 initialised, 7732961 pages in 1050ms [ 73.095421] node 159 initialised, 7732962 pages in 1050ms [ 73.090324] node 52 initialised, 7732961 pages in 1080ms [ 73.099056] node 5 initialised, 7732962 pages in 1080ms [ 73.090116] node 160 initialised, 7732961 pages in 1050ms [ 73.161051] node 157 initialised, 7732962 pages in 1120ms [ 73.193565] node 161 initialised, 7732962 pages in 1160ms [ 73.212456] node 26 initialised, 7732961 pages in 1200ms [ 73.222904] node 0 initialised, 6686488 pages in 1210ms [ 73.242165] node 140 initialised, 7732961 pages in 1210ms [ 73.254230] node 156 initialised, 7732961 pages in 1220ms [ 73.284634] node 1 initialised, 7732962 pages in 1270ms [ 73.305301] node 141 initialised, 7732962 pages in 1280ms [ 73.322845] node 28 initialised, 7732961 pages in 1310ms [ 73.321757] node 142 initialised, 7732961 pages in 1290ms [ 73.327677] node 138 initialised, 7732961 pages in 1300ms [ 73.413597] node 176 initialised, 7732961 pages in 1370ms [ 73.42] node 139 initialised, 7732962 pages in 1420ms [ 73.475356] node 143 initialised, 7732962 pages in 1440ms [ 73.547202] node 32 initialised, 7732961 pages in 1530ms [ 73.579591] node 104 initialised, 7732961 pages in 1560ms [ 73.618065] node 174 initialised, 7732961 pages in 1570ms [ 73.624918] node 178 initialised, 7732961 pages in 1580ms [ 73.649024] node 175 initialised, 7732962 pages in 1610ms [ 73.654110] node 105 initialised, 7732962 pages in 1630ms [ 73.670589] node 106 initialised, 7732961 pages in 1650ms [ 73.739682] node 102 initialised, 7732961 pages in 1720ms [ 73.769639] node 86 initialised, 7732961 pages in 1750ms [ 73.775573] node 44 initialised, 7732961 pages in 1760ms [ 73.772955] node 177 initialised, 7732962 pages in 1740ms [ 73.804390] node 34 initialised, 7732961 pages in 1790ms [ 73.819370] node 30 initialised, 7732961 pages in 1810ms [ 73.847882] node 98 initialised, 7732961 pages in 1830ms [ 73.867545] node 33 initialised, 7732962 pages in 1860ms [ 73.877964] node 107 initialised, 7732962 pages in 1860ms [ 73.906256] node 103 initialised, 7732962 pages in 1880ms [ 73.945581] node 100 initialised, 7732961 pages in 1930ms [ 73.947024] node 96 initialised, 7732961 pages in 1930ms [ 74.186208] node 116 initialised, 7732961 pages in 2170ms [ 74.220838] node 68 initialised, 7732961 pages in 2210ms [ 74.252341] node 46 initialised, 7732961 pages in 2240ms [ 74.274795] node 118 initialised, 7732961 pages in 2260ms [ 74.337544] node 14 initialised, 7732961 pages in 2320ms [ 74.350819] node 22 initialised, 7732961 pages in 2340ms [ 74.350332] node 69 initialised, 7732962 pages in 2340ms [ 74.362683] node 211 initialised, 7732962 pages in 2310ms [ 74.360617] node 70 initialised, 7732961 pages in 2340ms [ 74.369137] node 66 initialised, 7732961 pages in 2360ms [ 74.378242] node 115 initialised, 7732962 pages in 2360ms [ 74.404221] node 213 initialised, 7732962 pages in 2350ms [ 74.420901] node 210 initialised, 7732961 pages in 2370ms [ 74.430049] node 35 initialised, 7732962 pages in 2420ms [ 74.436007] node 48 initialised, 7732961 pages in 2420ms [ 74.480595] node 71 initialised, 7732962 pages in 2460ms [ 74.485700] node 67 initialised, 7732962 pages in 2480ms [ 74.502627] node 31 initialised, 7732962 pages in 2490ms [ 74.542220] node 16 initialised, 7732961 pages in 2530ms [ 74.547936] node 128 initialised, 7732961 pages in 2520ms [ 74.634374] node 214 initialised, 7732961 pages in 2580ms [ 74.654389] node 88 initialised, 7732961 pages in 2630ms [ 74.722833] node 117 initialised, 7732962 pages in 2700ms [ 74.735002] node 148 initialised, 7732961 pages in 2700ms [ 74.742725]
irq_work_sync hangs
t;flags); work->func(work); /* * Clear the BUSY bit and return to the free state if * no-one else claimed it meanwhile. */ (void)cmpxchg(>flags, flags, flags & ~IRQ_WORK_BUSY); + if (!(work->flags & IRQ_WORK_LAZY)) + pr_err("run id %lu end flags=0x%lx\n", work->id, work->flags); } } @@ -190,7 +205,13 @@ void irq_work_sync(struct irq_work *work) { WARN_ON_ONCE(irqs_disabled()); + if (!(work->flags & ~IRQ_WORK_LAZY)) + pr_err("sync id %lu start flags=0x%lx\n", work->id, work->flags); + while (work->flags & IRQ_WORK_BUSY) cpu_relax(); + + if (!(work->flags & ~IRQ_WORK_LAZY)) + pr_err("sync id %lu end\n", work->id); } EXPORT_SYMBOL_GPL(irq_work_sync); -- Daniel J Blueman Principal Software Engineer, Numascale AS -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
irq_work_sync hangs
-flags, flags); + if (!(work-flags ~IRQ_WORK_LAZY)) + pr_err(run id %lu start flags=0x%lx\n, work-id, work-flags); work-func(work); /* * Clear the BUSY bit and return to the free state if * no-one else claimed it meanwhile. */ (void)cmpxchg(work-flags, flags, flags ~IRQ_WORK_BUSY); + if (!(work-flags IRQ_WORK_LAZY)) + pr_err(run id %lu end flags=0x%lx\n, work-id, work-flags); } } @@ -190,7 +205,13 @@ void irq_work_sync(struct irq_work *work) { WARN_ON_ONCE(irqs_disabled()); + if (!(work-flags ~IRQ_WORK_LAZY)) + pr_err(sync id %lu start flags=0x%lx\n, work-id, work-flags); + while (work-flags IRQ_WORK_BUSY) cpu_relax(); + + if (!(work-flags ~IRQ_WORK_LAZY)) + pr_err(sync id %lu end\n, work-id); } EXPORT_SYMBOL_GPL(irq_work_sync); -- Daniel J Blueman Principal Software Engineer, Numascale AS -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup
On Thu, May 14, 2015 at 12:31 AM, Mel Gorman mgor...@suse.de wrote: On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote: I am just noticed a hang on my largest box. I can only reproduce with large core counts, if I turn down the number of cpus it doesn't have an issue. Odd. The number of core counts should make little a difference as only one CPU per node should be in use. Does sysrq+t give any indication how or where it is hanging? I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; this suggests either lock contention or O(n) behaviour. Nathan, can you check with this ordering of patches from Andrew's cache [2]? I was getting hanging until I a found them all. I'll follow up with timing data. Thanks, Daniel -- [1] [ 73.076117] node 2 initialised, 7732961 pages in 1060ms [ 73.077184] node 38 initialised, 7732961 pages in 1060ms [ 73.079626] node 146 initialised, 7732961 pages in 1050ms [ 73.093488] node 62 initialised, 7732961 pages in 1080ms [ 73.091557] node 3 initialised, 7732962 pages in 1080ms [ 73.10] node 186 initialised, 7732961 pages in 1040ms [ 73.095731] node 4 initialised, 7732961 pages in 1080ms [ 73.090289] node 50 initialised, 7732961 pages in 1080ms [ 73.094005] node 158 initialised, 7732961 pages in 1050ms [ 73.095421] node 159 initialised, 7732962 pages in 1050ms [ 73.090324] node 52 initialised, 7732961 pages in 1080ms [ 73.099056] node 5 initialised, 7732962 pages in 1080ms [ 73.090116] node 160 initialised, 7732961 pages in 1050ms [ 73.161051] node 157 initialised, 7732962 pages in 1120ms [ 73.193565] node 161 initialised, 7732962 pages in 1160ms [ 73.212456] node 26 initialised, 7732961 pages in 1200ms [ 73.222904] node 0 initialised, 6686488 pages in 1210ms [ 73.242165] node 140 initialised, 7732961 pages in 1210ms [ 73.254230] node 156 initialised, 7732961 pages in 1220ms [ 73.284634] node 1 initialised, 7732962 pages in 1270ms [ 73.305301] node 141 initialised, 7732962 pages in 1280ms [ 73.322845] node 28 initialised, 7732961 pages in 1310ms [ 73.321757] node 142 initialised, 7732961 pages in 1290ms [ 73.327677] node 138 initialised, 7732961 pages in 1300ms [ 73.413597] node 176 initialised, 7732961 pages in 1370ms [ 73.42] node 139 initialised, 7732962 pages in 1420ms [ 73.475356] node 143 initialised, 7732962 pages in 1440ms [ 73.547202] node 32 initialised, 7732961 pages in 1530ms [ 73.579591] node 104 initialised, 7732961 pages in 1560ms [ 73.618065] node 174 initialised, 7732961 pages in 1570ms [ 73.624918] node 178 initialised, 7732961 pages in 1580ms [ 73.649024] node 175 initialised, 7732962 pages in 1610ms [ 73.654110] node 105 initialised, 7732962 pages in 1630ms [ 73.670589] node 106 initialised, 7732961 pages in 1650ms [ 73.739682] node 102 initialised, 7732961 pages in 1720ms [ 73.769639] node 86 initialised, 7732961 pages in 1750ms [ 73.775573] node 44 initialised, 7732961 pages in 1760ms [ 73.772955] node 177 initialised, 7732962 pages in 1740ms [ 73.804390] node 34 initialised, 7732961 pages in 1790ms [ 73.819370] node 30 initialised, 7732961 pages in 1810ms [ 73.847882] node 98 initialised, 7732961 pages in 1830ms [ 73.867545] node 33 initialised, 7732962 pages in 1860ms [ 73.877964] node 107 initialised, 7732962 pages in 1860ms [ 73.906256] node 103 initialised, 7732962 pages in 1880ms [ 73.945581] node 100 initialised, 7732961 pages in 1930ms [ 73.947024] node 96 initialised, 7732961 pages in 1930ms [ 74.186208] node 116 initialised, 7732961 pages in 2170ms [ 74.220838] node 68 initialised, 7732961 pages in 2210ms [ 74.252341] node 46 initialised, 7732961 pages in 2240ms [ 74.274795] node 118 initialised, 7732961 pages in 2260ms [ 74.337544] node 14 initialised, 7732961 pages in 2320ms [ 74.350819] node 22 initialised, 7732961 pages in 2340ms [ 74.350332] node 69 initialised, 7732962 pages in 2340ms [ 74.362683] node 211 initialised, 7732962 pages in 2310ms [ 74.360617] node 70 initialised, 7732961 pages in 2340ms [ 74.369137] node 66 initialised, 7732961 pages in 2360ms [ 74.378242] node 115 initialised, 7732962 pages in 2360ms [ 74.404221] node 213 initialised, 7732962 pages in 2350ms [ 74.420901] node 210 initialised, 7732961 pages in 2370ms [ 74.430049] node 35 initialised, 7732962 pages in 2420ms [ 74.436007] node 48 initialised, 7732961 pages in 2420ms [ 74.480595] node 71 initialised, 7732962 pages in 2460ms [ 74.485700] node 67 initialised, 7732962 pages in 2480ms [ 74.502627] node 31 initialised, 7732962 pages in 2490ms [ 74.542220] node 16 initialised, 7732961 pages in 2530ms [ 74.547936] node 128 initialised, 7732961 pages in 2520ms [ 74.634374] node 214 initialised, 7732961 pages in 2580ms [ 74.654389] node 88 initialised, 7732961 pages in 2630ms [ 74.722833] node 117 initialised, 7732962 pages in 2700ms [ 74.735002] node 148 initialised, 7732961 pages in 2700ms
Re: [Patch v3] x86, irq: Allocate CPU vectors from device local CPUs if possible
On Thu, May 7, 2015 at 10:53 AM, Jiang Liu wrote: On NUMA systems, an IO device may be associated with a NUMA node. It may improve IO performance to allocate resources, such as memory and interrupts, from device local node. This patch introduces a mechanism to support CPU vector allocation policies. It tries to allocate CPU vectors from CPUs on device local node first, and then fallback to all online(global) CPUs. This mechanism may be used to support NumaConnect systems to allocate CPU vectors from device local node. Signed-off-by: Jiang Liu Cc: Daniel J Blueman --- Hi Thomas, I feel this should be simpliest version now:) Thanks! Gerry --- arch/x86/kernel/apic/vector.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 1c7dd42b98c1..eb65c6b98de0 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -210,6 +210,18 @@ static int assign_irq_vector(int irq, struct apic_chip_data *data, return err; } +static int assign_irq_vector_policy(int irq, int node, + struct apic_chip_data *data, + struct irq_alloc_info *info) +{ + if (info && info->mask) + return assign_irq_vector(irq, data, info->mask); + if (node != NUMA_NO_NODE && + assign_irq_vector(irq, data, cpumask_of_node(node)) == 0) + return 0; + return assign_irq_vector(irq, data, apic->target_cpus()); +} + static void clear_irq_vector(int irq, struct apic_chip_data *data) { int cpu, vector; @@ -258,12 +270,6 @@ void copy_irq_alloc_info(struct irq_alloc_info *dst, struct irq_alloc_info *src) memset(dst, 0, sizeof(*dst)); } -static inline const struct cpumask * -irq_alloc_info_get_mask(struct irq_alloc_info *info) -{ - return (!info || !info->mask) ? apic->target_cpus() : info->mask; -} - static void x86_vector_free_irqs(struct irq_domain *domain, unsigned int virq, unsigned int nr_irqs) { @@ -289,7 +295,6 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, { struct irq_alloc_info *info = arg; struct apic_chip_data *data; - const struct cpumask *mask; struct irq_data *irq_data; int i, err; @@ -300,7 +305,6 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1) return -ENOSYS; - mask = irq_alloc_info_get_mask(info); for (i = 0; i < nr_irqs; i++) { irq_data = irq_domain_get_irq_data(domain, virq + i); BUG_ON(!irq_data); @@ -318,7 +322,8 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, irq_data->chip = _controller; irq_data->chip_data = data; irq_data->hwirq = virq + i; - err = assign_irq_vector(virq, data, mask); + err = assign_irq_vector_policy(virq, irq_data->node, data, + info); if (err) goto error; } Testing x86/tip/apic with this patch on a 192 core/24 node NumaConnect system, all the PCIe bridge, GPU, SATA, NIC etc interrupts are allocated on the correct NUMA nodes, so it works great. Tested-by: Daniel J Blueman Many thanks! Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch v3] x86, irq: Allocate CPU vectors from device local CPUs if possible
On Thu, May 7, 2015 at 10:53 AM, Jiang Liu jiang@linux.intel.com wrote: On NUMA systems, an IO device may be associated with a NUMA node. It may improve IO performance to allocate resources, such as memory and interrupts, from device local node. This patch introduces a mechanism to support CPU vector allocation policies. It tries to allocate CPU vectors from CPUs on device local node first, and then fallback to all online(global) CPUs. This mechanism may be used to support NumaConnect systems to allocate CPU vectors from device local node. Signed-off-by: Jiang Liu jiang@linux.intel.com Cc: Daniel J Blueman dan...@numascale.com --- Hi Thomas, I feel this should be simpliest version now:) Thanks! Gerry --- arch/x86/kernel/apic/vector.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 1c7dd42b98c1..eb65c6b98de0 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -210,6 +210,18 @@ static int assign_irq_vector(int irq, struct apic_chip_data *data, return err; } +static int assign_irq_vector_policy(int irq, int node, + struct apic_chip_data *data, + struct irq_alloc_info *info) +{ + if (info info-mask) + return assign_irq_vector(irq, data, info-mask); + if (node != NUMA_NO_NODE + assign_irq_vector(irq, data, cpumask_of_node(node)) == 0) + return 0; + return assign_irq_vector(irq, data, apic-target_cpus()); +} + static void clear_irq_vector(int irq, struct apic_chip_data *data) { int cpu, vector; @@ -258,12 +270,6 @@ void copy_irq_alloc_info(struct irq_alloc_info *dst, struct irq_alloc_info *src) memset(dst, 0, sizeof(*dst)); } -static inline const struct cpumask * -irq_alloc_info_get_mask(struct irq_alloc_info *info) -{ - return (!info || !info-mask) ? apic-target_cpus() : info-mask; -} - static void x86_vector_free_irqs(struct irq_domain *domain, unsigned int virq, unsigned int nr_irqs) { @@ -289,7 +295,6 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, { struct irq_alloc_info *info = arg; struct apic_chip_data *data; - const struct cpumask *mask; struct irq_data *irq_data; int i, err; @@ -300,7 +305,6 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, if ((info-flags X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) nr_irqs 1) return -ENOSYS; - mask = irq_alloc_info_get_mask(info); for (i = 0; i nr_irqs; i++) { irq_data = irq_domain_get_irq_data(domain, virq + i); BUG_ON(!irq_data); @@ -318,7 +322,8 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, irq_data-chip = lapic_controller; irq_data-chip_data = data; irq_data-hwirq = virq + i; - err = assign_irq_vector(virq, data, mask); + err = assign_irq_vector_policy(virq, irq_data-node, data, + info); if (err) goto error; } Testing x86/tip/apic with this patch on a 192 core/24 node NumaConnect system, all the PCIe bridge, GPU, SATA, NIC etc interrupts are allocated on the correct NUMA nodes, so it works great. Tested-by: Daniel J Blueman dan...@numascale.com Many thanks! Daniel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] Parallel struct page initialisation v4
On Sat, May 2, 2015 at 4:52 PM, Daniel J Blueman wrote: On Sat, May 2, 2015 at 8:09 AM, Waiman Long wrote: On 05/01/2015 06:02 PM, Waiman Long wrote: Bad news! I tried your patch on a 24-TB DragonHawk and got an out of memory panic. The kernel log messages were: : [ 80.126186] CPU 474: hi: 186, btch: 31 usd: 0 [ 80.131457] CPU 475: hi: 186, btch: 31 usd: 0 [ 80.136726] CPU 476: hi: 186, btch: 31 usd: 0 [ 80.141997] CPU 477: hi: 186, btch: 31 usd: 0 [ 80.147267] CPU 478: hi: 186, btch: 31 usd: 0 [ 80.152538] CPU 479: hi: 186, btch: 31 usd: 0 [ 80.157813] active_anon:0 inactive_anon:0 isolated_anon:0 [ 80.157813] active_file:0 inactive_file:0 isolated_file:0 [ 80.157813] unevictable:0 dirty:0 writeback:0 unstable:0 [ 80.157813] free:209 slab_reclaimable:7 slab_unreclaimable:42986 [ 80.157813] mapped:0 shmem:0 pagetables:0 bounce:0 [ 80.157813] free_cma:0 [ 80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:14928kB kernel_stack:400kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.233475] lowmem_reserve[]: 0 0 0 0 [ 80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.281456] lowmem_reserve[]: 0 0 0 0 [ 80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.328958] lowmem_reserve[]: 0 0 0 0 [ 80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.377256] lowmem_reserve[]: 0 0 0 0 [ 80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.424764] lowmem_reserve[]: 0 0 0 0 [ 80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.472293] lowmem_reserve[]: 0 0 0 0 [ 80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.519803] lowmem_reserve[]: 0 0 0 0 [ 80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.567312] lowmem_reserve[]: 0 0 0 0 [ 80.571379] Node 6 Normal free:0kB
Re: [PATCH 0/13] Parallel struct page initialisation v4
On Sat, May 2, 2015 at 8:09 AM, Waiman Long wrote: On 05/01/2015 06:02 PM, Waiman Long wrote: Bad news! I tried your patch on a 24-TB DragonHawk and got an out of memory panic. The kernel log messages were: : [ 80.126186] CPU 474: hi: 186, btch: 31 usd: 0 [ 80.131457] CPU 475: hi: 186, btch: 31 usd: 0 [ 80.136726] CPU 476: hi: 186, btch: 31 usd: 0 [ 80.141997] CPU 477: hi: 186, btch: 31 usd: 0 [ 80.147267] CPU 478: hi: 186, btch: 31 usd: 0 [ 80.152538] CPU 479: hi: 186, btch: 31 usd: 0 [ 80.157813] active_anon:0 inactive_anon:0 isolated_anon:0 [ 80.157813] active_file:0 inactive_file:0 isolated_file:0 [ 80.157813] unevictable:0 dirty:0 writeback:0 unstable:0 [ 80.157813] free:209 slab_reclaimable:7 slab_unreclaimable:42986 [ 80.157813] mapped:0 shmem:0 pagetables:0 bounce:0 [ 80.157813] free_cma:0 [ 80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:14928kB kernel_stack:400kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.233475] lowmem_reserve[]: 0 0 0 0 [ 80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.281456] lowmem_reserve[]: 0 0 0 0 [ 80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.328958] lowmem_reserve[]: 0 0 0 0 [ 80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.377256] lowmem_reserve[]: 0 0 0 0 [ 80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.424764] lowmem_reserve[]: 0 0 0 0 [ 80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.472293] lowmem_reserve[]: 0 0 0 0 [ 80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.519803] lowmem_reserve[]: 0 0 0 0 [ 80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.567312] lowmem_reserve[]: 0 0 0 0 [ 80.571379] Node 6 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB
Re: [PATCH 0/13] Parallel struct page initialisation v4
On Sat, May 2, 2015 at 8:09 AM, Waiman Long waiman.l...@hp.com wrote: On 05/01/2015 06:02 PM, Waiman Long wrote: Bad news! I tried your patch on a 24-TB DragonHawk and got an out of memory panic. The kernel log messages were: : [ 80.126186] CPU 474: hi: 186, btch: 31 usd: 0 [ 80.131457] CPU 475: hi: 186, btch: 31 usd: 0 [ 80.136726] CPU 476: hi: 186, btch: 31 usd: 0 [ 80.141997] CPU 477: hi: 186, btch: 31 usd: 0 [ 80.147267] CPU 478: hi: 186, btch: 31 usd: 0 [ 80.152538] CPU 479: hi: 186, btch: 31 usd: 0 [ 80.157813] active_anon:0 inactive_anon:0 isolated_anon:0 [ 80.157813] active_file:0 inactive_file:0 isolated_file:0 [ 80.157813] unevictable:0 dirty:0 writeback:0 unstable:0 [ 80.157813] free:209 slab_reclaimable:7 slab_unreclaimable:42986 [ 80.157813] mapped:0 shmem:0 pagetables:0 bounce:0 [ 80.157813] free_cma:0 [ 80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:14928kB kernel_stack:400kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.233475] lowmem_reserve[]: 0 0 0 0 [ 80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.281456] lowmem_reserve[]: 0 0 0 0 [ 80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.328958] lowmem_reserve[]: 0 0 0 0 [ 80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.377256] lowmem_reserve[]: 0 0 0 0 [ 80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.424764] lowmem_reserve[]: 0 0 0 0 [ 80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.472293] lowmem_reserve[]: 0 0 0 0 [ 80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.519803] lowmem_reserve[]: 0 0 0 0 [ 80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.567312] lowmem_reserve[]: 0 0 0 0 [ 80.571379] Node 6 Normal free:0kB min:0kB low:0kB high:0kB
Re: [PATCH 0/13] Parallel struct page initialisation v4
On Sat, May 2, 2015 at 4:52 PM, Daniel J Blueman dan...@numascale.com wrote: On Sat, May 2, 2015 at 8:09 AM, Waiman Long waiman.l...@hp.com wrote: On 05/01/2015 06:02 PM, Waiman Long wrote: Bad news! I tried your patch on a 24-TB DragonHawk and got an out of memory panic. The kernel log messages were: : [ 80.126186] CPU 474: hi: 186, btch: 31 usd: 0 [ 80.131457] CPU 475: hi: 186, btch: 31 usd: 0 [ 80.136726] CPU 476: hi: 186, btch: 31 usd: 0 [ 80.141997] CPU 477: hi: 186, btch: 31 usd: 0 [ 80.147267] CPU 478: hi: 186, btch: 31 usd: 0 [ 80.152538] CPU 479: hi: 186, btch: 31 usd: 0 [ 80.157813] active_anon:0 inactive_anon:0 isolated_anon:0 [ 80.157813] active_file:0 inactive_file:0 isolated_file:0 [ 80.157813] unevictable:0 dirty:0 writeback:0 unstable:0 [ 80.157813] free:209 slab_reclaimable:7 slab_unreclaimable:42986 [ 80.157813] mapped:0 shmem:0 pagetables:0 bounce:0 [ 80.157813] free_cma:0 [ 80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:14928kB kernel_stack:400kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.233475] lowmem_reserve[]: 0 0 0 0 [ 80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.281456] lowmem_reserve[]: 0 0 0 0 [ 80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.328958] lowmem_reserve[]: 0 0 0 0 [ 80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.377256] lowmem_reserve[]: 0 0 0 0 [ 80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.424764] lowmem_reserve[]: 0 0 0 0 [ 80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.472293] lowmem_reserve[]: 0 0 0 0 [ 80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.519803] lowmem_reserve[]: 0 0 0 0 [ 80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 80.567312] lowmem_reserve[]: 0 0 0 0