Re: 4.14.44: BUG_ON(!list_empty(>wait_list));

2018-06-04 Thread Daniel J Blueman
On 5 June 2018 at 05:47,   wrote:
>> -Original Message-
>> From: Daniel J Blueman [mailto:dan...@quora.org]
>> Sent: Thursday, May 31, 2018 9:21 PM
>> To: Linux Kernel; linux-a...@vger.kernel.org
>> Cc: Limonciello, Mario; Dominguez, Jared
>> Subject: 4.14.44: BUG_ON(!list_empty(>wait_list));
>>
>> Plugging in a USB-C power source on my Dell XPS 9550 trips an ACPI
>> BUG_ON [1], reproducible with mainline 4.14.44, suggesting other
>> threads are waiting for semaphore acquisition due to
>> "BUG_ON(!list_empty(>wait_list))".
>>
>> This is the current 1.7.0 BIOS with Ubuntu 18.04 userspace, plugging
>> in an LG 27UD88 (also with the current firmware) monitor USB-C
>> connection which apparently advertises 60W charging (x1,
>> PowerDelivery, DisplayPort alternative mode, data). The same issues
>> reproduce on a Dell Precision 5510 with Ubuntu 16.04, the shipped
>> kernel and 4.14.44.
>>
>> I can enable ACPI debugging if useful? Perhaps ACPI_DB_MUTEX or other
>> levels would be appropriate?
>
> I think most useful would be if this can still reproduce with 4.17.

Fair suggestion!

I can achieve 100% reproducibility of the same backtrace on a clean
Ubuntu 18.04 install with 4.17 mainline [1]:

1. disable grub 'quiet' parameter, disconnect charger and power off laptop to S5
2. power on laptop from S5
3. suspend via closing lid
4. resume by opening lid
5. connect LG 27UD88 via USB-C
6. wait 20s
7. disconnect LG 27UD88
8. run 'systemctl poweroff'
9. observe the same backtrace from acpi_os_delete_semaphore

I don't observe the issue when using an Apple 87W USB-C Power Adapter,
so it may reproduce on other monitors advertising USB-C DisplayPort
alternate mode.

Thanks,
  Daniel

[1] 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17/linux-image-unsigned-4.17.0-041700-generic_4.17.0-041700.201806041953_amd64.deb

>> kernel BUG at /home/kernel/COD/linux/drivers/acpi/osl.c:1201
>> invalid opcode:  [#1] SMP PTI
>> Modules linked in: [...]
>> CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.14.44-041444-generic
>> #201805251612
>> Hardware name: Dell Inc. XPS 15 9550/, BIOS 1.7.0 02/23/2018
>> task: 9bc2ab6b9740 task.stack: bOca80034000
>> RIP: 0010:acpi_os_delete_semaphore+0x6d/0x70
>> RSP: 0018:bOca80037be8 EFLAGS: 00010283
>> RAX: bOca83f8fc40 RBX: 9bc238b5dbe0 RCX: 
>> RDX: 9bc238b5dbe8 RSI:  RDI: 9bc238b5dbe0
>> RBP: 9bc2adlc0990 ROB: 9bc2bdc25f20 R09: 9bc29ee56300
>> R10: e03bd2796440 R11: 9bc2ad183fa0 R12: 9bc22f1321e0
>> R13: 0001 R14: 0001 R15: 9bc22f132eb0
>> FS: 7fc03886f940() GS:9bc2bdc0()
>> knlGS:
>> CS: 0010 DS:  ES:  CRO: 80050033
>> CR2: 7ffc645e70f8 CR3: 00049e120001 CR4: 003606f0
>> Call Trace:
>> acpi_ex_system_reset_event+0x3f/0x65
>> acpi_ex_opcode_1A_OT_0R+0x70/0xfa
>> acpi_ds_exec_end_op+0x15d/0x71b
>> acpi_ps_parse_loop+0x929/0x9d6
>> ? acpi_ds_result_push+0x82/0x1d2
>> acpi_ps_parse_aml+0x1a2/0x4af
>> acpi_ps_execute_method+0x1ef/0x2ab
>> acpi_ns_evaluate+0x2e4/0x41d
>> acpi_evaluate_object+0x1cb/0x38e
>> acpi_enter_sleep_state_prep+0xae/0x13a
>> acpi_sleep_prepare.part.2+0x2e/0x40
>> acpi_power_off_prepare+0xf/0x20
>> [38871.1925361 kernel_power_off+0x42/0x70
>> SYSC_reboot+0x12f/0x210
>> ? handle_mm_fault+0xea/0x1e0
>> [38871.1925861 ? do_writev+0x5e/0xf0
>> ? do_writev+0x5e/0xf0
>> do_syscall_64+0x6e/0x120
>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> RIP: 0033:0x7fc03839b373
>> RSP: 002b:7ffc645e70f8 EFLAGS: 0202 ORIG_RAX: 00a9
>> RAX: ffda RBX: 4321fedc RCX: 7fc03839b373
>> ROX: 4321fedc RSI: 28121969 RDI: fee1dead
>> RBP: 7ffc645e7160 R08:  R09: 
>> R10: 00000002 R11: 0202 R12: 7ffc645e7168
>> R13:  R14: 001b0004 R15: 7ffc645e7458
>> Code: b8 00 04 00 00 48 c7 c1 c3 91 28 ab 48 c7 c2 20 91 28 ab be of
>> 04 00 00 bf 00 00 00 01 03 41 85 04 00 58 eb b0 b8 01 10 00 00 c3 
>> Ob 90 Of if 44 00 00 80 3d 74 CO 97 01 00 41 54 55 53 Of 84
>> RIP: acpi_os_delete_semaphore+0x6d/0x70 RSP: b0ca80037be8
-- 
Daniel J Blueman


Re: 4.14.44: BUG_ON(!list_empty(>wait_list));

2018-06-04 Thread Daniel J Blueman
On 5 June 2018 at 05:47,   wrote:
>> -Original Message-
>> From: Daniel J Blueman [mailto:dan...@quora.org]
>> Sent: Thursday, May 31, 2018 9:21 PM
>> To: Linux Kernel; linux-a...@vger.kernel.org
>> Cc: Limonciello, Mario; Dominguez, Jared
>> Subject: 4.14.44: BUG_ON(!list_empty(>wait_list));
>>
>> Plugging in a USB-C power source on my Dell XPS 9550 trips an ACPI
>> BUG_ON [1], reproducible with mainline 4.14.44, suggesting other
>> threads are waiting for semaphore acquisition due to
>> "BUG_ON(!list_empty(>wait_list))".
>>
>> This is the current 1.7.0 BIOS with Ubuntu 18.04 userspace, plugging
>> in an LG 27UD88 (also with the current firmware) monitor USB-C
>> connection which apparently advertises 60W charging (x1,
>> PowerDelivery, DisplayPort alternative mode, data). The same issues
>> reproduce on a Dell Precision 5510 with Ubuntu 16.04, the shipped
>> kernel and 4.14.44.
>>
>> I can enable ACPI debugging if useful? Perhaps ACPI_DB_MUTEX or other
>> levels would be appropriate?
>
> I think most useful would be if this can still reproduce with 4.17.

Fair suggestion!

I can achieve 100% reproducibility of the same backtrace on a clean
Ubuntu 18.04 install with 4.17 mainline [1]:

1. disable grub 'quiet' parameter, disconnect charger and power off laptop to S5
2. power on laptop from S5
3. suspend via closing lid
4. resume by opening lid
5. connect LG 27UD88 via USB-C
6. wait 20s
7. disconnect LG 27UD88
8. run 'systemctl poweroff'
9. observe the same backtrace from acpi_os_delete_semaphore

I don't observe the issue when using an Apple 87W USB-C Power Adapter,
so it may reproduce on other monitors advertising USB-C DisplayPort
alternate mode.

Thanks,
  Daniel

[1] 
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17/linux-image-unsigned-4.17.0-041700-generic_4.17.0-041700.201806041953_amd64.deb

>> kernel BUG at /home/kernel/COD/linux/drivers/acpi/osl.c:1201
>> invalid opcode:  [#1] SMP PTI
>> Modules linked in: [...]
>> CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.14.44-041444-generic
>> #201805251612
>> Hardware name: Dell Inc. XPS 15 9550/, BIOS 1.7.0 02/23/2018
>> task: 9bc2ab6b9740 task.stack: bOca80034000
>> RIP: 0010:acpi_os_delete_semaphore+0x6d/0x70
>> RSP: 0018:bOca80037be8 EFLAGS: 00010283
>> RAX: bOca83f8fc40 RBX: 9bc238b5dbe0 RCX: 
>> RDX: 9bc238b5dbe8 RSI:  RDI: 9bc238b5dbe0
>> RBP: 9bc2adlc0990 ROB: 9bc2bdc25f20 R09: 9bc29ee56300
>> R10: e03bd2796440 R11: 9bc2ad183fa0 R12: 9bc22f1321e0
>> R13: 0001 R14: 0001 R15: 9bc22f132eb0
>> FS: 7fc03886f940() GS:9bc2bdc0()
>> knlGS:
>> CS: 0010 DS:  ES:  CRO: 80050033
>> CR2: 7ffc645e70f8 CR3: 00049e120001 CR4: 003606f0
>> Call Trace:
>> acpi_ex_system_reset_event+0x3f/0x65
>> acpi_ex_opcode_1A_OT_0R+0x70/0xfa
>> acpi_ds_exec_end_op+0x15d/0x71b
>> acpi_ps_parse_loop+0x929/0x9d6
>> ? acpi_ds_result_push+0x82/0x1d2
>> acpi_ps_parse_aml+0x1a2/0x4af
>> acpi_ps_execute_method+0x1ef/0x2ab
>> acpi_ns_evaluate+0x2e4/0x41d
>> acpi_evaluate_object+0x1cb/0x38e
>> acpi_enter_sleep_state_prep+0xae/0x13a
>> acpi_sleep_prepare.part.2+0x2e/0x40
>> acpi_power_off_prepare+0xf/0x20
>> [38871.1925361 kernel_power_off+0x42/0x70
>> SYSC_reboot+0x12f/0x210
>> ? handle_mm_fault+0xea/0x1e0
>> [38871.1925861 ? do_writev+0x5e/0xf0
>> ? do_writev+0x5e/0xf0
>> do_syscall_64+0x6e/0x120
>> entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>> RIP: 0033:0x7fc03839b373
>> RSP: 002b:7ffc645e70f8 EFLAGS: 0202 ORIG_RAX: 00a9
>> RAX: ffda RBX: 4321fedc RCX: 7fc03839b373
>> ROX: 4321fedc RSI: 28121969 RDI: fee1dead
>> RBP: 7ffc645e7160 R08:  R09: 
>> R10: 00000002 R11: 0202 R12: 7ffc645e7168
>> R13:  R14: 001b0004 R15: 7ffc645e7458
>> Code: b8 00 04 00 00 48 c7 c1 c3 91 28 ab 48 c7 c2 20 91 28 ab be of
>> 04 00 00 bf 00 00 00 01 03 41 85 04 00 58 eb b0 b8 01 10 00 00 c3 
>> Ob 90 Of if 44 00 00 80 3d 74 CO 97 01 00 41 54 55 53 Of 84
>> RIP: acpi_os_delete_semaphore+0x6d/0x70 RSP: b0ca80037be8
-- 
Daniel J Blueman


4.14.44: BUG_ON(!list_empty(>wait_list));

2018-05-31 Thread Daniel J Blueman
Plugging in a USB-C power source on my Dell XPS 9550 trips an ACPI
BUG_ON [1], reproducible with mainline 4.14.44, suggesting other
threads are waiting for semaphore acquisition due to
"BUG_ON(!list_empty(>wait_list))".

This is the current 1.7.0 BIOS with Ubuntu 18.04 userspace, plugging
in an LG 27UD88 (also with the current firmware) monitor USB-C
connection which apparently advertises 60W charging (x1,
PowerDelivery, DisplayPort alternative mode, data). The same issues
reproduce on a Dell Precision 5510 with Ubuntu 16.04, the shipped
kernel and 4.14.44.

I can enable ACPI debugging if useful? Perhaps ACPI_DB_MUTEX or other
levels would be appropriate?

Thanks,
  Daniel

-- [1]

kernel BUG at /home/kernel/COD/linux/drivers/acpi/osl.c:1201
invalid opcode:  [#1] SMP PTI
Modules linked in: [...]
CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.14.44-041444-generic
#201805251612
Hardware name: Dell Inc. XPS 15 9550/, BIOS 1.7.0 02/23/2018
task: 9bc2ab6b9740 task.stack: bOca80034000
RIP: 0010:acpi_os_delete_semaphore+0x6d/0x70
RSP: 0018:bOca80037be8 EFLAGS: 00010283
RAX: bOca83f8fc40 RBX: 9bc238b5dbe0 RCX: 
RDX: 9bc238b5dbe8 RSI:  RDI: 9bc238b5dbe0
RBP: 9bc2adlc0990 ROB: 9bc2bdc25f20 R09: 9bc29ee56300
R10: e03bd2796440 R11: 9bc2ad183fa0 R12: 9bc22f1321e0
R13: 0001 R14: 0001 R15: 9bc22f132eb0
FS: 7fc03886f940() GS:9bc2bdc0() knlGS:
CS: 0010 DS:  ES:  CRO: 80050033
CR2: 7ffc645e70f8 CR3: 00049e120001 CR4: 003606f0
Call Trace:
acpi_ex_system_reset_event+0x3f/0x65
acpi_ex_opcode_1A_OT_0R+0x70/0xfa
acpi_ds_exec_end_op+0x15d/0x71b
acpi_ps_parse_loop+0x929/0x9d6
? acpi_ds_result_push+0x82/0x1d2
acpi_ps_parse_aml+0x1a2/0x4af
acpi_ps_execute_method+0x1ef/0x2ab
acpi_ns_evaluate+0x2e4/0x41d
acpi_evaluate_object+0x1cb/0x38e
acpi_enter_sleep_state_prep+0xae/0x13a
acpi_sleep_prepare.part.2+0x2e/0x40
acpi_power_off_prepare+0xf/0x20
[38871.1925361 kernel_power_off+0x42/0x70
SYSC_reboot+0x12f/0x210
? handle_mm_fault+0xea/0x1e0
[38871.1925861 ? do_writev+0x5e/0xf0
? do_writev+0x5e/0xf0
do_syscall_64+0x6e/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fc03839b373
RSP: 002b:7ffc645e70f8 EFLAGS: 0202 ORIG_RAX: 00a9
RAX: ffda RBX: 4321fedc RCX: 7fc03839b373
ROX: 4321fedc RSI: 28121969 RDI: fee1dead
RBP: 7ffc645e7160 R08:  R09: 
R10: 0002 R11: 0202 R12: 7ffc645e7168
R13:  R14: 001b0004 R15: 7ffc645e7458
Code: b8 00 04 00 00 48 c7 c1 c3 91 28 ab 48 c7 c2 20 91 28 ab be of
04 00 00 bf 00 00 00 01 03 41 85 04 00 58 eb b0 b8 01 10 00 00 c3 
Ob 90 Of if 44 00 00 80 3d 74 CO 97 01 00 41 54 55 53 Of 84
RIP: acpi_os_delete_semaphore+0x6d/0x70 RSP: b0ca80037be8
-- 
Daniel J Blueman


4.14.44: BUG_ON(!list_empty(>wait_list));

2018-05-31 Thread Daniel J Blueman
Plugging in a USB-C power source on my Dell XPS 9550 trips an ACPI
BUG_ON [1], reproducible with mainline 4.14.44, suggesting other
threads are waiting for semaphore acquisition due to
"BUG_ON(!list_empty(>wait_list))".

This is the current 1.7.0 BIOS with Ubuntu 18.04 userspace, plugging
in an LG 27UD88 (also with the current firmware) monitor USB-C
connection which apparently advertises 60W charging (x1,
PowerDelivery, DisplayPort alternative mode, data). The same issues
reproduce on a Dell Precision 5510 with Ubuntu 16.04, the shipped
kernel and 4.14.44.

I can enable ACPI debugging if useful? Perhaps ACPI_DB_MUTEX or other
levels would be appropriate?

Thanks,
  Daniel

-- [1]

kernel BUG at /home/kernel/COD/linux/drivers/acpi/osl.c:1201
invalid opcode:  [#1] SMP PTI
Modules linked in: [...]
CPU: 0 PID: 1 Comm: systemd-shutdow Not tainted 4.14.44-041444-generic
#201805251612
Hardware name: Dell Inc. XPS 15 9550/, BIOS 1.7.0 02/23/2018
task: 9bc2ab6b9740 task.stack: bOca80034000
RIP: 0010:acpi_os_delete_semaphore+0x6d/0x70
RSP: 0018:bOca80037be8 EFLAGS: 00010283
RAX: bOca83f8fc40 RBX: 9bc238b5dbe0 RCX: 
RDX: 9bc238b5dbe8 RSI:  RDI: 9bc238b5dbe0
RBP: 9bc2adlc0990 ROB: 9bc2bdc25f20 R09: 9bc29ee56300
R10: e03bd2796440 R11: 9bc2ad183fa0 R12: 9bc22f1321e0
R13: 0001 R14: 0001 R15: 9bc22f132eb0
FS: 7fc03886f940() GS:9bc2bdc0() knlGS:
CS: 0010 DS:  ES:  CRO: 80050033
CR2: 7ffc645e70f8 CR3: 00049e120001 CR4: 003606f0
Call Trace:
acpi_ex_system_reset_event+0x3f/0x65
acpi_ex_opcode_1A_OT_0R+0x70/0xfa
acpi_ds_exec_end_op+0x15d/0x71b
acpi_ps_parse_loop+0x929/0x9d6
? acpi_ds_result_push+0x82/0x1d2
acpi_ps_parse_aml+0x1a2/0x4af
acpi_ps_execute_method+0x1ef/0x2ab
acpi_ns_evaluate+0x2e4/0x41d
acpi_evaluate_object+0x1cb/0x38e
acpi_enter_sleep_state_prep+0xae/0x13a
acpi_sleep_prepare.part.2+0x2e/0x40
acpi_power_off_prepare+0xf/0x20
[38871.1925361 kernel_power_off+0x42/0x70
SYSC_reboot+0x12f/0x210
? handle_mm_fault+0xea/0x1e0
[38871.1925861 ? do_writev+0x5e/0xf0
? do_writev+0x5e/0xf0
do_syscall_64+0x6e/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fc03839b373
RSP: 002b:7ffc645e70f8 EFLAGS: 0202 ORIG_RAX: 00a9
RAX: ffda RBX: 4321fedc RCX: 7fc03839b373
ROX: 4321fedc RSI: 28121969 RDI: fee1dead
RBP: 7ffc645e7160 R08:  R09: 
R10: 0002 R11: 0202 R12: 7ffc645e7168
R13:  R14: 001b0004 R15: 7ffc645e7458
Code: b8 00 04 00 00 48 c7 c1 c3 91 28 ab 48 c7 c2 20 91 28 ab be of
04 00 00 bf 00 00 00 01 03 41 85 04 00 58 eb b0 b8 01 10 00 00 c3 
Ob 90 Of if 44 00 00 80 3d 74 CO 97 01 00 41 54 55 53 Of 84
RIP: acpi_os_delete_semaphore+0x6d/0x70 RSP: b0ca80037be8
-- 
Daniel J Blueman


4.14.34: kernel stack regs has bad 'bp' value

2018-04-18 Thread Daniel J Blueman
0638ad7d20: 8810513d5540 (0x8810513d5540)
880638ad7d28:  ...
880638ad7d48: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7d50: 1100c715afb4 (0x1100c715afb4)
880638ad7d58: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7d60: 0002 (0x2)
880638ad7d68: 8810513d5550 (0x8810513d5550)
880638ad7d70: 8805e358ce30 (0x8805e358ce30)
880638ad7d78: 0004 (0x4)
880638ad7d80: 8810 (0x8810)
880638ad7d88:  ...
880638ad7d90: 0400 (0x400)
880638ad7d98: 880638ad7ce0 (0x880638ad7ce0)
880638ad7da0: 0001 (0x1)
880638ad7da8: 8805e358ce30 (0x8805e358ce30)
880638ad7db0:  ...
880638ad7dc0: 880638ad7e08 (0x880638ad7e08)
880638ad7dc8: 816ac15d (rw_verify_area+0xbd/0x2b0)
880638ad7dd0: 0020 (0x20)
880638ad7dd8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7de0:  ...
880638ad7de8: 0400 (0x400)
880638ad7df0: 8810513d5540 (0x8810513d5540)
880638ad7df8: 7ffddaab65d0 (0x7ffddaab65d0)
880638ad7e00: 8810513d5584 (0x8810513d5584)
880638ad7e08: 880638ad7e48 (0x880638ad7e48)
880638ad7e10: 816b01ff (vfs_read+0xef/0x2f0)
880638ad7e18: 880638ad7e90 (0x880638ad7e90)
880638ad7e20: 8810513d5540 (0x8810513d5540)
880638ad7e28: 1100c715afce (0x1100c715afce)
880638ad7e30: 8810513d5540 (0x8810513d5540)
880638ad7e38: 880638ad7ed0 (0x880638ad7ed0)
880638ad7e40: 8810513d55a8 (0x8810513d55a8)
880638ad7e48: 880638ad7ef8 (0x880638ad7ef8)
880638ad7e50: 816b1672 (SyS_read+0xd2/0x1b0)
880638ad7e58: 7ffddaab65d0 (0x7ffddaab65d0)
880638ad7e60: 0400 (0x400)
880638ad7e68: dc00 (0xdc00)
880638ad7e70: 41b58ab3 (0x41b58ab3)
880638ad7e78: 8332f7eb (inat_primary_table+0x18292b/0x1d0d97)
880638ad7e80: 816b15a0 (kernel_write+0x130/0x130)
880638ad7e88:  ...
880638ad7e98: 881004cd9f8c (0x881004cd9f8c)
880638ad7ea0: 0002 (0x2)
880638ad7ea8: dc00 (0xdc00)
880638ad7eb0: 880638ad7f58 (0x880638ad7f58)
880638ad7eb8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7ec0: 880638ad7f58 (0x880638ad7f58)
880638ad7ec8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7ed0: 880638ad7f58 (0x880638ad7f58)
880638ad7ed8: 816b15a0 (kernel_write+0x130/0x130)
880638ad7ee0: 881004cd9f40 (0x881004cd9f40)
880638ad7ee8:  ...
880638ad7ef8: 880638ad7f48 (0x880638ad7f48)
880638ad7f00: 810073d9 (do_syscall_64+0x199/0x4c0)
880638ad7f08:  ...
880638ad7f10: 880638ad7f58 (0x880638ad7f58)
880638ad7f18:  ...
880638ad7f20: 880638ad7f48 (0x880638ad7f48)
880638ad7f28: 8100702e (prepare_exit_to_usermode+0x11e/0x150)
880638ad7f30:  ...
880638ad7f50: 82a00081 (entry_SYSCALL_64_after_hwframe+0x3d/0xa2)
880638ad7f58: 0005 (0x5)
880638ad7f60: 7ffddaab61d0 (0x7ffddaab61d0)
880638ad7f68: 7ffddaab65d0 (0x7ffddaab65d0)
880638ad7f70: 0048 (0x48)
880638ad7f78: 7ffddaab6a10 (0x7ffddaab6a10)
880638ad7f80: 0008 (0x8)
880638ad7f88: 0246 (0x246)
880638ad7f90: 7ffddaab60a0 (0x7ffddaab60a0)
880638ad7f98: 4000 (0x4000)
880638ad7fa0: 7ffddaab6050 (0x7ffddaab6050)
880638ad7fa8: ffda (0xffda)
880638ad7fb0: 7f504bfe56f0 (0x7f504bfe56f0)
880638ad7fb8: 0400 (0x400)
880638ad7fc0: 7ffddaab65d0 (0x7ffddaab65d0)
880638ad7fc8: 0005 (0x5)
880638ad7fd0:  ...
880638ad7fd8: 7f504bfe56f0 (0x7f504bfe56f0)
880638ad7fe0: 0033 (0x33)
880638ad7fe8: 0246 (0x246)
880638ad7ff0: 7ffddaab6048 (0x7ffddaab6048)
880638ad7ff8: 002b (0x2b)
-- 
Daniel J Blueman


4.14.34: kernel stack regs has bad 'bp' value

2018-04-18 Thread Daniel J Blueman
0638ad7d20: 8810513d5540 (0x8810513d5540)
880638ad7d28:  ...
880638ad7d48: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7d50: 1100c715afb4 (0x1100c715afb4)
880638ad7d58: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7d60: 0002 (0x2)
880638ad7d68: 8810513d5550 (0x8810513d5550)
880638ad7d70: 8805e358ce30 (0x8805e358ce30)
880638ad7d78: 0004 (0x4)
880638ad7d80: 8810 (0x8810)
880638ad7d88:  ...
880638ad7d90: 0400 (0x400)
880638ad7d98: 880638ad7ce0 (0x880638ad7ce0)
880638ad7da0: 0001 (0x1)
880638ad7da8: 8805e358ce30 (0x8805e358ce30)
880638ad7db0:  ...
880638ad7dc0: 880638ad7e08 (0x880638ad7e08)
880638ad7dc8: 816ac15d (rw_verify_area+0xbd/0x2b0)
880638ad7dd0: 0020 (0x20)
880638ad7dd8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7de0:  ...
880638ad7de8: 0400 (0x400)
880638ad7df0: 8810513d5540 (0x8810513d5540)
880638ad7df8: 7ffddaab65d0 (0x7ffddaab65d0)
880638ad7e00: 8810513d5584 (0x8810513d5584)
880638ad7e08: 880638ad7e48 (0x880638ad7e48)
880638ad7e10: 816b01ff (vfs_read+0xef/0x2f0)
880638ad7e18: 880638ad7e90 (0x880638ad7e90)
880638ad7e20: 8810513d5540 (0x8810513d5540)
880638ad7e28: 1100c715afce (0x1100c715afce)
880638ad7e30: 8810513d5540 (0x8810513d5540)
880638ad7e38: 880638ad7ed0 (0x880638ad7ed0)
880638ad7e40: 8810513d55a8 (0x8810513d55a8)
880638ad7e48: 880638ad7ef8 (0x880638ad7ef8)
880638ad7e50: 816b1672 (SyS_read+0xd2/0x1b0)
880638ad7e58: 7ffddaab65d0 (0x7ffddaab65d0)
880638ad7e60: 0400 (0x400)
880638ad7e68: dc00 (0xdc00)
880638ad7e70: 41b58ab3 (0x41b58ab3)
880638ad7e78: 8332f7eb (inat_primary_table+0x18292b/0x1d0d97)
880638ad7e80: 816b15a0 (kernel_write+0x130/0x130)
880638ad7e88:  ...
880638ad7e98: 881004cd9f8c (0x881004cd9f8c)
880638ad7ea0: 0002 (0x2)
880638ad7ea8: dc00 (0xdc00)
880638ad7eb0: 880638ad7f58 (0x880638ad7f58)
880638ad7eb8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7ec0: 880638ad7f58 (0x880638ad7f58)
880638ad7ec8: b5d8152ac2b05d00 (0xb5d8152ac2b05d00)
880638ad7ed0: 880638ad7f58 (0x880638ad7f58)
880638ad7ed8: 816b15a0 (kernel_write+0x130/0x130)
880638ad7ee0: 881004cd9f40 (0x881004cd9f40)
880638ad7ee8:  ...
880638ad7ef8: 880638ad7f48 (0x880638ad7f48)
880638ad7f00: 810073d9 (do_syscall_64+0x199/0x4c0)
880638ad7f08:  ...
880638ad7f10: 880638ad7f58 (0x880638ad7f58)
880638ad7f18:  ...
880638ad7f20: 880638ad7f48 (0x880638ad7f48)
880638ad7f28: 8100702e (prepare_exit_to_usermode+0x11e/0x150)
880638ad7f30:  ...
880638ad7f50: 82a00081 (entry_SYSCALL_64_after_hwframe+0x3d/0xa2)
880638ad7f58: 0005 (0x5)
880638ad7f60: 7ffddaab61d0 (0x7ffddaab61d0)
880638ad7f68: 7ffddaab65d0 (0x7ffddaab65d0)
880638ad7f70: 0048 (0x48)
880638ad7f78: 7ffddaab6a10 (0x7ffddaab6a10)
880638ad7f80: 0008 (0x8)
880638ad7f88: 0246 (0x246)
880638ad7f90: 7ffddaab60a0 (0x7ffddaab60a0)
880638ad7f98: 4000 (0x4000)
880638ad7fa0: 7ffddaab6050 (0x7ffddaab6050)
880638ad7fa8: ffda (0xffda)
880638ad7fb0: 7f504bfe56f0 (0x7f504bfe56f0)
880638ad7fb8: 0400 (0x400)
880638ad7fc0: 7ffddaab65d0 (0x7ffddaab65d0)
880638ad7fc8: 0005 (0x5)
880638ad7fd0:  ...
880638ad7fd8: 7f504bfe56f0 (0x7f504bfe56f0)
880638ad7fe0: 0033 (0x33)
880638ad7fe8: 0246 (0x246)
880638ad7ff0: 7ffddaab6048 (0x7ffddaab6048)
880638ad7ff8: 002b (0x2b)
-- 
Daniel J Blueman


drm/vc4: false-positive negative cursor position warning

2018-04-07 Thread Daniel J Blueman
Hi Eric et al,

In a number of windowing environments (eg GNOME 3) on Raspberry Pi 3B
on 4.16.0 arm64, the mouse cursor top-left gets down to x,y -4,-4,
tripping WARN_ON_ONCE(plane->state->crtc_x < 0 || plane->state->crtc_y
< 0) [1], which therefore seems false-positive.

Git history doesn't turn up any reason, eg it could cause undefined
hardware behaviour, which it doesn't appear to, so would it be better
to drop the warning, or adjust it to trip on x or y < -4 or so? If so,
I'll prepare a patch to adjust it.

[Side note: simply opening the GNOME 3 Activities menu with
libgl1-mesa-dri 17.3.7 is a reliable way to reproduce "[drm] Resetting
GPU"]

Thanks,
  Dan

-- [1]

WARNING: CPU: 3 PID: 966 at drivers/gpu/drm/vc4/vc4_plane.c:771
vc4_plane_async_set_fb+0x98/0xa0
CPU: 3 PID: 966 Comm: Xorg Tainted: G S   4.16.0+ #13
Hardware name: Raspberry Pi 3 Model B Rev 1.2 (DT)
pstate: 0005 (nzcv daif -PAN -UAO)
pc : vc4_plane_async_set_fb+0x98/0xa0
lr : vc4_plane_async_set_fb+0x4c/0xa0
sp : 086ab9b0
x29: 086ab9b0 x28: 
x27: 0009 x26: fffc
x25: a81b36ca8b00 x24: a81b30667c00
x23: 0040 x22: a81b30790400
x21: a81b36ca8b00 x20: a81b30a53018
x19: a81b30667c00 x18: 
x17: b4fcec50 x16: 3447cc0e8588
x15: 3447d14fbf88 x14: 344851bb337f
x13: 3447d1bb338d x12: 3447d153b000
x11: 3447d14fc7f0 x10: 3447ccb3dac8
x9 : ffd0 x8 : 0005
x7 : 3932373639343932 x6 : 056e
x5 :  x4 : 
x3 :  x2 : ed59bd53d8905e00
x1 : fffc x0 : a81b30667c00
Call trace:
 vc4_plane_async_set_fb+0x98/0xa0
 vc4_update_plane+0x124/0x1a0
 __setplane_internal+0x1f4/0x260
 drm_mode_cursor_universal+0xf4/0x220
 drm_mode_cursor_common+0x19c/0x218
 drm_mode_cursor2_ioctl+0x34/0x48
 drm_ioctl_kernel+0x70/0xd8
 drm_ioctl+0x30c/0x438
 do_vfs_ioctl+0xc4/0x880
 SyS_ioctl+0x8c/0xa8
 el0_svc_naked+0x30/0x34
-- 
Daniel J Blueman


drm/vc4: false-positive negative cursor position warning

2018-04-07 Thread Daniel J Blueman
Hi Eric et al,

In a number of windowing environments (eg GNOME 3) on Raspberry Pi 3B
on 4.16.0 arm64, the mouse cursor top-left gets down to x,y -4,-4,
tripping WARN_ON_ONCE(plane->state->crtc_x < 0 || plane->state->crtc_y
< 0) [1], which therefore seems false-positive.

Git history doesn't turn up any reason, eg it could cause undefined
hardware behaviour, which it doesn't appear to, so would it be better
to drop the warning, or adjust it to trip on x or y < -4 or so? If so,
I'll prepare a patch to adjust it.

[Side note: simply opening the GNOME 3 Activities menu with
libgl1-mesa-dri 17.3.7 is a reliable way to reproduce "[drm] Resetting
GPU"]

Thanks,
  Dan

-- [1]

WARNING: CPU: 3 PID: 966 at drivers/gpu/drm/vc4/vc4_plane.c:771
vc4_plane_async_set_fb+0x98/0xa0
CPU: 3 PID: 966 Comm: Xorg Tainted: G S   4.16.0+ #13
Hardware name: Raspberry Pi 3 Model B Rev 1.2 (DT)
pstate: 0005 (nzcv daif -PAN -UAO)
pc : vc4_plane_async_set_fb+0x98/0xa0
lr : vc4_plane_async_set_fb+0x4c/0xa0
sp : 086ab9b0
x29: 086ab9b0 x28: 
x27: 0009 x26: fffc
x25: a81b36ca8b00 x24: a81b30667c00
x23: 0040 x22: a81b30790400
x21: a81b36ca8b00 x20: a81b30a53018
x19: a81b30667c00 x18: 
x17: b4fcec50 x16: 3447cc0e8588
x15: 3447d14fbf88 x14: 344851bb337f
x13: 3447d1bb338d x12: 3447d153b000
x11: 3447d14fc7f0 x10: 3447ccb3dac8
x9 : ffd0 x8 : 0005
x7 : 3932373639343932 x6 : 056e
x5 :  x4 : 
x3 :  x2 : ed59bd53d8905e00
x1 : fffc x0 : a81b30667c00
Call trace:
 vc4_plane_async_set_fb+0x98/0xa0
 vc4_update_plane+0x124/0x1a0
 __setplane_internal+0x1f4/0x260
 drm_mode_cursor_universal+0xf4/0x220
 drm_mode_cursor_common+0x19c/0x218
 drm_mode_cursor2_ioctl+0x34/0x48
 drm_ioctl_kernel+0x70/0xd8
 drm_ioctl+0x30c/0x438
 do_vfs_ioctl+0xc4/0x880
 SyS_ioctl+0x8c/0xa8
 el0_svc_naked+0x30/0x34
-- 
Daniel J Blueman


[PATCH] drm/vc4: Fix memory leak during BO teardown

2018-04-02 Thread Daniel J Blueman
During BO teardown, an indirect list 'uniform_addr_offsets' wasn't being
freed leading to leaking many 128B allocations. Fix the memory leak by
releasing it at teardown time.

To: linux-kernel@vger.kernel.org
Cc: dri-de...@lists.freedesktop.org
Cc: Eric Anholt <e...@anholt.net>
Cc: Dave Airlie <airl...@redhat.com>
Cc: sta...@vger.kernel.org
Signed-off-by: Daniel J Blueman <dan...@quora.org>
---
 drivers/gpu/drm/vc4/vc4_bo.c   | 2 ++
 drivers/gpu/drm/vc4/vc4_validate_shaders.c | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/vc4/vc4_bo.c b/drivers/gpu/drm/vc4/vc4_bo.c
index 2decc8e2c79f..add9cc97a3b6 100644
--- a/drivers/gpu/drm/vc4/vc4_bo.c
+++ b/drivers/gpu/drm/vc4/vc4_bo.c
@@ -195,6 +195,7 @@ static void vc4_bo_destroy(struct vc4_bo *bo)
vc4_bo_set_label(obj, -1);
 
if (bo->validated_shader) {
+   kfree(bo->validated_shader->uniform_addr_offsets);
kfree(bo->validated_shader->texture_samples);
kfree(bo->validated_shader);
bo->validated_shader = NULL;
@@ -591,6 +592,7 @@ void vc4_free_object(struct drm_gem_object *gem_bo)
}
 
if (bo->validated_shader) {
+   kfree(bo->validated_shader->uniform_addr_offsets);
kfree(bo->validated_shader->texture_samples);
kfree(bo->validated_shader);
bo->validated_shader = NULL;
diff --git a/drivers/gpu/drm/vc4/vc4_validate_shaders.c 
b/drivers/gpu/drm/vc4/vc4_validate_shaders.c
index d3f15bf60900..7cf82b071de2 100644
--- a/drivers/gpu/drm/vc4/vc4_validate_shaders.c
+++ b/drivers/gpu/drm/vc4/vc4_validate_shaders.c
@@ -942,6 +942,7 @@ vc4_validate_shader(struct drm_gem_cma_object *shader_obj)
 fail:
kfree(validation_state.branch_targets);
if (validated_shader) {
+   kfree(validated_shader->uniform_addr_offsets);
kfree(validated_shader->texture_samples);
kfree(validated_shader);
}
-- 
2.11.0



[PATCH] drm/vc4: Fix memory leak during BO teardown

2018-04-02 Thread Daniel J Blueman
During BO teardown, an indirect list 'uniform_addr_offsets' wasn't being
freed leading to leaking many 128B allocations. Fix the memory leak by
releasing it at teardown time.

To: linux-kernel@vger.kernel.org
Cc: dri-de...@lists.freedesktop.org
Cc: Eric Anholt 
Cc: Dave Airlie 
Cc: sta...@vger.kernel.org
Signed-off-by: Daniel J Blueman 
---
 drivers/gpu/drm/vc4/vc4_bo.c   | 2 ++
 drivers/gpu/drm/vc4/vc4_validate_shaders.c | 1 +
 2 files changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/vc4/vc4_bo.c b/drivers/gpu/drm/vc4/vc4_bo.c
index 2decc8e2c79f..add9cc97a3b6 100644
--- a/drivers/gpu/drm/vc4/vc4_bo.c
+++ b/drivers/gpu/drm/vc4/vc4_bo.c
@@ -195,6 +195,7 @@ static void vc4_bo_destroy(struct vc4_bo *bo)
vc4_bo_set_label(obj, -1);
 
if (bo->validated_shader) {
+   kfree(bo->validated_shader->uniform_addr_offsets);
kfree(bo->validated_shader->texture_samples);
kfree(bo->validated_shader);
bo->validated_shader = NULL;
@@ -591,6 +592,7 @@ void vc4_free_object(struct drm_gem_object *gem_bo)
}
 
if (bo->validated_shader) {
+   kfree(bo->validated_shader->uniform_addr_offsets);
kfree(bo->validated_shader->texture_samples);
kfree(bo->validated_shader);
bo->validated_shader = NULL;
diff --git a/drivers/gpu/drm/vc4/vc4_validate_shaders.c 
b/drivers/gpu/drm/vc4/vc4_validate_shaders.c
index d3f15bf60900..7cf82b071de2 100644
--- a/drivers/gpu/drm/vc4/vc4_validate_shaders.c
+++ b/drivers/gpu/drm/vc4/vc4_validate_shaders.c
@@ -942,6 +942,7 @@ vc4_validate_shader(struct drm_gem_cma_object *shader_obj)
 fail:
kfree(validation_state.branch_targets);
if (validated_shader) {
+   kfree(validated_shader->uniform_addr_offsets);
kfree(validated_shader->texture_samples);
kfree(validated_shader);
}
-- 
2.11.0



Re: stack frame unwindind KASAN errors

2017-03-06 Thread Daniel J Blueman
On 7 March 2017 at 00:40, Josh Poimboeuf <jpoim...@redhat.com> wrote:
> On Mon, Mar 06, 2017 at 02:52:01PM +0800, Daniel J Blueman wrote:
>> Thanks Josh!
>>
>> With this patch, the KASAN warning still occurs, but at
>> unwind_get_return_address+0x1d3/0x130 instead; the rest of the trace
>> is identical.
>>
>> (gdb) list *(unwind_get_return_address+0x1d3)
>> 0x8112bca3 is in unwind_get_return_address
>> (./include/linux/compiler.h:243).
>> 238})
>> 239
>> 240static __always_inline
>> 241void __read_once_size(const volatile void *p, void *res, int size)
>> 242{
>> 243__READ_ONCE_SIZE;
>
> Looking deeper, I have an idea about what's going on:
>
>   https://quora.org/dmesg.txt
>
> Each of the warnings seems to show an interrupt happening during an EFI
> call.  I'm guessing EFI modified the frame pointer, at least
> temporarily, which confused the unwinder :-(
>
> Would it be possible for you to test again with 4.10?  It has some
> additional unwinder output which should hopefully confirm my suspicions.

Very good; I don't see the KASAN warnings with 4.10 in the same environment.

Thanks,
  Daniel
-- 
Daniel J Blueman


Re: stack frame unwindind KASAN errors

2017-03-06 Thread Daniel J Blueman
On 7 March 2017 at 00:40, Josh Poimboeuf  wrote:
> On Mon, Mar 06, 2017 at 02:52:01PM +0800, Daniel J Blueman wrote:
>> Thanks Josh!
>>
>> With this patch, the KASAN warning still occurs, but at
>> unwind_get_return_address+0x1d3/0x130 instead; the rest of the trace
>> is identical.
>>
>> (gdb) list *(unwind_get_return_address+0x1d3)
>> 0x8112bca3 is in unwind_get_return_address
>> (./include/linux/compiler.h:243).
>> 238})
>> 239
>> 240static __always_inline
>> 241void __read_once_size(const volatile void *p, void *res, int size)
>> 242{
>> 243__READ_ONCE_SIZE;
>
> Looking deeper, I have an idea about what's going on:
>
>   https://quora.org/dmesg.txt
>
> Each of the warnings seems to show an interrupt happening during an EFI
> call.  I'm guessing EFI modified the frame pointer, at least
> temporarily, which confused the unwinder :-(
>
> Would it be possible for you to test again with 4.10?  It has some
> additional unwinder output which should hopefully confirm my suspicions.

Very good; I don't see the KASAN warnings with 4.10 in the same environment.

Thanks,
  Daniel
-- 
Daniel J Blueman


Re: stack frame unwindind KASAN errors

2017-03-05 Thread Daniel J Blueman
On 27 February 2017 at 23:47, Josh Poimboeuf <jpoim...@redhat.com> wrote:
> On Mon, Feb 27, 2017 at 12:49:59PM +0800, Daniel J Blueman wrote:
>> On 4.9.13 with KASAN enabled [1], we see a number of stack unwinding
>> errors reported [2,3].
>>
>> This seems to occur at half of boots.
>>
>> Let me know for further debug info/patch testing and thanks,
>>   Daniel
>>
>> [1] https://quora.org/config
>> [2] https://quora.org/dmesg.txt
>
> Hi Daniel,
>
> Can you try the following patch?  It's a backport of the following
> upstream commit:
>
>   09ae68dd0a8d ("x86/unwind: Disable KASAN checks for non-current tasks")
>
> If it fixes it then I'll submit it for 4.9 stable.
>
> ---
>
> From: Josh Poimboeuf <jpoim...@redhat.com>
> Subject: [PATCH] x86/unwind: Disable KASAN checks for non-current tasks
>
> There are a handful of callers to save_stack_trace_tsk() and
> show_stack() which try to unwind the stack of a task other than current.
> In such cases, it's remotely possible that the task is running on one
> CPU while the unwinder is reading its stack from another CPU, causing
> the unwinder to see stack corruption.
>
> These cases seem to be mostly harmless.  The unwinder has checks which
> prevent it from following bad pointers beyond the bounds of the stack.
> So it's not really a bug as long as the caller understands that
> unwinding another task will not always succeed.
>
> In such cases, it's possible that the unwinder may read a KASAN-poisoned
> region of the stack.  Account for that by using READ_ONCE_NOCHECK() when
> reading the stack of another task.
>
> Use READ_ONCE() when reading the stack of the current task, since KASAN
> warnings can still be useful for finding bugs in that case.
>
> Reported-by: Dmitry Vyukov <dvyu...@google.com>
> Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
> Cc: Andy Lutomirski <l...@amacapital.net>
> Cc: Andy Lutomirski <l...@kernel.org>
> Cc: Borislav Petkov <b...@alien8.de>
> Cc: Brian Gerst <brge...@gmail.com>
> Cc: Dave Jones <da...@codemonkey.org.uk>
> Cc: Denys Vlasenko <dvlas...@redhat.com>
> Cc: H. Peter Anvin <h...@zytor.com>
> Cc: Linus Torvalds <torva...@linux-foundation.org>
> Cc: Miroslav Benes <mbe...@suse.cz>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Link: 
> http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoim...@redhat.com
> Signed-off-by: Ingo Molnar <mi...@kernel.org>
> ---
>  arch/x86/include/asm/stacktrace.h |  5 -
>  arch/x86/kernel/unwind_frame.c| 20 ++--
>  2 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/stacktrace.h 
> b/arch/x86/include/asm/stacktrace.h
> index 37f2e0b..4141ead 100644
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -55,13 +55,16 @@ extern int kstack_depth_to_print;
>  static inline unsigned long *
>  get_frame_pointer(struct task_struct *task, struct pt_regs *regs)
>  {
> +   struct inactive_task_frame *frame;
> +
> if (regs)
> return (unsigned long *)regs->bp;
>
> if (task == current)
> return __builtin_frame_address(0);
>
> -   return (unsigned long *)((struct inactive_task_frame 
> *)task->thread.sp)->bp;
> +   frame = (struct inactive_task_frame *)task->thread.sp;
> +   return (unsigned long *)READ_ONCE_NOCHECK(frame->bp);
>  }
>  #else
>  static inline unsigned long *
> diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> index a2456d4..caff129 100644
> --- a/arch/x86/kernel/unwind_frame.c
> +++ b/arch/x86/kernel/unwind_frame.c
> @@ -6,6 +6,21 @@
>
>  #define FRAME_HEADER_SIZE (sizeof(long) * 2)
>
> +/*
> + * This disables KASAN checking when reading a value from another task's 
> stack,
> + * since the other task could be running on another CPU and could have 
> poisoned
> + * the stack in the meantime.
> + */
> +#define READ_ONCE_TASK_STACK(task, x)  \
> +({ \
> +   unsigned long val;  \
> +   if (task == current)\
> +   val = READ_ONCE(x); \
> +   else\
> +   val = READ_ONCE_NOCHECK(x); \
> +   val;\
> +})
> +
>  unsigned long unwind_get_return_address(struct unwind_state *state)
>

Re: stack frame unwindind KASAN errors

2017-03-05 Thread Daniel J Blueman
On 27 February 2017 at 23:47, Josh Poimboeuf  wrote:
> On Mon, Feb 27, 2017 at 12:49:59PM +0800, Daniel J Blueman wrote:
>> On 4.9.13 with KASAN enabled [1], we see a number of stack unwinding
>> errors reported [2,3].
>>
>> This seems to occur at half of boots.
>>
>> Let me know for further debug info/patch testing and thanks,
>>   Daniel
>>
>> [1] https://quora.org/config
>> [2] https://quora.org/dmesg.txt
>
> Hi Daniel,
>
> Can you try the following patch?  It's a backport of the following
> upstream commit:
>
>   09ae68dd0a8d ("x86/unwind: Disable KASAN checks for non-current tasks")
>
> If it fixes it then I'll submit it for 4.9 stable.
>
> ---
>
> From: Josh Poimboeuf 
> Subject: [PATCH] x86/unwind: Disable KASAN checks for non-current tasks
>
> There are a handful of callers to save_stack_trace_tsk() and
> show_stack() which try to unwind the stack of a task other than current.
> In such cases, it's remotely possible that the task is running on one
> CPU while the unwinder is reading its stack from another CPU, causing
> the unwinder to see stack corruption.
>
> These cases seem to be mostly harmless.  The unwinder has checks which
> prevent it from following bad pointers beyond the bounds of the stack.
> So it's not really a bug as long as the caller understands that
> unwinding another task will not always succeed.
>
> In such cases, it's possible that the unwinder may read a KASAN-poisoned
> region of the stack.  Account for that by using READ_ONCE_NOCHECK() when
> reading the stack of another task.
>
> Use READ_ONCE() when reading the stack of the current task, since KASAN
> warnings can still be useful for finding bugs in that case.
>
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Josh Poimboeuf 
> Cc: Andy Lutomirski 
> Cc: Andy Lutomirski 
> Cc: Borislav Petkov 
> Cc: Brian Gerst 
> Cc: Dave Jones 
> Cc: Denys Vlasenko 
> Cc: H. Peter Anvin 
> Cc: Linus Torvalds 
> Cc: Miroslav Benes 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Link: 
> http://lkml.kernel.org/r/4c575eb288ba9f73d498dfe0acde2f58674598f1.1483978430.git.jpoim...@redhat.com
> Signed-off-by: Ingo Molnar 
> ---
>  arch/x86/include/asm/stacktrace.h |  5 -
>  arch/x86/kernel/unwind_frame.c| 20 ++--
>  2 files changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/stacktrace.h 
> b/arch/x86/include/asm/stacktrace.h
> index 37f2e0b..4141ead 100644
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -55,13 +55,16 @@ extern int kstack_depth_to_print;
>  static inline unsigned long *
>  get_frame_pointer(struct task_struct *task, struct pt_regs *regs)
>  {
> +   struct inactive_task_frame *frame;
> +
> if (regs)
> return (unsigned long *)regs->bp;
>
> if (task == current)
> return __builtin_frame_address(0);
>
> -   return (unsigned long *)((struct inactive_task_frame 
> *)task->thread.sp)->bp;
> +   frame = (struct inactive_task_frame *)task->thread.sp;
> +   return (unsigned long *)READ_ONCE_NOCHECK(frame->bp);
>  }
>  #else
>  static inline unsigned long *
> diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
> index a2456d4..caff129 100644
> --- a/arch/x86/kernel/unwind_frame.c
> +++ b/arch/x86/kernel/unwind_frame.c
> @@ -6,6 +6,21 @@
>
>  #define FRAME_HEADER_SIZE (sizeof(long) * 2)
>
> +/*
> + * This disables KASAN checking when reading a value from another task's 
> stack,
> + * since the other task could be running on another CPU and could have 
> poisoned
> + * the stack in the meantime.
> + */
> +#define READ_ONCE_TASK_STACK(task, x)  \
> +({ \
> +   unsigned long val;  \
> +   if (task == current)\
> +   val = READ_ONCE(x); \
> +   else\
> +   val = READ_ONCE_NOCHECK(x); \
> +   val;\
> +})
> +
>  unsigned long unwind_get_return_address(struct unwind_state *state)
>  {
> unsigned long addr;
> @@ -14,7 +29,8 @@ unsigned long unwind_get_return_address(struct unwind_state 
> *state)
> if (unwind_done(state))
> return 0;
>
> -   addr = ftrace_graph_ret_addr(state->task, >graph_idx, *addr_p,
> +   addr = READ_ONCE_TASK_STACK(state->task, *addr_p);
> +   addr = ftrace_graph_ret_addr(state->task, >

stack frame unwindind KASAN errors

2017-02-26 Thread Daniel J Blueman
On 4.9.13 with KASAN enabled [1], we see a number of stack unwinding
errors reported [2,3].

This seems to occur at half of boots.

Let me know for further debug info/patch testing and thanks,
  Daniel

[1] https://quora.org/config
[2] https://quora.org/dmesg.txt

-- [3]

BUG: KASAN: stack-out-of-bounds in
unwind_get_return_address+0x11d/0x130 at addr 881034eafa08
Read of size 8 by task systemd/1
page:ea0040d3abc0 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x2f8000()
page dumped because: kasan: bad access detected
CPU: 20 PID: 1 Comm: systemd Not tainted 4.9.13-debug+ #3
Hardware name: Supermicro Super Server/X10DRL-i, BIOS 2.0a 08/25/2016
 881c2f607a60 b0cdb541 881c2f607af8 881034eafa08
 881c2f607ae8 b064dd17 881034ea4f70 0024
  0286 881034ea4fe2 
Call Trace:
 
 [] dump_stack+0x85/0xc4
 [] kasan_report_error+0x4d7/0x500
 [] __asan_report_load8_noabort+0x61/0x70
 [] ? unwind_get_return_address+0x11d/0x130
 [] unwind_get_return_address+0x11d/0x130
 [] ? unwind_next_frame+0x97/0xf0
 [] __save_stack_trace+0x92/0x100
 [] ? file_free_rcu+0x46/0x60
 [] save_stack_trace+0x1b/0x20
 [] save_stack+0x46/0xd0
 [] ? save_stack_trace+0x1b/0x20
 [] ? save_stack+0x46/0xd0
 [] ? kasan_slab_free+0x71/0xb0
 [] ? kmem_cache_free+0xc4/0x350
 [] ? file_free_rcu+0x46/0x60
 [] ? rcu_process_callbacks+0x9d2/0x1220
 [] ? __do_softirq+0x286/0x87d
 [] ? irq_exit+0x160/0x190
 [] ? smp_apic_timer_interrupt+0x80/0xa0
 [] ? apic_timer_interrupt+0x8c/0xa0
 [] ? debug_check_no_locks_freed+0x290/0x290
 [] ? debug_object_deactivate+0xf8/0x320
 [] ? _raw_spin_unlock_irqrestore+0x5f/0x80
 [] ? trace_hardirqs_on_caller+0x19e/0x580
 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80
 [] ? mark_held_locks+0xc8/0x120
 [] ? kmem_cache_free+0xaf/0x350
 [] ? file_free_rcu+0x46/0x60
 [] kasan_slab_free+0x71/0xb0
 [] kmem_cache_free+0xc4/0x350
 [] file_free_rcu+0x46/0x60
 [] rcu_process_callbacks+0x9d2/0x1220
 [] ? rcu_process_callbacks+0x97d/0x1220
 [] ? get_max_files+0x20/0x20
 [] __do_softirq+0x286/0x87d
 [] irq_exit+0x160/0x190
 [] smp_apic_timer_interrupt+0x80/0xa0
 [] apic_timer_interrupt+0x8c/0xa0
 
 [] ? save_stack+0x46/0xd0
 [] ? debug_check_no_locks_freed+0x290/0x290
 [] ? mark_held_locks+0xc8/0x120
 [] ? efi_call+0x58/0x90
 [] ? virt_efi_get_variable+0x9c/0x150
 [] ? efivar_entry_size+0xa4/0x110
 [] ? efivarfs_callback+0x30f/0x4e7
 [] ? efivarfs_evict_inode+0x10/0x10
 [] mark_held_locks+0xc8/0x120
 [] ? _raw_spin_unlock_irqrestore+0x5f/0x80
 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80
 [] ? efivar_init+0x512/0x750
 [] ? efivarfs_evict_inode+0x10/0x10
 [] ? efivar_entry_iter+0x140/0x140
 [] ? debug_lockdep_rcu_enabled+0x77/0x90
 [] ? d_instantiate+0x6f/0x80
 [] ? _raw_spin_unlock+0x31/0x50
 [] ? _raw_spin_unlock+0x31/0x50
 [] ? d_instantiate+0x6f/0x80
 [] ? efivarfs_mount+0x20/0x20
 [] ? efivarfs_fill_super+0x1ea/0x290
 [] ? efivarfs_mount+0x20/0x20
 [] ? mount_single+0xcc/0x130
 [] ? efivarfs_mount+0x18/0x20
 [] ? mount_fs+0x81/0x2c0
 [] ? alloc_vfsmnt+0x451/0x720
 [] ? vfs_kern_mount+0x6b/0x370
 [] ? do_mount+0x355/0x2af0
 [] ? debug_lockdep_rcu_enabled+0x77/0x90
 [] ? copy_mount_string+0x20/0x20
 [] ? __might_fault+0xf6/0x1b0
 [] ? __check_object_size+0x1b4/0x3fe
 [] ? memdup_user+0x6b/0xa0
 [] ? SyS_mount+0x95/0xe0
 [] ? entry_SYSCALL_64_fastpath+0x23/0xc6
Memory state around the buggy address:
 881034eaf900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 881034eaf980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
>881034eafa00: f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 00 f4 f4 f3 f3
   ^
 881034eafa80: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 881034eafb00: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
Disabling lock debugging due to kernel taint
-- 
Daniel J Blueman


stack frame unwindind KASAN errors

2017-02-26 Thread Daniel J Blueman
On 4.9.13 with KASAN enabled [1], we see a number of stack unwinding
errors reported [2,3].

This seems to occur at half of boots.

Let me know for further debug info/patch testing and thanks,
  Daniel

[1] https://quora.org/config
[2] https://quora.org/dmesg.txt

-- [3]

BUG: KASAN: stack-out-of-bounds in
unwind_get_return_address+0x11d/0x130 at addr 881034eafa08
Read of size 8 by task systemd/1
page:ea0040d3abc0 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x2f8000()
page dumped because: kasan: bad access detected
CPU: 20 PID: 1 Comm: systemd Not tainted 4.9.13-debug+ #3
Hardware name: Supermicro Super Server/X10DRL-i, BIOS 2.0a 08/25/2016
 881c2f607a60 b0cdb541 881c2f607af8 881034eafa08
 881c2f607ae8 b064dd17 881034ea4f70 0024
  0286 881034ea4fe2 
Call Trace:
 
 [] dump_stack+0x85/0xc4
 [] kasan_report_error+0x4d7/0x500
 [] __asan_report_load8_noabort+0x61/0x70
 [] ? unwind_get_return_address+0x11d/0x130
 [] unwind_get_return_address+0x11d/0x130
 [] ? unwind_next_frame+0x97/0xf0
 [] __save_stack_trace+0x92/0x100
 [] ? file_free_rcu+0x46/0x60
 [] save_stack_trace+0x1b/0x20
 [] save_stack+0x46/0xd0
 [] ? save_stack_trace+0x1b/0x20
 [] ? save_stack+0x46/0xd0
 [] ? kasan_slab_free+0x71/0xb0
 [] ? kmem_cache_free+0xc4/0x350
 [] ? file_free_rcu+0x46/0x60
 [] ? rcu_process_callbacks+0x9d2/0x1220
 [] ? __do_softirq+0x286/0x87d
 [] ? irq_exit+0x160/0x190
 [] ? smp_apic_timer_interrupt+0x80/0xa0
 [] ? apic_timer_interrupt+0x8c/0xa0
 [] ? debug_check_no_locks_freed+0x290/0x290
 [] ? debug_object_deactivate+0xf8/0x320
 [] ? _raw_spin_unlock_irqrestore+0x5f/0x80
 [] ? trace_hardirqs_on_caller+0x19e/0x580
 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80
 [] ? mark_held_locks+0xc8/0x120
 [] ? kmem_cache_free+0xaf/0x350
 [] ? file_free_rcu+0x46/0x60
 [] kasan_slab_free+0x71/0xb0
 [] kmem_cache_free+0xc4/0x350
 [] file_free_rcu+0x46/0x60
 [] rcu_process_callbacks+0x9d2/0x1220
 [] ? rcu_process_callbacks+0x97d/0x1220
 [] ? get_max_files+0x20/0x20
 [] __do_softirq+0x286/0x87d
 [] irq_exit+0x160/0x190
 [] smp_apic_timer_interrupt+0x80/0xa0
 [] apic_timer_interrupt+0x8c/0xa0
 
 [] ? save_stack+0x46/0xd0
 [] ? debug_check_no_locks_freed+0x290/0x290
 [] ? mark_held_locks+0xc8/0x120
 [] ? efi_call+0x58/0x90
 [] ? virt_efi_get_variable+0x9c/0x150
 [] ? efivar_entry_size+0xa4/0x110
 [] ? efivarfs_callback+0x30f/0x4e7
 [] ? efivarfs_evict_inode+0x10/0x10
 [] mark_held_locks+0xc8/0x120
 [] ? _raw_spin_unlock_irqrestore+0x5f/0x80
 [] ? _raw_spin_unlock_irqrestore+0x4a/0x80
 [] ? efivar_init+0x512/0x750
 [] ? efivarfs_evict_inode+0x10/0x10
 [] ? efivar_entry_iter+0x140/0x140
 [] ? debug_lockdep_rcu_enabled+0x77/0x90
 [] ? d_instantiate+0x6f/0x80
 [] ? _raw_spin_unlock+0x31/0x50
 [] ? _raw_spin_unlock+0x31/0x50
 [] ? d_instantiate+0x6f/0x80
 [] ? efivarfs_mount+0x20/0x20
 [] ? efivarfs_fill_super+0x1ea/0x290
 [] ? efivarfs_mount+0x20/0x20
 [] ? mount_single+0xcc/0x130
 [] ? efivarfs_mount+0x18/0x20
 [] ? mount_fs+0x81/0x2c0
 [] ? alloc_vfsmnt+0x451/0x720
 [] ? vfs_kern_mount+0x6b/0x370
 [] ? do_mount+0x355/0x2af0
 [] ? debug_lockdep_rcu_enabled+0x77/0x90
 [] ? copy_mount_string+0x20/0x20
 [] ? __might_fault+0xf6/0x1b0
 [] ? __check_object_size+0x1b4/0x3fe
 [] ? memdup_user+0x6b/0xa0
 [] ? SyS_mount+0x95/0xe0
 [] ? entry_SYSCALL_64_fastpath+0x23/0xc6
Memory state around the buggy address:
 881034eaf900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 881034eaf980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
>881034eafa00: f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 00 f4 f4 f3 f3
   ^
 881034eafa80: f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 881034eafb00: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
Disabling lock debugging due to kernel taint
-- 
Daniel J Blueman


Re: [4.9.10] ip_route_me_harder() reading off-slab

2017-02-17 Thread Daniel J Blueman
On 17 February 2017 at 13:36, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Fri, 2017-02-17 at 12:36 +0800, Daniel J Blueman wrote:
>> When booting a VM in libvirt/KVM attached to a local bridge and KASAN
>> enabled on 4.9.10, we see a stream of KASAN warnings about off-slab
>> access [1].
>>
>> Let me know if you'd like more debug.
>
> Could you try the following patch ?
>
> Thanks !
>
> diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
> index 
> b3cc1335adbc1a20dcd225d0501b0a286d27e3c8..18839e59da849f0988924bcbc9873965a3681eb0
>  100644
> --- a/net/ipv4/netfilter.c
> +++ b/net/ipv4/netfilter.c
> @@ -23,7 +23,8 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> struct rtable *rt;
> struct flowi4 fl4 = {};
> __be32 saddr = iph->saddr;
> -   __u8 flags = skb->sk ? inet_sk_flowi_flags(skb->sk) : 0;
> +   struct sock *sk = skb->sk;
> +   __u8 flags = sk && sk_fullsock(sk) ? inet_sk_flowi_flags(sk) : 0;
> struct net_device *dev = skb_dst(skb)->dev;
> unsigned int hh_len;
>
> @@ -40,7 +41,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> fl4.daddr = iph->daddr;
> fl4.saddr = saddr;
> fl4.flowi4_tos = RT_TOS(iph->tos);
> -   fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0;
> +   fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0;
> if (!fl4.flowi4_oif)
> fl4.flowi4_oif = l3mdev_master_ifindex(dev);
> fl4.flowi4_mark = skb->mark;
> @@ -61,7 +62,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> xfrm_decode_session(skb, flowi4_to_flowi(), AF_INET) == 0) {
> struct dst_entry *dst = skb_dst(skb);
> skb_dst_set(skb, NULL);
> -   dst = xfrm_lookup(net, dst, flowi4_to_flowi(), skb->sk, 
> 0);
> +   dst = xfrm_lookup(net, dst, flowi4_to_flowi(), sk, 0);
>     if (IS_ERR(dst))
> return PTR_ERR(dst);
> skb_dst_set(skb, dst);

Fine work! This nicely resolves the issue. I'll test Florian's
proposed fix also.

Tested-by: Daniel J Blueman <dan...@quora.org>

Thanks,
  Dan
-- 
Daniel J Blueman


Re: [4.9.10] ip_route_me_harder() reading off-slab

2017-02-17 Thread Daniel J Blueman
On 17 February 2017 at 13:36, Eric Dumazet  wrote:
> On Fri, 2017-02-17 at 12:36 +0800, Daniel J Blueman wrote:
>> When booting a VM in libvirt/KVM attached to a local bridge and KASAN
>> enabled on 4.9.10, we see a stream of KASAN warnings about off-slab
>> access [1].
>>
>> Let me know if you'd like more debug.
>
> Could you try the following patch ?
>
> Thanks !
>
> diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
> index 
> b3cc1335adbc1a20dcd225d0501b0a286d27e3c8..18839e59da849f0988924bcbc9873965a3681eb0
>  100644
> --- a/net/ipv4/netfilter.c
> +++ b/net/ipv4/netfilter.c
> @@ -23,7 +23,8 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> struct rtable *rt;
> struct flowi4 fl4 = {};
> __be32 saddr = iph->saddr;
> -   __u8 flags = skb->sk ? inet_sk_flowi_flags(skb->sk) : 0;
> +   struct sock *sk = skb->sk;
> +   __u8 flags = sk && sk_fullsock(sk) ? inet_sk_flowi_flags(sk) : 0;
> struct net_device *dev = skb_dst(skb)->dev;
> unsigned int hh_len;
>
> @@ -40,7 +41,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> fl4.daddr = iph->daddr;
> fl4.saddr = saddr;
> fl4.flowi4_tos = RT_TOS(iph->tos);
> -   fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0;
> +   fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0;
> if (!fl4.flowi4_oif)
> fl4.flowi4_oif = l3mdev_master_ifindex(dev);
> fl4.flowi4_mark = skb->mark;
> @@ -61,7 +62,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> xfrm_decode_session(skb, flowi4_to_flowi(), AF_INET) == 0) {
> struct dst_entry *dst = skb_dst(skb);
> skb_dst_set(skb, NULL);
> -   dst = xfrm_lookup(net, dst, flowi4_to_flowi(), skb->sk, 
> 0);
> +   dst = xfrm_lookup(net, dst, flowi4_to_flowi(), sk, 0);
>     if (IS_ERR(dst))
>         return PTR_ERR(dst);
> skb_dst_set(skb, dst);

Fine work! This nicely resolves the issue. I'll test Florian's
proposed fix also.

Tested-by: Daniel J Blueman 

Thanks,
  Dan
-- 
Daniel J Blueman


[4.9.10] ip_route_me_harder() reading off-slab

2017-02-16 Thread Daniel J Blueman
[  473.580640]  [] ? nf_hook_slow+0xf6/0x1b0
[  473.580651]  [] ? nf_iterate+0x2d0/0x2d0
[  473.580660]  [] ip_finish_output+0x5a8/0x9b0
[  473.580670]  [] ip_output+0x1d6/0x520
[  473.580679]  [] ? ip_output+0x21d/0x520
[  473.580692]  [] ? ip_mc_output+0xc10/0xc10
[  473.580704]  [] ? ip_fragment.constprop.54+0x220/0x220
[  473.580714]  [] ip_local_out+0x7d/0x130
[  473.580724]  [] ip_queue_xmit+0x7f7/0x1bc0
[  473.580733]  [] ? ip_queue_xmit+0x3e/0x1bc0
[  473.580749]  [] ? __skb_clone+0x97/0x7d0
[  473.580760]  [] tcp_transmit_skb+0x172c/0x3430
[  473.580771]  [] ? kasan_unpoison_shadow+0x36/0x50
[  473.580782]  [] ? __tcp_select_window+0x6b0/0x6b0
[  473.580795]  [] ? fib_table_lookup+0xde2/0x1580
[  473.580808]  [] ? sk_stream_alloc_skb+0x2da/0x770
[  473.580816]  [] ? tcp_mtup_init+0x1af/0x330
[  473.580827]  [] tcp_connect+0x1ffd/0x2e30
[  473.580836]  [] ? trace_hardirqs_on+0xd/0x10
[  473.580850]  [] ? tcp_push_one+0xf0/0xf0
[  473.580862]  [] ? secure_tcp_sequence_number+0x101/0x190
[  473.580873]  [] ? secure_dccpv6_sequence_number+0x440/0x440
[  473.580885]  [] ? ip_rt_update_pmtu+0xd10/0xd10
[  473.580896]  [] ? xfrm_lookup_route+0x21/0x160
[  473.580910]  [] tcp_v4_connect+0xe08/0x1cd0
[  473.580923]  [] __inet_stream_connect+0x64b/0xd70
[  473.580934]  [] ? inet_bind+0x880/0x880
[  473.580946]  [] ? lock_sock_nested+0x90/0x110
[  473.580955]  [] ? trace_hardirqs_on+0xd/0x10
[  473.580965]  [] ? __local_bh_enable_ip+0x70/0xc0
[  473.580980]  [] inet_stream_connect+0x55/0xa0
[  473.580991]  [] SYSC_connect+0x22c/0x2d0
[  473.581000]  [] ? SYSC_bind+0x240/0x240
[  473.581011]  [] ? set_close_on_exec+0xc2/0x170
[  473.581021]  [] ? _raw_spin_unlock+0x27/0x40
[  473.581035]  [] ? set_close_on_exec+0xc2/0x170
[  473.581046]  [] ? SyS_fcntl+0x666/0xde0
[  473.581056]  [] ? f_getown+0xb0/0xb0
[  473.581067]  [] ? trace_hardirqs_on_thunk+0x1a/0x1c
[  473.581078]  [] SyS_connect+0xe/0x10
[  473.581091]  [] entry_SYSCALL_64_fastpath+0x23/0xc6
[  473.581102] Object at 8801e1eb26f8, in cache request_sock_TCP size: 352
[  473.581105] Allocated:
[  473.581109] PID = 0
[  473.581112] (stack is not available)
[  473.581115] Freed:
[  473.581119] PID = 0
[  473.581122] (stack is not available)
[  473.581125] Memory state around the buggy address:
[  473.581134]  8801e1eb2780: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  473.581140]  8801e1eb2800: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  473.581147] >8801e1eb2880: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  473.581151]   ^
[  473.581157]  8801e1eb2900: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  473.581164]  8801e1eb2980: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
-- 
Daniel J Blueman


[4.9.10] ip_route_me_harder() reading off-slab

2017-02-16 Thread Daniel J Blueman
[  473.580640]  [] ? nf_hook_slow+0xf6/0x1b0
[  473.580651]  [] ? nf_iterate+0x2d0/0x2d0
[  473.580660]  [] ip_finish_output+0x5a8/0x9b0
[  473.580670]  [] ip_output+0x1d6/0x520
[  473.580679]  [] ? ip_output+0x21d/0x520
[  473.580692]  [] ? ip_mc_output+0xc10/0xc10
[  473.580704]  [] ? ip_fragment.constprop.54+0x220/0x220
[  473.580714]  [] ip_local_out+0x7d/0x130
[  473.580724]  [] ip_queue_xmit+0x7f7/0x1bc0
[  473.580733]  [] ? ip_queue_xmit+0x3e/0x1bc0
[  473.580749]  [] ? __skb_clone+0x97/0x7d0
[  473.580760]  [] tcp_transmit_skb+0x172c/0x3430
[  473.580771]  [] ? kasan_unpoison_shadow+0x36/0x50
[  473.580782]  [] ? __tcp_select_window+0x6b0/0x6b0
[  473.580795]  [] ? fib_table_lookup+0xde2/0x1580
[  473.580808]  [] ? sk_stream_alloc_skb+0x2da/0x770
[  473.580816]  [] ? tcp_mtup_init+0x1af/0x330
[  473.580827]  [] tcp_connect+0x1ffd/0x2e30
[  473.580836]  [] ? trace_hardirqs_on+0xd/0x10
[  473.580850]  [] ? tcp_push_one+0xf0/0xf0
[  473.580862]  [] ? secure_tcp_sequence_number+0x101/0x190
[  473.580873]  [] ? secure_dccpv6_sequence_number+0x440/0x440
[  473.580885]  [] ? ip_rt_update_pmtu+0xd10/0xd10
[  473.580896]  [] ? xfrm_lookup_route+0x21/0x160
[  473.580910]  [] tcp_v4_connect+0xe08/0x1cd0
[  473.580923]  [] __inet_stream_connect+0x64b/0xd70
[  473.580934]  [] ? inet_bind+0x880/0x880
[  473.580946]  [] ? lock_sock_nested+0x90/0x110
[  473.580955]  [] ? trace_hardirqs_on+0xd/0x10
[  473.580965]  [] ? __local_bh_enable_ip+0x70/0xc0
[  473.580980]  [] inet_stream_connect+0x55/0xa0
[  473.580991]  [] SYSC_connect+0x22c/0x2d0
[  473.581000]  [] ? SYSC_bind+0x240/0x240
[  473.581011]  [] ? set_close_on_exec+0xc2/0x170
[  473.581021]  [] ? _raw_spin_unlock+0x27/0x40
[  473.581035]  [] ? set_close_on_exec+0xc2/0x170
[  473.581046]  [] ? SyS_fcntl+0x666/0xde0
[  473.581056]  [] ? f_getown+0xb0/0xb0
[  473.581067]  [] ? trace_hardirqs_on_thunk+0x1a/0x1c
[  473.581078]  [] SyS_connect+0xe/0x10
[  473.581091]  [] entry_SYSCALL_64_fastpath+0x23/0xc6
[  473.581102] Object at 8801e1eb26f8, in cache request_sock_TCP size: 352
[  473.581105] Allocated:
[  473.581109] PID = 0
[  473.581112] (stack is not available)
[  473.581115] Freed:
[  473.581119] PID = 0
[  473.581122] (stack is not available)
[  473.581125] Memory state around the buggy address:
[  473.581134]  8801e1eb2780: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  473.581140]  8801e1eb2800: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  473.581147] >8801e1eb2880: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  473.581151]   ^
[  473.581157]  8801e1eb2900: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
[  473.581164]  8801e1eb2980: fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc fc
-- 
Daniel J Blueman


Re: Dell XPS13: MCE (Hardware Error) reported

2017-01-05 Thread Daniel J Blueman
On 5 January 2017 at 13:00, Daniel J Blueman <dan...@quora.org> wrote:
> On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
>> Hi Boris
>>
>> thanks for forwarding.
>>
>> > > CPUID Vendor Intel Family 6 Model 142
>> This is Kabylake Mobile
>>
>> > > Hardware event. This is not a software error.
>> > > MCE 1
>> > > CPU 0 BANK 7
>> > > MISC 7880018086 ADDR fef1ce40
>> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
>> > > MCG status:
>> > > MCi status:
>> > > Error overflow
>> > > Uncorrected error
>> > > MCi_MISC register valid
>> > > MCi_ADDR register valid
>> > > Processor context corrupt
>> > > MCA: corrected filtering (some unreported errors in same region)
>> > > Generic CACHE Level-2 Generic Error
>> > > STATUS ee40110a MCGSTATUS 0
>>
>> Decoding the bits further from MCi_STATUS above:
>> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
>> been signaled by a CMCI.
>>
>> PCC=1, but should be ignored when EN=0.
>> MCACOD: 110a MSCOD: 0040
>>
>> If the system is stable enough after the report, can you send the output of
>> /proc/interrupts to confirm that.
>>
>> Although its reported as a L2 error, some memory errors can also manifest
>> itself as a cache error in certain cases.  In this case it looks like
>> some speculative fetch from bad memory might be the cause.
>>
>> > > MCGCAP c08 APICID 0 SOCKETID 0
>>
>> MCG_CAP: c08
>> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
>> Threshold based error reporting (bit 11) (TES_P).
>>
>>
>> Do you have another machine which doesn't report these errors? if so try
>> swapping memory between them to see if the error disappears.
>>
>> I don't have the model specific error handy.. will check that in the meantime
>> to get some decoding as well.
>>
>> If you haven't already running some memory tests would also help.
>>
>> If you replaced the motherboard, did that involve both cpu and memory?
>> or just the motheboard swap?
>
> I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
> physical address is in the non-coherent low MMIO window:
> MISC 7880018086 ADDR fef1ce40
>
> Which is declared as device memory:
> [0.00] PM: Registered nosave memory: [mem 0xfee01000-0xfeff]
>
> For core-generated cycles, it is between the local APIC space at
> FEE0:FEE and SPI BIOS at FFE0:, so will be
> subtractively decoded to the PCH, maybe being aborted due to a device
> not being enabled (hello TPM3 or new image processor).
>
> As it is logged as soon as the MCE driver initialises, it was probably
> logged during BIOS init, so there's not much we can do about it
> anyways.

That said, I have seen this reoccur after boot; there were no other
kernel messages around 300s uptime, and it hasn't occurred in the last
hours since:

$ dmesg | grep Machine
[0.039072] mce: [Hardware Error]: Machine check events logged
[  300.069176] mce: [Hardware Error]: Machine check events logged

As I don't see a driver controlling this area of address space, the
access is likely initiated from the UEFI BIOS System Management Mode
handler, and we see the same pair of registers FEF1FF40, FEF1CE40
accessed each time.

Dan
-- 
Daniel J Blueman


Re: Dell XPS13: MCE (Hardware Error) reported

2017-01-05 Thread Daniel J Blueman
On 5 January 2017 at 13:00, Daniel J Blueman  wrote:
> On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
>> Hi Boris
>>
>> thanks for forwarding.
>>
>> > > CPUID Vendor Intel Family 6 Model 142
>> This is Kabylake Mobile
>>
>> > > Hardware event. This is not a software error.
>> > > MCE 1
>> > > CPU 0 BANK 7
>> > > MISC 7880018086 ADDR fef1ce40
>> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
>> > > MCG status:
>> > > MCi status:
>> > > Error overflow
>> > > Uncorrected error
>> > > MCi_MISC register valid
>> > > MCi_ADDR register valid
>> > > Processor context corrupt
>> > > MCA: corrected filtering (some unreported errors in same region)
>> > > Generic CACHE Level-2 Generic Error
>> > > STATUS ee40110a MCGSTATUS 0
>>
>> Decoding the bits further from MCi_STATUS above:
>> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
>> been signaled by a CMCI.
>>
>> PCC=1, but should be ignored when EN=0.
>> MCACOD: 110a MSCOD: 0040
>>
>> If the system is stable enough after the report, can you send the output of
>> /proc/interrupts to confirm that.
>>
>> Although its reported as a L2 error, some memory errors can also manifest
>> itself as a cache error in certain cases.  In this case it looks like
>> some speculative fetch from bad memory might be the cause.
>>
>> > > MCGCAP c08 APICID 0 SOCKETID 0
>>
>> MCG_CAP: c08
>> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
>> Threshold based error reporting (bit 11) (TES_P).
>>
>>
>> Do you have another machine which doesn't report these errors? if so try
>> swapping memory between them to see if the error disappears.
>>
>> I don't have the model specific error handy.. will check that in the meantime
>> to get some decoding as well.
>>
>> If you haven't already running some memory tests would also help.
>>
>> If you replaced the motherboard, did that involve both cpu and memory?
>> or just the motheboard swap?
>
> I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
> physical address is in the non-coherent low MMIO window:
> MISC 7880018086 ADDR fef1ce40
>
> Which is declared as device memory:
> [0.00] PM: Registered nosave memory: [mem 0xfee01000-0xfeff]
>
> For core-generated cycles, it is between the local APIC space at
> FEE0:FEE and SPI BIOS at FFE0:, so will be
> subtractively decoded to the PCH, maybe being aborted due to a device
> not being enabled (hello TPM3 or new image processor).
>
> As it is logged as soon as the MCE driver initialises, it was probably
> logged during BIOS init, so there's not much we can do about it
> anyways.

That said, I have seen this reoccur after boot; there were no other
kernel messages around 300s uptime, and it hasn't occurred in the last
hours since:

$ dmesg | grep Machine
[0.039072] mce: [Hardware Error]: Machine check events logged
[  300.069176] mce: [Hardware Error]: Machine check events logged

As I don't see a driver controlling this area of address space, the
access is likely initiated from the UEFI BIOS System Management Mode
handler, and we see the same pair of registers FEF1FF40, FEF1CE40
accessed each time.

Dan
-- 
Daniel J Blueman


Re: Question regarding power button of Dell XPS13

2017-01-04 Thread Daniel J Blueman
On Monday, December 26, 2016 at 6:30:05 AM UTC+8, Linus Torvalds wrote:
> On Fri, Dec 23, 2016 at 4:36 AM, Paul Menzel <pmen...@molgen.mpg.de> wrote:
> >
> > I heard that you both have a Dell XPS13. I got the “revision” 9360, and
> > installed Debian Stretch/testing on it with Linux 4.8.15 and Linux 4.9-rc8.
> >
> > When pressing the power button the GNOME dialog, asking what to do (restart,
> > power off, …) doesn’t appear.
>
> Hmm. I don't recall ever seeing such a dialog. But I don't run Debian.
>
> For me it works like all power buttons on my laptops have worked
> lately - it suspends the machine.
>
> Of course, so does just closing the lid.
>
> The only "bug" I've seen in this area is the design bug of the XPS13
> where there is no visible indication of the suspend state (ie the
> traditional slowly pulsing LED showing that it's all nice and
> suspended). But that seems to be intentional, if stupid. I think it's
> the only real beef I have with the XPS13.

I find the 9360 to be a solid laptop (my XPS 15 9550 would fail to
resume from suspend 15% of the time), but did any of you guys run into
bit-depth colour issues [1] on the Skylake/9350 with USB-C to HDMI
adapters?

Dan

[1] https://bugs.freedesktop.org/show_bug.cgi?id=99137
-- 
Daniel J Blueman


Re: Question regarding power button of Dell XPS13

2017-01-04 Thread Daniel J Blueman
On Monday, December 26, 2016 at 6:30:05 AM UTC+8, Linus Torvalds wrote:
> On Fri, Dec 23, 2016 at 4:36 AM, Paul Menzel  wrote:
> >
> > I heard that you both have a Dell XPS13. I got the “revision” 9360, and
> > installed Debian Stretch/testing on it with Linux 4.8.15 and Linux 4.9-rc8.
> >
> > When pressing the power button the GNOME dialog, asking what to do (restart,
> > power off, …) doesn’t appear.
>
> Hmm. I don't recall ever seeing such a dialog. But I don't run Debian.
>
> For me it works like all power buttons on my laptops have worked
> lately - it suspends the machine.
>
> Of course, so does just closing the lid.
>
> The only "bug" I've seen in this area is the design bug of the XPS13
> where there is no visible indication of the suspend state (ie the
> traditional slowly pulsing LED showing that it's all nice and
> suspended). But that seems to be intentional, if stupid. I think it's
> the only real beef I have with the XPS13.

I find the 9360 to be a solid laptop (my XPS 15 9550 would fail to
resume from suspend 15% of the time), but did any of you guys run into
bit-depth colour issues [1] on the Skylake/9350 with USB-C to HDMI
adapters?

Dan

[1] https://bugs.freedesktop.org/show_bug.cgi?id=99137
-- 
Daniel J Blueman


Re: Dell XPS13: MCE (Hardware Error) reported

2017-01-04 Thread Daniel J Blueman
On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
> Hi Boris
>
> thanks for forwarding.
>
> > > CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
> > > Hardware event. This is not a software error.
> > > MCE 1
> > > CPU 0 BANK 7
> > > MISC 7880018086 ADDR fef1ce40
> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > > MCG status:
> > > MCi status:
> > > Error overflow
> > > Uncorrected error
> > > MCi_MISC register valid
> > > MCi_ADDR register valid
> > > Processor context corrupt
> > > MCA: corrected filtering (some unreported errors in same region)
> > > Generic CACHE Level-2 Generic Error
> > > STATUS ee40110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.
>
> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
> > > MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.
>
> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.
>
> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
physical address is in the non-coherent low MMIO window:
MISC 7880018086 ADDR fef1ce40

Which is declared as device memory:
[0.00] PM: Registered nosave memory: [mem 0xfee01000-0xfeff]

For core-generated cycles, it is between the local APIC space at
FEE0:FEE and SPI BIOS at FFE0:, so will be
subtractively decoded to the PCH, maybe being aborted due to a device
not being enabled (hello TPM3 or new image processor).

As it is logged as soon as the MCE driver initialises, it was probably
logged during BIOS init, so there's not much we can do about it
anyways.

Dan
-- 
Daniel J Blueman


Re: Dell XPS13: MCE (Hardware Error) reported

2017-01-04 Thread Daniel J Blueman
On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
> Hi Boris
>
> thanks for forwarding.
>
> > > CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
> > > Hardware event. This is not a software error.
> > > MCE 1
> > > CPU 0 BANK 7
> > > MISC 7880018086 ADDR fef1ce40
> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > > MCG status:
> > > MCi status:
> > > Error overflow
> > > Uncorrected error
> > > MCi_MISC register valid
> > > MCi_ADDR register valid
> > > Processor context corrupt
> > > MCA: corrected filtering (some unreported errors in same region)
> > > Generic CACHE Level-2 Generic Error
> > > STATUS ee40110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.
>
> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
> > > MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.
>
> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.
>
> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
physical address is in the non-coherent low MMIO window:
MISC 7880018086 ADDR fef1ce40

Which is declared as device memory:
[0.00] PM: Registered nosave memory: [mem 0xfee01000-0xfeff]

For core-generated cycles, it is between the local APIC space at
FEE0:FEE and SPI BIOS at FFE0:, so will be
subtractively decoded to the PCH, maybe being aborted due to a device
not being enabled (hello TPM3 or new image processor).

As it is logged as soon as the MCE driver initialises, it was probably
logged during BIOS init, so there's not much we can do about it
anyways.

Dan
-- 
Daniel J Blueman


FOSSASIA'17 Kernel Track: Call for Speakers

2016-12-23 Thread Daniel J Blueman
Dear Linux Kernel developers,

The FOSSASIA 2017 Kernel Track would like to welcome all interested
speakers to submit abstracts for presentations. You'll have the
opportunity to share your knowledge and discuss with like-minded
individuals, representing a broad range of industries and
technologies.

The topics include, but are not limited to:
- new kernel developments, ideas and limitations
- development process and community
- bringup experience on new platforms or SoCs
- debugging, profiling, tuning tips and experience
- security and vulnerabilities
- new and exciting architectures, features and platforms

There are over 3000 attendees each year and a broad range of other
tracks including the Hardware and Maker track, the Artificial
Intelligence track, the Startup and Business Development track and the
DevOps track.

The deadline for submission has been extended until Jan 20th; for more
details see:
http://blog.fossasia.org/fossasia-summit-2017-singapore-call-for-speakers/

We are looking forward to seeing you at the summit!

Daniel
-- 
Daniel J Blueman


FOSSASIA'17 Kernel Track: Call for Speakers

2016-12-23 Thread Daniel J Blueman
Dear Linux Kernel developers,

The FOSSASIA 2017 Kernel Track would like to welcome all interested
speakers to submit abstracts for presentations. You'll have the
opportunity to share your knowledge and discuss with like-minded
individuals, representing a broad range of industries and
technologies.

The topics include, but are not limited to:
- new kernel developments, ideas and limitations
- development process and community
- bringup experience on new platforms or SoCs
- debugging, profiling, tuning tips and experience
- security and vulnerabilities
- new and exciting architectures, features and platforms

There are over 3000 attendees each year and a broad range of other
tracks including the Hardware and Maker track, the Artificial
Intelligence track, the Startup and Business Development track and the
DevOps track.

The deadline for submission has been extended until Jan 20th; for more
details see:
http://blog.fossasia.org/fossasia-summit-2017-singapore-call-for-speakers/

We are looking forward to seeing you at the summit!

Daniel
-- 
Daniel J Blueman


[PATCH] x86/urgent: Fix NumaConnect2 MMCFG PCI access

2015-12-30 Thread Daniel J Blueman
The MMCFG PCI accessors weren't being setup for NumacConnect2
correctly due to over-early assignment; this would create the
potential for the wrong PCI domain to be accessed.

Fix this by using the correct arch-specific PCI init function.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
---
 arch/x86/kernel/apic/apic_numachip.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index 38dd5ef..2bd2292 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -193,20 +193,17 @@ static int __init numachip_system_init(void)
case 1:
init_extra_mapping_uc(NUMACHIP_LCSR_BASE, NUMACHIP_LCSR_SIZE);
numachip_apic_icr_write = numachip1_apic_icr_write;
-   x86_init.pci.arch_init = pci_numachip_init;
break;
case 2:
init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE);
numachip_apic_icr_write = numachip2_apic_icr_write;
-
-   /* Use MCFG config cycles rather than locked CF8 cycles */
-   raw_pci_ops = _mmcfg;
break;
default:
return 0;
}

x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
+   x86_init.pci.arch_init = pci_numachip_init;

return 0;
 }
--
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/urgent] x86/numachip: Fix NumaConnect2 MMCFG PCI access

2015-12-30 Thread tip-bot for Daniel J Blueman
Commit-ID:  dd7a5ab495019d424c2b0747892eb2e38a052ba5
Gitweb: http://git.kernel.org/tip/dd7a5ab495019d424c2b0747892eb2e38a052ba5
Author: Daniel J Blueman 
AuthorDate: Thu, 31 Dec 2015 02:06:47 +0800
Committer:  Thomas Gleixner 
CommitDate: Wed, 30 Dec 2015 19:19:03 +0100

x86/numachip: Fix NumaConnect2 MMCFG PCI access

The MMCFG PCI accessors weren't being setup for NumacConnect2
correctly due to over-early assignment; this would create the
potential for the wrong PCI domain to be accessed.

Fix this by using the correct arch-specific PCI init function.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
Cc: Daniel Lezcano 
Cc: Linus Torvalds 
Link: 
http://lkml.kernel.org/r/1451498807-15920-1-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/apic/apic_numachip.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index 38dd5ef..2bd2292 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -193,20 +193,17 @@ static int __init numachip_system_init(void)
case 1:
init_extra_mapping_uc(NUMACHIP_LCSR_BASE, NUMACHIP_LCSR_SIZE);
numachip_apic_icr_write = numachip1_apic_icr_write;
-   x86_init.pci.arch_init = pci_numachip_init;
break;
case 2:
init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE);
numachip_apic_icr_write = numachip2_apic_icr_write;
-
-   /* Use MCFG config cycles rather than locked CF8 cycles */
-   raw_pci_ops = _mmcfg;
break;
default:
return 0;
}
 
x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
+   x86_init.pci.arch_init = pci_numachip_init;
 
return 0;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/urgent] x86/numachip: Fix NumaConnect2 MMCFG PCI access

2015-12-30 Thread tip-bot for Daniel J Blueman
Commit-ID:  dd7a5ab495019d424c2b0747892eb2e38a052ba5
Gitweb: http://git.kernel.org/tip/dd7a5ab495019d424c2b0747892eb2e38a052ba5
Author: Daniel J Blueman <dan...@numascale.com>
AuthorDate: Thu, 31 Dec 2015 02:06:47 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Wed, 30 Dec 2015 19:19:03 +0100

x86/numachip: Fix NumaConnect2 MMCFG PCI access

The MMCFG PCI accessors weren't being setup for NumacConnect2
correctly due to over-early assignment; this would create the
potential for the wrong PCI domain to be accessed.

Fix this by using the correct arch-specific PCI init function.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
Cc: Daniel Lezcano <daniel.lezc...@linaro.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Link: 
http://lkml.kernel.org/r/1451498807-15920-1-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/kernel/apic/apic_numachip.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index 38dd5ef..2bd2292 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -193,20 +193,17 @@ static int __init numachip_system_init(void)
case 1:
init_extra_mapping_uc(NUMACHIP_LCSR_BASE, NUMACHIP_LCSR_SIZE);
numachip_apic_icr_write = numachip1_apic_icr_write;
-   x86_init.pci.arch_init = pci_numachip_init;
break;
case 2:
init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE);
numachip_apic_icr_write = numachip2_apic_icr_write;
-
-   /* Use MCFG config cycles rather than locked CF8 cycles */
-   raw_pci_ops = _mmcfg;
break;
default:
return 0;
}
 
x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
+   x86_init.pci.arch_init = pci_numachip_init;
 
return 0;
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86/urgent: Fix NumaConnect2 MMCFG PCI access

2015-12-30 Thread Daniel J Blueman
The MMCFG PCI accessors weren't being setup for NumacConnect2
correctly due to over-early assignment; this would create the
potential for the wrong PCI domain to be accessed.

Fix this by using the correct arch-specific PCI init function.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
---
 arch/x86/kernel/apic/apic_numachip.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index 38dd5ef..2bd2292 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -193,20 +193,17 @@ static int __init numachip_system_init(void)
case 1:
init_extra_mapping_uc(NUMACHIP_LCSR_BASE, NUMACHIP_LCSR_SIZE);
numachip_apic_icr_write = numachip1_apic_icr_write;
-   x86_init.pci.arch_init = pci_numachip_init;
break;
case 2:
init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE);
numachip_apic_icr_write = numachip2_apic_icr_write;
-
-   /* Use MCFG config cycles rather than locked CF8 cycles */
-   raw_pci_ops = _mmcfg;
break;
default:
return 0;
}

x86_cpuinit.fixup_cpu_id = fixup_cpu_id;
+   x86_init.pci.arch_init = pci_numachip_init;

return 0;
 }
--
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] PCI: Add mechanism to find topologically near cores

2015-12-23 Thread Daniel J Blueman
Some devices (eg ixgbe) make assumptions about device to core locality when
specifying interrupts locality hints and allocate starting from core 0.
Moreover, interrupts may not be routable to distant NUMA nodes due to the
8-bit APIC ID space limitations.

Provide a mechanism drivers can use to find cores with reasonable locality
to a device; use the existing precendent of RECLAIM_DISTANCE (30), wrapping
the offset.

Signed-off-by: Daniel J Blueman 
---
 drivers/pci/pci.c   | 15 +++
 include/linux/pci.h |  1 +
 2 files changed, 16 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 314db8c..d5535d1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4833,6 +4833,22 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
 }
 EXPORT_SYMBOL(pci_fixup_cardbus);

+int cpu_near_dev(const struct pci_dev *pdev, unsigned offset)
+{
+   /* Start search from node device is on for optimal locality */
+   int localnode = pcibus_to_node(pdev->bus);
+   int cpu = cpumask_first(cpumask_of_node(localnode));
+
+   while (offset--) {
+   do {
+   cpu = (cpu + 1) % nr_cpu_ids;
+   } while (!cpu_online(cpu) || node_distance(cpu_to_node(cpu),
+   localnode) > RECLAIM_DISTANCE);
+   }
+
+   return cpu;
+}
+
 static int __init pci_setup(char *str)
 {
while (str) {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6ae25aa..f7491bd 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -842,6 +842,7 @@ void pci_stop_root_bus(struct pci_bus *bus);
 void pci_remove_root_bus(struct pci_bus *bus);
 void pci_setup_cardbus(struct pci_bus *bus);
 void pci_sort_breadthfirst(void);
+int cpu_near_dev(const struct pci_dev *pdev, unsigned offset);
 #define dev_is_pci(d) ((d)->bus == _bus_type)
 #define dev_is_pf(d) ((dev_is_pci(d) ? to_pci_dev(d)->is_physfn : false))
 #define dev_num_vf(d) ((dev_is_pci(d) ? pci_num_vf(to_pci_dev(d)) : 0))
--
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] ixgbe: Use core to device locality interface

2015-12-23 Thread Daniel J Blueman
Rather than assuming cores starting from 0 are local to the ethernet
device, use the introduced interface to find near cores.

Not only does this improve performance due to spreading interrupts via near
NUMA nodes, it prevents assigning cores on distant NUMA nodes, which aren't
reachable by device interrupts due to the 8-bit APIC ID limitation.

With Numascale NumaConnect2 systems with Intel ixgbe cards on
non-primary PCI domains, all ixgbe NICs would previously revector
interrupts to cores 0 to 63 (cores 0 to 47 would be considered
near the primary PCI domain). Now, cores 48 to 95 are used, increasing
performance and addressing interrupt delivery issues:

do_IRQ: 79.180 No irq handler for vector (irq -1)
do_IRQ: 78.42 No irq handler for vector (irq -1)
do_IRQ: 71.172 No irq handler for vector (irq -1)
do_IRQ: 70.236 No irq handler for vector (irq -1)
do_IRQ: 69.109 No irq handler for vector (irq -1)
do_IRQ: 68.189 No irq handler for vector (irq -1)
do_IRQ: 72.92 No irq handler for vector (irq -1)
do_IRQ: 73.235 No irq handler for vector (irq -1)
do_IRQ: 66.185 No irq handler for vector (irq -1)
do_IRQ: 67.62 No irq handler for vector (irq -1)
do_IRQ: 197 callbacks suppressed

Signed-off-by: Daniel J Blueman 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index f3168bc..12c4ce1 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -817,10 +817,8 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter 
*adapter,
if ((tcs <= 1) && !(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)) {
u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
if (rss_i > 1 && adapter->atr_sample_rate) {
-   if (cpu_online(v_idx)) {
-   cpu = v_idx;
-   node = cpu_to_node(cpu);
-   }
+   cpu = cpu_near_dev(adapter->pdev, v_idx);
+   node = cpu_to_node(cpu);
}
}

--
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] ixgbe: Use core to device locality interface

2015-12-23 Thread Daniel J Blueman
Rather than assuming cores starting from 0 are local to the ethernet
device, use the introduced interface to find near cores.

Not only does this improve performance due to spreading interrupts via near
NUMA nodes, it prevents assigning cores on distant NUMA nodes, which aren't
reachable by device interrupts due to the 8-bit APIC ID limitation.

With Numascale NumaConnect2 systems with Intel ixgbe cards on
non-primary PCI domains, all ixgbe NICs would previously revector
interrupts to cores 0 to 63 (cores 0 to 47 would be considered
near the primary PCI domain). Now, cores 48 to 95 are used, increasing
performance and addressing interrupt delivery issues:

do_IRQ: 79.180 No irq handler for vector (irq -1)
do_IRQ: 78.42 No irq handler for vector (irq -1)
do_IRQ: 71.172 No irq handler for vector (irq -1)
do_IRQ: 70.236 No irq handler for vector (irq -1)
do_IRQ: 69.109 No irq handler for vector (irq -1)
do_IRQ: 68.189 No irq handler for vector (irq -1)
do_IRQ: 72.92 No irq handler for vector (irq -1)
do_IRQ: 73.235 No irq handler for vector (irq -1)
do_IRQ: 66.185 No irq handler for vector (irq -1)
do_IRQ: 67.62 No irq handler for vector (irq -1)
do_IRQ: 197 callbacks suppressed

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index f3168bc..12c4ce1 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -817,10 +817,8 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter 
*adapter,
if ((tcs <= 1) && !(adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)) {
u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
if (rss_i > 1 && adapter->atr_sample_rate) {
-   if (cpu_online(v_idx)) {
-   cpu = v_idx;
-   node = cpu_to_node(cpu);
-   }
+   cpu = cpu_near_dev(adapter->pdev, v_idx);
+   node = cpu_to_node(cpu);
}
}

--
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] PCI: Add mechanism to find topologically near cores

2015-12-23 Thread Daniel J Blueman
Some devices (eg ixgbe) make assumptions about device to core locality when
specifying interrupts locality hints and allocate starting from core 0.
Moreover, interrupts may not be routable to distant NUMA nodes due to the
8-bit APIC ID space limitations.

Provide a mechanism drivers can use to find cores with reasonable locality
to a device; use the existing precendent of RECLAIM_DISTANCE (30), wrapping
the offset.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
---
 drivers/pci/pci.c   | 15 +++
 include/linux/pci.h |  1 +
 2 files changed, 16 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 314db8c..d5535d1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4833,6 +4833,22 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
 }
 EXPORT_SYMBOL(pci_fixup_cardbus);

+int cpu_near_dev(const struct pci_dev *pdev, unsigned offset)
+{
+   /* Start search from node device is on for optimal locality */
+   int localnode = pcibus_to_node(pdev->bus);
+   int cpu = cpumask_first(cpumask_of_node(localnode));
+
+   while (offset--) {
+   do {
+   cpu = (cpu + 1) % nr_cpu_ids;
+   } while (!cpu_online(cpu) || node_distance(cpu_to_node(cpu),
+   localnode) > RECLAIM_DISTANCE);
+   }
+
+   return cpu;
+}
+
 static int __init pci_setup(char *str)
 {
while (str) {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6ae25aa..f7491bd 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -842,6 +842,7 @@ void pci_stop_root_bus(struct pci_bus *bus);
 void pci_remove_root_bus(struct pci_bus *bus);
 void pci_setup_cardbus(struct pci_bus *bus);
 void pci_sort_breadthfirst(void);
+int cpu_near_dev(const struct pci_dev *pdev, unsigned offset);
 #define dev_is_pci(d) ((d)->bus == _bus_type)
 #define dev_is_pf(d) ((dev_is_pci(d) ? to_pci_dev(d)->is_physfn : false))
 #define dev_num_vf(d) ((dev_is_pci(d) ? pci_num_vf(to_pci_dev(d)) : 0))
--
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [TESTPATCH v2] xhci: fix usb2 resume timing and races.

2015-12-04 Thread Daniel J Blueman
On 1 December 2015 at 16:26, Mathias Nyman
 wrote:
> usb2 ports need to signal resume for 20ms before moving to U0 state.
> Both device and host can initiate resume.
>
> On host initated resume port is set to resume state, sleep 20ms,
> and finally set port to U0 state.
>
> On device initated resume a port status interrupt with a port in resume
> state in issued. The interrupt handler tags a resume_done[port]
> timestamp with current time + 20ms, and kick roothub timer.
> Root hub timer requests for port status, finds the port in resume state,
> checks if resume_done[port] timestamp passed, and set port to U0 state.
>
> There are a few issues with this approach,
> 1. A host initated resume will also generate a resume event, the event
>handler will find the port in resume state, believe it's a device
>initated and act accordingly.
>
> 2. A port status request might cut the 20ms resume signalling short if a
>get_port_status request is handled during the 20ms host resume.
>The port will be found in resume state. The timestamp is not set leading
>to time_after_eq(jiffoes, timestamp) returning true, as timestamp = 0.
>get_port_status will proceed with moving the port to U0.
>
> 3. If an error, or anything else happends to the port during device
>initated 20ms resume signalling it will leave all device resume
>parameters hanging uncleared preventing further resume.
>
> Fix this by using the existing resuming_ports bitfield to indicate if
> resume signalling timing is taken care of.
> Also check if the resume_done[port] is set  before using it in time
> comparison. Also clear out any resume signalling related variables if port
> is not in U0 or Resume state.
>
> v2. fix parentheses when checking for uncleared resume variables.
> we want: if ((unclear1 OR unclear2 ) AND !in_resume AND !in_U3) { .. }
>
> Signed-off-by: Mathias Nyman 

Excellent; this correctly prevents the cyclic chain of suspend
attempts, resolving the issue.

Tested-by: Daniel J Blueman 

Thanks Mathias!
  Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [TESTPATCH v2] xhci: fix usb2 resume timing and races.

2015-12-04 Thread Daniel J Blueman
On 1 December 2015 at 16:26, Mathias Nyman
<mathias.ny...@linux.intel.com> wrote:
> usb2 ports need to signal resume for 20ms before moving to U0 state.
> Both device and host can initiate resume.
>
> On host initated resume port is set to resume state, sleep 20ms,
> and finally set port to U0 state.
>
> On device initated resume a port status interrupt with a port in resume
> state in issued. The interrupt handler tags a resume_done[port]
> timestamp with current time + 20ms, and kick roothub timer.
> Root hub timer requests for port status, finds the port in resume state,
> checks if resume_done[port] timestamp passed, and set port to U0 state.
>
> There are a few issues with this approach,
> 1. A host initated resume will also generate a resume event, the event
>handler will find the port in resume state, believe it's a device
>initated and act accordingly.
>
> 2. A port status request might cut the 20ms resume signalling short if a
>get_port_status request is handled during the 20ms host resume.
>The port will be found in resume state. The timestamp is not set leading
>to time_after_eq(jiffoes, timestamp) returning true, as timestamp = 0.
>get_port_status will proceed with moving the port to U0.
>
> 3. If an error, or anything else happends to the port during device
>initated 20ms resume signalling it will leave all device resume
>parameters hanging uncleared preventing further resume.
>
> Fix this by using the existing resuming_ports bitfield to indicate if
> resume signalling timing is taken care of.
> Also check if the resume_done[port] is set  before using it in time
> comparison. Also clear out any resume signalling related variables if port
> is not in U0 or Resume state.
>
> v2. fix parentheses when checking for uncleared resume variables.
> we want: if ((unclear1 OR unclear2 ) AND !in_resume AND !in_U3) { .. }
>
> Signed-off-by: Mathias Nyman <mathias.ny...@linux.intel.com>

Excellent; this correctly prevents the cyclic chain of suspend
attempts, resolving the issue.

Tested-by: Daniel J Blueman <dan...@quora.org>

Thanks Mathias!
  Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: overriding ACPI _CRS method

2015-11-29 Thread Daniel J Blueman

On Mon, Nov 30, 2015 at 11:09 AM, Zheng, Lv  wrote:

Hi,

IMO, if you want the new _CRS to be applied during the Linux early 
boot stage, you can override the table using initrd override or DSDT 
override mechanism.
Please see Documentation/acpi/initrd_table_override.txt or 
Documentation/acpi/dsdt-override.txt.


If you want the new _CRS to be applied during Linux runtime, you can 
override it using method customization mechanism.

Please see Documentation/acpi/method-customizing.txt


The reason I'm trying to adjust this in firmware, is to deliver the 
right behaviour with pre-built/distro kernels, so I can't use that 
approach.


Thanks Lv,
 Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


overriding ACPI _CRS method

2015-11-29 Thread Daniel J Blueman
In firmware that is loaded after the BIOS, I need to trim the root bus 
resource (0x4000-0xdfff) covering the MMIO window [1], so I can 
attach further PCI domains.


One strategy is to override the BIOS's DSDT [2] _SB.PCI0._CRS method; 
even when my firmware appends the bytecode for a new _CRS method [3], 
alas I see AE_ALREADY_EXISTS [4].


I understood methods were overrideable within the same table (eg not 
from an SSDT), but perhaps am missing something? Or any better approach 
to reduce the scope of the PCI domain  root bus?


Thanks!
 Daniel

-- [1]

pci_bus :00: root bus resource [io  0x-0x03af window]
pci_bus :00: root bus resource [io  0x03e0-0x0cf7 window]
pci_bus :00: root bus resource [io  0x03b0-0x03bb window]
pci_bus :00: root bus resource [io  0x03c0-0x03df window]
pci_bus :00: root bus resource [io  0x8000-0xdfff window]
pci_bus :00: root bus resource [mem 0x000a-0x000b window]
pci_bus :00: root bus resource [mem 0xf000-0x window]
pci_bus :00: root bus resource [mem 0x000d-0x000d window]
pci_bus :00: root bus resource [mem 0x4000-0xdfff window]
pci_bus :00: root bus resource [bus 00-04]

[2] https://resources.numascale.com/DSDT.dsl
[3] https://resources.numascale.com/DSDT-extra.dsl

-- [4]

ACPI: Core revision 20150930
ACPI Error: [_CRS] Namespace lookup failure, AE_ALREADY_EXISTS 
(20150930/dswload-378)
ACPI Exception: AE_ALREADY_EXISTS, During name lookup/catalog 
(20150930/psobject-227)
ACPI Exception: AE_ALREADY_EXISTS, [DSDT] table load failed 
(20150930/tbxfload-163)
ACPI Error: [\_PR_.P001] Namespace lookup failure, AE_NOT_FOUND 
(20150930/dswload-210)
ACPI Exception: AE_NOT_FOUND, During name lookup/catalog 
(20150930/psobject-227)
ACPI Exception: AE_NOT_FOUND, (SSDT:POWERNOW) while loading table 
(20150930/tbxfload-193)

ACPI Error: 2 table load failures, 0 successful (20150930/tbxfload-214)
--
Daniel J Blueman
Principal Software Engineer, Numascale

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


overriding ACPI _CRS method

2015-11-29 Thread Daniel J Blueman
In firmware that is loaded after the BIOS, I need to trim the root bus 
resource (0x4000-0xdfff) covering the MMIO window [1], so I can 
attach further PCI domains.


One strategy is to override the BIOS's DSDT [2] _SB.PCI0._CRS method; 
even when my firmware appends the bytecode for a new _CRS method [3], 
alas I see AE_ALREADY_EXISTS [4].


I understood methods were overrideable within the same table (eg not 
from an SSDT), but perhaps am missing something? Or any better approach 
to reduce the scope of the PCI domain  root bus?


Thanks!
 Daniel

-- [1]

pci_bus :00: root bus resource [io  0x-0x03af window]
pci_bus :00: root bus resource [io  0x03e0-0x0cf7 window]
pci_bus :00: root bus resource [io  0x03b0-0x03bb window]
pci_bus :00: root bus resource [io  0x03c0-0x03df window]
pci_bus :00: root bus resource [io  0x8000-0xdfff window]
pci_bus :00: root bus resource [mem 0x000a-0x000b window]
pci_bus :00: root bus resource [mem 0xf000-0x window]
pci_bus :00: root bus resource [mem 0x000d-0x000d window]
pci_bus :00: root bus resource [mem 0x4000-0xdfff window]
pci_bus :00: root bus resource [bus 00-04]

[2] https://resources.numascale.com/DSDT.dsl
[3] https://resources.numascale.com/DSDT-extra.dsl

-- [4]

ACPI: Core revision 20150930
ACPI Error: [_CRS] Namespace lookup failure, AE_ALREADY_EXISTS 
(20150930/dswload-378)
ACPI Exception: AE_ALREADY_EXISTS, During name lookup/catalog 
(20150930/psobject-227)
ACPI Exception: AE_ALREADY_EXISTS, [DSDT] table load failed 
(20150930/tbxfload-163)
ACPI Error: [\_PR_.P001] Namespace lookup failure, AE_NOT_FOUND 
(20150930/dswload-210)
ACPI Exception: AE_NOT_FOUND, During name lookup/catalog 
(20150930/psobject-227)
ACPI Exception: AE_NOT_FOUND, (SSDT:POWERNOW) while loading table 
(20150930/tbxfload-193)

ACPI Error: 2 table load failures, 0 successful (20150930/tbxfload-214)
--
Daniel J Blueman
Principal Software Engineer, Numascale

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: overriding ACPI _CRS method

2015-11-29 Thread Daniel J Blueman

On Mon, Nov 30, 2015 at 11:09 AM, Zheng, Lv  wrote:

Hi,

IMO, if you want the new _CRS to be applied during the Linux early 
boot stage, you can override the table using initrd override or DSDT 
override mechanism.
Please see Documentation/acpi/initrd_table_override.txt or 
Documentation/acpi/dsdt-override.txt.


If you want the new _CRS to be applied during Linux runtime, you can 
override it using method customization mechanism.

Please see Documentation/acpi/method-customizing.txt


The reason I'm trying to adjust this in firmware, is to deliver the 
right behaviour with pre-built/distro kernels, so I can't use that 
approach.


Thanks Lv,
 Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.3] kworker busy in pm_runtime_work

2015-11-24 Thread Daniel J Blueman
On 23 November 2015 at 23:52, Alan Stern  wrote:
> On Sun, 22 Nov 2015, Daniel J Blueman wrote:
>
>> On 16 November 2015 at 23:22, Alan Stern  wrote:
>> > On Mon, 16 Nov 2015, Daniel J Blueman wrote:
>> >
>> >> Tuning USB suspend [1] in 4.3 on a Dell XPS 15 9553 (Skylake), I see a
>> >> kworker thread spinning in rpm_suspend [2].
>> >>
>> >> What is the most useful debug to get here beyond the immediate [3]?
>> >
>> > You can try doing:
>> >
>> > echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
>>
>> kworker and ksoftirqd spinning occurs when I echo 'auto' to all the
>> USB control entries. Using Alan's excellent tip, we see this being
>> logged repeatedly at a high rate:
>> [  353.245180] usb usb1-port4: status 0107 change 
>> [  353.245194] usb usb1-port12: status 0507 change 
>> [  353.245202] hub 1-0:1.0: state 7 ports 16 chg  evt 
>> [  353.245203] hub 1-0:1.0: hub_suspend
>> [  353.245205] usb usb1: bus auto-suspend, wakeup 1
>> [  353.245206] usb usb1: bus suspend fail, err -16
>> [  353.245207] hub 1-0:1.0: hub_resume
>> ...
>>
>> So, EBUSY. Both the webcam is not open, and the bluetooth interface
>> [1] is rfkill'd; the situation occurs even if I unload all related
>> modules.
>>
>> What further debug would be useful?
>>
>> Thanks!
>>   Daniel
>>
>> -- [1]
>>
>> Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
>> Bus 001 Device 002: ID 0a5c:6410 Broadcom Corp.
>> Bus 001 Device 003: ID 1bcf:2b95 Sunplus Innovation Technology Inc.
>> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
>
> Since bus 1 uses an xHCI controller, you should do:
>
> echo 'module xhci-hcd =p' >/sys/kernel/debug/dynamic_debug/control
>
> I'm reasonably sure this will end up printing "suspend failed
> because a port is resuming", since that's the only place where
> xhci_bus_suspend() fails with -EBUSY, but you should try it to confirm
> this.

I had to use:
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control

and indeed we see:
[29172.246221] xhci_hcd :00:14.0: get port status, actual port 11
status  = 0xe63
[29172.246222] xhci_hcd :00:14.0: Get port status returned 0x507
[29172.246224] xhci_hcd :00:14.0: get port status, actual port 12
status  = 0x2a0
[29172.246228] xhci_hcd :00:14.0: get port status, actual port 13
status  = 0x2a0
[29172.246228] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246231] xhci_hcd :00:14.0: get port status, actual port 14
status  = 0x2a0
[29172.246232] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246235] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246248] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246254] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246264] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246275] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246285] xhci_hcd :00:14.0: Get port status returned 0x507
[29172.246294] xhci_hcd :00:14.0: get port status, actual port 14
status  = 0x2a0
[29172.246302] xhci_hcd :00:14.0: suspend failed because a port is resuming
[29172.246321] xhci_hcd :00:14.0: Get port status returned 0x107
[29172.246332] xhci_hcd :00:14.0: get port status, actual port 6
status  = 0x2a0
[29172.246346] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246359] xhci_hcd :00:14.0: get port status, actual port 13
status  = 0x2a0
[29172.246364] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246366] xhci_hcd :00:14.0: get port status, actual port 15
status  = 0x2a0
[29172.246371] xhci_hcd :00:14.0: suspend failed because a port is resuming
[29172.246380] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246382] xhci_hcd :00:14.0: get port status, actual port 1
status  = 0x2a0
[29172.246383] xhci_hcd :00:14.0: Get port status returned 0x100
...
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.3] kworker busy in pm_runtime_work

2015-11-24 Thread Daniel J Blueman
On 23 November 2015 at 23:52, Alan Stern <st...@rowland.harvard.edu> wrote:
> On Sun, 22 Nov 2015, Daniel J Blueman wrote:
>
>> On 16 November 2015 at 23:22, Alan Stern <st...@rowland.harvard.edu> wrote:
>> > On Mon, 16 Nov 2015, Daniel J Blueman wrote:
>> >
>> >> Tuning USB suspend [1] in 4.3 on a Dell XPS 15 9553 (Skylake), I see a
>> >> kworker thread spinning in rpm_suspend [2].
>> >>
>> >> What is the most useful debug to get here beyond the immediate [3]?
>> >
>> > You can try doing:
>> >
>> > echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
>>
>> kworker and ksoftirqd spinning occurs when I echo 'auto' to all the
>> USB control entries. Using Alan's excellent tip, we see this being
>> logged repeatedly at a high rate:
>> [  353.245180] usb usb1-port4: status 0107 change 
>> [  353.245194] usb usb1-port12: status 0507 change 
>> [  353.245202] hub 1-0:1.0: state 7 ports 16 chg  evt 
>> [  353.245203] hub 1-0:1.0: hub_suspend
>> [  353.245205] usb usb1: bus auto-suspend, wakeup 1
>> [  353.245206] usb usb1: bus suspend fail, err -16
>> [  353.245207] hub 1-0:1.0: hub_resume
>> ...
>>
>> So, EBUSY. Both the webcam is not open, and the bluetooth interface
>> [1] is rfkill'd; the situation occurs even if I unload all related
>> modules.
>>
>> What further debug would be useful?
>>
>> Thanks!
>>   Daniel
>>
>> -- [1]
>>
>> Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
>> Bus 001 Device 002: ID 0a5c:6410 Broadcom Corp.
>> Bus 001 Device 003: ID 1bcf:2b95 Sunplus Innovation Technology Inc.
>> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
>
> Since bus 1 uses an xHCI controller, you should do:
>
> echo 'module xhci-hcd =p' >/sys/kernel/debug/dynamic_debug/control
>
> I'm reasonably sure this will end up printing "suspend failed
> because a port is resuming", since that's the only place where
> xhci_bus_suspend() fails with -EBUSY, but you should try it to confirm
> this.

I had to use:
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control

and indeed we see:
[29172.246221] xhci_hcd :00:14.0: get port status, actual port 11
status  = 0xe63
[29172.246222] xhci_hcd :00:14.0: Get port status returned 0x507
[29172.246224] xhci_hcd :00:14.0: get port status, actual port 12
status  = 0x2a0
[29172.246228] xhci_hcd :00:14.0: get port status, actual port 13
status  = 0x2a0
[29172.246228] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246231] xhci_hcd :00:14.0: get port status, actual port 14
status  = 0x2a0
[29172.246232] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246235] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246248] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246254] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246264] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246275] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246285] xhci_hcd :00:14.0: Get port status returned 0x507
[29172.246294] xhci_hcd :00:14.0: get port status, actual port 14
status  = 0x2a0
[29172.246302] xhci_hcd :00:14.0: suspend failed because a port is resuming
[29172.246321] xhci_hcd :00:14.0: Get port status returned 0x107
[29172.246332] xhci_hcd :00:14.0: get port status, actual port 6
status  = 0x2a0
[29172.246346] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246359] xhci_hcd :00:14.0: get port status, actual port 13
status  = 0x2a0
[29172.246364] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246366] xhci_hcd :00:14.0: get port status, actual port 15
status  = 0x2a0
[29172.246371] xhci_hcd :00:14.0: suspend failed because a port is resuming
[29172.246380] xhci_hcd :00:14.0: Get port status returned 0x100
[29172.246382] xhci_hcd :00:14.0: get port status, actual port 1
status  = 0x2a0
[29172.246383] xhci_hcd :00:14.0: Get port status returned 0x100
...
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.3] kworker busy in pm_runtime_work

2015-11-22 Thread Daniel J Blueman
On 16 November 2015 at 23:22, Alan Stern  wrote:
> On Mon, 16 Nov 2015, Daniel J Blueman wrote:
>
>> Tuning USB suspend [1] in 4.3 on a Dell XPS 15 9553 (Skylake), I see a
>> kworker thread spinning in rpm_suspend [2].
>>
>> What is the most useful debug to get here beyond the immediate [3]?
>
> You can try doing:
>
> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control

kworker and ksoftirqd spinning occurs when I echo 'auto' to all the
USB control entries. Using Alan's excellent tip, we see this being
logged repeatedly at a high rate:
[  353.245180] usb usb1-port4: status 0107 change 
[  353.245194] usb usb1-port12: status 0507 change 
[  353.245202] hub 1-0:1.0: state 7 ports 16 chg  evt 
[  353.245203] hub 1-0:1.0: hub_suspend
[  353.245205] usb usb1: bus auto-suspend, wakeup 1
[  353.245206] usb usb1: bus suspend fail, err -16
[  353.245207] hub 1-0:1.0: hub_resume
...

So, EBUSY. Both the webcam is not open, and the bluetooth interface
[1] is rfkill'd; the situation occurs even if I unload all related
modules.

What further debug would be useful?

Thanks!
  Daniel

-- [1]

Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 0a5c:6410 Broadcom Corp.
Bus 001 Device 003: ID 1bcf:2b95 Sunplus Innovation Technology Inc.
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4.3] kworker busy in pm_runtime_work

2015-11-22 Thread Daniel J Blueman
On 16 November 2015 at 23:22, Alan Stern <st...@rowland.harvard.edu> wrote:
> On Mon, 16 Nov 2015, Daniel J Blueman wrote:
>
>> Tuning USB suspend [1] in 4.3 on a Dell XPS 15 9553 (Skylake), I see a
>> kworker thread spinning in rpm_suspend [2].
>>
>> What is the most useful debug to get here beyond the immediate [3]?
>
> You can try doing:
>
> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control

kworker and ksoftirqd spinning occurs when I echo 'auto' to all the
USB control entries. Using Alan's excellent tip, we see this being
logged repeatedly at a high rate:
[  353.245180] usb usb1-port4: status 0107 change 
[  353.245194] usb usb1-port12: status 0507 change 
[  353.245202] hub 1-0:1.0: state 7 ports 16 chg  evt 
[  353.245203] hub 1-0:1.0: hub_suspend
[  353.245205] usb usb1: bus auto-suspend, wakeup 1
[  353.245206] usb usb1: bus suspend fail, err -16
[  353.245207] hub 1-0:1.0: hub_resume
...

So, EBUSY. Both the webcam is not open, and the bluetooth interface
[1] is rfkill'd; the situation occurs even if I unload all related
modules.

What further debug would be useful?

Thanks!
  Daniel

-- [1]

Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 002: ID 0a5c:6410 Broadcom Corp.
Bus 001 Device 003: ID 1bcf:2b95 Sunplus Innovation Technology Inc.
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [lkp] [x86/numachip] db1003a719: BUG: kernel early-boot hang

2015-11-11 Thread Daniel J Blueman

Hi Ying Huang,

On Tue, Nov 10, 2015 at 6:12 AM, kernel test robot 
 wrote:

FYI, we noticed the below changes on

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
commit db1003a719d75cebe5843a7906c02c29bec9922c ("x86/numachip: 
Cleanup Numachip support")



Elapsed time: 210
BUG: kernel early-boot hang
Linux version 4.3.0-rc2-1-gdb1003a #1
Command line: root=/dev/ram0 user=lkp 
job=/lkp/scheduled/vm-intel12-yocto-x86_64-14/bisect_boot-1-yocto-minimal-x86_64.cgz-x86_64-allyesdebian-db1003a719d75cebe5843a7906c02c29bec9922c-20151107-100037-1jb4qfh-1.yaml 
ARCH=x86_64 kconfig=x86_64-allyesdebian 
branch=sergeh-security/2015-11-05/cgroupns 
commit=db1003a719d75cebe5843a7906c02c29bec9922c 
BOOT_IMAGE=/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a 
max_uptime=600 
RESULT_ROOT=/result/boot/1/vm-intel12-yocto-x86_64/yocto-minimal-x86_64.cgz/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/0 
LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug 
apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 
panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic 
load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 
vga=normal rw ip=vm-intel12-yocto-x86_64-14::dhcp 
drbd.minor_count=8
qemu-system-x86_64 -enable-kvm -cpu Nehalem -kernel 
/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a 
-append 'root=/dev/ram0 user=lkp 
job=/lkp/scheduled/vm-intel12-yocto-x86_64-14/bisect_boot-1-yocto-minimal-x86_64.cgz-x86_64-allyesdebian-db1003a719d75cebe5843a7906c02c29bec9922c-20151107-100037-1jb4qfh-1.yaml 
ARCH=x86_64 kconfig=x86_64-allyesdebian 
branch=sergeh-security/2015-11-05/cgroupns 
commit=db1003a719d75cebe5843a7906c02c29bec9922c 
BOOT_IMAGE=/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a 
max_uptime=600 
RESULT_ROOT=/result/boot/1/vm-intel12-yocto-x86_64/yocto-minimal-x86_64.cgz/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/0 
LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug 
apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 
panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic 
load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 
vga=normal rw ip=vm-intel12-yocto-x86_64-14::dhcp 
drbd.minor_count=8'  -initrd 
/fs/KVM/initrd-vm-intel12-yocto-x86_64-14 -m 832 -smp 2 -device 
e1000,netdev=net0 -netdev user,id=net0 -boot order=nc -no-reboot 
-watchdog i6300esb -rtc base=localtime -drive 
file=/fs/KVM/disk0-vm-intel12-yocto-x86_64-14,media=disk,if=virtio 
-drive 
file=/fs/KVM/disk1-vm-intel12-yocto-x86_64-14,media=disk,if=virtio 
-pidfile /dev/shm/kboot/pid-vm-intel12-yocto-x86_64-14 -serial 
file:/dev/shm/kboot/serial-vm-intel12-yocto-x86_64-14 -daemonize 
-display none -monitor null


Neat, however checking out the same kernel tree at "db1003a 
x86/numachip: Cleanup Numachip support", building with the same config 
(though with GCC 5.2.1), it boots just peachy with the same args.


The patch itself is conservative, so I can't see how it could cause 
early boot hangs. Have you seen this kind of issue before, or is this 
the first time?


Thanks!
 Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [lkp] [x86/numachip] db1003a719: BUG: kernel early-boot hang

2015-11-11 Thread Daniel J Blueman

Hi Ying Huang,

On Tue, Nov 10, 2015 at 6:12 AM, kernel test robot 
 wrote:

FYI, we noticed the below changes on

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
commit db1003a719d75cebe5843a7906c02c29bec9922c ("x86/numachip: 
Cleanup Numachip support")



Elapsed time: 210
BUG: kernel early-boot hang
Linux version 4.3.0-rc2-1-gdb1003a #1
Command line: root=/dev/ram0 user=lkp 
job=/lkp/scheduled/vm-intel12-yocto-x86_64-14/bisect_boot-1-yocto-minimal-x86_64.cgz-x86_64-allyesdebian-db1003a719d75cebe5843a7906c02c29bec9922c-20151107-100037-1jb4qfh-1.yaml 
ARCH=x86_64 kconfig=x86_64-allyesdebian 
branch=sergeh-security/2015-11-05/cgroupns 
commit=db1003a719d75cebe5843a7906c02c29bec9922c 
BOOT_IMAGE=/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a 
max_uptime=600 
RESULT_ROOT=/result/boot/1/vm-intel12-yocto-x86_64/yocto-minimal-x86_64.cgz/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/0 
LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug 
apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 
panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic 
load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 
vga=normal rw ip=vm-intel12-yocto-x86_64-14::dhcp 
drbd.minor_count=8
qemu-system-x86_64 -enable-kvm -cpu Nehalem -kernel 
/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a 
-append 'root=/dev/ram0 user=lkp 
job=/lkp/scheduled/vm-intel12-yocto-x86_64-14/bisect_boot-1-yocto-minimal-x86_64.cgz-x86_64-allyesdebian-db1003a719d75cebe5843a7906c02c29bec9922c-20151107-100037-1jb4qfh-1.yaml 
ARCH=x86_64 kconfig=x86_64-allyesdebian 
branch=sergeh-security/2015-11-05/cgroupns 
commit=db1003a719d75cebe5843a7906c02c29bec9922c 
BOOT_IMAGE=/pkg/linux/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/vmlinuz-4.3.0-rc2-1-gdb1003a 
max_uptime=600 
RESULT_ROOT=/result/boot/1/vm-intel12-yocto-x86_64/yocto-minimal-x86_64.cgz/x86_64-allyesdebian/gcc-5/db1003a719d75cebe5843a7906c02c29bec9922c/0 
LKP_SERVER=inn earlyprintk=ttyS0,115200 systemd.log_level=err debug 
apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 
panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic 
load_ramdisk=2 prompt_ramdisk=0 console=ttyS0,115200 console=tty0 
vga=normal rw ip=vm-intel12-yocto-x86_64-14::dhcp 
drbd.minor_count=8'  -initrd 
/fs/KVM/initrd-vm-intel12-yocto-x86_64-14 -m 832 -smp 2 -device 
e1000,netdev=net0 -netdev user,id=net0 -boot order=nc -no-reboot 
-watchdog i6300esb -rtc base=localtime -drive 
file=/fs/KVM/disk0-vm-intel12-yocto-x86_64-14,media=disk,if=virtio 
-drive 
file=/fs/KVM/disk1-vm-intel12-yocto-x86_64-14,media=disk,if=virtio 
-pidfile /dev/shm/kboot/pid-vm-intel12-yocto-x86_64-14 -serial 
file:/dev/shm/kboot/serial-vm-intel12-yocto-x86_64-14 -daemonize 
-display none -monitor null


Neat, however checking out the same kernel tree at "db1003a 
x86/numachip: Cleanup Numachip support", building with the same config 
(though with GCC 5.2.1), it boots just peachy with the same args.


The patch itself is conservative, so I can't see how it could cause 
early boot hangs. Have you seen this kind of issue before, or is this 
the first time?


Thanks!
 Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] x86/apic: Use smaller array for __apicid_to_node[] mapping

2015-10-12 Thread Daniel J Blueman
On Fri, Oct 9, 2015 at 11:35 PM, Jiang Liu  
wrote:

On 2015/10/3 3:12, Denys Vlasenko wrote:

 From: Daniel J Blueman 

 The Intel x2APIC spec states the upper 16-bits of APIC ID is the
 cluster ID [1, p2-12], intended for future distributed systems. 
Beyond

 the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the
 position of a server on each axis of a multi-dimension torus; SGI
 NUMAlink also structures the APIC ID space.

 Instead, define an array based on NR_CPUs to achieve a 1:1 mapping 
and
 perform linear search; this addresses the binary bloat and the 
present

 artificial APIC ID limits. With CONFIG_NR_CPUS=256:

 $ size vmlinux vmlinux-patched
   text  data bss  dec hex filename
 18232877 1849656 2281472 22364005 1553f65 vmlinux
 18233034 1786168 2281472 22300674 1544802 vmlinux-patched

 That is, ~64 kbytes less data.

 Works peachy on a 256-core system with a 20-bit APIC ID space, and 
on a

 48-core legacy 8-bit APIC ID system. If we care, I can make
 numa_cpu_node O(1) lookup for typical cases.

 Signed-off-by: Daniel J Blueman 
 CC: Ingo Molnar 
 CC: Daniel J Blueman 
 CC: Jiang Liu 
 CC: Thomas Gleixner 
 CC: Len Brown 
 CC: x...@kernel.org
 CC: linux-kernel@vger.kernel.org

 [1]
 
http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf

 ---

 I added forgotten change in arch/x86/mm/numa_emulation.c (Denys)

  arch/x86/include/asm/numa.h  | 13 +++--
  arch/x86/kernel/cpu/amd.c|  8 
  arch/x86/mm/numa.c   | 31 +++
  arch/x86/mm/numa_emulation.c |  6 +++---
  4 files changed, 37 insertions(+), 21 deletions(-)

 diff --git a/arch/x86/include/asm/numa.h 
b/arch/x86/include/asm/numa.h

 index c2ecfd0..33becb8 100644
 --- a/arch/x86/include/asm/numa.h
 +++ b/arch/x86/include/asm/numa.h
 @@ -17,6 +17,11 @@
   */
  #define NODE_MIN_SIZE (4*1024*1024)

 +struct apicid_to_node {
 +  int apicid;
 +  s16 node;
 +};
 +
  extern int numa_off;

  /*
 @@ -27,17 +32,13 @@ extern int numa_off;
   * should be accessed by the accessors - set_apicid_to_node() and
   * numa_cpu_node().
   */
 -extern s16 __apicid_to_node[MAX_LOCAL_APICID];
 +extern struct apicid_to_node __apicid_to_node[NR_CPUS];

Hi Denys and Daniel,
I still have some concerns about limiting the array to NR_CPUS.
__apicid_to_node are populated according to the order that CPUs are
listed in ACPI SRAT table. And CPU IDs are allocated according to the
order that CPUs are listed in ACPI MADT(APIC) order. So it may cause
trouble if:
1) system has more than NR_CPUS CPUs
2) CPUs are listed in different order in SRAT and MADT tables.


Another approach which may be suitable without changing SRAT parsing to 
be after the memory allocator is up, is to exploit the associativity of 
the bottom APIC ID bits.


We'd have a searchable static array based on NUMA_SHIFT and use the 
bit-shift encoded in the MSRs. That said, this may run into the issue 
Jiang cited albeit with CONFIG_NUMA_SHIFT. Perhaps the constraints or 
risk of restructuring SRAT parsing aren't worth the payoff?


Finally, the only alternative is as the current mapping is initialised 
in numa_init, we can drop the static initialisation and move the 64KB 
to the BSS to avoid bloating the binary image, but this may not achieve 
the initial goal of runtime footprint reduction.


Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] x86/apic: Use smaller array for __apicid_to_node[] mapping

2015-10-12 Thread Daniel J Blueman
On Fri, Oct 9, 2015 at 11:35 PM, Jiang Liu <jiang@linux.intel.com> 
wrote:

On 2015/10/3 3:12, Denys Vlasenko wrote:

 From: Daniel J Blueman <dan...@numascale.com>

 The Intel x2APIC spec states the upper 16-bits of APIC ID is the
 cluster ID [1, p2-12], intended for future distributed systems. 
Beyond

 the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the
 position of a server on each axis of a multi-dimension torus; SGI
 NUMAlink also structures the APIC ID space.

 Instead, define an array based on NR_CPUs to achieve a 1:1 mapping 
and
 perform linear search; this addresses the binary bloat and the 
present

 artificial APIC ID limits. With CONFIG_NR_CPUS=256:

 $ size vmlinux vmlinux-patched
   text  data bss  dec hex filename
 18232877 1849656 2281472 22364005 1553f65 vmlinux
 18233034 1786168 2281472 22300674 1544802 vmlinux-patched

 That is, ~64 kbytes less data.

 Works peachy on a 256-core system with a 20-bit APIC ID space, and 
on a

 48-core legacy 8-bit APIC ID system. If we care, I can make
 numa_cpu_node O(1) lookup for typical cases.

 Signed-off-by: Daniel J Blueman <dan...@numascale.com>
 CC: Ingo Molnar <mi...@kernel.org>
 CC: Daniel J Blueman <dan...@numascale.com>
 CC: Jiang Liu <jiang@linux.intel.com>
 CC: Thomas Gleixner <t...@linutronix.de>
 CC: Len Brown <len.br...@intel.com>
 CC: x...@kernel.org
 CC: linux-kernel@vger.kernel.org

 [1]
 
http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf

 ---

 I added forgotten change in arch/x86/mm/numa_emulation.c (Denys)

  arch/x86/include/asm/numa.h  | 13 +++--
  arch/x86/kernel/cpu/amd.c|  8 
  arch/x86/mm/numa.c   | 31 +++
  arch/x86/mm/numa_emulation.c |  6 +++---
  4 files changed, 37 insertions(+), 21 deletions(-)

 diff --git a/arch/x86/include/asm/numa.h 
b/arch/x86/include/asm/numa.h

 index c2ecfd0..33becb8 100644
 --- a/arch/x86/include/asm/numa.h
 +++ b/arch/x86/include/asm/numa.h
 @@ -17,6 +17,11 @@
   */
  #define NODE_MIN_SIZE (4*1024*1024)

 +struct apicid_to_node {
 +  int apicid;
 +  s16 node;
 +};
 +
  extern int numa_off;

  /*
 @@ -27,17 +32,13 @@ extern int numa_off;
   * should be accessed by the accessors - set_apicid_to_node() and
   * numa_cpu_node().
   */
 -extern s16 __apicid_to_node[MAX_LOCAL_APICID];
 +extern struct apicid_to_node __apicid_to_node[NR_CPUS];

Hi Denys and Daniel,
I still have some concerns about limiting the array to NR_CPUS.
__apicid_to_node are populated according to the order that CPUs are
listed in ACPI SRAT table. And CPU IDs are allocated according to the
order that CPUs are listed in ACPI MADT(APIC) order. So it may cause
trouble if:
1) system has more than NR_CPUS CPUs
2) CPUs are listed in different order in SRAT and MADT tables.


Another approach which may be suitable without changing SRAT parsing to 
be after the memory allocator is up, is to exploit the associativity of 
the bottom APIC ID bits.


We'd have a searchable static array based on NUMA_SHIFT and use the 
bit-shift encoded in the MSRs. That said, this may run into the issue 
Jiang cited albeit with CONFIG_NUMA_SHIFT. Perhaps the constraints or 
risk of restructuring SRAT parsing aren't worth the payoff?


Finally, the only alternative is as the current mapping is initialised 
in numa_init, we can drop the static initialisation and move the 64KB 
to the BSS to avoid bloating the binary image, but this may not achieve 
the initial goal of runtime footprint reduction.


Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: igb: do not re-init SR-IOV during probe

2015-10-05 Thread Daniel J Blueman
It would be great if the patch "igb: do not re-init SR-IOV during 
probe" [1] can be backported from 4.3-rc to stable kernels, since it 
fixes the regression introduced by "igb: do a reset on SR-IOV re-init 
if device is down" [2].


The regression was introduced in 3.16 and can isolate the IPMI 
interface on servers with 82576 NICs if using shared mode (high 
impact), around 0.5% of times booted.


Many thanks!
 Daniel

-- [1]

commit 6423fc34160939142d72ffeaa2db6408317f54df
Author: Stefan Assmann 
Date:   Fri Jul 10 15:01:12 2015 +0200

   igb: do not re-init SR-IOV during probe

   During driver probing the following code path is triggered.
   igb_probe
   ->igb_sw_init
 ->igb_probe_vfs
   ->igb_pci_enable_sriov
 ->igb_sriov_reinit

   Doing the SR-IOV re-init is not necessary during probing since we're
   starting from scratch. Here we can call igb_enable_sriov() right 
away.


   Running igb_sriov_reinit() during igb_probe() also seems to cause
   occasional packet loss on some onboard 82576 NICs. Reproduced on
   Dell and HP servers with onboard 82576 NICs.
   Example:
   Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 
01)

   Subsystem: Dell Device [1028:0481]

   Signed-off-by: Stefan Assmann 
   Tested-by: Aaron Brown 
   Signed-off-by: Jeff Kirsher 

-- [2]

commit 76252723e88681628a3dbb9c09c963e095476f73
Author: Stefan Assmann 
Date:   Thu Jul 10 03:29:39 2014 -0700

  igb: do a reset on SR-IOV re-init if device is down

  To properly re-initialize SR-IOV it is necessary to reset the device
  even if it is already down. Not doing this may result in Tx unit 
hangs.


  Cc: stable 
  Signed-off-by: Stefan Assmann 
  Tested-by: Aaron Brown 
  Signed-off-by: Jeff Kirsher 
  Signed-off-by: David S. Miller 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: igb: do not re-init SR-IOV during probe

2015-10-05 Thread Daniel J Blueman
It would be great if the patch "igb: do not re-init SR-IOV during 
probe" [1] can be backported from 4.3-rc to stable kernels, since it 
fixes the regression introduced by "igb: do a reset on SR-IOV re-init 
if device is down" [2].


The regression was introduced in 3.16 and can isolate the IPMI 
interface on servers with 82576 NICs if using shared mode (high 
impact), around 0.5% of times booted.


Many thanks!
 Daniel

-- [1]

commit 6423fc34160939142d72ffeaa2db6408317f54df
Author: Stefan Assmann 
Date:   Fri Jul 10 15:01:12 2015 +0200

   igb: do not re-init SR-IOV during probe

   During driver probing the following code path is triggered.
   igb_probe
   ->igb_sw_init
 ->igb_probe_vfs
   ->igb_pci_enable_sriov
 ->igb_sriov_reinit

   Doing the SR-IOV re-init is not necessary during probing since we're
   starting from scratch. Here we can call igb_enable_sriov() right 
away.


   Running igb_sriov_reinit() during igb_probe() also seems to cause
   occasional packet loss on some onboard 82576 NICs. Reproduced on
   Dell and HP servers with onboard 82576 NICs.
   Example:
   Intel Corporation 82576 Gigabit Network Connection [8086:10c9] (rev 
01)

   Subsystem: Dell Device [1028:0481]

   Signed-off-by: Stefan Assmann 
   Tested-by: Aaron Brown 
   Signed-off-by: Jeff Kirsher 

-- [2]

commit 76252723e88681628a3dbb9c09c963e095476f73
Author: Stefan Assmann 
Date:   Thu Jul 10 03:29:39 2014 -0700

  igb: do a reset on SR-IOV re-init if device is down

  To properly re-initialize SR-IOV it is necessary to reset the device
  even if it is already down. Not doing this may result in Tx unit 
hangs.


  Cc: stable 
  Signed-off-by: Stefan Assmann 
  Tested-by: Aaron Brown 
  Signed-off-by: Jeff Kirsher 
  Signed-off-by: David S. Miller 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] x86/apic: Use smaller array for __apicid_to_node[] mapping

2015-10-04 Thread Daniel J Blueman
The Intel x2APIC spec states the upper 16-bits of APIC ID is the
cluster ID [1, p2-12], intended for future distributed systems. Beyond
the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the
position of a server on each axis of a multi-dimension torus; SGI
NUMAlink also structures the APIC ID space.

Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and
perform linear search; we see "ACPI: NR_CPUS/possible_cpus limit of X
reached.  Processor 8/0x16 ignored." when config-limited. This addresses
the binary bloat and the present artificial APIC ID limits. With
CONFIG_NR_CPUS=256, we save ~64KB of vmlinux data:

$ size vmlinux vmlinux-patched
  text  data bss  dec hex filename
18232877 1849656 2281472 22364005 1553f65 vmlinux
18233034 1786168 2281472 22300674 1544802 vmlinux-patched

Tested on a 256-core system with a 20-bit APIC ID space, and on a
48-core legacy 8-bit APIC ID system with and without CONFIG_NUMA,
CONFIG_NUMA_EMU and CONFIG_AMD_NUMA.

v2: Improved readability by moving static variable out; integrated Denys's
numa emulation fix

Signed-off-by: Daniel J Blueman 
CC: Denys Vlasenko 
CC: Ingo Molnar 
CC: Thomas Gleixner 
CC: Jiang Liu 
CC: Len Brown 
CC: Steffen Persvold 
CC: linux-kernel@vger.kernel.org
CC: x...@kernel.org

[1] 
http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf
---
 arch/x86/include/asm/numa.h  | 13 +++--
 arch/x86/kernel/cpu/amd.c| 11 ++-
 arch/x86/mm/numa.c   | 29 +
 arch/x86/mm/numa_emulation.c |  6 +++---
 4 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 01b493e..33becb8 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -17,6 +17,11 @@
  */
 #define NODE_MIN_SIZE (4*1024*1024)
 
+struct apicid_to_node {
+   int apicid;
+   s16 node;
+};
+
 extern int numa_off;
 
 /*
@@ -27,17 +32,13 @@ extern int numa_off;
  * should be accessed by the accessors - set_apicid_to_node() and
  * numa_cpu_node().
  */
-extern s16 __apicid_to_node[MAX_LOCAL_APIC];
+extern struct apicid_to_node __apicid_to_node[NR_CPUS];
 extern nodemask_t numa_nodes_parsed __initdata;
 
 extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
 
-static inline void set_apicid_to_node(int apicid, s16 node)
-{
-   __apicid_to_node[apicid] = node;
-}
-
+extern void set_apicid_to_node(int apicid, s16 node);
 extern int numa_cpu_node(int cpu);
 
 #else  /* CONFIG_NUMA */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 4a70fc6..9494f0e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -277,12 +277,13 @@ static int nearby_node(int apicid)
int i, node;
 
for (i = apicid - 1; i >= 0; i--) {
-   node = __apicid_to_node[i];
+   node = __apicid_to_node[i].node;
if (node != NUMA_NO_NODE && node_online(node))
return node;
}
-   for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) {
-   node = __apicid_to_node[i];
+   for (i = apicid + 1; i < NR_CPUS; i++) {
+   node = __apicid_to_node[i].node;
+
if (node != NUMA_NO_NODE && node_online(node))
return node;
}
@@ -422,8 +423,8 @@ static void srat_detect_node(struct cpuinfo_x86 *c)
int ht_nodeid = c->initial_apicid;
 
if (ht_nodeid >= 0 &&
-   __apicid_to_node[ht_nodeid] != NUMA_NO_NODE)
-   node = __apicid_to_node[ht_nodeid];
+   __apicid_to_node[ht_nodeid].node != NUMA_NO_NODE)
+   node = __apicid_to_node[ht_nodeid].node;
/* Pick a nearby node */
if (!node_online(node))
node = nearby_node(apicid);
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c3b3f65..849a113 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -26,6 +26,7 @@ nodemask_t numa_nodes_parsed __initdata;
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
+static unsigned apicids;
 static struct numa_meminfo numa_meminfo
 #ifndef CONFIG_MEMORY_HOTPLUG
 __initdata
@@ -56,16 +57,31 @@ early_param("numa", numa_setup);
 /*
  * apicid, cpu, node mappings
  */
-s16 __apicid_to_node[MAX_LOCAL_APIC] = {
-   [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
+struct apicid_to_node __apicid_to_node[NR_CPUS] = {
+   [0 ... NR_CPUS-1] = {-1, NUMA_NO_NODE}
 };
 
+void set_apicid_to_node(int apicid, s16 node)
+{
+   /* Protect against small kernel on large system */
+   if (apicids >= NR_CPUS)
+   return;
+
+   __apicid_to_node[apicids].apicid = apicid;
+   __apicid_to_node[apicid

[PATCH v2] x86/apic: Use smaller array for __apicid_to_node[] mapping

2015-10-04 Thread Daniel J Blueman
The Intel x2APIC spec states the upper 16-bits of APIC ID is the
cluster ID [1, p2-12], intended for future distributed systems. Beyond
the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the
position of a server on each axis of a multi-dimension torus; SGI
NUMAlink also structures the APIC ID space.

Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and
perform linear search; we see "ACPI: NR_CPUS/possible_cpus limit of X
reached.  Processor 8/0x16 ignored." when config-limited. This addresses
the binary bloat and the present artificial APIC ID limits. With
CONFIG_NR_CPUS=256, we save ~64KB of vmlinux data:

$ size vmlinux vmlinux-patched
  text  data bss  dec hex filename
18232877 1849656 2281472 22364005 1553f65 vmlinux
18233034 1786168 2281472 22300674 1544802 vmlinux-patched

Tested on a 256-core system with a 20-bit APIC ID space, and on a
48-core legacy 8-bit APIC ID system with and without CONFIG_NUMA,
CONFIG_NUMA_EMU and CONFIG_AMD_NUMA.

v2: Improved readability by moving static variable out; integrated Denys's
numa emulation fix

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
CC: Denys Vlasenko <dvlas...@redhat.com>
CC: Ingo Molnar <mi...@kernel.org>
CC: Thomas Gleixner <t...@linutronix.de>
CC: Jiang Liu <jiang@linux.intel.com>
CC: Len Brown <len.br...@intel.com>
CC: Steffen Persvold <s...@numascale.com>
CC: linux-kernel@vger.kernel.org
CC: x...@kernel.org

[1] 
http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf
---
 arch/x86/include/asm/numa.h  | 13 +++--
 arch/x86/kernel/cpu/amd.c| 11 ++-
 arch/x86/mm/numa.c   | 29 +
 arch/x86/mm/numa_emulation.c |  6 +++---
 4 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 01b493e..33becb8 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -17,6 +17,11 @@
  */
 #define NODE_MIN_SIZE (4*1024*1024)
 
+struct apicid_to_node {
+   int apicid;
+   s16 node;
+};
+
 extern int numa_off;
 
 /*
@@ -27,17 +32,13 @@ extern int numa_off;
  * should be accessed by the accessors - set_apicid_to_node() and
  * numa_cpu_node().
  */
-extern s16 __apicid_to_node[MAX_LOCAL_APIC];
+extern struct apicid_to_node __apicid_to_node[NR_CPUS];
 extern nodemask_t numa_nodes_parsed __initdata;
 
 extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
 extern void __init numa_set_distance(int from, int to, int distance);
 
-static inline void set_apicid_to_node(int apicid, s16 node)
-{
-   __apicid_to_node[apicid] = node;
-}
-
+extern void set_apicid_to_node(int apicid, s16 node);
 extern int numa_cpu_node(int cpu);
 
 #else  /* CONFIG_NUMA */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 4a70fc6..9494f0e 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -277,12 +277,13 @@ static int nearby_node(int apicid)
int i, node;
 
for (i = apicid - 1; i >= 0; i--) {
-   node = __apicid_to_node[i];
+   node = __apicid_to_node[i].node;
if (node != NUMA_NO_NODE && node_online(node))
return node;
}
-   for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) {
-   node = __apicid_to_node[i];
+   for (i = apicid + 1; i < NR_CPUS; i++) {
+   node = __apicid_to_node[i].node;
+
if (node != NUMA_NO_NODE && node_online(node))
return node;
}
@@ -422,8 +423,8 @@ static void srat_detect_node(struct cpuinfo_x86 *c)
int ht_nodeid = c->initial_apicid;
 
if (ht_nodeid >= 0 &&
-   __apicid_to_node[ht_nodeid] != NUMA_NO_NODE)
-   node = __apicid_to_node[ht_nodeid];
+   __apicid_to_node[ht_nodeid].node != NUMA_NO_NODE)
+   node = __apicid_to_node[ht_nodeid].node;
/* Pick a nearby node */
if (!node_online(node))
node = nearby_node(apicid);
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c3b3f65..849a113 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -26,6 +26,7 @@ nodemask_t numa_nodes_parsed __initdata;
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
+static unsigned apicids;
 static struct numa_meminfo numa_meminfo
 #ifndef CONFIG_MEMORY_HOTPLUG
 __initdata
@@ -56,16 +57,31 @@ early_param("numa", numa_setup);
 /*
  * apicid, cpu, node mappings
  */
-s16 __apicid_to_node[MAX_LOCAL_APIC] = {
-   [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
+struct apicid_to_node __apicid_to_node[NR_CPUS] = {
+   [0 ... NR_CPUS-1] = {-1, NUMA_NO_NODE}
 };
 
+void set_apicid_to_node(int apicid, s16 node)
+{
+   /* P

Re: [PATCH RFC] x86: Reduce MAX_LOCAL_APIC and MAX_IO_APICS

2015-10-02 Thread Daniel J Blueman
On Saturday, September 26, 2015 at 4:40:07 AM UTC+8, Denys Vlasenko 
wrote:

> Before this change MAX_LOCAL_APIC had the fixed value of 32*1024.
> Such a big value causes several data arrays to be quite oversized:
>
> phys_cpu_present_map is 4 kbytes (one bit per apic id),
> __apicid_to_node[] is 64 kbytes,
> apic_version[] is 128 kbytes.
>
> On "usual" systems, APIC ids simply go from zero
> to maximum logical CPU number, mirroring CPU ids.
>
> On broken and unusual multi-socket systems
> APIC ids can be non-contiguous.

The Intel x2APIC spec states the upper 16-bits of APIC ID is the 
cluster ID [1, p2-12], intended for future distributed systems. Beyond 
the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the 
position of a server on each axis of a multi-dimension torus; SGI 
NUMAlink also structures the APIC ID space.


Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and 
perform linear search; this addresses the binary bloat and the present 
artificial APIC ID limits. With CONFIG_NR_CPUS=256:


$ size vmlinux vmlinux-patched
 textdata bss dec hex filename
182328771849656 2281472 223640051553f65 vmlinux
182330341786168 2281472 223006741544802 vmlinux-patched

Works peachy on a 256-core system with a 20-bit APIC ID space, and on a 
48-core legacy 8-bit APIC ID system. If we care, I can make 
numa_cpu_node O(1) lookup for typical cases.


Signed-off-by: Daniel J Blueman 

Daniel

[1] 
http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf


---
arch/x86/include/asm/numa.h | 13 +++--
arch/x86/kernel/cpu/amd.c   |  8 
arch/x86/mm/numa.c  | 31 +++
3 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 01b493e..33becb8 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -17,6 +17,11 @@
*/
#define NODE_MIN_SIZE (4*1024*1024)

+struct apicid_to_node {
+   int apicid;
+   s16 node;
+};
+
extern int numa_off;

/*
@@ -27,17 +32,13 @@ extern int numa_off;
* should be accessed by the accessors - set_apicid_to_node() and
* numa_cpu_node().
*/
-extern s16 __apicid_to_node[MAX_LOCAL_APIC];
+extern struct apicid_to_node __apicid_to_node[NR_CPUS];
extern nodemask_t numa_nodes_parsed __initdata;

extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
extern void __init numa_set_distance(int from, int to, int distance);

-static inline void set_apicid_to_node(int apicid, s16 node)
-{
-   __apicid_to_node[apicid] = node;
-}
-
+extern void set_apicid_to_node(int apicid, s16 node);
extern int numa_cpu_node(int cpu);

#else  /* CONFIG_NUMA */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 4a70fc6..e65c01c 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -277,12 +277,12 @@ static int nearby_node(int apicid)
  int i, node;

  for (i = apicid - 1; i >= 0; i--) {
-   node = __apicid_to_node[i];
+   node = __apicid_to_node[i].node;
  if (node != NUMA_NO_NODE && node_online(node))
  return node;
  }
  for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) {
-   node = __apicid_to_node[i];
+   node = __apicid_to_node[i].node;
  if (node != NUMA_NO_NODE && node_online(node))
  return node;
  }
@@ -422,8 +422,8 @@ static void srat_detect_node(struct cpuinfo_x86 *c)
  int ht_nodeid = c->initial_apicid;

  if (ht_nodeid >= 0 &&
-   __apicid_to_node[ht_nodeid] != NUMA_NO_NODE)
-   node = __apicid_to_node[ht_nodeid];
+   __apicid_to_node[ht_nodeid].node != NUMA_NO_NODE)
+   node = __apicid_to_node[ht_nodeid].node;
  /* Pick a nearby node */
  if (!node_online(node))
  node = nearby_node(apicid);
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c3b3f65..70f03a0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -56,16 +56,34 @@ early_param("numa", numa_setup);
/*
* apicid, cpu, node mappings
*/
-s16 __apicid_to_node[MAX_LOCAL_APIC] = {
-   [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
+
+struct apicid_to_node __apicid_to_node[NR_CPUS] = {
+   [0 ... NR_CPUS-1] = {-1, NUMA_NO_NODE}
};

+void set_apicid_to_node(int apicid, s16 node)
+{
+   static int ent;
+
+   /* Protect against small kernel on large system */
+   if (ent >= NR_CPUS)
+   return;
+
+   __apicid_to_node[ent].apicid = apicid;
+   __apicid_to_node[ent].node = node;
+   ent++;
+}
+
int numa_cpu_node(int cpu)
{
-   int apicid = early_per_cpu(x86_cpu_to_apicid, cpu);
+   int ent, apicid = early_per_cpu(x86_cpu_to_apicid, 

Re: [PATCH RFC] x86: Reduce MAX_LOCAL_APIC and MAX_IO_APICS

2015-10-02 Thread Daniel J Blueman
On Saturday, September 26, 2015 at 4:40:07 AM UTC+8, Denys Vlasenko 
wrote:

> Before this change MAX_LOCAL_APIC had the fixed value of 32*1024.
> Such a big value causes several data arrays to be quite oversized:
>
> phys_cpu_present_map is 4 kbytes (one bit per apic id),
> __apicid_to_node[] is 64 kbytes,
> apic_version[] is 128 kbytes.
>
> On "usual" systems, APIC ids simply go from zero
> to maximum logical CPU number, mirroring CPU ids.
>
> On broken and unusual multi-socket systems
> APIC ids can be non-contiguous.

The Intel x2APIC spec states the upper 16-bits of APIC ID is the 
cluster ID [1, p2-12], intended for future distributed systems. Beyond 
the legacy 8-bit APIC ID, Numascale NumaConnect uses 4-bits for the 
position of a server on each axis of a multi-dimension torus; SGI 
NUMAlink also structures the APIC ID space.


Instead, define an array based on NR_CPUs to achieve a 1:1 mapping and 
perform linear search; this addresses the binary bloat and the present 
artificial APIC ID limits. With CONFIG_NR_CPUS=256:


$ size vmlinux vmlinux-patched
 textdata bss dec hex filename
182328771849656 2281472 223640051553f65 vmlinux
182330341786168 2281472 223006741544802 vmlinux-patched

Works peachy on a 256-core system with a 20-bit APIC ID space, and on a 
48-core legacy 8-bit APIC ID system. If we care, I can make 
numa_cpu_node O(1) lookup for typical cases.


Signed-off-by: Daniel J Blueman <dan...@numascale.com>

Daniel

[1] 
http://www.intel.com/content/dam/doc/specification-update/64-architecture-x2apic-specification.pdf


---
arch/x86/include/asm/numa.h | 13 +++--
arch/x86/kernel/cpu/amd.c   |  8 
arch/x86/mm/numa.c  | 31 +++
3 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 01b493e..33becb8 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -17,6 +17,11 @@
*/
#define NODE_MIN_SIZE (4*1024*1024)

+struct apicid_to_node {
+   int apicid;
+   s16 node;
+};
+
extern int numa_off;

/*
@@ -27,17 +32,13 @@ extern int numa_off;
* should be accessed by the accessors - set_apicid_to_node() and
* numa_cpu_node().
*/
-extern s16 __apicid_to_node[MAX_LOCAL_APIC];
+extern struct apicid_to_node __apicid_to_node[NR_CPUS];
extern nodemask_t numa_nodes_parsed __initdata;

extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
extern void __init numa_set_distance(int from, int to, int distance);

-static inline void set_apicid_to_node(int apicid, s16 node)
-{
-   __apicid_to_node[apicid] = node;
-}
-
+extern void set_apicid_to_node(int apicid, s16 node);
extern int numa_cpu_node(int cpu);

#else  /* CONFIG_NUMA */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 4a70fc6..e65c01c 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -277,12 +277,12 @@ static int nearby_node(int apicid)
  int i, node;

  for (i = apicid - 1; i >= 0; i--) {
-   node = __apicid_to_node[i];
+   node = __apicid_to_node[i].node;
  if (node != NUMA_NO_NODE && node_online(node))
  return node;
  }
  for (i = apicid + 1; i < MAX_LOCAL_APIC; i++) {
-   node = __apicid_to_node[i];
+   node = __apicid_to_node[i].node;
  if (node != NUMA_NO_NODE && node_online(node))
  return node;
  }
@@ -422,8 +422,8 @@ static void srat_detect_node(struct cpuinfo_x86 *c)
  int ht_nodeid = c->initial_apicid;

  if (ht_nodeid >= 0 &&
-   __apicid_to_node[ht_nodeid] != NUMA_NO_NODE)
-   node = __apicid_to_node[ht_nodeid];
+   __apicid_to_node[ht_nodeid].node != NUMA_NO_NODE)
+   node = __apicid_to_node[ht_nodeid].node;
  /* Pick a nearby node */
  if (!node_online(node))
  node = nearby_node(apicid);
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c3b3f65..70f03a0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -56,16 +56,34 @@ early_param("numa", numa_setup);
/*
* apicid, cpu, node mappings
*/
-s16 __apicid_to_node[MAX_LOCAL_APIC] = {
-   [0 ... MAX_LOCAL_APIC-1] = NUMA_NO_NODE
+
+struct apicid_to_node __apicid_to_node[NR_CPUS] = {
+   [0 ... NR_CPUS-1] = {-1, NUMA_NO_NODE}
};

+void set_apicid_to_node(int apicid, s16 node)
+{
+   static int ent;
+
+   /* Protect against small kernel on large system */
+   if (ent >= NR_CPUS)
+   return;
+
+   __apicid_to_node[ent].apicid = apicid;
+   __apicid_to_node[ent].node = node;
+   ent++;
+}
+
int numa_cpu_node(int cpu)
{
-   int apicid = early_per_cpu(x86_cpu_to_apicid, cpu);
+   int ent, api

Re: [RFC] PCI: Unassigned Expansion ROM BARs

2015-09-26 Thread Daniel J Blueman
On Thursday, September 24, 2015 at 10:50:07 AM UTC+8, Myron Stowe wrote:
> I've encountered numerous bugzilla reports related to platform BIOS' not
> programming valid values into a PCI device's Type 0 Configuration space
> "Expansion ROM Base Address" field (a.k.a. Expansion ROM BAR).  The main
> observed consequence being 'dmesg' entries like the following that get
> customers excited enough to file reports against the kernel.

PCI option ROMs legitimately hold real-mode/EFI code needed to
initialise devices; the problem is, we can't guarantee that the BIOS
has initialised all devices with the option ROM code, so linux must
ensure they are correctly accessible.

In addition to VMs as Alex points out, hotplug (eg Thunderbold GPUs)
and PCI domains which may not be visible to the BIOS at early boot,
may need the option ROM. Nvidia GPUs primarily have had a lot of
encoder/connector (HDCP?) and product-specific voltage-frequency setup
code and tables in the ROM.

As such, in my NumaConnect open firmware which maps the PCI domains of
multiple servers into one, I have to also reallocate PCI option ROMs
[1] to guarantee GPU VBIOS execution in linux. That said, option ROMs
are a dying trend in favour of shipped binary blobs and open-coded
initialisation for cross-platform support, and there are only 10 users
of pci_map_rom().

Thanks,
  Daniel

[1] https://github.com/numascale/nc-utils/blob/master/bootloader/dnc-mmio.c
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] PCI: Unassigned Expansion ROM BARs

2015-09-26 Thread Daniel J Blueman
On Thursday, September 24, 2015 at 10:50:07 AM UTC+8, Myron Stowe wrote:
> I've encountered numerous bugzilla reports related to platform BIOS' not
> programming valid values into a PCI device's Type 0 Configuration space
> "Expansion ROM Base Address" field (a.k.a. Expansion ROM BAR).  The main
> observed consequence being 'dmesg' entries like the following that get
> customers excited enough to file reports against the kernel.

PCI option ROMs legitimately hold real-mode/EFI code needed to
initialise devices; the problem is, we can't guarantee that the BIOS
has initialised all devices with the option ROM code, so linux must
ensure they are correctly accessible.

In addition to VMs as Alex points out, hotplug (eg Thunderbold GPUs)
and PCI domains which may not be visible to the BIOS at early boot,
may need the option ROM. Nvidia GPUs primarily have had a lot of
encoder/connector (HDCP?) and product-specific voltage-frequency setup
code and tables in the ROM.

As such, in my NumaConnect open firmware which maps the PCI domains of
multiple servers into one, I have to also reallocate PCI option ROMs
[1] to guarantee GPU VBIOS execution in linux. That said, option ROMs
are a dying trend in favour of shipped binary blobs and open-coded
initialisation for cross-platform support, and there are only 10 users
of pci_map_rom().

Thanks,
  Daniel

[1] https://github.com/numascale/nc-utils/blob/master/bootloader/dnc-mmio.c
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/apic] x86/numachip: Add Numachip IPI optimisations

2015-09-22 Thread tip-bot for Daniel J Blueman
Commit-ID:  ad03a9c25d258641556c7198e26fd882c741987a
Gitweb: http://git.kernel.org/tip/ad03a9c25d258641556c7198e26fd882c741987a
Author: Daniel J Blueman 
AuthorDate: Mon, 21 Sep 2015 01:02:01 +0800
Committer:  Thomas Gleixner 
CommitDate: Tue, 22 Sep 2015 22:25:33 +0200

x86/numachip: Add Numachip IPI optimisations

When sending IPIs, first check if the non-local part of the source and
destination APIC IDs match; if so, send via the local APIC for efficiency.

Secondly, since the AMD BIOS-kernel developer guide states IPI delivery
will occur invarient of prior deliver status, avoid polling the delivery
status bit for efficiency.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
Cc: Daniel Lezcano 
Link: 
http://lkml.kernel.org/r/1442768522-19217-3-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/include/asm/numachip/numachip_csr.h |  1 +
 arch/x86/kernel/apic/apic_numachip.c | 37 
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index e08b803..e09d845 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -34,6 +34,7 @@
 #define NUMACHIP_LCSR_BASE 0x3e00ULL
 #define NUMACHIP_LCSR_LIM  0x3fffULL
 #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1)
+#define NUMACHIP_LAPIC_BITS8
 
 static inline void *lcsr_address(unsigned long offset)
 {
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index 3cb9294..38dd5ef 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -96,9 +96,25 @@ static int numachip_wakeup_secondary(int phys_apicid, 
unsigned long start_rip)
 
 static void numachip_send_IPI_one(int cpu, int vector)
 {
-   int apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   int local_apicid, apicid = per_cpu(x86_cpu_to_apicid, cpu);
unsigned int dmode;
 
+   preempt_disable();
+   local_apicid = __this_cpu_read(x86_cpu_to_apicid);
+
+   /* Send via local APIC where non-local part matches */
+   if (!((apicid ^ local_apicid) >> NUMACHIP_LAPIC_BITS)) {
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __default_send_IPI_dest_field(apicid, vector,
+   APIC_DEST_PHYSICAL);
+   local_irq_restore(flags);
+   preempt_enable();
+   return;
+   }
+   preempt_enable();
+
dmode = (vector == NMI_VECTOR) ? APIC_DM_NMI : APIC_DM_FIXED;
numachip_apic_icr_write(apicid, dmode | vector);
 }
@@ -218,6 +234,17 @@ static int numachip2_acpi_madt_oem_check(char *oem_id, 
char *oem_table_id)
return 1;
 }
 
+/* APIC IPIs are queued */
+static void numachip_apic_wait_icr_idle(void)
+{
+}
+
+/* APIC NMI IPIs are queued */
+static u32 numachip_safe_apic_wait_icr_idle(void)
+{
+   return 0;
+}
+
 static const struct apic apic_numachip1 __refconst = {
.name   = "NumaConnect system",
.probe  = numachip1_probe,
@@ -263,8 +290,8 @@ static const struct apic apic_numachip1 __refconst = {
.eoi_write  = native_apic_mem_write,
.icr_read   = native_apic_icr_read,
.icr_write  = native_apic_icr_write,
-   .wait_icr_idle  = native_apic_wait_icr_idle,
-   .safe_wait_icr_idle = native_safe_apic_wait_icr_idle,
+   .wait_icr_idle  = numachip_apic_wait_icr_idle,
+   .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle,
 };
 
 apic_driver(apic_numachip1);
@@ -314,8 +341,8 @@ static const struct apic apic_numachip2 __refconst = {
.eoi_write  = native_apic_mem_write,
.icr_read   = native_apic_icr_read,
.icr_write  = native_apic_icr_write,
-   .wait_icr_idle  = native_apic_wait_icr_idle,
-   .safe_wait_icr_idle = native_safe_apic_wait_icr_idle,
+   .wait_icr_idle  = numachip_apic_wait_icr_idle,
+   .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle,
 };
 
 apic_driver(apic_numachip2);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/apic] x86/numachip: Introduce Numachip2 timer mechanisms

2015-09-22 Thread tip-bot for Daniel J Blueman
Commit-ID:  ce2e572cfe7b2fc3f0e9da4aa7bc61a2c2c51fc7
Gitweb: http://git.kernel.org/tip/ce2e572cfe7b2fc3f0e9da4aa7bc61a2c2c51fc7
Author: Daniel J Blueman 
AuthorDate: Mon, 21 Sep 2015 18:02:25 +0800
Committer:  Thomas Gleixner 
CommitDate: Tue, 22 Sep 2015 22:25:33 +0200

x86/numachip: Introduce Numachip2 timer mechanisms

Add 1GHz 64-bit Numachip2 clocksource timer support for accurate
system-wide timekeeping, as core TSCs are unsynchronised.

Additionally, add a per-core clockevent mechanism that interrupts via the
platform IPI vector after a programmed period.

[ tglx: Taking it through x86 due to dependencies ]

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
Cc: Daniel Lezcano 
Link: 
http://lkml.kernel.org/r/1442829745-29311-1-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/include/asm/numachip/numachip_csr.h |  9 +++
 drivers/clocksource/Makefile |  1 +
 drivers/clocksource/numachip.c   | 95 
 3 files changed, 105 insertions(+)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index e09d845..29719ee 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
 #define NUMACHIP2_LCSR_BASE   0xf000UL
 #define NUMACHIP2_LCSR_SIZE   0x100UL
 #define NUMACHIP2_APIC_ICR0x10
+#define NUMACHIP2_TIMER_DEADLINE  0x20
+#define NUMACHIP2_TIMER_INT   0x28
+#define NUMACHIP2_TIMER_NOW   0x200018
+#define NUMACHIP2_TIMER_RESET 0x200020
 
 static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
 {
@@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long 
offset, u64 val)
writeq(val, numachip2_lcsr_address(offset));
 }
 
+static inline unsigned int numachip2_timer(void)
+{
+   return (smp_processor_id() % 48) << 6;
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 5c00863..57dfad3 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -62,3 +62,4 @@ obj-$(CONFIG_H8300)   += h8300_timer8.o
 obj-$(CONFIG_H8300_TMR16)  += h8300_timer16.o
 obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o
 obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o
+obj-$(CONFIG_X86_NUMACHIP) += numachip.o
diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c
new file mode 100644
index 000..088e5fa
--- /dev/null
+++ b/drivers/clocksource/numachip.c
@@ -0,0 +1,95 @@
+/*
+ *
+ * Copyright (C) 2015 Numascale AS. All rights reserved.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+
+static DEFINE_PER_CPU(struct clock_event_device, cpu_ced);
+
+static cycles_t numachip2_timer_read(struct clocksource *cs)
+{
+   return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW);
+}
+
+static struct clocksource numachip2_clocksource = {
+   .name= "numachip2",
+   .rating  = 295,
+   .read= numachip2_timer_read,
+   .mask= CLOCKSOURCE_MASK(64),
+   .flags   = CLOCK_SOURCE_IS_CONTINUOUS,
+   .mult= 1,
+   .shift   = 0,
+};
+
+static int numachip2_set_next_event(unsigned long delta, struct 
clock_event_device *ced)
+{
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(),
+   delta);
+   return 0;
+}
+
+static struct clock_event_device numachip2_clockevent = {
+   .name= "numachip2",
+   .rating  = 400,
+   .set_next_event  = numachip2_set_next_event,
+   .features= CLOCK_EVT_FEAT_ONESHOT,
+   .mult= 1,
+   .shift   = 0,
+   .min_delta_ns= 1250,
+   .max_delta_ns= LONG_MAX,
+};
+
+static void numachip_timer_interrupt(void)
+{
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   ced->event_handler(ced);
+}
+
+static __init void numachip_timer_each(struct work_struct *work)
+{
+   unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff;
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   /* Setup IPI vector to local core and relative timing mode */
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_

[tip:x86/apic] x86/numachip: Add Numachip2 APIC support

2015-09-22 Thread tip-bot for Daniel J Blueman
Commit-ID:  d9d4dee6cedfa17e5eedcba242dca3091bf73bc3
Gitweb: http://git.kernel.org/tip/d9d4dee6cedfa17e5eedcba242dca3091bf73bc3
Author: Daniel J Blueman 
AuthorDate: Mon, 21 Sep 2015 01:02:00 +0800
Committer:  Thomas Gleixner 
CommitDate: Tue, 22 Sep 2015 22:25:33 +0200

x86/numachip: Add Numachip2 APIC support

Introduce support for Numachip2 remote interrupts via detecting the right
ACPI SRAT signature.

Access is performed via a fixed mapping in the x86 physical address space.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
Cc: Daniel Lezcano 
Link: 
http://lkml.kernel.org/r/1442768522-19217-2-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/include/asm/numachip/numachip.h |  1 +
 arch/x86/include/asm/numachip/numachip_csr.h | 35 +++
 arch/x86/kernel/apic/apic_numachip.c | 93 
 3 files changed, 129 insertions(+)

diff --git a/arch/x86/include/asm/numachip/numachip.h 
b/arch/x86/include/asm/numachip/numachip.h
index 1c6f7f6..c64373a 100644
--- a/arch/x86/include/asm/numachip/numachip.h
+++ b/arch/x86/include/asm/numachip/numachip.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_H
 
+extern u8 numachip_system;
 extern int __init pci_numachip_init(void);
 
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index 7469b13..e08b803 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 
+#include 
 #include 
 
 #define CSR_NODE_SHIFT 16
@@ -50,4 +51,38 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
writel(swab32(val), lcsr_address(offset));
 }
 
+/*
+ * On NumaChip2, local CSR space is 16MB and starts at fixed offset below 4G
+ */
+
+#define NUMACHIP2_LCSR_BASE   0xf000UL
+#define NUMACHIP2_LCSR_SIZE   0x100UL
+#define NUMACHIP2_APIC_ICR0x10
+
+static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
+{
+   return (void __iomem *)__va(NUMACHIP2_LCSR_BASE |
+   (offset & (NUMACHIP2_LCSR_SIZE - 1)));
+}
+
+static inline u32 numachip2_read32_lcsr(unsigned long offset)
+{
+   return readl(numachip2_lcsr_address(offset));
+}
+
+static inline u64 numachip2_read64_lcsr(unsigned long offset)
+{
+   return readq(numachip2_lcsr_address(offset));
+}
+
+static inline void numachip2_write32_lcsr(unsigned long offset, u32 val)
+{
+   writel(val, numachip2_lcsr_address(offset));
+}
+
+static inline void numachip2_write64_lcsr(unsigned long offset, u64 val)
+{
+   writeq(val, numachip2_lcsr_address(offset));
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index eeefbb1..3cb9294 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -22,6 +22,7 @@
 
 u8 numachip_system __read_mostly;
 static const struct apic apic_numachip1;
+static const struct apic apic_numachip2;
 static void (*numachip_apic_icr_write)(int apicid, unsigned int val) 
__read_mostly;
 
 static unsigned int numachip1_get_apic_id(unsigned long x)
@@ -45,6 +46,19 @@ static unsigned long numachip1_set_apic_id(unsigned int id)
return x;
 }
 
+static unsigned int numachip2_get_apic_id(unsigned long x)
+{
+   u64 mcfg;
+
+   rdmsrl(MSR_FAM10H_MMIO_CONF_BASE, mcfg);
+   return ((mcfg >> (28 - 8)) & 0xfff00) | (x >> 24);
+}
+
+static unsigned long numachip2_set_apic_id(unsigned int id)
+{
+   return id << 24;
+}
+
 static int numachip_apic_id_valid(int apicid)
 {
/* Trust what bootloader passes in MADT */
@@ -66,6 +80,11 @@ static void numachip1_apic_icr_write(int apicid, unsigned 
int val)
write_lcsr(CSR_G3_EXT_IRQ_GEN, (apicid << 16) | val);
 }
 
+static void numachip2_apic_icr_write(int apicid, unsigned int val)
+{
+   numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val);
+}
+
 static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip)
 {
numachip_apic_icr_write(phys_apicid, APIC_DM_INIT);
@@ -130,6 +149,11 @@ static int __init numachip1_probe(void)
return apic == _numachip1;
 }
 
+static int __init numachip2_probe(void)
+{
+   return apic == _numachip2;
+}
+
 static void fixup_cpu_id(struct cpuinfo_x86 *c, int node)
 {
u64 val;
@@ -155,6 +179,13 @@ static int __init numachip_system_init(void)
numachip_apic_icr_write = numachip1_apic_icr_write;
x86_init.pci.arch_init = pci_numachip_init;
break;
+   case 2:
+   init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE);
+   numachip_api

[tip:x86/apic] x86/numachip: Cleanup Numachip support

2015-09-22 Thread tip-bot for Daniel J Blueman
Commit-ID:  db1003a719d75cebe5843a7906c02c29bec9922c
Gitweb: http://git.kernel.org/tip/db1003a719d75cebe5843a7906c02c29bec9922c
Author: Daniel J Blueman 
AuthorDate: Mon, 21 Sep 2015 01:01:59 +0800
Committer:  Thomas Gleixner 
CommitDate: Tue, 22 Sep 2015 22:25:32 +0200

x86/numachip: Cleanup Numachip support

Drop unused code and includes in Numachip header files and APIC driver.

Additionally, use the 'numachip1' prefix on Numachip1-specific functions;
this prepares for adding Numachip2 support in later patches.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
Cc: Daniel Lezcano 
Link: 
http://lkml.kernel.org/r/1442768522-19217-1-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/include/asm/numachip/numachip_csr.h | 118 +--
 arch/x86/kernel/apic/apic_numachip.c | 104 ++-
 2 files changed, 44 insertions(+), 178 deletions(-)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index 660f843..7469b13 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -14,12 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 
-#include 
-#include 
 #include 
-#include 
-#include 
-#include 
 
 #define CSR_NODE_SHIFT 16
 #define CSR_NODE_BITS(p)   (((unsigned long)(p)) << CSR_NODE_SHIFT)
@@ -27,11 +22,8 @@
 
 /* 32K CSR space, b15 indicates geo/non-geo */
 #define CSR_OFFSET_MASK0x7fffUL
-
-/* Global CSR space covers all 4K possible nodes with 64K CSR space per node */
-#define NUMACHIP_GCSR_BASE 0x3fffULL
-#define NUMACHIP_GCSR_LIM  0x3fff0fffULL
-#define NUMACHIP_GCSR_SIZE (NUMACHIP_GCSR_LIM - NUMACHIP_GCSR_BASE + 1)
+#define CSR_G0_NODE_IDS (0x008 + (0 << 12))
+#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12))
 
 /*
  * Local CSR space starts in global CSR space with "nodeid" = 0xfff0, however
@@ -42,28 +34,12 @@
 #define NUMACHIP_LCSR_LIM  0x3fffULL
 #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1)
 
-static inline void *gcsr_address(int node, unsigned long offset)
-{
-   return __va(NUMACHIP_GCSR_BASE | (1UL << 15) |
-   CSR_NODE_BITS(node & CSR_NODE_MASK) | (offset & 
CSR_OFFSET_MASK));
-}
-
 static inline void *lcsr_address(unsigned long offset)
 {
return __va(NUMACHIP_LCSR_BASE | (1UL << 15) |
CSR_NODE_BITS(0xfff0) | (offset & CSR_OFFSET_MASK));
 }
 
-static inline unsigned int read_gcsr(int node, unsigned long offset)
-{
-   return swab32(readl(gcsr_address(node, offset)));
-}
-
-static inline void write_gcsr(int node, unsigned long offset, unsigned int val)
-{
-   writel(swab32(val), gcsr_address(node, offset));
-}
-
 static inline unsigned int read_lcsr(unsigned long offset)
 {
return swab32(readl(lcsr_address(offset)));
@@ -74,94 +50,4 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
writel(swab32(val), lcsr_address(offset));
 }
 
-/* = */
-/*   CSR_G0_STATE_CLEAR  */
-/* = */
-
-#define CSR_G0_STATE_CLEAR (0x000 + (0 << 12))
-union numachip_csr_g0_state_clear {
-   unsigned int v;
-   struct numachip_csr_g0_state_clear_s {
-   unsigned int _state:2;
-   unsigned int _rsvd_2_6:5;
-   unsigned int _lost:1;
-   unsigned int _rsvd_8_31:24;
-   } s;
-};
-
-/* = */
-/*   CSR_G0_NODE_IDS */
-/* = */
-
-#define CSR_G0_NODE_IDS (0x008 + (0 << 12))
-union numachip_csr_g0_node_ids {
-   unsigned int v;
-   struct numachip_csr_g0_node_ids_s {
-   unsigned int _initialid:16;
-   unsigned int _nodeid:12;
-   unsigned int _rsvd_28_31:4;
-   } s;
-};
-
-/* = */
-/*   CSR_G3_EXT_IRQ_GEN  */
-/* = */
-
-#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12))
-union numachip_csr_g3_ext_irq_gen {
-   unsigned int v;
-   struct numachip_csr_g3_ext_irq_gen_s {
-   unsigned int _vector:8;
-   unsigned int _msgtype:3;
-   unsigned int _index:5;
-   unsigned int _destination_apic_id:16;
-   } s;
-};
-
-/* ==

[tip:x86/apic] x86/numachip: Add Numachip IPI optimisations

2015-09-22 Thread tip-bot for Daniel J Blueman
Commit-ID:  ad03a9c25d258641556c7198e26fd882c741987a
Gitweb: http://git.kernel.org/tip/ad03a9c25d258641556c7198e26fd882c741987a
Author: Daniel J Blueman <dan...@numascale.com>
AuthorDate: Mon, 21 Sep 2015 01:02:01 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Tue, 22 Sep 2015 22:25:33 +0200

x86/numachip: Add Numachip IPI optimisations

When sending IPIs, first check if the non-local part of the source and
destination APIC IDs match; if so, send via the local APIC for efficiency.

Secondly, since the AMD BIOS-kernel developer guide states IPI delivery
will occur invarient of prior deliver status, avoid polling the delivery
status bit for efficiency.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
Cc: Daniel Lezcano <daniel.lezc...@linaro.org>
Link: 
http://lkml.kernel.org/r/1442768522-19217-3-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/include/asm/numachip/numachip_csr.h |  1 +
 arch/x86/kernel/apic/apic_numachip.c | 37 
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index e08b803..e09d845 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -34,6 +34,7 @@
 #define NUMACHIP_LCSR_BASE 0x3e00ULL
 #define NUMACHIP_LCSR_LIM  0x3fffULL
 #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1)
+#define NUMACHIP_LAPIC_BITS8
 
 static inline void *lcsr_address(unsigned long offset)
 {
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index 3cb9294..38dd5ef 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -96,9 +96,25 @@ static int numachip_wakeup_secondary(int phys_apicid, 
unsigned long start_rip)
 
 static void numachip_send_IPI_one(int cpu, int vector)
 {
-   int apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   int local_apicid, apicid = per_cpu(x86_cpu_to_apicid, cpu);
unsigned int dmode;
 
+   preempt_disable();
+   local_apicid = __this_cpu_read(x86_cpu_to_apicid);
+
+   /* Send via local APIC where non-local part matches */
+   if (!((apicid ^ local_apicid) >> NUMACHIP_LAPIC_BITS)) {
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __default_send_IPI_dest_field(apicid, vector,
+   APIC_DEST_PHYSICAL);
+   local_irq_restore(flags);
+   preempt_enable();
+   return;
+   }
+   preempt_enable();
+
dmode = (vector == NMI_VECTOR) ? APIC_DM_NMI : APIC_DM_FIXED;
numachip_apic_icr_write(apicid, dmode | vector);
 }
@@ -218,6 +234,17 @@ static int numachip2_acpi_madt_oem_check(char *oem_id, 
char *oem_table_id)
return 1;
 }
 
+/* APIC IPIs are queued */
+static void numachip_apic_wait_icr_idle(void)
+{
+}
+
+/* APIC NMI IPIs are queued */
+static u32 numachip_safe_apic_wait_icr_idle(void)
+{
+   return 0;
+}
+
 static const struct apic apic_numachip1 __refconst = {
.name   = "NumaConnect system",
.probe  = numachip1_probe,
@@ -263,8 +290,8 @@ static const struct apic apic_numachip1 __refconst = {
.eoi_write  = native_apic_mem_write,
.icr_read   = native_apic_icr_read,
.icr_write  = native_apic_icr_write,
-   .wait_icr_idle  = native_apic_wait_icr_idle,
-   .safe_wait_icr_idle = native_safe_apic_wait_icr_idle,
+   .wait_icr_idle  = numachip_apic_wait_icr_idle,
+   .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle,
 };
 
 apic_driver(apic_numachip1);
@@ -314,8 +341,8 @@ static const struct apic apic_numachip2 __refconst = {
.eoi_write  = native_apic_mem_write,
.icr_read   = native_apic_icr_read,
.icr_write  = native_apic_icr_write,
-   .wait_icr_idle  = native_apic_wait_icr_idle,
-   .safe_wait_icr_idle = native_safe_apic_wait_icr_idle,
+   .wait_icr_idle  = numachip_apic_wait_icr_idle,
+   .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle,
 };
 
 apic_driver(apic_numachip2);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/apic] x86/numachip: Introduce Numachip2 timer mechanisms

2015-09-22 Thread tip-bot for Daniel J Blueman
Commit-ID:  ce2e572cfe7b2fc3f0e9da4aa7bc61a2c2c51fc7
Gitweb: http://git.kernel.org/tip/ce2e572cfe7b2fc3f0e9da4aa7bc61a2c2c51fc7
Author: Daniel J Blueman <dan...@numascale.com>
AuthorDate: Mon, 21 Sep 2015 18:02:25 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Tue, 22 Sep 2015 22:25:33 +0200

x86/numachip: Introduce Numachip2 timer mechanisms

Add 1GHz 64-bit Numachip2 clocksource timer support for accurate
system-wide timekeeping, as core TSCs are unsynchronised.

Additionally, add a per-core clockevent mechanism that interrupts via the
platform IPI vector after a programmed period.

[ tglx: Taking it through x86 due to dependencies ]

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
Cc: Daniel Lezcano <daniel.lezc...@linaro.org>
Link: 
http://lkml.kernel.org/r/1442829745-29311-1-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/include/asm/numachip/numachip_csr.h |  9 +++
 drivers/clocksource/Makefile |  1 +
 drivers/clocksource/numachip.c   | 95 
 3 files changed, 105 insertions(+)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index e09d845..29719ee 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
 #define NUMACHIP2_LCSR_BASE   0xf000UL
 #define NUMACHIP2_LCSR_SIZE   0x100UL
 #define NUMACHIP2_APIC_ICR0x10
+#define NUMACHIP2_TIMER_DEADLINE  0x20
+#define NUMACHIP2_TIMER_INT   0x28
+#define NUMACHIP2_TIMER_NOW   0x200018
+#define NUMACHIP2_TIMER_RESET 0x200020
 
 static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
 {
@@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long 
offset, u64 val)
writeq(val, numachip2_lcsr_address(offset));
 }
 
+static inline unsigned int numachip2_timer(void)
+{
+   return (smp_processor_id() % 48) << 6;
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 5c00863..57dfad3 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -62,3 +62,4 @@ obj-$(CONFIG_H8300)   += h8300_timer8.o
 obj-$(CONFIG_H8300_TMR16)  += h8300_timer16.o
 obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o
 obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o
+obj-$(CONFIG_X86_NUMACHIP) += numachip.o
diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c
new file mode 100644
index 000..088e5fa
--- /dev/null
+++ b/drivers/clocksource/numachip.c
@@ -0,0 +1,95 @@
+/*
+ *
+ * Copyright (C) 2015 Numascale AS. All rights reserved.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+
+static DEFINE_PER_CPU(struct clock_event_device, cpu_ced);
+
+static cycles_t numachip2_timer_read(struct clocksource *cs)
+{
+   return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW);
+}
+
+static struct clocksource numachip2_clocksource = {
+   .name= "numachip2",
+   .rating  = 295,
+   .read= numachip2_timer_read,
+   .mask= CLOCKSOURCE_MASK(64),
+   .flags   = CLOCK_SOURCE_IS_CONTINUOUS,
+   .mult= 1,
+   .shift   = 0,
+};
+
+static int numachip2_set_next_event(unsigned long delta, struct 
clock_event_device *ced)
+{
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(),
+   delta);
+   return 0;
+}
+
+static struct clock_event_device numachip2_clockevent = {
+   .name= "numachip2",
+   .rating  = 400,
+   .set_next_event  = numachip2_set_next_event,
+   .features= CLOCK_EVT_FEAT_ONESHOT,
+   .mult= 1,
+   .shift   = 0,
+   .min_delta_ns= 1250,
+   .max_delta_ns= LONG_MAX,
+};
+
+static void numachip_timer_interrupt(void)
+{
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   ced->event_handler(ced);
+}
+
+static __init void numachip_timer_each(struct work_struct *work)
+{
+   unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff;
+   struct clock_event_device *

[tip:x86/apic] x86/numachip: Cleanup Numachip support

2015-09-22 Thread tip-bot for Daniel J Blueman
Commit-ID:  db1003a719d75cebe5843a7906c02c29bec9922c
Gitweb: http://git.kernel.org/tip/db1003a719d75cebe5843a7906c02c29bec9922c
Author: Daniel J Blueman <dan...@numascale.com>
AuthorDate: Mon, 21 Sep 2015 01:01:59 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Tue, 22 Sep 2015 22:25:32 +0200

x86/numachip: Cleanup Numachip support

Drop unused code and includes in Numachip header files and APIC driver.

Additionally, use the 'numachip1' prefix on Numachip1-specific functions;
this prepares for adding Numachip2 support in later patches.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
Cc: Daniel Lezcano <daniel.lezc...@linaro.org>
Link: 
http://lkml.kernel.org/r/1442768522-19217-1-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/include/asm/numachip/numachip_csr.h | 118 +--
 arch/x86/kernel/apic/apic_numachip.c | 104 ++-
 2 files changed, 44 insertions(+), 178 deletions(-)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index 660f843..7469b13 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -14,12 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 
-#include 
-#include 
 #include 
-#include 
-#include 
-#include 
 
 #define CSR_NODE_SHIFT 16
 #define CSR_NODE_BITS(p)   (((unsigned long)(p)) << CSR_NODE_SHIFT)
@@ -27,11 +22,8 @@
 
 /* 32K CSR space, b15 indicates geo/non-geo */
 #define CSR_OFFSET_MASK0x7fffUL
-
-/* Global CSR space covers all 4K possible nodes with 64K CSR space per node */
-#define NUMACHIP_GCSR_BASE 0x3fffULL
-#define NUMACHIP_GCSR_LIM  0x3fff0fffULL
-#define NUMACHIP_GCSR_SIZE (NUMACHIP_GCSR_LIM - NUMACHIP_GCSR_BASE + 1)
+#define CSR_G0_NODE_IDS (0x008 + (0 << 12))
+#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12))
 
 /*
  * Local CSR space starts in global CSR space with "nodeid" = 0xfff0, however
@@ -42,28 +34,12 @@
 #define NUMACHIP_LCSR_LIM  0x3fffULL
 #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1)
 
-static inline void *gcsr_address(int node, unsigned long offset)
-{
-   return __va(NUMACHIP_GCSR_BASE | (1UL << 15) |
-   CSR_NODE_BITS(node & CSR_NODE_MASK) | (offset & 
CSR_OFFSET_MASK));
-}
-
 static inline void *lcsr_address(unsigned long offset)
 {
return __va(NUMACHIP_LCSR_BASE | (1UL << 15) |
CSR_NODE_BITS(0xfff0) | (offset & CSR_OFFSET_MASK));
 }
 
-static inline unsigned int read_gcsr(int node, unsigned long offset)
-{
-   return swab32(readl(gcsr_address(node, offset)));
-}
-
-static inline void write_gcsr(int node, unsigned long offset, unsigned int val)
-{
-   writel(swab32(val), gcsr_address(node, offset));
-}
-
 static inline unsigned int read_lcsr(unsigned long offset)
 {
return swab32(readl(lcsr_address(offset)));
@@ -74,94 +50,4 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
writel(swab32(val), lcsr_address(offset));
 }
 
-/* = */
-/*   CSR_G0_STATE_CLEAR  */
-/* = */
-
-#define CSR_G0_STATE_CLEAR (0x000 + (0 << 12))
-union numachip_csr_g0_state_clear {
-   unsigned int v;
-   struct numachip_csr_g0_state_clear_s {
-   unsigned int _state:2;
-   unsigned int _rsvd_2_6:5;
-   unsigned int _lost:1;
-   unsigned int _rsvd_8_31:24;
-   } s;
-};
-
-/* = */
-/*   CSR_G0_NODE_IDS */
-/* = */
-
-#define CSR_G0_NODE_IDS (0x008 + (0 << 12))
-union numachip_csr_g0_node_ids {
-   unsigned int v;
-   struct numachip_csr_g0_node_ids_s {
-   unsigned int _initialid:16;
-   unsigned int _nodeid:12;
-   unsigned int _rsvd_28_31:4;
-   } s;
-};
-
-/* = */
-/*   CSR_G3_EXT_IRQ_GEN  */
-/* = */
-
-#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12))
-union numachip_csr_g3_ext_irq_gen {
-   unsigned int v;
-   struct numachip_csr_g3_ext_irq_gen_s {
-   unsigned int _vector:8;
-   unsigned int _msgtype:3

[tip:x86/apic] x86/numachip: Add Numachip2 APIC support

2015-09-22 Thread tip-bot for Daniel J Blueman
Commit-ID:  d9d4dee6cedfa17e5eedcba242dca3091bf73bc3
Gitweb: http://git.kernel.org/tip/d9d4dee6cedfa17e5eedcba242dca3091bf73bc3
Author: Daniel J Blueman <dan...@numascale.com>
AuthorDate: Mon, 21 Sep 2015 01:02:00 +0800
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Tue, 22 Sep 2015 22:25:33 +0200

x86/numachip: Add Numachip2 APIC support

Introduce support for Numachip2 remote interrupts via detecting the right
ACPI SRAT signature.

Access is performed via a fixed mapping in the x86 physical address space.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
Cc: Daniel Lezcano <daniel.lezc...@linaro.org>
Link: 
http://lkml.kernel.org/r/1442768522-19217-2-git-send-email-dan...@numascale.com
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
---
 arch/x86/include/asm/numachip/numachip.h |  1 +
 arch/x86/include/asm/numachip/numachip_csr.h | 35 +++
 arch/x86/kernel/apic/apic_numachip.c | 93 
 3 files changed, 129 insertions(+)

diff --git a/arch/x86/include/asm/numachip/numachip.h 
b/arch/x86/include/asm/numachip/numachip.h
index 1c6f7f6..c64373a 100644
--- a/arch/x86/include/asm/numachip/numachip.h
+++ b/arch/x86/include/asm/numachip/numachip.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_H
 
+extern u8 numachip_system;
 extern int __init pci_numachip_init(void);
 
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index 7469b13..e08b803 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 
+#include 
 #include 
 
 #define CSR_NODE_SHIFT 16
@@ -50,4 +51,38 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
writel(swab32(val), lcsr_address(offset));
 }
 
+/*
+ * On NumaChip2, local CSR space is 16MB and starts at fixed offset below 4G
+ */
+
+#define NUMACHIP2_LCSR_BASE   0xf000UL
+#define NUMACHIP2_LCSR_SIZE   0x100UL
+#define NUMACHIP2_APIC_ICR0x10
+
+static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
+{
+   return (void __iomem *)__va(NUMACHIP2_LCSR_BASE |
+   (offset & (NUMACHIP2_LCSR_SIZE - 1)));
+}
+
+static inline u32 numachip2_read32_lcsr(unsigned long offset)
+{
+   return readl(numachip2_lcsr_address(offset));
+}
+
+static inline u64 numachip2_read64_lcsr(unsigned long offset)
+{
+   return readq(numachip2_lcsr_address(offset));
+}
+
+static inline void numachip2_write32_lcsr(unsigned long offset, u32 val)
+{
+   writel(val, numachip2_lcsr_address(offset));
+}
+
+static inline void numachip2_write64_lcsr(unsigned long offset, u64 val)
+{
+   writeq(val, numachip2_lcsr_address(offset));
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index eeefbb1..3cb9294 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -22,6 +22,7 @@
 
 u8 numachip_system __read_mostly;
 static const struct apic apic_numachip1;
+static const struct apic apic_numachip2;
 static void (*numachip_apic_icr_write)(int apicid, unsigned int val) 
__read_mostly;
 
 static unsigned int numachip1_get_apic_id(unsigned long x)
@@ -45,6 +46,19 @@ static unsigned long numachip1_set_apic_id(unsigned int id)
return x;
 }
 
+static unsigned int numachip2_get_apic_id(unsigned long x)
+{
+   u64 mcfg;
+
+   rdmsrl(MSR_FAM10H_MMIO_CONF_BASE, mcfg);
+   return ((mcfg >> (28 - 8)) & 0xfff00) | (x >> 24);
+}
+
+static unsigned long numachip2_set_apic_id(unsigned int id)
+{
+   return id << 24;
+}
+
 static int numachip_apic_id_valid(int apicid)
 {
/* Trust what bootloader passes in MADT */
@@ -66,6 +80,11 @@ static void numachip1_apic_icr_write(int apicid, unsigned 
int val)
write_lcsr(CSR_G3_EXT_IRQ_GEN, (apicid << 16) | val);
 }
 
+static void numachip2_apic_icr_write(int apicid, unsigned int val)
+{
+   numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val);
+}
+
 static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip)
 {
numachip_apic_icr_write(phys_apicid, APIC_DM_INIT);
@@ -130,6 +149,11 @@ static int __init numachip1_probe(void)
return apic == _numachip1;
 }
 
+static int __init numachip2_probe(void)
+{
+   return apic == _numachip2;
+}
+
 static void fixup_cpu_id(struct cpuinfo_x86 *c, int node)
 {
u64 val;
@@ -155,6 +179,13 @@ static int __init numachip_system_init(void)
numachip_apic_icr_write = numachip1_apic_icr_write;
x86_ini

[PATCH v2] x86: Introduce Numachip2 timer mechanisms

2015-09-21 Thread Daniel J Blueman
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate
system-wide timekeeping, as core TSCs are unsynchronised.

Additionally, add a per-core clockevent mechanism that interrupts via the
platform IPI vector after a programmed period.

v2: Fix whitespace and wrapping issue

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
---
 arch/x86/include/asm/numachip/numachip_csr.h |  9 +++
 drivers/clocksource/Makefile |  1 +
 drivers/clocksource/numachip.c   | 95 
 3 files changed, 105 insertions(+)
 create mode 100644 drivers/clocksource/numachip.c

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index e09d845..29719ee 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
 #define NUMACHIP2_LCSR_BASE   0xf000UL
 #define NUMACHIP2_LCSR_SIZE   0x100UL
 #define NUMACHIP2_APIC_ICR0x10
+#define NUMACHIP2_TIMER_DEADLINE  0x20
+#define NUMACHIP2_TIMER_INT   0x28
+#define NUMACHIP2_TIMER_NOW   0x200018
+#define NUMACHIP2_TIMER_RESET 0x200020
 
 static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
 {
@@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long 
offset, u64 val)
writeq(val, numachip2_lcsr_address(offset));
 }
 
+static inline unsigned int numachip2_timer(void)
+{
+   return (smp_processor_id() % 48) << 6;
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 5c00863..57dfad3 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -62,3 +62,4 @@ obj-$(CONFIG_H8300)   += h8300_timer8.o
 obj-$(CONFIG_H8300_TMR16)  += h8300_timer16.o
 obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o
 obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o
+obj-$(CONFIG_X86_NUMACHIP) += numachip.o
diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c
new file mode 100644
index 000..088e5fa
--- /dev/null
+++ b/drivers/clocksource/numachip.c
@@ -0,0 +1,95 @@
+/*
+ *
+ * Copyright (C) 2015 Numascale AS. All rights reserved.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+
+static DEFINE_PER_CPU(struct clock_event_device, cpu_ced);
+
+static cycles_t numachip2_timer_read(struct clocksource *cs)
+{
+   return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW);
+}
+
+static struct clocksource numachip2_clocksource = {
+   .name= "numachip2",
+   .rating  = 295,
+   .read= numachip2_timer_read,
+   .mask= CLOCKSOURCE_MASK(64),
+   .flags   = CLOCK_SOURCE_IS_CONTINUOUS,
+   .mult= 1,
+   .shift   = 0,
+};
+
+static int numachip2_set_next_event(unsigned long delta, struct 
clock_event_device *ced)
+{
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(),
+   delta);
+   return 0;
+}
+
+static struct clock_event_device numachip2_clockevent = {
+   .name= "numachip2",
+   .rating  = 400,
+   .set_next_event  = numachip2_set_next_event,
+   .features= CLOCK_EVT_FEAT_ONESHOT,
+   .mult= 1,
+   .shift   = 0,
+   .min_delta_ns= 1250,
+   .max_delta_ns= LONG_MAX,
+};
+
+static void numachip_timer_interrupt(void)
+{
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   ced->event_handler(ced);
+}
+
+static __init void numachip_timer_each(struct work_struct *work)
+{
+   unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff;
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   /* Setup IPI vector to local core and relative timing mode */
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_timer(),
+   (3 << 22) | (X86_PLATFORM_IPI_VECTOR << 14) |
+   (local_apicid << 6));
+
+   *ced = numachip2_clockevent;
+   ced->cpumask = cpumask_of(smp_processor_id());
+   clockevents_register_device(ced);
+}
+
+static int __init numachip_timer_init(void)
+{
+   if (numachip_system != 2)
+   return -ENODEV;
+
+   /* Reset timer */
+

[PATCH v2] x86: Introduce Numachip2 timer mechanisms

2015-09-21 Thread Daniel J Blueman
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate
system-wide timekeeping, as core TSCs are unsynchronised.

Additionally, add a per-core clockevent mechanism that interrupts via the
platform IPI vector after a programmed period.

v2: Fix whitespace and wrapping issue

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
---
 arch/x86/include/asm/numachip/numachip_csr.h |  9 +++
 drivers/clocksource/Makefile |  1 +
 drivers/clocksource/numachip.c   | 95 
 3 files changed, 105 insertions(+)
 create mode 100644 drivers/clocksource/numachip.c

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index e09d845..29719ee 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
 #define NUMACHIP2_LCSR_BASE   0xf000UL
 #define NUMACHIP2_LCSR_SIZE   0x100UL
 #define NUMACHIP2_APIC_ICR0x10
+#define NUMACHIP2_TIMER_DEADLINE  0x20
+#define NUMACHIP2_TIMER_INT   0x28
+#define NUMACHIP2_TIMER_NOW   0x200018
+#define NUMACHIP2_TIMER_RESET 0x200020
 
 static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
 {
@@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long 
offset, u64 val)
writeq(val, numachip2_lcsr_address(offset));
 }
 
+static inline unsigned int numachip2_timer(void)
+{
+   return (smp_processor_id() % 48) << 6;
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 5c00863..57dfad3 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -62,3 +62,4 @@ obj-$(CONFIG_H8300)   += h8300_timer8.o
 obj-$(CONFIG_H8300_TMR16)  += h8300_timer16.o
 obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o
 obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o
+obj-$(CONFIG_X86_NUMACHIP) += numachip.o
diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c
new file mode 100644
index 000..088e5fa
--- /dev/null
+++ b/drivers/clocksource/numachip.c
@@ -0,0 +1,95 @@
+/*
+ *
+ * Copyright (C) 2015 Numascale AS. All rights reserved.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+
+static DEFINE_PER_CPU(struct clock_event_device, cpu_ced);
+
+static cycles_t numachip2_timer_read(struct clocksource *cs)
+{
+   return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW);
+}
+
+static struct clocksource numachip2_clocksource = {
+   .name= "numachip2",
+   .rating  = 295,
+   .read= numachip2_timer_read,
+   .mask= CLOCKSOURCE_MASK(64),
+   .flags   = CLOCK_SOURCE_IS_CONTINUOUS,
+   .mult= 1,
+   .shift   = 0,
+};
+
+static int numachip2_set_next_event(unsigned long delta, struct 
clock_event_device *ced)
+{
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(),
+   delta);
+   return 0;
+}
+
+static struct clock_event_device numachip2_clockevent = {
+   .name= "numachip2",
+   .rating  = 400,
+   .set_next_event  = numachip2_set_next_event,
+   .features= CLOCK_EVT_FEAT_ONESHOT,
+   .mult= 1,
+   .shift   = 0,
+   .min_delta_ns= 1250,
+   .max_delta_ns= LONG_MAX,
+};
+
+static void numachip_timer_interrupt(void)
+{
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   ced->event_handler(ced);
+}
+
+static __init void numachip_timer_each(struct work_struct *work)
+{
+   unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff;
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   /* Setup IPI vector to local core and relative timing mode */
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_timer(),
+   (3 << 22) | (X86_PLATFORM_IPI_VECTOR << 14) |
+   (local_apicid << 6));
+
+   *ced = numachip2_clockevent;
+   ced->cpumask = cpumask_of(smp_processor_id());
+   clockevents_register_device(ced);
+}
+
+static int __init numachip_timer_init(void)
+{
+   if (numachip_system != 2)
+   return -

[PATCH 3/4] x86: Add Numachip IPI optimisations

2015-09-20 Thread Daniel J Blueman
When sending IPIs, first check if the non-local part of the source and
destination APIC IDs match; if so, send via the local APIC for efficiency.

Secondly, since the AMD BIOS-kernel developer guide states IPI delivery
will occur invarient of prior deliver status, avoid polling the delivery
status bit for efficiency.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
---
 arch/x86/include/asm/numachip/numachip_csr.h |  1 +
 arch/x86/kernel/apic/apic_numachip.c | 36 
 2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index c7efc25..75379f6 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -34,6 +34,7 @@
 #define NUMACHIP_LCSR_BASE 0x3e00ULL
 #define NUMACHIP_LCSR_LIM  0x3fffULL
 #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1)
+#define NUMACHIP_LAPIC_BITS8
 
 static inline void *lcsr_address(unsigned long offset)
 {
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index dfe2b1c..81bc216 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -95,9 +95,25 @@ static int numachip_wakeup_secondary(int phys_apicid, 
unsigned long start_rip)
 
 static void numachip_send_IPI_one(int cpu, int vector)
 {
-   int apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   int local_apicid, apicid = per_cpu(x86_cpu_to_apicid, cpu);
unsigned int dmode;
 
+   preempt_disable();
+   local_apicid = __this_cpu_read(x86_cpu_to_apicid);
+
+   /* Send via local APIC where non-local part matches */
+   if (!((apicid ^ local_apicid) >> NUMACHIP_LAPIC_BITS)) {
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __default_send_IPI_dest_field(apicid, vector,
+   APIC_DEST_PHYSICAL);
+   local_irq_restore(flags);
+   preempt_enable();
+   return;
+   }
+   preempt_enable();
+
dmode = (vector == NMI_VECTOR) ? APIC_DM_NMI : APIC_DM_FIXED;
numachip_apic_icr_write(apicid, dmode | vector);
 }
@@ -217,6 +232,17 @@ static int numachip2_acpi_madt_oem_check(char *oem_id, 
char *oem_table_id)
return 1;
 }
 
+/* APIC IPIs are queued */
+static void numachip_apic_wait_icr_idle(void)
+{
+}
+
+/* APIC NMI IPIs are queued */
+static u32 numachip_safe_apic_wait_icr_idle(void)
+{
+   return 0;
+}
+
 static const struct apic apic_numachip1 __refconst = {
.name   = "NumaConnect system",
.probe  = numachip1_probe,
@@ -262,8 +288,8 @@ static const struct apic apic_numachip1 __refconst = {
.eoi_write  = native_apic_mem_write,
.icr_read   = native_apic_icr_read,
.icr_write  = native_apic_icr_write,
-   .wait_icr_idle  = native_apic_wait_icr_idle,
-   .safe_wait_icr_idle = native_safe_apic_wait_icr_idle,
+   .wait_icr_idle  = numachip_apic_wait_icr_idle,
+   .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle,
 };
 
 apic_driver(apic_numachip1);
@@ -313,8 +339,8 @@ static const struct apic apic_numachip2 __refconst = {
.eoi_write  = native_apic_mem_write,
.icr_read   = native_apic_icr_read,
.icr_write  = native_apic_icr_write,
-   .wait_icr_idle  = native_apic_wait_icr_idle,
-   .safe_wait_icr_idle = native_safe_apic_wait_icr_idle,
+   .wait_icr_idle  = numachip_apic_wait_icr_idle,
+   .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle,
 };
 
 apic_driver(apic_numachip2);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] x86: Introduce Numachip2 timer mechanisms

2015-09-20 Thread Daniel J Blueman
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate
system-wide timekeeping, as core TSCs are unsynchronised.

Additionally, add a per-core clockevent mechanism that interrupts via the
platform IPI vector after a programmed period.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
---
 arch/x86/include/asm/numachip/numachip_csr.h | 9 +
 drivers/clocksource/Makefile | 1 +
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index e09d845..29719ee 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
 #define NUMACHIP2_LCSR_BASE   0xf000UL
 #define NUMACHIP2_LCSR_SIZE   0x100UL
 #define NUMACHIP2_APIC_ICR0x10
+#define NUMACHIP2_TIMER_DEADLINE  0x20
+#define NUMACHIP2_TIMER_INT   0x28
+#define NUMACHIP2_TIMER_NOW   0x200018
+#define NUMACHIP2_TIMER_RESET 0x200020
 
 static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
 {
@@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long 
offset, u64 val)
writeq(val, numachip2_lcsr_address(offset));
 }
 
+static inline unsigned int numachip2_timer(void)
+{
+   return (smp_processor_id() % 48) << 6;
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 5c00863..57dfad3 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -62,3 +62,4 @@ obj-$(CONFIG_H8300)   += h8300_timer8.o
 obj-$(CONFIG_H8300_TMR16)  += h8300_timer16.o
 obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o
 obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o
+obj-$(CONFIG_X86_NUMACHIP) += numachip.o
diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c
new file mode 100644
index 000..5e4f90e
--- /dev/null
+++ b/drivers/clocksource/numachip.c
@@ -0,0 +1,95 @@
+/*
+ *
+ * Copyright (C) 2015 Numascale AS. All rights reserved.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+
+static DEFINE_PER_CPU(struct clock_event_device, cpu_ced);
+
+static cycles_t numachip2_timer_read(struct clocksource *cs)
+{
+   return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW);
+}
+
+static struct clocksource numachip2_clocksource = {
+   .name= "numachip2",
+   .rating  = 295,
+   .read= numachip2_timer_read,
+   .mask= CLOCKSOURCE_MASK(64),
+   .flags   = CLOCK_SOURCE_IS_CONTINUOUS,
+   .mult= 1,
+   .shift   = 0,
+};
+
+static int numachip2_set_next_event(unsigned long delta, struct 
clock_event_device *ced)
+{
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(),
+   delta);
+   return 0;
+}
+
+static struct clock_event_device numachip2_clockevent = {
+   .name= "numachip2",
+   .rating  = 400,
+   .set_next_event  = numachip2_set_next_event,
+   .features= CLOCK_EVT_FEAT_ONESHOT,
+   .mult= 1,
+   .shift   = 0,
+   .min_delta_ns= 1250,
+   .max_delta_ns= LONG_MAX,
+};
+
+static void numachip_timer_interrupt(void)
+{
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   ced->event_handler(ced);
+}
+
+static __init void numachip_timer_each(struct work_struct *work)
+{
+   unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff;
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   /* Setup IPI vector to local core and relative timing mode */
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_timer(),
+   | (X86_PLATFORM_IPI_VECTOR << 14) |
+   (local_apicid << 6));
+
+   *ced = numachip2_clockevent;
+   ced->cpumask = cpumask_of(smp_processor_id());
+   clockevents_register_device(ced);
+}
+
+static int __init numachip_timer_init(void)
+{
+   if (numachip_system != 2)
+   return -ENODEV;
+
+   /* Reset timer */
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_RESET, 0);
+   clocksource_register_hz(_clocksource, NSEC_PER_SEC);
+
+   /* Setup per-cpu clockevents */
+   x86_

[PATCH 2/4] x86: Add Numachip2 APIC support

2015-09-20 Thread Daniel J Blueman
Introduce support for Numachip2 remote interrupts via detecting the right
ACPI SRAT signature.

Access is performed via a fixed mapping in the x86 physical address space.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
---
 arch/x86/include/asm/numachip/numachip.h |  1 +
 arch/x86/include/asm/numachip/numachip_csr.h | 34 ++
 arch/x86/kernel/apic/apic_numachip.c | 93 
 3 files changed, 128 insertions(+)

diff --git a/arch/x86/include/asm/numachip/numachip.h 
b/arch/x86/include/asm/numachip/numachip.h
index 1c6f7f6..c64373a 100644
--- a/arch/x86/include/asm/numachip/numachip.h
+++ b/arch/x86/include/asm/numachip/numachip.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_H
 
+extern u8 numachip_system;
 extern int __init pci_numachip_init(void);
 
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index 7469b13..c7efc25 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 
+#include 
 #include 
 
 #define CSR_NODE_SHIFT 16
@@ -50,4 +51,38 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
writel(swab32(val), lcsr_address(offset));
 }
 
+/*
+ * On NumaChip2, local CSR space is 16MB and starts at fixed offset below 4G
+ */
+
+#define NUMACHIP2_LCSR_BASE   0xf000UL
+#define NUMACHIP2_LCSR_SIZE   0x100UL
+#define NUMACHIP2_APIC_ICR0x10
+
+static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
+{
+   return (void __iomem *)__va(NUMACHIP2_LCSR_BASE |
+   (offset & (NUMACHIP2_LCSR_SIZE - 1)));
+}
+
+static inline u32 numachip2_read32_lcsr(unsigned long offset)
+{
+   return readl(numachip2_lcsr_address(offset));
+}
+
+static inline u64 numachip2_read64_lcsr(unsigned long offset)
+{
+   return readq(numachip2_lcsr_address(offset));
+}
+
+static inline void numachip2_write32_lcsr(unsigned long offset, u32 val)
+{
+   writel(val, numachip2_lcsr_address(offset));
+}
+
+static inline void numachip2_write64_lcsr(unsigned long offset, u64 val)
+{
+   writeq(val, numachip2_lcsr_address(offset));
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index 8729249..dfe2b1c 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -22,6 +22,7 @@
 
 u8 numachip_system __read_mostly;
 static const struct apic apic_numachip1;
+static const struct apic apic_numachip2;
 static void (*numachip_apic_icr_write)(int apicid, unsigned int val) 
__read_mostly;
 
 static unsigned int numachip1_get_apic_id(unsigned long x)
@@ -45,6 +46,19 @@ static unsigned long numachip1_set_apic_id(unsigned int id)
return x;
 }
 
+static unsigned int numachip2_get_apic_id(unsigned long x)
+{
+   u64 mcfg;
+
+   rdmsrl(MSR_FAM10H_MMIO_CONF_BASE, mcfg);
+   return ((mcfg >> (28 - 8)) & 0xfff00) | (x >> 24);
+}
+
+static unsigned long numachip2_set_apic_id(unsigned int id)
+{
+   return id << 24;
+}
+
 static int numachip_apic_id_valid(int apicid)
 {
/* Trust what bootloader passes in MADT */
@@ -66,6 +80,11 @@ static void numachip1_apic_icr_write(int apicid, unsigned 
int val)
write_lcsr(CSR_G3_EXT_IRQ_GEN, (apicid << 16) | val);
 }
 
+static void numachip2_apic_icr_write(int apicid, unsigned int val)
+{
+   numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val);
+}
+
 static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip)
 {
numachip_apic_icr_write(phys_apicid, APIC_DM_INIT);
@@ -129,6 +148,11 @@ static int __init numachip1_probe(void)
return apic == _numachip1;
 }
 
+static int __init numachip2_probe(void)
+{
+   return apic == _numachip2;
+}
+
 static void fixup_cpu_id(struct cpuinfo_x86 *c, int node)
 {
u64 val;
@@ -154,6 +178,13 @@ static int __init numachip_system_init(void)
numachip_apic_icr_write = numachip1_apic_icr_write;
x86_init.pci.arch_init = pci_numachip_init;
break;
+   case 2:
+   init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE);
+   numachip_apic_icr_write = numachip2_apic_icr_write;
+
+   /* Use MCFG config cycles rather than locked CF8 cycles */
+   raw_pci_ops = _mmcfg;
+   break;
default:
return 0;
}
@@ -175,6 +206,17 @@ static int numachip1_acpi_madt_oem_check(char *oem_id, 
char *oem_table_id)
return 1;
 }
 
+static int numachip2_acpi_madt_oem_check(char *oem_id, char *oem_table_id)
+{
+   if ((strncmp(oem_id, "NUMASC"

[PATCH 1/4] x86: Cleanup Numachip support

2015-09-20 Thread Daniel J Blueman
Drop unused code and includes in Numachip header files and APIC driver.

Additionally, use the 'numachip1' prefix on Numachip1-specific functions;
this prepares for adding Numachip2 support in later patches.

Signed-off-by: Daniel J Blueman 
Acked-by: Steffen Persvold 
---
 arch/x86/include/asm/numachip/numachip_csr.h | 118 +--
 arch/x86/kernel/apic/apic_numachip.c | 103 ++-
 2 files changed, 43 insertions(+), 178 deletions(-)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index 660f843..7469b13 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -14,12 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 
-#include 
-#include 
 #include 
-#include 
-#include 
-#include 
 
 #define CSR_NODE_SHIFT 16
 #define CSR_NODE_BITS(p)   (((unsigned long)(p)) << CSR_NODE_SHIFT)
@@ -27,11 +22,8 @@
 
 /* 32K CSR space, b15 indicates geo/non-geo */
 #define CSR_OFFSET_MASK0x7fffUL
-
-/* Global CSR space covers all 4K possible nodes with 64K CSR space per node */
-#define NUMACHIP_GCSR_BASE 0x3fffULL
-#define NUMACHIP_GCSR_LIM  0x3fff0fffULL
-#define NUMACHIP_GCSR_SIZE (NUMACHIP_GCSR_LIM - NUMACHIP_GCSR_BASE + 1)
+#define CSR_G0_NODE_IDS (0x008 + (0 << 12))
+#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12))
 
 /*
  * Local CSR space starts in global CSR space with "nodeid" = 0xfff0, however
@@ -42,28 +34,12 @@
 #define NUMACHIP_LCSR_LIM  0x3fffULL
 #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1)
 
-static inline void *gcsr_address(int node, unsigned long offset)
-{
-   return __va(NUMACHIP_GCSR_BASE | (1UL << 15) |
-   CSR_NODE_BITS(node & CSR_NODE_MASK) | (offset & 
CSR_OFFSET_MASK));
-}
-
 static inline void *lcsr_address(unsigned long offset)
 {
return __va(NUMACHIP_LCSR_BASE | (1UL << 15) |
CSR_NODE_BITS(0xfff0) | (offset & CSR_OFFSET_MASK));
 }
 
-static inline unsigned int read_gcsr(int node, unsigned long offset)
-{
-   return swab32(readl(gcsr_address(node, offset)));
-}
-
-static inline void write_gcsr(int node, unsigned long offset, unsigned int val)
-{
-   writel(swab32(val), gcsr_address(node, offset));
-}
-
 static inline unsigned int read_lcsr(unsigned long offset)
 {
return swab32(readl(lcsr_address(offset)));
@@ -74,94 +50,4 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
writel(swab32(val), lcsr_address(offset));
 }
 
-/* = */
-/*   CSR_G0_STATE_CLEAR  */
-/* = */
-
-#define CSR_G0_STATE_CLEAR (0x000 + (0 << 12))
-union numachip_csr_g0_state_clear {
-   unsigned int v;
-   struct numachip_csr_g0_state_clear_s {
-   unsigned int _state:2;
-   unsigned int _rsvd_2_6:5;
-   unsigned int _lost:1;
-   unsigned int _rsvd_8_31:24;
-   } s;
-};
-
-/* = */
-/*   CSR_G0_NODE_IDS */
-/* = */
-
-#define CSR_G0_NODE_IDS (0x008 + (0 << 12))
-union numachip_csr_g0_node_ids {
-   unsigned int v;
-   struct numachip_csr_g0_node_ids_s {
-   unsigned int _initialid:16;
-   unsigned int _nodeid:12;
-   unsigned int _rsvd_28_31:4;
-   } s;
-};
-
-/* = */
-/*   CSR_G3_EXT_IRQ_GEN  */
-/* = */
-
-#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12))
-union numachip_csr_g3_ext_irq_gen {
-   unsigned int v;
-   struct numachip_csr_g3_ext_irq_gen_s {
-   unsigned int _vector:8;
-   unsigned int _msgtype:3;
-   unsigned int _index:5;
-   unsigned int _destination_apic_id:16;
-   } s;
-};
-
-/* = */
-/*   CSR_G3_EXT_IRQ_STATUS   */
-/* = */
-
-#define CSR_G3_EXT_IRQ_STATUS (0x034 + (3 << 12))
-union numachip_csr_g3_ext_irq_status {
-   unsigned int v;
-   struct numachip_csr_g3_ext_irq_status_s {
-   unsigned int _result:32;
-   } s;
-};
-
-/* ==

[PATCH 3/4] x86: Add Numachip IPI optimisations

2015-09-20 Thread Daniel J Blueman
When sending IPIs, first check if the non-local part of the source and
destination APIC IDs match; if so, send via the local APIC for efficiency.

Secondly, since the AMD BIOS-kernel developer guide states IPI delivery
will occur invarient of prior deliver status, avoid polling the delivery
status bit for efficiency.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
---
 arch/x86/include/asm/numachip/numachip_csr.h |  1 +
 arch/x86/kernel/apic/apic_numachip.c | 36 
 2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index c7efc25..75379f6 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -34,6 +34,7 @@
 #define NUMACHIP_LCSR_BASE 0x3e00ULL
 #define NUMACHIP_LCSR_LIM  0x3fffULL
 #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1)
+#define NUMACHIP_LAPIC_BITS8
 
 static inline void *lcsr_address(unsigned long offset)
 {
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index dfe2b1c..81bc216 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -95,9 +95,25 @@ static int numachip_wakeup_secondary(int phys_apicid, 
unsigned long start_rip)
 
 static void numachip_send_IPI_one(int cpu, int vector)
 {
-   int apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   int local_apicid, apicid = per_cpu(x86_cpu_to_apicid, cpu);
unsigned int dmode;
 
+   preempt_disable();
+   local_apicid = __this_cpu_read(x86_cpu_to_apicid);
+
+   /* Send via local APIC where non-local part matches */
+   if (!((apicid ^ local_apicid) >> NUMACHIP_LAPIC_BITS)) {
+   unsigned long flags;
+
+   local_irq_save(flags);
+   __default_send_IPI_dest_field(apicid, vector,
+   APIC_DEST_PHYSICAL);
+   local_irq_restore(flags);
+   preempt_enable();
+   return;
+   }
+   preempt_enable();
+
dmode = (vector == NMI_VECTOR) ? APIC_DM_NMI : APIC_DM_FIXED;
numachip_apic_icr_write(apicid, dmode | vector);
 }
@@ -217,6 +232,17 @@ static int numachip2_acpi_madt_oem_check(char *oem_id, 
char *oem_table_id)
return 1;
 }
 
+/* APIC IPIs are queued */
+static void numachip_apic_wait_icr_idle(void)
+{
+}
+
+/* APIC NMI IPIs are queued */
+static u32 numachip_safe_apic_wait_icr_idle(void)
+{
+   return 0;
+}
+
 static const struct apic apic_numachip1 __refconst = {
.name   = "NumaConnect system",
.probe  = numachip1_probe,
@@ -262,8 +288,8 @@ static const struct apic apic_numachip1 __refconst = {
.eoi_write  = native_apic_mem_write,
.icr_read   = native_apic_icr_read,
.icr_write  = native_apic_icr_write,
-   .wait_icr_idle  = native_apic_wait_icr_idle,
-   .safe_wait_icr_idle = native_safe_apic_wait_icr_idle,
+   .wait_icr_idle  = numachip_apic_wait_icr_idle,
+   .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle,
 };
 
 apic_driver(apic_numachip1);
@@ -313,8 +339,8 @@ static const struct apic apic_numachip2 __refconst = {
.eoi_write  = native_apic_mem_write,
.icr_read   = native_apic_icr_read,
.icr_write  = native_apic_icr_write,
-   .wait_icr_idle  = native_apic_wait_icr_idle,
-   .safe_wait_icr_idle = native_safe_apic_wait_icr_idle,
+   .wait_icr_idle  = numachip_apic_wait_icr_idle,
+   .safe_wait_icr_idle = numachip_safe_apic_wait_icr_idle,
 };
 
 apic_driver(apic_numachip2);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] x86: Introduce Numachip2 timer mechanisms

2015-09-20 Thread Daniel J Blueman
Add 1GHz 64-bit Numachip2 clocksource timer support for accurate
system-wide timekeeping, as core TSCs are unsynchronised.

Additionally, add a per-core clockevent mechanism that interrupts via the
platform IPI vector after a programmed period.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
---
 arch/x86/include/asm/numachip/numachip_csr.h | 9 +
 drivers/clocksource/Makefile | 1 +
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index e09d845..29719ee 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -59,6 +59,10 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
 #define NUMACHIP2_LCSR_BASE   0xf000UL
 #define NUMACHIP2_LCSR_SIZE   0x100UL
 #define NUMACHIP2_APIC_ICR0x10
+#define NUMACHIP2_TIMER_DEADLINE  0x20
+#define NUMACHIP2_TIMER_INT   0x28
+#define NUMACHIP2_TIMER_NOW   0x200018
+#define NUMACHIP2_TIMER_RESET 0x200020
 
 static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
 {
@@ -86,4 +90,9 @@ static inline void numachip2_write64_lcsr(unsigned long 
offset, u64 val)
writeq(val, numachip2_lcsr_address(offset));
 }
 
+static inline unsigned int numachip2_timer(void)
+{
+   return (smp_processor_id() % 48) << 6;
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 5c00863..57dfad3 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -62,3 +62,4 @@ obj-$(CONFIG_H8300)   += h8300_timer8.o
 obj-$(CONFIG_H8300_TMR16)  += h8300_timer16.o
 obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o
 obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o
+obj-$(CONFIG_X86_NUMACHIP) += numachip.o
diff --git a/drivers/clocksource/numachip.c b/drivers/clocksource/numachip.c
new file mode 100644
index 000..5e4f90e
--- /dev/null
+++ b/drivers/clocksource/numachip.c
@@ -0,0 +1,95 @@
+/*
+ *
+ * Copyright (C) 2015 Numascale AS. All rights reserved.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include 
+
+#include 
+#include 
+#include 
+
+static DEFINE_PER_CPU(struct clock_event_device, cpu_ced);
+
+static cycles_t numachip2_timer_read(struct clocksource *cs)
+{
+   return numachip2_read64_lcsr(NUMACHIP2_TIMER_NOW);
+}
+
+static struct clocksource numachip2_clocksource = {
+   .name= "numachip2",
+   .rating  = 295,
+   .read= numachip2_timer_read,
+   .mask= CLOCKSOURCE_MASK(64),
+   .flags   = CLOCK_SOURCE_IS_CONTINUOUS,
+   .mult= 1,
+   .shift   = 0,
+};
+
+static int numachip2_set_next_event(unsigned long delta, struct 
clock_event_device *ced)
+{
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_DEADLINE + numachip2_timer(),
+   delta);
+   return 0;
+}
+
+static struct clock_event_device numachip2_clockevent = {
+   .name= "numachip2",
+   .rating  = 400,
+   .set_next_event  = numachip2_set_next_event,
+   .features= CLOCK_EVT_FEAT_ONESHOT,
+   .mult= 1,
+   .shift   = 0,
+   .min_delta_ns= 1250,
+   .max_delta_ns= LONG_MAX,
+};
+
+static void numachip_timer_interrupt(void)
+{
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   ced->event_handler(ced);
+}
+
+static __init void numachip_timer_each(struct work_struct *work)
+{
+   unsigned local_apicid = __this_cpu_read(x86_cpu_to_apicid) & 0xff;
+   struct clock_event_device *ced = this_cpu_ptr(_ced);
+
+   /* Setup IPI vector to local core and relative timing mode */
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_INT + numachip2_timer(),
+   | (X86_PLATFORM_IPI_VECTOR << 14) |
+   (local_apicid << 6));
+
+   *ced = numachip2_clockevent;
+   ced->cpumask = cpumask_of(smp_processor_id());
+   clockevents_register_device(ced);
+}
+
+static int __init numachip_timer_init(void)
+{
+   if (numachip_system != 2)
+   return -ENODEV;
+
+   /* Reset timer */
+   numachip2_write64_lcsr(NUMACHIP2_TIMER_RESET, 0);
+   clocksource_register_hz(_clocksource, NSEC_PER_SEC);
+
+   /* Se

[PATCH 2/4] x86: Add Numachip2 APIC support

2015-09-20 Thread Daniel J Blueman
Introduce support for Numachip2 remote interrupts via detecting the right
ACPI SRAT signature.

Access is performed via a fixed mapping in the x86 physical address space.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
---
 arch/x86/include/asm/numachip/numachip.h |  1 +
 arch/x86/include/asm/numachip/numachip_csr.h | 34 ++
 arch/x86/kernel/apic/apic_numachip.c | 93 
 3 files changed, 128 insertions(+)

diff --git a/arch/x86/include/asm/numachip/numachip.h 
b/arch/x86/include/asm/numachip/numachip.h
index 1c6f7f6..c64373a 100644
--- a/arch/x86/include/asm/numachip/numachip.h
+++ b/arch/x86/include/asm/numachip/numachip.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_H
 
+extern u8 numachip_system;
 extern int __init pci_numachip_init(void);
 
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_H */
diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index 7469b13..c7efc25 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -14,6 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 
+#include 
 #include 
 
 #define CSR_NODE_SHIFT 16
@@ -50,4 +51,38 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
writel(swab32(val), lcsr_address(offset));
 }
 
+/*
+ * On NumaChip2, local CSR space is 16MB and starts at fixed offset below 4G
+ */
+
+#define NUMACHIP2_LCSR_BASE   0xf000UL
+#define NUMACHIP2_LCSR_SIZE   0x100UL
+#define NUMACHIP2_APIC_ICR0x10
+
+static inline void __iomem *numachip2_lcsr_address(unsigned long offset)
+{
+   return (void __iomem *)__va(NUMACHIP2_LCSR_BASE |
+   (offset & (NUMACHIP2_LCSR_SIZE - 1)));
+}
+
+static inline u32 numachip2_read32_lcsr(unsigned long offset)
+{
+   return readl(numachip2_lcsr_address(offset));
+}
+
+static inline u64 numachip2_read64_lcsr(unsigned long offset)
+{
+   return readq(numachip2_lcsr_address(offset));
+}
+
+static inline void numachip2_write32_lcsr(unsigned long offset, u32 val)
+{
+   writel(val, numachip2_lcsr_address(offset));
+}
+
+static inline void numachip2_write64_lcsr(unsigned long offset, u64 val)
+{
+   writeq(val, numachip2_lcsr_address(offset));
+}
+
 #endif /* _ASM_X86_NUMACHIP_NUMACHIP_CSR_H */
diff --git a/arch/x86/kernel/apic/apic_numachip.c 
b/arch/x86/kernel/apic/apic_numachip.c
index 8729249..dfe2b1c 100644
--- a/arch/x86/kernel/apic/apic_numachip.c
+++ b/arch/x86/kernel/apic/apic_numachip.c
@@ -22,6 +22,7 @@
 
 u8 numachip_system __read_mostly;
 static const struct apic apic_numachip1;
+static const struct apic apic_numachip2;
 static void (*numachip_apic_icr_write)(int apicid, unsigned int val) 
__read_mostly;
 
 static unsigned int numachip1_get_apic_id(unsigned long x)
@@ -45,6 +46,19 @@ static unsigned long numachip1_set_apic_id(unsigned int id)
return x;
 }
 
+static unsigned int numachip2_get_apic_id(unsigned long x)
+{
+   u64 mcfg;
+
+   rdmsrl(MSR_FAM10H_MMIO_CONF_BASE, mcfg);
+   return ((mcfg >> (28 - 8)) & 0xfff00) | (x >> 24);
+}
+
+static unsigned long numachip2_set_apic_id(unsigned int id)
+{
+   return id << 24;
+}
+
 static int numachip_apic_id_valid(int apicid)
 {
/* Trust what bootloader passes in MADT */
@@ -66,6 +80,11 @@ static void numachip1_apic_icr_write(int apicid, unsigned 
int val)
write_lcsr(CSR_G3_EXT_IRQ_GEN, (apicid << 16) | val);
 }
 
+static void numachip2_apic_icr_write(int apicid, unsigned int val)
+{
+   numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val);
+}
+
 static int numachip_wakeup_secondary(int phys_apicid, unsigned long start_rip)
 {
numachip_apic_icr_write(phys_apicid, APIC_DM_INIT);
@@ -129,6 +148,11 @@ static int __init numachip1_probe(void)
return apic == _numachip1;
 }
 
+static int __init numachip2_probe(void)
+{
+   return apic == _numachip2;
+}
+
 static void fixup_cpu_id(struct cpuinfo_x86 *c, int node)
 {
u64 val;
@@ -154,6 +178,13 @@ static int __init numachip_system_init(void)
numachip_apic_icr_write = numachip1_apic_icr_write;
x86_init.pci.arch_init = pci_numachip_init;
break;
+   case 2:
+   init_extra_mapping_uc(NUMACHIP2_LCSR_BASE, NUMACHIP2_LCSR_SIZE);
+   numachip_apic_icr_write = numachip2_apic_icr_write;
+
+   /* Use MCFG config cycles rather than locked CF8 cycles */
+   raw_pci_ops = _mmcfg;
+   break;
default:
return 0;
}
@@ -175,6 +206,17 @@ static int numachip1_acpi_madt_oem_check(char *oem_id, 
char *oem_table_id)
return 1;
 }
 
+static int numachip2_acpi_madt_oem_check(char *oem_id, char *oem_tab

[PATCH 1/4] x86: Cleanup Numachip support

2015-09-20 Thread Daniel J Blueman
Drop unused code and includes in Numachip header files and APIC driver.

Additionally, use the 'numachip1' prefix on Numachip1-specific functions;
this prepares for adding Numachip2 support in later patches.

Signed-off-by: Daniel J Blueman <dan...@numascale.com>
Acked-by: Steffen Persvold <s...@numascale.com>
---
 arch/x86/include/asm/numachip/numachip_csr.h | 118 +--
 arch/x86/kernel/apic/apic_numachip.c | 103 ++-
 2 files changed, 43 insertions(+), 178 deletions(-)

diff --git a/arch/x86/include/asm/numachip/numachip_csr.h 
b/arch/x86/include/asm/numachip/numachip_csr.h
index 660f843..7469b13 100644
--- a/arch/x86/include/asm/numachip/numachip_csr.h
+++ b/arch/x86/include/asm/numachip/numachip_csr.h
@@ -14,12 +14,7 @@
 #ifndef _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 #define _ASM_X86_NUMACHIP_NUMACHIP_CSR_H
 
-#include 
-#include 
 #include 
-#include 
-#include 
-#include 
 
 #define CSR_NODE_SHIFT 16
 #define CSR_NODE_BITS(p)   (((unsigned long)(p)) << CSR_NODE_SHIFT)
@@ -27,11 +22,8 @@
 
 /* 32K CSR space, b15 indicates geo/non-geo */
 #define CSR_OFFSET_MASK0x7fffUL
-
-/* Global CSR space covers all 4K possible nodes with 64K CSR space per node */
-#define NUMACHIP_GCSR_BASE 0x3fffULL
-#define NUMACHIP_GCSR_LIM  0x3fff0fffULL
-#define NUMACHIP_GCSR_SIZE (NUMACHIP_GCSR_LIM - NUMACHIP_GCSR_BASE + 1)
+#define CSR_G0_NODE_IDS (0x008 + (0 << 12))
+#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12))
 
 /*
  * Local CSR space starts in global CSR space with "nodeid" = 0xfff0, however
@@ -42,28 +34,12 @@
 #define NUMACHIP_LCSR_LIM  0x3fffULL
 #define NUMACHIP_LCSR_SIZE (NUMACHIP_LCSR_LIM - NUMACHIP_LCSR_BASE + 1)
 
-static inline void *gcsr_address(int node, unsigned long offset)
-{
-   return __va(NUMACHIP_GCSR_BASE | (1UL << 15) |
-   CSR_NODE_BITS(node & CSR_NODE_MASK) | (offset & 
CSR_OFFSET_MASK));
-}
-
 static inline void *lcsr_address(unsigned long offset)
 {
return __va(NUMACHIP_LCSR_BASE | (1UL << 15) |
CSR_NODE_BITS(0xfff0) | (offset & CSR_OFFSET_MASK));
 }
 
-static inline unsigned int read_gcsr(int node, unsigned long offset)
-{
-   return swab32(readl(gcsr_address(node, offset)));
-}
-
-static inline void write_gcsr(int node, unsigned long offset, unsigned int val)
-{
-   writel(swab32(val), gcsr_address(node, offset));
-}
-
 static inline unsigned int read_lcsr(unsigned long offset)
 {
return swab32(readl(lcsr_address(offset)));
@@ -74,94 +50,4 @@ static inline void write_lcsr(unsigned long offset, unsigned 
int val)
writel(swab32(val), lcsr_address(offset));
 }
 
-/* = */
-/*   CSR_G0_STATE_CLEAR  */
-/* = */
-
-#define CSR_G0_STATE_CLEAR (0x000 + (0 << 12))
-union numachip_csr_g0_state_clear {
-   unsigned int v;
-   struct numachip_csr_g0_state_clear_s {
-   unsigned int _state:2;
-   unsigned int _rsvd_2_6:5;
-   unsigned int _lost:1;
-   unsigned int _rsvd_8_31:24;
-   } s;
-};
-
-/* = */
-/*   CSR_G0_NODE_IDS */
-/* = */
-
-#define CSR_G0_NODE_IDS (0x008 + (0 << 12))
-union numachip_csr_g0_node_ids {
-   unsigned int v;
-   struct numachip_csr_g0_node_ids_s {
-   unsigned int _initialid:16;
-   unsigned int _nodeid:12;
-   unsigned int _rsvd_28_31:4;
-   } s;
-};
-
-/* = */
-/*   CSR_G3_EXT_IRQ_GEN  */
-/* = */
-
-#define CSR_G3_EXT_IRQ_GEN (0x030 + (3 << 12))
-union numachip_csr_g3_ext_irq_gen {
-   unsigned int v;
-   struct numachip_csr_g3_ext_irq_gen_s {
-   unsigned int _vector:8;
-   unsigned int _msgtype:3;
-   unsigned int _index:5;
-   unsigned int _destination_apic_id:16;
-   } s;
-};
-
-/* = */
-/*   CSR_G3_EXT_IRQ_STATUS   */
-/* = */
-
-#define CSR_G3_EXT_IRQ_STATUS (0x034 + (3 << 12))
-union numachip_csr_g3_ext_irq_status {
-   unsigned int v;
-   struct numachip_csr_g3_ext_irq_status_s {
-   unsigned int _result:32;
-   } s;
-};
-
-/* ==

Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

2015-07-06 Thread Daniel J Blueman

Hi Nate,

On Wed, Jun 24, 2015 at 11:50 PM, Nathan Zimmer  wrote:

My apologies for taking so long to get back to this.

I think I did locate two potential sources of slowdown.
One is the set_cpus_allowed_ptr as I have noted previously.
However I only notice that on the very largest boxes.
I did cobble together a patch that seems to help.

The other spot I suspect is the zone lock in free_one_page.
I haven't been able to give that much thought as of yet though.

Daniel do you mind seeing if the attached patch helps out?


Just got back from travel, so apologies for the delays.

The patch doesn't mitigate the increasing initialisation time; summing 
the per-node times for an accurate measure, there was a total of 
171.48s before the patch and 175.23s after. I double-checked and got 
similar data.


Thanks,
 Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

2015-07-06 Thread Daniel J Blueman

Hi Nate,

On Wed, Jun 24, 2015 at 11:50 PM, Nathan Zimmer nzim...@sgi.com wrote:

My apologies for taking so long to get back to this.

I think I did locate two potential sources of slowdown.
One is the set_cpus_allowed_ptr as I have noted previously.
However I only notice that on the very largest boxes.
I did cobble together a patch that seems to help.

The other spot I suspect is the zone lock in free_one_page.
I haven't been able to give that much thought as of yet though.

Daniel do you mind seeing if the attached patch helps out?


Just got back from travel, so apologies for the delays.

The patch doesn't mitigate the increasing initialisation time; summing 
the per-node times for an accurate measure, there was a total of 
171.48s before the patch and 175.23s after. I double-checked and got 
similar data.


Thanks,
 Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockup when C1E and high-resolution timers enabled

2015-06-14 Thread Daniel J Blueman
On 14 June 2015 at 22:49, Christoph Fritz  wrote:
> On Sun, 2015-06-14 at 15:54 +0800, Daniel J Blueman wrote:
>> As a workaround, you can probably just disable message triggered C1E
>> (see the BKDG p399 [1]):
>>
>> val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4
>
> mhm... $(setpci -s 00:18.4 0xd4.l) returns zero, this can't be right.

Ahh, try:

val=0x$(setpci -s 00:18.3 0xd4.l) # read D18F3xD4
val=$((val &~(1 << 13))) # clear bit13 (MTC1eEn)
setpci -d 1022:1603 0xd4.l=$(printf %x $val) # write back
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockup when C1E and high-resolution timers enabled

2015-06-14 Thread Daniel J Blueman
On 14 June 2015 at 12:39, Christoph Fritz  wrote:
> On Sun, 2015-06-14 at 11:13 +0800, Daniel J Blueman wrote:
>> On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote:
>> > Hi,
>> >
>> >  on following computer configuration, I do get hard lockup under heavy
>> > IO-Load (using rsync):
>> >
>> >  - CONFIG_HIGH_RES_TIMERS=y
>> >  - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2)
>> >  - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950
>> >  - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option)
>> >  - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x
>> >
>> > Tests:
>> >  - add kernel parameter "idle=halt" -> system runs fine
>> >  - disable CONFIG_HIGH_RES_TIMERS -> system runs fine
>> >  - change motherboard and disable C1E -> system runs fine
>> >  - change CPU to AMD Phenom II X6 Processor -> system runs fine
>> [..]
>>
>> C1E disconnects HyperTransport links when all cores enter C1 (halt)
>> for a period of time; this is all at the platform level, so isn't due
>> to the kernel. The AMD AGESA code which controls the setup of this
>> mechanism is updated in the F2g BIOS:
>> http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios
>>
>> Did you try both BIOS releases with defaults?
>
> Yes, rechecked both versions: Same bad behaviour.
>
>> If still issues, also try with the current family 10h microcode from
>> http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2
>
> Don't you mean family 15h for 'AMD FX(tm)-8350' ?
>
> already using latest microcode:

As a workaround, you can probably just disable message triggered C1E
(see the BKDG p399 [1]):

val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4
val=$((val &~(1 << 13))) # clear bit13 (MTC1eEn)
setpci -d 1022:1604 0xd4.l=$(printf %x $val) # write back

The chipset setup and behaviour is quite complex, so it's likely
Gigabyte haven't done their homework. The alternative is coreboot of
course.

Thanks,
  Daniel

[1] http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockup when C1E and high-resolution timers enabled

2015-06-14 Thread Daniel J Blueman
On 14 June 2015 at 12:39, Christoph Fritz chf.fr...@googlemail.com wrote:
 On Sun, 2015-06-14 at 11:13 +0800, Daniel J Blueman wrote:
 On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote:
  Hi,
 
   on following computer configuration, I do get hard lockup under heavy
  IO-Load (using rsync):
 
   - CONFIG_HIGH_RES_TIMERS=y
   - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2)
   - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950
   - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option)
   - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x
 
  Tests:
   - add kernel parameter idle=halt - system runs fine
   - disable CONFIG_HIGH_RES_TIMERS - system runs fine
   - change motherboard and disable C1E - system runs fine
   - change CPU to AMD Phenom II X6 Processor - system runs fine
 [..]

 C1E disconnects HyperTransport links when all cores enter C1 (halt)
 for a period of time; this is all at the platform level, so isn't due
 to the kernel. The AMD AGESA code which controls the setup of this
 mechanism is updated in the F2g BIOS:
 http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios

 Did you try both BIOS releases with defaults?

 Yes, rechecked both versions: Same bad behaviour.

 If still issues, also try with the current family 10h microcode from
 http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2

 Don't you mean family 15h for 'AMD FX(tm)-8350' ?

 already using latest microcode:

As a workaround, you can probably just disable message triggered C1E
(see the BKDG p399 [1]):

val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4
val=$((val ~(1  13))) # clear bit13 (MTC1eEn)
setpci -d 1022:1604 0xd4.l=$(printf %x $val) # write back

The chipset setup and behaviour is quite complex, so it's likely
Gigabyte haven't done their homework. The alternative is coreboot of
course.

Thanks,
  Daniel

[1] http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockup when C1E and high-resolution timers enabled

2015-06-14 Thread Daniel J Blueman
On 14 June 2015 at 22:49, Christoph Fritz chf.fr...@googlemail.com wrote:
 On Sun, 2015-06-14 at 15:54 +0800, Daniel J Blueman wrote:
 As a workaround, you can probably just disable message triggered C1E
 (see the BKDG p399 [1]):

 val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4

 mhm... $(setpci -s 00:18.4 0xd4.l) returns zero, this can't be right.

Ahh, try:

val=0x$(setpci -s 00:18.3 0xd4.l) # read D18F3xD4
val=$((val ~(1  13))) # clear bit13 (MTC1eEn)
setpci -d 1022:1603 0xd4.l=$(printf %x $val) # write back
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockup when C1E and high-resolution timers enabled

2015-06-13 Thread Daniel J Blueman
On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote:
> Hi,
>
>  on following computer configuration, I do get hard lockup under heavy
> IO-Load (using rsync):
>
>  - CONFIG_HIGH_RES_TIMERS=y
>  - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2)
>  - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950
>  - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option)
>  - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x
>
> Tests:
>  - add kernel parameter "idle=halt" -> system runs fine
>  - disable CONFIG_HIGH_RES_TIMERS -> system runs fine
>  - change motherboard and disable C1E -> system runs fine
>  - change CPU to AMD Phenom II X6 Processor -> system runs fine
[..]

C1E disconnects HyperTransport links when all cores enter C1 (halt)
for a period of time; this is all at the platform level, so isn't due
to the kernel. The AMD AGESA code which controls the setup of this
mechanism is updated in the F2g BIOS:
http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios

Did you try both BIOS releases with defaults?

If still issues, also try with the current family 10h microcode from
http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2

Thanks,
  Daniel
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockup when C1E and high-resolution timers enabled

2015-06-13 Thread Daniel J Blueman
On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote:
 Hi,

  on following computer configuration, I do get hard lockup under heavy
 IO-Load (using rsync):

  - CONFIG_HIGH_RES_TIMERS=y
  - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2)
  - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950
  - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option)
  - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x

 Tests:
  - add kernel parameter idle=halt - system runs fine
  - disable CONFIG_HIGH_RES_TIMERS - system runs fine
  - change motherboard and disable C1E - system runs fine
  - change CPU to AMD Phenom II X6 Processor - system runs fine
[..]

C1E disconnects HyperTransport links when all cores enter C1 (halt)
for a period of time; this is all at the platform level, so isn't due
to the kernel. The AMD AGESA code which controls the setup of this
mechanism is updated in the F2g BIOS:
http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios

Did you try both BIOS releases with defaults?

If still issues, also try with the current family 10h microcode from
http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2

Thanks,
  Daniel
-- 
Daniel J Blueman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -next] iommu: Fix build failure without INTEL_IOMMU

2015-06-04 Thread Daniel J Blueman
Fix Intel IOMMU build failure in linux-next when CONFIG_INTEL_IOMMU is not 
enabled.

Signed-off-by: Daniel J Blueman 
---
 drivers/iommu/intel_irq_remapping.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 24f7a35..ec337e7 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -146,8 +146,10 @@ static int modify_irte(struct irq_2_iommu *irq_iommu,
set_64bit(>low, irte_modified->low);
set_64bit(>high, irte_modified->high);
 
+#ifdef CONFIG_INTEL_IOMMU
if (iommu->pre_enabled_ir)
__iommu_update_old_irte(iommu, index);
+#endif
 
__iommu_flush_cache(iommu, irte, sizeof(*irte));
 
@@ -210,8 +212,10 @@ static int clear_entries(struct irq_2_iommu *irq_iommu)
bitmap_release_region(iommu->ir_table->bitmap, index,
  irq_iommu->irte_mask);
 
+#ifdef CONFIG_INTEL_IOMMU
if (iommu->pre_enabled_ir)
__iommu_update_old_irte(iommu, -1);
+#endif
 
return qi_flush_iec(iommu, index, irq_iommu->irte_mask);
 }
@@ -650,6 +654,7 @@ static int __init intel_enable_irq_remapping(void)
 * Setup Interrupt-remapping for all the DRHD's now.
 */
for_each_iommu(iommu, drhd) {
+#ifdef CONFIG_INTEL_IOMMU
if (iommu->pre_enabled_ir) {
unsigned long long q;
 
@@ -660,6 +665,7 @@ static int __init intel_enable_irq_remapping(void)
INTR_REMAP_TABLE_ENTRIES*sizeof(struct irte));
__iommu_load_old_irte(iommu);
} else
+#endif
iommu_set_irq_remapping(iommu, eim);
 
setup = true;
@@ -1374,6 +1380,7 @@ static int __iommu_update_old_irte(struct intel_iommu 
*iommu, int index)
 
 static void iommu_check_pre_ir_status(struct intel_iommu *iommu)
 {
+#ifdef CONFIG_INTEL_IOMMU
u32 sts;
 
sts = readl(iommu->reg + DMAR_GSTS_REG);
@@ -1381,4 +1388,5 @@ static void iommu_check_pre_ir_status(struct intel_iommu 
*iommu)
pr_info("IR is enabled prior to OS.\n");
iommu->pre_enabled_ir = 1;
}
+#endif
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -next] iommu: Fix build failure without INTEL_IOMMU

2015-06-04 Thread Daniel J Blueman
Fix Intel IOMMU build failure in linux-next when CONFIG_INTEL_IOMMU is not 
enabled.

Signed-off-by: Daniel J Blueman dan...@numascale.com
---
 drivers/iommu/intel_irq_remapping.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 24f7a35..ec337e7 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -146,8 +146,10 @@ static int modify_irte(struct irq_2_iommu *irq_iommu,
set_64bit(irte-low, irte_modified-low);
set_64bit(irte-high, irte_modified-high);
 
+#ifdef CONFIG_INTEL_IOMMU
if (iommu-pre_enabled_ir)
__iommu_update_old_irte(iommu, index);
+#endif
 
__iommu_flush_cache(iommu, irte, sizeof(*irte));
 
@@ -210,8 +212,10 @@ static int clear_entries(struct irq_2_iommu *irq_iommu)
bitmap_release_region(iommu-ir_table-bitmap, index,
  irq_iommu-irte_mask);
 
+#ifdef CONFIG_INTEL_IOMMU
if (iommu-pre_enabled_ir)
__iommu_update_old_irte(iommu, -1);
+#endif
 
return qi_flush_iec(iommu, index, irq_iommu-irte_mask);
 }
@@ -650,6 +654,7 @@ static int __init intel_enable_irq_remapping(void)
 * Setup Interrupt-remapping for all the DRHD's now.
 */
for_each_iommu(iommu, drhd) {
+#ifdef CONFIG_INTEL_IOMMU
if (iommu-pre_enabled_ir) {
unsigned long long q;
 
@@ -660,6 +665,7 @@ static int __init intel_enable_irq_remapping(void)
INTR_REMAP_TABLE_ENTRIES*sizeof(struct irte));
__iommu_load_old_irte(iommu);
} else
+#endif
iommu_set_irq_remapping(iommu, eim);
 
setup = true;
@@ -1374,6 +1380,7 @@ static int __iommu_update_old_irte(struct intel_iommu 
*iommu, int index)
 
 static void iommu_check_pre_ir_status(struct intel_iommu *iommu)
 {
+#ifdef CONFIG_INTEL_IOMMU
u32 sts;
 
sts = readl(iommu-reg + DMAR_GSTS_REG);
@@ -1381,4 +1388,5 @@ static void iommu_check_pre_ir_status(struct intel_iommu 
*iommu)
pr_info(IR is enabled prior to OS.\n);
iommu-pre_enabled_ir = 1;
}
+#endif
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

2015-05-22 Thread Daniel J Blueman



--
Daniel J Blueman
Principal Software Engineer, Numascale

On Sat, May 23, 2015 at 1:14 AM, Waiman Long  wrote:

On 05/22/2015 05:33 AM, Mel Gorman wrote:

On Fri, May 22, 2015 at 02:30:01PM +0800, Daniel J Blueman wrote:

On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman
  wrote:
On Thu, May 14, 2015 at 12:31 AM, Mel Gorman  
wrote:

On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:

I am just noticed a hang on my largest box.
I can only reproduce with large core counts, if I turn down the
number of cpus it doesn't have an issue.


Odd. The number of core counts should make little a difference
as only
one CPU per node should be in use. Does sysrq+t give any
indication how
or where it is hanging?

I was seeing the same behaviour of 1000ms increasing to 5500ms
[1]; this suggests either lock contention or O(n) behaviour.

Nathan, can you check with this ordering of patches from Andrew's
cache [2]? I was getting hanging until I a found them all.

I'll follow up with timing data.
7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to 
login:


1. 2086s with patches 01-19 [1]

2. 2026s adding "Take into account that large system caches scale
linearly with memory", which has:
min(2UL<<  (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>>  3));

3. 2442s fixing to:
max(2UL<<  (30 - PAGE_SHIFT), (pgdat->node_spanned_pages>>  3));

4. 2064s adjusting minimum and shift to:
max(512UL<<  (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>>  8));

5. 1934s adjusting minimum and shift to:
max(128UL<<  (20 - PAGE_SHIFT), (pgdat->node_spanned_pages>>  8));

6. 930s #5 with the non-temporal PMD init patch I had earlier
proposed (I'll pursue separately)

The scaling patch isn't in -mm.

That patch was superceded by "mm: meminit: finish
initialisation of struct pages before basic setup" and
"mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix"
so that's ok.

FWIW, I think you should still go ahead with the non-temporal 
patches because
there is potential benefit there other than the initialisation.  If 
there
was an arch-optional implementation of a non-termporal clear then it 
would
also be worth considering if __GFP_ZERO should use non-temporal 
stores.
At a greater stretch it would be worth considering if kswapd freeing 
should
zero pages to avoid a zero on the allocation side in the general 
case as
it would be more generally useful and a stepping stone towards what 
the

series "Sanitizing freed pages" attempts.


Good tip Mel; I'll take a look when time allows and get some data, 
though I guess it'll only be a win where the clearing is on a different 
node than the allocation.


I think the non-temporal patch benefits mainly AMD systems. I have 
tried the patch on both DragonHawk and it actually made it boot up a 
little bit slower. I think the Intel optimized "rep stosb" 
instruction (used in memset) is performing well. I had done similar 
test on zero page code and the performance gain was non-conclusive.


I suspect 'rep stosb' on modern Intel hardware can write whole 
cachelines atomically, avoiding the RMW, or that the read part of the 
RMW is optimally prefetched. Open-coding it just can't reach the same 
level of pipeline saturation that the microcode can.


Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

2015-05-22 Thread Daniel J Blueman
On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman 
 wrote:

On Thu, May 14, 2015 at 12:31 AM, Mel Gorman  wrote:

On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:

 I am just noticed a hang on my largest box.
 I can only reproduce with large core counts, if I turn down the
 number of cpus it doesn't have an issue.



Odd. The number of core counts should make little a difference as 
only
one CPU per node should be in use. Does sysrq+t give any indication 
how

or where it is hanging?


I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; 
this suggests either lock contention or O(n) behaviour.


Nathan, can you check with this ordering of patches from Andrew's 
cache [2]? I was getting hanging until I a found them all.


I'll follow up with timing data.


7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login:

1. 2086s with patches 01-19 [1]

2. 2026s adding "Take into account that large system caches scale 
linearly with memory", which has:

min(2UL << (30 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 3));

3. 2442s fixing to:
max(2UL << (30 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 3));

4. 2064s adjusting minimum and shift to:
max(512UL << (20 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 8));

5. 1934s adjusting minimum and shift to:
max(128UL << (20 - PAGE_SHIFT), (pgdat->node_spanned_pages >> 8));

6. 930s #5 with the non-temporal PMD init patch I had earlier proposed 
(I'll pursue separately)


The scaling patch isn't in -mm. #5 tests out nice on a bunch of other 
AMD systems, 64GB and up, so: Tested-by: Daniel J Blueman 
.


Fine work, Mel!

Daniel

-- [1]


http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch
http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

2015-05-22 Thread Daniel J Blueman
On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman 
dan...@numascale.com wrote:

On Thu, May 14, 2015 at 12:31 AM, Mel Gorman mgor...@suse.de wrote:

On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:

 I am just noticed a hang on my largest box.
 I can only reproduce with large core counts, if I turn down the
 number of cpus it doesn't have an issue.



Odd. The number of core counts should make little a difference as 
only
one CPU per node should be in use. Does sysrq+t give any indication 
how

or where it is hanging?


I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; 
this suggests either lock contention or O(n) behaviour.


Nathan, can you check with this ordering of patches from Andrew's 
cache [2]? I was getting hanging until I a found them all.


I'll follow up with timing data.


7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to login:

1. 2086s with patches 01-19 [1]

2. 2026s adding Take into account that large system caches scale 
linearly with memory, which has:

min(2UL  (30 - PAGE_SHIFT), (pgdat-node_spanned_pages  3));

3. 2442s fixing to:
max(2UL  (30 - PAGE_SHIFT), (pgdat-node_spanned_pages  3));

4. 2064s adjusting minimum and shift to:
max(512UL  (20 - PAGE_SHIFT), (pgdat-node_spanned_pages  8));

5. 1934s adjusting minimum and shift to:
max(128UL  (20 - PAGE_SHIFT), (pgdat-node_spanned_pages  8));

6. 930s #5 with the non-temporal PMD init patch I had earlier proposed 
(I'll pursue separately)


The scaling patch isn't in -mm. #5 tests out nice on a bunch of other 
AMD systems, 64GB and up, so: Tested-by: Daniel J Blueman 
dan...@numascale.com.


Fine work, Mel!

Daniel

-- [1]


http://ozlabs.org/~akpm/mmots/broken-out/memblock-introduce-a-for_each_reserved_mem_region-iterator.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-move-page-initialization-into-a-separate-function.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-only-set-page-reserved-in-the-memblock-region.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-page_alloc-pass-pfn-to-__free_pages_bootmem.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-make-__early_pfn_to_nid-smp-safe-and-introduce-meminit_pfn_in_nid.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-remaining-struct-pages-in-parallel-with-kswapd.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-minimise-number-of-pfn-page-lookups-during-initialisation.patch
http://ozlabs.org/~akpm/mmots/broken-out/x86-mm-enable-deferred-struct-page-initialisation-on-x86-64.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-free-pages-in-large-chunks-where-possible.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-remove-mminit_verify_page_links.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-initialise-a-subset-of-struct-pages-if-config_deferred_struct_page_init-is-set-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-reduce-number-of-times-pageblocks-are-set-during-struct-page-init-fix.patch
http://ozlabs.org/~akpm/mmots/broken-out/mm-meminit-inline-some-helper-functions-fix2.patch


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

2015-05-22 Thread Daniel J Blueman



--
Daniel J Blueman
Principal Software Engineer, Numascale

On Sat, May 23, 2015 at 1:14 AM, Waiman Long waiman.l...@hp.com wrote:

On 05/22/2015 05:33 AM, Mel Gorman wrote:

On Fri, May 22, 2015 at 02:30:01PM +0800, Daniel J Blueman wrote:

On Thu, May 14, 2015 at 6:03 PM, Daniel J Blueman
dan...@numascale.com  wrote:
On Thu, May 14, 2015 at 12:31 AM, Mel Gormanmgor...@suse.de  
wrote:

On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:

I am just noticed a hang on my largest box.
I can only reproduce with large core counts, if I turn down the
number of cpus it doesn't have an issue.


Odd. The number of core counts should make little a difference
as only
one CPU per node should be in use. Does sysrq+t give any
indication how
or where it is hanging?

I was seeing the same behaviour of 1000ms increasing to 5500ms
[1]; this suggests either lock contention or O(n) behaviour.

Nathan, can you check with this ordering of patches from Andrew's
cache [2]? I was getting hanging until I a found them all.

I'll follow up with timing data.
7TB over 216 NUMA nodes, 1728 cores, from kernel 4.0.4 load to 
login:


1. 2086s with patches 01-19 [1]

2. 2026s adding Take into account that large system caches scale
linearly with memory, which has:
min(2UL  (30 - PAGE_SHIFT), (pgdat-node_spanned_pages  3));

3. 2442s fixing to:
max(2UL  (30 - PAGE_SHIFT), (pgdat-node_spanned_pages  3));

4. 2064s adjusting minimum and shift to:
max(512UL  (20 - PAGE_SHIFT), (pgdat-node_spanned_pages  8));

5. 1934s adjusting minimum and shift to:
max(128UL  (20 - PAGE_SHIFT), (pgdat-node_spanned_pages  8));

6. 930s #5 with the non-temporal PMD init patch I had earlier
proposed (I'll pursue separately)

The scaling patch isn't in -mm.

That patch was superceded by mm: meminit: finish
initialisation of struct pages before basic setup and
mm-meminit-finish-initialisation-of-struct-pages-before-basic-setup-fix
so that's ok.

FWIW, I think you should still go ahead with the non-temporal 
patches because
there is potential benefit there other than the initialisation.  If 
there
was an arch-optional implementation of a non-termporal clear then it 
would
also be worth considering if __GFP_ZERO should use non-temporal 
stores.
At a greater stretch it would be worth considering if kswapd freeing 
should
zero pages to avoid a zero on the allocation side in the general 
case as
it would be more generally useful and a stepping stone towards what 
the

series Sanitizing freed pages attempts.


Good tip Mel; I'll take a look when time allows and get some data, 
though I guess it'll only be a win where the clearing is on a different 
node than the allocation.


I think the non-temporal patch benefits mainly AMD systems. I have 
tried the patch on both DragonHawk and it actually made it boot up a 
little bit slower. I think the Intel optimized rep stosb 
instruction (used in memset) is performing well. I had done similar 
test on zero page code and the performance gain was non-conclusive.


I suspect 'rep stosb' on modern Intel hardware can write whole 
cachelines atomically, avoiding the RMW, or that the read part of the 
RMW is optimally prefetched. Open-coding it just can't reach the same 
level of pipeline saturation that the microcode can.


Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

2015-05-14 Thread Daniel J Blueman

On Thu, May 14, 2015 at 12:31 AM, Mel Gorman  wrote:

On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:

 I am just noticed a hang on my largest box.
 I can only reproduce with large core counts, if I turn down the
 number of cpus it doesn't have an issue.



Odd. The number of core counts should make little a difference as only
one CPU per node should be in use. Does sysrq+t give any indication 
how

or where it is hanging?


I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; 
this suggests either lock contention or O(n) behaviour.


Nathan, can you check with this ordering of patches from Andrew's cache 
[2]? I was getting hanging until I a found them all.


I'll follow up with timing data.

Thanks,
 Daniel

-- [1]

[   73.076117] node 2 initialised, 7732961 pages in 1060ms
[   73.077184] node 38 initialised, 7732961 pages in 1060ms
[   73.079626] node 146 initialised, 7732961 pages in 1050ms
[   73.093488] node 62 initialised, 7732961 pages in 1080ms
[   73.091557] node 3 initialised, 7732962 pages in 1080ms
[   73.10] node 186 initialised, 7732961 pages in 1040ms
[   73.095731] node 4 initialised, 7732961 pages in 1080ms
[   73.090289] node 50 initialised, 7732961 pages in 1080ms
[   73.094005] node 158 initialised, 7732961 pages in 1050ms
[   73.095421] node 159 initialised, 7732962 pages in 1050ms
[   73.090324] node 52 initialised, 7732961 pages in 1080ms
[   73.099056] node 5 initialised, 7732962 pages in 1080ms
[   73.090116] node 160 initialised, 7732961 pages in 1050ms
[   73.161051] node 157 initialised, 7732962 pages in 1120ms
[   73.193565] node 161 initialised, 7732962 pages in 1160ms
[   73.212456] node 26 initialised, 7732961 pages in 1200ms
[   73.222904] node 0 initialised, 6686488 pages in 1210ms
[   73.242165] node 140 initialised, 7732961 pages in 1210ms
[   73.254230] node 156 initialised, 7732961 pages in 1220ms
[   73.284634] node 1 initialised, 7732962 pages in 1270ms
[   73.305301] node 141 initialised, 7732962 pages in 1280ms
[   73.322845] node 28 initialised, 7732961 pages in 1310ms
[   73.321757] node 142 initialised, 7732961 pages in 1290ms
[   73.327677] node 138 initialised, 7732961 pages in 1300ms
[   73.413597] node 176 initialised, 7732961 pages in 1370ms
[   73.42] node 139 initialised, 7732962 pages in 1420ms
[   73.475356] node 143 initialised, 7732962 pages in 1440ms
[   73.547202] node 32 initialised, 7732961 pages in 1530ms
[   73.579591] node 104 initialised, 7732961 pages in 1560ms
[   73.618065] node 174 initialised, 7732961 pages in 1570ms
[   73.624918] node 178 initialised, 7732961 pages in 1580ms
[   73.649024] node 175 initialised, 7732962 pages in 1610ms
[   73.654110] node 105 initialised, 7732962 pages in 1630ms
[   73.670589] node 106 initialised, 7732961 pages in 1650ms
[   73.739682] node 102 initialised, 7732961 pages in 1720ms
[   73.769639] node 86 initialised, 7732961 pages in 1750ms
[   73.775573] node 44 initialised, 7732961 pages in 1760ms
[   73.772955] node 177 initialised, 7732962 pages in 1740ms
[   73.804390] node 34 initialised, 7732961 pages in 1790ms
[   73.819370] node 30 initialised, 7732961 pages in 1810ms
[   73.847882] node 98 initialised, 7732961 pages in 1830ms
[   73.867545] node 33 initialised, 7732962 pages in 1860ms
[   73.877964] node 107 initialised, 7732962 pages in 1860ms
[   73.906256] node 103 initialised, 7732962 pages in 1880ms
[   73.945581] node 100 initialised, 7732961 pages in 1930ms
[   73.947024] node 96 initialised, 7732961 pages in 1930ms
[   74.186208] node 116 initialised, 7732961 pages in 2170ms
[   74.220838] node 68 initialised, 7732961 pages in 2210ms
[   74.252341] node 46 initialised, 7732961 pages in 2240ms
[   74.274795] node 118 initialised, 7732961 pages in 2260ms
[   74.337544] node 14 initialised, 7732961 pages in 2320ms
[   74.350819] node 22 initialised, 7732961 pages in 2340ms
[   74.350332] node 69 initialised, 7732962 pages in 2340ms
[   74.362683] node 211 initialised, 7732962 pages in 2310ms
[   74.360617] node 70 initialised, 7732961 pages in 2340ms
[   74.369137] node 66 initialised, 7732961 pages in 2360ms
[   74.378242] node 115 initialised, 7732962 pages in 2360ms
[   74.404221] node 213 initialised, 7732962 pages in 2350ms
[   74.420901] node 210 initialised, 7732961 pages in 2370ms
[   74.430049] node 35 initialised, 7732962 pages in 2420ms
[   74.436007] node 48 initialised, 7732961 pages in 2420ms
[   74.480595] node 71 initialised, 7732962 pages in 2460ms
[   74.485700] node 67 initialised, 7732962 pages in 2480ms
[   74.502627] node 31 initialised, 7732962 pages in 2490ms
[   74.542220] node 16 initialised, 7732961 pages in 2530ms
[   74.547936] node 128 initialised, 7732961 pages in 2520ms
[   74.634374] node 214 initialised, 7732961 pages in 2580ms
[   74.654389] node 88 initialised, 7732961 pages in 2630ms
[   74.722833] node 117 initialised, 7732962 pages in 2700ms
[   74.735002] node 148 initialised, 7732961 pages in 2700ms
[   74.742725] 

irq_work_sync hangs

2015-05-14 Thread Daniel J Blueman
t;flags);

   work->func(work);
   /*
* Clear the BUSY bit and return to the free state if
* no-one else claimed it meanwhile.
*/
   (void)cmpxchg(>flags, flags, flags & 
~IRQ_WORK_BUSY);

+   if (!(work->flags & IRQ_WORK_LAZY))
+   pr_err("run id %lu end flags=0x%lx\n", 
work->id, work->flags);

   }
}

@@ -190,7 +205,13 @@ void irq_work_sync(struct irq_work *work)
{
   WARN_ON_ONCE(irqs_disabled());

+   if (!(work->flags & ~IRQ_WORK_LAZY))
+   pr_err("sync id %lu start flags=0x%lx\n", work->id, 
work->flags);

+
   while (work->flags & IRQ_WORK_BUSY)
   cpu_relax();
+
+   if (!(work->flags & ~IRQ_WORK_LAZY))
+   pr_err("sync id %lu end\n", work->id);
}
EXPORT_SYMBOL_GPL(irq_work_sync);
--
Daniel J Blueman
Principal Software Engineer, Numascale AS

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


irq_work_sync hangs

2015-05-14 Thread Daniel J Blueman
-flags, flags);

+   if (!(work-flags  ~IRQ_WORK_LAZY))
+   pr_err(run id %lu start flags=0x%lx\n, 
work-id, work-flags);

   work-func(work);
   /*
* Clear the BUSY bit and return to the free state if
* no-one else claimed it meanwhile.
*/
   (void)cmpxchg(work-flags, flags, flags  
~IRQ_WORK_BUSY);

+   if (!(work-flags  IRQ_WORK_LAZY))
+   pr_err(run id %lu end flags=0x%lx\n, 
work-id, work-flags);

   }
}

@@ -190,7 +205,13 @@ void irq_work_sync(struct irq_work *work)
{
   WARN_ON_ONCE(irqs_disabled());

+   if (!(work-flags  ~IRQ_WORK_LAZY))
+   pr_err(sync id %lu start flags=0x%lx\n, work-id, 
work-flags);

+
   while (work-flags  IRQ_WORK_BUSY)
   cpu_relax();
+
+   if (!(work-flags  ~IRQ_WORK_LAZY))
+   pr_err(sync id %lu end\n, work-id);
}
EXPORT_SYMBOL_GPL(irq_work_sync);
--
Daniel J Blueman
Principal Software Engineer, Numascale AS

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: meminit: Finish initialisation of struct pages before basic setup

2015-05-14 Thread Daniel J Blueman

On Thu, May 14, 2015 at 12:31 AM, Mel Gorman mgor...@suse.de wrote:

On Wed, May 13, 2015 at 10:53:33AM -0500, nzimmer wrote:

 I am just noticed a hang on my largest box.
 I can only reproduce with large core counts, if I turn down the
 number of cpus it doesn't have an issue.



Odd. The number of core counts should make little a difference as only
one CPU per node should be in use. Does sysrq+t give any indication 
how

or where it is hanging?


I was seeing the same behaviour of 1000ms increasing to 5500ms [1]; 
this suggests either lock contention or O(n) behaviour.


Nathan, can you check with this ordering of patches from Andrew's cache 
[2]? I was getting hanging until I a found them all.


I'll follow up with timing data.

Thanks,
 Daniel

-- [1]

[   73.076117] node 2 initialised, 7732961 pages in 1060ms
[   73.077184] node 38 initialised, 7732961 pages in 1060ms
[   73.079626] node 146 initialised, 7732961 pages in 1050ms
[   73.093488] node 62 initialised, 7732961 pages in 1080ms
[   73.091557] node 3 initialised, 7732962 pages in 1080ms
[   73.10] node 186 initialised, 7732961 pages in 1040ms
[   73.095731] node 4 initialised, 7732961 pages in 1080ms
[   73.090289] node 50 initialised, 7732961 pages in 1080ms
[   73.094005] node 158 initialised, 7732961 pages in 1050ms
[   73.095421] node 159 initialised, 7732962 pages in 1050ms
[   73.090324] node 52 initialised, 7732961 pages in 1080ms
[   73.099056] node 5 initialised, 7732962 pages in 1080ms
[   73.090116] node 160 initialised, 7732961 pages in 1050ms
[   73.161051] node 157 initialised, 7732962 pages in 1120ms
[   73.193565] node 161 initialised, 7732962 pages in 1160ms
[   73.212456] node 26 initialised, 7732961 pages in 1200ms
[   73.222904] node 0 initialised, 6686488 pages in 1210ms
[   73.242165] node 140 initialised, 7732961 pages in 1210ms
[   73.254230] node 156 initialised, 7732961 pages in 1220ms
[   73.284634] node 1 initialised, 7732962 pages in 1270ms
[   73.305301] node 141 initialised, 7732962 pages in 1280ms
[   73.322845] node 28 initialised, 7732961 pages in 1310ms
[   73.321757] node 142 initialised, 7732961 pages in 1290ms
[   73.327677] node 138 initialised, 7732961 pages in 1300ms
[   73.413597] node 176 initialised, 7732961 pages in 1370ms
[   73.42] node 139 initialised, 7732962 pages in 1420ms
[   73.475356] node 143 initialised, 7732962 pages in 1440ms
[   73.547202] node 32 initialised, 7732961 pages in 1530ms
[   73.579591] node 104 initialised, 7732961 pages in 1560ms
[   73.618065] node 174 initialised, 7732961 pages in 1570ms
[   73.624918] node 178 initialised, 7732961 pages in 1580ms
[   73.649024] node 175 initialised, 7732962 pages in 1610ms
[   73.654110] node 105 initialised, 7732962 pages in 1630ms
[   73.670589] node 106 initialised, 7732961 pages in 1650ms
[   73.739682] node 102 initialised, 7732961 pages in 1720ms
[   73.769639] node 86 initialised, 7732961 pages in 1750ms
[   73.775573] node 44 initialised, 7732961 pages in 1760ms
[   73.772955] node 177 initialised, 7732962 pages in 1740ms
[   73.804390] node 34 initialised, 7732961 pages in 1790ms
[   73.819370] node 30 initialised, 7732961 pages in 1810ms
[   73.847882] node 98 initialised, 7732961 pages in 1830ms
[   73.867545] node 33 initialised, 7732962 pages in 1860ms
[   73.877964] node 107 initialised, 7732962 pages in 1860ms
[   73.906256] node 103 initialised, 7732962 pages in 1880ms
[   73.945581] node 100 initialised, 7732961 pages in 1930ms
[   73.947024] node 96 initialised, 7732961 pages in 1930ms
[   74.186208] node 116 initialised, 7732961 pages in 2170ms
[   74.220838] node 68 initialised, 7732961 pages in 2210ms
[   74.252341] node 46 initialised, 7732961 pages in 2240ms
[   74.274795] node 118 initialised, 7732961 pages in 2260ms
[   74.337544] node 14 initialised, 7732961 pages in 2320ms
[   74.350819] node 22 initialised, 7732961 pages in 2340ms
[   74.350332] node 69 initialised, 7732962 pages in 2340ms
[   74.362683] node 211 initialised, 7732962 pages in 2310ms
[   74.360617] node 70 initialised, 7732961 pages in 2340ms
[   74.369137] node 66 initialised, 7732961 pages in 2360ms
[   74.378242] node 115 initialised, 7732962 pages in 2360ms
[   74.404221] node 213 initialised, 7732962 pages in 2350ms
[   74.420901] node 210 initialised, 7732961 pages in 2370ms
[   74.430049] node 35 initialised, 7732962 pages in 2420ms
[   74.436007] node 48 initialised, 7732961 pages in 2420ms
[   74.480595] node 71 initialised, 7732962 pages in 2460ms
[   74.485700] node 67 initialised, 7732962 pages in 2480ms
[   74.502627] node 31 initialised, 7732962 pages in 2490ms
[   74.542220] node 16 initialised, 7732961 pages in 2530ms
[   74.547936] node 128 initialised, 7732961 pages in 2520ms
[   74.634374] node 214 initialised, 7732961 pages in 2580ms
[   74.654389] node 88 initialised, 7732961 pages in 2630ms
[   74.722833] node 117 initialised, 7732962 pages in 2700ms
[   74.735002] node 148 initialised, 7732961 pages in 2700ms

Re: [Patch v3] x86, irq: Allocate CPU vectors from device local CPUs if possible

2015-05-08 Thread Daniel J Blueman
On Thu, May 7, 2015 at 10:53 AM, Jiang Liu  
wrote:

On NUMA systems, an IO device may be associated with a NUMA node.
It may improve IO performance to allocate resources, such as memory
and interrupts, from device local node.

This patch introduces a mechanism to support CPU vector allocation
policies. It tries to allocate CPU vectors from CPUs on device local
node first, and then fallback to all online(global) CPUs.

This mechanism may be used to support NumaConnect systems to allocate
CPU vectors from device local node.

Signed-off-by: Jiang Liu 
Cc: Daniel J Blueman 
---
Hi Thomas,
I feel this should be simpliest version now:)
Thanks!
Gerry
---
 arch/x86/kernel/apic/vector.c |   23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c 
b/arch/x86/kernel/apic/vector.c

index 1c7dd42b98c1..eb65c6b98de0 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -210,6 +210,18 @@ static int assign_irq_vector(int irq, struct 
apic_chip_data *data,

return err;
 }

+static int assign_irq_vector_policy(int irq, int node,
+   struct apic_chip_data *data,
+   struct irq_alloc_info *info)
+{
+   if (info && info->mask)
+   return assign_irq_vector(irq, data, info->mask);
+   if (node != NUMA_NO_NODE &&
+   assign_irq_vector(irq, data, cpumask_of_node(node)) == 0)
+   return 0;
+   return assign_irq_vector(irq, data, apic->target_cpus());
+}
+
 static void clear_irq_vector(int irq, struct apic_chip_data *data)
 {
int cpu, vector;
@@ -258,12 +270,6 @@ void copy_irq_alloc_info(struct irq_alloc_info 
*dst, struct irq_alloc_info *src)

memset(dst, 0, sizeof(*dst));
 }

-static inline const struct cpumask *
-irq_alloc_info_get_mask(struct irq_alloc_info *info)
-{
-   return (!info || !info->mask) ? apic->target_cpus() : info->mask;
-}
-
 static void x86_vector_free_irqs(struct irq_domain *domain,
 unsigned int virq, unsigned int nr_irqs)
 {
@@ -289,7 +295,6 @@ static int x86_vector_alloc_irqs(struct 
irq_domain *domain, unsigned int virq,

 {
struct irq_alloc_info *info = arg;
struct apic_chip_data *data;
-   const struct cpumask *mask;
struct irq_data *irq_data;
int i, err;

@@ -300,7 +305,6 @@ static int x86_vector_alloc_irqs(struct 
irq_domain *domain, unsigned int virq,

if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
return -ENOSYS;

-   mask = irq_alloc_info_get_mask(info);
for (i = 0; i < nr_irqs; i++) {
irq_data = irq_domain_get_irq_data(domain, virq + i);
BUG_ON(!irq_data);
@@ -318,7 +322,8 @@ static int x86_vector_alloc_irqs(struct 
irq_domain *domain, unsigned int virq,

irq_data->chip = _controller;
irq_data->chip_data = data;
irq_data->hwirq = virq + i;
-   err = assign_irq_vector(virq, data, mask);
+   err = assign_irq_vector_policy(virq, irq_data->node, data,
+  info);
if (err)
goto error;
}


Testing x86/tip/apic with this patch on a 192 core/24 node NumaConnect 
system, all the PCIe bridge, GPU, SATA, NIC etc interrupts are 
allocated on the correct NUMA nodes, so it works great. Tested-by: 
Daniel J Blueman 


Many thanks!
 Daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Patch v3] x86, irq: Allocate CPU vectors from device local CPUs if possible

2015-05-08 Thread Daniel J Blueman
On Thu, May 7, 2015 at 10:53 AM, Jiang Liu jiang@linux.intel.com 
wrote:

On NUMA systems, an IO device may be associated with a NUMA node.
It may improve IO performance to allocate resources, such as memory
and interrupts, from device local node.

This patch introduces a mechanism to support CPU vector allocation
policies. It tries to allocate CPU vectors from CPUs on device local
node first, and then fallback to all online(global) CPUs.

This mechanism may be used to support NumaConnect systems to allocate
CPU vectors from device local node.

Signed-off-by: Jiang Liu jiang@linux.intel.com
Cc: Daniel J Blueman dan...@numascale.com
---
Hi Thomas,
I feel this should be simpliest version now:)
Thanks!
Gerry
---
 arch/x86/kernel/apic/vector.c |   23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/apic/vector.c 
b/arch/x86/kernel/apic/vector.c

index 1c7dd42b98c1..eb65c6b98de0 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -210,6 +210,18 @@ static int assign_irq_vector(int irq, struct 
apic_chip_data *data,

return err;
 }

+static int assign_irq_vector_policy(int irq, int node,
+   struct apic_chip_data *data,
+   struct irq_alloc_info *info)
+{
+   if (info  info-mask)
+   return assign_irq_vector(irq, data, info-mask);
+   if (node != NUMA_NO_NODE 
+   assign_irq_vector(irq, data, cpumask_of_node(node)) == 0)
+   return 0;
+   return assign_irq_vector(irq, data, apic-target_cpus());
+}
+
 static void clear_irq_vector(int irq, struct apic_chip_data *data)
 {
int cpu, vector;
@@ -258,12 +270,6 @@ void copy_irq_alloc_info(struct irq_alloc_info 
*dst, struct irq_alloc_info *src)

memset(dst, 0, sizeof(*dst));
 }

-static inline const struct cpumask *
-irq_alloc_info_get_mask(struct irq_alloc_info *info)
-{
-   return (!info || !info-mask) ? apic-target_cpus() : info-mask;
-}
-
 static void x86_vector_free_irqs(struct irq_domain *domain,
 unsigned int virq, unsigned int nr_irqs)
 {
@@ -289,7 +295,6 @@ static int x86_vector_alloc_irqs(struct 
irq_domain *domain, unsigned int virq,

 {
struct irq_alloc_info *info = arg;
struct apic_chip_data *data;
-   const struct cpumask *mask;
struct irq_data *irq_data;
int i, err;

@@ -300,7 +305,6 @@ static int x86_vector_alloc_irqs(struct 
irq_domain *domain, unsigned int virq,

if ((info-flags  X86_IRQ_ALLOC_CONTIGUOUS_VECTORS)  nr_irqs  1)
return -ENOSYS;

-   mask = irq_alloc_info_get_mask(info);
for (i = 0; i  nr_irqs; i++) {
irq_data = irq_domain_get_irq_data(domain, virq + i);
BUG_ON(!irq_data);
@@ -318,7 +322,8 @@ static int x86_vector_alloc_irqs(struct 
irq_domain *domain, unsigned int virq,

irq_data-chip = lapic_controller;
irq_data-chip_data = data;
irq_data-hwirq = virq + i;
-   err = assign_irq_vector(virq, data, mask);
+   err = assign_irq_vector_policy(virq, irq_data-node, data,
+  info);
if (err)
goto error;
}


Testing x86/tip/apic with this patch on a 192 core/24 node NumaConnect 
system, all the PCIe bridge, GPU, SATA, NIC etc interrupts are 
allocated on the correct NUMA nodes, so it works great. Tested-by: 
Daniel J Blueman dan...@numascale.com


Many thanks!
 Daniel

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/13] Parallel struct page initialisation v4

2015-05-02 Thread Daniel J Blueman
On Sat, May 2, 2015 at 4:52 PM, Daniel J Blueman  
wrote:
On Sat, May 2, 2015 at 8:09 AM, Waiman Long  
wrote:

On 05/01/2015 06:02 PM, Waiman Long wrote:


Bad news!

I tried your patch on a 24-TB DragonHawk and got an out of memory 
panic. The kernel log messages were:

  :
[   80.126186] CPU  474: hi:  186, btch:  31 usd:   0
[   80.131457] CPU  475: hi:  186, btch:  31 usd:   0
[   80.136726] CPU  476: hi:  186, btch:  31 usd:   0
[   80.141997] CPU  477: hi:  186, btch:  31 usd:   0
[   80.147267] CPU  478: hi:  186, btch:  31 usd:   0
[   80.152538] CPU  479: hi:  186, btch:  31 usd:   0
[   80.157813] active_anon:0 inactive_anon:0 isolated_anon:0
[   80.157813]  active_file:0 inactive_file:0 isolated_file:0
[   80.157813]  unevictable:0 dirty:0 writeback:0 unstable:0
[   80.157813]  free:209 slab_reclaimable:7 slab_unreclaimable:42986
[   80.157813]  mapped:0 shmem:0 pagetables:0 bounce:0
[   80.157813]  free_cma:0
[   80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB 
mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:14928kB kernel_stack:400kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.233475] lowmem_reserve[]: 0 0 0 0
[   80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB 
slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.281456] lowmem_reserve[]: 0 0 0 0
[   80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB 
slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.328958] lowmem_reserve[]: 0 0 0 0
[   80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB 
slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.377256] lowmem_reserve[]: 0 0 0 0
[   80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.424764] lowmem_reserve[]: 0 0 0 0
[   80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.472293] lowmem_reserve[]: 0 0 0 0
[   80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.519803] lowmem_reserve[]: 0 0 0 0
[   80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.567312] lowmem_reserve[]: 0 0 0 0
[   80.571379] Node 6 Normal free:0kB

Re: [PATCH 0/13] Parallel struct page initialisation v4

2015-05-02 Thread Daniel J Blueman

On Sat, May 2, 2015 at 8:09 AM, Waiman Long  wrote:

On 05/01/2015 06:02 PM, Waiman Long wrote:


Bad news!

I tried your patch on a 24-TB DragonHawk and got an out of memory 
panic. The kernel log messages were:

  :
[   80.126186] CPU  474: hi:  186, btch:  31 usd:   0
[   80.131457] CPU  475: hi:  186, btch:  31 usd:   0
[   80.136726] CPU  476: hi:  186, btch:  31 usd:   0
[   80.141997] CPU  477: hi:  186, btch:  31 usd:   0
[   80.147267] CPU  478: hi:  186, btch:  31 usd:   0
[   80.152538] CPU  479: hi:  186, btch:  31 usd:   0
[   80.157813] active_anon:0 inactive_anon:0 isolated_anon:0
[   80.157813]  active_file:0 inactive_file:0 isolated_file:0
[   80.157813]  unevictable:0 dirty:0 writeback:0 unstable:0
[   80.157813]  free:209 slab_reclaimable:7 slab_unreclaimable:42986
[   80.157813]  mapped:0 shmem:0 pagetables:0 bounce:0
[   80.157813]  free_cma:0
[   80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB 
mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:14928kB 
kernel_stack:400kB pagetables:0kB unstable:0kB bounce:0kB 
free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes

[   80.233475] lowmem_reserve[]: 0 0 0 0
[   80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB 
slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.281456] lowmem_reserve[]: 0 0 0 0
[   80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB 
slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.328958] lowmem_reserve[]: 0 0 0 0
[   80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB 
slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.377256] lowmem_reserve[]: 0 0 0 0
[   80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.424764] lowmem_reserve[]: 0 0 0 0
[   80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.472293] lowmem_reserve[]: 0 0 0 0
[   80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.519803] lowmem_reserve[]: 0 0 0 0
[   80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.567312] lowmem_reserve[]: 0 0 0 0
[   80.571379] Node 6 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB 

Re: [PATCH 0/13] Parallel struct page initialisation v4

2015-05-02 Thread Daniel J Blueman

On Sat, May 2, 2015 at 8:09 AM, Waiman Long waiman.l...@hp.com wrote:

On 05/01/2015 06:02 PM, Waiman Long wrote:


Bad news!

I tried your patch on a 24-TB DragonHawk and got an out of memory 
panic. The kernel log messages were:

  :
[   80.126186] CPU  474: hi:  186, btch:  31 usd:   0
[   80.131457] CPU  475: hi:  186, btch:  31 usd:   0
[   80.136726] CPU  476: hi:  186, btch:  31 usd:   0
[   80.141997] CPU  477: hi:  186, btch:  31 usd:   0
[   80.147267] CPU  478: hi:  186, btch:  31 usd:   0
[   80.152538] CPU  479: hi:  186, btch:  31 usd:   0
[   80.157813] active_anon:0 inactive_anon:0 isolated_anon:0
[   80.157813]  active_file:0 inactive_file:0 isolated_file:0
[   80.157813]  unevictable:0 dirty:0 writeback:0 unstable:0
[   80.157813]  free:209 slab_reclaimable:7 slab_unreclaimable:42986
[   80.157813]  mapped:0 shmem:0 pagetables:0 bounce:0
[   80.157813]  free_cma:0
[   80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB 
mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:14928kB 
kernel_stack:400kB pagetables:0kB unstable:0kB bounce:0kB 
free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes

[   80.233475] lowmem_reserve[]: 0 0 0 0
[   80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB 
slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.281456] lowmem_reserve[]: 0 0 0 0
[   80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB 
slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.328958] lowmem_reserve[]: 0 0 0 0
[   80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB 
slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.377256] lowmem_reserve[]: 0 0 0 0
[   80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.424764] lowmem_reserve[]: 0 0 0 0
[   80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.472293] lowmem_reserve[]: 0 0 0 0
[   80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.519803] lowmem_reserve[]: 0 0 0 0
[   80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.567312] lowmem_reserve[]: 0 0 0 0
[   80.571379] Node 6 Normal free:0kB min:0kB low:0kB high:0kB 

Re: [PATCH 0/13] Parallel struct page initialisation v4

2015-05-02 Thread Daniel J Blueman
On Sat, May 2, 2015 at 4:52 PM, Daniel J Blueman dan...@numascale.com 
wrote:
On Sat, May 2, 2015 at 8:09 AM, Waiman Long waiman.l...@hp.com 
wrote:

On 05/01/2015 06:02 PM, Waiman Long wrote:


Bad news!

I tried your patch on a 24-TB DragonHawk and got an out of memory 
panic. The kernel log messages were:

  :
[   80.126186] CPU  474: hi:  186, btch:  31 usd:   0
[   80.131457] CPU  475: hi:  186, btch:  31 usd:   0
[   80.136726] CPU  476: hi:  186, btch:  31 usd:   0
[   80.141997] CPU  477: hi:  186, btch:  31 usd:   0
[   80.147267] CPU  478: hi:  186, btch:  31 usd:   0
[   80.152538] CPU  479: hi:  186, btch:  31 usd:   0
[   80.157813] active_anon:0 inactive_anon:0 isolated_anon:0
[   80.157813]  active_file:0 inactive_file:0 isolated_file:0
[   80.157813]  unevictable:0 dirty:0 writeback:0 unstable:0
[   80.157813]  free:209 slab_reclaimable:7 slab_unreclaimable:42986
[   80.157813]  mapped:0 shmem:0 pagetables:0 bounce:0
[   80.157813]  free_cma:0
[   80.190428] Node 0 DMA free:568kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:15988kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB 
mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:14928kB kernel_stack:400kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.233475] lowmem_reserve[]: 0 0 0 0
[   80.237542] Node 0 DMA32 free:20kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1961924kB managed:1333604kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB 
slab_unreclaimable:101664kB kernel_stack:50176kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.281456] lowmem_reserve[]: 0 0 0 0
[   80.285527] Node 0 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1608515580kB managed:2097148kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:4kB 
slab_unreclaimable:948kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.328958] lowmem_reserve[]: 0 0 0 0
[   80.333031] Node 1 Normal free:248kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612732kB managed:2228220kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB 
slab_unreclaimable:46240kB kernel_stack:3232kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.377256] lowmem_reserve[]: 0 0 0 0
[   80.381325] Node 2 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:612kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.424764] lowmem_reserve[]: 0 0 0 0
[   80.428842] Node 3 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:600kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.472293] lowmem_reserve[]: 0 0 0 0
[   80.476360] Node 4 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:620kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.519803] lowmem_reserve[]: 0 0 0 0
[   80.523875] Node 5 Normal free:0kB min:0kB low:0kB high:0kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:1610612736kB managed:2097152kB mlocked:0kB dirty:0kB 
writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB 
slab_unreclaimable:584kB kernel_stack:0kB pagetables:0kB 
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? yes

[   80.567312] lowmem_reserve[]: 0 0 0 0

  1   2   3   4   5   6   7   >