PROBLEM: 3.9-rc1 - WARNING: at include/linux/ftrace.h:209 ftrace_ops_control_func+0xde/0xf0()

2013-03-20 Thread WANG Chao
Hi,

On 3.9-rc1, if I run `perf stat -e ftrace:function sleep 1`, a kernel warning 
would happen like below:

root# perf stat -e ftrace:function sleep 1
[   74.339105] WARNING: at include/linux/ftrace.h:209 
ftrace_ops_control_func+0xde/0xf0()
[   74.349522] Hardware name: PRIMERGY RX200 S6
[   74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core 
iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core 
crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk
r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss 
nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom 
mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li
bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase 
scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[   74.446233] Pid: 1377, comm: perf Tainted: GW3.9.0-rc1 #1
[   74.453458] Call Trace:
[   74.456233]  [] warn_slowpath_common+0x7f/0xc0
[   74.462997]  [] ? rcu_note_context_switch+0xa0/0xa0
[   74.470272]  [] ? __unregister_ftrace_function+0xa2/0x1a0
[   74.478117]  [] warn_slowpath_null+0x1a/0x20
[   74.484681]  [] ftrace_ops_control_func+0xde/0xf0
[   74.491760]  [] ftrace_call+0x5/0x2f
[   74.497511]  [] ? ftrace_call+0x5/0x2f
[   74.503486]  [] ? ftrace_call+0x5/0x2f
[   74.509500]  [] ? synchronize_sched+0x5/0x50
[   74.516088]  [] ? _cond_resched+0x5/0x40
[   74.522268]  [] ? synchronize_sched+0x5/0x50
[   74.528837]  [] ? __unregister_ftrace_function+0xa2/0x1a0
[   74.536696]  [] ? _cond_resched+0x5/0x40
[   74.542878]  [] ? mutex_lock+0x1d/0x50
[   74.548869]  [] unregister_ftrace_function+0x27/0x50
[   74.556243]  [] perf_ftrace_event_register+0x9f/0x140
[   74.563709]  [] ? _cond_resched+0x5/0x40
[   74.569887]  [] ? mutex_lock+0x1d/0x50
[   74.575898]  [] perf_trace_destroy+0x2e/0x50
[   74.582505]  [] tp_perf_event_destroy+0x9/0x10
[   74.589298]  [] free_event+0x70/0x1a0
[   74.595208]  [] perf_event_release_kernel+0x69/0xa0
[   74.602460]  [] ? _cond_resched+0x5/0x40
[   74.608667]  [] put_event+0x90/0xc0
[   74.614373]  [] perf_release+0x10/0x20
[   74.620367]  [] __fput+0xf4/0x280
[   74.625894]  [] fput+0xe/0x10
[   74.631387]  [] task_work_run+0xa7/0xe0
[   74.637452]  [] do_notify_resume+0x71/0xb0
[   74.643843]  [] int_signal+0x12/0x17

 Performance counter stats for 'sleep 1':

 9,325 ftrace:function

   1.004088470 seconds time elapsed
root#

The warning only appears the first time above perf command being triggered. You 
won't see this until a fresh boot.

3.8 doesn't have this issue. Still exists on 3.9-rc3.

-- 
Thanks,
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 v1] the recommended crash memory reservation is too small for x86_64.

2013-03-25 Thread WANG Chao
On 03/25/2013 06:56 PM, zhouzho...@gmail.com wrote:
> From: root 
> 
> 
>  On Documentation/kdump/kdump.txt, section Boot into  System Kernel: On x86 
> and x86_64, use
>  "crashkernel=64M@16M", but some OSes like ubuntu 12.10 use ram fs larger 
> than 64M, so in these cases the
>  memory reserved for crashkernel should be at least 128M.

People use different initramfs generators for different purpose. That means
the size of initramfs and also its memory consuming can vary very much from
each other. You just can't list all these generators and their recommended
reserved memory here. Though I have to say crashkernel=128M is good choice.

I think it would be better to leave this to user or distribution itself to
determine how much memory should be reserved for crash kernel, then export
this value to kernel in some ways.

Thanks,
WANG Chao

> 
> 
> Signed-off-by: Zhouyi Zhou 
> ---
>  Documentation/kdump/kdump.txt |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> index 13f1aa0..1e850e0 100644
> --- a/Documentation/kdump/kdump.txt
> +++ b/Documentation/kdump/kdump.txt
> @@ -290,7 +290,9 @@ Boot into System Kernel
> "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory
> starting at physical address 0x0100 (16MB) for the dump-capture 
> kernel.
>  
> -   On x86 and x86_64, use "crashkernel=64M@16M".
> +   On x86 and x86_64, use "crashkernel=64M@16M" (some OSes use init ram fs 
> larger
> +than 64M, for example ubuntu-12.10, use crashkernel=128M@16M instead, or 
> dump-capture
> +kernel will out of memory).
>  
> On ppc64, use "crashkernel=128M@32M".
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 v1] the recommended crash memory reservation is too small for x86_64.

2013-03-25 Thread WANG Chao
On 03/25/2013 07:52 PM, 周洲仪 wrote:
> Thanks for reviewing
>   Is it sounds good to add following line into section "Boot into System 
> Kernel":
> 
>   The memory reserved for crash kernel should be no less than the unpacked 
> init ram disk size
> that loaded with dump-capture kernel plus wired memory used by kernel itself.

Sorry, that is just basic knowledge for everyone.

> 
>   #
>   After all, I have been trapped into "hang after capture" problem when 
> sticking strictly to this document.
> Cheers
> Zhouyi  
> 
>>  
>>  On 03/25/2013 06:56 PM, zhouzho...@gmail.com wrote:
>>  > From: root 
>>  > 
>>  > 
>>  >  On Documentation/kdump/kdump.txt, section Boot into  System Kernel: On 
>> x86 and x86_64, use
>>  >  "crashkernel=64M@16M", but some OSes like ubuntu 12.10 use ram fs larger 
>> than 64M, so in these cases the
>>  >  memory reserved for crashkernel should be at least 128M.
>>  
>>  People use different initramfs generators for different purpose. That means
>>  the size of initramfs and also its memory consuming can vary very much from
>>  each other. You just can't list all these generators and their recommended
>>  reserved memory here. Though I have to say crashkernel=128M is good choice.
>>  
>>  I think it would be better to leave this to user or distribution itself to
>>  determine how much memory should be reserved for crash kernel, then export
>>  this value to kernel in some ways.
>>  
>>  Thanks,
>>  WANG Chao
>>  
>>  > 
>>  > 
>>  > Signed-off-by: Zhouyi Zhou 
>>  > ---
>>  >  Documentation/kdump/kdump.txt |4 +++-
>>  >  1 file changed, 3 insertions(+), 1 deletion(-)
>>  > 
>>  > diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
>>  > index 13f1aa0..1e850e0 100644
>>  > --- a/Documentation/kdump/kdump.txt
>>  > +++ b/Documentation/kdump/kdump.txt
>>  > @@ -290,7 +290,9 @@ Boot into System Kernel
>>  > "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of 
>> memory
>>  > starting at physical address 0x0100 (16MB) for the dump-capture 
>> kernel.
>>  >  
>>  > -   On x86 and x86_64, use "crashkernel=64M@16M".
>>  > +   On x86 and x86_64, use "crashkernel=64M@16M" (some OSes use init ram 
>> fs larger
>>  > +than 64M, for example ubuntu-12.10, use crashkernel=128M@16M instead, or 
>> dump-capture
>>  > +kernel will out of memory).
>>  >  
>>  > On ppc64, use "crashkernel=128M@32M".
>>  >  
>>  > 
>>  
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] ftrace: Do not call stub functions in control loop

2013-03-28 Thread WANG Chao
On 03/28/2013 08:52 PM, Steven Rostedt wrote:
> From: "Steven Rostedt (Red Hat)" 
> 
> The function tracing control loop used by perf spits out a warning
> if the called function is not a control function. This is because
> the control function references a per cpu allocated data structure
> on struct ftrace_ops that is not allocated for other types of
> functions.
> 
> commit 0a016409e42 "ftrace: Optimize the function tracer list loop"
> 
> Had an optimization done to all function tracing loops to optimize
> for a single registered ops. Unfortunately, this allows for a slight
> race when tracing starts or ends, where the stub function might be
> called after the current registered ops is removed. In this case we
> get the following dump:
> 
> root# perf stat -e ftrace:function sleep 1
> [   74.339105] WARNING: at include/linux/ftrace.h:209 
> ftrace_ops_control_func+0xde/0xf0()
> [   74.349522] Hardware name: PRIMERGY RX200 S6
> [   74.357149] Modules linked in: sg igb iTCO_wdt ptp pps_core 
> iTCO_vendor_support i7core_edac dca lpc_ich i2c_i801 coretemp edac_core 
> crc32c_intel mfd_core ghash_clmulni_intel dm_multipath acpi_power_meter pcspk
> r microcode vhost_net tun macvtap macvlan nfsd kvm_intel kvm auth_rpcgss 
> nfs_acl lockd sunrpc uinput xfs libcrc32c sd_mod crc_t10dif sr_mod cdrom 
> mgag200 i2c_algo_bit drm_kms_helper ttm qla2xxx mptsas ahci drm li
> bahci scsi_transport_sas mptscsih libata scsi_transport_fc i2c_core mptbase 
> scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
> [   74.446233] Pid: 1377, comm: perf Tainted: GW3.9.0-rc1 #1
> [   74.453458] Call Trace:
> [   74.456233]  [] warn_slowpath_common+0x7f/0xc0
> [   74.462997]  [] ? rcu_note_context_switch+0xa0/0xa0
> [   74.470272]  [] ? __unregister_ftrace_function+0xa2/0x1a0
> [   74.478117]  [] warn_slowpath_null+0x1a/0x20
> [   74.484681]  [] ftrace_ops_control_func+0xde/0xf0
> [   74.491760]  [] ftrace_call+0x5/0x2f
> [   74.497511]  [] ? ftrace_call+0x5/0x2f
> [   74.503486]  [] ? ftrace_call+0x5/0x2f
> [   74.509500]  [] ? synchronize_sched+0x5/0x50
> [   74.516088]  [] ? _cond_resched+0x5/0x40
> [   74.522268]  [] ? synchronize_sched+0x5/0x50
> [   74.528837]  [] ? __unregister_ftrace_function+0xa2/0x1a0
> [   74.536696]  [] ? _cond_resched+0x5/0x40
> [   74.542878]  [] ? mutex_lock+0x1d/0x50
> [   74.548869]  [] unregister_ftrace_function+0x27/0x50
> [   74.556243]  [] perf_ftrace_event_register+0x9f/0x140
> [   74.563709]  [] ? _cond_resched+0x5/0x40
> [   74.569887]  [] ? mutex_lock+0x1d/0x50
> [   74.575898]  [] perf_trace_destroy+0x2e/0x50
> [   74.582505]  [] tp_perf_event_destroy+0x9/0x10
> [   74.589298]  [] free_event+0x70/0x1a0
> [   74.595208]  [] perf_event_release_kernel+0x69/0xa0
> [   74.602460]  [] ? _cond_resched+0x5/0x40
> [   74.608667]  [] put_event+0x90/0xc0
> [   74.614373]  [] perf_release+0x10/0x20
> [   74.620367]  [] __fput+0xf4/0x280
> [   74.625894]  [] fput+0xe/0x10
> [   74.631387]  [] task_work_run+0xa7/0xe0
> [   74.637452]  [] do_notify_resume+0x71/0xb0
> [   74.643843]  [] int_signal+0x12/0x17
> 
> To fix this a new ftrace_ops flag is added that denotes the ftrace_list_end
> ftrace_ops stub as just that, a stub. This flag is now checked in the
> control loop and the function is not called if the flag is set.
> 
> Thanks to Jovi for not just reporting the bug, but also pointing out
> where the bug was in the code.
> 
> Link: http://lkml.kernel.org/r/514a8855.7090...@redhat.com
> Link: 
> http://lkml.kernel.org/r/1364377499-1900-15-git-send-email-jovi.zhang...@huawei.com
> 
> Reported-by: WANG Chao 
> Reported-by: zhangwei(Jovi) 
> Signed-off-by: Steven Rostedt 
> ---
>  include/linux/ftrace.h |2 ++
>  kernel/trace/ftrace.c  |5 +++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index e5ca8ef..167abf9 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -89,6 +89,7 @@ typedef void (*ftrace_func_t)(unsigned long ip, unsigned 
> long parent_ip,
>   *that the call back has its own recursion protection. If it does
>   *not set this, then the ftrace infrastructure will add recursion
>   *protection for the caller.
> + * STUB   - The ftrace_ops is just a place holder.
>   */
>  enum {
>   FTRACE_OPS_FL_ENABLED   = 1 << 0,
> @@ -98,6 +99,7 @@ enum {
>   FTRACE_OPS_FL_SAVE_REGS = 1 << 4,
>   FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED= 1 << 5,
>   FTRACE_OPS_FL_RECURSION_SAFE= 1 << 6,
> + FTRACE_OPS_FL_STUB   

vmlinux symtab matching kallsyms fails: diff end addr for aesni_gcm_dec/aesni_gcm_enc

2013-03-07 Thread WANG Chao
Hi, All

On 3.9-rc1, I got following failure running `perf test`:

 1: vmlinux symtab matches kallsyms:
--- start ---
Looking at the vmlinux_path (6 entries long)
Using /lib/modules/3.9.0-rc1/build/vmlinux for symbols
0x810508b0: diff end addr for aesni_gcm_dec v:
0x81053551 k: 0x81050941
0x81053560: diff end addr for aesni_gcm_enc v:
0x8105619a k: 0x810535f1
Maps only in vmlinux:
 81a311f5-81b8c48f 0 [kernel].init.text
 81b8c490-9fff 0 [kernel].exit.text
Maps in vmlinux with a different name in kallsyms:
Maps only in kallsyms:
 end 
vmlinux symtab matches kallsyms: FAILED!

The end addr of aesni_gcm_dec/aesni_gcm_enc in vmlinux symtab and
kallsyms is wrong beyond 4K.

I think that's a problem or am I missing some changes for these days?

-- 
Thanks,
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.9-rc1: crash kernel panic - not syncing: Can not allocate SWIOTLB buffer earlier and can't now provide you with the DMA bounce buffer

2013-03-07 Thread WANG Chao
 1
[2.816782] i8042: PNP: PS/2 appears to have AUX port disabled, if this is 
incorrect please boot with i8042.nopnp
[2.827473] serio: i8042 KBD port at 0x60,0x64 irq 1
[2.832534] mousedev: PS/2 mouse device common for all mice
[2.838423] rtc_cmos 00:06: RTC can wake from S4
[2.843292] rtc_cmos 00:06: rtc core: registered rtc_cmos as rtc0
[2.849434] rtc_cmos 00:06: alarms up to one month, y3k, 242 bytes nvram, 
hpet irqs
[2.857324] cpuidle: using governor ladder
[2.861761] cpuidle: using governor menu
[2.866158] EFI Variables Facility v0.08 2004-May-17
[2.871140] hidraw: raw HID events driver (C) Jiri Kosina
[2.873905] input: AT Translated Set 2 keyboard as 
/devices/platform/i8042/serio0/input/input3
[2.885248] usbcore: registered new interface driver usbhid
[2.890818] usbhid: USB HID core driver
[2.894647] usb 1-1: new high-speed USB device number 2 using ehci-pci
[2.894695] drop_monitor: Initializing network drop monitor service
[2.894813] TCP: cubic registered
[2.894815] Initializing XFRM netlink socket
[2.894988] NET: Registered protocol family 10
[2.895269] NET: Registered protocol family 17
[2.895770] Loading module verification certificates
[2.897372] MODSIGN: Loaded cert 'Magrathea: Glacier signing key: 
b14fa6fba81316fe0bb193bbf458deba6d430978'
[2.897384] registered taskstats version 1
[2.897388] IMA: No TPM chip found, activating TPM-bypass!
[2.948076] Kernel panic - not syncing: Can not allocate SWIOTLB buffer 
earlier and can't now provide you with the DMA bounce buffer
[2.959958] Pid: 53, comm: khubd Not tainted 3.9.0-rc1+ #1
[2.965426] Call Trace:
[2.967866]  [] panic+0xc1/0x1d0
[2.972644]  [] swiotlb_tbl_map_single+0x27c/0x280
[2.978991]  [] map_single+0x19/0x20
[2.984115]  [] swiotlb_map_page+0x6e/0x160
[2.989845]  [] usb_hcd_map_urb_for_dma+0x230/0x4a0
[2.996268]  [] usb_hcd_submit_urb+0x295/0x8e0
[3.002258]  [] ? __dequeue_entity+0x2f/0x50
[3.008076]  [] ? __switch_to+0x13e/0x4a0
[3.013632]  [] usb_submit_urb+0xff/0x3d0
[3.019186]  [] ? __schedule+0x3de/0x7e0
[3.024657]  [] usb_start_wait_urb+0x6a/0x160
[3.030560]  [] ? __kmalloc+0x55/0x210
[3.035856]  [] ? usb_alloc_urb+0x1e/0x50
[3.041411]  [] usb_control_msg+0xde/0x140
[3.047056]  [] ? hub_port_init+0x310/0xaf0
[3.052785]  [] ? hub_port_init+0x2eb/0xaf0
[3.058515]  [] hub_port_init+0x338/0xaf0
[3.064071]  [] ? update_autosuspend+0x39/0x60
[3.070062]  [] ? 
pm_runtime_set_autosuspend_delay+0x49/0x70
[3.077264]  [] hub_port_connect_change+0x24a/0xaa0
[3.083684]  [] hub_events+0x2ea/0x910
[3.088981]  [] ? __schedule+0x3de/0x7e0
[3.094451]  [] hub_thread+0x35/0x1e0
[3.099661]  [] ? wake_up_bit+0x40/0x40
[3.105045]  [] ? hub_events+0x910/0x910
[3.110514]  [] kthread+0xc0/0xd0
[3.115378]  [] ? kthread_create_on_node+0x120/0x120
[    3.121887]  [] ret_from_fork+0x7c/0xb0
[3.127271]  [] ? kthread_create_on_node+0x120/0x120

-- 
Thanks,
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.9-rc1: crash kernel panic - not syncing: Can not allocate SWIOTLB buffer earlier and can't now provide you with the DMA bounce buffer

2013-03-07 Thread WANG Chao
On 03/08/2013 02:36 PM, Yinghai Lu wrote:
> On Thu, Mar 7, 2013 at 10:32 PM, Yinghai Lu  wrote:
>> On Thu, Mar 7, 2013 at 10:03 PM, CAI Qian  wrote:
>>> CC'ing kexec ML. Also mentioned that 3.8 has no such issue.
>>>
>>> This message looks suspicious and out of range while 3.8 reservation
>>> looks within the range.
>>>
>>> [0.00] Reserving 128MB of memory at 5216MB for crashkernel
>>> (System RAM: 3977MB)
>>>
>>> Wondering if anything to do with memblock again...
>>
>> that is intended...
>>
>>> - Original Message -
>>>> From: "WANG Chao" 
>>>> To: "LKML" vger.kernel.org>
>>>> Cc: "CAI Qian" 
>>>> Sent: Friday, March 8, 2013 1:54:37 PM
>>>> Subject: 3.9-rc1: crash kernel panic - not syncing: Can not allocate 
>>>> SWIOTLB buffer earlier and can't now provide you
>>>> with the DMA bounce buffer
>>>>
>>>> Hi, All
>>>>
>>>> On 3.9-rc1, I load crash kernel with latest kexec-tools(up to
>>>> 28d413a), but
>>>> 2nd kernel panic at early time:
>>>> [2.948076] Kernel panic - not syncing: Can not allocate SWIOTLB
>>>> buffer earlier and can't now provide you with the DMA bounce buffer
>>>> [2.959958] Pid: 53, comm: khubd Not tainted 3.9.0-rc1+ #1
>>
>> You need to add crashkernel_low=64M in first kernel.
>>
>> As your system does not support DMA remapping.
> 
> looks like your system DO have DMAR table, please enable dmar
> remapping in your kernel config.

I've already got following config:
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
CONFIG_IRQ_REMAP=y

but I don't have intel_iommu=on in kernel cmdline. IIRC, iommu will prevent
2nd kernel from booting ...

I tested crashkernel=128M and crashkernel_low=64M, seems 2nd-kernel/kexec only
works when two params are used in combination.

Thanks,
WANG Chao

> 
> Yinghai
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.9-rc1: crash kernel panic - not syncing: Can not allocate SWIOTLB buffer earlier and can't now provide you with the DMA bounce buffer

2013-03-07 Thread WANG Chao
On 03/08/2013 03:27 PM, Yinghai Lu wrote:
> On Thu, Mar 7, 2013 at 11:20 PM, WANG Chao  wrote:
>>>
>>> looks like your system DO have DMAR table, please enable dmar
>>> remapping in your kernel config.
>>
>> I've already got following config:
>> CONFIG_DMAR_TABLE=y
>> CONFIG_INTEL_IOMMU=y
>> CONFIG_IRQ_REMAP=y
>>
>> but I don't have intel_iommu=on in kernel cmdline. IIRC, iommu will prevent
>> 2nd kernel from booting ...
> 
> Did you put intel_iommu=on on first and second cpu both?

I tried, 2nd kernel didn't boot and keep splitting errors like these:
[2.106939] DMAR: No ATSR found
[2.110121] IOMMU 0 0xfed9: using Queued invalidation
[2.115522] IOMMU 1 0xfed91000: using Queued invalidation
[2.120919] IOMMU: Setting RMRR:
[2.124162] IOMMU: Setting identity map for device :00:02.0 [0xab80
- 0xaf9f]
[2.133099] IOMMU: Setting identity map for device :00:1d.0 [0xaac95000
- 0xaacb2fff]
[2.141305] IOMMU: Setting identity map for device :00:1a.0 [0xaac95000
- 0xaacb2fff]
[2.149503] IOMMU: Setting identity map for device :00:14.0 [0xaac95000
- 0xaacb2fff]
[2.157690] IOMMU: Prepare 0-16MiB unity mapping for LPC
[2.163011] IOMMU: Setting identity map for device :00:1f.0 [0x0 - 
0xff
[Errors, here we go]
[2.170932] dmar: DRHD: handling fault status reg 3
[2.170933] PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
[2.182486] dmar: DMAR:[DMA Write] Request device [00:02.0] fault addr 
e000
[2.182486] DMAR:[fault reason 05] PTE Write access is not set
[2.195705] dmar: DRHD: handling fault status reg 3
[2.200570] dmar: DMAR:[DMA Read] Request device [00:02.0] fault addr 
ff873000
[2.200570] DMAR:[fault reason 06] PTE Read access is not set
[2.213618] dmar: DRHD: handling fault status reg 3
[..]

Thanks,
WANG Chao
> 
> Thanks
> 
> Yinghai
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.9-rc1: crash kernel panic - not syncing: Can not allocate SWIOTLB buffer earlier and can't now provide you with the DMA bounce buffer

2013-03-10 Thread WANG Chao
On 03/09/2013 03:39 AM, Yinghai Lu wrote:
> [ Add more to To list ]
> 
> On Fri, Mar 8, 2013 at 10:24 AM, Yinghai Lu  wrote:
>> On Fri, Mar 8, 2013 at 4:12 AM, WANG Chao  wrote:
>>
>>>> what is 00:02.0 in your system?
>>> This IOMMU issue is related to https://lkml.org/lkml/2012/11/26/814. We can
>>> discuss this IOMMU issue in that thread.
>>> Anyway 00:02.0 is a video card, the box is Ivy Bridge.
>>> # lspci -s 00:02.0 -v
>>> 00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor
>>> Graphics Controller (rev 09) (prog-if 00 [VGA controller])
>>> Subsystem: Intel Corporation Device 2211
>>> Flags: bus master, fast devsel, latency 0, IRQ 44
>>> Memory at afc0 (64-bit, non-prefetchable) [size=4M]
>>> Memory at c000 (64-bit, prefetchable) [size=256M]
>>> I/O ports at 6000 [size=64]
>>> Expansion ROM at  [disabled]
>>> Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
>>> Capabilities: [d0] Power Management version 2
>>> Capabilities: [a4] PCI Advanced Features
>>> Kernel driver in use: i915
>>
>> disable drm for i915 will make your iommu work with dump?
>>
>>>
>>>
>>> Is it expected to intel_iommu=on or crashkernel_low to make 2nd kernel boot 
>>> in
>>> 3.9? Back in 3.8, it works just fine w/ only crashkernel param.
>>
>> Yes, I really do not want to set crashkernel low range like 72M
>> automatically for all.
>> that would have the system with proper iommu support lose 72M under 4G
>> in first kernel.
>> And can not play allocate and return tricks, as first kernel have no
>> idea if iommu will work
>> on second kernel even iommu is working on first kernel.
>>
>> Better to fix iommu support at first.
>>
>> For old system that does not have DMAR or kernel does not have IOMMU
>> support enabled, or
>> user does not pass intel_iommu=on.
>> We could set crashkernel low range to 72M automatically.
> 
> It seem that it is not worthy to check case that does not support
> IOMMU in second kernel.
> 
> Please check attached patch that will just set crashkernel_low auto, and if 
> the
> system DO support iommu with kdump, user can specify crashkernel_low=0
> to save low 72M.

The patch works flawlessly on my box! Thank you, Yinghai!
Let me know if anything else I can help.

WANG Chao

> 
> Thanks
> 
> Yinghai
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V2 0/5] x86: Create direct mappings for E820_RAM only

2012-08-14 Thread WANG Chao

On 08/15/2012 06:54 AM, Jacob Shin wrote:

On Tue, Aug 14, 2012 at 04:34:39PM +0800, Dave Young wrote:

On 08/14/2012 05:46 AM, Jacob Shin wrote:


Currently kernel direct mappings are created for all pfns between
[ 0 to max_low_pfn ) and [ 4GB to max_pfn ). When we introduce memory
holes, we end up mapping memory ranges that are not backed by physical
DRAM. This is fine for lower memory addresses which can be marked as UC
by fixed/variable range MTRRs, however we run in to trouble with high
addresses.

The following patchset creates direct mappings only for E820_RAM regions
between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM and
memory holes unmapped.



Hi,

Chaowang did some kdump test in a kvm guest with this patchset, 2nd
kenrel just reboot after some ACPI printk, see below dmesg of 2nd kernel:


Hello, thanks for testing, since I have not tested under KVM .. I also have
not tested passing in user supplied memory maps as your kernel log suggests.

Looking into this, it seems like we get a page fault while trying to set up
fixmap for the APIC. I think the fixmap is set up even before we get to
setup_arch(), and it is sitting in memory that is not marked as usable by
your user supplied e820.

Could you give V3 a try? I just sent it out a minute ago, this version
won't try to remap what has already been mapped as part of the boot process
before we get to setup_arch, it'll just take what its given.



Hi, Jacob

I just tried v3 patchset in my x86_64 kvm guest, it was booting 
successfully and the issue mentioned is gone.


-WANG Chao

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86 / CPU: Avoid unnecessary IPIs in arch_freq_get_on_cpu()

2017-11-12 Thread WANG Chao
On 11/13/17 at 02:15P, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> Even though aperfmperf_snapshot_khz() caches the samples.khz value
> to return if called again in a sufficiently short time, its caller,
> arch_freq_get_on_cpu(), still uses smp_call_function_single() to run
> it which may allow user space to trigger an IPI storm by reading from
> the scaling_cur_freq cpufreq sysfs file in a tight loop.
> 
> To avoid that, move the decision on whether or not to return the
> cached samples.khz value to arch_freq_get_on_cpu().
> 
> Fixes: 4815d3c56d1e (cpufreq: x86: Make scaling_cur_freq behave more as 
> expected)
> Signed-off-by: Rafael J. Wysocki 

Looks good to me.

Reviewed-by: WANG Chao 

> ---
> 
> This change was part of commit 941f5f0f6ef5 (x86: CPU: Fix up "cpu MHz" in
> /proc/cpuinfo), but it was not the reason for the revert and it remains
> applicable.
> 
> Thanks,
> Rafael
> 
> ---
>  arch/x86/kernel/cpu/Makefile |2 +-
>  arch/x86/kernel/cpu/aperfmperf.c |   11 +++
>  arch/x86/kernel/cpu/proc.c   |4 +++-
>  3 files changed, 11 insertions(+), 6 deletions(-)
> 
> Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
> ===
> --- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c
> +++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c
> @@ -42,10 +42,6 @@ static void aperfmperf_snapshot_khz(void
>   s64 time_delta = ktime_ms_delta(now, s->time);
>   unsigned long flags;
>  
> - /* Don't bother re-computing within the cache threshold time. */
> - if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
> - return;
> -
>   local_irq_save(flags);
>   rdmsrl(MSR_IA32_APERF, aperf);
>   rdmsrl(MSR_IA32_MPERF, mperf);
> @@ -74,6 +70,7 @@ static void aperfmperf_snapshot_khz(void
>  
>  unsigned int arch_freq_get_on_cpu(int cpu)
>  {
> + s64 time_delta;
>   unsigned int khz;
>  
>   if (!cpu_khz)
> @@ -82,6 +79,12 @@ unsigned int arch_freq_get_on_cpu(int cp
>   if (!static_cpu_has(X86_FEATURE_APERFMPERF))
>   return 0;
>  
> + /* Don't bother re-computing within the cache threshold time. */
> + time_delta = ktime_ms_delta(ktime_get(), per_cpu(samples.time, cpu));
> + khz = per_cpu(samples.khz, cpu);
> + if (khz && time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
> + return khz;
> +
>   smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
>   khz = per_cpu(samples.khz, cpu);
>   if (khz)
> 



[PATCH] treewide: replace RETPOLINE with CONFIG_RETPOLINE

2018-12-10 Thread WANG Chao
Since commit 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend
on compiler support"), RETPOLINE has been replaced by CONFIG_RETPOLINE.

Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler 
support")
Signed-off-by: WANG Chao 
---
 arch/x86/kernel/cpu/bugs.c   | 2 +-
 include/linux/compiler-gcc.h | 2 +-
 include/linux/module.h   | 2 +-
 scripts/mod/modpost.c| 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 500278f5308e..e8abe3f6c10e 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -213,7 +213,7 @@ static enum spectre_v2_mitigation spectre_v2_enabled 
__ro_after_init =
 static enum spectre_v2_user_mitigation spectre_v2_user __ro_after_init =
SPECTRE_V2_USER_NONE;
 
-#ifdef RETPOLINE
+#ifdef CONFIG_RETPOLINE
 static bool spectre_v2_bad_module;
 
 bool retpoline_module_ok(bool has_retpoline)
diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index 2010493e1040..977ddf2774f9 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -68,7 +68,7 @@
  */
 #define uninitialized_var(x) x = x
 
-#ifdef RETPOLINE
+#ifdef CONFIG_RETPOLINE
 #define __noretpoline __attribute__((__indirect_branch__("keep")))
 #endif
 
diff --git a/include/linux/module.h b/include/linux/module.h
index fce6b4335e36..0c575f51fe57 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -817,7 +817,7 @@ static inline void module_bug_finalize(const Elf_Ehdr *hdr,
 static inline void module_bug_cleanup(struct module *mod) {}
 #endif /* CONFIG_GENERIC_BUG */
 
-#ifdef RETPOLINE
+#ifdef CONFIG_RETPOLINE
 extern bool retpoline_module_ok(bool has_retpoline);
 #else
 static inline bool retpoline_module_ok(bool has_retpoline)
diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
index 0d998c54564d..5a5b3780456f 100644
--- a/scripts/mod/modpost.c
+++ b/scripts/mod/modpost.c
@@ -2157,7 +2157,7 @@ static void add_intree_flag(struct buffer *b, int 
is_intree)
 /* Cannot check for assembler */
 static void add_retpoline(struct buffer *b)
 {
-   buf_printf(b, "\n#ifdef RETPOLINE\n");
+   buf_printf(b, "\n#ifdef CONFIG_RETPOLINE\n");
buf_printf(b, "MODULE_INFO(retpoline, \"Y\");\n");
buf_printf(b, "#endif\n");
 }
-- 
2.19.2



Re: [PATCH] treewide: replace RETPOLINE with CONFIG_RETPOLINE

2018-12-21 Thread WANG Chao
On 12/11/18 at 12:37P, WANG Chao wrote:
> Since commit 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend
> on compiler support"), RETPOLINE has been replaced by CONFIG_RETPOLINE.
> 
> Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler 
> support")
> Signed-off-by: WANG Chao 

ping ...

> ---
>  arch/x86/kernel/cpu/bugs.c   | 2 +-
>  include/linux/compiler-gcc.h | 2 +-
>  include/linux/module.h   | 2 +-
>  scripts/mod/modpost.c| 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 500278f5308e..e8abe3f6c10e 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -213,7 +213,7 @@ static enum spectre_v2_mitigation spectre_v2_enabled 
> __ro_after_init =
>  static enum spectre_v2_user_mitigation spectre_v2_user __ro_after_init =
>   SPECTRE_V2_USER_NONE;
>  
> -#ifdef RETPOLINE
> +#ifdef CONFIG_RETPOLINE
>  static bool spectre_v2_bad_module;
>  
>  bool retpoline_module_ok(bool has_retpoline)
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index 2010493e1040..977ddf2774f9 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -68,7 +68,7 @@
>   */
>  #define uninitialized_var(x) x = x
>  
> -#ifdef RETPOLINE
> +#ifdef CONFIG_RETPOLINE
>  #define __noretpoline __attribute__((__indirect_branch__("keep")))
>  #endif
>  
> diff --git a/include/linux/module.h b/include/linux/module.h
> index fce6b4335e36..0c575f51fe57 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -817,7 +817,7 @@ static inline void module_bug_finalize(const Elf_Ehdr 
> *hdr,
>  static inline void module_bug_cleanup(struct module *mod) {}
>  #endif   /* CONFIG_GENERIC_BUG */
>  
> -#ifdef RETPOLINE
> +#ifdef CONFIG_RETPOLINE
>  extern bool retpoline_module_ok(bool has_retpoline);
>  #else
>  static inline bool retpoline_module_ok(bool has_retpoline)
> diff --git a/scripts/mod/modpost.c b/scripts/mod/modpost.c
> index 0d998c54564d..5a5b3780456f 100644
> --- a/scripts/mod/modpost.c
> +++ b/scripts/mod/modpost.c
> @@ -2157,7 +2157,7 @@ static void add_intree_flag(struct buffer *b, int 
> is_intree)
>  /* Cannot check for assembler */
>  static void add_retpoline(struct buffer *b)
>  {
> - buf_printf(b, "\n#ifdef RETPOLINE\n");
> + buf_printf(b, "\n#ifdef CONFIG_RETPOLINE\n");
>   buf_printf(b, "MODULE_INFO(retpoline, \"Y\");\n");
>   buf_printf(b, "#endif\n");
>  }
> -- 
> 2.19.2
> 
> 


Re: [PATCH] treewide: replace RETPOLINE with CONFIG_RETPOLINE

2018-12-21 Thread WANG Chao
On 12/21/18 at 11:47P, Borislav Petkov wrote:
> On Fri, Dec 21, 2018 at 05:33:47PM +0800, WANG Chao wrote:
> > On 12/11/18 at 12:37P, WANG Chao wrote:
> > > Since commit 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend
> > > on compiler support"), RETPOLINE has been replaced by CONFIG_RETPOLINE.
> > > 
> > > Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on 
> > > compiler support")
> > > Signed-off-by: WANG Chao 
> > 
> > ping ...
> 
> This one somehow slipped through the cracks... :-\
> 
> Ok, let's see: this could be relevant for the case when a module built
> with a compiler which doesn't support retpolines, gets loaded on a
> system which is built with retpolines.

You're right.

> 
> Which is pretty seldom as the majority of setups out there should have
> a retpoline-enabled compiler. And should not allow loading external
> modules anyway, but that's a different story.
> 
> So, this close to the merge window, I'm inclined to delay this one after
> it and have it take the x86/urgent fixes path. Unless someone screams
> and says that patch breaks his system. I haven't seen any reports about
> it so far so...

OK. Thanks.

WANG Chao


Re: [PATCH] mm, vmacache: Add kconfig VMACACHE_SHIFT

2015-01-22 Thread WANG Chao
On 01/22/15 at 11:22am, Rik van Riel wrote:
> On 01/22/2015 11:19 AM, Davidlohr Bueso wrote:
> > On Thu, 2015-01-22 at 15:57 +0800, WANG Chao wrote:
> >> Hi, Davidlohr
> >>
> >> On 01/21/15 at 11:46pm, Davidlohr Bueso wrote:
> >>> On Thu, 2015-01-22 at 14:29 +0800, WANG Chao wrote:
> >>>> Add a new kconfig option VMACACHE_SHIFT (as a power of 2) to specify the
> >>>> number of slots vma cache has for each thread. Range is chosen 0-4 (1-16
> >>>> slots) to consider both overhead and performance penalty. Default is 2
> >>>> (4 slots) as it originally is, which provides good enough balance.
> >>>>
> >>>
> >>> Nack. I don't feel comfortable making scalability features of core code
> >>> configurable.
> >>
> >> Out of respect, is this a general rule not making scalability features
> >> of core code configurable?
> > 
> > I doubt its a rule, just common sense. Users have no business
> > configuring such low level details. The optimizations need to
> > transparently work for everyone.
> 
> There may sometimes be a good reason for making this kind of
> thing configurable, but since there were no performance
> numbers in the changelog, I have not seen any such reason for
> this particular change :)

True. I didn't run any kind of benchmark, thus no numbers here. This is
purely hypothetical.

I'm glad to run some tests. For the sake of consistency, could you
please show me a hint how do you measure at the first place? I can do
hit-rate, but I don't know how you measure cpu cycles. Could you
elaborate?

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86, e820: clean up around sanitize_e820_map()

2015-01-06 Thread WANG Chao
The argument 3 of sanitize_e820_map() will only update upon a successful
sanitization. Some of the callers may not be aware of this in the past.
Now clean up these callers.

Signed-off-by: WANG Chao 
---
 arch/x86/kernel/e820.c | 22 ++
 1 file changed, 6 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index dd2f07a..ae70de8 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -185,7 +185,7 @@ void __init e820_print_map(char *who)
  *
  * The integer pointed to by pnr_map must be valid on entry (the
  * current number of valid entries located at biosmap) and will
- * be updated on return, with the new number of valid entries
+ * only be updated on return 0, with the new number of valid entries
  * (something no more than max_nr_map.)
  *
  * The return value from sanitize_e820_map() is zero if it
@@ -561,23 +561,15 @@ u64 __init e820_remove_range(u64 start, u64 size, 
unsigned old_type,
 
 void __init update_e820(void)
 {
-   u32 nr_map;
-
-   nr_map = e820.nr_map;
-   if (sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &nr_map))
+   if (sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map))
return;
-   e820.nr_map = nr_map;
printk(KERN_INFO "e820: modified physical RAM map:\n");
e820_print_map("modified");
 }
 static void __init update_e820_saved(void)
 {
-   u32 nr_map;
-
-   nr_map = e820_saved.nr_map;
-   if (sanitize_e820_map(e820_saved.map, ARRAY_SIZE(e820_saved.map), 
&nr_map))
-   return;
-   e820_saved.nr_map = nr_map;
+   sanitize_e820_map(e820_saved.map, ARRAY_SIZE(e820_saved.map),
+   &e820_saved.nr_map);
 }
 #define MAX_GAP_END 0x1ull
 /*
@@ -898,11 +890,9 @@ early_param("memmap", parse_memmap_opt);
 void __init finish_e820_parsing(void)
 {
if (userdef) {
-   u32 nr = e820.nr_map;
-
-   if (sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &nr) < 0)
+   if (sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map),
+   &e820.nr_map) < 0)
early_panic("Invalid user supplied memory map");
-   e820.nr_map = nr;
 
printk(KERN_INFO "e820: user-defined physical RAM map:\n");
e820_print_map("user");
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] x86: early_memremap exact size of struct setup_data

2015-01-07 Thread WANG Chao
When early remapping setup_data, we can remap the structure alone, use
sizeof(struct setup_data). No need to remap a larger area, we never
access setup_data->data at that point.

Signed-off-by: WANG Chao 
---
 arch/x86/kernel/setup.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ab4734e..c4648ada 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -431,15 +431,13 @@ static void __init parse_setup_data(void)
 
pa_data = boot_params.hdr.setup_data;
while (pa_data) {
-   u32 data_len, map_len, data_type;
+   u32 data_len, data_type;
 
-   map_len = max(PAGE_SIZE - (pa_data & ~PAGE_MASK),
- (u64)sizeof(struct setup_data));
-   data = early_memremap(pa_data, map_len);
+   data = early_memremap(pa_data, sizeof(*data));
data_len = data->len + sizeof(struct setup_data);
data_type = data->type;
pa_next = data->next;
-   early_iounmap(data, map_len);
+   early_iounmap(data, sizeof(*data));
 
switch (data_type) {
case SETUP_E820_EXT:
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86: add macro for_each_setup_data()

2015-01-07 Thread WANG Chao
A common task for parsing setup_data is to iterate over setup_data's
linked list, remap and do something and unmap. Now add macro
for_each_setup_data() to do that.

Signed-off-by: WANG Chao 
---
 arch/x86/kernel/setup.c | 37 -
 1 file changed, 12 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index c4648ada..1e24c37 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -424,35 +424,30 @@ static void __init reserve_initrd(void)
 }
 #endif /* CONFIG_BLK_DEV_INITRD */
 
+#define for_each_setup_data(pa_data, data) \
+   for (pa_data = boot_params.hdr.setup_data; \
+   pa_data && (data = early_memremap(pa_data, sizeof(*data))); \
+   pa_data = data->next, early_iounmap(data, sizeof(*data)))
+
 static void __init parse_setup_data(void)
 {
struct setup_data *data;
-   u64 pa_data, pa_next;
-
-   pa_data = boot_params.hdr.setup_data;
-   while (pa_data) {
-   u32 data_len, data_type;
-
-   data = early_memremap(pa_data, sizeof(*data));
-   data_len = data->len + sizeof(struct setup_data);
-   data_type = data->type;
-   pa_next = data->next;
-   early_iounmap(data, sizeof(*data));
+   u64 pa_data;
 
-   switch (data_type) {
+   for_each_setup_data(pa_data, data) {
+   switch (data->type) {
case SETUP_E820_EXT:
-   parse_e820_ext(pa_data, data_len);
+   parse_e820_ext(pa_data, data->len + sizeof(*data));
break;
case SETUP_DTB:
add_dtb(pa_data);
break;
case SETUP_EFI:
-   parse_efi_setup(pa_data, data_len);
+   parse_efi_setup(pa_data, data->len + sizeof(*data));
break;
default:
break;
}
-   pa_data = pa_next;
}
 }
 
@@ -462,14 +457,10 @@ static void __init e820_reserve_setup_data(void)
u64 pa_data;
int found = 0;
 
-   pa_data = boot_params.hdr.setup_data;
-   while (pa_data) {
-   data = early_memremap(pa_data, sizeof(*data));
+   for_each_setup_data(pa_data, data) {
e820_update_range(pa_data, sizeof(*data)+data->len,
 E820_RAM, E820_RESERVED_KERN);
found = 1;
-   pa_data = data->next;
-   early_iounmap(data, sizeof(*data));
}
if (!found)
return;
@@ -485,12 +476,8 @@ static void __init 
memblock_x86_reserve_range_setup_data(void)
struct setup_data *data;
u64 pa_data;
 
-   pa_data = boot_params.hdr.setup_data;
-   while (pa_data) {
-   data = early_memremap(pa_data, sizeof(*data));
+   for_each_setup_data(pa_data, data) {
memblock_reserve(pa_data, sizeof(*data) + data->len);
-   pa_data = data->next;
-   early_iounmap(data, sizeof(*data));
}
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] RAS/CEC: fix __find_elem

2019-04-17 Thread WANG Chao
A left over pfn (because we don't clear) at ca->array[n] can be a match
in __find_elem. Later it'd cause a memmove size overflow in del_elem.

Signed-off-by: WANG Chao 
---
 drivers/ras/cec.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 2d9ec378a8bc..2e0bf1269c31 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -206,7 +206,7 @@ static int __find_elem(struct ce_array *ca, u64 pfn, 
unsigned int *to)
 
this_pfn = PFN(ca->array[min]);
 
-   if (this_pfn == pfn)
+   if (this_pfn == pfn && ca->n > min)
return min;
 
return -ENOKEY;
-- 
2.21.0



[PATCH 2/3] RAS/CEC: make ces_entered smp safe

2019-04-17 Thread WANG Chao
ces_entered should be put in a critical section to avoid race condition.

Signed-off-by: WANG Chao 
---
 drivers/ras/cec.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 2e0bf1269c31..702e4c02c713 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -286,10 +286,10 @@ int cec_add_elem(u64 pfn)
if (!ce_arr.array || ce_arr.disabled)
return -ENODEV;
 
-   ca->ces_entered++;
-
mutex_lock(&ce_mutex);
 
+   ca->ces_entered++;
+
if (ca->n == MAX_ELEMS)
WARN_ON(!del_lru_elem_unlocked(ca));
 
-- 
2.21.0



[PATCH 3/3] RAS/CEC: immediate soft-offline page when count_threshold == 1

2019-04-17 Thread WANG Chao
count_threshol == 1 isn't working as expected. CEC only does soft
offline the second time the same pfn is hit by a correctable error.

Signed-off-by: WANG Chao 
---
 drivers/ras/cec.c | 36 +---
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 702e4c02c713..ac879c45377c 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -272,7 +272,22 @@ static u64 __maybe_unused del_lru_elem(void)
return pfn;
 }
 
+static void cec_valid_soft_offline(u64 pfn)
+{
+   if (!pfn_valid(pfn)) {
+   pr_warn("CEC: Invalid pfn: 0x%llx\n", pfn);
+   } else {
+   /* We have reached max count for this page, soft-offline it. */
+   pr_err("Soft-offlining pfn: 0x%llx\n", pfn);
+   memory_failure_queue(pfn, MF_SOFT_OFFLINE, &cec_chain);
+   ce_arr.pfns_poisoned++;
+   }
+}
 
+/*
+ * Return a >0 value to denote that we've reached the offlining
+ * threshold.
+ */
 int cec_add_elem(u64 pfn)
 {
struct ce_array *ca = &ce_arr;
@@ -295,6 +310,11 @@ int cec_add_elem(u64 pfn)
 
ret = find_elem(ca, pfn, &to);
if (ret < 0) {
+   if (count_threshold == 1) {
+   cec_valid_soft_offline(pfn);
+   ret = 1;
+   goto unlock;
+   }
/*
 * Shift range [to-end] to make room for one more element.
 */
@@ -320,23 +340,9 @@ int cec_add_elem(u64 pfn)
 
ret = 0;
} else {
-   u64 pfn = ca->array[to] >> PAGE_SHIFT;
-
-   if (!pfn_valid(pfn)) {
-   pr_warn("CEC: Invalid pfn: 0x%llx\n", pfn);
-   } else {
-   /* We have reached max count for this page, 
soft-offline it. */
-   pr_err("Soft-offlining pfn: 0x%llx\n", pfn);
-   memory_failure_queue(pfn, MF_SOFT_OFFLINE);
-   ca->pfns_poisoned++;
-   }
-
+   cec_valid_soft_offline(pfn);
del_elem(ca, to);
 
-   /*
-* Return a >0 value to denote that we've reached the offlining
-* threshold.
-*/
ret = 1;
 
goto unlock;
-- 
2.21.0



Re: [PATCH 3/3] RAS/CEC: immediate soft-offline page when count_threshold == 1

2019-04-23 Thread WANG Chao
On 04/20/19 at 01:57P, Borislav Petkov wrote:
> On Thu, Apr 18, 2019 at 11:41:15AM +0800, WANG Chao wrote:
> > count_threshol == 1 isn't working as expected. CEC only does soft
> > offline the second time the same pfn is hit by a correctable error.
> 
> So this?
> 
> ---
> diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
> index b3c377ddf340..750a427e1a73 100644
> --- a/drivers/ras/cec.c
> +++ b/drivers/ras/cec.c
> @@ -333,6 +333,7 @@ int cec_add_elem(u64 pfn)
>  
>   mutex_lock(&ce_mutex);
>  
> + /* Array full, free the LRU slot. */
>   if (ca->n == MAX_ELEMS)
>   WARN_ON(!del_lru_elem_unlocked(ca));
>  
> @@ -346,14 +347,9 @@ int cec_add_elem(u64 pfn)
>   (void *)&ca->array[to],
>   (ca->n - to) * sizeof(u64));
>  
> - ca->array[to] = (pfn << PAGE_SHIFT) |
> - (DECAY_MASK << COUNT_BITS) | 1;
> + ca->array[to] = (pfn << PAGE_SHIFT) | 1;
>  
>   ca->n++;
> -
> - ret = 0;
> -
> - goto decay;
>   }
>  
>   count = COUNT(ca->array[to]);
> @@ -386,7 +382,6 @@ int cec_add_elem(u64 pfn)
>   goto unlock;
>   }
>  
> -decay:
>   ca->decay_count++;
>  
>   if (ca->decay_count >= CLEAN_ELEMS)

It looks good to me. Thanks for a better fix.


Re: [PATCH 1/3] RAS/CEC: fix __find_elem

2019-04-25 Thread WANG Chao
On 04/18/19 at 11:41P, WANG Chao wrote:
> A left over pfn (because we don't clear) at ca->array[n] can be a match
> in __find_elem. Later it'd cause a memmove size overflow in del_elem.
> 
> Signed-off-by: WANG Chao 
> ---
>  drivers/ras/cec.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
> index 2d9ec378a8bc..2e0bf1269c31 100644
> --- a/drivers/ras/cec.c
> +++ b/drivers/ras/cec.c
> @@ -206,7 +206,7 @@ static int __find_elem(struct ce_array *ca, u64 pfn, 
> unsigned int *to)
>  
>   this_pfn = PFN(ca->array[min]);
>  
> - if (this_pfn == pfn)
> + if (this_pfn == pfn && ca->n > min)
>   return min;
>  
>   return -ENOKEY;

Any thought on this one?


Re: [PATCH 1/3] RAS/CEC: fix __find_elem

2019-04-25 Thread WANG Chao
On 04/25/19 at 03:56P, WANG Chao wrote:
> On 04/18/19 at 11:41P, WANG Chao wrote:
> > A left over pfn (because we don't clear) at ca->array[n] can be a match
> > in __find_elem. Later it'd cause a memmove size overflow in del_elem.
> > 
> > Signed-off-by: WANG Chao 
> > ---
> >  drivers/ras/cec.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
> > index 2d9ec378a8bc..2e0bf1269c31 100644
> > --- a/drivers/ras/cec.c
> > +++ b/drivers/ras/cec.c
> > @@ -206,7 +206,7 @@ static int __find_elem(struct ce_array *ca, u64 pfn, 
> > unsigned int *to)
> >  
> > this_pfn = PFN(ca->array[min]);
> >  
> > -   if (this_pfn == pfn)
> > +   if (this_pfn == pfn && ca->n > min)
> > return min;
> >  
> > return -ENOKEY;
> 
> Any thought on this one?

Aha, I see there's another fix queued. Thanks.


Re: [PATCH] kbuild: add extra-y to targets-for-modules

2020-12-08 Thread WANG Chao
Sorry for the late reply.

On 11/25/20 at 10:42P, Masahiro Yamada wrote:
> On Tue, Nov 24, 2020 at 12:05 AM WANG Chao  wrote:
> >
> > On 11/23/20 at 02:23P, Masahiro Yamada wrote:
> > > On Tue, Nov 3, 2020 at 3:23 PM WANG Chao  wrote:
> > > >
> > > > extra-y target doesn't build for 'make M=...' since commit 6212804f2d78
> > > > ("kbuild: do not create built-in objects for external module builds").
> > > >
> > > > This especially breaks kpatch, which is using 'extra-y := kpatch.lds'
> > > > and 'make M=...' to build livepatch patch module.
> > > >
> > > > Add extra-y to targets-for-modules so that such kind of build works
> > > > properly.
> > > >
> > > > Signed-off-by: WANG Chao 
> > > > ---
> > > >  scripts/Makefile.build | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/scripts/Makefile.build b/scripts/Makefile.build
> > > > index ae647379b579..0113a042d643 100644
> > > > --- a/scripts/Makefile.build
> > > > +++ b/scripts/Makefile.build
> > > > @@ -86,7 +86,7 @@ ifdef need-builtin
> > > >  targets-for-builtin += $(obj)/built-in.a
> > > >  endif
> > > >
> > > > -targets-for-modules := $(patsubst %.o, %.mod, $(filter %.o, $(obj-m)))
> > > > +targets-for-modules := $(extra-y) $(patsubst %.o, %.mod, $(filter %.o, 
> > > > $(obj-m)))
> > > >
> > > >  ifdef need-modorder
> > > >  targets-for-modules += $(obj)/modules.order
> > > > --
> > > > 2.29.1
> > > >
> > >
> > > NACK.
> > >
> > > Please fix your Makefile.
> > >
> > > Hint:
> > > https://patchwork.kernel.org/project/linux-kbuild/patch/20201123045403.63402-6-masahi...@kernel.org/
> > >
> > >
> > > Probably what you should use is 'targets'.
> >
> > I tried with 'targets' and 'always-y'. Both doesn't work for me.
> >
> > I narraw it down to the following example:
> >
> > cat > Makefile << _EOF_
> > obj-m += foo.o
> >
> > ldflags-y += -T $(src)/kpatch.lds
> > always-y += kpatch.lds
> >
> > foo-objs += bar.o
> >
> > all:
> > make -C /lib/modules/$(shell uname -r)/build M=$(PWD)
> > clean:
> > make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
> > _EOF_
> >
> > Take a look into scripts/Makefile.build:488:
> >
> > __build: $(if $(KBUILD_BUILTIN), $(targets-for-builtin)) \
> >  $(if $(KBUILD_MODULES), $(targets-for-modules)) \
> >  $(subdir-ym) $(always-y)
> > @:
> >
> > 'always-y' is built after 'targets-for-modules'. This makes
> > 'targets-for-modules' fails because kpatch.lds isn't there.
> 
> 
> Heh, you rely on the targets built from left to right,
> and you have never thought Make supports the parallel option -j.

You're right. I missed that.

> 
> 
> You need to specify the dependency if you expect objects
> are built in the particular order.
> 
> However, in this case, using ldflags-y looks wrong
> in the first place.
> 
> The linker script is used when combining the object
> as well as the final link of *.ko

I don't have a clean fix to kpatch right now.

I'm looping kpatch forks in. They're also looking at this right now:

https://github.com/dynup/kpatch/pull/1149

Thanks
WANG Chao

> 
> 
> > For 'targets', in case of OOT, does not seem to be useful.
> >
> > What change do you suggest to make to fix this kind of Makefile?
> >
> > Thanks,
> > WANG Chao
> 
> 
> 
> -- 
> Best Regards
> Masahiro Yamada
> 


Re: [PATCH] kbuild: add extra-y to targets-for-modules

2020-11-23 Thread WANG Chao
On 11/23/20 at 02:23P, Masahiro Yamada wrote:
> On Tue, Nov 3, 2020 at 3:23 PM WANG Chao  wrote:
> >
> > extra-y target doesn't build for 'make M=...' since commit 6212804f2d78
> > ("kbuild: do not create built-in objects for external module builds").
> >
> > This especially breaks kpatch, which is using 'extra-y := kpatch.lds'
> > and 'make M=...' to build livepatch patch module.
> >
> > Add extra-y to targets-for-modules so that such kind of build works
> > properly.
> >
> > Signed-off-by: WANG Chao 
> > ---
> >  scripts/Makefile.build | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/scripts/Makefile.build b/scripts/Makefile.build
> > index ae647379b579..0113a042d643 100644
> > --- a/scripts/Makefile.build
> > +++ b/scripts/Makefile.build
> > @@ -86,7 +86,7 @@ ifdef need-builtin
> >  targets-for-builtin += $(obj)/built-in.a
> >  endif
> >
> > -targets-for-modules := $(patsubst %.o, %.mod, $(filter %.o, $(obj-m)))
> > +targets-for-modules := $(extra-y) $(patsubst %.o, %.mod, $(filter %.o, 
> > $(obj-m)))
> >
> >  ifdef need-modorder
> >  targets-for-modules += $(obj)/modules.order
> > --
> > 2.29.1
> >
> 
> NACK.
> 
> Please fix your Makefile.
> 
> Hint:
> https://patchwork.kernel.org/project/linux-kbuild/patch/20201123045403.63402-6-masahi...@kernel.org/
> 
> 
> Probably what you should use is 'targets'.

I tried with 'targets' and 'always-y'. Both doesn't work for me.

I narraw it down to the following example:

cat > Makefile << _EOF_
obj-m += foo.o

ldflags-y += -T $(src)/kpatch.lds
always-y += kpatch.lds

foo-objs += bar.o

all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD)
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
_EOF_

Take a look into scripts/Makefile.build:488:

__build: $(if $(KBUILD_BUILTIN), $(targets-for-builtin)) \
 $(if $(KBUILD_MODULES), $(targets-for-modules)) \
 $(subdir-ym) $(always-y)
@:

'always-y' is built after 'targets-for-modules'. This makes
'targets-for-modules' fails because kpatch.lds isn't there.

For 'targets', in case of OOT, does not seem to be useful.

What change do you suggest to make to fix this kind of Makefile?

Thanks,
WANG Chao


Re: [PATCH] x86, kdump: crashkernel=X try to reserve below 896M first, then try below 4G, then MAXMEM

2013-11-04 Thread WANG Chao
On 10/28/13 at 11:12am, Vivek Goyal wrote:
> On Thu, Oct 24, 2013 at 10:48:29PM -0700, Yinghai Lu wrote:
> > On Thu, Oct 24, 2013 at 12:27 PM, Vivek Goyal  wrote:
> > > On Thu, Oct 24, 2013 at 12:24:57PM -0700, Yinghai Lu wrote:
> > >> On Thu, Oct 24, 2013 at 12:18 PM, Vivek Goyal  wrote:
> > >> > On Thu, Oct 24, 2013 at 12:15:25PM -0700, Yinghai Lu wrote:
> > >> >
> > >> > Also keeping things simple by not trying to *impose* a new crashkernel=
> > >> > syntax on existing crashkernel=xM users.
> > >>
> > >> Existing user that have crashkernel=xM working with their old kernel
> > >> and old kexec-tools, they still could keep their old command line and
> > >> old kexec-tools
> > >> with new updated kernel.
> > >> We should not change semantics to surprise them.
> > >
> > > Old users will get reservation still below 896MB.
> > >
> > > It will go above 896MB only if memory could not be allocated below 896MB.
> > >
> > > In the past reservation will fail and kexec-tools will fail.
> > > Now reservation will succeed but kexec-tools will fail.
> > >
> > > So end result a user sees is that kexec-tools fails. So I don't see how
> > > we are breaking existing installations or user setups.
> > 
> > case could be: if user add more memory and put more pcie cards, and
> > second kernel will need more ram and OOM there.
> 
> Now makedumpfile supports cyclic mode by default. So one does not have
> to necessarily linearly scale reserved memory based on physical memory
> present in the box.
> 
> > so user could just increase crashkernel=512M to crashkernel=1G.
> 
> If user has new makedumpfile, OOM should not happen and one should not
> have to increase memory reservation.
> 
> > 
> > without Cong's patch, kernel will fail to reserve, and user would dig
> > to change it
> > to crashkernel=1G,high, and update kexec-tools.
> > 
> > with Cong's patch, kernel will reserve other range like between 896
> > and 4G, old kexec-tools either
> > fail to load second kernel or hang in purgatory or early stage of
> > second kernel, or other unknown behavior.
> 
> I understand your concern about memory being reserved high and kexec
> just hanging. Only thing I am arguing is that the number of cases where
> it will happen is small.
> 
> - First of all for all old existing kexec-tools memory will stil come
>   from 896MB.
> - Old kexec-tools enforced that purgatory is loaded below 2G. So if 
>   memory is reserved above 2G, kexec-tools will fail.
> 
> So only problematic case seems to be that if user increased crashkernel=X
> value and memory got reserved between 896M and 2G. But this is not same
> as breaking old setups as old setups anyway never worked with this 
> configuration.
> 
> You argument that user will research and upgrade kexec-tools and use
> crashkernel=X,high, then it holds true for the case where memory is
> reserved between 896M and 2G.
> 
> I personally think that it is easier for a user to not change any
> kernel parameters with kernel and kexec-tools upgrade and still be
> able to work with large memory systems. So the benefit of extending
> the semantics of existing parameter seems to be outweighing the downside
> of side, IMHO.
> 
> > 
> > I would think first path is much clear and predicted.
> > 
> > If my memory is right, HPA did not like idea that we try below 896M,
> > and then under 4G and then above 4G.
> 
> hpa, I know you did not like the idea in the past. Is it still the case.

Hi, hpa

Do you still not like this idea? Could you make it more clear to us
please?

Thanks
Chao

> 
> IMHO, I like the fact that existing users still be able to work with
> crashkernel=X and not forced to switch to crashkernel=x,high and also incur
> the penalty of reserving extra memory for software iotlb.
> 
> Thanks
> Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] x86, calgary: use 8M TCE table size by default

2014-03-11 Thread WANG Chao
On 03/10/14 at 10:26pm, Jon Mason wrote:
> On Mon, Mar 10, 2014 at 8:54 AM, Muli Ben-Yehuda  
> wrote:
> > Patch looks good to me.
> >
> > Acked-by: Muli Ben-Yehuda 
> 
> Me too (just in case you want us both).
> 
> Acked-by: Jon Mason 

Thanks for review and helping push this change, to all of you.

WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How could we get rid of saved_max_pfn for calgary iommu?

2014-03-04 Thread WANG Chao
[ Add Muli, I find his new email in git log]

Hi, Muli

saved_max_pfn is becoming a setback for kexec-tools. Ideally calgary
could get rid of saved_max_pfn at all. But If this can't work, how about
exporting a calgary tce table size to user space, so that kexec-tools
can simply pass calgary=xxx cmdline to 2nd kernel.

What do you think of this problem?

BTW MAINTAINERS file still uses your old email, please update
accordingly.

Thanks
WANG Chao

On 02/19/14 at 02:18pm, WANG Chao wrote:
> Hi, All
> 
> arch/x86/kernel/pci-calgary.c is the only user of saved_max_pfn today:
> 
> int __init detect_calgary(void)
> {
>   [..]
>   specified_table_size = determine_tce_table_size((is_kdump_kernel() ?
>   saved_max_pfn : max_pfn) * PAGE_SIZE);
>   [..]
> }
> 
> saved_max_pfn is the real mem size and is calculated by 1st kernel E820
> memmap which is passed in by 2nd kernel's boot_params (done by kexec):
> 
>   saved_max_pfn = e820_end_of_ram_pfn();
> 
> After saved_max_pfn has been set, memmap=exactmap will reset the E820
> provided by boot_params and use the user defined E820 instead.
> 
> Now we want to get rid of memmap=exactmap and directly pass the E820
> memmap by boot_params for some reason (eg. exactmap may exceed the cmdline
> size and also isn't compatible with kaslr).
> 
> However saved_max_pfn becomes the obstacle for obsoleting exactmap.
> Because it needs two conditions: first kernel's E820 map and
> memmap=exactmap cmdline.
> 
> So I'm wondering if it's possible to get rid of saved_max_pfn totally in
> calgary code. Or we can get saved_max_pfn using a different way, for
> example calculated in 1st kernel and passed in to 2nd kernel by cmdline.
> 
> Thanks
> WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] vmcore: continue vmcore initialization if PT_NOTE is found empty

2014-03-20 Thread WANG Chao
Currently when an empty PT_NOTE is detected, vmcore initialization
fails. It sounds too harsh. Because PT_NOTE could be empty, for example,
one offlined a cpu but never restarted kdump service, and after crash,
PT_NOTE program header is there but no data contains. It's better to
warn about the empty PT_NOTE and continue to initialise vmcore.

And ultimately the multiple PT_NOTE are merged into a single one, all
empty PT_NOTE are discarded naturally during the merge. So empty PT_NOTE
is not visible to user space and vmcore is as good as expected.

Signed-off-by: WANG Chao 
---
 fs/proc/vmcore.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 88d4585..dc71bf9 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -484,7 +484,6 @@ static int __init update_note_header_size_elf64(const 
Elf64_Ehdr *ehdr_ptr)
phdr_ptr->p_memsz = real_sz;
if (real_sz == 0) {
pr_warn("Warning: Zero PT_NOTE entries found\n");
-   return -EINVAL;
}
}
 
@@ -671,7 +670,6 @@ static int __init update_note_header_size_elf32(const 
Elf32_Ehdr *ehdr_ptr)
phdr_ptr->p_memsz = real_sz;
if (real_sz == 0) {
pr_warn("Warning: Zero PT_NOTE entries found\n");
-   return -EINVAL;
}
}
 
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


How could we get rid of saved_max_pfn for calgary iommu?

2014-02-18 Thread WANG Chao
Hi, All

arch/x86/kernel/pci-calgary.c is the only user of saved_max_pfn today:

int __init detect_calgary(void)
{
[..]
specified_table_size = determine_tce_table_size((is_kdump_kernel() ?
saved_max_pfn : max_pfn) * PAGE_SIZE);
[..]
}

saved_max_pfn is the real mem size and is calculated by 1st kernel E820
memmap which is passed in by 2nd kernel's boot_params (done by kexec):

saved_max_pfn = e820_end_of_ram_pfn();

After saved_max_pfn has been set, memmap=exactmap will reset the E820
provided by boot_params and use the user defined E820 instead.

Now we want to get rid of memmap=exactmap and directly pass the E820
memmap by boot_params for some reason (eg. exactmap may exceed the cmdline
size and also isn't compatible with kaslr).

However saved_max_pfn becomes the obstacle for obsoleting exactmap.
Because it needs two conditions: first kernel's E820 map and
memmap=exactmap cmdline.

So I'm wondering if it's possible to get rid of saved_max_pfn totally in
calgary code. Or we can get saved_max_pfn using a different way, for
example calculated in 1st kernel and passed in to 2nd kernel by cmdline.

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How could we get rid of saved_max_pfn for calgary iommu?

2014-02-20 Thread WANG Chao
Remove m...@il.ibm.com from CC, this email isn't valid now.

On 02/19/14 at 09:36pm, Vivek Goyal wrote:
> On Wed, Feb 19, 2014 at 05:04:22PM -0700, Jon Mason wrote:
> > On Tue, Feb 18, 2014 at 11:18 PM, WANG Chao  wrote:
> > > Hi, All
> > >
> > > arch/x86/kernel/pci-calgary.c is the only user of saved_max_pfn today:
> > >
> > > int __init detect_calgary(void)
> > > {
> > > [..]
> > > specified_table_size = 
> > > determine_tce_table_size((is_kdump_kernel() ?
> > > saved_max_pfn : max_pfn) * 
> > > PAGE_SIZE);
> > > [..]
> > > }
> > 
> > IIUC, the purpose of this code is to reuse the TCE table from the
> > previous kernel.  Thus, it needs to be of the same size as the
> > pre-kdump kernel.  It is using the max_pfn to determine the TCE table
> > size in the non-kdump case.  If there is another way to determine the
> > size it used before, then I am fine making the change to use that way.
> 
> How about passing old tce table size on command line to second kernel.
> Given the fact that it is specific to calgary only, we can it very
> specific. Say calgary_iommu_old_tce_table_sz=.

Don't need to introduce a new parameter, this is already there:
calgary=[64k,128k,256k,512k,1M,2M,4M,8M]

> 
> But we will then need to know the size of TCE table in first kernel. Is
> this information exported to user space somewhere?

If this value isn't exported to userspace or even won't in the future, I think
the table size also can be determined by userspace. We can implement something 
like the
kernel does in kexec.

The calgary code:

static inline int __init determine_tce_table_size(u64 ram)
{
int ret;

if (specified_table_size != TCE_TABLE_SIZE_UNSPECIFIED)
return specified_table_size;

/*
 * Table sizes are from 0 to 7 (TCE_TABLE_SIZE_64K to
 * TCE_TABLE_SIZE_8M). Table size 0 has 8K entries and each
 * larger table size has twice as many entries, so shift the
 * max ram address by 13 to divide by 8K and then look at the
 * order of the result to choose between 0-7.
 */
ret = get_order(ram >> 13);
if (ret > TCE_TABLE_SIZE_8M)
ret = TCE_TABLE_SIZE_8M;

return ret;
}

But it still no clear to me how I can determine calgary iommu is in use in 1st 
kernel.

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: make reboot task only run on the appropriate processor

2013-11-10 Thread WANG Chao
On 11/08/13 at 10:14am, Vivek Goyal wrote:
> On Tue, Nov 05, 2013 at 05:16:07PM +0800, Baoquan He wrote:
> > Currently system always reboot after below message when execute "kexec -e".
> > 
> > [0.572119] smpboot: Booting Node   0, Processors  #   1 OK
> > 
> > In commit 1b3a5d02ee070c8f994b9b6370f486601e0f, reboot= handling was
> > moved to kerne/reboot.c. However, the code to migrate current thread to
> > reboot cpu was removed. That cause this incorrect kexec behavior.
> > 
> > Now add that code block back.
> > 
> > Reported-by: Matthew Whitehead 
> > Reported-by: Dave Young 
> > Tested-by: WANG Chao 
> > Signed-off-by: Baoquan He 
> 
> Hi Bao,
> 
> This patch fixes the issue for me too. I noticed that we have generic
> function migrate_to_reboot_cpu() to achieve what we want and rest of
> the reboot paths are using it. So how about using that function. I 
> wrote the new patch below. It works for me. Can you please give it
> a try.

Hi, Vivek

I confirm this patch works for me, even in a nested kexec scenario. Feel
free to add these:

Bisected-by: WANG Chao 
Tested-by: WANG Chao 

Thanks
WANG Chao

> 
> Thanks
> Vivek
> 
> kexec: migrate to reboot cpu
> 
> Commit 1b3a5d02ee070c8f994b9b6370f486601e0f moved reboot= handling to
> generic code. In the process it also removed the code in
> native_machine_shutdown() which are moving reboot process to reboot_cpu/cpu0.
> 
> I guess that thought must have been that all reboot paths are calling
> migrate_to_reboot_cpu(), so we don't need this special handling. But
> kexec reboot path (kernel_kexec()) is not calling migrate_to_reboot_cpu()
> so above change broke kexec. Now reboot can happen on non-boot cpu and when
> INIT is sent in second kerneo to bring up BP, it brings down the machine.
> 
> So start calling migrate_to_reboot_cpu() in kexec reboot path to avoid
> this problem.
> 
> Reported-by: Matthew Whitehead 
> Reported-by: Dave Young 
> Signed-off-by: Vivek Goyal 
> ---
>  include/linux/reboot.h |1 +
>  kernel/kexec.c |1 +
>  kernel/reboot.c|2 +-
>  3 files changed, 3 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/kernel/reboot.c
> ===
> --- linux-2.6.orig/kernel/reboot.c2013-10-16 00:30:50.0 -0400
> +++ linux-2.6/kernel/reboot.c 2013-11-08 21:31:03.379064848 -0500
> @@ -104,7 +104,7 @@ int unregister_reboot_notifier(struct no
>  }
>  EXPORT_SYMBOL(unregister_reboot_notifier);
>  
> -static void migrate_to_reboot_cpu(void)
> +void migrate_to_reboot_cpu(void)
>  {
>   /* The boot cpu is always logical cpu 0 */
>   int cpu = reboot_cpu;
> Index: linux-2.6/include/linux/reboot.h
> ===
> --- linux-2.6.orig/include/linux/reboot.h 2013-11-08 21:33:27.0 
> -0500
> +++ linux-2.6/include/linux/reboot.h  2013-11-08 21:34:29.778073522 -0500
> @@ -43,6 +43,7 @@ extern int unregister_reboot_notifier(st
>   * Architecture-specific implementations of sys_reboot commands.
>   */
>  
> +extern void migrate_to_reboot_cpu(void);
>  extern void machine_restart(char *cmd);
>  extern void machine_halt(void);
>  extern void machine_power_off(void);
> Index: linux-2.6/kernel/kexec.c
> ===
> --- linux-2.6.orig/kernel/kexec.c 2013-10-16 00:30:50.0 -0400
> +++ linux-2.6/kernel/kexec.c  2013-11-08 21:34:02.492072375 -0500
> @@ -1676,6 +1676,7 @@ int kernel_kexec(void)
>  #endif
>   {
>   kernel_restart_prepare(NULL);
> + migrate_to_reboot_cpu();
>   printk(KERN_EMERG "Starting new kernel\n");
>   machine_shutdown();
>   }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86, kdump: crashkernel=X try to reserve below 896M first, then try below 4G, then MAXMEM

2013-10-14 Thread WANG Chao
Now crashkernel=X will fail out if there's not enough memory at
low (below 896M). What makes sense for crashkernel=X would be:

 - First try to reserve X below 896M (for being compatible with old
   kexec-tools).
 - If fails, try to reserve X below 4G (swiotlb need to stay below 4G).
 - If fails, try to reserve X from MAXMEM top down.

So that user can easily reserve large memory with crashkernel=X instead
of crashkernel=X,high. It's more transparent and user-friendly.

If crashkernel is large and the reserved is beyond 896M, old kexec-tools
won't be compatible with new kernel for most of time.

kexec will fail out immediately in this case. But the failure could be
expected, because old kexec users should not try to reserve that large
amount of memory at the first place.

On the other hand, old kexec also will fail on old kernel when there's
not enough low memory to reserve a large crash kernel area. So the
failure of old kexec is consistent between old kernel and new kernel.

Signed-off-by: WANG Chao 
---
 arch/x86/kernel/setup.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f0de629..38e6c1f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -593,6 +593,20 @@ static void __init reserve_crashkernel(void)
high ? CRASH_KERNEL_ADDR_HIGH_MAX :
   CRASH_KERNEL_ADDR_LOW_MAX,
crash_size, alignment);
+   /*
+* crashkernel=X reserve below 896M fails? Try below 4G
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(alignment,
+   (1ULL << 32),
+   crash_size, alignment);
+   /*
+* crashkernel=X reserve below 4G fails? Try MAXMEM
+*/
+   if (!high && !crash_base)
+   crash_base = memblock_find_in_range(alignment,
+   CRASH_KERNEL_ADDR_HIGH_MAX,
+   crash_size, alignment);
 
if (!crash_base) {
pr_info("crashkernel reservation failed - No suitable 
area found.\n");
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, kdump: crashkernel=X try to reserve below 896M first, then try below 4G, then MAXMEM

2013-10-14 Thread WANG Chao
On 10/14/13 at 11:54am, Yinghai Lu wrote:
> On Mon, Oct 14, 2013 at 4:46 AM, WANG Chao  wrote:
> > Now crashkernel=X will fail out if there's not enough memory at
> > low (below 896M). What makes sense for crashkernel=X would be:
> >
> >  - First try to reserve X below 896M (for being compatible with old
> >kexec-tools).
> >  - If fails, try to reserve X below 4G (swiotlb need to stay below 4G).
> >  - If fails, try to reserve X from MAXMEM top down.
> >
> > So that user can easily reserve large memory with crashkernel=X instead
> > of crashkernel=X,high. It's more transparent and user-friendly.
> >
> > If crashkernel is large and the reserved is beyond 896M, old kexec-tools
> > won't be compatible with new kernel for most of time.
> >
> > kexec will fail out immediately in this case. But the failure could be
> > expected, because old kexec users should not try to reserve that large
> > amount of memory at the first place.
> >
> > On the other hand, old kexec also will fail on old kernel when there's
> > not enough low memory to reserve a large crash kernel area. So the
> > failure of old kexec is consistent between old kernel and new kernel.
> >
> > Signed-off-by: WANG Chao 
> > ---
> >  arch/x86/kernel/setup.c | 14 ++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index f0de629..38e6c1f 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -593,6 +593,20 @@ static void __init reserve_crashkernel(void)
> > high ? CRASH_KERNEL_ADDR_HIGH_MAX :
> >CRASH_KERNEL_ADDR_LOW_MAX,
> > crash_size, alignment);
> > +   /*
> > +* crashkernel=X reserve below 896M fails? Try below 4G
> > +*/
> > +   if (!high && !crash_base)
> > +   crash_base = memblock_find_in_range(alignment,
> > +   (1ULL << 32),
> > +   crash_size, alignment);
> > +   /*
> > +* crashkernel=X reserve below 4G fails? Try MAXMEM
> > +*/
> > +   if (!high && !crash_base)
> > +   crash_base = memblock_find_in_range(alignment,
> > +   CRASH_KERNEL_ADDR_HIGH_MAX,
> > +   crash_size, alignment);
> >
> > if (!crash_base) {
> > pr_info("crashkernel reservation failed - No 
> > suitable area found.\n");
> > --
> 
> User should change crashkernel=X to crashkernel=X,high.

crashkernel=X is more straightforward. IMHO, It should be a more general
crash kernel reservation parameter and it shouldn't reserve at low only.

> 
> As user could forget to update kexec-tools to utilize ",high" feature, and get
> kdump later fail later.

old kexec-tools should fail to load kernel high anyway, right?

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, kdump: crashkernel=X try to reserve below 896M first, then try below 4G, then MAXMEM

2013-10-27 Thread WANG Chao
On 10/24/13 at 12:04pm, Yinghai Lu wrote:
> On Mon, Oct 14, 2013 at 4:46 AM, WANG Chao  wrote:
> > Now crashkernel=X will fail out if there's not enough memory at
> > low (below 896M). What makes sense for crashkernel=X would be:
> >
> >  - First try to reserve X below 896M (for being compatible with old
> >kexec-tools).
> >  - If fails, try to reserve X below 4G (swiotlb need to stay below 4G).
> >  - If fails, try to reserve X from MAXMEM top down.
> >
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index f0de629..38e6c1f 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -593,6 +593,20 @@ static void __init reserve_crashkernel(void)
> > high ? CRASH_KERNEL_ADDR_HIGH_MAX :
> >CRASH_KERNEL_ADDR_LOW_MAX,
> > crash_size, alignment);
> > +   /*
> > +* crashkernel=X reserve below 896M fails? Try below 4G
> > +*/
> > +   if (!high && !crash_base)
> > +   crash_base = memblock_find_in_range(alignment,
> > +   (1ULL << 32),
> > +   crash_size, alignment);
> 
> Another problem, it would allocate range in [0,4g) for 32bit,
> if the user have crashkernel=512M or plus.

Yes, should use #ifdef CONFIG_X86_64 .. #endif.
Thanks for pointing this out.

WANG Chao

> 
> > +   /*
> > +* crashkernel=X reserve below 4G fails? Try MAXMEM
> > +*/
> > +   if (!high && !crash_base)
> > +   crash_base = memblock_find_in_range(alignment,
> > +   CRASH_KERNEL_ADDR_HIGH_MAX,
> > +   crash_size, alignment);
> >
> > if (!crash_base) {
> > pr_info("crashkernel reservation failed - No 
> > suitable area found.\n");
> 
> 
> Yinghai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/6] kexec: A new system call, kexec_file_load, for in kernel kexec

2013-12-02 Thread WANG Chao
On 11/29/13 at 11:10am, Baoquan He wrote:
> On 11/20/13 at 12:50pm, Vivek Goyal wrote:
> > diff --git a/arch/x86/kernel/machine_kexec_64.c 
> > b/arch/x86/kernel/machine_kexec_64.c
> > index 4eabc16..fb41b73 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -22,6 +22,13 @@
> >  #include 
> >  #include 
> >  
> > +/* arch dependent functionality related to kexec file based syscall */
> > +static struct kexec_file_type kexec_file_type[]={
> > +   {"", NULL, NULL, NULL, NULL},
> > +};
> > +
> > +
> > +void *arch_kexec_kernel_image_load(struct kimage *image, char *kernel,
> > +   unsigned long kernel_len, char *initrd,
> > +   unsigned long initrd_len, char *cmdline,
> > +   unsigned long cmdline_len)
> > +{
> > +   int idx = image->file_handler_idx;
> > +
> > +   if (idx < 0)
> > +   return ERR_PTR(-ENOEXEC);
> > +
> > +   return kexec_file_type[idx].load(image, kernel, kernel_len, initrd,
> > +   initrd_len, cmdline, cmdline_len);
> > +}
> > +
> > +int arch_image_file_post_load_cleanup(struct kimage *image)
> > +{
> 
> Hi Vivek,
> 
> This function is defined as one of arch specific fucntion set, why don't
> we name it in a unified prefix as others.
> 
> And name of the default dummy function in kernel/kexec.c is not consistent
> with the arch specific one, so currently
> arch_image_file_post_load_cleanup of x86 arch is not called. Please
> consider wihch one need be changed.

I think arch_kimage_file_post_load_cleanup should be used here.

And Good catch!

Thanks
WANG Chao

> 
> > +
> > +void __attribute__ ((weak))
> > +arch_kimage_file_post_load_cleanup(struct kimage *image)
> > +{
> > +   return;
> > +}
> > +
> 
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] x86, calgary: use 8M TCE table size by default

2014-03-07 Thread WANG Chao
kexec-tools wants to pass kdump kernel needed memmap via E820 directly,
instead of memmap=exactmap. That'll make saved_max_pfn totally useless.

Muli suggest to hard code TCE table size to max (8M) so that kdump
kernel could use this default size and kexec-tools doesn't need to pass
saved_max_pfn to kdump kernel in any way.

Signed-off-by: WANG Chao 
---
 arch/x86/kernel/pci-calgary_64.c | 25 -
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index 299d493..b26ab94 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -1207,25 +1207,17 @@ error:
return ret;
 }
 
-static inline int __init determine_tce_table_size(u64 ram)
+static inline int __init determine_tce_table_size(void)
 {
-   int ret;
-
if (specified_table_size != TCE_TABLE_SIZE_UNSPECIFIED)
return specified_table_size;
+   else
+   /*
+* Use 8M by default to get rid of saved_max_pfn,
+* suggested by Muli
+*/
+   return TCE_TABLE_SIZE_8M;
 
-   /*
-* Table sizes are from 0 to 7 (TCE_TABLE_SIZE_64K to
-* TCE_TABLE_SIZE_8M). Table size 0 has 8K entries and each
-* larger table size has twice as many entries, so shift the
-* max ram address by 13 to divide by 8K and then look at the
-* order of the result to choose between 0-7.
-*/
-   ret = get_order(ram >> 13);
-   if (ret > TCE_TABLE_SIZE_8M)
-   ret = TCE_TABLE_SIZE_8M;
-
-   return ret;
 }
 
 static int __init build_detail_arrays(void)
@@ -1418,8 +1410,7 @@ int __init detect_calgary(void)
return -ENOMEM;
}
 
-   specified_table_size = determine_tce_table_size((is_kdump_kernel() ?
-   saved_max_pfn : max_pfn) * PAGE_SIZE);
+   specified_table_size = determine_tce_table_size();
 
for (bus = 0; bus < MAX_PHB_BUS_NUM; bus++) {
struct calgary_bus_info *info = &bus_info[bus];
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, calgary: use 8M TCE table size by default

2014-03-07 Thread WANG Chao
Hi, Vivek

On 03/07/14 at 09:14am, Vivek Goyal wrote:
> On Fri, Mar 07, 2014 at 04:10:16PM +0800, WANG Chao wrote:
> 
> [..]
> > }
> >  
> > -   specified_table_size = determine_tce_table_size((is_kdump_kernel() ?
> > -   saved_max_pfn : max_pfn) * PAGE_SIZE);
> > +   specified_table_size = determine_tce_table_size();
> 
> I don't think you can get rid of saved_max_pfn right away. What if
> somebody is using old first kernel and new second kernel. Then old kernel
> will still be using a table size which is smaller than 8M.

If TCE table size is hard coded to 8M, the new 1st kernel can never be
compatible with the old 2nd kernel. Because old kernel's TCE table size
is depending on saved_max_pfn, which could be from 64K to 8M, not
necessarily the hard coded value 8M.

If we want to drop saved_max_pfn once and for all, that's a trade-off
which is worth doing. Because as Muli pointed out last time, there are
very few people likely running upstream kernel on calgary machine.  And
most of them would use the same kernel as 1st and 2nd kernel. For the
very very few people who suffers from it, we can always encourage them
to use the same kernel for 1st and 2nd kernel or other solutions we
have.

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, calgary: use 8M TCE table size by default

2014-03-10 Thread WANG Chao
On 03/07/14 at 11:09am, Vivek Goyal wrote:
> On Fri, Mar 07, 2014 at 11:52:04PM +0800, WANG Chao wrote:
> > Hi, Vivek
> > 
> > On 03/07/14 at 09:14am, Vivek Goyal wrote:
> > > On Fri, Mar 07, 2014 at 04:10:16PM +0800, WANG Chao wrote:
> > > 
> > > [..]
> > > > }
> > > >  
> > > > -   specified_table_size = 
> > > > determine_tce_table_size((is_kdump_kernel() ?
> > > > -   saved_max_pfn : max_pfn) * 
> > > > PAGE_SIZE);
> > > > +   specified_table_size = determine_tce_table_size();
> > > 
> > > I don't think you can get rid of saved_max_pfn right away. What if
> > > somebody is using old first kernel and new second kernel. Then old kernel
> > > will still be using a table size which is smaller than 8M.
> > 
> > If TCE table size is hard coded to 8M, the new 1st kernel can never be
> > compatible with the old 2nd kernel.
> 
> I gave example of old first kernel and new second kernel, and not
> vice-versa.
> 
> So we have two scenarios.
> 
> - Old first kernel and new second kernel.
> - New first kernel and old second kernel.
> 
> If we want to make these two scenarios work with calgary, we need to
> use kexec-tools with --pass-memmap option. And also in new kernel we
> need to retain saved_max_pfn. Only difference will be, that we will
> set saved_max_pfn only if it is kdump kernel and if user passed a 
> memory map on kernel command line.

Old first kernel and new second kernel is easy to handle. Like you said,
we can use exactmap and set saved_max_pfn in kdump kernel, just as the
old way.

But it's still not clear to me how we can handle the case of new first
kernel and old second kernel.

Let's say, a calgary system has 2G memory. The new first kernel TCE
table size are hard coded to 8M. When the old kdump kernel is booting,
memmap=exactmap is parsed and saved_max_pfn is set to 2G/PAGE_SIZE. TCE
table size is determined by 2G ram size. So the size is 4M. We run into
the situation that TCE table is 8M in first kernel and 4M in second
kernel. This inconsistency of TCE table size would cause kdump kernel
fails somehow, that's why we brought in saved_max_pfn in the first place
as you know.

The problem is how to keep TCE table size being consistent between new
first kernel and old second kernel.

I can only think of exporting this value to user space but once we does
that, this knob file could introduce another layer of backward
compatible issue in the future.

> 
> So it is doable if it want to do it.
> 
> Now the real question is, is it worth introducing all this complication
> and confusion given the fact most of the people use same first and second
> kernel and given the fact that calgary is old hardware. What are the
> chances somebody will run into it.

I understand. I'd definitely try to keep things compatible between
the new and old if that's doable. But chances that people would run into
an issue may be lower than both of us think.

> 
> I would say it is not very complicated to maintain backward compatibility
> in this case. So let us keep saved_max_pfn for some time and make
> kexec-tools changes. Some time down the line, one can get rid of
> saved_max_pfn completely.

I'm just not convinced that we can properly handle the case of new first
kernel and old second kernel.

Could you be more specific about how that could be solved please?

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] x86, calgary: use 8M TCE table size by default

2014-03-10 Thread WANG Chao
New kexec-tools wants to pass kdump kernel needed memmap via E820
directly, instead of memmap=exactmap. This makes saved_max_pfn not
be passed down to 2nd kernel. To keep 1st kernel and 2nd kernel using
the same TCE table size, Muli suggest to hard code the size to max (8M).

We can't get rid of saved_max_pfn this time, for backward compatibility
with old first kernel and new second kernel. However new first kernel
and old second kernel can not work unfortunately.

v2->v1:
- retain saved_max_pfn so new 2nd kernel can work with old 1st kernel
  from Vivek

Signed-off-by: WANG Chao 
---
 arch/x86/kernel/pci-calgary_64.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index 299d493..0497f71 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -1207,23 +1207,31 @@ error:
return ret;
 }
 
-static inline int __init determine_tce_table_size(u64 ram)
+static inline int __init determine_tce_table_size(void)
 {
int ret;
 
if (specified_table_size != TCE_TABLE_SIZE_UNSPECIFIED)
return specified_table_size;
 
-   /*
-* Table sizes are from 0 to 7 (TCE_TABLE_SIZE_64K to
-* TCE_TABLE_SIZE_8M). Table size 0 has 8K entries and each
-* larger table size has twice as many entries, so shift the
-* max ram address by 13 to divide by 8K and then look at the
-* order of the result to choose between 0-7.
-*/
-   ret = get_order(ram >> 13);
-   if (ret > TCE_TABLE_SIZE_8M)
+   if (is_kdump_kernel() && saved_max_pfn) {
+   /*
+* Table sizes are from 0 to 7 (TCE_TABLE_SIZE_64K to
+* TCE_TABLE_SIZE_8M). Table size 0 has 8K entries and each
+* larger table size has twice as many entries, so shift the
+* max ram address by 13 to divide by 8K and then look at the
+* order of the result to choose between 0-7.
+*/
+   ret = get_order((saved_max_pfn * PAGE_SIZE) >> 13);
+   if (ret > TCE_TABLE_SIZE_8M)
+   ret = TCE_TABLE_SIZE_8M;
+   } else {
+   /*
+* Use 8M by default (suggested by Muli) if it's not
+* kdump kernel and saved_max_pfn isn't set.
+*/
ret = TCE_TABLE_SIZE_8M;
+   }
 
return ret;
 }
@@ -1418,8 +1426,7 @@ int __init detect_calgary(void)
return -ENOMEM;
}
 
-   specified_table_size = determine_tce_table_size((is_kdump_kernel() ?
-   saved_max_pfn : max_pfn) * PAGE_SIZE);
+   specified_table_size = determine_tce_table_size();
 
for (bus = 0; bus < MAX_PHB_BUS_NUM; bus++) {
struct calgary_bus_info *info = &bus_info[bus];
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] mm/vmalloc.c: clean up map_vm_area third argument

2014-07-09 Thread WANG Chao
Currently map_vm_area() takes (struct page *** pages) as third argument,
and after mapping, it moves (*pages) to point to (*pages + nr_mappped_pages).

It looks like this kind of increment is useless to its caller these
days. The callers don't care about the increments and actually they're
trying to avoid this by passing another copy to map_vm_area().

The caller can always guarantee all the pages can be mapped into
vm_area as specified in first argument and the caller only cares about
whether map_vm_area() fails or not.

This patch cleans up the pointer movement in map_vm_area() and updates
its callers accordingly.

v2: Fix arch/tile/kernel/module.c::module_alloc().

Signed-off-by: WANG Chao 
---
 arch/tile/kernel/module.c|  2 +-
 drivers/lguest/core.c|  7 ++-
 drivers/staging/android/binder.c |  4 +---
 include/linux/vmalloc.h  |  2 +-
 mm/vmalloc.c | 14 +-
 mm/zsmalloc.c|  2 +-
 6 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/arch/tile/kernel/module.c b/arch/tile/kernel/module.c
index 4918d91..d19b13e 100644
--- a/arch/tile/kernel/module.c
+++ b/arch/tile/kernel/module.c
@@ -58,7 +58,7 @@ void *module_alloc(unsigned long size)
area->nr_pages = npages;
area->pages = pages;
 
-   if (map_vm_area(area, prot_rwx, &pages)) {
+   if (map_vm_area(area, prot_rwx, pages)) {
vunmap(area->addr);
goto error;
}
diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
index 0bf1e4e..6590558 100644
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -42,7 +42,6 @@ DEFINE_MUTEX(lguest_lock);
 static __init int map_switcher(void)
 {
int i, err;
-   struct page **pagep;
 
/*
 * Map the Switcher in to high memory.
@@ -110,11 +109,9 @@ static __init int map_switcher(void)
 * This code actually sets up the pages we've allocated to appear at
 * switcher_addr.  map_vm_area() takes the vma we allocated above, the
 * kind of pages we're mapping (kernel pages), and a pointer to our
-* array of struct pages.  It increments that pointer, but we don't
-* care.
+* array of struct pages.
 */
-   pagep = lg_switcher_pages;
-   err = map_vm_area(switcher_vma, PAGE_KERNEL_EXEC, &pagep);
+   err = map_vm_area(switcher_vma, PAGE_KERNEL_EXEC, lg_switcher_pages);
if (err) {
printk("lguest: map_vm_area failed: %i\n", err);
goto free_vma;
diff --git a/drivers/staging/android/binder.c b/drivers/staging/android/binder.c
index a741da7..0ca9785 100644
--- a/drivers/staging/android/binder.c
+++ b/drivers/staging/android/binder.c
@@ -586,7 +586,6 @@ static int binder_update_page_range(struct binder_proc 
*proc, int allocate,
 
for (page_addr = start; page_addr < end; page_addr += PAGE_SIZE) {
int ret;
-   struct page **page_array_ptr;
 
page = &proc->pages[(page_addr - proc->buffer) / PAGE_SIZE];
 
@@ -599,8 +598,7 @@ static int binder_update_page_range(struct binder_proc 
*proc, int allocate,
}
tmp_area.addr = page_addr;
tmp_area.size = PAGE_SIZE + PAGE_SIZE /* guard page? */;
-   page_array_ptr = page;
-   ret = map_vm_area(&tmp_area, PAGE_KERNEL, &page_array_ptr);
+   ret = map_vm_area(&tmp_area, PAGE_KERNEL, page);
if (ret) {
pr_err("%d: binder_alloc_buf failed to map page at %p 
in kernel\n",
   proc->pid, page_addr);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 4b8a891..b87696f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -113,7 +113,7 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
-   struct page ***pages);
+   struct page **pages);
 #ifdef CONFIG_MMU
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f64632b..c36547f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1270,19 +1270,15 @@ void unmap_kernel_range(unsigned long addr, unsigned 
long size)
 }
 EXPORT_SYMBOL_GPL(unmap_kernel_range);
 
-int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
+int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
 {
unsigned long addr = (unsigned long)area->addr;
unsigned long end = addr + get_vm_area_size(area);
int err;
 
-   err = vmap_page_range(addr, end, prot, *pages);
-   if (err > 0) {
-  

[PATCH] mm/vmalloc.c: clean up map_vm_area third argument

2014-07-07 Thread WANG Chao
Currently map_vm_area() takes (struct page *** pages) as third argument,
and after mapping, it moves (*pages) to point to (*pages + nr_mappped_pages).

It looks like this kind of increment is useless to its caller these
days. The callers don't care about the increments and actually they're
trying to avoid this by passing another copy to map_vm_area().

The caller can always guarantee all the pages can be mapped into
vm_area as specified in first argument and the caller only cares about
whether map_vm_area() fails or not.

This patch cleans up the pointer movement in map_vm_area() and updates
its callers accordingly.

Signed-off-by: WANG Chao 
---
 drivers/lguest/core.c|  7 ++-
 drivers/staging/android/binder.c |  4 +---
 include/linux/vmalloc.h  |  2 +-
 mm/vmalloc.c | 14 +-
 mm/zsmalloc.c|  2 +-
 5 files changed, 10 insertions(+), 19 deletions(-)

diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
index 0bf1e4e..6590558 100644
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -42,7 +42,6 @@ DEFINE_MUTEX(lguest_lock);
 static __init int map_switcher(void)
 {
int i, err;
-   struct page **pagep;
 
/*
 * Map the Switcher in to high memory.
@@ -110,11 +109,9 @@ static __init int map_switcher(void)
 * This code actually sets up the pages we've allocated to appear at
 * switcher_addr.  map_vm_area() takes the vma we allocated above, the
 * kind of pages we're mapping (kernel pages), and a pointer to our
-* array of struct pages.  It increments that pointer, but we don't
-* care.
+* array of struct pages.
 */
-   pagep = lg_switcher_pages;
-   err = map_vm_area(switcher_vma, PAGE_KERNEL_EXEC, &pagep);
+   err = map_vm_area(switcher_vma, PAGE_KERNEL_EXEC, lg_switcher_pages);
if (err) {
printk("lguest: map_vm_area failed: %i\n", err);
goto free_vma;
diff --git a/drivers/staging/android/binder.c b/drivers/staging/android/binder.c
index a741da7..0ca9785 100644
--- a/drivers/staging/android/binder.c
+++ b/drivers/staging/android/binder.c
@@ -586,7 +586,6 @@ static int binder_update_page_range(struct binder_proc 
*proc, int allocate,
 
for (page_addr = start; page_addr < end; page_addr += PAGE_SIZE) {
int ret;
-   struct page **page_array_ptr;
 
page = &proc->pages[(page_addr - proc->buffer) / PAGE_SIZE];
 
@@ -599,8 +598,7 @@ static int binder_update_page_range(struct binder_proc 
*proc, int allocate,
}
tmp_area.addr = page_addr;
tmp_area.size = PAGE_SIZE + PAGE_SIZE /* guard page? */;
-   page_array_ptr = page;
-   ret = map_vm_area(&tmp_area, PAGE_KERNEL, &page_array_ptr);
+   ret = map_vm_area(&tmp_area, PAGE_KERNEL, page);
if (ret) {
pr_err("%d: binder_alloc_buf failed to map page at %p 
in kernel\n",
   proc->pid, page_addr);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 4b8a891..b87696f 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -113,7 +113,7 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
-   struct page ***pages);
+   struct page **pages);
 #ifdef CONFIG_MMU
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f64632b..c36547f 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1270,19 +1270,15 @@ void unmap_kernel_range(unsigned long addr, unsigned 
long size)
 }
 EXPORT_SYMBOL_GPL(unmap_kernel_range);
 
-int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page ***pages)
+int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
 {
unsigned long addr = (unsigned long)area->addr;
unsigned long end = addr + get_vm_area_size(area);
int err;
 
-   err = vmap_page_range(addr, end, prot, *pages);
-   if (err > 0) {
-   *pages += err;
-   err = 0;
-   }
+   err = vmap_page_range(addr, end, prot, pages);
 
-   return err;
+   return err > 0 ? 0 : err;
 }
 EXPORT_SYMBOL_GPL(map_vm_area);
 
@@ -1548,7 +1544,7 @@ void *vmap(struct page **pages, unsigned int count,
if (!area)
return NULL;
 
-   if (map_vm_area(area, prot, &pages)) {
+   if (map_vm_area(area, prot, pages)) {
vunmap(area->addr);
return NULL;
}
@@ -1604,7 +1600,7 @@ static void *__vmalloc_area_node(struct vm

kaslr relocation incompitable with kernel loaded high

2014-04-21 Thread WANG Chao
Hi, Kees

When I'm testing kaslr with kdump, I find that when 2nd kernel is loaded
high, it doesn't boot.

I reserved 128M memory at high with kernel cmdline
"crashkernel=128M,high crashkernel=0,low", and for which I got:

[0.00] Reserving 128MB of memory at 6896MB for crashkernel (System RAM: 
6013MB)

Then I load kdump kernel into the reserved memory region, using a local
modified kexec-tools which is passing e820 in boot_params.

The e820 map of system RAM passed to 2nd kernel:

E820 memmap (of RAM):
1000-0009e3ff (1)
0001af00-0001b6f5dfff (1)
0001b6fff400-0001b6ff (1)

In which, 2nd kernel is loaded at 0x1b500.

After triggerred a system crash, 2nd kernel doesn't boot even with
"nokaslr" cmdline:

# echo c > /proc/sysrq-trigger
[..]

I'm in purgatory
early console in decompress_kernel
KASLR disabled...

Decompressing Linux... Parsing ELF... Performing relocations... 

32-bit relocation outside of kernel!


 -- System halted


I know 32bit relocation code has recently moved to misc.c and it's a
ground work for kaslr on x86_64. However it doesn't work out on a kernel
high loaded case.

I'm not quite familiar with this. But I'm wondering if it makes sense to
skip handle_relocation() if kaslr doesn't take effects (nokaslr or
kaslr failed to find suitable area).

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kaslr relocation incompitable with kernel loaded high

2014-04-21 Thread WANG Chao
On 04/21/14 at 11:01am, Kees Cook wrote:
> On Mon, Apr 21, 2014 at 10:56 AM, Yinghai Lu  wrote:
> > On Mon, Apr 21, 2014 at 3:52 AM, WANG Chao  wrote:
> >> Hi, Kees
> >>
> >> When I'm testing kaslr with kdump, I find that when 2nd kernel is loaded
> >> high, it doesn't boot.
> >>
> >> I reserved 128M memory at high with kernel cmdline
> >> "crashkernel=128M,high crashkernel=0,low", and for which I got:
> >>
> >> [0.00] Reserving 128MB of memory at 6896MB for crashkernel (System 
> >> RAM: 6013MB)
> >>
> >> Then I load kdump kernel into the reserved memory region, using a local
> >> modified kexec-tools which is passing e820 in boot_params.
> >>
> >> The e820 map of system RAM passed to 2nd kernel:
> >>
> >> E820 memmap (of RAM):
> >> 1000-0009e3ff (1)
> >> 0001af00-0001b6f5dfff (1)
> >> 0001b6fff400-0001b6ff (1)
> >>
> >> In which, 2nd kernel is loaded at 0x1b500.
> >>
> >> After triggerred a system crash, 2nd kernel doesn't boot even with
> >> "nokaslr" cmdline:
> >>
> >> # echo c > /proc/sysrq-trigger
> >> [..]
> >>
> >> I'm in purgatory
> >> early console in decompress_kernel
> >> KASLR disabled...
> >>
> >> Decompressing Linux... Parsing ELF... Performing relocations...
> >>
> >> 32-bit relocation outside of kernel!
> >
> > Interesting, when kernel get at "early console in decompress_kernel"
> > kernel already in 64 bit...
> >
> > what does it mean "32-bit relocation outside of kernel" ?
> >
> > why 32-bit is involved ?
> 
> The 64-bit kernel has both 64 and 32 bit relocations (there are two
> tables at the end of the kernel image). The error means that the
> resulting relocation is believed to be outside the kernel image:
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/boot/compressed/misc.c#n283
> 
> Which means there is likely something wrong with this calculation in
> your situation:
> 
> /*
>  * Calculate the delta between where vmlinux was linked to load
>  * and where it was actually loaded.
>  */
> delta = min_addr - LOAD_PHYSICAL_ADDR;
> 

Probably.

I'm wondering why such relocation is still needed when kaslr
is disabled in 64bit path. I think the relocation work is already
handled in head_64.S.

FWIW, I got these following relocation info:

# arch/x86/tools/relocs --text vmlinux
.section ".data.reloc","a"
.balign 4
.long 0x
.long 0x81004687
[..]
.long 0x81e6b018
.long 0x
.long 0x810a
[..]
.long 0x81e6c9fc


Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kaslr relocation incompitable with kernel loaded high

2014-04-21 Thread WANG Chao
On 04/21/14 at 09:58pm, Yinghai Lu wrote:
> On Mon, Apr 21, 2014 at 8:16 PM, WANG Chao  wrote:
> > On 04/21/14 at 11:01am, Kees Cook wrote:
> >> On Mon, Apr 21, 2014 at 10:56 AM, Yinghai Lu  wrote:
> >> > On Mon, Apr 21, 2014 at 3:52 AM, WANG Chao  wrote:
> >> >> Hi, Kees
> >> >>
> >> >> When I'm testing kaslr with kdump, I find that when 2nd kernel is loaded
> >> >> high, it doesn't boot.
> >> >>
> >> >> I reserved 128M memory at high with kernel cmdline
> >> >> "crashkernel=128M,high crashkernel=0,low", and for which I got:
> >> >>
> >> >> [0.00] Reserving 128MB of memory at 6896MB for crashkernel 
> >> >> (System RAM: 6013MB)
> >> >>
> >> >> Then I load kdump kernel into the reserved memory region, using a local
> >> >> modified kexec-tools which is passing e820 in boot_params.
> >> >>
> >> >> The e820 map of system RAM passed to 2nd kernel:
> >> >>
> >> >> E820 memmap (of RAM):
> >> >> 1000-0009e3ff (1)
> >> >> 0001af00-0001b6f5dfff (1)
> >> >> 0001b6fff400-0001b6ff (1)
> >> >>
> >> >> In which, 2nd kernel is loaded at 0x1b500.
> >> >>
> >> >> After triggerred a system crash, 2nd kernel doesn't boot even with
> >> >> "nokaslr" cmdline:
> >> >>
> >> >> # echo c > /proc/sysrq-trigger
> >> >> [..]
> >> >>
> >> >> I'm in purgatory
> >> >> early console in decompress_kernel
> >> >> KASLR disabled...
> >> >>
> >> >> Decompressing Linux... Parsing ELF... Performing relocations...
> >> >>
> >> >> 32-bit relocation outside of kernel!
> >> >
> >> > Interesting, when kernel get at "early console in decompress_kernel"
> >> > kernel already in 64 bit...
> >> >
> >> > what does it mean "32-bit relocation outside of kernel" ?
> >> >
> >> > why 32-bit is involved ?
> >>
> >> The 64-bit kernel has both 64 and 32 bit relocations (there are two
> >> tables at the end of the kernel image). The error means that the
> >> resulting relocation is believed to be outside the kernel image:
> >>
> >> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/boot/compressed/misc.c#n283
> >>
> >> Which means there is likely something wrong with this calculation in
> >> your situation:
> >>
> >> /*
> >>  * Calculate the delta between where vmlinux was linked to load
> >>  * and where it was actually loaded.
> >>  */
> >> delta = min_addr - LOAD_PHYSICAL_ADDR;
> >>
> >
> > Probably.
> 
> Please check attached that patch that will solve nokaslr.
> 
> Somehow I got "KASLR could not find suitable E820 region..."
> so i only have "No relocation needed"

I think it makes sense. If output from choose_kernel_location() doesn't
change (output == output_orig), we shouldn't call relocation code.

There are two situations that makes output == output_orig:
- "nokaslr" case
- "KASLR could not find suitable E820 region" case.

> 
> will check that later.

> ---
>  arch/x86/boot/compressed/misc.c |   14 +-
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/arch/x86/boot/compressed/misc.c
> ===
> --- linux-2.6.orig/arch/x86/boot/compressed/misc.c
> +++ linux-2.6/arch/x86/boot/compressed/misc.c
> @@ -235,8 +235,9 @@ static void error(char *x)
>   asm("hlt");
>  }
>  
> -#if CONFIG_X86_NEED_RELOCS
> -static void handle_relocations(void *output, unsigned long output_len)
> +#ifdef CONFIG_X86_NEED_RELOCS
> +static void handle_relocations(void *output_orig, void *output,
> +unsigned long output_len)
>  {
>   int *reloc;
>   unsigned long delta, map, ptr;
> @@ -247,7 +248,7 @@ static void handle_relocations(void *out
>* Calculate the delta between where vmlinux was linked to load
>* and where it was actually loaded.
>*/
> - delta = min_addr - LOAD_PHYSICAL_ADDR;
> + delta = min_addr - (unsigned long)output_orig;
>   if (!delta) {
>   debug_putstr("No relocation needed

[PATCH] staging: android: ion: fix missing a blank line after declarations

2014-07-23 Thread WANG Chao
Fix checkpatch warning of missing a blank line after declarations,
found under drivers/staging/android/ion/.

Signed-off-by: WANG Chao 
---
 drivers/staging/android/ion/ion.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/staging/android/ion/ion.c 
b/drivers/staging/android/ion/ion.c
index 2703609..cad76ae 100644
--- a/drivers/staging/android/ion/ion.c
+++ b/drivers/staging/android/ion/ion.c
@@ -805,6 +805,7 @@ struct ion_client *ion_client_create(struct ion_device *dev,
client, &debug_client_fops);
if (!client->debug_root) {
char buf[256], *path;
+
path = dentry_path(dev->clients_debug_root, buf, 256);
pr_err("Failed to create client debugfs at %s/%s\n",
path, client->display_name);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] staging: android: ion: fix missing a blank line after declarations

2014-07-23 Thread WANG Chao
On 07/24/14 at 02:26am, WANG Chao wrote:
> Fix checkpatch warning of missing a blank line after declarations,
> found under drivers/staging/android/ion/.

Sorry I forgot to mention this is for linux-next.

> 
> Signed-off-by: WANG Chao 
> ---
>  drivers/staging/android/ion/ion.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/staging/android/ion/ion.c 
> b/drivers/staging/android/ion/ion.c
> index 2703609..cad76ae 100644
> --- a/drivers/staging/android/ion/ion.c
> +++ b/drivers/staging/android/ion/ion.c
> @@ -805,6 +805,7 @@ struct ion_client *ion_client_create(struct ion_device 
> *dev,
>   client, &debug_client_fops);
>   if (!client->debug_root) {
>   char buf[256], *path;
> +
>   path = dentry_path(dev->clients_debug_root, buf, 256);
>   pr_err("Failed to create client debugfs at %s/%s\n",
>   path, client->display_name);
> -- 
> 1.9.3
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[linux-next PATCH] staging: android: ion: fix prefer kcalloc over kzalloc with multiply

2014-07-23 Thread WANG Chao
Fix checkpatch warning of prefer kcalloc over kzalloc with multiply,
found under drivers/staging/android/ion/.

Signed-off-by: WANG Chao 
---
 drivers/staging/android/ion/ion_dummy_driver.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/android/ion/ion_dummy_driver.c 
b/drivers/staging/android/ion/ion_dummy_driver.c
index 3a45e79..6d0a38a 100644
--- a/drivers/staging/android/ion/ion_dummy_driver.c
+++ b/drivers/staging/android/ion/ion_dummy_driver.c
@@ -68,7 +68,7 @@ static int __init ion_dummy_init(void)
int i, err;
 
idev = ion_device_create(NULL);
-   heaps = kzalloc(sizeof(struct ion_heap *) * dummy_ion_pdata.nr,
+   heaps = kcalloc(dummy_ion_pdata.nr, sizeof(struct ion_heap *),
GFP_KERNEL);
if (!heaps)
return -ENOMEM;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[linux-next PATCH] staging: android: ion: fix void function return statements

2014-07-23 Thread WANG Chao
Fix checkpatch warning of void function return statements, found under
drivers/staging/android/ion.

Signed-off-by: WANG Chao 
---
 drivers/staging/android/ion/ion.c   | 1 -
 drivers/staging/android/ion/ion_carveout_heap.c | 1 -
 drivers/staging/android/ion/ion_chunk_heap.c| 1 -
 drivers/staging/android/ion/ion_dummy_driver.c  | 2 --
 drivers/staging/android/ion/ion_system_heap.c   | 1 -
 5 files changed, 6 deletions(-)

diff --git a/drivers/staging/android/ion/ion.c 
b/drivers/staging/android/ion/ion.c
index cad76ae..56604f4 100644
--- a/drivers/staging/android/ion/ion.c
+++ b/drivers/staging/android/ion/ion.c
@@ -1057,7 +1057,6 @@ static void *ion_dma_buf_kmap(struct dma_buf *dmabuf, 
unsigned long offset)
 static void ion_dma_buf_kunmap(struct dma_buf *dmabuf, unsigned long offset,
   void *ptr)
 {
-   return;
 }
 
 static int ion_dma_buf_begin_cpu_access(struct dma_buf *dmabuf, size_t start,
diff --git a/drivers/staging/android/ion/ion_carveout_heap.c 
b/drivers/staging/android/ion/ion_carveout_heap.c
index dcb6f21..9156d82 100644
--- a/drivers/staging/android/ion/ion_carveout_heap.c
+++ b/drivers/staging/android/ion/ion_carveout_heap.c
@@ -133,7 +133,6 @@ static struct sg_table *ion_carveout_heap_map_dma(struct 
ion_heap *heap,
 static void ion_carveout_heap_unmap_dma(struct ion_heap *heap,
struct ion_buffer *buffer)
 {
-   return;
 }
 
 static struct ion_heap_ops carveout_heap_ops = {
diff --git a/drivers/staging/android/ion/ion_chunk_heap.c 
b/drivers/staging/android/ion/ion_chunk_heap.c
index 9c3e49a..3e6ec2e 100644
--- a/drivers/staging/android/ion/ion_chunk_heap.c
+++ b/drivers/staging/android/ion/ion_chunk_heap.c
@@ -126,7 +126,6 @@ static struct sg_table *ion_chunk_heap_map_dma(struct 
ion_heap *heap,
 static void ion_chunk_heap_unmap_dma(struct ion_heap *heap,
 struct ion_buffer *buffer)
 {
-   return;
 }
 
 static struct ion_heap_ops chunk_heap_ops = {
diff --git a/drivers/staging/android/ion/ion_dummy_driver.c 
b/drivers/staging/android/ion/ion_dummy_driver.c
index 6d0a38a..f3ea1c3 100644
--- a/drivers/staging/android/ion/ion_dummy_driver.c
+++ b/drivers/staging/android/ion/ion_dummy_driver.c
@@ -152,7 +152,5 @@ static void __exit ion_dummy_exit(void)
dummy_heaps[ION_HEAP_TYPE_CHUNK].size);
chunk_ptr = NULL;
}
-
-   return;
 }
 __exitcall(ion_dummy_exit);
diff --git a/drivers/staging/android/ion/ion_system_heap.c 
b/drivers/staging/android/ion/ion_system_heap.c
index 6b77c51..da2a63c 100644
--- a/drivers/staging/android/ion/ion_system_heap.c
+++ b/drivers/staging/android/ion/ion_system_heap.c
@@ -205,7 +205,6 @@ static struct sg_table *ion_system_heap_map_dma(struct 
ion_heap *heap,
 static void ion_system_heap_unmap_dma(struct ion_heap *heap,
  struct ion_buffer *buffer)
 {
-   return;
 }
 
 static int ion_system_heap_shrink(struct ion_heap *heap, gfp_t gfp_mask,
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration

2014-06-05 Thread WANG Chao
On 06/05/14 at 11:22am, Vivek Goyal wrote:
> On Thu, Jun 05, 2014 at 11:16:39AM -0400, Vivek Goyal wrote:
> > On Thu, Jun 05, 2014 at 05:56:03PM +0800, WANG Chao wrote:
> > 
> > [..]
> > > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > > index c435c5f..a3044e6 100644
> > > > --- a/kernel/kexec.c
> > > > +++ b/kernel/kexec.c
> > > > @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, 
> > > > compat_ulong_t, entry,
> > > >  }
> > > >  #endif
> > > >  
> > > > +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> > > > +   const char __user *, cmdline_ptr, unsigned long,
> > > > +   cmdline_len, unsigned long, flags)
> > > 
> > > initrd is optional for system boot.
> > > 
> > > How about using int *kernel_fd and int *initrd_fd as the argument? Then
> > > if I don't need initrd, in userspace I can do this:
> > 
> > Hi Chao,
> > 
> > I really am not too keen converting plain int fd arguments into pointers.
> > 
> > Given the fact that fd is int, that means all valid values are greater
> > than 0. How about using -1 to denote that initrd is not being loaded?
> > 
> > This does create one little anomaly and that is for all -ve values we
> > will return -EBADF except -1 which we special cased.
> 
> Or we could do.
> 
> - Define extra flag which should be set by user if valid initrd fd is not
>   being passed. Say, KEXEC_FILE_NO_INITRAMFS. And if kernel sees that flag
>   it will not try to parse value passed in argument initrd_fd at all.
> 
> I think I like this better.

Yep, it makes more sense to me as well. Please add this new flag.

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load

2014-06-05 Thread WANG Chao
On 06/03/14 at 09:06am, Vivek Goyal wrote:
> Previous patch provided the interface definition and this patch prvides
> implementation of new syscall.
> 
> Previously segment list was prepared in user space. Now user space just
> passes kernel fd, initrd fd and command line and kernel will create a
> segment list internally.
> 
> This patch contains generic part of the code. Actual segment preparation
> and loading is done by arch and image specific loader. Which comes in
> next patch.
> 
> Signed-off-by: Vivek Goyal 

[..]
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index a3044e6..1ad4d60 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c

> +static int kimage_file_prepare_segments(struct kimage *image, int kernel_fd,
> + int initrd_fd, const char __user *cmdline_ptr,
> + unsigned long cmdline_len)
> +{
> + int ret = 0;
> + void *ldata;
> +
> + ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> + &image->kernel_buf_len);
> + if (ret)
> + return ret;
> +
> + /* Call arch image probe handlers */
> + ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> + image->kernel_buf_len);
> +
> + if (ret)
> + goto out;
> +
> + ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> + &image->initrd_buf_len);
> + if (ret)
> + goto out;
> +
> + image->cmdline_buf = vzalloc(cmdline_len);

You should validate the upper/lower boundary of cmdline_len before
calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
message would be fired.

> + if (!image->cmdline_buf)
> + goto out;
> +
> + ret = copy_from_user(image->cmdline_buf, cmdline_ptr, cmdline_len);
> + if (ret) {
> + ret = -EFAULT;
> + goto out;
> + }
> +
> + image->cmdline_buf_len = cmdline_len;
> +
> + /* command line should be a string with last byte null */
> + if (image->cmdline_buf[cmdline_len - 1] != '\0') {
> + ret = -EINVAL;
> + goto out;
> + }

Given the fact that command line is optional as well as initrd, I think
above chunk of code needs to update a bit for the case cmdline_len is 0
or cmdline_buf is pointing NULL?

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load

2014-06-08 Thread WANG Chao
On 06/09/14 at 10:11am, Dave Young wrote:
> On 06/06/14 at 02:19pm, Vivek Goyal wrote:
> > On Fri, Jun 06, 2014 at 02:56:05PM +0800, WANG Chao wrote:
> > > On 06/03/14 at 09:06am, Vivek Goyal wrote:
> > > > Previous patch provided the interface definition and this patch prvides
> > > > implementation of new syscall.
> > > > 
> > > > Previously segment list was prepared in user space. Now user space just
> > > > passes kernel fd, initrd fd and command line and kernel will create a
> > > > segment list internally.
> > > > 
> > > > This patch contains generic part of the code. Actual segment preparation
> > > > and loading is done by arch and image specific loader. Which comes in
> > > > next patch.
> > > > 
> > > > Signed-off-by: Vivek Goyal 
> > > 
> > > [..]
> > > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > > index a3044e6..1ad4d60 100644
> > > > --- a/kernel/kexec.c
> > > > +++ b/kernel/kexec.c
> > > 
> > > > +static int kimage_file_prepare_segments(struct kimage *image, int 
> > > > kernel_fd,
> > > > +   int initrd_fd, const char __user *cmdline_ptr,
> > > > +   unsigned long cmdline_len)
> > > > +{
> > > > +   int ret = 0;
> > > > +   void *ldata;
> > > > +
> > > > +   ret = copy_file_from_fd(kernel_fd, &image->kernel_buf,
> > > > +   &image->kernel_buf_len);
> > > > +   if (ret)
> > > > +   return ret;
> > > > +
> > > > +   /* Call arch image probe handlers */
> > > > +   ret = arch_kexec_kernel_image_probe(image, image->kernel_buf,
> > > > +   image->kernel_buf_len);
> > > > +
> > > > +   if (ret)
> > > > +   goto out;
> > > > +
> > > > +   ret = copy_file_from_fd(initrd_fd, &image->initrd_buf,
> > > > +   &image->initrd_buf_len);
> > > > +   if (ret)
> > > > +   goto out;
> > > > +
> > > > +   image->cmdline_buf = vzalloc(cmdline_len);
> > > 
> > > You should validate the upper/lower boundary of cmdline_len before
> > > calling vzalloc. When cmdline_len is 0 or too large, vmalloc failure
> > > message would be fired.
> > 
> > What's the upper length of vzalloc(). I think if it is too big to alloc,
> > then vzalloc() should return me an error?

When allocating too large, eg. vzalloc(-1), kernel spits:

[  457.407579] vmalloc: allocation failure: 18446744073709551606 bytes
[  457.413854] kexec: page allocation failure: order:0, mode:0x80d2
[  457.419853] CPU: 3 PID: 2058 Comm: kexec Not tainted
3.15.0-rc8-00096-g3dc85e8 #10
[  457.427408] Hardware name: Dell Inc. OptiPlex 760
/0M860N, BIOS A12 05/23/2011
[  457.435999]  81a2f678 8800bfb03db0 816944fd
80d2
[  457.443422]  8800bfb03e38 8118a31a 81a2f678
8800bfb03dd0
[  457.450851]  88010018 8800bfb03e48 8800bfb03de8
8800bfb03e10
[  457.458278] Call Trace:
[  457.460731]  [] dump_stack+0x45/0x56
[  457.465865]  [] warn_alloc_failed+0xda/0x140
[  457.471693]  [] ? kernel_read+0x41/0x60
[  457.477085]  [] __vmalloc_node_range+0x1b6/0x270
[  457.483256]  [] vzalloc+0x4b/0x50
[  457.488132]  [] ?
kimage_file_prepare_segments.part.10+0x85/0x140
[  457.495774]  []
kimage_file_prepare_segments.part.10+0x85/0x140
[  457.503244]  [] SyS_kexec_file_load+0x38a/0x690
[  457.509330]  [] system_call_fastpath+0x16/0x1b
[..]

I think it's better to do some sane check to prevent such warning when
taking arbitrary argument from user space.

> 
> function __vmalloc_node_range:
> if (!size || (size >> PAGE_SHIFT) > totalram_pages)
> goto fail;
> 
> So I think only checking cmdline_len == 0 is enough.
> 
> For the upper length shouldn't it be stripped to COMMAND_LINE_SIZE?

Yes, COMMAND_LINE_SIZE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load

2014-06-13 Thread WANG Chao
On 06/13/14 at 09:50am, Borislav Petkov wrote:
> On Mon, Jun 09, 2014 at 11:41:37AM -0400, Vivek Goyal wrote:
> > IIUC, COMMAND_LINE_SIZE gives max limits of running kernel and it does
> > not tell us anything about command line size supported by kernel being
> > loaded.
> 
> Whatever you do, you do need a sane default because even querying the
> boot protocol is not reliable as the to-be-loaded kernel's boot protocol
> might be manipulated too, before signing (who knows what people do
> in the wild).

Make sense.

> 
> So having a sane, unconditional fallback COMMAND_LINE_SIZE from the
> first kernel is a must, methinks.

By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
the fallback, in general, is good for now and the future:

alpha/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE 256

arm/include/uapi/asm/setup.h
19:#define COMMAND_LINE_SIZE 1024

avr32/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE 256

cris/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE 256

frv/include/uapi/asm/setup.h
15:#define COMMAND_LINE_SIZE   512

ia64/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE 2048

m32r/include/uapi/asm/setup.h
8:#define COMMAND_LINE_SIZE   512

m68k/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE 256

mips/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE 4096

parisc/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE 1024

powerpc/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE 2048

s390/include/uapi/asm/setup.h
9:#define COMMAND_LINE_SIZE 4096

um/include/asm/setup.h
8:#define COMMAND_LINE_SIZE 4096

x86/include/asm/setup.h
6:#define COMMAND_LINE_SIZE 2048

xtensa/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE256

c6x/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE   1024

microblaze/include/uapi/asm/setup.h
14:#define COMMAND_LINE_SIZE256

mn10300/include/uapi/asm/param.h
16:#define COMMAND_LINE_SIZE 256

score/include/uapi/asm/setup.h
4:#define COMMAND_LINE_SIZE 256

tile/include/uapi/asm/setup.h
18:#define COMMAND_LINE_SIZE2048

arc/include/asm/setup.h
15:#define COMMAND_LINE_SIZE 256

arm64/include/uapi/asm/setup.h
24:#define COMMAND_LINE_SIZE2048
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 07/13] kexec: Implementation of new syscall kexec_file_load

2014-06-13 Thread WANG Chao
On 06/13/14 at 10:10am, Borislav Petkov wrote:
> On Fri, Jun 13, 2014 at 04:00:28PM +0800, WANG Chao wrote:
> > By greping for COMMAND_LINE_SIZE for different arch, I think 8K being
> > the fallback, in general, is good for now and the future:
> 
> Why - we could simply use the arch default one.

Emm.. I'm not sure, but I think there might be a chance that
COMMAND_LINE_SIZE extend in the future. In general 8K is safe to use,
because the current greatest one is 4K.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/13][V3] kexec: A new system call to allow in kernel loading

2014-06-04 Thread WANG Chao
On 06/03/14 at 09:12am, Vivek Goyal wrote:
> On Tue, Jun 03, 2014 at 09:06:49AM -0400, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V3 of the patchset. Previous versions were posted here.
> > 
> > V1: https://lkml.org/lkml/2013/11/20/540
> > V2: https://lkml.org/lkml/2014/1/27/331
> > 
> > Changes since v2:
> > 
> > - Took care of most of the review comments from V2.
> > - Added support for kexec/kdump on EFI systems.
> > - Dropped support for loading ELF vmlinux.
> > 
> > This patch series is generated on top of 3.15.0-rc8. It also requires a
> > two patch cleanup series which is sitting in -tip tree here.
> 
> I used following kexec-tools patches to test kernel changes.
> 
> Thanks
> Vivek
> 
> 
> kexec-tools: Provide an option to make use of new system call
> 
> This patch provides and option --use-kexec2-syscall, to force use of
> new system call for kexec. Default is to continue to use old syscall.
> 
> Signed-off-by: Vivek Goyal 

Hi, Vivek

In your kexec-tools patch, you mentioned '-s' as a short option for
--use-kexec2-syscall in usage(). But it doesn't work.

[..]
> Index: kexec-tools/kexec/kexec.h
> ===
> --- kexec-tools.orig/kexec/kexec.h2014-06-02 14:34:16.719774316 -0400
> +++ kexec-tools/kexec/kexec.h 2014-06-02 14:34:42.010036325 -0400
> @@ -156,6 +156,13 @@ struct kexec_info {
>   unsigned long kexec_flags;
>   unsigned long backup_src_start;
>   unsigned long backup_src_size;
> + /* Set to 1 if we are using kexec2 syscall */
> + unsigned long file_mode :1;
> +
> + /* Filled by kernel image processing code */
> + int initrd_fd;
> + char *command_line;
> + int command_line_len;
>  };
>  
>  struct arch_map_entry {
> @@ -207,6 +214,7 @@ extern int file_types;
>  #define OPT_UNLOAD   'u'
>  #define OPT_TYPE 't'
>  #define OPT_PANIC'p'
> +#define OPT_USE_KEXEC2_SYSCALL   's'
>  #define OPT_MEM_MIN 256
>  #define OPT_MEM_MAX 257
>  #define OPT_REUSE_INITRD 258
> @@ -230,6 +238,7 @@ extern int file_types;
>   { "mem-min",1, 0, OPT_MEM_MIN }, \
>   { "mem-max",1, 0, OPT_MEM_MAX }, \
>   { "reuseinitrd",0, 0, OPT_REUSE_INITRD }, \
> + { "use-kexec2-syscall", 0, 0, OPT_USE_KEXEC2_SYSCALL }, \
>   { "debug",  0, 0, OPT_DEBUG }, \
>  
>  #define KEXEC_OPT_STR "h?vdfxluet:p"

This line,
#define KEXEC_OPT_STR "h?vdfxluet:p"

should be something like,
#define KEXEC_OPT_STR "h?vdfxluet:ps"

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 06/13] kexec: New syscall kexec_file_load() declaration

2014-06-05 Thread WANG Chao
On 06/03/14 at 09:06am, Vivek Goyal wrote:
> This is the new syscall kexec_file_load() declaration/interface. I have
> reserved the syscall number only for x86_64 so far. Other architectures
> (including i386) can reserve syscall number when they enable the support
> for this new syscall.

Hi, Vivek

I have a comment below about the kexec_file_load args.

[..]
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index c435c5f..a3044e6 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1098,6 +1098,13 @@ COMPAT_SYSCALL_DEFINE4(kexec_load, compat_ulong_t, 
> entry,
>  }
>  #endif
>  
> +SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd,
> + const char __user *, cmdline_ptr, unsigned long,
> + cmdline_len, unsigned long, flags)

initrd is optional for system boot.

How about using int *kernel_fd and int *initrd_fd as the argument? Then
if I don't need initrd, in userspace I can do this:

kexec_file_load(&kernel_fd, NULL, ...)

And even you can remove KEXEC_FILE_UNLOAD flag, because you could tell
that one wants to unload if the following is invoked:

kexec_file_load(NULL, NULL, ...)

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] vmcore: copy fractional pages into buffers in the kdump 2nd kernel

2013-12-11 Thread WANG Chao
On 12/09/13 at 05:06pm, HATAYAMA Daisuke wrote:
> This is a patch for fixing mmap failure due to fractional page issue.
> 
> This patch might be still a bit too large as a single patch and might need to 
> split.
> If you think patch refactoring is needed, please suggest.
> 
> Change Log:
> 
> v1 => v2)
> 
> - Copy fractional pages from 1st kernel to 2nd kernel to reduce read
>   to the fractional pages for reliability.
> 
> - Deal with the case where multiple System RAM areas are contained in
>   a single fractional page.
> 
> Test:
> 
> Tested on X86_64. Fractional pages are created using memmap= kernel
> parameter on the kdump 1st kernel.
> 
> From fd6b0aca54caf7f0b5fd3841ef9e5ff081121ab8 Mon Sep 17 00:00:00 2001
> From: HATAYAMA Daisuke 
> Date: Mon, 9 Dec 2013 09:12:32 +0900
> Subject: [PATCH] vmcore: copy fractional pages into buffers in the kdump 2nd 
> kernel
> 
> As Vivek reported in https://lkml.org/lkml/2013/11/13/439, in real
> world there's platform that allocates System RAM area and Reserved
> area in a single same page. As a result, mmap fails at sanity check
> that comapres memory cache types in a given range, causing user-land
> tools to exit abnormally in the middle of crash dumping.
> 
> Although in the current case the data in Reserved area is ACPI data,
> in general, arbitrary data can possibly be located in a single page
> together with System RAM area. If they are, for example, mmio, read or
> write to the area could affect the corresponding devices and so a
> whole system. We should avoid doing such operations as much as
> possible in order to keep reliability.
> 
> To address this issue, we copy fractional pages into buffers in the
> kdump 2nd kernel, and then read data on the fractional pages from the
> buffers in the kdump 2nd kernel, not from the fractional pages on the
> kdump 1st kernel. Similarly, we mmap data on the buffers on the 2nd
> kernel, not on the 1st kernel. These are done just as we've already
> done for ELF note segments.
> 
> Rigorously, we should avoid even mapping pages containing non-System
> RAM area since mapping could cause some platform specific optimization
> that could then lead to some kind of prefetch to the page. However, as
> long as trying to read the System RAM area in the page, we cannot
> avoid mapping the page. Therefore, reliable possible way is to supress
> the number of times of reading the fractional pages to just once by
> buffering System RAM part of the fractional page in the 2nd kerenel.
> 
> To implement this, extend vmcore structure to represent object in
> buffer on the 2nd kernel, i.e., introducing VMCORE_2ND_KERNEL flag;
> for a vmcore object, if it has VMCORE_2ND_KERNEL set, then its data is
> on the buffer on the 2nd kernel that is pointed to by ->buf member.
> 
> Only non-trivial case is where multiple System RAM areas are contained
> in a single page. I want to think there's unlikely to be such system,
> but the issue addressed here is already odd enough, so we should
> consider there would be likely enough to be.
> 
> Reported-by: Vivek Goyal 
> Signed-off-by: HATAYAMA Daisuke 

Hi, HATAMAYA Daisuke

Thanks for the fix.

My workstation has been experiencing the same issue. It has a fractional
page, one part contains system ram and the other contains ACPI data:

# cat /proc/iomem
[..]
0010-bfdffbff : System RAM
  0100-0167b6a5 : Kernel code
  0167b6a6-01d06cbf : Kernel data
  01e6d000-01feafff : Kernel bss
  bb00-bf7f : Crash kernel
bfdffc00-bfe53bff : ACPI Non-volatile Storage

I apply your patch on top of 3.13-rc3. And makedumpfile can successfully
extract dump with mmap().

Thanks,
WANG Chao


> ---
>  fs/proc/vmcore.c  | 271 
> +-
>  include/linux/kcore.h |   4 +
>  2 files changed, 229 insertions(+), 46 deletions(-)
> 
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 9100d69..ca79120 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -231,11 +231,20 @@ static ssize_t __read_vmcore(char *buffer, size_t 
> buflen, loff_t *fpos,
>  
>   list_for_each_entry(m, &vmcore_list, list) {
>   if (*fpos < m->offset + m->size) {
> - tsz = min_t(size_t, m->offset + m->size - *fpos, 
> buflen);
> - start = m->paddr + *fpos - m->offset;
> - tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
> - if (tmp < 0)
> - return tmp;
> + tsz = min_t(size_t, m->offset+m->size-*fpos, buflen);
> + if ((m->flags & VMCORE_2ND_KERNEL)) {
> +  

Re: [PATCH] x86, kdump: crashkernel=X try to reserve below 896M first, then try below 4G, then MAXMEM

2013-10-24 Thread WANG Chao
On 10/23/13 at 11:11pm, Yinghai Lu wrote:
> On Mon, Oct 21, 2013 at 8:16 AM, Vivek Goyal  wrote:
> > On Fri, Oct 18, 2013 at 10:45:43PM -0700, Yinghai Lu wrote:
> >
> >
> > IIUC, you are trying to say that with new kernel old kexec-tools will fail
> > at a different failure point. I don't see why that is a problem. It still
> > fails.
> 
> Yes, that could cause confusion.
> 
> We already knew it would fail possible at most later, we should make
> it skip allocation during first kernel booting.

It's said:

crashkernel=size[KMG][@offset[KMG]]
[KNL] Using kexec, Linux can switch to a 'crash kernel'
upon panic. This parameter reserves the physical
memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset
is selected automatically. Check
Documentation/kdump/kdump.txt for further details.

Nothing mentions that crashkernel=X is restricted to reserve at low.

It could be more confusing that crashkernel=X (say 1G) can not find a
suitable area while actually there's enough memory out there.

Current situation is the combination of crashkernel X/high/low is quite
complicated, it's more like supposed to be used by the pros.

Truth is, most user, after upgrading to new kernel, would want reserve
large by just changing to crashkernel=1G, not be forced to use a new
format craskernel=1G,high.

I think crashkernel=XM,high is really supposed to be used when user indeed
want to reserve from high.

Like Vivek said, failing at different point shouldn't be a problem.
That's an incorrect configuration. When crashkernel=1G,high, old
kexec-tools still fails the same way. That could cause confusion, in
your word.

Let me put it in an example, a user want to utilize this new kernel
feature to reserve 1G for crash kernel but not upgrade kexec-tools,

- W/o this patch:
 First he would try crashkernel=1G, but failed to reserve. Second time,
 he goes with crashkernel=1G,high, reservation is fine but kexec fails
 to load. Upgrading kexec-tools is essential to him.

- W/ this patch:
  First he would try crashkernel=1G, reservation is ok but kexec fails
  to load the same way as the case of crashkernel=1G,high. Upgrading
  kexec-tools is essential to him.

The point is old kexec-tools can't load high, no matter by what kind of
crashkernel cmdline to reserve at high.

> 
> >
> > [..]
> >> > You are not thinking about ease of use here for existing users.
> >>
> >> most existing user don't need to do anything. just with new kernel and
> >> old kexec tools.
> >>
> >> those system that did not work kexec before because XM is too big, they 
> >> have to
> >> update kexec tools, and use ",high"
> >>
> >> Make it simple, less error.
> >
> > No, it is not that simple. Think from a distribution's perspective also.
> > We have the logic to scale reserved memory based on physical memory
> > present in the system. Now we are seeing bigger memory systems (which
> > would not have worked in the past). We still want to retain the existing
> > logic and not switch to crashkernel=x,high. One does not have to. It
> > makes life simpler.
> 
> distribution should go with ",high" for 64 bit kernel and new kexec-tools.
> for 32bit kernel, you still can have ",high" or not, as ",high" is ignored.

Why ",high" is becoming a default one instead of "XM"?

Isn't that the kernel's job to find a most suitable area at best guess
in this case? By this case, I mean kernel have no idea kexec-tools is
new or old.

> 
> 
> >
> > Same logic working both with smaller memory systems as well as large memory
> > systems. One should not have to choose a different command line because
> > there is more physical RAM present in the system.
> 
> ",high" is working even on smaller memory sytem.

Yeah, you're right.

> 
> >
> >>
> >> We already support above 4G, what is point for trying below 4G?
> >
> > Because it is not *required* to reserve memory above 4G. Because we want
> > same command line to work with both small memory systems as well as
> > large memory systems and we don't care whether memory is reserved below
> > 4G or above 4G. What does matter though that we don't have to worry about
> > switching command line option if it is large memory system.
> 
> ",high" will work smaller or large memory system after you install new
> kexec tools.
> 
> Again, for distribution, wh

[PATCH] mm, slub: do not add duplicate sysfs

2014-08-27 Thread WANG Chao
Mergeable slab can be changed to unmergeable after tuning its sysfs
interface, for example echo 1 > trace. But the sysfs kobject with the unique
name will be still there.

When creating a new mergeable slab, the following warning will happen:

(hello.ko is a trivial module to simply create a mergeable slab)

[  408.915029] [ cut here ]
[  408.919641] WARNING: CPU: 3 PID: 2766 at fs/sysfs/dir.c:31 
sysfs_warn_dup+0x64/0x80()
[  408.927449] sysfs: cannot create duplicate filename '/kernel/slab/:t-048'
[  408.934563] Modules linked in: hello(O+) ipt_MASQUERADE iptable_nat 
nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack 
xt_CHECKSUM iptable_mangle tun bridge stpi
[  408.980823] CPU: 3 PID: 2766 Comm: modprobe Tainted: G   O 
3.17.0-rc1 #22
[  408.988981] Hardware name: Dell Inc. OptiPlex 760 /0M860N, BIOS A12 
05/23/2011
[  408.997571]  0009 8801a053ba40 816c560e 
8801a053ba88
[  409.004994]  8801a053ba78 810bbb2d 8800b6b6d000 
8800b65bc580
[  409.012414]  8801a04260f0  880035d58b78 
8801a053bad8
[  409.019839] Call Trace:
[  409.022290]  [] dump_stack+0x45/0x56
[  409.027418]  [] warn_slowpath_common+0x7d/0xa0
[  409.033415]  [] warn_slowpath_fmt+0x4c/0x50
[  409.039156]  [] ? kernfs_path+0x48/0x60
[  409.044546]  [] sysfs_warn_dup+0x64/0x80
[  409.050027]  [] sysfs_create_dir_ns+0x8e/0xa0
[  409.055938]  [] kobject_add_internal+0xbf/0x3f0
[  409.062019]  [] kobject_init_and_add+0x60/0x80
[  409.068016]  [] ? sysfs_slab_add+0x146/0x200
[  409.073845]  [] sysfs_slab_add+0x81/0x200
[  409.079409]  [] __kmem_cache_create+0x51b/0x860
[  409.085494]  [] ? 0xc0094000
[  409.090627]  [] ? printk+0x67/0x69
[  409.095584]  [] ? kmem_cache_alloc+0x1c2/0x1f0
[  409.101581]  [] ? do_kmem_cache_create+0x3b/0xf0
[  409.107752]  [] do_kmem_cache_create+0xab/0xf0
[  409.113749]  [] kmem_cache_create+0x1b2/0x2a0
[  409.119661]  [] ? kasprintf+0x3e/0x40
[  409.124881]  [] ? 0xc0094000
[  409.130019]  [] init_hello+0x3a/0x1000 [hello]
[  409.136019]  [] do_one_initcall+0xbc/0x1f0
[  409.141671]  [] ? __vunmap+0xb2/0x100
[  409.146891]  [] load_module+0x1e4e/0x25e0
[  409.152455]  [] ? store_uevent+0x40/0x40
[  409.157934]  [] ? copy_module_from_fd.isra.47+0x121/0x180
[  409.164884]  [] SyS_finit_module+0x86/0xb0
[  409.170539]  [] system_call_fastpath+0x16/0x1b
[  409.176533] ---[ end trace c8eef8076cd27e36 ]---

Now if a unique is taken, we suffix it with an index, for example,
/sys/kernel/slab/:t-048 is already there, but not mergeable. We create
another unique name with index suffix, /sys/kernel/slab/:t-048-1, if
this one is taken too, we increase the index value each time, :t-048-2,
:t-048-3 ... until we find one.

Signed-off-by: WANG Chao 
---
 mm/slub.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 3e8afcc..8b4944e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5099,9 +5099,9 @@ static inline struct kset *cache_kset(struct kmem_cache 
*s)
 
 /* Create a unique string id for a slab cache:
  *
- * Format  :[flags-]size
+ * Format  :[flags-]size[-index]
  */
-static char *create_unique_id(struct kmem_cache *s)
+static char *__create_unique_id(struct kmem_cache *s, int index)
 {
char *name = kmalloc(ID_STR_LENGTH, GFP_KERNEL);
char *p = name;
@@ -5127,11 +5127,41 @@ static char *create_unique_id(struct kmem_cache *s)
if (p != name + 1)
*p++ = '-';
p += sprintf(p, "%07d", s->size);
+   if (index)
+   p += sprintf(p, "-%d", index);
 
BUG_ON(p > name + ID_STR_LENGTH - 1);
return name;
 }
 
+static char *create_unique_id(struct kmem_cache *s)
+{
+   char *name;
+   struct kmem_cache *k;
+   int index, unique;
+
+
+   for (index = 0, unique = 0; !unique; index++) {
+   name = __create_unique_id(s, index);
+   unique = 1;
+
+   /*
+* Walk through slab_caches to see if name is taken.
+* It happens when mergeables becomes unmergeables.
+*/
+   list_for_each_entry(k, &slab_caches, list) {
+   if (!k->kobj.name)
+   continue;
+
+   if (!strcmp(k->kobj.name, name)) {
+   unique = 0;
+   break;
+   }
+   }
+   }
+   return name;
+}
+
 static int sysfs_slab_add(struct kmem_cache *s)
 {
int err;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, slub: do not add duplicate sysfs

2014-08-27 Thread WANG Chao
On 08/27/14 at 10:25am, Christoph Lameter wrote:
> On Wed, 27 Aug 2014, WANG Chao wrote:
> 
> > Mergeable slab can be changed to unmergeable after tuning its sysfs
> > interface, for example echo 1 > trace. But the sysfs kobject with the unique
> > name will be still there.
> 
> Hmmm... Merging should be switched off if any debugging features are
> enabled. Maybe we need to disable modifying debug options for an active
> cache? This could cause other issues as well since the debug options will
> then apply to multiple caches.

Yes, currently merging is already switched off if there's any debug flag.

It sounds a bit overkill to me to disable runtime configuration. I don't
know how many people out there would trace a mergeable (multiple)
caches. Well it sounds better if we give them the chance to that...

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, slub: do not add duplicate sysfs

2014-08-27 Thread WANG Chao
On 08/27/14 at 10:32am, Christoph Lameter wrote:
> Maybe something like this may be a proper fix:
> 
> Subject: slub: Disable tracing of mergeable slabs
> 
> Tracing of mergeable slabs is confusing since the objects
> of multiple slab caches will be traced. Moreover this creates
> a situation where a mergeable slab will become unmergeable.
> 
> If tracing is desired then it may be best to switch merging
> off for starters.
> 
> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/slub.c
> ===
> --- linux.orig/mm/slub.c  2014-08-08 11:52:30.039681592 -0500
> +++ linux/mm/slub.c   2014-08-27 10:30:16.508108726 -0500
> @@ -4604,6 +4604,14 @@ static ssize_t trace_show(struct kmem_ca
>  static ssize_t trace_store(struct kmem_cache *s, const char *buf,
>   size_t length)
>  {
> + /*
> +  * Tracing a merged cache is going to give confusing results
> +  * as well as cause other issues like converting a mergeable
> +  * cache into an umergeable one.
> +  */
> + if (s->refcount > 1)
> + return -EINVAL;
> +
>   s->flags &= ~SLAB_TRACE;
>   if (buf[0] == '1') {
>   s->flags &= ~__CMPXCHG_DOUBLE;

What about failslab_store()? SLAB_FAILSLAB is also a nomerge flag.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, slub: do not add duplicate sysfs

2014-08-28 Thread WANG Chao
On 08/28/14 at 09:47am, Christoph Lameter wrote:
> On Thu, 28 Aug 2014, WANG Chao wrote:
> 
> > What about failslab_store()? SLAB_FAILSLAB is also a nomerge flag.
> 
> 
> Subject: slub: Disable tracing and failslab for merged slabs
> 
> Tracing of mergeable slabs as well as uses of failslab are
> confusing since the objects of multiple slab caches will be
> affected. Moreover this creates a situation where a mergeable
> slab will become unmergeable.
> 
> If tracing or failslab testing is desired then it may be best to
> switch merging off for starters.
> 
> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/slub.c
> ===
> --- linux.orig/mm/slub.c  2014-08-08 11:52:30.039681592 -0500
> +++ linux/mm/slub.c   2014-08-28 09:45:58.748840392 -0500
> @@ -4604,6 +4604,14 @@ static ssize_t trace_show(struct kmem_ca
>  static ssize_t trace_store(struct kmem_cache *s, const char *buf,
>   size_t length)
>  {
> + /*
> +  * Tracing a merged cache is going to give confusing results
> +  * as well as cause other issues like converting a mergeable
> +  * cache into an umergeable one.
> +  */
> + if (s->refcount > 1)
> + return -EINVAL;
> +
>   s->flags &= ~SLAB_TRACE;
>   if (buf[0] == '1') {
>   s->flags &= ~__CMPXCHG_DOUBLE;
> @@ -4721,6 +4729,9 @@ static ssize_t failslab_show(struct kmem
>  static ssize_t failslab_store(struct kmem_cache *s, const char *buf,
>   size_t length)
>  {
> + if (s->refcount > 1)
> + return -EINVAL;
> +
>   s->flags &= ~SLAB_FAILSLAB;
>   if (buf[0] == '1')
>   s->flags |= SLAB_FAILSLAB;

This works for me. Thanks for the fix.

WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel, add panic_on_warn

2014-11-04 Thread WANG Chao
On 11/04/14 at 10:41am, Prarit Bhargava wrote:
> There have been several times where I have had to rebuild a kernel to
> cause a panic when hitting a WARN() in the code in order to get a crash
> dump from a system.  Sometimes this is easy to do, other times (such as
> in the case of a remote admin) it is not trivial to send new images to the
> user.
> 
> A much easier method would be a switch to change the WARN() over to a
> panic.  This makes debugging easier in that I can now test the actual
> image the WARN() was seen on and I do not have to engage in remote
> debugging.
> 
> This patch adds a panic_on_warn kernel parameter and
> /proc/sys/kernel/panic_on_warn calls panic() in the warn_slowpath_common()
> path.  The function will still print out the location of the warning.
> 
> An example of the panic_on_warn output:
> 
> The first line below is from the WARN_ON() to output the WARN_ON()'s location.
> After that the panic() output is displayed.
> 
> WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 
> init_dummy+0x1f/0x30 [dummy_module]()
> Kernel panic - not syncing: panic_on_warn set ...
> 
> CPU: 30 PID: 11698 Comm: insmod Tainted: GW  OE  3.17.0+ #57
> Hardware name: Intel Corporation S2600CP/S2600CP, BIOS 
> RMLSDP.86I.00.29.D696.131329 11/11/2013
>   8e3f87df 88080f093c38 81665190
>   818aea3d 88080f093cb8 8165e2ec
>  0008 88080f093cc8 88080f093c68 8e3f87df
> Call Trace:
>  [] dump_stack+0x46/0x58
>  [] panic+0xd0/0x204
>  [] ? init_dummy+0x1f/0x30 [dummy_module]
>  [] warn_slowpath_common+0xd0/0xd0
>  [] ? dummy_greetings+0x40/0x40 [dummy_module]
>  [] warn_slowpath_null+0x1a/0x20
>  [] init_dummy+0x1f/0x30 [dummy_module]
>  [] do_one_initcall+0xd4/0x210
>  [] ? __vunmap+0xc2/0x110
>  [] load_module+0x16a9/0x1b30
>  [] ? store_uevent+0x70/0x70
>  [] ? copy_module_from_fd.isra.44+0x129/0x180
>  [] SyS_finit_module+0xa6/0xd0
>  [] system_call_fastpath+0x12/0x17
> 
> Successfully tested by me.
> 
> Cc: Jonathan Corbet 
> Cc: Andrew Morton 
> Cc: Rusty Russell 
> Cc: "H. Peter Anvin" 
> Cc: Andi Kleen 
> Cc: Masami Hiramatsu 
> Cc: Fabian Frederick 
> Cc: vgo...@redhat.com
> Cc: isimatu.yasu...@jp.fujitsu.com
> Cc: jba...@akamai.com
> Cc: linux-...@vger.kernel.org
> Cc: ke...@lists.infradead.org
> Cc: linux-...@vger.kernel.org
> Signed-off-by: Prarit Bhargava 
> 
> [v2]: add /proc/sys/kernel/panic_on_warn, additional documentation, modify
>   !slowpath cases
> [v3]: use proc_dointvec_minmax() in sysctl handler
> [v4]: remove !slowpath cases, and add __read_mostly
> [v5]: change to panic_on_warn, re-alphabetize Documentation/sysctl/kernel.txt
> [v6]: disable on kdump kernel to avoid bogus panicks.
> [v7]: swithch to core param, and remove change from v6

This looks good to me.

Acked-by: WANG Chao 

> ---
>  Documentation/kdump/kdump.txt   |7 ++
>  Documentation/kernel-parameters.txt |3 +++
>  Documentation/sysctl/kernel.txt |   40 
> +++
>  include/linux/kernel.h  |1 +
>  include/uapi/linux/sysctl.h |1 +
>  kernel/panic.c  |   15 -
>  kernel/sysctl.c |9 
>  kernel/sysctl_binary.c  |1 +
>  8 files changed, 62 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
> index 6c0b9f2..bc4bd5a 100644
> --- a/Documentation/kdump/kdump.txt
> +++ b/Documentation/kdump/kdump.txt
> @@ -471,6 +471,13 @@ format. Crash is available on Dave Anderson's site at 
> the following URL:
>  
> http://people.redhat.com/~anderson/
>  
> +Trigger Kdump on WARN()
> +===
> +
> +The kernel parameter, panic_on_warn, calls panic() in all WARN() paths.  This
> +will cause a kdump to occur at the panic() call.  In cases where a user wants
> +to specify this during runtime, /proc/sys/kernel/panic_on_warn can be set to 
> 1
> +to achieve the same behaviour.
>  
>  Contact
>  ===
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 4c81a86..ea5d57c 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2509,6 +2509,9 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   timeout < 0: reboot immediately
>   Format: 
>  
> + panic_on_warn   panic() instead of WARN().  Useful to cause kdump
> + on a WARN().
> +
&

Re: [PATCH resend] staging, lustre: fix a sparse error

2014-10-12 Thread WANG Chao
On 10/12/14 at 08:17pm, Al Viro wrote:
> On Fri, Oct 10, 2014 at 11:21:16AM +0800, WANG Chao wrote:
> 
> > I think __user annotation is for no dereferencing in kernel space. In
> > this case, I think it's fine to override this error by __force. Because
> > they're pointers with identical target types.
> 
> Umm...  The real question seems to be whether iovec is the right type in
> the first place.  Does ->tx_iov ever contain a vector with _userland_
> pointers?  If not, it ought to be struct kvec * instead.

I'm afraid I'm not familiar with lustre.

[CCing lustre maintainers]

Oleg and Andreas:

Could you answer the Al's question?

For your information, my original post was here:
https://lkml.org/lkml/2014/10/9/135

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] arch/x86/purgatory/Makefile: supress kexec-purgatory.c is up to date message

2014-10-13 Thread WANG Chao
Supress this unnecessary message during kernel re-build
(CONFIG_KEXEC_FILE=y):

make[1]: `arch/x86/purgatory/kexec-purgatory.c' is up to date.

Signed-off-by: WANG Chao 
---
 arch/x86/purgatory/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index f52e033..2c835e3 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -24,6 +24,7 @@ quiet_cmd_bin2c = BIN2C   $@
 
 $(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
$(call if_changed,bin2c)
+   @:
 
 
 obj-$(CONFIG_KEXEC_FILE)   += kexec-purgatory.o
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] staging, unisys: remove leftover function kmalloc_kernel()

2014-10-14 Thread WANG Chao
kmalloc_kernel() previously declared in timskmodutils.h which recently got
removed. Now also remove kmalloc_kernel(), because it's not used anywhere.

Signed-off-by: WANG Chao 
---
 drivers/staging/unisys/visorutil/visorkmodutils.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/drivers/staging/unisys/visorutil/visorkmodutils.c 
b/drivers/staging/unisys/visorutil/visorkmodutils.c
index d6815f9..64b135d 100644
--- a/drivers/staging/unisys/visorutil/visorkmodutils.c
+++ b/drivers/staging/unisys/visorutil/visorkmodutils.c
@@ -36,16 +36,6 @@
 int unisys_spar_platform;
 EXPORT_SYMBOL_GPL(unisys_spar_platform);
 
-/** Callers to interfaces that set __GFP_NORETRY flag below
- *  must check for a NULL (error) result as we are telling the
- *  kernel interface that it is okay to fail.
- */
-
-void *kmalloc_kernel(size_t siz)
-{
-   return kmalloc(siz, GFP_KERNEL | __GFP_NORETRY);
-}
-
 static __init uint32_t
 visorutil_spar_detect(void)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch/x86/purgatory/Makefile: supress kexec-purgatory.c is up to date message

2014-10-14 Thread WANG Chao
On 10/14/14 at 05:52pm, Vivek Goyal wrote:
> On Tue, Oct 14, 2014 at 12:46:58PM +0800, WANG Chao wrote:
> > Supress this unnecessary message during kernel re-build
> > (CONFIG_KEXEC_FILE=y):
> > 
> > make[1]: `arch/x86/purgatory/kexec-purgatory.c' is up to date.
> > 
> > Signed-off-by: WANG Chao 
> > ---
> >  arch/x86/purgatory/Makefile | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
> > index f52e033..2c835e3 100644
> > --- a/arch/x86/purgatory/Makefile
> > +++ b/arch/x86/purgatory/Makefile
> > @@ -24,6 +24,7 @@ quiet_cmd_bin2c = BIN2C   $@
> >  
> >  $(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
> > $(call if_changed,bin2c)
> > +   @:
> 
> Where can I find the description of "@:" and how does it work?

@ is used to suppress echo'ing. See:
https://www.gnu.org/software/make/manual/html_node/Echoing.html

This is a trick to avoid "... is up to date" message". Check out commit
a9358bc ("x86/build: Supress realmode.bin is up to date message").

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf test: fix typo in python test

2014-11-12 Thread WANG Chao
Library loading in python syntax should be 'import perf', not 'use perf'.

Signed-off-by: WANG Chao 
---
 tools/perf/tests/builtin-test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 162c978..4b7d9ab 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -85,7 +85,7 @@ static struct test {
.func = test__hists_link,
},
{
-   .desc = "Try 'use perf' in python, checking link problems",
+   .desc = "Try 'import perf' in python, checking link problems",
.func = test__python_use,
},
{
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Fix kdump failures with crashkernel=high

2014-12-02 Thread WANG Chao
> Hi,
> 
> here is a patch-set to fix failed kdump kernel boots when
> the systems was booted with crashkernel=X,high. On those
> systems the kernel allocates only 72MiB of low-memory for
> DMA buffers, which showed to be too low on some systems.

>From your experience It seems like swiotlb isn't working well with
crashkernel=X,high alone. What about using crashkernel=X,low with
crashkernel=X,high? Is there any reason you have to use
crashkernel=X,high alone?

> 
> The problem is that 64MiB of the low-memory is allocated by
> swiotlb, leaving 8MB for the page-allocator. But swiotlb
> tries to allocate DMA memory from the page-allocator first,
> which fails pretty fast in the boot sequence, causing
> warnings. This patch-set removes these warnings.
> 
> But even the 64MiB for swiotlb are eaten up on some systems,
> so that the default of low-memory allocated for the
> crash-kernel is increase from 72MB to 256MB (only changing
> the defaults, can still be overwritten by crashkernel=X,low).

crashkernel=X,high shouldn't automatically reserve 72M low at the first
place. Now it's going insane if you increase it to 256M by default.

Thanks
WANG Chao

> This number comes from experiments on the affected systems,
> 128MiB low-memory was still not enough there, thus I set the
> value to 256MiB to fix the issues.
> 
> Any feedback appreciated.
> 
> Thanks,
> 
>   Joerg
> 
> Joerg Roedel (3):
>   swiotlb: Warn on allocation failure in swiotlb_alloc_coherent
>   x86, swiotlb: Try coherent allocations with __GFP_NOWARN
>   x86, crash: Allocate enough low-mem when crashkernel=high
> 
>  arch/x86/kernel/pci-swiotlb.c |  8 
>  arch/x86/kernel/setup.c   |  5 -
>  lib/swiotlb.c | 11 +--
>  3 files changed, 21 insertions(+), 3 deletions(-)
> 
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] Fix kdump failures with crashkernel=high

2014-12-03 Thread WANG Chao
Hi,

On 12/03/14 at 11:35am, Joerg Roedel wrote:
> Hi,
> 
> On Wed, Dec 03, 2014 at 12:01:23PM +0800, WANG Chao wrote:
> >From your experience It seems like swiotlb isn't working well with
> > crashkernel=X,high alone. What about using crashkernel=X,low with
> > crashkernel=X,high? Is there any reason you have to use
> > crashkernel=X,high alone?
> 
> Sure, when I specify an additional crashkernel=X,low then I can get
> things to work this way too. But my patch-set is about changing the
> default, since the failure was seen on common server systems with the
> defaults.

crashkernel=X doesn't work for you?

> 
> > crashkernel=X,high shouldn't automatically reserve 72M low at the first
> > place. Now it's going insane if you increase it to 256M by default.
> 
> How should a kernel without some low memory (which has only memory above
> 4G available) handle any 32bit DMA devices? There would be no way to
> allocate DMA-able memory for those devices.

I mean crashkernel=X,high shouldn't automatically reserve low.
crashkernel=X,high should always be used together with
crashkernel=X,low.

If one is not satisfied with the combination of two parameters,
crashkernel=X should be able to look for a suitable area below 4G. But
right now, crashkernel=X only deals under 896M.

I've sent a patch for this in the past:
[PATCH] x86, kdump: crashkernel=X try to reserve below 896M first, then try 
below 4G, then MAXMEM
- https://lkml.org/lkml/2013/10/14/183

X86 people don't like this idea so I didn't update the patch even
there's minor nit to clean up.

> 
> And as I said, if people prefer it I can change the patch-set so that
> the amount of low-memory allocated is subtracted from the amount of
> high-memory. This way the overall memory usage for kdump would stay the
> same while changing the defaults to work on more systems.

Say you do this, crashkernel=X,high would be reserve (X-n) high and (n)
low? IMHO I think it's better crashkernel=X,high means reserve X high
and high only.

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Documentation/ABI/testing/sysfs-ibft: fix a typo

2014-10-17 Thread WANG Chao
Correct a sentence in Documentation/ABI/testing/sysfs-ibft.

Signed-off-by: WANG Chao 
---
 Documentation/ABI/testing/sysfs-ibft | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-ibft 
b/Documentation/ABI/testing/sysfs-ibft
index c2b7d11..cac3930 100644
--- a/Documentation/ABI/testing/sysfs-ibft
+++ b/Documentation/ABI/testing/sysfs-ibft
@@ -20,4 +20,4 @@ Date: November 2007
 Contact:   Konrad Rzeszutek 
 Description:   The /sys/firmware/ibft/ethernetX directory will contain
files that expose the iSCSI Boot Firmware Table NIC data.
-   This can this can the IP address, MAC, and gateway of the NIC.
+   Usually this contains the IP address, MAC, and gateway of the 
NIC.
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] iscsi_ibft: convert printk() to pr_foo()

2014-10-17 Thread WANG Chao
Switch to pr_foo() and a common prefix ("iscsi_ibft: ") to each printk
message it generates. The printk message used to break down into multiple
lines, now squash these into a single string.

Along with this patch, use multiple MODULE_AUTHOR() marco as
recommended.

Signed-off-by: WANG Chao 
---
 drivers/firmware/iscsi_ibft.c | 30 ++
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/drivers/firmware/iscsi_ibft.c b/drivers/firmware/iscsi_ibft.c
index 071c2c9..5f9b307 100644
--- a/drivers/firmware/iscsi_ibft.c
+++ b/drivers/firmware/iscsi_ibft.c
@@ -67,6 +67,8 @@
  */
 
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include 
 #include 
 #include 
@@ -87,8 +89,8 @@
 #define IBFT_ISCSI_VERSION "0.5.0"
 #define IBFT_ISCSI_DATE "2010-Feb-25"
 
-MODULE_AUTHOR("Peter Jones  and "
- "Konrad Rzeszutek ");
+MODULE_AUTHOR("Peter Jones ");
+MODULE_AUTHOR("Konrad Rzeszutek ");
 MODULE_DESCRIPTION("sysfs interface to BIOS iBFT information");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(IBFT_ISCSI_VERSION);
@@ -223,15 +225,13 @@ static ssize_t sprintf_string(char *str, int len, char 
*buf)
 static int ibft_verify_hdr(char *t, struct ibft_hdr *hdr, int id, int length)
 {
if (hdr->id != id) {
-   printk(KERN_ERR "iBFT error: We expected the %s " \
-   "field header.id to have %d but " \
-   "found %d instead!\n", t, id, hdr->id);
+   pr_err("iBFT error: We expected the %s field header.id to have 
%d but found %d instead!\n",
+   t, id, hdr->id);
return -ENODEV;
}
if (hdr->length != length) {
-   printk(KERN_ERR "iBFT error: We expected the %s " \
-   "field header.length to have %d but " \
-   "found %d instead!\n", t, length, hdr->length);
+   pr_err("iBFT error: We expected the %s field header.length to 
have %d but found %d instead!\n",
+   t, length, hdr->length);
return -ENODEV;
}
 
@@ -413,8 +413,7 @@ static int __init ibft_check_device(void)
 
/* Sanity checking of iBFT. */
if (ibft_addr->header.revision != 1) {
-   printk(KERN_ERR "iBFT module supports only revision 1, " \
-   "while this is %d.\n",
+   pr_err("iBFT module supports only revision 1, while this is 
%d.\n",
ibft_addr->header.revision);
return -ENOENT;
}
@@ -422,7 +421,7 @@ static int __init ibft_check_device(void)
csum += *pos;
 
if (csum) {
-   printk(KERN_ERR "iBFT has incorrect checksum (0x%x)!\n", csum);
+   pr_err("iBFT has incorrect checksum (0x%x)!\n", csum);
return -ENOENT;
}
 
@@ -646,9 +645,8 @@ static int __init ibft_create_kobject(struct 
acpi_table_ibft *header,
rc = 1;
break;
default:
-   printk(KERN_ERR "iBFT has unknown structure type (%d). " \
-   "Report this bug to %.6s!\n", hdr->id,
-   header->header.oem_id);
+   pr_err("iBFT has unknown structure type (%d). Report this bug 
to %.6s!\n",
+   hdr->id, header->header.oem_id);
rc = 1;
break;
}
@@ -703,7 +701,7 @@ static int __init ibft_register_kobjects(struct 
acpi_table_ibft *header)
/* iBFT table safety checking */
rc |= ((control->hdr.index) ? -ENODEV : 0);
if (rc) {
-   printk(KERN_ERR "iBFT error: Control header is invalid!\n");
+   pr_err("iBFT error: Control header is invalid!\n");
return rc;
}
for (ptr = &control->initiator_off; ptr < end; ptr += sizeof(u16)) {
@@ -809,7 +807,7 @@ static int __init ibft_init(void)
if (rc)
goto out_free;
} else
-   printk(KERN_INFO "No iBFT detected.\n");
+   pr_info("No iBFT detected.\n");
 
return 0;
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH resend] staging, lustre: fix a sparse error

2014-10-09 Thread WANG Chao
This fixes the following sparse error:

drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c:393:9: error:
incompatible types in comparison expression (different address spaces)

Signed-off-by: WANG Chao 
---
 drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c 
b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c
index 245c9d7..1510594 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c
@@ -390,7 +390,7 @@ ksocknal_lib_csum_tx(ksock_tx_t *tx)
__u32   csum;
void*base;
 
-   LASSERT(tx->tx_iov[0].iov_base == (void *)&tx->tx_msg);
+   LASSERT((void __force *) tx->tx_iov[0].iov_base == (void *)&tx->tx_msg);
LASSERT(tx->tx_conn != NULL);
LASSERT(tx->tx_conn->ksnc_proto == &ksocknal_protocol_v2x);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] staging, lustre: fix a sparse error

2014-10-09 Thread WANG Chao
On 10/09/14 at 05:58pm, Sudip Mukherjee wrote:
> On Thu, Oct 09, 2014 at 06:25:10PM +0800, WANG Chao wrote:
> > This fixes the following sparse error:
> > 
> > drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c:393:9: error:
> > incompatible types in comparison expression (different address spaces)
> > 
> > Signed-off-by: WANG Chao 
> > ---
> >  drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c 
> > b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c
> > index 245c9d7..1510594 100644
> > --- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c
> > +++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd_lib-linux.c
> > @@ -390,7 +390,7 @@ ksocknal_lib_csum_tx(ksock_tx_t *tx)
> > __u32   csum;
> > void*base;
> >  
> > -   LASSERT(tx->tx_iov[0].iov_base == (void *)&tx->tx_msg);
> > +   LASSERT((void __force *) tx->tx_iov[0].iov_base == (void *)&tx->tx_msg);
> is it ok? are you not just suppressing the sparse error?
> iov_base is void __user *  

I think __user annotation is for no dereferencing in kernel space. In
this case, I think it's fine to override this error by __force. Because
they're pointers with identical target types.

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, e820: clean up around sanitize_e820_map()

2015-01-21 Thread WANG Chao
Hi, David

On 01/21/15 at 02:51pm, David Rientjes wrote:
> On Wed, 7 Jan 2015, WANG Chao wrote:
> 
> > The argument 3 of sanitize_e820_map() will only update upon a successful
> > sanitization. Some of the callers may not be aware of this in the past.
> > Now clean up these callers.
> > 
> > Signed-off-by: WANG Chao 
> 
> Nice, but it's incomplete: it should also cleanup 
> default_machine_specific_memory_setup().

default_machine_specific_memory_setup() is special. Because
boot_params.e820_entries is 8bit unsigned, and sanitize_e820_map() which
takes 32bit unsigned, will step on other member of boot_params next to
e820_entries.

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm, vmacache: Add kconfig VMACACHE_SHIFT

2015-01-21 Thread WANG Chao
Add a new kconfig option VMACACHE_SHIFT (as a power of 2) to specify the
number of slots vma cache has for each thread. Range is chosen 0-4 (1-16
slots) to consider both overhead and performance penalty. Default is 2
(4 slots) as it originally is, which provides good enough balance.

Signed-off-by: WANG Chao 
---
 include/linux/sched.h | 2 +-
 mm/Kconfig| 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..56fd96d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -134,7 +134,7 @@ struct perf_event_context;
 struct blk_plug;
 struct filename;
 
-#define VMACACHE_BITS 2
+#define VMACACHE_BITS CONFIG_VMACACHE_SHIFT
 #define VMACACHE_SIZE (1U << VMACACHE_BITS)
 #define VMACACHE_MASK (VMACACHE_SIZE - 1)
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 1d1ae6b..7b82a52 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -618,3 +618,10 @@ config MAX_STACK_SIZE_MB
  changed to a smaller value in which case that is used.
 
  A sane initial value is 80 MB.
+
+config VMACACHE_SHIFT
+   int "Number of slots in per-thread VMA cache (as a power of 2)"
+   range 0 4
+   default 2
+   help
+ This is the number of slots VMA cache has for each thread.
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm, vmacache: Add kconfig VMACACHE_SHIFT

2015-01-21 Thread WANG Chao
Hi, Davidlohr

On 01/21/15 at 11:46pm, Davidlohr Bueso wrote:
> On Thu, 2015-01-22 at 14:29 +0800, WANG Chao wrote:
> > Add a new kconfig option VMACACHE_SHIFT (as a power of 2) to specify the
> > number of slots vma cache has for each thread. Range is chosen 0-4 (1-16
> > slots) to consider both overhead and performance penalty. Default is 2
> > (4 slots) as it originally is, which provides good enough balance.
> > 
> 
> Nack. I don't feel comfortable making scalability features of core code
> configurable.

Out of respect, is this a general rule not making scalability features
of core code configurable?

Thanks
WANG Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] vmcore: fix PT_NOTE n_namesz, n_descsz overflow issue

2014-12-14 Thread WANG Chao
When updating PT_NOTE header size (ie. p_memsz), an overflow issue
happens with the following bogus note entry:

n_namesz = 0x
n_descsz = 0x0
n_type   = 0x0

This kind of note entry should be dropped during updating p_memsz. But
because n_namesz is 32bit, after (n_namesz + 3) & (~3), it's overflow to
0x0, the note entry size looks sane and reserved.

When user space (eg. crash utility) is trying to access such bogus note,
it could lead to an unexpected behavior (eg. crash utility segment
fault because it's reading bogus address).

The source of bogus note hasn't been identified yet. At least we could
drop the bogus note so user space wouldn't be surprised.

Signed-off-by: WANG Chao 
---
 fs/proc/vmcore.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a90d6d35..4e61388 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -546,8 +546,8 @@ static int __init update_note_header_size_elf64(const 
Elf64_Ehdr *ehdr_ptr)
nhdr_ptr = notes_section;
while (nhdr_ptr->n_namesz != 0) {
sz = sizeof(Elf64_Nhdr) +
-   ((nhdr_ptr->n_namesz + 3) & ~3) +
-   ((nhdr_ptr->n_descsz + 3) & ~3);
+   (((u64)nhdr_ptr->n_namesz + 3) & ~3) +
+   (((u64)nhdr_ptr->n_descsz + 3) & ~3);
if ((real_sz + sz) > max_sz) {
pr_warn("Warning: Exceeded p_memsz, dropping 
PT_NOTE entry n_namesz=0x%x, n_descsz=0x%x\n",
nhdr_ptr->n_namesz, nhdr_ptr->n_descsz);
@@ -732,8 +732,8 @@ static int __init update_note_header_size_elf32(const 
Elf32_Ehdr *ehdr_ptr)
nhdr_ptr = notes_section;
while (nhdr_ptr->n_namesz != 0) {
sz = sizeof(Elf32_Nhdr) +
-   ((nhdr_ptr->n_namesz + 3) & ~3) +
-   ((nhdr_ptr->n_descsz + 3) & ~3);
+   (((u64)nhdr_ptr->n_namesz + 3) & ~3) +
+   (((u64)nhdr_ptr->n_descsz + 3) & ~3);
if ((real_sz + sz) > max_sz) {
pr_warn("Warning: Exceeded p_memsz, dropping 
PT_NOTE entry n_namesz=0x%x, n_descsz=0x%x\n",
nhdr_ptr->n_namesz, nhdr_ptr->n_descsz);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] treewide: replace RETPOLINE with CONFIG_RETPOLINE

2019-01-08 Thread WANG Chao
On 12/21/18 at 11:47P, Borislav Petkov wrote:
> On Fri, Dec 21, 2018 at 05:33:47PM +0800, WANG Chao wrote:
> > On 12/11/18 at 12:37P, WANG Chao wrote:
> > > Since commit 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend
> > > on compiler support"), RETPOLINE has been replaced by CONFIG_RETPOLINE.
> > > 
> > > Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on 
> > > compiler support")
> > > Signed-off-by: WANG Chao 
> > 
> > ping ...
> 
> This one somehow slipped through the cracks... :-\
> 
> Ok, let's see: this could be relevant for the case when a module built
> with a compiler which doesn't support retpolines, gets loaded on a
> system which is built with retpolines.
> 
> Which is pretty seldom as the majority of setups out there should have
> a retpoline-enabled compiler. And should not allow loading external
> modules anyway, but that's a different story.
> 
> So, this close to the merge window, I'm inclined to delay this one after
> it and have it take the x86/urgent fixes path. Unless someone screams
> and says that patch breaks his system. I haven't seen any reports about
> it so far so...

Hi, Boris.

Please pick up this one in x86/urgent.

Thanks
Chao


[PATCH v2] sched: unlikely corrupted stack end

2016-06-14 Thread WANG Chao
unlikely() was dropped in commit ce03e4137bb2 ("sched/core: Drop
unlikely behind BUG_ON()"), but commit 29d6455178a0 ("sched: panic on
corrupted stack end") dropped BUG_ON() and called panic directly.

Now we should bring unlikely() back for branch prediction. While we're
at it, it's better and cleaner to turn task_stack_end_corrupted() into
inline function.

Signed-off-by: WANG Chao 
---
 include/linux/sched.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..797ca1975431 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2997,8 +2997,11 @@ static inline unsigned long *end_of_stack(struct 
task_struct *p)
 }
 
 #endif
-#define task_stack_end_corrupted(task) \
-   (*(end_of_stack(task)) != STACK_END_MAGIC)
+
+static inline int task_stack_end_corrupted(struct task_struct *p)
+{
+   return unlikely(*(end_of_stack(p)) != STACK_END_MAGIC);
+}
 
 static inline int object_is_on_stack(void *obj)
 {
-- 
2.8.4



Re: [PATCH v2] sched: unlikely corrupted stack end

2016-06-14 Thread WANG Chao

> 在 2016年6月14日,下午4:56,Ingo Molnar  写道:
> 
> 
> * WANG Chao  wrote:
> 
>> unlikely() was dropped in commit ce03e4137bb2 ("sched/core: Drop
>> unlikely behind BUG_ON()"), but commit 29d6455178a0 ("sched: panic on
>> corrupted stack end") dropped BUG_ON() and called panic directly.
>> 
>> Now we should bring unlikely() back for branch prediction. While we're
>> at it, it's better and cleaner to turn task_stack_end_corrupted() into
>> inline function.
>> 
>> Signed-off-by: WANG Chao 
>> ---
>> include/linux/sched.h | 7 +--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 6e42ada26345..797ca1975431 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2997,8 +2997,11 @@ static inline unsigned long *end_of_stack(struct 
>> task_struct *p)
>> }
>> 
>> #endif
>> -#define task_stack_end_corrupted(task) \
>> -(*(end_of_stack(task)) != STACK_END_MAGIC)
>> +
>> +static inline int task_stack_end_corrupted(struct task_struct *p)
>> +{
>> +return unlikely(*(end_of_stack(p)) != STACK_END_MAGIC);
>> +}
> 
> The passed in pointer should be const, and the extra parentheses around the 
> end_of_stack() call are not needed anymore (since it's now proper C code now).

end_of_stack() will discard const and cause an compiler warning.
Should I add const to end_of_stack()?


Re: [PATCH v2] sched: unlikely corrupted stack end

2016-06-14 Thread WANG Chao

> 在 2016年6月14日,下午6:26,Ingo Molnar  写道:
> 
> 
> * WANG Chao  wrote:
> 
>> 
>>> 在 2016年6月14日,下午4:56,Ingo Molnar  写道:
>>> 
>>> 
>>> * WANG Chao  wrote:
>>> 
>>>> unlikely() was dropped in commit ce03e4137bb2 ("sched/core: Drop
>>>> unlikely behind BUG_ON()"), but commit 29d6455178a0 ("sched: panic on
>>>> corrupted stack end") dropped BUG_ON() and called panic directly.
>>>> 
>>>> Now we should bring unlikely() back for branch prediction. While we're
>>>> at it, it's better and cleaner to turn task_stack_end_corrupted() into
>>>> inline function.
>>>> 
>>>> Signed-off-by: WANG Chao 
>>>> ---
>>>> include/linux/sched.h | 7 +--
>>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>> index 6e42ada26345..797ca1975431 100644
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>>> @@ -2997,8 +2997,11 @@ static inline unsigned long *end_of_stack(struct 
>>>> task_struct *p)
>>>> }
>>>> 
>>>> #endif
>>>> -#define task_stack_end_corrupted(task) \
>>>> -  (*(end_of_stack(task)) != STACK_END_MAGIC)
>>>> +
>>>> +static inline int task_stack_end_corrupted(struct task_struct *p)
>>>> +{
>>>> +  return unlikely(*(end_of_stack(p)) != STACK_END_MAGIC);
>>>> +}
>>> 
>>> The passed in pointer should be const, and the extra parentheses around the 
>>> end_of_stack() call are not needed anymore (since it's now proper C code 
>>> now).
>> 
>> end_of_stack() will discard const and cause an compiler warning.
>> Should I add const to end_of_stack()?
> 
> Yes. Also make sure ia64 still builds and such.

It seems convert task_stack_end_corrupted() into inline isn’t trivial.

In ia64, end_of_stack() is expanded to:

(unsigned long *)((void *)(p) + ((IA64_TASK_SIZE + IA64_THREAD_INFO_SIZE + 31) 
& ~31))

IA64_TASK_SIZE and IA64_THREAD_INFO_SIZE is defined in 
arch/ia64/kernel/asm-offsets.c,
which needs to include linux/sched.h.

So the problem is task_stack_end_corrupted() doesn’t compile before 
asm-offsets.c is compiled.
asm-offsets.c also needs to include linux/sched.h to compile. I think maybe 
that’s why
task_stack_end_corrupted() is introduced as marco, not inline.

Any idea?

Thanks
WANG Chao



[PATCH v3] sched: unlikely corrupted stack end

2016-06-15 Thread WANG Chao
unlikely() was dropped in commit ce03e4137bb2 ("sched/core: Drop
unlikely behind BUG_ON()"), but commit 29d6455178a0 ("sched: panic on
corrupted stack end") dropped BUG_ON() and called panic directly.

Now we should bring unlikely() back for branch prediction. While we're
at it, it's better and cleaner to add unlikely() to
task_stack_end_corrupted() macro.

Signed-off-by: WANG Chao 
---
 include/linux/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada26345..74a02bf30827 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2998,7 +2998,7 @@ static inline unsigned long *end_of_stack(struct 
task_struct *p)
 
 #endif
 #define task_stack_end_corrupted(task) \
-   (*(end_of_stack(task)) != STACK_END_MAGIC)
+   (unlikely(*(end_of_stack(task)) != STACK_END_MAGIC))
 
 static inline int object_is_on_stack(void *obj)
 {
-- 
2.9.0



[PATCH] sched: unlikely corrupted stack end

2016-06-13 Thread WANG Chao
unlikely() was dropped in commit ce03e41 ("sched/core: Drop unlikely
behind BUG_ON()"), but commit 29d6455 ("sched: panic on corrupted stack
end") dropped BUG_ON() and called panic directly.

Now we should bring unlikely() back for branch prediction.

Signed-off-by: WANG Chao 
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 017d539..7db442c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3170,7 +3170,7 @@ static noinline void __schedule_bug(struct task_struct 
*prev)
 static inline void schedule_debug(struct task_struct *prev)
 {
 #ifdef CONFIG_SCHED_STACK_END_CHECK
-   if (task_stack_end_corrupted(prev))
+   if (unlikely(task_stack_end_corrupted(prev)))
panic("corrupted stack end detected inside scheduler\n");
 #endif
 
-- 
2.8.4



[PATCH] kbuild: add extra-y to targets-for-modules

2020-11-03 Thread WANG Chao
extra-y target doesn't build for 'make M=...' since commit 6212804f2d78
("kbuild: do not create built-in objects for external module builds").

This especially breaks kpatch, which is using 'extra-y := kpatch.lds'
and 'make M=...' to build livepatch patch module.

Add extra-y to targets-for-modules so that such kind of build works
properly.

Signed-off-by: WANG Chao 
---
 scripts/Makefile.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index ae647379b579..0113a042d643 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -86,7 +86,7 @@ ifdef need-builtin
 targets-for-builtin += $(obj)/built-in.a
 endif
 
-targets-for-modules := $(patsubst %.o, %.mod, $(filter %.o, $(obj-m)))
+targets-for-modules := $(extra-y) $(patsubst %.o, %.mod, $(filter %.o, 
$(obj-m)))
 
 ifdef need-modorder
 targets-for-modules += $(obj)/modules.order
-- 
2.29.1



[PATCH] x86/kvm: move kvm_load/put_guest_xcr0 into atomic context

2019-04-12 Thread WANG Chao
guest xcr0 could leak into host when MCE happens in guest mode. Because
do_machine_check() could schedule out at a few places.

For example:

kvm_load_guest_xcr0
...
kvm_x86_ops->run(vcpu) {
  vmx_vcpu_run
vmx_complete_atomic_exit
  kvm_machine_check
do_machine_check
  do_memory_failure
memory_failure
  lock_page

In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
out, host cpu has guest xcr0 loaded (0xff).

In __switch_to {
 switch_fpu_finish
   copy_kernel_to_fpregs
 XRSTORS

If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
and tries to reinitialize fpu by restoring init fpu state. Same story as
last #GP, except we get DOUBLE FAULT this time.

Cc: sta...@vger.kernel.org
Signed-off-by: WANG Chao 
---
 arch/x86/kvm/svm.c |  2 ++
 arch/x86/kvm/vmx/vmx.c |  4 
 arch/x86/kvm/x86.c | 10 --
 arch/x86/kvm/x86.h |  2 ++
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e0a791c3d4fc..2bf73076de7f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -5621,6 +5621,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
svm->vmcb->save.cr2 = vcpu->arch.cr2;
 
clgi();
+   kvm_load_guest_xcr0(vcpu);
 
/*
 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
@@ -5766,6 +5767,7 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
kvm_before_interrupt(&svm->vcpu);
 
+   kvm_put_guest_xcr0(vcpu);
stgi();
 
/* Any pending NMI will happen here */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ab432a930ae8..3157598c52f1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6410,6 +6410,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
vmx_set_interrupt_shadow(vcpu, 0);
 
+   kvm_load_guest_xcr0(vcpu);
+
if (static_cpu_has(X86_FEATURE_PKU) &&
kvm_read_cr4_bits(vcpu, X86_CR4_PKE) &&
vcpu->arch.pkru != vmx->host_pkru)
@@ -6506,6 +6508,8 @@ static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
__write_pkru(vmx->host_pkru);
}
 
+   kvm_put_guest_xcr0(vcpu);
+
vmx->nested.nested_run_pending = 0;
vmx->idt_vectoring_info = 0;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 099b851dabaf..22f66e9a7dc5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -800,7 +800,7 @@ void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw)
 }
 EXPORT_SYMBOL_GPL(kvm_lmsw);
 
-static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
+void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
 {
if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) &&
!vcpu->guest_xcr0_loaded) {
@@ -810,8 +810,9 @@ static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
vcpu->guest_xcr0_loaded = 1;
}
 }
+EXPORT_SYMBOL_GPL(kvm_load_guest_xcr0);
 
-static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
+void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
 {
if (vcpu->guest_xcr0_loaded) {
if (vcpu->arch.xcr0 != host_xcr0)
@@ -819,6 +820,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
vcpu->guest_xcr0_loaded = 0;
}
 }
+EXPORT_SYMBOL_GPL(kvm_put_guest_xcr0);
 
 static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
@@ -7865,8 +7867,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
goto cancel_injection;
}
 
-   kvm_load_guest_xcr0(vcpu);
-
if (req_immediate_exit) {
kvm_make_request(KVM_REQ_EVENT, vcpu);
kvm_x86_ops->request_immediate_exit(vcpu);
@@ -7919,8 +7919,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
 
-   kvm_put_guest_xcr0(vcpu);
-
kvm_before_interrupt(vcpu);
kvm_x86_ops->handle_external_intr(vcpu);
kvm_after_interrupt(vcpu);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 28406aa1136d..aedc5d0d4989 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -347,4 +347,6 @@ static inline void kvm_after_interrupt(struct kvm_vcpu 
*vcpu)
__this_cpu_write(current_vcpu, NULL);
 }
 
+void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu);
+void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu);
 #endif
-- 
2.21.0



Re: [PATCH] x86 / CPU: Always show current CPU frequency in /proc/cpuinfo

2017-11-15 Thread WANG Chao
On 11/15/17 at 02:13P, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> After commit 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get()
> for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
> on x86 can be either the nominal CPU frequency (which is constant)
> or the frequency most recently requested by a scaling governor in
> cpufreq, depending on the cpufreq configuration.  That is somewhat
> inconsistent and is different from what it was before 4.13, so in
> order to restore the previous behavior, make it report the current
> CPU frequency like the scaling_cur_freq sysfs file in cpufreq.
> 
> To that end, modify the /proc/cpuinfo implementation on x86 to use
> aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
> registers, if available, and use their values to compute the CPU
> frequency to be reported as "cpu MHz".
> 
> However, do that carefully enough to avoid accumulating delays that
> lead to unacceptable access times for /proc/cpuinfo on systems with
> many CPUs.  Run aperfmperf_snapshot_khz() once on all CPUs
> asynchronously at the /proc/cpuinfo open time, add a single delay
> upfront (if necessary) at that point and simply compute the current
> frequency while running show_cpuinfo() for each individual CPU.

Hi, Rafael

I tested your patch. It's much faster.

But from what I got, calling aperfmperf_snapshot_khz() asynchronously
with 10ms sleep takes much longer than calling aperfmperf_snapshot_khz()
synchronously.

Here's my result on 64 CPUs:

 - async aperfmperf_snapshot_khz() w/ 10ms sleep:

# time cat /proc/cpuinfo > /dev/null
real0m0.014s
user0m0.000s
sys 0m0.002s

 - sync aperfmperf_snapshot_khz() w/o any sleep:

# time cat /proc/cpuinfo > /dev/null
real0m0.002s
user0m0.000s
sys 0m0.002s

Thanks,
WANG Chao

> 
> Also, to avoid slowing down /proc/cpuinfo accesses too much, reduce
> the default delay between consecutive APERF and MPERF reads to 10 ms,
> which should be sufficient to get large enough numbers for the
> frequency computation in all cases.
> 
> Fixes: 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get() for 
> /proc/cpuinfo "cpu MHz"")
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> Resent with a changelog & tags.
> 
> I'm going to route it via the linux-pm tree.
> 
> ---
>  arch/x86/kernel/cpu/aperfmperf.c |   74 
> +++
>  arch/x86/kernel/cpu/cpu.h|3 +
>  arch/x86/kernel/cpu/proc.c   |6 ++-
>  fs/proc/cpuinfo.c|6 +++
>  include/linux/cpufreq.h  |1 
>  5 files changed, 67 insertions(+), 23 deletions(-)
> 
> Index: linux-pm/arch/x86/kernel/cpu/proc.c
> ===
> --- linux-pm.orig/arch/x86/kernel/cpu/proc.c
> +++ linux-pm/arch/x86/kernel/cpu/proc.c
> @@ -5,6 +5,8 @@
>  #include 
>  #include 
>  
> +#include "cpu.h"
> +
>  /*
>   *   Get CPU information for use by the procfs.
>   */
> @@ -78,9 +80,11 @@ static int show_cpuinfo(struct seq_file
>   seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
>  
>   if (cpu_has(c, X86_FEATURE_TSC)) {
> - unsigned int freq = cpufreq_quick_get(cpu);
> + unsigned int freq = aperfmperf_get_khz(cpu);
>  
>   if (!freq)
> + freq = cpufreq_quick_get(cpu);
> + if (!freq)
>   freq = cpu_khz;
>   seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
>  freq / 1000, (freq % 1000));
> Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
> ===
> --- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c
> +++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c
> @@ -14,6 +14,8 @@
>  #include 
>  #include 
>  
> +#include "cpu.h"
> +
>  struct aperfmperf_sample {
>   unsigned intkhz;
>   ktime_t time;
> @@ -24,7 +26,7 @@ struct aperfmperf_sample {
>  static DEFINE_PER_CPU(struct aperfmperf_sample, samples);
>  
>  #define APERFMPERF_CACHE_THRESHOLD_MS10
> -#define APERFMPERF_REFRESH_DELAY_MS  20
> +#define APERFMPERF_REFRESH_DELAY_MS  10
>  #define APERFMPERF_STALE_THRESHOLD_MS1000
>  
>  /*
> @@ -38,8 +40,6 @@ static void aperfmperf_snapshot_khz(void
>   u64 aperf, aperf_delta;
>   u64 mperf, mperf_delta;
>   struct aperfmperf_sample *s = this_cpu_ptr(&samples);
> - ktime_t now = ktime_get();
> - s64 time_delta = ktime_ms_delta(now, s->time);
>   unsigned long flags;
>  
>  

Re: [PATCH] x86 / CPU: Always show current CPU frequency in /proc/cpuinfo

2017-11-16 Thread WANG Chao
On 11/16/17 at 01:24P, Rafael J. Wysocki wrote:
> On Wednesday, November 15, 2017 10:33:47 AM CET WANG Chao wrote:
> > On 11/15/17 at 02:13P, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki 
> > > 
> > > After commit 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get()
> > > for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
> > > on x86 can be either the nominal CPU frequency (which is constant)
> > > or the frequency most recently requested by a scaling governor in
> > > cpufreq, depending on the cpufreq configuration.  That is somewhat
> > > inconsistent and is different from what it was before 4.13, so in
> > > order to restore the previous behavior, make it report the current
> > > CPU frequency like the scaling_cur_freq sysfs file in cpufreq.
> > > 
> > > To that end, modify the /proc/cpuinfo implementation on x86 to use
> > > aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
> > > registers, if available, and use their values to compute the CPU
> > > frequency to be reported as "cpu MHz".
> > > 
> > > However, do that carefully enough to avoid accumulating delays that
> > > lead to unacceptable access times for /proc/cpuinfo on systems with
> > > many CPUs.  Run aperfmperf_snapshot_khz() once on all CPUs
> > > asynchronously at the /proc/cpuinfo open time, add a single delay
> > > upfront (if necessary) at that point and simply compute the current
> > > frequency while running show_cpuinfo() for each individual CPU.
> > 
> > Hi, Rafael
> > 
> > I tested your patch. It's much faster.
> > 
> > But from what I got, calling aperfmperf_snapshot_khz() asynchronously
> > with 10ms sleep takes much longer than calling aperfmperf_snapshot_khz()
> > synchronously.
> > 
> > Here's my result on 64 CPUs:
> > 
> >  - async aperfmperf_snapshot_khz() w/ 10ms sleep:
> > 
> > # time cat /proc/cpuinfo > /dev/null
> > real0m0.014s
> > user0m0.000s
> > sys 0m0.002s
> > 
> >  - sync aperfmperf_snapshot_khz() w/o any sleep:
> > 
> > # time cat /proc/cpuinfo > /dev/null
> > real0m0.002s
> > user0m0.000s
> > sys 0m0.002s
> 
> Sure, but the delay is there, because without it the computed frequency
> may be way off for at least one of the CPUs.

Thanks, I understand now. In this case, The 10ms delay turns out to be
the interval of measuring aperf and mperf and computing their deltas.

Last question though, is 10ms best practice or can we make it shorter,
say 5ms?

Thanks,
WANG Chao



Re: [PATCH] x86 / CPU: Always show current CPU frequency in /proc/cpuinfo

2017-11-16 Thread WANG Chao
On 11/16/17 at 02:54P, Rafael J. Wysocki wrote:
> On Thursday, November 16, 2017 10:50:36 AM CET WANG Chao wrote:
> > On 11/16/17 at 01:24P, Rafael J. Wysocki wrote:
> > > On Wednesday, November 15, 2017 10:33:47 AM CET WANG Chao wrote:
> > > > On 11/15/17 at 02:13P, Rafael J. Wysocki wrote:
> > > > > From: Rafael J. Wysocki 
> > > > > 
> > > > > After commit 890da9cf0983 (Revert "x86: do not use cpufreq_quick_get()
> > > > > for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
> > > > > on x86 can be either the nominal CPU frequency (which is constant)
> > > > > or the frequency most recently requested by a scaling governor in
> > > > > cpufreq, depending on the cpufreq configuration.  That is somewhat
> > > > > inconsistent and is different from what it was before 4.13, so in
> > > > > order to restore the previous behavior, make it report the current
> > > > > CPU frequency like the scaling_cur_freq sysfs file in cpufreq.
> > > > > 
> > > > > To that end, modify the /proc/cpuinfo implementation on x86 to use
> > > > > aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
> > > > > registers, if available, and use their values to compute the CPU
> > > > > frequency to be reported as "cpu MHz".
> > > > > 
> > > > > However, do that carefully enough to avoid accumulating delays that
> > > > > lead to unacceptable access times for /proc/cpuinfo on systems with
> > > > > many CPUs.  Run aperfmperf_snapshot_khz() once on all CPUs
> > > > > asynchronously at the /proc/cpuinfo open time, add a single delay
> > > > > upfront (if necessary) at that point and simply compute the current
> > > > > frequency while running show_cpuinfo() for each individual CPU.
> > > > 
> > > > Hi, Rafael
> > > > 
> > > > I tested your patch. It's much faster.
> > > > 
> > > > But from what I got, calling aperfmperf_snapshot_khz() asynchronously
> > > > with 10ms sleep takes much longer than calling aperfmperf_snapshot_khz()
> > > > synchronously.
> > > > 
> > > > Here's my result on 64 CPUs:
> > > > 
> > > >  - async aperfmperf_snapshot_khz() w/ 10ms sleep:
> > > > 
> > > > # time cat /proc/cpuinfo > /dev/null
> > > > real0m0.014s
> > > > user0m0.000s
> > > > sys 0m0.002s
> > > > 
> > > >  - sync aperfmperf_snapshot_khz() w/o any sleep:
> > > > 
> > > > # time cat /proc/cpuinfo > /dev/null
> > > > real0m0.002s
> > > > user0m0.000s
> > > > sys 0m0.002s
> > > 
> > > Sure, but the delay is there, because without it the computed frequency
> > > may be way off for at least one of the CPUs.
> > 
> > Thanks, I understand now. In this case, The 10ms delay turns out to be
> > the interval of measuring aperf and mperf and computing their deltas.
> > 
> > Last question though, is 10ms best practice or can we make it shorter,
> > say 5ms?
> 
> Experimentally, I found 5 ms to be slightly too short.  It all depends on
> how accurate the numbers are expected to be, however, so there is some room
> for adjustments.
> 
> Regardless, I'd prefer to start with 10 ms as that is what has been used in
> intel_pstate for quite a long time at least and adjust later if need be.

Sure. Thanks for your explanation.

This patch looks good to me.

Reviewed-by: WANG Chao 



[PATCH] x86: use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" again

2017-11-09 Thread WANG Chao
Commit 941f5f0f6ef5 (x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo) caused
a serious performance issue when reading from /proc/cpuinfo on system
with aperfmperf.

For each cpu, arch_freq_get_on_cpu() sleeps 20ms to get its frequency.
On a system with 64 cpus, it takes 1.5s to finish running `cat
/proc/cpuinfo`, while it previously was done in 15ms.

Some programs use /proc/cpuinfo during startup or runtime, eg. QEMU.
This patch will restore the previous behavior and let user space happy.

Fixes: 941f5f0f6ef5 (x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo)
Cc: sta...@kernel.org   # 4.13
Signed-off-by: WANG Chao 
---
 arch/x86/kernel/cpu/Makefile | 2 +-
 arch/x86/kernel/cpu/proc.c   | 4 +---
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 236999c54edc..c60922a66385 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -22,7 +22,7 @@ obj-y += common.o
 obj-y  += rdrand.o
 obj-y  += match.o
 obj-y  += bugs.o
-obj-y  += aperfmperf.o
+obj-$(CONFIG_CPU_FREQ) += aperfmperf.o
 
 obj-$(CONFIG_PROC_FS)  += proc.o
 obj-$(CONFIG_X86_FEATURE_NAMES) += capflags.o powerflags.o
diff --git a/arch/x86/kernel/cpu/proc.c b/arch/x86/kernel/cpu/proc.c
index 4378a729b933..6b7e17bf0b71 100644
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -78,10 +78,8 @@ static int show_cpuinfo(struct seq_file *m, void *v)
seq_printf(m, "microcode\t: 0x%x\n", c->microcode);
 
if (cpu_has(c, X86_FEATURE_TSC)) {
-   unsigned int freq = arch_freq_get_on_cpu(cpu);
+   unsigned int freq = cpufreq_quick_get(cpu);
 
-   if (!freq)
-   freq = cpufreq_quick_get(cpu);
if (!freq)
freq = cpu_khz;
seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
-- 
2.15.0




Re: [PATCH] x86: use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" again

2017-11-09 Thread WANG Chao
On 11/10/17 at 01:06P, Rafael J. Wysocki wrote:
> On Thursday, November 9, 2017 11:30:54 PM CET Rafael J. Wysocki wrote:
> > On Thu, Nov 9, 2017 at 5:06 PM, Rafael J. Wysocki
> >  wrote:
> > > Hi Linus,
> > >
> > > On 11/9/2017 11:38 AM, WANG Chao wrote:
> > >>
> > >> Commit 941f5f0f6ef5 (x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo) caused
> > >> a serious performance issue when reading from /proc/cpuinfo on system
> > >> with aperfmperf.
> > >>
> > >> For each cpu, arch_freq_get_on_cpu() sleeps 20ms to get its frequency.
> > >> On a system with 64 cpus, it takes 1.5s to finish running `cat
> > >> /proc/cpuinfo`, while it previously was done in 15ms.
> > >
> > > Honestly, I'm not sure what to do to address this ATM.
> > >
> > > The last requested frequency is only available in the non-HWP case, so it
> > > cannot be used universally.
> > 
> > OK, here's an idea.
> > 
> > c_start() can run aperfmperf_snapshot_khz() on all CPUs upfront (say
> > in parallel), then wait for a while (say 5 ms; the current 20 ms wait
> > is overkill) and then aperfmperf_snapshot_khz() can be run once on
> > each CPU in show_cpuinfo() without taking the "stale cache" threshold
> > into account.
> > 
> > I'm going to try that and see how far I can get with it.
> 
> Below is what I have.
> 
> I ended up using APERFMPERF_REFRESH_DELAY_MS for the delay in
> aperfmperf_snapshot_all(), because 5 ms tended to add too much
> variation to the results on my test box.
> 
> I think it may be reduced to 10 ms, though.
> 
> Chao, can you please try this one and report back?

Hi, Rafael

Thanks for the patch. But it doesn't work for me. lscpu takes 1.5s to
finish on a 64 cpus AMD box with aperfmperf.

You missed the fact that c_start() will also be called by c_next().

But I don't think the overall idea is good enough. I think /proc/cpuinfo
is too general for usespace too be delayed, no matter it's 10ms or 20ms.

My point is cpu MHz is best to use a cached value for quick access. If
people are looking for reliable and accurate cpu frequency,
/proc/cpuinfo is probably a bad idae.

What do you think?

WANG Chao

> 
> 
> ---
>  arch/x86/kernel/cpu/aperfmperf.c |   42 
> ---
>  arch/x86/kernel/cpu/cpu.h|4 +++
>  arch/x86/kernel/cpu/proc.c   |5 +++-
>  3 files changed, 39 insertions(+), 12 deletions(-)
> 
> Index: linux-pm/arch/x86/kernel/cpu/aperfmperf.c
> ===
> --- linux-pm.orig/arch/x86/kernel/cpu/aperfmperf.c
> +++ linux-pm/arch/x86/kernel/cpu/aperfmperf.c
> @@ -14,6 +14,8 @@
>  #include 
>  #include 
>  
> +#include "cpu.h"
> +
>  struct aperfmperf_sample {
>   unsigned intkhz;
>   ktime_t time;
> @@ -38,8 +40,6 @@ static void aperfmperf_snapshot_khz(void
>   u64 aperf, aperf_delta;
>   u64 mperf, mperf_delta;
>   struct aperfmperf_sample *s = this_cpu_ptr(&samples);
> - ktime_t now = ktime_get();
> - s64 time_delta = ktime_ms_delta(now, s->time);
>   unsigned long flags;
>  
>   local_irq_save(flags);
> @@ -57,15 +57,10 @@ static void aperfmperf_snapshot_khz(void
>   if (mperf_delta == 0)
>   return;
>  
> - s->time = now;
> + s->time = ktime_get();
>   s->aperf = aperf;
>   s->mperf = mperf;
> -
> - /* If the previous iteration was too long ago, discard it. */
> - if (time_delta > APERFMPERF_STALE_THRESHOLD_MS)
> - s->khz = 0;
> - else
> - s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
> + s->khz = div64_u64((cpu_khz * aperf_delta), mperf_delta);
>  }
>  
>  unsigned int arch_freq_get_on_cpu(int cpu)
> @@ -82,16 +77,41 @@ unsigned int arch_freq_get_on_cpu(int cp
>   /* Don't bother re-computing within the cache threshold time. */
>   time_delta = ktime_ms_delta(ktime_get(), per_cpu(samples.time, cpu));
>   khz = per_cpu(samples.khz, cpu);
> - if (khz && time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
> + if (time_delta < APERFMPERF_CACHE_THRESHOLD_MS)
>   return khz;
>  
>   smp_call_function_single(cpu, aperfmperf_snapshot_khz, NULL, 1);
>   khz = per_cpu(samples.khz, cpu);
> - if (khz)
> + if (time_delta <= APERFMPERF_STALE_THRESHOLD_MS)
>   return khz;
>  
> + /* If the previous iteration was too long ago, take a new data point. */
>   msleep(APERFMPERF_REFRESH_DELAY

Re: [PATCH] x86: use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" again

2017-11-09 Thread WANG Chao
On 11/10/17 at 12:04P, WANG Chao wrote:
> On 11/10/17 at 01:06P, Rafael J. Wysocki wrote:
> > On Thursday, November 9, 2017 11:30:54 PM CET Rafael J. Wysocki wrote:
> > > On Thu, Nov 9, 2017 at 5:06 PM, Rafael J. Wysocki
> > >  wrote:
> > > > Hi Linus,
> > > >
> > > > On 11/9/2017 11:38 AM, WANG Chao wrote:
> > > >>
> > > >> Commit 941f5f0f6ef5 (x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo) 
> > > >> caused
> > > >> a serious performance issue when reading from /proc/cpuinfo on system
> > > >> with aperfmperf.
> > > >>
> > > >> For each cpu, arch_freq_get_on_cpu() sleeps 20ms to get its frequency.
> > > >> On a system with 64 cpus, it takes 1.5s to finish running `cat
> > > >> /proc/cpuinfo`, while it previously was done in 15ms.
> > > >
> > > > Honestly, I'm not sure what to do to address this ATM.
> > > >
> > > > The last requested frequency is only available in the non-HWP case, so 
> > > > it
> > > > cannot be used universally.
> > > 
> > > OK, here's an idea.
> > > 
> > > c_start() can run aperfmperf_snapshot_khz() on all CPUs upfront (say
> > > in parallel), then wait for a while (say 5 ms; the current 20 ms wait
> > > is overkill) and then aperfmperf_snapshot_khz() can be run once on
> > > each CPU in show_cpuinfo() without taking the "stale cache" threshold
> > > into account.
> > > 
> > > I'm going to try that and see how far I can get with it.
> > 
> > Below is what I have.
> > 
> > I ended up using APERFMPERF_REFRESH_DELAY_MS for the delay in
> > aperfmperf_snapshot_all(), because 5 ms tended to add too much
> > variation to the results on my test box.
> > 
> > I think it may be reduced to 10 ms, though.
> > 
> > Chao, can you please try this one and report back?
> 
> Hi, Rafael
> 
> Thanks for the patch. But it doesn't work for me. lscpu takes 1.5s to
> finish on a 64 cpus AMD box with aperfmperf.
> 
> You missed the fact that c_start() will also be called by c_next().
> 
> But I don't think the overall idea is good enough. I think /proc/cpuinfo
> is too general for usespace too be delayed, no matter it's 10ms or 20ms.
> 
> My point is cpu MHz is best to use a cached value for quick access. If
> people are looking for reliable and accurate cpu frequency,
> /proc/cpuinfo is probably a bad idae.
> 
> What do you think?

Could you also explain 941f5f0f6ef5 (x86: CPU: Fix up "cpu MHz" in
/proc/cpuinfo) please? The commit message is not clear for me.

Are there any upstream disscutions? I wasn't following this change in
upstream. Now I can't find any.

Thanks,
WANG Chao



Re: [PATCH] x86: use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz" again

2017-11-10 Thread WANG Chao
On 11/10/17 at 08:25P, Ingo Molnar wrote:
> 
> * Rafael J. Wysocki  wrote:
> 
> > Hi Linus,
> > 
> > On 11/9/2017 11:38 AM, WANG Chao wrote:
> > > Commit 941f5f0f6ef5 (x86: CPU: Fix up "cpu MHz" in /proc/cpuinfo) caused
> > > a serious performance issue when reading from /proc/cpuinfo on system
> > > with aperfmperf.
> > > 
> > > For each cpu, arch_freq_get_on_cpu() sleeps 20ms to get its frequency.
> > > On a system with 64 cpus, it takes 1.5s to finish running `cat
> > > /proc/cpuinfo`, while it previously was done in 15ms.
> > 
> > Honestly, I'm not sure what to do to address this ATM.
> > 
> > The last requested frequency is only available in the non-HWP case, so it
> > cannot be used universally.
> 
> This is a serious regression that needs to be fixed ASAP, because the 
> slowdown is 
> utterly ridiculous on a 120 CPU system:
> 
>   fomalhaut:~> time cat /proc/cpuinfo  >/dev/null
> 
>   real0m2.689s
>   user    0m0.001s
>   sys 0m0.007s

Because 4.14 is about to release, how about adding this patch to 4.14
now? We can work on a more sophisticated solution later.

Thanks,
WANG Chao



[tip:x86/mm] x86, setup: Let early_memremap() handle page alignment

2015-01-23 Thread tip-bot for WANG Chao
Commit-ID:  7389882c81474d635a208726edb22938645ff9ad
Gitweb: http://git.kernel.org/tip/7389882c81474d635a208726edb22938645ff9ad
Author: WANG Chao 
AuthorDate: Wed, 7 Jan 2015 18:55:48 +0800
Committer:  Thomas Gleixner 
CommitDate: Fri, 23 Jan 2015 16:14:26 +0100

x86, setup: Let early_memremap() handle page alignment

early_memremap() takes care of page alignment and map size, so we can
just remap the required data size and get rid of the adjustments in
the setup code.

[tglx: Massaged changelog ]

Signed-off-by: WANG Chao 
Cc: Matt Fleming 
Cc: Borislav Petkov 
Cc: Santosh Shilimkar 
Cc: Bryan O'Donoghue 
Link: 
http://lkml.kernel.org/r/1420628150-16872-1-git-send-email-chaow...@redhat.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/setup.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ab4734e..c4648ada 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -431,15 +431,13 @@ static void __init parse_setup_data(void)
 
pa_data = boot_params.hdr.setup_data;
while (pa_data) {
-   u32 data_len, map_len, data_type;
+   u32 data_len, data_type;
 
-   map_len = max(PAGE_SIZE - (pa_data & ~PAGE_MASK),
- (u64)sizeof(struct setup_data));
-   data = early_memremap(pa_data, map_len);
+   data = early_memremap(pa_data, sizeof(*data));
data_len = data->len + sizeof(struct setup_data);
data_type = data->type;
pa_next = data->next;
-   early_iounmap(data, map_len);
+   early_iounmap(data, sizeof(*data));
 
switch (data_type) {
case SETUP_E820_EXT:
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/mm] x86, e820: Clean up sanitize_e820_map() users

2015-01-23 Thread tip-bot for WANG Chao
Commit-ID:  d574ffa1066003569ed5cdaeabf44597564ce975
Gitweb: http://git.kernel.org/tip/d574ffa1066003569ed5cdaeabf44597564ce975
Author: WANG Chao 
AuthorDate: Wed, 7 Jan 2015 11:37:38 +0800
Committer:  Thomas Gleixner 
CommitDate: Fri, 23 Jan 2015 16:14:27 +0100

x86, e820: Clean up sanitize_e820_map() users

The argument 3 of sanitize_e820_map() will only be updated upon a
successful sanitization. Some of the callers have extra conditionals
for the same purpose. Clean them up.

default_machine_specific_memory_setup() must keep the extra
conditional because boot_params.e820_entries is an u8 and not an u32,
so the direct update would overwrite other fields in boot_params.

[ tglx: Massaged changelog ]

Signed-off-by: WANG Chao 
Acked-by: David Rientjes 
Cc: Grygorii Strashko 
Cc: Pavel Machek 
Cc: Lee Chun-Yi 
Cc: Xishi Qiu 
Link: 
http://lkml.kernel.org/r/1420601859-18439-1-git-send-email-chaow...@redhat.com
Signed-off-by: Thomas Gleixner 
---
 arch/x86/kernel/e820.c | 26 --
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index dd2f07a..46201de 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -184,9 +184,9 @@ void __init e820_print_map(char *who)
  * overwritten in the same location, starting at biosmap.
  *
  * The integer pointed to by pnr_map must be valid on entry (the
- * current number of valid entries located at biosmap) and will
- * be updated on return, with the new number of valid entries
- * (something no more than max_nr_map.)
+ * current number of valid entries located at biosmap). If the
+ * sanitizing succeeds the *pnr_map will be updated with the new
+ * number of valid entries (something no more than max_nr_map).
  *
  * The return value from sanitize_e820_map() is zero if it
  * successfully 'sanitized' the map entries passed in, and is -1
@@ -561,23 +561,15 @@ u64 __init e820_remove_range(u64 start, u64 size, 
unsigned old_type,
 
 void __init update_e820(void)
 {
-   u32 nr_map;
-
-   nr_map = e820.nr_map;
-   if (sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &nr_map))
+   if (sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &e820.nr_map))
return;
-   e820.nr_map = nr_map;
printk(KERN_INFO "e820: modified physical RAM map:\n");
e820_print_map("modified");
 }
 static void __init update_e820_saved(void)
 {
-   u32 nr_map;
-
-   nr_map = e820_saved.nr_map;
-   if (sanitize_e820_map(e820_saved.map, ARRAY_SIZE(e820_saved.map), 
&nr_map))
-   return;
-   e820_saved.nr_map = nr_map;
+   sanitize_e820_map(e820_saved.map, ARRAY_SIZE(e820_saved.map),
+   &e820_saved.nr_map);
 }
 #define MAX_GAP_END 0x1ull
 /*
@@ -898,11 +890,9 @@ early_param("memmap", parse_memmap_opt);
 void __init finish_e820_parsing(void)
 {
if (userdef) {
-   u32 nr = e820.nr_map;
-
-   if (sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map), &nr) < 0)
+   if (sanitize_e820_map(e820.map, ARRAY_SIZE(e820.map),
+   &e820.nr_map) < 0)
early_panic("Invalid user supplied memory map");
-   e820.nr_map = nr;
 
printk(KERN_INFO "e820: user-defined physical RAM map:\n");
e820_print_map("user");
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/platform] x86, calgary: Use 8M TCE table size by default

2014-03-20 Thread tip-bot for WANG Chao
Commit-ID:  936329de1e6bf3dfa8c056074e6fc6b673e7b1d0
Gitweb: http://git.kernel.org/tip/936329de1e6bf3dfa8c056074e6fc6b673e7b1d0
Author: WANG Chao 
AuthorDate: Mon, 10 Mar 2014 22:52:00 +0800
Committer:  H. Peter Anvin 
CommitDate: Thu, 20 Mar 2014 14:51:39 -0700

x86, calgary: Use 8M TCE table size by default

New kexec-tools wants to pass kdump kernel needed memmap via E820
directly, instead of memmap=exactmap. This makes saved_max_pfn not
be passed down to 2nd kernel. To keep 1st kernel and 2nd kernel using
the same TCE table size, Muli suggest to hard code the size to max (8M).

We can't get rid of saved_max_pfn this time, for backward compatibility
with old first kernel and new second kernel. However new first kernel
and old second kernel can not work unfortunately.

v2->v1:
- retain saved_max_pfn so new 2nd kernel can work with old 1st kernel
  from Vivek

Signed-off-by: WANG Chao 
Acked-by: Vivek Goyal 
Acked-by: Muli Ben-Yehuda 
Acked-by: Jon Mason 
Link: 
http://lkml.kernel.org/r/1394463120-26999-1-git-send-email-chaow...@redhat.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/pci-calgary_64.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index 299d493..0497f71 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -1207,23 +1207,31 @@ error:
return ret;
 }
 
-static inline int __init determine_tce_table_size(u64 ram)
+static inline int __init determine_tce_table_size(void)
 {
int ret;
 
if (specified_table_size != TCE_TABLE_SIZE_UNSPECIFIED)
return specified_table_size;
 
-   /*
-* Table sizes are from 0 to 7 (TCE_TABLE_SIZE_64K to
-* TCE_TABLE_SIZE_8M). Table size 0 has 8K entries and each
-* larger table size has twice as many entries, so shift the
-* max ram address by 13 to divide by 8K and then look at the
-* order of the result to choose between 0-7.
-*/
-   ret = get_order(ram >> 13);
-   if (ret > TCE_TABLE_SIZE_8M)
+   if (is_kdump_kernel() && saved_max_pfn) {
+   /*
+* Table sizes are from 0 to 7 (TCE_TABLE_SIZE_64K to
+* TCE_TABLE_SIZE_8M). Table size 0 has 8K entries and each
+* larger table size has twice as many entries, so shift the
+* max ram address by 13 to divide by 8K and then look at the
+* order of the result to choose between 0-7.
+*/
+   ret = get_order((saved_max_pfn * PAGE_SIZE) >> 13);
+   if (ret > TCE_TABLE_SIZE_8M)
+   ret = TCE_TABLE_SIZE_8M;
+   } else {
+   /*
+* Use 8M by default (suggested by Muli) if it's not
+* kdump kernel and saved_max_pfn isn't set.
+*/
ret = TCE_TABLE_SIZE_8M;
+   }
 
return ret;
 }
@@ -1418,8 +1426,7 @@ int __init detect_calgary(void)
return -ENOMEM;
}
 
-   specified_table_size = determine_tce_table_size((is_kdump_kernel() ?
-   saved_max_pfn : max_pfn) * PAGE_SIZE);
+   specified_table_size = determine_tce_table_size();
 
for (bus = 0; bus < MAX_PHB_BUS_NUM; bus++) {
struct calgary_bus_info *info = &bus_info[bus];
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/platform] x86, calgary: Use 8M TCE table size by default

2014-04-10 Thread tip-bot for WANG Chao
Commit-ID:  0534af01cca338193abbfdb53650af91e65dbf10
Gitweb: http://git.kernel.org/tip/0534af01cca338193abbfdb53650af91e65dbf10
Author: WANG Chao 
AuthorDate: Mon, 10 Mar 2014 22:52:00 +0800
Committer:  H. Peter Anvin 
CommitDate: Thu, 10 Apr 2014 19:51:32 -0700

x86, calgary: Use 8M TCE table size by default

New kexec-tools wants to pass kdump kernel needed memmap via E820
directly, instead of memmap=exactmap. This makes saved_max_pfn not
be passed down to 2nd kernel. To keep 1st kernel and 2nd kernel using
the same TCE table size, Muli suggest to hard code the size to max (8M).

We can't get rid of saved_max_pfn this time, for backward compatibility
with old first kernel and new second kernel. However new first kernel
and old second kernel can not work unfortunately.

v2->v1:
- retain saved_max_pfn so new 2nd kernel can work with old 1st kernel
  from Vivek

Signed-off-by: WANG Chao 
Acked-by: Vivek Goyal 
Acked-by: Muli Ben-Yehuda 
Acked-by: Jon Mason 
Link: 
http://lkml.kernel.org/r/1394463120-26999-1-git-send-email-chaow...@redhat.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/pci-calgary_64.c | 31 +++
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index 299d493..0497f71 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -1207,23 +1207,31 @@ error:
return ret;
 }
 
-static inline int __init determine_tce_table_size(u64 ram)
+static inline int __init determine_tce_table_size(void)
 {
int ret;
 
if (specified_table_size != TCE_TABLE_SIZE_UNSPECIFIED)
return specified_table_size;
 
-   /*
-* Table sizes are from 0 to 7 (TCE_TABLE_SIZE_64K to
-* TCE_TABLE_SIZE_8M). Table size 0 has 8K entries and each
-* larger table size has twice as many entries, so shift the
-* max ram address by 13 to divide by 8K and then look at the
-* order of the result to choose between 0-7.
-*/
-   ret = get_order(ram >> 13);
-   if (ret > TCE_TABLE_SIZE_8M)
+   if (is_kdump_kernel() && saved_max_pfn) {
+   /*
+* Table sizes are from 0 to 7 (TCE_TABLE_SIZE_64K to
+* TCE_TABLE_SIZE_8M). Table size 0 has 8K entries and each
+* larger table size has twice as many entries, so shift the
+* max ram address by 13 to divide by 8K and then look at the
+* order of the result to choose between 0-7.
+*/
+   ret = get_order((saved_max_pfn * PAGE_SIZE) >> 13);
+   if (ret > TCE_TABLE_SIZE_8M)
+   ret = TCE_TABLE_SIZE_8M;
+   } else {
+   /*
+* Use 8M by default (suggested by Muli) if it's not
+* kdump kernel and saved_max_pfn isn't set.
+*/
ret = TCE_TABLE_SIZE_8M;
+   }
 
return ret;
 }
@@ -1418,8 +1426,7 @@ int __init detect_calgary(void)
return -ENOMEM;
}
 
-   specified_table_size = determine_tce_table_size((is_kdump_kernel() ?
-   saved_max_pfn : max_pfn) * PAGE_SIZE);
+   specified_table_size = determine_tce_table_size();
 
for (bus = 0; bus < MAX_PHB_BUS_NUM; bus++) {
struct calgary_bus_info *info = &bus_info[bus];
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:x86/build] x86/purgatory, build: Suppress kexec-purgatory.c is up to date message

2014-10-15 Thread tip-bot for WANG Chao
Commit-ID:  3ea4b8ee2419e21295cabab66c317612c5a55d26
Gitweb: http://git.kernel.org/tip/3ea4b8ee2419e21295cabab66c317612c5a55d26
Author: WANG Chao 
AuthorDate: Tue, 14 Oct 2014 12:46:58 +0800
Committer:  H. Peter Anvin 
CommitDate: Wed, 15 Oct 2014 08:31:21 -0700

x86/purgatory, build: Suppress kexec-purgatory.c is up to date message

Suppress this unnecessary message during kernel re-build
(CONFIG_KEXEC_FILE=y):

make[1]: `arch/x86/purgatory/kexec-purgatory.c' is up to date.

Signed-off-by: WANG Chao 
Link: 
http://lkml.kernel.org/r/1413262019-3759-1-git-send-email-chaow...@redhat.com
Signed-off-by: H. Peter Anvin 
---
 arch/x86/purgatory/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/purgatory/Makefile b/arch/x86/purgatory/Makefile
index 899dd24..2c41066 100644
--- a/arch/x86/purgatory/Makefile
+++ b/arch/x86/purgatory/Makefile
@@ -23,6 +23,7 @@ quiet_cmd_bin2c = BIN2C   $@
 
 $(obj)/kexec-purgatory.c: $(obj)/purgatory.ro FORCE
$(call if_changed,bin2c)
+   @:
 
 
 obj-$(CONFIG_KEXEC_FILE)   += kexec-purgatory.o
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/core] perf test: fix typo in python test

2014-11-19 Thread tip-bot for WANG Chao
Commit-ID:  887e73d7f42c6a146b572a0577e9875ccca66f37
Gitweb: http://git.kernel.org/tip/887e73d7f42c6a146b572a0577e9875ccca66f37
Author: WANG Chao 
AuthorDate: Wed, 12 Nov 2014 16:27:05 +0800
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Wed, 19 Nov 2014 12:33:47 -0300

perf test: fix typo in python test

Library loading in python syntax should be 'import perf', not 'use perf'.

Signed-off-by: WANG Chao 
Cc: Adrian Hunter 
Cc: Ingo Molnar 
Cc: Jean Pihet 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/1415780826-13250-1-git-send-email-chaow...@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/tests/builtin-test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index 162c978..4b7d9ab 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -85,7 +85,7 @@ static struct test {
.func = test__hists_link,
},
{
-   .desc = "Try 'use perf' in python, checking link problems",
+   .desc = "Try 'import perf' in python, checking link problems",
.func = test__python_use,
},
{
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >