from:"Stefan Bader"

Re: [PATCH] xen/events: don't use chip_data for legacy IRQs

2020-09-30 Thread Stefan Bader

On 30.09.20 11:16, Juergen Gross wrote:
> Since commit c330fb1ddc0a ("XEN uses irqdesc::irq_data_common::handler_data 
> to store a per interrupt XEN data pointer which contains XEN specific 
> information.")
> Xen is using the chip_data pointer for storing IRQ specific data. When
> running as a HVM domain this can result in problems for legacy IRQs, as
> those might use chip_data for their own purposes.
> 
> Use a local array for this purpose in case of legacy IRQs, avoiding the
> double use.
> 
> Cc: sta...@vger.kernel.org
> Fixes: c330fb1ddc0a ("XEN uses irqdesc::irq_data_common::handler_data to 
> store a per interrupt XEN data pointer which contains XEN specific 
> information.")
> Signed-off-by: Juergen Gross 
Tested-by: Stefan Bader 
> ---
>  drivers/xen/events/events_base.c | 29 +
>  1 file changed, 21 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/xen/events/events_base.c 
> b/drivers/xen/events/events_base.c
> index 90b8f56fbadb..6f02c18fa65c 100644
> --- a/drivers/xen/events/events_base.c
> +++ b/drivers/xen/events/events_base.c
> @@ -92,6 +92,8 @@ static bool (*pirq_needs_eoi)(unsigned irq);
>  /* Xen will never allocate port zero for any purpose. */
>  #define VALID_EVTCHN(chn)((chn) != 0)
>  
> +static struct irq_info *legacy_info_ptrs[NR_IRQS_LEGACY];
> +
>  static struct irq_chip xen_dynamic_chip;
>  static struct irq_chip xen_percpu_chip;
>  static struct irq_chip xen_pirq_chip;
> @@ -156,7 +158,18 @@ int get_evtchn_to_irq(evtchn_port_t evtchn)
>  /* Get info for IRQ */
>  struct irq_info *info_for_irq(unsigned irq)
>  {
> - return irq_get_chip_data(irq);
> + if (irq < nr_legacy_irqs())
> + return legacy_info_ptrs[irq];
> + else
> + return irq_get_chip_data(irq);
> +}
> +
> +static void set_info_for_irq(unsigned int irq, struct irq_info *info)
> +{
> + if (irq < nr_legacy_irqs())
> + legacy_info_ptrs[irq] = info;
> + else
> + irq_set_chip_data(irq, info);
>  }
>  
>  /* Constructors for packed IRQ information. */
> @@ -377,7 +390,7 @@ static void xen_irq_init(unsigned irq)
>   info->type = IRQT_UNBOUND;
>   info->refcnt = -1;
>  
> - irq_set_chip_data(irq, info);
> + set_info_for_irq(irq, info);
>  
>   list_add_tail(>list, _irq_list_head);
>  }
> @@ -426,14 +439,14 @@ static int __must_check xen_allocate_irq_gsi(unsigned 
> gsi)
>  
>  static void xen_free_irq(unsigned irq)
>  {
> - struct irq_info *info = irq_get_chip_data(irq);
> + struct irq_info *info = info_for_irq(irq);
>  
>   if (WARN_ON(!info))
>   return;
>  
>   list_del(>list);
>  
> - irq_set_chip_data(irq, NULL);
> + set_info_for_irq(irq, NULL);
>  
>   WARN_ON(info->refcnt > 0);
>  
> @@ -603,7 +616,7 @@ EXPORT_SYMBOL_GPL(xen_irq_from_gsi);
>  static void __unbind_from_irq(unsigned int irq)
>  {
>   evtchn_port_t evtchn = evtchn_from_irq(irq);
> - struct irq_info *info = irq_get_chip_data(irq);
> + struct irq_info *info = info_for_irq(irq);
>  
>   if (info->refcnt > 0) {
>   info->refcnt--;
> @@ -1108,7 +1121,7 @@ int bind_ipi_to_irqhandler(enum ipi_vector ipi,
>  
>  void unbind_from_irqhandler(unsigned int irq, void *dev_id)
>  {
> - struct irq_info *info = irq_get_chip_data(irq);
> + struct irq_info *info = info_for_irq(irq);
>  
>   if (WARN_ON(!info))
>   return;
> @@ -1142,7 +1155,7 @@ int evtchn_make_refcounted(evtchn_port_t evtchn)
>   if (irq == -1)
>   return -ENOENT;
>  
> - info = irq_get_chip_data(irq);
> + info = info_for_irq(irq);
>  
>   if (!info)
>   return -ENOENT;
> @@ -1170,7 +1183,7 @@ int evtchn_get(evtchn_port_t evtchn)
>   if (irq == -1)
>   goto done;
>  
> - info = irq_get_chip_data(irq);
> + info = info_for_irq(irq);
>  
>   if (!info)
>   goto done;
> 




signature.asc
Description: OpenPGP digital signature

Re: [PATCH 4.4 50/62] XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.

2020-09-30 Thread Stefan Bader

On 29.09.20 16:05, Jürgen Groß wrote:
> On 29.09.20 15:13, Stefan Bader wrote:
>> On 01.09.20 17:10, Greg Kroah-Hartman wrote:
>>> From: Thomas Gleixner 
>>>
>>> commit c330fb1ddc0a922f044989492b7fcca77ee1db46 upstream.
>>>
>>> handler data is meant for interrupt handlers and not for storing irq chip
>>> specific information as some devices require handler data to store internal
>>> per interrupt information, e.g. pinctrl/GPIO chained interrupt handlers.
>>>
>>> This obviously creates a conflict of interests and crashes the machine
>>> because the XEN pointer is overwritten by the driver pointer.
>>
>> I cannot say whether this applies the same for the vanilla 4.4 stable kernels
>> but once this had been applied to our 4.4 based kernels, we observed Xen HVM
>> guests crashing on boot with:
>>
>> [    0.927538] ACPI: bus type PCI registered
>> [    0.936008] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
>> [    0.948739] PCI: Using configuration type 1 for base access
>> [    0.960007] PCI: Using configuration type 1 for extended access
>> [    0.984340] ACPI: Added _OSI(Module Device)
>> [    0.988010] ACPI: Added _OSI(Processor Device)
>> [    0.992007] ACPI: Added _OSI(3.0 _SCP Extensions)
>> [    0.996013] ACPI: Added _OSI(Processor Aggregator Device)
>> [    1.002103] BUG: unable to handle kernel paging request at 
>> ff5f3000
>> [    1.004000] IP: [] mp_irqdomain_activate+0x5f/0xa0
>> [    1.004000] PGD 1e0f067 PUD 1e11067 PMD 1e12067 PTE 0
>> [    1.004000] Oops: 0002 [#1] SMP
>> [    1.004000] Modules linked in:
>> [    1.004000] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.0-191-generic
>> #221-Ubuntu
>> [    1.004000] Hardware name: Xen HVM domU, BIOS 4.6.5 04/18/2018
>> [    1.004000] task: 880107db ti: 880107dac000 task.ti:
>> 880107dac000
>> [    1.004000] RIP: 0010:[]  []
>> mp_irqdomain_activate+0x5f/0xa0
>> [    1.004000] RSP: 0018:880107dafc48  EFLAGS: 00010086
>> [    1.004000] RAX: 0086 RBX: 8800eb852140 RCX: 
>> 
>> [    1.004000] RDX: ff5f3000 RSI: 0001 RDI: 
>> 0020c000
>> [    1.004000] RBP: 880107dafc50 R08: 81ebdfd0 R09: 
>> 
>> [    1.004000] R10: 0011 R11: 0009 R12: 
>> 88010880d400
>> [    1.004000] R13: 0001 R14: 0009 R15: 
>> 8800eb880080
>> [    1.004000] FS:  () GS:880108ec()
>> knlGS:
>> [    1.004000] CS:  0010 DS:  ES:  CR0: 80050033
>> [    1.004000] CR2: ff5f3000 CR3: 01e0a000 CR4: 
>> 06f0
>> [    1.004000] Stack:
>> [    1.004000]  88010880bc58 880107dafc70 810ea644
>> 88010880bc00
>> [    1.004000]  88010880bc58 880107dafca0 810e6d88
>> 810e1009
>> [    1.004000]  88010880bc00 88010880bca0 8800eb880080
>> 880107dafd38
>> [    1.004000] Call Trace:
>> [    1.004000]  [] irq_domain_activate_irq+0x44/0x50
>> [    1.004000]  [] irq_startup+0x38/0x90
>> [    1.004000]  [] ? vprintk_default+0x29/0x40
>> [    1.004000]  [] __setup_irq+0x5a2/0x650
>> [    1.004000]  [] ? kmem_cache_alloc_trace+0x1d4/0x1f0
>> [    1.004000]  [] ? acpi_osi_handler+0xb0/0xb0
>> [    1.004000]  [] request_threaded_irq+0xfb/0x1a0
>> [    1.004000]  [] ? acpi_osi_handler+0xb0/0xb0
>> [    1.004000]  [] ? acpi_ev_sci_dispatch+0x64/0x64
>> [    1.004000]  [] 
>> acpi_os_install_interrupt_handler+0xaa/0x100
>> [    1.004000]  [] ? acpi_sleep_proc_init+0x28/0x28
>> [    1.004000]  [] acpi_ev_install_sci_handler+0x23/0x25
>> [    1.004000]  [] acpi_ev_install_xrupt_handlers+0x1c/0x6c
>> [    1.004000]  [] acpi_enable_subsystem+0x8f/0x93
>> [    1.004000]  [] acpi_init+0x8b/0x2c4
>> [    1.004000]  [] ? kasprintf+0x4e/0x70
>> [    1.004000]  [] ? acpi_sleep_proc_init+0x28/0x28
>> [    1.004000]  [] do_one_initcall+0xb5/0x200
>> [    1.004000]  [] ? parse_args+0x29a/0x4a0
>> [    1.004000]  [] kernel_init_freeable+0x177/0x218
>> [    1.004000]  [] ? rest_init+0x80/0x80
>> [    1.004000]  [] kernel_init+0xe/0xe0
>> [    1.004000]  [] ret_from_fork+0x42/0x80
>> [    1.004000]  [] ? rest_init+0x80/0x80
>> [    1.004000] Code: 8d 1c d2 8d ba 0b 02 00 00 44 8d 51 11 42 8b 14 dd 74 
>> ec 10
>> 82 c1 e7 0c 48 63 ff 81 e2 ff 0f 00 00 48 81 ea 00 10 80 00 48 29 fa <44> 89 
>> 12
>> 89 72 10 42 8b 14 dd 74 ec

Re: [PATCH 4.4 50/62] XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.

2020-09-29 Thread Stefan Bader

On 29.09.20 16:05, Jürgen Groß wrote:
> On 29.09.20 15:13, Stefan Bader wrote:
>> On 01.09.20 17:10, Greg Kroah-Hartman wrote:
>>> From: Thomas Gleixner 
>>>
>>> commit c330fb1ddc0a922f044989492b7fcca77ee1db46 upstream.
>>>
>>> handler data is meant for interrupt handlers and not for storing irq chip
>>> specific information as some devices require handler data to store internal
>>> per interrupt information, e.g. pinctrl/GPIO chained interrupt handlers.
>>>
>>> This obviously creates a conflict of interests and crashes the machine
>>> because the XEN pointer is overwritten by the driver pointer.
>>
>> I cannot say whether this applies the same for the vanilla 4.4 stable kernels
>> but once this had been applied to our 4.4 based kernels, we observed Xen HVM
>> guests crashing on boot with:
>>
>> [    0.927538] ACPI: bus type PCI registered
>> [    0.936008] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
>> [    0.948739] PCI: Using configuration type 1 for base access
>> [    0.960007] PCI: Using configuration type 1 for extended access
>> [    0.984340] ACPI: Added _OSI(Module Device)
>> [    0.988010] ACPI: Added _OSI(Processor Device)
>> [    0.992007] ACPI: Added _OSI(3.0 _SCP Extensions)
>> [    0.996013] ACPI: Added _OSI(Processor Aggregator Device)
>> [    1.002103] BUG: unable to handle kernel paging request at 
>> ff5f3000
>> [    1.004000] IP: [] mp_irqdomain_activate+0x5f/0xa0
>> [    1.004000] PGD 1e0f067 PUD 1e11067 PMD 1e12067 PTE 0
>> [    1.004000] Oops: 0002 [#1] SMP
>> [    1.004000] Modules linked in:
>> [    1.004000] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.0-191-generic
>> #221-Ubuntu
>> [    1.004000] Hardware name: Xen HVM domU, BIOS 4.6.5 04/18/2018
>> [    1.004000] task: 880107db ti: 880107dac000 task.ti:
>> 880107dac000
>> [    1.004000] RIP: 0010:[]  []
>> mp_irqdomain_activate+0x5f/0xa0
>> [    1.004000] RSP: 0018:880107dafc48  EFLAGS: 00010086
>> [    1.004000] RAX: 0086 RBX: 8800eb852140 RCX: 
>> 
>> [    1.004000] RDX: ff5f3000 RSI: 0001 RDI: 
>> 0020c000
>> [    1.004000] RBP: 880107dafc50 R08: 81ebdfd0 R09: 
>> 
>> [    1.004000] R10: 0011 R11: 0009 R12: 
>> 88010880d400
>> [    1.004000] R13: 0001 R14: 0009 R15: 
>> 8800eb880080
>> [    1.004000] FS:  () GS:880108ec()
>> knlGS:
>> [    1.004000] CS:  0010 DS:  ES:  CR0: 80050033
>> [    1.004000] CR2: ff5f3000 CR3: 01e0a000 CR4: 
>> 06f0
>> [    1.004000] Stack:
>> [    1.004000]  88010880bc58 880107dafc70 810ea644
>> 88010880bc00
>> [    1.004000]  88010880bc58 880107dafca0 810e6d88
>> 810e1009
>> [    1.004000]  88010880bc00 88010880bca0 8800eb880080
>> 880107dafd38
>> [    1.004000] Call Trace:
>> [    1.004000]  [] irq_domain_activate_irq+0x44/0x50
>> [    1.004000]  [] irq_startup+0x38/0x90
>> [    1.004000]  [] ? vprintk_default+0x29/0x40
>> [    1.004000]  [] __setup_irq+0x5a2/0x650
>> [    1.004000]  [] ? kmem_cache_alloc_trace+0x1d4/0x1f0
>> [    1.004000]  [] ? acpi_osi_handler+0xb0/0xb0
>> [    1.004000]  [] request_threaded_irq+0xfb/0x1a0
>> [    1.004000]  [] ? acpi_osi_handler+0xb0/0xb0
>> [    1.004000]  [] ? acpi_ev_sci_dispatch+0x64/0x64
>> [    1.004000]  [] 
>> acpi_os_install_interrupt_handler+0xaa/0x100
>> [    1.004000]  [] ? acpi_sleep_proc_init+0x28/0x28
>> [    1.004000]  [] acpi_ev_install_sci_handler+0x23/0x25
>> [    1.004000]  [] acpi_ev_install_xrupt_handlers+0x1c/0x6c
>> [    1.004000]  [] acpi_enable_subsystem+0x8f/0x93
>> [    1.004000]  [] acpi_init+0x8b/0x2c4
>> [    1.004000]  [] ? kasprintf+0x4e/0x70
>> [    1.004000]  [] ? acpi_sleep_proc_init+0x28/0x28
>> [    1.004000]  [] do_one_initcall+0xb5/0x200
>> [    1.004000]  [] ? parse_args+0x29a/0x4a0
>> [    1.004000]  [] kernel_init_freeable+0x177/0x218
>> [    1.004000]  [] ? rest_init+0x80/0x80
>> [    1.004000]  [] kernel_init+0xe/0xe0
>> [    1.004000]  [] ret_from_fork+0x42/0x80
>> [    1.004000]  [] ? rest_init+0x80/0x80
>> [    1.004000] Code: 8d 1c d2 8d ba 0b 02 00 00 44 8d 51 11 42 8b 14 dd 74 
>> ec 10
>> 82 c1 e7 0c 48 63 ff 81 e2 ff 0f 00 00 48 81 ea 00 10 80 00 48 29 fa <44> 89 
>> 12
>> 89 72 10 42 8b 14 dd 74 ec

Re: [PATCH 4.4 50/62] XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.

2020-09-29 Thread Stefan Bader

On 01.09.20 17:10, Greg Kroah-Hartman wrote:
> From: Thomas Gleixner 
> 
> commit c330fb1ddc0a922f044989492b7fcca77ee1db46 upstream.
> 
> handler data is meant for interrupt handlers and not for storing irq chip
> specific information as some devices require handler data to store internal
> per interrupt information, e.g. pinctrl/GPIO chained interrupt handlers.
> 
> This obviously creates a conflict of interests and crashes the machine
> because the XEN pointer is overwritten by the driver pointer.

I cannot say whether this applies the same for the vanilla 4.4 stable kernels
but once this had been applied to our 4.4 based kernels, we observed Xen HVM
guests crashing on boot with:

[0.927538] ACPI: bus type PCI registered
[0.936008] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[0.948739] PCI: Using configuration type 1 for base access
[0.960007] PCI: Using configuration type 1 for extended access
[0.984340] ACPI: Added _OSI(Module Device)
[0.988010] ACPI: Added _OSI(Processor Device)
[0.992007] ACPI: Added _OSI(3.0 _SCP Extensions)
[0.996013] ACPI: Added _OSI(Processor Aggregator Device)
[1.002103] BUG: unable to handle kernel paging request at ff5f3000
[1.004000] IP: [] mp_irqdomain_activate+0x5f/0xa0
[1.004000] PGD 1e0f067 PUD 1e11067 PMD 1e12067 PTE 0
[1.004000] Oops: 0002 [#1] SMP
[1.004000] Modules linked in:
[1.004000] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.0-191-generic
#221-Ubuntu
[1.004000] Hardware name: Xen HVM domU, BIOS 4.6.5 04/18/2018
[1.004000] task: 880107db ti: 880107dac000 task.ti: 
880107dac000
[1.004000] RIP: 0010:[]  []
mp_irqdomain_activate+0x5f/0xa0
[1.004000] RSP: 0018:880107dafc48  EFLAGS: 00010086
[1.004000] RAX: 0086 RBX: 8800eb852140 RCX: 
[1.004000] RDX: ff5f3000 RSI: 0001 RDI: 0020c000
[1.004000] RBP: 880107dafc50 R08: 81ebdfd0 R09: 
[1.004000] R10: 0011 R11: 0009 R12: 88010880d400
[1.004000] R13: 0001 R14: 0009 R15: 8800eb880080
[1.004000] FS:  () GS:880108ec()
knlGS:
[1.004000] CS:  0010 DS:  ES:  CR0: 80050033
[1.004000] CR2: ff5f3000 CR3: 01e0a000 CR4: 06f0
[1.004000] Stack:
[1.004000]  88010880bc58 880107dafc70 810ea644 
88010880bc00
[1.004000]  88010880bc58 880107dafca0 810e6d88 
810e1009
[1.004000]  88010880bc00 88010880bca0 8800eb880080 
880107dafd38
[1.004000] Call Trace:
[1.004000]  [] irq_domain_activate_irq+0x44/0x50
[1.004000]  [] irq_startup+0x38/0x90
[1.004000]  [] ? vprintk_default+0x29/0x40
[1.004000]  [] __setup_irq+0x5a2/0x650
[1.004000]  [] ? kmem_cache_alloc_trace+0x1d4/0x1f0
[1.004000]  [] ? acpi_osi_handler+0xb0/0xb0
[1.004000]  [] request_threaded_irq+0xfb/0x1a0
[1.004000]  [] ? acpi_osi_handler+0xb0/0xb0
[1.004000]  [] ? acpi_ev_sci_dispatch+0x64/0x64
[1.004000]  [] 
acpi_os_install_interrupt_handler+0xaa/0x100
[1.004000]  [] ? acpi_sleep_proc_init+0x28/0x28
[1.004000]  [] acpi_ev_install_sci_handler+0x23/0x25
[1.004000]  [] acpi_ev_install_xrupt_handlers+0x1c/0x6c
[1.004000]  [] acpi_enable_subsystem+0x8f/0x93
[1.004000]  [] acpi_init+0x8b/0x2c4
[1.004000]  [] ? kasprintf+0x4e/0x70
[1.004000]  [] ? acpi_sleep_proc_init+0x28/0x28
[1.004000]  [] do_one_initcall+0xb5/0x200
[1.004000]  [] ? parse_args+0x29a/0x4a0
[1.004000]  [] kernel_init_freeable+0x177/0x218
[1.004000]  [] ? rest_init+0x80/0x80
[1.004000]  [] kernel_init+0xe/0xe0
[1.004000]  [] ret_from_fork+0x42/0x80
[1.004000]  [] ? rest_init+0x80/0x80
[1.004000] Code: 8d 1c d2 8d ba 0b 02 00 00 44 8d 51 11 42 8b 14 dd 74 ec 10
82 c1 e7 0c 48 63 ff 81 e2 ff 0f 00 00 48 81 ea 00 10 80 00 48 29 fa <44> 89 12
89 72 10 42 8b 14 dd 74 ec 10 82 83 c1 10 81 e2 ff 0f
[1.004000] RIP  [] mp_irqdomain_activate+0x5f/0xa0
[1.004000]  RSP 
[1.004000] CR2: ff5f3000
[1.004000] ---[ end trace 3201cae5b6bd7be1 ]---
[1.592027] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0009
[1.592027]

This is from a local server but same stack-trace happens on AWS instances while
initializing ACPI SCI. mp_irqdomain_activate is accessing chip_data expecting
ioapic data there. Oddly enough more recent kernels seem to do the same but not
crashing as HVM guest (neither seen for our 4.15 nor the 5.4).

Maybe someone could make sure the 4.4.y stable series is not also broken.

-Stefan
> 
> As the XEN data is not handler specific it should be stored in
> irqdesc::irq_data::chip_data instead.
> 
> A simple sed s/irq_[sg]et_handler_data/irq_[sg]et_chip_data/ cures that.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Roman

Re: [PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

2019-06-04 Thread Stefan Bader

On 29.05.19 12:37, Greg KH wrote:
> On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote:
>> From: Jiri Wiesner 
>>
>> The *_frag_reasm() functions are susceptible to miscalculating the byte
>> count of packet fragments in case the truesize of a head buffer changes.
>> The truesize member may be changed by the call to skb_unclone(), leaving
>> the fragment memory limit counter unbalanced even if all fragments are
>> processed. This miscalculation goes unnoticed as long as the network
>> namespace which holds the counter is not destroyed.
>>
>> Should an attempt be made to destroy a network namespace that holds an
>> unbalanced fragment memory limit counter the cleanup of the namespace
>> never finishes. The thread handling the cleanup gets stuck in
>> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
>> thread is usually in running state with a stacktrace similar to:
>>
>>  PID: 1073   TASK: 880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>>   #5 [880621563d48] _raw_spin_lock at 815f5480
>>   #6 [880621563d48] inet_evict_bucket at 8158020b
>>   #7 [880621563d80] inet_frags_exit_net at 8158051c
>>   #8 [880621563db0] ops_exit_list at 814f5856
>>   #9 [880621563dd8] cleanup_net at 814f67c0
>>  #10 [880621563e38] process_one_work at 81096f14
>>
>> It is not possible to create new network namespaces, and processes
>> that call unshare() end up being stuck in uninterruptible sleep state
>> waiting to acquire the net_mutex.
>>
>> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
>> I thank him for his analysis of the problem. The parts of this patch
>> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>>
>> Signed-off-by: Jiri Wiesner 
>> Reported-by: Per Sundstrom 
>> Acked-by: Peter Oskolkov 
>> Signed-off-by: David S. Miller 
>>
>> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
>> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
>> Signed-off-by: Stefan Bader 
> 
> I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading
> kernel versions would have a regression :(
> 
> Can you also provide a backport of the needed patches for 4.9.y for this
> issue so I can take these?

It turns out that I cannot provide patches for 4.9.y because those are not
needed there. Patches #1 and #2 of the list I did do not explicitly appear.
However patch #3 does and it might be possible to implicitly do the changes of
the other two by adjusting the removal of the functions from the old locations
and doing the additions unmodified.
And the final patch for pulling the skb from the list is included in 4.9.y by
backports for using rbtrees in ipv6, too. In 4.9.y however the skb_get() still
needs to be dropped. Sasha did not apply it in the end, maybe partially because
of my warning that this was not enough in 4.4.y.

So I think there are two options for 4.4.y which I would defer to the net-devs
to decide:
- either also backport the patches to use rbtrees in ipv6 to 4.4.y (including
  use of inet_frag_pull_head() in ip6_expire_frag_queue() and dropping the
  skb_get() there.
- or some limited change to ip6_expire_frag_queue(). Probably only doing the
  part not related to rbtrees. This would not require patch #3 as pre-req.
  Patches #1 and #2 might be considered separately. Those would be unrelated
  to the crash in ip6_expire-frag_queue() but could fix other issues (not
  liking the sound of net namespace teardown possibly getting stuck).

For 4.9.y, please re-consider picking "ip6: fix skb leak in
ip6frag_expire_frag_queue()". Depending on checks down the path of icmpv6_send()
it might be a crash, too. In that case it should be enough as
inet_frag_pull_head() takes the skb off the queue, so it won't get double freed
when releasing the whole queue.

-Stefan
> 
> thanks,
> 
> greg k-h
> 




signature.asc
Description: OpenPGP digital signature

Re: [PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

2019-05-29 Thread Stefan Bader

On 29.05.19 12:37, Greg KH wrote:
> On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote:
>> From: Jiri Wiesner 
>>
>> The *_frag_reasm() functions are susceptible to miscalculating the byte
>> count of packet fragments in case the truesize of a head buffer changes.
>> The truesize member may be changed by the call to skb_unclone(), leaving
>> the fragment memory limit counter unbalanced even if all fragments are
>> processed. This miscalculation goes unnoticed as long as the network
>> namespace which holds the counter is not destroyed.
>>
>> Should an attempt be made to destroy a network namespace that holds an
>> unbalanced fragment memory limit counter the cleanup of the namespace
>> never finishes. The thread handling the cleanup gets stuck in
>> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
>> thread is usually in running state with a stacktrace similar to:
>>
>>  PID: 1073   TASK: 880626711440  CPU: 1   COMMAND: "kworker/u48:4"
>>   #5 [880621563d48] _raw_spin_lock at 815f5480
>>   #6 [880621563d48] inet_evict_bucket at 8158020b
>>   #7 [880621563d80] inet_frags_exit_net at 8158051c
>>   #8 [880621563db0] ops_exit_list at 814f5856
>>   #9 [880621563dd8] cleanup_net at 814f67c0
>>  #10 [880621563e38] process_one_work at 81096f14
>>
>> It is not possible to create new network namespaces, and processes
>> that call unshare() end up being stuck in uninterruptible sleep state
>> waiting to acquire the net_mutex.
>>
>> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
>> I thank him for his analysis of the problem. The parts of this patch
>> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>>
>> Signed-off-by: Jiri Wiesner 
>> Reported-by: Per Sundstrom 
>> Acked-by: Peter Oskolkov 
>> Signed-off-by: David S. Miller 
>>
>> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
>> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
>> Signed-off-by: Stefan Bader 
> 
> I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading
> kernel versions would have a regression :(
> 
> Can you also provide a backport of the needed patches for 4.9.y for this
> issue so I can take these?

I will, once it is clear that a) the backport looks alright and b) is ok to be
done. Alternatively it might be decided that only the parts necessary for
pulling out a frag head should be picked.

Or the net-devs might decide they want to send things out. The problem
potentially exists in anything that has some stable support up to v5.1
and I have no complete overview where this was backported to.

So this is more a start of discussion that a request to apply it. stable was
just included to make stable maintainers aware.

-Stefan
> 
> thanks,
> 
> greg k-h
> 




signature.asc
Description: OpenPGP digital signature

[PATCH 0/4] ipv6: frags: fixups for linux-4.4.y

2019-05-29 Thread Stefan Bader

While this backport proposal is based on the 4.4.y stable tree, it
might also apply in some form to any stable tree which backported

 05c0b86b96: "ipv6: frags: rewrite ip6_expire_frag_queue()"

While this made ip6_expire_frag_queue() similar to ip_exire(),
it did not follow the additional changes to ip_expire() which
were also backported:

 fa0f527358: "ip: use rb trees for IP frag queue."

 a4fd284a1f: "ip: process in-order fragments efficiently"

The former of the two not only adds handling for rb trees, but
also modifies ip_expire() to take the first skb off the queue
before using it for the sending the icmp message. This also got
rid of the need to protect the skb by incrementing its reference
count (which is the reason for the crash in ip6_expire_frag_queue()).

My first approach was do those changes in ip6_expire_frag_queue(),
but only the former of the two can be done without problems. The
latter uses code which is only locally defined in ipv4/ip_fragment.c.

This was changed upstream in 5.1 when moving code around to be shared

  c23f35d19d: "net: IP defrag: encapsulate rbtree defrag code into
   callable functions"

And while backporting that I found the two other changes which sounded
like one might want them backported, too. Maybe even more since the
second (ip: fail fast on IP defrag errors) is already partially
included in the backport of "net: ipv4: do not handle duplicate
fragments as overlapping".

Though I do realize that "net: IP defrag: encapsulate rbtree
defrag code into callable functions" is rather large and for
that reason maybe not qualifying as a stable backport.
So I would like to ask what the net-developers think about
this.

Thanks,
Stefan



0001: v4.20: ipv4: ipv6: netfilter: Adjust the frag mem limit when
 truesize changes
0002: v4.20: ip: fail fast on IP defrag errors
0003: v5.1 : net: IP defrag: encapsulate rbtree defrag code into
 callable functions
0004: n/a  : ipv6: frags: Use inet_frag_pull_head() in
 ip6_expire_frag_queue()

Jiri Wiesner (1):
  ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

Peter Oskolkov (2):
  ip: fail fast on IP defrag errors
  net: IP defrag: encapsulate rbtree defrag code into callable functions

Stefan Bader (1):
  ipv6: frags: Use inet_frag_pull_head() in ip6_expire_frag_queue()

 include/net/inet_frag.h |  16 +-
 net/ipv4/inet_fragment.c| 293 +++
 net/ipv4/ip_fragment.c  | 294 +++-
 net/ipv6/netfilter/nf_conntrack_reasm.c |   8 +-
 net/ipv6/reassembly.c   |  20 +-
 5 files changed, 359 insertions(+), 272 deletions(-)

-- 
2.17.1

[PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

2019-05-29 Thread Stefan Bader

From: Jiri Wiesner 

The *_frag_reasm() functions are susceptible to miscalculating the byte
count of packet fragments in case the truesize of a head buffer changes.
The truesize member may be changed by the call to skb_unclone(), leaving
the fragment memory limit counter unbalanced even if all fragments are
processed. This miscalculation goes unnoticed as long as the network
namespace which holds the counter is not destroyed.

Should an attempt be made to destroy a network namespace that holds an
unbalanced fragment memory limit counter the cleanup of the namespace
never finishes. The thread handling the cleanup gets stuck in
inet_frags_exit_net() waiting for the percpu counter to reach zero. The
thread is usually in running state with a stacktrace similar to:

 PID: 1073   TASK: 880626711440  CPU: 1   COMMAND: "kworker/u48:4"
  #5 [880621563d48] _raw_spin_lock at 815f5480
  #6 [880621563d48] inet_evict_bucket at 8158020b
  #7 [880621563d80] inet_frags_exit_net at 8158051c
  #8 [880621563db0] ops_exit_list at 814f5856
  #9 [880621563dd8] cleanup_net at 814f67c0
 #10 [880621563e38] process_one_work at 81096f14

It is not possible to create new network namespaces, and processes
that call unshare() end up being stuck in uninterruptible sleep state
waiting to acquire the net_mutex.

The bug was observed in the IPv6 netfilter code by Per Sundstrom.
I thank him for his analysis of the problem. The parts of this patch
that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

Signed-off-by: Jiri Wiesner 
Reported-by: Per Sundstrom 
Acked-by: Peter Oskolkov 
Signed-off-by: David S. Miller 

(backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
[smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
Signed-off-by: Stefan Bader 
---
 net/ipv4/ip_fragment.c  | 7 +++
 net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +++-
 net/ipv6/reassembly.c   | 8 +++-
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 2c75330b1c71..5387e6ab78d7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -519,6 +519,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*skb,
struct rb_node *rbn;
int len;
int ihlen;
+   int delta;
int err;
u8 ecn;
 
@@ -560,10 +561,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff 
*skb,
if (len > 65535)
goto out_oversize;
 
+   delta = - head->truesize;
+
/* Head of list must not be cloned. */
if (skb_unclone(head, GFP_ATOMIC))
goto out_nomem;
 
+   delta += head->truesize;
+   if (delta)
+   add_frag_mem_limit(qp->q.net, delta);
+
/* If the first fragment is fragmented itself, we split
 * it to two chunks: the first with data and paged part
 * and the second, holding only fragments. */
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 664c84e47bab..e5d06ceb2c0c 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -353,7 +353,7 @@ static struct sk_buff *
 nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
 {
struct sk_buff *fp, *op, *head = fq->q.fragments;
-   intpayload_len;
+   intpayload_len, delta;
u8 ecn;
 
inet_frag_kill(>q);
@@ -374,12 +374,18 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct 
net_device *dev)
goto out_oversize;
}
 
+   delta = - head->truesize;
+
/* Head of list must not be cloned. */
if (skb_unclone(head, GFP_ATOMIC)) {
pr_debug("skb is cloned but can't expand head");
goto out_oom;
}
 
+   delta += head->truesize;
+   if (delta)
+   add_frag_mem_limit(fq->q.net, delta);
+
/* If the first fragment is fragmented itself, we split
 * it to two chunks: the first with data and paged part
 * and the second, holding only fragments. */
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index ec917f58d105..020cf70273c9 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -343,7 +343,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
sk_buff *prev,
 {
struct net *net = container_of(fq->q.net, struct net, ipv6.frags);
struct sk_buff *fp, *head = fq->q.fragments;
-   intpayload_len;
+   intpayload_len, delta;
unsigned int nhoff;
int sum_truesize;
u8 ecn;
@@ -384,10 +384,16 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct 
sk_buff *prev,
if (payload_len > IPV6_MAXPLEN)
goto out_oversize;
 
+   delta = - head->truesize;
+
/*

[PATCH 3/4] net: IP defrag: encapsulate rbtree defrag code into callable functions

2019-05-29 Thread Stefan Bader

From: Peter Oskolkov 

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

Signed-off-by: Peter Oskolkov 
Cc: Eric Dumazet 
Cc: Florian Westphal 
Cc: Tom Herbert 
Signed-off-by: David S. Miller 
(backported from commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f)
[smb: open code skb_mark_not_on_list(), context adjustments, use of
  IP_INC_STATS_BH instead of __IP_INC_STATS]
Signed-off-by: Stefan Bader 
---
 include/net/inet_frag.h  |  16 ++-
 net/ipv4/inet_fragment.c | 293 +++
 net/ipv4/ip_fragment.c   | 288 --
 3 files changed, 332 insertions(+), 265 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 6260ec146142..7c8b06302ade 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,8 +75,8 @@ struct inet_frag_queue {
struct timer_list   timer;
spinlock_t  lock;
atomic_trefcnt;
-   struct sk_buff  *fragments;  /* Used in IPv6. */
-   struct rb_root  rb_fragments; /* Used in IPv4. */
+   struct sk_buff  *fragments;  /* used in 6lopwpan IPv6. */
+   struct rb_root  rb_fragments; /* Used in IPv4/IPv6. */
struct sk_buff  *fragments_tail;
struct sk_buff  *last_run_head;
ktime_t stamp;
@@ -152,4 +152,16 @@ static inline void add_frag_mem_limit(struct netns_frags 
*nf, long val)
 
 extern const u8 ip_frag_ecn_table[16];
 
+/* Return values of inet_frag_queue_insert() */
+#define IPFRAG_OK  0
+#define IPFRAG_DUP 1
+#define IPFRAG_OVERLAP 2
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+  int offset, int end);
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent);
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+   void *reasm_data);
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
+
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index c03e5f5859e1..5c167efd54bd 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -24,6 +24,62 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+   union {
+   struct inet_skb_parmh4;
+   struct inet6_skb_parm   h6;
+   };
+   struct sk_buff  *next_frag;
+   int frag_run_len;
+};
+
+#define FRAG_CB(skb)   ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void fragcb_clear(struct sk_buff *skb)
+{
+   RB_CLEAR_NODE(>rbnode);
+   FRAG_CB(skb)->next_frag = NULL;
+   FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void fragrun_append_to_last(struct inet_frag_queue *q,
+  struct sk_buff *skb)
+{
+   fragcb_clear(skb);
+
+   FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+   FRAG_CB(q->fragments_tail)->next_frag = skb;
+   q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+   BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+   fragcb_clear(skb);
+
+   if (q->last_run_head)
+   rb_link_node(>rbnode, >last_run_head->rbnode,
+>last_run_head->rbnode.rb_right);
+   else
+   rb_link_node(>rbnode, NULL, >rb_fragments.rb_node);
+   rb_insert_color(>rbnode, >rb_fragments);
+
+   q->fragments_tail = skb;
+   q->last_run_head = skb;
+}
 
 /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
  * Value : 0xff if frame should be dropped.
@@ -130,6 +186,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
kmem_cache_free(f->frags_cachep, q);
 }
 
+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+   struct rb_node *p = rb_first(root);
+   unsigned int sum = 0;
+
+   while (p) {
+   struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+   p = rb_next(p);
+   rb_erase(>rbnode, root);
+   while (skb) {
+   struct sk_buff *next = FRAG_CB(skb)-

[PATCH 4/4] ipv6: frags: Use inet_frag_pull_head() in ip6_expire_frag_queue()

2019-05-29 Thread Stefan Bader

With frag code shared between IPv4 and IPv6 it is now possible
to use inet_frag_pull_head() in ip6_expire_frag_queue() in order
to properly extract the first skb from the frags queue.
Since this really takes the skb out of the list, it no longer
needs to be protected against removal (which implicitly fixes
a crash that will happen somewhere in the call to icmp6_send()
when the use count is forcefully checked to be 1.

 kernel BUG at linux-4.4.0/net/core/skbuff.c:1207!
 RIP: 0010:[]
  [] pskb_expand_head+0x243/0x250
  [] __pskb_pull_tail+0x50/0x350
  [] _decode_session6+0x26a/0x400
  [] __xfrm_decode_session+0x39/0x50
  [] icmpv6_route_lookup+0xf0/0x1c0
  [] icmp6_send+0x5e1/0x940
  [] icmpv6_send+0x21/0x30
  [] ip6_expire_frag_queue+0xe0/0x120

Signed-off-by: Stefan Bader 
---
 net/ipv6/reassembly.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 020cf70273c9..2bd45abc2516 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -110,16 +110,14 @@ void ip6_expire_frag_queue(struct net *net, struct 
frag_queue *fq)
IP6_INC_STATS_BH(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);
 
/* Don't send error if the first segment did not arrive. */
-   head = fq->q.fragments;
-   if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+   if (!(fq->q.flags & INET_FRAG_FIRST_IN))
+   goto out;
+
+   head = inet_frag_pull_head(>q);
+   if (!head)
goto out;
 
-   /* But use as source device on which LAST ARRIVED
-* segment was received. And do not use fq->dev
-* pointer directly, device might already disappeared.
-*/
head->dev = dev;
-   skb_get(head);
spin_unlock(>q.lock);
 
icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
-- 
2.17.1

[PATCH 2/4] ip: fail fast on IP defrag errors

2019-05-29 Thread Stefan Bader

From: Peter Oskolkov 

The current behavior of IP defragmentation is inconsistent:
- some overlapping/wrong length fragments are dropped without
  affecting the queue;
- most overlapping fragments cause the whole frag queue to be dropped.

This patch brings consistency: if a bad fragment is detected,
the whole frag queue is dropped. Two major benefits:
- fail fast: corrupted frag queues are cleared immediately, instead of
  by timeout;
- testing of overlapping fragments is now much easier: any kind of
  random fragment length mutation now leads to the frag queue being
  discarded (IP packet dropped); before this patch, some overlaps were
  "corrected", with tests not seeing expected packet drops.

Note that in one case (see "if (end&7)" conditional) the current
behavior is preserved as there are concerns that this could be
legitimate padding.

Signed-off-by: Peter Oskolkov 
Reviewed-by: Eric Dumazet 
Reviewed-by: Willem de Bruijn 
Signed-off-by: David S. Miller 

(backported from commit 0ff89efb524631ac9901b81446b453c29711c376)
[smb: context adjustments and ignoring those changes already done
  in backport for "net: ipv4: do not handle duplicate fragments
  as overlapping"]
Signed-off-by: Stefan Bader 
---
 net/ipv4/ip_fragment.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 5387e6ab78d7..a53652c8c0fd 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -382,7 +382,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 */
if (end < qp->q.len ||
((qp->q.flags & INET_FRAG_LAST_IN) && end != qp->q.len))
-   goto err;
+   goto discard_qp;
qp->q.flags |= INET_FRAG_LAST_IN;
qp->q.len = end;
} else {
@@ -394,20 +394,20 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
if (end > qp->q.len) {
/* Some bits beyond end -> corruption. */
if (qp->q.flags & INET_FRAG_LAST_IN)
-   goto err;
+   goto discard_qp;
qp->q.len = end;
}
}
if (end == offset)
-   goto err;
+   goto discard_qp;
 
err = -ENOMEM;
if (!pskb_pull(skb, skb_network_offset(skb) + ihl))
-   goto err;
+   goto discard_qp;
 
err = pskb_trim_rcsum(skb, end - offset);
if (err)
-   goto err;
+   goto discard_qp;
 
/* Note : skb->rbnode and skb->dev share the same location. */
dev = skb->dev;
@@ -434,7 +434,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
/* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
-   goto discard_qp;
+   goto overlap;
if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
ip4_frag_append_to_last_run(>q, skb);
else
@@ -457,7 +457,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
 end <= skb1_run_end)
goto err; /* No new data, potential duplicate */
else
-   goto discard_qp; /* Found an overlap */
+   goto overlap; /* Found an overlap */
} while (*rbn);
/* Here we have parent properly set, and rbn pointing to
 * one of its NULL left/right children. Insert skb.
@@ -494,15 +494,18 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff 
*skb)
skb->_skb_refdst = 0UL;
err = ip_frag_reasm(qp, skb, prev_tail, dev);
skb->_skb_refdst = orefdst;
+   if (err)
+   inet_frag_kill(>q);
return err;
}
 
skb_dst_drop(skb);
return -EINPROGRESS;
 
+overlap:
+   IP_INC_STATS_BH(net, IPSTATS_MIB_REASM_OVERLAPS);
 discard_qp:
inet_frag_kill(>q);
-   IP_INC_STATS_BH(net, IPSTATS_MIB_REASM_OVERLAPS);
 err:
kfree_skb(skb);
return err;
-- 
2.17.1

Re: [PATCH AUTOSEL 5.1 011/375] ip6: fix skb leak in ip6frag_expire_frag_queue()

2019-05-23 Thread Stefan Bader

On 22.05.19 21:15, Sasha Levin wrote:
> From: Eric Dumazet 
> 
> [ Upstream commit 47d3d7fdb10a21c223036b58bd70ffdc24a472c4 ]
> 
> Since ip6frag_expire_frag_queue() now pulls the head skb
> from frag queue, we should no longer use skb_get(), since
> this leads to an skb leak.
> 
> Stefan Bader initially reported a problem in 4.4.stable [1] caused
> by the skb_get(), so this patch should also fix this issue.

Just to let everybody know, while changing this has fixed the BUG_ON problem
while sending (in 4.4) it now crashes when releasing just a little later.
Still feels like the right direction but not complete, yet.

-Stefan

> 
> 296583.091021] kernel BUG at 
> /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
> [296583.091734] Call Trace:
> [296583.091749]  [] __pskb_pull_tail+0x50/0x350
> [296583.091764]  [] _decode_session6+0x26a/0x400
> [296583.091779]  [] __xfrm_decode_session+0x39/0x50
> [296583.091795]  [] icmpv6_route_lookup+0xf0/0x1c0
> [296583.091809]  [] icmp6_send+0x5e1/0x940
> [296583.091823]  [] ? __netif_receive_skb+0x18/0x60
> [296583.091838]  [] ? netif_receive_skb_internal+0x32/0xa0
> [296583.091858]  [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
> [296583.091876]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091893]  [] icmpv6_send+0x21/0x30
> [296583.091906]  [] ip6_expire_frag_queue+0xe0/0x120
> [296583.091921]  [] nf_ct_frag6_expire+0x1f/0x30 
> [nf_defrag_ipv6]
> [296583.091938]  [] call_timer_fn+0x37/0x140
> [296583.091951]  [] ? nf_ct_net_exit+0x50/0x50 
> [nf_defrag_ipv6]
> [296583.091968]  [] run_timer_softirq+0x234/0x330
> [296583.091982]  [] __do_softirq+0x109/0x2b0
> 
> Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")
> Signed-off-by: Eric Dumazet 
> Reported-by: Stefan Bader 
> Cc: Peter Oskolkov 
> Cc: Florian Westphal 
> Signed-off-by: David S. Miller 
> Signed-off-by: Sasha Levin 
> ---
>  include/net/ipv6_frag.h | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
> index 28aa9b30aecea..1f77fb4dc79df 100644
> --- a/include/net/ipv6_frag.h
> +++ b/include/net/ipv6_frag.h
> @@ -94,7 +94,6 @@ ip6frag_expire_frag_queue(struct net *net, struct 
> frag_queue *fq)
>   goto out;
>  
>   head->dev = dev;
> - skb_get(head);
>   spin_unlock(>q.lock);
>  
>   icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
> 




signature.asc
Description: OpenPGP digital signature

Re: [PATCH 4.4 018/101] netfilter: synproxy: fix conntrackd interaction

2017-08-16 Thread Stefan Bader

On 03.07.2017 15:34, Greg Kroah-Hartman wrote:
> 4.4-stable review patch.  If anyone has any objections, please let me know.

We found that pulling below patch into stable trees without also pulling

commit 9c3f3794926a997b1cab6c42480ff300efa2d162
Author: Liping Zhang 
Date:   Sat Mar 25 16:35:29 2017 +0800

netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister

will result in a regression, at least in 4.4.y[1]. Stable maintainers who picked
up below patch might want to consider picking up above fix.

-Stefan


[1] http://bugs.launchpad.net/bugs/1709032
> 
> --
> 
> From: Eric Leblond 
> 
> commit 87e94dbc210a720a34be5c1174faee5c84be963e upstream.
> 
> This patch fixes the creation of connection tracking entry from
> netlink when synproxy is used. It was missing the addition of
> the synproxy extension.
> 
> This was causing kernel crashes when a conntrack entry created by
> conntrackd was used after the switch of traffic from active node
> to the passive node.
> 
> Signed-off-by: Eric Leblond 
> Signed-off-by: Pablo Neira Ayuso 
> Signed-off-by: Greg Kroah-Hartman 
> 
> ---
>  net/netfilter/nf_conntrack_netlink.c |4 
>  1 file changed, 4 insertions(+)
> 
> --- a/net/netfilter/nf_conntrack_netlink.c
> +++ b/net/netfilter/nf_conntrack_netlink.c
> @@ -45,6 +45,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #ifdef CONFIG_NF_NAT_NEEDED
>  #include 
>  #include 
> @@ -1798,6 +1800,8 @@ ctnetlink_create_conntrack(struct net *n
>   nf_ct_tstamp_ext_add(ct, GFP_ATOMIC);
>   nf_ct_ecache_ext_add(ct, 0, 0, GFP_ATOMIC);
>   nf_ct_labels_ext_add(ct);
> + nfct_seqadj_ext_add(ct);
> + nfct_synproxy_ext_add(ct);
>  
>   /* we must add conntrack extensions before confirmation. */
>   ct->status |= IPS_CONFIRMED;
> 
> 




signature.asc
Description: OpenPGP digital signature

Re: [PATCH 4.4 018/101] netfilter: synproxy: fix conntrackd interaction

2017-08-16 Thread Stefan Bader

On 03.07.2017 15:34, Greg Kroah-Hartman wrote:
> 4.4-stable review patch.  If anyone has any objections, please let me know.

We found that pulling below patch into stable trees without also pulling

commit 9c3f3794926a997b1cab6c42480ff300efa2d162
Author: Liping Zhang 
Date:   Sat Mar 25 16:35:29 2017 +0800

netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister

will result in a regression, at least in 4.4.y[1]. Stable maintainers who picked
up below patch might want to consider picking up above fix.

-Stefan


[1] http://bugs.launchpad.net/bugs/1709032
> 
> --
> 
> From: Eric Leblond 
> 
> commit 87e94dbc210a720a34be5c1174faee5c84be963e upstream.
> 
> This patch fixes the creation of connection tracking entry from
> netlink when synproxy is used. It was missing the addition of
> the synproxy extension.
> 
> This was causing kernel crashes when a conntrack entry created by
> conntrackd was used after the switch of traffic from active node
> to the passive node.
> 
> Signed-off-by: Eric Leblond 
> Signed-off-by: Pablo Neira Ayuso 
> Signed-off-by: Greg Kroah-Hartman 
> 
> ---
>  net/netfilter/nf_conntrack_netlink.c |4 
>  1 file changed, 4 insertions(+)
> 
> --- a/net/netfilter/nf_conntrack_netlink.c
> +++ b/net/netfilter/nf_conntrack_netlink.c
> @@ -45,6 +45,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #ifdef CONFIG_NF_NAT_NEEDED
>  #include 
>  #include 
> @@ -1798,6 +1800,8 @@ ctnetlink_create_conntrack(struct net *n
>   nf_ct_tstamp_ext_add(ct, GFP_ATOMIC);
>   nf_ct_ecache_ext_add(ct, 0, 0, GFP_ATOMIC);
>   nf_ct_labels_ext_add(ct);
> + nfct_seqadj_ext_add(ct);
> + nfct_synproxy_ext_add(ct);
>  
>   /* we must add conntrack extensions before confirmation. */
>   ct->status |= IPS_CONFIRMED;
> 
> 




signature.asc
Description: OpenPGP digital signature

[PATCH] bcache: Fix bcache device names

2017-03-02 Thread Stefan Bader

When adding partition support to bcache, the name assignment was not
updated, resulting in numbers jumping (bcache0, bcache16, bcache32...).
Fix this by taking BCACHE_MINORS into account when assigning the disk
name.

BugLink: https://bugs.launchpad.net/bugs/1667078

Fixes: b8c0d91 (bcache: partition support: add 16 minors per bcacheN device)
Cc: <sta...@vger.kernel.org> # v4.10
Signed-off-by: Stefan Bader <stefan.ba...@canonical.com>
---
 drivers/md/bcache/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85e3f21..817e155 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -793,7 +793,8 @@ static int bcache_device_init(struct bcache_device *d, 
unsigned block_size,
}
 
set_capacity(d->disk, sectors);
-   snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", minor);
+   snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i",
+minor / BCACHE_MINORS);
 
d->disk->major  = bcache_major;
d->disk->first_minor= minor;
-- 
2.7.4

[PATCH] bcache: Fix bcache device names

2017-03-02 Thread Stefan Bader

When adding partition support to bcache, the name assignment was not
updated, resulting in numbers jumping (bcache0, bcache16, bcache32...).
Fix this by taking BCACHE_MINORS into account when assigning the disk
name.

BugLink: https://bugs.launchpad.net/bugs/1667078

Fixes: b8c0d91 (bcache: partition support: add 16 minors per bcacheN device)
Cc:  # v4.10
Signed-off-by: Stefan Bader 
---
 drivers/md/bcache/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85e3f21..817e155 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -793,7 +793,8 @@ static int bcache_device_init(struct bcache_device *d, 
unsigned block_size,
}
 
set_capacity(d->disk, sectors);
-   snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", minor);
+   snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i",
+minor / BCACHE_MINORS);
 
d->disk->major  = bcache_major;
d->disk->first_minor= minor;
-- 
2.7.4

Re: bcache super block corruption with non 4k pages

2016-08-04 Thread Stefan Bader

On 28.07.2016 18:27, Stefan Bader wrote:
> On 28.07.2016 07:55, Kent Overstreet wrote:
>> On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote:
>>> So here is another attempt which does half the proposed changes. And before 
>>> you
>>> point out that it looks still ugly, let me explain some reasons. The goal 
>>> for me
>>> would be to have something as small as possible to be acceptable as stable 
>>> change.
>>> And the part about putting a bio on the stack and using submit_bio_wait: I
>>> believe you meant in read_super to replace the __bread. I was thinking about
>>> that but in the end it seemed to make the change unnecessary big. Whether 
>>> using
>>> __bread or submit_bio_wait, in both cases and without needing to make more
>>> changes on the write side, read_super has to return the in-memory and 
>>> on-disk
>>> variant of the superblock. So as long as nothing that is related to __bread 
>>> is
>>> leaked out of read_super, it is much better than what is there now. And I 
>>> remain
>>> as small as possible for the delta.
>>
>> I like that approach much better. I suppose it's not _strictly_ necessary to 
>> rip
>> out the __bread()...
>>
>> Only other comment is that you shouldn't have to pass the buffer to
>> __write_super() - I'd just move the bch_bio_map() call to when the struct
>> cache/cached_dev is being initialized (or rip it out and initialize the bvec 
>> by
>> hand) and never touch it after that.
> 
> Hm, doing that would save three simple changes to add a new argument to that
> private functions at the cost of haven the map call twice and a (more)
> complicated calculation of the
>>
>>> So there is one part of the patch which I find hard to do in a better 
>>> manner but
>>> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
>>> paths but not on success when it is assigned to either cache or cached_dev.
>>> Could possibly pass the address of the pointer and then clear it inside the
>>> called functions. Not sure that would be much less strange...
>>
>> Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev 
>> I
>> added that "disk_sb" struct - it owns the buffer and random other crap. You
>> could read that patch for ideas if you want, look at how it transfers 
>> ownership
>> of the disk_sb. 
>>
> 
> I had a look but it felt like I could get into too much follow-up changes 
> going
> that path. But I think I got it simpler now. One note about that area: both
> register calls can run into problems but only one actually returns that 
> status.
> And both do not seem to free the allocated structures (cache or cache_dev). It
> is at least not obvious whether this is ever done.
> I working around this by moving the assignment of the buffer page and the
> mapping to a place where an error exit no longer is possible. So the release
> functions will only see a non NULL pointer if things went well (reality 
> required
> to revise that a little bit as one of the register calls that can fail is
> actually doing the write).
> 
> So now there is just one oddness: when I am testing this (unfortunately right
> now I only have a normal 4k case), after calling make-bache with cache and
> backing device, this all looks great and debugging shows the __write_super 
> being
> called. But reading the from userspace will return the old data until I stop 
> the
> bcache device and unregister the cache (which does not show any further 
> writes).
> I cannot decide what I should think here...

So its __bread doing cached operations and after not (ab)using the page from
there, the writes will never cause the cached old result to go away until the
device gets closed. So (hopefully) last apptempt. Also drop __bread and use
submit_bio_wait. This seems to work. On x64 and ppc64le with 64k pages.

So the only pain with this is that upstream seems to have re-organized the
submit_bio functions in the upcoming v4.7-rc2 kernel. So it is without rework
only applicable to older kernels. I was using a 4.4 one, 4.6 in theory should be
ok...

-Stefan

From 91bbee69d19b59ec3a08a2c4e7c92747a23b3956 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.ba...@canonical.com>
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having

Re: bcache super block corruption with non 4k pages

2016-08-04 Thread Stefan Bader

On 28.07.2016 18:27, Stefan Bader wrote:
> On 28.07.2016 07:55, Kent Overstreet wrote:
>> On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote:
>>> So here is another attempt which does half the proposed changes. And before 
>>> you
>>> point out that it looks still ugly, let me explain some reasons. The goal 
>>> for me
>>> would be to have something as small as possible to be acceptable as stable 
>>> change.
>>> And the part about putting a bio on the stack and using submit_bio_wait: I
>>> believe you meant in read_super to replace the __bread. I was thinking about
>>> that but in the end it seemed to make the change unnecessary big. Whether 
>>> using
>>> __bread or submit_bio_wait, in both cases and without needing to make more
>>> changes on the write side, read_super has to return the in-memory and 
>>> on-disk
>>> variant of the superblock. So as long as nothing that is related to __bread 
>>> is
>>> leaked out of read_super, it is much better than what is there now. And I 
>>> remain
>>> as small as possible for the delta.
>>
>> I like that approach much better. I suppose it's not _strictly_ necessary to 
>> rip
>> out the __bread()...
>>
>> Only other comment is that you shouldn't have to pass the buffer to
>> __write_super() - I'd just move the bch_bio_map() call to when the struct
>> cache/cached_dev is being initialized (or rip it out and initialize the bvec 
>> by
>> hand) and never touch it after that.
> 
> Hm, doing that would save three simple changes to add a new argument to that
> private functions at the cost of haven the map call twice and a (more)
> complicated calculation of the
>>
>>> So there is one part of the patch which I find hard to do in a better 
>>> manner but
>>> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
>>> paths but not on success when it is assigned to either cache or cached_dev.
>>> Could possibly pass the address of the pointer and then clear it inside the
>>> called functions. Not sure that would be much less strange...
>>
>> Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev 
>> I
>> added that "disk_sb" struct - it owns the buffer and random other crap. You
>> could read that patch for ideas if you want, look at how it transfers 
>> ownership
>> of the disk_sb. 
>>
> 
> I had a look but it felt like I could get into too much follow-up changes 
> going
> that path. But I think I got it simpler now. One note about that area: both
> register calls can run into problems but only one actually returns that 
> status.
> And both do not seem to free the allocated structures (cache or cache_dev). It
> is at least not obvious whether this is ever done.
> I working around this by moving the assignment of the buffer page and the
> mapping to a place where an error exit no longer is possible. So the release
> functions will only see a non NULL pointer if things went well (reality 
> required
> to revise that a little bit as one of the register calls that can fail is
> actually doing the write).
> 
> So now there is just one oddness: when I am testing this (unfortunately right
> now I only have a normal 4k case), after calling make-bache with cache and
> backing device, this all looks great and debugging shows the __write_super 
> being
> called. But reading the from userspace will return the old data until I stop 
> the
> bcache device and unregister the cache (which does not show any further 
> writes).
> I cannot decide what I should think here...

So its __bread doing cached operations and after not (ab)using the page from
there, the writes will never cause the cached old result to go away until the
device gets closed. So (hopefully) last apptempt. Also drop __bread and use
submit_bio_wait. This seems to work. On x64 and ppc64le with 64k pages.

So the only pain with this is that upstream seems to have re-organized the
submit_bio functions in the upcoming v4.7-rc2 kernel. So it is without rework
only applicable to older kernels. I was using a 4.4 one, 4.6 in theory should be
ok...

-Stefan

From 91bbee69d19b59ec3a08a2c4e7c92747a23b3956 Mon Sep 17 00:00:00 2001
From: Stefan Bader 
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having a buffer allo

Re: bcache super block corruption with non 4k pages

2016-07-28 Thread Stefan Bader

On 28.07.2016 07:55, Kent Overstreet wrote:
> On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote:
>> So here is another attempt which does half the proposed changes. And before 
>> you
>> point out that it looks still ugly, let me explain some reasons. The goal 
>> for me
>> would be to have something as small as possible to be acceptable as stable 
>> change.
>> And the part about putting a bio on the stack and using submit_bio_wait: I
>> believe you meant in read_super to replace the __bread. I was thinking about
>> that but in the end it seemed to make the change unnecessary big. Whether 
>> using
>> __bread or submit_bio_wait, in both cases and without needing to make more
>> changes on the write side, read_super has to return the in-memory and on-disk
>> variant of the superblock. So as long as nothing that is related to __bread 
>> is
>> leaked out of read_super, it is much better than what is there now. And I 
>> remain
>> as small as possible for the delta.
> 
> I like that approach much better. I suppose it's not _strictly_ necessary to 
> rip
> out the __bread()...
> 
> Only other comment is that you shouldn't have to pass the buffer to
> __write_super() - I'd just move the bch_bio_map() call to when the struct
> cache/cached_dev is being initialized (or rip it out and initialize the bvec 
> by
> hand) and never touch it after that.

Hm, doing that would save three simple changes to add a new argument to that
private functions at the cost of haven the map call twice and a (more)
complicated calculation of the
> 
>> So there is one part of the patch which I find hard to do in a better manner 
>> but
>> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
>> paths but not on success when it is assigned to either cache or cached_dev.
>> Could possibly pass the address of the pointer and then clear it inside the
>> called functions. Not sure that would be much less strange...
> 
> Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev I
> added that "disk_sb" struct - it owns the buffer and random other crap. You
> could read that patch for ideas if you want, look at how it transfers 
> ownership
> of the disk_sb. 
> 

I had a look but it felt like I could get into too much follow-up changes going
that path. But I think I got it simpler now. One note about that area: both
register calls can run into problems but only one actually returns that status.
And both do not seem to free the allocated structures (cache or cache_dev). It
is at least not obvious whether this is ever done.
I working around this by moving the assignment of the buffer page and the
mapping to a place where an error exit no longer is possible. So the release
functions will only see a non NULL pointer if things went well (reality required
to revise that a little bit as one of the register calls that can fail is
actually doing the write).

So now there is just one oddness: when I am testing this (unfortunately right
now I only have a normal 4k case), after calling make-bache with cache and
backing device, this all looks great and debugging shows the __write_super being
called. But reading the from userspace will return the old data until I stop the
bcache device and unregister the cache (which does not show any further writes).
I cannot decide what I should think here...

-Stefan

From 982a4ff25d4dbd114432b4b2f908182995f402a0 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.ba...@canonical.com>
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having a buffer allocated with kmalloc that holds the superblock
data as it is on disk.
This buffer can then be passed to bch_map_bio which will set up the
bio_vec correctly.

Signed-off-by: Stefan Bader <stefan.ba...@canonical.com>
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 58 --
 2 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a5..3c48927 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -295,6 +295,7 @@ struct cached_dev {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;
 	struct closure		sb_write;
 	struct semaphore	sb_write_mutex;

@@ -382,6 +383,7 @@ struct cache {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;

 	struct ko

Re: bcache super block corruption with non 4k pages

2016-07-28 Thread Stefan Bader

On 28.07.2016 07:55, Kent Overstreet wrote:
> On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote:
>> So here is another attempt which does half the proposed changes. And before 
>> you
>> point out that it looks still ugly, let me explain some reasons. The goal 
>> for me
>> would be to have something as small as possible to be acceptable as stable 
>> change.
>> And the part about putting a bio on the stack and using submit_bio_wait: I
>> believe you meant in read_super to replace the __bread. I was thinking about
>> that but in the end it seemed to make the change unnecessary big. Whether 
>> using
>> __bread or submit_bio_wait, in both cases and without needing to make more
>> changes on the write side, read_super has to return the in-memory and on-disk
>> variant of the superblock. So as long as nothing that is related to __bread 
>> is
>> leaked out of read_super, it is much better than what is there now. And I 
>> remain
>> as small as possible for the delta.
> 
> I like that approach much better. I suppose it's not _strictly_ necessary to 
> rip
> out the __bread()...
> 
> Only other comment is that you shouldn't have to pass the buffer to
> __write_super() - I'd just move the bch_bio_map() call to when the struct
> cache/cached_dev is being initialized (or rip it out and initialize the bvec 
> by
> hand) and never touch it after that.

Hm, doing that would save three simple changes to add a new argument to that
private functions at the cost of haven the map call twice and a (more)
complicated calculation of the
> 
>> So there is one part of the patch which I find hard to do in a better manner 
>> but
>> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
>> paths but not on success when it is assigned to either cache or cached_dev.
>> Could possibly pass the address of the pointer and then clear it inside the
>> called functions. Not sure that would be much less strange...
> 
> Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev I
> added that "disk_sb" struct - it owns the buffer and random other crap. You
> could read that patch for ideas if you want, look at how it transfers 
> ownership
> of the disk_sb. 
> 

I had a look but it felt like I could get into too much follow-up changes going
that path. But I think I got it simpler now. One note about that area: both
register calls can run into problems but only one actually returns that status.
And both do not seem to free the allocated structures (cache or cache_dev). It
is at least not obvious whether this is ever done.
I working around this by moving the assignment of the buffer page and the
mapping to a place where an error exit no longer is possible. So the release
functions will only see a non NULL pointer if things went well (reality required
to revise that a little bit as one of the register calls that can fail is
actually doing the write).

So now there is just one oddness: when I am testing this (unfortunately right
now I only have a normal 4k case), after calling make-bache with cache and
backing device, this all looks great and debugging shows the __write_super being
called. But reading the from userspace will return the old data until I stop the
bcache device and unregister the cache (which does not show any further writes).
I cannot decide what I should think here...

-Stefan

From 982a4ff25d4dbd114432b4b2f908182995f402a0 Mon Sep 17 00:00:00 2001
From: Stefan Bader 
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having a buffer allocated with kmalloc that holds the superblock
data as it is on disk.
This buffer can then be passed to bch_map_bio which will set up the
bio_vec correctly.

Signed-off-by: Stefan Bader 
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 58 --
 2 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 6b420a5..3c48927 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -295,6 +295,7 @@ struct cached_dev {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;
 	struct closure		sb_write;
 	struct semaphore	sb_write_mutex;

@@ -382,6 +383,7 @@ struct cache {
 	struct cache_sb		sb;
 	struct bio		sb_bio;
 	struct bio_vec		sb_bv[1];
+	void			*sb_disk_data;

 	struct kobject		kobj;
 	struct block_device	*bdev;
diff --git a/drivers/md/

Re: bcache super block corruption with non 4k pages

2016-07-27 Thread Stefan Bader

On 26.07.2016 14:49, Kent Overstreet wrote:
> On Tue, Jul 26, 2016 at 02:32:31PM +0200, Stefan Bader wrote:
>> On 26.07.2016 12:21, Kent Overstreet wrote:
>>> On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote:
>>>> On 21.07.2016 10:58, Stefan Bader wrote:
>>>>> I was pointed at the thread which seems to address the same after
>>>>> I wrote most of below text. Did not want to re-write this so please
>>>>> bear with the odd layout.
>>>>>
>>>>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
>>>>>
>>>>> Zhengyuan tries to fix the problem by relocating the superblock on
>>>>> disk. But I am not sure whether there is really any guarantee about
>>>>> how __bread fills data into the buffer_head. What if there is the next
>>>>> odd arch with 128K pages?
>>>>>
>>>>> So below is an attempt to be more generic. Still I don't feel completely
>>>>> happy with the way that a page moves (or is shared) between buffer_head
>>>>> and biovec. What I tried to outline below is to let the register functions
>>>>> allocate bio+biovec memory and use the in-memory sb_cache data to 
>>>>> initialize
>>>>> the biovec buffer.
>>>>
>>>> Any opinions here? Also adding LKML as I don't seem to get through 
>>>> moderation on
>>>> dm-devel.
>>>
>>> The correct solution is to rip out the __bread() and just read the 
>>> superblock by
>>> issuing a bio, the same way all the other IO in bcache is done.
>>>
>>> This is the way it's done in the bcache-dev branch - unfortunately, the 
>>> patch
>>> that does that in bcache-dev is big and invasive and probably not worth the
>>> hassle to backport:
>>>
>>> https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev=303eb67bffad57b4d9e71523e7df04bf258e66d1
>>
>> I agree that this looks better and also rather large.
>>>
>>> Probably best to just do something small and localized.
>>>
>> So what did you think about the change I did? It seemed to be ok for 4K and 
>> 64K
>> at least and is rather small. And I believe that, compared to Zhengyuan's
>> approach this would have the benefit of not changing the superblock sector. 
>> So
>> it would be compatible with previously created superblocks.
> 
> Too ugly to live. Just kmalloc() 4k, allocate a bio on the stack, set it up, 
> and
> submit it with submit_bio_wait(). Use virt_to_page(), don't bother with raw
> pages - you want 4k, not whatever the page size is.
> 

So here is another attempt which does half the proposed changes. And before you
point out that it looks still ugly, let me explain some reasons. The goal for me
would be to have something as small as possible to be acceptable as stable 
change.
And the part about putting a bio on the stack and using submit_bio_wait: I
believe you meant in read_super to replace the __bread. I was thinking about
that but in the end it seemed to make the change unnecessary big. Whether using
__bread or submit_bio_wait, in both cases and without needing to make more
changes on the write side, read_super has to return the in-memory and on-disk
variant of the superblock. So as long as nothing that is related to __bread is
leaked out of read_super, it is much better than what is there now. And I remain
as small as possible for the delta.

So there is one part of the patch which I find hard to do in a better manner but
is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
paths but not on success when it is assigned to either cache or cached_dev.
Could possibly pass the address of the pointer and then clear it inside the
called functions. Not sure that would be much less strange...

-Stefan

From 391682e2329a6f8608bfc7628b6c8a0afa9a5d98 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.ba...@canonical.com>
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having a buffer allocated with kmalloc that holds the superblock
data as it is on disk.
This buffer can then be passed to bch_map_bio which will set up the
bio_vec correctly.

Signed-off-by: Stefan Bader <stefan.ba...@canonical.com>
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 61 ++
 2 files chan

Re: bcache super block corruption with non 4k pages

2016-07-27 Thread Stefan Bader

On 26.07.2016 14:49, Kent Overstreet wrote:
> On Tue, Jul 26, 2016 at 02:32:31PM +0200, Stefan Bader wrote:
>> On 26.07.2016 12:21, Kent Overstreet wrote:
>>> On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote:
>>>> On 21.07.2016 10:58, Stefan Bader wrote:
>>>>> I was pointed at the thread which seems to address the same after
>>>>> I wrote most of below text. Did not want to re-write this so please
>>>>> bear with the odd layout.
>>>>>
>>>>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
>>>>>
>>>>> Zhengyuan tries to fix the problem by relocating the superblock on
>>>>> disk. But I am not sure whether there is really any guarantee about
>>>>> how __bread fills data into the buffer_head. What if there is the next
>>>>> odd arch with 128K pages?
>>>>>
>>>>> So below is an attempt to be more generic. Still I don't feel completely
>>>>> happy with the way that a page moves (or is shared) between buffer_head
>>>>> and biovec. What I tried to outline below is to let the register functions
>>>>> allocate bio+biovec memory and use the in-memory sb_cache data to 
>>>>> initialize
>>>>> the biovec buffer.
>>>>
>>>> Any opinions here? Also adding LKML as I don't seem to get through 
>>>> moderation on
>>>> dm-devel.
>>>
>>> The correct solution is to rip out the __bread() and just read the 
>>> superblock by
>>> issuing a bio, the same way all the other IO in bcache is done.
>>>
>>> This is the way it's done in the bcache-dev branch - unfortunately, the 
>>> patch
>>> that does that in bcache-dev is big and invasive and probably not worth the
>>> hassle to backport:
>>>
>>> https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev=303eb67bffad57b4d9e71523e7df04bf258e66d1
>>
>> I agree that this looks better and also rather large.
>>>
>>> Probably best to just do something small and localized.
>>>
>> So what did you think about the change I did? It seemed to be ok for 4K and 
>> 64K
>> at least and is rather small. And I believe that, compared to Zhengyuan's
>> approach this would have the benefit of not changing the superblock sector. 
>> So
>> it would be compatible with previously created superblocks.
> 
> Too ugly to live. Just kmalloc() 4k, allocate a bio on the stack, set it up, 
> and
> submit it with submit_bio_wait(). Use virt_to_page(), don't bother with raw
> pages - you want 4k, not whatever the page size is.
> 

So here is another attempt which does half the proposed changes. And before you
point out that it looks still ugly, let me explain some reasons. The goal for me
would be to have something as small as possible to be acceptable as stable 
change.
And the part about putting a bio on the stack and using submit_bio_wait: I
believe you meant in read_super to replace the __bread. I was thinking about
that but in the end it seemed to make the change unnecessary big. Whether using
__bread or submit_bio_wait, in both cases and without needing to make more
changes on the write side, read_super has to return the in-memory and on-disk
variant of the superblock. So as long as nothing that is related to __bread is
leaked out of read_super, it is much better than what is there now. And I remain
as small as possible for the delta.

So there is one part of the patch which I find hard to do in a better manner but
is a bit ugly: and that is to sanely free the sb_disk_data blob on all error
paths but not on success when it is assigned to either cache or cached_dev.
Could possibly pass the address of the pointer and then clear it inside the
called functions. Not sure that would be much less strange...

-Stefan

From 391682e2329a6f8608bfc7628b6c8a0afa9a5d98 Mon Sep 17 00:00:00 2001
From: Stefan Bader 
Date: Tue, 26 Jul 2016 18:47:21 +0200
Subject: [PATCH] bcache: read_super: handle architectures with more than 4k
 page size

There is no guarantee that the superblock which __bread returns in
a buffer_head starts at offset 0 when an architecture has bigger
pages than 4k (the used sector size).

This is the attempt to fix this with the minimum amount of change
by having a buffer allocated with kmalloc that holds the superblock
data as it is on disk.
This buffer can then be passed to bch_map_bio which will set up the
bio_vec correctly.

Signed-off-by: Stefan Bader 
---
 drivers/md/bcache/bcache.h |  2 ++
 drivers/md/bcache/super.c  | 61 ++
 2 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/drivers/md/bca

Re: bcache super block corruption with non 4k pages

2016-07-26 Thread Stefan Bader

On 26.07.2016 12:21, Kent Overstreet wrote:
> On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote:
>> On 21.07.2016 10:58, Stefan Bader wrote:
>>> I was pointed at the thread which seems to address the same after
>>> I wrote most of below text. Did not want to re-write this so please
>>> bear with the odd layout.
>>>
>>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
>>>
>>> Zhengyuan tries to fix the problem by relocating the superblock on
>>> disk. But I am not sure whether there is really any guarantee about
>>> how __bread fills data into the buffer_head. What if there is the next
>>> odd arch with 128K pages?
>>>
>>> So below is an attempt to be more generic. Still I don't feel completely
>>> happy with the way that a page moves (or is shared) between buffer_head
>>> and biovec. What I tried to outline below is to let the register functions
>>> allocate bio+biovec memory and use the in-memory sb_cache data to initialize
>>> the biovec buffer.
>>
>> Any opinions here? Also adding LKML as I don't seem to get through 
>> moderation on
>> dm-devel.
> 
> The correct solution is to rip out the __bread() and just read the superblock 
> by
> issuing a bio, the same way all the other IO in bcache is done.
> 
> This is the way it's done in the bcache-dev branch - unfortunately, the patch
> that does that in bcache-dev is big and invasive and probably not worth the
> hassle to backport:
> 
> https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev=303eb67bffad57b4d9e71523e7df04bf258e66d1

I agree that this looks better and also rather large.
> 
> Probably best to just do something small and localized.
> 
So what did you think about the change I did? It seemed to be ok for 4K and 64K
at least and is rather small. And I believe that, compared to Zhengyuan's
approach this would have the benefit of not changing the superblock sector. So
it would be compatible with previously created superblocks.

-Stefan


From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.ba...@canonical.com>
Date: Wed, 20 Jul 2016 12:06:27 +0200
Subject: [PATCH] bcache: handle non 4k pages returned by __bread

On non-x86 arches pages can be bigger than 4k. Currently read_super
returns a reference to the page used as buffer in the buffer_head
that is returned by __bread().
With 4k page size and a requested read this normally ends up with 
the super block data starting at offset 0. But as seen on ppc64le
with 64k page size, the data can start at an offset (from what I
saw the offset was 4k).
This causes harm later on when __write_super() maps the super
block data at the start of the page acquired before and also
not writes back all fields (particularly the super block magic).

Try to avoid this by also returning the potential offset within the
sb_page from read_super() and fully initiallize the single bvec of
the bio used for __write_super() calls. Doing that is the same as
would have been done in bch_bio_map() which now must not be used in
__write_super().

Signed-off-by: Stefan Bader <stefan.ba...@canonical.com>

---
 drivers/md/bcache/super.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e169739..3e0d2de 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
 /* Superblock */
 
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
-			  struct page **res)
+			  struct page **res, unsigned int *off)
 {
 	const char *err;
 	struct cache_sb *s;
@@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	err = NULL;
 
 	get_page(bh->b_page);
+	*off = (unsigned int) (bh->b_data - ((char *) page_address(bh->b_page)));
 	*res = bh->b_page;
 err:
 	put_bh(bh);
@@ -202,19 +203,19 @@ static void write_bdev_super_endio(struct bio *bio)
 {
 	struct cached_dev *dc = bio->bi_private;
 	/* XXX: error checking */
-
 	closure_put(>sb_write);
 }
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
-	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page) +
+			   bio->bi_io_vec[0].bv_offset;
 	unsigned i;
 
 	bio->bi_iter.bi_sector	= SB_SECTOR;
 	bio->bi_rw		= REQ_SYNC|REQ_META;
 	bio->bi_iter.bi_size	= SB_SIZE;
-	bch_bio_map(bio, NULL);
+	// bch_bio_map(bio, NULL);
 
 	out->offset		= cpu_to_le64(sb->offset);
 	out->version		= cpu_to_le64(sb->version);
@@ -1139,6 +1140,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
 /* Cached device - bcac

Re: bcache super block corruption with non 4k pages

2016-07-26 Thread Stefan Bader

On 26.07.2016 12:21, Kent Overstreet wrote:
> On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote:
>> On 21.07.2016 10:58, Stefan Bader wrote:
>>> I was pointed at the thread which seems to address the same after
>>> I wrote most of below text. Did not want to re-write this so please
>>> bear with the odd layout.
>>>
>>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
>>>
>>> Zhengyuan tries to fix the problem by relocating the superblock on
>>> disk. But I am not sure whether there is really any guarantee about
>>> how __bread fills data into the buffer_head. What if there is the next
>>> odd arch with 128K pages?
>>>
>>> So below is an attempt to be more generic. Still I don't feel completely
>>> happy with the way that a page moves (or is shared) between buffer_head
>>> and biovec. What I tried to outline below is to let the register functions
>>> allocate bio+biovec memory and use the in-memory sb_cache data to initialize
>>> the biovec buffer.
>>
>> Any opinions here? Also adding LKML as I don't seem to get through 
>> moderation on
>> dm-devel.
> 
> The correct solution is to rip out the __bread() and just read the superblock 
> by
> issuing a bio, the same way all the other IO in bcache is done.
> 
> This is the way it's done in the bcache-dev branch - unfortunately, the patch
> that does that in bcache-dev is big and invasive and probably not worth the
> hassle to backport:
> 
> https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev=303eb67bffad57b4d9e71523e7df04bf258e66d1

I agree that this looks better and also rather large.
> 
> Probably best to just do something small and localized.
> 
So what did you think about the change I did? It seemed to be ok for 4K and 64K
at least and is rather small. And I believe that, compared to Zhengyuan's
approach this would have the benefit of not changing the superblock sector. So
it would be compatible with previously created superblocks.

-Stefan


From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001
From: Stefan Bader 
Date: Wed, 20 Jul 2016 12:06:27 +0200
Subject: [PATCH] bcache: handle non 4k pages returned by __bread

On non-x86 arches pages can be bigger than 4k. Currently read_super
returns a reference to the page used as buffer in the buffer_head
that is returned by __bread().
With 4k page size and a requested read this normally ends up with 
the super block data starting at offset 0. But as seen on ppc64le
with 64k page size, the data can start at an offset (from what I
saw the offset was 4k).
This causes harm later on when __write_super() maps the super
block data at the start of the page acquired before and also
not writes back all fields (particularly the super block magic).

Try to avoid this by also returning the potential offset within the
sb_page from read_super() and fully initiallize the single bvec of
the bio used for __write_super() calls. Doing that is the same as
would have been done in bch_bio_map() which now must not be used in
__write_super().

Signed-off-by: Stefan Bader 

---
 drivers/md/bcache/super.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index e169739..3e0d2de 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
 /* Superblock */
 
 static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
-			  struct page **res)
+			  struct page **res, unsigned int *off)
 {
 	const char *err;
 	struct cache_sb *s;
@@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
 	err = NULL;
 
 	get_page(bh->b_page);
+	*off = (unsigned int) (bh->b_data - ((char *) page_address(bh->b_page)));
 	*res = bh->b_page;
 err:
 	put_bh(bh);
@@ -202,19 +203,19 @@ static void write_bdev_super_endio(struct bio *bio)
 {
 	struct cached_dev *dc = bio->bi_private;
 	/* XXX: error checking */
-
 	closure_put(>sb_write);
 }
 
 static void __write_super(struct cache_sb *sb, struct bio *bio)
 {
-	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page);
+	struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page) +
+			   bio->bi_io_vec[0].bv_offset;
 	unsigned i;
 
 	bio->bi_iter.bi_sector	= SB_SECTOR;
 	bio->bi_rw		= REQ_SYNC|REQ_META;
 	bio->bi_iter.bi_size	= SB_SIZE;
-	bch_bio_map(bio, NULL);
+	// bch_bio_map(bio, NULL);
 
 	out->offset		= cpu_to_le64(sb->offset);
 	out->version		= cpu_to_le64(sb->version);
@@ -1139,6 +1140,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size)
 /* Cached device - bcache superblock */
 
 static void register_bdev(struct cache_sb *sb

Re: bcache super block corruption with non 4k pages

2016-07-26 Thread Stefan Bader

On 21.07.2016 10:58, Stefan Bader wrote:
> I was pointed at the thread which seems to address the same after
> I wrote most of below text. Did not want to re-write this so please
> bear with the odd layout.
> 
> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
> 
> Zhengyuan tries to fix the problem by relocating the superblock on
> disk. But I am not sure whether there is really any guarantee about
> how __bread fills data into the buffer_head. What if there is the next
> odd arch with 128K pages?
> 
> So below is an attempt to be more generic. Still I don't feel completely
> happy with the way that a page moves (or is shared) between buffer_head
> and biovec. What I tried to outline below is to let the register functions
> allocate bio+biovec memory and use the in-memory sb_cache data to initialize
> the biovec buffer.

Any opinions here? Also adding LKML as I don't seem to get through moderation on
dm-devel.


> 
> -Stefan
> 
> 
> ---
> 
> This was seen on ppc64le with 64k page size. The problem is that
> in that case __bread returns a page where b_data is not at the
> start of the page. And I am not really sure that the way bcache
> re-purposes the page returned by __bread to be used as biovec 
> element is really a good idea.
> 
> The way it is now, I saw __bread return a 64k page where b_data
> was starting at 4k offset. Then __write_super was modifying some
> data at offset 0 but for example not writing the magic number again.
> 
> Not sure why (maybe this messes up flushes, too) but the bad data was not
> immediately written back when the devices were registered. So at that time
> bcache-super-show would report data as it was written by user-space (like
> the cache type still 0 and not 3, and claiming the cache to still bei
> detached).
> 
> But when stopping the cache and unregistering the cache set this changed
> and bcache-super-show was reporting a bad magic number (as expected).
> 
> The patch below works around this (tested with 64k and 4k pages) but I
> am not really sure this is a clean approach. My gut feeling would be that
> the bio structures should be allocated speperately (I think there is a way
> of allocating a bioset without using a pool). Then that separate page could
> be pre-initialized from the super block structure without passing around
> a page the feels more private to __bread usage...
> 
> -Stefan
> 
> 
> 
> From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.ba...@canonical.com>
> Date: Wed, 20 Jul 2016 12:06:27 +0200
> Subject: [PATCH] bcache: handle non 4k pages returned by __bread
> 
> On non-x86 arches pages can be bigger than 4k. Currently read_super
> returns a reference to the page used as buffer in the buffer_head
> that is returned by __bread().
> With 4k page size and a requested read this normally ends up with 
> the super block data starting at offset 0. But as seen on ppc64le
> with 64k page size, the data can start at an offset (from what I
> saw the offset was 4k).
> This causes harm later on when __write_super() maps the super
> block data at the start of the page acquired before and also
> not writes back all fields (particularly the super block magic).
> 
> Try to avoid this by also returning the potential offset within the
> sb_page from read_super() and fully initiallize the single bvec of
> the bio used for __write_super() calls. Doing that is the same as
> would have been done in bch_bio_map() which now must not be used in
> __write_super().
> 
> Signed-off-by: Stefan Bader <stefan.ba...@canonical.com>
> 
> removedebug
> 
> Signed-off-by: Stefan Bader <stefan.ba...@canonical.com>
> ---
>  drivers/md/bcache/super.c | 22 +++---
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index e169739..3e0d2de 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
>  /* Superblock */
>  
>  static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
> -   struct page **res)
> +   struct page **res, unsigned int *off)
>  {
>   const char *err;
>   struct cache_sb *s;
> @@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct 
> block_device *bdev,
>   err = NULL;
>  
>   get_page(bh->b_page);
> + *off = (unsigned int) (bh->b_data - ((char *) 
> page_address(bh->b_page)));
>   *res = bh->b_page;
>  err:
>   put_bh(bh);
> @@ -202,19 +203,19 @@ static void write_bdev_super_endio(str

Re: bcache super block corruption with non 4k pages

2016-07-26 Thread Stefan Bader

On 21.07.2016 10:58, Stefan Bader wrote:
> I was pointed at the thread which seems to address the same after
> I wrote most of below text. Did not want to re-write this so please
> bear with the odd layout.
> 
> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html
> 
> Zhengyuan tries to fix the problem by relocating the superblock on
> disk. But I am not sure whether there is really any guarantee about
> how __bread fills data into the buffer_head. What if there is the next
> odd arch with 128K pages?
> 
> So below is an attempt to be more generic. Still I don't feel completely
> happy with the way that a page moves (or is shared) between buffer_head
> and biovec. What I tried to outline below is to let the register functions
> allocate bio+biovec memory and use the in-memory sb_cache data to initialize
> the biovec buffer.

Any opinions here? Also adding LKML as I don't seem to get through moderation on
dm-devel.


> 
> -Stefan
> 
> 
> ---
> 
> This was seen on ppc64le with 64k page size. The problem is that
> in that case __bread returns a page where b_data is not at the
> start of the page. And I am not really sure that the way bcache
> re-purposes the page returned by __bread to be used as biovec 
> element is really a good idea.
> 
> The way it is now, I saw __bread return a 64k page where b_data
> was starting at 4k offset. Then __write_super was modifying some
> data at offset 0 but for example not writing the magic number again.
> 
> Not sure why (maybe this messes up flushes, too) but the bad data was not
> immediately written back when the devices were registered. So at that time
> bcache-super-show would report data as it was written by user-space (like
> the cache type still 0 and not 3, and claiming the cache to still bei
> detached).
> 
> But when stopping the cache and unregistering the cache set this changed
> and bcache-super-show was reporting a bad magic number (as expected).
> 
> The patch below works around this (tested with 64k and 4k pages) but I
> am not really sure this is a clean approach. My gut feeling would be that
> the bio structures should be allocated speperately (I think there is a way
> of allocating a bioset without using a pool). Then that separate page could
> be pre-initialized from the super block structure without passing around
> a page the feels more private to __bread usage...
> 
> -Stefan
> 
> 
> 
> From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001
> From: Stefan Bader 
> Date: Wed, 20 Jul 2016 12:06:27 +0200
> Subject: [PATCH] bcache: handle non 4k pages returned by __bread
> 
> On non-x86 arches pages can be bigger than 4k. Currently read_super
> returns a reference to the page used as buffer in the buffer_head
> that is returned by __bread().
> With 4k page size and a requested read this normally ends up with 
> the super block data starting at offset 0. But as seen on ppc64le
> with 64k page size, the data can start at an offset (from what I
> saw the offset was 4k).
> This causes harm later on when __write_super() maps the super
> block data at the start of the page acquired before and also
> not writes back all fields (particularly the super block magic).
> 
> Try to avoid this by also returning the potential offset within the
> sb_page from read_super() and fully initiallize the single bvec of
> the bio used for __write_super() calls. Doing that is the same as
> would have been done in bch_bio_map() which now must not be used in
> __write_super().
> 
> Signed-off-by: Stefan Bader 
> 
> removedebug
> 
> Signed-off-by: Stefan Bader 
> ---
>  drivers/md/bcache/super.c | 22 +++---
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index e169739..3e0d2de 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq;
>  /* Superblock */
>  
>  static const char *read_super(struct cache_sb *sb, struct block_device *bdev,
> -   struct page **res)
> +   struct page **res, unsigned int *off)
>  {
>   const char *err;
>   struct cache_sb *s;
> @@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct 
> block_device *bdev,
>   err = NULL;
>  
>   get_page(bh->b_page);
> + *off = (unsigned int) (bh->b_data - ((char *) 
> page_address(bh->b_page)));
>   *res = bh->b_page;
>  err:
>   put_bh(bh);
> @@ -202,19 +203,19 @@ static void write_bdev_super_endio(struct bio *bio)
>  {
>   struct cached_dev *dc = bio->bi_private;
>   /

Re: mm: Use phys_addr_t for reserve_bootmem_region arguments

2016-05-17 Thread Stefan Bader

On 17.05.2016 15:20, Stefan Bader wrote:
> Re-posting to a hopefully better suited audience. I hit this problem
> when trying to boot a i386 dom0 (PAE enabled) on a 64bit Xen host using
> a config which would result in a reserved memory range starting at 4MB.

Of course that ^ should be "starting at 4GB" not MB...

> Due to the usage of unsigned long as arguments for start address and
> length, this would wrap and actually mark the lower memory range staring
> from 0 as reserved. Between kernel version 4.2 and 4.4 this somehow boots
> but starting with 4.4 the result is a panic and reboot.
> 
> Not sure this special Xen case is the only one affected, but in general
> it seems more correct to use phys_addr_t as the type for start and end
> as that is the type used in the memblock region definitions and those
> are 64bit (at least with PAE enabled).
> 
> -Stefan
> 
> 
> 
> From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001
> From: Stefan Bader <stefan.ba...@canonical.com>
> Date: Tue, 10 May 2016 19:05:16 +0200
> Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments
> 
> Since 92923ca the reserved bit is set on reserved memblock regions.
> However start and end address are passed as unsigned long. This is
> only 32bit on i386, so it can end up marking the wrong pages reserved
> for ranges at 4GB and above.
> 
> This was observed on a 32bit Xen dom0 which was booted with initial
> memory set to a value below 4G but allowing to balloon in memory
> (dom0_mem=1024M for example). This would define a reserved bootmem
> region for the additional memory (for example on a 8GB system there was
> a reverved region covering the 4GB-8GB range). But since the addresses
> were passed on as unsigned long, this was actually marking all pages
> from 0 to 4GB as reserved.
> 
> Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region"
> Signed-off-by: Stefan Bader <stefan.ba...@canonical.com>
> Cc: <sta...@kernel.org> # 4.2+
> ---
>  include/linux/mm.h | 2 +-
>  mm/page_alloc.c| 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b56ff72..4c1ff62 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1715,7 +1715,7 @@ extern void free_highmem_page(struct page *page);
>  extern void adjust_managed_page_count(struct page *page, long count);
>  extern void mem_init_print_info(const char *str);
>  
> -extern void reserve_bootmem_region(unsigned long start, unsigned long end);
> +extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
>  
>  /* Free the reserved page into the buddy system, so it gets managed. */
>  static inline void __free_reserved_page(struct page *page)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c69531a..eb66f89 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -951,7 +951,7 @@ static inline void init_reserved_page(unsigned long pfn)
>   * marks the pages PageReserved. The remaining valid pages are later
>   * sent to the buddy page allocator.
>   */
> -void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
> +void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
>  {
>   unsigned long start_pfn = PFN_DOWN(start);
>   unsigned long end_pfn = PFN_UP(end);
> 




signature.asc
Description: OpenPGP digital signature

Re: mm: Use phys_addr_t for reserve_bootmem_region arguments

2016-05-17 Thread Stefan Bader

On 17.05.2016 15:20, Stefan Bader wrote:
> Re-posting to a hopefully better suited audience. I hit this problem
> when trying to boot a i386 dom0 (PAE enabled) on a 64bit Xen host using
> a config which would result in a reserved memory range starting at 4MB.

Of course that ^ should be "starting at 4GB" not MB...

> Due to the usage of unsigned long as arguments for start address and
> length, this would wrap and actually mark the lower memory range staring
> from 0 as reserved. Between kernel version 4.2 and 4.4 this somehow boots
> but starting with 4.4 the result is a panic and reboot.
> 
> Not sure this special Xen case is the only one affected, but in general
> it seems more correct to use phys_addr_t as the type for start and end
> as that is the type used in the memblock region definitions and those
> are 64bit (at least with PAE enabled).
> 
> -Stefan
> 
> 
> 
> From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001
> From: Stefan Bader 
> Date: Tue, 10 May 2016 19:05:16 +0200
> Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments
> 
> Since 92923ca the reserved bit is set on reserved memblock regions.
> However start and end address are passed as unsigned long. This is
> only 32bit on i386, so it can end up marking the wrong pages reserved
> for ranges at 4GB and above.
> 
> This was observed on a 32bit Xen dom0 which was booted with initial
> memory set to a value below 4G but allowing to balloon in memory
> (dom0_mem=1024M for example). This would define a reserved bootmem
> region for the additional memory (for example on a 8GB system there was
> a reverved region covering the 4GB-8GB range). But since the addresses
> were passed on as unsigned long, this was actually marking all pages
> from 0 to 4GB as reserved.
> 
> Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region"
> Signed-off-by: Stefan Bader 
> Cc:  # 4.2+
> ---
>  include/linux/mm.h | 2 +-
>  mm/page_alloc.c| 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b56ff72..4c1ff62 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1715,7 +1715,7 @@ extern void free_highmem_page(struct page *page);
>  extern void adjust_managed_page_count(struct page *page, long count);
>  extern void mem_init_print_info(const char *str);
>  
> -extern void reserve_bootmem_region(unsigned long start, unsigned long end);
> +extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
>  
>  /* Free the reserved page into the buddy system, so it gets managed. */
>  static inline void __free_reserved_page(struct page *page)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c69531a..eb66f89 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -951,7 +951,7 @@ static inline void init_reserved_page(unsigned long pfn)
>   * marks the pages PageReserved. The remaining valid pages are later
>   * sent to the buddy page allocator.
>   */
> -void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
> +void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
>  {
>   unsigned long start_pfn = PFN_DOWN(start);
>   unsigned long end_pfn = PFN_UP(end);
> 




signature.asc
Description: OpenPGP digital signature

mm: Use phys_addr_t for reserve_bootmem_region arguments

2016-05-17 Thread Stefan Bader

Re-posting to a hopefully better suited audience. I hit this problem
when trying to boot a i386 dom0 (PAE enabled) on a 64bit Xen host using
a config which would result in a reserved memory range starting at 4MB.
Due to the usage of unsigned long as arguments for start address and
length, this would wrap and actually mark the lower memory range staring
from 0 as reserved. Between kernel version 4.2 and 4.4 this somehow boots
but starting with 4.4 the result is a panic and reboot.

Not sure this special Xen case is the only one affected, but in general
it seems more correct to use phys_addr_t as the type for start and end
as that is the type used in the memblock region definitions and those
are 64bit (at least with PAE enabled).

-Stefan



>From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.ba...@canonical.com>
Date: Tue, 10 May 2016 19:05:16 +0200
Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments

Since 92923ca the reserved bit is set on reserved memblock regions.
However start and end address are passed as unsigned long. This is
only 32bit on i386, so it can end up marking the wrong pages reserved
for ranges at 4GB and above.

This was observed on a 32bit Xen dom0 which was booted with initial
memory set to a value below 4G but allowing to balloon in memory
(dom0_mem=1024M for example). This would define a reserved bootmem
region for the additional memory (for example on a 8GB system there was
a reverved region covering the 4GB-8GB range). But since the addresses
were passed on as unsigned long, this was actually marking all pages
from 0 to 4GB as reserved.

Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region"
Signed-off-by: Stefan Bader <stefan.ba...@canonical.com>
Cc: <sta...@kernel.org> # 4.2+
---
 include/linux/mm.h | 2 +-
 mm/page_alloc.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b56ff72..4c1ff62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1715,7 +1715,7 @@ extern void free_highmem_page(struct page *page);
 extern void adjust_managed_page_count(struct page *page, long count);
 extern void mem_init_print_info(const char *str);
 
-extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
 
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void __free_reserved_page(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c69531a..eb66f89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -951,7 +951,7 @@ static inline void init_reserved_page(unsigned long pfn)
  * marks the pages PageReserved. The remaining valid pages are later
  * sent to the buddy page allocator.
  */
-void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
+void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);
-- 
1.9.1

mm: Use phys_addr_t for reserve_bootmem_region arguments

2016-05-17 Thread Stefan Bader

Re-posting to a hopefully better suited audience. I hit this problem
when trying to boot a i386 dom0 (PAE enabled) on a 64bit Xen host using
a config which would result in a reserved memory range starting at 4MB.
Due to the usage of unsigned long as arguments for start address and
length, this would wrap and actually mark the lower memory range staring
from 0 as reserved. Between kernel version 4.2 and 4.4 this somehow boots
but starting with 4.4 the result is a panic and reboot.

Not sure this special Xen case is the only one affected, but in general
it seems more correct to use phys_addr_t as the type for start and end
as that is the type used in the memblock region definitions and those
are 64bit (at least with PAE enabled).

-Stefan



>From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001
From: Stefan Bader 
Date: Tue, 10 May 2016 19:05:16 +0200
Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments

Since 92923ca the reserved bit is set on reserved memblock regions.
However start and end address are passed as unsigned long. This is
only 32bit on i386, so it can end up marking the wrong pages reserved
for ranges at 4GB and above.

This was observed on a 32bit Xen dom0 which was booted with initial
memory set to a value below 4G but allowing to balloon in memory
(dom0_mem=1024M for example). This would define a reserved bootmem
region for the additional memory (for example on a 8GB system there was
a reverved region covering the 4GB-8GB range). But since the addresses
were passed on as unsigned long, this was actually marking all pages
from 0 to 4GB as reserved.

Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region"
Signed-off-by: Stefan Bader 
Cc:  # 4.2+
---
 include/linux/mm.h | 2 +-
 mm/page_alloc.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b56ff72..4c1ff62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1715,7 +1715,7 @@ extern void free_highmem_page(struct page *page);
 extern void adjust_managed_page_count(struct page *page, long count);
 extern void mem_init_print_info(const char *str);
 
-extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
 
 /* Free the reserved page into the buddy system, so it gets managed. */
 static inline void __free_reserved_page(struct page *page)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c69531a..eb66f89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -951,7 +951,7 @@ static inline void init_reserved_page(unsigned long pfn)
  * marks the pages PageReserved. The remaining valid pages are later
  * sent to the buddy page allocator.
  */
-void __meminit reserve_bootmem_region(unsigned long start, unsigned long end)
+void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
 {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);
-- 
1.9.1

Re: [Xen-devel] bad page flags booting 32bit dom0 on 64bit hypervisor using dom0_mem (kernel >=4.2)

2016-05-11 Thread Stefan Bader

On 02.05.2016 16:24, Stefan Bader wrote:
> On 02.05.2016 13:41, Juergen Gross wrote:
>> On 02/05/16 12:47, Stefan Bader wrote:
>>> I recently tried to boot 32bit dom0 on 64bit Xen host which I configured to 
>>> run
>>> with a limited, fix amount of memory for dom0. It seems that somewhere 
>>> between
>>> kernel versions 3.19 and 4.2 (sorry that is still a wide range) the Linux 
>>> kernel
>>> would report bad page flags for a range of pages (which seem to be around 
>>> the
>>> end of the guest pfn range). For a 4.2 kernel that was easily missed as the 
>>> boot
>>> finished ok and dom0 was accessible. However starting with 4.4 (tested 4.5 
>>> and a
>>> 4.6-rc) the serial console output freezes after some of those bad page flag
>>> messages and then (unfortunately without any further helpful output) the 
>>> host
>>> reboots (I assume there is a panic that triggers a reset).
>>>
>>> I suspect the problem is more a kernel side one. It is just possible to
>>> influence things by variation of dom0_mem=#,max:#. 512M seems ok, 1024M, 
>>> 2048M,
>>> and 3072M cause bad page flags starting around kernel 4.2 and reboots around
>>> 4.4. Then 4096M and not clamping dom0 memory seem to be ok again (though not
>>> limiting dom0 memory seems to cause trouble on 32bit dom0 later when a domU
>>> tries to balloon memory, but I think that is a different problem).
>>>
>>> I have not seen this on a 64bit dom0. Below is an example of those bad page
>>> errors. Somehow it looks to be a page marked as reserved. Initially I 
>>> wondered
>>> whether this could be a problem of not clearing page flags when moving 
>>> mappings
>>> to match the e820. But I never looked into i386 memory setup in that 
>>> detail. So
>>> I am posting this, hoping that someone may have an idea from the detail 
>>> about
>>> where to look next. PAE is enabled there. Usually its bpf init that gets 
>>> hit but
>>> that likely is just because that is doing the first vmallocs.
>>
>> Could you please post the kernel config, Xen and dom0 boot parameters?
>> I'm quite sure this is no common problem as there are standard tests
>> running for each kernel version including 32 bit dom0 with limited
>> memory size.
> 
> Hi Jürgen,
> 
> sure. Though by doing that I realized where I actually messed the whole thing
> up. I got the max limit syntax completely wrong. :( Instead of the correct
> "dom0_mem=1024M,max:1024M" I am using "dom0_mem=1024M:max=1024M" which I guess
> is like not having max set at all. Not sure whether that is a valid use case.
> 
> When I actually do the dom0_mem argument right, there are no bad page flag
> errors even in 4.4 with 1024M limit. I was at least consistent in my
> mis-configuration, so doing the same stupid thing on 64bit seems to be handled
> more gracefully.
> 
> Likely false alarm. But at least cut the config into mail made me spot
> the problem...
> 

Ok, thinking that "dom0_mem=x" (without a max or min) still is a valid case, I
went ahead and did a bisect for when the bad page flag issue started. I ended 
up at:

  92923ca "mm: meminit: only set page reserved in the memblock region"

And with a few more printks in the new functions I finally realized why this
goes wrong. The new reserve_bootmem_region is using unsigned long for start and
end addresses which just isn't working too well for 32bit.
For Xen dom0 the problem with that can just be more easily triggered. When dom0
memory is limited to a small size but allowed to balloon for more, the
additional system memory is put into reserved regions.
In my case a host with 8G memory and say 1G initial dom0 memory this created
(apart from other) one reserved region which started at 4GB and covered the
remaining 4G of host memory. Which reserve_bootmem_region() got as 0-4G due to
the unsigned long conversion. This basically marked *all* memory below 4G as
reserved.
The fix is relatively simple, just use phys_addr_t for start and end. I tested
this on 4.2 and 4.4 kernels. Both now boot without errors and neither does the
4.4 kernel crash. Maybe still not 100% safe when running on very large memory
systems (if I did not get the math wrong 16T) but at least some improvement...

-Stefan


From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001
From: Stefan Bader <stefan.ba...@canonical.com>
Date: Tue, 10 May 2016 19:05:16 +0200
Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments

Since 92923ca the reserved bit is set on reserved memblock regions.
However start and end address

Re: [Xen-devel] bad page flags booting 32bit dom0 on 64bit hypervisor using dom0_mem (kernel >=4.2)

2016-05-11 Thread Stefan Bader

On 02.05.2016 16:24, Stefan Bader wrote:
> On 02.05.2016 13:41, Juergen Gross wrote:
>> On 02/05/16 12:47, Stefan Bader wrote:
>>> I recently tried to boot 32bit dom0 on 64bit Xen host which I configured to 
>>> run
>>> with a limited, fix amount of memory for dom0. It seems that somewhere 
>>> between
>>> kernel versions 3.19 and 4.2 (sorry that is still a wide range) the Linux 
>>> kernel
>>> would report bad page flags for a range of pages (which seem to be around 
>>> the
>>> end of the guest pfn range). For a 4.2 kernel that was easily missed as the 
>>> boot
>>> finished ok and dom0 was accessible. However starting with 4.4 (tested 4.5 
>>> and a
>>> 4.6-rc) the serial console output freezes after some of those bad page flag
>>> messages and then (unfortunately without any further helpful output) the 
>>> host
>>> reboots (I assume there is a panic that triggers a reset).
>>>
>>> I suspect the problem is more a kernel side one. It is just possible to
>>> influence things by variation of dom0_mem=#,max:#. 512M seems ok, 1024M, 
>>> 2048M,
>>> and 3072M cause bad page flags starting around kernel 4.2 and reboots around
>>> 4.4. Then 4096M and not clamping dom0 memory seem to be ok again (though not
>>> limiting dom0 memory seems to cause trouble on 32bit dom0 later when a domU
>>> tries to balloon memory, but I think that is a different problem).
>>>
>>> I have not seen this on a 64bit dom0. Below is an example of those bad page
>>> errors. Somehow it looks to be a page marked as reserved. Initially I 
>>> wondered
>>> whether this could be a problem of not clearing page flags when moving 
>>> mappings
>>> to match the e820. But I never looked into i386 memory setup in that 
>>> detail. So
>>> I am posting this, hoping that someone may have an idea from the detail 
>>> about
>>> where to look next. PAE is enabled there. Usually its bpf init that gets 
>>> hit but
>>> that likely is just because that is doing the first vmallocs.
>>
>> Could you please post the kernel config, Xen and dom0 boot parameters?
>> I'm quite sure this is no common problem as there are standard tests
>> running for each kernel version including 32 bit dom0 with limited
>> memory size.
> 
> Hi Jürgen,
> 
> sure. Though by doing that I realized where I actually messed the whole thing
> up. I got the max limit syntax completely wrong. :( Instead of the correct
> "dom0_mem=1024M,max:1024M" I am using "dom0_mem=1024M:max=1024M" which I guess
> is like not having max set at all. Not sure whether that is a valid use case.
> 
> When I actually do the dom0_mem argument right, there are no bad page flag
> errors even in 4.4 with 1024M limit. I was at least consistent in my
> mis-configuration, so doing the same stupid thing on 64bit seems to be handled
> more gracefully.
> 
> Likely false alarm. But at least cut the config into mail made me spot
> the problem...
> 

Ok, thinking that "dom0_mem=x" (without a max or min) still is a valid case, I
went ahead and did a bisect for when the bad page flag issue started. I ended 
up at:

  92923ca "mm: meminit: only set page reserved in the memblock region"

And with a few more printks in the new functions I finally realized why this
goes wrong. The new reserve_bootmem_region is using unsigned long for start and
end addresses which just isn't working too well for 32bit.
For Xen dom0 the problem with that can just be more easily triggered. When dom0
memory is limited to a small size but allowed to balloon for more, the
additional system memory is put into reserved regions.
In my case a host with 8G memory and say 1G initial dom0 memory this created
(apart from other) one reserved region which started at 4GB and covered the
remaining 4G of host memory. Which reserve_bootmem_region() got as 0-4G due to
the unsigned long conversion. This basically marked *all* memory below 4G as
reserved.
The fix is relatively simple, just use phys_addr_t for start and end. I tested
this on 4.2 and 4.4 kernels. Both now boot without errors and neither does the
4.4 kernel crash. Maybe still not 100% safe when running on very large memory
systems (if I did not get the math wrong 16T) but at least some improvement...

-Stefan


From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001
From: Stefan Bader 
Date: Tue, 10 May 2016 19:05:16 +0200
Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments

Since 92923ca the reserved bit is set on reserved memblock regions.
However start and end address are passed as unsigned long. Th

[PATCH resend] bcache: prevent crash on changing writeback_running

2015-07-10 Thread Stefan Bader

I sent this out quite a while ago. Back then it was said that it
was applied to the development tree. But it seems it never made its
way upstream. 
I re-tested with a 4.0.7 kernel and compared this against a 4.0 kernel
with this patch applied. Still it is possible to trigger a NULL pointer
violation when writing 0 into writeback_running of a bcache device that
is not attached to a cache set (without the patch).

-Stefan


---

>From e949c64fa6acbdaab999410250855a6a4fc6bad1 Mon Sep 17 00:00:00 2001
From: Stefan Bader 
Date: Mon, 18 Aug 2014 20:00:13 +0200
Subject: [PATCH] bcache: prevent crash on changing writeback_running

commit a664d0f05a2e ("bcache: fix crash on shutdown in passthrough mode")

added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change  adds the same check before trying to
wake up the thread for that case.

Signed-off-by: Stefan Bader 
---
 drivers/md/bcache/writeback.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 0a9dab1..073a042 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, 
struct bio *bio,
 
 static inline void bch_writeback_queue(struct cached_dev *dc)
 {
-   wake_up_process(dc->writeback_thread);
+   if (!IS_ERR_OR_NULL(dc->writeback_thread))
+   wake_up_process(dc->writeback_thread);
 }
 
 static inline void bch_writeback_add(struct cached_dev *dc)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH resend] bcache: prevent crash on changing writeback_running

2015-07-10 Thread Stefan Bader

I sent this out quite a while ago. Back then it was said that it
was applied to the development tree. But it seems it never made its
way upstream. 
I re-tested with a 4.0.7 kernel and compared this against a 4.0 kernel
with this patch applied. Still it is possible to trigger a NULL pointer
violation when writing 0 into writeback_running of a bcache device that
is not attached to a cache set (without the patch).

-Stefan


---

From e949c64fa6acbdaab999410250855a6a4fc6bad1 Mon Sep 17 00:00:00 2001
From: Stefan Bader stefan.ba...@canonical.com
Date: Mon, 18 Aug 2014 20:00:13 +0200
Subject: [PATCH] bcache: prevent crash on changing writeback_running

commit a664d0f05a2e (bcache: fix crash on shutdown in passthrough mode)

added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change  adds the same check before trying to
wake up the thread for that case.

Signed-off-by: Stefan Bader stefan.ba...@canonical.com
---
 drivers/md/bcache/writeback.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 0a9dab1..073a042 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, 
struct bio *bio,
 
 static inline void bch_writeback_queue(struct cached_dev *dc)
 {
-   wake_up_process(dc-writeback_thread);
+   if (!IS_ERR_OR_NULL(dc-writeback_thread))
+   wake_up_process(dc-writeback_thread);
 }
 
 static inline void bch_writeback_add(struct cached_dev *dc)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.10

2015-03-19 Thread Stefan Bader

On 18.03.2015 10:18, Paolo Bonzini wrote:
> 
> 
> On 18/03/2015 09:46, Stefan Bader wrote:
>>
>> Regardless of that, I wonder whether the below (this version untested) sound
>> acceptable for upstream? At least it would make debugging much simpler. :)
>>
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, 
>> u32 ct
>> ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */
>>
>> /* Ensure minimum (required) set of control bits are supported. */
>> -   if (ctl_min & ~ctl)
>> +   if (ctl_min & ~ctl) {
>> +   printk(KERN_ERR "vmx: msr(%08x) does not match requirements. 
>> "
>> +   "req=%08x cur=%08x\n", msr, ctl_min, ctl);
>> return -EIO;
>> +   }
>>
>> *result = ctl;
>> return 0;
> 
> Yes, this is nice.  Maybe -ENODEV.
> 
> Also, a minimal patch for Ubuntu would probably be:
> 
> @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
> *vmcs_conf)
> vmx_capability.ept, vmx_capability.vpid);
>   }
>  
> - min = 0;
> + min = VM_EXIT_SAVE_DEBUG_CONTROLS;
>  #ifdef CONFIG_X86_64
>   min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #endif
> 
> but I don't think it's a good idea to add it to stable kernels.

Sorry, I got a bit confused on my assumptions. While the change above causes
guests to fail but the statement to say this is caused by host kernels before
this change was against better knowledge and wrong.

The actual range was hosts running 3.2 which (maybe not perfect but at least
well enough) allowed to use nested vmx for guest kernel <3.15 will break. But
running 3.13 on the host has no issues.

Comparing the rdmsr values of guests between those two host kernels, I found
that on 3.2 the exit control msr was very sparsely initialized. And looking at
the changes between 3.2 and 3.13 I found

commit 33fb20c39e98b90813b5ab2d9a0d6faa6300caca
Author: Jan Kiszka 
Date:   Wed Mar 6 15:44:03 2013 +0100

KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS

This was added in 3.10. So the range of kernels affected <3.10 back to when
nested vmx became somewhat usable. For 3.2 Ben (and obviously us) would be
affected. Apart from that, I believe, it is only 3.4 which has an active
longterm. At least that change looks safer for stable as it sounds like
correcting things and not adding a feature. I was able to cherry-pick that into
a 3.2 kernel and then a 3.16 guest successfully can load the kvm-intel module
again, of course with the same shortcomings as before.

-Stefan
> 
> Paolo
> 

signature.asc
Description: OpenPGP digital signature

Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 3.10

2015-03-19 Thread Stefan Bader

On 18.03.2015 10:18, Paolo Bonzini wrote:
 
 
 On 18/03/2015 09:46, Stefan Bader wrote:

 Regardless of that, I wonder whether the below (this version untested) sound
 acceptable for upstream? At least it would make debugging much simpler. :)

 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, 
 u32 ct
 ctl |= vmx_msr_low;  /* bit == 1 in low word  == must be one  */

 /* Ensure minimum (required) set of control bits are supported. */
 -   if (ctl_min  ~ctl)
 +   if (ctl_min  ~ctl) {
 +   printk(KERN_ERR vmx: msr(%08x) does not match requirements. 
 
 +   req=%08x cur=%08x\n, msr, ctl_min, ctl);
 return -EIO;
 +   }

 *result = ctl;
 return 0;
 
 Yes, this is nice.  Maybe -ENODEV.
 
 Also, a minimal patch for Ubuntu would probably be:
 
 @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
 *vmcs_conf)
 vmx_capability.ept, vmx_capability.vpid);
   }
  
 - min = 0;
 + min = VM_EXIT_SAVE_DEBUG_CONTROLS;
  #ifdef CONFIG_X86_64
   min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
  #endif
 
 but I don't think it's a good idea to add it to stable kernels.

Sorry, I got a bit confused on my assumptions. While the change above causes
guests to fail but the statement to say this is caused by host kernels before
this change was against better knowledge and wrong.

The actual range was hosts running 3.2 which (maybe not perfect but at least
well enough) allowed to use nested vmx for guest kernel 3.15 will break. But
running 3.13 on the host has no issues.

Comparing the rdmsr values of guests between those two host kernels, I found
that on 3.2 the exit control msr was very sparsely initialized. And looking at
the changes between 3.2 and 3.13 I found

commit 33fb20c39e98b90813b5ab2d9a0d6faa6300caca
Author: Jan Kiszka jan.kis...@siemens.com
Date:   Wed Mar 6 15:44:03 2013 +0100

KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS

This was added in 3.10. So the range of kernels affected 3.10 back to when
nested vmx became somewhat usable. For 3.2 Ben (and obviously us) would be
affected. Apart from that, I believe, it is only 3.4 which has an active
longterm. At least that change looks safer for stable as it sounds like
correcting things and not adding a feature. I was able to cherry-pick that into
a 3.2 kernel and then a 3.16 guest successfully can load the kvm-intel module
again, of course with the same shortcomings as before.

-Stefan
 
 Paolo
 




signature.asc
Description: OpenPGP digital signature

Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.15

2015-03-18 Thread Stefan Bader

On 18.03.2015 11:27, Paolo Bonzini wrote:
> 
> 
> On 18/03/2015 10:59, Stefan Bader wrote:
>>> @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct
>>> vmcs_config *vmcs_conf) vmx_capability.ept,
>>> vmx_capability.vpid); }
>>>
>>> -   min = 0; +  min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef
>>> CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif
>>>
>>> but I don't think it's a good idea to add it to stable kernels.
>>
>> Why is that? Because it has a risk of causing the module failing to
>> load on L0 where it did work before?
> 
> Because if we wanted to make 3.14 nested VMX stable-ish we would need
> several more, at least these:
> 
>   KVM: nVMX: fix lifetime issues for vmcs02
>   KVM: nVMX: clean up nested_release_vmcs12 and code around it
>   KVM: nVMX: Rework interception of IRQs and NMIs
>   KVM: nVMX: Do not inject NMI vmexits when L2 has a pending
>  interrupt
>   KVM: nVMX: Disable preemption while reading from shadow VMCS
> 
> and for 3.13:
> 
>   KVM: nVMX: Leave VMX mode on clearing of feature control MSR
> 
> There are also several L2-crash-L1 bugs too in Nadav Amit's patches.
> 
> Basically, nested VMX was never considered stable-worthy.  Perhaps
> that can change soon---but not retroactively.
> 
> So I'd rather avoid giving false impressions of the stability of nVMX
> in 3.14.
> 
> Even if we considered nVMX stable, I'd _really_ not want to consider
> the L1<->L2 boundary a secure one for a longer time.
> 
>> Which would be something I would rather avoid. Generally I think it
>> would be good to have something that can be generally applied.
>> Given the speed that cloud service providers tend to move forward
>> (ok they may not actively push the ability to go nested).
> 
> And if they did, I'd really not want them to do it with a 3.14 kernel.

3.14... you are optimistic. :) But thanks a lot for the detailed info.

-Stefan

> 
> Paolo
> 




signature.asc
Description: OpenPGP digital signature

Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.15

2015-03-18 Thread Stefan Bader

On 18.03.2015 10:18, Paolo Bonzini wrote:
> 
> 
> On 18/03/2015 09:46, Stefan Bader wrote:
>>
>> Regardless of that, I wonder whether the below (this version untested) sound
>> acceptable for upstream? At least it would make debugging much simpler. :)
>>
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, 
>> u32 ct
>> ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */
>>
>> /* Ensure minimum (required) set of control bits are supported. */
>> -   if (ctl_min & ~ctl)
>> +   if (ctl_min & ~ctl) {
>> +   printk(KERN_ERR "vmx: msr(%08x) does not match requirements. 
>> "
>> +   "req=%08x cur=%08x\n", msr, ctl_min, ctl);
>> return -EIO;
>> +   }
>>
>> *result = ctl;
>> return 0;
> 
> Yes, this is nice.  Maybe -ENODEV.

Maybe, though I did not change that. Just added to give some kind of hint when
the module would otherwise fail with just an IO error.

> 
> Also, a minimal patch for Ubuntu would probably be:
> 
> @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
> *vmcs_conf)
> vmx_capability.ept, vmx_capability.vpid);
>   }
>  
> - min = 0;
> + min = VM_EXIT_SAVE_DEBUG_CONTROLS;
>  #ifdef CONFIG_X86_64
>   min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #endif
> 
> but I don't think it's a good idea to add it to stable kernels.

Why is that? Because it has a risk of causing the module failing to load on L0
where it did work before? Which would be something I would rather avoid.
Generally I think it would be good to have something that can be generally
applied. Given the speed that cloud service providers tend to move forward (ok
they may not actively push the ability to go nested).

-Stefan
> 
> Paolo
> 




signature.asc
Description: OpenPGP digital signature

regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.15

2015-03-18 Thread Stefan Bader

Someone reported[1] that some of their L1 guests fail to load the kvm-intel
module (without much details). Turns out that this was (at least) caused by

KVM: vmx: Allow the guest to run with dirty debug registers

as this adds VM_EXIT_SAVE_DEBUG_CONTROLS to the required MSR_IA32_VMX_EXIT_CTLS
bits. Not sure this should be fixed up in pre 3.15 kernels or the other way
round. Maybe naively asked but would it be sufficient to add this as required to
older kernels vmcs setup (without the code to make any use of it)?

Regardless of that, I wonder whether the below (this version untested) sound
acceptable for upstream? At least it would make debugging much simpler. :)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct
ctl |= vmx_msr_low;  /* bit == 1 in low word  ==> must be one  */

/* Ensure minimum (required) set of control bits are supported. */
-   if (ctl_min & ~ctl)
+   if (ctl_min & ~ctl) {
+   printk(KERN_ERR "vmx: msr(%08x) does not match requirements. "
+   "req=%08x cur=%08x\n", msr, ctl_min, ctl);
return -EIO;
+   }

*result = ctl;
return 0;

Thanks,
-Stefan

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1431473



signature.asc
Description: OpenPGP digital signature

Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 3.15

2015-03-18 Thread Stefan Bader

On 18.03.2015 11:27, Paolo Bonzini wrote:
 
 
 On 18/03/2015 10:59, Stefan Bader wrote:
 @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct
 vmcs_config *vmcs_conf) vmx_capability.ept,
 vmx_capability.vpid); }

 -   min = 0; +  min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef
 CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif

 but I don't think it's a good idea to add it to stable kernels.

 Why is that? Because it has a risk of causing the module failing to
 load on L0 where it did work before?
 
 Because if we wanted to make 3.14 nested VMX stable-ish we would need
 several more, at least these:
 
   KVM: nVMX: fix lifetime issues for vmcs02
   KVM: nVMX: clean up nested_release_vmcs12 and code around it
   KVM: nVMX: Rework interception of IRQs and NMIs
   KVM: nVMX: Do not inject NMI vmexits when L2 has a pending
  interrupt
   KVM: nVMX: Disable preemption while reading from shadow VMCS
 
 and for 3.13:
 
   KVM: nVMX: Leave VMX mode on clearing of feature control MSR
 
 There are also several L2-crash-L1 bugs too in Nadav Amit's patches.
 
 Basically, nested VMX was never considered stable-worthy.  Perhaps
 that can change soon---but not retroactively.
 
 So I'd rather avoid giving false impressions of the stability of nVMX
 in 3.14.
 
 Even if we considered nVMX stable, I'd _really_ not want to consider
 the L1-L2 boundary a secure one for a longer time.
 
 Which would be something I would rather avoid. Generally I think it
 would be good to have something that can be generally applied.
 Given the speed that cloud service providers tend to move forward
 (ok they may not actively push the ability to go nested).
 
 And if they did, I'd really not want them to do it with a 3.14 kernel.

3.14... you are optimistic. :) But thanks a lot for the detailed info.

-Stefan

 
 Paolo
 




signature.asc
Description: OpenPGP digital signature

Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 3.15

2015-03-18 Thread Stefan Bader

On 18.03.2015 10:18, Paolo Bonzini wrote:
 
 
 On 18/03/2015 09:46, Stefan Bader wrote:

 Regardless of that, I wonder whether the below (this version untested) sound
 acceptable for upstream? At least it would make debugging much simpler. :)

 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, 
 u32 ct
 ctl |= vmx_msr_low;  /* bit == 1 in low word  == must be one  */

 /* Ensure minimum (required) set of control bits are supported. */
 -   if (ctl_min  ~ctl)
 +   if (ctl_min  ~ctl) {
 +   printk(KERN_ERR vmx: msr(%08x) does not match requirements. 
 
 +   req=%08x cur=%08x\n, msr, ctl_min, ctl);
 return -EIO;
 +   }

 *result = ctl;
 return 0;
 
 Yes, this is nice.  Maybe -ENODEV.

Maybe, though I did not change that. Just added to give some kind of hint when
the module would otherwise fail with just an IO error.

 
 Also, a minimal patch for Ubuntu would probably be:
 
 @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
 *vmcs_conf)
 vmx_capability.ept, vmx_capability.vpid);
   }
  
 - min = 0;
 + min = VM_EXIT_SAVE_DEBUG_CONTROLS;
  #ifdef CONFIG_X86_64
   min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
  #endif
 
 but I don't think it's a good idea to add it to stable kernels.

Why is that? Because it has a risk of causing the module failing to load on L0
where it did work before? Which would be something I would rather avoid.
Generally I think it would be good to have something that can be generally
applied. Given the speed that cloud service providers tend to move forward (ok
they may not actively push the ability to go nested).

-Stefan
 
 Paolo
 




signature.asc
Description: OpenPGP digital signature

regression: nested: L1 3.15+ fails to load kvm-intel on L0 3.15

2015-03-18 Thread Stefan Bader

Someone reported[1] that some of their L1 guests fail to load the kvm-intel
module (without much details). Turns out that this was (at least) caused by

KVM: vmx: Allow the guest to run with dirty debug registers

as this adds VM_EXIT_SAVE_DEBUG_CONTROLS to the required MSR_IA32_VMX_EXIT_CTLS
bits. Not sure this should be fixed up in pre 3.15 kernels or the other way
round. Maybe naively asked but would it be sufficient to add this as required to
older kernels vmcs setup (without the code to make any use of it)?

Regardless of that, I wonder whether the below (this version untested) sound
acceptable for upstream? At least it would make debugging much simpler. :)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct
ctl |= vmx_msr_low;  /* bit == 1 in low word  == must be one  */

/* Ensure minimum (required) set of control bits are supported. */
-   if (ctl_min  ~ctl)
+   if (ctl_min  ~ctl) {
+   printk(KERN_ERR vmx: msr(%08x) does not match requirements. 
+   req=%08x cur=%08x\n, msr, ctl_min, ctl);
return -EIO;
+   }

*result = ctl;
return 0;

Thanks,
-Stefan

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1431473



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3.18 129/151] x86/xen: Treat SCI interrupt as normal GSI interrupt

2015-03-04 Thread Stefan Bader

On 04.03.2015 07:14, Greg Kroah-Hartman wrote:
> 3.18-stable review patch.  If anyone has any objections, please let me know.

I thought I replied earlier today but I cannot seem to find it coming back via
the mailing list. Hope this is not duplicating too much... There was a
regression with that patch and it requires the below commit as well to prevent 
that:

commit 1ea76fbadd667b19c4fa4466f3a3b55a505e83d9
Author: Jiang Liu 
Date:   Mon Feb 16 10:11:13 2015 +0800

x86/irq: Fix regression caused by commit b568b8601f05

Commit b568b8601f05 ("Treat SCI interrupt as normal GSI interrupt")
accidently removes support of legacy PIC interrupt when fixing a
regression for Xen, which causes a nasty regression on HP/Compaq
nc6000 where we fail to register the ACPI interrupt, and thus
lose eg. thermal notifications leading a potentially overheated
machine.

-Stefan


> 
> --
> 
> From: Jiang Liu 
> 
> commit b568b8601f05a591a7ff09d8ee1cedb5b2e815fe upstream.
> 
> Currently Xen Domain0 has special treatment for ACPI SCI interrupt,
> that is initialize irq for ACPI SCI at early stage in a special way as:
> xen_init_IRQ()
>   ->pci_xen_initial_domain()
>   ->xen_setup_acpi_sci()
>   Allocate and initialize irq for ACPI SCI
> 
> Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq
> number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on
> IOAPIC irqdomains through following path
> acpi_gsi_to_irq()
>   ->mp_map_gsi_to_irq()
>   ->mp_map_pin_to_irq()
>   ->check IOAPIC irqdomain
> 
> For PV domains, it uses Xen event based interrupt manangement and
> doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC.
> This causes Xen domain0 fail to install interrupt handler for ACPI SCI
> and all ACPI events will be lost. Please refer to:
> https://lkml.org/lkml/2014/12/19/178
> 
> So the fix is to get rid of special treatment for ACPI SCI, just treat
> ACPI SCI as normal GSI interrupt as:
> acpi_gsi_to_irq()
>   ->acpi_register_gsi()
>   ->acpi_register_gsi_xen()
>   ->xen_register_gsi()
> 
> With above change, there's no need for xen_setup_acpi_sci() anymore.
> The above change also works with bare metal kernel too.
> 
> Signed-off-by: Jiang Liu 
> Tested-by: Sander Eikelenboom 
> Cc: Tony Luck 
> Cc: xen-de...@lists.xenproject.org
> Cc: Konrad Rzeszutek Wilk 
> Cc: David Vrabel 
> Cc: Rafael J. Wysocki 
> Cc: Len Brown 
> Cc: Pavel Machek 
> Cc: Bjorn Helgaas 
> Link: 
> http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang@linux.intel.com
> Signed-off-by: Thomas Gleixner 
> Signed-off-by: Stefan Bader 
> Signed-off-by: Greg Kroah-Hartman 
> 
> ---
>  arch/x86/kernel/acpi/boot.c |   21 ++-
>  arch/x86/pci/xen.c  |   47 
> 
>  2 files changed, 11 insertions(+), 57 deletions(-)
> 
> --- a/arch/x86/kernel/acpi/boot.c
> +++ b/arch/x86/kernel/acpi/boot.c
> @@ -604,18 +604,19 @@ void __init acpi_pic_sci_set_trigger(uns
>  
>  int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp)
>  {
> - int irq;
> + int rc, irq, trigger, polarity;
>  
> - if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) {
> - *irqp = gsi;
> - } else {
> - irq = mp_map_gsi_to_irq(gsi,
> - IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK);
> - if (irq < 0)
> - return -1;
> - *irqp = irq;
> + rc = acpi_get_override_irq(gsi, , );
> + if (rc == 0) {
> + trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE;
> + polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH;
> + irq = acpi_register_gsi(NULL, gsi, trigger, polarity);
> + if (irq >= 0) {
> + *irqp = irq;
> + return 0;
> + }
>   }
> - return 0;
> + return -1;
>  }
>  EXPORT_SYMBOL_GPL(acpi_gsi_to_irq);
>  
> --- a/arch/x86/pci/xen.c
> +++ b/arch/x86/pci/xen.c
> @@ -452,52 +452,6 @@ int __init pci_xen_hvm_init(void)
>  }
>  
>  #ifdef CONFIG_XEN_DOM0
> -static __init void xen_setup_acpi_sci(void)
> -{
> - int rc;
> - int trigger, polarity;
> - int gsi = acpi_sci_override_gsi;
> - int irq = -1;
> - int gsi_override = -1;
> -
> - if (!gsi)
> - return;
> -
> - rc = acpi_get_override_irq(gsi, , );
> - if (rc) {
> - printk(KERN_WARNING "xen: acpi_get_override_irq failed for acpi"
> -

Re: [PATCH 3.18 129/151] x86/xen: Treat SCI interrupt as normal GSI interrupt

2015-03-04 Thread Stefan Bader

On 04.03.2015 07:14, Greg Kroah-Hartman wrote:
 3.18-stable review patch.  If anyone has any objections, please let me know.

I thought I replied earlier today but I cannot seem to find it coming back via
the mailing list. Hope this is not duplicating too much... There was a
regression with that patch and it requires the below commit as well to prevent 
that:

commit 1ea76fbadd667b19c4fa4466f3a3b55a505e83d9
Author: Jiang Liu jiang@linux.intel.com
Date:   Mon Feb 16 10:11:13 2015 +0800

x86/irq: Fix regression caused by commit b568b8601f05

Commit b568b8601f05 (Treat SCI interrupt as normal GSI interrupt)
accidently removes support of legacy PIC interrupt when fixing a
regression for Xen, which causes a nasty regression on HP/Compaq
nc6000 where we fail to register the ACPI interrupt, and thus
lose eg. thermal notifications leading a potentially overheated
machine.

-Stefan


 
 --
 
 From: Jiang Liu jiang@linux.intel.com
 
 commit b568b8601f05a591a7ff09d8ee1cedb5b2e815fe upstream.
 
 Currently Xen Domain0 has special treatment for ACPI SCI interrupt,
 that is initialize irq for ACPI SCI at early stage in a special way as:
 xen_init_IRQ()
   -pci_xen_initial_domain()
   -xen_setup_acpi_sci()
   Allocate and initialize irq for ACPI SCI
 
 Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq
 number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on
 IOAPIC irqdomains through following path
 acpi_gsi_to_irq()
   -mp_map_gsi_to_irq()
   -mp_map_pin_to_irq()
   -check IOAPIC irqdomain
 
 For PV domains, it uses Xen event based interrupt manangement and
 doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC.
 This causes Xen domain0 fail to install interrupt handler for ACPI SCI
 and all ACPI events will be lost. Please refer to:
 https://lkml.org/lkml/2014/12/19/178
 
 So the fix is to get rid of special treatment for ACPI SCI, just treat
 ACPI SCI as normal GSI interrupt as:
 acpi_gsi_to_irq()
   -acpi_register_gsi()
   -acpi_register_gsi_xen()
   -xen_register_gsi()
 
 With above change, there's no need for xen_setup_acpi_sci() anymore.
 The above change also works with bare metal kernel too.
 
 Signed-off-by: Jiang Liu jiang@linux.intel.com
 Tested-by: Sander Eikelenboom li...@eikelenboom.it
 Cc: Tony Luck tony.l...@intel.com
 Cc: xen-de...@lists.xenproject.org
 Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
 Cc: David Vrabel david.vra...@citrix.com
 Cc: Rafael J. Wysocki r...@rjwysocki.net
 Cc: Len Brown len.br...@intel.com
 Cc: Pavel Machek pa...@ucw.cz
 Cc: Bjorn Helgaas bhelg...@google.com
 Link: 
 http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang@linux.intel.com
 Signed-off-by: Thomas Gleixner t...@linutronix.de
 Signed-off-by: Stefan Bader stefan.ba...@canonical.com
 Signed-off-by: Greg Kroah-Hartman gre...@linuxfoundation.org
 
 ---
  arch/x86/kernel/acpi/boot.c |   21 ++-
  arch/x86/pci/xen.c  |   47 
 
  2 files changed, 11 insertions(+), 57 deletions(-)
 
 --- a/arch/x86/kernel/acpi/boot.c
 +++ b/arch/x86/kernel/acpi/boot.c
 @@ -604,18 +604,19 @@ void __init acpi_pic_sci_set_trigger(uns
  
  int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp)
  {
 - int irq;
 + int rc, irq, trigger, polarity;
  
 - if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) {
 - *irqp = gsi;
 - } else {
 - irq = mp_map_gsi_to_irq(gsi,
 - IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK);
 - if (irq  0)
 - return -1;
 - *irqp = irq;
 + rc = acpi_get_override_irq(gsi, trigger, polarity);
 + if (rc == 0) {
 + trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE;
 + polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH;
 + irq = acpi_register_gsi(NULL, gsi, trigger, polarity);
 + if (irq = 0) {
 + *irqp = irq;
 + return 0;
 + }
   }
 - return 0;
 + return -1;
  }
  EXPORT_SYMBOL_GPL(acpi_gsi_to_irq);
  
 --- a/arch/x86/pci/xen.c
 +++ b/arch/x86/pci/xen.c
 @@ -452,52 +452,6 @@ int __init pci_xen_hvm_init(void)
  }
  
  #ifdef CONFIG_XEN_DOM0
 -static __init void xen_setup_acpi_sci(void)
 -{
 - int rc;
 - int trigger, polarity;
 - int gsi = acpi_sci_override_gsi;
 - int irq = -1;
 - int gsi_override = -1;
 -
 - if (!gsi)
 - return;
 -
 - rc = acpi_get_override_irq(gsi, trigger, polarity);
 - if (rc) {
 - printk(KERN_WARNING xen: acpi_get_override_irq failed for acpi
 -  sci, rc=%d\n, rc);
 - return;
 - }
 - trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE;
 - polarity = polarity ? ACPI_ACTIVE_LOW

Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19

2015-02-10 Thread Stefan Bader

On 09.02.2015 20:15, Sander Eikelenboom wrote:
> 
> Monday, February 9, 2015, 5:09:44 PM, you wrote:
> 
>> On 09.02.2015 13:29, Stefan Bader wrote:
>>> On 09.02.2015 13:12, Jiang Liu wrote:
>>>> On 2015/2/9 17:47, Stefan Bader wrote:
>>>>> On 05.02.2015 21:07, Sander Eikelenboom wrote:
>>>>>>
>>>>>> Tuesday, January 20, 2015, 3:21:04 AM, you wrote:
>>>>>>
>>>>>>> Hi Thomas,
>>>>>>> This patch set includes three hotfixes related Xen IRQ for 
>>>>>>> v3.19.
>>>>>>> Sorry for the long delay to get these two regressions fixed, it really
>>>>>>> cost me some time to read and understand Xen IRQ code.
>>>>>>
>>>>>>> Patch 1 fixes the failure to register ACPI SCI interrupt on Xen
>>>>>>> domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18
>>>>>>> too once it reaches the mainstream kernel.
>>>>>>
>>>>>>> Patch 2 fixes the regression in Xen PCI device passthrough(pciback).
>>>>>>> It's a temporary solution, I will send the formal fix for v3.20 and
>>>>>>> it has passed tests too.
>>>>>>
>>>>>>> Patch 3 fixes an issue found when reading code. There's no real bug
>>>>>>> reports related to this issue yet.
>>>>>>
>>>>>>> Great thanks to Konrad and Sander for testing fixes for these 
>>>>>>> regressions.
>>>>>>
>>>>>>> Regards,
>>>>>>> Gerry
>>>>>>
>>>>>> Hi Gerry,
>>>>>>
>>>>>> Since these patches now are: tested, reviewed and have landed into 
>>>>>> mainline,
>>>>>> could you also provide the backports for 3.17 and 3.18 where required ?
>>>>>
>>>>> This would be my attempt of backporting those to 3.18 (I have not tried 
>>>>> whether
>>>>> they would apply to 3.17 as well). Those seem to work when test-booting 
>>>>> and
>>>>> Sander said he will give them a try as well. But better check yourself I 
>>>>> did not
>>>>> mess anything up while backporting. One of the three patches seemed not 
>>>>> to be
>>>>> required at all.
>>>> I'm backporting them too, but now I will give up my patches:)
>>>>
>>>> The first one is a must, which should be applied to 3.17 and 3.18.
>>>> The second one doesn't apply to 3.17 or 3.18.
>>>> The third may be applied to v3.17 and v3.18, but it shouldn't be applied
>>>> according to the stable-kernel doc because no user reports such a bug yet.
>>>
>>> Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets
>>> around the acpi irq and the usb problems, too. Though interestingly from the
>>> history, the acpi irq I saw in 3.17 while the usb devices started to fail 
>>> with
>>> 3.18. I will let you know in a bit. Just want to finish another bisect 
>>> first.
> 
>> So it looks like patch #1 indeed does fix both the acpi irq and my usb 
>> devices.
>> Sander, do you want to verify that variant, too?
> 
> Hi Stefan / Gerry,
> 
> Just tested with only patch 1, ended up the same .. the reported issues seem 
> gone.

Sounds good. :) Gerry, do you want me to fwd the backport to Greg for 3.18 or do
you do that? As far as I know 3.17 has nobody caring about it.

-Stefan

> 
> 
>> The only odd thing left in a 3.18 with the fix (regardless of using patch #1
>> only or patches #1 and #3) is that the ahci driver shows up with its pci bus
>> number in /proc/interrupts. But that does not seem to have any relevant 
>> downside
>> (cannot try 3.19 on that box right now for other reasons).
> 
>> -Stefan
> 
>> ...
>> # 57: 12441  000  xen-pirq-msi   :00:1f.2
>> #(used to be ahci)
> 
> This is also there on baremetal with 3.19 on intel, so not xen related:
>  27:   1868582   1282555  IR-PCI-MSI-edge  
> :00:1f.2
> 
> --
> Sander
> 
>>>
>>> -Stefan
>>>
>>>> Regards
>>>> Gerry
>>>>>
>>>>> -Stefan
>>>>>
>>>>>>
>>>>>> The number of people running into these (subtle) issues with these stable
>>>>>> kernels seems to be steadily increasing.
>>>>>>
>>>>>> --
>>>>>> Sander
>>>>>>
>>>>>>> Jiang Liu (3):
>>>>>>>   xen/pci: Kill function xen_setup_acpi_sci()
>>>>>>>   xen/irq, ACPI: Fix regression in xen PCI passthrough caused by
>>>>>>> cffe0a2b5a34
>>>>>>>   xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi
>>>>>>
>>>>>>>  arch/x86/include/asm/acpi.h |1 +
>>>>>>>  arch/x86/kernel/acpi/boot.c |   26 +++
>>>>>>>  arch/x86/pci/xen.c  |   49 
>>>>>>> ++-
>>>>>>>  drivers/acpi/pci_irq.c  |1 -
>>>>>>>  4 files changed, 16 insertions(+), 61 deletions(-)
>>>>>>
>>>>>>
>>>>>
>>>
>>>
> 
> 
> 




signature.asc
Description: OpenPGP digital signature

Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19

2015-02-10 Thread Stefan Bader

On 09.02.2015 20:15, Sander Eikelenboom wrote:
 
 Monday, February 9, 2015, 5:09:44 PM, you wrote:
 
 On 09.02.2015 13:29, Stefan Bader wrote:
 On 09.02.2015 13:12, Jiang Liu wrote:
 On 2015/2/9 17:47, Stefan Bader wrote:
 On 05.02.2015 21:07, Sander Eikelenboom wrote:

 Tuesday, January 20, 2015, 3:21:04 AM, you wrote:

 Hi Thomas,
 This patch set includes three hotfixes related Xen IRQ for 
 v3.19.
 Sorry for the long delay to get these two regressions fixed, it really
 cost me some time to read and understand Xen IRQ code.

 Patch 1 fixes the failure to register ACPI SCI interrupt on Xen
 domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18
 too once it reaches the mainstream kernel.

 Patch 2 fixes the regression in Xen PCI device passthrough(pciback).
 It's a temporary solution, I will send the formal fix for v3.20 and
 it has passed tests too.

 Patch 3 fixes an issue found when reading code. There's no real bug
 reports related to this issue yet.

 Great thanks to Konrad and Sander for testing fixes for these 
 regressions.

 Regards,
 Gerry

 Hi Gerry,

 Since these patches now are: tested, reviewed and have landed into 
 mainline,
 could you also provide the backports for 3.17 and 3.18 where required ?

 This would be my attempt of backporting those to 3.18 (I have not tried 
 whether
 they would apply to 3.17 as well). Those seem to work when test-booting 
 and
 Sander said he will give them a try as well. But better check yourself I 
 did not
 mess anything up while backporting. One of the three patches seemed not 
 to be
 required at all.
 I'm backporting them too, but now I will give up my patches:)

 The first one is a must, which should be applied to 3.17 and 3.18.
 The second one doesn't apply to 3.17 or 3.18.
 The third may be applied to v3.17 and v3.18, but it shouldn't be applied
 according to the stable-kernel doc because no user reports such a bug yet.

 Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets
 around the acpi irq and the usb problems, too. Though interestingly from the
 history, the acpi irq I saw in 3.17 while the usb devices started to fail 
 with
 3.18. I will let you know in a bit. Just want to finish another bisect 
 first.
 
 So it looks like patch #1 indeed does fix both the acpi irq and my usb 
 devices.
 Sander, do you want to verify that variant, too?
 
 Hi Stefan / Gerry,
 
 Just tested with only patch 1, ended up the same .. the reported issues seem 
 gone.

Sounds good. :) Gerry, do you want me to fwd the backport to Greg for 3.18 or do
you do that? As far as I know 3.17 has nobody caring about it.

-Stefan

 
 
 The only odd thing left in a 3.18 with the fix (regardless of using patch #1
 only or patches #1 and #3) is that the ahci driver shows up with its pci bus
 number in /proc/interrupts. But that does not seem to have any relevant 
 downside
 (cannot try 3.19 on that box right now for other reasons).
 
 -Stefan
 
 ...
 # 57: 12441  000  xen-pirq-msi   :00:1f.2
 #(used to be ahci)
 
 This is also there on baremetal with 3.19 on intel, so not xen related:
  27:   1868582   1282555  IR-PCI-MSI-edge  
 :00:1f.2
 
 --
 Sander
 

 -Stefan

 Regards
 Gerry

 -Stefan


 The number of people running into these (subtle) issues with these stable
 kernels seems to be steadily increasing.

 --
 Sander

 Jiang Liu (3):
   xen/pci: Kill function xen_setup_acpi_sci()
   xen/irq, ACPI: Fix regression in xen PCI passthrough caused by
 cffe0a2b5a34
   xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi

  arch/x86/include/asm/acpi.h |1 +
  arch/x86/kernel/acpi/boot.c |   26 +++
  arch/x86/pci/xen.c  |   49 
 ++-
  drivers/acpi/pci_irq.c  |1 -
  4 files changed, 16 insertions(+), 61 deletions(-)





 
 
 




signature.asc
Description: OpenPGP digital signature

Re: 3.19: device name associates with IRQ's for ahci controllers operating with a single IRQ changed from "ahci?" to ""

2015-02-09 Thread Stefan Bader

On 09.02.2015 20:54, Sander Eikelenboom wrote:
> Hi.
> 
> In 3.19 the device name associates with IRQ's for ahci controllers operating
> with a single IRQ changed from "ahci?" to "", was this intentional ?
> 
> It's probably commit 18dcf433f3ded61eb140a55e7048ec2fef79e723 (or another one
> in that series). 

Oh, looking at that commit that might make sense. In
ahci_host_activate_single_irq it now uses dev_driver_name instead of dev_name as
it did before (from ata_host_activate).  The description sounds like before the
driver is registered this will return the bus. And registering an interrupt
would likely be before the driver is fully registered... But you probably saw
that, too. Whether that really was intentional is still the question. :)

-Stefan

>
> --
> Sander
> 
> 
> /proc/interrupts of an ahci controller with a single irq:
>  52:  13529  0  0  0  xen-pirq-msi   
> :00:1f.2
> 
> /proc/interrupts of an ahci controller with multiple irq's:
>  114: 412535  0  0  0  0  0  
> xen-pirq-msi   ahci0
>  115:  0  0  0  0  0  0  
> xen-pirq-msi   ahci1
>  116:  16717  0  0  0  0  0  
> xen-pirq-msi   ahci2
>  117:  0  0  0  0  0  0  
> xen-pirq-msi   ahci3
>  118:  0  0  0  0  0  0  
> xen-pirq-msi   ahci4
>  119:  0  0  0  0  0  0  
> xen-pirq-msi   ahci5
> 
> 

signature.asc
Description: OpenPGP digital signature

Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19

2015-02-09 Thread Stefan Bader

On 09.02.2015 13:29, Stefan Bader wrote:
> On 09.02.2015 13:12, Jiang Liu wrote:
>> On 2015/2/9 17:47, Stefan Bader wrote:
>>> On 05.02.2015 21:07, Sander Eikelenboom wrote:
>>>>
>>>> Tuesday, January 20, 2015, 3:21:04 AM, you wrote:
>>>>
>>>>> Hi Thomas,
>>>>> This patch set includes three hotfixes related Xen IRQ for v3.19.
>>>>> Sorry for the long delay to get these two regressions fixed, it really
>>>>> cost me some time to read and understand Xen IRQ code.
>>>>
>>>>> Patch 1 fixes the failure to register ACPI SCI interrupt on Xen
>>>>> domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18
>>>>> too once it reaches the mainstream kernel.
>>>>
>>>>> Patch 2 fixes the regression in Xen PCI device passthrough(pciback).
>>>>> It's a temporary solution, I will send the formal fix for v3.20 and
>>>>> it has passed tests too.
>>>>
>>>>> Patch 3 fixes an issue found when reading code. There's no real bug
>>>>> reports related to this issue yet.
>>>>
>>>>> Great thanks to Konrad and Sander for testing fixes for these regressions.
>>>>
>>>>> Regards,
>>>>> Gerry
>>>>
>>>> Hi Gerry,
>>>>
>>>> Since these patches now are: tested, reviewed and have landed into 
>>>> mainline,
>>>> could you also provide the backports for 3.17 and 3.18 where required ?
>>>
>>> This would be my attempt of backporting those to 3.18 (I have not tried 
>>> whether
>>> they would apply to 3.17 as well). Those seem to work when test-booting and
>>> Sander said he will give them a try as well. But better check yourself I 
>>> did not
>>> mess anything up while backporting. One of the three patches seemed not to 
>>> be
>>> required at all.
>> I'm backporting them too, but now I will give up my patches:)
>>
>> The first one is a must, which should be applied to 3.17 and 3.18.
>> The second one doesn't apply to 3.17 or 3.18.
>> The third may be applied to v3.17 and v3.18, but it shouldn't be applied
>> according to the stable-kernel doc because no user reports such a bug yet.
> 
> Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets
> around the acpi irq and the usb problems, too. Though interestingly from the
> history, the acpi irq I saw in 3.17 while the usb devices started to fail with
> 3.18. I will let you know in a bit. Just want to finish another bisect first.

So it looks like patch #1 indeed does fix both the acpi irq and my usb devices.
Sander, do you want to verify that variant, too?

The only odd thing left in a 3.18 with the fix (regardless of using patch #1
only or patches #1 and #3) is that the ahci driver shows up with its pci bus
number in /proc/interrupts. But that does not seem to have any relevant downside
(cannot try 3.19 on that box right now for other reasons).

-Stefan

...
# 57: 12441  000  xen-pirq-msi   :00:1f.2
#(used to be ahci)

> 
> -Stefan
> 
>> Regards
>> Gerry
>>>
>>> -Stefan
>>>
>>>>
>>>> The number of people running into these (subtle) issues with these stable
>>>> kernels seems to be steadily increasing.
>>>>
>>>> --
>>>> Sander
>>>>
>>>>> Jiang Liu (3):
>>>>>   xen/pci: Kill function xen_setup_acpi_sci()
>>>>>   xen/irq, ACPI: Fix regression in xen PCI passthrough caused by
>>>>> cffe0a2b5a34
>>>>>   xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi
>>>>
>>>>>  arch/x86/include/asm/acpi.h |1 +
>>>>>  arch/x86/kernel/acpi/boot.c |   26 +++
>>>>>  arch/x86/pci/xen.c  |   49 
>>>>> ++-
>>>>>  drivers/acpi/pci_irq.c  |1 -
>>>>>  4 files changed, 16 insertions(+), 61 deletions(-)
>>>>
>>>>
>>>
> 
> 




signature.asc
Description: OpenPGP digital signature

Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19

2015-02-09 Thread Stefan Bader

On 09.02.2015 13:12, Jiang Liu wrote:
> On 2015/2/9 17:47, Stefan Bader wrote:
>> On 05.02.2015 21:07, Sander Eikelenboom wrote:
>>>
>>> Tuesday, January 20, 2015, 3:21:04 AM, you wrote:
>>>
>>>> Hi Thomas,
>>>> This patch set includes three hotfixes related Xen IRQ for v3.19.
>>>> Sorry for the long delay to get these two regressions fixed, it really
>>>> cost me some time to read and understand Xen IRQ code.
>>>
>>>> Patch 1 fixes the failure to register ACPI SCI interrupt on Xen
>>>> domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18
>>>> too once it reaches the mainstream kernel.
>>>
>>>> Patch 2 fixes the regression in Xen PCI device passthrough(pciback).
>>>> It's a temporary solution, I will send the formal fix for v3.20 and
>>>> it has passed tests too.
>>>
>>>> Patch 3 fixes an issue found when reading code. There's no real bug
>>>> reports related to this issue yet.
>>>
>>>> Great thanks to Konrad and Sander for testing fixes for these regressions.
>>>
>>>> Regards,
>>>> Gerry
>>>
>>> Hi Gerry,
>>>
>>> Since these patches now are: tested, reviewed and have landed into mainline,
>>> could you also provide the backports for 3.17 and 3.18 where required ?
>>
>> This would be my attempt of backporting those to 3.18 (I have not tried 
>> whether
>> they would apply to 3.17 as well). Those seem to work when test-booting and
>> Sander said he will give them a try as well. But better check yourself I did 
>> not
>> mess anything up while backporting. One of the three patches seemed not to be
>> required at all.
> I'm backporting them too, but now I will give up my patches:)
> 
> The first one is a must, which should be applied to 3.17 and 3.18.
> The second one doesn't apply to 3.17 or 3.18.
> The third may be applied to v3.17 and v3.18, but it shouldn't be applied
> according to the stable-kernel doc because no user reports such a bug yet.

Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets
around the acpi irq and the usb problems, too. Though interestingly from the
history, the acpi irq I saw in 3.17 while the usb devices started to fail with
3.18. I will let you know in a bit. Just want to finish another bisect first.

-Stefan

> Regards
> Gerry
>>
>> -Stefan
>>
>>>
>>> The number of people running into these (subtle) issues with these stable
>>> kernels seems to be steadily increasing.
>>>
>>> --
>>> Sander
>>>
>>>> Jiang Liu (3):
>>>>   xen/pci: Kill function xen_setup_acpi_sci()
>>>>   xen/irq, ACPI: Fix regression in xen PCI passthrough caused by
>>>> cffe0a2b5a34
>>>>   xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi
>>>
>>>>  arch/x86/include/asm/acpi.h |1 +
>>>>  arch/x86/kernel/acpi/boot.c |   26 +++
>>>>  arch/x86/pci/xen.c  |   49 
>>>> ++-
>>>>  drivers/acpi/pci_irq.c  |1 -
>>>>  4 files changed, 16 insertions(+), 61 deletions(-)
>>>
>>>
>>




signature.asc
Description: OpenPGP digital signature

Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19

2015-02-09 Thread Stefan Bader

On 05.02.2015 21:07, Sander Eikelenboom wrote:
> 
> Tuesday, January 20, 2015, 3:21:04 AM, you wrote:
> 
>> Hi Thomas,
>> This patch set includes three hotfixes related Xen IRQ for v3.19.
>> Sorry for the long delay to get these two regressions fixed, it really
>> cost me some time to read and understand Xen IRQ code.
> 
>> Patch 1 fixes the failure to register ACPI SCI interrupt on Xen
>> domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18
>> too once it reaches the mainstream kernel.
> 
>> Patch 2 fixes the regression in Xen PCI device passthrough(pciback).
>> It's a temporary solution, I will send the formal fix for v3.20 and
>> it has passed tests too.
> 
>> Patch 3 fixes an issue found when reading code. There's no real bug
>> reports related to this issue yet.
> 
>> Great thanks to Konrad and Sander for testing fixes for these regressions.
> 
>> Regards,
>> Gerry
> 
> Hi Gerry,
> 
> Since these patches now are: tested, reviewed and have landed into mainline,
> could you also provide the backports for 3.17 and 3.18 where required ?

This would be my attempt of backporting those to 3.18 (I have not tried whether
they would apply to 3.17 as well). Those seem to work when test-booting and
Sander said he will give them a try as well. But better check yourself I did not
mess anything up while backporting. One of the three patches seemed not to be
required at all.

-Stefan

> 
> The number of people running into these (subtle) issues with these stable
> kernels seems to be steadily increasing.
> 
> --
> Sander
> 
>> Jiang Liu (3):
>>   xen/pci: Kill function xen_setup_acpi_sci()
>>   xen/irq, ACPI: Fix regression in xen PCI passthrough caused by
>> cffe0a2b5a34
>>   xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi
> 
>>  arch/x86/include/asm/acpi.h |1 +
>>  arch/x86/kernel/acpi/boot.c |   26 +++
>>  arch/x86/pci/xen.c  |   49 
>> ++-
>>  drivers/acpi/pci_irq.c  |1 -
>>  4 files changed, 16 insertions(+), 61 deletions(-)
> 
> 

From 8b5b328b62248d95743ca9af7aa71c06dd808dfe Mon Sep 17 00:00:00 2001
From: Jiang Liu 
Date: Tue, 20 Jan 2015 10:21:05 +0800
Subject: [PATCH 1/2] x86/xen: Treat SCI interrupt as normal GSI interrupt

Currently Xen Domain0 has special treatment for ACPI SCI interrupt,
that is initialize irq for ACPI SCI at early stage in a special way as:
xen_init_IRQ()
	->pci_xen_initial_domain()
		->xen_setup_acpi_sci()
			Allocate and initialize irq for ACPI SCI

Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq
number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on
IOAPIC irqdomains through following path
acpi_gsi_to_irq()
	->mp_map_gsi_to_irq()
		->mp_map_pin_to_irq()
			->check IOAPIC irqdomain

For PV domains, it uses Xen event based interrupt manangement and
doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC.
This causes Xen domain0 fail to install interrupt handler for ACPI SCI
and all ACPI events will be lost. Please refer to:
https://lkml.org/lkml/2014/12/19/178

So the fix is to get rid of special treatment for ACPI SCI, just treat
ACPI SCI as normal GSI interrupt as:
acpi_gsi_to_irq()
	->acpi_register_gsi()
		->acpi_register_gsi_xen()
			->xen_register_gsi()

With above change, there's no need for xen_setup_acpi_sci() anymore.
The above change also works with bare metal kernel too.

Signed-off-by: Jiang Liu 
Tested-by: Sander Eikelenboom 
Cc: Tony Luck 
Cc: xen-de...@lists.xenproject.org
Cc: Konrad Rzeszutek Wilk 
Cc: David Vrabel 
Cc: Rafael J. Wysocki 
Cc: Len Brown 
Cc: Pavel Machek 
Cc: Bjorn Helgaas 
Link: http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang@linux.intel.com
Signed-off-by: Thomas Gleixner 
Signed-off-by: Stefan Bader 
---
 arch/x86/kernel/acpi/boot.c | 23 +++---
 arch/x86/pci/xen.c  | 47 -
 2 files changed, 12 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index a142e77..460f498 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -604,18 +604,19 @@ void __init acpi_pic_sci_set_trigger(unsigned int irq, u16 trigger)
 
 int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp)
 {
-	int irq;
-
-	if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) {
-		*irqp = gsi;
-	} else {
-		irq = mp_map_gsi_to_irq(gsi,
-	IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK);
-		if (irq < 0)
-			return -1;
-		*irqp = irq;
+	int rc, irq, trigger, polarity;
+
+	rc = acpi_get_override_irq(gsi, , );
+	if (rc == 0) {
+		trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE;
+		polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACT

Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19

2015-02-09 Thread Stefan Bader

On 05.02.2015 21:07, Sander Eikelenboom wrote:
 
 Tuesday, January 20, 2015, 3:21:04 AM, you wrote:
 
 Hi Thomas,
 This patch set includes three hotfixes related Xen IRQ for v3.19.
 Sorry for the long delay to get these two regressions fixed, it really
 cost me some time to read and understand Xen IRQ code.
 
 Patch 1 fixes the failure to register ACPI SCI interrupt on Xen
 domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18
 too once it reaches the mainstream kernel.
 
 Patch 2 fixes the regression in Xen PCI device passthrough(pciback).
 It's a temporary solution, I will send the formal fix for v3.20 and
 it has passed tests too.
 
 Patch 3 fixes an issue found when reading code. There's no real bug
 reports related to this issue yet.
 
 Great thanks to Konrad and Sander for testing fixes for these regressions.
 
 Regards,
 Gerry
 
 Hi Gerry,
 
 Since these patches now are: tested, reviewed and have landed into mainline,
 could you also provide the backports for 3.17 and 3.18 where required ?

This would be my attempt of backporting those to 3.18 (I have not tried whether
they would apply to 3.17 as well). Those seem to work when test-booting and
Sander said he will give them a try as well. But better check yourself I did not
mess anything up while backporting. One of the three patches seemed not to be
required at all.

-Stefan

 
 The number of people running into these (subtle) issues with these stable
 kernels seems to be steadily increasing.
 
 --
 Sander
 
 Jiang Liu (3):
   xen/pci: Kill function xen_setup_acpi_sci()
   xen/irq, ACPI: Fix regression in xen PCI passthrough caused by
 cffe0a2b5a34
   xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi
 
  arch/x86/include/asm/acpi.h |1 +
  arch/x86/kernel/acpi/boot.c |   26 +++
  arch/x86/pci/xen.c  |   49 
 ++-
  drivers/acpi/pci_irq.c  |1 -
  4 files changed, 16 insertions(+), 61 deletions(-)
 
 

From 8b5b328b62248d95743ca9af7aa71c06dd808dfe Mon Sep 17 00:00:00 2001
From: Jiang Liu jiang@linux.intel.com
Date: Tue, 20 Jan 2015 10:21:05 +0800
Subject: [PATCH 1/2] x86/xen: Treat SCI interrupt as normal GSI interrupt

Currently Xen Domain0 has special treatment for ACPI SCI interrupt,
that is initialize irq for ACPI SCI at early stage in a special way as:
xen_init_IRQ()
	-pci_xen_initial_domain()
		-xen_setup_acpi_sci()
			Allocate and initialize irq for ACPI SCI

Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq
number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on
IOAPIC irqdomains through following path
acpi_gsi_to_irq()
	-mp_map_gsi_to_irq()
		-mp_map_pin_to_irq()
			-check IOAPIC irqdomain

For PV domains, it uses Xen event based interrupt manangement and
doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC.
This causes Xen domain0 fail to install interrupt handler for ACPI SCI
and all ACPI events will be lost. Please refer to:
https://lkml.org/lkml/2014/12/19/178

So the fix is to get rid of special treatment for ACPI SCI, just treat
ACPI SCI as normal GSI interrupt as:
acpi_gsi_to_irq()
	-acpi_register_gsi()
		-acpi_register_gsi_xen()
			-xen_register_gsi()

With above change, there's no need for xen_setup_acpi_sci() anymore.
The above change also works with bare metal kernel too.

Signed-off-by: Jiang Liu jiang@linux.intel.com
Tested-by: Sander Eikelenboom li...@eikelenboom.it
Cc: Tony Luck tony.l...@intel.com
Cc: xen-de...@lists.xenproject.org
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Rafael J. Wysocki r...@rjwysocki.net
Cc: Len Brown len.br...@intel.com
Cc: Pavel Machek pa...@ucw.cz
Cc: Bjorn Helgaas bhelg...@google.com
Link: http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang@linux.intel.com
Signed-off-by: Thomas Gleixner t...@linutronix.de
Signed-off-by: Stefan Bader stefan.ba...@canonical.com
---
 arch/x86/kernel/acpi/boot.c | 23 +++---
 arch/x86/pci/xen.c  | 47 -
 2 files changed, 12 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index a142e77..460f498 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -604,18 +604,19 @@ void __init acpi_pic_sci_set_trigger(unsigned int irq, u16 trigger)
 
 int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp)
 {
-	int irq;
-
-	if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) {
-		*irqp = gsi;
-	} else {
-		irq = mp_map_gsi_to_irq(gsi,
-	IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK);
-		if (irq  0)
-			return -1;
-		*irqp = irq;
+	int rc, irq, trigger, polarity;
+
+	rc = acpi_get_override_irq(gsi, trigger, polarity);
+	if (rc == 0) {
+		trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE;
+		polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH;
+		irq = acpi_register_gsi(NULL, gsi, trigger, polarity

Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19

2015-02-09 Thread Stefan Bader

On 09.02.2015 13:29, Stefan Bader wrote:
 On 09.02.2015 13:12, Jiang Liu wrote:
 On 2015/2/9 17:47, Stefan Bader wrote:
 On 05.02.2015 21:07, Sander Eikelenboom wrote:

 Tuesday, January 20, 2015, 3:21:04 AM, you wrote:

 Hi Thomas,
 This patch set includes three hotfixes related Xen IRQ for v3.19.
 Sorry for the long delay to get these two regressions fixed, it really
 cost me some time to read and understand Xen IRQ code.

 Patch 1 fixes the failure to register ACPI SCI interrupt on Xen
 domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18
 too once it reaches the mainstream kernel.

 Patch 2 fixes the regression in Xen PCI device passthrough(pciback).
 It's a temporary solution, I will send the formal fix for v3.20 and
 it has passed tests too.

 Patch 3 fixes an issue found when reading code. There's no real bug
 reports related to this issue yet.

 Great thanks to Konrad and Sander for testing fixes for these regressions.

 Regards,
 Gerry

 Hi Gerry,

 Since these patches now are: tested, reviewed and have landed into 
 mainline,
 could you also provide the backports for 3.17 and 3.18 where required ?

 This would be my attempt of backporting those to 3.18 (I have not tried 
 whether
 they would apply to 3.17 as well). Those seem to work when test-booting and
 Sander said he will give them a try as well. But better check yourself I 
 did not
 mess anything up while backporting. One of the three patches seemed not to 
 be
 required at all.
 I'm backporting them too, but now I will give up my patches:)

 The first one is a must, which should be applied to 3.17 and 3.18.
 The second one doesn't apply to 3.17 or 3.18.
 The third may be applied to v3.17 and v3.18, but it shouldn't be applied
 according to the stable-kernel doc because no user reports such a bug yet.
 
 Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets
 around the acpi irq and the usb problems, too. Though interestingly from the
 history, the acpi irq I saw in 3.17 while the usb devices started to fail with
 3.18. I will let you know in a bit. Just want to finish another bisect first.

So it looks like patch #1 indeed does fix both the acpi irq and my usb devices.
Sander, do you want to verify that variant, too?

The only odd thing left in a 3.18 with the fix (regardless of using patch #1
only or patches #1 and #3) is that the ahci driver shows up with its pci bus
number in /proc/interrupts. But that does not seem to have any relevant downside
(cannot try 3.19 on that box right now for other reasons).

-Stefan

...
# 57: 12441  000  xen-pirq-msi   :00:1f.2
#(used to be ahci)

 
 -Stefan
 
 Regards
 Gerry

 -Stefan


 The number of people running into these (subtle) issues with these stable
 kernels seems to be steadily increasing.

 --
 Sander

 Jiang Liu (3):
   xen/pci: Kill function xen_setup_acpi_sci()
   xen/irq, ACPI: Fix regression in xen PCI passthrough caused by
 cffe0a2b5a34
   xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi

  arch/x86/include/asm/acpi.h |1 +
  arch/x86/kernel/acpi/boot.c |   26 +++
  arch/x86/pci/xen.c  |   49 
 ++-
  drivers/acpi/pci_irq.c  |1 -
  4 files changed, 16 insertions(+), 61 deletions(-)



 
 




signature.asc
Description: OpenPGP digital signature

Re: 3.19: device name associates with IRQ's for ahci controllers operating with a single IRQ changed from ahci? to BDF

2015-02-09 Thread Stefan Bader

On 09.02.2015 20:54, Sander Eikelenboom wrote:
 Hi.
 
 In 3.19 the device name associates with IRQ's for ahci controllers operating
 with a single IRQ changed from ahci? to BDF, was this intentional ?
 
 It's probably commit 18dcf433f3ded61eb140a55e7048ec2fef79e723 (or another one
 in that series). 

Oh, looking at that commit that might make sense. In
ahci_host_activate_single_irq it now uses dev_driver_name instead of dev_name as
it did before (from ata_host_activate).  The description sounds like before the
driver is registered this will return the bus. And registering an interrupt
would likely be before the driver is fully registered... But you probably saw
that, too. Whether that really was intentional is still the question. :)

-Stefan


 --
 Sander
 
 
 /proc/interrupts of an ahci controller with a single irq:
  52:  13529  0  0  0  xen-pirq-msi   
 :00:1f.2
 
 /proc/interrupts of an ahci controller with multiple irq's:
  114: 412535  0  0  0  0  0  
 xen-pirq-msi   ahci0
  115:  0  0  0  0  0  0  
 xen-pirq-msi   ahci1
  116:  16717  0  0  0  0  0  
 xen-pirq-msi   ahci2
  117:  0  0  0  0  0  0  
 xen-pirq-msi   ahci3
  118:  0  0  0  0  0  0  
 xen-pirq-msi   ahci4
  119:  0  0  0  0  0  0  
 xen-pirq-msi   ahci5
 
 




signature.asc
Description: OpenPGP digital signature

Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19

2015-02-09 Thread Stefan Bader

On 09.02.2015 13:12, Jiang Liu wrote:
 On 2015/2/9 17:47, Stefan Bader wrote:
 On 05.02.2015 21:07, Sander Eikelenboom wrote:

 Tuesday, January 20, 2015, 3:21:04 AM, you wrote:

 Hi Thomas,
 This patch set includes three hotfixes related Xen IRQ for v3.19.
 Sorry for the long delay to get these two regressions fixed, it really
 cost me some time to read and understand Xen IRQ code.

 Patch 1 fixes the failure to register ACPI SCI interrupt on Xen
 domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18
 too once it reaches the mainstream kernel.

 Patch 2 fixes the regression in Xen PCI device passthrough(pciback).
 It's a temporary solution, I will send the formal fix for v3.20 and
 it has passed tests too.

 Patch 3 fixes an issue found when reading code. There's no real bug
 reports related to this issue yet.

 Great thanks to Konrad and Sander for testing fixes for these regressions.

 Regards,
 Gerry

 Hi Gerry,

 Since these patches now are: tested, reviewed and have landed into mainline,
 could you also provide the backports for 3.17 and 3.18 where required ?

 This would be my attempt of backporting those to 3.18 (I have not tried 
 whether
 they would apply to 3.17 as well). Those seem to work when test-booting and
 Sander said he will give them a try as well. But better check yourself I did 
 not
 mess anything up while backporting. One of the three patches seemed not to be
 required at all.
 I'm backporting them too, but now I will give up my patches:)
 
 The first one is a must, which should be applied to 3.17 and 3.18.
 The second one doesn't apply to 3.17 or 3.18.
 The third may be applied to v3.17 and v3.18, but it shouldn't be applied
 according to the stable-kernel doc because no user reports such a bug yet.

Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets
around the acpi irq and the usb problems, too. Though interestingly from the
history, the acpi irq I saw in 3.17 while the usb devices started to fail with
3.18. I will let you know in a bit. Just want to finish another bisect first.

-Stefan

 Regards
 Gerry

 -Stefan


 The number of people running into these (subtle) issues with these stable
 kernels seems to be steadily increasing.

 --
 Sander

 Jiang Liu (3):
   xen/pci: Kill function xen_setup_acpi_sci()
   xen/irq, ACPI: Fix regression in xen PCI passthrough caused by
 cffe0a2b5a34
   xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi

  arch/x86/include/asm/acpi.h |1 +
  arch/x86/kernel/acpi/boot.c |   26 +++
  arch/x86/pci/xen.c  |   49 
 ++-
  drivers/acpi/pci_irq.c  |1 -
  4 files changed, 16 insertions(+), 61 deletions(-)







signature.asc
Description: OpenPGP digital signature

Re: [PATCH] bcache: prevent crash on changing writeback_running

2014-12-02 Thread Stefan Bader

On 19.08.2014 20:03, Slava Pestov wrote:
> Thanks, this is now in our development branch.

It has been a while but I just went back there and checked linux and linux-next
but cannot find this change anywhere. I think it is a rather simple bugfix. So I
start to wonder where it went and when it would get pushed.

-Stefan

> 
> On Tue, Aug 19, 2014 at 6:01 AM, Stefan Bader
>  wrote:
>> commit a664d0f05a2ec02c8f042db536d84d15d6e19e81
>> bcache: fix crash on shutdown in passthrough mode
>>
>> added a safeguard in the shutdown case. At least while not being
>> attached it is also possible to trigger a kernel bug by writing into
>> writeback_running. This change  adds the same check before trying to
>> wake up the thread for that case.
>>
>> Signed-off-by: Stefan Bader 
>> ---
>>  drivers/md/bcache/writeback.h | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
>> index 0a9dab1..073a042 100644
>> --- a/drivers/md/bcache/writeback.h
>> +++ b/drivers/md/bcache/writeback.h
>> @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, 
>> struct bio *bio,
>>
>>  static inline void bch_writeback_queue(struct cached_dev *dc)
>>  {
>> -   wake_up_process(dc->writeback_thread);
>> +   if (!IS_ERR_OR_NULL(dc->writeback_thread))
>> +   wake_up_process(dc->writeback_thread);
>>  }
>>
>>  static inline void bch_writeback_add(struct cached_dev *dc)
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] bcache: prevent crash on changing writeback_running

2014-12-02 Thread Stefan Bader

On 19.08.2014 20:03, Slava Pestov wrote:
 Thanks, this is now in our development branch.

It has been a while but I just went back there and checked linux and linux-next
but cannot find this change anywhere. I think it is a rather simple bugfix. So I
start to wonder where it went and when it would get pushed.

-Stefan

 
 On Tue, Aug 19, 2014 at 6:01 AM, Stefan Bader
 stefan.ba...@canonical.com wrote:
 commit a664d0f05a2ec02c8f042db536d84d15d6e19e81
 bcache: fix crash on shutdown in passthrough mode

 added a safeguard in the shutdown case. At least while not being
 attached it is also possible to trigger a kernel bug by writing into
 writeback_running. This change  adds the same check before trying to
 wake up the thread for that case.

 Signed-off-by: Stefan Bader stefan.ba...@canonical.com
 ---
  drivers/md/bcache/writeback.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
 index 0a9dab1..073a042 100644
 --- a/drivers/md/bcache/writeback.h
 +++ b/drivers/md/bcache/writeback.h
 @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, 
 struct bio *bio,

  static inline void bch_writeback_queue(struct cached_dev *dc)
  {
 -   wake_up_process(dc-writeback_thread);
 +   if (!IS_ERR_OR_NULL(dc-writeback_thread))
 +   wake_up_process(dc-writeback_thread);
  }

  static inline void bch_writeback_add(struct cached_dev *dc)
 --
 1.9.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-bcache in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize

2014-12-01 Thread Stefan Bader

On 01.12.2014 14:59, Zoltan Kiss wrote:
> 
> 
> On 01/12/14 13:36, David Vrabel wrote:
>> On 01/12/14 08:55, Stefan Bader wrote:
>>> On 11.08.2014 19:32, Zoltan Kiss wrote:
>>>> There is a long known problem with the netfront/netback interface: if the 
>>>> guest
>>>> tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring 
>>>> slots,
>>>> it gets dropped. The reason is that netback maps these slots to a frag in 
>>>> the
>>>> frags array, which is limited by size. Having so many slots can occur since
>>>> compound pages were introduced, as the ring protocol slice them up into
>>>> individual (non-compound) page aligned slots. The theoretical worst case
>>>> scenario looks like this (note, skbs are limited to 64 Kb here):
>>>> linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary,
>>>> using 2 slots
>>>> first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at 
>>>> the
>>>> end and the beginning of a page, therefore they use 3 * 15 = 45 slots
>>>> last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots
>>>> Although I don't think this 51 slots skb can really happen, we need a 
>>>> solution
>>>> which can deal with every scenario. In real life there is only a few slots
>>>> overdue, but usually it causes the TCP stream to be blocked, as the retry 
>>>> will
>>>> most likely have the same buffer layout.
>>>> This patch solves this problem by linearizing the packet. This is not the
>>>> fastest way, and it can fail much easier as it tries to allocate a big 
>>>> linear
>>>> area for the whole packet, but probably easier by an order of magnitude 
>>>> than
>>>> anything else. Probably this code path is not touched very frequently 
>>>> anyway.
>>>>
>>>> Signed-off-by: Zoltan Kiss 
>>>> Cc: Wei Liu 
>>>> Cc: Ian Campbell 
>>>> Cc: Paul Durrant 
>>>> Cc: net...@vger.kernel.org
>>>> Cc: linux-kernel@vger.kernel.org
>>>> Cc: xen-de...@lists.xenproject.org
>>>
>>> This does not seem to be marked explicitly as stable. Has someone already 
>>> asked
>>> David Miller to put it on his stable queue? IMO it qualifies quite well and 
>>> the
>>> actual change should be simple to pick/backport.
>>
>> I think it's a candidate, yes.
>>
>> Can you expand on the user visible impact of the bug this patch fixes?
>> I think it results in certain types of traffic not working (because the
>> domU always generates skb's with the problematic frag layout), but I
>> can't remember the details.
> 
> Yes, this line in the comment talks about it: "In real life there is only a 
> few
> slots overdue, but usually it causes the TCP stream to be blocked, as the 
> retry
> will most likely have the same buffer layout."
> Maybe we can add what kind of traffic triggered this so far, AFAIK NFS was one
> of them, and Stefan had an another use case. But my memories are blur about 
> this.

We had some report about some web-app hitting packet losses. I suspect that also
was streaming something. For a easy trigger we found redis-benchmark (part of
the redis keyserver) with a larger (iirc 1kB) payload would trigger the
fragmentation/exceeding pages to happen. Though I think it did not fail but
showed a performance drop instead (from memory which also suffers from loosing
detail).

-Stefan
> 
> Zoli




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize

2014-12-01 Thread Stefan Bader

On 11.08.2014 19:32, Zoltan Kiss wrote:
> There is a long known problem with the netfront/netback interface: if the 
> guest
> tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring 
> slots,
> it gets dropped. The reason is that netback maps these slots to a frag in the
> frags array, which is limited by size. Having so many slots can occur since
> compound pages were introduced, as the ring protocol slice them up into
> individual (non-compound) page aligned slots. The theoretical worst case
> scenario looks like this (note, skbs are limited to 64 Kb here):
> linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary,
> using 2 slots
> first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the
> end and the beginning of a page, therefore they use 3 * 15 = 45 slots
> last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots
> Although I don't think this 51 slots skb can really happen, we need a solution
> which can deal with every scenario. In real life there is only a few slots
> overdue, but usually it causes the TCP stream to be blocked, as the retry will
> most likely have the same buffer layout.
> This patch solves this problem by linearizing the packet. This is not the
> fastest way, and it can fail much easier as it tries to allocate a big linear
> area for the whole packet, but probably easier by an order of magnitude than
> anything else. Probably this code path is not touched very frequently anyway.
> 
> Signed-off-by: Zoltan Kiss 
> Cc: Wei Liu 
> Cc: Ian Campbell 
> Cc: Paul Durrant 
> Cc: net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: xen-de...@lists.xenproject.org

This does not seem to be marked explicitly as stable. Has someone already asked
David Miller to put it on his stable queue? IMO it qualifies quite well and the
actual change should be simple to pick/backport.

-Stefan

> 
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 055222b..23359ae 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -628,9 +628,10 @@ static int xennet_start_xmit(struct sk_buff *skb, struct 
> net_device *dev)
>   slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) +
>   xennet_count_skb_frag_slots(skb);
>   if (unlikely(slots > MAX_SKB_FRAGS + 1)) {
> - net_alert_ratelimited(
> - "xennet: skb rides the rocket: %d slots\n", slots);
> - goto drop;
> + net_dbg_ratelimited("xennet: skb rides the rocket: %d slots, %d 
> bytes\n",
> + slots, skb->len);
> + if (skb_linearize(skb))
> + goto drop;
>   }
>  
>   spin_lock_irqsave(>tx_lock, flags);
> 
> ___
> Xen-devel mailing list
> xen-de...@lists.xen.org
> http://lists.xen.org/xen-devel
> 




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize

2014-12-01 Thread Stefan Bader

On 11.08.2014 19:32, Zoltan Kiss wrote:
 There is a long known problem with the netfront/netback interface: if the 
 guest
 tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring 
 slots,
 it gets dropped. The reason is that netback maps these slots to a frag in the
 frags array, which is limited by size. Having so many slots can occur since
 compound pages were introduced, as the ring protocol slice them up into
 individual (non-compound) page aligned slots. The theoretical worst case
 scenario looks like this (note, skbs are limited to 64 Kb here):
 linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary,
 using 2 slots
 first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the
 end and the beginning of a page, therefore they use 3 * 15 = 45 slots
 last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots
 Although I don't think this 51 slots skb can really happen, we need a solution
 which can deal with every scenario. In real life there is only a few slots
 overdue, but usually it causes the TCP stream to be blocked, as the retry will
 most likely have the same buffer layout.
 This patch solves this problem by linearizing the packet. This is not the
 fastest way, and it can fail much easier as it tries to allocate a big linear
 area for the whole packet, but probably easier by an order of magnitude than
 anything else. Probably this code path is not touched very frequently anyway.
 
 Signed-off-by: Zoltan Kiss zoltan.k...@citrix.com
 Cc: Wei Liu wei.l...@citrix.com
 Cc: Ian Campbell ian.campb...@citrix.com
 Cc: Paul Durrant paul.durr...@citrix.com
 Cc: net...@vger.kernel.org
 Cc: linux-kernel@vger.kernel.org
 Cc: xen-de...@lists.xenproject.org

This does not seem to be marked explicitly as stable. Has someone already asked
David Miller to put it on his stable queue? IMO it qualifies quite well and the
actual change should be simple to pick/backport.

-Stefan

 
 diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
 index 055222b..23359ae 100644
 --- a/drivers/net/xen-netfront.c
 +++ b/drivers/net/xen-netfront.c
 @@ -628,9 +628,10 @@ static int xennet_start_xmit(struct sk_buff *skb, struct 
 net_device *dev)
   slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) +
   xennet_count_skb_frag_slots(skb);
   if (unlikely(slots  MAX_SKB_FRAGS + 1)) {
 - net_alert_ratelimited(
 - xennet: skb rides the rocket: %d slots\n, slots);
 - goto drop;
 + net_dbg_ratelimited(xennet: skb rides the rocket: %d slots, %d 
 bytes\n,
 + slots, skb-len);
 + if (skb_linearize(skb))
 + goto drop;
   }
  
   spin_lock_irqsave(queue-tx_lock, flags);
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org
 http://lists.xen.org/xen-devel
 




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize

2014-12-01 Thread Stefan Bader

On 01.12.2014 14:59, Zoltan Kiss wrote:
 
 
 On 01/12/14 13:36, David Vrabel wrote:
 On 01/12/14 08:55, Stefan Bader wrote:
 On 11.08.2014 19:32, Zoltan Kiss wrote:
 There is a long known problem with the netfront/netback interface: if the 
 guest
 tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring 
 slots,
 it gets dropped. The reason is that netback maps these slots to a frag in 
 the
 frags array, which is limited by size. Having so many slots can occur since
 compound pages were introduced, as the ring protocol slice them up into
 individual (non-compound) page aligned slots. The theoretical worst case
 scenario looks like this (note, skbs are limited to 64 Kb here):
 linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary,
 using 2 slots
 first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at 
 the
 end and the beginning of a page, therefore they use 3 * 15 = 45 slots
 last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots
 Although I don't think this 51 slots skb can really happen, we need a 
 solution
 which can deal with every scenario. In real life there is only a few slots
 overdue, but usually it causes the TCP stream to be blocked, as the retry 
 will
 most likely have the same buffer layout.
 This patch solves this problem by linearizing the packet. This is not the
 fastest way, and it can fail much easier as it tries to allocate a big 
 linear
 area for the whole packet, but probably easier by an order of magnitude 
 than
 anything else. Probably this code path is not touched very frequently 
 anyway.

 Signed-off-by: Zoltan Kiss zoltan.k...@citrix.com
 Cc: Wei Liu wei.l...@citrix.com
 Cc: Ian Campbell ian.campb...@citrix.com
 Cc: Paul Durrant paul.durr...@citrix.com
 Cc: net...@vger.kernel.org
 Cc: linux-kernel@vger.kernel.org
 Cc: xen-de...@lists.xenproject.org

 This does not seem to be marked explicitly as stable. Has someone already 
 asked
 David Miller to put it on his stable queue? IMO it qualifies quite well and 
 the
 actual change should be simple to pick/backport.

 I think it's a candidate, yes.

 Can you expand on the user visible impact of the bug this patch fixes?
 I think it results in certain types of traffic not working (because the
 domU always generates skb's with the problematic frag layout), but I
 can't remember the details.
 
 Yes, this line in the comment talks about it: In real life there is only a 
 few
 slots overdue, but usually it causes the TCP stream to be blocked, as the 
 retry
 will most likely have the same buffer layout.
 Maybe we can add what kind of traffic triggered this so far, AFAIK NFS was one
 of them, and Stefan had an another use case. But my memories are blur about 
 this.

We had some report about some web-app hitting packet losses. I suspect that also
was streaming something. For a easy trigger we found redis-benchmark (part of
the redis keyserver) with a larger (iirc 1kB) payload would trigger the
fragmentation/exceeding pages to happen. Though I think it did not fail but
showed a performance drop instead (from memory which also suffers from loosing
detail).

-Stefan
 
 Zoli




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] BUG in xennet_make_frags with paged skb data

2014-11-07 Thread Stefan Bader

On 07.11.2014 13:21, Zoltan Kiss wrote:
> 
> 
> On 07/11/14 12:15, Stefan Bader wrote:
>> On 07.11.2014 12:22, Eric Dumazet wrote:
>>> On Fri, 2014-11-07 at 09:25 +, Zoltan Kiss wrote:
>>>
>>> Please do not top post.
>>>
>>>> Hi,
>>>>
>>>> AFAIK in this scenario your skb frag is wrong. The page pointer should
>>>> point to the original compound page (not a member of it), and offset
>>>> should be set accordingly.
>>>> For example, if your compound page is 16K (4 page), then the page
>>>> pointer should point to the first page, and if the data starts at the
>>>> 3rd page, then offset should be >8K
>>>
>>> This is not accurate.
>>>
>>> This BUG_ON() is wrong.
>>>
>>> It should instead be :
>>>
>>> BUG_ON(len + offset > PAGE_SIZE<>
>> would that not have to be
>>
>> BUG_ON((page-compound_head(page)*PAGE_SIZE)+offset+len >
>> PAGE_SIZE< 
> There should be a parentheses around "page-compound_head(page)".

*sigh* before submitting anything I'll have to make sure I run it past the
compiler at least. I never manager to type the beast without some error.

But you got the drift...

>>
>> since offset is adjusted to start from the tail page in that case.
>>>
>>> splice() code can generate such cases.
>>>
>>>
>>
>>




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] BUG in xennet_make_frags with paged skb data

2014-11-07 Thread Stefan Bader

On 07.11.2014 12:22, Eric Dumazet wrote:
> On Fri, 2014-11-07 at 09:25 +, Zoltan Kiss wrote:
> 
> Please do not top post.
> 
>> Hi,
>>
>> AFAIK in this scenario your skb frag is wrong. The page pointer should 
>> point to the original compound page (not a member of it), and offset 
>> should be set accordingly.
>> For example, if your compound page is 16K (4 page), then the page 
>> pointer should point to the first page, and if the data starts at the 
>> 3rd page, then offset should be >8K
> 
> This is not accurate.
> 
> This BUG_ON() is wrong.
> 
> It should instead be :
> 
> BUG_ON(len + offset > PAGE_SIZE<
PAGE_SIZE< 
> splice() code can generate such cases.
> 
> 




signature.asc
Description: OpenPGP digital signature

Re: BUG in xennet_make_frags with paged skb data

2014-11-07 Thread Stefan Bader

On 07.11.2014 11:44, David Vrabel wrote:
> On 06/11/14 21:49, Seth Forshee wrote:
>> We've had several reports of hitting the following BUG_ON in
>> xennet_make_frags with 3.2 and 3.13 kernels (I'm currently awaiting
>> results of testing with 3.17):
>>
>> /* Grant backend access to each skb fragment page. */
>> for (i = 0; i < frags; i++) {
>> skb_frag_t *frag = skb_shinfo(skb)->frags + i;
>> struct page *page = skb_frag_page(frag);
>>
>> len = skb_frag_size(frag);
>> offset = frag->page_offset;
>>
>> /* Data must not cross a page boundary. */
>> BUG_ON(len + offset > PAGE_SIZE<>
>> When this happens the page in question is a "middle" page in a compound
>> page (i.e. it's a tail page but not the last tail page), and the data is
>> fully contained within the compound page. The data does however cross
>> the hardware page boundary, and since compound_order evaluates to 0 for
>> tail pages the check fails.
>>
>> In going over this I've been unable to determine whether the BUG_ON in
>> xennet_make_frags is incorrect or the paged skb data is wrong. I can't
>> find that it's documented anywhere, and the networking code itself is a
>> bit ambiguous when it comes to compound pages. On the one hand
>> __skb_fill_page_desc specifically handles adding tail pages as paged
>> data, but on the other hand skb_copy_bits kmaps frag->page.p which could
>> fail with data that extends into another page.
> 
> netfront will safely handle this case so you can remove this BUG_ON()
> (and the one later on).  But it would be better to find out were these
> funny-looking skbs are coming from and (if necessary) fixing the bug there.

Right, in fact it does ignore pages from the start of a compound page in case
the offset is big enough. So there is no difference between that case and the
one where the page pointer from the frag is adjusted the way it was observed.

It really boils down to the question what the real rules for the frags are. If
it is "only" that data may not cross the boundary of a hw or compound page this
leaves room for interpretation. The odd skb does not violate that but a check
for conformance would become a lot more complex. And netfront is not the only
place that expects the frag page to be pointing to the compound head (there is
igb for example, though it does only a warn_on upon failing).

On the other hand the __skb_fill_page_desc is written in a way that seems to
assume the frag page pointer may be pointing to a tail page. So before "blaming"
one piece of code or the other we need an authoritative answer to what we are
supposed to expect.

-Stefan
> 
> David
> 

signature.asc
Description: OpenPGP digital signature

Re: BUG in xennet_make_frags with paged skb data

2014-11-07 Thread Stefan Bader

On 07.11.2014 11:44, David Vrabel wrote:
 On 06/11/14 21:49, Seth Forshee wrote:
 We've had several reports of hitting the following BUG_ON in
 xennet_make_frags with 3.2 and 3.13 kernels (I'm currently awaiting
 results of testing with 3.17):

 /* Grant backend access to each skb fragment page. */
 for (i = 0; i  frags; i++) {
 skb_frag_t *frag = skb_shinfo(skb)-frags + i;
 struct page *page = skb_frag_page(frag);

 len = skb_frag_size(frag);
 offset = frag-page_offset;

 /* Data must not cross a page boundary. */
 BUG_ON(len + offset  PAGE_SIZEcompound_order(page));

 When this happens the page in question is a middle page in a compound
 page (i.e. it's a tail page but not the last tail page), and the data is
 fully contained within the compound page. The data does however cross
 the hardware page boundary, and since compound_order evaluates to 0 for
 tail pages the check fails.

 In going over this I've been unable to determine whether the BUG_ON in
 xennet_make_frags is incorrect or the paged skb data is wrong. I can't
 find that it's documented anywhere, and the networking code itself is a
 bit ambiguous when it comes to compound pages. On the one hand
 __skb_fill_page_desc specifically handles adding tail pages as paged
 data, but on the other hand skb_copy_bits kmaps frag-page.p which could
 fail with data that extends into another page.
 
 netfront will safely handle this case so you can remove this BUG_ON()
 (and the one later on).  But it would be better to find out were these
 funny-looking skbs are coming from and (if necessary) fixing the bug there.

Right, in fact it does ignore pages from the start of a compound page in case
the offset is big enough. So there is no difference between that case and the
one where the page pointer from the frag is adjusted the way it was observed.

It really boils down to the question what the real rules for the frags are. If
it is only that data may not cross the boundary of a hw or compound page this
leaves room for interpretation. The odd skb does not violate that but a check
for conformance would become a lot more complex. And netfront is not the only
place that expects the frag page to be pointing to the compound head (there is
igb for example, though it does only a warn_on upon failing).

On the other hand the __skb_fill_page_desc is written in a way that seems to
assume the frag page pointer may be pointing to a tail page. So before blaming
one piece of code or the other we need an authoritative answer to what we are
supposed to expect.

-Stefan
 
 David
 




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] BUG in xennet_make_frags with paged skb data

2014-11-07 Thread Stefan Bader

On 07.11.2014 12:22, Eric Dumazet wrote:
 On Fri, 2014-11-07 at 09:25 +, Zoltan Kiss wrote:
 
 Please do not top post.
 
 Hi,

 AFAIK in this scenario your skb frag is wrong. The page pointer should 
 point to the original compound page (not a member of it), and offset 
 should be set accordingly.
 For example, if your compound page is 16K (4 page), then the page 
 pointer should point to the first page, and if the data starts at the 
 3rd page, then offset should be 8K
 
 This is not accurate.
 
 This BUG_ON() is wrong.
 
 It should instead be :
 
 BUG_ON(len + offset  PAGE_SIZEcompound_order(compound_head(page)));

would that not have to be

BUG_ON((page-compound_head(page)*PAGE_SIZE)+offset+len 
PAGE_SIZEcompound_order(compound_head(page)));

since offset is adjusted to start from the tail page in that case.
 
 splice() code can generate such cases.
 
 




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] BUG in xennet_make_frags with paged skb data

2014-11-07 Thread Stefan Bader

On 07.11.2014 13:21, Zoltan Kiss wrote:
 
 
 On 07/11/14 12:15, Stefan Bader wrote:
 On 07.11.2014 12:22, Eric Dumazet wrote:
 On Fri, 2014-11-07 at 09:25 +, Zoltan Kiss wrote:

 Please do not top post.

 Hi,

 AFAIK in this scenario your skb frag is wrong. The page pointer should
 point to the original compound page (not a member of it), and offset
 should be set accordingly.
 For example, if your compound page is 16K (4 page), then the page
 pointer should point to the first page, and if the data starts at the
 3rd page, then offset should be 8K

 This is not accurate.

 This BUG_ON() is wrong.

 It should instead be :

 BUG_ON(len + offset  PAGE_SIZEcompound_order(compound_head(page)));

 would that not have to be

 BUG_ON((page-compound_head(page)*PAGE_SIZE)+offset+len 
 PAGE_SIZEcompound_order(compound_head(page)));
 
 There should be a parentheses around page-compound_head(page).

*sigh* before submitting anything I'll have to make sure I run it past the
compiler at least. I never manager to type the beast without some error.

But you got the drift...


 since offset is adjusted to start from the tail page in that case.

 splice() code can generate such cases.








signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] [PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests

2014-09-06 Thread Stefan Bader

On 02.09.2014 13:01, David Vrabel wrote:
> On 01/09/14 18:34, David Vrabel wrote:
>> On 29/08/14 16:17, Stefan Bader wrote:
>>>
>>> This change might not be the fully correct approach as it basically
>>> removes the pre-set page table entry for the fixmap that is compile
>>> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
>>> level1 page table is not yet declared in C headers (that might be
>>> fixed). But also with the current bug, it was removed, too. Since
>>> the Xen mappings for level2_kernel_pgt only covered kernel + initrd
>>> and some Xen data this did never reach that far. And still, something
>>> does create entries at level2_fixmap_pgt[506..507]. So it should be
>>> ok. At least I was able to successfully boot a kernel with 1G kernel
>>> image size without any vmalloc whinings.
>> [...]
>>> --- a/arch/x86/xen/mmu.c
>>> +++ b/arch/x86/xen/mmu.c
>>> @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
>>> unsigned long max_pfn)
>>> /* L3_i[0] -> level2_ident_pgt */
>>> convert_pfn_mfn(level3_ident_pgt);
>>> /* L3_k[510] -> level2_kernel_pgt
>>> -* L3_i[511] -> level2_fixmap_pgt */
>>> +* L3_k[511] -> level2_fixmap_pgt */
>>> convert_pfn_mfn(level3_kernel_pgt);
>>> +
>>> +   /* level2_fixmap_pgt contains a single entry for the
>>> +* fixmap area at offset 506. The correct way would
>>> +* be to convert level2_fixmap_pgt to mfn and set the
>>> +* level1_fixmap_pgt (which is completely empty) to RO,
>>> +* too. But currently this page table is not declared,
>>> +* so it would be a bit of voodoo to get its address.
>>> +* And also the fixmap entry was never set due to using
>>> +* the wrong l2 when getting Xen's tables. So let's just
>>> +* just nuke it.
>>> +* This orphans level1_fixmap_pgt, but that was basically
>>> +* done before the change as well.
>>> +*/
>>> +   memset(level2_fixmap_pgt, 0, 512*sizeof(long));
>>
>> level2_fixmap_pgt etc. are defined for the benefit of Xen only so I
>> think you should add an extern for level1_fixmap_pgt and fix this up
>> properly.
>>
>> It might not matter now, but it might in the future...
> 
> I found some time to look into this and test it.  Including without
> enabling KSLAR, but reproing the vmalloc failure with a large (800 MiB
> module).
> 
> I've clarified the change description, can you review my edits?
> 
> Thanks for investigating and fixing this!

Sorry for not having replied earlier (I am on vacation). Without testing it
looks about what I had after a few iterations (minus the export of l1). So if
you had a kernel booting with that I am happy, too. :)

-Stefan
> 
> David
> 
> 8<--
> x86/xen: don't copy junk into kernel page tables
> 
> When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
> modules exceeds 512 MiB, then loading modules fails with a warning
> (and hence a vmalloc allocation failure) because the PTEs for the
> newly-allocated vmalloc address space are not zero.
> 
>   WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
>vmap_page_range_noflush+0x2a1/0x360()
> 
> This is caused by xen_setup_kernel_pagetables() copying junk into the
> page tables (to level2_fixmap_pgt), overwriting many non-present
> entries.
> 
> Without KASLR, the normal kernel image size only covers the first half
> of level2_kernel_pgt and module space starts after that.
> 
> L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
>   [256..511]->module
>   [511]->level2_fixmap_pgt[  0..505]->module
> 
> This allows 512 MiB of of module vmalloc space to be used before
> having to use the corrupted level2_fixmap_pgt entries.
> 
> With KASLR enabled, the kernel image uses the full PUD range of 1G and
> module space starts in the level2_fixmap_pgt. So basically:
> 
> L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
>   [511]->level2_fixmap_pgt[0..505]->module
> 
> And now no module vmalloc space can be used without using the corrupt
> level2_fixmap_pgt entries.
> 
> Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
> and setting level1_fixmap_pgt as rea

Re: [Xen-devel] [PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests

2014-09-06 Thread Stefan Bader

On 02.09.2014 13:01, David Vrabel wrote:
 On 01/09/14 18:34, David Vrabel wrote:
 On 29/08/14 16:17, Stefan Bader wrote:

 This change might not be the fully correct approach as it basically
 removes the pre-set page table entry for the fixmap that is compile
 time set (level2_fixmap_pgt[506]-level1_fixmap_pgt). For one the
 level1 page table is not yet declared in C headers (that might be
 fixed). But also with the current bug, it was removed, too. Since
 the Xen mappings for level2_kernel_pgt only covered kernel + initrd
 and some Xen data this did never reach that far. And still, something
 does create entries at level2_fixmap_pgt[506..507]. So it should be
 ok. At least I was able to successfully boot a kernel with 1G kernel
 image size without any vmalloc whinings.
 [...]
 --- a/arch/x86/xen/mmu.c
 +++ b/arch/x86/xen/mmu.c
 @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
 unsigned long max_pfn)
 /* L3_i[0] - level2_ident_pgt */
 convert_pfn_mfn(level3_ident_pgt);
 /* L3_k[510] - level2_kernel_pgt
 -* L3_i[511] - level2_fixmap_pgt */
 +* L3_k[511] - level2_fixmap_pgt */
 convert_pfn_mfn(level3_kernel_pgt);
 +
 +   /* level2_fixmap_pgt contains a single entry for the
 +* fixmap area at offset 506. The correct way would
 +* be to convert level2_fixmap_pgt to mfn and set the
 +* level1_fixmap_pgt (which is completely empty) to RO,
 +* too. But currently this page table is not declared,
 +* so it would be a bit of voodoo to get its address.
 +* And also the fixmap entry was never set due to using
 +* the wrong l2 when getting Xen's tables. So let's just
 +* just nuke it.
 +* This orphans level1_fixmap_pgt, but that was basically
 +* done before the change as well.
 +*/
 +   memset(level2_fixmap_pgt, 0, 512*sizeof(long));

 level2_fixmap_pgt etc. are defined for the benefit of Xen only so I
 think you should add an extern for level1_fixmap_pgt and fix this up
 properly.

 It might not matter now, but it might in the future...
 
 I found some time to look into this and test it.  Including without
 enabling KSLAR, but reproing the vmalloc failure with a large (800 MiB
 module).
 
 I've clarified the change description, can you review my edits?
 
 Thanks for investigating and fixing this!

Sorry for not having replied earlier (I am on vacation). Without testing it
looks about what I had after a few iterations (minus the export of l1). So if
you had a kernel booting with that I am happy, too. :)

-Stefan
 
 David
 
 8--
 x86/xen: don't copy junk into kernel page tables
 
 When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
 modules exceeds 512 MiB, then loading modules fails with a warning
 (and hence a vmalloc allocation failure) because the PTEs for the
 newly-allocated vmalloc address space are not zero.
 
   WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
vmap_page_range_noflush+0x2a1/0x360()
 
 This is caused by xen_setup_kernel_pagetables() copying junk into the
 page tables (to level2_fixmap_pgt), overwriting many non-present
 entries.
 
 Without KASLR, the normal kernel image size only covers the first half
 of level2_kernel_pgt and module space starts after that.
 
 L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt[  0..255]-kernel
   [256..511]-module
   [511]-level2_fixmap_pgt[  0..505]-module
 
 This allows 512 MiB of of module vmalloc space to be used before
 having to use the corrupted level2_fixmap_pgt entries.
 
 With KASLR enabled, the kernel image uses the full PUD range of 1G and
 module space starts in the level2_fixmap_pgt. So basically:
 
 L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt[0..511]-kernel
   [511]-level2_fixmap_pgt[0..505]-module
 
 And now no module vmalloc space can be used without using the corrupt
 level2_fixmap_pgt entries.
 
 Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
 and setting level1_fixmap_pgt as read-only.
 
 Signed-off-by: Stefan Bader stefan.ba...@canonical.com
 Cc: sta...@vger.kernel.org
 Signed-off-by: David Vrabel david.vra...@citrix.com
 ---
  arch/x86/include/asm/pgtable_64.h |1 +
  arch/x86/xen/mmu.c|   27 ---
  2 files changed, 13 insertions(+), 15 deletions(-)
 
 diff --git a/arch/x86/include/asm/pgtable_64.h 
 b/arch/x86/include/asm/pgtable_64.h
 index 5be9063..3874693 100644
 --- a/arch/x86/include/asm/pgtable_64.h
 +++ b/arch/x86/include/asm/pgtable_64.h
 @@ -19,6 +19,7 @@ extern pud_t level3_ident_pgt[512];
  extern pmd_t level2_kernel_pgt[512];
  extern pmd_t level2_fixmap_pgt[512];
  extern pmd_t level2_ident_pgt[512];
 +extern pte_t level1_fixmap_pgt[512];
  extern pgd_t

[PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests

2014-08-29 Thread Stefan Bader

This seemed to be one of those what-the-heck moments. When trying to
figure out why changing the kernel/module split (which enabling KASLR
does) causes vmalloc to run wild on boot of 64bit PV guests, after
much scratching my head, found that xen_setup_kernel_pagetable copies
the same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
but also (due to miscalculating the offset) to level2_fixmap_pgt.

This only worked because the normal kernel image size only covers the
first half of level2_kernel_pgt and module space starts after that.

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
  [256..511]->module
  [511]->level2_fixmap_pgt[  0..505]->module

With the split changing, the kernel image uses the full PUD range of
1G and module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
  [511]->level2_fixmap_pgt[0..505]->module

And now the incorrect copy of the kernel mapping in that range bites
(hard). Causing errors in vmalloc that start with the following:

WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
 vmap_page_range_noflush+0x2a1/0x360()

Which is caused by a freshly allocated PTE for an address in the module
vspace area that is not uninitialized (pte_none()).
The same would happen with the old layout when something causes the
initial mappings to cross the 512M boundary. I was told that someone
saw the same vmalloc warning with the old layout but 500M initrd.

This change might not be the fully correct approach as it basically
removes the pre-set page table entry for the fixmap that is compile
time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
level1 page table is not yet declared in C headers (that might be
fixed). But also with the current bug, it was removed, too. Since
the Xen mappings for level2_kernel_pgt only covered kernel + initrd
and some Xen data this did never reach that far. And still, something
does create entries at level2_fixmap_pgt[506..507]. So it should be
ok. At least I was able to successfully boot a kernel with 1G kernel
image size without any vmalloc whinings.

Signed-off-by: Stefan Bader 
Cc: sta...@vger.kernel.org
---
 arch/x86/xen/mmu.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e8a1201..145e50f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
/* L3_i[0] -> level2_ident_pgt */
convert_pfn_mfn(level3_ident_pgt);
/* L3_k[510] -> level2_kernel_pgt
-* L3_i[511] -> level2_fixmap_pgt */
+* L3_k[511] -> level2_fixmap_pgt */
convert_pfn_mfn(level3_kernel_pgt);
+
+   /* level2_fixmap_pgt contains a single entry for the
+* fixmap area at offset 506. The correct way would
+* be to convert level2_fixmap_pgt to mfn and set the
+* level1_fixmap_pgt (which is completely empty) to RO,
+* too. But currently this page table is not declared,
+* so it would be a bit of voodoo to get its address.
+* And also the fixmap entry was never set due to using
+* the wrong l2 when getting Xen's tables. So let's just
+* just nuke it.
+* This orphans level1_fixmap_pgt, but that was basically
+* done before the change as well.
+*/
+   memset(level2_fixmap_pgt, 0, 512*sizeof(long));
}
/* We get [511][511] and have Xen's version of level2_kernel_pgt */
l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
@@ -1913,21 +1927,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
addr[1] = (unsigned long)l3;
addr[2] = (unsigned long)l2;
/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
-* Both L4[272][0] and L4[511][511] have entries that point to the same
+* Both L4[272][0] and L4[511][510] have entries that point to the same
 * L2 (PMD) tables. Meaning that if you modify it in __va space
 * it will be also modified in the __ka space! (But if you just
 * modify the PMD table to point to other PTE's or none, then you
 * are OK - which is what cleanup_highmap does) */
copy_page(level2_ident_pgt, l2);
-   /* Graft it onto L4[511][511] */
+   /* Graft it onto L4[511][510] */
copy_page(level2_kernel_pgt, l2);
 
-   /* Get [511][510] and graft that in level2_fixmap_pgt */
-   l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
-   l2 = m2v(l3[pud_index(__START

Re: [PATCH] Solved the Xen PV/KASLR riddle

2014-08-29 Thread Stefan Bader

On 29.08.2014 16:31, David Vrabel wrote:
> On 29/08/14 15:27, Stefan Bader wrote:
>>
>> Ok, I rework the patch and re-send it (freshly, iow not part of this thread).
>> And while I am at it, I would add the stable tag.
> 
> Can you use a different title? Perhaps:
> 
> x86/xen: fix 64-bit PV guest kernel page tables for KASLR
> 
> David
> 
I can change the title but would not want to include KASLR. Because that is just
indirectly responsible. This is a fix for a problem that always was there but
just escaped notice until KASLR caused the layout to change. This could have
done independently before.

Or, like I mentioned in the last email, be caused by a large initrd.

-Stefan



signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle

2014-08-29 Thread Stefan Bader

On 29.08.2014 16:19, Andrew Cooper wrote:
> On 29/08/14 09:37, Stefan Bader wrote:
>> On 29.08.2014 00:42, Andrew Cooper wrote:
>>> On 28/08/2014 19:01, Stefan Bader wrote:
>>>>>> So not much further... but then I think I know what I do next. Probably 
>>>>>> should
>>>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a 
>>>>>> panic
>>>>>> and at least get a crash dump of that situation when it occurs. Then I 
>>>>>> can dig
>>>>>> in there with crash (really should have thought of that before)...
>>>>>  I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing 
>>>>> there
>>>>> that screams at me, so I fear I will have to wait until you get the crash
>>>>> and get some clues from that.
>>>> Ok, what a journey. So after long hours of painful staring at the code...
>>>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
>>>> in crash, I really would appreaciate :-P)
>>> The M2P map lives in the Xen reserved virtual address space in each PV
>>> guest, and forms part of the PV ABI.  It is mapped read-only, in the
>>> native width of the guest.
>>>
>>> 32bit PV (PAE) at 0xF580
>>> 64bit PV at 0x8000
>>>
>>> This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
>>> public header files.  You should be able to blindly construct a pointer
>>> to it (if you have nothing better to hand), as it will be hooked into
>>> the guests pagetables before execution starts.  Therefore,
>>> "MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch.
>> machine_to_phys_mapping is set to that address but its not mapped inside the
>> crash dump. Somehow vtop in crash handles translations. I need to have a 
>> look at
>> their code, I guess.
>>
>> Thanks,
>> Stefan
> 
> What context is the crash dump?  If it is a Xen+dom0 kexec()d to native
> linux, then the m2p should still be accessible given dom0's cr3.  If it
> is some state copied off-host then you will need to adjust the copy to
> include that virtual range.

No its a domU dump of a PV guest taken with "xl dump-core" (or actually the
result of on-crash trigger).

> 
> ~Andrew
> 




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Solved the Xen PV/KASLR riddle

2014-08-29 Thread Stefan Bader

On 29.08.2014 16:08, Konrad Rzeszutek Wilk wrote:
> On Thu, Aug 28, 2014 at 08:01:43PM +0200, Stefan Bader wrote:
>>>> So not much further... but then I think I know what I do next. Probably 
>>>> should
>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a 
>>>> panic
>>>> and at least get a crash dump of that situation when it occurs. Then I can 
>>>> dig
>>>> in there with crash (really should have thought of that before)...
>>>
>>>  I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing 
>>> there
>>> that screams at me, so I fear I will have to wait until you get the crash
>>> and get some clues from that.
>>
>> Ok, what a journey. So after long hours of painful staring at the code...
>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
>> in crash, I really would appreaciate :-P)
>>
>> So at some point I realized that level2_fixmap_pgt seemed to contain
>> an oddly high number of entries (given that the virtual address that
>> failed would be mapped by entry 0). And suddenly I realized that apart
>> from entries 506 and 507 (actual fixmap and vsyscalls) the whole list
>> actually was the same as level2_kernel_gpt (without the first 16M
>> cleared).
>>
>> And then I realized that xen_setup_kernel_pagetable is wrong to a
>> degree which makes one wonder how this ever worked. Adding PMD_SIZE
>> as an offset (2M) isn't changing l2 at all. This just copies Xen's
>> kernel mapping, AGAIN!.
>>
>> I guess we all just were lucky that in most cases modules would not
>> use more than 512M (which is the correctly cleaned up remainder
>> of kernel_level2_pgt)..
>>
>> I still need to compile a kernel with the patch and the old layout
>> but I kind of got exited by the find. At least this is tested with
>> the 1G/~1G layout and it comes up without vmalloc errors.
> 
> Woot!

Oh yeah! :)

>>
>> -Stefan
>>
>> >From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001
>> From: Stefan Bader 
>> Date: Thu, 28 Aug 2014 19:17:00 +0200
>> Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables
>>
>> This seemed to be one of those what-the-heck moments. When trying to
>> figure out why changing the kernel/module split (which enabling KASLR
>> does) causes vmalloc to run wild on boot of 64bit PV guests, after
>> much scratching my head, found that the current Xen code copies the
> 
> s/current Xen/xen_setup_kernel_pagetable/

ok

> 
>> same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
>> but also (due to miscalculating the offset) to level2_fixmap_pgt.
>>
>> This only worked because the normal kernel image size only covers the
>> first half of level2_kernel_pgt and module space starts after that.
>> With the split changing, the kernel image uses the full PUD range of
>> 1G and module space starts in the level2_fixmap_pgt. So basically:
>>
>> L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt
>>   [511]->level2_fixmap_pgt
>>
> 
> Perhaps you could add a similar drawing of what it looked like
> without the kaslr enabled? AS in the 'normal kernel image' scenario?

Sure. Btw, someone also contacted me saying they have the same problem without
changing the layout but having really big initrd (500M). While that feels like
it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M
kernel image size area then). But if it can happen, then surely it does cause
mappings to be where the module space starts then.

> 
>> And now the incorrect copy of the kernel mapping in that range bites
>> (hard).
> 
> Want to include the vmalloc warning you got?

Yeah, that is a good idea.

> 
>>
>> This change might not be the fully correct approach as it basically
>> removes the pre-set page table entry for the fixmap that is compile
>> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
>> level1 page table is not yet declared in C headers (that might be
>> fixed). But also with the current bug, it was removed, too. Since
>> the Xen mappings for level2_kernel_pgt only covered kernel + initrd
>> and some Xen data this did never reach that far. And still, something
>> does create entries at level2_fixmap_pgt[506..507]. So it should be
>> ok. At least I was able to successfully boot a kernel with 1G kernel
>> image size without any vmalloc whinings.
>>
>> Signed-off-by: Stefan Bader 
>> ---
>>  arch/x86/xen/mmu.c | 26 +-

Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle

2014-08-29 Thread Stefan Bader

On 29.08.2014 00:42, Andrew Cooper wrote:
> On 28/08/2014 19:01, Stefan Bader wrote:
>>>> So not much further... but then I think I know what I do next. Probably 
>>>> should
>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a 
>>>> panic
>>>> and at least get a crash dump of that situation when it occurs. Then I can 
>>>> dig
>>>> in there with crash (really should have thought of that before)...
>>>  I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing 
>>> there
>>> that screams at me, so I fear I will have to wait until you get the crash
>>> and get some clues from that.
>> Ok, what a journey. So after long hours of painful staring at the code...
>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
>> in crash, I really would appreaciate :-P)
> 
> The M2P map lives in the Xen reserved virtual address space in each PV
> guest, and forms part of the PV ABI.  It is mapped read-only, in the
> native width of the guest.
> 
> 32bit PV (PAE) at 0xF580
> 64bit PV at 0x8000
> 
> This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
> public header files.  You should be able to blindly construct a pointer
> to it (if you have nothing better to hand), as it will be hooked into
> the guests pagetables before execution starts.  Therefore,
> "MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch.

machine_to_phys_mapping is set to that address but its not mapped inside the
crash dump. Somehow vtop in crash handles translations. I need to have a look at
their code, I guess.

Thanks,
Stefan
> 
> ~Andrew
> 




signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle

2014-08-29 Thread Stefan Bader

On 29.08.2014 00:42, Andrew Cooper wrote:
 On 28/08/2014 19:01, Stefan Bader wrote:
 So not much further... but then I think I know what I do next. Probably 
 should
 have done before. I'll replace the WARN_ON in vmalloc that triggers by a 
 panic
 and at least get a crash dump of that situation when it occurs. Then I can 
 dig
 in there with crash (really should have thought of that before)...
 nods I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing 
 there
 that screams at me, so I fear I will have to wait until you get the crash
 and get some clues from that.
 Ok, what a journey. So after long hours of painful staring at the code...
 (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
 in crash, I really would appreaciate :-P)
 
 The M2P map lives in the Xen reserved virtual address space in each PV
 guest, and forms part of the PV ABI.  It is mapped read-only, in the
 native width of the guest.
 
 32bit PV (PAE) at 0xF580
 64bit PV at 0x8000
 
 This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
 public header files.  You should be able to blindly construct a pointer
 to it (if you have nothing better to hand), as it will be hooked into
 the guests pagetables before execution starts.  Therefore,
 MACH2PHYS_VIRT_START[(unsigned long)pfn] ought to do in a pinch.

machine_to_phys_mapping is set to that address but its not mapped inside the
crash dump. Somehow vtop in crash handles translations. I need to have a look at
their code, I guess.

Thanks,
Stefan
 
 ~Andrew
 




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Solved the Xen PV/KASLR riddle

2014-08-29 Thread Stefan Bader

On 29.08.2014 16:08, Konrad Rzeszutek Wilk wrote:
 On Thu, Aug 28, 2014 at 08:01:43PM +0200, Stefan Bader wrote:
 So not much further... but then I think I know what I do next. Probably 
 should
 have done before. I'll replace the WARN_ON in vmalloc that triggers by a 
 panic
 and at least get a crash dump of that situation when it occurs. Then I can 
 dig
 in there with crash (really should have thought of that before)...

 nods I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing 
 there
 that screams at me, so I fear I will have to wait until you get the crash
 and get some clues from that.

 Ok, what a journey. So after long hours of painful staring at the code...
 (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
 in crash, I really would appreaciate :-P)

 So at some point I realized that level2_fixmap_pgt seemed to contain
 an oddly high number of entries (given that the virtual address that
 failed would be mapped by entry 0). And suddenly I realized that apart
 from entries 506 and 507 (actual fixmap and vsyscalls) the whole list
 actually was the same as level2_kernel_gpt (without the first 16M
 cleared).

 And then I realized that xen_setup_kernel_pagetable is wrong to a
 degree which makes one wonder how this ever worked. Adding PMD_SIZE
 as an offset (2M) isn't changing l2 at all. This just copies Xen's
 kernel mapping, AGAIN!.

 I guess we all just were lucky that in most cases modules would not
 use more than 512M (which is the correctly cleaned up remainder
 of kernel_level2_pgt)..

 I still need to compile a kernel with the patch and the old layout
 but I kind of got exited by the find. At least this is tested with
 the 1G/~1G layout and it comes up without vmalloc errors.
 
 Woot!

Oh yeah! :)


 -Stefan

 From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001
 From: Stefan Bader stefan.ba...@canonical.com
 Date: Thu, 28 Aug 2014 19:17:00 +0200
 Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables

 This seemed to be one of those what-the-heck moments. When trying to
 figure out why changing the kernel/module split (which enabling KASLR
 does) causes vmalloc to run wild on boot of 64bit PV guests, after
 much scratching my head, found that the current Xen code copies the
 
 s/current Xen/xen_setup_kernel_pagetable/

ok

 
 same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
 but also (due to miscalculating the offset) to level2_fixmap_pgt.

 This only worked because the normal kernel image size only covers the
 first half of level2_kernel_pgt and module space starts after that.
 With the split changing, the kernel image uses the full PUD range of
 1G and module space starts in the level2_fixmap_pgt. So basically:

 L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt
   [511]-level2_fixmap_pgt

 
 Perhaps you could add a similar drawing of what it looked like
 without the kaslr enabled? AS in the 'normal kernel image' scenario?

Sure. Btw, someone also contacted me saying they have the same problem without
changing the layout but having really big initrd (500M). While that feels like
it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M
kernel image size area then). But if it can happen, then surely it does cause
mappings to be where the module space starts then.

 
 And now the incorrect copy of the kernel mapping in that range bites
 (hard).
 
 Want to include the vmalloc warning you got?

Yeah, that is a good idea.

 

 This change might not be the fully correct approach as it basically
 removes the pre-set page table entry for the fixmap that is compile
 time set (level2_fixmap_pgt[506]-level1_fixmap_pgt). For one the
 level1 page table is not yet declared in C headers (that might be
 fixed). But also with the current bug, it was removed, too. Since
 the Xen mappings for level2_kernel_pgt only covered kernel + initrd
 and some Xen data this did never reach that far. And still, something
 does create entries at level2_fixmap_pgt[506..507]. So it should be
 ok. At least I was able to successfully boot a kernel with 1G kernel
 image size without any vmalloc whinings.

 Signed-off-by: Stefan Bader stefan.ba...@canonical.com
 ---
  arch/x86/xen/mmu.c | 26 +-
  1 file changed, 17 insertions(+), 9 deletions(-)

 diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
 index e8a1201..803034c 100644
 --- a/arch/x86/xen/mmu.c
 +++ b/arch/x86/xen/mmu.c
 @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
 unsigned long max_pfn)
  /* L3_i[0] - level2_ident_pgt */
  convert_pfn_mfn(level3_ident_pgt);
  /* L3_k[510] - level2_kernel_pgt
 - * L3_i[511] - level2_fixmap_pgt */
 + * L3_k[511] - level2_fixmap_pgt */
  convert_pfn_mfn(level3_kernel_pgt);
 +
 +/* level2_fixmap_pgt contains a single entry for the
 + * fixmap area at offset 506

Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle

2014-08-29 Thread Stefan Bader

On 29.08.2014 16:19, Andrew Cooper wrote:
 On 29/08/14 09:37, Stefan Bader wrote:
 On 29.08.2014 00:42, Andrew Cooper wrote:
 On 28/08/2014 19:01, Stefan Bader wrote:
 So not much further... but then I think I know what I do next. Probably 
 should
 have done before. I'll replace the WARN_ON in vmalloc that triggers by a 
 panic
 and at least get a crash dump of that situation when it occurs. Then I 
 can dig
 in there with crash (really should have thought of that before)...
 nods I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing 
 there
 that screams at me, so I fear I will have to wait until you get the crash
 and get some clues from that.
 Ok, what a journey. So after long hours of painful staring at the code...
 (and btw, if someone could tell me how the heck one can do a mfn_to_pfn
 in crash, I really would appreaciate :-P)
 The M2P map lives in the Xen reserved virtual address space in each PV
 guest, and forms part of the PV ABI.  It is mapped read-only, in the
 native width of the guest.

 32bit PV (PAE) at 0xF580
 64bit PV at 0x8000

 This is represented by the MACH2PHYS_VIRT_START symbol from the Xen
 public header files.  You should be able to blindly construct a pointer
 to it (if you have nothing better to hand), as it will be hooked into
 the guests pagetables before execution starts.  Therefore,
 MACH2PHYS_VIRT_START[(unsigned long)pfn] ought to do in a pinch.
 machine_to_phys_mapping is set to that address but its not mapped inside the
 crash dump. Somehow vtop in crash handles translations. I need to have a 
 look at
 their code, I guess.

 Thanks,
 Stefan
 
 What context is the crash dump?  If it is a Xen+dom0 kexec()d to native
 linux, then the m2p should still be accessible given dom0's cr3.  If it
 is some state copied off-host then you will need to adjust the copy to
 include that virtual range.

No its a domU dump of a PV guest taken with xl dump-core (or actually the
result of on-crash trigger).

 
 ~Andrew
 




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Solved the Xen PV/KASLR riddle

2014-08-29 Thread Stefan Bader

On 29.08.2014 16:31, David Vrabel wrote:
 On 29/08/14 15:27, Stefan Bader wrote:

 Ok, I rework the patch and re-send it (freshly, iow not part of this thread).
 And while I am at it, I would add the stable tag.
 
 Can you use a different title? Perhaps:
 
 x86/xen: fix 64-bit PV guest kernel page tables for KASLR
 
 David
 
I can change the title but would not want to include KASLR. Because that is just
indirectly responsible. This is a fix for a problem that always was there but
just escaped notice until KASLR caused the layout to change. This could have
done independently before.

Or, like I mentioned in the last email, be caused by a large initrd.

-Stefan



signature.asc
Description: OpenPGP digital signature

[PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests

2014-08-29 Thread Stefan Bader

This seemed to be one of those what-the-heck moments. When trying to
figure out why changing the kernel/module split (which enabling KASLR
does) causes vmalloc to run wild on boot of 64bit PV guests, after
much scratching my head, found that xen_setup_kernel_pagetable copies
the same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
but also (due to miscalculating the offset) to level2_fixmap_pgt.

This only worked because the normal kernel image size only covers the
first half of level2_kernel_pgt and module space starts after that.

L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt[  0..255]-kernel
  [256..511]-module
  [511]-level2_fixmap_pgt[  0..505]-module

With the split changing, the kernel image uses the full PUD range of
1G and module space starts in the level2_fixmap_pgt. So basically:

L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt[0..511]-kernel
  [511]-level2_fixmap_pgt[0..505]-module

And now the incorrect copy of the kernel mapping in that range bites
(hard). Causing errors in vmalloc that start with the following:

WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
 vmap_page_range_noflush+0x2a1/0x360()

Which is caused by a freshly allocated PTE for an address in the module
vspace area that is not uninitialized (pte_none()).
The same would happen with the old layout when something causes the
initial mappings to cross the 512M boundary. I was told that someone
saw the same vmalloc warning with the old layout but 500M initrd.

This change might not be the fully correct approach as it basically
removes the pre-set page table entry for the fixmap that is compile
time set (level2_fixmap_pgt[506]-level1_fixmap_pgt). For one the
level1 page table is not yet declared in C headers (that might be
fixed). But also with the current bug, it was removed, too. Since
the Xen mappings for level2_kernel_pgt only covered kernel + initrd
and some Xen data this did never reach that far. And still, something
does create entries at level2_fixmap_pgt[506..507]. So it should be
ok. At least I was able to successfully boot a kernel with 1G kernel
image size without any vmalloc whinings.

Signed-off-by: Stefan Bader stefan.ba...@canonical.com
Cc: sta...@vger.kernel.org
---
 arch/x86/xen/mmu.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e8a1201..145e50f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
/* L3_i[0] - level2_ident_pgt */
convert_pfn_mfn(level3_ident_pgt);
/* L3_k[510] - level2_kernel_pgt
-* L3_i[511] - level2_fixmap_pgt */
+* L3_k[511] - level2_fixmap_pgt */
convert_pfn_mfn(level3_kernel_pgt);
+
+   /* level2_fixmap_pgt contains a single entry for the
+* fixmap area at offset 506. The correct way would
+* be to convert level2_fixmap_pgt to mfn and set the
+* level1_fixmap_pgt (which is completely empty) to RO,
+* too. But currently this page table is not declared,
+* so it would be a bit of voodoo to get its address.
+* And also the fixmap entry was never set due to using
+* the wrong l2 when getting Xen's tables. So let's just
+* just nuke it.
+* This orphans level1_fixmap_pgt, but that was basically
+* done before the change as well.
+*/
+   memset(level2_fixmap_pgt, 0, 512*sizeof(long));
}
/* We get [511][511] and have Xen's version of level2_kernel_pgt */
l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
@@ -1913,21 +1927,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
addr[1] = (unsigned long)l3;
addr[2] = (unsigned long)l2;
/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
-* Both L4[272][0] and L4[511][511] have entries that point to the same
+* Both L4[272][0] and L4[511][510] have entries that point to the same
 * L2 (PMD) tables. Meaning that if you modify it in __va space
 * it will be also modified in the __ka space! (But if you just
 * modify the PMD table to point to other PTE's or none, then you
 * are OK - which is what cleanup_highmap does) */
copy_page(level2_ident_pgt, l2);
-   /* Graft it onto L4[511][511] */
+   /* Graft it onto L4[511][510] */
copy_page(level2_kernel_pgt, l2);
 
-   /* Get [511][510] and graft that in level2_fixmap_pgt */
-   l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
-   l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
-   copy_page

[PATCH] Solved the Xen PV/KASLR riddle

2014-08-28 Thread Stefan Bader

> > So not much further... but then I think I know what I do next. Probably 
> > should
> > have done before. I'll replace the WARN_ON in vmalloc that triggers by a 
> > panic
> > and at least get a crash dump of that situation when it occurs. Then I can 
> > dig
> > in there with crash (really should have thought of that before)...
> 
>  I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
> that screams at me, so I fear I will have to wait until you get the crash
> and get some clues from that.

Ok, what a journey. So after long hours of painful staring at the code...
(and btw, if someone could tell me how the heck one can do a mfn_to_pfn
in crash, I really would appreaciate :-P)

So at some point I realized that level2_fixmap_pgt seemed to contain
an oddly high number of entries (given that the virtual address that
failed would be mapped by entry 0). And suddenly I realized that apart
from entries 506 and 507 (actual fixmap and vsyscalls) the whole list
actually was the same as level2_kernel_gpt (without the first 16M
cleared).

And then I realized that xen_setup_kernel_pagetable is wrong to a
degree which makes one wonder how this ever worked. Adding PMD_SIZE
as an offset (2M) isn't changing l2 at all. This just copies Xen's
kernel mapping, AGAIN!.

I guess we all just were lucky that in most cases modules would not
use more than 512M (which is the correctly cleaned up remainder
of kernel_level2_pgt)...

I still need to compile a kernel with the patch and the old layout
but I kind of got exited by the find. At least this is tested with
the 1G/~1G layout and it comes up without vmalloc errors.

-Stefan

>From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001
From: Stefan Bader 
Date: Thu, 28 Aug 2014 19:17:00 +0200
Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables

This seemed to be one of those what-the-heck moments. When trying to
figure out why changing the kernel/module split (which enabling KASLR
does) causes vmalloc to run wild on boot of 64bit PV guests, after
much scratching my head, found that the current Xen code copies the
same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
but also (due to miscalculating the offset) to level2_fixmap_pgt.

This only worked because the normal kernel image size only covers the
first half of level2_kernel_pgt and module space starts after that.
With the split changing, the kernel image uses the full PUD range of
1G and module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt
  [511]->level2_fixmap_pgt

And now the incorrect copy of the kernel mapping in that range bites
(hard).

This change might not be the fully correct approach as it basically
removes the pre-set page table entry for the fixmap that is compile
time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
level1 page table is not yet declared in C headers (that might be
fixed). But also with the current bug, it was removed, too. Since
the Xen mappings for level2_kernel_pgt only covered kernel + initrd
and some Xen data this did never reach that far. And still, something
does create entries at level2_fixmap_pgt[506..507]. So it should be
ok. At least I was able to successfully boot a kernel with 1G kernel
image size without any vmalloc whinings.

Signed-off-by: Stefan Bader 
---
 arch/x86/xen/mmu.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e8a1201..803034c 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
/* L3_i[0] -> level2_ident_pgt */
convert_pfn_mfn(level3_ident_pgt);
/* L3_k[510] -> level2_kernel_pgt
-* L3_i[511] -> level2_fixmap_pgt */
+* L3_k[511] -> level2_fixmap_pgt */
convert_pfn_mfn(level3_kernel_pgt);
+
+   /* level2_fixmap_pgt contains a single entry for the
+* fixmap area at offset 506. The correct way would
+* be to convert level2_fixmap_pgt to mfn and set the
+* level1_fixmap_pgt (which is completely empty) to RO,
+* too. But currently this page table is not delcared,
+* so it would be a bit of voodoo to get its address.
+* And also the fixmap entry was never set anyway due
+* to using the wrong l2 when getting Xen's tables.
+* So let's just nuke it.
+* This orphans level1_fixmap_pgt, but that should be
+* as it always has been.
+*/
+   memset(level2_fixmap_pgt, 0, 512*sizeof(long));
}
/* We get [511][511] and have Xen's version of level2_ke

[PATCH] Solved the Xen PV/KASLR riddle

2014-08-28 Thread Stefan Bader

  So not much further... but then I think I know what I do next. Probably 
  should
  have done before. I'll replace the WARN_ON in vmalloc that triggers by a 
  panic
  and at least get a crash dump of that situation when it occurs. Then I can 
  dig
  in there with crash (really should have thought of that before)...
 
 nods I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there
 that screams at me, so I fear I will have to wait until you get the crash
 and get some clues from that.

Ok, what a journey. So after long hours of painful staring at the code...
(and btw, if someone could tell me how the heck one can do a mfn_to_pfn
in crash, I really would appreaciate :-P)

So at some point I realized that level2_fixmap_pgt seemed to contain
an oddly high number of entries (given that the virtual address that
failed would be mapped by entry 0). And suddenly I realized that apart
from entries 506 and 507 (actual fixmap and vsyscalls) the whole list
actually was the same as level2_kernel_gpt (without the first 16M
cleared).

And then I realized that xen_setup_kernel_pagetable is wrong to a
degree which makes one wonder how this ever worked. Adding PMD_SIZE
as an offset (2M) isn't changing l2 at all. This just copies Xen's
kernel mapping, AGAIN!.

I guess we all just were lucky that in most cases modules would not
use more than 512M (which is the correctly cleaned up remainder
of kernel_level2_pgt)...

I still need to compile a kernel with the patch and the old layout
but I kind of got exited by the find. At least this is tested with
the 1G/~1G layout and it comes up without vmalloc errors.

-Stefan

From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001
From: Stefan Bader stefan.ba...@canonical.com
Date: Thu, 28 Aug 2014 19:17:00 +0200
Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables

This seemed to be one of those what-the-heck moments. When trying to
figure out why changing the kernel/module split (which enabling KASLR
does) causes vmalloc to run wild on boot of 64bit PV guests, after
much scratching my head, found that the current Xen code copies the
same L2 table not only to the level2_ident_pgt and level2_kernel_pgt,
but also (due to miscalculating the offset) to level2_fixmap_pgt.

This only worked because the normal kernel image size only covers the
first half of level2_kernel_pgt and module space starts after that.
With the split changing, the kernel image uses the full PUD range of
1G and module space starts in the level2_fixmap_pgt. So basically:

L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt
  [511]-level2_fixmap_pgt

And now the incorrect copy of the kernel mapping in that range bites
(hard).

This change might not be the fully correct approach as it basically
removes the pre-set page table entry for the fixmap that is compile
time set (level2_fixmap_pgt[506]-level1_fixmap_pgt). For one the
level1 page table is not yet declared in C headers (that might be
fixed). But also with the current bug, it was removed, too. Since
the Xen mappings for level2_kernel_pgt only covered kernel + initrd
and some Xen data this did never reach that far. And still, something
does create entries at level2_fixmap_pgt[506..507]. So it should be
ok. At least I was able to successfully boot a kernel with 1G kernel
image size without any vmalloc whinings.

Signed-off-by: Stefan Bader stefan.ba...@canonical.com
---
 arch/x86/xen/mmu.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e8a1201..803034c 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
/* L3_i[0] - level2_ident_pgt */
convert_pfn_mfn(level3_ident_pgt);
/* L3_k[510] - level2_kernel_pgt
-* L3_i[511] - level2_fixmap_pgt */
+* L3_k[511] - level2_fixmap_pgt */
convert_pfn_mfn(level3_kernel_pgt);
+
+   /* level2_fixmap_pgt contains a single entry for the
+* fixmap area at offset 506. The correct way would
+* be to convert level2_fixmap_pgt to mfn and set the
+* level1_fixmap_pgt (which is completely empty) to RO,
+* too. But currently this page table is not delcared,
+* so it would be a bit of voodoo to get its address.
+* And also the fixmap entry was never set anyway due
+* to using the wrong l2 when getting Xen's tables.
+* So let's just nuke it.
+* This orphans level1_fixmap_pgt, but that should be
+* as it always has been.
+*/
+   memset(level2_fixmap_pgt, 0, 512*sizeof(long));
}
/* We get [511][511] and have Xen's version of level2_kernel_pgt */
l3 = m2v(pgd[pgd_index

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-27 Thread Stefan Bader

On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote:
>> On 21.08.2014 18:03, Kees Cook wrote:
>>> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
>>>  wrote:
>>>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
>>>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
>>>>>  wrote:
>>>>>> On 12.08.2014 19:28, Kees Cook wrote:
>>>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader 
>>>>>>>  wrote:
>>>>>>>> On 08.08.2014 14:43, David Vrabel wrote:
>>>>>>>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>>>>>>>> Unfortunately I have not yet figured out why this happens, but can 
>>>>>>>>>> confirm by
>>>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that 
>>>>>>>>>> without KASLR all
>>>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does 
>>>>>>>>>> not even boot
>>>>>>>>>> as a follow up error).
>>>>>>>>>>
>>>>>>>>>> Details can be seen in [1] but basically this is always some portion 
>>>>>>>>>> of a
>>>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE 
>>>>>>>>>> space not being
>>>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In 
>>>>>>>>>> the
>>>>>>>>>> non-dom0 case this repeats many times but ends in a guest that 
>>>>>>>>>> allows login. In
>>>>>>>>>> the dom0 case there is a more fatal error at some point causing a 
>>>>>>>>>> crash.
>>>>>>>>>>
>>>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also 
>>>>>>>>>> does not help
>>>>>>>>>> to add "nokaslr" to the kernel command-line.
>>>>>>>>>
>>>>>>>>> Maybe it's overlapping with regions of the virtual address space
>>>>>>>>> reserved for Xen?  What the the VA that fails?
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>> Yeah, there is some code to avoid some regions of memory (like 
>>>>>>>> initrd). Maybe
>>>>>>>> missing p2m tables? I probably need to add debugging to find the 
>>>>>>>> failing VA (iow
>>>>>>>> not sure whether it might be somewhere in the stacktraces in the 
>>>>>>>> report).
>>>>>>>>
>>>>>>>> The kernel-command line does not seem to be looked at. It should put 
>>>>>>>> something
>>>>>>>> into dmesg and that never shows up. Also today's random feature is 
>>>>>>>> other PV
>>>>>>>> guests crashing after a bit somewhere in the check_for_corruption 
>>>>>>>> area...
>>>>>>>
>>>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
>>>>>>> there are other reserved regions that aren't listed in the e820, it'll
>>>>>>> need to locate and skip them.
>>>>>>>
>>>>>>> -Kees
>>>>>>>
>>>>>> Making my little steps towards more understanding I figured out that it 
>>>>>> isn't
>>>>>> the code that does the relocation. Even with that completely disabled 
>>>>>> there were
>>>>>> the vmalloc issues. What causes it seems to be the default of the upper 
>>>>>> limit
>>>>>> and that this changes the split between kernel and modules to 1G+1G 
>>>>>> instead of
>>>>>> 512M+1.5G. That is the reason why nokaslr has no effect.
>>>>>
>>>>> Oh! That's very interesting. There must be some assumption in Xen
>>>>> about the kernel VM layout then?
>>>>
>>>> No. I think most of the changes that look at PTE and PMDs are are all
>>>> in arch/x86/xen/mmu.c. I wonder if this is xen_clea

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-27 Thread Stefan Bader

On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote:
 On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote:
 On 21.08.2014 18:03, Kees Cook wrote:
 On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
 konrad.w...@oracle.com wrote:
 On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
 On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
 stefan.ba...@canonical.com wrote:
 On 12.08.2014 19:28, Kees Cook wrote:
 On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader 
 stefan.ba...@canonical.com wrote:
 On 08.08.2014 14:43, David Vrabel wrote:
 On 08/08/14 12:20, Stefan Bader wrote:
 Unfortunately I have not yet figured out why this happens, but can 
 confirm by
 compiling with or without CONFIG_RANDOMIZE_BASE being set that 
 without KASLR all
 is ok, but with it enabled there are issues (actually a dom0 does 
 not even boot
 as a follow up error).

 Details can be seen in [1] but basically this is always some portion 
 of a
 vmalloc allocation failing after hitting a freshly allocated PTE 
 space not being
 PTE_NONE (usually from a module load triggered by systemd-udevd). In 
 the
 non-dom0 case this repeats many times but ends in a guest that 
 allows login. In
 the dom0 case there is a more fatal error at some point causing a 
 crash.

 I have not tried this for a normal PV guest but for dom0 it also 
 does not help
 to add nokaslr to the kernel command-line.

 Maybe it's overlapping with regions of the virtual address space
 reserved for Xen?  What the the VA that fails?

 David

 Yeah, there is some code to avoid some regions of memory (like 
 initrd). Maybe
 missing p2m tables? I probably need to add debugging to find the 
 failing VA (iow
 not sure whether it might be somewhere in the stacktraces in the 
 report).

 The kernel-command line does not seem to be looked at. It should put 
 something
 into dmesg and that never shows up. Also today's random feature is 
 other PV
 guests crashing after a bit somewhere in the check_for_corruption 
 area...

 Right now, the kaslr code just deals with initrd, cmdline, etc. If
 there are other reserved regions that aren't listed in the e820, it'll
 need to locate and skip them.

 -Kees

 Making my little steps towards more understanding I figured out that it 
 isn't
 the code that does the relocation. Even with that completely disabled 
 there were
 the vmalloc issues. What causes it seems to be the default of the upper 
 limit
 and that this changes the split between kernel and modules to 1G+1G 
 instead of
 512M+1.5G. That is the reason why nokaslr has no effect.

 Oh! That's very interesting. There must be some assumption in Xen
 about the kernel VM layout then?

 No. I think most of the changes that look at PTE and PMDs are are all
 in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
 too aggressive

 (Sorry I had to cut our chat short at Kernel Summit!)

 I sounded like there was another region of memory that Xen was setting
 aside for page tables? But Stefan's investigation seems to show this
 isn't about layout at boot (since the kaslr=0 case means no relocation
 is done). Sounds more like the split between kernel and modules area,
 so I'm not sure how the memory area after the initrd would be part of
 this. What should next steps be, do you think?

 Maybe layout, but not about placement of the kernel. Basically leaving KASLR
 enabled but shrink the possible range back to the original kernel/module 
 split
 is fine as well.

 I am bouncing between feeling close to understand to being confused. Konrad
 suggested xen_cleanhighmap being overly aggressive. But maybe its the other 
 way
 round. The warning that occurs first indicates that PTE that was obtained for
 some vmalloc mapping is not unused (0) as it is expected. So it feels rather
 like some cleanup has *not* been done.

 Let me think aloud a bit... What seems to cause this, is the change of the
 kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M
 vsyscalls and 2M hole at the end). Which in vaddr terms means:

 Before:
 8000 - 9fff (=512 MB)  kernel text mapping, from 
 phys 0
 a000 - ff5f (=1526 MB) module mapping space

 After:
 8000 - bfff (=1024 MB) kernel text mapping, from 
 phys 0
 c000 - ff5f (=1014 MB) module mapping space

 Now, *if* I got this right, this means the kernel starts on a vaddr that is
 pointed at by:

 PGD[510]-PUD[510]-PMD[0]-PTE[0]

 In the old layout the module vaddr area would start in the same PUD area, but
 with the change the kernel would cover PUD[510] and the module vaddr + 
 vsyscalls
 and the hole would cover PUD[511].
 
 I think there is a fixmap there too?

Right, they forgot that in Documentation/x86/x86_64/mm... but head_64.S has it.
So fixmap seems to be in the 2M space before the vsyscalls.
Btw, apparently I got the PGD index wrong. It is of course 511, not 510.

init_level4_pgt[511]-level3_kernel_pgt[510]-level2_kernel_pgt[0

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-22 Thread Stefan Bader

On 21.08.2014 18:03, Kees Cook wrote:
> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
>  wrote:
>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
>>>  wrote:
>>>> On 12.08.2014 19:28, Kees Cook wrote:
>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader  
>>>>> wrote:
>>>>>> On 08.08.2014 14:43, David Vrabel wrote:
>>>>>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>>>>>> Unfortunately I have not yet figured out why this happens, but can 
>>>>>>>> confirm by
>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without 
>>>>>>>> KASLR all
>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does not 
>>>>>>>> even boot
>>>>>>>> as a follow up error).
>>>>>>>>
>>>>>>>> Details can be seen in [1] but basically this is always some portion 
>>>>>>>> of a
>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space 
>>>>>>>> not being
>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In 
>>>>>>>> the
>>>>>>>> non-dom0 case this repeats many times but ends in a guest that allows 
>>>>>>>> login. In
>>>>>>>> the dom0 case there is a more fatal error at some point causing a 
>>>>>>>> crash.
>>>>>>>>
>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also does 
>>>>>>>> not help
>>>>>>>> to add "nokaslr" to the kernel command-line.
>>>>>>>
>>>>>>> Maybe it's overlapping with regions of the virtual address space
>>>>>>> reserved for Xen?  What the the VA that fails?
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>> Yeah, there is some code to avoid some regions of memory (like initrd). 
>>>>>> Maybe
>>>>>> missing p2m tables? I probably need to add debugging to find the failing 
>>>>>> VA (iow
>>>>>> not sure whether it might be somewhere in the stacktraces in the report).
>>>>>>
>>>>>> The kernel-command line does not seem to be looked at. It should put 
>>>>>> something
>>>>>> into dmesg and that never shows up. Also today's random feature is other 
>>>>>> PV
>>>>>> guests crashing after a bit somewhere in the check_for_corruption area...
>>>>>
>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
>>>>> there are other reserved regions that aren't listed in the e820, it'll
>>>>> need to locate and skip them.
>>>>>
>>>>> -Kees
>>>>>
>>>> Making my little steps towards more understanding I figured out that it 
>>>> isn't
>>>> the code that does the relocation. Even with that completely disabled 
>>>> there were
>>>> the vmalloc issues. What causes it seems to be the default of the upper 
>>>> limit
>>>> and that this changes the split between kernel and modules to 1G+1G 
>>>> instead of
>>>> 512M+1.5G. That is the reason why nokaslr has no effect.
>>>
>>> Oh! That's very interesting. There must be some assumption in Xen
>>> about the kernel VM layout then?
>>
>> No. I think most of the changes that look at PTE and PMDs are are all
>> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
>> too aggressive
> 
> (Sorry I had to cut our chat short at Kernel Summit!)
> 
> I sounded like there was another region of memory that Xen was setting
> aside for page tables? But Stefan's investigation seems to show this
> isn't about layout at boot (since the kaslr=0 case means no relocation
> is done). Sounds more like the split between kernel and modules area,
> so I'm not sure how the memory area after the initrd would be part of
> this. What should next steps be, do you think?

Maybe layout, but not about placement of the kernel. Basically leaving KASLR
enabled but shrink the possible range back to the original kernel/module split
is fine as well.

I am bouncing between feeling close to understand to bein

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-22 Thread Stefan Bader

On 21.08.2014 18:03, Kees Cook wrote:
 On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
 konrad.w...@oracle.com wrote:
 On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
 On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
 stefan.ba...@canonical.com wrote:
 On 12.08.2014 19:28, Kees Cook wrote:
 On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader stefan.ba...@canonical.com 
 wrote:
 On 08.08.2014 14:43, David Vrabel wrote:
 On 08/08/14 12:20, Stefan Bader wrote:
 Unfortunately I have not yet figured out why this happens, but can 
 confirm by
 compiling with or without CONFIG_RANDOMIZE_BASE being set that without 
 KASLR all
 is ok, but with it enabled there are issues (actually a dom0 does not 
 even boot
 as a follow up error).

 Details can be seen in [1] but basically this is always some portion 
 of a
 vmalloc allocation failing after hitting a freshly allocated PTE space 
 not being
 PTE_NONE (usually from a module load triggered by systemd-udevd). In 
 the
 non-dom0 case this repeats many times but ends in a guest that allows 
 login. In
 the dom0 case there is a more fatal error at some point causing a 
 crash.

 I have not tried this for a normal PV guest but for dom0 it also does 
 not help
 to add nokaslr to the kernel command-line.

 Maybe it's overlapping with regions of the virtual address space
 reserved for Xen?  What the the VA that fails?

 David

 Yeah, there is some code to avoid some regions of memory (like initrd). 
 Maybe
 missing p2m tables? I probably need to add debugging to find the failing 
 VA (iow
 not sure whether it might be somewhere in the stacktraces in the report).

 The kernel-command line does not seem to be looked at. It should put 
 something
 into dmesg and that never shows up. Also today's random feature is other 
 PV
 guests crashing after a bit somewhere in the check_for_corruption area...

 Right now, the kaslr code just deals with initrd, cmdline, etc. If
 there are other reserved regions that aren't listed in the e820, it'll
 need to locate and skip them.

 -Kees

 Making my little steps towards more understanding I figured out that it 
 isn't
 the code that does the relocation. Even with that completely disabled 
 there were
 the vmalloc issues. What causes it seems to be the default of the upper 
 limit
 and that this changes the split between kernel and modules to 1G+1G 
 instead of
 512M+1.5G. That is the reason why nokaslr has no effect.

 Oh! That's very interesting. There must be some assumption in Xen
 about the kernel VM layout then?

 No. I think most of the changes that look at PTE and PMDs are are all
 in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
 too aggressive
 
 (Sorry I had to cut our chat short at Kernel Summit!)
 
 I sounded like there was another region of memory that Xen was setting
 aside for page tables? But Stefan's investigation seems to show this
 isn't about layout at boot (since the kaslr=0 case means no relocation
 is done). Sounds more like the split between kernel and modules area,
 so I'm not sure how the memory area after the initrd would be part of
 this. What should next steps be, do you think?

Maybe layout, but not about placement of the kernel. Basically leaving KASLR
enabled but shrink the possible range back to the original kernel/module split
is fine as well.

I am bouncing between feeling close to understand to being confused. Konrad
suggested xen_cleanhighmap being overly aggressive. But maybe its the other way
round. The warning that occurs first indicates that PTE that was obtained for
some vmalloc mapping is not unused (0) as it is expected. So it feels rather
like some cleanup has *not* been done.

Let me think aloud a bit... What seems to cause this, is the change of the
kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M
vsyscalls and 2M hole at the end). Which in vaddr terms means:

Before:
8000 - 9fff (=512 MB)  kernel text mapping, from phys 0
a000 - ff5f (=1526 MB) module mapping space

After:
8000 - bfff (=1024 MB) kernel text mapping, from phys 0
c000 - ff5f (=1014 MB) module mapping space

Now, *if* I got this right, this means the kernel starts on a vaddr that is
pointed at by:

PGD[510]-PUD[510]-PMD[0]-PTE[0]

In the old layout the module vaddr area would start in the same PUD area, but
with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls
and the hole would cover PUD[511].

xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a bit
since I am not sure I understand enough details) I believe is the one PMD
pointed at by PGD[510]-PUD[510]. That could mean that before the change
xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space but
not after the change. Maybe that also means it always should have covered more
but this would not be observed as long as modules would not claim more than
512M? I

Re: [PATCH] bcache: prevent crash on changing writeback_running

2014-08-21 Thread Stefan Bader

On 19.08.2014 20:03, Slava Pestov wrote:
> Thanks, this is now in our development branch.

Thanks. :) This kind of brings up another question: how are stable patches
handled for bcache. Someone handling them centrally like for the network stack
or tagging for stable and if that missed, individual proposals?

I am asking because it looks like for all kernels between 3.13..3.16 (inclusive)
the set of

a664d0f05a2ec02c8f042db536d84d15d6e19e81
  bcache: fix crash on shutdown in passthrough mode
9e5c353510b26500bd6b8309823ac9ef2837b761
  bcache: fix uninterruptible sleep in writeback thread

plus the one I did would be useful. Right now I only tested with 3.13 with those
and at least avoid the lockdep warnings and crashes I saw before. I only see an
unexplainable load of 1 when idle but having one bcache device configured in a
VM (Xen or KVM). Since writeback should be fixed, maybe gc...

-Stefan
> 
> On Tue, Aug 19, 2014 at 6:01 AM, Stefan Bader
>  wrote:
>> commit a664d0f05a2ec02c8f042db536d84d15d6e19e81
>> bcache: fix crash on shutdown in passthrough mode
>>
>> added a safeguard in the shutdown case. At least while not being
>> attached it is also possible to trigger a kernel bug by writing into
>> writeback_running. This change  adds the same check before trying to
>> wake up the thread for that case.
>>
>> Signed-off-by: Stefan Bader 
>> ---
>>  drivers/md/bcache/writeback.h | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
>> index 0a9dab1..073a042 100644
>> --- a/drivers/md/bcache/writeback.h
>> +++ b/drivers/md/bcache/writeback.h
>> @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, 
>> struct bio *bio,
>>
>>  static inline void bch_writeback_queue(struct cached_dev *dc)
>>  {
>> -   wake_up_process(dc->writeback_thread);
>> +   if (!IS_ERR_OR_NULL(dc->writeback_thread))
>> +   wake_up_process(dc->writeback_thread);
>>  }
>>
>>  static inline void bch_writeback_add(struct cached_dev *dc)
>> --
>> 1.9.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html




signature.asc
Description: OpenPGP digital signature

Re: [PATCH] bcache: prevent crash on changing writeback_running

2014-08-21 Thread Stefan Bader

On 19.08.2014 20:03, Slava Pestov wrote:
 Thanks, this is now in our development branch.

Thanks. :) This kind of brings up another question: how are stable patches
handled for bcache. Someone handling them centrally like for the network stack
or tagging for stable and if that missed, individual proposals?

I am asking because it looks like for all kernels between 3.13..3.16 (inclusive)
the set of

a664d0f05a2ec02c8f042db536d84d15d6e19e81
  bcache: fix crash on shutdown in passthrough mode
9e5c353510b26500bd6b8309823ac9ef2837b761
  bcache: fix uninterruptible sleep in writeback thread

plus the one I did would be useful. Right now I only tested with 3.13 with those
and at least avoid the lockdep warnings and crashes I saw before. I only see an
unexplainable load of 1 when idle but having one bcache device configured in a
VM (Xen or KVM). Since writeback should be fixed, maybe gc...

-Stefan
 
 On Tue, Aug 19, 2014 at 6:01 AM, Stefan Bader
 stefan.ba...@canonical.com wrote:
 commit a664d0f05a2ec02c8f042db536d84d15d6e19e81
 bcache: fix crash on shutdown in passthrough mode

 added a safeguard in the shutdown case. At least while not being
 attached it is also possible to trigger a kernel bug by writing into
 writeback_running. This change  adds the same check before trying to
 wake up the thread for that case.

 Signed-off-by: Stefan Bader stefan.ba...@canonical.com
 ---
  drivers/md/bcache/writeback.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
 index 0a9dab1..073a042 100644
 --- a/drivers/md/bcache/writeback.h
 +++ b/drivers/md/bcache/writeback.h
 @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, 
 struct bio *bio,

  static inline void bch_writeback_queue(struct cached_dev *dc)
  {
 -   wake_up_process(dc-writeback_thread);
 +   if (!IS_ERR_OR_NULL(dc-writeback_thread))
 +   wake_up_process(dc-writeback_thread);
  }

  static inline void bch_writeback_add(struct cached_dev *dc)
 --
 1.9.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-bcache in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html




signature.asc
Description: OpenPGP digital signature

[PATCH] bcache: prevent crash on changing writeback_running

2014-08-19 Thread Stefan Bader

commit a664d0f05a2ec02c8f042db536d84d15d6e19e81
bcache: fix crash on shutdown in passthrough mode

added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change  adds the same check before trying to
wake up the thread for that case.

Signed-off-by: Stefan Bader 
---
 drivers/md/bcache/writeback.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 0a9dab1..073a042 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, 
struct bio *bio,
 
 static inline void bch_writeback_queue(struct cached_dev *dc)
 {
-   wake_up_process(dc->writeback_thread);
+   if (!IS_ERR_OR_NULL(dc->writeback_thread))
+   wake_up_process(dc->writeback_thread);
 }
 
 static inline void bch_writeback_add(struct cached_dev *dc)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] bcache: prevent crash on changing writeback_running

2014-08-19 Thread Stefan Bader

commit a664d0f05a2ec02c8f042db536d84d15d6e19e81
bcache: fix crash on shutdown in passthrough mode

added a safeguard in the shutdown case. At least while not being
attached it is also possible to trigger a kernel bug by writing into
writeback_running. This change  adds the same check before trying to
wake up the thread for that case.

Signed-off-by: Stefan Bader stefan.ba...@canonical.com
---
 drivers/md/bcache/writeback.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h
index 0a9dab1..073a042 100644
--- a/drivers/md/bcache/writeback.h
+++ b/drivers/md/bcache/writeback.h
@@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, 
struct bio *bio,
 
 static inline void bch_writeback_queue(struct cached_dev *dc)
 {
-   wake_up_process(dc-writeback_thread);
+   if (!IS_ERR_OR_NULL(dc-writeback_thread))
+   wake_up_process(dc-writeback_thread);
 }
 
 static inline void bch_writeback_add(struct cached_dev *dc)
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86_32, entry: Clean up sysenter_badsys declaration

2014-08-15 Thread tip-bot for Stefan Bader

Commit-ID:  fb21b84e7f809ef04b1e5aed5d463cf0d4866638
Gitweb: http://git.kernel.org/tip/fb21b84e7f809ef04b1e5aed5d463cf0d4866638
Author: Stefan Bader 
AuthorDate: Fri, 15 Aug 2014 10:57:46 +0200
Committer:  H. Peter Anvin 
CommitDate: Fri, 15 Aug 2014 13:45:32 -0700

x86_32, entry: Clean up sysenter_badsys declaration

commit 554086d85e "x86_32, entry: Do syscall exit work on badsys
(CVE-2014-4508)" introduced a new jump label (sysenter_badsys) but
somehow the END statements seem to have gone wrong (at least it
feels that way to me).
This does not seem to be a fatal problem, but just for the sake
of symmetry, change the second syscall_badsys to sysenter_badsys.

Signed-off-by: Stefan Bader 
Link: 
http://lkml.kernel.org/r/1408093066-31021-1-git-send-email-stefan.ba...@canonical.com
Acked-by: Andy Lutomirski 
Signed-off-by: H. Peter Anvin 
---
 arch/x86/kernel/entry_32.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 47c410d..4b0e1df 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -683,7 +683,7 @@ END(syscall_badsys)
 sysenter_badsys:
movl $-ENOSYS,%eax
jmp sysenter_after_call
-END(syscall_badsys)
+END(sysenter_badsys)
CFI_ENDPROC
 
 .macro FIXUP_ESPFIX_STACK
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86_32, entry: Clean up sysenter_badsys declaration

2014-08-15 Thread Stefan Bader

commit 554086d85e "x86_32, entry: Do syscall exit work on badsys
(CVE-2014-4508)" introduced a new jump label (sysenter_badsys) but
somehow the END statements seem to have gone wrong (at least it
feels that way to me).
This does not seem to be a fatal problem, but just for the sake
of symmetry, change the second syscall_badsys to sysenter_badsys.

Signed-off-by: Stefan Bader 
---
 arch/x86/kernel/entry_32.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 47c410d..4b0e1df 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -683,7 +683,7 @@ END(syscall_badsys)
 sysenter_badsys:
movl $-ENOSYS,%eax
jmp sysenter_after_call
-END(syscall_badsys)
+END(sysenter_badsys)
CFI_ENDPROC
 
 .macro FIXUP_ESPFIX_STACK
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86_32, entry: Clean up sysenter_badsys declaration

2014-08-15 Thread Stefan Bader

commit 554086d85e x86_32, entry: Do syscall exit work on badsys
(CVE-2014-4508) introduced a new jump label (sysenter_badsys) but
somehow the END statements seem to have gone wrong (at least it
feels that way to me).
This does not seem to be a fatal problem, but just for the sake
of symmetry, change the second syscall_badsys to sysenter_badsys.

Signed-off-by: Stefan Bader stefan.ba...@canonical.com
---
 arch/x86/kernel/entry_32.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 47c410d..4b0e1df 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -683,7 +683,7 @@ END(syscall_badsys)
 sysenter_badsys:
movl $-ENOSYS,%eax
jmp sysenter_after_call
-END(syscall_badsys)
+END(sysenter_badsys)
CFI_ENDPROC
 
 .macro FIXUP_ESPFIX_STACK
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:x86/urgent] x86_32, entry: Clean up sysenter_badsys declaration

2014-08-15 Thread tip-bot for Stefan Bader

Commit-ID:  fb21b84e7f809ef04b1e5aed5d463cf0d4866638
Gitweb: http://git.kernel.org/tip/fb21b84e7f809ef04b1e5aed5d463cf0d4866638
Author: Stefan Bader stefan.ba...@canonical.com
AuthorDate: Fri, 15 Aug 2014 10:57:46 +0200
Committer:  H. Peter Anvin h...@linux.intel.com
CommitDate: Fri, 15 Aug 2014 13:45:32 -0700

x86_32, entry: Clean up sysenter_badsys declaration

commit 554086d85e x86_32, entry: Do syscall exit work on badsys
(CVE-2014-4508) introduced a new jump label (sysenter_badsys) but
somehow the END statements seem to have gone wrong (at least it
feels that way to me).
This does not seem to be a fatal problem, but just for the sake
of symmetry, change the second syscall_badsys to sysenter_badsys.

Signed-off-by: Stefan Bader stefan.ba...@canonical.com
Link: 
http://lkml.kernel.org/r/1408093066-31021-1-git-send-email-stefan.ba...@canonical.com
Acked-by: Andy Lutomirski l...@amacapital.net
Signed-off-by: H. Peter Anvin h...@linux.intel.com
---
 arch/x86/kernel/entry_32.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 47c410d..4b0e1df 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -683,7 +683,7 @@ END(syscall_badsys)
 sysenter_badsys:
movl $-ENOSYS,%eax
jmp sysenter_after_call
-END(syscall_badsys)
+END(sysenter_badsys)
CFI_ENDPROC
 
 .macro FIXUP_ESPFIX_STACK
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-12 Thread Stefan Bader

On 12.08.2014 19:28, Kees Cook wrote:
> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader  
> wrote:
>> On 08.08.2014 14:43, David Vrabel wrote:
>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>> Unfortunately I have not yet figured out why this happens, but can confirm 
>>>> by
>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without 
>>>> KASLR all
>>>> is ok, but with it enabled there are issues (actually a dom0 does not even 
>>>> boot
>>>> as a follow up error).
>>>>
>>>> Details can be seen in [1] but basically this is always some portion of a
>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not 
>>>> being
>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>>>> non-dom0 case this repeats many times but ends in a guest that allows 
>>>> login. In
>>>> the dom0 case there is a more fatal error at some point causing a crash.
>>>>
>>>> I have not tried this for a normal PV guest but for dom0 it also does not 
>>>> help
>>>> to add "nokaslr" to the kernel command-line.
>>>
>>> Maybe it's overlapping with regions of the virtual address space
>>> reserved for Xen?  What the the VA that fails?
>>>
>>> David
>>>
>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
>> missing p2m tables? I probably need to add debugging to find the failing VA 
>> (iow
>> not sure whether it might be somewhere in the stacktraces in the report).
>>
>> The kernel-command line does not seem to be looked at. It should put 
>> something
>> into dmesg and that never shows up. Also today's random feature is other PV
>> guests crashing after a bit somewhere in the check_for_corruption area...
> 
> Right now, the kaslr code just deals with initrd, cmdline, etc. If
> there are other reserved regions that aren't listed in the e820, it'll
> need to locate and skip them.
> 
> -Kees
> 
Making my little steps towards more understanding I figured out that it isn't
the code that does the relocation. Even with that completely disabled there were
the vmalloc issues. What causes it seems to be the default of the upper limit
and that this changes the split between kernel and modules to 1G+1G instead of
512M+1.5G. That is the reason why nokaslr has no effect.

-Stefan



signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-12 Thread Stefan Bader

On 12.08.2014 19:28, Kees Cook wrote:
 On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader stefan.ba...@canonical.com 
 wrote:
 On 08.08.2014 14:43, David Vrabel wrote:
 On 08/08/14 12:20, Stefan Bader wrote:
 Unfortunately I have not yet figured out why this happens, but can confirm 
 by
 compiling with or without CONFIG_RANDOMIZE_BASE being set that without 
 KASLR all
 is ok, but with it enabled there are issues (actually a dom0 does not even 
 boot
 as a follow up error).

 Details can be seen in [1] but basically this is always some portion of a
 vmalloc allocation failing after hitting a freshly allocated PTE space not 
 being
 PTE_NONE (usually from a module load triggered by systemd-udevd). In the
 non-dom0 case this repeats many times but ends in a guest that allows 
 login. In
 the dom0 case there is a more fatal error at some point causing a crash.

 I have not tried this for a normal PV guest but for dom0 it also does not 
 help
 to add nokaslr to the kernel command-line.

 Maybe it's overlapping with regions of the virtual address space
 reserved for Xen?  What the the VA that fails?

 David

 Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
 missing p2m tables? I probably need to add debugging to find the failing VA 
 (iow
 not sure whether it might be somewhere in the stacktraces in the report).

 The kernel-command line does not seem to be looked at. It should put 
 something
 into dmesg and that never shows up. Also today's random feature is other PV
 guests crashing after a bit somewhere in the check_for_corruption area...
 
 Right now, the kaslr code just deals with initrd, cmdline, etc. If
 there are other reserved regions that aren't listed in the e820, it'll
 need to locate and skip them.
 
 -Kees
 
Making my little steps towards more understanding I figured out that it isn't
the code that does the relocation. Even with that completely disabled there were
the vmalloc issues. What causes it seems to be the default of the upper limit
and that this changes the split between kernel and modules to 1G+1G instead of
512M+1.5G. That is the reason why nokaslr has no effect.

-Stefan



signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-08 Thread Stefan Bader

On 08.08.2014 14:43, David Vrabel wrote:
> On 08/08/14 12:20, Stefan Bader wrote:
>> Unfortunately I have not yet figured out why this happens, but can confirm by
>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR 
>> all
>> is ok, but with it enabled there are issues (actually a dom0 does not even 
>> boot
>> as a follow up error).
>>
>> Details can be seen in [1] but basically this is always some portion of a
>> vmalloc allocation failing after hitting a freshly allocated PTE space not 
>> being
>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>> non-dom0 case this repeats many times but ends in a guest that allows login. 
>> In
>> the dom0 case there is a more fatal error at some point causing a crash.
>>
>> I have not tried this for a normal PV guest but for dom0 it also does not 
>> help
>> to add "nokaslr" to the kernel command-line.
> 
> Maybe it's overlapping with regions of the virtual address space
> reserved for Xen?  What the the VA that fails?
> 
> David
> 
Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
missing p2m tables? I probably need to add debugging to find the failing VA (iow
not sure whether it might be somewhere in the stacktraces in the report).

The kernel-command line does not seem to be looked at. It should put something
into dmesg and that never shows up. Also today's random feature is other PV
guests crashing after a bit somewhere in the check_for_corruption area...

-Stefan



signature.asc
Description: OpenPGP digital signature

Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-08 Thread Stefan Bader

Unfortunately I have not yet figured out why this happens, but can confirm by
compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
is ok, but with it enabled there are issues (actually a dom0 does not even boot
as a follow up error).

Details can be seen in [1] but basically this is always some portion of a
vmalloc allocation failing after hitting a freshly allocated PTE space not being
PTE_NONE (usually from a module load triggered by systemd-udevd). In the
non-dom0 case this repeats many times but ends in a guest that allows login. In
the dom0 case there is a more fatal error at some point causing a crash.

I have not tried this for a normal PV guest but for dom0 it also does not help
to add "nokaslr" to the kernel command-line.

-Stefan

19:35:02 [ 2.547049] [ cut here ]
19:35:02 [ 2.547065] WARNING: CPU: 0 PID: 97 at
/build/buildd/linux-3.16.0/mm/vmalloc.c:128 
vmap_page_range_noflush+0x2d1/0x370()
19:35:02 [ 2.547069] Modules linked in:
19:35:02 [ 2.547073] CPU: 0 PID: 97 Comm: systemd-udevd Not tainted
3.16.0-6-generic #11-Ubuntu
19:35:02 [ 2.547077] 0009 880002defb98 81755538

19:35:02 [ 2.547082] 880002defbd0 8106bb0d 88000400ec88
0001
19:35:02 [ 2.547086] 880002fcfb00 c0391000 
880002defbe0
19:35:02 [ 2.547090] Call Trace:
19:35:02 [ 2.547096] [] dump_stack+0x45/0x56
19:35:02 [ 2.547101] [] warn_slowpath_common+0x7d/0xa0
19:35:02 [ 2.547104] [] warn_slowpath_null+0x1a/0x20
19:35:02 [ 2.547108] [] vmap_page_range_noflush+0x2d1/0x370
19:35:02 [ 2.547112] [] map_vm_area+0x2e/0x40
19:35:02 [ 2.547115] [] __vmalloc_node_range+0x188/0x280
19:35:02 [ 2.547120] [] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547124] [] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547129] [] module_alloc+0x74/0xd0
19:35:02 [ 2.547132] [] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547135] [] module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547146] [] layout_and_allocate+0x74c/0xc70
19:35:02 [ 2.547149] [] load_module+0xd3/0x1b70
19:35:02 [ 2.547154] [] ? vfs_read+0xf1/0x170
19:35:02 [ 2.547157] [] ? 
copy_module_from_fd.isra.46+0x121/0x180
19:35:02 [ 2.547161] [] SyS_finit_module+0x86/0xb0
19:35:02 [ 2.547167] [] tracesys+0xe1/0xe6
19:35:02 [ 2.547169] ---[ end trace 8a5de7fc66e75fe4 ]---
19:35:02 [ 2.547172] vmalloc: allocation failure, allocated 20480 of 24576 bytes
19:35:02 [ 2.547175] systemd-udevd: page allocation failure: order:0, mode:0xd2


[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350522



signature.asc
Description: OpenPGP digital signature

Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-08 Thread Stefan Bader

Unfortunately I have not yet figured out why this happens, but can confirm by
compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
is ok, but with it enabled there are issues (actually a dom0 does not even boot
as a follow up error).

Details can be seen in [1] but basically this is always some portion of a
vmalloc allocation failing after hitting a freshly allocated PTE space not being
PTE_NONE (usually from a module load triggered by systemd-udevd). In the
non-dom0 case this repeats many times but ends in a guest that allows login. In
the dom0 case there is a more fatal error at some point causing a crash.

I have not tried this for a normal PV guest but for dom0 it also does not help
to add nokaslr to the kernel command-line.

-Stefan

19:35:02 [ 2.547049] [ cut here ]
19:35:02 [ 2.547065] WARNING: CPU: 0 PID: 97 at
/build/buildd/linux-3.16.0/mm/vmalloc.c:128 
vmap_page_range_noflush+0x2d1/0x370()
19:35:02 [ 2.547069] Modules linked in:
19:35:02 [ 2.547073] CPU: 0 PID: 97 Comm: systemd-udevd Not tainted
3.16.0-6-generic #11-Ubuntu
19:35:02 [ 2.547077] 0009 880002defb98 81755538

19:35:02 [ 2.547082] 880002defbd0 8106bb0d 88000400ec88
0001
19:35:02 [ 2.547086] 880002fcfb00 c0391000 
880002defbe0
19:35:02 [ 2.547090] Call Trace:
19:35:02 [ 2.547096] [81755538] dump_stack+0x45/0x56
19:35:02 [ 2.547101] [8106bb0d] warn_slowpath_common+0x7d/0xa0
19:35:02 [ 2.547104] [8106bbea] warn_slowpath_null+0x1a/0x20
19:35:02 [ 2.547108] [81197c31] vmap_page_range_noflush+0x2d1/0x370
19:35:02 [ 2.547112] [81197cfe] map_vm_area+0x2e/0x40
19:35:02 [ 2.547115] [8119a058] __vmalloc_node_range+0x188/0x280
19:35:02 [ 2.547120] [810e92b4] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547124] [810e92b4] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547129] [8104f294] module_alloc+0x74/0xd0
19:35:02 [ 2.547132] [810e92b4] ? module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547135] [810e92b4] module_alloc_update_bounds+0x14/0x70
19:35:02 [ 2.547146] [810e9a6c] layout_and_allocate+0x74c/0xc70
19:35:02 [ 2.547149] [810ea063] load_module+0xd3/0x1b70
19:35:02 [ 2.547154] [811cfeb1] ? vfs_read+0xf1/0x170
19:35:02 [ 2.547157] [810e7aa1] ? 
copy_module_from_fd.isra.46+0x121/0x180
19:35:02 [ 2.547161] [810ebc76] SyS_finit_module+0x86/0xb0
19:35:02 [ 2.547167] [8175de7f] tracesys+0xe1/0xe6
19:35:02 [ 2.547169] ---[ end trace 8a5de7fc66e75fe4 ]---
19:35:02 [ 2.547172] vmalloc: allocation failure, allocated 20480 of 24576 bytes
19:35:02 [ 2.547175] systemd-udevd: page allocation failure: order:0, mode:0xd2


[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350522



signature.asc
Description: OpenPGP digital signature

Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

2014-08-08 Thread Stefan Bader

On 08.08.2014 14:43, David Vrabel wrote:
 On 08/08/14 12:20, Stefan Bader wrote:
 Unfortunately I have not yet figured out why this happens, but can confirm by
 compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR 
 all
 is ok, but with it enabled there are issues (actually a dom0 does not even 
 boot
 as a follow up error).

 Details can be seen in [1] but basically this is always some portion of a
 vmalloc allocation failing after hitting a freshly allocated PTE space not 
 being
 PTE_NONE (usually from a module load triggered by systemd-udevd). In the
 non-dom0 case this repeats many times but ends in a guest that allows login. 
 In
 the dom0 case there is a more fatal error at some point causing a crash.

 I have not tried this for a normal PV guest but for dom0 it also does not 
 help
 to add nokaslr to the kernel command-line.
 
 Maybe it's overlapping with regions of the virtual address space
 reserved for Xen?  What the the VA that fails?
 
 David
 
Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
missing p2m tables? I probably need to add debugging to find the failing VA (iow
not sure whether it might be somewhere in the stacktraces in the report).

The kernel-command line does not seem to be looked at. It should put something
into dmesg and that never shows up. Also today's random feature is other PV
guests crashing after a bit somewhere in the check_for_corruption area...

-Stefan



signature.asc
Description: OpenPGP digital signature

Re: fs/stat: Reduce memory requirements for stat_open

2014-07-08 Thread Stefan Bader

On 08.07.2014 15:09, Peter Zijlstra wrote:
> On Thu, Jun 12, 2014 at 03:00:17PM +0200, Stefan Bader wrote:
>> When reading from /proc/stat we allocate a large buffer to maximise
>> the chances of the results being from a single run and thus internally
>> consistent.  This currently is sized at 128 * num_possible_cpus() which,
>> in the face of kernels sized to handle large configurations (256 cpus
>> plus), results in the buffer being an order-4 allocation or more.
>> When system memory becomes fragmented these cannot be guarenteed, leading
>> to read failures due to allocation failures.
> 
>> @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v)
>>  
>>  static int stat_open(struct inode *inode, struct file *file)
>>  {
>> -size_t size = 1024 + 128 * num_possible_cpus();
>> +size_t size = 1024 + 128 * num_online_cpus();
>>  char *buf;
>>  struct seq_file *m;
>>  int res;
> 
> Old thread, and already solved in the meantime, but note that
> CONFIG_NR_CPUS _should_ have no reflection on num_possible_cpus().
> 
> The arch (x86 does) should detect at boot time the max possible CPUs the
> actual hardware supports and put num_possible_cpus() to that number. So
> your typical laptop will mostly have num_possible_cpus() <= 4, even
> though CONFIG_NR_CPUS could be 4k.
> 
> Of course, if you actually do put 256+ cpus in your system, well, then
> the difference between possible and online isn't going to help either.
> 
> If on the other hand your 'board' reports it can hold 256 CPUs while in
> fact it cannot, go kick your vendor in the nuts.
> 

Odd. So at least for the reporter the board must be doing that. For the other
way round at least back then when we did the change, some of the big hoovers
which end up claiming 64 or so CPUs (including threads) would not boot with a
too low CONFIG_NR_CPUS. So that seems to have changed at some point.
But at least this would explain why there have not been too many reports.

Thanks for the detail,
Stefan



signature.asc
Description: OpenPGP digital signature

Re: fs/stat: Reduce memory requirements for stat_open

2014-07-08 Thread Stefan Bader

On 08.07.2014 15:09, Peter Zijlstra wrote:
 On Thu, Jun 12, 2014 at 03:00:17PM +0200, Stefan Bader wrote:
 When reading from /proc/stat we allocate a large buffer to maximise
 the chances of the results being from a single run and thus internally
 consistent.  This currently is sized at 128 * num_possible_cpus() which,
 in the face of kernels sized to handle large configurations (256 cpus
 plus), results in the buffer being an order-4 allocation or more.
 When system memory becomes fragmented these cannot be guarenteed, leading
 to read failures due to allocation failures.
 
 @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v)
  
  static int stat_open(struct inode *inode, struct file *file)
  {
 -size_t size = 1024 + 128 * num_possible_cpus();
 +size_t size = 1024 + 128 * num_online_cpus();
  char *buf;
  struct seq_file *m;
  int res;
 
 Old thread, and already solved in the meantime, but note that
 CONFIG_NR_CPUS _should_ have no reflection on num_possible_cpus().
 
 The arch (x86 does) should detect at boot time the max possible CPUs the
 actual hardware supports and put num_possible_cpus() to that number. So
 your typical laptop will mostly have num_possible_cpus() = 4, even
 though CONFIG_NR_CPUS could be 4k.
 
 Of course, if you actually do put 256+ cpus in your system, well, then
 the difference between possible and online isn't going to help either.
 
 If on the other hand your 'board' reports it can hold 256 CPUs while in
 fact it cannot, go kick your vendor in the nuts.
 

Odd. So at least for the reporter the board must be doing that. For the other
way round at least back then when we did the change, some of the big hoovers
which end up claiming 64 or so CPUs (including threads) would not boot with a
too low CONFIG_NR_CPUS. So that seems to have changed at some point.
But at least this would explain why there have not been too many reports.

Thanks for the detail,
Stefan



signature.asc
Description: OpenPGP digital signature

1 2 3 >

1 - 100 of 222 matches

Mail list logo