Re: [PATCH] xen/events: don't use chip_data for legacy IRQs
On 30.09.20 11:16, Juergen Gross wrote: > Since commit c330fb1ddc0a ("XEN uses irqdesc::irq_data_common::handler_data > to store a per interrupt XEN data pointer which contains XEN specific > information.") > Xen is using the chip_data pointer for storing IRQ specific data. When > running as a HVM domain this can result in problems for legacy IRQs, as > those might use chip_data for their own purposes. > > Use a local array for this purpose in case of legacy IRQs, avoiding the > double use. > > Cc: sta...@vger.kernel.org > Fixes: c330fb1ddc0a ("XEN uses irqdesc::irq_data_common::handler_data to > store a per interrupt XEN data pointer which contains XEN specific > information.") > Signed-off-by: Juergen Gross Tested-by: Stefan Bader > --- > drivers/xen/events/events_base.c | 29 + > 1 file changed, 21 insertions(+), 8 deletions(-) > > diff --git a/drivers/xen/events/events_base.c > b/drivers/xen/events/events_base.c > index 90b8f56fbadb..6f02c18fa65c 100644 > --- a/drivers/xen/events/events_base.c > +++ b/drivers/xen/events/events_base.c > @@ -92,6 +92,8 @@ static bool (*pirq_needs_eoi)(unsigned irq); > /* Xen will never allocate port zero for any purpose. */ > #define VALID_EVTCHN(chn)((chn) != 0) > > +static struct irq_info *legacy_info_ptrs[NR_IRQS_LEGACY]; > + > static struct irq_chip xen_dynamic_chip; > static struct irq_chip xen_percpu_chip; > static struct irq_chip xen_pirq_chip; > @@ -156,7 +158,18 @@ int get_evtchn_to_irq(evtchn_port_t evtchn) > /* Get info for IRQ */ > struct irq_info *info_for_irq(unsigned irq) > { > - return irq_get_chip_data(irq); > + if (irq < nr_legacy_irqs()) > + return legacy_info_ptrs[irq]; > + else > + return irq_get_chip_data(irq); > +} > + > +static void set_info_for_irq(unsigned int irq, struct irq_info *info) > +{ > + if (irq < nr_legacy_irqs()) > + legacy_info_ptrs[irq] = info; > + else > + irq_set_chip_data(irq, info); > } > > /* Constructors for packed IRQ information. */ > @@ -377,7 +390,7 @@ static void xen_irq_init(unsigned irq) > info->type = IRQT_UNBOUND; > info->refcnt = -1; > > - irq_set_chip_data(irq, info); > + set_info_for_irq(irq, info); > > list_add_tail(>list, _irq_list_head); > } > @@ -426,14 +439,14 @@ static int __must_check xen_allocate_irq_gsi(unsigned > gsi) > > static void xen_free_irq(unsigned irq) > { > - struct irq_info *info = irq_get_chip_data(irq); > + struct irq_info *info = info_for_irq(irq); > > if (WARN_ON(!info)) > return; > > list_del(>list); > > - irq_set_chip_data(irq, NULL); > + set_info_for_irq(irq, NULL); > > WARN_ON(info->refcnt > 0); > > @@ -603,7 +616,7 @@ EXPORT_SYMBOL_GPL(xen_irq_from_gsi); > static void __unbind_from_irq(unsigned int irq) > { > evtchn_port_t evtchn = evtchn_from_irq(irq); > - struct irq_info *info = irq_get_chip_data(irq); > + struct irq_info *info = info_for_irq(irq); > > if (info->refcnt > 0) { > info->refcnt--; > @@ -1108,7 +1121,7 @@ int bind_ipi_to_irqhandler(enum ipi_vector ipi, > > void unbind_from_irqhandler(unsigned int irq, void *dev_id) > { > - struct irq_info *info = irq_get_chip_data(irq); > + struct irq_info *info = info_for_irq(irq); > > if (WARN_ON(!info)) > return; > @@ -1142,7 +1155,7 @@ int evtchn_make_refcounted(evtchn_port_t evtchn) > if (irq == -1) > return -ENOENT; > > - info = irq_get_chip_data(irq); > + info = info_for_irq(irq); > > if (!info) > return -ENOENT; > @@ -1170,7 +1183,7 @@ int evtchn_get(evtchn_port_t evtchn) > if (irq == -1) > goto done; > > - info = irq_get_chip_data(irq); > + info = info_for_irq(irq); > > if (!info) > goto done; > signature.asc Description: OpenPGP digital signature
Re: [PATCH 4.4 50/62] XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.
On 29.09.20 16:05, Jürgen Groß wrote: > On 29.09.20 15:13, Stefan Bader wrote: >> On 01.09.20 17:10, Greg Kroah-Hartman wrote: >>> From: Thomas Gleixner >>> >>> commit c330fb1ddc0a922f044989492b7fcca77ee1db46 upstream. >>> >>> handler data is meant for interrupt handlers and not for storing irq chip >>> specific information as some devices require handler data to store internal >>> per interrupt information, e.g. pinctrl/GPIO chained interrupt handlers. >>> >>> This obviously creates a conflict of interests and crashes the machine >>> because the XEN pointer is overwritten by the driver pointer. >> >> I cannot say whether this applies the same for the vanilla 4.4 stable kernels >> but once this had been applied to our 4.4 based kernels, we observed Xen HVM >> guests crashing on boot with: >> >> [ 0.927538] ACPI: bus type PCI registered >> [ 0.936008] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 >> [ 0.948739] PCI: Using configuration type 1 for base access >> [ 0.960007] PCI: Using configuration type 1 for extended access >> [ 0.984340] ACPI: Added _OSI(Module Device) >> [ 0.988010] ACPI: Added _OSI(Processor Device) >> [ 0.992007] ACPI: Added _OSI(3.0 _SCP Extensions) >> [ 0.996013] ACPI: Added _OSI(Processor Aggregator Device) >> [ 1.002103] BUG: unable to handle kernel paging request at >> ff5f3000 >> [ 1.004000] IP: [] mp_irqdomain_activate+0x5f/0xa0 >> [ 1.004000] PGD 1e0f067 PUD 1e11067 PMD 1e12067 PTE 0 >> [ 1.004000] Oops: 0002 [#1] SMP >> [ 1.004000] Modules linked in: >> [ 1.004000] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.0-191-generic >> #221-Ubuntu >> [ 1.004000] Hardware name: Xen HVM domU, BIOS 4.6.5 04/18/2018 >> [ 1.004000] task: 880107db ti: 880107dac000 task.ti: >> 880107dac000 >> [ 1.004000] RIP: 0010:[] [] >> mp_irqdomain_activate+0x5f/0xa0 >> [ 1.004000] RSP: 0018:880107dafc48 EFLAGS: 00010086 >> [ 1.004000] RAX: 0086 RBX: 8800eb852140 RCX: >> >> [ 1.004000] RDX: ff5f3000 RSI: 0001 RDI: >> 0020c000 >> [ 1.004000] RBP: 880107dafc50 R08: 81ebdfd0 R09: >> >> [ 1.004000] R10: 0011 R11: 0009 R12: >> 88010880d400 >> [ 1.004000] R13: 0001 R14: 0009 R15: >> 8800eb880080 >> [ 1.004000] FS: () GS:880108ec() >> knlGS: >> [ 1.004000] CS: 0010 DS: ES: CR0: 80050033 >> [ 1.004000] CR2: ff5f3000 CR3: 01e0a000 CR4: >> 06f0 >> [ 1.004000] Stack: >> [ 1.004000] 88010880bc58 880107dafc70 810ea644 >> 88010880bc00 >> [ 1.004000] 88010880bc58 880107dafca0 810e6d88 >> 810e1009 >> [ 1.004000] 88010880bc00 88010880bca0 8800eb880080 >> 880107dafd38 >> [ 1.004000] Call Trace: >> [ 1.004000] [] irq_domain_activate_irq+0x44/0x50 >> [ 1.004000] [] irq_startup+0x38/0x90 >> [ 1.004000] [] ? vprintk_default+0x29/0x40 >> [ 1.004000] [] __setup_irq+0x5a2/0x650 >> [ 1.004000] [] ? kmem_cache_alloc_trace+0x1d4/0x1f0 >> [ 1.004000] [] ? acpi_osi_handler+0xb0/0xb0 >> [ 1.004000] [] request_threaded_irq+0xfb/0x1a0 >> [ 1.004000] [] ? acpi_osi_handler+0xb0/0xb0 >> [ 1.004000] [] ? acpi_ev_sci_dispatch+0x64/0x64 >> [ 1.004000] [] >> acpi_os_install_interrupt_handler+0xaa/0x100 >> [ 1.004000] [] ? acpi_sleep_proc_init+0x28/0x28 >> [ 1.004000] [] acpi_ev_install_sci_handler+0x23/0x25 >> [ 1.004000] [] acpi_ev_install_xrupt_handlers+0x1c/0x6c >> [ 1.004000] [] acpi_enable_subsystem+0x8f/0x93 >> [ 1.004000] [] acpi_init+0x8b/0x2c4 >> [ 1.004000] [] ? kasprintf+0x4e/0x70 >> [ 1.004000] [] ? acpi_sleep_proc_init+0x28/0x28 >> [ 1.004000] [] do_one_initcall+0xb5/0x200 >> [ 1.004000] [] ? parse_args+0x29a/0x4a0 >> [ 1.004000] [] kernel_init_freeable+0x177/0x218 >> [ 1.004000] [] ? rest_init+0x80/0x80 >> [ 1.004000] [] kernel_init+0xe/0xe0 >> [ 1.004000] [] ret_from_fork+0x42/0x80 >> [ 1.004000] [] ? rest_init+0x80/0x80 >> [ 1.004000] Code: 8d 1c d2 8d ba 0b 02 00 00 44 8d 51 11 42 8b 14 dd 74 >> ec 10 >> 82 c1 e7 0c 48 63 ff 81 e2 ff 0f 00 00 48 81 ea 00 10 80 00 48 29 fa <44> 89 >> 12 >> 89 72 10 42 8b 14 dd 74 ec
Re: [PATCH 4.4 50/62] XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.
On 29.09.20 16:05, Jürgen Groß wrote: > On 29.09.20 15:13, Stefan Bader wrote: >> On 01.09.20 17:10, Greg Kroah-Hartman wrote: >>> From: Thomas Gleixner >>> >>> commit c330fb1ddc0a922f044989492b7fcca77ee1db46 upstream. >>> >>> handler data is meant for interrupt handlers and not for storing irq chip >>> specific information as some devices require handler data to store internal >>> per interrupt information, e.g. pinctrl/GPIO chained interrupt handlers. >>> >>> This obviously creates a conflict of interests and crashes the machine >>> because the XEN pointer is overwritten by the driver pointer. >> >> I cannot say whether this applies the same for the vanilla 4.4 stable kernels >> but once this had been applied to our 4.4 based kernels, we observed Xen HVM >> guests crashing on boot with: >> >> [ 0.927538] ACPI: bus type PCI registered >> [ 0.936008] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 >> [ 0.948739] PCI: Using configuration type 1 for base access >> [ 0.960007] PCI: Using configuration type 1 for extended access >> [ 0.984340] ACPI: Added _OSI(Module Device) >> [ 0.988010] ACPI: Added _OSI(Processor Device) >> [ 0.992007] ACPI: Added _OSI(3.0 _SCP Extensions) >> [ 0.996013] ACPI: Added _OSI(Processor Aggregator Device) >> [ 1.002103] BUG: unable to handle kernel paging request at >> ff5f3000 >> [ 1.004000] IP: [] mp_irqdomain_activate+0x5f/0xa0 >> [ 1.004000] PGD 1e0f067 PUD 1e11067 PMD 1e12067 PTE 0 >> [ 1.004000] Oops: 0002 [#1] SMP >> [ 1.004000] Modules linked in: >> [ 1.004000] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.0-191-generic >> #221-Ubuntu >> [ 1.004000] Hardware name: Xen HVM domU, BIOS 4.6.5 04/18/2018 >> [ 1.004000] task: 880107db ti: 880107dac000 task.ti: >> 880107dac000 >> [ 1.004000] RIP: 0010:[] [] >> mp_irqdomain_activate+0x5f/0xa0 >> [ 1.004000] RSP: 0018:880107dafc48 EFLAGS: 00010086 >> [ 1.004000] RAX: 0086 RBX: 8800eb852140 RCX: >> >> [ 1.004000] RDX: ff5f3000 RSI: 0001 RDI: >> 0020c000 >> [ 1.004000] RBP: 880107dafc50 R08: 81ebdfd0 R09: >> >> [ 1.004000] R10: 0011 R11: 0009 R12: >> 88010880d400 >> [ 1.004000] R13: 0001 R14: 0009 R15: >> 8800eb880080 >> [ 1.004000] FS: () GS:880108ec() >> knlGS: >> [ 1.004000] CS: 0010 DS: ES: CR0: 80050033 >> [ 1.004000] CR2: ff5f3000 CR3: 01e0a000 CR4: >> 06f0 >> [ 1.004000] Stack: >> [ 1.004000] 88010880bc58 880107dafc70 810ea644 >> 88010880bc00 >> [ 1.004000] 88010880bc58 880107dafca0 810e6d88 >> 810e1009 >> [ 1.004000] 88010880bc00 88010880bca0 8800eb880080 >> 880107dafd38 >> [ 1.004000] Call Trace: >> [ 1.004000] [] irq_domain_activate_irq+0x44/0x50 >> [ 1.004000] [] irq_startup+0x38/0x90 >> [ 1.004000] [] ? vprintk_default+0x29/0x40 >> [ 1.004000] [] __setup_irq+0x5a2/0x650 >> [ 1.004000] [] ? kmem_cache_alloc_trace+0x1d4/0x1f0 >> [ 1.004000] [] ? acpi_osi_handler+0xb0/0xb0 >> [ 1.004000] [] request_threaded_irq+0xfb/0x1a0 >> [ 1.004000] [] ? acpi_osi_handler+0xb0/0xb0 >> [ 1.004000] [] ? acpi_ev_sci_dispatch+0x64/0x64 >> [ 1.004000] [] >> acpi_os_install_interrupt_handler+0xaa/0x100 >> [ 1.004000] [] ? acpi_sleep_proc_init+0x28/0x28 >> [ 1.004000] [] acpi_ev_install_sci_handler+0x23/0x25 >> [ 1.004000] [] acpi_ev_install_xrupt_handlers+0x1c/0x6c >> [ 1.004000] [] acpi_enable_subsystem+0x8f/0x93 >> [ 1.004000] [] acpi_init+0x8b/0x2c4 >> [ 1.004000] [] ? kasprintf+0x4e/0x70 >> [ 1.004000] [] ? acpi_sleep_proc_init+0x28/0x28 >> [ 1.004000] [] do_one_initcall+0xb5/0x200 >> [ 1.004000] [] ? parse_args+0x29a/0x4a0 >> [ 1.004000] [] kernel_init_freeable+0x177/0x218 >> [ 1.004000] [] ? rest_init+0x80/0x80 >> [ 1.004000] [] kernel_init+0xe/0xe0 >> [ 1.004000] [] ret_from_fork+0x42/0x80 >> [ 1.004000] [] ? rest_init+0x80/0x80 >> [ 1.004000] Code: 8d 1c d2 8d ba 0b 02 00 00 44 8d 51 11 42 8b 14 dd 74 >> ec 10 >> 82 c1 e7 0c 48 63 ff 81 e2 ff 0f 00 00 48 81 ea 00 10 80 00 48 29 fa <44> 89 >> 12 >> 89 72 10 42 8b 14 dd 74 ec
Re: [PATCH 4.4 50/62] XEN uses irqdesc::irq_data_common::handler_data to store a per interrupt XEN data pointer which contains XEN specific information.
On 01.09.20 17:10, Greg Kroah-Hartman wrote: > From: Thomas Gleixner > > commit c330fb1ddc0a922f044989492b7fcca77ee1db46 upstream. > > handler data is meant for interrupt handlers and not for storing irq chip > specific information as some devices require handler data to store internal > per interrupt information, e.g. pinctrl/GPIO chained interrupt handlers. > > This obviously creates a conflict of interests and crashes the machine > because the XEN pointer is overwritten by the driver pointer. I cannot say whether this applies the same for the vanilla 4.4 stable kernels but once this had been applied to our 4.4 based kernels, we observed Xen HVM guests crashing on boot with: [0.927538] ACPI: bus type PCI registered [0.936008] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 [0.948739] PCI: Using configuration type 1 for base access [0.960007] PCI: Using configuration type 1 for extended access [0.984340] ACPI: Added _OSI(Module Device) [0.988010] ACPI: Added _OSI(Processor Device) [0.992007] ACPI: Added _OSI(3.0 _SCP Extensions) [0.996013] ACPI: Added _OSI(Processor Aggregator Device) [1.002103] BUG: unable to handle kernel paging request at ff5f3000 [1.004000] IP: [] mp_irqdomain_activate+0x5f/0xa0 [1.004000] PGD 1e0f067 PUD 1e11067 PMD 1e12067 PTE 0 [1.004000] Oops: 0002 [#1] SMP [1.004000] Modules linked in: [1.004000] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.4.0-191-generic #221-Ubuntu [1.004000] Hardware name: Xen HVM domU, BIOS 4.6.5 04/18/2018 [1.004000] task: 880107db ti: 880107dac000 task.ti: 880107dac000 [1.004000] RIP: 0010:[] [] mp_irqdomain_activate+0x5f/0xa0 [1.004000] RSP: 0018:880107dafc48 EFLAGS: 00010086 [1.004000] RAX: 0086 RBX: 8800eb852140 RCX: [1.004000] RDX: ff5f3000 RSI: 0001 RDI: 0020c000 [1.004000] RBP: 880107dafc50 R08: 81ebdfd0 R09: [1.004000] R10: 0011 R11: 0009 R12: 88010880d400 [1.004000] R13: 0001 R14: 0009 R15: 8800eb880080 [1.004000] FS: () GS:880108ec() knlGS: [1.004000] CS: 0010 DS: ES: CR0: 80050033 [1.004000] CR2: ff5f3000 CR3: 01e0a000 CR4: 06f0 [1.004000] Stack: [1.004000] 88010880bc58 880107dafc70 810ea644 88010880bc00 [1.004000] 88010880bc58 880107dafca0 810e6d88 810e1009 [1.004000] 88010880bc00 88010880bca0 8800eb880080 880107dafd38 [1.004000] Call Trace: [1.004000] [] irq_domain_activate_irq+0x44/0x50 [1.004000] [] irq_startup+0x38/0x90 [1.004000] [] ? vprintk_default+0x29/0x40 [1.004000] [] __setup_irq+0x5a2/0x650 [1.004000] [] ? kmem_cache_alloc_trace+0x1d4/0x1f0 [1.004000] [] ? acpi_osi_handler+0xb0/0xb0 [1.004000] [] request_threaded_irq+0xfb/0x1a0 [1.004000] [] ? acpi_osi_handler+0xb0/0xb0 [1.004000] [] ? acpi_ev_sci_dispatch+0x64/0x64 [1.004000] [] acpi_os_install_interrupt_handler+0xaa/0x100 [1.004000] [] ? acpi_sleep_proc_init+0x28/0x28 [1.004000] [] acpi_ev_install_sci_handler+0x23/0x25 [1.004000] [] acpi_ev_install_xrupt_handlers+0x1c/0x6c [1.004000] [] acpi_enable_subsystem+0x8f/0x93 [1.004000] [] acpi_init+0x8b/0x2c4 [1.004000] [] ? kasprintf+0x4e/0x70 [1.004000] [] ? acpi_sleep_proc_init+0x28/0x28 [1.004000] [] do_one_initcall+0xb5/0x200 [1.004000] [] ? parse_args+0x29a/0x4a0 [1.004000] [] kernel_init_freeable+0x177/0x218 [1.004000] [] ? rest_init+0x80/0x80 [1.004000] [] kernel_init+0xe/0xe0 [1.004000] [] ret_from_fork+0x42/0x80 [1.004000] [] ? rest_init+0x80/0x80 [1.004000] Code: 8d 1c d2 8d ba 0b 02 00 00 44 8d 51 11 42 8b 14 dd 74 ec 10 82 c1 e7 0c 48 63 ff 81 e2 ff 0f 00 00 48 81 ea 00 10 80 00 48 29 fa <44> 89 12 89 72 10 42 8b 14 dd 74 ec 10 82 83 c1 10 81 e2 ff 0f [1.004000] RIP [] mp_irqdomain_activate+0x5f/0xa0 [1.004000] RSP [1.004000] CR2: ff5f3000 [1.004000] ---[ end trace 3201cae5b6bd7be1 ]--- [1.592027] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009 [1.592027] This is from a local server but same stack-trace happens on AWS instances while initializing ACPI SCI. mp_irqdomain_activate is accessing chip_data expecting ioapic data there. Oddly enough more recent kernels seem to do the same but not crashing as HVM guest (neither seen for our 4.15 nor the 5.4). Maybe someone could make sure the 4.4.y stable series is not also broken. -Stefan > > As the XEN data is not handler specific it should be stored in > irqdesc::irq_data::chip_data instead. > > A simple sed s/irq_[sg]et_handler_data/irq_[sg]et_chip_data/ cures that. > > Cc: sta...@vger.kernel.org > Reported-by: Roman
Re: [PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
On 29.05.19 12:37, Greg KH wrote: > On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote: >> From: Jiri Wiesner >> >> The *_frag_reasm() functions are susceptible to miscalculating the byte >> count of packet fragments in case the truesize of a head buffer changes. >> The truesize member may be changed by the call to skb_unclone(), leaving >> the fragment memory limit counter unbalanced even if all fragments are >> processed. This miscalculation goes unnoticed as long as the network >> namespace which holds the counter is not destroyed. >> >> Should an attempt be made to destroy a network namespace that holds an >> unbalanced fragment memory limit counter the cleanup of the namespace >> never finishes. The thread handling the cleanup gets stuck in >> inet_frags_exit_net() waiting for the percpu counter to reach zero. The >> thread is usually in running state with a stacktrace similar to: >> >> PID: 1073 TASK: 880626711440 CPU: 1 COMMAND: "kworker/u48:4" >> #5 [880621563d48] _raw_spin_lock at 815f5480 >> #6 [880621563d48] inet_evict_bucket at 8158020b >> #7 [880621563d80] inet_frags_exit_net at 8158051c >> #8 [880621563db0] ops_exit_list at 814f5856 >> #9 [880621563dd8] cleanup_net at 814f67c0 >> #10 [880621563e38] process_one_work at 81096f14 >> >> It is not possible to create new network namespaces, and processes >> that call unshare() end up being stuck in uninterruptible sleep state >> waiting to acquire the net_mutex. >> >> The bug was observed in the IPv6 netfilter code by Per Sundstrom. >> I thank him for his analysis of the problem. The parts of this patch >> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures. >> >> Signed-off-by: Jiri Wiesner >> Reported-by: Per Sundstrom >> Acked-by: Peter Oskolkov >> Signed-off-by: David S. Miller >> >> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0) >> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c] >> Signed-off-by: Stefan Bader > > I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading > kernel versions would have a regression :( > > Can you also provide a backport of the needed patches for 4.9.y for this > issue so I can take these? It turns out that I cannot provide patches for 4.9.y because those are not needed there. Patches #1 and #2 of the list I did do not explicitly appear. However patch #3 does and it might be possible to implicitly do the changes of the other two by adjusting the removal of the functions from the old locations and doing the additions unmodified. And the final patch for pulling the skb from the list is included in 4.9.y by backports for using rbtrees in ipv6, too. In 4.9.y however the skb_get() still needs to be dropped. Sasha did not apply it in the end, maybe partially because of my warning that this was not enough in 4.4.y. So I think there are two options for 4.4.y which I would defer to the net-devs to decide: - either also backport the patches to use rbtrees in ipv6 to 4.4.y (including use of inet_frag_pull_head() in ip6_expire_frag_queue() and dropping the skb_get() there. - or some limited change to ip6_expire_frag_queue(). Probably only doing the part not related to rbtrees. This would not require patch #3 as pre-req. Patches #1 and #2 might be considered separately. Those would be unrelated to the crash in ip6_expire-frag_queue() but could fix other issues (not liking the sound of net namespace teardown possibly getting stuck). For 4.9.y, please re-consider picking "ip6: fix skb leak in ip6frag_expire_frag_queue()". Depending on checks down the path of icmpv6_send() it might be a crash, too. In that case it should be enough as inet_frag_pull_head() takes the skb off the queue, so it won't get double freed when releasing the whole queue. -Stefan > > thanks, > > greg k-h > signature.asc Description: OpenPGP digital signature
Re: [PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
On 29.05.19 12:37, Greg KH wrote: > On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote: >> From: Jiri Wiesner >> >> The *_frag_reasm() functions are susceptible to miscalculating the byte >> count of packet fragments in case the truesize of a head buffer changes. >> The truesize member may be changed by the call to skb_unclone(), leaving >> the fragment memory limit counter unbalanced even if all fragments are >> processed. This miscalculation goes unnoticed as long as the network >> namespace which holds the counter is not destroyed. >> >> Should an attempt be made to destroy a network namespace that holds an >> unbalanced fragment memory limit counter the cleanup of the namespace >> never finishes. The thread handling the cleanup gets stuck in >> inet_frags_exit_net() waiting for the percpu counter to reach zero. The >> thread is usually in running state with a stacktrace similar to: >> >> PID: 1073 TASK: 880626711440 CPU: 1 COMMAND: "kworker/u48:4" >> #5 [880621563d48] _raw_spin_lock at 815f5480 >> #6 [880621563d48] inet_evict_bucket at 8158020b >> #7 [880621563d80] inet_frags_exit_net at 8158051c >> #8 [880621563db0] ops_exit_list at 814f5856 >> #9 [880621563dd8] cleanup_net at 814f67c0 >> #10 [880621563e38] process_one_work at 81096f14 >> >> It is not possible to create new network namespaces, and processes >> that call unshare() end up being stuck in uninterruptible sleep state >> waiting to acquire the net_mutex. >> >> The bug was observed in the IPv6 netfilter code by Per Sundstrom. >> I thank him for his analysis of the problem. The parts of this patch >> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures. >> >> Signed-off-by: Jiri Wiesner >> Reported-by: Per Sundstrom >> Acked-by: Peter Oskolkov >> Signed-off-by: David S. Miller >> >> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0) >> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c] >> Signed-off-by: Stefan Bader > > I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading > kernel versions would have a regression :( > > Can you also provide a backport of the needed patches for 4.9.y for this > issue so I can take these? I will, once it is clear that a) the backport looks alright and b) is ok to be done. Alternatively it might be decided that only the parts necessary for pulling out a frag head should be picked. Or the net-devs might decide they want to send things out. The problem potentially exists in anything that has some stable support up to v5.1 and I have no complete overview where this was backported to. So this is more a start of discussion that a request to apply it. stable was just included to make stable maintainers aware. -Stefan > > thanks, > > greg k-h > signature.asc Description: OpenPGP digital signature
[PATCH 0/4] ipv6: frags: fixups for linux-4.4.y
While this backport proposal is based on the 4.4.y stable tree, it might also apply in some form to any stable tree which backported 05c0b86b96: "ipv6: frags: rewrite ip6_expire_frag_queue()" While this made ip6_expire_frag_queue() similar to ip_exire(), it did not follow the additional changes to ip_expire() which were also backported: fa0f527358: "ip: use rb trees for IP frag queue." a4fd284a1f: "ip: process in-order fragments efficiently" The former of the two not only adds handling for rb trees, but also modifies ip_expire() to take the first skb off the queue before using it for the sending the icmp message. This also got rid of the need to protect the skb by incrementing its reference count (which is the reason for the crash in ip6_expire_frag_queue()). My first approach was do those changes in ip6_expire_frag_queue(), but only the former of the two can be done without problems. The latter uses code which is only locally defined in ipv4/ip_fragment.c. This was changed upstream in 5.1 when moving code around to be shared c23f35d19d: "net: IP defrag: encapsulate rbtree defrag code into callable functions" And while backporting that I found the two other changes which sounded like one might want them backported, too. Maybe even more since the second (ip: fail fast on IP defrag errors) is already partially included in the backport of "net: ipv4: do not handle duplicate fragments as overlapping". Though I do realize that "net: IP defrag: encapsulate rbtree defrag code into callable functions" is rather large and for that reason maybe not qualifying as a stable backport. So I would like to ask what the net-developers think about this. Thanks, Stefan 0001: v4.20: ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes 0002: v4.20: ip: fail fast on IP defrag errors 0003: v5.1 : net: IP defrag: encapsulate rbtree defrag code into callable functions 0004: n/a : ipv6: frags: Use inet_frag_pull_head() in ip6_expire_frag_queue() Jiri Wiesner (1): ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes Peter Oskolkov (2): ip: fail fast on IP defrag errors net: IP defrag: encapsulate rbtree defrag code into callable functions Stefan Bader (1): ipv6: frags: Use inet_frag_pull_head() in ip6_expire_frag_queue() include/net/inet_frag.h | 16 +- net/ipv4/inet_fragment.c| 293 +++ net/ipv4/ip_fragment.c | 294 +++- net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +- net/ipv6/reassembly.c | 20 +- 5 files changed, 359 insertions(+), 272 deletions(-) -- 2.17.1
[PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
From: Jiri Wiesner The *_frag_reasm() functions are susceptible to miscalculating the byte count of packet fragments in case the truesize of a head buffer changes. The truesize member may be changed by the call to skb_unclone(), leaving the fragment memory limit counter unbalanced even if all fragments are processed. This miscalculation goes unnoticed as long as the network namespace which holds the counter is not destroyed. Should an attempt be made to destroy a network namespace that holds an unbalanced fragment memory limit counter the cleanup of the namespace never finishes. The thread handling the cleanup gets stuck in inet_frags_exit_net() waiting for the percpu counter to reach zero. The thread is usually in running state with a stacktrace similar to: PID: 1073 TASK: 880626711440 CPU: 1 COMMAND: "kworker/u48:4" #5 [880621563d48] _raw_spin_lock at 815f5480 #6 [880621563d48] inet_evict_bucket at 8158020b #7 [880621563d80] inet_frags_exit_net at 8158051c #8 [880621563db0] ops_exit_list at 814f5856 #9 [880621563dd8] cleanup_net at 814f67c0 #10 [880621563e38] process_one_work at 81096f14 It is not possible to create new network namespaces, and processes that call unshare() end up being stuck in uninterruptible sleep state waiting to acquire the net_mutex. The bug was observed in the IPv6 netfilter code by Per Sundstrom. I thank him for his analysis of the problem. The parts of this patch that apply to IPv4 and IPv6 fragment reassembly are preemptive measures. Signed-off-by: Jiri Wiesner Reported-by: Per Sundstrom Acked-by: Peter Oskolkov Signed-off-by: David S. Miller (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0) [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c] Signed-off-by: Stefan Bader --- net/ipv4/ip_fragment.c | 7 +++ net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +++- net/ipv6/reassembly.c | 8 +++- 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 2c75330b1c71..5387e6ab78d7 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -519,6 +519,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb, struct rb_node *rbn; int len; int ihlen; + int delta; int err; u8 ecn; @@ -560,10 +561,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb, if (len > 65535) goto out_oversize; + delta = - head->truesize; + /* Head of list must not be cloned. */ if (skb_unclone(head, GFP_ATOMIC)) goto out_nomem; + delta += head->truesize; + if (delta) + add_frag_mem_limit(qp->q.net, delta); + /* If the first fragment is fragmented itself, we split * it to two chunks: the first with data and paged part * and the second, holding only fragments. */ diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c index 664c84e47bab..e5d06ceb2c0c 100644 --- a/net/ipv6/netfilter/nf_conntrack_reasm.c +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -353,7 +353,7 @@ static struct sk_buff * nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev) { struct sk_buff *fp, *op, *head = fq->q.fragments; - intpayload_len; + intpayload_len, delta; u8 ecn; inet_frag_kill(>q); @@ -374,12 +374,18 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev) goto out_oversize; } + delta = - head->truesize; + /* Head of list must not be cloned. */ if (skb_unclone(head, GFP_ATOMIC)) { pr_debug("skb is cloned but can't expand head"); goto out_oom; } + delta += head->truesize; + if (delta) + add_frag_mem_limit(fq->q.net, delta); + /* If the first fragment is fragmented itself, we split * it to two chunks: the first with data and paged part * and the second, holding only fragments. */ diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index ec917f58d105..020cf70273c9 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -343,7 +343,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev, { struct net *net = container_of(fq->q.net, struct net, ipv6.frags); struct sk_buff *fp, *head = fq->q.fragments; - intpayload_len; + intpayload_len, delta; unsigned int nhoff; int sum_truesize; u8 ecn; @@ -384,10 +384,16 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev, if (payload_len > IPV6_MAXPLEN) goto out_oversize; + delta = - head->truesize; + /*
[PATCH 3/4] net: IP defrag: encapsulate rbtree defrag code into callable functions
From: Peter Oskolkov This is a refactoring patch: without changing runtime behavior, it moves rbtree-related code from IPv4-specific files/functions into .h/.c defrag files shared with IPv6 defragmentation code. Signed-off-by: Peter Oskolkov Cc: Eric Dumazet Cc: Florian Westphal Cc: Tom Herbert Signed-off-by: David S. Miller (backported from commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f) [smb: open code skb_mark_not_on_list(), context adjustments, use of IP_INC_STATS_BH instead of __IP_INC_STATS] Signed-off-by: Stefan Bader --- include/net/inet_frag.h | 16 ++- net/ipv4/inet_fragment.c | 293 +++ net/ipv4/ip_fragment.c | 288 -- 3 files changed, 332 insertions(+), 265 deletions(-) diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index 6260ec146142..7c8b06302ade 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -75,8 +75,8 @@ struct inet_frag_queue { struct timer_list timer; spinlock_t lock; atomic_trefcnt; - struct sk_buff *fragments; /* Used in IPv6. */ - struct rb_root rb_fragments; /* Used in IPv4. */ + struct sk_buff *fragments; /* used in 6lopwpan IPv6. */ + struct rb_root rb_fragments; /* Used in IPv4/IPv6. */ struct sk_buff *fragments_tail; struct sk_buff *last_run_head; ktime_t stamp; @@ -152,4 +152,16 @@ static inline void add_frag_mem_limit(struct netns_frags *nf, long val) extern const u8 ip_frag_ecn_table[16]; +/* Return values of inet_frag_queue_insert() */ +#define IPFRAG_OK 0 +#define IPFRAG_DUP 1 +#define IPFRAG_OVERLAP 2 +int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb, + int offset, int end); +void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb, + struct sk_buff *parent); +void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head, + void *reasm_data); +struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q); + #endif diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index c03e5f5859e1..5c167efd54bd 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -24,6 +24,62 @@ #include #include #include +#include +#include + +/* Use skb->cb to track consecutive/adjacent fragments coming at + * the end of the queue. Nodes in the rb-tree queue will + * contain "runs" of one or more adjacent fragments. + * + * Invariants: + * - next_frag is NULL at the tail of a "run"; + * - the head of a "run" has the sum of all fragment lengths in frag_run_len. + */ +struct ipfrag_skb_cb { + union { + struct inet_skb_parmh4; + struct inet6_skb_parm h6; + }; + struct sk_buff *next_frag; + int frag_run_len; +}; + +#define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb)) + +static void fragcb_clear(struct sk_buff *skb) +{ + RB_CLEAR_NODE(>rbnode); + FRAG_CB(skb)->next_frag = NULL; + FRAG_CB(skb)->frag_run_len = skb->len; +} + +/* Append skb to the last "run". */ +static void fragrun_append_to_last(struct inet_frag_queue *q, + struct sk_buff *skb) +{ + fragcb_clear(skb); + + FRAG_CB(q->last_run_head)->frag_run_len += skb->len; + FRAG_CB(q->fragments_tail)->next_frag = skb; + q->fragments_tail = skb; +} + +/* Create a new "run" with the skb. */ +static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb) +{ + BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb)); + fragcb_clear(skb); + + if (q->last_run_head) + rb_link_node(>rbnode, >last_run_head->rbnode, +>last_run_head->rbnode.rb_right); + else + rb_link_node(>rbnode, NULL, >rb_fragments.rb_node); + rb_insert_color(>rbnode, >rb_fragments); + + q->fragments_tail = skb; + q->last_run_head = skb; +} /* Given the OR values of all fragments, apply RFC 3168 5.3 requirements * Value : 0xff if frame should be dropped. @@ -130,6 +186,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head) kmem_cache_free(f->frags_cachep, q); } +unsigned int inet_frag_rbtree_purge(struct rb_root *root) +{ + struct rb_node *p = rb_first(root); + unsigned int sum = 0; + + while (p) { + struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode); + + p = rb_next(p); + rb_erase(>rbnode, root); + while (skb) { + struct sk_buff *next = FRAG_CB(skb)-
[PATCH 4/4] ipv6: frags: Use inet_frag_pull_head() in ip6_expire_frag_queue()
With frag code shared between IPv4 and IPv6 it is now possible to use inet_frag_pull_head() in ip6_expire_frag_queue() in order to properly extract the first skb from the frags queue. Since this really takes the skb out of the list, it no longer needs to be protected against removal (which implicitly fixes a crash that will happen somewhere in the call to icmp6_send() when the use count is forcefully checked to be 1. kernel BUG at linux-4.4.0/net/core/skbuff.c:1207! RIP: 0010:[] [] pskb_expand_head+0x243/0x250 [] __pskb_pull_tail+0x50/0x350 [] _decode_session6+0x26a/0x400 [] __xfrm_decode_session+0x39/0x50 [] icmpv6_route_lookup+0xf0/0x1c0 [] icmp6_send+0x5e1/0x940 [] icmpv6_send+0x21/0x30 [] ip6_expire_frag_queue+0xe0/0x120 Signed-off-by: Stefan Bader --- net/ipv6/reassembly.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index 020cf70273c9..2bd45abc2516 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -110,16 +110,14 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq) IP6_INC_STATS_BH(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT); /* Don't send error if the first segment did not arrive. */ - head = fq->q.fragments; - if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head) + if (!(fq->q.flags & INET_FRAG_FIRST_IN)) + goto out; + + head = inet_frag_pull_head(>q); + if (!head) goto out; - /* But use as source device on which LAST ARRIVED -* segment was received. And do not use fq->dev -* pointer directly, device might already disappeared. -*/ head->dev = dev; - skb_get(head); spin_unlock(>q.lock); icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0); -- 2.17.1
[PATCH 2/4] ip: fail fast on IP defrag errors
From: Peter Oskolkov The current behavior of IP defragmentation is inconsistent: - some overlapping/wrong length fragments are dropped without affecting the queue; - most overlapping fragments cause the whole frag queue to be dropped. This patch brings consistency: if a bad fragment is detected, the whole frag queue is dropped. Two major benefits: - fail fast: corrupted frag queues are cleared immediately, instead of by timeout; - testing of overlapping fragments is now much easier: any kind of random fragment length mutation now leads to the frag queue being discarded (IP packet dropped); before this patch, some overlaps were "corrected", with tests not seeing expected packet drops. Note that in one case (see "if (end&7)" conditional) the current behavior is preserved as there are concerns that this could be legitimate padding. Signed-off-by: Peter Oskolkov Reviewed-by: Eric Dumazet Reviewed-by: Willem de Bruijn Signed-off-by: David S. Miller (backported from commit 0ff89efb524631ac9901b81446b453c29711c376) [smb: context adjustments and ignoring those changes already done in backport for "net: ipv4: do not handle duplicate fragments as overlapping"] Signed-off-by: Stefan Bader --- net/ipv4/ip_fragment.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 5387e6ab78d7..a53652c8c0fd 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -382,7 +382,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) */ if (end < qp->q.len || ((qp->q.flags & INET_FRAG_LAST_IN) && end != qp->q.len)) - goto err; + goto discard_qp; qp->q.flags |= INET_FRAG_LAST_IN; qp->q.len = end; } else { @@ -394,20 +394,20 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) if (end > qp->q.len) { /* Some bits beyond end -> corruption. */ if (qp->q.flags & INET_FRAG_LAST_IN) - goto err; + goto discard_qp; qp->q.len = end; } } if (end == offset) - goto err; + goto discard_qp; err = -ENOMEM; if (!pskb_pull(skb, skb_network_offset(skb) + ihl)) - goto err; + goto discard_qp; err = pskb_trim_rcsum(skb, end - offset); if (err) - goto err; + goto discard_qp; /* Note : skb->rbnode and skb->dev share the same location. */ dev = skb->dev; @@ -434,7 +434,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) /* This is the common case: skb goes to the end. */ /* Detect and discard overlaps. */ if (offset < prev_tail->ip_defrag_offset + prev_tail->len) - goto discard_qp; + goto overlap; if (offset == prev_tail->ip_defrag_offset + prev_tail->len) ip4_frag_append_to_last_run(>q, skb); else @@ -457,7 +457,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) end <= skb1_run_end) goto err; /* No new data, potential duplicate */ else - goto discard_qp; /* Found an overlap */ + goto overlap; /* Found an overlap */ } while (*rbn); /* Here we have parent properly set, and rbn pointing to * one of its NULL left/right children. Insert skb. @@ -494,15 +494,18 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb) skb->_skb_refdst = 0UL; err = ip_frag_reasm(qp, skb, prev_tail, dev); skb->_skb_refdst = orefdst; + if (err) + inet_frag_kill(>q); return err; } skb_dst_drop(skb); return -EINPROGRESS; +overlap: + IP_INC_STATS_BH(net, IPSTATS_MIB_REASM_OVERLAPS); discard_qp: inet_frag_kill(>q); - IP_INC_STATS_BH(net, IPSTATS_MIB_REASM_OVERLAPS); err: kfree_skb(skb); return err; -- 2.17.1
Re: [PATCH AUTOSEL 5.1 011/375] ip6: fix skb leak in ip6frag_expire_frag_queue()
On 22.05.19 21:15, Sasha Levin wrote: > From: Eric Dumazet > > [ Upstream commit 47d3d7fdb10a21c223036b58bd70ffdc24a472c4 ] > > Since ip6frag_expire_frag_queue() now pulls the head skb > from frag queue, we should no longer use skb_get(), since > this leads to an skb leak. > > Stefan Bader initially reported a problem in 4.4.stable [1] caused > by the skb_get(), so this patch should also fix this issue. Just to let everybody know, while changing this has fixed the BUG_ON problem while sending (in 4.4) it now crashes when releasing just a little later. Still feels like the right direction but not complete, yet. -Stefan > > 296583.091021] kernel BUG at > /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207! > [296583.091734] Call Trace: > [296583.091749] [] __pskb_pull_tail+0x50/0x350 > [296583.091764] [] _decode_session6+0x26a/0x400 > [296583.091779] [] __xfrm_decode_session+0x39/0x50 > [296583.091795] [] icmpv6_route_lookup+0xf0/0x1c0 > [296583.091809] [] icmp6_send+0x5e1/0x940 > [296583.091823] [] ? __netif_receive_skb+0x18/0x60 > [296583.091838] [] ? netif_receive_skb_internal+0x32/0xa0 > [296583.091858] [] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe] > [296583.091876] [] ? nf_ct_net_exit+0x50/0x50 > [nf_defrag_ipv6] > [296583.091893] [] icmpv6_send+0x21/0x30 > [296583.091906] [] ip6_expire_frag_queue+0xe0/0x120 > [296583.091921] [] nf_ct_frag6_expire+0x1f/0x30 > [nf_defrag_ipv6] > [296583.091938] [] call_timer_fn+0x37/0x140 > [296583.091951] [] ? nf_ct_net_exit+0x50/0x50 > [nf_defrag_ipv6] > [296583.091968] [] run_timer_softirq+0x234/0x330 > [296583.091982] [] __do_softirq+0x109/0x2b0 > > Fixes: d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag") > Signed-off-by: Eric Dumazet > Reported-by: Stefan Bader > Cc: Peter Oskolkov > Cc: Florian Westphal > Signed-off-by: David S. Miller > Signed-off-by: Sasha Levin > --- > include/net/ipv6_frag.h | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h > index 28aa9b30aecea..1f77fb4dc79df 100644 > --- a/include/net/ipv6_frag.h > +++ b/include/net/ipv6_frag.h > @@ -94,7 +94,6 @@ ip6frag_expire_frag_queue(struct net *net, struct > frag_queue *fq) > goto out; > > head->dev = dev; > - skb_get(head); > spin_unlock(>q.lock); > > icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0); > signature.asc Description: OpenPGP digital signature
Re: [PATCH 4.4 018/101] netfilter: synproxy: fix conntrackd interaction
On 03.07.2017 15:34, Greg Kroah-Hartman wrote: > 4.4-stable review patch. If anyone has any objections, please let me know. We found that pulling below patch into stable trees without also pulling commit 9c3f3794926a997b1cab6c42480ff300efa2d162 Author: Liping ZhangDate: Sat Mar 25 16:35:29 2017 +0800 netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister will result in a regression, at least in 4.4.y[1]. Stable maintainers who picked up below patch might want to consider picking up above fix. -Stefan [1] http://bugs.launchpad.net/bugs/1709032 > > -- > > From: Eric Leblond > > commit 87e94dbc210a720a34be5c1174faee5c84be963e upstream. > > This patch fixes the creation of connection tracking entry from > netlink when synproxy is used. It was missing the addition of > the synproxy extension. > > This was causing kernel crashes when a conntrack entry created by > conntrackd was used after the switch of traffic from active node > to the passive node. > > Signed-off-by: Eric Leblond > Signed-off-by: Pablo Neira Ayuso > Signed-off-by: Greg Kroah-Hartman > > --- > net/netfilter/nf_conntrack_netlink.c |4 > 1 file changed, 4 insertions(+) > > --- a/net/netfilter/nf_conntrack_netlink.c > +++ b/net/netfilter/nf_conntrack_netlink.c > @@ -45,6 +45,8 @@ > #include > #include > #include > +#include > +#include > #ifdef CONFIG_NF_NAT_NEEDED > #include > #include > @@ -1798,6 +1800,8 @@ ctnetlink_create_conntrack(struct net *n > nf_ct_tstamp_ext_add(ct, GFP_ATOMIC); > nf_ct_ecache_ext_add(ct, 0, 0, GFP_ATOMIC); > nf_ct_labels_ext_add(ct); > + nfct_seqadj_ext_add(ct); > + nfct_synproxy_ext_add(ct); > > /* we must add conntrack extensions before confirmation. */ > ct->status |= IPS_CONFIRMED; > > signature.asc Description: OpenPGP digital signature
Re: [PATCH 4.4 018/101] netfilter: synproxy: fix conntrackd interaction
On 03.07.2017 15:34, Greg Kroah-Hartman wrote: > 4.4-stable review patch. If anyone has any objections, please let me know. We found that pulling below patch into stable trees without also pulling commit 9c3f3794926a997b1cab6c42480ff300efa2d162 Author: Liping Zhang Date: Sat Mar 25 16:35:29 2017 +0800 netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister will result in a regression, at least in 4.4.y[1]. Stable maintainers who picked up below patch might want to consider picking up above fix. -Stefan [1] http://bugs.launchpad.net/bugs/1709032 > > -- > > From: Eric Leblond > > commit 87e94dbc210a720a34be5c1174faee5c84be963e upstream. > > This patch fixes the creation of connection tracking entry from > netlink when synproxy is used. It was missing the addition of > the synproxy extension. > > This was causing kernel crashes when a conntrack entry created by > conntrackd was used after the switch of traffic from active node > to the passive node. > > Signed-off-by: Eric Leblond > Signed-off-by: Pablo Neira Ayuso > Signed-off-by: Greg Kroah-Hartman > > --- > net/netfilter/nf_conntrack_netlink.c |4 > 1 file changed, 4 insertions(+) > > --- a/net/netfilter/nf_conntrack_netlink.c > +++ b/net/netfilter/nf_conntrack_netlink.c > @@ -45,6 +45,8 @@ > #include > #include > #include > +#include > +#include > #ifdef CONFIG_NF_NAT_NEEDED > #include > #include > @@ -1798,6 +1800,8 @@ ctnetlink_create_conntrack(struct net *n > nf_ct_tstamp_ext_add(ct, GFP_ATOMIC); > nf_ct_ecache_ext_add(ct, 0, 0, GFP_ATOMIC); > nf_ct_labels_ext_add(ct); > + nfct_seqadj_ext_add(ct); > + nfct_synproxy_ext_add(ct); > > /* we must add conntrack extensions before confirmation. */ > ct->status |= IPS_CONFIRMED; > > signature.asc Description: OpenPGP digital signature
[PATCH] bcache: Fix bcache device names
When adding partition support to bcache, the name assignment was not updated, resulting in numbers jumping (bcache0, bcache16, bcache32...). Fix this by taking BCACHE_MINORS into account when assigning the disk name. BugLink: https://bugs.launchpad.net/bugs/1667078 Fixes: b8c0d91 (bcache: partition support: add 16 minors per bcacheN device) Cc: <sta...@vger.kernel.org> # v4.10 Signed-off-by: Stefan Bader <stefan.ba...@canonical.com> --- drivers/md/bcache/super.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 85e3f21..817e155 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -793,7 +793,8 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size, } set_capacity(d->disk, sectors); - snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", minor); + snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", +minor / BCACHE_MINORS); d->disk->major = bcache_major; d->disk->first_minor= minor; -- 2.7.4
[PATCH] bcache: Fix bcache device names
When adding partition support to bcache, the name assignment was not updated, resulting in numbers jumping (bcache0, bcache16, bcache32...). Fix this by taking BCACHE_MINORS into account when assigning the disk name. BugLink: https://bugs.launchpad.net/bugs/1667078 Fixes: b8c0d91 (bcache: partition support: add 16 minors per bcacheN device) Cc: # v4.10 Signed-off-by: Stefan Bader --- drivers/md/bcache/super.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 85e3f21..817e155 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -793,7 +793,8 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size, } set_capacity(d->disk, sectors); - snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", minor); + snprintf(d->disk->disk_name, DISK_NAME_LEN, "bcache%i", +minor / BCACHE_MINORS); d->disk->major = bcache_major; d->disk->first_minor= minor; -- 2.7.4
Re: bcache super block corruption with non 4k pages
On 28.07.2016 18:27, Stefan Bader wrote: > On 28.07.2016 07:55, Kent Overstreet wrote: >> On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote: >>> So here is another attempt which does half the proposed changes. And before >>> you >>> point out that it looks still ugly, let me explain some reasons. The goal >>> for me >>> would be to have something as small as possible to be acceptable as stable >>> change. >>> And the part about putting a bio on the stack and using submit_bio_wait: I >>> believe you meant in read_super to replace the __bread. I was thinking about >>> that but in the end it seemed to make the change unnecessary big. Whether >>> using >>> __bread or submit_bio_wait, in both cases and without needing to make more >>> changes on the write side, read_super has to return the in-memory and >>> on-disk >>> variant of the superblock. So as long as nothing that is related to __bread >>> is >>> leaked out of read_super, it is much better than what is there now. And I >>> remain >>> as small as possible for the delta. >> >> I like that approach much better. I suppose it's not _strictly_ necessary to >> rip >> out the __bread()... >> >> Only other comment is that you shouldn't have to pass the buffer to >> __write_super() - I'd just move the bch_bio_map() call to when the struct >> cache/cached_dev is being initialized (or rip it out and initialize the bvec >> by >> hand) and never touch it after that. > > Hm, doing that would save three simple changes to add a new argument to that > private functions at the cost of haven the map call twice and a (more) > complicated calculation of the >> >>> So there is one part of the patch which I find hard to do in a better >>> manner but >>> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error >>> paths but not on success when it is assigned to either cache or cached_dev. >>> Could possibly pass the address of the pointer and then clear it inside the >>> called functions. Not sure that would be much less strange... >> >> Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev >> I >> added that "disk_sb" struct - it owns the buffer and random other crap. You >> could read that patch for ideas if you want, look at how it transfers >> ownership >> of the disk_sb. >> > > I had a look but it felt like I could get into too much follow-up changes > going > that path. But I think I got it simpler now. One note about that area: both > register calls can run into problems but only one actually returns that > status. > And both do not seem to free the allocated structures (cache or cache_dev). It > is at least not obvious whether this is ever done. > I working around this by moving the assignment of the buffer page and the > mapping to a place where an error exit no longer is possible. So the release > functions will only see a non NULL pointer if things went well (reality > required > to revise that a little bit as one of the register calls that can fail is > actually doing the write). > > So now there is just one oddness: when I am testing this (unfortunately right > now I only have a normal 4k case), after calling make-bache with cache and > backing device, this all looks great and debugging shows the __write_super > being > called. But reading the from userspace will return the old data until I stop > the > bcache device and unregister the cache (which does not show any further > writes). > I cannot decide what I should think here... So its __bread doing cached operations and after not (ab)using the page from there, the writes will never cause the cached old result to go away until the device gets closed. So (hopefully) last apptempt. Also drop __bread and use submit_bio_wait. This seems to work. On x64 and ppc64le with 64k pages. So the only pain with this is that upstream seems to have re-organized the submit_bio functions in the upcoming v4.7-rc2 kernel. So it is without rework only applicable to older kernels. I was using a 4.4 one, 4.6 in theory should be ok... -Stefan From 91bbee69d19b59ec3a08a2c4e7c92747a23b3956 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.ba...@canonical.com> Date: Tue, 26 Jul 2016 18:47:21 +0200 Subject: [PATCH] bcache: read_super: handle architectures with more than 4k page size There is no guarantee that the superblock which __bread returns in a buffer_head starts at offset 0 when an architecture has bigger pages than 4k (the used sector size). This is the attempt to fix this with the minimum amount of change by having
Re: bcache super block corruption with non 4k pages
On 28.07.2016 18:27, Stefan Bader wrote: > On 28.07.2016 07:55, Kent Overstreet wrote: >> On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote: >>> So here is another attempt which does half the proposed changes. And before >>> you >>> point out that it looks still ugly, let me explain some reasons. The goal >>> for me >>> would be to have something as small as possible to be acceptable as stable >>> change. >>> And the part about putting a bio on the stack and using submit_bio_wait: I >>> believe you meant in read_super to replace the __bread. I was thinking about >>> that but in the end it seemed to make the change unnecessary big. Whether >>> using >>> __bread or submit_bio_wait, in both cases and without needing to make more >>> changes on the write side, read_super has to return the in-memory and >>> on-disk >>> variant of the superblock. So as long as nothing that is related to __bread >>> is >>> leaked out of read_super, it is much better than what is there now. And I >>> remain >>> as small as possible for the delta. >> >> I like that approach much better. I suppose it's not _strictly_ necessary to >> rip >> out the __bread()... >> >> Only other comment is that you shouldn't have to pass the buffer to >> __write_super() - I'd just move the bch_bio_map() call to when the struct >> cache/cached_dev is being initialized (or rip it out and initialize the bvec >> by >> hand) and never touch it after that. > > Hm, doing that would save three simple changes to add a new argument to that > private functions at the cost of haven the map call twice and a (more) > complicated calculation of the >> >>> So there is one part of the patch which I find hard to do in a better >>> manner but >>> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error >>> paths but not on success when it is assigned to either cache or cached_dev. >>> Could possibly pass the address of the pointer and then clear it inside the >>> called functions. Not sure that would be much less strange... >> >> Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev >> I >> added that "disk_sb" struct - it owns the buffer and random other crap. You >> could read that patch for ideas if you want, look at how it transfers >> ownership >> of the disk_sb. >> > > I had a look but it felt like I could get into too much follow-up changes > going > that path. But I think I got it simpler now. One note about that area: both > register calls can run into problems but only one actually returns that > status. > And both do not seem to free the allocated structures (cache or cache_dev). It > is at least not obvious whether this is ever done. > I working around this by moving the assignment of the buffer page and the > mapping to a place where an error exit no longer is possible. So the release > functions will only see a non NULL pointer if things went well (reality > required > to revise that a little bit as one of the register calls that can fail is > actually doing the write). > > So now there is just one oddness: when I am testing this (unfortunately right > now I only have a normal 4k case), after calling make-bache with cache and > backing device, this all looks great and debugging shows the __write_super > being > called. But reading the from userspace will return the old data until I stop > the > bcache device and unregister the cache (which does not show any further > writes). > I cannot decide what I should think here... So its __bread doing cached operations and after not (ab)using the page from there, the writes will never cause the cached old result to go away until the device gets closed. So (hopefully) last apptempt. Also drop __bread and use submit_bio_wait. This seems to work. On x64 and ppc64le with 64k pages. So the only pain with this is that upstream seems to have re-organized the submit_bio functions in the upcoming v4.7-rc2 kernel. So it is without rework only applicable to older kernels. I was using a 4.4 one, 4.6 in theory should be ok... -Stefan From 91bbee69d19b59ec3a08a2c4e7c92747a23b3956 Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Tue, 26 Jul 2016 18:47:21 +0200 Subject: [PATCH] bcache: read_super: handle architectures with more than 4k page size There is no guarantee that the superblock which __bread returns in a buffer_head starts at offset 0 when an architecture has bigger pages than 4k (the used sector size). This is the attempt to fix this with the minimum amount of change by having a buffer allo
Re: bcache super block corruption with non 4k pages
On 28.07.2016 07:55, Kent Overstreet wrote: > On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote: >> So here is another attempt which does half the proposed changes. And before >> you >> point out that it looks still ugly, let me explain some reasons. The goal >> for me >> would be to have something as small as possible to be acceptable as stable >> change. >> And the part about putting a bio on the stack and using submit_bio_wait: I >> believe you meant in read_super to replace the __bread. I was thinking about >> that but in the end it seemed to make the change unnecessary big. Whether >> using >> __bread or submit_bio_wait, in both cases and without needing to make more >> changes on the write side, read_super has to return the in-memory and on-disk >> variant of the superblock. So as long as nothing that is related to __bread >> is >> leaked out of read_super, it is much better than what is there now. And I >> remain >> as small as possible for the delta. > > I like that approach much better. I suppose it's not _strictly_ necessary to > rip > out the __bread()... > > Only other comment is that you shouldn't have to pass the buffer to > __write_super() - I'd just move the bch_bio_map() call to when the struct > cache/cached_dev is being initialized (or rip it out and initialize the bvec > by > hand) and never touch it after that. Hm, doing that would save three simple changes to add a new argument to that private functions at the cost of haven the map call twice and a (more) complicated calculation of the > >> So there is one part of the patch which I find hard to do in a better manner >> but >> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error >> paths but not on success when it is assigned to either cache or cached_dev. >> Could possibly pass the address of the pointer and then clear it inside the >> called functions. Not sure that would be much less strange... > > Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev I > added that "disk_sb" struct - it owns the buffer and random other crap. You > could read that patch for ideas if you want, look at how it transfers > ownership > of the disk_sb. > I had a look but it felt like I could get into too much follow-up changes going that path. But I think I got it simpler now. One note about that area: both register calls can run into problems but only one actually returns that status. And both do not seem to free the allocated structures (cache or cache_dev). It is at least not obvious whether this is ever done. I working around this by moving the assignment of the buffer page and the mapping to a place where an error exit no longer is possible. So the release functions will only see a non NULL pointer if things went well (reality required to revise that a little bit as one of the register calls that can fail is actually doing the write). So now there is just one oddness: when I am testing this (unfortunately right now I only have a normal 4k case), after calling make-bache with cache and backing device, this all looks great and debugging shows the __write_super being called. But reading the from userspace will return the old data until I stop the bcache device and unregister the cache (which does not show any further writes). I cannot decide what I should think here... -Stefan From 982a4ff25d4dbd114432b4b2f908182995f402a0 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.ba...@canonical.com> Date: Tue, 26 Jul 2016 18:47:21 +0200 Subject: [PATCH] bcache: read_super: handle architectures with more than 4k page size There is no guarantee that the superblock which __bread returns in a buffer_head starts at offset 0 when an architecture has bigger pages than 4k (the used sector size). This is the attempt to fix this with the minimum amount of change by having a buffer allocated with kmalloc that holds the superblock data as it is on disk. This buffer can then be passed to bch_map_bio which will set up the bio_vec correctly. Signed-off-by: Stefan Bader <stefan.ba...@canonical.com> --- drivers/md/bcache/bcache.h | 2 ++ drivers/md/bcache/super.c | 58 -- 2 files changed, 37 insertions(+), 23 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 6b420a5..3c48927 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -295,6 +295,7 @@ struct cached_dev { struct cache_sb sb; struct bio sb_bio; struct bio_vec sb_bv[1]; + void *sb_disk_data; struct closure sb_write; struct semaphore sb_write_mutex; @@ -382,6 +383,7 @@ struct cache { struct cache_sb sb; struct bio sb_bio; struct bio_vec sb_bv[1]; + void *sb_disk_data; struct ko
Re: bcache super block corruption with non 4k pages
On 28.07.2016 07:55, Kent Overstreet wrote: > On Wed, Jul 27, 2016 at 05:16:36PM +0200, Stefan Bader wrote: >> So here is another attempt which does half the proposed changes. And before >> you >> point out that it looks still ugly, let me explain some reasons. The goal >> for me >> would be to have something as small as possible to be acceptable as stable >> change. >> And the part about putting a bio on the stack and using submit_bio_wait: I >> believe you meant in read_super to replace the __bread. I was thinking about >> that but in the end it seemed to make the change unnecessary big. Whether >> using >> __bread or submit_bio_wait, in both cases and without needing to make more >> changes on the write side, read_super has to return the in-memory and on-disk >> variant of the superblock. So as long as nothing that is related to __bread >> is >> leaked out of read_super, it is much better than what is there now. And I >> remain >> as small as possible for the delta. > > I like that approach much better. I suppose it's not _strictly_ necessary to > rip > out the __bread()... > > Only other comment is that you shouldn't have to pass the buffer to > __write_super() - I'd just move the bch_bio_map() call to when the struct > cache/cached_dev is being initialized (or rip it out and initialize the bvec > by > hand) and never touch it after that. Hm, doing that would save three simple changes to add a new argument to that private functions at the cost of haven the map call twice and a (more) complicated calculation of the > >> So there is one part of the patch which I find hard to do in a better manner >> but >> is a bit ugly: and that is to sanely free the sb_disk_data blob on all error >> paths but not on success when it is assigned to either cache or cached_dev. >> Could possibly pass the address of the pointer and then clear it inside the >> called functions. Not sure that would be much less strange... > > Yeah that is a hassle - that's why in the 4k superblocks patch in bcache-dev I > added that "disk_sb" struct - it owns the buffer and random other crap. You > could read that patch for ideas if you want, look at how it transfers > ownership > of the disk_sb. > I had a look but it felt like I could get into too much follow-up changes going that path. But I think I got it simpler now. One note about that area: both register calls can run into problems but only one actually returns that status. And both do not seem to free the allocated structures (cache or cache_dev). It is at least not obvious whether this is ever done. I working around this by moving the assignment of the buffer page and the mapping to a place where an error exit no longer is possible. So the release functions will only see a non NULL pointer if things went well (reality required to revise that a little bit as one of the register calls that can fail is actually doing the write). So now there is just one oddness: when I am testing this (unfortunately right now I only have a normal 4k case), after calling make-bache with cache and backing device, this all looks great and debugging shows the __write_super being called. But reading the from userspace will return the old data until I stop the bcache device and unregister the cache (which does not show any further writes). I cannot decide what I should think here... -Stefan From 982a4ff25d4dbd114432b4b2f908182995f402a0 Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Tue, 26 Jul 2016 18:47:21 +0200 Subject: [PATCH] bcache: read_super: handle architectures with more than 4k page size There is no guarantee that the superblock which __bread returns in a buffer_head starts at offset 0 when an architecture has bigger pages than 4k (the used sector size). This is the attempt to fix this with the minimum amount of change by having a buffer allocated with kmalloc that holds the superblock data as it is on disk. This buffer can then be passed to bch_map_bio which will set up the bio_vec correctly. Signed-off-by: Stefan Bader --- drivers/md/bcache/bcache.h | 2 ++ drivers/md/bcache/super.c | 58 -- 2 files changed, 37 insertions(+), 23 deletions(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index 6b420a5..3c48927 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -295,6 +295,7 @@ struct cached_dev { struct cache_sb sb; struct bio sb_bio; struct bio_vec sb_bv[1]; + void *sb_disk_data; struct closure sb_write; struct semaphore sb_write_mutex; @@ -382,6 +383,7 @@ struct cache { struct cache_sb sb; struct bio sb_bio; struct bio_vec sb_bv[1]; + void *sb_disk_data; struct kobject kobj; struct block_device *bdev; diff --git a/drivers/md/
Re: bcache super block corruption with non 4k pages
On 26.07.2016 14:49, Kent Overstreet wrote: > On Tue, Jul 26, 2016 at 02:32:31PM +0200, Stefan Bader wrote: >> On 26.07.2016 12:21, Kent Overstreet wrote: >>> On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote: >>>> On 21.07.2016 10:58, Stefan Bader wrote: >>>>> I was pointed at the thread which seems to address the same after >>>>> I wrote most of below text. Did not want to re-write this so please >>>>> bear with the odd layout. >>>>> >>>>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html >>>>> >>>>> Zhengyuan tries to fix the problem by relocating the superblock on >>>>> disk. But I am not sure whether there is really any guarantee about >>>>> how __bread fills data into the buffer_head. What if there is the next >>>>> odd arch with 128K pages? >>>>> >>>>> So below is an attempt to be more generic. Still I don't feel completely >>>>> happy with the way that a page moves (or is shared) between buffer_head >>>>> and biovec. What I tried to outline below is to let the register functions >>>>> allocate bio+biovec memory and use the in-memory sb_cache data to >>>>> initialize >>>>> the biovec buffer. >>>> >>>> Any opinions here? Also adding LKML as I don't seem to get through >>>> moderation on >>>> dm-devel. >>> >>> The correct solution is to rip out the __bread() and just read the >>> superblock by >>> issuing a bio, the same way all the other IO in bcache is done. >>> >>> This is the way it's done in the bcache-dev branch - unfortunately, the >>> patch >>> that does that in bcache-dev is big and invasive and probably not worth the >>> hassle to backport: >>> >>> https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev=303eb67bffad57b4d9e71523e7df04bf258e66d1 >> >> I agree that this looks better and also rather large. >>> >>> Probably best to just do something small and localized. >>> >> So what did you think about the change I did? It seemed to be ok for 4K and >> 64K >> at least and is rather small. And I believe that, compared to Zhengyuan's >> approach this would have the benefit of not changing the superblock sector. >> So >> it would be compatible with previously created superblocks. > > Too ugly to live. Just kmalloc() 4k, allocate a bio on the stack, set it up, > and > submit it with submit_bio_wait(). Use virt_to_page(), don't bother with raw > pages - you want 4k, not whatever the page size is. > So here is another attempt which does half the proposed changes. And before you point out that it looks still ugly, let me explain some reasons. The goal for me would be to have something as small as possible to be acceptable as stable change. And the part about putting a bio on the stack and using submit_bio_wait: I believe you meant in read_super to replace the __bread. I was thinking about that but in the end it seemed to make the change unnecessary big. Whether using __bread or submit_bio_wait, in both cases and without needing to make more changes on the write side, read_super has to return the in-memory and on-disk variant of the superblock. So as long as nothing that is related to __bread is leaked out of read_super, it is much better than what is there now. And I remain as small as possible for the delta. So there is one part of the patch which I find hard to do in a better manner but is a bit ugly: and that is to sanely free the sb_disk_data blob on all error paths but not on success when it is assigned to either cache or cached_dev. Could possibly pass the address of the pointer and then clear it inside the called functions. Not sure that would be much less strange... -Stefan From 391682e2329a6f8608bfc7628b6c8a0afa9a5d98 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.ba...@canonical.com> Date: Tue, 26 Jul 2016 18:47:21 +0200 Subject: [PATCH] bcache: read_super: handle architectures with more than 4k page size There is no guarantee that the superblock which __bread returns in a buffer_head starts at offset 0 when an architecture has bigger pages than 4k (the used sector size). This is the attempt to fix this with the minimum amount of change by having a buffer allocated with kmalloc that holds the superblock data as it is on disk. This buffer can then be passed to bch_map_bio which will set up the bio_vec correctly. Signed-off-by: Stefan Bader <stefan.ba...@canonical.com> --- drivers/md/bcache/bcache.h | 2 ++ drivers/md/bcache/super.c | 61 ++ 2 files chan
Re: bcache super block corruption with non 4k pages
On 26.07.2016 14:49, Kent Overstreet wrote: > On Tue, Jul 26, 2016 at 02:32:31PM +0200, Stefan Bader wrote: >> On 26.07.2016 12:21, Kent Overstreet wrote: >>> On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote: >>>> On 21.07.2016 10:58, Stefan Bader wrote: >>>>> I was pointed at the thread which seems to address the same after >>>>> I wrote most of below text. Did not want to re-write this so please >>>>> bear with the odd layout. >>>>> >>>>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html >>>>> >>>>> Zhengyuan tries to fix the problem by relocating the superblock on >>>>> disk. But I am not sure whether there is really any guarantee about >>>>> how __bread fills data into the buffer_head. What if there is the next >>>>> odd arch with 128K pages? >>>>> >>>>> So below is an attempt to be more generic. Still I don't feel completely >>>>> happy with the way that a page moves (or is shared) between buffer_head >>>>> and biovec. What I tried to outline below is to let the register functions >>>>> allocate bio+biovec memory and use the in-memory sb_cache data to >>>>> initialize >>>>> the biovec buffer. >>>> >>>> Any opinions here? Also adding LKML as I don't seem to get through >>>> moderation on >>>> dm-devel. >>> >>> The correct solution is to rip out the __bread() and just read the >>> superblock by >>> issuing a bio, the same way all the other IO in bcache is done. >>> >>> This is the way it's done in the bcache-dev branch - unfortunately, the >>> patch >>> that does that in bcache-dev is big and invasive and probably not worth the >>> hassle to backport: >>> >>> https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev=303eb67bffad57b4d9e71523e7df04bf258e66d1 >> >> I agree that this looks better and also rather large. >>> >>> Probably best to just do something small and localized. >>> >> So what did you think about the change I did? It seemed to be ok for 4K and >> 64K >> at least and is rather small. And I believe that, compared to Zhengyuan's >> approach this would have the benefit of not changing the superblock sector. >> So >> it would be compatible with previously created superblocks. > > Too ugly to live. Just kmalloc() 4k, allocate a bio on the stack, set it up, > and > submit it with submit_bio_wait(). Use virt_to_page(), don't bother with raw > pages - you want 4k, not whatever the page size is. > So here is another attempt which does half the proposed changes. And before you point out that it looks still ugly, let me explain some reasons. The goal for me would be to have something as small as possible to be acceptable as stable change. And the part about putting a bio on the stack and using submit_bio_wait: I believe you meant in read_super to replace the __bread. I was thinking about that but in the end it seemed to make the change unnecessary big. Whether using __bread or submit_bio_wait, in both cases and without needing to make more changes on the write side, read_super has to return the in-memory and on-disk variant of the superblock. So as long as nothing that is related to __bread is leaked out of read_super, it is much better than what is there now. And I remain as small as possible for the delta. So there is one part of the patch which I find hard to do in a better manner but is a bit ugly: and that is to sanely free the sb_disk_data blob on all error paths but not on success when it is assigned to either cache or cached_dev. Could possibly pass the address of the pointer and then clear it inside the called functions. Not sure that would be much less strange... -Stefan From 391682e2329a6f8608bfc7628b6c8a0afa9a5d98 Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Tue, 26 Jul 2016 18:47:21 +0200 Subject: [PATCH] bcache: read_super: handle architectures with more than 4k page size There is no guarantee that the superblock which __bread returns in a buffer_head starts at offset 0 when an architecture has bigger pages than 4k (the used sector size). This is the attempt to fix this with the minimum amount of change by having a buffer allocated with kmalloc that holds the superblock data as it is on disk. This buffer can then be passed to bch_map_bio which will set up the bio_vec correctly. Signed-off-by: Stefan Bader --- drivers/md/bcache/bcache.h | 2 ++ drivers/md/bcache/super.c | 61 ++ 2 files changed, 36 insertions(+), 27 deletions(-) diff --git a/drivers/md/bca
Re: bcache super block corruption with non 4k pages
On 26.07.2016 12:21, Kent Overstreet wrote: > On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote: >> On 21.07.2016 10:58, Stefan Bader wrote: >>> I was pointed at the thread which seems to address the same after >>> I wrote most of below text. Did not want to re-write this so please >>> bear with the odd layout. >>> >>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html >>> >>> Zhengyuan tries to fix the problem by relocating the superblock on >>> disk. But I am not sure whether there is really any guarantee about >>> how __bread fills data into the buffer_head. What if there is the next >>> odd arch with 128K pages? >>> >>> So below is an attempt to be more generic. Still I don't feel completely >>> happy with the way that a page moves (or is shared) between buffer_head >>> and biovec. What I tried to outline below is to let the register functions >>> allocate bio+biovec memory and use the in-memory sb_cache data to initialize >>> the biovec buffer. >> >> Any opinions here? Also adding LKML as I don't seem to get through >> moderation on >> dm-devel. > > The correct solution is to rip out the __bread() and just read the superblock > by > issuing a bio, the same way all the other IO in bcache is done. > > This is the way it's done in the bcache-dev branch - unfortunately, the patch > that does that in bcache-dev is big and invasive and probably not worth the > hassle to backport: > > https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev=303eb67bffad57b4d9e71523e7df04bf258e66d1 I agree that this looks better and also rather large. > > Probably best to just do something small and localized. > So what did you think about the change I did? It seemed to be ok for 4K and 64K at least and is rather small. And I believe that, compared to Zhengyuan's approach this would have the benefit of not changing the superblock sector. So it would be compatible with previously created superblocks. -Stefan From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.ba...@canonical.com> Date: Wed, 20 Jul 2016 12:06:27 +0200 Subject: [PATCH] bcache: handle non 4k pages returned by __bread On non-x86 arches pages can be bigger than 4k. Currently read_super returns a reference to the page used as buffer in the buffer_head that is returned by __bread(). With 4k page size and a requested read this normally ends up with the super block data starting at offset 0. But as seen on ppc64le with 64k page size, the data can start at an offset (from what I saw the offset was 4k). This causes harm later on when __write_super() maps the super block data at the start of the page acquired before and also not writes back all fields (particularly the super block magic). Try to avoid this by also returning the potential offset within the sb_page from read_super() and fully initiallize the single bvec of the bio used for __write_super() calls. Doing that is the same as would have been done in bch_bio_map() which now must not be used in __write_super(). Signed-off-by: Stefan Bader <stefan.ba...@canonical.com> --- drivers/md/bcache/super.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index e169739..3e0d2de 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq; /* Superblock */ static const char *read_super(struct cache_sb *sb, struct block_device *bdev, - struct page **res) + struct page **res, unsigned int *off) { const char *err; struct cache_sb *s; @@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev, err = NULL; get_page(bh->b_page); + *off = (unsigned int) (bh->b_data - ((char *) page_address(bh->b_page))); *res = bh->b_page; err: put_bh(bh); @@ -202,19 +203,19 @@ static void write_bdev_super_endio(struct bio *bio) { struct cached_dev *dc = bio->bi_private; /* XXX: error checking */ - closure_put(>sb_write); } static void __write_super(struct cache_sb *sb, struct bio *bio) { - struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page); + struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page) + + bio->bi_io_vec[0].bv_offset; unsigned i; bio->bi_iter.bi_sector = SB_SECTOR; bio->bi_rw = REQ_SYNC|REQ_META; bio->bi_iter.bi_size = SB_SIZE; - bch_bio_map(bio, NULL); + // bch_bio_map(bio, NULL); out->offset = cpu_to_le64(sb->offset); out->version = cpu_to_le64(sb->version); @@ -1139,6 +1140,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size) /* Cached device - bcac
Re: bcache super block corruption with non 4k pages
On 26.07.2016 12:21, Kent Overstreet wrote: > On Tue, Jul 26, 2016 at 11:51:25AM +0200, Stefan Bader wrote: >> On 21.07.2016 10:58, Stefan Bader wrote: >>> I was pointed at the thread which seems to address the same after >>> I wrote most of below text. Did not want to re-write this so please >>> bear with the odd layout. >>> >>> https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html >>> >>> Zhengyuan tries to fix the problem by relocating the superblock on >>> disk. But I am not sure whether there is really any guarantee about >>> how __bread fills data into the buffer_head. What if there is the next >>> odd arch with 128K pages? >>> >>> So below is an attempt to be more generic. Still I don't feel completely >>> happy with the way that a page moves (or is shared) between buffer_head >>> and biovec. What I tried to outline below is to let the register functions >>> allocate bio+biovec memory and use the in-memory sb_cache data to initialize >>> the biovec buffer. >> >> Any opinions here? Also adding LKML as I don't seem to get through >> moderation on >> dm-devel. > > The correct solution is to rip out the __bread() and just read the superblock > by > issuing a bio, the same way all the other IO in bcache is done. > > This is the way it's done in the bcache-dev branch - unfortunately, the patch > that does that in bcache-dev is big and invasive and probably not worth the > hassle to backport: > > https://evilpiepirate.org/git/linux-bcache.git/commit/?h=bcache-dev=303eb67bffad57b4d9e71523e7df04bf258e66d1 I agree that this looks better and also rather large. > > Probably best to just do something small and localized. > So what did you think about the change I did? It seemed to be ok for 4K and 64K at least and is rather small. And I believe that, compared to Zhengyuan's approach this would have the benefit of not changing the superblock sector. So it would be compatible with previously created superblocks. -Stefan From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Wed, 20 Jul 2016 12:06:27 +0200 Subject: [PATCH] bcache: handle non 4k pages returned by __bread On non-x86 arches pages can be bigger than 4k. Currently read_super returns a reference to the page used as buffer in the buffer_head that is returned by __bread(). With 4k page size and a requested read this normally ends up with the super block data starting at offset 0. But as seen on ppc64le with 64k page size, the data can start at an offset (from what I saw the offset was 4k). This causes harm later on when __write_super() maps the super block data at the start of the page acquired before and also not writes back all fields (particularly the super block magic). Try to avoid this by also returning the potential offset within the sb_page from read_super() and fully initiallize the single bvec of the bio used for __write_super() calls. Doing that is the same as would have been done in bch_bio_map() which now must not be used in __write_super(). Signed-off-by: Stefan Bader --- drivers/md/bcache/super.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index e169739..3e0d2de 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq; /* Superblock */ static const char *read_super(struct cache_sb *sb, struct block_device *bdev, - struct page **res) + struct page **res, unsigned int *off) { const char *err; struct cache_sb *s; @@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct block_device *bdev, err = NULL; get_page(bh->b_page); + *off = (unsigned int) (bh->b_data - ((char *) page_address(bh->b_page))); *res = bh->b_page; err: put_bh(bh); @@ -202,19 +203,19 @@ static void write_bdev_super_endio(struct bio *bio) { struct cached_dev *dc = bio->bi_private; /* XXX: error checking */ - closure_put(>sb_write); } static void __write_super(struct cache_sb *sb, struct bio *bio) { - struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page); + struct cache_sb *out = page_address(bio->bi_io_vec[0].bv_page) + + bio->bi_io_vec[0].bv_offset; unsigned i; bio->bi_iter.bi_sector = SB_SECTOR; bio->bi_rw = REQ_SYNC|REQ_META; bio->bi_iter.bi_size = SB_SIZE; - bch_bio_map(bio, NULL); + // bch_bio_map(bio, NULL); out->offset = cpu_to_le64(sb->offset); out->version = cpu_to_le64(sb->version); @@ -1139,6 +1140,7 @@ static int cached_dev_init(struct cached_dev *dc, unsigned block_size) /* Cached device - bcache superblock */ static void register_bdev(struct cache_sb *sb
Re: bcache super block corruption with non 4k pages
On 21.07.2016 10:58, Stefan Bader wrote: > I was pointed at the thread which seems to address the same after > I wrote most of below text. Did not want to re-write this so please > bear with the odd layout. > > https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html > > Zhengyuan tries to fix the problem by relocating the superblock on > disk. But I am not sure whether there is really any guarantee about > how __bread fills data into the buffer_head. What if there is the next > odd arch with 128K pages? > > So below is an attempt to be more generic. Still I don't feel completely > happy with the way that a page moves (or is shared) between buffer_head > and biovec. What I tried to outline below is to let the register functions > allocate bio+biovec memory and use the in-memory sb_cache data to initialize > the biovec buffer. Any opinions here? Also adding LKML as I don't seem to get through moderation on dm-devel. > > -Stefan > > > --- > > This was seen on ppc64le with 64k page size. The problem is that > in that case __bread returns a page where b_data is not at the > start of the page. And I am not really sure that the way bcache > re-purposes the page returned by __bread to be used as biovec > element is really a good idea. > > The way it is now, I saw __bread return a 64k page where b_data > was starting at 4k offset. Then __write_super was modifying some > data at offset 0 but for example not writing the magic number again. > > Not sure why (maybe this messes up flushes, too) but the bad data was not > immediately written back when the devices were registered. So at that time > bcache-super-show would report data as it was written by user-space (like > the cache type still 0 and not 3, and claiming the cache to still bei > detached). > > But when stopping the cache and unregistering the cache set this changed > and bcache-super-show was reporting a bad magic number (as expected). > > The patch below works around this (tested with 64k and 4k pages) but I > am not really sure this is a clean approach. My gut feeling would be that > the bio structures should be allocated speperately (I think there is a way > of allocating a bioset without using a pool). Then that separate page could > be pre-initialized from the super block structure without passing around > a page the feels more private to __bread usage... > > -Stefan > > > > From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001 > From: Stefan Bader <stefan.ba...@canonical.com> > Date: Wed, 20 Jul 2016 12:06:27 +0200 > Subject: [PATCH] bcache: handle non 4k pages returned by __bread > > On non-x86 arches pages can be bigger than 4k. Currently read_super > returns a reference to the page used as buffer in the buffer_head > that is returned by __bread(). > With 4k page size and a requested read this normally ends up with > the super block data starting at offset 0. But as seen on ppc64le > with 64k page size, the data can start at an offset (from what I > saw the offset was 4k). > This causes harm later on when __write_super() maps the super > block data at the start of the page acquired before and also > not writes back all fields (particularly the super block magic). > > Try to avoid this by also returning the potential offset within the > sb_page from read_super() and fully initiallize the single bvec of > the bio used for __write_super() calls. Doing that is the same as > would have been done in bch_bio_map() which now must not be used in > __write_super(). > > Signed-off-by: Stefan Bader <stefan.ba...@canonical.com> > > removedebug > > Signed-off-by: Stefan Bader <stefan.ba...@canonical.com> > --- > drivers/md/bcache/super.c | 22 +++--- > 1 file changed, 15 insertions(+), 7 deletions(-) > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index e169739..3e0d2de 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq; > /* Superblock */ > > static const char *read_super(struct cache_sb *sb, struct block_device *bdev, > - struct page **res) > + struct page **res, unsigned int *off) > { > const char *err; > struct cache_sb *s; > @@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct > block_device *bdev, > err = NULL; > > get_page(bh->b_page); > + *off = (unsigned int) (bh->b_data - ((char *) > page_address(bh->b_page))); > *res = bh->b_page; > err: > put_bh(bh); > @@ -202,19 +203,19 @@ static void write_bdev_super_endio(str
Re: bcache super block corruption with non 4k pages
On 21.07.2016 10:58, Stefan Bader wrote: > I was pointed at the thread which seems to address the same after > I wrote most of below text. Did not want to re-write this so please > bear with the odd layout. > > https://www.redhat.com/archives/dm-devel/2016-June/msg00015.html > > Zhengyuan tries to fix the problem by relocating the superblock on > disk. But I am not sure whether there is really any guarantee about > how __bread fills data into the buffer_head. What if there is the next > odd arch with 128K pages? > > So below is an attempt to be more generic. Still I don't feel completely > happy with the way that a page moves (or is shared) between buffer_head > and biovec. What I tried to outline below is to let the register functions > allocate bio+biovec memory and use the in-memory sb_cache data to initialize > the biovec buffer. Any opinions here? Also adding LKML as I don't seem to get through moderation on dm-devel. > > -Stefan > > > --- > > This was seen on ppc64le with 64k page size. The problem is that > in that case __bread returns a page where b_data is not at the > start of the page. And I am not really sure that the way bcache > re-purposes the page returned by __bread to be used as biovec > element is really a good idea. > > The way it is now, I saw __bread return a 64k page where b_data > was starting at 4k offset. Then __write_super was modifying some > data at offset 0 but for example not writing the magic number again. > > Not sure why (maybe this messes up flushes, too) but the bad data was not > immediately written back when the devices were registered. So at that time > bcache-super-show would report data as it was written by user-space (like > the cache type still 0 and not 3, and claiming the cache to still bei > detached). > > But when stopping the cache and unregistering the cache set this changed > and bcache-super-show was reporting a bad magic number (as expected). > > The patch below works around this (tested with 64k and 4k pages) but I > am not really sure this is a clean approach. My gut feeling would be that > the bio structures should be allocated speperately (I think there is a way > of allocating a bioset without using a pool). Then that separate page could > be pre-initialized from the super block structure without passing around > a page the feels more private to __bread usage... > > -Stefan > > > > From 3652e98359b876f3c1e6d7b9b7305e3411178296 Mon Sep 17 00:00:00 2001 > From: Stefan Bader > Date: Wed, 20 Jul 2016 12:06:27 +0200 > Subject: [PATCH] bcache: handle non 4k pages returned by __bread > > On non-x86 arches pages can be bigger than 4k. Currently read_super > returns a reference to the page used as buffer in the buffer_head > that is returned by __bread(). > With 4k page size and a requested read this normally ends up with > the super block data starting at offset 0. But as seen on ppc64le > with 64k page size, the data can start at an offset (from what I > saw the offset was 4k). > This causes harm later on when __write_super() maps the super > block data at the start of the page acquired before and also > not writes back all fields (particularly the super block magic). > > Try to avoid this by also returning the potential offset within the > sb_page from read_super() and fully initiallize the single bvec of > the bio used for __write_super() calls. Doing that is the same as > would have been done in bch_bio_map() which now must not be used in > __write_super(). > > Signed-off-by: Stefan Bader > > removedebug > > Signed-off-by: Stefan Bader > --- > drivers/md/bcache/super.c | 22 +++--- > 1 file changed, 15 insertions(+), 7 deletions(-) > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index e169739..3e0d2de 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -62,7 +62,7 @@ struct workqueue_struct *bcache_wq; > /* Superblock */ > > static const char *read_super(struct cache_sb *sb, struct block_device *bdev, > - struct page **res) > + struct page **res, unsigned int *off) > { > const char *err; > struct cache_sb *s; > @@ -192,6 +192,7 @@ static const char *read_super(struct cache_sb *sb, struct > block_device *bdev, > err = NULL; > > get_page(bh->b_page); > + *off = (unsigned int) (bh->b_data - ((char *) > page_address(bh->b_page))); > *res = bh->b_page; > err: > put_bh(bh); > @@ -202,19 +203,19 @@ static void write_bdev_super_endio(struct bio *bio) > { > struct cached_dev *dc = bio->bi_private; > /
Re: mm: Use phys_addr_t for reserve_bootmem_region arguments
On 17.05.2016 15:20, Stefan Bader wrote: > Re-posting to a hopefully better suited audience. I hit this problem > when trying to boot a i386 dom0 (PAE enabled) on a 64bit Xen host using > a config which would result in a reserved memory range starting at 4MB. Of course that ^ should be "starting at 4GB" not MB... > Due to the usage of unsigned long as arguments for start address and > length, this would wrap and actually mark the lower memory range staring > from 0 as reserved. Between kernel version 4.2 and 4.4 this somehow boots > but starting with 4.4 the result is a panic and reboot. > > Not sure this special Xen case is the only one affected, but in general > it seems more correct to use phys_addr_t as the type for start and end > as that is the type used in the memblock region definitions and those > are 64bit (at least with PAE enabled). > > -Stefan > > > > From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001 > From: Stefan Bader <stefan.ba...@canonical.com> > Date: Tue, 10 May 2016 19:05:16 +0200 > Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments > > Since 92923ca the reserved bit is set on reserved memblock regions. > However start and end address are passed as unsigned long. This is > only 32bit on i386, so it can end up marking the wrong pages reserved > for ranges at 4GB and above. > > This was observed on a 32bit Xen dom0 which was booted with initial > memory set to a value below 4G but allowing to balloon in memory > (dom0_mem=1024M for example). This would define a reserved bootmem > region for the additional memory (for example on a 8GB system there was > a reverved region covering the 4GB-8GB range). But since the addresses > were passed on as unsigned long, this was actually marking all pages > from 0 to 4GB as reserved. > > Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region" > Signed-off-by: Stefan Bader <stefan.ba...@canonical.com> > Cc: <sta...@kernel.org> # 4.2+ > --- > include/linux/mm.h | 2 +- > mm/page_alloc.c| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index b56ff72..4c1ff62 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1715,7 +1715,7 @@ extern void free_highmem_page(struct page *page); > extern void adjust_managed_page_count(struct page *page, long count); > extern void mem_init_print_info(const char *str); > > -extern void reserve_bootmem_region(unsigned long start, unsigned long end); > +extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c69531a..eb66f89 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -951,7 +951,7 @@ static inline void init_reserved_page(unsigned long pfn) > * marks the pages PageReserved. The remaining valid pages are later > * sent to the buddy page allocator. > */ > -void __meminit reserve_bootmem_region(unsigned long start, unsigned long end) > +void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > { > unsigned long start_pfn = PFN_DOWN(start); > unsigned long end_pfn = PFN_UP(end); > signature.asc Description: OpenPGP digital signature
Re: mm: Use phys_addr_t for reserve_bootmem_region arguments
On 17.05.2016 15:20, Stefan Bader wrote: > Re-posting to a hopefully better suited audience. I hit this problem > when trying to boot a i386 dom0 (PAE enabled) on a 64bit Xen host using > a config which would result in a reserved memory range starting at 4MB. Of course that ^ should be "starting at 4GB" not MB... > Due to the usage of unsigned long as arguments for start address and > length, this would wrap and actually mark the lower memory range staring > from 0 as reserved. Between kernel version 4.2 and 4.4 this somehow boots > but starting with 4.4 the result is a panic and reboot. > > Not sure this special Xen case is the only one affected, but in general > it seems more correct to use phys_addr_t as the type for start and end > as that is the type used in the memblock region definitions and those > are 64bit (at least with PAE enabled). > > -Stefan > > > > From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001 > From: Stefan Bader > Date: Tue, 10 May 2016 19:05:16 +0200 > Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments > > Since 92923ca the reserved bit is set on reserved memblock regions. > However start and end address are passed as unsigned long. This is > only 32bit on i386, so it can end up marking the wrong pages reserved > for ranges at 4GB and above. > > This was observed on a 32bit Xen dom0 which was booted with initial > memory set to a value below 4G but allowing to balloon in memory > (dom0_mem=1024M for example). This would define a reserved bootmem > region for the additional memory (for example on a 8GB system there was > a reverved region covering the 4GB-8GB range). But since the addresses > were passed on as unsigned long, this was actually marking all pages > from 0 to 4GB as reserved. > > Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region" > Signed-off-by: Stefan Bader > Cc: # 4.2+ > --- > include/linux/mm.h | 2 +- > mm/page_alloc.c| 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index b56ff72..4c1ff62 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1715,7 +1715,7 @@ extern void free_highmem_page(struct page *page); > extern void adjust_managed_page_count(struct page *page, long count); > extern void mem_init_print_info(const char *str); > > -extern void reserve_bootmem_region(unsigned long start, unsigned long end); > +extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end); > > /* Free the reserved page into the buddy system, so it gets managed. */ > static inline void __free_reserved_page(struct page *page) > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c69531a..eb66f89 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -951,7 +951,7 @@ static inline void init_reserved_page(unsigned long pfn) > * marks the pages PageReserved. The remaining valid pages are later > * sent to the buddy page allocator. > */ > -void __meminit reserve_bootmem_region(unsigned long start, unsigned long end) > +void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end) > { > unsigned long start_pfn = PFN_DOWN(start); > unsigned long end_pfn = PFN_UP(end); > signature.asc Description: OpenPGP digital signature
mm: Use phys_addr_t for reserve_bootmem_region arguments
Re-posting to a hopefully better suited audience. I hit this problem when trying to boot a i386 dom0 (PAE enabled) on a 64bit Xen host using a config which would result in a reserved memory range starting at 4MB. Due to the usage of unsigned long as arguments for start address and length, this would wrap and actually mark the lower memory range staring from 0 as reserved. Between kernel version 4.2 and 4.4 this somehow boots but starting with 4.4 the result is a panic and reboot. Not sure this special Xen case is the only one affected, but in general it seems more correct to use phys_addr_t as the type for start and end as that is the type used in the memblock region definitions and those are 64bit (at least with PAE enabled). -Stefan >From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.ba...@canonical.com> Date: Tue, 10 May 2016 19:05:16 +0200 Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments Since 92923ca the reserved bit is set on reserved memblock regions. However start and end address are passed as unsigned long. This is only 32bit on i386, so it can end up marking the wrong pages reserved for ranges at 4GB and above. This was observed on a 32bit Xen dom0 which was booted with initial memory set to a value below 4G but allowing to balloon in memory (dom0_mem=1024M for example). This would define a reserved bootmem region for the additional memory (for example on a 8GB system there was a reverved region covering the 4GB-8GB range). But since the addresses were passed on as unsigned long, this was actually marking all pages from 0 to 4GB as reserved. Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region" Signed-off-by: Stefan Bader <stefan.ba...@canonical.com> Cc: <sta...@kernel.org> # 4.2+ --- include/linux/mm.h | 2 +- mm/page_alloc.c| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index b56ff72..4c1ff62 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1715,7 +1715,7 @@ extern void free_highmem_page(struct page *page); extern void adjust_managed_page_count(struct page *page, long count); extern void mem_init_print_info(const char *str); -extern void reserve_bootmem_region(unsigned long start, unsigned long end); +extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end); /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c69531a..eb66f89 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -951,7 +951,7 @@ static inline void init_reserved_page(unsigned long pfn) * marks the pages PageReserved. The remaining valid pages are later * sent to the buddy page allocator. */ -void __meminit reserve_bootmem_region(unsigned long start, unsigned long end) +void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); -- 1.9.1
mm: Use phys_addr_t for reserve_bootmem_region arguments
Re-posting to a hopefully better suited audience. I hit this problem when trying to boot a i386 dom0 (PAE enabled) on a 64bit Xen host using a config which would result in a reserved memory range starting at 4MB. Due to the usage of unsigned long as arguments for start address and length, this would wrap and actually mark the lower memory range staring from 0 as reserved. Between kernel version 4.2 and 4.4 this somehow boots but starting with 4.4 the result is a panic and reboot. Not sure this special Xen case is the only one affected, but in general it seems more correct to use phys_addr_t as the type for start and end as that is the type used in the memblock region definitions and those are 64bit (at least with PAE enabled). -Stefan >From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Tue, 10 May 2016 19:05:16 +0200 Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments Since 92923ca the reserved bit is set on reserved memblock regions. However start and end address are passed as unsigned long. This is only 32bit on i386, so it can end up marking the wrong pages reserved for ranges at 4GB and above. This was observed on a 32bit Xen dom0 which was booted with initial memory set to a value below 4G but allowing to balloon in memory (dom0_mem=1024M for example). This would define a reserved bootmem region for the additional memory (for example on a 8GB system there was a reverved region covering the 4GB-8GB range). But since the addresses were passed on as unsigned long, this was actually marking all pages from 0 to 4GB as reserved. Fixes: 92923ca "mm: meminit: only set page reserved in the memblock region" Signed-off-by: Stefan Bader Cc: # 4.2+ --- include/linux/mm.h | 2 +- mm/page_alloc.c| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index b56ff72..4c1ff62 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1715,7 +1715,7 @@ extern void free_highmem_page(struct page *page); extern void adjust_managed_page_count(struct page *page, long count); extern void mem_init_print_info(const char *str); -extern void reserve_bootmem_region(unsigned long start, unsigned long end); +extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end); /* Free the reserved page into the buddy system, so it gets managed. */ static inline void __free_reserved_page(struct page *page) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c69531a..eb66f89 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -951,7 +951,7 @@ static inline void init_reserved_page(unsigned long pfn) * marks the pages PageReserved. The remaining valid pages are later * sent to the buddy page allocator. */ -void __meminit reserve_bootmem_region(unsigned long start, unsigned long end) +void __meminit reserve_bootmem_region(phys_addr_t start, phys_addr_t end) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); -- 1.9.1
Re: [Xen-devel] bad page flags booting 32bit dom0 on 64bit hypervisor using dom0_mem (kernel >=4.2)
On 02.05.2016 16:24, Stefan Bader wrote: > On 02.05.2016 13:41, Juergen Gross wrote: >> On 02/05/16 12:47, Stefan Bader wrote: >>> I recently tried to boot 32bit dom0 on 64bit Xen host which I configured to >>> run >>> with a limited, fix amount of memory for dom0. It seems that somewhere >>> between >>> kernel versions 3.19 and 4.2 (sorry that is still a wide range) the Linux >>> kernel >>> would report bad page flags for a range of pages (which seem to be around >>> the >>> end of the guest pfn range). For a 4.2 kernel that was easily missed as the >>> boot >>> finished ok and dom0 was accessible. However starting with 4.4 (tested 4.5 >>> and a >>> 4.6-rc) the serial console output freezes after some of those bad page flag >>> messages and then (unfortunately without any further helpful output) the >>> host >>> reboots (I assume there is a panic that triggers a reset). >>> >>> I suspect the problem is more a kernel side one. It is just possible to >>> influence things by variation of dom0_mem=#,max:#. 512M seems ok, 1024M, >>> 2048M, >>> and 3072M cause bad page flags starting around kernel 4.2 and reboots around >>> 4.4. Then 4096M and not clamping dom0 memory seem to be ok again (though not >>> limiting dom0 memory seems to cause trouble on 32bit dom0 later when a domU >>> tries to balloon memory, but I think that is a different problem). >>> >>> I have not seen this on a 64bit dom0. Below is an example of those bad page >>> errors. Somehow it looks to be a page marked as reserved. Initially I >>> wondered >>> whether this could be a problem of not clearing page flags when moving >>> mappings >>> to match the e820. But I never looked into i386 memory setup in that >>> detail. So >>> I am posting this, hoping that someone may have an idea from the detail >>> about >>> where to look next. PAE is enabled there. Usually its bpf init that gets >>> hit but >>> that likely is just because that is doing the first vmallocs. >> >> Could you please post the kernel config, Xen and dom0 boot parameters? >> I'm quite sure this is no common problem as there are standard tests >> running for each kernel version including 32 bit dom0 with limited >> memory size. > > Hi Jürgen, > > sure. Though by doing that I realized where I actually messed the whole thing > up. I got the max limit syntax completely wrong. :( Instead of the correct > "dom0_mem=1024M,max:1024M" I am using "dom0_mem=1024M:max=1024M" which I guess > is like not having max set at all. Not sure whether that is a valid use case. > > When I actually do the dom0_mem argument right, there are no bad page flag > errors even in 4.4 with 1024M limit. I was at least consistent in my > mis-configuration, so doing the same stupid thing on 64bit seems to be handled > more gracefully. > > Likely false alarm. But at least cut the config into mail made me spot > the problem... > Ok, thinking that "dom0_mem=x" (without a max or min) still is a valid case, I went ahead and did a bisect for when the bad page flag issue started. I ended up at: 92923ca "mm: meminit: only set page reserved in the memblock region" And with a few more printks in the new functions I finally realized why this goes wrong. The new reserve_bootmem_region is using unsigned long for start and end addresses which just isn't working too well for 32bit. For Xen dom0 the problem with that can just be more easily triggered. When dom0 memory is limited to a small size but allowed to balloon for more, the additional system memory is put into reserved regions. In my case a host with 8G memory and say 1G initial dom0 memory this created (apart from other) one reserved region which started at 4GB and covered the remaining 4G of host memory. Which reserve_bootmem_region() got as 0-4G due to the unsigned long conversion. This basically marked *all* memory below 4G as reserved. The fix is relatively simple, just use phys_addr_t for start and end. I tested this on 4.2 and 4.4 kernels. Both now boot without errors and neither does the 4.4 kernel crash. Maybe still not 100% safe when running on very large memory systems (if I did not get the math wrong 16T) but at least some improvement... -Stefan From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001 From: Stefan Bader <stefan.ba...@canonical.com> Date: Tue, 10 May 2016 19:05:16 +0200 Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments Since 92923ca the reserved bit is set on reserved memblock regions. However start and end address
Re: [Xen-devel] bad page flags booting 32bit dom0 on 64bit hypervisor using dom0_mem (kernel >=4.2)
On 02.05.2016 16:24, Stefan Bader wrote: > On 02.05.2016 13:41, Juergen Gross wrote: >> On 02/05/16 12:47, Stefan Bader wrote: >>> I recently tried to boot 32bit dom0 on 64bit Xen host which I configured to >>> run >>> with a limited, fix amount of memory for dom0. It seems that somewhere >>> between >>> kernel versions 3.19 and 4.2 (sorry that is still a wide range) the Linux >>> kernel >>> would report bad page flags for a range of pages (which seem to be around >>> the >>> end of the guest pfn range). For a 4.2 kernel that was easily missed as the >>> boot >>> finished ok and dom0 was accessible. However starting with 4.4 (tested 4.5 >>> and a >>> 4.6-rc) the serial console output freezes after some of those bad page flag >>> messages and then (unfortunately without any further helpful output) the >>> host >>> reboots (I assume there is a panic that triggers a reset). >>> >>> I suspect the problem is more a kernel side one. It is just possible to >>> influence things by variation of dom0_mem=#,max:#. 512M seems ok, 1024M, >>> 2048M, >>> and 3072M cause bad page flags starting around kernel 4.2 and reboots around >>> 4.4. Then 4096M and not clamping dom0 memory seem to be ok again (though not >>> limiting dom0 memory seems to cause trouble on 32bit dom0 later when a domU >>> tries to balloon memory, but I think that is a different problem). >>> >>> I have not seen this on a 64bit dom0. Below is an example of those bad page >>> errors. Somehow it looks to be a page marked as reserved. Initially I >>> wondered >>> whether this could be a problem of not clearing page flags when moving >>> mappings >>> to match the e820. But I never looked into i386 memory setup in that >>> detail. So >>> I am posting this, hoping that someone may have an idea from the detail >>> about >>> where to look next. PAE is enabled there. Usually its bpf init that gets >>> hit but >>> that likely is just because that is doing the first vmallocs. >> >> Could you please post the kernel config, Xen and dom0 boot parameters? >> I'm quite sure this is no common problem as there are standard tests >> running for each kernel version including 32 bit dom0 with limited >> memory size. > > Hi Jürgen, > > sure. Though by doing that I realized where I actually messed the whole thing > up. I got the max limit syntax completely wrong. :( Instead of the correct > "dom0_mem=1024M,max:1024M" I am using "dom0_mem=1024M:max=1024M" which I guess > is like not having max set at all. Not sure whether that is a valid use case. > > When I actually do the dom0_mem argument right, there are no bad page flag > errors even in 4.4 with 1024M limit. I was at least consistent in my > mis-configuration, so doing the same stupid thing on 64bit seems to be handled > more gracefully. > > Likely false alarm. But at least cut the config into mail made me spot > the problem... > Ok, thinking that "dom0_mem=x" (without a max or min) still is a valid case, I went ahead and did a bisect for when the bad page flag issue started. I ended up at: 92923ca "mm: meminit: only set page reserved in the memblock region" And with a few more printks in the new functions I finally realized why this goes wrong. The new reserve_bootmem_region is using unsigned long for start and end addresses which just isn't working too well for 32bit. For Xen dom0 the problem with that can just be more easily triggered. When dom0 memory is limited to a small size but allowed to balloon for more, the additional system memory is put into reserved regions. In my case a host with 8G memory and say 1G initial dom0 memory this created (apart from other) one reserved region which started at 4GB and covered the remaining 4G of host memory. Which reserve_bootmem_region() got as 0-4G due to the unsigned long conversion. This basically marked *all* memory below 4G as reserved. The fix is relatively simple, just use phys_addr_t for start and end. I tested this on 4.2 and 4.4 kernels. Both now boot without errors and neither does the 4.4 kernel crash. Maybe still not 100% safe when running on very large memory systems (if I did not get the math wrong 16T) but at least some improvement... -Stefan From 1588a8b3983f63f8e690b91e99fe631902e38805 Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Tue, 10 May 2016 19:05:16 +0200 Subject: [PATCH] mm: Use phys_addr_t for reserve_bootmem_region arguments Since 92923ca the reserved bit is set on reserved memblock regions. However start and end address are passed as unsigned long. Th
[PATCH resend] bcache: prevent crash on changing writeback_running
I sent this out quite a while ago. Back then it was said that it was applied to the development tree. But it seems it never made its way upstream. I re-tested with a 4.0.7 kernel and compared this against a 4.0 kernel with this patch applied. Still it is possible to trigger a NULL pointer violation when writing 0 into writeback_running of a bcache device that is not attached to a cache set (without the patch). -Stefan --- >From e949c64fa6acbdaab999410250855a6a4fc6bad1 Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Mon, 18 Aug 2014 20:00:13 +0200 Subject: [PATCH] bcache: prevent crash on changing writeback_running commit a664d0f05a2e ("bcache: fix crash on shutdown in passthrough mode") added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader --- drivers/md/bcache/writeback.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 0a9dab1..073a042 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio, static inline void bch_writeback_queue(struct cached_dev *dc) { - wake_up_process(dc->writeback_thread); + if (!IS_ERR_OR_NULL(dc->writeback_thread)) + wake_up_process(dc->writeback_thread); } static inline void bch_writeback_add(struct cached_dev *dc) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH resend] bcache: prevent crash on changing writeback_running
I sent this out quite a while ago. Back then it was said that it was applied to the development tree. But it seems it never made its way upstream. I re-tested with a 4.0.7 kernel and compared this against a 4.0 kernel with this patch applied. Still it is possible to trigger a NULL pointer violation when writing 0 into writeback_running of a bcache device that is not attached to a cache set (without the patch). -Stefan --- From e949c64fa6acbdaab999410250855a6a4fc6bad1 Mon Sep 17 00:00:00 2001 From: Stefan Bader stefan.ba...@canonical.com Date: Mon, 18 Aug 2014 20:00:13 +0200 Subject: [PATCH] bcache: prevent crash on changing writeback_running commit a664d0f05a2e (bcache: fix crash on shutdown in passthrough mode) added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader stefan.ba...@canonical.com --- drivers/md/bcache/writeback.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 0a9dab1..073a042 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio, static inline void bch_writeback_queue(struct cached_dev *dc) { - wake_up_process(dc-writeback_thread); + if (!IS_ERR_OR_NULL(dc-writeback_thread)) + wake_up_process(dc-writeback_thread); } static inline void bch_writeback_add(struct cached_dev *dc) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.10
On 18.03.2015 10:18, Paolo Bonzini wrote: > > > On 18/03/2015 09:46, Stefan Bader wrote: >> >> Regardless of that, I wonder whether the below (this version untested) sound >> acceptable for upstream? At least it would make debugging much simpler. :) >> >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, >> u32 ct >> ctl |= vmx_msr_low; /* bit == 1 in low word ==> must be one */ >> >> /* Ensure minimum (required) set of control bits are supported. */ >> - if (ctl_min & ~ctl) >> + if (ctl_min & ~ctl) { >> + printk(KERN_ERR "vmx: msr(%08x) does not match requirements. >> " >> + "req=%08x cur=%08x\n", msr, ctl_min, ctl); >> return -EIO; >> + } >> >> *result = ctl; >> return 0; > > Yes, this is nice. Maybe -ENODEV. > > Also, a minimal patch for Ubuntu would probably be: > > @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config > *vmcs_conf) > vmx_capability.ept, vmx_capability.vpid); > } > > - min = 0; > + min = VM_EXIT_SAVE_DEBUG_CONTROLS; > #ifdef CONFIG_X86_64 > min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; > #endif > > but I don't think it's a good idea to add it to stable kernels. Sorry, I got a bit confused on my assumptions. While the change above causes guests to fail but the statement to say this is caused by host kernels before this change was against better knowledge and wrong. The actual range was hosts running 3.2 which (maybe not perfect but at least well enough) allowed to use nested vmx for guest kernel <3.15 will break. But running 3.13 on the host has no issues. Comparing the rdmsr values of guests between those two host kernels, I found that on 3.2 the exit control msr was very sparsely initialized. And looking at the changes between 3.2 and 3.13 I found commit 33fb20c39e98b90813b5ab2d9a0d6faa6300caca Author: Jan Kiszka Date: Wed Mar 6 15:44:03 2013 +0100 KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS This was added in 3.10. So the range of kernels affected <3.10 back to when nested vmx became somewhat usable. For 3.2 Ben (and obviously us) would be affected. Apart from that, I believe, it is only 3.4 which has an active longterm. At least that change looks safer for stable as it sounds like correcting things and not adding a feature. I was able to cherry-pick that into a 3.2 kernel and then a 3.16 guest successfully can load the kvm-intel module again, of course with the same shortcomings as before. -Stefan > > Paolo > signature.asc Description: OpenPGP digital signature
Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 3.10
On 18.03.2015 10:18, Paolo Bonzini wrote: On 18/03/2015 09:46, Stefan Bader wrote: Regardless of that, I wonder whether the below (this version untested) sound acceptable for upstream? At least it would make debugging much simpler. :) --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct ctl |= vmx_msr_low; /* bit == 1 in low word == must be one */ /* Ensure minimum (required) set of control bits are supported. */ - if (ctl_min ~ctl) + if (ctl_min ~ctl) { + printk(KERN_ERR vmx: msr(%08x) does not match requirements. + req=%08x cur=%08x\n, msr, ctl_min, ctl); return -EIO; + } *result = ctl; return 0; Yes, this is nice. Maybe -ENODEV. Also, a minimal patch for Ubuntu would probably be: @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmx_capability.ept, vmx_capability.vpid); } - min = 0; + min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif but I don't think it's a good idea to add it to stable kernels. Sorry, I got a bit confused on my assumptions. While the change above causes guests to fail but the statement to say this is caused by host kernels before this change was against better knowledge and wrong. The actual range was hosts running 3.2 which (maybe not perfect but at least well enough) allowed to use nested vmx for guest kernel 3.15 will break. But running 3.13 on the host has no issues. Comparing the rdmsr values of guests between those two host kernels, I found that on 3.2 the exit control msr was very sparsely initialized. And looking at the changes between 3.2 and 3.13 I found commit 33fb20c39e98b90813b5ab2d9a0d6faa6300caca Author: Jan Kiszka jan.kis...@siemens.com Date: Wed Mar 6 15:44:03 2013 +0100 KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS This was added in 3.10. So the range of kernels affected 3.10 back to when nested vmx became somewhat usable. For 3.2 Ben (and obviously us) would be affected. Apart from that, I believe, it is only 3.4 which has an active longterm. At least that change looks safer for stable as it sounds like correcting things and not adding a feature. I was able to cherry-pick that into a 3.2 kernel and then a 3.16 guest successfully can load the kvm-intel module again, of course with the same shortcomings as before. -Stefan Paolo signature.asc Description: OpenPGP digital signature
Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.15
On 18.03.2015 11:27, Paolo Bonzini wrote: > > > On 18/03/2015 10:59, Stefan Bader wrote: >>> @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct >>> vmcs_config *vmcs_conf) vmx_capability.ept, >>> vmx_capability.vpid); } >>> >>> - min = 0; + min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef >>> CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif >>> >>> but I don't think it's a good idea to add it to stable kernels. >> >> Why is that? Because it has a risk of causing the module failing to >> load on L0 where it did work before? > > Because if we wanted to make 3.14 nested VMX stable-ish we would need > several more, at least these: > > KVM: nVMX: fix lifetime issues for vmcs02 > KVM: nVMX: clean up nested_release_vmcs12 and code around it > KVM: nVMX: Rework interception of IRQs and NMIs > KVM: nVMX: Do not inject NMI vmexits when L2 has a pending > interrupt > KVM: nVMX: Disable preemption while reading from shadow VMCS > > and for 3.13: > > KVM: nVMX: Leave VMX mode on clearing of feature control MSR > > There are also several L2-crash-L1 bugs too in Nadav Amit's patches. > > Basically, nested VMX was never considered stable-worthy. Perhaps > that can change soon---but not retroactively. > > So I'd rather avoid giving false impressions of the stability of nVMX > in 3.14. > > Even if we considered nVMX stable, I'd _really_ not want to consider > the L1<->L2 boundary a secure one for a longer time. > >> Which would be something I would rather avoid. Generally I think it >> would be good to have something that can be generally applied. >> Given the speed that cloud service providers tend to move forward >> (ok they may not actively push the ability to go nested). > > And if they did, I'd really not want them to do it with a 3.14 kernel. 3.14... you are optimistic. :) But thanks a lot for the detailed info. -Stefan > > Paolo > signature.asc Description: OpenPGP digital signature
Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.15
On 18.03.2015 10:18, Paolo Bonzini wrote: > > > On 18/03/2015 09:46, Stefan Bader wrote: >> >> Regardless of that, I wonder whether the below (this version untested) sound >> acceptable for upstream? At least it would make debugging much simpler. :) >> >> --- a/arch/x86/kvm/vmx.c >> +++ b/arch/x86/kvm/vmx.c >> @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, >> u32 ct >> ctl |= vmx_msr_low; /* bit == 1 in low word ==> must be one */ >> >> /* Ensure minimum (required) set of control bits are supported. */ >> - if (ctl_min & ~ctl) >> + if (ctl_min & ~ctl) { >> + printk(KERN_ERR "vmx: msr(%08x) does not match requirements. >> " >> + "req=%08x cur=%08x\n", msr, ctl_min, ctl); >> return -EIO; >> + } >> >> *result = ctl; >> return 0; > > Yes, this is nice. Maybe -ENODEV. Maybe, though I did not change that. Just added to give some kind of hint when the module would otherwise fail with just an IO error. > > Also, a minimal patch for Ubuntu would probably be: > > @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config > *vmcs_conf) > vmx_capability.ept, vmx_capability.vpid); > } > > - min = 0; > + min = VM_EXIT_SAVE_DEBUG_CONTROLS; > #ifdef CONFIG_X86_64 > min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; > #endif > > but I don't think it's a good idea to add it to stable kernels. Why is that? Because it has a risk of causing the module failing to load on L0 where it did work before? Which would be something I would rather avoid. Generally I think it would be good to have something that can be generally applied. Given the speed that cloud service providers tend to move forward (ok they may not actively push the ability to go nested). -Stefan > > Paolo > signature.asc Description: OpenPGP digital signature
regression: nested: L1 3.15+ fails to load kvm-intel on L0 <3.15
Someone reported[1] that some of their L1 guests fail to load the kvm-intel module (without much details). Turns out that this was (at least) caused by KVM: vmx: Allow the guest to run with dirty debug registers as this adds VM_EXIT_SAVE_DEBUG_CONTROLS to the required MSR_IA32_VMX_EXIT_CTLS bits. Not sure this should be fixed up in pre 3.15 kernels or the other way round. Maybe naively asked but would it be sufficient to add this as required to older kernels vmcs setup (without the code to make any use of it)? Regardless of that, I wonder whether the below (this version untested) sound acceptable for upstream? At least it would make debugging much simpler. :) --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct ctl |= vmx_msr_low; /* bit == 1 in low word ==> must be one */ /* Ensure minimum (required) set of control bits are supported. */ - if (ctl_min & ~ctl) + if (ctl_min & ~ctl) { + printk(KERN_ERR "vmx: msr(%08x) does not match requirements. " + "req=%08x cur=%08x\n", msr, ctl_min, ctl); return -EIO; + } *result = ctl; return 0; Thanks, -Stefan [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1431473 signature.asc Description: OpenPGP digital signature
Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 3.15
On 18.03.2015 11:27, Paolo Bonzini wrote: On 18/03/2015 10:59, Stefan Bader wrote: @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmx_capability.ept, vmx_capability.vpid); } - min = 0; + min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif but I don't think it's a good idea to add it to stable kernels. Why is that? Because it has a risk of causing the module failing to load on L0 where it did work before? Because if we wanted to make 3.14 nested VMX stable-ish we would need several more, at least these: KVM: nVMX: fix lifetime issues for vmcs02 KVM: nVMX: clean up nested_release_vmcs12 and code around it KVM: nVMX: Rework interception of IRQs and NMIs KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt KVM: nVMX: Disable preemption while reading from shadow VMCS and for 3.13: KVM: nVMX: Leave VMX mode on clearing of feature control MSR There are also several L2-crash-L1 bugs too in Nadav Amit's patches. Basically, nested VMX was never considered stable-worthy. Perhaps that can change soon---but not retroactively. So I'd rather avoid giving false impressions of the stability of nVMX in 3.14. Even if we considered nVMX stable, I'd _really_ not want to consider the L1-L2 boundary a secure one for a longer time. Which would be something I would rather avoid. Generally I think it would be good to have something that can be generally applied. Given the speed that cloud service providers tend to move forward (ok they may not actively push the ability to go nested). And if they did, I'd really not want them to do it with a 3.14 kernel. 3.14... you are optimistic. :) But thanks a lot for the detailed info. -Stefan Paolo signature.asc Description: OpenPGP digital signature
Re: regression: nested: L1 3.15+ fails to load kvm-intel on L0 3.15
On 18.03.2015 10:18, Paolo Bonzini wrote: On 18/03/2015 09:46, Stefan Bader wrote: Regardless of that, I wonder whether the below (this version untested) sound acceptable for upstream? At least it would make debugging much simpler. :) --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct ctl |= vmx_msr_low; /* bit == 1 in low word == must be one */ /* Ensure minimum (required) set of control bits are supported. */ - if (ctl_min ~ctl) + if (ctl_min ~ctl) { + printk(KERN_ERR vmx: msr(%08x) does not match requirements. + req=%08x cur=%08x\n, msr, ctl_min, ctl); return -EIO; + } *result = ctl; return 0; Yes, this is nice. Maybe -ENODEV. Maybe, though I did not change that. Just added to give some kind of hint when the module would otherwise fail with just an IO error. Also, a minimal patch for Ubuntu would probably be: @@ -2850,7 +2851,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmx_capability.ept, vmx_capability.vpid); } - min = 0; + min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif but I don't think it's a good idea to add it to stable kernels. Why is that? Because it has a risk of causing the module failing to load on L0 where it did work before? Which would be something I would rather avoid. Generally I think it would be good to have something that can be generally applied. Given the speed that cloud service providers tend to move forward (ok they may not actively push the ability to go nested). -Stefan Paolo signature.asc Description: OpenPGP digital signature
regression: nested: L1 3.15+ fails to load kvm-intel on L0 3.15
Someone reported[1] that some of their L1 guests fail to load the kvm-intel module (without much details). Turns out that this was (at least) caused by KVM: vmx: Allow the guest to run with dirty debug registers as this adds VM_EXIT_SAVE_DEBUG_CONTROLS to the required MSR_IA32_VMX_EXIT_CTLS bits. Not sure this should be fixed up in pre 3.15 kernels or the other way round. Maybe naively asked but would it be sufficient to add this as required to older kernels vmcs setup (without the code to make any use of it)? Regardless of that, I wonder whether the below (this version untested) sound acceptable for upstream? At least it would make debugging much simpler. :) --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2953,8 +2953,11 @@ static __init int adjust_vmx_controls(u32 ctl_min, u32 ct ctl |= vmx_msr_low; /* bit == 1 in low word == must be one */ /* Ensure minimum (required) set of control bits are supported. */ - if (ctl_min ~ctl) + if (ctl_min ~ctl) { + printk(KERN_ERR vmx: msr(%08x) does not match requirements. + req=%08x cur=%08x\n, msr, ctl_min, ctl); return -EIO; + } *result = ctl; return 0; Thanks, -Stefan [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1431473 signature.asc Description: OpenPGP digital signature
Re: [PATCH 3.18 129/151] x86/xen: Treat SCI interrupt as normal GSI interrupt
On 04.03.2015 07:14, Greg Kroah-Hartman wrote: > 3.18-stable review patch. If anyone has any objections, please let me know. I thought I replied earlier today but I cannot seem to find it coming back via the mailing list. Hope this is not duplicating too much... There was a regression with that patch and it requires the below commit as well to prevent that: commit 1ea76fbadd667b19c4fa4466f3a3b55a505e83d9 Author: Jiang Liu Date: Mon Feb 16 10:11:13 2015 +0800 x86/irq: Fix regression caused by commit b568b8601f05 Commit b568b8601f05 ("Treat SCI interrupt as normal GSI interrupt") accidently removes support of legacy PIC interrupt when fixing a regression for Xen, which causes a nasty regression on HP/Compaq nc6000 where we fail to register the ACPI interrupt, and thus lose eg. thermal notifications leading a potentially overheated machine. -Stefan > > -- > > From: Jiang Liu > > commit b568b8601f05a591a7ff09d8ee1cedb5b2e815fe upstream. > > Currently Xen Domain0 has special treatment for ACPI SCI interrupt, > that is initialize irq for ACPI SCI at early stage in a special way as: > xen_init_IRQ() > ->pci_xen_initial_domain() > ->xen_setup_acpi_sci() > Allocate and initialize irq for ACPI SCI > > Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq > number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on > IOAPIC irqdomains through following path > acpi_gsi_to_irq() > ->mp_map_gsi_to_irq() > ->mp_map_pin_to_irq() > ->check IOAPIC irqdomain > > For PV domains, it uses Xen event based interrupt manangement and > doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC. > This causes Xen domain0 fail to install interrupt handler for ACPI SCI > and all ACPI events will be lost. Please refer to: > https://lkml.org/lkml/2014/12/19/178 > > So the fix is to get rid of special treatment for ACPI SCI, just treat > ACPI SCI as normal GSI interrupt as: > acpi_gsi_to_irq() > ->acpi_register_gsi() > ->acpi_register_gsi_xen() > ->xen_register_gsi() > > With above change, there's no need for xen_setup_acpi_sci() anymore. > The above change also works with bare metal kernel too. > > Signed-off-by: Jiang Liu > Tested-by: Sander Eikelenboom > Cc: Tony Luck > Cc: xen-de...@lists.xenproject.org > Cc: Konrad Rzeszutek Wilk > Cc: David Vrabel > Cc: Rafael J. Wysocki > Cc: Len Brown > Cc: Pavel Machek > Cc: Bjorn Helgaas > Link: > http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang@linux.intel.com > Signed-off-by: Thomas Gleixner > Signed-off-by: Stefan Bader > Signed-off-by: Greg Kroah-Hartman > > --- > arch/x86/kernel/acpi/boot.c | 21 ++- > arch/x86/pci/xen.c | 47 > > 2 files changed, 11 insertions(+), 57 deletions(-) > > --- a/arch/x86/kernel/acpi/boot.c > +++ b/arch/x86/kernel/acpi/boot.c > @@ -604,18 +604,19 @@ void __init acpi_pic_sci_set_trigger(uns > > int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp) > { > - int irq; > + int rc, irq, trigger, polarity; > > - if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) { > - *irqp = gsi; > - } else { > - irq = mp_map_gsi_to_irq(gsi, > - IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK); > - if (irq < 0) > - return -1; > - *irqp = irq; > + rc = acpi_get_override_irq(gsi, , ); > + if (rc == 0) { > + trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE; > + polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH; > + irq = acpi_register_gsi(NULL, gsi, trigger, polarity); > + if (irq >= 0) { > + *irqp = irq; > + return 0; > + } > } > - return 0; > + return -1; > } > EXPORT_SYMBOL_GPL(acpi_gsi_to_irq); > > --- a/arch/x86/pci/xen.c > +++ b/arch/x86/pci/xen.c > @@ -452,52 +452,6 @@ int __init pci_xen_hvm_init(void) > } > > #ifdef CONFIG_XEN_DOM0 > -static __init void xen_setup_acpi_sci(void) > -{ > - int rc; > - int trigger, polarity; > - int gsi = acpi_sci_override_gsi; > - int irq = -1; > - int gsi_override = -1; > - > - if (!gsi) > - return; > - > - rc = acpi_get_override_irq(gsi, , ); > - if (rc) { > - printk(KERN_WARNING "xen: acpi_get_override_irq failed for acpi" > -
Re: [PATCH 3.18 129/151] x86/xen: Treat SCI interrupt as normal GSI interrupt
On 04.03.2015 07:14, Greg Kroah-Hartman wrote: 3.18-stable review patch. If anyone has any objections, please let me know. I thought I replied earlier today but I cannot seem to find it coming back via the mailing list. Hope this is not duplicating too much... There was a regression with that patch and it requires the below commit as well to prevent that: commit 1ea76fbadd667b19c4fa4466f3a3b55a505e83d9 Author: Jiang Liu jiang@linux.intel.com Date: Mon Feb 16 10:11:13 2015 +0800 x86/irq: Fix regression caused by commit b568b8601f05 Commit b568b8601f05 (Treat SCI interrupt as normal GSI interrupt) accidently removes support of legacy PIC interrupt when fixing a regression for Xen, which causes a nasty regression on HP/Compaq nc6000 where we fail to register the ACPI interrupt, and thus lose eg. thermal notifications leading a potentially overheated machine. -Stefan -- From: Jiang Liu jiang@linux.intel.com commit b568b8601f05a591a7ff09d8ee1cedb5b2e815fe upstream. Currently Xen Domain0 has special treatment for ACPI SCI interrupt, that is initialize irq for ACPI SCI at early stage in a special way as: xen_init_IRQ() -pci_xen_initial_domain() -xen_setup_acpi_sci() Allocate and initialize irq for ACPI SCI Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on IOAPIC irqdomains through following path acpi_gsi_to_irq() -mp_map_gsi_to_irq() -mp_map_pin_to_irq() -check IOAPIC irqdomain For PV domains, it uses Xen event based interrupt manangement and doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC. This causes Xen domain0 fail to install interrupt handler for ACPI SCI and all ACPI events will be lost. Please refer to: https://lkml.org/lkml/2014/12/19/178 So the fix is to get rid of special treatment for ACPI SCI, just treat ACPI SCI as normal GSI interrupt as: acpi_gsi_to_irq() -acpi_register_gsi() -acpi_register_gsi_xen() -xen_register_gsi() With above change, there's no need for xen_setup_acpi_sci() anymore. The above change also works with bare metal kernel too. Signed-off-by: Jiang Liu jiang@linux.intel.com Tested-by: Sander Eikelenboom li...@eikelenboom.it Cc: Tony Luck tony.l...@intel.com Cc: xen-de...@lists.xenproject.org Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: David Vrabel david.vra...@citrix.com Cc: Rafael J. Wysocki r...@rjwysocki.net Cc: Len Brown len.br...@intel.com Cc: Pavel Machek pa...@ucw.cz Cc: Bjorn Helgaas bhelg...@google.com Link: http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang@linux.intel.com Signed-off-by: Thomas Gleixner t...@linutronix.de Signed-off-by: Stefan Bader stefan.ba...@canonical.com Signed-off-by: Greg Kroah-Hartman gre...@linuxfoundation.org --- arch/x86/kernel/acpi/boot.c | 21 ++- arch/x86/pci/xen.c | 47 2 files changed, 11 insertions(+), 57 deletions(-) --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -604,18 +604,19 @@ void __init acpi_pic_sci_set_trigger(uns int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp) { - int irq; + int rc, irq, trigger, polarity; - if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) { - *irqp = gsi; - } else { - irq = mp_map_gsi_to_irq(gsi, - IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK); - if (irq 0) - return -1; - *irqp = irq; + rc = acpi_get_override_irq(gsi, trigger, polarity); + if (rc == 0) { + trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE; + polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH; + irq = acpi_register_gsi(NULL, gsi, trigger, polarity); + if (irq = 0) { + *irqp = irq; + return 0; + } } - return 0; + return -1; } EXPORT_SYMBOL_GPL(acpi_gsi_to_irq); --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -452,52 +452,6 @@ int __init pci_xen_hvm_init(void) } #ifdef CONFIG_XEN_DOM0 -static __init void xen_setup_acpi_sci(void) -{ - int rc; - int trigger, polarity; - int gsi = acpi_sci_override_gsi; - int irq = -1; - int gsi_override = -1; - - if (!gsi) - return; - - rc = acpi_get_override_irq(gsi, trigger, polarity); - if (rc) { - printk(KERN_WARNING xen: acpi_get_override_irq failed for acpi - sci, rc=%d\n, rc); - return; - } - trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE; - polarity = polarity ? ACPI_ACTIVE_LOW
Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19
On 09.02.2015 20:15, Sander Eikelenboom wrote: > > Monday, February 9, 2015, 5:09:44 PM, you wrote: > >> On 09.02.2015 13:29, Stefan Bader wrote: >>> On 09.02.2015 13:12, Jiang Liu wrote: >>>> On 2015/2/9 17:47, Stefan Bader wrote: >>>>> On 05.02.2015 21:07, Sander Eikelenboom wrote: >>>>>> >>>>>> Tuesday, January 20, 2015, 3:21:04 AM, you wrote: >>>>>> >>>>>>> Hi Thomas, >>>>>>> This patch set includes three hotfixes related Xen IRQ for >>>>>>> v3.19. >>>>>>> Sorry for the long delay to get these two regressions fixed, it really >>>>>>> cost me some time to read and understand Xen IRQ code. >>>>>> >>>>>>> Patch 1 fixes the failure to register ACPI SCI interrupt on Xen >>>>>>> domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18 >>>>>>> too once it reaches the mainstream kernel. >>>>>> >>>>>>> Patch 2 fixes the regression in Xen PCI device passthrough(pciback). >>>>>>> It's a temporary solution, I will send the formal fix for v3.20 and >>>>>>> it has passed tests too. >>>>>> >>>>>>> Patch 3 fixes an issue found when reading code. There's no real bug >>>>>>> reports related to this issue yet. >>>>>> >>>>>>> Great thanks to Konrad and Sander for testing fixes for these >>>>>>> regressions. >>>>>> >>>>>>> Regards, >>>>>>> Gerry >>>>>> >>>>>> Hi Gerry, >>>>>> >>>>>> Since these patches now are: tested, reviewed and have landed into >>>>>> mainline, >>>>>> could you also provide the backports for 3.17 and 3.18 where required ? >>>>> >>>>> This would be my attempt of backporting those to 3.18 (I have not tried >>>>> whether >>>>> they would apply to 3.17 as well). Those seem to work when test-booting >>>>> and >>>>> Sander said he will give them a try as well. But better check yourself I >>>>> did not >>>>> mess anything up while backporting. One of the three patches seemed not >>>>> to be >>>>> required at all. >>>> I'm backporting them too, but now I will give up my patches:) >>>> >>>> The first one is a must, which should be applied to 3.17 and 3.18. >>>> The second one doesn't apply to 3.17 or 3.18. >>>> The third may be applied to v3.17 and v3.18, but it shouldn't be applied >>>> according to the stable-kernel doc because no user reports such a bug yet. >>> >>> Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets >>> around the acpi irq and the usb problems, too. Though interestingly from the >>> history, the acpi irq I saw in 3.17 while the usb devices started to fail >>> with >>> 3.18. I will let you know in a bit. Just want to finish another bisect >>> first. > >> So it looks like patch #1 indeed does fix both the acpi irq and my usb >> devices. >> Sander, do you want to verify that variant, too? > > Hi Stefan / Gerry, > > Just tested with only patch 1, ended up the same .. the reported issues seem > gone. Sounds good. :) Gerry, do you want me to fwd the backport to Greg for 3.18 or do you do that? As far as I know 3.17 has nobody caring about it. -Stefan > > >> The only odd thing left in a 3.18 with the fix (regardless of using patch #1 >> only or patches #1 and #3) is that the ahci driver shows up with its pci bus >> number in /proc/interrupts. But that does not seem to have any relevant >> downside >> (cannot try 3.19 on that box right now for other reasons). > >> -Stefan > >> ... >> # 57: 12441 000 xen-pirq-msi :00:1f.2 >> #(used to be ahci) > > This is also there on baremetal with 3.19 on intel, so not xen related: > 27: 1868582 1282555 IR-PCI-MSI-edge > :00:1f.2 > > -- > Sander > >>> >>> -Stefan >>> >>>> Regards >>>> Gerry >>>>> >>>>> -Stefan >>>>> >>>>>> >>>>>> The number of people running into these (subtle) issues with these stable >>>>>> kernels seems to be steadily increasing. >>>>>> >>>>>> -- >>>>>> Sander >>>>>> >>>>>>> Jiang Liu (3): >>>>>>> xen/pci: Kill function xen_setup_acpi_sci() >>>>>>> xen/irq, ACPI: Fix regression in xen PCI passthrough caused by >>>>>>> cffe0a2b5a34 >>>>>>> xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi >>>>>> >>>>>>> arch/x86/include/asm/acpi.h |1 + >>>>>>> arch/x86/kernel/acpi/boot.c | 26 +++ >>>>>>> arch/x86/pci/xen.c | 49 >>>>>>> ++- >>>>>>> drivers/acpi/pci_irq.c |1 - >>>>>>> 4 files changed, 16 insertions(+), 61 deletions(-) >>>>>> >>>>>> >>>>> >>> >>> > > > signature.asc Description: OpenPGP digital signature
Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19
On 09.02.2015 20:15, Sander Eikelenboom wrote: Monday, February 9, 2015, 5:09:44 PM, you wrote: On 09.02.2015 13:29, Stefan Bader wrote: On 09.02.2015 13:12, Jiang Liu wrote: On 2015/2/9 17:47, Stefan Bader wrote: On 05.02.2015 21:07, Sander Eikelenboom wrote: Tuesday, January 20, 2015, 3:21:04 AM, you wrote: Hi Thomas, This patch set includes three hotfixes related Xen IRQ for v3.19. Sorry for the long delay to get these two regressions fixed, it really cost me some time to read and understand Xen IRQ code. Patch 1 fixes the failure to register ACPI SCI interrupt on Xen domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18 too once it reaches the mainstream kernel. Patch 2 fixes the regression in Xen PCI device passthrough(pciback). It's a temporary solution, I will send the formal fix for v3.20 and it has passed tests too. Patch 3 fixes an issue found when reading code. There's no real bug reports related to this issue yet. Great thanks to Konrad and Sander for testing fixes for these regressions. Regards, Gerry Hi Gerry, Since these patches now are: tested, reviewed and have landed into mainline, could you also provide the backports for 3.17 and 3.18 where required ? This would be my attempt of backporting those to 3.18 (I have not tried whether they would apply to 3.17 as well). Those seem to work when test-booting and Sander said he will give them a try as well. But better check yourself I did not mess anything up while backporting. One of the three patches seemed not to be required at all. I'm backporting them too, but now I will give up my patches:) The first one is a must, which should be applied to 3.17 and 3.18. The second one doesn't apply to 3.17 or 3.18. The third may be applied to v3.17 and v3.18, but it shouldn't be applied according to the stable-kernel doc because no user reports such a bug yet. Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets around the acpi irq and the usb problems, too. Though interestingly from the history, the acpi irq I saw in 3.17 while the usb devices started to fail with 3.18. I will let you know in a bit. Just want to finish another bisect first. So it looks like patch #1 indeed does fix both the acpi irq and my usb devices. Sander, do you want to verify that variant, too? Hi Stefan / Gerry, Just tested with only patch 1, ended up the same .. the reported issues seem gone. Sounds good. :) Gerry, do you want me to fwd the backport to Greg for 3.18 or do you do that? As far as I know 3.17 has nobody caring about it. -Stefan The only odd thing left in a 3.18 with the fix (regardless of using patch #1 only or patches #1 and #3) is that the ahci driver shows up with its pci bus number in /proc/interrupts. But that does not seem to have any relevant downside (cannot try 3.19 on that box right now for other reasons). -Stefan ... # 57: 12441 000 xen-pirq-msi :00:1f.2 #(used to be ahci) This is also there on baremetal with 3.19 on intel, so not xen related: 27: 1868582 1282555 IR-PCI-MSI-edge :00:1f.2 -- Sander -Stefan Regards Gerry -Stefan The number of people running into these (subtle) issues with these stable kernels seems to be steadily increasing. -- Sander Jiang Liu (3): xen/pci: Kill function xen_setup_acpi_sci() xen/irq, ACPI: Fix regression in xen PCI passthrough caused by cffe0a2b5a34 xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi arch/x86/include/asm/acpi.h |1 + arch/x86/kernel/acpi/boot.c | 26 +++ arch/x86/pci/xen.c | 49 ++- drivers/acpi/pci_irq.c |1 - 4 files changed, 16 insertions(+), 61 deletions(-) signature.asc Description: OpenPGP digital signature
Re: 3.19: device name associates with IRQ's for ahci controllers operating with a single IRQ changed from "ahci?" to ""
On 09.02.2015 20:54, Sander Eikelenboom wrote: > Hi. > > In 3.19 the device name associates with IRQ's for ahci controllers operating > with a single IRQ changed from "ahci?" to "", was this intentional ? > > It's probably commit 18dcf433f3ded61eb140a55e7048ec2fef79e723 (or another one > in that series). Oh, looking at that commit that might make sense. In ahci_host_activate_single_irq it now uses dev_driver_name instead of dev_name as it did before (from ata_host_activate). The description sounds like before the driver is registered this will return the bus. And registering an interrupt would likely be before the driver is fully registered... But you probably saw that, too. Whether that really was intentional is still the question. :) -Stefan > > -- > Sander > > > /proc/interrupts of an ahci controller with a single irq: > 52: 13529 0 0 0 xen-pirq-msi > :00:1f.2 > > /proc/interrupts of an ahci controller with multiple irq's: > 114: 412535 0 0 0 0 0 > xen-pirq-msi ahci0 > 115: 0 0 0 0 0 0 > xen-pirq-msi ahci1 > 116: 16717 0 0 0 0 0 > xen-pirq-msi ahci2 > 117: 0 0 0 0 0 0 > xen-pirq-msi ahci3 > 118: 0 0 0 0 0 0 > xen-pirq-msi ahci4 > 119: 0 0 0 0 0 0 > xen-pirq-msi ahci5 > > signature.asc Description: OpenPGP digital signature
Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19
On 09.02.2015 13:29, Stefan Bader wrote: > On 09.02.2015 13:12, Jiang Liu wrote: >> On 2015/2/9 17:47, Stefan Bader wrote: >>> On 05.02.2015 21:07, Sander Eikelenboom wrote: >>>> >>>> Tuesday, January 20, 2015, 3:21:04 AM, you wrote: >>>> >>>>> Hi Thomas, >>>>> This patch set includes three hotfixes related Xen IRQ for v3.19. >>>>> Sorry for the long delay to get these two regressions fixed, it really >>>>> cost me some time to read and understand Xen IRQ code. >>>> >>>>> Patch 1 fixes the failure to register ACPI SCI interrupt on Xen >>>>> domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18 >>>>> too once it reaches the mainstream kernel. >>>> >>>>> Patch 2 fixes the regression in Xen PCI device passthrough(pciback). >>>>> It's a temporary solution, I will send the formal fix for v3.20 and >>>>> it has passed tests too. >>>> >>>>> Patch 3 fixes an issue found when reading code. There's no real bug >>>>> reports related to this issue yet. >>>> >>>>> Great thanks to Konrad and Sander for testing fixes for these regressions. >>>> >>>>> Regards, >>>>> Gerry >>>> >>>> Hi Gerry, >>>> >>>> Since these patches now are: tested, reviewed and have landed into >>>> mainline, >>>> could you also provide the backports for 3.17 and 3.18 where required ? >>> >>> This would be my attempt of backporting those to 3.18 (I have not tried >>> whether >>> they would apply to 3.17 as well). Those seem to work when test-booting and >>> Sander said he will give them a try as well. But better check yourself I >>> did not >>> mess anything up while backporting. One of the three patches seemed not to >>> be >>> required at all. >> I'm backporting them too, but now I will give up my patches:) >> >> The first one is a must, which should be applied to 3.17 and 3.18. >> The second one doesn't apply to 3.17 or 3.18. >> The third may be applied to v3.17 and v3.18, but it shouldn't be applied >> according to the stable-kernel doc because no user reports such a bug yet. > > Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets > around the acpi irq and the usb problems, too. Though interestingly from the > history, the acpi irq I saw in 3.17 while the usb devices started to fail with > 3.18. I will let you know in a bit. Just want to finish another bisect first. So it looks like patch #1 indeed does fix both the acpi irq and my usb devices. Sander, do you want to verify that variant, too? The only odd thing left in a 3.18 with the fix (regardless of using patch #1 only or patches #1 and #3) is that the ahci driver shows up with its pci bus number in /proc/interrupts. But that does not seem to have any relevant downside (cannot try 3.19 on that box right now for other reasons). -Stefan ... # 57: 12441 000 xen-pirq-msi :00:1f.2 #(used to be ahci) > > -Stefan > >> Regards >> Gerry >>> >>> -Stefan >>> >>>> >>>> The number of people running into these (subtle) issues with these stable >>>> kernels seems to be steadily increasing. >>>> >>>> -- >>>> Sander >>>> >>>>> Jiang Liu (3): >>>>> xen/pci: Kill function xen_setup_acpi_sci() >>>>> xen/irq, ACPI: Fix regression in xen PCI passthrough caused by >>>>> cffe0a2b5a34 >>>>> xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi >>>> >>>>> arch/x86/include/asm/acpi.h |1 + >>>>> arch/x86/kernel/acpi/boot.c | 26 +++ >>>>> arch/x86/pci/xen.c | 49 >>>>> ++- >>>>> drivers/acpi/pci_irq.c |1 - >>>>> 4 files changed, 16 insertions(+), 61 deletions(-) >>>> >>>> >>> > > signature.asc Description: OpenPGP digital signature
Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19
On 09.02.2015 13:12, Jiang Liu wrote: > On 2015/2/9 17:47, Stefan Bader wrote: >> On 05.02.2015 21:07, Sander Eikelenboom wrote: >>> >>> Tuesday, January 20, 2015, 3:21:04 AM, you wrote: >>> >>>> Hi Thomas, >>>> This patch set includes three hotfixes related Xen IRQ for v3.19. >>>> Sorry for the long delay to get these two regressions fixed, it really >>>> cost me some time to read and understand Xen IRQ code. >>> >>>> Patch 1 fixes the failure to register ACPI SCI interrupt on Xen >>>> domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18 >>>> too once it reaches the mainstream kernel. >>> >>>> Patch 2 fixes the regression in Xen PCI device passthrough(pciback). >>>> It's a temporary solution, I will send the formal fix for v3.20 and >>>> it has passed tests too. >>> >>>> Patch 3 fixes an issue found when reading code. There's no real bug >>>> reports related to this issue yet. >>> >>>> Great thanks to Konrad and Sander for testing fixes for these regressions. >>> >>>> Regards, >>>> Gerry >>> >>> Hi Gerry, >>> >>> Since these patches now are: tested, reviewed and have landed into mainline, >>> could you also provide the backports for 3.17 and 3.18 where required ? >> >> This would be my attempt of backporting those to 3.18 (I have not tried >> whether >> they would apply to 3.17 as well). Those seem to work when test-booting and >> Sander said he will give them a try as well. But better check yourself I did >> not >> mess anything up while backporting. One of the three patches seemed not to be >> required at all. > I'm backporting them too, but now I will give up my patches:) > > The first one is a must, which should be applied to 3.17 and 3.18. > The second one doesn't apply to 3.17 or 3.18. > The third may be applied to v3.17 and v3.18, but it shouldn't be applied > according to the stable-kernel doc because no user reports such a bug yet. Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets around the acpi irq and the usb problems, too. Though interestingly from the history, the acpi irq I saw in 3.17 while the usb devices started to fail with 3.18. I will let you know in a bit. Just want to finish another bisect first. -Stefan > Regards > Gerry >> >> -Stefan >> >>> >>> The number of people running into these (subtle) issues with these stable >>> kernels seems to be steadily increasing. >>> >>> -- >>> Sander >>> >>>> Jiang Liu (3): >>>> xen/pci: Kill function xen_setup_acpi_sci() >>>> xen/irq, ACPI: Fix regression in xen PCI passthrough caused by >>>> cffe0a2b5a34 >>>> xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi >>> >>>> arch/x86/include/asm/acpi.h |1 + >>>> arch/x86/kernel/acpi/boot.c | 26 +++ >>>> arch/x86/pci/xen.c | 49 >>>> ++- >>>> drivers/acpi/pci_irq.c |1 - >>>> 4 files changed, 16 insertions(+), 61 deletions(-) >>> >>> >> signature.asc Description: OpenPGP digital signature
Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19
On 05.02.2015 21:07, Sander Eikelenboom wrote: > > Tuesday, January 20, 2015, 3:21:04 AM, you wrote: > >> Hi Thomas, >> This patch set includes three hotfixes related Xen IRQ for v3.19. >> Sorry for the long delay to get these two regressions fixed, it really >> cost me some time to read and understand Xen IRQ code. > >> Patch 1 fixes the failure to register ACPI SCI interrupt on Xen >> domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18 >> too once it reaches the mainstream kernel. > >> Patch 2 fixes the regression in Xen PCI device passthrough(pciback). >> It's a temporary solution, I will send the formal fix for v3.20 and >> it has passed tests too. > >> Patch 3 fixes an issue found when reading code. There's no real bug >> reports related to this issue yet. > >> Great thanks to Konrad and Sander for testing fixes for these regressions. > >> Regards, >> Gerry > > Hi Gerry, > > Since these patches now are: tested, reviewed and have landed into mainline, > could you also provide the backports for 3.17 and 3.18 where required ? This would be my attempt of backporting those to 3.18 (I have not tried whether they would apply to 3.17 as well). Those seem to work when test-booting and Sander said he will give them a try as well. But better check yourself I did not mess anything up while backporting. One of the three patches seemed not to be required at all. -Stefan > > The number of people running into these (subtle) issues with these stable > kernels seems to be steadily increasing. > > -- > Sander > >> Jiang Liu (3): >> xen/pci: Kill function xen_setup_acpi_sci() >> xen/irq, ACPI: Fix regression in xen PCI passthrough caused by >> cffe0a2b5a34 >> xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi > >> arch/x86/include/asm/acpi.h |1 + >> arch/x86/kernel/acpi/boot.c | 26 +++ >> arch/x86/pci/xen.c | 49 >> ++- >> drivers/acpi/pci_irq.c |1 - >> 4 files changed, 16 insertions(+), 61 deletions(-) > > From 8b5b328b62248d95743ca9af7aa71c06dd808dfe Mon Sep 17 00:00:00 2001 From: Jiang Liu Date: Tue, 20 Jan 2015 10:21:05 +0800 Subject: [PATCH 1/2] x86/xen: Treat SCI interrupt as normal GSI interrupt Currently Xen Domain0 has special treatment for ACPI SCI interrupt, that is initialize irq for ACPI SCI at early stage in a special way as: xen_init_IRQ() ->pci_xen_initial_domain() ->xen_setup_acpi_sci() Allocate and initialize irq for ACPI SCI Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on IOAPIC irqdomains through following path acpi_gsi_to_irq() ->mp_map_gsi_to_irq() ->mp_map_pin_to_irq() ->check IOAPIC irqdomain For PV domains, it uses Xen event based interrupt manangement and doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC. This causes Xen domain0 fail to install interrupt handler for ACPI SCI and all ACPI events will be lost. Please refer to: https://lkml.org/lkml/2014/12/19/178 So the fix is to get rid of special treatment for ACPI SCI, just treat ACPI SCI as normal GSI interrupt as: acpi_gsi_to_irq() ->acpi_register_gsi() ->acpi_register_gsi_xen() ->xen_register_gsi() With above change, there's no need for xen_setup_acpi_sci() anymore. The above change also works with bare metal kernel too. Signed-off-by: Jiang Liu Tested-by: Sander Eikelenboom Cc: Tony Luck Cc: xen-de...@lists.xenproject.org Cc: Konrad Rzeszutek Wilk Cc: David Vrabel Cc: Rafael J. Wysocki Cc: Len Brown Cc: Pavel Machek Cc: Bjorn Helgaas Link: http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang@linux.intel.com Signed-off-by: Thomas Gleixner Signed-off-by: Stefan Bader --- arch/x86/kernel/acpi/boot.c | 23 +++--- arch/x86/pci/xen.c | 47 - 2 files changed, 12 insertions(+), 58 deletions(-) diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index a142e77..460f498 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -604,18 +604,19 @@ void __init acpi_pic_sci_set_trigger(unsigned int irq, u16 trigger) int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp) { - int irq; - - if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) { - *irqp = gsi; - } else { - irq = mp_map_gsi_to_irq(gsi, - IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK); - if (irq < 0) - return -1; - *irqp = irq; + int rc, irq, trigger, polarity; + + rc = acpi_get_override_irq(gsi, , ); + if (rc == 0) { + trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE; + polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACT
Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19
On 05.02.2015 21:07, Sander Eikelenboom wrote: Tuesday, January 20, 2015, 3:21:04 AM, you wrote: Hi Thomas, This patch set includes three hotfixes related Xen IRQ for v3.19. Sorry for the long delay to get these two regressions fixed, it really cost me some time to read and understand Xen IRQ code. Patch 1 fixes the failure to register ACPI SCI interrupt on Xen domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18 too once it reaches the mainstream kernel. Patch 2 fixes the regression in Xen PCI device passthrough(pciback). It's a temporary solution, I will send the formal fix for v3.20 and it has passed tests too. Patch 3 fixes an issue found when reading code. There's no real bug reports related to this issue yet. Great thanks to Konrad and Sander for testing fixes for these regressions. Regards, Gerry Hi Gerry, Since these patches now are: tested, reviewed and have landed into mainline, could you also provide the backports for 3.17 and 3.18 where required ? This would be my attempt of backporting those to 3.18 (I have not tried whether they would apply to 3.17 as well). Those seem to work when test-booting and Sander said he will give them a try as well. But better check yourself I did not mess anything up while backporting. One of the three patches seemed not to be required at all. -Stefan The number of people running into these (subtle) issues with these stable kernels seems to be steadily increasing. -- Sander Jiang Liu (3): xen/pci: Kill function xen_setup_acpi_sci() xen/irq, ACPI: Fix regression in xen PCI passthrough caused by cffe0a2b5a34 xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi arch/x86/include/asm/acpi.h |1 + arch/x86/kernel/acpi/boot.c | 26 +++ arch/x86/pci/xen.c | 49 ++- drivers/acpi/pci_irq.c |1 - 4 files changed, 16 insertions(+), 61 deletions(-) From 8b5b328b62248d95743ca9af7aa71c06dd808dfe Mon Sep 17 00:00:00 2001 From: Jiang Liu jiang@linux.intel.com Date: Tue, 20 Jan 2015 10:21:05 +0800 Subject: [PATCH 1/2] x86/xen: Treat SCI interrupt as normal GSI interrupt Currently Xen Domain0 has special treatment for ACPI SCI interrupt, that is initialize irq for ACPI SCI at early stage in a special way as: xen_init_IRQ() -pci_xen_initial_domain() -xen_setup_acpi_sci() Allocate and initialize irq for ACPI SCI Function xen_setup_acpi_sci() calls acpi_gsi_to_irq() to get an irq number for ACPI SCI. But unfortunately acpi_gsi_to_irq() depends on IOAPIC irqdomains through following path acpi_gsi_to_irq() -mp_map_gsi_to_irq() -mp_map_pin_to_irq() -check IOAPIC irqdomain For PV domains, it uses Xen event based interrupt manangement and doesn't make uses of native IOAPIC, so no irqdomains created for IOAPIC. This causes Xen domain0 fail to install interrupt handler for ACPI SCI and all ACPI events will be lost. Please refer to: https://lkml.org/lkml/2014/12/19/178 So the fix is to get rid of special treatment for ACPI SCI, just treat ACPI SCI as normal GSI interrupt as: acpi_gsi_to_irq() -acpi_register_gsi() -acpi_register_gsi_xen() -xen_register_gsi() With above change, there's no need for xen_setup_acpi_sci() anymore. The above change also works with bare metal kernel too. Signed-off-by: Jiang Liu jiang@linux.intel.com Tested-by: Sander Eikelenboom li...@eikelenboom.it Cc: Tony Luck tony.l...@intel.com Cc: xen-de...@lists.xenproject.org Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: David Vrabel david.vra...@citrix.com Cc: Rafael J. Wysocki r...@rjwysocki.net Cc: Len Brown len.br...@intel.com Cc: Pavel Machek pa...@ucw.cz Cc: Bjorn Helgaas bhelg...@google.com Link: http://lkml.kernel.org/r/1421720467-7709-2-git-send-email-jiang@linux.intel.com Signed-off-by: Thomas Gleixner t...@linutronix.de Signed-off-by: Stefan Bader stefan.ba...@canonical.com --- arch/x86/kernel/acpi/boot.c | 23 +++--- arch/x86/pci/xen.c | 47 - 2 files changed, 12 insertions(+), 58 deletions(-) diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index a142e77..460f498 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -604,18 +604,19 @@ void __init acpi_pic_sci_set_trigger(unsigned int irq, u16 trigger) int acpi_gsi_to_irq(u32 gsi, unsigned int *irqp) { - int irq; - - if (acpi_irq_model == ACPI_IRQ_MODEL_PIC) { - *irqp = gsi; - } else { - irq = mp_map_gsi_to_irq(gsi, - IOAPIC_MAP_ALLOC | IOAPIC_MAP_CHECK); - if (irq 0) - return -1; - *irqp = irq; + int rc, irq, trigger, polarity; + + rc = acpi_get_override_irq(gsi, trigger, polarity); + if (rc == 0) { + trigger = trigger ? ACPI_LEVEL_SENSITIVE : ACPI_EDGE_SENSITIVE; + polarity = polarity ? ACPI_ACTIVE_LOW : ACPI_ACTIVE_HIGH; + irq = acpi_register_gsi(NULL, gsi, trigger, polarity
Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19
On 09.02.2015 13:29, Stefan Bader wrote: On 09.02.2015 13:12, Jiang Liu wrote: On 2015/2/9 17:47, Stefan Bader wrote: On 05.02.2015 21:07, Sander Eikelenboom wrote: Tuesday, January 20, 2015, 3:21:04 AM, you wrote: Hi Thomas, This patch set includes three hotfixes related Xen IRQ for v3.19. Sorry for the long delay to get these two regressions fixed, it really cost me some time to read and understand Xen IRQ code. Patch 1 fixes the failure to register ACPI SCI interrupt on Xen domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18 too once it reaches the mainstream kernel. Patch 2 fixes the regression in Xen PCI device passthrough(pciback). It's a temporary solution, I will send the formal fix for v3.20 and it has passed tests too. Patch 3 fixes an issue found when reading code. There's no real bug reports related to this issue yet. Great thanks to Konrad and Sander for testing fixes for these regressions. Regards, Gerry Hi Gerry, Since these patches now are: tested, reviewed and have landed into mainline, could you also provide the backports for 3.17 and 3.18 where required ? This would be my attempt of backporting those to 3.18 (I have not tried whether they would apply to 3.17 as well). Those seem to work when test-booting and Sander said he will give them a try as well. But better check yourself I did not mess anything up while backporting. One of the three patches seemed not to be required at all. I'm backporting them too, but now I will give up my patches:) The first one is a must, which should be applied to 3.17 and 3.18. The second one doesn't apply to 3.17 or 3.18. The third may be applied to v3.17 and v3.18, but it shouldn't be applied according to the stable-kernel doc because no user reports such a bug yet. Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets around the acpi irq and the usb problems, too. Though interestingly from the history, the acpi irq I saw in 3.17 while the usb devices started to fail with 3.18. I will let you know in a bit. Just want to finish another bisect first. So it looks like patch #1 indeed does fix both the acpi irq and my usb devices. Sander, do you want to verify that variant, too? The only odd thing left in a 3.18 with the fix (regardless of using patch #1 only or patches #1 and #3) is that the ahci driver shows up with its pci bus number in /proc/interrupts. But that does not seem to have any relevant downside (cannot try 3.19 on that box right now for other reasons). -Stefan ... # 57: 12441 000 xen-pirq-msi :00:1f.2 #(used to be ahci) -Stefan Regards Gerry -Stefan The number of people running into these (subtle) issues with these stable kernels seems to be steadily increasing. -- Sander Jiang Liu (3): xen/pci: Kill function xen_setup_acpi_sci() xen/irq, ACPI: Fix regression in xen PCI passthrough caused by cffe0a2b5a34 xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi arch/x86/include/asm/acpi.h |1 + arch/x86/kernel/acpi/boot.c | 26 +++ arch/x86/pci/xen.c | 49 ++- drivers/acpi/pci_irq.c |1 - 4 files changed, 16 insertions(+), 61 deletions(-) signature.asc Description: OpenPGP digital signature
Re: 3.19: device name associates with IRQ's for ahci controllers operating with a single IRQ changed from ahci? to BDF
On 09.02.2015 20:54, Sander Eikelenboom wrote: Hi. In 3.19 the device name associates with IRQ's for ahci controllers operating with a single IRQ changed from ahci? to BDF, was this intentional ? It's probably commit 18dcf433f3ded61eb140a55e7048ec2fef79e723 (or another one in that series). Oh, looking at that commit that might make sense. In ahci_host_activate_single_irq it now uses dev_driver_name instead of dev_name as it did before (from ata_host_activate). The description sounds like before the driver is registered this will return the bus. And registering an interrupt would likely be before the driver is fully registered... But you probably saw that, too. Whether that really was intentional is still the question. :) -Stefan -- Sander /proc/interrupts of an ahci controller with a single irq: 52: 13529 0 0 0 xen-pirq-msi :00:1f.2 /proc/interrupts of an ahci controller with multiple irq's: 114: 412535 0 0 0 0 0 xen-pirq-msi ahci0 115: 0 0 0 0 0 0 xen-pirq-msi ahci1 116: 16717 0 0 0 0 0 xen-pirq-msi ahci2 117: 0 0 0 0 0 0 xen-pirq-msi ahci3 118: 0 0 0 0 0 0 xen-pirq-msi ahci4 119: 0 0 0 0 0 0 xen-pirq-msi ahci5 signature.asc Description: OpenPGP digital signature
Re: [Bugfix 0/3] Xen IRQ related hotfixes for v3.19
On 09.02.2015 13:12, Jiang Liu wrote: On 2015/2/9 17:47, Stefan Bader wrote: On 05.02.2015 21:07, Sander Eikelenboom wrote: Tuesday, January 20, 2015, 3:21:04 AM, you wrote: Hi Thomas, This patch set includes three hotfixes related Xen IRQ for v3.19. Sorry for the long delay to get these two regressions fixed, it really cost me some time to read and understand Xen IRQ code. Patch 1 fixes the failure to register ACPI SCI interrupt on Xen domain0 by reworking acpi_gsi_to_irq(). I will backport it to v3.18 too once it reaches the mainstream kernel. Patch 2 fixes the regression in Xen PCI device passthrough(pciback). It's a temporary solution, I will send the formal fix for v3.20 and it has passed tests too. Patch 3 fixes an issue found when reading code. There's no real bug reports related to this issue yet. Great thanks to Konrad and Sander for testing fixes for these regressions. Regards, Gerry Hi Gerry, Since these patches now are: tested, reviewed and have landed into mainline, could you also provide the backports for 3.17 and 3.18 where required ? This would be my attempt of backporting those to 3.18 (I have not tried whether they would apply to 3.17 as well). Those seem to work when test-booting and Sander said he will give them a try as well. But better check yourself I did not mess anything up while backporting. One of the three patches seemed not to be required at all. I'm backporting them too, but now I will give up my patches:) The first one is a must, which should be applied to 3.17 and 3.18. The second one doesn't apply to 3.17 or 3.18. The third may be applied to v3.17 and v3.18, but it shouldn't be applied according to the stable-kernel doc because no user reports such a bug yet. Hm, ok. So maybe I should retry with only patch#1 and verify that alone gets around the acpi irq and the usb problems, too. Though interestingly from the history, the acpi irq I saw in 3.17 while the usb devices started to fail with 3.18. I will let you know in a bit. Just want to finish another bisect first. -Stefan Regards Gerry -Stefan The number of people running into these (subtle) issues with these stable kernels seems to be steadily increasing. -- Sander Jiang Liu (3): xen/pci: Kill function xen_setup_acpi_sci() xen/irq, ACPI: Fix regression in xen PCI passthrough caused by cffe0a2b5a34 xen/irq: Override ACPI IRQ management callback __acpi_unregister_gsi arch/x86/include/asm/acpi.h |1 + arch/x86/kernel/acpi/boot.c | 26 +++ arch/x86/pci/xen.c | 49 ++- drivers/acpi/pci_irq.c |1 - 4 files changed, 16 insertions(+), 61 deletions(-) signature.asc Description: OpenPGP digital signature
Re: [PATCH] bcache: prevent crash on changing writeback_running
On 19.08.2014 20:03, Slava Pestov wrote: > Thanks, this is now in our development branch. It has been a while but I just went back there and checked linux and linux-next but cannot find this change anywhere. I think it is a rather simple bugfix. So I start to wonder where it went and when it would get pushed. -Stefan > > On Tue, Aug 19, 2014 at 6:01 AM, Stefan Bader > wrote: >> commit a664d0f05a2ec02c8f042db536d84d15d6e19e81 >> bcache: fix crash on shutdown in passthrough mode >> >> added a safeguard in the shutdown case. At least while not being >> attached it is also possible to trigger a kernel bug by writing into >> writeback_running. This change adds the same check before trying to >> wake up the thread for that case. >> >> Signed-off-by: Stefan Bader >> --- >> drivers/md/bcache/writeback.h | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h >> index 0a9dab1..073a042 100644 >> --- a/drivers/md/bcache/writeback.h >> +++ b/drivers/md/bcache/writeback.h >> @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, >> struct bio *bio, >> >> static inline void bch_writeback_queue(struct cached_dev *dc) >> { >> - wake_up_process(dc->writeback_thread); >> + if (!IS_ERR_OR_NULL(dc->writeback_thread)) >> + wake_up_process(dc->writeback_thread); >> } >> >> static inline void bch_writeback_add(struct cached_dev *dc) >> -- >> 1.9.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
Re: [PATCH] bcache: prevent crash on changing writeback_running
On 19.08.2014 20:03, Slava Pestov wrote: Thanks, this is now in our development branch. It has been a while but I just went back there and checked linux and linux-next but cannot find this change anywhere. I think it is a rather simple bugfix. So I start to wonder where it went and when it would get pushed. -Stefan On Tue, Aug 19, 2014 at 6:01 AM, Stefan Bader stefan.ba...@canonical.com wrote: commit a664d0f05a2ec02c8f042db536d84d15d6e19e81 bcache: fix crash on shutdown in passthrough mode added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader stefan.ba...@canonical.com --- drivers/md/bcache/writeback.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 0a9dab1..073a042 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio, static inline void bch_writeback_queue(struct cached_dev *dc) { - wake_up_process(dc-writeback_thread); + if (!IS_ERR_OR_NULL(dc-writeback_thread)) + wake_up_process(dc-writeback_thread); } static inline void bch_writeback_add(struct cached_dev *dc) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-bcache in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize
On 01.12.2014 14:59, Zoltan Kiss wrote: > > > On 01/12/14 13:36, David Vrabel wrote: >> On 01/12/14 08:55, Stefan Bader wrote: >>> On 11.08.2014 19:32, Zoltan Kiss wrote: >>>> There is a long known problem with the netfront/netback interface: if the >>>> guest >>>> tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring >>>> slots, >>>> it gets dropped. The reason is that netback maps these slots to a frag in >>>> the >>>> frags array, which is limited by size. Having so many slots can occur since >>>> compound pages were introduced, as the ring protocol slice them up into >>>> individual (non-compound) page aligned slots. The theoretical worst case >>>> scenario looks like this (note, skbs are limited to 64 Kb here): >>>> linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary, >>>> using 2 slots >>>> first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at >>>> the >>>> end and the beginning of a page, therefore they use 3 * 15 = 45 slots >>>> last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots >>>> Although I don't think this 51 slots skb can really happen, we need a >>>> solution >>>> which can deal with every scenario. In real life there is only a few slots >>>> overdue, but usually it causes the TCP stream to be blocked, as the retry >>>> will >>>> most likely have the same buffer layout. >>>> This patch solves this problem by linearizing the packet. This is not the >>>> fastest way, and it can fail much easier as it tries to allocate a big >>>> linear >>>> area for the whole packet, but probably easier by an order of magnitude >>>> than >>>> anything else. Probably this code path is not touched very frequently >>>> anyway. >>>> >>>> Signed-off-by: Zoltan Kiss >>>> Cc: Wei Liu >>>> Cc: Ian Campbell >>>> Cc: Paul Durrant >>>> Cc: net...@vger.kernel.org >>>> Cc: linux-kernel@vger.kernel.org >>>> Cc: xen-de...@lists.xenproject.org >>> >>> This does not seem to be marked explicitly as stable. Has someone already >>> asked >>> David Miller to put it on his stable queue? IMO it qualifies quite well and >>> the >>> actual change should be simple to pick/backport. >> >> I think it's a candidate, yes. >> >> Can you expand on the user visible impact of the bug this patch fixes? >> I think it results in certain types of traffic not working (because the >> domU always generates skb's with the problematic frag layout), but I >> can't remember the details. > > Yes, this line in the comment talks about it: "In real life there is only a > few > slots overdue, but usually it causes the TCP stream to be blocked, as the > retry > will most likely have the same buffer layout." > Maybe we can add what kind of traffic triggered this so far, AFAIK NFS was one > of them, and Stefan had an another use case. But my memories are blur about > this. We had some report about some web-app hitting packet losses. I suspect that also was streaming something. For a easy trigger we found redis-benchmark (part of the redis keyserver) with a larger (iirc 1kB) payload would trigger the fragmentation/exceeding pages to happen. Though I think it did not fail but showed a performance drop instead (from memory which also suffers from loosing detail). -Stefan > > Zoli signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize
On 11.08.2014 19:32, Zoltan Kiss wrote: > There is a long known problem with the netfront/netback interface: if the > guest > tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring > slots, > it gets dropped. The reason is that netback maps these slots to a frag in the > frags array, which is limited by size. Having so many slots can occur since > compound pages were introduced, as the ring protocol slice them up into > individual (non-compound) page aligned slots. The theoretical worst case > scenario looks like this (note, skbs are limited to 64 Kb here): > linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary, > using 2 slots > first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the > end and the beginning of a page, therefore they use 3 * 15 = 45 slots > last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots > Although I don't think this 51 slots skb can really happen, we need a solution > which can deal with every scenario. In real life there is only a few slots > overdue, but usually it causes the TCP stream to be blocked, as the retry will > most likely have the same buffer layout. > This patch solves this problem by linearizing the packet. This is not the > fastest way, and it can fail much easier as it tries to allocate a big linear > area for the whole packet, but probably easier by an order of magnitude than > anything else. Probably this code path is not touched very frequently anyway. > > Signed-off-by: Zoltan Kiss > Cc: Wei Liu > Cc: Ian Campbell > Cc: Paul Durrant > Cc: net...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: xen-de...@lists.xenproject.org This does not seem to be marked explicitly as stable. Has someone already asked David Miller to put it on his stable queue? IMO it qualifies quite well and the actual change should be simple to pick/backport. -Stefan > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 055222b..23359ae 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -628,9 +628,10 @@ static int xennet_start_xmit(struct sk_buff *skb, struct > net_device *dev) > slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > xennet_count_skb_frag_slots(skb); > if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > - net_alert_ratelimited( > - "xennet: skb rides the rocket: %d slots\n", slots); > - goto drop; > + net_dbg_ratelimited("xennet: skb rides the rocket: %d slots, %d > bytes\n", > + slots, skb->len); > + if (skb_linearize(skb)) > + goto drop; > } > > spin_lock_irqsave(>tx_lock, flags); > > ___ > Xen-devel mailing list > xen-de...@lists.xen.org > http://lists.xen.org/xen-devel > signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize
On 11.08.2014 19:32, Zoltan Kiss wrote: There is a long known problem with the netfront/netback interface: if the guest tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring slots, it gets dropped. The reason is that netback maps these slots to a frag in the frags array, which is limited by size. Having so many slots can occur since compound pages were introduced, as the ring protocol slice them up into individual (non-compound) page aligned slots. The theoretical worst case scenario looks like this (note, skbs are limited to 64 Kb here): linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary, using 2 slots first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the end and the beginning of a page, therefore they use 3 * 15 = 45 slots last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots Although I don't think this 51 slots skb can really happen, we need a solution which can deal with every scenario. In real life there is only a few slots overdue, but usually it causes the TCP stream to be blocked, as the retry will most likely have the same buffer layout. This patch solves this problem by linearizing the packet. This is not the fastest way, and it can fail much easier as it tries to allocate a big linear area for the whole packet, but probably easier by an order of magnitude than anything else. Probably this code path is not touched very frequently anyway. Signed-off-by: Zoltan Kiss zoltan.k...@citrix.com Cc: Wei Liu wei.l...@citrix.com Cc: Ian Campbell ian.campb...@citrix.com Cc: Paul Durrant paul.durr...@citrix.com Cc: net...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org This does not seem to be marked explicitly as stable. Has someone already asked David Miller to put it on his stable queue? IMO it qualifies quite well and the actual change should be simple to pick/backport. -Stefan diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 055222b..23359ae 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -628,9 +628,10 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + xennet_count_skb_frag_slots(skb); if (unlikely(slots MAX_SKB_FRAGS + 1)) { - net_alert_ratelimited( - xennet: skb rides the rocket: %d slots\n, slots); - goto drop; + net_dbg_ratelimited(xennet: skb rides the rocket: %d slots, %d bytes\n, + slots, skb-len); + if (skb_linearize(skb)) + goto drop; } spin_lock_irqsave(queue-tx_lock, flags); ___ Xen-devel mailing list xen-de...@lists.xen.org http://lists.xen.org/xen-devel signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize
On 01.12.2014 14:59, Zoltan Kiss wrote: On 01/12/14 13:36, David Vrabel wrote: On 01/12/14 08:55, Stefan Bader wrote: On 11.08.2014 19:32, Zoltan Kiss wrote: There is a long known problem with the netfront/netback interface: if the guest tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring slots, it gets dropped. The reason is that netback maps these slots to a frag in the frags array, which is limited by size. Having so many slots can occur since compound pages were introduced, as the ring protocol slice them up into individual (non-compound) page aligned slots. The theoretical worst case scenario looks like this (note, skbs are limited to 64 Kb here): linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary, using 2 slots first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the end and the beginning of a page, therefore they use 3 * 15 = 45 slots last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots Although I don't think this 51 slots skb can really happen, we need a solution which can deal with every scenario. In real life there is only a few slots overdue, but usually it causes the TCP stream to be blocked, as the retry will most likely have the same buffer layout. This patch solves this problem by linearizing the packet. This is not the fastest way, and it can fail much easier as it tries to allocate a big linear area for the whole packet, but probably easier by an order of magnitude than anything else. Probably this code path is not touched very frequently anyway. Signed-off-by: Zoltan Kiss zoltan.k...@citrix.com Cc: Wei Liu wei.l...@citrix.com Cc: Ian Campbell ian.campb...@citrix.com Cc: Paul Durrant paul.durr...@citrix.com Cc: net...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org This does not seem to be marked explicitly as stable. Has someone already asked David Miller to put it on his stable queue? IMO it qualifies quite well and the actual change should be simple to pick/backport. I think it's a candidate, yes. Can you expand on the user visible impact of the bug this patch fixes? I think it results in certain types of traffic not working (because the domU always generates skb's with the problematic frag layout), but I can't remember the details. Yes, this line in the comment talks about it: In real life there is only a few slots overdue, but usually it causes the TCP stream to be blocked, as the retry will most likely have the same buffer layout. Maybe we can add what kind of traffic triggered this so far, AFAIK NFS was one of them, and Stefan had an another use case. But my memories are blur about this. We had some report about some web-app hitting packet losses. I suspect that also was streaming something. For a easy trigger we found redis-benchmark (part of the redis keyserver) with a larger (iirc 1kB) payload would trigger the fragmentation/exceeding pages to happen. Though I think it did not fail but showed a performance drop instead (from memory which also suffers from loosing detail). -Stefan Zoli signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] BUG in xennet_make_frags with paged skb data
On 07.11.2014 13:21, Zoltan Kiss wrote: > > > On 07/11/14 12:15, Stefan Bader wrote: >> On 07.11.2014 12:22, Eric Dumazet wrote: >>> On Fri, 2014-11-07 at 09:25 +, Zoltan Kiss wrote: >>> >>> Please do not top post. >>> >>>> Hi, >>>> >>>> AFAIK in this scenario your skb frag is wrong. The page pointer should >>>> point to the original compound page (not a member of it), and offset >>>> should be set accordingly. >>>> For example, if your compound page is 16K (4 page), then the page >>>> pointer should point to the first page, and if the data starts at the >>>> 3rd page, then offset should be >8K >>> >>> This is not accurate. >>> >>> This BUG_ON() is wrong. >>> >>> It should instead be : >>> >>> BUG_ON(len + offset > PAGE_SIZE<> >> would that not have to be >> >> BUG_ON((page-compound_head(page)*PAGE_SIZE)+offset+len > >> PAGE_SIZE< > There should be a parentheses around "page-compound_head(page)". *sigh* before submitting anything I'll have to make sure I run it past the compiler at least. I never manager to type the beast without some error. But you got the drift... >> >> since offset is adjusted to start from the tail page in that case. >>> >>> splice() code can generate such cases. >>> >>> >> >> signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] BUG in xennet_make_frags with paged skb data
On 07.11.2014 12:22, Eric Dumazet wrote: > On Fri, 2014-11-07 at 09:25 +, Zoltan Kiss wrote: > > Please do not top post. > >> Hi, >> >> AFAIK in this scenario your skb frag is wrong. The page pointer should >> point to the original compound page (not a member of it), and offset >> should be set accordingly. >> For example, if your compound page is 16K (4 page), then the page >> pointer should point to the first page, and if the data starts at the >> 3rd page, then offset should be >8K > > This is not accurate. > > This BUG_ON() is wrong. > > It should instead be : > > BUG_ON(len + offset > PAGE_SIZE< PAGE_SIZE< > splice() code can generate such cases. > > signature.asc Description: OpenPGP digital signature
Re: BUG in xennet_make_frags with paged skb data
On 07.11.2014 11:44, David Vrabel wrote: > On 06/11/14 21:49, Seth Forshee wrote: >> We've had several reports of hitting the following BUG_ON in >> xennet_make_frags with 3.2 and 3.13 kernels (I'm currently awaiting >> results of testing with 3.17): >> >> /* Grant backend access to each skb fragment page. */ >> for (i = 0; i < frags; i++) { >> skb_frag_t *frag = skb_shinfo(skb)->frags + i; >> struct page *page = skb_frag_page(frag); >> >> len = skb_frag_size(frag); >> offset = frag->page_offset; >> >> /* Data must not cross a page boundary. */ >> BUG_ON(len + offset > PAGE_SIZE<> >> When this happens the page in question is a "middle" page in a compound >> page (i.e. it's a tail page but not the last tail page), and the data is >> fully contained within the compound page. The data does however cross >> the hardware page boundary, and since compound_order evaluates to 0 for >> tail pages the check fails. >> >> In going over this I've been unable to determine whether the BUG_ON in >> xennet_make_frags is incorrect or the paged skb data is wrong. I can't >> find that it's documented anywhere, and the networking code itself is a >> bit ambiguous when it comes to compound pages. On the one hand >> __skb_fill_page_desc specifically handles adding tail pages as paged >> data, but on the other hand skb_copy_bits kmaps frag->page.p which could >> fail with data that extends into another page. > > netfront will safely handle this case so you can remove this BUG_ON() > (and the one later on). But it would be better to find out were these > funny-looking skbs are coming from and (if necessary) fixing the bug there. Right, in fact it does ignore pages from the start of a compound page in case the offset is big enough. So there is no difference between that case and the one where the page pointer from the frag is adjusted the way it was observed. It really boils down to the question what the real rules for the frags are. If it is "only" that data may not cross the boundary of a hw or compound page this leaves room for interpretation. The odd skb does not violate that but a check for conformance would become a lot more complex. And netfront is not the only place that expects the frag page to be pointing to the compound head (there is igb for example, though it does only a warn_on upon failing). On the other hand the __skb_fill_page_desc is written in a way that seems to assume the frag page pointer may be pointing to a tail page. So before "blaming" one piece of code or the other we need an authoritative answer to what we are supposed to expect. -Stefan > > David > signature.asc Description: OpenPGP digital signature
Re: BUG in xennet_make_frags with paged skb data
On 07.11.2014 11:44, David Vrabel wrote: On 06/11/14 21:49, Seth Forshee wrote: We've had several reports of hitting the following BUG_ON in xennet_make_frags with 3.2 and 3.13 kernels (I'm currently awaiting results of testing with 3.17): /* Grant backend access to each skb fragment page. */ for (i = 0; i frags; i++) { skb_frag_t *frag = skb_shinfo(skb)-frags + i; struct page *page = skb_frag_page(frag); len = skb_frag_size(frag); offset = frag-page_offset; /* Data must not cross a page boundary. */ BUG_ON(len + offset PAGE_SIZEcompound_order(page)); When this happens the page in question is a middle page in a compound page (i.e. it's a tail page but not the last tail page), and the data is fully contained within the compound page. The data does however cross the hardware page boundary, and since compound_order evaluates to 0 for tail pages the check fails. In going over this I've been unable to determine whether the BUG_ON in xennet_make_frags is incorrect or the paged skb data is wrong. I can't find that it's documented anywhere, and the networking code itself is a bit ambiguous when it comes to compound pages. On the one hand __skb_fill_page_desc specifically handles adding tail pages as paged data, but on the other hand skb_copy_bits kmaps frag-page.p which could fail with data that extends into another page. netfront will safely handle this case so you can remove this BUG_ON() (and the one later on). But it would be better to find out were these funny-looking skbs are coming from and (if necessary) fixing the bug there. Right, in fact it does ignore pages from the start of a compound page in case the offset is big enough. So there is no difference between that case and the one where the page pointer from the frag is adjusted the way it was observed. It really boils down to the question what the real rules for the frags are. If it is only that data may not cross the boundary of a hw or compound page this leaves room for interpretation. The odd skb does not violate that but a check for conformance would become a lot more complex. And netfront is not the only place that expects the frag page to be pointing to the compound head (there is igb for example, though it does only a warn_on upon failing). On the other hand the __skb_fill_page_desc is written in a way that seems to assume the frag page pointer may be pointing to a tail page. So before blaming one piece of code or the other we need an authoritative answer to what we are supposed to expect. -Stefan David signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] BUG in xennet_make_frags with paged skb data
On 07.11.2014 12:22, Eric Dumazet wrote: On Fri, 2014-11-07 at 09:25 +, Zoltan Kiss wrote: Please do not top post. Hi, AFAIK in this scenario your skb frag is wrong. The page pointer should point to the original compound page (not a member of it), and offset should be set accordingly. For example, if your compound page is 16K (4 page), then the page pointer should point to the first page, and if the data starts at the 3rd page, then offset should be 8K This is not accurate. This BUG_ON() is wrong. It should instead be : BUG_ON(len + offset PAGE_SIZEcompound_order(compound_head(page))); would that not have to be BUG_ON((page-compound_head(page)*PAGE_SIZE)+offset+len PAGE_SIZEcompound_order(compound_head(page))); since offset is adjusted to start from the tail page in that case. splice() code can generate such cases. signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] BUG in xennet_make_frags with paged skb data
On 07.11.2014 13:21, Zoltan Kiss wrote: On 07/11/14 12:15, Stefan Bader wrote: On 07.11.2014 12:22, Eric Dumazet wrote: On Fri, 2014-11-07 at 09:25 +, Zoltan Kiss wrote: Please do not top post. Hi, AFAIK in this scenario your skb frag is wrong. The page pointer should point to the original compound page (not a member of it), and offset should be set accordingly. For example, if your compound page is 16K (4 page), then the page pointer should point to the first page, and if the data starts at the 3rd page, then offset should be 8K This is not accurate. This BUG_ON() is wrong. It should instead be : BUG_ON(len + offset PAGE_SIZEcompound_order(compound_head(page))); would that not have to be BUG_ON((page-compound_head(page)*PAGE_SIZE)+offset+len PAGE_SIZEcompound_order(compound_head(page))); There should be a parentheses around page-compound_head(page). *sigh* before submitting anything I'll have to make sure I run it past the compiler at least. I never manager to type the beast without some error. But you got the drift... since offset is adjusted to start from the tail page in that case. splice() code can generate such cases. signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] [PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests
On 02.09.2014 13:01, David Vrabel wrote: > On 01/09/14 18:34, David Vrabel wrote: >> On 29/08/14 16:17, Stefan Bader wrote: >>> >>> This change might not be the fully correct approach as it basically >>> removes the pre-set page table entry for the fixmap that is compile >>> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the >>> level1 page table is not yet declared in C headers (that might be >>> fixed). But also with the current bug, it was removed, too. Since >>> the Xen mappings for level2_kernel_pgt only covered kernel + initrd >>> and some Xen data this did never reach that far. And still, something >>> does create entries at level2_fixmap_pgt[506..507]. So it should be >>> ok. At least I was able to successfully boot a kernel with 1G kernel >>> image size without any vmalloc whinings. >> [...] >>> --- a/arch/x86/xen/mmu.c >>> +++ b/arch/x86/xen/mmu.c >>> @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, >>> unsigned long max_pfn) >>> /* L3_i[0] -> level2_ident_pgt */ >>> convert_pfn_mfn(level3_ident_pgt); >>> /* L3_k[510] -> level2_kernel_pgt >>> -* L3_i[511] -> level2_fixmap_pgt */ >>> +* L3_k[511] -> level2_fixmap_pgt */ >>> convert_pfn_mfn(level3_kernel_pgt); >>> + >>> + /* level2_fixmap_pgt contains a single entry for the >>> +* fixmap area at offset 506. The correct way would >>> +* be to convert level2_fixmap_pgt to mfn and set the >>> +* level1_fixmap_pgt (which is completely empty) to RO, >>> +* too. But currently this page table is not declared, >>> +* so it would be a bit of voodoo to get its address. >>> +* And also the fixmap entry was never set due to using >>> +* the wrong l2 when getting Xen's tables. So let's just >>> +* just nuke it. >>> +* This orphans level1_fixmap_pgt, but that was basically >>> +* done before the change as well. >>> +*/ >>> + memset(level2_fixmap_pgt, 0, 512*sizeof(long)); >> >> level2_fixmap_pgt etc. are defined for the benefit of Xen only so I >> think you should add an extern for level1_fixmap_pgt and fix this up >> properly. >> >> It might not matter now, but it might in the future... > > I found some time to look into this and test it. Including without > enabling KSLAR, but reproing the vmalloc failure with a large (800 MiB > module). > > I've clarified the change description, can you review my edits? > > Thanks for investigating and fixing this! Sorry for not having replied earlier (I am on vacation). Without testing it looks about what I had after a few iterations (minus the export of l1). So if you had a kernel booting with that I am happy, too. :) -Stefan > > David > > 8<-- > x86/xen: don't copy junk into kernel page tables > > When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded > modules exceeds 512 MiB, then loading modules fails with a warning > (and hence a vmalloc allocation failure) because the PTEs for the > newly-allocated vmalloc address space are not zero. > > WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128 >vmap_page_range_noflush+0x2a1/0x360() > > This is caused by xen_setup_kernel_pagetables() copying junk into the > page tables (to level2_fixmap_pgt), overwriting many non-present > entries. > > Without KASLR, the normal kernel image size only covers the first half > of level2_kernel_pgt and module space starts after that. > > L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[ 0..255]->kernel > [256..511]->module > [511]->level2_fixmap_pgt[ 0..505]->module > > This allows 512 MiB of of module vmalloc space to be used before > having to use the corrupted level2_fixmap_pgt entries. > > With KASLR enabled, the kernel image uses the full PUD range of 1G and > module space starts in the level2_fixmap_pgt. So basically: > > L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel > [511]->level2_fixmap_pgt[0..505]->module > > And now no module vmalloc space can be used without using the corrupt > level2_fixmap_pgt entries. > > Fix this by properly converting the level2_fixmap_pgt entries to MFNs, > and setting level1_fixmap_pgt as rea
Re: [Xen-devel] [PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests
On 02.09.2014 13:01, David Vrabel wrote: On 01/09/14 18:34, David Vrabel wrote: On 29/08/14 16:17, Stefan Bader wrote: This change might not be the fully correct approach as it basically removes the pre-set page table entry for the fixmap that is compile time set (level2_fixmap_pgt[506]-level1_fixmap_pgt). For one the level1 page table is not yet declared in C headers (that might be fixed). But also with the current bug, it was removed, too. Since the Xen mappings for level2_kernel_pgt only covered kernel + initrd and some Xen data this did never reach that far. And still, something does create entries at level2_fixmap_pgt[506..507]. So it should be ok. At least I was able to successfully boot a kernel with 1G kernel image size without any vmalloc whinings. [...] --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* L3_i[0] - level2_ident_pgt */ convert_pfn_mfn(level3_ident_pgt); /* L3_k[510] - level2_kernel_pgt -* L3_i[511] - level2_fixmap_pgt */ +* L3_k[511] - level2_fixmap_pgt */ convert_pfn_mfn(level3_kernel_pgt); + + /* level2_fixmap_pgt contains a single entry for the +* fixmap area at offset 506. The correct way would +* be to convert level2_fixmap_pgt to mfn and set the +* level1_fixmap_pgt (which is completely empty) to RO, +* too. But currently this page table is not declared, +* so it would be a bit of voodoo to get its address. +* And also the fixmap entry was never set due to using +* the wrong l2 when getting Xen's tables. So let's just +* just nuke it. +* This orphans level1_fixmap_pgt, but that was basically +* done before the change as well. +*/ + memset(level2_fixmap_pgt, 0, 512*sizeof(long)); level2_fixmap_pgt etc. are defined for the benefit of Xen only so I think you should add an extern for level1_fixmap_pgt and fix this up properly. It might not matter now, but it might in the future... I found some time to look into this and test it. Including without enabling KSLAR, but reproing the vmalloc failure with a large (800 MiB module). I've clarified the change description, can you review my edits? Thanks for investigating and fixing this! Sorry for not having replied earlier (I am on vacation). Without testing it looks about what I had after a few iterations (minus the export of l1). So if you had a kernel booting with that I am happy, too. :) -Stefan David 8-- x86/xen: don't copy junk into kernel page tables When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded modules exceeds 512 MiB, then loading modules fails with a warning (and hence a vmalloc allocation failure) because the PTEs for the newly-allocated vmalloc address space are not zero. WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128 vmap_page_range_noflush+0x2a1/0x360() This is caused by xen_setup_kernel_pagetables() copying junk into the page tables (to level2_fixmap_pgt), overwriting many non-present entries. Without KASLR, the normal kernel image size only covers the first half of level2_kernel_pgt and module space starts after that. L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt[ 0..255]-kernel [256..511]-module [511]-level2_fixmap_pgt[ 0..505]-module This allows 512 MiB of of module vmalloc space to be used before having to use the corrupted level2_fixmap_pgt entries. With KASLR enabled, the kernel image uses the full PUD range of 1G and module space starts in the level2_fixmap_pgt. So basically: L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt[0..511]-kernel [511]-level2_fixmap_pgt[0..505]-module And now no module vmalloc space can be used without using the corrupt level2_fixmap_pgt entries. Fix this by properly converting the level2_fixmap_pgt entries to MFNs, and setting level1_fixmap_pgt as read-only. Signed-off-by: Stefan Bader stefan.ba...@canonical.com Cc: sta...@vger.kernel.org Signed-off-by: David Vrabel david.vra...@citrix.com --- arch/x86/include/asm/pgtable_64.h |1 + arch/x86/xen/mmu.c| 27 --- 2 files changed, 13 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h index 5be9063..3874693 100644 --- a/arch/x86/include/asm/pgtable_64.h +++ b/arch/x86/include/asm/pgtable_64.h @@ -19,6 +19,7 @@ extern pud_t level3_ident_pgt[512]; extern pmd_t level2_kernel_pgt[512]; extern pmd_t level2_fixmap_pgt[512]; extern pmd_t level2_ident_pgt[512]; +extern pte_t level1_fixmap_pgt[512]; extern pgd_t
[PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests
This seemed to be one of those what-the-heck moments. When trying to figure out why changing the kernel/module split (which enabling KASLR does) causes vmalloc to run wild on boot of 64bit PV guests, after much scratching my head, found that xen_setup_kernel_pagetable copies the same L2 table not only to the level2_ident_pgt and level2_kernel_pgt, but also (due to miscalculating the offset) to level2_fixmap_pgt. This only worked because the normal kernel image size only covers the first half of level2_kernel_pgt and module space starts after that. L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[ 0..255]->kernel [256..511]->module [511]->level2_fixmap_pgt[ 0..505]->module With the split changing, the kernel image uses the full PUD range of 1G and module space starts in the level2_fixmap_pgt. So basically: L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel [511]->level2_fixmap_pgt[0..505]->module And now the incorrect copy of the kernel mapping in that range bites (hard). Causing errors in vmalloc that start with the following: WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128 vmap_page_range_noflush+0x2a1/0x360() Which is caused by a freshly allocated PTE for an address in the module vspace area that is not uninitialized (pte_none()). The same would happen with the old layout when something causes the initial mappings to cross the 512M boundary. I was told that someone saw the same vmalloc warning with the old layout but 500M initrd. This change might not be the fully correct approach as it basically removes the pre-set page table entry for the fixmap that is compile time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the level1 page table is not yet declared in C headers (that might be fixed). But also with the current bug, it was removed, too. Since the Xen mappings for level2_kernel_pgt only covered kernel + initrd and some Xen data this did never reach that far. And still, something does create entries at level2_fixmap_pgt[506..507]. So it should be ok. At least I was able to successfully boot a kernel with 1G kernel image size without any vmalloc whinings. Signed-off-by: Stefan Bader Cc: sta...@vger.kernel.org --- arch/x86/xen/mmu.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index e8a1201..145e50f 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* L3_i[0] -> level2_ident_pgt */ convert_pfn_mfn(level3_ident_pgt); /* L3_k[510] -> level2_kernel_pgt -* L3_i[511] -> level2_fixmap_pgt */ +* L3_k[511] -> level2_fixmap_pgt */ convert_pfn_mfn(level3_kernel_pgt); + + /* level2_fixmap_pgt contains a single entry for the +* fixmap area at offset 506. The correct way would +* be to convert level2_fixmap_pgt to mfn and set the +* level1_fixmap_pgt (which is completely empty) to RO, +* too. But currently this page table is not declared, +* so it would be a bit of voodoo to get its address. +* And also the fixmap entry was never set due to using +* the wrong l2 when getting Xen's tables. So let's just +* just nuke it. +* This orphans level1_fixmap_pgt, but that was basically +* done before the change as well. +*/ + memset(level2_fixmap_pgt, 0, 512*sizeof(long)); } /* We get [511][511] and have Xen's version of level2_kernel_pgt */ l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd); @@ -1913,21 +1927,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) addr[1] = (unsigned long)l3; addr[2] = (unsigned long)l2; /* Graft it onto L4[272][0]. Note that we creating an aliasing problem: -* Both L4[272][0] and L4[511][511] have entries that point to the same +* Both L4[272][0] and L4[511][510] have entries that point to the same * L2 (PMD) tables. Meaning that if you modify it in __va space * it will be also modified in the __ka space! (But if you just * modify the PMD table to point to other PTE's or none, then you * are OK - which is what cleanup_highmap does) */ copy_page(level2_ident_pgt, l2); - /* Graft it onto L4[511][511] */ + /* Graft it onto L4[511][510] */ copy_page(level2_kernel_pgt, l2); - /* Get [511][510] and graft that in level2_fixmap_pgt */ - l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd); - l2 = m2v(l3[pud_index(__START
Re: [PATCH] Solved the Xen PV/KASLR riddle
On 29.08.2014 16:31, David Vrabel wrote: > On 29/08/14 15:27, Stefan Bader wrote: >> >> Ok, I rework the patch and re-send it (freshly, iow not part of this thread). >> And while I am at it, I would add the stable tag. > > Can you use a different title? Perhaps: > > x86/xen: fix 64-bit PV guest kernel page tables for KASLR > > David > I can change the title but would not want to include KASLR. Because that is just indirectly responsible. This is a fix for a problem that always was there but just escaped notice until KASLR caused the layout to change. This could have done independently before. Or, like I mentioned in the last email, be caused by a large initrd. -Stefan signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
On 29.08.2014 16:19, Andrew Cooper wrote: > On 29/08/14 09:37, Stefan Bader wrote: >> On 29.08.2014 00:42, Andrew Cooper wrote: >>> On 28/08/2014 19:01, Stefan Bader wrote: >>>>>> So not much further... but then I think I know what I do next. Probably >>>>>> should >>>>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a >>>>>> panic >>>>>> and at least get a crash dump of that situation when it occurs. Then I >>>>>> can dig >>>>>> in there with crash (really should have thought of that before)... >>>>> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing >>>>> there >>>>> that screams at me, so I fear I will have to wait until you get the crash >>>>> and get some clues from that. >>>> Ok, what a journey. So after long hours of painful staring at the code... >>>> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn >>>> in crash, I really would appreaciate :-P) >>> The M2P map lives in the Xen reserved virtual address space in each PV >>> guest, and forms part of the PV ABI. It is mapped read-only, in the >>> native width of the guest. >>> >>> 32bit PV (PAE) at 0xF580 >>> 64bit PV at 0x8000 >>> >>> This is represented by the MACH2PHYS_VIRT_START symbol from the Xen >>> public header files. You should be able to blindly construct a pointer >>> to it (if you have nothing better to hand), as it will be hooked into >>> the guests pagetables before execution starts. Therefore, >>> "MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch. >> machine_to_phys_mapping is set to that address but its not mapped inside the >> crash dump. Somehow vtop in crash handles translations. I need to have a >> look at >> their code, I guess. >> >> Thanks, >> Stefan > > What context is the crash dump? If it is a Xen+dom0 kexec()d to native > linux, then the m2p should still be accessible given dom0's cr3. If it > is some state copied off-host then you will need to adjust the copy to > include that virtual range. No its a domU dump of a PV guest taken with "xl dump-core" (or actually the result of on-crash trigger). > > ~Andrew > signature.asc Description: OpenPGP digital signature
Re: [PATCH] Solved the Xen PV/KASLR riddle
On 29.08.2014 16:08, Konrad Rzeszutek Wilk wrote: > On Thu, Aug 28, 2014 at 08:01:43PM +0200, Stefan Bader wrote: >>>> So not much further... but then I think I know what I do next. Probably >>>> should >>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a >>>> panic >>>> and at least get a crash dump of that situation when it occurs. Then I can >>>> dig >>>> in there with crash (really should have thought of that before)... >>> >>> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing >>> there >>> that screams at me, so I fear I will have to wait until you get the crash >>> and get some clues from that. >> >> Ok, what a journey. So after long hours of painful staring at the code... >> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn >> in crash, I really would appreaciate :-P) >> >> So at some point I realized that level2_fixmap_pgt seemed to contain >> an oddly high number of entries (given that the virtual address that >> failed would be mapped by entry 0). And suddenly I realized that apart >> from entries 506 and 507 (actual fixmap and vsyscalls) the whole list >> actually was the same as level2_kernel_gpt (without the first 16M >> cleared). >> >> And then I realized that xen_setup_kernel_pagetable is wrong to a >> degree which makes one wonder how this ever worked. Adding PMD_SIZE >> as an offset (2M) isn't changing l2 at all. This just copies Xen's >> kernel mapping, AGAIN!. >> >> I guess we all just were lucky that in most cases modules would not >> use more than 512M (which is the correctly cleaned up remainder >> of kernel_level2_pgt).. >> >> I still need to compile a kernel with the patch and the old layout >> but I kind of got exited by the find. At least this is tested with >> the 1G/~1G layout and it comes up without vmalloc errors. > > Woot! Oh yeah! :) >> >> -Stefan >> >> >From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001 >> From: Stefan Bader >> Date: Thu, 28 Aug 2014 19:17:00 +0200 >> Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables >> >> This seemed to be one of those what-the-heck moments. When trying to >> figure out why changing the kernel/module split (which enabling KASLR >> does) causes vmalloc to run wild on boot of 64bit PV guests, after >> much scratching my head, found that the current Xen code copies the > > s/current Xen/xen_setup_kernel_pagetable/ ok > >> same L2 table not only to the level2_ident_pgt and level2_kernel_pgt, >> but also (due to miscalculating the offset) to level2_fixmap_pgt. >> >> This only worked because the normal kernel image size only covers the >> first half of level2_kernel_pgt and module space starts after that. >> With the split changing, the kernel image uses the full PUD range of >> 1G and module space starts in the level2_fixmap_pgt. So basically: >> >> L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt >> [511]->level2_fixmap_pgt >> > > Perhaps you could add a similar drawing of what it looked like > without the kaslr enabled? AS in the 'normal kernel image' scenario? Sure. Btw, someone also contacted me saying they have the same problem without changing the layout but having really big initrd (500M). While that feels like it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M kernel image size area then). But if it can happen, then surely it does cause mappings to be where the module space starts then. > >> And now the incorrect copy of the kernel mapping in that range bites >> (hard). > > Want to include the vmalloc warning you got? Yeah, that is a good idea. > >> >> This change might not be the fully correct approach as it basically >> removes the pre-set page table entry for the fixmap that is compile >> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the >> level1 page table is not yet declared in C headers (that might be >> fixed). But also with the current bug, it was removed, too. Since >> the Xen mappings for level2_kernel_pgt only covered kernel + initrd >> and some Xen data this did never reach that far. And still, something >> does create entries at level2_fixmap_pgt[506..507]. So it should be >> ok. At least I was able to successfully boot a kernel with 1G kernel >> image size without any vmalloc whinings. >> >> Signed-off-by: Stefan Bader >> --- >> arch/x86/xen/mmu.c | 26 +-
Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
On 29.08.2014 00:42, Andrew Cooper wrote: > On 28/08/2014 19:01, Stefan Bader wrote: >>>> So not much further... but then I think I know what I do next. Probably >>>> should >>>> have done before. I'll replace the WARN_ON in vmalloc that triggers by a >>>> panic >>>> and at least get a crash dump of that situation when it occurs. Then I can >>>> dig >>>> in there with crash (really should have thought of that before)... >>> I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing >>> there >>> that screams at me, so I fear I will have to wait until you get the crash >>> and get some clues from that. >> Ok, what a journey. So after long hours of painful staring at the code... >> (and btw, if someone could tell me how the heck one can do a mfn_to_pfn >> in crash, I really would appreaciate :-P) > > The M2P map lives in the Xen reserved virtual address space in each PV > guest, and forms part of the PV ABI. It is mapped read-only, in the > native width of the guest. > > 32bit PV (PAE) at 0xF580 > 64bit PV at 0x8000 > > This is represented by the MACH2PHYS_VIRT_START symbol from the Xen > public header files. You should be able to blindly construct a pointer > to it (if you have nothing better to hand), as it will be hooked into > the guests pagetables before execution starts. Therefore, > "MACH2PHYS_VIRT_START[(unsigned long)pfn]" ought to do in a pinch. machine_to_phys_mapping is set to that address but its not mapped inside the crash dump. Somehow vtop in crash handles translations. I need to have a look at their code, I guess. Thanks, Stefan > > ~Andrew > signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
On 29.08.2014 00:42, Andrew Cooper wrote: On 28/08/2014 19:01, Stefan Bader wrote: So not much further... but then I think I know what I do next. Probably should have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic and at least get a crash dump of that situation when it occurs. Then I can dig in there with crash (really should have thought of that before)... nods I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there that screams at me, so I fear I will have to wait until you get the crash and get some clues from that. Ok, what a journey. So after long hours of painful staring at the code... (and btw, if someone could tell me how the heck one can do a mfn_to_pfn in crash, I really would appreaciate :-P) The M2P map lives in the Xen reserved virtual address space in each PV guest, and forms part of the PV ABI. It is mapped read-only, in the native width of the guest. 32bit PV (PAE) at 0xF580 64bit PV at 0x8000 This is represented by the MACH2PHYS_VIRT_START symbol from the Xen public header files. You should be able to blindly construct a pointer to it (if you have nothing better to hand), as it will be hooked into the guests pagetables before execution starts. Therefore, MACH2PHYS_VIRT_START[(unsigned long)pfn] ought to do in a pinch. machine_to_phys_mapping is set to that address but its not mapped inside the crash dump. Somehow vtop in crash handles translations. I need to have a look at their code, I guess. Thanks, Stefan ~Andrew signature.asc Description: OpenPGP digital signature
Re: [PATCH] Solved the Xen PV/KASLR riddle
On 29.08.2014 16:08, Konrad Rzeszutek Wilk wrote: On Thu, Aug 28, 2014 at 08:01:43PM +0200, Stefan Bader wrote: So not much further... but then I think I know what I do next. Probably should have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic and at least get a crash dump of that situation when it occurs. Then I can dig in there with crash (really should have thought of that before)... nods I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there that screams at me, so I fear I will have to wait until you get the crash and get some clues from that. Ok, what a journey. So after long hours of painful staring at the code... (and btw, if someone could tell me how the heck one can do a mfn_to_pfn in crash, I really would appreaciate :-P) So at some point I realized that level2_fixmap_pgt seemed to contain an oddly high number of entries (given that the virtual address that failed would be mapped by entry 0). And suddenly I realized that apart from entries 506 and 507 (actual fixmap and vsyscalls) the whole list actually was the same as level2_kernel_gpt (without the first 16M cleared). And then I realized that xen_setup_kernel_pagetable is wrong to a degree which makes one wonder how this ever worked. Adding PMD_SIZE as an offset (2M) isn't changing l2 at all. This just copies Xen's kernel mapping, AGAIN!. I guess we all just were lucky that in most cases modules would not use more than 512M (which is the correctly cleaned up remainder of kernel_level2_pgt).. I still need to compile a kernel with the patch and the old layout but I kind of got exited by the find. At least this is tested with the 1G/~1G layout and it comes up without vmalloc errors. Woot! Oh yeah! :) -Stefan From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001 From: Stefan Bader stefan.ba...@canonical.com Date: Thu, 28 Aug 2014 19:17:00 +0200 Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables This seemed to be one of those what-the-heck moments. When trying to figure out why changing the kernel/module split (which enabling KASLR does) causes vmalloc to run wild on boot of 64bit PV guests, after much scratching my head, found that the current Xen code copies the s/current Xen/xen_setup_kernel_pagetable/ ok same L2 table not only to the level2_ident_pgt and level2_kernel_pgt, but also (due to miscalculating the offset) to level2_fixmap_pgt. This only worked because the normal kernel image size only covers the first half of level2_kernel_pgt and module space starts after that. With the split changing, the kernel image uses the full PUD range of 1G and module space starts in the level2_fixmap_pgt. So basically: L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt [511]-level2_fixmap_pgt Perhaps you could add a similar drawing of what it looked like without the kaslr enabled? AS in the 'normal kernel image' scenario? Sure. Btw, someone also contacted me saying they have the same problem without changing the layout but having really big initrd (500M). While that feels like it should be impossible (if the kernel+initrd+xen stuff has to fix the 512M kernel image size area then). But if it can happen, then surely it does cause mappings to be where the module space starts then. And now the incorrect copy of the kernel mapping in that range bites (hard). Want to include the vmalloc warning you got? Yeah, that is a good idea. This change might not be the fully correct approach as it basically removes the pre-set page table entry for the fixmap that is compile time set (level2_fixmap_pgt[506]-level1_fixmap_pgt). For one the level1 page table is not yet declared in C headers (that might be fixed). But also with the current bug, it was removed, too. Since the Xen mappings for level2_kernel_pgt only covered kernel + initrd and some Xen data this did never reach that far. And still, something does create entries at level2_fixmap_pgt[506..507]. So it should be ok. At least I was able to successfully boot a kernel with 1G kernel image size without any vmalloc whinings. Signed-off-by: Stefan Bader stefan.ba...@canonical.com --- arch/x86/xen/mmu.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index e8a1201..803034c 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* L3_i[0] - level2_ident_pgt */ convert_pfn_mfn(level3_ident_pgt); /* L3_k[510] - level2_kernel_pgt - * L3_i[511] - level2_fixmap_pgt */ + * L3_k[511] - level2_fixmap_pgt */ convert_pfn_mfn(level3_kernel_pgt); + +/* level2_fixmap_pgt contains a single entry for the + * fixmap area at offset 506
Re: [Xen-devel] [PATCH] Solved the Xen PV/KASLR riddle
On 29.08.2014 16:19, Andrew Cooper wrote: On 29/08/14 09:37, Stefan Bader wrote: On 29.08.2014 00:42, Andrew Cooper wrote: On 28/08/2014 19:01, Stefan Bader wrote: So not much further... but then I think I know what I do next. Probably should have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic and at least get a crash dump of that situation when it occurs. Then I can dig in there with crash (really should have thought of that before)... nods I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there that screams at me, so I fear I will have to wait until you get the crash and get some clues from that. Ok, what a journey. So after long hours of painful staring at the code... (and btw, if someone could tell me how the heck one can do a mfn_to_pfn in crash, I really would appreaciate :-P) The M2P map lives in the Xen reserved virtual address space in each PV guest, and forms part of the PV ABI. It is mapped read-only, in the native width of the guest. 32bit PV (PAE) at 0xF580 64bit PV at 0x8000 This is represented by the MACH2PHYS_VIRT_START symbol from the Xen public header files. You should be able to blindly construct a pointer to it (if you have nothing better to hand), as it will be hooked into the guests pagetables before execution starts. Therefore, MACH2PHYS_VIRT_START[(unsigned long)pfn] ought to do in a pinch. machine_to_phys_mapping is set to that address but its not mapped inside the crash dump. Somehow vtop in crash handles translations. I need to have a look at their code, I guess. Thanks, Stefan What context is the crash dump? If it is a Xen+dom0 kexec()d to native linux, then the m2p should still be accessible given dom0's cr3. If it is some state copied off-host then you will need to adjust the copy to include that virtual range. No its a domU dump of a PV guest taken with xl dump-core (or actually the result of on-crash trigger). ~Andrew signature.asc Description: OpenPGP digital signature
Re: [PATCH] Solved the Xen PV/KASLR riddle
On 29.08.2014 16:31, David Vrabel wrote: On 29/08/14 15:27, Stefan Bader wrote: Ok, I rework the patch and re-send it (freshly, iow not part of this thread). And while I am at it, I would add the stable tag. Can you use a different title? Perhaps: x86/xen: fix 64-bit PV guest kernel page tables for KASLR David I can change the title but would not want to include KASLR. Because that is just indirectly responsible. This is a fix for a problem that always was there but just escaped notice until KASLR caused the layout to change. This could have done independently before. Or, like I mentioned in the last email, be caused by a large initrd. -Stefan signature.asc Description: OpenPGP digital signature
[PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests
This seemed to be one of those what-the-heck moments. When trying to figure out why changing the kernel/module split (which enabling KASLR does) causes vmalloc to run wild on boot of 64bit PV guests, after much scratching my head, found that xen_setup_kernel_pagetable copies the same L2 table not only to the level2_ident_pgt and level2_kernel_pgt, but also (due to miscalculating the offset) to level2_fixmap_pgt. This only worked because the normal kernel image size only covers the first half of level2_kernel_pgt and module space starts after that. L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt[ 0..255]-kernel [256..511]-module [511]-level2_fixmap_pgt[ 0..505]-module With the split changing, the kernel image uses the full PUD range of 1G and module space starts in the level2_fixmap_pgt. So basically: L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt[0..511]-kernel [511]-level2_fixmap_pgt[0..505]-module And now the incorrect copy of the kernel mapping in that range bites (hard). Causing errors in vmalloc that start with the following: WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128 vmap_page_range_noflush+0x2a1/0x360() Which is caused by a freshly allocated PTE for an address in the module vspace area that is not uninitialized (pte_none()). The same would happen with the old layout when something causes the initial mappings to cross the 512M boundary. I was told that someone saw the same vmalloc warning with the old layout but 500M initrd. This change might not be the fully correct approach as it basically removes the pre-set page table entry for the fixmap that is compile time set (level2_fixmap_pgt[506]-level1_fixmap_pgt). For one the level1 page table is not yet declared in C headers (that might be fixed). But also with the current bug, it was removed, too. Since the Xen mappings for level2_kernel_pgt only covered kernel + initrd and some Xen data this did never reach that far. And still, something does create entries at level2_fixmap_pgt[506..507]. So it should be ok. At least I was able to successfully boot a kernel with 1G kernel image size without any vmalloc whinings. Signed-off-by: Stefan Bader stefan.ba...@canonical.com Cc: sta...@vger.kernel.org --- arch/x86/xen/mmu.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index e8a1201..145e50f 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* L3_i[0] - level2_ident_pgt */ convert_pfn_mfn(level3_ident_pgt); /* L3_k[510] - level2_kernel_pgt -* L3_i[511] - level2_fixmap_pgt */ +* L3_k[511] - level2_fixmap_pgt */ convert_pfn_mfn(level3_kernel_pgt); + + /* level2_fixmap_pgt contains a single entry for the +* fixmap area at offset 506. The correct way would +* be to convert level2_fixmap_pgt to mfn and set the +* level1_fixmap_pgt (which is completely empty) to RO, +* too. But currently this page table is not declared, +* so it would be a bit of voodoo to get its address. +* And also the fixmap entry was never set due to using +* the wrong l2 when getting Xen's tables. So let's just +* just nuke it. +* This orphans level1_fixmap_pgt, but that was basically +* done before the change as well. +*/ + memset(level2_fixmap_pgt, 0, 512*sizeof(long)); } /* We get [511][511] and have Xen's version of level2_kernel_pgt */ l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd); @@ -1913,21 +1927,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) addr[1] = (unsigned long)l3; addr[2] = (unsigned long)l2; /* Graft it onto L4[272][0]. Note that we creating an aliasing problem: -* Both L4[272][0] and L4[511][511] have entries that point to the same +* Both L4[272][0] and L4[511][510] have entries that point to the same * L2 (PMD) tables. Meaning that if you modify it in __va space * it will be also modified in the __ka space! (But if you just * modify the PMD table to point to other PTE's or none, then you * are OK - which is what cleanup_highmap does) */ copy_page(level2_ident_pgt, l2); - /* Graft it onto L4[511][511] */ + /* Graft it onto L4[511][510] */ copy_page(level2_kernel_pgt, l2); - /* Get [511][510] and graft that in level2_fixmap_pgt */ - l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd); - l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud); - copy_page
[PATCH] Solved the Xen PV/KASLR riddle
> > So not much further... but then I think I know what I do next. Probably > > should > > have done before. I'll replace the WARN_ON in vmalloc that triggers by a > > panic > > and at least get a crash dump of that situation when it occurs. Then I can > > dig > > in there with crash (really should have thought of that before)... > > I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there > that screams at me, so I fear I will have to wait until you get the crash > and get some clues from that. Ok, what a journey. So after long hours of painful staring at the code... (and btw, if someone could tell me how the heck one can do a mfn_to_pfn in crash, I really would appreaciate :-P) So at some point I realized that level2_fixmap_pgt seemed to contain an oddly high number of entries (given that the virtual address that failed would be mapped by entry 0). And suddenly I realized that apart from entries 506 and 507 (actual fixmap and vsyscalls) the whole list actually was the same as level2_kernel_gpt (without the first 16M cleared). And then I realized that xen_setup_kernel_pagetable is wrong to a degree which makes one wonder how this ever worked. Adding PMD_SIZE as an offset (2M) isn't changing l2 at all. This just copies Xen's kernel mapping, AGAIN!. I guess we all just were lucky that in most cases modules would not use more than 512M (which is the correctly cleaned up remainder of kernel_level2_pgt)... I still need to compile a kernel with the patch and the old layout but I kind of got exited by the find. At least this is tested with the 1G/~1G layout and it comes up without vmalloc errors. -Stefan >From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001 From: Stefan Bader Date: Thu, 28 Aug 2014 19:17:00 +0200 Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables This seemed to be one of those what-the-heck moments. When trying to figure out why changing the kernel/module split (which enabling KASLR does) causes vmalloc to run wild on boot of 64bit PV guests, after much scratching my head, found that the current Xen code copies the same L2 table not only to the level2_ident_pgt and level2_kernel_pgt, but also (due to miscalculating the offset) to level2_fixmap_pgt. This only worked because the normal kernel image size only covers the first half of level2_kernel_pgt and module space starts after that. With the split changing, the kernel image uses the full PUD range of 1G and module space starts in the level2_fixmap_pgt. So basically: L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt [511]->level2_fixmap_pgt And now the incorrect copy of the kernel mapping in that range bites (hard). This change might not be the fully correct approach as it basically removes the pre-set page table entry for the fixmap that is compile time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the level1 page table is not yet declared in C headers (that might be fixed). But also with the current bug, it was removed, too. Since the Xen mappings for level2_kernel_pgt only covered kernel + initrd and some Xen data this did never reach that far. And still, something does create entries at level2_fixmap_pgt[506..507]. So it should be ok. At least I was able to successfully boot a kernel with 1G kernel image size without any vmalloc whinings. Signed-off-by: Stefan Bader --- arch/x86/xen/mmu.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index e8a1201..803034c 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* L3_i[0] -> level2_ident_pgt */ convert_pfn_mfn(level3_ident_pgt); /* L3_k[510] -> level2_kernel_pgt -* L3_i[511] -> level2_fixmap_pgt */ +* L3_k[511] -> level2_fixmap_pgt */ convert_pfn_mfn(level3_kernel_pgt); + + /* level2_fixmap_pgt contains a single entry for the +* fixmap area at offset 506. The correct way would +* be to convert level2_fixmap_pgt to mfn and set the +* level1_fixmap_pgt (which is completely empty) to RO, +* too. But currently this page table is not delcared, +* so it would be a bit of voodoo to get its address. +* And also the fixmap entry was never set anyway due +* to using the wrong l2 when getting Xen's tables. +* So let's just nuke it. +* This orphans level1_fixmap_pgt, but that should be +* as it always has been. +*/ + memset(level2_fixmap_pgt, 0, 512*sizeof(long)); } /* We get [511][511] and have Xen's version of level2_ke
[PATCH] Solved the Xen PV/KASLR riddle
So not much further... but then I think I know what I do next. Probably should have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic and at least get a crash dump of that situation when it occurs. Then I can dig in there with crash (really should have thought of that before)... nods I dug a bit in the code (arch/x86/xen/mmu.c) but there is nothing there that screams at me, so I fear I will have to wait until you get the crash and get some clues from that. Ok, what a journey. So after long hours of painful staring at the code... (and btw, if someone could tell me how the heck one can do a mfn_to_pfn in crash, I really would appreaciate :-P) So at some point I realized that level2_fixmap_pgt seemed to contain an oddly high number of entries (given that the virtual address that failed would be mapped by entry 0). And suddenly I realized that apart from entries 506 and 507 (actual fixmap and vsyscalls) the whole list actually was the same as level2_kernel_gpt (without the first 16M cleared). And then I realized that xen_setup_kernel_pagetable is wrong to a degree which makes one wonder how this ever worked. Adding PMD_SIZE as an offset (2M) isn't changing l2 at all. This just copies Xen's kernel mapping, AGAIN!. I guess we all just were lucky that in most cases modules would not use more than 512M (which is the correctly cleaned up remainder of kernel_level2_pgt)... I still need to compile a kernel with the patch and the old layout but I kind of got exited by the find. At least this is tested with the 1G/~1G layout and it comes up without vmalloc errors. -Stefan From 4b9a9a45177284e29d345eb54c545bd1da718e1b Mon Sep 17 00:00:00 2001 From: Stefan Bader stefan.ba...@canonical.com Date: Thu, 28 Aug 2014 19:17:00 +0200 Subject: [PATCH] x86/xen: Fix setup of 64bit kernel pagetables This seemed to be one of those what-the-heck moments. When trying to figure out why changing the kernel/module split (which enabling KASLR does) causes vmalloc to run wild on boot of 64bit PV guests, after much scratching my head, found that the current Xen code copies the same L2 table not only to the level2_ident_pgt and level2_kernel_pgt, but also (due to miscalculating the offset) to level2_fixmap_pgt. This only worked because the normal kernel image size only covers the first half of level2_kernel_pgt and module space starts after that. With the split changing, the kernel image uses the full PUD range of 1G and module space starts in the level2_fixmap_pgt. So basically: L4[511]-level3_kernel_pgt[510]-level2_kernel_pgt [511]-level2_fixmap_pgt And now the incorrect copy of the kernel mapping in that range bites (hard). This change might not be the fully correct approach as it basically removes the pre-set page table entry for the fixmap that is compile time set (level2_fixmap_pgt[506]-level1_fixmap_pgt). For one the level1 page table is not yet declared in C headers (that might be fixed). But also with the current bug, it was removed, too. Since the Xen mappings for level2_kernel_pgt only covered kernel + initrd and some Xen data this did never reach that far. And still, something does create entries at level2_fixmap_pgt[506..507]. So it should be ok. At least I was able to successfully boot a kernel with 1G kernel image size without any vmalloc whinings. Signed-off-by: Stefan Bader stefan.ba...@canonical.com --- arch/x86/xen/mmu.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index e8a1201..803034c 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn) /* L3_i[0] - level2_ident_pgt */ convert_pfn_mfn(level3_ident_pgt); /* L3_k[510] - level2_kernel_pgt -* L3_i[511] - level2_fixmap_pgt */ +* L3_k[511] - level2_fixmap_pgt */ convert_pfn_mfn(level3_kernel_pgt); + + /* level2_fixmap_pgt contains a single entry for the +* fixmap area at offset 506. The correct way would +* be to convert level2_fixmap_pgt to mfn and set the +* level1_fixmap_pgt (which is completely empty) to RO, +* too. But currently this page table is not delcared, +* so it would be a bit of voodoo to get its address. +* And also the fixmap entry was never set anyway due +* to using the wrong l2 when getting Xen's tables. +* So let's just nuke it. +* This orphans level1_fixmap_pgt, but that should be +* as it always has been. +*/ + memset(level2_fixmap_pgt, 0, 512*sizeof(long)); } /* We get [511][511] and have Xen's version of level2_kernel_pgt */ l3 = m2v(pgd[pgd_index
Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote: > On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote: >> On 21.08.2014 18:03, Kees Cook wrote: >>> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk >>> wrote: >>>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote: >>>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader >>>>> wrote: >>>>>> On 12.08.2014 19:28, Kees Cook wrote: >>>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader >>>>>>> wrote: >>>>>>>> On 08.08.2014 14:43, David Vrabel wrote: >>>>>>>>> On 08/08/14 12:20, Stefan Bader wrote: >>>>>>>>>> Unfortunately I have not yet figured out why this happens, but can >>>>>>>>>> confirm by >>>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that >>>>>>>>>> without KASLR all >>>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does >>>>>>>>>> not even boot >>>>>>>>>> as a follow up error). >>>>>>>>>> >>>>>>>>>> Details can be seen in [1] but basically this is always some portion >>>>>>>>>> of a >>>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE >>>>>>>>>> space not being >>>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In >>>>>>>>>> the >>>>>>>>>> non-dom0 case this repeats many times but ends in a guest that >>>>>>>>>> allows login. In >>>>>>>>>> the dom0 case there is a more fatal error at some point causing a >>>>>>>>>> crash. >>>>>>>>>> >>>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also >>>>>>>>>> does not help >>>>>>>>>> to add "nokaslr" to the kernel command-line. >>>>>>>>> >>>>>>>>> Maybe it's overlapping with regions of the virtual address space >>>>>>>>> reserved for Xen? What the the VA that fails? >>>>>>>>> >>>>>>>>> David >>>>>>>>> >>>>>>>> Yeah, there is some code to avoid some regions of memory (like >>>>>>>> initrd). Maybe >>>>>>>> missing p2m tables? I probably need to add debugging to find the >>>>>>>> failing VA (iow >>>>>>>> not sure whether it might be somewhere in the stacktraces in the >>>>>>>> report). >>>>>>>> >>>>>>>> The kernel-command line does not seem to be looked at. It should put >>>>>>>> something >>>>>>>> into dmesg and that never shows up. Also today's random feature is >>>>>>>> other PV >>>>>>>> guests crashing after a bit somewhere in the check_for_corruption >>>>>>>> area... >>>>>>> >>>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If >>>>>>> there are other reserved regions that aren't listed in the e820, it'll >>>>>>> need to locate and skip them. >>>>>>> >>>>>>> -Kees >>>>>>> >>>>>> Making my little steps towards more understanding I figured out that it >>>>>> isn't >>>>>> the code that does the relocation. Even with that completely disabled >>>>>> there were >>>>>> the vmalloc issues. What causes it seems to be the default of the upper >>>>>> limit >>>>>> and that this changes the split between kernel and modules to 1G+1G >>>>>> instead of >>>>>> 512M+1.5G. That is the reason why nokaslr has no effect. >>>>> >>>>> Oh! That's very interesting. There must be some assumption in Xen >>>>> about the kernel VM layout then? >>>> >>>> No. I think most of the changes that look at PTE and PMDs are are all >>>> in arch/x86/xen/mmu.c. I wonder if this is xen_clea
Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote: On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote: On 21.08.2014 18:03, Kees Cook wrote: On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote: On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader stefan.ba...@canonical.com wrote: On 12.08.2014 19:28, Kees Cook wrote: On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader stefan.ba...@canonical.com wrote: On 08.08.2014 14:43, David Vrabel wrote: On 08/08/14 12:20, Stefan Bader wrote: Unfortunately I have not yet figured out why this happens, but can confirm by compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all is ok, but with it enabled there are issues (actually a dom0 does not even boot as a follow up error). Details can be seen in [1] but basically this is always some portion of a vmalloc allocation failing after hitting a freshly allocated PTE space not being PTE_NONE (usually from a module load triggered by systemd-udevd). In the non-dom0 case this repeats many times but ends in a guest that allows login. In the dom0 case there is a more fatal error at some point causing a crash. I have not tried this for a normal PV guest but for dom0 it also does not help to add nokaslr to the kernel command-line. Maybe it's overlapping with regions of the virtual address space reserved for Xen? What the the VA that fails? David Yeah, there is some code to avoid some regions of memory (like initrd). Maybe missing p2m tables? I probably need to add debugging to find the failing VA (iow not sure whether it might be somewhere in the stacktraces in the report). The kernel-command line does not seem to be looked at. It should put something into dmesg and that never shows up. Also today's random feature is other PV guests crashing after a bit somewhere in the check_for_corruption area... Right now, the kaslr code just deals with initrd, cmdline, etc. If there are other reserved regions that aren't listed in the e820, it'll need to locate and skip them. -Kees Making my little steps towards more understanding I figured out that it isn't the code that does the relocation. Even with that completely disabled there were the vmalloc issues. What causes it seems to be the default of the upper limit and that this changes the split between kernel and modules to 1G+1G instead of 512M+1.5G. That is the reason why nokaslr has no effect. Oh! That's very interesting. There must be some assumption in Xen about the kernel VM layout then? No. I think most of the changes that look at PTE and PMDs are are all in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being too aggressive (Sorry I had to cut our chat short at Kernel Summit!) I sounded like there was another region of memory that Xen was setting aside for page tables? But Stefan's investigation seems to show this isn't about layout at boot (since the kaslr=0 case means no relocation is done). Sounds more like the split between kernel and modules area, so I'm not sure how the memory area after the initrd would be part of this. What should next steps be, do you think? Maybe layout, but not about placement of the kernel. Basically leaving KASLR enabled but shrink the possible range back to the original kernel/module split is fine as well. I am bouncing between feeling close to understand to being confused. Konrad suggested xen_cleanhighmap being overly aggressive. But maybe its the other way round. The warning that occurs first indicates that PTE that was obtained for some vmalloc mapping is not unused (0) as it is expected. So it feels rather like some cleanup has *not* been done. Let me think aloud a bit... What seems to cause this, is the change of the kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M vsyscalls and 2M hole at the end). Which in vaddr terms means: Before: 8000 - 9fff (=512 MB) kernel text mapping, from phys 0 a000 - ff5f (=1526 MB) module mapping space After: 8000 - bfff (=1024 MB) kernel text mapping, from phys 0 c000 - ff5f (=1014 MB) module mapping space Now, *if* I got this right, this means the kernel starts on a vaddr that is pointed at by: PGD[510]-PUD[510]-PMD[0]-PTE[0] In the old layout the module vaddr area would start in the same PUD area, but with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls and the hole would cover PUD[511]. I think there is a fixmap there too? Right, they forgot that in Documentation/x86/x86_64/mm... but head_64.S has it. So fixmap seems to be in the 2M space before the vsyscalls. Btw, apparently I got the PGD index wrong. It is of course 511, not 510. init_level4_pgt[511]-level3_kernel_pgt[510]-level2_kernel_pgt[0
Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
On 21.08.2014 18:03, Kees Cook wrote: > On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk > wrote: >> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote: >>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader >>> wrote: >>>> On 12.08.2014 19:28, Kees Cook wrote: >>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader >>>>> wrote: >>>>>> On 08.08.2014 14:43, David Vrabel wrote: >>>>>>> On 08/08/14 12:20, Stefan Bader wrote: >>>>>>>> Unfortunately I have not yet figured out why this happens, but can >>>>>>>> confirm by >>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without >>>>>>>> KASLR all >>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does not >>>>>>>> even boot >>>>>>>> as a follow up error). >>>>>>>> >>>>>>>> Details can be seen in [1] but basically this is always some portion >>>>>>>> of a >>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space >>>>>>>> not being >>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In >>>>>>>> the >>>>>>>> non-dom0 case this repeats many times but ends in a guest that allows >>>>>>>> login. In >>>>>>>> the dom0 case there is a more fatal error at some point causing a >>>>>>>> crash. >>>>>>>> >>>>>>>> I have not tried this for a normal PV guest but for dom0 it also does >>>>>>>> not help >>>>>>>> to add "nokaslr" to the kernel command-line. >>>>>>> >>>>>>> Maybe it's overlapping with regions of the virtual address space >>>>>>> reserved for Xen? What the the VA that fails? >>>>>>> >>>>>>> David >>>>>>> >>>>>> Yeah, there is some code to avoid some regions of memory (like initrd). >>>>>> Maybe >>>>>> missing p2m tables? I probably need to add debugging to find the failing >>>>>> VA (iow >>>>>> not sure whether it might be somewhere in the stacktraces in the report). >>>>>> >>>>>> The kernel-command line does not seem to be looked at. It should put >>>>>> something >>>>>> into dmesg and that never shows up. Also today's random feature is other >>>>>> PV >>>>>> guests crashing after a bit somewhere in the check_for_corruption area... >>>>> >>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If >>>>> there are other reserved regions that aren't listed in the e820, it'll >>>>> need to locate and skip them. >>>>> >>>>> -Kees >>>>> >>>> Making my little steps towards more understanding I figured out that it >>>> isn't >>>> the code that does the relocation. Even with that completely disabled >>>> there were >>>> the vmalloc issues. What causes it seems to be the default of the upper >>>> limit >>>> and that this changes the split between kernel and modules to 1G+1G >>>> instead of >>>> 512M+1.5G. That is the reason why nokaslr has no effect. >>> >>> Oh! That's very interesting. There must be some assumption in Xen >>> about the kernel VM layout then? >> >> No. I think most of the changes that look at PTE and PMDs are are all >> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being >> too aggressive > > (Sorry I had to cut our chat short at Kernel Summit!) > > I sounded like there was another region of memory that Xen was setting > aside for page tables? But Stefan's investigation seems to show this > isn't about layout at boot (since the kaslr=0 case means no relocation > is done). Sounds more like the split between kernel and modules area, > so I'm not sure how the memory area after the initrd would be part of > this. What should next steps be, do you think? Maybe layout, but not about placement of the kernel. Basically leaving KASLR enabled but shrink the possible range back to the original kernel/module split is fine as well. I am bouncing between feeling close to understand to bein
Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
On 21.08.2014 18:03, Kees Cook wrote: On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote: On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader stefan.ba...@canonical.com wrote: On 12.08.2014 19:28, Kees Cook wrote: On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader stefan.ba...@canonical.com wrote: On 08.08.2014 14:43, David Vrabel wrote: On 08/08/14 12:20, Stefan Bader wrote: Unfortunately I have not yet figured out why this happens, but can confirm by compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all is ok, but with it enabled there are issues (actually a dom0 does not even boot as a follow up error). Details can be seen in [1] but basically this is always some portion of a vmalloc allocation failing after hitting a freshly allocated PTE space not being PTE_NONE (usually from a module load triggered by systemd-udevd). In the non-dom0 case this repeats many times but ends in a guest that allows login. In the dom0 case there is a more fatal error at some point causing a crash. I have not tried this for a normal PV guest but for dom0 it also does not help to add nokaslr to the kernel command-line. Maybe it's overlapping with regions of the virtual address space reserved for Xen? What the the VA that fails? David Yeah, there is some code to avoid some regions of memory (like initrd). Maybe missing p2m tables? I probably need to add debugging to find the failing VA (iow not sure whether it might be somewhere in the stacktraces in the report). The kernel-command line does not seem to be looked at. It should put something into dmesg and that never shows up. Also today's random feature is other PV guests crashing after a bit somewhere in the check_for_corruption area... Right now, the kaslr code just deals with initrd, cmdline, etc. If there are other reserved regions that aren't listed in the e820, it'll need to locate and skip them. -Kees Making my little steps towards more understanding I figured out that it isn't the code that does the relocation. Even with that completely disabled there were the vmalloc issues. What causes it seems to be the default of the upper limit and that this changes the split between kernel and modules to 1G+1G instead of 512M+1.5G. That is the reason why nokaslr has no effect. Oh! That's very interesting. There must be some assumption in Xen about the kernel VM layout then? No. I think most of the changes that look at PTE and PMDs are are all in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being too aggressive (Sorry I had to cut our chat short at Kernel Summit!) I sounded like there was another region of memory that Xen was setting aside for page tables? But Stefan's investigation seems to show this isn't about layout at boot (since the kaslr=0 case means no relocation is done). Sounds more like the split between kernel and modules area, so I'm not sure how the memory area after the initrd would be part of this. What should next steps be, do you think? Maybe layout, but not about placement of the kernel. Basically leaving KASLR enabled but shrink the possible range back to the original kernel/module split is fine as well. I am bouncing between feeling close to understand to being confused. Konrad suggested xen_cleanhighmap being overly aggressive. But maybe its the other way round. The warning that occurs first indicates that PTE that was obtained for some vmalloc mapping is not unused (0) as it is expected. So it feels rather like some cleanup has *not* been done. Let me think aloud a bit... What seems to cause this, is the change of the kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M vsyscalls and 2M hole at the end). Which in vaddr terms means: Before: 8000 - 9fff (=512 MB) kernel text mapping, from phys 0 a000 - ff5f (=1526 MB) module mapping space After: 8000 - bfff (=1024 MB) kernel text mapping, from phys 0 c000 - ff5f (=1014 MB) module mapping space Now, *if* I got this right, this means the kernel starts on a vaddr that is pointed at by: PGD[510]-PUD[510]-PMD[0]-PTE[0] In the old layout the module vaddr area would start in the same PUD area, but with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls and the hole would cover PUD[511]. xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a bit since I am not sure I understand enough details) I believe is the one PMD pointed at by PGD[510]-PUD[510]. That could mean that before the change xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space but not after the change. Maybe that also means it always should have covered more but this would not be observed as long as modules would not claim more than 512M? I
Re: [PATCH] bcache: prevent crash on changing writeback_running
On 19.08.2014 20:03, Slava Pestov wrote: > Thanks, this is now in our development branch. Thanks. :) This kind of brings up another question: how are stable patches handled for bcache. Someone handling them centrally like for the network stack or tagging for stable and if that missed, individual proposals? I am asking because it looks like for all kernels between 3.13..3.16 (inclusive) the set of a664d0f05a2ec02c8f042db536d84d15d6e19e81 bcache: fix crash on shutdown in passthrough mode 9e5c353510b26500bd6b8309823ac9ef2837b761 bcache: fix uninterruptible sleep in writeback thread plus the one I did would be useful. Right now I only tested with 3.13 with those and at least avoid the lockdep warnings and crashes I saw before. I only see an unexplainable load of 1 when idle but having one bcache device configured in a VM (Xen or KVM). Since writeback should be fixed, maybe gc... -Stefan > > On Tue, Aug 19, 2014 at 6:01 AM, Stefan Bader > wrote: >> commit a664d0f05a2ec02c8f042db536d84d15d6e19e81 >> bcache: fix crash on shutdown in passthrough mode >> >> added a safeguard in the shutdown case. At least while not being >> attached it is also possible to trigger a kernel bug by writing into >> writeback_running. This change adds the same check before trying to >> wake up the thread for that case. >> >> Signed-off-by: Stefan Bader >> --- >> drivers/md/bcache/writeback.h | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h >> index 0a9dab1..073a042 100644 >> --- a/drivers/md/bcache/writeback.h >> +++ b/drivers/md/bcache/writeback.h >> @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, >> struct bio *bio, >> >> static inline void bch_writeback_queue(struct cached_dev *dc) >> { >> - wake_up_process(dc->writeback_thread); >> + if (!IS_ERR_OR_NULL(dc->writeback_thread)) >> + wake_up_process(dc->writeback_thread); >> } >> >> static inline void bch_writeback_add(struct cached_dev *dc) >> -- >> 1.9.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
Re: [PATCH] bcache: prevent crash on changing writeback_running
On 19.08.2014 20:03, Slava Pestov wrote: Thanks, this is now in our development branch. Thanks. :) This kind of brings up another question: how are stable patches handled for bcache. Someone handling them centrally like for the network stack or tagging for stable and if that missed, individual proposals? I am asking because it looks like for all kernels between 3.13..3.16 (inclusive) the set of a664d0f05a2ec02c8f042db536d84d15d6e19e81 bcache: fix crash on shutdown in passthrough mode 9e5c353510b26500bd6b8309823ac9ef2837b761 bcache: fix uninterruptible sleep in writeback thread plus the one I did would be useful. Right now I only tested with 3.13 with those and at least avoid the lockdep warnings and crashes I saw before. I only see an unexplainable load of 1 when idle but having one bcache device configured in a VM (Xen or KVM). Since writeback should be fixed, maybe gc... -Stefan On Tue, Aug 19, 2014 at 6:01 AM, Stefan Bader stefan.ba...@canonical.com wrote: commit a664d0f05a2ec02c8f042db536d84d15d6e19e81 bcache: fix crash on shutdown in passthrough mode added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader stefan.ba...@canonical.com --- drivers/md/bcache/writeback.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 0a9dab1..073a042 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio, static inline void bch_writeback_queue(struct cached_dev *dc) { - wake_up_process(dc-writeback_thread); + if (!IS_ERR_OR_NULL(dc-writeback_thread)) + wake_up_process(dc-writeback_thread); } static inline void bch_writeback_add(struct cached_dev *dc) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-bcache in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
[PATCH] bcache: prevent crash on changing writeback_running
commit a664d0f05a2ec02c8f042db536d84d15d6e19e81 bcache: fix crash on shutdown in passthrough mode added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader --- drivers/md/bcache/writeback.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 0a9dab1..073a042 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio, static inline void bch_writeback_queue(struct cached_dev *dc) { - wake_up_process(dc->writeback_thread); + if (!IS_ERR_OR_NULL(dc->writeback_thread)) + wake_up_process(dc->writeback_thread); } static inline void bch_writeback_add(struct cached_dev *dc) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] bcache: prevent crash on changing writeback_running
commit a664d0f05a2ec02c8f042db536d84d15d6e19e81 bcache: fix crash on shutdown in passthrough mode added a safeguard in the shutdown case. At least while not being attached it is also possible to trigger a kernel bug by writing into writeback_running. This change adds the same check before trying to wake up the thread for that case. Signed-off-by: Stefan Bader stefan.ba...@canonical.com --- drivers/md/bcache/writeback.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/writeback.h b/drivers/md/bcache/writeback.h index 0a9dab1..073a042 100644 --- a/drivers/md/bcache/writeback.h +++ b/drivers/md/bcache/writeback.h @@ -63,7 +63,8 @@ static inline bool should_writeback(struct cached_dev *dc, struct bio *bio, static inline void bch_writeback_queue(struct cached_dev *dc) { - wake_up_process(dc-writeback_thread); + if (!IS_ERR_OR_NULL(dc-writeback_thread)) + wake_up_process(dc-writeback_thread); } static inline void bch_writeback_add(struct cached_dev *dc) -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86_32, entry: Clean up sysenter_badsys declaration
Commit-ID: fb21b84e7f809ef04b1e5aed5d463cf0d4866638 Gitweb: http://git.kernel.org/tip/fb21b84e7f809ef04b1e5aed5d463cf0d4866638 Author: Stefan Bader AuthorDate: Fri, 15 Aug 2014 10:57:46 +0200 Committer: H. Peter Anvin CommitDate: Fri, 15 Aug 2014 13:45:32 -0700 x86_32, entry: Clean up sysenter_badsys declaration commit 554086d85e "x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508)" introduced a new jump label (sysenter_badsys) but somehow the END statements seem to have gone wrong (at least it feels that way to me). This does not seem to be a fatal problem, but just for the sake of symmetry, change the second syscall_badsys to sysenter_badsys. Signed-off-by: Stefan Bader Link: http://lkml.kernel.org/r/1408093066-31021-1-git-send-email-stefan.ba...@canonical.com Acked-by: Andy Lutomirski Signed-off-by: H. Peter Anvin --- arch/x86/kernel/entry_32.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 47c410d..4b0e1df 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -683,7 +683,7 @@ END(syscall_badsys) sysenter_badsys: movl $-ENOSYS,%eax jmp sysenter_after_call -END(syscall_badsys) +END(sysenter_badsys) CFI_ENDPROC .macro FIXUP_ESPFIX_STACK -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86_32, entry: Clean up sysenter_badsys declaration
commit 554086d85e "x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508)" introduced a new jump label (sysenter_badsys) but somehow the END statements seem to have gone wrong (at least it feels that way to me). This does not seem to be a fatal problem, but just for the sake of symmetry, change the second syscall_badsys to sysenter_badsys. Signed-off-by: Stefan Bader --- arch/x86/kernel/entry_32.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 47c410d..4b0e1df 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -683,7 +683,7 @@ END(syscall_badsys) sysenter_badsys: movl $-ENOSYS,%eax jmp sysenter_after_call -END(syscall_badsys) +END(sysenter_badsys) CFI_ENDPROC .macro FIXUP_ESPFIX_STACK -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] x86_32, entry: Clean up sysenter_badsys declaration
commit 554086d85e x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508) introduced a new jump label (sysenter_badsys) but somehow the END statements seem to have gone wrong (at least it feels that way to me). This does not seem to be a fatal problem, but just for the sake of symmetry, change the second syscall_badsys to sysenter_badsys. Signed-off-by: Stefan Bader stefan.ba...@canonical.com --- arch/x86/kernel/entry_32.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 47c410d..4b0e1df 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -683,7 +683,7 @@ END(syscall_badsys) sysenter_badsys: movl $-ENOSYS,%eax jmp sysenter_after_call -END(syscall_badsys) +END(sysenter_badsys) CFI_ENDPROC .macro FIXUP_ESPFIX_STACK -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[tip:x86/urgent] x86_32, entry: Clean up sysenter_badsys declaration
Commit-ID: fb21b84e7f809ef04b1e5aed5d463cf0d4866638 Gitweb: http://git.kernel.org/tip/fb21b84e7f809ef04b1e5aed5d463cf0d4866638 Author: Stefan Bader stefan.ba...@canonical.com AuthorDate: Fri, 15 Aug 2014 10:57:46 +0200 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Fri, 15 Aug 2014 13:45:32 -0700 x86_32, entry: Clean up sysenter_badsys declaration commit 554086d85e x86_32, entry: Do syscall exit work on badsys (CVE-2014-4508) introduced a new jump label (sysenter_badsys) but somehow the END statements seem to have gone wrong (at least it feels that way to me). This does not seem to be a fatal problem, but just for the sake of symmetry, change the second syscall_badsys to sysenter_badsys. Signed-off-by: Stefan Bader stefan.ba...@canonical.com Link: http://lkml.kernel.org/r/1408093066-31021-1-git-send-email-stefan.ba...@canonical.com Acked-by: Andy Lutomirski l...@amacapital.net Signed-off-by: H. Peter Anvin h...@linux.intel.com --- arch/x86/kernel/entry_32.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 47c410d..4b0e1df 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -683,7 +683,7 @@ END(syscall_badsys) sysenter_badsys: movl $-ENOSYS,%eax jmp sysenter_after_call -END(syscall_badsys) +END(sysenter_badsys) CFI_ENDPROC .macro FIXUP_ESPFIX_STACK -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
On 12.08.2014 19:28, Kees Cook wrote: > On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader > wrote: >> On 08.08.2014 14:43, David Vrabel wrote: >>> On 08/08/14 12:20, Stefan Bader wrote: >>>> Unfortunately I have not yet figured out why this happens, but can confirm >>>> by >>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without >>>> KASLR all >>>> is ok, but with it enabled there are issues (actually a dom0 does not even >>>> boot >>>> as a follow up error). >>>> >>>> Details can be seen in [1] but basically this is always some portion of a >>>> vmalloc allocation failing after hitting a freshly allocated PTE space not >>>> being >>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the >>>> non-dom0 case this repeats many times but ends in a guest that allows >>>> login. In >>>> the dom0 case there is a more fatal error at some point causing a crash. >>>> >>>> I have not tried this for a normal PV guest but for dom0 it also does not >>>> help >>>> to add "nokaslr" to the kernel command-line. >>> >>> Maybe it's overlapping with regions of the virtual address space >>> reserved for Xen? What the the VA that fails? >>> >>> David >>> >> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe >> missing p2m tables? I probably need to add debugging to find the failing VA >> (iow >> not sure whether it might be somewhere in the stacktraces in the report). >> >> The kernel-command line does not seem to be looked at. It should put >> something >> into dmesg and that never shows up. Also today's random feature is other PV >> guests crashing after a bit somewhere in the check_for_corruption area... > > Right now, the kaslr code just deals with initrd, cmdline, etc. If > there are other reserved regions that aren't listed in the e820, it'll > need to locate and skip them. > > -Kees > Making my little steps towards more understanding I figured out that it isn't the code that does the relocation. Even with that completely disabled there were the vmalloc issues. What causes it seems to be the default of the upper limit and that this changes the split between kernel and modules to 1G+1G instead of 512M+1.5G. That is the reason why nokaslr has no effect. -Stefan signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
On 12.08.2014 19:28, Kees Cook wrote: On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader stefan.ba...@canonical.com wrote: On 08.08.2014 14:43, David Vrabel wrote: On 08/08/14 12:20, Stefan Bader wrote: Unfortunately I have not yet figured out why this happens, but can confirm by compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all is ok, but with it enabled there are issues (actually a dom0 does not even boot as a follow up error). Details can be seen in [1] but basically this is always some portion of a vmalloc allocation failing after hitting a freshly allocated PTE space not being PTE_NONE (usually from a module load triggered by systemd-udevd). In the non-dom0 case this repeats many times but ends in a guest that allows login. In the dom0 case there is a more fatal error at some point causing a crash. I have not tried this for a normal PV guest but for dom0 it also does not help to add nokaslr to the kernel command-line. Maybe it's overlapping with regions of the virtual address space reserved for Xen? What the the VA that fails? David Yeah, there is some code to avoid some regions of memory (like initrd). Maybe missing p2m tables? I probably need to add debugging to find the failing VA (iow not sure whether it might be somewhere in the stacktraces in the report). The kernel-command line does not seem to be looked at. It should put something into dmesg and that never shows up. Also today's random feature is other PV guests crashing after a bit somewhere in the check_for_corruption area... Right now, the kaslr code just deals with initrd, cmdline, etc. If there are other reserved regions that aren't listed in the e820, it'll need to locate and skip them. -Kees Making my little steps towards more understanding I figured out that it isn't the code that does the relocation. Even with that completely disabled there were the vmalloc issues. What causes it seems to be the default of the upper limit and that this changes the split between kernel and modules to 1G+1G instead of 512M+1.5G. That is the reason why nokaslr has no effect. -Stefan signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
On 08.08.2014 14:43, David Vrabel wrote: > On 08/08/14 12:20, Stefan Bader wrote: >> Unfortunately I have not yet figured out why this happens, but can confirm by >> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR >> all >> is ok, but with it enabled there are issues (actually a dom0 does not even >> boot >> as a follow up error). >> >> Details can be seen in [1] but basically this is always some portion of a >> vmalloc allocation failing after hitting a freshly allocated PTE space not >> being >> PTE_NONE (usually from a module load triggered by systemd-udevd). In the >> non-dom0 case this repeats many times but ends in a guest that allows login. >> In >> the dom0 case there is a more fatal error at some point causing a crash. >> >> I have not tried this for a normal PV guest but for dom0 it also does not >> help >> to add "nokaslr" to the kernel command-line. > > Maybe it's overlapping with regions of the virtual address space > reserved for Xen? What the the VA that fails? > > David > Yeah, there is some code to avoid some regions of memory (like initrd). Maybe missing p2m tables? I probably need to add debugging to find the failing VA (iow not sure whether it might be somewhere in the stacktraces in the report). The kernel-command line does not seem to be looked at. It should put something into dmesg and that never shows up. Also today's random feature is other PV guests crashing after a bit somewhere in the check_for_corruption area... -Stefan signature.asc Description: OpenPGP digital signature
Xen PV domain regression with KASLR enabled (kernel 3.16)
Unfortunately I have not yet figured out why this happens, but can confirm by compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all is ok, but with it enabled there are issues (actually a dom0 does not even boot as a follow up error). Details can be seen in [1] but basically this is always some portion of a vmalloc allocation failing after hitting a freshly allocated PTE space not being PTE_NONE (usually from a module load triggered by systemd-udevd). In the non-dom0 case this repeats many times but ends in a guest that allows login. In the dom0 case there is a more fatal error at some point causing a crash. I have not tried this for a normal PV guest but for dom0 it also does not help to add "nokaslr" to the kernel command-line. -Stefan 19:35:02 [ 2.547049] [ cut here ] 19:35:02 [ 2.547065] WARNING: CPU: 0 PID: 97 at /build/buildd/linux-3.16.0/mm/vmalloc.c:128 vmap_page_range_noflush+0x2d1/0x370() 19:35:02 [ 2.547069] Modules linked in: 19:35:02 [ 2.547073] CPU: 0 PID: 97 Comm: systemd-udevd Not tainted 3.16.0-6-generic #11-Ubuntu 19:35:02 [ 2.547077] 0009 880002defb98 81755538 19:35:02 [ 2.547082] 880002defbd0 8106bb0d 88000400ec88 0001 19:35:02 [ 2.547086] 880002fcfb00 c0391000 880002defbe0 19:35:02 [ 2.547090] Call Trace: 19:35:02 [ 2.547096] [] dump_stack+0x45/0x56 19:35:02 [ 2.547101] [] warn_slowpath_common+0x7d/0xa0 19:35:02 [ 2.547104] [] warn_slowpath_null+0x1a/0x20 19:35:02 [ 2.547108] [] vmap_page_range_noflush+0x2d1/0x370 19:35:02 [ 2.547112] [] map_vm_area+0x2e/0x40 19:35:02 [ 2.547115] [] __vmalloc_node_range+0x188/0x280 19:35:02 [ 2.547120] [] ? module_alloc_update_bounds+0x14/0x70 19:35:02 [ 2.547124] [] ? module_alloc_update_bounds+0x14/0x70 19:35:02 [ 2.547129] [] module_alloc+0x74/0xd0 19:35:02 [ 2.547132] [] ? module_alloc_update_bounds+0x14/0x70 19:35:02 [ 2.547135] [] module_alloc_update_bounds+0x14/0x70 19:35:02 [ 2.547146] [] layout_and_allocate+0x74c/0xc70 19:35:02 [ 2.547149] [] load_module+0xd3/0x1b70 19:35:02 [ 2.547154] [] ? vfs_read+0xf1/0x170 19:35:02 [ 2.547157] [] ? copy_module_from_fd.isra.46+0x121/0x180 19:35:02 [ 2.547161] [] SyS_finit_module+0x86/0xb0 19:35:02 [ 2.547167] [] tracesys+0xe1/0xe6 19:35:02 [ 2.547169] ---[ end trace 8a5de7fc66e75fe4 ]--- 19:35:02 [ 2.547172] vmalloc: allocation failure, allocated 20480 of 24576 bytes 19:35:02 [ 2.547175] systemd-udevd: page allocation failure: order:0, mode:0xd2 [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350522 signature.asc Description: OpenPGP digital signature
Xen PV domain regression with KASLR enabled (kernel 3.16)
Unfortunately I have not yet figured out why this happens, but can confirm by compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all is ok, but with it enabled there are issues (actually a dom0 does not even boot as a follow up error). Details can be seen in [1] but basically this is always some portion of a vmalloc allocation failing after hitting a freshly allocated PTE space not being PTE_NONE (usually from a module load triggered by systemd-udevd). In the non-dom0 case this repeats many times but ends in a guest that allows login. In the dom0 case there is a more fatal error at some point causing a crash. I have not tried this for a normal PV guest but for dom0 it also does not help to add nokaslr to the kernel command-line. -Stefan 19:35:02 [ 2.547049] [ cut here ] 19:35:02 [ 2.547065] WARNING: CPU: 0 PID: 97 at /build/buildd/linux-3.16.0/mm/vmalloc.c:128 vmap_page_range_noflush+0x2d1/0x370() 19:35:02 [ 2.547069] Modules linked in: 19:35:02 [ 2.547073] CPU: 0 PID: 97 Comm: systemd-udevd Not tainted 3.16.0-6-generic #11-Ubuntu 19:35:02 [ 2.547077] 0009 880002defb98 81755538 19:35:02 [ 2.547082] 880002defbd0 8106bb0d 88000400ec88 0001 19:35:02 [ 2.547086] 880002fcfb00 c0391000 880002defbe0 19:35:02 [ 2.547090] Call Trace: 19:35:02 [ 2.547096] [81755538] dump_stack+0x45/0x56 19:35:02 [ 2.547101] [8106bb0d] warn_slowpath_common+0x7d/0xa0 19:35:02 [ 2.547104] [8106bbea] warn_slowpath_null+0x1a/0x20 19:35:02 [ 2.547108] [81197c31] vmap_page_range_noflush+0x2d1/0x370 19:35:02 [ 2.547112] [81197cfe] map_vm_area+0x2e/0x40 19:35:02 [ 2.547115] [8119a058] __vmalloc_node_range+0x188/0x280 19:35:02 [ 2.547120] [810e92b4] ? module_alloc_update_bounds+0x14/0x70 19:35:02 [ 2.547124] [810e92b4] ? module_alloc_update_bounds+0x14/0x70 19:35:02 [ 2.547129] [8104f294] module_alloc+0x74/0xd0 19:35:02 [ 2.547132] [810e92b4] ? module_alloc_update_bounds+0x14/0x70 19:35:02 [ 2.547135] [810e92b4] module_alloc_update_bounds+0x14/0x70 19:35:02 [ 2.547146] [810e9a6c] layout_and_allocate+0x74c/0xc70 19:35:02 [ 2.547149] [810ea063] load_module+0xd3/0x1b70 19:35:02 [ 2.547154] [811cfeb1] ? vfs_read+0xf1/0x170 19:35:02 [ 2.547157] [810e7aa1] ? copy_module_from_fd.isra.46+0x121/0x180 19:35:02 [ 2.547161] [810ebc76] SyS_finit_module+0x86/0xb0 19:35:02 [ 2.547167] [8175de7f] tracesys+0xe1/0xe6 19:35:02 [ 2.547169] ---[ end trace 8a5de7fc66e75fe4 ]--- 19:35:02 [ 2.547172] vmalloc: allocation failure, allocated 20480 of 24576 bytes 19:35:02 [ 2.547175] systemd-udevd: page allocation failure: order:0, mode:0xd2 [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1350522 signature.asc Description: OpenPGP digital signature
Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
On 08.08.2014 14:43, David Vrabel wrote: On 08/08/14 12:20, Stefan Bader wrote: Unfortunately I have not yet figured out why this happens, but can confirm by compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all is ok, but with it enabled there are issues (actually a dom0 does not even boot as a follow up error). Details can be seen in [1] but basically this is always some portion of a vmalloc allocation failing after hitting a freshly allocated PTE space not being PTE_NONE (usually from a module load triggered by systemd-udevd). In the non-dom0 case this repeats many times but ends in a guest that allows login. In the dom0 case there is a more fatal error at some point causing a crash. I have not tried this for a normal PV guest but for dom0 it also does not help to add nokaslr to the kernel command-line. Maybe it's overlapping with regions of the virtual address space reserved for Xen? What the the VA that fails? David Yeah, there is some code to avoid some regions of memory (like initrd). Maybe missing p2m tables? I probably need to add debugging to find the failing VA (iow not sure whether it might be somewhere in the stacktraces in the report). The kernel-command line does not seem to be looked at. It should put something into dmesg and that never shows up. Also today's random feature is other PV guests crashing after a bit somewhere in the check_for_corruption area... -Stefan signature.asc Description: OpenPGP digital signature
Re: fs/stat: Reduce memory requirements for stat_open
On 08.07.2014 15:09, Peter Zijlstra wrote: > On Thu, Jun 12, 2014 at 03:00:17PM +0200, Stefan Bader wrote: >> When reading from /proc/stat we allocate a large buffer to maximise >> the chances of the results being from a single run and thus internally >> consistent. This currently is sized at 128 * num_possible_cpus() which, >> in the face of kernels sized to handle large configurations (256 cpus >> plus), results in the buffer being an order-4 allocation or more. >> When system memory becomes fragmented these cannot be guarenteed, leading >> to read failures due to allocation failures. > >> @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v) >> >> static int stat_open(struct inode *inode, struct file *file) >> { >> -size_t size = 1024 + 128 * num_possible_cpus(); >> +size_t size = 1024 + 128 * num_online_cpus(); >> char *buf; >> struct seq_file *m; >> int res; > > Old thread, and already solved in the meantime, but note that > CONFIG_NR_CPUS _should_ have no reflection on num_possible_cpus(). > > The arch (x86 does) should detect at boot time the max possible CPUs the > actual hardware supports and put num_possible_cpus() to that number. So > your typical laptop will mostly have num_possible_cpus() <= 4, even > though CONFIG_NR_CPUS could be 4k. > > Of course, if you actually do put 256+ cpus in your system, well, then > the difference between possible and online isn't going to help either. > > If on the other hand your 'board' reports it can hold 256 CPUs while in > fact it cannot, go kick your vendor in the nuts. > Odd. So at least for the reporter the board must be doing that. For the other way round at least back then when we did the change, some of the big hoovers which end up claiming 64 or so CPUs (including threads) would not boot with a too low CONFIG_NR_CPUS. So that seems to have changed at some point. But at least this would explain why there have not been too many reports. Thanks for the detail, Stefan signature.asc Description: OpenPGP digital signature
Re: fs/stat: Reduce memory requirements for stat_open
On 08.07.2014 15:09, Peter Zijlstra wrote: On Thu, Jun 12, 2014 at 03:00:17PM +0200, Stefan Bader wrote: When reading from /proc/stat we allocate a large buffer to maximise the chances of the results being from a single run and thus internally consistent. This currently is sized at 128 * num_possible_cpus() which, in the face of kernels sized to handle large configurations (256 cpus plus), results in the buffer being an order-4 allocation or more. When system memory becomes fragmented these cannot be guarenteed, leading to read failures due to allocation failures. @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v) static int stat_open(struct inode *inode, struct file *file) { -size_t size = 1024 + 128 * num_possible_cpus(); +size_t size = 1024 + 128 * num_online_cpus(); char *buf; struct seq_file *m; int res; Old thread, and already solved in the meantime, but note that CONFIG_NR_CPUS _should_ have no reflection on num_possible_cpus(). The arch (x86 does) should detect at boot time the max possible CPUs the actual hardware supports and put num_possible_cpus() to that number. So your typical laptop will mostly have num_possible_cpus() = 4, even though CONFIG_NR_CPUS could be 4k. Of course, if you actually do put 256+ cpus in your system, well, then the difference between possible and online isn't going to help either. If on the other hand your 'board' reports it can hold 256 CPUs while in fact it cannot, go kick your vendor in the nuts. Odd. So at least for the reporter the board must be doing that. For the other way round at least back then when we did the change, some of the big hoovers which end up claiming 64 or so CPUs (including threads) would not boot with a too low CONFIG_NR_CPUS. So that seems to have changed at some point. But at least this would explain why there have not been too many reports. Thanks for the detail, Stefan signature.asc Description: OpenPGP digital signature