Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On Thu, 22 Nov 2012 14:26:10 -0800 H. Peter Anvin h...@zytor.com wrote: Bullshit. This should be a separate domain. Thanks for top-posting, hpa... Andrew Cooper andrew.coop...@citrix.com wrote: On 22/11/12 17:47, H. Peter Anvin wrote: The other thing that should be considered here is how utterly preposterous the notion of doing in-guest crash dumping is in a system that contains a hypervisor. The reason for kdump is that on bare metal there are no other options, but in a hypervisor system the right thing should be for the hypervisor to do the dump (possibly spawning a clean I/O domain if the I/O domain is necessary to access the media.) There is absolutely no reason to have a crashkernel sitting around in each guest, consuming memory, and possibly get corrupt. -hpa I agree that regular guests should not be using the kexec/kdump. However, this patch series is required for allowing a pvops kernel to be a crash kernel for Xen, which is very important from dom0/Xen's point of view. In fact, a normal kernel is used for dumping, so it can handle both, Dom0 crashes _and_ hypervisor crashes. If you wanted to address hypervisor crashes, you'd have to allocate some space for that, too, so you may view this madness as a way to conserve resources. The memory area is reserved by the Xen hypervisor, and only the extents are passed down to the Dom0 kernel. In other words, there is indeed no physical mapping for this area. Having said that, I see no reason why that physical mapping cannot be created if it is needed. Petr T ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On 22/11/12 17:47, H. Peter Anvin wrote: The other thing that should be considered here is how utterly preposterous the notion of doing in-guest crash dumping is in a system that contains a hypervisor. The reason for kdump is that on bare metal there are no other options, but in a hypervisor system the right thing should be for the hypervisor to do the dump (possibly spawning a clean I/O domain if the I/O domain is necessary to access the media.) There is absolutely no reason to have a crashkernel sitting around in each guest, consuming memory, and possibly get corrupt. -hpa I agree that regular guests should not be using the kexec/kdump. However, this patch series is required for allowing a pvops kernel to be a crash kernel for Xen, which is very important from dom0/Xen's point of view. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On 22/11/2012 17:47, H. Peter Anvin wrote: The other thing that should be considered here is how utterly preposterous the notion of doing in-guest crash dumping is in a system that contains a hypervisor. The reason for kdump is that on bare metal there are no other options, but in a hypervisor system the right thing should be for the hypervisor to do the dump (possibly spawning a clean I/O domain if the I/O domain is necessary to access the media.) There is absolutely no reason to have a crashkernel sitting around in each guest, consuming memory, and possibly get corrupt. -hpa (Your reply to my email which I can see on the xen devel archive appears to have gotten lost somewhere inside the citrix email system, so apologies for replying out of order) The kdump kernel loaded by dom0 is for when Xen crashes, not for when dom0 crashes (although a dom0 crash does admittedly lead to a Xen crash) There is no possible way it could be a separate domain; Xen completely ceases to function as soon as jumps to the entry point of the kdump image. ~Andrew ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
Hi, Le 23/11/2012 02:56, Andrew Cooper a écrit : For within-guest kexec/kdump functionality, I agree that it is barking mad. However, we do see cloud operators interested in the idea so VM administrators can look after their crashes themselves. It's not barking mad when your dayjob is to investigate and fix other people's kernel problems. Right now, it's impossible to get a kernel image of a failing EC2 instance, so every time someone shows up with a my kernel crashes in my instance, we're lest with mostly unusable backtraces and oops messages. When I'm able to reproduce someone's kernel panic, I'm quite happy to be able to use virtualization to run a kernel dump analysis on a locally reproduced context. It's also quite useful when packaging things like makedumpfile, kdump-tools to be able to avoid having to rely on bare metal to test new releases. So yes, in theory it may look barking mad, but real life is somewhat different. Kind regards, ...Louis -- Louis Bouchard Backline Support Analyst Canonical Ltd Ubuntu support: http://landscape.canonical.com ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On 22.11.12 at 18:37, H. Peter Anvin h...@zytor.com wrote: I actually talked to Ian Jackson at LCE, and mentioned among other things the bogosity of requiring a PUD page for three-level paging in Linux -- a bogosity which has spread from Xen into native. It's a page wasted for no good reason, since it only contains 32 bytes worth of data, *inherently*. Furthermore, contrary to popular belief, it is *not* pa page table per se. Ian told me: I didn't know we did that, and we shouldn't have to. Here we have suffered this overhead for at least six years, ... Even the Xen kernel only needs the full page when running on a 64-bit hypervisor (now that we don't have a 32-bit hypervisor anymore, that of course basically means always). But yes, I too never liked this enforced over-allocation for native kernels (and was surprised that it was allowed in at all). Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On Fri, Nov 23, 2012 at 09:53:37AM +, Jan Beulich wrote: On 23.11.12 at 02:56, Andrew Cooper andrew.coop...@citrix.com wrote: On 23/11/2012 01:38, H. Peter Anvin wrote: I still don't really get why it can't be isolated from dom0, which would make more sense to me, even for a Xen crash. The crash region (as specified by crashkernel= on the Xen command line) is isolated from dom0. dom0 (using the kexec utility etc) has the task of locating the Xen crash notes (using the kexec hypercall interface), constructing a binary blob containing kernel, initram and gubbins, and asking Xen to put this blob in the crash region (again, using the kexec hypercall interface). I do not see how this is very much different from the native case currently (although please correct me if I am misinformed). Linux has extra work to do by populating /proc/iomem with the Xen crash regions boot (so the kexec utility can reference their physical addresses when constructing the blob), and should just act as a conduit between the kexec system call and the kexec hypercall to load the blob. But all of this _could_ be done completely independent of the Dom0 kernel's kexec infrastructure (i.e. fully from user space, invoking the necessary hypercalls through the privcmd driver). No, this is impossible. kexec/kdump image lives in dom0 kernel memory until execution. That is why privcmd driver itself is not a solution in this case. It's just that parts of the kexec infrastructure can be re-used (and hence that mechanism probably seemed the easier approach to the implementer of the original kexec-on-Xen). If the kernel folks dislike that re-use (quite understandably looking at how much of it needs to be re-done), that shouldn't prevent us from looking into the existing alternatives. This is last resort option. First I think we should try to find good solution which reuses existing code as much as possible. Daniel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On 23.11.12 at 11:37, Daniel Kiper daniel.ki...@oracle.com wrote: On Fri, Nov 23, 2012 at 09:53:37AM +, Jan Beulich wrote: On 23.11.12 at 02:56, Andrew Cooper andrew.coop...@citrix.com wrote: On 23/11/2012 01:38, H. Peter Anvin wrote: I still don't really get why it can't be isolated from dom0, which would make more sense to me, even for a Xen crash. The crash region (as specified by crashkernel= on the Xen command line) is isolated from dom0. dom0 (using the kexec utility etc) has the task of locating the Xen crash notes (using the kexec hypercall interface), constructing a binary blob containing kernel, initram and gubbins, and asking Xen to put this blob in the crash region (again, using the kexec hypercall interface). I do not see how this is very much different from the native case currently (although please correct me if I am misinformed). Linux has extra work to do by populating /proc/iomem with the Xen crash regions boot (so the kexec utility can reference their physical addresses when constructing the blob), and should just act as a conduit between the kexec system call and the kexec hypercall to load the blob. But all of this _could_ be done completely independent of the Dom0 kernel's kexec infrastructure (i.e. fully from user space, invoking the necessary hypercalls through the privcmd driver). No, this is impossible. kexec/kdump image lives in dom0 kernel memory until execution. That is why privcmd driver itself is not a solution in this case. Even if so, there's no fundamental reason why that kernel image can't be put into Xen controlled space instead. Jan ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
On Fri, Nov 23, 2012 at 10:51:55AM +, Jan Beulich wrote: On 23.11.12 at 11:37, Daniel Kiper daniel.ki...@oracle.com wrote: On Fri, Nov 23, 2012 at 09:53:37AM +, Jan Beulich wrote: On 23.11.12 at 02:56, Andrew Cooper andrew.coop...@citrix.com wrote: On 23/11/2012 01:38, H. Peter Anvin wrote: I still don't really get why it can't be isolated from dom0, which would make more sense to me, even for a Xen crash. The crash region (as specified by crashkernel= on the Xen command line) is isolated from dom0. dom0 (using the kexec utility etc) has the task of locating the Xen crash notes (using the kexec hypercall interface), constructing a binary blob containing kernel, initram and gubbins, and asking Xen to put this blob in the crash region (again, using the kexec hypercall interface). I do not see how this is very much different from the native case currently (although please correct me if I am misinformed). Linux has extra work to do by populating /proc/iomem with the Xen crash regions boot (so the kexec utility can reference their physical addresses when constructing the blob), and should just act as a conduit between the kexec system call and the kexec hypercall to load the blob. But all of this _could_ be done completely independent of the Dom0 kernel's kexec infrastructure (i.e. fully from user space, invoking the necessary hypercalls through the privcmd driver). No, this is impossible. kexec/kdump image lives in dom0 kernel memory until execution. That is why privcmd driver itself is not a solution in this case. Even if so, there's no fundamental reason why that kernel image can't be put into Xen controlled space instead. Yep, but we must change Xen kexec interface and/or its behavior first. If we take that option then we could also move almost all needed things from dom0 kernel to Xen. This way we could simplify Linux Kernel kexec/kdump infrastructure needed to run on Xen. Daniel ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
Daniel Kiper daniel.ki...@oracle.com writes: On Tue, Nov 20, 2012 at 08:40:39AM -0800, ebied...@xmission.com wrote: Daniel Kiper daniel.ki...@oracle.com writes: Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default functions or require some changes in behavior of kexec/kdump generic code. To cope with that problem kexec_ops struct was introduced. It allows a developer to replace all or some functions and control some functionality of kexec/kdump generic code. Default behavior of kexec/kdump generic code is not changed. Ick. v2 - suggestions/fixes: - add comment for kexec_ops.crash_alloc_temp_store member (suggested by Konrad Rzeszutek Wilk), - simplify kexec_ops usage (suggested by Konrad Rzeszutek Wilk). Signed-off-by: Daniel Kiper daniel.ki...@oracle.com --- include/linux/kexec.h | 26 ++ kernel/kexec.c| 131 + 2 files changed, 125 insertions(+), 32 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index d0b8458..c8d0b35 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -116,7 +116,33 @@ struct kimage { #endif }; +struct kexec_ops { + /* + * Some kdump implementations (e.g. Xen PVOPS dom0) could not access + * directly crash kernel memory area. In this situation they must + * allocate memory outside of it and later move contents from temporary + * storage to final resting places (usualy done by relocate_kernel()). + * Such behavior could be enforced by setting + * crash_alloc_temp_store member to true. + */ Why in the world would Xen not be able to access crash kernel memory? As currently defined it is normal memory that the kernel chooses not to use. If relocate kernel can access that memory you definitely can access the memory so the comment does not make any sense. Crash kernel memory is reserved by Xen hypervisor and Xen hypervisor only has access to it. dom0 does not have any mapping of this area. However, relocate_kernel() has access to crash kernel memory because it is executed by Xen hypervisor and whole machine memory is identity mapped. This is all weird. Doubly so since this code is multi-arch and you have a set of requirements no other arch has had. I recall that Xen uses kexec in a unique manner. What is the hypervisor interface and how is it used? Is this for when the hypervisor crashes and we want a crash dump of that? + bool crash_alloc_temp_store; + struct page *(*kimage_alloc_pages)(gfp_t gfp_mask, + unsigned int order, + unsigned long limit); + void (*kimage_free_pages)(struct page *page); + unsigned long (*page_to_pfn)(struct page *page); + struct page *(*pfn_to_page)(unsigned long pfn); + unsigned long (*virt_to_phys)(volatile void *address); + void *(*phys_to_virt)(unsigned long address); + int (*machine_kexec_prepare)(struct kimage *image); + int (*machine_kexec_load)(struct kimage *image); + void (*machine_kexec_cleanup)(struct kimage *image); + void (*machine_kexec_unload)(struct kimage *image); + void (*machine_kexec_shutdown)(void); + void (*machine_kexec)(struct kimage *image); +}; Ugh. This is a nasty abstraction. You are mixing and matching a bunch of things together here. If you need to override machine_kexec_xxx please do that on a per architecture basis. Yes, it is possible but I think that it is worth to do it at that level because it could be useful for other archs too (e.g. Xen ARM port is under development). Then we do not need to duplicate that functionality in arch code. Additionally, Xen requires machine_kexec_load and machine_kexec_unload hooks which are not available in current generic kexec/kdump code. Let me be clear. kexec_ops as you have implemented it is absolutely unacceptable. Your kexec_ops is not an abstraction but a hack that enshrines in stone implementation details. Special case overrides of page_to_pfn, pfn_to_page, virt_to_phys, phys_to_virt, and friends seem completely inappropriate. They are required in Xen PVOPS case. If we do not do that in that way then we at least need to duplicate almost all generic kexec/kdump existing code in arch depended files. I do not mention that we need to capture relevant syscall and other things. I think that this is wrong way. A different definition of phys_to_virt and page_to_pfn for one specific function is total nonsense. It may actually be better to have a completely different code path. This looks more like code abuse than code reuse. Successful code reuse depends upon not breaking the assumptions on which the code relies, or modifying the code so that the new modified assumptions are clear. In this case you might as well define up as down for all of the sense kexec_ops makes. There may be a point
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
The other thing that should be considered here is how utterly preposterous the notion of doing in-guest crash dumping is in a system that contains a hypervisor. The reason for kdump is that on bare metal there are no other options, but in a hypervisor system the right thing should be for the hypervisor to do the dump (possibly spawning a clean I/O domain if the I/O domain is necessary to access the media.) There is absolutely no reason to have a crashkernel sitting around in each guest, consuming memory, and possibly get corrupt. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
Bullshit. This should be a separate domain. Andrew Cooper andrew.coop...@citrix.com wrote: On 22/11/12 17:47, H. Peter Anvin wrote: The other thing that should be considered here is how utterly preposterous the notion of doing in-guest crash dumping is in a system that contains a hypervisor. The reason for kdump is that on bare metal there are no other options, but in a hypervisor system the right thing should be for the hypervisor to do the dump (possibly spawning a clean I/O domain if the I/O domain is necessary to access the media.) There is absolutely no reason to have a crashkernel sitting around in each guest, consuming memory, and possibly get corrupt. -hpa I agree that regular guests should not be using the kexec/kdump. However, this patch series is required for allowing a pvops kernel to be a crash kernel for Xen, which is very important from dom0/Xen's point of view. -- Sent from my mobile phone. Please excuse brevity and lack of formatting. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
I still don't really get why it can't be isolated from dom0, which would make more sense to me, even for a Xen crash. Andrew Cooper andrew.coop...@citrix.com wrote: On 22/11/2012 17:47, H. Peter Anvin wrote: The other thing that should be considered here is how utterly preposterous the notion of doing in-guest crash dumping is in a system that contains a hypervisor. The reason for kdump is that on bare metal there are no other options, but in a hypervisor system the right thing should be for the hypervisor to do the dump (possibly spawning a clean I/O domain if the I/O domain is necessary to access the media.) There is absolutely no reason to have a crashkernel sitting around in each guest, consuming memory, and possibly get corrupt. -hpa (Your reply to my email which I can see on the xen devel archive appears to have gotten lost somewhere inside the citrix email system, so apologies for replying out of order) The kdump kernel loaded by dom0 is for when Xen crashes, not for when dom0 crashes (although a dom0 crash does admittedly lead to a Xen crash) There is no possible way it could be a separate domain; Xen completely ceases to function as soon as jumps to the entry point of the kdump image. ~Andrew -- Sent from my mobile phone. Please excuse brevity and lack of formatting. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 01/11] kexec: introduce kexec_ops struct
Daniel Kiper daniel.ki...@oracle.com writes: Some kexec/kdump implementations (e.g. Xen PVOPS) could not use default functions or require some changes in behavior of kexec/kdump generic code. To cope with that problem kexec_ops struct was introduced. It allows a developer to replace all or some functions and control some functionality of kexec/kdump generic code. Default behavior of kexec/kdump generic code is not changed. Ick. v2 - suggestions/fixes: - add comment for kexec_ops.crash_alloc_temp_store member (suggested by Konrad Rzeszutek Wilk), - simplify kexec_ops usage (suggested by Konrad Rzeszutek Wilk). Signed-off-by: Daniel Kiper daniel.ki...@oracle.com --- include/linux/kexec.h | 26 ++ kernel/kexec.c| 131 + 2 files changed, 125 insertions(+), 32 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index d0b8458..c8d0b35 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -116,7 +116,33 @@ struct kimage { #endif }; +struct kexec_ops { + /* + * Some kdump implementations (e.g. Xen PVOPS dom0) could not access + * directly crash kernel memory area. In this situation they must + * allocate memory outside of it and later move contents from temporary + * storage to final resting places (usualy done by relocate_kernel()). + * Such behavior could be enforced by setting + * crash_alloc_temp_store member to true. + */ Why in the world would Xen not be able to access crash kernel memory? As currently defined it is normal memory that the kernel chooses not to use. If relocate kernel can access that memory you definitely can access the memory so the comment does not make any sense. + bool crash_alloc_temp_store; + struct page *(*kimage_alloc_pages)(gfp_t gfp_mask, + unsigned int order, + unsigned long limit); + void (*kimage_free_pages)(struct page *page); + unsigned long (*page_to_pfn)(struct page *page); + struct page *(*pfn_to_page)(unsigned long pfn); + unsigned long (*virt_to_phys)(volatile void *address); + void *(*phys_to_virt)(unsigned long address); + int (*machine_kexec_prepare)(struct kimage *image); + int (*machine_kexec_load)(struct kimage *image); + void (*machine_kexec_cleanup)(struct kimage *image); + void (*machine_kexec_unload)(struct kimage *image); + void (*machine_kexec_shutdown)(void); + void (*machine_kexec)(struct kimage *image); +}; Ugh. This is a nasty abstraction. You are mixing and matching a bunch of things together here. If you need to override machine_kexec_xxx please do that on a per architecture basis. Special case overrides of page_to_pfn, pfn_to_page, virt_to_phys, phys_to_virt, and friends seem completely inappropriate. There may be a point to all of these but you are mixing and matching things badly. Eric ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization