Re: [RFC PATCH 4/6] ftrace: Allow stealing pages from pipe buffer
On Tue, 2012-07-24 at 11:37 +0900, Yoshihiro YUNOMAE wrote: > From: Masami Hiramatsu > > Use generic steal operation on pipe buffer to allow stealing > ring buffer's read page from pipe buffer. > > Note that this could reduce the performance of splice on the > splice_write side operation without affinity setting. > Since the ring buffer's read pages are allocated on the > tracing-node, but the splice user does not always execute > splice write side operation on the same node. In this case, > the page will be accessed from the another node. > Thus, it is strongly recommended to assign the splicing > thread to corresponding node. > > Signed-off-by: Masami Hiramatsu Acked-by: Steven Rostedt -- Steve > Cc: Frederic Weisbecker > Cc: Ingo Molnar > --- > > kernel/trace/trace.c |8 +--- > 1 files changed, 1 insertions(+), 7 deletions(-) > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index a120f98..ae01930 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -4194,12 +4194,6 @@ static void buffer_pipe_buf_release(struct > pipe_inode_info *pipe, > buf->private = 0; > } > > -static int buffer_pipe_buf_steal(struct pipe_inode_info *pipe, > - struct pipe_buffer *buf) > -{ > - return 1; > -} > - > static void buffer_pipe_buf_get(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > @@ -4215,7 +4209,7 @@ static const struct pipe_buf_operations > buffer_pipe_buf_ops = { > .unmap = generic_pipe_buf_unmap, > .confirm= generic_pipe_buf_confirm, > .release= buffer_pipe_buf_release, > - .steal = buffer_pipe_buf_steal, > + .steal = generic_pipe_buf_steal, > .get= buffer_pipe_buf_get, > }; > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 2/6] virtio/console: Add a failback for unstealable pipe buffer
On Thu, 2012-08-09 at 18:24 +0900, Masami Hiramatsu wrote: > Yeah, it is really easy to fix that. > But out of curiosity, would that be really a problem? > I guess that host can access any guest page if need. If that > is right, is that really insecure to leak randomly allocated > unused page to the host? Yeah, it's like protecting userspace pages from the kernel ;-) -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/5] trace-cmd: Use TRACE_DIR envrionment variable if defined
On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote: > From: Masami Hiramatsu > > Use TRACE_DIR environment variable for setting TRACING_DIR would be better, as we are searching for /debug/tracing and not /debug/trace. Perhaps DEBUG_TRACING_DIR would be even better as to be less of a generic term. -- Steve > debugfs/tracing directory if defined. This is > for controlling guest(or remote) ftrace. > > Signed-off-by: Masami Hiramatsu > Signed-off-by: Yoshihiro YUNOMAE > --- > > trace-util.c |9 + > 1 files changed, 9 insertions(+), 0 deletions(-) > > diff --git a/trace-util.c b/trace-util.c > index e128188..d5a3eb4 100644 > --- a/trace-util.c > +++ b/trace-util.c > @@ -311,6 +311,15 @@ char *tracecmd_find_tracing_dir(void) > char type[100]; > FILE *fp; > > + tracing_dir = getenv("TRACE_DIR"); > + if (tracing_dir) { > + tracing_dir = strdup(tracing_dir); > + if (!tracing_dir) > + die("malloc"); > + warning("Use environmental tracing directory: %s\n", > tracing_dir); > + return tracing_dir; > + } > + > if ((fp = fopen("/proc/mounts","r")) == NULL) { > warning("Can't open /proc/mounts for read"); > return NULL; > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/5] trace-cmd: Use tracing directory to count CPUs
On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote: > From: Masami Hiramatsu > > Count debugfs/tracing/per_cpu/cpu* to determine the > number of CPUs. I'm curious, do you find that sysconf doesn't return the # of CPUs the system has? I've had boxes where the per_cpu/cpu* had more cpus than the box actually holds. But this was a bug in the kernel, not the tool. This change log needs to have rational instead of just explaining what the patch does. Thanks, -- Steve > > Signed-off-by: Masami Hiramatsu > Signed-off-by: Yoshihiro YUNOMAE > --- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/5] trace-cmd: Support trace-agent of virtio-trace
On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote: > Add read path and control path to use trace-agent of virtio-trace. > When we use trace-agent, trace-cmd will be used as follows: > # AGENT_READ_DIR=/tmp/virtio-trace/tracing \ > AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \ > TRACING_DIR=/tmp/virtio-trace/debugfs/tracing \\ Ha! You used "TRACING_DIR" but patch one introduces TRACE_DIR. Lets change this to DEBUG_TRACING_DIR instead anyway. Also, I don't like the generic environment variables. Perhaps VIRTIO_TRACE_DIR, or AGENT_TRACE_DIR and AGENT_TRACE_CTL. Lets try to keep the environment namespace sparse. > trace-cmd record -e "sched:*" > Here, AGENT_READ_DIR is the path for a reading directory of virtio-trace, > AGENT_CTL is a control path of trace-agent, and TRACING_DIR is a debugfs path > of a guest. > > Signed-off-by: Yoshihiro YUNOMAE > --- > > trace-cmd.h |1 + > trace-recorder.c | 57 > +- > trace-util.c | 18 + > 3 files changed, 75 insertions(+), 1 deletions(-) > > diff --git a/trace-cmd.h b/trace-cmd.h > index f904dc5..75506ed 100644 > --- a/trace-cmd.h > +++ b/trace-cmd.h > @@ -72,6 +72,7 @@ static inline int tracecmd_host_bigendian(void) > } > > char *tracecmd_find_tracing_dir(void); > +char *guest_agent_tracing_read_dir(void); > > /* --- Opening and Reading the trace.dat file --- */ > > diff --git a/trace-recorder.c b/trace-recorder.c > index 215affc..3b750e9 100644 > --- a/trace-recorder.c > +++ b/trace-recorder.c > @@ -33,6 +33,7 @@ > #include > #include > #include > +#include > > #include "trace-cmd.h" > > @@ -43,6 +44,8 @@ struct tracecmd_recorder { > int page_size; > int cpu; > int stop; > + int ctl_fd; > + boolagent_existing; Thanks for the reminder. I need to convert a lot to use 'bool' instead. > }; > > void tracecmd_free_recorder(struct tracecmd_recorder *recorder) > @@ -59,11 +62,29 @@ void tracecmd_free_recorder(struct tracecmd_recorder > *recorder) > free(recorder); > } > > +static char *use_trace_agent_dir(char *ctl_path, > + struct tracecmd_recorder *recorder) > +{ > + ctl_path = strdup(ctl_path); > + if (!ctl_path) > + die("malloc"); > + warning("Use environmental control path: %s\n", ctl_path); s/Use/Using/ -- Steve > + > + recorder->ctl_fd = open(ctl_path, O_WRONLY); > + if (recorder->ctl_fd < 0) > + return NULL; > + > + recorder->agent_existing = true; > + > + return guest_agent_tracing_read_dir(); > +} > + ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/5] trace-cmd: Use tracing directory to count CPUs
On Thu, 2012-08-23 at 12:00 +0900, Masami Hiramatsu wrote: > (2012/08/23 11:01), Masami Hiramatsu wrote: > > (2012/08/22 22:41), Steven Rostedt wrote: > >> On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote: > >>> From: Masami Hiramatsu > >>> > >>> Count debugfs/tracing/per_cpu/cpu* to determine the > >>> number of CPUs. > >> > >> I'm curious, do you find that sysconf doesn't return the # of CPUs the > >> system has? > > > > No, sysconf returns the number of hosts CPUs, not guests. > > > >> I've had boxes where the per_cpu/cpu* had more cpus than the > >> box actually holds. But this was a bug in the kernel, not the tool. This > >> change log needs to have rational instead of just explaining what the > >> patch does. > > > > Ah, I see. Hmm, then this should be enabled by a command line > > option or an environment variable. > > Oops, I misunderstood. I'll add more comment for why this > should be tried instead of sysconf. And now that I understand why you are doing this, why not only do this if the TRACE_AGENT or DEBUG_TRACING_DIR is defined. That is, if we are doing it against a bare metal system, then sysconf should suffice, but if we are tracing against a guest, then it should use the tracing directory to determine the buffers. We could add options to override this, but I would think the default should just Do The Right Thing(tm). -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Readonly GDT
On Tue, 2013-04-09 at 17:43 -0700, H. Peter Anvin wrote: > OK, thinking about the GDT here. > > The GDT is quite small -- 256 bytes on i386, 128 bytes on x86-64. As > such, we probably don't want to allocate a full page to it for only > that. This means that in order to create a readonly mapping we have to > pack GDTs from different CPUs together in the same pages, *or* we > tolerate that other things on the same page gets reflected in the same > mapping. What about grouping via nodes? > > However, the packing solution has the advantage of reducing address > space consumption which matters on 32 bits: even on i386 we can easily > burn a megabyte of address space for 4096 processors, but burning 16 > megabytes starts to hurt. Having 4096 32 bit processors, you deserve what you get. ;-) -- Steve > > It would be important to measure the performance impact on task switch, > though. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] tracing/events: Add bounce tracing to swiotbl-xen
On Thu, 22 Aug 2013 22:47:28 +0100 Zoltan Kiss wrote: > /* > * Used to do a quick range check in swiotlb_tbl_unmap_single and > * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by > this > @@ -358,6 +362,9 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, > struct page *page, > /* >* Oh well, have to allocate and map a bounce buffer. >*/ > + > + trace_bounced(dev, dev_addr, size, swiotlb_force); Please use a more specific name. "bounce" is too generic. I know tracepoints are grouped by systems, but its easier for tools to just state a tracepoint name than the system:event pair. trace_xen_bounced() or trace_swiotlb_bounced() Something other than just "bounced". Thanks! -- Steve > + > map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir); > if (map == SWIOTLB_MAP_ERROR) > return DMA_ERROR_CODE; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCHv2] tracing/events: Add bounce tracing to swiotbl
On Wed, 25 Sep 2013 13:56:49 -0400 Konrad Rzeszutek Wilk wrote: > .. snip.. > > >+ TP_printk("dev_name: %s dma_mask=%llx dev_addr=%llx " > > >+ "size=%zu swiotlb_force=%x", > > >+ __get_str(dev_name), > > >+ __entry->dma_mask, > > >+ (unsigned long long)__entry->dev_addr, > > >+ __entry->size, > > >+ __entry->swiotlb_force) > > Would it make sense to do something like this: > > __entry->swiotlb_force ? "swiotlb_force" : "") I think that's fine. I do believe that the libtraceevents can parse the "?:" syntax. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH -tip RFC 0/2] kprobes: introduce NOKPROBE_SYMBOL() and prohibit probing on .entry.text
On Tue, 12 Nov 2013 02:18:53 +0900 Masami Hiramatsu wrote: > > > After that we can convert all the rest, probably as part of this series. > > OK, I'll do. :) > BTW, converting all the __kprobes involves many archs, which > kprobes ported. In that case, which mailing-list would better me > to post the series, linux-arch? I would add linux-arch. Note, you may need to support both ways for the current time being, as new __kprobes are being added (I've seen several in patches flying by in LKML). But perhaps at a later -rc we convert the rest and discontinue them. That way, linux-next will break all new __kprobes, and that should get them fixed before we enter 3.14-rc. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH -tip RFC v2 01/22] kprobes: Prohibit probing on .entry.text code
On Fri, 15 Nov 2013 04:53:18 + Masami Hiramatsu wrote: > .entry.text is a code area which is used for interrupt/syscall > entries, and there are many sensitive codes. > Thus, it is better to prohibit probing on all of such codes > instead of a part of that. > Since some symbols are already registered on kprobe blacklist, > this also removes them from the blacklist. This change only works with x86. On other archs, I get this: kernel/built-in.o: In function `register_kprobe': (.kprobes.text+0x9f4): undefined reference to `__entry_text_start' kernel/built-in.o: In function `register_kprobe': (.kprobes.text+0x9f8): undefined reference to `__entry_text_end' make[1]: *** [vmlinux] Error 1 make: *** [sub-make] Error 2 -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH -tip v3 00/23] kprobes: introduce NOKPROBE_SYMBOL() and general cleaning of kprobe blacklist
On Wed, 20 Nov 2013 12:36:00 -0500 "Frank Ch. Eigler" wrote: > Hi - > > > > Does this new blacklist cover enough that the kernel now survives a > > > broadly wildcarded perf-probe, e.g. over e.g. all of its kallsyms? > > > > That's generally the purpose of the annotations - if it doesn't then > > that's a bug. > > AFAIK, no kernel since kprobes was introduced has ever stood up to > that test. perf probe lacks the wildcarding powers of systemtap, so > one needs to resort to something like: > > # cat /proc/kallsyms | grep ' [tT] ' | while read addr type symbol; do >perf probe $symbol > done I'm curious to why one would do that. IIUC, perf now has function tracing support. -- Steve > > then wait for a few hours for that to finish. Then, or while the loop > is still running, run > > # perf record -e 'probe:*' -aR sleep 1 > > to take a kernel down. > > > - FChE ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions
Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up after enabling function tracer. I asked him to bisect the functions within available_filter_functions, which he did and it came down to three: _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64() It was found that this is only an issue when noreplace-paravirt is added to the kernel command line. This means that those functions are most likely called within critical sections of the funtion tracer, and must not be traced. In newer kenels _paravirt_nop() is defined within gcc asm(), and is no longer an issue. But both _paravirt_ident_{32,64}() causes the following splat when they are traced: mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054) mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070) mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054) mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054) NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469] Modules linked in: e1000e CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000 RIP: 0010:[] [] queued_spin_lock_slowpath+0x118/0x1a0 RSP: 0018:8800d4aefb90 EFLAGS: 0246 RAX: RBX: RCX: 88011eb16d40 RDX: 82485760 RSI: 1f288820 RDI: ea008030 RBP: 8800d4aefb90 R08: 000c R09: R10: 821c8e0e R11: R12: 88200fb8 R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0 FS: 7f7a4e3d7840() GS:88011eb0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0 Stack: 8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b 8800d4aefcb0 8800d4aefbd0 811bf4cb 0002 0015 8800d2276050 8000c0fd8867 ea008030 Call Trace: [] _raw_spin_lock+0x27/0x30 [] handle_pte_fault+0x13db/0x16b0 [] ? function_trace_call+0x15b/0x180 [] ? handle_pte_fault+0x5/0x16b0 [] handle_mm_fault+0x312/0x670 [] ? find_vma+0x68/0x70 [] __do_page_fault+0x1b1/0x4e0 [] do_page_fault+0x22/0x30 [] page_fault+0x28/0x30 [] ? copy_user_enhanced_fast_string+0x5/0x10 [] ? seq_read+0x305/0x370 [] __vfs_read+0x28/0xe0 [] ? __vfs_read+0x5/0xe0 [] ? __vfs_read+0x5/0xe0 [] vfs_read+0x86/0x130 [] SyS_read+0x46/0xa0 [] entry_SYSCALL_64_fastpath+0x1e/0xa8 Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b Reported-by: Łukasz Daniluk Signed-off-by: Steven Rostedt diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index f08ac28b8136..f975d226be6e 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n" ".popsection"); /* identity function, which can be inlined */ -u32 _paravirt_ident_32(u32 x) +u32 notrace _paravirt_ident_32(u32 x) { return x; } -u64 _paravirt_ident_64(u64 x) +u64 notrace _paravirt_ident_64(u64 x) { return x; } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 21/27] x86/ftrace: Adapt function tracing for PIE support
On Thu, 24 May 2018 13:40:24 +0200 Petr Mladek wrote: > On Wed 2018-05-23 12:54:15, Thomas Garnier wrote: > > When using -fPIE/PIC with function tracing, the compiler generates a > > call through the GOT (call *__fentry__@GOTPCREL). This instruction > > takes 6 bytes instead of 5 on the usual relative call. > > > > If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop > > so ftrace can handle the previous 5-bytes as before. > > > > Position Independent Executable (PIE) support will allow to extended the > > KASLR randomization range below the -2G memory limit. > > > > Signed-off-by: Thomas Garnier > > --- > > arch/x86/include/asm/ftrace.h | 6 +++-- > > arch/x86/include/asm/sections.h | 4 > > arch/x86/kernel/ftrace.c| 42 +++-- > > 3 files changed, 48 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h > > index c18ed65287d5..8f2decce38d8 100644 > > --- a/arch/x86/include/asm/ftrace.h > > +++ b/arch/x86/include/asm/ftrace.h > > @@ -25,9 +25,11 @@ extern void __fentry__(void); > > static inline unsigned long ftrace_call_adjust(unsigned long addr) > > { > > /* > > -* addr is the address of the mcount call instruction. > > -* recordmcount does the necessary offset calculation. > > +* addr is the address of the mcount call instruction. PIE has always a > > +* byte added to the start of the function. > > */ > > + if (IS_ENABLED(CONFIG_X86_PIE)) > > + addr -= 1; > > This seems to modify the address even for modules that are _not_ compiled with > PIE, see below. Can one load a module not compiled for PIE in a kernel with PIE? > > > return addr; > > } > > > > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c > > index 01ebcb6f263e..73b3c30cb7a3 100644 > > --- a/arch/x86/kernel/ftrace.c > > +++ b/arch/x86/kernel/ftrace.c > > @@ -135,6 +135,44 @@ ftrace_modify_code_direct(unsigned long ip, unsigned > > const char *old_code, > > return 0; > > } > > > > +/* Bytes before call GOT offset */ > > +const unsigned char got_call_preinsn[] = { 0xff, 0x15 }; > > + > > +static int > > +ftrace_modify_initial_code(unsigned long ip, unsigned const char *old_code, > > + unsigned const char *new_code) > > +{ > > + unsigned char replaced[MCOUNT_INSN_SIZE + 1]; > > + > > + ftrace_expected = old_code; > > + > > + /* > > +* If PIE is not enabled or no GOT call was found, default to the > > +* original approach to code modification. > > +*/ > > + if (!IS_ENABLED(CONFIG_X86_PIE) || > > + probe_kernel_read(replaced, (void *)ip, sizeof(replaced)) || > > + memcmp(replaced, got_call_preinsn, sizeof(got_call_preinsn))) > > + return ftrace_modify_code_direct(ip, old_code, new_code); > > And this looks like an attempt to handle modules compiled without > PIE. Does it works with the right ip in that case? I'm guessing the || is for the "or no GOT call was found", but it doesn't explain why no GOT would be found. > > I wonder if a better solution would be to update > scripts/recordmcount.c to store the incremented location into the module. If recordmcount.c can handle this, then I think that's the preferred approach. Thanks! -- Steve > > IMPORTANT: I have only vague picture about how this all works. It is > possible that I am completely wrong. The code might be correct, > especially if you tested this situation. > > Best Regards, > Petr ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: net-next boot error
[ Added Thomas Gleixner ] On Thu, 26 Jul 2018 11:34:39 +0200 Dmitry Vyukov wrote: > On Thu, Jul 26, 2018 at 11:29 AM, syzbot > wrote: > > Hello, > > > > syzbot found the following crash on: > > > > HEAD commit:dc66fe43b7eb rds: send: Fix dead code in rds_sendmsg > > git tree: net-next > > console output: https://syzkaller.appspot.com/x/log.txt?x=127874c840 > > kernel config: https://syzkaller.appspot.com/x/.config?x=f34ce142a9f5f0e8 > > dashboard link: https://syzkaller.appspot.com/bug?extid=604f8271211546f5b3c7 > > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > > > Unfortunately, I don't have any reproducer for this crash yet. > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > Reported-by: syzbot+604f8271211546f5b...@syzkaller.appspotmail.com > > > > possible deadlock in static_key_slow_incsd 0:0:1:0: [sda] Attached SCSI disk > > MACsec IEEE 802.1AE > > tun: Universal TUN/TAP device driver, 1.6 > > > > > > WARNING: possible recursive locking detected > > +Tetsuo, perhaps this boot lockdep problem then disables lockdep for > actual testing. I think lockdep should respect panic_on_warn. > > > > 4.18.0-rc6+ #141 Not tainted > > > > swapper/0/1 is trying to acquire lock: > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: > > static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124 > > > > but task is already holding lock: > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus > > include/linux/cpu.h:126 [inline] > > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0xe1a/0x1520 > > drivers/net/virtio_net.c:2777 Here init_vqs() does: get_online_cpus(); virtnet_set_affinity(vi); put_online_cpus(); Which disables cpu hotplug and calls virtnet_set_affinity() Note, get_online_cpus() is no longer recursive. > > > > other info that might help us debug this: > > Possible unsafe locking scenario: > > > >CPU0 > > > > lock(cpu_hotplug_lock.rw_sem); > > lock(cpu_hotplug_lock.rw_sem); > > > > *** DEADLOCK *** > > > > May be due to missing lock nesting notation > > > > 3 locks held by swapper/0/1: > > #0: (ptrval) (&dev->mutex){}, at: device_lock > > include/linux/device.h:1134 [inline] > > #0: (ptrval) (&dev->mutex){}, at: __driver_attach+0x15f/0x2f0 > > drivers/base/dd.c:820 > > #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus > > include/linux/cpu.h:126 [inline] > > #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: > > init_vqs+0xe1a/0x1520 drivers/net/virtio_net.c:2777 > > #2: (ptrval) (xps_map_mutex){+.+.}, at: > > __netif_set_xps_queue+0x243/0x23f0 net/core/dev.c:2278 > > > > stack backtrace: > > CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6+ #141 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > Call Trace: > > __dump_stack lib/dump_stack.c:77 [inline] > > dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 > > print_deadlock_bug kernel/locking/lockdep.c:1765 [inline] > > check_deadlock kernel/locking/lockdep.c:1809 [inline] > > validate_chain kernel/locking/lockdep.c:2405 [inline] > > __lock_acquire.cold.65+0x1fb/0x486 kernel/locking/lockdep.c:3435 > > lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924 > > percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 [inline] > > percpu_down_read include/linux/percpu-rwsem.h:59 [inline] > > cpus_read_lock+0x43/0xa0 kernel/cpu.c:289 > > static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124 > > __netif_set_xps_queue+0xaac/0x23f0 net/core/dev.c:2320 __netif_set_xps_queue() calls static_key_slow_inc() which will also do a get_online_cpus() which will trigger this bug. There's a static_key_slow_inc_cpuslocked() version that should be used when get_online_cpus() is already taken, but I see __netif_set_xps_queue() is called from several places, and I doubt it is always called with get_online_cpus() held. Thus just using the cpuslocked() version is probably not sufficient of a fix. I don't know the code enough to offer other suggestions. -- Steve > > netif_set_xps_queue+0x26/0x30 net/core/dev.c:2455 > > virtnet_set_affinity+0x2ba/0x4b0 drivers/net/virtio_net.c:1944 > > init_vqs+0xe22/0x1520 drivers/net/virtio_net.c:2778 > > virtnet_probe+0x1092/0x2260 drivers/net/virtio_net.c:3016 > > virtio_dev_probe+0x592/0x942 drivers/virtio/virtio.c:245 > > really_probe drivers/base/dd.c:446 [inline] > > driver_probe_device+0x6ad/0x970 drivers/base/dd.c:588 > > __driver_attach+0x28b/0x2f0 drivers/base/dd.c:822 > > bus_for_each_dev+0x15d/0x1f0 drivers/base/bus.c:311 > > driver_attach+0x3d/0x50 drivers/base/dd.c:841 > > bus_add_driver+0x4b2/0x600 drivers/base/bus.c:667 > > driver_register+0x1c8/0x320 drivers/base/driver.c:170 > > register_virtio_driver+0x79/0xd0 drivers/virtio/
[PATCH] x86/vmware: Do not trace vmware_sched_clock()
From: Steven Rostedt (VMware) When running function tracing on a Linux guest running on VMware Workstation, the guest would crash. This is due to tracing of the sched_clock internal call of the VMware vmware_sched_clock(), which causes an infinite recursion within the tracing code (clock calls must not be traced). Make vmware_sched_clock() not traced by ftrace. Cc: sta...@vger.kernel.org Fixes: 80e9a4f21fd7c ("x86/vmware: Add paravirt sched clock") Reported-by: GwanYeong Kim Signed-off-by: Steven Rostedt (VMware) --- diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c index d9ab49bed8af..0eda91f8eeac 100644 --- a/arch/x86/kernel/cpu/vmware.c +++ b/arch/x86/kernel/cpu/vmware.c @@ -77,7 +77,7 @@ static __init int setup_vmw_sched_clock(char *s) } early_param("no-vmw-sched-clock", setup_vmw_sched_clock); -static unsigned long long vmware_sched_clock(void) +static unsigned long long notrace vmware_sched_clock(void) { unsigned long long ns; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable
On Fri, 2007-03-16 at 11:39 -0700, Jeremy Fitzhardinge wrote: > Ingo Molnar wrote: > >> + /* Make sure kernel address space is empty so that a pagetable > >> + will be allocated for it. */ > >> > > > > comment style. > > > > As you've noticed its a comment style I use quite often. I use it for > what's essentially a little local comment, rather than a block comment > which really needs to draw attention to itself. In this case, I think using > Understood, but then you need to make that comment a single line :) The way Linux code prefers comments (like the code itself actually prefers it!) is single line comments stay on a single line, and multiline comments are always block. So it's either: /* my little comment here */ or /* * This comment, although small and to the point, does span multiple * lines and regardless, needs to be a block comment. */ I guess it just makes it easier on the eyes when everything conforms and others need to look at / maintain your code. -- Steve ___ Virtualization mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: todo
[added Ingo since he'd probably be interested in this] On Fri, 2007-03-16 at 21:00 -0700, Jeremy Fitzhardinge wrote: > Chris Wright wrote: > > * Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote: > > > >> Chris Wright wrote: > >> > >>> Consistently wrap paravirt ops callsites > >>> "ugh" - mingo > >>> > >> Had a thought. What if we do a kind of reverse/two-way module linkage? > >> Somehow compile each pv-op implementation like a module, and then link > >> the appropriate one in at boot time. > >> > > > > This is very similar to something Steve was chatting with me about > > this morning. The idea he was tossing around was something a bit like > > an initrd that a load_module analog could link up. In a sense, it's > > similar to the VMI ROM, with the exceptions that the ABI is just created > > by the compiler from a normal mutable kernel API and it's linkage with > > symbols available on both sides. > > > > Yeah. It would have to happen a lot earlier than initrd. It would be > more like a multiboot kernel module or something. I was taking a look at kernel/module.c and arch/i386/kernel/module.c and it looks like we can copy/modify the former and use part of the later. > > >> Tricky parts: it would need two-way unresolved references between kernel > >> and module, and it would need to be able to run very early in the > >> kernel's life. > >> > > > > This is the tricky part, and where Steve and I left off. > > > I think it's possible. It still be tricky. We need to handle percpu variables. But I have some ideas on how to do it though. > Fortunately the linker code should be pretty easy to make > self-contained. It shouldn't need to do memory allocations or anything > complex like that, so I think its just a matter of grovelling around and > fixing up linkages. Well, we still need to make sure the kernel knows about this image, so that it doesn't write over it. I guess that can be done like the initrd is. > > > I suspect we could free the unused backends already. > > We could; we just need to make sure they get their own section so > they're easily separable. > > > It also has one > > negative side-effect, which is promoting external module code that links > > with the kernel. IOW, there's much less incentive to get code merged > > if it's just a matter of linking. > > It wouldn't be something we'd promote by making it easy to bind > out-of-tree code to the interface. And it would still be a > kernel-version-specific ABI; no guarantees of stability. It will be very similar to how we handle modules (and out of tree modules). -- Steve ___ Virtualization mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] lguest32 kallsyms backtrace of guest.
This is taken from the work I did on lguest64. When killing a guest, we read the guest stack to do a nice back trace of the guest and send it via printk to the host. So instead of just getting an error message from the lguest launcher of: lguest: bad read address 537012178 len 1 I also get in my dmesg: called from [] show_trace_log_lvl+0x1a/0x2f [] show_trace+0x12/0x14 [] dump_stack+0x16/0x18 [] lguest_dump_lg_regs+0x22/0x13c [lg] [] lgread+0x59/0x90 [lg] [] run_guest+0x26b/0x406 [lg] [] read+0x73/0x7d [lg] [] vfs_read+0xad/0x161 [] sys_read+0x3d/0x61 [] syscall_call+0x7/0xb === [] lgread+0x59/0x90 [lg] Printing LG 0 regs cr3: 021eb000 EIP: 0061: [<e00227d2>] ESP: 0069:c236fe3c EFLAGS: 00010202 EAX: 0004 EBX: e001fb20 ECX: 0008 EDX: 03f2 ESI: e001ee00 EDI: e001fb60 EBP: c236fea0 CR2: 1278000 lguest_data->cr2: 80011380 errcode: 0 trapnum: d Stack Dump: [<c1042b7a>] trace_hardirqs_on+0x125/0x149 [<c123b0ea>] wait_for_completion+0x90/0x98 [<c123bddc>] __mutex_unlock_slowpath+0x129/0x13e [<c1048769>] unlock_cpu_hotplug+0x62/0x64 [<c104b5b6>] sys_init_module+0x14e3/0x162c [<c1081d44>] do_sync_read+0xc2/0xff [<c1004f85>] restore_nocheck+0x12/0x15 [<c1004f34>] syscall_call+0x7/0xb TODO: - Clean up a little (still has stuff from lguest64 in it). - Perhaps make a config option or runtime switch to turn it off. - Send to the launcher the dump instead of printk. - make modules work too. Also I need to change the %u of the bad read print to a %x, because seeing 0x200227d2 is better than seeing 537012178 for addresses. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5-mm2/drivers/lguest/Makefile === --- linux-2.6.21-rc5-mm2.orig/drivers/lguest/Makefile +++ linux-2.6.21-rc5-mm2/drivers/lguest/Makefile @@ -4,4 +4,4 @@ obj-$(CONFIG_LGUEST_GUEST) += lguest.o l # Host requires the other files, which can be a module. obj-$(CONFIG_LGUEST) += lg.o lg-objs := core.o hypercalls.o page_tables.o interrupts_and_traps.o \ - segments.o io.o lguest_user.o hypervisor.o + segments.o io.o lguest_user.o hypervisor.o lguest_debug.o Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c === --- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c +++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c @@ -210,6 +210,28 @@ int lguest_address_ok(const struct lgues } /* Just like get_user, but don't let guest access lguest binary. */ +u8 lgread_u8(struct lguest *lg, u32 addr) +{ + u8 val = 0; + + /* Don't let them access lguest binary */ + if (!lguest_address_ok(lg, addr) + || get_user(val, (u32 __user *)addr) != 0) + kill_guest(lg, "bad read address %u", addr); + return val; +} + +u16 lgread_u16(struct lguest *lg, u32 addr) +{ + u16 val = 0; + + /* Don't let them access lguest binary */ + if (!lguest_address_ok(lg, addr) + || get_user(val, (u32 __user *)addr) != 0) + kill_guest(lg, "bad read address %u", addr); + return val; +} + u32 lgread_u32(struct lguest *lg, u32 addr) { u32 val = 0; Index: linux-2.6.21-rc5-mm2/drivers/lguest/lg.h === --- linux-2.6.21-rc5-mm2.orig/drivers/lguest/lg.h +++ linux-2.6.21-rc5-mm2/drivers/lguest/lg.h @@ -176,6 +176,8 @@ extern struct mutex lguest_lock; /* core.c: */ /* Entry points in hypervisor */ const unsigned long *__lguest_default_idt_entries(void); +u8 lgread_u8(struct lguest *lg, u32 addr); +u16 lgread_u16(struct lguest *lg, u32 addr); u32 lgread_u32(struct lguest *lg, u32 addr); void lgwrite_u32(struct lguest *lg, u32 val, u32 addr); void lgread(struct lguest *lg, void *buf, u32 addr, unsigned bytes); @@ -238,6 +240,7 @@ int hypercall(struct lguest *info, struc #define kill_guest(lg, fmt...) \ do { \ if (!(lg)->dead) { \ + lguest_dump_lg_regs(lg);\ (lg)->dead = kasprintf(GFP_ATOMIC, fmt);\ if (!(lg)->dead)\ (lg)->dead = (void *)1; \ @@ -248,5 +251,11 @@ static inline unsigned long guest_pa(str { return vaddr - lg->page_offset; } + +/* lguest_debug.c */ +void lguest_print_address(struct lguest *lg, unsigned long address); +void lguest_dump_trace(struct lguest *lg, struct lguest_regs *regs); +void lguest_dump_lg_regs(struct lguest *lg); + #endif /* __ASSEMBLY__ */ #endif /* _LGUEST_H */ Index: linux-2.6.21-rc5-mm2/drivers/lguest/l
[PATCH] Lguest32, use guest page tables to find paddr for emulated instructions
[Bug that was found by my previous patch] This patch allows things like modules, which don't have a direct __pa(EIP) mapping to do emulated instructions. Sure, the emulated instruction probably should be a paravirt_op, but this patch lets you at least boot a kernel that has modules needing emulated instructions. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c === --- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c +++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c @@ -160,11 +160,14 @@ static int emulate_insn(struct lguest *l { u8 insn; unsigned int insnlen = 0, in = 0, shift = 0; - unsigned long physaddr = guest_pa(lg, lg->regs->eip); + unsigned long physaddr = lguest_find_guest_paddr(lg, lg->regs->eip); - /* This only works for addresses in linear mapping... */ - if (lg->regs->eip < lg->page_offset) + /* FIXME: Handle physaddr's that crosses pages (modules are in VM) */ + + /* did we actually find the physaddr? */ + if (physaddr == (unsigned long)-1UL) return 0; + lgread(lg, &insn, physaddr, 1); /* Operand size prefix means it's actually for ax. */ Index: linux-2.6.21-rc5-mm2/drivers/lguest/lg.h === --- linux-2.6.21-rc5-mm2.orig/drivers/lguest/lg.h +++ linux-2.6.21-rc5-mm2/drivers/lguest/lg.h @@ -218,6 +218,7 @@ void guest_set_pte(struct lguest *lg, un void map_hypervisor_in_guest(struct lguest *lg, struct lguest_pages *pages); int demand_page(struct lguest *info, unsigned long cr2, int write); void pin_page(struct lguest *lg, unsigned long vaddr); +unsigned long lguest_find_guest_paddr(struct lguest *lg, unsigned long vaddr); /* lguest_user.c: */ int lguest_device_init(void); Index: linux-2.6.21-rc5-mm2/drivers/lguest/page_tables.c === --- linux-2.6.21-rc5-mm2.orig/drivers/lguest/page_tables.c +++ linux-2.6.21-rc5-mm2/drivers/lguest/page_tables.c @@ -105,6 +105,25 @@ static spte_t gpte_to_spte(struct lguest return spte; } +unsigned long lguest_find_guest_paddr(struct lguest *lg, unsigned long vaddr) +{ + gpgd_t gpgd; + gpte_t gpte; + unsigned long gpte_ptr; + + gpgd = mkgpgd(lgread_u32(lg, gpgd_addr(lg, vaddr))); + if (!(gpgd.flags & _PAGE_PRESENT)) + return -1; + + gpte_ptr = gpte_addr(lg, gpgd, vaddr); + gpte = mkgpte(lgread_u32(lg, gpte_ptr)); + + if (!(gpte.flags & _PAGE_PRESENT)) + return -1; + + return (gpte.pfn << PAGE_SHIFT) | (vaddr & (PAGE_SIZE-1)); +} + /* FIXME: We hold reference to pages, which prevents them from being swapped. It'd be nice to have a callback when Linux wants to swap out. */ ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] Lguest32 print hex on bad reads and writes
Currently the lguest32 error messages from bad reads and writes prints a decimal integer for addresses. This is pretty annoying. So this patch changes those to be hex outputs. This is applied on top of my debug patch. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c === --- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c +++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c @@ -220,7 +220,7 @@ u8 lgread_u8(struct lguest *lg, u32 addr /* Don't let them access lguest binary */ if (!lguest_address_ok(lg, addr) || get_user(val, (u32 __user *)addr) != 0) - kill_guest(lg, "bad read address %u", addr); + kill_guest(lg, "bad read address %x", addr); return val; } @@ -231,7 +231,7 @@ u16 lgread_u16(struct lguest *lg, u32 ad /* Don't let them access lguest binary */ if (!lguest_address_ok(lg, addr) || get_user(val, (u32 __user *)addr) != 0) - kill_guest(lg, "bad read address %u", addr); + kill_guest(lg, "bad read address %x", addr); return val; } @@ -242,7 +242,7 @@ u32 lgread_u32(struct lguest *lg, u32 ad /* Don't let them access lguest binary */ if (!lguest_address_ok(lg, addr) || get_user(val, (u32 __user *)addr) != 0) - kill_guest(lg, "bad read address %u", addr); + kill_guest(lg, "bad read address %x", addr); return val; } @@ -250,7 +250,7 @@ void lgwrite_u32(struct lguest *lg, u32 { if (!lguest_address_ok(lg, addr) || put_user(val, (u32 __user *)addr) != 0) - kill_guest(lg, "bad write address %u", addr); + kill_guest(lg, "bad write address %x", addr); } void lgread(struct lguest *lg, void *b, u32 addr, unsigned bytes) @@ -259,7 +259,7 @@ void lgread(struct lguest *lg, void *b, || copy_from_user(b, (void __user *)addr, bytes) != 0) { /* copy_from_user should do this, but as we rely on it... */ memset(b, 0, bytes); - kill_guest(lg, "bad read address %u len %u", addr, bytes); + kill_guest(lg, "bad read address %x len %u", addr, bytes); } } @@ -268,7 +268,7 @@ void lgwrite(struct lguest *lg, u32 addr if (addr + bytes < addr || !lguest_address_ok(lg, addr+bytes) || copy_to_user((void __user *)addr, b, bytes) != 0) - kill_guest(lg, "bad write address %u len %u", addr, bytes); + kill_guest(lg, "bad write address %x len %u", addr, bytes); } static void set_ts(unsigned int guest_ts) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] Lguest32 print hex on bad reads and writes
On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote: > > (Erk, I wonder what I was thinking when I wrote that?) Can I ask > > for %#x (or 0x%x)? I'm easily confused. > > How about "%p" for pointers? But that would require casting the numbers to pointers. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] lguest32 kallsyms backtrace of guest.
On Thu, 2007-04-05 at 12:54 +1000, Rusty Russell wrote: > > This is a cool idea, but there are two issues with this patch. The > first is that it's 500 lines of code: that's around +10% on lguest's > total code size! The second is that it conflicts with the medium-term > plan to allow any user to run up lguests: this is why lg.ko never > printk()s about problems with the guest. Not much I can do about the size, but it's in the debug section so hopefully it's not considered too bad :) > > While it is useful for cases where a guest dies mysteriously before it > brings up the console, three alternatives come to mind: > > 1) Modify early_printk so Guests can use it. > 2) Have a separate tool(-set?) for this kind of post-mortem. Then you > just have to implement guest suspend! 8) > 3) Put this in a CONFIG_LGUEST_DEBUG. > > Note that options 1 or 2 make you do more work, but are probably better > in the long term. I'm happy for #3 to sit as a patch in the tree for > the duration, tho! OK, I'll make a #3 patch to send, but the #1 looks best. Not to mention that I still need to make it so that the console can read it. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] Lguest32, use guest page tables to find paddr for emulated instructions
On Thu, 2007-04-05 at 12:59 +1000, Rusty Russell wrote: > On Wed, 2007-04-04 at 15:07 -0400, Steven Rostedt wrote: > Yeah, I haven't tried loading random modules but I can imagine this does > happen (what module was it, BTW?) I have no idea of which module it crashed on. I didn't investigate that too much. I could simply send a trap to guest when __pa(addr) != lguest_find_guest_paddr(addr) and see which module it crashed on. My block device I used was basically a copy of a RHEL5 system. I only modified the inittab and fstab to get it working. So on startup and doing the udev init was when it crashed. > > I used to have a function just like this, but managed to get rid of > it. > > Hmm, perhaps we should have an "int lgread_virt_byte(u8 *)" which does > the pgtable walk and read all in one? It won't be efficient, but it'll > be more correct and maybe even fewer lines 8) I forgot that you have a goal to keep lguest small :) Perhaps we can fork, and have lguest and lguest-lite. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] Lguest launcher, child starving parent
Glauber noticed long delays between hitting a key, and seeing data come up on the virtual console. Looking into this, I found that the wake_parent routine that reads from all devices was actually starving out the parent after sending the parent a signal to wake up. The thing is, the child which takes the console input is recognized by the scheduler as an interactive process. The parent, doesn't do so much, so it is recognized more as a CPU hog. So the child easily gets a higher priority than the parent. So when the child receives data from the console, it sends a signal to the parent and then does another select. The problem is that the select doesn't actually read from the device, and will return immediately since their is still data pending until it is read. But it's the parent that reads the data. So the child actually starves the parent from reading the data by spinning and waiting for it to read the data. The fix I implemented was to have the child wait for a response from the parent before going on. Since there was already communication between the parent and child via a pipe, I used that. This time, the data returned by the pipe is either everything went fine (fd == -1) or the device should be ignored (fd >= 0). Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> Index: linux-2.6-lguest/Documentation/lguest/lguest.c === --- linux-2.6-lguest.orig/Documentation/lguest/lguest.c 2007-04-05 16:13:08.0 -0400 +++ linux-2.6-lguest/Documentation/lguest/lguest.c 2007-04-05 16:16:31.0 -0400 @@ -328,15 +328,15 @@ static void wake_parent(int pipefd, stru for (;;) { fd_set rfds = devices->infds; + int ignorefd; select(devices->max_infd+1, &rfds, NULL, NULL, NULL); - if (FD_ISSET(pipefd, &rfds)) { - int ignorefd; - if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0) - exit(0); - FD_CLR(ignorefd, &devices->infds); - } kill(getppid(), SIGUSR1); + /* wait for parent response */ + if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0) + exit(0); + if (ignorefd >= 0) + FD_CLR(ignorefd, &devices->infds); } } @@ -594,6 +594,7 @@ static void handle_output(int fd, unsign static void handle_input(int fd, int childfd, struct device_list *devices) { struct timeval poll = { .tv_sec = 0, .tv_usec = 0 }; + int fdok = -1; for (;;) { struct device *i; @@ -608,7 +609,9 @@ static void handle_input(int fd, int chi FD_CLR(i->fd, &devices->infds); /* Tell child to ignore it too... */ write(childfd, &i->fd, sizeof(i->fd)); - } + } else + /* Tell child to continue */ + write(childfd, &fdok, sizeof(fdok)); } } } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: lguest over qemu
> > (Note that lguest doesn't support NMIs, but Steven has code for NMI > support for lguest-x86-64 which could be ported across). Rusty, About that. Is there a way to get a NMI only stack in i386? In x86_64 it's part of the TSS. So I can always know I have a good stack for the NMI. I'm not sure i368 has the same thing. Or do we always have a good stack whenever we are in ring0? Oh, and btw, I've just rewrote all of the Lguest64 page table handling. I'm just going over one design change that is really bothering me. In x86_64 we can have 2M or 4k pages (like the PSE in i386). But since 4K pages are used by the shadow page tables, I have to map them like that. But this means that I can have the same guest address as both a PMD and a PTE. Which is breaking some of my code. I'm working on a fix as I write this. But to get you up-to-date to where I'm at. I've implemented a way to have the HV mapped uniquely for all VCPUs. So there's a HV text section (the same for all VCPUs), a HV VCPU Data section (readonly in all rings with guest cr3), and a HV VCPU Scratch Pad section (read/write in rings 0,1,and2). So now the guest kernel runs in ring 1. With this change, I already implemented a syscall trampoline that no longer needs to switch to the host, as well as iretq by the guest kernel goes directly to the guest user space (or kernel). The next version of lguest64 will be much cleaner and faster -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S
Hi Andi, Thanks for all the comments, it's greatly appreciated. On Wed, 8 Aug 2007, Andi Kleen wrote: > > > +#define SYSRETQ\ > > + movq%gs:pda_oldrsp,%rsp;\ > > + swapgs; \ > > + sysretq; > > When the macro does more than sysret it should have a different > name Noted. Do you have a better idea? Something like SETSTACK_SWAPGS_SYSRETQ? > > > > */ > > .globl int_ret_from_sys_call > > int_ret_from_sys_call: > > - cli > > + DISABLE_INTERRUPTS(CLBR_ANY) > > ANY? There are certainly some registers alive at this point like rax Glauber will need to address this, this is his code ;-) > > > retint_restore_args: > > - cli > > + DISABLE_INTERRUPTS(CLBR_ANY) > > Similar. > > > > /* > > * The iretq could re-enable interrupts: > > */ > > @@ -566,10 +587,14 @@ retint_restore_args: > > restore_args: > > RESTORE_ARGS 0,8,0 > > iret_label: > > - iretq > > +#ifdef CONFIG_PARAVIRT > > + INTERRUPT_RETURN > > +ENTRY(native_iret) > > ENTRY adds alignment. Why do you need that export anyways? The paravirt ops struct points to it. > > > +#endif > > +1: iretq > > > > .section __ex_table,"a" > > - .quad iret_label,bad_iret > > + .quad 1b, bad_iret > > iret_label seems more expressive to me than 1 The reason for this change is because of the added: #ifdef CONFIG_PARAVIRT INTERRUPT_RETURN ENTRY(native_iret) #endif If we are paravirt ops, we need the iretq in the exception table, not the paravit ops function call. Since that function call may simply call the native_iretq, and if we take a fault at the iretq, it wont be in the exception table. > > > + ENABLE_INTERRUPTS(CLBR_NONE) > > In many of the CLBR_NONEs there are actually some registers free; > but it might be safer to keep it this way. But if some client can get > significantly better code with one or two free registers it might > be worthwhile to investigate. Glauber, that comment is for you. > > > - swapgs > > + SWAPGS_NOSTACK > > There's still stack here OK, bad name then. How about SWAPGS_UNTRUSTED_STACK? >From earlier in the file where SWAPGS_NOSTACK is declared we have: /* Currently paravirt can't handle swapgs nicely when we * don't have a stack. So we either find a way around these * or just fault and emulate if a guest tries to call swapgs * directly. * * Either way, this is a good way to document that we don't * have a reliable stack. */ #define SWAPGS_NOSTACK swapgs > > > paranoid_restore\trace: > > RESTORE_ALL 8 > > - iretq > > + INTERRUPT_RETURN > > I suspect Xen will need much more changes anyways because of its > ring 3 guest. Are these changes sufficient for lguest? Probably not, but this part of the code I don't fully understand. But just doing this doesn't break lguest. But perhaps it only did not break it yet ;-) Thanks, -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64
On Wed, 8 Aug 2007, Andi Kleen wrote: > Strictly you would at least need a !X86_VSMP dependency, but > with the vsmp change i requested that will be unnecessary > > Is this really synced with the latest version of the i386 code? Glauber started the paravirt ops 64 a second time around, from scratch using the PVOPS of i386 as a base. But since we couldn't just take a the PVOPS patch from i386 and apply it to x86_64, this was mainly done by looking at i386 code and massaging it for x86_64. Somethings may have slipped (and we may have been looking at different versions of PVOPS). It's not easy trying to keep up with a moving target ;-) -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S
On Wed, 8 Aug 2007, Andi Kleen wrote: > > > Probably not, but this part of the code I don't fully understand. > > I would suggest to defer all this until at least one example to test it > (except vsmp which is too simple) is around. Who uses that code? NMIs and debug regs? Lguest only has the host handle the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who else uses that part of the code? We can leave the "paranoid check" change out for this series. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S
-- On Wed, 8 Aug 2007, Andi Kleen wrote: > Steven Rostedt <[EMAIL PROTECTED]> writes: > > > On Wed, 8 Aug 2007, Andi Kleen wrote: > > > > > > > > > Probably not, but this part of the code I don't fully understand. > > > > > > I would suggest to defer all this until at least one example to test it > > > (except vsmp which is too simple) is around. > > > > Who uses that code? NMIs and debug regs? Lguest only has the host handle > > the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who > > else uses that part of the code? > > I'm not sure I understand your question. You're asking who uses entry.S? > Answer would be everybody. If you asked something else please reformulate. When I said "this part of the code I don't fully understand" I was not talking about entry.S. I understand entry.S very well, but the comment was originally on the paranoid_restore code. Which I thought had to deal with NMIs and such that I didn't worry about that I simply did the default. >> paranoid_restore\trace: >> RESTORE_ALL 8 >> - iretq >> + INTERRUPT_RETURN > >I suspect Xen will need much more changes anyways because of its >ring 3 guest. Are these changes sufficient for lguest? The above was what I was replying to. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S
On Wed, 8 Aug 2007, Steven Rostedt wrote: > On Wed, 8 Aug 2007, Andi Kleen wrote: > > >> paranoid_restore\trace: > >> RESTORE_ALL 8 > >> - iretq > >> + INTERRUPT_RETURN > > > >I suspect Xen will need much more changes anyways because of its > >ring 3 guest. Are these changes sufficient for lguest? > > > The above was what I was replying to. If you were talking about the general iretq => INTERRUPT_RETURN, then the answer is "Yes, they are sufficient". The first version of lguest ran the guest kernel in ring 3 (using dual page tables for guest kernel and guest user). The current version I'm pushing runs lguest in ring 1, and the entry.S code worked for both. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S
-- On Wed, 8 Aug 2007, Andi Kleen wrote: > > If you were talking about the general iretq => INTERRUPT_RETURN, then the > > answer is "Yes, they are sufficient". The first version of lguest ran the > > guest kernel in ring 3 (using dual page tables for guest kernel and guest > > user). The current version I'm pushing runs lguest in ring 1, and the > > entry.S code worked for both. > > How do you implement system calls then? /me working very hard to get lguest64 ready for public display Here's a snippet from my version of core.c. I've been thinking of ways to optimize it, but for now it works fine. This was done for both ring 3 and ring 1 lguest versions (this is the host running): /* * Update the LSTAR to point to the HV syscall handler. * Also update the fsbase if the guest uses one. */ wrmsrl(MSR_LSTAR, (unsigned long)HV_OFFSET(&lguest_syscall_trampoline)); [...] asm volatile ("pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;" /* The stack we pushed is off by 8, due to the previous pushq */ "addq $8, %%rsp" : "=D"(foo), "=a"(bar) : "i" (__KERNEL_DS), "i" (__KERNEL_CS), "0" (vcpu->vcpu), "1"(get_idt_table()), "r" (sw_guest) : "memory", "cc"); [...] /* restore old LSTAR */ wrmsrl(MSR_LSTAR, vcpu->host_syscall); Also in the switcher.S (The Hypervisor): .global lguest_syscall_trampoline .type lguest_syscall_trampoline, @function lguest_syscall_trampoline: /* * Tricky, we don't have much to choose from here. * The only way to get to our stack is with swapgs. * but we need to save the stack too, so we have to play * very carefully. */ swapgs /* now gs points to our VCPU Guest Data */ /* first save the stack! */ movq%rsp, %gs:LGUEST_GUEST_DATA_regs_rsp /* * x86 arch doesn't have an easy way to find out where * gs is located. So we need to read the MSR. But first * we need to save off the rcx, rax and rdx. */ movq%rax, %gs:LGUEST_GUEST_DATA_regs_rax movq%rdx, %gs:LGUEST_GUEST_DATA_regs_rdx movq%rcx, %gs:LGUEST_GUEST_DATA_regs_rcx /* Need to read manual, does rdmsr clear * the top 32 bits of rax? */ xor %rax, %rax movl$MSR_GS_BASE,%ecx rdmsr shl $32, %rdx orq %rax, %rdx movq%rdx, %rsp /* see if we need to disable interrupts */ testq $(1<<9), %gs:LGUEST_GUEST_DATA_SFMASK jz 1f movq$0, %gs:LGUEST_GUEST_DATA_irq_enabled jmp 2f 1: /* Still need to clear bit 10 (just in case) */ /* (see lguest_iretq) */ movq$(1<<10), %rax not %rax andq%rax, %gs:LGUEST_GUEST_DATA_irq_enabled 2: /* put back the generic regs */ movq%gs:LGUEST_GUEST_DATA_regs_rdx, %rdx movq%gs:LGUEST_GUEST_DATA_regs_rcx, %rcx movq%gs:LGUEST_GUEST_DATA_regs_rax, %rax /* Is this a hypercall? */ testq $1, %gs:LGUEST_GUEST_DATA_is_hc jnz handle_hcall /* We have 64 bytes to play with */ addq$LGUEST_GUEST_DATA_regs, %rsp /* do the swapgs if possible */ testq $1, %gs:LGUEST_GUEST_DATA_do_swapgs je 1f /* We now have a stack to use */ /* go back to the guest's gs */ swapgs DO_SWAPGS_USE_STACK /* and then back to HV gs */ swapgs 1: /* * The stack has 64 bytes of playing room. * Which is enough to do a jump to the guest kernel. * We store the guest LSTAR register in the scatch pad * because we don't care if the guest messes with it. * If it is a bad address, we fault from the guest side * and we kill the guest. No harm done to the host. */ pushq $(__KERNEL_DS | 1) pushq %gs:LGUEST_GUEST_DATA_regs_rsp pushfq /* Make sure we have actual interrupts on */ orq $(1<<9), 0(%rsp) pushq $(__KERNEL_CS | 1) pushq %gs:LGUEST_GUEST_DATA_LSTAR swapgs iretq NOTE: This is still under development, since I'm going with a new design change to try to stay more in sync with lguest32. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S
-- On Wed, 8 Aug 2007, Andi Kleen wrote: > > When I said "this part of the code I don't fully understand" I was not > > talking about entry.S. I understand entry.S very well, but the comment > > was originally on the paranoid_restore code. Which I thought had to deal > > with NMIs and such that I didn't worry about that I simply did the > > default. > > The paranoid path is used for more than just NMIs; it's also used for MCEs, > stack faults, double faults or debug exceptions. Anything that might > happen with a invalid stack or unknown GS state or system in other unknown > state. > > If you can guarantee your hypervisor never injects any of those it could be > ignored; > but at least losing debug exceptions would be probably not nice. For now it does ignore them. But that's something to work on for later versions of lguest64. I want lguest out in the public and working for the general case. I don't plan on ignoring the paranoid section forever. But it's more on an anomaly for now. But you are right, I want the debug stuff working. But I also need to walk before I can run ;-) > > > > > >> paranoid_restore\trace: > > >> RESTORE_ALL 8 > > >> - iretq > > >> + INTERRUPT_RETURN > > > > > >I suspect Xen will need much more changes anyways because of its > > >ring 3 guest. Are these changes sufficient for lguest? > > This was really a general comment not especially applying to the > paranoid path. Ah, OK, I assumed you were talking about just the previous code. But rereading your post, I see it was more general. Sorry, for the miscommunication. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S
-- On Wed, 8 Aug 2007, Andi Kleen wrote: > On Wed, Aug 08, 2007 at 09:47:05AM -0400, Steven Rostedt wrote: > > > > [...] > > asm volatile ("pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;" > > /* The stack we pushed is off by 8, due to the > > previous pushq */ > > "addq $8, %%rsp" > > Weird stack inbalance, i'm surprised that works. %6 must be doing strange > things. Heh, no it's very subtle. We do the "push %rsp" after the "push %2". So on return from the call, the stack that was pushed, is really 8 bytes off. Because we didn't save the stack pointer from the entry of the asm, we saved it after doing that "push %2". :-) We get back to this call via a retq from the HV switcher code. So that rsp that was pushed will be the stack. > > > /* Need to read manual, does rdmsr clear > > * the top 32 bits of rax? */ > > xor %rax, %rax > > > > movl$MSR_GS_BASE,%ecx > > rdmsr > > > This seems incredibly slow. Since GS changes are controlled in > the kernel why don't you just cache them in the per cpu area? > The only case where user can reload it is using segment > selector changes, which can be also handled. > > Also why do you need the guest gs value anyways? I did a swapgs here, so we have the HV GS value. I don't know of an easy way to read the GS value to put it into the RSP. > > You could just use your own stack and let the guest > switch to its own. Well, it is quite complex, since we can't save anything in the writable section that the guest can't also write to. Remember, we are in ring 1 for the guest kernel, so we have no protection. All the host related stuff is in read only sections when we are using the guest cr3. What I would like to do is have the GS point to the read only section, and load the RSP from a value there. But first we need to store the old value of RSP. The problem is, where do you store it. Hmm, I guess I can create my own "big" offset from the read only section :) OK this gives me an idea, I *can* point the kernel GS to the vcpu read only section, and have a way to store the value with the GS offset. The read only section is always mapped before the read write section, so it should be easy to calculate the diff! Thanks for the comment, you guided me to this great idea! -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 0/7] Modify lguest32 to make room for lguest64
Hi all, I've been working on lguest64 and in order to do this, I had to move a lot of the i386 specific out of the way. Well, the lguest64 port is still not ready to display, but before Rusty makes too many changes I would like this in upstream so I don't have to keep repeating my changes :-) So this patch series moves lguest32 out of the way for other archs. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 2/7] Added generic lg.h in lguest directory.
Add a generic lg.h file to call the architecture specific one. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- drivers/lguest/lg.h |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h new file mode 100644 index 000..4c4356e --- /dev/null +++ b/drivers/lguest/lg.h @@ -0,0 +1,3 @@ +#ifdef CONFIG_X86_32 +#include "i386/lg.h" +#endif -- 1.4.4.4 -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 5/7] Change lguest launcher to use asm generic include
Have the lguest launcher include e820.h via asm/e820.h instead of explicitly saying i386. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- Documentation/lguest/lguest.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index f791840..eb7a01c 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -46,7 +46,7 @@ typedef uint32_t u32; typedef uint16_t u16; typedef uint8_t u8; #include "../../include/linux/lguest_launcher.h" -#include "../../include/asm-i386/e820.h" +#include "../../include/asm/e820.h" /*:*/ #define PAGE_PRESENT 0x7 /* Present, RW, Execute */ -- 1.4.4.4 -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 7/7] Move lguest_dma_info into generic lg.h
The lguest_dma_info is also generic across architectures. Move it to the generic lg.h Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- drivers/lguest/i386/lg.h | 11 --- drivers/lguest/lg.h | 11 +++ 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h index c5ea14c..ae83bd2 100644 --- a/drivers/lguest/i386/lg.h +++ b/drivers/lguest/i386/lg.h @@ -49,17 +49,6 @@ int init_pagetables(struct page **switcher_page, unsigned int pages); #define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00}) #define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300}) -struct lguest_dma_info -{ - struct list_head list; - union futex_key key; - unsigned long dmas; - u16 next_dma; - u16 num_dmas; - u16 guestid; - u8 interrupt; /* 0 when not registered */ -}; - /*H:310 The page-table code owes a great debt of gratitude to Andi Kleen. He * reviewed the original code which used "u32" for all page table entries, and * insisted that it would be far clearer with explicit typing. I thought it diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h index 3147bf6..6057e3c 100644 --- a/drivers/lguest/lg.h +++ b/drivers/lguest/lg.h @@ -12,6 +12,17 @@ void release_all_dma(struct lguest *lg); unsigned long get_dma_buffer(struct lguest *lg, unsigned long key, unsigned long *interrupt); +struct lguest_dma_info +{ + struct list_head list; + union futex_key key; + unsigned long dmas; + u16 next_dma; + u16 num_dmas; + u16 guestid; + u8 interrupt; /* 0 when not registered */ +}; + #ifdef CONFIG_X86_32 #include "i386/lg.h" #endif -- 1.4.4.4 -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 1/7] Move lg.h to the i386 specific lguest directory
Make a i386 directory in the lguest directory, and move the lg.h into it. This will clear the way for other archs to have their own lg.h Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- drivers/lguest/i386/lg.h | 300 ++ drivers/lguest/lg.h | 300 -- 2 files changed, 300 insertions(+), 300 deletions(-) diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h new file mode 100644 index 000..64f0abe --- /dev/null +++ b/drivers/lguest/i386/lg.h @@ -0,0 +1,300 @@ +#ifndef _LGUEST_H +#define _LGUEST_H + +#include + +#define GDT_ENTRY_LGUEST_CS10 +#define GDT_ENTRY_LGUEST_DS11 +#define LGUEST_CS (GDT_ENTRY_LGUEST_CS * 8) +#define LGUEST_DS (GDT_ENTRY_LGUEST_DS * 8) + +#ifndef __ASSEMBLY__ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "irq_vectors.h" + +#define GUEST_PL 1 + +struct lguest_regs +{ + /* Manually saved part. */ + unsigned long ebx, ecx, edx; + unsigned long esi, edi, ebp; + unsigned long gs; + unsigned long eax; + unsigned long fs, ds, es; + unsigned long trapnum, errcode; + /* Trap pushed part */ + unsigned long eip; + unsigned long cs; + unsigned long eflags; + unsigned long esp; + unsigned long ss; +}; + +void free_pagetables(void); +int init_pagetables(struct page **switcher_page, unsigned int pages); + +/* Full 4G segment descriptors, suitable for CS and DS. */ +#define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00}) +#define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300}) + +struct lguest_dma_info +{ + struct list_head list; + union futex_key key; + unsigned long dmas; + u16 next_dma; + u16 num_dmas; + u16 guestid; + u8 interrupt; /* 0 when not registered */ +}; + +/*H:310 The page-table code owes a great debt of gratitude to Andi Kleen. He + * reviewed the original code which used "u32" for all page table entries, and + * insisted that it would be far clearer with explicit typing. I thought it + * was overkill, but he was right: it is much clearer than it was before. + * + * We have separate types for the Guest's ptes & pgds and the shadow ptes & + * pgds. There's already a Linux type for these (pte_t and pgd_t) but they + * change depending on kernel config options (PAE). */ + +/* Each entry is identical: lower 12 bits of flags and upper 20 bits for the + * "page frame number" (0 == first physical page, etc). They are different + * types so the compiler will warn us if we mix them improperly. */ +typedef union { + struct { unsigned flags:12, pfn:20; }; + struct { unsigned long val; } raw; +} spgd_t; +typedef union { + struct { unsigned flags:12, pfn:20; }; + struct { unsigned long val; } raw; +} spte_t; +typedef union { + struct { unsigned flags:12, pfn:20; }; + struct { unsigned long val; } raw; +} gpgd_t; +typedef union { + struct { unsigned flags:12, pfn:20; }; + struct { unsigned long val; } raw; +} gpte_t; + +/* We have two convenient macros to convert a "raw" value as handed to us by + * the Guest into the correct Guest PGD or PTE type. */ +#define mkgpte(_val) ((gpte_t){.raw.val = _val}) +#define mkgpgd(_val) ((gpgd_t){.raw.val = _val}) +/*:*/ + +struct pgdir +{ + unsigned long cr3; + spgd_t *pgdir; +}; + +/* This is a guest-specific page (mapped ro) into the guest. */ +struct lguest_ro_state +{ + /* Host information we need to restore when we switch back. */ + u32 host_cr3; + struct Xgt_desc_struct host_idt_desc; + struct Xgt_desc_struct host_gdt_desc; + u32 host_sp; + + /* Fields which are used when guest is running. */ + struct Xgt_desc_struct guest_idt_desc; + struct Xgt_desc_struct guest_gdt_desc; + struct i386_hw_tss guest_tss; + struct desc_struct guest_idt[IDT_ENTRIES]; + struct desc_struct guest_gdt[GDT_ENTRIES]; +}; + +/* We have two pages shared with guests, per cpu. */ +struct lguest_pages +{ + /* This is the stack page mapped rw in guest */ + char spare[PAGE_SIZE - sizeof(struct lguest_regs)]; + struct lguest_regs regs; + + /* This is the host state & guest descriptor page, ro in guest */ + struct lguest_ro_state state; +} __attribute__((aligned(PAGE_SIZE))); + +#define CHANGED_IDT1 +#define CHANGED_GDT2 +#define CHANGED_GDT_TLS4 /* Actually a subset of CHANGED_GDT */ +#define CHANGED_ALL3 + +/* The private info the thread maintains about the guest. */ +struct lguest +{ + /* At end of a page shared mapped over lguest_pages in guest. */ + unsigned long regs_page; + struct lguest_regs *regs; + str
[PATCH 4/7] Moved the io struct up to the generic lg.h
Move the io struct into the lg.h file since the io.c is generic to other architectures. Also added a proper ifdef for the generic lg.h. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- drivers/lguest/i386/lg.h | 11 ++- drivers/lguest/lg.h | 16 2 files changed, 18 insertions(+), 9 deletions(-) diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h index 64f0abe..c5ea14c 100644 --- a/drivers/lguest/i386/lg.h +++ b/drivers/lguest/i386/lg.h @@ -20,6 +20,8 @@ #include #include #include "irq_vectors.h" +/* some files include this, some include the parent */ +#include "../lg.h" #define GUEST_PL 1 @@ -245,15 +247,6 @@ void pin_page(struct lguest *lg, unsigned long vaddr); int lguest_device_init(void); void lguest_device_remove(void); -/* io.c: */ -void lguest_io_init(void); -int bind_dma(struct lguest *lg, -unsigned long key, unsigned long udma, u16 numdmas, u8 interrupt); -void send_dma(struct lguest *info, unsigned long key, unsigned long udma); -void release_all_dma(struct lguest *lg); -unsigned long get_dma_buffer(struct lguest *lg, unsigned long key, -unsigned long *interrupt); - /* hypercalls.c: */ void do_hypercalls(struct lguest *lg); void write_timestamp(struct lguest *lg); diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h index 4c4356e..3147bf6 100644 --- a/drivers/lguest/lg.h +++ b/drivers/lguest/lg.h @@ -1,3 +1,19 @@ +#ifndef _LGUEST_LG_GENERIC +#define _LGUEST_LG_GENERIC + +struct lguest; + +/* io.c: */ +void lguest_io_init(void); +int bind_dma(struct lguest *lg, +unsigned long key, unsigned long udma, u16 numdmas, u8 interrupt); +void send_dma(struct lguest *info, unsigned long key, unsigned long udma); +void release_all_dma(struct lguest *lg); +unsigned long get_dma_buffer(struct lguest *lg, unsigned long key, +unsigned long *interrupt); + #ifdef CONFIG_X86_32 #include "i386/lg.h" #endif + +#endif /* _LGUEST_LG_GENERIC */ -- 1.4.4.4 -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 6/7] Remove __pa() use in hvc_lguest
The hvc_lguest uses __pa in the const initialization. In some architectures, __pa() is not constant so this fails in compiles. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- drivers/char/hvc_lguest.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/drivers/char/hvc_lguest.c b/drivers/char/hvc_lguest.c index feeccba..c7d7d4c 100644 --- a/drivers/char/hvc_lguest.c +++ b/drivers/char/hvc_lguest.c @@ -42,7 +42,6 @@ * use of physical address for the buffer itself. */ static char inbuf[256]; static struct lguest_dma cons_input = { .used_len = 0, - .addr[0] = __pa(inbuf), .len[0] = sizeof(inbuf), .len[1] = 0 }; @@ -114,6 +113,13 @@ static struct hv_ops lguest_cons = { * (0), and the struct hv_ops containing the put_chars() function. */ static int __init cons_init(void) { + /* +* We can't initialize using __pa in const declarations, +* since __pa(inbuf) does not evaluate into a constant on +* all architectures (namely x86_64). +*/ + cons_input.addr[0] = __pa(inbuf); + if (strcmp(paravirt_ops.name, "lguest") != 0) return 0; -- 1.4.4.4 -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/7] Added generic lg.h in lguest directory.
-- On Thu, 9 Aug 2007, Stephen Rothwell wrote: > On Wed, 08 Aug 2007 20:32:13 -0400 Steven Rostedt <[EMAIL PROTECTED]> > wrote: > > > > Add a generic lg.h file to call the architecture specific one. > > Please combine this with the previous patch so that the tree will build > at each step. Yeah, I wanted to do that, but git got confused when I did that, and did what I didn't want. That was, created a new file in the i386 directory, and diffed the one in this directory. I'm really want to keep git knowing the history of the file on moving. I could just simply cat the two files instead. Would that be acceptible? -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/7] Moved the io struct up to the generic lg.h
-- On Thu, 9 Aug 2007, Stephen Rothwell wrote: > On Wed, 08 Aug 2007 20:32:15 -0400 Steven Rostedt <[EMAIL PROTECTED]> > wrote: > > > > diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h > > index 64f0abe..c5ea14c 100644 > > --- a/drivers/lguest/i386/lg.h > > +++ b/drivers/lguest/i386/lg.h > > @@ -20,6 +20,8 @@ > > #include > > #include > > #include "irq_vectors.h" > > +/* some files include this, some include the parent */ > > +#include "../lg.h" > > Surely everything should just include the parent which allows for easier > consolidation over time. (as you are doing here) Yeah, that sounds better. This was my attempt at not modifing the i386/ stuff too much. But I can have those files change to do a "../lg.h" include. Again, doing the move plus change to keep things building. Hmm, I guess git can handle that. > > > --- a/drivers/lguest/lg.h > > +++ b/drivers/lguest/lg.h > > @@ -1,3 +1,19 @@ > > +#ifndef _LGUEST_LG_GENERIC > > +#define _LGUEST_LG_GENERIC > > Maybe _LGUEST_LG_H? and the arch ones could use _LGUEST_I386_LG_H etc. Noted. Will do in the next release. Thanks, -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 3/5 -v2] Remove __pa() use in hvc_lguest
The hvc_lguest uses __pa in the const initialization. In some architectures, __pa() is not constant so this fails in compiles. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- drivers/char/hvc_lguest.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/drivers/char/hvc_lguest.c b/drivers/char/hvc_lguest.c index feeccba..c7d7d4c 100644 --- a/drivers/char/hvc_lguest.c +++ b/drivers/char/hvc_lguest.c @@ -42,7 +42,6 @@ * use of physical address for the buffer itself. */ static char inbuf[256]; static struct lguest_dma cons_input = { .used_len = 0, - .addr[0] = __pa(inbuf), .len[0] = sizeof(inbuf), .len[1] = 0 }; @@ -114,6 +113,13 @@ static struct hv_ops lguest_cons = { * (0), and the struct hv_ops containing the put_chars() function. */ static int __init cons_init(void) { + /* +* We can't initialize using __pa in const declarations, +* since __pa(inbuf) does not evaluate into a constant on +* all architectures (namely x86_64). +*/ + cons_input.addr[0] = __pa(inbuf); + if (strcmp(paravirt_ops.name, "lguest") != 0) return 0; -- 1.4.4.4 -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 0/5 -v2] Modify lguest32 to make room for lguest64 (version 2)
[ Changes since last version. - Move lg.h to include/asm instead (suggested by Rusty Russel) - All steps of the series compiles (suggested by Stephen Rothwell) - Better ifdef header naming (suggested by Stephen Rothwell) - Added Andi Kleen to CC (forgot to on V1) ] Hi all, I've been working on lguest64 and in order to do this, I had to move a lot of the i386 specific out of the way. Well, the lguest64 port is still not ready to display, but before Rusty makes too many changes I would like this in upstream so I don't have to keep repeating my changes :-) So this patch series moves lguest32 out of the way for other archs. -- Steve -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 4/5 -v2] Added generic lg.h in the include/linux directory
Add a generic lg.h that can be included from lguest files. This file will hold the data that can be shared across archs. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/i386/kernel/asm-offsets.c|2 +- drivers/lguest/core.c |2 +- drivers/lguest/hypercalls.c |2 +- drivers/lguest/interrupts_and_traps.c |2 +- drivers/lguest/io.c |2 +- drivers/lguest/lguest_user.c |2 +- drivers/lguest/page_tables.c |2 +- drivers/lguest/segments.c |2 +- include/asm-i386/lg.h | 24 ++-- include/linux/lg.h| 30 ++ 10 files changed, 40 insertions(+), 30 deletions(-) diff --git a/arch/i386/kernel/asm-offsets.c b/arch/i386/kernel/asm-offsets.c index e426e51..d6d17e5 100644 --- a/arch/i386/kernel/asm-offsets.c +++ b/arch/i386/kernel/asm-offsets.c @@ -21,7 +21,7 @@ #ifdef CONFIG_LGUEST_GUEST #include -#include +#include #endif #define DEFINE(sym, val) \ diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c index e5f441d..5bcd237 100644 --- a/drivers/lguest/core.c +++ b/drivers/lguest/core.c @@ -19,7 +19,7 @@ #include #include #include -#include +#include /* Found in switcher.S */ extern char start_switcher_text[], end_switcher_text[], switch_to_guest[]; diff --git a/drivers/lguest/hypercalls.c b/drivers/lguest/hypercalls.c index 0ddb133..3c0d41e 100644 --- a/drivers/lguest/hypercalls.c +++ b/drivers/lguest/hypercalls.c @@ -26,7 +26,7 @@ #include #include #include -#include +#include /*H:120 This is the core hypercall routine: where the Guest gets what it * wants. Or gets killed. Or, in the case of LHCALL_CRASH, both. diff --git a/drivers/lguest/interrupts_and_traps.c b/drivers/lguest/interrupts_and_traps.c index 899e681..3daf3e2 100644 --- a/drivers/lguest/interrupts_and_traps.c +++ b/drivers/lguest/interrupts_and_traps.c @@ -12,7 +12,7 @@ * them first, so we also have a way of "reflecting" them into the Guest as if * they had been delivered to it directly. :*/ #include -#include +#include /* The address of the interrupt handler is split into two bits: */ static unsigned long idt_address(u32 lo, u32 hi) diff --git a/drivers/lguest/io.c b/drivers/lguest/io.c index 091a46d..d196121 100644 --- a/drivers/lguest/io.c +++ b/drivers/lguest/io.c @@ -25,7 +25,7 @@ #include #include #include -#include +#include /*L:300 * I/O diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c index a29d85b..2f477df 100644 --- a/drivers/lguest/lguest_user.c +++ b/drivers/lguest/lguest_user.c @@ -7,7 +7,7 @@ #include #include #include -#include +#include /*L:030 setup_regs() doesn't really belong in this file, but it gives us an * early glimpse deeper into the Host so it's worth having here. diff --git a/drivers/lguest/page_tables.c b/drivers/lguest/page_tables.c index db3dd8b..812d56a 100644 --- a/drivers/lguest/page_tables.c +++ b/drivers/lguest/page_tables.c @@ -13,7 +13,7 @@ #include #include #include -#include +#include /*M:008 We hold reference to pages, which prevents them from being swapped. * It'd be nice to have a callback in the "struct mm_struct" when Linux wants diff --git a/drivers/lguest/segments.c b/drivers/lguest/segments.c index 1d16d5d..351b13f 100644 --- a/drivers/lguest/segments.c +++ b/drivers/lguest/segments.c @@ -9,7 +9,7 @@ * In these modern times, the segment handling code consists of simple sanity * checks, and the worst you'll experience reading this code is butterfly-rash * from frolicking through its parklike serenity. :*/ -#include +#include /*H:600 * We've almost completed the Host; there's just one file to go! diff --git a/include/asm-i386/lg.h b/include/asm-i386/lg.h index 64f0abe..c8e01c3 100644 --- a/include/asm-i386/lg.h +++ b/include/asm-i386/lg.h @@ -1,5 +1,5 @@ -#ifndef _LGUEST_H -#define _LGUEST_H +#ifndef _I386_LGUEST_H +#define _I386_LGUEST_H #include @@ -47,17 +47,6 @@ int init_pagetables(struct page **switcher_page, unsigned int pages); #define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00}) #define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300}) -struct lguest_dma_info -{ - struct list_head list; - union futex_key key; - unsigned long dmas; - u16 next_dma; - u16 num_dmas; - u16 guestid; - u8 interrupt; /* 0 when not registered */ -}; - /*H:310 The page-table code owes a great debt of gratitude to Andi Kleen. He * reviewed the original code which used "u32" for all page table entries, and * insisted that it would be far clearer with explicit typing. I thought it @@ -245,15 +234,6 @@ void pin_page(struct lguest *lg, unsigned long vaddr); int lguest_device_init(void); void lguest_device_remove(void); -/* io
[PATCH 2/5 -v2] Change lguest launcher to use asm generic include
Have the lguest launcher include e820.h via asm/e820.h instead of explicitly saying i386. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- Documentation/lguest/lguest.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index f791840..eb7a01c 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -46,7 +46,7 @@ typedef uint32_t u32; typedef uint16_t u16; typedef uint8_t u8; #include "../../include/linux/lguest_launcher.h" -#include "../../include/asm-i386/e820.h" +#include "../../include/asm/e820.h" /*:*/ #define PAGE_PRESENT 0x7 /* Present, RW, Execute */ -- 1.4.4.4 -- ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 1/5 -v2] move lg.h to include/asm-
The lg.h file has lots of i386 specific information. Move it over to the asm directory where it belongs. Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> --- arch/i386/kernel/asm-offsets.c|2 +- drivers/lguest/core.c |2 +- drivers/lguest/hypercalls.c |2 +- drivers/lguest/interrupts_and_traps.c |2 +- drivers/lguest/io.c |2 +- drivers/lguest/lg.h | 300 - drivers/lguest/lguest_user.c |2 +- drivers/lguest/page_tables.c |2 +- drivers/lguest/segments.c |2 +- drivers/lguest/switcher.S |2 +- include/asm-i386/lg.h | 300 + 11 files changed, 309 insertions(+), 309 deletions(-) diff --git a/arch/i386/kernel/asm-offsets.c b/arch/i386/kernel/asm-offsets.c index 7288ac8..e426e51 100644 --- a/arch/i386/kernel/asm-offsets.c +++ b/arch/i386/kernel/asm-offsets.c @@ -21,7 +21,7 @@ #ifdef CONFIG_LGUEST_GUEST #include -#include "../../../drivers/lguest/lg.h" +#include #endif #define DEFINE(sym, val) \ diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c index 0a46e88..e5f441d 100644 --- a/drivers/lguest/core.c +++ b/drivers/lguest/core.c @@ -19,7 +19,7 @@ #include #include #include -#include "lg.h" +#include /* Found in switcher.S */ extern char start_switcher_text[], end_switcher_text[], switch_to_guest[]; diff --git a/drivers/lguest/hypercalls.c b/drivers/lguest/hypercalls.c index db6caac..0ddb133 100644 --- a/drivers/lguest/hypercalls.c +++ b/drivers/lguest/hypercalls.c @@ -26,7 +26,7 @@ #include #include #include -#include "lg.h" +#include /*H:120 This is the core hypercall routine: where the Guest gets what it * wants. Or gets killed. Or, in the case of LHCALL_CRASH, both. diff --git a/drivers/lguest/interrupts_and_traps.c b/drivers/lguest/interrupts_and_traps.c index 49787e9..899e681 100644 --- a/drivers/lguest/interrupts_and_traps.c +++ b/drivers/lguest/interrupts_and_traps.c @@ -12,7 +12,7 @@ * them first, so we also have a way of "reflecting" them into the Guest as if * they had been delivered to it directly. :*/ #include -#include "lg.h" +#include /* The address of the interrupt handler is split into two bits: */ static unsigned long idt_address(u32 lo, u32 hi) diff --git a/drivers/lguest/io.c b/drivers/lguest/io.c index ea68613..091a46d 100644 --- a/drivers/lguest/io.c +++ b/drivers/lguest/io.c @@ -25,7 +25,7 @@ #include #include #include -#include "lg.h" +#include /*L:300 * I/O diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h deleted file mode 100644 index 64f0abe..000 --- a/drivers/lguest/lg.h +++ /dev/null @@ -1,300 +0,0 @@ -#ifndef _LGUEST_H -#define _LGUEST_H - -#include - -#define GDT_ENTRY_LGUEST_CS10 -#define GDT_ENTRY_LGUEST_DS11 -#define LGUEST_CS (GDT_ENTRY_LGUEST_CS * 8) -#define LGUEST_DS (GDT_ENTRY_LGUEST_DS * 8) - -#ifndef __ASSEMBLY__ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include "irq_vectors.h" - -#define GUEST_PL 1 - -struct lguest_regs -{ - /* Manually saved part. */ - unsigned long ebx, ecx, edx; - unsigned long esi, edi, ebp; - unsigned long gs; - unsigned long eax; - unsigned long fs, ds, es; - unsigned long trapnum, errcode; - /* Trap pushed part */ - unsigned long eip; - unsigned long cs; - unsigned long eflags; - unsigned long esp; - unsigned long ss; -}; - -void free_pagetables(void); -int init_pagetables(struct page **switcher_page, unsigned int pages); - -/* Full 4G segment descriptors, suitable for CS and DS. */ -#define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00}) -#define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300}) - -struct lguest_dma_info -{ - struct list_head list; - union futex_key key; - unsigned long dmas; - u16 next_dma; - u16 num_dmas; - u16 guestid; - u8 interrupt; /* 0 when not registered */ -}; - -/*H:310 The page-table code owes a great debt of gratitude to Andi Kleen. He - * reviewed the original code which used "u32" for all page table entries, and - * insisted that it would be far clearer with explicit typing. I thought it - * was overkill, but he was right: it is much clearer than it was before. - * - * We have separate types for the Guest's ptes & pgds and the shadow ptes & - * pgds. There's already a Linux type for these (pte_t and pgd_t) but they - * change depending on kernel config options (PAE). */ - -/* Each entry is identical: lower 12 bits of flags and upper 20 bits for the - * "page frame number" (0 == first physical page, etc). They are different - * types so t
Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S
-- On Wed, 8 Aug 2007, Jeremy Fitzhardinge wrote: > Steven Rostedt wrote: > > /* > > * x86 arch doesn't have an easy way to find out where > > * gs is located. So we need to read the MSR. But first > > * we need to save off the rcx, rax and rdx. > > > Why don't you store it in gs? movq %gs:my_gs_base, %rax? Because it can't be trusted. After the swapgs, we are pointing to the RW section of the HV. But by running the guest kernel in ring 1, we have no protection from the guest writing into that area too. So we can't put anything in that section and expect it to be safe after jumping to guest code. The only trusted values, is jumping in after getting there by an interrupt, where the hardware places the values onto the stack. But with Andi's comments, I realized I can point the gs pointer to the RO area. And make a constant offset that will point up into the RW area, so we could save the stack and replace it. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/5 -v2] Added generic lg.h in the include/linux directory
-- On Thu, 9 Aug 2007, Stephen Rothwell wrote: > Hi Steven, > > I am just being pedantic here (and I should have beaten up on Rusty > before now ... :-) > > On Thu, 09 Aug 2007 00:36:30 -0400 Steven Rostedt <[EMAIL PROTECTED]> > wrote: > > > > --- a/include/asm-i386/lg.h > > +++ b/include/asm-i386/lg.h > > @@ -1,5 +1,5 @@ > > -#ifndef _LGUEST_H > > -#define _LGUEST_H > > +#ifndef _I386_LGUEST_H > > +#define _I386_LGUEST_H > > Maybe _ASM_LG_H or _I386_LG_H ? And there is a comment on the #endif at > the end of the file that should match. > > > +++ b/include/linux/lg.h > > @@ -0,0 +1,30 @@ > > +#ifndef _LINUX_LGUEST_H > > +#define _LINUX_LGUEST_H > > Maybe _LINUX_LG_H ? > > Also (this is for Rusty) include/linux/lguest.h is protected by > _ASM_LGUEST_H ... Hah! in my lguest64 tree, I fixed all these. But I forgot to do the changes when making the patches for upstream. I'll make note and fix them in my next patch series (unless Rusty can pull my stuff in now and then I can apply on top of them ;-) Thanks, -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 5/5 -v2] Move the lguest files that are i386 specific
-- On Thu, 9 Aug 2007, Stephen Rothwell wrote: > Hi Steven, > > On Thu, 09 Aug 2007 00:36:31 -0400 Steven Rostedt <[EMAIL PROTECTED]> > wrote: > > > > Well, some may be merged with x86_64 later, but for now we move them > > out of the way. Later on we can start seeing how we can combine > > some of these files to be arch generic. > > > > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]> > > --- > > drivers/lguest/Makefile|7 +- > > drivers/lguest/core.c | 771 --- > > drivers/lguest/hypercalls.c| 300 > > drivers/lguest/i386/Makefile |7 + > > drivers/lguest/i386/core.c | 771 +++ > > drivers/lguest/i386/hypercalls.c | 300 > > drivers/lguest/i386/interrupts_and_traps.c | 440 +++ > > drivers/lguest/i386/lguest.c | 1097 > > > > drivers/lguest/i386/lguest_asm.S | 93 +++ > > drivers/lguest/i386/lguest_user.c | 382 ++ > > drivers/lguest/i386/page_tables.c | 680 + > > drivers/lguest/i386/segments.c | 229 ++ > > drivers/lguest/i386/switcher.S | 347 + > > drivers/lguest/interrupts_and_traps.c | 440 --- > > drivers/lguest/lguest.c| 1097 > > > > drivers/lguest/lguest_asm.S| 93 --- > > drivers/lguest/lguest_user.c | 382 -- > > drivers/lguest/page_tables.c | 680 - > > drivers/lguest/segments.c | 229 -- > > drivers/lguest/switcher.S | 347 - > > 20 files changed, 4348 insertions(+), 4344 deletions(-) > > if you add -C or -M to "git diff --stat" this looks much better: > > $ git diff --stat -C HEAD^ > drivers/lguest/Makefile |7 ++- > drivers/lguest/i386/Makefile |7 +++ > drivers/lguest/{ => i386}/core.c |0 > drivers/lguest/{ => i386}/hypercalls.c |0 > drivers/lguest/{ => i386}/interrupts_and_traps.c |0 > drivers/lguest/{ => i386}/lguest.c |0 > drivers/lguest/{ => i386}/lguest_asm.S |0 > drivers/lguest/{ => i386}/lguest_user.c |0 > drivers/lguest/{ => i386}/page_tables.c |0 > drivers/lguest/{ => i386}/segments.c |0 > drivers/lguest/{ => i386}/switcher.S |0 > 11 files changed, 9 insertions(+), 5 deletions(-) > /me should read the man pages. Thanks, that's the options I was looking for. I was using the git-format-patch, and thought it would do that automatically. I definitely add those options the next time around! Thanks, -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64
-- On Thu, 9 Aug 2007, Jeremy Fitzhardinge wrote: > Glauber de Oliveira Costa wrote: > > On 8/9/07, Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote: > > > >>> Does it really matter? > >>> > >>> > >> Well, yes, if alignment is an issue. > >> > > Of course, But the question rises from the context that they are both > > together at the beginning. So they are not making anybody non-aligned. > > Then the question: Why would putting it in the end be different to > > putting them _together_, aligned at the beginning ? > > > > Well, the point is that if you add new ones then alignment may be an > issue. Putting them at the end (with a comment explaining why they're > there) will make it more robust. Though splitting them into their own > sub-structure would probably be better. Glauber, I was thinking of putting them at the end too, and that would make it all work better. But I didn't mention it because I was in the mindset of "well i386 has that there, we should too" :-( > > Hm. So x86-64 doesn't make 64-bit pointers be 64-bit aligned? yeah, it usually does. But it's one of those paranoid things, where you want it to still work even if someone later on throws an __attribute__((packed)) in on paravirt ops ;-) -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 4/7] Moved the io struct up to the generic lg.h
-- On Thu, 9 Aug 2007, Jes Sorensen wrote: > Stephen Rothwell wrote: > > On Wed, 08 Aug 2007 20:32:15 -0400 Steven Rostedt <[EMAIL PROTECTED]> > > wrote: > >> diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h > >> index 64f0abe..c5ea14c 100644 > >> --- a/drivers/lguest/i386/lg.h > >> +++ b/drivers/lguest/i386/lg.h > >> @@ -20,6 +20,8 @@ > >> #include > >> #include > >> #include "irq_vectors.h" > >> +/* some files include this, some include the parent */ > >> +#include "../lg.h" > > > > Surely everything should just include the parent which allows for easier > > consolidation over time. (as you are doing here) > > Whats wrong with introducing an include/asm-/lg.h and have the > parent do #include ? Seems much cleaner to me. Seems cleaner to me to! I think I'll do that. Oh, wait! I did. ;-) -- Steve P.S. Sorry, couldn't resist. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Lguest] Introducing paravirt_ops for x86_64
-- On Wed, 8 Aug 2007, Jeremy Fitzhardinge wrote: > > At the moment Xen can't support guests with 2M pages. In 32-bit this > isn't a huge problem, since the kernel doesn't assume it can map itself > with 2M pages. But I think the 64-bit kernel assumes 2M pages are > always available for mapping the kernel. I don't know how pervasive > this assumption is, but it would be nice to parameterize it in pv-ops. > I got 2M pages working in lguest64, but it was a cause of a lot of pain. By removing support for 2M pages in guests, that might clean up a lot of lguest64 shadow paging as well. The only advantage I see for 2M pages or a guest, is that it saves the guest from having extra page tables. The host still needs the map the 2M to 4K, so it's not an advantage there. It doesn't speed anything up, so I'd be all for scrapping 2M pages for guests. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64
-- On Thu, 9 Aug 2007, Andi Kleen wrote: > > > This has to match the normal C calling convention though, doesn't it? > > Native cli/sti/save/restore_flags are all only assembly and can be easily > (in fact more easily than in C) written as pure assembler functions. Then > you can use whatever calling convention you want. I agree. Should we make a paravirt_ops_asm.S file that can implement these native funcions, and so we can get rid of the C functions only doing asm? > > While some paravirt implementations may have more complicated implementations > i guess it's still a reasonable requirement to make them simple enough > in pure assembler. If not they can use a trampoline, but that's hopefully > not needed. It works for lguest64. I'm sure it should be no problem with other HVs. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 18/25][V3] turn priviled operations into macros in entry.S
-- On Wed, 15 Aug 2007, Glauber de Oliveira Costa wrote: > @@ -566,10 +596,15 @@ retint_restore_args: > restore_args: > RESTORE_ARGS 0,8,0 > iret_label: > +#ifdef CONFIG_PARAVIRT > + INTERRUPT_RETURN > +#endif > +.globl do_iretq; Is the .globl really required here? I would think that the reference in the fixup section would be resolved with in the compiling of this object. -- Steve > +do_iretq: > iretq > > .section __ex_table,"a" > - .quad iret_label,bad_iret > + .quad do_iretq, bad_iret > .previous > .section .fixup,"ax" > /* force a signal here? this matches i386 behaviour */ > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[ANNOUNCE] Lguest64 - fatter puppies!
This is a formal announcement of Lguest64. Most are aware of the little puppies (lguest32, or simply lguest, or in some circles "rustyvisor"). But this time the puppies ate a bit too much. No more lean and mean puppies, now we got big fat lazy ones. Running on the hardware that's too lazy to do full virutalization. Yes, lguest now runs on x86_64! As you know, puppies are young, and so is lguest64. And like most new born puppies, lguest64 might crap on your floor. But if it's in a good mood, it will make it to the door and give you a login prompt (or maybe even ssh into it and run firefox! http://rostedt.homelinux.com/pics/firefox-on-lguest64.png ). lguest64 is still going through a bit of growth pains, but its getting better. It's to a point that we are not that afraid to bring it to the dog show. So for those that love puppies, and want even bigger ones, you can download the code at: git://git.et.redhat.com/kernel-lguest-64.git TODO: Many things! but here's what's on the near future list. - SMP first for host then for guest. We had it working a while ago, but we decided to update to catchup to lguest32, and we broke it. - get closer to lguest32. Rusty just likes to annoy us ;-) - optimization, optimization, optimization (did I say it's still slow?) - oh, and hopefully to get it merged! well, there's lots more to do, but we can think about it later. Want to help? test it out, and let us know how big a mess it made on your floor. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] input: Fix interrupt enable in i8042_ctr when enabling interrupt fails
-- On Mon, 10 Sep 2007, Markus Armbruster wrote: > > I believe this possible, but unlikely (perhaps not so unlikely on > virtual machines). Scenarios involve enable succeeding the first > time, failing the second time, and succeeding the third time. I can > provide details, but the point I'd like to make is not that this is > broken (although it is, strictly speaking), but that it is not > obviously correct where it easily could be: just clear the interrupt > enable bits when writing them to the hardware failed, like the old > code did. > I also want to stress that this is more of a clean up for "technically correct" code than a bug fix. This bug probably would never happen on baremetal unless it was running on broken hardware. BUT!!! With more and more systems going to a virtual environment, having a bug or some other anomaly can trigger the error that this patch prevents. The patch will also trigger a print in the case of running on a buggy virtual machine, which would help out the developers of that virtual machine to fix their code. Please apply. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 7/24] consolidate msr.h
On Mon, 19 Nov 2007, Bastian Blank wrote: > On Fri, Nov 09, 2007 at 04:42:48PM -0200, Glauber de Oliveira Costa wrote: > > - wrmsrl(MSR_CSTAR, ia32_cstar_target); > > + wrmsrl(MSR_CSTAR, (u64)ia32_cstar_target); > > Hmm, why do you add explicit casts? The compiler should convert that > correctly on its own. > > > +static inline void wrmsrl(unsigned int msr, unsigned long long val) > > Hmm, long long is 64 bit on all x86, but why not use explicit u64 to > show that? (quick reply) With PVOPS on it gives compiler warnings without that explict cast. Without looking at the code, IIRC with non-PVOPS it is a macro directly into asm, so it didn't matter what the cast was. But with PVOPS as a function, it gave compiler warnings. Take it out and try compiling it for both i386 and x86_64. One of them gave warnings. But maybe it's not a problem now. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 7/24] consolidate msr.h
On Tue, 20 Nov 2007, Ingo Molnar wrote: > * Steven Rostedt <[EMAIL PROTECTED]> wrote: > > > With PVOPS on it gives compiler warnings without that explict cast. > > Without looking at the code, IIRC with non-PVOPS it is a macro > > directly into asm, so it didn't matter what the cast was. But with > > PVOPS as a function, it gave compiler warnings. > > > > Take it out and try compiling it for both i386 and x86_64. One of them > > gave warnings. But maybe it's not a problem now. > > i dont think there's ever any true need (and good cause) to force > integer type casts like that at the callee site. I guess the problem is that we converted a macro to a function, where the macro did no type checking. Now we need to pick between integers and pointers. Some places uses intergers in wrmsrl and some use pointers. So changing this to a typechecking protocol is not going to be nice. Looking at the current code now, we have this: checking_wrmsrl(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS); checking_wrmsrl(MSR_IA32_SYSENTER_ESP, 0ULL); checking_wrmsrl(MSR_IA32_SYSENTER_EIP, (u64)ia32_sysenter_target); wrmsrl(MSR_CSTAR, ia32_cstar_target); A typecast is already used in that same area. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 7/24] consolidate msr.h
On Tue, 20 Nov 2007, Ingo Molnar wrote: > > i dont think there's ever any true need (and good cause) to force > integer type casts like that at the callee site. Unless you mean we should do something like this: static inline void __wrmsrl(unsigned int msr, unsigned long long val); #define wrmsr(msr, val) __wrmsrl(msr, (unsigned long long)var) -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 02/16] adapt lguest launcher to per-cpuness
On Wed, 26 Dec 2007, Rusty Russell wrote: > On Friday 21 December 2007 00:33:42 Glauber de Oliveira Costa wrote: > > + if (!vcpu_id) { > > + /* > > +* Service input, then unset the BREAK to > > +* release the Waker. > > +*/ > > + handle_input(lguest_fd); > > + if (pwrite(lguest_fd, args, sizeof(args), 0) < 0) > > + err(1, "Resetting break"); > > + } > > I hate winged comments: those two extra lines, wasted! > For multiple lines, wings are a Good Thing (TM). Otherwise it looks sloppy. /* Service input, then unset the BREAK to * release the Waker. */ extra asterisk! ok then /* Service input, then unset the BREAK to release the Waker. */ Yuck, that "release" looks like it can be code, especially with parsers that look for comments that start with some sort of /* or * Those little wings do IMHO make the code look nicer. I know in the Linux community, my weight compared to you is a chihuahua compared to a St. Bernard. But in this case, I believe others think that my collar is prettier than yours. ;-) Some one buy Rusty a bigger hard-drive to store those extra lines. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 01/11] Add basic support for gcc profiler instrumentation
[Added Chris Wright, Rusty and Virt list because they were involved with this issue before] On Thu, 3 Jan 2008, Ingo Molnar wrote: > > * Steven Rostedt <[EMAIL PROTECTED]> wrote: > > > +# function tracing might turn this off: > > +config REGPARM > > + bool > > + depends on !MCOUNT > > + default y > > are you sure -pg really needs this? Nope! Arnaldo and I only carried it because you had it ;-) > I just carried this along the years > and went the path of least resistence, but we should not be > reintroducing the !REGPARM build mode for the kernel. I'd not be > surprised if there were a few issues with REGPARM + mcount, but we have > to figure it out before merging ... Hmm, I know paravirt-ops had an issue with mcount in the RT tree. I can't remember the exact issues, but it did have something to do with the way parameters were passed in. Chris, do you remember what the issues were? I'm also thinking that this is only an i386 issue. Thanks, -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 01/11] Add basic support for gcc profiler instrumentation
On Thu, 3 Jan 2008, Chris Wright wrote: > > Yes, paravirt ops have a well-specified calling convention (register > based). There was a cleanup that Andi did that caused the problem > because it removed all the "fastcall" annotations since -mregparm=3 > is now always on for i386. Since MCOUNT disables REGPARM the calling > convention changes (caller pushes to stack, callee expects register) > chaos ensues. I sent a patch to fix that quite some months back, but > it went stale and I neglected to update it. Would you like me to dig > it up refresh and resend? Chris, thanks for the refresher. I'm going to see if we can remove the REGPARM hack and change the way mcount does its calls. Maybe this will fix things for us. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/6] Staging: hv: Unify hyper-v device abstractions
Hi! Just an FYI, When sending out patch series, could you send out a [PATCH 0/6] first, and have all other patches a reply to that patch. Both git and quilt have ways to do this. The reason is that the linux kernel mailing list gets over 600 emails a day, and for those of us that skim through, we like to hit [make thread as read] when we are not interested in the thread or patch set. When people send out a set of patches that are all individual, this means I need to hit this command for every patch. Also do not think I'm just picking on you. I've complained to several people about doing this. One just recently sent out 21 patches. Thanks, -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: linux-next: Tree for July 25 (xen)
On Sat, 2011-08-06 at 11:22 -0700, Jeremy Fitzhardinge wrote: > On 08/05/2011 06:58 PM, Konrad Rzeszutek Wilk wrote: > > On Fri, Aug 05, 2011 at 11:14:23PM +0200, Ingo Molnar wrote: > >> * Ingo Molnar wrote: > >> > >>> * Randy Dunlap wrote: > >>> > On 08/04/11 15:40, Konrad Rzeszutek Wilk wrote: > > On Thu, Aug 04, 2011 at 06:32:59PM -0400, Konrad Rzeszutek Wilk wrote: > > These build failures are still triggering upstream: > > > > arch/x86/xen/trace.c:44:2: error: array index in initializer not > > of integer type > > arch/x86/xen/trace.c:44:2: error: (near initialization for > > ‘xen_hypercall_names’) > > arch/x86/xen/trace.c:45:1: error: ‘__HYPERVISOR_arch_4’ undeclared > > here (not in a function) > > arch/x86/xen/trace.c:45:2: error: array index in initializer not > > of integer type > > arch/x86/xen/trace.c:45:2: error: (near initialization for > > ‘xen_hypercall_names’) > Oh, that I haven't seen. Can you send me the .config for that please. > >>> You can't be trying very hard then. I see lots of these (but no, > >> Ah, I am getting it now. Thanks for reporting it. > > This should do the trick: > Acked-by: Randy Dunlap > >>> That patch did the trick here too: > >>> > >>> Acked-by: Ingo Molnar > >> Except that i'm still seeing the occasional build failure - see the > >> error log below. Config attached. > > Much appreciate for the report. I believe this fix (which I think hit > > linux-next yesterday?) should do that: > > > > commit 1e9ea2656b656edd3c8de98675bbc0340211b5bd > > Author: Jeremy Fitzhardinge > > Date: Wed Aug 3 09:43:44 2011 -0700 > > > > xen/tracing: it looks like we wanted CONFIG_FTRACE > > > > Apparently we wanted CONFIG_FTRACE rather the CONFIG_FUNCTION_TRACER. > > > > Reported-by: Sander Eikelenboom > > Tested-by: Sander Eikelenboom > > Signed-off-by: Jeremy Fitzhardinge > > Signed-off-by: Konrad Rzeszutek Wilk > > > > diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile > > index 45e94ac..3326204 100644 > > --- a/arch/x86/xen/Makefile > > +++ b/arch/x86/xen/Makefile > > @@ -15,7 +15,7 @@ obj-y := enlighten.o setup.o multicalls.o > > mmu.o irq.o \ > > grant-table.o suspend.o platform-pci-unplug.o \ > > p2m.o > > > > -obj-$(CONFIG_FUNCTION_TRACER) += trace.o > > +obj-$(CONFIG_FTRACE) += trace.o > > > > I'm not sure this is correct either. Maybe it should be > CONFIG_TRACEPOINTS? Steven? Actually, I believe the correct answer is: CONFIG_EVENT_TRACING -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 6/7] qemu: Implement virtio-pstore device
On Thu, 28 Jul 2016 14:39:53 +0900 Namhyung Kim wrote: > Well, I dont' know. As you know, the kernel oops dump is already sent > to serial device but it's rather slow. As I wrote in the cover > letter, enabling ftrace_dump_on_oops makes it even worse.. Also > pstore saves the (compressed) binary data so I thought it'd be better > to have a dedicated IO channel. BTW, I agree with this. It is better to have a quick way to grab the ftrace buffers when a crash happens, as serial is excruciatingly slow. Although, currently I still use kexec/kdump, but as Namhyung said, it depends on crash being up to date. I tend to be sending in updates every time I have to use it. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions
Hmm, I'm guessing this patch got lost. -- Steve On Wed, 25 May 2016 13:47:26 -0400 Steven Rostedt wrote: > Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up > after enabling function tracer. I asked him to bisect the functions within > available_filter_functions, which he did and it came down to three: > > _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64() > > It was found that this is only an issue when noreplace-paravirt is added to > the kernel command line. > > This means that those functions are most likely called within critical > sections of the funtion tracer, and must not be traced. > > In newer kenels _paravirt_nop() is defined within gcc asm(), and is no > longer an issue. But both _paravirt_ident_{32,64}() causes the following > splat when they are traced: > > mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054) > mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070) > mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054) > mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054) > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469] > Modules linked in: e1000e > CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513 > Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 > 05/07/2012 > task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000 > RIP: 0010:[] [] > queued_spin_lock_slowpath+0x118/0x1a0 > RSP: 0018:8800d4aefb90 EFLAGS: 0246 > RAX: RBX: RCX: 88011eb16d40 > RDX: 82485760 RSI: 1f288820 RDI: ea008030 > RBP: 8800d4aefb90 R08: 000c R09: > R10: 821c8e0e R11: R12: 88200fb8 > R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0 > FS: 7f7a4e3d7840() GS:88011eb0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0 > Stack: > 8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b > 8800d4aefcb0 8800d4aefbd0 811bf4cb 0002 > 0015 8800d2276050 8000c0fd8867 ea008030 > Call Trace: > [] _raw_spin_lock+0x27/0x30 > [] handle_pte_fault+0x13db/0x16b0 > [] ? function_trace_call+0x15b/0x180 > [] ? handle_pte_fault+0x5/0x16b0 > [] handle_mm_fault+0x312/0x670 > [] ? find_vma+0x68/0x70 > [] __do_page_fault+0x1b1/0x4e0 > [] do_page_fault+0x22/0x30 > [] page_fault+0x28/0x30 > [] ? copy_user_enhanced_fast_string+0x5/0x10 > [] ? seq_read+0x305/0x370 > [] __vfs_read+0x28/0xe0 > [] ? __vfs_read+0x5/0xe0 > [] ? __vfs_read+0x5/0xe0 > [] vfs_read+0x86/0x130 > [] SyS_read+0x46/0xa0 > [] entry_SYSCALL_64_fastpath+0x1e/0xa8 > Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 > c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 > 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b > > Reported-by: Łukasz Daniluk > Signed-off-by: Steven Rostedt > > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index f08ac28b8136..f975d226be6e 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n" > ".popsection"); > > /* identity function, which can be inlined */ > -u32 _paravirt_ident_32(u32 x) > +u32 notrace _paravirt_ident_32(u32 x) > { > return x; > } > > -u64 _paravirt_ident_64(u64 x) > +u64 notrace _paravirt_ident_64(u64 x) > { > return x; > } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions
I just spent half a day bisecting function tracing because I tripped over this again. I thought this was merged, but I guess it was missed again. Can someone please pull this in. And mark it for stable, it goes probably as far back as 2.6.32. Thanks! -- Steve On Wed, 25 May 2016 13:47:26 -0400 Steven Rostedt wrote: > Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up > after enabling function tracer. I asked him to bisect the functions within > available_filter_functions, which he did and it came down to three: > > _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64() > > It was found that this is only an issue when noreplace-paravirt is added to > the kernel command line. > > This means that those functions are most likely called within critical > sections of the funtion tracer, and must not be traced. > > In newer kenels _paravirt_nop() is defined within gcc asm(), and is no > longer an issue. But both _paravirt_ident_{32,64}() causes the following > splat when they are traced: > > mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054) > mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070) > mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054) > mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054) > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469] > Modules linked in: e1000e > CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513 > Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 > 05/07/2012 > task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000 > RIP: 0010:[] [] > queued_spin_lock_slowpath+0x118/0x1a0 > RSP: 0018:8800d4aefb90 EFLAGS: 0246 > RAX: RBX: RCX: 88011eb16d40 > RDX: 82485760 RSI: 1f288820 RDI: ea008030 > RBP: 8800d4aefb90 R08: 000c R09: > R10: 821c8e0e R11: R12: 88200fb8 > R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0 > FS: 7f7a4e3d7840() GS:88011eb0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0 > Stack: > 8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b > 8800d4aefcb0 8800d4aefbd0 811bf4cb 0002 > 0015 8800d2276050 8000c0fd8867 ea008030 > Call Trace: > [] _raw_spin_lock+0x27/0x30 > [] handle_pte_fault+0x13db/0x16b0 > [] ? function_trace_call+0x15b/0x180 > [] ? handle_pte_fault+0x5/0x16b0 > [] handle_mm_fault+0x312/0x670 > [] ? find_vma+0x68/0x70 > [] __do_page_fault+0x1b1/0x4e0 > [] do_page_fault+0x22/0x30 > [] page_fault+0x28/0x30 > [] ? copy_user_enhanced_fast_string+0x5/0x10 > [] ? seq_read+0x305/0x370 > [] __vfs_read+0x28/0xe0 > [] ? __vfs_read+0x5/0xe0 > [] ? __vfs_read+0x5/0xe0 > [] vfs_read+0x86/0x130 > [] SyS_read+0x46/0xa0 > [] entry_SYSCALL_64_fastpath+0x1e/0xa8 > Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14 > c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7 > 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b > > Reported-by: Łukasz Daniluk > Signed-off-by: Steven Rostedt > > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index f08ac28b8136..f975d226be6e 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n" > ".popsection"); > > /* identity function, which can be inlined */ > -u32 _paravirt_ident_32(u32 x) > +u32 notrace _paravirt_ident_32(u32 x) > { > return x; > } > > -u64 _paravirt_ident_64(u64 x) > +u64 notrace _paravirt_ident_64(u64 x) > { > return x; > } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support
On Wed, 4 Oct 2017 14:19:56 -0700 Thomas Garnier wrote: > When using -fPIE/PIC with function tracing, the compiler generates a > call through the GOT (call *__fentry__@GOTPCREL). This instruction > takes 6 bytes instead of 5 on the usual relative call. > > With this change, function tracing supports 6 bytes on traceable > function and can still replace relative calls on the ftrace assembly > functions. > > Position Independent Executable (PIE) support will allow to extended the > KASLR randomization range below the -2G memory limit. Question: This 6 bytes is only the initial call that gcc creates. When function tracing is enabled, the calls are back to the normal call to the ftrace trampoline? -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support
On Thu, 5 Oct 2017 09:01:14 -0700 Thomas Garnier wrote: > On Thu, Oct 5, 2017 at 6:06 AM, Steven Rostedt wrote: > > On Wed, 4 Oct 2017 14:19:56 -0700 > > Thomas Garnier wrote: > > > >> When using -fPIE/PIC with function tracing, the compiler generates a > >> call through the GOT (call *__fentry__@GOTPCREL). This instruction > >> takes 6 bytes instead of 5 on the usual relative call. > >> > >> With this change, function tracing supports 6 bytes on traceable > >> function and can still replace relative calls on the ftrace assembly > >> functions. > >> > >> Position Independent Executable (PIE) support will allow to extended the > >> KASLR randomization range below the -2G memory limit. > > > > Question: This 6 bytes is only the initial call that gcc creates. When > > function tracing is enabled, the calls are back to the normal call to > > the ftrace trampoline? > > That is correct. > Then I think a better idea is to simply nop them out at compile time, and have the code that updates them to nops to know about it. See scripts/recordmcount.c Could we simply add a 5 byte nop followed by a 1 byte nop, and treat it the same as if it didn't exist? This code can be a little complex, and can cause really nasty side effects if things go wrong. I would like to keep from adding more variables to the changes here. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 1/3] MAINTAINERS: Update maintainers for paravirt ops and VMware hypervisor interface
On Thu, 11 Nov 2021 07:50:39 +0100 Greg KH wrote: > > Signed-off-by: Srivatsa S. Bhat (VMware) > > Acked-by: Alexey Makhalov > > Acked-by: Deep Shah > > Acked-by: Juergen Gross > > Cc: sta...@vger.kernel.org > > Why are MAINTAINERS updates needed for stable? That's not normal :( Probably not needed, but does it hurt? And who's normal? -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 9/9] testmmiotrace: eliminate anonymous module_init & module_exit
On Wed, 16 Mar 2022 12:20:10 -0700 Randy Dunlap wrote: > Eliminate anonymous module_init() and module_exit(), which can lead to > confusion or ambiguity when reading System.map, crashes/oops/bugs, > or an initcall_debug log. > > Give each of these init and exit functions unique driver-specific > names to eliminate the anonymous names. > > Example 1: (System.map) > 832fc78c t init > 832fc79e t init > 832fc8f8 t init > > Example 2: (initcall_debug log) > calling init+0x0/0x12 @ 1 > initcall init+0x0/0x12 returned 0 after 15 usecs > calling init+0x0/0x60 @ 1 > initcall init+0x0/0x60 returned 0 after 2 usecs > calling init+0x0/0x9a @ 1 > initcall init+0x0/0x9a returned 0 after 74 usecs > > Fixes: 8b7d89d02ef3 ("x86: mmiotrace - trace memory mapped IO") > Signed-off-by: Randy Dunlap > Cc: Thomas Gleixner > Cc: Steven Rostedt Acked-by: Steven Rostedt (Google) -- Steve > Cc: Ingo Molnar > Cc: Karol Herbst > Cc: Pekka Paalanen > Cc: Dave Hansen > Cc: Andy Lutomirski > Cc: Peter Zijlstra > Cc: Borislav Petkov > Cc: "H. Peter Anvin" > Cc: nouv...@lists.freedesktop.org > Cc: x...@kernel.org > --- > arch/x86/mm/testmmiotrace.c |8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > --- lnx-517-rc8.orig/arch/x86/mm/testmmiotrace.c > +++ lnx-517-rc8/arch/x86/mm/testmmiotrace.c > @@ -113,7 +113,7 @@ static void do_test_bulk_ioremapping(voi > synchronize_rcu(); > } > > -static int __init init(void) > +static int __init testmmiotrace_init(void) > { > unsigned long size = (read_far) ? (8 << 20) : (16 << 10); > int ret = security_locked_down(LOCKDOWN_MMIOTRACE); > @@ -136,11 +136,11 @@ static int __init init(void) > return 0; > } > > -static void __exit cleanup(void) > +static void __exit testmmiotrace_cleanup(void) > { > pr_debug("unloaded.\n"); > } > > -module_init(init); > -module_exit(cleanup); > +module_init(testmmiotrace_init); > +module_exit(testmmiotrace_cleanup); > MODULE_LICENSE("GPL"); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] virtio: Workaround fix for hard hang on guest using fifos
This is more of a workaround patch and probably not the proper fix. But I'm doing some work that is using fifos for guests and this is causing a hang that is quite annoying. I currently have this patch applied to continue my work. I was working on analyzing data transfers between host and guests via virtio sockets (FIFOs on host, dev on guest), vsockets and TCP packets. I wrote a program to test each by passing a 1GB file and timing it. I'm using the splice system call to help move things along. In doing so, I found that my pipe between splice calls originally used "page_size" for data transfer, and that is not as efficient as finding out what the pipe size is. So I changed the code to use pipe_size and while debugging it, the guest locked up hard. I'm attaching the "agent-fifo" that runs on the guest, and the "client-fifo" that runs on the host (the names may be backwards, but makes sense when you add how I test vsockets and network packets). Here's what I did: # ./client-fifo /var/lib/virt/Guest/trace-pipe-cpu0.out /test/bigfile Where the trace-pipe-cpu0.out is the receiving side from the guest's virtio pipe. The /test/bigfile is created when data starts coming in from the guest pipe. # dd if=/dev/urandom of=bigfile bs=1024 count=1048576 # ./agent-fifo /dev/virtio-ports/trace-pipe-cpu0 bigfile With the updates to change the size being passed in the splice from page_size to pipe_size, this never finished (it would copy around a meg or so). And stopped. When I killed the agent-fifo task on the guest, the guest hung hard. Debugging this, I found that the guest is stuck in the loop in drivers/char/virt_console.c: __send_control_msg(): if (virtqueue_add_outbuf(vq, sg, 1, &portdev->cpkt, GFP_ATOMIC) == 0) { virtqueue_kick(vq); while (!virtqueue_get_buf(vq, &len) && !virtqueue_is_broken(vq)) cpu_relax(); } It never exits that loop. My workaround (this patch) is to put in a timeout, and exit out if it spins there for more than 5 seconds. This makes the problem go away. Below is my changes, but this is a band-aid, it is not the cure. Workaround-fix-by: Steven Rostedt (Google) --- diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index e3c430539a17..65f259f3f8cb 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -551,6 +551,7 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id, struct scatterlist sg[1]; struct virtqueue *vq; unsigned int len; + u64 end; if (!use_multiport(portdev)) return 0; @@ -567,9 +568,15 @@ static ssize_t __send_control_msg(struct ports_device *portdev, u32 port_id, if (virtqueue_add_outbuf(vq, sg, 1, &portdev->cpkt, GFP_ATOMIC) == 0) { virtqueue_kick(vq); + end = jiffies + 5 * HZ; while (!virtqueue_get_buf(vq, &len) - && !virtqueue_is_broken(vq)) + && !virtqueue_is_broken(vq)) { + if (unlikely(end < jiffies)) { + dev_warn(&portdev->vdev->dev, "send_control_msg timed out!\n"); + break; + } cpu_relax(); + } } spin_unlock(&portdev->c_ovq_lock); #include #include #include #include #include #include #include #include #include #include #define _GNU_SOURCE #include #ifndef F_GETPIPE_SZ # define F_GETPIPE_SZ 1032 /* The Linux number for the option */ #endif const char *this_name = "agent-fifo"; static void usage(void) { printf("usage: %s dev bigfile\n" "\n",this_name); exit(-1); } static void __vdie(const char *fmt, va_list ap, int err) { int ret = errno; const char *p = this_name; if (err && errno) perror(p); else ret = -1; fprintf(stderr, " "); vfprintf(stderr, fmt, ap); fprintf(stderr, "\n"); exit(ret); } void die(const char *fmt, ...) { va_list ap; va_start(ap, fmt); __vdie(fmt, ap, 0); va_end(ap); } void pdie(const char *fmt, ...) { va_list ap; va_start(ap, fmt); __vdie(fmt, ap, 1); va_end(ap); } static unsigned long long time() { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec * 100 + tv.tv_usec; } static void print_time(unsigned long long time) { unsigned long long seconds; unsigned long long usecs; seconds = time / 100; usecs = time - (seconds * 100); printf("time: %llu.%06llu\n", seconds, usecs); } int main(int argc, char *argv[]) { unsigned long long start; unsigned long long end; struct stat st; off_t size; char *file; char *port; int page_size; int pipe_size; int brass[2]; int ret
Re: [PATCH] virtio: Workaround fix for hard hang on guest using fifos
On Fri, 25 Mar 2022 17:36:27 -0700 Linus Torvalds wrote: > On Fri, Mar 25, 2022 at 2:20 PM Steven Rostedt wrote: > > > > With the updates to change the size being passed in the splice from > > page_size to pipe_size, this never finished (it would copy around a meg or > > so). And stopped. When I killed the agent-fifo task on the guest, the guest > > hung hard. > > Without knowing (or really caring) at all how virtqueue works, this > sounds very much like the classic pipe deadlock where two processes > communicate over a pair of pipes, sending each other commands, and > replying to each other with status updates. It does look like this. Note, this is not due to my application, since it flows in one direction, but I think the way virtio pipes are implemented can cause this to occur. A virtio pipe on the guest looks like a normal bidirectional char device. You open it, read it, write to it. No problem. On the host, there are two FIFOs that are attached to the guest. One for each direction. That is, the host is using two pipes that convert to a character device on the guest. > > And you absolutely cannot do that if one side can possibly want to up > fill the whole pipe. > > Deadlock: > > - process A is trying to send data to process B (on 'pipe_A'), and > blocks because the pipe is full > > - process B is reads the data and everything is fine, and A gets to continue > I think you left out a step here, where A is blocked again. > - but then process B sends some stratus update the other way (on > 'pipe_B' - you can't use the same pipe for bidirectional, it's why you > use a pair of pipes or a socketpair) and waits for the result. > > - now A and B are both waiting for each other - A is waiting for B to > empty the big bunch of data it's sending, and B is waiting for the > result for the (small) command it sent. > > and neither makes any progress. > > You can find several mentions of these kinds of problems by just > googling for "bidirectional pipe deadlock" or similar. > > The solution is invariably to either > > (a) make sure that nobody writes even remotely close to enough data > to fill a pipe before reading the other pipe (you can still fill up a > pipe, but at least somebody is always going to succeed and make > progress and do the read to make progress). > > (b) make sure everybody who writes to a pipe will use nonblocking IO > (and is willing to do reads in between to satisfy the other end). > > That first case is basically what one of the PIPE_BUF guarantees is > all about (the other one is the atomicity it guarantees, ie you can > write a "command packet" and be guaranteed that readers will see it > without data mixed in from other writes). > > I have no idea what your client/agent does and how it interacts with > the virtio pipes, but it really _sounds_ like a very similar issue, > where it used to work (because PIPE_BUF) and now no longer does > (because pipe filled). So the agent writes to the virtio char device on the guest, while the client is reading the guest's FIFO on the host. One directional. > > And that virtio_console __send_control_msg() pattern very much sounds > like a "send data and wait for ACK" behavior of "process B". Something happens where data stops transferring (the virtio system waiting for status?) This mostly happens when I run the agent under gdb. Which it stops transferring data. Everything is still fine. I hit Ctrl^C twice to get back to the gdb command prompt, and then I kill the process with the gdb "kill" command. It asks me if I'm sure, I type "y", and then boom! The system is hung. The guest locks up. And printk()s show that the exit of the agent is in that loop. watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [agent-test:1165] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep bluetooth ecdh_generic ecc cfg80211 rfkill cirrus drm_shmem_helper drm_kms_helper vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common pcspkr joydev serio_raw vsock virtio_console drm virtio_balloon 8139too i2c_piix4 floppy nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net net_failover failover virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring 8139cp mii ata_generic pata_acpi CPU: 0 PID: 1165 Comm: agent-test Not tainted 5.17.0-test+ #41 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: 0010:virtqueue_get_buf_ctx_split+0x0/0xd0 [virtio_ring] Code: 75 a6 45 31 e4 39 45 2c 74 e6 0f 0b 83 c6 01 48 83 c2 10 39 c6 75 b4 45 31 e4 39 45 2c 74 d1 0f 0b 66 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 41 54 53 80 7f 3b 00 0f 85 9a 00 00 00 48 8b 47 68 RSP: 0018:b85c012e7c68 EFLAGS: 02
Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage
On Thu, 9 Jun 2022 15:02:20 +0200 Petr Mladek wrote: > > I'm somewhat curious whether we can actually remove that trace event. > > Good question. > > Well, I think that it might be useful. It allows to see trace and > printk messages together. Yes people still use it. I was just asked about it at Kernel Recipes. That is, someone wanted printk mixed in with the tracing, and I told them about this event (which they didn't know about but was happy to hear that it existed). -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: netconsole deadlock with virtnet
On Tue, 17 Nov 2020 12:23:41 +0200 Leon Romanovsky wrote: > Hi, > > Approximately two weeks ago, our regression team started to experience those > netconsole splats. The tested code is Linus's master (-rc4) + netdev net-next > + netdev net-rc. > > Such splats are random and we can't bisect because there is no stable > reproducer. > > Any idea, what is the root cause? > > [ 21.141117] netpoll: netconsole: local port > [ 21.141947] netpoll: netconsole: local IPv4 address 10.236.251.7 > [ 21.143138] netpoll: netconsole: interface 'eth1' > [ 21.143988] netpoll: netconsole: remote port 62517 > [ 21.145127] netpoll: netconsole: remote IPv4 address 10.236.251.1 > [ 21.146206] netpoll: netconsole: remote ethernet address 68:05:ca:aa:99:49 > [ 21.149464] = > [ 21.149466] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected > [ 21.149467] 5.10.0-rc4_for_upstream_min_debug_2020_11_16_13_03 #1 Not > tainted > [ 21.149468] - > [ 21.149469] modprobe/596 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: > [ 21.149471] 000194b4e298 (_xmit_ETHER#2){+.-.}-{2:2}, at: > virtnet_poll_tx+0x84/0x120 > [ 21.149478] and this task is already holding: > [ 21.149478] 893d4018 (target_list_lock){}-{2:2}, at: > write_msg+0x6c/0x120 [netconsole] > [ 21.149483] which would create a new lock dependency: > [ 21.149484] (target_list_lock){}-{2:2} -> (_xmit_ETHER#2){+.-.}-{2:2} > [ 21.149491] but this new dependency connects a HARDIRQ-irq-safe lock: > [ 21.149492] (console_owner){-.-.}-{0:0} > [ 21.149496] ... which became HARDIRQ-irq-safe at: > [ 21.149496] __lock_acquire+0xa78/0x1a94 > [ 21.149498] lock_acquire.part.0+0x170/0x360 > [ 21.149498] lock_acquire+0x68/0x8c > [ 21.149500] console_unlock+0x1e8/0x6a4 > [ 21.149500] vprintk_emit+0x1c4/0x3c4 > [ 21.149501] vprintk_default+0x40/0x4c > [ 21.149502] vprintk_func+0x10c/0x220 > [ 21.149503] printk+0x68/0x90 > [ 21.149504] crng_fast_load+0x1bc/0x1c0 > [ 21.149505] add_interrupt_randomness+0x280/0x290 > [ 21.149506] handle_irq_event+0x80/0x120 > [ 21.149507] handle_fasteoi_irq+0xac/0x200 > [ 21.149508] __handle_domain_irq+0x84/0xf0 > [ 21.149509] gic_handle_irq+0xd4/0x320 > [ 21.149510] el1_irq+0xd0/0x180 > [ 21.149511] arch_cpu_idle+0x24/0x44 > [ 21.149512] default_idle_call+0x48/0xa0 > [ 21.149513] do_idle+0x260/0x300 > [ 21.149514] cpu_startup_entry+0x30/0x6c > [ 21.149515] rest_init+0x1b4/0x288 > [ 21.149515] arch_call_rest_init+0x18/0x24 > [ 21.149516] start_kernel+0x5cc/0x608 > [ 21.149518] to a HARDIRQ-irq-unsafe lock: > [ 21.149519] (_xmit_ETHER#2){+.-.}-{2:2} > [ 21.149523] ... which became HARDIRQ-irq-unsafe at: > [ 21.149524] ... __lock_acquire+0x8bc/0x1a94 > [ 21.149525] lock_acquire.part.0+0x170/0x360 > [ 21.149526] lock_acquire+0x68/0x8c > [ 21.149527] _raw_spin_trylock+0x80/0xd0 > [ 21.149527] virtnet_poll+0xac/0x360 > [ 21.149528] net_rx_action+0x1b0/0x4e0 > [ 21.149529] __do_softirq+0x1f4/0x638 > [ 21.149530] do_softirq+0xb8/0xcc > [ 21.149531] __local_bh_enable_ip+0x18c/0x200 > [ 21.149532] virtnet_napi_enable+0xc0/0xd4 > [ 21.149533] virtnet_open+0x98/0x1c0 > [ 21.149534] __dev_open+0x12c/0x200 > [ 21.149535] __dev_change_flags+0x1a0/0x220 > [ 21.149536] dev_change_flags+0x2c/0x70 > [ 21.149536] do_setlink+0x214/0xe20 > [ 21.149537] __rtnl_newlink+0x514/0x820 > [ 21.149538] rtnl_newlink+0x58/0x84 > [ 21.149539] rtnetlink_rcv_msg+0x184/0x4b4 > [ 21.149540] netlink_rcv_skb+0x60/0x124 > [ 21.149541] rtnetlink_rcv+0x20/0x30 > [ 21.149542] netlink_unicast+0x1b4/0x270 > [ 21.149543] netlink_sendmsg+0x1f0/0x400 > [ 21.149544] sock_sendmsg+0x5c/0x70 > [ 21.149545] sys_sendmsg+0x24c/0x280 > [ 21.149545] ___sys_sendmsg+0x88/0xd0 > [ 21.149546] __sys_sendmsg+0x70/0xd0 > [ 21.149547] __arm64_sys_sendmsg+0x2c/0x40 > [ 21.149548] el0_svc_common.constprop.0+0x84/0x200 > [ 21.149549] do_el0_svc+0x2c/0x90 > [ 21.149550] el0_svc+0x18/0x50 > [ 21.149551] el0_sync_handler+0xe0/0x350 > [ 21.149552] el0_sync+0x158/0x180 > [ 21.149553] other info that might help us debug this: > [ 21.149555] Chain exists of: > [ 21.149556] console_owner --> target_list_lock --> _xmit_ETHER#2 So somewhere we have target_list_lock taken while holding the console_owner lock (which can happen in interrupt). And what this is saying, target_list_lock is held somewhere where _xmit_ETHER#2 is taken (which I'm guessing from the back traces is the xmit_lock taken in virtnet_poll_tx(). Thus, you can have a deadlock with three CPUs (I need to update the lockdep output to make it proper one of these days). CPU0CPU1CPU2
Re: netconsole deadlock with virtnet
[ Adding netdev as perhaps someone there knows ] On Wed, 18 Nov 2020 12:09:59 +0800 Jason Wang wrote: > > This CPU0 lock(_xmit_ETHER#2) -> hard IRQ -> lock(console_owner) is > > basically > > soft IRQ -> lock(_xmit_ETHER#2) -> hard IRQ -> printk() > > > > Then CPU1 spins on xmit, which is owned by CPU0, CPU0 spins on > > console_owner, which is owned by CPU1? It still looks to me that the target_list_lock is taken in IRQ, (which can be the case because printk calls write_msg() which takes that lock). And someplace there's a: lock(target_list_lock) lock(xmit_lock) which means you can remove the console lock from this scenario completely, and you still have a possible deadlock between target_list_lock and xmit_lock. > > > If this is true, it looks not a virtio-net specific issue but somewhere > else. > > I think all network driver will synchronize through bh instead of hardirq. I think the issue is where target_list_lock is held when we take xmit_lock. Is there anywhere in netconsole.c that can end up taking xmit_lock while holding the target_list_lock? If so, that's the problem. As target_list_lock is something that can be taken in IRQ context, which means *any* other lock that is taking while holding the target_list_lock must also protect against interrupts from happening while it they are held. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: netconsole deadlock with virtnet
On Mon, 23 Nov 2020 13:08:55 +0200 Leon Romanovsky wrote: > [ 10.028024] Chain exists of: > [ 10.028025] console_owner --> target_list_lock --> _xmit_ETHER#2 Note, the problem is that we have a location that grabs the xmit_lock while holding target_list_lock (and possibly console_owner). > [ 10.028028] > [ 10.028028] Possible interrupt unsafe locking scenario: > [ 10.028029] > [ 10.028029]CPU0CPU1 > [ 10.028030] > [ 10.028030] lock(_xmit_ETHER#2); > [ 10.028032]local_irq_disable(); > [ 10.028032]lock(console_owner); > [ 10.028034]lock(target_list_lock); > [ 10.028035] > [ 10.028035] lock(console_owner); > [ 10.028036] > [ 10.028037] *** DEADLOCK *** > [ 10.028037] > [ 10.028107] the dependencies between the lock to be acquired > [ 10.028107] and HARDIRQ-irq-unsafe lock: > [ 10.028108] -> (_xmit_ETHER#2){+.-.}-{2:2} ops: 217 { > [ 10.028110]HARDIRQ-ON-W at: > [ 10.028111]__lock_acquire+0x8bc/0x1a94 > [ 10.028111]lock_acquire.part.0+0x170/0x360 > [ 10.028112]lock_acquire+0x68/0x8c > [ 10.028113]_raw_spin_trylock+0x80/0xd0 > [ 10.028113]virtnet_poll+0xac/0x360 xmit_lock is taken in virtnet_poll() (via virtnet_poll_cleantx()). This is called from the softirq, and interrupts are not disabled. > [ 10.028114]net_rx_action+0x1b0/0x4e0 > [ 10.028115]__do_softirq+0x1f4/0x638 > [ 10.028115]do_softirq+0xb8/0xcc > [ 10.028116]__local_bh_enable_ip+0x18c/0x200 > [ 10.028116]virtnet_napi_enable+0xc0/0xd4 > [ 10.028117]virtnet_open+0x98/0x1c0 > [ 10.028118]__dev_open+0x12c/0x200 > [ 10.028118]__dev_change_flags+0x1a0/0x220 > [ 10.028119]dev_change_flags+0x2c/0x70 > [ 10.028119]do_setlink+0x214/0xe20 > [ 10.028120]__rtnl_newlink+0x514/0x820 > [ 10.028120]rtnl_newlink+0x58/0x84 > [ 10.028121]rtnetlink_rcv_msg+0x184/0x4b4 > [ 10.028122]netlink_rcv_skb+0x60/0x124 > [ 10.028122]rtnetlink_rcv+0x20/0x30 > [ 10.028123]netlink_unicast+0x1b4/0x270 > [ 10.028124]netlink_sendmsg+0x1f0/0x400 > [ 10.028124]sock_sendmsg+0x5c/0x70 > [ 10.028125]sys_sendmsg+0x24c/0x280 > [ 10.028125]___sys_sendmsg+0x88/0xd0 > [ 10.028126]__sys_sendmsg+0x70/0xd0 > [ 10.028127]__arm64_sys_sendmsg+0x2c/0x40 > [ 10.028128]el0_svc_common.constprop.0+0x84/0x200 > [ 10.028128]do_el0_svc+0x2c/0x90 > [ 10.028129]el0_svc+0x18/0x50 > [ 10.028129]el0_sync_handler+0xe0/0x350 > [ 10.028130]el0_sync+0x158/0x180 [..] > [ 10.028171] ... key at: [] > netdev_xmit_lock_key+0x10/0x390 > [ 10.028171] ... acquired at: > [ 10.028172]__lock_acquire+0x134c/0x1a94 > [ 10.028172]lock_acquire.part.0+0x170/0x360 > [ 10.028173]lock_acquire+0x68/0x8c > [ 10.028173]_raw_spin_lock+0x64/0x90 > [ 10.028174]virtnet_poll_tx+0x84/0x120 > [ 10.028174]netpoll_poll_dev+0x12c/0x350 > [ 10.028175]netpoll_send_skb+0x39c/0x400 > [ 10.028175]netpoll_send_udp+0x2b8/0x440 > [ 10.028176]write_msg+0xfc/0x120 [netconsole] > [ 10.028176]console_unlock+0x3ec/0x6a4 The above shows the problem. We have: console_unlock() (which holds the console_owner lock) write_msg() (which holds the target_list_lock) Then we write_msg() calls: netpoll_send_udp() { netpoll_send_skb() { netpoll_poll_dev() { virtnet_poll_tx() (which takes the xmit_lock!) DEADLOCK! In netpoll_send_skb() I see this: /* tickle device maybe there is some cleanup */ netpoll_poll_dev(np->dev); Which looks to me that it will call some code that should only be used in softirq context. It's called with locks held that are taken in interrupt context, and any locks that are taken in netpoll_poll_dev() must always be taken with interrupts disabled. That is, if xmit_lock is taken within netpoll_poll_dev(), then it must always be taken with interrupts disabled. Otherwise you can have the deadlock that lockdep reported. -- Steve > [ 10.028177]register
Re: netconsole deadlock with virtnet
On Mon, 23 Nov 2020 10:52:52 -0800 Jakub Kicinski wrote: > On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote: > > On Mon, 23 Nov 2020 13:08:55 +0200 > > Leon Romanovsky wrote: > > > > > > > [ 10.028024] Chain exists of: > > > [ 10.028025] console_owner --> target_list_lock --> _xmit_ETHER#2 > > > > Note, the problem is that we have a location that grabs the xmit_lock while > > holding target_list_lock (and possibly console_owner). > > Well, it try_locks the xmit_lock. Does lockdep understand try-locks? > > (not that I condone the shenanigans that are going on here) Does it? virtnet_poll_tx() { __netif_tx_lock() { spin_lock(&txq->_xmit_lock); That looks like we can have: CPU0CPU1 lock(xmit_lock) lock(console) lock(target_list_lock) __netif_tx_lock() lock(xmit_lock); [BLOCKED] lock(console) [BLOCKED] DEADLOCK. So where is the trylock here? Perhaps you need the trylock in virtnet_poll_tx()? -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: netconsole deadlock with virtnet
On Tue, 24 Nov 2020 11:22:03 +0800 Jason Wang wrote: > Btw, have a quick search, there are several other drivers that uses tx > lock in the tx NAPI. tx NAPI is not the issue. The issue is that write_msg() (in netconsole.c) calls this polling logic with the target_list_lock held. Are those other drivers called by netconsole? If not, then this is unique to virtio-net. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH v5 19/19] virtio/vsock: update trace event for SEQPACKET
On Thu, 18 Feb 2021 08:42:15 +0300 Arseny Krasnov wrote: Not sure if this was pulled in yet, but I do have a small issue with this patch. > @@ -69,14 +82,19 @@ TRACE_EVENT(virtio_transport_alloc_pkt, > __entry->type = type; > __entry->op = op; > __entry->flags = flags; > + __entry->msg_len = msg_len; > + __entry->msg_cnt = msg_cnt; > ), > - TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x", > + TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x " > + "msg_len=%u msg_cnt=%u", It's considered poor formatting to split strings like the above. This is one of the exceptions for the 80 character limit. Do not break strings just to keep it within 80 characters. -- Steve > __entry->src_cid, __entry->src_port, > __entry->dst_cid, __entry->dst_port, > __entry->len, > show_type(__entry->type), > show_op(__entry->op), > - __entry->flags) > + __entry->flags, > + __entry->msg_len, > + __entry->msg_cnt) > ); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 0/6] use canonical ftrace path whenever possible
On Fri, 10 Mar 2023 12:06:58 -0700 Ross Zwisler wrote: > On Fri, Mar 10, 2023 at 03:29:49AM -0500, Michael S. Tsirkin wrote: > > On Wed, Feb 15, 2023 at 03:33:44PM -0700, Ross Zwisler wrote: > > > Changes in v2: > > > * Dropped patches which were pulled into maintainer trees. > > > * Split BPF patches out into another series targeting bpf-next. > > > * trace-agent now falls back to debugfs if tracefs isn't present. > > > * Added Acked-by from m...@redhat.com to series. > > > * Added a typo fixup for the virtio-trace README. > > > > > > Steven, assuming there are no objections, would you feel comfortable > > > taking this series through your tree? > > > > Acked-by: Michael S. Tsirkin > > > > if you want the virtio patches through my tree after all, let me know. > > Yes, please, that would be great. I'll send out v3 with the few patches that > haven't gotten a response, but I'll drop the virtio patches assuming you've > got them. > Since the last patch 6/6 is not tracing related, I would prefer it to go through the virtio tree. For patches 1-5, please add: Reviewed-by: Steven Rostedt (Google) Thanks! -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 01/20] x86: move prepare_ftrace_return prototype to header
On Tue, 16 May 2023 21:35:30 +0200 Arnd Bergmann wrote: > From: Arnd Bergmann > > On 32-bit builds, the prepare_ftrace_return() function only has a global > definition, but no prototype before it, which causes a warning: > > arch/x86/kernel/ftrace.c:625:6: warning: no previous prototype for > ‘prepare_ftrace_return’ [-Wmissing-prototypes] > 625 | void prepare_ftrace_return(unsigned long ip, unsigned long *parent, > > Move the prototype that is already needed for some configurations into > a header file where it can be seen unconditionally. > > Signed-off-by: Arnd Bergmann > --- > arch/x86/include/asm/ftrace.h | 3 +++ > arch/x86/kernel/ftrace.c | 3 --- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h > index 5061ac98ffa1..b8d4a07f9595 100644 > --- a/arch/x86/include/asm/ftrace.h > +++ b/arch/x86/include/asm/ftrace.h Acked-by: Steven Rostedt (Google) -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] x86/xen: move paravirt lazy code
On Wed, 13 Sep 2023 13:38:27 +0200 Juergen Gross wrote: > diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h > index 44a3f565264d..0577f0cdd231 100644 > --- a/include/trace/events/xen.h > +++ b/include/trace/events/xen.h > @@ -6,26 +6,26 @@ > #define _TRACE_XEN_H > > #include > -#include > +#include > #include > > struct multicall_entry; > > /* Multicalls */ > DECLARE_EVENT_CLASS(xen_mc__batch, > - TP_PROTO(enum paravirt_lazy_mode mode), > + TP_PROTO(enum xen_lazy_mode mode), > TP_ARGS(mode), > TP_STRUCT__entry( > - __field(enum paravirt_lazy_mode, mode) > + __field(enum xen_lazy_mode, mode) > ), > TP_fast_assign(__entry->mode = mode), > TP_printk("start batch LAZY_%s", > - (__entry->mode == PARAVIRT_LAZY_MMU) ? "MMU" : > - (__entry->mode == PARAVIRT_LAZY_CPU) ? "CPU" : "NONE") > + (__entry->mode == XEN_LAZY_MMU) ? "MMU" : > + (__entry->mode == XEN_LAZY_CPU) ? "CPU" : "NONE") There's helper functions that make the above easier to implement as well as exports the symbols so that user space can parse this better: TRACE_DEFINE_ENUM(XEN_LAZY_NONE); TRACE_DEFINE_ENUM(XEN_LAZY_MMU); TRACE_DEFINE_ENUM(XEN_LAZY_CPU); [..] TP_printk("start batch LAZY_%s", __print_symbolic(mode, { XEN_LAZY_NONE, "NONE" }, { XEN_LAZY_MMU, "MMU }, { XEN_LAZY_CPU, "CPU" })) Then user space parsers that read the raw data can convert these events into something humans can read. -- Steve > ); > #define DEFINE_XEN_MC_BATCH(name)\ > DEFINE_EVENT(xen_mc__batch, name, \ > - TP_PROTO(enum paravirt_lazy_mode mode), \ > + TP_PROTO(enum xen_lazy_mode mode), \ >TP_ARGS(mode)) > > DEFINE_XEN_MC_BATCH(xen_mc_batch); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks
The new trace-cmd 3.0 (which is almost ready to be released) allows for tracing between host and guests with timestamp synchronization such that the events on the host and the guest can be interleaved in the proper order that they occur. KernelShark now has a plugin that visualizes this interaction. The implementation requires that the guest has a vsock CID assigned, and on the guest a "trace-cmd agent" is running, that will listen on a port for the CID. The on the host a "trace-cmd record -A guest@cid:port -e events" can be called and the host will connect to the guest agent through the cid/port pair and have the agent enable tracing on behalf of the host and send the trace data back down to it. The problem is that there is no sure fire way to find the CID for a guest. Currently, the user must know the cid, or we have a hack that looks for the qemu process and parses the --guest-cid parameter from it. But this is prone to error and does not work on other implementation (was told that crosvm does not use qemu). As I can not find a way to discover CIDs assigned to guests via any kernel interface, I decided to create this one. Note, I'm not attached to it. If there's a better way to do this, I would love to have it. But since I'm not an expert in the networking layer nor virtio, I decided to stick to what I know and add a debugfs interface that simply lists all the registered CIDs and the worker task that they are associated with. The worker task at least has the PID of the task it represents. Now I can find the cid / host process in charge of the guest pair: # cat /sys/kernel/debug/vsock_list 3 vhost-1954:2002 # ps aux | grep 1954 qemu1954 9.9 21.3 1629092 796148 ? Sl 16:22 0:58 /usr/bin/qemu-kvm -name guest=Fedora21,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-Fedora21/master-key.aes -machine pc-1.2,accel=kvm,usb=off,dump-guest-core=off -cpu qemu64 -m 1000 -overcommit mem-lock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 1eefeeb0-3ac7-07c1-926e-236908313b4c -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=32,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -blockdev {"driver":"host_device","filename":"/dev/mapper/vg_bxtest-GuestFedora","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"} -device ide-hd,bus=ide.0,unit=0,drive=libvirt-1- format,id=ide0-0-0,bootindex=1 -netdev tap,fd=34,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:9f:e9:d5,bus=pci.0,addr=0x3 -netdev tap,fd=35,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:ec:dc:6e,bus=pci.0,addr=0x5 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev pipe,id=charchannel0,path=/var/lib/trace-cmd/virt/Fedora21/trace-pipe-cpu0 -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=trace-pipe-cpu0 -chardev pipe,id=charchannel1,path=/var/lib/trace-cmd/virt/Fedora21/trace-pipe-cpu1 -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=trace-pipe-cpu1 -vnc 127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -device vhost-vsock-pci,id=vsock0,guest-cid=3,vhostfd=16,bus=pci.0,addr=0x7 -msg timestamp=on root2000 0.0 0.0 0 0 ?S16:22 0:00 [kvm-pit/1954] root 2002 0.0 0.0 0 0 ?S16:22 0:00 [vhost-1954] This is just an example of what I'm looking for. Just a way to find what process is using what cid. Signed-off-by: Steven Rostedt (VMware) --- diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 5e78fb719602..4f03b25b23c1 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include "vhost.h" @@ -900,6 +901,128 @@ static struct miscdevice vhost_vsock_misc = { .fops = &vhost_vsock_fops, }; +static struct dentry *vsock_file; + +struct vsock_file_iter { + struct hlist_node *node; + int index; +}; + + +static void *vsock_next(struct seq_file *m, void *v, loff_t *pos) +{ + struct vsock_file_iter *iter = v; + struct vhost_vsock *vsock; + + if (pos) + (*pos)++; + + if (iter->index >
Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks
For kicks, I wrote this program that uses libtracefs to search all CIDS (1-255), and find the kvm guests that are attached to them. It traces the sched_wakeup and kvm_exit, looking for: this_task -> wakeup -> wakeup -> kvm_exit when doing a connect to a cid. When it finds the pid that did a kvm_exit, it knows that's the PID that is woken by the vhost worker task. It's a little slow, and I would really like a better way to do this, but it's at least an option that is available now. -- Steve #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #define MAX_CID 256 static int this_pid; static int open_vsock(unsigned int cid, unsigned int port) { struct sockaddr_vm addr = { .svm_family = AF_VSOCK, .svm_cid = cid, .svm_port = port, }; int sd; sd = socket(AF_VSOCK, SOCK_STREAM, 0); if (sd < 0) return -1; if (connect(sd, (struct sockaddr *)&addr, sizeof(addr))) return -1; return sd; } struct pids { struct pids *next; int pid; }; struct trace_info { struct tracefs_instance *instance; struct tep_handle *tep; struct tep_event *wake_up; struct tep_event *kvm_exit; struct tep_format_field *common_pid; struct tep_format_field *wake_pid; struct pids *pids; intcid; intpid; }; static void tear_down_trace(struct trace_info *info) { tracefs_instance_file_write(info->instance, "events/enable", "0"); tracefs_instance_destroy(info->instance); tracefs_instance_free(info->instance); tep_free(info->tep); } static int setup_trace(struct trace_info *info) { const char *systems[] = { "sched", "kvm", NULL}; char *name; int ret; info->pids = NULL; ret = asprintf(&name, "vsock_find-%d\n", getpid()); if (ret < 0) return ret; info->instance = tracefs_instance_create(name); free(name); if (!info->instance) return -1; tracefs_trace_off(info->instance); info->tep = tracefs_local_events_system(NULL, systems); if (!info->tep) goto fail; info->wake_up = tep_find_event_by_name(info->tep, "sched", "sched_waking"); if (!info->wake_up) { fprintf(stderr, "Failed to find sched_waking\n"); goto fail; } info->kvm_exit = tep_find_event_by_name(info->tep, "kvm", "kvm_exit"); if (!info->kvm_exit) { fprintf(stderr, "Failed to find kvm_exit\n"); goto fail; } info->wake_pid = tep_find_any_field(info->wake_up, "pid"); if (!info->wake_pid) { fprintf(stderr, "Failed to find wake up pid\n"); goto fail; } info->common_pid = tep_find_common_field(info->wake_up, "common_pid"); if (!info->common_pid) { fprintf(stderr, "Failed to find common pid\n"); goto fail; } ret = tracefs_instance_file_write(info->instance, "events/sched/sched_waking/enable", "1"); if (ret < 0) { fprintf(stderr, "Failed to enable sched_waking\n"); goto fail; } ret = tracefs_instance_file_write(info->instance, "events/kvm/kvm_exit/enable", "1"); if (ret < 0) { fprintf(stderr, "Failed to enable kvm_exit\n"); goto fail; } return 0; fail: tear_down_trace(info); return -1; } static void free_pids(struct pids *pids) { struct pids *next; while (pids) { next = pids; pids = pids->next; free(next); } } static void add_pid(struct pids **pids, int pid) { struct pids *new_pid; new_pid = malloc(sizeof(*new_pid)); if (!new_pid) return; new_pid->pid = pid; new_pid->next = *pids; *pids = new_pid; } static bool match_pid(struct pids *pids, int pid) { while (pids) { if (pids->pid == pid) return true; pids = pids->next; } return false; } static int callback(struct tep_event *event, struct tep_record *record, int cpu, void *data) { struct trace_info *info = data; struct tep_handle *tep = info->tep; unsigned long long val; int type; int pid; int ret; ret = tep_read_number_field(info->common_pid, record->data, &val); if (ret < 0) return 0; pid = val; if (!match_pid(info->pids, pid)) return 0; type = tep_data_type(tep, record); if (type == info->kvm_exit->id) { info->pid = pid; return -1; } if (type != info->wake_up->id) return 0; ret = tep_read_number_field(info->wake_pid, record->data, &val); if (ret < 0) return 0; add_pid(&info->pids, (int)val); return 0; } static void print_cid_pid(int cid, int pid) { FILE *fp; char *path; char *buf = NULL; char *save; size_t l = 0; int tgid = -1; if (asprintf(&path, "/proc/%d/status", pid) < 0) return; fp = fopen(path, "r"); free(path); if (!fp) return; while (getline(&buf, &l, fp) > 0) { char *tok; if (strncmp(buf, "Tgid:", 5) != 0) continue; tok = strtok_r(buf, ":", &save); if (!tok) continue; tok = strtok_r(NULL, ":", &save); if (!tok) continue; while (isspace(*tok)) tok++; tgid = strtol(tok, NULL, 0); break; } free(buf); if (tgid >= 0) printf("%d\t%d\n", cid, tgid); } static void find_cid(struct trace_info *info, int cid) { int fd; add_pid(&info->pids, this_pid); tracefs_instance_file_clear(info->instance, "trace");
Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks
On Fri, 7 May 2021 16:11:20 +0200 Stefano Garzarella wrote: > Hi Steven, > > On Wed, May 05, 2021 at 04:38:55PM -0400, Steven Rostedt wrote: > >The new trace-cmd 3.0 (which is almost ready to be released) allows for > >tracing between host and guests with timestamp synchronization such that > >the events on the host and the guest can be interleaved in the proper order > >that they occur. KernelShark now has a plugin that visualizes this > >interaction. > > > >The implementation requires that the guest has a vsock CID assigned, and on > >the guest a "trace-cmd agent" is running, that will listen on a port for > >the CID. The on the host a "trace-cmd record -A guest@cid:port -e events" > >can be called and the host will connect to the guest agent through the > >cid/port pair and have the agent enable tracing on behalf of the host and > >send the trace data back down to it. > > > >The problem is that there is no sure fire way to find the CID for a guest. > >Currently, the user must know the cid, or we have a hack that looks for the > >qemu process and parses the --guest-cid parameter from it. But this is > >prone to error and does not work on other implementation (was told that > >crosvm does not use qemu). > > For debug I think could be useful to link the vhost-vsock kthread to the > CID, but for the user point of view, maybe is better to query the VM > management layer, for example if you're using libvirt, you can easily do: > > $ virsh dumpxml fedora34 | grep cid > We looked into going this route, but then that means trace-cmd host/guest tracing needs a way to handle every layer, as some people use libvirt (myself included), some people use straight qemu, some people us Xen, and some people use crosvm. We need to support all of them. Which is why I'm looking at doing this from the lowest common denominator, and since vsock is a requirement from trace-cmd to do this tracing, getting the thread that's related to the vsock is that lowest denominator. > > > > >As I can not find a way to discover CIDs assigned to guests via any kernel > >interface, I decided to create this one. Note, I'm not attached to it. If > >there's a better way to do this, I would love to have it. But since I'm not > >an expert in the networking layer nor virtio, I decided to stick to what I > >know and add a debugfs interface that simply lists all the registered > >CIDs > >and the worker task that they are associated with. The worker task at > >least has the PID of the task it represents. > > I honestly don't know if it's the best interface, like I said maybe for > debugging it's fine, but if we want to expose it to the user in some > way, we could support devlink/netlink to provide information about the > vsock devices currently in use. Ideally, a devlink/netlink is the right approach. I just had no idea on how to implement that ;-) So I went with what I know, which is debugfs files! > >Signed-off-by: Steven Rostedt (VMware) > >--- > >diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c > >index 5e78fb719602..4f03b25b23c1 100644 > >--- a/drivers/vhost/vsock.c > >+++ b/drivers/vhost/vsock.c > >@@ -15,6 +15,7 @@ > > #include > > #include > > #include > >+#include > > > > #include > > #include "vhost.h" > >@@ -900,6 +901,128 @@ static struct miscdevice vhost_vsock_misc = { > > .fops = &vhost_vsock_fops, > > }; > > > >+static struct dentry *vsock_file; > >+ > >+struct vsock_file_iter { > >+struct hlist_node *node; > >+int index; > >+}; > >+ > >+ > >+static void *vsock_next(struct seq_file *m, void *v, loff_t *pos) > >+{ > >+struct vsock_file_iter *iter = v; > >+struct vhost_vsock *vsock; > >+ > >+if (pos) > >+(*pos)++; > >+ > >+if (iter->index >= (int)HASH_SIZE(vhost_vsock_hash)) > >+return NULL; > >+ > >+if (iter->node) > >+iter->node = rcu_dereference_raw(hlist_next_rcu(iter->node)); > >+ > >+for (;;) { > >+if (iter->node) { > >+vsock = > >hlist_entry_safe(rcu_dereference_raw(iter->node), > >+ struct vhost_vsock, hash); > >+if (vsock->guest_cid) > >+break; > >+iter->node = > >rcu_dereference_raw(hlist_next_rcu(iter->node)); > >+co
Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks
On Fri, 7 May 2021 17:43:32 +0200 Stefano Garzarella wrote: > >The start/stop of a seq_file() is made for taking locks. I do this with all > >my code in ftrace. Yeah, there's a while loop between the two, but that's > >just to fill the buffer. It's not that long and it never goes to userspace > >between the two. You can even use this for spin locks (but I wouldn't > >recommend doing it for raw ones). > > Ah okay, thanks for the clarification! > > I was worried because building with `make C=2` I had these warnings: > > ../drivers/vhost/vsock.c:944:13: warning: context imbalance in 'vsock_start' > - wrong count at exit > ../drivers/vhost/vsock.c:963:13: warning: context imbalance in 'vsock_stop' - > unexpected unlock > > Maybe we need to annotate the functions somehow. Yep, I it should have been. static void *vsock_start(struct seq_file *m, loff_t *pos) __acquires(rcu) { [...] } static void vsock_stop(struct seq_file *m, void *p) __releases(rcu) { [...] } static int vsock_show(struct seq_file *m, void *v) __must_hold(rcu) { [...] } And guess what? I just copied those annotations from sock_hash_seq_start(), sock_hash_seq_show() and sock_hash_seq_stop() from net/core/sock_map.c which is doing exactly the same thing ;-) So there's definitely precedence for this. > > > > >> > >> >+ > >> >+ iter->index = -1; > >> >+ iter->node = NULL; > >> >+ t = vsock_next(m, iter, NULL); > >> >+ > >> >+ for (; iter->index < HASH_SIZE(vhost_vsock_hash) && l < *pos; > >> >+ t = vsock_next(m, iter, &l)) > >> >+ ; > >> > >> A while() maybe was more readable... > > > >Again, I just cut and pasted from my other code. > > > >If you have a good idea on how to implement this with netlink (something > >that ss or netstat can dislpay), I think that's the best way to go. > > Okay, I'll take a look and get back to you. > If it's too complicated, we can go ahead with this patch. Awesome, thanks! -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks
On Thu, 13 May 2021 16:57:34 +0100 Stefan Hajnoczi wrote: > This approach relies on process hierarchy of the VMM (QEMU). > Multi-process QEMU is in development and will allow VIRTIO devices to > run as separate processes from the main QEMU. It then becomes harder to > correlate a VIRTIO device process with its QEMU process. And we need to know all these mapping regardless, as we need to map each thread / process to the vCPU in order to correlate between host thread and vCPU thread for showing in KernelShark. Thus this mapping to find the main thread/process needs to be done regardless. > > So I think in the end this approach ends up being as fragile as parsing > command-lines. The kernel doesn't really have the concept of a "VM" that > the vhost_vsock is associated with :). Maybe just parse QEMU and crosvm > command-lines? > That's what we do now, and it already broke once, and even parsing the command line wont be enough for the stated reasons above. -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 33/44] ftrace: WARN on rcuidle
Nit, the subject should have "tracing:" an not "ftrace:" as the former encompasses the tracing infrastructure and the latter is for the function hook part of that. On Mon, 19 Sep 2022 12:00:12 +0200 Peter Zijlstra wrote: > CONFIG_GENERIC_ENTRY disallows any and all tracing when RCU isn't > enabled. > > XXX if s390 (the only other GENERIC_ENTRY user as of this writing) > isn't comfortable with this, we could switch to > HAVE_NOINSTR_VALIDATION which is x86_64 only atm. > > Signed-off-by: Peter Zijlstra (Intel) > --- > include/linux/tracepoint.h | 13 - > kernel/trace/trace.c |3 +++ > 2 files changed, 15 insertions(+), 1 deletion(-) > > --- a/include/linux/tracepoint.h > +++ b/include/linux/tracepoint.h > @@ -178,6 +178,16 @@ static inline struct tracepoint *tracepo > #endif /* CONFIG_HAVE_STATIC_CALL */ > > /* > + * CONFIG_GENERIC_ENTRY archs are expected to have sanitized entry and idle > + * code that disallow any/all tracing/instrumentation when RCU isn't > watching. > + */ > +#ifdef CONFIG_GENERIC_ENTRY > +#define RCUIDLE_COND(rcuidle)(rcuidle) > +#else Should probably move the below comment to here: /* srcu can't be used from NMI */ > +#define RCUIDLE_COND(rcuidle)(rcuidle && in_nmi()) > +#endif > + > +/* > * it_func[0] is never NULL because there is at least one element in the > array > * when the array itself is non NULL. > */ > @@ -189,7 +199,8 @@ static inline struct tracepoint *tracepo > return; \ > \ > /* srcu can't be used from NMI */ \ And remove the above. -- Steve > - WARN_ON_ONCE(rcuidle && in_nmi()); \ > + if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle)))\ > + return; \ > \ > /* keep srcu and sched-rcu usage consistent */ \ > preempt_disable_notrace(); \ > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -3104,6 +3104,9 @@ void __trace_stack(struct trace_array *t > return; > } > > + if (WARN_ON_ONCE(IS_ENABLED(CONFIG_GENERIC_ENTRY))) > + return; > + > /* >* When an NMI triggers, RCU is enabled via ct_nmi_enter(), >* but if the above rcu_is_watching() failed, then the NMI > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled
On Mon, 23 Jan 2023 21:50:12 +0100 Peter Zijlstra wrote: > All RCU disabled code should be noinstr and hence we should never get > here -- when we do, WARN about it and make sure to not actually do > tracing. > > Signed-off-by: Peter Zijlstra (Intel) > --- > arch/x86/kernel/ftrace.c |3 +++ > 1 file changed, 3 insertions(+) > > --- a/arch/x86/kernel/ftrace.c > +++ b/arch/x86/kernel/ftrace.c > @@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long > if (unlikely(atomic_read(¤t->tracing_graph_pause))) > return; > > + if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip)) > + return; > + Please add this to after recursion trylock below. Although WARN_ONCE() should not not have recursion issues, as function tracing can do weird things, I rather be safe than sorry, and not have the system triple boot due to some path that might get added in the future. If rcu_is_watching() is false, it will still get by the below recursion check and warn. That is, the below check should be done before this function calls any other function. > bit = ftrace_test_recursion_trylock(ip, *parent); > if (bit < 0) > return; > -- Steve ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled
On Mon, 23 Jan 2023 16:53:04 -0500 Steven Rostedt wrote: > On Mon, 23 Jan 2023 21:50:12 +0100 > Peter Zijlstra wrote: > > > All RCU disabled code should be noinstr and hence we should never get > > here -- when we do, WARN about it and make sure to not actually do > > tracing. > > > > Signed-off-by: Peter Zijlstra (Intel) > > --- > > arch/x86/kernel/ftrace.c |3 +++ > > 1 file changed, 3 insertions(+) > > > > --- a/arch/x86/kernel/ftrace.c > > +++ b/arch/x86/kernel/ftrace.c > > @@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long > > if (unlikely(atomic_read(¤t->tracing_graph_pause))) > > return; > > > > + if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip)) > > + return; > > + > > Please add this to after recursion trylock below. Although WARN_ONCE() > should not not have recursion issues, as function tracing can do weird > things, I rather be safe than sorry, and not have the system triple boot > due to some path that might get added in the future. > > If rcu_is_watching() is false, it will still get by the below recursion > check and warn. That is, the below check should be done before this > function calls any other function. > > > bit = ftrace_test_recursion_trylock(ip, *parent); > > if (bit < 0) > > return; > > > Actually, perhaps we can just add this, and all you need to do is create and set CONFIG_NO_RCU_TRACING (or some other name). This should cover all ftrace locations. (Uncompiled). -- Steve diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index c303f7a114e9..10ee3fbb9113 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,6 +135,22 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif +#ifdef CONFIG_NO_RCU_TRACING +# define trace_warn_on_no_rcu(ip) \ + ({ \ + bool __ret = false; \ + if (!trace_recursion_test(TRACE_RECORD_RECURSION_BIT)) { \ + trace_recursion_set(TRACE_RECORD_RECURSION_BIT); \ + __ret = WARN_ONCE(!rcu_is_watching(), \ + "RCU not on for: %pS\n", (void *)ip); \ + trace_recursion_clear(TRACE_RECORD_RECURSION_BIT); \ + } \ + __ret; \ + }) +#else +# define trace_warn_on_no_rcu(ip) false +#endif + /* * Preemption is promised to be disabled when return bit >= 0. */ @@ -144,6 +160,9 @@ static __always_inline int trace_test_and_set_recursion(unsigned long ip, unsign unsigned int val = READ_ONCE(current->trace_recursion); int bit; + if (trace_warn_on_no_rcu(ip)) + return -1; + bit = trace_get_context_bit() + start; if (unlikely(val & (1 << bit))) { /* ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization