Re: [RFC PATCH 4/6] ftrace: Allow stealing pages from pipe buffer

2012-07-30 Thread Steven Rostedt
On Tue, 2012-07-24 at 11:37 +0900, Yoshihiro YUNOMAE wrote:
> From: Masami Hiramatsu 
> 
> Use generic steal operation on pipe buffer to allow stealing
> ring buffer's read page from pipe buffer.
> 
> Note that this could reduce the performance of splice on the
> splice_write side operation without affinity setting.
> Since the ring buffer's read pages are allocated on the
> tracing-node, but the splice user does not always execute
> splice write side operation on the same node. In this case,
> the page will be accessed from the another node.
> Thus, it is strongly recommended to assign the splicing
> thread to corresponding node.
> 
> Signed-off-by: Masami Hiramatsu 

Acked-by: Steven Rostedt 

-- Steve


> Cc: Frederic Weisbecker 
> Cc: Ingo Molnar 
> ---
> 
>  kernel/trace/trace.c |8 +---
>  1 files changed, 1 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index a120f98..ae01930 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -4194,12 +4194,6 @@ static void buffer_pipe_buf_release(struct 
> pipe_inode_info *pipe,
>   buf->private = 0;
>  }
>  
> -static int buffer_pipe_buf_steal(struct pipe_inode_info *pipe,
> -  struct pipe_buffer *buf)
> -{
> - return 1;
> -}
> -
>  static void buffer_pipe_buf_get(struct pipe_inode_info *pipe,
>   struct pipe_buffer *buf)
>  {
> @@ -4215,7 +4209,7 @@ static const struct pipe_buf_operations 
> buffer_pipe_buf_ops = {
>   .unmap  = generic_pipe_buf_unmap,
>   .confirm= generic_pipe_buf_confirm,
>   .release= buffer_pipe_buf_release,
> - .steal  = buffer_pipe_buf_steal,
> + .steal  = generic_pipe_buf_steal,
>   .get= buffer_pipe_buf_get,
>  };
>  
> 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/6] virtio/console: Add a failback for unstealable pipe buffer

2012-08-09 Thread Steven Rostedt
On Thu, 2012-08-09 at 18:24 +0900, Masami Hiramatsu wrote:

> Yeah, it is really easy to fix that.
> But out of curiosity, would that be really a problem?
> I guess that host can access any guest page if need. If that
> is right, is that really insecure to leak randomly allocated
> unused page to the host?

Yeah, it's like protecting userspace pages from the kernel ;-)

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/5] trace-cmd: Use TRACE_DIR envrionment variable if defined

2012-08-22 Thread Steven Rostedt
On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:
> From: Masami Hiramatsu 
> 
> Use TRACE_DIR environment variable for setting

TRACING_DIR would be better, as we are searching for /debug/tracing and
not /debug/trace. Perhaps DEBUG_TRACING_DIR would be even better as to
be less of a generic term.

-- Steve

> debugfs/tracing directory if defined. This is
> for controlling guest(or remote) ftrace.
> 
> Signed-off-by: Masami Hiramatsu 
> Signed-off-by: Yoshihiro YUNOMAE 
> ---
> 
>  trace-util.c |9 +
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/trace-util.c b/trace-util.c
> index e128188..d5a3eb4 100644
> --- a/trace-util.c
> +++ b/trace-util.c
> @@ -311,6 +311,15 @@ char *tracecmd_find_tracing_dir(void)
>   char type[100];
>   FILE *fp;
>   
> + tracing_dir = getenv("TRACE_DIR");
> + if (tracing_dir) {
> + tracing_dir = strdup(tracing_dir);
> + if (!tracing_dir)
> + die("malloc");
> + warning("Use environmental tracing directory: %s\n", 
> tracing_dir);
> + return tracing_dir;
> + }
> +
>   if ((fp = fopen("/proc/mounts","r")) == NULL) {
>   warning("Can't open /proc/mounts for read");
>   return NULL;
> 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/5] trace-cmd: Use tracing directory to count CPUs

2012-08-22 Thread Steven Rostedt
On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:
> From: Masami Hiramatsu 
> 
> Count debugfs/tracing/per_cpu/cpu* to determine the
> number of CPUs.

I'm curious, do you find that sysconf doesn't return the # of CPUs the
system has? I've had boxes where the per_cpu/cpu* had more cpus than the
box actually holds. But this was a bug in the kernel, not the tool. This
change log needs to have rational instead of just explaining what the
patch does.

Thanks,

-- Steve

> 
> Signed-off-by: Masami Hiramatsu 
> Signed-off-by: Yoshihiro YUNOMAE 
> ---


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/5] trace-cmd: Support trace-agent of virtio-trace

2012-08-22 Thread Steven Rostedt
On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:
> Add read path and control path to use trace-agent of virtio-trace.
> When we use trace-agent, trace-cmd will be used as follows:
>   # AGENT_READ_DIR=/tmp/virtio-trace/tracing \
> AGENT_CTL=/tmp/virtio-trace/agent-ctl-path.in \
> TRACING_DIR=/tmp/virtio-trace/debugfs/tracing \\

Ha! You used "TRACING_DIR" but patch one introduces TRACE_DIR. Lets
change this to DEBUG_TRACING_DIR instead anyway.

Also, I don't like the generic environment variables. Perhaps
VIRTIO_TRACE_DIR, or AGENT_TRACE_DIR and AGENT_TRACE_CTL. Lets try to
keep the environment namespace sparse.

> trace-cmd record -e "sched:*"
> Here, AGENT_READ_DIR is the path for a reading directory of virtio-trace,
> AGENT_CTL is a control path of trace-agent, and TRACING_DIR is a debugfs path
> of a guest.
> 
> Signed-off-by: Yoshihiro YUNOMAE 
> ---
> 
>  trace-cmd.h  |1 +
>  trace-recorder.c |   57 
> +-
>  trace-util.c |   18 +
>  3 files changed, 75 insertions(+), 1 deletions(-)
> 
> diff --git a/trace-cmd.h b/trace-cmd.h
> index f904dc5..75506ed 100644
> --- a/trace-cmd.h
> +++ b/trace-cmd.h
> @@ -72,6 +72,7 @@ static inline int tracecmd_host_bigendian(void)
>  }
>  
>  char *tracecmd_find_tracing_dir(void);
> +char *guest_agent_tracing_read_dir(void);
>  
>  /* --- Opening and Reading the trace.dat file --- */
>  
> diff --git a/trace-recorder.c b/trace-recorder.c
> index 215affc..3b750e9 100644
> --- a/trace-recorder.c
> +++ b/trace-recorder.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "trace-cmd.h"
>  
> @@ -43,6 +44,8 @@ struct tracecmd_recorder {
>   int page_size;
>   int cpu;
>   int stop;
> + int ctl_fd;
> + boolagent_existing;

Thanks for the reminder. I need to convert a lot to use 'bool' instead.

>  };
>  
>  void tracecmd_free_recorder(struct tracecmd_recorder *recorder)
> @@ -59,11 +62,29 @@ void tracecmd_free_recorder(struct tracecmd_recorder 
> *recorder)
>   free(recorder);
>  }
>  
> +static char *use_trace_agent_dir(char *ctl_path,
> + struct tracecmd_recorder *recorder)
> +{
> + ctl_path = strdup(ctl_path);
> + if (!ctl_path)
> + die("malloc");
> + warning("Use environmental control path: %s\n", ctl_path);

s/Use/Using/

-- Steve

> +
> + recorder->ctl_fd = open(ctl_path, O_WRONLY);
> + if (recorder->ctl_fd < 0)
> + return NULL;
> +
> + recorder->agent_existing = true;
> +
> + return guest_agent_tracing_read_dir();
> +}
> +


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/5] trace-cmd: Use tracing directory to count CPUs

2012-08-23 Thread Steven Rostedt
On Thu, 2012-08-23 at 12:00 +0900, Masami Hiramatsu wrote:
> (2012/08/23 11:01), Masami Hiramatsu wrote:
> > (2012/08/22 22:41), Steven Rostedt wrote:
> >> On Wed, 2012-08-22 at 17:43 +0900, Yoshihiro YUNOMAE wrote:
> >>> From: Masami Hiramatsu 
> >>>
> >>> Count debugfs/tracing/per_cpu/cpu* to determine the
> >>> number of CPUs.
> >>
> >> I'm curious, do you find that sysconf doesn't return the # of CPUs the
> >> system has?
> > 
> > No, sysconf returns the number of hosts CPUs, not guests.
> > 
> >> I've had boxes where the per_cpu/cpu* had more cpus than the
> >> box actually holds. But this was a bug in the kernel, not the tool. This
> >> change log needs to have rational instead of just explaining what the
> >> patch does.
> > 
> > Ah, I see. Hmm, then this should be enabled by a command line
> > option or an environment variable.
> 
> Oops, I misunderstood. I'll add more comment for why this
> should be tried instead of sysconf.

And now that I understand why you are doing this, why not only do this
if the TRACE_AGENT or DEBUG_TRACING_DIR is defined. That is, if we are
doing it against a bare metal system, then sysconf should suffice, but
if we are tracing against a guest, then it should use the tracing
directory to determine the buffers.

We could add options to override this, but I would think the default
should just Do The Right Thing(tm).

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Readonly GDT

2013-04-09 Thread Steven Rostedt
On Tue, 2013-04-09 at 17:43 -0700, H. Peter Anvin wrote:
> OK, thinking about the GDT here.
> 
> The GDT is quite small -- 256 bytes on i386, 128 bytes on x86-64.  As
> such, we probably don't want to allocate a full page to it for only
> that.  This means that in order to create a readonly mapping we have to
> pack GDTs from different CPUs together in the same pages, *or* we
> tolerate that other things on the same page gets reflected in the same
> mapping.

What about grouping via nodes?

> 
> However, the packing solution has the advantage of reducing address
> space consumption which matters on 32 bits: even on i386 we can easily
> burn a megabyte of address space for 4096 processors, but burning 16
> megabytes starts to hurt.

Having 4096 32 bit processors, you deserve what you get. ;-)

-- Steve

> 
> It would be important to measure the performance impact on task switch,
> though.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] tracing/events: Add bounce tracing to swiotbl-xen

2013-08-22 Thread Steven Rostedt
On Thu, 22 Aug 2013 22:47:28 +0100
Zoltan Kiss  wrote:


>  /*
>   * Used to do a quick range check in swiotlb_tbl_unmap_single and
>   * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
> this
> @@ -358,6 +362,9 @@ dma_addr_t xen_swiotlb_map_page(struct device *dev, 
> struct page *page,
>   /*
>* Oh well, have to allocate and map a bounce buffer.
>*/
> +
> + trace_bounced(dev, dev_addr, size, swiotlb_force);

Please use a more specific name. "bounce" is too generic. I know
tracepoints are grouped by systems, but its easier for tools to just
state a tracepoint name than the system:event pair.

trace_xen_bounced()
 or
trace_swiotlb_bounced()

Something other than just "bounced".

Thanks!

-- Steve

> +
>   map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
>   if (map == SWIOTLB_MAP_ERROR)
>   return DMA_ERROR_CODE;
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCHv2] tracing/events: Add bounce tracing to swiotbl

2013-09-25 Thread Steven Rostedt
On Wed, 25 Sep 2013 13:56:49 -0400
Konrad Rzeszutek Wilk  wrote:

> .. snip..
> > >+  TP_printk("dev_name: %s dma_mask=%llx dev_addr=%llx "
> > >+  "size=%zu swiotlb_force=%x",
> > >+  __get_str(dev_name),
> > >+  __entry->dma_mask,
> > >+  (unsigned long long)__entry->dev_addr,
> > >+  __entry->size,
> > >+  __entry->swiotlb_force)
> 
> Would it make sense to do something like this:
> 
>   __entry->swiotlb_force ? "swiotlb_force" : "")

I think that's fine. I do believe that the libtraceevents can parse the
"?:" syntax.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH -tip RFC 0/2] kprobes: introduce NOKPROBE_SYMBOL() and prohibit probing on .entry.text

2013-11-11 Thread Steven Rostedt
On Tue, 12 Nov 2013 02:18:53 +0900
Masami Hiramatsu  wrote:
> 
> > After that we can convert all the rest, probably as part of this series.
> 
> OK, I'll do. :)
> BTW, converting all the __kprobes involves many archs, which
> kprobes ported. In that case, which mailing-list would better me
> to post the series, linux-arch?

I would add linux-arch.

Note, you may need to support both ways for the current time being, as
new __kprobes are being added (I've seen several in patches flying by
in LKML).

But perhaps at a later -rc we convert the rest and discontinue them.
That way, linux-next will break all new __kprobes, and that should get
them fixed before we enter 3.14-rc.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH -tip RFC v2 01/22] kprobes: Prohibit probing on .entry.text code

2013-11-15 Thread Steven Rostedt
On Fri, 15 Nov 2013 04:53:18 +
Masami Hiramatsu  wrote:

> .entry.text is a code area which is used for interrupt/syscall
> entries, and there are many sensitive codes.
> Thus, it is better to prohibit probing on all of such codes
> instead of a part of that.
> Since some symbols are already registered on kprobe blacklist,
> this also removes them from the blacklist.

This change only works with x86. On other archs, I get this:

kernel/built-in.o: In function `register_kprobe':
(.kprobes.text+0x9f4): undefined reference to `__entry_text_start'
kernel/built-in.o: In function `register_kprobe':
(.kprobes.text+0x9f8): undefined reference to `__entry_text_end'
make[1]: *** [vmlinux] Error 1
make: *** [sub-make] Error 2

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH -tip v3 00/23] kprobes: introduce NOKPROBE_SYMBOL() and general cleaning of kprobe blacklist

2013-11-20 Thread Steven Rostedt
On Wed, 20 Nov 2013 12:36:00 -0500
"Frank Ch. Eigler"  wrote:

> Hi -
> 
> > > Does this new blacklist cover enough that the kernel now survives a 
> > > broadly wildcarded perf-probe, e.g. over e.g. all of its kallsyms?
> > 
> > That's generally the purpose of the annotations - if it doesn't then 
> > that's a bug.
> 
> AFAIK, no kernel since kprobes was introduced has ever stood up to
> that test.  perf probe lacks the wildcarding powers of systemtap, so
> one needs to resort to something like:
> 
> # cat /proc/kallsyms | grep ' [tT] ' | while read addr type symbol; do
>perf probe $symbol
> done

I'm curious to why one would do that. IIUC, perf now has function
tracing support.

-- Steve

> 
> then wait for a few hours for that to finish. Then, or while the loop
> is still running, run
> 
> # perf record -e 'probe:*' -aR sleep 1
> 
> to take a kernel down.
> 
> 
> - FChE

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions

2016-05-25 Thread Steven Rostedt

Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
after enabling function tracer. I asked him to bisect the functions within
available_filter_functions, which he did and it came down to three:

 _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()

It was found that this is only an issue when noreplace-paravirt is added to
the kernel command line.

This means that those functions are most likely called within critical
sections of the funtion tracer, and must not be traced.

In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
longer an issue. But both _paravirt_ident_{32,64}() causes the following
splat when they are traced:

 mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054)
 mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070)
 mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054)
 mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054)
 NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
 Modules linked in: e1000e
 CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 
05/07/2012
 task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000
 RIP: 0010:[]  [] 
queued_spin_lock_slowpath+0x118/0x1a0
 RSP: 0018:8800d4aefb90  EFLAGS: 0246
 RAX:  RBX:  RCX: 88011eb16d40
 RDX: 82485760 RSI: 1f288820 RDI: ea008030
 RBP: 8800d4aefb90 R08: 000c R09: 
 R10: 821c8e0e R11:  R12: 88200fb8
 R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0
 FS:  7f7a4e3d7840() GS:88011eb0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0
 Stack:
  8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b
  8800d4aefcb0 8800d4aefbd0 811bf4cb 0002
  0015 8800d2276050 8000c0fd8867 ea008030
 Call Trace:
  [] _raw_spin_lock+0x27/0x30
  [] handle_pte_fault+0x13db/0x16b0
  [] ? function_trace_call+0x15b/0x180
  [] ? handle_pte_fault+0x5/0x16b0
  [] handle_mm_fault+0x312/0x670
  [] ? find_vma+0x68/0x70
  [] __do_page_fault+0x1b1/0x4e0
  [] do_page_fault+0x22/0x30
  [] page_fault+0x28/0x30
  [] ? copy_user_enhanced_fast_string+0x5/0x10
  [] ? seq_read+0x305/0x370
  [] __vfs_read+0x28/0xe0
  [] ? __vfs_read+0x5/0xe0
  [] ? __vfs_read+0x5/0xe0
  [] vfs_read+0x86/0x130
  [] SyS_read+0x46/0xa0
  [] entry_SYSCALL_64_fastpath+0x1e/0xa8
 Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14
 c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7
 4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b

Reported-by: Łukasz Daniluk 
Signed-off-by: Steven Rostedt 

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index f08ac28b8136..f975d226be6e 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n"
  ".popsection");
 
 /* identity function, which can be inlined */
-u32 _paravirt_ident_32(u32 x)
+u32 notrace _paravirt_ident_32(u32 x)
 {
return x;
 }
 
-u64 _paravirt_ident_64(u64 x)
+u64 notrace _paravirt_ident_64(u64 x)
 {
return x;
 }
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 21/27] x86/ftrace: Adapt function tracing for PIE support

2018-05-24 Thread Steven Rostedt
On Thu, 24 May 2018 13:40:24 +0200
Petr Mladek  wrote:

> On Wed 2018-05-23 12:54:15, Thomas Garnier wrote:
> > When using -fPIE/PIC with function tracing, the compiler generates a
> > call through the GOT (call *__fentry__@GOTPCREL). This instruction
> > takes 6 bytes instead of 5 on the usual relative call.
> > 
> > If PIE is enabled, replace the 6th byte of the GOT call by a 1-byte nop
> > so ftrace can handle the previous 5-bytes as before.
> > 
> > Position Independent Executable (PIE) support will allow to extended the
> > KASLR randomization range below the -2G memory limit.
> > 
> > Signed-off-by: Thomas Garnier 
> > ---
> >  arch/x86/include/asm/ftrace.h   |  6 +++--
> >  arch/x86/include/asm/sections.h |  4 
> >  arch/x86/kernel/ftrace.c| 42 +++--
> >  3 files changed, 48 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> > index c18ed65287d5..8f2decce38d8 100644
> > --- a/arch/x86/include/asm/ftrace.h
> > +++ b/arch/x86/include/asm/ftrace.h
> > @@ -25,9 +25,11 @@ extern void __fentry__(void);
> >  static inline unsigned long ftrace_call_adjust(unsigned long addr)
> >  {
> > /*
> > -* addr is the address of the mcount call instruction.
> > -* recordmcount does the necessary offset calculation.
> > +* addr is the address of the mcount call instruction. PIE has always a
> > +* byte added to the start of the function.
> >  */
> > +   if (IS_ENABLED(CONFIG_X86_PIE))
> > +   addr -= 1;  
> 
> This seems to modify the address even for modules that are _not_ compiled with
> PIE, see below.

Can one load a module not compiled for PIE in a kernel with PIE?

> 
> > return addr;
> >  }
> >  
> > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> > index 01ebcb6f263e..73b3c30cb7a3 100644
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -135,6 +135,44 @@ ftrace_modify_code_direct(unsigned long ip, unsigned 
> > const char *old_code,
> > return 0;
> >  }
> >  
> > +/* Bytes before call GOT offset */
> > +const unsigned char got_call_preinsn[] = { 0xff, 0x15 };
> > +
> > +static int
> > +ftrace_modify_initial_code(unsigned long ip, unsigned const char *old_code,
> > +  unsigned const char *new_code)
> > +{
> > +   unsigned char replaced[MCOUNT_INSN_SIZE + 1];
> > +
> > +   ftrace_expected = old_code;
> > +
> > +   /*
> > +* If PIE is not enabled or no GOT call was found, default to the
> > +* original approach to code modification.
> > +*/
> > +   if (!IS_ENABLED(CONFIG_X86_PIE) ||
> > +   probe_kernel_read(replaced, (void *)ip, sizeof(replaced)) ||
> > +   memcmp(replaced, got_call_preinsn, sizeof(got_call_preinsn)))
> > +   return ftrace_modify_code_direct(ip, old_code, new_code);  
> 
> And this looks like an attempt to handle modules compiled without
> PIE. Does it works with the right ip in that case?

I'm guessing the || is for the "or no GOT call was found", but it
doesn't explain why no GOT would be found.

> 
> I wonder if a better solution would be to update
> scripts/recordmcount.c to store the incremented location into the module.

If recordmcount.c can handle this, then I think that's the preferred
approach. Thanks!

-- Steve

> 
> IMPORTANT: I have only vague picture about how this all works. It is
> possible that I am completely wrong. The code might be correct,
> especially if you tested this situation.
> 
> Best Regards,
> Petr

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: net-next boot error

2018-07-26 Thread Steven Rostedt


[ Added Thomas Gleixner ]


On Thu, 26 Jul 2018 11:34:39 +0200
Dmitry Vyukov  wrote:

> On Thu, Jul 26, 2018 at 11:29 AM, syzbot
>  wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:dc66fe43b7eb rds: send: Fix dead code in rds_sendmsg
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=127874c840
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=f34ce142a9f5f0e8
> > dashboard link: https://syzkaller.appspot.com/bug?extid=604f8271211546f5b3c7
> > compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> >
> > Unfortunately, I don't have any reproducer for this crash yet.
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+604f8271211546f5b...@syzkaller.appspotmail.com
> >
> > possible deadlock in static_key_slow_incsd 0:0:1:0: [sda] Attached SCSI disk
> > MACsec IEEE 802.1AE
> > tun: Universal TUN/TAP device driver, 1.6
> >
> > 
> > WARNING: possible recursive locking detected  
> 
> +Tetsuo, perhaps this boot lockdep problem then disables lockdep for
> actual testing. I think lockdep should respect panic_on_warn.
> 
> 
> > 4.18.0-rc6+ #141 Not tainted
> > 
> > swapper/0/1 is trying to acquire lock:
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> > static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
> >
> > but task is already holding lock:
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> > include/linux/cpu.h:126 [inline]
> > (ptrval) (cpu_hotplug_lock.rw_sem){}, at: init_vqs+0xe1a/0x1520
> > drivers/net/virtio_net.c:2777

Here init_vqs() does:

get_online_cpus();
virtnet_set_affinity(vi);
put_online_cpus();

Which disables cpu hotplug and calls virtnet_set_affinity()

Note, get_online_cpus() is no longer recursive.

> >
> > other info that might help us debug this:
> >  Possible unsafe locking scenario:
> >
> >CPU0
> >
> >   lock(cpu_hotplug_lock.rw_sem);
> >   lock(cpu_hotplug_lock.rw_sem);
> >
> >  *** DEADLOCK ***
> >
> >  May be due to missing lock nesting notation
> >
> > 3 locks held by swapper/0/1:
> >  #0: (ptrval) (&dev->mutex){}, at: device_lock
> > include/linux/device.h:1134 [inline]
> >  #0: (ptrval) (&dev->mutex){}, at: __driver_attach+0x15f/0x2f0
> > drivers/base/dd.c:820
> >  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at: get_online_cpus
> > include/linux/cpu.h:126 [inline]
> >  #1: (ptrval) (cpu_hotplug_lock.rw_sem){}, at:
> > init_vqs+0xe1a/0x1520 drivers/net/virtio_net.c:2777
> >  #2: (ptrval) (xps_map_mutex){+.+.}, at:
> > __netif_set_xps_queue+0x243/0x23f0 net/core/dev.c:2278
> >
> > stack backtrace:
> > CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.18.0-rc6+ #141
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
> >  print_deadlock_bug kernel/locking/lockdep.c:1765 [inline]
> >  check_deadlock kernel/locking/lockdep.c:1809 [inline]
> >  validate_chain kernel/locking/lockdep.c:2405 [inline]
> >  __lock_acquire.cold.65+0x1fb/0x486 kernel/locking/lockdep.c:3435
> >  lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
> >  percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 [inline]
> >  percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
> >  cpus_read_lock+0x43/0xa0 kernel/cpu.c:289
> >  static_key_slow_inc+0x12/0x30 kernel/jump_label.c:124
> >  __netif_set_xps_queue+0xaac/0x23f0 net/core/dev.c:2320


__netif_set_xps_queue() calls static_key_slow_inc() which will also do
a get_online_cpus() which will trigger this bug.

There's a static_key_slow_inc_cpuslocked() version that should be used
when get_online_cpus() is already taken, but I see
__netif_set_xps_queue() is called from several places, and I doubt it
is always called with get_online_cpus() held. Thus just using the
cpuslocked() version is probably not sufficient of a fix.

I don't know the code enough to offer other suggestions.

-- Steve


> >  netif_set_xps_queue+0x26/0x30 net/core/dev.c:2455
> >  virtnet_set_affinity+0x2ba/0x4b0 drivers/net/virtio_net.c:1944
> >  init_vqs+0xe22/0x1520 drivers/net/virtio_net.c:2778
> >  virtnet_probe+0x1092/0x2260 drivers/net/virtio_net.c:3016
> >  virtio_dev_probe+0x592/0x942 drivers/virtio/virtio.c:245
> >  really_probe drivers/base/dd.c:446 [inline]
> >  driver_probe_device+0x6ad/0x970 drivers/base/dd.c:588
> >  __driver_attach+0x28b/0x2f0 drivers/base/dd.c:822
> >  bus_for_each_dev+0x15d/0x1f0 drivers/base/bus.c:311
> >  driver_attach+0x3d/0x50 drivers/base/dd.c:841
> >  bus_add_driver+0x4b2/0x600 drivers/base/bus.c:667
> >  driver_register+0x1c8/0x320 drivers/base/driver.c:170
> >  register_virtio_driver+0x79/0xd0 drivers/virtio/

[PATCH] x86/vmware: Do not trace vmware_sched_clock()

2018-11-09 Thread Steven Rostedt
From: Steven Rostedt (VMware) 

When running function tracing on a Linux guest running on VMware
Workstation, the guest would crash. This is due to tracing of the
sched_clock internal call of the VMware vmware_sched_clock(), which
causes an infinite recursion within the tracing code (clock calls must
not be traced).

Make vmware_sched_clock() not traced by ftrace.

Cc: sta...@vger.kernel.org
Fixes: 80e9a4f21fd7c ("x86/vmware: Add paravirt sched clock")
Reported-by: GwanYeong Kim 
Signed-off-by: Steven Rostedt (VMware) 
---
diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index d9ab49bed8af..0eda91f8eeac 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -77,7 +77,7 @@ static __init int setup_vmw_sched_clock(char *s)
 }
 early_param("no-vmw-sched-clock", setup_vmw_sched_clock);
 
-static unsigned long long vmware_sched_clock(void)
+static unsigned long long notrace vmware_sched_clock(void)
 {
unsigned long long ns;
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [patch 05/26] Xen-paravirt_ops: paravirt_ops: hooks to set up initial pagetable

2007-03-16 Thread Steven Rostedt
On Fri, 2007-03-16 at 11:39 -0700, Jeremy Fitzhardinge wrote:
> Ingo Molnar wrote:
> >> +  /* Make sure kernel address space is empty so that a pagetable
> >> + will be allocated for it. */
> >> 
> >
> > comment style.
> >   
> 
> As you've noticed its a comment style I use quite often.  I use it for
> what's essentially a little local comment, rather than a block comment
> which really needs to draw attention to itself.  In this case, I think using
> 

Understood, but then you need to make that comment a single line :)

The way Linux code prefers comments (like the code itself actually
prefers it!) is single line comments stay on a single line, and
multiline comments are always block.

So it's either:

   /* my little comment here */

or 

   /*
* This comment, although small and to the point, does span multiple
* lines and regardless, needs to be a block comment.
*/

I guess it just makes it easier on the eyes when everything conforms and
others need to look at / maintain your code.


-- Steve


___
Virtualization mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: todo

2007-03-16 Thread Steven Rostedt
[added Ingo since he'd probably be interested in this]

On Fri, 2007-03-16 at 21:00 -0700, Jeremy Fitzhardinge wrote:
> Chris Wright wrote:
> > * Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
> >   
> >> Chris Wright wrote:
> >> 
> >>> Consistently wrap paravirt ops callsites
> >>>   "ugh" - mingo
> >>>   
> >> Had a thought.  What if we do a kind of reverse/two-way module linkage? 
> >> Somehow compile each pv-op implementation like a module, and then link
> >> the appropriate one in at boot time.
> >> 
> >
> > This is very similar to something Steve was chatting with me about
> > this morning.  The idea he was tossing around was something a bit like
> > an initrd that a load_module analog could link up.  In a sense, it's
> > similar to the VMI ROM, with the exceptions that the ABI is just created
> > by the compiler from a normal mutable kernel API and it's linkage with
> > symbols available on both sides.
> >   
> 
> Yeah.  It would have to happen a lot earlier than initrd.  It would be
> more like a multiboot kernel module or something.

I was taking a look at kernel/module.c and arch/i386/kernel/module.c and
it looks like we can copy/modify the former and use part of the later.

> 
> >> Tricky parts: it would need two-way unresolved references between kernel
> >> and module, and it would need to be able to run very early in the
> >> kernel's life.
> >> 
> >
> > This is the tricky part, and where Steve and I left off.
> >   
> 

I think it's possible. It still be tricky. We need to handle percpu
variables.  But I have some ideas on how to do it though.

> Fortunately the linker code should be pretty easy to make
> self-contained.   It shouldn't need to do memory allocations or anything
> complex like that, so I think its just a matter of grovelling around and
> fixing up linkages.

Well, we still need to make sure the kernel knows about this image, so
that it doesn't write over it.  I guess that can be done like the initrd
is.

> 
> > I suspect we could free the unused backends already.
> 
> We could; we just need to make sure they get their own section so
> they're easily separable.
> 
> >   It also has one
> > negative side-effect, which is promoting external module code that links
> > with the kernel.  IOW, there's much less incentive to get code merged
> > if it's just a matter of linking.
> 
> It wouldn't be something we'd promote by making it easy to bind
> out-of-tree code to the interface.  And it would still be a
> kernel-version-specific ABI; no guarantees of stability.

It will be very similar to how we handle modules (and out of tree
modules).

-- Steve


___
Virtualization mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Steven Rostedt
This is taken from the work I did on lguest64.

When killing a guest, we read the guest stack to do a nice back trace of
the guest and send it via printk to the host.

So instead of just getting an error message from the lguest launcher of:

lguest: bad read address 537012178 len 1

I also get in my dmesg:

called from  [] show_trace_log_lvl+0x1a/0x2f
 [] show_trace+0x12/0x14
 [] dump_stack+0x16/0x18
 [] lguest_dump_lg_regs+0x22/0x13c [lg]
 [] lgread+0x59/0x90 [lg]
 [] run_guest+0x26b/0x406 [lg]
 [] read+0x73/0x7d [lg]
 [] vfs_read+0xad/0x161
 [] sys_read+0x3d/0x61
 [] syscall_call+0x7/0xb
 ===
[] lgread+0x59/0x90 [lg]
Printing LG 0 regs cr3: 021eb000
EIP: 0061:  [<e00227d2>]
ESP: 0069:c236fe3c  EFLAGS: 00010202
EAX: 0004 EBX: e001fb20 ECX: 0008 EDX: 03f2
ESI: e001ee00 EDI: e001fb60 EBP: c236fea0
 CR2: 1278000  lguest_data->cr2: 80011380
errcode: 0   trapnum: d
Stack Dump:
 [<c1042b7a>] trace_hardirqs_on+0x125/0x149
 [<c123b0ea>] wait_for_completion+0x90/0x98
 [<c123bddc>] __mutex_unlock_slowpath+0x129/0x13e
 [<c1048769>] unlock_cpu_hotplug+0x62/0x64
 [<c104b5b6>] sys_init_module+0x14e3/0x162c
 [<c1081d44>] do_sync_read+0xc2/0xff
 [<c1004f85>] restore_nocheck+0x12/0x15
 [<c1004f34>] syscall_call+0x7/0xb


TODO:

  - Clean up a little (still has stuff from lguest64 in it).
  - Perhaps make a config option or runtime switch to turn it off.
  - Send to the launcher the dump instead of printk.
  - make modules work too.

Also I need to change the %u of the bad read print to a %x, because
seeing 0x200227d2 is better than seeing 537012178 for addresses.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm2/drivers/lguest/Makefile
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/Makefile
+++ linux-2.6.21-rc5-mm2/drivers/lguest/Makefile
@@ -4,4 +4,4 @@ obj-$(CONFIG_LGUEST_GUEST) += lguest.o l
 # Host requires the other files, which can be a module.
 obj-$(CONFIG_LGUEST)   += lg.o
 lg-objs := core.o hypercalls.o page_tables.o interrupts_and_traps.o \
-   segments.o io.o lguest_user.o hypervisor.o
+   segments.o io.o lguest_user.o hypervisor.o lguest_debug.o
Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c
@@ -210,6 +210,28 @@ int lguest_address_ok(const struct lgues
 }
 
 /* Just like get_user, but don't let guest access lguest binary. */
+u8 lgread_u8(struct lguest *lg, u32 addr)
+{
+   u8 val = 0;
+
+   /* Don't let them access lguest binary */
+   if (!lguest_address_ok(lg, addr)
+   || get_user(val, (u32 __user *)addr) != 0)
+   kill_guest(lg, "bad read address %u", addr);
+   return val;
+}
+
+u16 lgread_u16(struct lguest *lg, u32 addr)
+{
+   u16 val = 0;
+
+   /* Don't let them access lguest binary */
+   if (!lguest_address_ok(lg, addr)
+   || get_user(val, (u32 __user *)addr) != 0)
+   kill_guest(lg, "bad read address %u", addr);
+   return val;
+}
+
 u32 lgread_u32(struct lguest *lg, u32 addr)
 {
u32 val = 0;
Index: linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/lg.h
+++ linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
@@ -176,6 +176,8 @@ extern struct mutex lguest_lock;
 /* core.c: */
 /* Entry points in hypervisor */
 const unsigned long *__lguest_default_idt_entries(void);
+u8 lgread_u8(struct lguest *lg, u32 addr);
+u16 lgread_u16(struct lguest *lg, u32 addr);
 u32 lgread_u32(struct lguest *lg, u32 addr);
 void lgwrite_u32(struct lguest *lg, u32 val, u32 addr);
 void lgread(struct lguest *lg, void *buf, u32 addr, unsigned bytes);
@@ -238,6 +240,7 @@ int hypercall(struct lguest *info, struc
 #define kill_guest(lg, fmt...) \
 do {   \
if (!(lg)->dead) {  \
+   lguest_dump_lg_regs(lg);\
(lg)->dead = kasprintf(GFP_ATOMIC, fmt);\
if (!(lg)->dead)\
(lg)->dead = (void *)1; \
@@ -248,5 +251,11 @@ static inline unsigned long guest_pa(str
 {
return vaddr - lg->page_offset;
 }
+
+/* lguest_debug.c */
+void lguest_print_address(struct lguest *lg, unsigned long address);
+void lguest_dump_trace(struct lguest *lg, struct lguest_regs *regs);
+void lguest_dump_lg_regs(struct lguest *lg);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _LGUEST_H */
Index: linux-2.6.21-rc5-mm2/drivers/lguest/l

[PATCH] Lguest32, use guest page tables to find paddr for emulated instructions

2007-04-04 Thread Steven Rostedt
[Bug that was found by my previous patch]

This patch allows things like modules, which don't have a direct
__pa(EIP) mapping to do emulated instructions.

Sure, the emulated instruction probably should be a paravirt_op, but
this patch lets you at least boot a kernel that has modules needing
emulated instructions.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c
@@ -160,11 +160,14 @@ static int emulate_insn(struct lguest *l
 {
u8 insn;
unsigned int insnlen = 0, in = 0, shift = 0;
-   unsigned long physaddr = guest_pa(lg, lg->regs->eip);
+   unsigned long physaddr = lguest_find_guest_paddr(lg, lg->regs->eip);
 
-   /* This only works for addresses in linear mapping... */
-   if (lg->regs->eip < lg->page_offset)
+   /* FIXME: Handle physaddr's that crosses pages (modules are in VM) */
+
+   /* did we actually find the physaddr? */
+   if (physaddr == (unsigned long)-1UL)
return 0;
+
lgread(lg, &insn, physaddr, 1);
 
/* Operand size prefix means it's actually for ax. */
Index: linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/lg.h
+++ linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
@@ -218,6 +218,7 @@ void guest_set_pte(struct lguest *lg, un
 void map_hypervisor_in_guest(struct lguest *lg, struct lguest_pages *pages);
 int demand_page(struct lguest *info, unsigned long cr2, int write);
 void pin_page(struct lguest *lg, unsigned long vaddr);
+unsigned long lguest_find_guest_paddr(struct lguest *lg, unsigned long vaddr);
 
 /* lguest_user.c: */
 int lguest_device_init(void);
Index: linux-2.6.21-rc5-mm2/drivers/lguest/page_tables.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/page_tables.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/page_tables.c
@@ -105,6 +105,25 @@ static spte_t gpte_to_spte(struct lguest
return spte;
 }
 
+unsigned long lguest_find_guest_paddr(struct lguest *lg, unsigned long vaddr)
+{
+   gpgd_t gpgd;
+   gpte_t gpte;
+   unsigned long gpte_ptr;
+
+   gpgd = mkgpgd(lgread_u32(lg, gpgd_addr(lg, vaddr)));
+   if (!(gpgd.flags & _PAGE_PRESENT))
+   return -1;
+
+   gpte_ptr = gpte_addr(lg, gpgd, vaddr);
+   gpte = mkgpte(lgread_u32(lg, gpte_ptr));
+
+   if (!(gpte.flags & _PAGE_PRESENT))
+   return -1;
+
+   return (gpte.pfn << PAGE_SHIFT) | (vaddr & (PAGE_SIZE-1));
+}
+
 /* FIXME: We hold reference to pages, which prevents them from being
swapped.  It'd be nice to have a callback when Linux wants to swap out. */
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Steven Rostedt
Currently the lguest32 error messages from bad reads and writes prints a
decimal integer for addresses. This is pretty annoying. So this patch
changes those to be hex outputs.

This is applied on top of my debug patch.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c
@@ -220,7 +220,7 @@ u8 lgread_u8(struct lguest *lg, u32 addr
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, "bad read address %u", addr);
+   kill_guest(lg, "bad read address %x", addr);
return val;
 }
 
@@ -231,7 +231,7 @@ u16 lgread_u16(struct lguest *lg, u32 ad
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, "bad read address %u", addr);
+   kill_guest(lg, "bad read address %x", addr);
return val;
 }
 
@@ -242,7 +242,7 @@ u32 lgread_u32(struct lguest *lg, u32 ad
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, "bad read address %u", addr);
+   kill_guest(lg, "bad read address %x", addr);
return val;
 }
 
@@ -250,7 +250,7 @@ void lgwrite_u32(struct lguest *lg, u32 
 {
if (!lguest_address_ok(lg, addr)
|| put_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, "bad write address %u", addr);
+   kill_guest(lg, "bad write address %x", addr);
 }
 
 void lgread(struct lguest *lg, void *b, u32 addr, unsigned bytes)
@@ -259,7 +259,7 @@ void lgread(struct lguest *lg, void *b, 
|| copy_from_user(b, (void __user *)addr, bytes) != 0) {
/* copy_from_user should do this, but as we rely on it... */
memset(b, 0, bytes);
-   kill_guest(lg, "bad read address %u len %u", addr, bytes);
+   kill_guest(lg, "bad read address %x len %u", addr, bytes);
}
 }
 
@@ -268,7 +268,7 @@ void lgwrite(struct lguest *lg, u32 addr
if (addr + bytes < addr
|| !lguest_address_ok(lg, addr+bytes)
|| copy_to_user((void __user *)addr, b, bytes) != 0)
-   kill_guest(lg, "bad write address %u len %u", addr, bytes);
+   kill_guest(lg, "bad write address %x len %u", addr, bytes);
 }
 
 static void set_ts(unsigned int guest_ts)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Steven Rostedt
On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote:

> > (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
> > for %#x (or 0x%x)?  I'm easily confused.
> 
> How about "%p" for pointers?

But that would require casting the numbers to pointers.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Steven Rostedt
On Thu, 2007-04-05 at 12:54 +1000, Rusty Russell wrote:

> 
>   This is a cool idea, but there are two issues with this patch.  The
> first is that it's 500 lines of code: that's around +10% on lguest's
> total code size!  The second is that it conflicts with the medium-term
> plan to allow any user to run up lguests: this is why lg.ko never
> printk()s about problems with the guest.

Not much I can do about the size, but it's in the debug section so
hopefully it's not considered too bad :)

> 
> While it is useful for cases where a guest dies mysteriously before it
> brings up the console, three alternatives come to mind:
> 
> 1) Modify early_printk so Guests can use it.
> 2) Have a separate tool(-set?) for this kind of post-mortem.  Then you
> just have to implement guest suspend! 8)
> 3) Put this in a CONFIG_LGUEST_DEBUG.
> 
> Note that options 1 or 2 make you do more work, but are probably better
> in the long term.  I'm happy for #3 to sit as a patch in the tree for
> the duration, tho!

OK, I'll make a #3 patch to send, but the #1 looks best. Not to mention
that I still need to make it so that the console can read it.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32, use guest page tables to find paddr for emulated instructions

2007-04-04 Thread Steven Rostedt
On Thu, 2007-04-05 at 12:59 +1000, Rusty Russell wrote:
> On Wed, 2007-04-04 at 15:07 -0400, Steven Rostedt wrote:

> Yeah, I haven't tried loading random modules but I can imagine this does
> happen (what module was it, BTW?)

I have no idea of which module it crashed on. I didn't investigate that
too much.  I could simply send a trap to guest when 
__pa(addr) != lguest_find_guest_paddr(addr) and see which module it
crashed on.

My block device I used was basically a copy of a RHEL5 system. I only
modified the inittab and fstab to get it working.  So on startup and
doing the udev init was when it crashed.

> 
> I used to have a function just like this, but managed to get rid of
> it.  
> 
> Hmm, perhaps we should have an "int lgread_virt_byte(u8 *)" which does
> the pgtable walk and read all in one?  It won't be efficient, but it'll
> be more correct and maybe even fewer lines 8)

I forgot that you have a goal to keep lguest small :)

Perhaps we can fork, and have lguest and lguest-lite.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] Lguest launcher, child starving parent

2007-04-05 Thread Steven Rostedt
Glauber noticed long delays between hitting a key, and seeing data come
up on the virtual console.  Looking into this, I found that the
wake_parent routine that reads from all devices was actually starving
out the parent after sending the parent a signal to wake up.

The thing is, the child which takes the console input is recognized by
the scheduler as an interactive process.  The parent, doesn't do so
much, so it is recognized more as a CPU hog. So the child easily gets a
higher priority than the parent.

So when the child receives data from the console, it sends a signal to
the parent and then does another select.  The problem is that the select
doesn't actually read from the device, and will return immediately since
their is still data pending until it is read.  But it's the parent that
reads the data.  So the child actually starves the parent from reading
the data by spinning and waiting for it to read the data.

The fix I implemented was to have the child wait for a response from the
parent before going on. Since there was already communication between
the parent and child via a pipe, I used that. This time, the data
returned by the pipe is either everything went fine (fd == -1) or the
device should be ignored (fd >= 0).

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6-lguest/Documentation/lguest/lguest.c
===
--- linux-2.6-lguest.orig/Documentation/lguest/lguest.c 2007-04-05 
16:13:08.0 -0400
+++ linux-2.6-lguest/Documentation/lguest/lguest.c  2007-04-05 
16:16:31.0 -0400
@@ -328,15 +328,15 @@ static void wake_parent(int pipefd, stru
 
for (;;) {
fd_set rfds = devices->infds;
+   int ignorefd;
 
select(devices->max_infd+1, &rfds, NULL, NULL, NULL);
-   if (FD_ISSET(pipefd, &rfds)) {
-   int ignorefd;
-   if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0)
-   exit(0);
-   FD_CLR(ignorefd, &devices->infds);
-   }
kill(getppid(), SIGUSR1);
+   /* wait for parent response */
+   if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0)
+   exit(0);
+   if (ignorefd >= 0)
+   FD_CLR(ignorefd, &devices->infds);
}
 }
 
@@ -594,6 +594,7 @@ static void handle_output(int fd, unsign
 static void handle_input(int fd, int childfd, struct device_list *devices)
 {
struct timeval poll = { .tv_sec = 0, .tv_usec = 0 };
+   int fdok = -1;
 
for (;;) {
struct device *i;
@@ -608,7 +609,9 @@ static void handle_input(int fd, int chi
FD_CLR(i->fd, &devices->infds);
/* Tell child to ignore it too... */
write(childfd, &i->fd, sizeof(i->fd));
-   }
+   } else
+   /* Tell child to continue */
+   write(childfd, &fdok, sizeof(fdok));
}
}
}


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: lguest over qemu

2007-07-12 Thread Steven Rostedt

>
>   (Note that lguest doesn't support NMIs, but Steven has code for NMI
> support for lguest-x86-64 which could be ported across).

Rusty,

About that.  Is there a way to get a NMI only stack in i386? In x86_64
it's part of the TSS. So I can always know I have a good stack for the
NMI. I'm not sure i368 has the same thing. Or do we always have a good
stack whenever we are in ring0?

Oh, and btw, I've just rewrote all of the Lguest64 page table handling.
I'm just going over one design change that is really bothering me. In
x86_64 we can have 2M or 4k pages (like the PSE in i386). But since 4K
pages are used by the shadow page tables, I have to map them like that.
But this means that I can have the same guest address as both a PMD and a
PTE. Which is breaking some of my code. I'm working on a fix as I write
this.

But to get you up-to-date to where I'm at. I've implemented a way to have
the HV mapped uniquely for all VCPUs.  So there's a HV text section (the
same for all VCPUs), a HV VCPU Data section (readonly in all rings with
guest cr3), and a HV VCPU Scratch Pad section (read/write in rings
0,1,and2).  So now the guest kernel runs in ring 1. With this change, I
already implemented a syscall trampoline that no longer needs to switch to
the host, as well as iretq by the guest kernel goes directly to the guest
user space (or kernel).  The next version of lguest64 will be much cleaner
and faster

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt

Hi Andi,

Thanks for all the comments, it's greatly appreciated.

On Wed, 8 Aug 2007, Andi Kleen wrote:

>
> > +#define SYSRETQ\
> > +   movq%gs:pda_oldrsp,%rsp;\
> > +   swapgs; \
> > +   sysretq;
>
> When the macro does more than sysret it should have a different
> name

Noted. Do you have a better idea? Something like SETSTACK_SWAPGS_SYSRETQ?

>
>
> >   */
> > .globl int_ret_from_sys_call
> >  int_ret_from_sys_call:
> > -   cli
> > +   DISABLE_INTERRUPTS(CLBR_ANY)
>
> ANY? There are certainly some registers alive at this point like rax

Glauber will need to address this, this is his code ;-)

>
> >  retint_restore_args:
> > -   cli
> > +   DISABLE_INTERRUPTS(CLBR_ANY)
>
> Similar.
>
>
> > /*
> >  * The iretq could re-enable interrupts:
> >  */
> > @@ -566,10 +587,14 @@ retint_restore_args:
> >  restore_args:
> > RESTORE_ARGS 0,8,0
> >  iret_label:
> > -   iretq
> > +#ifdef CONFIG_PARAVIRT
> > +   INTERRUPT_RETURN
> > +ENTRY(native_iret)
>
> ENTRY adds alignment. Why do you need that export anyways?

The paravirt ops struct points to it.

>
> > +#endif
> > +1: iretq
> >
> > .section __ex_table,"a"
> > -   .quad iret_label,bad_iret
> > +   .quad 1b, bad_iret
>
> iret_label seems more expressive to me than 1

The reason for this change is because of the added:
#ifdef CONFIG_PARAVIRT
INTERRUPT_RETURN
ENTRY(native_iret)
#endif

If we are paravirt ops, we need the iretq in the exception table, not the
paravit ops function call.  Since that function call may simply call the
native_iretq, and if we take a fault at the iretq, it wont be in the
exception table.

>
> > +   ENABLE_INTERRUPTS(CLBR_NONE)
>
> In many of the CLBR_NONEs there are actually some registers free;
> but it might be safer to keep it this way. But if some client can get
> significantly better code with one or two free registers it might
> be worthwhile to investigate.

Glauber, that comment is for you.

>
> > -   swapgs
> > +   SWAPGS_NOSTACK
>
> There's still stack here

OK, bad name then. How about SWAPGS_UNTRUSTED_STACK?

>From earlier in the file where SWAPGS_NOSTACK is declared we have:


/* Currently paravirt can't handle swapgs nicely when we
 * don't have a stack.  So we either find a way around these
 * or just fault and emulate if a guest tries to call swapgs
 * directly.
 *
 * Either way, this is a good way to document that we don't
 * have a reliable stack.
 */
#define SWAPGS_NOSTACK  swapgs

>
> >  paranoid_restore\trace:
> > RESTORE_ALL 8
> > -   iretq
> > +   INTERRUPT_RETURN
>
> I suspect Xen will need much more changes anyways because of its
> ring 3 guest. Are these changes sufficient for lguest?

Probably not, but this part of the code I don't fully understand. But just
doing this doesn't break lguest.  But perhaps it only did not break it yet
;-)


Thanks,

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

2007-08-08 Thread Steven Rostedt

On Wed, 8 Aug 2007, Andi Kleen wrote:

> Strictly you would at least need a !X86_VSMP dependency, but
> with the vsmp change i requested that will be unnecessary
>
> Is this really synced with the latest version of the i386 code?

Glauber started the paravirt ops 64 a second time around, from scratch
using the PVOPS of i386 as a base. But since we couldn't just take a the
PVOPS patch from i386 and apply it to x86_64, this was mainly done by
looking at i386 code and massaging it for x86_64. Somethings may have
slipped (and we may have been looking at different versions of PVOPS).
It's not easy trying to keep up with a moving target ;-)

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt

On Wed, 8 Aug 2007, Andi Kleen wrote:

>
> > Probably not, but this part of the code I don't fully understand.
>
> I would suggest to defer all this until at least one example to test it
> (except vsmp which is too simple) is around.

Who uses that code? NMIs and debug regs?  Lguest only has the host handle
the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who
else uses that part of the code?

We can leave the "paranoid check" change out for this series.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt

--
On Wed, 8 Aug 2007, Andi Kleen wrote:

> Steven Rostedt <[EMAIL PROTECTED]> writes:
>
> > On Wed, 8 Aug 2007, Andi Kleen wrote:
> >
> > >
> > > > Probably not, but this part of the code I don't fully understand.
> > >
> > > I would suggest to defer all this until at least one example to test it
> > > (except vsmp which is too simple) is around.
> >
> > Who uses that code? NMIs and debug regs?  Lguest only has the host handle
> > the NMIs (doesn't pass to guest). And we haven't gotten to debug regs. Who
> > else uses that part of the code?
>
> I'm not sure I understand your question. You're asking who uses entry.S?
> Answer would be everybody. If you asked something else please reformulate.

When I said "this part of the code I don't fully understand" I was not
talking about entry.S.  I understand entry.S very well, but the comment
was originally on the paranoid_restore code. Which I thought had to deal
with NMIs and such that I didn't worry about that I simply did the
default.

>>  paranoid_restore\trace:
>>   RESTORE_ALL 8
>> - iretq
>> + INTERRUPT_RETURN
>
>I suspect Xen will need much more changes anyways because of its
>ring 3 guest. Are these changes sufficient for lguest?


The above was what I was replying to.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt

On Wed, 8 Aug 2007, Steven Rostedt wrote:

> On Wed, 8 Aug 2007, Andi Kleen wrote:
>
> >>  paranoid_restore\trace:
> >>   RESTORE_ALL 8
> >> - iretq
> >> + INTERRUPT_RETURN
> >
> >I suspect Xen will need much more changes anyways because of its
> >ring 3 guest. Are these changes sufficient for lguest?
>
>
> The above was what I was replying to.

If you were talking about the general iretq => INTERRUPT_RETURN, then the
answer is "Yes, they are sufficient". The first version of lguest ran the
guest kernel in ring 3 (using dual page tables for guest kernel and guest
user).  The current version I'm pushing runs lguest in ring 1, and the
entry.S code  worked for both.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt

--
On Wed, 8 Aug 2007, Andi Kleen wrote:

> > If you were talking about the general iretq => INTERRUPT_RETURN, then the
> > answer is "Yes, they are sufficient". The first version of lguest ran the
> > guest kernel in ring 3 (using dual page tables for guest kernel and guest
> > user).  The current version I'm pushing runs lguest in ring 1, and the
> > entry.S code  worked for both.
>
> How do you implement system calls then?

/me working very hard to get lguest64 ready for public display

Here's a snippet from my version of core.c.  I've been thinking of ways to
optimize it, but for now it works fine. This was done for both ring 3 and
ring 1 lguest versions (this is the host running):


/*
 * Update the LSTAR to point to the HV syscall handler.
 * Also update the fsbase if the guest uses one.
 */
wrmsrl(MSR_LSTAR, (unsigned long)HV_OFFSET(&lguest_syscall_trampoline));

[...]
asm volatile ("pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;"
  /* The stack we pushed is off by 8, due to the
previous pushq */
  "addq $8, %%rsp"
  : "=D"(foo), "=a"(bar)
  : "i" (__KERNEL_DS), "i" (__KERNEL_CS), "0"
(vcpu->vcpu),
"1"(get_idt_table()),
"r" (sw_guest)
  : "memory", "cc");

[...]
/* restore old LSTAR */
wrmsrl(MSR_LSTAR, vcpu->host_syscall);



Also in the switcher.S (The Hypervisor):

.global lguest_syscall_trampoline
.type lguest_syscall_trampoline, @function
lguest_syscall_trampoline:
/*
 * Tricky, we don't have much to choose from here.
 * The only way to get to our stack is with swapgs.
 * but we need to save the stack too, so we have to play
 * very carefully.
 */
swapgs
/* now gs points to our VCPU Guest Data */

/* first save the stack! */
movq%rsp, %gs:LGUEST_GUEST_DATA_regs_rsp

/*
 * x86 arch doesn't have an easy way to find out where
 * gs is located. So we need to read the MSR. But first
 * we need to save off the rcx, rax and rdx.
 */
movq%rax, %gs:LGUEST_GUEST_DATA_regs_rax
movq%rdx, %gs:LGUEST_GUEST_DATA_regs_rdx
movq%rcx, %gs:LGUEST_GUEST_DATA_regs_rcx

/* Need to read manual, does rdmsr clear
 * the top 32 bits of rax? */
xor %rax, %rax

movl$MSR_GS_BASE,%ecx
rdmsr
shl $32, %rdx
orq %rax, %rdx

movq%rdx, %rsp

/* see if we need to disable interrupts */
testq   $(1<<9),  %gs:LGUEST_GUEST_DATA_SFMASK
jz  1f
movq$0, %gs:LGUEST_GUEST_DATA_irq_enabled
jmp 2f
1:
/* Still need to clear bit 10 (just in case) */
/*  (see lguest_iretq)   */
movq$(1<<10), %rax
not %rax
andq%rax, %gs:LGUEST_GUEST_DATA_irq_enabled
2:
/* put back the generic regs */
movq%gs:LGUEST_GUEST_DATA_regs_rdx, %rdx
movq%gs:LGUEST_GUEST_DATA_regs_rcx, %rcx
movq%gs:LGUEST_GUEST_DATA_regs_rax, %rax

/* Is this a hypercall? */
testq   $1, %gs:LGUEST_GUEST_DATA_is_hc
jnz handle_hcall

/* We have 64 bytes to play with */
addq$LGUEST_GUEST_DATA_regs, %rsp

/* do the swapgs if possible */
testq   $1, %gs:LGUEST_GUEST_DATA_do_swapgs
je  1f

/* We now have a stack to use */
/* go back to the guest's gs */
swapgs
DO_SWAPGS_USE_STACK
/* and then back to HV gs */
swapgs
1:
/*
 * The stack has 64 bytes of playing room.
 * Which is enough to do a jump to the guest kernel.
 * We store the guest LSTAR register in the scatch pad
 * because we don't care if the guest messes with it.
 * If it is a bad address, we fault from the guest side
 * and we kill the guest. No harm done to the host.
 */
pushq   $(__KERNEL_DS | 1)
pushq   %gs:LGUEST_GUEST_DATA_regs_rsp
pushfq
/* Make sure we have actual interrupts on */
orq $(1<<9), 0(%rsp)
pushq   $(__KERNEL_CS | 1)
pushq   %gs:LGUEST_GUEST_DATA_LSTAR

swapgs
iretq



NOTE: This is still under development, since I'm going with a new design
change to try to stay more in sync with lguest32.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt

--
On Wed, 8 Aug 2007, Andi Kleen wrote:

> > When I said "this part of the code I don't fully understand" I was not
> > talking about entry.S.  I understand entry.S very well, but the comment
> > was originally on the paranoid_restore code. Which I thought had to deal
> > with NMIs and such that I didn't worry about that I simply did the
> > default.
>
> The paranoid path is used for more than just NMIs; it's also used for MCEs,
> stack faults, double faults or debug exceptions. Anything that might
> happen with a invalid stack or unknown GS state or system in other unknown
> state.
>
> If you can guarantee your hypervisor never injects any of those it could be 
> ignored;
> but at least losing debug exceptions would be probably not nice.

For now it does ignore them. But that's something to work on
for later versions of lguest64.  I want lguest out in the public and
working for the general case. I don't plan on ignoring the paranoid
section forever. But it's more on an anomaly for now. But you are right, I
want the debug stuff working. But I also need to walk before I can run ;-)


>
> >
> > >>  paranoid_restore\trace:
> > >>   RESTORE_ALL 8
> > >> - iretq
> > >> + INTERRUPT_RETURN
> > >
> > >I suspect Xen will need much more changes anyways because of its
> > >ring 3 guest. Are these changes sufficient for lguest?
>
> This was really a general comment not especially applying to the
> paranoid path.

Ah, OK, I assumed you were talking about just the previous code. But
rereading your post, I see it was more general.  Sorry, for the
miscommunication.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-08 Thread Steven Rostedt

--
On Wed, 8 Aug 2007, Andi Kleen wrote:

> On Wed, Aug 08, 2007 at 09:47:05AM -0400, Steven Rostedt wrote:
> >
> > [...]
> > asm volatile ("pushq %2; pushq %%rsp; pushfq; pushq %3; call *%6;"
> >   /* The stack we pushed is off by 8, due to the
> > previous pushq */
> >   "addq $8, %%rsp"
>
> Weird stack inbalance, i'm surprised that works. %6 must be doing strange
> things.

Heh, no it's very subtle. We do the "push %rsp" after the "push %2". So on
return from the call, the stack that was pushed, is really 8 bytes off.
Because we didn't save the stack pointer from the entry of the asm, we
saved it after doing that "push %2". :-)

We get back to this call via a retq from the HV switcher code. So that rsp
that was pushed will be the stack.


>
> > /* Need to read manual, does rdmsr clear
> >  * the top 32 bits of rax? */
> > xor %rax, %rax
> >
> > movl$MSR_GS_BASE,%ecx
> > rdmsr
>
>
> This seems incredibly slow. Since GS changes are controlled in
> the kernel why don't you just cache them in the per cpu area?
> The only case where user can reload it is using segment
> selector changes, which can be also handled.
>
> Also why do you need the guest gs value anyways?

I did a swapgs here, so we have the HV GS value. I don't know of an easy
way to read the GS value to put it into the RSP.

>
> You could just use your own stack and let the guest
> switch to its own.

Well, it is quite complex, since we can't save anything in the writable
section that the guest can't also write to. Remember, we are in ring 1 for
the guest kernel, so we have no protection. All the host related stuff is
in read only sections when we are using the guest cr3.

What I would like to do is have the GS point to the read only section, and
load the RSP from a value there. But first we need to store the old value
of RSP.  The problem is, where do you store it. Hmm, I guess I can create
my own "big" offset from the read only section :)

OK this gives me an idea, I *can* point the kernel GS to the vcpu read
only section, and have a way to store the value with the GS offset. The
read only section is always mapped before the read write section, so it
should be easy to calculate the diff!

Thanks for the comment, you guided me to this great idea!

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 0/7] Modify lguest32 to make room for lguest64

2007-08-08 Thread Steven Rostedt
Hi all,

I've been working on lguest64 and in order to do this, I had to move
a lot of the i386 specific out of the way.  Well, the lguest64 port
is still not ready to display, but before Rusty makes too many changes
I would like this in upstream so I don't have to keep repeating my
changes :-)


So this patch series moves lguest32 out of the way for other archs.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 2/7] Added generic lg.h in lguest directory.

2007-08-08 Thread Steven Rostedt
Add a generic lg.h file to call the architecture specific one.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 drivers/lguest/lg.h |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
new file mode 100644
index 000..4c4356e
--- /dev/null
+++ b/drivers/lguest/lg.h
@@ -0,0 +1,3 @@
+#ifdef CONFIG_X86_32
+#include "i386/lg.h"
+#endif
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 5/7] Change lguest launcher to use asm generic include

2007-08-08 Thread Steven Rostedt
Have the lguest launcher include e820.h via asm/e820.h instead of explicitly
saying i386.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 Documentation/lguest/lguest.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index f791840..eb7a01c 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -46,7 +46,7 @@ typedef uint32_t u32;
 typedef uint16_t u16;
 typedef uint8_t u8;
 #include "../../include/linux/lguest_launcher.h"
-#include "../../include/asm-i386/e820.h"
+#include "../../include/asm/e820.h"
 /*:*/
 
 #define PAGE_PRESENT 0x7   /* Present, RW, Execute */
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 7/7] Move lguest_dma_info into generic lg.h

2007-08-08 Thread Steven Rostedt
The lguest_dma_info is also generic across architectures.
Move it to the generic lg.h

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 drivers/lguest/i386/lg.h |   11 ---
 drivers/lguest/lg.h  |   11 +++
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h
index c5ea14c..ae83bd2 100644
--- a/drivers/lguest/i386/lg.h
+++ b/drivers/lguest/i386/lg.h
@@ -49,17 +49,6 @@ int init_pagetables(struct page **switcher_page, unsigned 
int pages);
 #define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00})
 #define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300})
 
-struct lguest_dma_info
-{
-   struct list_head list;
-   union futex_key key;
-   unsigned long dmas;
-   u16 next_dma;
-   u16 num_dmas;
-   u16 guestid;
-   u8 interrupt;   /* 0 when not registered */
-};
-
 /*H:310 The page-table code owes a great debt of gratitude to Andi Kleen.  He
  * reviewed the original code which used "u32" for all page table entries, and
  * insisted that it would be far clearer with explicit typing.  I thought it
diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
index 3147bf6..6057e3c 100644
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -12,6 +12,17 @@ void release_all_dma(struct lguest *lg);
 unsigned long get_dma_buffer(struct lguest *lg, unsigned long key,
 unsigned long *interrupt);
 
+struct lguest_dma_info
+{
+   struct list_head list;
+   union futex_key key;
+   unsigned long dmas;
+   u16 next_dma;
+   u16 num_dmas;
+   u16 guestid;
+   u8 interrupt;   /* 0 when not registered */
+};
+
 #ifdef CONFIG_X86_32
 #include "i386/lg.h"
 #endif
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 1/7] Move lg.h to the i386 specific lguest directory

2007-08-08 Thread Steven Rostedt
Make a i386 directory in the lguest directory, and move the lg.h into
it. This will clear the way for other archs to have their own lg.h

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 drivers/lguest/i386/lg.h |  300 ++
 drivers/lguest/lg.h  |  300 --
 2 files changed, 300 insertions(+), 300 deletions(-)

diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h
new file mode 100644
index 000..64f0abe
--- /dev/null
+++ b/drivers/lguest/i386/lg.h
@@ -0,0 +1,300 @@
+#ifndef _LGUEST_H
+#define _LGUEST_H
+
+#include 
+
+#define GDT_ENTRY_LGUEST_CS10
+#define GDT_ENTRY_LGUEST_DS11
+#define LGUEST_CS  (GDT_ENTRY_LGUEST_CS * 8)
+#define LGUEST_DS  (GDT_ENTRY_LGUEST_DS * 8)
+
+#ifndef __ASSEMBLY__
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "irq_vectors.h"
+
+#define GUEST_PL 1
+
+struct lguest_regs
+{
+   /* Manually saved part. */
+   unsigned long ebx, ecx, edx;
+   unsigned long esi, edi, ebp;
+   unsigned long gs;
+   unsigned long eax;
+   unsigned long fs, ds, es;
+   unsigned long trapnum, errcode;
+   /* Trap pushed part */
+   unsigned long eip;
+   unsigned long cs;
+   unsigned long eflags;
+   unsigned long esp;
+   unsigned long ss;
+};
+
+void free_pagetables(void);
+int init_pagetables(struct page **switcher_page, unsigned int pages);
+
+/* Full 4G segment descriptors, suitable for CS and DS. */
+#define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00})
+#define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300})
+
+struct lguest_dma_info
+{
+   struct list_head list;
+   union futex_key key;
+   unsigned long dmas;
+   u16 next_dma;
+   u16 num_dmas;
+   u16 guestid;
+   u8 interrupt;   /* 0 when not registered */
+};
+
+/*H:310 The page-table code owes a great debt of gratitude to Andi Kleen.  He
+ * reviewed the original code which used "u32" for all page table entries, and
+ * insisted that it would be far clearer with explicit typing.  I thought it
+ * was overkill, but he was right: it is much clearer than it was before.
+ *
+ * We have separate types for the Guest's ptes & pgds and the shadow ptes &
+ * pgds.  There's already a Linux type for these (pte_t and pgd_t) but they
+ * change depending on kernel config options (PAE). */
+
+/* Each entry is identical: lower 12 bits of flags and upper 20 bits for the
+ * "page frame number" (0 == first physical page, etc).  They are different
+ * types so the compiler will warn us if we mix them improperly. */
+typedef union {
+   struct { unsigned flags:12, pfn:20; };
+   struct { unsigned long val; } raw;
+} spgd_t;
+typedef union {
+   struct { unsigned flags:12, pfn:20; };
+   struct { unsigned long val; } raw;
+} spte_t;
+typedef union {
+   struct { unsigned flags:12, pfn:20; };
+   struct { unsigned long val; } raw;
+} gpgd_t;
+typedef union {
+   struct { unsigned flags:12, pfn:20; };
+   struct { unsigned long val; } raw;
+} gpte_t;
+
+/* We have two convenient macros to convert a "raw" value as handed to us by
+ * the Guest into the correct Guest PGD or PTE type. */
+#define mkgpte(_val) ((gpte_t){.raw.val = _val})
+#define mkgpgd(_val) ((gpgd_t){.raw.val = _val})
+/*:*/
+
+struct pgdir
+{
+   unsigned long cr3;
+   spgd_t *pgdir;
+};
+
+/* This is a guest-specific page (mapped ro) into the guest. */
+struct lguest_ro_state
+{
+   /* Host information we need to restore when we switch back. */
+   u32 host_cr3;
+   struct Xgt_desc_struct host_idt_desc;
+   struct Xgt_desc_struct host_gdt_desc;
+   u32 host_sp;
+
+   /* Fields which are used when guest is running. */
+   struct Xgt_desc_struct guest_idt_desc;
+   struct Xgt_desc_struct guest_gdt_desc;
+   struct i386_hw_tss guest_tss;
+   struct desc_struct guest_idt[IDT_ENTRIES];
+   struct desc_struct guest_gdt[GDT_ENTRIES];
+};
+
+/* We have two pages shared with guests, per cpu.  */
+struct lguest_pages
+{
+   /* This is the stack page mapped rw in guest */
+   char spare[PAGE_SIZE - sizeof(struct lguest_regs)];
+   struct lguest_regs regs;
+
+   /* This is the host state & guest descriptor page, ro in guest */
+   struct lguest_ro_state state;
+} __attribute__((aligned(PAGE_SIZE)));
+
+#define CHANGED_IDT1
+#define CHANGED_GDT2
+#define CHANGED_GDT_TLS4 /* Actually a subset of CHANGED_GDT */
+#define CHANGED_ALL3
+
+/* The private info the thread maintains about the guest. */
+struct lguest
+{
+   /* At end of a page shared mapped over lguest_pages in guest.  */
+   unsigned long regs_page;
+   struct lguest_regs *regs;
+   str

[PATCH 4/7] Moved the io struct up to the generic lg.h

2007-08-08 Thread Steven Rostedt
Move the io struct into the lg.h file since the io.c is generic to other
architectures.

Also added a proper ifdef for the generic lg.h.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 drivers/lguest/i386/lg.h |   11 ++-
 drivers/lguest/lg.h  |   16 
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h
index 64f0abe..c5ea14c 100644
--- a/drivers/lguest/i386/lg.h
+++ b/drivers/lguest/i386/lg.h
@@ -20,6 +20,8 @@
 #include 
 #include 
 #include "irq_vectors.h"
+/* some files include this, some include the parent */
+#include "../lg.h"
 
 #define GUEST_PL 1
 
@@ -245,15 +247,6 @@ void pin_page(struct lguest *lg, unsigned long vaddr);
 int lguest_device_init(void);
 void lguest_device_remove(void);
 
-/* io.c: */
-void lguest_io_init(void);
-int bind_dma(struct lguest *lg,
-unsigned long key, unsigned long udma, u16 numdmas, u8 interrupt);
-void send_dma(struct lguest *info, unsigned long key, unsigned long udma);
-void release_all_dma(struct lguest *lg);
-unsigned long get_dma_buffer(struct lguest *lg, unsigned long key,
-unsigned long *interrupt);
-
 /* hypercalls.c: */
 void do_hypercalls(struct lguest *lg);
 void write_timestamp(struct lguest *lg);
diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
index 4c4356e..3147bf6 100644
--- a/drivers/lguest/lg.h
+++ b/drivers/lguest/lg.h
@@ -1,3 +1,19 @@
+#ifndef _LGUEST_LG_GENERIC
+#define _LGUEST_LG_GENERIC
+
+struct lguest;
+
+/* io.c: */
+void lguest_io_init(void);
+int bind_dma(struct lguest *lg,
+unsigned long key, unsigned long udma, u16 numdmas, u8 interrupt);
+void send_dma(struct lguest *info, unsigned long key, unsigned long udma);
+void release_all_dma(struct lguest *lg);
+unsigned long get_dma_buffer(struct lguest *lg, unsigned long key,
+unsigned long *interrupt);
+
 #ifdef CONFIG_X86_32
 #include "i386/lg.h"
 #endif
+
+#endif /* _LGUEST_LG_GENERIC */
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 6/7] Remove __pa() use in hvc_lguest

2007-08-08 Thread Steven Rostedt
The hvc_lguest uses __pa in the const initialization.
In some architectures, __pa() is not constant so this fails in compiles.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 drivers/char/hvc_lguest.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/char/hvc_lguest.c b/drivers/char/hvc_lguest.c
index feeccba..c7d7d4c 100644
--- a/drivers/char/hvc_lguest.c
+++ b/drivers/char/hvc_lguest.c
@@ -42,7 +42,6 @@
  * use of physical address for the buffer itself. */
 static char inbuf[256];
 static struct lguest_dma cons_input = { .used_len = 0,
-   .addr[0] = __pa(inbuf),
.len[0] = sizeof(inbuf),
.len[1] = 0 };
 
@@ -114,6 +113,13 @@ static struct hv_ops lguest_cons = {
  * (0), and the struct hv_ops containing the put_chars() function. */
 static int __init cons_init(void)
 {
+   /*
+* We can't initialize using __pa in const declarations,
+* since __pa(inbuf) does not evaluate into a constant on
+* all architectures (namely x86_64).
+*/
+   cons_input.addr[0] = __pa(inbuf);
+
if (strcmp(paravirt_ops.name, "lguest") != 0)
return 0;
 
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/7] Added generic lg.h in lguest directory.

2007-08-08 Thread Steven Rostedt
--
On Thu, 9 Aug 2007, Stephen Rothwell wrote:

> On Wed, 08 Aug 2007 20:32:13 -0400 Steven Rostedt <[EMAIL PROTECTED]>
> wrote:
> >
> > Add a generic lg.h file to call the architecture specific one.
>
> Please combine this with the previous patch so that the tree will build
> at each step.

Yeah, I wanted to do that, but git got confused when I did that, and did
what I didn't want. That was, created a new file in the i386 directory,
and diffed the one in this directory.

I'm really want to keep git knowing the history of the file on moving. I
could just simply cat the two files instead. Would that be acceptible?

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/7] Moved the io struct up to the generic lg.h

2007-08-08 Thread Steven Rostedt

--
On Thu, 9 Aug 2007, Stephen Rothwell wrote:

> On Wed, 08 Aug 2007 20:32:15 -0400 Steven Rostedt <[EMAIL PROTECTED]>
> wrote:
> >
> > diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h
> > index 64f0abe..c5ea14c 100644
> > --- a/drivers/lguest/i386/lg.h
> > +++ b/drivers/lguest/i386/lg.h
> > @@ -20,6 +20,8 @@
> >  #include 
> >  #include 
> >  #include "irq_vectors.h"
> > +/* some files include this, some include the parent */
> > +#include "../lg.h"
>
> Surely everything should just include the parent which allows for easier
> consolidation over time. (as you are doing here)

Yeah, that sounds better. This was my attempt at not modifing the
i386/ stuff too much.  But I can have those files change to do a "../lg.h"
include.  Again, doing the move plus change to keep things building. Hmm,
I guess git can handle that.

>
> > --- a/drivers/lguest/lg.h
> > +++ b/drivers/lguest/lg.h
> > @@ -1,3 +1,19 @@
> > +#ifndef _LGUEST_LG_GENERIC
> > +#define _LGUEST_LG_GENERIC
>
> Maybe _LGUEST_LG_H?  and the arch ones could use _LGUEST_I386_LG_H etc.

Noted. Will do in the next release.

Thanks,

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 3/5 -v2] Remove __pa() use in hvc_lguest

2007-08-08 Thread Steven Rostedt
The hvc_lguest uses __pa in the const initialization.
In some architectures, __pa() is not constant so this fails in compiles.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 drivers/char/hvc_lguest.c |8 +++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/drivers/char/hvc_lguest.c b/drivers/char/hvc_lguest.c
index feeccba..c7d7d4c 100644
--- a/drivers/char/hvc_lguest.c
+++ b/drivers/char/hvc_lguest.c
@@ -42,7 +42,6 @@
  * use of physical address for the buffer itself. */
 static char inbuf[256];
 static struct lguest_dma cons_input = { .used_len = 0,
-   .addr[0] = __pa(inbuf),
.len[0] = sizeof(inbuf),
.len[1] = 0 };
 
@@ -114,6 +113,13 @@ static struct hv_ops lguest_cons = {
  * (0), and the struct hv_ops containing the put_chars() function. */
 static int __init cons_init(void)
 {
+   /*
+* We can't initialize using __pa in const declarations,
+* since __pa(inbuf) does not evaluate into a constant on
+* all architectures (namely x86_64).
+*/
+   cons_input.addr[0] = __pa(inbuf);
+
if (strcmp(paravirt_ops.name, "lguest") != 0)
return 0;
 
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 0/5 -v2] Modify lguest32 to make room for lguest64 (version 2)

2007-08-08 Thread Steven Rostedt
[
  Changes since last version.
  - Move lg.h to include/asm instead (suggested by Rusty Russel)
  - All steps of the series compiles (suggested by Stephen Rothwell)
  - Better ifdef header naming (suggested by Stephen Rothwell)
  - Added Andi Kleen to CC (forgot to on V1)
]

Hi all,

I've been working on lguest64 and in order to do this, I had to move
a lot of the i386 specific out of the way.  Well, the lguest64 port
is still not ready to display, but before Rusty makes too many changes
I would like this in upstream so I don't have to keep repeating my
changes :-)


So this patch series moves lguest32 out of the way for other archs.

-- Steve

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 4/5 -v2] Added generic lg.h in the include/linux directory

2007-08-08 Thread Steven Rostedt
Add a generic lg.h that can be included from lguest files.
This file will hold the data that can be shared across archs.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/i386/kernel/asm-offsets.c|2 +-
 drivers/lguest/core.c |2 +-
 drivers/lguest/hypercalls.c   |2 +-
 drivers/lguest/interrupts_and_traps.c |2 +-
 drivers/lguest/io.c   |2 +-
 drivers/lguest/lguest_user.c  |2 +-
 drivers/lguest/page_tables.c  |2 +-
 drivers/lguest/segments.c |2 +-
 include/asm-i386/lg.h |   24 ++--
 include/linux/lg.h|   30 ++
 10 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/arch/i386/kernel/asm-offsets.c b/arch/i386/kernel/asm-offsets.c
index e426e51..d6d17e5 100644
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -21,7 +21,7 @@
 
 #ifdef CONFIG_LGUEST_GUEST
 #include 
-#include 
+#include 
 #endif
 
 #define DEFINE(sym, val) \
diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
index e5f441d..5bcd237 100644
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 /* Found in switcher.S */
 extern char start_switcher_text[], end_switcher_text[], switch_to_guest[];
diff --git a/drivers/lguest/hypercalls.c b/drivers/lguest/hypercalls.c
index 0ddb133..3c0d41e 100644
--- a/drivers/lguest/hypercalls.c
+++ b/drivers/lguest/hypercalls.c
@@ -26,7 +26,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 /*H:120 This is the core hypercall routine: where the Guest gets what it
  * wants.  Or gets killed.  Or, in the case of LHCALL_CRASH, both.
diff --git a/drivers/lguest/interrupts_and_traps.c 
b/drivers/lguest/interrupts_and_traps.c
index 899e681..3daf3e2 100644
--- a/drivers/lguest/interrupts_and_traps.c
+++ b/drivers/lguest/interrupts_and_traps.c
@@ -12,7 +12,7 @@
  * them first, so we also have a way of "reflecting" them into the Guest as if
  * they had been delivered to it directly. :*/
 #include 
-#include 
+#include 
 
 /* The address of the interrupt handler is split into two bits: */
 static unsigned long idt_address(u32 lo, u32 hi)
diff --git a/drivers/lguest/io.c b/drivers/lguest/io.c
index 091a46d..d196121 100644
--- a/drivers/lguest/io.c
+++ b/drivers/lguest/io.c
@@ -25,7 +25,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 /*L:300
  * I/O
diff --git a/drivers/lguest/lguest_user.c b/drivers/lguest/lguest_user.c
index a29d85b..2f477df 100644
--- a/drivers/lguest/lguest_user.c
+++ b/drivers/lguest/lguest_user.c
@@ -7,7 +7,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 /*L:030 setup_regs() doesn't really belong in this file, but it gives us an
  * early glimpse deeper into the Host so it's worth having here.
diff --git a/drivers/lguest/page_tables.c b/drivers/lguest/page_tables.c
index db3dd8b..812d56a 100644
--- a/drivers/lguest/page_tables.c
+++ b/drivers/lguest/page_tables.c
@@ -13,7 +13,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 
 /*M:008 We hold reference to pages, which prevents them from being swapped.
  * It'd be nice to have a callback in the "struct mm_struct" when Linux wants
diff --git a/drivers/lguest/segments.c b/drivers/lguest/segments.c
index 1d16d5d..351b13f 100644
--- a/drivers/lguest/segments.c
+++ b/drivers/lguest/segments.c
@@ -9,7 +9,7 @@
  * In these modern times, the segment handling code consists of simple sanity
  * checks, and the worst you'll experience reading this code is butterfly-rash
  * from frolicking through its parklike serenity. :*/
-#include 
+#include 
 
 /*H:600
  * We've almost completed the Host; there's just one file to go!
diff --git a/include/asm-i386/lg.h b/include/asm-i386/lg.h
index 64f0abe..c8e01c3 100644
--- a/include/asm-i386/lg.h
+++ b/include/asm-i386/lg.h
@@ -1,5 +1,5 @@
-#ifndef _LGUEST_H
-#define _LGUEST_H
+#ifndef _I386_LGUEST_H
+#define _I386_LGUEST_H
 
 #include 
 
@@ -47,17 +47,6 @@ int init_pagetables(struct page **switcher_page, unsigned 
int pages);
 #define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00})
 #define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300})
 
-struct lguest_dma_info
-{
-   struct list_head list;
-   union futex_key key;
-   unsigned long dmas;
-   u16 next_dma;
-   u16 num_dmas;
-   u16 guestid;
-   u8 interrupt;   /* 0 when not registered */
-};
-
 /*H:310 The page-table code owes a great debt of gratitude to Andi Kleen.  He
  * reviewed the original code which used "u32" for all page table entries, and
  * insisted that it would be far clearer with explicit typing.  I thought it
@@ -245,15 +234,6 @@ void pin_page(struct lguest *lg, unsigned long vaddr);
 int lguest_device_init(void);
 void lguest_device_remove(void);
 
-/* io

[PATCH 2/5 -v2] Change lguest launcher to use asm generic include

2007-08-08 Thread Steven Rostedt
Have the lguest launcher include e820.h via asm/e820.h instead of explicitly
saying i386.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 Documentation/lguest/lguest.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index f791840..eb7a01c 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -46,7 +46,7 @@ typedef uint32_t u32;
 typedef uint16_t u16;
 typedef uint8_t u8;
 #include "../../include/linux/lguest_launcher.h"
-#include "../../include/asm-i386/e820.h"
+#include "../../include/asm/e820.h"
 /*:*/
 
 #define PAGE_PRESENT 0x7   /* Present, RW, Execute */
-- 
1.4.4.4

-- 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH 1/5 -v2] move lg.h to include/asm-

2007-08-08 Thread Steven Rostedt
The lg.h file has lots of i386 specific information. Move it over to the asm
directory where it belongs.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
---
 arch/i386/kernel/asm-offsets.c|2 +-
 drivers/lguest/core.c |2 +-
 drivers/lguest/hypercalls.c   |2 +-
 drivers/lguest/interrupts_and_traps.c |2 +-
 drivers/lguest/io.c   |2 +-
 drivers/lguest/lg.h   |  300 -
 drivers/lguest/lguest_user.c  |2 +-
 drivers/lguest/page_tables.c  |2 +-
 drivers/lguest/segments.c |2 +-
 drivers/lguest/switcher.S |2 +-
 include/asm-i386/lg.h |  300 +
 11 files changed, 309 insertions(+), 309 deletions(-)

diff --git a/arch/i386/kernel/asm-offsets.c b/arch/i386/kernel/asm-offsets.c
index 7288ac8..e426e51 100644
--- a/arch/i386/kernel/asm-offsets.c
+++ b/arch/i386/kernel/asm-offsets.c
@@ -21,7 +21,7 @@
 
 #ifdef CONFIG_LGUEST_GUEST
 #include 
-#include "../../../drivers/lguest/lg.h"
+#include 
 #endif
 
 #define DEFINE(sym, val) \
diff --git a/drivers/lguest/core.c b/drivers/lguest/core.c
index 0a46e88..e5f441d 100644
--- a/drivers/lguest/core.c
+++ b/drivers/lguest/core.c
@@ -19,7 +19,7 @@
 #include 
 #include 
 #include 
-#include "lg.h"
+#include 
 
 /* Found in switcher.S */
 extern char start_switcher_text[], end_switcher_text[], switch_to_guest[];
diff --git a/drivers/lguest/hypercalls.c b/drivers/lguest/hypercalls.c
index db6caac..0ddb133 100644
--- a/drivers/lguest/hypercalls.c
+++ b/drivers/lguest/hypercalls.c
@@ -26,7 +26,7 @@
 #include 
 #include 
 #include 
-#include "lg.h"
+#include 
 
 /*H:120 This is the core hypercall routine: where the Guest gets what it
  * wants.  Or gets killed.  Or, in the case of LHCALL_CRASH, both.
diff --git a/drivers/lguest/interrupts_and_traps.c 
b/drivers/lguest/interrupts_and_traps.c
index 49787e9..899e681 100644
--- a/drivers/lguest/interrupts_and_traps.c
+++ b/drivers/lguest/interrupts_and_traps.c
@@ -12,7 +12,7 @@
  * them first, so we also have a way of "reflecting" them into the Guest as if
  * they had been delivered to it directly. :*/
 #include 
-#include "lg.h"
+#include 
 
 /* The address of the interrupt handler is split into two bits: */
 static unsigned long idt_address(u32 lo, u32 hi)
diff --git a/drivers/lguest/io.c b/drivers/lguest/io.c
index ea68613..091a46d 100644
--- a/drivers/lguest/io.c
+++ b/drivers/lguest/io.c
@@ -25,7 +25,7 @@
 #include 
 #include 
 #include 
-#include "lg.h"
+#include 
 
 /*L:300
  * I/O
diff --git a/drivers/lguest/lg.h b/drivers/lguest/lg.h
deleted file mode 100644
index 64f0abe..000
--- a/drivers/lguest/lg.h
+++ /dev/null
@@ -1,300 +0,0 @@
-#ifndef _LGUEST_H
-#define _LGUEST_H
-
-#include 
-
-#define GDT_ENTRY_LGUEST_CS10
-#define GDT_ENTRY_LGUEST_DS11
-#define LGUEST_CS  (GDT_ENTRY_LGUEST_CS * 8)
-#define LGUEST_DS  (GDT_ENTRY_LGUEST_DS * 8)
-
-#ifndef __ASSEMBLY__
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include "irq_vectors.h"
-
-#define GUEST_PL 1
-
-struct lguest_regs
-{
-   /* Manually saved part. */
-   unsigned long ebx, ecx, edx;
-   unsigned long esi, edi, ebp;
-   unsigned long gs;
-   unsigned long eax;
-   unsigned long fs, ds, es;
-   unsigned long trapnum, errcode;
-   /* Trap pushed part */
-   unsigned long eip;
-   unsigned long cs;
-   unsigned long eflags;
-   unsigned long esp;
-   unsigned long ss;
-};
-
-void free_pagetables(void);
-int init_pagetables(struct page **switcher_page, unsigned int pages);
-
-/* Full 4G segment descriptors, suitable for CS and DS. */
-#define FULL_EXEC_SEGMENT ((struct desc_struct){0x, 0x00cf9b00})
-#define FULL_SEGMENT ((struct desc_struct){0x, 0x00cf9300})
-
-struct lguest_dma_info
-{
-   struct list_head list;
-   union futex_key key;
-   unsigned long dmas;
-   u16 next_dma;
-   u16 num_dmas;
-   u16 guestid;
-   u8 interrupt;   /* 0 when not registered */
-};
-
-/*H:310 The page-table code owes a great debt of gratitude to Andi Kleen.  He
- * reviewed the original code which used "u32" for all page table entries, and
- * insisted that it would be far clearer with explicit typing.  I thought it
- * was overkill, but he was right: it is much clearer than it was before.
- *
- * We have separate types for the Guest's ptes & pgds and the shadow ptes &
- * pgds.  There's already a Linux type for these (pte_t and pgd_t) but they
- * change depending on kernel config options (PAE). */
-
-/* Each entry is identical: lower 12 bits of flags and upper 20 bits for the
- * "page frame number" (0 == first physical page, etc).  They are different
- * types so t

Re: [PATCH 18/25] [PATCH] turn priviled operations into macros in entry.S

2007-08-09 Thread Steven Rostedt
--
On Wed, 8 Aug 2007, Jeremy Fitzhardinge wrote:

> Steven Rostedt wrote:
> > /*
> >  * x86 arch doesn't have an easy way to find out where
> >  * gs is located. So we need to read the MSR. But first
> >  * we need to save off the rcx, rax and rdx.
> >
> Why don't you store it in gs?  movq %gs:my_gs_base, %rax?

Because it can't be trusted.  After the swapgs, we are pointing to the RW
section of the HV. But by running the guest kernel in ring 1, we have no
protection from the guest writing into that area too. So we can't put
anything in that section and expect it to be safe after jumping to guest
code.  The only trusted values, is jumping in after getting there by an
interrupt, where the hardware places the values onto the stack.

But with Andi's comments, I realized I can point the gs pointer to the
RO area. And make a constant offset that will point up into the RW area,
so we could save the stack and replace it.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/5 -v2] Added generic lg.h in the include/linux directory

2007-08-09 Thread Steven Rostedt
--
On Thu, 9 Aug 2007, Stephen Rothwell wrote:

> Hi Steven,
>
> I am just being pedantic here (and I should have beaten up on Rusty
> before now ...  :-)
>
> On Thu, 09 Aug 2007 00:36:30 -0400 Steven Rostedt <[EMAIL PROTECTED]>
> wrote:
> >
> > --- a/include/asm-i386/lg.h
> > +++ b/include/asm-i386/lg.h
> > @@ -1,5 +1,5 @@
> > -#ifndef _LGUEST_H
> > -#define _LGUEST_H
> > +#ifndef _I386_LGUEST_H
> > +#define _I386_LGUEST_H
>
> Maybe _ASM_LG_H or _I386_LG_H ? And there is a comment on the #endif at
> the end of the file that should match.
>
> > +++ b/include/linux/lg.h
> > @@ -0,0 +1,30 @@
> > +#ifndef _LINUX_LGUEST_H
> > +#define _LINUX_LGUEST_H
>
> Maybe _LINUX_LG_H ?
>
> Also (this is for Rusty) include/linux/lguest.h is protected by
> _ASM_LGUEST_H ...

Hah! in my lguest64 tree, I fixed all these. But I forgot to do the
changes when making the patches for upstream.  I'll make note and fix them
in my next patch series (unless Rusty can pull my stuff in now and then I
can apply on top of them ;-)

Thanks,

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 5/5 -v2] Move the lguest files that are i386 specific

2007-08-09 Thread Steven Rostedt


--
On Thu, 9 Aug 2007, Stephen Rothwell wrote:

> Hi Steven,
>
> On Thu, 09 Aug 2007 00:36:31 -0400 Steven Rostedt <[EMAIL PROTECTED]>
> wrote:
> >
> > Well, some may be merged with x86_64 later, but for now we move them
> > out of the way.  Later on we can start seeing how we can combine
> > some of these files to be arch generic.
> >
> > Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>
> > ---
> >  drivers/lguest/Makefile|7 +-
> >  drivers/lguest/core.c  |  771 ---
> >  drivers/lguest/hypercalls.c|  300 
> >  drivers/lguest/i386/Makefile   |7 +
> >  drivers/lguest/i386/core.c |  771 +++
> >  drivers/lguest/i386/hypercalls.c   |  300 
> >  drivers/lguest/i386/interrupts_and_traps.c |  440 +++
> >  drivers/lguest/i386/lguest.c   | 1097 
> > 
> >  drivers/lguest/i386/lguest_asm.S   |   93 +++
> >  drivers/lguest/i386/lguest_user.c  |  382 ++
> >  drivers/lguest/i386/page_tables.c  |  680 +
> >  drivers/lguest/i386/segments.c |  229 ++
> >  drivers/lguest/i386/switcher.S |  347 +
> >  drivers/lguest/interrupts_and_traps.c  |  440 ---
> >  drivers/lguest/lguest.c| 1097 
> > 
> >  drivers/lguest/lguest_asm.S|   93 ---
> >  drivers/lguest/lguest_user.c   |  382 --
> >  drivers/lguest/page_tables.c   |  680 -
> >  drivers/lguest/segments.c  |  229 --
> >  drivers/lguest/switcher.S  |  347 -
> >  20 files changed, 4348 insertions(+), 4344 deletions(-)
>
> if you add -C or -M to "git diff --stat" this looks much better:
>
> $ git diff --stat -C HEAD^
>  drivers/lguest/Makefile  |7 ++-
>  drivers/lguest/i386/Makefile |7 +++
>  drivers/lguest/{ => i386}/core.c |0
>  drivers/lguest/{ => i386}/hypercalls.c   |0
>  drivers/lguest/{ => i386}/interrupts_and_traps.c |0
>  drivers/lguest/{ => i386}/lguest.c   |0
>  drivers/lguest/{ => i386}/lguest_asm.S   |0
>  drivers/lguest/{ => i386}/lguest_user.c  |0
>  drivers/lguest/{ => i386}/page_tables.c  |0
>  drivers/lguest/{ => i386}/segments.c |0
>  drivers/lguest/{ => i386}/switcher.S |0
>  11 files changed, 9 insertions(+), 5 deletions(-)
>

/me should read the man pages.

Thanks, that's the options I was looking for.  I was using the
git-format-patch, and thought it would do that automatically.  I
definitely add those options the next time around!

Thanks,

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

2007-08-09 Thread Steven Rostedt

--
On Thu, 9 Aug 2007, Jeremy Fitzhardinge wrote:

> Glauber de Oliveira Costa wrote:
> > On 8/9/07, Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:
> >
> >>> Does it really matter?
> >>>
> >>>
> >> Well, yes, if alignment is an issue.
> >>
> > Of course, But the question rises from the context that they are both
> > together at the beginning. So they are not making anybody non-aligned.
> > Then the question: Why would putting it in the end be different to
> > putting them _together_, aligned at the beginning ?
> >
>
> Well, the point is that if you add new ones then alignment may be an
> issue.  Putting them at the end (with a comment explaining why they're
> there) will make it more robust.  Though splitting them into their own
> sub-structure would probably be better.

Glauber,

I was thinking of putting them at the end too, and that would make it all
work better. But I didn't mention it because I was in the mindset of "well
i386 has that there, we should too" :-(

>
> Hm.  So x86-64 doesn't make 64-bit pointers be 64-bit aligned?

yeah, it usually does. But it's one of those paranoid things, where you
want it to still work even if someone later on throws an
__attribute__((packed)) in on paravirt ops ;-)

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/7] Moved the io struct up to the generic lg.h

2007-08-09 Thread Steven Rostedt

--
On Thu, 9 Aug 2007, Jes Sorensen wrote:

> Stephen Rothwell wrote:
> > On Wed, 08 Aug 2007 20:32:15 -0400 Steven Rostedt <[EMAIL PROTECTED]>
> > wrote:
> >> diff --git a/drivers/lguest/i386/lg.h b/drivers/lguest/i386/lg.h
> >> index 64f0abe..c5ea14c 100644
> >> --- a/drivers/lguest/i386/lg.h
> >> +++ b/drivers/lguest/i386/lg.h
> >> @@ -20,6 +20,8 @@
> >>  #include 
> >>  #include 
> >>  #include "irq_vectors.h"
> >> +/* some files include this, some include the parent */
> >> +#include "../lg.h"
> >
> > Surely everything should just include the parent which allows for easier
> > consolidation over time. (as you are doing here)
>
> Whats wrong with introducing an include/asm-/lg.h and have the
> parent do #include  ? Seems much cleaner to me.

Seems cleaner to me to!  I think I'll do that.

Oh, wait! I did.  ;-)

-- Steve

P.S. Sorry, couldn't resist.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Lguest] Introducing paravirt_ops for x86_64

2007-08-09 Thread Steven Rostedt

--
On Wed, 8 Aug 2007, Jeremy Fitzhardinge wrote:

>
> At the moment Xen can't support guests with 2M pages.  In 32-bit this
> isn't a huge problem, since the kernel doesn't assume it can map itself
> with 2M pages.  But I think the 64-bit kernel assumes 2M pages are
> always available for mapping the kernel.  I don't know how pervasive
> this assumption is, but it would be nice to parameterize it in pv-ops.
>

I got 2M pages working in lguest64, but it was a cause of a lot of pain.
By removing support for 2M pages in guests, that might clean up a lot of
lguest64 shadow paging as well.

The only advantage I see for 2M pages or a guest, is that it saves the
guest from having extra page tables. The host still needs the map the 2M
to 4K, so it's not an advantage there. It doesn't speed anything up, so
I'd be all for scrapping 2M pages for guests.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 25/25] [PATCH] add paravirtualization support for x86_64

2007-08-09 Thread Steven Rostedt
--
On Thu, 9 Aug 2007, Andi Kleen wrote:
>
> > This has to match the normal C calling convention though, doesn't it?
>
> Native cli/sti/save/restore_flags are all only assembly and can be easily
> (in fact more easily than in C) written as pure assembler functions. Then
> you can use whatever calling convention you want.

I agree.
Should we make a paravirt_ops_asm.S file that can implement these native
funcions, and so we can get rid of the C functions only doing asm?

>
> While some paravirt implementations may have more complicated implementations
> i guess it's still a reasonable requirement to make them simple enough
> in pure assembler. If not they can use a trampoline, but that's hopefully
> not needed.

It works for lguest64.  I'm sure it should be no problem with other HVs.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/25][V3] turn priviled operations into macros in entry.S

2007-08-15 Thread Steven Rostedt


--

On Wed, 15 Aug 2007, Glauber de Oliveira Costa wrote:

> @@ -566,10 +596,15 @@ retint_restore_args:
>  restore_args:
>   RESTORE_ARGS 0,8,0
>  iret_label:
> +#ifdef CONFIG_PARAVIRT
> + INTERRUPT_RETURN
> +#endif
> +.globl do_iretq;

Is the .globl really required here? I would think that the reference in
the fixup section would be resolved with in the compiling of this object.

-- Steve

> +do_iretq:
>   iretq
>
>   .section __ex_table,"a"
> - .quad iret_label,bad_iret
> + .quad do_iretq, bad_iret
>   .previous
>   .section .fixup,"ax"
>   /* force a signal here? this matches i386 behaviour */
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[ANNOUNCE] Lguest64 - fatter puppies!

2007-09-04 Thread Steven Rostedt
This is a formal announcement of Lguest64.

Most are aware of the little puppies (lguest32, or simply lguest, or in
some circles "rustyvisor").  But this time the puppies ate a bit too
much.  No more lean and mean puppies, now we got big fat lazy ones.
Running on the hardware that's too lazy to do full virutalization. Yes,
lguest now runs on x86_64!

As you know, puppies are young, and so is lguest64. And like most new
born puppies, lguest64 might crap on your floor. But if it's in a good
mood, it will make it to the door and give you a login prompt (or maybe
even ssh into it and run firefox!
http://rostedt.homelinux.com/pics/firefox-on-lguest64.png ).

lguest64 is still going through a bit of growth pains, but its getting
better. It's to a point that we are not that afraid to bring it to the
dog show.

So for those that love puppies, and want even bigger ones, you can
download the code at:

  git://git.et.redhat.com/kernel-lguest-64.git

TODO:
 Many things! but here's what's on the near future list.

 - SMP first for host then for guest. We had it working a while
   ago, but we decided to update to catchup to lguest32, and we
   broke it.

 - get closer to lguest32. Rusty just likes to annoy us ;-)

 - optimization, optimization, optimization (did I say it's still slow?)

 - oh, and hopefully to get it merged!

well, there's lots more to do, but we can think about it later.

Want to help? test it out, and let us know how big a mess it made on
your floor.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] input: Fix interrupt enable in i8042_ctr when enabling interrupt fails

2007-09-10 Thread Steven Rostedt

--
On Mon, 10 Sep 2007, Markus Armbruster wrote:
>
> I believe this possible, but unlikely (perhaps not so unlikely on
> virtual machines).  Scenarios involve enable succeeding the first
> time, failing the second time, and succeeding the third time.  I can
> provide details, but the point I'd like to make is not that this is
> broken (although it is, strictly speaking), but that it is not
> obviously correct where it easily could be: just clear the interrupt
> enable bits when writing them to the hardware failed, like the old
> code did.
>

I also want to stress that this is more of a clean up for "technically
correct" code than a bug fix.  This bug probably would never happen on
baremetal unless it was running on broken hardware.

  BUT!!!

With more and more systems going to a virtual environment, having a bug or
some other anomaly can trigger the error that this patch prevents. The
patch will also trigger a print in the case of running on a buggy virtual
machine, which would help out the developers of that virtual machine to
fix their code.

Please apply.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 7/24] consolidate msr.h

2007-11-19 Thread Steven Rostedt

On Mon, 19 Nov 2007, Bastian Blank wrote:

> On Fri, Nov 09, 2007 at 04:42:48PM -0200, Glauber de Oliveira Costa wrote:
> > -   wrmsrl(MSR_CSTAR, ia32_cstar_target);
> > +   wrmsrl(MSR_CSTAR, (u64)ia32_cstar_target);
>
> Hmm, why do you add explicit casts? The compiler should convert that
> correctly on its own.
>
> > +static inline void wrmsrl(unsigned int msr, unsigned long long val)
>
> Hmm, long long is 64 bit on all x86, but why not use explicit u64 to
> show that?

(quick reply)

With PVOPS on it gives compiler warnings without that explict cast.
Without looking at the code, IIRC with non-PVOPS it is a macro directly
into asm, so it didn't matter what the cast was. But with PVOPS as a
function, it gave compiler warnings.

Take it out and try compiling it for both i386 and x86_64. One of them
gave warnings. But maybe it's not a problem now.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 7/24] consolidate msr.h

2007-11-19 Thread Steven Rostedt


On Tue, 20 Nov 2007, Ingo Molnar wrote:

> * Steven Rostedt <[EMAIL PROTECTED]> wrote:
>
> > With PVOPS on it gives compiler warnings without that explict cast.
> > Without looking at the code, IIRC with non-PVOPS it is a macro
> > directly into asm, so it didn't matter what the cast was. But with
> > PVOPS as a function, it gave compiler warnings.
> >
> > Take it out and try compiling it for both i386 and x86_64. One of them
> > gave warnings. But maybe it's not a problem now.
>
> i dont think there's ever any true need (and good cause) to force
> integer type casts like that at the callee site.

I guess the problem is that we converted a macro to a function, where the
macro did no type checking. Now we need to pick between integers and
pointers. Some places uses intergers in wrmsrl and some use pointers. So
changing this to a typechecking protocol is not going to be nice.

Looking at the current code now, we have this:


checking_wrmsrl(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
checking_wrmsrl(MSR_IA32_SYSENTER_ESP, 0ULL);
checking_wrmsrl(MSR_IA32_SYSENTER_EIP, (u64)ia32_sysenter_target);

wrmsrl(MSR_CSTAR, ia32_cstar_target);


A typecast is already used in that same area.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 7/24] consolidate msr.h

2007-11-19 Thread Steven Rostedt

On Tue, 20 Nov 2007, Ingo Molnar wrote:
>
> i dont think there's ever any true need (and good cause) to force
> integer type casts like that at the callee site.

Unless you mean we should do something like this:

static inline void __wrmsrl(unsigned int msr, unsigned long long val);
#define wrmsr(msr, val) __wrmsrl(msr, (unsigned long long)var)

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 02/16] adapt lguest launcher to per-cpuness

2007-12-26 Thread Steven Rostedt

On Wed, 26 Dec 2007, Rusty Russell wrote:

> On Friday 21 December 2007 00:33:42 Glauber de Oliveira Costa wrote:
> > +   if (!vcpu_id) {
> > +   /*
> > +* Service input, then unset the BREAK to
> > +* release the Waker.
> > +*/
> > +   handle_input(lguest_fd);
> > +   if (pwrite(lguest_fd, args, sizeof(args), 0) < 0)
> > +   err(1, "Resetting break");
> > +   }
>
> I hate winged comments: those two extra lines, wasted!
>

For multiple lines, wings are a Good Thing (TM). Otherwise it looks
sloppy.

  /* Service input, then unset the BREAK to
   * release the Waker. */

extra asterisk! ok then

  /* Service input, then unset the BREAK to
 release the Waker. */

Yuck, that "release" looks like it can be code, especially with parsers
that look for comments that start with some sort of /* or *

Those little wings do IMHO make the code look nicer. I know in the Linux
community, my weight compared to you is a chihuahua compared to a
St. Bernard.  But in this case, I believe others think that my collar is
prettier than yours. ;-)

Some one buy Rusty a bigger hard-drive to store those extra lines.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 01/11] Add basic support for gcc profiler instrumentation

2008-01-03 Thread Steven Rostedt

[Added Chris Wright, Rusty and Virt list because they were involved with
this issue before]

On Thu, 3 Jan 2008, Ingo Molnar wrote:

>
> * Steven Rostedt <[EMAIL PROTECTED]> wrote:
>
> > +# function tracing might turn this off:
> > +config REGPARM
> > +   bool
> > +   depends on !MCOUNT
> > +   default y
>
> are you sure -pg really needs this?

Nope! Arnaldo and I only carried it because you had it ;-)

> I just carried this along the years
> and went the path of least resistence, but we should not be
> reintroducing the !REGPARM build mode for the kernel. I'd not be
> surprised if there were a few issues with REGPARM + mcount, but we have
> to figure it out before merging ...

Hmm, I know paravirt-ops had an issue with mcount in the RT tree. I can't
remember the exact issues, but it did have something to do with the way
parameters were passed in.

Chris, do you remember what the issues were?

I'm also thinking that this is only an i386 issue.

Thanks,

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 01/11] Add basic support for gcc profiler instrumentation

2008-01-03 Thread Steven Rostedt

On Thu, 3 Jan 2008, Chris Wright wrote:

>
> Yes, paravirt ops have a well-specified calling convention (register
> based).  There was a cleanup that Andi did that caused the problem
> because it removed all the "fastcall" annotations since -mregparm=3
> is now always on for i386.  Since MCOUNT disables REGPARM the calling
> convention changes (caller pushes to stack, callee expects register)
> chaos ensues.  I sent a patch to fix that quite some months back, but
> it went stale and I neglected to update it.  Would you like me to dig
> it up refresh and resend?

Chris, thanks for the refresher.

I'm going to see if we can remove the REGPARM hack and change the way
mcount does its calls. Maybe this will fix things for us.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/6] Staging: hv: Unify hyper-v device abstractions

2011-02-28 Thread Steven Rostedt
Hi!

Just an FYI,

When sending out patch series, could you send out a [PATCH 0/6] first,
and have all other patches a reply to that patch. Both git and quilt
have ways to do this.

The reason is that the linux kernel mailing list gets over 600 emails a
day, and for those of us that skim through, we like to hit [make thread
as read] when we are not interested in the thread or patch set. When
people send out a set of patches that are all individual, this means I
need to hit this command for every patch.

Also do not think I'm just picking on you. I've complained to several
people about doing this. One just recently sent out 21 patches.

Thanks,

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-devel] Re: linux-next: Tree for July 25 (xen)

2011-08-08 Thread Steven Rostedt
On Sat, 2011-08-06 at 11:22 -0700, Jeremy Fitzhardinge wrote:
> On 08/05/2011 06:58 PM, Konrad Rzeszutek Wilk wrote:
> > On Fri, Aug 05, 2011 at 11:14:23PM +0200, Ingo Molnar wrote:
> >> * Ingo Molnar  wrote:
> >>
> >>> * Randy Dunlap  wrote:
> >>>
>  On 08/04/11 15:40, Konrad Rzeszutek Wilk wrote:
> > On Thu, Aug 04, 2011 at 06:32:59PM -0400, Konrad Rzeszutek Wilk wrote:
> > These build failures are still triggering upstream:
> >
> >  arch/x86/xen/trace.c:44:2: error: array index in initializer not 
> > of integer type
> >  arch/x86/xen/trace.c:44:2: error: (near initialization for 
> > ‘xen_hypercall_names’)
> >  arch/x86/xen/trace.c:45:1: error: ‘__HYPERVISOR_arch_4’ undeclared 
> > here (not in a function)
> >  arch/x86/xen/trace.c:45:2: error: array index in initializer not 
> > of integer type
> >  arch/x86/xen/trace.c:45:2: error: (near initialization for 
> > ‘xen_hypercall_names’)
>  Oh, that I haven't seen. Can you send me the .config for that please.
> >>> You can't be trying very hard then.  I see lots of these (but no,
> >> Ah, I am getting it now. Thanks for reporting it.
> > This should do the trick:
>  Acked-by: Randy Dunlap 
> >>> That patch did the trick here too:
> >>>
> >>> Acked-by: Ingo Molnar 
> >> Except that i'm still seeing the occasional build failure - see the 
> >> error log below. Config attached.
> > Much appreciate for the report. I believe this fix (which I think hit
> > linux-next yesterday?) should do that:
> >
> > commit 1e9ea2656b656edd3c8de98675bbc0340211b5bd
> > Author: Jeremy Fitzhardinge 
> > Date:   Wed Aug 3 09:43:44 2011 -0700
> >
> >  xen/tracing: it looks like we wanted CONFIG_FTRACE
> > 
> > Apparently we wanted CONFIG_FTRACE rather the CONFIG_FUNCTION_TRACER.
> > 
> > Reported-by: Sander Eikelenboom 
> > Tested-by: Sander Eikelenboom 
> > Signed-off-by: Jeremy Fitzhardinge 
> > Signed-off-by: Konrad Rzeszutek Wilk 
> >
> > diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
> > index 45e94ac..3326204 100644
> > --- a/arch/x86/xen/Makefile
> > +++ b/arch/x86/xen/Makefile
> > @@ -15,7 +15,7 @@ obj-y := enlighten.o setup.o multicalls.o 
> > mmu.o irq.o \
> > grant-table.o suspend.o platform-pci-unplug.o \
> > p2m.o
> >  
> > -obj-$(CONFIG_FUNCTION_TRACER) += trace.o
> > +obj-$(CONFIG_FTRACE) += trace.o
> >  
> 
> I'm not sure this is correct either.  Maybe it should be
> CONFIG_TRACEPOINTS?  Steven?

Actually, I believe the correct answer is:

CONFIG_EVENT_TRACING

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization

Re: [PATCH 6/7] qemu: Implement virtio-pstore device

2016-07-28 Thread Steven Rostedt
On Thu, 28 Jul 2016 14:39:53 +0900
Namhyung Kim  wrote:

> Well, I dont' know.  As you know, the kernel oops dump is already sent
> to serial device but it's rather slow.  As I wrote in the cover
> letter, enabling ftrace_dump_on_oops makes it even worse..  Also
> pstore saves the (compressed) binary data so I thought it'd be better
> to have a dedicated IO channel.

BTW, I agree with this. It is better to have a quick way to grab the
ftrace buffers when a crash happens, as serial is excruciatingly slow.
Although, currently I still use kexec/kdump, but as Namhyung said, it
depends on crash being up to date. I tend to be sending in updates
every time I have to use it.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions

2016-08-08 Thread Steven Rostedt

Hmm, I'm guessing this patch got lost.

-- Steve


On Wed, 25 May 2016 13:47:26 -0400
Steven Rostedt  wrote:

> Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
> after enabling function tracer. I asked him to bisect the functions within
> available_filter_functions, which he did and it came down to three:
> 
>  _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()
> 
> It was found that this is only an issue when noreplace-paravirt is added to
> the kernel command line.
> 
> This means that those functions are most likely called within critical
> sections of the funtion tracer, and must not be traced.
> 
> In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
> longer an issue. But both _paravirt_ident_{32,64}() causes the following
> splat when they are traced:
> 
>  mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054)
>  mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070)
>  mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054)
>  mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054)
>  NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
>  Modules linked in: e1000e
>  CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
>  Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 
> 05/07/2012
>  task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000
>  RIP: 0010:[]  [] 
> queued_spin_lock_slowpath+0x118/0x1a0
>  RSP: 0018:8800d4aefb90  EFLAGS: 0246
>  RAX:  RBX:  RCX: 88011eb16d40
>  RDX: 82485760 RSI: 1f288820 RDI: ea008030
>  RBP: 8800d4aefb90 R08: 000c R09: 
>  R10: 821c8e0e R11:  R12: 88200fb8
>  R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0
>  FS:  7f7a4e3d7840() GS:88011eb0() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0
>  Stack:
>   8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b
>   8800d4aefcb0 8800d4aefbd0 811bf4cb 0002
>   0015 8800d2276050 8000c0fd8867 ea008030
>  Call Trace:
>   [] _raw_spin_lock+0x27/0x30
>   [] handle_pte_fault+0x13db/0x16b0
>   [] ? function_trace_call+0x15b/0x180
>   [] ? handle_pte_fault+0x5/0x16b0
>   [] handle_mm_fault+0x312/0x670
>   [] ? find_vma+0x68/0x70
>   [] __do_page_fault+0x1b1/0x4e0
>   [] do_page_fault+0x22/0x30
>   [] page_fault+0x28/0x30
>   [] ? copy_user_enhanced_fast_string+0x5/0x10
>   [] ? seq_read+0x305/0x370
>   [] __vfs_read+0x28/0xe0
>   [] ? __vfs_read+0x5/0xe0
>   [] ? __vfs_read+0x5/0xe0
>   [] vfs_read+0x86/0x130
>   [] SyS_read+0x46/0xa0
>   [] entry_SYSCALL_64_fastpath+0x1e/0xa8
>  Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14
>  c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7
>  4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b
> 
> Reported-by: Łukasz Daniluk 
> Signed-off-by: Steven Rostedt 
> 
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index f08ac28b8136..f975d226be6e 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n"
>   ".popsection");
>  
>  /* identity function, which can be inlined */
> -u32 _paravirt_ident_32(u32 x)
> +u32 notrace _paravirt_ident_32(u32 x)
>  {
>   return x;
>  }
>  
> -u64 _paravirt_ident_64(u64 x)
> +u64 notrace _paravirt_ident_64(u64 x)
>  {
>   return x;
>  }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Do not trace _paravirt_ident_*() functions

2016-09-02 Thread Steven Rostedt

I just spent half a day bisecting function tracing because I tripped
over this again. I thought this was merged, but I guess it was missed
again.

Can someone please pull this in. And mark it for stable, it goes
probably as far back as 2.6.32.

Thanks!

-- Steve

On Wed, 25 May 2016 13:47:26 -0400
Steven Rostedt  wrote:

> Łukasz Daniluk reported that on a RHEL kernel that his machine would lock up
> after enabling function tracer. I asked him to bisect the functions within
> available_filter_functions, which he did and it came down to three:
> 
>  _paravirt_nop(), _paravirt_ident_32() and _paravirt_ident_64()
> 
> It was found that this is only an issue when noreplace-paravirt is added to
> the kernel command line.
> 
> This means that those functions are most likely called within critical
> sections of the funtion tracer, and must not be traced.
> 
> In newer kenels _paravirt_nop() is defined within gcc asm(), and is no
> longer an issue. But both _paravirt_ident_{32,64}() causes the following
> splat when they are traced:
> 
>  mm/pgtable-generic.c:33: bad pmd 8800d2435150(01d00054)
>  mm/pgtable-generic.c:33: bad pmd 8800d3624190(01d00070)
>  mm/pgtable-generic.c:33: bad pmd 8800d36a5110(01d00054)
>  mm/pgtable-generic.c:33: bad pmd 880118eb1450(01d00054)
>  NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [systemd-journal:469]
>  Modules linked in: e1000e
>  CPU: 2 PID: 469 Comm: systemd-journal Not tainted 4.6.0-rc4-test+ #513
>  Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 
> 05/07/2012
>  task: 880118f740c0 ti: 8800d4aec000 task.ti: 8800d4aec000
>  RIP: 0010:[]  [] 
> queued_spin_lock_slowpath+0x118/0x1a0
>  RSP: 0018:8800d4aefb90  EFLAGS: 0246
>  RAX:  RBX:  RCX: 88011eb16d40
>  RDX: 82485760 RSI: 1f288820 RDI: ea008030
>  RBP: 8800d4aefb90 R08: 000c R09: 
>  R10: 821c8e0e R11:  R12: 88200fb8
>  R13: 7f7a4e3f7000 R14: ea000303f600 R15: 8800d4b562e0
>  FS:  7f7a4e3d7840() GS:88011eb0() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 7f7a4e3f7000 CR3: d3e71000 CR4: 001406e0
>  Stack:
>   8800d4aefba0 81cc5f47 8800d4aefc60 8122c15b
>   8800d4aefcb0 8800d4aefbd0 811bf4cb 0002
>   0015 8800d2276050 8000c0fd8867 ea008030
>  Call Trace:
>   [] _raw_spin_lock+0x27/0x30
>   [] handle_pte_fault+0x13db/0x16b0
>   [] ? function_trace_call+0x15b/0x180
>   [] ? handle_pte_fault+0x5/0x16b0
>   [] handle_mm_fault+0x312/0x670
>   [] ? find_vma+0x68/0x70
>   [] __do_page_fault+0x1b1/0x4e0
>   [] do_page_fault+0x22/0x30
>   [] page_fault+0x28/0x30
>   [] ? copy_user_enhanced_fast_string+0x5/0x10
>   [] ? seq_read+0x305/0x370
>   [] __vfs_read+0x28/0xe0
>   [] ? __vfs_read+0x5/0xe0
>   [] ? __vfs_read+0x5/0xe0
>   [] vfs_read+0x86/0x130
>   [] SyS_read+0x46/0xa0
>   [] entry_SYSCALL_64_fastpath+0x1e/0xa8
>  Code: 12 48 c1 ea 0c 83 e8 01 83 e2 30 48 98 48 81 c2 40 6d 01 00 48 03 14
>  c5 80 6a 5d 82 48 89 0a 8b 41 08 85 c0 75 09 f3 90 8b 41 08 <85> c0 74 f7
>  4c 8b 09 4d 85 c9 74 08 41 0f 18 09 eb 02 f3 90 8b
> 
> Reported-by: Łukasz Daniluk 
> Signed-off-by: Steven Rostedt 
> 
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index f08ac28b8136..f975d226be6e 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -55,12 +55,12 @@ asm (".pushsection .entry.text, \"ax\"\n"
>   ".popsection");
>  
>  /* identity function, which can be inlined */
> -u32 _paravirt_ident_32(u32 x)
> +u32 notrace _paravirt_ident_32(u32 x)
>  {
>   return x;
>  }
>  
> -u64 _paravirt_ident_64(u64 x)
> +u64 notrace _paravirt_ident_64(u64 x)
>  {
>   return x;
>  }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support

2017-10-05 Thread Steven Rostedt
On Wed,  4 Oct 2017 14:19:56 -0700
Thomas Garnier  wrote:

> When using -fPIE/PIC with function tracing, the compiler generates a
> call through the GOT (call *__fentry__@GOTPCREL). This instruction
> takes 6 bytes instead of 5 on the usual relative call.
> 
> With this change, function tracing supports 6 bytes on traceable
> function and can still replace relative calls on the ftrace assembly
> functions.
> 
> Position Independent Executable (PIE) support will allow to extended the
> KASLR randomization range below the -2G memory limit.

Question: This 6 bytes is only the initial call that gcc creates. When
function tracing is enabled, the calls are back to the normal call to
the ftrace trampoline?

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC v3 20/27] x86/ftrace: Adapt function tracing for PIE support

2017-10-05 Thread Steven Rostedt
On Thu, 5 Oct 2017 09:01:14 -0700
Thomas Garnier  wrote:

> On Thu, Oct 5, 2017 at 6:06 AM, Steven Rostedt  wrote:
> > On Wed,  4 Oct 2017 14:19:56 -0700
> > Thomas Garnier  wrote:
> >  
> >> When using -fPIE/PIC with function tracing, the compiler generates a
> >> call through the GOT (call *__fentry__@GOTPCREL). This instruction
> >> takes 6 bytes instead of 5 on the usual relative call.
> >>
> >> With this change, function tracing supports 6 bytes on traceable
> >> function and can still replace relative calls on the ftrace assembly
> >> functions.
> >>
> >> Position Independent Executable (PIE) support will allow to extended the
> >> KASLR randomization range below the -2G memory limit.  
> >
> > Question: This 6 bytes is only the initial call that gcc creates. When
> > function tracing is enabled, the calls are back to the normal call to
> > the ftrace trampoline?  
> 
> That is correct.
> 

Then I think a better idea is to simply nop them out at compile time,
and have the code that updates them to nops to know about it.

See scripts/recordmcount.c

Could we simply add a 5 byte nop followed by a 1 byte nop, and treat it
the same as if it didn't exist? This code can be a little complex, and
can cause really nasty side effects if things go wrong. I would like to
keep from adding more variables to the changes here.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/3] MAINTAINERS: Update maintainers for paravirt ops and VMware hypervisor interface

2021-11-11 Thread Steven Rostedt
On Thu, 11 Nov 2021 07:50:39 +0100
Greg KH  wrote:

> > Signed-off-by: Srivatsa S. Bhat (VMware) 
> > Acked-by: Alexey Makhalov 
> > Acked-by: Deep Shah 
> > Acked-by: Juergen Gross 
> > Cc: sta...@vger.kernel.org  
> 
> Why are MAINTAINERS updates needed for stable?  That's not normal :(

Probably not needed, but does it hurt?  And who's normal?

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 9/9] testmmiotrace: eliminate anonymous module_init & module_exit

2022-03-16 Thread Steven Rostedt
On Wed, 16 Mar 2022 12:20:10 -0700
Randy Dunlap  wrote:

> Eliminate anonymous module_init() and module_exit(), which can lead to
> confusion or ambiguity when reading System.map, crashes/oops/bugs,
> or an initcall_debug log.
> 
> Give each of these init and exit functions unique driver-specific
> names to eliminate the anonymous names.
> 
> Example 1: (System.map)
>  832fc78c t init
>  832fc79e t init
>  832fc8f8 t init
> 
> Example 2: (initcall_debug log)
>  calling  init+0x0/0x12 @ 1
>  initcall init+0x0/0x12 returned 0 after 15 usecs
>  calling  init+0x0/0x60 @ 1
>  initcall init+0x0/0x60 returned 0 after 2 usecs
>  calling  init+0x0/0x9a @ 1
>  initcall init+0x0/0x9a returned 0 after 74 usecs
> 
> Fixes: 8b7d89d02ef3 ("x86: mmiotrace - trace memory mapped IO")
> Signed-off-by: Randy Dunlap 
> Cc: Thomas Gleixner 
> Cc: Steven Rostedt 

Acked-by: Steven Rostedt (Google) 

-- Steve

> Cc: Ingo Molnar 
> Cc: Karol Herbst 
> Cc: Pekka Paalanen 
> Cc: Dave Hansen 
> Cc: Andy Lutomirski 
> Cc: Peter Zijlstra 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Cc: nouv...@lists.freedesktop.org
> Cc: x...@kernel.org
> ---
>  arch/x86/mm/testmmiotrace.c |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> --- lnx-517-rc8.orig/arch/x86/mm/testmmiotrace.c
> +++ lnx-517-rc8/arch/x86/mm/testmmiotrace.c
> @@ -113,7 +113,7 @@ static void do_test_bulk_ioremapping(voi
>   synchronize_rcu();
>  }
>  
> -static int __init init(void)
> +static int __init testmmiotrace_init(void)
>  {
>   unsigned long size = (read_far) ? (8 << 20) : (16 << 10);
>   int ret = security_locked_down(LOCKDOWN_MMIOTRACE);
> @@ -136,11 +136,11 @@ static int __init init(void)
>   return 0;
>  }
>  
> -static void __exit cleanup(void)
> +static void __exit testmmiotrace_cleanup(void)
>  {
>   pr_debug("unloaded.\n");
>  }
>  
> -module_init(init);
> -module_exit(cleanup);
> +module_init(testmmiotrace_init);
> +module_exit(testmmiotrace_cleanup);
>  MODULE_LICENSE("GPL");

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio: Workaround fix for hard hang on guest using fifos

2022-03-25 Thread Steven Rostedt
This is more of a workaround patch and probably not the proper fix. But
I'm doing some work that is using fifos for guests and this is causing a
hang that is quite annoying.

I currently have this patch applied to continue my work.

I was working on analyzing data transfers between host and guests via
virtio sockets (FIFOs on host, dev on guest), vsockets and TCP packets.

I wrote a program to test each by passing a 1GB file and timing it. I'm
using the splice system call to help move things along. In doing so, I
found that my pipe between splice calls originally used "page_size" for
data transfer, and that is not as efficient as finding out what the pipe
size is. So I changed the code to use pipe_size and while debugging it, the
guest locked up hard.

I'm attaching the "agent-fifo" that runs on the guest, and the
"client-fifo" that runs on the host (the names may be backwards, but
makes sense when you add how I test vsockets and network packets).

Here's what I did:

 # ./client-fifo /var/lib/virt/Guest/trace-pipe-cpu0.out /test/bigfile

Where the trace-pipe-cpu0.out is the receiving side from the guest's virtio
pipe. The /test/bigfile is created when data starts coming in from the
guest pipe.

 # dd if=/dev/urandom of=bigfile bs=1024 count=1048576
 # ./agent-fifo /dev/virtio-ports/trace-pipe-cpu0 bigfile

With the updates to change the size being passed in the splice from
page_size to pipe_size, this never finished (it would copy around a meg or
so). And stopped. When I killed the agent-fifo task on the guest, the guest
hung hard.

Debugging this, I found that the guest is stuck in the loop in
drivers/char/virt_console.c: __send_control_msg():

if (virtqueue_add_outbuf(vq, sg, 1, &portdev->cpkt, GFP_ATOMIC) == 0) {
virtqueue_kick(vq);
while (!virtqueue_get_buf(vq, &len)
&& !virtqueue_is_broken(vq))
cpu_relax();
}

It never exits that loop. My workaround (this patch) is to put in a
timeout, and exit out if it spins there for more than 5 seconds. This
makes the problem go away.

Below is my changes, but this is a band-aid, it is not the cure.

Workaround-fix-by: Steven Rostedt (Google) 
---
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
index e3c430539a17..65f259f3f8cb 100644
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -551,6 +551,7 @@ static ssize_t __send_control_msg(struct ports_device 
*portdev, u32 port_id,
struct scatterlist sg[1];
struct virtqueue *vq;
unsigned int len;
+   u64 end;
 
if (!use_multiport(portdev))
return 0;
@@ -567,9 +568,15 @@ static ssize_t __send_control_msg(struct ports_device 
*portdev, u32 port_id,
 
if (virtqueue_add_outbuf(vq, sg, 1, &portdev->cpkt, GFP_ATOMIC) == 0) {
virtqueue_kick(vq);
+   end = jiffies + 5 * HZ;
while (!virtqueue_get_buf(vq, &len)
-   && !virtqueue_is_broken(vq))
+   && !virtqueue_is_broken(vq)) {
+   if (unlikely(end < jiffies)) {
+   dev_warn(&portdev->vdev->dev, "send_control_msg 
timed out!\n");
+   break;
+   }
cpu_relax();
+   }
}
 
spin_unlock(&portdev->c_ovq_lock);
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#define _GNU_SOURCE
#include 

#ifndef F_GETPIPE_SZ
# define F_GETPIPE_SZ	1032 /* The Linux number for the option */
#endif

const char *this_name = "agent-fifo";

static void usage(void)
{
	printf("usage: %s dev bigfile\n"
	   "\n",this_name);
	exit(-1);
}

static void __vdie(const char *fmt, va_list ap, int err)
{
	int ret = errno;
	const char *p = this_name;

	if (err && errno)
		perror(p);
	else
		ret = -1;

	fprintf(stderr, "  ");
	vfprintf(stderr, fmt, ap);

	fprintf(stderr, "\n");
	exit(ret);
}

void die(const char *fmt, ...)
{
	va_list ap;

	va_start(ap, fmt);
	__vdie(fmt, ap, 0);
	va_end(ap);
}

void pdie(const char *fmt, ...)
{
	va_list ap;

	va_start(ap, fmt);
	__vdie(fmt, ap, 1);
	va_end(ap);
}

static unsigned long long time()
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	return tv.tv_sec * 100 + tv.tv_usec;
}

static void print_time(unsigned long long time)
{
	unsigned long long seconds;
	unsigned long long usecs;

	seconds = time / 100;
	usecs = time - (seconds * 100);

	printf("time: %llu.%06llu\n", seconds, usecs);
}

int main(int argc, char *argv[])
{
	unsigned long long start;
	unsigned long long end;
	struct stat st;
	off_t size;
	char *file;
	char *port;
	int page_size;
	int pipe_size;
	int brass[2];
	int ret

Re: [PATCH] virtio: Workaround fix for hard hang on guest using fifos

2022-03-25 Thread Steven Rostedt
On Fri, 25 Mar 2022 17:36:27 -0700
Linus Torvalds  wrote:

> On Fri, Mar 25, 2022 at 2:20 PM Steven Rostedt  wrote:
> >
> > With the updates to change the size being passed in the splice from
> > page_size to pipe_size, this never finished (it would copy around a meg or
> > so). And stopped. When I killed the agent-fifo task on the guest, the guest
> > hung hard.  
> 
> Without knowing (or really caring) at all how virtqueue works, this
> sounds very much like the classic pipe deadlock where two processes
> communicate over a pair of pipes, sending each other commands, and
> replying to each other with status updates.

It does look like this. Note, this is not due to my application, since it
flows in one direction, but I think the way virtio pipes are implemented
can cause this to occur.

A virtio pipe on the guest looks like a normal bidirectional char device.
You open it, read it, write to it. No problem. On the host, there are two
FIFOs that are attached to the guest. One for each direction. That is, the
host is using two pipes that convert to a character device on the guest.

> 
> And you absolutely cannot do that if one side can possibly want to up
> fill the whole pipe.
> 
> Deadlock:
> 
>  - process A is trying to send data to process B (on 'pipe_A'), and
> blocks because the pipe is full
> 
>  - process B is reads the data and everything is fine, and A gets to continue
> 

I think you left out a step here, where A is blocked again.

>  - but then process B sends some stratus update the other way (on
> 'pipe_B' - you can't use the same pipe for bidirectional, it's why you
> use a pair of pipes or a socketpair) and waits for the result.
> 
>  - now A and B are both waiting for each other - A is waiting for B to
> empty the big bunch of data it's sending, and B is waiting for the
> result for the (small) command it sent.
> 
> and neither makes any progress.
> 
> You can find several mentions of these kinds of problems by just
> googling for "bidirectional pipe deadlock" or similar.
> 
> The solution is invariably to either
> 
>  (a) make sure that nobody writes even remotely close to enough data
> to fill a pipe before reading the other pipe (you can still fill up a
> pipe, but at least somebody is always going to succeed and make
> progress and do the read to make progress).
> 
>  (b) make sure everybody who writes to a pipe will use nonblocking IO
> (and is willing to do reads in between to satisfy the other end).
> 
> That first case is basically what one of the PIPE_BUF guarantees is
> all about (the other one is the atomicity it guarantees, ie you can
> write a "command packet" and be guaranteed that readers will see it
> without data mixed in from other writes).
> 
> I have no idea what your client/agent does and how it interacts with
> the virtio pipes, but it really _sounds_ like a very similar issue,
> where it used to work (because PIPE_BUF) and now no longer does
> (because pipe filled).

So the agent writes to the virtio char device on the guest, while the
client is reading the guest's FIFO on the host.

One directional.


> 
> And that virtio_console __send_control_msg() pattern very much sounds
> like a "send data and wait for ACK" behavior of "process B".

Something happens where data stops transferring (the virtio system waiting
for status?) This mostly happens when I run the agent under gdb. Which it
stops transferring data. Everything is still fine. I hit Ctrl^C twice to
get back to the gdb command prompt, and then I kill the process with the
gdb "kill" command. It asks me if I'm sure, I type "y", and then boom! The
system is hung. The guest locks up.

And printk()s show that the exit of the agent is in that loop.

 watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [agent-test:1165]
 Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables bnep 
bluetooth ecdh_generic ecc cfg80211 rfkill cirrus drm_shmem_helper 
drm_kms_helper vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common 
pcspkr joydev serio_raw vsock virtio_console drm virtio_balloon 8139too 
i2c_piix4 floppy nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net 
net_failover failover virtio_pci virtio virtio_pci_legacy_dev 
virtio_pci_modern_dev virtio_ring 8139cp mii ata_generic pata_acpi
 CPU: 0 PID: 1165 Comm: agent-test Not tainted 5.17.0-test+ #41
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 RIP: 0010:virtqueue_get_buf_ctx_split+0x0/0xd0 [virtio_ring]
 Code: 75 a6 45 31 e4 39 45 2c 74 e6 0f 0b 83 c6 01 48 83 c2 10 39 c6 75 b4 45 
31 e4 39 45 2c 74 d1 0f 0b 66 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 41 54 53 
80 7f 3b 00 0f 85 9a 00 00 00 48 8b 47 68
 RSP: 0018:b85c012e7c68 EFLAGS: 02

Re: [PATCH 24/36] printk: Remove trace_.*_rcuidle() usage

2022-06-14 Thread Steven Rostedt
On Thu, 9 Jun 2022 15:02:20 +0200
Petr Mladek  wrote:

> > I'm somewhat curious whether we can actually remove that trace event.  
> 
> Good question.
> 
> Well, I think that it might be useful. It allows to see trace and
> printk messages together.

Yes people still use it. I was just asked about it at Kernel Recipes. That
is, someone wanted printk mixed in with the tracing, and I told them about
this event (which they didn't know about but was happy to hear that it
existed).

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-17 Thread Steven Rostedt
On Tue, 17 Nov 2020 12:23:41 +0200
Leon Romanovsky  wrote:

> Hi,
> 
> Approximately two weeks ago, our regression team started to experience those
> netconsole splats. The tested code is Linus's master (-rc4) + netdev net-next
> + netdev net-rc.
> 
> Such splats are random and we can't bisect because there is no stable 
> reproducer.
> 
> Any idea, what is the root cause?
> 
> [   21.141117] netpoll: netconsole: local port 
> [   21.141947] netpoll: netconsole: local IPv4 address 10.236.251.7
> [   21.143138] netpoll: netconsole: interface 'eth1'
> [   21.143988] netpoll: netconsole: remote port 62517
> [   21.145127] netpoll: netconsole: remote IPv4 address 10.236.251.1
> [   21.146206] netpoll: netconsole: remote ethernet address 68:05:ca:aa:99:49
> [   21.149464] =
> [   21.149466] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected
> [   21.149467] 5.10.0-rc4_for_upstream_min_debug_2020_11_16_13_03 #1 Not 
> tainted
> [   21.149468] -
> [   21.149469] modprobe/596 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [   21.149471] 000194b4e298 (_xmit_ETHER#2){+.-.}-{2:2}, at: 
> virtnet_poll_tx+0x84/0x120
> [   21.149478] and this task is already holding:
> [   21.149478] 893d4018 (target_list_lock){}-{2:2}, at: 
> write_msg+0x6c/0x120 [netconsole]
> [   21.149483] which would create a new lock dependency:
> [   21.149484]  (target_list_lock){}-{2:2} -> (_xmit_ETHER#2){+.-.}-{2:2}
> [   21.149491] but this new dependency connects a HARDIRQ-irq-safe lock:
> [   21.149492]  (console_owner){-.-.}-{0:0}
> [   21.149496] ... which became HARDIRQ-irq-safe at:
> [   21.149496]   __lock_acquire+0xa78/0x1a94
> [   21.149498]   lock_acquire.part.0+0x170/0x360
> [   21.149498]   lock_acquire+0x68/0x8c
> [   21.149500]   console_unlock+0x1e8/0x6a4
> [   21.149500]   vprintk_emit+0x1c4/0x3c4
> [   21.149501]   vprintk_default+0x40/0x4c
> [   21.149502]   vprintk_func+0x10c/0x220
> [   21.149503]   printk+0x68/0x90
> [   21.149504]   crng_fast_load+0x1bc/0x1c0
> [   21.149505]   add_interrupt_randomness+0x280/0x290
> [   21.149506]   handle_irq_event+0x80/0x120
> [   21.149507]   handle_fasteoi_irq+0xac/0x200
> [   21.149508]   __handle_domain_irq+0x84/0xf0
> [   21.149509]   gic_handle_irq+0xd4/0x320
> [   21.149510]   el1_irq+0xd0/0x180
> [   21.149511]   arch_cpu_idle+0x24/0x44
> [   21.149512]   default_idle_call+0x48/0xa0
> [   21.149513]   do_idle+0x260/0x300
> [   21.149514]   cpu_startup_entry+0x30/0x6c
> [   21.149515]   rest_init+0x1b4/0x288
> [   21.149515]   arch_call_rest_init+0x18/0x24
> [   21.149516]   start_kernel+0x5cc/0x608
> [   21.149518] to a HARDIRQ-irq-unsafe lock:
> [   21.149519]  (_xmit_ETHER#2){+.-.}-{2:2}
> [   21.149523] ... which became HARDIRQ-irq-unsafe at:
> [   21.149524] ...  __lock_acquire+0x8bc/0x1a94
> [   21.149525]   lock_acquire.part.0+0x170/0x360
> [   21.149526]   lock_acquire+0x68/0x8c
> [   21.149527]   _raw_spin_trylock+0x80/0xd0
> [   21.149527]   virtnet_poll+0xac/0x360
> [   21.149528]   net_rx_action+0x1b0/0x4e0
> [   21.149529]   __do_softirq+0x1f4/0x638
> [   21.149530]   do_softirq+0xb8/0xcc
> [   21.149531]   __local_bh_enable_ip+0x18c/0x200
> [   21.149532]   virtnet_napi_enable+0xc0/0xd4
> [   21.149533]   virtnet_open+0x98/0x1c0
> [   21.149534]   __dev_open+0x12c/0x200
> [   21.149535]   __dev_change_flags+0x1a0/0x220
> [   21.149536]   dev_change_flags+0x2c/0x70
> [   21.149536]   do_setlink+0x214/0xe20
> [   21.149537]   __rtnl_newlink+0x514/0x820
> [   21.149538]   rtnl_newlink+0x58/0x84
> [   21.149539]   rtnetlink_rcv_msg+0x184/0x4b4
> [   21.149540]   netlink_rcv_skb+0x60/0x124
> [   21.149541]   rtnetlink_rcv+0x20/0x30
> [   21.149542]   netlink_unicast+0x1b4/0x270
> [   21.149543]   netlink_sendmsg+0x1f0/0x400
> [   21.149544]   sock_sendmsg+0x5c/0x70
> [   21.149545]   sys_sendmsg+0x24c/0x280
> [   21.149545]   ___sys_sendmsg+0x88/0xd0
> [   21.149546]   __sys_sendmsg+0x70/0xd0
> [   21.149547]   __arm64_sys_sendmsg+0x2c/0x40
> [   21.149548]   el0_svc_common.constprop.0+0x84/0x200
> [   21.149549]   do_el0_svc+0x2c/0x90
> [   21.149550]   el0_svc+0x18/0x50
> [   21.149551]   el0_sync_handler+0xe0/0x350
> [   21.149552]   el0_sync+0x158/0x180
> [   21.149553] other info that might help us debug this:
> [   21.149555] Chain exists of:
> [   21.149556]   console_owner --> target_list_lock --> _xmit_ETHER#2

So somewhere we have target_list_lock taken while holding the console_owner
lock (which can happen in interrupt). And what this is saying,
target_list_lock is held somewhere where _xmit_ETHER#2 is taken (which I'm
guessing from the back traces is the xmit_lock taken in virtnet_poll_tx().

Thus, you can have a deadlock with three CPUs (I need to update the lockdep
output to make it proper one of these days).

CPU0CPU1CPU2
   

Re: netconsole deadlock with virtnet

2020-11-18 Thread Steven Rostedt


[ Adding netdev as perhaps someone there knows ]

On Wed, 18 Nov 2020 12:09:59 +0800
Jason Wang  wrote:

> > This CPU0 lock(_xmit_ETHER#2) -> hard IRQ -> lock(console_owner) is
> > basically
> > soft IRQ -> lock(_xmit_ETHER#2) -> hard IRQ -> printk()
> >
> > Then CPU1 spins on xmit, which is owned by CPU0, CPU0 spins on
> > console_owner, which is owned by CPU1?  

It still looks to me that the target_list_lock is taken in IRQ, (which can
be the case because printk calls write_msg() which takes that lock). And
someplace there's a:

lock(target_list_lock)
lock(xmit_lock)

which means you can remove the console lock from this scenario completely,
and you still have a possible deadlock between target_list_lock and
xmit_lock.

> 
> 
> If this is true, it looks not a virtio-net specific issue but somewhere 
> else.
> 
> I think all network driver will synchronize through bh instead of hardirq.

I think the issue is where target_list_lock is held when we take xmit_lock.
Is there anywhere in netconsole.c that can end up taking xmit_lock while
holding the target_list_lock? If so, that's the problem. As
target_list_lock is something that can be taken in IRQ context, which means
*any* other lock that is taking while holding the target_list_lock must
also protect against interrupts from happening while it they are held.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-23 Thread Steven Rostedt
On Mon, 23 Nov 2020 13:08:55 +0200
Leon Romanovsky  wrote:


>  [   10.028024] Chain exists of:
>  [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2

Note, the problem is that we have a location that grabs the xmit_lock while
holding target_list_lock (and possibly console_owner).


>  [   10.028028]
>  [   10.028028]  Possible interrupt unsafe locking scenario:
>  [   10.028029]
>  [   10.028029]CPU0CPU1
>  [   10.028030]
>  [   10.028030]   lock(_xmit_ETHER#2);
>  [   10.028032]local_irq_disable();
>  [   10.028032]lock(console_owner);
>  [   10.028034]lock(target_list_lock);
>  [   10.028035]   
>  [   10.028035] lock(console_owner);
>  [   10.028036]
>  [   10.028037]  *** DEADLOCK ***
>  [   10.028037]



>  [   10.028107] the dependencies between the lock to be acquired
>  [   10.028107]  and HARDIRQ-irq-unsafe lock:
>  [   10.028108] -> (_xmit_ETHER#2){+.-.}-{2:2} ops: 217 {
>  [   10.028110]HARDIRQ-ON-W at:
>  [   10.028111]__lock_acquire+0x8bc/0x1a94
>  [   10.028111]lock_acquire.part.0+0x170/0x360
>  [   10.028112]lock_acquire+0x68/0x8c
>  [   10.028113]_raw_spin_trylock+0x80/0xd0
>  [   10.028113]virtnet_poll+0xac/0x360

xmit_lock is taken in virtnet_poll() (via virtnet_poll_cleantx()).

This is called from the softirq, and interrupts are not disabled.

>  [   10.028114]net_rx_action+0x1b0/0x4e0
>  [   10.028115]__do_softirq+0x1f4/0x638
>  [   10.028115]do_softirq+0xb8/0xcc
>  [   10.028116]__local_bh_enable_ip+0x18c/0x200
>  [   10.028116]virtnet_napi_enable+0xc0/0xd4
>  [   10.028117]virtnet_open+0x98/0x1c0
>  [   10.028118]__dev_open+0x12c/0x200
>  [   10.028118]__dev_change_flags+0x1a0/0x220
>  [   10.028119]dev_change_flags+0x2c/0x70
>  [   10.028119]do_setlink+0x214/0xe20
>  [   10.028120]__rtnl_newlink+0x514/0x820
>  [   10.028120]rtnl_newlink+0x58/0x84
>  [   10.028121]rtnetlink_rcv_msg+0x184/0x4b4
>  [   10.028122]netlink_rcv_skb+0x60/0x124
>  [   10.028122]rtnetlink_rcv+0x20/0x30
>  [   10.028123]netlink_unicast+0x1b4/0x270
>  [   10.028124]netlink_sendmsg+0x1f0/0x400
>  [   10.028124]sock_sendmsg+0x5c/0x70
>  [   10.028125]sys_sendmsg+0x24c/0x280
>  [   10.028125]___sys_sendmsg+0x88/0xd0
>  [   10.028126]__sys_sendmsg+0x70/0xd0
>  [   10.028127]__arm64_sys_sendmsg+0x2c/0x40
>  [   10.028128]el0_svc_common.constprop.0+0x84/0x200
>  [   10.028128]do_el0_svc+0x2c/0x90
>  [   10.028129]el0_svc+0x18/0x50
>  [   10.028129]el0_sync_handler+0xe0/0x350
>  [   10.028130]el0_sync+0x158/0x180

[..]

>  [   10.028171]  ... key  at: [] 
> netdev_xmit_lock_key+0x10/0x390
>  [   10.028171]  ... acquired at:
>  [   10.028172]__lock_acquire+0x134c/0x1a94
>  [   10.028172]lock_acquire.part.0+0x170/0x360
>  [   10.028173]lock_acquire+0x68/0x8c
>  [   10.028173]_raw_spin_lock+0x64/0x90
>  [   10.028174]virtnet_poll_tx+0x84/0x120
>  [   10.028174]netpoll_poll_dev+0x12c/0x350
>  [   10.028175]netpoll_send_skb+0x39c/0x400
>  [   10.028175]netpoll_send_udp+0x2b8/0x440
>  [   10.028176]write_msg+0xfc/0x120 [netconsole]
>  [   10.028176]console_unlock+0x3ec/0x6a4

The above shows the problem. We have:

console_unlock() (which holds the console_owner lock)
write_msg() (which holds the target_list_lock)

Then we write_msg() calls:

netpoll_send_udp() {
  netpoll_send_skb() {
netpoll_poll_dev() {
  virtnet_poll_tx() (which takes the xmit_lock!)

  DEADLOCK!


In netpoll_send_skb() I see this:

/* tickle device maybe there is some cleanup */
netpoll_poll_dev(np->dev);

Which looks to me that it will call some code that should only be used in
softirq context. It's called with locks held that are taken in interrupt
context, and any locks that are taken in netpoll_poll_dev() must always be
taken with interrupts disabled. That is, if xmit_lock is taken within
netpoll_poll_dev(), then it must always be taken with interrupts disabled.
Otherwise you can have the deadlock that lockdep reported.

-- Steve




>  [   10.028177]register

Re: netconsole deadlock with virtnet

2020-11-23 Thread Steven Rostedt
On Mon, 23 Nov 2020 10:52:52 -0800
Jakub Kicinski  wrote:

> On Mon, 23 Nov 2020 09:31:28 -0500 Steven Rostedt wrote:
> > On Mon, 23 Nov 2020 13:08:55 +0200
> > Leon Romanovsky  wrote:
> > 
> >   
> > >  [   10.028024] Chain exists of:
> > >  [   10.028025]   console_owner --> target_list_lock --> _xmit_ETHER#2
> > 
> > Note, the problem is that we have a location that grabs the xmit_lock while
> > holding target_list_lock (and possibly console_owner).  
> 
> Well, it try_locks the xmit_lock. Does lockdep understand try-locks?
> 
> (not that I condone the shenanigans that are going on here)

Does it?

virtnet_poll_tx() {
__netif_tx_lock() {
spin_lock(&txq->_xmit_lock);

That looks like we can have:


CPU0CPU1

   lock(xmit_lock)

lock(console)
lock(target_list_lock)
__netif_tx_lock()
lock(xmit_lock);

[BLOCKED]

   
   lock(console)

   [BLOCKED]



 DEADLOCK.


So where is the trylock here?

Perhaps you need the trylock in virtnet_poll_tx()?

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: netconsole deadlock with virtnet

2020-11-24 Thread Steven Rostedt
On Tue, 24 Nov 2020 11:22:03 +0800
Jason Wang  wrote:

> Btw, have a quick search, there are several other drivers that uses tx 
> lock in the tx NAPI.

tx NAPI is not the issue. The issue is that write_msg() (in netconsole.c)
calls this polling logic with the target_list_lock held.

Are those other drivers called by netconsole? If not, then this is unique
to virtio-net.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 19/19] virtio/vsock: update trace event for SEQPACKET

2021-03-02 Thread Steven Rostedt
On Thu, 18 Feb 2021 08:42:15 +0300
Arseny Krasnov  wrote:

Not sure if this was pulled in yet, but I do have a small issue with this
patch.

> @@ -69,14 +82,19 @@ TRACE_EVENT(virtio_transport_alloc_pkt,
>   __entry->type = type;
>   __entry->op = op;
>   __entry->flags = flags;
> + __entry->msg_len = msg_len;
> + __entry->msg_cnt = msg_cnt;
>   ),
> - TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x",
> + TP_printk("%u:%u -> %u:%u len=%u type=%s op=%s flags=%#x "
> +   "msg_len=%u msg_cnt=%u",

It's considered poor formatting to split strings like the above. This is
one of the exceptions for the 80 character limit. Do not break strings just
to keep it within 80 characters.

-- Steve


> __entry->src_cid, __entry->src_port,
> __entry->dst_cid, __entry->dst_port,
> __entry->len,
> show_type(__entry->type),
> show_op(__entry->op),
> -   __entry->flags)
> +   __entry->flags,
> +   __entry->msg_len,
> +   __entry->msg_cnt)
>  );
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/6] use canonical ftrace path whenever possible

2023-03-10 Thread Steven Rostedt
On Fri, 10 Mar 2023 12:06:58 -0700
Ross Zwisler  wrote:

> On Fri, Mar 10, 2023 at 03:29:49AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Feb 15, 2023 at 03:33:44PM -0700, Ross Zwisler wrote:  
> > > Changes in v2:
> > >  * Dropped patches which were pulled into maintainer trees.
> > >  * Split BPF patches out into another series targeting bpf-next.
> > >  * trace-agent now falls back to debugfs if tracefs isn't present.
> > >  * Added Acked-by from m...@redhat.com to series.
> > >  * Added a typo fixup for the virtio-trace README.
> > > 
> > > Steven, assuming there are no objections, would you feel comfortable
> > > taking this series through your tree?  
> > 
> > Acked-by: Michael S. Tsirkin 
> > 
> > if you want the virtio patches through my tree after all, let me know.  
> 
> Yes, please, that would be great.  I'll send out v3 with the few patches that
> haven't gotten a response, but I'll drop the virtio patches assuming you've
> got them.
>

Since the last patch 6/6 is not tracing related, I would prefer it to go
through the virtio tree.

For patches 1-5, please add:

Reviewed-by: Steven Rostedt (Google) 

Thanks!

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 01/20] x86: move prepare_ftrace_return prototype to header

2023-05-16 Thread Steven Rostedt
On Tue, 16 May 2023 21:35:30 +0200
Arnd Bergmann  wrote:

> From: Arnd Bergmann 
> 
> On 32-bit builds, the prepare_ftrace_return() function only has a global
> definition, but no prototype before it, which causes a warning:
> 
> arch/x86/kernel/ftrace.c:625:6: warning: no previous prototype for 
> ‘prepare_ftrace_return’ [-Wmissing-prototypes]
>   625 | void prepare_ftrace_return(unsigned long ip, unsigned long *parent,
> 
> Move the prototype that is already needed for some configurations into
> a header file where it can be seen unconditionally.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/x86/include/asm/ftrace.h | 3 +++
>  arch/x86/kernel/ftrace.c  | 3 ---
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index 5061ac98ffa1..b8d4a07f9595 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h

Acked-by: Steven Rostedt (Google) 

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/3] x86/xen: move paravirt lazy code

2023-09-13 Thread Steven Rostedt
On Wed, 13 Sep 2023 13:38:27 +0200
Juergen Gross  wrote:

> diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
> index 44a3f565264d..0577f0cdd231 100644
> --- a/include/trace/events/xen.h
> +++ b/include/trace/events/xen.h
> @@ -6,26 +6,26 @@
>  #define _TRACE_XEN_H
>  
>  #include 
> -#include 
> +#include 
>  #include 
>  
>  struct multicall_entry;
>  
>  /* Multicalls */
>  DECLARE_EVENT_CLASS(xen_mc__batch,
> - TP_PROTO(enum paravirt_lazy_mode mode),
> + TP_PROTO(enum xen_lazy_mode mode),
>   TP_ARGS(mode),
>   TP_STRUCT__entry(
> - __field(enum paravirt_lazy_mode, mode)
> + __field(enum xen_lazy_mode, mode)
>   ),
>   TP_fast_assign(__entry->mode = mode),
>   TP_printk("start batch LAZY_%s",
> -   (__entry->mode == PARAVIRT_LAZY_MMU) ? "MMU" :
> -   (__entry->mode == PARAVIRT_LAZY_CPU) ? "CPU" : "NONE")
> +   (__entry->mode == XEN_LAZY_MMU) ? "MMU" :
> +   (__entry->mode == XEN_LAZY_CPU) ? "CPU" : "NONE")

There's helper functions that make the above easier to implement as well as
exports the symbols so that user space can parse this better:

TRACE_DEFINE_ENUM(XEN_LAZY_NONE);
TRACE_DEFINE_ENUM(XEN_LAZY_MMU);
TRACE_DEFINE_ENUM(XEN_LAZY_CPU);

[..]

TP_printk("start batch LAZY_%s",
  __print_symbolic(mode,
   { XEN_LAZY_NONE, "NONE" },
   { XEN_LAZY_MMU,  "MMU   },
   { XEN_LAZY_CPU,  "CPU"  }))

Then user space parsers that read the raw data can convert these events
into something humans can read.

-- Steve

>   );
>  #define DEFINE_XEN_MC_BATCH(name)\
>   DEFINE_EVENT(xen_mc__batch, name,   \
> - TP_PROTO(enum paravirt_lazy_mode mode), \
> + TP_PROTO(enum xen_lazy_mode mode),  \
>TP_ARGS(mode))
>  
>  DEFINE_XEN_MC_BATCH(xen_mc_batch);
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-05 Thread Steven Rostedt
The new trace-cmd 3.0 (which is almost ready to be released) allows for
tracing between host and guests with timestamp synchronization such that
the events on the host and the guest can be interleaved in the proper order
that they occur. KernelShark now has a plugin that visualizes this
interaction.

The implementation requires that the guest has a vsock CID assigned, and on
the guest a "trace-cmd agent" is running, that will listen on a port for
the CID. The on the host a "trace-cmd record -A guest@cid:port -e events"
can be called and the host will connect to the guest agent through the
cid/port pair and have the agent enable tracing on behalf of the host and
send the trace data back down to it.

The problem is that there is no sure fire way to find the CID for a guest.
Currently, the user must know the cid, or we have a hack that looks for the
qemu process and parses the --guest-cid parameter from it. But this is
prone to error and does not work on other implementation (was told that
crosvm does not use qemu).

As I can not find a way to discover CIDs assigned to guests via any kernel
interface, I decided to create this one. Note, I'm not attached to it. If
there's a better way to do this, I would love to have it. But since I'm not
an expert in the networking layer nor virtio, I decided to stick to what I
know and add a debugfs interface that simply lists all the registered CIDs
and the worker task that they are associated with. The worker task at
least has the PID of the task it represents.

Now I can find the cid / host process in charge of the guest pair:

  # cat /sys/kernel/debug/vsock_list
  3 vhost-1954:2002

  # ps aux | grep 1954
  qemu1954  9.9 21.3 1629092 796148 ?  Sl   16:22   0:58  
/usr/bin/qemu-kvm -name guest=Fedora21,debug-threads=on -S -object 
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-Fedora21/master-key.aes
 -machine pc-1.2,accel=kvm,usb=off,dump-guest-core=off -cpu qemu64 -m 1000 
-overcommit mem-lock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 
1eefeeb0-3ac7-07c1-926e-236908313b4c -no-user-config -nodefaults -chardev 
socket,id=charmonitor,fd=32,server,nowait -mon 
chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot 
strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -blockdev 
{"driver":"host_device","filename":"/dev/mapper/vg_bxtest-GuestFedora","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}
 -blockdev 
{"node-name":"libvirt-1-format","read-only":false,"driver":"raw","file":"libvirt-1-storage"}
 -device ide-hd,bus=ide.0,unit=0,drive=libvirt-1-
 format,id=ide0-0-0,bootindex=1 -netdev tap,fd=34,id=hostnet0 -device 
rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:9f:e9:d5,bus=pci.0,addr=0x3 
-netdev tap,fd=35,id=hostnet1 -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:ec:dc:6e,bus=pci.0,addr=0x5 
-chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 
-chardev 
pipe,id=charchannel0,path=/var/lib/trace-cmd/virt/Fedora21/trace-pipe-cpu0 
-device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=trace-pipe-cpu0
 -chardev 
pipe,id=charchannel1,path=/var/lib/trace-cmd/virt/Fedora21/trace-pipe-cpu1 
-device 
virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=trace-pipe-cpu1
 -vnc 127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4 -sandbox 
on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -device 
vhost-vsock-pci,id=vsock0,guest-cid=3,vhostfd=16,bus=pci.0,addr=0x7 -msg 
 timestamp=on
  root2000  0.0  0.0  0 0 ?S16:22   0:00 
[kvm-pit/1954]
  root    2002  0.0  0.0  0 0 ?S16:22   0:00 
[vhost-1954]


This is just an example of what I'm looking for. Just a way to find what
process is using what cid.

Signed-off-by: Steven Rostedt (VMware) 
---
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 5e78fb719602..4f03b25b23c1 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "vhost.h"
@@ -900,6 +901,128 @@ static struct miscdevice vhost_vsock_misc = {
.fops = &vhost_vsock_fops,
 };
 
+static struct dentry *vsock_file;
+
+struct vsock_file_iter {
+   struct hlist_node   *node;
+   int index;
+};
+
+
+static void *vsock_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   struct vsock_file_iter *iter = v;
+   struct vhost_vsock *vsock;
+
+   if (pos)
+   (*pos)++;
+
+   if (iter->index >

Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-05 Thread Steven Rostedt
For kicks, I wrote this program that uses libtracefs to search all CIDS
(1-255), and find the kvm guests that are attached to them.

It traces the sched_wakeup and kvm_exit, looking for:

 this_task -> wakeup -> wakeup -> kvm_exit

when doing a connect to a cid.

When it finds the pid that did a kvm_exit, it knows that's the PID that
is woken by the vhost worker task. It's a little slow, and I would
really like a better way to do this, but it's at least an option that
is available now.

-- Steve
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 

#define MAX_CID		256

static int this_pid;

static int open_vsock(unsigned int cid, unsigned int port)
{
	struct sockaddr_vm addr = {
		.svm_family = AF_VSOCK,
		.svm_cid = cid,
		.svm_port = port,
	};
	int sd;

	sd = socket(AF_VSOCK, SOCK_STREAM, 0);
	if (sd < 0)
		return -1;

	if (connect(sd, (struct sockaddr *)&addr, sizeof(addr)))
		return -1;

	return sd;
}

struct pids {
	struct pids		*next;
	int			pid;
};

struct trace_info {
	struct tracefs_instance		*instance;
	struct tep_handle		*tep;
	struct tep_event		*wake_up;
	struct tep_event		*kvm_exit;
	struct tep_format_field		*common_pid;
	struct tep_format_field		*wake_pid;
	struct pids			*pids;
	intcid;
	intpid;
};

static void tear_down_trace(struct trace_info *info)
{
	tracefs_instance_file_write(info->instance, "events/enable", "0");
	tracefs_instance_destroy(info->instance);
	tracefs_instance_free(info->instance);
	tep_free(info->tep);
}

static int setup_trace(struct trace_info *info)
{
	const char *systems[] = { "sched", "kvm", NULL};
	char *name;
	int ret;

	info->pids = NULL;

	ret = asprintf(&name, "vsock_find-%d\n", getpid());
	if (ret < 0)
		return ret;

	info->instance = tracefs_instance_create(name);
	free(name);
	if (!info->instance)
		return -1;

	tracefs_trace_off(info->instance);
	info->tep = tracefs_local_events_system(NULL, systems);
	if (!info->tep)
		goto fail;

	info->wake_up = tep_find_event_by_name(info->tep, "sched", "sched_waking");
	if (!info->wake_up) {
		fprintf(stderr, "Failed to find sched_waking\n");
		goto fail;
	}

	info->kvm_exit = tep_find_event_by_name(info->tep, "kvm", "kvm_exit");
	if (!info->kvm_exit) {
		fprintf(stderr, "Failed to find kvm_exit\n");
		goto fail;
	}

	info->wake_pid = tep_find_any_field(info->wake_up, "pid");
	if (!info->wake_pid) {
		fprintf(stderr, "Failed to find wake up pid\n");
		goto fail;
	}

	info->common_pid = tep_find_common_field(info->wake_up,
		 "common_pid");
	if (!info->common_pid) {
		fprintf(stderr, "Failed to find common pid\n");
		goto fail;
	}

	ret = tracefs_instance_file_write(info->instance, "events/sched/sched_waking/enable", "1");
	if (ret < 0) {
		fprintf(stderr, "Failed to enable sched_waking\n");
		goto fail;
	}

	ret = tracefs_instance_file_write(info->instance, "events/kvm/kvm_exit/enable", "1");
	if (ret < 0) {
		fprintf(stderr, "Failed to enable kvm_exit\n");
		goto fail;
	}

	return 0;
fail:
	tear_down_trace(info);
	return -1;
}


static void free_pids(struct pids *pids)
{
	struct pids *next;

	while (pids) {
		next = pids;
		pids = pids->next;
		free(next);
	}
}

static void add_pid(struct pids **pids, int pid)
{
	struct pids *new_pid;

	new_pid = malloc(sizeof(*new_pid));
	if (!new_pid)
		return;

	new_pid->pid = pid;
	new_pid->next = *pids;
	*pids = new_pid;
}

static bool match_pid(struct pids *pids, int pid)
{
	while (pids) {
		if (pids->pid == pid)
			return true;
		pids = pids->next;
	}
	return false;
}

static int callback(struct tep_event *event, struct tep_record *record,
		int cpu, void *data)
{
	struct trace_info *info = data;
	struct tep_handle *tep = info->tep;
	unsigned long long val;
	int type;
	int pid;
	int ret;

	ret = tep_read_number_field(info->common_pid, record->data, &val);
	if (ret < 0)
		return 0;

	pid = val;

	if (!match_pid(info->pids, pid))
		return 0;

	type = tep_data_type(tep, record);
	if (type == info->kvm_exit->id) {
		info->pid = pid;
		return -1;
	}

	if (type != info->wake_up->id)
		return 0;

	ret = tep_read_number_field(info->wake_pid, record->data, &val);
	if (ret < 0)
		return 0;

	add_pid(&info->pids, (int)val);
	return 0;
}

static void print_cid_pid(int cid, int pid)
{
	FILE *fp;
	char *path;
	char *buf = NULL;
	char *save;
	size_t l = 0;
	int tgid = -1;

	if (asprintf(&path, "/proc/%d/status", pid) < 0)
		return;

	fp = fopen(path, "r");
	free(path);
	if (!fp)
		return;

	while (getline(&buf, &l, fp) > 0) {
		char *tok;

		if (strncmp(buf, "Tgid:", 5) != 0)
			continue;
		tok = strtok_r(buf, ":", &save);
		if (!tok)
			continue;
		tok = strtok_r(NULL, ":", &save);
		if (!tok)
			continue;
		while (isspace(*tok))
			tok++;
		tgid = strtol(tok, NULL, 0);
		break;
	}
	free(buf);

	if (tgid >= 0)
		printf("%d\t%d\n", cid, tgid);
}

static void find_cid(struct trace_info *info, int cid)
{
	int fd;

	add_pid(&info->pids, this_pid);

	tracefs_instance_file_clear(info->instance, "trace");

Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-07 Thread Steven Rostedt
On Fri, 7 May 2021 16:11:20 +0200
Stefano Garzarella  wrote:

> Hi Steven,
> 
> On Wed, May 05, 2021 at 04:38:55PM -0400, Steven Rostedt wrote:
> >The new trace-cmd 3.0 (which is almost ready to be released) allows for
> >tracing between host and guests with timestamp synchronization such that
> >the events on the host and the guest can be interleaved in the proper order
> >that they occur. KernelShark now has a plugin that visualizes this
> >interaction.
> >
> >The implementation requires that the guest has a vsock CID assigned, and on
> >the guest a "trace-cmd agent" is running, that will listen on a port for
> >the CID. The on the host a "trace-cmd record -A guest@cid:port -e events"
> >can be called and the host will connect to the guest agent through the
> >cid/port pair and have the agent enable tracing on behalf of the host and
> >send the trace data back down to it.
> >
> >The problem is that there is no sure fire way to find the CID for a guest.
> >Currently, the user must know the cid, or we have a hack that looks for the
> >qemu process and parses the --guest-cid parameter from it. But this is
> >prone to error and does not work on other implementation (was told that
> >crosvm does not use qemu).  
> 
> For debug I think could be useful to link the vhost-vsock kthread to the 
> CID, but for the user point of view, maybe is better to query the VM 
> management layer, for example if you're using libvirt, you can easily do:
> 
> $ virsh dumpxml fedora34 | grep cid
>  

We looked into going this route, but then that means trace-cmd host/guest
tracing needs a way to handle every layer, as some people use libvirt
(myself included), some people use straight qemu, some people us Xen, and
some people use crosvm. We need to support all of them. Which is why I'm
looking at doing this from the lowest common denominator, and since vsock
is a requirement from trace-cmd to do this tracing, getting the thread
that's related to the vsock is that lowest denominator.

> 
> >
> >As I can not find a way to discover CIDs assigned to guests via any kernel
> >interface, I decided to create this one. Note, I'm not attached to it. If
> >there's a better way to do this, I would love to have it. But since I'm not
> >an expert in the networking layer nor virtio, I decided to stick to what I
> >know and add a debugfs interface that simply lists all the registered 
> >CIDs
> >and the worker task that they are associated with. The worker task at
> >least has the PID of the task it represents.  
> 
> I honestly don't know if it's the best interface, like I said maybe for 
> debugging it's fine, but if we want to expose it to the user in some 
> way, we could support devlink/netlink to provide information about the 
> vsock devices currently in use.

Ideally, a devlink/netlink is the right approach. I just had no idea on how
to implement that ;-)  So I went with what I know, which is debugfs files!



> >Signed-off-by: Steven Rostedt (VMware) 
> >---
> >diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> >index 5e78fb719602..4f03b25b23c1 100644
> >--- a/drivers/vhost/vsock.c
> >+++ b/drivers/vhost/vsock.c
> >@@ -15,6 +15,7 @@
> > #include 
> > #include 
> > #include 
> >+#include 
> >
> > #include 
> > #include "vhost.h"
> >@@ -900,6 +901,128 @@ static struct miscdevice vhost_vsock_misc = {
> > .fops = &vhost_vsock_fops,
> > };
> >
> >+static struct dentry *vsock_file;
> >+
> >+struct vsock_file_iter {
> >+struct hlist_node   *node;
> >+int index;
> >+};
> >+
> >+
> >+static void *vsock_next(struct seq_file *m, void *v, loff_t *pos)
> >+{
> >+struct vsock_file_iter *iter = v;
> >+struct vhost_vsock *vsock;
> >+
> >+if (pos)
> >+(*pos)++;
> >+
> >+if (iter->index >= (int)HASH_SIZE(vhost_vsock_hash))
> >+return NULL;
> >+
> >+if (iter->node)
> >+iter->node = rcu_dereference_raw(hlist_next_rcu(iter->node));
> >+
> >+for (;;) {
> >+if (iter->node) {
> >+vsock = 
> >hlist_entry_safe(rcu_dereference_raw(iter->node),
> >+ struct vhost_vsock, hash);
> >+if (vsock->guest_cid)
> >+break;
> >+iter->node = 
> >rcu_dereference_raw(hlist_next_rcu(iter->node));
> >+co

Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-07 Thread Steven Rostedt
On Fri, 7 May 2021 17:43:32 +0200
Stefano Garzarella  wrote:

> >The start/stop of a seq_file() is made for taking locks. I do this with all
> >my code in ftrace. Yeah, there's a while loop between the two, but that's
> >just to fill the buffer. It's not that long and it never goes to userspace
> >between the two. You can even use this for spin locks (but I wouldn't
> >recommend doing it for raw ones).  
> 
> Ah okay, thanks for the clarification!
> 
> I was worried because building with `make C=2` I had these warnings:
> 
> ../drivers/vhost/vsock.c:944:13: warning: context imbalance in 'vsock_start' 
> - wrong count at exit
> ../drivers/vhost/vsock.c:963:13: warning: context imbalance in 'vsock_stop' - 
> unexpected unlock
> 
> Maybe we need to annotate the functions somehow.

Yep, I it should have been.

static void *vsock_start(struct seq_file *m, loff_t *pos)
__acquires(rcu)
{
[...]

}

static void vsock_stop(struct seq_file *m, void *p)
__releases(rcu)
{
[...]
}

static int vsock_show(struct seq_file *m, void *v)
__must_hold(rcu)
{
[...]
}


And guess what? I just copied those annotations from sock_hash_seq_start(),
sock_hash_seq_show() and sock_hash_seq_stop() from net/core/sock_map.c
which is doing exactly the same thing ;-)

So there's definitely precedence for this.

> 
> >  
> >>  
> >> >+
> >> >+ iter->index = -1;
> >> >+ iter->node = NULL;
> >> >+ t = vsock_next(m, iter, NULL);
> >> >+
> >> >+ for (; iter->index < HASH_SIZE(vhost_vsock_hash) && l < *pos;
> >> >+  t = vsock_next(m, iter, &l))
> >> >+ ;  
> >>
> >> A while() maybe was more readable...  
> >
> >Again, I just cut and pasted from my other code.
> >
> >If you have a good idea on how to implement this with netlink (something
> >that ss or netstat can dislpay), I think that's the best way to go.  
> 
> Okay, I'll take a look and get back to you.
> If it's too complicated, we can go ahead with this patch.

Awesome, thanks!

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC][PATCH] vhost/vsock: Add vsock_list file to map cid with vhost tasks

2021-05-13 Thread Steven Rostedt
On Thu, 13 May 2021 16:57:34 +0100
Stefan Hajnoczi  wrote:


> This approach relies on process hierarchy of the VMM (QEMU).
> Multi-process QEMU is in development and will allow VIRTIO devices to
> run as separate processes from the main QEMU. It then becomes harder to
> correlate a VIRTIO device process with its QEMU process.

And we need to know all these mapping regardless, as we need to map each
thread / process to the vCPU in order to correlate between host thread and
vCPU thread for showing in KernelShark.

Thus this mapping to find the main thread/process needs to be done
regardless.

> 
> So I think in the end this approach ends up being as fragile as parsing
> command-lines. The kernel doesn't really have the concept of a "VM" that
> the vhost_vsock is associated with :). Maybe just parse QEMU and crosvm
> command-lines?
>

That's what we do now, and it already broke once, and even parsing the
command line wont be enough for the stated reasons above.

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 33/44] ftrace: WARN on rcuidle

2022-09-26 Thread Steven Rostedt


Nit, the subject should have "tracing:" an not "ftrace:" as the former
encompasses the tracing infrastructure and the latter is for the function
hook part of that.

On Mon, 19 Sep 2022 12:00:12 +0200
Peter Zijlstra  wrote:

> CONFIG_GENERIC_ENTRY disallows any and all tracing when RCU isn't
> enabled.
> 
> XXX if s390 (the only other GENERIC_ENTRY user as of this writing)
> isn't comfortable with this, we could switch to
> HAVE_NOINSTR_VALIDATION which is x86_64 only atm.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/tracepoint.h |   13 -
>  kernel/trace/trace.c   |3 +++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -178,6 +178,16 @@ static inline struct tracepoint *tracepo
>  #endif /* CONFIG_HAVE_STATIC_CALL */
>  
>  /*
> + * CONFIG_GENERIC_ENTRY archs are expected to have sanitized entry and idle
> + * code that disallow any/all tracing/instrumentation when RCU isn't 
> watching.
> + */
> +#ifdef CONFIG_GENERIC_ENTRY
> +#define RCUIDLE_COND(rcuidle)(rcuidle)
> +#else

Should probably move the below comment to here:

 /* srcu can't be used from NMI */

> +#define RCUIDLE_COND(rcuidle)(rcuidle && in_nmi())
> +#endif
> +
> +/*
>   * it_func[0] is never NULL because there is at least one element in the 
> array
>   * when the array itself is non NULL.
>   */
> @@ -189,7 +199,8 @@ static inline struct tracepoint *tracepo
>   return; \
>   \
>   /* srcu can't be used from NMI */   \

And remove the above.

-- Steve

> - WARN_ON_ONCE(rcuidle && in_nmi());  \
> + if (WARN_ON_ONCE(RCUIDLE_COND(rcuidle)))\
> + return; \
>   \
>   /* keep srcu and sched-rcu usage consistent */  \
>   preempt_disable_notrace();  \
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -3104,6 +3104,9 @@ void __trace_stack(struct trace_array *t
>   return;
>   }
>  
> + if (WARN_ON_ONCE(IS_ENABLED(CONFIG_GENERIC_ENTRY)))
> + return;
> +
>   /*
>* When an NMI triggers, RCU is enabled via ct_nmi_enter(),
>* but if the above rcu_is_watching() failed, then the NMI
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-23 Thread Steven Rostedt
On Mon, 23 Jan 2023 21:50:12 +0100
Peter Zijlstra  wrote:

> All RCU disabled code should be noinstr and hence we should never get
> here -- when we do, WARN about it and make sure to not actually do
> tracing.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/kernel/ftrace.c |3 +++
>  1 file changed, 3 insertions(+)
> 
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long
>   if (unlikely(atomic_read(¤t->tracing_graph_pause)))
>   return;
>  
> + if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip))
> + return;
> +

Please add this to after recursion trylock below. Although WARN_ONCE()
should not not have recursion issues, as function tracing can do weird
things, I rather be safe than sorry, and not have the system triple boot
due to some path that might get added in the future.

If rcu_is_watching() is false, it will still get by the below recursion
check and warn. That is, the below check should be done before this
function calls any other function.

>   bit = ftrace_test_recursion_trylock(ip, *parent);
>   if (bit < 0)
>   return;
> 

-- Steve
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/6] ftrace/x86: Warn and ignore graph tracing when RCU is disabled

2023-01-23 Thread Steven Rostedt
On Mon, 23 Jan 2023 16:53:04 -0500
Steven Rostedt  wrote:

> On Mon, 23 Jan 2023 21:50:12 +0100
> Peter Zijlstra  wrote:
> 
> > All RCU disabled code should be noinstr and hence we should never get
> > here -- when we do, WARN about it and make sure to not actually do
> > tracing.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> >  arch/x86/kernel/ftrace.c |3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > --- a/arch/x86/kernel/ftrace.c
> > +++ b/arch/x86/kernel/ftrace.c
> > @@ -646,6 +646,9 @@ void prepare_ftrace_return(unsigned long
> > if (unlikely(atomic_read(¤t->tracing_graph_pause)))
> > return;
> >  
> > +   if (WARN_ONCE(!rcu_is_watching(), "RCU not on for: %pS\n", (void *)ip))
> > +   return;
> > +  
> 
> Please add this to after recursion trylock below. Although WARN_ONCE()
> should not not have recursion issues, as function tracing can do weird
> things, I rather be safe than sorry, and not have the system triple boot
> due to some path that might get added in the future.
> 
> If rcu_is_watching() is false, it will still get by the below recursion
> check and warn. That is, the below check should be done before this
> function calls any other function.
> 
> > bit = ftrace_test_recursion_trylock(ip, *parent);
> > if (bit < 0)
> > return;
> >   
> 

Actually, perhaps we can just add this, and all you need to do is create
and set CONFIG_NO_RCU_TRACING (or some other name).

This should cover all ftrace locations. (Uncompiled).

-- Steve

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index c303f7a114e9..10ee3fbb9113 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,6 +135,22 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
+#ifdef CONFIG_NO_RCU_TRACING
+# define trace_warn_on_no_rcu(ip)  \
+   ({  \
+   bool __ret = false; \
+   if (!trace_recursion_test(TRACE_RECORD_RECURSION_BIT)) { \
+   trace_recursion_set(TRACE_RECORD_RECURSION_BIT); \
+   __ret = WARN_ONCE(!rcu_is_watching(),   \
+ "RCU not on for: %pS\n", (void *)ip); 
\
+   trace_recursion_clear(TRACE_RECORD_RECURSION_BIT); \
+   }   \
+   __ret;  \
+   })
+#else
+# define trace_warn_on_no_rcu(ip)  false
+#endif
+
 /*
  * Preemption is promised to be disabled when return bit >= 0.
  */
@@ -144,6 +160,9 @@ static __always_inline int 
trace_test_and_set_recursion(unsigned long ip, unsign
unsigned int val = READ_ONCE(current->trace_recursion);
int bit;
 
+   if (trace_warn_on_no_rcu(ip))
+   return -1;
+
bit = trace_get_context_bit() + start;
if (unlikely(val & (1 << bit))) {
/*
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization