from:"Maxim Levitsky"

Re: [PATCH v2 for-8.2?] i386/sev: Avoid SEV-ES crash due to missing MSR_EFER_LMA bit

2023-12-08 Thread Maxim Levitsky

On Wed, 2023-12-06 at 11:42 -0600, Michael Roth wrote:
> On Wed, Dec 06, 2023 at 07:20:14PM +0200, Maxim Levitsky wrote:
> > On Tue, 2023-12-05 at 16:28 -0600, Michael Roth wrote:
> > > Commit 7191f24c7fcf ("accel/kvm/kvm-all: Handle register access errors")
> > > added error checking for KVM_SET_SREGS/KVM_SET_SREGS2. In doing so, it
> > > exposed a long-running bug in current KVM support for SEV-ES where the
> > > kernel assumes that MSR_EFER_LMA will be set explicitly by the guest
> > > kernel, in which case EFER write traps would result in KVM eventually
> > > seeing MSR_EFER_LMA get set and recording it in such a way that it would
> > > be subsequently visible when accessing it via KVM_GET_SREGS/etc.
> > > 
> > > However, guests kernels currently rely on MSR_EFER_LMA getting set
> > > automatically when MSR_EFER_LME is set and paging is enabled via
> > > CR0_PG_MASK. As a result, the EFER write traps don't actually expose the
> > > MSR_EFER_LMA even though it is set internally, and when QEMU
> > > subsequently tries to pass this EFER value back to KVM via
> > > KVM_SET_SREGS* it will fail various sanity checks and return -EINVAL,
> > > which is now considered fatal due to the aforementioned QEMU commit.
> > > 
> > > This can be addressed by inferring the MSR_EFER_LMA bit being set when
> > > paging is enabled and MSR_EFER_LME is set, and synthesizing it to ensure
> > > the expected bits are all present in subsequent handling on the host
> > > side.
> > > 
> > > Ultimately, this handling will be implemented in the host kernel, but to
> > > avoid breaking QEMU's SEV-ES support when using older host kernels, the
> > > same handling can be done in QEMU just after fetching the register
> > > values via KVM_GET_SREGS*. Implement that here.
> > > 
> > > Cc: Paolo Bonzini 
> > > Cc: Marcelo Tosatti 
> > > Cc: Tom Lendacky 
> > > Cc: Akihiko Odaki 
> > > Cc: k...@vger.kernel.org
> > > Fixes: 7191f24c7fcf ("accel/kvm/kvm-all: Handle register access errors")
> > > Signed-off-by: Michael Roth 
> > > ---
> > > v2:
> > >   - Add handling for KVM_GET_SREGS, not just KVM_GET_SREGS2
> > > 
> > >  target/i386/kvm/kvm.c | 14 ++
> > >  1 file changed, 14 insertions(+)
> > > 
> > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > > index 11b8177eff..8721c1bf8f 100644
> > > --- a/target/i386/kvm/kvm.c
> > > +++ b/target/i386/kvm/kvm.c
> > > @@ -3610,6 +3610,7 @@ static int kvm_get_sregs(X86CPU *cpu)
> > >  {
> > >  CPUX86State *env = &cpu->env;
> > >  struct kvm_sregs sregs;
> > > +target_ulong cr0_old;
> > >  int ret;
> > >  
> > >  ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS, &sregs);
> > > @@ -3637,12 +3638,18 @@ static int kvm_get_sregs(X86CPU *cpu)
> > >  env->gdt.limit = sregs.gdt.limit;
> > >  env->gdt.base = sregs.gdt.base;
> > >  
> > > +cr0_old = env->cr[0];
> > >  env->cr[0] = sregs.cr0;
> > >  env->cr[2] = sregs.cr2;
> > >  env->cr[3] = sregs.cr3;
> > >  env->cr[4] = sregs.cr4;
> > >  
> > >  env->efer = sregs.efer;
> > > +if (sev_es_enabled() && env->efer & MSR_EFER_LME) {
> > > +if (!(cr0_old & CR0_PG_MASK) && env->cr[0] & CR0_PG_MASK) {
> > > +env->efer |= MSR_EFER_LMA;
> > > +}
> > > +}
> > 
> > I think that we should not check that CR0_PG has changed, and just blindly 
> > assume
> > that if EFER.LME is set and CR0.PG is set, then EFER.LMA must be set as 
> > defined in x86 spec.
> > 
> > Otherwise, suppose qemu calls kvm_get_sregs twice: First time it will work,
> > but second time CR0.PG will match one that is stored in the env, and thus 
> > the workaround
> > will not be executed, and instead we will revert back to wrong EFER value 
> > reported by the kernel.
> > 
> > How about something like that:
> > 
> > 
> > if (sev_es_enabled() && env->efer & MSR_EFER_LME && env->cr[0] & 
> > CR0_PG_MASK) {
> > /* 
> >  * Workaround KVM bug, because of which KVM might not be aware of 
> > the 
> >  * fact that EFER.LMA was toggled by the hardware 
> >  */
> > env->efer |= MSR_EFER_

Re: [PATCH v2 for-8.2?] i386/sev: Avoid SEV-ES crash due to missing MSR_EFER_LMA bit

2023-12-06 Thread Maxim Levitsky

On Tue, 2023-12-05 at 16:28 -0600, Michael Roth wrote:
> Commit 7191f24c7fcf ("accel/kvm/kvm-all: Handle register access errors")
> added error checking for KVM_SET_SREGS/KVM_SET_SREGS2. In doing so, it
> exposed a long-running bug in current KVM support for SEV-ES where the
> kernel assumes that MSR_EFER_LMA will be set explicitly by the guest
> kernel, in which case EFER write traps would result in KVM eventually
> seeing MSR_EFER_LMA get set and recording it in such a way that it would
> be subsequently visible when accessing it via KVM_GET_SREGS/etc.
> 
> However, guests kernels currently rely on MSR_EFER_LMA getting set
> automatically when MSR_EFER_LME is set and paging is enabled via
> CR0_PG_MASK. As a result, the EFER write traps don't actually expose the
> MSR_EFER_LMA even though it is set internally, and when QEMU
> subsequently tries to pass this EFER value back to KVM via
> KVM_SET_SREGS* it will fail various sanity checks and return -EINVAL,
> which is now considered fatal due to the aforementioned QEMU commit.
> 
> This can be addressed by inferring the MSR_EFER_LMA bit being set when
> paging is enabled and MSR_EFER_LME is set, and synthesizing it to ensure
> the expected bits are all present in subsequent handling on the host
> side.
> 
> Ultimately, this handling will be implemented in the host kernel, but to
> avoid breaking QEMU's SEV-ES support when using older host kernels, the
> same handling can be done in QEMU just after fetching the register
> values via KVM_GET_SREGS*. Implement that here.
> 
> Cc: Paolo Bonzini 
> Cc: Marcelo Tosatti 
> Cc: Tom Lendacky 
> Cc: Akihiko Odaki 
> Cc: k...@vger.kernel.org
> Fixes: 7191f24c7fcf ("accel/kvm/kvm-all: Handle register access errors")
> Signed-off-by: Michael Roth 
> ---
> v2:
>   - Add handling for KVM_GET_SREGS, not just KVM_GET_SREGS2
> 
>  target/i386/kvm/kvm.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 11b8177eff..8721c1bf8f 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -3610,6 +3610,7 @@ static int kvm_get_sregs(X86CPU *cpu)
>  {
>  CPUX86State *env = &cpu->env;
>  struct kvm_sregs sregs;
> +target_ulong cr0_old;
>  int ret;
>  
>  ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS, &sregs);
> @@ -3637,12 +3638,18 @@ static int kvm_get_sregs(X86CPU *cpu)
>  env->gdt.limit = sregs.gdt.limit;
>  env->gdt.base = sregs.gdt.base;
>  
> +cr0_old = env->cr[0];
>  env->cr[0] = sregs.cr0;
>  env->cr[2] = sregs.cr2;
>  env->cr[3] = sregs.cr3;
>  env->cr[4] = sregs.cr4;
>  
>  env->efer = sregs.efer;
> +if (sev_es_enabled() && env->efer & MSR_EFER_LME) {
> +if (!(cr0_old & CR0_PG_MASK) && env->cr[0] & CR0_PG_MASK) {
> +env->efer |= MSR_EFER_LMA;
> +}
> +}

I think that we should not check that CR0_PG has changed, and just blindly 
assume
that if EFER.LME is set and CR0.PG is set, then EFER.LMA must be set as defined 
in x86 spec.

Otherwise, suppose qemu calls kvm_get_sregs twice: First time it will work,
but second time CR0.PG will match one that is stored in the env, and thus the 
workaround
will not be executed, and instead we will revert back to wrong EFER value 
reported by the kernel.

How about something like that:


if (sev_es_enabled() && env->efer & MSR_EFER_LME && env->cr[0] & CR0_PG_MASK) {
/* 
 * Workaround KVM bug, because of which KVM might not be aware of the 
 * fact that EFER.LMA was toggled by the hardware 
 */
env->efer |= MSR_EFER_LMA;
}


Best regards,
Maxim Levitsky

>  
>  /* changes to apic base and cr8/tpr are read back via kvm_arch_post_run 
> */
>  x86_update_hflags(env);
> @@ -3654,6 +3661,7 @@ static int kvm_get_sregs2(X86CPU *cpu)
>  {
>  CPUX86State *env = &cpu->env;
>  struct kvm_sregs2 sregs;
> +target_ulong cr0_old;
>  int i, ret;
>  
>  ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS2, &sregs);
> @@ -3676,12 +3684,18 @@ static int kvm_get_sregs2(X86CPU *cpu)
>  env->gdt.limit = sregs.gdt.limit;
>  env->gdt.base = sregs.gdt.base;
>  
> +cr0_old = env->cr[0];
>  env->cr[0] = sregs.cr0;
>  env->cr[2] = sregs.cr2;
>  env->cr[3] = sregs.cr3;
>  env->cr[4] = sregs.cr4;
>  
>  env->efer = sregs.efer;
> +if (sev_es_enabled() && env->efer & MSR_EFER_LME) {
> +if (!(cr0_old & CR0_PG_MASK) && env->cr[0] & CR0_PG_MASK) {
> +env->efer |= MSR_EFER_LMA;
> +}
> +}
>  
>  env->pdptrs_valid = sregs.flags & KVM_SREGS2_FLAGS_PDPTRS_VALID;
>

Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

2022-10-03 Thread Maxim Levitsky

On Sun, 2022-10-02 at 07:56 -0600, Keith Busch wrote:
> On Sun, Oct 02, 2022 at 11:59:42AM +0300, Maxim Levitsky wrote:
> > On Thu, 2022-09-29 at 19:35 +0200, Paolo Bonzini wrote:
> > > On 9/29/22 18:39, Christoph Hellwig wrote:
> > > > On Thu, Sep 29, 2022 at 10:37:22AM -0600, Keith Busch wrote:
> > > > > > I am aware, and I've submitted the fix to qemu here:
> > > > > > 
> > > > > >   
> > > > > > https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
> > > > > 
> > > > > I don't think so. Memory alignment and length granularity are two 
> > > > > completely
> > > > > different concepts. If anything, the kernel's ABI had been that the 
> > > > > length
> > > > > requirement was also required for the memory alignment, not the other 
> > > > > way
> > > > > around. That usage will continue working with this kernel patch.
> > 
> > Yes, this is how I also understand it - for example for O_DIRECT on a file 
> > which
> > resides on 4K block device, you have to use page aligned buffers.
> > 
> > But here after the patch, 512 aligned buffer starts working as well - If I
> > understand you correctly the ABI didn't guarantee that such usage would 
> > fail,
> > but rather that it might fail.
> 
> The kernel patch will allow buffer alignment to work with whatever the 
> hardware
> reports it can support. It could even as low as byte aligned if that's the
> hardware can use that.
> 
> The patch aligns direct-io with the same criteria blk_rq_map_user() has always
> used to know if the user space buffer is compatible with the hardware's dma
> requirements. Prior to this patch, the direct-io memory alignment was an
> artificial software constraint, and that constraint creates a lot of
> unnecessary memory pressure.
> 
> As has always been the case, each segment needs to be a logical block length
> granularity. QEMU assumed a buffer's page offset also defined the logical 
> block
> size instead of using the actual logical block size that it had previously
> discovered directly.
> 
> > If I understand that correctly, after the patch in question, 
> > qemu is able to use just 512 bytes aligned buffer to read a single 4K block 
> > from the disk,
> > which supposed to fail but wasn't guarnteed to fail.
> > 
> > Later qemu it submits iovec which also reads a 4K block but in two parts,
> > and if I understand that correctly, each part (iov) is considered
> > to be a separate IO operation,  and thus each has to be in my case 4K in 
> > size, 
> > and its memory buffer *should* also be 4K aligned.
> > 
> > (but it can work with smaller alignement as well).
> 
> Right. The iov length needs to match the logical block size. The iov's memory
> offset needs to align to the queue's dma_alignment attribute. The memory
> alignment may be smaller than a block size.
>  
> > Assuming that I understand all of this correctly, I agree with Paolo that 
> > this is qemu
> > bug, but I do fear that it can cause quite some problems for users,
> > especially for users that use outdated qemu version.
> > 
> > It might be too much to ask, but maybe add a Kconfig option to keep legacy 
> > behavier
> > for those that need it?
> 
> Kconfig doesn't sound right.
> 
> The block layer exports all the attributes user space needs to know about for
> direct io.
> 
>   iov length:/sys/block//queue/logical_block_size
>   iov mem align: /sys/block//queue/dma_alignment
> 
> If you really want to change the behavior, I think maybe we could make the
> dma_alignment attribute writeable (or perhaps add a new attribute specifically
> for dio_alignment) so the user can request something larger.
> 
All makes sense now. 

New attribute could make sense I guess, and can be set by an udev rule or 
something.


Anyway I won't worry about this for now, and if there are issues I'll see how 
we could work
around them.

Thanks for everything,
Best regards,
Maxim Levitsky

Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

2022-10-02 Thread Maxim Levitsky

On Thu, 2022-09-29 at 19:35 +0200, Paolo Bonzini wrote:
> On 9/29/22 18:39, Christoph Hellwig wrote:
> > On Thu, Sep 29, 2022 at 10:37:22AM -0600, Keith Busch wrote:
> > > > I am aware, and I've submitted the fix to qemu here:
> > > > 
> > > >   https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
> > > 
> > > I don't think so. Memory alignment and length granularity are two 
> > > completely
> > > different concepts. If anything, the kernel's ABI had been that the length
> > > requirement was also required for the memory alignment, not the other way
> > > around. That usage will continue working with this kernel patch.

Yes, this is how I also understand it - for example for O_DIRECT on a file which
resides on 4K block device, you have to use page aligned buffers.

But here after the patch, 512 aligned buffer starts working as well - If I
understand you correctly the ABI didn't guarantee that such usage would fail,
but rather that it might fail.

> > 
> > Well, Linus does treat anything that breaks significant userspace
> > as a regression.  Qemu certainly is significant, but that might depend
> > on bit how common configurations hitting this issue are.
> 
> Seeing the QEMU patch, I agree that it's a QEMU bug though.  I'm 
> surprised it has ever worked.
> 
> It requires 4K sectors in the host but not in the guest, and can be 
> worked around (if not migrating) by disabling O_DIRECT.  I think it's 
> not that awful, but we probably should do some extra releases of QEMU 
> stable branches.
> 
> Paolo
> 

I must admit I am out of the loop on the exact requirements of the O_DIRECT.

If I understand that correctly, after the patch in question, 
qemu is able to use just 512 bytes aligned buffer to read a single 4K block 
from the disk,
which supposed to fail but wasn't guarnteed to fail.

Later qemu it submits iovec which also reads a 4K block but in two parts,
and if I understand that correctly, each part (iov) is considered
to be a separate IO operation,  and thus each has to be in my case 4K in size, 
and its memory buffer *should* also be 4K aligned.

(but it can work with smaller alignement as well).

Assuming that I understand all of this correctly, I agree with Paolo that this 
is qemu
bug, but I do fear that it can cause quite some problems for users,
especially for users that use outdated qemu version.

It might be too much to ask, but maybe add a Kconfig option to keep legacy 
behavier
for those that need it?

Best regards,
Maxim Levitsky

Re: Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

2022-09-29 Thread Maxim Levitsky

On Thu, 2022-09-29 at 09:48 -0600, Keith Busch wrote:
> I am aware, and I've submitted the fix to qemu here:
> 
>   https://lists.nongnu.org/archive/html/qemu-block/2022-09/msg00398.html
> 

Thanks for quick response!

Question is though, isn't this an kernel ABI breakage?

(I myself don't care, I would be happy to patch my qemu), 

but I afraid that this will break *lots* of users that only updated the kernel
and not the qemu.

What do you think?

Best regards,
Maxim Levitsky

Commit 'iomap: add support for dma aligned direct-io' causes qemu/KVM boot failures

2022-09-29 Thread Maxim Levitsky

Hi!
 
Recently I noticed that this commit broke the boot of some of the VMs that I 
run on my dev machine.
 
It seems that I am not the first to notice this but in my case it is a bit 
different
 
https://lore.kernel.org/all/e0038866ac54176beeac944c9116f7a9bdec7019.ca...@linux.ibm.com/
 
My VM is a normal x86 VM, and it uses virtio-blk in the guest to access the 
virtual disk,
which is a qcow2 file stored on ext4 filesystem which is stored on NVME drive 
with 4K sectors.
(however I was also able to reproduce this on a raw file)
 
It seems that the only two things that is needed to reproduce the issue are:
 
1. The qcow2/raw file has to be located on a drive which has 4K hardware block 
size.
2. Qemu needs to use direct IO (both aio and 'threads' reproduce this). 
 
I did some debugging and I isolated the kernel change in behavior from qemu 
point of view:
 
 
Qemu, when using direct IO, 'probes' the underlying file.
 
It probes two things:
 
1. It probes the minimum block size it can read.
   It does so by trying to read 1, 512, 1024, 2048 and 4096 bytes at offset 0,
   using a 4096 bytes aligned buffer, and notes the first read that works as 
the hardware block size.
 
   (The relevant function is 'raw_probe_alignment' in src/block/file-posix.c in 
qemu source code).
 
 
2. It probes the buffer alignment by reading 4096 bytes also at file offset 0,
   this time using a buffer that is 1, 512, 1024, 2048 and 4096 aligned
   (this is done by allocating a buffer which is 4K aligned and adding 1/512 
and so on to its address)
 
   First successful read is saved as the required buffer alignment. 
 
 
Before the patch, both probes would yield 4096 and everything would work fine.
(The file in question is stored on 4K block device)
 
 
After the patch the buffer alignment probe succeeds at 512 bytes.
This means that the kernel now allows to read 4K of data at file offset 0 with 
a buffer that
is only 512 bytes aligned. 
 
It is worth to note that the probe was done using 'pread' syscall.
 
 
Later on, qemu likely reads the 1st 512 sector of the drive.
 
It uses preadv with 2 io vectors:
 
First one is for 512 bytes and it seems to have 0xC00 offset into page 
(likely depends on debug session but seems to be consistent)
 
Second one is for 3584 bytes and also has a buffer that is not 4K aligned.
(0x200 page offset this time)
 
This means that the qemu does respect the 4K block size but only respects 512 
bytes buffer alignment,
which is consistent with the result of the probing.
 
And that preadv fails with -EINVAL
 
Forcing qemu to use 4K buffer size fixes the issue, as well as reverting the 
offending commit.
 
Any patches, suggestions are welcome.

I use 6.0-rc7, using mainline master branch as yesterday.
 
Best regards,
Maxim Levitsky

Re: [PATCH RFC v1 1/2] i386: reset KVM nested state upon CPU reset

2022-08-10 Thread Maxim Levitsky

On Wed, 2022-08-10 at 16:00 +0200, Vitaly Kuznetsov wrote:
> Make sure env->nested_state is cleaned up when a vCPU is reset, it may
> be stale after an incoming migration, kvm_arch_put_registers() may
> end up failing or putting vCPU in a weird state.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  target/i386/kvm/kvm.c | 37 +++--
>  1 file changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index f148a6d52fa4..4f8dacc1d4b5 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -1695,6 +1695,30 @@ static void kvm_init_xsave(CPUX86State *env)
>     env->xsave_buf_len);
>  }
>  
> +static void kvm_init_nested_state(CPUX86State *env)
> +{
> +    struct kvm_vmx_nested_state_hdr *vmx_hdr;
> +    uint32_t size;
> +
> +    if (!env->nested_state) {
> +    return;
> +    }
> +
> +    size = env->nested_state->size;
> +
> +    memset(env->nested_state, 0, size);
> +    env->nested_state->size = size;
> +
> +    if (cpu_has_vmx(env)) {
> +    env->nested_state->format = KVM_STATE_NESTED_FORMAT_VMX;
> +    vmx_hdr = &env->nested_state->hdr.vmx;
> +    vmx_hdr->vmxon_pa = -1ull;
> +    vmx_hdr->vmcs12_pa = -1ull;
> +    } else if (cpu_has_svm(env)) {
> +    env->nested_state->format = KVM_STATE_NESTED_FORMAT_SVM;
> +    }
> +}
> +
>  int kvm_arch_init_vcpu(CPUState *cs)
>  {
>  struct {
> @@ -2122,19 +2146,10 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  assert(max_nested_state_len >= offsetof(struct kvm_nested_state, 
> data));
>  
>  if (cpu_has_vmx(env) || cpu_has_svm(env)) {
> -    struct kvm_vmx_nested_state_hdr *vmx_hdr;
> -
>  env->nested_state = g_malloc0(max_nested_state_len);
>  env->nested_state->size = max_nested_state_len;
>  
> -    if (cpu_has_vmx(env)) {
> -    env->nested_state->format = KVM_STATE_NESTED_FORMAT_VMX;
> -    vmx_hdr = &env->nested_state->hdr.vmx;
> -    vmx_hdr->vmxon_pa = -1ull;
> -    vmx_hdr->vmcs12_pa = -1ull;
> -    } else {
> -    env->nested_state->format = KVM_STATE_NESTED_FORMAT_SVM;
> -    }
> +    kvm_init_nested_state(env);
>      }
>  }
>  
> @@ -2199,6 +2214,8 @@ void kvm_arch_reset_vcpu(X86CPU *cpu)
>  /* enabled by default */
>  env->poll_control_msr = 1;
>  
> +    kvm_init_nested_state(env);
> +
>  sev_es_set_reset_vector(CPU(cpu));
>  }
>  
Makes sense.

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

Re: [PATCH RFC v1 2/2] i386: reorder kvm_put_sregs2() and kvm_put_nested_state() when vCPU is reset

2022-08-10 Thread Maxim Levitsky

On Wed, 2022-08-10 at 16:00 +0200, Vitaly Kuznetsov wrote:
> Setting nested state upon migration needs to happen after kvm_put_sregs2()
> to e.g. have EFER.SVME set. This, however, doesn't work for vCPU reset:
> when vCPU is in VMX root operation, certain CR bits are locked and
> kvm_put_sregs2() may fail. As nested state is fully cleaned up upon
> vCPU reset (kvm_arch_reset_vcpu() -> kvm_init_nested_state()), calling
> kvm_put_nested_state() before kvm_put_sregs2() is OK, this will ensure
> that vCPU is *not* in VMX root opertaion.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  target/i386/kvm/kvm.c | 20 ++--
>  1 file changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 4f8dacc1d4b5..73e3880fa57b 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -4529,18 +4529,34 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
>  
>  assert(cpu_is_stopped(cpu) || qemu_cpu_is_self(cpu));
>  
> -    /* must be before kvm_put_nested_state so that EFER.SVME is set */
> +    /*
> + * When resetting a vCPU, make sure to reset nested state first to
> + * e.g clear VMXON state and unlock certain CR4 bits.
> + */
> +    if (level == KVM_PUT_RESET_STATE) {
> +    ret = kvm_put_nested_state(x86_cpu);
> +    if (ret < 0) {
> +    return ret;
> +    }

I should have mentioned this, I actually already debugged the same issue while
trying to reproduce the smm int window bug.
100% my fault.

I also share the same feeling that this might be yet another 'whack a mole' and
break somewhere else, but overall it does make sense.


Reviewed-by: Maxim Levitsky 


Best regards,
Maxim Levitsky

> +    }
> +
>  ret = has_sregs2 ? kvm_put_sregs2(x86_cpu) : kvm_put_sregs(x86_cpu);
>  if (ret < 0) {
>  return ret;
>  }
>  
> -    if (level >= KVM_PUT_RESET_STATE) {
> +    /*
> + * When putting full CPU state, kvm_put_nested_state() must happen after
> + * kvm_put_sregs{,2} so that e.g. EFER.SVME is already set.
> + */
> +    if (level == KVM_PUT_FULL_STATE) {
>  ret = kvm_put_nested_state(x86_cpu);
>  if (ret < 0) {
>  return ret;
>  }
> +    }
>  
> +    if (level >= KVM_PUT_RESET_STATE) {
>  ret = kvm_put_msr_feature_control(x86_cpu);
>  if (ret < 0) {
>  return ret;

Re: Guest reboot issues since QEMU 6.0 and Linux 5.11

2022-07-21 Thread Maxim Levitsky

On Thu, 2022-07-21 at 14:49 +0200, Fabian Ebner wrote:
> Hi,
> since about half a year ago, we're getting user reports about guest
> reboot issues with KVM/QEMU[0].
> 
> The most common scenario is a Windows Server VM (2012R2/2016/2019,
> UEFI/OVMF and SeaBIOS) getting stuck during the screen with the Windows
> logo and the spinning circles after a reboot was triggered from within
> the guest. Quitting the kvm process and booting with a fresh instance
> works. The issue seems to become more likely, the longer the kvm
> instance runs.
> 
> We did not get such reports while we were providing Linux 5.4 and QEMU
> 5.2.0, but we do with Linux 5.11/5.13/5.15 and QEMU 6.x.
> 
> I'm just wondering if anybody has seen this issue before or might have a
> hunch what it's about? Any tips on what to look out for when debugging
> are also greatly appreciated!
> 
> We do have debug access to a user's test VM and the VM state was saved
> before a problematic reboot, but I can't modify the host system there.
> AFAICT QEMU just executes guest code as usual, but I'm really not sure
> what to look out for.
> 
> That VM has CPU type host, and a colleague did have a similar enough CPU
> to load the VM state, but for him, the reboot went through normally. On
> the user's system, it triggers consistently after loading the VM state
> and rebooting.
> 
> So unfortunately, we didn't manage to reproduce the issue locally yet.
> With two other images provided by users, we ran into a boot loop, where
> QEMU resets the CPUs and does a few KVM_RUNs before the exit reason is
> KVM_EXIT_SHUTDOWN (which to my understanding indicates a triple fa
> ult)
> and then it repeats. It's not clear if the issues are related.


Does the guest have HyperV enabled in it (that is nested virtualization?)

Intel or AMD?

Does the VM uses secure boot / SMM?

Best regards,
Maxim Levitsky

> 
> There are also a few reports about non-Windows VMs, mostly Ubuntu 20.04
> with UEFI/OVMF, but again, it's not clear if the issues are related.
> 
> [0]: https://forum.proxmox.com/threads/100744/
> (the forum thread is a bit chaotic unfortunately).
> 
> Best Regards,
> Fabi
> 
>

Re: [PATCH] target/i386: do not consult nonexistent host leaves

2022-05-01 Thread Maxim Levitsky

On Fri, 2022-04-29 at 21:26 +0200, Paolo Bonzini wrote:
> When cache_info_passthrough is requested, QEMU passes the host values
> of the cache information CPUID leaves down to the guest.  However,
> it blindly assumes that the CPUID leaf exists on the host, and this
> cannot be guaranteed: for example, KVM has recently started to
> synthesize AMD leaves up to 0x8021 in order to provide accurate
> CPU bug information to guests.
> 
> Querying a nonexistent host leaf fills the output arguments of
> host_cpuid with data that (albeit deterministic) is nonsensical
> as cache information, namely the data in the highest Intel CPUID
> leaf.  If said highest leaf is not ECX-dependent, this can even
> cause an infinite loop when kvm_arch_init_vcpu prepares the input
> to KVM_SET_CPUID2.  The infinite loop is only terminated by an
> abort() when the array gets full.
> 
> Reported-by: Maxim Levitsky 
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Paolo Bonzini 
> ---
>  target/i386/cpu.c | 41 -
>  1 file changed, 36 insertions(+), 5 deletions(-)
> 
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 99343be926..c5461f7c0b 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -5022,6 +5022,37 @@ uint64_t 
> x86_cpu_get_supported_feature_word(FeatureWord w,
>  return r;
>  }
>  
> +static void x86_cpu_get_cache_cpuid(uint32_t func, uint32_t index,
> +uint32_t *eax, uint32_t *ebx,
> +uint32_t *ecx, uint32_t *edx)
> +{
> +uint32_t level, unused;
> +
> +/* Only return valid host leaves.  */
> +switch (func) {
> +case 2:
> +case 4:
> +host_cpuid(0, 0, &level, &unused, &unused, &unused);
> +break;
> +case 0x8005:
> +case 0x8006:
> +case 0x801d:
> +host_cpuid(0x8000, 0, &level, &unused, &unused, &unused);
> +break;
> +default:
> +return;
> +}
> +
> +if (func > level) {
> +*eax = 0;
> +*ebx = 0;
> +*ecx = 0;
> +*edx = 0;
> +} else {
> +host_cpuid(func, index, eax, ebx, ecx, edx);
> +}
> +}
> +
>  /*
>   * Only for builtin_x86_defs models initialized with 
> x86_register_cpudef_types.
>   */
> @@ -5280,7 +5311,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  case 2:
>  /* cache info: needed for Pentium Pro compatibility */
>  if (cpu->cache_info_passthrough) {
> -host_cpuid(index, 0, eax, ebx, ecx, edx);
> +x86_cpu_get_cache_cpuid(index, 0, eax, ebx, ecx, edx);
>  break;
>  } else if (cpu->vendor_cpuid_only && IS_AMD_CPU(env)) {
>  *eax = *ebx = *ecx = *edx = 0;
> @@ -5300,7 +5331,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  case 4:
>  /* cache info: needed for Core compatibility */
>  if (cpu->cache_info_passthrough) {
> -host_cpuid(index, count, eax, ebx, ecx, edx);
> +x86_cpu_get_cache_cpuid(index, count, eax, ebx, ecx, edx);
>  /* QEMU gives out its own APIC IDs, never pass down bits 31..26. 
>  */
>  *eax &= ~0xFC00;
>  if ((*eax & 31) && cs->nr_cores > 1) {
> @@ -5702,7 +5733,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  case 0x8005:
>  /* cache info (L1 cache) */
>  if (cpu->cache_info_passthrough) {
> -host_cpuid(index, 0, eax, ebx, ecx, edx);
> +x86_cpu_get_cache_cpuid(index, 0, eax, ebx, ecx, edx);
>  break;
>  }
>  *eax = (L1_DTLB_2M_ASSOC << 24) | (L1_DTLB_2M_ENTRIES << 16) |
> @@ -5715,7 +5746,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  case 0x8006:
>  /* cache info (L2 cache) */
>  if (cpu->cache_info_passthrough) {
> -host_cpuid(index, 0, eax, ebx, ecx, edx);
> +x86_cpu_get_cache_cpuid(index, 0, eax, ebx, ecx, edx);
>  break;
>  }
>  *eax = (AMD_ENC_ASSOC(L2_DTLB_2M_ASSOC) << 28) |
> @@ -5775,7 +5806,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
> uint32_t count,
>  case 0x801D:
>  *eax = 0;
>  if (cpu->cache_info_passthrough) {
> -host_cpuid(index, count, eax, ebx, ecx, edx);
> +x86_cpu_get_cache_cpuid(index, count, eax, ebx, ecx, edx);
>  break;
>  }
>  switch (count) {

Makes sense.

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

Re: [PATCH v3 0/1] Patch to adjust coroutine pool size adaptively

2022-03-17 Thread Maxim Levitsky

On Fri, 2022-01-28 at 17:36 +0900, Hiroki Narukawa wrote:
> Resending patch with decreasing coroutine pool size on device remove
> 
> We encountered random disk IO performance drop since qemu-5.0.0, and this 
> patch fixes it.
> 
> Commit message in c740ad92 implied to adjust coroutine pool size adaptively, 
> so I tried to implement this.
> 
> Changes from v2:
> Decrease coroutine pool size on device remove
> 
> Changes from v1:
> Use qatomic_read properly
> 
> 
> Hiroki Narukawa (1):
>   util: adjust coroutine pool size to virtio block queue
> 
>  hw/block/virtio-blk.c|  5 +
>  include/qemu/coroutine.h | 10 ++
>  util/qemu-coroutine.c| 20 
>  3 files changed, 31 insertions(+), 4 deletions(-)
> 

I just bisected this to break my 32 bit qemu setup that I use for testing.

L1 is 32 bit VM with 16 GB of RAM (with PAE) with 16 vCPUs, and
L2 is 32 bit VM with 1.3 GB of RAM and 14 vCPUs (2 less)

Qemu runs out of memory, because new number of coroutines is quite high (14 * 
256).
I understand that 32 bit qemu is very limited anyway, so I won't argue
against this patch. Just FYI. 

As a workaround I reduced the virtio-blk queue-size to 16
and it seems to work again. I am only keeping this configuration to test
that it boots thus performance is not an issue.

Option to override the coroutine pool size would be ideal in this case IMHO 
though.

Best regards,
Maxim Levitsky

Re: [PULL 15/22] x86: Grant AMX permission for guest

2022-03-17 Thread Maxim Levitsky

On Wed, 2022-03-16 at 17:50 +, Daniel P. Berrangé wrote:
> On Wed, Mar 16, 2022 at 05:48:04PM +, David Edmondson wrote:
> > On Wednesday, 2022-03-16 at 16:05:01 GMT, Daniel P. Berrangé wrote:
> > 
> > > On Wed, Mar 16, 2022 at 04:57:39PM +0100, Peter Krempa wrote:
> > > > On Tue, Mar 08, 2022 at 12:34:38 +0100, Paolo Bonzini wrote:
> > > > > From: Yang Zhong 
> > > > > 
> > > > > Kernel allocates 4K xstate buffer by default. For XSAVE features
> > > > > which require large state component (e.g. AMX), Linux kernel
> > > > > dynamically expands the xstate buffer only after the process has
> > > > > acquired the necessary permissions. Those are called dynamically-
> > > > > enabled XSAVE features (or dynamic xfeatures).
> > > > > 
> > > > > There are separate permissions for native tasks and guests.
> > > > > 
> > > > > Qemu should request the guest permissions for dynamic xfeatures
> > > > > which will be exposed to the guest. This only needs to be done
> > > > > once before the first vcpu is created.
> > > > > 
> > > > > KVM implemented one new ARCH_GET_XCOMP_SUPP system attribute API to
> > > > > get host side supported_xcr0 and Qemu can decide if it can request
> > > > > dynamically enabled XSAVE features permission.
> > > > > https://lore.kernel.org/all/20220126152210.3044876-1-pbonz...@redhat.com/
> > > > > 
> > > > > Suggested-by: Paolo Bonzini 
> > > > > Signed-off-by: Yang Zhong 
> > > > > Signed-off-by: Jing Liu 
> > > > > Message-Id: <20220217060434.52460-4-yang.zh...@intel.com>
> > > > > Signed-off-by: Paolo Bonzini 
> > > > > ---
> > > > >  target/i386/cpu.c  |  7 +
> > > > >  target/i386/cpu.h  |  4 +++
> > > > >  target/i386/kvm/kvm-cpu.c  | 12 
> > > > >  target/i386/kvm/kvm.c  | 57 
> > > > > ++
> > > > >  target/i386/kvm/kvm_i386.h |  1 +
> > > > >  5 files changed, 75 insertions(+), 6 deletions(-)
> > > > 
> > > > With this commit qemu crashes for me when invoking the following
> > > > QMP command:
> > > 
> > > It is way worse than that even. If you remove '-S' you get an
> > > immediate kaboom on startup on AMD hosts
> > 
> > Which AMD CPU is in this host?
> 
> AMD EPYC 7302P
> 
> 
> With regards,
> Daniel

my 3970X - same issue.

Best regards,
Maxim Levitsky

[PATCH] KVM: SVM: always set MSR_AMD64_TSC_RATIO to default value

2022-02-23 Thread Maxim Levitsky

Even when the feature is not supported in guest CPUID,
still set the msr to the default value which will
be the only value KVM will accept in this case

Signed-off-by: Maxim Levitsky 
---
 target/i386/cpu.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 6c7ef1099b..3475e9fa46 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5930,9 +5930,7 @@ static void x86_cpu_reset(DeviceState *dev)
 
 x86_cpu_set_sgxlepubkeyhash(env);
 
-if (env->features[FEAT_SVM] & CPUID_SVM_TSCSCALE) {
-env->amd_tsc_scale_msr =  MSR_AMD64_TSC_RATIO_DEFAULT;
-}
+env->amd_tsc_scale_msr =  MSR_AMD64_TSC_RATIO_DEFAULT;
 
 #endif
 }
-- 
2.26.3

[PATCH v2 1/3] KVM: use KVM_{GET|SET}_SREGS2 when supported.

2021-11-01 Thread Maxim Levitsky

This allows to make PDPTRs part of the migration
stream and thus not reload them after migration which
is against X86 spec.

Signed-off-by: Maxim Levitsky 
---
 accel/kvm/kvm-all.c   |   5 ++
 include/sysemu/kvm.h  |   4 ++
 target/i386/cpu.h |   3 ++
 target/i386/kvm/kvm.c | 107 +-
 target/i386/machine.c |  29 
 5 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index db8d83b137..5558786d3d 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -163,6 +163,7 @@ bool kvm_msi_via_irqfd_allowed;
 bool kvm_gsi_routing_allowed;
 bool kvm_gsi_direct_mapping;
 bool kvm_allowed;
+bool kvm_sregs2;
 bool kvm_readonly_mem_allowed;
 bool kvm_vm_attributes_allowed;
 bool kvm_direct_msi_allowed;
@@ -2557,6 +2558,10 @@ static int kvm_init(MachineState *ms)
 kvm_ioeventfd_any_length_allowed =
 (kvm_check_extension(s, KVM_CAP_IOEVENTFD_ANY_LENGTH) > 0);
 
+
+kvm_sregs2 =
+(kvm_check_extension(s, KVM_CAP_SREGS2) > 0);
+
 kvm_state = s;
 
 ret = kvm_arch_init(ms, s);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a1ab1ee12d..b3d4538c55 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -32,6 +32,7 @@
 #ifdef CONFIG_KVM_IS_POSSIBLE
 
 extern bool kvm_allowed;
+extern bool kvm_sregs2;
 extern bool kvm_kernel_irqchip;
 extern bool kvm_split_irqchip;
 extern bool kvm_async_interrupts_allowed;
@@ -139,6 +140,9 @@ extern bool kvm_msi_use_devid;
  */
 #define kvm_gsi_direct_mapping() (kvm_gsi_direct_mapping)
 
+
+#define kvm_supports_sregs2() (kvm_sregs2)
+
 /**
  * kvm_readonly_mem_enabled:
  *
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 3edaad7688..8c39e787f9 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1452,6 +1452,9 @@ typedef struct CPUX86State {
 SegmentCache idt; /* only base and limit are used */
 
 target_ulong cr[5]; /* NOTE: cr1 is unused */
+
+bool pdptrs_valid;
+uint64_t pdptrs[4];
 int32_t a20_mask;
 
 BNDReg bnd_regs[4];
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 0eb7a0340c..13be6995a8 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2646,6 +2646,61 @@ static int kvm_put_sregs(X86CPU *cpu)
 return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS, &sregs);
 }
 
+static int kvm_put_sregs2(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct kvm_sregs2 sregs;
+int i;
+
+sregs.flags = 0;
+
+if ((env->eflags & VM_MASK)) {
+set_v8086_seg(&sregs.cs, &env->segs[R_CS]);
+set_v8086_seg(&sregs.ds, &env->segs[R_DS]);
+set_v8086_seg(&sregs.es, &env->segs[R_ES]);
+set_v8086_seg(&sregs.fs, &env->segs[R_FS]);
+set_v8086_seg(&sregs.gs, &env->segs[R_GS]);
+set_v8086_seg(&sregs.ss, &env->segs[R_SS]);
+} else {
+set_seg(&sregs.cs, &env->segs[R_CS]);
+set_seg(&sregs.ds, &env->segs[R_DS]);
+set_seg(&sregs.es, &env->segs[R_ES]);
+set_seg(&sregs.fs, &env->segs[R_FS]);
+set_seg(&sregs.gs, &env->segs[R_GS]);
+set_seg(&sregs.ss, &env->segs[R_SS]);
+}
+
+set_seg(&sregs.tr, &env->tr);
+set_seg(&sregs.ldt, &env->ldt);
+
+sregs.idt.limit = env->idt.limit;
+sregs.idt.base = env->idt.base;
+memset(sregs.idt.padding, 0, sizeof sregs.idt.padding);
+sregs.gdt.limit = env->gdt.limit;
+sregs.gdt.base = env->gdt.base;
+memset(sregs.gdt.padding, 0, sizeof sregs.gdt.padding);
+
+sregs.cr0 = env->cr[0];
+sregs.cr2 = env->cr[2];
+sregs.cr3 = env->cr[3];
+sregs.cr4 = env->cr[4];
+
+sregs.cr8 = cpu_get_apic_tpr(cpu->apic_state);
+sregs.apic_base = cpu_get_apic_base(cpu->apic_state);
+
+sregs.efer = env->efer;
+
+if (env->pdptrs_valid) {
+for (i = 0; i < 4; i++) {
+sregs.pdptrs[i] = env->pdptrs[i];
+}
+sregs.flags |= KVM_SREGS2_FLAGS_PDPTRS_VALID;
+}
+
+return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS2, &sregs);
+}
+
+
 static void kvm_msr_buf_reset(X86CPU *cpu)
 {
 memset(cpu->kvm_msr_buf, 0, MSR_BUF_SIZE);
@@ -3322,6 +3377,53 @@ static int kvm_get_sregs(X86CPU *cpu)
 return 0;
 }
 
+static int kvm_get_sregs2(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct kvm_sregs2 sregs;
+int i, ret;
+
+ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS2, &sregs);
+if (ret < 0) {
+return ret;
+}
+
+get_seg(&env->segs[R_CS], &sregs.cs);
+get_seg(&env->segs[R_DS], &sregs.ds);
+get_seg(&env->segs[R_ES], &sregs.es);
+get_seg(&env->segs[R_FS], &sregs.fs);
+get_seg(&env->segs[R_GS], &sregs.gs);
+get_seg(&env->segs[R_SS], &sreg

[PATCH v2 3/3] KVM: SVM: add migration support for nested TSC scaling

2021-11-01 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 target/i386/cpu.c |  5 +
 target/i386/cpu.h |  4 
 target/i386/kvm/kvm.c | 15 +++
 target/i386/machine.c | 23 +++
 4 files changed, 47 insertions(+)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 598d451dcf..53a23ca006 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5928,6 +5928,11 @@ static void x86_cpu_reset(DeviceState *dev)
 }
 
 x86_cpu_set_sgxlepubkeyhash(env);
+
+if (env->features[FEAT_SVM] & CPUID_SVM_TSCSCALE) {
+env->amd_tsc_scale_msr =  MSR_AMD64_TSC_RATIO_DEFAULT;
+}
+
 #endif
 }
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 8c39e787f9..9911d7c871 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -499,6 +499,9 @@ typedef enum X86Seg {
 #define MSR_GSBASE  0xc101
 #define MSR_KERNELGSBASE0xc102
 #define MSR_TSC_AUX 0xc103
+#define MSR_AMD64_TSC_RATIO 0xc104
+
+#define MSR_AMD64_TSC_RATIO_DEFAULT 0x1ULL
 
 #define MSR_VM_HSAVE_PA 0xc0010117
 
@@ -1539,6 +1542,7 @@ typedef struct CPUX86State {
 uint32_t tsx_ctrl;
 
 uint64_t spec_ctrl;
+uint64_t amd_tsc_scale_msr;
 uint64_t virt_ssbd;
 
 /* End of state preserved by INIT (dummy marker).  */
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 13be6995a8..e2dad8e168 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -105,6 +105,7 @@ static bool has_msr_hv_reenlightenment;
 static bool has_msr_xss;
 static bool has_msr_umwait;
 static bool has_msr_spec_ctrl;
+static bool has_tsc_scale_msr;
 static bool has_msr_tsx_ctrl;
 static bool has_msr_virt_ssbd;
 static bool has_msr_smi_count;
@@ -2216,6 +2217,9 @@ static int kvm_get_supported_msrs(KVMState *s)
 case MSR_IA32_SPEC_CTRL:
 has_msr_spec_ctrl = true;
 break;
+case MSR_AMD64_TSC_RATIO:
+has_tsc_scale_msr = true;
+break;
 case MSR_IA32_TSX_CTRL:
 has_msr_tsx_ctrl = true;
 break;
@@ -3027,6 +3031,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (has_msr_spec_ctrl) {
 kvm_msr_entry_add(cpu, MSR_IA32_SPEC_CTRL, env->spec_ctrl);
 }
+if (has_tsc_scale_msr) {
+kvm_msr_entry_add(cpu, MSR_AMD64_TSC_RATIO, env->amd_tsc_scale_msr);
+}
+
 if (has_msr_tsx_ctrl) {
 kvm_msr_entry_add(cpu, MSR_IA32_TSX_CTRL, env->tsx_ctrl);
 }
@@ -3479,6 +3487,10 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_spec_ctrl) {
 kvm_msr_entry_add(cpu, MSR_IA32_SPEC_CTRL, 0);
 }
+if (has_tsc_scale_msr) {
+kvm_msr_entry_add(cpu, MSR_AMD64_TSC_RATIO, 0);
+}
+
 if (has_msr_tsx_ctrl) {
 kvm_msr_entry_add(cpu, MSR_IA32_TSX_CTRL, 0);
 }
@@ -3890,6 +3902,9 @@ static int kvm_get_msrs(X86CPU *cpu)
 case MSR_IA32_SPEC_CTRL:
 env->spec_ctrl = msrs[i].data;
 break;
+case MSR_AMD64_TSC_RATIO:
+env->amd_tsc_scale_msr = msrs[i].data;
+break;
 case MSR_IA32_TSX_CTRL:
 env->tsx_ctrl = msrs[i].data;
 break;
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 14dd5ed79c..5dea1ef30d 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -1280,6 +1280,28 @@ static const VMStateDescription vmstate_spec_ctrl = {
 }
 };
 
+
+static bool amd_tsc_scale_msr_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = &cpu->env;
+
+return env->amd_tsc_scale_msr &&
+   env->amd_tsc_scale_msr != MSR_AMD64_TSC_RATIO_DEFAULT;
+}
+
+static const VMStateDescription amd_tsc_scale_msr_ctrl = {
+.name = "cpu/amd_tsc_scale_msr",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = amd_tsc_scale_msr_needed,
+.fields = (VMStateField[]){
+VMSTATE_UINT64(env.amd_tsc_scale_msr, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
+
 static bool intel_pt_enable_needed(void *opaque)
 {
 X86CPU *cpu = opaque;
@@ -1586,6 +1608,7 @@ const VMStateDescription vmstate_x86_cpu = {
 &vmstate_pkru,
 &vmstate_pkrs,
 &vmstate_spec_ctrl,
+&amd_tsc_scale_msr_ctrl,
 &vmstate_mcg_ext_ctl,
 &vmstate_msr_intel_pt,
 &vmstate_msr_virt_ssbd,
-- 
2.26.3

[PATCH v2 0/3] KVM: qemu patches for few KVM features I developed

2021-11-01 Thread Maxim Levitsky

These patches implement the qemu side logic to support
the KVM features I developed recently.

All 3 patches are for features that are accepted upstream in KVM.

V2: rebased and fixed patch 2 to compile without kvm

Best regards,
Maxim Levitsky

Maxim Levitsky (3):
  KVM: use KVM_{GET|SET}_SREGS2 when supported.
  gdbstub: implement NOIRQ support for single step on KVM
  KVM: SVM: add migration support for nested TSC scaling

 accel/kvm/kvm-all.c   |  30 +++
 gdbstub.c |  62 +
 include/sysemu/kvm.h  |  19 +++
 target/i386/cpu.c |   5 ++
 target/i386/cpu.h |   7 +++
 target/i386/kvm/kvm.c | 122 +-
 target/i386/machine.c |  52 ++
 7 files changed, 284 insertions(+), 13 deletions(-)

-- 
2.26.3

[PATCH v2 2/3] gdbstub: implement NOIRQ support for single step on KVM

2021-11-01 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 accel/kvm/kvm-all.c  | 25 ++
 gdbstub.c| 62 
 include/sysemu/kvm.h | 15 +++
 3 files changed, 91 insertions(+), 11 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5558786d3d..a92c12da6f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -169,6 +169,8 @@ bool kvm_vm_attributes_allowed;
 bool kvm_direct_msi_allowed;
 bool kvm_ioeventfd_any_length_allowed;
 bool kvm_msi_use_devid;
+bool kvm_has_guest_debug;
+int kvm_sstep_flags;
 static bool kvm_immediate_exit;
 static hwaddr kvm_max_slot_size = ~0;
 
@@ -2562,6 +2564,25 @@ static int kvm_init(MachineState *ms)
 kvm_sregs2 =
 (kvm_check_extension(s, KVM_CAP_SREGS2) > 0);
 
+kvm_has_guest_debug =
+(kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG) > 0);
+
+kvm_sstep_flags = 0;
+
+if (kvm_has_guest_debug) {
+/* Assume that single stepping is supported */
+kvm_sstep_flags = SSTEP_ENABLE;
+
+int guest_debug_flags =
+kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG2);
+
+if (guest_debug_flags > 0) {
+if (guest_debug_flags & KVM_GUESTDBG_BLOCKIRQ) {
+kvm_sstep_flags |= SSTEP_NOIRQ;
+}
+}
+}
+
 kvm_state = s;
 
 ret = kvm_arch_init(ms, s);
@@ -3191,6 +3212,10 @@ int kvm_update_guest_debug(CPUState *cpu, unsigned long 
reinject_trap)
 
 if (cpu->singlestep_enabled) {
 data.dbg.control |= KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
+
+if (cpu->singlestep_enabled & SSTEP_NOIRQ) {
+data.dbg.control |= KVM_GUESTDBG_BLOCKIRQ;
+}
 }
 kvm_arch_update_guest_debug(cpu, &data.dbg);
 
diff --git a/gdbstub.c b/gdbstub.c
index 36b85aa50e..ad088fbd3e 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -368,12 +368,11 @@ typedef struct GDBState {
 gdb_syscall_complete_cb current_syscall_cb;
 GString *str_buf;
 GByteArray *mem_buf;
+int sstep_flags;
+int supported_sstep_flags;
 } GDBState;
 
-/* By default use no IRQs and no timers while single stepping so as to
- * make single stepping like an ICE HW step.
- */
-static int sstep_flags = SSTEP_ENABLE|SSTEP_NOIRQ|SSTEP_NOTIMER;
+static GDBState gdbserver_state;
 
 /* Retrieves flags for single step mode. */
 static int get_sstep_flags(void)
@@ -385,11 +384,10 @@ static int get_sstep_flags(void)
 if (replay_mode != REPLAY_MODE_NONE) {
 return SSTEP_ENABLE;
 } else {
-return sstep_flags;
+return gdbserver_state.sstep_flags;
 }
 }
 
-static GDBState gdbserver_state;
 
 static void init_gdbserver_state(void)
 {
@@ -399,6 +397,23 @@ static void init_gdbserver_state(void)
 gdbserver_state.str_buf = g_string_new(NULL);
 gdbserver_state.mem_buf = g_byte_array_sized_new(MAX_PACKET_LENGTH);
 gdbserver_state.last_packet = g_byte_array_sized_new(MAX_PACKET_LENGTH + 
4);
+
+
+if (kvm_enabled()) {
+gdbserver_state.supported_sstep_flags = 
kvm_get_supported_sstep_flags();
+} else {
+gdbserver_state.supported_sstep_flags =
+SSTEP_ENABLE | SSTEP_NOIRQ | SSTEP_NOTIMER;
+}
+
+/*
+ * By default use no IRQs and no timers while single stepping so as to
+ * make single stepping like an ICE HW step.
+ */
+
+gdbserver_state.sstep_flags = SSTEP_ENABLE | SSTEP_NOIRQ | SSTEP_NOTIMER;
+gdbserver_state.sstep_flags &= gdbserver_state.supported_sstep_flags;
+
 }
 
 #ifndef CONFIG_USER_ONLY
@@ -505,7 +520,7 @@ static int gdb_continue_partial(char *newstates)
 CPU_FOREACH(cpu) {
 if (newstates[cpu->cpu_index] == 's') {
 trace_gdbstub_op_stepping(cpu->cpu_index);
-cpu_single_step(cpu, sstep_flags);
+cpu_single_step(cpu, get_sstep_flags());
 }
 }
 gdbserver_state.running_state = 1;
@@ -2017,24 +2032,44 @@ static void handle_v_commands(GArray *params, void 
*user_ctx)
 
 static void handle_query_qemu_sstepbits(GArray *params, void *user_ctx)
 {
-g_string_printf(gdbserver_state.str_buf, "ENABLE=%x,NOIRQ=%x,NOTIMER=%x",
-SSTEP_ENABLE, SSTEP_NOIRQ, SSTEP_NOTIMER);
+g_string_printf(gdbserver_state.str_buf, "ENABLE=%x", SSTEP_ENABLE);
+
+if (gdbserver_state.supported_sstep_flags & SSTEP_NOIRQ) {
+g_string_append_printf(gdbserver_state.str_buf, ",NOIRQ=%x",
+   SSTEP_NOIRQ);
+}
+
+if (gdbserver_state.supported_sstep_flags & SSTEP_NOTIMER) {
+g_string_append_printf(gdbserver_state.str_buf, ",NOTIMER=%x",
+   SSTEP_NOTIMER);
+}
+
 put_strbuf();
 }
 
 static void handle_set_qemu_sstep(GArray *params, void *user_ctx)
 {
+int new_sstep_flags;
+
 if (!params->len) {
 return;
 }
 
-sstep_flags = get_param(p

Re: TCP/IP connections sometimes stop retransmitting packets (in nested virtualization case)

2021-10-18 Thread Maxim Levitsky

On Mon, 2021-10-18 at 16:49 -0400, Michael S. Tsirkin wrote:
> On Mon, Oct 18, 2021 at 11:05:23AM -0700, Eric Dumazet wrote:
> > 
> > On 10/17/21 3:50 AM, Maxim Levitsky wrote:
> > > Hi!
> > >  
> > > This is a follow up mail to my mail about NFS client deadlock I was 
> > > trying to debug last week:
> > > https://lore.kernel.org/all/e10b46b04fe4427fa50901dda71fb5f5a26af33e.ca...@redhat.com/T/#u
> > >  
> > > I strongly believe now that this is not related to NFS, but rather to 
> > > some issue in networking stack and maybe
> > > to somewhat non standard .config I was using for the kernels which has 
> > > many advanced networking options disabled
> > > (to cut on compile time).
> > > This is why I choose to start a new thread about it.
> > >  
> > > Regarding the custom .config file, in particular I disabled 
> > > CONFIG_NET_SCHED and CONFIG_TCP_CONG_ADVANCED. 
> > > Both host and the fedora32 VM run the same kernel with those options 
> > > disabled.
> > > 
> > > 
> > > My setup is a VM (fedora32) which runs Win10 HyperV VM inside, nested, 
> > > which in turn runs a fedora32 VM
> > > (but I was able to reproduce it with ordinary HyperV disabled VM running 
> > > in the same fedora 32 VM)
> > >  
> > > The host is running a NFS server, and the fedora32 VM runs a NFS client 
> > > which is used to read/write to a qcow2 file
> > > which contains the disk of the nested Win10 VM. The L3 VM which windows 
> > > VM optionally
> > > runs, is contained in the same qcow2 file.
> > > 
> > > 
> > > I managed to capture (using wireshark) packets around the failure in both 
> > > L0 and L1.
> > > The trace shows fair number of lost packets, a bit more than I would 
> > > expect from communication that is running on the same host, 
> > > but they are retransmitted and don't cause any issues until the moment of 
> > > failure.
> > > 
> > > 
> > > The failure happens when one packet which is sent from host to the guest,
> > > is not received by the guest (as evident by the L1 trace, and by the 
> > > following SACKS from the guest which exclude this packet), 
> > > and then the host (on which the NFS server runs) never attempts to 
> > > re-transmit it.
> > > 
> > > 
> > > The host keeps on sending further TCP packets with replies to previous 
> > > RPC calls it received from the fedora32 VM,
> > > with an increasing sequence number, as evident from both traces, and the 
> > > fedora32 VM keeps on SACK'ing those received packets, 
> > > patiently waiting for the retransmission.
> > >  
> > > After around 12 minutes (!), the host RSTs the connection.
> > > 
> > > It is worth mentioning that while all of this is happening, the fedora32 
> > > VM can become hung if one attempts to access the files 
> > > on the NFS share because effectively all NFS communication is blocked on 
> > > TCP level.
> > > 
> > > I attached an extract from the two traces  (in L0 and L1) around the 
> > > failure up to the RST packet.
> > > 
> > > In this trace the second packet with TCP sequence number 1736557331 
> > > (first one was empty without data) is not received by the guest
> > > and then never retransmitted by the host.
> > > 
> > > Also worth noting that to ease on storage I captured only 512 bytes of 
> > > each packet, but wireshark
> > > notes how many bytes were in the actual packet.
> > >  
> > > Best regards,
> > >   Maxim Levitsky
> > 
> > TCP has special logic not attempting a retransmit if it senses the prior
> > packet has not been consumed yet.
> > 
> > Usually, the consume part is done from NIC drivers at TC completion time,
> > when NIC signals packet has been sent to the wire.
> > 
> > It seems one skb is essentially leaked somewhere, and leaked (not freed)
> 
> Thanks Eric!
> 
> Maxim since the packets that leak are transmitted on the host,
> the question then is what kind of device do you use on the host
> to talk to the guest? tap?
> 
> 
Yes, tap with bridge, similiar to how libvirt does 'bridge' networking for vms.
I use my own set of scripts to run qemu directly.

Usually vhost is used in both L0 and L1, and it 'seems' to help to reproduce it,
but I did reproduced this with vhost disabled on both L0 and L1.

The capture was done on the bridge interface on L0, and on a virtual network 
card in L1.

It does seem that I am unable to make it fail again (maybe luck?) with 
CONFIG_NET_SCHED (and its suboptions)
and CONFIG_TCP_CONG_ADVANCED set back to defaults (everything 'm')

Also just to avoid going on the wrong path, note that I did once reproduce this 
on e1000e virtual nic,
thus virtio is likely not to blame here.


Thanks,
Best regards,
Maxim Levitsky

Re: [PATCH 2/3] gdbstub: implement NOIRQ support for single step on KVM

2021-10-13 Thread Maxim Levitsky

On Wed, 2021-10-13 at 16:50 +0100, Alex Bennée wrote:
> Maxim Levitsky  writes:
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  accel/kvm/kvm-all.c  | 25 ++
> >  gdbstub.c| 60 
> >  include/sysemu/kvm.h | 13 ++
> >  3 files changed, 88 insertions(+), 10 deletions(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index 6b187e9c96..e141260796 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -169,6 +169,8 @@ bool kvm_vm_attributes_allowed;
> >  bool kvm_direct_msi_allowed;
> >  bool kvm_ioeventfd_any_length_allowed;
> >  bool kvm_msi_use_devid;
> > +bool kvm_has_guest_debug;
> > +int kvm_sstep_flags;
> >  static bool kvm_immediate_exit;
> >  static hwaddr kvm_max_slot_size = ~0;
> >  
> > @@ -2559,6 +2561,25 @@ static int kvm_init(MachineState *ms)
> >  kvm_sregs2 =
> >  (kvm_check_extension(s, KVM_CAP_SREGS2) > 0);
> >  
> > +kvm_has_guest_debug =
> > +(kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG) > 0);
> > +
> > +kvm_sstep_flags = 0;
> > +
> > +if (kvm_has_guest_debug) {
> > +/* Assume that single stepping is supported */
> > +kvm_sstep_flags = SSTEP_ENABLE;
> > +
> > +int guest_debug_flags =
> > +kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG2);
> > +
> > +if (guest_debug_flags > 0) {
> > +if (guest_debug_flags & KVM_GUESTDBG_BLOCKIRQ) {
> > +kvm_sstep_flags |= SSTEP_NOIRQ;
> > +}
> > +}
> > +}
> > +
> >  kvm_state = s;
> >  
> >  ret = kvm_arch_init(ms, s);
> > @@ -3188,6 +3209,10 @@ int kvm_update_guest_debug(CPUState *cpu, unsigned 
> > long reinject_trap)
> >  
> >  if (cpu->singlestep_enabled) {
> >  data.dbg.control |= KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
> > +
> > +if (cpu->singlestep_enabled & SSTEP_NOIRQ) {
> > +data.dbg.control |= KVM_GUESTDBG_BLOCKIRQ;
> > +}
> >  }
> >  kvm_arch_update_guest_debug(cpu, &data.dbg);
> >  
> > diff --git a/gdbstub.c b/gdbstub.c
> > index 5d8e6ae3cd..48bb803bae 100644
> > --- a/gdbstub.c
> > +++ b/gdbstub.c
> > @@ -368,12 +368,11 @@ typedef struct GDBState {
> >  gdb_syscall_complete_cb current_syscall_cb;
> >  GString *str_buf;
> >  GByteArray *mem_buf;
> > +int sstep_flags;
> > +int supported_sstep_flags;
> >  } GDBState;
> >  
> > -/* By default use no IRQs and no timers while single stepping so as to
> > - * make single stepping like an ICE HW step.
> > - */
> > -static int sstep_flags = SSTEP_ENABLE|SSTEP_NOIRQ|SSTEP_NOTIMER;
> > +static GDBState gdbserver_state;
> >  
> >  /* Retrieves flags for single step mode. */
> >  static int get_sstep_flags(void)
> > @@ -385,11 +384,10 @@ static int get_sstep_flags(void)
> >  if (replay_mode != REPLAY_MODE_NONE) {
> >  return SSTEP_ENABLE;
> >  } else {
> > -return sstep_flags;
> > +return gdbserver_state.sstep_flags;
> >  }
> >  }
> >  
> > -static GDBState gdbserver_state;
> >  
> >  static void init_gdbserver_state(void)
> >  {
> > @@ -399,6 +397,23 @@ static void init_gdbserver_state(void)
> >  gdbserver_state.str_buf = g_string_new(NULL);
> >  gdbserver_state.mem_buf = g_byte_array_sized_new(MAX_PACKET_LENGTH);
> >  gdbserver_state.last_packet = g_byte_array_sized_new(MAX_PACKET_LENGTH 
> > + 4);
> > +
> > +
> > +if (kvm_enabled()) {
> > +gdbserver_state.supported_sstep_flags = 
> > kvm_get_supported_sstep_flags();
> > +} else {
> > +gdbserver_state.supported_sstep_flags =
> > +SSTEP_ENABLE | SSTEP_NOIRQ | SSTEP_NOTIMER;
> > +}
> 
> This fails to build:
> 
> o -c ../../gdbstub.c
> ../../gdbstub.c: In function ‘init_gdbserver_state’:
> ../../gdbstub.c:403:49: error: implicit declaration of function 
> ‘kvm_get_supported_sstep_flags’ [-Werror=implicit-function-declaration]
>   403 | gdbserver_state.supported_sstep_flags = 
> kvm_get_supported_sstep_flags();
>   | 
> ^
> ../../gdbstub.c:403:49: error: nested extern declaration of 
> ‘kvm_get_supported_sstep_flags’ [-Werror=nested-externs]
> ../../gdbstub.c: I

Re: [PATCH 0/3] KVM: qemu patches for few KVM features I developed

2021-10-13 Thread Maxim Levitsky

On Tue, 2021-09-14 at 18:52 +0300, Maxim Levitsky wrote:
> These patches implement the qemu side logic to support
> the KVM features I developed recently.
> 
> First two patches are for features that are already accepted
> upstream, and I already posted them on the qemu mailing list once.
> 
> And the 3rd patch is for nested TSC scaling on SVM
> which isn't yet accepted in KVM but can already be added to qemu since
> it is conditional on KVM supporting it, and ABI wise it only relies
> on SVM spec.
> 
> Best regards,
> Maxim Levitsky
> 
> Maxim Levitsky (3):
>   KVM: use KVM_{GET|SET}_SREGS2 when supported.
>   gdbstub: implement NOIRQ support for single step on KVM
>   KVM: SVM: add migration support for nested TSC scaling
> 
>  accel/kvm/kvm-all.c   |  30 +++
>  gdbstub.c |  60 +
>  include/sysemu/kvm.h  |  17 ++
>  target/i386/cpu.c |   5 ++
>  target/i386/cpu.h |   7 +++
>  target/i386/kvm/kvm.c | 122 +-
>  target/i386/machine.c |  53 ++
>  7 files changed, 282 insertions(+), 12 deletions(-)
> 
> -- 
> 2.26.3
> 
> 
Kind ping on these patches.

Best regards,
Maxim Levitsky

Re: [PATCH 0/3] KVM: qemu patches for few KVM features I developed

2021-09-30 Thread Maxim Levitsky

On Tue, 2021-09-14 at 18:52 +0300, Maxim Levitsky wrote:
> These patches implement the qemu side logic to support
> the KVM features I developed recently.
> 
> First two patches are for features that are already accepted
> upstream, and I already posted them on the qemu mailing list once.
> 
> And the 3rd patch is for nested TSC scaling on SVM
> which isn't yet accepted in KVM but can already be added to qemu since
> it is conditional on KVM supporting it, and ABI wise it only relies
> on SVM spec.
> 
> Best regards,
> Maxim Levitsky
> 
> Maxim Levitsky (3):
>   KVM: use KVM_{GET|SET}_SREGS2 when supported.
>   gdbstub: implement NOIRQ support for single step on KVM
>   KVM: SVM: add migration support for nested TSC scaling
> 
>  accel/kvm/kvm-all.c   |  30 +++
>  gdbstub.c |  60 +
>  include/sysemu/kvm.h  |  17 ++
>  target/i386/cpu.c |   5 ++
>  target/i386/cpu.h |   7 +++
>  target/i386/kvm/kvm.c | 122 +-
>  target/i386/machine.c |  53 ++
>  7 files changed, 282 insertions(+), 12 deletions(-)
> 
> -- 
> 2.26.3
> 
> 
Very polite ping on these patches.

Best regards,
   Maxim Levitsky

Re: [PATCH] qemu-img: Add -F shorthand to convert

2021-09-15 Thread Maxim Levitsky

to out;
>  }
> diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
> index b3620f29e50c..4c4d94ab2267 100644
> --- a/qemu-img-cmds.hx
> +++ b/qemu-img-cmds.hx
> @@ -46,7 +46,7 @@ SRST
>  ERST
> 
>  DEF("convert", img_convert,
> -"convert [--object objectdef] [--image-opts] [--target-image-opts] 
> [--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] [-n] [-f fmt] [-t 
> cache] [-T src_cache] [-O output_fmt] [-B backing_file] [-o options] [-l 
> snapshot_param] [-S sparse_size] [-r rate_limit] [-m num_coroutines] [-W] 
> [--salvage] filename [filename2 [...]] output_filename")
> +"convert [--object objectdef] [--image-opts] [--target-image-opts] 
> [--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] [-n] [-f fmt] [-t 
> cache] [-T src_cache] [-O output_fmt] [-B backing_file [-F backing_fmt]] [-o 
> options] [-l snapshot_param] [-S sparse_size] [-r rate_limit] [-m 
> num_coroutines] [-W] [--salvage] filename [filename2 [...]] output_filename")
>  SRST
>  .. option:: convert [--object OBJECTDEF] [--image-opts] 
> [--target-image-opts] [--target-is-zero] [--bitmaps] [-U] [-C] [-c] [-p] [-q] 
> [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE] [-o 
> OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m 
> NUM_COROUTINES] [-W] [--salvage] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
>  ERST
> diff --git a/tests/qemu-iotests/122 b/tests/qemu-iotests/122
> index 5d550ed13ea3..efb260d822db 100755
> --- a/tests/qemu-iotests/122
> +++ b/tests/qemu-iotests/122
> @@ -67,7 +67,7 @@ echo
>  _make_test_img -b "$TEST_IMG".base -F $IMGFMT
> 
>  $QEMU_IO -c "write -P 0 0 3M" "$TEST_IMG" 2>&1 | _filter_qemu_io | 
> _filter_testdir
> -$QEMU_IMG convert -O $IMGFMT -B "$TEST_IMG".base -o backing_fmt=$IMGFMT \
> +$QEMU_IMG convert -O $IMGFMT -B "$TEST_IMG".base -F $IMGFMT \
>  "$TEST_IMG" "$TEST_IMG".orig
>  $QEMU_IO -c "read -P 0 0 3M" "$TEST_IMG".orig 2>&1 | _filter_qemu_io | 
> _filter_testdir
>  $QEMU_IMG convert -O $IMGFMT -c -B "$TEST_IMG".base -o backing_fmt=$IMGFMT \

I have seen that warning few times already in my scripts, so good to have this 
option.

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

[PATCH 2/3] gdbstub: implement NOIRQ support for single step on KVM

2021-09-14 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 accel/kvm/kvm-all.c  | 25 ++
 gdbstub.c| 60 
 include/sysemu/kvm.h | 13 ++
 3 files changed, 88 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 6b187e9c96..e141260796 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -169,6 +169,8 @@ bool kvm_vm_attributes_allowed;
 bool kvm_direct_msi_allowed;
 bool kvm_ioeventfd_any_length_allowed;
 bool kvm_msi_use_devid;
+bool kvm_has_guest_debug;
+int kvm_sstep_flags;
 static bool kvm_immediate_exit;
 static hwaddr kvm_max_slot_size = ~0;
 
@@ -2559,6 +2561,25 @@ static int kvm_init(MachineState *ms)
 kvm_sregs2 =
 (kvm_check_extension(s, KVM_CAP_SREGS2) > 0);
 
+kvm_has_guest_debug =
+(kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG) > 0);
+
+kvm_sstep_flags = 0;
+
+if (kvm_has_guest_debug) {
+/* Assume that single stepping is supported */
+kvm_sstep_flags = SSTEP_ENABLE;
+
+int guest_debug_flags =
+kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG2);
+
+if (guest_debug_flags > 0) {
+if (guest_debug_flags & KVM_GUESTDBG_BLOCKIRQ) {
+kvm_sstep_flags |= SSTEP_NOIRQ;
+}
+}
+}
+
 kvm_state = s;
 
 ret = kvm_arch_init(ms, s);
@@ -3188,6 +3209,10 @@ int kvm_update_guest_debug(CPUState *cpu, unsigned long 
reinject_trap)
 
 if (cpu->singlestep_enabled) {
 data.dbg.control |= KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
+
+if (cpu->singlestep_enabled & SSTEP_NOIRQ) {
+data.dbg.control |= KVM_GUESTDBG_BLOCKIRQ;
+}
 }
 kvm_arch_update_guest_debug(cpu, &data.dbg);
 
diff --git a/gdbstub.c b/gdbstub.c
index 5d8e6ae3cd..48bb803bae 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -368,12 +368,11 @@ typedef struct GDBState {
 gdb_syscall_complete_cb current_syscall_cb;
 GString *str_buf;
 GByteArray *mem_buf;
+int sstep_flags;
+int supported_sstep_flags;
 } GDBState;
 
-/* By default use no IRQs and no timers while single stepping so as to
- * make single stepping like an ICE HW step.
- */
-static int sstep_flags = SSTEP_ENABLE|SSTEP_NOIRQ|SSTEP_NOTIMER;
+static GDBState gdbserver_state;
 
 /* Retrieves flags for single step mode. */
 static int get_sstep_flags(void)
@@ -385,11 +384,10 @@ static int get_sstep_flags(void)
 if (replay_mode != REPLAY_MODE_NONE) {
 return SSTEP_ENABLE;
 } else {
-return sstep_flags;
+return gdbserver_state.sstep_flags;
 }
 }
 
-static GDBState gdbserver_state;
 
 static void init_gdbserver_state(void)
 {
@@ -399,6 +397,23 @@ static void init_gdbserver_state(void)
 gdbserver_state.str_buf = g_string_new(NULL);
 gdbserver_state.mem_buf = g_byte_array_sized_new(MAX_PACKET_LENGTH);
 gdbserver_state.last_packet = g_byte_array_sized_new(MAX_PACKET_LENGTH + 
4);
+
+
+if (kvm_enabled()) {
+gdbserver_state.supported_sstep_flags = 
kvm_get_supported_sstep_flags();
+} else {
+gdbserver_state.supported_sstep_flags =
+SSTEP_ENABLE | SSTEP_NOIRQ | SSTEP_NOTIMER;
+}
+
+/*
+ * By default use no IRQs and no timers while single stepping so as to
+ * make single stepping like an ICE HW step.
+ */
+
+gdbserver_state.sstep_flags = SSTEP_ENABLE | SSTEP_NOIRQ | SSTEP_NOTIMER;
+gdbserver_state.sstep_flags &= gdbserver_state.supported_sstep_flags;
+
 }
 
 #ifndef CONFIG_USER_ONLY
@@ -2017,24 +2032,44 @@ static void handle_v_commands(GArray *params, void 
*user_ctx)
 
 static void handle_query_qemu_sstepbits(GArray *params, void *user_ctx)
 {
-g_string_printf(gdbserver_state.str_buf, "ENABLE=%x,NOIRQ=%x,NOTIMER=%x",
-SSTEP_ENABLE, SSTEP_NOIRQ, SSTEP_NOTIMER);
+g_string_printf(gdbserver_state.str_buf, "ENABLE=%x", SSTEP_ENABLE);
+
+if (gdbserver_state.supported_sstep_flags & SSTEP_NOIRQ) {
+g_string_append_printf(gdbserver_state.str_buf, ",NOIRQ=%x",
+   SSTEP_NOIRQ);
+}
+
+if (gdbserver_state.supported_sstep_flags & SSTEP_NOTIMER) {
+g_string_append_printf(gdbserver_state.str_buf, ",NOTIMER=%x",
+   SSTEP_NOTIMER);
+}
+
 put_strbuf();
 }
 
 static void handle_set_qemu_sstep(GArray *params, void *user_ctx)
 {
+int new_sstep_flags;
+
 if (!params->len) {
 return;
 }
 
-sstep_flags = get_param(params, 0)->val_ul;
+new_sstep_flags = get_param(params, 0)->val_ul;
+
+if (new_sstep_flags  & ~gdbserver_state.supported_sstep_flags) {
+put_packet("E22");
+return;
+}
+
+gdbserver_state.sstep_flags = new_sstep_flags;
 put_packet("OK");
 }
 
 static void handle_query_qemu_sstep(GArray *params, void *user_ctx)
 {
-g_

[PATCH 1/3] KVM: use KVM_{GET|SET}_SREGS2 when supported.

2021-09-14 Thread Maxim Levitsky

This allows to make PDPTRs part of the migration
stream and thus not reload them after migration which
is against X86 spec.

Signed-off-by: Maxim Levitsky 
---
 accel/kvm/kvm-all.c   |   5 ++
 include/sysemu/kvm.h  |   4 ++
 target/i386/cpu.h |   3 ++
 target/i386/kvm/kvm.c | 107 +-
 target/i386/machine.c |  30 
 5 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 0125c17edb..6b187e9c96 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -163,6 +163,7 @@ bool kvm_msi_via_irqfd_allowed;
 bool kvm_gsi_routing_allowed;
 bool kvm_gsi_direct_mapping;
 bool kvm_allowed;
+bool kvm_sregs2;
 bool kvm_readonly_mem_allowed;
 bool kvm_vm_attributes_allowed;
 bool kvm_direct_msi_allowed;
@@ -2554,6 +2555,10 @@ static int kvm_init(MachineState *ms)
 kvm_ioeventfd_any_length_allowed =
 (kvm_check_extension(s, KVM_CAP_IOEVENTFD_ANY_LENGTH) > 0);
 
+
+kvm_sregs2 =
+(kvm_check_extension(s, KVM_CAP_SREGS2) > 0);
+
 kvm_state = s;
 
 ret = kvm_arch_init(ms, s);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a1ab1ee12d..b3d4538c55 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -32,6 +32,7 @@
 #ifdef CONFIG_KVM_IS_POSSIBLE
 
 extern bool kvm_allowed;
+extern bool kvm_sregs2;
 extern bool kvm_kernel_irqchip;
 extern bool kvm_split_irqchip;
 extern bool kvm_async_interrupts_allowed;
@@ -139,6 +140,9 @@ extern bool kvm_msi_use_devid;
  */
 #define kvm_gsi_direct_mapping() (kvm_gsi_direct_mapping)
 
+
+#define kvm_supports_sregs2() (kvm_sregs2)
+
 /**
  * kvm_readonly_mem_enabled:
  *
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 71ae3141c3..9adae12426 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1436,6 +1436,9 @@ typedef struct CPUX86State {
 SegmentCache idt; /* only base and limit are used */
 
 target_ulong cr[5]; /* NOTE: cr1 is unused */
+
+bool pdptrs_valid;
+uint64_t pdptrs[4];
 int32_t a20_mask;
 
 BNDReg bnd_regs[4];
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 500d2e0e68..841b3b98f7 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2587,6 +2587,61 @@ static int kvm_put_sregs(X86CPU *cpu)
 return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS, &sregs);
 }
 
+static int kvm_put_sregs2(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct kvm_sregs2 sregs;
+int i;
+
+sregs.flags = 0;
+
+if ((env->eflags & VM_MASK)) {
+set_v8086_seg(&sregs.cs, &env->segs[R_CS]);
+set_v8086_seg(&sregs.ds, &env->segs[R_DS]);
+set_v8086_seg(&sregs.es, &env->segs[R_ES]);
+set_v8086_seg(&sregs.fs, &env->segs[R_FS]);
+set_v8086_seg(&sregs.gs, &env->segs[R_GS]);
+set_v8086_seg(&sregs.ss, &env->segs[R_SS]);
+} else {
+set_seg(&sregs.cs, &env->segs[R_CS]);
+set_seg(&sregs.ds, &env->segs[R_DS]);
+set_seg(&sregs.es, &env->segs[R_ES]);
+set_seg(&sregs.fs, &env->segs[R_FS]);
+set_seg(&sregs.gs, &env->segs[R_GS]);
+set_seg(&sregs.ss, &env->segs[R_SS]);
+}
+
+set_seg(&sregs.tr, &env->tr);
+set_seg(&sregs.ldt, &env->ldt);
+
+sregs.idt.limit = env->idt.limit;
+sregs.idt.base = env->idt.base;
+memset(sregs.idt.padding, 0, sizeof sregs.idt.padding);
+sregs.gdt.limit = env->gdt.limit;
+sregs.gdt.base = env->gdt.base;
+memset(sregs.gdt.padding, 0, sizeof sregs.gdt.padding);
+
+sregs.cr0 = env->cr[0];
+sregs.cr2 = env->cr[2];
+sregs.cr3 = env->cr[3];
+sregs.cr4 = env->cr[4];
+
+sregs.cr8 = cpu_get_apic_tpr(cpu->apic_state);
+sregs.apic_base = cpu_get_apic_base(cpu->apic_state);
+
+sregs.efer = env->efer;
+
+if (env->pdptrs_valid) {
+for (i = 0; i < 4; i++) {
+sregs.pdptrs[i] = env->pdptrs[i];
+}
+sregs.flags |= KVM_SREGS2_FLAGS_PDPTRS_VALID;
+}
+
+return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS2, &sregs);
+}
+
+
 static void kvm_msr_buf_reset(X86CPU *cpu)
 {
 memset(cpu->kvm_msr_buf, 0, MSR_BUF_SIZE);
@@ -3252,6 +3307,53 @@ static int kvm_get_sregs(X86CPU *cpu)
 return 0;
 }
 
+static int kvm_get_sregs2(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct kvm_sregs2 sregs;
+int i, ret;
+
+ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS2, &sregs);
+if (ret < 0) {
+return ret;
+}
+
+get_seg(&env->segs[R_CS], &sregs.cs);
+get_seg(&env->segs[R_DS], &sregs.ds);
+get_seg(&env->segs[R_ES], &sregs.es);
+get_seg(&env->segs[R_FS], &sregs.fs);
+get_seg(&env->segs[R_GS], &sregs.gs);
+get_seg(&env->segs[R_SS], &sreg

[PATCH 3/3] KVM: SVM: add migration support for nested TSC scaling

2021-09-14 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 target/i386/cpu.c |  5 +
 target/i386/cpu.h |  4 
 target/i386/kvm/kvm.c | 15 +++
 target/i386/machine.c | 23 +++
 4 files changed, 47 insertions(+)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 6b029f1bdf..0870b53509 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -5770,6 +5770,11 @@ static void x86_cpu_reset(DeviceState *dev)
 if (kvm_enabled()) {
 kvm_arch_reset_vcpu(cpu);
 }
+
+if (env->features[FEAT_SVM] & CPUID_SVM_TSCSCALE) {
+env->amd_tsc_scale_msr =  MSR_AMD64_TSC_RATIO_DEFAULT;
+}
+
 #endif
 }
 
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 9adae12426..b9e1a3b7db 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -491,6 +491,9 @@ typedef enum X86Seg {
 #define MSR_GSBASE  0xc101
 #define MSR_KERNELGSBASE0xc102
 #define MSR_TSC_AUX 0xc103
+#define MSR_AMD64_TSC_RATIO 0xc104
+
+#define MSR_AMD64_TSC_RATIO_DEFAULT 0x1ULL
 
 #define MSR_VM_HSAVE_PA 0xc0010117
 
@@ -1522,6 +1525,7 @@ typedef struct CPUX86State {
 uint32_t tsx_ctrl;
 
 uint64_t spec_ctrl;
+uint64_t amd_tsc_scale_msr;
 uint64_t virt_ssbd;
 
 /* End of state preserved by INIT (dummy marker).  */
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 841b3b98f7..bd53a55148 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -106,6 +106,7 @@ static bool has_msr_hv_reenlightenment;
 static bool has_msr_xss;
 static bool has_msr_umwait;
 static bool has_msr_spec_ctrl;
+static bool has_tsc_scale_msr;
 static bool has_msr_tsx_ctrl;
 static bool has_msr_virt_ssbd;
 static bool has_msr_smi_count;
@@ -2157,6 +2158,9 @@ static int kvm_get_supported_msrs(KVMState *s)
 case MSR_IA32_SPEC_CTRL:
 has_msr_spec_ctrl = true;
 break;
+case MSR_AMD64_TSC_RATIO:
+has_tsc_scale_msr = true;
+break;
 case MSR_IA32_TSX_CTRL:
 has_msr_tsx_ctrl = true;
 break;
@@ -2968,6 +2972,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
 if (has_msr_spec_ctrl) {
 kvm_msr_entry_add(cpu, MSR_IA32_SPEC_CTRL, env->spec_ctrl);
 }
+if (has_tsc_scale_msr) {
+kvm_msr_entry_add(cpu, MSR_AMD64_TSC_RATIO, env->amd_tsc_scale_msr);
+}
+
 if (has_msr_tsx_ctrl) {
 kvm_msr_entry_add(cpu, MSR_IA32_TSX_CTRL, env->tsx_ctrl);
 }
@@ -3409,6 +3417,10 @@ static int kvm_get_msrs(X86CPU *cpu)
 if (has_msr_spec_ctrl) {
 kvm_msr_entry_add(cpu, MSR_IA32_SPEC_CTRL, 0);
 }
+if (has_tsc_scale_msr) {
+kvm_msr_entry_add(cpu, MSR_AMD64_TSC_RATIO, 0);
+}
+
 if (has_msr_tsx_ctrl) {
 kvm_msr_entry_add(cpu, MSR_IA32_TSX_CTRL, 0);
 }
@@ -3813,6 +3825,9 @@ static int kvm_get_msrs(X86CPU *cpu)
 case MSR_IA32_SPEC_CTRL:
 env->spec_ctrl = msrs[i].data;
 break;
+case MSR_AMD64_TSC_RATIO:
+env->amd_tsc_scale_msr = msrs[i].data;
+break;
 case MSR_IA32_TSX_CTRL:
 env->tsx_ctrl = msrs[i].data;
 break;
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 154666e7c0..39c8faf0ce 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -1280,6 +1280,28 @@ static const VMStateDescription vmstate_spec_ctrl = {
 }
 };
 
+
+static bool amd_tsc_scale_msr_needed(void *opaque)
+{
+X86CPU *cpu = opaque;
+CPUX86State *env = &cpu->env;
+
+return env->amd_tsc_scale_msr &&
+   env->amd_tsc_scale_msr != MSR_AMD64_TSC_RATIO_DEFAULT;
+}
+
+static const VMStateDescription amd_tsc_scale_msr_ctrl = {
+.name = "cpu/amd_tsc_scale_msr",
+.version_id = 1,
+.minimum_version_id = 1,
+.needed = amd_tsc_scale_msr_needed,
+.fields = (VMStateField[]){
+VMSTATE_UINT64(env.amd_tsc_scale_msr, X86CPU),
+VMSTATE_END_OF_LIST()
+}
+};
+
+
 static bool intel_pt_enable_needed(void *opaque)
 {
 X86CPU *cpu = opaque;
@@ -1568,6 +1590,7 @@ const VMStateDescription vmstate_x86_cpu = {
 &vmstate_pkru,
 &vmstate_pkrs,
 &vmstate_spec_ctrl,
+&amd_tsc_scale_msr_ctrl,
 &vmstate_mcg_ext_ctl,
 &vmstate_msr_intel_pt,
 &vmstate_msr_virt_ssbd,
-- 
2.26.3

[PATCH 0/3] KVM: qemu patches for few KVM features I developed

2021-09-14 Thread Maxim Levitsky

These patches implement the qemu side logic to support
the KVM features I developed recently.

First two patches are for features that are already accepted
upstream, and I already posted them on the qemu mailing list once.

And the 3rd patch is for nested TSC scaling on SVM
which isn't yet accepted in KVM but can already be added to qemu since
it is conditional on KVM supporting it, and ABI wise it only relies
on SVM spec.

Best regards,
Maxim Levitsky

Maxim Levitsky (3):
  KVM: use KVM_{GET|SET}_SREGS2 when supported.
  gdbstub: implement NOIRQ support for single step on KVM
  KVM: SVM: add migration support for nested TSC scaling

 accel/kvm/kvm-all.c   |  30 +++
 gdbstub.c |  60 +
 include/sysemu/kvm.h  |  17 ++
 target/i386/cpu.c |   5 ++
 target/i386/cpu.h |   7 +++
 target/i386/kvm/kvm.c | 122 +-
 target/i386/machine.c |  53 ++
 7 files changed, 282 insertions(+), 12 deletions(-)

-- 
2.26.3

Re: About two-dimensional page translation (e.g., Intel EPT) and shadow page table in Linux QEMU/KVM

2021-07-12 Thread Maxim Levitsky

On Mon, 2021-07-12 at 08:02 -0500, harry harry wrote:
> Dear Maxim,
> 
> Thanks for your reply. I knew, in our current design/implementation,
> EPT/NPT is enabled by a module param. I think it is possible to modify
> the QEMU/KVM code to let it support EPT/NPT and show page table (SPT)
> simultaneously (e.g., for an 80-core server, 40 cores use EPT/NPT and
> the other 40 cores use SPT). What do you think? Thanks!
> 
> Best regards,
> Harry
> 
> On Mon, Jul 12, 2021 at 4:49 AM Maxim Levitsky  wrote:
> > On Sun, 2021-07-11 at 15:13 -0500, harry harry wrote:
> > > Hi all,
> > > 
> > > I hope you are very well! May I know whether it is possible to enable
> > > two-dimensional page translation (e.g., Intel EPT) mechanisms and
> > > shadow page table mechanisms in Linux QEMU/KVM at the same time on a
> > > physical server? For example, if the physical server has 80 cores, is
> > > it possible to let 40 cores use Intel EPT mechanisms for page
> > > translation and the other 40 cores use shadow page table mechanisms?
> > > Thanks!
> > 
> > Nope sadly. EPT/NPT is enabled by a module param.
> > 
> > Best regards,
> > Maxim Levitsky
> > 
> > > Best,
> > > Harry
> > > 
For same VM, I don't think it is feasable.

For multiple VMs make some use NPT/EPT and some don't,
this should be possible to implement.

Why do you need it?

Best regards,
Maxim Levitsky

Re: About two-dimensional page translation (e.g., Intel EPT) and shadow page table in Linux QEMU/KVM

2021-07-12 Thread Maxim Levitsky

On Sun, 2021-07-11 at 15:13 -0500, harry harry wrote:
> Hi all,
> 
> I hope you are very well! May I know whether it is possible to enable
> two-dimensional page translation (e.g., Intel EPT) mechanisms and
> shadow page table mechanisms in Linux QEMU/KVM at the same time on a
> physical server? For example, if the physical server has 80 cores, is
> it possible to let 40 cores use Intel EPT mechanisms for page
> translation and the other 40 cores use shadow page table mechanisms?
> Thanks!

Nope sadly. EPT/NPT is enabled by a module param.

Best regards,
Maxim Levitsky

> 
> Best,
> Harry
>

Re: [PATCH v2] docs: Add '-device intel-iommu' entry

2021-07-07 Thread Maxim Levitsky

On Wed, 2021-07-07 at 11:41 -0400, Peter Xu wrote:
> The parameters of intel-iommu device are non-trivial to understand.  Add an
> entry for it so that people can reference to it when using.
> 
> There're actually a few more options there, but I hide them explicitly because
> they shouldn't be used by normal QEMU users.
> 
> Cc: Chao Yang 
> Cc: Lei Yang 
> Cc: Jing Zhao 
> Cc: Jason Wang 
> Cc: Michael S. Tsirkin 
> Cc: Alex Williamson 
> Reviewed-by: Jason Wang 
> Reviewed-by: Yi Liu 
> Signed-off-by: Peter Xu 
> ---
> v2:
> - Drop "in the guest" in intremap entry [Jason]
> - Explain how the default value of intremap is chosen [Eric]
> - Add r-bs for Jason and Yi
> Signed-off-by: Peter Xu 
> ---
>  qemu-options.hx | 33 +
>  1 file changed, 33 insertions(+)
> 
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 8965dabc83..0fcc8973dd 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -926,6 +926,39 @@ SRST
>  
>  ``-device pci-ipmi-bt,bmc=id``
>  Like the KCS interface, but defines a BT interface on the PCI bus.
> +
> +``-device intel-iommu[,option=...]``
> +This is only supported by ``-machine q35``, which will enable Intel VT-d
> +emulation within the guest.  It supports below options:
> +
> +``intremap=on|off`` (default: auto)
> +This enables interrupt remapping feature.  It's required to enable
> +complete x2apic.  Currently it only supports kvm kernel-irqchip modes
> +``off`` or ``split``, while full kernel-irqchip is not yet supported.
> +The default value is "auto", which will be decided by the mode of
> +kernel-irqchip.
> +
> +``caching-mode=on|off`` (default: off)
> +This enables caching mode for the VT-d emulated device.  When
> +caching-mode is enabled, each guest DMA buffer mapping will generate 
> an
> +IOTLB invalidation from the guest IOMMU driver to the vIOMMU device 
> in
> +a synchronous way.  It is required for ``-device vfio-pci`` to work
> +with the VT-d device, because host assigned devices requires to setup
> +the DMA mapping on the host before guest DMA starts.
> +
> +``device-iotlb=on|off`` (default: off)
> +This enables device-iotlb capability for the emulated VT-d device.  
> So
> +far virtio/vhost should be the only real user for this parameter,
> +paired with ats=on configured for the device.
> +
> +``aw-bits=39|48`` (default: 39)
> +This decides the address width of IOVA address space.  The address
> +space has 39 bits width for 3-level IOMMU page tables, and 48 bits 
> for
> +4-level IOMMU page tables.
> +
> +Please also refer to the wiki page for general scenarios of VT-d
> +emulation in QEMU: https://wiki.qemu.org/Features/VT-d.
> +
>  ERST
>  
>  DEF("name", HAS_ARG, QEMU_OPTION_name,

As far as I know this looks very good.
Thanks for doing this!

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

Re: [PATCH] block/nvme: Fix VFIO_MAP_DMA failed: No space left on device

2021-06-17 Thread Maxim Levitsky

On Mon, 2021-06-14 at 18:03 +0200, Philippe Mathieu-Daudé wrote:
> On 6/11/21 1:46 PM, Philippe Mathieu-Daudé wrote:
> > When the NVMe block driver was introduced (see commit bdd6a90a9e5,
> > January 2018), Linux VFIO_IOMMU_MAP_DMA ioctl was only returning
> > -ENOMEM in case of error. The driver was correctly handling the
> > error path to recycle its volatile IOVA mappings.
> > 
> > To fix CVE-2019-3882, Linux commit 492855939bdb ("vfio/type1: Limit
> > DMA mappings per container", April 2019) added the -ENOSPC error to
> > signal the user exhausted the DMA mappings available for a container.
> 
> Hmm this commit has been added before v5.1-rc4.
> 
> So while this fixes the behavior of v5.1-rc4+ kernels,
> older kernels using this fix will have the same problem...


Hi!

I wonder why not to check for both -ENOMEM and -ENOSPC
and recycle the mappings in both cases?

I think that would work on both old and new kernels.

What do you think?

Best regards,
Maxim Levitsky

> 
> Should I check uname(2)'s utsname.release[]? Is it reliable?
> 
> > The block driver started to mis-behave:
> > 
> >   qemu-system-x86_64: VFIO_MAP_DMA failed: No space left on device
> >   (qemu)
> >   (qemu) info status
> >   VM status: paused (io-error)
> >   (qemu) c
> >   VFIO_MAP_DMA failed: No space left on device
> >   qemu-system-x86_64: block/block-backend.c:1968: blk_get_aio_context: 
> > Assertion `ctx == blk->ctx' failed.
> > 
> > Fix by handling the -ENOSPC error when DMA mappings are exhausted;
> > other errors (such -ENOMEM) are still handled later in the same
> > function.
> > 
> > An easy way to reproduce this bug is to restrict the DMA mapping
> > limit (65535 by default) when loading the VFIO IOMMU module:
> > 
> >   # modprobe vfio_iommu_type1 dma_entry_limit=666
> > 
> > Cc: qemu-sta...@nongnu.org
> > Reported-by: Michal Prívozník 
> > Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver")
> > Buglink: https://bugs.launchpad.net/qemu/+bug/186
> > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/65
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> > Michal, is it still possible for you to test this (old bug)?
> > 
> > A functional test using viommu & nested VM is planned (suggested by
> > Stefan and Maxim).
> > ---
> >  block/nvme.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index 2b5421e7aa6..12f9dd5cce3 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -1030,7 +1030,7 @@ try_map:
> >  r = qemu_vfio_dma_map(s->vfio,
> >qiov->iov[i].iov_base,
> >len, true, &iova);
> > -if (r == -ENOMEM && retry) {
> > +if (r == -ENOSPC && retry) {
> >  retry = false;
> >  trace_nvme_dma_flush_queue_wait(s);
> >  if (s->dma_map_count) {
> >

Re: [PATCH v4 3/7] block: add max_hw_transfer to BlockLimits

2021-06-09 Thread Maxim Levitsky

On Tue, 2021-06-08 at 15:16 +0200, Paolo Bonzini wrote:
> For block host devices, I/O can happen through either the kernel file
> descriptor I/O system calls (preadv/pwritev, io_submit, io_uring)
> or the SCSI passthrough ioctl SG_IO.
> 
> In the latter case, the size of each transfer can be limited by the
> HBA, while for file descriptor I/O the kernel is able to split and
> merge I/O in smaller pieces as needed.  Applying the HBA limits to
> file descriptor I/O results in more system calls and suboptimal
> performance, so this patch splits the max_transfer limit in two:
> max_transfer remains valid and is used in general, while max_hw_transfer
> is limited to the maximum hardware size.  max_hw_transfer can then be
> included by the scsi-generic driver in the block limits page, to ensure
> that the stricter hardware limit is used.
> 
> Signed-off-by: Paolo Bonzini 

This is mostly the same as my patch 

https://www.mail-archive.com/qemu-devel@nongnu.org/msg768264.html

I called this max_ioctl_transfer, since this limit is only relevant
to the .ioctl, but max_hw_transfer is fine as well.

So this patch looks OK, but I might have missed something
as I haven't touched this area for a long time.

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky


> ---
>  block/block-backend.c  | 12 
>  block/file-posix.c |  2 +-
>  block/io.c |  1 +
>  hw/scsi/scsi-generic.c |  2 +-
>  include/block/block_int.h  |  7 +++
>  include/sysemu/block-backend.h |  1 +
>  6 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 15f1ea4288..2ea1412a54 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1953,6 +1953,18 @@ uint32_t blk_get_request_alignment(BlockBackend *blk)
>  return bs ? bs->bl.request_alignment : BDRV_SECTOR_SIZE;
>  }
>  
> +/* Returns the maximum hardware transfer length, in bytes; guaranteed 
> nonzero */
> +uint64_t blk_get_max_hw_transfer(BlockBackend *blk)
> +{
> +BlockDriverState *bs = blk_bs(blk);
> +uint64_t max = INT_MAX;
> +
> +if (bs) {
> +max = MIN_NON_ZERO(bs->bl.max_hw_transfer, bs->bl.max_transfer);
> +}
> +return max;
> +}
> +
>  /* Returns the maximum transfer length, in bytes; guaranteed nonzero */
>  uint32_t blk_get_max_transfer(BlockBackend *blk)
>  {
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 670c577bfe..c9746d3eb6 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1234,7 +1234,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  int ret = sg_get_max_transfer_length(s->fd);
>  
>  if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> -bs->bl.max_transfer = pow2floor(ret);
> +bs->bl.max_hw_transfer = pow2floor(ret);
>  }
>  
>  ret = sg_get_max_segments(s->fd);
> diff --git a/block/io.c b/block/io.c
> index 323854d063..089b99bb0c 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -127,6 +127,7 @@ static void bdrv_merge_limits(BlockLimits *dst, const 
> BlockLimits *src)
>  {
>  dst->opt_transfer = MAX(dst->opt_transfer, src->opt_transfer);
>  dst->max_transfer = MIN_NON_ZERO(dst->max_transfer, src->max_transfer);
> +dst->max_hw_transfer = MIN_NON_ZERO(dst->max_hw_transfer, 
> src->max_hw_transfer);
>  dst->opt_mem_alignment = MAX(dst->opt_mem_alignment,
>   src->opt_mem_alignment);
>  dst->min_mem_alignment = MAX(dst->min_mem_alignment,
> diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
> index 82e1e2ee79..3762dce749 100644
> --- a/hw/scsi/scsi-generic.c
> +++ b/hw/scsi/scsi-generic.c
> @@ -179,7 +179,7 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq *r, 
> SCSIDevice *s)
>  (r->req.cmd.buf[1] & 0x01)) {
>  page = r->req.cmd.buf[2];
>  if (page == 0xb0) {
> -uint32_t max_transfer = blk_get_max_transfer(s->conf.blk);
> +uint64_t max_transfer = blk_get_max_hw_transfer(s->conf.blk);
>  uint32_t max_iov = blk_get_max_iov(s->conf.blk);
>  
>  assert(max_transfer);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 057d88b1fc..f1a54db0f8 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -695,6 +695,13 @@ typedef struct BlockLimits {
>   * clamped down. */
>  uint32_t max_transfer;
>  
> +/* Maximal hardware transfer length in bytes.  Applies whenever
> + * transfers to the device b

Re: [PATCH v4 2/7] scsi-generic: pass max_segments via max_iov field in BlockLimits

2021-06-09 Thread Maxim Levitsky

On Tue, 2021-06-08 at 15:16 +0200, Paolo Bonzini wrote:
> I/O to a disk via read/write is not limited by the number of segments allowed
> by the host adapter; the kernel can split requests if needed, and the limit
> imposed by the host adapter can be very low (256k or so) to avoid that SG_IO
> returns EINVAL if memory is heavily fragmented.
> 
> Since this value is only interesting for SG_IO-based I/O, do not include
> it in the max_transfer and only take it into account when patching the
> block limits VPD page in the scsi-generic device.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/file-posix.c | 3 +--
>  hw/scsi/scsi-generic.c | 6 --
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 536998a1d6..670c577bfe 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1239,8 +1239,7 @@ static void raw_refresh_limits(BlockDriverState *bs, 
> Error **errp)
>  
>  ret = sg_get_max_segments(s->fd);
>  if (ret > 0) {
> -bs->bl.max_transfer = MIN(bs->bl.max_transfer,
> -  ret * qemu_real_host_page_size);
> +bs->bl.max_iov = ret;

Actually I think that both max transfer size and max segement count,
are only relevant for SCSI passthrough since kernel I think emualates
both for regular I/O, so I think that we shoudn't expose them to qemu at all.

In my version of the patches I removed both bl.max_transfer and bl.max_iov
setup from the file-posix driver and replaced it with bs->bl.max_ioctl_transfer
(you call it max_hw_transfer)

In my version the bl.max_ioctl_transfer is a merged limit of the max transfer 
size
and the max iovec number.

https://www.mail-archive.com/qemu-devel@nongnu.org/msg768264.html


Best regards,
Maxim Levitsky


>  }
>  }
>  
> diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
> index 98c30c5d5c..82e1e2ee79 100644
> --- a/hw/scsi/scsi-generic.c
> +++ b/hw/scsi/scsi-generic.c
> @@ -179,10 +179,12 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq 
> *r, SCSIDevice *s)
>  (r->req.cmd.buf[1] & 0x01)) {
>  page = r->req.cmd.buf[2];
>  if (page == 0xb0) {
> -uint32_t max_transfer =
> -blk_get_max_transfer(s->conf.blk) / s->blocksize;
> +uint32_t max_transfer = blk_get_max_transfer(s->conf.blk);
> +uint32_t max_iov = blk_get_max_iov(s->conf.blk);
>  
>  assert(max_transfer);
> +max_transfer = MIN_NON_ZERO(max_transfer, max_iov * 
> qemu_real_host_page_size)
> +/ s->blocksize;
>  stl_be_p(&r->buf[8], max_transfer);
>  /* Also take care of the opt xfer len. */
>  stl_be_p(&r->buf[12],

Re: [PATCH v4 1/7] file-posix: fix max_iov for /dev/sg devices

2021-06-09 Thread Maxim Levitsky

On Tue, 2021-06-08 at 22:14 +0300, Vladimir Sementsov-Ogievskiy wrote:
> 08.06.2021 16:16, Paolo Bonzini wrote:
> > Even though it was only called for devices that have bs->sg set (which
> > must be character devices), sg_get_max_segments looked at /sys/dev/block
> > which only works for block devices.
> > 
> > On Linux the sg driver has its own way to provide the maximum number of
> > iovecs in a scatter/gather list, so add support for it.  The block device
> > path is kept because it will be reinstated in the next patches.
> > 
> > Signed-off-by: Paolo Bonzini 
> > ---
> >   block/file-posix.c | 11 +++
> >   1 file changed, 11 insertions(+)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index f37dfc10b3..536998a1d6 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1180,6 +1180,17 @@ static int sg_get_max_segments(int fd)
> >   goto out;
> >   }
> >   
> > +if (S_ISCHR(st.st_mode)) {
> 
> Why not check "if (bs->sg) {" instead? It seems to be more consistent with 
> issuing SG_ ioctl. Or what I miss?

I also think so. Actually the 'hdev_is_sg' has a check for character device as 
well, 
in addition to a few more checks that make sure that we are really 
dealing with the quirky /dev/sg character device.

> 
> > +if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
> > +return ret;
> > +}
> > +return -ENOTSUP;
> > +}
> > +
> > +if (!S_ISBLK(st.st_mode)) {
> > +return -ENOTSUP;
> > +}
> > +
> >   sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
> >   major(st.st_rdev), minor(st.st_rdev));
> >   sysfd = open(sysfspath, O_RDONLY);
> > 
> 
> 

Other than that, this is the same as the patch from Tom Yan:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg768262.html

In this version he does check if the SG_GET_SG_TABLESIZE is defined, so
you might want to do this as well.


Best regards,
Maxim Levitsky

Re: [PATCH v4 4/7] file-posix: try BLKSECTGET on block devices too, do not round to power of 2

2021-06-09 Thread Maxim Levitsky

On Tue, 2021-06-08 at 15:16 +0200, Paolo Bonzini wrote:
> bs->sg is only true for character devices, but block devices can also
> be used with scsi-block and scsi-generic.  Unfortunately BLKSECTGET
> returns bytes in an int for /dev/sgN devices, and sectors in a short
> for block devices, so account for that in the code.
> 
> The maximum transfer also need not be a power of 2 (for example I have
> seen disks with 1280 KiB maximum transfer) so there's no need to pass
> the result through pow2floor.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block/file-posix.c | 44 
>  1 file changed, 24 insertions(+), 20 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index c9746d3eb6..1439293f63 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -1149,22 +1149,27 @@ static void raw_reopen_abort(BDRVReopenState *state)
>  s->reopen_state = NULL;
>  }
>  
> -static int sg_get_max_transfer_length(int fd)
> +static int hdev_get_max_hw_transfer(int fd, struct stat *st)
>  {
>  #ifdef BLKSECTGET
> -int max_bytes = 0;
> -
> -if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
> -return max_bytes;
> +if (S_ISBLK(st->st_mode)) {
> +unsigned short max_sectors = 0;
> +if (ioctl(fd, BLKSECTGET, &max_sectors) == 0) {
> +return max_sectors * 512;
> +}
>  } else {
> -return -errno;
> +int max_bytes = 0;
> +if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {

Again I would use the bs->sg for that.

> +return max_bytes;
> +}
>  }
> +return -errno;
>  #else
>  return -ENOSYS;
>  #endif
>  }
>  
> -static int sg_get_max_segments(int fd)
> +static int hdev_get_max_segments(int fd, struct stat *st)
>  {
>  #ifdef CONFIG_LINUX
>  char buf[32];
> @@ -1173,26 +1178,20 @@ static int sg_get_max_segments(int fd)
>  int ret;
>  int sysfd = -1;
>  long max_segments;
> -struct stat st;
>  
> -if (fstat(fd, &st)) {
> -ret = -errno;
> -goto out;
> -}
> -
> -if (S_ISCHR(st.st_mode)) {
> +if (S_ISCHR(st->st_mode)) {
>  if (ioctl(fd, SG_GET_SG_TABLESIZE, &ret) == 0) {
>  return ret;
>  }
>  return -ENOTSUP;
>  }
>  
> -if (!S_ISBLK(st.st_mode)) {
> +if (!S_ISBLK(st->st_mode)) {
>  return -ENOTSUP;
>  }
>  
>  sysfspath = g_strdup_printf("/sys/dev/block/%u:%u/queue/max_segments",
> -major(st.st_rdev), minor(st.st_rdev));
> +major(st->st_rdev), minor(st->st_rdev));
>  sysfd = open(sysfspath, O_RDONLY);
>  if (sysfd == -1) {
>  ret = -errno;
> @@ -1229,15 +1228,20 @@ out:
>  static void raw_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>  BDRVRawState *s = bs->opaque;
> +struct stat st;
> +
> +if (fstat(s->fd, &st)) {
> +return;
> +}
>  
> -if (bs->sg) {
> -int ret = sg_get_max_transfer_length(s->fd);
> +if (bs->sg || S_ISBLK(st.st_mode)) {
> +int ret = hdev_get_max_hw_transfer(s->fd, &st);
>  
>  if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> -bs->bl.max_hw_transfer = pow2floor(ret);
> +bs->bl.max_hw_transfer = ret;
>  }
>  
> -ret = sg_get_max_segments(s->fd);
> +ret = hdev_get_max_segments(s->fd, &st);
>  if (ret > 0) {
>  bs->bl.max_iov = ret;
>  }


Roughly speaking this looks correct, but I might have missed something as well.

This is roughly the same as patches from Tom Yan which I carried in my series

https://www.mail-archive.com/qemu-devel@nongnu.org/msg768258.html
https://www.mail-archive.com/qemu-devel@nongnu.org/msg768262.html


I like a bit more how he created separate functions for /dev/sg and for all 
other block devices.
Please take a look.

Also not related to this patch, you are missing my fix I did to the VPD limit 
emulation, please consider taking
it into the series:

https://www.mail-archive.com/qemu-devel@nongnu.org/msg768260.html


Best regards,
Maxim Levitsky

Re: [PATCH v3 0/7] block: file-posix queue

2021-06-07 Thread Maxim Levitsky

On Thu, 2021-06-03 at 15:37 +0200, Paolo Bonzini wrote:
> Hi Kevin,
> 
> this is a combination of two series that both affect host block device
> support in block/file-posix.c.  Joelle's series is unchanged, while
> mine was adjusted according to your review of v2.
> 
> v1->v2: add missing patch
> 
> v2->v3: add max_hw_transfer to BlockLimits
> 
> 
> Joelle van Dyne (3):
>   block: feature detection for host block support
>   block: check for sys/disk.h
>   block: detect DKIOCGETBLOCKCOUNT/SIZE before use
> 
> Paolo Bonzini (4):
>   file-posix: fix max_iov for /dev/sg devices
>   scsi-generic: pass max_segments via max_iov field in BlockLimits
>   block: add max_hw_transfer to BlockLimits
>   file-posix: try BLKSECTGET on block devices too, do not round to power
> of 2
> 
>  block.c|   2 +-
>  block/block-backend.c  |  12 
>  block/file-posix.c | 104 -
>  block/io.c |   1 +
>  hw/scsi/scsi-generic.c |   6 +-
>  include/block/block_int.h  |   7 +++
>  include/sysemu/block-backend.h |   1 +
>  meson.build|   7 ++-
>  qapi/block-core.json   |  10 +++-
>  9 files changed, 102 insertions(+), 48 deletions(-)
> 
Hi Paolo and everyone!
 
I used to have a patch series that was about to fix the block limits of the 
scsi-block,
which I think is similar to this patch series.
 
Sorry that I kind of forgot about it for too much time.
 
https://www.mail-archive.com/qemu-devel@nongnu.org/msg768261.html
 
I'll need some time to swap-in this area so that I could compare our
patches to see if we missed something.
 
Best regards,
Maxim Levitsky

Re: [PATCH 2/2] gdbstub: implement NOIRQ support for single step on KVM, when kvm's KVM_GUESTDBG_BLOCKIRQ debug flag is supported.

2021-05-05 Thread Maxim Levitsky

On Mon, 2021-04-19 at 17:29 +0100, Alex Bennée wrote:
> Maxim Levitsky  writes:
> 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  accel/kvm/kvm-all.c  | 25 +++
> >  gdbstub.c| 59 
> >  include/sysemu/kvm.h | 13 ++
> >  3 files changed, 87 insertions(+), 10 deletions(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index b6d9f92f15..bc7955fb19 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -147,6 +147,8 @@ bool kvm_vm_attributes_allowed;
> >  bool kvm_direct_msi_allowed;
> >  bool kvm_ioeventfd_any_length_allowed;
> >  bool kvm_msi_use_devid;
> > +bool kvm_has_guest_debug;
> > +int kvm_sstep_flags;
> >  static bool kvm_immediate_exit;
> >  static hwaddr kvm_max_slot_size = ~0;
> >  
> > @@ -2186,6 +2188,25 @@ static int kvm_init(MachineState *ms)
> >  kvm_ioeventfd_any_length_allowed =
> >  (kvm_check_extension(s, KVM_CAP_IOEVENTFD_ANY_LENGTH) > 0);
> >  
> > +kvm_has_guest_debug =
> > +(kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG) > 0);
> > +
> > +kvm_sstep_flags = 0;
> > +
> > +if (kvm_has_guest_debug) {
> > +/* Assume that single stepping is supported */
> > +kvm_sstep_flags = SSTEP_ENABLE;
> > +
> > +int guest_debug_flags =
> > +kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG2);
> > +
> > +if (guest_debug_flags > 0) {
> > +if (guest_debug_flags & KVM_GUESTDBG_BLOCKIRQ) {
> > +kvm_sstep_flags |= SSTEP_NOIRQ;
> > +}
> > +}
> > +}
> > +
> >  kvm_state = s;
> >  
> >  ret = kvm_arch_init(ms, s);
> > @@ -2796,6 +2817,10 @@ int kvm_update_guest_debug(CPUState *cpu, unsigned 
> > long reinject_trap)
> >  
> >  if (cpu->singlestep_enabled) {
> >  data.dbg.control |= KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
> > +
> > +if (cpu->singlestep_enabled & SSTEP_NOIRQ) {
> > +data.dbg.control |= KVM_GUESTDBG_BLOCKIRQ;
> > +}
> >  }
> >  kvm_arch_update_guest_debug(cpu, &data.dbg);
> >  
> > diff --git a/gdbstub.c b/gdbstub.c
> > index 054665e93e..f789ded99d 100644
> > --- a/gdbstub.c
> > +++ b/gdbstub.c
> > @@ -369,12 +369,11 @@ typedef struct GDBState {
> >  gdb_syscall_complete_cb current_syscall_cb;
> >  GString *str_buf;
> >  GByteArray *mem_buf;
> > +int sstep_flags;
> > +int supported_sstep_flags;
> >  } GDBState;
> >  
> > -/* By default use no IRQs and no timers while single stepping so as to
> > - * make single stepping like an ICE HW step.
> > - */
> > -static int sstep_flags = SSTEP_ENABLE|SSTEP_NOIRQ|SSTEP_NOTIMER;
> > +static GDBState gdbserver_state;
> >  
> >  /* Retrieves flags for single step mode. */
> >  static int get_sstep_flags(void)
> > @@ -386,11 +385,10 @@ static int get_sstep_flags(void)
> >  if (replay_mode != REPLAY_MODE_NONE) {
> >  return SSTEP_ENABLE;
> >  } else {
> > -return sstep_flags;
> > +return gdbserver_state.sstep_flags;
> >  }
> >  }
> >  
> > -static GDBState gdbserver_state;
> >  
> >  static void init_gdbserver_state(void)
> >  {
> > @@ -400,6 +398,23 @@ static void init_gdbserver_state(void)
> >  gdbserver_state.str_buf = g_string_new(NULL);
> >  gdbserver_state.mem_buf = g_byte_array_sized_new(MAX_PACKET_LENGTH);
> >  gdbserver_state.last_packet = g_byte_array_sized_new(MAX_PACKET_LENGTH 
> > + 4);
> > +
> > +
> > +if (kvm_enabled()) {
> > +gdbserver_state.supported_sstep_flags =
> >  kvm_get_supported_sstep_flags();
> 
> This falls over as soon as you build something without KVM support (like
> a TCG only build or an emulation only target):

This is something I'll check from now on before sending patches.


> 
>   [10/1152] Compiling C object libqemu-riscv32-softmmu.fa.p/gdbstub.c.o
>   FAILED: libqemu-riscv32-softmmu.fa.p/gdbstub.c.o 
>   cc -Ilibqemu-riscv32-softmmu.fa.p -I. -I../.. -Itarget/riscv 
> -I../../target/riscv -Idtc/libfdt -I../../dtc/libfdt 
> -I../../capstone/include/capstone -Iqapi -Itrace -Iui -Iui/shader 
> -I/usr/include/pixman-1 -I/usr/include/libdrm -I/usr/include/spice-server 
> -I/usr/include/spice-1 -I/usr/include/glib-2.0 
> -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -fdiagnostics-

Re: [PATCH 1/2] kvm: update kernel headers for KVM_GUESTDBG_BLOCKEVENTS

2021-05-05 Thread Maxim Levitsky

On Mon, 2021-04-19 at 17:22 +0100, Alex Bennée wrote:
> Maxim Levitsky  writes:
> 
> > Signed-off-by: Maxim Levitsky 
> 
> Generally it's a good idea to reference where these are coming from, is
> it a current kernel patch in flight or from an release we haven't synced
> up to yet?
Hi!

As Paolo explained to me, qemu syncs the kernel headers every once in a while
thus when I submit a feature to qemu which uses a new KVM feature, while
I should submit a patch to add it to the kernel headers, the patch is only
for the reference.

In this particular case, I first updated the qemu's kernel headers to
match the kvm/queue branch, then added my feature to the kernel, and updated
the qemu kernel headers again. This patch is the diff between 1st and second
update to make it more readable.

Best regards,
Maxim Levitsky

> 
> Usually linux header updates are done with semi-regular runs on
> ./scripts/update-linux-headers.sh but obviously it's OK to include
> standalone patches during the review process.
> 
> > ---
> >  linux-headers/asm-x86/kvm.h | 2 ++
> >  linux-headers/linux/kvm.h   | 1 +
> >  2 files changed, 3 insertions(+)
> > 
> > diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
> > index 8e76d3701d..33878cdc34 100644
> > --- a/linux-headers/asm-x86/kvm.h
> > +++ b/linux-headers/asm-x86/kvm.h
> > @@ -281,6 +281,8 @@ struct kvm_debug_exit_arch {
> >  #define KVM_GUESTDBG_USE_HW_BP 0x0002
> >  #define KVM_GUESTDBG_INJECT_DB 0x0004
> >  #define KVM_GUESTDBG_INJECT_BP 0x0008
> > +#define KVM_GUESTDBG_BLOCKIRQ  0x0010
> > +
> >  
> >  /* for KVM_SET_GUEST_DEBUG */
> >  struct kvm_guest_debug_arch {
> > diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> > index 020b62a619..2ded7a0630 100644
> > --- a/linux-headers/linux/kvm.h
> > +++ b/linux-headers/linux/kvm.h
> > @@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
> >  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
> >  #define KVM_CAP_SYS_HYPERV_CPUID 191
> >  #define KVM_CAP_DIRTY_LOG_RING 192
> > +#define KVM_CAP_SET_GUEST_DEBUG2 195
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> 
>

[PATCH v2 2/2] KVM: use KVM_{GET|SET}_SREGS2 when supported.

2021-04-26 Thread Maxim Levitsky

This allows to make PDPTRs part of the migration
stream and thus not reload them after migration which
is against X86 spec.

Signed-off-by: Maxim Levitsky 
---
 accel/kvm/kvm-all.c   |   5 ++
 include/sysemu/kvm.h  |   4 ++
 target/i386/cpu.h |   3 ++
 target/i386/kvm/kvm.c | 107 +-
 target/i386/machine.c |  30 
 5 files changed, 147 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index b6d9f92f15..0397b3cb2b 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -142,6 +142,7 @@ bool kvm_msi_via_irqfd_allowed;
 bool kvm_gsi_routing_allowed;
 bool kvm_gsi_direct_mapping;
 bool kvm_allowed;
+bool kvm_sregs2;
 bool kvm_readonly_mem_allowed;
 bool kvm_vm_attributes_allowed;
 bool kvm_direct_msi_allowed;
@@ -2186,6 +2187,10 @@ static int kvm_init(MachineState *ms)
 kvm_ioeventfd_any_length_allowed =
 (kvm_check_extension(s, KVM_CAP_IOEVENTFD_ANY_LENGTH) > 0);
 
+
+kvm_sregs2 =
+(kvm_check_extension(s, KVM_CAP_SREGS2) > 0);
+
 kvm_state = s;
 
 ret = kvm_arch_init(ms, s);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a1ab1ee12d..b3d4538c55 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -32,6 +32,7 @@
 #ifdef CONFIG_KVM_IS_POSSIBLE
 
 extern bool kvm_allowed;
+extern bool kvm_sregs2;
 extern bool kvm_kernel_irqchip;
 extern bool kvm_split_irqchip;
 extern bool kvm_async_interrupts_allowed;
@@ -139,6 +140,9 @@ extern bool kvm_msi_use_devid;
  */
 #define kvm_gsi_direct_mapping() (kvm_gsi_direct_mapping)
 
+
+#define kvm_supports_sregs2() (kvm_sregs2)
+
 /**
  * kvm_readonly_mem_enabled:
  *
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 570f916878..ac877097d4 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1422,6 +1422,9 @@ typedef struct CPUX86State {
 SegmentCache idt; /* only base and limit are used */
 
 target_ulong cr[5]; /* NOTE: cr1 is unused */
+
+bool pdptrs_valid;
+uint64_t pdptrs[4];
 int32_t a20_mask;
 
 BNDReg bnd_regs[4];
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 7fe9f52710..93570706e1 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2514,6 +2514,61 @@ static int kvm_put_sregs(X86CPU *cpu)
 return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS, &sregs);
 }
 
+static int kvm_put_sregs2(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct kvm_sregs2 sregs;
+int i;
+
+sregs.flags = 0;
+
+if ((env->eflags & VM_MASK)) {
+set_v8086_seg(&sregs.cs, &env->segs[R_CS]);
+set_v8086_seg(&sregs.ds, &env->segs[R_DS]);
+set_v8086_seg(&sregs.es, &env->segs[R_ES]);
+set_v8086_seg(&sregs.fs, &env->segs[R_FS]);
+set_v8086_seg(&sregs.gs, &env->segs[R_GS]);
+set_v8086_seg(&sregs.ss, &env->segs[R_SS]);
+} else {
+set_seg(&sregs.cs, &env->segs[R_CS]);
+set_seg(&sregs.ds, &env->segs[R_DS]);
+set_seg(&sregs.es, &env->segs[R_ES]);
+set_seg(&sregs.fs, &env->segs[R_FS]);
+set_seg(&sregs.gs, &env->segs[R_GS]);
+set_seg(&sregs.ss, &env->segs[R_SS]);
+}
+
+set_seg(&sregs.tr, &env->tr);
+set_seg(&sregs.ldt, &env->ldt);
+
+sregs.idt.limit = env->idt.limit;
+sregs.idt.base = env->idt.base;
+memset(sregs.idt.padding, 0, sizeof sregs.idt.padding);
+sregs.gdt.limit = env->gdt.limit;
+sregs.gdt.base = env->gdt.base;
+memset(sregs.gdt.padding, 0, sizeof sregs.gdt.padding);
+
+sregs.cr0 = env->cr[0];
+sregs.cr2 = env->cr[2];
+sregs.cr3 = env->cr[3];
+sregs.cr4 = env->cr[4];
+
+sregs.cr8 = cpu_get_apic_tpr(cpu->apic_state);
+sregs.apic_base = cpu_get_apic_base(cpu->apic_state);
+
+sregs.efer = env->efer;
+
+if (env->pdptrs_valid) {
+for (i = 0; i < 4; i++) {
+sregs.pdptrs[i] = env->pdptrs[i];
+}
+sregs.flags |= KVM_SREGS2_FLAGS_PDPTRS_VALID;
+}
+
+return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS2, &sregs);
+}
+
+
 static void kvm_msr_buf_reset(X86CPU *cpu)
 {
 memset(cpu->kvm_msr_buf, 0, MSR_BUF_SIZE);
@@ -3175,6 +3230,53 @@ static int kvm_get_sregs(X86CPU *cpu)
 return 0;
 }
 
+static int kvm_get_sregs2(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct kvm_sregs2 sregs;
+int i, ret;
+
+ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS2, &sregs);
+if (ret < 0) {
+return ret;
+}
+
+get_seg(&env->segs[R_CS], &sregs.cs);
+get_seg(&env->segs[R_DS], &sregs.ds);
+get_seg(&env->segs[R_ES], &sregs.es);
+get_seg(&env->segs[R_FS], &sregs.fs);
+get_seg(&env->segs[R_GS], &sregs.gs);
+get_seg(&env->segs[R_SS], &sreg

[PATCH v2 1/2] kvm: update kernel headers for KVM_{GET|SET}_SREGS2

2021-04-26 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 linux-headers/asm-x86/kvm.h | 13 +
 linux-headers/linux/kvm.h   |  5 +
 2 files changed, 18 insertions(+)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 8e76d3701d..d61dc76e24 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -158,6 +158,19 @@ struct kvm_sregs {
__u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
 };
 
+struct kvm_sregs2 {
+   /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+   struct kvm_segment cs, ds, es, fs, gs, ss;
+   struct kvm_segment tr, ldt;
+   struct kvm_dtable gdt, idt;
+   __u64 cr0, cr2, cr3, cr4, cr8;
+   __u64 efer;
+   __u64 apic_base;
+   __u64 flags;
+   __u64 pdptrs[4];
+};
+#define KVM_SREGS2_FLAGS_PDPTRS_VALID 1
+
 /* for KVM_GET_FPU and KVM_SET_FPU */
 struct kvm_fpu {
__u8  fpr[8][16];
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 020b62a619..22ea29a243 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
 #define KVM_CAP_SYS_HYPERV_CPUID 191
 #define KVM_CAP_DIRTY_LOG_RING 192
+#define KVM_CAP_SREGS2 199
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1563,6 +1564,10 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_DIRTY_LOG_RING */
 #define KVM_RESET_DIRTY_RINGS  _IO(KVMIO, 0xc7)
 
+
+#define KVM_GET_SREGS2 _IOR(KVMIO,  0xcc, struct kvm_sregs2)
+#define KVM_SET_SREGS2 _IOW(KVMIO,  0xcd, struct kvm_sregs2)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
/* Guest initialization commands */
-- 
2.26.2

[PATCH v2 0/2] kvm: use KVM_{GET|SET}_SREGS2 when available

2021-04-26 Thread Maxim Levitsky

This implements support for using these new
IOCTLS to migrate PDPTRS.

Best regards,
Maxim Levitsky

Maxim Levitsky (2):
  kvm: update kernel headers for KVM_{GET|SET}_SREGS2
  KVM: use KVM_{GET|SET}_SREGS2 when supported.

 accel/kvm/kvm-all.c |   5 ++
 include/sysemu/kvm.h|   4 ++
 linux-headers/asm-x86/kvm.h |  13 +
 linux-headers/linux/kvm.h   |   5 ++
 target/i386/cpu.h   |   3 +
 target/i386/kvm/kvm.c   | 107 +++-
 target/i386/machine.c   |  30 ++
 7 files changed, 165 insertions(+), 2 deletions(-)

-- 
2.26.2

Re: [PATCH 2/2] KVM: use KVM_{GET|SET}_SREGS2 when supported by kvm.

2021-04-01 Thread Maxim Levitsky

On Thu, 2021-04-01 at 18:09 +0200, Paolo Bonzini wrote:
> On 01/04/21 16:45, Maxim Levitsky wrote:
> > +
> > +for (i = 0; i < 4; i++) {
> > +sregs.pdptrs[i] = env->pdptrs[i];
> > +}
> > +
> > +sregs.flags = 0;
> > +sregs.padding = 0;
> > +
> > +return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS2, &sregs);
> > +}
> > +
> 
> This breaks when migrating from old to new kernel, because in that case 
> the PDPTRs are not initialized.

True, I haven't thought about it!
I'll fix this in the next version.

Best regards,
Maxim Levitsky


> 
> Paolo
>

[PATCH 0/2] kvm: use KVM_{GET|SET}_SREGS2 when available

2021-04-01 Thread Maxim Levitsky

clone of "starship_unstable"

Maxim Levitsky (2):
  kvm: update kernel headers for KVM_{GET|SET}_SREGS2
  KVM: use KVM_{GET|SET}_SREGS2 when supported by kvm.

 accel/kvm/kvm-all.c |   4 ++
 include/sysemu/kvm.h|   4 ++
 linux-headers/asm-x86/kvm.h |  13 +
 linux-headers/linux/kvm.h   |   5 ++
 target/i386/cpu.h   |   1 +
 target/i386/kvm/kvm.c   | 101 +++-
 target/i386/machine.c   |  33 
 7 files changed, 159 insertions(+), 2 deletions(-)

-- 
2.26.2

[PATCH 2/2] KVM: use KVM_{GET|SET}_SREGS2 when supported by kvm.

2021-04-01 Thread Maxim Levitsky

This allows qemu to make PDPTRs be part of the migration
stream and thus not reload them after a migration which
is against X86 spec.

Signed-off-by: Maxim Levitsky 
---
 accel/kvm/kvm-all.c   |   4 ++
 include/sysemu/kvm.h  |   4 ++
 target/i386/cpu.h |   1 +
 target/i386/kvm/kvm.c | 101 +-
 target/i386/machine.c |  33 ++
 5 files changed, 141 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index b6d9f92f15..082b791b01 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -142,6 +142,7 @@ bool kvm_msi_via_irqfd_allowed;
 bool kvm_gsi_routing_allowed;
 bool kvm_gsi_direct_mapping;
 bool kvm_allowed;
+bool kvm_sregs2;
 bool kvm_readonly_mem_allowed;
 bool kvm_vm_attributes_allowed;
 bool kvm_direct_msi_allowed;
@@ -2186,6 +2187,9 @@ static int kvm_init(MachineState *ms)
 kvm_ioeventfd_any_length_allowed =
 (kvm_check_extension(s, KVM_CAP_IOEVENTFD_ANY_LENGTH) > 0);
 
+kvm_sregs2 =
+(kvm_check_extension(s, KVM_CAP_SREGS2) > 0);
+
 kvm_state = s;
 
 ret = kvm_arch_init(ms, s);
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index a1ab1ee12d..b3d4538c55 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -32,6 +32,7 @@
 #ifdef CONFIG_KVM_IS_POSSIBLE
 
 extern bool kvm_allowed;
+extern bool kvm_sregs2;
 extern bool kvm_kernel_irqchip;
 extern bool kvm_split_irqchip;
 extern bool kvm_async_interrupts_allowed;
@@ -139,6 +140,9 @@ extern bool kvm_msi_use_devid;
  */
 #define kvm_gsi_direct_mapping() (kvm_gsi_direct_mapping)
 
+
+#define kvm_supports_sregs2() (kvm_sregs2)
+
 /**
  * kvm_readonly_mem_enabled:
  *
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 570f916878..4595d47409 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1422,6 +1422,7 @@ typedef struct CPUX86State {
 SegmentCache idt; /* only base and limit are used */
 
 target_ulong cr[5]; /* NOTE: cr1 is unused */
+uint64_t pdptrs[4];
 int32_t a20_mask;
 
 BNDReg bnd_regs[4];
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 7fe9f52710..71769f82ae 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2514,6 +2514,59 @@ static int kvm_put_sregs(X86CPU *cpu)
 return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS, &sregs);
 }
 
+static int kvm_put_sregs2(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct kvm_sregs2 sregs;
+int i;
+
+if ((env->eflags & VM_MASK)) {
+set_v8086_seg(&sregs.cs, &env->segs[R_CS]);
+set_v8086_seg(&sregs.ds, &env->segs[R_DS]);
+set_v8086_seg(&sregs.es, &env->segs[R_ES]);
+set_v8086_seg(&sregs.fs, &env->segs[R_FS]);
+set_v8086_seg(&sregs.gs, &env->segs[R_GS]);
+set_v8086_seg(&sregs.ss, &env->segs[R_SS]);
+} else {
+set_seg(&sregs.cs, &env->segs[R_CS]);
+set_seg(&sregs.ds, &env->segs[R_DS]);
+set_seg(&sregs.es, &env->segs[R_ES]);
+set_seg(&sregs.fs, &env->segs[R_FS]);
+set_seg(&sregs.gs, &env->segs[R_GS]);
+set_seg(&sregs.ss, &env->segs[R_SS]);
+}
+
+set_seg(&sregs.tr, &env->tr);
+set_seg(&sregs.ldt, &env->ldt);
+
+sregs.idt.limit = env->idt.limit;
+sregs.idt.base = env->idt.base;
+memset(sregs.idt.padding, 0, sizeof sregs.idt.padding);
+sregs.gdt.limit = env->gdt.limit;
+sregs.gdt.base = env->gdt.base;
+memset(sregs.gdt.padding, 0, sizeof sregs.gdt.padding);
+
+sregs.cr0 = env->cr[0];
+sregs.cr2 = env->cr[2];
+sregs.cr3 = env->cr[3];
+sregs.cr4 = env->cr[4];
+
+sregs.cr8 = cpu_get_apic_tpr(cpu->apic_state);
+sregs.apic_base = cpu_get_apic_base(cpu->apic_state);
+
+sregs.efer = env->efer;
+
+for (i = 0; i < 4; i++) {
+sregs.pdptrs[i] = env->pdptrs[i];
+}
+
+sregs.flags = 0;
+sregs.padding = 0;
+
+return kvm_vcpu_ioctl(CPU(cpu), KVM_SET_SREGS2, &sregs);
+}
+
+
 static void kvm_msr_buf_reset(X86CPU *cpu)
 {
 memset(cpu->kvm_msr_buf, 0, MSR_BUF_SIZE);
@@ -3175,6 +3228,49 @@ static int kvm_get_sregs(X86CPU *cpu)
 return 0;
 }
 
+static int kvm_get_sregs2(X86CPU *cpu)
+{
+CPUX86State *env = &cpu->env;
+struct kvm_sregs2 sregs;
+int i, ret;
+
+ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS2, &sregs);
+if (ret < 0) {
+return ret;
+}
+
+get_seg(&env->segs[R_CS], &sregs.cs);
+get_seg(&env->segs[R_DS], &sregs.ds);
+get_seg(&env->segs[R_ES], &sregs.es);
+get_seg(&env->segs[R_FS], &sregs.fs);
+get_seg(&env->segs[R_GS], &sregs.gs);
+get_seg(&env->segs[R_SS], &sregs.ss);
+
+get_seg(&env->tr, &sregs.tr);
+get_seg(&env->ldt, &sregs.ldt

[PATCH 1/2] kvm: update kernel headers for KVM_{GET|SET}_SREGS2

2021-04-01 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 linux-headers/asm-x86/kvm.h | 13 +
 linux-headers/linux/kvm.h   |  5 +
 2 files changed, 18 insertions(+)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 8e76d3701d..8c604e6bb1 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -158,6 +158,19 @@ struct kvm_sregs {
__u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
 };
 
+struct kvm_sregs2 {
+   /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+   struct kvm_segment cs, ds, es, fs, gs, ss;
+   struct kvm_segment tr, ldt;
+   struct kvm_dtable gdt, idt;
+   __u64 cr0, cr2, cr3, cr4, cr8;
+   __u64 efer;
+   __u64 apic_base;
+   __u64 flags; /* must be zero*/
+   __u64 pdptrs[4];
+   __u64 padding;
+};
+
 /* for KVM_GET_FPU and KVM_SET_FPU */
 struct kvm_fpu {
__u8  fpr[8][16];
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 020b62a619..a97f0f2d03 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
 #define KVM_CAP_SYS_HYPERV_CPUID 191
 #define KVM_CAP_DIRTY_LOG_RING 192
+#define KVM_CAP_SREGS2 196
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1563,6 +1564,10 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_DIRTY_LOG_RING */
 #define KVM_RESET_DIRTY_RINGS  _IO(KVMIO, 0xc7)
 
+
+#define KVM_GET_SREGS2 _IOR(KVMIO,  0xca, struct kvm_sregs2)
+#define KVM_SET_SREGS2 _IOW(KVMIO,  0xcb, struct kvm_sregs2)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
/* Guest initialization commands */
-- 
2.26.2

[PATCH 0/2] gdbstub: implement support for blocking interrupts on single stepping

2021-04-01 Thread Maxim Levitsky

clone of "starship_unstable"

Maxim Levitsky (2):
  kvm: update kernel headers for KVM_GUESTDBG_BLOCKEVENTS
  gdbstub: implement NOIRQ support for single step on KVM, when kvm's
KVM_GUESTDBG_BLOCKIRQ debug flag is supported.

 accel/kvm/kvm-all.c | 25 
 gdbstub.c   | 59 ++---
 include/sysemu/kvm.h| 13 
 linux-headers/asm-x86/kvm.h |  2 ++
 linux-headers/linux/kvm.h   |  1 +
 5 files changed, 90 insertions(+), 10 deletions(-)

-- 
2.26.2

[PATCH 1/2] kvm: update kernel headers for KVM_GUESTDBG_BLOCKEVENTS

2021-04-01 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 linux-headers/asm-x86/kvm.h | 2 ++
 linux-headers/linux/kvm.h   | 1 +
 2 files changed, 3 insertions(+)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 8e76d3701d..33878cdc34 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -281,6 +281,8 @@ struct kvm_debug_exit_arch {
 #define KVM_GUESTDBG_USE_HW_BP 0x0002
 #define KVM_GUESTDBG_INJECT_DB 0x0004
 #define KVM_GUESTDBG_INJECT_BP 0x0008
+#define KVM_GUESTDBG_BLOCKIRQ  0x0010
+
 
 /* for KVM_SET_GUEST_DEBUG */
 struct kvm_guest_debug_arch {
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 020b62a619..2ded7a0630 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
 #define KVM_CAP_SYS_HYPERV_CPUID 191
 #define KVM_CAP_DIRTY_LOG_RING 192
+#define KVM_CAP_SET_GUEST_DEBUG2 195
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.26.2

[PATCH 2/2] gdbstub: implement NOIRQ support for single step on KVM, when kvm's KVM_GUESTDBG_BLOCKIRQ debug flag is supported.

2021-04-01 Thread Maxim Levitsky

Signed-off-by: Maxim Levitsky 
---
 accel/kvm/kvm-all.c  | 25 +++
 gdbstub.c| 59 
 include/sysemu/kvm.h | 13 ++
 3 files changed, 87 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index b6d9f92f15..bc7955fb19 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -147,6 +147,8 @@ bool kvm_vm_attributes_allowed;
 bool kvm_direct_msi_allowed;
 bool kvm_ioeventfd_any_length_allowed;
 bool kvm_msi_use_devid;
+bool kvm_has_guest_debug;
+int kvm_sstep_flags;
 static bool kvm_immediate_exit;
 static hwaddr kvm_max_slot_size = ~0;
 
@@ -2186,6 +2188,25 @@ static int kvm_init(MachineState *ms)
 kvm_ioeventfd_any_length_allowed =
 (kvm_check_extension(s, KVM_CAP_IOEVENTFD_ANY_LENGTH) > 0);
 
+kvm_has_guest_debug =
+(kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG) > 0);
+
+kvm_sstep_flags = 0;
+
+if (kvm_has_guest_debug) {
+/* Assume that single stepping is supported */
+kvm_sstep_flags = SSTEP_ENABLE;
+
+int guest_debug_flags =
+kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG2);
+
+if (guest_debug_flags > 0) {
+if (guest_debug_flags & KVM_GUESTDBG_BLOCKIRQ) {
+kvm_sstep_flags |= SSTEP_NOIRQ;
+}
+}
+}
+
 kvm_state = s;
 
 ret = kvm_arch_init(ms, s);
@@ -2796,6 +2817,10 @@ int kvm_update_guest_debug(CPUState *cpu, unsigned long 
reinject_trap)
 
 if (cpu->singlestep_enabled) {
 data.dbg.control |= KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
+
+if (cpu->singlestep_enabled & SSTEP_NOIRQ) {
+data.dbg.control |= KVM_GUESTDBG_BLOCKIRQ;
+}
 }
 kvm_arch_update_guest_debug(cpu, &data.dbg);
 
diff --git a/gdbstub.c b/gdbstub.c
index 054665e93e..f789ded99d 100644
--- a/gdbstub.c
+++ b/gdbstub.c
@@ -369,12 +369,11 @@ typedef struct GDBState {
 gdb_syscall_complete_cb current_syscall_cb;
 GString *str_buf;
 GByteArray *mem_buf;
+int sstep_flags;
+int supported_sstep_flags;
 } GDBState;
 
-/* By default use no IRQs and no timers while single stepping so as to
- * make single stepping like an ICE HW step.
- */
-static int sstep_flags = SSTEP_ENABLE|SSTEP_NOIRQ|SSTEP_NOTIMER;
+static GDBState gdbserver_state;
 
 /* Retrieves flags for single step mode. */
 static int get_sstep_flags(void)
@@ -386,11 +385,10 @@ static int get_sstep_flags(void)
 if (replay_mode != REPLAY_MODE_NONE) {
 return SSTEP_ENABLE;
 } else {
-return sstep_flags;
+return gdbserver_state.sstep_flags;
 }
 }
 
-static GDBState gdbserver_state;
 
 static void init_gdbserver_state(void)
 {
@@ -400,6 +398,23 @@ static void init_gdbserver_state(void)
 gdbserver_state.str_buf = g_string_new(NULL);
 gdbserver_state.mem_buf = g_byte_array_sized_new(MAX_PACKET_LENGTH);
 gdbserver_state.last_packet = g_byte_array_sized_new(MAX_PACKET_LENGTH + 
4);
+
+
+if (kvm_enabled()) {
+gdbserver_state.supported_sstep_flags = 
kvm_get_supported_sstep_flags();
+} else {
+gdbserver_state.supported_sstep_flags =
+SSTEP_ENABLE | SSTEP_NOIRQ | SSTEP_NOTIMER;
+}
+
+/*
+ * By default use no IRQs and no timers while single stepping so as to
+ * make single stepping like an ICE HW step.
+ */
+
+gdbserver_state.sstep_flags = SSTEP_ENABLE | SSTEP_NOIRQ | SSTEP_NOTIMER;
+gdbserver_state.sstep_flags &= gdbserver_state.supported_sstep_flags;
+
 }
 
 #ifndef CONFIG_USER_ONLY
@@ -2023,24 +2038,43 @@ static void handle_v_commands(GdbCmdContext *gdb_ctx, 
void *user_ctx)
 
 static void handle_query_qemu_sstepbits(GdbCmdContext *gdb_ctx, void *user_ctx)
 {
-g_string_printf(gdbserver_state.str_buf, "ENABLE=%x,NOIRQ=%x,NOTIMER=%x",
-SSTEP_ENABLE, SSTEP_NOIRQ, SSTEP_NOTIMER);
+g_string_printf(gdbserver_state.str_buf, "ENABLE=%x", SSTEP_ENABLE);
+
+if (gdbserver_state.supported_sstep_flags & SSTEP_NOIRQ) {
+g_string_append_printf(gdbserver_state.str_buf, ",NOIRQ=%x",
+   SSTEP_NOIRQ);
+}
+
+if (gdbserver_state.supported_sstep_flags & SSTEP_NOTIMER) {
+g_string_append_printf(gdbserver_state.str_buf, ",NOTIMER=%x",
+   SSTEP_NOTIMER);
+}
+
 put_strbuf();
 }
 
 static void handle_set_qemu_sstep(GdbCmdContext *gdb_ctx, void *user_ctx)
 {
+int new_sstep_flags;
 if (!gdb_ctx->num_params) {
 return;
 }
 
-sstep_flags = gdb_ctx->params[0].val_ul;
+new_sstep_flags = gdb_ctx->params[0].val_ul;
+
+if (new_sstep_flags  & ~gdbserver_state.supported_sstep_flags) {
+put_packet("E22");
+return;
+}
+
+gdbserver_state.sstep_flags = new_sstep_flags;
 put_packet("OK");
 }
 
 static void handle

Re: [PATCH 2/2] qemu-img: align next status sector on destination alignment.

2021-03-18 Thread Maxim Levitsky

On Thu, 2020-11-12 at 17:04 +0200, Maxim Levitsky wrote:
> On Thu, 2020-11-12 at 07:45 -0600, Eric Blake wrote:
> > On 11/12/20 6:40 AM, Peter Lieven wrote:
> > 
> > > >  /*
> > > > - * Avoid that s->sector_next_status becomes unaligned to the 
> > > > source
> > > > - * request alignment and/or cluster size to avoid unnecessary 
> > > > read
> > > > - * cycles.
> > > > + * Avoid that s->sector_next_status becomes unaligned to the
> > > > + * source/destination request alignment and/or cluster size to 
> > > > avoid
> > > > + * unnecessary read/write cycles.
> > > >   */
> > > > -tail = (sector_num - src_cur_offset + n) % 
> > > > s->src_alignment[src_cur];
> > > > +alignment = MAX(s->src_alignment[src_cur], s->alignment);
> > > > +assert(is_power_of_2(alignment));
> > > > +
> > > > +tail = (sector_num - src_cur_offset + n) % alignment;
> > > >  if (n > tail) {
> > > >  n -= tail;
> > > >  }
> > > 
> > > I was also considering including the s->alignment when adding this 
> > > chance. However, you need the least common multiple of both alignments, 
> > > not the maximum, otherwise
> > > 
> > > you might get misaligned to either source or destination.
> > > 
> > > 
> > > Why exactly do you need the power of two requirement?
> > 
> > The power of two requirement ensures that you h ave the least common
> > multiple of both alignments ;)
> -
> Yes, and in addition to that both alignments are already power of two because 
> we align them
> to it. The assert I added is just in case.
> 
> This is how we calculate the destination alignment:
>  
> s.alignment = MAX(pow2floor(s.min_sparse),
>   DIV_ROUND_UP(out_bs->bl.request_alignment,
>BDRV_SECTOR_SIZE));
>  
>  
> This is how we calculate the source alignments (it is possible to have 
> several source files)
>  
> s.src_alignment[bs_i] = DIV_ROUND_UP(src_bs->bl.request_alignment,
>  BDRV_SECTOR_SIZE);
> if (!bdrv_get_info(src_bs, &bdi)) {
> s.src_alignment[bs_i] = MAX(s.src_alignment[bs_i], bdi.cluster_size / 
> BDRV_SECTOR_SIZE);
> }
> 
> 
> The bl.request_alignment comment mentions that it must be power of 2,
> and cluster sizes are also very likely to be power of 2 in all drivers
> we support. An assert for s.src_alignment won't hurt though.
> 
>  
> Note though that we don't really read the discard alignment.
> We have max_pdiscard, and pdiscard_alignment, but max_pdiscard
> is only used to split 'too large' discard requests, and pdiscard_alignment
> is advisory and used only in a couple of places.
> Neither are set by file-posix.
>  
> We do have request_alignment, which file-posix tries to align to the logical 
> block
> size which is still often 512 for backward compatibility on many devices 
> (even nvme)
>  
> Now both source and destination alignments in qemu-img convert are based on 
> request_aligment
> and on min_sparse (which is by default 4K, controlled by qemu-img -S parameter
> (which otherwise controls the minimum number of blocks to be zero, to consider
> discarding them in convert)
>  
> 
> This means that there is no guarantee to have 4K alignment, and besides,
> with sufficiently fragmented source file, the bdrv_block_status_above
> can return arbitrary short and unaligned extents which can't be aligned, 
> thus as I said this patch alone can't guarantee that we won't have 
> write and discard sharing the same page.
>  
> Another thing that can be discussed is why is_allocated_sectors was patched
> to convert short discards to writes. Probably because underlying hardware
> ignores them or something? In theory we should detect this and fail
> back to regular zeroing in this case.
> Again though, while in case of converting an empty file, this is the only
> source of writes, and eliminating it, also 'fixes' this case, with 
> sufficiently
> fragmented source file we can even without this code get a write and discard
> sharing a page.
> 
> 
> Best regards,
>   Maxim Levitsky
> 
> > However, there ARE devices in practice that have advertised a
> > non-power-of-2 discard granularity, such as 15MiB (see commits 430b26a8,
> > 63188c24).  Which means you probably don't want to assert power-of-2,
> > and in turn need to worry about least common multiple.
> > 

Any update on this patch?

Best regards,
Maxim Levitsky

Re: [PATCH] KVM: x86: do not fail if software breakpoint has already been removed

2021-03-12 Thread Maxim Levitsky

On Fri, 2021-03-12 at 12:59 +0100, Stefano Garzarella wrote:
> On Thu, Mar 11, 2021 at 02:53:15PM +0200, Maxim Levitsky wrote:
> > On Thu, 2021-03-04 at 11:15 +0100, Stefano Garzarella wrote:
> > > On Wed, Mar 03, 2021 at 02:07:24PM +0200, Maxim Levitsky wrote:
> > > > On Tue, 2021-03-02 at 15:52 +0100, Stefano Garzarella wrote:
> > > > > On Mon, Mar 01, 2021 at 02:56:40PM +0200, Maxim Levitsky wrote:
> > > > > > On Mon, 2021-03-01 at 12:17 +0100, Paolo Bonzini wrote:
> > > > > > > If kvm_arch_remove_sw_breakpoint finds that a software breakpoint 
> > > > > > > does not
> > > > > > > have an INT3 instruction, it fails.  This can happen if one sets a
> > > > > > > software breakpoint in a kernel module and then reloads it.  gdb 
> > > > > > > then
> > > > > > > thinks the breakpoint cannot be deleted and there is no way to 
> > > > > > > add it
> > > > > > > back.
> > > > > > > 
> > > > > > > Suggested-by: Maxim Levitsky 
> > > > > > > Signed-off-by: Paolo Bonzini 
> > > > > > > ---
> > > > > > >  target/i386/kvm/kvm.c | 9 +++--
> > > > > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > > > > > > index 0b5755e42b..c8d61daf68 100644
> > > > > > > --- a/target/i386/kvm/kvm.c
> > > > > > > +++ b/target/i386/kvm/kvm.c
> > > > > > > @@ -4352,8 +4352,13 @@ int kvm_arch_remove_sw_breakpoint(CPUState 
> > > > > > > *cs, struct kvm_sw_breakpoint *bp)
> > > > > > >  {
> > > > > > >  uint8_t int3;
> > > > > > > 
> > > > > > > -if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0) || int3 != 
> > > > > > > 0xcc ||
> > > > > > > -cpu_memory_rw_debug(cs, bp->pc, (uint8_t 
> > > > > > > *)&bp->saved_insn, 1, 1)) {
> > > > > > > +if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0)) {
> > > > > > > +return -EINVAL;
> > > > > > > +}
> > > > > > > +if (int3 != 0xcc) {
> > > > > > > +return 0;
> > > > > > > +}
> > > > > > > +if (cpu_memory_rw_debug(cs, bp->pc, (uint8_t 
> > > > > > > *)&bp->saved_insn, 1, 1)) {
> > > > > > >  return -EINVAL;
> > > > > > >  }
> > > > > > >  return 0;
> > > > > > 
> > > > > > There still remains a philosopical question if 
> > > > > > kvm_arch_remove_sw_breakpoint
> > > > > > should always return 0, since for the usual case of kernel 
> > > > > > debugging where
> > > > > > a breakpoint is in unloaded module, the above will probably still 
> > > > > > fail
> > > > > > as paging for this module is removed as well by the kernel.
> > > > > > It is still better though so:
> > > > > > 
> > > > > > Reviewed-by: Maxim Levitsky 
> > > > > > 
> > > > > > Note that I managed to make lx-symbols to work in a very stable way
> > > > > > with attached WIP patch I hacked on this Sunday.
> > > > > > I will send a cleaned up version of it to upstream when I have time.
> > > > > > 
> > > > > > Since I make gdb unload the symbols, it works even without this 
> > > > > > patch.
> > > > > > 
> > > > > > Added Stefano Garzarella to CC as I know that he tried to make this 
> > > > > > work as well.
> > > > > > https://lkml.org/lkml/2020/10/5/514
> > > > > 
> > > > > Thanks Maxim for CCing me!
> > > > > 
> > > > > Just a report when I tried these patches, but I'm not sure they are
> > > > > related.
> > > > > 
> > > > > I found that gdb 10 has some problem with QEMU:
> > > > > 
> > > > >  $ gdb --version
> > > > >  GNU gdb (GDB) Fedora 10.1-2.fc33
> > > > > 
> > > > >  (gdb) lx-s

Re: [PATCH] KVM: x86: do not fail if software breakpoint has already been removed

2021-03-11 Thread Maxim Levitsky

On Thu, 2021-03-04 at 11:15 +0100, Stefano Garzarella wrote:
> On Wed, Mar 03, 2021 at 02:07:24PM +0200, Maxim Levitsky wrote:
> > On Tue, 2021-03-02 at 15:52 +0100, Stefano Garzarella wrote:
> > > On Mon, Mar 01, 2021 at 02:56:40PM +0200, Maxim Levitsky wrote:
> > > > On Mon, 2021-03-01 at 12:17 +0100, Paolo Bonzini wrote:
> > > > > If kvm_arch_remove_sw_breakpoint finds that a software breakpoint 
> > > > > does not
> > > > > have an INT3 instruction, it fails.  This can happen if one sets a
> > > > > software breakpoint in a kernel module and then reloads it.  gdb then
> > > > > thinks the breakpoint cannot be deleted and there is no way to add it
> > > > > back.
> > > > > 
> > > > > Suggested-by: Maxim Levitsky 
> > > > > Signed-off-by: Paolo Bonzini 
> > > > > ---
> > > > >  target/i386/kvm/kvm.c | 9 +++--
> > > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > > > > index 0b5755e42b..c8d61daf68 100644
> > > > > --- a/target/i386/kvm/kvm.c
> > > > > +++ b/target/i386/kvm/kvm.c
> > > > > @@ -4352,8 +4352,13 @@ int kvm_arch_remove_sw_breakpoint(CPUState 
> > > > > *cs, struct kvm_sw_breakpoint *bp)
> > > > >  {
> > > > >  uint8_t int3;
> > > > > 
> > > > > -if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0) || int3 != 0xcc 
> > > > > ||
> > > > > -cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, 
> > > > > 1, 1)) {
> > > > > +if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0)) {
> > > > > +return -EINVAL;
> > > > > +}
> > > > > +if (int3 != 0xcc) {
> > > > > +return 0;
> > > > > +}
> > > > > +if (cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, 
> > > > > 1, 1)) {
> > > > >  return -EINVAL;
> > > > >  }
> > > > >  return 0;
> > > > 
> > > > There still remains a philosopical question if 
> > > > kvm_arch_remove_sw_breakpoint
> > > > should always return 0, since for the usual case of kernel debugging 
> > > > where
> > > > a breakpoint is in unloaded module, the above will probably still fail
> > > > as paging for this module is removed as well by the kernel.
> > > > It is still better though so:
> > > > 
> > > > Reviewed-by: Maxim Levitsky 
> > > > 
> > > > Note that I managed to make lx-symbols to work in a very stable way
> > > > with attached WIP patch I hacked on this Sunday.
> > > > I will send a cleaned up version of it to upstream when I have time.
> > > > 
> > > > Since I make gdb unload the symbols, it works even without this patch.
> > > > 
> > > > Added Stefano Garzarella to CC as I know that he tried to make this 
> > > > work as well.
> > > > https://lkml.org/lkml/2020/10/5/514
> > > 
> > > Thanks Maxim for CCing me!
> > > 
> > > Just a report when I tried these patches, but I'm not sure they are
> > > related.
> > > 
> > > I found that gdb 10 has some problem with QEMU:
> > > 
> > >  $ gdb --version
> > >  GNU gdb (GDB) Fedora 10.1-2.fc33
> > > 
> > >  (gdb) lx-symbols
> > >  loading vmlinux
> > >  scanning for modules in linux/build
> > >  ../../gdb/dwarf2/frame.c:1085: internal-error: Unknown CFA rule.
> > > 
> > > With gdb 9 'lx-symbols' works well, but I still have some issue when I
> > > put a breakpoint to a symbol not yet loaded (vsock_core_register in this
> > > example), then I load the module (vsock_loopback in this example) in the
> > > guest.
> > > 
> > > Whit your patch gdb stuck after loading the symbols of the first new
> > > module:
> > >  (gdb) b vsock_core_register
> > >  Function "vsock_core_register" not defined.
> > >  Make breakpoint pending on future shared library load? (y or [n]) y
> > >  Breakpoint 1 (vsock_core_register) pending.
> > >  (gdb) c
> > >  Continuing.
> > >  loading @0xc00

Re: [PATCH] KVM: x86: do not fail if software breakpoint has already been removed

2021-03-03 Thread Maxim Levitsky

On Tue, 2021-03-02 at 15:52 +0100, Stefano Garzarella wrote:
> On Mon, Mar 01, 2021 at 02:56:40PM +0200, Maxim Levitsky wrote:
> > On Mon, 2021-03-01 at 12:17 +0100, Paolo Bonzini wrote:
> > > If kvm_arch_remove_sw_breakpoint finds that a software breakpoint does not
> > > have an INT3 instruction, it fails.  This can happen if one sets a
> > > software breakpoint in a kernel module and then reloads it.  gdb then
> > > thinks the breakpoint cannot be deleted and there is no way to add it
> > > back.
> > > 
> > > Suggested-by: Maxim Levitsky 
> > > Signed-off-by: Paolo Bonzini 
> > > ---
> > >  target/i386/kvm/kvm.c | 9 +++--
> > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > > index 0b5755e42b..c8d61daf68 100644
> > > --- a/target/i386/kvm/kvm.c
> > > +++ b/target/i386/kvm/kvm.c
> > > @@ -4352,8 +4352,13 @@ int kvm_arch_remove_sw_breakpoint(CPUState *cs, 
> > > struct kvm_sw_breakpoint *bp)
> > >  {
> > >  uint8_t int3;
> > > 
> > > -if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0) || int3 != 0xcc ||
> > > -cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, 1, 
> > > 1)) {
> > > +if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0)) {
> > > +return -EINVAL;
> > > +}
> > > +if (int3 != 0xcc) {
> > > +return 0;
> > > +}
> > > +if (cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, 1, 
> > > 1)) {
> > >  return -EINVAL;
> > >  }
> > >  return 0;
> > 
> > There still remains a philosopical question if kvm_arch_remove_sw_breakpoint
> > should always return 0, since for the usual case of kernel debugging where
> > a breakpoint is in unloaded module, the above will probably still fail
> > as paging for this module is removed as well by the kernel.
> > It is still better though so:
> > 
> > Reviewed-by: Maxim Levitsky 
> > 
> > Note that I managed to make lx-symbols to work in a very stable way
> > with attached WIP patch I hacked on this Sunday.
> > I will send a cleaned up version of it to upstream when I have time.
> > 
> > Since I make gdb unload the symbols, it works even without this patch.
> > 
> > Added Stefano Garzarella to CC as I know that he tried to make this work as 
> > well.
> > https://lkml.org/lkml/2020/10/5/514
> 
> Thanks Maxim for CCing me!
> 
> Just a report when I tried these patches, but I'm not sure they are 
> related.
> 
> I found that gdb 10 has some problem with QEMU:
> 
>  $ gdb --version
>  GNU gdb (GDB) Fedora 10.1-2.fc33
> 
>  (gdb) lx-symbols
>  loading vmlinux
>  scanning for modules in linux/build
>  ../../gdb/dwarf2/frame.c:1085: internal-error: Unknown CFA rule.
> 
> With gdb 9 'lx-symbols' works well, but I still have some issue when I 
> put a breakpoint to a symbol not yet loaded (vsock_core_register in this 
> example), then I load the module (vsock_loopback in this example) in the 
> guest.
> 
> Whit your patch gdb stuck after loading the symbols of the first new 
> module:
>  (gdb) b vsock_core_register
>  Function "vsock_core_register" not defined.
>  Make breakpoint pending on future shared library load? (y or [n]) y
>  Breakpoint 1 (vsock_core_register) pending.
>  (gdb) c
>  Continuing.
>  loading @0xc00a1000: linux/build/net/vmw_vsock/vsock.ko
> 
> Without your patch, gdb loops infinitely reloading the new module:
>  refreshing all symbols to reload module 'vsock'
>  loading vmlinux
>  loading @0xc00a1000: linux/build/net/vmw_vsock/vsock.ko
>  loading @0xc00ad000: linux/build/drivers/net/tun.ko
>  loading @0xc007e000: linux/build/net/bridge/bridge.ko
>  loading @0xc0077000: linux/build/net/802/stp.ko
>  loading @0xc0007000: linux/build/net/llc/llc.ko
>  loading @0xc0013000: linux/build/net/sunrpc/sunrpc.ko
>  loading @0xc000d000: linux/build/net/ipv4/netfilter/ip_tables.ko
>  loading @0xc000: linux/build/net/netfilter/x_tables.ko
>  refreshing all symbols to reload module 'vsock'
>  loading vmlinux
>  loading @0xc00a1000: linux/build/net/vmw_vsock/vsock.ko
>  loading @0xc00ad000: linux/build/drivers/net/tun.ko
>  ...
> 
> I'll try to get a bette

Re: [PATCH] KVM: x86: deprecate -M kernel-irqchip=off except for -M isapc

2021-03-01 Thread Maxim Levitsky

On Mon, 2021-03-01 at 12:17 +0100, Paolo Bonzini wrote:
> The userspace local APIC is basically untested and does not support many
> features such as TSC deadline timer, x2APIC or PV spinlocks.  On the
> other hand, the PIT and IOAPIC are okay as they are not tied to
> the processor and are tested with -M kernel-irqchip=split.
> 
> Therefore, deprecate the local APIC and, with it, limit
> -M kernel-irqchip=off to the ISA PC machine type, which does not
> have a local APIC at all.
> 
> Signed-off-by: Paolo Bonzini 

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky

> ---
>  docs/system/deprecated.rst | 7 +++
>  hw/intc/apic.c | 5 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/docs/system/deprecated.rst b/docs/system/deprecated.rst
> index 561c916da2..ae180dc887 100644
> --- a/docs/system/deprecated.rst
> +++ b/docs/system/deprecated.rst
> @@ -153,6 +153,13 @@ The ``-writeconfig`` option is not able to serialize the 
> entire contents
>  of the QEMU command line.  It is thus considered a failed experiment
>  and deprecated, with no current replacement.
>  
> +Userspace local APIC with KVM (x86, since 6.0)
> +'''''''''''''''''''''''''''''''''''''''''
> +
> +Using ``-M kernel-irqchip=off`` with x86 machine types that include a local
> +APIC is deprecated.  The ``split`` setting is supported, as is using
> +``-M kernel-irqchip=off`` with the ISA PC machine type.
> +
>  QEMU Machine Protocol (QMP) commands
>  
>  
> diff --git a/hw/intc/apic.c b/hw/intc/apic.c
> index 3ada22f427..7e9601b89d 100644
> --- a/hw/intc/apic.c
> +++ b/hw/intc/apic.c
> @@ -875,6 +875,11 @@ static void apic_realize(DeviceState *dev, Error **errp)
>  return;
>  }
>  
> +if (kvm_enabled()) {
> +warn_report("Userspace local APIC is deprecated for KVM.");
> +warn_report("Do not use kernel-irqchip except for the -M isapc 
> machine type.");
> +}
> +
>  memory_region_init_io(&s->io_memory, OBJECT(s), &apic_io_ops, s, 
> "apic-msi",
>APIC_SPACE_SIZE);
>

Re: [PATCH] KVM: x86: do not fail if software breakpoint has already been removed

2021-03-01 Thread Maxim Levitsky

On Mon, 2021-03-01 at 12:17 +0100, Paolo Bonzini wrote:
> If kvm_arch_remove_sw_breakpoint finds that a software breakpoint does not
> have an INT3 instruction, it fails.  This can happen if one sets a
> software breakpoint in a kernel module and then reloads it.  gdb then
> thinks the breakpoint cannot be deleted and there is no way to add it
> back.
> 
> Suggested-by: Maxim Levitsky 
> Signed-off-by: Paolo Bonzini 
> ---
>  target/i386/kvm/kvm.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 0b5755e42b..c8d61daf68 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -4352,8 +4352,13 @@ int kvm_arch_remove_sw_breakpoint(CPUState *cs, struct 
> kvm_sw_breakpoint *bp)
>  {
>  uint8_t int3;
>  
> -if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0) || int3 != 0xcc ||
> -cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, 1, 1)) {
> +if (cpu_memory_rw_debug(cs, bp->pc, &int3, 1, 0)) {
> +return -EINVAL;
> +}
> +if (int3 != 0xcc) {
> +return 0;
> +}
> +if (cpu_memory_rw_debug(cs, bp->pc, (uint8_t *)&bp->saved_insn, 1, 1)) {
>  return -EINVAL;
>  }
>  return 0;

There still remains a philosopical question if kvm_arch_remove_sw_breakpoint
should always return 0, since for the usual case of kernel debugging where
a breakpoint is in unloaded module, the above will probably still fail
as paging for this module is removed as well by the kernel.
It is still better though so:

Reviewed-by: Maxim Levitsky 

Note that I managed to make lx-symbols to work in a very stable way
with attached WIP patch I hacked on this Sunday.
I will send a cleaned up version of it to upstream when I have time.

Since I make gdb unload the symbols, it works even without this patch. 

Added Stefano Garzarella to CC as I know that he tried to make this work as 
well.
https://lkml.org/lkml/2020/10/5/514


Best regards,
Maxim Levitsky

commit 86f0d5a08a40528350589ed54126f06d4ac293a8
Author: Maxim Levitsky 
Date:   Sun Feb 28 23:52:08 2021 +0200

gdb: rework gdb debug script

Signed-off-by: Maxim Levitsky 

diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py
index 1be9763cf8bb..13f21afc2059 100644
--- a/scripts/gdb/linux/symbols.py
+++ b/scripts/gdb/linux/symbols.py
@@ -17,6 +17,24 @@ import re
 
 from linux import modules, utils
 
+def save_state():
+breakpoints = []
+if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None:
+for bp in gdb.breakpoints():
+breakpoints.append({'breakpoint': bp, 'enabled': bp.enabled})
+bp.enabled = False
+
+show_pagination = gdb.execute("show pagination", to_string=True)
+pagination = show_pagination.endswith("on.\n")
+gdb.execute("set pagination off")
+
+return {"breakpoints":breakpoints, "show_pagination": show_pagination}
+
+def load_state(state):
+for breakpoint in state["breakpoints"]:
+breakpoint['breakpoint'].enabled = breakpoint['enabled']
+gdb.execute("set pagination %s" % ("on" if state["show_pagination"] else "off"))
+
 
 if hasattr(gdb, 'Breakpoint'):
 class LoadModuleBreakpoint(gdb.Breakpoint):
@@ -30,26 +48,38 @@ if hasattr(gdb, 'Breakpoint'):
 module_name = module['name'].string()
 cmd = self.gdb_command
 
+# module already loaded, false alarm
+if module_name in cmd.loaded_modules:
+return False
+
 # enforce update if object file is not found
 cmd.module_files_updated = False
 
 # Disable pagination while reporting symbol (re-)loading.
 # The console input is blocked in this context so that we would
 # get stuck waiting for the user to acknowledge paged output.
-show_pagination = gdb.execute("show pagination", to_string=True)
-pagination = show_pagination.endswith("on.\n")
-gdb.execute("set pagination off")
+state = save_state()
+cmd.load_module_symbols(module)
+load_state(state)
+return False
 
-if module_name in cmd.loaded_modules:
-gdb.write("refreshing all symbols to reload module "
-  "'{0}'\n".format(module_name))
-cmd.load_all_symbols()
-else:
-cmd.load_module_symbols(module)
+class UnLoadModuleBreakpoint(gdb.Breakpoint):
+def __init__(self, spec, gdb_command):
+sup

Re: [PATCH] event_notifier: Set ->initialized earlier in event_notifier_init()

2021-02-17 Thread Maxim Levitsky

On Tue, 2021-02-16 at 13:02 +0100, Greg Kurz wrote:
> Otherwise the call to event_notifier_set() is a nop, which causes
> the SLOF firmware on POWER to hang when booting from a virtio-scsi
> device:
> 
> virtio_scsi_dataplane_start()
>  virtio_scsi_vring_init()
>   virtio_bus_set_host_notifier() <- assign == true
>event_notifier_init() <- active == 1
> event_notifier_set() <- fails right away if !e->initialized
> 
> Fixes: e34e47eb28c0 ("event_notifier: handle initialization failure better")
> Cc: mlevi...@redhat.com
> Signed-off-by: Greg Kurz 
> ---
>  util/event_notifier-posix.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/util/event_notifier-posix.c b/util/event_notifier-posix.c
> index 5b2110e86198..8307013c5df0 100644
> --- a/util/event_notifier-posix.c
> +++ b/util/event_notifier-posix.c
> @@ -66,10 +66,10 @@ int event_notifier_init(EventNotifier *e, int active)
>  e->rfd = fds[0];
>  e->wfd = fds[1];
>  }
> +e->initialized = true;
>  if (active) {
>  event_notifier_set(e);
>  }
> -e->initialized = true;
>  return 0;
>  
>  fail:
Sorry about that!

Best regards,
Maxim levitsky

Re: [PATCH v2 0/2] Quality of life patches for qemu coroutine debugging

2021-01-07 Thread Maxim Levitsky

On Thu, 2021-01-07 at 19:05 +0100, Paolo Bonzini wrote:
> On 07/01/21 11:24, Maxim Levitsky wrote:
> > On Thu, 2020-12-17 at 17:54 +0200, Maxim Levitsky wrote:
> > > These are two patches that improve a bit the qemu gdb scripts in regard to
> > > coroutines.
> > > 
> > > First patch fixes a bug in 'qemu coroutine' in regard to usage on non 
> > > topmost
> > > stack frame.
> > > 
> > > Second patch implements 'qemu bt' as Stefan suggested.
> > > 
> > > V2, because after hitting 'send' I noticed a few bugs:
> > > 1. Switch to try/catch around bt, so that when regular bt fails, we
> > >print the same message as original bt.
Note that I forgot to 'git add' this bit, but it is only a cosmetic thing,
so I'll guess I'll just leave it like that.

> > > 2. Print the correct co-routine address
> > > 
> > > Best regards,
> > >  Maxim Levitsky
> > > 
> > > Maxim Levitsky (2):
> > >scripts/gdb: fix 'qemu coroutine' when users selects a non topmost
> > >  stack frame
> > >scripts/gdb: implement 'qemu bt'
> > > 
> > >   scripts/qemu-gdb.py  |  1 +
> > >   scripts/qemugdb/coroutine.py | 35 ++++++-
> > >   2 files changed, 35 insertions(+), 1 deletion(-)
> > > 
> > > -- 
> > > 2.26.2
> > > 
> > > 
> > > 
> > Any update on this?
> > Best regards,
> > Maxim Levitsky
> > 
> > 
> 
> I'll queue them myself, thanks.

Thanks!

Best regards,
Maxim Levitsky
> 
> Paolo
>

Re: [PATCH 1/2] file-posix: allow -EBUSY errors during write zeros on raw block devices

2021-01-07 Thread Maxim Levitsky

On Mon, 2020-11-16 at 15:48 +0100, Kevin Wolf wrote:
> Am 11.11.2020 um 16:39 hat Maxim Levitsky geschrieben:
> > On Linux, fallocate(fd, FALLOC_FL_PUNCH_HOLE) when it is used on a block 
> > device,
> > without O_DIRECT can return -EBUSY if it races with another write to the 
> > same page.
> > 
> > Since this is rare and discard is not a critical operation, ignore this 
> > error
> > 
> > Signed-off-by: Maxim Levitsky 
> 
> I'm applying this one for 5.2, it certainly shouldn't hurt and makes
> things work at least, even if possibly not in the optimal way.
> 
> Patch 2 seems to be a bit less obvious and discussion is ongoing, so
> that's probably more 6.0 material.
> 
> Kevin

Any feedback on patch 2?

Best regards,
Maxim Levitsky

Re: [PATCH v6 0/3] qcow2: don't leave partially initialized file on image creation

2021-01-07 Thread Maxim Levitsky

On Thu, 2020-12-17 at 19:09 +0200, Maxim Levitsky wrote:
> Use the bdrv_co_delete_file interface to delete the underlying
> file if qcow2 initialization fails (e.g due to bad encryption secret)
> 
> This makes the qcow2 driver behave the same way as the luks driver behaves.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1845353
> 
> V3: addressed review feedback and reworked commit messages
> 
> V4: got rid of code duplication by adding bdrv_co_delete_file_noerr
> and made the qcow2 driver use this function to delete
> both the main and the data file.
> 
> V5: addresssed review feedback on reworked version.
> 
> V6: addressed most of the review feedback.
> 
> Best regards,
>   Maxim Levitsky
> 
> Maxim Levitsky (3):
>   crypto: luks: Fix tiny memory leak
>   block: add bdrv_co_delete_file_noerr
>   block: qcow2: remove the created file on initialization error
> 
>  block.c   | 22 ++
>  block/crypto.c| 13 ++---
>  block/qcow2.c |  8 +---
>  include/block/block.h |  1 +
>  4 files changed, 30 insertions(+), 14 deletions(-)
> 
> -- 
> 2.26.2
> 
> 
> 
Any update on this?

Best regards,
Maxim Levitsky

Re: [PATCH 0/3] RFC: few random hacks to improve eventfd fallback path

2021-01-07 Thread Maxim Levitsky

On Thu, 2020-12-17 at 17:00 +0200, Maxim Levitsky wrote:
> These few patches are the result of a random hacking I did to make the qemu
> cope with eventfd allocation failure, when using an iothread,
> as it happened in bz #1897550.
> 
> I am not 100% sure which patches in this series are worth to merge, or if
> this can be fixed in a better way.
> 
> After this patch series applied, qemu still hangs while running reproducer for
> this BZ due to ABBA lock inversion which needs some heavy rework to get rid 
> of.
> I explained all the (gory) details in the bugzilla.
> 
> This patch series was (lightly) tested with make check, iotests and with
> the reproducer.
> 
> Best regards,
>   Maxim Levitsky
> 
> Maxim Levitsky (3):
>   scsi: virtio-scsi: don't process IO on fenced dataplane
>   virtio-scsi: don't uninitialize queues that we didn't initialize
>   event_notifier: handle initialization failure better
> 
>  hw/scsi/virtio-scsi-dataplane.c | 26 +++---
>  include/qemu/event_notifier.h   |  1 +
>  util/event_notifier-posix.c | 16 
>  3 files changed, 36 insertions(+), 7 deletions(-)
> 
> -- 
> 2.26.2
> 
> 
Any update on this?
Best regards,
Maxim Levitsky

Re: [PATCH v2 0/2] Quality of life patches for qemu coroutine debugging

2021-01-07 Thread Maxim Levitsky

On Thu, 2020-12-17 at 17:54 +0200, Maxim Levitsky wrote:
> These are two patches that improve a bit the qemu gdb scripts in regard to
> coroutines.
> 
> First patch fixes a bug in 'qemu coroutine' in regard to usage on non topmost
> stack frame.
> 
> Second patch implements 'qemu bt' as Stefan suggested.
> 
> V2, because after hitting 'send' I noticed a few bugs:
>1. Switch to try/catch around bt, so that when regular bt fails, we
>   print the same message as original bt.
>2. Print the correct co-routine address
> 
> Best regards,
> Maxim Levitsky
> 
> Maxim Levitsky (2):
>   scripts/gdb: fix 'qemu coroutine' when users selects a non topmost
> stack frame
>   scripts/gdb: implement 'qemu bt'
> 
>  scripts/qemu-gdb.py  |  1 +
>  scripts/qemugdb/coroutine.py | 35 ++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> -- 
> 2.26.2
> 
> 
> 
Any update on this?
Best regards,
Maxim Levitsky

Re: [PATCH v3 0/5] SCSI: fix transfer limits for SCSI passthrough

2021-01-07 Thread Maxim Levitsky

On Thu, 2020-12-17 at 18:56 +0200, Maxim Levitsky wrote:
> This patch series attempts to provide a solution to the problem of the 
> transfer
> limits of the raw file driver (host_device/file-posix), some of which I
> already tried to fix in the past.
> 
> I included 2 patches from Tom Yan which fix two issues with reading the limits
> correctly from the */dev/sg* character devices in the first place.
> 
> The only change to these patches is that I tweaked a bit the comments in the
> source to better document the /dev/sg quirks.
> 
> The other two patches in this series split the max transfer limits that qemu
> block devices expose in two:
> One limit is for the regular IO, and another is for the SG_IO (aka 
> bdrv_*_ioctl),
> and the two device drivers (scsi-block and scsi-generic) that use the later
> are switched to the new interface.
> 
> This should ensure that the raw driver can still advertise the unlimited
> transfer  length, unless it is used for SG_IO, because that yields the highest
> performance.
> 
> Also I include a somewhat unrelated fix to a bug I found in qemu's
> SCSI passthrough while testing this:
> When qemu emulates the VPD block limit page, for a SCSI device that doesn't
> implement it, it doesn't really advertise the emulated page to the guest.
> 
> I tested this by doing both regular and SG_IO passthrough of my
> USB SD card reader.
> 
> That device turned out to be a perfect device for the task, since it has max
> transfer size of 1024 blocks (512K), and it enforces it.
> 
> Also it didn't implement the VPD block limits page,
> (transfer size limit probably comes from something USB related) which 
> triggered
> the unrelated bug.
> 
> I was able to see IO errors without the patches, and the wrong max transfer
> size in the guest, and with patches both issues were gone.
> 
> I also found an unrelated issue in /dev/sg passthrough in the kernel.
> It turns out that in-kernel driver has a limitation of 16 requests in flight,
> regardless of what underlying device supports.
> 
> With a large multi-threaded fio job  and a debug print in qemu, it is easy to
> see it, although the errors don't do much harm to the guest as it retries the
> IO, and eventually succeed.
> It is an open question if this should be solved.
> 
> V2: fixed an issue in a patch from Tom Yan (thanks), and removed
> refactoring from last patch according to Paulo's request.
> 
> V3: few cosmitic changes due to the review feedback.
> 
> Maxim Levitsky (3):
>   block: add max_ioctl_transfer to BlockLimits
>   block: use blk_get_max_ioctl_transfer for SCSI passthrough
>   block/scsi: correctly emulate the VPD block limits page
> 
> Tom Yan (2):
>   file-posix: split hdev_refresh_limits from raw_refresh_limits
>   file-posix: add sg_get_max_segments that actually works with sg
> 
>  block/block-backend.c  | 12 ++
>  block/file-posix.c | 76 +++---
>  block/io.c |  2 +
>  block/iscsi.c  |  1 +
>  hw/scsi/scsi-generic.c | 13 --
>  include/block/block_int.h  |  4 ++
>  include/sysemu/block-backend.h |  1 +
>  7 files changed, 90 insertions(+), 19 deletions(-)
> 
> -- 
> 2.26.2
> 
> 
> 
Any update on this?
Best regards,
Maxim Levitsky

Re: [Bug 1873769] Re: SB16 audio playback freezes emulation in Windows 95 guest

2021-01-07 Thread Maxim Levitsky

On Wed, 2021-01-06 at 15:50 +, Vladislav K. Valtchev wrote:
> P.S.: sorry for the terribly broken lines. I didn't expect launchpad to
> add additional line breaks that way :-(
> 
FYI, since I am into retro gaming with kvm on win98, I ended up using
-device AC97 with old realtek_3_41 win95 (I think) drivers. I didn't play
with win95 though.

I do notice the exact same issue with SB16.

Best regards,
    Maxim Levitsky

[PATCH v6 0/3] qcow2: don't leave partially initialized file on image creation

2020-12-17 Thread Maxim Levitsky

Use the bdrv_co_delete_file interface to delete the underlying
file if qcow2 initialization fails (e.g due to bad encryption secret)

This makes the qcow2 driver behave the same way as the luks driver behaves.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1845353

V3: addressed review feedback and reworked commit messages

V4: got rid of code duplication by adding bdrv_co_delete_file_noerr
and made the qcow2 driver use this function to delete
both the main and the data file.

V5: addresssed review feedback on reworked version.

V6: addressed most of the review feedback.

Best regards,
Maxim Levitsky

Maxim Levitsky (3):
  crypto: luks: Fix tiny memory leak
  block: add bdrv_co_delete_file_noerr
  block: qcow2: remove the created file on initialization error

 block.c   | 22 ++
 block/crypto.c| 13 ++---
 block/qcow2.c |  8 +---
 include/block/block.h |  1 +
 4 files changed, 30 insertions(+), 14 deletions(-)

-- 
2.26.2

[PATCH v6 1/3] crypto: luks: Fix tiny memory leak

2020-12-17 Thread Maxim Levitsky

When the underlying block device doesn't support the
bdrv_co_delete_file interface, an 'Error' object was leaked.

Signed-off-by: Maxim Levitsky 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/crypto.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/crypto.c b/block/crypto.c
index aef5a5721a..b3a5275132 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -735,6 +735,8 @@ fail:
  */
 if ((r_del < 0) && (r_del != -ENOTSUP)) {
 error_report_err(local_delete_err);
+} else {
+error_free(local_delete_err);
 }
 }
 
-- 
2.26.2

[PATCH v6 3/3] block: qcow2: remove the created file on initialization error

2020-12-17 Thread Maxim Levitsky

If the qcow initialization fails, we should remove the file if it was
already created, to avoid leaving stale files around.

We already do this for luks raw images.

Signed-off-by: Maxim Levitsky 
Reviewed-by: Alberto Garcia 
---
 block/qcow2.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 3a90ef2786..a8638d250a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3847,12 +3847,14 @@ static int coroutine_fn 
qcow2_co_create_opts(BlockDriver *drv,
 
 /* Create the qcow2 image (format layer) */
 ret = qcow2_co_create(create_options, errp);
+finish:
 if (ret < 0) {
-goto finish;
+bdrv_co_delete_file_noerr(bs);
+bdrv_co_delete_file_noerr(data_bs);
+} else {
+ret = 0;
 }
 
-ret = 0;
-finish:
 qobject_unref(qdict);
 bdrv_unref(bs);
 bdrv_unref(data_bs);
-- 
2.26.2

[PATCH v6 2/3] block: add bdrv_co_delete_file_noerr

2020-12-17 Thread Maxim Levitsky

This function wraps bdrv_co_delete_file for the common case of removing a file,
which was just created by format driver, on an error condition.

It hides the -ENOTSUPP error, and reports all other errors otherwise.

Use it in luks driver

Signed-off-by: Maxim Levitsky 
Reviewed-by: Alberto Garcia 
---
 block.c   | 22 ++
 block/crypto.c| 15 ++-
 include/block/block.h |  1 +
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/block.c b/block.c
index f1cedac362..b9edea3b97 100644
--- a/block.c
+++ b/block.c
@@ -704,6 +704,28 @@ int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, 
Error **errp)
 return ret;
 }
 
+void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs)
+{
+Error *local_err = NULL;
+int ret;
+
+if (!bs) {
+return;
+}
+
+ret = bdrv_co_delete_file(bs, &local_err);
+/*
+ * ENOTSUP will happen if the block driver doesn't support
+ * the 'bdrv_co_delete_file' interface. This is a predictable
+ * scenario and shouldn't be reported back to the user.
+ */
+if (ret == -ENOTSUP) {
+error_free(local_err);
+} else if (ret < 0) {
+error_report_err(local_err);
+}
+}
+
 /**
  * Try to get @bs's logical and physical block size.
  * On success, store them in @bsz struct and return 0.
diff --git a/block/crypto.c b/block/crypto.c
index b3a5275132..1d30fde38e 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -725,19 +725,8 @@ fail:
  * If an error occurred, delete 'filename'. Even if the file existed
  * beforehand, it has been truncated and corrupted in the process.
  */
-if (ret && bs) {
-Error *local_delete_err = NULL;
-int r_del = bdrv_co_delete_file(bs, &local_delete_err);
-/*
- * ENOTSUP will happen if the block driver doesn't support
- * the 'bdrv_co_delete_file' interface. This is a predictable
- * scenario and shouldn't be reported back to the user.
- */
-if ((r_del < 0) && (r_del != -ENOTSUP)) {
-error_report_err(local_delete_err);
-} else {
-error_free(local_delete_err);
-}
+if (ret) {
+bdrv_co_delete_file_noerr(bs);
 }
 
 bdrv_unref(bs);
diff --git a/include/block/block.h b/include/block/block.h
index c9d7c58765..af03022723 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -428,6 +428,7 @@ int bdrv_freeze_backing_chain(BlockDriverState *bs, 
BlockDriverState *base,
   Error **errp);
 void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base);
 int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp);
+void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs);
 
 
 typedef struct BdrvCheckResult {
-- 
2.26.2

[PATCH v3 4/5] block: use blk_get_max_ioctl_transfer for SCSI passthrough

2020-12-17 Thread Maxim Levitsky

Switch file-posix to expose only the max_ioctl_transfer limit.

Let the iscsi driver work as it did before since it is bound by the transfer
limit in both regular read/write and in SCSI passthrough case.

Switch the scsi-disk and scsi-block drivers to read the SG max transfer limits
using the new blk_get_max_ioctl_transfer interface.


Fixes: 867eccfed8 ("file-posix: Use max transfer length/segment count only for 
SCSI passthrough")
Signed-off-by: Maxim Levitsky 
---
 block/file-posix.c | 7 ---
 block/iscsi.c  | 1 +
 hw/scsi/scsi-generic.c | 4 ++--
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 2bf4d095a7..c34a7a9599 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1282,13 +1282,14 @@ static void hdev_refresh_limits(BlockDriverState *bs, 
Error **errp)
get_max_transfer_length(s->fd);
 
 if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
-bs->bl.max_transfer = pow2floor(ret);
+bs->bl.max_ioctl_transfer = pow2floor(ret);
 }
 
 ret = bs->sg ? sg_get_max_segments(s->fd) : get_max_segments(s->fd);
 if (ret > 0) {
-bs->bl.max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
-   ret * qemu_real_host_page_size);
+bs->bl.max_ioctl_transfer =
+MIN_NON_ZERO(bs->bl.max_ioctl_transfer,
+ ret * qemu_real_host_page_size);
 }
 
 raw_refresh_limits(bs, errp);
diff --git a/block/iscsi.c b/block/iscsi.c
index 7d4b3b56d5..b74fdf149d 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -2063,6 +2063,7 @@ static void iscsi_refresh_limits(BlockDriverState *bs, 
Error **errp)
 
 if (max_xfer_len * block_size < INT_MAX) {
 bs->bl.max_transfer = max_xfer_len * iscsilun->block_size;
+bs->bl.max_ioctl_transfer = bs->bl.max_transfer;
 }
 
 if (iscsilun->lbp.lbpu) {
diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 2cb23ca891..6df67bf889 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -167,7 +167,7 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq *r, 
SCSIDevice *s)
 page = r->req.cmd.buf[2];
 if (page == 0xb0) {
 uint32_t max_transfer =
-blk_get_max_transfer(s->conf.blk) / s->blocksize;
+blk_get_max_ioctl_transfer(s->conf.blk) / s->blocksize;
 
 assert(max_transfer);
 stl_be_p(&r->buf[8], max_transfer);
@@ -210,7 +210,7 @@ static int scsi_generic_emulate_block_limits(SCSIGenericReq 
*r, SCSIDevice *s)
 uint8_t buf[64];
 
 SCSIBlockLimits bl = {
-.max_io_sectors = blk_get_max_transfer(s->conf.blk) / s->blocksize
+.max_io_sectors = blk_get_max_ioctl_transfer(s->conf.blk) / 
s->blocksize
 };
 
 memset(r->buf, 0, r->buflen);
-- 
2.26.2

[PATCH v3 2/5] file-posix: add sg_get_max_segments that actually works with sg

2020-12-17 Thread Maxim Levitsky

From: Tom Yan 

sg devices have different major/minor than their corresponding
block devices. Using sysfs to get max segments never really worked
for them.

Fortunately the sg driver provides an ioctl to get sg_tablesize,
which is apparently equivalent to max segments.

Signed-off-by: Tom Yan 
Signed-off-by: Maxim Levitsky 
---
 block/file-posix.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index cbf1271773..2bf4d095a7 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1179,6 +1179,26 @@ static int sg_get_max_transfer_length(int fd)
 #endif
 }
 
+static int sg_get_max_segments(int fd)
+{
+/*
+ * /dev/sg* character devices report 'max_segments' via
+ * SG_GET_SG_TABLESIZE ioctl
+ */
+
+#ifdef SG_GET_SG_TABLESIZE
+long max_segments = 0;
+
+if (ioctl(fd, SG_GET_SG_TABLESIZE, &max_segments) == 0) {
+return max_segments;
+} else {
+return -errno;
+}
+#else
+return -ENOSYS;
+#endif
+}
+
 static int get_max_transfer_length(int fd)
 {
 #if defined(BLKSECTGET)
@@ -1265,7 +1285,7 @@ static void hdev_refresh_limits(BlockDriverState *bs, 
Error **errp)
 bs->bl.max_transfer = pow2floor(ret);
 }
 
-ret = get_max_segments(s->fd);
+ret = bs->sg ? sg_get_max_segments(s->fd) : get_max_segments(s->fd);
 if (ret > 0) {
 bs->bl.max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
ret * qemu_real_host_page_size);
-- 
2.26.2

[PATCH v3 0/5] SCSI: fix transfer limits for SCSI passthrough

2020-12-17 Thread Maxim Levitsky

This patch series attempts to provide a solution to the problem of the transfer
limits of the raw file driver (host_device/file-posix), some of which I
already tried to fix in the past.

I included 2 patches from Tom Yan which fix two issues with reading the limits
correctly from the */dev/sg* character devices in the first place.

The only change to these patches is that I tweaked a bit the comments in the
source to better document the /dev/sg quirks.

The other two patches in this series split the max transfer limits that qemu
block devices expose in two:
One limit is for the regular IO, and another is for the SG_IO (aka 
bdrv_*_ioctl),
and the two device drivers (scsi-block and scsi-generic) that use the later
are switched to the new interface.

This should ensure that the raw driver can still advertise the unlimited
transfer  length, unless it is used for SG_IO, because that yields the highest
performance.

Also I include a somewhat unrelated fix to a bug I found in qemu's
SCSI passthrough while testing this:
When qemu emulates the VPD block limit page, for a SCSI device that doesn't
implement it, it doesn't really advertise the emulated page to the guest.

I tested this by doing both regular and SG_IO passthrough of my
USB SD card reader.

That device turned out to be a perfect device for the task, since it has max
transfer size of 1024 blocks (512K), and it enforces it.

Also it didn't implement the VPD block limits page,
(transfer size limit probably comes from something USB related) which triggered
the unrelated bug.

I was able to see IO errors without the patches, and the wrong max transfer
size in the guest, and with patches both issues were gone.

I also found an unrelated issue in /dev/sg passthrough in the kernel.
It turns out that in-kernel driver has a limitation of 16 requests in flight,
regardless of what underlying device supports.

With a large multi-threaded fio job  and a debug print in qemu, it is easy to
see it, although the errors don't do much harm to the guest as it retries the
IO, and eventually succeed.
It is an open question if this should be solved.

V2: fixed an issue in a patch from Tom Yan (thanks), and removed
refactoring from last patch according to Paulo's request.

V3: few cosmitic changes due to the review feedback.

Maxim Levitsky (3):
  block: add max_ioctl_transfer to BlockLimits
  block: use blk_get_max_ioctl_transfer for SCSI passthrough
  block/scsi: correctly emulate the VPD block limits page

Tom Yan (2):
  file-posix: split hdev_refresh_limits from raw_refresh_limits
  file-posix: add sg_get_max_segments that actually works with sg

 block/block-backend.c  | 12 ++
 block/file-posix.c | 76 +++---
 block/io.c |  2 +
 block/iscsi.c  |  1 +
 hw/scsi/scsi-generic.c | 13 --
 include/block/block_int.h  |  4 ++
 include/sysemu/block-backend.h |  1 +
 7 files changed, 90 insertions(+), 19 deletions(-)

-- 
2.26.2

[PATCH v3 5/5] block/scsi: correctly emulate the VPD block limits page

2020-12-17 Thread Maxim Levitsky

When the device doesn't support the VPD block limits page, we emulate it even
for SCSI passthrough.

As a part of the emulation we need to add it to the 'Supported VPD Pages'

The code that does this adds it to the page, but it doesn't increase the length
of the data to be copied to the guest, thus the guest never sees the VPD block
limits page as supported.

Bump the transfer size by 1 in this case.

Signed-off-by: Maxim Levitsky 
---
 hw/scsi/scsi-generic.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/hw/scsi/scsi-generic.c b/hw/scsi/scsi-generic.c
index 6df67bf889..8f43979cbc 100644
--- a/hw/scsi/scsi-generic.c
+++ b/hw/scsi/scsi-generic.c
@@ -134,7 +134,7 @@ static int execute_command(BlockBackend *blk,
 return 0;
 }
 
-static void scsi_handle_inquiry_reply(SCSIGenericReq *r, SCSIDevice *s)
+static int scsi_handle_inquiry_reply(SCSIGenericReq *r, SCSIDevice *s, int len)
 {
 uint8_t page, page_idx;
 
@@ -200,8 +200,13 @@ static void scsi_handle_inquiry_reply(SCSIGenericReq *r, 
SCSIDevice *s)
 r->buf[page_idx] = 0xb0;
 }
 stw_be_p(r->buf + 2, lduw_be_p(r->buf + 2) + 1);
+
+if (len < r->buflen) {
+len++;
+}
 }
 }
+return len;
 }
 
 static int scsi_generic_emulate_block_limits(SCSIGenericReq *r, SCSIDevice *s)
@@ -316,7 +321,7 @@ static void scsi_read_complete(void * opaque, int ret)
 }
 }
 if (r->req.cmd.buf[0] == INQUIRY) {
-scsi_handle_inquiry_reply(r, s);
+len = scsi_handle_inquiry_reply(r, s, len);
 }
 
 req_complete:
-- 
2.26.2

[PATCH v3 3/5] block: add max_ioctl_transfer to BlockLimits

2020-12-17 Thread Maxim Levitsky

Maximum transfer size when accessing a kernel block device is only relevant
when using SCSI passthrough (SG_IO ioctl) since only in this case the requests
are passed directly to underlying hardware with no pre-processing.
Same is true when using /dev/sg* character devices (which only support SG_IO)

Therefore split the block driver's advertized max transfer size by
the regular max transfer size, and the max transfer size for SCSI passthrough
(the new max_ioctl_transfer field)

In the next patch, the qemu block drivers that support SCSI passthrough
will set the max_ioctl_transfer field, and simultaneously, the block devices
that implement scsi passthrough will switch to 'blk_get_max_ioctl_transfer' to
query and to pass it to the guest.

Signed-off-by: Maxim Levitsky 
---
 block/block-backend.c  | 12 
 block/io.c |  2 ++
 include/block/block_int.h  |  4 
 include/sysemu/block-backend.h |  1 +
 4 files changed, 19 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index ce78d30794..c1d149a755 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -1938,6 +1938,18 @@ uint32_t blk_get_max_transfer(BlockBackend *blk)
 return MIN_NON_ZERO(max, INT_MAX);
 }
 
+/* Returns the maximum transfer length, for SCSI passthrough */
+uint32_t blk_get_max_ioctl_transfer(BlockBackend *blk)
+{
+BlockDriverState *bs = blk_bs(blk);
+uint32_t max = 0;
+
+if (bs) {
+max = bs->bl.max_ioctl_transfer;
+}
+return MIN_NON_ZERO(max, INT_MAX);
+}
+
 int blk_get_max_iov(BlockBackend *blk)
 {
 return blk->root->bs->bl.max_iov;
diff --git a/block/io.c b/block/io.c
index 24205f5168..ac5aea435e 100644
--- a/block/io.c
+++ b/block/io.c
@@ -126,6 +126,8 @@ static void bdrv_merge_limits(BlockLimits *dst, const 
BlockLimits *src)
 {
 dst->opt_transfer = MAX(dst->opt_transfer, src->opt_transfer);
 dst->max_transfer = MIN_NON_ZERO(dst->max_transfer, src->max_transfer);
+dst->max_ioctl_transfer = MIN_NON_ZERO(dst->max_ioctl_transfer,
+src->max_ioctl_transfer);
 dst->opt_mem_alignment = MAX(dst->opt_mem_alignment,
  src->opt_mem_alignment);
 dst->min_mem_alignment = MAX(dst->min_mem_alignment,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 1eeafc118c..c59b0aefc4 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -686,6 +686,10 @@ typedef struct BlockLimits {
  * clamped down. */
 uint32_t max_transfer;
 
+/* Maximal transfer length for SCSI passthrough (ioctl interface) */
+uint32_t max_ioctl_transfer;
+
+
 /* memory alignment, in bytes so that no bounce buffer is needed */
 size_t min_mem_alignment;
 
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 8203d7f6f9..b019a37b7a 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -203,6 +203,7 @@ void blk_eject(BlockBackend *blk, bool eject_flag);
 int blk_get_flags(BlockBackend *blk);
 uint32_t blk_get_request_alignment(BlockBackend *blk);
 uint32_t blk_get_max_transfer(BlockBackend *blk);
+uint32_t blk_get_max_ioctl_transfer(BlockBackend *blk);
 int blk_get_max_iov(BlockBackend *blk);
 void blk_set_guest_block_size(BlockBackend *blk, int align);
 void *blk_try_blockalign(BlockBackend *blk, size_t size);
-- 
2.26.2

[PATCH v3 1/5] file-posix: split hdev_refresh_limits from raw_refresh_limits

2020-12-17 Thread Maxim Levitsky

From: Tom Yan 

We can and should get max transfer length and max segments for all host
devices / cdroms (on Linux).

Also use MIN_NON_ZERO instead when we clamp max transfer length against
max segments.

Signed-off-by: Tom Yan 
Signed-off-by: Maxim Levitsky 
---
 block/file-posix.c | 57 +-
 1 file changed, 41 insertions(+), 16 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 9804681d5c..cbf1271773 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1166,6 +1166,10 @@ static int sg_get_max_transfer_length(int fd)
 int max_bytes = 0;
 
 if (ioctl(fd, BLKSECTGET, &max_bytes) == 0) {
+/*
+ * BLKSECTGET for /dev/sg* character devices incorrectly returns
+ * the max transfer size in bytes (rather than in blocks).
+ */
 return max_bytes;
 } else {
 return -errno;
@@ -1175,7 +1179,22 @@ static int sg_get_max_transfer_length(int fd)
 #endif
 }
 
-static int sg_get_max_segments(int fd)
+static int get_max_transfer_length(int fd)
+{
+#if defined(BLKSECTGET)
+int sect = 0;
+
+if (ioctl(fd, BLKSECTGET, §) == 0) {
+return sect << 9;
+} else {
+return -errno;
+}
+#else
+return -ENOSYS;
+#endif
+}
+
+static int get_max_segments(int fd)
 {
 #ifdef CONFIG_LINUX
 char buf[32];
@@ -1230,23 +1249,29 @@ static void raw_refresh_limits(BlockDriverState *bs, 
Error **errp)
 {
 BDRVRawState *s = bs->opaque;
 
-if (bs->sg) {
-int ret = sg_get_max_transfer_length(s->fd);
+raw_probe_alignment(bs, s->fd, errp);
+bs->bl.min_mem_alignment = s->buf_align;
+bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
+}
+
+static void hdev_refresh_limits(BlockDriverState *bs, Error **errp)
+{
+BDRVRawState *s = bs->opaque;
+
+int ret = bs->sg ? sg_get_max_transfer_length(s->fd) :
+   get_max_transfer_length(s->fd);
 
-if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
-bs->bl.max_transfer = pow2floor(ret);
-}
+if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
+bs->bl.max_transfer = pow2floor(ret);
+}
 
-ret = sg_get_max_segments(s->fd);
-if (ret > 0) {
-bs->bl.max_transfer = MIN(bs->bl.max_transfer,
-  ret * qemu_real_host_page_size);
-}
+ret = get_max_segments(s->fd);
+if (ret > 0) {
+bs->bl.max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
+   ret * qemu_real_host_page_size);
 }
 
-raw_probe_alignment(bs, s->fd, errp);
-bs->bl.min_mem_alignment = s->buf_align;
-bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
+raw_refresh_limits(bs, errp);
 }
 
 static int check_for_dasd(int fd)
@@ -3600,7 +3625,7 @@ static BlockDriver bdrv_host_device = {
 .bdrv_co_pdiscard   = hdev_co_pdiscard,
 .bdrv_co_copy_range_from = raw_co_copy_range_from,
 .bdrv_co_copy_range_to  = raw_co_copy_range_to,
-.bdrv_refresh_limits = raw_refresh_limits,
+.bdrv_refresh_limits = hdev_refresh_limits,
 .bdrv_io_plug = raw_aio_plug,
 .bdrv_io_unplug = raw_aio_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
@@ -3724,7 +3749,7 @@ static BlockDriver bdrv_host_cdrom = {
 .bdrv_co_preadv = raw_co_preadv,
 .bdrv_co_pwritev= raw_co_pwritev,
 .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
-.bdrv_refresh_limits = raw_refresh_limits,
+.bdrv_refresh_limits = hdev_refresh_limits,
 .bdrv_io_plug = raw_aio_plug,
 .bdrv_io_unplug = raw_aio_unplug,
 .bdrv_attach_aio_context = raw_aio_attach_aio_context,
-- 
2.26.2

Re: [PATCH v2 1/5] file-posix: split hdev_refresh_limits from raw_refresh_limits

2020-12-17 Thread Maxim Levitsky

On Thu, 2020-12-10 at 18:36 +0800, Tom Yan wrote:
> On Wed, 9 Dec 2020 at 21:54, Maxim Levitsky  wrote:
> > From: Tom Yan 
> > 
> > We can and should get max transfer length and max segments for all host
> > devices / cdroms (on Linux).
> > 
> > Also use MIN_NON_ZERO instead when we clamp max transfer length against
> > max segments.
> > 
> > Signed-off-by: Tom Yan 
> > Signed-off-by: Maxim Levitsky 
> > ---
> >  block/file-posix.c | 59 +-
> >  1 file changed, 43 insertions(+), 16 deletions(-)
> > 
> > diff --git a/block/file-posix.c b/block/file-posix.c
> > index d5fd1dbcd2..226ddbbdad 100644
> > --- a/block/file-posix.c
> > +++ b/block/file-posix.c
> > @@ -1162,6 +1162,12 @@ static void raw_reopen_abort(BDRVReopenState *state)
> > 
> >  static int sg_get_max_transfer_length(int fd)
> >  {
> > +/*
> > + * BLKSECTGET for /dev/sg* character devices incorrectly returns
> > + * the max transfer size in bytes (rather than in blocks).
> > + * Also note that /dev/sg* doesn't support BLKSSZGET ioctl.
> > + */
> The second statement should be removed. Also maybe it's better to have
> the first one right above the line `return max_bytes;`.
Done, thanks.

Best regards,
Maxim Levitsky

> > +
> >  #ifdef BLKSECTGET
> >  int max_bytes = 0;
> > 
> > @@ -1175,7 +1181,22 @@ static int sg_get_max_transfer_length(int fd)
> >  #endif
> >  }
> > 
> > -static int sg_get_max_segments(int fd)
> > +static int get_max_transfer_length(int fd)
> > +{
> > +#if defined(BLKSECTGET)
> > +int sect = 0;
> > +
> > +if (ioctl(fd, BLKSECTGET, §) == 0) {
> > +return sect << 9;
> > +} else {
> > +return -errno;
> > +}
> > +#else
> > +return -ENOSYS;
> > +#endif
> > +}
> > +
> > +static int get_max_segments(int fd)
> >  {
> >  #ifdef CONFIG_LINUX
> >  char buf[32];
> > @@ -1230,23 +1251,29 @@ static void raw_refresh_limits(BlockDriverState 
> > *bs, Error **errp)
> >  {
> >  BDRVRawState *s = bs->opaque;
> > 
> > -if (bs->sg) {
> > -int ret = sg_get_max_transfer_length(s->fd);
> > +raw_probe_alignment(bs, s->fd, errp);
> > +bs->bl.min_mem_alignment = s->buf_align;
> > +bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
> > +}
> > 
> > -if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> > -bs->bl.max_transfer = pow2floor(ret);
> > -}
> > +static void hdev_refresh_limits(BlockDriverState *bs, Error **errp)
> > +{
> > +BDRVRawState *s = bs->opaque;
> > 
> > -ret = sg_get_max_segments(s->fd);
> > -if (ret > 0) {
> > -bs->bl.max_transfer = MIN(bs->bl.max_transfer,
> > -  ret * qemu_real_host_page_size);
> > -}
> > +int ret = bs->sg ? sg_get_max_transfer_length(s->fd) :
> > +   get_max_transfer_length(s->fd);
> > +
> > +if (ret > 0 && ret <= BDRV_REQUEST_MAX_BYTES) {
> > +bs->bl.max_transfer = pow2floor(ret);
> >  }
> > 
> > -raw_probe_alignment(bs, s->fd, errp);
> > -bs->bl.min_mem_alignment = s->buf_align;
> > -bs->bl.opt_mem_alignment = MAX(s->buf_align, qemu_real_host_page_size);
> > +ret = get_max_segments(s->fd);
> > +if (ret > 0) {
> > +bs->bl.max_transfer = MIN_NON_ZERO(bs->bl.max_transfer,
> > +   ret * qemu_real_host_page_size);
> > +}
> > +
> > +raw_refresh_limits(bs, errp);
> >  }
> > 
> >  static int check_for_dasd(int fd)
> > @@ -3601,7 +3628,7 @@ static BlockDriver bdrv_host_device = {
> >  .bdrv_co_pdiscard   = hdev_co_pdiscard,
> >  .bdrv_co_copy_range_from = raw_co_copy_range_from,
> >  .bdrv_co_copy_range_to  = raw_co_copy_range_to,
> > -.bdrv_refresh_limits = raw_refresh_limits,
> > +.bdrv_refresh_limits = hdev_refresh_limits,
> >  .bdrv_io_plug = raw_aio_plug,
> >  .bdrv_io_unplug = raw_aio_unplug,
> >  .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > @@ -3725,7 +3752,7 @@ static BlockDriver bdrv_host_cdrom = {
> >  .bdrv_co_preadv = raw_co_preadv,
> >  .bdrv_co_pwritev= raw_co_pwritev,
> >  .bdrv_co_flush_to_disk  = raw_co_flush_to_disk,
> > -.bdrv_refresh_limits = raw_refresh_limits,
> > +.bdrv_refresh_limits = hdev_refresh_limits,
> >  .bdrv_io_plug = raw_aio_plug,
> >  .bdrv_io_unplug = raw_aio_unplug,
> >  .bdrv_attach_aio_context = raw_aio_attach_aio_context,
> > --
> > 2.26.2
> >

[PATCH v2 2/2] scripts/gdb: implement 'qemu bt'

2020-12-17 Thread Maxim Levitsky

This script first runs the regular gdb's 'bt' command, and then if we are in a
coroutine it prints the coroutines backtraces in the order in which they
were called.

Signed-off-by: Maxim Levitsky 
---
 scripts/qemu-gdb.py  |  1 +
 scripts/qemugdb/coroutine.py | 28 +++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/scripts/qemu-gdb.py b/scripts/qemu-gdb.py
index e0bfa7b5a4..4d2a9f6c43 100644
--- a/scripts/qemu-gdb.py
+++ b/scripts/qemu-gdb.py
@@ -40,6 +40,7 @@ timers.TimersCommand()
 
 coroutine.CoroutineSPFunction()
 coroutine.CoroutinePCFunction()
+coroutine.CoroutineBt()
 
 # Default to silently passing through SIGUSR1, because QEMU sends it
 # to itself a lot.
diff --git a/scripts/qemugdb/coroutine.py b/scripts/qemugdb/coroutine.py
index e1399211e6..7db46d4b68 100644
--- a/scripts/qemugdb/coroutine.py
+++ b/scripts/qemugdb/coroutine.py
@@ -88,8 +88,11 @@ def bt_jmpbuf(jmpbuf):
 
 selected_frame.select()
 
+def co_cast(co):
+return co.cast(gdb.lookup_type('CoroutineUContext').pointer())
+
 def coroutine_to_jmpbuf(co):
-coroutine_pointer = co.cast(gdb.lookup_type('CoroutineUContext').pointer())
+coroutine_pointer = co_cast(co)
 return coroutine_pointer['env']['__jmpbuf']
 
 
@@ -107,6 +110,29 @@ class CoroutineCommand(gdb.Command):
 
 bt_jmpbuf(coroutine_to_jmpbuf(gdb.parse_and_eval(argv[0])))
 
+class CoroutineBt(gdb.Command):
+'''Display backtrace including coroutine switches'''
+def __init__(self):
+gdb.Command.__init__(self, 'qemu bt', gdb.COMMAND_STACK,
+ gdb.COMPLETE_NONE)
+
+def invoke(self, arg, from_tty):
+
+gdb.execute("bt")
+
+if gdb.parse_and_eval("qemu_in_coroutine()") == False:
+return
+
+co_ptr = gdb.parse_and_eval("qemu_coroutine_self()")
+
+while True:
+co = co_cast(co_ptr)
+co_ptr = co["base"]["caller"]
+if co_ptr == 0:
+break
+gdb.write("Coroutine at " + str(co_ptr) + ":\n")
+bt_jmpbuf(coroutine_to_jmpbuf(co_ptr))
+
 class CoroutineSPFunction(gdb.Function):
 def __init__(self):
 gdb.Function.__init__(self, 'qemu_coroutine_sp')
-- 
2.26.2

[PATCH v2 1/2] scripts/gdb: fix 'qemu coroutine' when users selects a non topmost stack frame

2020-12-17 Thread Maxim Levitsky

The code that dumps the stack frame works like that:
* save current registers
* overwrite current registers (including rip/rsp) with coroutine snapshot
  in the jmpbuf
* print backtrace
* restore the saved registers.

If the user has currently selected a non topmost stack frame in gdb,
the above code will still restore the selected frame registers,
but the gdb will then lose the selected frame index, which makes it impossible
to switch back to frame 0, to continue debugging the executable.

Therefore switch temporarily to the topmost frame of the stack
for the above code.

Signed-off-by: Maxim Levitsky 
---
 scripts/qemugdb/coroutine.py | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/scripts/qemugdb/coroutine.py b/scripts/qemugdb/coroutine.py
index db61389022..e1399211e6 100644
--- a/scripts/qemugdb/coroutine.py
+++ b/scripts/qemugdb/coroutine.py
@@ -70,6 +70,11 @@ def bt_jmpbuf(jmpbuf):
 regs = get_jmpbuf_regs(jmpbuf)
 old = dict()
 
+# remember current stack frame and select the topmost
+# so that register modifications don't wreck it
+selected_frame = gdb.selected_frame()
+gdb.newest_frame().select()
+
 for i in regs:
 old[i] = gdb.parse_and_eval('(uint64_t)$%s' % i)
 
@@ -81,6 +86,8 @@ def bt_jmpbuf(jmpbuf):
 for i in regs:
 gdb.execute('set $%s = %s' % (i, old[i]))
 
+selected_frame.select()
+
 def coroutine_to_jmpbuf(co):
 coroutine_pointer = co.cast(gdb.lookup_type('CoroutineUContext').pointer())
 return coroutine_pointer['env']['__jmpbuf']
-- 
2.26.2

[PATCH v2 0/2] Quality of life patches for qemu coroutine debugging

2020-12-17 Thread Maxim Levitsky

These are two patches that improve a bit the qemu gdb scripts in regard to
coroutines.

First patch fixes a bug in 'qemu coroutine' in regard to usage on non topmost
stack frame.

Second patch implements 'qemu bt' as Stefan suggested.

V2, because after hitting 'send' I noticed a few bugs:
   1. Switch to try/catch around bt, so that when regular bt fails, we
  print the same message as original bt.
   2. Print the correct co-routine address

Best regards,
Maxim Levitsky

Maxim Levitsky (2):
  scripts/gdb: fix 'qemu coroutine' when users selects a non topmost
stack frame
  scripts/gdb: implement 'qemu bt'

 scripts/qemu-gdb.py  |  1 +
 scripts/qemugdb/coroutine.py | 35 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

-- 
2.26.2

[PATCH 0/2] Quality of life patches for qemu coroutine debugging

2020-12-17 Thread Maxim Levitsky

These are two patches that improve a bit the qemu gdb scripts in regard to
coroutines.

First patch fixes a bug in 'qemu coroutine' in regard to usage on non topmost
stack frame.

Second patch implements 'qemu bt' as Stefan suggested.

Best regards,
    Maxim Levitsky

Maxim Levitsky (2):
  scripts/gdb: fix 'qemu coroutine' when users selects a non topmost
stack frame
  scripts/gdb: implement 'qemu bt'

 scripts/qemu-gdb.py  |  1 +
 scripts/qemugdb/coroutine.py | 39 +++-
 2 files changed, 39 insertions(+), 1 deletion(-)

-- 
2.26.2

[PATCH 1/3] scsi: virtio-scsi: don't process IO on fenced dataplane

2020-12-17 Thread Maxim Levitsky

If virtio_scsi_dataplane_start fails, there is a small window when it drops the
aio lock (in aio_wait_bh_oneshot) and the dataplane's AIO handler can
still run during that window.

This is done after the dataplane was marked as fenced, thus we use this flag
to avoid it doing any IO.

Signed-off-by: Maxim Levitsky 
---
 hw/scsi/virtio-scsi-dataplane.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index b995bab3a2..2824a25b65 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -56,8 +56,10 @@ static bool virtio_scsi_data_plane_handle_cmd(VirtIODevice 
*vdev,
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
-assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_cmd_vq(s, vq);
+if (!s->dataplane_fenced) {
+assert(s->ctx && s->dataplane_started);
+progress = virtio_scsi_handle_cmd_vq(s, vq);
+}
 virtio_scsi_release(s);
 return progress;
 }
@@ -69,8 +71,10 @@ static bool virtio_scsi_data_plane_handle_ctrl(VirtIODevice 
*vdev,
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
-assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_ctrl_vq(s, vq);
+if (!s->dataplane_fenced) {
+assert(s->ctx && s->dataplane_started);
+progress = virtio_scsi_handle_ctrl_vq(s, vq);
+}
 virtio_scsi_release(s);
 return progress;
 }
@@ -82,8 +86,10 @@ static bool virtio_scsi_data_plane_handle_event(VirtIODevice 
*vdev,
 VirtIOSCSI *s = VIRTIO_SCSI(vdev);
 
 virtio_scsi_acquire(s);
-assert(s->ctx && s->dataplane_started);
-progress = virtio_scsi_handle_event_vq(s, vq);
+if (!s->dataplane_fenced) {
+assert(s->ctx && s->dataplane_started);
+progress = virtio_scsi_handle_event_vq(s, vq);
+}
 virtio_scsi_release(s);
 return progress;
 }
-- 
2.26.2

[PATCH 2/2] scripts/gdb: implement 'qemu bt'

2020-12-17 Thread Maxim Levitsky

This script first runs the regular gdb's 'bt' command, and then if we are in a
coroutine it prints the coroutines backtraces in the order in which they
were called.

Signed-off-by: Maxim Levitsky 
---
 scripts/qemu-gdb.py  |  1 +
 scripts/qemugdb/coroutine.py | 32 +++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/scripts/qemu-gdb.py b/scripts/qemu-gdb.py
index e0bfa7b5a4..4d2a9f6c43 100644
--- a/scripts/qemu-gdb.py
+++ b/scripts/qemu-gdb.py
@@ -40,6 +40,7 @@ timers.TimersCommand()
 
 coroutine.CoroutineSPFunction()
 coroutine.CoroutinePCFunction()
+coroutine.CoroutineBt()
 
 # Default to silently passing through SIGUSR1, because QEMU sends it
 # to itself a lot.
diff --git a/scripts/qemugdb/coroutine.py b/scripts/qemugdb/coroutine.py
index e1399211e6..39d53a102e 100644
--- a/scripts/qemugdb/coroutine.py
+++ b/scripts/qemugdb/coroutine.py
@@ -88,8 +88,11 @@ def bt_jmpbuf(jmpbuf):
 
 selected_frame.select()
 
+def co_cast(co):
+return co.cast(gdb.lookup_type('CoroutineUContext').pointer())
+
 def coroutine_to_jmpbuf(co):
-coroutine_pointer = co.cast(gdb.lookup_type('CoroutineUContext').pointer())
+coroutine_pointer = co_cast(co)
 return coroutine_pointer['env']['__jmpbuf']
 
 
@@ -107,6 +110,33 @@ class CoroutineCommand(gdb.Command):
 
 bt_jmpbuf(coroutine_to_jmpbuf(gdb.parse_and_eval(argv[0])))
 
+class CoroutineBt(gdb.Command):
+'''Display backtrace including coroutine switches'''
+def __init__(self):
+gdb.Command.__init__(self, 'qemu bt', gdb.COMMAND_STACK,
+ gdb.COMPLETE_NONE)
+
+def invoke(self, arg, from_tty):
+
+gdb.execute("bt")
+
+thread = gdb.selected_thread()
+if thread == None or not thread.is_stopped():
+raise gdb.GdbError("No currrent thread")
+
+if gdb.parse_and_eval("qemu_in_coroutine()") == False:
+return
+
+co_ptr = gdb.parse_and_eval("qemu_coroutine_self()")
+gdb.write("Coroutine at " + str(co_ptr) + ":\n")
+
+while True:
+co = co_cast(co_ptr)
+co_ptr = co["base"]["caller"]
+if co_ptr == 0:
+break
+bt_jmpbuf(coroutine_to_jmpbuf(co_ptr))
+
 class CoroutineSPFunction(gdb.Function):
 def __init__(self):
 gdb.Function.__init__(self, 'qemu_coroutine_sp')
-- 
2.26.2

[PATCH 0/3] RFC: few random hacks to improve eventfd fallback path

2020-12-17 Thread Maxim Levitsky

These few patches are the result of a random hacking I did to make the qemu
cope with eventfd allocation failure, when using an iothread,
as it happened in bz #1897550.

I am not 100% sure which patches in this series are worth to merge, or if
this can be fixed in a better way.

After this patch series applied, qemu still hangs while running reproducer for
this BZ due to ABBA lock inversion which needs some heavy rework to get rid of.
I explained all the (gory) details in the bugzilla.

This patch series was (lightly) tested with make check, iotests and with
the reproducer.

Best regards,
Maxim Levitsky

Maxim Levitsky (3):
  scsi: virtio-scsi: don't process IO on fenced dataplane
  virtio-scsi: don't uninitialize queues that we didn't initialize
  event_notifier: handle initialization failure better

 hw/scsi/virtio-scsi-dataplane.c | 26 +++---
 include/qemu/event_notifier.h   |  1 +
 util/event_notifier-posix.c | 16 
 3 files changed, 36 insertions(+), 7 deletions(-)

-- 
2.26.2

[PATCH 1/2] scripts/gdb: fix 'qemu coroutine' when users selects a non topmost stack frame

2020-12-17 Thread Maxim Levitsky

The code that dumps the stack frame works like that:
* save current registers
* overwrite current registers (including rip/rsp) with coroutine snapshot
  in the jmpbuf
* print backtrace
* restore the saved registers.

If the user has currently selected a non topmost stack frame in gdb,
the above code will still restore the selected frame registers,
but the gdb will then lose the selected frame index, which makes it impossible
to switch back to frame 0, to continue debugging the executable.

Therefore switch temporarily to the topmost frame of the stack
for the above code.

Signed-off-by: Maxim Levitsky 
---
 scripts/qemugdb/coroutine.py | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/scripts/qemugdb/coroutine.py b/scripts/qemugdb/coroutine.py
index db61389022..e1399211e6 100644
--- a/scripts/qemugdb/coroutine.py
+++ b/scripts/qemugdb/coroutine.py
@@ -70,6 +70,11 @@ def bt_jmpbuf(jmpbuf):
 regs = get_jmpbuf_regs(jmpbuf)
 old = dict()
 
+# remember current stack frame and select the topmost
+# so that register modifications don't wreck it
+selected_frame = gdb.selected_frame()
+gdb.newest_frame().select()
+
 for i in regs:
 old[i] = gdb.parse_and_eval('(uint64_t)$%s' % i)
 
@@ -81,6 +86,8 @@ def bt_jmpbuf(jmpbuf):
 for i in regs:
 gdb.execute('set $%s = %s' % (i, old[i]))
 
+selected_frame.select()
+
 def coroutine_to_jmpbuf(co):
 coroutine_pointer = co.cast(gdb.lookup_type('CoroutineUContext').pointer())
 return coroutine_pointer['env']['__jmpbuf']
-- 
2.26.2

[PATCH 2/3] virtio-scsi: don't uninitialize queues that we didn't initialize

2020-12-17 Thread Maxim Levitsky

Count number of queues that we initialized and only deinitialize these that we
initialized successfully.

Signed-off-by: Maxim Levitsky 
---
 hw/scsi/virtio-scsi-dataplane.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index 2824a25b65..cd1e72d3f8 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -132,6 +132,7 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev)
 {
 int i;
 int rc;
+int vq_init_count = 0;
 BusState *qbus = qdev_get_parent_bus(DEVICE(vdev));
 VirtioBusClass *k = VIRTIO_BUS_GET_CLASS(qbus);
 VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(vdev);
@@ -159,17 +160,22 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev)
 if (rc) {
 goto fail_vrings;
 }
+
+vq_init_count++;
 rc = virtio_scsi_vring_init(s, vs->event_vq, 1,
 virtio_scsi_data_plane_handle_event);
 if (rc) {
 goto fail_vrings;
 }
+
+vq_init_count++;
 for (i = 0; i < vs->conf.num_queues; i++) {
 rc = virtio_scsi_vring_init(s, vs->cmd_vqs[i], i + 2,
 virtio_scsi_data_plane_handle_cmd);
 if (rc) {
 goto fail_vrings;
 }
+vq_init_count++;
 }
 
 s->dataplane_starting = false;
@@ -180,7 +186,7 @@ int virtio_scsi_dataplane_start(VirtIODevice *vdev)
 fail_vrings:
 aio_wait_bh_oneshot(s->ctx, virtio_scsi_dataplane_stop_bh, s);
 aio_context_release(s->ctx);
-for (i = 0; i < vs->conf.num_queues + 2; i++) {
+for (i = 0; i < vq_init_count; i++) {
 virtio_bus_set_host_notifier(VIRTIO_BUS(qbus), i, false);
 virtio_bus_cleanup_host_notifier(VIRTIO_BUS(qbus), i);
 }
-- 
2.26.2

[PATCH 3/3] event_notifier: handle initialization failure better

2020-12-17 Thread Maxim Levitsky

Add 'initialized' field and use it to avoid touching event notifiers which are
either not initialized or if their initialization failed.

This is somewhat a hack, but it seems the less intrusive way to make
virtio code deal with event notifiers that failed initialization.

Signed-off-by: Maxim Levitsky 
---
 include/qemu/event_notifier.h |  1 +
 util/event_notifier-posix.c   | 16 
 2 files changed, 17 insertions(+)

diff --git a/include/qemu/event_notifier.h b/include/qemu/event_notifier.h
index 3380b662f3..b79add035d 100644
--- a/include/qemu/event_notifier.h
+++ b/include/qemu/event_notifier.h
@@ -24,6 +24,7 @@ struct EventNotifier {
 #else
 int rfd;
 int wfd;
+bool initialized;
 #endif
 };
 
diff --git a/util/event_notifier-posix.c b/util/event_notifier-posix.c
index 00d93204f9..5b2110e861 100644
--- a/util/event_notifier-posix.c
+++ b/util/event_notifier-posix.c
@@ -29,6 +29,7 @@ void event_notifier_init_fd(EventNotifier *e, int fd)
 {
 e->rfd = fd;
 e->wfd = fd;
+e->initialized = true;
 }
 #endif
 
@@ -68,6 +69,7 @@ int event_notifier_init(EventNotifier *e, int active)
 if (active) {
 event_notifier_set(e);
 }
+e->initialized = true;
 return 0;
 
 fail:
@@ -78,12 +80,18 @@ fail:
 
 void event_notifier_cleanup(EventNotifier *e)
 {
+if (!e->initialized) {
+return;
+}
+
 if (e->rfd != e->wfd) {
 close(e->rfd);
 }
+
 e->rfd = -1;
 close(e->wfd);
 e->wfd = -1;
+e->initialized = false;
 }
 
 int event_notifier_get_fd(const EventNotifier *e)
@@ -96,6 +104,10 @@ int event_notifier_set(EventNotifier *e)
 static const uint64_t value = 1;
 ssize_t ret;
 
+if (!e->initialized) {
+return -1;
+}
+
 do {
 ret = write(e->wfd, &value, sizeof(value));
 } while (ret < 0 && errno == EINTR);
@@ -113,6 +125,10 @@ int event_notifier_test_and_clear(EventNotifier *e)
 ssize_t len;
 char buffer[512];
 
+if (!e->initialized) {
+return 0;
+}
+
 /* Drain the notify pipe.  For eventfd, only 8 bytes will be read.  */
 value = 0;
 do {
-- 
2.26.2

[PATCH 1/1] scsi: fix device removal race vs IO restart callback on resume

2020-12-10 Thread Maxim Levitsky

There is (mostly theoretical) race between removal of a scsi device and
scsi_dma_restart_bh.

It used to be easier to hit this race prior to my / Paulo's patch series
that added rcu to scsi bus device handling code, but IMHO this race
should still be possible to hit, at least in theory.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1854811

Fix it anyway with a patch that was proposed by Paulo in the above bugzilla.

Suggested-by: Paolo Bonzini 
Signed-off-by: Maxim Levitsky 
---
 hw/scsi/scsi-bus.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c
index b901e701f0..edb5c3492a 100644
--- a/hw/scsi/scsi-bus.c
+++ b/hw/scsi/scsi-bus.c
@@ -170,6 +170,8 @@ static void scsi_dma_restart_bh(void *opaque)
 scsi_req_unref(req);
 }
 aio_context_release(blk_get_aio_context(s->conf.blk));
+/* Drop the reference that was acquired in scsi_dma_restart_cb */
+object_unref(OBJECT(s));
 }
 
 void scsi_req_retry(SCSIRequest *req)
@@ -188,6 +190,8 @@ static void scsi_dma_restart_cb(void *opaque, int running, 
RunState state)
 }
 if (!s->bh) {
 AioContext *ctx = blk_get_aio_context(s->conf.blk);
+/* The reference is dropped in scsi_dma_restart_bh.*/
+object_ref(OBJECT(s));
 s->bh = aio_bh_new(ctx, scsi_dma_restart_bh, s);
 qemu_bh_schedule(s->bh);
 }
-- 
2.26.2

[PATCH 0/1] Fix for one more race on scsi device removal

2020-12-10 Thread Maxim Levitsky

This is a patch from Paulo that he suggested to fix bz #1854811.
(https://bugzilla.redhat.com/show_bug.cgi?id=1854811)

I think that the race that is described in this bug can still happen (in theory)
so it is better to plug it with a refcount as it was suggested.

Best regards,
Maxim Levitsky

Maxim Levitsky (1):
  scsi: fix device removal race vs IO restart callback on resume

 hw/scsi/scsi-bus.c | 4 
 1 file changed, 4 insertions(+)

-- 
2.26.2

Re: [PATCH v5 2/4] block: add bdrv_co_delete_file_noerr

2020-12-10 Thread Maxim Levitsky

On Thu, 2020-12-10 at 13:54 +0300, Vladimir Sementsov-Ogievskiy wrote:
> 09.12.2020 23:33, Maxim Levitsky wrote:
> > This function wraps bdrv_co_delete_file for the common case of removing a 
> > file,
> > which was just created by format driver, on an error condition.
> > 
> > It hides the -ENOTSUPP error, and reports all other errors otherwise.
> 
> I've looked at original commit added this logic, and didn't find exactly, why 
> should we hide it..
> 
> > Signed-off-by: Maxim Levitsky 
> > Reviewed-by: Alberto Garcia 
> > ---
> >   block.c   | 24 
> >   include/block/block.h |  1 +
> >   2 files changed, 25 insertions(+)
> > 
> > diff --git a/block.c b/block.c
> > index f1cedac362..5d35ba2fb8 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -704,6 +704,30 @@ int coroutine_fn bdrv_co_delete_file(BlockDriverState 
> > *bs, Error **errp)
> >   return ret;
> >   }
> >   
> > +void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs)
> > +{
> > +Error *local_err = NULL;
> > +int ret;
> > +
> > +if (!bs) {
> > +return;
> > +}
> > +
> > +ret = bdrv_co_delete_file(bs, &local_err);
> > +/*
> > + * ENOTSUP will happen if the block driver doesn't support
> > + * the 'bdrv_co_delete_file' interface. This is a predictable
> > + * scenario and shouldn't be reported back to the user.
> > + */
> 
> It's not predictable for user: user doesn't know internal processes and 
> interfaces.
> 
> The problem: we've left extra file on failure scenario and can't remove it. 
> We want to warn the user that there is a wrong file left. Why not to warn, 
> when the removement is unsupported internally by the driver?
> 
> I think, it's better to report it in any case.

This is a good point, but it is not something I changed.
I prefer this to be changed in a separate patch.


> 
> Another reason: less code and logic on failure case. Optimizing failure 
> scenarios in different manner is seldom a good thing to do.
> 
> And finally: why to report the error at all? If we have errp parameter, it's 
> better to add the info to it using error_append.. something like 
> error_append(errp, " (also failed to remove unfinished file %s: %s)", 
> file_name, error_get_pretty(local_err))

The idea behind this was to reduce code duplication in the callers of this 
function,
and this is why this function doesn't even have the 'errp' parameter.
I'll think about it.

> 
> 
> Probably I'm making mountains out of molehills. It should work, so take my 
> r-b anyway:
> 
> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 
> 
> Side note: I'd not split patches 02 and 03: this way it would be simpler to 
> see the code movement.
> 
> > +if (ret == -ENOTSUP) {
> > +error_free(local_err);
> > +} else if (ret < 0) {
> > +error_report_err(local_err);
> > +}
> > +}
> > +
> > +
> > +
> 
> three empty lines..

Will fix.
> 
> >   /**
> >* Try to get @bs's logical and physical block size.
> >* On success, store them in @bsz struct and return 0.
> > diff --git a/include/block/block.h b/include/block/block.h
> > index c9d7c58765..af03022723 100644
> > --- a/include/block/block.h
> > +++ b/include/block/block.h
> > @@ -428,6 +428,7 @@ int bdrv_freeze_backing_chain(BlockDriverState *bs, 
> > BlockDriverState *base,
> > Error **errp);
> >   void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState 
> > *base);
> >   int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp);
> > +void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs);
> >   
> >   
> >   typedef struct BdrvCheckResult {
> > 

Thanks a lot for the review!

Best regards,
Maxim Levitsky

> 
>

Re: [PATCH v5 4/4] block: qcow2: remove the created file on initialization error

2020-12-10 Thread Maxim Levitsky

On Thu, 2020-12-10 at 14:00 +0300, Vladimir Sementsov-Ogievskiy wrote:
> 09.12.2020 23:33, Maxim Levitsky wrote:
> > If the qcow initialization fails, we should remove the file if it was
> > already created, to avoid leaving stale files around.
> > 
> > We already do this for luks raw images.
> > 
> > Signed-off-by: Maxim Levitsky 
> > Reviewed-by: Alberto Garcia 
> > ---
> >   block/qcow2.c | 6 --
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/qcow2.c b/block/qcow2.c
> > index 3a90ef2786..68c9182f92 100644
> > --- a/block/qcow2.c
> > +++ b/block/qcow2.c
> > @@ -3847,12 +3847,14 @@ static int coroutine_fn 
> > qcow2_co_create_opts(BlockDriver *drv,
> >   
> >   /* Create the qcow2 image (format layer) */
> >   ret = qcow2_co_create(create_options, errp);
> > +finish:
> >   if (ret < 0) {
> > -goto finish;
> > +bdrv_co_delete_file_noerr(bs);
> > +bdrv_co_delete_file_noerr(data_bs);
> >   }
> >   
> >   ret = 0;
> 
> Ooops :) After this patch the function always returns zero :)

Indeed :-(

These small changes done in the last minute are the most dangerous.

Best regards,
Maxim Levitsky
> 
> > -finish:
> > +
> >   qobject_unref(qdict);
> >   bdrv_unref(bs);
> >   bdrv_unref(data_bs);
> > 
> 
>

Re: [PATCH v2 0/5] SCSI: fix transfer limits for SCSI passthrough

2020-12-10 Thread Maxim Levitsky

On Thu, 2020-12-10 at 01:09 +0100, Paolo Bonzini wrote:
> On 09/12/20 15:03, no-re...@patchew.org wrote:
> > Patchew URL: 
> > https://patchew.org/QEMU/20201209135355.561745-1-mlevi...@redhat.com/
> > 
> > 
> > 
> > Hi,
> > 
> > This series seems to have some coding style problems. See output below for
> > more information:
> > 
> > Type: series
> > Message-id: 20201209135355.561745-1-mlevi...@redhat.com
> > Subject: [PATCH v2 0/5] SCSI: fix transfer limits for SCSI passthrough
> > 
> > === TEST SCRIPT BEGIN ===
> > #!/bin/bash
> > git rev-parse base > /dev/null || exit 0
> > git config --local diff.renamelimit 0
> > git config --local diff.renames True
> > git config --local diff.algorithm histogram
> > ./scripts/checkpatch.pl --mailback base..
> > === TEST SCRIPT END ===
> > 
> > Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> >  From https://github.com/patchew-project/qemu
> >   * [new tag] patchew/20201209135355.561745-1-mlevi...@redhat.com 
> > -> patchew/20201209135355.561745-1-mlevi...@redhat.com
> > Switched to a new branch 'test'
> > 77c9000 block/scsi: correctly emulate the VPD block limits page
> > 61f49e1 block: use blk_get_max_ioctl_transfer for SCSI passthrough
> > 35c66d6 block: add max_ioctl_transfer to BlockLimits
> > 08ba263 file-posix: add sg_get_max_segments that actually works with sg
> > e9fd749 file-posix: split hdev_refresh_limits from raw_refresh_limits
> > 
> > === OUTPUT BEGIN ===
> > 1/5 Checking commit e9fd7498060c (file-posix: split hdev_refresh_limits 
> > from raw_refresh_limits)
> > 2/5 Checking commit 08ba263f565d (file-posix: add sg_get_max_segments that 
> > actually works with sg)
> > 3/5 Checking commit 35c66d636d83 (block: add max_ioctl_transfer to 
> > BlockLimits)
> > 4/5 Checking commit 61f49e1c953b (block: use blk_get_max_ioctl_transfer for 
> > SCSI passthrough)
> > 5/5 Checking commit 77c9000b7c30 (block/scsi: correctly emulate the VPD 
> > block limits page)
> > ERROR: braces {} are necessary for all arms of this statement
> > #39: FILE: hw/scsi/scsi-generic.c:204:
> > +if (len < r->buflen)
> > [...]
> > 
> > total: 1 errors, 0 warnings, 28 lines checked
> > 
> > Patch 5/5 has style problems, please review.  If any of these errors
> > are false positives report them to the maintainer, see
> > CHECKPATCH in MAINTAINERS.
> > 
> > === OUTPUT END ===
> > 
> > Test command exited with code: 1
> > 
> > 
> > The full log is available at
> > http://patchew.org/logs/20201209135355.561745-1-mlevi...@redhat.com/testing.checkpatch/?type=message.
> > ---
> > Email generated automatically by Patchew [https://patchew.org/].
> > Please send your feedback to patchew-de...@redhat.com
> > 
> 
> Time for v3?

I am waiting a bit to see if anything else pops up,
to avoid doing too much noise.


Best regards,
Maxim Levitsky
> 
> Paolo
>

[PATCH v5 3/4] crypto: luks: use bdrv_co_delete_file_noerr

2020-12-09 Thread Maxim Levitsky

This refactoring is now possible thanks to this function.

Signed-off-by: Maxim Levitsky 
Reviewed-by: Alberto Garcia 
---
 block/crypto.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/block/crypto.c b/block/crypto.c
index b3a5275132..1d30fde38e 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -725,19 +725,8 @@ fail:
  * If an error occurred, delete 'filename'. Even if the file existed
  * beforehand, it has been truncated and corrupted in the process.
  */
-if (ret && bs) {
-Error *local_delete_err = NULL;
-int r_del = bdrv_co_delete_file(bs, &local_delete_err);
-/*
- * ENOTSUP will happen if the block driver doesn't support
- * the 'bdrv_co_delete_file' interface. This is a predictable
- * scenario and shouldn't be reported back to the user.
- */
-if ((r_del < 0) && (r_del != -ENOTSUP)) {
-error_report_err(local_delete_err);
-} else {
-error_free(local_delete_err);
-}
+if (ret) {
+bdrv_co_delete_file_noerr(bs);
 }
 
 bdrv_unref(bs);
-- 
2.26.2

[PATCH v5 2/4] block: add bdrv_co_delete_file_noerr

2020-12-09 Thread Maxim Levitsky

This function wraps bdrv_co_delete_file for the common case of removing a file,
which was just created by format driver, on an error condition.

It hides the -ENOTSUPP error, and reports all other errors otherwise.

Signed-off-by: Maxim Levitsky 
Reviewed-by: Alberto Garcia 
---
 block.c   | 24 
 include/block/block.h |  1 +
 2 files changed, 25 insertions(+)

diff --git a/block.c b/block.c
index f1cedac362..5d35ba2fb8 100644
--- a/block.c
+++ b/block.c
@@ -704,6 +704,30 @@ int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, 
Error **errp)
 return ret;
 }
 
+void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs)
+{
+Error *local_err = NULL;
+int ret;
+
+if (!bs) {
+return;
+}
+
+ret = bdrv_co_delete_file(bs, &local_err);
+/*
+ * ENOTSUP will happen if the block driver doesn't support
+ * the 'bdrv_co_delete_file' interface. This is a predictable
+ * scenario and shouldn't be reported back to the user.
+ */
+if (ret == -ENOTSUP) {
+error_free(local_err);
+} else if (ret < 0) {
+error_report_err(local_err);
+}
+}
+
+
+
 /**
  * Try to get @bs's logical and physical block size.
  * On success, store them in @bsz struct and return 0.
diff --git a/include/block/block.h b/include/block/block.h
index c9d7c58765..af03022723 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -428,6 +428,7 @@ int bdrv_freeze_backing_chain(BlockDriverState *bs, 
BlockDriverState *base,
   Error **errp);
 void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base);
 int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp);
+void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs);
 
 
 typedef struct BdrvCheckResult {
-- 
2.26.2

[PATCH v5 1/4] crypto: luks: Fix tiny memory leak

2020-12-09 Thread Maxim Levitsky

When the underlying block device doesn't support the
bdrv_co_delete_file interface, an 'Error' object was leaked.

Signed-off-by: Maxim Levitsky 
Reviewed-by: Alberto Garcia 
---
 block/crypto.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/crypto.c b/block/crypto.c
index aef5a5721a..b3a5275132 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -735,6 +735,8 @@ fail:
  */
 if ((r_del < 0) && (r_del != -ENOTSUP)) {
 error_report_err(local_delete_err);
+} else {
+error_free(local_delete_err);
 }
 }
 
-- 
2.26.2

[PATCH v5 0/4] qcow2: don't leave partially initialized file on image creation

2020-12-09 Thread Maxim Levitsky

Use the bdrv_co_delete_file interface to delete the underlying
file if qcow2 initialization fails (e.g due to bad encryption secret)

This makes the qcow2 driver behave the same way as the luks driver behaves.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1845353

V3: addressed review feedback and reworked commit messages

V4: got rid of code duplication by adding bdrv_co_delete_file_noerr
and made the qcow2 driver use this function to delete
both the main and the data file.

V5: addresssed review feedback on reworked version.

Best regards,
Maxim Levitsky

Maxim Levitsky (4):
  crypto: luks: Fix tiny memory leak
  block: add bdrv_co_delete_file_noerr
  crypto: luks: use bdrv_co_delete_file_noerr
  block: qcow2: remove the created file on initialization error

 block.c   | 24 
 block/crypto.c| 13 ++---
 block/qcow2.c |  6 --
 include/block/block.h |  1 +
 4 files changed, 31 insertions(+), 13 deletions(-)

-- 
2.26.2

[PATCH v5 4/4] block: qcow2: remove the created file on initialization error

2020-12-09 Thread Maxim Levitsky

If the qcow initialization fails, we should remove the file if it was
already created, to avoid leaving stale files around.

We already do this for luks raw images.

Signed-off-by: Maxim Levitsky 
Reviewed-by: Alberto Garcia 
---
 block/qcow2.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 3a90ef2786..68c9182f92 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3847,12 +3847,14 @@ static int coroutine_fn 
qcow2_co_create_opts(BlockDriver *drv,
 
 /* Create the qcow2 image (format layer) */
 ret = qcow2_co_create(create_options, errp);
+finish:
 if (ret < 0) {
-goto finish;
+bdrv_co_delete_file_noerr(bs);
+bdrv_co_delete_file_noerr(data_bs);
 }
 
 ret = 0;
-finish:
+
 qobject_unref(qdict);
 bdrv_unref(bs);
 bdrv_unref(data_bs);
-- 
2.26.2

Re: [PATCH v4 4/4] block: qcow2: remove the created file on initialization error

2020-12-09 Thread Maxim Levitsky

On Wed, 2020-12-09 at 18:41 +0100, Alberto Garcia wrote:
> On Wed 09 Dec 2020 05:44:41 PM CET, Maxim Levitsky wrote:
> > @@ -3847,12 +3847,13 @@ static int coroutine_fn 
> > qcow2_co_create_opts(BlockDriver *drv,
> >  
> >  /* Create the qcow2 image (format layer) */
> >  ret = qcow2_co_create(create_options, errp);
> > +
> > +finish:
> >  if (ret < 0) {
> > -goto finish;
> > +bdrv_co_delete_file_noerr(bs);
> > +bdrv_co_delete_file_noerr(data_bs);
> >  }
> >  
> > -ret = 0;
> 
> Many/most functions in qcow2.c force ret to be 0 on success, we could
> also keep that here (although in practice I don't think that ret can be
> greater than 0 in this case, or that the caller would care).

I also noticed this when I was sending the patches, and I wasn't sure
if I want to keep that 'ret = 0' or not.
I will add it back.

Best regards,
Maxim Levitsky

> 
> Either way,
> 
> Reviewed-by: Alberto Garcia 
> 
> Berto
>

Re: [PATCH v4 2/4] block: add bdrv_co_delete_file_noerr

2020-12-09 Thread Maxim Levitsky

On Wed, 2020-12-09 at 18:34 +0100, Alberto Garcia wrote:
> On Wed 09 Dec 2020 05:44:39 PM CET, Maxim Levitsky wrote:
> > +void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs)
> > +{
> > +Error *local_err = NULL;
> > +
> > +if (!bs) {
> > +return;
> > +}
> > +
> > +int ret = bdrv_co_delete_file(bs, &local_err);
>^^^
> 
> According to the QEMU coding style we should not have declarations in
> the middle of a block.

Oops!

I will send next version now.

Thanks a lot for the review!

Best regards,
Maxim Levitsky

> 
> The patch looks otherwise fine.
> 
> Reviewed-by: Alberto Garcia 
> 
> Berto
>

[PATCH v4 3/4] crypto: luks: use bdrv_co_delete_file_noerr

2020-12-09 Thread Maxim Levitsky

This refactoring is now possible thanks to this function.

Signed-off-by: Maxim Levitsky 
---
 block/crypto.c | 15 ++-
 1 file changed, 2 insertions(+), 13 deletions(-)

diff --git a/block/crypto.c b/block/crypto.c
index b3a5275132..1d30fde38e 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -725,19 +725,8 @@ fail:
  * If an error occurred, delete 'filename'. Even if the file existed
  * beforehand, it has been truncated and corrupted in the process.
  */
-if (ret && bs) {
-Error *local_delete_err = NULL;
-int r_del = bdrv_co_delete_file(bs, &local_delete_err);
-/*
- * ENOTSUP will happen if the block driver doesn't support
- * the 'bdrv_co_delete_file' interface. This is a predictable
- * scenario and shouldn't be reported back to the user.
- */
-if ((r_del < 0) && (r_del != -ENOTSUP)) {
-error_report_err(local_delete_err);
-} else {
-error_free(local_delete_err);
-}
+if (ret) {
+bdrv_co_delete_file_noerr(bs);
 }
 
 bdrv_unref(bs);
-- 
2.26.2

[PATCH v4 2/4] block: add bdrv_co_delete_file_noerr

2020-12-09 Thread Maxim Levitsky

This function wraps bdrv_co_delete_file for the common case of removing a file,
which was just created by format driver, on an error condition.

It hides the -ENOTSUPP error, and reports all other errors otherwise.

Signed-off-by: Maxim Levitsky 
---
 block.c   | 23 +++
 include/block/block.h |  1 +
 2 files changed, 24 insertions(+)

diff --git a/block.c b/block.c
index f1cedac362..57e6d9750a 100644
--- a/block.c
+++ b/block.c
@@ -704,6 +704,29 @@ int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, 
Error **errp)
 return ret;
 }
 
+void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs)
+{
+Error *local_err = NULL;
+
+if (!bs) {
+return;
+}
+
+int ret = bdrv_co_delete_file(bs, &local_err);
+/*
+ * ENOTSUP will happen if the block driver doesn't support
+ * the 'bdrv_co_delete_file' interface. This is a predictable
+ * scenario and shouldn't be reported back to the user.
+ */
+if (ret == -ENOTSUP) {
+error_free(local_err);
+} else if (ret < 0) {
+error_report_err(local_err);
+}
+}
+
+
+
 /**
  * Try to get @bs's logical and physical block size.
  * On success, store them in @bsz struct and return 0.
diff --git a/include/block/block.h b/include/block/block.h
index c9d7c58765..af03022723 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -428,6 +428,7 @@ int bdrv_freeze_backing_chain(BlockDriverState *bs, 
BlockDriverState *base,
   Error **errp);
 void bdrv_unfreeze_backing_chain(BlockDriverState *bs, BlockDriverState *base);
 int coroutine_fn bdrv_co_delete_file(BlockDriverState *bs, Error **errp);
+void coroutine_fn bdrv_co_delete_file_noerr(BlockDriverState *bs);
 
 
 typedef struct BdrvCheckResult {
-- 
2.26.2

[PATCH v4 4/4] block: qcow2: remove the created file on initialization error

2020-12-09 Thread Maxim Levitsky

If the qcow initialization fails, we should remove the file if it was
already created, to avoid leaving stale files around.

We already do this for luks raw images.

Signed-off-by: Maxim Levitsky 
---
 block/qcow2.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 3a90ef2786..b5169b7cad 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3847,12 +3847,13 @@ static int coroutine_fn 
qcow2_co_create_opts(BlockDriver *drv,
 
 /* Create the qcow2 image (format layer) */
 ret = qcow2_co_create(create_options, errp);
+
+finish:
 if (ret < 0) {
-goto finish;
+bdrv_co_delete_file_noerr(bs);
+bdrv_co_delete_file_noerr(data_bs);
 }
 
-ret = 0;
-finish:
 qobject_unref(qdict);
 bdrv_unref(bs);
 bdrv_unref(data_bs);
-- 
2.26.2

[PATCH v4 1/4] crypto: luks: Fix tiny memory leak

2020-12-09 Thread Maxim Levitsky

When the underlying block device doesn't support the
bdrv_co_delete_file interface, an 'Error' object was leaked.

Signed-off-by: Maxim Levitsky 
Reviewed-by: Alberto Garcia 
---
 block/crypto.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/crypto.c b/block/crypto.c
index aef5a5721a..b3a5275132 100644
--- a/block/crypto.c
+++ b/block/crypto.c
@@ -735,6 +735,8 @@ fail:
  */
 if ((r_del < 0) && (r_del != -ENOTSUP)) {
 error_report_err(local_delete_err);
+} else {
+error_free(local_delete_err);
 }
 }
 
-- 
2.26.2

[PATCH v4 0/4] qcow2: don't leave partially initialized file on image creation

2020-12-09 Thread Maxim Levitsky

Use the bdrv_co_delete_file interface to delete the underlying
file if qcow2 initialization fails (e.g due to bad encryption secret)

This makes the qcow2 driver behave the same way as the luks driver behaves.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1845353

V3: addressed review feedback and reworked commit messages

V4: got rid of code duplication by adding bdrv_co_delete_file_noerr
and made the qcow2 driver use this function to delete
both the main and the data file.

Best regards,
Maxim Levitsky

Maxim Levitsky (4):
  crypto: luks: Fix tiny memory leak
  block: add bdrv_co_delete_file_noerr
  crypto: luks: use bdrv_co_delete_file_noerr
  block: qcow2: remove the created file on initialization error

 block.c   | 23 +++
 block/crypto.c| 13 ++---
 block/qcow2.c |  7 ---
 include/block/block.h |  1 +
 4 files changed, 30 insertions(+), 14 deletions(-)

-- 
2.26.2

Re: [PATCH v2 0/5] SCSI: fix transfer limits for SCSI passthrough

2020-12-09 Thread Maxim Levitsky

On Wed, 2020-12-09 at 06:03 -0800, no-re...@patchew.org wrote:
> Patchew URL: 
> https://patchew.org/QEMU/20201209135355.561745-1-mlevi...@redhat.com/
> 
> 
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 20201209135355.561745-1-mlevi...@redhat.com
> Subject: [PATCH v2 0/5] SCSI: fix transfer limits for SCSI passthrough
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> git rev-parse base > /dev/null || exit 0
> git config --local diff.renamelimit 0
> git config --local diff.renames True
> git config --local diff.algorithm histogram
> ./scripts/checkpatch.pl --mailback base..
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> From https://github.com/patchew-project/qemu
>  * [new tag] patchew/20201209135355.561745-1-mlevi...@redhat.com -> 
> patchew/20201209135355.561745-1-mlevi...@redhat.com
> Switched to a new branch 'test'
> 77c9000 block/scsi: correctly emulate the VPD block limits page
> 61f49e1 block: use blk_get_max_ioctl_transfer for SCSI passthrough
> 35c66d6 block: add max_ioctl_transfer to BlockLimits
> 08ba263 file-posix: add sg_get_max_segments that actually works with sg
> e9fd749 file-posix: split hdev_refresh_limits from raw_refresh_limits
> 
> === OUTPUT BEGIN ===
> 1/5 Checking commit e9fd7498060c (file-posix: split hdev_refresh_limits from 
> raw_refresh_limits)
> 2/5 Checking commit 08ba263f565d (file-posix: add sg_get_max_segments that 
> actually works with sg)
> 3/5 Checking commit 35c66d636d83 (block: add max_ioctl_transfer to 
> BlockLimits)
> 4/5 Checking commit 61f49e1c953b (block: use blk_get_max_ioctl_transfer for 
> SCSI passthrough)
> 5/5 Checking commit 77c9000b7c30 (block/scsi: correctly emulate the VPD block 
> limits page)
> ERROR: braces {} are necessary for all arms of this statement
> #39: FILE: hw/scsi/scsi-generic.c:204:
> +if (len < r->buflen)
+1 Good bot :-)

Best regards,
Maxim Levitsky

> [...]
> 
> total: 1 errors, 0 warnings, 28 lines checked
> 
> Patch 5/5 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 
> === OUTPUT END ===
> 
> Test command exited with code: 1
> 
> 
> The full log is available at
> http://patchew.org/logs/20201209135355.561745-1-mlevi...@redhat.com/testing.checkpatch/?type=message.
> ---
> Email generated automatically by Patchew [https://patchew.org/].
> Please send your feedback to patchew-de...@redhat.com

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1129 matches

Mail list logo