Re: driver type raw-xz supports discard=unmap?

2022-07-25 Thread Chris Murphy



On Mon, Jul 25, 2022, at 9:53 AM, Daniel P. Berrangé wrote:
> On Mon, Jul 25, 2022 at 08:51:42AM -0400, Chris Murphy wrote:

>> Huh, interesting. I have no idea then. I just happened to notice it in the 
>> (libvirt) XML config that's used by oz.
>> https://kojipkgs.fedoraproject.org//packages/Fedora-Workstation/Rawhide/20220721.n.0/images/libvirt-raw-xz-aarch64.xml
>
> I don't see 'raw-xz' mentioned anywhere in the Oz code at
>
>   https://github.com/clalancette/oz
>
> was it a fork that's being used ?

Must be. I'm not seeing it in either oz or imagefactory source either.

>> I've got no idea what happens if an invalid type is specified in
>> the config. The VM's are definitely running despite this. I'll ask
>> oz devs.
>
> This is pretty surprising if they're actually running as it should
> cause a fatal error message
>
> error: unsupported configuration: unknown driver format value 'raw-xz'

Yep, I'm lost. I guess it's down a rabbit hole or yak shaving time.

-- 
Chris Murphy



Re: driver type raw-xz supports discard=unmap?

2022-07-25 Thread Chris Murphy



On Mon, Jul 25, 2022, at 5:13 AM, Daniel P. Berrangé wrote:
> On Fri, Jul 22, 2022 at 04:03:52PM -0400, Chris Murphy wrote:
>> Is this valid?
>> 
>> `
>> 
>> 
>> `
>> `/>
>> `
>> 
>> I know type="raw" works fine, I'm wondering if there'd be any problem
>> with type "raw-xz" combined with discards?
>
> This is libvirt configuration, so libvirt-us...@redhat.com is the better
> list in general. That said, what is this 'raw-xz' type you refer to ?
>
> AFAIK, that is not a disk driver type that exists in either libvirt or
> QEMU releases.

Huh, interesting. I have no idea then. I just happened to notice it in the 
(libvirt) XML config that's used by oz.
https://kojipkgs.fedoraproject.org//packages/Fedora-Workstation/Rawhide/20220721.n.0/images/libvirt-raw-xz-aarch64.xml

When manually modifying a virt-manager created config, to change "raw" to 
"raw-xz" I get an error:

# virsh edit uefivm
error: XML document failed to validate against schema: Unable to validate doc 
against /usr/share/libvirt/schemas/domain.rng
Extra element devices in interleave
Element domain failed to validate content

Failed. Try again? [y,n,i,f,?]: 

I've got no idea what happens if an invalid type is specified in the config. 
The VM's are definitely running despite this. I'll ask oz devs.



driver type raw-xz supports discard=unmap?

2022-07-22 Thread Chris Murphy
Is this valid?

`


`
`/>
`

I know type="raw" works fine, I'm wondering if there'd be any problem with type 
"raw-xz" combined with discards?

Thanks,

Chris Murphy

Re: dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13

2021-12-08 Thread Chris Murphy
On Tue, Dec 7, 2021 at 5:25 PM Sean Christopherson  wrote:
>
> On Tue, Dec 07, 2021, Chris Murphy wrote:
> > cc: qemu-devel
> >
> > Hi,
> >
> > I'm trying to help progress a very troublesome and so far elusive bug
> > we're seeing in Fedora infrastructure. When running dozens of qemu-kvm
> > VMs simultaneously, eventually they become unresponsive, as well as
> > new processes as we try to extract information from the host about
> > what's gone wrong.
>
> Have you tried bisecting?  IIUC, the issues showed up between v5.11 and 
> v5.12.12,
> bisecting should be relatively straightforward.

We haven't tried bisecting. Due to limited access since it's a
production machine, and limited resources for those who have that
access, I think the chance of bisecting is low, but I've asked. We
could do something of a faux-bisect by running already built kernels
in Fedora infrastructure. We could start by running x.y.0 kernels to
see when it first appeared, then once hitting the problem, start
testing rc1, rc2, ... in that series. We also have approximately daily
git builds in between those rc's. That might be enough to deduce a
culprit, but I'm not sure. At the least this would get us a ~1-3 day
window within two rc's for bisecting.

>
> > Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
> > state where forking does not work correctly, breaking most things
> > https://bugzilla.redhat.com/show_bug.cgi?id=2009585
> >
> > In subsequent testing, we used newer kernels with lockdep and other
> > debug stuff enabled, and managed to capture a hung task with a bunch
> > of locks listed, including kvm and qemu processes. But I can't parse
> > it.
> >
> > 5.15-rc7
> > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941
> > 5.15+
> > https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939
> >
> > If anyone can take a glance at those kernel messages, and/or give
> > hints how we can extract more information for debugging, it'd be
> > appreciated. Maybe all of that is normal and the actual problem isn't
> > in any of these traces.
>
> All the instances of
>
>   (>mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x77/0x720 [kvm]
>
> are uninteresting and expected, that's just each vCPU task taking its 
> associated
> vcpu->mutex, likely for KVM_RUN.
>
> At a glance, the XFS stuff looks far more interesting/suspect.

Thanks for the reply.

-- 
Chris Murphy



dozens of qemu/kvm VMs getting into stuck states since kernel ~5.13

2021-12-07 Thread Chris Murphy
cc: qemu-devel

Hi,

I'm trying to help progress a very troublesome and so far elusive bug
we're seeing in Fedora infrastructure. When running dozens of qemu-kvm
VMs simultaneously, eventually they become unresponsive, as well as
new processes as we try to extract information from the host about
what's gone wrong.

Systems (Fedora openQA worker hosts) on kernel 5.12.12+ wind up in a
state where forking does not work correctly, breaking most things
https://bugzilla.redhat.com/show_bug.cgi?id=2009585

In subsequent testing, we used newer kernels with lockdep and other
debug stuff enabled, and managed to capture a hung task with a bunch
of locks listed, including kvm and qemu processes. But I can't parse
it.

5.15-rc7
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840941
5.15+
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1840939

If anyone can take a glance at those kernel messages, and/or give
hints how we can extract more information for debugging, it'd be
appreciated. Maybe all of that is normal and the actual problem isn't
in any of these traces.

Thanks,

--
Chris Murphy