Re: [linux-lvm] LVM performance vs direct dm-thin

2022-02-02 Thread Demi Marie Obenour
On Mon, Jan 31, 2022 at 10:29:04PM +0100, Marian Csontos wrote:
> On Sun, Jan 30, 2022 at 11:17 PM Demi Marie Obenour <
> d...@invisiblethingslab.com> wrote:
> 
>> On Sun, Jan 30, 2022 at 04:39:30PM -0500, Stuart D. Gathman wrote:
>>> Your VM usage is different from ours - you seem to need to clone and
>>> activate a VM quickly (like a vps provider might need to do).  We
>>> generally have to buy more RAM to add a new VM :-), so performance of
>>> creating a new LV is the least of our worries.
>>
>> To put it mildly, yes :).  Ideally we could get VM boot time down to
>> 100ms or lower.
>>
> 
> Out of curiosity, is snapshot creation the main culprit to boot a VM in
> under 100ms? Does Qubes OS use tweaked linux distributions, to achieve the
> desired boot time?

The goal is 100ms from user action until PID 1 starts in the guest.
After that, it’s the job of whatever distro the guest is running.
Storage management is one area that needs to be optimized to achieve
this, though it is not the only one.

> Back to business. Perhaps I missed an answer to this question: Are the
> Qubes OS VMs throw away?  Throw away in the sense like many containers are
> - it's just a runtime which can be "easily" reconstructed. If so, you can
> ignore the safety belts and try to squeeze more performance by sacrificing
> (meta)data integrity.

Why does a trade-off need to be made here?  More specifically, why is it
not possible to be reasonably fast (a few ms) AND safe?

> And the answer to that question seems to be both Yes and No. Classical pets
> vs cattle.
> 
> As I understand it, except of the system VMs, there are at least two kinds
> of user domains and these have different requirements:
> 
> 1. few permanent pet VMs (Work, Personal, Banking, ...), in Qubes OS called
> AppVMs,
> 2. and many transient cattle VMs (e.g. for opening an attachment from
> email, or browsing web, or batch processing of received files) called
> Disposable VMs.
> 
> For AppVMs, there are only "few" of those and these are running most of the
> time so start time may be less important than data safety. Certainly
> creation time is only once in a while operation so I would say use LVM for
> these. And where snapshots are not required, use plain linear LVs, one less
> thing which could go wrong. However, AppVMs are created from Template VMs,
> so snapshots seem to be part of the system.

Snapshots are used and required *everywhere*.  Qubes OS offers
copy-on-write cloning support, and users expect it to be cheap, not
least because renaming a qube is implemented using it.  By default,
AppVM private and TemplateVM root volumes always have at least one
snapshot, to support `qvm-volume revert`.  Start time really matters
too; a user may not wish to have every qube running at once.

In short, performance and safety *both* matter, and data AND metadata
operations are performance-critical.

> But data may be on linear LVs
> anyway as these are not shared and these are the most important part of the
> system. And you can still use old style snapshots for backing up the data
> (and by backup I mean snapshot, copy, delete snapshot. Not a long term
> snapshot. And definitely not multiple snapshots).

Creating a qube is intended to be a cheap operation, so thin
provisioning of storage is required.  Qubes OS also relies heavily
on over-provisioning of storage, so linear LVs and old style snapshots
won’t fly.  Qubes OS does have a storage driver that uses dm-snapshot on
top of loop devices, but that is deprecated, since it cannot provide the
features Qubes OS requires.  As just one example, the default private
volume size is 2GiB, but many qubes use nowhere near this amount of disk
space.

> Now I realized there is the third kind of user domains - Template VMs.
> Similarly to App VM, there are only few of those, and creating them
> requires downloading an image, upgrading system on an existing template, or
> even installation of the system, so any LVM overhead is insignificant for
> these. Use thin volumes.
> 
> For the Disposable VMs it is the creation + startup time which matters. Use
> whatever is the fastest method. These are created from template VMs too.
> What LVM/DM has to offer here is external origin. So the templates
> themselves could be managed by LVM, and Qubes OS could use them as external
> origin for Disposable VMs using device mapper directly. These could be held
> in a disposable thin pool which can be reinitialized from scratch on host
> reboot, after a crash, or on a problem with the pool. As a bonus this would
> also address the absence of thin pool shrinking.

That is an interesting idea I had not considered, but it would add
substantial complexity to the storage management system.  More
generally, the same approach could be used for all volatile volumes,
which are intended to be thrown away after qube shutdown.  Qubes OS even
supports encrypting volatile volumes with an ephemeral key to guarantee
they are unrecoverable.  (Disposable VM private 

Re: [linux-lvm] LVM performance vs direct dm-thin

2022-02-02 Thread Demi Marie Obenour
On Wed, Feb 02, 2022 at 11:04:37AM +0100, Zdenek Kabelac wrote:
> Dne 02. 02. 22 v 3:09 Demi Marie Obenour napsal(a):
> > On Sun, Jan 30, 2022 at 06:43:13PM +0100, Zdenek Kabelac wrote:
> > > Dne 30. 01. 22 v 17:45 Demi Marie Obenour napsal(a):
> > > > On Sun, Jan 30, 2022 at 11:52:52AM +0100, Zdenek Kabelac wrote:
> > > > > Dne 30. 01. 22 v 1:32 Demi Marie Obenour napsal(a):
> > > > > > On Sat, Jan 29, 2022 at 10:32:52PM +0100, Zdenek Kabelac wrote:
> > > > > > > Dne 29. 01. 22 v 21:34 Demi Marie Obenour napsal(a):
> > > My biased advice would be to stay with lvm2. There is lot of work, many
> > > things are not well documented and getting everything running correctly 
> > > will
> > > take a lot of effort  (Docker in fact did not managed to do it well and 
> > > was
> > > incapable to provide any recoverability)
> > 
> > What did Docker do wrong?  Would it be possible for a future version of
> > lvm2 to be able to automatically recover from off-by-one thin pool
> > transaction IDs?
> 
> Ensuring all steps in state-machine are always correct is not exactly simple.
> But since I've not heard about off-by-one problem for a long while -  I
> believe we've managed to close all the holes and bugs in double-commit
> system
> and metadata handling by thin-pool and lvm2 (for recent lvm2 & kernel)

How recent are you talking about?  Are there fixes that can be
cherry-picked?  I somewhat recently triggered this issue on a test
machine, so I would like to know.

> > > It's difficult - if you would be distributing lvm2 with exact kernel 
> > > version
> > > & udev & systemd with a single linux distro - it reduces huge set of
> > > troubles...
> > 
> > Qubes OS comes close to this in practice.  systemd and udev versions are
> > known and fixed, and Qubes OS ships its own kernels.
> 
> Systemd/udev evolves - so fixed today doesn't really mean same version will
> be there tomorrow.  And unfortunately systemd is known to introduce
> backward incompatible changes from time to time...

Thankfully, in Qubes OS’s dom0, the version of systemd is frozen and
will never change throughout an entire release.

> > > Chain filesystem->block_layer->filesystem->block_layer is something you 
> > > most
> > > likely do not want to use for any well performing solution...
> > > But it's ok for testing...
> > 
> > How much of this is due to the slow loop driver?  How much of it could
> > be mitigated if btrfs supported an equivalent of zvols?
> 
> Here you are missing the core of problem from kernel POV aka
> how the memory allocation is working and what are the approximation in
> kernel with buffer handling and so on.
> So whoever is using  'loop' devices in production systems in the way
> described above has never really tested any corner case logic

In Qubes OS the loop device is always passed through to a VM or used as
the base device for an old-style device-mapper snapshot.  It is never
mounted on the host.  Are there known problems with either of these
configurations?

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


signature.asc
Description: PGP signature
___
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Re: [linux-lvm] LVM performance vs direct dm-thin

2022-02-02 Thread Zdenek Kabelac

Dne 02. 02. 22 v 3:09 Demi Marie Obenour napsal(a):

On Sun, Jan 30, 2022 at 06:43:13PM +0100, Zdenek Kabelac wrote:

Dne 30. 01. 22 v 17:45 Demi Marie Obenour napsal(a):

On Sun, Jan 30, 2022 at 11:52:52AM +0100, Zdenek Kabelac wrote:

Dne 30. 01. 22 v 1:32 Demi Marie Obenour napsal(a):

On Sat, Jan 29, 2022 at 10:32:52PM +0100, Zdenek Kabelac wrote:

Dne 29. 01. 22 v 21:34 Demi Marie Obenour napsal(a):

My biased advice would be to stay with lvm2. There is lot of work, many
things are not well documented and getting everything running correctly will
take a lot of effort  (Docker in fact did not managed to do it well and was
incapable to provide any recoverability)


What did Docker do wrong?  Would it be possible for a future version of
lvm2 to be able to automatically recover from off-by-one thin pool
transaction IDs?


Ensuring all steps in state-machine are always correct is not exactly simple.
But since I've not heard about off-by-one problem for a long while -  I 
believe we've managed to close all the holes and bugs in double-commit system

and metadata handling by thin-pool and lvm2 (for recent lvm2 & kernel)


It's difficult - if you would be distributing lvm2 with exact kernel version
& udev & systemd with a single linux distro - it reduces huge set of
troubles...


Qubes OS comes close to this in practice.  systemd and udev versions are
known and fixed, and Qubes OS ships its own kernels.


Systemd/udev evolves - so fixed today doesn't really mean same version will be 
there tomorrow.  And unfortunately systemd is known to introduce  backward 
incompatible changes from time to time...



I'm not familiar with QubesOS - but in many cases in real-life world we
can't push to our users latest - so we need to live with bugs and
add workarounds...


Qubes OS is more than capable of shipping fixes for kernel bugs.  Is
that what you are referring to?

not going to starting discussing this topic ;)


Chain filesystem->block_layer->filesystem->block_layer is something you most
likely do not want to use for any well performing solution...
But it's ok for testing...


How much of this is due to the slow loop driver?  How much of it could
be mitigated if btrfs supported an equivalent of zvols?


Here you are missing the core of problem from kernel POV aka
how the memory allocation is working and what are the approximation in kernel 
with buffer handling and so on.
So whoever is using  'loop' devices in production systems in the way described 
above has never really tested any corner case logic


Regards

Zdenek

___
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/