Re: [Users] ZFS vs ploop

Kir Kolyshkin Thu, 23 Jul 2015 00:48:43 -0700

On 07/22/2015 11:59 PM, Сергей Мамонов wrote:

>1. creating then removing data (vzctl compact takes care of that)
>So, #1 is solved


Only partially in fact.
1. Compact "eat"a lot of resources, because of the heavy use of the disk.
2. You need compact your ploop very very regulary.

On our nodes, when we run compact every day, with 3-5T /vz/ dailydelta about 4-20% of space!

Every day it must clean 300 - 500+ Gb.

And it clean not all, as example -

[root@evo12 ~]# vzctl compact 75685
Trying to find free extents bigger than 0 bytes
Waiting
Call FITRIM, for minlen=33554432
Call FITRIM, for minlen=16777216
Call FITRIM, for minlen=8388608
Call FITRIM, for minlen=4194304
Call FITRIM, for minlen=2097152
Call FITRIM, for minlen=1048576
0 clusters have been relocated
[root@evo12 ~]# ls -lhat /vz/private/75685/root.hdd/root.hdd

-rw------- 1 root root 43G Июл 20 20:45/vz/private/75685/root.hdd/root.hdd

[root@evo12 ~]# vzctl exec 75685 df -h /
Filesystem         Size  Used Avail Use% Mounted on
/dev/ploop32178p1   50G   26G   21G  56% /
[root@evo12 ~]# vzctl --version
vzctl version 4.9.2


This is either #2 or #3 from my list, or both.

>My point was, the feature works fine for many people despite this bug.
Not fine, but we need it very much for migration and not. So anywaywhe use it, we have no alternative in fact.And it one of bugs. Live migration regulary failed, because vzctlcannot restore container correctly after suspend.


You really need to file bugs in case you want fixes.

Cpt is pain in fact. But I want to belive, that CRIU fix everything =)

And ext4 only with ploop - not good  case, and not modern case too.

As example on some big nodes we have few /vz/ partition, because raidcontroller cannot push all disk in one raid10 logical device. And few/vz/ partition it is not comfortable.

And it is less flexible like one zpool as exapmle.

2015-07-23 5:44 GMT+03:00 Kir Kolyshkin <k...@openvz.org<mailto:k...@openvz.org>>:




    On 07/22/2015 10:08 AM, Gena Makhomed wrote:

        On 22.07.2015 8:39, Kir Kolyshkin wrote:

                1) currently even suspend/resume not work reliable:
                https://bugzilla.openvz.org/show_bug.cgi?id=2470
                - I can't suspend and resume containers without bugs.
                and as result - I also can't use it for live migration.


            Valid point, we need to figure it out. What I don't understand
            is how lots of users are enjoying live migration despite
            this bug.
            Me, personally, I never came across this.


        Nevertheless, steps to 100% reproduce bug provided in bugreport.


    I was not saying anything about the bug report being bad/incomplete.
    My point was, the feature works fine for many people despite this bug.


                2) I see in google many bugreports about this feature:
                "openvz live migration kernel panic" - so I prefer make
                planned downtime of containers at the night instead
                of unexpected and very painful kernel panics and
                complete reboots in the middle of the working day.
                (with data lost, data corruption and other "amenities")


            Unlike the previous item, which is valid, this is pure FUD.


        Compare two situations:

        1) Live migration not used at all

        2) Live migration used and containers migrated between HN

        In which situation possibility to obtain kernel panic is higher?

        If you say "possibility are equals" this means
        what OpenVZ live migration code has no errors at all.

        Is it feasible? Especially if you see OpenVZ live migration
        code volume, code complexity and grandiosity if this task.

        If you say "for (1) possibility is lower and for (2)
        possibility is higher" - this is the same what I think.

        I don't use live migration because I don't want kernel panics.


    Following your logic, if you don't want kernel panics, you might want
    to not use advanced filesystems such as ZFS, not use containers,
    cgroups, namespaces, etc. The ultimate solution here, of course,
    is to not use the kernel at all -- this will totally guarantee no
    kernel
    panics at all, ever.

    On a serious note, I find your logic flawed.


        And you say what "this is pure FUD" ? Why?


    Because it is not based on your experience or correct statistics,
    but rather on something you saw on Google followed by some
    flawed logic.




                4) from technical point of view - it is possible
                to do live migration using ZFS, so "live migration"
                currently is only one advantage of ploop over ZFS


            I wouldn't say so. If you have some real world comparison
            of zfs vs ploop, feel free to share. Like density or
            performance
            measurements, done in a controlled environment.


        Ok.

        My experience with ploop:

        DISKSPACE limited to 256 GiB, real data used inside container
        was near 40-50% of limit 256 GiB, but ploop image is lot bigger,
        it use near 256 GiB of space at hardware node. Overhead ~ 50-60%

        I found workaround for this: run "/usr/sbin/vzctl compact $CT"
        via cron every night, and now ploop image has less overhead.

        current state:

        on hardware node:

        # du -b /vz/private/155/root.hdd
        205963399961    /vz/private/155/root.hdd

        inside container:

        # df -B1
        Filesystem               1B-blocks          Used Available
        Use% Mounted on

/dev/ploop38149p1 270426705920 163129053184 9492856012864% /


        ====================================

        used space, bytes: 163129053184

        image size, bytes: 205963399961

        "ext4 over ploop over ext4" solution disk space overhead is
        near 26%,
        or is near 40 GiB, if see this disk space overhead in absolute
        numbers.

        This is main disadvantage of ploop.

        And this disadvantage can't be avoided - it is "by design".


    To anyone reading this, there are a few things here worth noting.

    a. Such overhead is caused by three things:
    1. creating then removing data (vzctl compact takes care of that)
    2. filesystem fragmentation (we have some experimental patches to ext4
        plus an ext4 defragmenter to solve it, but currently it's
    still in research stage)
    3. initial filesystem layout (which depends on initial ext4 fs
    size, including inode requirement)

    So, #1 is solved, #2 is solvable, and #3 is a limitation of the
    used file system and can me mitigated
    by properly choosing initial size of a newly created ploop.

    A example of #3 effect is this: if you create a very large
    filesystem initially (say, 16TB) and then
    downsize it (say, to 1TB), filesystem metadata overhead will be
    quite big. Same thing happens
    if you ask for lots of inodes (here "lots" means more than a
    default value which is 1 inode
    per 16K of disk space). This happens because ext4 filesystem is
    not designed to shrink.
    Therefore, to have lowest possible overhead you have to choose the
    initial filesystem size
    carefully. Yes, this is not a solution but a workaround.

    Also note, that ploop was not designed with any specific
    filesystem in mind, it is
    universal, so #3 can be solved by moving to a different fs in the
    future.

    Next thing, you can actually use shared base deltas for
    containers, and although it is not
    enabled by default, but quite possible and works in practice. The
    key is to create a base delta
    and use it for multiple containers (via hardlinks).

    Here is a quick and dirty example:

    SRCID=50 # "Donor" container ID
    vztmpl-dl centos-7-x86_64 # to make sure we use the latest
    vzctl create $SRCID --ostemplate centos-7-x86_64
    vzctl snapshot $SRCID
    for CT in $(seq 1000 2000); do \
          mkdir -p /vz/private/$CT/root.hdd /vz/root/$CT; \
          ln /vz/private/$SRCID/root.hdd/root.hdd
    /vz/private/$CT/root.hdd/root.hdd; \
          cp -nr /vz/private/$SRCID/root.hdd /vz/private/$CT/; \
          cp /etc/vz/conf/$SRCID.conf /etc/vz/conf/$CT.conf; \
       done
    vzctl set $SRCID --disabled yes --save # make sure we don't use it

    This will create 1000 containers (so make sure your host have
    enough RAM),
    each having about 650MB files, so 650GB in total. Host disk space
    used will be
    about 650 + 1000*1 MB before start (i.e. about 2GB) , or about 650
    + 1000*30 MB
    after start (i.e. about 32GB). So:

    real data used inside containers near 650 GB
    real space used on hard disk is near 32 GB

    So, 20x disk space savings, and this result is reproducible.
    Surely it will get worse
    over time etc., and this way of using plooop is neither official
    nor supported/recommended,
    but it's not the point here. The points are:
     - this is a demonstration of what you could do with ploop
     - this shows why you shouldn't trust any numbers

        =======================================================================

        My experience with ZFS:

        real data used inside container near 62 GiB,
        real space used on hard disk is near 11 GiB.


    So, you are not even comparing apples to apples here. You just
    took two
    different containers, certainly of different sizes, probably also
    different data sets
    and usage history. Not saying it's invalid, but if you want to
    have a meaningful
    (rather than anecdotal) comparison, you need to use same data
    sets, same
    operations on data etc., try to optimize each case, and compare




    _______________________________________________
    Users mailing list
    Users@openvz.org <mailto:Users@openvz.org>
    https://lists.openvz.org/mailman/listinfo/users




_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users

_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users

Re: [Users] ZFS vs ploop

Reply via email to