On 07/22/2015 11:59 PM, Сергей Мамонов wrote:
>1. creating then removing data (vzctl compact takes care of that)
>So, #1 is solved
Only partially in fact.
1. Compact "eat"a lot of resources, because of the heavy use of the disk.
2. You need compact your ploop very very regulary.
On our nodes, when we run compact every day, with 3-5T /vz/ daily
delta about 4-20% of space!
Every day it must clean 300 - 500+ Gb.
And it clean not all, as example -
[root@evo12 ~]# vzctl compact 75685
Trying to find free extents bigger than 0 bytes
Waiting
Call FITRIM, for minlen=33554432
Call FITRIM, for minlen=16777216
Call FITRIM, for minlen=8388608
Call FITRIM, for minlen=4194304
Call FITRIM, for minlen=2097152
Call FITRIM, for minlen=1048576
0 clusters have been relocated
[root@evo12 ~]# ls -lhat /vz/private/75685/root.hdd/root.hdd
-rw------- 1 root root 43G Июл 20 20:45
/vz/private/75685/root.hdd/root.hdd
[root@evo12 ~]# vzctl exec 75685 df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/ploop32178p1 50G 26G 21G 56% /
[root@evo12 ~]# vzctl --version
vzctl version 4.9.2
This is either #2 or #3 from my list, or both.
>My point was, the feature works fine for many people despite this bug.
Not fine, but we need it very much for migration and not. So anyway
whe use it, we have no alternative in fact.
And it one of bugs. Live migration regulary failed, because vzctl
cannot restore container correctly after suspend.
You really need to file bugs in case you want fixes.
Cpt is pain in fact. But I want to belive, that CRIU fix everything =)
And ext4 only with ploop - not good case, and not modern case too.
As example on some big nodes we have few /vz/ partition, because raid
controller cannot push all disk in one raid10 logical device. And few
/vz/ partition it is not comfortable.
And it is less flexible like one zpool as exapmle.
2015-07-23 5:44 GMT+03:00 Kir Kolyshkin <k...@openvz.org
<mailto:k...@openvz.org>>:
On 07/22/2015 10:08 AM, Gena Makhomed wrote:
On 22.07.2015 8:39, Kir Kolyshkin wrote:
1) currently even suspend/resume not work reliable:
https://bugzilla.openvz.org/show_bug.cgi?id=2470
- I can't suspend and resume containers without bugs.
and as result - I also can't use it for live migration.
Valid point, we need to figure it out. What I don't understand
is how lots of users are enjoying live migration despite
this bug.
Me, personally, I never came across this.
Nevertheless, steps to 100% reproduce bug provided in bugreport.
I was not saying anything about the bug report being bad/incomplete.
My point was, the feature works fine for many people despite this bug.
2) I see in google many bugreports about this feature:
"openvz live migration kernel panic" - so I prefer make
planned downtime of containers at the night instead
of unexpected and very painful kernel panics and
complete reboots in the middle of the working day.
(with data lost, data corruption and other "amenities")
Unlike the previous item, which is valid, this is pure FUD.
Compare two situations:
1) Live migration not used at all
2) Live migration used and containers migrated between HN
In which situation possibility to obtain kernel panic is higher?
If you say "possibility are equals" this means
what OpenVZ live migration code has no errors at all.
Is it feasible? Especially if you see OpenVZ live migration
code volume, code complexity and grandiosity if this task.
If you say "for (1) possibility is lower and for (2)
possibility is higher" - this is the same what I think.
I don't use live migration because I don't want kernel panics.
Following your logic, if you don't want kernel panics, you might want
to not use advanced filesystems such as ZFS, not use containers,
cgroups, namespaces, etc. The ultimate solution here, of course,
is to not use the kernel at all -- this will totally guarantee no
kernel
panics at all, ever.
On a serious note, I find your logic flawed.
And you say what "this is pure FUD" ? Why?
Because it is not based on your experience or correct statistics,
but rather on something you saw on Google followed by some
flawed logic.
4) from technical point of view - it is possible
to do live migration using ZFS, so "live migration"
currently is only one advantage of ploop over ZFS
I wouldn't say so. If you have some real world comparison
of zfs vs ploop, feel free to share. Like density or
performance
measurements, done in a controlled environment.
Ok.
My experience with ploop:
DISKSPACE limited to 256 GiB, real data used inside container
was near 40-50% of limit 256 GiB, but ploop image is lot bigger,
it use near 256 GiB of space at hardware node. Overhead ~ 50-60%
I found workaround for this: run "/usr/sbin/vzctl compact $CT"
via cron every night, and now ploop image has less overhead.
current state:
on hardware node:
# du -b /vz/private/155/root.hdd
205963399961 /vz/private/155/root.hdd
inside container:
# df -B1
Filesystem 1B-blocks Used Available
Use% Mounted on
/dev/ploop38149p1 270426705920 163129053184 94928560128
64% /
====================================
used space, bytes: 163129053184
image size, bytes: 205963399961
"ext4 over ploop over ext4" solution disk space overhead is
near 26%,
or is near 40 GiB, if see this disk space overhead in absolute
numbers.
This is main disadvantage of ploop.
And this disadvantage can't be avoided - it is "by design".
To anyone reading this, there are a few things here worth noting.
a. Such overhead is caused by three things:
1. creating then removing data (vzctl compact takes care of that)
2. filesystem fragmentation (we have some experimental patches to ext4
plus an ext4 defragmenter to solve it, but currently it's
still in research stage)
3. initial filesystem layout (which depends on initial ext4 fs
size, including inode requirement)
So, #1 is solved, #2 is solvable, and #3 is a limitation of the
used file system and can me mitigated
by properly choosing initial size of a newly created ploop.
A example of #3 effect is this: if you create a very large
filesystem initially (say, 16TB) and then
downsize it (say, to 1TB), filesystem metadata overhead will be
quite big. Same thing happens
if you ask for lots of inodes (here "lots" means more than a
default value which is 1 inode
per 16K of disk space). This happens because ext4 filesystem is
not designed to shrink.
Therefore, to have lowest possible overhead you have to choose the
initial filesystem size
carefully. Yes, this is not a solution but a workaround.
Also note, that ploop was not designed with any specific
filesystem in mind, it is
universal, so #3 can be solved by moving to a different fs in the
future.
Next thing, you can actually use shared base deltas for
containers, and although it is not
enabled by default, but quite possible and works in practice. The
key is to create a base delta
and use it for multiple containers (via hardlinks).
Here is a quick and dirty example:
SRCID=50 # "Donor" container ID
vztmpl-dl centos-7-x86_64 # to make sure we use the latest
vzctl create $SRCID --ostemplate centos-7-x86_64
vzctl snapshot $SRCID
for CT in $(seq 1000 2000); do \
mkdir -p /vz/private/$CT/root.hdd /vz/root/$CT; \
ln /vz/private/$SRCID/root.hdd/root.hdd
/vz/private/$CT/root.hdd/root.hdd; \
cp -nr /vz/private/$SRCID/root.hdd /vz/private/$CT/; \
cp /etc/vz/conf/$SRCID.conf /etc/vz/conf/$CT.conf; \
done
vzctl set $SRCID --disabled yes --save # make sure we don't use it
This will create 1000 containers (so make sure your host have
enough RAM),
each having about 650MB files, so 650GB in total. Host disk space
used will be
about 650 + 1000*1 MB before start (i.e. about 2GB) , or about 650
+ 1000*30 MB
after start (i.e. about 32GB). So:
real data used inside containers near 650 GB
real space used on hard disk is near 32 GB
So, 20x disk space savings, and this result is reproducible.
Surely it will get worse
over time etc., and this way of using plooop is neither official
nor supported/recommended,
but it's not the point here. The points are:
- this is a demonstration of what you could do with ploop
- this shows why you shouldn't trust any numbers
=======================================================================
My experience with ZFS:
real data used inside container near 62 GiB,
real space used on hard disk is near 11 GiB.
So, you are not even comparing apples to apples here. You just
took two
different containers, certainly of different sizes, probably also
different data sets
and usage history. Not saying it's invalid, but if you want to
have a meaningful
(rather than anecdotal) comparison, you need to use same data
sets, same
operations on data etc., try to optimize each case, and compare
_______________________________________________
Users mailing list
Users@openvz.org <mailto:Users@openvz.org>
https://lists.openvz.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users