Re: [Users] flashcache

Pavel Snajdr Thu, 10 Jul 2014 03:21:13 -0700

On 07/10/2014 11:35 AM, Pavel Odintsov wrote:
>> Not true, IO limits are working as they should (if we're talking vzctl
>> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
>> accounting support, so it is there.
> 
> You can share tests with us? For standard folders like simfs this
> limits works bad in big number of cases


If you can give me concrete tests to run, sure, I'm curious to see if
you're right - then we'd have something concrete to fix :)

> 
>> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit really)
> 
> It's ok when your customer create 1 billion of small files on 10GB VPS
> and you will try to archive it for backup? On slow disk system it's
> really nightmare because a lot of disk operations which kills your
> I/O.

zfs snapshot <dataset>@<snapname>
zfs send <dataset>@<snapname> > your-file or | ssh backuper zfs recv
<backupdataset>

That's done on block level. No need to run rsync anymore, it's a lot
faster this way.

> 
>> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
>> I thought the point of migration is to don't have the CT notice any
>> change, I don't see why the inode numbers should change.
> 
> Do you have really working zero downtime vzmigrate on ZFS?

Nope, vzmigrate isn't zero downtime. Due to vzctl/vzmigrate not
supporting ZFS, we're implementing this our own way in vpsAdmin, which
in it's 2.0 re-implementation will go opensource under GPL.

> 
>> How exactly? I haven't seen a problem with any userspace software, other
>> than MySQL default setting to AIO (it fallbacks to older method), which
>> ZFS doesn't support (*yet*, they have it in their plans).
> 
> I speaks about MySQL primarily. I have thousands of containers and I
> can tune MySQL for another mode for all customers, it's impossible.

As I said, this is under development and will improve.

> 
>> L2ARC cache really smart
> 
> Yep, fine, I knew. But can you account L2ARC cache usage per customer?
> OpenVZ can it via flag:
> sysctl -a|grep pagecache_isola
> ubc.pagecache_isolation = 0

I can't account for caches per CT, but I didn't have any need to do so.

L2ARC != ARC, ARC is in system RAM, L2ARC is intended to be on SSD for
the content of ARC that is the least significant in case of low memory -
it gets pushed from ARC to L2ARC.

ARC has two primary lists of cached data - most frequently used and most
recently used and these two lists are divided by a boundary marking
which data can be pushed away in low mem situation.

It doesn't happen like with Linux VFS cache that you're copying one big
file and it pushes out all of the other useful data there.

Thanks to this distinction of MRU and MFU ARC achieves far better hitrates.

> 
> But one customer can eat almost all L2ARC cache and displace another
> customers data.

Yes, but ZFS keeps track on what's being used, so useful data can't be
pushed away that easily, things naturally balance themselves due to the
way how ARC mechanism works.

> 
> I'm not agains ZFS but I'm against of usage ZFS as underlying system
> for containers. We caught ~100 kernel bugs with simfs on EXT4 when
> customers do some strange thinks.

I haven't encountered any problems especially with vzquota disabled (no
need for it, ZFS has its own quotas, which never need to be recalculated
as with vzquota).

> 
> But ext4 has about few thouasands developers and the fix this issues
> asap but ZFS on Linux has only 3-5 developers which VERY slow.
> Because of this I recommends using ext4 with ploop because this
> solution is rock stable or ZFS with ZVOL's with ext4 because this
> solution if more reliable and more predictable then placing ZFS
> containers on ZFS volumes.

ZFS itself is a stable and mature filesystem, it first shipped as
production with Solaris in 2006.
And it's still being developed upstream as OpenZFS, that code is shared
between the primary version - Illumos and the ports - FreeBSD, OS X, Linux.

So what really needs and still is being developed is the way how ZFS is
run under Linux kernel, but with recent release of 0.6.3, things have
gotten mature enough to be used in production without any fears. Of
course, no software is without bugs, but I can say with absolute
certainty that ZFS will never eat your data, the only problem you can
encounter is with the memory management, which is done really
differently in Linux than in ZFS's original habitat - Solaris.

/snajpa

> 
> 
> On Thu, Jul 10, 2014 at 1:08 PM, Pavel Snajdr <li...@snajpa.net> wrote:
>> On 07/10/2014 10:34 AM, Pavel Odintsov wrote:
>>> Hello!
>>>
>>> You scheme is fine but you can't divide I/O load with cgroup blkio
>>> (ioprio/iolimit/iopslimit) between different folders but between
>>> different ZVOL you do.
>>
>> Not true, IO limits are working as they should (if we're talking vzctl
>> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
>> accounting support, so it is there.
>>
>>>
>>> I could imagine following problems for per folder scheme:
>>> 1) Can't limit number of inodes in different folders (but there are
>>> not an inode limit for ZFS like ext4 but bug amount of files in
>>> container could broke node;
>>
>> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit really)
>>
>>> http://serverfault.com/questions/503658/can-you-set-inode-quotas-in-zfs)
>>> 2) Problems with system cache which used by all containers in HWN together
>>
>> This exactly isn't a problem, but a *HUGE* benefit, you'd need to see it
>> in practice :) Linux VFS cache is really dumb in comparison to ARC.
>> ARC's hitrates just can't be done with what linux currently offers.
>>
>>> 3) Problems with live migration because you _should_ change inode
>>> numbers on diffferent nodes
>>
>> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
>> I thought the point of migration is to don't have the CT notice any
>> change, I don't see why the inode numbers should change.
>>
>>> 4) ZFS behaviour with linux software in some cases is very STRANGE 
>>> (DIRECT_IO)
>>
>> How exactly? I haven't seen a problem with any userspace software, other
>> than MySQL default setting to AIO (it fallbacks to older method), which
>> ZFS doesn't support (*yet*, they have it in their plans).
>>
>>> 5) ext4 has good support from vzctl (fsck, resize2fs)
>>
>> Yeah, but ext4 sucks big time. At least in my use-case.
>>
>> We've implemented most of vzctl create/destroy/etc. functionality in our
>> vpsAdmin software instead.
>>
>> Guys, can I ask you to keep your mind open instead of fighting with
>> pointless arguments? :) Give ZFS a try and then decide for yourselves.
>>
>> I think the community would benefit greatly if ZFS woudn't be fought as
>> something alien in the Linux world, which I in my experience is what
>> every Linux zealot I talk to about ZFS is doing.
>> This is just not fair. It's primarily about technology, primarily about
>> the best tool for the job. If we can implement something like this in
>> Linux but without having ties to CDDL and possibly Oracle patents, that
>> would be awesome, yet nobody has done such a thing yet. BTRFS is nowhere
>> near ZFS when it comes to running larger scale deployments and in some
>> regards I don't think it will ever match ZFS, just looking at the way
>> it's been designed.
>>
>> I'm not trying to flame here, I'm trying to open you guys to the fact,
>> that there really is a better alternative than you're currently seeing.
>> And if it has some technological drawbacks like these that you're trying
>> to point out, instead of pointing at them as something, which can't be
>> changed and thus everyone should use "your best solution(tm)", try to
>> think of ways how to change it for the better.
>>
>>>
>>> My ideas like simfs vs ploop comparison:
>>> http://openvz.org/images/f/f3/Ct_in_a_file.pdf
>>
>> Again, you have to see ZFS doing its magic in production under a really
>> heavy load, otherwise you won't understand. Any arbitrary benchmarks
>> I've seen show ZFS is slower than ext4, but these are not tuned for such
>> use cases as I'm talking about.
>>
>> /snajpa
>>
>>>
>>> On Thu, Jul 10, 2014 at 12:06 PM, Pavel Snajdr <li...@snajpa.net> wrote:
>>>> On 07/09/2014 06:58 PM, Kir Kolyshkin wrote:
>>>>> On 07/08/2014 11:54 PM, Pavel Snajdr wrote:
>>>>>> On 07/08/2014 07:52 PM, Scott Dowdle wrote:
>>>>>>> Greetings,
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> (offtopic) We can not use ZFS. Unfortunately, NAS with something like
>>>>>>>> Nexenta is to expensive for us.
>>>>>>> From what I've gathered from a few presentations, ZFS on Linux 
>>>>>>> (http://zfsonlinux.org/) is as stable but more performant than it is on 
>>>>>>> the OpenSolaris forks... so you can build your own if you can spare the 
>>>>>>> people to learn the best practices.
>>>>>>>
>>>>>>> I don't have a use for ZFS myself so I'm not really advocating it.
>>>>>>>
>>>>>>> TYL,
>>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> we run tens of OpenVZ nodes (bigger boxes, 256G RAM, 12cores+, 90 CTs at
>>>>>> least). We've used to run ext4+flashcache, but ext4 has proven to be a
>>>>>> bottleneck. That was the primary motivation behind ploop as far as I 
>>>>>> know.
>>>>>>
>>>>>> We've switched to ZFS on Linux around the time Ploop was announced and I
>>>>>> didn't have second thoughts since. ZFS really *is* in my experience the
>>>>>> best filesystem there is at the moment for this kind of deployment  -
>>>>>> especially if you use dedicated SSDs for ZIL and L2ARC, although the
>>>>>> latter is less important. You will know what I'm talking about when you
>>>>>> try this on boxes with lots of CTs doing LAMP load - databases and their
>>>>>> synchronous writes are the real problem, which ZFS with dedicated ZIL
>>>>>> device solves.
>>>>>>
>>>>>> Also there is the ARC caching, which is smarter then linux VFS cache -
>>>>>> we're able to achieve about 99% of hitrate at about 99% of the time,
>>>>>> even under high loads.
>>>>>>
>>>>>> Having said all that, I recommend everyone to give ZFS a chance, but I'm
>>>>>> aware this is yet another out-of-mainline code and that doesn't suit
>>>>>> everyone that well.
>>>>>>
>>>>>
>>>>> Are you using per-container ZVOL or something else?
>>>>
>>>> That would mean I'd need to do another filesystem on top of ZFS, which
>>>> would in turn mean I'd add another unnecessary layer of indirection. ZFS
>>>> is a pooled storage like BTRFS is, we're giving one dataset to each
>>>> container.
>>>>
>>>> vzctl tries to move the VE_PRIVATE folder around, so we had to add one
>>>> more directory to put the VE_PRIVATE data into (see the first ls).
>>>>
>>>> Example from production:
>>>>
>>>> [r...@node2.prg.vpsfree.cz]
>>>>  ~ # zpool status vz
>>>>   pool: vz
>>>>  state: ONLINE
>>>>   scan: scrub repaired 0 in 1h24m with 0 errors on Tue Jul  8 16:22:17 2014
>>>> config:
>>>>
>>>>         NAME        STATE     READ WRITE CKSUM
>>>>         vz          ONLINE       0     0     0
>>>>           mirror-0  ONLINE       0     0     0
>>>>             sda     ONLINE       0     0     0
>>>>             sdb     ONLINE       0     0     0
>>>>           mirror-1  ONLINE       0     0     0
>>>>             sde     ONLINE       0     0     0
>>>>             sdf     ONLINE       0     0     0
>>>>           mirror-2  ONLINE       0     0     0
>>>>             sdg     ONLINE       0     0     0
>>>>             sdh     ONLINE       0     0     0
>>>>         logs
>>>>           mirror-3  ONLINE       0     0     0
>>>>             sdc3    ONLINE       0     0     0
>>>>             sdd3    ONLINE       0     0     0
>>>>         cache
>>>>           sdc5      ONLINE       0     0     0
>>>>           sdd5      ONLINE       0     0     0
>>>>
>>>> errors: No known data errors
>>>>
>>>> [r...@node2.prg.vpsfree.cz]
>>>>  ~ # zfs list
>>>> NAME              USED  AVAIL  REFER  MOUNTPOINT
>>>> vz                432G  2.25T    36K  /vz
>>>> vz/private        427G  2.25T   111K  /vz/private
>>>> vz/private/101   17.7G  42.3G  17.7G  /vz/private/101
>>>> <snip>
>>>> vz/root           104K  2.25T   104K  /vz/root
>>>> vz/template      5.38G  2.25T  5.38G  /vz/template
>>>>
>>>> [r...@node2.prg.vpsfree.cz]
>>>>  ~ # zfs get compressratio vz/private/101
>>>> NAME            PROPERTY       VALUE  SOURCE
>>>> vz/private/101  compressratio  1.38x  -
>>>>
>>>> [r...@node2.prg.vpsfree.cz]
>>>>  ~ # ls /vz/private/101
>>>> private
>>>>
>>>> [r...@node2.prg.vpsfree.cz]
>>>>  ~ # ls /vz/private/101/private/
>>>> aquota.group  aquota.user  b  bin  boot  dev  etc  git  home  lib
>>>> <snip>
>>>>
>>>> [r...@node2.prg.vpsfree.cz]
>>>>  ~ # cat /etc/vz/conf/101.conf | grep -P "PRIVATE|ROOT"
>>>> VE_ROOT="/vz/root/101"
>>>> VE_PRIVATE="/vz/private/101/private"
>>>>
>>>>
>>>>> _______________________________________________
>>>>> Users mailing list
>>>>> Users@openvz.org
>>>>> https://lists.openvz.org/mailman/listinfo/users
>>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@openvz.org
>>>> https://lists.openvz.org/mailman/listinfo/users
>>>
>>>
>>>
>>
>> _______________________________________________
>> Users mailing list
>> Users@openvz.org
>> https://lists.openvz.org/mailman/listinfo/users
> 
> 
> 

_______________________________________________
Users mailing list
Users@openvz.org
https://lists.openvz.org/mailman/listinfo/users

Re: [Users] flashcache

Reply via email to