Re: [Users] flashcache

Pavel Snajdr Thu, 10 Jul 2014 04:05:31 -0700

On 07/10/2014 12:50 PM, Pavel Snajdr wrote:
> On 07/10/2014 12:32 PM, Pavel Odintsov wrote:
>> Could you share your patches to vzmigrate and vzctl?
> 
> We don't have any, where vzctl/vzmigrate didn't satisfy our needs, we've
> went the way around these utilities and let vpsAdmin on the hwnode
> manage things.
> 
> You can take a look here:
> 
> https://github.com/vpsfreecz/vpsadmind
> 
> I wouldn't recommend anyone outside of our organization to use vpsAdmin
> yet, as the 2.0 transition to self-describing RESTful API is still
> underway. As soon as it's finished and well documented, I'll post a note
> here as well.
> 
> The 2.0 version will be primarily controled via a CLI tool, which
> autogenerates itself from the API description.
> 
> A running version of the API can be seen here:
> 
> https://api.vpsfree.cz/v1/
> 
> Github repos:
> 
> https://github.com/vpsfreecz/vpsadminapi (the API)
> https://github.com/vpsfreecz/vpsadminctl (the CLI tool)
> 
> https://github.com/vpsfreecz/vpsadmind (deamon run on hwnode)
> https://github.com/vpsfreecz/vpsadmindctl (CLI tool to control the daemon)
> 
> https://github.com/vpsfreecz/vpsadmin
> 
> The last repo is the vpsAdmin 1.x, which all 2.0 things still require to
> run, it's a pain to get this running yourself, but stay tuned, once we
> get rid of 1.x and document 2.0 properly, it's going to be a great thing.
> 
> /snajpa
>


Though, if you don't mind managing things via a web interface, vpsAdmin
1.x can be installed through these scripts:

https://github.com/vpsfreecz/vpsadmininstall

/snajpa

>>
>> On Thu, Jul 10, 2014 at 2:25 PM, Pavel Odintsov
>> <[email protected]> wrote:
>>> Thank you for your answers! It's really useful information.
>>>
>>> On Thu, Jul 10, 2014 at 2:08 PM, Pavel Snajdr <[email protected]> wrote:
>>>> On 07/10/2014 11:35 AM, Pavel Odintsov wrote:
>>>>>> Not true, IO limits are working as they should (if we're talking vzctl
>>>>>> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
>>>>>> accounting support, so it is there.
>>>>>
>>>>> You can share tests with us? For standard folders like simfs this
>>>>> limits works bad in big number of cases
>>>>
>>>> If you can give me concrete tests to run, sure, I'm curious to see if
>>>> you're right - then we'd have something concrete to fix :)
>>>>
>>>>>
>>>>>> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit 
>>>>>> really)
>>>>>
>>>>> It's ok when your customer create 1 billion of small files on 10GB VPS
>>>>> and you will try to archive it for backup? On slow disk system it's
>>>>> really nightmare because a lot of disk operations which kills your
>>>>> I/O.
>>>>
>>>> zfs snapshot <dataset>@<snapname>
>>>> zfs send <dataset>@<snapname> > your-file or | ssh backuper zfs recv
>>>> <backupdataset>
>>>>
>>>> That's done on block level. No need to run rsync anymore, it's a lot
>>>> faster this way.
>>>>
>>>>>
>>>>>> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
>>>>>> I thought the point of migration is to don't have the CT notice any
>>>>>> change, I don't see why the inode numbers should change.
>>>>>
>>>>> Do you have really working zero downtime vzmigrate on ZFS?
>>>>
>>>> Nope, vzmigrate isn't zero downtime. Due to vzctl/vzmigrate not
>>>> supporting ZFS, we're implementing this our own way in vpsAdmin, which
>>>> in it's 2.0 re-implementation will go opensource under GPL.
>>>>
>>>>>
>>>>>> How exactly? I haven't seen a problem with any userspace software, other
>>>>>> than MySQL default setting to AIO (it fallbacks to older method), which
>>>>>> ZFS doesn't support (*yet*, they have it in their plans).
>>>>>
>>>>> I speaks about MySQL primarily. I have thousands of containers and I
>>>>> can tune MySQL for another mode for all customers, it's impossible.
>>>>
>>>> As I said, this is under development and will improve.
>>>>
>>>>>
>>>>>> L2ARC cache really smart
>>>>>
>>>>> Yep, fine, I knew. But can you account L2ARC cache usage per customer?
>>>>> OpenVZ can it via flag:
>>>>> sysctl -a|grep pagecache_isola
>>>>> ubc.pagecache_isolation = 0
>>>>
>>>> I can't account for caches per CT, but I didn't have any need to do so.
>>>>
>>>> L2ARC != ARC, ARC is in system RAM, L2ARC is intended to be on SSD for
>>>> the content of ARC that is the least significant in case of low memory -
>>>> it gets pushed from ARC to L2ARC.
>>>>
>>>> ARC has two primary lists of cached data - most frequently used and most
>>>> recently used and these two lists are divided by a boundary marking
>>>> which data can be pushed away in low mem situation.
>>>>
>>>> It doesn't happen like with Linux VFS cache that you're copying one big
>>>> file and it pushes out all of the other useful data there.
>>>>
>>>> Thanks to this distinction of MRU and MFU ARC achieves far better hitrates.
>>>>
>>>>>
>>>>> But one customer can eat almost all L2ARC cache and displace another
>>>>> customers data.
>>>>
>>>> Yes, but ZFS keeps track on what's being used, so useful data can't be
>>>> pushed away that easily, things naturally balance themselves due to the
>>>> way how ARC mechanism works.
>>>>
>>>>>
>>>>> I'm not agains ZFS but I'm against of usage ZFS as underlying system
>>>>> for containers. We caught ~100 kernel bugs with simfs on EXT4 when
>>>>> customers do some strange thinks.
>>>>
>>>> I haven't encountered any problems especially with vzquota disabled (no
>>>> need for it, ZFS has its own quotas, which never need to be recalculated
>>>> as with vzquota).
>>>>
>>>>>
>>>>> But ext4 has about few thouasands developers and the fix this issues
>>>>> asap but ZFS on Linux has only 3-5 developers which VERY slow.
>>>>> Because of this I recommends using ext4 with ploop because this
>>>>> solution is rock stable or ZFS with ZVOL's with ext4 because this
>>>>> solution if more reliable and more predictable then placing ZFS
>>>>> containers on ZFS volumes.
>>>>
>>>> ZFS itself is a stable and mature filesystem, it first shipped as
>>>> production with Solaris in 2006.
>>>> And it's still being developed upstream as OpenZFS, that code is shared
>>>> between the primary version - Illumos and the ports - FreeBSD, OS X, Linux.
>>>>
>>>> So what really needs and still is being developed is the way how ZFS is
>>>> run under Linux kernel, but with recent release of 0.6.3, things have
>>>> gotten mature enough to be used in production without any fears. Of
>>>> course, no software is without bugs, but I can say with absolute
>>>> certainty that ZFS will never eat your data, the only problem you can
>>>> encounter is with the memory management, which is done really
>>>> differently in Linux than in ZFS's original habitat - Solaris.
>>>>
>>>> /snajpa
>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 10, 2014 at 1:08 PM, Pavel Snajdr <[email protected]> wrote:
>>>>>> On 07/10/2014 10:34 AM, Pavel Odintsov wrote:
>>>>>>> Hello!
>>>>>>>
>>>>>>> You scheme is fine but you can't divide I/O load with cgroup blkio
>>>>>>> (ioprio/iolimit/iopslimit) between different folders but between
>>>>>>> different ZVOL you do.
>>>>>>
>>>>>> Not true, IO limits are working as they should (if we're talking vzctl
>>>>>> set --iolimit/--iopslimit). I've kicked the ZoL guys around to add IO
>>>>>> accounting support, so it is there.
>>>>>>
>>>>>>>
>>>>>>> I could imagine following problems for per folder scheme:
>>>>>>> 1) Can't limit number of inodes in different folders (but there are
>>>>>>> not an inode limit for ZFS like ext4 but bug amount of files in
>>>>>>> container could broke node;
>>>>>>
>>>>>> How? ZFS doesn't have a limit on number of files (2^48 isn't a limit 
>>>>>> really)
>>>>>>
>>>>>>> http://serverfault.com/questions/503658/can-you-set-inode-quotas-in-zfs)
>>>>>>> 2) Problems with system cache which used by all containers in HWN 
>>>>>>> together
>>>>>>
>>>>>> This exactly isn't a problem, but a *HUGE* benefit, you'd need to see it
>>>>>> in practice :) Linux VFS cache is really dumb in comparison to ARC.
>>>>>> ARC's hitrates just can't be done with what linux currently offers.
>>>>>>
>>>>>>> 3) Problems with live migration because you _should_ change inode
>>>>>>> numbers on diffferent nodes
>>>>>>
>>>>>> Why? ZFS send/receive is able to do bit-by-bit identical copy of the FS,
>>>>>> I thought the point of migration is to don't have the CT notice any
>>>>>> change, I don't see why the inode numbers should change.
>>>>>>
>>>>>>> 4) ZFS behaviour with linux software in some cases is very STRANGE 
>>>>>>> (DIRECT_IO)
>>>>>>
>>>>>> How exactly? I haven't seen a problem with any userspace software, other
>>>>>> than MySQL default setting to AIO (it fallbacks to older method), which
>>>>>> ZFS doesn't support (*yet*, they have it in their plans).
>>>>>>
>>>>>>> 5) ext4 has good support from vzctl (fsck, resize2fs)
>>>>>>
>>>>>> Yeah, but ext4 sucks big time. At least in my use-case.
>>>>>>
>>>>>> We've implemented most of vzctl create/destroy/etc. functionality in our
>>>>>> vpsAdmin software instead.
>>>>>>
>>>>>> Guys, can I ask you to keep your mind open instead of fighting with
>>>>>> pointless arguments? :) Give ZFS a try and then decide for yourselves.
>>>>>>
>>>>>> I think the community would benefit greatly if ZFS woudn't be fought as
>>>>>> something alien in the Linux world, which I in my experience is what
>>>>>> every Linux zealot I talk to about ZFS is doing.
>>>>>> This is just not fair. It's primarily about technology, primarily about
>>>>>> the best tool for the job. If we can implement something like this in
>>>>>> Linux but without having ties to CDDL and possibly Oracle patents, that
>>>>>> would be awesome, yet nobody has done such a thing yet. BTRFS is nowhere
>>>>>> near ZFS when it comes to running larger scale deployments and in some
>>>>>> regards I don't think it will ever match ZFS, just looking at the way
>>>>>> it's been designed.
>>>>>>
>>>>>> I'm not trying to flame here, I'm trying to open you guys to the fact,
>>>>>> that there really is a better alternative than you're currently seeing.
>>>>>> And if it has some technological drawbacks like these that you're trying
>>>>>> to point out, instead of pointing at them as something, which can't be
>>>>>> changed and thus everyone should use "your best solution(tm)", try to
>>>>>> think of ways how to change it for the better.
>>>>>>
>>>>>>>
>>>>>>> My ideas like simfs vs ploop comparison:
>>>>>>> http://openvz.org/images/f/f3/Ct_in_a_file.pdf
>>>>>>
>>>>>> Again, you have to see ZFS doing its magic in production under a really
>>>>>> heavy load, otherwise you won't understand. Any arbitrary benchmarks
>>>>>> I've seen show ZFS is slower than ext4, but these are not tuned for such
>>>>>> use cases as I'm talking about.
>>>>>>
>>>>>> /snajpa
>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 10, 2014 at 12:06 PM, Pavel Snajdr <[email protected]> wrote:
>>>>>>>> On 07/09/2014 06:58 PM, Kir Kolyshkin wrote:
>>>>>>>>> On 07/08/2014 11:54 PM, Pavel Snajdr wrote:
>>>>>>>>>> On 07/08/2014 07:52 PM, Scott Dowdle wrote:
>>>>>>>>>>> Greetings,
>>>>>>>>>>>
>>>>>>>>>>> ----- Original Message -----
>>>>>>>>>>>> (offtopic) We can not use ZFS. Unfortunately, NAS with something 
>>>>>>>>>>>> like
>>>>>>>>>>>> Nexenta is to expensive for us.
>>>>>>>>>>> From what I've gathered from a few presentations, ZFS on Linux 
>>>>>>>>>>> (http://zfsonlinux.org/) is as stable but more performant than it 
>>>>>>>>>>> is on the OpenSolaris forks... so you can build your own if you can 
>>>>>>>>>>> spare the people to learn the best practices.
>>>>>>>>>>>
>>>>>>>>>>> I don't have a use for ZFS myself so I'm not really advocating it.
>>>>>>>>>>>
>>>>>>>>>>> TYL,
>>>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> we run tens of OpenVZ nodes (bigger boxes, 256G RAM, 12cores+, 90 
>>>>>>>>>> CTs at
>>>>>>>>>> least). We've used to run ext4+flashcache, but ext4 has proven to be 
>>>>>>>>>> a
>>>>>>>>>> bottleneck. That was the primary motivation behind ploop as far as I 
>>>>>>>>>> know.
>>>>>>>>>>
>>>>>>>>>> We've switched to ZFS on Linux around the time Ploop was announced 
>>>>>>>>>> and I
>>>>>>>>>> didn't have second thoughts since. ZFS really *is* in my experience 
>>>>>>>>>> the
>>>>>>>>>> best filesystem there is at the moment for this kind of deployment  -
>>>>>>>>>> especially if you use dedicated SSDs for ZIL and L2ARC, although the
>>>>>>>>>> latter is less important. You will know what I'm talking about when 
>>>>>>>>>> you
>>>>>>>>>> try this on boxes with lots of CTs doing LAMP load - databases and 
>>>>>>>>>> their
>>>>>>>>>> synchronous writes are the real problem, which ZFS with dedicated ZIL
>>>>>>>>>> device solves.
>>>>>>>>>>
>>>>>>>>>> Also there is the ARC caching, which is smarter then linux VFS cache 
>>>>>>>>>> -
>>>>>>>>>> we're able to achieve about 99% of hitrate at about 99% of the time,
>>>>>>>>>> even under high loads.
>>>>>>>>>>
>>>>>>>>>> Having said all that, I recommend everyone to give ZFS a chance, but 
>>>>>>>>>> I'm
>>>>>>>>>> aware this is yet another out-of-mainline code and that doesn't suit
>>>>>>>>>> everyone that well.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Are you using per-container ZVOL or something else?
>>>>>>>>
>>>>>>>> That would mean I'd need to do another filesystem on top of ZFS, which
>>>>>>>> would in turn mean I'd add another unnecessary layer of indirection. 
>>>>>>>> ZFS
>>>>>>>> is a pooled storage like BTRFS is, we're giving one dataset to each
>>>>>>>> container.
>>>>>>>>
>>>>>>>> vzctl tries to move the VE_PRIVATE folder around, so we had to add one
>>>>>>>> more directory to put the VE_PRIVATE data into (see the first ls).
>>>>>>>>
>>>>>>>> Example from production:
>>>>>>>>
>>>>>>>> [[email protected]]
>>>>>>>>  ~ # zpool status vz
>>>>>>>>   pool: vz
>>>>>>>>  state: ONLINE
>>>>>>>>   scan: scrub repaired 0 in 1h24m with 0 errors on Tue Jul  8 16:22:17 
>>>>>>>> 2014
>>>>>>>> config:
>>>>>>>>
>>>>>>>>         NAME        STATE     READ WRITE CKSUM
>>>>>>>>         vz          ONLINE       0     0     0
>>>>>>>>           mirror-0  ONLINE       0     0     0
>>>>>>>>             sda     ONLINE       0     0     0
>>>>>>>>             sdb     ONLINE       0     0     0
>>>>>>>>           mirror-1  ONLINE       0     0     0
>>>>>>>>             sde     ONLINE       0     0     0
>>>>>>>>             sdf     ONLINE       0     0     0
>>>>>>>>           mirror-2  ONLINE       0     0     0
>>>>>>>>             sdg     ONLINE       0     0     0
>>>>>>>>             sdh     ONLINE       0     0     0
>>>>>>>>         logs
>>>>>>>>           mirror-3  ONLINE       0     0     0
>>>>>>>>             sdc3    ONLINE       0     0     0
>>>>>>>>             sdd3    ONLINE       0     0     0
>>>>>>>>         cache
>>>>>>>>           sdc5      ONLINE       0     0     0
>>>>>>>>           sdd5      ONLINE       0     0     0
>>>>>>>>
>>>>>>>> errors: No known data errors
>>>>>>>>
>>>>>>>> [[email protected]]
>>>>>>>>  ~ # zfs list
>>>>>>>> NAME              USED  AVAIL  REFER  MOUNTPOINT
>>>>>>>> vz                432G  2.25T    36K  /vz
>>>>>>>> vz/private        427G  2.25T   111K  /vz/private
>>>>>>>> vz/private/101   17.7G  42.3G  17.7G  /vz/private/101
>>>>>>>> <snip>
>>>>>>>> vz/root           104K  2.25T   104K  /vz/root
>>>>>>>> vz/template      5.38G  2.25T  5.38G  /vz/template
>>>>>>>>
>>>>>>>> [[email protected]]
>>>>>>>>  ~ # zfs get compressratio vz/private/101
>>>>>>>> NAME            PROPERTY       VALUE  SOURCE
>>>>>>>> vz/private/101  compressratio  1.38x  -
>>>>>>>>
>>>>>>>> [[email protected]]
>>>>>>>>  ~ # ls /vz/private/101
>>>>>>>> private
>>>>>>>>
>>>>>>>> [[email protected]]
>>>>>>>>  ~ # ls /vz/private/101/private/
>>>>>>>> aquota.group  aquota.user  b  bin  boot  dev  etc  git  home  lib
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>> [[email protected]]
>>>>>>>>  ~ # cat /etc/vz/conf/101.conf | grep -P "PRIVATE|ROOT"
>>>>>>>> VE_ROOT="/vz/root/101"
>>>>>>>> VE_PRIVATE="/vz/private/101/private"
>>>>>>>>
>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.openvz.org/mailman/listinfo/users
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.openvz.org/mailman/listinfo/users
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list
>>>>>> [email protected]
>>>>>> https://lists.openvz.org/mailman/listinfo/users
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list
>>>> [email protected]
>>>> https://lists.openvz.org/mailman/listinfo/users
>>>
>>>
>>>
>>> --
>>> Sincerely yours, Pavel Odintsov
>>
>>
>>
> 

_______________________________________________
Users mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/users

Re: [Users] flashcache

Reply via email to