Re: [gentoo-user] Re: btrfs fails to balance
On 21/01/15 00:03, Rich Freeman wrote: > On Tue, Jan 20, 2015 at 10:07 AM, James wrote: >> Bill Kenworthy iinet.net.au> writes: >> >>> You can turn off COW and go single on btrfs to speed it up but bugs in >>> ceph and btrfs lose data real fast! >> >> Interesting idea, since I'll have raid1 underneath each node. I'll need to >> dig into this idea a bit more. >> > > So, btrfs and ceph solve an overlapping set of problems in an > overlapping set of ways. In general adding data security often comes > at the cost of performance, and obviously adding it at multiple layers > can come at the cost of additional performance. I think the right > solution is going to depend on the circumstances. > > if ceph provided that protection against bitrot I'd probably avoid a > COW filesystem entirely. It isn't going to add any additional value, > and they do have a performance cost. If I had mirroring at the ceph > level I'd probably just run them on ext4 on lvm with no > mdadm/btrfs/whatever below that. Availability is already ensured by > ceph - if you lose a drive then other nodes will pick up the load. If > I didn't have robust mirroring at the ceph level then having mirroring > of some kind at the individual node level would improve availability. > > On the other hand, ceph currently has some gaps, so having it on top > of zfs/btrfs could provide protection against bitrot. However, right > now there is no way to turn off COW while leaving checksumming > enabled. It would be nice if you could leave the checksumming on. > Then if there was bitrot btrfs would just return an error when you > tried to read the file, and then ceph would handle it like any other > disk error and use a mirrored copy on another node. The problem with > ceph+ext4 is that if there is bitrot neither layer will detect it. > > Does btrfs+ceph really have a performance hit that is larger than > btrfs without ceph? I fully expect it to be slower than ext4+ceph. > Btrfs in general performs fairly poorly right now - that is expected > to improve in the future, but I doubt that it will ever outperform > ext4 other than for specific operations that benefit from it (like > reflink copies). It will always be faster to just overwrite one block > in the middle of a file than to write the block out to unallocated > space and update all the metadata. > answer to both you and James here: I think it was pre 8.0 when I dropped out. Its Ceph that suffers from bitrot - I use the "golden master" approach to generating the VM's so corruption was obvious. I did report one bug in the early days that turned out to be btrfs, but I think it was largely ceph which has been born out by consolidating the ceph trial hardware and using it with btrfs and the same storage - rare problems and I can point to hardware/power when it happened. The performance hit was not due to lack of horsepower (cpu, ram etc) but due to I/O - both network bandwidth and internal bus on the hosts. That is why a small number of systems no matter how powerful wont work well. For real performance, I saw people using SSD's and large numbers of hosts in order to distribute the data flows - this does work and I saw some insane numbers posted. It also requires multiple networks (internal and external) to separate the flows (not VLAN but dedicated pipes) due to the extreem burstiness of the traffic. As well as VM images, I had backups (using dirvish) and thousands of security camera images. Deletes of a directory with a lot of files would take many hours. Same with using ceph for a mail store (came up on the ceph list under "why is it so slow") - as a chunk server its just not suitable for lots of small files. Towards the end of my use, I stopped seeing bitrot on a system with data but idle to limiting it to occurring during heavy use. My overall conclusion is lots of small hosts with no more than a couple of drives each and multiple networks with lots of bandwidth is what its designed for. I had two reasons for looking at ceph - distributed storage where data in use was held close to the user but could be redistributed easily with multiple copies (think two small data stores with an intermittent WAN link storing high and low priority data) and high performance with high availability on HW failure. Ceph was not the answer for me with the scale I have. BillK
Re: [gentoo-user] Re: btrfs fails to balance
On Tue, Jan 20, 2015 at 12:27 PM, James wrote: > > Raid 1 with btrfs can not only protect the ceph fs files but the gentoo > node installation itself. Agree 100%. Like I said, the right solution depends on your situation. If you're using the server doing ceph storage only for file serving, then protecting the OS installation isn't very important. Heck, you could just run the OS off of a USB stick. If you're running nodes that do a combination of application and storage, then obviously you need to worry about both, which probably means not relying on ceph as your sole source of protection. That applies to a lot of "kitchen sink" setups where hosts don't have a single role. -- Rich
Re: [gentoo-user] Re: btrfs fails to balance
On Tue, Jan 20, 2015 at 10:07 AM, James wrote: > Bill Kenworthy iinet.net.au> writes: > >> You can turn off COW and go single on btrfs to speed it up but bugs in >> ceph and btrfs lose data real fast! > > Interesting idea, since I'll have raid1 underneath each node. I'll need to > dig into this idea a bit more. > So, btrfs and ceph solve an overlapping set of problems in an overlapping set of ways. In general adding data security often comes at the cost of performance, and obviously adding it at multiple layers can come at the cost of additional performance. I think the right solution is going to depend on the circumstances. if ceph provided that protection against bitrot I'd probably avoid a COW filesystem entirely. It isn't going to add any additional value, and they do have a performance cost. If I had mirroring at the ceph level I'd probably just run them on ext4 on lvm with no mdadm/btrfs/whatever below that. Availability is already ensured by ceph - if you lose a drive then other nodes will pick up the load. If I didn't have robust mirroring at the ceph level then having mirroring of some kind at the individual node level would improve availability. On the other hand, ceph currently has some gaps, so having it on top of zfs/btrfs could provide protection against bitrot. However, right now there is no way to turn off COW while leaving checksumming enabled. It would be nice if you could leave the checksumming on. Then if there was bitrot btrfs would just return an error when you tried to read the file, and then ceph would handle it like any other disk error and use a mirrored copy on another node. The problem with ceph+ext4 is that if there is bitrot neither layer will detect it. Does btrfs+ceph really have a performance hit that is larger than btrfs without ceph? I fully expect it to be slower than ext4+ceph. Btrfs in general performs fairly poorly right now - that is expected to improve in the future, but I doubt that it will ever outperform ext4 other than for specific operations that benefit from it (like reflink copies). It will always be faster to just overwrite one block in the middle of a file than to write the block out to unallocated space and update all the metadata. -- Rich
Re: [gentoo-user] Re: btrfs fails to balance
On 20/01/15 05:10, Rich Freeman wrote: > On Mon, Jan 19, 2015 at 11:50 AM, James wrote: >> Bill Kenworthy iinet.net.au> writes: >> >> I was wondering what my /etc/fstab should look like using uuids, raid 1 and >> btrfs. > > From mine: > /dev/disk/by-uuid/7d9f3772-a39c-408b-9be0-5fa26eec8342 / > btrfs noatime,ssd,compress=none > /dev/disk/by-uuid/cd074207-9bc3-402d-bee8-6a8c77d56959 /data > btrfs noatime,compress=none > > The first is a single disk, the second is 5-drive raid1. > > I disabled compression due to some bugs a few kernels ago. I need to > look into whether those were fixed - normally I'd use lzo. > > I use dracut - obviously you need to use some care when running root > on a disk identified by uuid since this isn't a kernel feature. With > btrfs as long as you identify one device in an array it will find the > rest. They all have the same UUID though. > > Probably also worth nothing that if you try to run btrfs on top of lvm > and then create an lvm snapshot btrfs can cause spectacular breakage > when it sees two devices whose metadata identify them as being the > same - I don't know where it went but there was talk of trying to use > a generation id/etc to keep track of which ones are old vs recent in > this scenario. > >> >> Eventually, I want to run CephFS on several of these raid one btrfs >> systems for some clustering code experiments. I'm not sure how that >> will affect, if at all, the raid 1-btrfs-uuid setup. >> > > Btrfs would run below CephFS I imagine, so it wouldn't affect it at all. > > The main thing keeping me away from CephFS is that it has no mechanism > for resolving silent corruption. Btrfs underneath it would obviously > help, though not for failure modes that involve CephFS itself. I'd > feel a lot better if CephFS had some way of determining which copy was > the right one other than "the master server always wins." > Forget ceph on btrfs for the moment - the COW kills it stone dead after real use. When running a small handful of VMs on a raid1 with ceph - slw :) You can turn off COW and go single on btrfs to speed it up but bugs in ceph and btrfs lose data real fast! ceph itself (my last setup trashed itself 6 months ago and I've given up!) will only work under real use/heavy loads with lots of discrete systems, ideally 10G network, and small disks to spread the failure domain. Using 3 hosts and 2x2g disks per host wasn't near big enough :( Its design means that small scale trials just wont work. Its not designed for small scale/low end hardware, no matter how attractive the idea is :( BillK
Re: [gentoo-user] Re: btrfs fails to balance
On 20/01/15 00:50, James wrote: > Bill Kenworthy iinet.net.au> writes: > > >>> Am 19.01.2015 um 09:32 schrieb Bill Kenworthy: > Can someone suggest what is causing a balance on this raid 1 > > Interesting. > I am about to test (reboot) a btrfs, raid one installation. > >> Brilliant, you have hit on the answer! - The ancient 300GB system disk >> was sda at one point and moved to sdb - possibly at the time I changed >> to using UUID's. Ive just resized all the disks and its now moved past >> 300G for the first time as well as the other two falling in step with >> the data moving. > > I was wondering what my /etc/fstab should look like using uuids, raid 1 and > btrfs. > > Could you post your /etc/fstab and any other modifications you made to > your installation related to the btrfs, raid 1 uuid setup? > > I'm just using (2) identical 2T disks for my new gentoo workstation. > >> I moved to UUID's as the machine has a number of sata ports and a PCI-e >> sata adaptor and the sd* drive numbering kept moving around when I added >> the WD red. > > > Eventually, I want to run CephFS on several of these raid one btrfs > systems for some clustering code experiments. I'm not sure how that > will affect, if at all, the raid 1-btrfs-uuid setup. > > > TIA, > James > > > > Sorry about the line wrap: rattus backups # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:00 1.8T 0 disk sdb 8:16 0 279.5G 0 disk ├─sdb1 8:17 0 100M 0 part ├─sdb2 8:18 0 8G 0 part [SWAP] └─sdb3 8:19 0 271.4G 0 part / sdc 8:32 0 1.8T 0 disk /mnt/vm sdd 8:48 0 1.8T 0 disk sde 8:64 0 1.8T 0 disk rattus backups # rattus backups # blkid /dev/sda: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1" UUID_SUB="9003b772-3487-447a-9794-50cf9880a9c0" TYPE="btrfs" PTTYPE="dos" /dev/sdc: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1" UUID_SUB="20523d9d-3d90-439e-ad68-62def0824198" TYPE="btrfs" /dev/sdb1: UUID="cc5f4bf7-28fc-4661-9d24-a0c9d0048f40" TYPE="ext2" /dev/sdb2: UUID="dddb7e60-89a9-40d4-bf6b-ff4644e079e9" TYPE="swap" /dev/sdb3: UUID="04d8ff4f-fe19-4530-ab45-d82fcd647515" UUID_SUB="72134593-8c9f-436f-98ce-fbb07facbf35" TYPE="btrfs" /dev/sdd: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1" UUID_SUB="2ca026f7-e5c9-4ece-bba1-809ddb03979b" TYPE="btrfs" rattus backups # rattus backups # cat /etc/fstab UUID=cc5f4bf7-28fc-4661-9d24-a0c9d0048f40 /boot ext2noauto,noatime 1 2 UUID=04d8ff4f-fe19-4530-ab45-d82fcd647515 / btrfs defaults,noatime,compress=lzo,space_cache 0 0 UUID=dddb7e60-89a9-40d4-bf6b-ff4644e079e9 none swapsw 0 0 UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1 /mnt/btrfs-root btrfs defaults,noatime,compress=lzo,space_cache 0 0 UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1 /home/wdk btrfs defaults,noatime,compress=lzo,space_cache,subvolid=258 0 0 UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1 /mnt/backups btrfs defaults,noatime,compress=lzo,space_cache,subvolid=365 0 0 UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1 /mnt/vm btrfs defaults,noatime,compress=lzo,space_cache,subvolid=149160 0 rattus backups #
Re: [gentoo-user] Re: btrfs fails to balance
On Mon, Jan 19, 2015 at 11:50 AM, James wrote: > Bill Kenworthy iinet.net.au> writes: > > I was wondering what my /etc/fstab should look like using uuids, raid 1 and > btrfs. >From mine: /dev/disk/by-uuid/7d9f3772-a39c-408b-9be0-5fa26eec8342 / btrfs noatime,ssd,compress=none /dev/disk/by-uuid/cd074207-9bc3-402d-bee8-6a8c77d56959 /data btrfs noatime,compress=none The first is a single disk, the second is 5-drive raid1. I disabled compression due to some bugs a few kernels ago. I need to look into whether those were fixed - normally I'd use lzo. I use dracut - obviously you need to use some care when running root on a disk identified by uuid since this isn't a kernel feature. With btrfs as long as you identify one device in an array it will find the rest. They all have the same UUID though. Probably also worth nothing that if you try to run btrfs on top of lvm and then create an lvm snapshot btrfs can cause spectacular breakage when it sees two devices whose metadata identify them as being the same - I don't know where it went but there was talk of trying to use a generation id/etc to keep track of which ones are old vs recent in this scenario. > > Eventually, I want to run CephFS on several of these raid one btrfs > systems for some clustering code experiments. I'm not sure how that > will affect, if at all, the raid 1-btrfs-uuid setup. > Btrfs would run below CephFS I imagine, so it wouldn't affect it at all. The main thing keeping me away from CephFS is that it has no mechanism for resolving silent corruption. Btrfs underneath it would obviously help, though not for failure modes that involve CephFS itself. I'd feel a lot better if CephFS had some way of determining which copy was the right one other than "the master server always wins." -- Rich