Re: [gentoo-user] Re: btrfs fails to balance

2015-01-20 Thread Bill Kenworthy
On 21/01/15 00:03, Rich Freeman wrote:
> On Tue, Jan 20, 2015 at 10:07 AM, James  wrote:
>> Bill Kenworthy  iinet.net.au> writes:
>>
>>> You can turn off COW and go single on btrfs to speed it up but bugs in
>>> ceph and btrfs lose data real fast!
>>
>> Interesting idea, since I'll have raid1 underneath each node. I'll need to
>> dig into this idea a bit more.
>>
> 
> So, btrfs and ceph solve an overlapping set of problems in an
> overlapping set of ways.  In general adding data security often comes
> at the cost of performance, and obviously adding it at multiple layers
> can come at the cost of additional performance.  I think the right
> solution is going to depend on the circumstances.
> 
> if ceph provided that protection against bitrot I'd probably avoid a
> COW filesystem entirely.  It isn't going to add any additional value,
> and they do have a performance cost.  If I had mirroring at the ceph
> level I'd probably just run them on ext4 on lvm with no
> mdadm/btrfs/whatever below that.  Availability is already ensured by
> ceph - if you lose a drive then other nodes will pick up the load.  If
> I didn't have robust mirroring at the ceph level then having mirroring
> of some kind at the individual node level would improve availability.
> 
> On the other hand, ceph currently has some gaps, so having it on top
> of zfs/btrfs could provide protection against bitrot.  However, right
> now there is no way to turn off COW while leaving checksumming
> enabled.  It would be nice if you could leave the checksumming on.
> Then if there was bitrot btrfs would just return an error when you
> tried to read the file, and then ceph would handle it like any other
> disk error and use a mirrored copy on another node.  The problem with
> ceph+ext4 is that if there is bitrot neither layer will detect it.
> 
> Does btrfs+ceph really have a performance hit that is larger than
> btrfs without ceph?  I fully expect it to be slower than ext4+ceph.
> Btrfs in general performs fairly poorly right now - that is expected
> to improve in the future, but I doubt that it will ever outperform
> ext4 other than for specific operations that benefit from it (like
> reflink copies).  It will always be faster to just overwrite one block
> in the middle of a file than to write the block out to unallocated
> space and update all the metadata.
> 

answer to both you and James here:

I think it was pre 8.0 when I dropped out.  Its Ceph that suffers from
bitrot - I use the "golden master" approach to generating the VM's so
corruption was obvious.  I did report one bug in the early days that
turned out to be btrfs, but I think it was largely ceph which has been
born out by consolidating the ceph trial hardware and using it with
btrfs and the same storage - rare problems and I can point to
hardware/power when it happened.

The performance hit was not due to lack of horsepower (cpu, ram etc) but
due to I/O - both network bandwidth and internal bus on the hosts.  That
is why a small number of systems no matter how powerful wont work well.
 For real performance, I saw people using SSD's and large numbers of
hosts in order to distribute the data flows - this does work and I saw
some insane numbers posted.  It also requires multiple networks
(internal and external) to separate the flows (not VLAN but dedicated
pipes) due to the extreem burstiness of the traffic.  As well as VM
images, I had backups (using dirvish) and thousands of security camera
images.  Deletes of a directory with a lot of files would take many
hours.  Same with using ceph for a mail store (came up on the ceph list
under "why is it so slow") - as a chunk server its just not suitable for
lots of small files.

Towards the end of my use, I stopped seeing bitrot on a system with data
but idle to limiting it to occurring during heavy use.  My overall
conclusion is lots of small hosts with no more than a couple of drives
each and multiple networks with lots of bandwidth is what its designed for.

I had two reasons for looking at ceph - distributed storage where data
in use was held close to the user but could be redistributed easily
with multiple copies (think two small data stores with an intermittent
WAN link storing high and low priority data) and high performance with
high availability on HW failure.

Ceph was not the answer for me with the scale I have.

BillK




Re: [gentoo-user] Re: btrfs fails to balance

2015-01-20 Thread Rich Freeman
On Tue, Jan 20, 2015 at 12:27 PM, James  wrote:
>
> Raid 1 with btrfs can not only protect the ceph fs files but the gentoo
> node installation itself.

Agree 100%.  Like I said, the right solution depends on your situation.

If you're using the server doing ceph storage only for file serving,
then protecting the OS installation isn't very important.  Heck, you
could just run the OS off of a USB stick.

If you're running nodes that do a combination of application and
storage, then obviously you need to worry about both, which probably
means not relying on ceph as your sole source of protection.  That
applies to a lot of "kitchen sink" setups where hosts don't have a
single role.

--
Rich



Re: [gentoo-user] Re: btrfs fails to balance

2015-01-20 Thread Rich Freeman
On Tue, Jan 20, 2015 at 10:07 AM, James  wrote:
> Bill Kenworthy  iinet.net.au> writes:
>
>> You can turn off COW and go single on btrfs to speed it up but bugs in
>> ceph and btrfs lose data real fast!
>
> Interesting idea, since I'll have raid1 underneath each node. I'll need to
> dig into this idea a bit more.
>

So, btrfs and ceph solve an overlapping set of problems in an
overlapping set of ways.  In general adding data security often comes
at the cost of performance, and obviously adding it at multiple layers
can come at the cost of additional performance.  I think the right
solution is going to depend on the circumstances.

if ceph provided that protection against bitrot I'd probably avoid a
COW filesystem entirely.  It isn't going to add any additional value,
and they do have a performance cost.  If I had mirroring at the ceph
level I'd probably just run them on ext4 on lvm with no
mdadm/btrfs/whatever below that.  Availability is already ensured by
ceph - if you lose a drive then other nodes will pick up the load.  If
I didn't have robust mirroring at the ceph level then having mirroring
of some kind at the individual node level would improve availability.

On the other hand, ceph currently has some gaps, so having it on top
of zfs/btrfs could provide protection against bitrot.  However, right
now there is no way to turn off COW while leaving checksumming
enabled.  It would be nice if you could leave the checksumming on.
Then if there was bitrot btrfs would just return an error when you
tried to read the file, and then ceph would handle it like any other
disk error and use a mirrored copy on another node.  The problem with
ceph+ext4 is that if there is bitrot neither layer will detect it.

Does btrfs+ceph really have a performance hit that is larger than
btrfs without ceph?  I fully expect it to be slower than ext4+ceph.
Btrfs in general performs fairly poorly right now - that is expected
to improve in the future, but I doubt that it will ever outperform
ext4 other than for specific operations that benefit from it (like
reflink copies).  It will always be faster to just overwrite one block
in the middle of a file than to write the block out to unallocated
space and update all the metadata.

-- 
Rich



Re: [gentoo-user] Re: btrfs fails to balance

2015-01-19 Thread Bill Kenworthy
On 20/01/15 05:10, Rich Freeman wrote:
> On Mon, Jan 19, 2015 at 11:50 AM, James  wrote:
>> Bill Kenworthy  iinet.net.au> writes:
>>
>> I was wondering what my /etc/fstab should look like using uuids, raid 1 and
>> btrfs.
> 
> From mine:
> /dev/disk/by-uuid/7d9f3772-a39c-408b-9be0-5fa26eec8342  /
>  btrfs   noatime,ssd,compress=none
> /dev/disk/by-uuid/cd074207-9bc3-402d-bee8-6a8c77d56959  /data
>  btrfs   noatime,compress=none
> 
> The first is a single disk, the second is 5-drive raid1.
> 
> I disabled compression due to some bugs a few kernels ago.  I need to
> look into whether those were fixed - normally I'd use lzo.
> 
> I use dracut - obviously you need to use some care when running root
> on a disk identified by uuid since this isn't a kernel feature.  With
> btrfs as long as you identify one device in an array it will find the
> rest.  They all have the same UUID though.
> 
> Probably also worth nothing that if you try to run btrfs on top of lvm
> and then create an lvm snapshot btrfs can cause spectacular breakage
> when it sees two devices whose metadata identify them as being the
> same - I don't know where it went but there was talk of trying to use
> a generation id/etc to keep track of which ones are old vs recent in
> this scenario.
> 
>>
>> Eventually, I want to run CephFS on several of these raid one btrfs
>> systems for some clustering code experiments. I'm not sure how that
>> will affect, if at all, the raid 1-btrfs-uuid setup.
>>
> 
> Btrfs would run below CephFS I imagine, so it wouldn't affect it at all.
> 
> The main thing keeping me away from CephFS is that it has no mechanism
> for resolving silent corruption.  Btrfs underneath it would obviously
> help, though not for failure modes that involve CephFS itself.  I'd
> feel a lot better if CephFS had some way of determining which copy was
> the right one other than "the master server always wins."
> 

Forget ceph on btrfs for the moment - the COW kills it stone dead after
real use.  When running a small handful of VMs on a raid1 with ceph -
slw :)

You can turn off COW and go single on btrfs to speed it up but bugs in
ceph and btrfs lose data real fast!

ceph itself (my last setup trashed itself 6 months ago and I've given
up!) will only work under real use/heavy loads with lots of discrete
systems, ideally 10G network, and small disks to spread the failure
domain.  Using 3 hosts and 2x2g disks per host wasn't near big enough :(
 Its design means that small scale trials just wont work.

Its not designed for small scale/low end hardware, no matter how
attractive the idea is :(

BillK








Re: [gentoo-user] Re: btrfs fails to balance

2015-01-19 Thread Bill Kenworthy
On 20/01/15 00:50, James wrote:
> Bill Kenworthy  iinet.net.au> writes:
> 
> 
>>> Am 19.01.2015 um 09:32 schrieb Bill Kenworthy:
> 
 Can someone suggest what is causing a balance on this raid 1
> 
> Interesting.
> I am about to test (reboot) a btrfs, raid one installation.
> 
>> Brilliant, you have hit on the answer! - The ancient 300GB system disk
>> was sda at one point and moved to sdb - possibly at the time I changed
>> to using UUID's.  Ive just resized all the disks and its now moved past
>> 300G for the first time as well as the other two falling in step with
>> the data moving.
> 
> I was wondering what my /etc/fstab should look like using uuids, raid 1 and
> btrfs.
> 
> Could you post your /etc/fstab and any other modifications you made to
> your installation related to the btrfs, raid 1 uuid setup?
> 
> I'm just using (2) identical 2T disks for my new gentoo workstation.
> 
>> I moved to UUID's as the machine has a number of sata ports and a PCI-e
>> sata adaptor and the sd* drive numbering kept moving around when I added
>> the WD red.
> 
> 
> Eventually, I want to run CephFS on several of these raid one btrfs
> systems for some clustering code experiments. I'm not sure how that
> will affect, if at all, the raid 1-btrfs-uuid setup.
> 
> 
> TIA,
> James
> 
> 
> 
> 

Sorry about the line wrap:

rattus backups # lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda  8:00   1.8T  0 disk
sdb  8:16   0 279.5G  0 disk
├─sdb1   8:17   0   100M  0 part
├─sdb2   8:18   0 8G  0 part [SWAP]
└─sdb3   8:19   0 271.4G  0 part /
sdc  8:32   0   1.8T  0 disk /mnt/vm
sdd  8:48   0   1.8T  0 disk
sde  8:64   0   1.8T  0 disk
rattus backups #

rattus backups # blkid
/dev/sda: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1"
UUID_SUB="9003b772-3487-447a-9794-50cf9880a9c0" TYPE="btrfs" PTTYPE="dos"
/dev/sdc: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1"
UUID_SUB="20523d9d-3d90-439e-ad68-62def0824198" TYPE="btrfs"
/dev/sdb1: UUID="cc5f4bf7-28fc-4661-9d24-a0c9d0048f40" TYPE="ext2"
/dev/sdb2: UUID="dddb7e60-89a9-40d4-bf6b-ff4644e079e9" TYPE="swap"
/dev/sdb3: UUID="04d8ff4f-fe19-4530-ab45-d82fcd647515"
UUID_SUB="72134593-8c9f-436f-98ce-fbb07facbf35" TYPE="btrfs"
/dev/sdd: UUID="f5a284b6-442f-4b3d-aa1a-8d6296f517b1"
UUID_SUB="2ca026f7-e5c9-4ece-bba1-809ddb03979b" TYPE="btrfs"
rattus backups #


rattus backups # cat /etc/fstab

UUID=cc5f4bf7-28fc-4661-9d24-a0c9d0048f40   /boot
ext2noauto,noatime
1 2
UUID=04d8ff4f-fe19-4530-ab45-d82fcd647515   /
btrfs
defaults,noatime,compress=lzo,space_cache   0 0
UUID=dddb7e60-89a9-40d4-bf6b-ff4644e079e9   none
swapsw
0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1   /mnt/btrfs-root
btrfs
defaults,noatime,compress=lzo,space_cache   0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1   /home/wdk
btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=258  0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1   /mnt/backups
btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=365  0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1   /mnt/vm
btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=149160 0

rattus backups #




Re: [gentoo-user] Re: btrfs fails to balance

2015-01-19 Thread Rich Freeman
On Mon, Jan 19, 2015 at 11:50 AM, James  wrote:
> Bill Kenworthy  iinet.net.au> writes:
>
> I was wondering what my /etc/fstab should look like using uuids, raid 1 and
> btrfs.

>From mine:
/dev/disk/by-uuid/7d9f3772-a39c-408b-9be0-5fa26eec8342  /
 btrfs   noatime,ssd,compress=none
/dev/disk/by-uuid/cd074207-9bc3-402d-bee8-6a8c77d56959  /data
 btrfs   noatime,compress=none

The first is a single disk, the second is 5-drive raid1.

I disabled compression due to some bugs a few kernels ago.  I need to
look into whether those were fixed - normally I'd use lzo.

I use dracut - obviously you need to use some care when running root
on a disk identified by uuid since this isn't a kernel feature.  With
btrfs as long as you identify one device in an array it will find the
rest.  They all have the same UUID though.

Probably also worth nothing that if you try to run btrfs on top of lvm
and then create an lvm snapshot btrfs can cause spectacular breakage
when it sees two devices whose metadata identify them as being the
same - I don't know where it went but there was talk of trying to use
a generation id/etc to keep track of which ones are old vs recent in
this scenario.

>
> Eventually, I want to run CephFS on several of these raid one btrfs
> systems for some clustering code experiments. I'm not sure how that
> will affect, if at all, the raid 1-btrfs-uuid setup.
>

Btrfs would run below CephFS I imagine, so it wouldn't affect it at all.

The main thing keeping me away from CephFS is that it has no mechanism
for resolving silent corruption.  Btrfs underneath it would obviously
help, though not for failure modes that involve CephFS itself.  I'd
feel a lot better if CephFS had some way of determining which copy was
the right one other than "the master server always wins."

-- 
Rich