Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread Jeffrey E Altman

On 3/22/2023 3:47 PM, spacefrogg-open...@spacefrogg.net wrote:

OpenAFS does not maintain checksums.  Checksums are neither transmitted in
the RXAFS_FetchData and RXAFS_StoreData RPCs messages nor are checksums
stored and compared when reading and writing to the vice partition.

Thanks for clearing this up. So, volume inconsistencies are just detected on 
the metadata level?


The salvager can update Volume metadata to be consistent with what is 
stored in the vice partition, it can attach orphaned vnodes, it can 
remove directory entries for missing vnodes, it can correct vnode size 
information,  it can rebuild directories, and a few other things.   
However, there is no ability to repair a damaged block or restore a 
missing vnode.


btrfs and zfs have benefits in being able to recover from damaged disk 
sectors by maintaining multiple copies of each data block (if that 
functionality is configured.)   However, they each have downsides as 
well.   Both require vasts sums of memory compared to other file 
systems.   Both have very inconsistent performance characteristics 
especially as free space falls below 40% and as memory pressure 
increases.   Both can fail under memory pressure or when their storage 
volumes are full.


AuriStor recommends that AuriStorFS vice partitions be deployed using 
xfs.   There are severalFS AuriStor end users that build xfs filesystems 
on zfs exported block devices when zfs is desired for additional 
reliability.


If zfs is going to be installed on Linux I recommend using a Linux 
distribution that packages zfs to ensure that an incompatible kernel 
update is never issued.  I have observed sites lose vice partitions 
hosted using zfs of rhel because of subtle kernel incompatibilities.  A 
risk with all out-of-tree filesystems that are not tested against every 
released kernel version but especially with out-of-tree non-GPL filesystems.


Jeffrey Altman





smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread spacefrogg-openafs
> OpenAFS does not maintain checksums.  Checksums are neither transmitted in
> the RXAFS_FetchData and RXAFS_StoreData RPCs messages nor are checksums
> stored and compared when reading and writing to the vice partition.

Thanks for clearing this up. So, volume inconsistencies are just detected on 
the metadata level?

–Michael
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread Jeffrey E Altman

On 3/22/2023 9:34 AM, Ciprian Craciun (ciprian.crac...@gmail.com) wrote:

On Wed, Mar 22, 2023 at 10:30 AM  wrote:

OpenAFS implements its own CoW and using CoW below that again has no benefits and 
disturbs the fileservers "free-space" assumptions. It knows when it makes 
in-place updates and does not expect to run out of space in that situation.

At what level does OpenAFS implement CoW?  Is it implemented at
whole-file-level, i.e. changing a file that is part of a replicated /
backup volume it will copy the entire file, or is it implemented at
some range or smaller granularity level (i.e. it will change only that
range, but share the rest)?
OpenAFS performs CoW on whole files within a Volume on the first update 
after a
Volume clone is created.  The clone can be a ROVOL, BACKVOL or untyped 
clone.


OpenAFS CoW is limited to sharing a vnode between multiple Volume instances.
Once a vnode can no longer be shared between two or more Volumes, the vnode
is copied.

OpenAFS does not perform CoW at a byte range level.

On Linux btrfs and xfs both support CoW at the block level.   A one byte 
change to
a 1GB file on btrfs will result in one block being copied and 
modified.   Whereas

OpenAFS will copy the entire 1GB.


Unfortunately (at least for my use-case) losing the checksumming and
compression is a no-go, because these were exactly the features that
made BTRFS appealing versus Ext4.

If you say so...
AFS does its own data checksumming.


OpenAFS does not maintain checksums.  Checksums are neither transmitted in
the RXAFS_FetchData and RXAFS_StoreData RPCs messages nor are checksums
stored and compared when reading and writing to the vice partition.

Granted, RAID is not a backup solution, but it should instead protect
one from faulty hardware.  Which is exactly what it doesn't do 100%,
because if one of the drive in the array returns corrupted data, the
RAID system can't say which one is it (based purely on the returned
data).  Granted, disks don't just return random data without any other
failure or symptom.
Bit flips occur more frequently than we would like which was the 
rationale behind

adding checksums and multiple copies and self-healing to ZFS.

With regard to file-system scrubbing, to my knowledge, only those that
actually have checksumming can do this, which currently is either
BTRFS or ZFS.
There are some other examples but these are the only two regularly 
available as

a local Linux filesystem.

Jeffrey Altman




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread Dirk Heinrichs

Ciprian Craciun:

Well, I base this supposition on my simple observation with OpenAFS's 
own client which is also out-of-tree and requires custom module builds 
(via DKMS or equivalent).


For example I use OpenSUSE Tumbleweed (rolling release), and sometimes 
I need to delay my updates until the distribution manages to get the 
modules ready (with the latest Linux kernel).


Ah, OK, I see. Yeah, I also sometimes see this with the OpenAFS module 
on Debian *testing*, where it can happen that the kernel is too new so 
that the module doesn't build until a compatibilty fix is released. I 
usually swtich to the in-kernel AFS module temporarily in these cases.


However, this never happened on Debian *stable*, neither for OpenAFS, 
nor for ZFS.


Bye...

    Dirk

--
Dirk Heinrichs 
Matrix-Adresse: @heini:chat.altum.de
GPG Public Key: 80F1540E03A3968F3D79C382853C32C427B48049
Privacy Handbuch: https://www.privacy-handbuch.de



OpenPGP_signature
Description: OpenPGP digital signature


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread spacefrogg-openafs
> At what level does OpenAFS implement CoW?  Is it implemented at
> whole-file-level, i.e. changing a file that is part of a replicated /
> backup volume it will copy the entire file, or is it implemented at
> some range or smaller granularity level (i.e. it will change only that
> range, but share the rest)?

I don't know. I suppose at the file level.

> Can one force OpenAFS to do a verification of these checksums and
> report back any issues?

That is what happens by default. FileLog and SalvageLog should have info on 
that. In case of failure detection, the volume is taken offline (not taken 
online again) which manifests in access errors.

> What kind of checksums are these?  Cryptographic ones like
> MD/SHA/newer or CRC-ones?

I don't know. I would try investigating in the direction of the fileserver and 
salvager documentation.

> Granted, RAID is not a backup solution, but it should instead protect
> one from faulty hardware.  Which is exactly what it doesn't do 100%,
> because if one of the drive in the array returns corrupted data, the
> RAID system can't say which one is it (based purely on the returned
> data).  Granted, disks don't just return random data without any other
> failure or symptom.

If you have faulty hw, only a backup and new hardware will save you, but do 
what you must.

> With regard to file-system scrubbing, to my knowledge, only those that
> actually have checksumming can do this, which currently is either
> BTRFS or ZFS.

You only lose data checksumming. Metadata checksumming (and CoW for metadata 
changes) is still used and gives you most of the relevant properties (because 
AFS does its own data checksumming).

> I think that barriers have other implications especially to journaled
> file-systems.

They don't. The ones you have in mind relate to local devices. An NAS will 
simply always report write success as soon as possible (for the sake of 
covering up the huge network latencies). Your local FS driver will never know 
the truth... This said, a journaling FS can be less safe over network than a 
non-journaling FS.

> This is true.  It is true even of OpenAFS backup volumes.  :)

That's not true. AFS knows about its backup volumes but not about BTRFS 
snapshots. At least in principle...

Kind regards,
–Michael
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread Ciprian Craciun
On Wed, Mar 22, 2023 at 10:30 AM  wrote:
> OpenAFS implements its own CoW and using CoW below that again has no benefits 
> and disturbs the fileservers "free-space" assumptions. It knows when it makes 
> in-place updates and does not expect to run out of space in that situation.


At what level does OpenAFS implement CoW?  Is it implemented at
whole-file-level, i.e. changing a file that is part of a replicated /
backup volume it will copy the entire file, or is it implemented at
some range or smaller granularity level (i.e. it will change only that
range, but share the rest)?

I'm asking this because I've assumed (based on empirical observations)
that all files stored in OpenAFS (via the proper `afsd`) will end
somewhere in `/vicepX` as individual files.  (I.e. if I were to
`md5sum` all the files from `/afs/some-cell`, and then `md5sum` all
the files in `/vicepX`, then the first set of `/afs/...` is a subset
of the second one `/vicepX`.)



> > Unfortunately (at least for my use-case) losing the checksumming and
> > compression is a no-go, because these were exactly the features that
> > made BTRFS appealing versus Ext4.
>
> If you say so...
> AFS does its own data checksumming.


Can one force OpenAFS to do a verification of these checksums and
report back any issues?

What kind of checksums are these?  Cryptographic ones like
MD/SHA/newer or CRC-ones?



> > Also, regarding RAID scrubbing, it doesn't cover the issue of
> > checksumming, because (for example with RAID5) it can only detect that
> > one of the disks has corrupted data, but couldn't say which.
>
> Do not use RAID to prevent data loss! That's what backups are for. RAID is 
> for operative redundancy. Scrubbing also tells you about your state of FS 
> metadata. So, it's not that it has no use without checksumming. I only use 
> RAID 1 and 1-0. They have lower dataloss probabilities that RAID 5.


Granted, RAID is not a backup solution, but it should instead protect
one from faulty hardware.  Which is exactly what it doesn't do 100%,
because if one of the drive in the array returns corrupted data, the
RAID system can't say which one is it (based purely on the returned
data).  Granted, disks don't just return random data without any other
failure or symptom.

With regard to file-system scrubbing, to my knowledge, only those that
actually have checksumming can do this, which currently is either
BTRFS or ZFS.



> All -sync properties are ineffective with NAS, because the network layer and 
> far-end OS decide on actual commit strategies. So you might as well stop 
> deceiving yourself and disable write barriers.

I think that barriers have other implications especially to journaled
file-systems.



> You will use subvolumes the moment you start making snapshots. So be careful 
> to not deceive yourself. A forgotten snapshot can easily get you into trouble 
> the moment you move off some volumes to make room for a large addition, just 
> to realise no space opened up at all.

This is true.  It is true even of OpenAFS backup volumes.  :)


Ciprian.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread Ciprian Craciun
On Wed, Mar 22, 2023 at 3:11 PM Dirk Heinrichs  wrote:
> > it's not in-kernel; which means sooner or later one would encounter
> > problems.
>
> Can you please elaborate? I run two ZFS systems @home where one is an
> OpenAFS fileserver and client, the other one a client only. They both
> started as Debian Stretch and have been updated to Buster and then
> Bullseye and I've never had any problems because of ZFS being
> out-of-tree. The Debian DKMS system does quite a good job.


Well, I base this supposition on my simple observation with OpenAFS's
own client which is also out-of-tree and requires custom module builds
(via DKMS or equivalent).

For example I use OpenSUSE Tumbleweed (rolling release), and sometimes
I need to delay my updates until the distribution manages to get the
modules ready (with the latest Linux kernel).

Granted, this doesn't usually happen on OpenSUSE Leap, although (for
some reason) the package manager from time to time decides to remove
the old `libafs.ko` for the current running kernel, which (in case
`afsd` must be restarted, for example as is the case of updates)
requires me to reboot the system.

I bet the same applies to ZFS also.

Ciprian.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread Dirk Heinrichs

Ciprian Craciun:

it's not in-kernel; which means sooner or later one would encounter 
problems.


Can you please elaborate? I run two ZFS systems @home where one is an 
OpenAFS fileserver and client, the other one a client only. They both 
started as Debian Stretch and have been updated to Buster and then 
Bullseye and I've never had any problems because of ZFS being 
out-of-tree. The Debian DKMS system does quite a good job.


The OpenAFS client module is out-of-tree too, BTW...

Bye...

    Dirk

--
Dirk Heinrichs 
Matrix-Adresse: @heini:chat.altum.de
GPG Public Key: 80F1540E03A3968F3D79C382853C32C427B48049
Privacy Handbuch: https://www.privacy-handbuch.de



OpenPGP_signature
Description: OpenPGP digital signature


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread spacefrogg-openafs
> What is the reason behind disabling copy-on-write for BTRFS?  Does it
> break OpenAFS in some way, or is it only the out-of-space issue?

OpenAFS implements its own CoW and using CoW below that again has no benefits 
and disturbs the fileservers "free-space" assumptions. It knows when it makes 
in-place updates and does not expect to run out of space in that situation.

For context, AFS did most of the cool FS stuff that the young kids are doing 
nowadays before it was cool. So, the tricky thing with other CoW filesystems is 
that they might work against each other.

Anyways, AFS deals badly with out-of-space situations, because it needs space 
to move off a volume to make space, like any other CoW filesystem. It has it's 
mechanisms to prevent that from happening but CoW FS counteract these 
mechanisms.

>
> Unfortunately (at least for my use-case) losing the checksumming and
> compression is a no-go, because these were exactly the features that
> made BTRFS appealing versus Ext4.

If you say so...
AFS does its own data checksumming.
Compression is problematic on nearly full disks, because you can never properly 
judge whether the data will fit after compression or not. The gains were not 
worth it for me.

> Also, regarding RAID scrubbing, it doesn't cover the issue of
> checksumming, because (for example with RAID5) it can only detect that
> one of the disks has corrupted data, but couldn't say which.

Do not use RAID to prevent data loss! That's what backups are for. RAID is for 
operative redundancy. Scrubbing also tells you about your state of FS metadata. 
So, it's not that it has no use without checksumming. I only use RAID 1 and 
1-0. They have lower dataloss probabilities that RAID 5.

> Could you elaborate more on this?  I guess it doesn't apply to
> directly attached disks.  Is this in order to increase write
> performance, or?

All -sync properties are ineffective with NAS, because the network layer and 
far-end OS decide on actual commit strategies. So you might as well stop 
deceiving yourself and disable write barriers.

>
> Have you also changed the `-sync` file-server option?
>
> I'm using `-sync onclose` to be sure that my data is actually stored
> on the disk.  The write performance does suffer, especially for
> use-cases like Git where some simple operations (like repacking) take
> forever (because for some reason Git tries to touch each and every
> `.git/objects/XX` folders...)

True, I left it on 'onclose' as well IIRC.

> (In my case I intend to use a dedicated BTRFS disk, over RAID, without
> any subsolumes.)

You will use subvolumes the moment you start making snapshots. So be careful to 
not deceive yourself. A forgotten snapshot can easily get you into trouble the 
moment you move off some volumes to make room for a large addition, just to 
realise no space opened up at all.

Kind regards,
–Michael
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Advice on using BTRFS for vicep partitions on Linux

2023-03-22 Thread Ciprian Craciun
On Tue, Mar 21, 2023 at 9:32 PM  wrote:
> The main ingredient on BTRFS is to disable Copy-on-Write for the respective. 
> This also somewhat mitigates surprising out-of-space issues.


What is the reason behind disabling copy-on-write for BTRFS?  Does it
break OpenAFS in some way, or is it only the out-of-space issue?



> You need to provide the 'nodatacow' mount option.
> You lose data checksumming and compression on BTRFS. So, reasonable RAID 
> config and scrubbing may be more important, now.


Unfortunately (at least for my use-case) losing the checksumming and
compression is a no-go, because these were exactly the features that
made BTRFS appealing versus Ext4.

Also, regarding RAID scrubbing, it doesn't cover the issue of
checksumming, because (for example with RAID5) it can only detect that
one of the disks has corrupted data, but couldn't say which.

(As an alternative to file-system provided checksumming, at list on
Linux, there is the `dm-integrity`, configured via `integritysetup`,
that could provide checksumming at the block level;  but at the moment
I'm still experimenting with it for other use-cases.)



> Additionally, depending on your exact setup, you may want to disable write 
> barriers (e.g. for network attached storage, 'nobarrier') when it is without 
> effect.


Could you elaborate more on this?  I guess it doesn't apply to
directly attached disks.  Is this in order to increase write
performance, or?

Have you also changed the `-sync` file-server option?

I'm using `-sync onclose` to be sure that my data is actually stored
on the disk.  The write performance does suffer, especially for
use-cases like Git where some simple operations (like repacking) take
forever (because for some reason Git tries to touch each and every
`.git/objects/XX` folders...)



> Last remark. BTRFS, to my knowledge, does not support reservations. You MUST 
> make sure to use a pre-allocated storage for the /vicepX mountpoint or the 
> ugly day of failing AFS writes will come during your next overseas vacation.


You mean in the case `/vicepX` is a separate volume, but on the same
actual disk with other volumes, right?

(In my case I intend to use a dedicated BTRFS disk, over RAID, without
any subsolumes.)



> ZFS, although you don't want to go that way, works fine as well. Again, make 
> sure to create a filesystem (i.e. subvolume) with a fixed reservation. AFAIK 
> the FS takes care of providing enough space although you cannot disable COW. 
> You keep all the goodies, duplication, deduplication, checksumming. I would 
> suggest reading on ZFS setups for heavy database loads, should I have got you 
> interested.


Thanks for the ZFS suggestions, however for me ZFS is a complete no-go
due to one main reason:  it's not in-kernel;  which means sooner or
later one would encounter problems.  The other reason is complexity:
I use OpenAFS for my own "self-hosted" / "home" needs, thus I want
something I can easily debug in case something goes wrong.  ZFS
doesn't give me much peace of mind;  too complex, too many options...

Thanks,
Ciprian.
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info