[ceph-users] Re: MDS_DAMAGE in 17.2.7 / Cannot delete affected files

2023-11-30 Thread Sebastian Knust

Hi Patrick,

On 30.11.23 03:58, Patrick Donnelly wrote:


I've not yet fully reviewed the logs but it seems there is a bug in
the detection logic which causes a spurious abort. This does not
appear to be actually new damage.


We are accessing the metadata (read-only) daily. The issue only popped 
up after updating to 17.2.7. Of course, this does not mean that there 
was no damage there before, only that it was not detected.


Are you using postgres?

Not on top of CephFS, no. We do use postgres on some RBD volumes.



If you can share details about your snapshot
workflow and general workloads that would be helpful (privately if
desired).


Our CephFS root looks like this:
/archive
/homes
/no-snapshot
/other-snapshot
/scratch

We are running snapshots on /homes and /other-snapshot with the same 
schedule. We mount the filesystem with a Kernel client on one of the 
Ceph Hosts (not running the MDS) and mkdir / rmdir as needed.
- daily between 06:00 and 19:45 UTC (inclusive): Create a snapshot every 
15 minutes, delete it unless it is hourly (xx:00) one hour later
- daily on the full hour: Create a snapshot, delete the 24 hours old 
snapshot unless it is midnight

- daily at midnight delete the snapshot from 14 days ago unless it is Sunday
- every Sunday at midnight delete the snapshot from 8 weeks ago

Workload is two main Samba servers (one only sharing a subdirectory 
which is generally not accessed on the other). Client access to those 
servers is limited to 1GBit/s each. Until Tuesday, we also had a 
mailserver with Dovecot running on top of CephFS. This was migrated on 
Tuesday to an RBD volume as we had some issues with hanging access to 
some files / directories (interestingly only in the main tree, in 
snapshots access was without issue). Additionally, we have a Nextcloud 
instance with ~200 active users storing data in CephFS as well as some 
other Kernel clients with little / sporadic traffic, some running Samba, 
some NFS, some interactive SSH / x2go servers with direct user access, 
some specialised web applications (notably OMERO).


We run daily incremental backups of most of the CephFS content with 
Bareos running on a dedicated server which has the whole CephFS tree 
mounted read-only. For most data a full backup is performed every two 
months, for some data only every six months. The affected area is 
contained in this "every six months" full backup portion of the file 
system tree.



Two weeks ago we deleted a folder structure with 6 TB, average file size 
in the range of 1GB. The structure was unter /other-snapshot as well. 
This led to severe load on the MDS, especially starting midnight. In 
conjunction with Ubuntu kernel mount, we also had issues with 
non-released capabilities preventing read-access to the /other-snapshot 
part.


To combat these lingering problems, we deleted all snapshots in 
/other-snapshot which led to a half a dozen PGs stuck in snaptrim state 
(and a few hundred in snaptrim_wait). Updating from 17.2.6 to 17.2.7 
solved that issue quickly, the affected PGs became unstuck and the whole 
cluster was in active+clean a few hours later.






For now, I'll hold off on running first-damage.py to try to remove the
affected files / inodes. Ultimately however, this seems to be the most
sensible solution to me, at least with regards to cluster downtime.


Please give me another day to review then feel free to use
first-damage.py to cleanup. If you see new damage please upload the
logs.

We are in no hurry and will probably run first-damage.py sometime next 
week. I will report new damage if it comes in.


Cheers
Sebastian

--
Dr. Sebastian Knust  | Bielefeld University
IT Administrator | Faculty of Physics
Office: D2-110   | Universitätsstr. 25
Phone: +49 521 106 5234  | 33615 Bielefeld
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS_DAMAGE in 17.2.7 / Cannot delete affected files

2023-11-29 Thread Sebastian Knust

Hello Patrick,

On 27.11.23 19:05, Patrick Donnelly wrote:


I would **really** love to see the debug logs from the MDS. Please
upload them using ceph-post-file [1]. If you can reliably reproduce,
turn on more debugging:


ceph config set mds debug_mds 20
ceph config set mds debug_ms 1


[1] https://docs.ceph.com/en/reef/man/8/ceph-post-file/



Uploaded debug log and core dump, see ceph-post-file: 
02f78445-7136-44c9-a362-410de37a0b7d
Unfortunately, we cannot easily shut down normal access to the cluster 
for these tests, therefore there is quite some clutter in the logs. The 
logs show three crashes, the last one with enabled core dumping (ulimits 
set to unlimited)


A note on reproducibility: To recreate the crash, reading the contents 
of the file prior to removal seems necessary. Simply calling stat on the 
file and then performing the removal also yields an Input/output error 
but does not crash the MDS.


Interestingly, the MDS_DAMAGE flag is reset on restart of the MDS and 
only comes back once the files in question are accessed (stat call is 
sufficient).



For now, I'll hold off on running first-damage.py to try to remove the 
affected files / inodes. Ultimately however, this seems to be the most 
sensible solution to me, at least with regards to cluster downtime.


Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS_DAMAGE in 17.2.7 / Cannot delete affected files

2023-11-24 Thread Sebastian Knust

Hi,

After updating from 17.2.6 to 17.2.7 with cephadm, our cluster went into 
MDS_DAMAGE state. We had some prior issues with faulty kernel clients 
not releasing capabilities, therefore the update might just be a 
coincidence.


`ceph tell mds.cephfs:0 damage ls` lists 56 affected files all with 
these general details:


{
"damage_type": "dentry",
"id": 123456,
"ino": 1234567890,
"frag": "*",
"dname": "some-filename.ext",
"snap_id": "head",
"path": "/full/path/to/file"
}

The behaviour upon trying to access file information in the (Kernel 
mounted) filesystem is a bit inconsistent. Generally, the first `stat` 
call seems to result in "Input/output error", the next call provides all 
`stat` data as expected from an undamaged file. The file can be read 
with `cat` with full and correct content (verified with backup) once the 
stat call succeeds.


Scrubbing the affected subdirectories with `ceph tell mds.cephfs:0 scrub 
start /path/to/dir/ recursive,repair,force` does not fix the issue.


Trying to delete the file results in an "Input/output error". If the 
stat calls beforehand succeeded, this also crashes the active MDS with 
these messages in the system journal:

Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: 
mds.0.cache.den(0x10012271195 DisplaySettings.json) newly corrupt dentry to be 
committed: [dentry 
#0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
 [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 
state=1073741824 | inodepin=1 0x56413e1e2780]
Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: log_channel(cluster) log 
[ERR] : MDS abort because newly corrupt dentry to be committed: [dentry 
#0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
 [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 
state=1073741824 | inodepin=1 0x56413e1e2780]
Nov 24 14:21:15 iceph-18.servernet 
ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: 
2023-11-24T13:21:15.654+ 7f3fdcde0700 -1 mds.0.cache.den(0x10012271195 
DisplaySettings.json) newly corrupt dentry to be committed: [dentry 
#0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
 [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x1001>
Nov 24 14:21:15 iceph-18.servernet 
ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: 
2023-11-24T13:21:15.654+ 7f3fdcde0700 -1 log_channel(cluster) log [ERR] : MDS 
abort because newly corrupt dentry to be committed: [dentry 
#0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json
 [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012>
Nov 24 14:21:15 iceph-18.servernet 
ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc:
 In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 time 
2023-11-24T13:21:15.655088+
Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc:
 In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 time 
2023-11-24T13:21:15.655088+
  
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc:
 937: ceph_abort_msg("abort() called")
  
   ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

   1: (ceph::__ceph_abort(char const*, int, char 
const*, std::__cxx11::basic_string, std::allocator 
> const&)+0xd7) [0x7f3fe5a1cb03]
   2: 
(MDSRank::abort(std::basic_string_view >)+0x7d) 
[0x5640f2e6fa2d]
   3: 
(CDentry::check_corruption(bool)+0x740) [0x5640f30e4820]
   4: 
(EMetaBlob::add_primary_dentry(EMetaBlob::dirlump&, CDentry*, CInode*, unsigned 
char)+0x47) [0x5640f2f41877]
   5: 
(EOpen::add_clean_in

[ceph-users] Re: Centos 7 Kernel clients on ceph Quincy -- experiences??

2022-09-20 Thread Sebastian Knust

Hi Christoph,

I am able to reproducibly kernel panic CentOS 7 clients with native 
kernel (3.10.0-1160.76.1.el7) when accessing CephFS snapshots via SMB 
with vfs_shadow_copy2. This occurs on a Pacific cluster.  IIRC accessing 
the snapshots on the server also lead to a kernel panic, but I'm not sure.


Running a mainline kernel from elrepo prevents this issue.

I imagine that you might possibly run into these issues with a Quincy 
cluster as well, if you are using CephFS snapshots at all.


Cheers
Sebastian

On 20.09.22 13:34, Ackermann, Christoph wrote:

Hello all,

i would like to upgrade our well running Rocky 8.6 based bare metal cluster
from Octopus to Quincy next few days. But there are some Centos7 Kernel
based clients mapping RBDs or mounting CephFS in our environment.

Is there someone here who can confirm Centos 7 clients
(3.10.0-1160.76.1.el7.x86_64) working with Quincy?

Best regards,

Christoph



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS snapshots with samba shadowcopy

2022-07-13 Thread Sebastian Knust

Hi,

I am providing CephFS snapshots via Samba with the shadow_copy2 VFS 
object. I am running CentOS 7 with smbd 4.10.16 for which ceph_snapshots 
is not available AFAIK.


Snapshots are created by a cronjob above the root of my shares with
  export TZ=GMT
  mkdir /cephfs/path/.snap/`date +@GMT-%Y.%m.%d-%H.%M.%S`
i.e. the exported shares are subfolders of the folder in which I create 
snapshots.


Samba configuration is:
  [global]
  ...
  shadow:snapdir = .snap
  shadow:snapdirseverywhere = yes
  shadow:format = _@GMT-%Y.%m.%d-%H.%M.%S_some-inode-number
  ...
  [sharename]
  ...
  path = /cephfs/path_to_main_root/share
  vfs object = shadow_copy2
  ...
  [other_share_with_different_root]
  ...
  path = /cephfs/path_to_different_root/other_share
  vfs object = shadow_copy2
  shadow:format = _@GMT-%Y.%m.%d-%H.%M.%S_other-inode-number

The inode numbers in the configuration are of course the inode numbers 
of the directory containing the snapshots.


Cheers
Sebastian

On 13.07.22 02:08, Bailey Allison wrote:

Hi All,

  


Curious if anyone is making use of samba shadowcopy with CephFS snapshots
using the vfs object ceph_snapshots?

  


I've had wildly different results on an Ubuntu 20.04 LTS samba server where
the snaps just do not appear at all within shadowcopy, and a Rocky Linux
samba server where the snaps do appear within shadowcopy but when opening
them they contain absolutely no files at all.

  


Both the Ubuntu and Rocky samba server are sharing out kernel cephfs mount
via samba, ceph version is 17.2.1 and samba version is 4.13.7 for Ubuntu
20.04 and 4.15.5 for Rocky Linux.

  


I have also tried using a samba fuse mount with vfs_ceph with the same
results.

  


More so just curious to see if anyone on the list has had success with
making use of the ceph_snapshots vfs object and if they can share how it has
worked for them.

  


Included below is the share config for both Ubuntu and Rocky if anyone is
curious:

  


Ubuntu 20.04 LTS

  


[public]

 force group = nogroup

 force user = nobody

 guest ok = Yes

 path = /mnt/cephfs/public

 read only = No

 vfs objects = ceph_snapshots

  


Rocky Linux

  


[public]

 force group = nogroup

 force user = nobody

 guest ok = Yes

 path = /mnt/cephfs/public

 read only = No

 vfs objects = ceph_snapshots

  


Regards,

  


Bailey

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Dr. Sebastian Knust  | Bielefeld University
IT Administrator | Faculty of Physics
Office: D2-110   | Universitätsstr. 25
Phone: +49 521 106 5234  | 33615 Bielefeld
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs quota used

2021-12-16 Thread Sebastian Knust

Hi Jasper,

On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote:

Now, I want to access the usage information of folders with quotas from root 
level of the cephfs.
I have failed to find this information through getfattr commands, only quota 
limits are shown here, and du-command on individual folders is a suboptimal 
solution.


`getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a 
given path.
`getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you 
would usually get with du for conventional file systems.


As an example, I am using this script for weekly utilisation reports:

for i in /ceph-path-to-home-dirs/*; do
if [ -d "$i" ]; then
SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i")
QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" 2>/dev/null 
|| echo 0)
PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null)
if [ -z "$PERC" ]; then PERC="--"; fi
printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt 
--to=iec $QUOTA` $PERC
fi
done



Note that you can also mount CephFS with the "rbytes" mount option. IIRC 
the fuse clients defaults to it, for the kernel client you have to 
specify it in the mount command or fstab entry.


The rbytes option returns the recursive path size (so the 
ceph.dir.rbytes fattr) in stat calls to directories, so you will see it 
with ls immediately. I really like it!


Just beware that some software might have issues with this behaviour - 
alpine is the only example (bug report and patch proposal have been 
submitted) that I know of.


Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs kernel client + snapshots slowness

2021-12-10 Thread Sebastian Knust
2973] ceph:  queue_realm_cap_snaps d313f1e4 
1001af8c8e7 inodes
[3199649.882974] ceph:  queue_cap_snap 171f11a7 nothing 
dirty|writing
[3199649.882975] ceph:  queue_cap_snap 938b9cd2 nothing 
dirty|writing
[3199649.882976] ceph:  queue_cap_snap 615cf4dd nothing 
dirty|writing
[3199649.882977] ceph:  queue_cap_snap 0027e295 nothing 
dirty|writing
[3199649.882979] ceph:  queue_cap_snap ba18b2f8 nothing 
dirty|writing
[3199649.882980] ceph:  queue_cap_snap 7c9c80de nothing 
dirty|writing
[3199649.882981] ceph:  queue_cap_snap 629b4b0e nothing 
dirty|writing
[3199649.882982] ceph:  queue_cap_snap ab330b37 nothing 
dirty|writing
[3199649.882983] ceph:  queue_cap_snap c7dbc320 nothing 
dirty|writing
[3199649.882985] ceph:  queue_cap_snap 70a0598f nothing 
dirty|writing
[3199649.882986] ceph:  queue_cap_snap 915b9e2e nothing 
dirty|writing

... (and a lot lot more of these) ...

At this point the client has about a million caps (running up against 
the default cap limit) - so potentially this loop is over all the caps 
(?), which could mean tens/hundreds of milliseconds? Indeed, reducing 
mds_max_caps_per_client by an order of magnitude does improve the lstat 
times by about an order of magnitude (which is still pretty slow - but 
supports this hypothesis).


The ceph cluster is Nautilus 14.2.20.  There are a total of 7 snapshots 
in cephfs, all taken at the root of the cephfs tree (a rolling set of 7 
previous daily snapshots).


I've tested this with a few kernels: two LTS ones, and one more recent 
stable one: 5.4.114, 5.10.73 and 5.14.16 with the same result.  Any 
ideas/suggestions?


Andras

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Dr. Sebastian Knust  | Bielefeld University
IT Administrator | Faculty of Physics
Office: D2-110   | Universitätsstr. 25
Phone: +49 521 106 5234  | 33615 Bielefeld
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD repeatedly marked down

2021-12-01 Thread Sebastian Knust

Hi Jan,

On 01.12.21 17:31, Jan Kasprzak wrote:

In "ceph -s", they "2 osds down"
message disappears, and the number of degraded objects steadily decreases.
However, after some time the number of degraded objects starts going up
and down again, and osds appear to be down (and then up again). After 5 minutes
the OSDs are kicked out from the cluster, and the ceph-osd daemons stop
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 received  signal: Interrupt from Kernel ( Could be generated by 
pthread_kill(), raise(), abort(), alarm() ) UID: 0
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt ***
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 
7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) 
***



Do you have enough memory on your host? You might want to look for oom 
messages in dmesg / journal and monitor your memory usage throughout the 
recovery.


If the osd processes are indeed killed by OOM killer, you have a few 
options. Adding more memory would probably be best to future-proof the 
system. Maybe you could also work with some Ceph config setting, e.g. 
lowering osd_max_backfills (although I'm definitely not an expert on 
which parameters would give you the best result). Adding swap will most 
likely only produce other issues, but might be a method of last resort.


Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Kworker 100% with ceph-msgr (after upgrade to 14.2.6?)

2021-09-07 Thread Sebastian Knust

Hi,

I too am still suffering the same issue (snapshots lead to 100% 
ceph-msgr usage on client during metadata-intensive operations like 
backup and rsync) and had previously reported it to this list. This 
issue is also tracked at https://tracker.ceph.com/issues/44100


My current observations:
- approx. 20 total snapshots in the filesystem are sufficient to 
reliably cause the issue
- in my observation there is no linear relationship between slowdown and 
number of snapshots. Once you reach a critical snapshot number (which 
might actually be 1, I have not tested this extensively) and perform the 
necessary operations to induce the error (for me, Bareos backups are a 
reliable reproducer), metadata operations on that client grind to a 
near-halt
- memory on the MDS is not a limiting / causing factor: I now have a 
dedicated MDS server with 160 GB memory and adjusted 
mds_cache_memory_limit accordingly and saw the issue occurring at 30GB 
MDS memory usage
- fuse mounts don't show the issue but are much slower on metadata 
operations overall and therefore not a solution for daily backups, as 
they slow down the backup too much


I'm running Ceph Octopus 15.2.13 on CentOS8. Client is CentOS8 with 
elrepo kernel 5.12. My workaround is to not use cephfs snapshots at all, 
although I really would like to use them.


Cheers
Sebastian

On 07.09.21 14:12, Frank Schilder wrote:

Hi Marc,

did you ever get a proper solution for this problem? We are having exactly the 
same issue, having snapshots on a file system leads to incredible performance 
degradation. I'm reporting some observations here (latest reply):

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HKEBXXRMX5WA5Y6JFM34WFPMWTCMPFCG/#6S5GTKGGBI2Y3QE4E5XJJY2KSSLLX64H

The problem is almost certainly that the ceph kernel client executes 
ceph_update_snap_trace over and over again over the exact same data. I see that 
the execution time of ceph fs IO increases roughly with the number of snapshots 
present, N snapshots means ~N times slower.

I'm testing this on kernel version 5.9.9-1.el7.elrepo.x86_64. It is even worse 
on older kernels.

Best regards,
=
Frank Schilder

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS Octopus mv: Invalid cross-device link [Errno 18] / slow move

2021-08-18 Thread Sebastian Knust
Hi Luís,

Am 18.08.2021 um 19:02 schrieb Luis Henriques:
> Sebastian Knust  writes:
> 
>> Hi,
>>
>> I am running a Ceph Oc,topus (15.2.13) cluster mainly for CephFS. Moving 
>> (with
>> mv) a large directory (mail server backup, so a few million small files) 
>> within
>> the cluster takes multiple days, even though both source and destination 
>> share
>> the same (default) file layout and - at least on the client I am performing 
>> the
>> move on - are located within the same mount point.
>>
>> I also see that the move is done by recursive copying and later deletion, as 
>> I
>> would only expect between different file systems / mount points.
> 
> A reason for that to happen could be the usage of quotas in the
> filesystem.  If you have quotas set in any of the source or destination
> hierarchies the rename(2) syscall will fail with -EXDEV (the "Invalid
> cross-device link" error).  And I guess that 'mv' will then revert to
> the less efficient recursive copy.
> 
> A possible solution would be to temporarily remove the quotas
> (i.e. setting them to '0'), and setting them back after the rename.
> 
> Cheers,

That's it! Setting quota temporarily to 0 allows for immediate move by
rename.

Thanks a lot.

Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS Octopus mv: Invalid cross-device link [Errno 18] / slow move

2021-08-18 Thread Sebastian Knust

Hi,

I am running a Ceph Octopus (15.2.13) cluster mainly for CephFS. Moving 
(with mv) a large directory (mail server backup, so a few million small 
files) within the cluster takes multiple days, even though both source 
and destination share the same (default) file layout and - at least on 
the client I am performing the move on - are located within the same 
mount point.


I also see that the move is done by recursive copying and later 
deletion, as I would only expect between different file systems / mount 
points.



Checking with cephfs-shell (16.2.5), the move fails with the "Invalid 
cross-device link [Errno 18]" error. However, stat shows the same device 
ID for source and destination:


CephFS:~/>>> mv /source/foo /dest/foo
cephfs.OSError: error in rename /source/foo to /dest/foo: Invalid 
cross-device link [Errno 18]


CephFS:~/>>> stat /source/foo
Device: 18446744073709551614Inode: 1099620656366

CephFS:~/>>> stat /dest/
Device: 18446744073709551614Inode: 1099570814227

Full output at https://pastebin.com/9V6FZ6hP


Any ideas why this happens?

The /source was originally created by ceph fs subvolume create ..., 
however I was not using the volume/subvolume features and reorganised 
the data - the directory inode is still the same.


Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Docker container snapshots accumulate until disk full failure?

2021-08-12 Thread Sebastian Knust

Dear Harry,

`docker image prune -a` removes all dangling images as well as all 
images not referenced by any running container. I successfully used it 
in my setups to remove old versions.


In RHEL/CentOS, podman is used and thus you should use `podman image 
prune -a` instead.


HTH, Cheers
Sebastian

On 11.08.21 15:35, Harry G. Coin wrote:

Does ceph remove container subvolumes holding previous revisions of
daemon images after upgrades?

I have a couple servers using btrfs to hold the containers.   The number
of docker related sub-volumes just keeps growing, way beyond the number
of daemons running.  If I ignore this, I'll get disk-full related system
failures.

Is there a command to 'erase all non-live docker image subvolumes'?  Or
a way to at least get a list of what I need to delete manually ( !! )

Thanks

Harry Coin


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Wrong hostnames in "ceph mgr services" (Octopus)

2021-07-08 Thread Sebastian Knust

Hi,

After upgrading from 15.2.8. to 15.2.13 with cephadm on CentOS 8 
(containerised installation done by cephadm), Grafana no longer shows 
new data. Additionally, when accessing the Dashboard-URL on a host 
currently not hosting the dashboard, I am redirected to a wrong hostname 
(as shown in ceph mgr services).


I assume that this is caused by the same reason which leads to this 
output of `ceph mgr services`:

{
"dashboard": "https://ceph--mgr.iceph-11.tsmsqs:8443/",
"prometheus": "http://ceph--mgr.iceph-11.tsmsqs:9283/"
}

The correct hostname is iceph-11 (without the tsmsqs part), FQDN is 
iceph-11.servernet. The hosts use DNS, the names (iceph-11 and 
iceph-11.servernet) are resolvable both from the hosts as well as from 
within the Podman containers.



I have determined that podman by default sets the container name as a 
hostname alias (visible with `hostname -a` within the container), which 
somehow leads to Ceph mgr picking it up as the primary name?


My workaround is to modify 
/var/lib/ceph//mgr../unit.run, 
adding --no-hosts as an additional argument to the "podman run" command.


I could probably use a system-wide containers.conf as well.

With this workaround and after restarting the Ceph mgr container (via 
systemctl) and then restarting Prometheus and Grafana (with ceph orch 
redeploy), I once again get data in Grafana and the correct redirect for 
the dashboard. `ceph mgr services` also shows expected and correct values.



I am wondering if this kind of issue is known or whether there is 
something wrong with my setup. I expected Ceph mgr to use the primary 
hostname and not some seemingly random hostname alias. Maybe this issue 
can also be discussed in a troubleshooting section of the monitoring 
stack documentation.



Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OT: How to Build a poor man's storage with ceph

2021-06-08 Thread Sebastian Knust

Hi Michael,

On 08.06.21 11:38, Ml Ml wrote:

Now i was asked if i could also build a cheap 200-500TB Cluster
Storage, which should also scale. Just for Data Storage such as
NextCloud/OwnCloud.


With similar requirements (server primarily for Samba and NextCloud, 
some RBD use, very limited budget) I am using HDD for data and SSD for 
system and CephFS metadata.


Note that I am running NextCloud on CephFS storage. If you want to go 
with RGW/S3 as a storage backend instead, the following might not apply 
to your use case.


My nodes (bought end of 2020) are:
- 2U chassis with 12 3.5" SATA slots
- Intel Xeon Silver 4208
- 128 GB RAM
- 2 x 480 GB Samsung PM883 SSD
  -> 50 GB in MD-RAID1 for system
  -> 430 GB OSD (one per SSD)
- initially 6 x 14 TB Enterprise HDD
- 4 x 10 GBase-T (active/passive bonded, dedicated backend network)

Each node with this configuration cost about 4k EUR net at the end of 
2020. Due to increasing prices for storage, it will be a bit more 
expensive now. I am running five nodes now and have added a few more 
disks (ranging 8-14 TB), nearly filling up the nodes.


My experience so far:
- I had to throttle scrubbing (see below for details)
- For purely NextCloud and Samba performance is sufficient for a few 
hundred concurrent users with a handful of power users
- Migration of the mail server to this cluster was a disaster due to 
limited IOPS, had to add some more SSDs and place the mail server in an 
SSD-only pool.
- MDS needs a lot of memory for larger CephFS installs, I will move it 
to a dedicated server probably next year. 128 GB per node works but I 
would not recommend any less.
- Rebalancing takes an eternity (2-3 weeks), so make sure that your PG 
nums are okay from the start
- I have all but given up on snapshots with CephFS due to severe 
performance degradation with kernel client during backup


My scrubbing config looks like this:
osd_backfill_scan_max   16
osd_backfill_scan_min   4
osd_deep_scrub_interval 2592000.00
osd_deep_scrub_randomize_ratio  0.03
osd_recovery_max_active_hdd 1
osd_recovery_max_active_ssd 5
osd_recovery_sleep_hdd  0.05
osd_scrub_begin_hour18
osd_scrub_end_hour  7
osd_scrub_chunk_max 1
osd_scrub_chunk_min 1
osd_scrub_max_interval  2419200.00
osd_scrub_min_interval  172800.00
osd_scrub_sleep 0.10


My data is in a replicated pool with n=3 without compression. You might 
also consider EC and then want to aim for more nodes.


Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephfs metadta pool suddenly full (100%) !

2021-06-01 Thread Sebastian Knust

Hi Hervé,

On 01.06.21 14:00, Hervé Ballans wrote:
I'm aware with your points, and maybe I was not really clear in my 
previous email (written in a hurry!)
The problematic pool is the metadata one. All its OSDs (x3) are full. 
The associated data pool is OK and no OSD is full on the data pool.


Are you saying that you only have 3 OSD for your metadata pool, which 
are the full ones? Alright, then you can - at least for this specific 
issue - disregard my previous comment.


The problem is that metadata pool suddenly increases a lot and 
continiously from 3% to 100% in 5 hours (from 5 am to 10 am, then crash)


724 GiB stored in the metadata pool with only 11 TiB cephfs data size 
does seem huge at first glance. For reference, I have about 160 TiB 
cephfs data with only 31 GiB stored in the metadata pool.


I don't have an explanation for this behaviour, as I am relatively new 
to Ceph. Maybe the list can chime in?




And we don't understand the reason, since there was no specific 
activities on the data pool ?
This cluster runs perfectly with the current configuration since many 
years.


Probably unrelated to your issues: I noticed that your STORED and USED 
column in `ceph df` output are identical. Is that because of Nautilus (I 
myself am running Octopus, where USED is the expected multiple of STORED 
depending on replication factor / EC configuration in the pool) or are 
you running a specific configuration that might cause that?


Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephfs metadta pool suddenly full (100%) !

2021-06-01 Thread Sebastian Knust

Hi Hervé,

On 01.06.21 13:15, Hervé Ballans wrote:

# ceph status
   cluster:
     id: 838506b7-e0c6-4022-9e17-2d1cf9458be6
     health: HEALTH_ERR
     1 filesystem is degraded
     3 full osd(s)
     1 pool(s) full
     1 daemons have recently crashed
You have full OSDs and therefore a full pool. The "fullness" of a pool 
is limited by the fullest OSD, i.e. a single full OSDs can block your 
pool. Take a look at `ceph osd df` and you will notice a very 
non-uniform osd usage (both with numbers of PG / size as well as usage %).



     osd: 126 osds: 126 up (since 5m), 126 in (since 5M)
     pgs: 1662 active+clean


The PG/osd ratio seems to be very low for me. The general recommendation 
is 100 PG / osd post-replication (and power of 2 for each pool). In my 
cluster I actually run with ~200 PG / osd for my SSD which contain the 
cephfs metadata.




Thanks a lot if you have some ways for trying to solve this...


You have to get your OSDs to rebalance, which probably includes 
increasing the number of PGs in some pools. Details depend on which Ceph 
version you are running and your CRUSH rules (maybe your cephfs metadata 
pool is residing only on NVMe?). Take a look at the balancer module [1] 
and the autoscaler [2] (`ceph osd pool autoscale-status` is most 
interesting).


Theoretically, you could (temporarilly!) increase the full_ratio. 
However, this is a very dangerous operation which you should not do 
unless you know *exactly* what you are doing.


Cheers & Best of luck
Sebastian

[1] https://docs.ceph.com/en/latest/rados/operations/balancer/
[2] https://docs.ceph.com/en/latest/rados/operations/placement-groups/
Replace latest in the URIs with your Ceph version string (i.e. octopus, 
nautilus) for version specific documentation

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: XFS on RBD on EC painfully slow

2021-05-28 Thread Sebastian Knust

Hi Reed,

To add to this command by Weiwen:

On 28.05.21 13:03, 胡 玮文 wrote:

Have you tried just start multiple rsync process simultaneously to transfer 
different directories? Distributed system like ceph often benefits from more 
parallelism.


When I migrated from XFS on iSCSI (legacy system, no Ceph) to CephFS a 
few months ago, I used msrsync [1] and was quite happy with the speed. 
For your use case, I would start with -p 12 but might experiment with up 
to -p 24 (as you only have 6C/12T in your CPU). With many small files, 
you also might want to increase -s from the default 1000.


Note that msrsync does not work with the --delete rsync flag. As I was 
syncing a live system, I ended up with this workflow:


- Initial sync with msrsync (something like ./msrsync -p 12 --progress 
--stats --rsync "-aS --numeric-ids" ...)

- Second sync with msrsync (to sync changes during the first sync)
- Take old storage off-line for users / read-only
- Final rsync with --delete (i.e. rsync -aS --numeric-ids --delete ...)
- Mount cephfs at location of old storage, adjust /etc/exports with fsid 
entries where necessary, turn system back on-line / read-write


Cheers
Sebastian

[1] https://github.com/jbd/msrsync
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS: side effects of not using ceph-mgr volumes / subvolumes

2021-03-03 Thread Sebastian Knust

Hi,

Assuming a cluster (currently octopus, might upgrade to pacific once 
released) serving only CephFS and that only to a handful of kernel and 
fuse-clients (no OpenStack, CSI or similar): Are there any side effects 
of not using the ceph-mgr volumes module abstractions [1], namely 
subvolumes and subvolume groups, that I have to consider?


I would still only mount subtrees of the whole (single) CephFS file 
system and have some clients which mount multiple disjunct subtrees. 
Quotas would only be set on the subtree level which I am mounting, 
likewise file layouts. Snapshots (via mkdir in .snap) would be used on 
the mounting level or one level above.



Background: I don't require the abstraction features per se. Some 
restrictions (e.g. subvolume group snapshots not being supported) seem 
to me to be caused only by the abstraction layer and not the underlying 
CephFS. For my specific use case I require snapshots on the subvolume 
group layer. It therefore seems better to just forego the abstraction as 
a whole and work on bare CephFS.



Cheers
Sebastian

[1] https://docs.ceph.com/en/octopus/cephfs/fs-volumes/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS Octopus snapshots / kworker at 100% / kernel vs. fuse client

2021-02-05 Thread Sebastian Knust

Hi,

I am running a Ceph Octopus (15.2.8) cluster primarily for CephFS. 
Metadata is stored on SSD, data is stored in three different pools on 
HDD. Currently, I use 22 subvolumes.


I am rotating snapshots on 16 subvolumes, all in the same pool, which is 
the primary data pool for CephFS. Currently I have 41 snapshots per 
subvolume. The goal is 50 snapshots (see bottom of mail for details). 
Snapshots are only placed in the root subvolume directory, i.e. 
/volumes/_nogroup/subvolname/hex-id/.snap


I place the snapshots on one of the nodes. Complete CephFS is mounted, 
mkdir and rmdir is performed for each relevant subvolume, then CephFS is 
unmounted again. All PGs are active+clean most of the time, only a few 
in snaptrim for 1-2 minutes after snapshot deletion. I therefore assume 
that snaptrim is not a limiting factor.


Obviously, the total number of snapshots is more than the 400 and 100 I 
see mentioned in some documentation. I am unsure if that is an issue 
here, as the snapshots are all in disjunct subvolumes.




When mounting the subvolumes with kernel client (ranging from CentOS 7 
supplied 3.10 up to 5.4.93), after some time and for some subvolumes the 
kworker process begins to hug 100% cpu usage and stat operations become 
very slow (even slower than with fuse client). I can mostly replicate 
this by starting specific rsync operations (with many small files, e.g. 
CTAN, CentOS, Debian mirrors) and by running a bareos backup. The 
kworker process seems to be stuck even after terminating the causing 
operating, i.e. rsync or bareos-fd.


Interestingly, I can even trigger these issues on a host that has only a 
single CephFS subvolume without any snapshots mounted, as long as that 
subvolume is in the same pool as other subvolumes with snapshots.


I don't see any abnormal behaviour on the cluster nodes or on other 
clients during these kworker hanging phases.




With fuse client, in normal operation stat calls are about 10-20x slower 
than with the kernel client. However, I don't encounter the extreme 
slowdown behaviour. I am therefore currently mounting some 
known-problematic subvolumes with fuse and non-problematic subvolumes 
with the kernel client.




My questions are:
- Is this known or expected behaviour?
- I could move the subvolumes with snapshots into a subvolumegroup and 
snapshot the whole group instead of each subvolume. Will this be likely 
to solve the issues?
- What is the current recommendation regarding CephFS and max number of 
snapshots?




Cluster setup:
5 nodes with a total of 56 OSDs
Each node has a Xeon Silver 4208 and 128 GB RAM
Each node has two 480GB Samsung PM883 SSD used for CephFS metadata pool
HDDs are ranging from 8TB to 14TB, majority is 14TB
10 GbE internal network and 10 GbE client network, no Jumbo frames

$ ceph df
--- RAW STORAGE ---
CLASS  SIZE AVAILUSED RAW USED  %RAW USED
hdd520 TiB  141 TiB  378 TiB   379 TiB  72.88
ssd3.9 TiB  3.8 TiB  1.7 GiB97 GiB   2.46
TOTAL  524 TiB  145 TiB  378 TiB   379 TiB  72.36

--- POOLS ---
POOL   ID  PGS   STORED   OBJECTS  USED %USED  MAX AVAIL
device_health_metrics   1 1   66 MiB   57  198 MiB  0 23 TiB
cephfs.cephfs.meta  2  1024   26 GiB2.29M   77 GiB   2.061.2 TiB
cephfs.cephfs.data  3  1024   70 TiB   54.95M  213 TiB  75.19 23 TiB
lofar   4   512   77 TiB   21.41M  154 TiB  68.68 35 TiB
proxmox 664  526 GiB  158.60k  1.6 TiB   2.16 23 TiB
archive 732  7.3 TiB5.42M   10 TiB  12.57 56 TiB
Snapshots are only on cephfs.cephfs.data pool.


Intended snapshot rotation:
4 quarter-hourly snapshots
24 hourly snapshots
14 daily snapshots
8 weekly snapshots


Cheers
Sebastian
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io