[ceph-users] Re: force-create-pg not working
Hi Josh, Thanks for your reply. But this I already tried that, with no luck. Primary OSD goes down and hangs forever, upon "mark_unfound_lost delete” command. I guess it is too damaged to salvage, unless one really starts deleting individual corrupt objects? Anyway, as I said. files in the PG are identified and under backup, so I just want to healthy, no matter what ;-) I actually discovered that removing the pgs shards, with objectstore-tool indeed works in getting the pg back active-clean (containing 0 objects though). One just need to run a final remove - start/stop OSD - repair - mark-complete on the primary OSD. A scrub tells me that the "active+clean” state is for real. I also found out the more automated "force-create-pg" command only works on pgs that a in down state. Best, Jesper ---------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 20 Sep 2022, at 15.40, Josh Baergen wrote: > > Hi Jesper, > > Given that the PG is marked recovery_unfound, I think you need to > follow > https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#unfound-objects. > > Josh > > On Tue, Sep 20, 2022 at 12:56 AM Jesper Lykkegaard Karlsen > wrote: >> >> Dear all, >> >> System: latest Octopus, 8+3 erasure Cephfs >> >> I have a PG that has been driving me crazy. >> It had gotten to a bad state after heavy backfilling, combined with OSD >> going down in turn. >> >> State is: >> >> active+recovery_unfound+undersized+degraded+remapped >> >> I have tried repairing it with ceph-objectstore-tool, but no luck so far. >> Given the time recovery takes this way and since data are under backup, I >> thought that I would do the "easy" approach instead and: >> >> * scan pg_files with cephfs-data-scan >> * delete data beloging to that pool >> * recreate PG with "ceph osd force-create-pg" >> * restore data >> >> Although, this has shown not to be so easy after all. >> >> ceph osd force-create-pg 20.13f --yes-i-really-mean-it >> >> seems to be accepted well enough with "pg 20.13f now creating, ok", but then >> nothing happens. >> Issuing the command again just gives a "pg 20.13f already creating" response. >> >> If I restart the primary OSD, then the pending force-create-pg disappears. >> >> I read that this could be due to crush map issue, but I have checked and >> that does not seem to be the case. >> >> Would it, for instance, be possible to do the force-create-pg manually with >> something like this?: >> >> * set nobackfill and norecovery >> * delete the pgs shards one by one >> * unset nobackfill and norecovery >> >> >> Any idea on how to proceed from here is most welcome. >> >> Thanks, >> Jesper >> >> >> -- >> Jesper Lykkegaard Karlsen >> Scientific Computing >> Centre for Structural Biology >> Department of Molecular Biology and Genetics >> Aarhus University >> Universitetsbyen 81 >> 8000 Aarhus C >> >> E-mail: je...@mbg.au.dk >> Tlf:+45 50906203 >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] force-create-pg not working
Dear all, System: latest Octopus, 8+3 erasure Cephfs I have a PG that has been driving me crazy. It had gotten to a bad state after heavy backfilling, combined with OSD going down in turn. State is: active+recovery_unfound+undersized+degraded+remapped I have tried repairing it with ceph-objectstore-tool, but no luck so far. Given the time recovery takes this way and since data are under backup, I thought that I would do the "easy" approach instead and: * scan pg_files with cephfs-data-scan * delete data beloging to that pool * recreate PG with "ceph osd force-create-pg" * restore data Although, this has shown not to be so easy after all. ceph osd force-create-pg 20.13f --yes-i-really-mean-it seems to be accepted well enough with "pg 20.13f now creating, ok", but then nothing happens. Issuing the command again just gives a "pg 20.13f already creating" response. If I restart the primary OSD, then the pending force-create-pg disappears. I read that this could be due to crush map issue, but I have checked and that does not seem to be the case. Would it, for instance, be possible to do the force-create-pg manually with something like this?: * set nobackfill and norecovery * delete the pgs shards one by one * unset nobackfill and norecovery Any idea on how to proceed from here is most welcome. Thanks, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Remove corrupt PG
Well not the total solution after all. There is still some metadata and header structure left that I still cannot delete with ceph-objectstore-tool —op remove. It makes a core dump. I think I need to declare the OSD lost anyway to the through this. Unless somebody have a better suggestion? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 1 Sep 2022, at 22.01, Jesper Lykkegaard Karlsen wrote: > > To answer my own question. > > The removal of the corrupt PG, could be fixed by doing ceph-objectstore-tool > fuse mount-thingy. > Then from the mount point, delete everything in the PGs head directory. > > This took only a few seconds (compared to 7.5 days) and after unmount and > restart of the OSD it came back online. > > Best, > Jesper > > ------ > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Universitetsbyen 81 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > >> On 31 Aug 2022, at 20.53, Jesper Lykkegaard Karlsen wrote: >> >> Hi all, >> >> I wanted to move a PG to an empty OSD, so I could do repairs on it without >> the whole OSD, which is full of other PG’s, would be effected with extensive >> downtime. >> >> Thus, I exported the PG with ceph-objectstore-tool, an after successful >> export I removed it. Unfortunately, the remove command was interrupted >> midway. >> This resulted in a PG that could not be remove with “ceph-objectstore-tool >> —op remove ….”, since the header is gone. >> Worse is that the OSD does not boot, due to it can see objects from the >> removed PG, but cannot access them. >> >> I have tried to remove the individual objects in that PG (also with >> objectstore-tool), but this process is extremely slow. >> When looping over the >65,000 object, each remove takes ~10 sec and is very >> compute intensive, which is approximately 7.5 days. >> >> Is the a faster way to get around this? >> >> Mvh. Jesper >> >> -- >> Jesper Lykkegaard Karlsen >> Scientific Computing >> Centre for Structural Biology >> Department of Molecular Biology and Genetics >> Aarhus University >> Universitetsbyen 81 >> 8000 Aarhus C >> >> E-mail: je...@mbg.au.dk >> Tlf:+45 50906203 >> >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Remove corrupt PG
To answer my own question. The removal of the corrupt PG, could be fixed by doing ceph-objectstore-tool fuse mount-thingy. Then from the mount point, delete everything in the PGs head directory. This took only a few seconds (compared to 7.5 days) and after unmount and restart of the OSD it came back online. Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 31 Aug 2022, at 20.53, Jesper Lykkegaard Karlsen wrote: > > Hi all, > > I wanted to move a PG to an empty OSD, so I could do repairs on it without > the whole OSD, which is full of other PG’s, would be effected with extensive > downtime. > > Thus, I exported the PG with ceph-objectstore-tool, an after successful > export I removed it. Unfortunately, the remove command was interrupted > midway. > This resulted in a PG that could not be remove with “ceph-objectstore-tool > —op remove ….”, since the header is gone. > Worse is that the OSD does not boot, due to it can see objects from the > removed PG, but cannot access them. > > I have tried to remove the individual objects in that PG (also with > objectstore-tool), but this process is extremely slow. > When looping over the >65,000 object, each remove takes ~10 sec and is very > compute intensive, which is approximately 7.5 days. > > Is the a faster way to get around this? > > Mvh. Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Universitetsbyen 81 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Remove corrupt PG
Hi all, I wanted to move a PG to an empty OSD, so I could do repairs on it without the whole OSD, which is full of other PG’s, would be effected with extensive downtime. Thus, I exported the PG with ceph-objectstore-tool, an after successful export I removed it. Unfortunately, the remove command was interrupted midway. This resulted in a PG that could not be remove with “ceph-objectstore-tool —op remove ….”, since the header is gone. Worse is that the OSD does not boot, due to it can see objects from the removed PG, but cannot access them. I have tried to remove the individual objects in that PG (also with objectstore-tool), but this process is extremely slow. When looping over the >65,000 object, each remove takes ~10 sec and is very compute intensive, which is approximately 7.5 days. Is the a faster way to get around this? Mvh. Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Potential bug in cephfs-data-scan?
Actually, it might have worked better if the PG had stayed down while running cephfs-data-scan, as it could only then get file structure from metadata pool and not touch each file/link in data pool? This would at least properly have given the list of files in (only) the affected PG? //Jesper Fra: Jesper Lykkegaard Karlsen Sendt: 19. august 2022 22:49 Til: Patrick Donnelly Cc: ceph-users@ceph.io Emne: [ceph-users] Re: Potential bug in cephfs-data-scan? Fra: Patrick Donnelly Sendt: 19. august 2022 16:16 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Potential bug in cephfs-data-scan? On Fri, Aug 19, 2022 at 5:02 AM Jesper Lykkegaard Karlsen wrote: >> > >Hi, >> >> I have recently been scanning the files in a PG with "cephfs-data-scan >> pg_files ...". >Why? I had an incident where a PG that went down+incomplete after some OSD crashed + heavy load + ongoing snap trimming. Got it back up again with object store tool by marking complete. Then I wanted to show possible affected files with cephfs-data-scan in the unfortunate PG, so I could recover potential loss from backup. >> Although, after a long time the scan was still running and the list of files >> consumed 44 GB, I stopped it, as something obviously was very wrong. >> >> It turns out some users had symlinks that looped and even a user had a >> symlink to "/". >Symlinks are not stored in the data pool. This should be irrelevant. Okay, it may be a case of me "holding it wrong", but I do see "cephfs-data-scan pg_files" trying to follow any global or local symlink in the file structure, which leads to many more files registrered than possibly could be in that PG and even endless loops in some cases. If the symlinks are not stored in data pool, how can cephfs-data-scan then follow the link? And how do I get "cephfs-data-scan" to just show the symlinks as links and not follow them up or down in directory structure? Best, Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Potential bug in cephfs-data-scan?
Fra: Patrick Donnelly Sendt: 19. august 2022 16:16 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Potential bug in cephfs-data-scan? On Fri, Aug 19, 2022 at 5:02 AM Jesper Lykkegaard Karlsen wrote: >> > >Hi, >> >> I have recently been scanning the files in a PG with "cephfs-data-scan >> pg_files ...". >Why? I had an incident where a PG that went down+incomplete after some OSD crashed + heavy load + ongoing snap trimming. Got it back up again with object store tool by marking complete. Then I wanted to show possible affected files with cephfs-data-scan in the unfortunate PG, so I could recover potential loss from backup. >> Although, after a long time the scan was still running and the list of files >> consumed 44 GB, I stopped it, as something obviously was very wrong. >> >> It turns out some users had symlinks that looped and even a user had a >> symlink to "/". >Symlinks are not stored in the data pool. This should be irrelevant. Okay, it may be a case of me "holding it wrong", but I do see "cephfs-data-scan pg_files" trying to follow any global or local symlink in the file structure, which leads to many more files registrered than possibly could be in that PG and even endless loops in some cases. If the symlinks are not stored in data pool, how can cephfs-data-scan then follow the link? And how do I get "cephfs-data-scan" to just show the symlinks as links and not follow them up or down in directory structure? Best, Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Potential bug in cephfs-data-scan?
Hi, I have recently been scanning the files in a PG with "cephfs-data-scan pg_files ...". Although, after a long time the scan was still running and the list of files consumed 44 GB, I stopped it, as something obviously was very wrong. It turns out some users had symlinks that looped and even a user had a symlink to "/". It does not make sense that cephfs-data-scan follows symlinks, as this will give a wrong picture of what files are in the target PG. I have looked though CEPHs bug reports, but I do not see anyone mentioning this. Although I am still on the recently deprecated Octopus, I suspect that this bug is also present in Pacific and Quincy? It might be related to this bug? https://tracker.ceph.com/issues/46166 But symptoms are different. Or, maybe there is a way to disable the following of symlinks in "cephfs-data-scan pg_files ..."? Best, Jesper ---------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
Cool thanks a lot! I will definitely put it in my toolbox. Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 29 Jul 2022, at 00.35, Josh Baergen wrote: > >> I know the balancer will reach a well balanced PG landscape eventually, but >> I am not sure that it will prioritise backfill after “most available >> location” first. > > Correct, I don't believe it prioritizes in this way. > >> Have you tried the pgremapper youself Josh? > > My team wrote and maintains pgremapper and we've used it extensively, > but I'd always recommend trying it in test environments first. Its > effect on the system isn't much different than what you're proposing > (it simply manipulates the upmap exception table). > > Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
Thanks you for your suggestions Josh, it is really appreciated. Pgremapper looks interesting and definitely something I will look into. I know the balancer will reach a well balanced PG landscape eventually, but I am not sure that it will prioritise backfill after “most available location” first. Then I might end up in the same situation, where some of the old (but not retired) OSD starts getting full. Then there is the “undo-upmaps” script left or maybe even the script that I propose in combination with “cancel-backfill”, as it just moves what Ceph was planing to move anyway, just in a prioritised manner. Have you tried the pgremapper youself Josh? Is it safe to use? And does the Ceph developers vouch for this methode? Status now is ~1,600,000,000 objects are now move, which is about half of all of the planned backfills. I have been reweighing OSD down, as they get to close to maximum usage, which works to some extend. Monitors on the other hand are now complaining about using a lot of disk space, due to the long time backfilling. There is still plenty of disk space on the mons, but I feel that the backfill is getting slower and slower, although still the same amount of PGs are backfilling. Can large disk usage on mons slow down backfill and other operations? Is it dangerous? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 28 Jul 2022, at 22.26, Josh Baergen wrote: > > I don't have many comments on your proposed approach, but just wanted > to note that how I would have approached this, assuming that you have > the same number of old hosts, would be to: > 1. Swap-bucket the hosts. > 2. Downweight the OSDs on the old hosts to 0.001. (Marking them out > (i.e. weight 0) prevents maps from being applied.) > 3. Add the old hosts back to the CRUSH map in their old racks or whatever. > 4. Use https://github.com/digitalocean/pgremapper#cancel-backfill. > 5. Then run https://github.com/digitalocean/pgremapper#undo-upmaps in > a loop to drain the old OSDs. > > This gives you the maximum concurrency and efficiency of movement, but > doesn't necessarily solve your balance issue if it's the new OSDs that > are getting full (that wasn't clear to me). It's still possible to > apply steps 2, 4, and 5 if the new hosts are in place. If you're not > in a rush could actually use the balancer instead of undo-upmaps in > step 5 to perform the rest of the data migration and then you wouldn't > have full OSDs. > > Josh > > On Fri, Jul 22, 2022 at 1:57 AM Jesper Lykkegaard Karlsen > wrote: >> >> It seems like a low hanging fruit to fix? >> There must be a reason why the developers have not made a prioritized order >> of backfilling PGs. >> Or maybe the prioritization is something else than available space? >> >> The answer remains unanswered, as well as if my suggested approach/script >> would work or not? >> >> Summer vacation? >> >> Best, >> Jesper >> >> -- >> Jesper Lykkegaard Karlsen >> Scientific Computing >> Centre for Structural Biology >> Department of Molecular Biology and Genetics >> Aarhus University >> Universitetsbyen 81 >> 8000 Aarhus C >> >> E-mail: je...@mbg.au.dk >> Tlf: +45 50906203 >> >> >> Fra: Janne Johansson >> Sendt: 20. juli 2022 19:39 >> Til: Jesper Lykkegaard Karlsen >> Cc: ceph-users@ceph.io >> Emne: Re: [ceph-users] replacing OSD nodes >> >> Den ons 20 juli 2022 kl 11:22 skrev Jesper Lykkegaard Karlsen >> : >>> Thanks for you answer Janne. >>> Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once >>> they get too close for comfort. >>> >>> But I just though a continuous prioritization of rebalancing PGs, could >>> make this process more smooth, with less/no need for handheld operations. >> >> You are absolutely right there, just wanted to chip in with my >> experiences of "it nags at me but it will still work out" so other >> people finding these mails later on can feel a bit relieved at knowing >> that a few toofull warnings aren't a major disaster and that it >> sometimes happens, because ceph looks for all possible moves, even >> those who will run late in the rebalancing. >> >> -- >> May the most significant bit of your life be positive. >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cannot set quota on ceph fs root
Hi Frank, I guess there is alway the possibility to set quota on pool level with "target_max_objects" and “target_max_bytes” The cephfs quotas through attributes are only for sub-directories as far as I recall. Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 28 Jul 2022, at 17.22, Frank Schilder wrote: > > Hi Gregory, > > thanks for your reply. It should be possible to set a quota on the root, > other vattribs can be set as well despite it being a mount point. There must > be something on the ceph side (or another bug in the kclient) preventing it. > > By the way, I can't seem to find cephfs-tools like cephfs-shell. I'm using > the image quay.io/ceph/ceph:v15.2.16 and its not installed in the image. A > "yum provides cephfs-shell" returns no candidate and I can't find > installation instructions. Could you help me out here? > > Thanks and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Gregory Farnum > Sent: 28 July 2022 16:59:50 > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] cannot set quota on ceph fs root > > On Thu, Jul 28, 2022 at 1:01 AM Frank Schilder wrote: >> >> Hi all, >> >> I'm trying to set a quota on the ceph fs file system root, but it fails with >> "setfattr: /mnt/adm/cephfs: Invalid argument". I can set quotas on any >> sub-directory. Is this intentional? The documentation >> (https://docs.ceph.com/en/octopus/cephfs/quota/#quotas) says >> >>> CephFS allows quotas to be set on any directory in the system. >> >> Any includes the fs root. Is the documentation incorrect or is this a bug? > > I'm not immediately seeing why we can't set quota on the root, but the > root inode is special in a lot of ways so this doesn't surprise me. > I'd probably regard it as a docs bug. > > That said, there's also a good chance that the setfattr is getting > intercepted before Ceph ever sees it, since by setting it on the root > you're necessarily interacting with a mount point in Linux and those > can also be finicky...You could see if it works by using cephfs-shell. > -Greg > > >> >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: PG does not become active
Ah I see, should have look at the “raw” data instead ;-) Then I agree this very weird? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 28 Jul 2022, at 12.45, Frank Schilder wrote: > > Hi Jesper, > > thanks for looking at this. The failure domain is OSD and not host. I typed > it wrong in the text, the copy of the crush rule shows it right: step choose > indep 0 type osd. > > I'm trying to reproduce the observation to file a tracker item, but it is > more difficult than expected. It might be a race condition, so far I didn't > see it again. I hope I can figure out when and why this is happening. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Jesper Lykkegaard Karlsen > Sent: 28 July 2022 12:02:51 > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] PG does not become active > > Hi Frank, > > I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain > host. > > I do not know how it was possible for you to create that configuration at > first? > Could it be that you have multiple name for the OSD hosts? > That would at least explain the one OSD down, being show as two OSDs down. > > Also, I believe that min_size should never be smaller than “coding” shards, > which is 4 in this case. > > You can either make a new test setup with your three test OSD hosts using EC > 2+1 or make e.g. 4+2, but with failure domain set to OSD. > > Best, > Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Universitetsbyen 81 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > >> On 27 Jul 2022, at 17.32, Frank Schilder wrote: >> >> Update: the inactive PG got recovered and active after a lnngg wait. The >> middle question is now answered. However, these two questions are still of >> great worry: >> >> - How can 2 OSDs be missing if only 1 OSD is down? >> - If the PG should recover, why is it not prioritised considering its severe >> degradation >> compared with all other PGs? >> >> I don't understand how a PG can loose 2 shards if 1 OSD goes down. That >> looks really really bad to me (did ceph loose track of data??). >> >> The second is of no less importance. The inactive PG was holding back client >> IO, leading to further warnings about slow OPS/requests/... Why are such >> critically degraded PGs not scheduled for recovery first? There is a service >> outage but only a health warning? >> >> Thanks and best regards. >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Frank Schilder >> Sent: 27 July 2022 17:19:05 >> To: ceph-users@ceph.io >> Subject: [ceph-users] PG does not become active >> >> I'm testing octopus 15.2.16 and run into a problem right away. I'm filling >> up a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how >> recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs >> of this pool 2 (!!!) shards are missing. This most degraded PG is not >> becoming active, its stuck inactive but peered. >> >> Questions: >> >> - How can 2 OSDs be missing if only 1 OSD is down? >> - Wasn't there an important code change to allow recovery for an EC PG with >> at >> least k shards present even if min_size>k? Do I have to set something? >> - If the PG should recover, why is it not prioritised considering its severe >> degradation >> compared with all other PGs? >> >> I have already increased these crush tunables and executed a pg repeer to no >> avail: >> >> tunable choose_total_tries 250 <-- default 100 >> rule fs-data { >> id 1 >> type erasure >> min_size 3 >> max_size 6 >> step set_chooseleaf_tries 50 <-- default 5 >> step set_choose_tries 200 <-- default 100 >> step take default >> step choose indep 0 type osd >> step emit >> } >> >> Ceph health detail says to that:
[ceph-users] Re: PG does not become active
Hi Frank, I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain host. I do not know how it was possible for you to create that configuration at first? Could it be that you have multiple name for the OSD hosts? That would at least explain the one OSD down, being show as two OSDs down. Also, I believe that min_size should never be smaller than “coding” shards, which is 4 in this case. You can either make a new test setup with your three test OSD hosts using EC 2+1 or make e.g. 4+2, but with failure domain set to OSD. Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 > On 27 Jul 2022, at 17.32, Frank Schilder wrote: > > Update: the inactive PG got recovered and active after a lnngg wait. The > middle question is now answered. However, these two questions are still of > great worry: > > - How can 2 OSDs be missing if only 1 OSD is down? > - If the PG should recover, why is it not prioritised considering its severe > degradation > compared with all other PGs? > > I don't understand how a PG can loose 2 shards if 1 OSD goes down. That looks > really really bad to me (did ceph loose track of data??). > > The second is of no less importance. The inactive PG was holding back client > IO, leading to further warnings about slow OPS/requests/... Why are such > critically degraded PGs not scheduled for recovery first? There is a service > outage but only a health warning? > > Thanks and best regards. > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: 27 July 2022 17:19:05 > To: ceph-users@ceph.io > Subject: [ceph-users] PG does not become active > > I'm testing octopus 15.2.16 and run into a problem right away. I'm filling up > a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how > recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs > of this pool 2 (!!!) shards are missing. This most degraded PG is not > becoming active, its stuck inactive but peered. > > Questions: > > - How can 2 OSDs be missing if only 1 OSD is down? > - Wasn't there an important code change to allow recovery for an EC PG with at > least k shards present even if min_size>k? Do I have to set something? > - If the PG should recover, why is it not prioritised considering its severe > degradation > compared with all other PGs? > > I have already increased these crush tunables and executed a pg repeer to no > avail: > > tunable choose_total_tries 250 <-- default 100 > rule fs-data { >id 1 >type erasure >min_size 3 >max_size 6 >step set_chooseleaf_tries 50 <-- default 5 >step set_choose_tries 200 <-- default 100 >step take default >step choose indep 0 type osd >step emit > } > > Ceph health detail says to that: > > [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive >pg 4.32 is stuck inactive for 37m, current state > recovery_wait+undersized+degraded+remapped+peered, last acting > [1,2147483647,2147483647,4,5,2] > > I don't want to cheat and set min_size=k on this pool. It should work by > itself. > > Thanks for any pointers! > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
It seems like a low hanging fruit to fix? There must be a reason why the developers have not made a prioritized order of backfilling PGs. Or maybe the prioritization is something else than available space? The answer remains unanswered, as well as if my suggested approach/script would work or not? Summer vacation? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Janne Johansson Sendt: 20. juli 2022 19:39 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] replacing OSD nodes Den ons 20 juli 2022 kl 11:22 skrev Jesper Lykkegaard Karlsen : > Thanks for you answer Janne. > Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once they > get too close for comfort. > > But I just though a continuous prioritization of rebalancing PGs, could make > this process more smooth, with less/no need for handheld operations. You are absolutely right there, just wanted to chip in with my experiences of "it nags at me but it will still work out" so other people finding these mails later on can feel a bit relieved at knowing that a few toofull warnings aren't a major disaster and that it sometimes happens, because ceph looks for all possible moves, even those who will run late in the rebalancing. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
Thanks for you answer Janne. Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once they get too close for comfort. But I just though a continuous prioritization of rebalancing PGs, could make this process more smooth, with less/no need for handheld operations. Best, Jesper ---------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Janne Johansson Sendt: 20. juli 2022 10:47 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] replacing OSD nodes Den tis 19 juli 2022 kl 13:09 skrev Jesper Lykkegaard Karlsen : > > Hi all, > Setup: Octopus - erasure 8-3 > I had gotten to the point where I had some rather old OSD nodes, that I > wanted to replace with new ones. > The procedure was planned like this: > > * add new replacement OSD nodes > * set all OSDs on the retiring nodes to out. > * wait for everything to rebalance > * remove retiring nodes > After around 50% misplaced objects remaining, the OSDs started to complain > about backfillfull OSDs and nearfull OSDs. > A bit of a surprise to me, as RAW size is only 47% used. > It seems that rebalancing does not happen in a prioritized manner, where > planed backfill starts with the OSD with most space available space, but > "alphabetically" according to pg-name. > Is this really true? I don't know if it does it in any particular order, just that it certainly doesn't fire off requests to the least filled OSD to receive data first, so when I have gotten into similar situations, it just tried to run as many moves as possible given max_backfill and all that, then some/most might get stuck in toofull, but as the rest of the slots progress, space gets available and at some point those toofull ones get handled. It delays the completion but hasn't caused me any other specific problems. Though I will admit I have used "ceph osd reweight osd.123 " at times to force emptying of some OSDs, but that was more my impatience than anything else. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] replacing OSD nodes
5 377 46 322 24 306 53 200 240 338 #1.9TiB bytes available on most full OSD (306) ceph osd pg-upmap-items 20.6c5 334 371 30 340 70 266 241 407 3 233 186 356 40 312 294 391 #1.9TiB bytes available on most full OSD (233) ceph osd pg-upmap-items 20.6b4 344 338 226 389 319 362 309 411 85 379 248 233 121 318 0 254 #1.9TiB bytes available on most full OSD (233) ceph osd pg-upmap-items 20.6b1 325 292 35 371 347 153 146 390 12 343 88 327 27 355 54 250 192 408 #1.9TiB bytes available on most full OSD (153) ceph osd pg-upmap-items 20.57 82 389 282 356 103 165 62 284 67 408 252 366 #1.9TiB bytes available on most full OSD (165) ceph osd pg-upmap-items 20.50 244 355 319 228 154 397 63 317 113 378 97 276 288 150 #1.9TiB bytes available on most full OSD (228) ceph osd pg-upmap-items 20.47 343 351 107 283 81 332 76 398 160 410 26 378 #1.9TiB bytes available on most full OSD (283) ceph osd pg-upmap-items 20.3e 56 322 31 283 330 377 107 360 199 309 190 385 78 406 #1.9TiB bytes available on most full OSD (283) ceph osd pg-upmap-items 20.3b 91 349 312 414 268 386 45 244 125 371 #1.9TiB bytes available on most full OSD (244) ceph osd pg-upmap-items 20.3a 277 371 290 359 91 415 165 392 107 167 #1.9TiB bytes available on most full OSD (167) ceph osd pg-upmap-items 20.39 74 175 18 302 240 393 3 269 224 374 194 408 173 364 #1.9TiB bytes available on most full OSD (302) ... ... If I were to set this into effect, I would first set norecover and nobackfill, then run the script and unset norecover and nobackfill again. But I am uncertain if it would work? Or even if this is a good idea? It would be nice if Ceph did something similar automatically 🙂 Or maybe Ceph already does something similar, and I have just not been able to find it? If Ceph were to do this, it could be nice if the priority of backfill_wait PGs was rerun, perharps every 24 hours, as OSD availability landscape of course changes during backfill. I imagine this, especially, could stabilize recovery/rebalance on systems where space is a little tight. Best regards, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
Thanks Konstantin, Actually, I went a bit further and made the script more universal in usage: ceph_du_dir: # usage: ceph_du_dir $DIR1 ($DIR2 .) for i in $@; do if [[ -d $i && ! -L $i ]]; then echo "$(numfmt --to=iec --suffix=B --padding=7 $(getfattr --only-values -n ceph.dir.rbytes $i 2>/dev/nul) | sed -r 's/([0-9])([a-zA-Z])/\1 \2/g; s/([a-zA-Z])([0-9])/\1 \2/g') $i" fi done The above can be run as: ceph_du_dir $DIR with multiple directories: ceph_du_dir $DIR1 $DIR2 $DIR3 .. Or even with wildcard: ceph_du_dir $DIR/* Best, Jesper ---------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Konstantin Shalygin Sendt: 17. december 2021 09:17 Til: Jesper Lykkegaard Karlsen Cc: Robert Gallop ; ceph-users@ceph.io Emne: Re: [ceph-users] cephfs quota used Or you can mount with 'dirstat' option and use 'cat .' for determine CephFS stats: alias fsdf="cat . | grep rbytes | awk '{print \$2}' | numfmt --to=iec --suffix=B" [root@host catalog]# fsdf 245GB [root@host catalog]# Cheers, k On 17 Dec 2021, at 00:25, Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>> wrote: Anyway, I just made my own ceph-fs version of "du". ceph_du_dir: #!/bin/bash # usage: ceph_du_dir $DIR SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g) numfmt --to=iec-i --suffix=B --padding=7 $SIZE Prints out ceph-fs dir size in "human-readble" It works like a charm and my god it is fast!. Tools like that could be very useful, if provided by the development team 🙂 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
Not to spam, but to make it output prettier, one can also separate the number from the byte-size prefix. numfmt --to=iec --suffix=B --padding=7 $(getfattr --only-values -n ceph.dir.rbytes $1 2>/dev/nul) | sed -r 's/([0-9])([a-zA-Z])/\1 \2/g; s/([a-zA-Z])([0-9])/\1 \2/g' //Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ____ Fra: Jesper Lykkegaard Karlsen Sendt: 16. december 2021 23:07 Til: Jean-Francois GUILLAUME Cc: Robert Gallop ; ceph-users@ceph.io Emne: [ceph-users] Re: cephfs quota used Brilliant, thanks Jean-François Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Jean-Francois GUILLAUME Sendt: 16. december 2021 23:03 Til: Jesper Lykkegaard Karlsen Cc: Robert Gallop ; ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used Hi, You can avoid using awk by passing --only-values to getfattr. This should look something like this : > #!/bin/bash > numfmt --to=iec-i --suffix=B --padding=7 $(getfattr --only-values -n > ceph.dir.rbytes $1 2>/dev/null) Best, --- Cordialement, Jean-François GUILLAUME Plateforme Bioinformatique BiRD Tél. : +33 (0)2 28 08 00 57 www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr><http://www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr>> Inserm UMR 1087/CNRS UMR 6291 IRS-UN - 8 quai Moncousu - BP 70721 44007 Nantes Cedex 1 Le 2021-12-16 22:25, Jesper Lykkegaard Karlsen a écrit : > To answer my own question. > It seems Frank Schilder asked a similar question two years ago: > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/ > > listxattr() was aparrently removed and not much have happen since then > it seems. > > Anyway, I just made my own ceph-fs version of "du". > > ceph_du_dir: > > #!/bin/bash > # usage: ceph_du_dir $DIR > SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep > "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g) > numfmt --to=iec-i --suffix=B --padding=7 $SIZE > > Prints out ceph-fs dir size in "human-readble" > It works like a charm and my god it is fast!. > > Tools like that could be very useful, if provided by the development > team 🙂 > > Best, > Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Jesper Lykkegaard Karlsen > Sendt: 16. december 2021 14:37 > Til: Robert Gallop > Cc: ceph-users@ceph.io > Emne: [ceph-users] Re: cephfs quota used > > Woops, wrong copy/pasta: > > getfattr -n ceph.dir.rbytes $DIR > > works on all distributions I have tested. > > It is: > > getfattr -d -m 'ceph.*' $DIR > > that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS > 7. > > Best, > Jesper > ------ > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Jesper Lykkegaard Karlsen > Sendt: 16. december 2021 13:57 > Til: Robert Gallop > Cc: ceph-users@ceph.io > Emne: [ceph-users] Re: cephfs quota used > > Just tested: > > getfattr -n ceph.dir.rbytes $DIR > > Works on CentOS 7, but not on Ubuntu 18.04 eighter. > Weird? > > Best, > Jesper > ---------- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Robert Gallop > Sendt: 16. december 2021 13:42 > Til: Jesper Lykkegaard Karlsen > Cc: ceph-users@ceph.io > Emne: Re: [ceph-users] Re: cephfs quota used > > From what I understand you used to be able to do that but cannot on > later kernels? > > Seems there would be a list somewhere,
[ceph-users] Re: cephfs quota used
Brilliant, thanks Jean-François Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Jean-Francois GUILLAUME Sendt: 16. december 2021 23:03 Til: Jesper Lykkegaard Karlsen Cc: Robert Gallop ; ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used Hi, You can avoid using awk by passing --only-values to getfattr. This should look something like this : > #!/bin/bash > numfmt --to=iec-i --suffix=B --padding=7 $(getfattr --only-values -n > ceph.dir.rbytes $1 2>/dev/null) Best, --- Cordialement, Jean-François GUILLAUME Plateforme Bioinformatique BiRD Tél. : +33 (0)2 28 08 00 57 www.pf-bird.univ-nantes.fr<http://www.pf-bird.univ-nantes.fr> Inserm UMR 1087/CNRS UMR 6291 IRS-UN - 8 quai Moncousu - BP 70721 44007 Nantes Cedex 1 Le 2021-12-16 22:25, Jesper Lykkegaard Karlsen a écrit : > To answer my own question. > It seems Frank Schilder asked a similar question two years ago: > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/ > > listxattr() was aparrently removed and not much have happen since then > it seems. > > Anyway, I just made my own ceph-fs version of "du". > > ceph_du_dir: > > #!/bin/bash > # usage: ceph_du_dir $DIR > SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep > "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g) > numfmt --to=iec-i --suffix=B --padding=7 $SIZE > > Prints out ceph-fs dir size in "human-readble" > It works like a charm and my god it is fast!. > > Tools like that could be very useful, if provided by the development > team 🙂 > > Best, > Jesper > > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Jesper Lykkegaard Karlsen > Sendt: 16. december 2021 14:37 > Til: Robert Gallop > Cc: ceph-users@ceph.io > Emne: [ceph-users] Re: cephfs quota used > > Woops, wrong copy/pasta: > > getfattr -n ceph.dir.rbytes $DIR > > works on all distributions I have tested. > > It is: > > getfattr -d -m 'ceph.*' $DIR > > that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS > 7. > > Best, > Jesper > -- > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Jesper Lykkegaard Karlsen > Sendt: 16. december 2021 13:57 > Til: Robert Gallop > Cc: ceph-users@ceph.io > Emne: [ceph-users] Re: cephfs quota used > > Just tested: > > getfattr -n ceph.dir.rbytes $DIR > > Works on CentOS 7, but not on Ubuntu 18.04 eighter. > Weird? > > Best, > Jesper > ------ > Jesper Lykkegaard Karlsen > Scientific Computing > Centre for Structural Biology > Department of Molecular Biology and Genetics > Aarhus University > Gustav Wieds Vej 10 > 8000 Aarhus C > > E-mail: je...@mbg.au.dk > Tlf:+45 50906203 > > > Fra: Robert Gallop > Sendt: 16. december 2021 13:42 > Til: Jesper Lykkegaard Karlsen > Cc: ceph-users@ceph.io > Emne: Re: [ceph-users] Re: cephfs quota used > > From what I understand you used to be able to do that but cannot on > later kernels? > > Seems there would be a list somewhere, but I can’t find it, maybe > it’s changing too often depending on the kernel your using or > something. > > But yeah, these attrs are one of the major reasons we are moving from > traditional appliance NAS to ceph, the many other benefits come with > it. > > On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen > mailto:je...@mbg.au.dk>> wrote: > Thanks everybody, > > That was a quick answer. > > getfattr -n ceph.dir.rbytes $DIR > > Was the answer that worked for me. So getfattr was the solution after > all. > > Is there some way I can display all attributes, without knowing them > in forehand? > > I have tried: > > getfattr -d -m 'ceph.*' $DIR > > which gives me no output. Should that not list all atributes? > > This
[ceph-users] Re: cephfs quota used
To answer my own question. It seems Frank Schilder asked a similar question two years ago: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/6ENI42ZMHTTP2OONBRD7FDP7LQBC4P2E/ listxattr() was aparrently removed and not much have happen since then it seems. Anyway, I just made my own ceph-fs version of "du". ceph_du_dir: #!/bin/bash # usage: ceph_du_dir $DIR SIZE=$(getfattr -n ceph.dir.rbytes $1 2>/dev/null| grep "ceph\.dir\.rbytes" | awk -F\= '{print $2}' | sed s/\"//g) numfmt --to=iec-i --suffix=B --padding=7 $SIZE Prints out ceph-fs dir size in "human-readble" It works like a charm and my god it is fast!. Tools like that could be very useful, if provided by the development team 🙂 Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ________ Fra: Jesper Lykkegaard Karlsen Sendt: 16. december 2021 14:37 Til: Robert Gallop Cc: ceph-users@ceph.io Emne: [ceph-users] Re: cephfs quota used Woops, wrong copy/pasta: getfattr -n ceph.dir.rbytes $DIR works on all distributions I have tested. It is: getfattr -d -m 'ceph.*' $DIR that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS 7. Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ____ Fra: Jesper Lykkegaard Karlsen Sendt: 16. december 2021 13:57 Til: Robert Gallop Cc: ceph-users@ceph.io Emne: [ceph-users] Re: cephfs quota used Just tested: getfattr -n ceph.dir.rbytes $DIR Works on CentOS 7, but not on Ubuntu 18.04 eighter. Weird? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ____ Fra: Robert Gallop Sendt: 16. december 2021 13:42 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used From what I understand you used to be able to do that but cannot on later kernels? Seems there would be a list somewhere, but I can’t find it, maybe it’s changing too often depending on the kernel your using or something. But yeah, these attrs are one of the major reasons we are moving from traditional appliance NAS to ceph, the many other benefits come with it. On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>> wrote: Thanks everybody, That was a quick answer. getfattr -n ceph.dir.rbytes $DIR Was the answer that worked for me. So getfattr was the solution after all. Is there some way I can display all attributes, without knowing them in forehand? I have tried: getfattr -d -m 'ceph.*' $DIR which gives me no output. Should that not list all atributes? This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64 Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk> Tlf:+45 50906203 Fra: Sebastian Knust mailto:skn...@physik.uni-bielefeld.de>> Sendt: 16. december 2021 13:01 Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; ceph-users@ceph.io<mailto:ceph-users@ceph.io> mailto:ceph-users@ceph.io>> Emne: Re: [ceph-users] cephfs quota used Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: > Now, I want to access the usage information of folders with quotas from root > level of the cephfs. > I have failed to find this information through getfattr commands, only quota > limits are shown here, and du-command on individual folders is a suboptimal > solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: > for i in /ceph-path-to-home-dirs/*; do > if [ -d "$i" ]; then > SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") > QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" > 2>/dev/null || echo 0) > PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) > if [ -z "$PERC" ]; then PERC="--"; fi >
[ceph-users] Re: cephfs quota used
Woops, wrong copy/pasta: getfattr -n ceph.dir.rbytes $DIR works on all distributions I have tested. It is: getfattr -d -m 'ceph.*' $DIR that does not work on Rocky Linux 8, Ubuntu 18.04, but works on CentOS 7. Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Jesper Lykkegaard Karlsen Sendt: 16. december 2021 13:57 Til: Robert Gallop Cc: ceph-users@ceph.io Emne: [ceph-users] Re: cephfs quota used Just tested: getfattr -n ceph.dir.rbytes $DIR Works on CentOS 7, but not on Ubuntu 18.04 eighter. Weird? Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Robert Gallop Sendt: 16. december 2021 13:42 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used >From what I understand you used to be able to do that but cannot on later >kernels? Seems there would be a list somewhere, but I can’t find it, maybe it’s changing too often depending on the kernel your using or something. But yeah, these attrs are one of the major reasons we are moving from traditional appliance NAS to ceph, the many other benefits come with it. On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>> wrote: Thanks everybody, That was a quick answer. getfattr -n ceph.dir.rbytes $DIR Was the answer that worked for me. So getfattr was the solution after all. Is there some way I can display all attributes, without knowing them in forehand? I have tried: getfattr -d -m 'ceph.*' $DIR which gives me no output. Should that not list all atributes? This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64 Best, Jesper ---------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk> Tlf:+45 50906203 Fra: Sebastian Knust mailto:skn...@physik.uni-bielefeld.de>> Sendt: 16. december 2021 13:01 Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; ceph-users@ceph.io<mailto:ceph-users@ceph.io> mailto:ceph-users@ceph.io>> Emne: Re: [ceph-users] cephfs quota used Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: > Now, I want to access the usage information of folders with quotas from root > level of the cephfs. > I have failed to find this information through getfattr commands, only quota > limits are shown here, and du-command on individual folders is a suboptimal > solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: > for i in /ceph-path-to-home-dirs/*; do > if [ -d "$i" ]; then > SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") > QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" > 2>/dev/null || echo 0) > PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) > if [ -z "$PERC" ]; then PERC="--"; fi > printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt > --to=iec $QUOTA` $PERC > fi > done Note that you can also mount CephFS with the "rbytes" mount option. IIRC the fuse clients defaults to it, for the kernel client you have to specify it in the mount command or fstab entry. The rbytes option returns the recursive path size (so the ceph.dir.rbytes fattr) in stat calls to directories, so you will see it with ls immediately. I really like it! Just beware that some software might have issues with this behaviour - alpine is the only example (bug report and patch proposal have been submitted) that I know of. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
Just tested: getfattr -n ceph.dir.rbytes $DIR Works on CentOS 7, but not on Ubuntu 18.04 eighter. Weird? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Robert Gallop Sendt: 16. december 2021 13:42 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] Re: cephfs quota used >From what I understand you used to be able to do that but cannot on later >kernels? Seems there would be a list somewhere, but I can’t find it, maybe it’s changing too often depending on the kernel your using or something. But yeah, these attrs are one of the major reasons we are moving from traditional appliance NAS to ceph, the many other benefits come with it. On Thu, Dec 16, 2021 at 5:38 AM Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>> wrote: Thanks everybody, That was a quick answer. getfattr -n ceph.dir.rbytes $DIR Was the answer that worked for me. So getfattr was the solution after all. Is there some way I can display all attributes, without knowing them in forehand? I have tried: getfattr -d -m 'ceph.*' $DIR which gives me no output. Should that not list all atributes? This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64 Best, Jesper ---------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk> Tlf:+45 50906203 Fra: Sebastian Knust mailto:skn...@physik.uni-bielefeld.de>> Sendt: 16. december 2021 13:01 Til: Jesper Lykkegaard Karlsen mailto:je...@mbg.au.dk>>; ceph-users@ceph.io<mailto:ceph-users@ceph.io> mailto:ceph-users@ceph.io>> Emne: Re: [ceph-users] cephfs quota used Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: > Now, I want to access the usage information of folders with quotas from root > level of the cephfs. > I have failed to find this information through getfattr commands, only quota > limits are shown here, and du-command on individual folders is a suboptimal > solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: > for i in /ceph-path-to-home-dirs/*; do > if [ -d "$i" ]; then > SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") > QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" > 2>/dev/null || echo 0) > PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) > if [ -z "$PERC" ]; then PERC="--"; fi > printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt > --to=iec $QUOTA` $PERC > fi > done Note that you can also mount CephFS with the "rbytes" mount option. IIRC the fuse clients defaults to it, for the kernel client you have to specify it in the mount command or fstab entry. The rbytes option returns the recursive path size (so the ceph.dir.rbytes fattr) in stat calls to directories, so you will see it with ls immediately. I really like it! Just beware that some software might have issues with this behaviour - alpine is the only example (bug report and patch proposal have been submitted) that I know of. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
Thanks everybody, That was a quick answer. getfattr -n ceph.dir.rbytes $DIR Was the answer that worked for me. So getfattr was the solution after all. Is there some way I can display all attributes, without knowing them in forehand? I have tried: getfattr -d -m 'ceph.*' $DIR which gives me no output. Should that not list all atributes? This is on Rocky Linux kernel 4.18.0-348.2.1.el8_5.x86_64 Best, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Sebastian Knust Sendt: 16. december 2021 13:01 Til: Jesper Lykkegaard Karlsen ; ceph-users@ceph.io Emne: Re: [ceph-users] cephfs quota used Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: > Now, I want to access the usage information of folders with quotas from root > level of the cephfs. > I have failed to find this information through getfattr commands, only quota > limits are shown here, and du-command on individual folders is a suboptimal > solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: > for i in /ceph-path-to-home-dirs/*; do > if [ -d "$i" ]; then > SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") > QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" > 2>/dev/null || echo 0) > PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) > if [ -z "$PERC" ]; then PERC="--"; fi > printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt > --to=iec $QUOTA` $PERC > fi > done Note that you can also mount CephFS with the "rbytes" mount option. IIRC the fuse clients defaults to it, for the kernel client you have to specify it in the mount command or fstab entry. The rbytes option returns the recursive path size (so the ceph.dir.rbytes fattr) in stat calls to directories, so you will see it with ls immediately. I really like it! Just beware that some software might have issues with this behaviour - alpine is the only example (bug report and patch proposal have been submitted) that I know of. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephfs quota used
Hi all, Cephfs quota work really well for me. A cool feature is that if one mounts a folder, which has quotas enabled, then the mountpoint will show as a partition of quota size and how much is used (e.g. with df command), nice! Now, I want to access the usage information of folders with quotas from root level of the cephfs. I have failed to find this information through getfattr commands, only quota limits are shown here, and du-command on individual folders is a suboptimal solution. The usage information must be somewhere in ceph metadata/mondb, but where and how do I read? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Recover data from Cephfs snapshot
Hi Ceph'ers, I love the possibility to make snapshots on Cephfs systems. Although there is one thing that puzzles me. Creating snapshot takes no time to do and deleting snapshots can bring PGs into snaptrim state for some hours. While recovering data from a snapshot will always invoke a full data transfer, where data are "physically" being copied back into place. This can make recovering from snapshots on Cephfs a rather heavy procedure. I have even tried "mv" command but that also starts transfer real data instead of just moving metadata pointers. Am I missing some "ceph snapshot recover" command, that can move metadata pointers and make recovery much lighter, or is this just that way it is? Best reagards, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cephfs metadata and MDS on same node
Dear Ceph’ers I am about to upgrade MDS nodes for Cephfs in the Ceph cluster (erasure code 8+3 ) I am administrating. Since they will get plenty of memory and CPU cores, I was wondering if it would be a good idea to move metadata OSDs (NVMe's currently on OSD nodes together with cephfs_data ODS (HDD)) to the MDS nodes? Configured as: 4 x MDS with each a metadata OSD and configured with 4 x replication so each metadata OSD would have a complete copy of metadata. I know MDS, stores al lot of metadata in RAM, but if metadata OSDs were on MDS nodes, would that not bring down latency? Anyway, I am just asking for your opinion on this? Pros and cons or even better somebody who actually have tried this? Best regards, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk> Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Healthy objects trapped in incomplete pgs
Dear Cephers, A few days ago disaster struck the Ceph cluster (erasure-coded) I am administrating, as the UPS power was pull from the cluster causing a power outage. After rebooting the system, 6 osds were lost (spread over 5 osd nodes) as they could not mount anymore, several others had damages. This was more than the host-faliure domain was setup to handle and auto-recovery failed and osds started downing in a cascading maner. When the dust settled, there were 8 pgs (of 2048) inactive and a bunch of osds down. I managed to recover 5 pgs, mainly by ceph-objectstore-tool export/import/repair commands, but now I am left with 3 pgs that are inactive and incomplete. One of the pgs seems un-salvageable, as I cannot get to become active at all (repair/import/export/lowering min_size), but the two others I can get active if I export/import one of the pg shards and restart osd. Rebuilding then starts but after a while one of the osds holding the pgs goes down, with a "FAILED ceph_assert(clone_size.count(clone))" message in the log. If I set osds to noout nodown, then I can that it is only rather few objects e.g. 161 of a pg of >10, that are failing to be remapped. Since most of the object in the two pgs seem intact, it would be sad to delete the whole pg (force-create-pg) and loose all that data. Is there a way to show and delete the failing objects? I have thought of a recovery plan and want to share that with you, so you can comment on this if it sounds doable or not? * Stop osds from recovering:ceph osd set norecover * bring back pgs active:ceph-objectstore-tool export/import and restart osd * find files in pgs: cephfs-data-scan pg_files * pull out as many as possible of those files to other location. * recreate pgs: ceph osd force-create-pg * restart recovery:ceph osd unset norecover * copy back in the recovered files Would that work or do you have a better suggestion? Cheers, Jesper ------ Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io