[ceph-users] Invalid crush class

2022-10-08 Thread Michael Thomas
In 15.2.7, how can I remove an invalid crush class?  I'm surprised that 
I was able to create it in the first place:


[root@ceph1 bin]# ceph osd crush class ls
[
"ssd",
"JBOD.hdd",
"nvme",
"hdd"
]


[root@ceph1 bin]# ceph osd crush class ls-osd JBOD.hdd
Invalid command: invalid chars . in JBOD.hdd
osd crush class ls-osd  :  list all osds belonging to the 
specific 
>
Error EINVAL: invalid command

There are no devices mapped to this class:

[root@ceph1 bin]# ceph osd crush tree | grep JBOD | wc -l
0

--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rebalance after draining - why?

2022-05-28 Thread Michael Thomas

Try this:

ceph osd crush reweight osd.XX 0

--Mike

On 5/28/22 15:02, Nico Schottelius wrote:


Good evening dear fellow Ceph'ers,

when removing OSDs from a cluster, we sometimes use

 ceph osd reweight osd.XX 0

and wait until the OSD's content has been redistributed. However, when
then finally stopping and removing it, Ceph is again rebalancing.

I assume this is due to a position that is removed in the CRUSH map and
thus the logical placement is "wrong". (Am I wrong about that?)

I wonder, is there a way to tell ceph properly that a particular OSD is
planned to leave the cluster and to remove the data to the "correct new
position" instead of doing the rebalance dance twice?

Best regards,

Nico

--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: managed block storage stopped working

2022-02-09 Thread Michael Thomas

On 1/7/22 16:49, Marc wrote:





Where else can I look to find out why the managed block storage isn't
accessible anymore?



ceph -s ? I guess it is not showing any errors, and there is probably nothing 
with ceph, you can do an rbdmap and see if you can just map an image.
Then try mapping an image with the user credentials ovirt is using, maybe some 
auth key has been deleted.


Finally figured out the problem.  A routing change on our core switch 
was preventing the OSDs from being able to talk to the ovirg engine. 
For some reason I thought the engine delegated all disk creation/attach 
operations to the ovirt hosts, but it seems that the engine still needs 
to be able to reach the OSDs.


Access to the rbd images from the hosts was working fine, but access to 
the rbd images (and OSDs) from the engine was failing.


After adding back the missing route, I'm able to create and attach new 
ceph rbd volumes.


Thanks for the nudge in the right direction,

--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] managed block storage stopped working

2022-01-07 Thread Michael Thomas
...sorta.  I have a ovirt-4.4.2 system installed a couple of years ago 
and set up managed block storage using ceph Octopus[1].  This has been 
working well since it was originally set up.


In late November we had some network issues on one of our ovirt hosts, 
as well a seperate network issue that took many ceph OSDs offline.  This 
was eventually recovered, and 2 of the 3 VMs that use managed block 
storage started working again.  The third did not.


We eventually discovered that ovirt was not able to access the ceph rbd 
images, which is odd because two VMs are actively reading and writing to 
ceph block devices.  We are also no longer able to create new ovirt 
disks using the managed block driver.


/var/log/cinderlib/cinderlib.log on the ovirt-engine is empty.

/var/log/ovirt-engine/engine.log shows the attempt to connect to the 
storage, which eventually errors out with no helpful message:


2022-01-07 11:36:47,398-06 INFO 
[org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] (default 
task-1) [6613fac6-dd2f-4d22-993b-d805b2b572cd] Running command: 
AttachDiskToVmCommand internal: false. Entities affected :  ID: 
804b259a-c580-436b-a5ba-decdd0a2ccbd Type: VMAction group 
CONFIGURE_VM_STORAGE with role type USER,  ID: 
32c537e9-42cf-4648-b33b-2723374416e1 Type: DiskAction group ATTACH_DISK 
with role type USER
2022-01-07 11:36:47,415-06 INFO 
[org.ovirt.engine.core.bll.storage.disk.managedblock.ConnectManagedBlockStorageDeviceCommand] 
(default task-1) [46265b18] Running command: 
ConnectManagedBlockStorageDeviceCommand internal: true.
2022-01-07 11:39:00,248-06 INFO 
[org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] 
(EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) 
[] Thread pool 'default' is using 0 threads out of 1, 5 threads waiting 
for tasks.
2022-01-07 11:39:00,248-06 INFO 
[org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] 
(EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) 
[] Thread pool 'engine' is using 0 threads out of 500, 32 threads 
waiting for tasks and 0 tasks in queue.
2022-01-07 11:39:00,248-06 INFO 
[org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] 
(EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) 
[] Thread pool 'engineScheduledThreadPool' is using 0 threads out of 1, 
100 threads waiting for tasks.
2022-01-07 11:39:00,248-06 INFO 
[org.ovirt.engine.core.bll.utils.ThreadPoolMonitoringService] 
(EE-ManagedScheduledExecutorService-engineThreadMonitoringThreadPool-Thread-1) 
[] Thread pool 'engineThreadMonitoringThreadPool' is using 1 threads out 
of 1, 0 threads waiting for tasks.
2022-01-07 11:41:19,774-06 INFO 
[org.ovirt.engine.core.bll.aaa.LoginOnBehalfCommand] (default task-6) 
[103222ef] Running command: LoginOnBehalfCommand internal: true.
2022-01-07 11:41:19,832-06 INFO 
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(default task-6) [103222ef] EVENT_ID: USER_LOGIN_ON_BEHALF(1,401), 
Executed login on behalf - for user admin.
2022-01-07 11:41:19,848-06 INFO 
[org.ovirt.engine.core.bll.aaa.LogoutSessionCommand] (default task-6) 
[32106489] Running command: LogoutSessionCommand internal: true.
2022-01-07 11:41:19,853-06 INFO 
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(default task-6) [32106489] EVENT_ID: USER_VDC_LOGOUT(31), User SYSTEM 
connected from 'UNKNOWN' using session 
'pSzmWpAZSakSozpj4HQF2bic6EKUClj5wni+i9GPIlmdLIqfnAG9LYqb2MbO34fOuskBvjmTPbe4WRGFWUfmbQ==' 
logged out.
2022-01-07 11:41:47,405-06 ERROR 
[org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand] 
(Transaction Reaper Worker 0) [] Transaction rolled-back for command 
'org.ovirt.engine.core.bll.storage.disk.AttachDiskToVmCommand'.


Where else can I look to find out why the managed block storage isn't 
accessible anymore?


--Mike

[1]https://lists.ovirt.org/archives/list/us...@ovirt.org/thread/KHCLXVOCELHOR3G7SH3GDPGRKITCW7UY/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [External Email] Re: ceph-objectstore-tool core dump

2021-10-04 Thread Michael Thomas
On 10/4/21 11:57 AM, Dave Hall wrote:
> I also had a delay on the start of the repair scrub when I was dealing with
> this issue.  I ultimately increased the number of simultaneous scrubs, but
> I think you could also temporarily disable scrubs and then re-issue the 'pg
> repair'.  (But I'm not one of the experts on this.)
> 
> My perception is that between EC pools, large HDDs, and the overall OSD
> count, there might need to be some tuning to assure that scrubs can get
> scheduled:  A large HDD contains pieces of more PGs.  Each PG in an EC pool
> is spread across more disks than a replication pool.  Thus, especially if
> the number of OSDs is not large, there is an increased chance that more
> than one scrub will want to read the same OSD.   Scheduling nightmare if
> the number of simultaneous scrubs is low and client traffic is given
> priority.
> 
> -Dave

That seemed to be the case.  After ~24 hours, 1 of the 8 repair tasks
had completed.  Unfortunately, it found another error that wasn't
present before.

After checking the SMART logs, it looks like this particular disk is
failing.  No sense in pursuing this any further; I'll be replacing it
with a spare instead.

I'll look into disabling scrubs the next time I need to schedule a
repair.  Hopefully it will run the repair jobs a bit sooner.

Regards,

--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-objectstore-tool core dump

2021-10-03 Thread Michael Thomas

On 10/3/21 12:08, 胡 玮文 wrote:



在 2021年10月4日,00:53,Michael Thomas  写道:

I recently started getting inconsistent PGs in my Octopus (15.2.14) ceph 
cluster.  I was able to determine that they are all coming from the same OSD: 
osd.143.  This host recently suffered from an unplanned power loss, so I'm not 
surprised that there may be some corruption.  This PG is part of a EC 8+2 pool.

The OSD logs from the PG's primary OSD show this and similar errors from the 
PG's most recent deep scrub:

2021-10-03T03:25:25.969-0500 7f6e6801f700 -1 log_channel(cluster) log [ERR] : 
23.1fa shard 143(1) soid 23:5f8c3d4e:::1179969.0168:head : candidate 
had a read error

In attempting to fix it, I first ran 'ceph pg repair 23.1fa' on the PG. This 
accomplished nothing.  Next I ran a shallow fsck on the OSD:


I expect this ‘ceph pg repair’ command could handle this kind of errors. After 
issuing this command, the pg should enter a state like 
“active+clean+scrubbing+deep+inconsistent+repair”, then you wait for the repair 
to finish (this can take hours), and you should be able to recover from the 
inconsistent state. What do you mean by “This accomplished nothing”?


The PG never entered the 'repair' state, nor did anything appear in the 
primary OSD logs about a request for repair.  After more than 24 hours, 
the PG remained listed as 'inconsistent'.


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-objectstore-tool core dump

2021-10-03 Thread Michael Thomas
I recently started getting inconsistent PGs in my Octopus (15.2.14) ceph 
cluster.  I was able to determine that they are all coming from the same 
OSD: osd.143.  This host recently suffered from an unplanned power loss, 
so I'm not surprised that there may be some corruption.  This PG is part 
of a EC 8+2 pool.


The OSD logs from the PG's primary OSD show this and similar errors from 
the PG's most recent deep scrub:


2021-10-03T03:25:25.969-0500 7f6e6801f700 -1 log_channel(cluster) log 
[ERR] : 23.1fa shard 143(1) soid 23:5f8c3d4e:::1179969.0168:head 
: candidate had a read error


In attempting to fix it, I first ran 'ceph pg repair 23.1fa' on the PG. 
This accomplished nothing.  Next I ran a shallow fsck on the OSD:


# ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-143
fsck success

I estimated that a deep fsck will take ~24 hours to run on this mostly 
full 16TB HDD.  Before doing that, I wanted to see if I could simply 
remove the offending object and let ceph recover itself.  Unfortunately, 
ceph-objectstore-tool core dumps when I try to remove this object:


# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-143 --pgid 
23.1fa 
'{"oid":"1179969.0168","key":"","snapid":-2,"hash":1924936186,"max":0,"pool":23,"namespace":"","shard_id":1,"max":0}' 
remove

*** Caught signal (Segmentation fault) **
 in thread 7fdc491a88c0 thread_name:ceph-objectstor
 ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) 
octopus (stable)

 1: (()+0xf630) [0x7fdc3e62a630]
 2: (__pthread_rwlock_rdlock()+0xb) [0x7fdc3e62614b]
 3: 
(BlueStore::collection_bits(boost::intrusive_ptr&)+0x148) 
[0x5583c8fa7878]

 4: (main()+0x4b50) [0x5583c8a85270]
 5: (__libc_start_main()+0xf5) [0x7fdc3cfe7555]
 6: (()+0x39d3a0) [0x5583c8ab03a0]
Segmentation fault (core dumped)

As a last resort, I know that I can map this OID back to the cephfs file 
and simply remove/restore the offending file to fix the object.  But 
before I do that, I'm running a deep fsck to see if that can fix this 
and the other inconsistent objects.  In the meantime, I wondered if 
there was anything else I could do to clean up this inconsistent PG?


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs auditing

2021-05-27 Thread Michael Thomas
Is there a way to log or track which cephfs files are being accessed? 
This would help us in planning where to place certain datasets based on 
popularity, eg on a EC HDD pool or a replicated SSD pool.


I know I can run inotify on the ceph clients, but I was hoping that the 
MDS would have a way to log this information centrally.


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HEALTH_WARN - Recovery Stuck?

2021-04-12 Thread Michael Thomas
I recently had a similar issue when reducing the number of PGs on a 
pool.  A few OSDs became backfillful even though there was enough space; 
the OSDs were just not balanced well.


To fix, I reweighted the most-full OSDs:

ceph osd reweight-by-utilization 120

After it finished (~1 hour), I had fewer backfillful OSDs.  I repeated 
this 2 more times, after which the OSDs were no longer backfillful and 
recovery data movement resumed.


Once the recovery was complete, I reweighted all OSDs back to 1.0, and 
all was fine.


--Mike

On 4/12/21 12:30 PM, Ml Ml wrote:

Hello,

i kind of ran out of disk space, so i added another host with osd.37.
But it does not seem to move much data on it. (85MB in 2h)

Any idea why the recovery process seems to be stuck? Should i fix the
4 backfillfull osds first? (by changing the weight)?

root@ceph01:~# ceph -s
   cluster:
 id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df
 health: HEALTH_WARN
 4 backfillfull osd(s)
 9 nearfull osd(s)
 Low space hindering backfill (add storage if this doesn't
resolve itself): 1 pg backfill_toofull
 4 pool(s) backfillfull

   services:
 mon: 3 daemons, quorum ceph03,ceph01,ceph02 (age 12d)
 mgr: ceph03(active, since 4M), standbys: ceph02.jwvivm
 mds: backup:1 {0=backup.ceph06.hdjehi=up:active} 3 up:standby
 osd: 53 osds: 53 up (since 2h), 53 in (since 2h); 235 remapped pgs

   task status:
 scrub status:
 mds.backup.ceph06.hdjehi: idle

   data:
 pools:   4 pools, 1185 pgs
 objects: 24.69M objects, 45 TiB
 usage:   149 TiB used, 42 TiB / 191 TiB avail
 pgs: 5388809/74059569 objects misplaced (7.276%)
  950 active+clean
  232 active+remapped+backfill_wait
  2   active+remapped+backfilling
  1   active+remapped+backfill_wait+backfill_toofull

   io:
 recovery: 0 B/s, 171 keys/s, 16 objects/s

   progress:
 Rebalancing after osd.37 marked in (2h)
   [] (remaining: 6d)



root@ceph01:~# ceph health detail
HEALTH_WARN 4 backfillfull osd(s); 9 nearfull osd(s); Low space
hindering backfill (add storage if this doesn't resolve itself): 1 pg
backfill_toofull; 4 pool(s) backfillfull
[WRN] OSD_BACKFILLFULL: 4 backfillfull osd(s)
 osd.28 is backfill full
 osd.32 is backfill full
 osd.66 is backfill full
 osd.68 is backfill full
[WRN] OSD_NEARFULL: 9 nearfull osd(s)
 osd.11 is near full
 osd.24 is near full
 osd.27 is near full
 osd.39 is near full
 osd.40 is near full
 osd.42 is near full
 osd.43 is near full
 osd.45 is near full
 osd.69 is near full
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if
this doesn't resolve itself): 1 pg backfill_toofull
 pg 23.295 is active+remapped+backfill_wait+backfill_toofull,
acting [8,67,32]
[WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull
 pool 'backurne-rbd' is backfillfull
 pool 'device_health_metrics' is backfillfull
 pool 'cephfs.backup.meta' is backfillfull
 pool 'cephfs.backup.data' is backfillfull


root@ceph01:~# ceph osd df tree
ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP
META AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
  -1 182.59897 -  191 TiB  149 TiB  149 TiB35 GiB
503 GiB   42 TiB  77.96  1.00-  root default
  -2  24.62473 -   29 TiB   22 TiB   22 TiB   5.0 GiB
80 GiB  7.1 TiB  75.23  0.96-  host ceph01
   0hdd2.3   1.0  2.7 TiB  2.2 TiB  2.2 TiB   665 MiB
8.0 GiB  480 GiB  82.43  1.06   53  up  osd.0
   1hdd2.2   1.0  2.7 TiB  2.1 TiB  2.1 TiB   446 MiB
7.5 GiB  590 GiB  78.44  1.01   49  up  osd.1
   4hdd2.67029   0.91066  2.7 TiB  2.2 TiB  2.2 TiB   484 MiB
7.9 GiB  440 GiB  83.90  1.08   53  up  osd.4
   8hdd2.3   1.0  2.7 TiB  2.1 TiB  2.1 TiB   490 MiB
7.9 GiB  533 GiB  80.49  1.03   51  up  osd.8
  11hdd1.71660   1.0  1.7 TiB  1.5 TiB  1.5 TiB   406 MiB
5.5 GiB  200 GiB  88.60  1.14   36  up  osd.11
  12hdd1.2   1.0  2.7 TiB  1.2 TiB  1.2 TiB   366 MiB
4.9 GiB  1.5 TiB  43.89  0.56   28  up  osd.12
  14hdd2.2   1.0  2.7 TiB  2.0 TiB  2.0 TiB   418 MiB
7.1 GiB  693 GiB  74.66  0.96   47  up  osd.14
  18hdd2.2   1.0  2.7 TiB  2.0 TiB  1.9 TiB   434 MiB
7.3 GiB  737 GiB  73.05  0.94   47  up  osd.18
  22hdd1.0   1.0  1.7 TiB  890 GiB  886 GiB   110 MiB
3.6 GiB  868 GiB  50.62  0.65   20  up  osd.22
  30hdd1.5   1.0  1.7 TiB  1.4 TiB  1.3 TiB   361 MiB
4.9 GiB  370 GiB  78.93  1.01   32  up  osd.30
  33hdd1.5   0.97437  1.6 TiB  1.4 TiB  1.4 TiB   397 MiB
5.4 GiB  213 GiB  87.20  1.12   34  up  osd.33
  64hdd3.33789   0.89752  3.3 TiB  2.7 TiB 

[ceph-users] Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

2021-04-09 Thread Michael Thomas

Hi Joshua,

I'll dig into this output a bit more later, but here are my thoughts 
right now.  I'll preface this by saying that I've never had to clean up 
from unrecoverable incomplete PGs, so some of what I suggest may not 
work/apply or be the ideal fix in your case.


Correct me if I'm wrong, but you are willing to throw away all of the 
data on this pool?  This should make it easier because we don't have to 
worry about recovering any lost data.


If this is the case, then I think the general strategy would be:

1) Identify and remove any files/directories in cephfs that are located 
on this pool (based on ceph.file.layout.pool=claypool and 
ceph.dir.layout.pool=claypool).  Use 'unlink' instead of 'rm' to remove 
the files; it should be less prone to hanging.


2) Wait a bit for ceph to clean up any unreferenced objects.  Watch the 
output of 'ceph df' to see how many objects are listed for the pool.


3) Use 'rados -p claypool ls' to identify the remaining objects.  Use 
the OID identifier to calculate the inode number of each file, then 
search cephfs to identify which files these belong to.  I would expect 
it would be none, as you already deleted the files in step 1.


4) With nothing in the cephfs metadata referring to the objects anymore, 
it should be safe to remove them with 'rados -p rm'.


5) Remove the now-empty pool from cephfs

6) Remove the now-empty pool from ceph

Can you also include the output of 'ceph df'?

--Mike

On 4/9/21 7:31 AM, Joshua West wrote:

Thank you Mike!

This is honestly a way more detailed reply than I was expecting.
You've equipped me with new tools to work with.  Thank you!

I don't actually have any unfound pgs... only "incomplete" ones, which
limits the usefulness of:
`grep recovery_unfound`
`ceph pg $pg list_unfound`
`ceph pg $pg mark_unfound_lost delete`

I don't seem to see equivalent commands for incomplete pgs, save for
grep of course.

This does make me slightly more hopeful that recovery might be
possible if the pgs are incomplete and stuck, but not unfound..? Not
going to get my hopes too high.

Going to attach a few items just to keep from bugging me, if anyone
can take a glance, it would be appreciated.

In the meantime, in the absence of the above commands, what's the best
way to clean this up under the assumption that the data is lost?

~Joshua


Joshua West
President
403-456-0072
CAYK.ca


On Thu, Apr 8, 2021 at 6:15 PM Michael Thomas  wrote:


Hi Joshua,

I have had a similar issue three different times on one of my cephfs
pools (15.2.10). The first time this happened I had lost some OSDs.  In
all cases I ended up with degraded PGs with unfound objects that could
not be recovered.

Here's how I recovered from the situation.  Note that this will
permanently remove the affected files from ceph.  Restoring them from
backup is an excercise left to the reader.

* Make a list of the affected PGs:
ceph pg dump_stuck  | grep recovery_unfound > pg.txt

* Make a list of the affected objects (OIDs):
cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg
$pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' >
oid.txt

* Convert the OID numbers to inodes using 'printf "%d\n" 0x${oid}' and
put the results in a file called 'inum.txt'

* On a ceph client, find the files that correspond to the affected inodes:
cat inum.txt | while read inum ; do echo -n "${inum} " ; find
/ceph/frames/O3/raw -inum ${inum} ; done > files.txt

* It may be helpful to put this table of PG, OID, inum, and files into a
spreadsheet to keep track of what's been done.

* On the ceph client, use 'unlink' to remove the files from the
filesystem.  Do not use 'rm', as it will hang while calling 'stat()' on
each file.  Even unlink may hang when you first try it.  If it does
hang, do the following to get it unstuck:
- Reboot the client
- Restart each mon and the mgr.  I rebooted each mon/mgr, but it may
be sufficient to restart the services without a reboot.
- Try using 'unlink' again

* After all of the affected files have been removed, go through the list
of PGs and remove the unfound OIDs:
ceph pg $pgid mark_unfound_lost delete

...or if you're feeling brave, delete them all at once:
cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg
$pg mark_unfound_lost delete ; done

* Watch the output of 'ceph -s' to see the health of the pools/pgs recover.

* Restore the deleted files from backup, or decide that you don't care
about them and don't do anything
This procedure lets you fix the problem without deleting the affected
pool.  To be honest, the first time it happened, my solution was to
first copy all of the data off of the affected pool and onto a new pool.
   I later found this to be unnecessary.  But if you want to pursue this,
here's what I suggest:

* Follow the steps above to get rid of the affected files.  I feel this
should still be done even though you don't care 

[ceph-users] Re: Abandon incomplete (damaged EC) pgs - How to manage the impact on cephfs?

2021-04-08 Thread Michael Thomas

Hi Joshua,

I have had a similar issue three different times on one of my cephfs 
pools (15.2.10). The first time this happened I had lost some OSDs.  In 
all cases I ended up with degraded PGs with unfound objects that could 
not be recovered.


Here's how I recovered from the situation.  Note that this will 
permanently remove the affected files from ceph.  Restoring them from 
backup is an excercise left to the reader.


* Make a list of the affected PGs:
  ceph pg dump_stuck  | grep recovery_unfound > pg.txt

* Make a list of the affected objects (OIDs):
  cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg 
$pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' > 
oid.txt


* Convert the OID numbers to inodes using 'printf "%d\n" 0x${oid}' and 
put the results in a file called 'inum.txt'


* On a ceph client, find the files that correspond to the affected inodes:
  cat inum.txt | while read inum ; do echo -n "${inum} " ; find 
/ceph/frames/O3/raw -inum ${inum} ; done > files.txt


* It may be helpful to put this table of PG, OID, inum, and files into a 
spreadsheet to keep track of what's been done.


* On the ceph client, use 'unlink' to remove the files from the 
filesystem.  Do not use 'rm', as it will hang while calling 'stat()' on 
each file.  Even unlink may hang when you first try it.  If it does 
hang, do the following to get it unstuck:

  - Reboot the client
  - Restart each mon and the mgr.  I rebooted each mon/mgr, but it may 
be sufficient to restart the services without a reboot.

  - Try using 'unlink' again

* After all of the affected files have been removed, go through the list 
of PGs and remove the unfound OIDs:

  ceph pg $pgid mark_unfound_lost delete

...or if you're feeling brave, delete them all at once:
  cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg 
$pg mark_unfound_lost delete ; done


* Watch the output of 'ceph -s' to see the health of the pools/pgs recover.

* Restore the deleted files from backup, or decide that you don't care 
about them and don't do anything
This procedure lets you fix the problem without deleting the affected 
pool.  To be honest, the first time it happened, my solution was to 
first copy all of the data off of the affected pool and onto a new pool. 
 I later found this to be unnecessary.  But if you want to pursue this, 
here's what I suggest:


* Follow the steps above to get rid of the affected files.  I feel this 
should still be done even though you don't care about saving the data, 
to prevent corruption in the cephfs metadata.


* Go through the entire filesystem and look for:
  - files that are located on the pool (ceph.file.layout.pool = $pool_name)
  - directories that are set to write files to the pool 
(ceph.dir.layout.pool = $pool_name)


* After you confirm that no files or directories are pointing at the 
pool anymore, run 'ceph df' and look at the number of objects in the 
pool.  Ideally, it would be zero.  But more than likely it isn't.  This 
could be a simple mismatch in the object count in cephfs (harmless), or 
there could be clients with open filehandles on files that have been 
removed.  such objects will still appear in the rados listing of the 
pool[1]:

  rados -p $pool_name ls
  for obj in $(rados -p $pool_name ls); do echo $obj; rados -p 
$pool_name getxattr parent | strings; done


* To check for clients with access to these stray objects, dump the mds 
cache:

  ceph daemon mds.ceph1 dump cache /tmp/cache.txt

* Look for lines that refer to the stray objects, like this:
  [inode 0x1020fbc [2,head] ~mds0/stray6/1020fbc auth v7440537 
s=252778863 nl=0 n(v0 rc2020-12-11T21:17:59.454863-0600 b252778863 
1=1+0) (iversion lock) caps={9541437=pAsLsXsFscr/pFscr@2},l=9541437 | 
caps=1 authpin=0 0x563a7e52a000]


* The 'caps' field in the output above contains the client session id 
(eg 9541437).  Search the MDS for sessions that match to identify the 
client:

  ceph daemon mds.ceph1 session ls > session.txt
  Search through 'session.txt' for matching entries.  This will give 
you the IP address of the client:

"id": 9541437,
"entity": {
"name": {
"type": "client",
"num": 9541437
},
"addr": {
"type": "v1",
"addr": "10.13.5.48:0",
"nonce": 2011077845
}
},

* Restart the client's connection to ceph to get it to drop the cap.  I 
did this by rebooting the client, but there may be gentler ways to do it.


* Once you've done this clean up, it should be safe to remove the pool 
from cephfs:

  ceph fs rm_data_pool $fs_name $pool_name

* Once the pool has been detached from cephfs, you can remove it from 
ceph altogether:

  ceph osd pool rm $pool_name $pool_name --yes-i-really-really-mean-it

Hope this helps,

--Mike
[1]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005234.html



On 4/8/21 5:41 PM, 

[ceph-users] Re: Removing secondary data pool from mds

2021-03-12 Thread Michael Thomas

Hi Frank,

I finally got around to removing the data pool.  It went without a hitch.

Ironically, about a week before I got around to removing the pool, I 
suffered the same problem as before, except this time it wasn't a power 
glitch that took out the OSDs, it was my own careless self who decided 
to reboot too many OSD hosts at the same time.  The multiple OSDs went 
down while I was copying a lot of data into ceph.  And as before, this 
left a bunch of corrupted files that caused stat() and unlink() to hang.


I recovered it the same as before, by removing the files from the 
filesystem, then removing the lost objects from the PGs.  Unlike last 
time, I did not try to copy the good files into a new pool. 
Fortunately, this cleanup process worked fine.


For those watching from home, here are the steps I took to clean up:

* Restart all mons (I rebooted all of them, but it may have been enough 
to simply restart the mds).  Reboot the client that is experiencing the 
hang.  This didn't fix the problem with stat() hanging, but did allow 
unlink() (and /usr/bin/unlink) to remove the files without hanging.  I'm 
not sure which of these steps is the necessary one, as I did all of them 
before I was able to proceed.


* Make a list of the affected PGs:
  ceph pg dump_stuck  | grep recovery_unfound > pg.txt

* Make a list of the affected OIDs:
  cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg 
$pg list_unfound | jq '.objects[].oid.oid' ; done | sed -e 's/"//g' > 
oid.txt


* Convert the OID numbers to inodes:
  cat oid.txt | awk '{print $2}' | sed -e 's/\..*//' | while read oid ; 
do  printf "%d\n" 0x${oid} ; done > inum.txt


* Find the filenames corresponding to the affected inodes (requires the 
/ceph filesystem to be mounted):
  cat inum.txt | while read inum ; do echo -n "${inum} " ; find 
/ceph/frames/O3/raw -inum ${inum} ; done > files.txt


* Call /usr/bin/unlink on each of the files in files.txt.  Don't use 
/usr/bin/rm, as it will hang when calling stat() before unlink().


* Remove the unfound objects:
  cat pg.txt | awk '{print $1}' | while read pg ; do echo $pg ; ceph pg 
$pg mark_unfound_lost delete ; done


* Watch the output of 'ceph -s' to see the cluster become healthy again

--Mike

On 2/12/21 4:55 PM, Frank Schilder wrote:

Hi Michael,

I also think it would be safe to delete. The object count might be an incorrect 
reference count of lost objects that didn't get decremented. This might be 
fixed by running a deep scrub over all PGs in that pool.

I don't know rados well enough to find out where such an object count comes 
from. However, ceph df is known to be imperfect. Maybe its just an accounting 
bug there. I think there were a couple of cases where people deleted all 
objects in a pool and ceph df would still report non-zero usage.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Michael Thomas 
Sent: 12 February 2021 22:35:25
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Removing secondary data pool from mds

Hi Frank,

We're not using snapshots.

I was able to run:
  ceph daemon mds.ceph1 dump cache /tmp/cache.txt

...and scan for the stray object to find the cap id that was accessing
the object.  I matched this with the entity name in:
  ceph daemon mds.ceph1 session ls

...to determine the client host.  The strays went away after I rebooted
the offending client.

With all access to the objects now cleared, I ran:

  ceph pg X.Y mark_unfound_lost delete

...on any remaining rados objects.

At this point (at long last) the pool was able to return to the
'HEALTHY' status.  However, there is one remaining bit that I don't
understand.  'ceph df' returns 355 objects for the pool
(fs.data.archive.frames):

https://pastebin.com/vbZLhQmC

...but 'rados -p fs.data.archive.frames ls --all' returns no objects.
So I'm not sure what these 355 objects were.  Because of that, I haven't
removed the pool from cephfs quite yet, even though I think it would be
safe to do so.

--Mike


On 2/10/21 4:20 PM, Frank Schilder wrote:

Hi Michael,

out of curiosity, did the pool go away or did it put up a fight?

I don't remember exactly, its a long time ago, but I believe stray objects on 
fs pools come from files still in snapshots but were deleted on the fs level. 
Such files are moved to special stray pools until the snapshot containing them 
is deleted as well. Not sure if this applies here though, there might be other 
occasions when objects go to stray.

I updated the case concerning the underlying problem, but not too much progress 
either: https://tracker.ceph.com/issues/46847#change-184710 . I had PG 
degradation even using the recovery technique with before- and after crush 
maps. I was just lucky that I lost only 1 shard per object and ordinary 
recovery could fix it.

Best regards,
=
Frank Schilder

[ceph-users] Re: Removing secondary data pool from mds

2021-02-12 Thread Michael Thomas

Hi Frank,

We're not using snapshots.

I was able to run:
ceph daemon mds.ceph1 dump cache /tmp/cache.txt

...and scan for the stray object to find the cap id that was accessing 
the object.  I matched this with the entity name in:

ceph daemon mds.ceph1 session ls

...to determine the client host.  The strays went away after I rebooted 
the offending client.


With all access to the objects now cleared, I ran:

ceph pg X.Y mark_unfound_lost delete

...on any remaining rados objects.

At this point (at long last) the pool was able to return to the 
'HEALTHY' status.  However, there is one remaining bit that I don't 
understand.  'ceph df' returns 355 objects for the pool 
(fs.data.archive.frames):


https://pastebin.com/vbZLhQmC

...but 'rados -p fs.data.archive.frames ls --all' returns no objects. 
So I'm not sure what these 355 objects were.  Because of that, I haven't 
removed the pool from cephfs quite yet, even though I think it would be 
safe to do so.


--Mike


On 2/10/21 4:20 PM, Frank Schilder wrote:

Hi Michael,

out of curiosity, did the pool go away or did it put up a fight?

I don't remember exactly, its a long time ago, but I believe stray objects on 
fs pools come from files still in snapshots but were deleted on the fs level. 
Such files are moved to special stray pools until the snapshot containing them 
is deleted as well. Not sure if this applies here though, there might be other 
occasions when objects go to stray.

I updated the case concerning the underlying problem, but not too much progress 
either: https://tracker.ceph.com/issues/46847#change-184710 . I had PG 
degradation even using the recovery technique with before- and after crush 
maps. I was just lucky that I lost only 1 shard per object and ordinary 
recovery could fix it.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 December 2020 23:12:09
To: ceph-users@ceph.io
Subject: [ceph-users] Removing secondary data pool from mds

I have a cephfs secondary (non-root) data pool with unfound and degraded
objects that I have not been able to recover[1].  I created an
additional data pool and used "setfattr -n ceph.dir.layout.pool' and a
very long rsync to move the files off of the degraded pool and onto the
new pool.  This has completed, and using find + 'getfattr -n
ceph.file.layout.pool', I verified that no files are using the old pool
anymore.  No ceph.dir.layout.pool attributes point to the old pool either.

However, the old pool still reports that there are objects in the old
pool, likely the same ones that were unfound/degraded from before:
https://pastebin.com/qzVA7eZr

Based on a old message from the mailing list[2], I checked the MDS for
stray objects (ceph daemon mds.ceph4 dump cache file.txt ; grep -i stray
file.txt) and found 36 stray entries in the cache:
https://pastebin.com/MHkpw3DV.  However, I'm not certain how to map
these stray cache objects to clients that may be accessing them.

'rados -p fs.data.archive.frames ls' shows 145 objects.  Looking at the
parent of each object shows 2 strays:

for obj in $(cat rados.ls.txt) ; do echo $obj ; rados -p
fs.data.archive.frames getxattr $obj parent | strings ; done


[...]
1020fa1.
1020fa1
stray6
1020fbc.
1020fbc
stray6
[...]

...before getting stuck on one object for over 5 minutes (then I gave up):

105b1af.0083

What can I do to make sure this pool is ready to be safely deleted from
cephfs (ceph fs rm_data_pool archive fs.data.archive.frames)?

--Mike

[1]https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#7YQ6SSTESM5LTFVLQK3FSYFW5FDXJ5CF

[2]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005233.html
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Removing secondary data pool from mds

2020-12-21 Thread Michael Thomas
I have a cephfs secondary (non-root) data pool with unfound and degraded 
objects that I have not been able to recover[1].  I created an 
additional data pool and used "setfattr -n ceph.dir.layout.pool' and a 
very long rsync to move the files off of the degraded pool and onto the 
new pool.  This has completed, and using find + 'getfattr -n 
ceph.file.layout.pool', I verified that no files are using the old pool 
anymore.  No ceph.dir.layout.pool attributes point to the old pool either.


However, the old pool still reports that there are objects in the old 
pool, likely the same ones that were unfound/degraded from before: 
https://pastebin.com/qzVA7eZr


Based on a old message from the mailing list[2], I checked the MDS for 
stray objects (ceph daemon mds.ceph4 dump cache file.txt ; grep -i stray 
file.txt) and found 36 stray entries in the cache: 
https://pastebin.com/MHkpw3DV.  However, I'm not certain how to map 
these stray cache objects to clients that may be accessing them.


'rados -p fs.data.archive.frames ls' shows 145 objects.  Looking at the 
parent of each object shows 2 strays:


for obj in $(cat rados.ls.txt) ; do echo $obj ; rados -p 
fs.data.archive.frames getxattr $obj parent | strings ; done



[...]
1020fa1.
1020fa1
stray6
1020fbc.
1020fbc
stray6
[...]

...before getting stuck on one object for over 5 minutes (then I gave up):

105b1af.0083

What can I do to make sure this pool is ready to be safely deleted from 
cephfs (ceph fs rm_data_pool archive fs.data.archive.frames)?


--Mike

[1]https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#7YQ6SSTESM5LTFVLQK3FSYFW5FDXJ5CF

[2]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005233.html
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-12-15 Thread Michael Thomas

Hi Frank,

I was able to migrate the data off of the "broken" pool 
(fs.data.archive.frames) and onto the new one 
(fs.data.archive.newframes).  I verified that no useful data is left on 
the "broken" pool:


* 'find + getfattr -n ceph.file.layout.pool' shows no files on the bad pool

* 'find + getfattr -n ceph.dir.layout.pool' shows no future files will 
land on the bad pool


* 'ceph -s' shows some misplaced/degraded/unfound objects on the bad pool:
  data:
pools:   14 pools, 3492 pgs
objects: 111.94M objects, 425 TiB
usage:   587 TiB used, 525 TiB / 1.1 PiB avail
pgs: 68/893408279 objects degraded (0.000%)
 35/893408279 objects misplaced (0.000%)
 24/111943463 objects unfound (0.000%)
 3480 active+clean
 5active+recovery_unfound+degraded+remapped
 4active+clean+scrubbing+deep
 2active+recovery_unfound+undersized+degraded+remapped
 1active+recovery_unfound+degraded

* 'rados ls --pool fs.data.archive.frames' shows these orphaned objects. 
 I extracted the first component of the rados object names (eg 
1020fa1.0030) and ran 'find /ceph -inum XXX' to verify that none 
of these objects maps back to a known file in the cephfs filesystem.


Here are the next steps that I plan to perform:

* 'rados rm --pool fs.data.archive.frames ' on a couple of 
objects to see how ceph handles it.


* 'rados purge fs.data.archive.frames' to purge all objects in the 
"broken" pool


* ceph fs rm_data_pool fs.data.archive.frames

Is there anything else you think I ought to check before finalizing the 
removal of this broken pool?


--Mike

On 11/22/20 1:59 PM, Frank Schilder wrote:

Dear Michael,

yes, your plan will work if the temporary space requirement can be addressed. 
Good luck!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Michael Thomas 
Sent: 22 November 2020 20:14:09
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

Hi Frank,

  From my understanding, with my current filesystem layout, I should be
able to remove the "broken" pool once the data has been moved off of it.
   This is because the "broken" pool is not the default data pool.
According to the documentation[1]:

fs rm_data_pool  

"This command removes the specified pool from the list of data pools for
the file system. If any files have layouts for the removed data pool,
the file data will become unavailable. The default data pool (when
creating the file system) cannot be removed."

My default data pool (triply replicated on SSD) is still healthy.  The
"broken" pool is EC on HDD, and while it holds a majority of the
filesystem data (~400TB), it is not the root of the filesystem.

My plan would be:

* Create a new data pool matching the "broken" pool
* Create a parallel directory tree matching the directories that are
mapped to the "broken" pool.  eg Broken: /ceph/frames/..., New:
/ceph/frames.new/...
* Use 'setfattr -n ceph.dir.layout.pool' on this parallel directory tree
to map the content to the new data pool
* Use parallel+rsync to copy data from the broken pool to the new pool.
* After each directory gets filled in the new pool, mv/rename the old
and new directories so that users start accessing the data from the new
pool.
* Delete data from the renamed old pool directories as they are
replaced, to keep the OSDs from filling up
* After all data is moved off of the old pool (verified by checking
ceph.dir.layout.pool and ceph.file.layout.pool on all files in the fs,
as well as rados ls, ceph df), remove the pool from the fs.

This is effectively the same strategy I did when moving frequently
accessed directories from the EC pool to a replicated SSD pool, except
that in the previous situation I didn't need to remove any pools at the
end.  It's time consuming, because every file on the "broken" pool needs
to be copied, but it minimizes downtime.  Being able to add some
temporary new OSDs to the new pool (but not the "broken" pool) would
reduce some pressure of filling up the OSDs.  If the old and new pools
use the same crush rule, would disabling backfilling+rebalancing keep
the OSDs from being used in the old pool until the old pool is deleted
(with the exception of the occasional new file)?

--Mike
[1]https://docs.ceph.com/en/latest/cephfs/administration/#file-systems



On 11/22/20 12:19 PM, Frank Schilder wrote:

Dear Michael,

I was also wondering whether deleting the broken pool could clean up 
everything. The difficulty is, that while migrating a pool to new devices is 
easy via a crush rule change, migrating data between pools is not so easy. In 
particular, if you can't afford downtime.

In case you can afford some downtime, it might be possible to migrate fas

[ceph-users] Re: Whether removing device_health_metrics pool is ok or not

2020-12-03 Thread Michael Thomas

On 12/3/20 6:47 PM, Satoru Takeuchi wrote:

Hi,

Could you tell me whether it's ok to remove device_health_metrics pool
after disabling device monitoring feature?

I don't use device monitoring feature because I capture hardware
information from other way.
However, after disabling this feature, device_health_metrics pool stll
exists.
I don't want to concern HEALTH_WARN caused by problems in PGs of this pool.

As a result of reading the source code of device monitoring module,
it seems to be safe to remove this pool. Is my understanding correct?


On my Octopus cluster I was running with a broken device_health_metrics 
pool (due to PG issues) for over a month with no other obvious ill 
effects.  I finally removed and recreated it to make ceph stop 
complaining about it.


Since then I have not seen any data written to the pool.

I know this doesn't answer your question directly, but in my case 
removing the pool temporarily did not cause any harm.


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Prometheus monitoring

2020-11-24 Thread Michael Thomas
I am gathering prometheus metrics from my (unhealthy) Octopus (15.2.4) 
cluster and notice a discrepency (or misunderstanding) with the ceph 
dashboard.


In the dashboard, and with ceph -s, it reports 807 million objects objects:

pgs: 169747/807333195 objects degraded (0.021%)
 78570293/807333195 objects misplaced (9.732%)
 24/101158245 objects unfound (0.000%)

But in the prometheus metrics (and in ceph df), it reports almost a 
factor of 10 fewer objects (dominated by pool 7):


# HELP ceph_pool_objects DF pool objects
# TYPE ceph_pool_objects gauge
ceph_pool_objects{pool_id="4"} 3920.0
ceph_pool_objects{pool_id="5"} 372743.0
ceph_pool_objects{pool_id="7"} 86972464.0
ceph_pool_objects{pool_id="8"} 9287431.0
ceph_pool_objects{pool_id="13"} 8961.0
ceph_pool_objects{pool_id="15"} 0.0
ceph_pool_objects{pool_id="17"} 4.0
ceph_pool_objects{pool_id="18"} 206.0
ceph_pool_objects{pool_id="19"} 8.0
ceph_pool_objects{pool_id="20"} 7.0
ceph_pool_objects{pool_id="21"} 22.0
ceph_pool_objects{pool_id="22"} 203.0
ceph_pool_objects{pool_id="23"} 4415522.0

Why are these two values different?  How can I get the total number of 
objects (807 million) from the prometheus metrics?


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-11-22 Thread Michael Thomas

Hi Frank,

From my understanding, with my current filesystem layout, I should be 
able to remove the "broken" pool once the data has been moved off of it. 
 This is because the "broken" pool is not the default data pool. 
According to the documentation[1]:


  fs rm_data_pool  

"This command removes the specified pool from the list of data pools for 
the file system. If any files have layouts for the removed data pool, 
the file data will become unavailable. The default data pool (when 
creating the file system) cannot be removed."


My default data pool (triply replicated on SSD) is still healthy.  The 
"broken" pool is EC on HDD, and while it holds a majority of the 
filesystem data (~400TB), it is not the root of the filesystem.


My plan would be:

* Create a new data pool matching the "broken" pool
* Create a parallel directory tree matching the directories that are 
mapped to the "broken" pool.  eg Broken: /ceph/frames/..., New: 
/ceph/frames.new/...
* Use 'setfattr -n ceph.dir.layout.pool' on this parallel directory tree 
to map the content to the new data pool

* Use parallel+rsync to copy data from the broken pool to the new pool.
* After each directory gets filled in the new pool, mv/rename the old 
and new directories so that users start accessing the data from the new 
pool.
* Delete data from the renamed old pool directories as they are 
replaced, to keep the OSDs from filling up
* After all data is moved off of the old pool (verified by checking 
ceph.dir.layout.pool and ceph.file.layout.pool on all files in the fs, 
as well as rados ls, ceph df), remove the pool from the fs.


This is effectively the same strategy I did when moving frequently 
accessed directories from the EC pool to a replicated SSD pool, except 
that in the previous situation I didn't need to remove any pools at the 
end.  It's time consuming, because every file on the "broken" pool needs 
to be copied, but it minimizes downtime.  Being able to add some 
temporary new OSDs to the new pool (but not the "broken" pool) would 
reduce some pressure of filling up the OSDs.  If the old and new pools 
use the same crush rule, would disabling backfilling+rebalancing keep 
the OSDs from being used in the old pool until the old pool is deleted 
(with the exception of the occasional new file)?


--Mike
[1]https://docs.ceph.com/en/latest/cephfs/administration/#file-systems



On 11/22/20 12:19 PM, Frank Schilder wrote:

Dear Michael,

I was also wondering whether deleting the broken pool could clean up 
everything. The difficulty is, that while migrating a pool to new devices is 
easy via a crush rule change, migrating data between pools is not so easy. In 
particular, if you can't afford downtime.

In case you can afford some downtime, it might be possible to migrate fast by 
creating a new pool and use the pool copy command to migrate the data (rados 
cppool ...). Its important that the FS is shutdown (no MDS active) during this 
copy process. After copy, one could either rename the pools to have the copy 
match the fs data pool name, or change the data pool at the top level 
directory. You might need to set some pool meta data by hand, notably, the fs 
tag.

Having said that, I have no idea how a ceph fs reacts if presented with a 
replacement data pool. Although I don't believe that meta data contains the 
pool IDs, I cannot exclude that complication. The copy pool variant should be 
tested with an isolated FS first.

The other option is what you describe, create a new data pool, make the fs root 
placed on this pool and copy every file onto itself. This should also do the 
trick. However, with this method you will not be able to get rid of the broken 
pool. After the copy, you could, however, reduce the number of PGs to below the 
unhealthy one and the broken PG(s) might get deleted cleanly. Then you still 
have a surplus pool, but at least all PGs are clean.

I hope one of these will work. Please post your experience here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 22 November 2020 18:29:16
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/23/20 3:07 AM, Frank Schilder wrote:

Hi Michael.


I still don't see any traffic to the pool, though I'm also unsure how much 
traffic is to be expected.


Probably not much. If ceph df shows that the pool contains some objects, I 
guess that's sorted.

That osdmaptool crashes indicates that your cluster runs with corrupted 
internal data. I tested your crush map and you should get complete PGs for the 
fs data pool. That you don't and that osdmaptool crashes points at a corruption 
of internal data. I'm afraid this is the point where you need support from ceph 
developers and should file a tracker report 
(https://tracker.cep

[ceph-users] Re: multiple OSD crash, unfound objects

2020-11-22 Thread Michael Thomas

On 10/23/20 3:07 AM, Frank Schilder wrote:

Hi Michael.


I still don't see any traffic to the pool, though I'm also unsure how much 
traffic is to be expected.


Probably not much. If ceph df shows that the pool contains some objects, I 
guess that's sorted.

That osdmaptool crashes indicates that your cluster runs with corrupted 
internal data. I tested your crush map and you should get complete PGs for the 
fs data pool. That you don't and that osdmaptool crashes points at a corruption 
of internal data. I'm afraid this is the point where you need support from ceph 
developers and should file a tracker report 
(https://tracker.ceph.com/projects/ceph/issues). A short description of the 
origin of the situation with the osdmaptool output and a reference to this 
thread linked in should be sufficient. Please post a link to the ticket here.


https://tracker.ceph.com/issues/48059


In parallel, you should probably open a new thread focussed on the osd map 
corruption. Maybe there are low-level commands to repair it.


Will do.


You should wait with trying to clean up the unfound objects until this is 
resolved. Not sure about adding further storage either. To me, this sounds 
quite serious.


Another approach that I'm considering is to create a new pool using the 
same set of OSDs, adding it to the set of cephfs data pools, and 
migrating the data from the "broken" pool to the new pool.


I have some additional unused storage that I could add to this new pool, 
if I can figure out the right crush rules to make sure they don't get 
used for the "broken" pool too.


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: safest way to re-crush a pool

2020-11-10 Thread Michael Thomas
Yes, of course this works.  For some reason I recall having trouble when 
I tried this on my first ceph install.  But I think in that case I 
didn't change the crush tree, but instead I had changed the device 
classes without changing the crush tree.


In any case, the re-crush worked fine.

--Mike

On 11/10/20 4:20 PM, dhils...@performair.com wrote:

Michael;

I run a Nautilus cluster, but all I had to do was change the rule associated 
with the pool, and ceph moved the data.

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com



-Original Message-
From: Michael Thomas [mailto:w...@caltech.edu]
Sent: Tuesday, November 10, 2020 1:32 PM
To: ceph-users@ceph.io
Subject: [ceph-users] safest way to re-crush a pool

I'm setting up a radosgw for my ceph Octopus cluster.  As soon as I
started the radosgw service, I notice that it created a handful of new
pools.  These pools were assigned the 'replicated_data' crush rule
automatically.

I have a mixed hdd/ssd/nvme cluster, and this 'replicated_data' crush
rule spans all device types.  I would like radosgw to use a replicated
SSD pool and avoid the HDDs.  What is the recommended way to change the
crush device class for these pools without risking the loss of any data
in the pools?  I will note that I have not yet written any user data to
the pools.  Everything in them was added by the radosgw process
automatically.

--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] safest way to re-crush a pool

2020-11-10 Thread Michael Thomas
I'm setting up a radosgw for my ceph Octopus cluster.  As soon as I 
started the radosgw service, I notice that it created a handful of new 
pools.  These pools were assigned the 'replicated_data' crush rule 
automatically.


I have a mixed hdd/ssd/nvme cluster, and this 'replicated_data' crush 
rule spans all device types.  I would like radosgw to use a replicated 
SSD pool and avoid the HDDs.  What is the recommended way to change the 
crush device class for these pools without risking the loss of any data 
in the pools?  I will note that I have not yet written any user data to 
the pools.  Everything in them was added by the radosgw process 
automatically.


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Michael Thomas

On 10/22/20 3:22 AM, Frank Schilder wrote:

Could you also execute (and post the output of)

   # osdmaptool osd.map --test-map-pgs-dump --pool 7


osdmaptool dumped core.  Here is stdout:

https://pastebin.com/HPtSqcS1

The PG map for 7.39d matches the pg dump, with the expected difference 
of 2147483647 -> NONE.


...and here is stderr:

https://pastebin.com/CrtwE54r

Regards,

--Mike


with the osd map you pulled out (pool 7 should be the fs data pool)? Please 
check what mapping is reported for PG 7.39d? Just checking if osd map and pg 
dump agree here.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 22 October 2020 09:32:07
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Sounds good. Did you re-create the pool again? If not, please do to give the 
devicehealth manager module its storage. In case you can't see any IO, it might 
be necessary to restart the MGR to flush out a stale rados connection. I would 
probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:

Hi Michael,

some quick thoughts.


That you can create a pool with 1 PG is a good sign, the crush rule is OK. That 
pg query says it doesn't have PG 1.0 points in the right direction. There is an 
inconsistency in the cluster. This is also indicated by the fact that no upmaps 
seem to exist (the clean-up script was empty). With the osd map you extracted, 
you could check what the osd map believes the mapping of the PGs of pool 1 are:

# osdmaptool osd.map --test-map-pgs-dump --pool 1


https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.


or if it also claims the PG does not exist. It looks like something went wrong 
during pool creation and you are not the only one having problems with this 
particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . 
Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics 
pool is a way forward. Please look at the procedure mentioned in the thread 
quoted above. Deletion of the pool there lead to some crashes and some surgery 
on some OSDs was necessary. However, in your case it might just work, because 
you redeployed the OSDs in question already - if I remember correctly.


That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.


In order to do so cleanly, however, you will probably want to shut down all 
clients accessing this pool. Note that clients accessing the health metrics 
pool are not FS clients, so the mds cannot tell you anything about them. The 
only command that seems to list all clients is

# ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also 
just go ahead and see if something crashes (an MGR module probably) or disable 
all MGR modules during this recovery attempt. I found some info that cephadm 
creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of 
similar cases. It looks solvable.


Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.


I will look at the incomplete PG issue. I hope this is just some PG tuning. At 
least pg query didn't complain :)


I have OSDs ready to add to the pool, in case you think we should try.


The stuck MDS request could be an attempt to access an unfound object. It 
should be possible to locate the fs client and find out what it was trying to 
do. I see this sometimes when people are too impatient. They manage to trigger 
a race condition and an MDS operation gets stuck (there are MDS bugs and in my 
case it was an ls command that got stuck). Usually, evicting the client 
temporarily solves the issue (but tell the user :).


I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike
____

From: Michael Thomas 
Sent: 20 October 2020 23:48:36
To: F

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-22 Thread Michael Thomas
Done.  I gave it 4 PGs (I read somewhere that PG counts should be 
multiples of 2), and restarted the mgr.  I still don't see any traffic 
to the pool, though I'm also unsure how much traffic is to be expected.


--Mike

On 10/22/20 2:32 AM, Frank Schilder wrote:

Sounds good. Did you re-create the pool again? If not, please do to give the 
devicehealth manager module its storage. In case you can't see any IO, it might 
be necessary to restart the MGR to flush out a stale rados connection. I would 
probably give the pool 10 PGs instead of 1, but that's up to you.

I hope I find time today to look at the incomplete PG.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 21 October 2020 22:58:47
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/21/20 6:47 AM, Frank Schilder wrote:

Hi Michael,

some quick thoughts.


That you can create a pool with 1 PG is a good sign, the crush rule is OK. That 
pg query says it doesn't have PG 1.0 points in the right direction. There is an 
inconsistency in the cluster. This is also indicated by the fact that no upmaps 
seem to exist (the clean-up script was empty). With the osd map you extracted, 
you could check what the osd map believes the mapping of the PGs of pool 1 are:

# osdmaptool osd.map --test-map-pgs-dump --pool 1


https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.


or if it also claims the PG does not exist. It looks like something went wrong 
during pool creation and you are not the only one having problems with this 
particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . 
Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics 
pool is a way forward. Please look at the procedure mentioned in the thread 
quoted above. Deletion of the pool there lead to some crashes and some surgery 
on some OSDs was necessary. However, in your case it might just work, because 
you redeployed the OSDs in question already - if I remember correctly.


That is correct.  The original OSDs 0 and 41 were removed and redeployed
on new disks.


In order to do so cleanly, however, you will probably want to shut down all 
clients accessing this pool. Note that clients accessing the health metrics 
pool are not FS clients, so the mds cannot tell you anything about them. The 
only command that seems to list all clients is

# ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also 
just go ahead and see if something crashes (an MGR module probably) or disable 
all MGR modules during this recovery attempt. I found some info that cephadm 
creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of 
similar cases. It looks solvable.


Unfortunately, in Octopus you can not disable the devicehealth manager
module, and the manager is required for operation.  So I just went ahead
and removed the pool with everything still running.  Fortunately, this
did not appear to cause any problems, and the single unknown PG has
disappeared from the ceph health output.


I will look at the incomplete PG issue. I hope this is just some PG tuning. At 
least pg query didn't complain :)


I have OSDs ready to add to the pool, in case you think we should try.


The stuck MDS request could be an attempt to access an unfound object. It 
should be possible to locate the fs client and find out what it was trying to 
do. I see this sometimes when people are too impatient. They manage to trigger 
a race condition and an MDS operation gets stuck (there are MDS bugs and in my 
case it was an ls command that got stuck). Usually, evicting the client 
temporarily solves the issue (but tell the user :).


I found the fs client and rebooted it.  The MDS still reports the slow
OPs, but according to the mds logs the offending ops were established
before the client was rebooted, and the offending client session (now
defunct) has been blacklisted.  I'll check back later to see if the slow
OPS get cleared from 'ceph status'.

Regards,

--Mike
____

From: Michael Thomas 
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/20/20 1:18 PM, Frank Schilder wrote:

Dear Michael,


Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD 
mapping?


I meant here with crush rule replicated_host_nvme. Sorry, forgot.


Seems to have worked fine:

https://pastebin.com/PFgDE4J1


Yes, the OSD was still out when the previous health report was created.


Hmm, this is odd. If this is correct, then it did report a slow op even though 
it was out of the cluster:


from https://pastebin.co

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-21 Thread Michael Thomas

On 10/21/20 6:47 AM, Frank Schilder wrote:

Hi Michael,

some quick thoughts.


That you can create a pool with 1 PG is a good sign, the crush rule is OK. That 
pg query says it doesn't have PG 1.0 points in the right direction. There is an 
inconsistency in the cluster. This is also indicated by the fact that no upmaps 
seem to exist (the clean-up script was empty). With the osd map you extracted, 
you could check what the osd map believes the mapping of the PGs of pool 1 are:

   # osdmaptool osd.map --test-map-pgs-dump --pool 1


https://pastebin.com/seh6gb7R

As I suspected, it thinks that OSDs 0, 41 are the acting set.


or if it also claims the PG does not exist. It looks like something went wrong 
during pool creation and you are not the only one having problems with this 
particular pool: https://www.spinics.net/lists/ceph-users/msg52665.html . 
Sounds a lot like a bug in cephadm.

In principle, it looks like the idea to delete and recreate the health metrics 
pool is a way forward. Please look at the procedure mentioned in the thread 
quoted above. Deletion of the pool there lead to some crashes and some surgery 
on some OSDs was necessary. However, in your case it might just work, because 
you redeployed the OSDs in question already - if I remember correctly.


That is correct.  The original OSDs 0 and 41 were removed and redeployed 
on new disks.



In order to do so cleanly, however, you will probably want to shut down all 
clients accessing this pool. Note that clients accessing the health metrics 
pool are not FS clients, so the mds cannot tell you anything about them. The 
only command that seems to list all clients is

   # ceph daemon mon.MON-ID sessions

that needs to be executed on all mon hosts. On the other hand, you could also 
just go ahead and see if something crashes (an MGR module probably) or disable 
all MGR modules during this recovery attempt. I found some info that cephadm 
creates this pool and starts an MGR module.

If you google "device_health_metric pool" you should find descriptions of 
similar cases. It looks solvable.


Unfortunately, in Octopus you can not disable the devicehealth manager 
module, and the manager is required for operation.  So I just went ahead 
and removed the pool with everything still running.  Fortunately, this 
did not appear to cause any problems, and the single unknown PG has 
disappeared from the ceph health output.



I will look at the incomplete PG issue. I hope this is just some PG tuning. At 
least pg query didn't complain :)


I have OSDs ready to add to the pool, in case you think we should try.


The stuck MDS request could be an attempt to access an unfound object. It 
should be possible to locate the fs client and find out what it was trying to 
do. I see this sometimes when people are too impatient. They manage to trigger 
a race condition and an MDS operation gets stuck (there are MDS bugs and in my 
case it was an ls command that got stuck). Usually, evicting the client 
temporarily solves the issue (but tell the user :).


I found the fs client and rebooted it.  The MDS still reports the slow 
OPs, but according to the mds logs the offending ops were established 
before the client was rebooted, and the offending client session (now 
defunct) has been blacklisted.  I'll check back later to see if the slow 
OPS get cleared from 'ceph status'.


Regards,

--Mike
____

From: Michael Thomas 
Sent: 20 October 2020 23:48:36
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

On 10/20/20 1:18 PM, Frank Schilder wrote:

Dear Michael,


Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD 
mapping?


I meant here with crush rule replicated_host_nvme. Sorry, forgot.


Seems to have worked fine:

https://pastebin.com/PFgDE4J1


Yes, the OSD was still out when the previous health report was created.


Hmm, this is odd. If this is correct, then it did report a slow op even though 
it was out of the cluster:


from https://pastebin.com/3G3ij9ui:
[WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons 
[osd.0,osd.41] have slow ops.


Not sure what to make of that. It looks almost like you have a ghost osd.41.


I think (some of) the slow ops you are seeing are directed to the 
health_metrics pool and can be ignored. If it is too annoying, you could try to 
find out who runs the client with IDs client.7524484 and disable it. Might be 
an MGR module.


I'm also pretty certain that the slow ops are related to the health
metrics pool, which is why I've been ignoring them.

What I'm not sure about is whether re-creating the device_health_metrics
pool will cause any problems in the ceph cluster.


Looking at the data you provided and also some older threads of yours 
(https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start 
considering that we are looking at the fall-out of a past admin

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-20 Thread Michael Thomas

On 10/20/20 1:18 PM, Frank Schilder wrote:

Dear Michael,


Can you create a test pool with pg_num=pgp_num=1 and see if the PG gets an OSD 
mapping?


I meant here with crush rule replicated_host_nvme. Sorry, forgot.


Seems to have worked fine:

https://pastebin.com/PFgDE4J1


Yes, the OSD was still out when the previous health report was created.


Hmm, this is odd. If this is correct, then it did report a slow op even though 
it was out of the cluster:


from https://pastebin.com/3G3ij9ui:
[WRN] SLOW_OPS: 2 slow ops, oldest one blocked for 8133 sec, daemons 
[osd.0,osd.41] have slow ops.


Not sure what to make of that. It looks almost like you have a ghost osd.41.


I think (some of) the slow ops you are seeing are directed to the 
health_metrics pool and can be ignored. If it is too annoying, you could try to 
find out who runs the client with IDs client.7524484 and disable it. Might be 
an MGR module.


I'm also pretty certain that the slow ops are related to the health 
metrics pool, which is why I've been ignoring them.


What I'm not sure about is whether re-creating the device_health_metrics 
pool will cause any problems in the ceph cluster.



Looking at the data you provided and also some older threads of yours 
(https://www.mail-archive.com/ceph-users@ceph.io/msg05842.html), I start 
considering that we are looking at the fall-out of a past admin operation. A 
possibility is, that an upmap for PG 1.0 exists that conflicts with the crush 
rule replicated_host_nvme and, hence, prevents the assignment of OSDs to PG 
1.0. For example, the upmap specifies HDDs, but the crush rule required NVMEs. 
This result is an empty set.


So var I've been unable to locate the client with the ID 7524484.  It's 
not showing up in the manager dashboard -> Filesystems page, nor in the 
output of 'ceph tell mds.ceph1 client ls'.


I'm digging through the compress logs for the past week to see if I can 
find the culprit.



I couldn't really find a simple command to list up-maps. The only 
non-destructive way seems to be to extract the osdmap and create a clean-up 
command file. The cleanup file should contain a command for every PG with an 
upmap. To check this, you can execute (see also 
https://docs.ceph.com/en/latest/man/8/osdmaptool/)

   # ceph osd getmap > osd.map
   # osdmaptool osd.map --upmap-cleanup cleanup.cmd

If you do this, could you please post as usual the contents of cleanup.cmd?


It was empty:

[root@ceph1 ~]# ceph osd getmap > osd.map
got osdmap epoch 52833

[root@ceph1 ~]# osdmaptool osd.map --upmap-cleanup cleanup.cmd
osdmaptool: osdmap file 'osd.map'
writing upmap command output to: cleanup.cmd
checking for upmap cleanups

[root@ceph1 ~]# wc cleanup.cmd
0 0 0 cleanup.cmd


Also, with the OSD map of your cluster, you can simulate certain admin 
operations and check resulting PG mappings for pools and other things without 
having to touch the cluster; see 
https://docs.ceph.com/en/latest/man/8/osdmaptool/.


To dig a little bit deeper, could you please post as usual the output of:

- ceph pg 1.0 query
- ceph pg 7.39d query


Oddly, it claims that it doesn't have pgid 1.0.

https://pastebin.com/pHh33Dq7


It would also be helpful if you could post the decoded crush map. You can get 
the map as a txt-file as follows:

   # ceph osd getcrushmap -o crush-orig.bin
   # crushtool -d crush-orig.bin -o crush.txt

and post the contents of file crush.txt.


https://pastebin.com/EtEGpWy3


Did the slow MDS request complete by now?


Nope.

--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-19 Thread Michael Thomas

Hi Frank,

I'll give both of these a try and let you know what happens.

Thanks again for your help,

--Mike

On 10/16/20 12:35 PM, Frank Schilder wrote:

Dear Michael,

this is a bit of a nut. I can't see anything obvious. I have two hypotheses 
that you might consider testing.

1) Problem with 1 incomplete PG.

In the shadow hierarchy for your cluster I can see quite a lot of nodes like

 {
 "id": -135,
 "name": "node229~hdd",
 "type_id": 1,
 "type_name": "host",
 "weight": 0,
 "alg": "straw2",
 "hash": "rjenkins1",
 "items": []
 },

I would have expected that hosts without a device of a certain device class are 
*excluded* completely from a tree instead of having weight 0. I'm wondering if 
this could lead to the crush algorithm fail in the way described here: 
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon
 . This might be a long shot, but could you export your crush map and play with 
the tunables as described under this link to see if more tries lead to a valid 
mapping? Note that testing this is harmless and does not change anything on the 
cluster.

>

The hypothesis here is that buckets with weight 0 are not excluded from drawing 
a-priori, but a-posteriori. If there are too many draws of an empty bucket, a 
mapping fails. Allowing more tries should then lead to success. We should at 
least rule out this possibility.

2) About the incomplete PG.

I'm wondering if the problem is that the pool has exactly 1 PG. I don't have a 
test pool with Nautilus and cannot try this out. Can you create a test pool 
with pg_num=pgp_num=1 and see if the PG gets an OSD mapping? If not, can you 
then increase pg_num and pgp_num to, say, 10 and see if this has any effect?

I'm wondering here if there needs to be a minimum number >1 of PGs in a pool. 
Again, this is more about ruling out a possibility than expecting success. As an 
extension to this test, you could increase pg_num and pgp_num of the pool 
device_health_metrics to see if this has any effect.


The crush rules and crush tree look OK to me. I can't really see why the 
missing OSDs are not assigned to the two PGs 1.0 and 7.39d.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 16 October 2020 15:41:29
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Dear Michael,


Please mark OSD 41 as "in" again and wait for some slow ops to show up.


I forgot. "wait for some slow ops to show up" ... and then what?

Could you please go to the host of the affected OSD and look at the output of "ceph daemon 
osd.ID ops" or "ceph daemon osd.ID dump_historic_slow_ops" and check what type of 
operations get stuck? I'm wondering if its administrative, like peering attempts.

Best regards,
=========
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder
Sent: 16 October 2020 15:09:20
To: Michael Thomas; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects

Dear Michael,

thanks for this initial work. I will need to look through the files you posted 
in more detail. In the meantime:

Please mark OSD 41 as "in" again and wait for some slow ops to show up. As far as I can 
see, marking it "out" might have cleared hanging slow ops (there were 1000 before), but 
they then started piling up again. From the OSD log it looks like an operation that is sent to/from 
PG 1.0, which doesn't respond because it is inactive. Hence, getting PG 1.0 active should resolve 
this issue (later).

Its a bit strange that I see slow ops for OSD 41 in the latest health detail 
(https://pastebin.com/3G3ij9ui). Was the OSD still out when this health report 
was created?

I think we might have misunderstood my question 6. My question was whether or 
not each host bucket corresponds to a physical host and vice versa, that is, 
each physical host has exactly 1 host bucket. I'm asking because it is possible 
to have multiple host buckets assigned to a single physical host and this has 
implications on how to manage things.

Coming back to PG 1.0 (the only PG in pool device_health_metrics as far as I 
can see), the problem is that is has no OSDs assigned. I need to look a bit 
longer at the data you uploaded to find out why. I can't see anything obvious.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 16 October 2020 02:08:01
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: multiple OSD 

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-15 Thread Michael Thomas

On 10/14/20 3:49 PM, Frank Schilder wrote:

Hi Michael,

it doesn't look too bad. All degraded objects are due to the undersized PG. If 
this is an EC pool with m>=2, data is currently not in danger.

I see a few loose ends to pick up, let's hope this is something simple. For any 
of the below, before attempting the next step, please wait until all induced 
recovery IO has completed before continuing.

1) Could you please paste the output of the following commands to pastebin 
(bash syntax):

   ceph osd pool get device_health_metrics all


https://pastebin.com/6D83mjsV


   ceph osd pool get fs.data.archive.frames all


https://pastebin.com/7XAaQcpC


   ceph pg dump |& grep -i -e PG_STAT -e "^7.39d"


https://pastebin.com/tBLaq63Q


   ceph osd crush rule ls


https://pastebin.com/6f5B778G


   ceph osd erasure-code-profile ls


https://pastebin.com/uhAaMH1c


   ceph osd crush dump # this is a big one, please be careful with copy-paste 
(see point 3 below)


https://pastebin.com/u92D23jV


2) I don't see any IO reported (neither user nor recovery). Could you please 
confirm that the command outputs were taken during a zero-IO period?


That's correct, there was no activity at this time.  Access to the 
cephfs filesystem is very bursty, varying from completely idle to 
multiple GB/s (read).



3) Something is wrong with osd.41. Can you check its health status with smartctl? If it 
is reported healthy, give it one more clean restart. If the slow ops do not disappear, it 
could be a disk fail that is not detected by health monitoring. You could set it to 
"out" and see if the cluster recovers to a healthy state (modulo the currently 
degraded objects) with no slow ops. If so, I would replace the disk.


smartctl reports no problems.

osd.41 (and osd.0) was one of the original OSDs used for the 
device_health_metrics pool.  Early on, before I knew better, I had 
removed this OSD (and osd.0) from the cluster, and the OSD ids got 
recycled when new disks were later added.  This is when the slow ops on 
osd.0 and osd.41 started getting reported.  On advice from another user 
on ceph-users, I updated my crush map to remap the device_health_metrics 
pool to a different set of OSDs (and the slow ops persisted).


osd.0 usually also shows slow ops.  I was a little surprised that it 
didn't when I took this snapshot, but now it does.


I have now run 'ceph osd out 41', and the recovery I/O has finished. 
With the exception of one less OSD marked in, the output of 'ceph 
status' looks the same.


The last few lines of the osd.41 logfile are here:

https://pastebin.com/k06aArW4

How long does it take for ceph to clear the slow ops status?


4) In the output of "df tree" node141 shows up twice. Could you confirm that this is a 
copy-paste error or is this node indeed twice in the output? This is easiest to see in the pastebin 
when switching to "raw" view.


This was a copy/paste error.


5) The crush tree contains an empty host bucket (node308). Please delete this 
host bucket (ceph osd crush rm node308) for now and let me know if this caused 
any data movements (recovery IO).


This did not cause any data movement, according to 'ceph status'.


6) The crush tree looks a bit exotic. Do the nodes with a single OSD correspond 
to a physical host with 1 OSD disk? If not, could you please state how the host 
buckets are mapped onto physical hosts?


Each OSD corresponds to a single physical disk.  Hosts may have 1, 2 or 
3 OSDs of varying types (HDD, SSD, or SSD+NVME).  There are a few 
different crush types used in the cluster:


3 x replicated nvme - used for cephfs metadata
3 x replicated SSD - used for ovirt block storage
EC HDD - used for the bulk of the experiment data
EC SSD - used for frequently accessed experiment data


7) In case there was a change to the health status, could you please include an updated 
"ceph health detail"?


Looks like the only difference is a new slow MDS op, and one PG that 
hasn't been deep scrubbed in the last week:


https://pastebin.com/3G3ij9ui

--Mike


I don't expect to get the incomplete PG resolved with the above, but it will 
move some issues out of the way before proceeding.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 14 October 2020 20:52:10
To: Andreas John; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Hello,

The original cause of the OSD instability has already been fixed.  It
was due to user jobs (via condor) consuming too much memory and causing
the machine to swap.  The OSDs didn't actually crash, but weren't
responding in time and were being flagged as down.

In most cases, the problematic OSD servers were also not responding on
the console and had to be physically power cycled to recover.

Since adding additional memory limits to user jobs, we have only had 1
or

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-14 Thread Michael Thomas
Hello,

The original cause of the OSD instability has already been fixed.  It
was due to user jobs (via condor) consuming too much memory and causing
the machine to swap.  The OSDs didn't actually crash, but weren't
responding in time and were being flagged as down.

In most cases, the problematic OSD servers were also not responding on
the console and had to be physically power cycled to recover.

Since adding additional memory limits to user jobs, we have only had 1
or 2 unstable OSDs that were fixed by killing the remaining rogue user jobs.

Regards,

--Mike

On 10/10/20 9:22 AM, Andreas John wrote:
> Hello Mike,
> 
> do your OSDs go down from time to time? I once has an issue with
> unrecoverable objects, because I had only n+1 (size 2) redundancy and
> ceph wasn't able to decide, what's the correct copy of the object. In my
> case there half-deleted snapshots  in one of the copies. I used
> ceph-objectstoretool to remove the "wrong" part. Did you check you OSD
> logs? Do the osd go down wirth an obscure stacktrace (and maybe they are
> restartet by systemd ...)
> 
> rgds,
> 
> j.
> 
> 
> 
> On 09.10.20 22:33, Michael Thomas wrote:
>> Hi Frank,
>>
>> That was a good tip.  I was able to move the broken files out of the
>> way and restore them for users.  However, after 2 weeks I'm still left
>> with unfound objects.  Even more annoying, I now have 82k objects
>> degraded (up from 74), which hasn't changed in over a week.
>>
>> I'm ready to claim that the auto-repair capabilities of ceph are not
>> able to fix my particular issues, and will have to continue to
>> investigate alternate ways to clean this up, including a pg
>> export/import (as you suggested) and perhaps a mds backward scrub
>> (after testing in a junk pool first).
>>
>> I have other tasks I need to perform on the filesystem (removing OSDs,
>> adding new OSDs, increasing PG count), but I feel like I need to
>> address these degraded/lost objects before risking any more damage.
>>
>> One particular PG is in a curious state:
>>
>> 7.39d    82163 82165 246734    1  344060777807    0
>>   0   2139  active+recovery_unfound+undersized+degraded+remapped 23m 
>> 50755'112549   50766:960500   [116,72,122,48,45,131,73,81]p116
>>   [71,109,99,48,45,90,73,NONE]p71  2020-08-13T23:02:34.325887-0500
>> 2020-08-07T11:01:45.657036-0500
>>
>> Note the 'NONE' in the acting set.  I do not know which OSD this may
>> have been, nor how to find out.  I suspect (without evidence) that
>> this is part of the cause of no action on the degraded and misplaced
>> objects.
>>
>> --Mike
>>
>> On 9/18/20 11:26 AM, Frank Schilder wrote:
>>> Dear Michael,
>>>
>>> maybe there is a way to restore access for users and solve the issues
>>> later. Someone else with a lost/unfound object was able to move the
>>> affected file (or directory containing the file) to a separate
>>> location and restore the now missing data from backup. This will
>>> "park" the problem of cluster health for later fixing.
>>>
>>> Best regads,
>>> =
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> 
>>> From: Frank Schilder 
>>> Sent: 18 September 2020 15:38:51
>>> To: Michael Thomas; ceph-users@ceph.io
>>> Subject: [ceph-users] Re: multiple OSD crash, unfound objects
>>>
>>> Dear Michael,
>>>
>>>> I disagree with the statement that trying to recover health by deleting
>>>> data is a contradiction.  In some cases (such as mine), the data in
>>>> ceph
>>>> is backed up in another location (eg tape library).  Restoring a few
>>>> files from tape is a simple and cheap operation that takes a minute, at
>>>> most.
>>>
>>> I would agree with that if the data was deleted using the appropriate
>>> high-level operation. Deleting an unfound object is like marking a
>>> sector on a disk as bad with smartctl. How should the file system
>>> react to that? Purging an OSD is like removing a disk from a raid
>>> set. Such operations increase inconsistencies/degradation rather than
>>> resolving them. Cleaning this up also requires to execute other
>>> operations to remove all references to the object and, finally, the
>>> file inode itself.
>>>
>>> The ls on a dir with corrupted file(s) hangs if ls calls stat on
>>> every file. For example, when coloring is enabled, ls will

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-14 Thread Michael Thomas
Hi Frank,

Thanks for taking the time to help out with this.  Here is the output
you requested:

ceph status: https://pastebin.com/v8cJJvjm

ceph health detail: https://pastebin.com/w9wWLGiv

ceph osd pool stats: https://pastebin.com/dcJTsXE1

ceph osd df tree: https://pastebin.com/LaZcBemC

I removed one object following a troubleshooting guide, and removed one
OSD (weeks ago) as part of a server upgrade.  I have not removed any PGs.

A couple of notes about some of the output:

The '1 pg inactive' is from the 'device_health_metrics' pool.  It has
been broken since the very beginning of my ceph deployment.  Fixing this
would be nice, but not the focus of my current issues.

The 8 PGs that have not been deep-scrubbed are the same ones that are
marked "recovery_unfound".  I suspect that ceph won't deep scrub these
until they are active+clean.

I have restarted (systemctl restart ceph-osd@XXX) and rebooted (init 6)
all OSDs (one at a time) for PG 7.39d, with no change in the number of
degraded objects.

The only difference in the 'ceph status' output before and after the
object removal is the number of degraded objects (went down by 1) and
degraded PGs (went down by 1).

Regards,

--Mike

On 10/10/20 5:14 AM, Frank Schilder wrote:
> Dear Michael,
> 
>> I have other tasks I need to perform on the filesystem (removing OSDs,
>> adding new OSDs, increasing PG count), but I feel like I need to address
>> these degraded/lost objects before risking any more damage.
> 
> I would probably not attempt any such maintenance before there was a period 
> of at least 1 day with HEALTH_OK. The reason is that certain historical 
> information is not trimmed unless the cluster is in HEALTH_OK. The more such 
> information is accumulated, the more risk one runs that a cluster becomes 
> unstable.
> 
> Can you post the output of ceph status, ceph health detail, ceph osd pool 
> stats and ceph osd df tree (on pastebin.com)? If I remember correctly, you 
> removed OSDs/PGs following a trouble-shooting guide? I suspect that the 
> removal has left something in an inconsistent state that requires manual 
> clean up for recovery to proceed.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Michael Thomas 
> Sent: 09 October 2020 22:33:46
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: multiple OSD crash, unfound objects
> 
> Hi Frank,
> 
> That was a good tip.  I was able to move the broken files out of the way
> and restore them for users.  However, after 2 weeks I'm still left with
> unfound objects.  Even more annoying, I now have 82k objects degraded
> (up from 74), which hasn't changed in over a week.
> 
> I'm ready to claim that the auto-repair capabilities of ceph are not
> able to fix my particular issues, and will have to continue to
> investigate alternate ways to clean this up, including a pg
> export/import (as you suggested) and perhaps a mds backward scrub (after
> testing in a junk pool first).
> 
> I have other tasks I need to perform on the filesystem (removing OSDs,
> adding new OSDs, increasing PG count), but I feel like I need to address
> these degraded/lost objects before risking any more damage.
> 
> One particular PG is in a curious state:
> 
> 7.39d82163 82165 2467341  3440607778070
> 
>0   2139  active+recovery_unfound+undersized+degraded+remapped
> 23m  50755'112549   50766:960500   [116,72,122,48,45,131,73,81]p116
>[71,109,99,48,45,90,73,NONE]p71  2020-08-13T23:02:34.325887-0500
> 2020-08-07T11:01:45.657036-0500
> 
> Note the 'NONE' in the acting set.  I do not know which OSD this may
> have been, nor how to find out.  I suspect (without evidence) that this
> is part of the cause of no action on the degraded and misplaced objects.
> 
> --Mike
> 
> On 9/18/20 11:26 AM, Frank Schilder wrote:
>> Dear Michael,
>>
>> maybe there is a way to restore access for users and solve the issues later. 
>> Someone else with a lost/unfound object was able to move the affected file 
>> (or directory containing the file) to a separate location and restore the 
>> now missing data from backup. This will "park" the problem of cluster health 
>> for later fixing.
>>
>> Best regads,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Frank Schilder 
>> Sent: 18 September 2020 15:38:51
>> To: Michael Thomas; ceph-users@ceph.io
>> Subject: [ceph-users] Re: multiple OSD crash, unfound objects
>>
>> Dear Michael,
>>
>>> I 

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-09 Thread Michael Thomas

Hi Frank,

That was a good tip.  I was able to move the broken files out of the way 
and restore them for users.  However, after 2 weeks I'm still left with 
unfound objects.  Even more annoying, I now have 82k objects degraded 
(up from 74), which hasn't changed in over a week.


I'm ready to claim that the auto-repair capabilities of ceph are not 
able to fix my particular issues, and will have to continue to 
investigate alternate ways to clean this up, including a pg 
export/import (as you suggested) and perhaps a mds backward scrub (after 
testing in a junk pool first).


I have other tasks I need to perform on the filesystem (removing OSDs, 
adding new OSDs, increasing PG count), but I feel like I need to address 
these degraded/lost objects before risking any more damage.


One particular PG is in a curious state:

7.39d82163 82165 2467341  3440607778070 

  0   2139  active+recovery_unfound+undersized+degraded+remapped 
23m  50755'112549   50766:960500   [116,72,122,48,45,131,73,81]p116 
  [71,109,99,48,45,90,73,NONE]p71  2020-08-13T23:02:34.325887-0500 
2020-08-07T11:01:45.657036-0500


Note the 'NONE' in the acting set.  I do not know which OSD this may 
have been, nor how to find out.  I suspect (without evidence) that this 
is part of the cause of no action on the degraded and misplaced objects.


--Mike

On 9/18/20 11:26 AM, Frank Schilder wrote:

Dear Michael,

maybe there is a way to restore access for users and solve the issues later. Someone else 
with a lost/unfound object was able to move the affected file (or directory containing 
the file) to a separate location and restore the now missing data from backup. This will 
"park" the problem of cluster health for later fixing.

Best regads,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 18 September 2020 15:38:51
To: Michael Thomas; ceph-users@ceph.io
Subject: [ceph-users] Re: multiple OSD crash, unfound objects

Dear Michael,


I disagree with the statement that trying to recover health by deleting
data is a contradiction.  In some cases (such as mine), the data in ceph
is backed up in another location (eg tape library).  Restoring a few
files from tape is a simple and cheap operation that takes a minute, at
most.


I would agree with that if the data was deleted using the appropriate 
high-level operation. Deleting an unfound object is like marking a sector on a 
disk as bad with smartctl. How should the file system react to that? Purging an 
OSD is like removing a disk from a raid set. Such operations increase 
inconsistencies/degradation rather than resolving them. Cleaning this up also 
requires to execute other operations to remove all references to the object 
and, finally, the file inode itself.

The ls on a dir with corrupted file(s) hangs if ls calls stat on every file. For example, when 
coloring is enabled, ls will stat every file in the dir to be able to choose the color according to 
permissions. If one then disables coloring, a plain "ls" will return all names while an 
"ls -l" will hang due to stat calls.

An "rm" or "rm -f" should succeed if the folder permissions allow that. It should not stat the file 
itself, so it sounds a bit odd that its hanging. I guess in some situations it does, like "rm -i", which will 
ask before removing read-only files. How does "unlink FILE" behave?

Most admin commands on ceph are asynchronous. A command like "pg repair" or "osd scrub" 
only schedules an operation. The command "ceph pg 7.1fb mark_unfound_lost delete" does probably 
just the same. Unfortunately, I don't know how to check that a scheduled operation has 
started/completed/succeeded/failed. I asked this in an earlier thread (about PG repair) and didn't get an 
answer. On our cluster, the actual repair happened ca. 6-12 hours after scheduling (on a healthy cluster!). I 
would conclude that (some of) these operations have very low priority and will not start at least as long as 
there is recovery going on. One might want to consider the possibility that some of the scheduled commands 
have not been executed yet.

The output of "pg query" contains the IDs of the missing objects (in mimic) and each of these 
objects is on one of the peer OSDs of the PG (I think object here refers to shard or copy). It should be 
possible to find the corresponding OSD (or at least obtain confirmation that the object is really gone) and 
move the object to a place where it is expected to be found. This can probably be achieved with "PG 
export" and "PG import". I don't know of any other way(s).

I guess, in the current situation, sitting it out a bit longer might be a good 
strategy. I don't know how many asynchronous commands you executed and giving 
the cluster time to complete the

[ceph-users] Re: multiple OSD crash, unfound objects

2020-09-17 Thread Michael Thomas

Hi Frank,

Yes, it does sounds similar to your ticket.

I've tried a few things to restore the failed files:

* Locate a missing object with 'ceph pg $pgid list_unfound'

* Convert the hex oid to a decimal inode number

* Identify the affected file with 'find /ceph -inum $inode'

At this point, I know which file is affected by the missing object.  As 
expected, attempts to read the file simply hang.  Unexpectedly, attempts 
to 'ls' the file or its containing directory also hang.  I presume from 
this that the stat() system call needs some information that is 
contained in the missing object, and is waiting for the object to become 
available.


Next I tried to remove the affected object with:

* ceph pg $pgid mark_unfound_lost delete

Now 'ceph status' shows one fewer missing objects, but attempts to 'ls' 
or 'rm' the affected file continue to hang.


Finally, I ran a scrub over the part of the filesystem containing the 
affected file:


ceph tell mds.ceph4 scrub start /frames/postO3/hoft recursive

Nothing seemed to come up during the scrub:

2020-09-17T14:56:15.208-0500 7f39bca24700  1 mds.ceph4 asok_command: 
scrub status {prefix=scrub status} (starting...)
2020-09-17T14:58:58.013-0500 7f39bca24700  1 mds.ceph4 asok_command: 
scrub start {path=/frames/postO3/hoft,prefix=scrub 
start,scrubops=[recursive]} (starting...)
2020-09-17T14:58:58.013-0500 7f39b5215700  0 log_channel(cluster) log 
[INF] : scrub summary: active
2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log 
[INF] : scrub queued for path: /frames/postO3/hoft
2020-09-17T14:58:58.014-0500 7f39b5215700  0 log_channel(cluster) log 
[INF] : scrub summary: active [paths:/frames/postO3/hoft]
2020-09-17T14:59:02.535-0500 7f39bca24700  1 mds.ceph4 asok_command: 
scrub status {prefix=scrub status} (starting...)
2020-09-17T15:00:12.520-0500 7f39bca24700  1 mds.ceph4 asok_command: 
scrub status {prefix=scrub status} (starting...)
2020-09-17T15:02:32.944-0500 7f39b5215700  0 log_channel(cluster) log 
[INF] : scrub summary: idle
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log 
[INF] : scrub complete with tag '1405e5c7-3ecf-4754-918e-129e9d101f7a'
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log 
[INF] : scrub completed for path: /frames/postO3/hoft
2020-09-17T15:02:32.945-0500 7f39b5215700  0 log_channel(cluster) log 
[INF] : scrub summary: idle



After the scrub completed, access to the file (ls or rm) continue to 
hang.  The MDS reports slow reads:


2020-09-17T15:11:05.654-0500 7f39b9a1e700  0 log_channel(cluster) log 
[WRN] : slow request 481.867381 seconds old, received at 
2020-09-17T15:03:03.788058-0500: client_request(client.451432:11309 
getattr pAsLsXsFs #0x105b1c0 2020-09-17T15:03:03.787602-0500 
caller_uid=0, caller_gid=0{}) currently dispatched


Does anyone have any suggestions on how else to clean up from a 
permanently lost object?


--Mike

On 9/16/20 2:03 AM, Frank Schilder wrote:

Sounds similar to this one: https://tracker.ceph.com/issues/46847

If you have or can reconstruct the crush map from before adding the OSDs, you 
might be able to discover everything with the temporary reversal of the crush 
map method.

Not sure if there is another method, i never got a reply to my question in the 
tracker.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michael Thomas 
Sent: 16 September 2020 01:27:19
To: ceph-users@ceph.io
Subject: [ceph-users] multiple OSD crash, unfound objects

Over the weekend I had multiple OSD servers in my Octopus cluster
(15.2.4) crash and reboot at nearly the same time.  The OSDs are part of
an erasure coded pool.  At the time the cluster had been busy with a
long-running (~week) remapping of a large number of PGs after I
incrementally added more OSDs to the cluster.  After bringing all of the
OSDs back up, I have 25 unfound objects and 75 degraded objects.  There
are other problems reported, but I'm primarily concerned with these
unfound/degraded objects.

The pool with the missing objects is a cephfs pool.  The files stored in
the pool are backed up on tape, so I can easily restore individual files
as needed (though I would not want to restore the entire filesystem).

I tried following the guide at
https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects.
   I found a number of OSDs that are still 'not queried'.  Restarting a
sampling of these OSDs changed the state from 'not queried' to 'already
probed', but that did not recover any of the unfound or degraded objects.

I have also tried 'ceph pg deep-scrub' on the affected PGs, but never
saw them get scrubbed.  I also tried doing a 'ceph pg force-recovery' on
the affected PGs, but only one seems to have been tagged accordingly
(see ceph -s output below).

The guide also says "Sometimes it simply takes some time for the cluster
to query possible locations."  I'm no

[ceph-users] multiple OSD crash, unfound objects

2020-09-15 Thread Michael Thomas
Over the weekend I had multiple OSD servers in my Octopus cluster 
(15.2.4) crash and reboot at nearly the same time.  The OSDs are part of 
an erasure coded pool.  At the time the cluster had been busy with a 
long-running (~week) remapping of a large number of PGs after I 
incrementally added more OSDs to the cluster.  After bringing all of the 
OSDs back up, I have 25 unfound objects and 75 degraded objects.  There 
are other problems reported, but I'm primarily concerned with these 
unfound/degraded objects.


The pool with the missing objects is a cephfs pool.  The files stored in 
the pool are backed up on tape, so I can easily restore individual files 
as needed (though I would not want to restore the entire filesystem).


I tried following the guide at 
https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg/#unfound-objects. 
 I found a number of OSDs that are still 'not queried'.  Restarting a 
sampling of these OSDs changed the state from 'not queried' to 'already 
probed', but that did not recover any of the unfound or degraded objects.


I have also tried 'ceph pg deep-scrub' on the affected PGs, but never 
saw them get scrubbed.  I also tried doing a 'ceph pg force-recovery' on 
the affected PGs, but only one seems to have been tagged accordingly 
(see ceph -s output below).


The guide also says "Sometimes it simply takes some time for the cluster 
to query possible locations."  I'm not sure how long "some time" might 
take, but it hasn't changed after several hours.


My questions are:

* Is there a way to force the cluster to query the possible locations 
sooner?


* Is it possible to identify the files in cephfs that are affected, so 
that I could delete only the affected files and restore them from backup 
tapes?


--Mike

ceph -s:

  cluster:
id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad
health: HEALTH_ERR
1 clients failing to respond to capability release
1 MDSs report slow requests
25/78520351 objects unfound (0.000%)
2 nearfull osd(s)
Reduced data availability: 1 pg inactive
Possible data damage: 9 pgs recovery_unfound
Degraded data redundancy: 75/626645098 objects degraded 
(0.000%), 9 pgs degraded

1013 pgs not deep-scrubbed in time
1013 pgs not scrubbed in time
2 pool(s) nearfull
1 daemons have recently crashed
4 slow ops, oldest one blocked for 77939 sec, daemons 
[osd.0,osd.41] have slow ops.


  services:
mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d)
mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1
mds: archive:1 {0=ceph4=up:active} 3 up:standby
osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs

  task status:
scrub status:
mds.ceph4: idle

  data:
pools:   9 pools, 2433 pgs
objects: 78.52M objects, 298 TiB
usage:   412 TiB used, 545 TiB / 956 TiB avail
pgs: 0.041% pgs unknown
 75/626645098 objects degraded (0.000%)
 135224/626645098 objects misplaced (0.022%)
 25/78520351 objects unfound (0.000%)
 2421 active+clean
 5active+recovery_unfound+degraded
 3active+recovery_unfound+degraded+remapped
 2active+clean+scrubbing+deep
 1unknown
 1active+forced_recovery+recovery_unfound+degraded

  progress:
PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d)
  []
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg stuck in unknown state

2020-08-11 Thread Michael Thomas

On 8/11/20 2:52 AM, Wido den Hollander wrote:



On 11/08/2020 00:40, Michael Thomas wrote:
On my relatively new Octopus cluster, I have one PG that has been 
perpetually stuck in the 'unknown' state.  It appears to belong to the 
device_health_metrics pool, which was created automatically by the mgr 
daemon(?).


The OSDs that the PG maps to are all online and serving other PGs.  
But when I list the PGs that belong to the OSDs from 'ceph pg map', 
the offending PG is not listed.


# ceph pg dump pgs | grep ^1.0
dumped pgs
1.0    0   0 0  0    0 
0    0   0  0 0   unknown 
2020-08-08T09:30:33.251653-0500 0'0 0:0 []  
-1 []  -1  0'0 
2020-08-08T09:30:33.251653-0500  0'0 
2020-08-08T09:30:33.251653-0500  0


# ceph osd pool stats device_health_metrics
pool device_health_metrics id 1
   nothing is going on

# ceph pg map 1.0
osdmap e7199 pg 1.0 (1.0) -> up [41,40,2] acting [41,0]

What can be done to fix the PG?  I tried doing a 'ceph pg repair 1.0', 
but that didn't seem to do anything.


Is it safe to try to update the crush_rule for this pool so that the 
PG gets mapped to a fresh set of OSDs?


Yes, it would be. But still, it's weird. Mainly as the acting set is so 
different from the up-set.


You have different CRUSH rules I think?

Marking those OSDs down might work, but otherwise change the crush_rule 
and see how that goes.


Yes, I do have different crush rules to help map certain types of data 
to different classes of hardware (EC HDDs, replicated SSDs, replicated 
nvme).  The default crush rule for the device_health_metrics pool was to 
use replication across any storage device.  I changed it to use the 
replicated nvme crush rule, and now the map looks different:


# ceph pg map 1.0
osdmap e7256 pg 1.0 (1.0) -> up [24,22,12] acting [41,0]

However, the acting set of OSDs has not changed.

--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg stuck in unknown state

2020-08-10 Thread Michael Thomas
On my relatively new Octopus cluster, I have one PG that has been 
perpetually stuck in the 'unknown' state.  It appears to belong to the 
device_health_metrics pool, which was created automatically by the mgr 
daemon(?).


The OSDs that the PG maps to are all online and serving other PGs.  But 
when I list the PGs that belong to the OSDs from 'ceph pg map', the 
offending PG is not listed.


# ceph pg dump pgs | grep ^1.0
dumped pgs
1.00   0 0  00 
 00   0  0 0   unknown 
2020-08-08T09:30:33.251653-0500 0'0 0:0 
   []  -1 []  -1 
 0'0  2020-08-08T09:30:33.251653-0500  0'0 
2020-08-08T09:30:33.251653-0500  0


# ceph osd pool stats device_health_metrics
pool device_health_metrics id 1
  nothing is going on

# ceph pg map 1.0
osdmap e7199 pg 1.0 (1.0) -> up [41,40,2] acting [41,0]

What can be done to fix the PG?  I tried doing a 'ceph pg repair 1.0', 
but that didn't seem to do anything.


Is it safe to try to update the crush_rule for this pool so that the PG 
gets mapped to a fresh set of OSDs?


--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io