Re: [ceph-users] PG Balancer Upmap mode not working
> How is that possible? I dont know how much more proof I need to present that > there's a bug. I also think there's a bug in the balancer plugin as it seems to have stopped for me also. I'm on Luminous though, so not sure if that will be the same bug. The balancer used to work flawlessly, giving me a very even distribution with about 1% variance. Some time between 12.2.7 (maybe) and 12.2.12 it's stopped working. Here's a small selection of my osd's showing 47%-62% spread. 210 hdd 7.27739 1.0 7.28TiB 3.43TiB 3.84TiB 47.18 0.74 104 211 hdd 7.27739 1.0 7.28TiB 3.96TiB 3.32TiB 54.39 0.85 118 212 hdd 7.27739 1.0 7.28TiB 4.50TiB 2.77TiB 61.88 0.97 136 213 hdd 7.27739 1.0 7.28TiB 4.06TiB 3.21TiB 55.85 0.87 124 214 hdd 7.27739 1.0 7.28TiB 4.30TiB 2.98TiB 59.05 0.92 130 215 hdd 7.27739 1.0 7.28TiB 4.41TiB 2.87TiB 60.54 0.95 134 TOTAL 1.26PiB 825TiB 463TiB 64.01 MIN/MAX VAR: 0.74/1.10 STDDEV: 3.22 $ sudo ceph balancer status { "active": true, "plans": [], "mode": "upmap" } I'm happy to add debugging data or test things to get this bug fixed. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Upgrade Documentation: Wait for recovery
Hi Everyone, Recently we moved a bunch of our servers from one rack to another. In the late stages of this we hit a point when some requests were blocked due to one pg being in "peered" state. This was unexpected to us, but on discussion with Wido we understand why this happened. However it's brought up another point in that we believed we were following the instructions as per upgrade documentation. We've done our upgrades this way in the past without hitting this "peered" state. The documentation says this: "Ensure each upgraded Ceph OSD Daemon has rejoined the cluster" We read this that you can go through and restart all the osd's one by one in the whole cluster without waiting for recovery to happen. Whereas it seems more like it should be: "Ensure each upgraded Ceph OSD Daemon has rejoined the cluster" and "ensure recovery has completed before moving on to the next {failure domain}" where failure domain is host, rack etc depending on what is in your crush map. Thoughts? Should the documentation be more clear on this to help people such as myself making this mistake? Rich ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs compression?
Oh, also because the compression is at the osd level you don't see it in ceph df. You just see that your RAW is not increasing as much as you'd expect. E.g. $ sudo ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 785T 300T 485T 61.73 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS cephfs-metadata 11 185M 068692G 178 cephfs-data 12 408T 75.26 134T 132641159 You can see that we've used 408TB in the pool but only 485TB RAW - Rather than ~600TB RAW that I'd expect for my k4, m2 pool settings. On Fri, 29 Jun 2018 at 17:08, Richard Bade wrote: > > I'm using compression on a cephfs-data pool in luminous. I didn't do > anything special > > $ sudo ceph osd pool get cephfs-data all | grep ^compression > compression_mode: aggressive > compression_algorithm: zlib > > You can check how much compression you're getting on the osd's > $ for osd in `seq 0 11`; do echo osd.$osd; sudo ceph daemon osd.$osd > perf dump | grep 'bluestore_compressed'; done > osd.0 > "bluestore_compressed": 686487948225, > "bluestore_compressed_allocated": 788659830784, > "bluestore_compressed_original": 1660064620544, > > osd.11 > "bluestore_compressed": 700999601387, > "bluestore_compressed_allocated": 808854355968, > "bluestore_compressed_original": 1752045551616, > > I can't say for mimic, but definitely for luminous v12.2.5 compression > is working well with mostly default options. > > -Rich > > > For RGW, compression works very well. We use rgw to store crash dumps, in > > most cases, the compression ratio is about 2.0 ~ 4.0. > > > I tried to enable compression for cephfs data pool: > > > # ceph osd pool get cephfs_data all | grep ^compression > > compression_mode: force > > compression_algorithm: lz4 > > compression_required_ratio: 0.95 > > compression_max_blob_size: 4194304 > > compression_min_blob_size: 4096 > > > (we built ceph packages and enabled lz4.) > > > It doesn't seem to work. I copied a 8.7GB folder to cephfs, ceph df says it > > used 8.7GB: > > > root at ceph-admin:~# ceph df > > GLOBAL: > > SIZE AVAIL RAW USED %RAW USED > > 16 TiB 16 TiB 111 GiB 0.69 > > POOLS: > > NAMEID USED%USED MAX AVAIL OBJECTS > > cephfs_data 1 8.7 GiB 0.17 5.0 TiB 360545 > > cephfs_metadata 2 221 MiB 0 5.0 TiB 77707 > > > I know this folder can be compressed to ~4.0GB under zfs lz4 compression. > > > Am I missing anything? how to make cephfs compression work? is there any > trick? > > > By the way, I am evaluating ceph mimic v13.2.0. > > > Thanks in advance, > > --Youzhong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs compression?
I'm using compression on a cephfs-data pool in luminous. I didn't do anything special $ sudo ceph osd pool get cephfs-data all | grep ^compression compression_mode: aggressive compression_algorithm: zlib You can check how much compression you're getting on the osd's $ for osd in `seq 0 11`; do echo osd.$osd; sudo ceph daemon osd.$osd perf dump | grep 'bluestore_compressed'; done osd.0 "bluestore_compressed": 686487948225, "bluestore_compressed_allocated": 788659830784, "bluestore_compressed_original": 1660064620544, osd.11 "bluestore_compressed": 700999601387, "bluestore_compressed_allocated": 808854355968, "bluestore_compressed_original": 1752045551616, I can't say for mimic, but definitely for luminous v12.2.5 compression is working well with mostly default options. -Rich > For RGW, compression works very well. We use rgw to store crash dumps, in > most cases, the compression ratio is about 2.0 ~ 4.0. > I tried to enable compression for cephfs data pool: > # ceph osd pool get cephfs_data all | grep ^compression > compression_mode: force > compression_algorithm: lz4 > compression_required_ratio: 0.95 > compression_max_blob_size: 4194304 > compression_min_blob_size: 4096 > (we built ceph packages and enabled lz4.) > It doesn't seem to work. I copied a 8.7GB folder to cephfs, ceph df says it > used 8.7GB: > root at ceph-admin:~# ceph df > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 16 TiB 16 TiB 111 GiB 0.69 > POOLS: > NAMEID USED%USED MAX AVAIL OBJECTS > cephfs_data 1 8.7 GiB 0.17 5.0 TiB 360545 > cephfs_metadata 2 221 MiB 0 5.0 TiB 77707 > I know this folder can be compressed to ~4.0GB under zfs lz4 compression. > Am I missing anything? how to make cephfs compression work? is there any trick? > By the way, I am evaluating ceph mimic v13.2.0. > Thanks in advance, > --Youzhong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous Bluestore performance, bcache
Hi Andrei, These are good questions. We have another cluster with filestore and bcache but for this particular one I was interested in testing out bluestore. So I have used bluestore both with and without bcache. For my synthetic load on the vm's I'm using this fio command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randwrite --rate_iops=50 Currently on bluestore with my synthetic load I'm getting 7% hit ratio (cat /sys/block/bcache*/bcache/stats_total/cache_hit_ratio) On our filestore cluster with ~700 vm's of varied workload we're geting about 30-35% hit ratio. In the hourly hit ratio I have as high as 50% on some osd's in our filestore cluster. Only 25% on my synthetic load on bluestore so far, but I hadn't actually been checking this stat until now. I hope that helps. Regards, Richard > Hi Richard, > It is an interesting test for me too as I am planning to migrate to > Bluestore storage and was considering repurposing the ssd disks > that we currently use for journals. > I was wondering if you are using the Filestore or the bluestone > for the osds? > Also, when you perform your testing, how good is the hit ratio > that you have on the bcache? > Are you using a lot of random data for your benchmarks? How > large is your test file for each vm? > We have been playing around with a few caching scenarios a > few years back (enchanceio and a few more which I can't > remember now) and we have seen a very poor hit ratio on the > caching system. Was wondering if you see a different picture? > Cheers ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Luminous Bluestore performance, bcache
Hi Everyone, There's been a few threads go past around this but I haven't seen any that pointed me in the right direction. We've recently set up a new luminous (12.2.5) cluster with 5 hosts each with 12 4TB Seagate Constellation ES spinning disks for osd's. We also have 2x 400GB Intel DC P3700's per node. We're using this for rbd storage for VM's running under Proxmox VE. I firstly set these up with DB partition (approx 60GB per osd) on nvme and data directly onto the spinning disk using ceph-deploy create. This worked great and was very simple. However performance wasn't great. I fired up 20vm's each running fio trying to attain 50 iops. Ceph was only just able to keep up with the 1000iops this generated and vm's started to have trouble hitting their 50iops target. So I rebuilt all the osd's halving the DB space (~30GB per osd) and adding a 200GB BCache partition shared between 6 osd's. Again this worked great with ceph-deploy create and was very simple. I have had a vast improvement with my synthetic test. I can now run 100 50iops test vm's generating a constant 5000iops load and each one can keep up without any trouble. The question I have is if the poor performance out of the box is expected? Or is there some kind of tweaking I should be doing to make this usable for rbd images? Are others able to work ok with this kind of config at a small scale like my 60osd's? Or is it only workable at a larger scale? Regards, Rich ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph ObjectCacher FAILED assert (qemu/kvm)
Hi Everyone, We run some hosts with Proxmox 4.4 connected to our ceph cluster for RBD storage. Occasionally we get a vm suddenly stop with no real explanation. The last time this happened to one particular vm I turned on some qemu logging via Proxmox Monitor tab for the vm and got this dump this time when the vm stopped again: osdc/ObjectCacher.cc: In function 'void ObjectCacher::Object::discard(loff_t, loff_t)' thread 7f1c6ebfd700 time 2018-05-08 07:00:47.816114 osdc/ObjectCacher.cc: 533: FAILED assert(bh->waitfor_read.empty()) ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (()+0x2d0712) [0x7f1c8e093712] 2: (()+0x52c107) [0x7f1c8e2ef107] 3: (()+0x52c45f) [0x7f1c8e2ef45f] 4: (()+0x82107) [0x7f1c8de45107] 5: (()+0x83388) [0x7f1c8de46388] 6: (()+0x80e74) [0x7f1c8de43e74] 7: (()+0x86db0) [0x7f1c8de49db0] 8: (()+0x2c0ddf) [0x7f1c8e083ddf] 9: (()+0x2c1d00) [0x7f1c8e084d00] 10: (()+0x8064) [0x7f1c804e0064] 11: (clone()+0x6d) [0x7f1c8021562d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. We're using virtio-scsi for the disk with discard option and writeback cache enabled. The vm is Win2012r2. Has anyone seen this before? Is there a resolution? I couldn't find any mention of this while googling for various key words in the dump. Regards, Richard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Safe to delete data, metadata pools?
Thanks John, I removed these pools on Friday and as you suspected there was no impact. Regards, Rich On 8 January 2018 at 23:15, John Spray <jsp...@redhat.com> wrote: > On Mon, Jan 8, 2018 at 2:55 AM, Richard Bade <hitr...@gmail.com> wrote: >> Hi Everyone, >> I've got a couple of pools that I don't believe are being used but >> have a reasonably large number of pg's (approx 50% of our total pg's). >> I'd like to delete them but as they were pre-existing when I inherited >> the cluster, I wanted to make sure they aren't needed for anything >> first. >> Here's the details: >> POOLS: >> NAME ID USED %USED MAX AVAIL OBJECTS >> data 0 0 088037G0 >> metadata 1 0 088037G0 >> >> We don't run cephfs and I believe these are meant for that, but may >> have been created by default when the cluster was set up (back on >> dumpling or bobtail I think). >> As far as I can tell there is no data in them. Do they need to exist >> for some ceph function? >> The pool names worry me a little, as they sound important. > > The data and metadata pools were indeed created by default in older > versions of Ceph, for use by CephFS. Since you're not using CephFS, > and nobody is using the pools for anything else either (they're > empty), you can go ahead and delete them. > >> >> They have 3136 pg's each so I'd like to be rid of those so I can >> increase the number of pg's in my actual data pools without getting >> over the 300 pg's per osd. >> Here's the osd dump: >> pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash >> rjenkins pg_num 3136 pgp_num 3136 last_change 1 crash_replay_interval >> 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1 >> stripe_width 0 >> pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 >> object_hash rjenkins pg_num 3136 pgp_num 3136 last_change 1 >> min_read_recency_for_promote 1 min_write_recency_for_promote 1 >> stripe_width 0 >> >> Also, what performance impact am I likely to see when ceph removes the >> empty pg's considering it's approx 50% of my total pg's on my 180 >> osd's. > > Given that they're empty, I'd expect little if any noticeable impact. > > John > >> >> Thanks, >> Rich >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Safe to delete data, metadata pools?
Hi Everyone, I've got a couple of pools that I don't believe are being used but have a reasonably large number of pg's (approx 50% of our total pg's). I'd like to delete them but as they were pre-existing when I inherited the cluster, I wanted to make sure they aren't needed for anything first. Here's the details: POOLS: NAME ID USED %USED MAX AVAIL OBJECTS data 0 0 088037G0 metadata 1 0 088037G0 We don't run cephfs and I believe these are meant for that, but may have been created by default when the cluster was set up (back on dumpling or bobtail I think). As far as I can tell there is no data in them. Do they need to exist for some ceph function? The pool names worry me a little, as they sound important. They have 3136 pg's each so I'd like to be rid of those so I can increase the number of pg's in my actual data pools without getting over the 300 pg's per osd. Here's the osd dump: pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 3136 pgp_num 3136 last_change 1 crash_replay_interval 45 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0 pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 3136 pgp_num 3136 last_change 1 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0 Also, what performance impact am I likely to see when ceph removes the empty pg's considering it's approx 50% of my total pg's on my 180 osd's. Thanks, Rich ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Inconsistent PG won't repair
For anyone that encounters this in the future, I was able to resolve the issue by finding the three osd's that the object is on. One by one I stop the osd, flushed the journal and used the objectstore tool to remove the data (sudo ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-19 --journal-path /dev/disk/by-partlabel/journal19 --pool tier3-rbd-3X rbd_data.19cdf512ae8944a.0001bb56 remove). Then I started the osd again and let it recover before moving on to the next osd. After the object was deleted from all three osd's I ran a scrub on the PG (ceph pg scrub 3.f05). Once the scrub was finished the inconsistency went away. Note, the object in question was empty (size of zero bytes) before I started this process. I emptied the object by moving the rbd image to another pool. Rich On 24 October 2017 at 14:34, Richard Bade <hitr...@gmail.com> wrote: > What I'm thinking about trying is using the ceph-objectstore-tool to > remove the offending clone metadata. From the help the syntax is this: > ceph-objectstore-tool ... remove-clone-metadata > i.e. something like for my object and expected clone from the log message > ceph-objectstore-tool rbd_data.19cdf512ae8944a.0001bb56 > remove-clone-metadata 148d2 > Anyone had experience with this? I'm not 100% sure if this will > resolve the issue or cause much the same situation (since it's already > expecting a clone that's not there currently). > > Rich > > On 21 October 2017 at 14:13, Brad Hubbard <bhubb...@redhat.com> wrote: >> On Sat, Oct 21, 2017 at 1:59 AM, Richard Bade <hitr...@gmail.com> wrote: >>> Hi Lincoln, >>> Yes the object is 0-bytes on all OSD's. Has the same filesystem >>> date/time too. Before I removed the rbd image (migrated disk to >>> different pool) it was 4MB on all the OSD's and md5 checksum was the >>> same on all so it seems that only metadata is inconsistent. >>> Thanks for your suggestion, I just looked into this as I thought maybe >>> I can delete the object (since it's empty anyway). But I just get file >>> not found: >>> ~$ rados stat rbd_data.19cdf512ae8944a.0001bb56 --pool=tier3-rbd-3X >>> error stat-ing >>> tier3-rbd-3X/rbd_data.19cdf512ae8944a.0001bb56: (2) No such >>> file or directory >> >> Maybe try downing the osds involved? >> >>> >>> Regards, >>> Rich >>> >>> On 21 October 2017 at 04:32, Lincoln Bryant <linco...@uchicago.edu> wrote: >>>> Hi Rich, >>>> >>>> Is the object inconsistent and 0-bytes on all OSDs? >>>> >>>> We ran into a similar issue on Jewel, where an object was empty across the >>>> board but had inconsistent metadata. Ultimately it was resolved by doing a >>>> "rados get" and then a "rados put" on the object. *However* that was a >>>> last ditch effort after I couldn't get any other repair option to work, >>>> and I have no idea if that will cause any issues down the road :) >>>> >>>> --Lincoln >>>> >>>>> On Oct 20, 2017, at 10:16 AM, Richard Bade <hitr...@gmail.com> wrote: >>>>> >>>>> Hi Everyone, >>>>> In our cluster running 0.94.10 we had a pg pop up as inconsistent >>>>> during scrub. Previously when this has happened running ceph pg repair >>>>> [pg_num] has resolved the problem. This time the repair runs but it >>>>> remains inconsistent. >>>>> ~$ ceph health detail >>>>> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set >>>>> pg 3.f05 is active+clean+inconsistent, acting [171,23,131] >>>>> 1 scrub errors >>>>> >>>>> The error in the logs is: >>>>> cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub >>>>> 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir >>>>> expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2 >>>>> >>>>> Now, I've tried several things to resolve this. I've tried stopping >>>>> each of the osd's in turn and running a repair. I've located the rbd >>>>> image and removed it to empty out the object. The object is now zero >>>>> bytes but still inconsistent. I've tried stopping each osd, removing >>>>> the object and starting the osd again. It correctly identifies the >>>>> object as missing and repair works to fix this but it still remains >>>>> inconsistent. >>>>> I've run out of ideas. >>>&
Re: [ceph-users] Inconsistent PG won't repair
What I'm thinking about trying is using the ceph-objectstore-tool to remove the offending clone metadata. From the help the syntax is this: ceph-objectstore-tool ... remove-clone-metadata i.e. something like for my object and expected clone from the log message ceph-objectstore-tool rbd_data.19cdf512ae8944a.0001bb56 remove-clone-metadata 148d2 Anyone had experience with this? I'm not 100% sure if this will resolve the issue or cause much the same situation (since it's already expecting a clone that's not there currently). Rich On 21 October 2017 at 14:13, Brad Hubbard <bhubb...@redhat.com> wrote: > On Sat, Oct 21, 2017 at 1:59 AM, Richard Bade <hitr...@gmail.com> wrote: >> Hi Lincoln, >> Yes the object is 0-bytes on all OSD's. Has the same filesystem >> date/time too. Before I removed the rbd image (migrated disk to >> different pool) it was 4MB on all the OSD's and md5 checksum was the >> same on all so it seems that only metadata is inconsistent. >> Thanks for your suggestion, I just looked into this as I thought maybe >> I can delete the object (since it's empty anyway). But I just get file >> not found: >> ~$ rados stat rbd_data.19cdf512ae8944a.0001bb56 --pool=tier3-rbd-3X >> error stat-ing >> tier3-rbd-3X/rbd_data.19cdf512ae8944a.0001bb56: (2) No such >> file or directory > > Maybe try downing the osds involved? > >> >> Regards, >> Rich >> >> On 21 October 2017 at 04:32, Lincoln Bryant <linco...@uchicago.edu> wrote: >>> Hi Rich, >>> >>> Is the object inconsistent and 0-bytes on all OSDs? >>> >>> We ran into a similar issue on Jewel, where an object was empty across the >>> board but had inconsistent metadata. Ultimately it was resolved by doing a >>> "rados get" and then a "rados put" on the object. *However* that was a last >>> ditch effort after I couldn't get any other repair option to work, and I >>> have no idea if that will cause any issues down the road :) >>> >>> --Lincoln >>> >>>> On Oct 20, 2017, at 10:16 AM, Richard Bade <hitr...@gmail.com> wrote: >>>> >>>> Hi Everyone, >>>> In our cluster running 0.94.10 we had a pg pop up as inconsistent >>>> during scrub. Previously when this has happened running ceph pg repair >>>> [pg_num] has resolved the problem. This time the repair runs but it >>>> remains inconsistent. >>>> ~$ ceph health detail >>>> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set >>>> pg 3.f05 is active+clean+inconsistent, acting [171,23,131] >>>> 1 scrub errors >>>> >>>> The error in the logs is: >>>> cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub >>>> 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir >>>> expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2 >>>> >>>> Now, I've tried several things to resolve this. I've tried stopping >>>> each of the osd's in turn and running a repair. I've located the rbd >>>> image and removed it to empty out the object. The object is now zero >>>> bytes but still inconsistent. I've tried stopping each osd, removing >>>> the object and starting the osd again. It correctly identifies the >>>> object as missing and repair works to fix this but it still remains >>>> inconsistent. >>>> I've run out of ideas. >>>> The object is now zero bytes: >>>> ~$ find /var/lib/ceph/osd/ceph-23/current/3.f05_head/ -name >>>> "*19cdf512ae8944a.0001bb56*" -ls >>>> 537598582 0 -rw-r--r-- 1 root root0 Oct 21 >>>> 03:54 >>>> /var/lib/ceph/osd/ceph-23/current/3.f05_head/DIR_5/DIR_0/DIR_F/DIR_5/DIR_B/rbd\\udata.19cdf512ae8944a.0001bb56__snapdir_68AB5F05__3 >>>> >>>> How can I resolve this? Is there some way to remove the empty object >>>> completely? I saw reference to ceph-objectstore-tool which has some >>>> options to remove-clone-metadata but I don't know how to use this. >>>> Will using this to remove the mentioned 148d2 expected clone resolve >>>> this? Or would this do the opposite as it would seem that it can't >>>> find that clone? >>>> Documentation on this tool is sparse. >>>> >>>> Any help here would be appreciated. >>>> >>>> Regards, >>>> Rich >>>> ___ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Cheers, > Brad ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Inconsistent PG won't repair
Hi Lincoln, Yes the object is 0-bytes on all OSD's. Has the same filesystem date/time too. Before I removed the rbd image (migrated disk to different pool) it was 4MB on all the OSD's and md5 checksum was the same on all so it seems that only metadata is inconsistent. Thanks for your suggestion, I just looked into this as I thought maybe I can delete the object (since it's empty anyway). But I just get file not found: ~$ rados stat rbd_data.19cdf512ae8944a.0001bb56 --pool=tier3-rbd-3X error stat-ing tier3-rbd-3X/rbd_data.19cdf512ae8944a.0001bb56: (2) No such file or directory Regards, Rich On 21 October 2017 at 04:32, Lincoln Bryant <linco...@uchicago.edu> wrote: > Hi Rich, > > Is the object inconsistent and 0-bytes on all OSDs? > > We ran into a similar issue on Jewel, where an object was empty across the > board but had inconsistent metadata. Ultimately it was resolved by doing a > "rados get" and then a "rados put" on the object. *However* that was a last > ditch effort after I couldn't get any other repair option to work, and I have > no idea if that will cause any issues down the road :) > > --Lincoln > >> On Oct 20, 2017, at 10:16 AM, Richard Bade <hitr...@gmail.com> wrote: >> >> Hi Everyone, >> In our cluster running 0.94.10 we had a pg pop up as inconsistent >> during scrub. Previously when this has happened running ceph pg repair >> [pg_num] has resolved the problem. This time the repair runs but it >> remains inconsistent. >> ~$ ceph health detail >> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set >> pg 3.f05 is active+clean+inconsistent, acting [171,23,131] >> 1 scrub errors >> >> The error in the logs is: >> cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub >> 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir >> expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2 >> >> Now, I've tried several things to resolve this. I've tried stopping >> each of the osd's in turn and running a repair. I've located the rbd >> image and removed it to empty out the object. The object is now zero >> bytes but still inconsistent. I've tried stopping each osd, removing >> the object and starting the osd again. It correctly identifies the >> object as missing and repair works to fix this but it still remains >> inconsistent. >> I've run out of ideas. >> The object is now zero bytes: >> ~$ find /var/lib/ceph/osd/ceph-23/current/3.f05_head/ -name >> "*19cdf512ae8944a.0001bb56*" -ls >> 537598582 0 -rw-r--r-- 1 root root0 Oct 21 >> 03:54 >> /var/lib/ceph/osd/ceph-23/current/3.f05_head/DIR_5/DIR_0/DIR_F/DIR_5/DIR_B/rbd\\udata.19cdf512ae8944a.0001bb56__snapdir_68AB5F05__3 >> >> How can I resolve this? Is there some way to remove the empty object >> completely? I saw reference to ceph-objectstore-tool which has some >> options to remove-clone-metadata but I don't know how to use this. >> Will using this to remove the mentioned 148d2 expected clone resolve >> this? Or would this do the opposite as it would seem that it can't >> find that clone? >> Documentation on this tool is sparse. >> >> Any help here would be appreciated. >> >> Regards, >> Rich >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Inconsistent PG won't repair
Hi Everyone, In our cluster running 0.94.10 we had a pg pop up as inconsistent during scrub. Previously when this has happened running ceph pg repair [pg_num] has resolved the problem. This time the repair runs but it remains inconsistent. ~$ ceph health detail HEALTH_ERR 1 pgs inconsistent; 2 scrub errors; noout flag(s) set pg 3.f05 is active+clean+inconsistent, acting [171,23,131] 1 scrub errors The error in the logs is: cstor01 ceph-mon: osd.171 10.233.202.21:6816/12694 45 : deep-scrub 3.f05 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/snapdir expected clone 3/68ab5f05/rbd_data.19cdf512ae8944a.0001bb56/148d2 Now, I've tried several things to resolve this. I've tried stopping each of the osd's in turn and running a repair. I've located the rbd image and removed it to empty out the object. The object is now zero bytes but still inconsistent. I've tried stopping each osd, removing the object and starting the osd again. It correctly identifies the object as missing and repair works to fix this but it still remains inconsistent. I've run out of ideas. The object is now zero bytes: ~$ find /var/lib/ceph/osd/ceph-23/current/3.f05_head/ -name "*19cdf512ae8944a.0001bb56*" -ls 537598582 0 -rw-r--r-- 1 root root0 Oct 21 03:54 /var/lib/ceph/osd/ceph-23/current/3.f05_head/DIR_5/DIR_0/DIR_F/DIR_5/DIR_B/rbd\\udata.19cdf512ae8944a.0001bb56__snapdir_68AB5F05__3 How can I resolve this? Is there some way to remove the empty object completely? I saw reference to ceph-objectstore-tool which has some options to remove-clone-metadata but I don't know how to use this. Will using this to remove the mentioned 148d2 expected clone resolve this? Or would this do the opposite as it would seem that it can't find that clone? Documentation on this tool is sparse. Any help here would be appreciated. Regards, Rich ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Objects Stuck Degraded
Hi Everyone, Just an update to this in case anyone has the same issue. This seems to have been caused by ceph osd reweight-by-utilization. Because we have two pools that map to two separate sets of disks and one pool was more full than the other the reweight-by-utilization had reduced the weight of the osd's in one pool down to around 0.3. This seems to have caused the crush map not to be able to find a suitable osd for the 2nd copy. Changing the reweight weights back up to near 1 has resolved the issue. Regards, Richard On 25 January 2017 at 10:58, Richard Bade <hitr...@gmail.com> wrote: > Hi Everyone, > I've got a strange one. After doing a reweight of some osd's the other > night our cluster is showing 1pg stuck unclean. > > 2017-01-25 09:48:41 : 1 pgs stuck unclean | recovery 140/71532872 > objects degraded (0.000%) | recovery 2553/71532872 objects misplaced > (0.004%) > > When I query the pg it shows one of the osd's is not up. > > "state": "active+remapped", > "snap_trimq": "[]", > "epoch": 231928, > "up": [ > 155 > ], > "acting": [ > 155, > 105 > ], > "actingbackfill": [ > "105", > "155" > ], > > I've tried restarting the osd's, ceph pg repair, ceph pg 4.559 > list_missing, ceph pg 4.559 mark_unfound_lost revert. > Nothing works. > I've just tried setting osd.105 out, waiting for backfill to evacuate > the osd and stopping the osd process to see if it'll recreate the 2nd > set of data but no luck. > It would seem that the primary copy of the data on osd.155 is fine but > the 2nd copy on osd.105 isn't there. > > Any ideas how I can force rebuilding the 2nd copy? Or any other ideas > to resolve this? > > We're running Hammer > ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) > > Regards, > Richard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Objects Stuck Degraded
Hi Everyone, I've got a strange one. After doing a reweight of some osd's the other night our cluster is showing 1pg stuck unclean. 2017-01-25 09:48:41 : 1 pgs stuck unclean | recovery 140/71532872 objects degraded (0.000%) | recovery 2553/71532872 objects misplaced (0.004%) When I query the pg it shows one of the osd's is not up. "state": "active+remapped", "snap_trimq": "[]", "epoch": 231928, "up": [ 155 ], "acting": [ 155, 105 ], "actingbackfill": [ "105", "155" ], I've tried restarting the osd's, ceph pg repair, ceph pg 4.559 list_missing, ceph pg 4.559 mark_unfound_lost revert. Nothing works. I've just tried setting osd.105 out, waiting for backfill to evacuate the osd and stopping the osd process to see if it'll recreate the 2nd set of data but no luck. It would seem that the primary copy of the data on osd.155 is fine but the 2nd copy on osd.105 isn't there. Any ideas how I can force rebuilding the 2nd copy? Or any other ideas to resolve this? We're running Hammer ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) Regards, Richard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mark_unfound_lost revert|delete behaviour
Hi Everyone, Can anyone tell me how the ceph pg x.x mark_unfound_lost revert|delete command is meant to work? Due to some not fully know strange circumstances I have 1 unfound object in one of my pools. I've read through http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#unfound-objects and it seems pretty clear that the object is lost and needs to be reverted or deleted. However when I run the revert it returns quickly and doesn't seem to do anything. The first time I tried this it also kicked the osd that was the primary for this pg. This seemed bad so I restarted the osd. I tried reverting again and nothing happened. Later I tried deleting the unfound and a similar thing happened - the osd which was primary went down. This time though the command didn't return straight away and the loadaverage on that box skyrocketed to around 1800. I restarted the OSD, but noticed that it didn't seem to have caused a lot of pg's to require recovery as it would normally when an osd is down. So I'm wondering if the osd is meant to go down? Can anyone confirm the sequence of events that are expected when issuing the mark_unfound_lost? I have not managed to find any info when googling. Regards, Richard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-mon crash after update to Hammer 0.94.3 from Firefly 0.80.10
Hi Everyone, Thanks for your input on this. I know it's been a long time but I just wanted to report back that this issue has been resolved. We added two more monitors which happened to be on Ubuntu 14.04 (rather than 12.04) and these had no issues. So we upgraded every host to 14.04. Since the OS update we have not had any Monitor crashes. It's now been over two months and the Mon's have been stable. Thanks again, Richard On 17 October 2015 at 07:26, Richard Bade <hitr...@gmail.com> wrote: > Ok, debugging increased > ceph tell mon.[abc] injectargs --debug-mon 20 > ceph tell mon.[abc] injectargs --debug-ms 1 > > Regards, > Richard > > On 17 October 2015 at 01:38, Sage Weil <s...@newdream.net> wrote: >> >> This doesn't look familiar. Are you able to enable a higher log level so >> that if it happens again we'll have more info? >> >> debug mon = 20 >> debug ms = 1 >> >> Thanks! >> sage >> >> On Fri, 16 Oct 2015, Dan van der Ster wrote: >> >> > Hmm, that's strange. I didn't see anything in the tracker that looks >> > related. Hopefully an expert can chime in... >> > >> > Cheers, Dan >> > >> > On Fri, Oct 16, 2015 at 1:38 PM, Richard Bade <hitr...@gmail.com> wrote: >> > > Thanks for your quick response Dan, but no. All the ceph-mon.*.log >> > > files are >> > > empty. >> > > I did track this down in syslog though, in case it helps: >> > > ceph-mon: 2015-10-16 21:25:00.117115 7f4c9f458700 -1 *** Caught signal >> > > (Segmentation fault) **#012 in thread 7f4c9f458700#012#012 ceph >> > > version >> > > 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)#012 1: >> > > /usr/bin/ceph-mon() >> > > [0x928b05]#012 2: (()+0xfcb0) [0x7f4ca50e0cb0]#012 3: >> > > (get_str_map_key(std::map<std::string, std::string, >> > > std::less, >> > > std::allocator<std::pair > > const&, >> > > std::string const&, std::string const*)+0x37) [0x87d8e7]#012 4: >> > > (LogMonitor::update_from_paxos(bool*)+0x801) [0x6846e1]#012 5: >> > > (PaxosService::refresh(bool*)+0x3c6) [0x5dc326]#012 6: >> > > (Monitor::refresh_from_paxos(bool*)+0x36b) [0x588aab]#012 7: >> > > (Paxos::do_refresh()+0x4c) [0x5c465c]#012 8: >> > > (Paxos::handle_commit(MMonPaxos*)+0x243) [0x5cb2d3]#012 9: >> > > (Paxos::dispatch(PaxosServiceMessage*)+0x22b) [0x5d3fbb]#012 10: >> > > (Monitor::dispatch(MonSession*, Message*, bool)+0x864) [0x5ab0d4]#012 >> > > 11: >> > > (Monitor::_ms_dispatch(Message*)+0x2c9) [0x5a8a19]#012 12: >> > > (Monitor::ms_dispatch(Message*)+0x32) [0x5c3952]#012 13: >> > > (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x8ac987]#012 14: >> > > (DispatchQueue::entry()+0x44a) [0x8a9b2a]#012 15: >> > > (DispatchQueue::DispatchThread::entry()+0xd) [0x79e4ad]#012 16: >> > > (()+0x7e9a) >> > > [0x7f4ca50d8e9a]#012 17: (clone()+0x6d) [0x7f4ca3dca38d]#012 NOTE: a >> > > copy of >> > > the executable, or `objdump -rdS ` is needed to interpret >> > > this. >> > > >> > > Regards, >> > > Richard >> > > >> > > On 17 October 2015 at 00:33, Dan van der Ster <d...@vanderster.com> >> > > wrote: >> > >> >> > >> Hi, >> > >> Is there a backtrace in /var/log/ceph/ceph-mon.*.log ? >> > >> Cheers, Dan >> > >> >> > >> On Fri, Oct 16, 2015 at 12:46 PM, Richard Bade <hitr...@gmail.com> >> > >> wrote: >> > >> > Hi Everyone, >> > >> > I upgraded our cluster to Hammer 0.94.3 a couple of days ago and >> > >> > today >> > >> > we've >> > >> > had one monitor crash twice and another one once. We have 3 >> > >> > monitors >> > >> > total >> > >> > and have been running Firefly 0.80.10 for quite some time without >> > >> > any >> > >> > monitor issues. >> > >> > When the monitor crashes it leaves a core file and a crash file in >> > >> > /var/crash >> > >> > I can't see anything obviously the same goolging about. >> > >> > Has anyone seen anything like this? >> > >> > Any suggestions? What other info would be useful to help track down >> > >> > the >> > >> > issue. >> > >> > >> > >> > Regards, >> > >> > Richard >> > >> > >> > >> > ___ >> > >> > ceph-users mailing list >> > >> > ceph-users@lists.ceph.com >> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > >> > > >> > > >> > ___ >> > ceph-users mailing list >> > ceph-users@lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-mon crash after update to Hammer 0.94.3 from Firefly 0.80.10
Ok, debugging increased ceph tell mon.[abc] injectargs --debug-mon 20 ceph tell mon.[abc] injectargs --debug-ms 1 Regards, Richard On 17 October 2015 at 01:38, Sage Weil <s...@newdream.net> wrote: > This doesn't look familiar. Are you able to enable a higher log level so > that if it happens again we'll have more info? > > debug mon = 20 > debug ms = 1 > > Thanks! > sage > > On Fri, 16 Oct 2015, Dan van der Ster wrote: > > > Hmm, that's strange. I didn't see anything in the tracker that looks > > related. Hopefully an expert can chime in... > > > > Cheers, Dan > > > > On Fri, Oct 16, 2015 at 1:38 PM, Richard Bade <hitr...@gmail.com> wrote: > > > Thanks for your quick response Dan, but no. All the ceph-mon.*.log > files are > > > empty. > > > I did track this down in syslog though, in case it helps: > > > ceph-mon: 2015-10-16 21:25:00.117115 7f4c9f458700 -1 *** Caught signal > > > (Segmentation fault) **#012 in thread 7f4c9f458700#012#012 ceph version > > > 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)#012 1: > /usr/bin/ceph-mon() > > > [0x928b05]#012 2: (()+0xfcb0) [0x7f4ca50e0cb0]#012 3: > > > (get_str_map_key(std::map<std::string, std::string, > std::less, > > > std::allocator<std::pair > > const&, > > > std::string const&, std::string const*)+0x37) [0x87d8e7]#012 4: > > > (LogMonitor::update_from_paxos(bool*)+0x801) [0x6846e1]#012 5: > > > (PaxosService::refresh(bool*)+0x3c6) [0x5dc326]#012 6: > > > (Monitor::refresh_from_paxos(bool*)+0x36b) [0x588aab]#012 7: > > > (Paxos::do_refresh()+0x4c) [0x5c465c]#012 8: > > > (Paxos::handle_commit(MMonPaxos*)+0x243) [0x5cb2d3]#012 9: > > > (Paxos::dispatch(PaxosServiceMessage*)+0x22b) [0x5d3fbb]#012 10: > > > (Monitor::dispatch(MonSession*, Message*, bool)+0x864) [0x5ab0d4]#012 > 11: > > > (Monitor::_ms_dispatch(Message*)+0x2c9) [0x5a8a19]#012 12: > > > (Monitor::ms_dispatch(Message*)+0x32) [0x5c3952]#012 13: > > > (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x8ac987]#012 14: > > > (DispatchQueue::entry()+0x44a) [0x8a9b2a]#012 15: > > > (DispatchQueue::DispatchThread::entry()+0xd) [0x79e4ad]#012 16: > (()+0x7e9a) > > > [0x7f4ca50d8e9a]#012 17: (clone()+0x6d) [0x7f4ca3dca38d]#012 NOTE: a > copy of > > > the executable, or `objdump -rdS ` is needed to interpret > this. > > > > > > Regards, > > > Richard > > > > > > On 17 October 2015 at 00:33, Dan van der Ster <d...@vanderster.com> > wrote: > > >> > > >> Hi, > > >> Is there a backtrace in /var/log/ceph/ceph-mon.*.log ? > > >> Cheers, Dan > > >> > > >> On Fri, Oct 16, 2015 at 12:46 PM, Richard Bade <hitr...@gmail.com> > wrote: > > >> > Hi Everyone, > > >> > I upgraded our cluster to Hammer 0.94.3 a couple of days ago and > today > > >> > we've > > >> > had one monitor crash twice and another one once. We have 3 monitors > > >> > total > > >> > and have been running Firefly 0.80.10 for quite some time without > any > > >> > monitor issues. > > >> > When the monitor crashes it leaves a core file and a crash file in > > >> > /var/crash > > >> > I can't see anything obviously the same goolging about. > > >> > Has anyone seen anything like this? > > >> > Any suggestions? What other info would be useful to help track down > the > > >> > issue. > > >> > > > >> > Regards, > > >> > Richard > > >> > > > >> > ___ > > >> > ceph-users mailing list > > >> > ceph-users@lists.ceph.com > > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >> > > > > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-mon crash after update to Hammer 0.94.3 from Firefly 0.80.10
Hi Everyone, I upgraded our cluster to Hammer 0.94.3 a couple of days ago and today we've had one monitor crash twice and another one once. We have 3 monitors total and have been running Firefly 0.80.10 for quite some time without any monitor issues. When the monitor crashes it leaves a core file and a crash file in /var/crash I can't see anything obviously the same goolging about. Has anyone seen anything like this? Any suggestions? What other info would be useful to help track down the issue. Regards, Richard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-mon crash after update to Hammer 0.94.3 from Firefly 0.80.10
Thanks for your quick response Dan, but no. All the ceph-mon.*.log files are empty. I did track this down in syslog though, in case it helps: ceph-mon: 2015-10-16 21:25:00.117115 7f4c9f458700 -1 *** Caught signal (Segmentation fault) **#012 in thread 7f4c9f458700#012#012 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)#012 1: /usr/bin/ceph-mon() [0x928b05]#012 2: (()+0xfcb0) [0x7f4ca50e0cb0]#012 3: (get_str_map_key(std::map<std::string, std::string, std::less, std::allocator<std::pair > > const&, std::string const&, std::string const*)+0x37) [0x87d8e7]#012 4: (LogMonitor::update_from_paxos(bool*)+0x801) [0x6846e1]#012 5: (PaxosService::refresh(bool*)+0x3c6) [0x5dc326]#012 6: (Monitor::refresh_from_paxos(bool*)+0x36b) [0x588aab]#012 7: (Paxos::do_refresh()+0x4c) [0x5c465c]#012 8: (Paxos::handle_commit(MMonPaxos*)+0x243) [0x5cb2d3]#012 9: (Paxos::dispatch(PaxosServiceMessage*)+0x22b) [0x5d3fbb]#012 10: (Monitor::dispatch(MonSession*, Message*, bool)+0x864) [0x5ab0d4]#012 11: (Monitor::_ms_dispatch(Message*)+0x2c9) [0x5a8a19]#012 12: (Monitor::ms_dispatch(Message*)+0x32) [0x5c3952]#012 13: (Messenger::ms_deliver_dispatch(Message*)+0x77) [0x8ac987]#012 14: (DispatchQueue::entry()+0x44a) [0x8a9b2a]#012 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x79e4ad]#012 16: (()+0x7e9a) [0x7f4ca50d8e9a]#012 17: (clone()+0x6d) [0x7f4ca3dca38d]#012 NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. Regards, Richard On 17 October 2015 at 00:33, Dan van der Ster <d...@vanderster.com> wrote: > Hi, > Is there a backtrace in /var/log/ceph/ceph-mon.*.log ? > Cheers, Dan > > On Fri, Oct 16, 2015 at 12:46 PM, Richard Bade <hitr...@gmail.com> wrote: > > Hi Everyone, > > I upgraded our cluster to Hammer 0.94.3 a couple of days ago and today > we've > > had one monitor crash twice and another one once. We have 3 monitors > total > > and have been running Firefly 0.80.10 for quite some time without any > > monitor issues. > > When the monitor crashes it leaves a core file and a crash file in > > /var/crash > > I can't see anything obviously the same goolging about. > > Has anyone seen anything like this? > > Any suggestions? What other info would be useful to help track down the > > issue. > > > > Regards, > > Richard > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS and nobarriers on Intel SSD
Hi Everyone, I updated the firmware on 3 S3710 drives (one host) last Tuesday and have not seen any ATA resets or Task Aborts on that host in the 5 days since. I also set nobarriers on another host on Wednesday and have only seen one Task Abort, and that was on an S3710. I have seen 18 ATA resets or Task Aborts on the two hosts that I made no changes on. It looks like this firmware has fixed my issues, but it looks like nobarriers also improves the situation significantly. Which seems to Correlate with your experience Christian. Thanks everyone for the info in this thread, I plan to update the firmware on the remainder of the S3710 drives this week and also set nobarriers. Regards, Richard On 8 September 2015 at 14:27, Richard Bade <hitr...@gmail.com> wrote: > Hi Christian, > > On 8 September 2015 at 14:02, Christian Balzer <ch...@gol.com> wrote: >> >> Indeed. But first a word about the setup where I'm seeing this. >> These are 2 mailbox server clusters (2 nodes each), replicating via DRBD >> over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster >> with the Samsung DC SSDs, one with the Intel S3610. >> 2 of these chassis to be precise: >> https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm > > > We are using the same box, but DC0R (no infiniband) so I guess not > surprising we're seeing the same thing happening. > > >> >> >> Of course latest firmware and I tried this with any kernel from Debian >> 3.16 to stock 4.1.6. >> >> With nobarrier I managed to trigger the error only once yesterday on the >> DRBD replication target, not the machine that actual has the FS mounted. >> Usually I'd be able to trigger quite a bit more often during those tests. >> >> So this morning I updated the firmware of all S3610s on one node and >> removed the nobarrier flag. It took a lot of punishment, but eventually >> this happened: >> --- >> Sep 8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting task >> abort! scmd(880fdc85b680) >> Sep 8 10:43:47 mbx09 kernel: [ 1743.358339] sd 0:0:1:0: [sdb] CDB: >> Write(10) 2a 00 0e 9a fb b8 00 00 08 00 >> Sep 8 10:43:47 mbx09 kernel: [ 1743.358345] scsi target0:0:1: >> handle(0x000a), sas_address(0x443322110100), phy(1) >> Sep 8 10:43:47 mbx09 kernel: [ 1743.358348] scsi target0:0:1: >> enclosure_logical_id(0x5003048019e98d00), slot(1) >> Sep 8 10:43:47 mbx09 kernel: [ 1743.387951] sd 0:0:1:0: task abort: >> SUCCESS scmd(880fdc85b680) >> --- >> Note that on the un-patched node (DRBD replication target) I managed to >> trigger this bug 3 times in the same period. >> >> So unless Intel has something to say (and given that this happens with >> Samsungs as well), I'd still look beady eyed at LSI/Avago... >> > > Yes, I think there may be more than one issue here. The reduction in > occurrences seems to prove there is an issue fixed by the Intel firmware, > but something is still happening. > Once I have updated the firmware on the drives on one of our hosts > tonight, hopefully I can get some more statistics and pinpoint if there is > another issue specifically with the LSI3008. > I'd be interested to know if the combination of nobarriers and the updated > firmware fixes the issue. > > Regards, > Richard > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS and nobarriers on Intel SSD
Thanks guys for the pointers to this Intel thread: https://communities.intel.com/thread/77801 It looks promising. I intend to update the firmware on disks in one node tonight and will report back after a few days to a week on my findings. I've also posted to that forum and will update there too. Regards, Richard On 5 September 2015 at 07:55, Richard Bade <hitr...@gmail.com> wrote: > Hi Everyone, > > We have a Ceph pool that is entirely made up of Intel S3700/S3710 > enterprise SSD's. > > We are seeing some significant I/O delays on the disks causing a “SCSI > Task Abort” from the OS. This seems to be triggered by the drive receiving > a “Synchronize cache command”. > > My current thinking is that setting nobarriers in XFS will stop the drive > receiving a sync command and therefore stop the I/O delay associated with > it. > > In the XFS FAQ it looks like the recommendation is that if you have a > Battery Backed raid controller you should set nobarriers for performance > reasons. > > Our LSI card doesn’t have battery backed cache as it’s configured in HBA > mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor > backed cache though. > > So is it recommended that barriers are turned off as the drive has a safe > cache (I am confident that the cache will write out to disk on power > failure)? > > Has anyone else encountered this issue? > > Any info or suggestions about this would be appreciated. > > Regards, > > Richard > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS and nobarriers on Intel SSD
Hi Christian, On 8 September 2015 at 14:02, Christian Balzerwrote: > > Indeed. But first a word about the setup where I'm seeing this. > These are 2 mailbox server clusters (2 nodes each), replicating via DRBD > over Infiniband (IPoIB at this time), LSI 3008 controller. One cluster > with the Samsung DC SSDs, one with the Intel S3610. > 2 of these chassis to be precise: > https://www.supermicro.com/products/system/2U/2028/SYS-2028TP-DC0FR.cfm We are using the same box, but DC0R (no infiniband) so I guess not surprising we're seeing the same thing happening. > > > Of course latest firmware and I tried this with any kernel from Debian > 3.16 to stock 4.1.6. > > With nobarrier I managed to trigger the error only once yesterday on the > DRBD replication target, not the machine that actual has the FS mounted. > Usually I'd be able to trigger quite a bit more often during those tests. > > So this morning I updated the firmware of all S3610s on one node and > removed the nobarrier flag. It took a lot of punishment, but eventually > this happened: > --- > Sep 8 10:43:47 mbx09 kernel: [ 1743.358329] sd 0:0:1:0: attempting task > abort! scmd(880fdc85b680) > Sep 8 10:43:47 mbx09 kernel: [ 1743.358339] sd 0:0:1:0: [sdb] CDB: > Write(10) 2a 00 0e 9a fb b8 00 00 08 00 > Sep 8 10:43:47 mbx09 kernel: [ 1743.358345] scsi target0:0:1: > handle(0x000a), sas_address(0x443322110100), phy(1) > Sep 8 10:43:47 mbx09 kernel: [ 1743.358348] scsi target0:0:1: > enclosure_logical_id(0x5003048019e98d00), slot(1) > Sep 8 10:43:47 mbx09 kernel: [ 1743.387951] sd 0:0:1:0: task abort: > SUCCESS scmd(880fdc85b680) > --- > Note that on the un-patched node (DRBD replication target) I managed to > trigger this bug 3 times in the same period. > > So unless Intel has something to say (and given that this happens with > Samsungs as well), I'd still look beady eyed at LSI/Avago... > Yes, I think there may be more than one issue here. The reduction in occurrences seems to prove there is an issue fixed by the Intel firmware, but something is still happening. Once I have updated the firmware on the drives on one of our hosts tonight, hopefully I can get some more statistics and pinpoint if there is another issue specifically with the LSI3008. I'd be interested to know if the combination of nobarriers and the updated firmware fixes the issue. Regards, Richard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS and nobarriers on Intel SSD
Hi Christian, Thanks for the info. I'm just wondering, have you updated your S3610's with the new firmware that was released on 21/08 as referred to in the thread? We thought we weren't seeing the issue on the intel controller also to start with, but after further investigation it turned out we were, but it was reported as a different log item such as this: ata5.00: exception Emask 0x0 SAct 0x30 SErr 0x0 action 0x6 frozen ata5.00: failed command: READ FPDMA QUEUED ata5.00: cmd 60/10:a0:18:ca:ca/00:00:32:00:00/40 tag 20 ncq 8192 in res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: status: { DRDY } ata5.00: failed command: READ FPDMA QUEUED ata5.00: cmd 60/40:a8:48:ca:ca/00:00:32:00:00/40 tag 21 ncq 32768 in res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) ata5.00: status: { DRDY } ata5: hard resetting link ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300) ata5.00: configured for UDMA/133 ata5.00: device reported invalid CHS sector 0 ata5.00: device reported invalid CHS sector 0 ata5: EH complete ata5.00: Enabling discard_zeroes_data I believe this to be the same thing as the LSI3008 which gives these log messages: sd 0:0:6:0: attempting task abort! scmd(8804cac00600) sd 0:0:6:0: [sdg] CDB: Read(10): 28 00 1c e7 76 a0 00 01 30 00 scsi target0:0:6: handle(0x000f), sas_address(0x443322110600), phy(6) scsi target0:0:6: enclosure_logical_id(0x50030480), slot(6) sd 0:0:6:0: task abort: SUCCESS scmd(8804cac00600) sd 0:0:6:0: attempting task abort! scmd(8804cac03780) I appreciate your info with regards to nobarries. I assume by "alleviate it, but didn't fix" you mean the number of occurrences is reduced? Regards, Richard On 8 September 2015 at 11:43, Christian Balzer <ch...@gol.com> wrote: > > Hello, > > Note that I see exactly your errors (in a non-Ceph environment) with both > Samsung 845DC EVO and Intel DC S3610. > Though I need to stress things quite a bit to make it happen. > > Also setting nobarrier did alleviate it, but didn't fix it 100%, so I > guess something still issues flushes at some point. > > From where I stand LSI/Avago are full of it. > Not only does this problem NOT happen with any onboard SATA chipset I have > access to, their task abort and reset is what actually impacts things > (several seconds to recover), not whatever insignificant delay caused by > the SSDs. > > Christian > On Tue, 8 Sep 2015 11:35:38 +1200 Richard Bade wrote: > > > Thanks guys for the pointers to this Intel thread: > > > > https://communities.intel.com/thread/77801 > > > > It looks promising. I intend to update the firmware on disks in one > > node tonight and will report back after a few days to a week on my > > findings. > > > > I've also posted to that forum and will update there too. > > > > Regards, > > > > Richard > > > > > > On 5 September 2015 at 07:55, Richard Bade <hitr...@gmail.com> wrote: > > > > > Hi Everyone, > > > > > > We have a Ceph pool that is entirely made up of Intel S3700/S3710 > > > enterprise SSD's. > > > > > > We are seeing some significant I/O delays on the disks causing a “SCSI > > > Task Abort” from the OS. This seems to be triggered by the drive > > > receiving a “Synchronize cache command”. > > > > > > My current thinking is that setting nobarriers in XFS will stop the > > > drive receiving a sync command and therefore stop the I/O delay > > > associated with it. > > > > > > In the XFS FAQ it looks like the recommendation is that if you have a > > > Battery Backed raid controller you should set nobarriers for > > > performance reasons. > > > > > > Our LSI card doesn’t have battery backed cache as it’s configured in > > > HBA mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a > > > capacitor backed cache though. > > > > > > So is it recommended that barriers are turned off as the drive has a > > > safe cache (I am confident that the cache will write out to disk on > > > power failure)? > > > > > > Has anyone else encountered this issue? > > > > > > Any info or suggestions about this would be appreciated. > > > > > > Regards, > > > > > > Richard > > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] XFS and nobarriers on Intel SSD
Hi Everyone, We have a Ceph pool that is entirely made up of Intel S3700/S3710 enterprise SSD's. We are seeing some significant I/O delays on the disks causing a “SCSI Task Abort” from the OS. This seems to be triggered by the drive receiving a “Synchronize cache command”. My current thinking is that setting nobarriers in XFS will stop the drive receiving a sync command and therefore stop the I/O delay associated with it. In the XFS FAQ it looks like the recommendation is that if you have a Battery Backed raid controller you should set nobarriers for performance reasons. Our LSI card doesn’t have battery backed cache as it’s configured in HBA mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor backed cache though. So is it recommended that barriers are turned off as the drive has a safe cache (I am confident that the cache will write out to disk on power failure)? Has anyone else encountered this issue? Any info or suggestions about this would be appreciated. Regards, Richard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] XFS and nobarriers on Intel SSD
Hi Jan, Thanks for your response. > *How exactly do you know this is the cause? This is usually just an effect > of something going wrong and part of error recovery process.**Preceding > this event should be the real error/root cause...* We have been working with LSI/Avago to resolve this. We get a bunch of these type log events: 2015-09-04T14:58:59.169677+12:00 ceph-osd: - ceph-osd: 2015-09-04 14:58:59.168444 7fbc5ec71700 0 log [WRN] : slow request 30.894936 seconds old, received at 2015-09-04 14:58:28.272976: osd_op(client.42319583.0:1185218039 rbd_data.1d8a5a92eb141f2.56a0 [read 3579392~8192] 4.f9f016cb ack+read e66603) v4 currently no flag points reached Followed by the task abort I mentioned: sd 11:0:4:0: attempting task abort! scmd(8804c07d0480) sd 11:0:4:0: [sdf] CDB: Write(10): 2a 00 24 6f 01 a8 00 00 08 00 scsi target11:0:4: handle(0x000d), sas_address(0x443322110400), phy(4) scsi target11:0:4: enclosure_logical_id(0x50030480), slot(4) sd 11:0:4:0: task abort: SUCCESS scmd(8804c07d0480) LSI had us enable debugging on our card and send them many logs and debugging data. Their response was: Please do not send in the Synchronize cache command(35h). That’s the one > causing the drive from not responding to Read/write commands quick enough. A Synchronize cache command instructs the ATA device to flush the cache > contents to medium and so while the disk is in the process of doing it, > it’s probably causing the read/write commands to take longer time to > complete. LSI/Avago believe this to be the root cause of the IO delay based on the debugging info. *and from what I've seen it is not necessary with fast drives (such as > S3700).* While I agree with you that it should not be necessary as the S3700's should be very fast, our current experience does not show this to be the case. Just a little more about our setup. We're using Ceph Firefly (0.80.10) on Ubuntu 14.04. We see this same thing on every S3700/10 on four hosts. We do not see this happening on the spinning disks in the same cluster but different pool on similar hardware. If you know of any other reason this may be happening, we would appreciate it. Otherwise we will need to continue investigating the possibility of setting nobarriers. Regards, Richard On 5 September 2015 at 09:32, Jan Schermer <j...@schermer.cz> wrote: > We are seeing some significant I/O delays on the disks causing a “SCSI > Task Abort” from the OS. This seems to be triggered by the drive receiving > a “Synchronize cache command”. > > > How exactly do you know this is the cause? This is usually just an effect > of something going wrong and part of error recovery process. > Preceding this event should be the real error/root cause... > > It is _supposedly_ safe to disable barriers in this scenario, but IMO the > assumptions behind that are deeply flawed, and from what I've seen it is > not necessary with fast drives (such as S3700). > > Take a look in the mailing list archives, I elaborated on this quite a bit > in the past, including my experience with Kingston drives + XFS + LSI (and > the effect is present even on Intels, but because they are much faster it > shouldn't cause any real problems). > > Jan > > > On 04 Sep 2015, at 21:55, Richard Bade <hitr...@gmail.com> wrote: > > Hi Everyone, > > We have a Ceph pool that is entirely made up of Intel S3700/S3710 > enterprise SSD's. > > We are seeing some significant I/O delays on the disks causing a “SCSI > Task Abort” from the OS. This seems to be triggered by the drive receiving > a “Synchronize cache command”. > > My current thinking is that setting nobarriers in XFS will stop the drive > receiving a sync command and therefore stop the I/O delay associated with > it. > > In the XFS FAQ it looks like the recommendation is that if you have a > Battery Backed raid controller you should set nobarriers for performance > reasons. > > Our LSI card doesn’t have battery backed cache as it’s configured in HBA > mode (IT) rather than Raid (IR). Our Intel s37xx SSD’s do have a capacitor > backed cache though. > > So is it recommended that barriers are turned off as the drive has a safe > cache (I am confident that the cache will write out to disk on power > failure)? > > Has anyone else encountered this issue? > > Any info or suggestions about this would be appreciated. > > Regards, > > Richard > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com