Re: [ceph-users] Deadly slow Ceph cluster revisited
What does ceph status say? I had a problem with similar symptoms some months ago that was accompanied by OSDs getting marked out for no apparent reason and the cluster going into a HEALTH_WARN state intermittently. Ultimately the root of the problem ended up being a faulty NIC. Once I took that out of the picture everything started flying right. QH On Fri, Jul 17, 2015 at 8:21 AM, Mark Nelson mnel...@redhat.com wrote: On 07/17/2015 08:38 AM, J David wrote: This is the same cluster I posted about back in April. Since then, the situation has gotten significantly worse. Here is what iostat looks like for the one active RBD image on this cluster: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdb 0.00 0.00 14.100.00 685.65 0.00 97.26 3.43 299.40 299.400.00 70.92 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 2753.09 2753.090.00 909.09 100.00 vdb 0.00 0.00 17.400.00 2227.20 0.00 256.00 3.00 178.78 178.780.00 57.47 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2256.62 2256.620.00 769.23 100.00 vdb 0.00 0.008.200.00 1049.60 0.00 256.00 3.00 362.10 362.100.00 121.95 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 2517.45 2517.450.00 909.45 100.04 vdb 0.00 0.001.100.00 140.66 0.00 256.00 3.00 2863.64 2863.640.00 909.09 99.90 vdb 0.00 0.000.700.0089.60 0.00 256.00 3.00 3898.86 3898.860.00 1428.57 100.00 vdb 0.00 0.000.600.0076.80 0.00 256.00 3.00 5093.33 5093.330.00 1666.67 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2568.33 2568.330.00 833.33 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2457.85 2457.850.00 769.23 100.00 vdb 0.00 0.00 13.900.00 1779.20 0.00 256.00 3.00 220.95 220.950.00 71.94 100.00 vdb 0.00 0.001.000.00 128.00 0.00 256.00 3.00 2250.40 2250.400.00 1000.00 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2798.77 2798.770.00 769.23 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3304.00 3304.000.00 .11 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3425.33 3425.330.00 .11 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2290.77 2290.770.00 769.23 100.00 vdb 0.00 0.004.300.00 550.40 0.00 256.00 3.00 721.30 721.300.00 232.56 100.00 vdb 0.00 0.001.600.00 204.80 0.00 256.00 3.00 1894.75 1894.750.00 625.00 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2375.00 2375.000.00 833.33 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3036.44 3036.440.00 .11 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 3086.18 3086.180.00 909.09 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 2480.44 2480.440.00 .11 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 3124.33 3124.330.00 833.67 100.04 vdb 0.00 0.000.800.00 102.40 0.00 256.00 3.00 3228.00 3228.000.00 1250.00 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2439.33 2439.330.00 833.33 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2567.08 2567.080.00 769.23 100.00 vdb 0.00 0.000.800.00 102.40 0.00 256.00 3.00 3023.00 3023.000.00 1250.00 100.00 vdb 0.00 0.004.800.00 614.40 0.00 256.00 3.00 712.50 712.500.00 208.33 100.00 vdb 0.00 0.001.300.00 118.75 0.00 182.69 3.00 2003.69 2003.690.00 769.23 100.00 vdb 0.00 0.00 10.500.00 1344.00 0.00 256.00 3.00 344.46 344.460.00 95.24 100.00 So, between 0 and 15 reads per second, no write activity, a constant queue depth of 3+, wait times in seconds, and 100% I/O utilization, all for read performance of 100-200K/sec. Even trivial writes can hang for 15-60 seconds before completing. Sometimes this behavior will go away for awhile and it will go back to what we saw in April: 50IOPS (read or write) and 5-20MB/sec of I/O throughput. But it always
[ceph-users] OSD RAM usage values
Hi all, I've read in the documentation that OSDs use around 512MB on a healthy cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram) Now, our OSD's are all using around 2GB of RAM memory while the cluster is healthy. PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 29784 root 20 0 6081276 2.535g 4740 S 0.7 8.1 1346:55 ceph-osd 32818 root 20 0 5417212 2.164g 24780 S 16.2 6.9 1238:55 ceph-osd 25053 root 20 0 5386604 2.159g 27864 S 0.7 6.9 1192:08 ceph-osd 33875 root 20 0 5345288 2.092g 3544 S 0.7 6.7 1188:53 ceph-osd 30779 root 20 0 5474832 2.090g 28892 S 1.0 6.7 1142:29 ceph-osd 22068 root 20 0 5191516 2.000g 28664 S 0.7 6.4 31:56.72 ceph-osd 34932 root 20 0 5242656 1.994g 4536 S 0.3 6.4 1144:48 ceph-osd 26883 root 20 0 5178164 1.938g 6164 S 0.3 6.2 1173:01 ceph-osd 31796 root 20 0 5193308 1.916g 27000 S 16.2 6.1 923:14.87 ceph-osd 25958 root 20 0 5193436 1.901g 2900 S 0.7 6.1 1039:53 ceph-osd 27826 root 20 0 5225764 1.845g 5576 S 1.0 5.9 1031:15 ceph-osd 36011 root 20 0 5111660 1.823g 20512 S 15.9 5.8 1093:01 ceph-osd 19736 root 20 0 2134680 0.994g 0 S 0.3 3.2 46:13.47 ceph-osd [root@osd003 ~]# ceph status 2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore 2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47 health HEALTH_OK monmap e1: 3 mons at {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0} election epoch 58, quorum 0,1,2 mds01,mds02,mds03 mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby osdmap e25542: 258 osds: 258 up, 258 in pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects 270 TB used, 549 TB / 819 TB avail 4152 active+clean 8 active+clean+scrubbing+deep We are using erasure code on most of our OSDs, so maybe that is a reason. But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of RAM. Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool (2*14 OSDS) has a pg_num of 1024. Are these normal values for this configuration, and is the documentation a bit outdated, or should we look into something else? Thank you! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW Malformed Headers
Test complete. Civet still shows the same problem: https://gist.github.com/spjmurray/88203f564389294b3774 /admin/user?uid=admin is fine /admin/user?quotauid=adminquota-type=user is not so good. Upgrade to 0.94.2 didn't solve the problem nor 9.0.2. Unless anyone knows anything more I'll go a hunting and prepare a patch so it can get fixed for 9.1.x On 17 July 2015 at 07:26, Simon Murray simon.mur...@datacentred.co.uk wrote: The basics are, 14.04, giant, apache with the ceph version of fastCGI. I'll spin up a test system in openstack with Civet and see if it misbehaves the same way or I need to narrow it down further. Chances are if you haven't heard of it I'll need to crack out g++ and get my hands dirty -- DataCentred Limited registered in England and Wales no. 05611763 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD RAM usage values
On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman kenneth.waege...@ugent.be wrote: Hi all, I've read in the documentation that OSDs use around 512MB on a healthy cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram) Now, our OSD's are all using around 2GB of RAM memory while the cluster is healthy. PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 29784 root 20 0 6081276 2.535g 4740 S 0.7 8.1 1346:55 ceph-osd 32818 root 20 0 5417212 2.164g 24780 S 16.2 6.9 1238:55 ceph-osd 25053 root 20 0 5386604 2.159g 27864 S 0.7 6.9 1192:08 ceph-osd 33875 root 20 0 5345288 2.092g 3544 S 0.7 6.7 1188:53 ceph-osd 30779 root 20 0 5474832 2.090g 28892 S 1.0 6.7 1142:29 ceph-osd 22068 root 20 0 5191516 2.000g 28664 S 0.7 6.4 31:56.72 ceph-osd 34932 root 20 0 5242656 1.994g 4536 S 0.3 6.4 1144:48 ceph-osd 26883 root 20 0 5178164 1.938g 6164 S 0.3 6.2 1173:01 ceph-osd 31796 root 20 0 5193308 1.916g 27000 S 16.2 6.1 923:14.87 ceph-osd 25958 root 20 0 5193436 1.901g 2900 S 0.7 6.1 1039:53 ceph-osd 27826 root 20 0 5225764 1.845g 5576 S 1.0 5.9 1031:15 ceph-osd 36011 root 20 0 5111660 1.823g 20512 S 15.9 5.8 1093:01 ceph-osd 19736 root 20 0 2134680 0.994g 0 S 0.3 3.2 46:13.47 ceph-osd [root@osd003 ~]# ceph status 2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore 2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following dangerous and experimental features are enabled: keyvaluestore cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47 health HEALTH_OK monmap e1: 3 mons at {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0} election epoch 58, quorum 0,1,2 mds01,mds02,mds03 mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby osdmap e25542: 258 osds: 258 up, 258 in pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects 270 TB used, 549 TB / 819 TB avail 4152 active+clean 8 active+clean+scrubbing+deep We are using erasure code on most of our OSDs, so maybe that is a reason. But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of RAM. Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool (2*14 OSDS) has a pg_num of 1024. Are these normal values for this configuration, and is the documentation a bit outdated, or should we look into something else? 2GB of RSS is larger than I would have expected, but not unreasonable. In particular I don't think we've gathered numbers on either EC pools or on the effects of the caching processes. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
On 07/17/2015 08:38 AM, J David wrote: This is the same cluster I posted about back in April. Since then, the situation has gotten significantly worse. Here is what iostat looks like for the one active RBD image on this cluster: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdb 0.00 0.00 14.100.00 685.65 0.00 97.26 3.43 299.40 299.400.00 70.92 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 2753.09 2753.090.00 909.09 100.00 vdb 0.00 0.00 17.400.00 2227.20 0.00 256.00 3.00 178.78 178.780.00 57.47 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2256.62 2256.620.00 769.23 100.00 vdb 0.00 0.008.200.00 1049.60 0.00 256.00 3.00 362.10 362.100.00 121.95 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 2517.45 2517.450.00 909.45 100.04 vdb 0.00 0.001.100.00 140.66 0.00 256.00 3.00 2863.64 2863.640.00 909.09 99.90 vdb 0.00 0.000.700.0089.60 0.00 256.00 3.00 3898.86 3898.860.00 1428.57 100.00 vdb 0.00 0.000.600.0076.80 0.00 256.00 3.00 5093.33 5093.330.00 1666.67 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2568.33 2568.330.00 833.33 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2457.85 2457.850.00 769.23 100.00 vdb 0.00 0.00 13.900.00 1779.20 0.00 256.00 3.00 220.95 220.950.00 71.94 100.00 vdb 0.00 0.001.000.00 128.00 0.00 256.00 3.00 2250.40 2250.400.00 1000.00 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2798.77 2798.770.00 769.23 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3304.00 3304.000.00 .11 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3425.33 3425.330.00 .11 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2290.77 2290.770.00 769.23 100.00 vdb 0.00 0.004.300.00 550.40 0.00 256.00 3.00 721.30 721.300.00 232.56 100.00 vdb 0.00 0.001.600.00 204.80 0.00 256.00 3.00 1894.75 1894.750.00 625.00 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2375.00 2375.000.00 833.33 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3036.44 3036.440.00 .11 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 3086.18 3086.180.00 909.09 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 2480.44 2480.440.00 .11 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 3124.33 3124.330.00 833.67 100.04 vdb 0.00 0.000.800.00 102.40 0.00 256.00 3.00 3228.00 3228.000.00 1250.00 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2439.33 2439.330.00 833.33 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2567.08 2567.080.00 769.23 100.00 vdb 0.00 0.000.800.00 102.40 0.00 256.00 3.00 3023.00 3023.000.00 1250.00 100.00 vdb 0.00 0.004.800.00 614.40 0.00 256.00 3.00 712.50 712.500.00 208.33 100.00 vdb 0.00 0.001.300.00 118.75 0.00 182.69 3.00 2003.69 2003.690.00 769.23 100.00 vdb 0.00 0.00 10.500.00 1344.00 0.00 256.00 3.00 344.46 344.460.00 95.24 100.00 So, between 0 and 15 reads per second, no write activity, a constant queue depth of 3+, wait times in seconds, and 100% I/O utilization, all for read performance of 100-200K/sec. Even trivial writes can hang for 15-60 seconds before completing. Sometimes this behavior will go away for awhile and it will go back to what we saw in April: 50IOPS (read or write) and 5-20MB/sec of I/O throughput. But it always comes back. The hardware of the ceph cluster is: - Three ceph nodes - Two of the ceph nodes have 64GiB RAM and 12 5TB SATA drives - One of the ceph nodes has 32GiB RAM and 4 5TB SATA drives - All ceph nodes have Intel E5-2609 v2 (2.50Ghz quad-core) CPUs - Everything is 10GBase-T - All three nodes running Ceph 0.80.9 The ceph hardware is all borderline idle. The CPU is 3-5% utilized and iostat reports the individual disks hover around 4-7% utilized at any given time. It does appear to be using
Re: [ceph-users] 10d
A bit of progress: rm'ing everything from inside current/36.10d_head/ actually let the OSD start and continue deleting other PGs. Cheers, Dan On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote: Thanks for the quick reply. We /could/ just wipe these OSDs and start from scratch (the only other pools were 4+2 ec and recovery already brought us to 100% active+clean). But it'd be good to understand and prevent this kind of crash... Cheers, Dan On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote: I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped
[ceph-users] Deadly slow Ceph cluster revisited
This is the same cluster I posted about back in April. Since then, the situation has gotten significantly worse. Here is what iostat looks like for the one active RBD image on this cluster: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdb 0.00 0.00 14.100.00 685.65 0.00 97.26 3.43 299.40 299.400.00 70.92 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 2753.09 2753.090.00 909.09 100.00 vdb 0.00 0.00 17.400.00 2227.20 0.00 256.00 3.00 178.78 178.780.00 57.47 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2256.62 2256.620.00 769.23 100.00 vdb 0.00 0.008.200.00 1049.60 0.00 256.00 3.00 362.10 362.100.00 121.95 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 2517.45 2517.450.00 909.45 100.04 vdb 0.00 0.001.100.00 140.66 0.00 256.00 3.00 2863.64 2863.640.00 909.09 99.90 vdb 0.00 0.000.700.0089.60 0.00 256.00 3.00 3898.86 3898.860.00 1428.57 100.00 vdb 0.00 0.000.600.0076.80 0.00 256.00 3.00 5093.33 5093.330.00 1666.67 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2568.33 2568.330.00 833.33 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2457.85 2457.850.00 769.23 100.00 vdb 0.00 0.00 13.900.00 1779.20 0.00 256.00 3.00 220.95 220.950.00 71.94 100.00 vdb 0.00 0.001.000.00 128.00 0.00 256.00 3.00 2250.40 2250.400.00 1000.00 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2798.77 2798.770.00 769.23 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3304.00 3304.000.00 .11 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3425.33 3425.330.00 .11 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2290.77 2290.770.00 769.23 100.00 vdb 0.00 0.004.300.00 550.40 0.00 256.00 3.00 721.30 721.300.00 232.56 100.00 vdb 0.00 0.001.600.00 204.80 0.00 256.00 3.00 1894.75 1894.750.00 625.00 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2375.00 2375.000.00 833.33 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 3036.44 3036.440.00 .11 100.00 vdb 0.00 0.001.100.00 140.80 0.00 256.00 3.00 3086.18 3086.180.00 909.09 100.00 vdb 0.00 0.000.900.00 115.20 0.00 256.00 3.00 2480.44 2480.440.00 .11 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 3124.33 3124.330.00 833.67 100.04 vdb 0.00 0.000.800.00 102.40 0.00 256.00 3.00 3228.00 3228.000.00 1250.00 100.00 vdb 0.00 0.001.200.00 153.60 0.00 256.00 3.00 2439.33 2439.330.00 833.33 100.00 vdb 0.00 0.001.300.00 166.40 0.00 256.00 3.00 2567.08 2567.080.00 769.23 100.00 vdb 0.00 0.000.800.00 102.40 0.00 256.00 3.00 3023.00 3023.000.00 1250.00 100.00 vdb 0.00 0.004.800.00 614.40 0.00 256.00 3.00 712.50 712.500.00 208.33 100.00 vdb 0.00 0.001.300.00 118.75 0.00 182.69 3.00 2003.69 2003.690.00 769.23 100.00 vdb 0.00 0.00 10.500.00 1344.00 0.00 256.00 3.00 344.46 344.460.00 95.24 100.00 So, between 0 and 15 reads per second, no write activity, a constant queue depth of 3+, wait times in seconds, and 100% I/O utilization, all for read performance of 100-200K/sec. Even trivial writes can hang for 15-60 seconds before completing. Sometimes this behavior will go away for awhile and it will go back to what we saw in April: 50IOPS (read or write) and 5-20MB/sec of I/O throughput. But it always comes back. The hardware of the ceph cluster is: - Three ceph nodes - Two of the ceph nodes have 64GiB RAM and 12 5TB SATA drives - One of the ceph nodes has 32GiB RAM and 4 5TB SATA drives - All ceph nodes have Intel E5-2609 v2 (2.50Ghz quad-core) CPUs - Everything is 10GBase-T - All three nodes running Ceph 0.80.9 The ceph hardware is all borderline idle. The CPU is 3-5% utilized and iostat reports the individual disks hover around 4-7% utilized at any given time. It does appear to be using most of the available RAM for OSD
Re: [ceph-users] 10d
Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? 34.10d: [52,202,218] 55.10d: [303,231,65] 36.10d: [30,171,69] So no common OSDs. I've already repaired all of these PGs, and logs have nothing interesting, so I can't say more about the objects. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10d
I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? 34.10d: [52,202,218] 55.10d: [303,231,65] 36.10d: [30,171,69] So no common OSDs. I've already repaired all of these PGs, and logs have nothing interesting, so I can't say more about the objects. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10d
Thanks for the quick reply. We /could/ just wipe these OSDs and start from scratch (the only other pools were 4+2 ec and recovery already brought us to 100% active+clean). But it'd be good to understand and prevent this kind of crash... Cheers, Dan On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote: I think you'll need to use the ceph-objectstore-tool to remove the PG/data consistently, but I've not done this — David or Sam will need to chime in. -Greg On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote: Hi Greg + list, Sorry to reply to this old'ish thread, but today one of these PGs bit us in the ass. Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171, and 69 all crash when trying to delete pg 36.10d. They all crash with ENOTEMPTY suggests garbage data in osd data dir (full log below). There is indeed some garbage in there: # find 36.10d_head/ 36.10d_head/ 36.10d_head/DIR_D 36.10d_head/DIR_D/DIR_0 36.10d_head/DIR_D/DIR_0/DIR_1 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24 36.10d_head/DIR_D/DIR_0/DIR_9 Do you have any suggestion how to get these OSDs back running? We already tried manually moving 36.10d_head to 36.10d_head.bak but then the OSD crashes for a different reason: -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid 36.10d coll 36.10d_head 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t, ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17 15:07:42.442902 osd/PG.cc: 2839: FAILED assert(r 0) Any clues? Cheers, Dan 2015-07-17 14:40:54.493935 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) error (39) Directory not empty not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting from 0) 2015-07-17 14:40:54.494019 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data in osd data dir 2015-07-17 14:40:54.494021 7f0ba60f4700 0 filestore(/var/lib/ceph/osd/ceph-30) transaction dump: { ops: [ { op_num: 0, op_name: remove, collection: 36.10d_head, oid: 10d\/\/head\/\/36 }, { op_num: 1, op_name: rmcoll, collection: 36.10d_head } ] } 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17 14:40:54.502996 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error) ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x97d794] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x2a0) [0x97da50] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6] 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10] 6: /lib64/libpthread.so.0() [0x3fbec079d1] 7: (clone()+0x6d) [0x3fbe8e88fd] On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote: On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? 34.10d: [52,202,218] 55.10d: [303,231,65] 36.10d: [30,171,69] So no common OSDs. I've already repaired all of these PGs, and
[ceph-users] Problem re-running dpkg-buildpackages with '-nc' option
Hi all, I'm trying to rebuild ceph deb packages using 'dpkg-buildpackages -nc'. Without '-nc' the compilation works fine but obviously takes a long time. When I add the '-nc' option, I end up with following issues: .. ./check_version ./.git_version ./.git_version is up to date. CXXLDceph_rgw_jsonparser ./.libs/libglobal.a(json_spirit_reader.o): In function `~thread_specific_ptr': /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' /usr/include/boost/thread/tss.hpp:79: undefined reference to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79: more undefined references to `boost::detail::set_tss_data(void const*, boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid (*)()': /usr/include/boost/thread/pthread/once_atomic.hpp:145: undefined reference to `boost::thread_detail::enter_once_region(boost::once_flag)' /usr/include/boost/thread/pthread/once_atomic.hpp:157: undefined reference to `boost::thread_detail::commit_once_region(boost::once_flag)' /usr/include/boost/thread/pthread/once_atomic.hpp:153: undefined reference to `boost::thread_detail::rollback_once_region(boost::once_flag)' /usr/include/boost/thread/pthread/once_atomic.hpp:145: undefined reference to `boost::thread_detail::enter_once_region(boost::once_flag)' . For me this looks like some boost libraries not linked in. Has anybody seen something like that ? -- Bartlomiej Swiecki bartlomiej.swie...@corp.ovh.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: What does ceph status say? Usually it says everything is cool. However just now it gave this: cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7 health HEALTH_WARN 2 requests are blocked 32 sec monmap e3: 3 mons at {f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0}, election epoch 176, quorum 0,1,2 f16,f17,f18 osdmap e10705: 28 osds: 28 up, 28 in pgmap v26213984: 4224 pgs, 3 pools, 21637 GB data, 5409 kobjects 43344 GB used, 86870 GB / 127 TB avail 4224 active+clean client io 368 kB/s rd, 444 kB/s wr, 10 op/s How do I find out what requests are blocked and why? Resolving whatever that is seems like a very good next step to troubleshooting this issue. I had a problem with similar symptoms some months ago that was accompanied by OSDs getting marked out for no apparent reason and the cluster going into a HEALTH_WARN state intermittently. There's no record of any OSD's being marked out at the time of problems. (E.g. as shown above, it's currently angry about something, but all OSD's are up in.) Here's the slow requests related info from ceph.log from the last 90 minutes: 2015-07-17 06:25:14.809606 osd.16 192.168.19.218:6823/29360 4137 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.051626 secs 2015-07-17 06:25:14.809618 osd.16 192.168.19.218:6823/29360 4138 : [WRN] slow request 30.051626 seconds old, received at 2015-07-17 06:24:44.757909: osd_op(client.32913524.0:3016955 rbd_data.15322ae8944a.0011f34f [set-alloc-hint object_size 4194304 write_size 4194304,write 1041920~3152384] 2.20c2afbe ack+ondisk+write e10705) v4 currently waiting for subops from 13 2015-07-17 06:25:20.280232 osd.24 192.168.19.218:6826/31177 4095 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.105786 secs 2015-07-17 06:25:20.280239 osd.24 192.168.19.218:6826/31177 4096 : [WRN] slow request 30.105786 seconds old, received at 2015-07-17 06:24:50.174399: osd_op(client.32913524.0:3016970 rbd_data.15322ae8944a.0011f355 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 2.66b14219 ack+ondisk+write e10705) v4 currently waiting for subops from 4 2015-07-17 06:25:28.476583 osd.2 192.168.19.216:6808/29552 12827 : [WRN] 1 slow requests, 1 included below; oldest blocked for 33.389677 secs 2015-07-17 06:25:28.476594 osd.2 192.168.19.216:6808/29552 12828 : [WRN] slow request 33.389677 seconds old, received at 2015-07-17 06:24:55.086843: osd_sub_op(client.32913524.0:3016982 2.e55 53500e55/rbd_data.15322ae8944a.0011f359/head//2 [] v 10705'851608 snapset=0=[]:[] snapc=0=[]) v11 currently commit sent 2015-07-17 06:25:25.223395 osd.20 192.168.19.218:6810/30518 4133 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.255906 secs 2015-07-17 06:25:25.223406 osd.20 192.168.19.218:6810/30518 4134 : [WRN] slow request 30.255906 seconds old, received at 2015-07-17 06:24:54.967416: osd_op(client.32913524.0:3016982 rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~3683840] 2.53500e55 ack+ondisk+write e10705) v4 currently waiting for subops from 2 2015-07-17 06:25:30.224304 osd.20 192.168.19.218:6810/30518 4135 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.574590 secs 2015-07-17 06:25:30.224312 osd.20 192.168.19.218:6810/30518 4136 : [WRN] slow request 30.574590 seconds old, received at 2015-07-17 06:24:59.649652: osd_op(client.32913524.0:3017153 rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size 4194304 write_size 4194304,write 3683840~510464] 2.53500e55 ack+ondisk+write e10705) v4 currently waiting for subops from 2 2015-07-17 06:25:34.624203 osd.23 192.168.19.218:6801/26435 4154 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.596777 secs 2015-07-17 06:25:34.624215 osd.23 192.168.19.218:6801/26435 4155 : [WRN] slow request 30.596777 seconds old, received at 2015-07-17 06:25:04.027362: osd_op(client.32913524.0:3017169 rbd_data.15322ae8944a.0011f35e [set-alloc-hint object_size 4194304 write_size 4194304,write 2963456~1230848] 2.dad14b4 ack+ondisk+write e10705) v4 currently waiting for subops from 8 2015-07-17 06:26:00.932314 osd.15 192.168.19.217:6800/25977 3491 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.549624 secs 2015-07-17 06:26:00.932328 osd.15 192.168.19.217:6800/25977 3492 : [WRN] slow request 30.549624 seconds old, received at 2015-07-17 06:25:30.382612: osd_op(client.32913524.0:3017343 rbd_data.15322ae8944a.0011f381 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~758784] 2.85df8e7d ack+ondisk+write e10705) v4 currently waiting for subops from 16 2015-07-17 06:26:00.817039 osd.16 192.168.19.218:6823/29360 4139 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.433143 secs 2015-07-17 06:26:00.817050 osd.16
Re: [ceph-users] Deadly slow Ceph cluster revisited
That looks a lot like what I was seeing initially. The OSDs getting marked out was relatively rare and it took a bit before I saw it. I ended up digging into the logs on the OSDs themselves to discover that they were getting marked out. The messages were like So-and-so incorrectly marked us out IIRC. I don't remember the commands exactly, but there are tools for digging into specifics of the health of individual pgs. It's something like ceph pg dump summary. Someone else may chime in more details, but that should be a good google seed. That should help you isolate the OSDs and PGs that are being problematic. On the off chance you are having a NIC problem like I was, use ifconfig to check the error rates on your interfaces. You will likely have some errors, but if one of them has way more than the others, it's worth investigating the NIC / cabling. If nothing else it will help eliminate this as the root of the problem. On Fri, Jul 17, 2015 at 9:06 AM, J David j.david.li...@gmail.com wrote: On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: What does ceph status say? Usually it says everything is cool. However just now it gave this: cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7 health HEALTH_WARN 2 requests are blocked 32 sec monmap e3: 3 mons at {f16= 192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0}, election epoch 176, quorum 0,1,2 f16,f17,f18 osdmap e10705: 28 osds: 28 up, 28 in pgmap v26213984: 4224 pgs, 3 pools, 21637 GB data, 5409 kobjects 43344 GB used, 86870 GB / 127 TB avail 4224 active+clean client io 368 kB/s rd, 444 kB/s wr, 10 op/s How do I find out what requests are blocked and why? Resolving whatever that is seems like a very good next step to troubleshooting this issue. I had a problem with similar symptoms some months ago that was accompanied by OSDs getting marked out for no apparent reason and the cluster going into a HEALTH_WARN state intermittently. There's no record of any OSD's being marked out at the time of problems. (E.g. as shown above, it's currently angry about something, but all OSD's are up in.) Here's the slow requests related info from ceph.log from the last 90 minutes: 2015-07-17 06:25:14.809606 osd.16 192.168.19.218:6823/29360 4137 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.051626 secs 2015-07-17 06:25:14.809618 osd.16 192.168.19.218:6823/29360 4138 : [WRN] slow request 30.051626 seconds old, received at 2015-07-17 06:24:44.757909: osd_op(client.32913524.0:3016955 rbd_data.15322ae8944a.0011f34f [set-alloc-hint object_size 4194304 write_size 4194304,write 1041920~3152384] 2.20c2afbe ack+ondisk+write e10705) v4 currently waiting for subops from 13 2015-07-17 06:25:20.280232 osd.24 192.168.19.218:6826/31177 4095 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.105786 secs 2015-07-17 06:25:20.280239 osd.24 192.168.19.218:6826/31177 4096 : [WRN] slow request 30.105786 seconds old, received at 2015-07-17 06:24:50.174399: osd_op(client.32913524.0:3016970 rbd_data.15322ae8944a.0011f355 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 2.66b14219 ack+ondisk+write e10705) v4 currently waiting for subops from 4 2015-07-17 06:25:28.476583 osd.2 192.168.19.216:6808/29552 12827 : [WRN] 1 slow requests, 1 included below; oldest blocked for 33.389677 secs 2015-07-17 06:25:28.476594 osd.2 192.168.19.216:6808/29552 12828 : [WRN] slow request 33.389677 seconds old, received at 2015-07-17 06:24:55.086843: osd_sub_op(client.32913524.0:3016982 2.e55 53500e55/rbd_data.15322ae8944a.0011f359/head//2 [] v 10705'851608 snapset=0=[]:[] snapc=0=[]) v11 currently commit sent 2015-07-17 06:25:25.223395 osd.20 192.168.19.218:6810/30518 4133 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.255906 secs 2015-07-17 06:25:25.223406 osd.20 192.168.19.218:6810/30518 4134 : [WRN] slow request 30.255906 seconds old, received at 2015-07-17 06:24:54.967416: osd_op(client.32913524.0:3016982 rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~3683840] 2.53500e55 ack+ondisk+write e10705) v4 currently waiting for subops from 2 2015-07-17 06:25:30.224304 osd.20 192.168.19.218:6810/30518 4135 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.574590 secs 2015-07-17 06:25:30.224312 osd.20 192.168.19.218:6810/30518 4136 : [WRN] slow request 30.574590 seconds old, received at 2015-07-17 06:24:59.649652: osd_op(client.32913524.0:3017153 rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size 4194304 write_size 4194304,write 3683840~510464] 2.53500e55 ack+ondisk+write e10705) v4 currently waiting for subops from 2 2015-07-17 06:25:34.624203 osd.23 192.168.19.218:6801/26435 4154 : [WRN] 1 slow requests, 1 included below; oldest blocked for
[ceph-users] RGW Malformed Headers
The basics are, 14.04, giant, apache with the ceph version of fastCGI. I'll spin up a test system in openstack with Civet and see if it misbehaves the same way or I need to narrow it down further. Chances are if you haven't heard of it I'll need to crack out g++ and get my hands dirty -- DataCentred Limited registered in England and Wales no. 05611763 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
On 07/17/2015 09:55 AM, J David wrote: On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson mnel...@redhat.com wrote: rados -p pool 30 bench write just to see how it handles 4MB object writes. Here's that, from the VM host: Total time run: 52.062639 Total writes made: 66 Write size: 4194304 Bandwidth (MB/sec): 5.071 Yep, awfully slow! Stddev Bandwidth: 11.6312 Max bandwidth (MB/sec): 80 Min bandwidth (MB/sec): 0 Average Latency:12.436 12 second average latency! Yikes. That does either sound like network or one of the disks is very slow. Do you see faster performance during the first second or two of the rados bench run? That might indicate that you are backing up on a specific OSD. Stddev Latency: 13.6272 Max latency:51.6924 Min latency:0.073353 Unfortunately I don't know much about how to parse this (other than 5MB/sec writes does match up with our best-case performance in the VM guest). If rados bench is also terribly slow, then you might want to start looking for evidence of IO getting hung up on a specific disk or node. Thusfar, no evidence of that has presented itself. iostat looks good on every drive and the nodes are all equally loaded. Ok. Maybe try some iperf tests between the different OSD nodes in your cluster and also the client to the OSDs. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
David - I'm new to Ceph myself, so can't point out any smoking guns - but your problem feels like a network issue. I suggest you check all of your OSD/Mon/Clients network interfaces. Check for errors, check that they are negotiating the same link speed/type with your switches (if you have LLDP enabled on your switches, this will help), most importantly - check that you have MTU matching - I.E. - if you are using Jumbo frames (eg 9000 MTU) on your hosts, that your switches are *also* supporting that with appropriate packet overhead (eg 9128). If you have your hosts set to 9000 - and your switches to 1500 - you'll see this exact behavior... Hopefully that helps some ... ~~shane On 7/17/15, 8:57 AM, ceph-users on behalf of J David ceph-users-boun...@lists.ceph.com on behalf of j.david.li...@gmail.com wrote: On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: That looks a lot like what I was seeing initially. The OSDs getting marked out was relatively rare and it took a bit before I saw it. Our problem is most of the time and does not appear confined to a specific ceph cluster node or OSD: $ sudo fgrep 'waiting for subops' ceph.log | sed -e 's/.* v4 //' | sort | uniq -c | sort -n 1 currently waiting for subops from 0 1 currently waiting for subops from 10 1 currently waiting for subops from 11 1 currently waiting for subops from 12 1 currently waiting for subops from 3 1 currently waiting for subops from 7 2 currently waiting for subops from 13 2 currently waiting for subops from 16 2 currently waiting for subops from 4 3 currently waiting for subops from 15 4 currently waiting for subops from 6 4 currently waiting for subops from 8 7 currently waiting for subops from 2 Node f16: 0, 2, and 3 (3 out of 4) Node f17: 4, 6, 7, 8, 10, 11, 12, 13 and 15 (9 out of 12) Node f18: 16 (1 out of 12) So f18 seems like the odd man out, in that it has *less* problems than the other two. There are a grand total of 2 RX errors across all the interfaces on all three machines. (Each one has dual 10G interfaces bonded together as active/failover.) The OSD log for the worst offender above (2) says: 2015-07-17 08:52:05.441607 7f562ea0c700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for 30.119568 secs 2015-07-17 08:52:05.441622 7f562ea0c700 0 log [WRN] : slow request 30.119568 seconds old, received at 2015-07-17 08:51:35.321991: osd_sub_op(client.32913524.0:3149584 2.249 2792c249/rbd_data.15322ae8944a.0011b487/head//2 [] v 10705'944603 snapset=0=[]:[] snapc=0=[]) v11 currently started 2015-07-17 08:52:43.229770 7f560833f700 0 -- 192.168.2.216:6813/16029552 192.168.2.218:6810/7028653 pipe(0x25265180 sd=25 :6813 s=2 pgs=23894 cs=41 l=0 c=0x22be4c60).fault with nothing to send, going to standby There are a bunch of those fault with nothing to send, going to standby messages. The messages were like So-and-so incorrectly marked us out IIRC. Nothing like that. Nor, with ceph -w running constantly, any reference to anything being marked out at any point, even when problems are severe. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Dont used fqdns in monmaptool and ceph-mon --mkfs
On 7/16/15, 9:51 PM, ceph-users on behalf of Goncalo Borges ceph-users-boun...@lists.ceph.com on behalf of gonc...@physics.usyd.edu.au wrote: Once I substituted the fqdn by simply the hostname (without the domain) it worked. Goncalo, I ran into the same problems too - and ended up bailing on the ceph-deploy tools and manually building my clusters ... eventually tripped over a piece of documentation on the ceph site that mentions this: http://ceph.com/docs/v0.71/radosgw/config/#add-a-gateway-configuration-to-c eph The text: Note host must be your machine hostname, not FQDN. Incidentally - this note doesn't seem to appear in the master docs anymore ... There are a couple of other places where that note is referenced... but it's definitely something that could be spelled out in bot the Quick and Manual install procedures early on in big red WARNING text, IMO. I'm hoping to get more involved with providing back to the Ceph community, and will start with documentation - and hope to submit patches myself soon for that very issue ... ~~shane ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
Disclaimer: I'm relatively new to ceph, and haven't moved into production with it. Did you run your bench for 30 seconds? For reference my bench from a VM bridged to a 10Gig card with 90x4TB at 30 seconds is: Total time run: 30.766596 Total writes made: 1979 Write size: 4194304 Bandwidth (MB/sec): 257.292 Stddev Bandwidth: 106.78 Max bandwidth (MB/sec): 420 Min bandwidth (MB/sec): 0 Average Latency:0.248238 Stddev Latency: 0.723444 Max latency:10.5275 Min latency:0.0346015 Seems like latency is a huge factor if your 30 second test took 52 seconds. What kind of 10Gig NICs are you using? I have Mellanox Connectx-3 and one node was using an older driver version. I started to experience the osd in..out..in.. and incorrectly marked out from... as mentioned by Quentin as well as poor performance. Installed the newest version of the Mellanox driver and all is running well again. On Fri, Jul 17, 2015 at 7:55 AM, J David j.david.li...@gmail.com wrote: On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson mnel...@redhat.com wrote: rados -p pool 30 bench write just to see how it handles 4MB object writes. Here's that, from the VM host: Total time run: 52.062639 Total writes made: 66 Write size: 4194304 Bandwidth (MB/sec): 5.071 Stddev Bandwidth: 11.6312 Max bandwidth (MB/sec): 80 Min bandwidth (MB/sec): 0 Average Latency:12.436 Stddev Latency: 13.6272 Max latency:51.6924 Min latency:0.073353 Unfortunately I don't know much about how to parse this (other than 5MB/sec writes does match up with our best-case performance in the VM guest). If rados bench is also terribly slow, then you might want to start looking for evidence of IO getting hung up on a specific disk or node. Thusfar, no evidence of that has presented itself. iostat looks good on every drive and the nodes are all equally loaded. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
I would say use admin socket to find out which part is causing most of the latencies, don't rule out disk anomalies. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David Sent: Friday, July 17, 2015 8:07 AM To: Quentin Hartman Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Deadly slow Ceph cluster revisited On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: What does ceph status say? Usually it says everything is cool. However just now it gave this: cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7 health HEALTH_WARN 2 requests are blocked 32 sec monmap e3: 3 mons at {f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0}, election epoch 176, quorum 0,1,2 f16,f17,f18 osdmap e10705: 28 osds: 28 up, 28 in pgmap v26213984: 4224 pgs, 3 pools, 21637 GB data, 5409 kobjects 43344 GB used, 86870 GB / 127 TB avail 4224 active+clean client io 368 kB/s rd, 444 kB/s wr, 10 op/s How do I find out what requests are blocked and why? Resolving whatever that is seems like a very good next step to troubleshooting this issue. I had a problem with similar symptoms some months ago that was accompanied by OSDs getting marked out for no apparent reason and the cluster going into a HEALTH_WARN state intermittently. There's no record of any OSD's being marked out at the time of problems. (E.g. as shown above, it's currently angry about something, but all OSD's are up in.) Here's the slow requests related info from ceph.log from the last 90 minutes: 2015-07-17 06:25:14.809606 osd.16 192.168.19.218:6823/29360 4137 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.051626 secs 2015-07-17 06:25:14.809618 osd.16 192.168.19.218:6823/29360 4138 : [WRN] slow request 30.051626 seconds old, received at 2015-07-17 06:24:44.757909: osd_op(client.32913524.0:3016955 rbd_data.15322ae8944a.0011f34f [set-alloc-hint object_size 4194304 write_size 4194304,write 1041920~3152384] 2.20c2afbe ack+ondisk+write e10705) v4 currently waiting for subops from 13 2015-07-17 06:25:20.280232 osd.24 192.168.19.218:6826/31177 4095 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.105786 secs 2015-07-17 06:25:20.280239 osd.24 192.168.19.218:6826/31177 4096 : [WRN] slow request 30.105786 seconds old, received at 2015-07-17 06:24:50.174399: osd_op(client.32913524.0:3016970 rbd_data.15322ae8944a.0011f355 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~4194304] 2.66b14219 ack+ondisk+write e10705) v4 currently waiting for subops from 4 2015-07-17 06:25:28.476583 osd.2 192.168.19.216:6808/29552 12827 : [WRN] 1 slow requests, 1 included below; oldest blocked for 33.389677 secs 2015-07-17 06:25:28.476594 osd.2 192.168.19.216:6808/29552 12828 : [WRN] slow request 33.389677 seconds old, received at 2015-07-17 06:24:55.086843: osd_sub_op(client.32913524.0:3016982 2.e55 53500e55/rbd_data.15322ae8944a.0011f359/head//2 [] v 10705'851608 snapset=0=[]:[] snapc=0=[]) v11 currently commit sent 2015-07-17 06:25:25.223395 osd.20 192.168.19.218:6810/30518 4133 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.255906 secs 2015-07-17 06:25:25.223406 osd.20 192.168.19.218:6810/30518 4134 : [WRN] slow request 30.255906 seconds old, received at 2015-07-17 06:24:54.967416: osd_op(client.32913524.0:3016982 rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~3683840] 2.53500e55 ack+ondisk+write e10705) v4 currently waiting for subops from 2 2015-07-17 06:25:30.224304 osd.20 192.168.19.218:6810/30518 4135 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.574590 secs 2015-07-17 06:25:30.224312 osd.20 192.168.19.218:6810/30518 4136 : [WRN] slow request 30.574590 seconds old, received at 2015-07-17 06:24:59.649652: osd_op(client.32913524.0:3017153 rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size 4194304 write_size 4194304,write 3683840~510464] 2.53500e55 ack+ondisk+write e10705) v4 currently waiting for subops from 2 2015-07-17 06:25:34.624203 osd.23 192.168.19.218:6801/26435 4154 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.596777 secs 2015-07-17 06:25:34.624215 osd.23 192.168.19.218:6801/26435 4155 : [WRN] slow request 30.596777 seconds old, received at 2015-07-17 06:25:04.027362: osd_op(client.32913524.0:3017169 rbd_data.15322ae8944a.0011f35e [set-alloc-hint object_size 4194304 write_size 4194304,write 2963456~1230848] 2.dad14b4 ack+ondisk+write e10705) v4 currently waiting for subops from 8 2015-07-17 06:26:00.932314 osd.15 192.168.19.217:6800/25977 3491 : [WRN] 1 slow requests, 1 included below; oldest blocked for 30.549624 secs 2015-07-17 06:26:00.932328 osd.15 192.168.19.217:6800/25977 3492 : [WRN] slow request 30.549624 seconds old, received at 2015-07-17 06:25:30.382612:
Re: [ceph-users] OSD latency inaccurate reports?
Also, by running ceph osd perf, I see that fs_apply_latency is larger than fs_commit_latency. Shouldn't that be the opposite? Apply latency is afaik the time that it takes to to apply updates to the file system in page cache. Commitcycle latency is the time it takes to flush cache on disks, right? Again the statistics fetched from the sockets seem more reasonable, the commitcycle latency is substantially larger than apply latency, which seems normal to me. Is this a bug or a misunderstanding on my part? Regards, Kostis On 13 July 2015 at 13:27, Kostis Fardelas dante1...@gmail.com wrote: Hello, I noticed that commit/apply latency reported using: ceph pg dump -f json-pretty is very different from the values reported when querying the OSD sockets. What is your opinion? What are the targets the I should fetch metrics from in order to be as precise as possible? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unsetting osd_crush_chooseleaf_type = 0
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Yes you will need to change osd to host as you thought so that copies will be separated between hosts. You will run into problems you see until that is changed. It will cause data movement. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Jul 16, 2015 at 11:45 AM, Steve Dainard wrote: I originally built a single node cluster, and added 'osd_crush_chooseleaf_type = 0 #0 is for one node cluster' to ceph.conf (which is now commented out). I've now added a 2nd node, where can I set this value to 1? I see in the crush map that the osd's are under 'host' buckets and don't see any reference to leaf. Would the cluster automatically rebalance when the 2nd host was added? How can I verify this? The issue right now, is with two host, copies = 2, min copies = 1, I cannot access data from client machines when one of the two hosts goes down. # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host ceph1 { id -2 # do not change unnecessarily # weight 163.350 alg straw hash 0 # rjenkins1 item osd.0 weight 3.630 item osd.1 weight 3.630 } host ceph2 { id -3 # do not change unnecessarily # weight 163.350 alg straw hash 0 # rjenkins1 item osd.2 weight 3.630 item osd.3 weight 3.630 } root default { id -1 # do not change unnecessarily # weight 326.699 alg straw hash 0 # rjenkins1 item ceph1 weight 163.350 item ceph2 weight 163.350 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd -- should this line be ''step chooseleaf firstn 0 type host? step emit } # end crush map ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVqTieCRDmVDuy+mK58QAASToQAJCnJW4xqiRQRiUiJK9s tF//xmrODAI3CrJENjtJ1wekNPq8YMWfX/CcxU0v7DdEx4JusVPxfv1BcEO3 OIKYHq3IXhWeMg20um1jGXINNI9ENR1Wpmp8G1gENjSRG+F3UOBG5jvZtoUn LEZ0OmYbzIQcHmgOGhcCICp/vXlTbHfSkOMRz34mlwpoKkKlwuQoQKd0o1gg MT3XvVBgIhZcQANiwu0sWxNTfNCtYq0/jVPcQyd3Wfzh9XtqHmZlgOYb92E1 uqYFTBpNc06vCYcBt2C93DhzUbJXdYWFxqG3HIkDu5GDJ1jL8WHcHMhoEdRX 61wTm5uku4+XTwHqoIJnr29f70z0pVBqZaY9nqz8XLcNdW7wYRbFvKnbravq q32DJMVM8uhN5daRlJb0dRKMRCGjUhEItS/omHdriijmxu61SOjJzZvJN+w4 eEivnLeglqPGq20zGXHuSQT9KXcQlCGzvnCDblDGDky+UuDpwtYqRO3ez3xH fOcBpjjslzXp09YGUo7UT9n9flonGnnOOi18lhgweF/zSIRw1lP350rdolbg EnKSynkIYyoYjIf8cM8bw1mbxQXt2b3pIu2tTRYZG5bEe58aK2+8IKv0yN2I M88VAOSXxOpzF0hxUyJiGYU2PK45lK5k+KQgQR2nuICk6Qc59/kuXjK8uZWX LDRh =Ay3V -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests during ceph osd boot
Thanks for your answers, we will also experiment with osd recovery max active / threads and will come back to you Regards, Kostis On 16 July 2015 at 12:29, Jan Schermer j...@schermer.cz wrote: For me setting recovery_delay_start helps during the OSD bootup _sometimes_, but it clearly does something different than what’s in the docs. Docs say: After peering completes, Ceph will delay for the specified number of seconds before starting to recover objects. However, what I see is greatly slowed recovery, not a delayed start of recovery. It seems to basically sleep between recovering the PGs. AFAIK peering is already done unless I was remapping the PGs at the same moment, so not sure what’s happening there in reality. We had this set to 20 for some time and recovering after host restart took close to two hours. With this parameter set to 0, it recovered in less than 30 seconds (and caused no slow requests or anything). So what I usually do is set this to a high number (like 200), and after all the OSDs are started I set it to 0. This does not completely prevent slow requests from happening, but does somewhat help… Jan On 15 Jul 2015, at 11:52, Andrey Korolyov and...@xdel.ru wrote: On Wed, Jul 15, 2015 at 12:15 PM, Jan Schermer j...@schermer.cz wrote: We have the same problems, we need to start the OSDs slowly. The problem seems to be CPU congestion. A booting OSD will use all available CPU power you give it, and if it doesn’t have enough nasty stuff happens (this might actually be the manifestation of some kind of problem in our setup as well). It doesn’t do that always - I was restarting our hosts this weekend and most of them came up fine with simple “service ceph start”, some just sat there spinning the CPU and not doing any real world (and the cluster was not very happy about that). Jan On 15 Jul 2015, at 10:53, Kostis Fardelas dante1...@gmail.com wrote: Hello, after some trial and error we concluded that if we start the 6 stopped OSD daemons with a delay of 1 minute, we do not experience slow requests (threshold is set on 30 sec), althrough there are some ops that last up to 10s which is already high enough. I assume that if we spread the delay more, the slow requests will vanish. The possibility of not having tuned our setup to the most finest detail is not zeroed out but I wonder if at any way we miss some ceph tuning in terms of ceph configuration. We run firefly latest stable version. Regards, Kostis On 13 July 2015 at 13:28, Kostis Fardelas dante1...@gmail.com wrote: Hello, after rebooting a ceph node and the OSDs starting booting and joining the cluster, we experience slow requests that get resolved immediately after cluster recovers. It is improtant to note that before the node reboot, we set noout flag in order to prevent recovery - so there are only degraded PGs when OSDs shut down- and let the cluster handle the OSDs down/up in the lightest way. Is there any tunable we should consider in order to avoid service degradation for our ceph clients? Regards, Kostis ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com As far as I`ve seen this problem, the main issue for regular disk-backed OSDs is an IOPS starvation during some interval after reading maps from filestore and marking itself as 'in' - even if in-memory caches are still hot, I/O will significantly degrade for a short period. The possible workaround for an otherwise healthy cluster and node-wide restart is to set norecover flag, it would greatly reduce a chance of hitting slow operations. Of course it is applicable only to non-empty cluster with tens of percents of an average utilization for rotating media. I pointed this issue a couple of years ago first (it *does* break 30s I/O SLA for returning OSD, but refilling same OSDs from scratch would not violate the same SLA, giving out way bigger completion time for a refill). From UX side, it would be great to introduce some kind of recovery throttler for newly started OSDs, as recovery_ delay_start does not prevent immediate recovery procedures. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman qhart...@direwolfdigital.com wrote: That looks a lot like what I was seeing initially. The OSDs getting marked out was relatively rare and it took a bit before I saw it. Our problem is most of the time and does not appear confined to a specific ceph cluster node or OSD: $ sudo fgrep 'waiting for subops' ceph.log | sed -e 's/.* v4 //' | sort | uniq -c | sort -n 1 currently waiting for subops from 0 1 currently waiting for subops from 10 1 currently waiting for subops from 11 1 currently waiting for subops from 12 1 currently waiting for subops from 3 1 currently waiting for subops from 7 2 currently waiting for subops from 13 2 currently waiting for subops from 16 2 currently waiting for subops from 4 3 currently waiting for subops from 15 4 currently waiting for subops from 6 4 currently waiting for subops from 8 7 currently waiting for subops from 2 Node f16: 0, 2, and 3 (3 out of 4) Node f17: 4, 6, 7, 8, 10, 11, 12, 13 and 15 (9 out of 12) Node f18: 16 (1 out of 12) So f18 seems like the odd man out, in that it has *less* problems than the other two. There are a grand total of 2 RX errors across all the interfaces on all three machines. (Each one has dual 10G interfaces bonded together as active/failover.) The OSD log for the worst offender above (2) says: 2015-07-17 08:52:05.441607 7f562ea0c700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for 30.119568 secs 2015-07-17 08:52:05.441622 7f562ea0c700 0 log [WRN] : slow request 30.119568 seconds old, received at 2015-07-17 08:51:35.321991: osd_sub_op(client.32913524.0:3149584 2.249 2792c249/rbd_data.15322ae8944a.0011b487/head//2 [] v 10705'944603 snapset=0=[]:[] snapc=0=[]) v11 currently started 2015-07-17 08:52:43.229770 7f560833f700 0 -- 192.168.2.216:6813/16029552 192.168.2.218:6810/7028653 pipe(0x25265180 sd=25 :6813 s=2 pgs=23894 cs=41 l=0 c=0x22be4c60).fault with nothing to send, going to standby There are a bunch of those fault with nothing to send, going to standby messages. The messages were like So-and-so incorrectly marked us out IIRC. Nothing like that. Nor, with ceph -w running constantly, any reference to anything being marked out at any point, even when problems are severe. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson mnel...@redhat.com wrote: Maybe try some iperf tests between the different OSD nodes in your cluster and also the client to the OSDs. This proved to be an excellent suggestion. One of these is not like the others: f16 inbound: 6Gbps f16 outbound: 6Gbps f17 inbound: 6Gbps f17 outbound: 6Gbps f18 inbound: 6Gbps f18 outbound: 1.2Mbps There is flatly no explanation for the outbound performance on f18. There are no errors in ifconfig/netstat, nothing logged on the switch, etc. Even with tcpdump running during iperf, there aren't retransmits or anything. It's just slow. ifconfig'ing the primary bond interface down immediately resolved the problem. The iostat running in the virtual machine immediately surged to 500+ IOPS and 40M-60M/sec. Weirdly, ifconfig'ing the primary device back up did not bring the problem back. It switched back to that interface, but everything is still fine (and iperf gives 6Gbps) at the moment. There's no way of telling if that will last, but it's a solid lead either way. It's an Intel onboard dual-port X540's using the ixgbe driver. If it were a driver problem, we've got tons of these so I'd expect to see this problem elsewhere. If it's a hardware problem, ifconfig down/up doesn't seem like it would fix it. Very mysterious! Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings?
Other than those errors, do you find RBD's will not be unmapped on system restart/shutdown on a machine using systemd? Leaving the system hanging without network connections trying to unmap RBD's? That's been my experience thus far, so I wrote an (overly simple) systemd file to handle this on a per RBD basis. On Tue, Jul 14, 2015 at 1:15 PM, Bruce McFarland bruce.mcfarl...@taec.toshiba.com wrote: When starting the rbdmap.service to provide map/unmap of rbd devices across boot/shutdown cycles the /etc/init.d/rbdmap includes /lib/lsb/init-functions. This is not a problem except that the rbdmap script is making calls to the log_daemon_* log_progress_* log_actiion_* functions that are included in Ubuntu 14.04 distro's, but are not in the RHEL 7.1/RHCS 1.3 distro. Are there any recommended workaround for boot time startup in RHEL/Centos 7.1 clients? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
May I suggest checking also the error counters on your network switch? Check speed and duplex. Is bonding in use? Is flow control on? Can you swap the network cable? Can you swap a NIC with another node and does the problem follow? Hth, Alex On Friday, July 17, 2015, Steve Thompson s...@vgersoft.com wrote: On Fri, 17 Jul 2015, J David wrote: f16 inbound: 6Gbps f16 outbound: 6Gbps f17 inbound: 6Gbps f17 outbound: 6Gbps f18 inbound: 6Gbps f18 outbound: 1.2Mbps Unless the network was very busy when you did this, I think that 6 Gb/s may not be very good either. Usually iperf will give you much more than that. For example, between two of my OSD's, I get 9.4 Gb/s, or up to 9.9 Gb/s when nothing else is happening. Steve ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] backing Hadoop with Ceph ??
On 07/15/2015 11:48 AM, Shane Gibson wrote: Somnath - thanks for the reply ... :-) Haven't tried anything yet - just starting to gather info/input/direction for this solution. Looking at the S3 API info [2] - there is no mention of support for the S3a API extensions - namely rename support. The problem with backing via S3 API - if you need to rename a large (say multi GB) data object - you have to copy to new name and delete - this is a very IO expensive operation - and something we do a lot of. That in and of itself might be a deal breaker ... Any idea/input/intention of supporting the S3a exentsions within the RadosGW S3 API implementation? I see you're trying out cephfs now, and I think that makes sense. I just wanted to mention that at CDS a couple weeks ago Yehuda noted that RGW's rename is cheap, since it does not require copying the data, just updating its location [1]. Josh [1] http://pad.ceph.com/p/hadoop-over-rgw Plus - it seems like it's considered a bad idea to back Hadoop via S3 (and indirectly Ceph via RGW) [3]; though not sure if the architectural differences from Amazon's S3 implementation and the far superior Ceph make it more palatable? ~~shane [2] http://ceph.com/docs/master/radosgw/s3/ [3] https://wiki.apache.org/hadoop/AmazonS3 On 7/15/15, 9:50 AM, Somnath Roy somnath@sandisk.com mailto:somnath@sandisk.com wrote: Did you try to integrate ceph +rgw+s3 with Hadoop? Sent from my iPhone On Jul 15, 2015, at 8:58 AM, Shane Gibson shane_gib...@symantec.com mailto:shane_gib...@symantec.com wrote: We are in the (very) early stages of considering testing backing Hadoop via Ceph - as opposed to HDFS. I've seen a few very vague references to doing that, but haven't found any concrete info (architecture, configuration recommendations, gotchas, lessons learned, etc...). I did find the ceph.com/docs/ http://ceph.com/docs/ info [1] which discusses use of CephFS for backing Hadoop - but this would be foolish for production clusters given that CephFS isn't yet considered production quality/grade. Does anyone in the ceph-users community have experience with this that they'd be willing to share? Preferably ... via use of Ceph - not via CephFS...but I am interested in any CephFS related experiences too. If we were to do this, and Ceph proved out as a backing store to Hadoop - there is the potential to be creating a fairly large multi-Petabyte (100s ??) class backing store for Ceph. We do a very large amount of analytics on a lot of data sets for security trending correlations, etc... Our current Ceph experience is limited to a few small (90 x 4TB OSD size) clusters - which we are working towards putting in production for Glance/Cinder backing and for Block storage for various large storage need platforms (eg software and package repo/mirrors, etc...). Thanks in advance for any input, thoughts, or pointers ... ~~shane [1] http://ceph.com/docs/master/cephfs/hadoop/ ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
Glad we were able to point you in the right direction! I would suspect a borderline cable at this point. Did you happen to notice if the interface had negotiated down to some dumb speed? If it had, I've seen cases where a dodgy cable has caused an intermittent problem that causes it to negotiate the speed downward, but then it never tries to coma back up until the interface is restarted. fwiw, I'm using the same chipset (not onboard) and driver, and they have been rock solid so far. I would be skeptical of a driver bug as well. QH On Fri, Jul 17, 2015 at 2:34 PM, J David j.david.li...@gmail.com wrote: On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson mnel...@redhat.com wrote: Maybe try some iperf tests between the different OSD nodes in your cluster and also the client to the OSDs. This proved to be an excellent suggestion. One of these is not like the others: f16 inbound: 6Gbps f16 outbound: 6Gbps f17 inbound: 6Gbps f17 outbound: 6Gbps f18 inbound: 6Gbps f18 outbound: 1.2Mbps There is flatly no explanation for the outbound performance on f18. There are no errors in ifconfig/netstat, nothing logged on the switch, etc. Even with tcpdump running during iperf, there aren't retransmits or anything. It's just slow. ifconfig'ing the primary bond interface down immediately resolved the problem. The iostat running in the virtual machine immediately surged to 500+ IOPS and 40M-60M/sec. Weirdly, ifconfig'ing the primary device back up did not bring the problem back. It switched back to that interface, but everything is still fine (and iperf gives 6Gbps) at the moment. There's no way of telling if that will last, but it's a solid lead either way. It's an Intel onboard dual-port X540's using the ixgbe driver. If it were a driver problem, we've got tons of these so I'd expect to see this problem elsewhere. If it's a hardware problem, ifconfig down/up doesn't seem like it would fix it. Very mysterious! Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deadly slow Ceph cluster revisited
On Fri, 17 Jul 2015, J David wrote: f16 inbound: 6Gbps f16 outbound: 6Gbps f17 inbound: 6Gbps f17 outbound: 6Gbps f18 inbound: 6Gbps f18 outbound: 1.2Mbps Unless the network was very busy when you did this, I think that 6 Gb/s may not be very good either. Usually iperf will give you much more than that. For example, between two of my OSD's, I get 9.4 Gb/s, or up to 9.9 Gb/s when nothing else is happening. Steve ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings?
Yes the rbd's are not remapped at system boot time. I haven't run into a VM or system hang because this since I ran into it as part of investigating using RHEL 7.1 as a client distro. Yes remapping the rbd's in a startup script worked around the issue. -Original Message- From: Steve Dainard [mailto:sdain...@spd1.com] Sent: Friday, July 17, 2015 1:59 PM To: Bruce McFarland Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings? Other than those errors, do you find RBD's will not be unmapped on system restart/shutdown on a machine using systemd? Leaving the system hanging without network connections trying to unmap RBD's? That's been my experience thus far, so I wrote an (overly simple) systemd file to handle this on a per RBD basis. On Tue, Jul 14, 2015 at 1:15 PM, Bruce McFarland bruce.mcfarl...@taec.toshiba.com wrote: When starting the rbdmap.service to provide map/unmap of rbd devices across boot/shutdown cycles the /etc/init.d/rbdmap includes /lib/lsb/init-functions. This is not a problem except that the rbdmap script is making calls to the log_daemon_* log_progress_* log_actiion_* functions that are included in Ubuntu 14.04 distro's, but are not in the RHEL 7.1/RHCS 1.3 distro. Are there any recommended workaround for boot time startup in RHEL/Centos 7.1 clients? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com