Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
What does ceph status say? I had a problem with similar symptoms some
months ago that was accompanied by OSDs getting marked out for no apparent
reason and the cluster going into a HEALTH_WARN state intermittently.
Ultimately the root of the problem ended up being a faulty NIC. Once I took
that out of the picture everything started flying right.

QH

On Fri, Jul 17, 2015 at 8:21 AM, Mark Nelson mnel...@redhat.com wrote:

 On 07/17/2015 08:38 AM, J David wrote:

 This is the same cluster I posted about back in April.  Since then,
 the situation has gotten significantly worse.

 Here is what iostat looks like for the one active RBD image on this
 cluster:

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 vdb   0.00 0.00   14.100.00   685.65 0.00
 97.26 3.43  299.40  299.400.00  70.92 100.00
 vdb   0.00 0.001.100.00   140.80 0.00
 256.00 3.00 2753.09 2753.090.00 909.09 100.00
 vdb   0.00 0.00   17.400.00  2227.20 0.00
 256.00 3.00  178.78  178.780.00  57.47 100.00
 vdb   0.00 0.001.300.00   166.40 0.00
 256.00 3.00 2256.62 2256.620.00 769.23 100.00
 vdb   0.00 0.008.200.00  1049.60 0.00
 256.00 3.00  362.10  362.100.00 121.95 100.00
 vdb   0.00 0.001.100.00   140.80 0.00
 256.00 3.00 2517.45 2517.450.00 909.45 100.04
 vdb   0.00 0.001.100.00   140.66 0.00
 256.00 3.00 2863.64 2863.640.00 909.09  99.90
 vdb   0.00 0.000.700.0089.60 0.00
 256.00 3.00 3898.86 3898.860.00 1428.57 100.00
 vdb   0.00 0.000.600.0076.80 0.00
 256.00 3.00 5093.33 5093.330.00 1666.67 100.00
 vdb   0.00 0.001.200.00   153.60 0.00
 256.00 3.00 2568.33 2568.330.00 833.33 100.00
 vdb   0.00 0.001.300.00   166.40 0.00
 256.00 3.00 2457.85 2457.850.00 769.23 100.00
 vdb   0.00 0.00   13.900.00  1779.20 0.00
 256.00 3.00  220.95  220.950.00  71.94 100.00
 vdb   0.00 0.001.000.00   128.00 0.00
 256.00 3.00 2250.40 2250.400.00 1000.00 100.00
 vdb   0.00 0.001.300.00   166.40 0.00
 256.00 3.00 2798.77 2798.770.00 769.23 100.00
 vdb   0.00 0.000.900.00   115.20 0.00
 256.00 3.00 3304.00 3304.000.00 .11 100.00
 vdb   0.00 0.000.900.00   115.20 0.00
 256.00 3.00 3425.33 3425.330.00 .11 100.00
 vdb   0.00 0.001.300.00   166.40 0.00
 256.00 3.00 2290.77 2290.770.00 769.23 100.00
 vdb   0.00 0.004.300.00   550.40 0.00
 256.00 3.00  721.30  721.300.00 232.56 100.00
 vdb   0.00 0.001.600.00   204.80 0.00
 256.00 3.00 1894.75 1894.750.00 625.00 100.00
 vdb   0.00 0.001.200.00   153.60 0.00
 256.00 3.00 2375.00 2375.000.00 833.33 100.00
 vdb   0.00 0.000.900.00   115.20 0.00
 256.00 3.00 3036.44 3036.440.00 .11 100.00
 vdb   0.00 0.001.100.00   140.80 0.00
 256.00 3.00 3086.18 3086.180.00 909.09 100.00
 vdb   0.00 0.000.900.00   115.20 0.00
 256.00 3.00 2480.44 2480.440.00 .11 100.00
 vdb   0.00 0.001.200.00   153.60 0.00
 256.00 3.00 3124.33 3124.330.00 833.67 100.04
 vdb   0.00 0.000.800.00   102.40 0.00
 256.00 3.00 3228.00 3228.000.00 1250.00 100.00
 vdb   0.00 0.001.200.00   153.60 0.00
 256.00 3.00 2439.33 2439.330.00 833.33 100.00
 vdb   0.00 0.001.300.00   166.40 0.00
 256.00 3.00 2567.08 2567.080.00 769.23 100.00
 vdb   0.00 0.000.800.00   102.40 0.00
 256.00 3.00 3023.00 3023.000.00 1250.00 100.00
 vdb   0.00 0.004.800.00   614.40 0.00
 256.00 3.00  712.50  712.500.00 208.33 100.00
 vdb   0.00 0.001.300.00   118.75 0.00
 182.69 3.00 2003.69 2003.690.00 769.23 100.00
 vdb   0.00 0.00   10.500.00  1344.00 0.00
 256.00 3.00  344.46  344.460.00  95.24 100.00

 So, between 0 and 15 reads per second, no write activity, a constant
 queue depth of 3+, wait times in seconds, and 100% I/O utilization,
 all for read performance of 100-200K/sec.  Even trivial writes can
 hang for 15-60 seconds before completing.

 Sometimes this behavior will go away for awhile and it will go back
 to what we saw in April: 50IOPS (read or write) and 5-20MB/sec of I/O
 throughput.  But it always 

[ceph-users] OSD RAM usage values

2015-07-17 Thread Kenneth Waegeman

Hi all,

I've read in the documentation that OSDs use around 512MB on a healthy 
cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram)
Now, our OSD's are all using around 2GB of RAM memory while the cluster 
is healthy.



  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ 
COMMAND 

29784 root  20   0 6081276 2.535g   4740 S   0.7  8.1   1346:55 
ceph-osd 

32818 root  20   0 5417212 2.164g  24780 S  16.2  6.9   1238:55 
ceph-osd 

25053 root  20   0 5386604 2.159g  27864 S   0.7  6.9   1192:08 
ceph-osd 

33875 root  20   0 5345288 2.092g   3544 S   0.7  6.7   1188:53 
ceph-osd 

30779 root  20   0 5474832 2.090g  28892 S   1.0  6.7   1142:29 
ceph-osd 

22068 root  20   0 5191516 2.000g  28664 S   0.7  6.4  31:56.72 
ceph-osd 

34932 root  20   0 5242656 1.994g   4536 S   0.3  6.4   1144:48 
ceph-osd 

26883 root  20   0 5178164 1.938g   6164 S   0.3  6.2   1173:01 
ceph-osd 

31796 root  20   0 5193308 1.916g  27000 S  16.2  6.1 923:14.87 
ceph-osd 

25958 root  20   0 5193436 1.901g   2900 S   0.7  6.1   1039:53 
ceph-osd 

27826 root  20   0 5225764 1.845g   5576 S   1.0  5.9   1031:15 
ceph-osd 

36011 root  20   0 5111660 1.823g  20512 S  15.9  5.8   1093:01 
ceph-osd 

19736 root  20   0 2134680 0.994g  0 S   0.3  3.2  46:13.47 
ceph-osd




[root@osd003 ~]# ceph status
2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following 
dangerous and experimental features are enabled: keyvaluestore
2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following 
dangerous and experimental features are enabled: keyvaluestore

cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47
 health HEALTH_OK
 monmap e1: 3 mons at 
{mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0}

election epoch 58, quorum 0,1,2 mds01,mds02,mds03
 mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby
 osdmap e25542: 258 osds: 258 up, 258 in
  pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects
270 TB used, 549 TB / 819 TB avail
4152 active+clean
   8 active+clean+scrubbing+deep


We are using erasure code on most of our OSDs, so maybe that is a 
reason. But also the cache-pool filestore OSDS on 200GB SSDs are using 
2GB of RAM.
Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool 
(2*14 OSDS) has a pg_num of 1024.


Are these normal values for this configuration, and is the documentation 
a bit outdated, or should we look into something else?


Thank you!

Kenneth
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Malformed Headers

2015-07-17 Thread Simon Murray
Test complete.  Civet still shows the same problem:

https://gist.github.com/spjmurray/88203f564389294b3774

/admin/user?uid=admin is fine
/admin/user?quotauid=adminquota-type=user is not so good.  Upgrade to
0.94.2 didn't solve the problem nor 9.0.2.  Unless anyone knows anything
more I'll go a hunting and prepare a patch so it can get fixed for 9.1.x

On 17 July 2015 at 07:26, Simon Murray simon.mur...@datacentred.co.uk
wrote:

 The basics are, 14.04, giant, apache with the ceph version of fastCGI.
 I'll spin up a test system in openstack with Civet and see if it misbehaves
 the same way or I need to narrow it down further. Chances are if you
 haven't heard of it I'll need to crack out g++ and get my hands dirty


-- 
DataCentred Limited registered in England and Wales no. 05611763
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD RAM usage values

2015-07-17 Thread Gregory Farnum
On Fri, Jul 17, 2015 at 1:13 PM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:
 Hi all,

 I've read in the documentation that OSDs use around 512MB on a healthy
 cluster.(http://ceph.com/docs/master/start/hardware-recommendations/#ram)
 Now, our OSD's are all using around 2GB of RAM memory while the cluster is
 healthy.


   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 29784 root  20   0 6081276 2.535g   4740 S   0.7  8.1   1346:55 ceph-osd
 32818 root  20   0 5417212 2.164g  24780 S  16.2  6.9   1238:55 ceph-osd
 25053 root  20   0 5386604 2.159g  27864 S   0.7  6.9   1192:08 ceph-osd
 33875 root  20   0 5345288 2.092g   3544 S   0.7  6.7   1188:53 ceph-osd
 30779 root  20   0 5474832 2.090g  28892 S   1.0  6.7   1142:29 ceph-osd
 22068 root  20   0 5191516 2.000g  28664 S   0.7  6.4  31:56.72 ceph-osd
 34932 root  20   0 5242656 1.994g   4536 S   0.3  6.4   1144:48 ceph-osd
 26883 root  20   0 5178164 1.938g   6164 S   0.3  6.2   1173:01 ceph-osd
 31796 root  20   0 5193308 1.916g  27000 S  16.2  6.1 923:14.87 ceph-osd
 25958 root  20   0 5193436 1.901g   2900 S   0.7  6.1   1039:53 ceph-osd
 27826 root  20   0 5225764 1.845g   5576 S   1.0  5.9   1031:15 ceph-osd
 36011 root  20   0 5111660 1.823g  20512 S  15.9  5.8   1093:01 ceph-osd
 19736 root  20   0 2134680 0.994g  0 S   0.3  3.2  46:13.47 ceph-osd



 [root@osd003 ~]# ceph status
 2015-07-17 14:03:13.865063 7f1fde5f0700 -1 WARNING: the following dangerous
 and experimental features are enabled: keyvaluestore
 2015-07-17 14:03:13.887087 7f1fde5f0700 -1 WARNING: the following dangerous
 and experimental features are enabled: keyvaluestore
 cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47
  health HEALTH_OK
  monmap e1: 3 mons at
 {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0}
 election epoch 58, quorum 0,1,2 mds01,mds02,mds03
  mdsmap e17218: 1/1/1 up {0=mds03=up:active}, 1 up:standby
  osdmap e25542: 258 osds: 258 up, 258 in
   pgmap v2460163: 4160 pgs, 4 pools, 228 TB data, 154 Mobjects
 270 TB used, 549 TB / 819 TB avail
 4152 active+clean
8 active+clean+scrubbing+deep


 We are using erasure code on most of our OSDs, so maybe that is a reason.
 But also the cache-pool filestore OSDS on 200GB SSDs are using 2GB of RAM.
 Our erasure code pool (16*14 osds) have a pg_num of 2048; our cache pool
 (2*14 OSDS) has a pg_num of 1024.

 Are these normal values for this configuration, and is the documentation a
 bit outdated, or should we look into something else?

2GB of RSS is larger than I would have expected, but not unreasonable.
In particular I don't think we've gathered numbers on either EC pools
or on the effects of the caching processes.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Mark Nelson

On 07/17/2015 08:38 AM, J David wrote:

This is the same cluster I posted about back in April.  Since then,
the situation has gotten significantly worse.

Here is what iostat looks like for the one active RBD image on this cluster:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdb   0.00 0.00   14.100.00   685.65 0.00
97.26 3.43  299.40  299.400.00  70.92 100.00
vdb   0.00 0.001.100.00   140.80 0.00
256.00 3.00 2753.09 2753.090.00 909.09 100.00
vdb   0.00 0.00   17.400.00  2227.20 0.00
256.00 3.00  178.78  178.780.00  57.47 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2256.62 2256.620.00 769.23 100.00
vdb   0.00 0.008.200.00  1049.60 0.00
256.00 3.00  362.10  362.100.00 121.95 100.00
vdb   0.00 0.001.100.00   140.80 0.00
256.00 3.00 2517.45 2517.450.00 909.45 100.04
vdb   0.00 0.001.100.00   140.66 0.00
256.00 3.00 2863.64 2863.640.00 909.09  99.90
vdb   0.00 0.000.700.0089.60 0.00
256.00 3.00 3898.86 3898.860.00 1428.57 100.00
vdb   0.00 0.000.600.0076.80 0.00
256.00 3.00 5093.33 5093.330.00 1666.67 100.00
vdb   0.00 0.001.200.00   153.60 0.00
256.00 3.00 2568.33 2568.330.00 833.33 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2457.85 2457.850.00 769.23 100.00
vdb   0.00 0.00   13.900.00  1779.20 0.00
256.00 3.00  220.95  220.950.00  71.94 100.00
vdb   0.00 0.001.000.00   128.00 0.00
256.00 3.00 2250.40 2250.400.00 1000.00 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2798.77 2798.770.00 769.23 100.00
vdb   0.00 0.000.900.00   115.20 0.00
256.00 3.00 3304.00 3304.000.00 .11 100.00
vdb   0.00 0.000.900.00   115.20 0.00
256.00 3.00 3425.33 3425.330.00 .11 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2290.77 2290.770.00 769.23 100.00
vdb   0.00 0.004.300.00   550.40 0.00
256.00 3.00  721.30  721.300.00 232.56 100.00
vdb   0.00 0.001.600.00   204.80 0.00
256.00 3.00 1894.75 1894.750.00 625.00 100.00
vdb   0.00 0.001.200.00   153.60 0.00
256.00 3.00 2375.00 2375.000.00 833.33 100.00
vdb   0.00 0.000.900.00   115.20 0.00
256.00 3.00 3036.44 3036.440.00 .11 100.00
vdb   0.00 0.001.100.00   140.80 0.00
256.00 3.00 3086.18 3086.180.00 909.09 100.00
vdb   0.00 0.000.900.00   115.20 0.00
256.00 3.00 2480.44 2480.440.00 .11 100.00
vdb   0.00 0.001.200.00   153.60 0.00
256.00 3.00 3124.33 3124.330.00 833.67 100.04
vdb   0.00 0.000.800.00   102.40 0.00
256.00 3.00 3228.00 3228.000.00 1250.00 100.00
vdb   0.00 0.001.200.00   153.60 0.00
256.00 3.00 2439.33 2439.330.00 833.33 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2567.08 2567.080.00 769.23 100.00
vdb   0.00 0.000.800.00   102.40 0.00
256.00 3.00 3023.00 3023.000.00 1250.00 100.00
vdb   0.00 0.004.800.00   614.40 0.00
256.00 3.00  712.50  712.500.00 208.33 100.00
vdb   0.00 0.001.300.00   118.75 0.00
182.69 3.00 2003.69 2003.690.00 769.23 100.00
vdb   0.00 0.00   10.500.00  1344.00 0.00
256.00 3.00  344.46  344.460.00  95.24 100.00

So, between 0 and 15 reads per second, no write activity, a constant
queue depth of 3+, wait times in seconds, and 100% I/O utilization,
all for read performance of 100-200K/sec.  Even trivial writes can
hang for 15-60 seconds before completing.

Sometimes this behavior will go away for awhile and it will go back
to what we saw in April: 50IOPS (read or write) and 5-20MB/sec of I/O
throughput.  But it always comes back.

The hardware of the ceph cluster is:
- Three ceph nodes
- Two of the ceph nodes have 64GiB RAM and 12 5TB SATA drives
- One of the ceph nodes has 32GiB RAM and 4 5TB SATA drives
- All ceph nodes have Intel E5-2609 v2 (2.50Ghz quad-core) CPUs
- Everything is 10GBase-T
- All three nodes running Ceph 0.80.9

The ceph hardware is all borderline idle.  The CPU is 3-5% utilized
and iostat reports the individual disks hover around 4-7% utilized at
any given time.  It does appear to be using 

Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
A bit of progress: rm'ing everything from inside current/36.10d_head/
actually let the OSD start and continue deleting other PGs.

Cheers, Dan

On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote:
 Thanks for the quick reply.

 We /could/ just wipe these OSDs and start from scratch (the only other
 pools were 4+2 ec and recovery already brought us to 100%
 active+clean).

 But it'd be good to understand and prevent this kind of crash...

 Cheers, Dan




 On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote:
 I think you'll need to use the ceph-objectstore-tool to remove the
 PG/data consistently, but I've not done this — David or Sam will need
 to chime in.
 -Greg

 On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi Greg + list,

 Sorry to reply to this old'ish thread, but today one of these PGs bit
 us in the ass.

 Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
 and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

 (full log below). There is indeed some garbage in there:

 # find 36.10d_head/
 36.10d_head/
 36.10d_head/DIR_D
 36.10d_head/DIR_D/DIR_0
 36.10d_head/DIR_D/DIR_0/DIR_1
 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
 36.10d_head/DIR_D/DIR_0/DIR_9


 Do you have any suggestion how to get these OSDs back running? We
 already tried manually moving 36.10d_head to 36.10d_head.bak but then
 the OSD crashes for a different reason:

 -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
 36.10d coll 36.10d_head
  0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
 function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
 ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
 15:07:42.442902
 osd/PG.cc: 2839: FAILED assert(r  0)


 Any clues?

 Cheers, Dan

 2015-07-17 14:40:54.493935 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
 not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
 from 0)
 2015-07-17 14:40:54.494019 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
 in osd data dir
 2015-07-17 14:40:54.494021 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
 {
ops: [
{
op_num: 0,
op_name: remove,
collection: 36.10d_head,
oid: 10d\/\/head\/\/36
},
{
op_num: 1,
op_name: rmcoll,
collection: 36.10d_head
}
]
 }

 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
 14:40:54.502996
 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) [0x97d794]
 3: (FileStore::_do_op(FileStore::OpSequencer*,
 ThreadPool::TPHandle)+0x2a0) [0x97da50]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
 6: /lib64/libpthread.so.0() [0x3fbec079d1]
 7: (clone()+0x6d) [0x3fbe8e88fd]

 On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped 

[ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
This is the same cluster I posted about back in April.  Since then,
the situation has gotten significantly worse.

Here is what iostat looks like for the one active RBD image on this cluster:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdb   0.00 0.00   14.100.00   685.65 0.00
97.26 3.43  299.40  299.400.00  70.92 100.00
vdb   0.00 0.001.100.00   140.80 0.00
256.00 3.00 2753.09 2753.090.00 909.09 100.00
vdb   0.00 0.00   17.400.00  2227.20 0.00
256.00 3.00  178.78  178.780.00  57.47 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2256.62 2256.620.00 769.23 100.00
vdb   0.00 0.008.200.00  1049.60 0.00
256.00 3.00  362.10  362.100.00 121.95 100.00
vdb   0.00 0.001.100.00   140.80 0.00
256.00 3.00 2517.45 2517.450.00 909.45 100.04
vdb   0.00 0.001.100.00   140.66 0.00
256.00 3.00 2863.64 2863.640.00 909.09  99.90
vdb   0.00 0.000.700.0089.60 0.00
256.00 3.00 3898.86 3898.860.00 1428.57 100.00
vdb   0.00 0.000.600.0076.80 0.00
256.00 3.00 5093.33 5093.330.00 1666.67 100.00
vdb   0.00 0.001.200.00   153.60 0.00
256.00 3.00 2568.33 2568.330.00 833.33 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2457.85 2457.850.00 769.23 100.00
vdb   0.00 0.00   13.900.00  1779.20 0.00
256.00 3.00  220.95  220.950.00  71.94 100.00
vdb   0.00 0.001.000.00   128.00 0.00
256.00 3.00 2250.40 2250.400.00 1000.00 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2798.77 2798.770.00 769.23 100.00
vdb   0.00 0.000.900.00   115.20 0.00
256.00 3.00 3304.00 3304.000.00 .11 100.00
vdb   0.00 0.000.900.00   115.20 0.00
256.00 3.00 3425.33 3425.330.00 .11 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2290.77 2290.770.00 769.23 100.00
vdb   0.00 0.004.300.00   550.40 0.00
256.00 3.00  721.30  721.300.00 232.56 100.00
vdb   0.00 0.001.600.00   204.80 0.00
256.00 3.00 1894.75 1894.750.00 625.00 100.00
vdb   0.00 0.001.200.00   153.60 0.00
256.00 3.00 2375.00 2375.000.00 833.33 100.00
vdb   0.00 0.000.900.00   115.20 0.00
256.00 3.00 3036.44 3036.440.00 .11 100.00
vdb   0.00 0.001.100.00   140.80 0.00
256.00 3.00 3086.18 3086.180.00 909.09 100.00
vdb   0.00 0.000.900.00   115.20 0.00
256.00 3.00 2480.44 2480.440.00 .11 100.00
vdb   0.00 0.001.200.00   153.60 0.00
256.00 3.00 3124.33 3124.330.00 833.67 100.04
vdb   0.00 0.000.800.00   102.40 0.00
256.00 3.00 3228.00 3228.000.00 1250.00 100.00
vdb   0.00 0.001.200.00   153.60 0.00
256.00 3.00 2439.33 2439.330.00 833.33 100.00
vdb   0.00 0.001.300.00   166.40 0.00
256.00 3.00 2567.08 2567.080.00 769.23 100.00
vdb   0.00 0.000.800.00   102.40 0.00
256.00 3.00 3023.00 3023.000.00 1250.00 100.00
vdb   0.00 0.004.800.00   614.40 0.00
256.00 3.00  712.50  712.500.00 208.33 100.00
vdb   0.00 0.001.300.00   118.75 0.00
182.69 3.00 2003.69 2003.690.00 769.23 100.00
vdb   0.00 0.00   10.500.00  1344.00 0.00
256.00 3.00  344.46  344.460.00  95.24 100.00

So, between 0 and 15 reads per second, no write activity, a constant
queue depth of 3+, wait times in seconds, and 100% I/O utilization,
all for read performance of 100-200K/sec.  Even trivial writes can
hang for 15-60 seconds before completing.

Sometimes this behavior will go away for awhile and it will go back
to what we saw in April: 50IOPS (read or write) and 5-20MB/sec of I/O
throughput.  But it always comes back.

The hardware of the ceph cluster is:
- Three ceph nodes
- Two of the ceph nodes have 64GiB RAM and 12 5TB SATA drives
- One of the ceph nodes has 32GiB RAM and 4 5TB SATA drives
- All ceph nodes have Intel E5-2609 v2 (2.50Ghz quad-core) CPUs
- Everything is 10GBase-T
- All three nodes running Ceph 0.80.9

The ceph hardware is all borderline idle.  The CPU is 3-5% utilized
and iostat reports the individual disks hover around 4-7% utilized at
any given time.  It does appear to be using most of the available RAM
for OSD 

Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
Hi Greg + list,

Sorry to reply to this old'ish thread, but today one of these PGs bit
us in the ass.

Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
and 69 all crash when trying to delete pg 36.10d. They all crash with

   ENOTEMPTY suggests garbage data in osd data dir

(full log below). There is indeed some garbage in there:

# find 36.10d_head/
36.10d_head/
36.10d_head/DIR_D
36.10d_head/DIR_D/DIR_0
36.10d_head/DIR_D/DIR_0/DIR_1
36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
36.10d_head/DIR_D/DIR_0/DIR_9


Do you have any suggestion how to get these OSDs back running? We
already tried manually moving 36.10d_head to 36.10d_head.bak but then
the OSD crashes for a different reason:

-1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
36.10d coll 36.10d_head
 0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
15:07:42.442902
osd/PG.cc: 2839: FAILED assert(r  0)


Any clues?

Cheers, Dan

2015-07-17 14:40:54.493935 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
from 0)
2015-07-17 14:40:54.494019 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
in osd data dir
2015-07-17 14:40:54.494021 7f0ba60f4700  0
filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
{
   ops: [
   {
   op_num: 0,
   op_name: remove,
   collection: 36.10d_head,
   oid: 10d\/\/head\/\/36
   },
   {
   op_num: 1,
   op_name: rmcoll,
   collection: 36.10d_head
   }
   ]
}

2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
function 'unsigned int
FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
14:40:54.502996
os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long,
ThreadPool::TPHandle*)+0x64) [0x97d794]
3: (FileStore::_do_op(FileStore::OpSequencer*,
ThreadPool::TPHandle)+0x2a0) [0x97da50]
4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
6: /lib64/libpthread.so.0() [0x3fbec079d1]
7: (clone()+0x6d) [0x3fbe8e88fd]

On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped to?
 It looks like each of them is missing one object on some of the OSDs,
 what are the objects?

 34.10d: [52,202,218]
 55.10d: [303,231,65]
 36.10d: [30,171,69]

 So no common OSDs. I've already repaired all of these PGs, and logs
 have nothing interesting, so I can't say more about the objects.

 Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10d

2015-07-17 Thread Gregory Farnum
I think you'll need to use the ceph-objectstore-tool to remove the
PG/data consistently, but I've not done this — David or Sam will need
to chime in.
-Greg

On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote:
 Hi Greg + list,

 Sorry to reply to this old'ish thread, but today one of these PGs bit
 us in the ass.

 Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
 and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

 (full log below). There is indeed some garbage in there:

 # find 36.10d_head/
 36.10d_head/
 36.10d_head/DIR_D
 36.10d_head/DIR_D/DIR_0
 36.10d_head/DIR_D/DIR_0/DIR_1
 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
 36.10d_head/DIR_D/DIR_0/DIR_9


 Do you have any suggestion how to get these OSDs back running? We
 already tried manually moving 36.10d_head to 36.10d_head.bak but then
 the OSD crashes for a different reason:

 -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
 36.10d coll 36.10d_head
  0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
 function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
 ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
 15:07:42.442902
 osd/PG.cc: 2839: FAILED assert(r  0)


 Any clues?

 Cheers, Dan

 2015-07-17 14:40:54.493935 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
 not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
 from 0)
 2015-07-17 14:40:54.494019 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
 in osd data dir
 2015-07-17 14:40:54.494021 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
 {
ops: [
{
op_num: 0,
op_name: remove,
collection: 36.10d_head,
oid: 10d\/\/head\/\/36
},
{
op_num: 1,
op_name: rmcoll,
collection: 36.10d_head
}
]
 }

 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
 14:40:54.502996
 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) [0x97d794]
 3: (FileStore::_do_op(FileStore::OpSequencer*,
 ThreadPool::TPHandle)+0x2a0) [0x97da50]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
 6: /lib64/libpthread.so.0() [0x3fbec079d1]
 7: (clone()+0x6d) [0x3fbe8e88fd]

 On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped to?
 It looks like each of them is missing one object on some of the OSDs,
 what are the objects?

 34.10d: [52,202,218]
 55.10d: [303,231,65]
 36.10d: [30,171,69]

 So no common OSDs. I've already repaired all of these PGs, and logs
 have nothing interesting, so I can't say more about the objects.

 Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10d

2015-07-17 Thread Dan van der Ster
Thanks for the quick reply.

We /could/ just wipe these OSDs and start from scratch (the only other
pools were 4+2 ec and recovery already brought us to 100%
active+clean).

But it'd be good to understand and prevent this kind of crash...

Cheers, Dan




On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote:
 I think you'll need to use the ceph-objectstore-tool to remove the
 PG/data consistently, but I've not done this — David or Sam will need
 to chime in.
 -Greg

 On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com wrote:
 Hi Greg + list,

 Sorry to reply to this old'ish thread, but today one of these PGs bit
 us in the ass.

 Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
 and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

 (full log below). There is indeed some garbage in there:

 # find 36.10d_head/
 36.10d_head/
 36.10d_head/DIR_D
 36.10d_head/DIR_D/DIR_0
 36.10d_head/DIR_D/DIR_0/DIR_1
 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
 36.10d_head/DIR_D/DIR_0/DIR_9


 Do you have any suggestion how to get these OSDs back running? We
 already tried manually moving 36.10d_head to 36.10d_head.bak but then
 the OSD crashes for a different reason:

 -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
 36.10d coll 36.10d_head
  0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
 function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
 ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
 15:07:42.442902
 osd/PG.cc: 2839: FAILED assert(r  0)


 Any clues?

 Cheers, Dan

 2015-07-17 14:40:54.493935 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
 not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
 from 0)
 2015-07-17 14:40:54.494019 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
 in osd data dir
 2015-07-17 14:40:54.494021 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
 {
ops: [
{
op_num: 0,
op_name: remove,
collection: 36.10d_head,
oid: 10d\/\/head\/\/36
},
{
op_num: 1,
op_name: rmcoll,
collection: 36.10d_head
}
]
 }

 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
 14:40:54.502996
 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) [0x97d794]
 3: (FileStore::_do_op(FileStore::OpSequencer*,
 ThreadPool::TPHandle)+0x2a0) [0x97da50]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
 6: /lib64/libpthread.so.0() [0x3fbec079d1]
 7: (clone()+0x6d) [0x3fbe8e88fd]

 On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one object is missing. In all cases, the PG id is 10d.
 Is this an epic coincidence or could something else going on here?

 I'm betting on something else. What OSDs is each PG mapped to?
 It looks like each of them is missing one object on some of the OSDs,
 what are the objects?

 34.10d: [52,202,218]
 55.10d: [303,231,65]
 36.10d: [30,171,69]

 So no common OSDs. I've already repaired all of these PGs, and 

[ceph-users] Problem re-running dpkg-buildpackages with '-nc' option

2015-07-17 Thread Bartłomiej Święcki
Hi all,

I'm trying to rebuild ceph deb packages using 'dpkg-buildpackages -nc'.
Without '-nc' the compilation works fine but obviously takes a long time.
When I add the '-nc' option, I end up with following issues:

 ..
 ./check_version ./.git_version
 ./.git_version is up to date.
   CXXLDceph_rgw_jsonparser
 ./.libs/libglobal.a(json_spirit_reader.o): In function `~thread_specific_ptr':
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 /usr/include/boost/thread/tss.hpp:79: undefined reference to 
 `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)'
 ./.libs/libglobal.a(json_spirit_reader.o):/usr/include/boost/thread/tss.hpp:79:
  more undefined references to `boost::detail::set_tss_data(void const*, 
 boost::shared_ptrboost::detail::tss_cleanup_function, void*, bool)' follow
 ./.libs/libglobal.a(json_spirit_reader.o): In function `call_oncevoid 
 (*)()':
 /usr/include/boost/thread/pthread/once_atomic.hpp:145: undefined reference to 
 `boost::thread_detail::enter_once_region(boost::once_flag)'
 /usr/include/boost/thread/pthread/once_atomic.hpp:157: undefined reference to 
 `boost::thread_detail::commit_once_region(boost::once_flag)'
 /usr/include/boost/thread/pthread/once_atomic.hpp:153: undefined reference to 
 `boost::thread_detail::rollback_once_region(boost::once_flag)'
 /usr/include/boost/thread/pthread/once_atomic.hpp:145: undefined reference to 
 `boost::thread_detail::enter_once_region(boost::once_flag)'
 .

For me this looks like some boost libraries not linked in.

Has anybody seen something like that ?


-- 
Bartlomiej Swiecki bartlomiej.swie...@corp.ovh.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 What does ceph status say?

Usually it says everything is cool.  However just now it gave this:

cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
 health HEALTH_WARN 2 requests are blocked  32 sec
 monmap e3: 3 mons at
{f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
election epoch 176, quorum 0,1,2 f16,f17,f18
 osdmap e10705: 28 osds: 28 up, 28 in
  pgmap v26213984: 4224 pgs, 3 pools, 21637 GB data, 5409 kobjects
43344 GB used, 86870 GB / 127 TB avail
4224 active+clean
  client io 368 kB/s rd, 444 kB/s wr, 10 op/s

How do I find out what requests are blocked and why?  Resolving
whatever that is seems like a very good next step to troubleshooting
this issue.

 I had a problem with similar symptoms some
 months ago that was accompanied by OSDs getting marked out for no apparent
 reason and the cluster going into a HEALTH_WARN state intermittently.

There's no record of any OSD's being marked out at the time of
problems.  (E.g. as shown above, it's currently angry about something,
but all OSD's are up  in.)

Here's the slow requests related info from ceph.log from the last 90 minutes:

2015-07-17 06:25:14.809606 osd.16 192.168.19.218:6823/29360 4137 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.051626 secs
2015-07-17 06:25:14.809618 osd.16 192.168.19.218:6823/29360 4138 :
[WRN] slow request 30.051626 seconds old, received at 2015-07-17
06:24:44.757909: osd_op(client.32913524.0:3016955
rbd_data.15322ae8944a.0011f34f [set-alloc-hint object_size
4194304 write_size 4194304,write 1041920~3152384] 2.20c2afbe
ack+ondisk+write e10705) v4 currently waiting for subops from 13
2015-07-17 06:25:20.280232 osd.24 192.168.19.218:6826/31177 4095 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.105786 secs
2015-07-17 06:25:20.280239 osd.24 192.168.19.218:6826/31177 4096 :
[WRN] slow request 30.105786 seconds old, received at 2015-07-17
06:24:50.174399: osd_op(client.32913524.0:3016970
rbd_data.15322ae8944a.0011f355 [set-alloc-hint object_size
4194304 write_size 4194304,write 0~4194304] 2.66b14219
ack+ondisk+write e10705) v4 currently waiting for subops from 4
2015-07-17 06:25:28.476583 osd.2 192.168.19.216:6808/29552 12827 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
33.389677 secs
2015-07-17 06:25:28.476594 osd.2 192.168.19.216:6808/29552 12828 :
[WRN] slow request 33.389677 seconds old, received at 2015-07-17
06:24:55.086843: osd_sub_op(client.32913524.0:3016982 2.e55
53500e55/rbd_data.15322ae8944a.0011f359/head//2 [] v
10705'851608 snapset=0=[]:[] snapc=0=[]) v11 currently commit sent
2015-07-17 06:25:25.223395 osd.20 192.168.19.218:6810/30518 4133 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.255906 secs
2015-07-17 06:25:25.223406 osd.20 192.168.19.218:6810/30518 4134 :
[WRN] slow request 30.255906 seconds old, received at 2015-07-17
06:24:54.967416: osd_op(client.32913524.0:3016982
rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size
4194304 write_size 4194304,write 0~3683840] 2.53500e55
ack+ondisk+write e10705) v4 currently waiting for subops from 2
2015-07-17 06:25:30.224304 osd.20 192.168.19.218:6810/30518 4135 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.574590 secs
2015-07-17 06:25:30.224312 osd.20 192.168.19.218:6810/30518 4136 :
[WRN] slow request 30.574590 seconds old, received at 2015-07-17
06:24:59.649652: osd_op(client.32913524.0:3017153
rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size
4194304 write_size 4194304,write 3683840~510464] 2.53500e55
ack+ondisk+write e10705) v4 currently waiting for subops from 2
2015-07-17 06:25:34.624203 osd.23 192.168.19.218:6801/26435 4154 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.596777 secs
2015-07-17 06:25:34.624215 osd.23 192.168.19.218:6801/26435 4155 :
[WRN] slow request 30.596777 seconds old, received at 2015-07-17
06:25:04.027362: osd_op(client.32913524.0:3017169
rbd_data.15322ae8944a.0011f35e [set-alloc-hint object_size
4194304 write_size 4194304,write 2963456~1230848] 2.dad14b4
ack+ondisk+write e10705) v4 currently waiting for subops from 8
2015-07-17 06:26:00.932314 osd.15 192.168.19.217:6800/25977 3491 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.549624 secs
2015-07-17 06:26:00.932328 osd.15 192.168.19.217:6800/25977 3492 :
[WRN] slow request 30.549624 seconds old, received at 2015-07-17
06:25:30.382612: osd_op(client.32913524.0:3017343
rbd_data.15322ae8944a.0011f381 [set-alloc-hint object_size
4194304 write_size 4194304,write 0~758784] 2.85df8e7d ack+ondisk+write
e10705) v4 currently waiting for subops from 16
2015-07-17 06:26:00.817039 osd.16 192.168.19.218:6823/29360 4139 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.433143 secs
2015-07-17 06:26:00.817050 osd.16 

Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
That looks a lot like what I was seeing initially. The OSDs getting marked
out was relatively rare and it took a bit before I saw it. I ended up
digging into the logs on the OSDs themselves to discover that they were
getting marked out. The messages were like So-and-so incorrectly marked us
out IIRC.

I don't remember the commands exactly, but there are tools for digging into
specifics of the health of individual pgs. It's something like ceph pg
dump summary. Someone else may chime in more details, but that should be a
good google seed. That should help you isolate the OSDs and PGs that are
being problematic.

On the off chance you are having a NIC problem like I was, use ifconfig to
check the error rates on your interfaces. You will likely have some errors,
but if one of them has way more than the others, it's worth investigating
the NIC / cabling. If nothing else it will help eliminate this as the root
of the problem.

On Fri, Jul 17, 2015 at 9:06 AM, J David j.david.li...@gmail.com wrote:

 On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman
 qhart...@direwolfdigital.com wrote:
  What does ceph status say?

 Usually it says everything is cool.  However just now it gave this:

 cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
  health HEALTH_WARN 2 requests are blocked  32 sec
  monmap e3: 3 mons at
 {f16=
192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
 election epoch 176, quorum 0,1,2 f16,f17,f18
  osdmap e10705: 28 osds: 28 up, 28 in
   pgmap v26213984: 4224 pgs, 3 pools, 21637 GB data, 5409 kobjects
 43344 GB used, 86870 GB / 127 TB avail
 4224 active+clean
   client io 368 kB/s rd, 444 kB/s wr, 10 op/s

 How do I find out what requests are blocked and why?  Resolving
 whatever that is seems like a very good next step to troubleshooting
 this issue.

  I had a problem with similar symptoms some
  months ago that was accompanied by OSDs getting marked out for no
apparent
  reason and the cluster going into a HEALTH_WARN state intermittently.

 There's no record of any OSD's being marked out at the time of
 problems.  (E.g. as shown above, it's currently angry about something,
 but all OSD's are up  in.)

 Here's the slow requests related info from ceph.log from the last 90
minutes:

 2015-07-17 06:25:14.809606 osd.16 192.168.19.218:6823/29360 4137 :
 [WRN] 1 slow requests, 1 included below; oldest blocked for 
 30.051626 secs
 2015-07-17 06:25:14.809618 osd.16 192.168.19.218:6823/29360 4138 :
 [WRN] slow request 30.051626 seconds old, received at 2015-07-17
 06:24:44.757909: osd_op(client.32913524.0:3016955
 rbd_data.15322ae8944a.0011f34f [set-alloc-hint object_size
 4194304 write_size 4194304,write 1041920~3152384] 2.20c2afbe
 ack+ondisk+write e10705) v4 currently waiting for subops from 13
 2015-07-17 06:25:20.280232 osd.24 192.168.19.218:6826/31177 4095 :
 [WRN] 1 slow requests, 1 included below; oldest blocked for 
 30.105786 secs
 2015-07-17 06:25:20.280239 osd.24 192.168.19.218:6826/31177 4096 :
 [WRN] slow request 30.105786 seconds old, received at 2015-07-17
 06:24:50.174399: osd_op(client.32913524.0:3016970
 rbd_data.15322ae8944a.0011f355 [set-alloc-hint object_size
 4194304 write_size 4194304,write 0~4194304] 2.66b14219
 ack+ondisk+write e10705) v4 currently waiting for subops from 4
 2015-07-17 06:25:28.476583 osd.2 192.168.19.216:6808/29552 12827 :
 [WRN] 1 slow requests, 1 included below; oldest blocked for 
 33.389677 secs
 2015-07-17 06:25:28.476594 osd.2 192.168.19.216:6808/29552 12828 :
 [WRN] slow request 33.389677 seconds old, received at 2015-07-17
 06:24:55.086843: osd_sub_op(client.32913524.0:3016982 2.e55
 53500e55/rbd_data.15322ae8944a.0011f359/head//2 [] v
 10705'851608 snapset=0=[]:[] snapc=0=[]) v11 currently commit sent
 2015-07-17 06:25:25.223395 osd.20 192.168.19.218:6810/30518 4133 :
 [WRN] 1 slow requests, 1 included below; oldest blocked for 
 30.255906 secs
 2015-07-17 06:25:25.223406 osd.20 192.168.19.218:6810/30518 4134 :
 [WRN] slow request 30.255906 seconds old, received at 2015-07-17
 06:24:54.967416: osd_op(client.32913524.0:3016982
 rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size
 4194304 write_size 4194304,write 0~3683840] 2.53500e55
 ack+ondisk+write e10705) v4 currently waiting for subops from 2
 2015-07-17 06:25:30.224304 osd.20 192.168.19.218:6810/30518 4135 :
 [WRN] 1 slow requests, 1 included below; oldest blocked for 
 30.574590 secs
 2015-07-17 06:25:30.224312 osd.20 192.168.19.218:6810/30518 4136 :
 [WRN] slow request 30.574590 seconds old, received at 2015-07-17
 06:24:59.649652: osd_op(client.32913524.0:3017153
 rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size
 4194304 write_size 4194304,write 3683840~510464] 2.53500e55
 ack+ondisk+write e10705) v4 currently waiting for subops from 2
 2015-07-17 06:25:34.624203 osd.23 192.168.19.218:6801/26435 4154 :
 [WRN] 1 slow requests, 1 included below; oldest blocked for 
 

[ceph-users] RGW Malformed Headers

2015-07-17 Thread Simon Murray
The basics are, 14.04, giant, apache with the ceph version of fastCGI.
I'll spin up a test system in openstack with Civet and see if it misbehaves
the same way or I need to narrow it down further. Chances are if you
haven't heard of it I'll need to crack out g++ and get my hands dirty

-- 
DataCentred Limited registered in England and Wales no. 05611763
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Mark Nelson



On 07/17/2015 09:55 AM, J David wrote:

On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson mnel...@redhat.com wrote:

rados -p pool 30 bench write

just to see how it handles 4MB object writes.


Here's that, from the VM host:

  Total time run: 52.062639
Total writes made:  66
Write size: 4194304
Bandwidth (MB/sec): 5.071


Yep, awfully slow!



Stddev Bandwidth:   11.6312
Max bandwidth (MB/sec): 80
Min bandwidth (MB/sec): 0
Average Latency:12.436


12 second average latency!  Yikes.  That does either sound like network 
or one of the disks is very slow.  Do you see faster performance during 
the first second or two of the rados bench run?  That might indicate 
that you are backing up on a specific OSD.



Stddev Latency: 13.6272
Max latency:51.6924
Min latency:0.073353

Unfortunately I don't know much about how to parse this (other than
5MB/sec writes does match up with our best-case performance in the VM
guest).


If rados bench is
also terribly slow, then you might want to start looking for evidence of IO
getting hung up on a specific disk or node.


Thusfar, no evidence of that has presented itself.  iostat looks good
on every drive and the nodes are all equally loaded.


Ok.  Maybe try some iperf tests between the different OSD nodes in your 
cluster and also the client to the OSDs.




Thanks!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Shane Gibson


David - I'm new to Ceph myself, so can't point out any smoking guns - but
your problem feels like a network issue.  I suggest you check all of
your OSD/Mon/Clients network interfaces.  Check for errors, check that
they are negotiating the same link speed/type with your switches (if you
have LLDP enabled on your switches, this will help), most importantly -
check that you have MTU matching - I.E. - if you are using Jumbo frames
(eg 9000 MTU) on your hosts, that your switches are *also* supporting that
with appropriate packet overhead (eg 9128).  If you have your hosts set to
9000 - and your switches to 1500 - you'll see this exact behavior...

Hopefully that helps some ...

~~shane 




On 7/17/15, 8:57 AM, ceph-users on behalf of J David
ceph-users-boun...@lists.ceph.com on behalf of j.david.li...@gmail.com
wrote:

On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 That looks a lot like what I was seeing initially. The OSDs getting
marked
 out was relatively rare and it took a bit before I saw it.

Our problem is most of the time and does not appear confined to a
specific ceph cluster node or OSD:

$ sudo fgrep 'waiting for subops' ceph.log  | sed -e 's/.* v4 //' |
sort | uniq -c | sort -n
  1 currently waiting for subops from 0
  1 currently waiting for subops from 10
  1 currently waiting for subops from 11
  1 currently waiting for subops from 12
  1 currently waiting for subops from 3
  1 currently waiting for subops from 7
  2 currently waiting for subops from 13
  2 currently waiting for subops from 16
  2 currently waiting for subops from 4
  3 currently waiting for subops from 15
  4 currently waiting for subops from 6
  4 currently waiting for subops from 8
  7 currently waiting for subops from 2

Node f16: 0, 2, and 3 (3 out of 4)
Node f17: 4, 6, 7, 8, 10, 11, 12, 13 and 15 (9 out of 12)
Node f18: 16 (1 out of 12)

So f18 seems like the odd man out, in that it has *less* problems than
the other two.

There are a grand total of 2 RX errors across all the interfaces on
all three machines. (Each one has dual 10G interfaces bonded together
as active/failover.)

The OSD log for the worst offender above (2) says:

2015-07-17 08:52:05.441607 7f562ea0c700  0 log [WRN] : 1 slow
requests, 1 included below; oldest blocked for  30.119568 secs

2015-07-17 08:52:05.441622 7f562ea0c700  0 log [WRN] : slow request
30.119568 seconds old, received at 2015-07-17 08:51:35.321991:
osd_sub_op(client.32913524.0:3149584 2.249
2792c249/rbd_data.15322ae8944a.0011b487/head//2 [] v
10705'944603 snapset=0=[]:[] snapc=0=[]) v11 currently started

2015-07-17 08:52:43.229770 7f560833f700  0 --
192.168.2.216:6813/16029552  192.168.2.218:6810/7028653
pipe(0x25265180 sd=25 :6813 s=2 pgs=23894 cs=41 l=0
c=0x22be4c60).fault with nothing to send, going to standby

There are a bunch of those fault with nothing to send, going to
standby messages.

 The messages were like So-and-so incorrectly marked us
 out IIRC.

Nothing like that. Nor, with ceph -w running constantly, any
reference to anything being marked out at any point, even when
problems are severe.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dont used fqdns in monmaptool and ceph-mon --mkfs

2015-07-17 Thread Shane Gibson
On 7/16/15, 9:51 PM, ceph-users on behalf of Goncalo Borges
ceph-users-boun...@lists.ceph.com on behalf of
gonc...@physics.usyd.edu.au wrote:



Once I substituted the fqdn by simply the hostname (without the domain)
it worked.

Goncalo, 

I ran into the same problems too - and ended up bailing on the
ceph-deploy tools and manually building my clusters ... eventually
tripped over a piece of documentation on the ceph site that mentions this:

  
http://ceph.com/docs/v0.71/radosgw/config/#add-a-gateway-configuration-to-c
eph

The text:

  Note host must be your machine hostname, not FQDN.

Incidentally - this note doesn't seem to appear in the master docs
anymore ... There are a couple of other places where that note is
referenced... but it's definitely something that could be spelled out in
bot the Quick and Manual install procedures early on in big red WARNING
text, IMO.

I'm hoping to get more involved with providing back to the Ceph community,
and will start with documentation - and hope to submit patches myself soon
for that very issue ...

~~shane 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Steve Dainard
Disclaimer: I'm relatively new to ceph, and haven't moved into
production with it.

Did you run your bench for 30 seconds?

For reference my bench from a VM bridged to a 10Gig card with 90x4TB
at 30 seconds is:

 Total time run: 30.766596
Total writes made:  1979
Write size: 4194304
Bandwidth (MB/sec): 257.292

Stddev Bandwidth:   106.78
Max bandwidth (MB/sec): 420
Min bandwidth (MB/sec): 0
Average Latency:0.248238
Stddev Latency: 0.723444
Max latency:10.5275
Min latency:0.0346015

Seems like latency is a huge factor if your 30 second test took 52 seconds.

What kind of 10Gig NICs are you using? I have Mellanox Connectx-3 and
one node was using an older driver version. I started to experience
the osd in..out..in.. and incorrectly marked out from... as
mentioned by Quentin as well as poor performance. Installed the newest
version of the Mellanox driver and all is running well again.

On Fri, Jul 17, 2015 at 7:55 AM, J David j.david.li...@gmail.com wrote:
 On Fri, Jul 17, 2015 at 10:21 AM, Mark Nelson mnel...@redhat.com wrote:
 rados -p pool 30 bench write

 just to see how it handles 4MB object writes.

 Here's that, from the VM host:

  Total time run: 52.062639
 Total writes made:  66
 Write size: 4194304
 Bandwidth (MB/sec): 5.071

 Stddev Bandwidth:   11.6312
 Max bandwidth (MB/sec): 80
 Min bandwidth (MB/sec): 0
 Average Latency:12.436
 Stddev Latency: 13.6272
 Max latency:51.6924
 Min latency:0.073353

 Unfortunately I don't know much about how to parse this (other than
 5MB/sec writes does match up with our best-case performance in the VM
 guest).

 If rados bench is
 also terribly slow, then you might want to start looking for evidence of IO
 getting hung up on a specific disk or node.

 Thusfar, no evidence of that has presented itself.  iostat looks good
 on every drive and the nodes are all equally loaded.

 Thanks!
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Somnath Roy
I would say use admin socket to find out which part is causing most of the 
latencies, don't rule out disk anomalies.

Thanks  Regards
Somnath


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David
Sent: Friday, July 17, 2015 8:07 AM
To: Quentin Hartman
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Deadly slow Ceph cluster revisited

On Fri, Jul 17, 2015 at 10:47 AM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:
 What does ceph status say?

Usually it says everything is cool.  However just now it gave this:

cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
 health HEALTH_WARN 2 requests are blocked  32 sec
 monmap e3: 3 mons at
{f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
election epoch 176, quorum 0,1,2 f16,f17,f18
 osdmap e10705: 28 osds: 28 up, 28 in
  pgmap v26213984: 4224 pgs, 3 pools, 21637 GB data, 5409 kobjects
43344 GB used, 86870 GB / 127 TB avail
4224 active+clean
  client io 368 kB/s rd, 444 kB/s wr, 10 op/s

How do I find out what requests are blocked and why?  Resolving whatever that 
is seems like a very good next step to troubleshooting this issue.

 I had a problem with similar symptoms some months ago that was
 accompanied by OSDs getting marked out for no apparent reason and the
 cluster going into a HEALTH_WARN state intermittently.

There's no record of any OSD's being marked out at the time of problems.  (E.g. 
as shown above, it's currently angry about something, but all OSD's are up  
in.)

Here's the slow requests related info from ceph.log from the last 90 minutes:

2015-07-17 06:25:14.809606 osd.16 192.168.19.218:6823/29360 4137 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.051626 secs
2015-07-17 06:25:14.809618 osd.16 192.168.19.218:6823/29360 4138 :
[WRN] slow request 30.051626 seconds old, received at 2015-07-17
06:24:44.757909: osd_op(client.32913524.0:3016955 
rbd_data.15322ae8944a.0011f34f [set-alloc-hint object_size
4194304 write_size 4194304,write 1041920~3152384] 2.20c2afbe
ack+ondisk+write e10705) v4 currently waiting for subops from 13
2015-07-17 06:25:20.280232 osd.24 192.168.19.218:6826/31177 4095 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.105786 secs
2015-07-17 06:25:20.280239 osd.24 192.168.19.218:6826/31177 4096 :
[WRN] slow request 30.105786 seconds old, received at 2015-07-17
06:24:50.174399: osd_op(client.32913524.0:3016970
rbd_data.15322ae8944a.0011f355 [set-alloc-hint object_size
4194304 write_size 4194304,write 0~4194304] 2.66b14219
ack+ondisk+write e10705) v4 currently waiting for subops from 4
2015-07-17 06:25:28.476583 osd.2 192.168.19.216:6808/29552 12827 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
33.389677 secs
2015-07-17 06:25:28.476594 osd.2 192.168.19.216:6808/29552 12828 :
[WRN] slow request 33.389677 seconds old, received at 2015-07-17
06:24:55.086843: osd_sub_op(client.32913524.0:3016982 2.e55
53500e55/rbd_data.15322ae8944a.0011f359/head//2 [] v
10705'851608 snapset=0=[]:[] snapc=0=[]) v11 currently commit sent
2015-07-17 06:25:25.223395 osd.20 192.168.19.218:6810/30518 4133 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.255906 secs
2015-07-17 06:25:25.223406 osd.20 192.168.19.218:6810/30518 4134 :
[WRN] slow request 30.255906 seconds old, received at 2015-07-17
06:24:54.967416: osd_op(client.32913524.0:3016982
rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size
4194304 write_size 4194304,write 0~3683840] 2.53500e55
ack+ondisk+write e10705) v4 currently waiting for subops from 2
2015-07-17 06:25:30.224304 osd.20 192.168.19.218:6810/30518 4135 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.574590 secs
2015-07-17 06:25:30.224312 osd.20 192.168.19.218:6810/30518 4136 :
[WRN] slow request 30.574590 seconds old, received at 2015-07-17
06:24:59.649652: osd_op(client.32913524.0:3017153
rbd_data.15322ae8944a.0011f359 [set-alloc-hint object_size
4194304 write_size 4194304,write 3683840~510464] 2.53500e55
ack+ondisk+write e10705) v4 currently waiting for subops from 2
2015-07-17 06:25:34.624203 osd.23 192.168.19.218:6801/26435 4154 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.596777 secs
2015-07-17 06:25:34.624215 osd.23 192.168.19.218:6801/26435 4155 :
[WRN] slow request 30.596777 seconds old, received at 2015-07-17
06:25:04.027362: osd_op(client.32913524.0:3017169 
rbd_data.15322ae8944a.0011f35e [set-alloc-hint object_size
4194304 write_size 4194304,write 2963456~1230848] 2.dad14b4
ack+ondisk+write e10705) v4 currently waiting for subops from 8
2015-07-17 06:26:00.932314 osd.15 192.168.19.217:6800/25977 3491 :
[WRN] 1 slow requests, 1 included below; oldest blocked for 
30.549624 secs
2015-07-17 06:26:00.932328 osd.15 192.168.19.217:6800/25977 3492 :
[WRN] slow request 30.549624 seconds old, received at 2015-07-17
06:25:30.382612: 

Re: [ceph-users] OSD latency inaccurate reports?

2015-07-17 Thread Kostis Fardelas
Also, by running ceph osd perf, I see that fs_apply_latency is larger
than fs_commit_latency. Shouldn't that be the opposite? Apply latency
is afaik the time that it takes to to apply updates to the file system
in page cache. Commitcycle latency is the time it takes to flush cache
on disks, right? Again the statistics fetched from the sockets seem
more reasonable, the commitcycle latency is substantially larger than
apply latency, which seems normal to me.

Is this a bug or a misunderstanding on my part?

Regards,
Kostis

On 13 July 2015 at 13:27, Kostis Fardelas dante1...@gmail.com wrote:
 Hello,
 I noticed that commit/apply latency reported using:
 ceph pg dump -f json-pretty

 is very different from the values reported when querying the OSD sockets.
 What is your opinion? What are the targets the I should fetch metrics
 from in order to be as precise as possible?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unsetting osd_crush_chooseleaf_type = 0

2015-07-17 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Yes you will need to change osd to host as you thought so that copies
will be separated between hosts. You will run into problems you see
until that is changed. It will cause data movement.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Jul 16, 2015 at 11:45 AM, Steve Dainard  wrote:
 I originally built a single node cluster, and added
 'osd_crush_chooseleaf_type = 0 #0 is for one node cluster' to ceph.conf
 (which is now commented out).

 I've now added a 2nd node, where can I set this value to 1? I see in the
 crush map that the osd's are under 'host' buckets and don't see any
 reference to leaf.

 Would the cluster automatically rebalance when the 2nd host was added? How
 can I verify this?

 The issue right now, is with two host, copies = 2, min copies = 1, I cannot
 access data from client machines when one of the two hosts goes down.

 # begin crush map
 tunable choose_local_tries 0
 tunable choose_local_fallback_tries 0
 tunable choose_total_tries 50
 tunable chooseleaf_descend_once 1

 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3

 # types
 type 0 osd
 type 1 host
 type 2 chassis
 type 3 rack
 type 4 row
 type 5 pdu
 type 6 pod
 type 7 room
 type 8 datacenter
 type 9 region
 type 10 root

 # buckets
 host ceph1 {
 id -2   # do not change unnecessarily
 # weight 163.350
 alg straw
 hash 0  # rjenkins1
 item osd.0 weight 3.630
 item osd.1 weight 3.630
 }
 host ceph2 {
 id -3   # do not change unnecessarily
 # weight 163.350
 alg straw
 hash 0  # rjenkins1
 item osd.2 weight 3.630
 item osd.3 weight 3.630
 }
 root default {
 id -1   # do not change unnecessarily
 # weight 326.699
 alg straw
 hash 0  # rjenkins1
 item ceph1 weight 163.350
 item ceph2 weight 163.350
 }

 # rules
 rule replicated_ruleset {
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step choose firstn 0 type osd -- should this line be ''step
 chooseleaf firstn 0 type host?
 step emit
 }

 # end crush map

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVqTieCRDmVDuy+mK58QAASToQAJCnJW4xqiRQRiUiJK9s
tF//xmrODAI3CrJENjtJ1wekNPq8YMWfX/CcxU0v7DdEx4JusVPxfv1BcEO3
OIKYHq3IXhWeMg20um1jGXINNI9ENR1Wpmp8G1gENjSRG+F3UOBG5jvZtoUn
LEZ0OmYbzIQcHmgOGhcCICp/vXlTbHfSkOMRz34mlwpoKkKlwuQoQKd0o1gg
MT3XvVBgIhZcQANiwu0sWxNTfNCtYq0/jVPcQyd3Wfzh9XtqHmZlgOYb92E1
uqYFTBpNc06vCYcBt2C93DhzUbJXdYWFxqG3HIkDu5GDJ1jL8WHcHMhoEdRX
61wTm5uku4+XTwHqoIJnr29f70z0pVBqZaY9nqz8XLcNdW7wYRbFvKnbravq
q32DJMVM8uhN5daRlJb0dRKMRCGjUhEItS/omHdriijmxu61SOjJzZvJN+w4
eEivnLeglqPGq20zGXHuSQT9KXcQlCGzvnCDblDGDky+UuDpwtYqRO3ez3xH
fOcBpjjslzXp09YGUo7UT9n9flonGnnOOi18lhgweF/zSIRw1lP350rdolbg
EnKSynkIYyoYjIf8cM8bw1mbxQXt2b3pIu2tTRYZG5bEe58aK2+8IKv0yN2I
M88VAOSXxOpzF0hxUyJiGYU2PK45lK5k+KQgQR2nuICk6Qc59/kuXjK8uZWX
LDRh
=Ay3V
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests during ceph osd boot

2015-07-17 Thread Kostis Fardelas
Thanks for your answers,
we will also experiment with osd recovery max active / threads and
will come back to you

Regards,
Kostis

On 16 July 2015 at 12:29, Jan Schermer j...@schermer.cz wrote:
 For me setting recovery_delay_start helps during the OSD bootup _sometimes_, 
 but it clearly does something different than what’s in the docs.

 Docs say:
 After peering completes, Ceph will delay for the specified number of seconds 
 before starting to recover objects.

 However, what I see is greatly slowed recovery, not a delayed start of 
 recovery. It seems to basically sleep between recovering the PGs. AFAIK 
 peering is already done unless I was remapping the PGs at the same moment, so 
 not sure what’s happening there in reality.

 We had this set to 20 for some time and recovering after host restart took 
 close to two hours.
 With this parameter set to 0, it recovered in less than 30 seconds (and 
 caused no slow requests or anything).

 So what I usually do is set this to a high number (like 200), and after all 
 the OSDs are started I set it to 0. This does not completely prevent slow 
 requests from happening, but does somewhat help…

 Jan

 On 15 Jul 2015, at 11:52, Andrey Korolyov and...@xdel.ru wrote:

 On Wed, Jul 15, 2015 at 12:15 PM, Jan Schermer j...@schermer.cz wrote:
 We have the same problems, we need to start the OSDs slowly.
 The problem seems to be CPU congestion. A booting OSD will use all 
 available CPU power you give it, and if it doesn’t have enough nasty stuff 
 happens (this might actually be the manifestation of some kind of problem 
 in our setup as well).
 It doesn’t do that always - I was restarting our hosts this weekend and 
 most of them came up fine with simple “service ceph start”, some just sat 
 there spinning the CPU and not doing any real world (and the cluster was 
 not very happy about that).

 Jan


 On 15 Jul 2015, at 10:53, Kostis Fardelas dante1...@gmail.com wrote:

 Hello,
 after some trial and error we concluded that if we start the 6 stopped
 OSD daemons with a delay of 1 minute, we do not experience slow
 requests (threshold is set on 30 sec), althrough there are some ops
 that last up to 10s which is already high enough. I assume that if we
 spread the delay more, the slow requests will vanish. The possibility
 of not having tuned our setup to the most finest detail is not zeroed
 out but I wonder if at any way we miss some ceph tuning in terms of
 ceph configuration.

 We run firefly latest stable version.

 Regards,
 Kostis

 On 13 July 2015 at 13:28, Kostis Fardelas dante1...@gmail.com wrote:
 Hello,
 after rebooting a ceph node and the OSDs starting booting and joining
 the cluster, we experience slow requests that get resolved immediately
 after cluster recovers. It is improtant to note that before the node
 reboot, we set noout flag in order to prevent recovery - so there are
 only degraded PGs when OSDs shut down- and let the cluster handle the
 OSDs down/up in the lightest way.

 Is there any tunable we should consider in order to avoid service
 degradation for our ceph clients?

 Regards,
 Kostis
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 As far as I`ve seen this problem, the main issue for regular
 disk-backed OSDs is an IOPS starvation during some interval after
 reading maps from filestore and marking itself as 'in' - even if
 in-memory caches are still hot, I/O will significantly degrade for a
 short period. The possible workaround for an otherwise healthy cluster
 and node-wide restart is to set norecover flag, it would greatly
 reduce a chance of hitting slow operations. Of course it is applicable
 only to non-empty cluster with tens of percents of an average
 utilization for rotating media. I pointed this issue a couple of years
 ago first (it *does* break 30s I/O SLA for returning OSD, but
 refilling same OSDs from scratch would not violate the same SLA,
 giving out way bigger completion time for a refill). From UX side, it
 would be great to introduce some kind of recovery throttler for newly
 started OSDs, as recovery_ delay_start does not prevent immediate
 recovery procedures.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 11:15 AM, Quentin Hartman
qhart...@direwolfdigital.com wrote:
 That looks a lot like what I was seeing initially. The OSDs getting marked
 out was relatively rare and it took a bit before I saw it.

Our problem is most of the time and does not appear confined to a
specific ceph cluster node or OSD:

$ sudo fgrep 'waiting for subops' ceph.log  | sed -e 's/.* v4 //' |
sort | uniq -c | sort -n
  1 currently waiting for subops from 0
  1 currently waiting for subops from 10
  1 currently waiting for subops from 11
  1 currently waiting for subops from 12
  1 currently waiting for subops from 3
  1 currently waiting for subops from 7
  2 currently waiting for subops from 13
  2 currently waiting for subops from 16
  2 currently waiting for subops from 4
  3 currently waiting for subops from 15
  4 currently waiting for subops from 6
  4 currently waiting for subops from 8
  7 currently waiting for subops from 2

Node f16: 0, 2, and 3 (3 out of 4)
Node f17: 4, 6, 7, 8, 10, 11, 12, 13 and 15 (9 out of 12)
Node f18: 16 (1 out of 12)

So f18 seems like the odd man out, in that it has *less* problems than
the other two.

There are a grand total of 2 RX errors across all the interfaces on
all three machines. (Each one has dual 10G interfaces bonded together
as active/failover.)

The OSD log for the worst offender above (2) says:

2015-07-17 08:52:05.441607 7f562ea0c700  0 log [WRN] : 1 slow
requests, 1 included below; oldest blocked for  30.119568 secs

2015-07-17 08:52:05.441622 7f562ea0c700  0 log [WRN] : slow request
30.119568 seconds old, received at 2015-07-17 08:51:35.321991:
osd_sub_op(client.32913524.0:3149584 2.249
2792c249/rbd_data.15322ae8944a.0011b487/head//2 [] v
10705'944603 snapset=0=[]:[] snapc=0=[]) v11 currently started

2015-07-17 08:52:43.229770 7f560833f700  0 --
192.168.2.216:6813/16029552  192.168.2.218:6810/7028653
pipe(0x25265180 sd=25 :6813 s=2 pgs=23894 cs=41 l=0
c=0x22be4c60).fault with nothing to send, going to standby

There are a bunch of those fault with nothing to send, going to
standby messages.

 The messages were like So-and-so incorrectly marked us
 out IIRC.

Nothing like that. Nor, with ceph -w running constantly, any
reference to anything being marked out at any point, even when
problems are severe.

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread J David
On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson mnel...@redhat.com wrote:
 Maybe try some iperf tests between the different OSD nodes in your
 cluster and also the client to the OSDs.

This proved to be an excellent suggestion.  One of these is not like the others:

f16 inbound: 6Gbps
f16 outbound: 6Gbps
f17 inbound: 6Gbps
f17 outbound: 6Gbps
f18 inbound: 6Gbps
f18 outbound: 1.2Mbps

There is flatly no explanation for the outbound performance on f18.
There are no errors in ifconfig/netstat, nothing logged on the switch,
etc.  Even with tcpdump running during iperf, there aren't retransmits
or anything.  It's just slow.

ifconfig'ing the primary bond interface down immediately resolved the
problem.  The iostat running in the virtual machine immediately surged
to 500+ IOPS and 40M-60M/sec.

Weirdly, ifconfig'ing the primary device back up did not bring the
problem back.  It switched back to that interface, but everything is
still fine (and iperf gives 6Gbps) at the moment.  There's no way of
telling if that will last, but it's a solid lead either way.

It's an Intel onboard dual-port X540's using the ixgbe driver.  If it
were a driver problem, we've got tons of these so I'd expect to see
this problem elsewhere.  If it's a hardware problem, ifconfig down/up
doesn't seem like it would fix it.  Very mysterious!

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings?

2015-07-17 Thread Steve Dainard
Other than those errors, do you find RBD's will not be unmapped on
system restart/shutdown on a machine using systemd? Leaving the system
hanging without network connections trying to unmap RBD's?

That's been my experience thus far, so I wrote an (overly simple)
systemd file to handle this on a per RBD basis.

On Tue, Jul 14, 2015 at 1:15 PM, Bruce McFarland
bruce.mcfarl...@taec.toshiba.com wrote:
 When starting the rbdmap.service to provide map/unmap of rbd devices across
 boot/shutdown cycles the /etc/init.d/rbdmap includes
 /lib/lsb/init-functions. This is not a problem except that the rbdmap script
 is making calls to the log_daemon_* log_progress_* log_actiion_* functions
 that are included in Ubuntu 14.04 distro's, but are not in the RHEL 7.1/RHCS
 1.3 distro. Are there any recommended workaround for boot time startup in
 RHEL/Centos 7.1 clients?


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Alex Gorbachev
May I suggest checking also the error counters on your network switch?

Check speed and duplex. Is bonding in use?  Is flow control on?  Can you
swap the network cable?  Can you swap a NIC with another node and does the
problem follow?

Hth, Alex

On Friday, July 17, 2015, Steve Thompson s...@vgersoft.com wrote:

 On Fri, 17 Jul 2015, J David wrote:

  f16 inbound: 6Gbps
 f16 outbound: 6Gbps
 f17 inbound: 6Gbps
 f17 outbound: 6Gbps
 f18 inbound: 6Gbps
 f18 outbound: 1.2Mbps


 Unless the network was very busy when you did this, I think that 6 Gb/s
 may not be very good either. Usually iperf will give you much more than
 that. For example, between two of my OSD's, I get 9.4 Gb/s, or up to 9.9
 Gb/s when nothing else is happening.

 Steve
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] backing Hadoop with Ceph ??

2015-07-17 Thread Josh Durgin

On 07/15/2015 11:48 AM, Shane Gibson wrote:


Somnath - thanks for the reply ...

:-)  Haven't tried anything yet - just starting to gather
info/input/direction for this solution.

Looking at the S3 API info [2] - there is no mention of support for the
S3a API extensions - namely rename support.  The problem with
backing via S3 API - if you need to rename a large (say multi GB) data
object - you have to copy to new name and delete - this is a very IO
expensive operation - and something we do a lot of.  That in and of
itself might be a deal breaker ...   Any idea/input/intention of
supporting the S3a exentsions within the RadosGW S3 API implementation?


I see you're trying out cephfs now, and I think that makes sense.

I just wanted to mention that at CDS a couple weeks ago Yehuda noted
that RGW's rename is cheap, since it does not require copying the data,
just updating its location [1].

Josh

[1] http://pad.ceph.com/p/hadoop-over-rgw


Plus - it seems like it's considered a bad idea to back Hadoop via S3
(and indirectly Ceph via RGW) [3]; though not sure if the architectural
differences from Amazon's S3 implementation and the far superior Ceph
make it more palatable?

~~shane

[2] http://ceph.com/docs/master/radosgw/s3/
[3] https://wiki.apache.org/hadoop/AmazonS3




On 7/15/15, 9:50 AM, Somnath Roy somnath@sandisk.com
mailto:somnath@sandisk.com wrote:

Did you try to integrate ceph +rgw+s3 with Hadoop?

Sent from my iPhone

On Jul 15, 2015, at 8:58 AM, Shane Gibson shane_gib...@symantec.com
mailto:shane_gib...@symantec.com wrote:




We are in the (very) early stages of considering testing backing
Hadoop via Ceph - as opposed to HDFS.  I've seen a few very vague
references to doing that, but haven't found any concrete info
(architecture, configuration recommendations, gotchas, lessons
learned, etc...).   I did find the ceph.com/docs/
http://ceph.com/docs/ info [1] which discusses use of CephFS for
backing Hadoop - but this would be foolish for production clusters
given that CephFS isn't yet considered production quality/grade.

Does anyone in the ceph-users community have experience with this
that they'd be willing to share?   Preferably ... via use of Ceph
- not via CephFS...but I am interested in any CephFS related
experiences too.

If we were to do this, and Ceph proved out as a backing store to
Hadoop - there is the potential to be creating a fairly large
multi-Petabyte (100s ??) class backing store for Ceph.  We do a
very large amount of analytics on a lot of data sets for security
trending correlations, etc...

Our current Ceph experience is limited to a few small (90 x 4TB
OSD size) clusters - which we are working towards putting in
production for Glance/Cinder backing and for Block storage for
various large storage need platforms (eg software and package
repo/mirrors, etc...).

Thanks in  advance for any input, thoughts, or pointers ...

~~shane

[1] http://ceph.com/docs/master/cephfs/hadoop/



___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s)
named above. If the reader of this message is not the intended
recipient, you are hereby notified that you have received this
message in error and that any review, dissemination, distribution,
or copying of this message is strictly prohibited. If you have
received this communication in error, please notify the sender by
telephone or e-mail (as shown above) immediately and destroy any and
all copies of this message in your possession (whether hard copies
or electronically stored copies).



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Quentin Hartman
Glad we were able to point you in the right direction! I would suspect a
borderline cable at this point. Did you happen to notice if the interface
had negotiated down to some dumb speed? If it had, I've seen cases where a
dodgy cable has caused an intermittent problem that causes it to negotiate
the speed downward, but then it never tries to coma back up until the
interface is restarted.

fwiw, I'm using the same chipset (not onboard) and driver, and they have
been rock solid so far. I would be skeptical of a driver bug as well.

QH

On Fri, Jul 17, 2015 at 2:34 PM, J David j.david.li...@gmail.com wrote:

 On Fri, Jul 17, 2015 at 12:19 PM, Mark Nelson mnel...@redhat.com wrote:
  Maybe try some iperf tests between the different OSD nodes in your
  cluster and also the client to the OSDs.

 This proved to be an excellent suggestion.  One of these is not like the
 others:

 f16 inbound: 6Gbps
 f16 outbound: 6Gbps
 f17 inbound: 6Gbps
 f17 outbound: 6Gbps
 f18 inbound: 6Gbps
 f18 outbound: 1.2Mbps

 There is flatly no explanation for the outbound performance on f18.
 There are no errors in ifconfig/netstat, nothing logged on the switch,
 etc.  Even with tcpdump running during iperf, there aren't retransmits
 or anything.  It's just slow.

 ifconfig'ing the primary bond interface down immediately resolved the
 problem.  The iostat running in the virtual machine immediately surged
 to 500+ IOPS and 40M-60M/sec.

 Weirdly, ifconfig'ing the primary device back up did not bring the
 problem back.  It switched back to that interface, but everything is
 still fine (and iperf gives 6Gbps) at the moment.  There's no way of
 telling if that will last, but it's a solid lead either way.

 It's an Intel onboard dual-port X540's using the ixgbe driver.  If it
 were a driver problem, we've got tons of these so I'd expect to see
 this problem elsewhere.  If it's a hardware problem, ifconfig down/up
 doesn't seem like it would fix it.  Very mysterious!

 Thanks!
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deadly slow Ceph cluster revisited

2015-07-17 Thread Steve Thompson

On Fri, 17 Jul 2015, J David wrote:


f16 inbound: 6Gbps
f16 outbound: 6Gbps
f17 inbound: 6Gbps
f17 outbound: 6Gbps
f18 inbound: 6Gbps
f18 outbound: 1.2Mbps


Unless the network was very busy when you did this, I think that 6 Gb/s 
may not be very good either. Usually iperf will give you much more than 
that. For example, between two of my OSD's, I get 9.4 Gb/s, or up to 9.9 
Gb/s when nothing else is happening.


Steve
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings?

2015-07-17 Thread Bruce McFarland
Yes the rbd's are not remapped at system boot time. I haven't run into a VM or 
system hang because this since I ran into it as part of investigating using 
RHEL 7.1 as a client distro. Yes remapping the rbd's in a startup script worked 
around the issue. 

 -Original Message-
 From: Steve Dainard [mailto:sdain...@spd1.com]
 Sent: Friday, July 17, 2015 1:59 PM
 To: Bruce McFarland
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service
 start warnings?
 
 Other than those errors, do you find RBD's will not be unmapped on system
 restart/shutdown on a machine using systemd? Leaving the system hanging
 without network connections trying to unmap RBD's?
 
 That's been my experience thus far, so I wrote an (overly simple) systemd
 file to handle this on a per RBD basis.
 
 On Tue, Jul 14, 2015 at 1:15 PM, Bruce McFarland
 bruce.mcfarl...@taec.toshiba.com wrote:
  When starting the rbdmap.service to provide map/unmap of rbd devices
  across boot/shutdown cycles the /etc/init.d/rbdmap includes
  /lib/lsb/init-functions. This is not a problem except that the rbdmap
  script is making calls to the log_daemon_* log_progress_*
  log_actiion_* functions that are included in Ubuntu 14.04 distro's,
  but are not in the RHEL 7.1/RHCS
  1.3 distro. Are there any recommended workaround for boot time startup
  in RHEL/Centos 7.1 clients?
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com