Hi JC,

In answer to your question, iostat shows high wait times on the RBD, but not on the underlying medium. For example:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util

rbd50             0.00     0.01    0.00    0.00     0.01     0.03    15.81     
5.08 892493.16    1.47 1692367.78 85088.96  48.41

xvdk              0.00    32.18    0.20    4.55     1.00   314.30   132.83     
0.40   83.70    1.10   87.35   0.82   0.39


In this case, rbd50 is the RBD that is blocking, while xvdk is the physical disk where OSD data is stored. xvdk appears completely normal, whereas rbd50 has absurdly high wait times. This leads me to think that the problem is a bug or misconfiguration in ceph, rather than being actually blocked by slow I/O.

This information is also reflected in vmstat:

    procs -----------memory---------- ---swap-- -----io---- -system-- 
----cpu----

 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa

 0  1 1386024  79596 263552 468136    1   11    26   297  213  247  0  1 57 42


Finally, one can see the blocked processes, and the function they're blocked in, with ps. This is pretty typical:

    2  6750 root     D     0.0 jbd2/rbd47-8    wait_on_buffer

    2 17551 root     D     0.0 jbd2/rbd51-8    wait_on_buffer

    2 19019 root     D     0.0 kworker/u30:3   get_write_access

22369 22374 root     D     0.0 shutdown        sync_inodes_sb

22372 22381 root     D     0.0 shutdown        sync_inodes_sb

Frequently mkfs blocks as well:

 1468 12329 root     D     0.0 mkfs.ext4       wait_on_page_bit

 1468 12332 root     D     0.0 mkfs.ext4       wait_on_buffer


I haven't seen anything obviously unusual in ceph -w, but I'm also not completely sure what I'm looking for.

The network connection between our nodes is provided by Amazon AWS, and has always been sufficient for our production needs until now. If there's a specific issue of concern related to ceph that I should be investigating, please let me know.

Here's a pastebin from an OSD experiencing the problem I described. I set debug_osd to 5/5. If you can provide any insight, I'd be grateful. http://pastebin.com/kLSwbVRb Also, if you have any more suggestions on how I can collect potentially interesting debug info, please let me know. Thanks.

Jeff

On 04/10/2015 12:24 AM, LOPEZ Jean-Charles wrote:
Hi Jeff,

have you tried gathering an iostats but on the OSD side to see how your OSD 
drives behave.

The RBD side shows you what the client is experiencing (the symptom) but will 
not help you find the problem.

Can you grab this iostat output on the OSD VMs (district-1 or district-2) 
depending on which test you did last. Don’t forget to indicate which devices 
are the OSD devices on your VMs together with the iostat posting.

Have you also investigated the network between your client and the OSDs? While 
the test is going, do you see any unusal message in a « ceph -w » output?

Pastebin and we’ll see if we can spot something.

As for the too few PGs, once we’ve found the root cause of why it’s slow, 
you’ll be able to adjust and increase the number of PGs per pool.

Cheers
JC

On 9 Apr 2015, at 20:25, Jeff Epstein <jeff.epst...@commerceguys.com> wrote:

As a follow-up to this issue, I'd like to point out some other things I've 
noticed.

First, per suggestions posted here, I've reduced the number of pgs per pool. 
This results in the following ceph status:

     cluster e96e10d3-ad2b-467f-9fe4-ab5269b70206
      health HEALTH_WARN too few pgs per osd (14 < min 20)
      monmap e1: 3 mons at 
{a=192.168.224.4:6789/0,b=192.168.232.4:6789/0,c=192.168.240.4:6789/0}, 
election epoch 8, quorum 0,1,2 a,b,c
      osdmap e238: 6 osds: 6 up, 6 in
       pgmap v1107: 86 pgs, 23 pools, 2511 MB data, 801 objects
             38288 MB used, 1467 GB / 1504 GB avail
                   86 active+clean

I'm not sure if I should be concerned about the HEALTH WARN.

However, this has not helped the performance issues. I've dug deeper to try to 
understand what is actually happening. It's curious because there isn't much 
data: our pools are about 5GB, so it really shouldn't take 30 minutes to an 
hour to run mkfs. Here some results taken from disk analysis tools while this 
delay is in progress:

 From pt-diskstats:

   #ts device    rd_s rd_avkb rd_mb_s rd_mrg rd_cnc   rd_rt    wr_s wr_avkb 
wr_mb_s wr_mrg wr_cnc   wr_rt busy in_prg    io_s  qtime stime
   1.0 rbd0       0.0     0.0     0.0     0%    0.0     0.0     0.0     0.0     
0.0     0%    0.0     0.0 100%      6     0.0    0.0   0.0

 From iostat:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
rbd0              0.00     0.03    0.03    0.04     0.13    10.73   310.78     
3.31 19730.41    0.40 37704.35 7073.59  49.47

These results correspond with my experience: the device is busy, as witnessed by the 
"busy" column in pt-diskstats and the "await" column in iostat. But both tools 
also attest to the fact that there isn't much reading or writing going on.  According to 
pt-diskstats, there isn't any. So my question is: what is ceph doing? It clearly isn't just 
blocking as a result of excess I/O load, something else is going on. Can anyone please explain?

Jeff


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to