Re: [ceph-users] Potential OSD deadlock?

Josef Johansson Sun, 04 Oct 2015 09:17:53 -0700

I would start with defrag the drives, the good part is that you can just
run the defrag with the time parameter and it will take all available xfs
drives.
On 4 Oct 2015 6:13 pm, "Robert LeBlanc" <rob...@leblancnet.us> wrote:


> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> These are Toshiba MG03ACA400 drives.
>
> sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series 
> chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
> sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series 
> chipset 6-Port SATA AHCI Controller (rev 05) at 6.0 Gb
> sde is SATADOM with OS install
> sd{f..i,l,m} are 4TB on 01:00.0 Serial Attached SCSI controller: LSI Logic / 
> Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
> sd{j,k} are 240 GB Intel SSDSC2BB240G4 on 01:00.0 Serial Attached SCSI 
> controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 
> (rev 05)
>
> There is probably some performance optimization that we can do in this area, 
> however unless I'm missing something, I don't see anything that should cause 
> I/O to take 30-60+ seconds to complete from a disk standpoint.
>
> [root@ceph1 ~]# for i in {{a..d},{f..i},{l,m}}; do echo -n "sd${i}1: "; 
> xfs_db -c frag -r /dev/sd${i}1; done
> sda1: actual 924229, ideal 414161, fragmentation factor 55.19%
> sdb1: actual 1703083, ideal 655321, fragmentation factor 61.52%
> sdc1: actual 2161827, ideal 746418, fragmentation factor 65.47%
> sdd1: actual 1807008, ideal 654214, fragmentation factor 63.80%
> sdf1: actual 735471, ideal 311837, fragmentation factor 57.60%
> sdg1: actual 1463859, ideal 507362, fragmentation factor 65.34%
> sdh1: actual 1684905, ideal 556571, fragmentation factor 66.97%
> sdi1: actual 1833980, ideal 608499, fragmentation factor 66.82%
> sdl1: actual 1641128, ideal 554364, fragmentation factor 66.22%
> sdm1: actual 2032644, ideal 697129, fragmentation factor 65.70%
>
>
> [root@ceph1 ~]# iostat -xd 2
> Linux 4.2.1-1.el7.elrepo.x86_64 (ceph1)   10/04/2015      _x86_64_        (16 
> CPU)
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.09     2.06    9.24   36.18   527.28  1743.71   100.00    
>  8.96  197.32   17.50  243.23   4.07  18.47
> sdb               0.17     3.61   16.70   74.44   949.65  2975.30    86.13    
>  6.74   73.95   23.94   85.16   4.31  39.32
> sdc               0.14     4.67   15.69   87.80   818.02  3860.11    90.41    
>  9.56   92.38   26.73  104.11   4.44  45.91
> sdd               0.17     3.43    7.16   69.13   480.96  2847.42    87.25    
>  4.80   62.89   30.00   66.30   4.33  33.00
> sde               0.01     1.13    0.34    0.99     8.35    12.01    30.62    
>  0.01    7.37    2.64    9.02   1.64   0.22
> sdj               0.00     1.22    0.01  348.22     0.03 11302.65    64.91    
>  0.23    0.66    0.14    0.66   0.15   5.15
> sdk               0.00     1.99    0.01  369.94     0.03 12876.74    69.61    
>  0.26    0.71    0.13    0.71   0.16   5.75
> sdf               0.01     1.79    1.55   31.12    39.64  1431.37    90.06    
>  4.07  124.67   16.25  130.05   3.11  10.17
> sdi               0.22     3.17   23.92   72.90  1386.45  2676.28    83.93    
>  7.75   80.00   24.31   98.27   4.31  41.77
> sdm               0.16     3.10   17.63   72.84   986.29  2767.24    82.98    
>  6.57   72.64   23.67   84.50   4.23  38.30
> sdl               0.11     3.01   12.10   55.14   660.85  2361.40    89.89    
> 17.87  265.80   21.64  319.36   4.08  27.45
> sdg               0.08     2.45    9.75   53.90   489.67  1929.42    76.01    
> 17.27  271.30   20.77  316.61   3.98  25.33
> sdh               0.10     2.76   11.28   60.97   600.10  2114.48    75.14    
>  1.70   23.55   22.92   23.66   4.10  29.60
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.50    0.00   146.00     0.00   584.00    
>  0.01   16.00   16.00    0.00  16.00   0.80
> sdb               0.00     0.50    9.00  119.00  2036.00  2578.00    72.09    
>  0.68    5.50    7.06    5.39   2.36  30.25
> sdc               0.00     4.00   34.00  129.00   494.00  6987.75    91.80    
>  1.70   10.44   17.00    8.72   4.44  72.40
> sdd               0.00     1.50    1.50   95.50    74.00  2396.50    50.94    
>  0.85    8.75   23.33    8.52   7.53  73.05
> sde               0.00    37.00   11.00    1.00    46.00   152.00    33.00    
>  0.01    1.00    0.64    5.00   0.54   0.65
> sdj               0.00     0.50    0.00  970.50     0.00 12594.00    25.95    
>  0.09    0.09    0.00    0.09   0.08   8.20
> sdk               0.00     0.00    0.00  977.50     0.00 12016.00    24.59    
>  0.10    0.10    0.00    0.10   0.09   8.90
> sdf               0.00     0.50    0.50   37.50     2.00   230.25    12.22    
>  9.63   10.58    8.00   10.61   1.79   6.80
> sdi               2.00     0.00   10.50    0.00  2528.00     0.00   481.52    
>  0.10    9.33    9.33    0.00   7.76   8.15
> sdm               0.00     0.50   15.00  116.00   546.00   833.25    21.06    
>  0.94    7.17   14.03    6.28   4.13  54.15
> sdl               0.00     0.00    3.00    0.00    26.00     0.00    17.33    
>  0.02    7.50    7.50    0.00   7.50   2.25
> sdg               0.00     3.50    1.00   64.50     4.00  2929.25    89.56    
>  0.40    6.04    9.00    5.99   3.42  22.40
> sdh               0.50     0.50    4.00   64.00   770.00  1105.00    55.15    
>  4.96  189.42   21.25  199.93   4.21  28.60
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    8.50    0.00   110.00     0.00    25.88    
>  0.01    1.59    1.59    0.00   1.53   1.30
> sdb               0.00     4.00    6.50  117.50   494.00  4544.50    81.27    
>  0.87    6.99   11.62    6.73   3.28  40.70
> sdc               0.00     0.50    5.50  202.50   526.00  4123.00    44.70    
>  1.80    8.66   18.73    8.39   2.08  43.30
> sdd               0.00     3.00    2.50  227.00   108.00  6952.00    61.53    
> 46.10  197.44   30.20  199.29   3.86  88.60
> sde               0.00     0.00    0.00    1.50     0.00     6.00     8.00    
>  0.00    2.33    0.00    2.33   1.33   0.20
> sdj               0.00     0.00    0.00  834.00     0.00  9912.00    23.77    
>  0.08    0.09    0.00    0.09   0.08   6.75
> sdk               0.00     0.00    0.00  777.00     0.00 12318.00    31.71    
>  0.12    0.15    0.00    0.15   0.10   7.70
> sdf               0.00     1.00    4.50  117.00   198.00   693.25    14.67    
> 34.86  362.88   84.33  373.60   3.59  43.65
> sdi               0.00     0.00    1.50    0.00     6.00     0.00     8.00    
>  0.01    9.00    9.00    0.00   9.00   1.35
> sdm               0.50     3.00    3.50  143.00  1014.00  4205.25    71.25    
>  0.93    5.95   20.43    5.59   3.08  45.15
> sdl               0.50     0.00    8.00  148.50  1578.00  2128.50    47.37    
>  0.82    5.27    6.44    5.21   3.40  53.20
> sdg               1.50     2.00   10.50  100.50  2540.00  2039.50    82.51    
>  0.77    7.00   14.19    6.25   5.42  60.20
> sdh               0.50     0.00    5.00    0.00  1050.00     0.00   420.00    
>  0.04    7.10    7.10    0.00   7.10   3.55
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    6.00    0.00   604.00     0.00   201.33    
>  0.03    5.58    5.58    0.00   5.58   3.35
> sdb               0.00     6.00    7.00  236.00   132.00  8466.00    70.77    
> 45.48  186.59   31.79  191.18   1.62  39.45
> sdc               2.00     0.00   19.50   46.50  6334.00   686.00   212.73    
>  0.39    5.96    7.97    5.12   3.57  23.55
> sdd               0.00     1.00    3.00   20.00    72.00  1527.25   139.07    
>  0.31   47.67    6.17   53.90   3.11   7.15
> sde               0.00    17.00    0.00    4.50     0.00   184.00    81.78    
>  0.01    2.33    0.00    2.33   2.33   1.05
> sdj               0.00     0.00    0.00  805.50     0.00 12760.00    31.68    
>  0.21    0.27    0.00    0.27   0.09   7.35
> sdk               0.00     0.00    0.00  438.00     0.00 14300.00    65.30    
>  0.24    0.54    0.00    0.54   0.13   5.65
> sdf               0.00     0.00    1.00    0.00     6.00     0.00    12.00    
>  0.00    2.50    2.50    0.00   2.50   0.25
> sdi               0.00     5.50   14.50   27.50   394.00  6459.50   326.36    
>  0.86   20.18   11.00   25.02   7.42  31.15
> sdm               0.00     1.00    9.00  175.00   554.00  3173.25    40.51    
>  1.12    6.38    7.22    6.34   2.41  44.40
> sdl               0.00     2.00    2.50  100.50    26.00  2483.00    48.72    
>  0.77    7.47   11.80    7.36   2.10  21.65
> sdg               0.00     4.50    9.00  214.00   798.00  7417.00    73.68    
> 66.56  298.46   28.83  309.80   3.35  74.70
> sdh               0.00     0.00   16.50    0.00   344.00     0.00    41.70    
>  0.09    5.61    5.61    0.00   4.55   7.50
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               1.00     0.00    9.00    0.00  3162.00     0.00   702.67    
>  0.07    8.06    8.06    0.00   6.06   5.45
> sdb               0.50     0.00   12.50   13.00  1962.00   298.75   177.31    
>  0.63   30.00    4.84   54.19   9.96  25.40
> sdc               0.00     0.50    3.50  131.00    18.00  1632.75    24.55    
>  0.87    6.48   16.86    6.20   3.51  47.25
> sdd               0.00     0.00    4.00    0.00    72.00    16.00    44.00    
>  0.26   10.38   10.38    0.00  23.38   9.35
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  843.50     0.00 16334.00    38.73    
>  0.19    0.23    0.00    0.23   0.11   9.10
> sdk               0.00     0.00    0.00  803.00     0.00 10394.00    25.89    
>  0.07    0.08    0.00    0.08   0.08   6.25
> sdf               0.00     4.00   11.00   90.50   150.00  2626.00    54.70    
>  0.59    5.84    3.82    6.08   4.06  41.20
> sdi               0.00     3.50   17.50  130.50  2132.00  6309.50   114.07    
>  1.84   12.55   25.60   10.80   5.76  85.30
> sdm               0.00     4.00    2.00  139.00    44.00  1957.25    28.39    
>  0.89    6.28   14.50    6.17   3.55  50.10
> sdl               0.00     0.50   12.00  101.00   334.00  1449.75    31.57    
>  0.94    8.28   10.17    8.06   2.11  23.85
> sdg               0.00     0.00    2.50    3.00   204.00    17.00    80.36    
>  0.02    5.27    4.60    5.83   3.91   2.15
> sdh               0.00     0.50    9.50   32.50  1810.00   199.50    95.69    
>  0.28    6.69    3.79    7.54   5.12  21.50
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.50     0.50   25.00   24.50  1248.00   394.25    66.35    
>  0.76   15.30   11.62   19.06   5.25  26.00
> sdb               1.50     0.00   13.50   30.00  2628.00   405.25   139.46    
>  0.27    5.94    8.19    4.93   5.31  23.10
> sdc               0.00     6.00    3.00  163.00    60.00  9889.50   119.87    
>  1.66    9.83   28.67    9.48   5.95  98.70
> sdd               0.00    11.00    5.50  353.50    50.00  2182.00    12.43   
> 118.42  329.26   30.27  333.91   2.78  99.90
> sde               0.00     5.50    0.00    1.50     0.00    28.00    37.33    
>  0.00    2.33    0.00    2.33   2.33   0.35
> sdj               0.00     0.00    0.00 1227.50     0.00 22064.00    35.95    
>  0.50    0.41    0.00    0.41   0.10  12.50
> sdk               0.00     0.50    0.00 1073.50     0.00 19248.00    35.86    
>  0.24    0.23    0.00    0.23   0.10  10.40
> sdf               0.00     4.00    0.00  109.00     0.00  4145.00    76.06    
>  0.59    5.44    0.00    5.44   3.63  39.55
> sdi               0.00     1.00    8.50   95.50   218.00  2091.75    44.42    
>  1.06    9.70   18.71    8.90   7.00  72.80
> sdm               0.00     0.00    8.00  177.50    82.00  3173.00    35.09    
>  1.24    6.65   14.31    6.30   3.53  65.40
> sdl               0.00     3.50    3.00  187.50    32.00  2175.25    23.17    
>  1.47    7.68   18.50    7.50   3.85  73.35
> sdg               0.00     0.00    1.00    0.00    12.00     0.00    24.00    
>  0.00    1.50    1.50    0.00   1.50   0.15
> sdh               0.50     1.00   14.00  169.50  2364.00  4568.00    75.55    
>  1.50    8.12   21.25    7.03   4.91  90.10
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     4.00    3.00   60.00   212.00  2542.00    87.43    
>  0.58    8.02   15.50    7.64   7.95  50.10
> sdb               0.00     0.50    2.50   98.00   682.00  1652.00    46.45    
>  0.51    5.13    6.20    5.10   3.05  30.65
> sdc               0.00     2.50    4.00  146.00    16.00  4623.25    61.86    
>  1.07    7.33   13.38    7.17   2.22  33.25
> sdd               0.00     0.50    9.50   30.00   290.00   358.00    32.81    
>  0.84   32.22   49.16   26.85  12.28  48.50
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.50    0.00  530.00     0.00  7138.00    26.94    
>  0.06    0.11    0.00    0.11   0.09   4.65
> sdk               0.00     0.00    0.00  625.00     0.00  8254.00    26.41    
>  0.07    0.12    0.00    0.12   0.09   5.75
> sdf               0.00     0.00    0.00    4.00     0.00    18.00     9.00    
>  0.01    3.62    0.00    3.62   3.12   1.25
> sdi               0.00     2.50    8.00   61.00   836.00  2681.50   101.96    
>  0.58    9.25   15.12    8.48   6.71  46.30
> sdm               0.00     4.50   11.00  273.00  2100.00  8562.00    75.08    
> 13.49   47.53   24.95   48.44   1.83  52.00
> sdl               0.00     1.00    0.50   49.00     2.00  1038.00    42.02    
>  0.23    4.83   14.00    4.73   2.45  12.15
> sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdh               1.00     1.00    9.00  109.00  2082.00  2626.25    79.80    
>  0.85    7.34    7.83    7.30   3.83  45.20
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     1.50   10.00  177.00   284.00  4857.00    54.98    
> 36.26  194.27   21.85  204.01   3.53  66.00
> sdb               1.00     0.50   39.50  119.50  1808.00  2389.25    52.80    
>  1.58    9.96   12.32    9.18   2.42  38.45
> sdc               0.00     2.00   15.00  200.50   116.00  4951.00    47.03    
> 14.37   66.70   73.87   66.16   2.23  47.95
> sdd               0.00     3.50    6.00   54.50   180.00  2360.50    83.98    
>  0.69   11.36   20.42   10.36   7.99  48.35
> sde               0.00     7.50    0.00   32.50     0.00   160.00     9.85    
>  1.64   50.51    0.00   50.51   1.48   4.80
> sdj               0.00     0.00    0.00  835.00     0.00 10198.00    24.43    
>  0.07    0.09    0.00    0.09   0.08   6.50
> sdk               0.00     0.00    0.00  802.00     0.00 12534.00    31.26    
>  0.23    0.29    0.00    0.29   0.10   8.05
> sdf               0.00     2.50    2.00  133.50    14.00  5272.25    78.03    
>  4.37   32.21    4.50   32.63   1.73  23.40
> sdi               0.00     4.50   17.00  125.50  2676.00  8683.25   159.43    
>  1.86   13.02   27.97   11.00   4.95  70.55
> sdm               0.00     0.00    7.00    0.50   540.00    32.00   152.53    
>  0.05    7.07    7.57    0.00   7.07   5.30
> sdl               0.00     7.00   27.00  276.00  2374.00 11955.50    94.58    
> 25.87   85.36   15.20   92.23   1.84  55.90
> sdg               0.00     0.00   45.00    0.00   828.00     0.00    36.80    
>  0.07    1.62    1.62    0.00   0.68   3.05
> sdh               0.00     0.50    0.50   65.50     2.00  1436.25    43.58    
>  0.51    7.79   16.00    7.73   3.61  23.80
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     8.00   14.50  150.00   122.00   929.25    12.78    
> 20.65   70.61    7.55   76.71   1.46  24.05
> sdb               0.00     5.00    8.00  283.50    86.00  2757.50    19.51    
> 69.43  205.40   51.75  209.73   2.40  69.85
> sdc               0.00     0.00   12.50    1.50   350.00    48.25    56.89    
>  0.25   17.75   17.00   24.00   4.75   6.65
> sdd               0.00     3.50   36.50  141.00   394.00  2338.75    30.79    
>  1.50    8.42   16.16    6.41   4.56  80.95
> sde               0.00     1.50    0.00    1.00     0.00    10.00    20.00    
>  0.00    2.00    0.00    2.00   2.00   0.20
> sdj               0.00     0.00    0.00 1059.00     0.00 18506.00    34.95    
>  0.19    0.18    0.00    0.18   0.10  10.75
> sdk               0.00     0.00    0.00 1103.00     0.00 14220.00    25.78    
>  0.09    0.08    0.00    0.08   0.08   8.35
> sdf               0.00     5.50    2.00   19.50     8.00  5158.75   480.63    
>  0.17    8.05    6.50    8.21   6.95  14.95
> sdi               0.00     5.50   28.00  224.50  2210.00  8971.75    88.57   
> 122.15  328.47   27.43  366.02   3.71  93.70
> sdm               0.00     0.00   13.00    4.00   718.00    16.00    86.35    
>  0.15    3.76    4.23    2.25   3.62   6.15
> sdl               0.00     0.00   16.50    0.00   832.00     0.00   100.85    
>  0.02    1.12    1.12    0.00   1.09   1.80
> sdg               0.00     2.50   17.00   23.50  1032.00  3224.50   210.20    
>  0.25    6.25    2.56    8.91   3.41  13.80
> sdh               0.00    10.50    4.50  241.00    66.00  7252.00    59.62    
> 23.00   91.66    4.22   93.29   2.11  51.85
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.50    3.50   91.00    92.00   552.75    13.65    
> 36.27  479.41   81.57  494.71   5.65  53.35
> sdb               0.00     1.00    6.00  168.00   224.00   962.50    13.64    
> 83.35  533.92   62.00  550.77   5.75 100.00
> sdc               0.00     1.00    3.00  171.00    16.00  1640.00    19.03    
>  1.08    6.18   11.83    6.08   3.15  54.80
> sdd               0.00     5.00    5.00  107.50   132.00  6576.75   119.27    
>  0.79    7.06   18.80    6.51   5.13  57.70
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00 1111.50     0.00 22346.00    40.21    
>  0.27    0.24    0.00    0.24   0.11  12.10
> sdk               0.00     0.00    0.00 1022.00     0.00 33040.00    64.66    
>  0.68    0.67    0.00    0.67   0.13  13.60
> sdf               0.00     5.50    2.50   91.00    12.00  4977.25   106.72    
>  2.29   24.48   14.40   24.76   2.42  22.60
> sdi               0.00     0.00   10.00   69.50   368.00   858.50    30.86    
>  7.40  586.41    5.50  669.99   4.21  33.50
> sdm               0.00     4.00    8.00  210.00   944.00  5833.50    62.18    
>  1.57    7.62   18.62    7.20   4.57  99.70
> sdl               0.00     0.00    7.50   22.50   104.00   253.25    23.82    
>  0.14    4.82    5.07    4.73   4.03  12.10
> sdg               0.00     4.00    1.00   84.00     4.00  3711.75    87.43    
>  0.58    6.88   12.50    6.81   5.75  48.90
> sdh               0.00     3.50    7.50   44.00    72.00  2954.25   117.52    
>  1.54   39.50   61.73   35.72   6.40  32.95
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    1.00    0.00    20.00     0.00    40.00    
>  0.01   14.50   14.50    0.00  14.50   1.45
> sdb               0.00     7.00   10.50  198.50  2164.00  6014.75    78.27    
>  1.94    9.29   28.90    8.25   4.77  99.75
> sdc               0.00     2.00    4.00   95.50   112.00  5152.25   105.81    
>  0.94    9.46   24.25    8.84   4.68  46.55
> sdd               0.00     1.00    2.00  131.00    10.00  7167.25   107.93    
>  4.55   34.23   83.25   33.48   2.52  33.55
> sde               0.00     0.00    0.00    0.50     0.00     2.00     8.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  541.50     0.00  6468.00    23.89    
>  0.05    0.10    0.00    0.10   0.09   5.00
> sdk               0.00     0.00    0.00  509.00     0.00  7704.00    30.27    
>  0.07    0.14    0.00    0.14   0.10   4.85
> sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdi               0.00     0.00    3.50    0.00    90.00     0.00    51.43    
>  0.04   10.14   10.14    0.00  10.14   3.55
> sdm               0.00     2.00    5.00  102.50  1186.00  4583.00   107.33    
>  0.81    7.56   23.20    6.80   2.78  29.85
> sdl               0.00    14.00   10.00  216.00   112.00  3645.50    33.25    
> 73.45  311.05   46.30  323.31   3.51  79.35
> sdg               0.00     1.00    0.00   52.50     0.00   240.00     9.14    
>  0.25    4.76    0.00    4.76   4.48  23.50
> sdh               0.00     0.00    3.50    0.00    18.00     0.00    10.29    
>  0.02    7.00    7.00    0.00   7.00   2.45
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    1.00    0.00     4.00     0.00     8.00    
>  0.01   14.50   14.50    0.00  14.50   1.45
> sdb               0.00     9.00    2.00  292.00   192.00 10925.75    75.63    
> 36.98  100.27   54.75  100.58   2.95  86.60
> sdc               0.00     9.00   10.50  151.00    78.00  6771.25    84.82    
> 36.06   94.60   26.57   99.33   3.77  60.85
> sdd               0.00     0.00    5.00    1.00    74.00    24.00    32.67    
>  0.03    5.00    6.00    0.00   5.00   3.00
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  787.50     0.00  9418.00    23.92    
>  0.07    0.10    0.00    0.10   0.09   6.70
> sdk               0.00     0.00    0.00  766.50     0.00  9400.00    24.53    
>  0.08    0.11    0.00    0.11   0.10   7.70
> sdf               0.00     0.00    0.50   41.50     6.00   391.00    18.90    
>  0.24    5.79    9.00    5.75   5.50  23.10
> sdi               0.00    10.00    9.00  268.00    92.00  1618.75    12.35    
> 68.20  150.90   15.50  155.45   2.36  65.30
> sdm               0.00    11.50   10.00  330.50    72.00  3201.25    19.23    
> 68.83  139.38   37.45  142.46   1.84  62.80
> sdl               0.00     2.50    2.50  228.50    14.00  2526.00    21.99    
> 90.42  404.71  242.40  406.49   4.33 100.00
> sdg               0.00     5.50    7.50  298.00    68.00  5275.25    34.98    
> 75.31  174.85   26.73  178.58   2.67  81.60
> sdh               0.00     0.00    2.50    2.00    28.00    24.00    23.11    
>  0.01    2.78    5.00    0.00   2.78   1.25
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Sun, Oct 4, 2015 at 12:16 AM, Josef Johansson  wrote:
> Hi,
>
> I don't know what brand those 4TB spindles are, but I know that mine are very 
> bad at doing write at the same time as read. Especially small read write.
>
> This has an absurdly bad effect when doing maintenance on ceph. That being 
> said we see a lot of difference between dumpling and hammer in performance on 
> these drives. Most likely due to hammer able to read write degraded PGs.
>
> We have run into two different problems along the way, the first was blocked 
> request where we had to upgrade from 64GB mem on each node to 256GB. We 
> thought that it was the only safe buy make things better.
>
> I believe it worked because more reads were cached so we had less mixed read 
> write on the nodes, thus giving the spindles more room to breath. Now this 
> was a shot in the dark then, but the price is not that high even to just try 
> it out.. compared to 6 people working on it. I believe the IO on disk was not 
> huge either, but what kills the disk is high latency. How much bandwidth are 
> the disk using? We had very low.. 3-5MB/s.
>
> The second problem was defragmentations hitting 70%, lowering that to 6% made 
> a lot of difference. Depending on IO pattern this increases different.
>
> TL;DR read kills the 4TB spindles.
>
> Hope you guys clear out of the woods.
> /Josef
>
> On 3 Oct 2015 10:10 pm, "Robert LeBlanc"  wrote:
> - -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
>
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
>
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
>
> I don't know what "no flag points reached" means.
>
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
>
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
>
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
>
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
>
> Thanks,
> - -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> l7OF
> =OI++
> - -----END PGP SIGNATURE-----
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> > We dropped the replication on our cluster from 4 to 3 and it looks
> > like all the blocked I/O has stopped (no entries in the log for the
> > last 12 hours). This makes me believe that there is some issue with
> > the number of sockets or some other TCP issue. We have not messed with
> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> > processes and 16K system wide.
> >
> > Does this seem like the right spot to be looking? What are some
> > configuration items we should be looking at?
> >
> > Thanks,
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >> seems that there were some major reworks in the network handling in
> >> the kernel to efficiently handle that network rate. If I remember
> >> right we also saw a drop in CPU utilization. I'm starting to think
> >> that we did see packet loss while congesting our ISLs in our initial
> >> testing, but we could not tell where the dropping was happening. We
> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >> trying to congest things. We probably already saw this issue, just
> >> didn't know it.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>> FWIW, we've got some 40GbE Intel cards in the community performance 
> >>> cluster
> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >>> drivers might cause problems though.
> >>>
> >>> Here's ifconfig from one of the nodes:
> >>>
> >>> ens513f1: flags=4163  mtu 1500
> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >>>
> >>> Mark
> >>>
> >>>
> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>>>
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> OK, here is the update on the saga...
> >>>>
> >>>> I traced some more of blocked I/Os and it seems that communication
> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>>>
> >>>> I then ran iperf between all the nodes with the same results, so that
> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>>> need the network enhancements in the 4.x series to work well.
> >>>>
> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>>> with this issue are running 4.x, other than that, they are pretty
> >>>> differing hardware and network configs.
> >>>>
> >>>> Thanks,
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v1.1.0
> >>>> Comment: https://www.mailvelope.com
> >>>>
> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>>> 4OEo
> >>>> =P33I
> >>>> -----END PGP SIGNATURE-----
> >>>> ----------------
> >>>> Robert LeBlanc
> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>>> wrote:
> >>>>>
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA256
> >>>>>
> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>>>> blocked I/O.
> >>>>>
> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >>>>> the blocked I/O.
> >>>>> - ----------------
> >>>>> Robert LeBlanc
> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>>>>>
> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>>>>>>
> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >>>>>>> delayed for many 10s of seconds?
> >>>>>>
> >>>>>>
> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >>>>>> has
> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> What kernel are you running?
> >>>>>>> -Sam
> >>>>>>>
> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >>>>>>>>
> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>> Hash: SHA256
> >>>>>>>>
> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>>>>>>> extracted what I think are important entries from the logs for the
> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >>>>>>>>
> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual 
> >>>>>>>> data
> >>>>>>>> transfer).
> >>>>>>>>
> >>>>>>>> It looks like osd.17 is receiving responses to start the 
> >>>>>>>> communication
> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>>>>>>> later. To me it seems that the message is getting received but not
> >>>>>>>> passed to another thread right away or something. This test was done
> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >>>>>>>> thread.
> >>>>>>>>
> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >>>>>>>> some help.
> >>>>>>>>
> >>>>>>>> Single Test started about
> >>>>>>>> 2015-09-22 12:52:36
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.439150 secs
> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,16
> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 : 
> >>>>>>>> cluster
> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>>>>>>> 30.379680 secs
> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 : 
> >>>>>>>> cluster
> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.406303:
> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 : 
> >>>>>>>> cluster
> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.318144:
> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,14
> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.954212 secs
> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 16,17
> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.704367 secs
> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>>
> >>>>>>>> Server   IP addr              OSD
> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>>>>>>>
> >>>>>>>> fio job:
> >>>>>>>> [rbd-test]
> >>>>>>>> readwrite=write
> >>>>>>>> blocksize=4M
> >>>>>>>> #runtime=60
> >>>>>>>> name=rbd-test
> >>>>>>>> #readwrite=randwrite
> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>>>>>>> #rwmixread=72
> >>>>>>>> #norandommap
> >>>>>>>> #size=1T
> >>>>>>>> #blocksize=4k
> >>>>>>>> ioengine=rbd
> >>>>>>>> rbdname=test2
> >>>>>>>> pool=rbd
> >>>>>>>> clientname=admin
> >>>>>>>> iodepth=8
> >>>>>>>> #numjobs=4
> >>>>>>>> #thread
> >>>>>>>> #group_reporting
> >>>>>>>> #time_based
> >>>>>>>> #direct=1
> >>>>>>>> #ramp_time=60
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>>> Version: Mailvelope v1.1.0
> >>>>>>>> Comment: https://www.mailvelope.com
> >>>>>>>>
> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>>>>>>> J3hS
> >>>>>>>> =0J7F
> >>>>>>>> -----END PGP SIGNATURE-----
> >>>>>>>> ----------------
> >>>>>>>> Robert LeBlanc
> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >>>>>>>>>>
> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>>>> Hash: SHA256
> >>>>>>>>>>
> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>>>>>>>>>
> >>>>>>>>>> I'm not
> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >>>>>>>>> this, it was discussed not too long ago.
> >>>>>>>>>
> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the 
> >>>>>>>>>> sessions
> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>>>>>>>>>
> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >>>>>>>>>> the
> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >>>>>>>>>> having to create new file and therefore split collections. This is
> >>>>>>>>>> on
> >>>>>>>>>> my test cluster with no other load.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hmm, that does make it seem less likely if you're really not 
> >>>>>>>>> creating
> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>>>>>>>> 20",
> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >>>>>>>>> out
> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>>>>>>>> -Greg
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>>>> in
> >>>>>>>> the body of a message to majord...@vger.kernel.org
> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> -----BEGIN PGP SIGNATURE-----
> >>>>> Version: Mailvelope v1.1.0
> >>>>> Comment: https://www.mailvelope.com
> >>>>>
> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>>>> gcZm
> >>>>> =CjwB
> >>>>> -----END PGP SIGNATURE-----
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majord...@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.1.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >> ae22
> >> =AX+L
> >> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEVA5CRDmVDuy+mK58QAAY4EP/2jTEGPrbR3KDOC1d6FU
> 7TkVeFtow7UCe9/TwArLtcEVTr8rdaXNWRi7gat99zbL5pw+96Sj6bGqpKVz
> ZBSHcBlLIl42Hj10Ju7Svpwn7Q9RnSGOvjEdghEKsTxnf37gZD/KjvMbidJu
> jlPGEfnGEdYbQ+vDYoCoUIuvUNPbCvWQTjJpnTXrMZfhhEBoOepMzF9s6L6B
> AWR9WUrtz4HtGSMT42U1gd3LDOUh/5Ioy6FuhJe04piaf3ikRg+pjX47/WBd
> mQupmKJOblaULCswOrMLTS9R2+p6yaWj0zlUb3OAOErO7JR8OWZ2H7tYjkQN
> rGPsIRNv4yKw2Z5vJdHLksVdYhBQY1I4N1GO3+hf+j/yotPC9Ay4BYLZrQwf
> 3L+uhqSEu80erZjsJF4lilmw0l9nbDSoXc0MqRoXrpUIqyVtmaCBynv5Xq7s
> L5idaH6iVPBwy4Y6qzVuQpP0LaHp48ojIRx7likQJt0MSeDzqnslnp5B/9nb
> Ppu3peRUKf5GEKISRQ6gOI3C4gTSSX6aBatWdtpm01Et0T6ysoxAP/VoO3Nb
> 0PDsuYYT0U1MYqi0USouiNc4yRWNb9hkkBHEJrwjtP52moL1WYdYleL6w+FS
> Y1YQ1DU8YsEtVniBmZc4TBQJRRIS6SaQjH108JCjUcy9oVNwRtOqbcT1aiI6
> EP/Q
> =efx7
> -----END PGP SIGNATURE-----
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Potential OSD deadlock?

Reply via email to