Re: [ceph-users] Few questions.

Leszek Master Wed, 22 Oct 2014 06:45:58 -0700

Ad 1) Thanks for information.

Ad 2) Cluster Information: 5x Dell R720 with 256 GB RAM, and 2x6cores with
HT. We use 10 GB ethernet for networking. One interface for public network
and other for cluster network.


Three of servers have Ubuntu Server 12.04 with 3.16.0-031600-generic
kernel, and 2 new servers we want to attach have CentOS 7 with 3.14.4
kernel.

Old Servers are node-01,02,03, the new servers that we need to attach are
04 and 05.

The HDD's are different:
http://pastebin.com/ga2Qp4we

The OSD with 0.27 weight is 300 GB SAS 10k drives.
The OSD with 0.55 weight is 600 GB SAS 10k drives.
The OSD with 0.82 weight is 900 GB SAS 10k drives.

Every OSD have a 10 GB partition created on SSD (depends from server it can
be Intel 3xSSD S3500(old) or 2xOCZ RevoDrive 3 X2).

With node-05 (new) server down we had rados bench this:

 sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat

     0       0         0         0         0         0         -         0

     1      16        82        66   263.704       264  0.034511  0.157596

     2      16       194       178   355.768       448  0.144852   0.16572

     3      16       298       282   375.816       416  0.075267   0.15512

     4      16       419       403   402.835       484  0.073001  0.151483

     5      15       531       516   412.653       452   0.05382  0.153122

     6      16       652       636   423.861       480  0.045246  0.141938

     7      16       776       760   434.154       496  0.094384    0.1461

     8      16       869       853   426.377       372  0.055912  0.138176

And if i turn on osds from new server we have:

ceph@node-02:/home/leni$ rados -p test bench 10 write --no-cleanup
 Maintaining 16 concurrent writes of 4194304 bytes for up to 10 seconds or
0 objects
 Object prefix: benchmark_data_node-02_45166
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        60        44   175.943       176   0.32255  0.254007
     2      16       105        89   177.954       180  0.172481  0.225552
     3      16       168       152   202.622       252  0.192577  0.305603
     4      16       223       207   206.958       220  0.353051   0.29058
     5      15       263       248   198.362       164  0.330949  0.293684
     6      16       307       291   193.964       172  0.192487  0.289606
     7      16       354       338   193.108       188  0.288342   0.27043
     8      16       393       377   188.465       156  0.388039  0.327652
     9      16       423       407   180.855       120  0.309964  0.337049
    10      16       481       465   185.967       232  0.090552   0.32017
    11      15       482       467   169.788         8  0.053557  0.319134
    12      15       482       467   155.639         0         -  0.319134

The same for reads seq:

ceph@node-02:/home/leni$ rados -p test bench 10 seq
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
     0       0         0         0         0         0         -         0
     1      16        51        35   139.942       140   0.37242  0.308696
     2      16        89        73   144.524       152  0.331963  0.386161
     3      16       126       110   145.666       148   0.59192   0.40105
     4      16       161       145   144.249       140  0.170732   0.40284
     5      16       191       175   139.414       120  0.516386   0.42865
     6      16       223       207   137.515       128  0.333633  0.444945
     7      16       251       235   133.791       112  0.521015  0.453296
     8      15       286       271   135.056       144  0.188018  0.456516
     9      16       315       299   132.499       112   1.29282  0.465136
    10      16       342       326    129.75       108  0.889321  0.476148
 Total time run:        10.260278
Total reads made:     342
Read size:            4194304
Bandwidth (MB/sec):    133.330

Average Latency:       0.47688
Max latency:           1.69694
Min latency:           0.028467

Also i can see in logs sometimes:

2014-10-22 14:20:59.142970 7ff415d7a700  0 -- 172.100.0.25:6800/44707
submit_message osd_op_reply(321 benchmark_data_node-02_13320_object320
[write 0~4194304] v9387'2846 uv2846 ondisk = 0) v6 remote,
172.100.0.22:0/1013320, failed lossy con, dropping message 0x2876900
2014-10-22 14:20:59.143126 7ff415d7a700  0 -- 172.100.0.25:6800/44707
submit_message osd_op_reply(322 benchmark_data_node-02_13320_object321
[write 0~4194304] v9387'3227 uv3227 ondisk = 0) v6 remote,
172.100.0.22:0/1013320, failed lossy con, dropping message 0x17c40c80

For every osd in the new server. The network don't have any errors, iperf
shows us 7-8 Gb/s.

When i connect node-05 osded the cluster is so slow that we cannot use any
services hosted on it. We had only 128 pgs for our pool and adding even one
osd made our cluster "unworkable", so we extended it to 1024 pgs. But we
have now other problems aith adding new hosts.



2014-10-22 2:57 GMT+02:00 Christian Balzer <ch...@gol.com>:

> On Mon, 20 Oct 2014 11:07:43 +0200 Leszek Master wrote:
>
> > 1) If i want to use cache tier should i use it with ssd journaling or i
> > can get better perfomance using more ssd GB for cache tier?
> >
> From reading what others on this ML experienced and what Robert already
> pointed out, cache tiering is definitely too unpolished at this point in
> time and not particular helpful. Given the right changes and more tuning
> abilities I'd expect it to be useful in the future (1-2 releases out
> maybe?) though.
>
> > 2) I've got cluster made of 26x900GB SAS disk with ssd journaling. The
> > placement groups i've got is 1024. When i add new osd to cluster, my VMs
> > get io errors and got stuck even if i had osd_max_backfills set to 1. If
> > i change pgs from 1024 to 4096 would it get less affected by backfilling
> > and recovery?
> >
> You're not telling us enough about your cluster by far, starting with
> Ceph and OS/kernel versions.
> What are you storage nodes like (all the specs, cpu, memory, network,
> what type of SSDs, journal to OSD ratio, etc.)?
>
> If your replica size is 2 (risky!) then your PG and PGP count should be
> 2048, with a replica of 3 your current number is fine when it comes to the
> formula but it might still be better for data distribution at 2048 as well.
>
> But changing those values from what you have already should have little
> effect on your data-migration impact, as in the end the same amount of
> data needs to be moved if an OSD is added or lost and your current PG
> count isn't horribly wrong.
> If your cluster is running close to capacity (monitor with atop!) during
> normal usage and with all the tunables already set to lowest impact your
> only way forward is to address its shortcomings, whatever they are (CPU,
> IOPS, etc).
>
> Too high (way too high usually) PG counts will cost you in performance due
> to CPU resource exhaustion caused by Ceph internal locking/protocol
> overhead.
> Too little PGs on the other hand will not only cause uneven data
> distribution but ALSO cost you in performance as the same cause is prone
> to creating hotspots.
>
> > 3) When i was adding my last 6 drives to a cluster i've noticed that the
> > recovery speed had gone from 500-1000MB/s to 10-50 MB/s. When i restarted
> > the osd that i was adding the transfers got back to normal. Also i've
> > noticed that when i then do rados benchmark i've got dropping transfers
> > to 0 MB/s even few times a row. The restarting osdes that i was adding
> > or one by one that was already in cluster solved the problem. What can
> > it be ? In the logs there isn't anything weird. The whole cluster stucks
> > till i restart or even recreate journals on them. How to solve this ?
> >
> That is very odd, maybe some of the Ceph developers have an idea or
> recollection of seeing this before.
>
> In general you will want to monitor all your cluster nodes with something
> like atop in a situation like this to spot potential problems like slow
> disks, CPU or network starvation, etc.
>
> Christian
>
> > Please help me.
> >
> > Best regards !
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Few questions.

Reply via email to