[ceph-users] RGW multisite sync issue

2019-05-28 Thread Matteo Dacrema
Hi All,

I’ve configured a multisite deployment on Ceph Nautilus 14.2.1 with one zone 
group “eu", one master zone and two secondary zone.

If I upload ( on the master zone ) for 200 objects of 80MB each and I delete 
all of them without waiting for the replication to finish I end up with one 
zone empty and the other two that still have objects.

It’s seems like secondary zones tries to synchronize between themselves infact 
they have same files. Sometimes happens also that objects have a different size 
of the original one.

Can anyone help?

Thank you
Matteo


This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Multisite RGW

2019-05-27 Thread Matteo Dacrema
Hi all,

I’m planning to replace a swift multi-region deployment using Ceph.
Right now Swift is deployed across 3 region in Europe and the data is 
replicated across this 3 regions.

Is it possible to configure Ceph to do the same? 
I think I need to go with multiple zone group with a single realm right? 
I also noticed that if I lose the master zone group the whole object storage 
service might stop working. Is it right?

Can Ceph right now compete with Swift in terms of distributed multi-region 
object storage?

Another thing is: in Swift I’m placing one replica per single region. If I lose 
one HDD in that region Swift recover the object by reading from other regions. 
Do Ceph act at the same mode recovering from other regions?


Thank you
Regards
Matteo

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Switch to replica 3

2017-11-20 Thread Matteo Dacrema
Ok, thank you guys

The version is 10.2.10

Matteo

> Il giorno 20 nov 2017, alle ore 23:15, Christian Balzer  ha 
> scritto:
> 
> On Mon, 20 Nov 2017 10:35:36 -0800 Chris Taylor wrote:
> 
>> On 2017-11-20 3:39 am, Matteo Dacrema wrote:
>>> Yes I mean the existing Cluster.
>>> SSDs are on a fully separate pool.
>>> Cluster is not busy during recovery and deep scrubs but I think it’s
>>> better to limit replication in some way when switching to replica 3.
>>> 
>>> My question is to understand if I need to set some options parameters
>>> to limit the impact of the creation of new objects.I’m also concerned
>>> about disk filling up during recovery because of inefficient data
>>> balancing.  
>> 
>> You can try using osd_recovery_sleep to slow down the backfilling so it 
>> does not cause the client io to hang.
>> 
>> ceph tell osd.* injectargs "--osd_recovery_sleep 0.1"
>> 
> 
> Which is one of the things that is version specific and we don't know the
> version yet.
> 
> The above will work with Hammer and should again with Luminous, but not so
> much with the unified queue bits inbetween. 
> 
> Christian
> 
>> 
>>> 
>>> Here osd tree
>>> 
>>> ID  WEIGHTTYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
>>> -10  19.69994 root ssd
>>> -11   5.06998 host ceph101
>>> 166   0.98999 osd.166   up  1.0  1.0
>>> 167   1.0 osd.167   up  1.0  1.0
>>> 168   1.0 osd.168   up  1.0  1.0
>>> 169   1.07999 osd.169   up  1.0  1.0
>>> 170   1.0 osd.170   up  1.0  1.0
>>> -12   4.92998 host ceph102
>>> 171   0.98000 osd.171   up  1.0  1.0
>>> 172   0.92999 osd.172   up  1.0  1.0
>>> 173   0.98000 osd.173   up  1.0  1.0
>>> 174   1.0 osd.174   up  1.0  1.0
>>> 175   1.03999 osd.175   up  1.0  1.0
>>> -13   4.69998 host ceph103
>>> 176   0.84999 osd.176   up  1.0  1.0
>>> 177   0.84999 osd.177   up  1.0  1.0
>>> 178   1.0 osd.178   up  1.0  1.0
>>> 179   1.0 osd.179   up  1.0  1.0
>>> 180   1.0 osd.180   up  1.0  1.0
>>> -14   5.0 host ceph104
>>> 181   1.0 osd.181   up  1.0  1.0
>>> 182   1.0 osd.182   up  1.0  1.0
>>> 183   1.0 osd.183   up  1.0  1.0
>>> 184   1.0 osd.184   up  1.0  1.0
>>> 185   1.0 osd.185   up  1.0  1.0
>>> -1 185.19835 root default
>>> -2  18.39980 host ceph001
>>> 63   0.7 osd.63up  1.0  1.0
>>> 64   0.7 osd.64up  1.0  1.0
>>> 65   0.7 osd.65up  1.0  1.0
>>> 146   0.7 osd.146   up  1.0  1.0
>>> 147   0.7 osd.147   up  1.0  1.0
>>> 148   0.90999 osd.148   up  1.0  1.0
>>> 149   0.7 osd.149   up  1.0  1.0
>>> 150   0.7 osd.150   up  1.0  1.0
>>> 151   0.7 osd.151   up  1.0  1.0
>>> 152   0.7 osd.152   up  1.0  1.0
>>> 153   0.7 osd.153   up  1.0  1.0
>>> 154   0.7 osd.154   up  1.0  1.0
>>> 155   0.8 osd.155   up  1.0  1.0
>>> 156   0.84999 osd.156   up  1.0  1.0
>>> 157   0.7 osd.157   up  1.0  1.0
>>> 158   0.7 osd.158   up  1.0  1.0
>>> 159   0.84999 osd.159   up  1.0  1.0
>>> 160   0.90999 osd.160   up  1.0  1.0
>>> 161   0.90999 osd.161   up  1.0  1.0
>>> 162   0.90999 osd.162   up  1.0  1.0
>>> 163   0.7 osd.163   up  1.0  1.0
>>> 164   0.90999 osd.164   up  1.0 

Re: [ceph-users] Switch to replica 3

2017-11-20 Thread Matteo Dacrema
   up  1.0  1.0



> Il giorno 20 nov 2017, alle ore 12:17, Christian Balzer  ha 
> scritto:
> 
> 
> Hello,
> 
> On Mon, 20 Nov 2017 11:56:31 +0100 Matteo Dacrema wrote:
> 
>> Hi,
>> 
>> I need to switch a cluster of over 200 OSDs from replica 2 to replica 3
> I presume this means the existing cluster and not adding 100 OSDs...
> 
>> There are two different crush maps for HDD and SSDs also mapped to two 
>> different pools.
>> 
>> Is there a best practice to use? Can this provoke troubles?
>> 
> Are your SSDs a cache-tier or are they a fully separate pool?
> 
> As for troubles, how busy is your cluster during the recovery of failed
> OSDs or deep scrubs?
> 
> There are 2 things to consider here:
> 
> 1. The re-balancing and additional replication of all the data, which you
> can control/ease by the various knobs present. Ceph version matters to
> which are relevant/useful. It shouldn't impact things too much, unless
> your cluster was at the very edge of it's capacity anyway.
> 
> 2. The little detail that after 1) is done, your cluster will be
> noticeably slower than before, especially in the latency department. 
> In short, you don't just need to have the disk space to go 3x, but also
> enough IOPS/bandwidth reserves.
> 
> Christian
> 
>> Thank you
>> Matteo
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Rakuten Communications
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=524464756E.A33EC
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Switch to replica 3

2017-11-20 Thread Matteo Dacrema
Hi,

I need to switch a cluster of over 200 OSDs from replica 2 to replica 3
There are two different crush maps for HDD and SSDs also mapped to two 
different pools.

Is there a best practice to use? Can this provoke troubles?

Thank you
Matteo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Active+clean PGs reported many times in log

2017-11-20 Thread Matteo Dacrema
I was running 10.2.7 but I’ve upgraded to 10.2.10 few days ago.

Here Pg dump:

https://owncloud.enter.it/index.php/s/AaD5Fc5tA6c8i1G 
<https://owncloud.enter.it/index.php/s/AaD5Fc5tA6c8i1G>



> Il giorno 19 nov 2017, alle ore 11:15, Gregory Farnum  ha 
> scritto:
> 
> On Tue, Nov 14, 2017 at 1:09 AM Matteo Dacrema  <mailto:mdacr...@enter.eu>> wrote:
> Hi,
> I noticed that sometimes the monitors start to log active+clean pgs many 
> times in the same line. For example I have 18432 and the logs shows " 2136 
> active+clean, 28 active+clean, 2 active+clean+scrubbing+deep, 16266 
> active+clean;”
> After a minute monitor start to log correctly again.
> 
> Is it normal ?
> 
> That definitely looks weird to me, but I can imagine a few ways for it to 
> occur. What version of Ceph are you running? Can you extract the pgmap and 
> post the binary somewhere?
>  
> 
> 2017-11-13 11:05:08.876724 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797105: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 
> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 40596 kB/s 
> rd, 89723 kB/s wr, 4899 op/s
> 2017-11-13 11:05:09.911266 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797106: 18432 pgs: 2 active+clean+scrubbing+deep, 18430 
> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 45931 kB/s 
> rd, 114 MB/s wr, 6179 op/s
> 2017-11-13 11:05:10.751378 7fb359cfb700  0 mon.controller001@0(leader) e1 
> handle_command mon_command({"prefix": "osd pool stats", "format": "json"} v 
> 0) v1
> 2017-11-13 11:05:10.751599 7fb359cfb700  0 log_channel(audit) log [DBG] : 
> from='client.? MailScanner warning: numerical links are often malicious: 
> 10.16.24.127:0/547552484 <http://10.16.24.127:0/547552484>' 
> entity='client.telegraf' cmd=[{"prefix": "osd pool stats", "format": 
> "json"}]: dispatch
> 2017-11-13 11:05:10.926839 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797107: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 
> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 47617 kB/s 
> rd, 134 MB/s wr, 7414 op/s
> 2017-11-13 11:05:11.921115 7fb35d17d700  1 mon.controller001@0(leader).osd 
> e120942 e120942: 216 osds: 216 up, 216 in
> 2017-11-13 11:05:11.926818 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> osdmap e120942: 216 osds: 216 up, 216 in
> 2017-11-13 11:05:11.984732 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797109: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 
> active+clean; 59520 GB data, 129 TB used, 110 TB / 239 TB avail; 54110 kB/s 
> rd, 115 MB/s wr, 7827 op/s
> 2017-11-13 11:05:13.085799 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797110: 18432 pgs: 973 active+clean, 12 active+clean, 3 
> active+clean+scrubbing+deep, 17444 active+clean; 59520 GB data, 129 TB used, 
> 110 TB / 239 TB avail; 115 MB/s rd, 90498 kB/s wr, 8490 op/s
> 2017-11-13 11:05:14.181219 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797111: 18432 pgs: 2136 active+clean, 28 active+clean, 2 
> active+clean+scrubbing+deep, 16266 active+clean; 59520 GB data, 129 TB used, 
> 110 TB / 239 TB avail; 136 MB/s rd, 94461 kB/s wr, 10237 op/s
> 2017-11-13 11:05:15.324630 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797112: 18432 pgs: 3179 active+clean, 44 active+clean, 2 
> active+clean+scrubbing+deep, 15207 active+clean; 59519 GB data, 129 TB used, 
> 110 TB / 239 TB avail; 184 MB/s rd, 81743 kB/s wr, 13786 op/s
> 2017-11-13 11:05:16.381452 7fb35d17d700  0 log_channel(cluster) log [INF] : 
> pgmap v99797113: 18432 pgs: 3600 active+clean, 52 active+clean, 2 
> active+clean+scrubbing+deep, 14778 active+clean; 59518 GB data, 129 TB used, 
> 110 TB / 239 TB avail; 208 MB/s rd, 77342 kB/s wr, 14382 op/s
> 2017-11-13 11:05:17.272757 7fb3570f2700  1 leveldb: Level-0 table #26314650: 
> started
> 2017-11-13 11:05:17.390808 7fb3570f2700  1 leveldb: Level-0 table #26314650: 
> 18281928 bytes OK
> 2017-11-13 11:05:17.392636 7fb3570f2700  1 leveldb: Delete type=0 #26314647
> 
> 2017-11-13 11:05:17.397516 7fb3570f2700  1 leveldb: Manual compaction at 
> level-0 from 'pgmap\x0099796362' @ 72057594037927935 : 1 .. 
> 'pgmap\x0099796613' @ 0 : 0; will stop at 'pgmap_pg\x006.ff' @ 29468156273 : 1
> 
> 
> Thank you
> Matteo
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> -- 
> Questo mess

[ceph-users] Active+clean PGs reported many times in log

2017-11-13 Thread Matteo Dacrema
Hi, 
I noticed that sometimes the monitors start to log active+clean pgs many times 
in the same line. For example I have 18432 and the logs shows " 2136 
active+clean, 28 active+clean, 2 active+clean+scrubbing+deep, 16266 
active+clean;”
After a minute monitor start to log correctly again.

Is it normal ?

2017-11-13 11:05:08.876724 7fb35d17d700  0 log_channel(cluster) log [INF] : 
pgmap v99797105: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 active+clean; 
59520 GB data, 129 TB used, 110 TB / 239 TB avail; 40596 kB/s rd, 89723 kB/s 
wr, 4899 op/s
2017-11-13 11:05:09.911266 7fb35d17d700  0 log_channel(cluster) log [INF] : 
pgmap v99797106: 18432 pgs: 2 active+clean+scrubbing+deep, 18430 active+clean; 
59520 GB data, 129 TB used, 110 TB / 239 TB avail; 45931 kB/s rd, 114 MB/s wr, 
6179 op/s
2017-11-13 11:05:10.751378 7fb359cfb700  0 mon.controller001@0(leader) e1 
handle_command mon_command({"prefix": "osd pool stats", "format": "json"} v 0) 
v1
2017-11-13 11:05:10.751599 7fb359cfb700  0 log_channel(audit) log [DBG] : 
from='client.? 10.16.24.127:0/547552484' entity='client.telegraf' 
cmd=[{"prefix": "osd pool stats", "format": "json"}]: dispatch
2017-11-13 11:05:10.926839 7fb35d17d700  0 log_channel(cluster) log [INF] : 
pgmap v99797107: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 active+clean; 
59520 GB data, 129 TB used, 110 TB / 239 TB avail; 47617 kB/s rd, 134 MB/s wr, 
7414 op/s
2017-11-13 11:05:11.921115 7fb35d17d700  1 mon.controller001@0(leader).osd 
e120942 e120942: 216 osds: 216 up, 216 in
2017-11-13 11:05:11.926818 7fb35d17d700  0 log_channel(cluster) log [INF] : 
osdmap e120942: 216 osds: 216 up, 216 in
2017-11-13 11:05:11.984732 7fb35d17d700  0 log_channel(cluster) log [INF] : 
pgmap v99797109: 18432 pgs: 3 active+clean+scrubbing+deep, 18429 active+clean; 
59520 GB data, 129 TB used, 110 TB / 239 TB avail; 54110 kB/s rd, 115 MB/s wr, 
7827 op/s
2017-11-13 11:05:13.085799 7fb35d17d700  0 log_channel(cluster) log [INF] : 
pgmap v99797110: 18432 pgs: 973 active+clean, 12 active+clean, 3 
active+clean+scrubbing+deep, 17444 active+clean; 59520 GB data, 129 TB used, 
110 TB / 239 TB avail; 115 MB/s rd, 90498 kB/s wr, 8490 op/s
2017-11-13 11:05:14.181219 7fb35d17d700  0 log_channel(cluster) log [INF] : 
pgmap v99797111: 18432 pgs: 2136 active+clean, 28 active+clean, 2 
active+clean+scrubbing+deep, 16266 active+clean; 59520 GB data, 129 TB used, 
110 TB / 239 TB avail; 136 MB/s rd, 94461 kB/s wr, 10237 op/s
2017-11-13 11:05:15.324630 7fb35d17d700  0 log_channel(cluster) log [INF] : 
pgmap v99797112: 18432 pgs: 3179 active+clean, 44 active+clean, 2 
active+clean+scrubbing+deep, 15207 active+clean; 59519 GB data, 129 TB used, 
110 TB / 239 TB avail; 184 MB/s rd, 81743 kB/s wr, 13786 op/s
2017-11-13 11:05:16.381452 7fb35d17d700  0 log_channel(cluster) log [INF] : 
pgmap v99797113: 18432 pgs: 3600 active+clean, 52 active+clean, 2 
active+clean+scrubbing+deep, 14778 active+clean; 59518 GB data, 129 TB used, 
110 TB / 239 TB avail; 208 MB/s rd, 77342 kB/s wr, 14382 op/s
2017-11-13 11:05:17.272757 7fb3570f2700  1 leveldb: Level-0 table #26314650: 
started
2017-11-13 11:05:17.390808 7fb3570f2700  1 leveldb: Level-0 table #26314650: 
18281928 bytes OK
2017-11-13 11:05:17.392636 7fb3570f2700  1 leveldb: Delete type=0 #26314647

2017-11-13 11:05:17.397516 7fb3570f2700  1 leveldb: Manual compaction at 
level-0 from 'pgmap\x0099796362' @ 72057594037927935 : 1 .. 'pgmap\x0099796613' 
@ 0 : 0; will stop at 'pgmap_pg\x006.ff' @ 29468156273 : 1


Thank you
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster hang (deep scrub bug? "waiting for scrub")

2017-11-13 Thread Matteo Dacrema
I’ve seen that only one time and noticed that there’s a bug fixed in 10.2.10 (  
http://tracker.ceph.com/issues/20041 <http://tracker.ceph.com/issues/20041> ) 
Yes I use snapshots.

As I can see in my case the PG was scrubbing since 20 days but I’ve only 7 days 
logs so I’m not able to identify the affected PG.



> Il giorno 10 nov 2017, alle ore 14:05, Peter Maloney 
>  ha scritto:
> 
> I have often seen a problem where a single osd in an eternal deep scrup
> will hang any client trying to connect. Stopping or restarting that
> single OSD fixes the problem.
> 
> Do you use snapshots?
> 
> Here's what the scrub bug looks like (where that many seconds is 14 hours):
> 
>> ceph daemon "osd.$osd_number" dump_blocked_ops
> 
>>  {
>>  "description": "osd_op(client.6480719.0:2000419292 4.a27969ae
>> rbd_data.46820b238e1f29.aa70 [set-alloc-hint object_size
>> 524288 write_size 524288,write 0~4096] snapc 16ec0=[16ec0]
>> ack+ondisk+write+known_if_redirected e148441)",
>>  "initiated_at": "2017-09-12 20:04:27.987814",
>>  "age": 49315.666393,
>>  "duration": 49315.668515,
>>  "type_data": [
>>  "delayed",
>>  {
>>  "client": "client.6480719",
>>  "tid": 2000419292
>>  },
>>  [
>>  {
>>  "time": "2017-09-12 20:04:27.987814",
>>  "event": "initiated"
>>  },
>>  {
>>  "time": "2017-09-12 20:04:27.987862",
>>  "event": "queued_for_pg"
>>  },
>>      {
>>  "time": "2017-09-12 20:04:28.004142",
>>  "event": "reached_pg"
>>  },
>>  {
>>  "time": "2017-09-12 20:04:28.004219",
>>  "event": "waiting for scrub"
>>  }
>>  ]
>>  ]
>>  }
> 
> 
> 
> 
> 
> 
> On 11/09/17 17:20, Matteo Dacrema wrote:
>> Update:  I noticed that there was a pg that remained scrubbing from the 
>> first day I found the issue to when I reboot the node and problem 
>> disappeared.
>> Can this cause the behaviour I described before?
>> 
>> 
>>> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema  
>>> ha scritto:
>>> 
>>> Hi all,
>>> 
>>> I’ve experienced a strange issue with my cluster.
>>> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each 
>>> plus 4 SSDs nodes with 5 SSDs each.
>>> All the nodes are behind 3 monitors and 2 different crush maps.
>>> All the cluster is on 10.2.7 
>>> 
>>> About 20 days ago I started to notice that long backups hangs with "task 
>>> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
>>> About few days ago another VM start to have high iowait without doing iops 
>>> also on the HDD crush map.
>>> 
>>> Today about a hundreds VMs wasn’t able to read/write from many volumes all 
>>> of them on HDD crush map. Ceph health was ok and no significant log entries 
>>> were found.
>>> Not all the VMs experienced this problem and in the meanwhile the iops on 
>>> the journal and HDDs was very low even if I was able to do significant iops 
>>> on the working VMs.
>>> 
>>> After two hours of debug I decided to reboot one of the OSD nodes and the 
>>> cluster start to respond again. Now the OSD node is back in the cluster and 
>>> the problem is disappeared.
>>> 
>>> Can someone help me to understand what happened?
>>> I see strange entries in the log files like:
>>> 
>>> accept replacing existing (lossy) channel (new one lossy=1)
>>> fault with nothing to send, going to standby
>>> leveldb manual compact 
>>> 
>>> I can share all the logs that can help to identify the issue.
>>> 
>>> Thank you.
>>> Regards,
>>> 
>>> Matteo
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>

Re: [ceph-users] Cluster hang

2017-11-09 Thread Matteo Dacrema
Update:  I noticed that there was a pg that remained scrubbing from the first 
day I found the issue to when I reboot the node and problem disappeared.
Can this cause the behaviour I described before?


> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema  ha 
> scritto:
> 
> Hi all,
> 
> I’ve experienced a strange issue with my cluster.
> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 
> 4 SSDs nodes with 5 SSDs each.
> All the nodes are behind 3 monitors and 2 different crush maps.
> All the cluster is on 10.2.7 
> 
> About 20 days ago I started to notice that long backups hangs with "task 
> jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
> About few days ago another VM start to have high iowait without doing iops 
> also on the HDD crush map.
> 
> Today about a hundreds VMs wasn’t able to read/write from many volumes all of 
> them on HDD crush map. Ceph health was ok and no significant log entries were 
> found.
> Not all the VMs experienced this problem and in the meanwhile the iops on the 
> journal and HDDs was very low even if I was able to do significant iops on 
> the working VMs.
> 
> After two hours of debug I decided to reboot one of the OSD nodes and the 
> cluster start to respond again. Now the OSD node is back in the cluster and 
> the problem is disappeared.
> 
> Can someone help me to understand what happened?
> I see strange entries in the log files like:
> 
> accept replacing existing (lossy) channel (new one lossy=1)
> fault with nothing to send, going to standby
> leveldb manual compact 
> 
> I can share all the logs that can help to identify the issue.
> 
> Thank you.
> Regards,
> 
> Matteo
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster hang

2017-11-09 Thread Matteo Dacrema
Hi all,

I’ve experienced a strange issue with my cluster.
The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 4 
SSDs nodes with 5 SSDs each.
All the nodes are behind 3 monitors and 2 different crush maps.
All the cluster is on 10.2.7 

About 20 days ago I started to notice that long backups hangs with "task 
jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
About few days ago another VM start to have high iowait without doing iops also 
on the HDD crush map.

Today about a hundreds VMs wasn’t able to read/write from many volumes all of 
them on HDD crush map. Ceph health was ok and no significant log entries were 
found.
Not all the VMs experienced this problem and in the meanwhile the iops on the 
journal and HDDs was very low even if I was able to do significant iops on the 
working VMs.

After two hours of debug I decided to reboot one of the OSD nodes and the 
cluster start to respond again. Now the OSD node is back in the cluster and the 
problem is disappeared.

Can someone help me to understand what happened?
I see strange entries in the log files like:

accept replacing existing (lossy) channel (new one lossy=1)
fault with nothing to send, going to standby
leveldb manual compact 

I can share all the logs that can help to identify the issue.

Thank you.
Regards,

Matteo


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph not recovering after osd/host failure

2017-10-16 Thread Matteo Dacrema

In the meanwhile I find out why this happened.
For some reason the 3 osds was not marked out of the cluster as the other and 
this caused the cluster to not reassing PGs to other OSDs.

This is strange because I leaved the 3 osds down for two days.


> Il giorno 16 ott 2017, alle ore 10:21, Matteo Dacrema  ha 
> scritto:
> 
> Hi all,
> 
> I’m testing Ceph Luminous 12.2.1 installed with ceph ansible.
> 
> Doing some failover tests I noticed that when I kill an osd or and hosts Ceph 
> doesn’t recover automatically remaining in this state until I bring OSDs or 
> host back online.
> I’ve 3 pools volumes, cephfs_data and cephfs_metadata with size 3 and 
> min_size 1.
> 
> Is there something I’m missing ?
> 
> Below some cluster info.
> 
> Thank you all
> Regards
> 
> Matteo
> 
> 
>  cluster:
>id: ab7cb890-ee21-484e-9290-14b9e5e85125
>health: HEALTH_WARN
>3 osds down
>Degraded data redundancy: 2842/73686 objects degraded (3.857%), 
> 318 pgs unclean, 318 pgs degraded, 318 pgs undersized
> 
>  services:
>mon: 3 daemons, quorum controller001,controller002,controller003
>mgr: controller001(active), standbys: controller002, controller003
>mds: cephfs-1/1/1 up  {0=controller002=up:active}, 2 up:standby
>osd: 77 osds: 74 up, 77 in
> 
>  data:
>pools:   3 pools, 4112 pgs
>objects: 36843 objects, 142 GB
>usage:   470 GB used, 139 TB / 140 TB avail
>pgs: 2842/73686 objects degraded (3.857%)
> 3794 active+clean
> 318  active+undersized+degraded
> 
> 
> ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
> -1   140.02425 root default
> -920.00346 host storage001
>  0   hdd   1.81850 osd.0   up  1.0 1.0
>  6   hdd   1.81850 osd.6   up  1.0 1.0
>  8   hdd   1.81850 osd.8   up  1.0 1.0
> 11   hdd   1.81850 osd.11  up  1.0 1.0
> 14   hdd   1.81850 osd.14  up  1.0 1.0
> 18   hdd   1.81850 osd.18  up  1.0 1.0
> 24   hdd   1.81850 osd.24  up  1.0 1.0
> 28   hdd   1.81850 osd.28  up  1.0 1.0
> 33   hdd   1.81850 osd.33  up  1.0 1.0
> 40   hdd   1.81850 osd.40  up  1.0 1.0
> 45   hdd   1.81850 osd.45  up  1.0 1.0
> -720.00346 host storage002
>  1   hdd   1.81850 osd.1   up  1.0 1.0
>  5   hdd   1.81850 osd.5   up  1.0 1.0
>  9   hdd   1.81850 osd.9   up  1.0 1.0
> 21   hdd   1.81850 osd.21  up  1.0 1.0
> 22   hdd   1.81850 osd.22  up  1.0 1.0
> 23   hdd   1.81850 osd.23  up  1.0 1.0
> 35   hdd   1.81850 osd.35  up  1.0 1.0
> 36   hdd   1.81850 osd.36  up  1.0 1.0
> 38   hdd   1.81850 osd.38  up  1.0 1.0
> 42   hdd   1.81850 osd.42  up  1.0 1.0
> 49   hdd   1.81850 osd.49  up  1.0 1.0
> -1120.00346 host storage003
> 27   hdd   1.81850 osd.27  up  1.0 1.0
> 31   hdd   1.81850 osd.31  up  1.0 1.0
> 32   hdd   1.81850 osd.32  up  1.0 1.0
> 37   hdd   1.81850 osd.37  up  1.0 1.0
> 44   hdd   1.81850 osd.44  up  1.0 1.0
> 46   hdd   1.81850 osd.46  up  1.0 1.0
> 48   hdd   1.81850 osd.48  up  1.0 1.0
> 53   hdd   1.81850 osd.53  up  1.0 1.0
> 54   hdd   1.81850 osd.54  up  1.0 1.0
> 56   hdd   1.81850 osd.56  up  1.0 1.0
> 59   hdd   1.81850 osd.59  up  1.0 1.0
> -320.00346 host storage004
>  2   hdd   1.81850 osd.2   up  1.0 1.0
>  4   hdd   1.81850 osd.4   up  1.0 1.0
> 10   hdd   1.81850 osd.10  up  1.0 1.0
> 16   hdd   1.81850 osd.16  up  1.0 1.0
> 17   hdd   1.81850 osd.17  up  1.0 1.0
> 19   hdd   1.81850 osd.19  up  1.0 1.0
> 26   hdd   1.81850 osd.26  up  1.0 1.0
> 29   hdd   1.81850 osd.29  up  1.0 1.0
> 39   hdd   1.81850 osd.39  up  1.0 1.0
> 43   hdd   1.81850 osd.43  up  1.0 1.0
> 50   hdd   1.81850 osd.50  up  1.0 1.0
> -520.00346   

[ceph-users] Ceph not recovering after osd/host failure

2017-10-16 Thread Matteo Dacrema
Hi all,

I’m testing Ceph Luminous 12.2.1 installed with ceph ansible.

Doing some failover tests I noticed that when I kill an osd or and hosts Ceph 
doesn’t recover automatically remaining in this state until I bring OSDs or 
host back online.
I’ve 3 pools volumes, cephfs_data and cephfs_metadata with size 3 and min_size 
1.

Is there something I’m missing ?

Below some cluster info.

Thank you all
Regards

Matteo


  cluster:
id: ab7cb890-ee21-484e-9290-14b9e5e85125
health: HEALTH_WARN
3 osds down
Degraded data redundancy: 2842/73686 objects degraded (3.857%), 318 
pgs unclean, 318 pgs degraded, 318 pgs undersized

  services:
mon: 3 daemons, quorum controller001,controller002,controller003
mgr: controller001(active), standbys: controller002, controller003
mds: cephfs-1/1/1 up  {0=controller002=up:active}, 2 up:standby
osd: 77 osds: 74 up, 77 in

  data:
pools:   3 pools, 4112 pgs
objects: 36843 objects, 142 GB
usage:   470 GB used, 139 TB / 140 TB avail
pgs: 2842/73686 objects degraded (3.857%)
 3794 active+clean
 318  active+undersized+degraded


ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
 -1   140.02425 root default
 -920.00346 host storage001
  0   hdd   1.81850 osd.0   up  1.0 1.0
  6   hdd   1.81850 osd.6   up  1.0 1.0
  8   hdd   1.81850 osd.8   up  1.0 1.0
 11   hdd   1.81850 osd.11  up  1.0 1.0
 14   hdd   1.81850 osd.14  up  1.0 1.0
 18   hdd   1.81850 osd.18  up  1.0 1.0
 24   hdd   1.81850 osd.24  up  1.0 1.0
 28   hdd   1.81850 osd.28  up  1.0 1.0
 33   hdd   1.81850 osd.33  up  1.0 1.0
 40   hdd   1.81850 osd.40  up  1.0 1.0
 45   hdd   1.81850 osd.45  up  1.0 1.0
 -720.00346 host storage002
  1   hdd   1.81850 osd.1   up  1.0 1.0
  5   hdd   1.81850 osd.5   up  1.0 1.0
  9   hdd   1.81850 osd.9   up  1.0 1.0
 21   hdd   1.81850 osd.21  up  1.0 1.0
 22   hdd   1.81850 osd.22  up  1.0 1.0
 23   hdd   1.81850 osd.23  up  1.0 1.0
 35   hdd   1.81850 osd.35  up  1.0 1.0
 36   hdd   1.81850 osd.36  up  1.0 1.0
 38   hdd   1.81850 osd.38  up  1.0 1.0
 42   hdd   1.81850 osd.42  up  1.0 1.0
 49   hdd   1.81850 osd.49  up  1.0 1.0
-1120.00346 host storage003
 27   hdd   1.81850 osd.27  up  1.0 1.0
 31   hdd   1.81850 osd.31  up  1.0 1.0
 32   hdd   1.81850 osd.32  up  1.0 1.0
 37   hdd   1.81850 osd.37  up  1.0 1.0
 44   hdd   1.81850 osd.44  up  1.0 1.0
 46   hdd   1.81850 osd.46  up  1.0 1.0
 48   hdd   1.81850 osd.48  up  1.0 1.0
 53   hdd   1.81850 osd.53  up  1.0 1.0
 54   hdd   1.81850 osd.54  up  1.0 1.0
 56   hdd   1.81850 osd.56  up  1.0 1.0
 59   hdd   1.81850 osd.59  up  1.0 1.0
 -320.00346 host storage004
  2   hdd   1.81850 osd.2   up  1.0 1.0
  4   hdd   1.81850 osd.4   up  1.0 1.0
 10   hdd   1.81850 osd.10  up  1.0 1.0
 16   hdd   1.81850 osd.16  up  1.0 1.0
 17   hdd   1.81850 osd.17  up  1.0 1.0
 19   hdd   1.81850 osd.19  up  1.0 1.0
 26   hdd   1.81850 osd.26  up  1.0 1.0
 29   hdd   1.81850 osd.29  up  1.0 1.0
 39   hdd   1.81850 osd.39  up  1.0 1.0
 43   hdd   1.81850 osd.43  up  1.0 1.0
 50   hdd   1.81850 osd.50  up  1.0 1.0
 -520.00346 host storage005
  3   hdd   1.81850 osd.3   up  1.0 1.0
  7   hdd   1.81850 osd.7   up  1.0 1.0
 12   hdd   1.81850 osd.12  up  1.0 1.0
 13   hdd   1.81850 osd.13  up  1.0 1.0
 15   hdd   1.81850 osd.15  up  1.0 1.0
 20   hdd   1.81850 osd.20  up  1.0 1.0
 25   hdd   1.81850 osd.25  up  1.0 1.0
 30   hdd   1.81850 osd.30  up  1.0 1.0
 34   hdd   1.81850 osd.34  up  1.0 1.0
 41   hdd   1.81850 osd.41  up  1.0 1.0
 47   hdd   1.81850 osd.47  up  1.0 1.0
-1320.0034

Re: [ceph-users] MySQL and ceph volumes

2017-03-08 Thread Matteo Dacrema
Ok, thank you guys.

I changed the innodb flush method to O_DIRECT and seems to performs quite 
better.

Regards
Matteo



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 08 mar 2017, alle ore 09:08, Wido den Hollander  ha 
> scritto:
> 
>> 
>> Op 8 maart 2017 om 0:35 schreef Matteo Dacrema > <mailto:mdacr...@enter.eu>>:
>> 
>> 
>> Thank you Adrian!
>> 
>> I’ve forgot this option and I can reproduce the problem.
>> 
>> Now, what could be the problem on ceph side with O_DSYNC writes?
>> 
> 
> As mentioned nothing, but what you can do with MySQL is provide it multiple 
> RBD disks, eg:
> 
> - Disk for Operating System
> - Disk for /var/lib/mysql
> - Disk for InnoDB data
> - Disk for InnoDB log
> - Disk for /var/log/mysql (binary logs)
> 
> That way you can send in more parallel I/O into the Ceph cluster and gain 
> more performance.
> 
> Wido
> 
>> Regards
>> Matteo
>> 
>> 
>> 
>> This email and any files transmitted with it are confidential and intended 
>> solely for the use of the individual or entity to whom they are addressed. 
>> If you have received this email in error please notify the system manager. 
>> This message contains confidential information and is intended only for the 
>> individual named. If you are not the named addressee you should not 
>> disseminate, distribute or copy this e-mail. Please notify the sender 
>> immediately by e-mail if you have received this e-mail by mistake and delete 
>> this e-mail from your system. If you are not the intended recipient you are 
>> notified that disclosing, copying, distributing or taking any action in 
>> reliance on the contents of this information is strictly prohibited.
>> 
>>> Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
>>>  ha scritto:
>>> 
>>> 
>>> Possibly MySQL is doing sync writes, where as your FIO could be doing 
>>> buffered writes.
>>> 
>>> Try enabling the sync option on fio and compare results.
>>> 
>>> 
>>>> -Original Message-
>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>>> Matteo Dacrema
>>>> Sent: Wednesday, 8 March 2017 7:52 AM
>>>> To: ceph-users
>>>> Subject: [ceph-users] MySQL and ceph volumes
>>>> 
>>>> Hi All,
>>>> 
>>>> I have a galera cluster running on openstack with data on ceph volumes
>>>> capped at 1500 iops for read and write ( 3000 total ).
>>>> I can’t understand why with fio I can reach 1500 iops without IOwait and
>>>> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
>>>> 
>>>> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) 
>>>> and I
>>>> can’t reproduce the problem.
>>>> 
>>>> Anyone can tell me where I’m wrong?
>>>> 
>>>> Thank you
>>>> Regards
>>>> Matteo
>>>> 
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> Confidentiality: This email and any attachments are confidential and may be 
>>> subject to copyright, legal or some other professional privilege. They are 
>>> intended solely for the attention and use of the named addressee(s). They 
>>> may only be copied, distributed or disclosed with the consent of the 
>>> copyright owner. If you have received this email by mistake or by breach of 
>>> the confidentiality clause, please notify the sender immediately by return 
>>> email and delete or destroy all copies of the email. Any confidentiality, 
>>> privilege or copyright is not waived o

Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Matteo Dacrema
Thank you Adrian!

I’ve forgot this option and I can reproduce the problem.

Now, what could be the problem on ceph side with O_DSYNC writes?

Regards
Matteo



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 08 mar 2017, alle ore 00:25, Adrian Saul 
>  ha scritto:
> 
> 
> Possibly MySQL is doing sync writes, where as your FIO could be doing 
> buffered writes.
> 
> Try enabling the sync option on fio and compare results.
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Matteo Dacrema
>> Sent: Wednesday, 8 March 2017 7:52 AM
>> To: ceph-users
>> Subject: [ceph-users] MySQL and ceph volumes
>> 
>> Hi All,
>> 
>> I have a galera cluster running on openstack with data on ceph volumes
>> capped at 1500 iops for read and write ( 3000 total ).
>> I can’t understand why with fio I can reach 1500 iops without IOwait and
>> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
>> 
>> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I
>> can’t reproduce the problem.
>> 
>> Anyone can tell me where I’m wrong?
>> 
>> Thank you
>> Regards
>> Matteo
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=13CCD402D0.AA534
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replica questions

2017-03-07 Thread Matteo Dacrema
Hi,

thank you all.

I’m using Mellanox switches with connectX-3 40 gbit pro NIC.
Bond balance-xor with policy layer3+4 

It’s a bit expensive but it’s very hard to saturate.
I’m using one single nic for both replica and access network. 


> Il giorno 03 mar 2017, alle ore 14:52, Vy Nguyen Tan 
>  ha scritto:
> 
> Hi,
> 
> You should read email from Wido den Hollander:
> "Hi,
> 
> As a Ceph consultant I get numerous calls throughout the year to help people 
> with getting their broken Ceph clusters back online.
> 
> The causes of downtime vary vastly, but one of the biggest causes is that 
> people use replication 2x. size = 2, min_size = 1.
> 
> In 2016 the amount of cases I have where data was lost due to these settings 
> grew exponentially.
> 
> Usually a disk failed, recovery kicks in and while recovery is happening a 
> second disk fails. Causing PGs to become incomplete.
> 
> There have been to many times where I had to use xfs_repair on broken disks 
> and use ceph-objectstore-tool to export/import PGs.
> 
> I really don't like these cases, mainly because they can be prevented easily 
> by using size = 3 and min_size = 2 for all pools.
> 
> With size = 2 you go into the danger zone as soon as a single disk/daemon 
> fails. With size = 3 you always have two additional copies left thus keeping 
> your data safe(r).
> 
> If you are running CephFS, at least consider running the 'metadata' pool with 
> size = 3 to keep the MDS happy.
> 
> Please, let this be a big warning to everybody who is running with size = 2. 
> The downtime and problems caused by missing objects/replicas are usually big 
> and it takes days to recover from those. But very often data is lost and/or 
> corrupted which causes even more problems.
> 
> I can't stress this enough. Running with size = 2 in production is a SERIOUS 
> hazard and should not be done imho.
> 
> To anyone out there running with size = 2, please reconsider this!
> 
> Thanks,
> 
> Wido"
> 
> Btw, could you please share your experience about HA network for Ceph ? What 
> type of bonding do you have? are you using stackable switches?
> 
> 
> 
> On Fri, Mar 3, 2017 at 6:24 PM, Maxime Guyot  <mailto:maxime.gu...@elits.com>> wrote:
> Hi Henrik and Matteo,
> 
>  
> 
> While I agree with Henrik: increasing your replication factor won’t improve 
> recovery or read performance on its own. If you are changing from replica 2 
> to replica 3, you might need to scale-out your cluster to have enough space 
> for the additional replica, and that would improve the recovery and read 
> performance.
> 
>  
> 
> Cheers,
> 
> Maxime
> 
>  
> 
> From: ceph-users  <mailto:ceph-users-boun...@lists.ceph.com>> on behalf of Henrik Korkuc 
> mailto:li...@kirneh.eu>>
> Date: Friday 3 March 2017 11:35
> To: "ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>" 
> mailto:ceph-users@lists.ceph.com>>
> Subject: Re: [ceph-users] replica questions
> 
>  
> 
> On 17-03-03 12:30, Matteo Dacrema wrote:
> 
> Hi All,
> 
>  
> 
> I’ve a production cluster made of 8 nodes, 166 OSDs and 4 Journal SSD every 5 
> OSDs with replica 2 for a total RAW space of 150 TB.
> 
> I’ve few question about it:
> 
>  
> 
> It’s critical to have replica 2? Why?
> Replica size 3 is highly recommended. I do not know exact numbers but it 
> decreases chance of data loss as 2 disk failures appear to be quite frequent 
> thing, especially in larger clusters.
> 
> 
> Does replica 3 makes recovery faster?
> no
> 
> 
> Does replica 3 makes rebalancing and recovery less heavy for customers? If I 
> lose 1 node does replica 3 reduce the IO impact respect a replica 2?
> no
> 
> 
> Does read performance increase with replica 3?
> no
> 
> 
>  
> 
> Thank you
> 
> Regards
> 
> Matteo
> 
>  
> 
> 
> 
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. If 
> you have received this email in error please notify the system manager. This 
> message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and delete 
> this e-mail from your system. If you are not the intended recipient you are 
> notified that disclosing, copying, distributing or taking any action in 
> reliance on the contents

Re: [ceph-users] MySQL and ceph volumes

2017-03-07 Thread Matteo Dacrema
Hi Deepak,

thank you.

Here an example of iostat

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   5.160.002.64   15.740.00   76.45

Device: rrqm/s  wrqm/s  r/s w/s rkB/s   
wkB/s   avgrq-szavgqu-szawait   r_await 
w_await svctm   %util
vda   0.00  0.000.000.00
0.000.000.000.00
0.000.000.000.000.00
vdb   0.00  1.0096.00   292.00  4944.00 
14065 2.00  750.49  17.39   43.89   
17.79   52.47   2.58100.00

vdb is the ceph volumes with xfs fs.


Disk /dev/vdb: 2199.0 GB, 219902322 bytes
255 heads, 63 sectors/track, 267349 cylinders, total 4294967296 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x

   Device Boot  Start End  Blocks   Id  System
/dev/vdb1   1  4294967295  2147483647+  ee  GPT

Regards
Matteo

> Il giorno 07 mar 2017, alle ore 22:08, Deepak Naidu  ha 
> scritto:
> 
> My response is without any context to ceph or any SDS, purely how to check 
> the IO bottleneck. You can then determine if its Ceph or any other process or 
> disk.
>  
> >> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
> Lower IOPS is not issue with itself as your block size might be higher. But 
> MySQL doing higher block not sure.  You can check below iostat metrics to see 
> why is the IO wait higher.
>  
> *  avgqu-sz(Avg queue length)à  Higher the Q length 
> more the IO wait
> * avgrq-sz[The average size (in sectors)] à  Shows IOblock size( check 
> this when using mysql). [ you need to calculate this based on your FS block 
> size in KB & don’t just you the avgrq-sz # ]
>  
>  
> --
> Deepak
>  
>  
>  
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Matteo Dacrema
> Sent: Tuesday, March 07, 2017 12:52 PM
> To: ceph-users
> Subject: [ceph-users] MySQL and ceph volumes
>  
> Hi All,
>  
> I have a galera cluster running on openstack with data on ceph volumes capped 
> at 1500 iops for read and write ( 3000 total ).
> I can’t understand why with fio I can reach 1500 iops without IOwait and 
> MySQL can reach only 150 iops both read or writes showing 30% of IOwait.
>  
> I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I 
> can’t reproduce the problem.
>  
> Anyone can tell me where I’m wrong?
>  
> Thank you
> Regards
> Matteo
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential information.  Any unauthorized review, use, disclosure 
> or distribution is prohibited.  If you are not the intended recipient, please 
> contact the sender by reply email and destroy all copies of the original 
> message. 
> 
> -- 
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto. 
> Clicca qui per segnalarlo come spam. 
> <http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=DCF01401CF.AE456> 
> Clicca qui per metterlo in blacklist 
> <http://mx01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=DCF01401CF.AE456>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MySQL and ceph volumes

2017-03-07 Thread Matteo Dacrema
Hi All,

I have a galera cluster running on openstack with data on ceph volumes capped 
at 1500 iops for read and write ( 3000 total ).
I can’t understand why with fio I can reach 1500 iops without IOwait and MySQL 
can reach only 150 iops both read or writes showing 30% of IOwait.

I tried with fio 64k block size and various io depth ( 1.2.4.8.16….128) and I 
can’t reproduce the problem.

Anyone can tell me where I’m wrong?

Thank you
Regards
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mix HDDs and SSDs togheter

2017-03-03 Thread Matteo Dacrema
Hi all,

Does anyone run a production cluster with a modified crush map for create two 
pools belonging one to HDDs and one to SSDs.
What’s the best method? Modify the crush map via ceph CLI or via text editor? 
Will the modification to the crush map be persistent across reboots and 
maintenance operations?
There’s something to consider when doing upgrades or other operations or 
different by having “original” crush map?

Thank you
Matteo

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] replica questions

2017-03-03 Thread Matteo Dacrema
Hi All,

I’ve a production cluster made of 8 nodes, 166 OSDs and 4 Journal SSD every 5 
OSDs with replica 2 for a total RAW space of 150 TB.
I’ve few question about it:

It’s critical to have replica 2? Why?
Does replica 3 makes recovery faster?
Does replica 3 makes rebalancing and recovery less heavy for customers? If I 
lose 1 node does replica 3 reduce the IO impact respect a replica 2?
Does read performance increase with replica 3?

Thank you
Regards
Matteo


This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cluster pause - possible consequences

2017-01-02 Thread Matteo Dacrema
Increasing pg_num will lead to several slow requests and cluster freeze, but  
due to creating pgs operation , for what I’ve seen until now.
During the creation period all the request are frozen , and the creation period 
take a lot of time even for 128 pgs.

I’ve observed that during creation period most of the OSD goes at 100% of their 
performance capacity. I think that without operation running in the cluster 
I’ll be able to upgrade pg_num quickly without causing down time several times.

Matteo

> Il giorno 02 gen 2017, alle ore 15:02, c...@jack.fr.eu.org ha scritto:
> 
> Well, as the doc said:
>> Set or clear the pause flags in the OSD map. If set, no IO requests will be 
>> sent to any OSD. Clearing the flags via unpause results in resending pending 
>> requests.
> If you do that on a production cluster, that means your cluster will no
> longer be in production :)
> 
> Depending on your needs, but ..
> Maybe you want do this operation as fast as possible
> Or maybe you want to make that operation as transparent as possible,
> from a user point of view
> 
> You may have a look at osd_recovery_op_priority &
> osd_client_op_priority, they might be interesting for you
> 
> On 02/01/2017 14:37, Matteo Dacrema wrote:
>> Hi All,
>> 
>> what happen if I set pause flag on a production cluster?
>> I mean, will all the request remain pending/waiting or all the volumes 
>> attached to the VMs will become read-only?
>> 
>> I need to quickly upgrade placement group number from 3072 to 8192 or better 
>> to 165336 and I think doing it without client operations will be much faster.
>> 
>> Thanks
>> Regards
>> Matteo
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=9F3C956B85.A333A
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cluster pause - possible consequences

2017-01-02 Thread Matteo Dacrema
Hi All,

what happen if I set pause flag on a production cluster?
I mean, will all the request remain pending/waiting or all the volumes attached 
to the VMs will become read-only?

I need to quickly upgrade placement group number from 3072 to 8192 or better to 
165336 and I think doing it without client operations will be much faster.

Thanks
Regards
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Sandisk SSDs

2016-12-02 Thread Matteo Dacrema
Hi All,

Did someone ever used or tested Sandisk Cloudspeed Eco II 1,92TB with Ceph?
I know they have 0,6 DWPD that with Journal will be only 0,3 DPWD which means 
560GB of data per day over 5 years.
I need to know the performance side.

Thanks
Matteo



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph and container

2016-11-15 Thread Matteo Dacrema
Hi,

does anyone ever tried to run ceph monitors in containers?
Could it lead to performance issues?
Can I run monitor containers on the OSD nodes?

I don’t want to buy 3 dedicated servers. Is there any other solution?

Thanks
Best regards

Matteo Dacrema

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 6 Node cluster with 24 SSD per node: Hardwareplanning/ agreement

2016-11-11 Thread Matteo Dacrema
Hi,

after your tips and consideration I’ve planned to use this hardware 
configuration:

- 4x OSD ( for starting the project):
1x Intel E5-1630v4 @ 4.00 Ghz with turbo 4 core, 8 thread , 10MB cache
128GB RAM ( does frequency matter in terms of performance ? ) 
4x Intel P3700 2TB NVME
2x Mellanox Connect-X 3 Pro 40gbit/s

- 3 x MON:
1x Intel E5-1630v4
64GB RAM
2 x Intel S3510 SSD
2x Mellanox Connect-X 3 Pro 10gbit/s

What do you think about?
I don’t know if this CPU works well with ceph workload and if it’s better to 
use 4x Samsung SM863 1.92TB rather than Intel P3700.
I’ve considered to place the Journal inline.

Thanks
Matteo 

> Il giorno 11 ott 2016, alle ore 03:04, Christian Balzer  ha 
> scritto:
> 
> 
> Hello,
> 
> On Mon, 10 Oct 2016 14:56:40 +0200 Matteo Dacrema wrote:
> 
>> Hi,
>> 
>> I’m planning a similar cluster.
>> Because it’s a new project I’ll start with only 2 node cluster witch each:
>> 
> As Wido said, that's a very dense and risky proposition for a first time
> cluster. 
> Never mind the lack of 3rd node for 3 MONs is begging for Murphy to come
> and smite you.
> 
> While I understand the need/wish to save money and space by maximizing
> density, that only works sort of when you have plenty of such nodes to
> begin with.
> 
> Your proposed setup isn't cheap to begin with, consider alternatives like
> the one I'm pointing out below.
> 
>> 2x E5-2640v4 with 40 threads total @ 3.40Ghz with turbo
> Spendy and still potentially overwhelmed when dealing with small write
> IOPS.
> 
>> 24x 1.92 TB Samsung SM863 
> Should be fine, but keep in mind that with inline journals they will only
> have about a 1.5 DWPD endurance.
> At about 5.7GB/s write bandwidth not a total mismatch to your 4GB/s
> network link (unless those 2 ports are MC-LAG, giving you 8GB/s).
> 
>> 128GB RAM
>> 3x LSI 3008 in IT mode / HBA for OSD - 1 each 8 OSD/SDDs
> Also not free, they need to be on the latest FW and kernel version to work
> reliably with SSDs.
> 
>> 2x SSD for OS
>> 2x 40Gbit/s NIC
>> 
>> 
> Consider basing your cluster on two of these 2U 4node servers:
> https://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-HTTR.cfm 
> <https://www.supermicro.com.tw/products/system/2U/2028/SYS-2028TP-HTTR.cfm>
> 
> Built-in dual 10Gb/s, the onboard SATA works nicely with SSDs, you can get
> better matched CPU(s).
> 
> 10Gb/s MC-LAG (white box) switches are also widely available and
> affordable.
> 
> So 8 nodes instead of 2, in the same space.
> 
> Of course running a cluster (even with well monitored and reliable SSDs)
> with a replication of 2 has risks (and that risk increases with the size of
> the SSDs), so you may want to reconsider that.
> 
> Christian
> 
>> What about this hardware configuration? Is that wrong or I’m missing 
>> something ?
>> 
>> Regards
>> Matteo
>> 
>>> Il giorno 06 ott 2016, alle ore 13:52, Denny Fuchs  ha 
>>> scritto:
>>> 
>>> God morning,
>>> 
>>>>> * 2 x SN2100 100Gb/s Switch 16 ports
>>>> Which incidentally is a half sized (identical HW really) Arctica 3200C.
>>> 
>>> really never heart from them :-) (and didn't find any price €/$ region)
>>> 
>>> 
>>>>> * 10 x ConnectX 4LX-EN 25Gb card for hypervisor and OSD nodes
>>> [...]
>>> 
>>>> You haven't commented on my rather lengthy mail about your whole design,
>>>> so to reiterate:
>>> 
>>> maybe accidentally skipped, so much new input  :-) sorry
>>> 
>>>> The above will give you a beautiful, fast (but I doubt you'll need the
>>>> bandwidth for your DB transactions), low latency and redundant network
>>>> (these switches do/should support MC-LAG). 
>>> 
>>> Jepp, they do MLAG (with the 25Gbit version of the cx4 NICs)
>>> 
>>>> In more technical terms, your network as depicted above can handle under
>>>> normal circumstances around 5GB/s, while your OSD nodes can't write more
>>>> than 1GB/s.
>>>> Massive, wasteful overkill.
>>> 
>>> before we started with planing Ceph / new hypervisor design, we where sure 
>>> that our network would be more powerful, than we need in the near future. 
>>> Our applications / DB never used the full 1GBs in any way ...  we loosing 
>>> speed on the plain (painful LANCOM) switches and the applications (mostly 
>>> Perl written in the beginning of the 2005).
>>> But anyway, the network should be have enough capacity for the next years, 
>>>

Re: [ceph-users] 6 Node cluster with 24 SSD per node: Hardwareplanning/ agreement

2016-10-10 Thread Matteo Dacrema
Hi,

I’m planning a similar cluster.
Because it’s a new project I’ll start with only 2 node cluster witch each:

2x E5-2640v4 with 40 threads total @ 3.40Ghz with turbo
24x 1.92 TB Samsung SM863 
128GB RAM
3x LSI 3008 in IT mode / HBA for OSD - 1 each 8 OSD/SDDs
2x SSD for OS
2x 40Gbit/s NIC


What about this hardware configuration? Is that wrong or I’m missing something ?

Regards
Matteo

> Il giorno 06 ott 2016, alle ore 13:52, Denny Fuchs  ha 
> scritto:
> 
> God morning,
> 
>>> * 2 x SN2100 100Gb/s Switch 16 ports
>> Which incidentally is a half sized (identical HW really) Arctica 3200C.
>  
> really never heart from them :-) (and didn't find any price €/$ region)
>  
> 
>>> * 10 x ConnectX 4LX-EN 25Gb card for hypervisor and OSD nodes
> [...]
> 
>> You haven't commented on my rather lengthy mail about your whole design,
>> so to reiterate:
>  
> maybe accidentally skipped, so much new input  :-) sorry
> 
>> The above will give you a beautiful, fast (but I doubt you'll need the
>> bandwidth for your DB transactions), low latency and redundant network
>> (these switches do/should support MC-LAG). 
>  
> Jepp, they do MLAG (with the 25Gbit version of the cx4 NICs)
>  
>> In more technical terms, your network as depicted above can handle under
>> normal circumstances around 5GB/s, while your OSD nodes can't write more
>> than 1GB/s.
>> Massive, wasteful overkill.
>  
> before we started with planing Ceph / new hypervisor design, we where sure 
> that our network would be more powerful, than we need in the near future. Our 
> applications / DB never used the full 1GBs in any way ...  we loosing speed 
> on the plain (painful LANCOM) switches and the applications (mostly Perl 
> written in the beginning of the 2005).
> But anyway, the network should be have enough capacity for the next years, 
> because it is much more complicated to change network (design) components, 
> than to kick a node.
>  
>> With a 2nd NVMe in there you'd be at 2GB/s, or simple overkill.
>  
> We would buy them ... so that in the end, every 12 disk has a separated NVMe
> 
>> With decent SSDs and in-line journals (400GB DC S3610s) you'd be at 4.8
>> GB/s, a perfect match.
>  
> What about the worst case, two nodes are broken, fixed and replaced ? I red 
> (a lot) that some Ceph users had massive problems, while the rebuild runs. 
>  
> 
>> Of course if your I/O bandwidth needs are actually below 1GB/s at all times
>> and all your care about is reducing latency, a single NVMe journal will be
>> fine (but also be a very obvious SPoF).
> 
> Very happy  to put the finger in the wound, SPof ... is a very hard thing ... 
> so we try to plan everything redundant  :-)
>  
> The bad side of life: the SSD itself. A consumer SSD lays round about 70/80€, 
> a DC SSD jumps up to 120-170€. My nightmare is: a lot of SSDs are jumping 
> over the bridge at the same time  -> arghh 
>  
> But, we are working on it :-)
>  
> I've searching an alternative for the Asus board with more PCIe slots and 
> maybe some components; better CPU with 3.5Ghz-> ; maybe a mix from the SSDs 
> ...
>  
> At this time, I've found the X10DRi:
>  
> https://www.supermicro.com/products/motherboard/xeon/c600/x10dri.cfm 
> 
>  
> and I think we use the E5-2637v4 :-)
>  
>  cu denny
>  
> 
> -- 
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto. 
> Clicca qui per segnalarlo come spam. 
>  
> Clicca qui per metterlo in blacklist 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on different OS version

2016-09-22 Thread Matteo Dacrema
To be more precise, the node with different OS are only the OSD nodes.

Thanks
Matteo

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 22 set 2016, alle ore 15:56, Lenz Grimmer  ha 
> scritto:
> 
> Hi,
> 
> On 09/22/2016 03:03 PM, Matteo Dacrema wrote:
> 
>> someone have ever tried to run a ceph cluster on two different version
>> of the OS?
>> In particular I’m running a ceph cluster half on Ubuntu 12.04 and half
>> on Ubuntu 14.04 with Firefly version.
>> I’m not seeing any issues.
>> Are there some kind of risks?
> 
> I could be wrong, but as long as the Ceph version running on these nodes
> is the same, I doubt the underlying OS version makes much of a
> difference, if we're talking about "userland" Ceph components like MONs,
> OSDs or RGW nodes.
> 
> Lenz
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph on different OS version

2016-09-22 Thread Matteo Dacrema
Hi,

someone have ever tried to run a ceph cluster on two different version of the 
OS?
In particular I’m running a ceph cluster half on Ubuntu 12.04 and half on 
Ubuntu 14.04 with Firefly version.
I’m not seeing any issues.
Are there some kind of risks?

Thanks
Matteo

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Increase PG number

2016-09-20 Thread Matteo Dacrema
Thanks a lot guys.

I’ll try to do as you told me.

Best Regards
Matteo

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 20 set 2016, alle ore 12:20, Vincent Godin  
> ha scritto:
> 
> Hi, 
> 
> In fact, when you increase your pg number, the new pgs will have to peer 
> first and during this time, a lot a pg will be unreachable. The best way to 
> upgrade the number of PG of a cluster (you 'll need to adjust the number of 
> PGP too) is :
> 
> Don't forget to apply Goncalo advices to keep your cluster responsive for 
> client operations. Otherwise, all the IO and CPU will be used for the 
> recovery operations and your cluster will be unreachable. Be sure that all 
> these new parameters are in place before upgrading your cluster
> stop and wait for scrub and deep-scrub operations
> ceph osd set noscrub
> ceph osd set nodeep-scrub
> 
> set you cluster in maintenance mode with :
> ceph osd set norecover
> ceph osd set nobackfill
> ceph osd set nodown
> ceph osd set noout
> 
> wait for your cluster not have scrub or deep-scrub opration anymore
> upgrade the pg number with a small increment like 256
> wait for the cluster to create and peer the new pgs (about 30 seconds)
> upgrade the pgp number with the same increment
> wait for the cluster to create and peer (about 30 seconds)
> (Repeat the last 4 operations until you reach the number of pg and pgp you 
> want
> 
> At this time, your cluster is still functionnal. 
> 
> Now you have to unset the maintenance mode
> ceph osd unset noout
> ceph osd unset nodown
> ceph osd unset nobackfill
> ceph osd unset norecover
> 
> It will take some time to replace all the pgs but at the end you will have a 
> cluster with all pgs active+clean.During all the operation,your cluster will 
> still be functionnal if you have respected Goncalo parameters
> 
> When all the pgs are active+clean, you can re-enable the scrub and deep-scrub 
> operations
> ceph osd unset noscrub
> ceph osd unset nodeep-scrub
> 
> Vincent
> 
> 
> 
> -- 
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto. 
> Clicca qui per segnalarlo come spam. 
>  
> Clicca qui per metterlo in blacklist 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] capacity planning - iops

2016-09-19 Thread Matteo Dacrema
Hi All,

I’m trying to estimate how many iops ( 4k direct random write )  my ceph 
cluster should deliver.
I’ve Journal on SSDs and SATA 7.2k drives for OSD.

The question is: does journal on SSD increase the number of maximum write iops 
or I need to consider only the IOPS provided by SATA drives divided by replica 
count?

Regards
M.



This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [EXTERNAL] Re: Increase PG number

2016-09-19 Thread Matteo Dacrema
Hi,

I’ve 3 different cluster.
The first I’ve been able to upgrade from 1024 to 2048 pgs with 10 minutes of 
"io freeze”.
The second I’ve been able to upgrade from 368 to 512 in a sec without any 
performance issue, but from 512 to 1024 it take over 20 minutes to create pgs.
The third I’ve to upgrade is now 2048 pgs and I’ve to take it to 16384. So what 
I’m wondering is how to do it with minimum performance impact.

Maybe the best way is to upgrade by 256 to 256 pg and pgp num each time letting 
the cluster to rebalance every time.

Thanks
Matteo

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 19 set 2016, alle ore 05:22, Will.Boege  ha 
> scritto:
> 
> How many PGs do you have - and how many are you increasing it to? 
> 
> Increasing PG counts can be disruptive if you are increasing by a large 
> proportion of the initial count because all the PG peering involved.  If you 
> are doubling the amount of PGs it might be good to do it in stages to 
> minimize peering.  For example if you are going from 1024 to 2048 - consider 
> 4 increases of 256, allowing the cluster to stabilize in-between, rather that 
> one event that doubles the number of PGs. 
> 
> If you expect this cluster to grow, overshoot the recommended PG count by 50% 
> or so.  This will allow you to minimize the PG increase events, and thusly 
> impact to your users.  
> 
> From: ceph-users  <mailto:ceph-users-boun...@lists.ceph.com>> on behalf of Matteo Dacrema 
> mailto:mdacr...@enter.eu>>
> Date: Sunday, September 18, 2016 at 3:29 PM
> To: Goncalo Borges  <mailto:goncalo.bor...@sydney.edu.au>>, "ceph-users@lists.ceph.com 
> <mailto:ceph-users@lists.ceph.com>"  <mailto:ceph-users@lists.ceph.com>>
> Subject: [EXTERNAL] Re: [ceph-users] Increase PG number
> 
> Hi , thanks for your reply.
> 
> Yes, I’don’t any near full osd.
> 
> The problem is not the rebalancing process but the process of creation of new 
> pgs.
> 
> I’ve only 2 host running Ceph Firefly version with 3 SSDs for journaling each.
> During the creation of new pgs all the volumes attached stop to read or write 
> showing high iowait.
> Ceph -s tell me that there are thousand of slow requests.
> 
> When all the pgs are created slow request begin to decrease and the cluster 
> start rebalancing process.
> 
> Matteo
> 
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. If 
> you have received this email in error please notify the system manager. This 
> message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and delete 
> this e-mail from your system. If you are not the intended recipient you are 
> notified that disclosing, copying, distributing or taking any action in 
> reliance on the contents of this information is strictly prohibited.
> 
>> Il giorno 18 set 2016, alle ore 13:08, Goncalo Borges 
>> mailto:goncalo.bor...@sydney.edu.au>> ha 
>> scritto:
>> 
>> Hi
>> I am assuming that you do not have any near full osd  (either before or 
>> along the pg splitting process) and that your cluster is healthy. 
>> 
>> To minimize the impact on the clients during recover or operations like pg 
>> splitting, it is good to set the following configs. Obviously the whole 
>> operation will take longer to recover but the impact on clients will be 
>> minimized.
>> 
>> #  ceph daemon mon.rccephmon1 config show | egrep 
>> "(osd_max_backfills|osd_recovery_threads|osd_recovery_op_priority|osd_client_op_priority|osd_recovery_max_active)"
>>"osd_max_backfills": "1",
>>"osd_recovery_threads": "1",
>>"osd_recovery_max_active": "1"
>>"osd_client_op_priority": &

Re: [ceph-users] Increase PG number

2016-09-18 Thread Matteo Dacrema
Hi , thanks for your reply.

Yes, I’don’t any near full osd.

The problem is not the rebalancing process but the process of creation of new 
pgs.

I’ve only 2 host running Ceph Firefly version with 3 SSDs for journaling each.
During the creation of new pgs all the volumes attached stop to read or write 
showing high iowait.
Ceph -s tell me that there are thousand of slow requests.

When all the pgs are created slow request begin to decrease and the cluster 
start rebalancing process.

Matteo

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

> Il giorno 18 set 2016, alle ore 13:08, Goncalo Borges 
>  ha scritto:
> 
> Hi
> I am assuming that you do not have any near full osd  (either before or along 
> the pg splitting process) and that your cluster is healthy. 
> 
> To minimize the impact on the clients during recover or operations like pg 
> splitting, it is good to set the following configs. Obviously the whole 
> operation will take longer to recover but the impact on clients will be 
> minimized.
> 
> #  ceph daemon mon.rccephmon1 config show | egrep 
> "(osd_max_backfills|osd_recovery_threads|osd_recovery_op_priority|osd_client_op_priority|osd_recovery_max_active)"
>"osd_max_backfills": "1",
>"osd_recovery_threads": "1",
>"osd_recovery_max_active": "1"
>"osd_client_op_priority": "63",
>"osd_recovery_op_priority": "1"
> 
> Cheers
> G.
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Matteo 
> Dacrema [mdacr...@enter.eu]
> Sent: 18 September 2016 03:42
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Increase PG number
> 
> Hi All,
> 
> I need to expand my ceph cluster and I also need to increase pg number.
> In a test environment I see that during pg creation all read and write 
> operations are stopped.
> 
> Is that a normal behavior ?
> 
> Thanks
> Matteo
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual or entity to whom they are addressed. If 
> you have received this email in error please notify the system manager. This 
> message contains confidential information and is intended only for the 
> individual named. If you are not the named addressee you should not 
> disseminate, distribute or copy this e-mail. Please notify the sender 
> immediately by e-mail if you have received this e-mail by mistake and delete 
> this e-mail from your system. If you are not the intended recipient you are 
> notified that disclosing, copying, distributing or taking any action in 
> reliance on the contents of this information is strictly prohibited.
> 
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non 
> infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=D6CF2401EE.A1426
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increase PG number

2016-09-17 Thread Matteo Dacrema
Hi All,

I need to expand my ceph cluster and I also need to increase pg number.
In a test environment I see that during pg creation all read and write 
operations are stopped.

Is that a normal behavior ?

Thanks
Matteo
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. If 
you have received this email in error please notify the system manager. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and delete 
this e-mail from your system. If you are not the intended recipient you are 
notified that disclosing, copying, distributing or taking any action in 
reliance on the contents of this information is strictly prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BAD nvme SSD performance

2015-10-27 Thread Matteo Dacrema
Hi,

thanks for all the replies.

I've found the issue: 
The Samsung nvme SSD has poor performance with sync=1. It reach only 4/5 k iops 
with randwrite ops.

Using Intel DC S3700 SSDs I'm able to saturate the CPU.

I'm using hammer v 0.94.5 on Ubuntu 14.04 and 3.19.0-31 kernel

What do you think about Intel 750 series : 
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-750-series.html

I plan to use it for cache layer ( one for host - is it a problem? )
Behind the cache layer I plan to use Mechanical HDD with Journal on SSD drives.

What do you think about it?

Thanks
Regards,
Matteo

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: lunedì 26 ottobre 2015 17:45
To: Christian Balzer ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance

Another point,
As Christian mentioned, try to evaluate O_DIRECT|O_DSYNC performance of a SSD 
before choosing that for Ceph..
Try to run with direct=1 and sync =1 with fio to a raw ssd drive..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Monday, October 26, 2015 9:20 AM
To: Christian Balzer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance

One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't mean 
you are saturating SSDs there..I have seen a large performance delta even if 
iostat is reporting 100% disk util in both the cases.
Also, the ceph.conf file you are using is not optimal..Try to add these..

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

You didn't mention anything about your cpu, considering you have powerful cpu 
complex for SSDs tweak this to high number of shards..It also depends on number 
of OSDs per box..

osd_op_num_threads_per_shard
osd_op_num_shards


Don't need to change the following..

osd_disk_threads
osd_op_threads


Instead, try increasing..

filestore_op_threads

Use the following in the global section..

ms_dispatch_throttle_bytes = 0
throttler_perf_counter = false

Change the following..
filestore_max_sync_interval = 1   (or even lower, need to lower 
filestore_min_sync_interval as well)


I am assuming you are using hammer and newer..

Thanks & Regards
Somnath

Try increasing the following to very big numbers..

> > filestore_queue_max_ops = 2000
> >
> > filestore_queue_max_bytes = 536870912
> >
> > filestore_queue_committing_max_ops = 500
> >
> > filestore_queue_committing_max_bytes = 268435456

Use the following..

osd_enable_op_tracker = false


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Monday, October 26, 2015 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance


Hello,

On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote:

>
>
> On 26-10-15 14:29, Matteo Dacrema wrote:
> > Hi Nick,
> >
> >
> >
> > I also tried to increase iodepth but nothing has changed.
> >
> >
> >
> > With iostat I noticed that the disk is fully utilized and write per 
> > seconds from iostat match fio output.
> >
>
> Ceph isn't fully optimized to get the maximum potential out of NVME 
> SSDs yet.
>
Indeed. Don't expect Ceph to be near raw SSD performance.

However he writes that per iostat the SSD is fully utilized.

Matteo, can you run run atop instead of iostat and confirm that:

a) utilization of the SSD is 100%.
b) CPU is not the bottleneck.

My guess would be these particular NVMe SSDs might just suffer from the same 
direct sync I/O deficiencies as other Samsung SSDs.
This feeling is re-affirmed by seeing Samsung listing them as a Client SSDs, 
not data center one.
http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831

Regards,

Christian

> For example, NVM-E SSDs work best with very high queue depths and 
> parallel IOps.
>
> Also, be aware that Ceph add multiple layers to the whole I/O 
> subsystem and that there will be a performance impact when Ceph is used in 
> between.
>
> Wido
>
> >
> >
> > Matteo
> >
> >
> >
> > *From:*Nick Fisk [mailto:n...@fisk.me.uk]
> > *Sent:* 

Re: [ceph-users] BAD nvme SSD performance

2015-10-26 Thread Matteo Dacrema
Hi Nick,

I also tried to increase iodepth but nothing has changed.

With iostat I noticed that the disk is fully utilized and write per seconds 
from iostat match fio output.

Matteo

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: lunedì 26 ottobre 2015 13:06
To: Matteo Dacrema ; ceph-us...@ceph.com
Subject: RE: BAD nvme SSD performance

Hi Matteo,

Ceph introduces latency into the write path and so what you are seeing is 
typical. If you increase the iodepth of the fio test you should get higher 
results though, until you start maxing out your CPU.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Matteo 
Dacrema
Sent: 26 October 2015 11:20
To: ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>
Subject: [ceph-users] BAD nvme SSD performance

Hi all,

I've recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a 2 OSD ceph 
cluster with min_size = 1.
I've tested them with fio ad I obtained two very different results with these 
two situations with fio.
This is the command : fio  --ioengine=libaio --direct=1  --name=test 
--filename=test --bs=4k  --size=100M --readwrite=randwrite  --numjobs=200  
--group_reporting

On the OSD host I've obtained this result:
bw=575493KB/s, iops=143873

On the client host with a mounted volume I've obtained this result:

Fio executed on the client osd with a mounted volume:
bw=9288.1KB/s, iops=2322

I've obtained this results with Journal and data on the same disk and also with 
Journal on separate SSD.

I've two OSD host with 64GB of RAM and 2x Intel Xeon E5-2620 @ 2.00GHz and one 
MON host with 128GB of RAM and 2x Intel Xeon E5-2620 @ 2.00 GHz.
I'm using 10G mellanox NIC and Switch with jumbo frames.

I also did other test with this configuration ( see attached Excel workbook )
Hardware configuration for each of the two OSD nodes:
3x  100GB Intel SSD DC S3700 with 3 * 30 GB partition for every 
SSD
9x  1TB Seagate HDD
Results: about 12k IOPS with 4k bs and same fio test.

I can't understand where is the problem with nvme SSDs.
Anyone can help me?

Here the ceph.conf:
[global]
fsid = 3392a053-7b48-49d3-8fc9-50f245513cc7
mon_initial_members = mon1
mon_host = 192.168.1.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 2
mon_client_hung_interval = 1.0
mon_client_ping_interval = 5.0
public_network = 192.168.1.0/24
cluster_network = 192.168.1.0/24
mon_osd_full_ratio = .90
mon_osd_nearfull_ratio = .85

[mon]
mon_warn_on_legacy_crush_tunables = false

[mon.1]
host = mon1
mon_addr = 192.168.1.3:6789

[osd]
osd_journal_size = 3
journal_dio = true
journal_aio = true
osd_op_threads = 24
osd_op_thread_timeout = 60
osd_disk_threads = 8
osd_recovery_threads = 2
osd_recovery_max_active = 1
osd_max_backfills = 2
osd_mkfs_type = xfs
osd_mkfs_options_xfs = "-f -i size=2048"
osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog"
filestore_xattr_use_omap = false
filestore_max_inline_xattr_size = 512
filestore_max_sync_interval = 10
filestore_merge_threshold = 40
filestore_split_multiple = 8
filestore_flusher = false
filestore_queue_max_ops = 2000
filestore_queue_max_bytes = 536870912
filestore_queue_committing_max_ops = 500
filestore_queue_committing_max_bytes = 268435456
filestore_op_threads = 2

Best regards,
Matteo



--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Clicca qui per segnalarlo come 
spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=326E9400C6.A1DC9>
Clicca qui per metterlo in 
blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=326E9400C6.A1DC9>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] BAD nvme SSD performance

2015-10-26 Thread Matteo Dacrema
Hi all,

I've recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a 2 OSD ceph 
cluster with min_size = 1.
I've tested them with fio ad I obtained two very different results with these 
two situations with fio.
This is the command : fio  --ioengine=libaio --direct=1  --name=test 
--filename=test --bs=4k  --size=100M --readwrite=randwrite  --numjobs=200  
--group_reporting

On the OSD host I've obtained this result:
bw=575493KB/s, iops=143873

On the client host with a mounted volume I've obtained this result:

Fio executed on the client osd with a mounted volume:
bw=9288.1KB/s, iops=2322

I've obtained this results with Journal and data on the same disk and also with 
Journal on separate SSD.

I've two OSD host with 64GB of RAM and 2x Intel Xeon E5-2620 @ 2.00GHz and one 
MON host with 128GB of RAM and 2x Intel Xeon E5-2620 @ 2.00 GHz.
I'm using 10G mellanox NIC and Switch with jumbo frames.

I also did other test with this configuration ( see attached Excel workbook )
Hardware configuration for each of the two OSD nodes:
3x  100GB Intel SSD DC S3700 with 3 * 30 GB partition for every 
SSD
9x  1TB Seagate HDD
Results: about 12k IOPS with 4k bs and same fio test.

I can't understand where is the problem with nvme SSDs.
Anyone can help me?

Here the ceph.conf:
[global]
fsid = 3392a053-7b48-49d3-8fc9-50f245513cc7
mon_initial_members = mon1
mon_host = 192.168.1.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd_pool_default_size = 2
mon_client_hung_interval = 1.0
mon_client_ping_interval = 5.0
public_network = 192.168.1.0/24
cluster_network = 192.168.1.0/24
mon_osd_full_ratio = .90
mon_osd_nearfull_ratio = .85

[mon]
mon_warn_on_legacy_crush_tunables = false

[mon.1]
host = mon1
mon_addr = 192.168.1.3:6789

[osd]
osd_journal_size = 3
journal_dio = true
journal_aio = true
osd_op_threads = 24
osd_op_thread_timeout = 60
osd_disk_threads = 8
osd_recovery_threads = 2
osd_recovery_max_active = 1
osd_max_backfills = 2
osd_mkfs_type = xfs
osd_mkfs_options_xfs = "-f -i size=2048"
osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog"
filestore_xattr_use_omap = false
filestore_max_inline_xattr_size = 512
filestore_max_sync_interval = 10
filestore_merge_threshold = 40
filestore_split_multiple = 8
filestore_flusher = false
filestore_queue_max_ops = 2000
filestore_queue_max_bytes = 536870912
filestore_queue_committing_max_ops = 500
filestore_queue_committing_max_bytes = 268435456
filestore_op_threads = 2

Best regards,
Matteo



Test_ceph_benchmark.xlsx
Description: Test_ceph_benchmark.xlsx
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] metadata server rejoin time

2015-07-02 Thread Matteo Dacrema
Hi all,

I'm using CephFS on Hammer and I've 1.5 million files , 2 metadata servers in 
active/standby configuration with 8 GB of RAM , 20 clients with 2 GB of RAM 
each and 2 OSD nodes with 4 80GB osd and 4GB of RAM.
?I've noticed that if I kill the active metadata server the second one took 
about 10 to 30 minutes to switch from rejoin to active state.On the rejoin 
server while that is in rejoin state I can see ceph allocating RAM.


Here my configuration:

[global]
fsid = 2de7b17f-0a3e-4109-b878-c035dd2f7735
mon_initial_members = cephmds01
mon_host = 10.29.81.161
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = 10.29.81.0/24
tcp nodelay = true
tcp rcvbuf = 0
ms tcp read timeout = 600

#Capacity
mon osd full ratio = .95
mon osd nearfull ratio = .85


[osd]
osd journal size = 1024
journal dio = true
journal aio = true

osd op threads = 2
osd op thread timeout = 60
osd disk threads = 2
osd recovery threads = 1
osd recovery max active = 1
osd max backfills = 2


# Pool
osd pool default size = 2

#XFS
osd mkfs type = xfs
osd mkfs options xfs = "-f -i size=2048"
osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog"

#FileStore Settings
filestore xattr use omap = false
filestore max inline xattr size = 512
filestore max sync interval = 10
filestore merge threshold = 40
filestore split multiple = 8
filestore flusher = false
filestore queue max ops = 2000
filestore queue max bytes = 536870912
filestore queue committing max ops = 500
filestore queue committing max bytes = 268435456
filestore op threads = 2

[mds]
max mds = 1
mds cache size = 25
client cache size = 1024
mds dir commit ratio = 0.5

Best regards,
Matteo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS client issue

2015-06-16 Thread Matteo Dacrema
Hello,

you're right. 
I misunderstood the meaning of the two configuration params: size and min_size.

Now it works correctly.

Thanks,
Matteo  

Da: Christian Balzer 
Inviato: martedì 16 giugno 2015 09:42
A: ceph-users
Cc: Matteo Dacrema
Oggetto: Re: [ceph-users] CephFS client issue

Hello,

On Tue, 16 Jun 2015 07:21:54 + Matteo Dacrema wrote:

> Hi,
>
> I've shutoff the node without take any cautions for simulate a real case.
>
Normal shutdown (as opposed to simulating a crash by pulling cables)
should not result in any delays due to Ceph timeouts.

> The  osd_pool_default_min_size is 2 .
>
This on the other is most likely your problem.
It would have to be "1" for things to work in your case.
Verify it with "ceph osd pool get  min_size" for your actual
pool(s).

Christian

> Regards,
> Matteo
>
> 
> Da: Christian Balzer 
> Inviato: martedì 16 giugno 2015 01:44
> A: ceph-users
> Cc: Matteo Dacrema
> Oggetto: Re: [ceph-users] CephFS client issue
>
> Hello,
>
> On Mon, 15 Jun 2015 23:11:07 + Matteo Dacrema wrote:
>
> > With 3.16.3 kernel it seems to be stable but I've discovered one new
> > issue.
> >
> > If I take down one of the two osd node all the client stop to respond.
> >
> How did you take the node down?
>
> What is your "osd_pool_default_min_size"?
>
> Penultimately, you wouldn't deploy a cluster with just 2 storage nodes in
> production anyway.
>
> Christian
> >
> > Here the output of ceph -s
> >
> > ceph -s
> > cluster 2de7b17f-0a3e-4109-b878-c035dd2f7735
> >  health HEALTH_WARN
> > 256 pgs degraded
> > 127 pgs stuck inactive
> > 127 pgs stuck unclean
> > 256 pgs undersized
> > recovery 1457662/2915324 objects degraded (50.000%)
> > 4/8 in osds are down
> > clock skew detected on mon.cephmds01, mon.ceph-mon1
> >  monmap e5: 3 mons at
> > {ceph-mon1=10.29.81.184:6789/0,cephmds01=10.29.81.161:6789/0,cephmds02=10.29.81.160:6789/0}
> > election epoch 64, quorum 0,1,2 cephmds02,cephmds01,ceph-mon1 mdsmap
> > e176: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby osdmap e712: 8
> > osds: 4 up, 8 in pgmap v420651: 256 pgs, 2 pools, 133 GB data, 1423
> > kobjects 289 GB used, 341 GB / 631 GB avail
> > 1457662/2915324 objects degraded (50.000%)
> >  256 undersized+degraded+peered
> >   client io 86991 B/s wr, 0 op/s
> >
> >
> > When I take UP the node all clients resume to work.
> >
> > Thanks,
> > Matteo
> >
> > ?
> >
> >
> > 
> > Da: ceph-users  per conto di Matteo
> > Dacrema  Inviato: luned? 15 giugno 2015 12:37
> > A: John Spray; Lincoln Bryant; ceph-users
> > Oggetto: Re: [ceph-users] CephFS client issue
> >
> >
> > Ok, I'll update kernel to 3.16.3 version and let you know.
> >
> >
> > Thanks,
> >
> > Matteo
> >
> > 
> > Da: John Spray 
> > Inviato: luned? 15 giugno 2015 10:51
> > A: Matteo Dacrema; Lincoln Bryant; ceph-users
> > Oggetto: Re: [ceph-users] CephFS client issue
> >
> >
> >
> > On 14/06/15 20:00, Matteo Dacrema wrote:
> >
> > Hi Lincoln,
> >
> >
> > I'm using the kernel client.
> >
> > Kernel version is: 3.13.0-53-generic?
> >
> > That's old by CephFS standards.  It's likely that the issue you're
> > seeing is one of the known bugs (which were actually the motivation for
> > adding the warning message you're seeing).
> >
> > John
> >
> > --
> > Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non
> > infetto. Clicca qui per segnalarlo come
> > spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=637D840263.A210F>
> > Clicca qui per metterlo in
> > blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=637D840263.A210F>
> >
> > --
> > Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non
> > infetto. Clicca qui per segnalarlo come
> > spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=DBA3140262.AE1FB>
> > Clicca qui per metterlo in
> > blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=DBA3140262.AE1FB>
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine

Re: [ceph-users] CephFS client issue

2015-06-16 Thread Matteo Dacrema
Hi,

I've shutoff the node without take any cautions for simulate a real case.

The  osd_pool_default_min_size is 2 .

Regards,
Matteo


Da: Christian Balzer 
Inviato: martedì 16 giugno 2015 01:44
A: ceph-users
Cc: Matteo Dacrema
Oggetto: Re: [ceph-users] CephFS client issue

Hello,

On Mon, 15 Jun 2015 23:11:07 + Matteo Dacrema wrote:

> With 3.16.3 kernel it seems to be stable but I've discovered one new
> issue.
>
> If I take down one of the two osd node all the client stop to respond.
>
How did you take the node down?

What is your "osd_pool_default_min_size"?

Penultimately, you wouldn't deploy a cluster with just 2 storage nodes in
production anyway.

Christian
>
> Here the output of ceph -s
>
> ceph -s
> cluster 2de7b17f-0a3e-4109-b878-c035dd2f7735
>  health HEALTH_WARN
> 256 pgs degraded
> 127 pgs stuck inactive
> 127 pgs stuck unclean
> 256 pgs undersized
> recovery 1457662/2915324 objects degraded (50.000%)
> 4/8 in osds are down
> clock skew detected on mon.cephmds01, mon.ceph-mon1
>  monmap e5: 3 mons at
> {ceph-mon1=10.29.81.184:6789/0,cephmds01=10.29.81.161:6789/0,cephmds02=10.29.81.160:6789/0}
> election epoch 64, quorum 0,1,2 cephmds02,cephmds01,ceph-mon1 mdsmap
> e176: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby osdmap e712: 8
> osds: 4 up, 8 in pgmap v420651: 256 pgs, 2 pools, 133 GB data, 1423
> kobjects 289 GB used, 341 GB / 631 GB avail
> 1457662/2915324 objects degraded (50.000%)
>  256 undersized+degraded+peered
>   client io 86991 B/s wr, 0 op/s
>
>
> When I take UP the node all clients resume to work.
>
> Thanks,
> Matteo
>
> ?
>
>
> 
> Da: ceph-users  per conto di Matteo
> Dacrema  Inviato: luned? 15 giugno 2015 12:37
> A: John Spray; Lincoln Bryant; ceph-users
> Oggetto: Re: [ceph-users] CephFS client issue
>
>
> Ok, I'll update kernel to 3.16.3 version and let you know.
>
>
> Thanks,
>
> Matteo
>
> ____
> Da: John Spray 
> Inviato: luned? 15 giugno 2015 10:51
> A: Matteo Dacrema; Lincoln Bryant; ceph-users
> Oggetto: Re: [ceph-users] CephFS client issue
>
>
>
> On 14/06/15 20:00, Matteo Dacrema wrote:
>
> Hi Lincoln,
>
>
> I'm using the kernel client.
>
> Kernel version is: 3.13.0-53-generic?
>
> That's old by CephFS standards.  It's likely that the issue you're
> seeing is one of the known bugs (which were actually the motivation for
> adding the warning message you're seeing).
>
> John
>
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non
> infetto. Clicca qui per segnalarlo come
> spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=637D840263.A210F>
> Clicca qui per metterlo in
> blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=637D840263.A210F>
>
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non
> infetto. Clicca qui per segnalarlo come
> spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=DBA3140262.AE1FB>
> Clicca qui per metterlo in
> blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=DBA3140262.AE1FB>


--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/

--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Seguire il link qui sotto per segnalarlo come spam:
http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=03A5340264.ADC79

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS client issue

2015-06-15 Thread Matteo Dacrema
With 3.16.3 kernel it seems to be stable but I've discovered one new issue.

If I take down one of the two osd node all the client stop to respond.


Here the output of ceph -s

ceph -s
cluster 2de7b17f-0a3e-4109-b878-c035dd2f7735
 health HEALTH_WARN
256 pgs degraded
127 pgs stuck inactive
127 pgs stuck unclean
256 pgs undersized
recovery 1457662/2915324 objects degraded (50.000%)
4/8 in osds are down
clock skew detected on mon.cephmds01, mon.ceph-mon1
 monmap e5: 3 mons at 
{ceph-mon1=10.29.81.184:6789/0,cephmds01=10.29.81.161:6789/0,cephmds02=10.29.81.160:6789/0}
election epoch 64, quorum 0,1,2 cephmds02,cephmds01,ceph-mon1
 mdsmap e176: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
 osdmap e712: 8 osds: 4 up, 8 in
  pgmap v420651: 256 pgs, 2 pools, 133 GB data, 1423 kobjects
289 GB used, 341 GB / 631 GB avail
1457662/2915324 objects degraded (50.000%)
 256 undersized+degraded+peered
  client io 86991 B/s wr, 0 op/s


When I take UP the node all clients resume to work.

Thanks,
Matteo

?



Da: ceph-users  per conto di Matteo Dacrema 

Inviato: luned? 15 giugno 2015 12:37
A: John Spray; Lincoln Bryant; ceph-users
Oggetto: Re: [ceph-users] CephFS client issue


Ok, I'll update kernel to 3.16.3 version and let you know.


Thanks,

Matteo


Da: John Spray 
Inviato: luned? 15 giugno 2015 10:51
A: Matteo Dacrema; Lincoln Bryant; ceph-users
Oggetto: Re: [ceph-users] CephFS client issue



On 14/06/15 20:00, Matteo Dacrema wrote:

Hi Lincoln,


I'm using the kernel client.

Kernel version is: 3.13.0-53-generic?

That's old by CephFS standards.  It's likely that the issue you're seeing is 
one of the known bugs (which were actually the motivation for adding the 
warning message you're seeing).

John

--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Clicca qui per segnalarlo come 
spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=637D840263.A210F>
Clicca qui per metterlo in 
blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=637D840263.A210F>

--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Clicca qui per segnalarlo come 
spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=DBA3140262.AE1FB>
Clicca qui per metterlo in 
blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=DBA3140262.AE1FB>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS client issue

2015-06-15 Thread Matteo Dacrema
Ok, I'll update kernel to 3.16.3 version and let you know.


Thanks,

Matteo


Da: John Spray 
Inviato: luned? 15 giugno 2015 10:51
A: Matteo Dacrema; Lincoln Bryant; ceph-users
Oggetto: Re: [ceph-users] CephFS client issue



On 14/06/15 20:00, Matteo Dacrema wrote:

Hi Lincoln,


I'm using the kernel client.

Kernel version is: 3.13.0-53-generic?

That's old by CephFS standards.  It's likely that the issue you're seeing is 
one of the known bugs (which were actually the motivation for adding the 
warning message you're seeing).

John

--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Clicca qui per segnalarlo come 
spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=637D840263.A210F>
Clicca qui per metterlo in 
blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=637D840263.A210F>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS client issue

2015-06-14 Thread Matteo Dacrema
Hi Lincoln,


I'm using the kernel client.

Kernel version is: 3.13.0-53-generic?


Thanks,

Matteo


Da: Lincoln Bryant 
Inviato: domenica 14 giugno 2015 19:31
A: Matteo Dacrema; ceph-users
Oggetto: Re: [ceph-users] CephFS client issue

Hi Matteo,

Are your clients using the FUSE client or the kernel client? If the latter, 
what kernel version?

--Lincoln

On 6/14/2015 10:26 AM, Matteo Dacrema wrote:

?Hi all,


I'm using CephFS on Hammer and sometimes I need to reboot one or more clients 
because , as ceph -s tells me, it's "failing to respond to capability 
release".After tha?t all clients stop to respond: can't access files or 
mount/umont cephfs.

I've 1.5 million files , 2 metadata servers in active/standby configuration 
with 8 GB of RAM , 20 clients with 2 GB of RAM each and 2 OSD nodes with 4 80GB 
osd and 4GB of RAM.



Here my configuration:


[global]
fsid = 2de7b17f-0a3e-4109-b878-c035dd2f7735
mon_initial_members = cephmds01
mon_host = 10.29.81.161
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = 10.29.81.0/24
tcp nodelay = true
tcp rcvbuf = 0
ms tcp read timeout = 600

#Capacity
mon osd full ratio = .95
mon osd nearfull ratio = .85


[osd]
osd journal size = 1024
journal dio = true
journal aio = true

osd op threads = 2
osd op thread timeout = 60
osd disk threads = 2
osd recovery threads = 1
osd recovery max active = 1
osd max backfills = 2


# Pool
osd pool default size = 2

#XFS
osd mkfs type = xfs
osd mkfs options xfs = "-f -i size=2048"
osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog"

#FileStore Settings
filestore xattr use omap = false
filestore max inline xattr size = 512
filestore max sync interval = 10
filestore merge threshold = 40
filestore split multiple = 8
filestore flusher = false
filestore queue max ops = 2000
filestore queue max bytes = 536870912
filestore queue committing max ops = 500
filestore queue committing max bytes = 268435456
filestore op threads = 2

[mds]
max mds = 1
mds cache size = 75
client cache size = 2048
mds dir commit ratio = 0.5



Here ceph -s output:


root@service-new:~# ceph -s
cluster 2de7b17f-0a3e-4109-b878-c035dd2f7735
 health HEALTH_WARN
mds0: Client 94102 failing to respond to cache pressure
 monmap e2: 2 mons at 
{cephmds01=10.29.81.161:6789/0,cephmds02=10.29.81.160:6789/0}
election epoch 34, quorum 0,1 cephmds02,cephmds01
 mdsmap e79: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
 osdmap e669: 8 osds: 8 up, 8 in
  pgmap v339741: 256 pgs, 2 pools, 132 GB data, 1417 kobjects
288 GB used, 342 GB / 631 GB avail
 256 active+clean
  client io 3091 kB/s rd, 342 op/s

Thank you.
Regards,
Matteo









___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
Clicca qui per segnalarlo come 
spam.<http://esva01.enter.it/cgi-bin/learn-msg.cgi?id=4F9DA41B1A.A3369>
Clicca qui per metterlo in 
blacklist<http://esva01.enter.it/cgi-bin/learn-msg.cgi?blacklist=1&id=4F9DA41B1A.A3369>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS client issue

2015-06-14 Thread Matteo Dacrema
?Hi all,


I'm using CephFS on Hammer and sometimes I need to reboot one or more clients 
because , as ceph -s tells me, it's "failing to respond to capability 
release".After tha?t all clients stop to respond: can't access files or 
mount/umont cephfs.

I've 1.5 million files , 2 metadata servers in active/standby configuration 
with 8 GB of RAM , 20 clients with 2 GB of RAM each and 2 OSD nodes with 4 80GB 
osd and 4GB of RAM.



Here my configuration:


[global]
fsid = 2de7b17f-0a3e-4109-b878-c035dd2f7735
mon_initial_members = cephmds01
mon_host = 10.29.81.161
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = 10.29.81.0/24
tcp nodelay = true
tcp rcvbuf = 0
ms tcp read timeout = 600

#Capacity
mon osd full ratio = .95
mon osd nearfull ratio = .85


[osd]
osd journal size = 1024
journal dio = true
journal aio = true

osd op threads = 2
osd op thread timeout = 60
osd disk threads = 2
osd recovery threads = 1
osd recovery max active = 1
osd max backfills = 2


# Pool
osd pool default size = 2

#XFS
osd mkfs type = xfs
osd mkfs options xfs = "-f -i size=2048"
osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog"

#FileStore Settings
filestore xattr use omap = false
filestore max inline xattr size = 512
filestore max sync interval = 10
filestore merge threshold = 40
filestore split multiple = 8
filestore flusher = false
filestore queue max ops = 2000
filestore queue max bytes = 536870912
filestore queue committing max ops = 500
filestore queue committing max bytes = 268435456
filestore op threads = 2

[mds]
max mds = 1
mds cache size = 75
client cache size = 2048
mds dir commit ratio = 0.5



Here ceph -s output:


root@service-new:~# ceph -s
cluster 2de7b17f-0a3e-4109-b878-c035dd2f7735
 health HEALTH_WARN
mds0: Client 94102 failing to respond to cache pressure
 monmap e2: 2 mons at 
{cephmds01=10.29.81.161:6789/0,cephmds02=10.29.81.160:6789/0}
election epoch 34, quorum 0,1 cephmds02,cephmds01
 mdsmap e79: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
 osdmap e669: 8 osds: 8 up, 8 in
  pgmap v339741: 256 pgs, 2 pools, 132 GB data, 1417 kobjects
288 GB used, 342 GB / 631 GB avail
 256 active+clean
  client io 3091 kB/s rd, 342 op/s

Thank you.
Regards,
Matteo




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com