Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-24 Thread Alexandre DERUMIER
Here the result:


root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net flush journal
{
"message": "",
"return_code": 0
}
root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 
1
{
"success": "mds_cache_size = '1' (not observed, change may require 
restart) "
}

wait ...


root@ceph4-2:~# ceph tell mds.ceph4-2.odiso.net heap stats
2018-05-25 07:44:02.185911 7f4cad7fa700  0 client.50748489 ms_handle_reset on 
10.5.0.88:6804/994206868
2018-05-25 07:44:02.196160 7f4cae7fc700  0 client.50792764 ms_handle_reset on 
10.5.0.88:6804/994206868
mds.ceph4-2.odiso.net tcmalloc heap 
stats:
MALLOC:13175782328 (12565.4 MiB) Bytes in use by application
MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
MALLOC: +   1774628488 ( 1692.4 MiB) Bytes in central cache freelist
MALLOC: + 34274608 (   32.7 MiB) Bytes in transfer cache freelist
MALLOC: + 57260176 (   54.6 MiB) Bytes in thread cache freelists
MALLOC: +120582336 (  115.0 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =  15162527936 (14460.1 MiB) Actual memory used (physical + swap)
MALLOC: +   4974067712 ( 4743.6 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  20136595648 (19203.8 MiB) Virtual address space used
MALLOC:
MALLOC:1852388  Spans in use
MALLOC: 18  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.


root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 0
{
"success": "mds_cache_size = '0' (not observed, change may require restart) 
"
}

- Mail original -
De: "Zheng Yan" 
À: "aderumier" 
Envoyé: Vendredi 25 Mai 2018 05:56:31
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Thu, May 24, 2018 at 11:34 PM, Alexandre DERUMIER 
 wrote: 
>>>Still don't find any clue. Does the cephfs have idle period. If it 
>>>has, could you decrease mds's cache size and check what happens. For 
>>>example, run following commands during the old period. 
> 
>>>ceph daemon mds.xx flush journal 
>>>ceph daemon mds.xx config set mds_cache_size 1; 
>>>"wait a minute" 
>>>ceph tell mds.xx heap stats 
>>>ceph daemon mds.xx config set mds_cache_size 0 
> 
> ok thanks. I'll try this night. 
> 
> I have already mds_cache_memory_limit = 5368709120, 
> 
> does it need to remove it first before setting mds_cache_size 1 ? 

no 
> 
> 
> 
> 
> - Mail original - 
> De: "Zheng Yan"  
> À: "aderumier"  
> Cc: "ceph-users"  
> Envoyé: Jeudi 24 Mai 2018 16:27:21 
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? 
> 
> On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER  
> wrote: 
>> Thanks! 
>> 
>> 
>> here the profile.pdf 
>> 
>> 10-15min profiling, I can't do it longer because my clients where lagging. 
>> 
>> but I think it should be enough to observe the rss memory increase. 
>> 
>> 
> 
> Still don't find any clue. Does the cephfs have idle period. If it 
> has, could you decrease mds's cache size and check what happens. For 
> example, run following commands during the old period. 
> 
> ceph daemon mds.xx flush journal 
> ceph daemon mds.xx config set mds_cache_size 1; 
> "wait a minute" 
> ceph tell mds.xx heap stats 
> ceph daemon mds.xx config set mds_cache_size 0 
> 
> 
>> 
>> 
>> - Mail original - 
>> De: "Zheng Yan"  
>> À: "aderumier"  
>> Cc: "ceph-users"  
>> Envoyé: Jeudi 24 Mai 2018 11:34:20 
>> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? 
>> 
>> On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER  
>> wrote: 
>>> Hi,some new stats, mds memory is not 16G, 
>>> 
>>> I have almost same number of items and bytes in cache vs some weeks ago 
>>> when mds was using 8G. (ceph 12.2.5) 
>>> 
>>> 
>>> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf 
>>> dump | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | 
>>> jq -c '.mds_co'; done 
>>> 16905052 
>>> {"items":43350988,"bytes":5257428143} 
>>> 16905052 
>>> {"items":43428329,"bytes":5283850173} 
>>> 16905052 
>>> {"items":43209167,"bytes":5208578149} 
>>> 16905052 
>>> {"items":43177631,"bytes":5198833577} 
>>> 16905052 
>>> {"items":43312734,"bytes":5252649462} 
>>> 16905052 
>>> {"items":43355753,"bytes":5277197972} 
>>> 16905052 
>>> {"items":43700693,"bytes":5303376141} 
>>> 16905052 
>>> {"items":43115809,"bytes":5156628138} 
>>> ^C 
>>> 
>>> 
>>> 
>>> 
>>> root@ceph4-2:~# ceph status 
>>> cluster: 
>>> 

Re: [ceph-users] Some OSDs never get any data or PGs

2018-05-24 Thread David Turner
I had noticed the instance, but I haven't seen it leave an osd completely
empty. Must be a function of the tree algorithm. Glad you figured it out.

On Thu, May 24, 2018, 9:01 PM Pardhiv Karri  wrote:

> Finally figured that it is happening because of unbalanced rack structure.
> When we moved the host/osd to another rack they are working just fine. Now
> we balanced the racks by moving hosts, some rebalancing happened due to
> that but everything is fine now.
>
> Thanks,
> Pardhiv Karri
>
>
> On Tue, May 22, 2018 at 11:34 AM, Pardhiv Karri 
> wrote:
>
>> Hi,
>>
>> Here is our complete crush map that is being  used.
>>
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable straw_calc_version 1
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> device 3 osd.3
>> device 4 osd.4
>> device 5 osd.5
>> device 6 osd.6
>> device 7 osd.7
>> device 8 osd.8
>> device 9 osd.9
>> device 10 osd.10
>> device 11 osd.11
>> device 12 osd.12
>> device 13 osd.13
>> device 14 osd.14
>> device 15 osd.15
>> device 16 osd.16
>> device 17 osd.17
>> device 18 osd.18
>> device 19 osd.19
>> device 20 osd.20
>> device 21 osd.21
>> device 22 osd.22
>> device 23 osd.23
>> device 24 osd.24
>> device 25 osd.25
>> device 26 osd.26
>> device 27 osd.27
>> device 28 osd.28
>> device 29 osd.29
>> device 30 osd.30
>> device 31 osd.31
>> device 32 osd.32
>> device 33 osd.33
>> device 34 osd.34
>> device 35 osd.35
>> device 36 osd.36
>> device 37 osd.37
>> device 38 osd.38
>> device 39 osd.39
>>
>> # types
>> type 0 osd
>> type 1 host
>> type 2 chassis
>> type 3 rack
>> type 4 row
>> type 5 pdu
>> type 6 pod
>> type 7 room
>> type 8 datacenter
>> type 9 region
>> type 10 root
>>
>> # buckets
>> host or1010051251040 {
>> id -3 # do not change unnecessarily
>> # weight 20.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> item osd.0 weight 2.000 pos 0
>> item osd.1 weight 2.000 pos 1
>> item osd.2 weight 2.000 pos 2
>> item osd.3 weight 2.000 pos 3
>> item osd.4 weight 2.000 pos 4
>> item osd.5 weight 2.000 pos 5
>> item osd.6 weight 2.000 pos 6
>> item osd.7 weight 2.000 pos 7
>> item osd.8 weight 2.000 pos 8
>> item osd.9 weight 2.000 pos 9
>> }
>> host or1010051251044 {
>> id -8 # do not change unnecessarily
>> # weight 20.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> item osd.30 weight 2.000 pos 0
>> item osd.31 weight 2.000 pos 1
>> item osd.32 weight 2.000 pos 2
>> item osd.33 weight 2.000 pos 3
>> item osd.34 weight 2.000 pos 4
>> item osd.35 weight 2.000 pos 5
>> item osd.36 weight 2.000 pos 6
>> item osd.37 weight 2.000 pos 7
>> item osd.38 weight 2.000 pos 8
>> item osd.39 weight 2.000 pos 9
>> }
>> rack rack_A1 {
>> id -2 # do not change unnecessarily
>> # weight 40.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> item or1010051251040 weight 20.000 pos 0
>> item or1010051251044 weight 20.000 pos 1
>> }
>> host or1010051251041 {
>> id -5 # do not change unnecessarily
>> # weight 20.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> item osd.10 weight 2.000 pos 0
>> item osd.11 weight 2.000 pos 1
>> item osd.12 weight 2.000 pos 2
>> item osd.13 weight 2.000 pos 3
>> item osd.14 weight 2.000 pos 4
>> item osd.15 weight 2.000 pos 5
>> item osd.16 weight 2.000 pos 6
>> item osd.17 weight 2.000 pos 7
>> item osd.18 weight 2.000 pos 8
>> item osd.19 weight 2.000 pos 9
>> }
>> host or1010051251045 {
>> id -9 # do not change unnecessarily
>> # weight 0.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> }
>> rack rack_B1 {
>> id -4 # do not change unnecessarily
>> # weight 20.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> item or1010051251041 weight 20.000 pos 0
>> item or1010051251045 weight 0.000 pos 1
>> }
>> host or1010051251042 {
>> id -7 # do not change unnecessarily
>> # weight 20.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> item osd.20 weight 2.000 pos 0
>> item osd.21 weight 2.000 pos 1
>> item osd.22 weight 2.000 pos 2
>> item osd.23 weight 2.000 pos 3
>> item osd.24 weight 2.000 pos 4
>> item osd.25 weight 2.000 pos 5
>> item osd.26 weight 2.000 pos 6
>> item osd.27 weight 2.000 pos 7
>> item osd.28 weight 2.000 pos 8
>> item osd.29 weight 2.000 pos 9
>> }
>> host or1010051251046 {
>> id -10 # do not change unnecessarily
>> # weight 0.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> }
>> host or1010051251023 {
>> id -11 # do not change unnecessarily
>> # weight 0.000
>> alg tree # do not change pos for existing items unnecessarily
>> hash 0 # rjenkins1
>> }
>> rack rack_C1 {
>> id -6 # do not 

Re: [ceph-users] Can Bluestore work with 2 replicas or still need 3 for data integrity?

2018-05-24 Thread Linh Vu
You can use erasure code for your SSDs in Luminous if you're worried about cost 
per TB.


From: ceph-users  on behalf of Pardhiv Karri 

Sent: Friday, 25 May 2018 11:16:07 AM
To: ceph-users
Subject: [ceph-users] Can Bluestore work with 2 replicas or still need 3 for 
data integrity?

Hi,

Can Ceph Bluestore in Luminous work with 2 replicas using crc32c checksum which 
is more powerful than hashing in filestore versions or do we still need 3 
replicas for data integrity?

In our current Hammer-filestore environment we are using 3 replicas with HDD 
but planning to move to Bluestore-Luminous all SSD. Due to the cost of SSD's 
want to know if 2 replica is good or still need 3.

Thanks,
Pardhiv Karri


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Privacy Statement for the Ceph Project

2018-05-24 Thread Leonardo Vaz
Hi,

As part of Ceph's commitment to addressing the EU General Data Privacy
Regulation (GDPR), among other reasons, the Ceph privacy statement has
been revised. The new statement has been reviewed and approved by the
Ceph Advisory Board.

You should read the statement fully, but the key updates include:

* Clearer descriptions of how Ceph uses your personal data to provide services;
* Information about the GDPR;
* Additional detail on how your personal data may be shared to serve
the public interest and that of the open source community; and
* More focus on how you can review, modify, and update your personal data.

You can find the new statement at the following URL on Ceph Website:

  https://ceph.com/privacy

The new privacy statement takes effect on Friday, May 25, 2018. By
using Ceph's resources mentioned on the privacy stated on or after
that date, you are agreeing to these updates.

Respectfully,

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can Bluestore work with 2 replicas or still need 3 for data integrity?

2018-05-24 Thread Pardhiv Karri
Hi,

Can Ceph Bluestore in Luminous work with 2 replicas using crc32c checksum
which is more powerful than hashing in filestore versions or do we still
need 3 replicas for data integrity?

In our current Hammer-filestore environment we are using 3 replicas with
HDD but planning to move to Bluestore-Luminous all SSD. Due to the cost of
SSD's want to know if 2 replica is good or still need 3.

Thanks,
Pardhiv Karri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some OSDs never get any data or PGs

2018-05-24 Thread Pardhiv Karri
Finally figured that it is happening because of unbalanced rack structure.
When we moved the host/osd to another rack they are working just fine. Now
we balanced the racks by moving hosts, some rebalancing happened due to
that but everything is fine now.

Thanks,
Pardhiv Karri


On Tue, May 22, 2018 at 11:34 AM, Pardhiv Karri 
wrote:

> Hi,
>
> Here is our complete crush map that is being  used.
>
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable straw_calc_version 1
>
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> device 4 osd.4
> device 5 osd.5
> device 6 osd.6
> device 7 osd.7
> device 8 osd.8
> device 9 osd.9
> device 10 osd.10
> device 11 osd.11
> device 12 osd.12
> device 13 osd.13
> device 14 osd.14
> device 15 osd.15
> device 16 osd.16
> device 17 osd.17
> device 18 osd.18
> device 19 osd.19
> device 20 osd.20
> device 21 osd.21
> device 22 osd.22
> device 23 osd.23
> device 24 osd.24
> device 25 osd.25
> device 26 osd.26
> device 27 osd.27
> device 28 osd.28
> device 29 osd.29
> device 30 osd.30
> device 31 osd.31
> device 32 osd.32
> device 33 osd.33
> device 34 osd.34
> device 35 osd.35
> device 36 osd.36
> device 37 osd.37
> device 38 osd.38
> device 39 osd.39
>
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
>
> # buckets
> host or1010051251040 {
> id -3 # do not change unnecessarily
> # weight 20.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> item osd.0 weight 2.000 pos 0
> item osd.1 weight 2.000 pos 1
> item osd.2 weight 2.000 pos 2
> item osd.3 weight 2.000 pos 3
> item osd.4 weight 2.000 pos 4
> item osd.5 weight 2.000 pos 5
> item osd.6 weight 2.000 pos 6
> item osd.7 weight 2.000 pos 7
> item osd.8 weight 2.000 pos 8
> item osd.9 weight 2.000 pos 9
> }
> host or1010051251044 {
> id -8 # do not change unnecessarily
> # weight 20.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> item osd.30 weight 2.000 pos 0
> item osd.31 weight 2.000 pos 1
> item osd.32 weight 2.000 pos 2
> item osd.33 weight 2.000 pos 3
> item osd.34 weight 2.000 pos 4
> item osd.35 weight 2.000 pos 5
> item osd.36 weight 2.000 pos 6
> item osd.37 weight 2.000 pos 7
> item osd.38 weight 2.000 pos 8
> item osd.39 weight 2.000 pos 9
> }
> rack rack_A1 {
> id -2 # do not change unnecessarily
> # weight 40.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> item or1010051251040 weight 20.000 pos 0
> item or1010051251044 weight 20.000 pos 1
> }
> host or1010051251041 {
> id -5 # do not change unnecessarily
> # weight 20.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> item osd.10 weight 2.000 pos 0
> item osd.11 weight 2.000 pos 1
> item osd.12 weight 2.000 pos 2
> item osd.13 weight 2.000 pos 3
> item osd.14 weight 2.000 pos 4
> item osd.15 weight 2.000 pos 5
> item osd.16 weight 2.000 pos 6
> item osd.17 weight 2.000 pos 7
> item osd.18 weight 2.000 pos 8
> item osd.19 weight 2.000 pos 9
> }
> host or1010051251045 {
> id -9 # do not change unnecessarily
> # weight 0.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> }
> rack rack_B1 {
> id -4 # do not change unnecessarily
> # weight 20.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> item or1010051251041 weight 20.000 pos 0
> item or1010051251045 weight 0.000 pos 1
> }
> host or1010051251042 {
> id -7 # do not change unnecessarily
> # weight 20.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> item osd.20 weight 2.000 pos 0
> item osd.21 weight 2.000 pos 1
> item osd.22 weight 2.000 pos 2
> item osd.23 weight 2.000 pos 3
> item osd.24 weight 2.000 pos 4
> item osd.25 weight 2.000 pos 5
> item osd.26 weight 2.000 pos 6
> item osd.27 weight 2.000 pos 7
> item osd.28 weight 2.000 pos 8
> item osd.29 weight 2.000 pos 9
> }
> host or1010051251046 {
> id -10 # do not change unnecessarily
> # weight 0.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> }
> host or1010051251023 {
> id -11 # do not change unnecessarily
> # weight 0.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> }
> rack rack_C1 {
> id -6 # do not change unnecessarily
> # weight 20.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # rjenkins1
> item or1010051251042 weight 20.000 pos 0
> item or1010051251046 weight 0.000 pos 1
> item or1010051251023 weight 0.000 pos 2
> }
> host or1010051251048 {
> id -12 # do not change unnecessarily
> # weight 0.000
> alg tree # do not change pos for existing items unnecessarily
> hash 0 # 

Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Jack
On 05/24/2018 11:40 PM, Stefan Kooman wrote:
>> What are your thoughts, would you run 2x replication factor in
>> Production and in what scenarios?
Me neither, mostly because I have yet to read a technical point of view,
from someone who read and understand the code

I do not buy Janne's "trust me, I am an engineer", whom btw confirmed
that the "replica 3" stuff is subject to probability and function to the
cluster size, thus is not a generic "always-true" rule



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Stefan Kooman
Quoting Anthony Verevkin (anth...@verevkin.ca):
> My thoughts on the subject are that even though checksums do allow to
> find which replica is corrupt without having to figure which 2 out of
> 3 copies are the same, this is not the only reason min_size=2 was
> required. Even if you are running all SSD which are more reliable than
> HDD and are keeping the disk size small so you could backfill quickly
> in case of a single disk failure, you would still occasionally have
> longer periods of degraded operation. To name a couple - a full node
> going down; or operator deliberately wiping an OSD to rebuild it.
> min_size=1 in this case would leave you running with no redundancy at
> all. DR scenario with pool-to-pool mirroring probably means that you
> can not just replace the lost or incomplete PGs in your main site from
> your DR, cause DR is likely to have a different PG layout, so full
> resync from DR would be required in case of one disk lost during such
> unprotected times.

... "min_size=1 in this case would leave you running with no redundancy
at all.". Exactly. And that would be the reason not to do it. DR is
asynchronous. What if the PG that gets lost has ACK'ed a WRITE but has
not been synchronised? Doing a "full resync" would bring you back in
time.

The DR site is not for free either, so I doubt that you actually really
win a lot here. I would opt for three datacenters: size=3, min_size=2

> 
> What are your thoughts, would you run 2x replication factor in
> Production and in what scenarios?

Not for me.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs no space on device error

2018-05-24 Thread Doug Bell

I am new to Ceph and have built a small Ceph instance on 3 servers.  I realize 
the configuration is probably not ideal but I’d like to understand an error I’m 
getting.

Ceph hosts are cm1, cm2, cm3.  Cephfs is mounted with ceph.fuse on a server c1. 
 I am attempting to perform a simple cp-rp from one directory tree already in 
cephfs to another directory also inside of cephfs.  The directory tree is 2740 
files totaling 93G.  Approximately 3/4 of the way through the copy, the 
following error occurs:  "cp: failed to close ‘': No space left on 
device”  The odd thing is that it seems to finish the copy, as the final 
directory sizes are the same.  But scripts attached to the process see an error 
so it is causing a problem.

Any idea what is happening?  I have watched all of the ceph logs on one of the 
ceph servers and haven’t seen anything.

Here is some of the configuration.  The names actually aren’t obfuscated, they 
really are that generic.  IP Addresses are altered though.

# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

# ceph status
  cluster:
id: c14e77f1-9898-48d8-8a52-cd1f1c5bf689
health: HEALTH_WARN
1 MDSs behind on trimming

  services:
mon: 3 daemons, quorum cm1,cm3,cm2
mgr: cm3(active), standbys: cm2, cm1
mds: cephfs-1/1/1 up  {0=cm1=up:active}, 1 up:standby-replay, 1 up:standby
osd: 7 osds: 7 up, 7 in

  data:
pools:   2 pools, 256 pgs
objects: 377k objects, 401 GB
usage:   1228 GB used, 902 GB / 2131 GB avail
pgs: 256 active+clean

  io:
client:   852 B/s rd, 2 op/s rd, 0 op/s wr

# ceph osd status
++--+---+---++-++-+---+
| id | host |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
++--+---+---++-++-+---+
| 0  | cm1  |  134G |  165G |0   | 0   |0   | 0   | exists,up |
| 1  | cm1  |  121G |  178G |0   | 0   |0   | 0   | exists,up |
| 2  | cm2  |  201G | 98.3G |0   | 0   |1   |90   | exists,up |
| 3  | cm2  |  207G | 92.1G |0   | 0   |0   | 0   | exists,up |
| 4  | cm3  |  217G | 82.8G |0   | 0   |0   | 0   | exists,up |
| 5  | cm3  |  192G |  107G |0   | 0   |0   | 0   | exists,up |
| 6  | cm1  |  153G |  177G |0   | 0   |1   |16   | exists,up |
++--+---+---++-++-+—+

# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL   %USE  VAR  PGS
 0   ssd 0.29300  1.0  299G  134G165G 44.74 0.78  79
 1   ssd 0.29300  1.0  299G  121G178G 40.64 0.70  75
 6   ssd 0.32370  1.0  331G  153G177G 46.36 0.80 102
 2   ssd 0.29300  1.0  299G  201G 100754M 67.20 1.17 129
 3   ssd 0.29300  1.0  299G  207G  94366M 69.28 1.20 127
 4   ssd 0.29300  1.0  299G  217G  84810M 72.39 1.26 131
 5   ssd 0.29300  1.0  299G  192G107G 64.15 1.11 125
TOTAL 2131G 1228G902G 57.65
MIN/MAX VAR: 0.70/1.26  STDDEV: 12.36

# ceph fs get cephfs
Filesystem 'cephfs' (1)
fs_name cephfs
epoch   1047
flags   c
created 2018-03-20 13:58:51.860813
modified2018-03-20 13:58:51.860813
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   1099511627776
last_failure0
last_failure_osd_epoch  98
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout 
v2}
max_mds 1
in  0
up  {0=74127}
failed
damaged
stopped
data_pools  [1]
metadata_pool   2
inline_data disabled
balancer
standby_count_wanted1
74127:  10.1.2.157:6800/3141645279 'cm1' mds.0.36 up:active seq 5 (standby for 
rank 0)
64318:  10.1.2.194:6803/2623342769 'cm2' mds.0.0 up:standby-replay seq 497658 
(standby for rank 0)

# ceph fs status
cephfs - 9 clients
==
+--++-+---+---+---+
| Rank | State  | MDS |Activity   |  dns  |  inos |
+--++-+---+---+---+
|  0   | active | cm1 | Reqs:0 /s |  295k |  292k |
| 0-s  | standby-replay | cm2 | Evts:0 /s |0  |0  |
+--++-+---+---+---+
+-+--+---+---+
|   Pool  |   type   |  used | avail |
+-+--+---+---+
| cephfs_metadata | metadata |  167M |  160G |
|   cephfs_data   |   data   |  401G |  160G |
+-+--+---+---+

+-+
| Standby MDS |
+-+
| cm3 |
+-+
+--+-+
| version   
   | daemons |

Re: [ceph-users] Ceph tech talk on deploy ceph with rook on kubernetes

2018-05-24 Thread Bryan Banister
Hi Sage,

Please provide a link to the youtube video once it's posted, thanks!!
-Bryan

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Sage 
Weil
Sent: Thursday, May 24, 2018 12:04 PM
To: ceph-de...@vger.kernel.org; ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph tech talk on deploy ceph with rook on kubernetes

Note: External Email
-

Starting now!

https://redhat.bluejeans.com/967991495/

It'll be recorded and go up on youtube shortly as well.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete pool nicely

2018-05-24 Thread Simon Ironside


On 22/05/18 18:28, David Turner wrote:
 From my experience, that would cause you some troubles as it would 
throw the entire pool into the deletion queue to be processed as it 
cleans up the disks and everything.  I would suggest using a pool 
listing from `rados -p .rgw.buckets ls` and iterate on that using some 
scripts around the `rados -p .rgw.buckest rm ` command that 
you could stop, restart at a faster pace, slow down, etc.  Once the 
objects in the pool are gone, you can delete the empty pool without any 
problems.  I like this option because it makes it simple to stop it if 
you're impacting your VM traffic.


Brilliant, thanks David. That's exactly the kind of answer I needed.

Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-05-24 Thread David Turner
I have some bluestore DC S4500's in my 3 node home cluster.  I haven't ever
had any problems with it.  I've used them with an EC cache tier, cephfs
metadata, and VM RBDs.

On Thu, May 24, 2018 at 2:21 PM Lionel Bouton 
wrote:

> Hi,
>
> On 22/02/2018 23:32, Mike Lovell wrote:
> > hrm. intel has, until a year ago, been very good with ssds. the
> > description of your experience definitely doesn't inspire confidence.
> > intel also dropping the entire s3xxx and p3xxx series last year before
> > having a viable replacement has been driving me nuts.
> >
> > i don't know that i have the luxury of being able to return all of the
> > ones i have or just buying replacements. i'm going to need to at least
> > try them in production. it'll probably happen with the s4600 limited
> > to a particular fault domain. these are also going to be filestore
> > osds so maybe that will result in a different behavior. i'll try to
> > post updates as i have them.
>
> Sorry for the deep digging into the archives. I might be in a situation
> where I could get S4600 (with filestore initially but I would very much
> like them to support Bluestore without bursting into flames).
>
> To expand a Ceph cluster and test EPYC in our context we have ordered a
> server based on a Supermicro EPYC motherboard and SM863a SSDs. For
> reference :
> https://www.supermicro.nl/Aplus/motherboard/EPYC7000/H11DSU-iN.cfm
>
> Unfortunately I just learned that Supermicro found an incompatibility
> between this motherboard and SM863a SSDs (I don't have more information
> yet) and they proposed S4600 as an alternative. I immediately remembered
> that there were problems and asked for a delay/more information and dug
> out this old thread.
>
> Has anyone successfully used Ceph with S4600 ? If so could you share if
> you used filestore or bluestore, which firmware was used and
> approximately how much data was written on the most used SSDs ?
>
> Best regards,
>
> Lionel
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-05-24 Thread Lionel Bouton
Hi,

On 22/02/2018 23:32, Mike Lovell wrote:
> hrm. intel has, until a year ago, been very good with ssds. the
> description of your experience definitely doesn't inspire confidence.
> intel also dropping the entire s3xxx and p3xxx series last year before
> having a viable replacement has been driving me nuts.
>
> i don't know that i have the luxury of being able to return all of the
> ones i have or just buying replacements. i'm going to need to at least
> try them in production. it'll probably happen with the s4600 limited
> to a particular fault domain. these are also going to be filestore
> osds so maybe that will result in a different behavior. i'll try to
> post updates as i have them.

Sorry for the deep digging into the archives. I might be in a situation
where I could get S4600 (with filestore initially but I would very much
like them to support Bluestore without bursting into flames).

To expand a Ceph cluster and test EPYC in our context we have ordered a
server based on a Supermicro EPYC motherboard and SM863a SSDs. For
reference :
https://www.supermicro.nl/Aplus/motherboard/EPYC7000/H11DSU-iN.cfm

Unfortunately I just learned that Supermicro found an incompatibility
between this motherboard and SM863a SSDs (I don't have more information
yet) and they proposed S4600 as an alternative. I immediately remembered
that there were problems and asked for a delay/more information and dug
out this old thread.

Has anyone successfully used Ceph with S4600 ? If so could you share if
you used filestore or bluestore, which firmware was used and
approximately how much data was written on the most used SSDs ?

Best regards,

Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] samba gateway experiences with cephfs ?

2018-05-24 Thread David Disseldorp
On Thu, 24 May 2018 15:13:09 +0200, Daniel Baumann wrote:

> On 05/24/2018 02:53 PM, David Disseldorp wrote:
> >> [ceph_test]
> >> path = /ceph-kernel
> >> guest ok = no
> >> delete readonly = yes
> >> oplocks = yes
> >> posix locking = no  
> 
> jftr, we use the following to disable all locking (on samba 4.8.2):
> 
>   oplocks = False
>   level2 oplocks = False
>   kernel oplocks = no

oplocks aren't locks per se - they allow the client to cache data
locally (leases in SMB2+), often allowing for improved application
performance. That said, if the same share path is accessible via NFS or
native CephFS then oplocks / leases should be disabled, until proper
vfs_ceph lease support is implemented via the delegation API.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph tech talk on deploy ceph with rook on kubernetes

2018-05-24 Thread Sage Weil
Starting now!

https://redhat.bluejeans.com/967991495/

It'll be recorded and go up on youtube shortly as well.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] nfs-ganesha HA with Cephfs

2018-05-24 Thread nigel davies
Hay all

I am trying to set an HA NFS cluster of two servers.

I been reading and want to use RecoveryBackend to be all in ceph, as i
under stand i need to

craete an new pool and and place it in the config,

but when i try and start  nfs-ganesha with RecoveryBackend set as rados_ng
or rados_kv i get an error saying

nfs4_recovery_init :CLIENT ID :CRIT :Unknown recovery backend

the version i am using be 2.5.5, but from reading i think i need version
2.6 but i cant find it on the ceph repos for Ubuntu just an empty folder.

AM i doing somethig really wrong

thanks

nigdav007
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-osd@ service keeps restarting after removing osd

2018-05-24 Thread Michael Burk
Hello,

I'm trying to replace my OSDs with higher capacity drives. I went through
the steps to remove the OSD on the OSD node:
# ceph osd out osd.2
# ceph osd down osd.2
# ceph osd rm osd.2
Error EBUSY: osd.2 is still up; must be down before removal.
# systemctl stop ceph-osd@2
# ceph osd rm osd.2
removed osd.2
# ceph osd crush rm osd.2
removed item id 2 name 'osd.2' from crush map
# ceph auth del osd.2
updated

umount /var/lib/ceph/osd/ceph-2

It no longer shows in the crush map, and I am ready to remove the drive.
However, the ceph-osd@ service keeps restarting and mounting the disk in
/var/lib/ceph/osd. I do "systemctl stop ceph-osd@2" and umount the disk,
but then the service starts again and mounts the drive.

# systemctl stop ceph-osd@2
# umount /var/lib/ceph/osd/ceph-2

/dev/sdb1 on /var/lib/ceph/osd/ceph-2 type xfs
(rw,noatime,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)

ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous
(stable)

What am I missing?

Thanks,
Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-24 Thread Alexandre DERUMIER
>>Still don't find any clue. Does the cephfs have idle period. If it 
>>has, could you decrease mds's cache size and check what happens. For 
>>example, run following commands during the old period. 

>>ceph daemon mds.xx flush journal 
>>ceph daemon mds.xx config set mds_cache_size 1; 
>>"wait a minute" 
>>ceph tell mds.xx heap stats 
>>ceph daemon mds.xx config set mds_cache_size 0 

ok thanks. I'll try this night.

I have already mds_cache_memory_limit = 5368709120,

does it need to remove it first before setting  mds_cache_size 1 ?




- Mail original -
De: "Zheng Yan" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Jeudi 24 Mai 2018 16:27:21
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER  
wrote: 
> Thanks! 
> 
> 
> here the profile.pdf 
> 
> 10-15min profiling, I can't do it longer because my clients where lagging. 
> 
> but I think it should be enough to observe the rss memory increase. 
> 
> 

Still don't find any clue. Does the cephfs have idle period. If it 
has, could you decrease mds's cache size and check what happens. For 
example, run following commands during the old period. 

ceph daemon mds.xx flush journal 
ceph daemon mds.xx config set mds_cache_size 1; 
"wait a minute" 
ceph tell mds.xx heap stats 
ceph daemon mds.xx config set mds_cache_size 0 


> 
> 
> - Mail original - 
> De: "Zheng Yan"  
> À: "aderumier"  
> Cc: "ceph-users"  
> Envoyé: Jeudi 24 Mai 2018 11:34:20 
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? 
> 
> On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER  
> wrote: 
>> Hi,some new stats, mds memory is not 16G, 
>> 
>> I have almost same number of items and bytes in cache vs some weeks ago when 
>> mds was using 8G. (ceph 12.2.5) 
>> 
>> 
>> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf 
>> dump | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | 
>> jq -c '.mds_co'; done 
>> 16905052 
>> {"items":43350988,"bytes":5257428143} 
>> 16905052 
>> {"items":43428329,"bytes":5283850173} 
>> 16905052 
>> {"items":43209167,"bytes":5208578149} 
>> 16905052 
>> {"items":43177631,"bytes":5198833577} 
>> 16905052 
>> {"items":43312734,"bytes":5252649462} 
>> 16905052 
>> {"items":43355753,"bytes":5277197972} 
>> 16905052 
>> {"items":43700693,"bytes":5303376141} 
>> 16905052 
>> {"items":43115809,"bytes":5156628138} 
>> ^C 
>> 
>> 
>> 
>> 
>> root@ceph4-2:~# ceph status 
>> cluster: 
>> id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca 
>> health: HEALTH_OK 
>> 
>> services: 
>> mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3 
>> mgr: ceph4-1.odiso.net(active), standbys: ceph4-2.odiso.net, 
>> ceph4-3.odiso.net 
>> mds: cephfs4-1/1/1 up {0=ceph4-2.odiso.net=up:active}, 2 up:standby 
>> osd: 18 osds: 18 up, 18 in 
>> rgw: 3 daemons active 
>> 
>> data: 
>> pools: 11 pools, 1992 pgs 
>> objects: 75677k objects, 6045 GB 
>> usage: 20579 GB used, 6246 GB / 26825 GB avail 
>> pgs: 1992 active+clean 
>> 
>> io: 
>> client: 14441 kB/s rd, 2550 kB/s wr, 371 op/s rd, 95 op/s wr 
>> 
>> 
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net cache status 
>> { 
>> "pool": { 
>> "items": 44523608, 
>> "bytes": 5326049009 
>> } 
>> } 
>> 
>> 
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net perf dump 
>> { 
>> "AsyncMessenger::Worker-0": { 
>> "msgr_recv_messages": 798876013, 
>> "msgr_send_messages": 825999506, 
>> "msgr_recv_bytes": 7003223097381, 
>> "msgr_send_bytes": 691501283744, 
>> "msgr_created_connections": 148, 
>> "msgr_active_connections": 146, 
>> "msgr_running_total_time": 39914.832387470, 
>> "msgr_running_send_time": 13744.704199430, 
>> "msgr_running_recv_time": 32342.160588451, 
>> "msgr_running_fast_dispatch_time": 5996.336446782 
>> }, 
>> "AsyncMessenger::Worker-1": { 
>> "msgr_recv_messages": 429668771, 
>> "msgr_send_messages": 414760220, 
>> "msgr_recv_bytes": 5003149410825, 
>> "msgr_send_bytes": 396281427789, 
>> "msgr_created_connections": 132, 
>> "msgr_active_connections": 132, 
>> "msgr_running_total_time": 23644.410515392, 
>> "msgr_running_send_time": 7669.068710688, 
>> "msgr_running_recv_time": 19751.610043696, 
>> "msgr_running_fast_dispatch_time": 4331.023453385 
>> }, 
>> "AsyncMessenger::Worker-2": { 
>> "msgr_recv_messages": 1312910919, 
>> "msgr_send_messages": 1260040403, 
>> "msgr_recv_bytes": 5330386980976, 
>> "msgr_send_bytes": 3341965016878, 
>> "msgr_created_connections": 143, 
>> "msgr_active_connections": 138, 
>> "msgr_running_total_time": 61696.635450100, 
>> "msgr_running_send_time": 23491.027014598, 
>> "msgr_running_recv_time": 53858.409319734, 
>> "msgr_running_fast_dispatch_time": 4312.451966809 
>> }, 
>> "finisher-PurgeQueue": { 
>> "queue_len": 0, 
>> "complete_latency": { 
>> "avgcount": 1889416, 
>> 

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-24 Thread Yan, Zheng
On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER  wrote:
> Thanks!
>
>
> here the profile.pdf
>
> 10-15min profiling, I can't do it longer because my clients where lagging.
>
> but I think it should be enough to observe the rss memory increase.
>
>

Still don't find any clue. Does the cephfs have idle period. If it
has, could you decrease mds's cache size and check what happens. For
example, run following commands during the old period.

ceph daemon mds.xx flush journal
ceph daemon mds.xx config set mds_cache_size 1;
"wait a minute"
ceph tell mds.xx heap stats
ceph daemon mds.xx config set mds_cache_size 0


>
>
> - Mail original -
> De: "Zheng Yan" 
> À: "aderumier" 
> Cc: "ceph-users" 
> Envoyé: Jeudi 24 Mai 2018 11:34:20
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>
> On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER  
> wrote:
>> Hi,some new stats, mds memory is not 16G,
>>
>> I have almost same number of items and bytes in cache vs some weeks ago when 
>> mds was using 8G. (ceph 12.2.5)
>>
>>
>> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf 
>> dump | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | 
>> jq -c '.mds_co'; done
>> 16905052
>> {"items":43350988,"bytes":5257428143}
>> 16905052
>> {"items":43428329,"bytes":5283850173}
>> 16905052
>> {"items":43209167,"bytes":5208578149}
>> 16905052
>> {"items":43177631,"bytes":5198833577}
>> 16905052
>> {"items":43312734,"bytes":5252649462}
>> 16905052
>> {"items":43355753,"bytes":5277197972}
>> 16905052
>> {"items":43700693,"bytes":5303376141}
>> 16905052
>> {"items":43115809,"bytes":5156628138}
>> ^C
>>
>>
>>
>>
>> root@ceph4-2:~# ceph status
>> cluster:
>> id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca
>> health: HEALTH_OK
>>
>> services:
>> mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3
>> mgr: ceph4-1.odiso.net(active), standbys: ceph4-2.odiso.net, 
>> ceph4-3.odiso.net
>> mds: cephfs4-1/1/1 up {0=ceph4-2.odiso.net=up:active}, 2 up:standby
>> osd: 18 osds: 18 up, 18 in
>> rgw: 3 daemons active
>>
>> data:
>> pools: 11 pools, 1992 pgs
>> objects: 75677k objects, 6045 GB
>> usage: 20579 GB used, 6246 GB / 26825 GB avail
>> pgs: 1992 active+clean
>>
>> io:
>> client: 14441 kB/s rd, 2550 kB/s wr, 371 op/s rd, 95 op/s wr
>>
>>
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net cache status
>> {
>> "pool": {
>> "items": 44523608,
>> "bytes": 5326049009
>> }
>> }
>>
>>
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net perf dump
>> {
>> "AsyncMessenger::Worker-0": {
>> "msgr_recv_messages": 798876013,
>> "msgr_send_messages": 825999506,
>> "msgr_recv_bytes": 7003223097381,
>> "msgr_send_bytes": 691501283744,
>> "msgr_created_connections": 148,
>> "msgr_active_connections": 146,
>> "msgr_running_total_time": 39914.832387470,
>> "msgr_running_send_time": 13744.704199430,
>> "msgr_running_recv_time": 32342.160588451,
>> "msgr_running_fast_dispatch_time": 5996.336446782
>> },
>> "AsyncMessenger::Worker-1": {
>> "msgr_recv_messages": 429668771,
>> "msgr_send_messages": 414760220,
>> "msgr_recv_bytes": 5003149410825,
>> "msgr_send_bytes": 396281427789,
>> "msgr_created_connections": 132,
>> "msgr_active_connections": 132,
>> "msgr_running_total_time": 23644.410515392,
>> "msgr_running_send_time": 7669.068710688,
>> "msgr_running_recv_time": 19751.610043696,
>> "msgr_running_fast_dispatch_time": 4331.023453385
>> },
>> "AsyncMessenger::Worker-2": {
>> "msgr_recv_messages": 1312910919,
>> "msgr_send_messages": 1260040403,
>> "msgr_recv_bytes": 5330386980976,
>> "msgr_send_bytes": 3341965016878,
>> "msgr_created_connections": 143,
>> "msgr_active_connections": 138,
>> "msgr_running_total_time": 61696.635450100,
>> "msgr_running_send_time": 23491.027014598,
>> "msgr_running_recv_time": 53858.409319734,
>> "msgr_running_fast_dispatch_time": 4312.451966809
>> },
>> "finisher-PurgeQueue": {
>> "queue_len": 0,
>> "complete_latency": {
>> "avgcount": 1889416,
>> "sum": 29224.227703697,
>> "avgtime": 0.015467333
>> }
>> },
>> "mds": {
>> "request": 1822420924,
>> "reply": 1822420886,
>> "reply_latency": {
>> "avgcount": 1822420886,
>> "sum": 5258467.616943274,
>> "avgtime": 0.002885429
>> },
>> "forward": 0,
>> "dir_fetch": 116035485,
>> "dir_commit": 1865012,
>> "dir_split": 17,
>> "dir_merge": 24,
>> "inode_max": 2147483647,
>> "inodes": 1600438,
>> "inodes_top": 210492,
>> "inodes_bottom": 100560,
>> "inodes_pin_tail": 1289386,
>> "inodes_pinned": 1299735,
>> "inodes_expired": 3476046,
>> "inodes_with_caps": 1299137,
>> "caps": 2211546,
>> "subtrees": 2,
>> "traverse": 1953482456,
>> "traverse_hit": 1127647211,
>> "traverse_forward": 0,
>> "traverse_discover": 0,
>> "traverse_dir_fetch": 105833969,
>> "traverse_remote_ino": 31686,
>> "traverse_lock": 4344,
>> "load_cent": 182244014474,
>> "q": 104,
>> "exported": 0,
>> "exported_inodes": 0,
>> "imported": 0,
>> 

[ceph-users] Ceph luminous packages for Ubuntu 18.04 LTS (bionic)?

2018-05-24 Thread Stefan Kooman
Hi List,

Will there be, some point in time, ceph luminous packages for Ubuntu
18.04 LTS (bionic)? Or are we supposed to upgrade to "Mimic" / 18.04 LTS
in one go?

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] samba gateway experiences with cephfs ?

2018-05-24 Thread Jake Grimmett
Hi David,

Many thanks for your help :)

I'm using Scientific Linux 7.5
thus samba-4.7.1-6.el7.x86_64

I've added these settings to the share:

aio read size = 1
aio write size = 1

...and restarting samba, Helios LanTest didn't show any real changes,
I will test from a Linux machine later and see if I/O improves here.

Glad to hear CTDB will work with posix locks off, I will start testing
this next week.

Oh, the RADOS object lock is definitely worth investigating... thanks
for this too :)

all the best,

Jake

On 24/05/18 13:53, David Disseldorp wrote:
> Hi Jake,
> 
> On Thu, 24 May 2018 13:17:16 +0100, Jake Grimmett wrote:
> 
>> Hi Daniel, David,
>>
>> Many thanks for both of your advice.
>>
>> Sorry not to reply to the list, but I'm subscribed to the digest and my
>> mail client will not reply to individual threads - I've switched back to
>> regular.
> 
> No worries, cc'ing the list in this response.
> 
>> As to this issue, I've turned off posix locking, which has improved
>> write speeds - here are the old benchmarks plus new figures.
>>
>> i.e. Using Helios LanTest 6.0.0 on Osx.
>>
>> Create 300 Files
>>  Cephfs (kernel) > samba (no Posix locks)
>>   average  3600 ms
>>  Cephfs (kernel) > samba. average 5100 ms
>>  Isilon  > CIFS  average 2600 ms
>>  ZFS > samba  average  121 ms
>>
>> Remove 300 files
>>  Cephfs (kernel) > samba (no Posix locks)
>>   average  2200 ms
>>  Cephfs (kernel) > samba. average 2100 ms
>>  Isilon  > CIFS  average  900 ms
>>  ZFS > samba  average  421 ms
>>
>> Write 300MB to file
>>  Cephfs (kernel) > samba (no Posix locks)
>>   average  53 MB/s
>>  Cephfs (kernel) > samba. average 25 MB/s
>>  Isilon  > CIFS  average  17.9 MB/s
>>  ZFS > samba  average  64.4 MB/s
>>
>>
>> Settings as follows:
>> [global]
>> (snip)
>> smb2 leases = yes
>>
>>
>> [ceph_test]
>> path = /ceph-kernel
>> guest ok = no
>> delete readonly = yes
>> oplocks = yes
>> posix locking = no
> 
> Which version of Samba are you using here? If it's relatively recent
> (4.6+), please rerun with asynchronous I/O enabled via:
>   [share]
>   aio read size = 1
>   aio write size = 1
> 
> ...these settings are the default with Samba 4.8+. AIO won't help the
> file creation / deletion benchmarks, but there should be a positive
> affect on read/write performance.
> 
>> Disabling all locking (locking = no) gives some further speed improvements.
>>
>> File locking hopefully will not be an issue...
>>
>> We are not exporting this share via NFS. The shares will only be used by
>> single clients (Windows or OSX Desktops) as a backup location.
>>
>> Specifically, each machine has a separate smb mounted folder, to which
>> they either use ChronoSync or Max SyncUp to write to.
>>
>> One other point...
>> Will CTDB work with "posix locking = no"?
>> It would be great if CTDB works, as I'd like to have a several SMB heads
>> to load-balance the clients
> 
> Yes, it shouldn't affect CTDB. Clustered FS POSIX locks are used by CTDB
> for split-brain avoidance, and are separate to Samba's
> client-lock <-> POSIX-lock mapping.
> (https://wiki.samba.org/index.php/Configuring_the_CTDB_recovery_lock)
> FYI, CTDB is now also capable of using RADOS objects for the recovery
> lock:
> https://ctdb.samba.org/manpages/ctdb_mutex_ceph_rados_helper.7.html
> 
> Cheers, David
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] samba gateway experiences with cephfs ?

2018-05-24 Thread Daniel Baumann
Hi,

On 05/24/2018 02:53 PM, David Disseldorp wrote:
>> [ceph_test]
>> path = /ceph-kernel
>> guest ok = no
>> delete readonly = yes
>> oplocks = yes
>> posix locking = no

jftr, we use the following to disable all locking (on samba 4.8.2):

  oplocks = False
  level2 oplocks = False
  kernel oplocks = no

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] samba gateway experiences with cephfs ?

2018-05-24 Thread David Disseldorp
Hi Jake,

On Thu, 24 May 2018 13:17:16 +0100, Jake Grimmett wrote:

> Hi Daniel, David,
> 
> Many thanks for both of your advice.
> 
> Sorry not to reply to the list, but I'm subscribed to the digest and my
> mail client will not reply to individual threads - I've switched back to
> regular.

No worries, cc'ing the list in this response.

> As to this issue, I've turned off posix locking, which has improved
> write speeds - here are the old benchmarks plus new figures.
> 
> i.e. Using Helios LanTest 6.0.0 on Osx.
> 
> Create 300 Files
>  Cephfs (kernel) > samba (no Posix locks)
>average  3600 ms
>  Cephfs (kernel) > samba. average 5100 ms
>  Isilon   > CIFS  average 2600 ms
>  ZFS > samba   average  121 ms
> 
> Remove 300 files
>  Cephfs (kernel) > samba (no Posix locks)
>average  2200 ms
>  Cephfs (kernel) > samba. average 2100 ms
>  Isilon   > CIFS  average  900 ms
>  ZFS > samba   average  421 ms
> 
> Write 300MB to file
>  Cephfs (kernel) > samba (no Posix locks)
>average  53 MB/s
>  Cephfs (kernel) > samba. average 25 MB/s
>  Isilon   > CIFS  average  17.9 MB/s
>  ZFS > samba   average  64.4 MB/s
> 
> 
> Settings as follows:
> [global]
> (snip)
> smb2 leases = yes
> 
> 
> [ceph_test]
> path = /ceph-kernel
> guest ok = no
> delete readonly = yes
> oplocks = yes
> posix locking = no

Which version of Samba are you using here? If it's relatively recent
(4.6+), please rerun with asynchronous I/O enabled via:
[share]
aio read size = 1
aio write size = 1

...these settings are the default with Samba 4.8+. AIO won't help the
file creation / deletion benchmarks, but there should be a positive
affect on read/write performance.

> Disabling all locking (locking = no) gives some further speed improvements.
> 
> File locking hopefully will not be an issue...
> 
> We are not exporting this share via NFS. The shares will only be used by
> single clients (Windows or OSX Desktops) as a backup location.
> 
> Specifically, each machine has a separate smb mounted folder, to which
> they either use ChronoSync or Max SyncUp to write to.
> 
> One other point...
> Will CTDB work with "posix locking = no"?
> It would be great if CTDB works, as I'd like to have a several SMB heads
> to load-balance the clients

Yes, it shouldn't affect CTDB. Clustered FS POSIX locks are used by CTDB
for split-brain avoidance, and are separate to Samba's
client-lock <-> POSIX-lock mapping.
(https://wiki.samba.org/index.php/Configuring_the_CTDB_recovery_lock)
FYI, CTDB is now also capable of using RADOS objects for the recovery
lock:
https://ctdb.samba.org/manpages/ctdb_mutex_ceph_rados_helper.7.html

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph - Xen accessing RBDs through libvirt

2018-05-24 Thread thg
Hi Eugen, hi all,

thank you very much for your answer!

>> So "somthing" goes wrong:
>>
>> # cat /var/log/libvirt/libxl/libxl-driver.log
>> -> ...
>> 2018-05-20 15:28:15.270+: libxl:
>> libxl_bootloader.c:634:bootloader_finished: bootloader failed - consult
>> logfile /var/log/xen/bootloader.7.log
>> 2018-05-20 15:28:15.270+: libxl:
>> libxl_exec.c:118:libxl_report_child_exitstatus: bootloader [26640]
>> exited with error status 1
>> 2018-05-20 15:28:15.271+: libxl:
>> libxl_create.c:1259:domcreate_rebuild_done: cannot (re-)build domain: -3
>>
>> # cat /var/log/xen/bootloader.7.log
>> ->
>> Traceback (most recent call last):
>>   File "/usr/lib64/xen/bin/pygrub", line 896, in 
>>     part_offs = get_partition_offsets(file)
>>   File "/usr/lib64/xen/bin/pygrub", line 113, in get_partition_offsets
>>     image_type = identify_disk_image(file)
>>   File "/usr/lib64/xen/bin/pygrub", line 56, in identify_disk_image
>>     fd = os.open(file, os.O_RDONLY)
>> OSError: [Errno 2] No such file or directory:
>> 'rbd:devel-pool/testvm3.rbd:id=libvirt:key=AQBThwFbGFRYFx==:auth_supported=cephx\\;none:mon_host=10.20.30.1\\:6789\\;10.20.30.2\\:6789\\;10.20.30.3\\:6789'
>>
> 
> we used to work with Xen hypervisors before we switched to KVM, all the
> VMs are within OpenStack. There was one thing we had to configure for
> Xen instances: the base image needed two image properties,
> "hypervisor_type = xen" and "kernel_id = " where the image for
> the kernel_id was uploaded from /usr/lib/grub2/x86_64-xen/grub.xen.
> For VMs independent from openstack we had to provide the kernel like this:
> 
> # kernel="/usr/lib/grub2/x86_64-xen/grub.xen"
> kernel="/usr/lib/grub2/i386-xen/grub.xen"
> 
> I'm not sure if this is all that's required in your environment but we
> managed to run Xen VMs with Ceph backend.

I don't think that this is the cause, because as far as I understand the
error, Xen does not even try to look for the kernel or what ever.

On the first access on it's image, located on the RBD, it says "file not
found", otherwise it would say s.th. like "kernel not found".


So any other ideas?
-- 

kind regards,

thg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-24 Thread Alexandre DERUMIER
Thanks!


here the profile.pdf 

10-15min profiling, I can't do it longer because my clients where lagging.

but I think it should be enough to observe the rss memory increase.


 

- Mail original -
De: "Zheng Yan" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Jeudi 24 Mai 2018 11:34:20
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER  
wrote: 
> Hi,some new stats, mds memory is not 16G, 
> 
> I have almost same number of items and bytes in cache vs some weeks ago when 
> mds was using 8G. (ceph 12.2.5) 
> 
> 
> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf dump 
> | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | jq -c 
> '.mds_co'; done 
> 16905052 
> {"items":43350988,"bytes":5257428143} 
> 16905052 
> {"items":43428329,"bytes":5283850173} 
> 16905052 
> {"items":43209167,"bytes":5208578149} 
> 16905052 
> {"items":43177631,"bytes":5198833577} 
> 16905052 
> {"items":43312734,"bytes":5252649462} 
> 16905052 
> {"items":43355753,"bytes":5277197972} 
> 16905052 
> {"items":43700693,"bytes":5303376141} 
> 16905052 
> {"items":43115809,"bytes":5156628138} 
> ^C 
> 
> 
> 
> 
> root@ceph4-2:~# ceph status 
> cluster: 
> id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca 
> health: HEALTH_OK 
> 
> services: 
> mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3 
> mgr: ceph4-1.odiso.net(active), standbys: ceph4-2.odiso.net, 
> ceph4-3.odiso.net 
> mds: cephfs4-1/1/1 up {0=ceph4-2.odiso.net=up:active}, 2 up:standby 
> osd: 18 osds: 18 up, 18 in 
> rgw: 3 daemons active 
> 
> data: 
> pools: 11 pools, 1992 pgs 
> objects: 75677k objects, 6045 GB 
> usage: 20579 GB used, 6246 GB / 26825 GB avail 
> pgs: 1992 active+clean 
> 
> io: 
> client: 14441 kB/s rd, 2550 kB/s wr, 371 op/s rd, 95 op/s wr 
> 
> 
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net cache status 
> { 
> "pool": { 
> "items": 44523608, 
> "bytes": 5326049009 
> } 
> } 
> 
> 
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net perf dump 
> { 
> "AsyncMessenger::Worker-0": { 
> "msgr_recv_messages": 798876013, 
> "msgr_send_messages": 825999506, 
> "msgr_recv_bytes": 7003223097381, 
> "msgr_send_bytes": 691501283744, 
> "msgr_created_connections": 148, 
> "msgr_active_connections": 146, 
> "msgr_running_total_time": 39914.832387470, 
> "msgr_running_send_time": 13744.704199430, 
> "msgr_running_recv_time": 32342.160588451, 
> "msgr_running_fast_dispatch_time": 5996.336446782 
> }, 
> "AsyncMessenger::Worker-1": { 
> "msgr_recv_messages": 429668771, 
> "msgr_send_messages": 414760220, 
> "msgr_recv_bytes": 5003149410825, 
> "msgr_send_bytes": 396281427789, 
> "msgr_created_connections": 132, 
> "msgr_active_connections": 132, 
> "msgr_running_total_time": 23644.410515392, 
> "msgr_running_send_time": 7669.068710688, 
> "msgr_running_recv_time": 19751.610043696, 
> "msgr_running_fast_dispatch_time": 4331.023453385 
> }, 
> "AsyncMessenger::Worker-2": { 
> "msgr_recv_messages": 1312910919, 
> "msgr_send_messages": 1260040403, 
> "msgr_recv_bytes": 5330386980976, 
> "msgr_send_bytes": 3341965016878, 
> "msgr_created_connections": 143, 
> "msgr_active_connections": 138, 
> "msgr_running_total_time": 61696.635450100, 
> "msgr_running_send_time": 23491.027014598, 
> "msgr_running_recv_time": 53858.409319734, 
> "msgr_running_fast_dispatch_time": 4312.451966809 
> }, 
> "finisher-PurgeQueue": { 
> "queue_len": 0, 
> "complete_latency": { 
> "avgcount": 1889416, 
> "sum": 29224.227703697, 
> "avgtime": 0.015467333 
> } 
> }, 
> "mds": { 
> "request": 1822420924, 
> "reply": 1822420886, 
> "reply_latency": { 
> "avgcount": 1822420886, 
> "sum": 5258467.616943274, 
> "avgtime": 0.002885429 
> }, 
> "forward": 0, 
> "dir_fetch": 116035485, 
> "dir_commit": 1865012, 
> "dir_split": 17, 
> "dir_merge": 24, 
> "inode_max": 2147483647, 
> "inodes": 1600438, 
> "inodes_top": 210492, 
> "inodes_bottom": 100560, 
> "inodes_pin_tail": 1289386, 
> "inodes_pinned": 1299735, 
> "inodes_expired": 3476046, 
> "inodes_with_caps": 1299137, 
> "caps": 2211546, 
> "subtrees": 2, 
> "traverse": 1953482456, 
> "traverse_hit": 1127647211, 
> "traverse_forward": 0, 
> "traverse_discover": 0, 
> "traverse_dir_fetch": 105833969, 
> "traverse_remote_ino": 31686, 
> "traverse_lock": 4344, 
> "load_cent": 182244014474, 
> "q": 104, 
> "exported": 0, 
> "exported_inodes": 0, 
> "imported": 0, 
> "imported_inodes": 0 
> }, 
> "mds_cache": { 
> "num_strays": 14980, 
> "num_strays_delayed": 7, 
> "num_strays_enqueuing": 0, 
> "strays_created": 1672815, 
> "strays_enqueued": 1659514, 
> "strays_reintegrated": 666, 
> "strays_migrated": 0, 
> "num_recovering_processing": 0, 
> "num_recovering_enqueued": 0, 
> "num_recovering_prioritized": 0, 
> "recovery_started": 2, 
> "recovery_completed": 2, 
> "ireq_enqueue_scrub": 0, 
> "ireq_exportdir": 0, 
> "ireq_flush": 0, 
> "ireq_fragmentdir": 

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-24 Thread Yan, Zheng
On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER  wrote:
> Hi,some new stats, mds memory is not 16G,
>
> I have almost same number of items and bytes in cache  vs some weeks ago when 
> mds was using 8G. (ceph 12.2.5)
>
>
> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf dump 
> | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | jq -c 
> '.mds_co'; done
> 16905052
> {"items":43350988,"bytes":5257428143}
> 16905052
> {"items":43428329,"bytes":5283850173}
> 16905052
> {"items":43209167,"bytes":5208578149}
> 16905052
> {"items":43177631,"bytes":5198833577}
> 16905052
> {"items":43312734,"bytes":5252649462}
> 16905052
> {"items":43355753,"bytes":5277197972}
> 16905052
> {"items":43700693,"bytes":5303376141}
> 16905052
> {"items":43115809,"bytes":5156628138}
> ^C
>
>
>
>
> root@ceph4-2:~# ceph status
>   cluster:
> id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3
> mgr: ceph4-1.odiso.net(active), standbys: ceph4-2.odiso.net, 
> ceph4-3.odiso.net
> mds: cephfs4-1/1/1 up  {0=ceph4-2.odiso.net=up:active}, 2 up:standby
> osd: 18 osds: 18 up, 18 in
> rgw: 3 daemons active
>
>   data:
> pools:   11 pools, 1992 pgs
> objects: 75677k objects, 6045 GB
> usage:   20579 GB used, 6246 GB / 26825 GB avail
> pgs: 1992 active+clean
>
>   io:
> client:   14441 kB/s rd, 2550 kB/s wr, 371 op/s rd, 95 op/s wr
>
>
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net cache status
> {
> "pool": {
> "items": 44523608,
> "bytes": 5326049009
> }
> }
>
>
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net perf dump
> {
> "AsyncMessenger::Worker-0": {
> "msgr_recv_messages": 798876013,
> "msgr_send_messages": 825999506,
> "msgr_recv_bytes": 7003223097381,
> "msgr_send_bytes": 691501283744,
> "msgr_created_connections": 148,
> "msgr_active_connections": 146,
> "msgr_running_total_time": 39914.832387470,
> "msgr_running_send_time": 13744.704199430,
> "msgr_running_recv_time": 32342.160588451,
> "msgr_running_fast_dispatch_time": 5996.336446782
> },
> "AsyncMessenger::Worker-1": {
> "msgr_recv_messages": 429668771,
> "msgr_send_messages": 414760220,
> "msgr_recv_bytes": 5003149410825,
> "msgr_send_bytes": 396281427789,
> "msgr_created_connections": 132,
> "msgr_active_connections": 132,
> "msgr_running_total_time": 23644.410515392,
> "msgr_running_send_time": 7669.068710688,
> "msgr_running_recv_time": 19751.610043696,
> "msgr_running_fast_dispatch_time": 4331.023453385
> },
> "AsyncMessenger::Worker-2": {
> "msgr_recv_messages": 1312910919,
> "msgr_send_messages": 1260040403,
> "msgr_recv_bytes": 5330386980976,
> "msgr_send_bytes": 3341965016878,
> "msgr_created_connections": 143,
> "msgr_active_connections": 138,
> "msgr_running_total_time": 61696.635450100,
> "msgr_running_send_time": 23491.027014598,
> "msgr_running_recv_time": 53858.409319734,
> "msgr_running_fast_dispatch_time": 4312.451966809
> },
> "finisher-PurgeQueue": {
> "queue_len": 0,
> "complete_latency": {
> "avgcount": 1889416,
> "sum": 29224.227703697,
> "avgtime": 0.015467333
> }
> },
> "mds": {
> "request": 1822420924,
> "reply": 1822420886,
> "reply_latency": {
> "avgcount": 1822420886,
> "sum": 5258467.616943274,
> "avgtime": 0.002885429
> },
> "forward": 0,
> "dir_fetch": 116035485,
> "dir_commit": 1865012,
> "dir_split": 17,
> "dir_merge": 24,
> "inode_max": 2147483647,
> "inodes": 1600438,
> "inodes_top": 210492,
> "inodes_bottom": 100560,
> "inodes_pin_tail": 1289386,
> "inodes_pinned": 1299735,
> "inodes_expired": 3476046,
> "inodes_with_caps": 1299137,
> "caps": 2211546,
> "subtrees": 2,
> "traverse": 1953482456,
> "traverse_hit": 1127647211,
> "traverse_forward": 0,
> "traverse_discover": 0,
> "traverse_dir_fetch": 105833969,
> "traverse_remote_ino": 31686,
> "traverse_lock": 4344,
> "load_cent": 182244014474,
> "q": 104,
> "exported": 0,
> "exported_inodes": 0,
> "imported": 0,
> "imported_inodes": 0
> },
> "mds_cache": {
> "num_strays": 14980,
> "num_strays_delayed": 7,
> "num_strays_enqueuing": 0,
> "strays_created": 1672815,
> "strays_enqueued": 1659514,
> "strays_reintegrated": 666,
> "strays_migrated": 0,
> "num_recovering_processing": 0,
> 

Re: [ceph-users] Luminous: resilience - private interface down , no read/write

2018-05-24 Thread nokia ceph
Hi ,

We changed   mon_osd_ min_down_reporters   to 69 , and when the cluster
network is down , read/write completely blocked and none of the OSDs moved
to down state in mon status.

We have set  mon osd down out subtree limit to host which is our failure
domain from the default rack.

Could you please suggest other options which we can try.

thanks,
Muthu

On Wed, May 23, 2018 at 4:51 PM, nokia ceph 
wrote:

> yes it is 68 disks , and will this  mon_osd_reporter_subtree_level = host
> have any impact on  mon_osd_ min_down_reporters ?
>
> And related to min_size , yes there was many suggestions for us to move to
> 2 , due to storage efficiency concerns we still retain with 1 and trying to
> convince customers to go with 2 for better data integrity.
>
> thanks,
> Muthu
>
> On Wed, May 23, 2018 at 3:31 PM, David Turner 
> wrote:
>
>> How many disks in each node? 68? If yes, then change it to 69. Also
>> running with ec 4+1 is bad for the same reason as running with size=2
>> min_size=1 which has been mentioned and discussed multiple times on the ML.
>>
>>
>> On Wed, May 23, 2018, 3:39 AM nokia ceph 
>> wrote:
>>
>>> Hi David Turner,
>>>
>>> This is our ceph config under mon section , we have EC 4+1 and set the
>>> failure domain as host and osd_min_down_reporters to 4 ( osds from 4
>>> different host ) .
>>>
>>> [mon]
>>> mon_compact_on_start = True
>>> mon_osd_down_out_interval = 86400
>>> mon_osd_down_out_subtree_limit = host
>>> mon_osd_min_down_reporters = 4
>>> mon_osd_reporter_subtree_level = host
>>>
>>> We have 68 disks , can we increase  sd_min_down_reporters  to 68 ?
>>>
>>> Thanks,
>>> Muthu
>>>
>>> On Tue, May 22, 2018 at 5:46 PM, David Turner 
>>> wrote:
>>>
 What happens when a storage node loses its cluster network but not it's
 public network is that all other osss on the cluster see that it's down and
 report that to the mons, but the node call still talk to the mons telling
 the mons that it is up and in fact everything else is down.

 The setting osd _min_reporters (I think that's the name of it off the
 top of my head) is designed to help with this scenario. It's default is 1
 which means any osd on either side of the network problem will be trusted
 by the mons to mark osds down. What you want to do with this seeing is to
 set it to at least 1 more than the number of osds in your failure domain.
 If the failure domain is host and each node has 32 osds, then setting it to
 33 will prevent a full problematic node from being able to cause havoc.

 The osds will still try to mark themselves as up and this will still
 cause problems for read until the osd process stops or the network comes
 back up. There might be a seeing for how long an odd will try telling the
 mons it's up, but this isn't really a situation I've come across after
 initial testing and installation of nodes.

 On Tue, May 22, 2018, 1:47 AM nokia ceph 
 wrote:

> Hi Ceph users,
>
> We have a cluster with 5 node (67 disks) and EC 4+1 configuration and
> min_size set as 4.
> Ceph version : 12.2.5
> While executing one of our resilience usecase , making private
> interface down on one of the node, till kraken we saw less outage in rados
> (60s) .
>
> Now with luminous, we could able to see rados read/write outage for
> more than 200s . In the logs we could able to see that peer OSDs inform
> that one of the node OSDs are down however the OSDs  defend like it is
> wrongly marked down and does not move to down state for long time.
>
> 2018-05-22 05:37:17.871049 7f6ac71e6700  0 log_channel(cluster) log
> [WRN] : Monitor daemon marked osd.1 down, but it is still running
> 2018-05-22 05:37:17.871072 7f6ac71e6700  0 log_channel(cluster) log
> [DBG] : map e35690 wrongly marked me down at e35689
> 2018-05-22 05:37:17.878347 7f6ac71e6700  0 osd.1 35690 crush map has
> features 1009107927421960192, adjusting msgr requires for osds
> 2018-05-22 05:37:18.296643 7f6ac71e6700  0 osd.1 35691 crush map has
> features 1009107927421960192, adjusting msgr requires for osds
>
>
> Only when all 67 OSDs are move to down state , the read/write traffic
> is resumed.
>
> Could you please help us in resolving this issue and if it is bug , we
> will create corresponding ticket.
>
> Thanks,
> Muthu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

>>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD-primary crush rule doesn't work as intended

2018-05-24 Thread Peter Linder
It will also only work reliably if you use a single level tree structure 
with failure domain "host". If you want say, separate data center 
failure domains, you need extra steps to make sure a SSD host and a HDD 
host do not get selected from the same DC.


I have done such a layout so it is possible (see my older posts) but you 
need to be careful when you construct the additional trees that are 
needed in order to force the correct elections.


In reality however, even if you force all reads to the SSD using primary 
affinity, you will soon run out of write IOPS on the HDDs. To keep up 
with the SSD's you will need so many HDDs for an average workload that 
in order to keep up performance you will not save any money.


Regards,

Peter



Den 2018-05-23 kl. 14:37, skrev Paul Emmerich:

You can't mix HDDs and SSDs in a server if you want to use such a rule.
The new selection step after "emit" can't know what server was 
selected previously.


Paul

2018-05-23 11:02 GMT+02:00 Horace >:


Add to the info, I have a slightly modified rule to take advantage
of the new storage class.

rule ssd-hybrid {
        id 2
        type replicated
        min_size 1
        max_size 10
        step take default class ssd
        step chooseleaf firstn 1 type host
        step emit
        step take default class hdd
        step chooseleaf firstn -1 type host
        step emit
}

Regards,
Horace Ng

- Original Message -
From: "horace" >
To: "ceph-users" >
Sent: Wednesday, May 23, 2018 3:56:20 PM
Subject: [ceph-users] SSD-primary crush rule doesn't work as intended

I've set up the rule according to the doc, but some of the PGs are
still being assigned to the same host.

http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/


  rule ssd-primary {
              ruleset 5
              type replicated
              min_size 5
              max_size 10
              step take ssd
              step chooseleaf firstn 1 type host
              step emit
              step take platter
              step chooseleaf firstn -1 type host
              step emit
      }

Crush tree:

[root@ceph0 ~]#    ceph osd crush tree
ID CLASS WEIGHT   TYPE NAME
-1       58.63989 root default
-2       19.55095     host ceph0
 0   hdd  2.73000         osd.0
 1   hdd  2.73000         osd.1
 2   hdd  2.73000         osd.2
 3   hdd  2.73000         osd.3
12   hdd  4.54999         osd.12
15   hdd  3.71999         osd.15
18   ssd  0.2         osd.18
19   ssd  0.16100         osd.19
-3       19.55095     host ceph1
 4   hdd  2.73000         osd.4
 5   hdd  2.73000         osd.5
 6   hdd  2.73000         osd.6
 7   hdd  2.73000         osd.7
13   hdd  4.54999         osd.13
16   hdd  3.71999         osd.16
20   ssd  0.16100         osd.20
21   ssd  0.2         osd.21
-4       19.53799     host ceph2
 8   hdd  2.73000         osd.8
 9   hdd  2.73000         osd.9
10   hdd  2.73000         osd.10
11   hdd  2.73000         osd.11
14   hdd  3.71999         osd.14
17   hdd  4.54999         osd.17
22   ssd  0.18700         osd.22
23   ssd  0.16100         osd.23

#ceph pg ls-by-pool ssd-hybrid

27.8       1051                  0        0         0    0
4399733760 1581     1581               active+clean 2018-05-23
06:20:56.088216 27957'185553 27959:368828 [23,1,11]         23 
[23,1,11]             23 27953'182582 2018-05-23 06:20:56.088172 
  27843'162478 2018-05-20 18:28:20.118632

With osd.23 and osd.11 being assigned on the same host.

Regards,
Horace Ng
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io 
Tel: +49 89 1896585 90


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list

Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Alexandre DERUMIER
Hi,

>>My thoughts on the subject are that even though checksums do allow to find 
>>which replica is corrupt without having to figure which 2 out of 3 copies are 
>>the same, this is not the only reason min_size=2 was required.

AFAIK, 

compare copies (like 2 out of 3 copies are the same) has never been implemented.
pg_repair for example, still copy the primary pg to replicas. (even if it's 
corrupt).


and old topic about this:
http://ceph-users.ceph.narkive.com/zS2yZ2FL/how-safe-is-ceph-pg-repair-these-days

- Mail original -
De: "Janne Johansson" 
À: c...@jack.fr.eu.org
Cc: "ceph-users" 
Envoyé: Jeudi 24 Mai 2018 08:33:32
Objet: Re: [ceph-users] Ceph replication factor of 2

Den tors 24 maj 2018 kl 00:20 skrev Jack < [ mailto:c...@jack.fr.eu.org | 
c...@jack.fr.eu.org ] >: 


Hi, 

I have to say, this is a common yet worthless argument 
If I have 3000 OSD, using 2 or 3 replica will not change much : the 
probability of losing 2 devices is still "high" 
On the other hand, if I have a small cluster, less than a hundred OSD, 
that same probability become "low" 



It's about losing the 2 or 3 OSDs that any particular PG is on that matters, 
not if there are 1000 other OSDs in the next rack. 
Losing data is rather binary, its not a from 0.0 -> 1.0 scale. Either a piece 
of data is lost because its storage units are not there 
or its not. Murphys law will make it so that this lost piece of data is rather 
important to you. And Murphy will of course pick the 
2-3 OSDs that are the worst case for you. 

BQ_BEGIN

I do not buy the "if someone is making a maintenance and a device fails" 
either : this is a no-limit goal: what is X servers burns at the same 
time ? What if an admin make a mistake and drop 5 OSD ? What is some 
network tor or routers blow away ? 
Should we do one replica par OSD ? 


BQ_END

From my viewpoint, maintenance must happen. Unplanned maintenance will happen 
even if I wish it not to. 
So the 2-vs-3 is about what situation you end up in when one replica is under 
(planned or not) maintenance. 
Is this a "any surprise makes me lose data now" mode, or is it "many surprises 
need to occur"? 

BQ_BEGIN

I would like people, especially the Ceph's devs and other people who 
knows how it works deeply (read the code!) to give us their advices 

BQ_END

How about listening to people who have lost data during 20+ year long careers 
in storage? 
They will know a lot more on how the very improbable or "impossible" still 
happened to them 
at the most unfortunate moment, regardless of what the code readers say. 

This is all about weighing risks. If the risk for you is "ok, then I have to 
redownload that lost ubuntu-ISO again" its fine 
to stick to data in only one place. 

If the company goes out of business or at least faces 2 days total stop while 
some sleep-deprived admin tries the 
bare metal restores for the first time of her life then the price of SATA disks 
to cover 3 replicas will be literally 
nothing compared to that. 

To me it sounds like you are chasing some kind of validation of an answer you 
already have while asking the questions, 
so if you want to go 2-replicas, then just do it. But you don't get to complain 
to ceph or ceph-users when you also figure 
that the Mean-Time-Between-Failure ratings on the stickers of the disks is 
bogus and what you really needed was 
"mean time between surprises", and thats always less than MTBF. 

-- 
May the most significant bit of your life be positive. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Daniel Baumann
Hi,

I coudn't agree more, but just to re-emphasize what others already said:

  the point of replica 3 is not to have extra safety for
  (human|software|server) failures, but to have enough data around to
  allow rebalancing the cluster when disks fail.

after a certain amount of disks in a cluster, you're going to get disks
failures all the time. if you don't pay extra attention (and wasting
lots and lots of time/money) to carefully arrange/choose disks of
different vendor productions lines/dates, simultaneous disk failures
happen within minutes.


example from our past:

on our (at that time small) cluster of 72 disks spread over 6 storage
nodes, half of them were seagate enterprice capacity disks, the other
half western digitial red pro. for each disk manufacturer, we bought
only half of the disks from the same production. so.. we had..

  * 18 disks wd, production charge A
  * 18 disks wd, production charge B
  * 18 disks seagate, production charge C
  * 18 disks seagate, production charge D

one day, 6 disks failed simultaneously spread over two storage nodes.
had we had replica 2, we couldn't recover and would have lost data.
instead, because of replica 3, we didn't loose any data and ceph
automatically rebalanced all data before further disks were failing.


so: if re-creating data stored on the cluster is valuable (because it
costs much time and effort to 're-collect' it, or you can't accept the
time it takes to restore from backup, or worse to re-create it from
scratch), you have to assume that whatever manufacturer/production
charge of HDs you're using, they *can* fail all at the same time because
you could have hit a faulty production.

the only way out here is replica >=3.

(of course, the whole MTBF and why raid doesn't scale applies as well)

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Janne Johansson
Den tors 24 maj 2018 kl 00:20 skrev Jack :

> Hi,
>
> I have to say, this is a common yet worthless argument
> If I have 3000 OSD, using 2 or 3 replica will not change much : the
> probability of losing 2 devices is still "high"
> On the other hand, if I have a small cluster, less than a hundred OSD,
> that same probability become "low"
>

It's about losing the 2 or 3 OSDs that any particular PG is on that
matters, not if there are 1000 other OSDs in the next rack.
Losing data is rather binary, its not a from 0.0 -> 1.0 scale. Either a
piece of data is lost because its storage units are not there
or its not. Murphys law will make it so that this lost piece of data is
rather important to you. And Murphy will of course pick the
2-3 OSDs that are the worst case for you.


>
> I do not buy the "if someone is making a maintenance and a device fails"
> either : this is a no-limit goal: what is X servers burns at the same
> time ? What if an admin make a mistake and drop 5 OSD ? What is some
> network tor or routers blow away ?
> Should we do one replica par OSD ?
>
>
>From my viewpoint, maintenance must happen. Unplanned maintenance will
happen even if I wish it not to.
So the 2-vs-3 is about what situation you end up in when one replica is
under (planned or not) maintenance.
Is this a "any surprise makes me lose data now" mode, or is it "many
surprises need to occur"?


>
> I would like people, especially the Ceph's devs and other people who
> knows how it works deeply (read the code!) to give us their advices
>

How about listening to people who have lost data during 20+ year long
careers in storage?
They will know a lot more on how the very improbable or "impossible" still
happened to them
at the most unfortunate moment, regardless of what the code readers say.

This is all about weighing risks. If the risk for you is "ok, then I have
to redownload that lost ubuntu-ISO again" its fine
to stick to data in only one place.

If the company goes out of business or at least faces 2 days total stop
while some sleep-deprived admin tries the
bare metal restores for the first time of her life then the price of SATA
disks to cover 3 replicas will be literally
nothing compared to that.

To me it sounds like you are chasing some kind of validation of an answer
you already have while asking the questions,
so if you want to go 2-replicas, then just do it. But you don't get to
complain to ceph or ceph-users when you also figure
that the Mean-Time-Between-Failure ratings on the stickers of the disks is
bogus and what you really needed was
"mean time between surprises", and thats always less than MTBF.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com