Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Jelle de Jong
 seconds or 0 objects

Object prefix: benchmark_data_ceph02_396172
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg 
lat(s)
0   0 0 0 0 0   - 
0
1  167761   243.956   2440.200718 
0.227978
2  16   151   135   269.946   2960.327927 
0.2265
3  16   215   199   265.281   256   0.0875193 
0.225989
4  16   288   272   271.951   2920.184617 
0.227921
5  16   358   342   273.553   2800.140823 
0.22683
6  16   426   410   273.286   2720.118436 
0.226586
7  16   501   485   277.094   3000.224887 
0.226209
8  16   573   557   278.452   2880.200903 
0.226424
9  16   643   627   278.619   2800.214474 
0.227003
   10  16   711   695   277.952   2720.259724 
0.226849

Total time run: 10.146720
Total writes made:  712
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 280.682
Stddev Bandwidth:   17.7138
Max bandwidth (MB/sec): 300
Min bandwidth (MB/sec): 244
Average IOPS:   70
Stddev IOPS:4
Max IOPS:   75
Min IOPS:   61
Average Latency(s):     0.227538
Stddev Latency(s):  0.0843661
Max latency(s): 0.48464
Min latency(s): 0.0467124


On 2020-01-06 20:44, Jelle de Jong wrote:

Hello everybody,

I have issues with very slow requests a simple tree node cluster here, 
four WDC enterprise disks and Intel Optane NVMe journal on identical 
high memory nodes, with 10GB networking.


It was working all good with Ceph Hammer on Debian Wheezy, but I wanted 
to upgrade to a supported version and test out bluestore as well. So I 
upgraded to luminous on Debian Stretch and used ceph-volume to create 
bluestore osds, everything went downhill from there.


I went back to filestore on all nodes but I still have slow requests and 
I can not pinpoint a good reason I tried to debug and gathered 
information to look at:


https://paste.debian.net/hidden/acc5d204/

First I thought it was the balancing that was making things slow, then I 
thought it might be the LVM layer, so I recreated the nodes without LVM 
by switching from ceph-volume to ceph-disk, no different still slow 
request. Then I changed back from bluestore to filestore but still the a 
very slow cluster. Then I thought it was a CPU scheduling issue and 
downgraded the 5.x kernel and CPU performance is full speed again. I 
thought maybe there is something weird with an osd and taking them out 
one by one, but slow request are still showing up and client performance 
from vms is really poor.


I just feel a burst of small requests keeps blocking for a while then 
recovers again.


Many thanks for helping out looking at the URL.

If there are options which I should tune for a hdd with nvme journal 
setup please share.


Jelle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Random slow requests without any load

2020-01-06 Thread Jelle de Jong

Hi,

What are the full commands you used to setup this iptables config?

iptables --table raw --append OUTPUT --jump NOTRACK
iptables --table raw --append PREROUTING --jump NOTRACK

Does not create the same output, it needs some more.

Kind regards,

Jelle de Jong



On 2019-07-17 14:59, Kees Meijs wrote:

Hi,

Experienced similar issues. Our cluster internal network (completely
separated) now has NOTRACK (no connection state tracking) iptables rules.

In full:


# iptables-save
# Generated by xtables-save v1.8.2 on Wed Jul 17 14:57:38 2019
*filter
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
:INPUT ACCEPT [0:0]
COMMIT
# Completed on Wed Jul 17 14:57:38 2019
# Generated by xtables-save v1.8.2 on Wed Jul 17 14:57:38 2019
*raw
:OUTPUT ACCEPT [0:0]
:PREROUTING ACCEPT [0:0]
-A OUTPUT -j NOTRACK
-A PREROUTING -j NOTRACK
COMMIT
# Completed on Wed Jul 17 14:57:38 2019


Ceph uses IPv4 in our case, but to be complete:


# ip6tables-save
# Generated by xtables-save v1.8.2 on Wed Jul 17 14:58:20 2019
*filter
:OUTPUT ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD DROP [0:0]
COMMIT
# Completed on Wed Jul 17 14:58:20 2019
# Generated by xtables-save v1.8.2 on Wed Jul 17 14:58:20 2019
*raw
:OUTPUT ACCEPT [0:0]
:PREROUTING ACCEPT [0:0]
-A OUTPUT -j NOTRACK
-A PREROUTING -j NOTRACK
COMMIT
# Completed on Wed Jul 17 14:58:20 2019


Using this configuration, state tables never ever can fill up with
dropped connections as effect.

Cheers,
Kees

On 17-07-2019 11:27, Maximilien Cuony wrote:

Just a quick update about this if somebody else get the same issue:

The problem was with the firewall. Port range and established
connection are allowed, but for some reasons it seems the tracking of
connections are lost, leading to a strange state where one machine
refuse data (RST are replied) and the sender never get the RST packed
(even with 'related' packets allowed).

There was a similar post on this list in February ("Ceph and TCP
States") where lossing of connections in conntrack created issues, but
the fix, net.netfilter.nf_conntrack_tcp_be_liberal=1 did not improve
that particular case.

As a workaround, we installed lighter rules for the firewall (allowing
all packets from machines inside the cluster by default) and that
"fixed" the issue :)



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-06 Thread Jelle de Jong

Hello everybody,

I have issues with very slow requests a simple tree node cluster here, 
four WDC enterprise disks and Intel Optane NVMe journal on identical 
high memory nodes, with 10GB networking.


It was working all good with Ceph Hammer on Debian Wheezy, but I wanted 
to upgrade to a supported version and test out bluestore as well. So I 
upgraded to luminous on Debian Stretch and used ceph-volume to create 
bluestore osds, everything went downhill from there.


I went back to filestore on all nodes but I still have slow requests and 
I can not pinpoint a good reason I tried to debug and gathered 
information to look at:


https://paste.debian.net/hidden/acc5d204/

First I thought it was the balancing that was making things slow, then I 
thought it might be the LVM layer, so I recreated the nodes without LVM 
by switching from ceph-volume to ceph-disk, no different still slow 
request. Then I changed back from bluestore to filestore but still the a 
very slow cluster. Then I thought it was a CPU scheduling issue and 
downgraded the 5.x kernel and CPU performance is full speed again. I 
thought maybe there is something weird with an osd and taking them out 
one by one, but slow request are still showing up and client performance 
from vms is really poor.


I just feel a burst of small requests keeps blocking for a while then 
recovers again.


Many thanks for helping out looking at the URL.

If there are options which I should tune for a hdd with nvme journal 
setup please share.


Jelle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-12 Thread Jelle de Jong

Hello everybody,

I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's 
with 32GB Intel Optane NVMe journal, 10GB networking.


I wanted to move to bluestore due to dropping support of filestore, our 
cluster was working fine with filestore and we could take complete nodes 
out for maintenance without issues.


root@ceph04:~# ceph osd pool get libvirt-pool size
size: 3
root@ceph04:~# ceph osd pool get libvirt-pool min_size
min_size: 2

I removed all osds from one node, zapping the osd and journal devices, 
we recreated the osds as bluestore and used a small 5GB partition as 
rockdb device instead of journal for all osd's.


I saw the cluster suffer with pgs inactive and slow request.

I tried setting the following on all nodes, but no diffrence:
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
ceph tell osd.* injectargs '--osd_recovery_sleep 0.3'
systemctl restart ceph-osd.target

It took three days to recover and during this time clients were not 
responsive.


How can I migrate to bluestore without inactive pgs or slow request. I 
got several more filestore clusters and I would like to know how to 
migrate without inactive pgs and slow reguests?


As a side question, I optimized our cluster for filestore, the Intel 
Optane NVMe journals showed good fio dsync write tests, does bluestore 
also use dsync writes for rockdb caching or can we select NVMe devices 
on other specifications? My test with filestores showed that Optane NVMe 
SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few 
GB for filestore journals, but with bluestore rockdb caching the 
situation is different and I can't find documentation on how to speed 
test NVMe devices for bluestore.


Kind regards,

Jelle

root@ceph04:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
-1   60.04524 root default
-2   20.01263 host ceph04
 0   hdd  2.72899 osd.0   up  1.0 1.0
 1   hdd  2.72899 osd.1   up  1.0 1.0
 2   hdd  5.45799 osd.2   up  1.0 1.0
 3   hdd  2.72899 osd.3   up  1.0 1.0
14   hdd  3.63869 osd.14  up  1.0 1.0
15   hdd  2.72899 osd.15  up  1.0 1.0
-3   20.01263 host ceph05
 4   hdd  5.45799 osd.4   up  1.0 1.0
 5   hdd  2.72899 osd.5   up  1.0 1.0
 6   hdd  2.72899 osd.6   up  1.0 1.0
13   hdd  3.63869 osd.13  up  1.0 1.0
16   hdd  2.72899 osd.16  up  1.0 1.0
18   hdd  2.72899 osd.18  up  1.0 1.0
-4   20.01997 host ceph06
 8   hdd  5.45999 osd.8   up  1.0 1.0
 9   hdd  2.73000 osd.9   up  1.0 1.0
10   hdd  2.73000 osd.10  up  1.0 1.0
11   hdd  2.73000 osd.11  up  1.0 1.0
12   hdd  3.64000 osd.12  up  1.0 1.0
17   hdd  2.73000 osd.17  up  1.0 1.0


root@ceph04:~# ceph status
  cluster:
id: 85873cda-4865-4147-819d-8deda5345db5
health: HEALTH_WARN
18962/11801097 objects misplaced (0.161%)
1/3933699 objects unfound (0.000%)
Reduced data availability: 42 pgs inactive
Degraded data redundancy: 3645135/11801097 objects degraded 
(30.888%), 959 pgs degraded, 960 pgs undersized

110 slow requests are blocked > 32 sec. Implicated osds 3,10,11

  services:
mon: 3 daemons, quorum ceph04,ceph05,ceph06
mgr: ceph04(active), standbys: ceph06, ceph05
osd: 18 osds: 18 up, 18 in; 964 remapped pgs

  data:
pools:   1 pools, 1024 pgs
objects: 3.93M objects, 15.0TiB
usage:   31.2TiB used, 28.8TiB / 60.0TiB avail
pgs: 4.102% pgs not active
 3645135/11801097 objects degraded (30.888%)
 18962/11801097 objects misplaced (0.161%)
 1/3933699 objects unfound (0.000%)
 913 active+undersized+degraded+remapped+backfill_wait
 60  active+clean
 41  activating+undersized+degraded+remapped
 4   active+remapped+backfill_wait
 4   active+undersized+degraded+remapped+backfilling
 1   undersized+degraded+remapped+backfilling+peered
 1   active+recovery_wait+undersized+remapped

  io:
recovery: 197MiB/s, 49objects/s


root@ceph04:~# ceph health detail
HEALTH_WARN 18962/11801097 objects misplaced (0.161%); 1/3933699 objects 
unfound (0.000%); Reduced data availability: 42 pgs inactive; Degraded 
data redundancy: 3643636/11801097 objects degraded (30.875%), 959 pgs 
degraded, 960 pgs undersized; 110 slow requests are blocked > 32 sec. 
Implicated osds 3,10,11

OBJECT_MISPLACED 18962/11801097 objects misplaced (0.161%)
OBJECT_UNFOUND 1/3933699 objects unfound (0.000%)
pg 3.361 has 1 unfound objects
PG_AVAILABILITY Reduced data availability: 42 pgs i

[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-06 Thread Jelle de Jong

Hello everybody,

[fix confusing typo]

I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's 
with 32GB Intel Optane NVMe journal, 10GB networking.


I wanted to move to bluestore due to dropping support of filestore, our 
cluster was working fine with filestore and we could take complete nodes 
out for maintenance without issues.


root@ceph04:~# ceph osd pool get libvirt-pool size
size: 3
root@ceph04:~# ceph osd pool get libvirt-pool min_size
min_size: 2

I removed all osds from one node, zapping the osd and journal devices, 
we recreated the osds as bluestore and used a small 5GB partition as 
block device instead of journal for all osd's.


I saw the cluster suffer with pgs inactive and slow request.

I tried setting the following on all nodes, but no diffrence:
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
ceph tell osd.* injectargs '--osd_recovery_sleep 0.3'
systemctl restart ceph-osd.target

How can I migrate to bluestore without inactive pgs or slow request. I 
got several more filestore clusters and I would like to know how to 
migrate without inactive pgs and slow reguests?


As a side question, I optimized our cluster for filestore, the Intel 
Optane NVMe journals showed good fio dsync write tests, does bluestore 
also use dsync writes for rockdb caching or can we select NVMe devices 
on other specifications? My test with filestores showed that Optane NVMe 
SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few 
GB for filestore journals, but with bluestore rockdb caching the 
situation is different and I can't find documentation on how to speed 
test NVMe devices for bluestore.


Kind regards,

Jelle

root@ceph04:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
-1   60.04524 root default
-2   20.01263 host ceph04
 0   hdd  2.72899 osd.0   up  1.0 1.0
 1   hdd  2.72899 osd.1   up  1.0 1.0
 2   hdd  5.45799 osd.2   up  1.0 1.0
 3   hdd  2.72899 osd.3   up  1.0 1.0
14   hdd  3.63869 osd.14  up  1.0 1.0
15   hdd  2.72899 osd.15  up  1.0 1.0
-3   20.01263 host ceph05
 4   hdd  5.45799 osd.4   up  1.0 1.0
 5   hdd  2.72899 osd.5   up  1.0 1.0
 6   hdd  2.72899 osd.6   up  1.0 1.0
13   hdd  3.63869 osd.13  up  1.0 1.0
16   hdd  2.72899 osd.16  up  1.0 1.0
18   hdd  2.72899 osd.18  up  1.0 1.0
-4   20.01997 host ceph06
 8   hdd  5.45999 osd.8   up  1.0 1.0
 9   hdd  2.73000 osd.9   up  1.0 1.0
10   hdd  2.73000 osd.10  up  1.0 1.0
11   hdd  2.73000 osd.11  up  1.0 1.0
12   hdd  3.64000 osd.12  up  1.0 1.0
17   hdd  2.73000 osd.17  up  1.0 1.0


root@ceph04:~# ceph status
  cluster:
id: 85873cda-4865-4147-819d-8deda5345db5
health: HEALTH_WARN
18962/11801097 objects misplaced (0.161%)
1/3933699 objects unfound (0.000%)
Reduced data availability: 42 pgs inactive
Degraded data redundancy: 3645135/11801097 objects degraded 
(30.888%), 959 pgs degraded, 960 pgs undersized

110 slow requests are blocked > 32 sec. Implicated osds 3,10,11

  services:
mon: 3 daemons, quorum ceph04,ceph05,ceph06
mgr: ceph04(active), standbys: ceph06, ceph05
osd: 18 osds: 18 up, 18 in; 964 remapped pgs

  data:
pools:   1 pools, 1024 pgs
objects: 3.93M objects, 15.0TiB
usage:   31.2TiB used, 28.8TiB / 60.0TiB avail
pgs: 4.102% pgs not active
 3645135/11801097 objects degraded (30.888%)
 18962/11801097 objects misplaced (0.161%)
 1/3933699 objects unfound (0.000%)
 913 active+undersized+degraded+remapped+backfill_wait
 60  active+clean
 41  activating+undersized+degraded+remapped
 4   active+remapped+backfill_wait
 4   active+undersized+degraded+remapped+backfilling
 1   undersized+degraded+remapped+backfilling+peered
 1   active+recovery_wait+undersized+remapped

  io:
recovery: 197MiB/s, 49objects/s


root@ceph04:~# ceph health detail
HEALTH_WARN 18962/11801097 objects misplaced (0.161%); 1/3933699 objects 
unfound (0.000%); Reduced data availability: 42 pgs inactive; Degraded 
data redundancy: 3643636/11801097 objects degraded (30.875%), 959 pgs 
degraded, 960 pgs undersized; 110 slow requests are blocked > 32 sec. 
Implicated osds 3,10,11

OBJECT_MISPLACED 18962/11801097 objects misplaced (0.161%)
OBJECT_UNFOUND 1/3933699 objects unfound (0.000%)
pg 3.361 has 1 unfound objects
PG_AVAILABILITY Reduced data availability: 42 pgs inactive
pg 3.26 is stuck inactive for 19268.231084, curren

[ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-06 Thread Jelle de Jong

Hello everybody,

I got a tree node ceph cluster made of E3-1220v3, 24GB ram, 6 hdd osd's 
with 32GB Intel Optane NVMe journal, 10GB networking.


I wanted to move to bluestore due to dropping support of file store, our 
cluster was working fine with bluestore and we could take complete nodes 
out for maintenance without issues.


root@ceph04:~# ceph osd pool get libvirt-pool size
size: 3
root@ceph04:~# ceph osd pool get libvirt-pool min_size
min_size: 2

I removed all osds from one node, zapping the osd and journal devices, 
we recreated the osds as bluestore and used a small 5GB partition as 
block device instead of journal for all osd's.


I saw the cluster suffer with pgs inactive and slow request.

I tried setting the following on all nodes, but no diffrence:
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_op_priority 1'
ceph tell osd.* injectargs '--osd_recovery_sleep 0.3'
systemctl restart ceph-osd.target

How can I migrate to bluestore without inactive pgs or slow request. I 
got several more filestore clusters and I would like to know how to 
migrate without inactive pgs and slow reguests?


As a side question, I optimized our cluster for filestore, the Intel 
Optane NVMe journals showed good fio dsync write tests, does bluestore 
also use dsync writes for block caching or can we select NVMe devices on 
other specifications? My test with filestores showed that Optane NVMe 
SSD was faster then the Samsung NVMe SSD 970 Pro and I only need a a few 
GB for filestore journals, but with bluestore block caching the 
situation is different and I can't find documentation on how to speed 
test NVMe devices for bluestore.


Kind regards,

Jelle

root@ceph04:~# ceph osd tree
ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
-1   60.04524 root default
-2   20.01263 host ceph04
 0   hdd  2.72899 osd.0   up  1.0 1.0
 1   hdd  2.72899 osd.1   up  1.0 1.0
 2   hdd  5.45799 osd.2   up  1.0 1.0
 3   hdd  2.72899 osd.3   up  1.0 1.0
14   hdd  3.63869 osd.14  up  1.0 1.0
15   hdd  2.72899 osd.15  up  1.0 1.0
-3   20.01263 host ceph05
 4   hdd  5.45799 osd.4   up  1.0 1.0
 5   hdd  2.72899 osd.5   up  1.0 1.0
 6   hdd  2.72899 osd.6   up  1.0 1.0
13   hdd  3.63869 osd.13  up  1.0 1.0
16   hdd  2.72899 osd.16  up  1.0 1.0
18   hdd  2.72899 osd.18  up  1.0 1.0
-4   20.01997 host ceph06
 8   hdd  5.45999 osd.8   up  1.0 1.0
 9   hdd  2.73000 osd.9   up  1.0 1.0
10   hdd  2.73000 osd.10  up  1.0 1.0
11   hdd  2.73000 osd.11  up  1.0 1.0
12   hdd  3.64000 osd.12  up  1.0 1.0
17   hdd  2.73000 osd.17  up  1.0 1.0


root@ceph04:~# ceph status
  cluster:
id: 85873cda-4865-4147-819d-8deda5345db5
health: HEALTH_WARN
18962/11801097 objects misplaced (0.161%)
1/3933699 objects unfound (0.000%)
Reduced data availability: 42 pgs inactive
Degraded data redundancy: 3645135/11801097 objects degraded 
(30.888%), 959 pgs degraded, 960 pgs undersized

110 slow requests are blocked > 32 sec. Implicated osds 3,10,11

  services:
mon: 3 daemons, quorum ceph04,ceph05,ceph06
mgr: ceph04(active), standbys: ceph06, ceph05
osd: 18 osds: 18 up, 18 in; 964 remapped pgs

  data:
pools:   1 pools, 1024 pgs
objects: 3.93M objects, 15.0TiB
usage:   31.2TiB used, 28.8TiB / 60.0TiB avail
pgs: 4.102% pgs not active
 3645135/11801097 objects degraded (30.888%)
 18962/11801097 objects misplaced (0.161%)
 1/3933699 objects unfound (0.000%)
 913 active+undersized+degraded+remapped+backfill_wait
 60  active+clean
 41  activating+undersized+degraded+remapped
 4   active+remapped+backfill_wait
 4   active+undersized+degraded+remapped+backfilling
 1   undersized+degraded+remapped+backfilling+peered
 1   active+recovery_wait+undersized+remapped

  io:
recovery: 197MiB/s, 49objects/s


root@ceph04:~# ceph health detail
HEALTH_WARN 18962/11801097 objects misplaced (0.161%); 1/3933699 objects 
unfound (0.000%); Reduced data availability: 42 pgs inactive; Degraded 
data redundancy: 3643636/11801097 objects degraded (30.875%), 959 pgs 
degraded, 960 pgs undersized; 110 slow requests are blocked > 32 sec. 
Implicated osds 3,10,11

OBJECT_MISPLACED 18962/11801097 objects misplaced (0.161%)
OBJECT_UNFOUND 1/3933699 objects unfound (0.000%)
pg 3.361 has 1 unfound objects
PG_AVAILABILITY Reduced data availability: 42 pgs inactive
pg 3.26 is stuck inactive for 19268.231084, current state 
activating+und

Re: [ceph-users] Scaling out

2019-11-21 Thread Alfredo De Luca
Thanks heaps Nathan. That's what we thoughts and we wanted implement but I
wanted to double check with the community.


Cheers


On Thu, Nov 21, 2019 at 2:42 PM Nathan Fish  wrote:

> The default crush rule uses "host" as the failure domain, so in order
> to deploy on one host you will need to make a crush rule that
> specifies "osd". Then simply adding more hosts with osds will result
> in automatic rebalancing. Once you have enough hosts to satisfy the
> crush rule ( 3 for replicated size = 3) you can change the pool(s)
> back to the default rule.
>
> On Thu, Nov 21, 2019 at 7:46 AM Alfredo De Luca
>  wrote:
> >
> > Hi all.
> > We are doing some tests on how to scale out nodes on Ceph Nautilus.
> > Basically we want to try to install Ceph on one node and scale up to 2+
> nodes. How to do so?
> >
> > Every nodes has 6 disks and maybe  we can use the crushmap to achieve
> this?
> >
> > Any thoughts/ideas/recommendations?
> >
> >
> > Cheers
> >
> >
> > --
> > Alfredo
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
*Alfredo*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Scaling out

2019-11-21 Thread Alfredo De Luca
Hi all.
We are doing some tests on how to scale out nodes on Ceph Nautilus.
Basically we want to try to install Ceph on one node and scale up to 2+
nodes. How to do so?

Every nodes has 6 disks and maybe  we can use the crushmap to achieve this?

Any thoughts/ideas/recommendations?


Cheers


-- 
*Alfredo*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-objectstore-tool crash when trying to recover pg from OSD

2019-11-07 Thread Eugene de Beste
Hi, does anyone have any feedback for me regarding this?

Here's the log I get when trying to restart the OSD via systemctl: 
https://pastebin.com/tshuqsLP
On Mon, 4 Nov 2019 at 12:42, Eugene de Beste mailto:eug...@sanbi.ac.za)> wrote:
> Hi everyone
>
> I have a cluster that was initially set up with bad defaults in Luminous. 
> After upgrading to Nautilus I've had a few OSDs crash on me, due to errors 
> seemingly related to https://tracker.ceph.com/issues/42223 and 
> https://tracker.ceph.com/issues/22678.
> One of my pools have been running in min_size 1 (yes, I know) and I am not 
> stuck with incomplete pgs due to aforementioned OSD crash.
> When trying to use the ceph-objectstore-tool to get the pgs out of the OSD, 
> I'm running into the same issue as trying to start the OSD, which is the 
> crashes. ceph-objectstore-tool core dumps and I can't retrieve the pg.
> Does anyone have any input on this? I would like to be able to retrieve that 
> data if possible.
> Here's the log for ceph-objectstore-tool --debug --data-path 
> /var/lib/ceph/osd/ceph-22 --skip-journal-replay --skip-mount-omap --op info 
> --pgid 2.9f -- https://pastebin.com/9aGtAfSv
> Regards and thanks,
> Eugene

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-objectstore-tool crash when trying to recover pg from OSD

2019-11-04 Thread Eugene de Beste
Hi everyone

I have a cluster that was initially set up with bad defaults in Luminous. After 
upgrading to Nautilus I've had a few OSDs crash on me, due to errors seemingly 
related to https://tracker.ceph.com/issues/42223 and 
https://tracker.ceph.com/issues/22678.
One of my pools have been running in min_size 1 (yes, I know) and I am not 
stuck with incomplete pgs due to aforementioned OSD crash.
When trying to use the ceph-objectstore-tool to get the pgs out of the OSD, I'm 
running into the same issue as trying to start the OSD, which is the crashes. 
ceph-objectstore-tool core dumps and I can't retrieve the pg.
Does anyone have any input on this? I would like to be able to retrieve that 
data if possible.
Here's the log for ceph-objectstore-tool --debug --data-path 
/var/lib/ceph/osd/ceph-22 --skip-journal-replay --skip-mount-omap --op info 
--pgid 2.9f -- https://pastebin.com/9aGtAfSv
Regards and thanks,
Eugene
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ssd requirements for wal/db

2019-10-04 Thread Stijn De Weirdt
hi all,

maybe to clarify a bit, e.g.
https://indico.cern.ch/event/755842/contributions/3243386/attachments/1784159/2904041/2019-jcollet-openlab.pdf
clearly shows that the db+wal disks are not saturated,
but we are wondering what is really needed/acceptable wrt throughput and
latency (eg is a 6gbps sata enough or is 12gbps sas needed); we are
thinking combining 4 or 5 7.2k rpms disks with one ssd.

similar question with the read-intensive: how much is actually written
to the db+wal compared to the data disk? is that 1-to-1?
do people see eg 1 DWPD on their db+wal devices? (i guess it depends;)
if so, what kind of workload daily averages are this in terms of volume?

thanks for pointing out the capacitor isue, something to defintely
double check for the (cheaper) read intensive ssd.


stijn

On 10/4/19 7:29 PM, Vitaliy Filippov wrote:
> WAL/DB isn't "read intensive". It's more "write intensive" :) use server
> SSDs with capacitors to get adequate write performance.
> 
>> Hi all,
>>
>> We are thinking about putting our wal/db of hdds/ on ssds. If we would
>> put the wal&db of 4 HDDS on 1 SSD as recommended, what type of SSD would
>> suffice?
>> We were thinking of using SATA Read Intensive 6Gbps 1DWPD SSDs.
>>
>> Does someone has some experience with this configuration? Would we need
>> SAS ssds instead of SATA? And Mixed Use 3WPD instead of Read intensive?
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] process stuck in D state on cephfs kernel mount

2019-01-21 Thread Stijn De Weirdt
hi marc,

> - how to prevent the D state process to accumulate so much load?
you can't. in linux, uninterruptable tasks themself count as "load",
this does not mean you eg ran out of cpu resources.

stijn

> 
> Thanks,
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Encryption questions

2019-01-11 Thread Sergio A. de Carvalho Jr.
Thanks for the answers, guys!

Am I right to assume msgr2 (http://docs.ceph.com/docs/mimic/dev/msgr2/)
will provide encryption between Ceph daemons as well as between clients and
daemons?

Does anybody know if it will be available in Nautilus?


On Fri, Jan 11, 2019 at 8:10 AM Tobias Florek  wrote:

> Hi,
>
> as others pointed out, traffic in ceph is unencrypted (internal traffic
> as well as client traffic).  I usually advise to set up IPSec or
> nowadays wireguard connections between all hosts.  That takes care of
> any traffic going over the wire, including ceph.
>
> Cheers,
>  Tobias Florek
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Encryption questions

2019-01-10 Thread Sergio A. de Carvalho Jr.
Hi everyone, I have some questions about encryption in Ceph.

1) Are RBD connections encrypted or is there an option to use encryption
between clients and Ceph? From reading the documentation, I have the
impression that the only option to guarantee encryption in transit is to
force clients to encrypt volumes via dmcrypt. Is there another option? I
know I could encrypt the OSDs but that's not going to solve the problem of
encryption in transit.

2) I'm also struggling to understand if communication between Ceph daemons
(monitors and OSDs) are encrypted or not. I came across a few references
about msgr2 but I couldn't tell if it is already implemented. Can anyone
confirm this?

I'm currently starting a new project using Ceph Mimic but if there's
something new in this space expected for Nautilus, it would be good to know
as well.

Regards,

Sergio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lost machine with MON and MDS

2018-10-26 Thread Maiko de Andrade
Hi,

I have 3 machine with ceph config with cephfs. But I lost one machine, just
with mon and mds. It's possible recovey cephfs? If yes how?

ceph: Ubuntu 16.05.5 (lost this machine)
- mon
- mds
- osd

ceph-osd-1: Ubuntu 16.05.5
- osd

ceph-osd-2: Ubuntu 16.05.5
- osd



[]´s
Maiko de Andrade
MAX Brasil
Desenvolvedor de Sistemas
+55 51 91251756
http://about.me/maiko
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-10-04 Thread Webert de Souza Lima
Hi, bring this up again to ask one more question:

what would be the best recommended locking strategy for dovecot against
cephfs?
this is a balanced setup using independent director instances but all
dovecot instances on each node share the same storage system (cephfs).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 5:15 PM Webert de Souza Lima 
wrote:

> Thanks Jack.
>
> That's good to know. It is definitely something to consider.
> In a distributed storage scenario we might build a dedicated pool for that
> and tune the pool as more capacity or performance is needed.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
>
> On Wed, May 16, 2018 at 4:45 PM Jack  wrote:
>
>> On 05/16/2018 09:35 PM, Webert de Souza Lima wrote:
>> > We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
>> > backend.
>> > We'll have to do some some work on how to simulate user traffic, for
>> writes
>> > and readings. That seems troublesome.
>> I would appreciate seeing these results !
>>
>> > Thanks for the plugins recommendations. I'll take the change and ask you
>> > how is the SIS status? We have used it in the past and we've had some
>> > problems with it.
>>
>> I am using it since Dec 2016 with mdbox, with no issue at all (I am
>> currently using Dovecot 2.2.27-3 from Debian Stretch)
>> The only config I use is mail_attachment_dir, the rest lies as default
>> (mail_attachment_min_size = 128k, mail_attachment_fs = sis posix,
>> ail_attachment_hash = %{sha1})
>> The backend storage is a local filesystem, and there is only one Dovecot
>> instance
>>
>> >
>> > Regards,
>> >
>> > Webert Lima
>> > DevOps Engineer at MAV Tecnologia
>> > *Belo Horizonte - Brasil*
>> > *IRC NICK - WebertRLZ*
>> >
>> >
>> > On Wed, May 16, 2018 at 4:19 PM Jack  wrote:
>> >
>> >> Hi,
>> >>
>> >> Many (most ?) filesystems does not store multiple files on the same
>> block
>> >>
>> >> Thus, with sdbox, every single mail (you know, that kind of mail with
>> 10
>> >> lines in it) will eat an inode, and a block (4k here)
>> >> mdbox is more compact on this way
>> >>
>> >> Another difference: sdbox removes the message, mdbox does not : a
>> single
>> >> metadata update is performed, which may be packed with others if many
>> >> files are deleted at once
>> >>
>> >> That said, I do not have experience with dovecot + cephfs, nor have
>> made
>> >> tests for sdbox vs mdbox
>> >>
>> >> However, and this is a bit out of topic, I recommend you look at the
>> >> following dovecot's features (if not already done), as they are awesome
>> >> and will help you a lot:
>> >> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
>> >> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
>> >> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
>> >>
>> >> Regards,
>> >> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
>> >>> I'm sending this message to both dovecot and ceph-users ML so please
>> >> don't
>> >>> mind if something seems too obvious for you.
>> >>>
>> >>> Hi,
>> >>>
>> >>> I have a question for both dovecot and ceph lists and below I'll
>> explain
>> >>> what's going on.
>> >>>
>> >>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
>> >> when
>> >>> using sdbox, a new file is stored for each email message.
>> >>> When using mdbox, multiple messages are appended to a single file
>> until
>> >> it
>> >>> reaches/passes the rotate limit.
>> >>>
>> >>> I would like to understand better how the mdbox format impacts on IO
>> >>> performance.
>> >>> I think it's generally expected that fewer larger file translate to
>> less
>> >> IO
>> >>> and more troughput when compared to more small files, but how does
>> >> dovecot
>> >>> handle that with mdbox?
>> >>> If dovecot does flush data to storage upon each and every n

Re: [ceph-users] rados rm objects, still appear in rados ls

2018-09-28 Thread Frank de Bot (lists)
John Spray wrote:
> On Fri, Sep 28, 2018 at 2:25 PM Frank (lists)  wrote:
>>
>> Hi,
>>
>> On my cluster I tried to clear all objects from a pool. I used the
>> command "rados -p bench ls | xargs rados -p bench rm". (rados -p bench
>> cleanup doesn't clean everything, because there was a lot of other
>> testing going on here).
>>
>> Now 'rados -p bench ls' returns a list of objects, which don't exists:
>> [root@ceph01 yum.repos.d]# rados -p bench stat
>> benchmark_data_ceph01.example.com_1805226_object32453
>>   error stat-ing
>> bench/benchmark_data_ceph01.example.com_1805226_object32453: (2) No such
>> file or directory
>>
>> I've tried scrub and deepscrub the pg the object is in, but the problem
>> persists. What causes this?
> 
> Are you perhaps using a cache tier pool?

The pool had 2 snaps. After removing those, the ls command returned no
'non-existing' objects. I expected that ls would only return objects of
the current contents, I did not specify -s for working with snaps of the
pool.

> 
> John
> 
>>
>> I use Centos 7.5 with mimic 13.2.2
>>
>>
>> regards,
>>
>> Frank de Bot
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-09-04 Thread Jones de Andrade
Hi Eugen.

Just tried everything again here by removing the /sda4 partitions and
letting it so that either salt-run proposal-populate or salt-run state.orch
ceph.stage.configure could try to find the free space on the partitions to
work with: unsuccessfully again. :(

Just to make things clear: are you so telling me that it is completely
impossible to have a ceph "volume" in non-dedicated devices, sharing space
with, for instance, the nodes swap, boot or main partition?

And so the only possible way to have a functioning ceph distributed
filesystem working would be by having in each node at least one disk
dedicated for the operational system and another, independent disk
dedicated to the ceph filesystem?

That would be a awful drawback in our plans if real, but if there is no
other way, we will have to just give up. Just, please, answer this two
questions clearly, before we capitulate?  :(

Anyway, thanks a lot, once again,

Jones

On Mon, Sep 3, 2018 at 5:39 AM Eugen Block  wrote:

> Hi Jones,
>
> I still don't think creating an OSD on a partition will work. The
> reason is that SES creates an additional partition per OSD resulting
> in something like this:
>
> vdb   253:16   05G  0 disk
> ├─vdb1253:17   0  100M  0 part /var/lib/ceph/osd/ceph-1
> └─vdb2253:18   0  4,9G  0 part
>
> Even with external block.db and wal.db on additional devices you would
> still need two partitions for the OSD. I'm afraid with your setup this
> can't work.
>
> Regards,
> Eugen
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-31 Thread Jones de Andrade
:18.787469-03:00 polar kernel: [3.036222] ata2.00:
configured for UDMA/133
2018-08-30T10:21:18.787469-03:00 polar kernel: [3.043916] scsi 1:0:0:0:
CD-ROMPLDS DVD+-RW DU-8A5LH 6D1M PQ: 0 ANSI: 5
2018-08-30T10:21:18.787470-03:00 polar kernel: [3.052087] usb 1-6: new
low-speed USB device number 2 using xhci_hcd
2018-08-30T10:21:18.787471-03:00 polar kernel: [3.063179] scsi 1:0:0:0:
Attached scsi generic sg1 type 5
2018-08-30T10:21:18.787472-03:00 polar kernel: [3.083566]  sda: sda1
sda2 sda3 sda4
2018-08-30T10:21:18.787472-03:00 polar kernel: [3.084238] sd 0:0:0:0:
[sda] Attached SCSI disk
2018-08-30T10:21:18.787473-03:00 polar kernel: [3.113065] sr 1:0:0:0:
[sr0] scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray
2018-08-30T10:21:18.787475-03:00 polar kernel: [3.113068] cdrom:
Uniform CD-ROM driver Revision: 3.20
2018-08-30T10:21:18.787476-03:00 polar kernel: [3.113272] sr 1:0:0:0:
Attached scsi CD-ROM sr0
2018-08-30T10:21:18.787477-03:00 polar kernel: [3.213133] usb 1-6: New
USB device found, idVendor=413c, idProduct=2113
###

I'm trying to run deploy again here, however I'm having some connection
issues today (possibly due to the heavy rain) affecting the initial stages
of it. If it succeeds, I send the outputs from /var/log/messages on the
minions right away.

Thanks a lot,

Jones

On Fri, Aug 31, 2018 at 4:00 AM Eugen Block  wrote:

> Hi,
>
> I'm not sure if there's a misunderstanding. You need to track the logs
> during the osd deployment step (stage.3), that is where it fails, and
> this is where /var/log/messages could be useful. Since the deployment
> failed you have no systemd-units (ceph-osd@.service) to log
> anything.
>
> Before running stage.3 again try something like
>
> grep -C5 ceph-disk /var/log/messages (or messages-201808*.xz)
>
> or
>
> grep -C5 sda4 /var/log/messages (or messages-201808*.xz)
>
> If that doesn't reveal anything run stage.3 again and watch the logs.
>
> Regards,
> Eugen
>
>
> Zitat von Jones de Andrade :
>
> > Hi Eugen.
> >
> > Ok, edited the file /etc/salt/minion, uncommented the "log_level_logfile"
> > line and set it to "debug" level.
> >
> > Turned off the computer, waited a few minutes so that the time frame
> would
> > stand out in the /var/log/messages file, and restarted the computer.
> >
> > Using vi I "greped out" (awful wording) the reboot section. From that, I
> > also removed most of what it seemed totally unrelated to ceph, salt,
> > minions, grafana, prometheus, whatever.
> >
> > I got the lines below. It does not seem to complain about anything that I
> > can see. :(
> >
> > 
> > 2018-08-30T15:41:46.455383-03:00 torcello systemd[1]: systemd 234 running
> > in system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT
> +UTMP
> > +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID -ELFUTILS
> > +KMOD -IDN2 -IDN default-hierarchy=hybrid)
> > 2018-08-30T15:41:46.456330-03:00 torcello systemd[1]: Detected
> architecture
> > x86-64.
> > 2018-08-30T15:41:46.456350-03:00 torcello systemd[1]: nss-lookup.target:
> > Dependency Before=nss-lookup.target dropped
> > 2018-08-30T15:41:46.456357-03:00 torcello systemd[1]: Started Load Kernel
> > Modules.
> > 2018-08-30T15:41:46.456369-03:00 torcello systemd[1]: Starting Apply
> Kernel
> > Variables...
> > 2018-08-30T15:41:46.457230-03:00 torcello systemd[1]: Started
> Alertmanager
> > for prometheus.
> > 2018-08-30T15:41:46.457237-03:00 torcello systemd[1]: Started Monitoring
> > system and time series database.
> > 2018-08-30T15:41:46.457403-03:00 torcello systemd[1]: Starting NTP
> > client/server...
> >
> >
> >
> >
> >
> >
> > *2018-08-30T15:41:46.457425-03:00 torcello systemd[1]: Started Prometheus
> > exporter for machine metrics.2018-08-30T15:41:46.457706-03:00 torcello
> > prometheus[695]: level=info ts=2018-08-30T18:41:44.797896888Z
> > caller=main.go:225 msg="Starting Prometheus" version="(version=2.1.0,
> > branch=non-git, revision=non-git)"2018-08-30T15:41:46.457712-03:00
> torcello
> > prometheus[695]: level=info ts=2018-08-30T18:41:44.797969232Z
> > caller=main.go:226 build_context="(go=go1.9.4, user=abuild@lamb69,
> > date=20180513-03:46:03)"2018-08-30T15:41:46.457719-03:00 torcello
> > prometheus[695]: level=info ts=2018-08-30T18:41:44.798008802Z
> > caller=main.go:227 host_details="(Linux 4.12.14-lp150.12.4-default #1 SMP
> > Tue May 22 05:17:22 UTC 2018 (66b2eda) x86_64 torcello
> > (none))"2018-08-3

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-30 Thread Jones de Andrade
511493-03:00 torcello systemd[2295]: Reached target
Timers.
2018-08-30T15:44:15.511664-03:00 torcello systemd[2295]: Reached target
Paths.
2018-08-30T15:44:15.517873-03:00 torcello systemd[2295]: Listening on D-Bus
User Message Bus Socket.
2018-08-30T15:44:15.518060-03:00 torcello systemd[2295]: Reached target
Sockets.
2018-08-30T15:44:15.518216-03:00 torcello systemd[2295]: Reached target
Basic System.
2018-08-30T15:44:15.518373-03:00 torcello systemd[2295]: Reached target
Default.
2018-08-30T15:44:15.518501-03:00 torcello systemd[2295]: Startup finished
in 31ms.
2018-08-30T15:44:15.518634-03:00 torcello systemd[1]: Started User Manager
for UID 1000.
2018-08-30T15:44:15.518759-03:00 torcello systemd[1792]: Received
SIGRTMIN+24 from PID 2300 (kill).
2018-08-30T15:44:15.537634-03:00 torcello systemd[1]: Stopped User Manager
for UID 464.
2018-08-30T15:44:15.538422-03:00 torcello systemd[1]: Removed slice User
Slice of sddm.
2018-08-30T15:44:15.613246-03:00 torcello systemd[2295]: Started D-Bus User
Message Bus.
2018-08-30T15:44:15.623989-03:00 torcello dbus-daemon[2311]: [session
uid=1000 pid=2311] Successfully activated service 'org.freedesktop.systemd1'
2018-08-30T15:44:16.447162-03:00 torcello kapplymousetheme[2350]:
kcm_input: Using X11 backend
2018-08-30T15:44:16.901642-03:00 torcello node_exporter[807]:
time="2018-08-30T15:44:16-03:00" level=error msg="ERROR: ntp collector
failed after 0.000205s: couldn't get SNTP reply: read udp 127.0.0.1:53434->
127.0.0.1:123: read: connection refused" source="collector.go:123"


Any ideas?

Thanks a lot,

Jones

On Thu, Aug 30, 2018 at 4:14 AM Eugen Block  wrote:

> Hi,
>
> > So, it only contains logs concerning the node itself (is it correct?
> sincer
> > node01 is also the master, I was expecting it to have logs from the other
> > too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
> > available, and nothing "shines out" (sorry for my poor english) as a
> > possible error.
>
> the logging is not configured to be centralised per default, you would
> have to configure that yourself.
>
> Regarding the OSDs, if there are OSD logs created, they're created on
> the OSD nodes, not on the master. But since the OSD deployment fails,
> there probably are no OSD specific logs yet. So you'll have to take a
> look into the syslog (/var/log/messages), that's where the salt-minion
> reports its attempts to create the OSDs. Chances are high that you'll
> find the root cause in here.
>
> If the output is not enough, set the log-level to debug:
>
> osd-1:~ # grep -E "^log_level" /etc/salt/minion
> log_level: debug
>
>
> Regards,
> Eugen
>
>
> Zitat von Jones de Andrade :
>
> > Hi Eugen.
> >
> > Sorry for the delay in answering.
> >
> > Just looked in the /var/log/ceph/ directory. It only contains the
> following
> > files (for example on node01):
> >
> > ###
> > # ls -lart
> > total 3864
> > -rw--- 1 ceph ceph 904 ago 24 13:11 ceph.audit.log-20180829.xz
> > drwxr-xr-x 1 root root 898 ago 28 10:07 ..
> > -rw-r--r-- 1 ceph ceph  189464 ago 28 23:59
> ceph-mon.node01.log-20180829.xz
> > -rw--- 1 ceph ceph   24360 ago 28 23:59 ceph.log-20180829.xz
> > -rw-r--r-- 1 ceph ceph   48584 ago 29 00:00
> ceph-mgr.node01.log-20180829.xz
> > -rw--- 1 ceph ceph   0 ago 29 00:00 ceph.audit.log
> > drwxrws--T 1 ceph ceph 352 ago 29 00:00 .
> > -rw-r--r-- 1 ceph ceph 1908122 ago 29 12:46 ceph-mon.node01.log
> > -rw--- 1 ceph ceph  175229 ago 29 12:48 ceph.log
> > -rw-r--r-- 1 ceph ceph 1599920 ago 29 12:49 ceph-mgr.node01.log
> > ###
> >
> > So, it only contains logs concerning the node itself (is it correct?
> sincer
> > node01 is also the master, I was expecting it to have logs from the other
> > too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
> > available, and nothing "shines out" (sorry for my poor english) as a
> > possible error.
> >
> > Any suggestion on how to proceed?
> >
> > Thanks a lot in advance,
> >
> > Jones
> >
> >
> > On Mon, Aug 27, 2018 at 5:29 AM Eugen Block  wrote:
> >
> >> Hi Jones,
> >>
> >> all ceph logs are in the directory /var/log/ceph/, each daemon has its
> >> own log file, e.g. OSD logs are named ceph-osd.*.
> >>
> >> I haven't tried it but I don't think SUSE Enterprise Storage deploys
> >> OSDs on partitioned disks. Is there a way to attach a second disk to
> >> the OSD nodes, maybe via USB or something?
> >>
> >> Although th

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-29 Thread Jones de Andrade
Hi Eugen.

Sorry for the delay in answering.

Just looked in the /var/log/ceph/ directory. It only contains the following
files (for example on node01):

###
# ls -lart
total 3864
-rw--- 1 ceph ceph 904 ago 24 13:11 ceph.audit.log-20180829.xz
drwxr-xr-x 1 root root 898 ago 28 10:07 ..
-rw-r--r-- 1 ceph ceph  189464 ago 28 23:59 ceph-mon.node01.log-20180829.xz
-rw--- 1 ceph ceph   24360 ago 28 23:59 ceph.log-20180829.xz
-rw-r--r-- 1 ceph ceph   48584 ago 29 00:00 ceph-mgr.node01.log-20180829.xz
-rw--- 1 ceph ceph   0 ago 29 00:00 ceph.audit.log
drwxrws--T 1 ceph ceph 352 ago 29 00:00 .
-rw-r--r-- 1 ceph ceph 1908122 ago 29 12:46 ceph-mon.node01.log
-rw--- 1 ceph ceph  175229 ago 29 12:48 ceph.log
-rw-r--r-- 1 ceph ceph 1599920 ago 29 12:49 ceph-mgr.node01.log
###

So, it only contains logs concerning the node itself (is it correct? sincer
node01 is also the master, I was expecting it to have logs from the other
too) and, moreover, no ceph-osd* files. Also, I'm looking the logs I have
available, and nothing "shines out" (sorry for my poor english) as a
possible error.

Any suggestion on how to proceed?

Thanks a lot in advance,

Jones


On Mon, Aug 27, 2018 at 5:29 AM Eugen Block  wrote:

> Hi Jones,
>
> all ceph logs are in the directory /var/log/ceph/, each daemon has its
> own log file, e.g. OSD logs are named ceph-osd.*.
>
> I haven't tried it but I don't think SUSE Enterprise Storage deploys
> OSDs on partitioned disks. Is there a way to attach a second disk to
> the OSD nodes, maybe via USB or something?
>
> Although this thread is ceph related it is referring to a specific
> product, so I would recommend to post your question in the SUSE forum
> [1].
>
> Regards,
> Eugen
>
> [1] https://forums.suse.com/forumdisplay.php?99-SUSE-Enterprise-Storage
>
> Zitat von Jones de Andrade :
>
> > Hi Eugen.
> >
> > Thanks for the suggestion. I'll look for the logs (since it's our first
> > attempt with ceph, I'll have to discover where they are, but no problem).
> >
> > One thing called my attention on your response however:
> >
> > I haven't made myself clear, but one of the failures we encountered were
> > that the files now containing:
> >
> > node02:
> >--
> >storage:
> >--
> >osds:
> >--
> >/dev/sda4:
> >--
> >format:
> >bluestore
> >standalone:
> >True
> >
> > Were originally empty, and we filled them by hand following a model found
> > elsewhere on the web. It was necessary, so that we could continue, but
> the
> > model indicated that, for example, it should have the path for /dev/sda
> > here, not /dev/sda4. We chosen to include the specific partition
> > identification because we won't have dedicated disks here, rather just
> the
> > very same partition as all disks were partitioned exactly the same.
> >
> > While that was enough for the procedure to continue at that point, now I
> > wonder if it was the right call and, if it indeed was, if it was done
> > properly.  As such, I wonder: what you mean by "wipe" the partition here?
> > /dev/sda4 is created, but is both empty and unmounted: Should a different
> > operation be performed on it, should I remove it first, should I have
> > written the files above with only /dev/sda as target?
> >
> > I know that probably I wouldn't run in this issues with dedicated discks,
> > but unfortunately that is absolutely not an option.
> >
> > Thanks a lot in advance for any comments and/or extra suggestions.
> >
> > Sincerely yours,
> >
> > Jones
> >
> > On Sat, Aug 25, 2018 at 5:46 PM Eugen Block  wrote:
> >
> >> Hi,
> >>
> >> take a look into the logs, they should point you in the right direction.
> >> Since the deployment stage fails at the OSD level, start with the OSD
> >> logs. Something's not right with the disks/partitions, did you wipe
> >> the partition from previous attempts?
> >>
> >> Regards,
> >> Eugen
> >>
> >> Zitat von Jones de Andrade :
> >>
> >>> (Please forgive my previous email: I was using another message and
> >>> completely forget to update the subject)
> >>>
> >>> Hi all.
> >>>
> >>> I'm new to ceph, and after having serious problems in ceph stages 0, 1
> >> and
> >>> 2 that

Re: [ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-26 Thread Jones de Andrade
Hi Eugen.

Thanks for the suggestion. I'll look for the logs (since it's our first
attempt with ceph, I'll have to discover where they are, but no problem).

One thing called my attention on your response however:

I haven't made myself clear, but one of the failures we encountered were
that the files now containing:

node02:
--
storage:
--
osds:
--
/dev/sda4:
--
format:
bluestore
standalone:
True

Were originally empty, and we filled them by hand following a model found
elsewhere on the web. It was necessary, so that we could continue, but the
model indicated that, for example, it should have the path for /dev/sda
here, not /dev/sda4. We chosen to include the specific partition
identification because we won't have dedicated disks here, rather just the
very same partition as all disks were partitioned exactly the same.

While that was enough for the procedure to continue at that point, now I
wonder if it was the right call and, if it indeed was, if it was done
properly.  As such, I wonder: what you mean by "wipe" the partition here?
/dev/sda4 is created, but is both empty and unmounted: Should a different
operation be performed on it, should I remove it first, should I have
written the files above with only /dev/sda as target?

I know that probably I wouldn't run in this issues with dedicated discks,
but unfortunately that is absolutely not an option.

Thanks a lot in advance for any comments and/or extra suggestions.

Sincerely yours,

Jones

On Sat, Aug 25, 2018 at 5:46 PM Eugen Block  wrote:

> Hi,
>
> take a look into the logs, they should point you in the right direction.
> Since the deployment stage fails at the OSD level, start with the OSD
> logs. Something's not right with the disks/partitions, did you wipe
> the partition from previous attempts?
>
> Regards,
> Eugen
>
> Zitat von Jones de Andrade :
>
> > (Please forgive my previous email: I was using another message and
> > completely forget to update the subject)
> >
> > Hi all.
> >
> > I'm new to ceph, and after having serious problems in ceph stages 0, 1
> and
> > 2 that I could solve myself, now it seems that I have hit a wall harder
> > than my head. :)
> >
> > When I run salt-run state.orch ceph.stage.deploy, i monitor I see it
> going
> > up to here:
> >
> > ###
> > [14/71]   ceph.sysctl on
> >   node01... ✓ (0.5s)
> >   node02 ✓ (0.7s)
> >   node03... ✓ (0.6s)
> >   node04. ✓ (0.5s)
> >   node05... ✓ (0.6s)
> >   node06.. ✓ (0.5s)
> >
> > [15/71]   ceph.osd on
> >   node01.. ❌ (0.7s)
> >   node02 ❌ (0.7s)
> >   node03... ❌ (0.7s)
> >   node04. ❌ (0.6s)
> >   node05... ❌ (0.6s)
> >   node06.. ❌ (0.7s)
> >
> > Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s
> >
> > Failures summary:
> >
> > ceph.osd (/srv/salt/ceph/osd):
> >   node02:
> > deploy OSDs: Module function osd.deploy threw an exception.
> Exception:
> > Mine on node02 for cephdisks.list
> >   node03:
> > deploy OSDs: Module function osd.deploy threw an exception.
> Exception:
> > Mine on node03 for cephdisks.list
> >   node01:
> > deploy OSDs: Module function osd.deploy threw an exception.
> Exception:
> > Mine on node01 for cephdisks.list
> >   node04:
> > deploy OSDs: Module function osd.deploy threw an exception.
> Exception:
> > Mine on node04 for cephdisks.list
> >   node05:
> > deploy OSDs: Module function osd.deploy threw an exception.
> Exception:
> > Mine on node05 for cephdisks.list
> >   node06:
> > deploy OSDs: Module function osd.deploy threw an exception.
> Exception:
> > Mine on node06 for cephdisks.list
> > ###
> >
> > Since this is a first attempt in 6 simple test machines, we are going to
> > put the mon, osds, etc, in all nodes at first. Only the master is left
> in a
> > single machine (node01) by now.
> >
> > As they are simple machin

[ceph-users] Ceph-Deploy error on 15/71 stage

2018-08-24 Thread Jones de Andrade
(Please forgive my previous email: I was using another message and
completely forget to update the subject)

Hi all.

I'm new to ceph, and after having serious problems in ceph stages 0, 1 and
2 that I could solve myself, now it seems that I have hit a wall harder
than my head. :)

When I run salt-run state.orch ceph.stage.deploy, i monitor I see it going
up to here:

###
[14/71]   ceph.sysctl on
  node01... ✓ (0.5s)
  node02 ✓ (0.7s)
  node03... ✓ (0.6s)
  node04. ✓ (0.5s)
  node05... ✓ (0.6s)
  node06.. ✓ (0.5s)

[15/71]   ceph.osd on
  node01.. ❌ (0.7s)
  node02 ❌ (0.7s)
  node03... ❌ (0.7s)
  node04. ❌ (0.6s)
  node05... ❌ (0.6s)
  node06.. ❌ (0.7s)

Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s

Failures summary:

ceph.osd (/srv/salt/ceph/osd):
  node02:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node02 for cephdisks.list
  node03:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node03 for cephdisks.list
  node01:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node01 for cephdisks.list
  node04:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node04 for cephdisks.list
  node05:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node05 for cephdisks.list
  node06:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node06 for cephdisks.list
###

Since this is a first attempt in 6 simple test machines, we are going to
put the mon, osds, etc, in all nodes at first. Only the master is left in a
single machine (node01) by now.

As they are simple machines, they have a single hdd, which is partitioned
as follows (the hda4 partition is unmounted and left for the ceph system):

###
# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda  8:00 465,8G  0 disk
├─sda1   8:10   500M  0 part /boot/efi
├─sda2   8:2016G  0 part [SWAP]
├─sda3   8:30  49,3G  0 part /
└─sda4   8:40   400G  0 part
sr0 11:01   3,7G  0 rom

# salt -I 'roles:storage' cephdisks.list
node01:
node02:
node03:
node04:
node05:
node06:

# salt -I 'roles:storage' pillar.get ceph
node02:
--
storage:
--
osds:
--
/dev/sda4:
--
format:
bluestore
standalone:
True
(and so on for all 6 machines)
##

Finally and just in case, my policy.cfg file reads:

#
#cluster-unassigned/cluster/*.sls
cluster-ceph/cluster/*.sls
profile-default/cluster/*.sls
profile-default/stack/default/ceph/minions/*yml
config/stack/default/global.yml
config/stack/default/ceph/cluster.yml
role-master/cluster/node01.sls
role-admin/cluster/*.sls
role-mon/cluster/*.sls
role-mgr/cluster/*.sls
role-mds/cluster/*.sls
role-ganesha/cluster/*.sls
role-client-nfs/cluster/*.sls
role-client-cephfs/cluster/*.sls
##

Please, could someone help me and shed some light on this issue?

Thanks a lot in advance,

Regasrds,

Jones
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic prometheus plugin -no socket could be created

2018-08-24 Thread Jones de Andrade
Hi all.

I'm new to ceph, and after having serious problems in ceph stages 0, 1 and
2 that I could solve myself, now it seems that I have hit a wall harder
than my head. :)

When I run salt-run state.orch ceph.stage.deploy, i monitor I see it going
up to here:

###
[14/71]   ceph.sysctl on
  node01... ✓ (0.5s)
  node02 ✓ (0.7s)
  node03... ✓ (0.6s)
  node04. ✓ (0.5s)
  node05... ✓ (0.6s)
  node06.. ✓ (0.5s)

[15/71]   ceph.osd on
  node01.. ❌ (0.7s)
  node02 ❌ (0.7s)
  node03... ❌ (0.7s)
  node04. ❌ (0.6s)
  node05... ❌ (0.6s)
  node06.. ❌ (0.7s)

Ended stage: ceph.stage.deploy succeeded=14/71 failed=1/71 time=624.7s

Failures summary:

ceph.osd (/srv/salt/ceph/osd):
  node02:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node02 for cephdisks.list
  node03:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node03 for cephdisks.list
  node01:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node01 for cephdisks.list
  node04:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node04 for cephdisks.list
  node05:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node05 for cephdisks.list
  node06:
deploy OSDs: Module function osd.deploy threw an exception. Exception:
Mine on node06 for cephdisks.list
###

Since this is a first attempt in 6 simple test machines, we are going to
put the mon, osds, etc, in all nodes at first. Only the master is left in a
single machine (node01) by now.

As they are simple machines, they have a single hdd, which is partitioned
as follows (the hda4 partition is unmounted and left for the ceph system):

###
# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda  8:00 465,8G  0 disk
├─sda1   8:10   500M  0 part /boot/efi
├─sda2   8:2016G  0 part [SWAP]
├─sda3   8:30  49,3G  0 part /
└─sda4   8:40   400G  0 part
sr0 11:01   3,7G  0 rom

# salt -I 'roles:storage' cephdisks.list
node01:
node02:
node03:
node04:
node05:
node06:

# salt -I 'roles:storage' pillar.get ceph
node02:
--
storage:
--
osds:
--
/dev/sda4:
--
format:
bluestore
standalone:
True
(and so on for all 6 machines)
##

Finally and just in case, my policy.cfg file reads:

#
#cluster-unassigned/cluster/*.sls
cluster-ceph/cluster/*.sls
profile-default/cluster/*.sls
profile-default/stack/default/ceph/minions/*yml
config/stack/default/global.yml
config/stack/default/ceph/cluster.yml
role-master/cluster/node01.sls
role-admin/cluster/*.sls
role-mon/cluster/*.sls
role-mgr/cluster/*.sls
role-mds/cluster/*.sls
role-ganesha/cluster/*.sls
role-client-nfs/cluster/*.sls
role-client-cephfs/cluster/*.sls
##

Please, could someone help me and shed some light on this issue?

Thanks a lot in advance,

Regasrds,

Jones



On Thu, Aug 23, 2018 at 2:46 PM John Spray  wrote:

> On Thu, Aug 23, 2018 at 5:18 PM Steven Vacaroaia  wrote:
> >
> > Hi All,
> >
> > I am trying to enable prometheus plugin with no success due to "no
> socket could be created"
> >
> > The instructions for enabling the plugin are very straightforward and
> simple
> >
> > Note
> > My ultimate goal is to use Prometheus with Cephmetrics
> > Some of you suggested to deploy ceph-exporter but why do we need to do
> that when there is a plugin already ?
> >
> >
> > How can I troubleshoot this further ?
> >
> > nhandled exception from module 'prometheus' while running on mgr.mon01:
> error('No socket could be created',)
> > Aug 23 12:03:06 mon01 ceph-mgr: 2018-08-23 12:03:06.615 7fadab50e700 -1
> prometheus.serve:
> > Aug 23 12:03:06 mon01 ceph-mgr: 2018-08-23 12:03:06.615 7fadab50e700 -1
> Traceback (most recent call last):
> > Aug 23 12:03:06 mon01 ceph-mgr: File
> "/usr/lib64/ceph/mgr/prometheus/module.py", line 720, in serve
> > Aug 23 12:03:06 mon01 ceph-mgr: cherrypy.engine.start()
> > Aug 23 12:03:06 mon01 ceph-mgr: File
> "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 250, in
> start
> > Aug 23 12:03:06 mon01 ceph-mgr: raise e_info
> > Aug 23 12:03:06 mon01 ceph-mgr: ChannelFailures: error('No socket could
> be created',)
>
> The things I usually check if a process can't create a socket are:
>  - is th

Re: [ceph-users] cephfs kernel client hangs

2018-08-08 Thread Webert de Souza Lima
You can only try to remount the cephs dir. It will probably not work,
giving you I/O Errors, so the fallback would be to use a fuse-mount.

If I recall correctly you could do a lazy umount on the current dir (umount
-fl /mountdir) and remount it using the FUSE client.
it will work for new sessions but the currently hanging ones will still be
hanging.

with fuse you'll only be able to mount cephfs root dir, so if you have
multiple directories, you'll need to:
 - mount root cephfs dir in another directory
 - mount each subdir (after root mounted) to the desired directory via bind
mount.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 11:46 AM Zhenshi Zhou  wrote:

> Hi,
> Is there any other way excpet rebooting the server when the client hangs?
> If the server is in production environment, I can't restart it everytime.
>
> Webert de Souza Lima  于2018年8月8日周三 下午10:33写道:
>
>> Hi Zhenshi,
>>
>> if you still have the client mount hanging but no session is connected,
>> you probably have some PID waiting with blocked IO from cephfs mount.
>> I face that now and then and the only solution is to reboot the server,
>> as you won't be able to kill a process with pending IO.
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>>
>> On Wed, Aug 8, 2018 at 11:17 AM Zhenshi Zhou  wrote:
>>
>>> Hi Webert,
>>> That command shows the current sessions, whereas the server which I get
>>> the files(osdc,mdsc,monc) disconnect for a long time.
>>> So I cannot get useful infomation from the command you provide.
>>>
>>> Thanks
>>>
>>> Webert de Souza Lima  于2018年8月8日周三 下午10:10写道:
>>>
>>>> You could also see open sessions at the MDS server by issuing  `ceph
>>>> daemon mds.XX session ls`
>>>>
>>>> Regards,
>>>>
>>>> Webert Lima
>>>> DevOps Engineer at MAV Tecnologia
>>>> *Belo Horizonte - Brasil*
>>>> *IRC NICK - WebertRLZ*
>>>>
>>>>
>>>> On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou 
>>>> wrote:
>>>>
>>>>> Hi, I find an old server which mounted cephfs and has the debug files.
>>>>> # cat osdc
>>>>> REQUESTS 0 homeless 0
>>>>> LINGER REQUESTS
>>>>> BACKOFFS
>>>>> # cat monc
>>>>> have monmap 2 want 3+
>>>>> have osdmap 3507
>>>>> have fsmap.user 0
>>>>> have mdsmap 55 want 56+
>>>>> fs_cluster_id -1
>>>>> # cat mdsc
>>>>> 194 mds0getattr  #1036ae3
>>>>>
>>>>> What does it mean?
>>>>>
>>>>> Zhenshi Zhou  于2018年8月8日周三 下午1:58写道:
>>>>>
>>>>>> I restarted the client server so that there's no file in that
>>>>>> directory. I will take care of it if the client hangs next time.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Yan, Zheng  于2018年8月8日周三 上午11:23写道:
>>>>>>
>>>>>>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou 
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> > I check all my ceph servers and they are not mount cephfs on each
>>>>>>> of them(maybe I umount after testing). As a result, the cluster didn't
>>>>>>> encounter a memory deadlock. Besides, I check the monitoring system and 
>>>>>>> the
>>>>>>> memory and cpu usage were at common level while the clients hung.
>>>>>>> > Back to my question, there must be something else cause the client
>>>>>>> hang.
>>>>>>> >
>>>>>>>
>>>>>>> Check if there are hang requests in
>>>>>>> /sys/kernel/debug/ceph//{osdc,mdsc},
>>>>>>>
>>>>>>> > Zhenshi Zhou  于2018年8月8日周三 上午4:16写道:
>>>>>>> >>
>>>>>>> >> Hi, I'm not sure if it just mounts the cephfs without using or
>>>>>>> doing any operation within the mounted directory would be affected by
>>>>>>> flushing cache. I mounted cephfs on osd servers only for testing and 
>>>>>>> then
>>>>>>&

Re: [ceph-users] cephfs kernel client hangs

2018-08-08 Thread Webert de Souza Lima
Hi Zhenshi,

if you still have the client mount hanging but no session is connected, you
probably have some PID waiting with blocked IO from cephfs mount.
I face that now and then and the only solution is to reboot the server, as
you won't be able to kill a process with pending IO.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 11:17 AM Zhenshi Zhou  wrote:

> Hi Webert,
> That command shows the current sessions, whereas the server which I get
> the files(osdc,mdsc,monc) disconnect for a long time.
> So I cannot get useful infomation from the command you provide.
>
> Thanks
>
> Webert de Souza Lima  于2018年8月8日周三 下午10:10写道:
>
>> You could also see open sessions at the MDS server by issuing  `ceph
>> daemon mds.XX session ls`
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>>
>> On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou  wrote:
>>
>>> Hi, I find an old server which mounted cephfs and has the debug files.
>>> # cat osdc
>>> REQUESTS 0 homeless 0
>>> LINGER REQUESTS
>>> BACKOFFS
>>> # cat monc
>>> have monmap 2 want 3+
>>> have osdmap 3507
>>> have fsmap.user 0
>>> have mdsmap 55 want 56+
>>> fs_cluster_id -1
>>> # cat mdsc
>>> 194 mds0getattr  #1036ae3
>>>
>>> What does it mean?
>>>
>>> Zhenshi Zhou  于2018年8月8日周三 下午1:58写道:
>>>
>>>> I restarted the client server so that there's no file in that
>>>> directory. I will take care of it if the client hangs next time.
>>>>
>>>> Thanks
>>>>
>>>> Yan, Zheng  于2018年8月8日周三 上午11:23写道:
>>>>
>>>>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou 
>>>>> wrote:
>>>>> >
>>>>> > Hi,
>>>>> > I check all my ceph servers and they are not mount cephfs on each of
>>>>> them(maybe I umount after testing). As a result, the cluster didn't
>>>>> encounter a memory deadlock. Besides, I check the monitoring system and 
>>>>> the
>>>>> memory and cpu usage were at common level while the clients hung.
>>>>> > Back to my question, there must be something else cause the client
>>>>> hang.
>>>>> >
>>>>>
>>>>> Check if there are hang requests in
>>>>> /sys/kernel/debug/ceph//{osdc,mdsc},
>>>>>
>>>>> > Zhenshi Zhou  于2018年8月8日周三 上午4:16写道:
>>>>> >>
>>>>> >> Hi, I'm not sure if it just mounts the cephfs without using or
>>>>> doing any operation within the mounted directory would be affected by
>>>>> flushing cache. I mounted cephfs on osd servers only for testing and then
>>>>> left it there. Anyway I will umount it.
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> John Spray 于2018年8月8日 周三03:37写道:
>>>>> >>>
>>>>> >>> On Tue, Aug 7, 2018 at 5:42 PM Reed Dier 
>>>>> wrote:
>>>>> >>> >
>>>>> >>> > This is the first I am hearing about this as well.
>>>>> >>>
>>>>> >>> This is not a Ceph-specific thing -- it can also affect similar
>>>>> >>> systems like Lustre.
>>>>> >>>
>>>>> >>> The classic case is when under some memory pressure, the kernel
>>>>> tries
>>>>> >>> to free memory by flushing the client's page cache, but doing the
>>>>> >>> flush means allocating more memory on the server, making the memory
>>>>> >>> pressure worse, until the whole thing just seizes up.
>>>>> >>>
>>>>> >>> John
>>>>> >>>
>>>>> >>> > Granted, I am using ceph-fuse rather than the kernel client at
>>>>> this point, but that isn’t etched in stone.
>>>>> >>> >
>>>>> >>> > Curious if there is more to share.
>>>>> >>> >
>>>>> >>> > Reed
>>>>> >>> >
>>>>> >>> > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima <
>>>>> webert.b...@gmail.c

Re: [ceph-users] cephfs kernel client hangs

2018-08-08 Thread Webert de Souza Lima
You could also see open sessions at the MDS server by issuing  `ceph daemon
mds.XX session ls`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 5:08 AM Zhenshi Zhou  wrote:

> Hi, I find an old server which mounted cephfs and has the debug files.
> # cat osdc
> REQUESTS 0 homeless 0
> LINGER REQUESTS
> BACKOFFS
> # cat monc
> have monmap 2 want 3+
> have osdmap 3507
> have fsmap.user 0
> have mdsmap 55 want 56+
> fs_cluster_id -1
> # cat mdsc
> 194 mds0getattr  #1036ae3
>
> What does it mean?
>
> Zhenshi Zhou  于2018年8月8日周三 下午1:58写道:
>
>> I restarted the client server so that there's no file in that directory.
>> I will take care of it if the client hangs next time.
>>
>> Thanks
>>
>> Yan, Zheng  于2018年8月8日周三 上午11:23写道:
>>
>>> On Wed, Aug 8, 2018 at 11:02 AM Zhenshi Zhou 
>>> wrote:
>>> >
>>> > Hi,
>>> > I check all my ceph servers and they are not mount cephfs on each of
>>> them(maybe I umount after testing). As a result, the cluster didn't
>>> encounter a memory deadlock. Besides, I check the monitoring system and the
>>> memory and cpu usage were at common level while the clients hung.
>>> > Back to my question, there must be something else cause the client
>>> hang.
>>> >
>>>
>>> Check if there are hang requests in
>>> /sys/kernel/debug/ceph//{osdc,mdsc},
>>>
>>> > Zhenshi Zhou  于2018年8月8日周三 上午4:16写道:
>>> >>
>>> >> Hi, I'm not sure if it just mounts the cephfs without using or doing
>>> any operation within the mounted directory would be affected by flushing
>>> cache. I mounted cephfs on osd servers only for testing and then left it
>>> there. Anyway I will umount it.
>>> >>
>>> >> Thanks
>>> >>
>>> >> John Spray 于2018年8月8日 周三03:37写道:
>>> >>>
>>> >>> On Tue, Aug 7, 2018 at 5:42 PM Reed Dier 
>>> wrote:
>>> >>> >
>>> >>> > This is the first I am hearing about this as well.
>>> >>>
>>> >>> This is not a Ceph-specific thing -- it can also affect similar
>>> >>> systems like Lustre.
>>> >>>
>>> >>> The classic case is when under some memory pressure, the kernel tries
>>> >>> to free memory by flushing the client's page cache, but doing the
>>> >>> flush means allocating more memory on the server, making the memory
>>> >>> pressure worse, until the whole thing just seizes up.
>>> >>>
>>> >>> John
>>> >>>
>>> >>> > Granted, I am using ceph-fuse rather than the kernel client at
>>> this point, but that isn’t etched in stone.
>>> >>> >
>>> >>> > Curious if there is more to share.
>>> >>> >
>>> >>> > Reed
>>> >>> >
>>> >>> > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima <
>>> webert.b...@gmail.com> wrote:
>>> >>> >
>>> >>> >
>>> >>> > Yan, Zheng  于2018年8月7日周二 下午7:51写道:
>>> >>> >>
>>> >>> >> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou 
>>> wrote:
>>> >>> >> this can cause memory deadlock. you should avoid doing this
>>> >>> >>
>>> >>> >> > Yan, Zheng 于2018年8月7日 周二19:12写道:
>>> >>> >> >>
>>> >>> >> >> did you mount cephfs on the same machines that run ceph-osd?
>>> >>> >> >>
>>> >>> >
>>> >>> >
>>> >>> > I didn't know about this. I run this setup in production. :P
>>> >>> >
>>> >>> > Regards,
>>> >>> >
>>> >>> > Webert Lima
>>> >>> > DevOps Engineer at MAV Tecnologia
>>> >>> > Belo Horizonte - Brasil
>>> >>> > IRC NICK - WebertRLZ
>>> >>> >
>>> >>> > ___
>>> >>> > ceph-users mailing list
>>> >>> > ceph-users@lists.ceph.com
>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>> >
>>> >>> >
>>> >>> > ___
>>> >>> > ceph-users mailing list
>>> >>> > ceph-users@lists.ceph.com
>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>> ___
>>> >>> ceph-users mailing list
>>> >>> ceph-users@lists.ceph.com
>>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Whole cluster flapping

2018-08-08 Thread Webert de Souza Lima
So your OSDs are really too busy to respond heartbeats.
You'll be facing this for sometime until cluster loads get lower.

I would set `ceph osd set nodeep-scrub` until the heavy disk IO stops.
maybe you can schedule it for enable during the night and disabling in the
morning.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Aug 8, 2018 at 9:18 AM CUZA Frédéric  wrote:

> Thx for the command line, I did take a look too it what I don’t really
> know what to search for, my bad….
>
> All this flapping is due to deep-scrub when it starts on an OSD things
> start to go bad.
>
>
>
> I set out all the OSDs that were flapping the most (1 by 1 after
> rebalancing) and it looks better even if some osds keep going down/up with
> the same message in logs :
>
>
>
> 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdabd897700' had
> timed out after 90
>
>
>
> (I update it to 90 instead of 15s)
>
>
>
> Regards,
>
>
>
>
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 07 August 2018 16:28
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> oops, my bad, you're right.
>
>
>
> I don't know much you can see but maybe you can dig around performance
> counters and see what's happening on those OSDs, try these:
>
>
>
> ~# ceph daemonperf osd.XX
>
> ~# ceph daemon osd.XX perf dump
>
>
>
> change XX to your OSD numbers.
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric 
> wrote:
>
> Pool is already deleted and no longer present in stats.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 07 August 2018 15:08
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> Frédéric,
>
>
>
> see if the number of objects is decreasing in the pool with `ceph df
> [detail]`
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric  wrote:
>
> It’s been over a week now and the whole cluster keeps flapping, it is
> never the same OSDs that go down.
>
> Is there a way to get the progress of this recovery ? (The pool hat I
> deleted is no longer present (for a while now))
>
> In fact, there is a lot of i/o activity on the server where osds go down.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 31 July 2018 16:25
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> The pool deletion might have triggered a lot of IO operations on the disks
> and the process might be too busy to respond to hearbeats, so the mons mark
> them as down due to no response.
>
> Check also the OSD logs to see if they are actually crashing and
> restarting, and disk IO usage (i.e. iostat).
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
> wrote:
>
> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.

Re: [ceph-users] cephfs kernel client hangs

2018-08-07 Thread Webert de Souza Lima
That's good to know, thanks for the explanation.
Fortunately we are in the process of cluster redesign and we can definitely
fix that scenario.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Aug 7, 2018 at 4:37 PM John Spray  wrote:

> On Tue, Aug 7, 2018 at 5:42 PM Reed Dier  wrote:
> >
> > This is the first I am hearing about this as well.
>
> This is not a Ceph-specific thing -- it can also affect similar
> systems like Lustre.
>
> The classic case is when under some memory pressure, the kernel tries
> to free memory by flushing the client's page cache, but doing the
> flush means allocating more memory on the server, making the memory
> pressure worse, until the whole thing just seizes up.
>
> John
>
> > Granted, I am using ceph-fuse rather than the kernel client at this
> point, but that isn’t etched in stone.
> >
> > Curious if there is more to share.
> >
> > Reed
> >
> > On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima 
> wrote:
> >
> >
> > Yan, Zheng  于2018年8月7日周二 下午7:51写道:
> >>
> >> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou 
> wrote:
> >> this can cause memory deadlock. you should avoid doing this
> >>
> >> > Yan, Zheng 于2018年8月7日 周二19:12写道:
> >> >>
> >> >> did you mount cephfs on the same machines that run ceph-osd?
> >> >>
> >
> >
> > I didn't know about this. I run this setup in production. :P
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> > IRC NICK - WebertRLZ
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client hangs

2018-08-07 Thread Webert de Souza Lima
Yan, Zheng  于2018年8月7日周二 下午7:51写道:

> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou  wrote:
> this can cause memory deadlock. you should avoid doing this
>
> > Yan, Zheng 于2018年8月7日 周二19:12写道:
> >>
> >> did you mount cephfs on the same machines that run ceph-osd?
> >>


I didn't know about this. I run this setup in production. :P

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Whole cluster flapping

2018-08-07 Thread Webert de Souza Lima
oops, my bad, you're right.

I don't know much you can see but maybe you can dig around performance
counters and see what's happening on those OSDs, try these:

~# ceph daemonperf osd.XX
~# ceph daemon osd.XX perf dump

change XX to your OSD numbers.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Aug 7, 2018 at 10:47 AM CUZA Frédéric  wrote:

> Pool is already deleted and no longer present in stats.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 07 August 2018 15:08
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> Frédéric,
>
>
>
> see if the number of objects is decreasing in the pool with `ceph df
> [detail]`
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric  wrote:
>
> It’s been over a week now and the whole cluster keeps flapping, it is
> never the same OSDs that go down.
>
> Is there a way to get the progress of this recovery ? (The pool hat I
> deleted is no longer present (for a while now))
>
> In fact, there is a lot of i/o activity on the server where osds go down.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 31 July 2018 16:25
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> The pool deletion might have triggered a lot of IO operations on the disks
> and the process might be too busy to respond to hearbeats, so the mons mark
> them as down due to no response.
>
> Check also the OSD logs to see if they are actually crashing and
> restarting, and disk IO usage (i.e. iostat).
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
> wrote:
>
> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
> 172.29.228.72:6803/95830 boot
>
> 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
> 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
> degraded, 223 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update:
> 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
> 172.29.228.246:6812/3144542 boot
>
> 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
> 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
> degraded, 220 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] 

Re: [ceph-users] Whole cluster flapping

2018-08-07 Thread Webert de Souza Lima
Frédéric,

see if the number of objects is decreasing in the pool with `ceph df
[detail]`

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Aug 7, 2018 at 5:46 AM CUZA Frédéric  wrote:

> It’s been over a week now and the whole cluster keeps flapping, it is
> never the same OSDs that go down.
>
> Is there a way to get the progress of this recovery ? (The pool hat I
> deleted is no longer present (for a while now))
>
> In fact, there is a lot of i/o activity on the server where osds go down.
>
>
>
> Regards,
>
>
>
> *De :* ceph-users  *De la part de*
> Webert de Souza Lima
> *Envoyé :* 31 July 2018 16:25
> *À :* ceph-users 
> *Objet :* Re: [ceph-users] Whole cluster flapping
>
>
>
> The pool deletion might have triggered a lot of IO operations on the disks
> and the process might be too busy to respond to hearbeats, so the mons mark
> them as down due to no response.
>
> Check also the OSD logs to see if they are actually crashing and
> restarting, and disk IO usage (i.e. iostat).
>
>
>
> Regards,
>
>
>
> Webert Lima
>
> DevOps Engineer at MAV Tecnologia
>
> *Belo Horizonte - Brasil*
>
> *IRC NICK - WebertRLZ*
>
>
>
>
>
> On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric 
> wrote:
>
> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
> 172.29.228.72:6803/95830 boot
>
> 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
> 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
> degraded, 223 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update:
> 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
> 172.29.228.246:6812/3144542 boot
>
> 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
> 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
> degraded, 220 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update:
> 83 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update:
> 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update:
> 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update:
> 5738/5845923 objects mispla

Re: [ceph-users] Whole cluster flapping

2018-07-31 Thread Webert de Souza Lima
The pool deletion might have triggered a lot of IO operations on the disks
and the process might be too busy to respond to hearbeats, so the mons mark
them as down due to no response.
Check also the OSD logs to see if they are actually crashing and
restarting, and disk IO usage (i.e. iostat).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Jul 31, 2018 at 7:23 AM CUZA Frédéric  wrote:

> Hi Everyone,
>
>
>
> I just upgrade our cluster to Luminous 12.2.7 and I delete a quite large
> pool that we had (120 TB).
>
> Our cluster is made of 14 Nodes with each composed of 12 OSDs (1 HDD -> 1
> OSD), we have SDD for journal.
>
>
>
> After I deleted the large pool my cluster started to flapping on all OSDs.
>
> Osds are marked down and then marked up as follow :
>
>
>
> 2018-07-31 10:42:51.504319 mon.ceph_monitor01 [INF] osd.97
> 172.29.228.72:6800/95783 boot
>
> 2018-07-31 10:42:55.330993 mon.ceph_monitor01 [WRN] Health check update:
> 5798/5845200 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:42:55.331065 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 221365/5845200 objects degraded (3.787%), 98 pgs
> degraded, 317 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:42:55.331093 mon.ceph_monitor01 [WRN] Health check update:
> 81 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:42:55.548385 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 13 pgs inactive, 4 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:42:55.610556 mon.ceph_monitor01 [INF] osd.96
> 172.29.228.72:6803/95830 boot
>
> 2018-07-31 10:43:00.331787 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:00.331930 mon.ceph_monitor01 [WRN] Health check update:
> 5782/5845401 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:00.331950 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 167757/5845401 objects degraded (2.870%), 77 pgs
> degraded, 223 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:00.331966 mon.ceph_monitor01 [WRN] Health check update:
> 76 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:01.729891 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 7 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:01.753867 mon.ceph_monitor01 [INF] osd.4
> 172.29.228.246:6812/3144542 boot
>
> 2018-07-31 10:43:05.332624 mon.ceph_monitor01 [WRN] Health check update: 4
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:05.332691 mon.ceph_monitor01 [WRN] Health check update:
> 5767/5845569 objects misplaced (0.099%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:05.332718 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 130565/5845569 objects degraded (2.234%), 67 pgs
> degraded, 220 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:05.332736 mon.ceph_monitor01 [WRN] Health check update:
> 83 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:07.004993 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 5 pgs inactive, 5 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:10.333548 mon.ceph_monitor01 [WRN] Health check update:
> 5752/5845758 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:10.333593 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107805/5845758 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:10.333608 mon.ceph_monitor01 [WRN] Health check update:
> 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334451 mon.ceph_monitor01 [WRN] Health check update:
> 5738/5845923 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:15.334494 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 107807/5845923 objects degraded (1.844%), 59 pgs
> degraded, 197 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:15.334510 mon.ceph_monitor01 [WRN] Health check update:
> 98 slow requests are blocked > 32 sec (REQUEST_SLOW)
>
> 2018-07-31 10:43:15.334865 mon.ceph_monitor01 [INF] osd.18 failed
> (root=default,room=,host=) (8 reporters from different host after
> 54.650576 >= grace 54.300663)
>
> 2018-07-31 10:43:15.336552 mon.ceph_monitor01 [WRN] Health check update: 5
> osds down (OSD_DOWN)
>
> 2018-07-31 10:43:17.357747 mon.ceph_monitor01 [WRN] Health check update:
> Reduced data availability: 6 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
>
> 2018-07-31 10:43:20.339495 mon.ceph_monitor01 [WRN] Health check update:
> 5724/5846073 objects misplaced (0.098%) (OBJECT_MISPLACED)
>
> 2018-07-31 10:43:20.339543 mon.ceph_monitor01 [WRN] Health check update:
> Degraded data redundancy: 122901/5846073 objects degraded (2.102%), 65 pgs
> degraded, 201 pgs undersized (PG_DEGRADED)
>
> 2018-07-31 10:43:20.339559 mon.ceph_monitor01 [WRN] Health check update:
> 78 slow reques

Re: [ceph-users] MDS damaged

2018-07-13 Thread Alessandro De Salvo

Hi Dan,

you're right, I was following the mimic instructions (which indeed 
worked on my mimic testbed), but luminous is different and I missed the 
additional step.


Works now, thanks!


    Alessandro


Il 13/07/18 17:51, Dan van der Ster ha scritto:

On Fri, Jul 13, 2018 at 4:07 PM Alessandro De Salvo
 wrote:

However, I cannot reduce the number of mdses anymore, I was used to do
that with e.g.:

ceph fs set cephfs max_mds 1

Trying this with 12.2.6 has apparently no effect, I am left with 2
active mdses. Is this another bug?

Are you following this procedure?
http://docs.ceph.com/docs/luminous/cephfs/multimds/#decreasing-the-number-of-ranks
i.e. you need to deactivate after decreasing max_mds.

(Mimic does this automatically, OTOH).

-- dan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2018-07-13 Thread Alessandro De Salvo

Thanks all,

100..inode, mds_snaptable and 1..inode were not 
corrupted, so I left them as they were. I have re-injected all the bad 
objects, for all mdses (2 per filesysytem) and all filesystems I had 
(2), and after setiing the mdses as repaired my filesystems are back!


However, I cannot reduce the number of mdses anymore, I was used to do 
that with e.g.:



ceph fs set cephfs max_mds 1


Trying this with 12.2.6 has apparently no effect, I am left with 2 
active mdses. Is this another bug?


Thanks,


    Alessandro



Il 13/07/18 15:54, Yan, Zheng ha scritto:

On Thu, Jul 12, 2018 at 11:39 PM Alessandro De Salvo
 wrote:

Some progress, and more pain...

I was able to recover the 200. using the ceph-objectstore-tool for one 
of the OSDs (all identical copies) but trying to re-inject it just with rados 
put was giving no error while the get was still giving the same I/O error. So 
the solution was to rm the object and the put it again, that worked.

However, after restarting one of the MDSes and seeting it to repaired, I've hit 
another, similar problem:


2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] : 
error reading table object 'mds0_inotable' -5 ((5) Input/output error)


Can I safely try to do the same as for object 200.? Should I check 
something before trying it? Again, checking the copies of the object, they have 
identical md5sums on all the replicas.


Yes, It should be safe. you also need to the same for several other
objects. full object list are:

200.
mds0_inotable
100..inode
mds_snaptable
1..inode

The first three objects are per-mds-rank.  Ff you have enabled
multi-active mds, you also need to update objects of other ranks. For
mds.1, object names are 201., mds1_inotable and
101..inode.




Thanks,


 Alessandro


Il 12/07/18 16:46, Alessandro De Salvo ha scritto:

Unfortunately yes, all the OSDs were restarted a few times, but no change.

Thanks,


 Alessandro


Il 12/07/18 15:55, Paul Emmerich ha scritto:

This might seem like a stupid suggestion, but: have you tried to restart the 
OSDs?

I've also encountered some random CRC errors that only showed up when trying to 
read an object,
but not on scrubbing, that magically disappeared after restarting the OSD.

However, in my case it was clearly related to 
https://tracker.ceph.com/issues/22464 which doesn't
seem to be the issue here.

Paul

2018-07-12 13:53 GMT+02:00 Alessandro De Salvo 
:


Il 12/07/18 11:20, Alessandro De Salvo ha scritto:



Il 12/07/18 10:58, Dan van der Ster ha scritto:

On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum  wrote:

On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo 
 wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in fact I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may
help.

No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and
no disk problems anywhere. No relevant errors in syslogs, the hosts are
just fine. I cannot exclude an error on the RAID controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one on a different one, so I
would tend to exclude they both had (silent) errors at the same time.


That's fairly distressing. At this point I'd probably try extracting the object 
using ceph-objectstore-tool and seeing if it decodes properly as an mds 
journal. If it does, you might risk just putting it back in place to overwrite 
the crc.


Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep scrub, but I'm not 
sure it could repair in case all replicas are bad, as it seems to be the case.


I finally managed (with the help of Dan), to perform the deep-scrub on pg 
10.14, but the deep scrub did not detect anything wrong. Also trying to repair 
10.14 has no effect.
Still, trying to access the object I get in the OSDs:

2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log [ERR] : 
10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
10:292cf221:::200.:h

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo

Some progress, and more pain...

I was able to recover the 200. using the ceph-objectstore-tool 
for one of the OSDs (all identical copies) but trying to re-inject it 
just with rados put was giving no error while the get was still giving 
the same I/O error. So the solution was to rm the object and the put it 
again, that worked.


However, after restarting one of the MDSes and seeting it to repaired, 
I've hit another, similar problem:



2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log 
[ERR] : error reading table object 'mds0_inotable' -5 ((5) Input/output 
error)



Can I safely try to do the same as for object 200.? Should I 
check something before trying it? Again, checking the copies of the 
object, they have identical md5sums on all the replicas.


Thanks,


    Alessandro


Il 12/07/18 16:46, Alessandro De Salvo ha scritto:


Unfortunately yes, all the OSDs were restarted a few times, but no change.

Thanks,


    Alessandro


Il 12/07/18 15:55, Paul Emmerich ha scritto:
This might seem like a stupid suggestion, but: have you tried to 
restart the OSDs?


I've also encountered some random CRC errors that only showed up when 
trying to read an object,
but not on scrubbing, that magically disappeared after restarting the 
OSD.


However, in my case it was clearly related to 
https://tracker.ceph.com/issues/22464 which doesn't

seem to be the issue here.

Paul

2018-07-12 13:53 GMT+02:00 Alessandro De Salvo 
<mailto:alessandro.desa...@roma1.infn.it>>:



    Il 12/07/18 11:20, Alessandro De Salvo ha scritto:



Il 12/07/18 10:58, Dan van der Ster ha scritto:

On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum
mailto:gfar...@redhat.com>> wrote:

On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
mailto:alessandro.desa...@roma1.infn.it>> wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object
'200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23)
acting ([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in
fact I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but
I'm not sure if it may
help.

No SMART errors (the fileservers are SANs, in
RAID6 + LVM volumes), and
no disk problems anywhere. No relevant errors in
syslogs, the hosts are
just fine. I cannot exclude an error on the RAID
controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one
on a different one, so I
would tend to exclude they both had (silent)
errors at the same time.


That's fairly distressing. At this point I'd probably
try extracting the object using ceph-objectstore-tool
and seeing if it decodes properly as an mds journal.
If it does, you might risk just putting it back in
place to overwrite the crc.

Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep
scrub, but I'm not sure it could repair in case all replicas
are bad, as it seems to be the case.


I finally managed (with the help of Dan), to perform the
deep-scrub on pg 10.14, but the deep scrub did not detect
anything wrong. Also trying to repair 10.14 has no effect.
Still, trying to access the object I get in the OSDs:

2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster)
log [ERR]

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo

Unfortunately yes, all the OSDs were restarted a few times, but no change.

Thanks,


    Alessandro


Il 12/07/18 15:55, Paul Emmerich ha scritto:
This might seem like a stupid suggestion, but: have you tried to 
restart the OSDs?


I've also encountered some random CRC errors that only showed up when 
trying to read an object,
but not on scrubbing, that magically disappeared after restarting the 
OSD.


However, in my case it was clearly related to 
https://tracker.ceph.com/issues/22464 which doesn't

seem to be the issue here.

Paul

2018-07-12 13:53 GMT+02:00 Alessandro De Salvo 
<mailto:alessandro.desa...@roma1.infn.it>>:



Il 12/07/18 11:20, Alessandro De Salvo ha scritto:



Il 12/07/18 10:58, Dan van der Ster ha scritto:

On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum
mailto:gfar...@redhat.com>> wrote:

On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
mailto:alessandro.desa...@roma1.infn.it>> wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object
'200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23) acting
([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in fact
I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1
log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 !=
expected 0x9ef2b41b on
10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but
I'm not sure if it may
help.

No SMART errors (the fileservers are SANs, in
RAID6 + LVM volumes), and
no disk problems anywhere. No relevant errors in
syslogs, the hosts are
just fine. I cannot exclude an error on the RAID
controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one on
a different one, so I
would tend to exclude they both had (silent)
errors at the same time.


That's fairly distressing. At this point I'd probably
try extracting the object using ceph-objectstore-tool
and seeing if it decodes properly as an mds journal.
If it does, you might risk just putting it back in
place to overwrite the crc.

Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep
scrub, but I'm not sure it could repair in case all replicas
are bad, as it seems to be the case.


I finally managed (with the help of Dan), to perform the
deep-scrub on pg 10.14, but the deep scrub did not detect anything
wrong. Also trying to repair 10.14 has no effect.
Still, trying to access the object I get in the OSDs:

2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster)
log [ERR] : 10.14 full-object read crc 0x976aefc5 != expected
0x9ef2b41b on 10:292cf221:::200.:head

Was deep-scrub supposed to detect the wrong crc? If yes, them it
sounds like a bug.
Can I force the repair someway?
Thanks,

   Alessandro



Alessandro, did you already try a deep-scrub on pg 10.14?


I'm waiting for the cluster to do that, I've sent it earlier
this morning.

  I expect
it'll show an inconsistent object. Though, I'm unsure if
repair will
correct the crc given that in this case *all* replicas
have a bad crc.


Exactly, this is what I wonder too.
Cheers,

    Alessandro


--Dan

However, I'm also quite curious how it ended up that
way, with a checksum mismatch b

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo


Il 12/07/18 11:20, Alessandro De Salvo ha scritto:



Il 12/07/18 10:58, Dan van der Ster ha scritto:
On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum  
wrote:
On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo 
 wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in fact I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 
0x9ef2b41b on

10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 
0x9ef2b41b on

10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 
0x9ef2b41b on

10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but I'm not sure if 
it may

help.

No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), 
and
no disk problems anywhere. No relevant errors in syslogs, the hosts 
are

just fine. I cannot exclude an error on the RAID controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one on a different one, 
so I

would tend to exclude they both had (silent) errors at the same time.


That's fairly distressing. At this point I'd probably try extracting 
the object using ceph-objectstore-tool and seeing if it decodes 
properly as an mds journal. If it does, you might risk just putting 
it back in place to overwrite the crc.



Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep scrub, but 
I'm not sure it could repair in case all replicas are bad, as it seems 
to be the case.


I finally managed (with the help of Dan), to perform the deep-scrub on 
pg 10.14, but the deep scrub did not detect anything wrong. Also trying 
to repair 10.14 has no effect.

Still, trying to access the object I get in the OSDs:

2018-07-12 13:40:32.711732 7efbee672700 -1 log_channel(cluster) log 
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
10:292cf221:::200.:head


Was deep-scrub supposed to detect the wrong crc? If yes, them it sounds 
like a bug.

Can I force the repair someway?
Thanks,

   Alessandro




Alessandro, did you already try a deep-scrub on pg 10.14?


I'm waiting for the cluster to do that, I've sent it earlier this 
morning.



  I expect
it'll show an inconsistent object. Though, I'm unsure if repair will
correct the crc given that in this case *all* replicas have a bad crc.


Exactly, this is what I wonder too.
Cheers,

    Alessandro



--Dan

However, I'm also quite curious how it ended up that way, with a 
checksum mismatch but identical data (and identical checksums!) 
across the three replicas. Have you previously done some kind of 
scrub repair on the metadata pool? Did the PG perhaps get backfilled 
due to cluster changes?

-Greg



Thanks,


  Alessandro



Il 11/07/18 18:56, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
 wrote:

Hi John,

in fact I get an I/O error by hand too:


rados get -p cephfs_metadata 200. 200.
error getting cephfs_metadata/200.: (5) Input/output error

Next step would be to go look for corresponding errors on your OSD
logs, system logs, and possibly also check things like the SMART
counters on your hard drives for possible root causes.

John




Can this be recovered someway?

Thanks,


   Alessandro


Il 11/07/18 18:33, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
 wrote:

Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have 
been

marked as damaged. Trying to restart the instances only result in
standby MDSes. We currently have 2 filesystems active and 2 
MDSes each.


I found the following error messages in the mon:


mds.0 :6800/2412911269 down:damaged
mds.1 :6800/830539001 down:damaged
mds.0 :6800/4080298733 down:damaged


Whenever I try to force the repaired state with ceph mds repaired
: I get something like this in the MDS logs:


2018-07-11 13:20:41.597970 7ff7e010e700  0 
mds.1.journaler.mdlog(ro)

error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) 
log

[ERR] : Error recovering journal 0x201: (5) Input/output error

An EIO reading the journal header is pretty scary. The MDS itself
probably can't tell you much more about this: you need to dig down
into the RADOS layer.  Try reading the 200. object (that
happens to be the rank 0 journal header, every CephFS filesystem
should have one) u

Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo



Il 12/07/18 10:58, Dan van der Ster ha scritto:

On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum  wrote:

On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo 
 wrote:

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)


So, looking at the osds 23, 35 and 18 logs in fact I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on
10:292cf221:::200.:head


So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may
help.

No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and
no disk problems anywhere. No relevant errors in syslogs, the hosts are
just fine. I cannot exclude an error on the RAID controllers, but 2 of
the OSDs with 10.14 are on a SAN system and one on a different one, so I
would tend to exclude they both had (silent) errors at the same time.


That's fairly distressing. At this point I'd probably try extracting the object 
using ceph-objectstore-tool and seeing if it decodes properly as an mds 
journal. If it does, you might risk just putting it back in place to overwrite 
the crc.


Wouldn't it be easier to scrub repair the PG to fix the crc?


this is what I already instructed the cluster to do, a deep scrub, but 
I'm not sure it could repair in case all replicas are bad, as it seems 
to be the case.




Alessandro, did you already try a deep-scrub on pg 10.14?


I'm waiting for the cluster to do that, I've sent it earlier this morning.


  I expect
it'll show an inconsistent object. Though, I'm unsure if repair will
correct the crc given that in this case *all* replicas have a bad crc.


Exactly, this is what I wonder too.
Cheers,

    Alessandro



--Dan


However, I'm also quite curious how it ended up that way, with a checksum 
mismatch but identical data (and identical checksums!) across the three 
replicas. Have you previously done some kind of scrub repair on the metadata 
pool? Did the PG perhaps get backfilled due to cluster changes?
-Greg



Thanks,


  Alessandro



Il 11/07/18 18:56, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
 wrote:

Hi John,

in fact I get an I/O error by hand too:


rados get -p cephfs_metadata 200. 200.
error getting cephfs_metadata/200.: (5) Input/output error

Next step would be to go look for corresponding errors on your OSD
logs, system logs, and possibly also check things like the SMART
counters on your hard drives for possible root causes.

John




Can this be recovered someway?

Thanks,


   Alessandro


Il 11/07/18 18:33, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
 wrote:

Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have been
marked as damaged. Trying to restart the instances only result in
standby MDSes. We currently have 2 filesystems active and 2 MDSes each.

I found the following error messages in the mon:


mds.0 :6800/2412911269 down:damaged
mds.1 :6800/830539001 down:damaged
mds.0 :6800/4080298733 down:damaged


Whenever I try to force the repaired state with ceph mds repaired
: I get something like this in the MDS logs:


2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
[ERR] : Error recovering journal 0x201: (5) Input/output error

An EIO reading the journal header is pretty scary.  The MDS itself
probably can't tell you much more about this: you need to dig down
into the RADOS layer.  Try reading the 200. object (that
happens to be the rank 0 journal header, every CephFS filesystem
should have one) using the `rados` command line tool.

John




Any attempt of running the journal export results in errors, like this one:


cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1
Header 200. is unreadable

2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not
readable, attempt object-by-object dump with `rados`


Same happens for recover_dentries

cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
200. is unreadable
Errors:
0


Re: [ceph-users] MDS damaged

2018-07-12 Thread Alessandro De Salvo


> Il giorno 11 lug 2018, alle ore 23:25, Gregory Farnum  ha 
> scritto:
> 
>> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo 
>>  wrote:
>> OK, I found where the object is:
>> 
>> 
>> ceph osd map cephfs_metadata 200.
>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 
>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)
>> 
>> 
>> So, looking at the osds 23, 35 and 18 logs in fact I see:
>> 
>> 
>> osd.23:
>> 
>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log 
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
>> 10:292cf221:::200.:head
>> 
>> 
>> osd.35:
>> 
>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log 
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
>> 10:292cf221:::200.:head
>> 
>> 
>> osd.18:
>> 
>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log 
>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
>> 10:292cf221:::200.:head
>> 
>> 
>> So, basically the same error everywhere.
>> 
>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may 
>> help.
>> 
>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and 
>> no disk problems anywhere. No relevant errors in syslogs, the hosts are 
>> just fine. I cannot exclude an error on the RAID controllers, but 2 of 
>> the OSDs with 10.14 are on a SAN system and one on a different one, so I 
>> would tend to exclude they both had (silent) errors at the same time.
> 
> That's fairly distressing. At this point I'd probably try extracting the 
> object using ceph-objectstore-tool and seeing if it decodes properly as an 
> mds journal. If it does, you might risk just putting it back in place to 
> overwrite the crc.
> 

Ok, I guess I know how to extract the object from a given OSD, but I’m not sure 
how to check if it decodes as mds journal, is there a procedure for this? 
However if trying to export all the sophie’s from all the osd brings the same 
object md5sum I believe I can try directly to overwrite the object, as it 
cannot go worse than this, correct?
Also I’d need a confirmation of the procedure to follow in this case, when 
possibly all copies of an object are wrong, I would try the following:

- set the noout
- bring down all the osd where the object is present
- replace the object in all stores
- bring the osds up again
- unset the noout

Correct?


> However, I'm also quite curious how it ended up that way, with a checksum 
> mismatch but identical data (and identical checksums!) across the three 
> replicas. Have you previously done some kind of scrub repair on the metadata 
> pool?

No, at least not on this pg, I only remember of a repair but it was on a 
different pool.

> Did the PG perhaps get backfilled due to cluster changes?

That might be the case, as we have to reboot the osds sometimes when they 
crash. Also, yesterday we rebooted all of them, but this happens always in 
sequence, one by one, not all at the same time.
Thanks for the help,

   Alessandro

> -Greg
>  
>> 
>> Thanks,
>> 
>> 
>>  Alessandro
>> 
>> 
>> 
>> Il 11/07/18 18:56, John Spray ha scritto:
>> > On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
>> >  wrote:
>> >> Hi John,
>> >>
>> >> in fact I get an I/O error by hand too:
>> >>
>> >>
>> >> rados get -p cephfs_metadata 200. 200.
>> >> error getting cephfs_metadata/200.: (5) Input/output error
>> > Next step would be to go look for corresponding errors on your OSD
>> > logs, system logs, and possibly also check things like the SMART
>> > counters on your hard drives for possible root causes.
>> >
>> > John
>> >
>> >
>> >
>> >>
>> >> Can this be recovered someway?
>> >>
>> >> Thanks,
>> >>
>> >>
>> >>   Alessandro
>> >>
>> >>
>> >> Il 11/07/18 18:33, John Spray ha scritto:
>> >>> On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
>> >>>  wrote:
>> >>>> Hi,
>> >>>>
>> >>>> after the upgrade to luminous 12.2.6 today, all our MDSes have been
>> >>>> marked as damaged. Trying to restart the instances only result in
>> >>>> 

Re: [ceph-users] v10.2.11 Jewel released

2018-07-11 Thread Webert de Souza Lima
Cheers!

Thanks for all the backports and fixes.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Jul 11, 2018 at 1:46 PM Abhishek Lekshmanan 
wrote:

>
> We're glad to announce v10.2.11 release of the Jewel stable release
> series. This point releases brings a number of important bugfixes and
> has a few important security fixes. This is most likely going to be the
> final Jewel release (shine on you crazy diamond). We thank everyone in
> the community for contributing towards this release and particularly
> want to thank Nathan and Yuri for their relentless efforts in
> backporting and testing this release.
>
> We recommend that all Jewel 10.2.x users upgrade.
>
> Notable Changes
> ---
>
> * CVE 2018-1128: auth: cephx authorizer subject to replay attack
> (issue#24836 http://tracker.ceph.com/issues/24836, Sage Weil)
>
> * CVE 2018-1129: auth: cephx signature check is weak (issue#24837
> http://tracker.ceph.com/issues/24837, Sage Weil)
>
> * CVE 2018-10861: mon: auth checks not correct for pool ops (issue#24838
> http://tracker.ceph.com/issues/24838, Jason Dillaman)
>
> * The RBD C API's rbd_discard method and the C++ API's Image::discard
> method
>   now enforce a maximum length of 2GB. This restriction prevents overflow
> of
>   the result code.
>
> * New OSDs will now use rocksdb for omap data by default, rather than
>   leveldb. omap is used by RGW bucket indexes and CephFS directories,
>   and when a single leveldb grows to 10s of GB with a high write or
>   delete workload, it can lead to high latency when leveldb's
>   single-threaded compaction cannot keep up. rocksdb supports multiple
>   threads for compaction, which avoids this problem.
>
> * The CephFS client now catches failures to clear dentries during startup
>   and refuses to start as consistency and untrimmable cache issues may
>   develop. The new option client_die_on_failed_dentry_invalidate (default:
>   true) may be turned off to allow the client to proceed (dangerous!).
>
> * In 10.2.10 and earlier releases, keyring caps were not checked for
> validity,
>   so the caps string could be anything. As of 10.2.11, caps strings are
>   validated and providing a keyring with an invalid caps string to, e.g.,
>   "ceph auth add" will result in an error.
>
> The changelog and the full release notes are at the release blog entry
> at https://ceph.com/releases/v10-2-11-jewel-released/
>
> Getting Ceph
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-10.2.11.tar.gz
> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
> * Release git sha1: e4b061b47f07f583c92a050d9e84b1813a35671e
>
>
> Best,
> Abhishek
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2018-07-11 Thread Alessandro De Salvo

OK, I found where the object is:


ceph osd map cephfs_metadata 200.
osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg 
10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)



So, looking at the osds 23, 35 and 18 logs in fact I see:


osd.23:

2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log 
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
10:292cf221:::200.:head



osd.35:

2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log 
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
10:292cf221:::200.:head



osd.18:

2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log 
[ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b on 
10:292cf221:::200.:head



So, basically the same error everywhere.

I'm trying to issue a repair of the pg 10.14, but I'm not sure if it may 
help.


No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes), and 
no disk problems anywhere. No relevant errors in syslogs, the hosts are 
just fine. I cannot exclude an error on the RAID controllers, but 2 of 
the OSDs with 10.14 are on a SAN system and one on a different one, so I 
would tend to exclude they both had (silent) errors at the same time.


Thanks,


    Alessandro



Il 11/07/18 18:56, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:49 PM Alessandro De Salvo
 wrote:

Hi John,

in fact I get an I/O error by hand too:


rados get -p cephfs_metadata 200. 200.
error getting cephfs_metadata/200.: (5) Input/output error

Next step would be to go look for corresponding errors on your OSD
logs, system logs, and possibly also check things like the SMART
counters on your hard drives for possible root causes.

John





Can this be recovered someway?

Thanks,


  Alessandro


Il 11/07/18 18:33, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
 wrote:

Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have been
marked as damaged. Trying to restart the instances only result in
standby MDSes. We currently have 2 filesystems active and 2 MDSes each.

I found the following error messages in the mon:


mds.0 :6800/2412911269 down:damaged
mds.1 :6800/830539001 down:damaged
mds.0 :6800/4080298733 down:damaged


Whenever I try to force the repaired state with ceph mds repaired
: I get something like this in the MDS logs:


2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
[ERR] : Error recovering journal 0x201: (5) Input/output error

An EIO reading the journal header is pretty scary.  The MDS itself
probably can't tell you much more about this: you need to dig down
into the RADOS layer.  Try reading the 200. object (that
happens to be the rank 0 journal header, every CephFS filesystem
should have one) using the `rados` command line tool.

John




Any attempt of running the journal export results in errors, like this one:


cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1
Header 200. is unreadable

2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not
readable, attempt object-by-object dump with `rados`


Same happens for recover_dentries

cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
200. is unreadable
Errors:
0

Is there something I could try to do to have the cluster back?

I was able to dump the contents of the metadata pool with rados export
-p cephfs_metadata  and I'm currently trying the procedure
described in
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery
but I'm not sure if it will work as it's apparently doing nothing at the
moment (maybe it's just very slow).

Any help is appreciated, thanks!


   Alessandro

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2018-07-11 Thread Alessandro De Salvo

Hi John,

in fact I get an I/O error by hand too:


rados get -p cephfs_metadata 200. 200.
error getting cephfs_metadata/200.: (5) Input/output error


Can this be recovered someway?

Thanks,


    Alessandro


Il 11/07/18 18:33, John Spray ha scritto:

On Wed, Jul 11, 2018 at 4:10 PM Alessandro De Salvo
 wrote:

Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have been
marked as damaged. Trying to restart the instances only result in
standby MDSes. We currently have 2 filesystems active and 2 MDSes each.

I found the following error messages in the mon:


mds.0 :6800/2412911269 down:damaged
mds.1 :6800/830539001 down:damaged
mds.0 :6800/4080298733 down:damaged


Whenever I try to force the repaired state with ceph mds repaired
: I get something like this in the MDS logs:


2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
[ERR] : Error recovering journal 0x201: (5) Input/output error

An EIO reading the journal header is pretty scary.  The MDS itself
probably can't tell you much more about this: you need to dig down
into the RADOS layer.  Try reading the 200. object (that
happens to be the rank 0 journal header, every CephFS filesystem
should have one) using the `rados` command line tool.

John





Any attempt of running the journal export results in errors, like this one:


cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1
Header 200. is unreadable

2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not
readable, attempt object-by-object dump with `rados`


Same happens for recover_dentries

cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
200. is unreadable
Errors:
0

Is there something I could try to do to have the cluster back?

I was able to dump the contents of the metadata pool with rados export
-p cephfs_metadata  and I'm currently trying the procedure
described in
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery
but I'm not sure if it will work as it's apparently doing nothing at the
moment (maybe it's just very slow).

Any help is appreciated, thanks!


  Alessandro

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS damaged

2018-07-11 Thread Alessandro De Salvo

Hi Gregory,

thanks for the reply. I have the dump of the metadata pool, but I'm not 
sure what to check there. Is it what you mean?


The cluster was operational until today at noon, when a full restart of 
the daemons was issued, like many other times in the past. I was trying 
to issue the repaired command to get a real error in the logs, but it 
was apparently not the case.


Thanks,


    Alessandro


Il 11/07/18 18:22, Gregory Farnum ha scritto:
Have you checked the actual journal objects as the "journal export" 
suggested? Did you identify any actual source of the damage before 
issuing the "repaired" command?

What is the history of the filesystems on this cluster?

On Wed, Jul 11, 2018 at 8:10 AM Alessandro De Salvo 
<mailto:alessandro.desa...@roma1.infn.it>> wrote:


Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have been
marked as damaged. Trying to restart the instances only result in
standby MDSes. We currently have 2 filesystems active and 2 MDSes
each.

I found the following error messages in the mon:


mds.0 :6800/2412911269 down:damaged
mds.1 :6800/830539001 down:damaged
mds.0 :6800/4080298733 down:damaged


Whenever I try to force the repaired state with ceph mds repaired
: I get something like this in the MDS logs:


2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
[ERR] : Error recovering journal 0x201: (5) Input/output error


Any attempt of running the journal export results in errors, like
this one:


cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
Error ((5) Input/output error)2018-07-11 17:01:30.631571
7f94354fff00 -1
Header 200. is unreadable

2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal
not
readable, attempt object-by-object dump with `rados`


Same happens for recover_dentries

cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header
200. is unreadable
Errors:
0

Is there something I could try to do to have the cluster back?

I was able to dump the contents of the metadata pool with rados
export
-p cephfs_metadata  and I'm currently trying the procedure
described in

http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery

but I'm not sure if it will work as it's apparently doing nothing
at the
moment (maybe it's just very slow).

Any help is appreciated, thanks!


 Alessandro

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS damaged

2018-07-11 Thread Alessandro De Salvo

Hi,

after the upgrade to luminous 12.2.6 today, all our MDSes have been 
marked as damaged. Trying to restart the instances only result in 
standby MDSes. We currently have 2 filesystems active and 2 MDSes each.


I found the following error messages in the mon:


mds.0 :6800/2412911269 down:damaged
mds.1 :6800/830539001 down:damaged
mds.0 :6800/4080298733 down:damaged


Whenever I try to force the repaired state with ceph mds repaired 
: I get something like this in the MDS logs:



2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro) 
error getting journal off disk
2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log 
[ERR] : Error recovering journal 0x201: (5) Input/output error



Any attempt of running the journal export results in errors, like this one:


cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00 -1 
Header 200. is unreadable


2018-07-11 17:01:30.631584 7f94354fff00 -1 journal_export: Journal not 
readable, attempt object-by-object dump with `rados`



Same happens for recover_dentries

cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
Events by type:2018-07-11 17:04:19.770779 7f05429fef00 -1 Header 
200. is unreadable

Errors:
0

Is there something I could try to do to have the cluster back?

I was able to dump the contents of the metadata pool with rados export 
-p cephfs_metadata  and I'm currently trying the procedure 
described in 
http://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery 
but I'm not sure if it will work as it's apparently doing nothing at the 
moment (maybe it's just very slow).


Any help is appreciated, thanks!


    Alessandro

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Looking for some advise on distributed FS: Is Ceph the right option for me?

2018-07-10 Thread Jones de Andrade
Hi all.

I'm looking for some information on several distributed filesystems for our
application.

It looks like it finally came down to two candidates, Ceph being one of
them. But there are still a few questions about ir that I would really like
to clarify, if possible.

Our plan, initially on 6 workstations, is to have it hosting a distributed
file system that can withstand two simultaneous computers failures without
data loss (something that can remember a raid 6, but over the network).
This file system will also need to be also remotely mounted (NFS server
with fallbacks) by other 5+ computers. Students will be working on all 11+
computers at the same time (different requisites from different softwares:
some use many small files, other a few really big, 100s gb, files), and
absolutely no hardware modifications are allowed. This initial test bed is
for undergraduate students usage, but if successful will be employed also
for our small clusters. The connection is a simple GbE.

Our actual concerns are:
1) Data Resilience: It seems that double copy of each block is the standard
setting, is it correct? As such, it will strip-parity data among three
computers for each block?

2) Metadata Resilience: We seen that we can now have more than a single
Metadata Server (which was a show-stopper on previous versions). However,
do they have to be dedicated boxes, or they can share boxes with the Data
Servers? Can it be configured in such a way that even if two metadata
server computers fail the whole system data will still be accessible from
the remaining computers, without interruptions, or they share different
data aiming only for performance?

3) Other softwares compability: We seen that there is NFS incompability, is
it correct? Also, any posix issues?

4) No single (or double) point of failure: every single possible stance has
to be able to endure a *double* failure (yes, things can get time to be
fixed here). Does Ceph need s single master server for any of its
activities? Can it endure double failure? How long would it take to any
sort of "fallback" to be completed, users would need to wait to regain
access?

I think that covers the initial questions we have. Sorry if this is the
wrong list, however.

Looking forward for any answer or suggestion,

Regards,

Jones
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD for bluestore

2018-07-09 Thread Webert de Souza Lima
bluestore doesn't have a journal like the filestore does, but there is the
WAL (Write-Ahead Log) which is looks like a journal but works differently.
You can (or must, depending or your needs) have SSDs to serve this WAL (and
for Rocks DB).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Sun, Jul 8, 2018 at 11:58 AM Satish Patel  wrote:

> Folks,
>
> I'm just reading from multiple post that bluestore doesn't need SSD
> journel, is that true?
>
> I'm planning to build 5 node cluster so depending on that I purchase SSD
> for journel.
>
> If it does require SSD for journel then what would be the best vendor and
> model which last long? Any recommendation
>
> Sent from my iPhone
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD Initiator with Ceph iscsi

2018-06-30 Thread Frank de Bot (lists)
I've crossposted the problem to the freebsd-stable mailinglist. There is
no ALUA support at the initiator side. There were 2 options for
multipathing:

1. Export your LUNs via two (or more) different paths (for example
   via two different target portal IP addresses), on the initiator
   side set up both iSCSI sessions in the usual way (like without
   multipathing), add kern.iscsi.fail_on_disconnection=1 to
   /etc/sysctl.conf, and set up gmultipath on top of LUNs reachable
   via those sessions

2. Set up the target so it redirects (sends "Target moved temporarily"
   login responses) to the target portal it considers active.  Then
   set up the initiator (single session) to either one; the target
   will "bounce it" to the right place.  You don't need gmultipath
   in this case, because from the initiator point of view there's only
   one iSCSI session at any time.

Would an of those 2 options be possible on the ceph iscsi gateway
solution to configure?


Regards,

Frank

Jason Dillaman wrote:
> Conceptually, I would assume it should just work if configured correctly
> w/ multipath (to properly configure the ALUA settings on the LUNs). I
> don't run FreeBSD, but any particular issue you are seeing?
> 
> On Tue, Jun 26, 2018 at 6:06 PM Frank de Bot (lists)  <mailto:li...@searchy.net>> wrote:
> 
> Hi,
> 
> In my test setup I have a ceph iscsi gateway (configured as in
> http://docs.ceph.com/docs/luminous/rbd/iscsi-overview/ )
> 
> I would like to use thie with a FreeBSD (11.1) initiator, but I fail to
> make a working setup in FreeBSD. Is it known if the FreeBSD initiator
> (with gmultipath) can work with this gateway setup?
> 
> 
> Regards,
> 
> Frank
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FreeBSD Initiator with Ceph iscsi

2018-06-26 Thread Frank de Bot (lists)
Hi,

In my test setup I have a ceph iscsi gateway (configured as in
http://docs.ceph.com/docs/luminous/rbd/iscsi-overview/ )

I would like to use thie with a FreeBSD (11.1) initiator, but I fail to
make a working setup in FreeBSD. Is it known if the FreeBSD initiator
(with gmultipath) can work with this gateway setup?


Regards,

Frank
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intel SSD DC P3520 PCIe for OSD 1480 TBW good idea?

2018-06-25 Thread Jelle de Jong

Hello everybody,

I am thinking about making a production three node Ceph cluster with 3x 
1.2TB Intel SSD DC P3520 PCIe storage devices. 10.8 (7.2TB 66% for 
production)


I am not planning on a journal on a separate ssd. I assume there is no 
advantage of this when using pcie storage?


Network connection to an Cisco SG550XG-8F8T 10Gbe Switch with Intel 
X710-DA2. (if someone knows a good mainline Linux budget replacement).


https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p3520-series/dc-p3520-1-2tb-aic-3d1.html

Is this a good storage setup?

Mainboard: Intel® Server Board S2600CW2R
CPU: 2x Intel® Xeon® Processor E5-2630 v4 (25M Cache, 2.20 GHz)
Memory:  1x 64GB DDR4 ECC KVR24R17D4K4/64
Disk: 2x WD Gold 4TB 7200rpm 128MB SATA3
Storage: 3x Intel SSD DC P3520 1.2TB PCIe
Adapter: Intel Ethernet Converged Network Adapter X710-DA2

I want to try using NUMA to also run KVM guests besides the OSD. I 
should have enough cores and only have a few osd processes.


Kind regards,

Jelle de Jong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Frequent slow requests

2018-06-19 Thread Frank de Bot (lists)
Frank (lists) wrote:
> Hi,
> 
> On a small cluster (3 nodes) I frequently have slow requests. When
> dumping the inflight ops from the hanging OSD, it seems it doesn't get a
> 'response' for one of the subops. The events always look like:
> 

I've done some further testing, all slow request are blocked by OSD's on
 a single host.  How can I debug this problem further? I can't find any
errors or other strange things on the host with osd's that are seemingly
not sending a response to an op.


Regards,

Frank de Bot

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Minimal MDS for CephFS on OSD hosts

2018-06-19 Thread Webert de Souza Lima
Keep in mind that the mds server is cpu-bound, so during heavy workloads it
will eat up CPU usage, so the OSD daemons can affect or be affected by the
MDS daemon.
But it does work well. We've been running a few clusters with MON, MDS and
OSDs sharing the same hosts for a couple of years now.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Tue, Jun 19, 2018 at 11:03 AM Paul Emmerich 
wrote:

> Just co-locate them with your OSDs. You can can control how much RAM the
> MDSs use with the "mds cache memory limit" option. (default 1 GB)
> Note that the cache should be large enough RAM to keep the active working
> set in the mds cache but 1 million files is not really a lot.
> As a rule of thumb: ~1GB of MDS cache per ~100k files.
>
> 64GB of RAM for 12 OSDs and an MDS is enough in most cases.
>
> Paul
>
> 2018-06-19 15:34 GMT+02:00 Denny Fuchs :
>
>> Hi,
>>
>> Am 19.06.2018 15:14, schrieb Stefan Kooman:
>>
>> Storage doesn't matter for MDS, as they won't use it to store ceph data
>>> (but instead use the (meta)data pool to store meta data).
>>> I would not colocate the MDS daemons with the OSDS, but instead create a
>>> couple of VMs (active / standby) and give them as much RAM as you
>>> possibly can.
>>>
>>
>> thanks a lot. I think, we would start with round about 8GB and see, what
>> happens.
>>
>> cu denny
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating cephfs data pools and/or mounting multiple filesystems belonging to the same cluster

2018-06-14 Thread Alessandro De Salvo

Hi,


Il 14/06/18 06:13, Yan, Zheng ha scritto:

On Wed, Jun 13, 2018 at 9:35 PM Alessandro De Salvo
 wrote:

Hi,


Il 13/06/18 14:40, Yan, Zheng ha scritto:

On Wed, Jun 13, 2018 at 7:06 PM Alessandro De Salvo
 wrote:

Hi,

I'm trying to migrate a cephfs data pool to a different one in order to
reconfigure with new pool parameters. I've found some hints but no
specific documentation to migrate pools.

I'm currently trying with rados export + import, but I get errors like
these:

Write #-9223372036854775808::::11e1007.:head#
omap_set_header failed: (95) Operation not supported

The command I'm using is the following:

rados export -p cephfs_data | rados import -p cephfs_data_new -

So, I have a few questions:


1) would it work to swap the cephfs data pools by renaming them while
the fs cluster is down?

2) how can I copy the old data pool into a new one without errors like
the ones above?


This won't work as you expected.  some cephfs metadata records ID of data pool.

This is was suspecting too, hence the question, so thanks for confirming it.
Basically, once a cephfs filesystem is created the pool and structure
are immutable. This is not good, though.


3) plain copy from a fs to another one would also work, but I didn't
find a way to tell the ceph fuse clients how to mount different
filesystems in the same cluster, any documentation on it?


ceph-fuse /mnt/ceph --client_mds_namespace=cephfs_name

In the meantime I also found the same option for fuse and tried it. It
works with fuse, but it seems it's not possible to export via
nfs-ganesha multiple filesystems.


put client_mds_namespace option to client section of ceph.conf  (the
machine the run ganesha)


Yes, that would work but then I need a (set of) exporter(s) for every 
cephfs filesystem. That sounds reasonable though, as it's the same 
situation as for the mds services.

Thanks for the hint,

    Alessandro





Anyone tried it?



4) even if I found a way to mount via fuse different filesystems
belonging to the same cluster, is this feature stable enough or is it
still super-experimental?


very stable

Very good!

Thanks,


  Alessandro


Thanks,


   Alessandro


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: bind data pool via file layout

2018-06-13 Thread Webert de Souza Lima
Got it Gregory, sounds good enough for us.

Thank you all for the help provided.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Jun 13, 2018 at 2:20 PM Gregory Farnum  wrote:

> Nah, I would use one Filesystem unless you can’t. The backtrace does
> create another object but IIRC it’s a maximum one IO per create/rename (on
> the file).
> On Wed, Jun 13, 2018 at 1:12 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> Thanks for clarifying that, Gregory.
>>
>> As said before, we use the file layout to resolve the difference of
>> workloads in those 2 different directories in cephfs.
>> Would you recommend using 2 filesystems instead? By doing so, each fs
>> would have it's default data pool accordingly.
>>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>>
>> On Wed, Jun 13, 2018 at 11:33 AM Gregory Farnum 
>> wrote:
>>
>>> The backtrace object Zheng referred to is used only for resolving hard
>>> links or in disaster recovery scenarios. If the default data pool isn’t
>>> available you would stack up pending RADOS writes inside of your mds but
>>> the rest of the system would continue unless you manage to run the mds out
>>> of memory.
>>> -Greg
>>> On Wed, Jun 13, 2018 at 9:25 AM Webert de Souza Lima <
>>> webert.b...@gmail.com> wrote:
>>>
>>>> Thank you Zheng.
>>>>
>>>> Does that mean that, when using such feature, our data integrity relies
>>>> now on both data pools'  integrity/availability?
>>>>
>>>> We currently use such feature in production for dovecot's index files,
>>>> so we could store this directory on a pool of SSDs only. The main data pool
>>>> is made of HDDs and stores the email files themselves.
>>>>
>>>> There ain't too many files created, it's just a few files per email
>>>> user, and basically one directory per user's mailbox.
>>>> Each mailbox has a index file that is updated upon every new email
>>>> received or moved, deleted, read, etc.
>>>>
>>>> I think in this scenario the overhead may be acceptable for us.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Webert Lima
>>>> DevOps Engineer at MAV Tecnologia
>>>> *Belo Horizonte - Brasil*
>>>> *IRC NICK - WebertRLZ*
>>>>
>>>>
>>>> On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng  wrote:
>>>>
>>>>> On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima
>>>>>  wrote:
>>>>> >
>>>>> > hello,
>>>>> >
>>>>> > is there any performance impact on cephfs for using file layouts to
>>>>> bind a specific directory in cephfs to a given pool? Of course, such pool
>>>>> is not the default data pool for this cephfs.
>>>>> >
>>>>>
>>>>> For each file, no matter which pool file data are stored,  mds alway
>>>>> create an object in the default data pool. The object in default data
>>>>> pool is used for storing backtrace. So files stored in non-default
>>>>> pool have extra overhead on file creation. For large file, the
>>>>> overhead can be neglect. But for lots of small files, the overhead may
>>>>> affect performance.
>>>>>
>>>>>
>>>>> > Regards,
>>>>> >
>>>>> > Webert Lima
>>>>> > DevOps Engineer at MAV Tecnologia
>>>>> > Belo Horizonte - Brasil
>>>>> > IRC NICK - WebertRLZ
>>>>> > ___
>>>>> > ceph-users mailing list
>>>>> > ceph-users@lists.ceph.com
>>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: bind data pool via file layout

2018-06-13 Thread Webert de Souza Lima
Thanks for clarifying that, Gregory.

As said before, we use the file layout to resolve the difference of
workloads in those 2 different directories in cephfs.
Would you recommend using 2 filesystems instead? By doing so, each fs would
have it's default data pool accordingly.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Jun 13, 2018 at 11:33 AM Gregory Farnum  wrote:

> The backtrace object Zheng referred to is used only for resolving hard
> links or in disaster recovery scenarios. If the default data pool isn’t
> available you would stack up pending RADOS writes inside of your mds but
> the rest of the system would continue unless you manage to run the mds out
> of memory.
> -Greg
> On Wed, Jun 13, 2018 at 9:25 AM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> Thank you Zheng.
>>
>> Does that mean that, when using such feature, our data integrity relies
>> now on both data pools'  integrity/availability?
>>
>> We currently use such feature in production for dovecot's index files, so
>> we could store this directory on a pool of SSDs only. The main data pool is
>> made of HDDs and stores the email files themselves.
>>
>> There ain't too many files created, it's just a few files per email user,
>> and basically one directory per user's mailbox.
>> Each mailbox has a index file that is updated upon every new email
>> received or moved, deleted, read, etc.
>>
>> I think in this scenario the overhead may be acceptable for us.
>>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>>
>> On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng  wrote:
>>
>>> On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima
>>>  wrote:
>>> >
>>> > hello,
>>> >
>>> > is there any performance impact on cephfs for using file layouts to
>>> bind a specific directory in cephfs to a given pool? Of course, such pool
>>> is not the default data pool for this cephfs.
>>> >
>>>
>>> For each file, no matter which pool file data are stored,  mds alway
>>> create an object in the default data pool. The object in default data
>>> pool is used for storing backtrace. So files stored in non-default
>>> pool have extra overhead on file creation. For large file, the
>>> overhead can be neglect. But for lots of small files, the overhead may
>>> affect performance.
>>>
>>>
>>> > Regards,
>>> >
>>> > Webert Lima
>>> > DevOps Engineer at MAV Tecnologia
>>> > Belo Horizonte - Brasil
>>> > IRC NICK - WebertRLZ
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating cephfs data pools and/or mounting multiple filesystems belonging to the same cluster

2018-06-13 Thread Alessandro De Salvo

Hi,


Il 13/06/18 14:40, Yan, Zheng ha scritto:

On Wed, Jun 13, 2018 at 7:06 PM Alessandro De Salvo
 wrote:

Hi,

I'm trying to migrate a cephfs data pool to a different one in order to
reconfigure with new pool parameters. I've found some hints but no
specific documentation to migrate pools.

I'm currently trying with rados export + import, but I get errors like
these:

Write #-9223372036854775808::::11e1007.:head#
omap_set_header failed: (95) Operation not supported

The command I'm using is the following:

   rados export -p cephfs_data | rados import -p cephfs_data_new -

So, I have a few questions:


1) would it work to swap the cephfs data pools by renaming them while
the fs cluster is down?

2) how can I copy the old data pool into a new one without errors like
the ones above?


This won't work as you expected.  some cephfs metadata records ID of data pool.


This is was suspecting too, hence the question, so thanks for confirming it.
Basically, once a cephfs filesystem is created the pool and structure 
are immutable. This is not good, though.





3) plain copy from a fs to another one would also work, but I didn't
find a way to tell the ceph fuse clients how to mount different
filesystems in the same cluster, any documentation on it?


ceph-fuse /mnt/ceph --client_mds_namespace=cephfs_name


In the meantime I also found the same option for fuse and tried it. It 
works with fuse, but it seems it's not possible to export via 
nfs-ganesha multiple filesystems.


Anyone tried it?





4) even if I found a way to mount via fuse different filesystems
belonging to the same cluster, is this feature stable enough or is it
still super-experimental?


very stable


Very good!

Thanks,


    Alessandro




Thanks,


  Alessandro


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: bind data pool via file layout

2018-06-13 Thread Webert de Souza Lima
Thank you Zheng.

Does that mean that, when using such feature, our data integrity relies now
on both data pools'  integrity/availability?

We currently use such feature in production for dovecot's index files, so
we could store this directory on a pool of SSDs only. The main data pool is
made of HDDs and stores the email files themselves.

There ain't too many files created, it's just a few files per email user,
and basically one directory per user's mailbox.
Each mailbox has a index file that is updated upon every new email received
or moved, deleted, read, etc.

I think in this scenario the overhead may be acceptable for us.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, Jun 13, 2018 at 9:51 AM Yan, Zheng  wrote:

> On Wed, Jun 13, 2018 at 3:34 AM Webert de Souza Lima
>  wrote:
> >
> > hello,
> >
> > is there any performance impact on cephfs for using file layouts to bind
> a specific directory in cephfs to a given pool? Of course, such pool is not
> the default data pool for this cephfs.
> >
>
> For each file, no matter which pool file data are stored,  mds alway
> create an object in the default data pool. The object in default data
> pool is used for storing backtrace. So files stored in non-default
> pool have extra overhead on file creation. For large file, the
> overhead can be neglect. But for lots of small files, the overhead may
> affect performance.
>
>
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > Belo Horizonte - Brasil
> > IRC NICK - WebertRLZ
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating cephfs data pools and/or mounting multiple filesystems belonging to the same cluster

2018-06-13 Thread Alessandro De Salvo

Hi,

I'm trying to migrate a cephfs data pool to a different one in order to 
reconfigure with new pool parameters. I've found some hints but no 
specific documentation to migrate pools.


I'm currently trying with rados export + import, but I get errors like 
these:


Write #-9223372036854775808::::11e1007.:head#
omap_set_header failed: (95) Operation not supported

The command I'm using is the following:

 rados export -p cephfs_data | rados import -p cephfs_data_new -

So, I have a few questions:


1) would it work to swap the cephfs data pools by renaming them while 
the fs cluster is down?


2) how can I copy the old data pool into a new one without errors like 
the ones above?


3) plain copy from a fs to another one would also work, but I didn't 
find a way to tell the ceph fuse clients how to mount different 
filesystems in the same cluster, any documentation on it?


4) even if I found a way to mount via fuse different filesystems 
belonging to the same cluster, is this feature stable enough or is it 
still super-experimental?



Thanks,


    Alessandro


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs: bind data pool via file layout

2018-06-12 Thread Webert de Souza Lima
hello,

is there any performance impact on cephfs for using file layouts to bind a
specific directory in cephfs to a given pool? Of course, such pool is not
the default data pool for this cephfs.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] (yet another) multi active mds advise needed

2018-05-19 Thread Webert de Souza Lima
Hi Daniel,

Thanks for clarifying.
I'll have a look at dirfrag option.

Regards,
Webert Lima

Em sáb, 19 de mai de 2018 01:18, Daniel Baumann 
escreveu:

> On 05/19/2018 01:13 AM, Webert de Souza Lima wrote:
> > New question: will it make any difference in the balancing if instead of
> > having the MAIL directory in the root of cephfs and the domains's
> > subtrees inside it, I discard the parent dir and put all the subtress
> right in cephfs root?
>
> the balancing between the MDS is influenced by which directories are
> accessed, the currently accessed directory-trees are diveded between the
> MDS's (also check the dirfrag option in the docs). assuming you have the
> same access pattern, the "fragmentation" between the MDS's happens at
> these "target-directories", so it doesn't matter if these directories
> are further up or down in the same filesystem tree.
>
> in the multi-MDS scenario where the MDS serving rank 0 fails, the
> effects in the moment of the failure for any cephfs client accessing a
> directory/file are the same (as described in an earlier mail),
> regardless on which level the directory/file is within the filesystem.
>
> Regards,
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] (yet another) multi active mds advise needed

2018-05-18 Thread Webert de Souza Lima
Hi Patrick

On Fri, May 18, 2018 at 6:20 PM Patrick Donnelly 
wrote:

> Each MDS may have multiple subtrees they are authoritative for. Each
> MDS may also replicate metadata from another MDS as a form of load
> balancing.


Ok, its good to know that it actually does some load balance. Thanks.
New question: will it make any difference in the balancing if instead of
having the MAIL directory in the root of cephfs and the domains's subtrees
inside it,
I discard the parent dir and put all the subtress right in cephfs root?


> standby-replay daemons are not available to take over for ranks other
> than the one it follows. So, you would want to have a standby-replay
> daemon for each rank or just have normal standbys. It will likely
> depend on the size of your MDS (cache size) and available hardware.
>
> It's best if y ou see if the normal balancer (especially in v12.2.6
> [1]) can handle the load for you without trying to micromanage things
> via pins. You can use pinning to isolate metadata load from other
> ranks as a stop-gap measure.
>

Ok I will start with the simplest way. This can be changed after deployment
if it comes to be the case.

On Fri, May 18, 2018 at 6:38 PM Daniel Baumann 
wrote:

> jftr, having 3 active mds and 3 standby-replay resulted May 20217 in a
> longer downtime for us due to http://tracker.ceph.com/issues/21749
>
> we're not using standby-replay MDS's anymore but only "normal" standby,
> and didn't have had any problems anymore (running kraken then, upgraded
> to luminous last fall).
>

Thank you very much for your feedback Daniel. I'll go for the regular
standby daemons, then.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] (yet another) multi active mds advise needed

2018-05-18 Thread Webert de Souza Lima
Hi,

We're migrating from a Jewel / filestore based cephfs archicture to a
Luminous / buestore based one.

One MUST HAVE is multiple Active MDS daemons. I'm still lacking knowledge
of how it actually works.
After reading the docs and ML we learned that they work by sort of dividing
the responsibilities, each with his own and only directory subtree. (please
correct me if I'm wrong).

Question 1: I'd like to know if it is viable to have 4 MDS daemons, being 3
Active and 1 Standby (or Standby-Replay if that's still possible with
multi-mds).

Basically, what we have is 2 subtrees used by dovecot: INDEX and MAIL.
Their tree is almost identical but INDEX stores all dovecot metadata with
heavy IO going on and MAIL stores actual email files, with much more writes
than reads.

I don't know by now which one could bottleneck the MDS servers most so I
wonder if I can take metrics on MDS usage per pool when it's deployed.
Question 2: If the metadata workloads are very different I wonder if I can
isolate them, like pinning MDS servers X and Y to one of the directories.

Cache Tier is deprecated so,
Question 3: how can I think of a read cache mechanism in Luminous with
bluestore, mainly to keep newly created files (emails that just arrived and
will probably be fetched by the user in a few seconds via IMAP/POP3).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-MDS Failover

2018-05-18 Thread Webert de Souza Lima
Hello,


On Mon, Apr 30, 2018 at 7:16 AM Daniel Baumann 
wrote:

> additionally: if rank 0 is lost, the whole FS stands still (no new
> client can mount the fs; no existing client can change a directory, etc.).
>
> my guess is that the root of a cephfs (/; which is always served by rank
> 0) is needed in order to do traversals/lookups of any directories on the
> top-level (which then can be served by ranks 1-n).
>

Could someone confirm if this is actually how it works? Thanks.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Thanks Jack.

That's good to know. It is definitely something to consider.
In a distributed storage scenario we might build a dedicated pool for that
and tune the pool as more capacity or performance is needed.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:45 PM Jack  wrote:

> On 05/16/2018 09:35 PM, Webert de Souza Lima wrote:
> > We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
> > backend.
> > We'll have to do some some work on how to simulate user traffic, for
> writes
> > and readings. That seems troublesome.
> I would appreciate seeing these results !
>
> > Thanks for the plugins recommendations. I'll take the change and ask you
> > how is the SIS status? We have used it in the past and we've had some
> > problems with it.
>
> I am using it since Dec 2016 with mdbox, with no issue at all (I am
> currently using Dovecot 2.2.27-3 from Debian Stretch)
> The only config I use is mail_attachment_dir, the rest lies as default
> (mail_attachment_min_size = 128k, mail_attachment_fs = sis posix,
> ail_attachment_hash = %{sha1})
> The backend storage is a local filesystem, and there is only one Dovecot
> instance
>
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > *Belo Horizonte - Brasil*
> > *IRC NICK - WebertRLZ*
> >
> >
> > On Wed, May 16, 2018 at 4:19 PM Jack  wrote:
> >
> >> Hi,
> >>
> >> Many (most ?) filesystems does not store multiple files on the same
> block
> >>
> >> Thus, with sdbox, every single mail (you know, that kind of mail with 10
> >> lines in it) will eat an inode, and a block (4k here)
> >> mdbox is more compact on this way
> >>
> >> Another difference: sdbox removes the message, mdbox does not : a single
> >> metadata update is performed, which may be packed with others if many
> >> files are deleted at once
> >>
> >> That said, I do not have experience with dovecot + cephfs, nor have made
> >> tests for sdbox vs mdbox
> >>
> >> However, and this is a bit out of topic, I recommend you look at the
> >> following dovecot's features (if not already done), as they are awesome
> >> and will help you a lot:
> >> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
> >> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
> >> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
> >>
> >> Regards,
> >> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
> >>> I'm sending this message to both dovecot and ceph-users ML so please
> >> don't
> >>> mind if something seems too obvious for you.
> >>>
> >>> Hi,
> >>>
> >>> I have a question for both dovecot and ceph lists and below I'll
> explain
> >>> what's going on.
> >>>
> >>> Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> >> when
> >>> using sdbox, a new file is stored for each email message.
> >>> When using mdbox, multiple messages are appended to a single file until
> >> it
> >>> reaches/passes the rotate limit.
> >>>
> >>> I would like to understand better how the mdbox format impacts on IO
> >>> performance.
> >>> I think it's generally expected that fewer larger file translate to
> less
> >> IO
> >>> and more troughput when compared to more small files, but how does
> >> dovecot
> >>> handle that with mdbox?
> >>> If dovecot does flush data to storage upon each and every new email is
> >>> arrived and appended to the corresponding file, would that mean that it
> >>> generate the same ammount of IO as it would do with one file per
> message?
> >>> Also, if using mdbox many messages will be appended to a said file
> >> before a
> >>> new file is created. That should mean that a file descriptor is kept
> open
> >>> for sometime by dovecot process.
> >>> Using cephfs as backend, how would this impact cluster performance
> >>> regarding MDS caps and inodes cached when files from thousands of users
> >> are
> >>> opened and appended all over?
> >>>
> >>> I would like to understand this better.
> >>>
> >>> Why?
> >>> We are a small Business Email Hosting provider with ba

Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Hello Danny,

I actually saw that thread and I was very excited about it. I thank you all
for that idea and all the effort being put in it.
I haven't yet tried to play around with your plugin but I intend to, and to
contribute back. I think when it's ready for production it will be
unbeatable.

I have watched your talk at Cephalocon (on YouTube). I'll see your slides,
maybe they'll give me more insights on our infrastructure architecture.

As you can see our business is still taking baby steps compared to Deutsche
Telekom's but we face infrastructure challenges everyday since ever.
As for now, I think we could still fit with cephfs, but we definitely need
some improvement.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:42 PM Danny Al-Gaaf 
wrote:

> Hi,
>
> some time back we had similar discussions when we, as an email provider,
> discussed to move away from traditional NAS/NFS storage to Ceph.
>
> The problem with POSIX file systems and dovecot is that e.g. with mdbox
> only around ~20% of the IO operations are READ/WRITE, the rest are
> metadata IOs. You will not change this with using CephFS since it will
> basically behave the same way as e.g. NFS.
>
> We decided to develop librmb to store emails as objects directly in
> RADOS instead of CephFS. The project is still under development, so you
> should not use it in production, but you can try it to run a POC.
>
> For more information check out my slides from Ceph Day London 2018:
> https://dalgaaf.github.io/cephday-london2018-emailstorage/#/cover-page
>
> The project can be found on github:
> https://github.com/ceph-dovecot/
>
> -Danny
>
> Am 16.05.2018 um 20:37 schrieb Webert de Souza Lima:
> > I'm sending this message to both dovecot and ceph-users ML so please
> don't
> > mind if something seems too obvious for you.
> >
> > Hi,
> >
> > I have a question for both dovecot and ceph lists and below I'll explain
> > what's going on.
> >
> > Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> when
> > using sdbox, a new file is stored for each email message.
> > When using mdbox, multiple messages are appended to a single file until
> it
> > reaches/passes the rotate limit.
> >
> > I would like to understand better how the mdbox format impacts on IO
> > performance.
> > I think it's generally expected that fewer larger file translate to less
> IO
> > and more troughput when compared to more small files, but how does
> dovecot
> > handle that with mdbox?
> > If dovecot does flush data to storage upon each and every new email is
> > arrived and appended to the corresponding file, would that mean that it
> > generate the same ammount of IO as it would do with one file per message?
> > Also, if using mdbox many messages will be appended to a said file
> before a
> > new file is created. That should mean that a file descriptor is kept open
> > for sometime by dovecot process.
> > Using cephfs as backend, how would this impact cluster performance
> > regarding MDS caps and inodes cached when files from thousands of users
> are
> > opened and appended all over?
> >
> > I would like to understand this better.
> >
> > Why?
> > We are a small Business Email Hosting provider with bare metal, self
> hosted
> > systems, using dovecot for servicing mailboxes and cephfs for email
> storage.
> >
> > We are currently working on dovecot and storage redesign to be in
> > production ASAP. The main objective is to serve more users with better
> > performance, high availability and scalability.
> > * high availability and load balancing is extremely important to us *
> >
> > On our current model, we're using mdbox format with dovecot, having
> > dovecot's INDEXes stored in a replicated pool of SSDs, and messages
> stored
> > in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> > All using cephfs / filestore backend.
> >
> > Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> > (10.2.9-4).
> >  - ~25K users from a few thousands of domains per cluster
> >  - ~25TB of email data per cluster
> >  - ~70GB of dovecot INDEX [meta]data per cluster
> >  - ~100MB of cephfs metadata per cluster
> >
> > Our goal is to build a single ceph cluster for storage that could expand
> in
> > capacity, be highly available and perform well enough. I know, that's
> what
> > everyone wants.
> >
> > Cephfs is an important choise because:
> >  - there can 

Re: [ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
Hello Jack,

yes, I imagine I'll have to do some work on tuning the block size on
cephfs. Thanks for the advise.
I knew that using mdbox, messages are not removed but I though that was
true in sdbox too. Thanks again.

We'll soon do benchmarks of sdbox vs mdbox over cephfs with bluestore
backend.
We'll have to do some some work on how to simulate user traffic, for writes
and readings. That seems troublesome.

Thanks for the plugins recommendations. I'll take the change and ask you
how is the SIS status? We have used it in the past and we've had some
problems with it.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 16, 2018 at 4:19 PM Jack  wrote:

> Hi,
>
> Many (most ?) filesystems does not store multiple files on the same block
>
> Thus, with sdbox, every single mail (you know, that kind of mail with 10
> lines in it) will eat an inode, and a block (4k here)
> mdbox is more compact on this way
>
> Another difference: sdbox removes the message, mdbox does not : a single
> metadata update is performed, which may be packed with others if many
> files are deleted at once
>
> That said, I do not have experience with dovecot + cephfs, nor have made
> tests for sdbox vs mdbox
>
> However, and this is a bit out of topic, I recommend you look at the
> following dovecot's features (if not already done), as they are awesome
> and will help you a lot:
> - Compression (classic, https://wiki.dovecot.org/Plugins/Zlib)
> - Single-Instance-Storage (aka sis, aka "attachment deduplication" :
> https://www.dovecot.org/list/dovecot/2013-December/094276.html)
>
> Regards,
> On 05/16/2018 08:37 PM, Webert de Souza Lima wrote:
> > I'm sending this message to both dovecot and ceph-users ML so please
> don't
> > mind if something seems too obvious for you.
> >
> > Hi,
> >
> > I have a question for both dovecot and ceph lists and below I'll explain
> > what's going on.
> >
> > Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox),
> when
> > using sdbox, a new file is stored for each email message.
> > When using mdbox, multiple messages are appended to a single file until
> it
> > reaches/passes the rotate limit.
> >
> > I would like to understand better how the mdbox format impacts on IO
> > performance.
> > I think it's generally expected that fewer larger file translate to less
> IO
> > and more troughput when compared to more small files, but how does
> dovecot
> > handle that with mdbox?
> > If dovecot does flush data to storage upon each and every new email is
> > arrived and appended to the corresponding file, would that mean that it
> > generate the same ammount of IO as it would do with one file per message?
> > Also, if using mdbox many messages will be appended to a said file
> before a
> > new file is created. That should mean that a file descriptor is kept open
> > for sometime by dovecot process.
> > Using cephfs as backend, how would this impact cluster performance
> > regarding MDS caps and inodes cached when files from thousands of users
> are
> > opened and appended all over?
> >
> > I would like to understand this better.
> >
> > Why?
> > We are a small Business Email Hosting provider with bare metal, self
> hosted
> > systems, using dovecot for servicing mailboxes and cephfs for email
> storage.
> >
> > We are currently working on dovecot and storage redesign to be in
> > production ASAP. The main objective is to serve more users with better
> > performance, high availability and scalability.
> > * high availability and load balancing is extremely important to us *
> >
> > On our current model, we're using mdbox format with dovecot, having
> > dovecot's INDEXes stored in a replicated pool of SSDs, and messages
> stored
> > in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
> > All using cephfs / filestore backend.
> >
> > Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
> > (10.2.9-4).
> >  - ~25K users from a few thousands of domains per cluster
> >  - ~25TB of email data per cluster
> >  - ~70GB of dovecot INDEX [meta]data per cluster
> >  - ~100MB of cephfs metadata per cluster
> >
> > Our goal is to build a single ceph cluster for storage that could expand
> in
> > capacity, be highly available and perform well enough. I know, that's
> what
> > everyone wants.
> >
> > Cephfs is an important choise because:
> >  - there can be multiple mountpoints, thus multip

[ceph-users] dovecot + cephfs - sdbox vs mdbox

2018-05-16 Thread Webert de Souza Lima
I'm sending this message to both dovecot and ceph-users ML so please don't
mind if something seems too obvious for you.

Hi,

I have a question for both dovecot and ceph lists and below I'll explain
what's going on.

Regarding dbox format (https://wiki2.dovecot.org/MailboxFormat/dbox), when
using sdbox, a new file is stored for each email message.
When using mdbox, multiple messages are appended to a single file until it
reaches/passes the rotate limit.

I would like to understand better how the mdbox format impacts on IO
performance.
I think it's generally expected that fewer larger file translate to less IO
and more troughput when compared to more small files, but how does dovecot
handle that with mdbox?
If dovecot does flush data to storage upon each and every new email is
arrived and appended to the corresponding file, would that mean that it
generate the same ammount of IO as it would do with one file per message?
Also, if using mdbox many messages will be appended to a said file before a
new file is created. That should mean that a file descriptor is kept open
for sometime by dovecot process.
Using cephfs as backend, how would this impact cluster performance
regarding MDS caps and inodes cached when files from thousands of users are
opened and appended all over?

I would like to understand this better.

Why?
We are a small Business Email Hosting provider with bare metal, self hosted
systems, using dovecot for servicing mailboxes and cephfs for email storage.

We are currently working on dovecot and storage redesign to be in
production ASAP. The main objective is to serve more users with better
performance, high availability and scalability.
* high availability and load balancing is extremely important to us *

On our current model, we're using mdbox format with dovecot, having
dovecot's INDEXes stored in a replicated pool of SSDs, and messages stored
in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).
All using cephfs / filestore backend.

Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel
(10.2.9-4).
 - ~25K users from a few thousands of domains per cluster
 - ~25TB of email data per cluster
 - ~70GB of dovecot INDEX [meta]data per cluster
 - ~100MB of cephfs metadata per cluster

Our goal is to build a single ceph cluster for storage that could expand in
capacity, be highly available and perform well enough. I know, that's what
everyone wants.

Cephfs is an important choise because:
 - there can be multiple mountpoints, thus multiple dovecot instances on
different hosts
 - the same storage backend is used for all dovecot instances
 - no need of sharding domains
 - dovecot is easily load balanced (with director sticking users to the
same dovecot backend)

On the upcoming upgrade we intent to:
 - upgrade ceph to 12.X (Luminous)
 - drop the SSD Cache Tier (because it's deprecated)
 - use bluestore engine

I was said on freenode/#dovecot that there are many cases where SDBOX would
perform better with NFS sharing.
In case of cephfs, at first, I wouldn't think that would be true because
more files == more generated IO, but thinking about what I said in the
beginning regarding sdbox vs mdbox that could be wrong.

Any thoughts will be highlt appreciated.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Node crash, filesytem not usable

2018-05-15 Thread Webert de Souza Lima
I'm sorry I wouldn't know, I'm on Jewel.
is your cluster HEALTH_OK now?

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Sun, May 13, 2018 at 6:29 AM Marc Roos  wrote:

>
> In luminous
> osd_recovery_threads = osd_disk_threads ?
> osd_recovery_sleep = osd_recovery_sleep_hdd ?
>
> Or is this speeding up recovery, a lot different in luminous?
>
> [@~]# ceph daemon osd.0 config show | grep osd | grep thread
> "osd_command_thread_suicide_timeout": "900",
> "osd_command_thread_timeout": "600",
> "osd_disk_thread_ioprio_class": "",
> "osd_disk_thread_ioprio_priority": "-1",
> "osd_disk_threads": "1",
> "osd_op_num_threads_per_shard": "0",
> "osd_op_num_threads_per_shard_hdd": "1",
> "osd_op_num_threads_per_shard_ssd": "2",
> "osd_op_thread_suicide_timeout": "150",
> "osd_op_thread_timeout": "15",
>     "osd_peering_wq_threads": "2",
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_thread_timeout": "30",
> "osd_remove_thread_suicide_timeout": "36000",
> "osd_remove_thread_timeout": "3600",
>
> -Original Message-
> From: Webert de Souza Lima [mailto:webert.b...@gmail.com]
> Sent: vrijdag 11 mei 2018 20:34
> To: ceph-users
> Subject: Re: [ceph-users] Node crash, filesytem not usable
>
> This message seems to be very concerning:
>  >mds0: Metadata damage detected
>
>
> but for the rest, the cluster seems still to be recovering. you could
> try to seep thing up with ceph tell, like:
>
> ceph tell osd.* injectargs --osd_max_backfills=10
>
> ceph tell osd.* injectargs --osd_recovery_sleep=0.0
>
> ceph tell osd.* injectargs --osd_recovery_threads=2
>
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
>
> On Fri, May 11, 2018 at 3:06 PM Daniel Davidson
>  wrote:
>
>
> Below id the information you were asking for.  I think they are
> size=2, min size=1.
>
> Dan
>
> # ceph status
> cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
>
>
>
>
>  health HEALTH_ERR
>
>
>
>
> 140 pgs are stuck inactive for more than 300 seconds
> 64 pgs backfill_wait
> 76 pgs backfilling
> 140 pgs degraded
> 140 pgs stuck degraded
> 140 pgs stuck inactive
> 140 pgs stuck unclean
> 140 pgs stuck undersized
> 140 pgs undersized
> 210 requests are blocked > 32 sec
> recovery 38725029/695508092 objects degraded (5.568%)
> recovery 10844554/695508092 objects misplaced (1.559%)
> mds0: Metadata damage detected
> mds0: Behind on trimming (71/30)
> noscrub,nodeep-scrub flag(s) set
>  monmap e3: 4 mons at
> {ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:
> 6789/0,ceph-3=172.16.31.4:6789/0}
> election epoch 824, quorum 0,1,2,3
> ceph-0,ceph-1,ceph-2,ceph-3
>   fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
>  osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
> flags
> noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>   pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
> 1444 TB used, 1011 TB / 2455 TB avail
> 38725029/695508092 objects degraded (5.568%)
> 10844554/695508092 objects misplaced (1.559%)
> 1396 active+clean
>   76
> undersized+degraded+remapped+backfilling+peered
>   64
> undersized+degraded+remapped+wait_backfill+peered
> recovery io 1244 MB/s, 1612 keys/s, 705 objects/s
>
> ID  WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 2619.54541 root default
>  -2  163.72159 host ceph-0
>   0   81.86079 osd.0 up  1.0  1.0
>   1   81.86079 osd.1 up  1.0  1.0
>  -3  163.72159 host ceph-1
>   2   81.86079 osd.2

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-14 Thread Webert de Souza Lima
On Sat, May 12, 2018 at 3:11 AM Alexandre DERUMIER 
wrote:

> The documentation (luminous) say:
>


> >mds cache size
> >
> >Description:The number of inodes to cache. A value of 0 indicates an
> unlimited number. It is recommended to use mds_cache_memory_limit to limit
> the amount of memory the MDS cache uses.
> >Type:   32-bit Integer
> >Default:0
> >

and, my mds_cache_memory_limit is currently at 5GB.


yeah I have only suggested that because the high memory usage seemed to
trouble you and it might be a bug, so it's more of a workaround.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
Thanks David.
Although you mentioned this was introduced with Luminous, it's working with
Jewel.

~# ceph osd pool stats

Fri May 11 17:41:39 2018

pool rbd id 5
  client io 505 kB/s rd, 3801 kB/s wr, 46 op/s rd, 27 op/s wr

pool rbd_cache id 6
  client io 2538 kB/s rd, 3070 kB/s wr, 601 op/s rd, 758 op/s wr
  cache tier io 12225 kB/s flush, 0 op/s promote, 3 PG(s) flushing

pool cephfs_metadata id 7
  client io 2233 kB/s rd, 2260 kB/s wr, 95 op/s rd, 587 op/s wr

pool cephfs_data_ssd id 8
  client io 1126 kB/s rd, 94897 B/s wr, 33 op/s rd, 42 op/s wr

pool cephfs_data id 9
  client io 0 B/s rd, 11203 kB/s wr, 12 op/s rd, 12 op/s wr

pool cephfs_data_cache id 10
  client io 4383 kB/s rd, 550 kB/s wr, 57 op/s rd, 39 op/s wr
  cache tier io 7012 kB/s flush, 4399 kB/s evict, 11 op/s promote


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 5:14 PM David Turner  wrote:

> `ceph osd pool stats` with the option to specify the pool you are
> interested in should get you the breakdown of IO per pool.  This was
> introduced with luminous.
>
> On Fri, May 11, 2018 at 2:39 PM Webert de Souza Lima <
> webert.b...@gmail.com> wrote:
>
>> I think ceph doesn't have IO metrics will filters by pool right? I see IO
>> metrics from clients only:
>>
>> ceph_client_io_ops
>> ceph_client_io_read_bytes
>> ceph_client_io_read_ops
>> ceph_client_io_write_bytes
>> ceph_client_io_write_ops
>>
>> and pool "byte" metrics, but not "io":
>>
>> ceph_pool(write/read)_bytes(_total)
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> *Belo Horizonte - Brasil*
>> *IRC NICK - WebertRLZ*
>>
>> On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> Hey Jon!
>>>
>>> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>>>
>>>> It depends on the metadata intensity of your workload.  It might be
>>>> quite interesting to gather some drive stats on how many IOPS are
>>>> currently hitting your metadata pool over a week of normal activity.
>>>>
>>>
>>> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
>>> sure what I should be looking at).
>>> My current SSD disks have 2 partitions.
>>>  - One is used for cephfs cache tier pool,
>>>  - The other is used for both:  cephfs meta-data pool and cephfs
>>> data-ssd (this is an additional cephfs data pool with only ssds with file
>>> layout for a specific direcotory to use it)
>>>
>>> Because of this, iostat shows me peaks of 12k IOPS in the metadata
>>> partition, but this could definitely be IO for the data-ssd pool.
>>>
>>>
>>>> If you are doing large file workloads, and the metadata mostly fits in
>>>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>>>> the other hand, if you're doing random metadata reads from a small
>>>> file workload where the metadata does not fit in RAM, almost every
>>>> client read could generate a read operation, and each MDS could easily
>>>> generate thousands of ops per second.
>>>>
>>>
>>> I have yet to measure it the right way but I'd assume my metadata fits
>>> in RAM (a few 100s of MB only).
>>>
>>> This is an email hosting cluster with dozens of thousands of users so
>>> there are a lot of random reads and writes, but not too many small files.
>>> Email messages are concatenated together in files up to 4MB in size
>>> (when a rotation happens).
>>> Most user operations are dovecot's INDEX operations and I will keep
>>> index directory in a SSD-dedicaded pool.
>>>
>>>
>>>
>>>> Isolating metadata OSDs is useful if the data OSDs are going to be
>>>> completely saturated: metadata performance will be protected even if
>>>> clients are hitting the data OSDs hard.
>>>>
>>>
>>> This seems to be the case.
>>>
>>>
>>>> If "heavy write" means completely saturating the cluster, then sharing
>>>> the OSDs is risky.  If "heavy write" just means that there are more
>>>> writes than reads, then it may be fine if the metadata workload is not
>>>> heavy enough to make good use of SSDs.
>>>>
>>>
>>> Saturarion will only happen in peak workloads, not often. By heavy write
>>> 

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-11 Thread Webert de Souza Lima
I think ceph doesn't have IO metrics will filters by pool right? I see IO
metrics from clients only:

ceph_client_io_ops
ceph_client_io_read_bytes
ceph_client_io_read_ops
ceph_client_io_write_bytes
ceph_client_io_write_ops

and pool "byte" metrics, but not "io":

ceph_pool(write/read)_bytes(_total)

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Wed, May 9, 2018 at 2:23 PM Webert de Souza Lima 
wrote:

> Hey Jon!
>
> On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:
>
>> It depends on the metadata intensity of your workload.  It might be
>> quite interesting to gather some drive stats on how many IOPS are
>> currently hitting your metadata pool over a week of normal activity.
>>
>
> Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
> sure what I should be looking at).
> My current SSD disks have 2 partitions.
>  - One is used for cephfs cache tier pool,
>  - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
> (this is an additional cephfs data pool with only ssds with file layout for
> a specific direcotory to use it)
>
> Because of this, iostat shows me peaks of 12k IOPS in the metadata
> partition, but this could definitely be IO for the data-ssd pool.
>
>
>> If you are doing large file workloads, and the metadata mostly fits in
>> RAM, then the number of IOPS from the MDS can be very, very low.  On
>> the other hand, if you're doing random metadata reads from a small
>> file workload where the metadata does not fit in RAM, almost every
>> client read could generate a read operation, and each MDS could easily
>> generate thousands of ops per second.
>>
>
> I have yet to measure it the right way but I'd assume my metadata fits in
> RAM (a few 100s of MB only).
>
> This is an email hosting cluster with dozens of thousands of users so
> there are a lot of random reads and writes, but not too many small files.
> Email messages are concatenated together in files up to 4MB in size (when
> a rotation happens).
> Most user operations are dovecot's INDEX operations and I will keep index
> directory in a SSD-dedicaded pool.
>
>
>
>> Isolating metadata OSDs is useful if the data OSDs are going to be
>> completely saturated: metadata performance will be protected even if
>> clients are hitting the data OSDs hard.
>>
>
> This seems to be the case.
>
>
>> If "heavy write" means completely saturating the cluster, then sharing
>> the OSDs is risky.  If "heavy write" just means that there are more
>> writes than reads, then it may be fine if the metadata workload is not
>> heavy enough to make good use of SSDs.
>>
>
> Saturarion will only happen in peak workloads, not often. By heavy write I
> mean there are much more writes than reads, yes.
> So I think I can start sharing the OSDs, if I think this is impacting
> performance I can just change the ruleset and move metadata to a SSD-only
> pool, right?
>
>
>> The way I'd summarise this is: in the general case, dedicated SSDs are
>> the safe way to go -- they're intrinsically better suited to metadata.
>> However, in some quite common special cases, the overall number of
>> metadata ops is so low that the device doesn't matter.
>
>
>
> Thank you very much John!
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Node crash, filesytem not usable

2018-05-11 Thread Webert de Souza Lima
This message seems to be very concerning:
 >mds0: Metadata damage detected

but for the rest, the cluster seems still to be recovering. you could try
to seep thing up with ceph tell, like:

ceph tell osd.* injectargs --osd_max_backfills=10
ceph tell osd.* injectargs --osd_recovery_sleep=0.0
ceph tell osd.* injectargs --osd_recovery_threads=2


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 3:06 PM Daniel Davidson 
wrote:

> Below id the information you were asking for.  I think they are size=2,
> min size=1.
>
> Dan
>
> # ceph status
> cluster
> 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
>
>  health
> HEALTH_ERR
>
> 140 pgs are stuck inactive for more than 300 seconds
> 64 pgs backfill_wait
> 76 pgs backfilling
> 140 pgs degraded
> 140 pgs stuck degraded
> 140 pgs stuck inactive
> 140 pgs stuck unclean
> 140 pgs stuck undersized
> 140 pgs undersized
> 210 requests are blocked > 32 sec
> recovery 38725029/695508092 objects degraded (5.568%)
> recovery 10844554/695508092 objects misplaced (1.559%)
> mds0: Metadata damage detected
> mds0: Behind on trimming (71/30)
> noscrub,nodeep-scrub flag(s) set
>  monmap e3: 4 mons at {ceph-0=
> 172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0
> }
> election epoch 824, quorum 0,1,2,3 ceph-0,ceph-1,ceph-2,ceph-3
>   fsmap e144928: 1/1/1 up {0=ceph-0=up:active}, 1 up:standby
>  osdmap e35814: 32 osds: 30 up, 30 in; 140 remapped pgs
> flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>   pgmap v43142427: 1536 pgs, 2 pools, 762 TB data, 331 Mobjects
> 1444 TB used, 1011 TB / 2455 TB avail
> 38725029/695508092 objects degraded (5.568%)
> 10844554/695508092 objects misplaced (1.559%)
> 1396 active+clean
>   76 undersized+degraded+remapped+backfilling+peered
>   64 undersized+degraded+remapped+wait_backfill+peered
> recovery io 1244 MB/s, 1612 keys/s, 705 objects/s
>
> ID  WEIGHT TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
>  -1 2619.54541 root default
>  -2  163.72159 host ceph-0
>   0   81.86079 osd.0 up  1.0  1.0
>   1   81.86079 osd.1 up  1.0  1.0
>  -3  163.72159 host ceph-1
>   2   81.86079 osd.2 up  1.0  1.0
>   3   81.86079 osd.3 up  1.0  1.0
>  -4  163.72159 host ceph-2
>   8   81.86079 osd.8 up  1.0  1.0
>   9   81.86079 osd.9 up  1.0  1.0
>  -5  163.72159 host ceph-3
>  10   81.86079 osd.10up  1.0  1.0
>  11   81.86079 osd.11up  1.0  1.0
>  -6  163.72159 host ceph-4
>   4   81.86079 osd.4 up  1.0  1.0
>   5   81.86079 osd.5 up  1.0  1.0
>  -7  163.72159 host ceph-5
>   6   81.86079 osd.6 up  1.0  1.0
>   7   81.86079 osd.7 up  1.0  1.0
>  -8  163.72159 host ceph-6
>  12   81.86079 osd.12up  0.7  1.0
>  13   81.86079 osd.13up  1.0  1.0
>  -9  163.72159 host ceph-7
>  14   81.86079 osd.14up  1.0  1.0
>  15   81.86079 osd.15up  1.0  1.0
> -10  163.72159 host ceph-8
>  16   81.86079 osd.16up  1.0  1.0
>  17   81.86079 osd.17up  1.0  1.0
> -11  163.72159 host ceph-9
>  18   81.86079 osd.18up  1.0  1.0
>  19   81.86079 osd.19up  1.0  1.0
> -12  163.72159 host ceph-10
>  20   81.86079 osd.20up  1.0  1.0
>  21   81.86079 osd.21up  1.0  1.0
> -13  163.72159 host ceph-11
>  22   81.86079 osd.22up  1.0  1.0
>  23   81.86079 osd.23up  1.0  1.0
> -14  163.72159 host ceph-12
>  24   81.86079 osd.24up  1.0  1.0
>  25   81.86079 osd.25up  1.0  1.0
> -15  163.72159 host ceph-13
>  26   81.86079 osd.26  down0  1.0
>  27   81.86079 osd.27  down0  1.0
> -16  163.72159 host ceph-14
>  28   81.86079 osd.28up  1.0  1.0
>  29   81.86079 osd.29up  1.0  1.0
> -17  163.72159 host ceph-15
>  30   81.86079 osd.30up 

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-11 Thread Webert de Souza Lima
You could use "mds_cache_size" to limit number of CAPS untill you have this
fixed, but I'd say for your number of caps and inodes, 20GB is normal.

this mds (jewel) here is consuming 24GB RAM:

{
"mds": {
"request": 7194867047,
"reply": 7194866688,
"reply_latency": {
"avgcount": 7194866688,
"sum": 27779142.611775008
},
"forward": 0,
"dir_fetch": 179223482,
"dir_commit": 1529387896,
"dir_split": 0,
"inode_max": 300,
"inodes": 3001264,
"inodes_top": 160517,
"inodes_bottom": 226577,
"inodes_pin_tail": 2614170,
"inodes_pinned": 2770689,
"inodes_expired": 2920014835,
"inodes_with_caps": 2743194,
"caps": 2803568,
"subtrees": 2,
"traverse": 8255083028,
"traverse_hit": 7452972311,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 180547123,
"traverse_remote_ino": 122257,
"traverse_lock": 5957156,
"load_cent": 18446743934203149911,
"q": 54,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
}
}


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*


On Fri, May 11, 2018 at 3:13 PM Alexandre DERUMIER 
wrote:

> Hi,
>
> I'm still seeing memory leak with 12.2.5.
>
> seem to leak some MB each 5 minutes.
>
> I'll try to resent some stats next weekend.
>
>
> - Mail original -
> De: "Patrick Donnelly" 
> À: "Brady Deetz" 
> Cc: "Alexandre Derumier" , "ceph-users" <
> ceph-users@lists.ceph.com>
> Envoyé: Jeudi 10 Mai 2018 21:11:19
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?
>
> On Thu, May 10, 2018 at 12:00 PM, Brady Deetz  wrote:
> > [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> > ceph 1841 3.5 94.3 133703308 124425384 ? Ssl Apr04 1808:32
> > /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup
> ceph
> >
> >
> > [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> > {
> > "pool": {
> > "items": 173261056,
> > "bytes": 76504108600
> > }
> > }
> >
> > So, 80GB is my configured limit for the cache and it appears the mds is
> > following that limit. But, the mds process is using over 100GB RAM in my
> > 128GB host. I thought I was playing it safe by configuring at 80. What
> other
> > things consume a lot of RAM for this process?
> >
> > Let me know if I need to create a new thread.
>
> The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade
> ASAP.
>
> [1] https://tracker.ceph.com/issues/22972
>
> --
> Patrick Donnelly
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] howto: multiple ceph filesystems

2018-05-11 Thread Webert de Souza Lima
Basically what we're trying to figure out looks like what is being done
here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020958.html

But instead of using LIBRADOS to store EMAILs directly into RADOS we're
still using CEPHFS for it, just figuring out if it makes sense to separate
them in different workloads.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Fri, May 11, 2018 at 2:07 AM, Marc Roos  wrote:

>
>
> If I would like to use an erasurecode pool for a cephfs directory how
> would I create these placement rules?
>
>
>
>
> -Original Message-
> From: David Turner [mailto:drakonst...@gmail.com]
> Sent: vrijdag 11 mei 2018 1:54
> To: João Paulo Sacchetto Ribeiro Bastos
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] howto: multiple ceph filesystems
>
> Another option you could do is to use a placement rule. You could create
> a general pool for most data to go to and a special pool for specific
> folders on the filesystem. Particularly I think of a pool for replica vs
> EC vs flash for specific folders in the filesystem.
>
> If the pool and OSDs wasn't the main concern for multiple filesystems
> and the mds servers are then you could have multiple active mds servers
> and pin the metadata for the indexes to one of them while the rest is
> served by the other active mds servers.
>
> I really haven't come across a need for multiple filesystems in ceph
> with the type of granularity you can achieve with mds pinning, folder
> placement rules, and cephx authentication to limit a user to a specific
> subfolder.
>
>
> On Thu, May 10, 2018, 5:10 PM João Paulo Sacchetto Ribeiro Bastos
>  wrote:
>
>
> Hey John, thanks for you answer. For sure the hardware robustness
> will be nice enough. My true concern was actually the two FS ecosystem
> coexistence. In fact I realized that we may not use this as well because
> it may be represent a high overhead, despite the fact that it's a
> experiental feature yet.
>
> On Thu, 10 May 2018 at 15:48 John Spray  wrote:
>
>
> On Thu, May 10, 2018 at 7:38 PM, João Paulo Sacchetto
> Ribeiro
> Bastos
>  wrote:
> > Hello guys,
> >
> > My company is about to rebuild its whole infrastructure,
> so
> I was called in
> > order to help on the planning. We are essentially an
> corporate mail
> > provider, so we handle daily lots of clients using
> dovecot
> and roundcube and
> > in order to do so we want to design a better plant of
> our
> cluster. Today,
> > using Jewel, we have a single cephFS for both index and
> mail
> from dovecot,
> > but we want to split it into an index_FS and a mail_FS
> to
> handle the
> > workload a little better, is it profitable nowadays?
> From my
> research I
> > realized that we will need data and metadata individual
> pools for each FS
> > such as a group of MDS for each of then, also.
> >
> > The one thing that really scares me about all of this
> is: we
> are planning to
> > have four machines at full disposal to handle our MDS
> instances. We started
> > to think if an idea like the one below is valid, can
> anybody
> give a hint on
> > this? We basically want to handle two MDS instances on
> each
> machine (one for
> > each FS) and wonder if we'll be able to have them
> swapping
> between active
> > and standby simultaneously without any trouble.
> >
> > index_FS: (active={machines 1 and 3}, standby={machines
> 2
> and 4})
> > mail_FS: (active={machines 2 and 4}, standby={machines 1
> and
> 3})
>
> Nothing wrong with that setup, but remember that those
> servers
> are
> going to have to be well-resourced enough to run all four
> at
> once
> (when a failure occurs), so it might not matter very much
> exactly
> which servers are running which daemons.
>
> With a filesystem's MDS daemons (i.e. daemons with the same
> standby_for_fscid setting), Ceph will activate whichever
> daemon comes
> up first, so if it's important to you to have particular
> daemons
> active then you would need to take care of that at the
> point
> you're
> starting them up.
>
> John
>
> >
> > Regards,
> > --
> >
> > João Paulo Sacchetto Ribeiro Bastos
> > +55 31 99279-7092
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-us

Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread Webert de Souza Lima
Hey Jon!

On Wed, May 9, 2018 at 12:11 PM, John Spray  wrote:

> It depends on the metadata intensity of your workload.  It might be
> quite interesting to gather some drive stats on how many IOPS are
> currently hitting your metadata pool over a week of normal activity.
>

Any ceph built-in tool for this? maybe ceph daemonperf (altoght I'm not
sure what I should be looking at).
My current SSD disks have 2 partitions.
 - One is used for cephfs cache tier pool,
 - The other is used for both:  cephfs meta-data pool and cephfs data-ssd
(this is an additional cephfs data pool with only ssds with file layout for
a specific direcotory to use it)

Because of this, iostat shows me peaks of 12k IOPS in the metadata
partition, but this could definitely be IO for the data-ssd pool.


> If you are doing large file workloads, and the metadata mostly fits in
> RAM, then the number of IOPS from the MDS can be very, very low.  On
> the other hand, if you're doing random metadata reads from a small
> file workload where the metadata does not fit in RAM, almost every
> client read could generate a read operation, and each MDS could easily
> generate thousands of ops per second.
>

I have yet to measure it the right way but I'd assume my metadata fits in
RAM (a few 100s of MB only).

This is an email hosting cluster with dozens of thousands of users so there
are a lot of random reads and writes, but not too many small files.
Email messages are concatenated together in files up to 4MB in size (when a
rotation happens).
Most user operations are dovecot's INDEX operations and I will keep index
directory in a SSD-dedicaded pool.



> Isolating metadata OSDs is useful if the data OSDs are going to be
> completely saturated: metadata performance will be protected even if
> clients are hitting the data OSDs hard.
>

This seems to be the case.


> If "heavy write" means completely saturating the cluster, then sharing
> the OSDs is risky.  If "heavy write" just means that there are more
> writes than reads, then it may be fine if the metadata workload is not
> heavy enough to make good use of SSDs.
>

Saturarion will only happen in peak workloads, not often. By heavy write I
mean there are much more writes than reads, yes.
So I think I can start sharing the OSDs, if I think this is impacting
performance I can just change the ruleset and move metadata to a SSD-only
pool, right?


> The way I'd summarise this is: in the general case, dedicated SSDs are
> the safe way to go -- they're intrinsically better suited to metadata.
> However, in some quite common special cases, the overall number of
> metadata ops is so low that the device doesn't matter.



Thank you very much John!
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread Webert de Souza Lima
I'm sorry I have mixed up some information. The actual ratio I have now
is 0,0005% (*100MB for 20TB data*).


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Wed, May 9, 2018 at 11:32 AM, Webert de Souza Lima  wrote:

> Hello,
>
> Currently, I run Jewel + Filestore for cephfs, with SSD-only pools used
> for cephfs-metadata, and HDD-only pools for cephfs-data. The current
> metadata/data ratio is something like 0,25% (50GB metadata for 20TB data).
>
> Regarding bluestore architecture, assuming I have:
>
>  - SSDs for WAL+DB
>  - Spinning Disks for bluestore data.
>
> would you recommend still store metadata in SSD-Only OSD nodes?
> If not, is it recommended to *dedicate* some OSDs (Spindle+SSD for
> WAL/DB) for cephfs-metadata?
>
> If I just have 2 pools (metadata and data) all sharing the same OSDs in
> the cluster, would it be enough for heavy-write cases?
>
> Assuming min_size=2, size=3.
>
> Thanks for your thoughts.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question: CephFS + Bluestore

2018-05-09 Thread Webert de Souza Lima
Hello,

Currently, I run Jewel + Filestore for cephfs, with SSD-only pools used for
cephfs-metadata, and HDD-only pools for cephfs-data. The current
metadata/data ratio is something like 0,25% (50GB metadata for 20TB data).

Regarding bluestore architecture, assuming I have:

 - SSDs for WAL+DB
 - Spinning Disks for bluestore data.

would you recommend still store metadata in SSD-Only OSD nodes?
If not, is it recommended to *dedicate* some OSDs (Spindle+SSD for WAL/DB)
for cephfs-metadata?

If I just have 2 pools (metadata and data) all sharing the same OSDs in the
cluster, would it be enough for heavy-write cases?

Assuming min_size=2, size=3.

Thanks for your thoughts.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can't get MDS running after a power outage

2018-03-29 Thread Webert de Souza Lima
I'd also try to boot up only one mds until it's fully up and running. Not
both of them.
Sometimes they go switching states between each other.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Mar 29, 2018 at 7:32 AM, John Spray  wrote:

> On Thu, Mar 29, 2018 at 8:16 AM, Zhang Qiang 
> wrote:
> > Hi,
> >
> > Ceph version 10.2.3. After a power outage, I tried to start the MDS
> > deamons, but they stuck forever replaying journals, I had no idea why
> > they were taking that long, because this is just a small cluster for
> > testing purpose with only hundreds MB data. I restarted them, and the
> > error below was encountered.
>
> Usually if an MDS is stuck in replay, it's because it's waiting for
> the OSDs to service the reads of the journal.  Are all your PGs up and
> healthy?
>
> >
> > Any chance I can restore them?
> >
> > Mar 28 14:20:30 node01 systemd: Started Ceph metadata server daemon.
> > Mar 28 14:20:30 node01 systemd: Starting Ceph metadata server daemon...
> > Mar 28 14:20:30 node01 ceph-mds: 2018-03-28 14:20:30.796255
> > 7f0150c8c180 -1 deprecation warning: MDS id 'mds.0' is invalid and
> > will be forbidden in a future version.  MDS names may not start with a
> > numeric digit.
>
> If you're really using "0" as an MDS name, now would be a good time to
> fix that -- most people use a hostname or something like that.  The
> reason that numeric MDS names are invalid is that it makes commands
> like "ceph mds fail 0" ambiguous (do we mean the name 0 or the rank
> 0?).
>
> > Mar 28 14:20:30 node01 ceph-mds: starting mds.0 at :/0
> > Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: In function 'const
> > entity_inst_t MDSMap::get_inst(mds_rank_t)' thread 7f014ac6c700 time
> > 2018-03-28 14:20:30.942480
> > Mar 28 14:20:30 node01 ceph-mds: ./mds/MDSMap.h: 582: FAILED
> assert(up.count(m))
> > Mar 28 14:20:30 node01 ceph-mds: ceph version 10.2.3
> > (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> > Mar 28 14:20:30 node01 ceph-mds: 1: (ceph::__ceph_assert_fail(char
> > const*, char const*, int, char const*)+0x85) [0x7f01512aba45]
> > Mar 28 14:20:30 node01 ceph-mds: 2: (MDSMap::get_inst(int)+0x20f)
> > [0x7f0150ee5e3f]
> > Mar 28 14:20:30 node01 ceph-mds: 3:
> > (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x7b9)
> > [0x7f0150ed6e49]
>
> This is a weird assertion.  I can't see how it could be reached :-/
>
> John
>
> > Mar 28 14:20:30 node01 ceph-mds: 4:
> > (MDSDaemon::handle_mds_map(MMDSMap*)+0xe3d) [0x7f0150eb396d]
> > Mar 28 14:20:30 node01 ceph-mds: 5:
> > (MDSDaemon::handle_core_message(Message*)+0x7b3) [0x7f0150eb4eb3]
> > Mar 28 14:20:30 node01 ceph-mds: 6:
> > (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x7f0150eb514b]
> > Mar 28 14:20:30 node01 ceph-mds: 7: (DispatchQueue::entry()+0x78a)
> > [0x7f01513ad4aa]
> > Mar 28 14:20:30 node01 ceph-mds: 8:
> > (DispatchQueue::DispatchThread::entry()+0xd) [0x7f015129098d]
> > Mar 28 14:20:30 node01 ceph-mds: 9: (()+0x7dc5) [0x7f0150095dc5]
> > Mar 28 14:20:30 node01 ceph-mds: 10: (clone()+0x6d) [0x7f014eb61ced]
> > Mar 28 14:20:30 node01 ceph-mds: NOTE: a copy of the executable, or
> > `objdump -rdS ` is needed to interpret this.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS very unstable with many small files

2018-02-25 Thread Stijn De Weirdt
hi,

can you give soem more details on the setup? number and size of osds.
are you using EC or not? and if so, what EC parameters?

thanks,

stijn

On 02/26/2018 08:15 AM, Linh Vu wrote:
> Sounds like you just need more RAM on your MDS. Ours have 256GB each, and the 
> OSD nodes have 128GB each. Networking is 2x25Gbe.
> 
> 
> We are on luminous 12.2.1, bluestore, and use CephFS for HPC, with about 
> 500-ish compute nodes. We have done stress testing with small files up to 2M 
> per directory as part of our acceptance testing, and encountered no problem.
> 
> 
> From: ceph-users  on behalf of Oliver 
> Freyermuth 
> Sent: Monday, 26 February 2018 3:45:59 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] CephFS very unstable with many small files
> 
> Dear Cephalopodians,
> 
> in preparation for production, we have run very successful tests with large 
> sequential data,
> and just now a stress-test creating many small files on CephFS.
> 
> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 
> hosts with 32 OSDs each, running in EC k=4 m=2.
> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 
> 12.2.3.
> There are (at the moment) only two MDS's, one is active, the other standby.
> 
> For the test, we had 1120 client processes on 40 client machines (all 
> cephfs-fuse!) extract a tarball with 150k small files
> ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a 
> separate subdirectory.
> 
> Things started out rather well (but expectedly slow), we had to increase
> mds_log_max_segments => 240
> mds_log_max_expiring => 160
> due to https://github.com/ceph/ceph/pull/18624
> and adjusted mds_cache_memory_limit to 4 GB.
> 
> Even though the MDS machine has 32 GB, it is also running 2 OSDs (for 
> metadata) and so we have been careful with the cache
> (e.g. due to http://tracker.ceph.com/issues/22599 ).
> 
> After a while, we tested MDS failover and realized we entered a flip-flop 
> situation between the two MDS nodes we have.
> Increasing mds_beacon_grace to 240 helped with that.
> 
> Now, with about 100,000,000 objects written, we are in a disaster situation.
> First off, the MDS could not restart anymore - it required >40 GB of memory, 
> which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.
> So it tried to recover and OOMed quickly after. Replay was reasonably fast, 
> but join took many minutes:
> 2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start
> 2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start
> and finally, 5 minutes later, OOM.
> 
> I stopped half of the stress-test tar's, which did not help - then I rebooted 
> half of the clients, which did help and let the MDS recover just fine.
> So it seems the client caps have been too many for the MDS to handle. I'm 
> unsure why "tar" would cause so many open file handles.
> Is there anything that can be configured to prevent this from happening?
> Now, I only lost some "stress test data", but later, it might be user's 
> data...
> 
> 
> In parallel, I had reinstalled one OSD host.
> It was backfilling well, but now, <24 hours later, before backfill has 
> finished, several OSD hosts enter OOM condition.
> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the 
> default bluestore cache size of 1 GB. However, it seems the processes are 
> using much more,
> up to several GBs until memory is exhausted. They then become sluggish, are 
> kicked out of the cluster, come back, and finally at some point they are 
> OOMed.
> 
> Now, I have restarted some OSD processes and hosts which helped to reduce the 
> memory usage - but now I have some OSDs crashing continously,
> leading to PG unavailability, and preventing recovery from completion.
> I have reported a ticket about that, with stacktrace and log:
> http://tracker.ceph.com/issues/23120
> This might well be a consequence of a previous OOM killer condition.
> 
> However, my final question after these ugly experiences is:
> Did somebody ever stresstest CephFS for many small files?
> Are those issues known? Can special configuration help?
> Are the memory issues known? Are there solutions?
> 
> We don't plan to use Ceph for many small files, but we don't have full 
> control of our users, which is why we wanted to test this "worst case" 
> scenario.
> It would be really bad if we lost a production filesystem due to such a 
> situation, so the plan was to test now to know what happens before we enter 
> production.
> As of now, this looks really bad, and I'm not sure the cluster will ever 
> recover.
> I'll give it some more time, but we'll likely kill off all remaining clients 
> next week and see what happens, and worst case recreate the Ceph cluster.
> 
> Cheers,
> Oliver
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/

Re: [ceph-users] CephFS very unstable with many small files

2018-02-25 Thread Stijn De Weirdt
hi oliver,

>>> in preparation for production, we have run very successful tests with large 
>>> sequential data,
>>> and just now a stress-test creating many small files on CephFS. 
>>>
>>> We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 
>>> 6 hosts with 32 OSDs each, running in EC k=4 m=2. 
>>> Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 
>>> 12.2.3. 
(this is all afaik;) so with EC k=4, small files get cut in 4 smaller
parts. i'm not sure when the compression is applied, but your small
files might be very small files before the get cut in 4 tiny parts. this
might become pure iops wrt performance.
with filestore (and witout compression), this was quite awfull. we have
not retested with bluestore yet, but in the end a disk is just a disk.
writing 1 file results in 6 diskwrites, so you need a lot of iops and/or
disks.

<...>

>>> In parallel, I had reinstalled one OSD host. 
>>> It was backfilling well, but now, <24 hours later, before backfill has 
>>> finished, several OSD hosts enter OOM condition. 
>>> Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the 
>>> default bluestore cache size of 1 GB. However, it seems the processes are 
>>> using much more,
>>> up to several GBs until memory is exhausted. They then become sluggish, are 
>>> kicked out of the cluster, come back, and finally at some point they are 
>>> OOMed. 
>>
>> 32GB RAM for MDS, 64GB RAM for 32 OSDs per node looks very low on memory 
>> requirements for the scale you are trying. what are the size of each osd 
>> device?
>> Could you also dump osd tree + more cluster info in the tracker you raised, 
>> so that one could try to recreate at a lower scale and check.
> 
> Done! 
> All HDD-OSDs have 4 TB, while the SSDs used for the metadata pool have 240 
> GB. 
the rule of thumb is 1GB per 1 TB. that is a lot (and imho one of the
bad things about ceph, but i'm not complaining ;)
most of the time this memory will not be used except for cache, but eg
recovery is one of the cases where it is used, and thus needed.

i have no idea what the real requirements are (i assumes there's some
fixed amount per OSD and the rest is linear(?) with volume. so you can
try to use some softraid on the disks to reduce the number of OSDs per
host; but i doubt that the fixed part is over 50%, so you will probably
end up with ahving to add some memory or not use certain disks. i don't
know if you can limit the amount of volume per disk, eg only use 2TB of
a 4TB disk, because then you can keep the iops.

stijn

> We had initially planned to use something more lightweight on CPU and RAM 
> (BeeGFS or Lustre),
> but since we encountered serious issues with BeeGFS, have some bad past 
> experience with Lustre (but it was an old version)
> and were really happy with the self-healing features of Ceph which also 
> allows us to reinstall OSD-hosts if we do an upgrade without having a 
> downtime,
> we have decided to repurpose the hardware. For this reason, the RAM is not 
> really optimized (yet) for Ceph. 
> We will try to adapt hardware now as best as possible. 
> 
> Are there memory recommendations for a setup of this size? Anything's 
> welcome. 
> 
> Cheers and thanks!
>   Oliver
> 
>>
>>>
>>> Now, I have restarted some OSD processes and hosts which helped to reduce 
>>> the memory usage - but now I have some OSDs crashing continously,
>>> leading to PG unavailability, and preventing recovery from completion. 
>>> I have reported a ticket about that, with stacktrace and log:
>>> http://tracker.ceph.com/issues/23120
>>> This might well be a consequence of a previous OOM killer condition. 
>>>
>>> However, my final question after these ugly experiences is: 
>>> Did somebody ever stresstest CephFS for many small files? 
>>> Are those issues known? Can special configuration help? 
>>> Are the memory issues known? Are there solutions? 
>>>
>>> We don't plan to use Ceph for many small files, but we don't have full 
>>> control of our users, which is why we wanted to test this "worst case" 
>>> scenario. 
>>> It would be really bad if we lost a production filesystem due to such a 
>>> situation, so the plan was to test now to know what happens before we enter 
>>> production. 
>>> As of now, this looks really bad, and I'm not sure the cluster will ever 
>>> recover. 
>>> I'll give it some more time, but we'll likely kill off all remaining 
>>> clients next week and see what happens, and worst case recreate the Ceph 
>>> cluster. 
>>>
>>> Cheers,
>>> Oliver
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users

Re: [ceph-users] Ceph Bluestore performance question

2018-02-18 Thread Stijn De Weirdt
hi oliver,

the IPoIB network is not 56gb, it's probably a lot less (20gb or so).
the ib_write_bw test is verbs/rdma based. do you have iperf tests
between hosts, and if so, can you share those reuslts?

stijn

> we are just getting started with our first Ceph cluster (Luminous 12.2.2) and 
> doing some basic benchmarking. 
> 
> We have two pools:
> - cephfs_metadata, living on 4 SSD devices (each is a bluestore OSD, 240 GB) 
> on 2 hosts (i.e. 2 SSDs each), setup as:
>   - replicated, min size 2, max size 4
>   - 128 PGs
> - cephfs_data, living on 6 hosts each of which has the following setup:
>   - 32 HDD drives (4 TB) each of which is a bluestore OSD, the LSI controller 
> to which they are attached is in JBOD personality
>   - 2 SSD drives, each has 16 partitions with 7 GB per partition, used as 
> block-db by the bluestore OSDs living on the HDDs. 
>   - Created with:
> ceph osd erasure-code-profile set cephfs_data k=4 m=2 
> crush-device-class=hdd crush-failure-domain=host
> ceph osd pool create cephfs_data 2048 2048 erasure cephfs_data
>   - So to summarize: 192 OSDs, 2048 PGs, each OSD has 4 TB data + 7 GB 
> block-db
> 
> The interconnect (public and cluster network) 
> is made via IP over Infiniband (56 GBit bandwidth), using the software stack 
> that comes with CentOS 7. 
> 
> This leaves us with the possibility that one of the metadata-hosts can fail, 
> and still one of the disks can fail. 
> For the data hosts, up to two machines total can fail. 
> 
> We have 40 clients connected to this cluster. We now run something like:
> dd if=/dev/zero of=some_file bs=1M count=1
> on each CPU core of each of the clients, yielding a total of 1120 writing 
> processes (all 40 clients have 28+28HT cores),
> using the ceph-fuse client. 
> 
> This yields a write throughput of a bit below 1 GB/s (capital B), which is 
> unexpectedly low. 
> Running a BeeGFS on the same cluster before (disks were in RAID 6 in that 
> case) yielded throughputs of about 12 GB/s,
> but came with other issues (e.g. it's not FOSS...), so we'd love to run Ceph 
> :-). 
> 
> I performed some basic tests to try to understand the bottleneck for Ceph:
> # rados bench -p cephfs_data 10 write --no-cleanup -t 40
> Bandwidth (MB/sec): 695.952
> Stddev Bandwidth:   295.223
> Max bandwidth (MB/sec): 1088
> Min bandwidth (MB/sec): 76
> Average IOPS:   173
> Stddev IOPS:73
> Max IOPS:   272
> Min IOPS:   19
> Average Latency(s): 0.220967
> Stddev Latency(s):  0.305967
> Max latency(s): 2.88931
> Min latency(s): 0.0741061
> 
> => This agrees mostly with our basic dd benchmark. 
> 
> Reading is a bit faster:
> # rados bench -p cephfs_data 10 rand
> => Bandwidth (MB/sec):   1108.75
> 
> However, the disks are reasonably quick:
> # ceph tell osd.0 bench
> {
> "bytes_written": 1073741824,
> "blocksize": 4194304,
> "bytes_per_sec": 331850403
> }
> 
> I checked and the OSD-hosts peaked at a load average of about 22 (they have 
> 24+24HT cores) in our dd benchmark,
> but stayed well below that (only about 20 % per OSD daemon) in the rados 
> bench test. 
> One idea would be to switch from jerasure to ISA, since the machines are all 
> Intel CPUs only anyways. 
> 
> Already tried: 
> - TCP stack tuning (wmem, rmem), no huge effect. 
> - changing the block sizes used by dd, no effect. 
> - Testing network throughput with ib_write_bw, this revealed something like:
>  #bytes #iterationsBW peak[MB/sec]BW average[MB/sec]   
> MsgRate[Mpps]
>  2  5000 19.73  19.30  10.118121
>  4  5000 52.79  51.70  13.553412
>  8  5000 101.23 96.65  12.668371  
> 
>  16 5000 243.66 233.42 15.297583
>  32 5000 350.66 344.73 11.296089
>  64 5000 909.14 324.85 5.322323
>  1285000 1424.841401.2911.479374
>  2565000 2865.242801.0411.473055
>  5125000 5169.985095.0810.434733
>  1024   5000 10022.759791.42   
> 10.026410
>  2048   5000 10988.6410628.83  
> 5.441958
>  4096   5000 11401.4011399.14  
> 2.918180
> [...]
> 
> So it seems the IP-over-Infiniband is not the bottleneck (BeeGFS was using 
> RDMA). 
> Other ideas that come to mind:
> - Testing with Ceph-RDMA, but that does not seem production-ready yet, if I 
> read the list correctly. 
> - Increasing osd_pool_erasure_code_stripe_width. 
> - Using ISA as EC plugin. 
> - Reducing the bluestore_cache_size_hdd, it seems when recovery + benchmark 
> is ongoing, swap is used (but not when perfo

Re: [ceph-users] Luminous 12.2.2 OSDs with Bluestore crashing randomly

2018-01-31 Thread Alessandro De Salvo

Hi Greg,

many thanks. This is a new cluster created initially with luminous 
12.2.0. I'm not sure the instructions on jewel really apply on my case 
too, and all the machines have ntp enabled, but I'll have a look, many 
thanks for the link. All machines are set to CET, although I'm running 
over docker containers which are using UTC internally, but they are all 
consistent.


At the moment, after setting 5 of the osds out the cluster resumed, and 
now I'm recreating those osds to be on the safe side.


Thanks,


    Alessandro


Il 31/01/18 19:26, Gregory Farnum ha scritto:
On Tue, Jan 30, 2018 at 5:49 AM Alessandro De Salvo 
<mailto:alessandro.desa...@roma1.infn.it>> wrote:


Hi,

we have several times a day different OSDs running Luminous 12.2.2 and
Bluestore crashing with errors like this:


starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2
/var/lib/ceph/osd/ceph-2/journal
2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082
log_to_monitors
{default=true}

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)
  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556c6df51550]
  2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr >&, unsigned int)+0x3b6)
[0x556c6db5e106]
  3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
  4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2389)
[0x556c6db78d39]
  5: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
  6: (OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
  7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr
const&)+0x57) [0x556c6dc38897]
  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x556c6df57069]
  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x556c6df59000]
  11: (()+0x7e25) [0x7f1e16c17e25]
  12: (clone()+0x6d) [0x7f1e15d0b34d]
  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
2018-01-30 13:45:29.505317 7f1dfd734700 -1

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
In function 'void
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
int)'
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc:
12819: FAILED assert(obc)

  ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba)
luminous (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x110) [0x556c6df51550]
  2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr >&, unsigned int)+0x3b6)
[0x556c6db5e106]
  3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
  4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2389)
[0x556c6db78d39]
  5: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
  6: (OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9)
[0x556c6d9c0899]
  7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr
const&)+0x57) [0x556c6dc38897]
  8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x556c6df57069]
  10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x556c6df59000]
  11: (()+0x7e25) [0x7f1e16c17e25]
  12: (clone()+0x6d) [0x7f1e15d0b34d]
  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.


Is it a known issue? How can we fix that?



Hmm, it looks a lot like http://tracker.ceph.com/issues/19185, but 
that wasn't suppo

[ceph-users] Luminous 12.2.2 OSDs with Bluestore crashing randomly

2018-01-30 Thread Alessandro De Salvo

Hi,

we have several times a day different OSDs running Luminous 12.2.2 and 
Bluestore crashing with errors like this:



starting osd.2 at - osd_data /var/lib/ceph/osd/ceph-2 
/var/lib/ceph/osd/ceph-2/journal
2018-01-30 13:45:28.440883 7f1e193cbd00 -1 osd.2 107082 log_to_monitors 
{default=true}
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 
In function 'void 
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)' 
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 
12819: FAILED assert(obc)
 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x556c6df51550]
 2: 
(PrimaryLogPG::hit_set_trim(std::unique_ptrstd::default_delete >&, unsigned int)+0x3b6) 
[0x556c6db5e106]

 3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
 4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2389) 
[0x556c6db78d39]
 5: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
 6: (OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9) 
[0x556c6d9c0899]
 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr 
const&)+0x57) [0x556c6dc38897]
 8: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) 
[0x556c6df57069]

 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000]
 11: (()+0x7e25) [0x7f1e16c17e25]
 12: (clone()+0x6d) [0x7f1e15d0b34d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.
2018-01-30 13:45:29.505317 7f1dfd734700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 
In function 'void 
PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)' 
thread 7f1dfd734700 time 2018-01-30 13:45:29.498133
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/osd/PrimaryLogPG.cc: 
12819: FAILED assert(obc)


 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x110) [0x556c6df51550]
 2: 
(PrimaryLogPG::hit_set_trim(std::unique_ptrstd::default_delete >&, unsigned int)+0x3b6) 
[0x556c6db5e106]

 3: (PrimaryLogPG::hit_set_persist()+0xb67) [0x556c6db61fb7]
 4: (PrimaryLogPG::do_op(boost::intrusive_ptr&)+0x2389) 
[0x556c6db78d39]
 5: (PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0xeba) [0x556c6db368aa]
 6: (OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9) 
[0x556c6d9c0899]
 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr 
const&)+0x57) [0x556c6dc38897]
 8: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0xfce) [0x556c6d9ee43e]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) 
[0x556c6df57069]

 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x556c6df59000]
 11: (()+0x7e25) [0x7f1e16c17e25]
 12: (clone()+0x6d) [0x7f1e15d0b34d]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.



Is it a known issue? How can we fix that?

Thanks,


    Alessandro

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-22 Thread Webert de Souza Lima
Hi,

On Fri, Jan 19, 2018 at 8:31 PM, zhangbingyin 
 wrote:

> 'MAX AVAIL' in the 'ceph df' output represents the amount of data that can
> be used before the first OSD becomes full, and not the sum of all free
> space across a set of OSDs.
>

Thank you very much. I figured this out by the end of the day. That is the
answer. I'm not sure this is in ceph.com docs though.
Now I know the problem is indeed solved (by doing proper reweight).

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-19 Thread Webert de Souza Lima
While it seemed to be solved yesterday, today the %USED has grown a lot
again. See:

~# ceph osd df tree
http://termbin.com/0zhk

~# ceph df detail
http://termbin.com/thox

94% USED while there is about 21TB worth of data, size = 2 menas ~42TB RAW
Usage, but the OSDs in that root sum ~70TB available space.



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Jan 18, 2018 at 8:21 PM, Webert de Souza Lima  wrote:

> With the help of robbat2 and llua on IRC channel I was able to solve this
> situation by taking down the 2-OSD only hosts.
> After crush reweighting OSDs 8 and 23 from host mia1-master-fe02 to 0,
> ceph df showed the expected storage capacity usage (about 70%)
>
>
> With this in mind, those guys have told me that it is due the cluster
> beeing uneven and unable to balance properly. It makes sense and it worked.
> But for me it is still a very unexpected bahaviour for ceph to say that
> the pools are 100% full and Available Space is 0.
>
> There were 3 hosts and repl. size = 2, if the host with only 2 OSDs were
> full (it wasn't), ceph could still use space from OSDs from the other hosts.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> *IRC NICK - WebertRLZ*
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
With the help of robbat2 and llua on IRC channel I was able to solve this
situation by taking down the 2-OSD only hosts.
After crush reweighting OSDs 8 and 23 from host mia1-master-fe02 to 0, ceph
df showed the expected storage capacity usage (about 70%)


With this in mind, those guys have told me that it is due the cluster
beeing uneven and unable to balance properly. It makes sense and it worked.
But for me it is still a very unexpected bahaviour for ceph to say that the
pools are 100% full and Available Space is 0.

There were 3 hosts and repl. size = 2, if the host with only 2 OSDs were
full (it wasn't), ceph could still use space from OSDs from the other hosts.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
Hi David, thanks for replying.


On Thu, Jan 18, 2018 at 5:03 PM David Turner  wrote:

> You can have overall space available in your cluster because not all of
> your disks are in the same crush root.  You have multiple roots
> corresponding to multiple crush rulesets.  All pools using crush ruleset 0
> are full because all of the osds in that crush rule are full.
>


So I did check this. The usage of the OSDs that belonged to that root
(default) was about 60%.
All the pools using crush ruleset 0 were being show 100% there was only 1
near-full OSD in that crush rule. That's what is so weird about it.

On Thu, Jan 18, 2018 at 8:05 PM, David Turner  wrote:

> `ceph osd df` is a good command for you to see what's going on.  Compare
> the osd numbers with `ceph osd tree`.
>

I am sorry I forgot to send this output, here it is. I have added 2 OSDs to
that crush, borrowed them from the host mia1-master-ds05, to see if the
available space would higher, but it didn't.
So adding new OSDs to this didn't take any effect.

ceph osd df tree

ID  WEIGHT   REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS TYPE NAME
 -9 13.5- 14621G  2341G 12279G 16.02 0.31   0 root
databases
 -8  6.5-  7182G   835G  6346G 11.64 0.22   0 host
mia1-master-ds05
 20  3.0  1.0  3463G   380G  3082G 10.99 0.21 260
osd.20
 17  3.5  1.0  3719G   455G  3263G 12.24 0.24 286
osd.17
-10  7.0-  7438G  1505G  5932G 20.24 0.39   0 host
mia1-master-fe01
 21  3.5  1.0  3719G   714G  3004G 19.22 0.37 269
osd.21
 22  3.5  1.0  3719G   791G  2928G 21.27 0.41 295
osd.22
 -3  2.39996-  2830G  1647G  1182G 58.22 1.12   0 root
databases-ssd
 -5  1.19998-  1415G   823G   591G 58.22 1.12   0 host
mia1-master-ds02-ssd
 24  0.3  1.0   471G   278G   193G 58.96 1.14 173
osd.24
 25  0.3  1.0   471G   276G   194G 58.68 1.13 172
osd.25
 26  0.3  1.0   471G   269G   202G 57.03 1.10 167
osd.26
 -6  1.19998-  1415G   823G   591G 58.22 1.12   0 host
mia1-master-ds03-ssd
 27  0.3  1.0   471G   244G   227G 51.87 1.00 152
osd.27
 28  0.3  1.0   471G   281G   190G 59.63 1.15 175
osd.28
 29  0.3  1.0   471G   297G   173G 63.17 1.22 185
osd.29
 -1 71.69997- 76072G 44464G 31607G 58.45 1.13   0 root default
 -2 26.59998- 29575G 17334G 12240G 58.61 1.13   0 host
mia1-master-ds01
  0  3.2  1.0  3602G  1907G  1695G 52.94 1.02  90
osd.0
  1  3.2  1.0  3630G  2721G   908G 74.97 1.45 112
osd.1
  2  3.2  1.0  3723G  2373G  1349G 63.75 1.23  98
osd.2
  3  3.2  1.0  3723G  1781G  1941G 47.85 0.92 105
osd.3
  4  3.2  1.0  3723G  1880G  1843G 50.49 0.97  95
osd.4
  5  3.2  1.0  3723G  2465G  1257G 66.22 1.28 111
osd.5
  6  3.7  1.0  3723G  1722G  2001G 46.25 0.89 109
osd.6
  7  3.7  1.0  3723G  2481G  1241G 66.65 1.29 126
osd.7
 -4  8.5-  9311G  8540G   770G 91.72 1.77   0 host
mia1-master-fe02
  8  5.5  0.7  5587G  5419G   167G 97.00 1.87 189
osd.8
 23  3.0  1.0  3724G  3120G   603G 83.79 1.62 128
osd.23
 -7 29.5- 29747G 17821G 11926G 59.91 1.16   0 host
mia1-master-ds04
  9  3.7  1.0  3718G  2493G  1224G 67.07 1.29 114
osd.9
 10  3.7  1.0  3718G  2454G  1264G 66.00 1.27  90
osd.10
 11  3.7  1.0  3718G  2202G  1516G 59.22 1.14 116
osd.11
 12  3.7  1.0  3718G  2290G  1427G 61.61 1.19 113
osd.12
 13  3.7  1.0  3718G  2015G  1703G 54.19 1.05 112
osd.13
 14  3.7  1.0  3718G  1264G  2454G 34.00 0.66 101
osd.14
 15  3.7  1.0  3718G  2195G  1522G 59.05 1.14 104
osd.15
 16  3.7  1.0  3718G  2905G   813G 78.13 1.51 130
osd.16
-11  7.0-  7438G   768G  6669G 10.33 0.20   0 host
mia1-master-ds05-borrowed-osds
 18  3.5  1.0  3719G   393G  3325G 10.59 0.20 262
osd.18
 19  3.5  1.0  3719G   374G  3344G 10.07 0.19 256
osd.19
TOTAL 93524G 48454G 45069G 51.81
MIN/MAX VAR: 0.19/1.87  STDDEV: 22.02



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Thu, Jan 18, 2018 at 8:05 PM, David Turner  wrote:

> `ceph osd df` is a good command for you to see what's going on.  Compare
> the osd numbers with `ceph osd tree`.
>
>
>>
>> On Thu, Jan 18, 2018 at 3:34 PM Webert de Souza Lima <
>> webert.b...@gmail.com> wrote:
>>
>>> Sorry I forgot, this is a ceph jewel 10.2.10
>>>
>>>
>>> Regards,
>>>
>>> Webert Lima
>>> DevOps Engineer at MAV Tecnologia
>>> *Belo Horizonte - Brasil*
>>> *IRC NICK - WebertRLZ*
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
Sorry I forgot, this is a ceph jewel 10.2.10


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
Also, there is no quota set for the pools

Here is "ceph osd pool get xxx all": http://termbin.com/ix0n


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph df shows 100% used

2018-01-18 Thread Webert de Souza Lima
Hello,

I'm running near-out-of service radosgw (very slow to write new objects)
and I suspect it's because of ceph df is showing 100% usage in some pools,
though I don't know what that information comes from.

Pools:
#~ ceph osd pool ls detail  -> http://termbin.com/lsd0

Crush Rules (important is rule 0)
~# ceph osd crush rule dump ->  http://termbin.com/wkpo

OSD Tree:
~# ceph osd tree -> http://termbin.com/87vt

Ceph DF, which shows 100% Usage:
~# ceph df detail -> http://termbin.com/15mz

Ceph Status, which shows 45600 GB / 93524 GB avail:
~# ceph -s -> http://termbin.com/wycq


Any thoughts?

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

2018-01-11 Thread Alessandro De Salvo
Hi,
took quite some time to recover the pgs, and indeed the problem with the
mds instances was due to the activating pgs. Once they were cleared the
fs went back to the original state.
I had to restart a few times some OSds though, in order to get all the
pgs activated, and I didn't hit the limits on the max pgs, but I'm close
to, so I have set them to 300 just to be safe (AFAIK it was the limit
set to prior releases of ceph, not sure why it was lowered to 200 now).
Thanks,

Alessandro

On Tue, 2018-01-09 at 09:01 +0100, Burkhard Linke wrote:
> Hi,
> 
> 
> On 01/08/2018 05:40 PM, Alessandro De Salvo wrote:
> > Thanks Lincoln,
> >
> > indeed, as I said the cluster is recovering, so there are pending ops:
> >
> >
> > pgs: 21.034% pgs not active
> >  1692310/24980804 objects degraded (6.774%)
> >  5612149/24980804 objects misplaced (22.466%)
> >  458 active+clean
> >  329 active+remapped+backfill_wait
> >  159 activating+remapped
> >  100 active+undersized+degraded+remapped+backfill_wait
> >  58  activating+undersized+degraded+remapped
> >  27  activating
> >  22  active+undersized+degraded+remapped+backfilling
> >  6   active+remapped+backfilling
> >  1   active+recovery_wait+degraded
> >
> >
> > If it's just a matter to wait for the system to complete the recovery 
> > it's fine, I'll deal with that, but I was wondendering if there is a 
> > more suble problem here.
> >
> > OK, I'll wait for the recovery to complete and see what happens, thanks.
> 
> The blocked MDS might be caused by the 'activating' PGs. Do you have a 
> warning about too much PGs per OSD? If that is the case, 
> activating/creating/peering/whatever on the affected OSDs is blocked, 
> which leads to blocked requests etc.
> 
> You can resolve this be increasing the number of allowed PGs per OSD 
> ('mon_max_pg_per_osd'). AFAIK it needs to be set for mon, mgr and osd 
> instances. There was also been some discussion about this setting on the 
> mailing list in the last weeks.
> 
> Regards,
> Burkhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous: HEALTH_ERR full ratio(s) out of order

2018-01-10 Thread Webert de Souza Lima
Good to know. I don't think this should trigger HEALTH_ERR though, but
HEALTH_WARN makes sense.
It makes sense to keep the backfillfull_ratio greater than nearfull_ratio
as one might need backfilling to avoid OSD getting full on reweight
operations.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Wed, Jan 10, 2018 at 12:11 PM, Stefan Priebe - Profihost AG <
s.pri...@profihost.ag> wrote:

> Hello,
>
> since upgrading to luminous i get the following error:
>
> HEALTH_ERR full ratio(s) out of order
> OSD_OUT_OF_ORDER_FULL full ratio(s) out of order
> backfillfull_ratio (0.9) < nearfull_ratio (0.95), increased
>
> but ceph.conf has:
>
> mon_osd_full_ratio = .97
> mon_osd_nearfull_ratio = .95
> mon_osd_backfillfull_ratio = .96
> osd_backfill_full_ratio = .96
> osd_failsafe_full_ratio = .98
>
> Any ideas?  i already restarted:
> * all osds
> * all mons
> * all mgrs
>
> Greets,
> Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'lost' cephfs filesystem?

2018-01-10 Thread Webert de Souza Lima
On Wed, Jan 10, 2018 at 12:44 PM, Mark Schouten  wrote:

> > Thanks, that's a good suggestion. Just one question, will this affect
> RBD-
> > access from the same (client)host?


i'm sorry that this didn't help. No, it does not affect rbd clients, as MDS
is related only to cephfs.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'lost' cephfs filesystem?

2018-01-10 Thread Webert de Souza Lima
try to kick out (evict) that cephfs client from the mds node, see
http://docs.ceph.com/docs/master/cephfs/eviction/


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Wed, Jan 10, 2018 at 12:59 AM, Mark Schouten  wrote:

> Hi,
>
> While upgrading a server with a CephFS mount tonight, it stalled on
> installing
> a new kernel, because it was waiting for `sync`.
>
> I'm pretty sure it has something to do with the CephFS filesystem which
> caused
> some issues last week. I think the kernel still has a reference to the
> probably lazy unmounted CephFS filesystem.
> Unmounting the filesystem 'works', which means it is no longer available,
> but
> the unmount-command seems to be waiting for sync() as well. Mounting the
> filesystem again doesn't work either.
>
> I know the simple solution is to just reboot the server, but the server
> holds
> quite a lot of VM's and Containers, so I'd prefer to fix this without a
> reboot.
>
> Anybody with some clever ideas? :)
>
> --
> Kerio Operator in de Cloud? https://www.kerioindecloud.nl/
> Mark Schouten  | Tuxis Internet Engineering
> KvK: 61527076  | http://www.tuxis.nl/
> T: 0318 200208 | i...@tuxis.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs degraded on ceph luminous 12.2.2

2018-01-08 Thread Alessandro De Salvo

Thanks Lincoln,

indeed, as I said the cluster is recovering, so there are pending ops:


    pgs: 21.034% pgs not active
 1692310/24980804 objects degraded (6.774%)
 5612149/24980804 objects misplaced (22.466%)
 458 active+clean
 329 active+remapped+backfill_wait
 159 activating+remapped
 100 active+undersized+degraded+remapped+backfill_wait
 58  activating+undersized+degraded+remapped
 27  activating
 22  active+undersized+degraded+remapped+backfilling
 6   active+remapped+backfilling
 1   active+recovery_wait+degraded


If it's just a matter to wait for the system to complete the recovery 
it's fine, I'll deal with that, but I was wondendering if there is a 
more suble problem here.


OK, I'll wait for the recovery to complete and see what happens, thanks.

Cheers,


    Alessandro


Il 08/01/18 17:36, Lincoln Bryant ha scritto:

Hi Alessandro,

What is the state of your PGs? Inactive PGs have blocked CephFS
recovery on our cluster before. I'd try to clear any blocked ops and
see if the MDSes recover.

--Lincoln

On Mon, 2018-01-08 at 17:21 +0100, Alessandro De Salvo wrote:

Hi,

I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded.

I have 2 active mds instances and 1 standby. All the active
instances
are now in replay state and show the same error in the logs:


 mds1 

2018-01-08 16:04:15.765637 7fc2e92451c0  0 ceph version 12.2.2
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
process
(unknown), pid 164
starting mds.mds1 at -
2018-01-08 16:04:15.785849 7fc2e92451c0  0 pidfile_write: ignore
empty
--pid-file
2018-01-08 16:04:20.168178 7fc2e1ee1700  1 mds.mds1 handle_mds_map
standby
2018-01-08 16:04:20.278424 7fc2e1ee1700  1 mds.1.20635 handle_mds_map
i
am now mds.1.20635
2018-01-08 16:04:20.278432 7fc2e1ee1700  1 mds.1.20635
handle_mds_map
state change up:boot --> up:replay
2018-01-08 16:04:20.278443 7fc2e1ee1700  1 mds.1.20635 replay_start
2018-01-08 16:04:20.278449 7fc2e1ee1700  1 mds.1.20635  recovery set
is 0
2018-01-08 16:04:20.278458 7fc2e1ee1700  1 mds.1.20635  waiting for
osdmap 21467 (which blacklists prior instance)


 mds2 

2018-01-08 16:04:16.870459 7fd8456201c0  0 ceph version 12.2.2
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),
process
(unknown), pid 295
starting mds.mds2 at -
2018-01-08 16:04:16.881616 7fd8456201c0  0 pidfile_write: ignore
empty
--pid-file
2018-01-08 16:04:21.274543 7fd83e2bc700  1 mds.mds2 handle_mds_map
standby
2018-01-08 16:04:21.314438 7fd83e2bc700  1 mds.0.20637 handle_mds_map
i
am now mds.0.20637
2018-01-08 16:04:21.314459 7fd83e2bc700  1 mds.0.20637
handle_mds_map
state change up:boot --> up:replay
2018-01-08 16:04:21.314479 7fd83e2bc700  1 mds.0.20637 replay_start
2018-01-08 16:04:21.314492 7fd83e2bc700  1 mds.0.20637  recovery set
is 1
2018-01-08 16:04:21.314517 7fd83e2bc700  1 mds.0.20637  waiting for
osdmap 21467 (which blacklists prior instance)
2018-01-08 16:04:21.393307 7fd837aaf700  0 mds.0.cache creating
system
inode with ino:0x100
2018-01-08 16:04:21.397246 7fd837aaf700  0 mds.0.cache creating
system
inode with ino:0x1

The cluster is recovering as we are changing some of the osds, and
there
are a few slow/stuck requests, but I'm not sure if this is the cause,
as
there is apparently no data loss (until now).

How can I force the MDSes to quit the replay state?

Thanks for any help,


  Alessandro


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs degraded on ceph luminous 12.2.2

2018-01-08 Thread Alessandro De Salvo

Hi,

I'm running on ceph luminous 12.2.2 and my cephfs suddenly degraded.

I have 2 active mds instances and 1 standby. All the active instances 
are now in replay state and show the same error in the logs:



 mds1 

2018-01-08 16:04:15.765637 7fc2e92451c0  0 ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process 
(unknown), pid 164

starting mds.mds1 at -
2018-01-08 16:04:15.785849 7fc2e92451c0  0 pidfile_write: ignore empty 
--pid-file

2018-01-08 16:04:20.168178 7fc2e1ee1700  1 mds.mds1 handle_mds_map standby
2018-01-08 16:04:20.278424 7fc2e1ee1700  1 mds.1.20635 handle_mds_map i 
am now mds.1.20635
2018-01-08 16:04:20.278432 7fc2e1ee1700  1 mds.1.20635 handle_mds_map 
state change up:boot --> up:replay

2018-01-08 16:04:20.278443 7fc2e1ee1700  1 mds.1.20635 replay_start
2018-01-08 16:04:20.278449 7fc2e1ee1700  1 mds.1.20635  recovery set is 0
2018-01-08 16:04:20.278458 7fc2e1ee1700  1 mds.1.20635  waiting for 
osdmap 21467 (which blacklists prior instance)



 mds2 

2018-01-08 16:04:16.870459 7fd8456201c0  0 ceph version 12.2.2 
(cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process 
(unknown), pid 295

starting mds.mds2 at -
2018-01-08 16:04:16.881616 7fd8456201c0  0 pidfile_write: ignore empty 
--pid-file

2018-01-08 16:04:21.274543 7fd83e2bc700  1 mds.mds2 handle_mds_map standby
2018-01-08 16:04:21.314438 7fd83e2bc700  1 mds.0.20637 handle_mds_map i 
am now mds.0.20637
2018-01-08 16:04:21.314459 7fd83e2bc700  1 mds.0.20637 handle_mds_map 
state change up:boot --> up:replay

2018-01-08 16:04:21.314479 7fd83e2bc700  1 mds.0.20637 replay_start
2018-01-08 16:04:21.314492 7fd83e2bc700  1 mds.0.20637  recovery set is 1
2018-01-08 16:04:21.314517 7fd83e2bc700  1 mds.0.20637  waiting for 
osdmap 21467 (which blacklists prior instance)
2018-01-08 16:04:21.393307 7fd837aaf700  0 mds.0.cache creating system 
inode with ino:0x100
2018-01-08 16:04:21.397246 7fd837aaf700  0 mds.0.cache creating system 
inode with ino:0x1


The cluster is recovering as we are changing some of the osds, and there 
are a few slow/stuck requests, but I'm not sure if this is the cause, as 
there is apparently no data loss (until now).


How can I force the MDSes to quit the replay state?

Thanks for any help,


    Alessandro


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-05 Thread Stijn De Weirdt
or do it live https://access.redhat.com/articles/3311301

# echo 0 > /sys/kernel/debug/x86/pti_enabled
# echo 0 > /sys/kernel/debug/x86/ibpb_enabled
# echo 0 > /sys/kernel/debug/x86/ibrs_enabled

stijn

On 01/05/2018 12:54 PM, David wrote:
> Hi!
> 
> nopti or pti=off in kernel options should disable some of the kpti.
> I haven't tried it yet though, so give it a whirl.
> 
> https://en.wikipedia.org/wiki/Kernel_page-table_isolation 
> <https://en.wikipedia.org/wiki/Kernel_page-table_isolation>
> 
> Kind Regards,
> 
> David Majchrzak
> 
> 
>> 5 jan. 2018 kl. 11:03 skrev Xavier Trilla :
>>
>> Hi Nick,
>>
>> I'm actually wondering about exactly the same. Regarding OSDs, I agree, 
>> there is no reason to apply the security patch to the machines running the 
>> OSDs -if they are properly isolated in your setup-.
>>
>> But I'm worried about the hypervisors, as I don't know how meltdown or 
>> Spectre patches -AFAIK, only Spectre patch needs to be applied to the host 
>> hypervisor, Meltdown patch only needs to be applied to guest- will affect 
>> librbd performance in the hypervisors. 
>>
>> Does anybody have some information about how Meltdown or Spectre affect ceph 
>> OSDs and clients? 
>>
>> Also, regarding Meltdown patch, seems to be a compilation option, meaning 
>> you could build a kernel without it easily.
>>
>> Thanks,
>> Xavier. 
>>
>> -Mensaje original-
>> De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Nick 
>> Fisk
>> Enviado el: jueves, 4 de enero de 2018 17:30
>> Para: 'ceph-users' 
>> Asunto: [ceph-users] Linux Meltdown (KPTI) fix and how it affects 
>> performance?
>>
>> Hi All,
>>
>> As the KPTI fix largely only affects the performance where there are a large 
>> number of syscalls made, which Ceph does a lot of, I was wondering if 
>> anybody has had a chance to perform any initial tests. I suspect small write 
>> latencies will the worse affected?
>>
>> Although I'm thinking the backend Ceph OSD's shouldn't really be at risk 
>> from these vulnerabilities, due to them not being direct user facing and 
>> could have this work around disabled?
>>
>> Nick
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs stuck in "active+undersized+degraded+remapped+backfill_wait", recovery speed is extremely slow

2018-01-03 Thread ignaqui de la fila
Hello all,

I have ceph Luminous setup with filestore and bluestore OSDs. This cluster
was deployed initially as Hammer, than I upgraded it to Jewel and
eventually to Luminous. It’s heterogenous, we have SSDs, SAS 15K and 7.2K
HDDs in it (see crush map attached). Earlier I converted 7.2K HDD from
filestore to bluestore without any problem. After converting two SSDs from
filestore to bluestore I ended up the following warning:

   cluster:
id: 089d3673-5607-404d-9351-2d4004043966
health: HEALTH_WARN
Degraded data redundancy: 12566/4361616 objects degraded
(0.288%), 6 pgs unclean,
6 pgs degraded, 6 pgs undersized
10 slow requests are blocked > 32 sec

  services:
mon: 3 daemons, quorum 2,1,0
mgr: tw-dwt-prx-03(active), standbys: tw-dwt-prx-05, tw-dwt-prx-07
osd: 92 osds: 92 up, 92 in; 6 remapped pgs

  data:
pools:   3 pools, 1024 pgs
objects: 1419k objects, 5676 GB
usage:   17077 GB used, 264 TB / 280 TB avail
pgs: 12566/4361616 objects degraded (0.288%)
 1018 active+clean
 4active+undersized+degraded+remapped+backfill_wait
 2active+undersized+degraded+remapped+backfilling
  io:

client:   1567 kB/s rd, 2274 kB/s wr, 67 op/s rd, 186 op/s wr

# rados df
POOL_NAME USED  OBJECTS CLONES COPIES  MISSING_ON_PRIMARY UNFOUND DEGRADED
RD_OPS   RDWR_OPSWR
sas_sata   556G  142574  0  427722  0   00
48972431  478G 207803733  3035G
sata_only 1939M 491  01473  0   00
3302 5003k 17170  2108M
ssd_sata  5119G 1311028  0 3933084  0   012549
46982011 2474G 620926839 24962G

total_objects1454093
total_used   17080G
total_avail  264T
total_space  280T

# ceph pg dump_stuck
ok
PG_STAT STATE UP
 UP_PRIMARY ACTING  ACTING_PRIMARY
22.ac active+undersized+degraded+remapped+backfilling [6,28,62]
 6 [28,62] 28
22.85 active+undersized+degraded+remapped+backfilling [7,43,62]
 7 [43,62] 43
22.146  active+undersized+degraded+remapped+backfill_wait [7,48,46]
 7 [46,48] 46
22.4f   active+undersized+degraded+remapped+backfill_wait [7,59,58]
 7 [58,59] 58
22.d8   active+undersized+degraded+remapped+backfill_wait [7,48,46]
 7 [46,48] 46
22.60   active+undersized+degraded+remapped+backfill_wait [7,50,34]
 7 [34,50] 34

 The pool I have problem with, has replicas on SSDs and 7.2K HDD with
primary affinity set as 1 for SSD and 0 for HDD.
 All clients eventually ceased to operate, recovery speed is 1-2 objects
per minute (which would take more than a week to recover 12500 objects).
Another pool works fine.

How I can speed up recovery process?

Thank you,
Ignaqui
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)

2017-12-22 Thread Webert de Souza Lima
On Thu, Dec 21, 2017 at 12:52 PM, shadow_lin  wrote:
>
> After 18:00 suddenly the write throughput dropped and the osd latency
> increased. TCmalloc started relcaim page heap freelist much more
> frequently.All of this happened very fast and every osd had the indentical
> pattern.
>
Could that be caused by OSD scrub?  Check your "osd_scrub_begin_hour"

  ceph daemon osd.$ID config show | grep osd_scrub


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS locatiins

2017-12-22 Thread Webert de Souza Lima
it depends on how you use it. for me, it runs fine on the OSD hosts but the
mds server consumes loads of RAM, so be aware of that.
if the system load average goes too high due to osd disk utilization the
MDS server might run into troubles too, as delayed response from the host
could cause the MDS to be marked as down.


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
*IRC NICK - WebertRLZ*

On Fri, Dec 22, 2017 at 5:24 AM, nigel davies  wrote:

> Hay all
>
> Is it ok to set up mds on the same serves that do host the osd's or should
> they be on different server's
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   >