[ceph-users] v12.2.0 bluestore - OSD down/crash " internal heartbeat not healthy, dropping ping reques "

2017-09-19 Thread nokia ceph
Hello,

Env:- RHEL 7.2 , 3.10.0-327.el7.x86_64 , EC 4+1 , bluestore

We are writing to ceph via librados C API  . Testing with rados no issues.

The same we tested with Jewel/kraken without any issue. Need your view how
to debug this issue?

>>

OSD.log
==

~~~

2017-09-18 14:51:59.895746 7f1e744e0700  0 log_channel(cluster) log [WRN] :
slow request 60.068824 seconds old, received at 2017-09-18 14:50:59.826849:
MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971,
last_complete=1350'153, committed=1, applied=0)) currently queued_for_pg
2017-09-18 14:51:59.895749 7f1e744e0700  0 log_channel(cluster) log [WRN] :
slow request 60.068737 seconds old, received at 2017-09-18 14:50:59.826936:
MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971,
last_complete=0'0, committed=0, applied=1)) currently queued_for_pg
2017-09-18 14:51:59.895754 7f1e744e0700  0 log_channel(cluster) log [WRN] :
slow request 60.067539 seconds old, received at 2017-09-18 14:50:59.828134:
MOSDECSubOpWriteReply(1.132s0 1350/1344 ECSubWriteReply(tid=971,
last_complete=1350'153, committed=1, applied=0)) currently queued_for_pg
2017-09-18 14:51:59.923825 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
2017-09-18 14:51:59.923835 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
2017-09-18 14:51:59.923837 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
2017-09-18 14:51:59.923840 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
2017-09-18 14:51:59.923842 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
2017-09-18 14:51:59.940780 7f1e77ca5700 20 osd.181 1350 share_map_peer
0x7f1e8dbf2800 already has epoch 1350
2017-09-18 14:51:59.940855 7f1e78ca7700 20 osd.181 1350 share_map_peer
0x7f1e8dbf2800 already has epoch 1350
2017-09-18 14:52:00.081390 7f1e6f572700 20 osd.181 1350 OSD::ms_dispatch:
ping magic: 0 v1
2017-09-18 14:52:00.081393 7f1e6f572700 10 osd.181 1350 do_waiters -- start
2017-09-18 14:52:00.081394 7f1e6f572700 10 osd.181 1350 do_waiters -- finish
2017-09-18 14:52:00.081395 7f1e6f572700 20 osd.181 1350 _dispatch
0x7f1e90923a40 ping magic: 0 v1
2017-09-18 14:52:00.081397 7f1e6f572700 10 osd.181 1350 ping from
client.414556
2017-09-18 14:52:00.123908 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1359 k (1083 k + 276 k)
2017-09-18 14:52:00.123926 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1066 k (1066 k + 0 )
2017-09-18 14:52:00.123932 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 643 k (643 k + 0 )
2017-09-18 14:52:00.123937 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 1049 k (1049 k + 0 )
2017-09-18 14:52:00.123942 7f1e71cdb700 10 trim shard target 102 M
meta/data ratios 0.5 + 0 (52428 k + 0 ),  current 896 k (896 k + 0 )
2017-09-18 14:52:00.145445 7f1e784a6700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
2017-09-18 14:52:00.145450 7f1e784a6700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60
2017-09-18 14:52:00.145496 7f1e784a6700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e63cbf700' had timed out after 60
2017-09-18 14:52:00.145534 7f1e784a6700 10 osd.181 1350 internal heartbeat
not healthy, dropping ping request
2017-09-18 14:52:00.146224 7f1e78ca7700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e61cbb700' had timed out after 60
2017-09-18 14:52:00.146226 7f1e78ca7700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f1e624bc700' had timed out after 60

~~~

 thread apply all bt

Thread 54 (LWP 479360):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 53 (LWP 484888):
#0  0x7f1e7a644b7d in ?? ()
#1  0x in ?? ()

Thread 52 (LWP 484177):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x000a in ?? ()
#2  0x7f1e88d8df98 in ?? ()
#3  0x7f1e88d8df48 in ?? ()
#4  0x000a in ?? ()
#5  0x7f1e5ccaf7f8 in ?? ()
#6  0x7f1e7e45b9ee in ?? ()
#7  0x7f1e88d8d860 in ?? ()
#8  0x7f1e8e6e5500 in ?? ()
#9  0x7f1e889881c0 in ?? ()
#10 0x7f1e7e3e9ea0 in ?? ()
#11 0x in ?? ()

Thread 51 (LWP 484176):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 50 (LWP 484175):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

Thread 49 (LWP 484174):
#0  0x7f1e7b5606d5 in ?? ()
#1  0x in ?? ()

~~~

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD assert hit suicide timeout

2017-09-19 Thread Jordan Share
We had suicide timeouts, but unfortunately I can't remember the specific
root cause at this point.

It was definitely one of two things:
   * too low of net.netfilter.nf_conntrack_max (preventing the osds from
opening new connections to each other)
   * too low of kernel.pid_max or kernel.threads-max (preventing new
threads from starting)

I am pretty sure we hit the pid_max early on, but the conntrack_max didn't
cause trouble until we had enough VMs running to push the total number of
connections past the default limit.

Our cluster is approximately the same size as yours.

Jordan

On Tue, Sep 19, 2017 at 3:05 PM, Stanley Zhang 
wrote:

> We don't use EC pools, but my experience with similar slow requests on
> RGW+replicated_pools is that in the logs you need to find out the first
> slow request and identify where it's from, for example, is it deep-scrub,
> or some client accessing corrupted objects, disk errors etc.
>
> On 20/09/17 8:13 AM, David Turner wrote:
>
> Just starting 3 nights ago we started seeing OSDs randomly going down in
> our cluster (Jewel 10.2.7).  At first I saw that each OSD that was recently
> marked down in the cluster (`ceph osd dump | grep -E '^osd\.[0-9]+\s' |
> sort -nrk11` sorted list of OSDs by which OSDs have been marked down in the
> most recent OSD map epochs) and all of them had been wrongly marked down.
> Prior to this there is a lot of evidence of slow requests, OSD op thread
> timing out, Filestore op thread timing out, and other errors.  At the
> bottom of the email is an excerpt of such errors.  It is definitely not a
> comprehensive log of these errors.
>
> The map epoch of the last time the OSD is marked down in the OSD map
> matches when it logs that it was wrongly marked down.  After thousands more
> lines of op threads timing out and slow requests, the OSD finally asserts
> with "hit suicide timeout".  The first 2 nights this was localized to 2
> osds on the same host, but last night this happened on an osd that is on a
> second host.  2 of the 3 OSDs have hit this 3 times and the other has hit
> it twice.
>
> There are 15 OSD nodes and 224 total OSDs in the cluster.  So far only
> these 3 OSDs have hit this FAILED assert.  Does anyone have any ideas of
> what to do next?  Or have any information that is missing here to be able
> to understand things further?  It's too intermittent to run the OSD with
> log level 20.
>
> 2017-09-19 02:06:19.712334 7fdfeae86700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fdfc7235700' had timed out after 15
> 2017-09-19 02:06:19.712358 7fdfeae86700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
> 2017-09-19 02:06:55.974099 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : 891 slow requests, 5 included below; oldest blocked for > 150.987639 secs
> 2017-09-19 02:06:55.974114 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 150.985487 seconds old, received at 2017-09-19
> 02:04:24.987412: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
> 4267, reqid=client.24351228.0:139832730, at_version=48121'1955383,
> trim_to=48111'1952339, trim_rollback_to=48121'1955373)) currently started
> 2017-09-19 02:06:55.974123 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 150.510718 seconds old, received at 2017-09-19
> 02:04:25.462180: MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540
> 7590, reqid=client.24892528.0:36410074, at_version=48121'1960640,
> trim_to=48111'1957612, trim_rollback_to=48121'1960635)) currently started
> 2017-09-19 02:06:55.974128 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 149.177285 seconds old, received at 2017-09-19
> 02:04:26.795614: MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87
> 93253, reqid=client.24892528.0:36410582, at_version=48121'1964351,
> trim_to=48111'1961282, trim_rollback_to=48121'1964344)) currently
> queued_for_pg
> 2017-09-19 02:06:55.974134 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 147.359320 seconds old, received at 2017-09-19
> 02:04:28.613578: MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228
> 587, reqid=client.18422767.0:329511916, at_version=48121'1965073,
> trim_to=48111'1962055, trim_rollback_to=48121'1965066)) currently
> queued_for_pg
> 2017-09-19 02:06:55.974139 7fdfe5b11700  0 log_channel(cluster) log [WRN]
> : slow request 146.404241 seconds old, received at 2017-09-19
> 02:04:29.568657: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
> 4291, reqid=client.18422767.0:329512018, at_version=48121'1955389,
> trim_to=48111'1952339, trim_rollback_to=48121'1955377)) currently
> queued_for_pg
> 2017-09-19 02:06:55.974276 7fdf7aa6a700  0 -- 10.10.113.29:6826/3332721
> >> 10.10.13.32:6822/130263 pipe(0x7fe003b1c800 sd=67 :24892 s=1 pgs=7268
> cs=1 l=0 c=0x7fe01b44be00).connect got RESETSESSION
> 2017-09-19 02:06:55.974626 7fdf78141700  0 -- 10.10.13.29:6826/3332721 >>
> 10.10.13.33:6802/2651380 pipe(0x7fe008140800 sd=70 :61127 s=1 

Re: [ceph-users] OSD assert hit suicide timeout

2017-09-19 Thread Stanley Zhang
We don't use EC pools, but my experience with similar slow requests on 
RGW+replicated_pools is that in the logs you need to find out the first 
slow request and identify where it's from, for example, is it 
deep-scrub, or some client accessing corrupted objects, disk errors etc.



On 20/09/17 8:13 AM, David Turner wrote:
Just starting 3 nights ago we started seeing OSDs randomly going down 
in our cluster (Jewel 10.2.7).  At first I saw that each OSD that was 
recently marked down in the cluster (`ceph osd dump | grep -E 
'^osd\.[0-9]+\s' | sort -nrk11` sorted list of OSDs by which OSDs have 
been marked down in the most recent OSD map epochs) and all of them 
had been wrongly marked down. Prior to this there is a lot of evidence 
of slow requests, OSD op thread timing out, Filestore op thread timing 
out, and other errors.  At the bottom of the email is an excerpt of 
such errors.  It is definitely not a comprehensive log of these errors.


The map epoch of the last time the OSD is marked down in the OSD map 
matches when it logs that it was wrongly marked down.  After thousands 
more lines of op threads timing out and slow requests, the OSD finally 
asserts with "hit suicide timeout".  The first 2 nights this was 
localized to 2 osds on the same host, but last night this happened on 
an osd that is on a second host.  2 of the 3 OSDs have hit this 3 
times and the other has hit it twice.


There are 15 OSD nodes and 224 total OSDs in the cluster. So far only 
these 3 OSDs have hit this FAILED assert.  Does anyone have any ideas 
of what to do next?  Or have any information that is missing here to 
be able to understand things further?  It's too intermittent to run 
the OSD with log level 20.


2017-09-19 02:06:19.712334 7fdfeae86700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7fdfc7235700' had timed out after 15
2017-09-19 02:06:19.712358 7fdfeae86700  1 heartbeat_map is_healthy 
'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
2017-09-19 02:06:55.974099 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : 891 slow requests, 5 included below; oldest blocked for > 
150.987639 secs
2017-09-19 02:06:55.974114 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 150.985487 seconds old, received at 2017-09-19 
02:04:24.987412: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
4267, reqid=client.24351228.0:139832730, at_version=48121'1955383, 
trim_to=48111'1952339, trim_rollback_to=48121'1955373)) currently started
2017-09-19 02:06:55.974123 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 150.510718 seconds old, received at 2017-09-19 
02:04:25.462180: MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540
7590, reqid=client.24892528.0:36410074, at_version=48121'1960640, 
trim_to=48111'1957612, trim_rollback_to=48121'1960635)) currently started
2017-09-19 02:06:55.974128 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 149.177285 seconds old, received at 2017-09-19 
02:04:26.795614: MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87
93253, reqid=client.24892528.0:36410582, at_version=48121'1964351, 
trim_to=48111'1961282, trim_rollback_to=48121'1964344)) currently 
queued_for_pg
2017-09-19 02:06:55.974134 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 147.359320 seconds old, received at 2017-09-19 
02:04:28.613578: MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228
587, reqid=client.18422767.0:329511916, at_version=48121'1965073, 
trim_to=48111'1962055, trim_rollback_to=48121'1965066)) currently 
queued_for_pg
2017-09-19 02:06:55.974139 7fdfe5b11700  0 log_channel(cluster) log 
[WRN] : slow request 146.404241 seconds old, received at 2017-09-19 
02:04:29.568657: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
4291, reqid=client.18422767.0:329512018, at_version=48121'1955389, 
trim_to=48111'1952339, trim_rollback_to=48121'1955377)) currently 
queued_for_pg
2017-09-19 02:06:55.974276 7fdf7aa6a700  0 -- 
10.10.113.29:6826/3332721  >> 
10.10.13.32:6822/130263  
pipe(0x7fe003b1c800 sd=67 :24892 s=1 pgs=7268 cs=1 l=0 
c=0x7fe01b44be00).connect got RESETSESSION
2017-09-19 02:06:55.974626 7fdf78141700  0 -- 10.10.13.29:6826/3332721 
 >> 10.10.13.33:6802/2651380 
 pipe(0x7fe008140800 sd=70 :61127 s=1 
pgs=7059 cs=1 l=0 c=0x7fe01be72600).connect got RESETSESSION
2017-09-19 02:06:55.974657 7fdf78e4e700  0 -- 10.10.13.29:6826/3332721 
 >> 10.10.13.35:6827/477826 
 pipe(0x7fe004aef400 sd=91 :19464 s=1 
pgs=7316 cs=1 l=0 c=0x7fe01b179080).connect got RESETSESSION
2017-09-19 02:06:55.977120 7fdf95310700  0 -- 10.10.13.29:6826/3332721 
 >> 10.10.13.22:6812/2276428 
 pipe(0x7fe003d15400 sd=86 :58892 s=1 
pgs=15908 cs=1 l=0 c=0x7fe01b2e1500).connect got RESETSESSION
2017-09-19 02:06:55.979722 

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Stanley Zhang
I like this, there is some similar ideas we probably can borrow from 
Cassandra on disk failure



# policy for data disk failures:
# die: shut down gossip and Thrift and kill the JVM for any fs errors or
#  single-sstable errors, so the node can be replaced.
# stop_paranoid: shut down gossip and Thrift even for single-sstable 
errors.
# stop: shut down gossip and Thrift, leaving the node effectively 
dead, but

#   can still be inspected via JMX.
# best_effort: stop using the failed disk and respond to requests based on
#  remaining available sstables.  This means you WILL see 
obsolete

#  data at CL.ONE!
# ignore: ignore fatal errors and let requests fail, as in pre-1.2 
Cassandra

disk_failure_policy: stop_paranoid

Regards

Stanley


On 19/09/17 9:16 PM, Manuel Lausch wrote:

Am Tue, 19 Sep 2017 08:24:48 +
schrieb Adrian Saul :


I understand what you mean and it's indeed dangerous, but see:
https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service

Looking at the systemd docs it's difficult though:
https://www.freedesktop.org/software/systemd/man/systemd.service.ht
ml

If the OSD crashes due to another bug you do want it to restart.

But for systemd it's not possible to see if the crash was due to a
disk I/O- error or a bug in the OSD itself or maybe the OOM-killer
or something.

Perhaps using something like RestartPreventExitStatus and defining a
specific exit code for the OSD to exit on when it is exiting due to
an IO error.

A other idea: The OSD daemon keeps running in a defined error state
and only stops the listeners with other OSDs and the clients.




--

*Stanley Zhang | * Senior Operations Engineer
*Telephone:* +64 9 302 0515 *Fax:* +64 9 302 0518
*Mobile:* +64 22 318 3664 *Freephone:* 0800 SMX SMX (769 769)
*SMX Limited:* Level 15, 19 Victoria Street West, Auckland, New Zealand
*Web:* http://smxemail.com
SMX | Cloud Email Hosting & Security

_

This email has been filtered by SMX. For more info visit http://smxemail.com
_

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph fails to recover

2017-09-19 Thread David Turner
Can you please provide the output of `ceph status`, `ceph osd tree`, and
`ceph health detail`?  Thank you.

On Tue, Sep 19, 2017 at 2:59 PM Jonas Jaszkowic <
jonasjaszkowic.w...@gmail.com> wrote:

> Hi all,
>
> I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD
> of size 320GB per host) and 16 clients which are reading
> and writing to the cluster. I have one erasure coded pool (shec plugin)
> with k=8, m=4, c=3 and pg_num=256. Failure domain is host.
> I am able to reach a HEALTH_OK state and everything is working as
> expected. The pool was populated with
> 114048 files of different sizes ranging from 1kB to 4GB. Total amount of
> data in the pool was around 3TB. The capacity of the
> pool was around 10TB.
>
> I want to evaluate how Ceph is rebalancing data in case of an OSD loss
> while clients are still reading. To do so, I am killing one OSD on purpose
> via *ceph osd out  *without adding a new one, i.e. I have 31 OSDs
> left. Ceph seems to notice this failure and starts to rebalance data
> which I can observe with the *ceph -w *command.
>
> However, Ceph failed to rebalance the data. The recovering process seemed
> to be stuck at a random point. I waited more than 12h but the
> number of degraded objects did not reduce and some PGs were stuck. Why is
> this happening? Based on the number of OSDs and the k,m,c values
> there should be enough hosts and OSDs to be able to recover from a single
> OSD failure?
>
> Thank you in advance!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD assert hit suicide timeout

2017-09-19 Thread David Turner
Just starting 3 nights ago we started seeing OSDs randomly going down in
our cluster (Jewel 10.2.7).  At first I saw that each OSD that was recently
marked down in the cluster (`ceph osd dump | grep -E '^osd\.[0-9]+\s' |
sort -nrk11` sorted list of OSDs by which OSDs have been marked down in the
most recent OSD map epochs) and all of them had been wrongly marked down.
Prior to this there is a lot of evidence of slow requests, OSD op thread
timing out, Filestore op thread timing out, and other errors.  At the
bottom of the email is an excerpt of such errors.  It is definitely not a
comprehensive log of these errors.

The map epoch of the last time the OSD is marked down in the OSD map
matches when it logs that it was wrongly marked down.  After thousands more
lines of op threads timing out and slow requests, the OSD finally asserts
with "hit suicide timeout".  The first 2 nights this was localized to 2
osds on the same host, but last night this happened on an osd that is on a
second host.  2 of the 3 OSDs have hit this 3 times and the other has hit
it twice.

There are 15 OSD nodes and 224 total OSDs in the cluster.  So far only
these 3 OSDs have hit this FAILED assert.  Does anyone have any ideas of
what to do next?  Or have any information that is missing here to be able
to understand things further?  It's too intermittent to run the OSD with
log level 20.

2017-09-19 02:06:19.712334 7fdfeae86700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fdfc7235700' had timed out after 15
2017-09-19 02:06:19.712358 7fdfeae86700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fdfde58f700' had timed out after 60
2017-09-19 02:06:55.974099 7fdfe5b11700  0 log_channel(cluster) log [WRN] :
891 slow requests, 5 included below; oldest blocked for > 150.987639 secs
2017-09-19 02:06:55.974114 7fdfe5b11700  0 log_channel(cluster) log [WRN] :
slow request 150.985487 seconds old, received at 2017-09-19
02:04:24.987412: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
4267, reqid=client.24351228.0:139832730, at_version=48121'1955383,
trim_to=48111'1952339, trim_rollback_to=48121'1955373)) currently started
2017-09-19 02:06:55.974123 7fdfe5b11700  0 log_channel(cluster) log [WRN] :
slow request 150.510718 seconds old, received at 2017-09-19
02:04:25.462180: MOSDECSubOpWrite(97.4es1 48121 ECSubWrite(tid=540
7590, reqid=client.24892528.0:36410074, at_version=48121'1960640,
trim_to=48111'1957612, trim_rollback_to=48121'1960635)) currently started
2017-09-19 02:06:55.974128 7fdfe5b11700  0 log_channel(cluster) log [WRN] :
slow request 149.177285 seconds old, received at 2017-09-19
02:04:26.795614: MOSDECSubOpWrite(97.1f4s1 48121 ECSubWrite(tid=87
93253, reqid=client.24892528.0:36410582, at_version=48121'1964351,
trim_to=48111'1961282, trim_rollback_to=48121'1964344)) currently
queued_for_pg
2017-09-19 02:06:55.974134 7fdfe5b11700  0 log_channel(cluster) log [WRN] :
slow request 147.359320 seconds old, received at 2017-09-19
02:04:28.613578: MOSDECSubOpWrite(97.8s5 48121 ECSubWrite(tid=3228
587, reqid=client.18422767.0:329511916, at_version=48121'1965073,
trim_to=48111'1962055, trim_rollback_to=48121'1965066)) currently
queued_for_pg
2017-09-19 02:06:55.974139 7fdfe5b11700  0 log_channel(cluster) log [WRN] :
slow request 146.404241 seconds old, received at 2017-09-19
02:04:29.568657: MOSDECSubOpWrite(97.f8s5 48121 ECSubWrite(tid=420
4291, reqid=client.18422767.0:329512018, at_version=48121'1955389,
trim_to=48111'1952339, trim_rollback_to=48121'1955377)) currently
queued_for_pg
2017-09-19 02:06:55.974276 7fdf7aa6a700  0 -- 10.10.113.29:6826/3332721 >>
10.10.13.32:6822/130263 pipe(0x7fe003b1c800 sd=67 :24892 s=1 pgs=7268 cs=1
l=0 c=0x7fe01b44be00).connect got RESETSESSION
2017-09-19 02:06:55.974626 7fdf78141700  0 -- 10.10.13.29:6826/3332721 >>
10.10.13.33:6802/2651380 pipe(0x7fe008140800 sd=70 :61127 s=1 pgs=7059 cs=1
l=0 c=0x7fe01be72600).connect got RESETSESSION
2017-09-19 02:06:55.974657 7fdf78e4e700  0 -- 10.10.13.29:6826/3332721 >>
10.10.13.35:6827/477826 pipe(0x7fe004aef400 sd=91 :19464 s=1 pgs=7316 cs=1
l=0 c=0x7fe01b179080).connect got RESETSESSION
2017-09-19 02:06:55.977120 7fdf95310700  0 -- 10.10.13.29:6826/3332721 >>
10.10.13.22:6812/2276428 pipe(0x7fe003d15400 sd=86 :58892 s=1 pgs=15908
cs=1 l=0 c=0x7fe01b2e1500).connect got RESETSESSION
2017-09-19 02:06:55.979722 7fdf9500d700  0 -- 10.10.13.29:6826/3332721 >>
10.10.13.27:6830/2018590 pipe(0x7fe003a06800 sd=98 :42191 s=1 pgs=12697
cs=1 l=0 c=0x7fe01b44d780).connect got RESETSESSION
2017-09-19 02:06:56.106436 7fdfba1dc700  0 -- 10.10.13.29:6826/3332721 >>
10.10.13.27:6811/2018593 pipe(0x7fe009e79400 sd=137 :54582 s=1 pgs=11500
cs=1 l=0 c=0x7fe005820880).connect got RESETSESSION
2017-09-19 02:06:56.107146 7fdfbbaf5700  0 -- 10.10.13.29:6826/3332721 >>
10.10.13.27:6811/2018593 pipe(0x7fe009e79400 sd=137 :54582 s=2 pgs=11602
cs=1 l=0 c=0x7fe005820880).fault, initiating reconnect
---
2017-09-19 02:06:56.213980 7fdfdd58d700  0 log_channel(cluster) log [WRN] :

[ceph-users] Ceph fails to recover

2017-09-19 Thread Jonas Jaszkowic
Hi all, 

I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 OSD of 
size 320GB per host) and 16 clients which are reading
and writing to the cluster. I have one erasure coded pool (shec plugin) with 
k=8, m=4, c=3 and pg_num=256. Failure domain is host.
I am able to reach a HEALTH_OK state and everything is working as expected. The 
pool was populated with
114048 files of different sizes ranging from 1kB to 4GB. Total amount of data 
in the pool was around 3TB. The capacity of the
pool was around 10TB.

I want to evaluate how Ceph is rebalancing data in case of an OSD loss while 
clients are still reading. To do so, I am killing one OSD on purpose
via ceph osd out  without adding a new one, i.e. I have 31 OSDs left. 
Ceph seems to notice this failure and starts to rebalance data
which I can observe with the ceph -w command.

However, Ceph failed to rebalance the data. The recovering process seemed to be 
stuck at a random point. I waited more than 12h but the
number of degraded objects did not reduce and some PGs were stuck. Why is this 
happening? Based on the number of OSDs and the k,m,c values 
there should be enough hosts and OSDs to be able to recover from a single OSD 
failure?

Thank you in advance!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds: failed to decode message of type 43 v7: buffer::end_of_buffer

2017-09-19 Thread Gregory Farnum
You've probably run in to http://tracker.ceph.com/issues/16010 — do you
have very large directories? (Or perhaps just a whole bunch of unlinked
files which the MDS hasn't managed to trim yet?)

On Tue, Sep 19, 2017 at 11:51 AM Christian Salzmann-Jäckel <
christian.salzm...@fu-berlin.de> wrote:

> Hi,
>
> we run cephfs  (10.2.9 on Debian jessie; 108 OSDs on 9 nodes) as scratch
> filesystem for a HPC cluster using IPoIB interconnect with kernel client
> (Debian backports kernel version 4.9.30).
>
> Our clients started blocking on file system access.
> Logs show 'mds0: Behind on trimming' and slow requests to one osd
> (osd.049).
> Replacing the disk of osd.049 didn't show any effect. Clust health is ok.
>
> 'ceph daemon mds.cephmon1 dump_ops_in_flight' shows ops from client
> sessions which are no longer present according to 'ceph daemon mds.cephmon1
> session ls'.
>
> We observe traffic of ~200 Mbps on the mds node and this OSD (osd.049).
> Stopping the mds process ends the traffic (of course).
> Stopping osd.049 shifts traffic to the next OSD (osd.095).
> ceph logs show 'slow requests' even after stopping almost all clients.
>
> Debug log on osd.049 show zillions of lines of a single pg (4.22e) of the
> cephfs_metadata pool which resides on OSDs [49, 95, 9].
>
> 2017-09-19 12:20:08.535383 7fd6b98c3700 20 osd.49 pg_epoch: 240725
> pg[4.22e( v 240141'1432046 (239363'1429042,240141'1432046] local-les=240073
> n=4848 ec=451 les/c/f 240073/240073/0 239916/240072/240072) [49,95,9] r=0
> lpr=240072 crt=240129'1432044 lcod 240130'1432045 mlcod 240130'1432045
> active+clean] Found key .chunk_4761369_head
>
> Is there anything we can do to get the mds back into operation?
>
> ciao
> Christian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds: failed to decode message of type 43 v7: buffer::end_of_buffer

2017-09-19 Thread Christian Salzmann-Jäckel
Hi,

we run cephfs  (10.2.9 on Debian jessie; 108 OSDs on 9 nodes) as scratch 
filesystem for a HPC cluster using IPoIB interconnect with kernel client 
(Debian backports kernel version 4.9.30).

Our clients started blocking on file system access.
Logs show 'mds0: Behind on trimming' and slow requests to one osd (osd.049).
Replacing the disk of osd.049 didn't show any effect. Clust health is ok.

'ceph daemon mds.cephmon1 dump_ops_in_flight' shows ops from client sessions 
which are no longer present according to 'ceph daemon mds.cephmon1 session ls'.

We observe traffic of ~200 Mbps on the mds node and this OSD (osd.049).
Stopping the mds process ends the traffic (of course).
Stopping osd.049 shifts traffic to the next OSD (osd.095).
ceph logs show 'slow requests' even after stopping almost all clients.

Debug log on osd.049 show zillions of lines of a single pg (4.22e) of the 
cephfs_metadata pool which resides on OSDs [49, 95, 9].

2017-09-19 12:20:08.535383 7fd6b98c3700 20 osd.49 pg_epoch: 240725 pg[4.22e( v 
240141'1432046 (239363'1429042,240141'1432046] local-les=240073 n=4848 ec=451 
les/c/f 240073/240073/0 239916/240072/240072) [49,95,9] r=0 lpr=240072 
crt=240129'1432044 lcod 240130'1432045 mlcod 240130'1432045 active+clean] Found 
key .chunk_4761369_head

Is there anything we can do to get the mds back into operation?

ciao
Christian


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-19 Thread Jake Young
On Tue, Sep 19, 2017 at 9:38 AM Kees Meijs  wrote:

> Hi Jake,
>
> On 19-09-17 15:14, Jake Young wrote:
> > Ideally you actually want fewer disks per server and more servers.
> > This has been covered extensively in this mailing list. Rule of thumb
> > is that each server should have 10% or less of the capacity of your
> > cluster.
>
> That's very true, but let's focus on the HBA.
>
> > I didn't do extensive research to decide on this HBA, it's simply what
> > my server vendor offered. There are probably better, faster, cheaper
> > HBAs out there. A lot of people complain about LSI HBAs, but I am
> > comfortable with them.
>
> Given a configuration our vendor offered it's about LSI/Avago 9300-8i
> with 8 drives connected individually using SFF8087 on a backplane (e.g.
> not an expander). Or, 24 drives using three HBAs (6xSFF8087 in total)
> when using a 4HE SuperMicro chassis with 24 drive bays.
>
> But, what are the LSI complaints about? Or, are the complaints generic
> to HBAs and/or cryptic CLI tools and not LSI specific?


Typically people rant about how much Megaraid/LSI support sucks. I've been
using LSI or MegaRAID for years and haven't had any big problems.

I had some performance issues with Areca onboard SAS chips (non-Ceph setup,
4 disks in a RAID10) and after about 6 months of troubleshooting with the
server vendor and Areca support they did patch the firmware and resolve the
issue.


>
> > There is a management tool called storcli that can fully configure the
> > HBA in one or two command lines.  There's a command that configures
> > all attached disks as individual RAID0 disk groups. That command gets
> > run by salt when I provision a new osd server.
>
> The thread I read was about Areca in JBOD but still able to utilise the
> cache, if I'm not mistaken. I'm not sure anymore if there was something
> mentioned about BBU.


JBOD with WB cache would be nice so you can get smart data directly from
the disks instead of having interrogate the HBA for the data.  This becomes
more important once your cluster is stable and in production.

IMHO if there is unwritten data in a RAM chip, like when you enable WB
cache, you really, really need a BBU. This is another nice thing about
using SSD journals instead of HBAs in WB mode, the journaled data is safe
on the SSD before the write is acknowledged.


>
> >
> > What many other people are doing is using the least expensive JBOD HBA
> > or the on board SAS controller in JBOD mode and then using SSD
> > journals. Save the money you would have spent on the fancy HBA for
> > fast, high endurance SSDs.
>
> Thanks! And obviously I'm very interested in other comments as well.
>
> Regards,
> Kees
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD crash starting up

2017-09-19 Thread David Turner
Adding the old OSD back in with its data shouldn't help you at all.  Your
cluster has finished backfilling and has the proper amount of copies of all
of its data.  The time you would want to add a removed OSD back to a
cluster is when you have unfound objects.

The scrub errors and inconsistent PGs are what you need to focus on and
where your current problem is.  The message with too many PGs per OSD is
just a warning and not causing any issues at this point as long as your OSD
nodes aren't having any OOM messages.  Once you add in a 6th OSD, that will
go away on its own.

There are several threads on the Mailing List that you should be able to
find about recovering from these and the potential dangers of some of the
commands.  Googling for `ceph-users scrub errors inconsistent pgs` is a
good place to start.

On Tue, Sep 19, 2017 at 11:28 AM Gonzalo Aguilar Delgado <
gagui...@aguilardelgado.com> wrote:

> Hi David,
>
> What I want is to add the OSD back with its data yes. But avoiding any
> troubles that can happen from the time it was out.
>
> Is it possible? I suppose that some pg has been updated after. Will ceph
> manage it gracefully?
>
> Ceph status is getting worse every day.
>
> ceph status
> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>  health HEALTH_ERR
> 6 pgs inconsistent
> 31 scrub errors
> too many PGs per OSD (305 > max 300)
>  monmap e12: 2 mons at {blue-compute=
> 172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
> election epoch 4328, quorum 0,1 red-compute,blue-compute
>   fsmap e881: 1/1/1 up {0=blue-compute=up:active}
>  osdmap e7120: 5 osds: 5 up, 5 in
> flags require_jewel_osds
>   pgmap v66976120: 764 pgs, 6 pools, 555 GB data, 140 kobjects
>  GB used, 3068 GB / 4179 GB avail
>  758 active+clean
>6 active+clean+inconsistent
>   client io 384 kB/s wr, 0 op/s rd, 83 op/s wr
>
>
> I want to add the old OSD, rebalance copies are more hosts/osds and remove
> it out again.
>
>
> Best regards,
>
> On 19/09/17 14:47, David Turner wrote:
>
> Are you asking to add the osd back with its data or add it back in as a
> fresh osd.  What is your `ceph status`?
>
> On Tue, Sep 19, 2017, 5:23 AM Gonzalo Aguilar Delgado <
> gagui...@aguilardelgado.com> wrote:
>
>> Hi David,
>>
>> Thank you for the great explanation of the weights, I thought that ceph
>> was adjusting them based on disk. But it seems it's not.
>>
>> But the problem was not that I think the node was failing because a
>> software bug because the disk was not full anymeans.
>>
>> /dev/sdb1 976284608 172396756   803887852  18%
>> /var/lib/ceph/osd/ceph-1
>>
>> Now the question is to know if I can add again this osd safely. Is it
>> possible?
>>
>> Best regards,
>>
>>
>>
>> On 14/09/17 23:29, David Turner wrote:
>>
>> Your weights should more closely represent the size of the OSDs.  OSD3
>> and OSD6 are weighted properly, but your other 3 OSDs have the same weight
>> even though OSD0 is twice the size of OSD2 and OSD4.
>>
>> Your OSD weights is what I thought you were referring to when you said
>> you set the crush map to 1.  At some point it does look like you set all of
>> your OSD weights to 1, which would apply to OSD1.  If the OSD was too small
>> for that much data, it would have filled up and be too full to start.  Can
>> you mount that disk and see how much free space is on it?
>>
>> Just so you understand what that weight is, it is how much data the
>> cluster is going to put on it.  The default is for the weight to be the
>> size of the OSD in TiB (1024 based instead of TB which is 1000).  If you
>> set the weight of a 1TB disk and a 4TB disk both to 1, then the cluster
>> will try and give them the same amount of data.  If you set the 4TB disk to
>> a weight of 4, then the cluster will try to give it 4x more data than the
>> 1TB drive (usually what you want).
>>
>> In your case, your 926G OSD0 has a weight of 1 and your 460G OSD2 has a
>> weight of 1 so the cluster thinks they should each receive the same amount
>> of data (which it did, they each have ~275GB of data).  OSD3 has a weight
>> of 1.36380 (its size in TiB) and OSD6 has a weight of 0.90919 and they have
>> basically the same %used space (17%) as opposed to the same amount of data
>> because the weight is based on their size.
>>
>> As long as you had enough replicas of your data in the cluster for it to
>> recover from you removing OSD1 such that your cluster is health_ok without
>> any missing objects, then there is nothing that you need off of OSD1 and
>> ceph recovered from the lost disk successfully.
>>
>> On Thu, Sep 14, 2017 at 4:39 PM Gonzalo Aguilar Delgado <
>> gagui...@aguilardelgado.com> wrote:
>>
>>> Hello,
>>>
>>> I was on a old version of ceph. And it showed a warning saying:
>>>
>>> *crush map* has straw_calc_version=*0*
>>>
>>> I rode that adjusting it will only rebalance all so admin should 

Re: [ceph-users] Ceph OSD crash starting up

2017-09-19 Thread Gonzalo Aguilar Delgado

Hi David,

What I want is to add the OSD back with its data yes. But avoiding any 
troubles that can happen from the time it was out.


Is it possible? I suppose that some pg has been updated after. Will ceph 
manage it gracefully?


Ceph status is getting worse every day.

ceph status
cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
 health HEALTH_ERR
6 pgs inconsistent
31 scrub errors
too many PGs per OSD (305 > max 300)
 monmap e12: 2 mons at 
{blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}

election epoch 4328, quorum 0,1 red-compute,blue-compute
  fsmap e881: 1/1/1 up {0=blue-compute=up:active}
 osdmap e7120: 5 osds: 5 up, 5 in
flags require_jewel_osds
  pgmap v66976120: 764 pgs, 6 pools, 555 GB data, 140 kobjects
 GB used, 3068 GB / 4179 GB avail
 758 active+clean
   6 active+clean+inconsistent
  client io 384 kB/s wr, 0 op/s rd, 83 op/s wr


I want to add the old OSD, rebalance copies are more hosts/osds and 
remove it out again.



Best regards,


On 19/09/17 14:47, David Turner wrote:


Are you asking to add the osd back with its data or add it back in as 
a fresh osd.  What is your `ceph status`?



On Tue, Sep 19, 2017, 5:23 AM Gonzalo Aguilar Delgado 
> wrote:


Hi David,

Thank you for the great explanation of the weights, I thought that
ceph was adjusting them based on disk. But it seems it's not.

But the problem was not that I think the node was failing because
a software bug because the disk was not full anymeans.

/dev/sdb1 976284608 172396756 803887852  18%
/var/lib/ceph/osd/ceph-1

Now the question is to know if I can add again this osd safely. Is
it possible?

Best regards,



On 14/09/17 23:29, David Turner wrote:
Your weights should more closely represent the size of the OSDs. 
OSD3 and OSD6 are weighted properly, but your other 3 OSDs have

the same weight even though OSD0 is twice the size of OSD2 and OSD4.

Your OSD weights is what I thought you were referring to when you
said you set the crush map to 1.  At some point it does look like
you set all of your OSD weights to 1, which would apply to OSD1. 
If the OSD was too small for that much data, it would have filled

up and be too full to start.  Can you mount that disk and see how
much free space is on it?

Just so you understand what that weight is, it is how much data
the cluster is going to put on it.  The default is for the weight
to be the size of the OSD in TiB (1024 based instead of TB which
is 1000).  If you set the weight of a 1TB disk and a 4TB disk
both to 1, then the cluster will try and give them the same
amount of data.  If you set the 4TB disk to a weight of 4, then
the cluster will try to give it 4x more data than the 1TB drive
(usually what you want).

In your case, your 926G OSD0 has a weight of 1 and your 460G OSD2
has a weight of 1 so the cluster thinks they should each receive
the same amount of data (which it did, they each have ~275GB of
data).  OSD3 has a weight of 1.36380 (its size in TiB) and OSD6
has a weight of 0.90919 and they have basically the same %used
space (17%) as opposed to the same amount of data because the
weight is based on their size.

As long as you had enough replicas of your data in the cluster
for it to recover from you removing OSD1 such that your cluster
is health_ok without any missing objects, then there is nothing
that you need off of OSD1 and ceph recovered from the lost disk
successfully.

On Thu, Sep 14, 2017 at 4:39 PM Gonzalo Aguilar Delgado
> wrote:

Hello,

I was on a old version of ceph. And it showed a warning saying:

/crush map/ has straw_calc_version=/0/

I rode that adjusting it will only rebalance all so admin
should select when to do it. So I went straigth and ran:


ceph osd crush tunables optimal

/
/It rebalanced as it said but then I started to have lots of
pg wrong. I discovered that it was because my OSD1. I thought
it was disk faillure so I added a new OSD6 and system started
to rebalance. Anyway OSD was not starting.

I thought to wipe it all. But I preferred to leave disk as it
was, and journal intact, in case I can recover and get data
from it. (See mail: [ceph-users] Scrub failing all the time,
new inconsistencies keep appearing).


So here's the information. But it has OSD1 replaced by OSD3,
sorry.

ID WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR PGS
 0 1.0  1.0  926G  271G  654G 29.34 1.10 369
 2 1.0  1.0  460G  284G  

Re: [ceph-users] Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

2017-09-19 Thread Florian Haas
On Mon, Sep 18, 2017 at 2:02 PM, Christian Theune  wrote:
> Hi,
>
> and here’s another update which others might find quite interesting.
>
> Florian and I spend some time discussing the issue further, face to face. I 
> had one switch that I brought up again (—osd-recovery-start-delay) which I 
> looked at a few weeks ago but came to the conclusion that its rules are 
> underdocumented and from the appearance it didn’t seem to do anything.
>
> After stepping through what we learned about prioritized recovery, I brought 
> this up again and we started to experiment with this further - and it turns 
> out this switch might be quite helpful.
>
> Here’s what I found and maybe others can chime in whether this is going in 
> the right direction or not:
>
> 1. Setting --osd-recovery-start-delay (e.g. 60 seconds) causes no PG
>to start its recovery when the OSD boots and goes from ‘down/in’ to
>‘up/in’.

Just a minor point of correction here, for the people grabbing this
thread from the archives: the option is osd_recovery_delay_start.

> 2. Client traffic starts getting processed immediately.
>
> 3. Writes from client traffic cause individual objects to require a
>(prioritized) recovery. As no other recovery is happening, everything
>is pretty relaxed and the recovery happens quickly and no slow
>requests appear. (Even when pushing the complaint time to 15s)
>
> 4. When an object from a PG gets recovered this way, the PG is marked as
>‘active+recovering+degraded’. In my test cluster this went up to ~37
>and made me wonder, because it exceeded my ‘--osd-recovery-max-
>active’ setting. Looking at the recovery rate you can see that no
>objects are recovering, and only every now and then an object
>gets recovered.

Again for future interested parties following this thread I think this
is worth highlighting, as it's rather unexpected (albeit somewhat
logical): the PG will go to the "recovering" state even though it's
not undergoing a full recovery. There are just a handful of objects
*in* the PG that are being recovered, even though (a) recovery is
deferred via osd_recovery_delay_start, *and* (b) concurrent recovery
of that many PGs technically isn't allowed, thanks to
osd_recovery_max_active. A clear indicator of this situation is ceph
-w showing a nontrivial number of PGs recovering, but the recovery
rate being in the single-digit objects per second.

> 5. After two minutes, no sudden “everyone else please start recovering”
>thundering happens. I scratch my head. I think.
>
>My conclusion is, that the “active+recovering+degraded” marker is
>actually just that: a marker. The organic writes now (implicitly)
>signal Ceph that there is a certain amount organic traffic that
>requires recovery and pushes the recovering PGs beyond the point
>where “real” recovery would start, because my limits are 3 PGs per
>OSD recovering.

Josh, can you confirm this? And if so, can you elaborate on the
reasoning behind it?

> 6. After a while your “hot set” of objects that get written to (I used
>to VMs with a random write fio[1] is recovered by organic means and
>the ‘recovering’ PGs count goes down.
>
> 7. Once an OSD’s “recovering” count falls below the limit, it begins
>to start “real” recoveries. However, the hot set is now already
>recovered, so slow requests due to prioritized recoveries
>become unlikely.
>
> This actually feels like a quite nice way to handle this. Yes, recovery time 
> will be longer, but with a size=3/min_size=2 this still feels fast enough. 
> (In my test setup it took about 1h to recover fully from a 30% failure with 
> heavy client traffic).

Again, for the benefit to third parties we should probably mention
that recovery otherwise completed in a matter of minutes, albeit at
the cost of making client I/O almost unworkably slow. Just so people
can decide for themselves whether or not they want to go down that
route.

Also (Josh, please correct me if I'm wrong here), I think people need
to understand that using this strategy makes recovery last longer when
their clients are very busy, and wrap up quickly when they are not
doing much. Client activity is not something that the Ceph cluster
operator necessarily has much control over, so keeping tabs on average
and max recovery time would be a good idea here.

What do others think? In particular, does anyone think what Christian
is suggesting is a bad idea? It seems like a sound approach to me.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-19 Thread Kees Meijs
Hi Jake,

On 19-09-17 15:14, Jake Young wrote:
> Ideally you actually want fewer disks per server and more servers.
> This has been covered extensively in this mailing list. Rule of thumb
> is that each server should have 10% or less of the capacity of your
> cluster.

That's very true, but let's focus on the HBA.

> I didn't do extensive research to decide on this HBA, it's simply what
> my server vendor offered. There are probably better, faster, cheaper
> HBAs out there. A lot of people complain about LSI HBAs, but I am
> comfortable with them.

Given a configuration our vendor offered it's about LSI/Avago 9300-8i
with 8 drives connected individually using SFF8087 on a backplane (e.g.
not an expander). Or, 24 drives using three HBAs (6xSFF8087 in total)
when using a 4HE SuperMicro chassis with 24 drive bays.

But, what are the LSI complaints about? Or, are the complaints generic
to HBAs and/or cryptic CLI tools and not LSI specific?

> There is a management tool called storcli that can fully configure the
> HBA in one or two command lines.  There's a command that configures
> all attached disks as individual RAID0 disk groups. That command gets
> run by salt when I provision a new osd server.

The thread I read was about Areca in JBOD but still able to utilise the
cache, if I'm not mistaken. I'm not sure anymore if there was something
mentioned about BBU.

>
> What many other people are doing is using the least expensive JBOD HBA
> or the on board SAS controller in JBOD mode and then using SSD
> journals. Save the money you would have spent on the fancy HBA for
> fast, high endurance SSDs.

Thanks! And obviously I'm very interested in other comments as well.

Regards,
Kees

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-19 Thread Jake Young
On Tue, Sep 19, 2017 at 7:34 AM Kees Meijs  wrote:

> Hi list,
>
> It's probably something to discuss over coffee in Ede tomorrow but I'll
> ask anyway: what HBA is best suitable for Ceph nowadays?
>
> In an earlier thread I read some comments about some "dumb" HBAs running
> in IT mode but still being able to use cache on the HBA. Does it make
> sense? Or, is this dangerous similar to RAID solutions* without BBU?



Yes, that would be dangerous without a BBU.



>
> (On a side note, we're planning on not using SAS expanders any-more but
> to "wire" each individual disk e.g. using SFF8087 per four disks
> minimising risk of bus congestion and/or lock-ups.)
>
> Anyway, in short I'm curious about opinions on brand, type and
> configuration of HBA to choose.
>
> Cheers,
> Kees
>
> *: apologies for cursing.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

It depends a lot on how many disks you want per server.

Ideally you actually want fewer disks per server and more servers. This has
been covered extensively in this mailing list. Rule of thumb is that each
server should have 10% or less of the capacity of your cluster.

In my cluster I use the LSI 3108 HBA with 4GB of RAM, BBU and 9 3.5" 2TB
disks in 2U servers. Each disk is configured as a RAID0 disk group so I can
use the write back cache. I chose to use the HBA for write coalescing
rather than using SSD journals. It isn't as fast as SSD journals could be,
but it is cheaper and simpler to install and maintain.

I didn't do extensive research to decide on this HBA, it's simply what my
server vendor offered. There are probably better, faster, cheaper HBAs out
there. A lot of people complain about LSI HBAs, but I am comfortable with
them.

There is a management tool called storcli that can fully configure the HBA
in one or two command lines.  There's a command that configures all
attached disks as individual RAID0 disk groups. That command gets run by
salt when I provision a new osd server.

What many other people are doing is using the least expensive JBOD HBA or
the on board SAS controller in JBOD mode and then using SSD journals. Save
the money you would have spent on the fancy HBA for fast, high endurance
SSDs.

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What HBA to choose? To expand or not to expand?

2017-09-19 Thread Wido den Hollander

> Op 19 september 2017 om 13:34 schreef Kees Meijs :
> 
> 
> Hi list,
> 
> It's probably something to discuss over coffee in Ede tomorrow but I'll
> ask anyway: what HBA is best suitable for Ceph nowadays?
> 

I still prefer LSI (Avago) in most systems. A 8-port or 16-port controller and 
a bunch of disks connected to them.

> In an earlier thread I read some comments about some "dumb" HBAs running
> in IT mode but still being able to use cache on the HBA. Does it make
> sense? Or, is this dangerous similar to RAID solutions* without BBU?
> 

I wouldn't trust a HBA with any of my data in cache, black boxes without me 
knowing what happens there.

My preference is to put them in IT mode and that's it.

Wido

> (On a side note, we're planning on not using SAS expanders any-more but
> to "wire" each individual disk e.g. using SFF8087 per four disks
> minimising risk of bus congestion and/or lock-ups.)
> 
> Anyway, in short I'm curious about opinions on brand, type and
> configuration of HBA to choose.
> 
> Cheers,
> Kees
> 
> *: apologies for cursing.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD crash starting up

2017-09-19 Thread David Turner
Are you asking to add the osd back with its data or add it back in as a
fresh osd.  What is your `ceph status`?

On Tue, Sep 19, 2017, 5:23 AM Gonzalo Aguilar Delgado <
gagui...@aguilardelgado.com> wrote:

> Hi David,
>
> Thank you for the great explanation of the weights, I thought that ceph
> was adjusting them based on disk. But it seems it's not.
>
> But the problem was not that I think the node was failing because a
> software bug because the disk was not full anymeans.
>
> /dev/sdb1 976284608 172396756   803887852  18%
> /var/lib/ceph/osd/ceph-1
>
> Now the question is to know if I can add again this osd safely. Is it
> possible?
>
> Best regards,
>
>
>
> On 14/09/17 23:29, David Turner wrote:
>
> Your weights should more closely represent the size of the OSDs.  OSD3 and
> OSD6 are weighted properly, but your other 3 OSDs have the same weight even
> though OSD0 is twice the size of OSD2 and OSD4.
>
> Your OSD weights is what I thought you were referring to when you said you
> set the crush map to 1.  At some point it does look like you set all of
> your OSD weights to 1, which would apply to OSD1.  If the OSD was too small
> for that much data, it would have filled up and be too full to start.  Can
> you mount that disk and see how much free space is on it?
>
> Just so you understand what that weight is, it is how much data the
> cluster is going to put on it.  The default is for the weight to be the
> size of the OSD in TiB (1024 based instead of TB which is 1000).  If you
> set the weight of a 1TB disk and a 4TB disk both to 1, then the cluster
> will try and give them the same amount of data.  If you set the 4TB disk to
> a weight of 4, then the cluster will try to give it 4x more data than the
> 1TB drive (usually what you want).
>
> In your case, your 926G OSD0 has a weight of 1 and your 460G OSD2 has a
> weight of 1 so the cluster thinks they should each receive the same amount
> of data (which it did, they each have ~275GB of data).  OSD3 has a weight
> of 1.36380 (its size in TiB) and OSD6 has a weight of 0.90919 and they have
> basically the same %used space (17%) as opposed to the same amount of data
> because the weight is based on their size.
>
> As long as you had enough replicas of your data in the cluster for it to
> recover from you removing OSD1 such that your cluster is health_ok without
> any missing objects, then there is nothing that you need off of OSD1 and
> ceph recovered from the lost disk successfully.
>
> On Thu, Sep 14, 2017 at 4:39 PM Gonzalo Aguilar Delgado <
> gagui...@aguilardelgado.com> wrote:
>
>> Hello,
>>
>> I was on a old version of ceph. And it showed a warning saying:
>>
>> *crush map* has straw_calc_version=*0*
>>
>> I rode that adjusting it will only rebalance all so admin should select
>> when to do it. So I went straigth and ran:
>>
>>
>> ceph osd crush tunables optimal
>>
>>
>> It rebalanced as it said but then I started to have lots of pg wrong. I
>> discovered that it was because my OSD1. I thought it was disk faillure so I
>> added a new OSD6 and system started to rebalance. Anyway OSD was not
>> starting.
>>
>> I thought to wipe it all. But I preferred to leave disk as it was, and
>> journal intact, in case I can recover and get data from it. (See mail:
>> [ceph-users] Scrub failing all the time, new inconsistencies keep
>> appearing).
>>
>>
>> So here's the information. But it has OSD1 replaced by OSD3, sorry.
>>
>> ID WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
>>  0 1.0  1.0  926G  271G  654G 29.34 1.10 369
>>  2 1.0  1.0  460G  284G  176G 61.67 2.32 395
>>  4 1.0  1.0  465G  151G  313G 32.64 1.23 214
>>  3 1.36380  1.0 1396G  239G 1157G 17.13 0.64 340
>>  6 0.90919  1.0  931G  164G  766G 17.70 0.67 210
>>   TOTAL 4179G G 3067G 26.60
>> MIN/MAX VAR: 0.64/2.32  STDDEV: 16.99
>>
>> As I said I still have OSD1 intact so I can do whatever you need except
>> readding to the cluster. Since I don't know what It will do, maybe cause
>> havok.
>> Best regards,
>>
>>
>> On 14/09/17 17:12, David Turner wrote:
>>
>> What do you mean by "updated crush map to 1"?  Can you please provide a
>> copy of your crush map and `ceph osd df`?
>>
>> On Wed, Sep 13, 2017 at 6:39 AM Gonzalo Aguilar Delgado <
>> gagui...@aguilardelgado.com> wrote:
>>
>>> Hi,
>>>
>>> I'recently updated crush map to 1 and did all relocation of the pgs. At
>>> the end I found that one of the OSD is not starting.
>>>
>>> This is what it shows:
>>>
>>>
>>> 2017-09-13 10:37:34.287248 7f49cbe12700 -1 *** Caught signal (Aborted) **
>>>  in thread 7f49cbe12700 thread_name:filestore_sync
>>>
>>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>>  1: (()+0x9616ee) [0xa93c6ef6ee]
>>>  2: (()+0x11390) [0x7f49d9937390]
>>>  3: (gsignal()+0x38) [0x7f49d78d3428]
>>>  4: (abort()+0x16a) [0x7f49d78d502a]
>>>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x26b) [0xa93c7ef43b]
>>>  6: 

[ceph-users] What HBA to choose? To expand or not to expand?

2017-09-19 Thread Kees Meijs
Hi list,

It's probably something to discuss over coffee in Ede tomorrow but I'll
ask anyway: what HBA is best suitable for Ceph nowadays?

In an earlier thread I read some comments about some "dumb" HBAs running
in IT mode but still being able to use cache on the HBA. Does it make
sense? Or, is this dangerous similar to RAID solutions* without BBU?

(On a side note, we're planning on not using SAS expanders any-more but
to "wire" each individual disk e.g. using SFF8087 per four disks
minimising risk of bus congestion and/or lock-ups.)

Anyway, in short I'm curious about opinions on brand, type and
configuration of HBA to choose.

Cheers,
Kees

*: apologies for cursing.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3cmd not working with luminous radosgw

2017-09-19 Thread Sean Purdy
On Tue, 19 Sep 2017, Yoann Moulin said:
> Hello,
> 
> Does anyone have tested s3cmd or other tools to manage ACL on luminous 
> radosGW ?

Don't know about ACL, but s3cmd for other things works for me.  Version 1.6.1


My config file includes (but is not limited to):

access_key = yourkey
secret_key = yoursecret
host_bucket = %(bucket)s.host.yourdomain
host_base = host.yourdomain

$ s3cmd -c s3cfg-ceph ls s3://test/148671665
2017-08-02 21:39 18218   s3://test/1486716654.15214271.docx.gpg.97
2017-08-02 22:10 18218   s3://test/1486716654.15214271.docx.gpg.98
2017-08-02 22:48 18218   s3://test/1486716654.15214271.docx.gpg.99

I have not tried rclone or ACL futzing.


Sean Purdy
 
> I have opened an issue on s3cmd too
> 
> https://github.com/s3tools/s3cmd/issues/919
> 
> Thanks for your help
> 
> Yoann
> 
> > I have a fresh luminous cluster in test and I made a copy of a bucket (4TB 
> > 1.5M files) with rclone, I'm able to list/copy files with rclone but
> > s3cmd does not work at all, it is just able to give the bucket list but I 
> > can't list files neither update ACL.
> > 
> > does anyone already test this ?
> > 
> > root@iccluster012:~# rclone --version
> > rclone v1.37
> > 
> > root@iccluster012:~# s3cmd --version
> > s3cmd version 2.0.0
> > 
> > 
> > ### rclone ls files ###
> > 
> > root@iccluster012:~# rclone ls testadmin:image-net/LICENSE
> >  1589 LICENSE
> > root@iccluster012:~#
> > 
> > nginx (as revers proxy) log :
> > 
> >> 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
> >> HTTP/1.1" 200 0 "-" "rclone/v1.37"
> >> 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "GET 
> >> /image-net?delimiter=%2F=1024= HTTP/1.1" 200 779 "-" 
> >> "rclone/v1.37"
> > 
> > rgw logs :
> > 
> >> 2017-09-15 10:30:02.620266 7ff1f58f7700  1 == starting new request 
> >> req=0x7ff1f58f11f0 =
> >> 2017-09-15 10:30:02.622245 7ff1f58f7700  1 == req done 
> >> req=0x7ff1f58f11f0 op status=0 http_status=200 ==
> >> 2017-09-15 10:30:02.622324 7ff1f58f7700  1 civetweb: 0x56061584b000: 
> >> 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
> >> HTTP/1.0" 1 0 - rclone/v1.37
> >> 2017-09-15 10:30:02.623361 7ff1f50f6700  1 == starting new request 
> >> req=0x7ff1f50f01f0 =
> >> 2017-09-15 10:30:02.689632 7ff1f50f6700  1 == req done 
> >> req=0x7ff1f50f01f0 op status=0 http_status=200 ==
> >> 2017-09-15 10:30:02.689719 7ff1f50f6700  1 civetweb: 0x56061585: 
> >> 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "GET 
> >> /image-net?delimiter=%2F=1024= HTTP/1.0" 1 0 - rclone/v1.37
> > 
> > 
> > 
> > ### s3cmds ls files ###
> > 
> > root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls 
> > s3://image-net/LICENSE
> > root@iccluster012:~#
> > 
> > nginx (as revers proxy) log :
> > 
> >> 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
> >> http://test.iccluster.epfl.ch/image-net/?location HTTP/1.1" 200 127 "-" "-"
> >> 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
> >> http://image-net.test.iccluster.epfl.ch/?delimiter=%2F=LICENSE 
> >> HTTP/1.1" 200 318 "-" "-"
> > 
> > rgw logs :
> > 
> >> 2017-09-15 10:30:04.295355 7ff1f48f5700  1 == starting new request 
> >> req=0x7ff1f48ef1f0 =
> >> 2017-09-15 10:30:04.295913 7ff1f48f5700  1 == req done 
> >> req=0x7ff1f48ef1f0 op status=0 http_status=200 ==
> >> 2017-09-15 10:30:04.295977 7ff1f48f5700  1 civetweb: 0x560615855000: 
> >> 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET /image-net/?location 
> >> HTTP/1.0" 1 0 - -
> >> 2017-09-15 10:30:04.299303 7ff1f40f4700  1 == starting new request 
> >> req=0x7ff1f40ee1f0 =
> >> 2017-09-15 10:30:04.300993 7ff1f40f4700  1 == req done 
> >> req=0x7ff1f40ee1f0 op status=0 http_status=200 ==
> >> 2017-09-15 10:30:04.301070 7ff1f40f4700  1 civetweb: 0x56061585a000: 
> >> 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET 
> >> /?delimiter=%2F=LICENSE HTTP/1.0" 1 0 - 
> > 
> > 
> > 
> > ### s3cmd : list bucket ###
> > 
> > root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls s3://
> > 2017-08-28 12:27  s3://image-net
> > root@iccluster012:~#
> > 
> > nginx (as revers proxy) log :
> > 
> >> ==> nginx/access.log <==
> >> 10.90.37.13 - - [15/Sep/2017:10:36:10 +0200] "GET 
> >> http://test.iccluster.epfl.ch/ HTTP/1.1" 200 318 "-" "-"
> > 
> > rgw logs :
> > 
> >> 2017-09-15 10:36:10.645354 7ff1f38f3700  1 == starting new request 
> >> req=0x7ff1f38ed1f0 =
> >> 2017-09-15 10:36:10.647419 7ff1f38f3700  1 == req done 
> >> req=0x7ff1f38ed1f0 op status=0 http_status=200 ==
> >> 2017-09-15 10:36:10.647488 7ff1f38f3700  1 civetweb: 0x56061585f000: 
> >> 127.0.0.1 - - [15/Sep/2017:10:36:10 +0200] "GET / HTTP/1.0" 1 0 - -
> > 
> > 
> > 
> > ### rclone : list bucket ###
> > 
> > 
> > root@iccluster012:~# rclone lsd testadmin:
> >   -1 2017-08-28 12:27:33-1 image-net
> > root@iccluster012:~#
> > 
> > nginx (as revers proxy) log :
> > 
> >> ==> nginx/access.log <==
> >> 10.90.37.13 - - 

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Wido den Hollander

> Op 19 september 2017 om 10:24 schreef Adrian Saul 
> :
> 
> 
> > I understand what you mean and it's indeed dangerous, but see:
> > https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service
> >
> > Looking at the systemd docs it's difficult though:
> > https://www.freedesktop.org/software/systemd/man/systemd.service.ht
> > ml
> >
> > If the OSD crashes due to another bug you do want it to restart.
> >
> > But for systemd it's not possible to see if the crash was due to a disk I/O-
> > error or a bug in the OSD itself or maybe the OOM-killer or something.
> 
> Perhaps using something like RestartPreventExitStatus and defining a specific 
> exit code for the OSD to exit on when it is exiting due to an IO error.
> 

That's a very, very good idea! I didn't know that one existed.

That would prevent restarts in case of I/O error indeed.

Wido

> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3cmd not working with luminous radosgw

2017-09-19 Thread jwil...@stads.net

Op 19-9-2017 om 11:24 schreef Yoann Moulin:

Hello,

Does anyone have tested s3cmd or other tools to manage ACL on luminous radosGW ?

I have opened an issue on s3cmd too

https://github.com/s3tools/s3cmd/issues/919

Just an extra option. Have you tried --signature-v2  on s3cmd?

Thanks for your help

Yoann


I have a fresh luminous cluster in test and I made a copy of a bucket (4TB 1.5M 
files) with rclone, I'm able to list/copy files with rclone but
s3cmd does not work at all, it is just able to give the bucket list but I can't 
list files neither update ACL.

does anyone already test this ?

root@iccluster012:~# rclone --version
rclone v1.37

root@iccluster012:~# s3cmd --version
s3cmd version 2.0.0


### rclone ls files ###

root@iccluster012:~# rclone ls testadmin:image-net/LICENSE
  1589 LICENSE
root@iccluster012:~#

nginx (as revers proxy) log :


10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE HTTP/1.1" 200 0 "-" 
"rclone/v1.37"
10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "GET /image-net?delimiter=%2F=1024= 
HTTP/1.1" 200 779 "-" "rclone/v1.37"

rgw logs :


2017-09-15 10:30:02.620266 7ff1f58f7700  1 == starting new request 
req=0x7ff1f58f11f0 =
2017-09-15 10:30:02.622245 7ff1f58f7700  1 == req done req=0x7ff1f58f11f0 
op status=0 http_status=200 ==
2017-09-15 10:30:02.622324 7ff1f58f7700  1 civetweb: 0x56061584b000: 127.0.0.1 - - 
[15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE HTTP/1.0" 1 0 - 
rclone/v1.37
2017-09-15 10:30:02.623361 7ff1f50f6700  1 == starting new request 
req=0x7ff1f50f01f0 =
2017-09-15 10:30:02.689632 7ff1f50f6700  1 == req done req=0x7ff1f50f01f0 
op status=0 http_status=200 ==
2017-09-15 10:30:02.689719 7ff1f50f6700  1 civetweb: 0x56061585: 127.0.0.1 - - 
[15/Sep/2017:10:30:02 +0200] "GET /image-net?delimiter=%2F=1024= 
HTTP/1.0" 1 0 - rclone/v1.37



### s3cmds ls files ###

root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls 
s3://image-net/LICENSE
root@iccluster012:~#

nginx (as revers proxy) log :


10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET http://test.iccluster.epfl.ch/image-net/?location 
HTTP/1.1" 200 127 "-" "-"
10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
http://image-net.test.iccluster.epfl.ch/?delimiter=%2F=LICENSE HTTP/1.1" 200 318 "-" 
"-"

rgw logs :


2017-09-15 10:30:04.295355 7ff1f48f5700  1 == starting new request 
req=0x7ff1f48ef1f0 =
2017-09-15 10:30:04.295913 7ff1f48f5700  1 == req done req=0x7ff1f48ef1f0 
op status=0 http_status=200 ==
2017-09-15 10:30:04.295977 7ff1f48f5700  1 civetweb: 0x560615855000: 127.0.0.1 - - 
[15/Sep/2017:10:30:04 +0200] "GET /image-net/?location HTTP/1.0" 1 0 - -
2017-09-15 10:30:04.299303 7ff1f40f4700  1 == starting new request 
req=0x7ff1f40ee1f0 =
2017-09-15 10:30:04.300993 7ff1f40f4700  1 == req done req=0x7ff1f40ee1f0 
op status=0 http_status=200 ==
2017-09-15 10:30:04.301070 7ff1f40f4700  1 civetweb: 0x56061585a000: 127.0.0.1 - - 
[15/Sep/2017:10:30:04 +0200] "GET /?delimiter=%2F=LICENSE HTTP/1.0" 1 0 -



### s3cmd : list bucket ###

root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls s3://
2017-08-28 12:27  s3://image-net
root@iccluster012:~#

nginx (as revers proxy) log :


==> nginx/access.log <==
10.90.37.13 - - [15/Sep/2017:10:36:10 +0200] "GET http://test.iccluster.epfl.ch/ HTTP/1.1" 200 318 
"-" "-"

rgw logs :


2017-09-15 10:36:10.645354 7ff1f38f3700  1 == starting new request 
req=0x7ff1f38ed1f0 =
2017-09-15 10:36:10.647419 7ff1f38f3700  1 == req done req=0x7ff1f38ed1f0 
op status=0 http_status=200 ==
2017-09-15 10:36:10.647488 7ff1f38f3700  1 civetweb: 0x56061585f000: 127.0.0.1 - - 
[15/Sep/2017:10:36:10 +0200] "GET / HTTP/1.0" 1 0 - -



### rclone : list bucket ###


root@iccluster012:~# rclone lsd testadmin:
   -1 2017-08-28 12:27:33-1 image-net
root@iccluster012:~#

nginx (as revers proxy) log :


==> nginx/access.log <==
10.90.37.13 - - [15/Sep/2017:10:37:53 +0200] "GET / HTTP/1.1" 200 318 "-" 
"rclone/v1.37"

rgw logs :


==> ceph/luminous-rgw-iccluster015.log <==
2017-09-15 10:37:53.005424 7ff1f28f1700  1 == starting new request 
req=0x7ff1f28eb1f0 =
2017-09-15 10:37:53.007192 7ff1f28f1700  1 == req done req=0x7ff1f28eb1f0 
op status=0 http_status=200 ==
2017-09-15 10:37:53.007282 7ff1f28f1700  1 civetweb: 0x56061586e000: 127.0.0.1 - - 
[15/Sep/2017:10:37:53 +0200] "GET / HTTP/1.0" 1 0 - rclone/v1.37




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3cmd not working with luminous radosgw

2017-09-19 Thread Yoann Moulin
Hello,

Does anyone have tested s3cmd or other tools to manage ACL on luminous radosGW ?

I have opened an issue on s3cmd too

https://github.com/s3tools/s3cmd/issues/919

Thanks for your help

Yoann

> I have a fresh luminous cluster in test and I made a copy of a bucket (4TB 
> 1.5M files) with rclone, I'm able to list/copy files with rclone but
> s3cmd does not work at all, it is just able to give the bucket list but I 
> can't list files neither update ACL.
> 
> does anyone already test this ?
> 
> root@iccluster012:~# rclone --version
> rclone v1.37
> 
> root@iccluster012:~# s3cmd --version
> s3cmd version 2.0.0
> 
> 
> ### rclone ls files ###
> 
> root@iccluster012:~# rclone ls testadmin:image-net/LICENSE
>  1589 LICENSE
> root@iccluster012:~#
> 
> nginx (as revers proxy) log :
> 
>> 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
>> HTTP/1.1" 200 0 "-" "rclone/v1.37"
>> 10.90.37.13 - - [15/Sep/2017:10:30:02 +0200] "GET 
>> /image-net?delimiter=%2F=1024= HTTP/1.1" 200 779 "-" 
>> "rclone/v1.37"
> 
> rgw logs :
> 
>> 2017-09-15 10:30:02.620266 7ff1f58f7700  1 == starting new request 
>> req=0x7ff1f58f11f0 =
>> 2017-09-15 10:30:02.622245 7ff1f58f7700  1 == req done 
>> req=0x7ff1f58f11f0 op status=0 http_status=200 ==
>> 2017-09-15 10:30:02.622324 7ff1f58f7700  1 civetweb: 0x56061584b000: 
>> 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "HEAD /image-net/LICENSE 
>> HTTP/1.0" 1 0 - rclone/v1.37
>> 2017-09-15 10:30:02.623361 7ff1f50f6700  1 == starting new request 
>> req=0x7ff1f50f01f0 =
>> 2017-09-15 10:30:02.689632 7ff1f50f6700  1 == req done 
>> req=0x7ff1f50f01f0 op status=0 http_status=200 ==
>> 2017-09-15 10:30:02.689719 7ff1f50f6700  1 civetweb: 0x56061585: 
>> 127.0.0.1 - - [15/Sep/2017:10:30:02 +0200] "GET 
>> /image-net?delimiter=%2F=1024= HTTP/1.0" 1 0 - rclone/v1.37
> 
> 
> 
> ### s3cmds ls files ###
> 
> root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls 
> s3://image-net/LICENSE
> root@iccluster012:~#
> 
> nginx (as revers proxy) log :
> 
>> 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
>> http://test.iccluster.epfl.ch/image-net/?location HTTP/1.1" 200 127 "-" "-"
>> 10.90.37.13 - - [15/Sep/2017:10:30:04 +0200] "GET 
>> http://image-net.test.iccluster.epfl.ch/?delimiter=%2F=LICENSE 
>> HTTP/1.1" 200 318 "-" "-"
> 
> rgw logs :
> 
>> 2017-09-15 10:30:04.295355 7ff1f48f5700  1 == starting new request 
>> req=0x7ff1f48ef1f0 =
>> 2017-09-15 10:30:04.295913 7ff1f48f5700  1 == req done 
>> req=0x7ff1f48ef1f0 op status=0 http_status=200 ==
>> 2017-09-15 10:30:04.295977 7ff1f48f5700  1 civetweb: 0x560615855000: 
>> 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET /image-net/?location 
>> HTTP/1.0" 1 0 - -
>> 2017-09-15 10:30:04.299303 7ff1f40f4700  1 == starting new request 
>> req=0x7ff1f40ee1f0 =
>> 2017-09-15 10:30:04.300993 7ff1f40f4700  1 == req done 
>> req=0x7ff1f40ee1f0 op status=0 http_status=200 ==
>> 2017-09-15 10:30:04.301070 7ff1f40f4700  1 civetweb: 0x56061585a000: 
>> 127.0.0.1 - - [15/Sep/2017:10:30:04 +0200] "GET 
>> /?delimiter=%2F=LICENSE HTTP/1.0" 1 0 - 
> 
> 
> 
> ### s3cmd : list bucket ###
> 
> root@iccluster012:~# s3cmd -v -c ~/.s3cfg-test-rgwadmin ls s3://
> 2017-08-28 12:27  s3://image-net
> root@iccluster012:~#
> 
> nginx (as revers proxy) log :
> 
>> ==> nginx/access.log <==
>> 10.90.37.13 - - [15/Sep/2017:10:36:10 +0200] "GET 
>> http://test.iccluster.epfl.ch/ HTTP/1.1" 200 318 "-" "-"
> 
> rgw logs :
> 
>> 2017-09-15 10:36:10.645354 7ff1f38f3700  1 == starting new request 
>> req=0x7ff1f38ed1f0 =
>> 2017-09-15 10:36:10.647419 7ff1f38f3700  1 == req done 
>> req=0x7ff1f38ed1f0 op status=0 http_status=200 ==
>> 2017-09-15 10:36:10.647488 7ff1f38f3700  1 civetweb: 0x56061585f000: 
>> 127.0.0.1 - - [15/Sep/2017:10:36:10 +0200] "GET / HTTP/1.0" 1 0 - -
> 
> 
> 
> ### rclone : list bucket ###
> 
> 
> root@iccluster012:~# rclone lsd testadmin:
>   -1 2017-08-28 12:27:33-1 image-net
> root@iccluster012:~#
> 
> nginx (as revers proxy) log :
> 
>> ==> nginx/access.log <==
>> 10.90.37.13 - - [15/Sep/2017:10:37:53 +0200] "GET / HTTP/1.1" 200 318 "-" 
>> "rclone/v1.37"
> 
> rgw logs :
> 
>> ==> ceph/luminous-rgw-iccluster015.log <==
>> 2017-09-15 10:37:53.005424 7ff1f28f1700  1 == starting new request 
>> req=0x7ff1f28eb1f0 =
>> 2017-09-15 10:37:53.007192 7ff1f28f1700  1 == req done 
>> req=0x7ff1f28eb1f0 op status=0 http_status=200 ==
>> 2017-09-15 10:37:53.007282 7ff1f28f1700  1 civetweb: 0x56061586e000: 
>> 127.0.0.1 - - [15/Sep/2017:10:37:53 +0200] "GET / HTTP/1.0" 1 0 - 
>> rclone/v1.37


-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD crash starting up

2017-09-19 Thread Gonzalo Aguilar Delgado

Hi David,

Thank you for the great explanation of the weights, I thought that ceph 
was adjusting them based on disk. But it seems it's not.


But the problem was not that I think the node was failing because a 
software bug because the disk was not full anymeans.


/dev/sdb1 976284608 172396756   803887852 18% 
/var/lib/ceph/osd/ceph-1


Now the question is to know if I can add again this osd safely. Is it 
possible?


Best regards,



On 14/09/17 23:29, David Turner wrote:
Your weights should more closely represent the size of the OSDs.  OSD3 
and OSD6 are weighted properly, but your other 3 OSDs have the same 
weight even though OSD0 is twice the size of OSD2 and OSD4.


Your OSD weights is what I thought you were referring to when you said 
you set the crush map to 1.  At some point it does look like you set 
all of your OSD weights to 1, which would apply to OSD1.  If the OSD 
was too small for that much data, it would have filled up and be too 
full to start.  Can you mount that disk and see how much free space is 
on it?


Just so you understand what that weight is, it is how much data the 
cluster is going to put on it.  The default is for the weight to be 
the size of the OSD in TiB (1024 based instead of TB which is 1000).  
If you set the weight of a 1TB disk and a 4TB disk both to 1, then the 
cluster will try and give them the same amount of data.  If you set 
the 4TB disk to a weight of 4, then the cluster will try to give it 4x 
more data than the 1TB drive (usually what you want).


In your case, your 926G OSD0 has a weight of 1 and your 460G OSD2 has 
a weight of 1 so the cluster thinks they should each receive the same 
amount of data (which it did, they each have ~275GB of data).  OSD3 
has a weight of 1.36380 (its size in TiB) and OSD6 has a weight of 
0.90919 and they have basically the same %used space (17%) as opposed 
to the same amount of data because the weight is based on their size.


As long as you had enough replicas of your data in the cluster for it 
to recover from you removing OSD1 such that your cluster is health_ok 
without any missing objects, then there is nothing that you need off 
of OSD1 and ceph recovered from the lost disk successfully.


On Thu, Sep 14, 2017 at 4:39 PM Gonzalo Aguilar Delgado 
> wrote:


Hello,

I was on a old version of ceph. And it showed a warning saying:

/crush map/ has straw_calc_version=/0/

I rode that adjusting it will only rebalance all so admin should
select when to do it. So I went straigth and ran:


ceph osd crush tunables optimal

/
/It rebalanced as it said but then I started to have lots of pg
wrong. I discovered that it was because my OSD1. I thought it was
disk faillure so I added a new OSD6 and system started to
rebalance. Anyway OSD was not starting.

I thought to wipe it all. But I preferred to leave disk as it was,
and journal intact, in case I can recover and get data from it.
(See mail: [ceph-users] Scrub failing all the time, new
inconsistencies keep appearing).


So here's the information. But it has OSD1 replaced by OSD3, sorry.

ID WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
 0 1.0  1.0  926G  271G  654G 29.34 1.10 369
 2 1.0  1.0  460G  284G  176G 61.67 2.32 395
 4 1.0  1.0  465G  151G  313G 32.64 1.23 214
 3 1.36380  1.0 1396G  239G 1157G 17.13 0.64 340
 6 0.90919  1.0  931G  164G  766G 17.70 0.67 210
  TOTAL 4179G G 3067G 26.60
MIN/MAX VAR: 0.64/2.32  STDDEV: 16.99

As I said I still have OSD1 intact so I can do whatever you need
except readding to the cluster. Since I don't know what It will
do, maybe cause havok.
Best regards,


On 14/09/17 17:12, David Turner wrote:

What do you mean by "updated crush map to 1"?  Can you please
provide a copy of your crush map and `ceph osd df`?

On Wed, Sep 13, 2017 at 6:39 AM Gonzalo Aguilar Delgado
> wrote:

Hi,

I'recently updated crush map to 1 and did all relocation of
the pgs. At the end I found that one of the OSD is not starting.

This is what it shows:


2017-09-13 10:37:34.287248 7f49cbe12700 -1 *** Caught signal
(Aborted) **
 in thread 7f49cbe12700 thread_name:filestore_sync

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (()+0x9616ee) [0xa93c6ef6ee]
 2: (()+0x11390) [0x7f49d9937390]
 3: (gsignal()+0x38) [0x7f49d78d3428]
 4: (abort()+0x16a) [0x7f49d78d502a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int,
char const*)+0x26b) [0xa93c7ef43b]
 6: (FileStore::sync_entry()+0x2bbb) [0xa93c47fcbb]
 7: (FileStore::SyncThread::entry()+0xd) [0xa93c4adcdd]
 8: (()+0x76ba) 

Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Manuel Lausch
Am Tue, 19 Sep 2017 08:24:48 +
schrieb Adrian Saul :

> > I understand what you mean and it's indeed dangerous, but see:
> > https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service
> >
> > Looking at the systemd docs it's difficult though:
> > https://www.freedesktop.org/software/systemd/man/systemd.service.ht
> > ml
> >
> > If the OSD crashes due to another bug you do want it to restart.
> >
> > But for systemd it's not possible to see if the crash was due to a
> > disk I/O- error or a bug in the OSD itself or maybe the OOM-killer
> > or something.
> 
> Perhaps using something like RestartPreventExitStatus and defining a
> specific exit code for the OSD to exit on when it is exiting due to
> an IO error.

A other idea: The OSD daemon keeps running in a defined error state
and only stops the listeners with other OSDs and the clients. 


-- 
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 |
76135 Karlsruhe | Germany Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte
Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat
sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie
bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem
bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern,
weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu
verwenden.

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient of this e-mail, you are hereby
notified that saving, distribution or use of the content of this e-mail
in any way is prohibited. If you have received this e-mail in error,
please notify the sender and delete the e-mail.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Adrian Saul
> I understand what you mean and it's indeed dangerous, but see:
> https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service
>
> Looking at the systemd docs it's difficult though:
> https://www.freedesktop.org/software/systemd/man/systemd.service.ht
> ml
>
> If the OSD crashes due to another bug you do want it to restart.
>
> But for systemd it's not possible to see if the crash was due to a disk I/O-
> error or a bug in the OSD itself or maybe the OOM-killer or something.

Perhaps using something like RestartPreventExitStatus and defining a specific 
exit code for the OSD to exit on when it is exiting due to an IO error.

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Wido den Hollander

> Op 19 september 2017 om 10:02 schreef Manuel Lausch :
> 
> 
> Hi,
> 
> I see a issue with systemd's restart behaviour and disk IO-errors 
> If a disk fails with IO-errors ceph-osd stops running. Systemd detects
> this and starts the daemon again. In our cluster I did see some loops
> with osd crashes caused by disk failure and restarts triggerd by
> systemd. Every time with peering impact and timeouts to our application
> until systemd gave up.
> 
> Obviously ceph needs the restart feature (at least with dmcrypt) to
> avoid raceconditions In the startup process. But in the
> case of disk related failures this is contraproductive. 
> 
> What do you think about this? Is this a bug which should be fixed?
> 

I understand what you mean and it's indeed dangerous, but see: 
https://github.com/ceph/ceph/blob/master/systemd/ceph-osd%40.service

Looking at the systemd docs it's difficult though: 
https://www.freedesktop.org/software/systemd/man/systemd.service.html

If the OSD crashes due to another bug you do want it to restart.

But for systemd it's not possible to see if the crash was due to a disk 
I/O-error or a bug in the OSD itself or maybe the OOM-killer or something.

Wido

> We use ceph jewel (10.2.9)
> 
> 
> Regards
> Manuel 
> 
> 
> -- 
> Manuel Lausch
> 
> Systemadministrator
> Cloud Services
> 
> 1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 |
> 76135 Karlsruhe | Germany Phone: +49 721 91374-1847
> E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de
> 
> Amtsgericht Montabaur, HRB 5452
> 
> Geschäftsführer: Thomas Ludwig, Jan Oetjen
> 
> 
> Member of United Internet
> 
> Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte
> Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat
> sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie
> bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem
> bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern,
> weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu
> verwenden.
> 
> This e-mail may contain confidential and/or privileged information. If
> you are not the intended recipient of this e-mail, you are hereby
> notified that saving, distribution or use of the content of this e-mail
> in any way is prohibited. If you have received this e-mail in error,
> please notify the sender and delete the e-mail.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-osd restartd via systemd in case of disk error

2017-09-19 Thread Manuel Lausch
Hi,

I see a issue with systemd's restart behaviour and disk IO-errors 
If a disk fails with IO-errors ceph-osd stops running. Systemd detects
this and starts the daemon again. In our cluster I did see some loops
with osd crashes caused by disk failure and restarts triggerd by
systemd. Every time with peering impact and timeouts to our application
until systemd gave up.

Obviously ceph needs the restart feature (at least with dmcrypt) to
avoid raceconditions In the startup process. But in the
case of disk related failures this is contraproductive. 

What do you think about this? Is this a bug which should be fixed?

We use ceph jewel (10.2.9)


Regards
Manuel 


-- 
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 |
76135 Karlsruhe | Germany Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte
Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat
sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie
bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem
bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern,
weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu
verwenden.

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient of this e-mail, you are hereby
notified that saving, distribution or use of the content of this e-mail
in any way is prohibited. If you have received this e-mail in error,
please notify the sender and delete the e-mail.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com