Re: [ceph-users] Calamari or Alternative

2017-01-12 Thread Tu Holmes
I'll give ceph-dash a look.

Thanks!
On Thu, Jan 12, 2017 at 9:19 PM John Petrini  wrote:

> I used Calamari before making the move to Ubuntu 16.04 and upgrading to
> Jewel. At the time I tried to install it on 16.04 but couldn't get it
> working.
>
> I'm now using ceph-dash  along
> with the nagios plugin check_ceph_dash
>  and I've found that this
> gets me everything I need. A nice looking dashboard, graphs and alerting on
> the most important stats.
>
> Another plus is that it's incredibly easy to setup; you can have the
> dashboard up and running in five minutes.
>
> ___
>
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission,  dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>
> On Fri, Jan 13, 2017 at 12:06 AM, Tu Holmes  wrote:
>
> Hey Cephers.
>
> Question for you.
>
> Do you guys use Calamari or an alternative?
>
> If so, why has the installation of Calamari not really gotten much better
> recently.
>
> Are you still building the vagrant installers and building packages?
>
> Just wondering what you are all doing.
>
> Thanks.
>
> //Tu
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Calamari or Alternative

2017-01-12 Thread John Petrini
I used Calamari before making the move to Ubuntu 16.04 and upgrading to
Jewel. At the time I tried to install it on 16.04 but couldn't get it
working.

I'm now using ceph-dash  along with
the nagios plugin check_ceph_dash
 and I've found that this
gets me everything I need. A nice looking dashboard, graphs and alerting on
the most important stats.

Another plus is that it's incredibly easy to setup; you can have the
dashboard up and running in five minutes.

___

The information transmitted is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged
material. Any review, retransmission,  dissemination or other use of, or
taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received
this in error, please contact the sender and delete the material from any
computer.

On Fri, Jan 13, 2017 at 12:06 AM, Tu Holmes  wrote:

> Hey Cephers.
>
> Question for you.
>
> Do you guys use Calamari or an alternative?
>
> If so, why has the installation of Calamari not really gotten much better
> recently.
>
> Are you still building the vagrant installers and building packages?
>
> Just wondering what you are all doing.
>
> Thanks.
>
> //Tu
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Calamari or Alternative

2017-01-12 Thread Tu Holmes
Hey Cephers.

Question for you.

Do you guys use Calamari or an alternative?

If so, why has the installation of Calamari not really gotten much better
recently.

Are you still building the vagrant installers and building packages?

Just wondering what you are all doing.

Thanks.

//Tu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: 答复: Pipe "deadlock" in Hammer, 0.94.5

2017-01-12 Thread 许雪寒
Thank you for your continuous help☺.

We are using hammer 0.94.5 version, and what I read is the version of the 
source code.
However, on the other hand, if Pipe::do_recv do act as blocked, is it 
reasonable for the Pipe::reader_thread to block threads calling 
SimpleMessenger::submit_message by holding Connection::lock?

I think maybe a different mutex should be used in Pipe::read_message rather 
than Connection::lock.

发件人: jiajia zhong [mailto:zhong2p...@gmail.com]
发送时间: 2017年1月13日 11:50
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: 答复: [ceph-users] Pipe "deadlock" in Hammer, 0.94.5

Yes, but that depends.

that might have changed on master barnch.

2017-01-13 10:47 GMT+08:00 许雪寒 >:
Thanks for your reply☺

Indeed, Pipe::do_recv would act just as blocked when errno is EAGAIN, however, 
in Pipe::read_message method, it first checks if there is pending msg on the 
socket by “Pipe::tcp_read_wait”. So, I think, when Pipe::do_recv is called, it 
shouldn’t get an EAGAIN, which means it wouldn’t act as blocked. Is this so?
This really confuses me.


发件人: jiajia zhong [mailto:zhong2p...@gmail.com]
发送时间: 2017年1月12日 18:22
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] Pipe "deadlock" in Hammer, 0.94.5

if errno is EAGAIN for recv, the Pipe:do_recv just acts as blocked. so

2017-01-12 16:34 GMT+08:00 许雪寒 >:
Hi, everyone.

Recently, we did some experiment to test the stability of the ceph cluster. We 
used Hammer version which is the mostly used version of online cluster. One of 
the scenarios that we simulated is poor network connectivity, in which we used 
iptables to drop TCP/IP packet under some probability. And sometimes, we can 
see the following phenomenon: one machine is running iptables to drop packets 
going in and out, OSDs on other machines could be brought down, and sometimes 
more than one OSD.

We used gdb to debug the core dumped by linux. We found that the thread that 
hit the suicide time threshold is a peering thread who is trying to send a 
pg_notify message, the ceph-osd log file and gdb output is as follows:

Log file:
-3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map is_healthy 
'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
-2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map is_healthy 
'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out after 150
-1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 -- 
10.160.132.157:6818/10014122 <== osd.20 
10.160.132.156:0/24908 163  osd_ping(ping 
e4030 stamp 2017-01-10 17:02:13.450374) v2  47+0+0 (3247646131 0 0) 
0x7fd418ca8600 con 0x7fd413c89700
 0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg 
common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' 
thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

GDB OUTPUT:
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0  
0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x01a54ae4 in Mutex::Lock (this=0x7fd426453598, no_lockdep=false) 
at common/Mutex.cc:96
#4  0x01409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0, m=...) at 
common/Mutex.h:115
#5  0x01c46446 in PipeConnection::try_get_pipe (this=0x7fd426453580, 
p=0x7fd440beb908) at msg/simple/PipeConnection.cc:38
#6  0x01c05809 in SimpleMessenger::submit_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580, dest_addr=..., dest_type=4, 
already_locked=false) at msg/simple/SimpleMessenger.cc:443
#7  0x01c033fa in SimpleMessenger::_send_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.cc:136
#8  0x01c467c7 in SimpleMessenger::send_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.h:139
#9  0x01c466a1 in PipeConnection::send_message (this=0x7fd426453580, 
m=0x7fd425538d00) at msg/simple/PipeConnection.cc:78
#10 0x013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, 
m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
#11 0x013b45e7 in OSDService::send_incremental_map 
(this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, 
osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:1087
#12 0x013b215f in OSDService::share_map_peer (this=0x7fd4821e76c8, 
peer=9, con=0x7fd426453580, map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) 
at osd/OSD.cc:887
#13 

Re: [ceph-users] 答复: Pipe "deadlock" in Hammer, 0.94.5

2017-01-12 Thread jiajia zhong
Yes, but that depends.

that might have changed on master barnch.

2017-01-13 10:47 GMT+08:00 许雪寒 :

> Thanks for your reply☺
>
> Indeed, Pipe::do_recv would act just as blocked when errno is EAGAIN,
> however, in Pipe::read_message method, it first checks if there is pending
> msg on the socket by “Pipe::tcp_read_wait”. So, I think, when Pipe::do_recv
> is called, it shouldn’t get an EAGAIN, which means it wouldn’t act as
> blocked. Is this so?
> This really confuses me.
>
>
> 发件人: jiajia zhong [mailto:zhong2p...@gmail.com]
> 发送时间: 2017年1月12日 18:22
> 收件人: 许雪寒
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] Pipe "deadlock" in Hammer, 0.94.5
>
> if errno is EAGAIN for recv, the Pipe:do_recv just acts as blocked. so
>
> 2017-01-12 16:34 GMT+08:00 许雪寒 :
> Hi, everyone.
>
> Recently, we did some experiment to test the stability of the ceph
> cluster. We used Hammer version which is the mostly used version of online
> cluster. One of the scenarios that we simulated is poor network
> connectivity, in which we used iptables to drop TCP/IP packet under some
> probability. And sometimes, we can see the following phenomenon: one
> machine is running iptables to drop packets going in and out, OSDs on other
> machines could be brought down, and sometimes more than one OSD.
>
> We used gdb to debug the core dumped by linux. We found that the thread
> that hit the suicide time threshold is a peering thread who is trying to
> send a pg_notify message, the ceph-osd log file and gdb output is as
> follows:
>
> Log file:
> -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
> -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out after
> 150
> -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 --
> 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163 
> osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2  47+0+0
> (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
>  0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg
> common/HeartbeatMap.cc: In function 'bool 
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
> const char*, time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>
> GDB OUTPUT:
> (gdb) thread 8
> [Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0
> 0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x01a54ae4 in Mutex::Lock (this=0x7fd426453598,
> no_lockdep=false) at common/Mutex.cc:96
> #4  0x01409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0,
> m=...) at common/Mutex.h:115
> #5  0x01c46446 in PipeConnection::try_get_pipe
> (this=0x7fd426453580, p=0x7fd440beb908) at msg/simple/PipeConnection.cc:38
> #6  0x01c05809 in SimpleMessenger::submit_message
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580, dest_addr=...,
> dest_type=4, already_locked=false) at msg/simple/SimpleMessenger.cc:443
> #7  0x01c033fa in SimpleMessenger::_send_message
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
> msg/simple/SimpleMessenger.cc:136
> #8  0x01c467c7 in SimpleMessenger::send_message
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
> msg/simple/SimpleMessenger.h:139
> #9  0x01c466a1 in PipeConnection::send_message
> (this=0x7fd426453580, m=0x7fd425538d00) at msg/simple/PipeConnection.cc:78
> #10 0x013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8,
> m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
> #11 0x013b45e7 in OSDService::send_incremental_map
> (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580,
> osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:1087
> #12 0x013b215f in OSDService::share_map_peer (this=0x7fd4821e76c8,
> peer=9, con=0x7fd426453580, map=std::tr1::shared_ptr (count 49)
> 0x7fd426c0f480) at osd/OSD.cc:887
> #13 0x013f43cc in OSD::do_notifies (this=0x7fd4821e6000,
> notify_list=std::map with 7 elements = {...}, curmap=std::tr1::shared_ptr
> (count 49) 0x7fd426c0f480) at osd/OSD.cc:7246
> #14 0x013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000,
> ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480,
> handle=0x7fd440becb40) at osd/OSD.cc:7198
> #15 0x0140043e in OSD::process_peering_events
> (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at osd/OSD.cc:8539
> #16 0x0141e094 in OSD::PeeringWQ::_process (this=0x7fd4821e7070,
> pgs=std::list = {...}, handle=...) at osd/OSD.h:1601
> #17 

[ceph-users] 答复: Pipe "deadlock" in Hammer, 0.94.5

2017-01-12 Thread 许雪寒
Thanks for your reply☺

Indeed, Pipe::do_recv would act just as blocked when errno is EAGAIN, however, 
in Pipe::read_message method, it first checks if there is pending msg on the 
socket by “Pipe::tcp_read_wait”. So, I think, when Pipe::do_recv is called, it 
shouldn’t get an EAGAIN, which means it wouldn’t act as blocked. Is this so?
This really confuses me.


发件人: jiajia zhong [mailto:zhong2p...@gmail.com] 
发送时间: 2017年1月12日 18:22
收件人: 许雪寒
抄送: ceph-users@lists.ceph.com
主题: Re: [ceph-users] Pipe "deadlock" in Hammer, 0.94.5

if errno is EAGAIN for recv, the Pipe:do_recv just acts as blocked. so

2017-01-12 16:34 GMT+08:00 许雪寒 :
Hi, everyone.

Recently, we did some experiment to test the stability of the ceph cluster. We 
used Hammer version which is the mostly used version of online cluster. One of 
the scenarios that we simulated is poor network connectivity, in which we used 
iptables to drop TCP/IP packet under some probability. And sometimes, we can 
see the following phenomenon: one machine is running iptables to drop packets 
going in and out, OSDs on other machines could be brought down, and sometimes 
more than one OSD.

We used gdb to debug the core dumped by linux. We found that the thread that 
hit the suicide time threshold is a peering thread who is trying to send a 
pg_notify message, the ceph-osd log file and gdb output is as follows:

Log file:
    -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map is_healthy 
'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
    -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map is_healthy 
'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out after 150
    -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 -- 
10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163  
osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2  47+0+0 
(3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
 0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg 
common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' 
thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

GDB OUTPUT:
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0  
0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x01a54ae4 in Mutex::Lock (this=0x7fd426453598, no_lockdep=false) 
at common/Mutex.cc:96
#4  0x01409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0, m=...) at 
common/Mutex.h:115
#5  0x01c46446 in PipeConnection::try_get_pipe (this=0x7fd426453580, 
p=0x7fd440beb908) at msg/simple/PipeConnection.cc:38
#6  0x01c05809 in SimpleMessenger::submit_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580, dest_addr=..., dest_type=4, 
already_locked=false) at msg/simple/SimpleMessenger.cc:443
#7  0x01c033fa in SimpleMessenger::_send_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.cc:136
#8  0x01c467c7 in SimpleMessenger::send_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.h:139
#9  0x01c466a1 in PipeConnection::send_message (this=0x7fd426453580, 
m=0x7fd425538d00) at msg/simple/PipeConnection.cc:78
#10 0x013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, 
m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
#11 0x013b45e7 in OSDService::send_incremental_map 
(this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, 
osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:1087
#12 0x013b215f in OSDService::share_map_peer (this=0x7fd4821e76c8, 
peer=9, con=0x7fd426453580, map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) 
at osd/OSD.cc:887
#13 0x013f43cc in OSD::do_notifies (this=0x7fd4821e6000, 
notify_list=std::map with 7 elements = {...}, curmap=std::tr1::shared_ptr 
(count 49) 0x7fd426c0f480) at osd/OSD.cc:7246
#14 0x013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000, ctx=..., 
pg=0x0, curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480, 
handle=0x7fd440becb40) at osd/OSD.cc:7198
#15 0x0140043e in OSD::process_peering_events (this=0x7fd4821e6000, 
pgs=std::list = {...}, handle=...) at osd/OSD.cc:8539
#16 0x0141e094 in OSD::PeeringWQ::_process (this=0x7fd4821e7070, 
pgs=std::list = {...}, handle=...) at osd/OSD.h:1601
#17 0x014b94bf in ThreadPool::BatchWorkQueue::_void_process 
(this=0x7fd4821e7070, p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
#18 0x01b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0, 
wt=0x7fd4761db430) at common/WorkQueue.cc:128
#19 

Re: [ceph-users] HEALTH_OK when one server crashed?

2017-01-12 Thread John Spray
On Fri, Jan 13, 2017 at 12:21 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Thu, 12 Jan 2017 14:35:32 + Matthew Vernon wrote:
>
>> Hi,
>>
>> One of our ceph servers froze this morning (no idea why, alas). Ceph
>> noticed, moved things around, and when I ran ceph -s, said:
>>
>> root@sto-1-1:~# ceph -s
>> cluster 049fc780-8998-45a8-be12-d3b8b6f30e69
>>  health HEALTH_OK
>>  monmap e2: 3 mons at
>> {sto-1-1=172.27.6.11:6789/0,sto-2-1=172.27.6.14:6789/0,sto-3-1=172.27.6.17:6789/0}
>> election epoch 250, quorum 0,1,2 sto-1-1,sto-2-1,sto-3-1
>>  osdmap e9899: 540 osds: 480 up, 480 in
>> flags sortbitwise
>>   pgmap v4549229: 20480 pgs, 25 pools, 7559 GB data, 1906 kobjects
>> 22920 GB used, 2596 TB / 2618 TB avail
>>20480 active+clean
>>   client io 5416 kB/s rd, 6598 kB/s wr, 44 op/s rd, 53 op/s wr
>>
>> Is it intentional that it says HEALTH_OK when an entire server's worth
>> of OSDs are dead? you have to look quite hard at the output to notice
>> that 60 OSDs are unaccounted for.
>>
> What Wido said.
> Though there have been several discussions and nodding of heads that the
> current states of Ceph are pitifully limited and for many people simply
> inaccurate.
> As in, separating them in something like OK, INFO, WARN, ERR and having
> configuration options to determine what situation equates what state.

If anyone is interested in working on this, I'd recommend tidying up
the existing health reporting as a first step:
http://tracker.ceph.com/issues/7192

Currently, the health messages are just a string and a severity: the
first step to being able to selectively silence them would be to
formalize the definitions and give each possible health condition a
unique ID.

John

>
> Of course you should be monitoring your cluster with other tools like
> nagios, from general availability on all network ports, disk usage, SMART
> wear out levels of SSDs down to the individual processes you'd expect to
> see running on a node:
> "PROCS OK: 8 processes with command name 'ceph-osd' "
>
> I lost single OSDs a few times and didn't notice either by looking at
> Nagios as the recovery was so quick.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Christian Balzer
On Thu, 12 Jan 2017 13:59:12 -0800 Samuel Just wrote:

> That would work.
> -Sam
> 
Having seen similar behavior in the past I made it a habit to manually
shut down services before a reboot.

Not limited to Ceph and these race conditions have definitely gotten worse
with systemd in general.

Christian

> On Thu, Jan 12, 2017 at 1:40 PM, Gregory Farnum  wrote:
> > On Thu, Jan 12, 2017 at 1:37 PM, Samuel Just  wrote:
> >> Oh, this is basically working as intended.  What happened is that the
> >> mon died before the pending map was actually committed.  The OSD has a
> >> timeout (5s) after which it stops trying to mark itself down and just
> >> dies (so that OSDs don't hang when killed).  It took a bit longer than
> >> 5s for the remaining 2 mons to form a new quorum, so they never got
> >> the MOSDMarkMeDown message so we had to do it the slow way.  I would
> >> prefer this behavior to changing the mon shutdown process or making
> >> the OSDs wait longer, so I think that's it.  If you want to avoid
> >> disruption with colocated mons and osds, stop the osds first
> >
> > We can probably make our systemd scripts do this automatically? Or at
> > least, there's a Ceph super-task thingy and I bet we can order the
> > shutdown so it waits to kill the monitor until all the OSDs processes
> > have ended.
> >
> >> and then
> >> reboot.
> >
> >
> >
> >> -Sam
> >>
> >> On Thu, Jan 12, 2017 at 1:24 PM, Udo Lembke  wrote:
> >>> Hi Sam,
> >>>
> >>> the webfrontend of an external ceph-dash was interrupted till the node
> >>> was up again. The reboot took app. 5 min.
> >>>
> >>> But  the ceph -w output shows some IO much faster. I will look tomorrow
> >>> at the output again and create an ticket.
> >>>
> >>>
> >>> Thanks
> >>>
> >>>
> >>> Udo
> >>>
> >>>
> >>> On 12.01.2017 20:02, Samuel Just wrote:
>  How long did it take for the cluster to recover?
>  -Sam
> 
>  On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  
>  wrote:
> > On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
> >> Hi all,
> >> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
> >> ceph-cluster. All nodes are mons and have two OSDs.
> >> During reboot of one node, ceph stucks longer than normaly and I look 
> >> in the
> >> "ceph -w" output to find the reason.
> >>
> >> This is not the reason, but I'm wonder why "osd marked itself down" 
> >> will not
> >> recognised by the mons:
> >> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
> >> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
> >> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor 
> >> election
> >> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor 
> >> election
> >> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
> >> quorum 0,2
> >> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, 
> >> quorum 0,2
> >> 0,2
> >> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
> >> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
> >> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
> >> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 
> >> kB/s
> >> wr, 15 op/s
> >> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
> >> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
> >> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
> >> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 
> >> 73018 B/s
> >> wr, 12 op/s
> >> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
> >> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 
> >> B/s
> >> rd, 135 kB/s wr, 15 op/s
> >> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
> >> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 
> >> B/s
> >> rd, 189 kB/s wr, 7 op/s
> >> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
> >> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> >> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
> >> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> >> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 
> >> failed (2
> >> reporters from different host after 21.222945 >= grace 20.388836)
> >> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 
> >> failed (2
> >> reporters from different host after 21.222970 >= grace 20.388836)
> >> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
> >> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> >> 2017-01-12 

Re: [ceph-users] HEALTH_OK when one server crashed?

2017-01-12 Thread Christian Balzer

Hello,

On Thu, 12 Jan 2017 14:35:32 + Matthew Vernon wrote:

> Hi,
> 
> One of our ceph servers froze this morning (no idea why, alas). Ceph
> noticed, moved things around, and when I ran ceph -s, said:
> 
> root@sto-1-1:~# ceph -s
> cluster 049fc780-8998-45a8-be12-d3b8b6f30e69
>  health HEALTH_OK
>  monmap e2: 3 mons at
> {sto-1-1=172.27.6.11:6789/0,sto-2-1=172.27.6.14:6789/0,sto-3-1=172.27.6.17:6789/0}
> election epoch 250, quorum 0,1,2 sto-1-1,sto-2-1,sto-3-1
>  osdmap e9899: 540 osds: 480 up, 480 in
> flags sortbitwise
>   pgmap v4549229: 20480 pgs, 25 pools, 7559 GB data, 1906 kobjects
> 22920 GB used, 2596 TB / 2618 TB avail
>20480 active+clean
>   client io 5416 kB/s rd, 6598 kB/s wr, 44 op/s rd, 53 op/s wr
> 
> Is it intentional that it says HEALTH_OK when an entire server's worth
> of OSDs are dead? you have to look quite hard at the output to notice
> that 60 OSDs are unaccounted for.
> 
What Wido said.
Though there have been several discussions and nodding of heads that the
current states of Ceph are pitifully limited and for many people simply
inaccurate.
As in, separating them in something like OK, INFO, WARN, ERR and having
configuration options to determine what situation equates what state.

Of course you should be monitoring your cluster with other tools like
nagios, from general availability on all network ports, disk usage, SMART
wear out levels of SSDs down to the individual processes you'd expect to
see running on a node:
"PROCS OK: 8 processes with command name 'ceph-osd' "

I lost single OSDs a few times and didn't notice either by looking at
Nagios as the recovery was so quick. 

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-data-scan scan_links cross version from master on jewel ?

2017-01-12 Thread Gregory Farnum
On Thu, Jan 12, 2017 at 4:10 PM, Kjetil Jørgensen  wrote:
> Hi,
>
> I want/need cephfs-data-scan scan_links, it's in master, although we're
> currently on jewel (10.2.5). Am I better off cherry-picking the relevant
> commit onto the jewel branch rather than just using master ?

Almost certainly. I didn't check what changed but we routinely create
new (versioned) disk formats and while it would probably work, you
don't want your recovery tools writing newer versions than the
software which will be reading them.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs-data-scan scan_links cross version from master on jewel ?

2017-01-12 Thread Kjetil Jørgensen
Hi,

I want/need cephfs-data-scan scan_links, it's in master, although we're
currently on jewel (10.2.5). Am I better off cherry-picking the relevant
commit onto the jewel branch rather than just using master ?

Cheers,
-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs ata1.00: status: { DRDY }

2017-01-12 Thread Oliver Dzombic
Hi,

so i extended IO capability by adding spinning disks ( +10% ) and i
stopped scrubbing completely.

But the problem keep on coming:

2017-01-12 21:19:18.275826 7f5d93e58700  0 log_channel(cluster) log
[WRN] : 19 slow requests, 5 included below; oldest blocked for >
202.408648 secs
2017-01-12 21:19:18.275839 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.008335 seconds old, received at 2017-01-12
21:18:18.267397: osd_op(client.245117.1:639159942 13.21d2b510
rbd_data.320282ae8944a.000a0058 [set-alloc-hint object_size
4194304 write_size 4194304,write 765952~4096] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15
2017-01-12 21:19:18.275847 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.143672 seconds old, received at 2017-01-12
21:18:18.132060: osd_op(client.245117.1:639158909 13.caf24910
rbd_data.320282ae8944a.00067db7 [set-alloc-hint object_size
4194304 write_size 4194304,write 741376~4096] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15
2017-01-12 21:19:18.275858 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.164862 seconds old, received at 2017-01-12
21:18:18.110870: osd_op(client.245117.1:639158730 13.c9d74f90
rbd_data.320282ae8944a.0008f18e [set-alloc-hint object_size
4194304 write_size 4194304,write 897024~4096] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15
2017-01-12 21:19:18.275863 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.127854 seconds old, received at 2017-01-12
21:18:18.147878: osd_op(client.245117.1:639159079 13.a2efa410
rbd_data.320282ae8944a.0008e5cf [set-alloc-hint object_size
4194304 write_size 4194304,write 1703936~4096] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15
2017-01-12 21:19:18.275867 7f5d93e58700  0 log_channel(cluster) log
[WRN] : slow request 60.183234 seconds old, received at 2017-01-12
21:18:18.092498: osd_op(client.245117.1:639158607 13.b56e4190
rbd_data.320282ae8944a.000f45eb [set-alloc-hint object_size
4194304 write_size 4194304,write 2850816~8192] snapc 0=[] ondisk+write
e5148) currently waiting for subops from 15


At this time, the spinning disks were around 10-20% busy.
While the SSD Caching disks ( writeback config ) were around 2% busy.

So to me it does not look like i have here a problem, based on missing
IO power.

So any idea how to find out more ?

Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 06.01.2017 um 01:56 schrieb Christian Balzer:
> 
> Hello,
> 
> On Thu, 5 Jan 2017 23:02:51 +0100 Oliver Dzombic wrote:
> 
> 
> I've never seen hung qemu tasks, slow/hung I/O tasks inside VMs with a
> broken/slow cluster I've seen.
> That's because mine are all RBD librbd backed.
> 
> I think your approach with cephfs probably isn't the way forward.
> Also with cephfs you probably want to run the latest and greatest kernel
> there is (4.8?).
> 
> Is your cluster logging slow request warnings during that time?
> 
>>
>> In the night, thats when this issues occure primary/(only?), we run the
>> scrubs and deep scrubs.
>>
>> In this time the HDD Utilization of the cold storage peaks to 80-95%.
>>
> Never a good thing, if they are also expected to do something useful.
> HDD OSDs have their journals inline?
> 
>> But we have a SSD hot storage in front of this, which is buffering
>> writes and reads.
>>
> With that you mean cache-tier in writeback mode?
>  
>> In our ceph.conf we already have this settings active:
>>
>> osd max scrubs = 1
>> osd scrub begin hour = 20
>> osd scrub end hour = 7
>> osd op threads = 16
>> osd client op priority = 63
>> osd recovery op priority = 1
>> osd op thread timeout = 5
>>
>> osd disk thread ioprio class = idle
>> osd disk thread ioprio priority = 7
>>
> You're missing the most powerful scrub dampener there is:
> osd_scrub_sleep = 0.1
> 
>>
>>
>> All in all i do not think that there is not enough IO for the clients on
>> the cold storage ( even it looks like that on the first view ).
>>
> I find that one of the best ways to understand and thus manage your
> cluster is to run something like collectd with graphite (or grafana or
> whatever cranks your tractor).
> 
> This should in combination with detailed spot analysis by atop or similar
> give a very good idea of what is going on.
> 
> So in this case, watch cache-tier promotions and flushes, see if your
> clients I/Os really are covered by the cache or if during the night your
> VMs may do log rotates or access other cold data and thus have to go to
> the HDD based OSDs...
>  
>> And if its really as simple as too view IO for the clients, my question
>> would be, how to avoid it ?
>>
>> Turning off scrub/deep scrub 

Re: [ceph-users] slow requests break performance

2017-01-12 Thread Brad Hubbard
Check the latency figures in a "perf dump". High numbers in a
particular area may help you nail it.

I suspect though, that it may come down to enabling debug logging and
tracking a slow request through the logs.

On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block  wrote:
> Hi,
>
>> Looking at the output of dump_historic_ops and dump_ops_in_flight
>
>
> I waited for new slow request messages and dumped the historic_ops into a
> file. The reporting OSD shows lots of "waiting for rw locks" messages and a
> duration of more than 30 secs:
>
>  "age": 366.044746,
> "duration": 32.491506,
> "type_data": [
> "commit sent; apply or cleanup",
> {
> "client": "client.9664429",
> "tid": 130439910
> },
> [
> {
> "time": "2017-01-12 10:38:15.227649",
> "event": "initiated"
> },
> {
> "time": "2017-01-12 10:38:15.232310",
> "event": "reached_pg"
> },
> {
> "time": "2017-01-12 10:38:15.232341",
> "event": "waiting for rw locks"
> },
> {
> "time": "2017-01-12 10:38:15.268819",
> "event": "reached_pg"
> },
> [
> .
> .
> .
> ]
> {
> "time": "2017-01-12 10:38:45.515055",
> "event": "waiting for rw locks"
> },
> {
> "time": "2017-01-12 10:38:46.921095",
> "event": "reached_pg"
> },
> {
> "time": "2017-01-12 10:38:46.921157",
> "event": "started"
> },
> {
> "time": "2017-01-12 10:38:46.921342",
> "event": "waiting for subops from 9,15"
> },
> {
> "time": "2017-01-12 10:38:46.921724",
> "event": "commit_queued_for_journal_write"
> },
> {
> "time": "2017-01-12 10:38:46.922186",
> "event": "write_thread_in_journal_buffer"
> },
> {
> "time": "2017-01-12 10:38:46.931103",
> "event": "sub_op_commit_rec"
> },
> {
> "time": "2017-01-12 10:38:46.968730",
> "event": "sub_op_commit_rec"
> },
> {
> "time": "2017-01-12 10:38:47.717770",
> "event": "journaled_completion_queued"
> },
> {
> "time": "2017-01-12 10:38:47.718280",
> "event": "op_commit"
> },
> {
> "time": "2017-01-12 10:38:47.718359",
> "event": "commit_sent"
> },
> {
> "time": "2017-01-12 10:38:47.718890",
> "event": "op_applied"
> },
> {
> "time": "2017-01-12 10:38:47.719154",
> "event": "done"
> }
>
>
> There were about 70 events "waiting for rw locks", I truncated the output.
> Based on the message "waiting for subops from 9,15" I also dumped the
> historic_ops for these two OSDs.
>
> Duration on OSD.9
>
> "initiated_at": "2017-01-12 10:38:29.258221",
> "age": 54.069919,
> "duration": 20.831568,
>
> Duration on OSD.15
>
> "initiated_at": "2017-01-12 10:38:23.695098",
> "age": 112.118210,
> "duration": 26.452526,
>
> They also contain many "waiting for rw locks" messages, but not as much as
> the dump from the reporting OSD.
> To me it seems that because two OSDs take a lot of time to process their
> requests (only slightly less than 30 secs), it sums up to more than 30 secs
> on the reporting (primary?) OSD. Is the reporting OSD always the primary?
>
> How can I debug this further? I searched the web for "waiting for rw locks",
> I also found Wido's blog [1] about my exact problem, but I'm not sure how to
> continue. Our admin says our network should be fine, but what can I do to
> rule that out?
>
> I don't think I have provided information about our cluster yet:
>
> 4 nodes, 3 mons, 20 OSDs on
> ceph version 0.94.7-84-g8e6f430 (8e6f430683e4d8293e31fd4eb6cb09be96960cfa)
>
> [1]
> 

Re: [ceph-users] Any librados C API users out there?

2017-01-12 Thread Matt Benjamin
Hi,

- Original Message -
> From: "Yehuda Sadeh-Weinraub" 
> To: "Sage Weil" 
> Cc: "Gregory Farnum" , "Jason Dillaman" 
> , "Piotr Dałek"
> , "ceph-devel" , 
> "ceph-users" 
> Sent: Thursday, January 12, 2017 3:22:06 PM
> Subject: Re: [ceph-users] Any librados C API users out there?
> 
> On Thu, Jan 12, 2017 at 12:08 PM, Sage Weil  wrote:
> > On Thu, 12 Jan 2017, Gregory Farnum wrote:
> >> On Thu, Jan 12, 2017 at 5:54 AM, Jason Dillaman 
> >> wrote:
> >> > There is option (3) which is to have a new (or modified)
> >> > "buffer::create_static" take an optional callback to invoke when the
> >> > buffer::raw object is destructed. The raw pointer would be destructed
> >> > when the last buffer::ptr / buffer::list containing it is destructed,
> >> > so you know it's no longer being referenced.
> >> >
> >> > You could then have the new C API methods that wrap the C buffer in a
> >> > bufferlist and set a new flag in the librados::AioCompletion to delay
> >> > its completion until after it's both completed and the memory is
> >> > released. When the buffer is freed, the callback would unblock the
> >> > librados::AioCompltion completion callback.
> >>
> >> I much prefer an approach like this: it's zero-copy; it's not a lot of
> >> user overhead; but it requires them to explicitly pass memory off to
> >> Ceph and keep it immutable until Ceph is done (at which point they are
> >> told so explicitly).
> >
> > Yeah, this is simpler.  I still feel like we should provide a way to
> > revoke buffers, though, because otherwise it's possible for calls to block
> > semi-indefinitey if, say, an old MOSDOp is quueed for another OSD and that
> > OSD is not reading data off the socket but has not failed (e.g., due to
> > it's rx throttling).
> >
> 
> We need to provide some way to cancel requests (at least from the
> client's aspect), that would guarantee that buffers are not going to
> be used (and no completion callback is going to be called).

is the client/consumer cancellation async wrt completion?  a cancellation in 
that case could ensure that, if it succeeds, those guarantees are met, or else 
fails (because the callback and completion have raced cancellation)?

Matt

> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Shinobu Kinjo
Now I'm totally clear.

Regards,

On Fri, Jan 13, 2017 at 6:59 AM, Samuel Just  wrote:
> That would work.
> -Sam
>
> On Thu, Jan 12, 2017 at 1:40 PM, Gregory Farnum  wrote:
>> On Thu, Jan 12, 2017 at 1:37 PM, Samuel Just  wrote:
>>> Oh, this is basically working as intended.  What happened is that the
>>> mon died before the pending map was actually committed.  The OSD has a
>>> timeout (5s) after which it stops trying to mark itself down and just
>>> dies (so that OSDs don't hang when killed).  It took a bit longer than
>>> 5s for the remaining 2 mons to form a new quorum, so they never got
>>> the MOSDMarkMeDown message so we had to do it the slow way.  I would
>>> prefer this behavior to changing the mon shutdown process or making
>>> the OSDs wait longer, so I think that's it.  If you want to avoid
>>> disruption with colocated mons and osds, stop the osds first
>>
>> We can probably make our systemd scripts do this automatically? Or at
>> least, there's a Ceph super-task thingy and I bet we can order the
>> shutdown so it waits to kill the monitor until all the OSDs processes
>> have ended.
>>
>>> and then
>>> reboot.
>>
>>
>>
>>> -Sam
>>>
>>> On Thu, Jan 12, 2017 at 1:24 PM, Udo Lembke  wrote:
 Hi Sam,

 the webfrontend of an external ceph-dash was interrupted till the node
 was up again. The reboot took app. 5 min.

 But  the ceph -w output shows some IO much faster. I will look tomorrow
 at the output again and create an ticket.


 Thanks


 Udo


 On 12.01.2017 20:02, Samuel Just wrote:
> How long did it take for the cluster to recover?
> -Sam
>
> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  
> wrote:
>> On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
>>> Hi all,
>>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>>> ceph-cluster. All nodes are mons and have two OSDs.
>>> During reboot of one node, ceph stucks longer than normaly and I look 
>>> in the
>>> "ceph -w" output to find the reason.
>>>
>>> This is not the reason, but I'm wonder why "osd marked itself down" 
>>> will not
>>> recognised by the mons:
>>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor 
>>> election
>>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor 
>>> election
>>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>>> quorum 0,2
>>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 
>>> 0,2
>>> 0,2
>>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 
>>> kB/s
>>> wr, 15 op/s
>>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 
>>> 73018 B/s
>>> wr, 12 op/s
>>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 
>>> B/s
>>> rd, 135 kB/s wr, 15 op/s
>>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 
>>> B/s
>>> rd, 189 kB/s wr, 7 op/s
>>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 
>>> failed (2
>>> reporters from different host after 21.222945 >= grace 20.388836)
>>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 
>>> failed (2
>>> reporters from different host after 21.222970 >= grace 20.388836)
>>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
>>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 

Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Samuel Just
That would work.
-Sam

On Thu, Jan 12, 2017 at 1:40 PM, Gregory Farnum  wrote:
> On Thu, Jan 12, 2017 at 1:37 PM, Samuel Just  wrote:
>> Oh, this is basically working as intended.  What happened is that the
>> mon died before the pending map was actually committed.  The OSD has a
>> timeout (5s) after which it stops trying to mark itself down and just
>> dies (so that OSDs don't hang when killed).  It took a bit longer than
>> 5s for the remaining 2 mons to form a new quorum, so they never got
>> the MOSDMarkMeDown message so we had to do it the slow way.  I would
>> prefer this behavior to changing the mon shutdown process or making
>> the OSDs wait longer, so I think that's it.  If you want to avoid
>> disruption with colocated mons and osds, stop the osds first
>
> We can probably make our systemd scripts do this automatically? Or at
> least, there's a Ceph super-task thingy and I bet we can order the
> shutdown so it waits to kill the monitor until all the OSDs processes
> have ended.
>
>> and then
>> reboot.
>
>
>
>> -Sam
>>
>> On Thu, Jan 12, 2017 at 1:24 PM, Udo Lembke  wrote:
>>> Hi Sam,
>>>
>>> the webfrontend of an external ceph-dash was interrupted till the node
>>> was up again. The reboot took app. 5 min.
>>>
>>> But  the ceph -w output shows some IO much faster. I will look tomorrow
>>> at the output again and create an ticket.
>>>
>>>
>>> Thanks
>>>
>>>
>>> Udo
>>>
>>>
>>> On 12.01.2017 20:02, Samuel Just wrote:
 How long did it take for the cluster to recover?
 -Sam

 On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  
 wrote:
> On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
>> Hi all,
>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>> ceph-cluster. All nodes are mons and have two OSDs.
>> During reboot of one node, ceph stucks longer than normaly and I look in 
>> the
>> "ceph -w" output to find the reason.
>>
>> This is not the reason, but I'm wonder why "osd marked itself down" will 
>> not
>> recognised by the mons:
>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>> quorum 0,2
>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 
>> 0,2
>> 0,2
>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 
>> kB/s
>> wr, 15 op/s
>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 
>> B/s
>> wr, 12 op/s
>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
>> rd, 135 kB/s wr, 15 op/s
>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 
>> B/s
>> rd, 189 kB/s wr, 7 op/s
>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 
>> failed (2
>> reporters from different host after 21.222945 >= grace 20.388836)
>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 
>> failed (2
>> reporters from different host after 21.222970 >= grace 20.388836)
>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>>
>> Why trust the mon not the osd? In this case the osdmap will be right 
>> app. 26
>> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>>
>> ceph version 10.2.5 

Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Gregory Farnum
On Thu, Jan 12, 2017 at 1:37 PM, Samuel Just  wrote:
> Oh, this is basically working as intended.  What happened is that the
> mon died before the pending map was actually committed.  The OSD has a
> timeout (5s) after which it stops trying to mark itself down and just
> dies (so that OSDs don't hang when killed).  It took a bit longer than
> 5s for the remaining 2 mons to form a new quorum, so they never got
> the MOSDMarkMeDown message so we had to do it the slow way.  I would
> prefer this behavior to changing the mon shutdown process or making
> the OSDs wait longer, so I think that's it.  If you want to avoid
> disruption with colocated mons and osds, stop the osds first

We can probably make our systemd scripts do this automatically? Or at
least, there's a Ceph super-task thingy and I bet we can order the
shutdown so it waits to kill the monitor until all the OSDs processes
have ended.

> and then
> reboot.



> -Sam
>
> On Thu, Jan 12, 2017 at 1:24 PM, Udo Lembke  wrote:
>> Hi Sam,
>>
>> the webfrontend of an external ceph-dash was interrupted till the node
>> was up again. The reboot took app. 5 min.
>>
>> But  the ceph -w output shows some IO much faster. I will look tomorrow
>> at the output again and create an ticket.
>>
>>
>> Thanks
>>
>>
>> Udo
>>
>>
>> On 12.01.2017 20:02, Samuel Just wrote:
>>> How long did it take for the cluster to recover?
>>> -Sam
>>>
>>> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  wrote:
 On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
> Hi all,
> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
> ceph-cluster. All nodes are mons and have two OSDs.
> During reboot of one node, ceph stucks longer than normaly and I look in 
> the
> "ceph -w" output to find the reason.
>
> This is not the reason, but I'm wonder why "osd marked itself down" will 
> not
> recognised by the mons:
> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
> quorum 0,2
> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 
> 0,2
> 0,2
> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 
> kB/s
> wr, 15 op/s
> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 
> B/s
> wr, 12 op/s
> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
> rd, 135 kB/s wr, 15 op/s
> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
> rd, 189 kB/s wr, 7 op/s
> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed 
> (2
> reporters from different host after 21.222945 >= grace 20.388836)
> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed 
> (2
> reporters from different host after 21.222970 >= grace 20.388836)
> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>
> Why trust the mon not the osd? In this case the osdmap will be right app. 
> 26
> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>
> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 That's not what anybody intended to have happen. It's possible the
 simultaneous loss of a monitor and the OSDs is triggering a case
 that's not behaving correctly. Can you create a ticket at
 

Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Samuel Just
Oh, this is basically working as intended.  What happened is that the
mon died before the pending map was actually committed.  The OSD has a
timeout (5s) after which it stops trying to mark itself down and just
dies (so that OSDs don't hang when killed).  It took a bit longer than
5s for the remaining 2 mons to form a new quorum, so they never got
the MOSDMarkMeDown message so we had to do it the slow way.  I would
prefer this behavior to changing the mon shutdown process or making
the OSDs wait longer, so I think that's it.  If you want to avoid
disruption with colocated mons and osds, stop the osds first and then
reboot.
-Sam

On Thu, Jan 12, 2017 at 1:24 PM, Udo Lembke  wrote:
> Hi Sam,
>
> the webfrontend of an external ceph-dash was interrupted till the node
> was up again. The reboot took app. 5 min.
>
> But  the ceph -w output shows some IO much faster. I will look tomorrow
> at the output again and create an ticket.
>
>
> Thanks
>
>
> Udo
>
>
> On 12.01.2017 20:02, Samuel Just wrote:
>> How long did it take for the cluster to recover?
>> -Sam
>>
>> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  wrote:
>>> On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
 Hi all,
 I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
 ceph-cluster. All nodes are mons and have two OSDs.
 During reboot of one node, ceph stucks longer than normaly and I look in 
 the
 "ceph -w" output to find the reason.

 This is not the reason, but I'm wonder why "osd marked itself down" will 
 not
 recognised by the mons:
 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
 quorum 0,2
 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
 0,2
 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
 {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
 active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
 wr, 15 op/s
 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
 active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 
 B/s
 wr, 12 op/s
 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
 active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
 rd, 135 kB/s wr, 15 op/s
 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
 active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
 rd, 189 kB/s wr, 7 op/s
 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
 active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
 active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed 
 (2
 reporters from different host after 21.222945 >= grace 20.388836)
 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed 
 (2
 reporters from different host after 21.222970 >= grace 20.388836)
 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
 active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
 active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in

 Why trust the mon not the osd? In this case the osdmap will be right app. 
 26
 seconds earlier (the pgmap at 10:18:27.953410 is wrong).

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>>> That's not what anybody intended to have happen. It's possible the
>>> simultaneous loss of a monitor and the OSDs is triggering a case
>>> that's not behaving correctly. Can you create a ticket at
>>> tracker.ceph.com with your logs and what steps you took and symptoms
>>> observed?
>>> -Greg
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Udo Lembke
Hi Sam,

the webfrontend of an external ceph-dash was interrupted till the node
was up again. The reboot took app. 5 min.

But  the ceph -w output shows some IO much faster. I will look tomorrow
at the output again and create an ticket.


Thanks


Udo


On 12.01.2017 20:02, Samuel Just wrote:
> How long did it take for the cluster to recover?
> -Sam
>
> On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  wrote:
>> On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
>>> Hi all,
>>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>>> ceph-cluster. All nodes are mons and have two OSDs.
>>> During reboot of one node, ceph stucks longer than normaly and I look in the
>>> "ceph -w" output to find the reason.
>>>
>>> This is not the reason, but I'm wonder why "osd marked itself down" will not
>>> recognised by the mons:
>>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
>>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
>>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>>> quorum 0,2
>>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
>>> 0,2
>>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
>>> wr, 15 op/s
>>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 B/s
>>> wr, 12 op/s
>>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
>>> rd, 135 kB/s wr, 15 op/s
>>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
>>> rd, 189 kB/s wr, 7 op/s
>>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed (2
>>> reporters from different host after 21.222945 >= grace 20.388836)
>>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed (2
>>> reporters from different host after 21.222970 >= grace 20.388836)
>>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
>>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
>>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>>>
>>> Why trust the mon not the osd? In this case the osdmap will be right app. 26
>>> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>>>
>>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>> That's not what anybody intended to have happen. It's possible the
>> simultaneous loss of a monitor and the OSDs is triggering a case
>> that's not behaving correctly. Can you create a ticket at
>> tracker.ceph.com with your logs and what steps you took and symptoms
>> observed?
>> -Greg
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any librados C API users out there?

2017-01-12 Thread Sage Weil
On Thu, 12 Jan 2017, Yehuda Sadeh-Weinraub wrote:
> On Thu, Jan 12, 2017 at 12:08 PM, Sage Weil  wrote:
> > On Thu, 12 Jan 2017, Gregory Farnum wrote:
> >> On Thu, Jan 12, 2017 at 5:54 AM, Jason Dillaman  
> >> wrote:
> >> > There is option (3) which is to have a new (or modified)
> >> > "buffer::create_static" take an optional callback to invoke when the
> >> > buffer::raw object is destructed. The raw pointer would be destructed
> >> > when the last buffer::ptr / buffer::list containing it is destructed,
> >> > so you know it's no longer being referenced.
> >> >
> >> > You could then have the new C API methods that wrap the C buffer in a
> >> > bufferlist and set a new flag in the librados::AioCompletion to delay
> >> > its completion until after it's both completed and the memory is
> >> > released. When the buffer is freed, the callback would unblock the
> >> > librados::AioCompltion completion callback.
> >>
> >> I much prefer an approach like this: it's zero-copy; it's not a lot of
> >> user overhead; but it requires them to explicitly pass memory off to
> >> Ceph and keep it immutable until Ceph is done (at which point they are
> >> told so explicitly).
> >
> > Yeah, this is simpler.  I still feel like we should provide a way to
> > revoke buffers, though, because otherwise it's possible for calls to block
> > semi-indefinitey if, say, an old MOSDOp is quueed for another OSD and that
> > OSD is not reading data off the socket but has not failed (e.g., due to
> > it's rx throttling).
> >
> 
> We need to provide some way to cancel requests (at least from the
> client's aspect), that would guarantee that buffers are not going to
> be used (and no completion callback is going to be called).

Yeah.  It's a bit more work but I think this is the best path.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any librados C API users out there?

2017-01-12 Thread Sage Weil
On Thu, 12 Jan 2017, Gregory Farnum wrote:
> On Thu, Jan 12, 2017 at 5:54 AM, Jason Dillaman  wrote:
> > There is option (3) which is to have a new (or modified)
> > "buffer::create_static" take an optional callback to invoke when the
> > buffer::raw object is destructed. The raw pointer would be destructed
> > when the last buffer::ptr / buffer::list containing it is destructed,
> > so you know it's no longer being referenced.
> >
> > You could then have the new C API methods that wrap the C buffer in a
> > bufferlist and set a new flag in the librados::AioCompletion to delay
> > its completion until after it's both completed and the memory is
> > released. When the buffer is freed, the callback would unblock the
> > librados::AioCompltion completion callback.
> 
> I much prefer an approach like this: it's zero-copy; it's not a lot of
> user overhead; but it requires them to explicitly pass memory off to
> Ceph and keep it immutable until Ceph is done (at which point they are
> told so explicitly). 

Yeah, this is simpler.  I still feel like we should provide a way to 
revoke buffers, though, because otherwise it's possible for calls to block 
semi-indefinitey if, say, an old MOSDOp is quueed for another OSD and that 
OSD is not reading data off the socket but has not failed (e.g., due to 
it's rx throttling).

sage

> Even if we were very careful about not returning
> to users until operations are done, just taking buffers into a
> multi-threaded application without having explicit markers about
> ownership is a recipe for misuse.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any librados C API users out there?

2017-01-12 Thread Gregory Farnum
On Thu, Jan 12, 2017 at 5:54 AM, Jason Dillaman  wrote:
> There is option (3) which is to have a new (or modified)
> "buffer::create_static" take an optional callback to invoke when the
> buffer::raw object is destructed. The raw pointer would be destructed
> when the last buffer::ptr / buffer::list containing it is destructed,
> so you know it's no longer being referenced.
>
> You could then have the new C API methods that wrap the C buffer in a
> bufferlist and set a new flag in the librados::AioCompletion to delay
> its completion until after it's both completed and the memory is
> released. When the buffer is freed, the callback would unblock the
> librados::AioCompltion completion callback.

I much prefer an approach like this: it's zero-copy; it's not a lot of
user overhead; but it requires them to explicitly pass memory off to
Ceph and keep it immutable until Ceph is done (at which point they are
told so explicitly). Even if we were very careful about not returning
to users until operations are done, just taking buffers into a
multi-threaded application without having explicit markers about
ownership is a recipe for misuse.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Samuel Just
How long did it take for the cluster to recover?
-Sam

On Thu, Jan 12, 2017 at 10:54 AM, Gregory Farnum  wrote:
> On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
>> Hi all,
>> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
>> ceph-cluster. All nodes are mons and have two OSDs.
>> During reboot of one node, ceph stucks longer than normaly and I look in the
>> "ceph -w" output to find the reason.
>>
>> This is not the reason, but I'm wonder why "osd marked itself down" will not
>> recognised by the mons:
>> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
>> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
>> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
>> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
>> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
>> quorum 0,2
>> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
>> 0,2
>> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
>> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
>> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
>> wr, 15 op/s
>> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
>> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
>> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 B/s
>> wr, 12 op/s
>> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
>> rd, 135 kB/s wr, 15 op/s
>> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
>> rd, 189 kB/s wr, 7 op/s
>> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed (2
>> reporters from different host after 21.222945 >= grace 20.388836)
>> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed (2
>> reporters from different host after 21.222970 >= grace 20.388836)
>> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
>> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
>> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
>> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>>
>> Why trust the mon not the osd? In this case the osdmap will be right app. 26
>> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>>
>> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
> That's not what anybody intended to have happen. It's possible the
> simultaneous loss of a monitor and the OSDs is triggering a case
> that's not behaving correctly. Can you create a ticket at
> tracker.ceph.com with your logs and what steps you took and symptoms
> observed?
> -Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Gregory Farnum
On Thu, Jan 12, 2017 at 2:03 AM,   wrote:
> Hi all,
> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
> ceph-cluster. All nodes are mons and have two OSDs.
> During reboot of one node, ceph stucks longer than normaly and I look in the
> "ceph -w" output to find the reason.
>
> This is not the reason, but I'm wonder why "osd marked itself down" will not
> recognised by the mons:
> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
> quorum 0,2
> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
> 0,2
> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
> wr, 15 op/s
> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 B/s
> wr, 12 op/s
> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
> rd, 135 kB/s wr, 15 op/s
> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
> rd, 189 kB/s wr, 7 op/s
> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed (2
> reporters from different host after 21.222945 >= grace 20.388836)
> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed (2
> reporters from different host after 21.222970 >= grace 20.388836)
> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>
> Why trust the mon not the osd? In this case the osdmap will be right app. 26
> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>
> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

That's not what anybody intended to have happen. It's possible the
simultaneous loss of a monitor and the OSDs is triggering a case
that's not behaving correctly. Can you create a ticket at
tracker.ceph.com with your logs and what steps you took and symptoms
observed?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD v1 image format ...

2017-01-12 Thread Shinobu Kinjo
It would be more appreciated to provide users with evaluation results
of migration and recovery tools by QA to avoid any disaster on
production environment, and get agreement with them

 e.g.,
 #1 Scenarios we test
 #2 Images spec we use
 and some

Does it make sense, or too much?

Regards,


On Thu, Jan 12, 2017 at 1:01 PM, Jason Dillaman  wrote:
> On Wed, Jan 11, 2017 at 10:43 PM, Shinobu Kinjo  wrote:
>> +2
>>  * Reduce manual operation as much as possible.
>>  * A recovery tool in case that we break something which would not
>> appear to us initially.
>
> I definitely agree that this is an overdue tool and we have an
> upstream feature ticket for tracking a possible solution for this [1].
> We won't remove the support for interacting with v1 images before we
> provide a path for migration. The Ceph core development team would
> really like to drop internal support for tmap operations, which are
> only utilized by RBD v1.
>
> [1] http://tracker.ceph.com/issues/18430
>
> --
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD key permission to unprotect a rbd snapshot

2017-01-12 Thread Jason Dillaman
The "rbd snap unprotect" action needs to scan the "rbd_children"
object of all pools to ensure that the image doesn't have any children
attached. Therefore, you need to ensure that the user that will
perform the "snap unprotect" has the "allow class-read object_prefix
rbd_children" on all pools [1].

[1] http://docs.ceph.com/docs/master/man/8/ceph-authtool/#capabilities

On Thu, Jan 12, 2017 at 10:56 AM, Martin Palma  wrote:
> Hi all,
>
> what permissions do I need to unprotect a protected rbd snapshot?
>
> Currently the key interacting with the pool containing the rbd image
> has the following permissions:
>
> mon 'allow r'
> osd 'allow rwx pool=vms'
>
> When I try to unprotect a snaphost with the following command "rbd
> snap unprotect vms/ubuntu@snap" I get the following error:
>
> 2017-01-12 16:45:15.385212 7fab38ee4700 -1
> librbd::SnapshotUnprotectRequest: cannot get children for pool 'vms'
> 2017-01-12 16:45:15.385343 7fab38ee4700 -1
> librbd::SnapshotUnprotectRequest: cannot get children for pool 'data'
> 2017-01-12 16:45:15.386220 7fab38ee4700 -1
> librbd::SnapshotUnprotectRequest: cannot get children for pool
> 'cephfs_data'
> 2017-01-12 16:45:15.386332 7fab38ee4700 -1
> librbd::SnapshotUnprotectRequest: cannot get children for pool
> 'cephfs_metadata'
> 2017-01-12 16:45:15.386845 7fab38ee4700 -1
> librbd::SnapshotUnprotectRequest: encountered error: (1) Operation not
> permitted
> 2017-01-12 16:45:15.386870 7fab38ee4700 -1
> librbd::SnapshotUnprotectRequest: 0x7fab6376a4a0
> should_complete_error: ret_val=-1
> 2017-01-12 16:45:15.389819 7fab38ee4700 -1
> librbd::SnapshotUnprotectRequest: 0x7fab6376a4a0
> should_complete_error: ret_val=-1
> rbd: unprotecting snap failed: (1) Operation not permitted
>
> What additional permission does the key need? And why does the command
> try to get children on all pools?
>
> Best,
> Martin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD key permission to unprotect a rbd snapshot

2017-01-12 Thread Martin Palma
Hi all,

what permissions do I need to unprotect a protected rbd snapshot?

Currently the key interacting with the pool containing the rbd image
has the following permissions:

mon 'allow r'
osd 'allow rwx pool=vms'

When I try to unprotect a snaphost with the following command "rbd
snap unprotect vms/ubuntu@snap" I get the following error:

2017-01-12 16:45:15.385212 7fab38ee4700 -1
librbd::SnapshotUnprotectRequest: cannot get children for pool 'vms'
2017-01-12 16:45:15.385343 7fab38ee4700 -1
librbd::SnapshotUnprotectRequest: cannot get children for pool 'data'
2017-01-12 16:45:15.386220 7fab38ee4700 -1
librbd::SnapshotUnprotectRequest: cannot get children for pool
'cephfs_data'
2017-01-12 16:45:15.386332 7fab38ee4700 -1
librbd::SnapshotUnprotectRequest: cannot get children for pool
'cephfs_metadata'
2017-01-12 16:45:15.386845 7fab38ee4700 -1
librbd::SnapshotUnprotectRequest: encountered error: (1) Operation not
permitted
2017-01-12 16:45:15.386870 7fab38ee4700 -1
librbd::SnapshotUnprotectRequest: 0x7fab6376a4a0
should_complete_error: ret_val=-1
2017-01-12 16:45:15.389819 7fab38ee4700 -1
librbd::SnapshotUnprotectRequest: 0x7fab6376a4a0
should_complete_error: ret_val=-1
rbd: unprotecting snap failed: (1) Operation not permitted

What additional permission does the key need? And why does the command
try to get children on all pools?

Best,
Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Network question

2017-01-12 Thread Sivaram Kannan
Hi,

Thanks for the reply. The public network I am talking about is an
isolated network with no access to internet, but lot of compute
traffic though. If it is more about security, I would try setting up
both in the same network. My worry is more towards any performance
issues (due to re-balancing between the nodes) by configuring both
control and data in the same network?

Thanks,
./Siva.

On Thu, Jan 12, 2017 at 9:35 AM, Oliver Humpage  wrote:
>
>> I do recommend separating your public and cluster networks but there's not a 
>> whole lot of benefit to it unless they are using physically separate links 
>> with dedicated bandwidth.
>
> I thought a large part of it was security, in that it’s possible to DOS the 
> cluster by disrupting intra-OSD traffic. Even with message signatures turned 
> on, it’s unwise to bet on there not being any security bugs.
>
> If you only have one 10Gb connection, perhaps consider separate VLANs?
>
> Oliver.
>



-- 
ever tried. ever failed. no matter.
try again. fail again. fail better.
-- Samuel Beckett
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any librados C API users out there?

2017-01-12 Thread Jason Dillaman
There is option (3) which is to have a new (or modified)
"buffer::create_static" take an optional callback to invoke when the
buffer::raw object is destructed. The raw pointer would be destructed
when the last buffer::ptr / buffer::list containing it is destructed,
so you know it's no longer being referenced.

You could then have the new C API methods that wrap the C buffer in a
bufferlist and set a new flag in the librados::AioCompletion to delay
its completion until after it's both completed and the memory is
released. When the buffer is freed, the callback would unblock the
librados::AioCompltion completion callback.

On Thu, Jan 12, 2017 at 8:48 AM, Sage Weil  wrote:
> On Thu, 12 Jan 2017, Piotr Dałek wrote:
>> On 01/11/2017 07:01 PM, Sage Weil wrote:
>> > On Wed, 11 Jan 2017, Jason Dillaman wrote:
>> > > On Wed, Jan 11, 2017 at 11:44 AM, Piotr Dałek 
>> > > wrote:
>> > > > As the subject says - are here any users/consumers of librados C API?
>> > > > I'm
>> > > > asking because we're researching if this PR:
>> > > > https://github.com/ceph/ceph/pull/12216 will be actually beneficial for
>> > > > larger group of users. This PR adds a bunch of new APIs that perform
>> > > > object
>> > > > writes without intermediate data copy, which will reduce cpu and memory
>> > > > load
>> > > > on clients. If you're using librados C API for object writes, feel free
>> > > > to
>> > > > comment here or in the pull request.
>> > > +1
>> > >
>> > > I'd be happy to tweak the internals of librbd to support pass-through
>> > > of C buffers all the way to librados. librbd clients like QEMU use the
>> > > C API and this currently results in several extra copies (in librbd
>> > > and librados).
>> >
>> > +1 from me too.
>> >
>> > The caveat is that we have to be very careful with buffers that are
>> > provided by users.  Currently the userspace messenger code doesn't provide
>> > a way to manage the provenance of references to the buffer::raw_static
>> > buffers, which means that even if the write has completed, there may be
>> > ways for an MOSDOp to still be alive that references that memory.
>> >
>> > Either (1) we have to audit the code to be sure that by the time the
>> > Objecter request completes we know that all messages and their bufferlists
>> > are cleared (tricky/fragile), or (2) introduce some buffer management
>> > interface in librados so that the buffer lifecycle is independent of the
>> > request.  I would prefer (2), but it means the interfaces would be
>> > something like
>> >
>> >  rados_buffer_create(...)
>> >  copy your data into that buffer
>> >  rados_write(...) or whatever
>> >  rados_buffer_release(...)
>> >
>> > and then rados can do the proper refcounting and only deallocate the
>> > memory when all refs have gone away.  Unfortunately, I suspect that there
>> > is a largish category of users where this isn't sufficient... e.g., if
>> > some existing C user has its own buffer and it isn't practical to
>> > allocate/release via rados_buffer_* calls instead of malloc/free (or
>> > whatever).
>>
>> Personally I vote for (1). That way is time consuming, but may help us find
>> other ways to optimize resource consumption of API itself. In particular, if
>> librbd, as you all wrote, does several copies of user buffers, then maybe 
>> it's
>> worth spending some time on figuring out the actual lifetime of buffers?
>> Maybe, for example, buffers in MOSDOp are actually copies of already copied
>> buffers, so the extra copy done by librados is unnecessary?
>> Regardless, the idea behind my PR wasn't to make librados/librbd/libwhatever
>> 100% optimal right away, but to make one step towards it being efficient -
>> recduce cpu and memory usage a bit, have it go through testing, if nothing
>> breaks, try harder until we're close to perfection.
>> even if few percent decrease in memory consumption doesn't look interesting 
>> on
>> single-client level, few percents in large scale deployments may mean 
>> hundreds
>> of gigabytes of memory that could be put to better use.
>>
>> (2), and particularly the requirement for API changes, along with need to put
>> user data into specific RADOS structure doesn't look good because
>> - that requires way more changes to user code, and I don't expect any of
>> current user to be interested in that (unless we force it with API change)
>> - most of devs are simply lazy and instead of using "rados_buffer_create" to
>> put their data directly into them, they'll keep using their own buffers and
>> *then* copy them into RADOS buffers, just like we're doing it right now (even
>> if behind the scenes).
>
> Yeah, I think you're right.  The bad news is just that (1) is hard.  It's
> going to require a few things in order to address the problem of sending
> multiple MOSDOp requests for the same operation (e.g., after peering).  I
> think it will mean
>
> - The Message superclass is going to require a new Mutex that is taken by
> the 

Re: [ceph-users] Any librados C API users out there?

2017-01-12 Thread Sage Weil
On Thu, 12 Jan 2017, Piotr Dałek wrote:
> On 01/11/2017 07:01 PM, Sage Weil wrote:
> > On Wed, 11 Jan 2017, Jason Dillaman wrote:
> > > On Wed, Jan 11, 2017 at 11:44 AM, Piotr Dałek 
> > > wrote:
> > > > As the subject says - are here any users/consumers of librados C API?
> > > > I'm
> > > > asking because we're researching if this PR:
> > > > https://github.com/ceph/ceph/pull/12216 will be actually beneficial for
> > > > larger group of users. This PR adds a bunch of new APIs that perform
> > > > object
> > > > writes without intermediate data copy, which will reduce cpu and memory
> > > > load
> > > > on clients. If you're using librados C API for object writes, feel free
> > > > to
> > > > comment here or in the pull request.
> > > +1
> > > 
> > > I'd be happy to tweak the internals of librbd to support pass-through
> > > of C buffers all the way to librados. librbd clients like QEMU use the
> > > C API and this currently results in several extra copies (in librbd
> > > and librados).
> > 
> > +1 from me too.
> > 
> > The caveat is that we have to be very careful with buffers that are
> > provided by users.  Currently the userspace messenger code doesn't provide
> > a way to manage the provenance of references to the buffer::raw_static
> > buffers, which means that even if the write has completed, there may be
> > ways for an MOSDOp to still be alive that references that memory.
> > 
> > Either (1) we have to audit the code to be sure that by the time the
> > Objecter request completes we know that all messages and their bufferlists
> > are cleared (tricky/fragile), or (2) introduce some buffer management
> > interface in librados so that the buffer lifecycle is independent of the
> > request.  I would prefer (2), but it means the interfaces would be
> > something like
> > 
> >  rados_buffer_create(...)
> >  copy your data into that buffer
> >  rados_write(...) or whatever
> >  rados_buffer_release(...)
> > 
> > and then rados can do the proper refcounting and only deallocate the
> > memory when all refs have gone away.  Unfortunately, I suspect that there
> > is a largish category of users where this isn't sufficient... e.g., if
> > some existing C user has its own buffer and it isn't practical to
> > allocate/release via rados_buffer_* calls instead of malloc/free (or
> > whatever).
> 
> Personally I vote for (1). That way is time consuming, but may help us find
> other ways to optimize resource consumption of API itself. In particular, if
> librbd, as you all wrote, does several copies of user buffers, then maybe it's
> worth spending some time on figuring out the actual lifetime of buffers?
> Maybe, for example, buffers in MOSDOp are actually copies of already copied
> buffers, so the extra copy done by librados is unnecessary?
> Regardless, the idea behind my PR wasn't to make librados/librbd/libwhatever
> 100% optimal right away, but to make one step towards it being efficient -
> recduce cpu and memory usage a bit, have it go through testing, if nothing
> breaks, try harder until we're close to perfection.
> even if few percent decrease in memory consumption doesn't look interesting on
> single-client level, few percents in large scale deployments may mean hundreds
> of gigabytes of memory that could be put to better use.
> 
> (2), and particularly the requirement for API changes, along with need to put
> user data into specific RADOS structure doesn't look good because
> - that requires way more changes to user code, and I don't expect any of
> current user to be interested in that (unless we force it with API change)
> - most of devs are simply lazy and instead of using "rados_buffer_create" to
> put their data directly into them, they'll keep using their own buffers and
> *then* copy them into RADOS buffers, just like we're doing it right now (even
> if behind the scenes).

Yeah, I think you're right.  The bad news is just that (1) is hard.  It's 
going to require a few things in order to address the problem of sending 
multiple MOSDOp requests for the same operation (e.g., after peering).  I 
think it will mean

- The Message superclass is going to require a new Mutex that is taken by 
the messenger implementation while accessing the encoded message payload 
buffer

- It will also need a revoke_buffers() method (that takes the mutex) so 
that Objecter can yank references to buffers for any messages in flight 
that it no longer cares about

- Objecter will need to keep a ref of the in-flight request of record.  If 
it needs to resend to a different OSD, it can use that ref to revoke the 
buffers.

- Once the operation completes, Objecter can drop it's ref to that 
Message.

Until we do that, any zero-copy changes like those in the PR will mostly 
work but trigger use-after-free is various cases where the PG mappings are 
changing.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Write back cache removal

2017-01-12 Thread Wido den Hollander

> Op 10 januari 2017 om 22:05 schreef Nick Fisk :
> 
> 
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Stuart Harland
> Sent: 10 January 2017 11:58
> To: Wido den Hollander 
> Cc: ceph new ; n...@fisk.me.uk
> Subject: Re: [ceph-users] Write back cache removal
> 
>  
> 
> Yes Wido, you are correct. There is a RBD pool in the cluster, but is not 
> currently running with a cache attached. The Pool I’m trying to manage here 
> is only used by Librados to write objects directly to the pool as opposed to 
> any of the other niceties that ceph provides.
> 
>  
> 
> Specifically I ran:
> 
>  
> 
> `ceph osd tier cache-mode  forward`
> 
>  
> 
> which returned `Error EPERM: 'forward' is not a well-supported cache mode and 
> may corrupt your data.  pass --yes-i-really-mean-it to force.`
> 
>  
> 
> Currently we are running 10.2.5. I suspect that it’s fine in our use case, 
> however given the sparsity of the documentation I didn’t like to assume 
> anything.
> 
>  
> 
>  
> 
> Regards
> 
>  
> 
> Stuart
> 
>  
> 
>  
> 
> Yep, sorry, I got this post mixed up with the one from Daznis yesterday who 
> was using RBD’s. I think that warning was introduced as some bugs were found 
> that corrupted some users data after frequently switching between writeback 
> and forward modes. As it is very rarely used mode and so wasn’t worth the 
> testing I believe the decision was taken to just implement the warning. If 
> you are using it as part of removing a cache tier and you have already 
> flushed the tier, then I believe it should be fine to use. 
> 
>  

I suggest that you stop writes if possible so that nothing changes.

Then drain the cache and set the mode to forward.

Wido

> 
> Another way would probably be to set the min promote thresholds to higher 
> than your hit set counts, this will abuse the tiering logic but should also 
> stop anything getting promoted into your cache tier.
> 
>  
> 
>  
> 
>  
> 
>  
> 
> On 10 Jan 2017, at 09:52, Wido den Hollander   > wrote:
> 
>  
> 
> 
> Op 10 januari 2017 om 9:52 schreef Nick Fisk   >:
> 
> 
> 
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
> den Hollander
> Sent: 10 January 2017 07:54
> To: ceph new  >; 
> Stuart Harland   >
> Subject: Re: [ceph-users] Write back cache removal
> 
> 
> 
> 
> 
> Op 9 januari 2017 om 13:02 schreef Stuart Harland 
>  >:
> 
> 
> Hi,
> 
> We’ve been operating a ceph storage system storing files using librados 
> (using a replicated pool on rust disks). We implemented a
> 
> cache over the top of this with SSDs, however we now want to turn this off.
> 
> 
> 
> 
> The documentation suggests setting the cache mode to forward before draining 
> the pool, however the ceph management
> 
> controller spits out an error about this saying that it is unsupported and 
> hence dangerous.
> 
> 
> 
>  
> 
> 
> What version of Ceph are you running?
> 
> And can you paste the exact command and the output?
> 
> Wido
> 
> 
> Hi Wido,
> 
> I think this has been discussed before and looks like it might be a current 
> limitation. Not sure if it's on anybody's radar to fix.
> 
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg24472.html
> 
> 
> Might be, but afaik they are using their own application which writes to 
> RADOS using librados, not RBD.
> 
> Is that correct Stuart?
> 
> Wido
> 
> 
> 
> 
> Nick
> 
> 
> 
> 
> 
> 
> 
> 
> The thing is I cannot really locate any documentation as to why it’s 
> considered unsupported and under what conditions it is expected
> 
> to fail: I have read a passing comment about EC pools having data corruption, 
> but we are using replicated pools.
> 
> 
> 
> 
> Is this something that is safe to do?
> 
> Otherwise I have noted the read proxy mode of cache tiers which is documented 
> as a mechanism to transition from write back to
> 
> disabled, however the documentation is even sparser on this than forward 
> mode. Would this be a better approach if there is some
> unsupported behaviour in the forward mode cache option?
> 
> 
> 
> 
> Any thoughts would be appreciated - we really cannot afford to corrupt the 
> data, and I really do not want to have to do some
> 
> manual software based eviction on this data.
> 
> 
> 
> 
> regards
> 
> Stuart
> 
> 
> − Stuart Harland:
> Infrastructure Engineer
> Email: s.harl...@livelinktechnology.net 
>   
> 
> 
> 
> 
> LiveLink Technology Ltd
> McCormack House
> 56A East Street
> Havant
> PO9 1BS
> 
> IMPORTANT: The information transmitted in this 

[ceph-users] Ceph Network question

2017-01-12 Thread Sivaram Kannan
Hi,

CEPH first time user here. I am trying to setup a ceph cluster. The
documentation (http://ceph.com/planet/bootstrap-your-ceph-cluster-in-docker/)
recommends a seperate network for control and data plane

1. Can i configure both control plane in the same network?
2. Is it so bad to configure that way even if I have a 10G pipe??

Thanks,
./Siva.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs of EC pool stuck in peering state

2017-01-12 Thread george.vasilakakos
Hi Ceph folks,

I’ve just posted a bug report http://tracker.ceph.com/issues/18508 

I have a cluster (Jewel 10.2.3, SL7) that has trouble creating PGs in EC pools. 
Essentially, I’ll get a lot of CRUSH_ITEM_NONE (2147483647) in there and PGs 
will stay in peering states. This sometimes affects other pools (EC and rep.) 
where their PGs fall into peering states too.

Restarting the primary OSD for a PG will get it to peer.

Has anyone run into this issue before, if so what did you do to fix it?


Cheers,

George


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread ulembke

Hi,

Am 2017-01-12 11:38, schrieb Shinobu Kinjo:

Sorry, I don't get your question.

Generally speaking, the MON maintains maps of the cluster state:

 * Monitor map
 * OSD map
 * PG map
 * CRUSH map
yes - and if an osd say "osd.5 marked itself down" the mon can update 
immediately the OSD map (and PG map) and must not wait for reporters 
(waiting longer than the grace time).


Or do I missing somewhat?

Regards,

Udo



Regards,


On Thu, Jan 12, 2017 at 7:03 PM,   wrote:

Hi all,
I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
ceph-cluster. All nodes are mons and have two OSDs.
During reboot of one node, ceph stucks longer than normaly and I look 
in the

"ceph -w" output to find the reason.

This is not the reason, but I'm wonder why "osd marked itself down" 
will not

recognised by the mons:
2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor 
election
2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor 
election
2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election 
with

quorum 0,2
2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, 
quorum 0,2

0,2
2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
{0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 
kB/s

wr, 15 op/s
2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 
in

2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 
73018 B/s

wr, 12 op/s
2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 
B/s

rd, 135 kB/s wr, 15 op/s
2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 
B/s

rd, 189 kB/s wr, 7 op/s
2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 
failed (2

reporters from different host after 21.222945 >= grace 20.388836)
2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 
failed (2

reporters from different host after 21.222970 >= grace 20.388836)
2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 
in

2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 
in


Why trust the mon not the osd? In this case the osdmap will be right 
app. 26

seconds earlier (the pgmap at 10:18:27.953410 is wrong).

ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore activation error on Ubuntu Xenial/Ceph Jewel

2017-01-12 Thread Peter Maloney
Hey there... resurrecting a dead apparently unanswered question. I had
issues with this, and nobody online had any answers, and I accidentally
ran into the solution. So I hope this helps someone.

> Hello,
>
> I have been trying to deploy bluestore OSDs in a test cluster of 2x OSDs
> and 3x mon (xen1,2,3) on Ubuntu Xenial and Jewel 10.2.1.
>
> Activating the OSDs gives an error in systemd as follows. the culprit is
> the command "ceph-osd --get-device-fsid" which fails to get fsid.
...
> root at xen2 :/# 
> /usr/bin/ceph-osd --get-device-fsid /dev/sdb2
> 2016-06-02 19:03:50.960521 7f203b2928c0 -1 bluestore(/dev/sdb2)
> _read_bdev_label unable to decode label at offset 62:
> buffer::malformed_input: void
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) unknown
> encoding version > 1
> 2016-06-02 19:03:50.963348 7f203b2928c0 -1 journal read_header error
> decoding journal header
> failed to get device fsid for /dev/sdb2: (22) Invalid argument


To fix that, you have to run `ceph-osd ... --mkfs ...` (eg. `ceph-osd
--cluster "${cluster}" -i "${osd_number}" --mkfs --mkkey --osd-uuid
"${osd_uuid}"`) on the osd data dir, which requires that either a
symlink, or ceph.conf says where the block device is. Before that point,
the block device just has old garbage from whatever was on it before,
and not a bluestore block header.

I accidentally found the cause/solution for this by running the
filestore osd creating manual procedure while accidentally leaving "osd
objectstore = bluestore" in the ceph.conf, which created a file, not
block device, which worked for some reason, but only allocated half the
space of the osd and ran slower.

And I tested it in the latest kraken on Ubuntu 14.04 with kernel 4.4
from xenial.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Path Restriction, can still read all files

2017-01-12 Thread John Spray
On Thu, Jan 12, 2017 at 9:27 AM, Boris Mattijssen
 wrote:
> John,
>
> Do you know which kernel version I need? It seems to be not working with
> 4.8.15 on coreos (4.8.15-coreos) (I also tested on 4.7.3).
> I can confirm that it works using the ceph-fuse client, but I need the
> kernel client to work since I want to mount using Kubernetes ;)

The ticket (links to commits) was:
http://tracker.ceph.com/issues/17191

Looks like it's in 4.9.

John



>
> Btw, this is the error I get:
> mount: x.x.x.x:6789:/boris is write-protected, mounting read-only
> mount: cannot mount x.x.x.x:6789:/boris read-only
>
> Thanks,
> Boris
>
> On Wed, Jan 11, 2017 at 3:05 PM Boris Mattijssen
>  wrote:
>>
>> Ah right, I was using the the kernel client on kernel 3.x
>> Thanks for the answer. I'll try updating tomorrow and will let you know if
>> it works!
>>
>> Cheers,
>> Boris
>>
>>
>> On Wed, Jan 11, 2017 at 1:03 PM John Spray  wrote:
>>>
>>> On Wed, Jan 11, 2017 at 11:39 AM, Boris Mattijssen
>>>  wrote:
>>> > Hi Brukhard,
>>> >
>>> > Thanks for your answer. I've tried two things now:
>>> > * ceph auth get-or-create client.boris mon 'allow r' mds 'allow r
>>> > path=/,
>>> > allow rw path=/boris' osd 'allow rw pool=cephfs_data'. This is
>>> > according to
>>> > your suggestion. I am however now still able to mount the root path and
>>> > read
>>> > all containing subdirectories.
>>> > * ceph auth get-or-create client.boris mon 'allow r' mds 'allow rw
>>> > path=/boris' osd 'allow rw pool=cephfs_data'. So now I disallowed
>>> > reading
>>> > the root at all. I am however now not able to mount the fs (even when
>>> > using
>>> > the -r /boris) flag.
>>>
>>> The second one is correct, but some older clients (notably the kernel
>>> client before it was fixed in 4.x recently) don't work properly with
>>> it -- the older client code always tries to read the root inode, so
>>> fails to mount if it can't access it.
>>>
>>> John
>>>
>>> >
>>> > So to make it clear, I want to limit a given client (boris in this
>>> > case) to
>>> > only read an write to a given subdirectory of the root (/boris in this
>>> > case).
>>> >
>>> > Thanks,
>>> > Boris
>>> >
>>> > On Wed, Jan 11, 2017 at 11:30 AM Burkhard Linke
>>> >  wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >>
>>> >> On 01/11/2017 11:02 AM, Boris Mattijssen wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I'm trying to use path restriction on CephFS, running a Ceph Jewel
>>> >> (ceph
>>> >> version 10.2.5) cluster.
>>> >> For this I'm using the command specified in the official docs
>>> >> (http://docs.ceph.com/docs/jewel/cephfs/client-auth/):
>>> >> ceph auth get-or-create client.boris mon 'allow r' mds 'allow r, allow
>>> >> rw
>>> >> path=/boris' osd 'allow rw pool=cephfs_data'
>>> >>
>>> >> When I mount the fs with boris user and the generated secret I can
>>> >> still
>>> >> see all files in the fs (not just the files in /boris).
>>> >> l am restricted to write to anything but /boris, so the problem is
>>> >> that I
>>> >> can still read anything outside of /boris.
>>> >>
>>> >> Can someone please clarify what's going on?
>>> >>
>>> >>
>>> >> As far as I understand the mds caps, mds 'allow r' allows read-only
>>> >> access
>>> >> to all files; 'allow rw path=/boris' restricts write access to the
>>> >> given
>>> >> path. So your observations reflect the given permissions.
>>> >>
>>> >> You can configure ceph-fuse and kcephfs to use a given directory as
>>> >> 'root'
>>> >> directory of the mount point (e.g. ceph-fuse -r /boris). But I'm not
>>> >> sure
>>> >> whether
>>> >>
>>> >> - you need access to the root directory to mount with -r option
>>> >> - you can restrict the read-only access to the root directory without
>>> >> sub
>>> >> directories
>>> >>   (e.g. 'allow r path=/, allow rw path=/boris' to allow mounting a sub
>>> >> directory only)
>>> >>
>>> >> Unfortunately the -r option is a client side option, so you have to
>>> >> trust
>>> >> your clients.
>>> >>
>>> >> Regards,
>>> >> Burkhard
>>> >> ___
>>> >> ceph-users mailing list
>>> >> ceph-users@lists.ceph.com
>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>> >
>>> > ___
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests break performance

2017-01-12 Thread Eugen Block

Hi,


Looking at the output of dump_historic_ops and dump_ops_in_flight


I waited for new slow request messages and dumped the historic_ops  
into a file. The reporting OSD shows lots of "waiting for rw locks"  
messages and a duration of more than 30 secs:


 "age": 366.044746,
"duration": 32.491506,
"type_data": [
"commit sent; apply or cleanup",
{
"client": "client.9664429",
"tid": 130439910
},
[
{
"time": "2017-01-12 10:38:15.227649",
"event": "initiated"
},
{
"time": "2017-01-12 10:38:15.232310",
"event": "reached_pg"
},
{
"time": "2017-01-12 10:38:15.232341",
"event": "waiting for rw locks"
},
{
"time": "2017-01-12 10:38:15.268819",
"event": "reached_pg"
},
[
.
.
.
]
{
"time": "2017-01-12 10:38:45.515055",
"event": "waiting for rw locks"
},
{
"time": "2017-01-12 10:38:46.921095",
"event": "reached_pg"
},
{
"time": "2017-01-12 10:38:46.921157",
"event": "started"
},
{
"time": "2017-01-12 10:38:46.921342",
"event": "waiting for subops from 9,15"
},
{
"time": "2017-01-12 10:38:46.921724",
"event": "commit_queued_for_journal_write"
},
{
"time": "2017-01-12 10:38:46.922186",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2017-01-12 10:38:46.931103",
"event": "sub_op_commit_rec"
},
{
"time": "2017-01-12 10:38:46.968730",
"event": "sub_op_commit_rec"
},
{
"time": "2017-01-12 10:38:47.717770",
"event": "journaled_completion_queued"
},
{
"time": "2017-01-12 10:38:47.718280",
"event": "op_commit"
},
{
"time": "2017-01-12 10:38:47.718359",
"event": "commit_sent"
},
{
"time": "2017-01-12 10:38:47.718890",
"event": "op_applied"
},
{
"time": "2017-01-12 10:38:47.719154",
"event": "done"
}


There were about 70 events "waiting for rw locks", I truncated the output.
Based on the message "waiting for subops from 9,15" I also dumped the  
historic_ops for these two OSDs.


Duration on OSD.9

"initiated_at": "2017-01-12 10:38:29.258221",
"age": 54.069919,
"duration": 20.831568,

Duration on OSD.15

"initiated_at": "2017-01-12 10:38:23.695098",
"age": 112.118210,
"duration": 26.452526,

They also contain many "waiting for rw locks" messages, but not as  
much as the dump from the reporting OSD.
To me it seems that because two OSDs take a lot of time to process  
their requests (only slightly less than 30 secs), it sums up to more  
than 30 secs on the reporting (primary?) OSD. Is the reporting OSD  
always the primary?


How can I debug this further? I searched the web for "waiting for rw  
locks", I also found Wido's blog [1] about my exact problem, but I'm  
not sure how to continue. Our admin says our network should be fine,  
but what can I do to rule that out?


I don't think I have provided information about our cluster yet:

4 nodes, 3 mons, 20 OSDs on
ceph version 0.94.7-84-g8e6f430 (8e6f430683e4d8293e31fd4eb6cb09be96960cfa)

[1]  
https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/


Thanks!
Eugen


Zitat von Brad Hubbard :


On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block  wrote:

Hi,

I simply grepped for "slow request" in ceph.log. What exactly do you mean by
"effective OSD"?

If I have this log line:
2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141 seconds
old, received at 2017-01-11 [...] ack+ondisk+write+known_if_redirected
e12440) currently 

Re: [ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread Shinobu Kinjo
Sorry, I don't get your question.

Generally speaking, the MON maintains maps of the cluster state:

 * Monitor map
 * OSD map
 * PG map
 * CRUSH map

Regards,


On Thu, Jan 12, 2017 at 7:03 PM,   wrote:
> Hi all,
> I had just reboot all 3 nodes (one after one) of an small Proxmox-VE
> ceph-cluster. All nodes are mons and have two OSDs.
> During reboot of one node, ceph stucks longer than normaly and I look in the
> "ceph -w" output to find the reason.
>
> This is not the reason, but I'm wonder why "osd marked itself down" will not
> recognised by the mons:
> 2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
> 2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
> 2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor election
> 2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor election
> 2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with
> quorum 0,2
> 2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,2
> 0,2
> 2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at
> {0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
> 2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 kB/s
> wr, 15 op/s
> 2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
> 2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
> 2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 B/s
> wr, 12 op/s
> 2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 B/s
> rd, 135 kB/s wr, 15 op/s
> 2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 B/s
> rd, 189 kB/s wr, 7 op/s
> 2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 failed (2
> reporters from different host after 21.222945 >= grace 20.388836)
> 2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 failed (2
> reporters from different host after 21.222970 >= grace 20.388836)
> 2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
> 2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392
> active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
> 2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in
>
> Why trust the mon not the osd? In this case the osdmap will be right app. 26
> seconds earlier (the pgmap at 10:18:27.953410 is wrong).
>
> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>
>
> regards
>
> Udo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pipe "deadlock" in Hammer, 0.94.5

2017-01-12 Thread jiajia zhong
if errno is EAGAIN for recv, the Pipe:do_recv just acts as blocked. so

2017-01-12 16:34 GMT+08:00 许雪寒 :

> Hi, everyone.
>
> Recently, we did some experiment to test the stability of the ceph
> cluster. We used Hammer version which is the mostly used version of online
> cluster. One of the scenarios that we simulated is poor network
> connectivity, in which we used iptables to drop TCP/IP packet under some
> probability. And sometimes, we can see the following phenomenon: one
> machine is running iptables to drop packets going in and out, OSDs on other
> machines could be brought down, and sometimes more than one OSD.
>
> We used gdb to debug the core dumped by linux. We found that the thread
> that hit the suicide time threshold is a peering thread who is trying to
> send a pg_notify message, the ceph-osd log file and gdb output is as
> follows:
>
> Log file:
> -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
> -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out after
> 150
> -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 --
> 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163 
> osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2  47+0+0
> (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
>  0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg
> common/HeartbeatMap.cc: In function 'bool 
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
> const char*, time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>
> GDB OUTPUT:
> (gdb) thread 8
> [Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0
> 0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #3  0x01a54ae4 in Mutex::Lock (this=0x7fd426453598,
> no_lockdep=false) at common/Mutex.cc:96
> #4  0x01409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0,
> m=...) at common/Mutex.h:115
> #5  0x01c46446 in PipeConnection::try_get_pipe
> (this=0x7fd426453580, p=0x7fd440beb908) at msg/simple/PipeConnection.cc:38
> #6  0x01c05809 in SimpleMessenger::submit_message
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580, dest_addr=...,
> dest_type=4, already_locked=false) at msg/simple/SimpleMessenger.cc:443
> #7  0x01c033fa in SimpleMessenger::_send_message
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
> msg/simple/SimpleMessenger.cc:136
> #8  0x01c467c7 in SimpleMessenger::send_message
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
> msg/simple/SimpleMessenger.h:139
> #9  0x01c466a1 in PipeConnection::send_message
> (this=0x7fd426453580, m=0x7fd425538d00) at msg/simple/PipeConnection.cc:78
> #10 0x013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8,
> m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
> #11 0x013b45e7 in OSDService::send_incremental_map
> (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580,
> osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:1087
> #12 0x013b215f in OSDService::share_map_peer (this=0x7fd4821e76c8,
> peer=9, con=0x7fd426453580, map=std::tr1::shared_ptr (count 49)
> 0x7fd426c0f480) at osd/OSD.cc:887
> #13 0x013f43cc in OSD::do_notifies (this=0x7fd4821e6000,
> notify_list=std::map with 7 elements = {...}, curmap=std::tr1::shared_ptr
> (count 49) 0x7fd426c0f480) at osd/OSD.cc:7246
> #14 0x013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000,
> ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480,
> handle=0x7fd440becb40) at osd/OSD.cc:7198
> #15 0x0140043e in OSD::process_peering_events
> (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at osd/OSD.cc:8539
> #16 0x0141e094 in OSD::PeeringWQ::_process (this=0x7fd4821e7070,
> pgs=std::list = {...}, handle=...) at osd/OSD.h:1601
> #17 0x014b94bf in ThreadPool::BatchWorkQueue::_void_process
> (this=0x7fd4821e7070, p=0x7fd425419040, handle=...) at
> common/WorkQueue.h:107
> #18 0x01b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0,
> wt=0x7fd4761db430) at common/WorkQueue.cc:128
> #19 0x01b313f7 in ThreadPool::WorkThread::entry
> (this=0x7fd4761db430) at common/WorkQueue.h:318
> #20 0x01b33d40 in Thread::entry_wrapper (this=0x7fd4761db430) at
> common/Thread.cc:61
> #21 0x01b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at
> common/Thread.cc:45
> #22 0x003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #23 0x003c5d4e8aad in clone () from /lib64/libc.so.6

[ceph-users] Why would "osd marked itself down" will not recognised?

2017-01-12 Thread ulembke

Hi all,
I had just reboot all 3 nodes (one after one) of an small Proxmox-VE 
ceph-cluster. All nodes are mons and have two OSDs.
During reboot of one node, ceph stucks longer than normaly and I look in 
the "ceph -w" output to find the reason.


This is not the reason, but I'm wonder why "osd marked itself down" will 
not recognised by the mons:

2017-01-12 10:18:13.584930 mon.0 [INF] osd.5 marked itself down
2017-01-12 10:18:13.585169 mon.0 [INF] osd.4 marked itself down
2017-01-12 10:18:22.809473 mon.2 [INF] mon.2 calling new monitor 
election
2017-01-12 10:18:22.847548 mon.0 [INF] mon.0 calling new monitor 
election
2017-01-12 10:18:27.879341 mon.0 [INF] mon.0@0 won leader election with 
quorum 0,2
2017-01-12 10:18:27.889797 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 
0,2 0,2
2017-01-12 10:18:27.952672 mon.0 [INF] monmap e3: 3 mons at 
{0=10.132.7.11:6789/0,1=10.132.7.12:6789/0,2=10.132.7.13:6789/0}
2017-01-12 10:18:27.953410 mon.0 [INF] pgmap v4800799: 392 pgs: 392 
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 239 
kB/s wr, 15 op/s

2017-01-12 10:18:27.953453 mon.0 [INF] fsmap e1:
2017-01-12 10:18:27.953787 mon.0 [INF] osdmap e2053: 6 osds: 6 up, 6 in
2017-01-12 10:18:29.013968 mon.0 [INF] pgmap v4800800: 392 pgs: 392 
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 73018 
B/s wr, 12 op/s
2017-01-12 10:18:30.086787 mon.0 [INF] pgmap v4800801: 392 pgs: 392 
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 59 
B/s rd, 135 kB/s wr, 15 op/s
2017-01-12 10:18:34.559509 mon.0 [INF] pgmap v4800802: 392 pgs: 392 
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail; 184 
B/s rd, 189 kB/s wr, 7 op/s
2017-01-12 10:18:35.623838 mon.0 [INF] pgmap v4800803: 392 pgs: 392 
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:39.580770 mon.0 [INF] pgmap v4800804: 392 pgs: 392 
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail
2017-01-12 10:18:39.681058 mon.0 [INF] osd.4 10.132.7.12:6800/4064 
failed (2 reporters from different host after 21.222945 >= grace 
20.388836)
2017-01-12 10:18:39.681221 mon.0 [INF] osd.5 10.132.7.12:6802/4163 
failed (2 reporters from different host after 21.222970 >= grace 
20.388836)
2017-01-12 10:18:40.612401 mon.0 [INF] pgmap v4800805: 392 pgs: 392 
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail

2017-01-12 10:18:40.670801 mon.0 [INF] osdmap e2054: 6 osds: 4 up, 6 in
2017-01-12 10:18:40.689302 mon.0 [INF] pgmap v4800806: 392 pgs: 392 
active+clean; 567 GB data, 1697 GB used, 9445 GB / 11142 GB avail

2017-01-12 10:18:41.730006 mon.0 [INF] osdmap e2055: 6 osds: 4 up, 6 in

Why trust the mon not the osd? In this case the osdmap will be right 
app. 26 seconds earlier (the pgmap at 10:18:27.953410 is wrong).


ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)


regards

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Path Restriction, can still read all files

2017-01-12 Thread Boris Mattijssen
John,

Do you know which kernel version I need? It seems to be not working with
4.8.15 on coreos (4.8.15-coreos) (I also tested on 4.7.3).
I can confirm that it works using the ceph-fuse client, but I need the
kernel client to work since I want to mount using Kubernetes ;)

Btw, this is the error I get:
mount: x.x.x.x:6789:/boris is write-protected, mounting read-only
mount: cannot mount x.x.x.x:6789:/boris read-only

Thanks,
Boris

On Wed, Jan 11, 2017 at 3:05 PM Boris Mattijssen 
wrote:

> Ah right, I was using the the kernel client on kernel 3.x
> Thanks for the answer. I'll try updating tomorrow and will let you know if
> it works!
>
> Cheers,
> Boris
>
>
> On Wed, Jan 11, 2017 at 1:03 PM John Spray  wrote:
>
> On Wed, Jan 11, 2017 at 11:39 AM, Boris Mattijssen
>  wrote:
> > Hi Brukhard,
> >
> > Thanks for your answer. I've tried two things now:
> > * ceph auth get-or-create client.boris mon 'allow r' mds 'allow r path=/,
> > allow rw path=/boris' osd 'allow rw pool=cephfs_data'. This is according
> to
> > your suggestion. I am however now still able to mount the root path and
> read
> > all containing subdirectories.
> > * ceph auth get-or-create client.boris mon 'allow r' mds 'allow rw
> > path=/boris' osd 'allow rw pool=cephfs_data'. So now I disallowed reading
> > the root at all. I am however now not able to mount the fs (even when
> using
> > the -r /boris) flag.
>
> The second one is correct, but some older clients (notably the kernel
> client before it was fixed in 4.x recently) don't work properly with
> it -- the older client code always tries to read the root inode, so
> fails to mount if it can't access it.
>
> John
>
> >
> > So to make it clear, I want to limit a given client (boris in this case)
> to
> > only read an write to a given subdirectory of the root (/boris in this
> > case).
> >
> > Thanks,
> > Boris
> >
> > On Wed, Jan 11, 2017 at 11:30 AM Burkhard Linke
> >  wrote:
> >>
> >> Hi,
> >>
> >>
> >> On 01/11/2017 11:02 AM, Boris Mattijssen wrote:
> >>
> >> Hi all,
> >>
> >> I'm trying to use path restriction on CephFS, running a Ceph Jewel (ceph
> >> version 10.2.5) cluster.
> >> For this I'm using the command specified in the official docs
> >> (http://docs.ceph.com/docs/jewel/cephfs/client-auth/):
> >> ceph auth get-or-create client.boris mon 'allow r' mds 'allow r, allow
> rw
> >> path=/boris' osd 'allow rw pool=cephfs_data'
> >>
> >> When I mount the fs with boris user and the generated secret I can still
> >> see all files in the fs (not just the files in /boris).
> >> l am restricted to write to anything but /boris, so the problem is that
> I
> >> can still read anything outside of /boris.
> >>
> >> Can someone please clarify what's going on?
> >>
> >>
> >> As far as I understand the mds caps, mds 'allow r' allows read-only
> access
> >> to all files; 'allow rw path=/boris' restricts write access to the given
> >> path. So your observations reflect the given permissions.
> >>
> >> You can configure ceph-fuse and kcephfs to use a given directory as
> 'root'
> >> directory of the mount point (e.g. ceph-fuse -r /boris). But I'm not
> sure
> >> whether
> >>
> >> - you need access to the root directory to mount with -r option
> >> - you can restrict the read-only access to the root directory without
> sub
> >> directories
> >>   (e.g. 'allow r path=/, allow rw path=/boris' to allow mounting a sub
> >> directory only)
> >>
> >> Unfortunately the -r option is a client side option, so you have to
> trust
> >> your clients.
> >>
> >> Regards,
> >> Burkhard
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Using hammer version, is radosgw supporting fastcgi long connection?

2017-01-12 Thread ??????
Hi everybody,

I am using hammer version. In this version, is radosgw supporting fastcgi long 
connection? The fastcgi long connection configuration for nginx is 
fastcgi_keep_conn on.


Best wishes,
yaozongyou
2017/1/12___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pipe "deadlock" in Hammer, 0.94.5

2017-01-12 Thread 许雪寒
Hi, everyone.

Recently, we did some experiment to test the stability of the ceph cluster. We 
used Hammer version which is the mostly used version of online cluster. One of 
the scenarios that we simulated is poor network connectivity, in which we used 
iptables to drop TCP/IP packet under some probability. And sometimes, we can 
see the following phenomenon: one machine is running iptables to drop packets 
going in and out, OSDs on other machines could be brought down, and sometimes 
more than one OSD.

We used gdb to debug the core dumped by linux. We found that the thread that 
hit the suicide time threshold is a peering thread who is trying to send a 
pg_notify message, the ceph-osd log file and gdb output is as follows:

Log file:
    -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map is_healthy 
'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
    -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map is_healthy 
'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out after 150
    -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 -- 
10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163  
osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2  47+0+0 
(3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
 0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg 
common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' 
thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

GDB OUTPUT:
(gdb) thread 8
[Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0  
0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) bt
#0  0x003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x003c5d8094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x01a54ae4 in Mutex::Lock (this=0x7fd426453598, no_lockdep=false) 
at common/Mutex.cc:96
#4  0x01409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0, m=...) at 
common/Mutex.h:115
#5  0x01c46446 in PipeConnection::try_get_pipe (this=0x7fd426453580, 
p=0x7fd440beb908) at msg/simple/PipeConnection.cc:38
#6  0x01c05809 in SimpleMessenger::submit_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580, dest_addr=..., dest_type=4, 
already_locked=false) at msg/simple/SimpleMessenger.cc:443
#7  0x01c033fa in SimpleMessenger::_send_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.cc:136
#8  0x01c467c7 in SimpleMessenger::send_message (this=0x7fd482029400, 
m=0x7fd425538d00, con=0x7fd426453580) at msg/simple/SimpleMessenger.h:139
#9  0x01c466a1 in PipeConnection::send_message (this=0x7fd426453580, 
m=0x7fd425538d00) at msg/simple/PipeConnection.cc:78
#10 0x013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, 
m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
#11 0x013b45e7 in OSDService::send_incremental_map 
(this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, 
osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:1087
#12 0x013b215f in OSDService::share_map_peer (this=0x7fd4821e76c8, 
peer=9, con=0x7fd426453580, map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) 
at osd/OSD.cc:887
#13 0x013f43cc in OSD::do_notifies (this=0x7fd4821e6000, 
notify_list=std::map with 7 elements = {...}, curmap=std::tr1::shared_ptr 
(count 49) 0x7fd426c0f480) at osd/OSD.cc:7246
#14 0x013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000, ctx=..., 
pg=0x0, curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480, 
handle=0x7fd440becb40) at osd/OSD.cc:7198
#15 0x0140043e in OSD::process_peering_events (this=0x7fd4821e6000, 
pgs=std::list = {...}, handle=...) at osd/OSD.cc:8539
#16 0x0141e094 in OSD::PeeringWQ::_process (this=0x7fd4821e7070, 
pgs=std::list = {...}, handle=...) at osd/OSD.h:1601
#17 0x014b94bf in ThreadPool::BatchWorkQueue::_void_process 
(this=0x7fd4821e7070, p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
#18 0x01b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0, 
wt=0x7fd4761db430) at common/WorkQueue.cc:128
#19 0x01b313f7 in ThreadPool::WorkThread::entry (this=0x7fd4761db430) 
at common/WorkQueue.h:318
#20 0x01b33d40 in Thread::entry_wrapper (this=0x7fd4761db430) at 
common/Thread.cc:61
#21 0x01b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at 
common/Thread.cc:45
#22 0x003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
#23 0x003c5d4e8aad in clone () from /lib64/libc.so.6

As is shown, the thread is waiting for a mutex lock which we believe is 
Connection::lock . And the thread that is holding this mutex lock which the 
waiting thread is trying to get is a pipe reader_thread who is trying to read a 
full message that