Re: [ceph-users] Slow OPS

2019-03-20 Thread Brad Hubbard
Actually, the lag is between "sub_op_committed" and "commit_sent". Is
there any pattern to these slow requests? Do they involve the same
osd, or set of osds?

On Thu, Mar 21, 2019 at 3:37 PM Brad Hubbard  wrote:
>
> On Thu, Mar 21, 2019 at 3:20 PM Glen Baars  
> wrote:
> >
> > Thanks for that - we seem to be experiencing the wait in this section of 
> > the ops.
> >
> > {
> > "time": "2019-03-21 14:12:42.830191",
> > "event": "sub_op_committed"
> > },
> > {
> > "time": "2019-03-21 14:12:43.699872",
> > "event": "commit_sent"
> > },
> >
> > Does anyone know what that section is waiting for?
>
> Hi Glen,
>
> These are documented, to some extent, here.
>
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
>
> It looks like it may be taking a long time to communicate the commit
> message back to the client? Are these slow ops always the same client?
>
> >
> > Kind regards,
> > Glen Baars
> >
> > -Original Message-
> > From: Brad Hubbard 
> > Sent: Thursday, 21 March 2019 8:23 AM
> > To: Glen Baars 
> > Cc: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Slow OPS
> >
> > On Thu, Mar 21, 2019 at 12:11 AM Glen Baars  
> > wrote:
> > >
> > > Hello Ceph Users,
> > >
> > >
> > >
> > > Does anyone know what the flag point ‘Started’ is? Is that ceph osd 
> > > daemon waiting on the disk subsystem?
> >
> > This is set by "mark_started()" and is roughly set when the pg starts 
> > processing the op. Might want to capture dump_historic_ops output after the 
> > op completes.
> >
> > >
> > >
> > >
> > > Ceph 13.2.4 on centos 7.5
> > >
> > >
> > >
> > > "description": "osd_op(client.1411875.0:422573570 5.18ds0
> > > 5:b1ed18e5:::rbd_data.6.cf7f46b8b4567.0046e41a:head [read
> > >
> > > 1703936~16384] snapc 0=[] ondisk+read+known_if_redirected e30622)",
> > >
> > > "initiated_at": "2019-03-21 01:04:40.598438",
> > >
> > > "age": 11.340626,
> > >
> > > "duration": 11.342846,
> > >
> > > "type_data": {
> > >
> > > "flag_point": "started",
> > >
> > > "client_info": {
> > >
> > > "client": "client.1411875",
> > >
> > > "client_addr": "10.4.37.45:0/627562602",
> > >
> > > "tid": 422573570
> > >
> > > },
> > >
> > > "events": [
> > >
> > > {
> > >
> > > "time": "2019-03-21 01:04:40.598438",
> > >
> > > "event": "initiated"
> > >
> > > },
> > >
> > > {
> > >
> > > "time": "2019-03-21 01:04:40.598438",
> > >
> > > "event": "header_read"
> > >
> > > },
> > >
> > > {
> > >
> > > "time": "2019-03-21 01:04:40.598439",
> > >
> > > "event": "throttled"
> > >
> > > },
> > >
> > > {
> > >
> > > "time": "2019-03-21 01:04:40.598450",
> > >
> > > "event": "all_read"
> > >
> > > },
> > >
> > > {
> > >
> > > "time": "2019-03-21 01:04:40.598499",
> > >
> > > "event": "dispatched"
> > >
> > > },
> > >
> > > {
> > >
> > > "time": "2019-03-21 01:04:40.598504",
> > >
> > > "event": "queued_for_pg"
> > >
> > > },
> > >
> > > {
> > >
> > > "time": "2019-03-21 01:04:40.598883",
> > >
> > > "event": "reached_pg"
> > >
> > > },
> > >
> > > {
> > >
> > > "time": "2019-03-21 01:04:40.598905",
> > >
> > > "event": "started"
> > >
> > > }
> > >
> > > ]
> > >
> > > }
> > >
> > > }
> > >
> > > ],
> > >
> > >
> > >
> > > Glen
> > >
> > > This e-mail is intended solely for the benefit of the addressee(s) and 
> > > any other named recipient. It is confidential and may contain legally 
> > > privileged or confidential information. If you are not the recipient, any 
> > > use, distribution, disclosure or copying of this e-mail is prohibited. 
> > > The confidentiality and legal privilege attached to this communication is 
> > > not waived or lost by reason of the mistaken transmission or delivery to 
> > > you. If you have received this e-mail in error, please notify us 
> > > immediately.
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > 

Re: [ceph-users] Slow OPS

2019-03-20 Thread Brad Hubbard
On Thu, Mar 21, 2019 at 3:20 PM Glen Baars  wrote:
>
> Thanks for that - we seem to be experiencing the wait in this section of the 
> ops.
>
> {
> "time": "2019-03-21 14:12:42.830191",
> "event": "sub_op_committed"
> },
> {
> "time": "2019-03-21 14:12:43.699872",
> "event": "commit_sent"
> },
>
> Does anyone know what that section is waiting for?

Hi Glen,

These are documented, to some extent, here.

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

It looks like it may be taking a long time to communicate the commit
message back to the client? Are these slow ops always the same client?

>
> Kind regards,
> Glen Baars
>
> -Original Message-
> From: Brad Hubbard 
> Sent: Thursday, 21 March 2019 8:23 AM
> To: Glen Baars 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Slow OPS
>
> On Thu, Mar 21, 2019 at 12:11 AM Glen Baars  
> wrote:
> >
> > Hello Ceph Users,
> >
> >
> >
> > Does anyone know what the flag point ‘Started’ is? Is that ceph osd daemon 
> > waiting on the disk subsystem?
>
> This is set by "mark_started()" and is roughly set when the pg starts 
> processing the op. Might want to capture dump_historic_ops output after the 
> op completes.
>
> >
> >
> >
> > Ceph 13.2.4 on centos 7.5
> >
> >
> >
> > "description": "osd_op(client.1411875.0:422573570 5.18ds0
> > 5:b1ed18e5:::rbd_data.6.cf7f46b8b4567.0046e41a:head [read
> >
> > 1703936~16384] snapc 0=[] ondisk+read+known_if_redirected e30622)",
> >
> > "initiated_at": "2019-03-21 01:04:40.598438",
> >
> > "age": 11.340626,
> >
> > "duration": 11.342846,
> >
> > "type_data": {
> >
> > "flag_point": "started",
> >
> > "client_info": {
> >
> > "client": "client.1411875",
> >
> > "client_addr": "10.4.37.45:0/627562602",
> >
> > "tid": 422573570
> >
> > },
> >
> > "events": [
> >
> > {
> >
> > "time": "2019-03-21 01:04:40.598438",
> >
> > "event": "initiated"
> >
> > },
> >
> > {
> >
> > "time": "2019-03-21 01:04:40.598438",
> >
> > "event": "header_read"
> >
> > },
> >
> > {
> >
> > "time": "2019-03-21 01:04:40.598439",
> >
> > "event": "throttled"
> >
> > },
> >
> > {
> >
> > "time": "2019-03-21 01:04:40.598450",
> >
> > "event": "all_read"
> >
> > },
> >
> > {
> >
> > "time": "2019-03-21 01:04:40.598499",
> >
> > "event": "dispatched"
> >
> > },
> >
> > {
> >
> > "time": "2019-03-21 01:04:40.598504",
> >
> > "event": "queued_for_pg"
> >
> > },
> >
> > {
> >
> > "time": "2019-03-21 01:04:40.598883",
> >
> > "event": "reached_pg"
> >
> > },
> >
> > {
> >
> > "time": "2019-03-21 01:04:40.598905",
> >
> > "event": "started"
> >
> > }
> >
> > ]
> >
> > }
> >
> > }
> >
> > ],
> >
> >
> >
> > Glen
> >
> > This e-mail is intended solely for the benefit of the addressee(s) and any 
> > other named recipient. It is confidential and may contain legally 
> > privileged or confidential information. If you are not the recipient, any 
> > use, distribution, disclosure or copying of this e-mail is prohibited. The 
> > confidentiality and legal privilege attached to this communication is not 
> > waived or lost by reason of the mistaken transmission or delivery to you. 
> > If you have received this e-mail in error, please notify us immediately.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, 

Re: [ceph-users] Slow OPS

2019-03-20 Thread Glen Baars
Thanks for that - we seem to be experiencing the wait in this section of the 
ops.

{
"time": "2019-03-21 14:12:42.830191",
"event": "sub_op_committed"
},
{
"time": "2019-03-21 14:12:43.699872",
"event": "commit_sent"
},

Does anyone know what that section is waiting for?

Kind regards,
Glen Baars

-Original Message-
From: Brad Hubbard 
Sent: Thursday, 21 March 2019 8:23 AM
To: Glen Baars 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Slow OPS

On Thu, Mar 21, 2019 at 12:11 AM Glen Baars  wrote:
>
> Hello Ceph Users,
>
>
>
> Does anyone know what the flag point ‘Started’ is? Is that ceph osd daemon 
> waiting on the disk subsystem?

This is set by "mark_started()" and is roughly set when the pg starts 
processing the op. Might want to capture dump_historic_ops output after the op 
completes.

>
>
>
> Ceph 13.2.4 on centos 7.5
>
>
>
> "description": "osd_op(client.1411875.0:422573570 5.18ds0
> 5:b1ed18e5:::rbd_data.6.cf7f46b8b4567.0046e41a:head [read
>
> 1703936~16384] snapc 0=[] ondisk+read+known_if_redirected e30622)",
>
> "initiated_at": "2019-03-21 01:04:40.598438",
>
> "age": 11.340626,
>
> "duration": 11.342846,
>
> "type_data": {
>
> "flag_point": "started",
>
> "client_info": {
>
> "client": "client.1411875",
>
> "client_addr": "10.4.37.45:0/627562602",
>
> "tid": 422573570
>
> },
>
> "events": [
>
> {
>
> "time": "2019-03-21 01:04:40.598438",
>
> "event": "initiated"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598438",
>
> "event": "header_read"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598439",
>
> "event": "throttled"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598450",
>
> "event": "all_read"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598499",
>
> "event": "dispatched"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598504",
>
> "event": "queued_for_pg"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598883",
>
> "event": "reached_pg"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598905",
>
> "event": "started"
>
> }
>
> ]
>
> }
>
> }
>
> ],
>
>
>
> Glen
>
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Cheers,
Brad
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Recovery Settings

2019-03-20 Thread Brent Kennedy
Lots of good info there, thank you!  I tend to get options fatigue when trying 
to pick out a new system.  This should help narrow that focus greatly.  

 

-Brent

 

From: Reed Dier  
Sent: Wednesday, March 20, 2019 12:48 PM
To: Brent Kennedy 
Cc: ceph-users 
Subject: Re: [ceph-users] SSD Recovery Settings

 

Grafana   is the web frontend for creating the graphs.

 

InfluxDB   holds the 
time series data that Grafana pulls from.

 

To collect data, I am using collectd 
  daemons running on each ceph 
node (mon,mds,osd), as this was my initial way of ingesting metrics.

I am also now using the influx plugin in ceph-mgr 
  to have ceph-mgr directly 
report statistics to InfluxDB.

 

I know two other popular methods of collecting data are Telegraf 
  and Prometheus 
 , both of which are popular, both of which have 
ceph-mgr plugins as well here   
and here  .

Influx Data also has a Grafana like graphing front end Chronograf 
 , which some 
prefer to Grafana.

 

Hopefully thats enough to get you headed in the right direction.

I would recommend not going down the CollectD path, as the project doesn't move 
as quickly as Telegraf and Prometheus, and the majority of the metrics I am 
pulling from these days are provided from the ceph-mgr plugin.

 

Hope that helps,

Reed





On Mar 20, 2019, at 11:30 AM, Brent Kennedy mailto:bkenn...@cfl.rr.com> > wrote:

 

Reed:  If you don’t mind me asking, what was the graphing tool you had in the 
post?  I am using the ceph health web panel right now but it doesn’t go that 
deep.

 

Regards,

Brent

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS performance improved in 13.2.5?

2019-03-20 Thread Sergey Malinin
Hello,
Yesterday I upgraded from 13.2.2 to 13.2.5 and so far I have only seen 
significant improvements in MDS operations. Needless to say I'm happy, but I 
didn't notice anything related in release notes. Am I missing something, 
possibly new configuration settings?

Screenshots below:
https://prnt.sc/n0qzfp (https://prnt.sc/n0qzfp)
https://prnt.sc/n0qzd5 (https://prnt.sc/n0qzd5)

And yes, ceph nodes and clients had kernel upgraded to v5.0.3
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow OPS

2019-03-20 Thread Brad Hubbard
On Thu, Mar 21, 2019 at 12:11 AM Glen Baars  wrote:
>
> Hello Ceph Users,
>
>
>
> Does anyone know what the flag point ‘Started’ is? Is that ceph osd daemon 
> waiting on the disk subsystem?

This is set by "mark_started()" and is roughly set when the pg starts
processing the op. Might want to capture dump_historic_ops output
after the op completes.

>
>
>
> Ceph 13.2.4 on centos 7.5
>
>
>
> "description": "osd_op(client.1411875.0:422573570 5.18ds0 
> 5:b1ed18e5:::rbd_data.6.cf7f46b8b4567.0046e41a:head [read
>
> 1703936~16384] snapc 0=[] ondisk+read+known_if_redirected e30622)",
>
> "initiated_at": "2019-03-21 01:04:40.598438",
>
> "age": 11.340626,
>
> "duration": 11.342846,
>
> "type_data": {
>
> "flag_point": "started",
>
> "client_info": {
>
> "client": "client.1411875",
>
> "client_addr": "10.4.37.45:0/627562602",
>
> "tid": 422573570
>
> },
>
> "events": [
>
> {
>
> "time": "2019-03-21 01:04:40.598438",
>
> "event": "initiated"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598438",
>
> "event": "header_read"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598439",
>
> "event": "throttled"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598450",
>
> "event": "all_read"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598499",
>
> "event": "dispatched"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598504",
>
> "event": "queued_for_pg"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598883",
>
> "event": "reached_pg"
>
> },
>
> {
>
> "time": "2019-03-21 01:04:40.598905",
>
> "event": "started"
>
> }
>
> ]
>
> }
>
> }
>
> ],
>
>
>
> Glen
>
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd 14.2.0 won't start: Failed to pick public address on IPv6 only cluster

2019-03-20 Thread Simon Ironside

On 20/03/2019 19:53, Ricardo Dias wrote:

Make sure you have the following option in ceph.conf:

ms_bind_ipv4 = false

That will prevent the OSD from trying to find an IPv4 address.



Thank you! I've only ever used ms_bind_ipv6 = true on its own. Adding 
your line solved my problem.


Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-osd 14.2.0 won't start: Failed to pick public address on IPv6 only cluster

2019-03-20 Thread Ricardo Dias
Hi,

Make sure you have the following option in ceph.conf:

ms_bind_ipv4 = false

That will prevent the OSD from trying to find an IPv4 address.

Cheers,
Ricardo Dias

> On 20 Mar 2019, at 19:41, Simon Ironside  wrote:
> 
> Hi Everyone,
> 
> I'm upgrading an IPv6 only cluster from 13.2.5 Mimic to 14.2.0 Nautilus. The 
> mon and mgr upgrades went fine, the first OSD node unfortunately fails to 
> restart after upgdating the packages. 
> 
> The affected ceph-osd logs show the lines:
> 
> Unable to find any IPv4 address in networks 'MY /64' interfaces ''
> Failed to pick public address.
> 
> Where MY/64 is the correct IPv6 public subnet from ceph.conf.
> Should the single quotes after interfaces be blank? It would be on bond0 in 
> my case, just in case that's relevant.
> 
> The upgrade from 13.2.4 to 13.2.5 went without a hitch.
> I've obviously not gone any further but any suggestions for how to proceed?
> 
> Thanks,
> Simon.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-20 Thread Ronny Aasen



with Debian buster frozen, If there are issues with ceph on debian that 
would best be fixed in debian, now is the last chance to get anything 
into buster before the next release.


it is also important to get mimic and luminous packages built for 
Buster. Since you want to avoid a situation where you have to upgrade 
both the OS and ceph at the same time.


kind regards
Ronny Aasen



On 20.03.2019 07:09, Alfredo Deza wrote:

There aren't any Debian packages built for this release because we
haven't updated the infrastructure to build (and test) Debian packages
yet.

On Tue, Mar 19, 2019 at 10:24 AM Sean Purdy  wrote:

Hi,


Will debian packages be released?  I don't see them in the nautilus repo.  I 
thought that Nautilus was going to be debian-friendly, unlike Mimic.


Sean

On Tue, 19 Mar 2019 14:58:41 +0100
Abhishek Lekshmanan  wrote:


We're glad to announce the first release of Nautilus v14.2.0 stable
series. There have been a lot of changes across components from the
previous Ceph releases, and we advise everyone to go through the release
and upgrade notes carefully.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-osd 14.2.0 won't start: Failed to pick public address on IPv6 only cluster

2019-03-20 Thread Simon Ironside

Hi Everyone,

I'm upgrading an IPv6 only cluster from 13.2.5 Mimic to 14.2.0 Nautilus. 
The mon and mgr upgrades went fine, the first OSD node unfortunately 
fails to restart after upgdating the packages.


The affected ceph-osd logs show the lines:

Unable to find any IPv4 address in networks 'MY /64' interfaces ''
Failed to pick public address.

Where MY/64 is the correct IPv6 public subnet from ceph.conf.
Should the single quotes after interfaces be blank? It would be on bond0 
in my case, just in case that's relevant.


The upgrade from 13.2.4 to 13.2.5 went without a hitch.
I've obviously not gone any further but any suggestions for how to proceed?

Thanks,
Simon.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Looking up buckets in multi-site radosgw configuration

2019-03-20 Thread David Coles
On Tue, Mar 19, 2019 at 7:51 AM Casey Bodley  wrote:

> Yeah, correct on both points. The zonegroup redirects would be the only
> way to guide clients between clusters.

Awesome. Thank you for the clarification.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Recovery Settings

2019-03-20 Thread Reed Dier
Grafana  is the web frontend for creating the graphs.

InfluxDB  holds the 
time series data that Grafana pulls from.

To collect data, I am using collectd 
 daemons running on each ceph 
node (mon,mds,osd), as this was my initial way of ingesting metrics.
I am also now using the influx plugin in ceph-mgr 
 to have ceph-mgr directly 
report statistics to InfluxDB.

I know two other popular methods of collecting data are Telegraf 
 and Prometheus 
, both of which are popular, both of which have 
ceph-mgr plugins as well here  
and here .
Influx Data also has a Grafana like graphing front end Chronograf 
, which some 
prefer to Grafana.

Hopefully thats enough to get you headed in the right direction.
I would recommend not going down the CollectD path, as the project doesn't move 
as quickly as Telegraf and Prometheus, and the majority of the metrics I am 
pulling from these days are provided from the ceph-mgr plugin.

Hope that helps,
Reed

> On Mar 20, 2019, at 11:30 AM, Brent Kennedy  wrote:
> 
> Reed:  If you don’t mind me asking, what was the graphing tool you had in the 
> post?  I am using the ceph health web panel right now but it doesn’t go that 
> deep.
>  
> Regards,
> Brent



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Recovery Settings

2019-03-20 Thread Brent Kennedy
Seems both of you are spot on.  I injected the change and its now moving at
.080 instead of .002.  I did fix the label on the drives from HDD to SDD but
I didn't restart the OSDs due to the recovery process.  Seeing it fly now.
I also restarted the stuck OSDs but I know they are where the data is, so
they keep going back to slow.  I imagine this is part of the pg creation
process and my failure to adjust these when I created them.  Thanks!

 

I wasn't able to pull the config from the daemon(directory not found), but I
used the web panel to look at the setting.  I also found that
"bluestore_bdev_type" was set to "hdd", so I am going to see if there is a
way to change that because when I restarted some of the stuck OSDs, the tag
change I made doesn't seem to affect this setting.  I use ceph-deploy to do
the deployment(after ansible server setup), so it could also be a switch I
need to be using.  This is our first SSD cluster.

 

Reed:  If you don't mind me asking, what was the graphing tool you had in
the post?  I am using the ceph health web panel right now but it doesn't go
that deep.

 

Regards,

Brent

 

From: Reed Dier  
Sent: Wednesday, March 20, 2019 11:01 AM
To: Brent Kennedy 
Cc: ceph-users 
Subject: Re: [ceph-users] SSD Recovery Settings

 

Not sure what your OSD config looks like,

 

When I was moving from Filestore to Bluestore on my SSD OSD's (and NVMe FS
journal to NVMe Bluestore block.db),

I had an issue where the OSD was incorrectly being reported as rotational in
some part of the chain.

Once I overcame that, I had a huge boost in recovery performance (repaving
OSDs).

Might be something useful in there.

 

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/025039.htm
l

 

Reed





On Mar 19, 2019, at 11:29 PM, Konstantin Shalygin mailto:k0...@k0ste.ru> > wrote:

 

 

I setup an SSD Luminous 12.2.11 cluster and realized after data had been
added that pg_num was not set properly on the default.rgw.buckets.data pool
( where all the data goes ).  I adjusted the settings up, but recovery is
going really slow ( like 56-110MiB/s ) ticking down at .002 per log
entry(ceph -w).  These are all SSDs on luminous 12.2.11 ( no journal drives
) with a set of 2 10Gb fiber twinax in a bonded LACP config.  There are six
servers, 60 OSDs, each OSD is 2TB.  There was about 4TB of data ( 3 million
objects ) added to the cluster before I noticed the red blinking lights.
 
 
 
I tried adjusting the recovery to:
 
ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
 
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 30'
 
 
 
Which did help a little, but didn't seem to have the impact I was looking
for.  I have used the settings on HDD clusters before to speed things up (
using 8 backfills and 4 max active though ).  Did I miss something or is
this part of the pg expansion process.  Should I be doing something else
with SSD clusters?
 
 
 
Regards,
 
-Brent
 
 
 
Existing Clusters:
 
Test: Luminous 12.2.11 with 3 osd servers, 1 mon/man, 1 gateway ( all
virtual on SSD )
 
US Production(HDD): Jewel 10.2.11 with 5 osd servers, 3 mons, 3 gateways
behind haproxy LB
 
UK Production(HDD): Luminous 12.2.11 with 15 osd servers, 3 mons/man, 3
gateways behind haproxy LB
 
US Production(SSD): Luminous 12.2.11 with 6 osd servers, 3 mons/man, 3
gateways behind haproxy LB

 

Try to lower `osd_recovery_sleep*` options.

You can get your current values from ceph admin socket like this:

```

ceph daemon osd.0 config show | jq 'to_entries[] | if
(.key|test("^(osd_recovery_sleep)(.*)")) then (.) else empty end'

```

 

k

___
ceph-users mailing list
  ceph-users@lists.ceph.com
 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs manila snapshots best practices

2019-03-20 Thread Dan van der Ster
Hi all,

We're currently upgrading our cephfs (managed by OpenStack Manila)
clusters to Mimic, and want to start enabling snapshots of the file
shares.
There are different ways to approach this, and I hope someone can
share their experiences with:

1. Do you give users the 's' flag in their cap, so that they can
create snapshots themselves? We're currently planning *not* to do this
-- we'll create snapshots for the users.
2. We want to create periodic snaps for all cephfs volumes. I can see
pros/cons to creating the snapshots in /volumes/.snap or in
/volumes/_nogroup//.snap. Any experience there? Or maybe even
just an fs-wide snap in /.snap is the best approach ?
3. I found this simple cephfs-snap script which should do the job:
http://images.45drives.com/ceph/cephfs/cephfs-snap  Does anyone have a
different recommendation?

Thanks!

Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Recovery Settings

2019-03-20 Thread Reed Dier
Not sure what your OSD config looks like,

When I was moving from Filestore to Bluestore on my SSD OSD's (and NVMe FS 
journal to NVMe Bluestore block.db),
I had an issue where the OSD was incorrectly being reported as rotational in 
some part of the chain.
Once I overcame that, I had a huge boost in recovery performance (repaving 
OSDs).
Might be something useful in there.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/025039.html 


Reed

> On Mar 19, 2019, at 11:29 PM, Konstantin Shalygin  wrote:
> 
> 
>> I setup an SSD Luminous 12.2.11 cluster and realized after data had been
>> added that pg_num was not set properly on the default.rgw.buckets.data pool
>> ( where all the data goes ).  I adjusted the settings up, but recovery is
>> going really slow ( like 56-110MiB/s ) ticking down at .002 per log
>> entry(ceph -w).  These are all SSDs on luminous 12.2.11 ( no journal drives
>> ) with a set of 2 10Gb fiber twinax in a bonded LACP config.  There are six
>> servers, 60 OSDs, each OSD is 2TB.  There was about 4TB of data ( 3 million
>> objects ) added to the cluster before I noticed the red blinking lights.
>> 
>>  
>> 
>> I tried adjusting the recovery to:
>> 
>> ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
>> 
>> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 30'
>> 
>>  
>> 
>> Which did help a little, but didn't seem to have the impact I was looking
>> for.  I have used the settings on HDD clusters before to speed things up (
>> using 8 backfills and 4 max active though ).  Did I miss something or is
>> this part of the pg expansion process.  Should I be doing something else
>> with SSD clusters?
>> 
>>  
>> 
>> Regards,
>> 
>> -Brent
>> 
>>  
>> 
>> Existing Clusters:
>> 
>> Test: Luminous 12.2.11 with 3 osd servers, 1 mon/man, 1 gateway ( all
>> virtual on SSD )
>> 
>> US Production(HDD): Jewel 10.2.11 with 5 osd servers, 3 mons, 3 gateways
>> behind haproxy LB
>> 
>> UK Production(HDD): Luminous 12.2.11 with 15 osd servers, 3 mons/man, 3
>> gateways behind haproxy LB
>> 
>> US Production(SSD): Luminous 12.2.11 with 6 osd servers, 3 mons/man, 3
>> gateways behind haproxy LB
> 
> Try to lower `osd_recovery_sleep*` options.
> 
> You can get your current values from ceph admin socket like this:
> 
> ```
> 
> ceph daemon osd.0 config show | jq 'to_entries[] | if 
> (.key|test("^(osd_recovery_sleep)(.*)")) then (.) else empty end'
> 
> ```
> 
> 
> 
> k
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Slow OPS

2019-03-20 Thread Glen Baars
Hello Ceph Users,

Does anyone know what the flag point 'Started' is? Is that ceph osd daemon 
waiting on the disk subsystem?

Ceph 13.2.4 on centos 7.5

"description": "osd_op(client.1411875.0:422573570 5.18ds0 
5:b1ed18e5:::rbd_data.6.cf7f46b8b4567.0046e41a:head [read
1703936~16384] snapc 0=[] ondisk+read+known_if_redirected e30622)",
"initiated_at": "2019-03-21 01:04:40.598438",
"age": 11.340626,
"duration": 11.342846,
"type_data": {
"flag_point": "started",
"client_info": {
"client": "client.1411875",
"client_addr": "10.4.37.45:0/627562602",
"tid": 422573570
},
"events": [
{
"time": "2019-03-21 01:04:40.598438",
"event": "initiated"
},
{
"time": "2019-03-21 01:04:40.598438",
"event": "header_read"
},
{
"time": "2019-03-21 01:04:40.598439",
"event": "throttled"
},
{
"time": "2019-03-21 01:04:40.598450",
"event": "all_read"
},
{
"time": "2019-03-21 01:04:40.598499",
"event": "dispatched"
},
{
"time": "2019-03-21 01:04:40.598504",
"event": "queued_for_pg"
},
{
"time": "2019-03-21 01:04:40.598883",
"event": "reached_pg"
},
{
"time": "2019-03-21 01:04:40.598905",
"event": "started"
}
]
}
}
],

Glen
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread jesper
> `cpupower idle-set -D 0` will help you a lot, yes.
>
> However it seems that not only the bluestore makes it slow. >= 50% of the
> latency is introduced by the OSD itself. I'm just trying to understand
> WHAT parts of it are doing so much work. For example in my current case
> (with cpupower idle-set -D 0 of course) when I was testing a single OSD on
> a very good drive (Intel NVMe, capable of 4+ single-thread sync write
> iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency,
> and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x
> perf dump`)! I've even tuned bluestore a little, so that now I'm getting
> ~1200 iops from it. It means that the bluestore's latency dropped by 33%
> (it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the
> overall improvement is only 20% - everything else is eaten by the OSD
> itself.


Thanks for the insight - that means that the SSD-number for read/write
performance are roughly ok - I guess.

It still puzzles me why the bluestore-caching does not benefit
the read-size.

Is the cache not an LRU cache on the block device or is it actually uses for
something else?

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: effects of using hard links

2019-03-20 Thread Dan van der Ster
On Tue, Mar 19, 2019 at 9:43 AM Erwin Bogaard  wrote:
>
> Hi,
>
>
>
> For a number of application we use, there is a lot of file duplication. This 
> wastes precious storage space, which I would like to avoid.
>
> When using a local disk, I can use a hard link to let all duplicate files 
> point to the same inode (use “rdfind”, for example).
>
>
>
> As there isn’t any deduplication in Ceph(FS) I’m wondering if I can use hard 
> links on CephFS in the same way as I use for ‘regular’ file systems like ext4 
> and xfs.
>
> 1. Is it advisible to use hard links on CephFS? (It isn’t in the ‘best 
> practices’: http://docs.ceph.com/docs/master/cephfs/app-best-practices/)
>
> 2. Is there any performance (dis)advantage?
>
> 3. When using hard links, is there an actual space savings, or is there some 
> trickery happening?
>
> 4. Are there any issues (other than the regular hard link ‘gotcha’s’) I need 
> to keep in mind combining hard links with CephFS?

The only issue we've seen is if you hardlink b to a, then rm a, then
never stat b, the inode is added to the "stray" directory. By default
there is a limit of 1 million stray entries -- so if you accumulate
files in this state eventually users will be unable to rm any files,
until you stat the `b` files.

-- dan


-- dan


>
>
>
> Thanks
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: effects of using hard links

2019-03-20 Thread Yan, Zheng


On 3/20/19 11:54 AM, Gregory Farnum wrote:
On Tue, Mar 19, 2019 at 2:13 PM Erwin Bogaard > wrote:


Hi,

For a number of application we use, there is a lot of file
duplication. This wastes precious storage space, which I would
like to avoid.

When using a local disk, I can use a hard link to let all
duplicate files point to the same inode (use “rdfind”, for example).

As there isn’t any deduplication in Ceph(FS) I’m wondering if I
can use hard links on CephFS in the same way as I use for
‘regular’ file systems like ext4 and xfs.

1. Is it advisible to use hard links on CephFS? (It isn’t in the
‘best practices’:
http://docs.ceph.com/docs/master/cephfs/app-best-practices/)


This should be okay now. Hard links have changed a few times so Zheng 
can correct me if I've gotten something wrong, but the differences 
between regular files from a user/performance perspective are:
* if you take snapshots and have hard links, hard-linked files are 
special and will be a member of *every* snapshot in the system (which 
only matters if you actually write to them during all those snapshots)
* opening a hard-linked file may behave as if you were doing two file 
opens instead of one, from a performance perspective. But this might 
have changed? (In the past, you would need to look up the file name 
you open, and then do another lookup on the authoritative location of 
the file.)


This hasn't changed. hard link in cephfs is magic symbol link. Its main 
overhead is at open.



Regards

Yan, Zheng




2. Is there any performance (dis)advantage?


Generally not once the file is open.

3. When using hard links, is there an actual space savings, or is
there some trickery happening?


If you create a hard link, there is a single copy of the file data in 
RADOS that all the file names refer to. I think that's what you're asking?


4. Are there any issues (other than the regular hard link
‘gotcha’s’) I need to keep in mind combining hard links with CephFS?


Not other than above.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Mark Nelson

On 3/20/19 3:12 AM, Vitaliy Filippov wrote:

`cpupower idle-set -D 0` will help you a lot, yes.

However it seems that not only the bluestore makes it slow. >= 50% of 
the latency is introduced by the OSD itself. I'm just trying to 
understand WHAT parts of it are doing so much work. For example in my 
current case (with cpupower idle-set -D 0 of course) when I was 
testing a single OSD on a very good drive (Intel NVMe, capable of 
4+ single-thread sync write iops) it was delivering me only 
950-1000 iops. It's roughly 1 ms latency, and only 50% of it comes 
from bluestore (you can see it `ceph daemon osd.x perf dump`)! I've 
even tuned bluestore a little, so that now I'm getting ~1200 iops from 
it. It means that the bluestore's latency dropped by 33% (it was 
around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the 
overall improvement is only 20% - everything else is eaten by the OSD 
itself.




I'd suggest looking in the direction of pglog.  See:


https://www.spinics.net/lists/ceph-devel/msg38975.html


Back around that time I hacked pglog updates out of the code when I was 
testing a custom version of the memstore backend and saw some pretty 
dramatic reductions in CPU usage (and at least somewhat an increase in 
performance).  Unfortunately I think fixing it is going to be a big job, 
but it's high on my list of troublemakers.



Mark


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm batch OSD replacement

2019-03-20 Thread Dan van der Ster
On Tue, Mar 19, 2019 at 12:25 PM Dan van der Ster  wrote:
>
> On Tue, Mar 19, 2019 at 12:17 PM Alfredo Deza  wrote:
> >
> > On Tue, Mar 19, 2019 at 7:00 AM Alfredo Deza  wrote:
> > >
> > > On Tue, Mar 19, 2019 at 6:47 AM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > We've just hit our first OSD replacement on a host created with
> > > > `ceph-volume lvm batch` with mixed hdds+ssds.
> > > >
> > > > The hdd /dev/sdq was prepared like this:
> > > ># ceph-volume lvm batch /dev/sd[m-r] /dev/sdac --yes
> > > >
> > > > Then /dev/sdq failed and was then zapped like this:
> > > >   # ceph-volume lvm zap /dev/sdq --destroy
> > > >
> > > > The zap removed the pv/vg/lv from sdq, but left behind the db on
> > > > /dev/sdac (see P.S.)
> > >
> > > That is correct behavior for the zap command used.
> > >
> > > >
> > > > Now we're replaced /dev/sdq and we're wondering how to proceed. We see
> > > > two options:
> > > >   1. reuse the existing db lv from osd.240 (Though the osd fsid will
> > > > change when we re-create, right?)
> > >
> > > This is possible but you are right that in the current state, the FSID
> > > and other cluster data exist in the LV metadata. To reuse this LV for
> > > a new (replaced) OSD
> > > then you would need to zap the LV *without* the --destroy flag, which
> > > would clear all metadata on the LV and do a wipefs. The command would
> > > need the full path to
> > > the LV associated with osd.240, something like:
> > >
> > > ceph-volume lvm zap /dev/ceph-osd-lvs/db-lv-240
> > >
> > > >   2. remove the db lv from sdac then run
> > > > # ceph-volume lvm batch /dev/sdq /dev/sdac
> > > >  which should do the correct thing.
> > >
> > > This would also work if the db lv is fully removed with --destroy
> > >
> > > >
> > > > This is all v12.2.11 btw.
> > > > If (2) is the prefered approached, then it looks like a bug that the
> > > > db lv was not destroyed by lvm zap --destroy.
> > >
> > > Since /dev/sdq was passed in to zap, just that one device was removed,
> > > so this is working as expected.
> > >
> > > Alternatively, zap has the ability to destroy or zap LVs associated
> > > with an OSD ID. I think this is not released yet for Luminous but
> > > should be in the next release (which seems to be what you want)
> >
> > Seems like 12.2.11 was released with the ability to zap by OSD ID. You
> > can also zap by OSD FSID, both way will zap (and optionally destroy if
> > using --destroy)
> > all LVs associated with the OSD.
> >
> > Full examples on this can be found here:
> >
> > http://docs.ceph.com/docs/luminous/ceph-volume/lvm/zap/#removing-devices
> >
> >
>
> Ohh that's an improvement! (Our goal is outsourcing the failure
> handling to non-ceph experts, so this will help simplify things.)
>
> In our example, the operator needs to know the osd id, then can do:
>
> 1. ceph-volume lvm zap --destroy --osd-id 240 (wipes sdq and removes
> the lvm from sdac for osd.240)
> 2. replace the hdd
> 3. ceph-volume lvm batch /dev/sdq /dev/sdac --osd-ids 240
>
> But I just remembered that the --osd-ids flag hasn't been backported
> to luminous, so we can't yet do that. I guess we'll follow the first
> (1) procedure to re-use the existing db lv.

Hmm... re-using the db lv didn't work.

We zapped it (see https://pastebin.com/N6PwpbYu) then got this error
when trying to create:

# ceph-volume lvm create --data /dev/sdq --block.db
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
--osd-id 240
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
9f63b457-37e0-4e33-971e-c0fc24658b65 240
Running command: vgcreate --force --yes
ceph-8ef05e54-8909-49f8-951d-0f9d37aeba45 /dev/sdq
 stdout: Physical volume "/dev/sdq" successfully created.
 stdout: Volume group "ceph-8ef05e54-8909-49f8-951d-0f9d37aeba45"
successfully created
Running command: lvcreate --yes -l 100%FREE -n
osd-block-9f63b457-37e0-4e33-971e-c0fc24658b65
ceph-8ef05e54-8909-49f8-951d-0f9d37aeba45
 stdout: Logical volume
"osd-block-9f63b457-37e0-4e33-971e-c0fc24658b65" created.
--> blkid could not detect a PARTUUID for device:
/dev/ceph-094c06db-98dc-47f6-a7e5-1092b099b372/osd-block-db-fa0e7927-dc3e-44d0-a8ce-1d8202fa75dd
--> Was unable to complete a new OSD, will rollback changes
--> OSD will be destroyed, keeping the ID because it was provided with --osd-id
Running command: ceph osd destroy osd.240 --yes-i-really-mean-it
 stderr: destroyed osd.240
-->  RuntimeError: unable to use device


Any idea?

-- dan



>
> -- dan
>
> > >
> > > >
> > > > Once we sort this out, we'd be happy to contribute to the ceph-volume
> > > > lvm batch doc.
> > > >
> > > > Thanks!
> > > >
> > > > Dan
> > > >
> > > > P.S:
> > > >
> > > 

Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Vitaliy Filippov

`cpupower idle-set -D 0` will help you a lot, yes.

However it seems that not only the bluestore makes it slow. >= 50% of the  
latency is introduced by the OSD itself. I'm just trying to understand  
WHAT parts of it are doing so much work. For example in my current case  
(with cpupower idle-set -D 0 of course) when I was testing a single OSD on  
a very good drive (Intel NVMe, capable of 4+ single-thread sync write  
iops) it was delivering me only 950-1000 iops. It's roughly 1 ms latency,  
and only 50% of it comes from bluestore (you can see it `ceph daemon osd.x  
perf dump`)! I've even tuned bluestore a little, so that now I'm getting  
~1200 iops from it. It means that the bluestore's latency dropped by 33%  
(it was around 1/1000 = 500 us, now it is 1/1200 = ~330 us). But still the  
overall improvement is only 20% - everything else is eaten by the OSD  
itself.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fio test rbd - single thread - qd1

2019-03-20 Thread Maged Mokhtar




On 19/03/2019 16:17, jes...@krogh.cc wrote:

Hi All.

I'm trying to get head and tails into where we can stretch our Ceph cluster
into what applications. Parallism works excellent, but baseline throughput
it - perhaps - not what I would expect it to be.

Luminous cluster running bluestore - all OSD-daemons have 16GB of cache.

Fio files attacher - 4KB random read and 4KB random write - test file is
"only" 1GB
In this i ONLY care about raw IOPS numbers.

I have 2 pools, both 3x replicated .. one backed with SSDs S4510's
(14x1TB) and one with HDD's 84x10TB.

Network latency from rbd mount to one of the osd-hosts.
--- ceph-osd01.nzcorp.net ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9189ms
rtt min/avg/max/mdev = 0.084/0.108/0.146/0.022 ms

SSD:
randr:
# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38   1727.07   2033.66   1954.71 1949.4789 46.592401
randw:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  36400.05455.26436.58 433.91417 12.468187

The double (or triple) network penalty of-course kicks in and delivers a
lower throughput here.
Are these performance numbers in the ballpark of what we'd expect?

With 1GB of test file .. I would really expect this to be memory cached in
the OSD/bluestore cache
and thus deliver a read IOPS closer to theoretical max: 1s/0.108ms => 9.2K
IOPS

Again on the write side - all OSDs are backed by Battery-Backed write
cache, thus writes should go directly
into memory of the constroller .. .. still slower than reads - due to
having to visit 3 hosts.. but not this low?

Suggestions for improvements? Are other people seeing similar results?

For the HDD tests I get similar - surprisingly slow numbers:
# grep iops write*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  38 36.91 118.8 69.14 72.926842  21.75198

This should have the same performance characteristics as the SSD's as the
writes should be hitting BBWC.

# grep iops read*json | grep -v 0.00  | perl -ane'print $F[-1] . "\n"' |
cut -d\, -f1 | ministat -n
x 
 N   Min   MaxMedian   AvgStddev
x  39 26.18181.51 48.16 50.574872  24.01572

Same here - shold be cached in the blue-store cache as it is 16GB x 84
OSD's  .. with a 1GB testfile.

Any thoughts - suggestions - insights ?

Jesper


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



Cannot comment on the cache issue, hope someone will.

the ssd read latency of 0.5 ms, write latency of 2 ms are in the 
ballpark. with Bluestore it is difficult to get below 1 ms for write.


As suggested make sure your cpu has at most 1 c-state and p-state min 
freq is 100%. Also a cpu with higher GHz wouldg give better 1 qd / 
latency value than a cpu with high cores but less GHz


Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH ISCSI LIO multipath change delay

2019-03-20 Thread Maged Mokhtar




On 20/03/2019 07:43, li jerry wrote:

Hi,ALL

I’ve deployed mimic(13.2.5) cluster on 3 CentOS 7.6 servers, then 
configured iscsi-target and created a LUN, referring to 
http://docs.ceph.com/docs/mimic/rbd/iscsi-target-cli/.


I have another server which is CentOS 7.4, configured and mounted the 
LUN I’ve just created, referring to 
http://docs.ceph.com/docs/mimic/rbd/iscsi-initiator-linux/.


I’m trying to do a HA testing:

1. Perform a WRITE test with DD command

2. Stop one ‘Activate’ iscsi-target node(ini 0), DD IO hangs over 25 
seconds until iscsi-target switch to another node


3. DD IO goes back normal

My question is, why it takes so long for the iscsi-target switching? Is 
there any settings I’ve misconfigured?


Usually it only take a few seconds to switch on the enterprise storage 
products.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



If you mean you shutdown the entire host, if so from your description 
this is also running osds, so you also took out some osds serving io.


if a primary osd is not responding, clients io (in this case your iscsi 
target) will block until ceph marks the osd down and issue a new epoch 
map mapping the pg to another osd. This process is controlled by 
osd_heartbeat_interval(5) and osd_heartbeat_grace(20) total 25 sec which 
is what you observe. I do not recommend you lower them, else your 
cluster will be over sensitive and osds could flap under load.


Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New Bluestore Cluster Hardware Questions

2019-03-20 Thread Ariel S
Hey all, we've been running Ceph for a while and I'm in the process of
providing hardware for new site. On current site we run filestore-ceph
and for the new site I'm going ahead for a bluestore-ceph cluster.

This is what I've come up for hardware specification for the server,
which we will starts at 3 of these:

    MB: Intel Server Board S1200SPLR, Intel® Xeon® Processor E3 v6 Series 
LGA1151
    CPU: Intel Xeon E3-1230v6, 3.5Ghz, Cache 8MB, LGA1151
    LAN: Intel X550-T1, 10G-Base-T
    RAM: 4x SK Hynix 16GB DDR4-2400 ECC UDIMM
    HDD, OS: Samsung SSD 860 Pro 256GB
    HDD, Journal: Samsung SSD SM863a 240GiB (MZ7KM240HMHQ-5)
    HDD, OSD1: Seagate 3.5" 6 TB SATA3 Exos 7E8
    HDD, OSD2: Western Digital 3.5" 6 TB SATA3 Ultrastar DC HC310

    Networking will use bonding active-backup of 10G and 1G link, then split
with VLAN for private and public network.

These will be custom built, unfortunately, no-go (yet) on Supermicro, HP, Dell,
etc. hardware because (apparently) we're (still) sensitive on budget. We are
going to use ceph dominantly for capacity, using rbd to store PostgreSQL
instances WAL and snapshots for retention purposes, as well as mapping
rbd-disks on VMs as additional block device(s) should it become necessary.


So here goes my questions:

    Supposedly, we are going to use 2U 8-bay chassis and populate available
slots with 4/6TB disks. That would be 48TB of storage for a node with 4-cores
8-threads. A slide from Nick Fisk [1] demonstrate a low-latency ceph with
similar CPU and slightly bigger total capacity, is that a good selection for
the CPU?

    Correct me if I'm wrong here, the documentation mentions a block.db
partition for metadata that should be allocated as large as possible [2] with
the recommendation at least 4% of its block-data size. Again, I'm going to use
6TB disks for storage, does that mean I should have 245G partition on SSD
just for one block.db? Does anyone using rbd actually follow these numbers?
I've read that one can use either 4G or 30G for block.db [3], should I
under-size this block.db partition  am I going to see faster write/read until
this "metadata" completely full?

    Any recommendation on my hardware list?

[1] https://www.slideshare.net/ShapeBlue/nick-fisk-low-latency-ceph
[2] 
http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#sizing
[3] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-March/033692.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-20 Thread Alfredo Deza
On Tue, Mar 19, 2019 at 2:53 PM Benjamin Cherian
 wrote:
>
> Hi,
>
> I'm getting an error when trying to use the APT repo for Ubuntu bionic. Does 
> anyone else have this issue? Is the mirror sync actually still in progress? 
> Or was something setup incorrectly?
>
> E: Failed to fetch 
> https://download.ceph.com/debian-nautilus/dists/bionic/main/binary-amd64/Packages.bz2
>   File has unexpected size (15515 != 15488). Mirror sync in progress? [IP: 
> 158.69.68.124 443]
>Hashes of expected file:
> - Filesize:15488 [weak]
> - SHA256:d5ea08e095eeeaa5cc134b1661bfaf55280fcbf8a265d584a4af80d2a424ec17
> - SHA1:6da3a8aa17ed7f828f35f546cdcf923040e8e5b0 [weak]
> - MD5Sum:7e5a4ecea4a4edc3f483623d48b6efa4 [weak]
>Release file created at: Mon, 11 Mar 2019 18:44:46 +
>

This has now been fixed, let me know if you have any more issues.


>
> Thanks,
> Ben
>
>
> On Tue, Mar 19, 2019 at 7:24 AM Sean Purdy  wrote:
>>
>> Hi,
>>
>>
>> Will debian packages be released?  I don't see them in the nautilus repo.  I 
>> thought that Nautilus was going to be debian-friendly, unlike Mimic.
>>
>>
>> Sean
>>
>> On Tue, 19 Mar 2019 14:58:41 +0100
>> Abhishek Lekshmanan  wrote:
>>
>> >
>> > We're glad to announce the first release of Nautilus v14.2.0 stable
>> > series. There have been a lot of changes across components from the
>> > previous Ceph releases, and we advise everyone to go through the release
>> > and upgrade notes carefully.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-20 Thread Alfredo Deza
There aren't any Debian packages built for this release because we
haven't updated the infrastructure to build (and test) Debian packages
yet.

On Tue, Mar 19, 2019 at 10:24 AM Sean Purdy  wrote:
>
> Hi,
>
>
> Will debian packages be released?  I don't see them in the nautilus repo.  I 
> thought that Nautilus was going to be debian-friendly, unlike Mimic.
>
>
> Sean
>
> On Tue, 19 Mar 2019 14:58:41 +0100
> Abhishek Lekshmanan  wrote:
>
> >
> > We're glad to announce the first release of Nautilus v14.2.0 stable
> > series. There have been a lot of changes across components from the
> > previous Ceph releases, and we advise everyone to go through the release
> > and upgrade notes carefully.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com