[ceph-users] Re: moving small production cluster to different datacenter

2020-01-30 Thread Wido den Hollander



On 1/31/20 12:09 AM, Nigel Williams wrote:
> Did you end up having all new IPs for your MONs? I've wondered how
> should a large KVM deployment be handled when the instance-metadata
> has a hard-coded list of MON IPs for the cluster? how are they changed
> en-masse with running VMs? or do these moves always result in at least
> one MON with an original IP?
> 

Any running VM will get the updated monmap and will handle that properly.

However, if you stop and start the VM with the same XML definition it
will not start anymore.

Therefor I've always used Round Robin DNS pointing to the Monitors.

Wido

> e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1314099
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recovering monitor failure

2020-01-30 Thread vishal
Thanks folks for the replies. Now I feel confident to test this out in my 
cluster.

Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: moving small production cluster to different datacenter

2020-01-30 Thread Nigel Williams
Did you end up having all new IPs for your MONs? I've wondered how
should a large KVM deployment be handled when the instance-metadata
has a hard-coded list of MON IPs for the cluster? how are they changed
en-masse with running VMs? or do these moves always result in at least
one MON with an original IP?

e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1314099
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-iscsi create RBDs on erasure coded data pools

2020-01-30 Thread Wesley Dillingham
Is it possible to create an EC backed RBD via ceph-iscsi tools (gwcli,
rbd-target-api)? It appears that a pre-existing RBD created with the rbd
command can be imported, but there is no means to directly create an EC
backed RBD. The API seems to expect a single pool field in the body to work
with.

Perhaps there is a lower level construct where you can set the metadata of
a particular RADOS pool to always use Pool X for data-pool when using Pool
Y for RBD header and metadata. This way the clients, in our case ceph-iscsi
needn't be modified or concerned with the dual-pool situation unless
explicitly specified.

For out particular use case we expose limited functionality of
rbd-target-api to clients and it would be helpful for them to keep track of
a single pool and not be concerned with two pools but if a data-pool and a
"main" pool could be passed via the API that would be okay too.

Thanks a lot.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: moving small production cluster to different datacenter

2020-01-30 Thread Marc Roos


I have osd nodes combined with mds,mgr and mon's. There are also running 
a few VM's on them with libvirt, however client en cluster on ipv4 (and 
no experience with ipv6). cluster network is on a switch not connected 
to the internet. 

- I should enable again ipv6 
- enable forwarding so cluster communication is being routed, through 
the client interfaces?
- test if the connection works between new and old
- then add two vm's with monitors bringing total to 5
- then move one osd node with mon to the new location. 
- wait for recovery
- then move the two vm mon's to the new location osd node, so there are 
3 there.
- move an osd node to the new location
- wait for recovery 
Etc.

Something like this? What is the idea about having 5 monitors in this 
migration?






-Original Message-
To: ceph-users
Subject: Re: [ceph-users] moving small production cluster to different 
datacenter



On 1/28/20 11:19 AM, Marc Roos wrote:
> 
> Say one is forced to move a production cluster (4 nodes) to a 
> different datacenter. What options do I have, other than just turning 
> it off at the old location and on on the new location?
> 
> Maybe buying some extra nodes, and move one node at a time?

I did this ones. This cluster was running IPv6-only (still is) and thus 
I had the flexibility of new IPs.

First I temporarily moved the MONs from hardware to Virtual. MONMAP went 
from 3 to 5 MONs.

Then I moved the MONs one by one to the new DC and then removed the 2 
additional VMs.

Then I set the 'noout' flag and moved the OSD nodes one by one. These 
datacenters were located very close thus each node could be moved within 
20 minutes.

Wait for recovery to finish and then move the next node.

Keep in mind that there is/might be additional latency between the two 
datacenters.

Wido

> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can Ceph Do The Job?

2020-01-30 Thread Adam Boyhan
Its my understanding that pool snapshots would basically require us to be in a 
all or nothing situation were we would have to revert all RBD's in a pool. If 
we could clone a pool snapshot for filesystem level access like a rbd snapshot, 
that would help a ton. 

Thanks, 
Adam Boyhan 
System & Network Administrator 
MEDENT(EMR/EHR) 
15 Hulbert Street - P.O. Box 980 
Auburn, New York 13021 
www.medent.com 
Phone: (315)-255-1751 
Fax: (315)-255-3539 
Cell: (315)-729-2290 
ad...@medent.com 

This message and any attachments may contain information that is protected by 
law as privileged and confidential, and is transmitted for the sole use of the 
intended recipient(s). If you are not the intended recipient, you are hereby 
notified that any use, dissemination, copying or retention of this e-mail or 
the information contained herein is strictly prohibited. If you received this 
e-mail in error, please immediately notify the sender by e-mail, and 
permanently delete this e-mail. 


From: "Janne Johansson"  
To: "adamb"  
Cc: "ceph-users"  
Sent: Thursday, January 30, 2020 10:06:14 AM 
Subject: Re: [ceph-users] Can Ceph Do The Job? 

Den tors 30 jan. 2020 kl 15:29 skrev Adam Boyhan < [ mailto:ad...@medent.com | 
ad...@medent.com ] >: 


We are looking to role out a all flash Ceph cluster as storage for our cloud 
solution. The OSD's will be on slightly slower Micron 5300 PRO's, with WAL/DB 
on Micron 7300 MAX NVMe's. 
My main concern with Ceph being able to fit the bill is its snapshot abilities. 
For each RBD we would like the following snapshots 
8x 30 minute snapshots (latest 4 hours) 
With our current solution (HPE Nimble) we simply pause all write IO on the 10 
minute mark for roughly 2 seconds and then we take a snapshot of the entire 
Nimble volume. Each VM within the Nimble volume is sitting on a Linux Logical 
Volume so its easy for us to take one big snapshot and only get access to a 
specific clients data. 
Are there any options for automating managing/retention of snapshots within 
Ceph besides some bash scripts? Is there anyway to take snapshots of all RBD's 
within a pool at a given time? 




You could make a snapshot of the whole pool, that would cover all RBDs in it I 
gather? 
[ 
https://docs.ceph.com/docs/nautilus/rados/operations/pools/#make-a-snapshot-of-a-pool
 | 
https://docs.ceph.com/docs/nautilus/rados/operations/pools/#make-a-snapshot-of-a-pool
 ] 

But if you need to work in parallel with each snapshot from different times and 
clone them one by one and so forth, doing it per-RBD would be better. 

[ https://docs.ceph.com/docs/nautilus/rbd/rbd-snapshot/ | 
https://docs.ceph.com/docs/nautilus/rbd/rbd-snapshot/ ] 

-- 
May the most significant bit of your life be positive. 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can Ceph Do The Job?

2020-01-30 Thread Janne Johansson
Den tors 30 jan. 2020 kl 15:29 skrev Adam Boyhan :

> We are looking to role out a all flash Ceph cluster as storage for our
> cloud solution. The OSD's will be on slightly slower Micron 5300 PRO's,
> with WAL/DB on Micron 7300 MAX NVMe's.
> My main concern with Ceph being able to fit the bill is its snapshot
> abilities.
> For each RBD we would like the following snapshots
> 8x 30 minute snapshots (latest 4 hours)
> With our current solution (HPE Nimble) we simply pause all write IO on the
> 10 minute mark for roughly 2 seconds and then we take a snapshot of the
> entire Nimble volume. Each VM within the Nimble volume is sitting on a
> Linux Logical Volume so its easy for us to take one big snapshot and only
> get access to a specific clients data.
> Are there any options for automating managing/retention of snapshots
> within Ceph besides some bash scripts? Is there anyway to take snapshots of
> all RBD's within a pool at a given time?
>

You could make a snapshot of the whole pool, that would cover all RBDs in
it I gather?

https://docs.ceph.com/docs/nautilus/rados/operations/pools/#make-a-snapshot-of-a-pool

But if you need to work in parallel with each snapshot from different times
and clone them one by one and so forth, doing it per-RBD would be better.

https://docs.ceph.com/docs/nautilus/rbd/rbd-snapshot/

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can Ceph Do The Job?

2020-01-30 Thread Phil Regnauld
Bastiaan Visser (bastiaan) writes:
> We are making hourly snapshots of ~400 rbd drives in one (spinning-rust)
> cluster. The snapshots are made one by one.
> Total size of the base images is around 80TB. The entire process takes a
> few minutes.
> We do not experience any problems doing this.

Out of curiosity - which version of CEPH ? I've read that I/O to RBD
images for which snapshots exist is impacted. Is that your experience ?

And, do you maintain the snapshots, or are they discarded after (i.e.:
are they used as potential point-in-time recovery points, or are they
used to take a backup, then discarded ?)

Cheers,
Phil
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can Ceph Do The Job?

2020-01-30 Thread Bastiaan Visser
We are making hourly snapshots of ~400 rbd drives in one (spinning-rust)
cluster. The snapshots are made one by one.
Total size of the base images is around 80TB. The entire process takes a
few minutes.
We do not experience any problems doing this.


Op do 30 jan. 2020 om 15:30 schreef Adam Boyhan :

> We are looking to role out a all flash Ceph cluster as storage for our
> cloud solution. The OSD's will be on slightly slower Micron 5300 PRO's,
> with WAL/DB on Micron 7300 MAX NVMe's.
>
> My main concern with Ceph being able to fit the bill is its snapshot
> abilities.
>
> For each RBD we would like the following snapshots
>
> 8x 30 minute snapshots (latest 4 hours)
>
> With our current solution (HPE Nimble) we simply pause all write IO on the
> 10 minute mark for roughly 2 seconds and then we take a snapshot of the
> entire Nimble volume. Each VM within the Nimble volume is sitting on a
> Linux Logical Volume so its easy for us to take one big snapshot and only
> get access to a specific clients data.
>
> Are there any options for automating managing/retention of snapshots
> within Ceph besides some bash scripts? Is there anyway to take snapshots of
> all RBD's within a pool at a given time?
>
> Is there anyone successfully running with this many snapshots? If anyone
> is running a similar setup, would love to hear how your doing it.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recovering monitor failure

2020-01-30 Thread Gregory Farnum
On Thu, Jan 30, 2020 at 1:38 PM Wido den Hollander  wrote:
>
>
>
> On 1/30/20 1:34 PM, vis...@denovogroup.org wrote:
> > Iam testing failure scenarios for my cluster. I have 3 monitors. Lets say 
> > if mons 1 and 2 go down and so monitors can't form a quorum, how can I 
> > recover?
> >
> > Are the instructions at followling link valid for deleting mons 1 and 2 
> > from monmap, 
> > https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.2.3/html/red_hat_ceph_administration_guide/remove_a_monitor#removing-monitors-from-an-unhealthy-cluster
> >
> > One more question - lets say I delete mons 1 and 2 from monmap. And the 
> > cluster has only mon 3 remaining so mon 3 has quorum. Now what happens if 
> > mon 1 and 2 come up? Do they join mon 3 and so there will again be 3 
> > monitors in the cluster?
> >
>
> The epoch of the monmap has increased by removing mons 1 and 2 from the
> map. Only mon 3 has this new map with the new epoch.
>
> Therefor, if mon 1 and 2 boot they see the epoch of mon 3 is newer and
> thus won't be able to join.

If you delete the monitors from the map using ceph commands, so that
they KNOW they've been removed, this is fine. But you don't want to do
that to a cluster using offline tools: if monitor 3 dies before mons 1
and 2 turn on, they will find each other, not see another peer, and
say "hey, we are 2 of the 3 monitors in the map, let's form a quorum!"
-Greg

>
> Wido
>
> > Thanks
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recovering monitor failure

2020-01-30 Thread Wido den Hollander



On 1/30/20 1:34 PM, vis...@denovogroup.org wrote:
> Iam testing failure scenarios for my cluster. I have 3 monitors. Lets say if 
> mons 1 and 2 go down and so monitors can't form a quorum, how can I recover? 
> 
> Are the instructions at followling link valid for deleting mons 1 and 2 from 
> monmap, 
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.2.3/html/red_hat_ceph_administration_guide/remove_a_monitor#removing-monitors-from-an-unhealthy-cluster
>  
> 
> One more question - lets say I delete mons 1 and 2 from monmap. And the 
> cluster has only mon 3 remaining so mon 3 has quorum. Now what happens if mon 
> 1 and 2 come up? Do they join mon 3 and so there will again be 3 monitors in 
> the cluster?
> 

The epoch of the monmap has increased by removing mons 1 and 2 from the
map. Only mon 3 has this new map with the new epoch.

Therefor, if mon 1 and 2 boot they see the epoch of mon 3 is newer and
thus won't be able to join.

Wido

> Thanks
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] recovering monitor failure

2020-01-30 Thread vishal
Iam testing failure scenarios for my cluster. I have 3 monitors. Lets say if 
mons 1 and 2 go down and so monitors can't form a quorum, how can I recover? 

Are the instructions at followling link valid for deleting mons 1 and 2 from 
monmap, 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.2.3/html/red_hat_ceph_administration_guide/remove_a_monitor#removing-monitors-from-an-unhealthy-cluster
 

One more question - lets say I delete mons 1 and 2 from monmap. And the cluster 
has only mon 3 remaining so mon 3 has quorum. Now what happens if mon 1 and 2 
come up? Do they join mon 3 and so there will again be 3 monitors in the 
cluster?

Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Servicing multiple OpenStack clusters from the same Ceph cluster [EXT]

2020-01-30 Thread Stefan Kooman
Hi,

Quoting Paul Browne (pf...@cam.ac.uk):
> On Wed, 29 Jan 2020 at 16:52, Matthew Vernon  wrote:
> 
> > Hi,
> >
> > On 29/01/2020 16:40, Paul Browne wrote:
> >
> > > Recently we deployed a brand new Stein cluster however, and I'm curious
> > > whether the idea of pointing the new OpenStack cluster at the same RBD
> > > pools for Cinder/Glance/Nova as the Luminous cluster would be considered
> > > bad practice, or even potentially dangerous.
> >
> > I think that would be pretty risky - here we have a Ceph cluster that
> > provides backing for our OpenStacks, and each OpenStack has its own set
> > of pools -metrics,-images,-volumes,-vms (and its own credential).
> >
> 
> Hi Matthew,
> 
> I think I've come around to that thinking now too.
> 
> Despite using different keys, the 2 sets of clients in different OpenStack
> clusters would require the same capabilities on the shared pools, which
> widens the blast radius a bit too far for me, I think (unless there were
> also a capability to restrict the sets of clients' keys to specific
> namespaces within the shared pools similar to the caps given out to CephFS
> clients)

This is supported since Nautilus: namespace support for librbd. I do
not now however if there is already support for this in
qemu/libvirt/openstack. OpenNebula support is pending [1].

Gr. Stefan

[1]: https://github.com/OpenNebula/one/issues/3141

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Servicing multiple OpenStack clusters from the same Ceph cluster [EXT]

2020-01-30 Thread Paul Browne
On Wed, 29 Jan 2020 at 16:52, Matthew Vernon  wrote:

> Hi,
>
> On 29/01/2020 16:40, Paul Browne wrote:
>
> > Recently we deployed a brand new Stein cluster however, and I'm curious
> > whether the idea of pointing the new OpenStack cluster at the same RBD
> > pools for Cinder/Glance/Nova as the Luminous cluster would be considered
> > bad practice, or even potentially dangerous.
>
> I think that would be pretty risky - here we have a Ceph cluster that
> provides backing for our OpenStacks, and each OpenStack has its own set
> of pools -metrics,-images,-volumes,-vms (and its own credential).
>

Hi Matthew,

I think I've come around to that thinking now too.

Despite using different keys, the 2 sets of clients in different OpenStack
clusters would require the same capabilities on the shared pools, which
widens the blast radius a bit too far for me, I think (unless there were
also a capability to restrict the sets of clients' keys to specific
namespaces within the shared pools similar to the caps given out to CephFS
clients)

Thanmks,
Paul



>
> Regards,
>
> Matthew
>
>

-- 
***
Paul Browne
Research Computing Platforms
University Information Services
Roger Needham Building
JJ Thompson Avenue
University of Cambridge
Cambridge
United Kingdom
E-Mail: pf...@cam.ac.uk
Tel: 0044-1223-746548
***
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: General question CephFS or RBD

2020-01-30 Thread Stefan Kooman
Hi,

Quoting Willi Schiegel (willi.schie...@technologit.de):
> Hello All,
> 
> I have a HW RAID based 240 TB data pool with about 200 million files for
> users in a scientific institution. Data sizes range from tiny parameter
> files for scientific calculations and experiments to huge images of brain
> scans. There are group directories, home directories, Windows roaming
> profile directories organized in ZFS pools on Solaris operating systems,
> exported via NFS and Samba to Linux, macOS, and Windows clients.
> 
> I would like to switch to CephFS because of the flexibility and
> expandability but I cannot find any recommendations for which storage
> backend would be suitable for all the functionality we have.
> 
> Since I like the features of ZFS like immediate snapshots of very large data
> pools, quotas for each file system within hierarchical data trees and
> dynamic expandability by simply adding new disks or disk images without
> manual resizing would it be a good idea to create RBD images, map them onto
> the file servers and create zpools on the mapped images? I know that ZFS
> best works with raw disks but maybe a RBD image is close enough to a raw
> disk?

Some of the features also exist within Ceph. Ceph would rebalance your
data accross the cluster immdiately as opposed to zfs (only new data
written to new disks). You can make snapshots of filesystems, you can
set quotas but with different caveats than with zfs [1]. You would have
to setup Ceph with samba (ceph vfs) for your windows / mac clients [2]
which is not (yet) HA (or you would build something yourself with Samba
CTDB, Ceph object locking feature of CTDB and vfs_ceph). If you need to
support nfs you would be able to do so in a HA fashion with nfs-ganesha
(nautilus has fixed most caveats). You would need one ore more MDS
servers. And it would really depend on the workload (and the amount of
clients) if it would work out for you. It might require tuning of the
MDSs. It's one of the more difficult interfaces of Ceph to comprehend.

ZFS with rbd would work (IIRC I have seen a presentation at Cephalacon
of a user using both zfs and rbd). I have certainly done it myself. As
far as zfs is concerned it's just another disk (depends if you would map
it with rbd-nbd, krbd or as virtual disk, but all three should work).


> Or would CephFS be the way to go? Can there be multiple CephFS pools for the
> group data folders and for the user's home directory folders for example or
> do I have to have everything in one single file space?

You can have multiple data pools and use xattrs to set the appropriate
attributes to use the correct pool. You can use name spaces (ceph, not
linux) to "prefix" all objects put in cephfs. That's there primarily there to
be able to set permissions per "namespace" instead of on a whole pool.
In this case you can restrict what objects a user is allowed access to.
You can also restrict the permission of creating cephfs snapshots
(from nautilus onwards) on a per user basis.

It would probably require a whole bunch of servers to get the right sizing for
ceph as opposed to your zfs setup for which you proably have one or more
nodes with a bunch of disks. But it would definately scale better and
provide higher availability when done right. Even if you would only use
Ceph for "block" devices. It would also require quite a investment in
time to get used to the quirks of using Ceph in production for large
CephFS workloads. 


Gr. Stefan

[1]: https://docs.ceph.com/docs/master/cephfs/quota/
> 
> Thank you very much.
> 
> Best regards
> Willi
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Network performance checks

2020-01-30 Thread Stefan Kooman
Hi,

Quoting Massimo Sgaravatto (massimo.sgarava...@gmail.com):
> Thanks for your answer
> 
> MON-MGR hosts have a mgmt network and a public network.
> OSD nodes have instead a mgmt network, a  public network. and a cluster
> network
> This is what I have in ceph.conf:
> 
> public network = 192.168.61.0/24
> cluster network = 192.168.222.0/24
> 
> 
> public and cluster networks are 10 Gbps networks (actually there is a
> single 10 Gbps NIC on each node used for both the public and the cluster
> networks).

In that case there is no advantage of using a seperate cluster network.
As it would only be beneficial when replication data between OSDs is on
a seperate interface. Is the cluster heavily loaded? Do you have metrics
on bandwith usage / switch port statistics? If you have many "discards"
(and / or errors) this might impact the ping times as well.

> The mgmt network is a 1 Gbps network, but this one shouldn't be used for
> such pings among the OSDs ...

I doubt Ceph will use the mgmt network, but not sure if 't doing a
lookup on hostname which might use mgmt network in your case or if it's
using configured IPs for ceph


You can dump osd network info per OSD on the storage nodes themselves by
this command:

ceph daemon osd.$id dump_osd_network

You would have to do that for every OSD and see which ones report
"entries".

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] General question CephFS or RBD

2020-01-30 Thread Willi Schiegel

Hello All,

I have a HW RAID based 240 TB data pool with about 200 million files for 
users in a scientific institution. Data sizes range from tiny parameter 
files for scientific calculations and experiments to huge images of 
brain scans. There are group directories, home directories, Windows 
roaming profile directories organized in ZFS pools on Solaris operating 
systems, exported via NFS and Samba to Linux, macOS, and Windows clients.


I would like to switch to CephFS because of the flexibility and 
expandability but I cannot find any recommendations for which storage 
backend would be suitable for all the functionality we have.


Since I like the features of ZFS like immediate snapshots of very large 
data pools, quotas for each file system within hierarchical data trees 
and dynamic expandability by simply adding new disks or disk images 
without manual resizing would it be a good idea to create RBD images, 
map them onto the file servers and create zpools on the mapped images? I 
know that ZFS best works with raw disks but maybe a RBD image is close 
enough to a raw disk?


Or would CephFS be the way to go? Can there be multiple CephFS pools for 
the group data folders and for the user's home directory folders for 
example or do I have to have everything in one single file space?


Maybe someone can share his or her field experience?

Thank you very much.

Best regards
Willi
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Network performance checks

2020-01-30 Thread Massimo Sgaravatto
Thanks for your answer

MON-MGR hosts have a mgmt network and a public network.
OSD nodes have instead a mgmt network, a  public network. and a cluster
network
This is what I have in ceph.conf:

public network = 192.168.61.0/24
cluster network = 192.168.222.0/24


public and cluster networks are 10 Gbps networks (actually there is a
single 10 Gbps NIC on each node used for both the public and the cluster
networks).
The mgmt network is a 1 Gbps network, but this one shouldn't be used for
such pings among the OSDs ...

Cheers, Massimo


On Thu, Jan 30, 2020 at 9:26 AM Stefan Kooman  wrote:

> Quoting Massimo Sgaravatto (massimo.sgarava...@gmail.com):
> > After having upgraded my ceph cluster from Luminous to Nautilus 14.2.6 ,
> > from time to time "ceph health detail" claims about some"Long heartbeat
> > ping times on front/back interface seen".
> >
> > As far as I can understand (after having read
> > https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this
> > means that  the ping from one OSD to another one exceeded 1 s.
> >
> > I have some questions on these network performance checks
> >
> > 1) What is meant exactly with front and back interface ?
>
> Do you have a "public" and a "cluster" network? I would expect that the
> "back" interface is a "cluster" network interface.
>
> > 2) I can see the involved OSDs only in the output of "ceph health detail"
> > (when there is the problem) but I can't find this information  in the log
> > files. In the mon log file I can only see messages such as:
> >
> >
> > 2020-01-28 11:14:07.641 7f618e644700  0 log_channel(cluster) log [WRN] :
> > Health check failed: Long heartbeat ping times on back interface seen,
> > longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK)
> >
> > but the involved OSDs are not reported in this log.
> > Do I just need to increase the verbosity of the mon log ?
> >
> > 3) Is 1 s a reasonable value for this threshold ? How could this value be
> > changed ? What is the relevant configuration variable ?
>
> Not sure how much priority Ceph gives to this ping check. But if you're
> on a 10 Gb/s network I would start complaining when things take longer
> than 1 ms ... a ping should not take much longer than 0.05 ms so if it
> would take an order of magnitude longer than expected latency is not
> optimal.
>
> For Gigabit networks I would bump above values by an order of magnitude.
>
> Gr. Stefan
>
> --
> | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: High CPU usage by ceph-mgr in 14.2.6

2020-01-30 Thread Janek Bevendorff
I can report similar results, although it's probably not just due to 
cluster size.


Our cluster has 1248 OSDs at the moment and we have three active MDSs to 
spread the metadata operations evenly. However, I noticed that it isn't 
spread evenly at all. Usually, it's just one MDS (in our case mds.1) 
which handles most of the load and slowing down the others as a result. 
What we see is a significantly higher latency curve for this one MDS 
than for the other two. All MDSs operate at 100-150% CPU utilisation 
when multiple clients (we have up to 320) are actively reading or 
writing data (note: we have quite an uneven data distribution, so 
directory pinning isn't really an option).


In the end, it turned out that some clients were running updatedb 
processes which tried to index the CephFS. After fixing that, the 
constant request load went down and with it the CPU load on the MDSs, 
but the underlying problem isn't solved of course. We just don't have 
any clients constantly operating on some of our largest directories anymore.



On 29/01/2020 20:28, Neha Ojha wrote:

Hi Joe,

Can you grab a wallclock profiler dump from the mgr process and share
it with us? This was useful for us to get to the root cause of the
issue in 14.2.5.

Quoting Mark's suggestion from "[ceph-users] High CPU usage by
ceph-mgr in 14.2.5" below.

If you can get a wallclock profiler on the mgr process we might be able
to figure out specifics of what's taking so much time (ie processing
pg_summary or something else).  Assuming you have gdb with the python
bindings and the ceph debug packages installed, if you (are anyone)
could try gdbpmp on the 100% mgr process that would be fantastic.


https://github.com/markhpc/gdbpmp


gdbpmp.py -p`pidof ceph-mgr` -n 1000 -o mgr.gdbpmp


If you want to view the results:


gdbpmp.py -i mgr.gdbpmp -t 1

Thanks,
Neha



On Wed, Jan 29, 2020 at 7:35 AM  wrote:

Modules that are normally enabled:

ceph mgr module ls | jq -r '.enabled_modules'
[
   "dashboard",
   "prometheus",
   "restful"
]

We did test with all modules disabled, restarted the mgrs and saw no difference.

Joe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Network performance checks

2020-01-30 Thread Stefan Kooman
Quoting Massimo Sgaravatto (massimo.sgarava...@gmail.com):
> After having upgraded my ceph cluster from Luminous to Nautilus 14.2.6 ,
> from time to time "ceph health detail" claims about some"Long heartbeat
> ping times on front/back interface seen".
> 
> As far as I can understand (after having read
> https://docs.ceph.com/docs/nautilus/rados/operations/monitoring/), this
> means that  the ping from one OSD to another one exceeded 1 s.
> 
> I have some questions on these network performance checks
> 
> 1) What is meant exactly with front and back interface ?

Do you have a "public" and a "cluster" network? I would expect that the
"back" interface is a "cluster" network interface.

> 2) I can see the involved OSDs only in the output of "ceph health detail"
> (when there is the problem) but I can't find this information  in the log
> files. In the mon log file I can only see messages such as:
> 
> 
> 2020-01-28 11:14:07.641 7f618e644700  0 log_channel(cluster) log [WRN] :
> Health check failed: Long heartbeat ping times on back interface seen,
> longest is 1416.618 msec (OSD_SLOW_PING_TIME_BACK)
> 
> but the involved OSDs are not reported in this log.
> Do I just need to increase the verbosity of the mon log ?
> 
> 3) Is 1 s a reasonable value for this threshold ? How could this value be
> changed ? What is the relevant configuration variable ?

Not sure how much priority Ceph gives to this ping check. But if you're
on a 10 Gb/s network I would start complaining when things take longer
than 1 ms ... a ping should not take much longer than 0.05 ms so if it
would take an order of magnitude longer than expected latency is not
optimal.

For Gigabit networks I would bump above values by an order of magnitude.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io