Re: [ceph-users] Mon placement over wide area

Adrian Saul Tue, 12 Apr 2016 01:13:48 -0700

At this stage the RGW component is down the line - pretty much just concept 
while we build out the RBD side first.


What I wanted to get out of EC was distributing the data across multiple DCs 
such that we were not simply replicating data - which would give us much better 
storage efficiency and redundancy.    Some of what I had read in the past was 
around using EC to spread data over multiple DCs to be able to sustain loss of 
multiple sites.  Most of this was implied fairly clearly in the documentation 
under "CHEAP MULTIDATACENTER STORAGE":

http://docs.ceph.com/docs/hammer/dev/erasure-coded-pool/

Although I note that section appears to have disappeared in the later 
documentation versions....

It seems a little disheartening that much of this promise and capability for 
Ceph appears to be just not there in practice.






> -----Original Message-----
> From: Maxime Guyot [mailto:[email protected]]
> Sent: Tuesday, 12 April 2016 5:49 PM
> To: Adrian Saul; Christian Balzer; '[email protected]'
> Subject: Re: [ceph-users] Mon placement over wide area
>
> Hi Adrian,
>
> Looking at the documentation RadosGW has multi region support with the
> “federated gateways”
> (http://docs.ceph.com/docs/master/radosgw/federated-config/):
> "When you deploy a Ceph Object Store service that spans geographical
> locales, configuring Ceph Object Gateway regions and metadata
> synchronization agents enables the service to maintain a global namespace,
> even though Ceph Object Gateway instances run in different geographic
> locales and potentially on different Ceph Storage Clusters.”
>
> Maybe that could do the trick for your multi metro EC pools?
>
> Disclaimer: I haven't tested the federated gateways RadosGW.
>
> Best Regards
>
> Maxime Guyot
> System Engineer
>
>
>
>
>
>
>
>
>
> On 12/04/16 03:28, "ceph-users on behalf of Adrian Saul" <ceph-users-
> [email protected] on behalf of [email protected]>
> wrote:
>
> >Hello again Christian :)
> >
> >
> >> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> >> > will be distributed over every major capital in Australia.    The config
> >> > will be dual sites in each city that will be coupled as HA pairs - 12
> >> > sites in total.   The vast majority of CRUSH rules will place data
> >> > either locally to the individual site, or replicated to the other HA
> >> > site in that city.   However there are future use cases where I think we
> >> > could use EC to distribute data wider or have some replication that
> >> > puts small data sets across multiple cities.
> >> This will very, very, VERY much depend on the data (use case) in question.
> >
> >The EC use case would be using RGW and to act as an archival backup
> >store
> >
> >> > The concern I have is around the placement of mons.  In the current
> >> > design there would be two monitors in each site, running separate to
> the
> >> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> >> > will also be a "tiebreaker" mon placed on a separate host which
> >> > will house some management infrastructure for the whole platform.
> >> >
> >> Yes, that's the preferable way, might want to up this to 5 mons so
> >> you can loose one while doing maintenance on another one.
> >> But if that would be a coupled, national cluster you're looking both
> >> at significant MON traffic, interesting "split-brain" scenarios and
> >> latencies as well (MONs get chosen randomly by clients AFAIK).
> >
> >In the case I am setting up it would be 2 per site plus the extra so 25 - 
> >but I
> am fearing that would make the mon syncing become to heavy.  Once we
> build up to multiple sites though we can maybe reduce to one per site to
> reduce the workload on keeping the mons in sync.
> >
> >> > Obviously a concern is latency - the east coast to west coast
> >> > latency is around 50ms, and on the east coast it is 12ms between
> >> > Sydney and the other two sites, and 24ms Melbourne to Brisbane.
> >> In any situation other than "write speed doesn't matter at all"
> >> combined with "large writes, not small ones" and "read-mostly" you're
> >> going to be in severe pain.
> >
> >For data yes, but the main case for that would be backup data where it
> would be large writes, read rarely and as long as streaming performance
> keeps up latency wont matter.   My concern with the latency would be how
> that impacts the monitors having to keep in sync and how that would impact
> client opertions, especially with the rate of change that would occur with the
> predominant RBD use in most sites.
> >
> >> > Most of the data
> >> > traffic will remain local but if we create a single national
> >> > cluster then how much of an impact will it be having all the mons
> >> > needing to keep in sync, as well as monitor and communicate with
> >> > all OSDs (in the end goal design there will be some 2300+ OSDs).
> >> >
> >> Significant.
> >> I wouldn't suggest it, but even if you deploy differently I'd suggest
> >> a test run/setup and sharing the experience with us. ^.^
> >
> >Someone has to be the canary right :)
> >
> >> > The other options I  am considering:
> >> > - split into east and west coast clusters, most of the cross city
> >> > need is in the east coast, any data moves between clusters can be
> >> > done with snap replication
> >> > - city based clusters (tightest latency) but loose the multi-DC EC
> >> > option, do cross city replication using snapshots
> >> >
> >> The later, I seem to remember that there was work in progress to do
> >> this (snapshot replication) in an automated fashion.
> >>
> >> > Just want to get a feel for what I need to consider when we start
> >> > building at this scale.
> >> >
> >> I know you're set on iSCSI/NFS (have you worked out the iSCSI
> >> kinks?), but the only well known/supported way to do geo-replication
> >> with Ceph is via RGW.
> >
> >iSCSI is working fairly well.  We have decided to not use Ceph for the 
> >latency
> sensitive workloads so while we are still working to keep that low, we wont
> be putting the heavier IOP or latency sensitive workloads onto it until we get
> a better feel for how it behaves at scale and can be sure of the performance.
> >
> >As above - for the most part we are going to be for the most part having
> local site pools (replicate at application level), a few metro replicated 
> pools
> and a couple of very small multi-metro replicated pools, with the geo-
> redundant EC stuff a future plan.  It would just be a shame to lock the design
> into a setup that won't let us do some of these wider options down the track.
> >
> >Thanks.
> >
> >Adrian
> >
> >Confidentiality: This email and any attachments are confidential and may be
> subject to copyright, legal or some other professional privilege. They are
> intended solely for the attention and use of the named addressee(s). They
> may only be copied, distributed or disclosed with the consent of the
> copyright owner. If you have received this email by mistake or by breach of
> the confidentiality clause, please notify the sender immediately by return
> email and delete or destroy all copies of the email. Any confidentiality,
> privilege or copyright is not waived or lost because this email has been sent
> to you by mistake.
> >_______________________________________________
> >ceph-users mailing list
> >[email protected]
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mon placement over wide area

Reply via email to