Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Max A. Krasilnikov
Здравствуйте! 

On Tue, Apr 12, 2016 at 07:48:58AM +, Maxime.Guyot wrote:

> Hi Adrian,

> Looking at the documentation RadosGW has multi region support with the 
> “federated gateways” 
> (http://docs.ceph.com/docs/master/radosgw/federated-config/):
> "When you deploy a Ceph Object Store service that spans geographical locales, 
> configuring Ceph Object Gateway regions and metadata synchronization agents 
> enables the service to maintain a global namespace, even though Ceph Object 
> Gateway instances run in different geographic locales and potentially on 
> different Ceph Storage Clusters.”

> Maybe that could do the trick for your multi metro EC pools?

> Disclaimer: I haven't tested the federated gateways RadosGW.

As I can see in doc, Jewel have to be able to perform per-image async mirroring:

There is new support for mirroring (asynchronous replication) of RBD images
across clusters. This is implemented as a per-RBD image journal that can be
streamed across a WAN to another site, and a new rbd-mirror daemon that performs
the cross-cluster replication.

© http://docs.ceph.com/docs/master/release-notes/

I will test it 1-2 month later this year :)

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Adrian Saul

At this stage the RGW component is down the line - pretty much just concept 
while we build out the RBD side first.

What I wanted to get out of EC was distributing the data across multiple DCs 
such that we were not simply replicating data - which would give us much better 
storage efficiency and redundancy.Some of what I had read in the past was 
around using EC to spread data over multiple DCs to be able to sustain loss of 
multiple sites.  Most of this was implied fairly clearly in the documentation 
under "CHEAP MULTIDATACENTER STORAGE":

http://docs.ceph.com/docs/hammer/dev/erasure-coded-pool/

Although I note that section appears to have disappeared in the later 
documentation versions

It seems a little disheartening that much of this promise and capability for 
Ceph appears to be just not there in practice.






> -Original Message-
> From: Maxime Guyot [mailto:maxime.gu...@elits.com]
> Sent: Tuesday, 12 April 2016 5:49 PM
> To: Adrian Saul; Christian Balzer; 'ceph-users@lists.ceph.com'
> Subject: Re: [ceph-users] Mon placement over wide area
>
> Hi Adrian,
>
> Looking at the documentation RadosGW has multi region support with the
> “federated gateways”
> (http://docs.ceph.com/docs/master/radosgw/federated-config/):
> "When you deploy a Ceph Object Store service that spans geographical
> locales, configuring Ceph Object Gateway regions and metadata
> synchronization agents enables the service to maintain a global namespace,
> even though Ceph Object Gateway instances run in different geographic
> locales and potentially on different Ceph Storage Clusters.”
>
> Maybe that could do the trick for your multi metro EC pools?
>
> Disclaimer: I haven't tested the federated gateways RadosGW.
>
> Best Regards
>
> Maxime Guyot
> System Engineer
>
>
>
>
>
>
>
>
>
> On 12/04/16 03:28, "ceph-users on behalf of Adrian Saul"  boun...@lists.ceph.com on behalf of adrian.s...@tpgtelecom.com.au>
> wrote:
>
> >Hello again Christian :)
> >
> >
> >> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> >> > will be distributed over every major capital in Australia.The config
> >> > will be dual sites in each city that will be coupled as HA pairs - 12
> >> > sites in total.   The vast majority of CRUSH rules will place data
> >> > either locally to the individual site, or replicated to the other HA
> >> > site in that city.   However there are future use cases where I think we
> >> > could use EC to distribute data wider or have some replication that
> >> > puts small data sets across multiple cities.
> >> This will very, very, VERY much depend on the data (use case) in question.
> >
> >The EC use case would be using RGW and to act as an archival backup
> >store
> >
> >> > The concern I have is around the placement of mons.  In the current
> >> > design there would be two monitors in each site, running separate to
> the
> >> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> >> > will also be a "tiebreaker" mon placed on a separate host which
> >> > will house some management infrastructure for the whole platform.
> >> >
> >> Yes, that's the preferable way, might want to up this to 5 mons so
> >> you can loose one while doing maintenance on another one.
> >> But if that would be a coupled, national cluster you're looking both
> >> at significant MON traffic, interesting "split-brain" scenarios and
> >> latencies as well (MONs get chosen randomly by clients AFAIK).
> >
> >In the case I am setting up it would be 2 per site plus the extra so 25 - 
> >but I
> am fearing that would make the mon syncing become to heavy.  Once we
> build up to multiple sites though we can maybe reduce to one per site to
> reduce the workload on keeping the mons in sync.
> >
> >> > Obviously a concern is latency - the east coast to west coast
> >> > latency is around 50ms, and on the east coast it is 12ms between
> >> > Sydney and the other two sites, and 24ms Melbourne to Brisbane.
> >> In any situation other than "write speed doesn't matter at all"
> >> combined with "large writes, not small ones" and "read-mostly" you're
> >> going to be in severe pain.
> >
> >For data yes, but the main case for that would be backup data where it
> would be large writes, read rarely and as long as streaming performance
> keeps up latency wont matter.   My concern with the latency would be how
> that impacts the monitors having

Re: [ceph-users] Mon placement over wide area

2016-04-12 Thread Maxime Guyot
Hi Adrian,

Looking at the documentation RadosGW has multi region support with the 
“federated gateways” 
(http://docs.ceph.com/docs/master/radosgw/federated-config/):
"When you deploy a Ceph Object Store service that spans geographical locales, 
configuring Ceph Object Gateway regions and metadata synchronization agents 
enables the service to maintain a global namespace, even though Ceph Object 
Gateway instances run in different geographic locales and potentially on 
different Ceph Storage Clusters.”

Maybe that could do the trick for your multi metro EC pools?

Disclaimer: I haven't tested the federated gateways RadosGW.

Best Regards 

Maxime Guyot
System Engineer









On 12/04/16 03:28, "ceph-users on behalf of Adrian Saul" 
 
wrote:

>Hello again Christian :)
>
>
>> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
>> > will be distributed over every major capital in Australia.The config
>> > will be dual sites in each city that will be coupled as HA pairs - 12
>> > sites in total.   The vast majority of CRUSH rules will place data
>> > either locally to the individual site, or replicated to the other HA
>> > site in that city.   However there are future use cases where I think we
>> > could use EC to distribute data wider or have some replication that puts
>> > small data sets across multiple cities.
>> This will very, very, VERY much depend on the data (use case) in question.
>
>The EC use case would be using RGW and to act as an archival backup store
>
>> > The concern I have is around the placement of mons.  In the current
>> > design there would be two monitors in each site, running separate to the
>> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
>> > will also be a "tiebreaker" mon placed on a separate host which will
>> > house some management infrastructure for the whole platform.
>> >
>> Yes, that's the preferable way, might want to up this to 5 mons so you can
>> loose one while doing maintenance on another one.
>> But if that would be a coupled, national cluster you're looking both at
>> significant MON traffic, interesting "split-brain" scenarios and latencies as
>> well (MONs get chosen randomly by clients AFAIK).
>
>In the case I am setting up it would be 2 per site plus the extra so 25 - but 
>I am fearing that would make the mon syncing become to heavy.  Once we build 
>up to multiple sites though we can maybe reduce to one per site to reduce the 
>workload on keeping the mons in sync.
>
>> > Obviously a concern is latency - the east coast to west coast latency
>> > is around 50ms, and on the east coast it is 12ms between Sydney and
>> > the other two sites, and 24ms Melbourne to Brisbane.
>> In any situation other than "write speed doesn't matter at all" combined with
>> "large writes, not small ones" and "read-mostly" you're going to be in severe
>> pain.
>
>For data yes, but the main case for that would be backup data where it would 
>be large writes, read rarely and as long as streaming performance keeps up 
>latency wont matter.   My concern with the latency would be how that impacts 
>the monitors having to keep in sync and how that would impact client 
>opertions, especially with the rate of change that would occur with the 
>predominant RBD use in most sites.
>
>> > Most of the data
>> > traffic will remain local but if we create a single national cluster
>> > then how much of an impact will it be having all the mons needing to
>> > keep in sync, as well as monitor and communicate with all OSDs (in the
>> > end goal design there will be some 2300+ OSDs).
>> >
>> Significant.
>> I wouldn't suggest it, but even if you deploy differently I'd suggest a test
>> run/setup and sharing the experience with us. ^.^
>
>Someone has to be the canary right :)
>
>> > The other options I  am considering:
>> > - split into east and west coast clusters, most of the cross city need
>> > is in the east coast, any data moves between clusters can be done with
>> > snap replication
>> > - city based clusters (tightest latency) but loose the multi-DC EC
>> > option, do cross city replication using snapshots
>> >
>> The later, I seem to remember that there was work in progress to do this
>> (snapshot replication) in an automated fashion.
>>
>> > Just want to get a feel for what I need to consider when we start
>> > building at this scale.
>> >
>> I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
>> the only well known/supported way to do geo-replication with Ceph is via
>> RGW.
>
>iSCSI is working fairly well.  We have decided to not use Ceph for the latency 
>sensitive workloads so while we are still working to keep that low, we wont be 
>putting the heavier IOP or latency sensitive workloads onto it until we get a 
>better feel for how it behaves at scale and can be sure of the performance.
>
>As above - for the most part we are going 

Re: [ceph-users] Mon placement over wide area

2016-04-11 Thread Adrian Saul
Hello again Christian :)


> > We are close to being given approval to deploy a 3.5PB Ceph cluster that
> > will be distributed over every major capital in Australia.The config
> > will be dual sites in each city that will be coupled as HA pairs - 12
> > sites in total.   The vast majority of CRUSH rules will place data
> > either locally to the individual site, or replicated to the other HA
> > site in that city.   However there are future use cases where I think we
> > could use EC to distribute data wider or have some replication that puts
> > small data sets across multiple cities.
> This will very, very, VERY much depend on the data (use case) in question.

The EC use case would be using RGW and to act as an archival backup store

> > The concern I have is around the placement of mons.  In the current
> > design there would be two monitors in each site, running separate to the
> > OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> > will also be a "tiebreaker" mon placed on a separate host which will
> > house some management infrastructure for the whole platform.
> >
> Yes, that's the preferable way, might want to up this to 5 mons so you can
> loose one while doing maintenance on another one.
> But if that would be a coupled, national cluster you're looking both at
> significant MON traffic, interesting "split-brain" scenarios and latencies as
> well (MONs get chosen randomly by clients AFAIK).

In the case I am setting up it would be 2 per site plus the extra so 25 - but I 
am fearing that would make the mon syncing become to heavy.  Once we build up 
to multiple sites though we can maybe reduce to one per site to reduce the 
workload on keeping the mons in sync.

> > Obviously a concern is latency - the east coast to west coast latency
> > is around 50ms, and on the east coast it is 12ms between Sydney and
> > the other two sites, and 24ms Melbourne to Brisbane.
> In any situation other than "write speed doesn't matter at all" combined with
> "large writes, not small ones" and "read-mostly" you're going to be in severe
> pain.

For data yes, but the main case for that would be backup data where it would be 
large writes, read rarely and as long as streaming performance keeps up latency 
wont matter.   My concern with the latency would be how that impacts the 
monitors having to keep in sync and how that would impact client opertions, 
especially with the rate of change that would occur with the predominant RBD 
use in most sites.

> > Most of the data
> > traffic will remain local but if we create a single national cluster
> > then how much of an impact will it be having all the mons needing to
> > keep in sync, as well as monitor and communicate with all OSDs (in the
> > end goal design there will be some 2300+ OSDs).
> >
> Significant.
> I wouldn't suggest it, but even if you deploy differently I'd suggest a test
> run/setup and sharing the experience with us. ^.^

Someone has to be the canary right :)

> > The other options I  am considering:
> > - split into east and west coast clusters, most of the cross city need
> > is in the east coast, any data moves between clusters can be done with
> > snap replication
> > - city based clusters (tightest latency) but loose the multi-DC EC
> > option, do cross city replication using snapshots
> >
> The later, I seem to remember that there was work in progress to do this
> (snapshot replication) in an automated fashion.
>
> > Just want to get a feel for what I need to consider when we start
> > building at this scale.
> >
> I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
> the only well known/supported way to do geo-replication with Ceph is via
> RGW.

iSCSI is working fairly well.  We have decided to not use Ceph for the latency 
sensitive workloads so while we are still working to keep that low, we wont be 
putting the heavier IOP or latency sensitive workloads onto it until we get a 
better feel for how it behaves at scale and can be sure of the performance.

As above - for the most part we are going to be for the most part having local 
site pools (replicate at application level), a few metro replicated pools and a 
couple of very small multi-metro replicated pools, with the geo-redundant EC 
stuff a future plan.  It would just be a shame to lock the design into a setup 
that won't let us do some of these wider options down the track.

Thanks.

Adrian

Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not 

Re: [ceph-users] Mon placement over wide area

2016-04-11 Thread Christian Balzer

Hello (again),

On Tue, 12 Apr 2016 00:46:29 + Adrian Saul wrote:

> 
> We are close to being given approval to deploy a 3.5PB Ceph cluster that
> will be distributed over every major capital in Australia.The config
> will be dual sites in each city that will be coupled as HA pairs - 12
> sites in total.   The vast majority of CRUSH rules will place data
> either locally to the individual site, or replicated to the other HA
> site in that city.   However there are future use cases where I think we
> could use EC to distribute data wider or have some replication that puts
> small data sets across multiple cities.   
This will very, very, VERY much depend on the data (use case) in question.

>All of this will be tied
> together with a dedicated private IP network.
> 
> The concern I have is around the placement of mons.  In the current
> design there would be two monitors in each site, running separate to the
> OSDs as part of some hosts acting as RBD to iSCSI/NFS gateways.   There
> will also be a "tiebreaker" mon placed on a separate host which will
> house some management infrastructure for the whole platform.
> 
Yes, that's the preferable way, might want to up this to 5 mons so you can
loose one while doing maintenance on another one.
But if that would be a coupled, national cluster you're looking both at
significant MON traffic, interesting "split-brain" scenarios and latencies
as well (MONs get chosen randomly by clients AFAIK).

> Obviously a concern is latency - the east coast to west coast latency is
> around 50ms, and on the east coast it is 12ms between Sydney and the
> other two sites, and 24ms Melbourne to Brisbane.  
In any situation other than "write speed doesn't matter at all" combined
with "large writes, not small ones" and "read-mostly" you're going to be in
severe pain.

> Most of the data
> traffic will remain local but if we create a single national cluster
> then how much of an impact will it be having all the mons needing to
> keep in sync, as well as monitor and communicate with all OSDs (in the
> end goal design there will be some 2300+ OSDs).
> 
Significant. 
I wouldn't suggest it, but even if you deploy differently I'd suggest a
test run/setup and sharing the experience with us. ^.^

> The other options I  am considering:
> - split into east and west coast clusters, most of the cross city need
> is in the east coast, any data moves between clusters can be done with
> snap replication
> - city based clusters (tightest latency) but loose the multi-DC EC
> option, do cross city replication using snapshots
> 
The later, I seem to remember that there was work in progress to do this
(snapshot replication) in an automated fashion.

> Just want to get a feel for what I need to consider when we start
> building at this scale.
> 
I know you're set on iSCSI/NFS (have you worked out the iSCSI kinks?), but
the only well known/supported way to do geo-replication with Ceph is via
RGW.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mon placement over wide area

2016-04-11 Thread Adrian Saul

We are close to being given approval to deploy a 3.5PB Ceph cluster that will 
be distributed over every major capital in Australia.The config will be 
dual sites in each city that will be coupled as HA pairs - 12 sites in total.   
The vast majority of CRUSH rules will place data either locally to the 
individual site, or replicated to the other HA site in that city.   However 
there are future use cases where I think we could use EC to distribute data 
wider or have some replication that puts small data sets across multiple 
cities.   All of this will be tied together with a dedicated private IP network.

The concern I have is around the placement of mons.  In the current design 
there would be two monitors in each site, running separate to the OSDs as part 
of some hosts acting as RBD to iSCSI/NFS gateways.   There will also be a 
"tiebreaker" mon placed on a separate host which will house some management 
infrastructure for the whole platform.

Obviously a concern is latency - the east coast to west coast latency is around 
50ms, and on the east coast it is 12ms between Sydney and the other two sites, 
and 24ms Melbourne to Brisbane.  Most of the data traffic will remain local but 
if we create a single national cluster then how much of an impact will it be 
having all the mons needing to keep in sync, as well as monitor and communicate 
with all OSDs (in the end goal design there will be some 2300+ OSDs).

The other options I  am considering:
- split into east and west coast clusters, most of the cross city need is in 
the east coast, any data moves between clusters can be done with snap 
replication
- city based clusters (tightest latency) but loose the multi-DC EC option, do 
cross city replication using snapshots

Just want to get a feel for what I need to consider when we start building at 
this scale.

Cheers,
 Adrian






Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com