Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Christian Balzer

Hello,

On Wed, 17 Feb 2016 09:19:39 - Nick Fisk wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Christian Balzer
> > Sent: 17 February 2016 02:41
> > To: ceph-users 
> > Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> > Erasure Code
> > 
> > 
> > Hello,
> > 
> > On Tue, 16 Feb 2016 16:39:06 +0800 Василий Ангапов wrote:
> > 
> > > Nick, Tyler, many thanks for very helpful feedback!
> > > I spent many hours meditating on the following two links:
> > > http://www.supermicro.com/solutions/storage_ceph.cfm
> > > http://s3s.eu/cephshop
> > >
> > > 60- or even 72-disk nodes are very capacity-efficient, but will the 2
> > > CPUs (even the fastest ones) be enough to handle Erasure Coding?
> > >
> > Depends.
> > Since you're doing sequential writes (and reads I assume as you're
> > dealing with videos), CPU usage is going to be a lot lower than with
> > random, small 4KB block I/Os.
> > So most likely, yes.
> 
> That was my initial thought, but reading that paper I linked, the 4MB
> tests are the ones that bring the CPU's to their knees. I think the
> erasure calculation is a large part of the overall CPU usage and more
> data with the larger IO's causes a significant increase in CPU
> requirements.
> 
This is clearly where my total lack of EC exposure and experience is
showing, but it certainly makes sense as well.

> Correct me if I'm wrong, but I recall Christian, that your cluster is a
> full SSD cluster? 
No, but we talked back when I was building our 2nd production cluster and
while waiting for parts did make a temporary all SSD one by using all the
prospective journals SSDs.

And definitely maxed out on CPU long before the SSDs got busy when doing
4KB rados benches or similar.

OTOH that same machine only uses about 4 cores out of 16 when doing the
same thing in its current configuration with 8 HDDs and 4 journal SSDs.

> I think we touched on this before, that the ghz per
> OSD is probably more like 100mhz per IOP. In a spinning disk cluster,
> you effectively have a cap on the number of IOs you can serve before the
> disks max out. So the difference between large and small IO's is not
> that great. But on a SSD cluster there is no cap and so you just end up
> with more IO's, hence the higher CPU.
> 
Yes and that number is a good baseline (still).

My own rule of thumb is 1GHz or slightly less per OSD for pure HDD based
clusters and about 1.5GHz for ones with SSD journals. 
Round up for OS and (in my case frequently) MON usage.

Of course for purely SSD based OSDs, throw the kitchen sink at it, if
your wallet allows for it.
 
Christian
> > 
> > > Also as Nick stated with 4-5 nodes I cannot use high-M "K+M"
> > > combinations. I've did some calculations and found that the most
> > > efficient and safe configuration is to use 10 nodes with 29*6TB SATA
> > > and 7*200GB S3700 for journals. Assuming 6+3 EC profile that will
> > > give me
> > > 1.16 PB of effective space. Also I prefer not to use precious NVMe
> > > drives. Don't see any reason to use them.
> > >
> > This is probably your best way forward, dense is nice and cost saving,
> > but comes with a lot of potential gotchas.
> > Dense and large clusters can work, dense and small not so much.
> > 
> > > But what about RAM? Can I go with 64GB per node with above config?
> > > I've seen OSDs are consuming not more than 1GB RAM for replicated
> > > pools (even 6TB ones). But what is the typical memory usage of EC
> > > pools? Does anybody know that?
> > >
> > Above config (29 OSDs) that would be just about right.
> > I always go with at least 2GB RAM per OSD, since during a full node
> > restart and the consecutive peering OSDs will grow large, a LOT larger
> > than their usual steady state size.
> > RAM isn't that expensive these days and additional RAM comes in very
> > handy when used for pagecache and SLAB (dentry) stuff.
> > 
> > Something else to think about in your specific use case is to have
> > RAID'ed OSDs.
> > It's a bit of zero sum game probably, but compare the above config
> > with this. 11 nodes, each with:
> > 34 6TB SATAs (2x 17HDDs RAID6)
> > 2 200GB S3700 SSDs (journal/OS)
> > Just 2 OSDs per node.
> > Ceph with replication of 2.
> > Just shy of 1PB of effective space.
> > 
> > Minus: More physical space, less efficient HDD usage (replication vs.
> > EC).
> > 
> > Plu

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Nick Fisk
Ah typo, I meant to say 10Mhz per IO. So a 7.2k disk does around 80IOPs = ~ 
800mhz which is close to the 1Ghz figure.

 

From: John Hogenmiller [mailto:j...@hogenmiller.net] 
Sent: 17 February 2016 13:15
To: Nick Fisk 
Cc: Василий Ангапов ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure 
Code

 

I hadn't come across this ratio prior, but now that I've read that PDF you 
linked and I've narrowed my search in the mailing list, I think that the 0.5 - 
1ghz per OSD ratio is pretty spot on. The 100Mhz per IOP is also pretty 
interesting, and we do indeed use 7200 RPM drives. 

 

I'll look up a few more things, but based on what I've seen so far, the 
hardware we're using will most likely not be suitable, which is unfortunate as 
that adds some more complexity at OSI Level 8. :D

 

 

On Wed, Feb 17, 2016 at 4:14 AM, Nick Fisk mailto:n...@fisk.me.uk> > wrote:

Thanks for posting your experiences John, very interesting read. I think the 
golden rule of around 1Ghz is still a realistic goal to aim for. It looks like 
you probably have around 16ghz for 60OSD's, or 0.26Ghz per OSD. Do you have any 
idea on how much CPU you think you would need to just be able to get away with 
it?

I have 24Ghz for 12 OSD's (2x2620v2) and I typically don't see CPU usage over 
about 20%, which indicates to me the bare minimum for a replicated pool is 
probably around 0.5Ghz per 7.2k rpm OSD. The next nodes we have will certainly 
have less CPU.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of
> John Hogenmiller
> Sent: 17 February 2016 03:04
> To: Василий Ангапов mailto:anga...@gmail.com> >; 
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
>

> Turns out i didn't do reply-all.
>
> On Tue, Feb 16, 2016 at 9:18 AM, John Hogenmiller  <mailto:j...@hogenmiller.net> >
> wrote:
> > And again - is dual Xeon's power enough for 60-disk node and Erasure
> Code?
>
>
> This is something I've been attempting to determine as well. I'm not yet
> getting
> I'm testing with some white-label hardware, but essentially supermicro
> 2twinu's with a pair of E5-2609 Xeons and 64GB of
> memory.  (http://www.supermicro.com/products/system/2U/6028/SYS-
> 6028TR-HTFR.cfm). This is attached to DAEs with 60 x 6TB drives, in JBOD.
>
> Conversely, Supermicro sells a 72-disk OSD node, which Redhat considers a
> supported "reference architecture" device. The processors in those nodes
> are E5-269 12-core, vs what I have which is quad-
> core. http://www.supermicro.com/solutions/storage_ceph.cfm  (SSG-
> 6048R-OSD432). I would highly recommend reflecting on the supermicro
> hardware and using that as your reference as well. If you could get an eval
> unit, use that to compare with the hardware you're working with.
>
> I currently have mine setup with 7 nodes, 60 OSDs each, radosgw running
> one each node, and 5 ceph monitors. I plan to move the monitors to their
> own dedicated hardware, and in reading, I may only need 3 to manage the
> 420 OSDs.   I am currently just setup for replication instead of EC, though I
> want to redo this cluster to use EC. Also, I am still trying to work out how
> much of an impact placement groups have on performance, and I may have a
> performance-hampering amount..
>
> We test the system using locust speaking S3 to the radosgw. Transactions are
> distributed equally across all 7 nodes and we track the statistics. We started
> first emulating 1000 users and got over 4Gbps, but load average on all nodes
> was in the mid-100s, and after 15 minutes we started getting socket
> timeouts. We stopped the test, let load settle, and started back at 100
> users.  We've been running this test about 5 days now.  Load average on all
> nodes floats between 40 and 70. The nodes with ceph-mon running on them
> do not appear to be taxed any more than the ones without. The radosgw
> itself seems to take up a decent amount of cpu (running civetweb, no
> ssl).  iowait is non existent, everything appears to be cpu bound.
>
> At 1000 users, we had 4.3Gbps of PUTs and 2.2Gbps of GETs. Did not capture
> the TPS on that short test.
> At 100 users, we're pushing 2Gbps  in PUTs and 1.24Gpbs in GETs. Averaging
> 115 TPS.
>
> All in all, the speeds are not bad for a single rack, but the CPU utilization 
> is a
> big concern. We're currently using other (proprietary) object storage
> platforms on this hardware configuration. They have their own set of issues,
> but CPU utilization

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Tyler Bishop
I'm using 2x replica on that pool for storing rbd volumes. Our workload is 
pretty heavy, id imagine objects an ec would be light in comparison. 







Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. 




From: "John Hogenmiller"  
To: "Tyler Bishop"  
Cc: "Nick Fisk" , ceph-users@lists.ceph.com 
Sent: Wednesday, February 17, 2016 7:50:11 AM 
Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure 
Code 

Tyler, 
E5-2660 V2 is a 10-core, 2.2Ghz, giving you roughly 44Ghz or 0.78Ghz per OSD. 
That seems to fall in line with Nick's "golden rule" or 0.5Ghz - 1Ghz per OSD. 

Are you doing EC or Replication? If EC, what profile? Could you also provide an 
average of CPU utilization? 

I'm still researching, but so far, the ratio seems to be pretty realistic. 

-John 

On Tue, Feb 16, 2016 at 9:22 AM, Tyler Bishop < tyler.bis...@beyondhosting.net 
> wrote: 


We use dual E5-2660 V2 with 56 6T and performance has not been an issue. It 
will easily saturate the 40G interfaces and saturate the spindle io. 

And yes, you can run dual servers attached to 30 disk each. This gives you lots 
of density. Your failure domain will remain as individual servers. The only 
thing shared is the quad power supplies. 

Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. 

- Original Message - 
From: "Nick Fisk" < n...@fisk.me.uk > 
To: "Василий Ангапов" < anga...@gmail.com >, "Tyler Bishop" < 
tyler.bis...@beyondhosting.net > 
Cc: ceph-users@lists.ceph.com 
Sent: Tuesday, February 16, 2016 8:24:33 AM 
Subject: RE: [ceph-users] Recomendations for building 1PB RadosGW with Erasure 
Code 

> -Original Message- 
> From: Василий Ангапов [mailto: anga...@gmail.com ] 
> Sent: 16 February 2016 13:15 
> To: Tyler Bishop < tyler.bis...@beyondhosting.net > 
> Cc: Nick Fisk < n...@fisk.me.uk >; < ceph-users@lists.ceph.com >  us...@lists.ceph.com > 
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with 
> Erasure Code 
> 
> 2016-02-16 17:09 GMT+08:00 Tyler Bishop 
> < tyler.bis...@beyondhosting.net >: 
> > With ucs you can run dual server and split the disk. 30 drives per node. 
> > Better density and easier to manage. 
> I don't think I got your point. Can you please explain it in more details? 

I think he means that the 60 bays can be zoned, so you end up with physically 1 
JBOD split into two 30 logical JBOD's each connected to a different server. 
What this does to your failures domains is another question. 

> 
> And again - is dual Xeon's power enough for 60-disk node and Erasure Code? 

I would imagine yes, but you would mostly likely need to go for the 12-18core 
versions with a high clock. These are serious . I don't know at what point 
this becomes more expensive than 12 disk nodes with "cheap" Xeon-D's or Xeon 
E3's. 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread John Hogenmiller
I hadn't come across this ratio prior, but now that I've read that PDF you
linked and I've narrowed my search in the mailing list, I think that the
0.5 - 1ghz per OSD ratio is pretty spot on. The 100Mhz per IOP is also
pretty interesting, and we do indeed use 7200 RPM drives.

I'll look up a few more things, but based on what I've seen so far, the
hardware we're using will most likely not be suitable, which is unfortunate
as that adds some more complexity at OSI Level 8. :D


On Wed, Feb 17, 2016 at 4:14 AM, Nick Fisk  wrote:

> Thanks for posting your experiences John, very interesting read. I think
> the golden rule of around 1Ghz is still a realistic goal to aim for. It
> looks like you probably have around 16ghz for 60OSD's, or 0.26Ghz per OSD.
> Do you have any idea on how much CPU you think you would need to just be
> able to get away with it?
>
> I have 24Ghz for 12 OSD's (2x2620v2) and I typically don't see CPU usage
> over about 20%, which indicates to me the bare minimum for a replicated
> pool is probably around 0.5Ghz per 7.2k rpm OSD. The next nodes we have
> will certainly have less CPU.
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > John Hogenmiller
> > Sent: 17 February 2016 03:04
> > To: Василий Ангапов ; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> > Erasure Code
> >
> > Turns out i didn't do reply-all.
> >
> > On Tue, Feb 16, 2016 at 9:18 AM, John Hogenmiller 
> > wrote:
> > > And again - is dual Xeon's power enough for 60-disk node and Erasure
> > Code?
> >
> >
> > This is something I've been attempting to determine as well. I'm not yet
> > getting
> > I'm testing with some white-label hardware, but essentially supermicro
> > 2twinu's with a pair of E5-2609 Xeons and 64GB of
> > memory.  (http://www.supermicro.com/products/system/2U/6028/SYS-
> > 6028TR-HTFR.cfm). This is attached to DAEs with 60 x 6TB drives, in JBOD.
> >
> > Conversely, Supermicro sells a 72-disk OSD node, which Redhat considers a
> > supported "reference architecture" device. The processors in those nodes
> > are E5-269 12-core, vs what I have which is quad-
> > core. http://www.supermicro.com/solutions/storage_ceph.cfm  (SSG-
> > 6048R-OSD432). I would highly recommend reflecting on the supermicro
> > hardware and using that as your reference as well. If you could get an
> eval
> > unit, use that to compare with the hardware you're working with.
> >
> > I currently have mine setup with 7 nodes, 60 OSDs each, radosgw running
> > one each node, and 5 ceph monitors. I plan to move the monitors to their
> > own dedicated hardware, and in reading, I may only need 3 to manage the
> > 420 OSDs.   I am currently just setup for replication instead of EC,
> though I
> > want to redo this cluster to use EC. Also, I am still trying to work out
> how
> > much of an impact placement groups have on performance, and I may have a
> > performance-hampering amount..
> >
> > We test the system using locust speaking S3 to the radosgw. Transactions
> are
> > distributed equally across all 7 nodes and we track the statistics. We
> started
> > first emulating 1000 users and got over 4Gbps, but load average on all
> nodes
> > was in the mid-100s, and after 15 minutes we started getting socket
> > timeouts. We stopped the test, let load settle, and started back at 100
> > users.  We've been running this test about 5 days now.  Load average on
> all
> > nodes floats between 40 and 70. The nodes with ceph-mon running on them
> > do not appear to be taxed any more than the ones without. The radosgw
> > itself seems to take up a decent amount of cpu (running civetweb, no
> > ssl).  iowait is non existent, everything appears to be cpu bound.
> >
> > At 1000 users, we had 4.3Gbps of PUTs and 2.2Gbps of GETs. Did not
> capture
> > the TPS on that short test.
> > At 100 users, we're pushing 2Gbps  in PUTs and 1.24Gpbs in GETs.
> Averaging
> > 115 TPS.
> >
> > All in all, the speeds are not bad for a single rack, but the CPU
> utilization is a
> > big concern. We're currently using other (proprietary) object storage
> > platforms on this hardware configuration. They have their own set of
> issues,
> > but CPU utilization is typically not the problem, even at higher
> utilization.
> >
> >
> >
> > root@ljb01:/home/ceph/rain-cluster# ceph status
> > cluster 

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread John Hogenmiller
Tyler,

E5-2660 V2 is a 10-core, 2.2Ghz, giving you roughly 44Ghz or 0.78Ghz per
OSD.  That seems to fall in line with Nick's "golden rule" or 0.5Ghz - 1Ghz
per OSD.

Are you doing EC or Replication? If EC, what profile?  Could you also
provide an average of CPU utilization?

I'm still researching, but so far, the ratio seems to be pretty realistic.

-John

On Tue, Feb 16, 2016 at 9:22 AM, Tyler Bishop <
tyler.bis...@beyondhosting.net> wrote:

> We use dual E5-2660 V2 with 56 6T and performance has not been an issue.
> It will easily saturate the 40G interfaces and saturate the spindle io.
>
> And yes, you can run dual servers attached to 30 disk each.  This gives
> you lots of density.  Your failure domain will remain as individual
> servers.  The only thing shared is the quad power supplies.
>
> Tyler Bishop
> Chief Technical Officer
> 513-299-7108 x10
>
>
>
> tyler.bis...@beyondhosting.net
>
>
> If you are not the intended recipient of this transmission you are
> notified that disclosing, copying, distributing or taking any action in
> reliance on the contents of this information is strictly prohibited.
>
> - Original Message -
> From: "Nick Fisk" 
> To: "Василий Ангапов" , "Tyler Bishop" <
> tyler.bis...@beyondhosting.net>
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, February 16, 2016 8:24:33 AM
> Subject: RE: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
>
> > -Original Message-
> > From: Василий Ангапов [mailto:anga...@gmail.com]
> > Sent: 16 February 2016 13:15
> > To: Tyler Bishop 
> > Cc: Nick Fisk ;   > us...@lists.ceph.com>
> > Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> > Erasure Code
> >
> > 2016-02-16 17:09 GMT+08:00 Tyler Bishop
> > :
> > > With ucs you can run dual server and split the disk.  30 drives per
> node.
> > > Better density and easier to manage.
> > I don't think I got your point. Can you please explain it in more
> details?
>
> I think he means that the 60 bays can be zoned, so you end up with
> physically 1 JBOD split into two 30 logical JBOD's each connected to a
> different server. What this does to your failures domains is another
> question.
>
> >
> > And again - is dual Xeon's power enough for 60-disk node and Erasure
> Code?
>
> I would imagine yes, but you would mostly likely need to go for the
> 12-18core versions with a high clock. These are serious . I don't know
> at what point this becomes more expensive than 12 disk nodes with "cheap"
> Xeon-D's or Xeon E3's.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 17 February 2016 02:41
> To: ceph-users 
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
> 
> 
> Hello,
> 
> On Tue, 16 Feb 2016 16:39:06 +0800 Василий Ангапов wrote:
> 
> > Nick, Tyler, many thanks for very helpful feedback!
> > I spent many hours meditating on the following two links:
> > http://www.supermicro.com/solutions/storage_ceph.cfm
> > http://s3s.eu/cephshop
> >
> > 60- or even 72-disk nodes are very capacity-efficient, but will the 2
> > CPUs (even the fastest ones) be enough to handle Erasure Coding?
> >
> Depends.
> Since you're doing sequential writes (and reads I assume as you're dealing
> with videos), CPU usage is going to be a lot lower than with random, small
> 4KB block I/Os.
> So most likely, yes.

That was my initial thought, but reading that paper I linked, the 4MB tests are 
the ones that bring the CPU's to their knees. I think the erasure calculation 
is a large part of the overall CPU usage and more data with the larger IO's 
causes a significant increase in CPU requirements.

Correct me if I'm wrong, but I recall Christian, that your cluster is a full 
SSD cluster? I think we touched on this before, that the ghz per OSD is 
probably more like 100mhz per IOP. In a spinning disk cluster, you effectively 
have a cap on the number of IOs you can serve before the disks max out. So the 
difference between large and small IO's is not that great. But on a SSD cluster 
there is no cap and so you just end up with more IO's, hence the higher CPU.

> 
> > Also as Nick stated with 4-5 nodes I cannot use high-M "K+M"
> > combinations. I've did some calculations and found that the most
> > efficient and safe configuration is to use 10 nodes with 29*6TB SATA
> > and 7*200GB S3700 for journals. Assuming 6+3 EC profile that will give
> > me
> > 1.16 PB of effective space. Also I prefer not to use precious NVMe
> > drives. Don't see any reason to use them.
> >
> This is probably your best way forward, dense is nice and cost saving, but
> comes with a lot of potential gotchas.
> Dense and large clusters can work, dense and small not so much.
> 
> > But what about RAM? Can I go with 64GB per node with above config?
> > I've seen OSDs are consuming not more than 1GB RAM for replicated
> > pools (even 6TB ones). But what is the typical memory usage of EC
> > pools? Does anybody know that?
> >
> Above config (29 OSDs) that would be just about right.
> I always go with at least 2GB RAM per OSD, since during a full node restart
> and the consecutive peering OSDs will grow large, a LOT larger than their
> usual steady state size.
> RAM isn't that expensive these days and additional RAM comes in very
> handy when used for pagecache and SLAB (dentry) stuff.
> 
> Something else to think about in your specific use case is to have RAID'ed
> OSDs.
> It's a bit of zero sum game probably, but compare the above config with this.
> 11 nodes, each with:
> 34 6TB SATAs (2x 17HDDs RAID6)
> 2 200GB S3700 SSDs (journal/OS)
> Just 2 OSDs per node.
> Ceph with replication of 2.
> Just shy of 1PB of effective space.
> 
> Minus: More physical space, less efficient HDD usage (replication vs. EC).
> 
> Plus: A lot less expensive SSDs, less CPU and RAM requirements, smaller
> impact in case of node failure/maintenance.
> 
> No ideas about the stuff below.
> 
> Christian
> > Also, am I right that for 6+3 EC profile i need at least 10 nodes to
> > feel comfortable (one extra node for redundancy)?
> >
> > And finally can someone recommend what EC plugin to use in my case? I
> > know it's a difficult question but anyway?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 2016-02-16 16:12 GMT+08:00 Nick Fisk :
> > >
> > >
> > >> -Original Message-
> > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > >> Behalf Of Tyler Bishop
> > >> Sent: 16 February 2016 04:20
> > >> To: Василий Ангапов 
> > >> Cc: ceph-users 
> > >> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW
> > >> with Erasure Code
> > >>
> > >> You should look at a 60 bay 4U chassis like a Cisco UCS C3260.
> > >>
> > >> We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.
> > >> Performance is excellent.
> > >
> > > Only t

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Nick Fisk
Thanks for posting your experiences John, very interesting read. I think the 
golden rule of around 1Ghz is still a realistic goal to aim for. It looks like 
you probably have around 16ghz for 60OSD's, or 0.26Ghz per OSD. Do you have any 
idea on how much CPU you think you would need to just be able to get away with 
it?

I have 24Ghz for 12 OSD's (2x2620v2) and I typically don't see CPU usage over 
about 20%, which indicates to me the bare minimum for a replicated pool is 
probably around 0.5Ghz per 7.2k rpm OSD. The next nodes we have will certainly 
have less CPU.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> John Hogenmiller
> Sent: 17 February 2016 03:04
> To: Василий Ангапов ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
> 
> Turns out i didn't do reply-all.
> 
> On Tue, Feb 16, 2016 at 9:18 AM, John Hogenmiller 
> wrote:
> > And again - is dual Xeon's power enough for 60-disk node and Erasure
> Code?
> 
> 
> This is something I've been attempting to determine as well. I'm not yet
> getting
> I'm testing with some white-label hardware, but essentially supermicro
> 2twinu's with a pair of E5-2609 Xeons and 64GB of
> memory.  (http://www.supermicro.com/products/system/2U/6028/SYS-
> 6028TR-HTFR.cfm). This is attached to DAEs with 60 x 6TB drives, in JBOD.
> 
> Conversely, Supermicro sells a 72-disk OSD node, which Redhat considers a
> supported "reference architecture" device. The processors in those nodes
> are E5-269 12-core, vs what I have which is quad-
> core. http://www.supermicro.com/solutions/storage_ceph.cfm  (SSG-
> 6048R-OSD432). I would highly recommend reflecting on the supermicro
> hardware and using that as your reference as well. If you could get an eval
> unit, use that to compare with the hardware you're working with.
> 
> I currently have mine setup with 7 nodes, 60 OSDs each, radosgw running
> one each node, and 5 ceph monitors. I plan to move the monitors to their
> own dedicated hardware, and in reading, I may only need 3 to manage the
> 420 OSDs.   I am currently just setup for replication instead of EC, though I
> want to redo this cluster to use EC. Also, I am still trying to work out how
> much of an impact placement groups have on performance, and I may have a
> performance-hampering amount..
> 
> We test the system using locust speaking S3 to the radosgw. Transactions are
> distributed equally across all 7 nodes and we track the statistics. We started
> first emulating 1000 users and got over 4Gbps, but load average on all nodes
> was in the mid-100s, and after 15 minutes we started getting socket
> timeouts. We stopped the test, let load settle, and started back at 100
> users.  We've been running this test about 5 days now.  Load average on all
> nodes floats between 40 and 70. The nodes with ceph-mon running on them
> do not appear to be taxed any more than the ones without. The radosgw
> itself seems to take up a decent amount of cpu (running civetweb, no
> ssl).  iowait is non existent, everything appears to be cpu bound.
> 
> At 1000 users, we had 4.3Gbps of PUTs and 2.2Gbps of GETs. Did not capture
> the TPS on that short test.
> At 100 users, we're pushing 2Gbps  in PUTs and 1.24Gpbs in GETs. Averaging
> 115 TPS.
> 
> All in all, the speeds are not bad for a single rack, but the CPU utilization 
> is a
> big concern. We're currently using other (proprietary) object storage
> platforms on this hardware configuration. They have their own set of issues,
> but CPU utilization is typically not the problem, even at higher utilization.
> 
> 
> 
> root@ljb01:/home/ceph/rain-cluster# ceph status
> cluster 4ebe7995-6a33-42be-bd4d-20f51d02ae45
>  health HEALTH_OK
>  monmap e5: 5 mons at {hail02-r01-06=172.29.4.153:6789/0,hail02-r01-
> 08=172.29.4.155:6789/0,rain02-r01-01=172.29.4.148:6789/0,rain02-r01-
> 03=172.29.4.150:6789/0,rain02-r01-04=172.29.4.151:6789/0}
> election epoch 86, quorum 0,1,2,3,4 
> rain02-r01-01,rain02-r01-03,rain02-
> r01-04,hail02-r01-06,hail02-r01-08
>  osdmap e2543: 423 osds: 419 up, 419 in
> flags sortbitwise
>   pgmap v676131: 33848 pgs, 14 pools, 50834 GB data, 29660 kobjects
> 149 TB used, 2134 TB / 2284 TB avail
>33848 active+clean
>   client io 129 MB/s rd, 182 MB/s wr, 1562 op/s
> 
> 
> 
>  # ceph-osd + ceph-mon + radosgw
> top - 13:29:22 up 40 days, 22:05,  1 user,  load average: 47.76, 47.33, 47.08
> Tasks: 1001 total,   7 running, 994 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 39.2 us, 44.7 sy,  0.0 ni,  9.9 id,  2.4

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread John Hogenmiller
Turns out i didn't do reply-all.

On Tue, Feb 16, 2016 at 9:18 AM, John Hogenmiller 
wrote:

> > And again - is dual Xeon's power enough for 60-disk node and Erasure
> Code?
>
>
> This is something I've been attempting to determine as well. I'm not yet
> getting
> I'm testing with some white-label hardware, but essentially supermicro
> 2twinu's with a pair of E5-2609 Xeons and 64GB of memory.  (
> http://www.supermicro.com/products/system/2U/6028/SYS-6028TR-HTFR.cfm).
> This is attached to DAEs with 60 x 6TB drives, in JBOD.
>
> Conversely, Supermicro sells a 72-disk OSD node, which Redhat considers a
> supported "reference architecture" device. The processors in those nodes
> are E5-269 12-core, vs what I have which is quad-core.
> http://www.supermicro.com/solutions/storage_ceph.cfm  (SSG-6048R-OSD432).*
> I would highly recommend reflecting on the supermicro hardware and using
> that as your reference as well*. If you could get an eval unit, use that
> to compare with the hardware you're working with.
>
> I currently have mine setup with 7 nodes, 60 OSDs each, radosgw running
> one each node, and 5 ceph monitors. I plan to move the monitors to their
> own dedicated hardware, and in reading, I may only need 3 to manage the 420
> OSDs.   *I am currently just setup for replication instead of EC*, though
> I want to redo this cluster to use EC. *Also, I am still trying to work
> out how much of an impact placement groups have on performance, and I may
> have a performance-hampering amount.*.
>
> We test the system using locust speaking S3 to the radosgw. Transactions
> are distributed equally across all 7 nodes and we track the statistics. We
> started first emulating 1000 users and got over 4Gbps, but load average on
> all nodes was in the mid-100s, and after 15 minutes we started getting
> socket timeouts. We stopped the test, let load settle, and started back at
> 100 users.  We've been running this test about 5 days now.  Load average on
> all nodes floats between 40 and 70. The nodes with ceph-mon running on them
> do not appear to be taxed any more than the ones without. The radosgw
> itself seems to take up a decent amount of cpu (running civetweb, no ssl).
>  iowait is non existent, everything appears to be cpu bound.
>
> At 1000 users, we had 4.3Gbps of PUTs and 2.2Gbps of GETs. Did not capture
> the TPS on that short test.
> At 100 users, we're pushing 2Gbps  in PUTs and 1.24Gpbs in GETs. Averaging
> 115 TPS.
>
> All in all, the speeds are not bad for a single rack, but the CPU
> utilization is a big concern. We're currently using other (proprietary)
> object storage platforms on this hardware configuration. They have their
> own set of issues, but CPU utilization is typically not the problem, even
> at higher utilization.
>
>
>
> root@ljb01:/home/ceph/rain-cluster# ceph status
> cluster 4ebe7995-6a33-42be-bd4d-20f51d02ae45
>  health HEALTH_OK
>  monmap e5: 5 mons at {hail02-r01-06=
> 172.29.4.153:6789/0,hail02-r01-08=172.29.4.155:6789/0,rain02-r01-01=172.29.4.148:6789/0,rain02-r01-03=172.29.4.150:6789/0,rain02-r01-04=172.29.4.151:6789/0
> }
> election epoch 86, quorum 0,1,2,3,4
> rain02-r01-01,rain02-r01-03,rain02-r01-04,hail02-r01-06,hail02-r01-08
>  osdmap e2543: 423 osds: 419 up, 419 in
> flags sortbitwise
>   pgmap v676131: 33848 pgs, 14 pools, 50834 GB data, 29660 kobjects
> 149 TB used, 2134 TB / 2284 TB avail
>33848 active+clean
>   client io 129 MB/s rd, 182 MB/s wr, 1562 op/s
>
>
>
>  # ceph-osd + ceph-mon + radosgw
> top - 13:29:22 up 40 days, 22:05,  1 user,  load average: 47.76, 47.33,
> 47.08
> Tasks: 1001 total,   7 running, 994 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 39.2 us, 44.7 sy,  0.0 ni,  9.9 id,  2.4 wa,  0.0 hi,  3.7 si,
>  0.0 st
> KiB Mem:  65873180 total, 64818176 used,  1055004 free, 9324 buffers
> KiB Swap:  8388604 total,  7801828 used,   586776 free. 17610868 cached Mem
>
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
>
>  178129 ceph  20   0 3066452 618060   5440 S  54.6  0.9   2678:49
> ceph-osd
>  218049 ceph  20   0 6261880 179704   2872 S  33.4  0.3   1852:14
> radosgw
>  165529 ceph  20   0 2915332 579064   4308 S  19.7  0.9 530:12.65
> ceph-osd
>  185193 ceph  20   0 2932696 585724   4412 S  19.1  0.9 545:20.31
> ceph-osd
>   52334 ceph  20   0 3030300 618868   4328 S  15.8  0.9 543:53.64
> ceph-osd
>   23124 ceph  20   0 3037740 607088   4440 S  15.2  0.9 461:03.98
> ceph-osd
>  154031 ceph  20   0 2982344 525428   4044 S  14.9  0.8 587:17.62
> ceph-osd
>  191278 ceph  20   0 2835208 570100   4700 S  14.9  0.9 547:11.66
> ceph-osd
>
>  # ceph-osd + radosgw (no ceph-mon)
>
>  top - 13:31:22 up 40 days, 22:06,  1 user,  load average: 64.25, 59.76,
> 58.17
> Tasks: 1015 total,   4 running, 1011 sleeping,   0 stopped,   0 zombie
> %Cpu0  : 24.2 us, 48.5 sy,  0.0 ni, 10.9 id,  1.2 wa,  0.0 hi, 15.2 si,
>  0.0 st
> %Cpu

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread Christian Balzer

Hello,

On Tue, 16 Feb 2016 16:39:06 +0800 Василий Ангапов wrote:

> Nick, Tyler, many thanks for very helpful feedback!
> I spent many hours meditating on the following two links:
> http://www.supermicro.com/solutions/storage_ceph.cfm
> http://s3s.eu/cephshop
> 
> 60- or even 72-disk nodes are very capacity-efficient, but will the 2
> CPUs (even the fastest ones) be enough to handle Erasure Coding?
>
Depends. 
Since you're doing sequential writes (and reads I assume as you're dealing
with videos), CPU usage is going to be a lot lower than with random, small
4KB block I/Os.
So most likely, yes.

> Also as Nick stated with 4-5 nodes I cannot use high-M "K+M"
> combinations. I've did some calculations and found that the most
> efficient and safe configuration is to use 10 nodes with 29*6TB SATA and
> 7*200GB S3700 for journals. Assuming 6+3 EC profile that will give me
> 1.16 PB of effective space. Also I prefer not to use precious NVMe
> drives. Don't see any reason to use them.
> 
This is probably your best way forward, dense is nice and cost saving, but
comes with a lot of potential gotchas. 
Dense and large clusters can work, dense and small not so much.

> But what about RAM? Can I go with 64GB per node with above config?
> I've seen OSDs are consuming not more than 1GB RAM for replicated
> pools (even 6TB ones). But what is the typical memory usage of EC
> pools? Does anybody know that?
> 
Above config (29 OSDs) that would be just about right.
I always go with at least 2GB RAM per OSD, since during a full node
restart and the consecutive peering OSDs will grow large, a LOT larger
than their usual steady state size.
RAM isn't that expensive these days and additional RAM comes in very handy
when used for pagecache and SLAB (dentry) stuff.

Something else to think about in your specific use case is to have RAID'ed
OSDs.
It's a bit of zero sum game probably, but compare the above config with
this.
11 nodes, each with:
34 6TB SATAs (2x 17HDDs RAID6)
2 200GB S3700 SSDs (journal/OS)
Just 2 OSDs per node.
Ceph with replication of 2.
Just shy of 1PB of effective space.

Minus: More physical space, less efficient HDD usage (replication vs. EC).

Plus: A lot less expensive SSDs, less CPU and RAM requirements, smaller
impact in case of node failure/maintenance.

No ideas about the stuff below.

Christian
> Also, am I right that for 6+3 EC profile i need at least 10 nodes to
> feel comfortable (one extra node for redundancy)?
> 
> And finally can someone recommend what EC plugin to use in my case? I
> know it's a difficult question but anyway?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 2016-02-16 16:12 GMT+08:00 Nick Fisk :
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> >> Of Tyler Bishop
> >> Sent: 16 February 2016 04:20
> >> To: Василий Ангапов 
> >> Cc: ceph-users 
> >> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> >> Erasure Code
> >>
> >> You should look at a 60 bay 4U chassis like a Cisco UCS C3260.
> >>
> >> We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.
> >> Performance is excellent.
> >
> > Only thing I will say to the OP, is that if you only need 1PB, then
> > likely 4-5 of these will give you enough capacity. Personally I would
> > prefer to spread the capacity around more nodes. If you are doing
> > anything serious with Ceph its normally a good idea to try and make
> > each node no more than 10% of total capacity. Also with Ec pools you
> > will be limited to the K+M combo's you can achieve with smaller number
> > of nodes.
> >
> >>
> >> I would recommend a cache tier for sure if your data is busy for
> >> reads.
> >>
> >> Tyler Bishop
> >> Chief Technical Officer
> >> 513-299-7108 x10
> >>
> >>
> >>
> >> tyler.bis...@beyondhosting.net
> >>
> >>
> >> If you are not the intended recipient of this transmission you are
> >> notified that disclosing, copying, distributing or taking any action
> >> in reliance on the contents of this information is strictly
> >> prohibited.
> >>
> >> - Original Message -
> >> From: "Василий Ангапов" 
> >> To: "ceph-users" 
> >> Sent: Friday, February 12, 2016 7:44:07 AM
> >> Subject: [ceph-users] Recomendations for building 1PB RadosGW with
> >> Erasure   Code
> >>
> >> Hello,
> >>
> >> We are planning to build 1PB Ceph cluster for RadosGW with Erasur

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread Tyler Bishop
We use dual E5-2660 V2 with 56 6T and performance has not been an issue.  It 
will easily saturate the 40G interfaces and saturate the spindle io.

And yes, you can run dual servers attached to 30 disk each.  This gives you 
lots of density.  Your failure domain will remain as individual servers.  The 
only thing shared is the quad power supplies.

Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.

- Original Message -
From: "Nick Fisk" 
To: "Василий Ангапов" , "Tyler Bishop" 

Cc: ceph-users@lists.ceph.com
Sent: Tuesday, February 16, 2016 8:24:33 AM
Subject: RE: [ceph-users] Recomendations for building 1PB RadosGW with Erasure 
Code

> -Original Message-
> From: Василий Ангапов [mailto:anga...@gmail.com]
> Sent: 16 February 2016 13:15
> To: Tyler Bishop 
> Cc: Nick Fisk ;   us...@lists.ceph.com>
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
> 
> 2016-02-16 17:09 GMT+08:00 Tyler Bishop
> :
> > With ucs you can run dual server and split the disk.  30 drives per node.
> > Better density and easier to manage.
> I don't think I got your point. Can you please explain it in more details?

I think he means that the 60 bays can be zoned, so you end up with physically 1 
JBOD split into two 30 logical JBOD's each connected to a different server. 
What this does to your failures domains is another question.

> 
> And again - is dual Xeon's power enough for 60-disk node and Erasure Code?

I would imagine yes, but you would mostly likely need to go for the 12-18core 
versions with a high clock. These are serious . I don't know at what point 
this becomes more expensive than 12 disk nodes with "cheap" Xeon-D's or Xeon 
E3's.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread Nick Fisk


> -Original Message-
> From: Василий Ангапов [mailto:anga...@gmail.com]
> Sent: 16 February 2016 13:15
> To: Tyler Bishop 
> Cc: Nick Fisk ;   us...@lists.ceph.com>
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
> 
> 2016-02-16 17:09 GMT+08:00 Tyler Bishop
> :
> > With ucs you can run dual server and split the disk.  30 drives per node.
> > Better density and easier to manage.
> I don't think I got your point. Can you please explain it in more details?

I think he means that the 60 bays can be zoned, so you end up with physically 1 
JBOD split into two 30 logical JBOD's each connected to a different server. 
What this does to your failures domains is another question.

> 
> And again - is dual Xeon's power enough for 60-disk node and Erasure Code?

I would imagine yes, but you would mostly likely need to go for the 12-18core 
versions with a high clock. These are serious . I don't know at what point 
this becomes more expensive than 12 disk nodes with "cheap" Xeon-D's or Xeon 
E3's.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread Василий Ангапов
2016-02-16 17:09 GMT+08:00 Tyler Bishop :
> With ucs you can run dual server and split the disk.  30 drives per node.
> Better density and easier to manage.
I don't think I got your point. Can you please explain it in more details?

And again - is dual Xeon's power enough for 60-disk node and Erasure Code?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread Tyler Bishop
With ucs you can run dual server and split the disk.  30 drives per node.  
Better density and easier to manage. 

Sent from TypeApp



On Feb 16, 2016, 3:39 AM, at 3:39 AM, "Василий Ангапов"  
wrote:
>Nick, Tyler, many thanks for very helpful feedback!
>I spent many hours meditating on the following two links:
>http://www.supermicro.com/solutions/storage_ceph.cfm
>http://s3s.eu/cephshop
>
>60- or even 72-disk nodes are very capacity-efficient, but will the 2
>CPUs (even the fastest ones) be enough to handle Erasure Coding?
>Also as Nick stated with 4-5 nodes I cannot use high-M "K+M"
>combinations.
>I've did some calculations and found that the most efficient and safe
>configuration is to use 10 nodes with 29*6TB SATA and 7*200GB S3700
>for journals. Assuming 6+3 EC profile that will give me 1.16 PB of
>effective space. Also I prefer not to use precious NVMe drives. Don't
>see any reason to use them.
>
>But what about RAM? Can I go with 64GB per node with above config?
>I've seen OSDs are consuming not more than 1GB RAM for replicated
>pools (even 6TB ones). But what is the typical memory usage of EC
>pools? Does anybody know that?
>
>Also, am I right that for 6+3 EC profile i need at least 10 nodes to
>feel comfortable (one extra node for redundancy)?
>
>And finally can someone recommend what EC plugin to use in my case? I
>know it's a difficult question but anyway?
>
>
>
>
>
>
>
>
>
>2016-02-16 16:12 GMT+08:00 Nick Fisk :
>>
>>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>Behalf Of
>>> Tyler Bishop
>>> Sent: 16 February 2016 04:20
>>> To: Василий Ангапов 
>>> Cc: ceph-users 
>>> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW
>with
>>> Erasure Code
>>>
>>> You should look at a 60 bay 4U chassis like a Cisco UCS C3260.
>>>
>>> We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.
>>> Performance is excellent.
>>
>> Only thing I will say to the OP, is that if you only need 1PB, then
>likely 4-5 of these will give you enough capacity. Personally I would
>prefer to spread the capacity around more nodes. If you are doing
>anything serious with Ceph its normally a good idea to try and make
>each node no more than 10% of total capacity. Also with Ec pools you
>will be limited to the K+M combo's you can achieve with smaller number
>of nodes.
>>
>>>
>>> I would recommend a cache tier for sure if your data is busy for
>reads.
>>>
>>> Tyler Bishop
>>> Chief Technical Officer
>>> 513-299-7108 x10
>>>
>>>
>>>
>>> tyler.bis...@beyondhosting.net
>>>
>>>
>>> If you are not the intended recipient of this transmission you are
>notified
>>> that disclosing, copying, distributing or taking any action in
>reliance on the
>>> contents of this information is strictly prohibited.
>>>
>>> - Original Message -
>>> From: "Василий Ангапов" 
>>> To: "ceph-users" 
>>> Sent: Friday, February 12, 2016 7:44:07 AM
>>> Subject: [ceph-users] Recomendations for building 1PB RadosGW with
>>> Erasure   Code
>>>
>>> Hello,
>>>
>>> We are planning to build 1PB Ceph cluster for RadosGW with Erasure
>Code. It
>>> will be used for storing online videos.
>>> We do not expect outstanding write performace, something like 200-
>>> 300MB/s of sequental write will be quite enough, but data safety is
>very
>>> important.
>>> What are the most popular hardware and software recomendations?
>>> 1) What EC profile is best to use? What values of K/M do you
>recommend?
>>
>> The higher total k+m you go, you will require more CPU and sequential
>performance will degrade slightly as the IO's are smaller going to the
>disks. However larger numbers allow you to be more creative with
>failure scenarios and "replication" efficiency.
>>
>>> 2) Do I need to use Cache Tier for RadosGW or it is only needed for
>RBD? Is it
>>
>> Only needed for RBD, but depending on workload it may still benefit.
>If you are mostly doing large IO's, the gains will be a lot smaller.
>>
>>> still an overall good practice to use Cache Tier for RadosGW?
>>> 3) What hardware is recommended for EC? I assume higher-clocked CPUs
>are
>>> needed? What about RAM?
>>
>> Total Ghz is more important (ie ghzxcores) Go with the cheapest/power
>e

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread Василий Ангапов
Nick, Tyler, many thanks for very helpful feedback!
I spent many hours meditating on the following two links:
http://www.supermicro.com/solutions/storage_ceph.cfm
http://s3s.eu/cephshop

60- or even 72-disk nodes are very capacity-efficient, but will the 2
CPUs (even the fastest ones) be enough to handle Erasure Coding?
Also as Nick stated with 4-5 nodes I cannot use high-M "K+M" combinations.
I've did some calculations and found that the most efficient and safe
configuration is to use 10 nodes with 29*6TB SATA and 7*200GB S3700
for journals. Assuming 6+3 EC profile that will give me 1.16 PB of
effective space. Also I prefer not to use precious NVMe drives. Don't
see any reason to use them.

But what about RAM? Can I go with 64GB per node with above config?
I've seen OSDs are consuming not more than 1GB RAM for replicated
pools (even 6TB ones). But what is the typical memory usage of EC
pools? Does anybody know that?

Also, am I right that for 6+3 EC profile i need at least 10 nodes to
feel comfortable (one extra node for redundancy)?

And finally can someone recommend what EC plugin to use in my case? I
know it's a difficult question but anyway?









2016-02-16 16:12 GMT+08:00 Nick Fisk :
>
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Tyler Bishop
>> Sent: 16 February 2016 04:20
>> To: Василий Ангапов 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
>> Erasure Code
>>
>> You should look at a 60 bay 4U chassis like a Cisco UCS C3260.
>>
>> We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.
>> Performance is excellent.
>
> Only thing I will say to the OP, is that if you only need 1PB, then likely 
> 4-5 of these will give you enough capacity. Personally I would prefer to 
> spread the capacity around more nodes. If you are doing anything serious with 
> Ceph its normally a good idea to try and make each node no more than 10% of 
> total capacity. Also with Ec pools you will be limited to the K+M combo's you 
> can achieve with smaller number of nodes.
>
>>
>> I would recommend a cache tier for sure if your data is busy for reads.
>>
>> Tyler Bishop
>> Chief Technical Officer
>> 513-299-7108 x10
>>
>>
>>
>> tyler.bis...@beyondhosting.net
>>
>>
>> If you are not the intended recipient of this transmission you are notified
>> that disclosing, copying, distributing or taking any action in reliance on 
>> the
>> contents of this information is strictly prohibited.
>>
>> - Original Message -
>> From: "Василий Ангапов" 
>> To: "ceph-users" 
>> Sent: Friday, February 12, 2016 7:44:07 AM
>> Subject: [ceph-users] Recomendations for building 1PB RadosGW with
>> Erasure   Code
>>
>> Hello,
>>
>> We are planning to build 1PB Ceph cluster for RadosGW with Erasure Code. It
>> will be used for storing online videos.
>> We do not expect outstanding write performace, something like 200-
>> 300MB/s of sequental write will be quite enough, but data safety is very
>> important.
>> What are the most popular hardware and software recomendations?
>> 1) What EC profile is best to use? What values of K/M do you recommend?
>
> The higher total k+m you go, you will require more CPU and sequential 
> performance will degrade slightly as the IO's are smaller going to the disks. 
> However larger numbers allow you to be more creative with failure scenarios 
> and "replication" efficiency.
>
>> 2) Do I need to use Cache Tier for RadosGW or it is only needed for RBD? Is 
>> it
>
> Only needed for RBD, but depending on workload it may still benefit. If you 
> are mostly doing large IO's, the gains will be a lot smaller.
>
>> still an overall good practice to use Cache Tier for RadosGW?
>> 3) What hardware is recommended for EC? I assume higher-clocked CPUs are
>> needed? What about RAM?
>
> Total Ghz is more important (ie ghzxcores) Go with the cheapest/power 
> efficient you can get. Aim for somewhere around 1Ghz per disk.
>
>> 4) What SSDs for Ceph journals are the best?
>
> Intel S3700 or P3700 (if you can stretch)
>
> By all means explore other options, but you can't go wrong by buying these. 
> Think "You can't get fired for buying Cisco" quote!!!
>
>>
>> Thanks a lot!
>>
>> Regards, Vasily.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread Nick Fisk
Just to add, check out this excellent paper by Mark

http://www.spinics.net/lists/ceph-users/attachments/pdf6QGsF7Xi1G.pdf

Unfortunately his test hardware at the time didn't have enough horsepower to 
give an accurate view on required CPU for EC pools over all the tests. But you 
should get a fairly good idea about the hardware requirements from this.



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Nick Fisk
> Sent: 16 February 2016 08:12
> To: 'Tyler Bishop' ; 'Василий Ангапов'
> 
> Cc: 'ceph-users' 
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
> 
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Tyler Bishop
> > Sent: 16 February 2016 04:20
> > To: Василий Ангапов 
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> > Erasure Code
> >
> > You should look at a 60 bay 4U chassis like a Cisco UCS C3260.
> >
> > We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.
> > Performance is excellent.
> 
> Only thing I will say to the OP, is that if you only need 1PB, then likely 
> 4-5 of
> these will give you enough capacity. Personally I would prefer to spread the
> capacity around more nodes. If you are doing anything serious with Ceph its
> normally a good idea to try and make each node no more than 10% of total
> capacity. Also with Ec pools you will be limited to the K+M combo's you can
> achieve with smaller number of nodes.
> 
> >
> > I would recommend a cache tier for sure if your data is busy for reads.
> >
> > Tyler Bishop
> > Chief Technical Officer
> > 513-299-7108 x10
> >
> >
> >
> > tyler.bis...@beyondhosting.net
> >
> >
> > If you are not the intended recipient of this transmission you are
> > notified that disclosing, copying, distributing or taking any action
> > in reliance on the contents of this information is strictly prohibited.
> >
> > - Original Message -
> > From: "Василий Ангапов" 
> > To: "ceph-users" 
> > Sent: Friday, February 12, 2016 7:44:07 AM
> > Subject: [ceph-users] Recomendations for building 1PB RadosGW with
> > Erasure Code
> >
> > Hello,
> >
> > We are planning to build 1PB Ceph cluster for RadosGW with Erasure
> > Code. It will be used for storing online videos.
> > We do not expect outstanding write performace, something like 200-
> > 300MB/s of sequental write will be quite enough, but data safety is
> > very important.
> > What are the most popular hardware and software recomendations?
> > 1) What EC profile is best to use? What values of K/M do you recommend?
> 
> The higher total k+m you go, you will require more CPU and sequential
> performance will degrade slightly as the IO's are smaller going to the disks.
> However larger numbers allow you to be more creative with failure scenarios
> and "replication" efficiency.
> 
> > 2) Do I need to use Cache Tier for RadosGW or it is only needed for
> > RBD? Is it
> 
> Only needed for RBD, but depending on workload it may still benefit. If you
> are mostly doing large IO's, the gains will be a lot smaller.
> 
> > still an overall good practice to use Cache Tier for RadosGW?
> > 3) What hardware is recommended for EC? I assume higher-clocked CPUs
> > are needed? What about RAM?
> 
> Total Ghz is more important (ie ghzxcores) Go with the cheapest/power
> efficient you can get. Aim for somewhere around 1Ghz per disk.
> 
> > 4) What SSDs for Ceph journals are the best?
> 
> Intel S3700 or P3700 (if you can stretch)
> 
> By all means explore other options, but you can't go wrong by buying these.
> Think "You can't get fired for buying Cisco" quote!!!
> 
> >
> > Thanks a lot!
> >
> > Regards, Vasily.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-16 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Tyler Bishop
> Sent: 16 February 2016 04:20
> To: Василий Ангапов 
> Cc: ceph-users 
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
> 
> You should look at a 60 bay 4U chassis like a Cisco UCS C3260.
> 
> We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.
> Performance is excellent.

Only thing I will say to the OP, is that if you only need 1PB, then likely 4-5 
of these will give you enough capacity. Personally I would prefer to spread the 
capacity around more nodes. If you are doing anything serious with Ceph its 
normally a good idea to try and make each node no more than 10% of total 
capacity. Also with Ec pools you will be limited to the K+M combo's you can 
achieve with smaller number of nodes. 

> 
> I would recommend a cache tier for sure if your data is busy for reads.
> 
> Tyler Bishop
> Chief Technical Officer
> 513-299-7108 x10
> 
> 
> 
> tyler.bis...@beyondhosting.net
> 
> 
> If you are not the intended recipient of this transmission you are notified
> that disclosing, copying, distributing or taking any action in reliance on the
> contents of this information is strictly prohibited.
> 
> - Original Message -
> From: "Василий Ангапов" 
> To: "ceph-users" 
> Sent: Friday, February 12, 2016 7:44:07 AM
> Subject: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure   Code
> 
> Hello,
> 
> We are planning to build 1PB Ceph cluster for RadosGW with Erasure Code. It
> will be used for storing online videos.
> We do not expect outstanding write performace, something like 200-
> 300MB/s of sequental write will be quite enough, but data safety is very
> important.
> What are the most popular hardware and software recomendations?
> 1) What EC profile is best to use? What values of K/M do you recommend?

The higher total k+m you go, you will require more CPU and sequential 
performance will degrade slightly as the IO's are smaller going to the disks. 
However larger numbers allow you to be more creative with failure scenarios and 
"replication" efficiency.

> 2) Do I need to use Cache Tier for RadosGW or it is only needed for RBD? Is it

Only needed for RBD, but depending on workload it may still benefit. If you are 
mostly doing large IO's, the gains will be a lot smaller.

> still an overall good practice to use Cache Tier for RadosGW?
> 3) What hardware is recommended for EC? I assume higher-clocked CPUs are
> needed? What about RAM?

Total Ghz is more important (ie ghzxcores) Go with the cheapest/power efficient 
you can get. Aim for somewhere around 1Ghz per disk.

> 4) What SSDs for Ceph journals are the best?

Intel S3700 or P3700 (if you can stretch)

By all means explore other options, but you can't go wrong by buying these. 
Think "You can't get fired for buying Cisco" quote!!!

> 
> Thanks a lot!
> 
> Regards, Vasily.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-15 Thread Tyler Bishop
You should look at a 60 bay 4U chassis like a Cisco UCS C3260.

We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.  Performance is 
excellent.

I would recommend a cache tier for sure if your data is busy for reads.

Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.

- Original Message -
From: "Василий Ангапов" 
To: "ceph-users" 
Sent: Friday, February 12, 2016 7:44:07 AM
Subject: [ceph-users] Recomendations for building 1PB RadosGW with Erasure  
Code

Hello,

We are planning to build 1PB Ceph cluster for RadosGW with Erasure
Code. It will be used for storing online videos.
We do not expect outstanding write performace, something like
200-300MB/s of sequental write will be quite enough, but data safety
is very important.
What are the most popular hardware and software recomendations?
1) What EC profile is best to use? What values of K/M do you recommend?
2) Do I need to use Cache Tier for RadosGW or it is only needed for
RBD? Is it still an overall good practice to use Cache Tier for
RadosGW?
3) What hardware is recommended for EC? I assume higher-clocked CPUs
are needed? What about RAM?
4) What SSDs for Ceph journals are the best?

Thanks a lot!

Regards, Vasily.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com