Re: [ceph-users] OSD node memory sizing

Christian Balzer Thu, 19 May 2016 04:27:55 -0700

Hello,

On Thu, 19 May 2016 10:51:20 +0200 Dietmar Rieder wrote:


> Hello,
> 
> On 05/19/2016 03:36 AM, Christian Balzer wrote:
> > 
> > Hello again,
> > 
> > On Wed, 18 May 2016 15:32:50 +0200 Dietmar Rieder wrote:
> > 
> >> Hello Christian,
> >>
> >>> Hello,
> >>>
> >>> On Wed, 18 May 2016 13:57:59 +0200 Dietmar Rieder wrote:
> >>>
> >>>> Dear Ceph users,
> >>>>
> >>>> I've a question regarding the memory recommendations for an OSD
> >>>> node.
> >>>>
> >>>> The official Ceph hardware recommendations say that an OSD node
> >>>> should have 1GB Ram / TB OSD [1]
> >>>>
> >>>> The "Reference Architecture" whitpaper from Red Hat & Supermicro
> >>>> says that "typically" 2GB of memory per OSD on a OSD node is used.
> >>>> [2]
> >>>>
> >>> This question has been asked and answered here countless times.
> >>>
> >>> Maybe something a bit more detailed ought to be placed in the first
> >>> location, or simply a reference to the 2nd one. 
> >>> But then again, that would detract from the RH added value.
> >>
> >> thanks for replying, nonetheless.
> >> I checked the list before but I failed to find a definitive answer,
> >> may be I was not looking hard enough. Anyway, thanks!
> >>
> > They tend to hidden sometimes in other threads, but there really is a
> > lot..
> 
> It seems so, have to dig deeper into the available discussions...
>
See the recent thread "journal or cache tier on SSDs ?" started by
another academic, slightly to your west for some insights, more below.

> > 
> >>>  
> >>>> According to the recommendation in [1] an OSD node with 24x 8TB OSD
> >>>> disks is "underpowered "  when it is equipped with 128GB of RAM.
> >>>> However, following the "recommendation" in [2] 128GB should be
> >>>> plenty enough.
> >>>>
> >>> It's fine per se, the OSD processes will not consume all of that even
> >>> in extreme situations.
> >>
> >> Ok, if I understood this correctly, then 128GB should be enough also
> >> during rebalancing or backfilling.
> >>
> > Definitely, but realize that during this time of high memory
> > consumption cause by backfilling your system is also under strain from
> > objects moving in an out, so as per the high-density thread you will
> > want all your dentry and other important SLAB objects to stay in RAM.
> > 
> > That's a lot of objects potentially with 8TB, so when choosing DIMMs
> > pick ones that leave you with the option to go to 256GB later if need
> > be.
> 
> Good point, I'll keep this in mind
> 
> > 
> > Also you'll probably have loads of fun playing with CRUSH weights to
> > keep the utilization of these 8TB OSDs within 100GB of each other. 
> 
> I'm afraid that  finding the "optimal" settings will demand a lot of
> testing/playing
> 

Optimal settings is another topic, this is just making tiny adjustments to
your CRUSH weights so that the OSDs stay within a few percent of usage of
each other. 

> > 
> >>>
> >>> Very large OSDs and high density storage nodes have other issues and
> >>> challenges, tuning and memory wise.
> >>> There are several threads about these recently, including today.
> >>
> >> Thanks, I'll study these...
> >>
> >>>> I'm wondering which of the two is good enough for a Ceph cluster
> >>>> with 10 nodes using EC (6+3)
> >>>>
> >>> I would spend more time pondering about the CPU power of these
> >>> machines (EC need more) and what cache tier to get.
> >>
> >> We are planing to equip the OSD nodes with 2x2650v4 CPUs (24 cores @
> >> 2.2GHz), that is 1 core/OSD. For the cache tier each OSD node gets two
> >> 800Gb NVMe's. We hope this setup will give reasonable performance with
> >> EC.
> >>
> > So you have actually 26 OSDs per node then.
> > I'd say the CPUs are fine, but EC and the NVMes will eat a fair share
> > of it.
> 
> Your right, it is 26 OSDs but still I assume that with these CPUs we
> will not be completely underpowered.
>
Since you stated your use case I'll say the same, not so much if this were
to be the storage for lots of high IOPS VMs.
 
> > That's why I prefer to have dedicated cache tier nodes with fewer but
> > faster cores, unless the cluster is going to be very large.
> > With Hammer a 800GB DC S3160 SSD based OSD can easily saturate a 
> > "E5-2623 v3" core @3.3GHz (nearly 2 cores to be precise) and Jewel has
> > optimization that will both make it faster by itself AND enable it to
> > use more CPU resources as well.
> > 
> 
> That's probably, the best solution, but this will not be in our budged
> and rackspace limits for the first setup, however when expanding later
> on it will definitely be something to consider, also depending on the
> performance that we obtain with this first setup.
> 
Well, if you're gonna grow this cluster your shared setup will become more
and more effective (but still remain harder to design/specify just right).

> > The NVMes (DC P3700 one presumes?) just for cache tiering, no SSD
> > journals for the OSDs?
> 
> For now we have an offer for HPE  800GB NVMe MU (mixed use), 880MB/s
> write 2600MB/s read, 3 DW/D. So they are a fast as the DC 3700, we will
> probably check also other options.
> 
Hmm, they seem to be not re-branded Intels then.
And they're closer to the P3600s, which are actually slightly faster in
sequential writes and a bit slower in write IOPS.

Note that if these were for journals, I'd definitely go for the Intels as
they are 120MB/s faster in writes and that is all that counts for journals.

With a cache tier, your journals will be (most likely) on the same device,
so it matters as well, but not quite as much.

And while I have a cache tier based on 800GB DC S3600s with 3 DWPD, I also
have a very stable and predictable write load in my use case. 

Read the thread mentioned above and consider the impact of huge writes on
your cluster and cache tier life expectancy.
As I wrote in that thread, clever configuration of Jewel can avoid
needless cache promotion and thus SSD wearout. 
 
> > What are your network plans then, as in is your node storage bandwidth
> > a good match for your network bandwidth? 
> >
> 
> As network we will have 2x10GBit bonded cluster internal and 2x10GBit
> bonded towards the clients, 1GBit for administration
>
Sounds fine.
 
> 
> >>> That is, if performance is a requirement in your use case.
> >>
> >> Always, who wouldn't care about performance?  :-)
> >>
> > "Good enough" sometimes really is good enough.
> > 
> > Since you're going for 8TB OSDs, EC and 10 nodes it feels that for you
> > space is important, so something like archival, not RBD images for high
> > performance VMs.
> > 
> > What is your use case?
> 
> 
> You're right, space is most important. Our use case is not serving RBD
> for VMs.
> We will mainly store genomic data on cephfs volumes and access it from a
> computing cluster
> for analyis. This computing cluster is not very large (will grow) right
> now It consists of 6 nodes and 288 cores.
> 
Right, I guessed as much from the Bioinformatics bit.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
ch...@gol.com           Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD node memory sizing

Reply via email to