[ceph-users] Beginner questions

2020-01-16 Thread Dave Hall

Hello all.

Sorry for the beginner questions...

I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster 
to store some research data.  It is expected that this cluster will grow 
significantly in the next year, possibly to multiple petabytes and 10s 
of nodes.  At this time I'm expected a relatively small number of 
clients, with only one or two actively writing collected data - albeit 
at a high volume per day.


Currently I'm deploying on Debian 9 via ceph-ansible.

Before I put this cluster into production I have a couple questions 
based on my experience to date:


Luminous, Mimic, or Nautilus?  I need stability for this deployment, so 
I am sticking with Debian 9 since Debian 10 is fairly new, and I have 
been hesitant to go with Nautilus.  Yet Mimic seems to have had a hard 
road on Debian but for the efforts at Croit.


 * Statements on the Releases page are now making more sense to me, but
   I would like to confirm that Nautilus is the right choice at this time?

Bluestore DB size:  My nodes currently have 8 x 12TB drives (plus 4 
empty bays) and a PCIe NVMe drive.  If I understand the suggested 
calculation correctly, the DB size for a 12 TB Bluestore OSD would be 
480GB.  If my NVMe isn't big enough to provide this size, should I skip 
provisioning the DBs on the NVMe, or should I give each OSD 1/12th of 
what I have available?  Also, should I try to shift budget a bit to get 
more NVMe as soon as I can, and redo the OSDs when sufficient NVMe is 
available?


Thanks.

-Dave

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginner questions

2020-01-16 Thread Bastiaan Visser
I would definitely go for Nautilus. there are quite some optimizations that
went in after mimic.

Bluestore DB size usually ends up at either 30 or 60 GB.
30 GB is one of the sweet spots during normal operation. But during
compaction, ceph writes the new data before removing the old, hence the
60GB.
Next sweetspot is 300/600GB. any size between 60 and 300 will never be
unused.

DB Usage is also dependent on ceph usage, object storage is known to use a
lot more db space than rbd images for example.

Op do 16 jan. 2020 om 17:46 schreef Dave Hall :

> Hello all.
>
> Sorry for the beginner questions...
>
> I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster to
> store some research data.  It is expected that this cluster will grow
> significantly in the next year, possibly to multiple petabytes and 10s of
> nodes.  At this time I'm expected a relatively small number of clients,
> with only one or two actively writing collected data - albeit at a high
> volume per day.
>
> Currently I'm deploying on Debian 9 via ceph-ansible.
>
> Before I put this cluster into production I have a couple questions based
> on my experience to date:
>
> Luminous, Mimic, or Nautilus?  I need stability for this deployment, so I
> am sticking with Debian 9 since Debian 10 is fairly new, and I have been
> hesitant to go with Nautilus.  Yet Mimic seems to have had a hard road on
> Debian but for the efforts at Croit.
>
>- Statements on the Releases page are now making more sense to me, but
>I would like to confirm that Nautilus is the right choice at this time?
>
> Bluestore DB size:  My nodes currently have 8 x 12TB drives (plus 4 empty
> bays) and a PCIe NVMe drive.  If I understand the suggested calculation
> correctly, the DB size for a 12 TB Bluestore OSD would be 480GB.  If my
> NVMe isn't big enough to provide this size, should I skip provisioning the
> DBs on the NVMe, or should I give each OSD 1/12th of what I have
> available?  Also, should I try to shift budget a bit to get more NVMe as
> soon as I can, and redo the OSDs when sufficient NVMe is available?
>
> Thanks.
>
> -Dave
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginner questions

2020-01-16 Thread Paul Emmerich
Don't use Mimic, support for it is far worse than Nautilus or Luminous. I
think we were the only company who built a product around Mimic, both
Redhat and Suse enterprise storage was Luminous and then Nautilus skipping
Mimic entirely.

We only offered Mimic as a default for a limited time and immediately moved
to Nautilus as it became available and Nautilus + Debian 10 has been great
for us.
Mimic and Debian 9 was... well, hacked together, due to the gcc backport
issues. That's not to say that it doesn't work, in fact Mimic (> 13.2.2)
and Debian 9 worked perfectly fine for us.

Our Debian 10 and Nautilus packages are just so much better and more stable
than Debian 9 + Mimic because we don't need to do weird things with Debian.
Check the mailing list for old posts around the Mimic release by me to see
how we did that build. It's not pretty, but it was the only way to use Ceph
>= Mimic on Debian 9.
All that mess has been eliminated with Debian 10.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 16, 2020 at 6:55 PM Bastiaan Visser  wrote:

> I would definitely go for Nautilus. there are quite some
> optimizations that went in after mimic.
>
> Bluestore DB size usually ends up at either 30 or 60 GB.
> 30 GB is one of the sweet spots during normal operation. But during
> compaction, ceph writes the new data before removing the old, hence the
> 60GB.
> Next sweetspot is 300/600GB. any size between 60 and 300 will never be
> unused.
>
> DB Usage is also dependent on ceph usage, object storage is known to use a
> lot more db space than rbd images for example.
>
> Op do 16 jan. 2020 om 17:46 schreef Dave Hall :
>
>> Hello all.
>>
>> Sorry for the beginner questions...
>>
>> I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster
>> to store some research data.  It is expected that this cluster will grow
>> significantly in the next year, possibly to multiple petabytes and 10s of
>> nodes.  At this time I'm expected a relatively small number of clients,
>> with only one or two actively writing collected data - albeit at a high
>> volume per day.
>>
>> Currently I'm deploying on Debian 9 via ceph-ansible.
>>
>> Before I put this cluster into production I have a couple questions based
>> on my experience to date:
>>
>> Luminous, Mimic, or Nautilus?  I need stability for this deployment, so I
>> am sticking with Debian 9 since Debian 10 is fairly new, and I have been
>> hesitant to go with Nautilus.  Yet Mimic seems to have had a hard road on
>> Debian but for the efforts at Croit.
>>
>>- Statements on the Releases page are now making more sense to me,
>>but I would like to confirm that Nautilus is the right choice at this 
>> time?
>>
>> Bluestore DB size:  My nodes currently have 8 x 12TB drives (plus 4 empty
>> bays) and a PCIe NVMe drive.  If I understand the suggested calculation
>> correctly, the DB size for a 12 TB Bluestore OSD would be 480GB.  If my
>> NVMe isn't big enough to provide this size, should I skip provisioning the
>> DBs on the NVMe, or should I give each OSD 1/12th of what I have
>> available?  Also, should I try to shift budget a bit to get more NVMe as
>> soon as I can, and redo the OSDs when sufficient NVMe is available?
>>
>> Thanks.
>>
>> -Dave
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginner questions

2020-01-16 Thread DHilsbos
Dave;

I'd like to expand on this answer, briefly...

The information in the docs is wrong.  There have been many discussions about 
changing it, but no good alternative has been suggested, thus it hasn't been 
changed.

The 3rd party project that Ceph's BlueStore uses for its database (RocksDB), 
apparently only uses DB sizes of 3GB, 30GB, and 300GB.  As Dave mentions below, 
when RocksDB executes a compact operation, it creates a new blob of the same 
target size, and writes the compacted data into it.  This doubles the necessary 
space.  In addition, BlueStore places its Write Ahead Log (WAL) into the 
fastest storage that is available to OSD daemon,  i.e. NVMe if available.  
Since this is done before the first compaction is requested, the WAL can force 
compaction onto slower storage.

Thus, the numbers I've had floating around in my head for our next cluster are: 
7GB, 66GB, and 630GB.  From all the discussion I've seen around RocksDB, those 
seem like good, common sense targets.  Pick the largest one that works for your 
setup.

All that said... You would really want to pair a 600GB+ NVMe with 12TB drives, 
otherwise your DB is almost guaranteed to overflow onto the spinning drive, and 
affect performance.

I became aware of most of this after we planned our clusters, so I haven't 
tried it, YMMV.

One final note: more hosts, and more spindles usually translates into better 
cluster-wide performance.  I can't predict what the relatively low client 
counts you're suggesting would impact that.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Bastiaan Visser
Sent: Thursday, January 16, 2020 10:55 AM
To: Dave Hall
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Beginner questions

I would definitely go for Nautilus. there are quite some optimizations that 
went in after mimic.

Bluestore DB size usually ends up at either 30 or 60 GB.
30 GB is one of the sweet spots during normal operation. But during compaction, 
ceph writes the new data before removing the old, hence the 60GB.
Next sweetspot is 300/600GB. any size between 60 and 300 will never be unused.

DB Usage is also dependent on ceph usage, object storage is known to use a lot 
more db space than rbd images for example.

Op do 16 jan. 2020 om 17:46 schreef Dave Hall :
Hello all.
Sorry for the beginner questions...
I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster to 
store some research data.  It is expected that this cluster will grow 
significantly in the next year, possibly to multiple petabytes and 10s of 
nodes.  At this time I'm expected a relatively small number of clients, with 
only one or two actively writing collected data - albeit at a high volume per 
day.
Currently I'm deploying on Debian 9 via ceph-ansible.  
Before I put this cluster into production I have a couple questions based on my 
experience to date:
Luminous, Mimic, or Nautilus?  I need stability for this deployment, so I am 
sticking with Debian 9 since Debian 10 is fairly new, and I have been hesitant 
to go with Nautilus.  Yet Mimic seems to have had a hard road on Debian but for 
the efforts at Croit.  
• Statements on the Releases page are now making more sense to me, but I would 
like to confirm that Nautilus is the right choice at this time?
Bluestore DB size:  My nodes currently have 8 x 12TB drives (plus 4 empty bays) 
and a PCIe NVMe drive.  If I understand the suggested calculation correctly, 
the DB size for a 12 TB Bluestore OSD would be 480GB.  If my NVMe isn't big 
enough to provide this size, should I skip provisioning the DBs on the NVMe, or 
should I give each OSD 1/12th of what I have available?  Also, should I try to 
shift budget a bit to get more NVMe as soon as I can, and redo the OSDs when 
sufficient NVMe is available?
Thanks.
-Dave
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Beginner questions

2020-01-16 Thread Dave Hall

Bastiaan,

Regarding EC pools:   Our concern at 3 nodes is that 2-way replication 
seems risky - if the two copies don't match, which one is corrupted.  
However,  3-way replication on a 3 node cluster triples the price per 
TB.   Doing EC pools that are the equivalent of RAID-5 2+1 seems like 
the right place to start as far as maximizing capacity is concerned, 
although I do understand the potential time involved in rebuilding a 12 
TB drive.  Early on I'd be more concerned about a drive failure than 
about a node failure.


Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 
32 thread) with 128GB RAM.  From what I recall reading I think the RAM, 
at least, is a bit higher than recommended.


Question:  Does a PG (EC or replicated) span multiple drives per node?  
I haven't got to the point of understanding this part yet, so pardon the 
totally naive question.  I'll probably be conversant on this by Monday.


-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 1/16/2020 4:27 PM, Bastiaan Visser wrote:
Dave made a good point WAL + DB might end up a little over 60G, I 
would probably go with ~70Gig partitions /LV's per OSD in your case. 
(if the nvme drive is smart enough to spread the writes over all 
available capacity, mort recent nvme's are). I have not yet seen a WAL 
larger or even close to than a gigabyte.


We don't even think about EC-coded pools on clusters with less than 6 
nodes (spindles, full SSD is another story).
EC pools neer more processing resources  We usually settle with 1 gig 
per TB of storage on replicated only sluters, but whet EC polls are 
involved, we add at least 50% to that. Also make sure your processors 
are up for it.


Do not base your calculations on a healthy cluster -> build to fail.
How long are you willing to be in a degraded state on node failure. 
Especially when using many larger spindles. recovery time might be way 
longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool 
you might end up with over 200 TB of traffic, on a 10Gig network 
that's roughly 2 and a half days to recover. IF your processors are 
not bottleneck due to EC parity calculations and all capacity is 
available for recovery (which is usually not the case, there is still 
production traffic that will eat up resources).


Op do 16 jan. 2020 om 21:30 schreef <mailto:dhils...@performair.com>>:


Dave;

I don't like reading inline responses, so...

I have zero experience with EC pools, so I won't pretend to give
advice in that area.

I would think that small NVMe for DB would be better than nothing,
but I don't know.

Once I got the hang of building clusters, it was relatively easy
to wipe a cluster out and rebuild it.  Perhaps you could take some
time, and benchmark different configurations?

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com <http://www.PerformAir.com>


-Original Message-
From: Dave Hall [mailto:kdh...@binghamton.edu
<mailto:kdh...@binghamton.edu>]
Sent: Thursday, January 16, 2020 1:04 PM
To: Dominic Hilsbos; ceph-users@lists.ceph.com
    <mailto:ceph-users@lists.ceph.com>
Subject: Re: [External Email] RE: [ceph-users] Beginner questions

Dominic,

We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this
worked out to a DB size of something like 163GB per OSD. Allowing for
expansion to 12 drives brings it down to 124GB. So maybe just put the
WALs on NVMe and leave the DBs on the platters?

Understood that we will want to move to more nodes rather than more
drives per node, but our funding is grant and donation based, so
we may
end up adding drives in the short term.  The long term plan is to
get to
separate MON/MGR/MDS nodes and 10s of OSD nodes.

Due to our current low node count, we are considering
erasure-coded PGs
rather than replicated in order to maximize usable space.  Any
guidelines or suggestions on this?

Also, sorry for not replying inline.  I haven't done this much in a
while - I'll figure it out.

Thanks.

-Dave

On 1/16/2020 2:48 PM, dhils...@performair.com
<mailto:dhils...@performair.com> wrote:
> Dave;
>
> I'd like to expand on this answer, briefly...
>
> The information in the docs is wrong.  There have been many
discussions about changing it, but no good alternative has been
suggested, thus it hasn't been changed.
>
> The 3rd party project that Ceph's BlueStore uses for its
database (RocksDB), apparently only uses DB sizes of 3GB, 30GB,
and 300GB.  As Dave mentions below, when RocksDB executes a
compact operation, it c

Re: [ceph-users] Beginner questions

2020-01-16 Thread Bastiaan Visser
There is no difference in allocation between replication or EC. If failure
domain is host, one osd per host ok s used for a PG. So if you use a 2+1 EC
profile with a host failure domain, you need 3 hosts for a healthy cluster.
The pool will go read-only when you have a failure (host or disk), or are
doing maintenance on a node (reboot). On a node failure there will be no
rebuilding, since there is no place to find a 3rd osd for a pg, so you'll
have to fix/replace the node before any writes will be accepted.

So yes, you can do a 2+1 EC pool on 3 nodes, you are paying the price in
reliability, flexibility and maybe performance. Only way to really know the
latter is benchmarking with your setup.

I think you will be fine on the hardware side. Memory recommendations swing
around between 512M and 1G per Tb storage.I usually go with 1 gig. But I
never use disks larger than 4Tb. On the cpu I always try to have a few more
cores than I have osd's in a machine. So 16 is fine in your case.


On Fri, Jan 17, 2020, 03:29 Dave Hall  wrote:

> Bastiaan,
>
> Regarding EC pools:   Our concern at 3 nodes is that 2-way replication
> seems risky - if the two copies don't match, which one is corrupted.
> However,  3-way replication on a 3 node cluster triples the price per TB.
> Doing EC pools that are the equivalent of RAID-5 2+1 seems like the right
> place to start as far as maximizing capacity is concerned, although I do
> understand the potential time involved in rebuilding a 12 TB drive.  Early
> on I'd be more concerned about a drive failure than about a node failure.
>
> Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 32
> thread) with 128GB RAM.  From what I recall reading I think the RAM, at
> least, is a bit higher than recommended.
>
> Question:  Does a PG (EC or replicated) span multiple drives per node?  I
> haven't got to the point of understanding this part yet, so pardon the
> totally naive question.  I'll probably be conversant on this by Monday.
>
> -Dave
>
> Dave Hall
> Binghamton universitykdh...@binghamton.edu
> 607-760-2328 (Cell)
> 607-777-4641 (Office)
>
>
>
> On 1/16/2020 4:27 PM, Bastiaan Visser wrote:
>
> Dave made a good point WAL + DB might end up a little over 60G, I would
> probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme
> drive is smart enough to spread the writes over all available capacity,
> mort recent nvme's are). I have not yet seen a WAL larger or even close to
> than a gigabyte.
>
> We don't even think about EC-coded pools on clusters with less than 6
> nodes (spindles, full SSD is another story).
> EC pools neer more processing resources  We usually settle with 1 gig per
> TB of storage on replicated only sluters, but whet EC polls are involved,
> we add at least 50% to that. Also make sure your processors are up for it.
>
> Do not base your calculations on a healthy cluster -> build to fail.
> How long are you willing to be in a degraded state on node failure.
> Especially when using many larger spindles. recovery time might be way
> longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you
> might end up with over 200 TB of traffic, on a 10Gig network that's roughly
> 2 and a half days to recover. IF your processors are not bottleneck due to
> EC parity calculations and all capacity is available for recovery (which is
> usually not the case, there is still production traffic that will eat up
> resources).
>
> Op do 16 jan. 2020 om 21:30 schreef :
>
>> Dave;
>>
>> I don't like reading inline responses, so...
>>
>> I have zero experience with EC pools, so I won't pretend to give advice
>> in that area.
>>
>> I would think that small NVMe for DB would be better than nothing, but I
>> don't know.
>>
>> Once I got the hang of building clusters, it was relatively easy to wipe
>> a cluster out and rebuild it.  Perhaps you could take some time, and
>> benchmark different configurations?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA
>> Director – Information Technology
>> Perform Air International Inc.
>> dhils...@performair.com
>> www.PerformAir.com
>>
>>
>> -Original Message-
>> From: Dave Hall [mailto:kdh...@binghamton.edu]
>> Sent: Thursday, January 16, 2020 1:04 PM
>> To: Dominic Hilsbos; ceph-users@lists.ceph.com
>> Subject: Re: [External Email] RE: [ceph-users] Beginner questions
>>
>> Dominic,
>>
>> We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this
>> worked out to a DB size of something like 163GB per OSD. Allowing for
>> expansion to 12 drives brings it down to 124

Re: [ceph-users] Beginner questions

2020-01-17 Thread Frank Schilder
I would strongly advise against 2+1 EC pools for production if stability is 
your main concern. There was a discussion towards the end of last year 
addressing this in more detail. Short story, if you don't have at least 8-10 
nodes (in the short run), EC is not suitable. You cannot maintain a cluster 
with such EC-pools.

Reasoning: k+1 is a no-go in production. You can set min_size to k, but 
whenever a node is down (maintenance or whatever), new writes are 
non-redundant. Loosing just one more disk means data loss. This is not a 
problem with replication x3 and min_size=2. Be aware that maintenance more 
often than not takes more than a day. Parts may need to be shipped. An upgrade 
goes wrong and requires lengthy support for fixing. Etc.

In addition, admins make mistakes. You need to build your cluster such that it 
can survive mistakes (shut down wrong host, etc.) in degraded state. Redundancy 
m=1 means zero tolerance for errors. Often the recommendation therefore is m=3, 
while m=2 is the bare minimum. Note that EC 1+2 is equal in redundancy as 
replication x3, but will use more compute (hence, its useless). In your 
situation, I would start with replicated pools and move to EC once enough nodes 
are at hand.

If you want to use the benefits of EC, you need to build large clusters. 
Starting with 3 nodes and failure domain disk will be a horrible experience. 
You will not be able to maintain, upgrade or fix anything without downtime.

Plan for sleeping well in worst-case situations.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Bastiaan 
Visser 
Sent: 17 January 2020 06:55:25
To: Dave Hall
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Beginner questions

There is no difference in allocation between replication or EC. If failure 
domain is host, one osd per host ok s used for a PG. So if you use a 2+1 EC 
profile with a host failure domain, you need 3 hosts for a healthy cluster. The 
pool will go read-only when you have a failure (host or disk), or are doing 
maintenance on a node (reboot). On a node failure there will be no rebuilding, 
since there is no place to find a 3rd osd for a pg, so you'll have to 
fix/replace the node before any writes will be accepted.

So yes, you can do a 2+1 EC pool on 3 nodes, you are paying the price in 
reliability, flexibility and maybe performance. Only way to really know the 
latter is benchmarking with your setup.

I think you will be fine on the hardware side. Memory recommendations swing 
around between 512M and 1G per Tb storage.I usually go with 1 gig. But I never 
use disks larger than 4Tb. On the cpu I always try to have a few more cores 
than I have osd's in a machine. So 16 is fine in your case.


On Fri, Jan 17, 2020, 03:29 Dave Hall 
mailto:kdh...@binghamton.edu>> wrote:

Bastiaan,

Regarding EC pools:   Our concern at 3 nodes is that 2-way replication seems 
risky - if the two copies don't match, which one is corrupted.  However,  3-way 
replication on a 3 node cluster triples the price per TB.   Doing EC pools that 
are the equivalent of RAID-5 2+1 seems like the right place to start as far as 
maximizing capacity is concerned, although I do understand the potential time 
involved in rebuilding a 12 TB drive.  Early on I'd be more concerned about a 
drive failure than about a node failure.

Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 32 
thread) with 128GB RAM.  From what I recall reading I think the RAM, at least, 
is a bit higher than recommended.

Question:  Does a PG (EC or replicated) span multiple drives per node?  I 
haven't got to the point of understanding this part yet, so pardon the totally 
naive question.  I'll probably be conversant on this by Monday.

-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu<mailto:kdh...@binghamton.edu>
607-760-2328 (Cell)
607-777-4641 (Office)




On 1/16/2020 4:27 PM, Bastiaan Visser wrote:
Dave made a good point WAL + DB might end up a little over 60G, I would 
probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme 
drive is smart enough to spread the writes over all available capacity, mort 
recent nvme's are). I have not yet seen a WAL larger or even close to than a 
gigabyte.

We don't even think about EC-coded pools on clusters with less than 6 nodes 
(spindles, full SSD is another story).
EC pools neer more processing resources  We usually settle with 1 gig per TB of 
storage on replicated only sluters, but whet EC polls are involved, we add at 
least 50% to that. Also make sure your processors are up for it.

Do not base your calculations on a healthy cluster -> build to fail.
How long are you willing to be in a degraded state on node failure. Especially 
when using many larger spindles. recovery time might be way longer than you 
think. 12 * 12TB is 144T

Re: [ceph-users] Beginner questions

2020-01-17 Thread Dave Hall

Frank,

Thank you for your input.  It is good to know that the cluster will go 
read-only in if a node goes down.  Our circumstance is probably a bit 
unusual, which is why I'm considering the2+1 solution.  We have a 
researcher who will be collecting extremely large amounts of data in 
real time, requiring both high write and high read bandwith, but it's 
pretty much going to be a single user or a small research group.  Right 
now we have 3 physical storage hosts and we need to get into 
production.  We also need to maximize the available storage on these 3 
nodes.


We chose Ceph doe to scalability.  As the research (and the funding) 
progresses we expect to add many more Ceph nodes, and to move the 
MONs/MGRs/MDSs off on to dedicated systems.  At that time I'd likely lay 
out more rational pools and be more thoughtful about resiliency, 
understanding, of course, that I'd have to play games and migrate data 
around.


But for now we have to make the most of the hardware we have. I'm 
thinking 2+1 because that gives me more usable storage that keeping 
copies, and much more than keeping 3 copies.


-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 1/17/2020 3:50 AM, Frank Schilder wrote:

I would strongly advise against 2+1 EC pools for production if stability is 
your main concern. There was a discussion towards the end of last year 
addressing this in more detail. Short story, if you don't have at least 8-10 
nodes (in the short run), EC is not suitable. You cannot maintain a cluster 
with such EC-pools.

Reasoning: k+1 is a no-go in production. You can set min_size to k, but 
whenever a node is down (maintenance or whatever), new writes are 
non-redundant. Loosing just one more disk means data loss. This is not a 
problem with replication x3 and min_size=2. Be aware that maintenance more 
often than not takes more than a day. Parts may need to be shipped. An upgrade 
goes wrong and requires lengthy support for fixing. Etc.

In addition, admins make mistakes. You need to build your cluster such that it 
can survive mistakes (shut down wrong host, etc.) in degraded state. Redundancy 
m=1 means zero tolerance for errors. Often the recommendation therefore is m=3, 
while m=2 is the bare minimum. Note that EC 1+2 is equal in redundancy as 
replication x3, but will use more compute (hence, its useless). In your 
situation, I would start with replicated pools and move to EC once enough nodes 
are at hand.

If you want to use the benefits of EC, you need to build large clusters. 
Starting with 3 nodes and failure domain disk will be a horrible experience. 
You will not be able to maintain, upgrade or fix anything without downtime.

Plan for sleeping well in worst-case situations.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Bastiaan Visser 

Sent: 17 January 2020 06:55:25
To: Dave Hall
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Beginner questions

There is no difference in allocation between replication or EC. If failure 
domain is host, one osd per host ok s used for a PG. So if you use a 2+1 EC 
profile with a host failure domain, you need 3 hosts for a healthy cluster. The 
pool will go read-only when you have a failure (host or disk), or are doing 
maintenance on a node (reboot). On a node failure there will be no rebuilding, 
since there is no place to find a 3rd osd for a pg, so you'll have to 
fix/replace the node before any writes will be accepted.

So yes, you can do a 2+1 EC pool on 3 nodes, you are paying the price in 
reliability, flexibility and maybe performance. Only way to really know the 
latter is benchmarking with your setup.

I think you will be fine on the hardware side. Memory recommendations swing 
around between 512M and 1G per Tb storage.I usually go with 1 gig. But I never 
use disks larger than 4Tb. On the cpu I always try to have a few more cores 
than I have osd's in a machine. So 16 is fine in your case.


On Fri, Jan 17, 2020, 03:29 Dave Hall 
mailto:kdh...@binghamton.edu>> wrote:

Bastiaan,

Regarding EC pools:   Our concern at 3 nodes is that 2-way replication seems 
risky - if the two copies don't match, which one is corrupted.  However,  3-way 
replication on a 3 node cluster triples the price per TB.   Doing EC pools that 
are the equivalent of RAID-5 2+1 seems like the right place to start as far as 
maximizing capacity is concerned, although I do understand the potential time 
involved in rebuilding a 12 TB drive.  Early on I'd be more concerned about a 
drive failure than about a node failure.

Regarding the hardware, our nodes are single socket EPYC 7302 (16 core, 32 
thread) with 128GB RAM.  From what I recall reading I think the RAM, at least, 
is a bit higher than recommended.

Question:  Does a PG (EC or replicated) span m