[Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-19 Thread Erik Jacobson
A while back I was asked to make a blog or something similar to discuss
the use cases the team I work on (HPCM cluster management) at HPE.

If you are not interested in reading about what I'm up to, just delete
this and move on.

I really don't have a public blogging mechanism so I'll just describe
what we're up to here. Some of this was posted in some form in the past.
Since this contains the raw materials, I could make a wiki-ized version
if there were a public place to put it.



We currently use gluster in two parts of cluster management.

In fact, gluster in our management node infrastructure is helping us to
provide scaling and consistency to some of the largest clusters in the
world, clusters in the TOP100 list. While I can get in to trouble by
sharing too much, I will just say that trends are continuing and the
future may have some exciting announcements on where on TOP100 certain
new giant systems may end up in the coming 1-2 years.

At HPE, HPCM is the "traditional cluster manager." There is another team
that develops a more cloud-like solution and I am not discussing that
solution here.


Use Case #1: Leader Nodes and Scale Out
--
- Why?
  * Scale out
  * Redundancy (combined with CTDB, any leader can fail)
  * Consistency (All servers and compute agree on what the content is)

- Cluster manager has an admin or head node and zero or more leader nodes

- Leader nodes are provisioned in groups of 3 to use distributed
  replica-3 volumes (although at least one customer has interest
  in replica-5)

- We configure a few different volumes for different use cases

- We use Gluster NFS still because, over a year ago, Ganesha was not
  working with our workload and we haven't had time to re-test and
  engage with the community. No blame - we would also owe making sure
  our settings are right.

- We use CTDB for a measure of HA and IP alias management. We use this
  instead of pacemaker to reduce complexity.

- The volume use cases are:
  * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
-> Normally squashFS image files for speed/efficiency exported NFS
-> Expanded ("chrootable") traditional NFS trees for people who
   prefer that, but they don't scale as well and are slower to boot
-> Squashfs images sit on a sharded volume while traditional gluster
   used for expanded tree.
  * TFTP/HTTP for network boot/PXE including miniroot
-> Spread across leaders too due so one node is not saturated with
   PXE/DHCP requests
-> Miniroot is a "fatter initrd" that has our CM toolchain
  * Logs/consoles
-> For traditional logs and consoles (HCPM also uses
   elasticsearch/kafka/friends but we don't put that in gluster)
-> Separate volume to have more non-cached friendly settings
  * 4 total volumes used (one sharded, one heavily optimized for
caching, one for ctdb lock, and one traditional for logging/etc)

- Leader Setup
  * Admin node installs the leaders like any other compute nodes
  * A setup tool operates that configures gluster volumes and CTDB
  * When ready, an admin/head node can be engaged with the leaders
  * At that point, certain paths on the admin become gluster fuse mounts
and bind mounts to gluster fuse mounts.

- How images are deployed (squashfs mode)
  * User creates an image using image creation tools that make a
chrootable tree style image on the admin/head node
  * mksquashfs will generate a squashfs image file on to a shared
storage gluster mount
  * Nodes will mount the filesystem with the squashfs images and then
loop mount the squashfs as part of the boot process.

- How are compute nodes tied to leaders
  * We simply have a variable in our database where human or automated
discovery tools can assign a given node to a given IP alias. This
works better for us than trying to play routing tricks or load
balance tricks
  * When leaders PXE, the DHCP response includes next-server and the
compute node uses the leader IP alias for the tftp/http for
getting the boot loader DHCP config files are on shared storage
to facilitate future scaling of DHCP services.
  * ipxe or grub2 network config files then fetch the kernel, initrd
  * initrd has a small update to load a miniroot (install environment)
 which has more tooling
  * Node is installed (for nodes with root disks) or does a network boot
cycle.

- Gluster sizing
  * We typically state compute nodes per leader but this is not for
gluster per-se. Squashfs image objects are very efficient and
probably would be fine for 2k nodes per leader. Leader nodes provide
other services including console logs, system logs, and monitoring
services.
  * Our biggest deployment at a customer site right now has 24 leader
nodes. Bigger systems are coming.

- Startup scripts - Getting all the gluster mounts and many bind mounts
  used in the solution, as well 

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-19 Thread Ewen Chan
Erik:

Thank you for sharing your insights in terms of how Gluster is used, in a 
professional, production environment (or at least how HPE is using it, either 
internally, and/or for your clients).

I really appreciated reading this.

I am just a home lab user and I have a very tiny 4-node micro cluster for 
HPC/CAE applications, and where I run into issues with storage is that NAND 
flash based SSDs have finite write endurance limits, which, as a home lab user, 
gets expensive.

But I've also tested using tmpfs (allocating half of the RAM per compute node) 
and exporting that as a distributed stripped GlusterFS volume over NFS over 
RDMA to the 100 Gbps IB network so that the "ramdrives" can be used as a high 
speed "scratch disk space" that doesn't have the write endurance limits that 
NAND based flash memory SSDs have.

Yes, it isn't as reliable or certainly not high availability (power goes down, 
and the battery backup is exhausted, then the data is lost because it sat in 
RAM), but it's to solve the problems of mechanically rotating hard drives are 
too slow, NAND flash based SSDs has finite write endurance limits, and RAM 
drives, whilst in theory, faster, is also the most expensive in a $/GB basis 
compared to the other storage solutions.

It's rather unfortunately that you have these different "tiers" of storage, and 
there's really nothing else in between that can help address all of these 
issues simultaneously.

Thank you for sharing your thoughts.

Sincerely,

Ewen Chan


From: gluster-users-boun...@gluster.org  on 
behalf of Erik Jacobson 
Sent: March 19, 2021 11:03 AM
To: gluster-users@gluster.org 
Subject: [Gluster-users] Gluster usage scenarios in HPC cluster management

A while back I was asked to make a blog or something similar to discuss
the use cases the team I work on (HPCM cluster management) at HPE.

If you are not interested in reading about what I'm up to, just delete
this and move on.

I really don't have a public blogging mechanism so I'll just describe
what we're up to here. Some of this was posted in some form in the past.
Since this contains the raw materials, I could make a wiki-ized version
if there were a public place to put it.



We currently use gluster in two parts of cluster management.

In fact, gluster in our management node infrastructure is helping us to
provide scaling and consistency to some of the largest clusters in the
world, clusters in the TOP100 list. While I can get in to trouble by
sharing too much, I will just say that trends are continuing and the
future may have some exciting announcements on where on TOP100 certain
new giant systems may end up in the coming 1-2 years.

At HPE, HPCM is the "traditional cluster manager." There is another team
that develops a more cloud-like solution and I am not discussing that
solution here.


Use Case #1: Leader Nodes and Scale Out
--
- Why?
  * Scale out
  * Redundancy (combined with CTDB, any leader can fail)
  * Consistency (All servers and compute agree on what the content is)

- Cluster manager has an admin or head node and zero or more leader nodes

- Leader nodes are provisioned in groups of 3 to use distributed
  replica-3 volumes (although at least one customer has interest
  in replica-5)

- We configure a few different volumes for different use cases

- We use Gluster NFS still because, over a year ago, Ganesha was not
  working with our workload and we haven't had time to re-test and
  engage with the community. No blame - we would also owe making sure
  our settings are right.

- We use CTDB for a measure of HA and IP alias management. We use this
  instead of pacemaker to reduce complexity.

- The volume use cases are:
  * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
-> Normally squashFS image files for speed/efficiency exported NFS
-> Expanded ("chrootable") traditional NFS trees for people who
   prefer that, but they don't scale as well and are slower to boot
-> Squashfs images sit on a sharded volume while traditional gluster
   used for expanded tree.
  * TFTP/HTTP for network boot/PXE including miniroot
-> Spread across leaders too due so one node is not saturated with
   PXE/DHCP requests
-> Miniroot is a "fatter initrd" that has our CM toolchain
  * Logs/consoles
-> For traditional logs and consoles (HCPM also uses
   elasticsearch/kafka/friends but we don't put that in gluster)
-> Separate volume to have more non-cached friendly settings
  * 4 total volumes used (one sharded, one heavily optimized for
caching, one for ctdb lock, and one traditional for logging/etc)

- Leader Setup
  * Admin node installs the leaders like any other compute nodes
  * A setup tool operates t

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-19 Thread Erik Jacobson
> - Gluster sizing
>   * We typically state compute nodes per leader but this is not for
> gluster per-se. Squashfs image objects are very efficient and
> probably would be fine for 2k nodes per leader. Leader nodes provide
> other services including console logs, system logs, and monitoring
> services.

I tried to avoid typos and mistakes but I missed something above. Argues
for wiki right? :)  I missed "512" :)

  * We typically state 512 compute nodes per leader but this is not for
gluster per-se. Squashfs image objects are very efficient and
probably would be fine for 2k nodes per leader. Leader nodes provide
other services including console logs, system logs, and monitoring
services.





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-19 Thread Erik Jacobson
> But I've also tested using tmpfs (allocating half of the RAM per compute node)
> and exporting that as a distributed stripped GlusterFS volume over NFS over
> RDMA to the 100 Gbps IB network so that the "ramdrives" can be used as a high
> speed "scratch disk space" that doesn't have the write endurance limits that
> NAND based flash memory SSDs have.

In my world, we leave the high speed networks to jobs so I don't have
much to offer. In our test SU Leader setup where we may not have disks,
we do carve gluster bricks out of TMPS mounts. However, in that test
case, designed to test the tooling and not the workload, I use iscsi to
emulate disks to test the true solution.

I will just mention that the cluster manager use of squashfs image
objects sitting on NFS mounts is very fast even on top of 20G (2x10G)
mgmt infrastructure. If you combine it with a TMPFS overlay, which is
our default, you will have a writable area in to TMPFS that doesn't
persist. You will have low memory usage.

For a 4-node cluster, you probably don't need to bother with squashfs
even and just mount the directory tree for the image at the right time.

By using tmpfs overlay and some post-boot configuration, you can perhaps
avoid the memory usage of what you are doing. As long as you don't need
to beat the crap out of root, an NFS root is fine and using gluster
backed disks is fine. Note that if you use exported trees with gnfs
instead of image objects, there are lots of volume tweaks you can make
to push efficiency up. For squashfs, I used a sharded volume.

It's easy for me to write this since we have the install environment.
While nothing is "Hard" in there, it's a bunch of code developed over
time. That said, if you wanted to experiment, I can share some pieces of
what we do. I just fear it's too complicated.

I will note that some customers advocate for a tiny root - say 1.5G --
that could fit in TMPFS easily and then attach in workloads (other
filesystems with development environments over the network, or container
environments, etc). That would be another way to keep memory use low for
a diskless cluster.

(we use gnfs because we're not ready to switch to ganesha yet. It's on
our list to move if we can get it working for our load).

> Yes, it isn't as reliable or certainly not high availability (power goes down,
> and the battery backup is exhausted, then the data is lost because it sat in
> RAM), but it's to solve the problems of mechanically rotating hard drives are
> too slow, NAND flash based SSDs has finite write endurance limits, and RAM
> drives, whilst in theory, faster, is also the most expensive in a $/GB basis
> compared to the other storage solutions.
> 
> It's rather unfortunately that you have these different "tiers" of storage, 
> and
> there's really nothing else in between that can help address all of these
> issues simultaneously.
> 
> Thank you for sharing your thoughts.
> 
> Sincerely,
> 
> Ewen Chan
> 
> ━━━━━━━━━━━━━━━━━━━
> From: gluster-users-boun...@gluster.org  on
> behalf of Erik Jacobson 
> Sent: March 19, 2021 11:03 AM
> To: gluster-users@gluster.org 
> Subject: [Gluster-users] Gluster usage scenarios in HPC cluster management
>  
> A while back I was asked to make a blog or something similar to discuss
> the use cases the team I work on (HPCM cluster management) at HPE.
> 
> If you are not interested in reading about what I'm up to, just delete
> this and move on.
> 
> I really don't have a public blogging mechanism so I'll just describe
> what we're up to here. Some of this was posted in some form in the past.
> Since this contains the raw materials, I could make a wiki-ized version
> if there were a public place to put it.
> 
> 
> 
> We currently use gluster in two parts of cluster management.
> 
> In fact, gluster in our management node infrastructure is helping us to
> provide scaling and consistency to some of the largest clusters in the
> world, clusters in the TOP100 list. While I can get in to trouble by
> sharing too much, I will just say that trends are continuing and the
> future may have some exciting announcements on where on TOP100 certain
> new giant systems may end up in the coming 1-2 years.
> 
> At HPE, HPCM is the "traditional cluster manager." There is another team
> that develops a more cloud-like solution and I am not discussing that
> solution here.
> 
> 
> Use Case #1: Leader Nodes and Scale Out
> --
> - Why?
>   * Scale out
>   * Redundancy (combined with CTDB, any leader can fail)
>   * Consistency (All servers

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-19 Thread Ewen Chan
Erik:

My apologies for not being more clear originally.

What I meant to say was that I was using GlusterFS for HPC jobs because my 
understanding is that most HPC environments often or tend to use, for example, 
NVMe SSDs for their high speed storage tier, but even those have a finite write 
endurance limit as well.

And whilst normally, for a large corporation, the consumption of the write 
endurance limit of the NVMe SSDs and their replacement would just be a cost of 
"normal part of doing business", but for a home lab, I can't afford to spend 
that kind of money whenever the drives wear out like that.

And this is what drove me to testing GlusterFS distributed stripped volume 
exported to NFS over RDMA so that the RAM was used both in the execution of the 
jobs as well as for the high speed scratch disk space during job execution such 
that it wouldn't be subject to the write endurance limits of NAND flash SSDs 
(NVMe or otherwise), nor the significantly slower performance of mechanically 
rotating hard disk drives.

So, I was talking about using GlusterFS for HPC as well, but in the context of 
job execution rather than more the "management" tasks/operations that you 
described in your message below.

Thank you.

Sincerely,
Ewen


From: Erik Jacobson 
Sent: March 19, 2021 12:24 PM
To: Ewen Chan 
Cc: Erik Jacobson ; gluster-users@gluster.org 

Subject: Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

> But I've also tested using tmpfs (allocating half of the RAM per compute node)
> and exporting that as a distributed stripped GlusterFS volume over NFS over
> RDMA to the 100 Gbps IB network so that the "ramdrives" can be used as a high
> speed "scratch disk space" that doesn't have the write endurance limits that
> NAND based flash memory SSDs have.

In my world, we leave the high speed networks to jobs so I don't have
much to offer. In our test SU Leader setup where we may not have disks,
we do carve gluster bricks out of TMPS mounts. However, in that test
case, designed to test the tooling and not the workload, I use iscsi to
emulate disks to test the true solution.

I will just mention that the cluster manager use of squashfs image
objects sitting on NFS mounts is very fast even on top of 20G (2x10G)
mgmt infrastructure. If you combine it with a TMPFS overlay, which is
our default, you will have a writable area in to TMPFS that doesn't
persist. You will have low memory usage.

For a 4-node cluster, you probably don't need to bother with squashfs
even and just mount the directory tree for the image at the right time.

By using tmpfs overlay and some post-boot configuration, you can perhaps
avoid the memory usage of what you are doing. As long as you don't need
to beat the crap out of root, an NFS root is fine and using gluster
backed disks is fine. Note that if you use exported trees with gnfs
instead of image objects, there are lots of volume tweaks you can make
to push efficiency up. For squashfs, I used a sharded volume.

It's easy for me to write this since we have the install environment.
While nothing is "Hard" in there, it's a bunch of code developed over
time. That said, if you wanted to experiment, I can share some pieces of
what we do. I just fear it's too complicated.

I will note that some customers advocate for a tiny root - say 1.5G --
that could fit in TMPFS easily and then attach in workloads (other
filesystems with development environments over the network, or container
environments, etc). That would be another way to keep memory use low for
a diskless cluster.

(we use gnfs because we're not ready to switch to ganesha yet. It's on
our list to move if we can get it working for our load).

> Yes, it isn't as reliable or certainly not high availability (power goes down,
> and the battery backup is exhausted, then the data is lost because it sat in
> RAM), but it's to solve the problems of mechanically rotating hard drives are
> too slow, NAND flash based SSDs has finite write endurance limits, and RAM
> drives, whilst in theory, faster, is also the most expensive in a $/GB basis
> compared to the other storage solutions.
>
> It's rather unfortunately that you have these different "tiers" of storage, 
> and
> there's really nothing else in between that can help address all of these
> issues simultaneously.
>
> Thank you for sharing your thoughts.
>
> Sincerely,
>
> Ewen Chan
>
> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> From: gluster-users-boun...@gluster.org  on
> behalf of Erik Jacobson 
> Sent: March 19, 2021 11:03 AM
> To: gluster-users@gluster.org 
> Subject: [Gluster-users] Gluster usage scenarios in HPC cluster management
>
> A while back I was aske

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-22 Thread Diego Zuccato

Il 19/03/2021 16:03, Erik Jacobson ha scritto:


A while back I was asked to make a blog or something similar to discuss
the use cases the team I work on (HPCM cluster management) at HPE.

Tks for the article.

I just miss a bit of information: how are you sizing CPU/RAM for pods?

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-22 Thread Erik Jacobson
The stuff I work on doesn't use containers much (unlike a different
system also at HPE).

Leaders are over-sized but the sizing largely is associated with all the
other stuff leaders do, not just for gluster. That said, my gluster
settings for the expanded nfs tree (as opposed to squashfs image files on
nfs) method use heavy caching; I believe the max was 8G.

I don't have a recipe, they've just always been beefy enough for
gluster. Sorry I don't have a more scientific answer.

On Mon, Mar 22, 2021 at 02:24:17PM +0100, Diego Zuccato wrote:
> Il 19/03/2021 16:03, Erik Jacobson ha scritto:
> 
> > A while back I was asked to make a blog or something similar to discuss
> > the use cases the team I work on (HPCM cluster management) at HPE.
> Tks for the article.
> 
> I just miss a bit of information: how are you sizing CPU/RAM for pods?
> 
> -- 
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
> 
> 
> 
> 
> Community Meeting Calendar:
> 
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-22 Thread Diego Zuccato
Il 22/03/21 14:45, Erik Jacobson ha scritto:

> The stuff I work on doesn't use containers much (unlike a different
> system also at HPE).
By "pods" I meant "glusterd instance", a server hosting a collection of
bricks.

> I don't have a recipe, they've just always been beefy enough for
> gluster. Sorry I don't have a more scientific answer.
Seems that 64GB RAM are not enough for a pod with 26 glusterfsd
instances and no other services (except sshd for management). What do
you mean by "beefy enough"? 128GB RAM or 1TB?

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-22 Thread Gionatan Danti

Il 2021-03-19 16:03 Erik Jacobson ha scritto:

A while back I was asked to make a blog or something similar to discuss
the use cases the team I work on (HPCM cluster management) at HPE.

If you are not interested in reading about what I'm up to, just delete
this and move on.

I really don't have a public blogging mechanism so I'll just describe
what we're up to here. Some of this was posted in some form in the 
past.

Since this contains the raw materials, I could make a wiki-ized version
if there were a public place to put it.


Very interesting post, thank you so much for sharing!

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.da...@assyoma.it - i...@assyoma.it
GPG public key ID: FF5F32A8




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-22 Thread Erik Jacobson
> > The stuff I work on doesn't use containers much (unlike a different
> > system also at HPE).
> By "pods" I meant "glusterd instance", a server hosting a collection of
> bricks.

Oh ok. The term is overloaded in my world.

> > I don't have a recipe, they've just always been beefy enough for
> > gluster. Sorry I don't have a more scientific answer.
> Seems that 64GB RAM are not enough for a pod with 26 glusterfsd
> instances and no other services (except sshd for management). What do
> you mean by "beefy enough"? 128GB RAM or 1TB?

We are currently using replica-3 but may also support replica-5 in the
future.

So if you had 24 leaders like HLRS, there would be 8 replica-3 at the
bottom layer, and then distributed across. (replicated/distributed
volumes)

So we would have 24 leader nodes, each leader would have a disk serving
4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
one is for logs, and one is heavily optimized for non-object expanded
tree NFS). The term "disk" is loose.

So each SU Leader (or gluster server) serving the 4 volumes, 8x3
configuration, in our world has some differences in CPU type and memory
and storage depending on order and preferences and timing (things always
move forward).

On an SU Leader, we typically do 2 RAID10 volumes with a RAID
controller including cache. However, we have moved to RAID1 in some cases with
better disks. Leaders store a lot of non-gluster stuff on "root" and
then gluster has a dedicated disk/LUN. We have been trying to improve
our helper tools to 100% wheel out a bad leader (say it melted in to the
floor) and replace it. Once we have that solid, and because our
monitoring data on the "root" drive is already redundant, we plan to
move newer servers to two NVME drives without RAID. One for gluster and
one for OS. If a leader melts in to the floor, we have a procedure to
discover a new node for that, install the base OS including
gluster/CTDB/etc, and then run a tool to re-integrate it in to the
cluster as an SU Leader node again and do the healing. Separately,
monitoring data outside of gluster will heal.

PS: I will note that I have a mini-SU-leader cluster on my desktop
(qemu/ libvirt) for development. It is a 1x3 set of SU Leaders, one head node,
and one compute node. I make an adjustment to reduce the gluster cache to fit
in the memory space. Works fine. Not real fast but good enough for development.


Specs of a leader node at a customer site:
 * 256G RAM
 * Storage: 
   - MR9361-8i controller
   - 7681GB root LUN (RAID1)
   - 15.4 TB for gluster bricks (RAID10)
   - 6 SATA SSD MZ7LH7T6HMLA-5
 * AMD EPYC 7702 64-Core Processor
   - CPU(s):  128
   - On-line CPU(s) list: 0-127
   - Thread(s) per core:  2
   - Core(s) per socket:  64
   - Socket(s):   1
   - NUMA node(s):4
 * Management Ethernet
   - Gluster and cluster management co-mingled
   - 2x40G (but 2x10G wouold be fine)




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-23 Thread Diego Zuccato
Il 22/03/21 16:54, Erik Jacobson ha scritto:

> So if you had 24 leaders like HLRS, there would be 8 replica-3 at the
> bottom layer, and then distributed across. (replicated/distributed
> volumes)
I still have to grasp the "leader node" concept.
Weren't gluster nodes "peers"? Or by "leader" you mean that it's
mentioned in the fstab entry like
/l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0
while the peer list includes l1,l2,l3 and a bunch of other nodes?

> So we would have 24 leader nodes, each leader would have a disk serving
> 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
> one is for logs, and one is heavily optimized for non-object expanded
> tree NFS). The term "disk" is loose.
That's a system way bigger than ours (3 nodes, replica3arbiter1, up to
36 bricks per node).

> Specs of a leader node at a customer site:
>  * 256G RAM
Glip! 256G for 4 bricks... No wonder I have had troubles running 26
bricks in 64GB RAM... :)

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-23 Thread Yaniv Kaul
On Tue, Mar 23, 2021 at 10:02 AM Diego Zuccato 
wrote:

> Il 22/03/21 16:54, Erik Jacobson ha scritto:
>
> > So if you had 24 leaders like HLRS, there would be 8 replica-3 at the
> > bottom layer, and then distributed across. (replicated/distributed
> > volumes)
> I still have to grasp the "leader node" concept.
> Weren't gluster nodes "peers"? Or by "leader" you mean that it's
> mentioned in the fstab entry like
> /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0
> while the peer list includes l1,l2,l3 and a bunch of other nodes?
>
> > So we would have 24 leader nodes, each leader would have a disk serving
> > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
> > one is for logs, and one is heavily optimized for non-object expanded
> > tree NFS). The term "disk" is loose.
> That's a system way bigger than ours (3 nodes, replica3arbiter1, up to
> 36 bricks per node).
>
> > Specs of a leader node at a customer site:
> >  * 256G RAM
> Glip! 256G for 4 bricks... No wonder I have had troubles running 26
> bricks in 64GB RAM... :)
>

If you can recompile Gluster, you may want to experiment with disabling
memory pools - this should save you some memory.
Y.

>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
> 
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://meet.google.com/cpu-eiue-hvk
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
>




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-23 Thread Erik Jacobson
> I still have to grasp the "leader node" concept.
> Weren't gluster nodes "peers"? Or by "leader" you mean that it's
> mentioned in the fstab entry like
> /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0
> while the peer list includes l1,l2,l3 and a bunch of other nodes?

Right, it's a list of 24 peers. The 24 peers are split in to a 3x24
replicated/distributed setup for the volumes. They also have entries
for themselves as clients in /etc/fstab. I'll dump some volume info
at the end of this.


> > So we would have 24 leader nodes, each leader would have a disk serving
> > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
> > one is for logs, and one is heavily optimized for non-object expanded
> > tree NFS). The term "disk" is loose.
> That's a system way bigger than ours (3 nodes, replica3arbiter1, up to
> 36 bricks per node).

I have one dedicated "disk" (could be disk, raid lun, single ssd) and
4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just
for the lock and has a single file.

> 
> > Specs of a leader node at a customer site:
> >  * 256G RAM
> Glip! 256G for 4 bricks... No wonder I have had troubles running 26
> bricks in 64GB RAM... :)

I'm not an expert in memory pools or how they would be impacted by more
peers. I had to do a little research and I think what you're after is
if I can run gluster volume status cm_shared mem on a real cluster
that has a decent node count. I will see if I can do that.


TEST ENV INFO for those who care

Here is some info on my own test environemnt which you can skip.

I have the environment duplicated on my desktop using virtual machines and it
runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache
from the optimized volumes but other than that it is fine. In my
development environment, the gluster disk is a 40G qcow2 image.

Cache sizes changed from 8G to 100M to fit in the VM.

XML snips for memory, cpus:

  cm-leader1
  99d5a8fc-a32c-b181-2f1a-2929b29c3953
  3268608
  3268608
  2
  
..


I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test
compute node for my development environment.

My desktop where I test this cluster stack is a beefy but not brand new
desktop:

Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   46 bits physical, 48 bits virtual
CPU(s):  16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):   1
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   79
Model name:  Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping:1
CPU MHz: 2594.333
CPU max MHz: 3000.
CPU min MHz: 1200.
BogoMIPS:4190.22
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:20480K
NUMA node0 CPU(s):   0-15



(Not that it matters but this is a HP Z640 Workstation)

128G memory (good for a desktop I know, but I think 64G would work since
I also run windows10 vm environment for unrelated reasons)

I was able to find a MegaRAID in the lab a few years ago and so I have 4
drives in a MegaRAID and carve off a separate volume for the VM disk
images. It has a cache. So that's also more beefy than a normal desktop.
(on the other hand, I have no SSDs. May experiment with that some day
but things work so well now I'm tempted to leave it until something
croaks :)

I keep all VMs for the test cluster with "Unsafe cache mode" since there
is no true data to worry about and it makes the test cases faster.

So I am able to test a complete cluster management stack including
3-leader-gluster servers, an admin, and compute all on my desktop using
virtual machines and shared networks within libivrt/qemu.

It is so much easier to do development when you don't have to reserve
scarce test clusters and compete with people. I can do 90% of my cluster
development work this way. Things fall over when I need to care about
BMCs/ILOs or need to do performance testing of course. Then I move to
real hardware and play the hunger-games-of-internal-test-resources :) :)

I mention all this just to show that the beefy servers are not needed
nor the memory usage high. I'm not continually swapping or anything like
that.




Configuration Info from Real Machine


Some info on an active 3x3 cluster. 2738 compute nodes.

The most active volume here is "cm_obj_sharded". It is where the image
objects live and this cluster uses image objects for compute node root
filesystems. I by hand changed the IP addresses (in case I made an
error doing that).


Memory status for volume : cm_obj_sharded
--
Brick : 10.1.0.5:/data/brick_cm_obj_sharded
Mallinfo

Arena: 20676608
Ordblks  : 2077
Smblks   : 518
Hblks: 17
Hblkhd   : 173506

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-23 Thread Ewen Chan
Erik:

I just want to say that I really appreciate you sharing this information with 
us.

I don't think that my personal home lab micro cluster environment may get that 
complicated enough where I have a virtualized testing/Gluster development setup 
like you have, but on the other hand, as I mentioned before, I am running 100 
Gbps Infiniband so what I am trying to do/use Gluster for is quite different 
than what and how most people deploy/install Gluster for production systems.

If I wanted to splurge, I'd get a second set of IB cables so that the high 
speed interconnect layer can be split so that jobs will run on one layer of the 
Infiniband fabric whilst storage/Gluster may run on another layer.

But for that, I'll have to revamp my entire microcluster, so there are no plans 
to do that just yet.

Thank you.

Sincerely,
Ewen


From: gluster-users-boun...@gluster.org  on 
behalf of Erik Jacobson 
Sent: March 23, 2021 10:43 AM
To: Diego Zuccato 
Cc: gluster-users@gluster.org 
Subject: Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

> I still have to grasp the "leader node" concept.
> Weren't gluster nodes "peers"? Or by "leader" you mean that it's
> mentioned in the fstab entry like
> /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0
> while the peer list includes l1,l2,l3 and a bunch of other nodes?

Right, it's a list of 24 peers. The 24 peers are split in to a 3x24
replicated/distributed setup for the volumes. They also have entries
for themselves as clients in /etc/fstab. I'll dump some volume info
at the end of this.


> > So we would have 24 leader nodes, each leader would have a disk serving
> > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
> > one is for logs, and one is heavily optimized for non-object expanded
> > tree NFS). The term "disk" is loose.
> That's a system way bigger than ours (3 nodes, replica3arbiter1, up to
> 36 bricks per node).

I have one dedicated "disk" (could be disk, raid lun, single ssd) and
4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just
for the lock and has a single file.

>
> > Specs of a leader node at a customer site:
> >  * 256G RAM
> Glip! 256G for 4 bricks... No wonder I have had troubles running 26
> bricks in 64GB RAM... :)

I'm not an expert in memory pools or how they would be impacted by more
peers. I had to do a little research and I think what you're after is
if I can run gluster volume status cm_shared mem on a real cluster
that has a decent node count. I will see if I can do that.


TEST ENV INFO for those who care

Here is some info on my own test environemnt which you can skip.

I have the environment duplicated on my desktop using virtual machines and it
runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache
from the optimized volumes but other than that it is fine. In my
development environment, the gluster disk is a 40G qcow2 image.

Cache sizes changed from 8G to 100M to fit in the VM.

XML snips for memory, cpus:

  cm-leader1
  99d5a8fc-a32c-b181-2f1a-2929b29c3953
  3268608
  3268608
  2
  
..


I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test
compute node for my development environment.

My desktop where I test this cluster stack is a beefy but not brand new
desktop:

Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   46 bits physical, 48 bits virtual
CPU(s):  16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):   1
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   79
Model name:  Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping:1
CPU MHz: 2594.333
CPU max MHz: 3000.
CPU min MHz: 1200.
BogoMIPS:4190.22
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:20480K
NUMA node0 CPU(s):   0-15



(Not that it matters but this is a HP Z640 Workstation)

128G memory (good for a desktop I know, but I think 64G would work since
I also run windows10 vm environment for unrelated reasons)

I was able to find a MegaRAID in the lab a few years ago and so I have 4
drives in a MegaRAID and carve off a separate volume for the VM disk
images. It has a cache. So that's also more beefy than a normal desktop.
(on the other hand, I have no SSDs. May experiment with that some day
but things work so well now I'm tempted to leave it until something
croaks :)

I keep all VMs for the test cluster with "Unsafe cache mode" since there
is no true data to worry about and it makes the test

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-03-23 Thread Zeeshan Ali Shah
Just add on, we are using gluster beside our main storage Lustre for k8s
cluster  .

On Wed, Mar 24, 2021 at 4:33 AM Ewen Chan  wrote:

> Erik:
>
> I just want to say that I really appreciate you sharing this information
> with us.
>
> I don't think that my personal home lab micro cluster environment may get
> that complicated enough where I have a virtualized testing/Gluster
> development setup like you have, but on the other hand, as I mentioned
> before, I am running 100 Gbps Infiniband so what I am trying to do/use
> Gluster for is quite different than what and how most people deploy/install
> Gluster for production systems.
>
> If I wanted to splurge, I'd get a second set of IB cables so that the high
> speed interconnect layer can be split so that jobs will run on one layer of
> the Infiniband fabric whilst storage/Gluster may run on another layer.
>
> But for that, I'll have to revamp my entire microcluster, so there are no
> plans to do that just yet.
>
> Thank you.
>
> Sincerely,
> Ewen
>
> --
> *From:* gluster-users-boun...@gluster.org <
> gluster-users-boun...@gluster.org> on behalf of Erik Jacobson <
> erik.jacob...@hpe.com>
> *Sent:* March 23, 2021 10:43 AM
> *To:* Diego Zuccato 
> *Cc:* gluster-users@gluster.org 
> *Subject:* Re: [Gluster-users] Gluster usage scenarios in HPC cluster
> management
>
> > I still have to grasp the "leader node" concept.
> > Weren't gluster nodes "peers"? Or by "leader" you mean that it's
> > mentioned in the fstab entry like
> > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0
> > while the peer list includes l1,l2,l3 and a bunch of other nodes?
>
> Right, it's a list of 24 peers. The 24 peers are split in to a 3x24
> replicated/distributed setup for the volumes. They also have entries
> for themselves as clients in /etc/fstab. I'll dump some volume info
> at the end of this.
>
>
> > > So we would have 24 leader nodes, each leader would have a disk serving
> > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded,
> > > one is for logs, and one is heavily optimized for non-object expanded
> > > tree NFS). The term "disk" is loose.
> > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to
> > 36 bricks per node).
>
> I have one dedicated "disk" (could be disk, raid lun, single ssd) and
> 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just
> for the lock and has a single file.
>
> >
> > > Specs of a leader node at a customer site:
> > >  * 256G RAM
> > Glip! 256G for 4 bricks... No wonder I have had troubles running 26
> > bricks in 64GB RAM... :)
>
> I'm not an expert in memory pools or how they would be impacted by more
> peers. I had to do a little research and I think what you're after is
> if I can run gluster volume status cm_shared mem on a real cluster
> that has a decent node count. I will see if I can do that.
>
>
> TEST ENV INFO for those who care
> 
> Here is some info on my own test environemnt which you can skip.
>
> I have the environment duplicated on my desktop using virtual machines and
> it
> runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache
> from the optimized volumes but other than that it is fine. In my
> development environment, the gluster disk is a 40G qcow2 image.
>
> Cache sizes changed from 8G to 100M to fit in the VM.
>
> XML snips for memory, cpus:
> 
>   cm-leader1
>   99d5a8fc-a32c-b181-2f1a-2929b29c3953
>   3268608
>   3268608
>   2
>   
> ..
>
>
> I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test
> compute node for my development environment.
>
> My desktop where I test this cluster stack is a beefy but not brand new
> desktop:
>
> Architecture:x86_64
> CPU op-mode(s):  32-bit, 64-bit
> Byte Order:  Little Endian
> Address sizes:   46 bits physical, 48 bits virtual
> CPU(s):  16
> On-line CPU(s) list: 0-15
> Thread(s) per core:  2
> Core(s) per socket:  8
> Socket(s):   1
> NUMA node(s):1
> Vendor ID:   GenuineIntel
> CPU family:  6
> Model:   79
> Model name:  Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> Stepping:1
> CPU MHz: 2594.333
> CPU max MHz: 3000.
> CPU min MHz: 1200.
> BogoMIPS:4190.22
> Virtualization:  VT-x
> L1d cache:   32K
> L1i cache: 

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2021-04-05 Thread Olaf Buitelaar
Hi Erik,

Thanks for sharing your unique use-case and solution. It was very
interesting to read your write-up.

I agree with your point; " * Growing volumes, replacing bricks, and
replacing servers work.
However, the process is very involved and quirky for us. I have."
in your use-case 1 last point.

I do seem to suffer from similar issues where glusterd just doesn't want to
start up correctly at first time, maybe also see:
https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html as
one possible cause for not starting up correctly.
And secondly indeed for a "molten to the floor" server it would be great if
gluster instead of the current replace-brick commands, would have something
to replace a complete host, and would just recreate the all bricks (or
better yet all missing bricks, in case some survived from another
RAID/disk), which it would expect on that host, and would heal them all.
right now indeed this process is quite involved, and sometimes feels a bit
like performing black magic.

Best Olaf

Op vr 19 mrt. 2021 om 16:10 schreef Erik Jacobson :

> A while back I was asked to make a blog or something similar to discuss
> the use cases the team I work on (HPCM cluster management) at HPE.
>
> If you are not interested in reading about what I'm up to, just delete
> this and move on.
>
> I really don't have a public blogging mechanism so I'll just describe
> what we're up to here. Some of this was posted in some form in the past.
> Since this contains the raw materials, I could make a wiki-ized version
> if there were a public place to put it.
>
>
>
> We currently use gluster in two parts of cluster management.
>
> In fact, gluster in our management node infrastructure is helping us to
> provide scaling and consistency to some of the largest clusters in the
> world, clusters in the TOP100 list. While I can get in to trouble by
> sharing too much, I will just say that trends are continuing and the
> future may have some exciting announcements on where on TOP100 certain
> new giant systems may end up in the coming 1-2 years.
>
> At HPE, HPCM is the "traditional cluster manager." There is another team
> that develops a more cloud-like solution and I am not discussing that
> solution here.
>
>
> Use Case #1: Leader Nodes and Scale Out
>
> --
> - Why?
>   * Scale out
>   * Redundancy (combined with CTDB, any leader can fail)
>   * Consistency (All servers and compute agree on what the content is)
>
> - Cluster manager has an admin or head node and zero or more leader nodes
>
> - Leader nodes are provisioned in groups of 3 to use distributed
>   replica-3 volumes (although at least one customer has interest
>   in replica-5)
>
> - We configure a few different volumes for different use cases
>
> - We use Gluster NFS still because, over a year ago, Ganesha was not
>   working with our workload and we haven't had time to re-test and
>   engage with the community. No blame - we would also owe making sure
>   our settings are right.
>
> - We use CTDB for a measure of HA and IP alias management. We use this
>   instead of pacemaker to reduce complexity.
>
> - The volume use cases are:
>   * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
> -> Normally squashFS image files for speed/efficiency exported NFS
> -> Expanded ("chrootable") traditional NFS trees for people who
>prefer that, but they don't scale as well and are slower to boot
> -> Squashfs images sit on a sharded volume while traditional gluster
>used for expanded tree.
>   * TFTP/HTTP for network boot/PXE including miniroot
> -> Spread across leaders too due so one node is not saturated with
>PXE/DHCP requests
> -> Miniroot is a "fatter initrd" that has our CM toolchain
>   * Logs/consoles
> -> For traditional logs and consoles (HCPM also uses
>elasticsearch/kafka/friends but we don't put that in gluster)
> -> Separate volume to have more non-cached friendly settings
>   * 4 total volumes used (one sharded, one heavily optimized for
> caching, one for ctdb lock, and one traditional for logging/etc)
>
> - Leader Setup
>   * Admin node installs the leaders like any other compute nodes
>   * A setup tool operates that configures gluster volumes and CTDB
>   * When ready, an admin/head node can be engaged with the leaders
>   * At that point, certain paths on the admin become gluster fuse mounts
> and bind mounts to gluster fuse mounts.
>
> - How images are deployed (squashfs mode)
>   * User creates an image using image creation tools that make a
> chrootable tree style image on the admin/head node
>   * mksquashfs will generate a squashfs image file on to a shared
> storage gluster mount
>   * Nodes will mount the filesystem with the squashfs images and then
> loop mount the squashfs as part of the boot process.
>
> - How are compute nodes tied to leaders
>   * 

Re: [Gluster-users] Gluster usage scenarios in HPC cluster management

2022-08-22 Thread Zeeshan Ali Shah
Hi All,
Adding my two cents, We have two kinds of storage First based on SSD 600TB
and 2nd on spinning disks. 20PB.

for SSD i tried glusterfs ,beefs and compare them with lustre . Beefs
failed because of ln (hard link issues) at that time (2019) . Glusterfs was
very promising but somehow the lustre with zfs showed good performance.

on our 20PB we are running lustre based on zfs and happy with it.

I am trying to write a whitepaper with different parameters to tweak the
lustre performance from disks, to zfs to lustre . This could help the HPC
community, especially in life science.

/Zee
Section head of IT Infrastructure,
Centre for Genomic Medicine , KFSHRC , Riyadh


On Mon, Apr 5, 2021 at 6:54 PM Olaf Buitelaar 
wrote:

> Hi Erik,
>
> Thanks for sharing your unique use-case and solution. It was very
> interesting to read your write-up.
>
> I agree with your point; " * Growing volumes, replacing bricks, and
> replacing servers work.
> However, the process is very involved and quirky for us. I have."
> in your use-case 1 last point.
>
> I do seem to suffer from similar issues where glusterd just doesn't want
> to start up correctly at first time, maybe also see:
> https://lists.gluster.org/pipermail/gluster-users/2021-February/039134.html as
> one possible cause for not starting up correctly.
> And secondly indeed for a "molten to the floor" server it would be great
> if gluster instead of the current replace-brick commands, would have
> something to replace a complete host, and would just recreate the all
> bricks (or better yet all missing bricks, in case some survived from
> another RAID/disk), which it would expect on that host, and would heal them
> all.
> right now indeed this process is quite involved, and sometimes feels a bit
> like performing black magic.
>
> Best Olaf
>
> Op vr 19 mrt. 2021 om 16:10 schreef Erik Jacobson :
>
>> A while back I was asked to make a blog or something similar to discuss
>> the use cases the team I work on (HPCM cluster management) at HPE.
>>
>> If you are not interested in reading about what I'm up to, just delete
>> this and move on.
>>
>> I really don't have a public blogging mechanism so I'll just describe
>> what we're up to here. Some of this was posted in some form in the past.
>> Since this contains the raw materials, I could make a wiki-ized version
>> if there were a public place to put it.
>>
>>
>>
>> We currently use gluster in two parts of cluster management.
>>
>> In fact, gluster in our management node infrastructure is helping us to
>> provide scaling and consistency to some of the largest clusters in the
>> world, clusters in the TOP100 list. While I can get in to trouble by
>> sharing too much, I will just say that trends are continuing and the
>> future may have some exciting announcements on where on TOP100 certain
>> new giant systems may end up in the coming 1-2 years.
>>
>> At HPE, HPCM is the "traditional cluster manager." There is another team
>> that develops a more cloud-like solution and I am not discussing that
>> solution here.
>>
>>
>> Use Case #1: Leader Nodes and Scale Out
>>
>> --
>> - Why?
>>   * Scale out
>>   * Redundancy (combined with CTDB, any leader can fail)
>>   * Consistency (All servers and compute agree on what the content is)
>>
>> - Cluster manager has an admin or head node and zero or more leader nodes
>>
>> - Leader nodes are provisioned in groups of 3 to use distributed
>>   replica-3 volumes (although at least one customer has interest
>>   in replica-5)
>>
>> - We configure a few different volumes for different use cases
>>
>> - We use Gluster NFS still because, over a year ago, Ganesha was not
>>   working with our workload and we haven't had time to re-test and
>>   engage with the community. No blame - we would also owe making sure
>>   our settings are right.
>>
>> - We use CTDB for a measure of HA and IP alias management. We use this
>>   instead of pacemaker to reduce complexity.
>>
>> - The volume use cases are:
>>   * Image sharing for diskless compute nodes (sometimes 6,000 nodes)
>> -> Normally squashFS image files for speed/efficiency exported NFS
>> -> Expanded ("chrootable") traditional NFS trees for people who
>>prefer that, but they don't scale as well and are slower to boot
>> -> Squashfs images sit on a sharded volume while traditional gluster
>>used for expanded tree.
>>   * TFTP/HTTP for network boot/PXE including miniroot
>> -> Spread across leaders too due so one node is not saturated with
>>PXE/DHCP requests
>> -> Miniroot is a "fatter initrd" that has our CM toolchain
>>   * Logs/consoles
>> -> For traditional logs and consoles (HCPM also uses
>>elasticsearch/kafka/friends but we don't put that in gluster)
>> -> Separate volume to have more non-cached friendly settings
>>   * 4 total volumes used (one sharded, one heavily optimized for
>> caching,