Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
Just add on, we are using gluster beside our main storage Lustre for k8s cluster . On Wed, Mar 24, 2021 at 4:33 AM Ewen Chan wrote: > Erik: > > I just want to say that I really appreciate you sharing this information > with us. > > I don't think that my personal home lab micro cluster environment may get > that complicated enough where I have a virtualized testing/Gluster > development setup like you have, but on the other hand, as I mentioned > before, I am running 100 Gbps Infiniband so what I am trying to do/use > Gluster for is quite different than what and how most people deploy/install > Gluster for production systems. > > If I wanted to splurge, I'd get a second set of IB cables so that the high > speed interconnect layer can be split so that jobs will run on one layer of > the Infiniband fabric whilst storage/Gluster may run on another layer. > > But for that, I'll have to revamp my entire microcluster, so there are no > plans to do that just yet. > > Thank you. > > Sincerely, > Ewen > > -- > *From:* gluster-users-boun...@gluster.org < > gluster-users-boun...@gluster.org> on behalf of Erik Jacobson < > erik.jacob...@hpe.com> > *Sent:* March 23, 2021 10:43 AM > *To:* Diego Zuccato > *Cc:* gluster-users@gluster.org > *Subject:* Re: [Gluster-users] Gluster usage scenarios in HPC cluster > management > > > I still have to grasp the "leader node" concept. > > Weren't gluster nodes "peers"? Or by "leader" you mean that it's > > mentioned in the fstab entry like > > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 > > while the peer list includes l1,l2,l3 and a bunch of other nodes? > > Right, it's a list of 24 peers. The 24 peers are split in to a 3x24 > replicated/distributed setup for the volumes. They also have entries > for themselves as clients in /etc/fstab. I'll dump some volume info > at the end of this. > > > > > So we would have 24 leader nodes, each leader would have a disk serving > > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > > > one is for logs, and one is heavily optimized for non-object expanded > > > tree NFS). The term "disk" is loose. > > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to > > 36 bricks per node). > > I have one dedicated "disk" (could be disk, raid lun, single ssd) and > 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just > for the lock and has a single file. > > > > > > Specs of a leader node at a customer site: > > > * 256G RAM > > Glip! 256G for 4 bricks... No wonder I have had troubles running 26 > > bricks in 64GB RAM... :) > > I'm not an expert in memory pools or how they would be impacted by more > peers. I had to do a little research and I think what you're after is > if I can run gluster volume status cm_shared mem on a real cluster > that has a decent node count. I will see if I can do that. > > > TEST ENV INFO for those who care > > Here is some info on my own test environemnt which you can skip. > > I have the environment duplicated on my desktop using virtual machines and > it > runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache > from the optimized volumes but other than that it is fine. In my > development environment, the gluster disk is a 40G qcow2 image. > > Cache sizes changed from 8G to 100M to fit in the VM. > > XML snips for memory, cpus: > > cm-leader1 > 99d5a8fc-a32c-b181-2f1a-2929b29c3953 > 3268608 > 3268608 > 2 > > .. > > > I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test > compute node for my development environment. > > My desktop where I test this cluster stack is a beefy but not brand new > desktop: > > Architecture:x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > Address sizes: 46 bits physical, 48 bits virtual > CPU(s): 16 > On-line CPU(s) list: 0-15 > Thread(s) per core: 2 > Core(s) per socket: 8 > Socket(s): 1 > NUMA node(s):1 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 79 > Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz > Stepping:1 > CPU MHz: 2594.333 > CPU max MHz: 3000. > CPU min MHz: 1200. > BogoMIPS:4190.22 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache:256K > L3 cache:20480K > NUMA node0 CPU(s): 0-15 > > > > (Not that it matters but this is a HP Z640 Workstation) > > 128G memory (good for a desktop I know, but I think 64G would work since > I also run windows10 vm environment for unrelated reasons) > > I was able to find a MegaRAID in the lab a few years ago and so I have 4 > drives in a MegaRAID and carve off a separate volume for the VM disk > images. It has a cache. So that's also more beefy than a normal desktop. > (on the other hand, I have no SSDs. May experimen
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
Erik: I just want to say that I really appreciate you sharing this information with us. I don't think that my personal home lab micro cluster environment may get that complicated enough where I have a virtualized testing/Gluster development setup like you have, but on the other hand, as I mentioned before, I am running 100 Gbps Infiniband so what I am trying to do/use Gluster for is quite different than what and how most people deploy/install Gluster for production systems. If I wanted to splurge, I'd get a second set of IB cables so that the high speed interconnect layer can be split so that jobs will run on one layer of the Infiniband fabric whilst storage/Gluster may run on another layer. But for that, I'll have to revamp my entire microcluster, so there are no plans to do that just yet. Thank you. Sincerely, Ewen From: gluster-users-boun...@gluster.org on behalf of Erik Jacobson Sent: March 23, 2021 10:43 AM To: Diego Zuccato Cc: gluster-users@gluster.org Subject: Re: [Gluster-users] Gluster usage scenarios in HPC cluster management > I still have to grasp the "leader node" concept. > Weren't gluster nodes "peers"? Or by "leader" you mean that it's > mentioned in the fstab entry like > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 > while the peer list includes l1,l2,l3 and a bunch of other nodes? Right, it's a list of 24 peers. The 24 peers are split in to a 3x24 replicated/distributed setup for the volumes. They also have entries for themselves as clients in /etc/fstab. I'll dump some volume info at the end of this. > > So we would have 24 leader nodes, each leader would have a disk serving > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > > one is for logs, and one is heavily optimized for non-object expanded > > tree NFS). The term "disk" is loose. > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to > 36 bricks per node). I have one dedicated "disk" (could be disk, raid lun, single ssd) and 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just for the lock and has a single file. > > > Specs of a leader node at a customer site: > > * 256G RAM > Glip! 256G for 4 bricks... No wonder I have had troubles running 26 > bricks in 64GB RAM... :) I'm not an expert in memory pools or how they would be impacted by more peers. I had to do a little research and I think what you're after is if I can run gluster volume status cm_shared mem on a real cluster that has a decent node count. I will see if I can do that. TEST ENV INFO for those who care Here is some info on my own test environemnt which you can skip. I have the environment duplicated on my desktop using virtual machines and it runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache from the optimized volumes but other than that it is fine. In my development environment, the gluster disk is a 40G qcow2 image. Cache sizes changed from 8G to 100M to fit in the VM. XML snips for memory, cpus: cm-leader1 99d5a8fc-a32c-b181-2f1a-2929b29c3953 3268608 3268608 2 .. I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test compute node for my development environment. My desktop where I test this cluster stack is a beefy but not brand new desktop: Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s):1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Stepping:1 CPU MHz: 2594.333 CPU max MHz: 3000. CPU min MHz: 1200. BogoMIPS:4190.22 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache:256K L3 cache:20480K NUMA node0 CPU(s): 0-15 (Not that it matters but this is a HP Z640 Workstation) 128G memory (good for a desktop I know, but I think 64G would work since I also run windows10 vm environment for unrelated reasons) I was able to find a MegaRAID in the lab a few years ago and so I have 4 drives in a MegaRAID and carve off a separate volume for the VM disk images. It has a cache. So that's also more beefy than a normal desktop. (on the other hand, I have no SSDs. May experiment with that some day but things work so well now I'm tempted to leave it until something croaks :) I keep all VMs for the test cluster with "Unsafe cache mode" since there is no true data to worry about and it makes the test cases faster. So I am able to test a complete cluster management stack including 3-leader-gluster servers, an admin, and compute all on my desktop using virtual machines and shared networks within libivrt/qemu.
[Gluster-users] remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}]
Hi, I have just configured a gluster volume. I can mount it copy and read files. But, time to time, even without user operations I get lots of errors on the log files (pasted below). Can anyone, please help me to figuring out what's wrong? Thanks alot. RM root@srv-31:~# tail -f /var/log/glusterfs/glusterd.log /var/log/glusterfs/glustershd.log /var/log/glusterfs/bricks/data-glusterfs-ssd-brick[12]-brick.log ==> /var/log/glusterfs/bricks/data-glusterfs-ssd-brick1-brick.log <== [2021-03-19 11:45:09.525079 +] E [MSGID: 113002] [posix-entry-ops.c:682:posix_mkdir] 0-ssd-volume-posix: gfid is null for (null) [Invalid argument] [2021-03-19 11:45:09.525212 +] E [MSGID: 115056] [server-rpc-fops_v2.c:497:server4_mkdir_cbk] 0-ssd-volume-server: MKDIR info [{frame=11817}, {MKDIR_path=}, {uuid_utoa=----0001}, {bname=}, {client=CTX_ID:4b47408f-323c-4c6a-9a20-2ae2a3a2cdb8-GRAPH_ID:3-PID:2291-HOST:srv-32-PC_NAME:ssd-volume-client-0-RECON_NO:-0}, {error-xlator=ssd-volume-posix}, {errno=22}, {error=Invalid argument}] ==> /var/log/glusterfs/bricks/data-glusterfs-ssd-brick2-brick.log <== [2021-03-19 11:45:13.525581 +] E [MSGID: 113002] [posix-entry-ops.c:682:posix_mkdir] 0-ssd-volume-posix: gfid is null for (null) [Invalid argument] [2021-03-19 11:45:13.525711 +] E [MSGID: 115056] [server-rpc-fops_v2.c:497:server4_mkdir_cbk] 0-ssd-volume-server: MKDIR info [{frame=10055}, {MKDIR_path=}, {uuid_utoa=----0001}, {bname=}, {client=CTX_ID:4b47408f-323c-4c6a-9a20-2ae2a3a2cdb8-GRAPH_ID:3-PID:2291-HOST:srv-32-PC_NAME:ssd-volume-client-3-RECON_NO:-0}, {error-xlator=ssd-volume-posix}, {errno=22}, {error=Invalid argument}] ==> /var/log/glusterfs/bricks/data-glusterfs-ssd-brick1-brick.log <== [2021-03-19 11:45:38.829803 +] E [MSGID: 115056] [server-rpc-fops_v2.c:497:server4_mkdir_cbk] 0-ssd-volume-server: MKDIR info [{frame=11820}, {MKDIR_path=}, {uuid_utoa=----0001}, {bname=}, {client=CTX_ID:974f4637-64ef-42e6-afad-1dc9c67c4a43-GRAPH_ID:3-PID:2062-HOST:srv-33-PC_NAME:ssd-volume-client-0-RECON_NO:-0}, {error-xlator=ssd-volume-posix}, {errno=22}, {error=Invalid argument}] ==> /var/log/glusterfs/bricks/data-glusterfs-ssd-brick2-brick.log <== [2021-03-19 11:45:38.829970 +] E [MSGID: 115056] [server-rpc-fops_v2.c:497:server4_mkdir_cbk] 0-ssd-volume-server: MKDIR info [{frame=10058}, {MKDIR_path=}, {uuid_utoa=----0001}, {bname=}, {client=CTX_ID:974f4637-64ef-42e6-afad-1dc9c67c4a43-GRAPH_ID:3-PID:2062-HOST:srv-33-PC_NAME:ssd-volume-client-3-RECON_NO:-0}, {error-xlator=ssd-volume-posix}, {errno=22}, {error=Invalid argument}] ==> /var/log/glusterfs/bricks/data-glusterfs-ssd-brick1-brick.log <== [2021-03-19 11:45:38.829721 +] E [MSGID: 113002] [posix-entry-ops.c:682:posix_mkdir] 0-ssd-volume-posix: gfid is null for (null) [Invalid argument] ==> /var/log/glusterfs/bricks/data-glusterfs-ssd-brick2-brick.log <== [2021-03-19 11:45:38.829883 +] E [MSGID: 113002] [posix-entry-ops.c:682:posix_mkdir] 0-ssd-volume-posix: gfid is null for (null) [Invalid argument] [2021-03-19 11:45:50.005995 +] E [MSGID: 113002] [posix-entry-ops.c:682:posix_mkdir] 0-ssd-volume-posix: gfid is null for (null) [Invalid argument] [2021-03-19 11:45:50.006115 +] E [MSGID: 115056] [server-rpc-fops_v2.c:497:server4_mkdir_cbk] 0-ssd-volume-server: MKDIR info [{frame=10082}, {MKDIR_path=}, {uuid_utoa=----0001}, {bname=}, {client=CTX_ID:4945546f-f368-4fa7-8bfc-3dd7abda5d1b-GRAPH_ID:3-PID:2486-HOST:srv-31-PC_NAME:ssd-volume-client-3-RECON_NO:-0}, {error-xlator=ssd-volume-posix}, {errno=22}, {error=Invalid argument}] ==> /var/log/glusterfs/bricks/data-glusterfs-ssd-brick1-brick.log <== [2021-03-19 11:45:50.006096 +] E [MSGID: 113002] [posix-entry-ops.c:682:posix_mkdir] 0-ssd-volume-posix: gfid is null for (null) [Invalid argument] [2021-03-19 11:45:50.006212 +] E [MSGID: 115056] [server-rpc-fops_v2.c:497:server4_mkdir_cbk] 0-ssd-volume-server: MKDIR info [{frame=11844}, {MKDIR_path=}, {uuid_utoa=----0001}, {bname=}, {client=CTX_ID:4945546f-f368-4fa7-8bfc-3dd7abda5d1b-GRAPH_ID:3-PID:2486-HOST:srv-31-PC_NAME:ssd-volume-client-0-RECON_NO:-0}, {error-xlator=ssd-volume-posix}, {errno=22}, {error=Invalid argument}] ==> /var/log/glusterfs/glustershd.log <== [2021-03-19 11:45:50.006255 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 3-ssd-volume-client-3: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-03-19 11:45:50.006352 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:client4_0_mkdir_cbk] 3-ssd-volume-client-0: remote operation failed. [{path=(null)}, {errno=22}, {error=Invalid argument}] [2021-03-19 11:45:50.006408 +] E [MSGID: 114031] [client-rpc-fops_v2.c:214:cl
[Gluster-users] Updated invitation: Gluster Community Meeting @ Tue Mar 23, 2021 2:30pm - 3:30pm (IST) (gluster-users@gluster.org)
BEGIN:VCALENDAR PRODID:-//Google Inc//Google Calendar 70.9054//EN VERSION:2.0 CALSCALE:GREGORIAN METHOD:REQUEST BEGIN:VTIMEZONE TZID:Asia/Kolkata X-LIC-LOCATION:Asia/Kolkata BEGIN:STANDARD TZOFFSETFROM:+0530 TZOFFSETTO:+0530 TZNAME:IST DTSTART:19700101T00 END:STANDARD END:VTIMEZONE BEGIN:VEVENT DTSTART;TZID=Asia/Kolkata:20210323T143000 DTEND;TZID=Asia/Kolkata:20210323T153000 DTSTAMP:20210322T070229Z ORGANIZER;CN=sajmo...@redhat.com:mailto:sajmo...@redhat.com UID:044bdru9e1v3uah2jln7j5rebc_r20210223t090...@google.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE ;CN=pierre-marie.jan...@agoda.com;X-NUM-GUESTS=0:mailto:pierre-marie.janvre @agoda.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=sajmo...@redhat.com;X-NUM-GUESTS=0;X-RESPONSE-COMMENT="UPDATED THE BRID GE TO GOOGLE MEET LINK - meet.google.com/cpu-eiue-hvk\n":mailto:sajmoham@re dhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE ;CN=alpha754...@hotmail.com;X-NUM-GUESTS=0:mailto:alpha754...@hotmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=Sheetal Pamecha;X-NUM-GUESTS=0:mailto:spame...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Shwetha Acharya;X-NUM-GUESTS=0:mailto:sacha...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=Deepshikha Khandelwal;X-NUM-GUESTS=0:mailto:dkhan...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=Sunil Kumar Heggodu Gopala Acharya;X-NUM-GUESTS=0:mailto:sheggodu@redha t.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=Vinayakswami Hariharmath;X-NUM-GUESTS=0:mailto:vhari...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=bsaso...@redhat.com;X-NUM-GUESTS=0:mailto:bsaso...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE ;CN=Ana Neri;X-NUM-GUESTS=0:mailto:amne...@fb.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=ssiva...@redhat.com;X-NUM-GUESTS=0:mailto:ssiva...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE ;CN=Richard Wareing;X-NUM-GUESTS=0:mailto:rware...@fb.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE ;CN=David Hasson;X-NUM-GUESTS=0:mailto:d...@fb.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU E;CN=ch...@redhat.com;X-NUM-GUESTS=0:mailto:ch...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU E;CN=Ravishankar N;X-NUM-GUESTS=0:mailto:ravishan...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=a...@kadalu.io;X-NUM-GUESTS=0:mailto:a...@kadalu.io ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=nla...@redhat.com;X-NUM-GUESTS=0:mailto:nla...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=sankarshan.mukhopadh...@gmail.com;X-NUM-GUESTS=0:mailto:sankarshan.mukh opadh...@gmail.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=rkoth...@redhat.com;X-NUM-GUESTS=0:mailto:rkoth...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU E;CN=sunku...@redhat.com;X-NUM-GUESTS=0:mailto:sunku...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=pranith.karamp...@phonepe.com;X-NUM-GUESTS=0:mailto:pranith.karampuri@p honepe.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=DECLINED;RSVP=TRUE ;CN=Wojciech J. Turek;X-NUM-GUESTS=0:mailto:wj...@cam.ac.uk ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=sasun...@redhat.com;X-NUM-GUESTS=0:mailto:sasun...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=tshac...@redhat.com;X-NUM-GUESTS=0:mailto:tshac...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=TENTATIVE;RSVP=TRU E;CN=pueb...@redhat.com;X-NUM-GUESTS=0:mailto:pueb...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=neesi...@redhat.com;X-NUM-GUESTS=0:mailto:neesi...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=aujj...@redhat.com;X-NUM-GUESTS=0:mailto:aujj...@redhat.com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=rafi.kavun...@iternity.com;X-NUM-GUESTS=0:mailto:rafi.kavungal@iternity .com ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=NEEDS-ACTION;RSVP= TRUE;CN=gluster-users@gluster.org;X-NUM-GUESTS=0:mailto:gluster-users@glust er.org ATTENDEE;CUTYPE=INDIVIDUAL;ROLE=REQ-PARTICIPANT;PARTSTAT=ACCEPTED;RSVP=TRUE ;CN=gluster-de...@gluster.org;X-NUM-GUESTS=0:mailto:gluster-devel@gluste
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
> I still have to grasp the "leader node" concept. > Weren't gluster nodes "peers"? Or by "leader" you mean that it's > mentioned in the fstab entry like > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 > while the peer list includes l1,l2,l3 and a bunch of other nodes? Right, it's a list of 24 peers. The 24 peers are split in to a 3x24 replicated/distributed setup for the volumes. They also have entries for themselves as clients in /etc/fstab. I'll dump some volume info at the end of this. > > So we would have 24 leader nodes, each leader would have a disk serving > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > > one is for logs, and one is heavily optimized for non-object expanded > > tree NFS). The term "disk" is loose. > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to > 36 bricks per node). I have one dedicated "disk" (could be disk, raid lun, single ssd) and 4 directories for volumes ("bricks"). Of course, the "ctdb" volume is just for the lock and has a single file. > > > Specs of a leader node at a customer site: > > * 256G RAM > Glip! 256G for 4 bricks... No wonder I have had troubles running 26 > bricks in 64GB RAM... :) I'm not an expert in memory pools or how they would be impacted by more peers. I had to do a little research and I think what you're after is if I can run gluster volume status cm_shared mem on a real cluster that has a decent node count. I will see if I can do that. TEST ENV INFO for those who care Here is some info on my own test environemnt which you can skip. I have the environment duplicated on my desktop using virtual machines and it runs fine (slow but fine). It's a 3x1. I take out my giant 8GB cache from the optimized volumes but other than that it is fine. In my development environment, the gluster disk is a 40G qcow2 image. Cache sizes changed from 8G to 100M to fit in the VM. XML snips for memory, cpus: cm-leader1 99d5a8fc-a32c-b181-2f1a-2929b29c3953 3268608 3268608 2 .. I have 1 admin (head) node VM, 3 VM leader nodes like above, and one test compute node for my development environment. My desktop where I test this cluster stack is a beefy but not brand new desktop: Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s):1 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Stepping:1 CPU MHz: 2594.333 CPU max MHz: 3000. CPU min MHz: 1200. BogoMIPS:4190.22 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache:256K L3 cache:20480K NUMA node0 CPU(s): 0-15 (Not that it matters but this is a HP Z640 Workstation) 128G memory (good for a desktop I know, but I think 64G would work since I also run windows10 vm environment for unrelated reasons) I was able to find a MegaRAID in the lab a few years ago and so I have 4 drives in a MegaRAID and carve off a separate volume for the VM disk images. It has a cache. So that's also more beefy than a normal desktop. (on the other hand, I have no SSDs. May experiment with that some day but things work so well now I'm tempted to leave it until something croaks :) I keep all VMs for the test cluster with "Unsafe cache mode" since there is no true data to worry about and it makes the test cases faster. So I am able to test a complete cluster management stack including 3-leader-gluster servers, an admin, and compute all on my desktop using virtual machines and shared networks within libivrt/qemu. It is so much easier to do development when you don't have to reserve scarce test clusters and compete with people. I can do 90% of my cluster development work this way. Things fall over when I need to care about BMCs/ILOs or need to do performance testing of course. Then I move to real hardware and play the hunger-games-of-internal-test-resources :) :) I mention all this just to show that the beefy servers are not needed nor the memory usage high. I'm not continually swapping or anything like that. Configuration Info from Real Machine Some info on an active 3x3 cluster. 2738 compute nodes. The most active volume here is "cm_obj_sharded". It is where the image objects live and this cluster uses image objects for compute node root filesystems. I by hand changed the IP addresses (in case I made an error doing that). Memory status for volume : cm_obj_sharded -- Brick : 10.1.0.5:/data/brick_cm_obj_sharded Mallinfo Arena: 20676608 Ordblks : 2077 Smblks : 518 Hblks: 17 Hblkhd : 173506
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
On Tue, Mar 23, 2021 at 10:02 AM Diego Zuccato wrote: > Il 22/03/21 16:54, Erik Jacobson ha scritto: > > > So if you had 24 leaders like HLRS, there would be 8 replica-3 at the > > bottom layer, and then distributed across. (replicated/distributed > > volumes) > I still have to grasp the "leader node" concept. > Weren't gluster nodes "peers"? Or by "leader" you mean that it's > mentioned in the fstab entry like > /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 > while the peer list includes l1,l2,l3 and a bunch of other nodes? > > > So we would have 24 leader nodes, each leader would have a disk serving > > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > > one is for logs, and one is heavily optimized for non-object expanded > > tree NFS). The term "disk" is loose. > That's a system way bigger than ours (3 nodes, replica3arbiter1, up to > 36 bricks per node). > > > Specs of a leader node at a customer site: > > * 256G RAM > Glip! 256G for 4 bricks... No wonder I have had troubles running 26 > bricks in 64GB RAM... :) > If you can recompile Gluster, you may want to experiment with disabling memory pools - this should save you some memory. Y. > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Gluster usage scenarios in HPC cluster management
Il 22/03/21 16:54, Erik Jacobson ha scritto: > So if you had 24 leaders like HLRS, there would be 8 replica-3 at the > bottom layer, and then distributed across. (replicated/distributed > volumes) I still have to grasp the "leader node" concept. Weren't gluster nodes "peers"? Or by "leader" you mean that it's mentioned in the fstab entry like /l1,l2,l3:gv0 /mnt/gv0 glusterfs defaults 0 0 while the peer list includes l1,l2,l3 and a bunch of other nodes? > So we would have 24 leader nodes, each leader would have a disk serving > 4 bricks (one of which is simply a lock FS for CTDB, one is sharded, > one is for logs, and one is heavily optimized for non-object expanded > tree NFS). The term "disk" is loose. That's a system way bigger than ours (3 nodes, replica3arbiter1, up to 36 bricks per node). > Specs of a leader node at a customer site: > * 256G RAM Glip! 256G for 4 bricks... No wonder I have had troubles running 26 bricks in 64GB RAM... :) -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users