Re: [ClusterLabs] Corosync 3.1.5 Fails to Autostart

2023-05-25 Thread Tyler Phillippe via Users
Hey all,

Finally got some time to check this on our servers and build up a separate test 
cluster and I found the issue, no debugging required. Seems we are/were using 
IP addresses instead of names in the corosync.conf. I replicated that with the 
separate test cluster and noticed the exact same behaviour. Thanks for all the 
support!! I really appreciate it!

Respectfully,
 Tyler Phillippe



Apr 25, 2023, 3:23 AM by jfrie...@redhat.com:

> On 24/04/2023 22:16, Tyler Phillippe via Users wrote:
>
>> Hello all,
>>
>> We are currently using RHEL9 and have set up a PCS cluster. When restarting 
>> the servers, we noticed Corosync 3.1.5 doesn't start properly with the below 
>> error message:
>>
>> Parse error in config: No valid name found for local host
>> Corosync Cluster Engine exiting with status 8 at main.c:1445.
>> Corosync.service: Main process exited, code=exited, status=8/n/a
>>
>> These are physical, blade machines that are using a 2x Fibre Channel NIC in 
>> a Mode 6 bond as their networking interface for the cluster; other than 
>> that, there is really nothing special about these machines. We have ensured 
>> the names of the machines exist in /etc/hosts and that they can resolve 
>> those names via the hosts file first. The strange
>>
>
> This is really weird. All described symptoms simply points to name service 
> (DNS/NIS/...) is not available during bootup and it will become available 
> later. But if /etc/hosts really contains static entries it should just work.
>
> Could you please try to set debug: trace in corosync.conf like
> ```
> ...
> logging {
>  to_syslog: yes
>  to_stderr: yes
>  timestamp: on
>  to_logfile: yes
>  logfile: /var/log/cluster/corosync.log
>
>  debug: trace
> }
> ...
> ```
>
> and observe very beginning output of corosync (either in syslog or in 
> /var/log/cluster/corosync.log)? There should be something like
>
> totemip_parse: IPv4 address of NAME resolved as IPADDR
>
> Also compare the difference between corosync started on boot and later after 
> multi-user.target.
>
> thing is if we start Corosync manually after we can SSH into the machines, 
> Corosync starts immediately and without issue. We did manage to get Corosync 
> to autostart properly by modifying the service file and changing the 
> After=network-online.target to After=multi-user.target. In doing this, at 
> first, Pacemaker complains about mismatching dependencies in the service 
> between Corosync and Pacemaker. Changing the Pacemaker service to 
> After=multi-user.target fixes that self-caused issue. Any ideas on this one? 
> Mostly checking to see if changing the After dependency will harm us in the 
> future.
>
> That's questionable. It's always best if resolve uses /etc/hosts reliably, 
> what is not the case now, so IMHO better to find a reason why /etc/hosts 
> doesn't work rather than "workaround" it.
>
> Regards,
>  Honza
>
>>
>> Thanks!
>>
>> Respectfully,
>>   Tyler Phillippe
>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Corosync 3.1.5 Fails to Autostart

2023-04-24 Thread Tyler Phillippe via Users
Hello all,

We are currently using RHEL9 and have set up a PCS cluster. When restarting the 
servers, we noticed Corosync 3.1.5 doesn't start properly with the below error 
message:

Parse error in config: No valid name found for local host
Corosync Cluster Engine exiting with status 8 at main.c:1445.
Corosync.service: Main process exited, code=exited, status=8/n/a

These are physical, blade machines that are using a 2x Fibre Channel NIC in a 
Mode 6 bond as their networking interface for the cluster; other than that, 
there is really nothing special about these machines. We have ensured the names 
of the machines exist in /etc/hosts and that they can resolve those names via 
the hosts file first. The strange thing is if we start Corosync manually after 
we can SSH into the machines, Corosync starts immediately and without issue. We 
did manage to get Corosync to autostart properly by modifying the service file 
and changing the After=network-online.target to After=multi-user.target. In 
doing this, at first, Pacemaker complains about mismatching dependencies in the 
service between Corosync and Pacemaker. Changing the Pacemaker service to 
After=multi-user.target fixes that self-caused issue. Any ideas on this one? 
Mostly checking to see if changing the After dependency will harm us in the 
future.

Thanks!

Respectfully,
 Tyler Phillippe
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Filesystem Resource Device Naming Convention

2023-04-21 Thread Tyler Phillippe via Users
LVM (currently) isn't an option for us since most of the team is unfamiliar 
with it. We use Puppet to push out the multipath.conf and are trying to prevent 
against a badly written or changed config file being pushed to the PCS servers 
- that's what I meant by corruption, more so than actual bit corruption. Was 
thinking if the Filesystem resource pointed to the WWID, since that can only 
change on the SAN box, even if the multipath.conf was wrong or the aliases 
changed, the resource wouldn't know/care/fail.

Thanks!!

Respectfully,
 Tyler Phillippe



Apr 20, 2023, 9:36 PM by nw...@redhat.com:

> On Thu, Apr 20, 2023 at 1:49 PM Tyler Phillippe via Users
>  wrote:
>
>>
>> Hello all,
>>
>> In my position, we are running several PCS clusters that host NFS shares and 
>> their backing disks are SAN LUNs. We have been using the 
>> /dev/mapper/ name as the actual device when defining a PCS 
>> Filesystem resource; however, it was brought up that potentially the 
>> multipath configuration file could be corrupted in any number of accidental 
>> ways. It was then proposed to use the actual SCSI WWID as the device, under 
>> /dev/disk/by-id/scsi-. There has been discussion back and forth on 
>> which is better - mostly from a peace of mind perspective. I know Linux has 
>> changed a lot and mounting disks by WWID/UUID may not strictly be necessary 
>> any more, but I was wondering what is preferred, especially as nodes are 
>> added to the cluster and more people are brought on to the team. Thanks all!
>>
>
> I almost always see users configure LVM logical volumes (whose volume
> groups are managed by LVM-activate resources) as the device for
> Filesystem resources, unless they're mounting an NFS share.
>
> I'm not aware of the ways that the multipath config file could become
> corrupted (aside from generalized data corruption, which is a much
> larger problem). It seems fairly unlikely, but I'm open to other
> perspectives.
>
>> Respectfully,
>>  Tyler Phillippe
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
>
>
> -- 
> Regards,
>
> Reid Wahl (He/Him)
> Senior Software Engineer, Red Hat
> RHEL High Availability - Pacemaker
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Filesystem Resource Device Naming Convention

2023-04-20 Thread Tyler Phillippe via Users
Hello all,

In my position, we are running several PCS clusters that host NFS shares and 
their backing disks are SAN LUNs. We have been using the 
/dev/mapper/ name as the actual device when defining a PCS 
Filesystem resource; however, it was brought up that potentially the multipath 
configuration file could be corrupted in any number of accidental ways. It was 
then proposed to use the actual SCSI WWID as the device, under 
/dev/disk/by-id/scsi-. There has been discussion back and forth on which 
is better - mostly from a peace of mind perspective. I know Linux has changed a 
lot and mounting disks by WWID/UUID may not strictly be necessary any more, but 
I was wondering what is preferred, especially as nodes are added to the cluster 
and more people are brought on to the team. Thanks all!

Respectfully,
 Tyler Phillippe
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] DRBD Dual Primary Write Speed Extremely Slow

2022-11-14 Thread Tyler Phillippe via Users
Actually, I think I just figured it out - it's the Microsoft Continuous 
Availability. It slows everything down to a crawl - sounds like that's almost 
by design for some reason.

Thanks!

Respectfully,
 Tyler



Nov 14, 2022, 9:48 AM by tylerphilli...@tutamail.com:

> Hi Vladislav,
>
> If I don't use the Scale-Out File Server, I don't have any issues with iSCSI 
> speeds: if I directly connect the LUN(s) to the individual servers, I get 
> 'full' speed - it just seems the Scale-Out File Server is causing the issue. 
> Very strange of Microsoft (though not totally unexpected!). I can't find too 
> much online about Scale-Out File Server, other than generic setup information.
>
> Thanks!
>
> Respectfully,
>  Tyler
>
>
>
> Nov 14, 2022, 9:20 AM by bub...@hoster-ok.com:
>
>> Hi
>>
>> On Mon, 2022-11-14 at 15:00 +0100, Tyler Phillippe via Users wrote:
>>
>>> Good idea! I setup a RAM disk on both of those systems, let them sync, 
>>> added it to the cluster. 
>>>
>>> One thing I left out (which didn't hit me until yesterday as a possibility) 
>>> is that I have the iSCSI LUN attached to two Windows servers that are 
>>> acting as a Scale-Out File Server. When I copied a file over to the new 
>>> RAMdisk LUN via Scale-Out File Server, I am still getting 10-20MB/s; 
>>> however, when I create a large file to the underlying, shared DRBD on those 
>>> CentOS machines, I am getting about 700+MB/s, which I watched via iostat. 
>>> So, I guess it's the Scale-Out File Server causing the issue. Not sure why 
>>> Microsoft and the Scale-Out File Server is causing the issue - guess 
>>> Microsoft really doesn't like non-Microsoft backing disks
>>>
>>>
>>
>>
>> Not with Microsoft, but with overall iSCSI performance. For the older iSCSI 
>> target - IET - I used to use the following settings:
>> InitialR2T=No 
>> ImmediateData=Yes 
>> MaxRecvDataSegmentLength=65536 
>> MaxXmitDataSegmentLength=65536 
>> MaxBurstLength=262144 
>> FirstBurstLength=131072 
>> MaxOutstandingR2T=2 
>> Wthreads=128 
>> QueuedCommands=32
>>
>> Without that iSCSI LUNs were very slow independently of backing device speed.
>> Probably LIO provides a way to set them up as well.
>>
>> Best,
>> Vladislav
>>
>>
>>> Does anyone have any experience with that, perhaps? Thanks!!
>>>
>>> Respectfully,
>>>  Tyler
>>>
>>>
>>>
>>> Nov 14, 2022, 2:30 AM by ulrich.wi...@rz.uni-regensburg.de:
>>>
>>>> Hi!
>>>>
>>>> If you have planty of RAM you could configure an iSCSI disk using a ram 
>>>> disk and try how much I/O you get from there.
>>>> Maybe you issue is not-su-much DRBD related. However when my local 
>>>> MD-RAID1 resyncs with about 120MB/s (spinning disks), the system also is 
>>>> hardly usable.
>>>>
>>>> Regards,
>>>> Ulrich
>>>>
>>>>>>> Tyler Phillippe via Users  schrieb am 13.11.2022 
>>>>>>> um
>>>>>>>
>>>> 19:26 in Nachricht :
>>>>
>>>>> Hello all,
>>>>>
>>>>> I have setup a Linux cluster on 2x CentOS 8 Stream machines - it has 
>>>>> resources to manage a dual primary, GFS2 DRBD setup. DRBD and the cluster 
>>>>> have a diskless witness. Everything works fine - I have the dual primary 
>>>>> DRBD 
>>>>> working and it is able to present an iSCSI LUN out to my LAN. However, 
>>>>> the 
>>>>> DRBD write speed is terrible. The backing DRBD disks (HDD) are RAID10 
>>>>> using 
>>>>> mdadm and they (re)sync at around 150MB/s. DRBD verify has been limited 
>>>>> to 
>>>>> 100MB/s, but left untethered, it will get to around 140MB/s. If I write 
>>>>> data 
>>>>> to the iSCSI LUN, I only get about 10-15MB/s. Here's the DRBD 
>>>>> global_common.conf - these are exactly the same on both machines:
>>>>>
>>>>> global {
>>>>> usage-count no;
>>>>> udev-always-use-vnr;
>>>>> }
>>>>>
>>>>> common {
>>>>> handlers {
>>>>> }
>>>>>
>>>>> startup {
>>>>> wfc-timeout 5;
>>>>> degr-wfc-timeout 5;
>>>>> }
>>>>>
>>>>> options {
>>>>> auto-promote yes;
>&

Re: [ClusterLabs] Antw: [EXT] DRBD Dual Primary Write Speed Extremely Slow

2022-11-14 Thread Tyler Phillippe via Users
Hi Vladislav,

If I don't use the Scale-Out File Server, I don't have any issues with iSCSI 
speeds: if I directly connect the LUN(s) to the individual servers, I get 
'full' speed - it just seems the Scale-Out File Server is causing the issue. 
Very strange of Microsoft (though not totally unexpected!). I can't find too 
much online about Scale-Out File Server, other than generic setup information.

Thanks!

Respectfully,
 Tyler



Nov 14, 2022, 9:20 AM by bub...@hoster-ok.com:

> Hi
>
> On Mon, 2022-11-14 at 15:00 +0100, Tyler Phillippe via Users wrote:
>
>> Good idea! I setup a RAM disk on both of those systems, let them sync, added 
>> it to the cluster. 
>>
>> One thing I left out (which didn't hit me until yesterday as a possibility) 
>> is that I have the iSCSI LUN attached to two Windows servers that are acting 
>> as a Scale-Out File Server. When I copied a file over to the new RAMdisk LUN 
>> via Scale-Out File Server, I am still getting 10-20MB/s; however, when I 
>> create a large file to the underlying, shared DRBD on those CentOS machines, 
>> I am getting about 700+MB/s, which I watched via iostat. So, I guess it's 
>> the Scale-Out File Server causing the issue. Not sure why Microsoft and the 
>> Scale-Out File Server is causing the issue - guess Microsoft really doesn't 
>> like non-Microsoft backing disks
>>
>>
>
>
> Not with Microsoft, but with overall iSCSI performance. For the older iSCSI 
> target - IET - I used to use the following settings:
> InitialR2T=No 
> ImmediateData=Yes 
> MaxRecvDataSegmentLength=65536 
> MaxXmitDataSegmentLength=65536 
> MaxBurstLength=262144 
> FirstBurstLength=131072 
> MaxOutstandingR2T=2 
> Wthreads=128 
> QueuedCommands=32
>
> Without that iSCSI LUNs were very slow independently of backing device speed.
> Probably LIO provides a way to set them up as well.
>
> Best,
> Vladislav
>
>
>> Does anyone have any experience with that, perhaps? Thanks!!
>>
>> Respectfully,
>>  Tyler
>>
>>
>>
>> Nov 14, 2022, 2:30 AM by ulrich.wi...@rz.uni-regensburg.de:
>>
>>> Hi!
>>>
>>> If you have planty of RAM you could configure an iSCSI disk using a ram 
>>> disk and try how much I/O you get from there.
>>> Maybe you issue is not-su-much DRBD related. However when my local MD-RAID1 
>>> resyncs with about 120MB/s (spinning disks), the system also is hardly 
>>> usable.
>>>
>>> Regards,
>>> Ulrich
>>>
>>>>>> Tyler Phillippe via Users  schrieb am 13.11.2022 
>>>>>> um
>>>>>>
>>> 19:26 in Nachricht :
>>>
>>>> Hello all,
>>>>
>>>> I have setup a Linux cluster on 2x CentOS 8 Stream machines - it has 
>>>> resources to manage a dual primary, GFS2 DRBD setup. DRBD and the cluster 
>>>> have a diskless witness. Everything works fine - I have the dual primary 
>>>> DRBD 
>>>> working and it is able to present an iSCSI LUN out to my LAN. However, the 
>>>> DRBD write speed is terrible. The backing DRBD disks (HDD) are RAID10 
>>>> using 
>>>> mdadm and they (re)sync at around 150MB/s. DRBD verify has been limited to 
>>>> 100MB/s, but left untethered, it will get to around 140MB/s. If I write 
>>>> data 
>>>> to the iSCSI LUN, I only get about 10-15MB/s. Here's the DRBD 
>>>> global_common.conf - these are exactly the same on both machines:
>>>>
>>>> global {
>>>> usage-count no;
>>>> udev-always-use-vnr;
>>>> }
>>>>
>>>> common {
>>>> handlers {
>>>> }
>>>>
>>>> startup {
>>>> wfc-timeout 5;
>>>> degr-wfc-timeout 5;
>>>> }
>>>>
>>>> options {
>>>> auto-promote yes;
>>>> quorum 1;
>>>> on-no-data-accessible suspend-io;
>>>> on-no-quorum suspend-io;
>>>> }
>>>>
>>>> disk {
>>>> al-extents 4096;
>>>> al-updates yes;
>>>> no-disk-barrier;
>>>> disk-flushes;
>>>> on-io-error detach;
>>>> c-plan-ahead 0;
>>>> resync-rate 100M;
>>>> }
>>>>
>>>> net {
>>>> protocol C;
>>>> allow-two-primaries yes;
>>>> cram-hmac-alg "sha256";
>>>> csums-alg "sha256";
>>>> verify-alg "sha256";
>>>> shared-secret "secret123";
>>>> max-buffers 36864;
>>>> rcvbuf-size 5242880;
>>>> sndbuf-size 5242880;
>>>> }
>>>> }
>>>>
>>>> Respectfully,
>>>> Tyler
>>>>
>>>
>>>
>>>
>>>
>>> ___
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>>
>>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: >> https://www.clusterlabs.org/
>>
>
>
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: [EXT] DRBD Dual Primary Write Speed Extremely Slow

2022-11-14 Thread Tyler Phillippe via Users
Good idea! I setup a RAM disk on both of those systems, let them sync, added it 
to the cluster. 

One thing I left out (which didn't hit me until yesterday as a possibility) is 
that I have the iSCSI LUN attached to two Windows servers that are acting as a 
Scale-Out File Server. When I copied a file over to the new RAMdisk LUN via 
Scale-Out File Server, I am still getting 10-20MB/s; however, when I create a 
large file to the underlying, shared DRBD on those CentOS machines, I am 
getting about 700+MB/s, which I watched via iostat. So, I guess it's the 
Scale-Out File Server causing the issue. Not sure why Microsoft and the 
Scale-Out File Server is causing the issue - guess Microsoft really doesn't 
like non-Microsoft backing disks

Does anyone have any experience with that, perhaps? Thanks!!

Respectfully,
 Tyler



Nov 14, 2022, 2:30 AM by ulrich.wi...@rz.uni-regensburg.de:

> Hi!
>
> If you have planty of RAM you could configure an iSCSI disk using a ram disk 
> and try how much I/O you get from there.
> Maybe you issue is not-su-much DRBD related. However when my local MD-RAID1 
> resyncs with about 120MB/s (spinning disks), the system also is hardly usable.
>
> Regards,
> Ulrich
>
>>>> Tyler Phillippe via Users  schrieb am 13.11.2022 um
>>>>
> 19:26 in Nachricht :
>
>> Hello all,
>>
>> I have setup a Linux cluster on 2x CentOS 8 Stream machines - it has 
>> resources to manage a dual primary, GFS2 DRBD setup. DRBD and the cluster 
>> have a diskless witness. Everything works fine - I have the dual primary 
>> DRBD 
>> working and it is able to present an iSCSI LUN out to my LAN. However, the 
>> DRBD write speed is terrible. The backing DRBD disks (HDD) are RAID10 using 
>> mdadm and they (re)sync at around 150MB/s. DRBD verify has been limited to 
>> 100MB/s, but left untethered, it will get to around 140MB/s. If I write data 
>> to the iSCSI LUN, I only get about 10-15MB/s. Here's the DRBD 
>> global_common.conf - these are exactly the same on both machines:
>>
>> global {
>>  usage-count no;
>>  udev-always-use-vnr;
>> }
>>
>> common {
>>  handlers {
>>  }
>>
>>  startup {
>>  wfc-timeout 5;
>>  degr-wfc-timeout 5;
>>  }
>>
>>  options {
>>  auto-promote yes;
>>  quorum 1;
>>  on-no-data-accessible suspend-io;
>>  on-no-quorum suspend-io;
>>  }
>>
>>  disk {
>>  al-extents 4096;
>>  al-updates yes;
>>  no-disk-barrier;
>>  disk-flushes;
>>  on-io-error detach;
>>  c-plan-ahead 0;
>>  resync-rate 100M;
>>  }
>>
>>  net {
>>  protocol C;
>>  allow-two-primaries yes;
>>  cram-hmac-alg "sha256";
>>  csums-alg "sha256";
>>  verify-alg "sha256";
>>  shared-secret "secret123";
>>  max-buffers 36864;
>>  rcvbuf-size 5242880;
>>  sndbuf-size 5242880;
>>  }
>> }
>>
>> Respectfully,
>>  Tyler
>>
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] DRBD Dual Primary Write Speed Extremely Slow

2022-11-13 Thread Tyler Phillippe via Users
Hello all,

I have setup a Linux cluster on 2x CentOS 8 Stream machines - it has resources 
to manage a dual primary, GFS2 DRBD setup. DRBD and the cluster have a diskless 
witness. Everything works fine - I have the dual primary DRBD working and it is 
able to present an iSCSI LUN out to my LAN. However, the DRBD write speed is 
terrible. The backing DRBD disks (HDD) are RAID10 using mdadm and they (re)sync 
at around 150MB/s. DRBD verify has been limited to 100MB/s, but left 
untethered, it will get to around 140MB/s. If I write data to the iSCSI LUN, I 
only get about 10-15MB/s. Here's the DRBD global_common.conf - these are 
exactly the same on both machines:

global {
    usage-count no;
    udev-always-use-vnr;
}

common {
    handlers {
    }

    startup {
    wfc-timeout 5;
    degr-wfc-timeout 5;
    }

    options {
    auto-promote yes;
    quorum 1;
    on-no-data-accessible suspend-io;
    on-no-quorum suspend-io;
    }

    disk {
    al-extents 4096;
    al-updates yes;
    no-disk-barrier;
    disk-flushes;
    on-io-error detach;
    c-plan-ahead 0;
    resync-rate 100M;
    }

    net {
    protocol C;
    allow-two-primaries yes;
    cram-hmac-alg "sha256";
    csums-alg "sha256";
    verify-alg "sha256";
    shared-secret "secret123";
    max-buffers 36864;
    rcvbuf-size 5242880;
    sndbuf-size 5242880;
    }
}

Respectfully,
 Tyler
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/