Re: [DRBD-user] Really, Really Slow or Misunderstanding?

2019-07-30 Thread knebb
Hi all,


all back!

I just realized the line does not offer 19Mbit/s but instead only
4Mbit/s and then the value is to be expected.


Sorry for confusion!


/Christian


Am 31.07.2019 um 06:48 schrieb kn...@knebb.de:
> Hi all,
>
>
> I have a drbd device set up and it looks like it is running fine so far.
> Just when watching /proc/drbd the resync seems to be really, really slow:
>
> version: 8.4.11-1 (api:1/proto:86-101)
> GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild@,
> 2018-11-03 01:26:55
>
>  1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate A r-
>     ns:0 nr:285164 dw:285044 dr:0 al:0 bm:0 lo:2 pe:2 ua:2 ap:0 ep:1
> wo:f oos:314127936
>     [>] sync'ed:  0.1% (306764/307004)M
>     finish: 188:23:50 speed: 444 (424) want: 1,160 K/sec
>
>
> I have not added any special settings to my drbd setup, the *.conf file
> just hass all setting marekedd out with "#". And the device file has the
> following:
>
> resource drbd1 {
>   protocol A;
>   startup {become-primary-on backuppc2;}
>   on backuppc1 {
>     device /dev/drbd1;
>     disk /dev/vdb;
>     address 192.168.9.1:7789;
>     meta-disk internal;
>   }
>   on backuppc2 {
>     device /dev/drbd1;
>     disk /dev/sdb;
>     address 192.168.1.1:7789;
>     meta-disk internal;
>   }
> }
>
> As it runs over a slow WAN connection (the line is 10Mbit ) I do not
> expect the transfer rate to be very high. But most of the time just
> 444K/sec? There is no really impressive other traffic on the line and a
> bandwidth calculator tells me it should be done (for the approx
> 307.000MB) within close three days. But 188 hours are triple of the
> calculated time!
>
> Oh, and on the device itself there is nearly no IO at this stage.
>
> Is there some miscalculation or misreading of the values from
> /proc/drbd? How can I speed up the sync?
>
>
>
> Thanks a lot!
>
>
> /Christian
>
> ___
> Star us on GITHUB: https://github.com/LINBIT
> drbd-user mailing list
> drbd-user@lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


[DRBD-user] Really, Really Slow or Misunderstanding?

2019-07-30 Thread knebb
Hi all,


I have a drbd device set up and it looks like it is running fine so far.
Just when watching /proc/drbd the resync seems to be really, really slow:

version: 8.4.11-1 (api:1/proto:86-101)
GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild@,
2018-11-03 01:26:55

 1: cs:SyncTarget ro:Secondary/Secondary ds:Inconsistent/UpToDate A r-
    ns:0 nr:285164 dw:285044 dr:0 al:0 bm:0 lo:2 pe:2 ua:2 ap:0 ep:1
wo:f oos:314127936
    [>] sync'ed:  0.1% (306764/307004)M
    finish: 188:23:50 speed: 444 (424) want: 1,160 K/sec


I have not added any special settings to my drbd setup, the *.conf file
just hass all setting marekedd out with "#". And the device file has the
following:

resource drbd1 {
  protocol A;
  startup {become-primary-on backuppc2;}
  on backuppc1 {
    device /dev/drbd1;
    disk /dev/vdb;
    address 192.168.9.1:7789;
    meta-disk internal;
  }
  on backuppc2 {
    device /dev/drbd1;
    disk /dev/sdb;
    address 192.168.1.1:7789;
    meta-disk internal;
  }
}

As it runs over a slow WAN connection (the line is 10Mbit ) I do not
expect the transfer rate to be very high. But most of the time just
444K/sec? There is no really impressive other traffic on the line and a
bandwidth calculator tells me it should be done (for the approx
307.000MB) within close three days. But 188 hours are triple of the
calculated time!

Oh, and on the device itself there is nearly no IO at this stage.

Is there some miscalculation or misreading of the values from
/proc/drbd? How can I speed up the sync?



Thanks a lot!


/Christian

___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Managed to corrupt a drbd9 resource, but don't fully understand how

2019-07-30 Thread Eddie Chapman

On 29/07/2019 10:34, Eddie Chapman wrote:

Hello,

I've managed to corrupt one of the drbd9 resources on one of my 
production servers, now I'm trying to figure out exactly what happened 
so I can try and recover the data. I wonder if anyone might understand 
what went wrong here ( apart from the fact that PEBCAK :-) )?


That was a very long email from me yesterday, most people don't have 
time for that, so actually let me boil it down to just this question:


Should I be able to run lvextend on a backing device of a live drbd 
resource, but then safely be able to abort everything and "down" the 
resource, without telling drbd to resize it or do anything to it? Or is 
it expected that doing something like that will lead to data corruption 
so just "Dont Do That"?


It's an important question for me to understand as I use drbd on a lot 
of servers and often do very advanced things with the layers below drbd. 
If I find myself in a similar situation again would be good to know for 
sure what is and is not expected behaviour.


Thanks,
Eddie
___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] DRBD Delay

2019-07-30 Thread Banks, David (db2d)
This did not work.


On Jul 24, 2019, at 9:03 AM, Banks, David (db2d) 
mailto:d...@virginia.edu>> wrote:

Tried adding new drbd.service file below. In the middle of a data transfer now 
but will try a reboot when done.

[Unit]
Description=DRBD -- please disable. Unless you are NOT using a cluster manager.
Wants=network.target sshd.service network-online.target zfs-import.target
After=network.target sshd.service network-online.target zfs-import.target

[Service]
Type=oneshot


---
Dav Banks
School of Architecture
University of Virginia
Campbell Hall, Rm 138
(o) 434.243.8883
(e) dav.ba...@virginia.edu

On Jul 24, 2019, at 2:28 AM, Gianni Milo 
mailto:gianni.mil...@gmail.com>> wrote:

It might worth having a look at the drbd.service systemd unit file...

dpkg -l | grep drbd
ii  drbd-dkms9.0.19-1   all 
 RAID 1 over TCP/IP for Linux module source
ii  drbd-utils   9.10.0-1   amd64   
 RAID 1 over TCP/IP for Linux (user utilities)

dpkg -L drbd-utils|grep drbd.service
/lib/systemd/system/drbd.service

Gianni

On Tue, 23 Jul 2019 at 17:07, Banks, David (db2d) 
mailto:d...@virginia.edu>> wrote:
Thanks Veit, that’s what I use to start it manually. I’m looking for the 
automated approach so that it starts on it’s own reliably after system start.

---
Dav Banks
School of Architecture
University of Virginia
Campbell Hall, Rm 138
(o) 434.243.8883
(e) dav.ba...@virginia.edu

On Jul 23, 2019, at 3:49 AM, Veit Wahlich 
mailto:cru.li...@zodia.de>> wrote:

Hi David,

have a look at the documentation of drbdadm; the commands up, down and
adjust might be what you are looking for.

Best regards,
// Veit


Am Montag, den 22.07.2019, 16:34 + schrieb Banks, David (db2d):
Hello,

Is there a way to delay the loading of DRBD resources until after the
underlying block system has made the devices available?

I’ve looked in systemd but didn’t see any drbd services and wanted to
ask before monkeying with that.

System: Ubuntu 18.04
DRBD: 8.9.10

After reboot DRBD starts before the zfs volumes that is use are
available so I have to do a 'drbdadm adjust all’ each time. I’d like
it to just wait until the zfs-mount.service is done.

Thanks!

___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user

___
Star us on GITHUB: https://github.com/LINBIT
drbd-user mailing list
drbd-user@lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user


Re: [DRBD-user] Auto-promote hangs when 3rd node is gracefully taken offline

2019-07-30 Thread Lars Ellenberg
On Sat, Jul 27, 2019 at 02:43:31PM -0400, Digimer wrote:
> Hi all,
> 
>   I hit an odd issue. I have a 3-node DRBD 9 setup (see below).
> Basically, two nodes are protocol C and in pacemaker, the third node is
> protocol A and is outside pacemaker (it's a DR and will periodically
> connect -> sync -> disconnect, mostly living outside the cluster).
> 
>   I down'ed the "DR" resource a day or two ago. Today, I stopped a
> server on node 2 and then tried to boot it on node 1. It hung, because
> DRBD wouldn't promote to primary. I tried manually promoting it, and the
> 'drbdadm up ' never returned.

the "drbdadm primary" command, I guess?

That would be the corresponding log line, probably:

Jul 27 18:16:39 el8-a01n01.digimer.ca kernel: drbd test_server:
Auto-promote failed: Timeout in operation

>   Eventually, I brought the DR node back up, and after a delay with a
> flurry of log entries, it connected and resync'ed. After than,
> auto-promotion worked again and I could boot the server.

So we should figure out why the auto-promote ran into a timeout
(and why some command "never" returned for you)

What's the latency to your DR? between the other nodes?
What's the typical run time of your fence handler,
and does it try to "talk to DRBD" while it is running?

Then, why later DRBD is "busy looping"
"preparing/aborting/..." cluster wide state transitions,
that seems ... a waste of effort ;-)

>   Attached are the full logs from the three nods, going back to the 24th
> (three days ago). I'm not including them inline as it adds up to 4500 lines.

compressed attachments are fine.

>   I assume I am right to assume that gracefully down'ing a resource
> should not cause problems, as it worked for a little bit OK, and failed
> after it was left alone for a day or so.


Even ungracefully disconnecting some DR node
should not affect main cluster operation.

Just saying: I'm not a fan of auto-promote, btw.
but it sure makes "live migration" easier.

Some other log lines I want to highlight, even though that is several
days before the "interesting" auto promote timeout failure:

> Jul 24 21:47:20 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 6
> Jul 24 21:47:27 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 5
> Jul 24 21:47:33 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 4
> Jul 24 21:47:39 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 3
> Jul 24 21:47:45 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 2
> Jul 24 21:47:51 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: [drbd_s_test_ser/18629] sending time expired, ko = 1

Uh? Seriously congested on the network, several times?

pattern repeats 18 times, finally:

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: sock was shut down by peer
> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server: susp-io( no 
> -> fencing)

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: fence-peer helper broken, returned 9

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server: susp-io( 
> fencing -> no)

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server/1 drbd1 
> el8-a01n02.digimer.ca: Resync done (total 1 sec; paused 0 sec; 0 K/sec)
> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server/0 drbd0 
> el8-a01n02.digimer.ca: Resync done (total 1 sec; paused 0 sec; 2052 K/sec)

> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: helper command: /sbin/drbdadm unfence-peer
> Jul 24 22:00:57 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: helper command: /sbin/drbdadm unfence-peer exit code 0 
> (0x0)

> Jul 26 19:44:47 el8-a01n01.digimer.ca kernel: drbd test_server: role( Primary 
> -> Secondary )
> Jul 26 19:45:03 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: peer( Secondary -> Primary )

> Jul 27 17:52:41 el8-a01n01.digimer.ca kernel: drbd test_server 
> el8-a01n02.digimer.ca: peer( Primary -> Secondary )


Here, autopromote fails, with timeout, after trying three times,
and each try hit the timeout of 30 seconds:

> Jul 27 18:14:48 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing 
> cluster-wide state change 3188437928 (0->-1 3/1)
> Jul 27 18:15:03 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting 
> cluster-wide state change 3188437928 (30341ms) rv = -23
> Jul 27 18:15:34 el8-a01n01.digimer.ca kernel: drbd test_server: Preparing 
> cluster-wide state change 216414519 (0->-1 3/1)
> Jul 27 18:15:34 el8-a01n01.digimer.ca kernel: drbd test_server: Aborting