ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-20 Thread Matthias Petermann

Hello all,

I have a network problem here that I'm not sure what Xen's contribution is.

There is one Dom0 and several DomUs. The DomUs are connected via a 
brigde to the Dom0 and the LAN.


The filesystems of the DomUs are backed up to a USB disk attached to the 
host. To do this, Dom0 calls the dump command[1] in the DomUs via ssh 
and has the data written to stdout. In Dom0, the data is then redirected 
to the file on the USB disk.


When the VMs were not yet particularly busy, this worked without any 
problems. Since there is a bit more steam on the system, I get irregular 
but predictable SSH connection disconnects (ssh client loop send 
disconnect: Broken pipe). I have already tried all possible combinations 
of ClientAliveInterval and ServerAlivceInterval (and -CountMax), also 
TCPKeepAlive. Nothing has changed this situation. Even if I run the Ssh 
server in the DomUs with -d -d -d (debug level 3), I see no evidence of 
a cause around the time of the abort.


Interesting: if I initiate SSH outbound from an external host on the LAN 
(i.e. not the Dom0), it works without interconnection aborts in any case.


So I'm just wondering if there might be any peculiarities, setting 
options or known errors related to long-running SSH connections from 
Dom0 into a DomU on the same host. If anyone has any ideas here, I would 
be very very grateful.


Kind regards
Matthias


[1] 
https://forge.petermann-it.de/mpeterma/vmtools/src/commit/c0f89b3b7610da25fd073a0cebf4e11788934a4b/vmbackup#L193


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread RVP

On Tue, 20 Jun 2023, Matthias Petermann wrote:

problems. Since there is a bit more steam on the system, I get irregular but 
predictable SSH connection disconnects (ssh client loop send disconnect: 
Broken pipe). I have already tried all possible combinations of 
ClientAliveInterval and ServerAlivceInterval (and -CountMax), also 
TCPKeepAlive. Nothing has changed this situation. Even if I run the Ssh 
server in the DomUs with -d -d -d (debug level 3), I see no evidence of a 
cause around the time of the abort.




A `Broken pipe' from ssh means the RHS of the pipeline exited prematurely.
Does dd report anything? media errors? filesystem full? O_DIRECT constraint
violations (offsets and sizes not aligned to block-size because dd(1) is
reading from a (network) pipe--though this is not a problem on UFS)?

-RVP


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread Matthias Petermann

Hello,

On 21.06.23 09:31, RVP wrote:

On Tue, 20 Jun 2023, Matthias Petermann wrote:

problems. Since there is a bit more steam on the system, I get 
irregular but predictable SSH connection disconnects (ssh client loop 
send disconnect: Broken pipe). I have already tried all possible 
combinations of ClientAliveInterval and ServerAlivceInterval (and 
-CountMax), also TCPKeepAlive. Nothing has changed this situation. 
Even if I run the Ssh server in the DomUs with -d -d -d (debug level 
3), I see no evidence of a cause around the time of the abort.




A `Broken pipe' from ssh means the RHS of the pipeline exited prematurely.
Does dd report anything? media errors? filesystem full? O_DIRECT constraint
violations (offsets and sizes not aligned to block-size because dd(1) is
reading from a (network) pipe--though this is not a problem on UFS)?



thanks for your response and pointing me to the dd... I did not think 
about this could contribute to the issue at first but it sounds like 
something important to consider. Basically I use dd to get at least some 
kind of return code on the host e.g. if the target media is full or not 
writable for some reason.


Before I had dd in place, I used a redirection > $dumpname which results 
in the same kind of broken pipe issues. I just did verify this by 
repeating this as an isolated test case.


Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread Matthias Petermann

On 21.06.23 10:22, RVP wrote:

On Wed, 21 Jun 2023, Matthias Petermann wrote:

Before I had dd in place, I used a redirection > $dumpname which 
results in the same kind of broken pipe issues. I just did verify this 
by repeating this as an isolated test case.




I don't get that: there's no pipe there when you do `> file'. So how come
a Broken pipe still?



My mistake... the error message probably was slighty different but still 
related to the ssh_client_loop. I will repeat the test to catch the 
exact message and will attach the exact command.


Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread RVP

On Wed, 21 Jun 2023, Matthias Petermann wrote:

Before I had dd in place, I used a redirection > $dumpname which results in 
the same kind of broken pipe issues. I just did verify this by repeating this 
as an isolated test case.




I don't get that: there's no pipe there when you do `> file'. So how come
a Broken pipe still?

```
$ ssh arpa.sdf.org 'dd if=/dev/zero bs=1m count=10 msgfmt=quiet' > /dev/full || 
echo FAIL
FAIL
$
```

Can you give the exact command you tested with?

-RVP

PS. Any `ProxyCommand' set in `~/.ssh/config'?


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread RVP

On Wed, 21 Jun 2023, Matthias Petermann wrote:

My mistake... the error message probably was slighty different but still 
related to the ssh_client_loop.




Aah! ssh is stuffing errno in _many_ places, so it's definitely possible.
See, for example:

src/crypto/external/bsd/openssh/dist/sshbuf-misc.c:

```
298 } else if (rr == 0) {
299 errno = EPIPE;
300 return SSH_ERR_SYSTEM_ERROR;
```

And, that's on a _read_ operation. :(

-RVP


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread RVP

On Wed, 21 Jun 2023, RVP wrote:


A `Broken pipe' from ssh means the RHS of the pipeline exited prematurely.



Is what I said, but, I see that ssh ignores SIGPIPE (network I/O--duh!),
so that error message is even odder.

Do a `2>log.txt ssh -vvv ...' and post the `log.txt' file when you send
the command-line.

-RVP


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread Michael van Elst
r...@sdf.org (RVP) writes:

>I don't get that: there's no pipe there when you do `> file'. So how come
>a Broken pipe still?

It's the communication between ssh and sshd where ssh can no longer write
to a network connection closed by sshd. The problem is to find out why
the connection got closed.




Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread Matthias Petermann

Hello,

On 21.06.23 11:22, RVP wrote:

On Wed, 21 Jun 2023, RVP wrote:

A `Broken pipe' from ssh means the RHS of the pipeline exited 
prematurely.




Is what I said, but, I see that ssh ignores SIGPIPE (network I/O--duh!),
so that error message is even odder.

Do a `2>log.txt ssh -vvv ...' and post the `log.txt' file when you send
the command-line.

-RVP


Thanks for your patience It took me some hours to get ready for the 
test. The command that I issues is:


```
vhost2$ 2>log.txt ssh user@srv-net -vvv doas /sbin/dump -X -h 0 -b 64 
-0auf - /data/119455aa-6ef8-49e0-b71a-9c87e84014cb > /mnt/test.dump

```

The log output at the time of the was:

```
debug2: tcpwinsz: 197420 for connection: 3
debug2: channel 0: window 1933312 sent adjust 163840
debug2: tcpwinsz: 197420 for connection: 3
debug2: channel 0: window 1933312 sent adjust 163840
debug2: tcpwinsz: 197420 for connection: 3
debug2: channel 0: window 1966080 sent adjust 131072
debug3: send packet: type 1
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
  #0 client-session (t4 r0 i0/0 o0/0 e[write]/0 fd 6/7/8 sock -1 cc -1 
io 0x01/0x00)


Connection to srv-net closed by remote host.
Transferred: sent 1119040, received 8256466228 bytes, in 374.4 seconds
Bytes per second: sent 2988.7, received 22051060.6
debug1: Exit status -1
```

I would like to apologize for being imprecise in the initial wording of 
the error message. Without the pipe - i.e. with the redirect - I can't 
currently see a "client_loop send disconnect" message at all, but 
"Connection to srv-net closed by remote host.". However, the result is 
the same - the connection was closed remotely.


By the way, until now I assumed that the problem only occurs from Dom0 
to DomU. In the meantime, however, I have encountered it at least once 
with my previously working variant - another physical host on the LAN 
after DomU. Since all packets going to and from a DomU also pass through 
Dom0, it is of course conceivable that this is a problem in Dom0. I have 
also noticed there that the virtual network interfaces of the DomU in 
question get transmission errors reported:


```
netbsd_netif_rx_bytes{interface="xvif1i0"} 14371724
netbsd_netif_tx_bytes{interface="xvif1i0"} 29567916243
netbsd_netif_errors{interface="xvif1i0"} 26
```

For your reference you can find the full log at:

https://paste.petermann-it.de/?0a1f841ca7c27c63#DTzA3mJMN4fXqGTNVERaWuLJxLTNU1BByhs3DoEn5i9d

Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread Matthias Petermann

On 21.06.23 19:54, Matthias Petermann wrote:
2>log.txt ssh user@srv-net -vvv doas /sbin/dump -X -h 0 -b 64 -0auf - 
/data/119455aa-6ef8-49e0-b71a-9c87e84014cb > /mnt/test.dump


...just noticed another variation, this time client_loop: send 
disconnect occured:


https://paste.petermann-it.de/?880b721698a2bedc#8pTM5QsrDoojU5tL9xmKoLgRw4Zi96BPrQiKQX6GtAaZ

Best regards,
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread Matthias Petermann

On 22.06.23 07:52, Matthias Petermann wrote:

On 21.06.23 19:54, Matthias Petermann wrote:
2>log.txt ssh user@srv-net -vvv doas /sbin/dump -X -h 0 -b 64 -0auf - 
/data/119455aa-6ef8-49e0-b71a-9c87e84014cb > /mnt/test.dump


...just noticed another variation, this time client_loop: send 
disconnect occured:


https://paste.petermann-it.de/?880b721698a2bedc#8pTM5QsrDoojU5tL9xmKoLgRw4Zi96BPrQiKQX6GtAaZ



```
debug2: channel 0: window 1966080 sent adjust 131072
debug2: tcpwinsz: 197420 for connection: 3
debug2: channel 0: window 1933312 sent adjust 163840
debug2: tcpwinsz: 197420 for connection: 3
debug2: channel 0: window 1900544 sent adjust 196608
debug3: send packet: type 1
client_loop: send disconnect: Broken pipe
```

...and the error count of the interface did not increase:

```
netbsd_netif_rx_bytes{interface="xvif1i0"} 29121457
netbsd_netif_tx_bytes{interface="xvif1i0"} 30167032238
netbsd_netif_errors{interface="xvif1i0"} 26
```

So the disconnection event and the error cound do not seem to be 
connected directly(?)


Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-21 Thread RVP

On Wed, 21 Jun 2023, Matthias Petermann wrote:


The log output at the time of the was:

```
debug2:  tcpwinsz: 197420 for connection: 3
debug2:  channel 0: window 1933312 sent adjust 163840
debug2:  tcpwinsz: 197420 for connection: 3
debug2:  channel 0: window 1933312 sent adjust 163840
debug2:  tcpwinsz: 197420 for connection: 3
debug2:  channel 0: window 1966080 sent adjust 131072
debug3: send packet: type 1
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
 #0 client-session (t4 r0 i0/0 o0/0 e[write]/0 fd 6/7/8 sock -1 cc -1 io 
0x01/0x00)


Connection to srv-net closed by remote host.
Transferred: sent 1119040, received 8256466228 bytes, in 374.4 seconds
Bytes per second: sent 2988.7, received 22051060.6
debug1: Exit status -1
```



Hmm. From the log, it looks like sshd just exited without even saying goodbye.
_One_ way to reproduce this is by killing the sshd instance handling the
transfer:

```
$ ssh -Elog.txt -vvv localhost 'dd if=/dev/zero bs=1m count=1 msgfmt=quiet' 
>/dev/null &
$ su -l
/root# ps -Auxd | fgrep sshd
root26071  0.0  0.0  11844  1536 pts/1 S+5:58AM 0:00.00 |   `-- fgrep sshd 
root16367  0.0  0.1  24712  3300 ? Ss5:56AM 0:00.01 |-- sshd: /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
root 5818  0.0  0.2  23404  8060 ? Ss5:58AM 0:00.04 | `-- sshd: rvp [priv] 
rvp 16542 85.9  0.2  27628  9616 ? R 5:58AM 0:14.08 |   `-- sshd: rvp@notty 
/root# kill 16542

```

results in:

```
debug2: channel 0: window 1966080 sent adjust 131072
debug2: tcpwinsz: 81920 for connection: 4
debug2: channel 0: window 1966080 sent adjust 131072
debug3: send packet: type 1
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
  #0 client-session (t4 r0 i0/0 o0/0 e[write]/0 fd 5/6/7 sock -1 cc -1 io 
0x01/0x00)

Transferred: sent 142108, received 911247684 bytes, in 8.5 seconds
Bytes per second: sent 16800.5, received 107730568.0
debug1: Exit status -1
```

or, even, as mlelstv@ pointed out, if the client attempts to write to the
socket:

```
debug2: channel 0: window 1966080 sent adjust 131072
debug2: tcpwinsz: 81920 for connection: 4
debug2: channel 0: window 1966080 sent adjust 131072
debug3: send packet: type 1
client_loop: send disconnect: Broken pipe
```

A normal closure (even if the program being run exits abnormally) looks like:

```
[...]
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 0 rtype e...@openssh.com reply 0
debug2: channel 0: rcvd eow
debug2: chan_shutdown_read: channel 0: (i0 o0 sock -1 wfd 6 efd 8 [write])
debug2: channel 0: input open -> closed
[...]
debug3: receive packet: type 96
debug2: channel 0: rcvd eof
debug2: channel 0: output open -> drain
debug3: receive packet: type 97
debug2: channel 0: rcvd close
debug3: channel 0: will not send data after close
debug3: channel 0: will not send data after close
debug2: channel 0: obuf empty
debug2: chan_shutdown_write: channel 0: (i3 o1 sock -1 wfd 7 efd 8 [write])
debug2: channel 0: output drain -> closed
debug2: channel 0: almost dead
debug2: channel 0: gc: notify user
debug2: channel 0: gc: user detached
debug2: channel 0: send close
debug3: send packet: type 97
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
  #0 client-session (t4 r0 i3/0 o3/0 e[write]/0 fd -1/-1/8 sock -1 cc -1 io 
0x00/0x00)

debug3: send packet: type 1
Transferred: sent 19696, received 104925684 bytes, in 546.4 seconds
Bytes per second: sent 36.0, received 192032.2
debug1: Exit status 0
```

You can see the messages being exchanged to close the channel.

Can you see any errors from sshd(8) in the logs on the DomU?
If not, run the sshd server standalone like this:

```
/usr/sbin/sshd -Dddd -E/tmp/s.log
```

then post the `s.log' file after you run something like:

```
$ ssh -E/tmp/c.log -vvv XXX.NET 'dd if=/dev/zero bs=1m count=1 msgfmt=quiet' 
>/dev/null
```

on the Dom0.

Thanks,
-RVP


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-22 Thread Matthias Petermann

Hi,

On 22.06.23 08:36, RVP wrote:

Can you see any errors from sshd(8) in the logs on the DomU?
If not, run the sshd server standalone like this:

```
/usr/sbin/sshd -Dddd -E/tmp/s.log
```

then post the `s.log' file after you run something like:

```
$ ssh -E/tmp/c.log -vvv XXX.NET 'dd if=/dev/zero bs=1m count=1 
msgfmt=quiet' >/dev/null

```

on the Dom0.



Thanks for all the good points and suggestions. I have picked up the 
last one and temporarely ran sshd on the DomU in debug mode, while 
repeating the exact same test with ssh and dump.


Here are the logs:

 - Dom0: 
https://paste.petermann-it.de/?0c56870be0c9e8b1#9bCJT4C2hTBMUWzgDu84o6ipccjC7kNQzdofQtCaLUyz


 - DomU: 
https://paste.petermann-it.de/?8185479d13b60bbd#98azwgVmfsz5aQ8dPswwCD2yn2mTnM9kstVLvNv8JUKy


During this test, the error count on the affected interfaces did not 
increase further:


```
netbsd_netif_rx_bytes{interface="re0"} 64759716902
netbsd_netif_tx_bytes{interface="re0"} 1017162986389
netbsd_netif_errors{interface="re0"} 0
netbsd_netif_rx_bytes{interface="lo0"} 99500058
netbsd_netif_tx_bytes{interface="lo0"} 99500058
netbsd_netif_errors{interface="lo0"} 0
netbsd_netif_rx_bytes{interface="bridge0"} 1047528461564
netbsd_netif_tx_bytes{interface="bridge0"} 1049530231106
netbsd_netif_errors{interface="bridge0"} 0
netbsd_netif_rx_bytes{interface="xvif1i0"} 33842998
netbsd_netif_tx_bytes{interface="xvif1i0"} 31012168535
netbsd_netif_errors{interface="xvif1i0"} 26
```

The relevant part from the client log...

```
debug2: channel 0: window 1900544 sent adjust 196608

debug2: tcpwinsz: 197420 for connection: 3

debug2: channel 0: window 1966080 sent adjust 131072

debug2: tcpwinsz: 197420 for connection: 3

debug2: channel 0: window 1966080 sent adjust 131072

debug2: tcpwinsz: 197420 for connection: 3

debug2: channel 0: window 1966080 sent adjust 131072

debug3: send packet: type 1

debug1: channel 0: free: client-session, nchannels 1

debug3: channel 0: status: The following connections are open:

  #0 client-session (t4 r0 i0/0 o0/0 e[write]/0 fd 6/7/8 sock -1 cc -1 
io 0x01/0x00)




Connection to srv-net closed by remote host.

Transferred: sent 1086580, received 7869273016 bytes, in 393.9 seconds

Bytes per second: sent 2758.8, received 19980159.5

debug1: Exit status -1
```

...and from the server log...

```
debug2: channel 0: rcvd adjust 131072

debug2: channel 0: rcvd adjust 131072

debug2: channel 0: rcvd adjust 196608

debug2: channel 0: rcvd adjust 131072

debug2: channel 0: rcvd adjust 131072

debug2: channel 0: rcvd adjust 131072

process_output: ssh_packet_write_poll: Connection from user user 
192.168.2.50 port 60196: Host is down


debug1: do_cleanup

debug3: PAM: sshpam_thread_cleanup entering

debug3: mm_request_receive: entering

debug1: do_cleanup

debug1: PAM: cleanup

debug1: PAM: closing session

debug1: PAM: deleting credentials

debug3: PAM: sshpam_thread_cleanup entering
```

...appears a bit like both parties blaming each other... client tells 
the remote host has closed the connection, and the remote host complains 
about the client beeing down.



Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-22 Thread Brian Buhrow
hello.  Actually, on the server side, where you get the "host is down" 
message, that is a
system error from the network stack itself.  I've seen it when the arp cache 
times out and
can't be refreshed in a timely manner.  What happens if you run an extended 
ping session
between the dom0 and domu hosts, starting the ping from the dom0 side?  And, by 
extended
session, I mean running a ping session for an hour or two, capturing all the 
output in a log
file.  Do you get any packet loss during that interval?  If so, what errors 
does ping show when
the loss is occurring?

-thanks
-Brian



Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-22 Thread RVP

On Thu, 22 Jun 2023, Matthias Petermann wrote:


...and from the server log...

```
debug2: channel 0: rcvd adjust 131072
debug2: channel 0: rcvd adjust 131072
debug2: channel 0: rcvd adjust 196608
debug2: channel 0: rcvd adjust 131072
debug2: channel 0: rcvd adjust 131072
debug2: channel 0: rcvd adjust 131072

process_output: ssh_packet_write_poll: Connection from user user 192.168.2.50 
port 60196: Host is down

```

...appears a bit like both parties blaming each other... client tells the 
remote host has closed the connection, and the remote host complains about 
the client beeing down.




So, the server tries to write data into the socket; write() fails with
errno = EHOSTDOWN which sshd(8) treats as a fatal error and it exits.
The client tries to read/write to a closed connection, and it too quits.

The part which doesn't make sense is the EHOSTDOWN error. Clearly the
other end isn't down. Can't say I understand what's happening here. You
need a Xen guru now, Matthias :)

On Thu, 22 Jun 2023, Brian Buhrow wrote:


  hello.  Actually, on the server side, where you get the "host is down" 
message, that is a
system error from the network stack itself.  I've seen it when the arp cache 
times out and
can't be refreshed in a timely manner.



But, does ARP make any sense for Xen IFs? I thought MAC addresses were
ginned up for Xen IFs...

-RVP



Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-23 Thread Matthias Petermann

Hi,

On 23.06.23 02:45, RVP wrote:

So, the server tries to write data into the socket; write() fails with
errno = EHOSTDOWN which sshd(8) treats as a fatal error and it exits.
The client tries to read/write to a closed connection, and it too quits.

The part which doesn't make sense is the EHOSTDOWN error. Clearly the
other end isn't down. Can't say I understand what's happening here. You
need a Xen guru now, Matthias :)


I will still try the tips from yesterday (long time ping test) and 
collect some more data. And yes - I think only someone with a strong Xen 
background can really help me :-) I will followup as soon I completed my 
recent tests.




On Thu, 22 Jun 2023, Brian Buhrow wrote:

  hello.  Actually, on the server side, where you get the "host is 
down" message, that is a
system error from the network stack itself.  I've seen it when the arp 
cache times out and

can't be refreshed in a timely manner.



But, does ARP make any sense for Xen IFs? I thought MAC addresses were
ginned up for Xen IFs...


At the moment, I manually set the MAC adresses for all DomUs in the 
Domain configuration file (at the network interface specification), example:


```
name="srv-net"
type="pv"
kernel="/netbsd-XEN3_DOMU.gz"
memory=512
vcpus=2
vif = ['mac=00:16:3E:00:00:01,bridge=bridge0,ip=192.168.2.51' ]
disk = [

'file:/data/vhd/srv-net_root.img,0x01,rw','file:/data/vhd/srv-net_data1.img,0x02,rw','file:/data/vhd/srv-net_data2.img,0x03,rw','file:/data/vhd/srv-net_data3.img,0x04,rw',
]
```

I have made sure that there are no duplicates of MAC adresses in my 
network. The reason why I had decided to set them manually was to avoid 
accidental duplicates when operating multiple Xen hosts at the same 
network (My understanding is that I in case the mac= paramater is left 
off, Xen tooling decides for a MAC adress from the 00:16:3E... range).


Actually I don't believe this would make a difference - but should I try 
to avoid the manual specification of the mac adress here for a test?


Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-23 Thread Manuel Bouyer
On Fri, Jun 23, 2023 at 03:52:21PM +0200, Matthias Petermann wrote:
> Hi,
> 
> On 23.06.23 02:45, RVP wrote:
> > So, the server tries to write data into the socket; write() fails with
> > errno = EHOSTDOWN which sshd(8) treats as a fatal error and it exits.
> > The client tries to read/write to a closed connection, and it too quits.
> > 
> > The part which doesn't make sense is the EHOSTDOWN error. Clearly the
> > other end isn't down. Can't say I understand what's happening here. You
> > need a Xen guru now, Matthias :)
> 
> I will still try the tips from yesterday (long time ping test) and collect
> some more data. And yes - I think only someone with a strong Xen background
> can really help me :-) I will followup as soon I completed my recent tests.

I'm not sure it's Xen-specific, there have been changes in the network stack
between -9 and -10 affecting the way ARP and duplicate addresses are managed.

> 
> > 
> > On Thu, 22 Jun 2023, Brian Buhrow wrote:
> > 
> > >   hello.  Actually, on the server side, where you get the "host
> > > is down" message, that is a
> > > system error from the network stack itself.  I've seen it when the
> > > arp cache times out and
> > > can't be refreshed in a timely manner.
> > > 
> > 
> > But, does ARP make any sense for Xen IFs? I thought MAC addresses were
> > ginned up for Xen IFs...
> 
> At the moment, I manually set the MAC adresses for all DomUs in the Domain
> configuration file (at the network interface specification), example:


> 
> ```
> name="srv-net"
> type="pv"
> kernel="/netbsd-XEN3_DOMU.gz"
> memory=512
> vcpus=2
> vif = ['mac=00:16:3E:00:00:01,bridge=bridge0,ip=192.168.2.51' ]

the ip= part is not used by NetBSD.
A fixed mac address shouldn't make a difference, it's the xl tool which
generates one if needed and the domU doesn't know if it's fixed or
auto-generated.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-23 Thread Brian Buhrow
hello.  My understanding is that the arp caching mechanism works 
regardless of whether
you use static MAC addresses or dynamically generated ones.  The reason is that 
arp bridges the
gap between the layer 2 network, i.e. the MAC addresses, and the layer 3 
network, i.e. the IP
addresses those MAC addresses map to.  You can demonstrate this interaction by 
shutting down
the vif interface to your domu, then delete the MAC address from the arp cache 
for that vif by
using arp -d , then by trying to ping your domu from dom0.  After 
about 20
seconds, you should see the host is down message.  Then, use arp -a to look for 
your domu's IP
address.   what you'll see in the MAC field is the word "incomplete".  
If you then run brconfig on the bridge containing the domu, you'll see the MAC  
address you
assigned, or which was assigned dynamically, alive and well.

My guess is that you're runing into some sort of short term memory 
crunch inside the
dom0's network stack.  The long term ping test should provide more details 
about where this
memory crunch might be.  The long time favorite variable for this issue is the 
good ole
nmbclusters value, tunable in the kernel config and visible through:
/sbin/sysctl kern.mbuf.nmbclusters
Although it's a blunt instrument, the output from:
netstat -m
might be helpful as well.  specifically, the value listed as the number of 
calls to protocol
drain routines.

Yet another possibility is if you have a firewall set up , either on 
the dom0, or on the
domu in question.  If you're running into some rule that restricts access or 
bandwidth on the
path between the dom0 and the domu, you might see this kind of behavior.  
Unfortunately, in my
experience, when one runs into a firewall issue of this nature, the error 
messaging around it
is very misleading.  It's important to remember that the IP stacks on the dom0 
or domu,
respectively, don't know that the IP address for the machine at the other end 
of the connection
is actually running on the same hardware.  Consequently, if there are firewall 
rules set up on
either dom0 or the domu in question, and, possibly both, be sure your firewall 
rules provide
full access between the dom0 and domu in question, just as you would if you 
were writing rules
for remote machines.

the fact that you're only seeing this problem when communicating between the 
dom0 and the domu,
and not between the domu and the rest of the world, suggests to me the problem 
is on the dom0,
so I would start by looking there first.

Hope these notes help.
-Brian




Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-23 Thread Matthias Petermann

Hello Manuel,

On 23.06.23 16:17, Manuel Bouyer wrote:


I'm not sure it's Xen-specific, there have been changes in the network stack
between -9 and -10 affecting the way ARP and duplicate addresses are managed.



Thanks for your attention. I remember you are one of the Xen Gurus RVP 
recommended me to call :-) At the moment I am also far from sure this is 
a Xen issue, but try to follow each suspicion I get aware of. Do you 
have more details on the network stack changes (maybe a link to change 
log or files I should take a look at?).




```
name="srv-net"
type="pv"
kernel="/netbsd-XEN3_DOMU.gz"
memory=512
vcpus=2
vif = ['mac=00:16:3E:00:00:01,bridge=bridge0,ip=192.168.2.51' ]


the ip= part is not used by NetBSD.
A fixed mac address shouldn't make a difference, it's the xl tool which
generates one if needed and the domU doesn't know if it's fixed or
auto-generated.


Thanks for the clarification... so then its a one time setup thing only.

Btw, the ip= I forgot to mention - I agree this is not part of the 
official NetBSD configuration :-) Actually I use it as a way to assign 
IP-Adresses for my DomUs from the Dom0 config. It's part of the custom 
image builder[1] created to standardize my setups. Xen in NetBSD 10 
feels (and measures...) so much more performant - I hope there is some 
way to address this network issue. For the weekend I plan do reproduce 
the exact same setup on a higher quality hardware to find out if there 
is potentially some hardware related factor existing. At the moment I 
run it on a low cost NUC7CJYB with Celeron J4025 CPU and Realtek NIC. I 
go some reports especially the NIC could qualify for the source of the 
trouble (although I still don't understand if this is relevant for the 
bridge between Dom0 and the DomUs.).


Kind regards
Matthias


[1] 
https://forge.petermann-it.de/mpeterma/vmtools/src/commit/95d55f184b9fd1d74c931abfc7d44c58f00c0c32/lib/install.sh#L81


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-23 Thread RVP

On Fri, 23 Jun 2023, Brian Buhrow wrote:


hello.  My understanding is that the arp caching mechanism works 
regardless of whether
you use static MAC addresses or dynamically generated ones.
[...]
If you then run brconfig on the bridge containing the domu, you'll see the MAC  
address you
assigned, or which was assigned dynamically, alive and well.



Right, but, cacheing implies a timeout, and is there a timeout for the MAC
addresses on Xen IFs? Does an `arp -an' indicate this (I can't test this--
no Xen set up.)

-RVP



Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-23 Thread Matthias Petermann

Hello,

On 24.06.23 01:37, RVP wrote:

On Fri, 23 Jun 2023, Brian Buhrow wrote:

hello.  My understanding is that the arp caching mechanism works 
regardless of whether

you use static MAC addresses or dynamically generated ones.
[...]
If you then run brconfig on the bridge containing the domu, you'll see 
the MAC  address you

assigned, or which was assigned dynamically, alive and well.



Right, but, cacheing implies a timeout, and is there a timeout for the MAC
addresses on Xen IFs? Does an `arp -an' indicate this (I can't test this--
no Xen set up.)


On my Dom0, it looks like there is a timeout for the MAC adresses. The 
lines below are random but subsequent samples of the "arp -an" command 
on the Dom0 (192.168.2.50) within a timespan of ~5 minutes. What catched 
my eye so far:


 - there seem to be expirations, that resolve / renew (*1)
 - there are very long timeouts (23h+) that shortly later seem to be 
reset to a shorter value (*2)


So I am wondering what the expectation should be. Are the MAC address 
timeouts supposed to be long-lived (hours...) or are they usually 
short-lived (seconds)? Does the output below indicate some issue?


Kind regards
Matthias


```
vhost2$ doas arp -an
? (192.168.2.254) at e0:28:6d:25:44:6c on re0 23h59m52s S(*2)
? (192.168.2.191) at 98:ee:cb:f0:3c:b8 on re0 20s R
? (192.168.2.51) at 00:16:3e:00:00:01 on re0 2s R
? (192.168.2.54) at 00:16:3e:00:00:04 on re0 2s R
? (192.168.2.55) at 00:16:3e:00:00:05 on re0 1s R
? (192.168.2.52) at 00:16:3e:00:00:02 on re0 8s R
? (192.168.2.53) at 00:16:3e:00:00:03 on re0 13s R
vhost2$ doas arp -an
? (192.168.2.254) at e0:28:6d:25:44:6c on re0 23h59m52s S(*2)
? (192.168.2.191) at 98:ee:cb:f0:3c:b8 on re0 20s R
? (192.168.2.51) at 00:16:3e:00:00:01 on re0 2s R
? (192.168.2.54) at 00:16:3e:00:00:04 on re0 2s R
? (192.168.2.55) at 00:16:3e:00:00:05 on re0 1s R
? (192.168.2.52) at 00:16:3e:00:00:02 on re0 8s R
? (192.168.2.53) at 00:16:3e:00:00:03 on re0 13s R
vhost2$ doas arp -an
? (192.168.2.254) at e0:28:6d:25:44:6c on re0 23h59m51s S
? (192.168.2.191) at 98:ee:cb:f0:3c:b8 on re0 19s R
? (192.168.2.51) at 00:16:3e:00:00:01 on re0 1s R
? (192.168.2.54) at 00:16:3e:00:00:04 on re0 1s R
? (192.168.2.55) at 00:16:3e:00:00:05 on re0 expired R   (*1)
? (192.168.2.52) at 00:16:3e:00:00:02 on re0 7s R
? (192.168.2.53) at 00:16:3e:00:00:03 on re0 12s R
vhost2$ doas arp -an
? (192.168.2.254) at e0:28:6d:25:44:6c on re0 16s R
? (192.168.2.191) at 98:ee:cb:f0:3c:b8 on re0 29s R
? (192.168.2.51) at 00:16:3e:00:00:01 on re0 26s R
? (192.168.2.54) at 00:16:3e:00:00:04 on re0 26s R
? (192.168.2.55) at 00:16:3e:00:00:05 on re0 25s R
? (192.168.2.52) at 00:16:3e:00:00:02 on re0 2s D
? (192.168.2.53) at 00:16:3e:00:00:03 on re0 23h59m52s S (*2)
vhost2$ doas arp -an
? (192.168.2.254) at e0:28:6d:25:44:6c on re0 10s R
? (192.168.2.191) at 98:ee:cb:f0:3c:b8 on re0 23s R
? (192.168.2.51) at 00:16:3e:00:00:01 on re0 20s R
? (192.168.2.54) at 00:16:3e:00:00:04 on re0 20s R
? (192.168.2.55) at 00:16:3e:00:00:05 on re0 19s R
? (192.168.2.52) at 00:16:3e:00:00:02 on re0 26s R
? (192.168.2.53) at 00:16:3e:00:00:03 on re0 1s D
vhost2$ doas arp -an
? (192.168.2.254) at e0:28:6d:25:44:6c on re0 29s R
? (192.168.2.191) at 98:ee:cb:f0:3c:b8 on re0 10s R
? (192.168.2.51) at 00:16:3e:00:00:01 on re0 23h59m52s S(*2)
? (192.168.2.54) at 00:16:3e:00:00:04 on re0 23h59m52s S(*2)
? (192.168.2.55) at 00:16:3e:00:00:05 on re0 23h59m51s S(*2)
? (192.168.2.52) at 00:16:3e:00:00:02 on re0 23h59m58s S(*2)
? (192.168.2.53) at 00:16:3e:00:00:03 on re0 3s R
vhost2$ doas arp -an
? (192.168.2.254) at e0:28:6d:25:44:6c on re0 25s R
? (192.168.2.191) at 98:ee:cb:f0:3c:b8 on re0 6s R
? (192.168.2.51) at 00:16:3e:00:00:01 on re0 3s D
? (192.168.2.54) at 00:16:3e:00:00:04 on re0 3s D
? (192.168.2.55) at 00:16:3e:00:00:05 on re0 2s D
? (192.168.2.52) at 00:16:3e:00:00:02 on re0 23h59m54s S(*2)
? (192.168.2.53) at 00:16:3e:00:00:03 on re0 23h59m59s S(*2)
vhost2$ doas arp -an
? (192.168.2.254) at e0:28:6d:25:44:6c on re0 23s R
? (192.168.2.191) at 98:ee:cb:f0:3c:b8 on re0 4s R
? (192.168.2.51) at 00:16:3e:00:00:01 on re0 1s D
? (192.168.2.54) at 00:16:3e:00:00:04 on re0 1s D
? (192.168.2.55) at 00:16:3e:00:00:05 on re0 30s R
? (192.168.2.52) at 00:16:3e:00:00:02 on re0 23h59m52s S(*2)
? (192.168.2.53) at 00:16:3e:00:00:03 on re0 23h59m57s S(*2)

```


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread RVP

On Sat, 24 Jun 2023, Matthias Petermann wrote:

On my Dom0, it looks like there is a timeout for the MAC adresses. The lines 
below are random but subsequent samples of the "arp -an" command on the Dom0 
(192.168.2.50) within a timespan of ~5 minutes. What catched my eye so far:


 - there seem to be expirations, that resolve / renew (*1)
 - there are very long timeouts (23h+) that shortly later seem to be 
reset to a shorter value (*2)




Can you do the test after increasing the `nd_reachable' timeout to 20 mins?

```
$ sudo sysctl -w net.inet.arp.nd_reachable=$((20*60*1000))
```

Increasing `net.inet.arp.nd_bmaxtries' and/or `net.inet.arp.nd_umaxtries' is
also something to try.

So I am wondering what the expectation should be. Are the MAC address 
timeouts supposed to be long-lived (hours...) or are they usually short-lived 
(seconds)? Does the output below indicate some issue?




ARP-derived MAC-addresses typically had a lifetime of 20 mins (1200
secs). I dunno why it is so short now (30 secs.)

Can you also show what `arp -an' shows from inside a DomU?

-RVP



Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread Manuel Bouyer
On Fri, Jun 23, 2023 at 11:37:23PM +, RVP wrote:
> On Fri, 23 Jun 2023, Brian Buhrow wrote:
> 
> > hello.  My understanding is that the arp caching mechanism works 
> > regardless of whether
> > you use static MAC addresses or dynamically generated ones.
> > [...]
> > If you then run brconfig on the bridge containing the domu, you'll see the 
> > MAC  address you
> > assigned, or which was assigned dynamically, alive and well.
> > 
> 
> Right, but, cacheing implies a timeout, and is there a timeout for the MAC
> addresses on Xen IFs? Does an `arp -an' indicate this (I can't test this--
> no Xen set up.)

Xen IFs are no different from regular ethernert interfaces

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread Brian Buhrow
Hello.  The ARP cache timeout used to be 1200 seconds or 20 minutes, 
hard coded.  Now, it
looks like it's either 1200 seconds or 300 seconds, I'm not sure after a quick 
romp through the
kernel source.  In any case, The fact that you're getting regular delays on 
your pings suggests
there is a delay between the time when the arp cache times out and when it gets 
refreshed.  As
a consequence of that delay, if you have a high speed stream running when the 
cache times out,
it's possible the send buffer of the sending process, i.e. sshd, is filling up 
before that
cache gets refreshed and the packets can flow again.  
What is the value of 
net.inet.tcp.sendbuf_max on your dom0?
also, is 
net.inet.tcp.sendbuf_auto set to 1?  If not, try setting that to 1 with 
sysctl(8) and see if
that changes the behavior at all.

-thanks
-Brian



Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread Greg Troxel
Brian Buhrow  writes:

>   Hello.  The ARP cache timeout used to be 1200 seconds or 20 minutes, 
> hard coded.  Now, it
> looks like it's either 1200 seconds or 300 seconds, I'm not sure after a 
> quick romp through the
> kernel source.  In any case, The fact that you're getting regular delays on 
> your pings suggests
> there is a delay between the time when the arp cache times out and when it 
> gets refreshed.  As

However, a missing arp cache entry should result in at most a 1 RTT
delay over the local net, and it really should not be a big deal.
However, tcpdump and analysis is always a good idea, to turn theories in
to observations.


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread RVP

On Sat, 24 Jun 2023, Brian Buhrow wrote:


In any case, The fact that you're getting regular delays on your pings suggests
there is a delay between the time when the arp cache times out and when it gets 
refreshed.



This would be determined by `net.inet.arp.nd_delay' I think (on
-HEAD).


As a consequence of that delay, if you have a high speed stream running when 
the cache times out,
it's possible the send buffer of the sending process, i.e. sshd, is filling up 
before that
cache gets refreshed and the packets can flow again.



In this case, the kernel would either block the sshd process or
return EAGAIN--which is handled. The kernel should only return a
EHOSTDOWN if `net.inet.arp.nd_bmaxtries' * `net.inet.arp.nd_retrans'
(ie. 3 * 1000ms) has passed without getting an ARP response. Even
on a LAN, this is pretty unlikely (even with that peculiarly short
30-second ARP-address cache timeout). Smells like a Xen+load+timing
issue (not hand-wavy at all there, RVP!). It would be interesting
to see the tcpdump capture from the DomU.

-RVP


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread Matthias Petermann

Hello,

On 25.06.23 03:48, RVP wrote:

On Sat, 24 Jun 2023, Brian Buhrow wrote:

In any case, The fact that you're getting regular delays on your pings 
suggests
there is a delay between the time when the arp cache times out and 
when it gets refreshed.




This would be determined by `net.inet.arp.nd_delay' I think (on
-HEAD).

As a consequence of that delay, if you have a high speed stream 
running when the cache times out,
it's possible the send buffer of the sending process, i.e. sshd, is 
filling up before that

cache gets refreshed and the packets can flow again.



In this case, the kernel would either block the sshd process or
return EAGAIN--which is handled. The kernel should only return a
EHOSTDOWN if `net.inet.arp.nd_bmaxtries' * `net.inet.arp.nd_retrans'
(ie. 3 * 1000ms) has passed without getting an ARP response. Even
on a LAN, this is pretty unlikely (even with that peculiarly short
30-second ARP-address cache timeout). Smells like a Xen+load+timing
issue (not hand-wavy at all there, RVP!). It would be interesting
to see the tcpdump capture from the DomU.

-RVP


Over the last day I did some further tests and tried out all the hints I 
got in this thread. Here is s short summary:


1) Run a ping over night from DomU to Dom0 -> no dropouts

2) increased the ARP cache timeout  net.inet.arp.nd_reachable=120
   on both, Dom0 and DomU  -> this seemed to have an effect at first,
   but the problem still exists (its not a measured fact but a feeling,
   that it happens now a bit less often and later)

3) Checked send/receive buffer configuration

```
srv-net$ sysctl net.inet.tcp.sendbuf_auto
net.inet.tcp.sendbuf_auto = 1
srv-net$ sysctl net.inet.tcp.recvbuf_auto
net.inet.tcp.recvbuf_auto = 1
srv-net$ sysctl net.inet.tcp.sendbuf_max
net.inet.tcp.sendbuf_max = 262144
srv-net$ sysctl net.inet.tcp.recvbuf_max
net.inet.tcp.recvbuf_max = 262144
```

These samples are from DomU, but Dom0 has an identical configuration.

4) Run the test with tcpdump from DomU -> this is currently ongoing. I 
will followup as soon I have the results.



Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread Brian Buhrow
Hello.  Here are the network configuration settings I've been using for 
a number of years,
all the way through -current.

net.inet.tcp.recvbuf_auto=1
net.inet.tcp.sendbuf_auto=1
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216

-thanks
-Brian



Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-24 Thread RVP

On Sun, 25 Jun 2023, Matthias Petermann wrote:


2) increased the ARP cache timeout  net.inet.arp.nd_reachable=120
   on both, Dom0 and DomU  -> this seemed to have an effect at first,
   but the problem still exists (its not a measured fact but a feeling,
   that it happens now a bit less often and later)



Yeah, I had a feeling that we had just delayed that error, so I've cooked
up another scheme get an ARP-cache entry with no expiry-time. Try this:

1. Enable broadcast ICMP-Echo replies on Dom0 and all the DomUs:

sysctl -w net.inet.icmp.bmcastecho=1

2. Prime ARP-cache by doing a broadcast echo:

ping -nc10 192.168.2.255

On -HEAD, this creates ARP-address entries with no expiration-time. Don't
know if it'll work on 9.x. (On FreeBSD, the expiration-time is set to the
default of 20 mins, so YMMV.)

HTH,

-RVP


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-25 Thread Matthias Petermann

Hello,

On 25.06.23 07:49, Matthias Petermann wrote:


4) Run the test with tcpdump from DomU -> this is currently ongoing. I 
will followup as soon I have the results.


This is the follow-up I promised. I was lucky this morning to catch one 
occurance of the issue while tcpdump was running in the DomU. Because of 
the huge volume, I just captured the meta data (default output of 
tcpdump to stdout) and even here, the resulting log grow quickly to 5 
GB. So I cut it down to the relevant time window and uploaded it here:


https://paste.petermann-it.de/?2ea9787bbff024f4#71N5aXYoQTdDq3tXVxBXjfmDAuw9Wdof3Dkyim99xcYG

For better classification, here is the rough timelime of the events I'd 
like to comment below:


1) 08:52:07.595169

Begin active monitoring
Continuous ssh package flow from srv-net.lan (DomU) -> vhost2.lan (Dom0)

2) 08:52:07.595831

Notified a lot of ARP related traffic
ssh package flow seems to be slowed down / paused
Client (ssh) reported "Connection to srv-net.lan closed by remote host."

3) 08:52:21.xx

Client (backup script) reported that it created a new ssh connection to
the remote host and started the next dump.


Somewhere between 2) and 3) there should be the answer to the question. 
Please apologize for the noise in the log file this host is quite busy 
and I fear that removing lines that I consider unrelated might result in 
unintentional misdirection of the analysis.


So far, thanks for all your time support and valuable support - it helps 
a lot to understand the system even better.


Kind regards
Matthias



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-26 Thread RVP

On Sun, 25 Jun 2023, Matthias Petermann wrote:

Somewhere between 2) and 3) there should be the answer to the question. 



```
08:52:07.595831 ARP, Request who-has vhost2.lan tell srv-net.lan, length 28
08:52:07.595904 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui Unknown), 
length 28
08:52:07.595919 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui Unknown), 
length 28
08:52:07.595921 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui Unknown), 
length 28
08:52:07.595921 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui Unknown), 
length 28
08:52:07.595926 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui Unknown), 
length 28
[...]
08:52:07.627118 IP srv-net.lan.ssh > vhost2.lan.54243: Flags [R], seq 
3177171235, win 0, length 0
```

Well, this doesn't look like an ARP timeout issue. The DomU does the ARP-query
and gets back an answer from the Dom0 right away. In fact the Dom0 sends 
multiple
replies to the query (I don't know what that means nor if it's relevant to your
issue...); then sshd on the DomU gets a EHOSTDOWN and exits, and the kernel 
sends
a reset TCP packet in response to more data coming to that socket.

I may have to replicate your setup to dig into this. Maybe this weekend. Send
instructions on how to set-up Xen. In the meantime, can you:

1. post the output of `ifconfig' on all your DomUs
2. tell me if `dhcpcd' is running on the DomUs?

-RVP


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-26 Thread Matthias Petermann

Hi,

On 26.06.23 10:41, RVP wrote:

On Sun, 25 Jun 2023, Matthias Petermann wrote:


Somewhere between 2) and 3) there should be the answer to the question.


```
08:52:07.595831 ARP, Request who-has vhost2.lan tell srv-net.lan, length 28
08:52:07.595904 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
08:52:07.595919 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
08:52:07.595921 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
08:52:07.595921 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
08:52:07.595926 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28

[...]
08:52:07.627118 IP srv-net.lan.ssh > vhost2.lan.54243: Flags [R], seq 
3177171235, win 0, length 0

```

Well, this doesn't look like an ARP timeout issue. The DomU does the 
ARP-query
and gets back an answer from the Dom0 right away. In fact the Dom0 sends 
multiple
replies to the query (I don't know what that means nor if it's relevant 
to your
issue...); then sshd on the DomU gets a EHOSTDOWN and exits, and the 
kernel sends

a reset TCP packet in response to more data coming to that socket.



Could it still be an ARP related issue? I did a simplified version of 
the test this morning:


```
ssh user@srv-net /bin/dd if=/dev/zero > test.img
```

while running tcpdump in the DomU. Exactly at the time where I got the 
"Connection to srv-net closed by remote host." on the client side, 
tcpdump shows a pattern very similiar to the tcpdump from yesterday:


```
14:02:39.132635 IP srv-net.lan.ssh > vhost2.lan.56867: Flags [P.], seq 
1107922413:1107922961, ack 2414700, win 4197, options [nop,nop,TS val 
7788 ecr 7786],

 length 548
14:02:39.132678 IP vhost2.lan.56867 > srv-net.lan.ssh: Flags [.], ack 
1107922961, win 24609, options [nop,nop,TS val 7786 ecr 7788], length 0
14:02:39.132758 IP srv-net.lan.ssh > vhost2.lan.56867: Flags [P.], seq 
1107922961:1107923509, ack 2414700, win 4197, options [nop,nop,TS val 
7788 ecr 7786],

 length 548
14:02:39.132823 ARP, Request who-has vhost2.lan tell srv-net.lan, length 28
14:02:39.133234 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
14:02:39.133237 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
14:02:39.133238 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
14:02:39.133239 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
14:02:39.133240 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
14:02:39.133241 ARP, Reply vhost2.lan is-at 88:ae:dd:02:a4:03 (oui 
Unknown), length 28
14:02:39.133251 IP srv-net.lan.ssh > vhost2.lan.56867: Flags [P.], seq 
1107923509:1107924057, ack 2414700, win 4197, options [nop,nop,TS val 
7788 ecr 7786],

 length 548
14:02:39.133289 IP vhost2.lan.56867 > srv-net.lan.ssh: Flags [.], ack 
1107924057, win 24609, options [nop,nop,TS val 7786 ecr 7788], length 0
14:02:39.137375 IP srv-net.lan.ssh > vhost2.lan.56867: Flags [F.], seq 
1107924057, ack 2414700, win 4197, options [nop,nop,TS val 7788 ecr 
7786], length 0
14:02:39.137437 IP vhost2.lan.56867 > srv-net.lan.ssh: Flags [.], ack 
1107924058, win 24677, options [nop,nop,TS val 7786 ecr 7788], length 0
14:02:39.137568 IP vhost2.lan.56867 > srv-net.lan.ssh: Flags [P.], seq 
2414700:2414760, ack 1107924058, win 24677, options [nop,nop,TS val 7786 
ecr 7788], l

ength 60
14:02:39.137588 IP srv-net.lan.ssh > vhost2.lan.56867: Flags [R], seq 
645276183, win 0, length 0

```

> I may have to replicate your setup to dig into this. Maybe this weekend.
> Send
> instructions on how to set-up Xen. In the meantime, can you:
>
> 1. post the output of `ifconfig' on all your DomUs

```
❯ for i in srv-net srv-iot srv-mail srv-app srv-extra;do echo "--\n-- 
ifconfig of DomU $i\n--"; ssh user@$i /sbin/ifconfig -a;done

--
-- ifconfig of DomU srv-net
--
xennet0: flags=0x8843 mtu 1500
capabilities=0x3fc00
capabilities=0x3fc00
enabled=0
ec_capabilities=0x5
ec_enabled=0
address: 00:16:3e:00:00:01
inet6 fe80::216:3eff:fe00:1%xennet0/64 flags 0 scopeid 0x1
inet 192.168.2.51/24 broadcast 192.168.2.255 flags 0
lo0: flags=0x8049 mtu 33624
status: active
inet6 ::1/128 flags 0x20
inet6 fe80::1%lo0/64 flags 0 scopeid 0x2
inet 127.0.0.1/8 flags 0
--
-- ifconfig of DomU srv-iot
--
xennet0: flags=0x8843 mtu 1500
capabilities=0x3fc00
capabilities=0x3fc00
enabled=0
ec_capabilities=0x5
ec_enabled=0
address: 00:16:3e:00:00:02
inet6 fe80::216:3eff:fe00:2%xennet0/64 flags 0 scopeid 0x1
inet 192.168.2.52/24 broadcast 192.168.2.255 flags 0
lo0: flags=0x8049 mtu 33624
status: active
inet6 ::1/128 flags 0x20
inet6 fe80::1%lo0/64 flags 0 scopeid 0x2
inet 127.0.0.1/8 flags 0
--
-- ifconfig of DomU srv-mail
--
xennet0: flags=0x8843 mtu 1500
capabilitie

Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-26 Thread RVP

On Mon, 26 Jun 2023, Matthias Petermann wrote:

Could it still be an ARP related issue? I did a simplified version of the 
test this morning:




Try this test: since you have static IP- & MAC-addresses everywhere in
your setup, just add them as static ARP entries (skip own address):

On each of your DomUs and the Dom0:

arp -d -a   # delete ARP-cache
arp -s IP-addr1 MAC-addr1
arp -s IP-addr2 MAC-addr2

etc.

On the Dom0, add the addrs. of the DomUs. On each of the DomUs, the addrs.
of Dom0 and _other_ DomUs.

Do your tests.

-RVP




Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-26 Thread Brian Buhrow
hello.  A couple of quick questions based on the convrsation and the 
snippets of logs
shown in the e-mails.

1.  Is the MAC address shown in the ARP replies the correct one for the dom0?  
No reason it
should be wrong, but it's worth verifying, just in case there is an unknown 
host replying on
the network.

2. Can you capture the same tcpdumps using the -e flag?  The -e flag will print 
the source and
destination MAC addresses, as wel as the source and destination IP addresses or 
host names,
depending on whether you use the -n flag.  This might provide additional 
insight into what's
happening on the network.

-thanks
-Brian



Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-29 Thread Matthias Petermann

Hi,

On 26.06.23 15:37, RVP wrote:

On Mon, 26 Jun 2023, Matthias Petermann wrote:

Could it still be an ARP related issue? I did a simplified version of 
the test this morning:




Try this test: since you have static IP- & MAC-addresses everywhere in
your setup, just add them as static ARP entries (skip own address):

On each of your DomUs and the Dom0:

arp -d -a    # delete ARP-cache
arp -s IP-addr1 MAC-addr1
arp -s IP-addr2 MAC-addr2

etc.

On the Dom0, add the addrs. of the DomUs. On each of the DomUs, the addrs.
of Dom0 and _other_ DomUs.

Do your tests.

-RVP


While I do not want to praise the evening before the dayyou deserve 
some feedback. Both the synthetic test with ssh/dd and my real payload 
with ssh/dump have been running for easily 6 hours without interruption 
this morning. I took the advice and first made static entries in the ARP 
table for each other for the two partners directly involved (Dom0 and 
the DomU concerned). I will continue to monitor this but it looks much 
better now than the days before.


In case this proves as a reproduceable solution, my next question would 
be how this could be persisted (apart from hard-coding the arp -d -a / 
-s calls into rc.local etc.). The former proposal you sent me 
(net.inet.icmp.bmcastecho=1  and ping -nc10) did not create ARP-adresses 
with no expiration time on my NetBSD 10.0_BETA system. You mentioned 
this might be a feature of -HEAD - not sure about 10...


I also wanted to mention - and I don't know how this contributes - that 
mDNSd is enabled on all involved hosts. I had originally planned this so 
that the hosts can also find each other via the .local suffix if the 
local domain .lan cannot be resolved - for example if the DNS server is 
down.


Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-29 Thread Matthias Petermann

Hi Brian,

On 26.06.23 16:17, Brian Buhrow wrote:

hello.  A couple of quick questions based on the convrsation and the 
snippets of logs
shown in the e-mails.

1.  Is the MAC address shown in the ARP replies the correct one for the dom0?  
No reason it
should be wrong, but it's worth verifying, just in case there is an unknown 
host replying on
the network.


The addresses match - I just verified this. Anyway, thanks for the pointer.



2. Can you capture the same tcpdumps using the -e flag?  The -e flag will print 
the source and
destination MAC addresses, as wel as the source and destination IP addresses or 
host names,
depending on whether you use the -n flag.  This might provide additional 
insight into what's
happening on the network.


Since I added static ARP records, the problem did not occur another 
time. I did stop tcpdump for now to save space, but I will consider the 
-e flag next time.


Kind regards
Matthias



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-29 Thread Matthias Petermann

Hello,

On 29.06.23 11:58, Matthias Petermann wrote:
While I do not want to praise the evening before the dayyou deserve 
some feedback. Both the synthetic test with ssh/dd and my real payload 
with ssh/dump have been running for easily 6 hours without interruption 
this morning. I took the advice and first made static entries in the ARP 
table for each other for the two partners directly involved (Dom0 and 
the DomU concerned). I will continue to monitor this but it looks much 
better now than the days before.


In case this proves as a reproduceable solution, my next question would 
be how this could be persisted (apart from hard-coding the arp -d -a / 
-s calls into rc.local etc.). The former proposal you sent me 
(net.inet.icmp.bmcastecho=1  and ping -nc10) did not create ARP-adresses 
with no expiration time on my NetBSD 10.0_BETA system. You mentioned 
this might be a feature of -HEAD - not sure about 10...


I also wanted to mention - and I don't know how this contributes - that 
mDNSd is enabled on all involved hosts. I had originally planned this so 
that the hosts can also find each other via the .local suffix if the 
local domain .lan cannot be resolved - for example if the DNS server is 
down.


Kind regards
Matthias


With the assignment of permanent ARP entries, everything worked stably 
for the whole day yesterday. It seems to be due to the ARP entries. I've 
done some work on how to make this persistent at least as a workaround 
and found /etc/ethers in combination with /usr/sbin/arp -f /etc/ethers 
to be suitable.


Anyway, while applying this change and do further testing, something 
weird came to my attention. Is this expected?:


Please see the MAC adress configured in the DomU config file (on Dom0):

```
ame="srv-net"
type="pv"
kernel="/netbsd-XEN3_DOMU.gz"
memory=512
vcpus=2
vif = ['mac=00:16:3E:00:00:01,bridge=bridge0,ip=192.168.2.51' ]
disk = [

'file:/data/vhd/srv-net_root.img,0x01,rw','file:/data/vhd/srv-net_data1.img,0x02,rw','file:/data/vhd/srv-net_data2.img,0x03,rw','file:/data/vhd/srv-net_data3.img,0x04,rw',
]
```

In the DomU this configured MAC adress matches the MAC of the virtual 
network interface:


```
srv-net$ ifconfig xennet0
xennet0: flags=0x8843 mtu 1500
capabilities=0x3fc00
capabilities=0x3fc00
enabled=0
ec_capabilities=0x5
ec_enabled=0
address: 00:16:3e:00:00:01
inet6 fe80::216:3eff:fe00:1%xennet0/64 flags 0 scopeid 0x1
inet 192.168.2.51/24 broadcast 192.168.2.255 flags 0
```

In opposite to this, on the Dom0 the related xen backend network 
interface has a slightly different MAC:


```
xvif1i0: flags=0x8943 
mtu 1500

capabilities=0x3fc00
capabilities=0x3fc00
enabled=0
ec_capabilities=0x5
ec_enabled=0x1
address: 00:16:3e:01:00:01
inet6 fe80::216:3eff:fe01:1%xvif1i0/64 flags 0 scopeid 0x4
```

It differs in the 4th octet and I am wondering, if this is intended?

Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-29 Thread Brian Buhrow
hello.  Yes, this behavior is expected.  It ensures that there is no 
conflict between the
device on the domu end of the vif port and the device on the dom0 end.  This is 
more
sane behavior than FreeBSD, which zeros out the MAC address on the dom0 side of 
the vif.

-thanks
-Brian



Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-29 Thread Matthias Petermann

Hi,

On 30.06.23 07:07, Brian Buhrow wrote:

hello.  Yes, this behavior is expected.  It ensures that there is no 
conflict between the
device on the domu end of the vif port and the device on the dom0 end.  This is 
more
sane behavior than FreeBSD, which zeros out the MAC address on the dom0 side of 
the vif.

-thanks
-Brian



thanks for the clarification. Good to know this is not a bug or so.

Overall the topic seems now a lot more clearer and with the static ARP 
entries in place I can finally return to having daily backups of my VMs :-)


Concerning the root cause I assume this requires further investigation. 
To make this independend of the system where it originally occured, I 
plan to create a minimal setup to reproduce it on a similiar sized 
system and will support as much I can.


Kind regards
Matthias


smime.p7s
Description: S/MIME Cryptographic Signature


Re: ssh client_loop send disconnnect from Dom0 -> DomU (NetBSD 10.0_BETA/Xen)

2023-06-30 Thread RVP

On Thu, 29 Jun 2023, Matthias Petermann wrote:

The former proposal you sent me 
(net.inet.icmp.bmcastecho=1  and ping -nc10) did not create ARP-adresses with 
no expiration time on my NetBSD 10.0_BETA system. You mentioned this might be 
a feature of -HEAD - not sure about 10...




I should have mentioned that you had to do `arp -d -a' first before
starting the broadcast ping. Otherwise, the existing entries are not
replaced. (Even with that this menthod is a bit fiddly...)

Well, I've got another partition running 10.0_BETA set up for the Xen
tests. Can't find any binary package for xentools415... Might have to
compile from pkgsrc...

-RVP