Re: [Simh] Cluster communications errors

2018-07-24 Thread Hunter Goatley

On 7/20/2018 3:53 PM, Johnny Billquist wrote:


Well, I can at least confirm on a real 8650, the network also briefly 
goes down and up again when DECnet is started. If I remember it even 
happens independent of if you have the machine in a cluster or not.


Thanks.

As a followup, the SIMH instance has been running without incident, 
other than the single drop when DECnet is started, for over four days 
now. Everything has been rock-solid. Apparently, that Intel network card 
had some issue that was causing what I was seeing.


Thanks again for all the replies. It was a most enlightening discussion, 
and I have a much better handle on how SIMH works now.


Hunter


___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Larry Baker
> On 20 Jul 2018, at 6:06:31 PM, simh-requ...@trailing-edge.com wrote:
> 
> Message: 1
> Date: Fri, 20 Jul 2018 22:53:33 +0200
> From: Johnny Billquist mailto:b...@softjar.se>>
> To: simh@trailing-edge.com <mailto:simh@trailing-edge.com>
> Subject: Re: [Simh] Cluster communications errors
> Message-ID:  <mailto:ee193165-438d-a2ac-f867-991ad44b8...@softjar.se>>
> Content-Type: text/plain; charset=utf-8; format=flowed
> 
> On 2018-07-20 16:20, Mark Pizzolato wrote:
>> On Friday, July 20, 2018 at 6:05 AM, Hunter Goatley wrote:
>>> On 7/20/2018 7:58 AM, Paul Koning wrote:
>>>> Is that the Ethernet interface down/up that happens when DECnet sets the
>>>> MAC address?  I assume you don't have a card that supports multiple MAC
>>>> addresses.
>>> 
>>> It probably is. That makes total sense, and I should have realized that.
>> 
>> I'm quite sure that the MAC address change happens much earlier than
>> the DECnet startup when a VMS cluster is configured on the system.  As
>> soon as the booting system gets its SYSGEN parameters and knows its
>> SCSSYSTEMID it 1) has enough info to set the DECnet MAC address and
>> 2) is capable of engaging in cluster communications using this ID (and MAC).
> 
> Well, I can at least confirm on a real 8650, the network also briefly 
> goes down and up again when DECnet is started. If I remember it even 
> happens independent of if you have the machine in a cluster or not.

Yeah, when I read Mark's comments I was thinking DECnet would still do what 
DECnet always does, which is change the MAC address to match the DECnet node 
number.  There is no particular advantage in checking first whether that has 
already been done.  The SCSSYSTEMID has to match the DECnet node number is all. 
 Whether one or both actually set up the hardware MAC address is not specified, 
from what I recall.

> If someone is really interested I can boot the machine up to VMS and 
> capture the output. I can also do some other tests and checks if anyone 
> is interested.
> 
>   Johnny
> 
> -- 
> Johnny Billquist  || "I'm on a bus
>   ||  on a psychedelic trip
> email: b...@softjar.se <mailto:b...@softjar.se> ||  Reading 
> murder books
> pdp is alive! ||  tryin' to stay hip" - B. Idol
> 

Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov



smime.p7s
Description: S/MIME cryptographic signature
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Johnny Billquist

On 2018-07-20 16:20, Mark Pizzolato wrote:

On Friday, July 20, 2018 at 6:05 AM, Hunter Goatley wrote:

On 7/20/2018 7:58 AM, Paul Koning wrote:

Is that the Ethernet interface down/up that happens when DECnet sets the
MAC address?  I assume you don't have a card that supports multiple MAC
addresses.


It probably is. That makes total sense, and I should have realized that.


I'm quite sure that the MAC address change happens much earlier than
the DECnet startup when a VMS cluster is configured on the system.  As
soon as the booting system gets its SYSGEN parameters and knows its
SCSSYSTEMID it 1) has enough info to set the DECnet MAC address and
2) is capable of engaging in cluster communications using this ID (and MAC).


Well, I can at least confirm on a real 8650, the network also briefly 
goes down and up again when DECnet is started. If I remember it even 
happens independent of if you have the machine in a cluster or not.


If someone is really interested I can boot the machine up to VMS and 
capture the output. I can also do some other tests and checks if anyone 
is interested.


  Johnny

--
Johnny Billquist  || "I'm on a bus
  ||  on a psychedelic trip
email: b...@softjar.se ||  Reading murder books
pdp is alive! ||  tryin' to stay hip" - B. Idol
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Hunter Goatley

On 7/20/2018 9:20 AM, Mark Pizzolato wrote:

I agree with Paul completely here and I wonder, at least for the sake of proving
the USB device is or is not a factor, why not merely share the host system's
primary LAN.  Nothing special to get this to work except changing the
ATTACH XQ argument in the configuration file.  When using that LAN interface,
without jumping through hoops configuring internal bridging, the host won't
be able to talk to the simh VAX instance, but I suspect that may not be a high
priority.


No, it's not, and I hadn't done that at first because I didn't remember 
seeing the host system's primary LAN device when I first started. I must 
have just overlooked it, because it's there now, and I just booted using 
it (attach xq eth0). (I was probably so bent on using the dedicated 
device that I overlooked the primary device when I did SHOW ETHER.)


I was also mistaken about the dedicated device. It's not a USB device. 
It's actually an Intel PCI-X card that it's in the host system. Which 
now makes me even more confused. ;-)


System booted using the system's primary device and is running fine, 
though it still had the drop/re-establish when DECnet was started. But 
everything else is working just fine. No subsequent drops, DECnet, 
TCP/IP, and clustering all working as expected.


We're just going to pull the second card and run off the primary LAN device.

Thanks again for all your help!

Hunter

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Mark Pizzolato
On Friday, July 20, 2018 at 5:31 AM, Hunter Goatley wrote:
> On 7/19/2018 10:34 PM, Mark Pizzolato wrote:
[...]
> So I took it down again and did SET THROTTLE 80%.  Still considerably 
> slower, but workable. And as soon as DECnet started, it lost 
> communication and re-established it. It's now two minutes farther 
> into the boot with no further drops.

FYI: Throttling is merely part of identifying what is causing the problem.

Unless your simh VAX cluster member is VERY busy, throttling at 80%
will probably use at least 15 times more host system CPU cycles than
Idling.  The 80% number will really use 80% of one CPU core continuously 
even when nothing at all is going on in the running VMS environment.

- Mark
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Mark Pizzolato
On Friday, July 20, 2018 at 6:05 AM, Hunter Goatley wrote:
> On 7/20/2018 7:58 AM, Paul Koning wrote:
> > Is that the Ethernet interface down/up that happens when DECnet sets the
> > MAC address?  I assume you don't have a card that supports multiple MAC
> > addresses.
> 
> It probably is. That makes total sense, and I should have realized that.

I'm quite sure that the MAC address change happens much earlier than 
the DECnet startup when a VMS cluster is configured on the system.  As 
soon as the booting system gets its SYSGEN parameters and knows its 
SCSSYSTEMID it 1) has enough info to set the DECnet MAC address and
2) is capable of engaging in cluster communications using this ID (and MAC).

> > On the USB thing: USB bridge things are often consumer grade devices, and
> > while they may "work" in the sense that you can get a packet in and out, I
> > would not necessarily expect them to behave sanely under any nontrivial 
> > load.
> > The same way I would not expect to run a cluster on a $50 Ethernet switch.
> 
> True. We have some USB dongles we've used with CHARON-VAX for years
> without incident, but I don't even know if these are the same brand
> dongles. Even if they are, that doesn't mean anything, of course.
> 
> I'm not a hardware kind of guy, so I tend to miss some of the obvious
> things, like remembering that the "dedicated card" is a USB dongle of
> unknown make. ;-)

I agree with Paul completely here and I wonder, at least for the sake of 
proving 
the USB device is or is not a factor, why not merely share the host system's 
primary LAN.  Nothing special to get this to work except changing the 
ATTACH XQ argument in the configuration file.  When using that LAN interface,
without jumping through hoops configuring internal bridging, the host won't
be able to talk to the simh VAX instance, but I suspect that may not be a high
priority.

> So I took it down again and did SET THROTTLE 80%.  Still considerably slower, 
> but workable. And as soon as DECnet started, it lost communication and 
> re-established it. It's now two minutes farther into the boot with no further 
> drops.
>
> It drops between the "Starting DECnet" OPCOM message and the first 
> "adjacency up" OPCOM message. After that, all is well.
 
I would be quite surprised if a USB LAN device actually provided reliable 
status/statistic information to the host it is connected to beyond the
basics of link connection state and/or speed settings.

- Mark

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Hunter Goatley

On 7/20/2018 7:58 AM, Paul Koning wrote:

Is that the Ethernet interface down/up that happens when DECnet sets the MAC 
address?  I assume you don't have a card that supports multiple MAC addresses.


It probably is. That makes total sense, and I should have realized that.


On the USB thing: USB bridge things are often consumer grade devices, and while they may 
"work" in the sense that you can get a packet in and out, I would not 
necessarily expect them to behave sanely under any nontrivial load.  The same way I would 
not expect to run a cluster on a $50 Ethernet switch.


True. We have some USB dongles we've used with CHARON-VAX for years 
without incident, but I don't even know if these are the same brand 
dongles. Even if they are, that doesn't mean anything, of course.


I'm not a hardware kind of guy, so I tend to miss some of the obvious 
things, like remembering that the "dedicated card" is a USB dongle of 
unknown make. ;-)



Good to hear things are looking better now.


Thank you all for your help!

Hunter

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Paul Koning


> On Jul 19, 2018, at 11:18 PM, Hunter Goatley  wrote:
> 
> Another data point. After more playing around and several reboots, I can 
> confirm that with tunneling using the host system's Ethernet device, 
> communications with other cluster members only drops when DECnet is started.
> %%%  OPCOM  19-JUL-2018 23:14:55.58  %%%
> Message from user DECNET on DARTH
> DECnet starting
> 
> %CNXMAN,  lost connection to system QUEST
> %CNXMAN,  lost connection to system GALAXY
> %CNXMAN,  re-established connection to system FASTER
> %CNXMAN,  quorum lost, blocking activity
> %CNXMAN,  re-established connection to system VADER
> %CNXMAN,  re-established connection to system QUEST
> %CNXMAN,  quorum regained, resuming activity
> 
> That's not a full log, but as soon as I see the OPCOM message about DECnet 
> starting, I get the "lost connection" messages, then the "re-established" 
> messages, and then everything is fine afterward.

Is that the Ethernet interface down/up that happens when DECnet sets the MAC 
address?  I assume you don't have a card that supports multiple MAC addresses.

On the USB thing: USB bridge things are often consumer grade devices, and while 
they may "work" in the sense that you can get a packet in and out, I would not 
necessarily expect them to behave sanely under any nontrivial load.  The same 
way I would not expect to run a cluster on a $50 Ethernet switch.

Good to hear things are looking better now.

paul

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Hunter Goatley

Hi, Dave.

I run an ESXi host on a USDT system and use a USB3 LAN dongle to give 
me a seperate network for user/management traffic so I can use the 
onboard one for iSCSI. This was done following the artivle here:


https://www.virtuallyghetto.com/2016/03/working-usb-ethernet-adapter-nic-for-esxi.html


Thanks for the link!

I note that that USB interface can be dropping packets all the time, 
not a big problem if the protocols can handle that and RDP etc suffers 
no real issues. But running something like TotalNetworkMonitor on a VM 
there you do see that there are up to 50% or so ping packets lost in 
its probes.


Could be that you are seeing a similar behaviour where the protocol 
doesn't handle lost packets too well...


Yeah, that's what it sounds like. I'll try to run some other tests on 
the USB dongle to see if I see anything else odd.


Thanks!

Hunter

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Hunter Goatley

On 7/19/2018 10:34 PM, Mark Pizzolato wrote:

The improvement by setting the port speed to 10Mbit suggests
that packet loss/overruns are happening and they are reduced
by limiting the wire speed.


Agreed, though nothing ever indicated any errors or overruns: not the 
switch, not NCP or LANCP on any nodes.



The arrival of DECnet's traffic might be causing a burst of traffic
that still ends up overrunning another systems ability to receive
it.  Do things change if you throttle the simh VAX down?

   sim> SET CPU NOIDLE
   sim> SET THROTTLE 25%


Wow. That was a flashback to 1987, when I was working on a VAX 11/730 
with four other developers at the same time. ;-) We all got lots of 
pleasure-reading done waiting for product builds


Continued this morning: I ended up going to bed, it was taking so long. 
I woke this morning to find that the startup took about four hours to 
complete, and it had spent the next three hours losing and 
re-establishing communications every 40 seconds. I'm guessing the system 
was /so/ slow that it didn't respond fast enough to suit the other members.


So I took it down again and did SET THROTTLE 80%.  Still considerably 
slower, but workable. And as soon as DECnet started, it lost 
communication and re-established it. It's now two minutes farther into 
the boot with no further drops.


It drops between the "Starting DECnet" OPCOM message and the first 
"adjacency up" OPCOM message. After that, all is well.


Thanks.

Hunter


___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-20 Thread Dave L

Hi Hunter

I run an ESXi host on a USDT system and use a USB3 LAN dongle to give me a  
seperate network for user/management traffic so I can use the onboard one  
for iSCSI. This was done following the artivle here:


https://www.virtuallyghetto.com/2016/03/working-usb-ethernet-adapter-nic-for-esxi.html

I note that that USB interface can be dropping packets all the time, not a  
big problem if the protocols can handle that and RDP etc suffers no real  
issues. But running something like TotalNetworkMonitor on a VM there you  
do see that there are up to 50% or so ping packets lost in its probes.


Could be that you are seeing a similar behaviour where the protocol  
doesn't handle lost packets too well...


regards
Dave



On Fri, 20 Jul 2018 03:15:34 +0100, Hunter Goatley  
 wrote:


Here's where we stand on our cluster communications errors: nothing we  
did worked. We tried different ports on the switch. We tried forcing  
1Gbps. >We tried forcing the port down to 10 Mbps. That actually seemed  
to help slightly, in that we only lost communications every 63 seconds  
or so, >instead of every 15--60 seconds. But it would lose and  
re-establish connection to the cluster every 63 seconds.


So I decided to try setting up and using a TAP device, just to see what  
would happen.


Using the dedicated Ethernet card, it made no difference. It still lost  
communications every 63 seconds.


When I say dedicated Ethernet card, I probably should have stated  
earlier that it's a USB -> Ethernet device plugged into the system. I  
don't know >what brand or model, but I can find out, if anyone wants to  
know.


So I decided to try tunneling through the "real" Ethernet port used by  
the Linux system. After figuring out what to do for the missing tunctl  
command >under CentOS, I was able to set up a tunnel, and I did "attach  
xq tap:tap0". I then booted the system and wonder of wonders, miracle of  
miracles, it >was seven minutes into the boot (yes, it takes a long  
time, mounting a slew of disks that needed to be rebuilt) before it lost  
communications. But it re->established them immediately, and as of my  
typing this, it was been twenty-nine minutes since that happened. No  
further drops. Normally, I wouldn't >think twenty-nine minutes is enough  
to prove anything, but when it was dropping every 15--63 seconds for two  
solid days, this sounds like a fix to >me.


So what does it mean? One thing it suggests is that the USB Ethernet  
device may be buggy or bad. I mean, it seems to work OK for TCP/IP  
>communications, etc, but it sure sounds like it may be the part  
responsible for the problems. Especially since tunneling through the  
built-in Ethernet >card seems to work and tunneling through the USB  
device did not.


These are the commands I used to set up the tap device for CentOS:

brctl addbr br0
ifconfig eno1 0.0.0.0  ; eno1 is the host's Ethernet device
ifconfig br0 XXX.XX.XX.XX up   ; the IP address of the host system
brctl addif br0 eno1
brctl setfd br0 0
#tunctl -t tap0
ip tuntap add tap0 mode tap; Replacement for tunctl on CentOS 7
brctl addif br0 tap0
ifconfig tap0 up

I then just did "xq attach tap:tap0" in the init file. I guess I should  
set up a special MAC address, but I haven't yet, and so far, nothing  
seems amiss.


While I thought having a dedicated Ethernet device would be the simplest  
thing, I can live with tunneling it through the shared Ethernet device,  
>especially since it works and the former does not. ;-)


Thank you for all of your input over the past couple of days, and thank  
you for all of your work on SIMH!


Hunter





--___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-19 Thread Mark Pizzolato
On Thursday, July 19, 2018 at 8:18 PM, Hunter Goatley wrote:
> Another data point. After more playing around and several reboots, 
> I can confirm that with tunneling using the host system's Ethernet 
> device, communications with other cluster members only drops 
> when DECnet is started.
> %%%  OPCOM  19-JUL-2018 23:14:55.58  %%%
> Message from user DECNET on DARTH
> DECnet starting
>
> %CNXMAN,  lost connection to system QUEST
> %CNXMAN,  lost connection to system GALAXY
> %CNXMAN,  re-established connection to system FASTER
> %CNXMAN,  quorum lost, blocking activity
> %CNXMAN,  re-established connection to system VADER
> %CNXMAN,  re-established connection to system QUEST
> %CNXMAN,  quorum regained, resuming activity
> That's not a full log, but as soon as I see the OPCOM message about 
> DECnet starting, I get the "lost connection" messages, then the 
> "re-established"
>  messages, and then everything is fine afterward.

The improvement by setting the port speed to 10Mbit suggests 
that packet loss/overruns are happening and they are reduced
by limiting the wire speed.

If this wasn't a cluster, I say that DECnet starting might have 
caused XQ device's MAC address to be changed around that 
time to reflect the DECnet Phase IV address switch that is done.
Which might then have some effect on the switch's learning 
of MAC addresses...  However, in a cluster this change is done 
when the LAN device is first brought online with info in 
SYSGEN parameter (SCS_SYSTEMID).

The arrival of DECnet's traffic might be causing a burst of traffic 
that still ends up overrunning another systems ability to receive 
it.  Do things change if you throttle the simh VAX down?

  sim> SET CPU NOIDLE
  sim> SET THROTTLE 25%

- Mark
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-19 Thread Hunter Goatley
Here's where we stand on our cluster communications errors: nothing we 
did worked. We tried different ports on the switch. We tried forcing 
1Gbps. We tried forcing the port down to 10 Mbps. That actually seemed 
to help slightly, in that we only lost communications every 63 seconds 
or so, instead of every 15--60 seconds. But it would lose and 
re-establish connection to the cluster every 63 seconds.


So I decided to try setting up and using a TAP device, just to see what 
would happen.


Using the dedicated Ethernet card, it made no difference. It still lost 
communications every 63 seconds.


When I say dedicated Ethernet card, I probably should have stated 
earlier that it's a USB -> Ethernet device plugged into the system. I 
don't know what brand or model, but I can find out, if anyone wants to know.


So I decided to try tunneling through the "real" Ethernet port used by 
the Linux system. After figuring out what to do for the missing tunctl 
command under CentOS, I was able to set up a tunnel, and I did "attach 
xq tap:tap0". I then booted the system and wonder of wonders, miracle of 
miracles, it was seven minutes into the boot (yes, it takes a long time, 
mounting a slew of disks that needed to be rebuilt) before it lost 
communications. But it re-established them immediately, and as of my 
typing this, it was been twenty-nine minutes since that happened. No 
further drops. Normally, I wouldn't think twenty-nine minutes is enough 
to prove anything, but when it was dropping every 15--63 seconds for two 
solid days, this sounds like a fix to me.


So what does it mean? One thing it suggests is that the USB Ethernet 
device may be buggy or bad. I mean, it seems to work OK for TCP/IP 
communications, etc, but it sure sounds like it may be the part 
responsible for the problems. Especially since tunneling through the 
built-in Ethernet card seems to work and tunneling through the USB 
device did not.


These are the commands I used to set up the tap device for CentOS:

   brctl addbr br0
   ifconfig eno1 0.0.0.0  ; eno1 is the host's Ethernet device
   ifconfig br0 XXX.XX.XX.XX up   ; the IP address of the host system
   brctl addif br0 eno1
   brctl setfd br0 0
   #tunctl -t tap0
   ip tuntap add tap0 mode tap; Replacement for tunctl on CentOS 7
   brctl addif br0 tap0
   ifconfig tap0 up

I then just did "xq attach tap:tap0" in the init file. I guess I should 
set up a special MAC address, but I haven't yet, and so far, nothing 
seems amiss.


While I thought having a dedicated Ethernet device would be the simplest 
thing, I can live with tunneling it through the shared Ethernet device, 
especially since it works and the former does not. ;-)


Thank you for all of your input over the past couple of days, and thank 
you for all of your work on SIMH!


Hunter

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-19 Thread Hunter Goatley
Another data point. After more playing around and several reboots, I can 
confirm that with tunneling using the host system's Ethernet device, 
communications with other cluster members /only/ drops when DECnet is 
started.


   %%%  OPCOM  19-JUL-2018 23:14:55.58  %%%
   Message from user DECNET on DARTH
   DECnet starting

   %CNXMAN,  lost connection to system QUEST
   %CNXMAN,  lost connection to system GALAXY
   %CNXMAN,  re-established connection to system FASTER
   %CNXMAN,  quorum lost, blocking activity
   %CNXMAN,  re-established connection to system VADER
   %CNXMAN,  re-established connection to system QUEST
   %CNXMAN,  quorum regained, resuming activity

That's not a full log, but as soon as I see the OPCOM message about 
DECnet starting, I get the "lost connection" messages, then the 
"re-established" messages, and then everything is fine afterward.


Hunter

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Johnny Billquist

On 2018-07-19 02:29, Paul Koning wrote:




On Jul 18, 2018, at 8:22 PM, Johnny Billquist  wrote:

On 2018-07-19 02:07, Paul Koning wrote:

On Jul 18, 2018, at 7:18 PM, Johnny Billquist  wrote:


...


It's probably worth pointing out that the reason I implemented that was not 
because of hardware problems, but because of software problems. DECnet can 
degenerate pretty badly when packets are lost. And if you shove packets fast 
enough at the interface, the interface will (obviously) eventually run out of 
buffers, at which point packets will be dropped.
This is especially noticeable in DECnet/RSX at least. I think I know how to 
improve that software, but I have not had enough time to actually try fixing 
it. And it is especially noticeable when doing file transfers over DECnet.

All ARQ protocols suffer dramatically with packet loss.  The other day I was reading a 
recent paper about high speed long distance TCP.  It showed a graph of throughput vs. 
packet loss rate.  I forgot the exact numbers, but it was something like 0.01% packet 
loss rate causes a 90% throughput drop.  Compare that with the old (1970s) ARPAnet rule 
of thumb that 1% packet loss means 90% loss of throughput.  Those both make sense; the 
old one was for "high speed" links running at 56 kbps, rather than the 
multi-Gbps of current links.
The other thing with nontrivial packet loss is that any protocol with 
congestion control algorithms triggered by packet loss (such as recent versions 
of DECnet), the flow control machinery will severely throttle the link under 
such conditions.
So yes, anything you can do in the infrastructure to keep the packet loss well 
under 1% is going to be very helpful indeed.


Right. That said, TCP behaves extremely much better than DECnet here. At least 
if we talk about TCP with the ability to deal with out of order packets (which 
most should do) and DECnet under RSX. The problem with DECnet under RSX is that 
recovering from a lost packet because of congestion essentially guarantees that 
congestion will happen again, while TCP pretty quickly comes into a steady 
working state.


Out of order packet handling isn't involved in that.  Congestion doesn't 
reorder packets.  If you drop a packet, TCP and DECnet both force the 
retransmission of all packets starting with the dropped one.  (At least, I 
don't think selective ACK is used in TCP.)  DECnet described out of order 
packet caching for the same reason TCP does: to work efficiently in packet 
topologies that have multiple paths in which the routers do equal cost path 
splitting.  In DECnet, that support is optional; it's not in DECnet/E and I 
wouldn't expect it in other 16-bit platforms either.


This is maybe getting too technical, so let me know if we should take 
this off list.


Yes, congestion does not reorder packets. However, if you cannot handle 
out of order packets, you have to retransmit everything from the point 
where a packet was lost.
If you can deal with packets out of order, you can keep the packets you 
received, even though there is a hole, and once that hole is plugged, 
you can ACK everything. And this is pretty normal in TCP, even without 
selective ACK.


So, in TCP, what normally happens is that a node is spraying packets as 
fast as it can. Some packets are lost, but not all of them. Including 
some holes in the sequence of received packets.
TCP will after some time, or other heuristics, start retransmitting from 
the point where packets were lost, and as soon as the receiving end have 
plugged the hole, it will jump forward with the ACKs, meaning the sender 
does not need to retransmit everything. Even more, if the sender does 
retransmit everything, loosing some of those retransmitted packets will 
not matter, since the receiver already have them anyway. At some point, 
you will get to a state where the receiver have no window open, so the 
transmitter is getting blocked, and every time the receiver opens up a 
window, which usually is just a packet or two in size, the transmitter 
can send that much data. But this much data is usually less than the 
number of buffers the hardware have, so there are no problems receiving 
those packets, and TCP gets into a steady state where the transmitter 
can transmit packets as fast as the receiver can consume them, and apart 
from a few lost packets in the early stages, no packets are lost.


DECnet (at least in RSX) on the other hand will transmit a whole bunch 
of packets. The first few will get through, but at some point one or 
several are lost. After some time, DECnet decides that packets were 
lost, and will back up and start transmitting again from the point where 
the packets were lost. Once more it will soon blast more packets than 
the receiver can process, and you will once more get a timeout 
situation. DECnet is backing off on the timeouts every time this 
happens, and soon you are at a horrendous 127s timeout for pretty much 
every other packet sent, meaning in effect 

Re: [Simh] Cluster communications errors

2018-07-18 Thread Paul Koning


> On Jul 18, 2018, at 8:22 PM, Johnny Billquist  wrote:
> 
> On 2018-07-19 02:07, Paul Koning wrote:
>>> On Jul 18, 2018, at 7:18 PM, Johnny Billquist  wrote:
>>> 
 ...
>>> 
>>> It's probably worth pointing out that the reason I implemented that was not 
>>> because of hardware problems, but because of software problems. DECnet can 
>>> degenerate pretty badly when packets are lost. And if you shove packets 
>>> fast enough at the interface, the interface will (obviously) eventually run 
>>> out of buffers, at which point packets will be dropped.
>>> This is especially noticeable in DECnet/RSX at least. I think I know how to 
>>> improve that software, but I have not had enough time to actually try 
>>> fixing it. And it is especially noticeable when doing file transfers over 
>>> DECnet.
>> All ARQ protocols suffer dramatically with packet loss.  The other day I was 
>> reading a recent paper about high speed long distance TCP.  It showed a 
>> graph of throughput vs. packet loss rate.  I forgot the exact numbers, but 
>> it was something like 0.01% packet loss rate causes a 90% throughput drop.  
>> Compare that with the old (1970s) ARPAnet rule of thumb that 1% packet loss 
>> means 90% loss of throughput.  Those both make sense; the old one was for 
>> "high speed" links running at 56 kbps, rather than the multi-Gbps of current 
>> links.
>> The other thing with nontrivial packet loss is that any protocol with 
>> congestion control algorithms triggered by packet loss (such as recent 
>> versions of DECnet), the flow control machinery will severely throttle the 
>> link under such conditions.
>> So yes, anything you can do in the infrastructure to keep the packet loss 
>> well under 1% is going to be very helpful indeed.
> 
> Right. That said, TCP behaves extremely much better than DECnet here. At 
> least if we talk about TCP with the ability to deal with out of order packets 
> (which most should do) and DECnet under RSX. The problem with DECnet under 
> RSX is that recovering from a lost packet because of congestion essentially 
> guarantees that congestion will happen again, while TCP pretty quickly comes 
> into a steady working state.

Out of order packet handling isn't involved in that.  Congestion doesn't 
reorder packets.  If you drop a packet, TCP and DECnet both force the 
retransmission of all packets starting with the dropped one.  (At least, I 
don't think selective ACK is used in TCP.)  DECnet described out of order 
packet caching for the same reason TCP does: to work efficiently in packet 
topologies that have multiple paths in which the routers do equal cost path 
splitting.  In DECnet, that support is optional; it's not in DECnet/E and I 
wouldn't expect it in other 16-bit platforms either.

> I have not analyzed other DECnet implementation enough to tell for sure if 
> they also exhibit the same problem.

Another consideration is that TCP has seen another 20 years of work on 
congestion control since DECnet Phase IV.  But in any case, it may well be that 
VMS handles these things better.  It's also possible that DECnet/OSI does, 
since it is newer and was designed right around the time that DEC very 
seriously got into congestion control algorithm research.  Phase IV isn't so 
well developed; it largely predates that work.

paul

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Johnny Billquist

On 2018-07-19 02:07, Paul Koning wrote:




On Jul 18, 2018, at 7:18 PM, Johnny Billquist  wrote:


...


It's probably worth pointing out that the reason I implemented that was not 
because of hardware problems, but because of software problems. DECnet can 
degenerate pretty badly when packets are lost. And if you shove packets fast 
enough at the interface, the interface will (obviously) eventually run out of 
buffers, at which point packets will be dropped.
This is especially noticeable in DECnet/RSX at least. I think I know how to 
improve that software, but I have not had enough time to actually try fixing 
it. And it is especially noticeable when doing file transfers over DECnet.


All ARQ protocols suffer dramatically with packet loss.  The other day I was reading a 
recent paper about high speed long distance TCP.  It showed a graph of throughput vs. 
packet loss rate.  I forgot the exact numbers, but it was something like 0.01% packet 
loss rate causes a 90% throughput drop.  Compare that with the old (1970s) ARPAnet rule 
of thumb that 1% packet loss means 90% loss of throughput.  Those both make sense; the 
old one was for "high speed" links running at 56 kbps, rather than the 
multi-Gbps of current links.

The other thing with nontrivial packet loss is that any protocol with 
congestion control algorithms triggered by packet loss (such as recent versions 
of DECnet), the flow control machinery will severely throttle the link under 
such conditions.

So yes, anything you can do in the infrastructure to keep the packet loss well 
under 1% is going to be very helpful indeed.


Right. That said, TCP behaves extremely much better than DECnet here. At 
least if we talk about TCP with the ability to deal with out of order 
packets (which most should do) and DECnet under RSX. The problem with 
DECnet under RSX is that recovering from a lost packet because of 
congestion essentially guarantees that congestion will happen again, 
while TCP pretty quickly comes into a steady working state.


I have not analyzed other DECnet implementation enough to tell for sure 
if they also exhibit the same problem.


  Johnny

--
Johnny Billquist  || "I'm on a bus
  ||  on a psychedelic trip
email: b...@softjar.se ||  Reading murder books
pdp is alive! ||  tryin' to stay hip" - B. Idol
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Paul Koning


> On Jul 18, 2018, at 7:18 PM, Johnny Billquist  wrote:
> 
>> ...
> 
> It's probably worth pointing out that the reason I implemented that was not 
> because of hardware problems, but because of software problems. DECnet can 
> degenerate pretty badly when packets are lost. And if you shove packets fast 
> enough at the interface, the interface will (obviously) eventually run out of 
> buffers, at which point packets will be dropped.
> This is especially noticeable in DECnet/RSX at least. I think I know how to 
> improve that software, but I have not had enough time to actually try fixing 
> it. And it is especially noticeable when doing file transfers over DECnet.

All ARQ protocols suffer dramatically with packet loss.  The other day I was 
reading a recent paper about high speed long distance TCP.  It showed a graph 
of throughput vs. packet loss rate.  I forgot the exact numbers, but it was 
something like 0.01% packet loss rate causes a 90% throughput drop.  Compare 
that with the old (1970s) ARPAnet rule of thumb that 1% packet loss means 90% 
loss of throughput.  Those both make sense; the old one was for "high speed" 
links running at 56 kbps, rather than the multi-Gbps of current links.

The other thing with nontrivial packet loss is that any protocol with 
congestion control algorithms triggered by packet loss (such as recent versions 
of DECnet), the flow control machinery will severely throttle the link under 
such conditions.

So yes, anything you can do in the infrastructure to keep the packet loss well 
under 1% is going to be very helpful indeed.

paul

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Johnny Billquist

On 2018-07-18 23:36, Mark Pizzolato wrote:

On Wednesday, July 18, 2018 at 2:32 PM, Paul Koning wrote:

It might be better to hard set the Linux simh host system's port to
10Mbit on the switch.  That would help with the potential for
overrunning the original DEC hardware...


DEC hardware tends to handle line rate traffic; a lot of other Ethernet
hardware does not, especially not earlier models.  I remember arguing with the
DECnet/DOS folks that no, we would not modify the DECnet architecture to
handle the single buffer "design" of the 3c501.

But if you have speed mismatches, you're likely to have congestion loss, unless
the bursts are less than the switch buffer quota.  Some switches have
thousands of buffers; other (inexpensive) ones have only a surprisingly small
number and can easily give you congestion loss.


Well, not all systems and hardware can actually handle back to back
packets even at 10Mbits.   The XQ THROTTLING is based on the throttling that
Johnny Billquist implemented in his bridge which was needed to allow his
physical systems to be able to communicated with simulated systems without
crazy packet loss...


It's probably worth pointing out that the reason I implemented that was 
not because of hardware problems, but because of software problems. 
DECnet can degenerate pretty badly when packets are lost. And if you 
shove packets fast enough at the interface, the interface will 
(obviously) eventually run out of buffers, at which point packets will 
be dropped.
This is especially noticeable in DECnet/RSX at least. I think I know how 
to improve that software, but I have not had enough time to actually try 
fixing it. And it is especially noticeable when doing file transfers 
over DECnet.


  Johnny

--
Johnny Billquist  || "I'm on a bus
  ||  on a psychedelic trip
email: b...@softjar.se ||  Reading murder books
pdp is alive! ||  tryin' to stay hip" - B. Idol
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Kevin Handy
When you are looking for packet loss/errors, are you just looking inside
the simulator, or are you also checking the host machine?
Your host OS may be hiding errors, giving you "cleaned up" traffic.

On Wed, Jul 18, 2018 at 3:36 PM, Mark Pizzolato  wrote:

> On Wednesday, July 18, 2018 at 2:32 PM, Paul Koning wrote:
> > > On Jul 18, 2018, at 5:27 PM, Mark Pizzolato  wrote:
> > >
> > > On Wednesday, July 18, 2018 at 2:19 PM, Paul Koning wrote:
> > >>> On Jul 18, 2018, at 5:03 PM, Hunter Goatley 
> > wrote:
> > >>>
> > >>> On 7/18/2018 3:38 PM, Hunter Goatley wrote:
> >  I know it's currently as set to autosense. I'll try forcing the
> >  speed and duplex.
> > 
> > >>>
> > >>> I was told:
> > >>> The router is reporting that the port auto-sensed 1Gbit duplex, but
> > >>> I just manually forced it to that to be sure.
> > >
> > > It might be better to hard set the Linux simh host system's port to
> > > 10Mbit on the switch.  That would help with the potential for
> > > overrunning the original DEC hardware...
> >
> > DEC hardware tends to handle line rate traffic; a lot of other Ethernet
> > hardware does not, especially not earlier models.  I remember arguing
> with the
> > DECnet/DOS folks that no, we would not modify the DECnet architecture to
> > handle the single buffer "design" of the 3c501.
> >
> > But if you have speed mismatches, you're likely to have congestion loss,
> unless
> > the bursts are less than the switch buffer quota.  Some switches have
> > thousands of buffers; other (inexpensive) ones have only a surprisingly
> small
> > number and can easily give you congestion loss.
>
> Well, not all systems and hardware can actually handle back to back
> packets even at 10Mbits.   The XQ THROTTLING is based on the throttling
> that
> Johnny Billquist implemented in his bridge which was needed to allow his
> physical systems to be able to communicated with simulated systems without
> crazy packet loss...
>
> - Mark
>
> ___
> Simh mailing list
> Simh@trailing-edge.com
> http://mailman.trailing-edge.com/mailman/listinfo/simh
>
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Mark Pizzolato
There are no deliberate buffering delays in the Ethernet layer.  There merely 
is one thread which receives (and filters) packets and queues the potentially 
interesting ones.  The available queue gets drained as fast as the simulated 
system happens to read the available data.  This might affect some worst case 
situations, but overruns due to speed mismatches and limited capacity of the 
old physical hardware are much more likely to blame.  Like I said, I’ve got 
multiple simulated LAVC nodes that can all talk just fine without the errors 
Hunter is seeing which if Bufferbloat was a factor might be worse there…


-  Mark

From: Simh [mailto:simh-boun...@trailing-edge.com] On Behalf Of Warren Young
Sent: Wednesday, July 18, 2018 2:33 PM
To: simh@trailing-edge.com
Subject: Re: [Simh] Cluster communications errors

On Wed, Jul 18, 2018 at 12:21 PM Mark Pizzolato 
mailto:m...@infocomm.com>> wrote:

The simh Ethernet layer has dramatically more internal packet buffering (maybe 
50 X) than anything real DEC hardware ever had.  This might account for the 
relatively smooth behavior I’m seeing.

More buffering can also mean more delay in the feedback loop that controls the 
underlying protocols, leading to *worse* performance as buffer space goes up.

This is called Buffer Bloat in the TCP sphere:

https://www..net/<https://www.bufferbloat.net/>

Perhaps the low-level protocols involved in VAX clustering have the same issue? 
They may be expecting to get some kind of feedback response, which is getting 
delayed through the buffering, which causes the real VAXen to kick the fake one 
out, thinking it's gone MIA.
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Mark Pizzolato
On Wednesday, July 18, 2018 at 2:32 PM, Paul Koning wrote:
> > On Jul 18, 2018, at 5:27 PM, Mark Pizzolato  wrote:
> >
> > On Wednesday, July 18, 2018 at 2:19 PM, Paul Koning wrote:
> >>> On Jul 18, 2018, at 5:03 PM, Hunter Goatley 
> wrote:
> >>>
> >>> On 7/18/2018 3:38 PM, Hunter Goatley wrote:
>  I know it's currently as set to autosense. I'll try forcing the
>  speed and duplex.
> 
> >>>
> >>> I was told:
> >>> The router is reporting that the port auto-sensed 1Gbit duplex, but
> >>> I just manually forced it to that to be sure.
> >
> > It might be better to hard set the Linux simh host system's port to
> > 10Mbit on the switch.  That would help with the potential for
> > overrunning the original DEC hardware...
> 
> DEC hardware tends to handle line rate traffic; a lot of other Ethernet
> hardware does not, especially not earlier models.  I remember arguing with the
> DECnet/DOS folks that no, we would not modify the DECnet architecture to
> handle the single buffer "design" of the 3c501.
>
> But if you have speed mismatches, you're likely to have congestion loss, 
> unless
> the bursts are less than the switch buffer quota.  Some switches have
> thousands of buffers; other (inexpensive) ones have only a surprisingly small
> number and can easily give you congestion loss.

Well, not all systems and hardware can actually handle back to back 
packets even at 10Mbits.   The XQ THROTTLING is based on the throttling that
Johnny Billquist implemented in his bridge which was needed to allow his
physical systems to be able to communicated with simulated systems without
crazy packet loss...

- Mark

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Warren Young
On Wed, Jul 18, 2018 at 12:21 PM Mark Pizzolato  wrote:

>
>
> The simh Ethernet layer has dramatically more internal packet buffering
> (maybe 50 X) than anything real DEC hardware ever had.  This might account
> for the relatively smooth behavior I’m seeing.
>

More buffering can also mean more delay in the feedback loop that controls
the underlying protocols, leading to *worse* performance as buffer space
goes up.

This is called Buffer Bloat in the TCP sphere:

https://www.bufferbloat.net/

Perhaps the low-level protocols involved in VAX clustering have the same
issue? They may be expecting to get some kind of feedback response, which
is getting delayed through the buffering, which causes the real VAXen to
kick the fake one out, thinking it's gone MIA.
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Paul Koning


> On Jul 18, 2018, at 5:27 PM, Mark Pizzolato  wrote:
> 
> On Wednesday, July 18, 2018 at 2:19 PM, Paul Koning wrote:
>>> On Jul 18, 2018, at 5:03 PM, Hunter Goatley  wrote:
>>> 
>>> On 7/18/2018 3:38 PM, Hunter Goatley wrote:
 I know it's currently as set to autosense. I'll try forcing the speed and
 duplex.
 
>>> 
>>> I was told:
>>> The router is reporting that the port auto-sensed 1Gbit duplex, but I just
>>> manually forced it to that to be sure.
> 
> It might be better to hard set the Linux simh host system's port to 10Mbit 
> on the switch.  That would help with the potential for overrunning the 
> original 
> DEC hardware...

DEC hardware tends to handle line rate traffic; a lot of other Ethernet 
hardware does not, especially not earlier models.  I remember arguing with the 
DECnet/DOS folks that no, we would not modify the DECnet architecture to handle 
the single buffer "design" of the 3c501.

But if you have speed mismatches, you're likely to have congestion loss, unless 
the bursts are less than the switch buffer quota.  Some switches have thousands 
of buffers; other (inexpensive) ones have only a surprisingly small number and 
can easily give you congestion loss.

paul

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Hunter Goatley

On 7/18/2018 4:18 PM, Paul Koning wrote:


You mentioned that some of this is real hardware and some is simulated.  It 
might be helpful to post a map showing the setup, including interface models, 
link speeds, and switch models.


I'll have to see about getting that. I think I mentioned that I'm not 
physically located with the equipment.



Are the interface speeds all the same?  LAVC was built for 10 Mbps Ethernet, 
and while running it faster should be ok, running mixed speeds may create more 
congestion than the protocol is comfortable with.


Good point.


Is there any way to show packet loss counts?  Can you run DECnet, and if you 
put a significant load on DECnet connections, do the DECnet counters show any 
errors?
The counters I've checked via NCP and LANCP show no errors, no 
collisions, no overruns.


Mark wrote:

   It might be better to hard set the Linux simh host system's port to 10Mbit
   on the switch.  That would help with the potential for overrunning the 
original
   DEC hardware...

I just asked my colleague to try forcing that.

And I take that back about turning on throttling not making a 
difference. It has made a difference---the system is no longer coming 
all the way up. I'm not sure why, as the reasons long ago scrolled away 
because of all of the "lost connection" messages. I didn't think to 
record them.


Thanks!

Hunter


___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Mark Pizzolato
On Wednesday, July 18, 2018 at 2:19 PM, Paul Koning wrote:
> > On Jul 18, 2018, at 5:03 PM, Hunter Goatley  wrote:
> >
> > On 7/18/2018 3:38 PM, Hunter Goatley wrote:
> >> I know it's currently as set to autosense. I'll try forcing the speed and
> >> duplex.
> >>
> >
> > I was told:
> > The router is reporting that the port auto-sensed 1Gbit duplex, but I just
> > manually forced it to that to be sure.

It might be better to hard set the Linux simh host system's port to 10Mbit 
on the switch.  That would help with the potential for overrunning the original 
DEC hardware...

- Mark
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Paul Koning


> On Jul 18, 2018, at 5:03 PM, Hunter Goatley  wrote:
> 
> On 7/18/2018 3:38 PM, Hunter Goatley wrote:
>> I know it's currently as set to autosense. I'll try forcing the speed and 
>> duplex.
>> 
> 
> I was told:
> The router is reporting that the port auto-sensed 1Gbit duplex, but I just 
> manually forced it to that to be sure.
> No change in behavior, unfortunately.
> 
> Hunter

You mentioned that some of this is real hardware and some is simulated.  It 
might be helpful to post a map showing the setup, including interface models, 
link speeds, and switch models.

Are the interface speeds all the same?  LAVC was built for 10 Mbps Ethernet, 
and while running it faster should be ok, running mixed speeds may create more 
congestion than the protocol is comfortable with.  While any Ethernet protocol 
has to handle packet loss, some protocols assume packet loss is rare.  DECnet 
wouldn't, but the cluster protocols (and LAT, for that matter) do.  

Is there any way to show packet loss counts?  Can you run DECnet, and if you 
put a significant load on DECnet connections, do the DECnet counters show any 
errors?

paul

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Hunter Goatley

On 7/18/2018 3:38 PM, Hunter Goatley wrote:


I know it's currently as set to autosense. I'll try forcing the speed 
and duplex.




I was told:

   The router is reporting that the port auto-sensed 1Gbit duplex, but
   I just manually forced it to that to be sure.

No change in behavior, unfortunately.

Hunter


___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Hunter Goatley

Hi, Mark.

Meanwhile, from what you’ve mentioned it seems you’ve got a simh 
instance talking to real DEC hardware.




Yes, the switch isn't DEC, but most of the other nodes in the cluster 
are real DEC hardware.


Using the 4.0 Current codebase, you might want to look at: HELP XQ 
CONFIG SET THROTTLE




Interesting. Thanks. I just read that and enabled throttling, but just 
with a sample SET XQ THROTTLE=ON. I'll see what that does.


Not enough. It has lost connection several times during the boot  I 
haven't studied the timings enough to have any idea what I might specify 
for TIME, BURST, or DELAY.


You may also want to show us the simh VAX configuration file you are 
using…




Something else I meant to include:

   load -r /usr/local/vax/data/ka655x.bin
   attach nvr /usr/local/vax/data/nvram.bin
   set cpu 256m
   set rq0 ra92
   attach rq0 /usr/local/vax/data/darth.vdisk
   set rl disable
   set ts disable
   attach xq eth0
   set xq throttle=on
   dep bdr 0
   boot cpu

Thanks!

Hunter

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Hunter Goatley
Thanks, Dave. I meant to try to double-check the settings. I don't have 
physical access to the system, so I'll ask someone to double-check the card 
and the switch.


I know it's currently as set to autosense. I'll try forcing the speed and 
duplex.


Thanks!

Hunter
---
Hunter Goatley, goathun...@goatley.com



On July 18, 2018 3:34:59 PM Dave Wade  wrote:


Hunter,

Is it set to Autosense Speed and Duplex? Is it getting confused? Can it be 
set to a fixed speed.


Dave



From: Simh  On Behalf Of Hunter Goatley
Sent: 18 July 2018 16:58
To: Simh 
Subject: Re: [Simh] Cluster communications errors



My mistake. I'm not running V4.0, I'm running V3.10-0 RC1.

After posting, it dawned on me that I should have tried SIMH V3.9.0, but it 
fails to boot:


(BOOT/R5:0 DUA0



  2..
-DUA0
  1..0..

HALT instruction, PC: 4C02 (HALT)
sim>

I'm not sure why. I'm using the KA655x.bin that came with V3.9.0 and a new 
nvram.bin file, but everything else is the same as the V3.10-0 RC1 instance.


I just downloaded the current GitHub sources and compiled them ( 
<https://github.com/simh/simh/commit/15fd71b97c8aaec29dc1bbbd3473c3f0d582c9ff> 
15fd71b). It boots, but I see the same behavior of losing connection to the 
cluster.


I also should have mentioned that this dedicated Ethernet card is plugged 
into the same switch as all of the other cluster members, so that shouldn't 
be an issue.


Thanks.

Hunter


___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Mark Pizzolato
Hi Hunter,

Maybe Dave has something, but maybe not.

I just booted a diskless simh VAX instance from a simh VAX system running on 
the same LAN.

The Second phase of the boot (after the MOP load of the Secondary boot loader) 
had a few retries until the boot succeeded and then the node came up and stayed 
up without issue.  The booting system had this minimal configuration:

  sim> set cpu 256
  sim> set XQ mac=08-00-2b-11-22-44
  sim> attach XQ eth2
  sim> BOOT

   Then entered BOOT XQ at the >>> prompt.

The simh host system in this case was running Windows.  Just for grins, I tried 
the same thing from a Ubuntu 18.04 Linux system running in a VirtualBox VM on 
that same Windows host.

The simh Ethernet layer has dramatically more internal packet buffering (maybe 
50 X) than anything real DEC hardware ever had.  This might account for the 
relatively smooth behavior I’m seeing.
Meanwhile, from what you’ve mentioned it seems you’ve got a simh instance 
talking to real DEC hardware.

Using the 4.0 Current codebase, you might want to look at: HELP XQ CONFIG SET 
THROTTLE

You may also want to show us the simh VAX configuration file you are using…


-  Mark

From: Simh [mailto:simh-boun...@trailing-edge.com] On Behalf Of Dave Wade
Sent: Wednesday, July 18, 2018 10:14 AM
To: 'Hunter Goatley' ; 'Simh' 
Subject: Re: [Simh] Cluster communications errors

Hunter,
Is it set to Autosense Speed and Duplex? Is it getting confused? Can it be set 
to a fixed speed.
Dave

From: Simh 
mailto:simh-boun...@trailing-edge.com>> On 
Behalf Of Hunter Goatley
Sent: 18 July 2018 16:58
To: Simh mailto:simh@trailing-edge.com>>
Subject: Re: [Simh] Cluster communications errors

My mistake. I'm not running V4.0, I'm running V3.10-0 RC1.

After posting, it dawned on me that I should have tried SIMH V3.9.0, but it 
fails to boot:

(BOOT/R5:0 DUA0







  2..

-DUA0

  1..0..



HALT instruction, PC: 4C02 (HALT)

sim>
I'm not sure why. I'm using the KA655x.bin that came with V3.9.0 and a new 
nvram.bin file, but everything else is the same as the V3.10-0 RC1 instance.

I just downloaded the current GitHub sources and compiled them 
(15fd71b<https://github.com/simh/simh/commit/15fd71b97c8aaec29dc1bbbd3473c3f0d582c9ff>).
 It boots, but I see the same behavior of losing connection to the cluster.

I also should have mentioned that this dedicated Ethernet card is plugged into 
the same switch as all of the other cluster members, so that shouldn't be an 
issue.

Thanks.

Hunter


___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Dave Wade
Hunter,

Is it set to Autosense Speed and Duplex? Is it getting confused? Can it be set 
to a fixed speed.

Dave

 

From: Simh  On Behalf Of Hunter Goatley
Sent: 18 July 2018 16:58
To: Simh 
Subject: Re: [Simh] Cluster communications errors

 

My mistake. I'm not running V4.0, I'm running V3.10-0 RC1.

After posting, it dawned on me that I should have tried SIMH V3.9.0, but it 
fails to boot:

(BOOT/R5:0 DUA0
 
 
 
  2..
-DUA0
  1..0..
 
HALT instruction, PC: 4C02 (HALT)
sim>

I'm not sure why. I'm using the KA655x.bin that came with V3.9.0 and a new 
nvram.bin file, but everything else is the same as the V3.10-0 RC1 instance.

I just downloaded the current GitHub sources and compiled them ( 
<https://github.com/simh/simh/commit/15fd71b97c8aaec29dc1bbbd3473c3f0d582c9ff> 
15fd71b). It boots, but I see the same behavior of losing connection to the 
cluster.

I also should have mentioned that this dedicated Ethernet card is plugged into 
the same switch as all of the other cluster members, so that shouldn't be an 
issue.

Thanks.

Hunter

 
___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Re: [Simh] Cluster communications errors

2018-07-18 Thread Hunter Goatley

My mistake. I'm not running V4.0, I'm running V3.10-0 RC1.

After posting, it dawned on me that I should have tried SIMH V3.9.0, but 
it fails to boot:


   (BOOT/R5:0 DUA0



  2..
   -DUA0
  1..0..

   HALT instruction, PC: 4C02 (HALT)
   sim>

I'm not sure why. I'm using the KA655x.bin that came with V3.9.0 and a 
new nvram.bin file, but everything else is the same as the V3.10-0 RC1 
instance.


I just downloaded the current GitHub sources and compiled them (15fd71b 
). 
It boots, but I see the same behavior of losing connection to the cluster.


I also should have mentioned that this dedicated Ethernet card is 
plugged into the same switch as all of the other cluster members, so 
that /shouldn't/ be an issue.


Thanks.

Hunter

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

[Simh] Cluster communications errors

2018-07-18 Thread Hunter Goatley

Good morning.

I recently set up SIMH running under Linux to replace some aging VAX 
hardware. The SIMH instance is about 30% faster than the actual 
hardware, which is a nice win. I'm running the current code from GitHub, 
which I downloaded on Monday.


I have a dedicated Ethernet device on the Linux system for the SIMH 
instance.


It's in a cluster of other machines, and all is working well except for 
one thing. Every 15--60 seconds, it loses and re-establishes contact 
with the cluster:


   %CNXMAN,  lost connection to system VADER
   %CNXMAN,  re-established connection to system VADER

And these OPCOM messages from VADER:

   %%%  OPCOM  18-JUL-2018 11:33:01.26  %%%(from node VADER 
 a)
   11:32:46.71 Node VADER (csid 00010078) lost connection to node DARTH

   %%%  OPCOM  18-JUL-2018 11:33:01.26  %%%(from node VADER 
 a)
   11:32:49.21 Node VADER (csid 00010078) re-established connection to node 
DARTH

It recovers every time, but everything hangs briefly while connectivity 
is re-established, and, of course, it's generating a ton of OPCOM 
messages, since this happens every 15--60 seconds.


Has anyone else seen this issue or have any suggestions?

Thanks!

--
Hunter
--
Hunter Goatley, Process Software, http://www.process.com/
goathun...@goatley.com   http://hunter.goatley.com/

___
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh