Hi,

ok, i need to share some additional information :-)

first i observed a huge number of dropped packages on one server. so we did some tests including iperf which showed nearly the physical bandwidth. we also took a tcpdump and handed it to our network guys who didn't see anything giving a hint but that there was broadcast traffic which might cause the dropped packages. Starting with a LACP trunk we broke it and tried both NICs seperately, switched the SFP modules als well vice versa as also tested completely different ones.

then we switched to a dedicated private network for the ISP/TSM server2server traffic only, this means both servers and one dedicated switch -- on this connection no packages got lost or were dropped. for this setup we also took traces, but got no helpful answer.

The last approach was to connect both servers directly by crossing the fiber cables, but the problem still remains.

by now the ordinary client traffic is handled on each server using one nic and for the server2server connection we have this crosslink connection on a second nic.

i do wonder because the export of the nodes run as expected: got suspended when not enough drives were available or the staging was full on the destination server, but finished without any problems, as well for small nodes as for large one (> 10 TB primary data and/or > 10 mio. files). more silly: some replications do work, replication larger nodes. If we have a problem with the driver, the export's shouldn't finish and all replications should run in an error, shouldn't they?


my problem is, that i got no idea why the replication fails. The error messages are not clear to me.


APAR IC920088 (https://www-01.ibm.com/support/docview.wss?uid=swg1IC92088) says it's caused by network timeouts, but this note is about TSM6.3 -- an "This problem was fixed". Unfortunately the corresponding technote (http://www-01.ibm.com/support/docview.wss?uid=swg1642715) isn't available any more.

well we also increased different timeout setting on the target server

CommTimeOut 600
AdminIdleTimeOut 180
AdminCommTimeOut 180

now i will increase IdleTimeOut on 600, too.

but due to the option "KeepAliveInterval 30" i expect idle connection where refreshed every 5 minutes, so within the idleTimeOuts, especially "KeepAliveTime 300" -- on both servers.


thanks & best,

Bjørn



Stefan Folkerts wrote:
Did you use something like iperf with a long and heavy load? a bad nic or
driver might cause this, so it might still be the network.

On Mon, May 13, 2019 at 4:15 PM Bjørn Nachtwey<bjoern.nacht...@gwdg.de>
wrote:

Hi all,

we planned to switch from COPYPOOL to Replication for having a second
copy of the data, therefore we bought a new server that should become
the primary TSM/ISP server and then make the old one holding the
replicates.

what we did:

we started by exporting the nodes, which worked well. But as the
"incremental" exports even took some time, we set up a replication from
old server "A" to the new one "B". For all nodes already exported we set
up the replication vice versa: TSM "B" replicates them to TSM "A".

well, the replication jobs did not finish, some data and files were
missing as long as we replicated using a node group. Now we use
replication for each single node and it works -- for most of them :-(

Replication the "bad" nodes from "TSM A" to "TSM B" first the sessions
hang for many minutes, sometimes even hours, then they got "terminated -
forced by administrator" (ANR0483W), e.g.:

05/13/2019 15:23:16    ANR2017I Administrator GK issued command:
REPLICATE NODE vsbck  (SESSION: 26128)
05/13/2019 15:23:16    ANR1626I The previous message (message number
2017) was repeated 1 times.
05/13/2019 15:23:16    ANR0984I Process 494 for Replicate Node started
in the BACKGROUND at 15:23:16. (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:16    ANR2110I REPLICATE NODE started as process 494.
(SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:16    ANR0408I Session 26184 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:16    ANR0408I Session 26185 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:16    ANR0408I Session 26186 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26187 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26188 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26189 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26190 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)
05/13/2019 15:23:17    ANR0408I Session 26191 started for server SM283
(Linux/x86_64) (Tcp/Ip) for replication.  (SESSION: 26128, PROCESS: 494)

05/13/2019 15:24:57    ANR0483W Session 26187 for node SM283
(Linux/x86_64) terminated - forced by administrator. (SESSION: 26128,
PROCESS: 494)

on the target server we observe at that time:

13.05.2019 15:25:51 ANR8213E Socket 34 aborted due to send error; error
104.
13.05.2019 15:25:51 ANR3178E A communication error occurred during
session 65294 with replication server TSM.
13.05.2019 15:25:51 ANR0479W Session 65294 for server TSM (Windows)
terminated - connection with server severed.
13.05.2019 15:25:51 ANR8213E Socket 34 aborted due to send error; error 32.

=> Any idea why this replication aborts?

=> why is there a "socket abortion error"?


well, we already opened a SR case, send lots of logs and traces. as IBM
suspects a network problem, now both serves use a cross link connection
without nothing but NIC/GBICs, plugs and wires.

thanks & best

Bjørn

Reply via email to