On 30.05.11 21:42, Mikolaj Golub wrote:
DK> One strange thing is that there is never established TCP connection
DK> between both nodes:
DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457
FIN_WAIT_2
DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457
CLOSE_WAIT
DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457
FIN_WAIT_2
DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457
CLOSE_WAIT
DK> tcp4 0 0 10.2.101.11.8457 *.* LISTEN
It is normal. hastd uses the connections only in one direction so it calls
shutdown to close unused directions.
So the TCP connections are all too short-lived that I can never see a
single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so this
might well be possible...
I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is
lower and the problem is not triggered.
I was thinking something like this. My later tests seems to suggest that
when the network transfer rate is mugh higher than disk transfer rate
this gets triggered.
"Hash mismatch" message suggests that actually you were using checksum then,
weren't you?
Yes, this occurs only when checksums are enabled. Happens with both
crc32 and sha256.
I would like to look at full logs for some rather large period, with several
cases, from both primary and secondary (and be sure about synchronized time).
I have made sure clocks are synchronized and am currently running on a
freshly rebooted nodes (with two additional SATA drives at each node) --
so far some interesting findings, like I get hash errors and
disconnects much more frequent now. Will post when an bonnie++ run on
the ZFS filesystem on top of the HAST resources finishes.
Also, it might worth checking that there is no network packet corruption (some
strange things in netstat -di, netstat -s, may be copying large files via net
and comparing checksums).
I will post these as well, however so far no indication of any network
problems was seen, no interface errors etc. Might be also the ix driver
is not reporting such, of course.
One additional note: while playing with this setup, I tried to simulate
local disk going away in the hope HAST will switch to using the remote
disk. Instead of asking someone at the site to pull out the drive, I
just issued on the primary
hastctl role init data0
which resulted in kernel panic. Unfortunately, there was no sufficient
dump space for 48GB. I will re-run this again with more drives for the
crash dump. Anything you want me to look for in particular? (kernels
have no KDB compiled in yet)
Daniel
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"