Re: HAST instability
On 14.06.11 17:56, Mikolaj Golub wrote: It has turned out that automatic receive buffer sizing works only for connections in ESTABLISHED state. And with small receive buffer the connection might stuck sending data only via TCP window probes -- one byte every few seconds (see "Scenario to make recv(MSG_WAITALL) stuck" in net@ for details). I have tried some TCP/IP tuning to help utilize the faster network, but for the moment it is likely local disks limit the throughput to about 230 MB/sec peak. The peaks now are the same as before, but now the total performance is better. However, it may turn out that single TCP/IP session across 10Gbit network would not be able to achieve very high throughput. It may be beneficial to support multiple parallel TCP/IP connections between primary/slave in order to utilize faster networks. Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On Tue, 14 Jun 2011 16:39:11 +0300 Daniel Kalchev wrote: DK> On 10.06.11 20:07, Mikolaj Golub wrote: >> On Fri, 10 Jun 2011 20:05:43 +0300 Mikolaj Golub wrote to Daniel Kalchev: >> >> MG> Could you please try this patch? >> >> MG> http://people.freebsd.org/~trociny/hastd.no_shutdown.patch >> >> Sure you still have to have your kernel patched with uipc_socket.c.patch :-) >> DK> It is now running for about a day with both patches applied, without DK> disconnects. DK> Also, now TCP/IP connections always stay in ESTABLISHED state. As I DK> believe they should. Primary to secondary drain quickly on switching DK> form init to primary etc. No troubles without checksums as DK> well. Kernel is as of Thanks! It has turned out that automatic receive buffer sizing works only for connections in ESTABLISHED state. And with small receive buffer the connection might stuck sending data only via TCP window probes -- one byte every few seconds (see "Scenario to make recv(MSG_WAITALL) stuck" in net@ for details). hastd.no_shutdown.patch disables closing of unused directions so the connections remain in ESTABLISHED state and automatic receive buffer sizing works again. uipc_socket.c.patch has been committed to CURRENT and I am going to MFC soon. DK> FreeBSD b1a 8.2-STABLE FreeBSD 8.2-STABLE #1: Mon Jun 13 11:32:38 EEST DK> 2011 root@b1a:/usr/obj/usr/src/sys/GENERIC amd64 DK> Daniel -- Mikolaj Golub ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On 10.06.11 20:07, Mikolaj Golub wrote: On Fri, 10 Jun 2011 20:05:43 +0300 Mikolaj Golub wrote to Daniel Kalchev: MG> Could you please try this patch? MG> http://people.freebsd.org/~trociny/hastd.no_shutdown.patch Sure you still have to have your kernel patched with uipc_socket.c.patch :-) It is now running for about a day with both patches applied, without disconnects. Also, now TCP/IP connections always stay in ESTABLISHED state. As I believe they should. Primary to secondary drain quickly on switching form init to primary etc. No troubles without checksums as well. Kernel is as of FreeBSD b1a 8.2-STABLE FreeBSD 8.2-STABLE #1: Mon Jun 13 11:32:38 EEST 2011 root@b1a:/usr/obj/usr/src/sys/GENERIC amd64 Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On Fri, 10 Jun 2011 20:05:43 +0300 Mikolaj Golub wrote to Daniel Kalchev: MG> Could you please try this patch? MG> http://people.freebsd.org/~trociny/hastd.no_shutdown.patch Sure you still have to have your kernel patched with uipc_socket.c.patch :-) -- Mikolaj Golub ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On Fri, 03 Jun 2011 19:18:29 +0300 Daniel Kalchev wrote: DK> Well, apparently my HAST joy was short. On a second run, I got stuck with DK> Jun 3 19:08:16 b1a hastd[1900]: [data2] (primary) Unable to receive DK> reply header: Operation timed out. DK> on the primary. No messages on the secondary. DK> On primary: DK> # netstat -an | grep 8457 DK> tcp4 0 0 10.2.101.11.42659 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 0 10.2.101.11.62058 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.34646 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 0 10.2.101.11.11419 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.37773 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 0 10.2.101.11.21911 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 0 10.2.101.11.40169 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 97749 10.2.101.11.44360 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.8457 *.*LISTEN DK> on secondary DK> # netstat -an | grep 8457 DK> tcp4 0 0 10.2.101.12.8457 10.2.101.11.42659 CLOSE_WAIT DK> tcp4 0 0 10.2.101.12.8457 10.2.101.11.62058 FIN_WAIT_2 DK> tcp4 0 0 10.2.101.12.8457 10.2.101.11.34646 CLOSE_WAIT DK> tcp4 0 0 10.2.101.12.8457 10.2.101.11.11419 FIN_WAIT_2 DK> tcp4 0 0 10.2.101.12.8457 10.2.101.11.37773 CLOSE_WAIT DK> tcp4 0 0 10.2.101.12.8457 10.2.101.11.21911 CLOSE_WAIT DK> tcp4 0 0 10.2.101.12.8457 10.2.101.11.40169 FIN_WAIT_2 DK> tcp4 66415 0 10.2.101.12.8457 10.2.101.11.44360 FIN_WAIT_2 DK> tcp4 0 0 10.2.101.12.8457 *.*LISTEN DK> on primary DK> # hastctl status DK> data0: DK> role: primary DK> provname: data0 DK> localpath: /dev/gpt/data0 DK> extentsize: 2097152 (2.0MB) DK> keepdirty: 64 DK> remoteaddr: 10.2.101.12 DK> sourceaddr: 10.2.101.11 DK> replication: fullsync DK> status: complete DK> dirty: 0 (0B) DK> data1: DK> role: primary DK> provname: data1 DK> localpath: /dev/gpt/data1 DK> extentsize: 2097152 (2.0MB) DK> keepdirty: 64 DK> remoteaddr: 10.2.101.12 DK> sourceaddr: 10.2.101.11 DK> replication: fullsync DK> status: complete DK> dirty: 0 (0B) DK> data2: DK> role: primary DK> provname: data2 DK> localpath: /dev/gpt/data2 DK> extentsize: 2097152 (2.0MB) DK> keepdirty: 64 DK> remoteaddr: 10.2.101.12 DK> sourceaddr: 10.2.101.11 DK> replication: fullsync DK> status: complete DK> dirty: 6291456 (6.0MB) DK> data3: DK> role: primary DK> provname: data3 DK> localpath: /dev/gpt/data3 DK> extentsize: 2097152 (2.0MB) DK> keepdirty: 64 DK> remoteaddr: 10.2.101.12 DK> sourceaddr: 10.2.101.11 DK> replication: fullsync DK> status: complete DK> dirty: 0 (0B) DK> Sits in this state for over 10 minutes. DK> Unfortunately, no KDB in kernel. Any ideas what other to look for? Could you please try this patch? http://people.freebsd.org/~trociny/hastd.no_shutdown.patch After patching you need to rebuild hastd and restart it (I expect only on secondary is enough but it is better to do this on both nodes). No server restart is needed. -- Mikolaj Golub ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
Well, apparently my HAST joy was short. On a second run, I got stuck with Jun 3 19:08:16 b1a hastd[1900]: [data2] (primary) Unable to receive reply header: Operation timed out. on the primary. No messages on the secondary. On primary: # netstat -an | grep 8457 tcp4 0 0 10.2.101.11.42659 10.2.101.12.8457 FIN_WAIT_2 tcp4 0 0 10.2.101.11.62058 10.2.101.12.8457 CLOSE_WAIT tcp4 0 0 10.2.101.11.34646 10.2.101.12.8457 FIN_WAIT_2 tcp4 0 0 10.2.101.11.11419 10.2.101.12.8457 CLOSE_WAIT tcp4 0 0 10.2.101.11.37773 10.2.101.12.8457 FIN_WAIT_2 tcp4 0 0 10.2.101.11.21911 10.2.101.12.8457 FIN_WAIT_2 tcp4 0 0 10.2.101.11.40169 10.2.101.12.8457 CLOSE_WAIT tcp4 0 97749 10.2.101.11.44360 10.2.101.12.8457 CLOSE_WAIT tcp4 0 0 10.2.101.11.8457 *.*LISTEN on secondary # netstat -an | grep 8457 tcp4 0 0 10.2.101.12.8457 10.2.101.11.42659 CLOSE_WAIT tcp4 0 0 10.2.101.12.8457 10.2.101.11.62058 FIN_WAIT_2 tcp4 0 0 10.2.101.12.8457 10.2.101.11.34646 CLOSE_WAIT tcp4 0 0 10.2.101.12.8457 10.2.101.11.11419 FIN_WAIT_2 tcp4 0 0 10.2.101.12.8457 10.2.101.11.37773 CLOSE_WAIT tcp4 0 0 10.2.101.12.8457 10.2.101.11.21911 CLOSE_WAIT tcp4 0 0 10.2.101.12.8457 10.2.101.11.40169 FIN_WAIT_2 tcp4 66415 0 10.2.101.12.8457 10.2.101.11.44360 FIN_WAIT_2 tcp4 0 0 10.2.101.12.8457 *.*LISTEN on primary # hastctl status data0: role: primary provname: data0 localpath: /dev/gpt/data0 extentsize: 2097152 (2.0MB) keepdirty: 64 remoteaddr: 10.2.101.12 sourceaddr: 10.2.101.11 replication: fullsync status: complete dirty: 0 (0B) data1: role: primary provname: data1 localpath: /dev/gpt/data1 extentsize: 2097152 (2.0MB) keepdirty: 64 remoteaddr: 10.2.101.12 sourceaddr: 10.2.101.11 replication: fullsync status: complete dirty: 0 (0B) data2: role: primary provname: data2 localpath: /dev/gpt/data2 extentsize: 2097152 (2.0MB) keepdirty: 64 remoteaddr: 10.2.101.12 sourceaddr: 10.2.101.11 replication: fullsync status: complete dirty: 6291456 (6.0MB) data3: role: primary provname: data3 localpath: /dev/gpt/data3 extentsize: 2097152 (2.0MB) keepdirty: 64 remoteaddr: 10.2.101.12 sourceaddr: 10.2.101.11 replication: fullsync status: complete dirty: 0 (0B) Sits in this state for over 10 minutes. Unfortunately, no KDB in kernel. Any ideas what other to look for? Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
Decided to apply the patch proposed in -current by Mikolaj Golub: http://people.freebsd.org/~trociny/uipc_socket.c.patch This apparently fixed my issue as well. Running without checksums for a full bonnie++ run (~100GB write/rewrite) produced no disconnects, no stalls and generated up to 280MB/sec (4 drives in stripped zpool). Interestingly, the hast devices write latency as observed by gstat was under 30ms. I believe this fix should be committed. Here are the accumulated netstat -s from both hosts, for comparison with previous runs. Retransmits etc are much less. http://news.digsys.bg/~admin/hast/test3jun-fix/b1a-netstat-s http://news.digsys.bg/~admin/hast/test3jun-fix/b1b-netstat-s http://news.digsys.bg/~admin/hast/test3jun-fix/b1b-systat-if-fix Before applying the patch I verified there are no network problems. Created 1TB file from /dev/random on the first host. Copied over to the second host with ftp. Transfer speed was low, at 80MB/sec -- ftp would utilize one CPU core 100% at the receiving node. Then calculated md5 checksums on both sides, matched. Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
Here goes the second run, wihtout checksums. systat -if /0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10 Load Average Interface Traffic PeakTotal lo0 in 0.000 KB/s 71.666 KB/s 361.825 KB out 0.000 KB/s 71.666 KB/s 361.825 KB ix1 in 0.021 KB/s816.608 MB/s 625.751 GB out 0.016 KB/s 7.384 MB/s 23.032 GB igb0 in 0.025 KB/s 1.507 KB/s 11.547 MB out 0.069 KB/s 1.765 KB/s 17.140 MB This time it managed to achieve 800MB/s wow! Anyway, no idea when this happened, as during my observation, it didn't manage to push much data, due to frequent disconnects. Typical "good" rate was lower than with checksums, like just over 100MB/s. from primary messages: http://news.digsys.bg/~admin/hast/test31may-2/b1a-messages netstat -in: http://news.digsys.bg/~admin/hast/test31may-2/b1a-netstat-in netstat-s: http://news.digsys.bg/~admin/hast/test31may-2/b1a-netstat-s from secondary messages: http://news.digsys.bg/~admin/hast/test31may-2/b1b-messages netstat -in: http://news.digsys.bg/~admin/hast/test31may-2/b1b-netstat-in netstat-s: http://news.digsys.bg/~admin/hast/test31may-2/b1b-netstat-s Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On 31.05.11 17:08, Mikolaj Golub wrote: As I wrote privately, it would be nice to see both netstat and hast logs (from both nodes) for the same rather long period, when several cases occured. It would be good to place them somewere on web so other guys could access them too, as I will be offline for 7-10 days and will not be able to help you until I am back. The test finished running for almost three hours, and so here is the collected data: (for the duration of test, on the secondary node) systat -if /0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10 Load Average Interface Traffic PeakTotal lo0 in 0.000 KB/s 0.000 KB/s1.126 KB out 0.000 KB/s 0.000 KB/s1.126 KB ix1 in 0.003 KB/s230.590 MB/s 614.688 GB out 0.054 KB/s 7.425 MB/s 19.910 GB igb0 in 0.025 KB/s 3.636 KB/s 566.897 KB out 0.072 KB/s 4.296 KB/s1.091 MB The primary node is b1a, the secondary node is b1b. kernel (built just after csup update): FreeBSD b1a 8.2-STABLE FreeBSD 8.2-STABLE #1: Mon May 30 14:17:50 EEST 2011 root@b1a:/usr/obj/usr/src/sys/GENERIC amd64 from primary messages: http://news.digsys.bg/~admin/hast/test31may/b1a-messages netstat -in: http://news.digsys.bg/~admin/hast/test31may/b1a-netstat -in netstat-s: http://news.digsys.bg/~admin/hast/test31may/b1a-netstat-s from secondary messages: http://news.digsys.bg/~admin/hast/test31may/b1b-messages netstat -in: http://news.digsys.bg/~admin/hast/test31may/b1b-netstat -in netstat-s: http://news.digsys.bg/~admin/hast/test31may/b1b-netstat-s DK> One additional note: while playing with this setup, I tried to DK> simulate local disk going away in the hope HAST will switch to using DK> the remote disk. Instead of asking someone at the site to pull out the DK> drive, I just issued on the primary DK> hastctl role init data0 DK> which resulted in kernel panic. Unfortunately, there was no sufficient DK> dump space for 48GB. I will re-run this again with more drives for the DK> crash dump. Anything you want me to look for in particular? (kernels DK> have no KDB compiled in yet) Well, removing physical disk (device /dev/gpt/data0 consumed by hastd dissapears) and switching a resource to init role (devive /dev/hast/data0 consumed by FS dissapears) are two different things. Sure you should not normally change the resource role (destroy hast device) before unmounting (exporting) FS. Then how do I proceed with a failed drive? Or a flaky drive that is still visible to the OS, that I want to remove from HAST and replace with a different one? How do I ask HAST to switch I/O to the secondary? Is there other way to get a drive out of HAST? In any case, even if this is not allowed operation, it should not panic. I am now going to reboot and run the same tests without checksums. Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On Tue, 31 May 2011 15:51:07 +0300 Daniel Kalchev wrote: DK> On 30.05.11 21:42, Mikolaj Golub wrote: >> DK> One strange thing is that there is never established TCP connection >> DK> between both nodes: >> >> DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 >> FIN_WAIT_2 >> DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 >> CLOSE_WAIT >> DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 >> FIN_WAIT_2 >> DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 >> CLOSE_WAIT >> DK> tcp4 0 0 10.2.101.11.8457 *.* >> LISTEN >> >> It is normal. hastd uses the connections only in one direction so it calls >> shutdown to close unused directions. DK> So the TCP connections are all too short-lived that I can never see a DK> single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so DK> this might well be possible... No the connections are persistent, just only one (unused) direction of communication is closed. See shutdown(2) for further info. >> I would like to look at full logs for some rather large period, with several >> cases, from both primary and secondary (and be sure about synchronized >> time). DK> I have made sure clocks are synchronized and am currently running on a freshly rebooted nodes (with two additional SATA drives at each node) -- DK> so far some interesting findings, like I get hash errors and DK> disconnects much more frequent now. Will post when an bonnie++ run on DK> the ZFS filesystem on top of the HAST resources finishes. As I wrote privately, it would be nice to see both netstat and hast logs (from both nodes) for the same rather long period, when several cases occured. It would be good to place them somewere on web so other guys could access them too, as I will be offline for 7-10 days and will not be able to help you until I am back. DK> One additional note: while playing with this setup, I tried to DK> simulate local disk going away in the hope HAST will switch to using DK> the remote disk. Instead of asking someone at the site to pull out the DK> drive, I just issued on the primary DK> hastctl role init data0 DK> which resulted in kernel panic. Unfortunately, there was no sufficient DK> dump space for 48GB. I will re-run this again with more drives for the DK> crash dump. Anything you want me to look for in particular? (kernels DK> have no KDB compiled in yet) Well, removing physical disk (device /dev/gpt/data0 consumed by hastd dissapears) and switching a resource to init role (devive /dev/hast/data0 consumed by FS dissapears) are two different things. Sure you should not normally change the resource role (destroy hast device) before unmounting (exporting) FS. -- Mikolaj Golub ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On 30.05.11 21:42, Mikolaj Golub wrote: DK> One strange thing is that there is never established TCP connection DK> between both nodes: DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.8457 *.*LISTEN It is normal. hastd uses the connections only in one direction so it calls shutdown to close unused directions. So the TCP connections are all too short-lived that I can never see a single one in ESTABLISHED state? 10Gbit Ethernet is indeed fast, so this might well be possible... I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is lower and the problem is not triggered. I was thinking something like this. My later tests seems to suggest that when the network transfer rate is mugh higher than disk transfer rate this gets triggered. "Hash mismatch" message suggests that actually you were using checksum then, weren't you? Yes, this occurs only when checksums are enabled. Happens with both crc32 and sha256. I would like to look at full logs for some rather large period, with several cases, from both primary and secondary (and be sure about synchronized time). I have made sure clocks are synchronized and am currently running on a freshly rebooted nodes (with two additional SATA drives at each node) -- so far some interesting findings, like I get hash errors and disconnects much more frequent now. Will post when an bonnie++ run on the ZFS filesystem on top of the HAST resources finishes. Also, it might worth checking that there is no network packet corruption (some strange things in netstat -di, netstat -s, may be copying large files via net and comparing checksums). I will post these as well, however so far no indication of any network problems was seen, no interface errors etc. Might be also the ix driver is not reporting such, of course. One additional note: while playing with this setup, I tried to simulate local disk going away in the hope HAST will switch to using the remote disk. Instead of asking someone at the site to pull out the drive, I just issued on the primary hastctl role init data0 which resulted in kernel panic. Unfortunately, there was no sufficient dump space for 48GB. I will re-run this again with more drives for the crash dump. Anything you want me to look for in particular? (kernels have no KDB compiled in yet) Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On Mon, 30 May 2011 17:43:04 +0300 Daniel Kalchev wrote: DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.8457 *.*LISTEN Also, it might be useful to see if you normally have full receive buffers like above or only when the issue is observed, running netstat in loop, something like below: while sleep 5; do t=`date '+%F %H:%M:%S'`; netstat -na | grep 8457 | while read l; do echo "$t $l"; done; done > /tmp/netstat.log -- Mikolaj Golub ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
On Mon, 30 May 2011 17:43:04 +0300 Daniel Kalchev wrote: DK> Some further investigation: DK> The HAST nodes do not disconnect when checksum is enabled (either DK> crc32 or sha256). DK> One strange thing is that there is never established TCP connection DK> between both nodes: DK> tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 DK> tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT DK> tcp4 0 0 10.2.101.11.8457 *.*LISTEN It is normal. hastd uses the connections only in one direction so it calls shutdown to close unused directions. DK> When using sha256 one CPU core is 100% utilized by each hastd process, DK> while 70-80MB/sec per HAST resource is being transferred (total of up DK> to 140 MB/sec traffic for both); DK> When using crc32 each CPU core is at 22% utilization; DK> When using none as checksum, CPU usage is under 10% I suppose when checksum is enabled the bottleneck is cpu, the triffic rate is lower and the problem is not triggered. DK> Eventually after many hours, got corrupted communication: DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch. "Hash mismatch" message suggests that actually you were using checksum then, weren't you? DK> May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive DK> request data: No such file or directory. DK> May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process DK> exited ungracefully (pid=9827, exitcode=75). DK> and DK> May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive DK> reply header: Operation timed out. DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from DK> 10.2.101.12. DK> May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send DK> request (Broken pipe): WRITE(99128470016, 131072). It looks a little different than in your fist message. Do you have clock in sync on both nodes? I would like to look at full logs for some rather large period, with several cases, from both primary and secondary (and be sure about synchronized time). Also, it might worth checking that there is no network packet corruption (some strange things in netstat -di, netstat -s, may be copying large files via net and comparing checksums). -- Mikolaj Golub ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HAST instability
Some further investigation: The HAST nodes do not disconnect when checksum is enabled (either crc32 or sha256). One strange thing is that there is never established TCP connection between both nodes: tcp4 0 0 10.2.101.11.48939 10.2.101.12.8457 FIN_WAIT_2 tcp4 0 1288 10.2.101.11.57008 10.2.101.12.8457 CLOSE_WAIT tcp4 0 0 10.2.101.11.46346 10.2.101.12.8457 FIN_WAIT_2 tcp4 0 90648 10.2.101.11.13916 10.2.101.12.8457 CLOSE_WAIT tcp4 0 0 10.2.101.11.8457 *.*LISTEN When using sha256 one CPU core is 100% utilized by each hastd process, while 70-80MB/sec per HAST resource is being transferred (total of up to 140 MB/sec traffic for both); When using crc32 each CPU core is at 22% utilization; When using none as checksum, CPU usage is under 10% Eventually after many hours, got corrupted communication: May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Hash mismatch. May 30 17:32:35 b1b hastd[9827]: [data0] (secondary) Unable to receive request data: No such file or directory. May 30 17:32:38 b1b hastd[9397]: [data0] (secondary) Worker process exited ungracefully (pid=9827, exitcode=75). and May 30 17:32:27 b1a hastd[1837]: [data0] (primary) Unable to receive reply header: Operation timed out. May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Disconnected from 10.2.101.12. May 30 17:32:30 b1a hastd[1837]: [data0] (primary) Unable to send request (Broken pipe): WRITE(99128470016, 131072). Daniel ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"