I am trying to get a basic HAST setup working on 8-stable (as of today). Hardware is two supermicro blades, each with 2x Xeon E5620 processors, 48GB RAM, integrated LSI2008 controller, two 600GB SAS2 Toshiba drives, two Intel gigabit interfaces and two Intel 10Gbit interfaces.

On each of the drives there is an GPT partition intended to be used by HAST. Each host thus has two HAST resources, data0 and data1 respectively. HAST is run over the 10Gbit interfaces, connected via the blade chasis 10Gbit switch.

/etc/hast.conf is

resource data0 {
        on b1a {
                local /dev/gpt/data0
                remote 10.2.101.12
        }
        on b1b {
                local /dev/gpt/data0
                remote 10.2.101.11
        }
}

resource data1 {
        on b1a {
                local /dev/gpt/data1
                remote 10.2.101.12
        }
        on b1b {
                local /dev/gpt/data1
                remote 10.2.101.11
        }
}

On top of data0 and data1 I run ZFS mirror, although this doesn't seem to be relevant here.

What I am observing is very jumpy performance, both nodes often disconnect with primary:

May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to receive reply header: Socket is not connected. May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to send request (Broken pipe): WRITE(60470853632, 131072). May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Disconnected from 10.2.101.11. May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to write synchronization data: Socket is not connected.

on secondary:

May 29 03:03:14 b1a hastd[28357]: [data1] (secondary) Unable to receive request header: RPC version wrong. May 29 03:03:19 b1a hastd[11659]: [data1] (secondary) Worker process exited ungracefully (pid=28357, exitcode=75). May 29 03:05:31 b1a hastd[35535]: [data0] (secondary) Unable to receive request header: RPC version wrong. May 29 03:05:36 b1a hastd[11659]: [data0] (secondary) Worker process exited ungracefully (pid=35535, exitcode=75).

When it works, replication rate observed with 'systat -if' is over 140MB/sec (perhaps limited by drives write troughput)

The only reference to this error messages I found in http://lists.freebsd.org/pipermail/freebsd-stable/2010-November/059817.html, and that thread indicated the fix was commited.

About the only tuning these machines have is to set kern.ipc.nmbclusters=51200, because with the default values 10Gbit interfaces would not work and anyway the system would run out of mbufs.

Has anyone observed something similar? Any ideas how to fix it?

Daniel
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Reply via email to