HAST instability

Daniel Kalchev Sun, 29 May 2011 03:15:13 -0700

I am trying to get a basic HAST setup working on 8-stable (as of today).Hardware is two supermicro blades, each with 2x Xeon E5620 processors,48GB RAM, integrated LSI2008 controller, two 600GB SAS2 Toshiba drives,two Intel gigabit interfaces and two Intel 10Gbit interfaces.

On each of the drives there is an GPT partition intended to be used byHAST. Each host thus has two HAST resources, data0 and data1respectively. HAST is run over the 10Gbit interfaces, connected via theblade chasis 10Gbit switch.


/etc/hast.conf is

resource data0 {
        on b1a {
                local /dev/gpt/data0
                remote 10.2.101.12
        }
        on b1b {
                local /dev/gpt/data0
                remote 10.2.101.11
        }
}

resource data1 {
        on b1a {
                local /dev/gpt/data1
                remote 10.2.101.12
        }
        on b1b {
                local /dev/gpt/data1
                remote 10.2.101.11
        }
}

On top of data0 and data1 I run ZFS mirror, although this doesn't seemto be relevant here.

What I am observing is very jumpy performance, both nodes oftendisconnect with primary:

May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to receivereply header: Socket is not connected.May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to sendrequest (Broken pipe): WRITE(60470853632, 131072).May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Disconnected from10.2.101.11.May 29 13:06:33 b1b hastd[2372]: [data0] (primary) Unable to writesynchronization data: Socket is not connected.


on secondary:

May 29 03:03:14 b1a hastd[28357]: [data1] (secondary) Unable to receiverequest header: RPC version wrong.May 29 03:03:19 b1a hastd[11659]: [data1] (secondary) Worker processexited ungracefully (pid=28357, exitcode=75).May 29 03:05:31 b1a hastd[35535]: [data0] (secondary) Unable to receiverequest header: RPC version wrong.May 29 03:05:36 b1a hastd[11659]: [data0] (secondary) Worker processexited ungracefully (pid=35535, exitcode=75).

When it works, replication rate observed with 'systat -if' is over140MB/sec (perhaps limited by drives write troughput)

The only reference to this error messages I found inhttp://lists.freebsd.org/pipermail/freebsd-stable/2010-November/059817.html,and that thread indicated the fix was commited.

About the only tuning these machines have is to setkern.ipc.nmbclusters=51200, because with the default values 10Gbitinterfaces would not work and anyway the system would run out of mbufs.


Has anyone observed something similar? Any ideas how to fix it?

Daniel
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

HAST instability

Reply via email to