Re: Issue with hast replication

2012-03-17 Thread Phil Regnauld
Mikolaj Golub (to.my.trociny) writes:
> 
> I just tried to reproduce this and failed. For me a new recource was added
> without problems on reload.
> 
> Mar 17 20:04:24 kopusha hastd[52678]: Reloading configuration...
> Mar 17 20:04:24 kopusha hastd[52678]: Keep listening on address 0.0.0.0:7771.
> Mar 17 20:04:24 kopusha hastd[52678]: Resource rtest added.
> Mar 17 20:04:24 kopusha hastd[52678]: Configuration reloaded successfully.
> 
> You sent SIGHUP to master process and on both hosts, didn't you?

Nope :-| Duh.

> Could you please provide more details if you still fail to add new resources
> on the fly (configuration, log messages).

I'll look. Right now, I need to try and reproduce the original 
hast-over-
zvol problem.

Thanks,
Phil
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-17 Thread Mikolaj Golub

On Tue, 13 Mar 2012 00:22:23 +0100 Phil Regnauld wrote:

 PR>   (side note: hastd doesn't pick up configuration changes even with 
SIGHUP,
 PR>which makes it hard to provision new resources on the fly) 

I just tried to reproduce this and failed. For me a new recource was added
without problems on reload.

Mar 17 20:04:24 kopusha hastd[52678]: Reloading configuration...
Mar 17 20:04:24 kopusha hastd[52678]: Keep listening on address 0.0.0.0:7771.
Mar 17 20:04:24 kopusha hastd[52678]: Resource rtest added.
Mar 17 20:04:24 kopusha hastd[52678]: Configuration reloaded successfully.

You sent SIGHUP to master process and on both hosts, didn't you?

Could you please provide more details if you still fail to add new resources
on the fly (configuration, log messages).

-- 
Mikolaj Golub
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-13 Thread Phil Regnauld
Mikolaj Golub (to.my.trociny) writes:
> 
> 
> What about failed counters like mbuf_alloc_failed_count,
> dma_map_addr_rx_failed_count, dma_map_addr_tx_failed_count?

dev.bce.0.l2fhdr_error_count: 0
dev.bce.0.mbuf_alloc_failed_count: 0
dev.bce.0.mbuf_frag_count: 0
dev.bce.0.dma_map_addr_rx_failed_count: 0
dev.bce.0.dma_map_addr_tx_failed_count: 0
dev.bce.0.unexpected_attention_count: 0
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-13 Thread Mikolaj Golub

On Tue, 13 Mar 2012 22:19:28 +0100 Phil Regnauld wrote:

 PR> dev.bce.0.l2fhdr_error_count: 0
 PR> dev.bce.0.stat_emac_tx_stat_dot3statsinternalmactransmiterrors: 0
 PR> dev.bce.0.stat_Dot3StatsCarrierSenseErrors: 0
 PR> dev.bce.0.stat_Dot3StatsFCSErrors: 0
 PR> dev.bce.0.stat_Dot3StatsAlignmentErrors: 0

What about failed counters like mbuf_alloc_failed_count,
dma_map_addr_rx_failed_count, dma_map_addr_tx_failed_count?

-- 
Mikolaj Golub
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-13 Thread Phil Regnauld
Mikolaj Golub (to.my.trociny) writes:
> 
> Ok. So it is send(2). I suppose the network driver could generate the
> error. Did you tell what network adaptor you had?

Not yet.

bce0:  mem 
0xf400-0xf5ff irq 16 at device 0.0 on pci2
bce0: ASIC (0x57092003); Rev (C0); Bus (PCIe x2, 2.5Gbps); B/C (4.6.4); 
Bufs (RX:2;TX:2;PG:8); Flags (SPLT|MSI|MFW); MFW (NCSI 1.0.3)

>  PR> No obvious errors there either, but again what should I look out for 
> ?
> 
> I would look at sysctl -a dev. statistics and try to find if there is 
> correlation
> between ENOMEM failures and growing of error counters.

0 errors:

dev.bce.0.l2fhdr_error_count: 0
dev.bce.0.stat_emac_tx_stat_dot3statsinternalmactransmiterrors: 0
dev.bce.0.stat_Dot3StatsCarrierSenseErrors: 0
dev.bce.0.stat_Dot3StatsFCSErrors: 0
dev.bce.0.stat_Dot3StatsAlignmentErrors: 0

> Looking at buffer usage from 'netstat -nax' output ran during synchronization
> (on both hosts) could provide useful info where the bottleneck is. top -HS
> output might be useful too.

Good point.

I'll have to attempt to recreate the problem, as the volume has 
replicated
without errors. Typical.

Cheers,
Phil
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-13 Thread Mikolaj Golub

On Tue, 13 Mar 2012 00:22:23 +0100 Phil Regnauld wrote:

 PR> Mikolaj Golub (to.my.trociny) writes:
 >> 
 >> It looks like in the case of hastd this was send(2) who returned ENOMEM, but
 >> it would be good to check. Could you please start synchronization again,
 >> ktrace primary worker process when ENOMEM errors are observed and show 
 >> output
 >> here?

 PR> Ok, took a little while, as running ktrace on the hastd does slow it 
down
 PR> significantly, and the error normally occurs at 30-90 sec intervals.

 PR>0x0f90 b2f3 3ad5 e657 7f0f 3e50 698f 5deb 12af  |..:..W..>Pi.]...|
 PR>0x0fa0 740d c343 6e80 75f3 e1a7 bfdf a4c1 f6a6  |t..Cn.u.|
 PR>0x0fb0 ea85 655d e423 bd5e 42f7 7e9a 05d2 363a  |..e].#.^B.~...6:|
 PR>0x0fc0 025e a7b5 0956 417c f31c a6eb 2cd9 d073  |.^...VA|,..s|
 PR>0x0fd0 2589 e8c0 d76a 889f 8345 eeaf f2a0 c2d6  |%j...E..|
 PR>0x0fe0 b89e aaef fee2 6593 e515 7271 88aa cf66  |..e...rq...f|
 PR>0x0ff0 d272 411a 7289 d6c9 6643 bdbe 3c8c 8ae8  |.rA.r...fC..<...|
 PR>  50959 hastdRET   sendto 32768/0x8000
 PR>  50959 hastdCALL  
sendto(0x6,0x8024bf000,0x8000,0x2,0,0)
 PR>  50959 hastdRET   sendto -1 errno 12 Cannot allocate memory
 PR>  50959 hastdCALL  clock_gettime(0xd,0x7f3f86f0)
 PR>  50959 hastdRET   clock_gettime 0
 PR>  50959 hastdCALL  getpid
 PR>  50959 hastdRET   getpid 50959/0xc70f
 PR>  50959 hastdCALL  sendto(0x3,0x7f3f8780,0x84,0,0,0)
 PR>  50959 hastdGIO   fd 3 wrote 132 bytes
 PR>"<27>Mar 12 23:42:43 hastd[50959]: [hvol] (primary) Unable to sen\
 PR> d request (Cannot allocate memory): WRITE(8626634752, 131072)."  
 PR>  50959 hastdRET   sendto 132/0x84
 PR>  50959 hastdCALL  close(0x7)
 PR>  50959 hastdRET   close 0

Ok. So it is send(2). I suppose the network driver could generate the
error. Did you tell what network adaptor you had?

 >> If it is send(2) who fails then monitoring netstat and network driver
 >> statistics might be helpful. Something like
 >> 
 >> netstat -nax
 >> netstat -naT
 >> netstat -m
 >> netstat -nid

 PR> I could run this in a loop, but that would be a lot of data, and might
 PR> not be appropriate to paste here.

 PR> I didn't see any obvious errors, but I'm not sure what I'm looking for.
 PR> netstat -m didn't show anything close to running out of buffers or
 PR> clusters...

 >> sysctl -a dev.
 >>
 >> And may be
 >> 
 >> vmstat -m
 >> vmstat -z

 PR> No obvious errors there either, but again what should I look out for ?

I would look at sysctl -a dev. statistics and try to find if there is 
correlation
between ENOMEM failures and growing of error counters.

 PR> In the meantime, I've also experimented with a few different 
scenarios, and
 PR> I'm quite puzzled.

 PR> For instance, I configured one of the other gigabit cards on each host 
to
 PR> provide a dedicated replication network. The main difference is that up
 PR> until now this has been running using tagged vlans. To be on the safe 
side,
 PR> I decided to use an untagged interface (the second gigabit adapter in 
each
 PR> machine).
 PR> 
 PR> Here's where I observed, and it is very odd:
 PR> 
 PR> - doing a dd ... | ssh dd fails in the same fashion as before

 PR> - I created a second zvol + hast resource of just 1 GB, and it 
replicated
 PR>   without any problems, peaking at 75 MB / sec (!) - maybe 1GB is too 
small
 PR>   ?
 PR> 
 PR>   (side note: hastd doesn't pick up configuration changes even with 
SIGHUP,
 PR>which makes it hard to provision new resources on the fly) 

 PR> - I restarted replication on the 100 G hast resource, and it's 
currently
 PR>   replicating without any problems over the second ethernet, but it's
 PR>   dragging along at 9-10 MB/sec, peaking at 29 MB/sec occasionally.

Looking at buffer usage from 'netstat -nax' output ran during synchronization
(on both hosts) could provide useful info where the bottleneck is. top -HS
output might be useful too.

 PR>   Earlier, I was observing peaks at 65-70 MB sec in between failures...

 PR> So I don't really know what to conclude :-| 

-- 
Mikolaj Golub
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-12 Thread Phil Regnauld
Mikolaj Golub (to.my.trociny) writes:
> 
> It looks like in the case of hastd this was send(2) who returned ENOMEM, but
> it would be good to check. Could you please start synchronization again,
> ktrace primary worker process when ENOMEM errors are observed and show output
> here?

Ok, took a little while, as running ktrace on the hastd does slow it down
significantly, and the error normally occurs at 30-90 sec intervals.

   0x0f90 b2f3 3ad5 e657 7f0f 3e50 698f 5deb 12af  |..:..W..>Pi.]...|
   0x0fa0 740d c343 6e80 75f3 e1a7 bfdf a4c1 f6a6  |t..Cn.u.|
   0x0fb0 ea85 655d e423 bd5e 42f7 7e9a 05d2 363a  |..e].#.^B.~...6:|
   0x0fc0 025e a7b5 0956 417c f31c a6eb 2cd9 d073  |.^...VA|,..s|
   0x0fd0 2589 e8c0 d76a 889f 8345 eeaf f2a0 c2d6  |%j...E..|
   0x0fe0 b89e aaef fee2 6593 e515 7271 88aa cf66  |..e...rq...f|
   0x0ff0 d272 411a 7289 d6c9 6643 bdbe 3c8c 8ae8  |.rA.r...fC..<...|
 50959 hastdRET   sendto 32768/0x8000
 50959 hastdCALL  sendto(0x6,0x8024bf000,0x8000,0x2,0,0)
 50959 hastdRET   sendto -1 errno 12 Cannot allocate memory
 50959 hastdCALL  clock_gettime(0xd,0x7f3f86f0)
 50959 hastdRET   clock_gettime 0
 50959 hastdCALL  getpid
 50959 hastdRET   getpid 50959/0xc70f
 50959 hastdCALL  sendto(0x3,0x7f3f8780,0x84,0,0,0)
 50959 hastdGIO   fd 3 wrote 132 bytes
   "<27>Mar 12 23:42:43 hastd[50959]: [hvol] (primary) Unable to sen\
d request (Cannot allocate memory): WRITE(8626634752, 131072)."  
 50959 hastdRET   sendto 132/0x84
 50959 hastdCALL  close(0x7)
 50959 hastdRET   close 0

> If it is send(2) who fails then monitoring netstat and network driver
> statistics might be helpful. Something like
> 
> netstat -nax
> netstat -naT
> netstat -m
> netstat -nid

I could run this in a loop, but that would be a lot of data, and might
not be appropriate to paste here.

I didn't see any obvious errors, but I'm not sure what I'm looking for.
netstat -m didn't show anything close to running out of buffers or
clusters...

> sysctl -a dev.
>
> And may be
> 
> vmstat -m
> vmstat -z

No obvious errors there either, but again what should I look out for ?

In the meantime, I've also experimented with a few different scenarios, and
I'm quite puzzled.

For instance, I configured one of the other gigabit cards on each host to
provide a dedicated replication network. The main difference is that up
until now this has been running using tagged vlans. To be on the safe side,
I decided to use an untagged interface (the second gigabit adapter in each
machine).

Here's where I observed, and it is very odd:

- doing a dd ... | ssh dd fails in the same fashion as before

- I created a second zvol + hast resource of just 1 GB, and it replicated
  without any problems, peaking at 75 MB / sec (!) - maybe 1GB is too small
  ?

  (side note: hastd doesn't pick up configuration changes even with SIGHUP,
   which makes it hard to provision new resources on the fly) 

- I restarted replication on the 100 G hast resource, and it's currently
  replicating without any problems over the second ethernet, but it's
  dragging along at 9-10 MB/sec, peaking at 29 MB/sec occasionally.

  Earlier, I was observing peaks at 65-70 MB sec in between failures...

So I don't really know what to conclude :-| 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-12 Thread Mikolaj Golub

On Mon, 12 Mar 2012 15:31:27 +0100 Phil Regnauld wrote:

 PR> Phil Regnauld (regnauld) writes:
 >> 
 >> 7) ktrace on the destination dd:
 >> 
 >> fstat(0,{ mode=p- ,inode=5,size=16384,blksize=4096 }) = 0 (0x0)
 >> lseek(0,0x0,SEEK_CUR)ERR#29 'Illegal seek'

 PR> [...]

 >> Illegal seek, eh ? Any clues ?
 >> 
 >> The boxes are identical (HP DL380 G6), though the RAM config is 
 >> different.
 >> 
 >> Summary:
 >> 
 >> - ssh works fine
 >> - h1 zvol to h2 zvol over ssh fails
 >> - h1 zvol to h2 /tmp/x over ssh is fine
 >> - h2 /dev/zero locally to h2 zvol is fine
 >> - h2 /tmp/x locally to h2 zvol fails at first, but works afterwards...

 PR> A few more data points: dd from a local zvol to a local zvol on either
 PR> machine works fine.

 PR> Using nc instead of ssh, this time it's the sender nc dying:

 PR> ktrace on the sender:

 PR> 47704 nc   CALL  write(0x3,0x7fff5450,0x800)
 PR> 47704 nc   RET   write -1 errno 32 Broken pipe
 PR> 47704 nc   PSIG  SIGPIPE SIG_DFL code=0x10006

 PR> truss on the sender:

 PR> poll({3/POLLIN 0/POLLIN},2,-1)   = 2 (0x2)
 PR> read(3,0x7fff5450,2048)  ERR#54 'Connection 
reset by peer'
 PR> close(3) = 0 (0x0)


 PR> On tcpdump, I do see the receiver send a FIN when using nc.
 PR> When using ssh, the sender is sending the FIN.

 PR> Anything else I can look for ?

It looks like in the case of hastd this was send(2) who returned ENOMEM, but
it would be good to check. Could you please start synchronization again,
ktrace primary worker process when ENOMEM errors are observed and show output
here?

If it is send(2) who fails then monitoring netstat and network driver
statistics might be helpful. Something like

netstat -nax
netstat -naT
netstat -m
netstat -nid

sysctl -a dev.

And may be

vmstat -m
vmstat -z

-- 
Mikolaj Golub
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-12 Thread Phil Regnauld
Phil Regnauld (regnauld) writes:
> 
> 7) ktrace on the destination dd:
> 
> fstat(0,{ mode=p- ,inode=5,size=16384,blksize=4096 }) = 0 (0x0)
> lseek(0,0x0,SEEK_CUR)ERR#29 'Illegal seek'

[...]

> Illegal seek, eh ? Any clues ?
> 
> The boxes are identical (HP DL380 G6), though the RAM config is different.
> 
> Summary:
> 
> - ssh works fine
> - h1 zvol to h2 zvol over ssh fails
> - h1 zvol to h2 /tmp/x over ssh is fine
> - h2 /dev/zero locally to h2 zvol is fine
> - h2 /tmp/x locally to h2 zvol fails at first, but works afterwards...

A few more data points: dd from a local zvol to a local zvol on either
machine works fine.

Using nc instead of ssh, this time it's the sender nc dying:

ktrace on the sender:

47704 nc   CALL  write(0x3,0x7fff5450,0x800)
47704 nc   RET   write -1 errno 32 Broken pipe
47704 nc   PSIG  SIGPIPE SIG_DFL code=0x10006

truss on the sender:

poll({3/POLLIN 0/POLLIN},2,-1)   = 2 (0x2)
read(3,0x7fff5450,2048)  ERR#54 'Connection reset 
by peer'
close(3) = 0 (0x0)


On tcpdump, I do see the receiver send a FIN when using nc.
When using ssh, the sender is sending the FIN.

Anything else I can look for ?

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-11 Thread Phil Regnauld
Mikolaj Golub (trociny) writes:
> 
> 
>  PR> Mar 11 02:02:30 h1 hastd[2282]: [hvol] (primary) Disconnected from 
> tcp4://192.168.1.200.
>  PR> Mar 11 02:02:30 h1 hastd[2282]: [hvol] (primary) Unable to write 
> synchronization data: Cannot allocate memory.
>  PR> Mar 11 02:02:41 h1 hastd[2282]: [hvol] (primary) Unable to send request 
> (Cannot allocate memory): WRITE(31642091520, 131072).
> 
> 31642091520 looks like rather large offset for 10Gb volume...

Sorry, that should have been 100G - I typed from memory instead of 
copy-pasting.

> Just to be more confident that this is a HAST issue could you please try the
> following experiment?
> 
> 1) Stop hastd on h2.
> 
> 2) On h1 run something like below:
> 
>   dd if=/dev/zvol/zfs/hvol bs=131072 | ssh h2 dd bs=131072 
> of=/dev/zvol/zfs/hvol
> 
> (copy hvol from h1 to h2 without hastd to see if it will succeed).
> 
> Note: you will need to recreate HAST provider on secondary after this.

Ok this is interesting.

(For debugging purposes I've renamed the target zvol as "junk", you'll see
why below).

1) As you suggested:

h1# dd if=/dev/zvol/zfs/hvol bs=131072 | ssh h2 dd bs=131072 
of=/dev/zvol/zfs/junk
dd: /dev/zvol/zfs/junk: Invalid argument
0+6 records in
0+5 records out
131072 bytes transferred in 0.002344 secs (55920640 bytes/sec)

To be certain which dd was complaining, I renamed the target zvol.

2) Tried repeatedly, sometimes the number of bytes is a bit different:

0+7 records in
0+6 records out
147456 bytes transferred in 0.002448 secs (60233277 bytes/sec)

And yes, hastd is stopped on h2.

3) I tried dd'ing zero to the zvol locally on h2:

h2# dd if=/dev/zero of=/dev/zvol/zfs/junk bs=131072
^C1817+0 records in
1816+0 records out
238026752 bytes transferred in 1.582006 secs (150458820 bytes/sec)

That works, until I ^C it.

4) I tried redirecting the output of the dd | ssh to a file on the h2 side:

h1# dd if=/dev/zvol/zfs/hvol bs=131072 | ssh h2 dd bs=131072 of=/tmp/x
^C653+0 records in
652+0 records out
85458944 bytes transferred in 2.408074 secs (35488506 bytes/sec)

That works too, until I ^C it.

5) Things get even weirder - if I then go over to h2 and dd the
"/tmp/x" test file over to the zvol:

h2# dd if=x bs=131072 of=/dev/zvol/zfs/junk 
dd: /dev/zvol/zfs/junk: Invalid argument
652+1 records in
652+0 records out
85458944 bytes transferred in 0.444571 secs (192227879 bytes/sec)

Note that the file /tmp/x is 86917120 bytes long.

6) I try to copy more data into /tmp/x - it's now 291946496 (~280 MB)

h2# dd if=x bs=131072 of=/dev/zvol/zfs/junk
2227+1 records in
2227+1 records out
291946496 bytes transferred in 3.564129 secs (81912441 bytes/sec)

No more "invalid argument"...

7) ktrace on the destination dd:

[...]
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
\0"
  5807 dd   RET   read 17992/0x4648
  5807 dd   CALL  write(0x3,0x800c09000,0x4648)
  5807 dd   RET   write -1 errno 22 Invalid argument
  5807 dd   CALL  write(0x2,0x7fffd300,0x4)
  5807 dd   GIO   fd 2 wrote 4 bytes
 "dd: "
  5807 dd   RET   write 4
  5807 dd   CALL  write(0x2,0x7fffd3e0,0x12)
  5807 dd   GIO   fd 2 wrote 18 bytes
   "/dev/zvol/zfs/junk"

truss is a bit more informative:

fstat(0,{ mode=p- ,inode=5,size=16384,blksize=4096 }) = 0 (0x0)
lseek(0,0x0,SEEK_CUR)ERR#29 'Illegal seek'

Illegal seek, eh ? Any clues ?

The boxes are identical (HP DL380 G6), though the RAM config is different.

Summary:

- ssh works fine
- h1 zvol to h2 zvol over ssh fails
- h1 zvol to h2 /tmp/x over ssh is fine
- h2 /dev/zero locally to h2 zvol is fine
- h2 /tmp/x locally to h2 zvol fails at first, but works afterwards...


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Issue with hast replication

2012-03-11 Thread Mikolaj Golub

On Sun, 11 Mar 2012 19:54:57 +0100 Phil Regnauld wrote:

 PR> Hi,

 PR> I've got a fairly simple setup: two hosts running 9.0-R (will upgrade to 
stable
 PR> if told to, but want to check here first), ZFS and HAST. HAST is 
configured to
 PR> run on top of zvols configured on each host, as illustrated:

 PR>   FS  FS
 PR>+--++--+ 
 PR>| hvol | < hastd -> | hvol | 
 PR>+--++--+ 
 PR>| zvol || zvol | 
 PR>+--++--+ 
 PR>| zfs  || zfs  | 
 PR>+--++--+ 
 PR>   h1  h2

 PR> Connection is gigabit to the same switch. No issues with large TCP
 PR> transfers such as SCP/FTP.

 PR> Config is vanilla:

 PR> # zfs create -V 10G zfs/hvol

 PR> hast.conf:

 PR> resource hvol {
 PR> on h1 {
 PR> local /dev/zvol/zfs/hvol
 PR> remote tcp4://192.168.1.100
 PR> }
 PR> on h2 {
 PR> local /dev/zvol/zfs/hvol
 PR> remote tcp4://192.168.1.200
 PR> }
 PR> }


 PR> h1 is behaving fine as primary, either with h2 turned off or in init -
 PR> but as soon as I set the role to secondary for h2, the receiver
 PR> repeatedly crashes and restarts - see the traces below.

 PR> Primary:

 PR> Mar 11 02:02:30 h1 hastd[2282]: [hvol] (primary) Disconnected from 
tcp4://192.168.1.200.
 PR> Mar 11 02:02:30 h1 hastd[2282]: [hvol] (primary) Unable to write 
synchronization data: Cannot allocate memory.
 PR> Mar 11 02:02:41 h1 hastd[2282]: [hvol] (primary) Unable to send request 
(Cannot allocate memory): WRITE(31642091520, 131072).

31642091520 looks like rather large offset for 10Gb volume...

Just to be more confident that this is a HAST issue could you please try the
following experiment?

1) Stop hastd on h2.

2) On h1 run something like below:

  dd if=/dev/zvol/zfs/hvol bs=131072 | ssh h2 dd bs=131072 of=/dev/zvol/zfs/hvol

(copy hvol from h1 to h2 without hastd to see if it will succeed).

Note: you will need to recreate HAST provider on secondary after this.

-- 
Mikolaj Golub
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"