Re: zpool can't bring online disk2 ----I screwed up

2012-09-26 Thread Mikolaj Golub
On Sun, Sep 23, 2012 at 10:50:28PM -0700, Jose A. Lombera wrote:

> This is the error I got when I run the failover script.
> 
>  
> 
> Sep 24 06:43:39 san1 hastd[3404]: [disk3] (primary) Provider /dev/mfid3 is 
> not part of resource disk3.
> 
> Sep 24 06:43:39 san1 hastd[3343]: [disk3] (primary) Worker process exited 
> ungracefully (pid=3404, exitcode=66).
> 
> Sep 24 06:43:39 san1 hastd[3413]: [disk6] (primary) Provider /dev/mfid6 is 
> not part of resource disk6.
> 
> Sep 24 06:43:39 san1 hastd[3343]: [disk6] (primary) Worker process exited 
> ungracefully (pid=3413, exitcode=66).
> 
> Sep 24 06:43:39 san1 hastd[3425]: [disk10] (primary) Unable to open 
> /dev/mfid10: No such file or directory.
> 
> Sep 24 06:43:39 san1 hastd[3407]: [disk4] (primary) Provider /dev/mfid4 is 
> not part of resource disk4.

This looks like your disk numbering has changed? Your another email
confirms this. Then you should change it accordingly in hast.conf.

> Sep 24 06:43:40 san1 hastd[3351]: [disk2] (primary) Resource unique ID 
> mismatch (primary=2635341666474957411, secondary=5944493181984227803).
> 
> Sep 24 06:43:45 san1 hastd[3348]: [disk1] (primary) Split-brain condition!
> 
> Sep 24 06:43:50 san1 hastd[3351]: [disk2] (primary) Resource unique ID 
> mismatch (primary=2635341666474957411, secondary=5944493181984227803).
> 
> Sep 24 06:43:55 san1 hastd[3348]: [disk1] (primary) Split-brain condition!

Split-brain can only be fixed manually, deciding what host contains
actual data and recreating HAST resources (disk1 and disk2 in this
case) on another host.

The simplest way to recover from your situation looks like the following:

Supposing that host A is a host where the disk was changed and things
messed up and host B is a "good" host.

1) Disable auto failovering if you have any.
2) On host A set all HAST resources to init.
3) On host B set all HAST resources to primary.
4) On host B import pool and check that it works ok here and you have
   your data.
5) On host A recreate HAST resources (hastctl create disk1...)
6) On host A change role to secondary for all HAST
   resources. A synchronization process should start.
7) Wait until the synchronization is complete, checking hastctl status on
   B (primary) host

After this you can switch the pool to the host A again if you want and
enable auto failovering.

Actually you can switch to the host A not waiting until the
synchronization is complete. It will work, but read requests will go
to the remote host B until the synchronization is complete, so I would
not do this until there are good reasons for this.

It might be possible to recover faster, without recreating/resyncing
all devices, depending on how things messed up, fixing the disk
numbering in hast.conf and recreating/resyncing only resources in
split-brain state. But it would require more manual work, careful
investigation of logs and good understanding what you are doing.

-- 
Mikolaj Golub
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


RE: zpool can't bring online disk2 ----I screwed up

2012-09-23 Thread Jose A. Lombera
This is the error I got when I run the failover script.

 

Sep 24 06:43:39 san1 hastd[3404]: [disk3] (primary) Provider /dev/mfid3 is not 
part of resource disk3.

Sep 24 06:43:39 san1 hastd[3343]: [disk3] (primary) Worker process exited 
ungracefully (pid=3404, exitcode=66).

Sep 24 06:43:39 san1 hastd[3413]: [disk6] (primary) Provider /dev/mfid6 is not 
part of resource disk6.

Sep 24 06:43:39 san1 hastd[3343]: [disk6] (primary) Worker process exited 
ungracefully (pid=3413, exitcode=66).

Sep 24 06:43:39 san1 hastd[3425]: [disk10] (primary) Unable to open 
/dev/mfid10: No such file or directory.

Sep 24 06:43:39 san1 hastd[3407]: [disk4] (primary) Provider /dev/mfid4 is not 
part of resource disk4.

Sep 24 06:43:39 san1 hastd[3343]: [disk10] (primary) Worker process exited 
ungracefully (pid=3425, exitcode=66).

Sep 24 06:43:39 san1 hastd[3410]: [disk5] (primary) Provider /dev/mfid5 is not 
part of resource disk5.

Sep 24 06:43:39 san1 hastd[3343]: [disk4] (primary) Worker process exited 
ungracefully (pid=3407, exitcode=66).

Sep 24 06:43:39 san1 hastd[3416]: [disk7] (primary) Provider /dev/mfid7 is not 
part of resource disk7.

Sep 24 06:43:39 san1 hastd[3422]: [disk9] (primary) Provider /dev/mfid9 is not 
part of resource disk9.

Sep 24 06:43:39 san1 hastd[3419]: [disk8] (primary) Provider /dev/mfid8 is not 
part of resource disk8.

Sep 24 06:43:39 san1 hastd[3343]: [disk5] (primary) Worker process exited 
ungracefully (pid=3410, exitcode=66).

Sep 24 06:43:40 san1 hastd[3343]: [disk9] (primary) Worker process exited 
ungracefully (pid=3422, exitcode=66).

Sep 24 06:43:40 san1 hastd[3343]: [disk8] (primary) Worker process exited 
ungracefully (pid=3419, exitcode=66).

Sep 24 06:43:40 san1 hastd[3343]: [disk7] (primary) Worker process exited 
ungracefully (pid=3416, exitcode=66).

Sep 24 06:43:40 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch 
(primary=2635341666474957411, secondary=5944493181984227803).

Sep 24 06:43:45 san1 hastd[3348]: [disk1] (primary) Split-brain condition!

Sep 24 06:43:50 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch 
(primary=2635341666474957411, secondary=5944493181984227803).

Sep 24 06:43:55 san1 hastd[3348]: [disk1] (primary) Split-brain condition!

Sep 24 06:44:00 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch 
(primary=2635341666474957411, secondary=5944493181984227803).

Sep 24 06:44:05 san1 hastd[3348]: [disk1] (primary) Split-brain condition!

Sep 24 06:44:10 san1 hastd[3351]: [disk2] (primary) Resource unique ID mismatch 
(primary=2635341666474957411, secondary=5944493181984227803)

 

 

Is there any patch I need to run to fix this issue?

 

 

 

From: Jose A. Lombera [mailto:j...@lajni.com] 
Sent: Sunday, September 23, 2012 10:00 PM
To: freebsd-current@freebsd.org
Cc: freebsd-current@freebsd.org
Subject: RE: zpool can't bring online disk2 ----I screwed up

 

Everytime I run this for any of the disk 3,4,5,6,7,8,9,10

Disk 1,2 shows in the /dev/hast

 

[root@san2 /usr/home/jose]# hastctl role primary disk3

[root@san2 /usr/home/jose]#

 

I got this in the logs.

 

Sep 23 21:58:13 san2 hastd[2793]: [disk3] (primary) Provider /dev/mfid3 is not 
part of resource disk3.

 

Please help.

 

Thanks.

 

 

 

From: Jose A. Lombera [mailto:j...@lajni.com] 
Sent: Sunday, September 23, 2012 9:46 PM
To: 'Freddie Cash'
Cc: freebsd-current@freebsd.org
Subject: RE: zpool can't bring online disk2 I screwed up

 

Please, some one help me….!!!

 

I screw up big time.

 

 

I was doing the 

 

Hastctl create disk2

 

But since I got some input out errors I decided to stop   /etc/rc.d/hastd stop

But since couldn’t stop disk1 and 9 I killed it.

Restarted both servers.

 

And now only  /dev/hast  shows nothing.

And the pool is lost.

 

I was able to create disk2.

I have restarted both server but  the pool is not coming up.

 

Any suggestions, please help I know that the info is there since I only did 
“hastctl create disk2” I haven’t done it for the other disks.

 

 

 

 

 

From: Jose A. Lombera [mailto:j...@lajni.com] 
Sent: Sunday, September 23, 2012 8:10 PM
To: 'Freddie Cash'
Cc: freebsd-current@freebsd.org
Subject: RE: zpool can't bring online disk2

 

Freddie,

 

Thanks for your great help, now makes so much sense.

I still have a small problem, and I'm not sure if it is because hastd is 
running.

I can't initialize (hastctl create disk2) disk2

 

This is what I did.

 

1.. zpool offline tank /dev/dsk/hast/disk2

2. zpool status -x

[root@san /usr/home/jose]# zpool status -x

  pool: tank

state: DEGRADED

status: One or more devices has been taken offline by the administrator.

Sufficient replicas exist for the pool to continue functioning in a

degraded state.

action: Online the device using 'zpool online' or replace the device with

'zpool replace'.

scan: scrub repaired 0 in 12h4m wi

RE: zpool can't bring online disk2 ----I screwed up

2012-09-23 Thread Jose A. Lombera
Everytime I run this for any of the disk 3,4,5,6,7,8,9,10

Disk 1,2 shows in the /dev/hast

 

[root@san2 /usr/home/jose]# hastctl role primary disk3

[root@san2 /usr/home/jose]#

 

I got this in the logs.

 

Sep 23 21:58:13 san2 hastd[2793]: [disk3] (primary) Provider /dev/mfid3 is not 
part of resource disk3.

 

Please help.

 

Thanks.

 

 

 

From: Jose A. Lombera [mailto:j...@lajni.com] 
Sent: Sunday, September 23, 2012 9:46 PM
To: 'Freddie Cash'
Cc: freebsd-current@freebsd.org
Subject: RE: zpool can't bring online disk2 ----I screwed up

 

Please, some one help me….!!!

 

I screw up big time.

 

 

I was doing the 

 

Hastctl create disk2

 

But since I got some input out errors I decided to stop   /etc/rc.d/hastd stop

But since couldn’t stop disk1 and 9 I killed it.

Restarted both servers.

 

And now only  /dev/hast  shows nothing.

And the pool is lost.

 

I was able to create disk2.

I have restarted both server but  the pool is not coming up.

 

Any suggestions, please help I know that the info is there since I only did 
“hastctl create disk2” I haven’t done it for the other disks.

 

 

 

 

 

From: Jose A. Lombera [mailto:j...@lajni.com] 
Sent: Sunday, September 23, 2012 8:10 PM
To: 'Freddie Cash'
Cc: freebsd-current@freebsd.org
Subject: RE: zpool can't bring online disk2

 

Freddie,

 

Thanks for your great help, now makes so much sense.

I still have a small problem, and I'm not sure if it is because hastd is 
running.

I can't initialize (hastctl create disk2) disk2

 

This is what I did.

 

1.. zpool offline tank /dev/dsk/hast/disk2

2. zpool status -x

[root@san /usr/home/jose]# zpool status -x

  pool: tank

state: DEGRADED

status: One or more devices has been taken offline by the administrator.

Sufficient replicas exist for the pool to continue functioning in a

degraded state.

action: Online the device using 'zpool online' or replace the device with

'zpool replace'.

scan: scrub repaired 0 in 12h4m with 0 errors on Sun Sep 23 19:14:19 2012

config:

 

NAME  STATE READ WRITE CKSUM

tank  DEGRADED 0 0 0

  raidz1-0DEGRADED 0 0 0

hast/disk1ONLINE   0 0 0

11919832608590631234  OFFLINE  0 0 0  was 
/dev/dsk/hast/disk2

hast/disk3ONLINE   0 0 0

hast/disk4ONLINE   0 0 0

hast/disk5ONLINE   0 0 0

hast/disk6ONLINE   0 0 0

hast/disk7ONLINE   0 0 0

hast/disk8ONLINE   0 0 0

hast/disk9ONLINE   0 0 0

hast/disk10   ONLINE   0 0 0

 

errors: No known data errors

 

3. removed disk / insert a new one.

4. initialize

 Hastctl role init disk2

[root@san /usr/home/jose]# hastctl status disk2

disk2:

  role: init

  provname: disk2

  localpath: /dev/mfid2

  extentsize: 0 (0B)

  keepdirty: 0

  remoteaddr: san1

  replication: fullsync

  dirty: 0 (0B)

  statistics:

reads: 0

writes: 0

deletes: 0

flushes: 0

activemap updates: 0

[root@san /usr/home/jose]# 

[root@san /usr/home/jose]# 

[root@san /usr/home/jose]# hastctl create disk2

[ERROR] [disk2] Unable to write metadata: Input/output error.

 

 

 

I don't want to stop hastd since it will shut down the connection to my san.

 

Do you have any suggestion?

 

Thanks

 

 

--jose

 

 

-Original Message-
From: owner-freebsd-curr...@freebsd.org 
[mailto:owner-freebsd-curr...@freebsd.org] On Behalf Of Freddie Cash
Sent: Sunday, September 23, 2012 6:30 PM
To: compufutura -the computer of the future
Cc: yaneg...@gmail.com; freebsd-current@freebsd.org
Subject: RE: zpool can't bring online disk2

 

Since it's a HAST device, you have to initialise the disk via hastctl. Once 
that is done, the /dev/hast/disk2 GEOM device node will be created.

 

Then you can 'zpool replace' it.

 

One step at a time. :)  And you've skipped a few.

 

1. 'zpool offline' the defective disk

2. Physically remove the defective disk

3. Physically insert the new disk

4. Initialise it as a HAST resource via 'hastctl'

5. 'zpool replace' it using the /dev/hast node 6. Wait for the pool (and HAST) 
to resilver it 7. Carry on as per normal  On Sep 23, 2012 2:28 PM, "compufutura 
-the computer of the future" <  <mailto:j...@compufutura.com> 
j...@compufutura.com> wrote:

 

> Yanegomi,

> 

> 

> 

> I tried that, as you can see below, freebsd doesn’t have cfgadm

> 

> Utility to un configure the device, according to, 

>  <http://docs.oracle.com/cd/E19253-01/819-5

RE: zpool can't bring online disk2 ----I screwed up

2012-09-23 Thread Jose A. Lombera
Please, some one help me….!!!

 

I screw up big time.

 

 

I was doing the 

 

Hastctl create disk2

 

But since I got some input out errors I decided to stop   /etc/rc.d/hastd stop

But since couldn’t stop disk1 and 9 I killed it.

Restarted both servers.

 

And now only  /dev/hast  shows nothing.

And the pool is lost.

 

I was able to create disk2.

I have restarted both server but  the pool is not coming up.

 

Any suggestions, please help I know that the info is there since I only did 
“hastctl create disk2” I haven’t done it for the other disks.

 

 

 

 

 

From: Jose A. Lombera [mailto:j...@lajni.com] 
Sent: Sunday, September 23, 2012 8:10 PM
To: 'Freddie Cash'
Cc: freebsd-current@freebsd.org
Subject: RE: zpool can't bring online disk2

 

Freddie,

 

Thanks for your great help, now makes so much sense.

I still have a small problem, and I'm not sure if it is because hastd is 
running.

I can't initialize (hastctl create disk2) disk2

 

This is what I did.

 

1.. zpool offline tank /dev/dsk/hast/disk2

2. zpool status -x

[root@san /usr/home/jose]# zpool status -x

  pool: tank

state: DEGRADED

status: One or more devices has been taken offline by the administrator.

Sufficient replicas exist for the pool to continue functioning in a

degraded state.

action: Online the device using 'zpool online' or replace the device with

'zpool replace'.

scan: scrub repaired 0 in 12h4m with 0 errors on Sun Sep 23 19:14:19 2012

config:

 

NAME  STATE READ WRITE CKSUM

tank  DEGRADED 0 0 0

  raidz1-0DEGRADED 0 0 0

hast/disk1ONLINE   0 0 0

11919832608590631234  OFFLINE  0 0 0  was 
/dev/dsk/hast/disk2

hast/disk3ONLINE   0 0 0

hast/disk4ONLINE   0 0 0

hast/disk5ONLINE   0 0 0

hast/disk6ONLINE   0 0 0

hast/disk7ONLINE   0 0 0

hast/disk8ONLINE   0 0 0

hast/disk9ONLINE   0 0 0

hast/disk10   ONLINE   0 0 0

 

errors: No known data errors

 

3. removed disk / insert a new one.

4. initialize

 Hastctl role init disk2

[root@san /usr/home/jose]# hastctl status disk2

disk2:

  role: init

  provname: disk2

  localpath: /dev/mfid2

  extentsize: 0 (0B)

  keepdirty: 0

  remoteaddr: san1

  replication: fullsync

  dirty: 0 (0B)

  statistics:

reads: 0

writes: 0

deletes: 0

flushes: 0

activemap updates: 0

[root@san /usr/home/jose]# 

[root@san /usr/home/jose]# 

[root@san /usr/home/jose]# hastctl create disk2

[ERROR] [disk2] Unable to write metadata: Input/output error.

 

 

 

I don't want to stop hastd since it will shut down the connection to my san.

 

Do you have any suggestion?

 

Thanks

 

 

--jose

 

 

-Original Message-
From: owner-freebsd-curr...@freebsd.org 
[mailto:owner-freebsd-curr...@freebsd.org] On Behalf Of Freddie Cash
Sent: Sunday, September 23, 2012 6:30 PM
To: compufutura -the computer of the future
Cc: yaneg...@gmail.com; freebsd-current@freebsd.org
Subject: RE: zpool can't bring online disk2

 

Since it's a HAST device, you have to initialise the disk via hastctl. Once 
that is done, the /dev/hast/disk2 GEOM device node will be created.

 

Then you can 'zpool replace' it.

 

One step at a time. :)  And you've skipped a few.

 

1. 'zpool offline' the defective disk

2. Physically remove the defective disk

3. Physically insert the new disk

4. Initialise it as a HAST resource via 'hastctl'

5. 'zpool replace' it using the /dev/hast node 6. Wait for the pool (and HAST) 
to resilver it 7. Carry on as per normal  On Sep 23, 2012 2:28 PM, "compufutura 
-the computer of the future" <   
j...@compufutura.com> wrote:

 

> Yanegomi,

> 

> 

> 

> I tried that, as you can see below, freebsd doesn’t have cfgadm

> 

> Utility to un configure the device, according to, 

>   
> http://docs.oracle.com/cd/E19253-01/819-5461/gbcet/index.html, I 

> looked to ports but there is no utility like that.

> 

> 

> 

> Pardon me, my knowledge is little.

> 

> 

> 

> Can you please type the command I will need, or if I need cfgadm do I 

> have to look for that and install it in my freebsd box?

> 

> 

> 

> Thanks.

> 

> 

> 

> 

> 

> [root@san1 /usr/home/jose]# zpool offline tank hast/disk2

> 

> [root@san1 /usr/home/jose]#

> 

> [root@san1 /usr/home/jose]#

> 

> [root@san1 /usr/home/jose]# zpool status -x

> 

>   pool: tank

> 

> state: DEGRADED

> 

> status: One or more devices has been taken offline by the administrator.

> 

> Sufficient replicas exist