Re: [Gluster-users] Replacing a failed brick

2013-08-18 Thread David Gibbons
Joe,

Now I understand what is going on here. Makes a lot more sense that it's a
bug in the sanity checking code. Thanks so much!

Dave


On Fri, Aug 16, 2013 at 11:19 AM, Joe Julian  wrote:

> This tells you that this brick isn't running. That's probably because it
> was formatted and lost it's volume-id extended attribute. See
> http://www.joejulian.name/**blog/replacing-a-brick-on-**glusterfs-340/
>
> Once that's fixed, on 10.250.4.65:
>
>
>   gluster volume start test-a force
>
>
> On 08/16/2013 08:03 AM, David Gibbons wrote:
>
>> Brick 10.250.4.65:/localmnt/g2lv5   N/A N   N/A
>>
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Replacing a failed brick

2013-08-16 Thread David Gibbons
Ok, it appears that the following worked. Thanks for the nudge in the right
direction:

volume replace-brick test-a 10.250.4.65:/localmnt/g2lv5
10.250.4.65:/localmnt/g2lv6
commit force

then
volume heal test-a full

and monitor the progress with
volume heal test-a info

However that does not solve my problem for what to do when a brick is
corrupted somehow, if I don't have enough space to first heal it and then
replace it.

That did get me thinking though, "what if I replace the brick, forgoe the
heal, replace it again and then do a heal?" That seems to work.

So if I lose one brick, here is the process that I used to recover it:
1) create a directory that is just to temporary trick gluster and allow us
to maintain the correct replica count: mkdir /localmnt/garbage
2) replace the dead brick with our garbage directory: volume replace-brick
test-a 10.250.4.65:/localmnt/g2lv5 10.250.4.65:/localmnt/garbage commit
force
3) fix our dead brick using whatever process is required. in this case, for
testing, we had to remove some gluster bits or it throws the "already part
of a volume error":
setfattr -x trusted.glusterfs.volume-id /localmnt/g2lv5
setfattr -x trusted.gfid /localmnt/g2lv5
4) now that our dead brick is fixed, swap it for the garbage/temporary
brick: volume replace-brick test-a 10.250.4.65:/localmnt/garbage
10.250.4.65:/localmnt/g2lv5 commit force
5) now all that we have to do is let gluster heal the volume: volume heal
test-a full

Is there anything wrong with this procedure?

Cheers,
Dave




On Fri, Aug 16, 2013 at 11:03 AM, David Gibbons
wrote:

> Ravi,
>
> Thanks for the tips. When I run a volume status:
> gluster> volume status test-a
> Status of volume: test-a
> Gluster process PortOnline  Pid
>
> --
> Brick 10.250.4.63:/localmnt/g1lv2   49152   Y
> 8072
> Brick 10.250.4.65:/localmnt/g2lv2   49152   Y
> 3403
> Brick 10.250.4.63:/localmnt/g1lv3   49153   Y
> 8081
> Brick 10.250.4.65:/localmnt/g2lv3   49153   Y
> 3410
> Brick 10.250.4.63:/localmnt/g1lv4   49154   Y
> 8090
> Brick 10.250.4.65:/localmnt/g2lv4   49154   Y
> 3417
> Brick 10.250.4.63:/localmnt/g1lv5   49155   Y
> 8099
> Brick 10.250.4.65:/localmnt/g2lv5   N/A N
> N/A
> Brick 10.250.4.63:/localmnt/g1lv1   49156   Y
> 8576
> Brick 10.250.4.65:/localmnt/g2lv1   49156   Y
> 3431
> NFS Server on localhost 2049Y
> 3440
> Self-heal Daemon on localhost   N/A Y
> 3445
> NFS Server on 10.250.4.63   2049Y
> 8586
> Self-heal Daemon on 10.250.4.63 N/A Y
> 8593
>
> There are no active volume tasks
> --
>
> Attempting to start the volume results in:
> gluster> volume start test-a force
> volume start: test-a: failed: Failed to get extended attribute
> trusted.glusterfs.volume-id for brick dir /localmnt/g2lv5. Reason : No data
> available
> --
>
> It doesn't like when I try to fire off a heal either:
> gluster> volume heal test-a
> Launching Heal operation on volume test-a has been unsuccessful
> --
>
> Although that did lead me to this:
> gluster> volume heal test-a info
> Gathering Heal info on volume test-a has been successful
>
> Brick 10.250.4.63:/localmnt/g1lv2
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv2
> Number of entries: 0
>
> Brick 10.250.4.63:/localmnt/g1lv3
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv3
> Number of entries: 0
>
> Brick 10.250.4.63:/localmnt/g1lv4
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv4
> Number of entries: 0
>
> Brick 10.250.4.63:/localmnt/g1lv5
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv5
> Status: Brick is Not connected
> Number of entries: 0
>
> Brick 10.250.4.63:/localmnt/g1lv1
> Number of entries: 0
>
> Brick 10.250.4.65:/localmnt/g2lv1
> Number of entries: 0
> --
>
> So perhaps I need to re-connect the brick?
>
> Cheers,
> Dave
>
>
>
> On Fri, Aug 16, 2013 at 12:43 AM, Ravishankar N wrote:
>
>>  On 08/15/2013 10:05 PM, David Gibbons wrote:
>>
>> Hi There,
>>
>>  I'm currently testing Gluster for possible production use. I haven't
>> been able to find the answer to this question in the forum arch or in the
>> public docs. It's possible that I don't know which keywords to search for.
>>
>>  Here's the question (more details below): let's say that one of my
>> bricks "fails" -- *not* a whole node failure but a single brick failure
>> within the node. How do I replace a single brick on a node and force a sync
>> from one of the replicas?
>>
>>  I have two nodes with 5 bricks each:
>>  gluster> volume info test-a
>>
>>  Volume Name: test-a
>> Type: Distributed-Replicate
>> Volume ID: e8957773-dd36-44ae-b80a-01e22c7

Re: [Gluster-users] Replacing a failed brick

2013-08-16 Thread Joe Julian
This tells you that this brick isn't running. That's probably because it 
was formatted and lost it's volume-id extended attribute. See 
http://www.joejulian.name/blog/replacing-a-brick-on-glusterfs-340/


Once that's fixed, on 10.250.4.65:

  gluster volume start test-a force


On 08/16/2013 08:03 AM, David Gibbons wrote:

Brick 10.250.4.65:/localmnt/g2lv5   N/A N   N/A


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Replacing a failed brick

2013-08-16 Thread David Gibbons
Ravi,

Thanks for the tips. When I run a volume status:
gluster> volume status test-a
Status of volume: test-a
Gluster process PortOnline  Pid
--
Brick 10.250.4.63:/localmnt/g1lv2   49152   Y   8072
Brick 10.250.4.65:/localmnt/g2lv2   49152   Y   3403
Brick 10.250.4.63:/localmnt/g1lv3   49153   Y   8081
Brick 10.250.4.65:/localmnt/g2lv3   49153   Y   3410
Brick 10.250.4.63:/localmnt/g1lv4   49154   Y   8090
Brick 10.250.4.65:/localmnt/g2lv4   49154   Y   3417
Brick 10.250.4.63:/localmnt/g1lv5   49155   Y   8099
Brick 10.250.4.65:/localmnt/g2lv5   N/A N   N/A
Brick 10.250.4.63:/localmnt/g1lv1   49156   Y   8576
Brick 10.250.4.65:/localmnt/g2lv1   49156   Y   3431
NFS Server on localhost 2049Y   3440
Self-heal Daemon on localhost   N/A Y   3445
NFS Server on 10.250.4.63   2049Y   8586
Self-heal Daemon on 10.250.4.63 N/A Y   8593

There are no active volume tasks
--

Attempting to start the volume results in:
gluster> volume start test-a force
volume start: test-a: failed: Failed to get extended attribute
trusted.glusterfs.volume-id for brick dir /localmnt/g2lv5. Reason : No data
available
--

It doesn't like when I try to fire off a heal either:
gluster> volume heal test-a
Launching Heal operation on volume test-a has been unsuccessful
--

Although that did lead me to this:
gluster> volume heal test-a info
Gathering Heal info on volume test-a has been successful

Brick 10.250.4.63:/localmnt/g1lv2
Number of entries: 0

Brick 10.250.4.65:/localmnt/g2lv2
Number of entries: 0

Brick 10.250.4.63:/localmnt/g1lv3
Number of entries: 0

Brick 10.250.4.65:/localmnt/g2lv3
Number of entries: 0

Brick 10.250.4.63:/localmnt/g1lv4
Number of entries: 0

Brick 10.250.4.65:/localmnt/g2lv4
Number of entries: 0

Brick 10.250.4.63:/localmnt/g1lv5
Number of entries: 0

Brick 10.250.4.65:/localmnt/g2lv5
Status: Brick is Not connected
Number of entries: 0

Brick 10.250.4.63:/localmnt/g1lv1
Number of entries: 0

Brick 10.250.4.65:/localmnt/g2lv1
Number of entries: 0
--

So perhaps I need to re-connect the brick?

Cheers,
Dave



On Fri, Aug 16, 2013 at 12:43 AM, Ravishankar N wrote:

>  On 08/15/2013 10:05 PM, David Gibbons wrote:
>
> Hi There,
>
>  I'm currently testing Gluster for possible production use. I haven't
> been able to find the answer to this question in the forum arch or in the
> public docs. It's possible that I don't know which keywords to search for.
>
>  Here's the question (more details below): let's say that one of my
> bricks "fails" -- *not* a whole node failure but a single brick failure
> within the node. How do I replace a single brick on a node and force a sync
> from one of the replicas?
>
>  I have two nodes with 5 bricks each:
>  gluster> volume info test-a
>
>  Volume Name: test-a
> Type: Distributed-Replicate
> Volume ID: e8957773-dd36-44ae-b80a-01e22c78a8b4
> Status: Started
> Number of Bricks: 5 x 2 = 10
> Transport-type: tcp
> Bricks:
> Brick1: 10.250.4.63:/localmnt/g1lv2
> Brick2: 10.250.4.65:/localmnt/g2lv2
> Brick3: 10.250.4.63:/localmnt/g1lv3
> Brick4: 10.250.4.65:/localmnt/g2lv3
> Brick5: 10.250.4.63:/localmnt/g1lv4
> Brick6: 10.250.4.65:/localmnt/g2lv4
> Brick7: 10.250.4.63:/localmnt/g1lv5
> Brick8: 10.250.4.65:/localmnt/g2lv5
> Brick9: 10.250.4.63:/localmnt/g1lv1
> Brick10: 10.250.4.65:/localmnt/g2lv1
>
>  I formatted 10.250.4.65:/localmnt/g2lv5 (to simulate a "failure"). What
> is the next step? I have tried various combinations of removing and
> re-adding the brick, replacing the brick, etc. I read in a previous message
> to this list that replace-brick was for planned changes which makes sense,
> so that's probably not my next step.
>
> You must first check if the 'formatted' brick 10.250.4.65:/localmnt/g2lv5
> is online using the `gluster volume status` command. If not start the
> volume using `gluster volume start force`. You can then use the
> gluster volume heal command which would copy the data from the other
> replica brick into your formatted brick.
> Hope this helps.
> -Ravi
>
>
>  Cheers,
> Dave
>
>
> ___
> Gluster-users mailing 
> listGluster-users@gluster.orghttp://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Replacing a failed brick

2013-08-15 Thread Ravishankar N

On 08/15/2013 10:05 PM, David Gibbons wrote:

Hi There,

I'm currently testing Gluster for possible production use. I haven't 
been able to find the answer to this question in the forum arch or in 
the public docs. It's possible that I don't know which keywords to 
search for.


Here's the question (more details below): let's say that one of my 
bricks "fails" -- /not/ a whole node failure but a single brick 
failure within the node. How do I replace a single brick on a node and 
force a sync from one of the replicas?


I have two nodes with 5 bricks each:
gluster> volume info test-a

Volume Name: test-a
Type: Distributed-Replicate
Volume ID: e8957773-dd36-44ae-b80a-01e22c78a8b4
Status: Started
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: 10.250.4.63:/localmnt/g1lv2
Brick2: 10.250.4.65:/localmnt/g2lv2
Brick3: 10.250.4.63:/localmnt/g1lv3
Brick4: 10.250.4.65:/localmnt/g2lv3
Brick5: 10.250.4.63:/localmnt/g1lv4
Brick6: 10.250.4.65:/localmnt/g2lv4
Brick7: 10.250.4.63:/localmnt/g1lv5
Brick8: 10.250.4.65:/localmnt/g2lv5
Brick9: 10.250.4.63:/localmnt/g1lv1
Brick10: 10.250.4.65:/localmnt/g2lv1

I formatted 10.250.4.65:/localmnt/g2lv5 (to simulate a "failure"). 
What is the next step? I have tried various combinations of removing 
and re-adding the brick, replacing the brick, etc. I read in a 
previous message to this list that replace-brick was for planned 
changes which makes sense, so that's probably not my next step.
You must first check if the 'formatted' brick 
10.250.4.65:/localmnt/g2lv5 is online using the `gluster volume status` 
command. If not start the volume using `gluster volume start 
force`. You can then use the gluster volume heal command which 
would copy the data from the other replica brick into your formatted brick.

Hope this helps.
-Ravi



Cheers,
Dave


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Replacing a failed brick

2013-08-15 Thread David Gibbons
Hi There,

I'm currently testing Gluster for possible production use. I haven't been
able to find the answer to this question in the forum arch or in the public
docs. It's possible that I don't know which keywords to search for.

Here's the question (more details below): let's say that one of my bricks
"fails" -- *not* a whole node failure but a single brick failure within the
node. How do I replace a single brick on a node and force a sync from one
of the replicas?

I have two nodes with 5 bricks each:
gluster> volume info test-a

Volume Name: test-a
Type: Distributed-Replicate
Volume ID: e8957773-dd36-44ae-b80a-01e22c78a8b4
Status: Started
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: 10.250.4.63:/localmnt/g1lv2
Brick2: 10.250.4.65:/localmnt/g2lv2
Brick3: 10.250.4.63:/localmnt/g1lv3
Brick4: 10.250.4.65:/localmnt/g2lv3
Brick5: 10.250.4.63:/localmnt/g1lv4
Brick6: 10.250.4.65:/localmnt/g2lv4
Brick7: 10.250.4.63:/localmnt/g1lv5
Brick8: 10.250.4.65:/localmnt/g2lv5
Brick9: 10.250.4.63:/localmnt/g1lv1
Brick10: 10.250.4.65:/localmnt/g2lv1

I formatted 10.250.4.65:/localmnt/g2lv5 (to simulate a "failure"). What is
the next step? I have tried various combinations of removing and re-adding
the brick, replacing the brick, etc. I read in a previous message to this
list that replace-brick was for planned changes which makes sense, so
that's probably not my next step.

Cheers,
Dave
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users