Re: [Gluster-users] self-heal trouble after changing arbiter brick

2018-02-08 Thread Karthik Subrahmanya
On Fri, Feb 9, 2018 at 11:46 AM, Karthik Subrahmanya 
wrote:

> Hey,
>
> Did the heal completed and you still have some entries pending heal?
> If yes then can you provide the following informations to debug the issue.
> 1. Which version of gluster you are running
> 2. Output of gluster volume heal  info summary or gluster volume
> heal  info
> 3. getfattr -d -e hex -m .  output of any one of the
> file which is pending heal from all the bricks
>
> Regards,
> Karthik
>
> On Thu, Feb 8, 2018 at 12:48 PM, Seva Gluschenko 
> wrote:
>
>> Hi folks,
>>
>> I'm troubled moving an arbiter brick to another server because of I/O
>> load issues. My setup is as follows:
>>
>> # gluster volume info
>>
>> Volume Name: myvol
>> Type: Distributed-Replicate
>> Volume ID: 43ba517a-ac09-461e-99da-a197759a7dc8
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 3 x (2 + 1) = 9
>> Transport-type: tcp
>> Bricks:
>> Brick1: gv0:/data/glusterfs
>> Brick2: gv1:/data/glusterfs
>> Brick3: gv4:/data/gv01-arbiter (arbiter)
>> Brick4: gv2:/data/glusterfs
>> Brick5: gv3:/data/glusterfs
>> Brick6: gv1:/data/gv23-arbiter (arbiter)
>> Brick7: gv4:/data/glusterfs
>> Brick8: gv5:/data/glusterfs
>> Brick9: pluto:/var/gv45-arbiter (arbiter)
>> Options Reconfigured:
>> nfs.disable: on
>> transport.address-family: inet
>> storage.owner-gid: 1000
>> storage.owner-uid: 1000
>> cluster.self-heal-daemon: enable
>>
>> The gv23-arbiter is the brick that was recently moved from other server
>> (chronos) using the following command:
>>
>> # gluster volume replace-brick myvol chronos:/mnt/gv23-arbiter
>> gv1:/data/gv23-arbiter commit force
>> volume replace-brick: success: replace-brick commit force operation
>> successful
>>
>> It's not the first time I was moving an arbiter brick, and the heal-count
>> was zero for all the bricks before the change, so I didn't expect much
>> trouble then. What was probably wrong is that I then forced chronos out of
>> cluster with gluster peer detach command. All since that, over the course
>> of the last 3 days, I see this:
>>
>> # gluster volume heal myvol statistics heal-count
>> Gathering count of entries to be healed on volume myvol has been
>> successful
>>
>> Brick gv0:/data/glusterfs
>> Number of entries: 0
>>
>> Brick gv1:/data/glusterfs
>> Number of entries: 0
>>
>> Brick gv4:/data/gv01-arbiter
>> Number of entries: 0
>>
>> Brick gv2:/data/glusterfs
>> Number of entries: 64999
>>
>> Brick gv3:/data/glusterfs
>> Number of entries: 64999
>>
>> Brick gv1:/data/gv23-arbiter
>> Number of entries: 0
>>
>> Brick gv4:/data/glusterfs
>> Number of entries: 0
>>
>> Brick gv5:/data/glusterfs
>> Number of entries: 0
>>
>> Brick pluto:/var/gv45-arbiter
>> Number of entries: 0
>>
>> According to the /var/log/glusterfs/glustershd.log, the self-healing is
>> undergoing, so it might be worth just sit and wait, but I'm wondering why
>> this 64999 heal-count persists (a limitation on counter? In fact, gv2 and
>> gv3 bricks contain roughly 30 million files), and I feel bothered because
>> of the following output:
>>
>> # gluster volume heal myvol info heal-failed
>> Gathering list of heal failed entries on volume myvol has been
>> unsuccessful on bricks that are down. Please check if all brick processes
>> are running.
>>
>> I attached the chronos server back to the cluster, with no noticeable
>> effect. Any comments and suggestions would be much appreciated.
>>
>> --
>> Best Regards,
>>
>> Seva Gluschenko
>> CTO @ http://webkontrol.ru
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] self-heal trouble after changing arbiter brick

2018-02-08 Thread Karthik Subrahmanya
Hey,

Did the heal completed and you still have some entries pending heal?
If yes then can you provide the following informations to debug the issue.
1. Which version of gluster you are running
2. gluster volume heal  info summary or gluster volume heal
 info
3. getfattr -d -e hex -m .  output of any one of the
which is pending heal from all the bricks

Regards,
Karthik

On Thu, Feb 8, 2018 at 12:48 PM, Seva Gluschenko  wrote:

> Hi folks,
>
> I'm troubled moving an arbiter brick to another server because of I/O load
> issues. My setup is as follows:
>
> # gluster volume info
>
> Volume Name: myvol
> Type: Distributed-Replicate
> Volume ID: 43ba517a-ac09-461e-99da-a197759a7dc8
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 3 x (2 + 1) = 9
> Transport-type: tcp
> Bricks:
> Brick1: gv0:/data/glusterfs
> Brick2: gv1:/data/glusterfs
> Brick3: gv4:/data/gv01-arbiter (arbiter)
> Brick4: gv2:/data/glusterfs
> Brick5: gv3:/data/glusterfs
> Brick6: gv1:/data/gv23-arbiter (arbiter)
> Brick7: gv4:/data/glusterfs
> Brick8: gv5:/data/glusterfs
> Brick9: pluto:/var/gv45-arbiter (arbiter)
> Options Reconfigured:
> nfs.disable: on
> transport.address-family: inet
> storage.owner-gid: 1000
> storage.owner-uid: 1000
> cluster.self-heal-daemon: enable
>
> The gv23-arbiter is the brick that was recently moved from other server
> (chronos) using the following command:
>
> # gluster volume replace-brick myvol chronos:/mnt/gv23-arbiter
> gv1:/data/gv23-arbiter commit force
> volume replace-brick: success: replace-brick commit force operation
> successful
>
> It's not the first time I was moving an arbiter brick, and the heal-count
> was zero for all the bricks before the change, so I didn't expect much
> trouble then. What was probably wrong is that I then forced chronos out of
> cluster with gluster peer detach command. All since that, over the course
> of the last 3 days, I see this:
>
> # gluster volume heal myvol statistics heal-count
> Gathering count of entries to be healed on volume myvol has been successful
>
> Brick gv0:/data/glusterfs
> Number of entries: 0
>
> Brick gv1:/data/glusterfs
> Number of entries: 0
>
> Brick gv4:/data/gv01-arbiter
> Number of entries: 0
>
> Brick gv2:/data/glusterfs
> Number of entries: 64999
>
> Brick gv3:/data/glusterfs
> Number of entries: 64999
>
> Brick gv1:/data/gv23-arbiter
> Number of entries: 0
>
> Brick gv4:/data/glusterfs
> Number of entries: 0
>
> Brick gv5:/data/glusterfs
> Number of entries: 0
>
> Brick pluto:/var/gv45-arbiter
> Number of entries: 0
>
> According to the /var/log/glusterfs/glustershd.log, the self-healing is
> undergoing, so it might be worth just sit and wait, but I'm wondering why
> this 64999 heal-count persists (a limitation on counter? In fact, gv2 and
> gv3 bricks contain roughly 30 million files), and I feel bothered because
> of the following output:
>
> # gluster volume heal myvol info heal-failed
> Gathering list of heal failed entries on volume myvol has been
> unsuccessful on bricks that are down. Please check if all brick processes
> are running.
>
> I attached the chronos server back to the cluster, with no noticeable
> effect. Any comments and suggestions would be much appreciated.
>
> --
> Best Regards,
>
> Seva Gluschenko
> CTO @ http://webkontrol.ru
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Thousands of EPOLLERR - disconnecting now

2018-02-08 Thread Atin Mukherjee
Are you running gluster version <= 3.12?

Did you happen to start seeing this flood after rebalance? I'm just trying
to eliminate you're not hitting
https://bugzilla.redhat.com/show_bug.cgi?id=1484885 .

On Fri, Feb 9, 2018 at 4:45 AM, Vijay Bellur  wrote:

>
> On Thu, Feb 8, 2018 at 2:04 PM, Gino Lisignoli 
> wrote:
>
>> Hello
>>
>> I have a large cluster in which every node is logging:
>>
>> I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR -
>> disconnecting now
>>
>> At a rate of of around 4 or 5 per second per node, which is adding up to
>> a lot of messages. This seems to happen while my cluster is idle.
>>
>
>
> This log message is normally seen repetitively when there are problems in
> the network layer. Can you please verify if there are any problems in the
> network layer?
>
> Regards,
> Vijay
>
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] trusted.ec.dirty attribute

2018-02-08 Thread Ashish Pandey

It is not at all good. 
It should be healed and dirty xattr should be set to all zero. 

Please provide all the xattrs of all the fragments of this file. 
Provide gluster v heal info 
Run gluster v heal  and provide glustershd.logs 
Provide gluster v status  

--- 
Ashish 


- Original Message -

From: "Dmitri Chebotarov" <4dim...@gmail.com> 
To: "gluster-users"  
Sent: Friday, February 9, 2018 1:12:39 AM 
Subject: [Gluster-users] trusted.ec.dirty attribute 

Hi 

I've got a problem on a EC volume where heal doesn't seem to work for just few 
files (heal info shows no progress for few days, Warning on the client mount) 

I ran 

getfattr -m . -d -e hex / 

across all servers in the cluster and 'trusted.ec.dirty' attr is non-zero on 
all files which don't heal. 

Is it normal? Any way to correct it? 
It's my understanding it should be 
trusted.ec.dirty=0x for 'good' files. 

Out of 12 copies of the files, 11 have the same non-zero value 
(0x00111592), and 12th file has a different non-zero 
value. 

Thank you. 

___ 
Gluster-users mailing list 
Gluster-users@gluster.org 
http://lists.gluster.org/mailman/listinfo/gluster-users 

___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Thousands of EPOLLERR - disconnecting now

2018-02-08 Thread Vijay Bellur
On Thu, Feb 8, 2018 at 2:04 PM, Gino Lisignoli  wrote:

> Hello
>
> I have a large cluster in which every node is logging:
>
> I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR -
> disconnecting now
>
> At a rate of of around 4 or 5 per second per node, which is adding up to a
> lot of messages. This seems to happen while my cluster is idle.
>


This log message is normally seen repetitively when there are problems in
the network layer. Can you please verify if there are any problems in the
network layer?

Regards,
Vijay
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Thousands of EPOLLERR - disconnecting now

2018-02-08 Thread Gino Lisignoli
Hello

I have a large cluster in which every node is logging:

I [socket.c:2474:socket_event_handler] 0-transport: EPOLLERR -
disconnecting now

At a rate of of around 4 or 5 per second per node, which is adding up to a
lot of messages. This seems to happen while my cluster is idle.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] trusted.ec.dirty attribute

2018-02-08 Thread Dmitri Chebotarov
Hi

I've got a problem on a EC volume where heal doesn't seem to work for just
few files (heal info shows no progress for few days, Warning on the client
mount)

I ran

getfattr -m . -d -e hex /

across all servers in the cluster and 'trusted.ec.dirty' attr is non-zero
on all files which don't heal.

Is it normal? Any way to correct it?
It's my understanding it should
be trusted.ec.dirty=0x for 'good' files.

Out of 12 copies of the files, 11 have the same non-zero value
(0x00111592), and 12th file has a different
non-zero value.

Thank you.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] How to fix an out-of-sync node?

2018-02-08 Thread Karthik Subrahmanya
Hi,

>From the information you provided, I am guessing that you have a replica 3
volume configured.
In that case you can run "gluster volume heal " which should do
the trick for you.

Regards,
Karthik

On Thu, Feb 8, 2018 at 6:16 AM, Frizz  wrote:

> I have a setup with 3 nodes running GlusterFS.
>
> gluster volume create myBrick replica 3 node01:/mnt/data/myBrick
> node02:/mnt/data/myBrick node03:/mnt/data/myBrick
>
> Unfortunately node1 seemed to stop syncing with the other nodes, but this
> was undetected for weeks!
>
> When I noticed it, I did a "service glusterd restart" on node1, hoping the
> three nodes would sync again.
>
> But this did not happen. Only the CPU load went up on all three nodes +
> the access time went up.
>
> When I look into the physical storage of the bricks, node1 is very
> different
> node01:/mnt/data/myBrick : 9GB data
> node02:/mnt/data/myBrick : 12GB data
> node03:/mnt/data/myBrick : 12GB data
>
> How do I sync data from the healthy nodes Node2/Node3 back to Node1?
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] How to fix an out-of-sync node?

2018-02-08 Thread Frizz
I have a setup with 3 nodes running GlusterFS.

gluster volume create myBrick replica 3 node01:/mnt/data/myBrick
node02:/mnt/data/myBrick node03:/mnt/data/myBrick

Unfortunately node1 seemed to stop syncing with the other nodes, but this
was undetected for weeks!

When I noticed it, I did a "service glusterd restart" on node1, hoping the
three nodes would sync again.

But this did not happen. Only the CPU load went up on all three nodes + the
access time went up.

When I look into the physical storage of the bricks, node1 is very different
node01:/mnt/data/myBrick : 9GB data
node02:/mnt/data/myBrick : 12GB data
node03:/mnt/data/myBrick : 12GB data

How do I sync data from the healthy nodes Node2/Node3 back to Node1?
___
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users